Pages 1–8 2eig Evolutionary trace report by report maker September 6, 2008

4.3.1 Alistat 7 4.3.2 CE 7 4.3.3 DSSP 7 4.3.4 HSSP 7 4.3.5 LaTex 7 4.3.6 Muscle 7 4.3.7 Pymol 7 4.4 Note about ET Viewer 7 4.5 Citing this work 8 4.6 About report maker 8 4.7 Attachments 8

1 INTRODUCTION From the original Protein Data Bank entry (PDB id 2eig): Title: seed lectin (isoform) Compound: Mol id: 1; molecule: lectin; chain: a, b, c, d Organism, scientific name: Lotus Tetragonolobus; 2eig contains a single unique chain 2eigA (230 residues long) and its homologues 2eigD, 2eigC, and 2eigB. CONTENTS 2 CHAIN 2EIGA 1 Introduction 1 2.1 P19664 overview 2 Chain 2eigA 1 From SwissProt, id P19664, 81% identical to 2eigA: 2.1 P19664 overview 1 Description: Anti-H(O) lectin (LTA). 2.2 Multiple sequence alignment for 2eigA 1 Organism, scientific name: Lotus tetragonolobus (Winged pea) 2.3 Residue ranking in 2eigA 1 (). 2.4 Top ranking residues in 2eigA and their position on : Eukaryota; Viridiplantae; Streptophyta; Embryophyta; the structure 1 Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core 2.4.1 Clustering of residues at 25% coverage. 2 eudicotyledons; ; eurosids I; ; ; Papilionoi- 2.4.2 Overlap with known functional surfaces at deae; ; Lotus. 25% coverage. 2 Function: L-fucose specific lectin. 2.4.3 Possible novel functional surfaces at 25% Similarity: Belongs to the leguminous lectin family. coverage. 4 About: This Swiss-Prot entry is copyright. It is produced through a collaboration between the Swiss Institute of Bioinformatics and the 3 Notes on using trace results 6 EMBL outstation - the European Bioinformatics Institute. There are 3.1 Coverage 6 no restrictions on its use as long as its content is in no way modified 3.2 Known substitutions 6 and this statement is not removed. 3.3 Surface 6 3.4 Number of contacts 6 2.2 Multiple sequence alignment for 2eigA 3.5 Annotation 6 3.6 Mutation suggestions 6 For the chain 2eigA, the alignment 2eigA.msf (attached) with 61 sequences was used. The alignment was assembled through combi- 4 Appendix 7 nation of BLAST searching on the UniProt database and alignment 4.1 File formats 7 using Muscle program. It can be found in the attachment to this 4.2 Color schemes used 7 report, under the name of 2eigA.msf. Its statistics, from the alistat 4.3 Credits 7 program are the following:

1 Lichtarge lab 2006 Fig. 1. Residues 1-115 in 2eigA colored by their relative importance. (See Appendix, Fig.11, for the coloring scheme.)

Fig. 2. Residues 116-230 in 2eigA colored by their relative importance. (See Appendix, Fig.11, for the coloring scheme.)

Fig. 3. Residues in 2eigA, colored by their relative importance. Clockwise: Format: MSF front, back, top and bottom views. Number of sequences: 61 Total number of residues: 13061 Smallest: 191 Largest: 230 Average length: 214.1 Alignment length: 230 Average identity: 46% Most related pair: 99% Most unrelated pair: 23% Most distant seq: 40%

Furthermore, 1% of residues show as conserved in this alignment. The alignment consists of 98% eukaryotic ( 98% plantae) sequences. (Descriptions of some sequences were not readily availa- ble.) The file containing the sequence descriptions can be found in the attachment, under the name 2eigA.descr. 2.3 Residue ranking in 2eigA The 2eigA sequence is shown in Figs. 1–2, with each residue colored according to its estimated importance. The full listing of residues in 2eigA can be found in the file called 2eigA.ranks sorted in the attachment. 2.4 Top ranking residues in 2eigA and their position on Fig. 4. Residues in 2eigA, colored according to the cluster they belong to: the structure red, followed by blue and yellow are the largest clusters (see Appendix for the coloring scheme). Clockwise: front, back, top and bottom views. The In the following we consider residues ranking among top 25% of resi- corresponding Pymol script is attached. dues in the protein . Figure 3 shows residues in 2eigA colored by their importance: bright red and yellow indicate more conserved/important residues (see Appendix for the coloring scheme). A Pymol script for producing this figure can be found in the attachment. 2.4.1 Clustering of residues at 25% coverage. Fig. 4 shows the top 25% of all residues, this time colored according to clusters they belong to. The clusters in Fig.4 are composed of the residues listed in Table 1.

2 Table 1. cluster size member color residues red 54 3,8,17,28,30,41,42,43,44,45 48,51,59,60,61,62,64,66,68 81,83,84,86,94,102,103,104 105,115,117,118,119,121,138 139,145,161,163,164,165,172 192,200,202,203,204,205,206 207,214,221,222,223,225

Table 1. Clusters of top ranking residues in 2eigA.

2.4.2 Overlap with known functional surfaces at 25% coverage. The name of the ligand is composed of the source PDB identifier and the heteroatom name used in that file. NAG binding site. Table 2 lists the top 25% of residues at the interface with 2eigANAG1001 (nag). The following table (Table 3) suggests possible disruptive replacements for these residues (see Section 3.6). Table 2. Fig. 5. Residues in 2eigA, at the interface with NAG, colored by their relative res type subst’s cvg noc/ dist antn importance. The ligand (NAG) is colored green. Atoms further than 30A˚ away (%) bb (A˚ ) from the geometric center of the ligand, as well as on the line of sight to the 222 S S(98) 0.04 8/0 3.57 site ligand were removed. (See Appendix for the coloring scheme for the protein T(1) chain 2eigA.)

Table 2. The top 25% of residues in 2eigA at the interface with Table 4. NAG.(Field names: res: residue number in the PDB entry; type: amino acid continued type; substs: substitutions seen in the alignment; with the percentage of each res type subst’s cvg noc/ dist antn type in the bracket; noc/bb: number of contacts with the ligand, with the num- (%) bb (A˚ ) ber of contacts realized through backbone atoms given in the bracket; dist: E(3) distance of closest apporach to the ligand. ) N(3) 145 S S(63) 0.20 1/0 4.19 P(27) Table 3. L(3) res type disruptive D(1) mutations .(1) 222 S (KR)(FQMWH)(NELPI)(Y) F(1) 118 E E(68) 0.22 4/0 2.19 site Table 3. List of disruptive mutations for the top 25% of residues in 2eigA, V(18) that are at the interface with NAG. A(9) C(1) Figure 5 shows residues in 2eigA colored by their importance, at the F(1) interface with 2eigANAG1001. Manganese (ii) ion binding site. Table 4 lists the top 25% of resi- Table 4. The top 25% of residues in 2eigA at the interface with manga- dues at the interface with 2eigAMN1101 (manganese (ii) ion). The nese (ii) ion.(Field names: res: residue number in the PDB entry; type: amino following table (Table 5) suggests possible disruptive replacements acid type; substs: substitutions seen in the alignment; with the percentage of for these residues (see Section 3.6). each type in the bracket; noc/bb: number of contacts with the ligand, with the number of contacts realized through backbone atoms given in the bracket; Table 4. dist: distance of closest apporach to the ligand. ) res type subst’s cvg noc/ dist antn (%) bb (A˚ ) 127 D D(63) 0.19 5/1 2.15 site Table 5. .(19) res type disruptive S(9) mutations continued in next column continued in next column

3 Table 5. continued type in the bracket; noc/bb: number of contacts with the ligand, with the num- res type disruptive ber of contacts realized through backbone atoms given in the bracket; dist: mutations distance of closest apporach to the ligand. ) 127 D (R)(FWH)(Y)(VCAG) 145 S (R)(K)(H)(Q) 118 E (HR)(Y)(FW)(K) Table 7. res type disruptive mutations Table 5. List of disruptive mutations for the top 25% of residues in 2eigA, that are at the interface with manganese (ii) ion. 127 D (R)(FWH)(Y)(VCAG)

Table 7. List of disruptive mutations for the top 25% of residues in 2eigA, that are at the interface with calcium ion.

Fig. 6. Residues in 2eigA, at the interface with manganese (ii) ion, colored by their relative importance. The ligand (manganese (ii) ion) is colored green. Atoms further than 30A˚ away from the geometric center of the ligand, as well as on the line of sight to the ligand were removed. (See Appendix for the Fig. 7. Residues in 2eigA, at the interface with calcium ion, colored by their coloring scheme for the protein chain 2eigA.) relative importance. The ligand (calcium ion) is colored green. Atoms further than 30A˚ away from the geometric center of the ligand, as well as on the line of sight to the ligand were removed. (See Appendix for the coloring scheme Figure 6 shows residues in 2eigA colored by their importance, at the for the protein chain 2eigA.) interface with 2eigAMN1101. Calcium ion binding site. Table 6 lists the top 25% of residues at the interface with 2eigACA1102 (calcium ion). The following table Figure 7 shows residues in 2eigA colored by their importance, at the (Table 7) suggests possible disruptive replacements for these residues interface with 2eigACA1102. Interface with 2eigD. (see Section 3.6). By analogy with 2eigC – 2eigD interface. Table 8 lists the top 25% of residues at the interface with 2eigD. The Table 6. following table (Table 9) suggests possible disruptive replacements res type subst’s cvg noc/ dist antn for these residues (see Section 3.6). (%) bb (A˚ ) Table 8. 127 D D(63) 0.19 4/0 2.39 site res type subst’s cvg noc/ dist .(19) (%) bb (A˚ ) S(9) 141 N N(90) 0.18 25/2 3.13 E(3) S(3) N(3) T(3) D(1) Table 6. The top 25% of residues in 2eigA at the interface with calcium continued in next column ion.(Field names: res: residue number in the PDB entry; type: amino acid type; substs: substitutions seen in the alignment; with the percentage of each

4 Table 8. continued res type subst’s cvg noc/ dist (%) bb (A˚ ) G(1) 145 S S(63) 0.20 11/11 3.56 P(27) L(3) D(1) .(1) F(1) 105 T F(77) 0.23 3/0 3.33 .(8) I(4) L(3) Y(3) T(1) V(1)

Table 8. The top 25% of residues in 2eigA at the interface with 2eigD. (Field names: res: residue number in the PDB entry; type: amino acid type; substs: substitutions seen in the alignment; with the percentage of each type in the bracket; noc/bb: number of contacts with the ligand, with the number of contacts realized through backbone atoms given in the bracket; dist: distance Fig. 8. Residues in 2eigA, at the interface with 2eigD, colored by their rela- of closest apporach to the ligand. ) tive importance. 2eigD is shown in backbone representation (See Appendix for the coloring scheme for the protein chain 2eigA.)

Table 9. res type disruptive mutations 141 N (FYWH)(R)(E)(M) 145 S (R)(K)(H)(Q) 105 T (R)(K)(Q)(H)

Table 9. List of disruptive mutations for the top 25% of residues in 2eigA, that are at the interface with 2eigD.

Figure 8 shows residues in 2eigA colored by their importance, at the interface with 2eigD. Fig. 9. A possible active surface on the chain 2eigA. The larger cluster it 2.4.3 Possible novel functional surfaces at 25% coverage. One belongs to is shown in blue. group of residues is conserved on the 2eigA surface, away from (or susbtantially larger than) other functional sites and interfaces reco- Table 10. continued gnizable in PDB entry 2eig. It is shown in Fig. 9. The right panel res type substitutions(%) cvg antn shows (in blue) the rest of the larger cluster this surface belongs to. A(1)I(1) The residues belonging to this surface ”patch” are listed in Table 10, 14 L L(88)M(1)I(6) 0.25 while Table 11 suggests possible disruptive replacements for these V(1).(1) residues (see Section 3.6).

Table 10. Table 10. Residues forming surface ”patch” in 2eigA. res type substitutions(%) cvg antn 222 S S(98)T(1) 0.04 site 3 F F(98)Y(1) 0.06 Table 11. 8 F F(95)Y(3).(1) 0.08 res type disruptive 45 Y Y(96)H(1)F(1) 0.10 mutations 17 Q Q(85)L(3)E(3) 0.16 222 S (KR)(FQMWH)(NELPI)(Y) D(4).(1)M(1) 3 F (K)(E)(Q)(D) 44 L L(37)F(55)S(3) 0.17 8 F (K)(E)(Q)(D) continued in next column continued in next column

5 Table 11. continued Table 13. res type disruptive res type disruptive mutations mutations 45 Y (K)(Q)(EM)(N) 84 F (KE)(TQD)(SNCRG)(M) 17 Q (Y)(H)(FTW)(CG) 66 F (E)(K)(TQD)(SNCG) 44 L (R)(Y)(K)(H) 62 F (K)(E)(Q)(DR) 14 L (Y)(R)(H)(T) 172 L (YR)(T)(H)(SKECG) 225 S (R)(K)(H)(Q) Table 11. Disruptive mutations for the surface patch in 2eigA. 61 S (R)(FKWH)(YM)(Q) 86 L (YR)(H)(T)(KE) 88 P (Y)(TR)(E)(KCHG) Another group of surface residues is shown in Fig.10. The right panel 59 V (R)(K)(Y)(H) shows (in blue) the rest of the larger cluster this surface belongs to. 115 V (R)(Y)(K)(E) 164 I (Y)(R)(H)(T)

Table 13. Disruptive mutations for the surface patch in 2eigA.

3 NOTES ON USING TRACE RESULTS 3.1 Coverage Trace results are commonly expressed in terms of coverage: the resi- due is important if its “coverage” is small - that is if it belongs to some small top percentage of residues [100% is all of the residues Fig. 10. Another possible active surface on the chain 2eigA. The larger cluster in a chain], according to trace. The ET results are presented in the it belongs to is shown in blue. form of a table, usually limited to top 25% percent of residues (or to some nearby percentage), sorted by the strength of the presumed evolutionary pressure. (I.e., the smaller the coverage, the stronger the The residues belonging to this surface ”patch” are listed in Table 12, pressure on the residue.) Starting from the top of that list, mutating a while Table 13 suggests possible disruptive replacements for these couple of residues should affect the protein somehow, with the exact residues (see Section 3.6). effects to be determined experimentally. Table 12. res type substitutions(%) cvg 3.2 Known substitutions 84 F F(100) 0.01 One of the table columns is “substitutions” - other amino acid types 66 F F(98)H(1) 0.02 seen at the same position in the alignment. These amino acid types 62 F F(95)W(1)L(1) 0.08 may be interchangeable at that position in the protein, so if one wants T(1) to affect the protein by a point mutation, they should be avoided. For 172 L L(90)M(3)I(4) 0.09 example if the substitutions are “RVK” and the original protein has F(1) an R at that position, it is advisable to try anything, but RVK. Conver- 225 S S(90)I(4)A(3) 0.11 sely, when looking for substitutions which will not affect the protein, .(1) one may try replacing, R with K, or (perhaps more surprisingly), with 61 S S(91)D(1)N(6) 0.14 V. The percentage of times the substitution appears in the alignment 86 L L(73)I(24)V(1) 0.14 is given in the immediately following bracket. No percentage is given 88 P P(93)R(3)S(1) 0.15 in the cases when it is smaller than 1%. This is meant to be a rough H(1) guide - due to rounding errors these percentages often do not add up 59 V V(72)L(21)E(3) 0.22 to 100%. T(1)A(1) 115 V V(91)I(1)F(3) 0.23 3.3 Surface E(1)L(1) To detect candidates for novel functional interfaces, first we look for 164 I T(67)S(6)I(8) 0.24 residues that are solvent accessible (according to DSSP program) by 2 D(4)K(3)A(6) at least 10A˚ , which is roughly the area needed for one water mole- R(1)E(1) cule to come in the contact with the residue. Furthermore, we require that these residues form a “cluster” of residues which have neighbor Table 12. Residues forming surface ”patch” in 2eigA. within 5A˚ from any of their heavy atoms. Note, however, that, if our picture of protein evolution is correct, the neighboring residues which are not surface accessible might be equally important in maintaining the interaction specificity - they

6 should not be automatically dropped from consideration when choo- sing the set for mutagenesis. (Especially if they form a cluster with the surface residues.) 3.4 Number of contacts COVERAGE Another column worth noting is denoted “noc/bb”; it tells the num-

ber of contacts heavy atoms of the residue in question make across V the interface, as well as how many of them are realized through the 100% 50% 30% 5% backbone atoms (if all or most contacts are through the backbone, mutation presumably won’t have strong impact). Two heavy atoms are considered to be “in contact” if their centers are closer than 5A˚ . 3.5 Annotation

If the residue annotation is available (either from the pdb file or V from other sources), another column, with the header “annotation” RELATIVE IMPORTANCE appears. Annotations carried over from PDB are the following: site (indicating existence of related site record in PDB ), S-S (disulfide bond forming residue), hb (hydrogen bond forming residue, jb (james Fig. 11. Coloring scheme used to color residues by their relative importance. bond forming residue), and sb (for salt bridge forming residue). 3.6 Mutation suggestions • gaps percentage of gaps in this column Mutation suggestions are completely heuristic and based on comple- mentarity with the substitutions found in the alignment. Note that 4.2 Color schemes used they are meant to be disruptive to the interaction of the protein The following color scheme is used in figures with residues colored with its ligand. The attempt is made to complement the following by cluster size: black is a single-residue cluster; clusters composed of properties: small [AV GST C], medium [LP NQDEMIK], large more than one residue colored according to this hierarchy (ordered [W F Y HR], hydrophobic [LP V AMW F I], polar [GT CY ]; posi- by descending size): red, blue, yellow, green, purple, azure, tur- tively [KHR], or negatively [DE] charged, aromatic [W F Y H], quoise, brown, coral, magenta, LightSalmon, SkyBlue, violet, gold, long aliphatic chain [EKRQM], OH-group possession [SDET Y ], bisque, LightSlateBlue, orchid, RosyBrown, MediumAquamarine, and NH2 group possession [NQRK]. The suggestions are listed DarkOliveGreen, CornflowerBlue, grey55, burlywood, LimeGreen, according to how different they appear to be from the original amino tan, DarkOrange, DeepPink, maroon, BlanchedAlmond. acid, and they are grouped in round brackets if they appear equally The colors used to distinguish the residues by the estimated disruptive. From left to right, each bracketed group of amino acid evolutionary pressure they experience can be seen in Fig. 11. types resembles more strongly the original (i.e. is, presumably, less disruptive) These suggestions are tentative - they might prove disrup- 4.3 Credits tive to the fold rather than to the interaction. Many researcher will 4.3.1 Alistat alistat reads a multiple sequence alignment from the choose, however, the straightforward alanine mutations, especially in file and shows a number of simple statistics about it. These stati- the beginning stages of their investigation. stics include the format, the number of sequences, the total number of residues, the average and range of the sequence lengths, and the 4 APPENDIX alignment length (e.g. including gap characters). Also shown are 4.1 File formats some percent identities. A percent pairwise alignment identity is defi- ned as (idents / MIN(len1, len2)) where idents is the number of Files with extension “ranks sorted” are the actual trace results. The exact identities and len1, len2 are the unaligned lengths of the two fields in the table in this file: sequences. The ”average percent identity”, ”most related pair”, and • alignment# number of the position in the alignment ”most unrelated pair” of the alignment are the average, maximum, and minimum of all (N)(N-1)/2 pairs, respectively. The ”most distant • residue# residue number in the PDB file seq” is calculated by finding the maximum pairwise identity (best • type amino acid type relative) for all N sequences, then finding the minimum of these N • rank rank of the position according to older version of ET numbers (hence, the most outlying sequence). alistat is copyrighted • variability has two subfields: by HHMI/Washington University School of Medicine, 1992-2001, 1. number of different amino acids appearing in in this column and freely distributed under the GNU General Public License. of the alignment 4.3.2 CE To map ligand binding sites from different 2. their type source structures, report maker uses the CE program: http://cl.sdsc.edu/ • rho ET score - the smaller this value, the lesser variability of . Shindyalov IN, Bourne PE (1998) this position across the branches of the tree (and, presumably, ”Protein structure alignment by incremental combinatorial extension the greater the importance for the protein) (CE) of the optimal path . Protein Engineering 11(9) 739-747. • cvg coverage - percentage of the residues on the structure which 4.3.3 DSSP In this work a residue is considered solvent accessi- 2 have this rho or smaller ble if the DSSP program finds it exposed to water by at least 10A˚ ,

7 which is roughly the area needed for one water molecule to come in 4.5 Citing this work the contact with the residue. DSSP is copyrighted by W. Kabsch, C. The method used to rank residues and make predictions in this report Sander and MPI-MF, 1983, 1985, 1988, 1994 1995, CMBI version can be found in Mihalek, I., I. Res,ˇ O. Lichtarge. (2004). ”A Family of by [email protected] November 18,2002, Evolution-Entropy Hybrid Methods for Ranking of Protein Residues by Importance” 336 http://www.cmbi.kun.nl/gv/dssp/descrip.html. J. Mol. Bio. : 1265-82. For the original version of ET see O. Lichtarge, H.Bourne and F. Cohen (1996). ”An Evolu- 4.3.4 HSSP Whenever available, report maker uses HSSP ali- tionary Trace Method Defines Binding Surfaces Common to Protein gnment as a starting point for the analysis (sequences shorter than Families” J. Mol. Bio. 257: 342-358. 75% of the query are taken out, however); R. Schneider, A. de report maker itself is described in Mihalek I., I. Res and O. Daruvar, and C. Sander. ”The HSSP database of protein structure- Lichtarge (2006). ”Evolutionary Trace Report Maker: a new type sequence alignments.” Nucleic Acids Res., 25:226–230, 1997. of service for comparative analysis of proteins.” Bioinformatics 22:1656-7. http://swift.cmbi.kun.nl/swift/hssp/ 4.6 About report maker 4.3.5 LaTex The text for this report was processed using LATEX; report maker was written in 2006 by Ivana Mihalek. The 1D ran- Leslie Lamport, “LaTeX: A Document Preparation System Addison- king visualization program was written by Ivica Res.ˇ report maker Wesley,” Reading, Mass. (1986). is copyrighted by Lichtarge Lab, Baylor College of Medicine, Houston. 4.3.6 Muscle When making alignments “from scratch”, report 4.7 Attachments maker uses Muscle alignment program: Edgar, Robert C. (2004), ”MUSCLE: multiple sequence alignment with high accuracy and The following files should accompany this report: high throughput.” Nucleic Acids Research 32(5), 1792-97. • 2eigA.complex.pdb - coordinates of 2eigA with all of its inter- acting partners http://www.drive5.com/muscle/ • 2eigA.etvx - ET viewer input file for 2eigA 4.3.7 Pymol The figures in this report were produced using • 2eigA.cluster report.summary - Cluster report summary for Pymol. The scripts can be found in the attachment. Pymol 2eigA is an open-source application copyrighted by DeLano Scien- • 2eigA.ranks - Ranks file in sequence order for 2eigA tific LLC (2005). For more information about Pymol see • http://pymol.sourceforge.net/. (Note for Windows 2eigA.clusters - Cluster descriptions for 2eigA users: the attached package needs to be unzipped for Pymol to read • 2eigA.msf - the multiple sequence alignment used for the chain the scripts and launch the viewer.) 2eigA • 2eigA.descr - description of sequences used in 2eigA msf 4.4 Note about ET Viewer • 2eigA.ranks sorted - full listing of residues and their ranking for Dan Morgan from the Lichtarge lab has developed a visualization 2eigA tool specifically for viewing trace results. If you are interested, please • visit: 2eigA.2eigANAG1001.if.pml - Pymol script for Figure 5 • 2eigA.cbcvg - used by other 2eigA – related pymol scripts http://mammoth.bcm.tmc.edu/traceview/ • 2eigA.2eigAMN1101.if.pml - Pymol script for Figure 6 The viewer is self-unpacking and self-installing. Input files to be used • 2eigA.2eigACA1102.if.pml - Pymol script for Figure 7 with ETV (extension .etvx) can be found in the attachment to the • 2eigA.2eigD.if.pml - Pymol script for Figure 8 main report.

8