Pages 1–7 1yz4 Evolutionary trace report by report maker September 26, 2008

4.3.1 Alistat 7 4.3.2 CE 7 4.3.3 DSSP 7 4.3.4 HSSP 7 4.3.5 LaTex 7 4.3.6 Muscle 7 4.3.7 Pymol 7 4.4 Note about ET Viewer 7 4.5 Citing this work 7 4.6 About report maker 7 4.7 Attachments 7

1 INTRODUCTION From the original Data Bank entry (PDB id 1yz4): Title: Crystal structure of dusp15 Compound: Mol id: 1; molecule: dual specificity -like CONTENTS 15 isoform a; chain: a, b; fragment: catalytic domain; synonym: dusp15; ec: 3.1.3.48; engineered: yes; mutation: yes 1 Introduction 1 Organism, scientific name: Homo Sapiens; 1yz4 contains a single unique chain 1yz4A (159 residues long) and 2 Chain 1yz4A 1 its homologue 1yz4B. 2.1 Q6PGN7 overview 1 2.2 Multiple sequence alignment for 1yz4A 1 2.3 Residue ranking in 1yz4A 1 2.4 Top ranking residues in 1yz4A and their position on the structure 1 2 CHAIN 1YZ4A 2.4.1 Clustering of residues at 25% coverage. 1 2.1 Q6PGN7 overview 2.4.2 Overlap with known functional surfaces at 25% coverage. 2 From SwissProt, id Q6PGN7, 99% identical to 1yz4A: 2.4.3 Possible novel functional surfaces at 25% Description: Dual specificity phosphatase 15, isoform a. coverage. 4 Organism, scientific name: Homo sapiens (). Taxonomy: Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; 3 Notes on using trace results 5 Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; 3.1 Coverage 5 Catarrhini; Hominidae; Homo. 3.2 Known substitutions 6 3.3 Surface 6 3.4 Number of contacts 6 2.2 Multiple sequence alignment for 1yz4A 3.5 Annotation 6 3.6 Mutation suggestions 6 For the chain 1yz4A, the alignment 1yz4A.msf (attached) with 58 sequences was used. The alignment was assembled through combi- 4 Appendix 6 nation of BLAST searching on the UniProt database and alignment 4.1 File formats 6 using Muscle program. It can be found in the attachment to this 4.2 Color schemes used 6 report, under the name of 1yz4A.msf. Its statistics, from the alistat 4.3 Credits 7 program are the following:

1 Lichtarge lab 2006 Fig. 1. Residues -2-156 in 1yz4A colored by their relative importance. (See Appendix, Fig.9, for the coloring scheme.)

Format: MSF Number of sequences: 58 Total number of residues: 8870 Smallest: 121 Largest: 159 Average length: 152.9 Alignment length: 159 Average identity: 35% Most related pair: 98% Fig. 2. Residues in 1yz4A, colored by their relative importance. Clockwise: Most unrelated pair: 21% front, back, top and bottom views. Most distant seq: 35%

Furthermore, 3% of residues show as conserved in this alignment. The alignment consists of 96% eukaryotic ( 62% vertebrata, 12% arthropoda, 1% fungi, 3% plantae), and 1% viral sequences. (Des- criptions of some sequences were not readily available.) The file containing the sequence descriptions can be found in the attachment, under the name 1yz4A.descr. 2.3 Residue ranking in 1yz4A The 1yz4A sequence is shown in Fig. 1, with each residue colored according to its estimated importance. The full listing of residues in 1yz4A can be found in the file called 1yz4A.ranks sorted in the attachment. 2.4 Top ranking residues in 1yz4A and their position on the structure In the following we consider residues ranking among top 25% of residues in the protein . Figure 2 shows residues in 1yz4A colored by their importance: bright red and yellow indicate more conser- ved/important residues (see Appendix for the coloring scheme). A Pymol script for producing this figure can be found in the attachment. Fig. 3. Residues in 1yz4A, colored according to the cluster they belong to: 2.4.1 Clustering of residues at 25% coverage. Fig. 3 shows the red, followed by blue and yellow are the largest clusters (see Appendix for top 25% of all residues, this time colored according to clusters they the coloring scheme). Clockwise: front, back, top and bottom views. The corresponding Pymol script is attached. belong to. The clusters in Fig.3 are composed of the residues listed in Table 1. Table 1. Table 1. continued cluster size member cluster size member color residues color residues 36,37,50,57,67,73,74,85,86 red 39 8,12,13,14,15,20,26,31,33,34 87,88,90,91,93,94,95,97,101 continued in next column 102,104,114,118,122,127,128 continued in next column

2 Table 1. continued cluster size member color residues 130,131,134,135

Table 1. Clusters of top ranking residues in 1yz4A.

2.4.2 Overlap with known functional surfaces at 25% coverage. The name of the ligand is composed of the source PDB identifier and the heteroatom name used in that file. B-octylglucoside binding site. Table 2 lists the top 25% of resi- dues at the interface with 1yz4BOG210 (b-octylglucoside). The following table (Table 3) suggests possible disruptive replacements for these residues (see Section 3.6). Table 2. res type subst’s cvg noc/ dist (%) bb (A˚ ) 94 R R(100) 0.03 10/0 3.06 90 A A(74) 0.17 3/2 4.17 V(6) F(3) Fig. 4. Residues in 1yz4A, at the interface with b-octylglucoside, colored M(13) by their relative importance. The ligand (b-octylglucoside) is colored green. C(1) Atoms further than 30A˚ away from the geometric center of the ligand, as well as on the line of sight to the ligand were removed. (See Appendix for the Table 2. The top 25% of residues in 1yz4A at the interface with b- coloring scheme for the protein chain 1yz4A.) octylglucoside.(Field names: res: residue number in the PDB entry; type: amino acid type; substs: substitutions seen in the alignment; with the percen- tage of each type in the bracket; noc/bb: number of contacts with the ligand, Table 4. continued with the number of contacts realized through backbone atoms given in the res type subst’s cvg noc/ dist bracket; dist: distance of closest apporach to the ligand. ) (%) bb (A˚ ) 95 S S(96) 0.04 3/2 4.52 A(3) Table 3. 88 S C(93) 0.07 22/12 2.50 res type disruptive S(1) mutations G(5) 94 R (TD)(SYEVCLAPIG)(FMW)(N) 90 A A(74) 0.17 18/14 2.84 90 A (KER)(Y)(D)(Q) V(6) F(3) Table 3. List of disruptive mutations for the top 25% of residues in M(13) 1yz4A, that are at the interface with b-octylglucoside. C(1)

Figure 4 shows residues in 1yz4A colored by their importance, at the Table 4. The top 25% of residues in 1yz4A at the interface with sulfate interface with 1yz4BOG210. ion.(Field names: res: residue number in the PDB entry; type: amino acid Sulfate ion binding site. Table 4 lists the top 25% of residues type; substs: substitutions seen in the alignment; with the percentage of each at the interface with 1yz4SO4101 (sulfate ion). The following table type in the bracket; noc/bb: number of contacts with the ligand, with the num- (Table 5) suggests possible disruptive replacements for these residues ber of contacts realized through backbone atoms given in the bracket; dist: distance of closest apporach to the ligand. ) (see Section 3.6). Table 4. res type subst’s cvg noc/ dist Table 5. (%) bb (A˚ ) res type disruptive 57 D D(100) 0.03 12/0 3.79 mutations 91 G G(100) 0.03 11/11 2.79 57 D (R)(FWH)(KYVCAG)(TQM) 93 S S(100) 0.03 17/12 3.01 91 G (KER)(FQMWHD)(NYLPI)(SVA) 94 R R(100) 0.03 30/8 2.66 93 S (KR)(FQMWH)(NYELPI)(D) continued in next column continued in next column

3 Table 5. continued Table 6. continued res type disruptive res type subst’s cvg noc/ dist mutations (%) bb (A˚ ) 94 R (TD)(SYEVCLAPIG)(FMW)(N) T(5) 95 S (KR)(QH)(FYEMW)(N) L(3) 88 S (KR)(FMWH)(Q)(E) V(1) 90 A (KER)(Y)(D)(Q) Q(1) .(5) Table 5. List of disruptive mutations for the top 25% of residues in H(1) 1yz4A, that are at the interface with sulfate ion. Table 6. The top 25% of residues in 1yz4A at the interface with 1yz4B. (Field names: res: residue number in the PDB entry; type: amino acid type; substs: substitutions seen in the alignment; with the percentage of each type in the bracket; noc/bb: number of contacts with the ligand, with the number of contacts realized through backbone atoms given in the bracket; dist: distance of closest apporach to the ligand. )

Table 7. res type disruptive mutations 73 F (E)(T)(K)(D) 50 Y (K)(Q)(M)(ER)

Table 7. List of disruptive mutations for the top 25% of residues in 1yz4A, that are at the interface with 1yz4B.

Fig. 5. Residues in 1yz4A, at the interface with sulfate ion, colored by their relative importance. The ligand (sulfate ion) is colored green. Atoms further than 30A˚ away from the geometric center of the ligand, as well as on the line of sight to the ligand were removed. (See Appendix for the coloring scheme for the protein chain 1yz4A.)

Figure 5 shows residues in 1yz4A colored by their importance, at the interface with 1yz4SO4101. Interface with 1yz4B.Table 6 lists the top 25% of residues at the interface with 1yz4B. The following table (Table 7) suggests possible disruptive replacements for these residues (see Section 3.6). Table 6. res type subst’s cvg noc/ dist (%) bb (A˚ ) 73 F F(89) 0.15 1/0 4.49 Y(5) Fig. 6. Residues in 1yz4A, at the interface with 1yz4B, colored by their rela- L(1) tive importance. 1yz4B is shown in backbone representation (See Appendix K(1) for the coloring scheme for the protein chain 1yz4A.) Q(1) 50 Y Y(81) 0.20 9/9 3.75 Figure 6 shows residues in 1yz4A colored by their importance, at the continued in next column interface with 1yz4B. Sulfate ion binding site. By analogy with 1yz4B – 1yz4SO4104 interface. Table 8 lists the top 25% of residues at the interface with

4 1yz4SO4104 (sulfate ion). The following table (Table 9) suggests Figure 7 shows residues in 1yz4A colored by their importance, at the possible disruptive replacements for these residues (see Section 3.6). interface with 1yz4SO4104. Table 8. 2.4.3 Possible novel functional surfaces at 25% coverage. One res type subst’s cvg noc/ dist group of residues is conserved on the 1yz4A surface, away from (or (%) bb (A˚ ) susbtantially larger than) other functional sites and interfaces reco- 94 R R(100) 0.03 11/0 2.97 gnizable in PDB entry 1yz4. It is shown in Fig. 8. The right panel 37 I V(62) 0.21 3/3 4.55 shows (in blue) the rest of the larger cluster this surface belongs to. I(18) A(10) L(3) C(3) M(1)

Table 8. The top 25% of residues in 1yz4A at the interface with sulfate ion.(Field names: res: residue number in the PDB entry; type: amino acid type; substs: substitutions seen in the alignment; with the percentage of each type in the bracket; noc/bb: number of contacts with the ligand, with the num- ber of contacts realized through backbone atoms given in the bracket; dist: distance of closest apporach to the ligand. )

Table 9. Fig. 8. A possible active surface on the chain 1yz4A. The larger cluster it belongs to is shown in blue. res type disruptive mutations 94 R (TD)(SYEVCLAPIG)(FMW)(N) The residues belonging to this surface ”patch” are listed in Table 10, 37 I (R)(Y)(H)(TKE) while Table 11 suggests possible disruptive replacements for these residues (see Section 3.6). Table 9. List of disruptive mutations for the top 25% of residues in Table 10. 1yz4A, that are at the interface with sulfate ion. res type substitutions(%) cvg 57 D D(100) 0.03 87 H H(100) 0.03 93 S S(100) 0.03 94 R R(100) 0.03 128 N N(98).(1) 0.05 122 R R(91)K(8) 0.06 88 S C(93)S(1)G(5) 0.07 134 Q Q(94)E(1)L(1) 0.10 .(1) 31 I I(82)V(15)F(1) 0.13 36 S S(27)N(48)T(15) 0.14 C(8) 73 F F(89)Y(5)L(1) 0.15 K(1)Q(1) 67 F F(82)L(8)W(8) 0.16 90 A A(74)V(6)F(3) 0.17 M(13)C(1) 8 V V(29)I(70) 0.18 114 V C(5)A(82)V(5) 0.18 S(5)L(1) 26 L L(86).(1)Y(3) 0.19 M(5)F(1)I(1) 20 A P(1)A(81)G(1) 0.20 V(5)L(1)S(8) Fig. 7. Residues in 1yz4A, at the interface with sulfate ion, colored by their 50 Y Y(81)T(5)L(3) 0.20 relative importance. The ligand (sulfate ion) is colored green. Atoms further continued in next column than 30A˚ away from the geometric center of the ligand, as well as on the line of sight to the ligand were removed. (See Appendix for the coloring scheme for the protein chain 1yz4A.)

5 Table 10. continued couple of residues should affect the protein somehow, with the exact res type substitutions(%) cvg effects to be determined experimentally. V(1)Q(1).(5) H(1) 3.2 Known substitutions 118 I V(72)I(15)T(3) 0.23 One of the table columns is “substitutions” - other amino acid types L(5)C(1)M(1) seen at the same position in the alignment. These amino acid types 130 G G(65)S(13)A(1) 0.23 may be interchangeable at that position in the protein, so if one wants N(15)H(1).(1) to affect the protein by a point mutation, they should be avoided. For 13 Y Y(75)F(22)W(1) 0.24 example if the substitutions are “RVK” and the original protein has 33 H H(56)A(18)K(1) 0.24 an R at that position, it is advisable to try anything, but RVK. Conver- S(1)Y(18)M(1) sely, when looking for substitutions which will not affect the protein, 62 P N(43)D(46)P(5) 0.25 one may try replacing, R with K, or (perhaps more surprisingly), with S(1)R(1)Y(1) V. The percentage of times the substitution appears in the alignment is given in the immediately following bracket. No percentage is given Table 10. Residues forming surface ”patch” in 1yz4A. in the cases when it is smaller than 1%. This is meant to be a rough guide - due to rounding errors these percentages often do not add up to 100%. Table 11. 3.3 Surface res type disruptive mutations To detect candidates for novel functional interfaces, first we look for residues that are solvent accessible (according to DSSP program) by 57 D (R)(FWH)(KYVCAG)(TQM) 2 87 H (E)(TQMD)(SNKVCLAPIG)(YR) at least 10A˚ , which is roughly the area needed for one water mole- 93 S (KR)(FQMWH)(NYELPI)(D) cule to come in the contact with the residue. Furthermore, we require 94 R (TD)(SYEVCLAPIG)(FMW)(N) that these residues form a “cluster” of residues which have neighbor 128 N (Y)(FTWH)(SVCAG)(ER) within 5A˚ from any of their heavy atoms. 122 R (T)(YD)(SVCAG)(FELWPI) Note, however, that, if our picture of protein evolution is correct, 88 S (KR)(FMWH)(Q)(E) the neighboring residues which are not surface accessible might be 134 Q (Y)(H)(FTW)(CG) equally important in maintaining the interaction specificity - they 31 I (R)(Y)(T)(KE) should not be automatically dropped from consideration when choo- 36 S (R)(K)(FWH)(M) sing the set for mutagenesis. (Especially if they form a cluster with 73 F (E)(T)(K)(D) the surface residues.) 67 F (KE)(T)(QD)(R) 3.4 Number of contacts 90 A (KER)(Y)(D)(Q) 8 V (YR)(KE)(H)(QD) Another column worth noting is denoted “noc/bb”; it tells the num- 114 V (R)(K)(YE)(H) ber of contacts heavy atoms of the residue in question make across 26 L (R)(Y)(T)(H) the interface, as well as how many of them are realized through the 20 A (R)(KY)(E)(H) backbone atoms (if all or most contacts are through the backbone, 50 Y (K)(Q)(M)(ER) mutation presumably won’t have strong impact). Two heavy atoms 118 I (R)(Y)(H)(K) are considered to be “in contact” if their centers are closer than 5A˚ . 130 G (E)(KR)(M)(D) 3.5 Annotation 13 Y (K)(Q)(E)(M) 33 H (E)(D)(Q)(T) If the residue annotation is available (either from the pdb file or 62 P (R)(Y)(H)(T) from other sources), another column, with the header “annotation” appears. Annotations carried over from PDB are the following: site (indicating existence of related site record in PDB ), S-S (disulfide Table 11. Disruptive mutations for the surface patch in 1yz4A. bond forming residue), hb (hydrogen bond forming residue, jb (james bond forming residue), and sb (for salt bridge forming residue). 3.6 Mutation suggestions 3 NOTES ON USING TRACE RESULTS Mutation suggestions are completely heuristic and based on comple- 3.1 Coverage mentarity with the substitutions found in the alignment. Note that Trace results are commonly expressed in terms of coverage: the resi- they are meant to be disruptive to the interaction of the protein due is important if its “coverage” is small - that is if it belongs to with its ligand. The attempt is made to complement the following some small top percentage of residues [100% is all of the residues properties: small [AV GST C], medium [LP NQDEMIK], large in a chain], according to trace. The ET results are presented in the [W F Y HR], hydrophobic [LP V AMW F I], polar [GT CY ]; posi- form of a table, usually limited to top 25% percent of residues (or tively [KHR], or negatively [DE] charged, aromatic [W F Y H], to some nearby percentage), sorted by the strength of the presumed long aliphatic chain [EKRQM], OH-group possession [SDET Y ], evolutionary pressure. (I.e., the smaller the coverage, the stronger the and NH2 group possession [NQRK]. The suggestions are listed pressure on the residue.) Starting from the top of that list, mutating a according to how different they appear to be from the original amino

6 DarkOliveGreen, CornflowerBlue, grey55, burlywood, LimeGreen, tan, DarkOrange, DeepPink, maroon, BlanchedAlmond. The colors used to distinguish the residues by the estimated evolutionary pressure they experience can be seen in Fig. 9. COVERAGE 4.3 Credits

V 4.3.1 Alistat alistat reads a multiple sequence alignment from the file and shows a number of simple statistics about it. These stati- 100% 50% 30% 5% stics include the format, the number of sequences, the total number of residues, the average and range of the sequence lengths, and the alignment length (e.g. including gap characters). Also shown are some percent identities. A percent pairwise alignment identity is defi- ned as (idents / MIN(len1, len2)) where idents is the number of V exact identities and len1, len2 are the unaligned lengths of the two RELATIVE IMPORTANCE sequences. The ”average percent identity”, ”most related pair”, and ”most unrelated pair” of the alignment are the average, maximum, and minimum of all (N)(N-1)/2 pairs, respectively. The ”most distant Fig. 9. Coloring scheme used to color residues by their relative importance. seq” is calculated by finding the maximum pairwise identity (best relative) for all N sequences, then finding the minimum of these N numbers (hence, the most outlying sequence). alistat is copyrighted acid, and they are grouped in round brackets if they appear equally by HHMI/Washington University School of Medicine, 1992-2001, disruptive. From left to right, each bracketed group of amino acid and freely distributed under the GNU General Public License. types resembles more strongly the original (i.e. is, presumably, less 4.3.2 CE To map ligand binding sites from different disruptive) These suggestions are tentative - they might prove disrup- source structures, report maker uses the CE program: tive to the fold rather than to the interaction. Many researcher will http://cl.sdsc.edu/. Shindyalov IN, Bourne PE (1998) choose, however, the straightforward alanine mutations, especially in ”Protein structure alignment by incremental combinatorial extension the beginning stages of their investigation. (CE) of the optimal path . Protein Engineering 11(9) 739-747. 4.3.3 DSSP In this work a residue is considered solvent accessi- 2 4 APPENDIX ble if the DSSP program finds it exposed to water by at least 10A˚ , 4.1 File formats which is roughly the area needed for one water molecule to come in Files with extension “ranks sorted” are the actual trace results. The the contact with the residue. DSSP is copyrighted by W. Kabsch, C. fields in the table in this file: Sander and MPI-MF, 1983, 1985, 1988, 1994 1995, CMBI version by [email protected] November 18,2002, • alignment# number of the position in the alignment http://www.cmbi.kun.nl/gv/dssp/descrip.html. • residue# residue number in the PDB file 4.3.4 HSSP Whenever available, report maker uses HSSP ali- • type amino acid type gnment as a starting point for the analysis (sequences shorter than • rank rank of the position according to older version of ET 75% of the query are taken out, however); R. Schneider, A. de • variability has two subfields: Daruvar, and C. Sander. ”The HSSP database of protein structure- 1. number of different amino acids appearing in in this column sequence alignments.” Nucleic Acids Res., 25:226–230, 1997. of the alignment http://swift.cmbi.kun.nl/swift/hssp/ 2. their type • rho ET score - the smaller this value, the lesser variability of 4.3.5 LaTex The text for this report was processed using LATEX; this position across the branches of the tree (and, presumably, Leslie Lamport, “LaTeX: A Document Preparation System Addison- the greater the importance for the protein) Wesley,” Reading, Mass. (1986). • cvg coverage - percentage of the residues on the structure which 4.3.6 Muscle When making alignments “from scratch”, report have this rho or smaller maker uses Muscle alignment program: Edgar, Robert C. (2004), • gaps percentage of gaps in this column ”MUSCLE: multiple sequence alignment with high accuracy and high throughput.” Nucleic Acids Research 32(5), 1792-97. 4.2 Color schemes used http://www.drive5.com/muscle/ The following color scheme is used in figures with residues colored by cluster size: black is a single-residue cluster; clusters composed of 4.3.7 Pymol The figures in this report were produced using more than one residue colored according to this hierarchy (ordered Pymol. The scripts can be found in the attachment. Pymol by descending size): red, blue, yellow, green, purple, azure, tur- is an open-source application copyrighted by DeLano Scien- quoise, brown, coral, magenta, LightSalmon, SkyBlue, violet, gold, tific LLC (2005). For more information about Pymol see bisque, LightSlateBlue, orchid, RosyBrown, MediumAquamarine, http://pymol.sourceforge.net/. (Note for Windows

7 users: the attached package needs to be unzipped for Pymol to read is copyrighted by Lichtarge Lab, Baylor College of Medicine, the scripts and launch the viewer.) Houston. 4.4 Note about ET Viewer 4.7 Attachments Dan Morgan from the Lichtarge lab has developed a visualization The following files should accompany this report: tool specifically for viewing trace results. If you are interested, please • 1yz4A.complex.pdb - coordinates of 1yz4A with all of its visit: interacting partners http://mammoth.bcm.tmc.edu/traceview/ • 1yz4A.etvx - ET viewer input file for 1yz4A The viewer is self-unpacking and self-installing. Input files to be used • 1yz4A.cluster report.summary - Cluster report summary for with ETV (extension .etvx) can be found in the attachment to the 1yz4A main report. • 1yz4A.ranks - Ranks file in sequence order for 1yz4A 4.5 Citing this work • 1yz4A.clusters - Cluster descriptions for 1yz4A The method used to rank residues and make predictions in this report • 1yz4A.msf - the multiple sequence alignment used for the chain can be found in Mihalek, I., I. Res,ˇ O. Lichtarge. (2004). ”A Family of 1yz4A Evolution-Entropy Hybrid Methods for Ranking of Protein Residues • 1yz4A.descr - description of sequences used in 1yz4A msf by Importance” J. Mol. Bio. 336: 1265-82. For the original version • 1yz4A.ranks sorted - full listing of residues and their ranking of ET see O. Lichtarge, H.Bourne and F. Cohen (1996). ”An Evolu- for 1yz4A tionary Trace Method Defines Binding Surfaces Common to Protein • Families” J. Mol. Bio. 257: 342-358. 1yz4A.1yz4BOG210.if.pml - Pymol script for Figure 4 report maker itself is described in Mihalek I., I. Res and O. • 1yz4A.cbcvg - used by other 1yz4A – related pymol scripts Lichtarge (2006). ”Evolutionary Trace Report Maker: a new type • 1yz4A.1yz4SO4101.if.pml - Pymol script for Figure 5 of service for comparative analysis of .” Bioinformatics • 22:1656-7. 1yz4A.1yz4B.if.pml - Pymol script for Figure 6 • 1yz4A.1yz4SO4104.if.pml - Pymol script for Figure 7 4.6 About report maker report maker was written in 2006 by Ivana Mihalek. The 1D ran- king visualization program was written by Ivica Res.ˇ report maker

8