Protein Folding, De Novo Protein Design, and Peptide Identification in Proteomics

Christodoulos A. Floudas Princeton University Department of Chemical Engineering Program of Applied and Computational Department of Operations Research and Financial Engineering Center for Quantitative Biology REVOLUTION OF GENOMICS DNA RNA Nucleus Protein (genome) (transcriptome) (proteome)

Protein - Protein

Nucleic Acid - Protein Feedback Protein - Ligand

Cell

Design of Drugs and Inhibitors

Metabolic Networks Phenotype Outline

From Sequence to Structure • Structure Prediction in Protein Folding ASTRO-FOLD • Helix Prediction • Beta Sheet & Disulfide Bridges • 3-D Structure Prediction

From Structure to Function • De Novo Peptide Design • Sequence Selection • Fold Specificity • Design of Inhibitors for Complement 3

Peptide and Protein Identification via Tandem Mass Spectrometry Structure Prediction In Protein Folding: Outline

• Introduction to Protein Structure Prediction • Free Energy Calculations in Oligo-peptides • Prediction of Helical Segments • Prediction of Beta Sheet • Prediction of Loop Structures • Derivation of Restraints • Prediction of Protein Tertiary Structure Structure Prediction In Protein Folding

Review Aricles • Klepeis J.L., H.D. Schafroth, K.M. Westerberg, and C.A. Floudas, "Deterministic Global Optimization and Ab Initio Approaches for the Structure Prediction of Polypeptides, Dynamics of Protein Folding and Protein-Protein Interactions", Advances in Chemical Physics, 120, 265-457 (2002). • Floudas C.A., "Research Challenges, Opportunities and Synergism in Systems Engineering and Computational Biology", AIChE Journal, 51, 1872-1884 (2005). • Floudas C.A., H.K. Fung, S.R. McAllister, M. Monnigmann, and R. Rajgaria, "Advances in Protein Structure Prediction and De Novo Protein Design: A Review", Chemical Engineering Science, 61, 966-988 (2006). • C.A. Floudas, "Computational Methods in Protein Structure Prediction", Biotechnology and Bioengineering, 97, 207-213 (2007). Protein Primary Structure • Made up primarily of amino acids O

Amino group C H2N Side chain CH OH C carbon

R Carboxyl group

• This “alphabet” is often represented by a one letter abbreviation PDB: 1q4sA MHRTSNGSHATGGNLPDVASHYPVAYEQTLDGTVGFVIDEMTPERATASVEVTDTLRQRWGLVHGGAYCALAEMLA TEATVAVVHEKGMMAVGQSNHTSFFRPVKEGHVRAEAVRIHAGSTTWFWDVSLRDDAGRLCAVSSMSIAVRPRRD Proteins: Sequence of Amino Acids • 20 letter alphabet • 3M known sequences • 200 amino acids/domain • 25K elucidated structures Protein Secondary Structure

• Local structural motifs defined by hydrogen bonding patterns -helix -sheet Protein Dihedral Angles • Fixed bond length • Fixed bond angles • Dihedral angles as variable representation

Image from Creighton, 1993 Why Protein Folding ? • Human Genome Project (Nature, Vol 409, 15 Feb 2001) • 3 x 109 base pairs 104 functional proteins • ~ 3.1 x 104 genes

• Computational Structural Biology • Protein interactions Drug design • Role of mutations Protein Folding Challenges

Structure Prediction Dynamics Can we predict the Can we elucidate the 3-D structure from mechanism of the only the 1-D amino folding process ? acid sequence ? How does the sequence of amino acids physically fold into the 3-D structure ? Protein Structure Prediction Amino acid sequence [PDB: 1q4sA ] MHRTSNGSHATGGNLPDVASHYPVAYEQTLDGTVGFVIDEMTPERATASVEVTDTLRQRWGLVHGGAYCALAEMLA TEATVAVVHEKGMMAVGQSNHTSFFRPVKEGHVRAEAVRIHAGSTTWFWDVSLRDDAGRLCAVSSMSIAVRPRRD Helical structure MHRTSNGSHATGGNLPDVASHYPVAYEQTLDGTVGFVIDEMTPERATASVEVTDTLRQRWGLVHGGAYCALAEMLA TEATVAVVHEKGMMAVGQSNHTSFFRPVKEGHVRAEAVRIHAGSTTWFWDVSLRDDAGRLCAVSSMSIAVRPRRD

Beta strand and sheet structure MHRTSNGSHATGGNLPDVASHYPVAYEQTLDGTVGFVIDEMTPERATASVEVTDTLRQRWGLVHGGAYCALAEMLA TEATVAVVHEKGMMAVGQSNHTSFFRPVKEGHVRAEAVRIHAGSTTWFWDVSLRDDAGRLCAVSSMSIAVRPRRD

3D Protein Structure Protein Folding: Advances

Modeling / Comparative Modeling – The probe and template sequences are evolutionary related – Honig et al.; Sali et al.; Fischer et al.; Rost et al; • Fold Recognition / Threading – For the query sequence, determine closest matching structure from a library of known folds by scoring function – Skolnick et al.; Jones et al.; Bryant et al.; Xu et al.; Elber et al.; – Baker et al.; Rychlewski & Ginalski; Honig et al. Protein Folding: Advances

• First Principles with Database Information – Secondary and/or tertiary information from databases/statistical methods – Levitt et al.; Baker et al.; Skolnick, Kolinski et al.; Friesner et al. • First Principles without Database Information – Physiochemical models with most general application – Scheraga et al.; et al.; Floudas et al. ASTRO-FOLD

Klepeis, J.L and C.A. Floudas. Biophys J 85 (2003) 2119 Structure Prediction In Protein Folding: Outline

• Introduction to Protein Structure Prediction • Free Energy Calculations in Oligo-peptides • Prediction of Helical Segments • Prediction of Beta Sheet Topologies • Prediction of Loop Structures • Derivation of Restraints • Prediction of Protein Tertiary Structure Free Energy Calculations in Oligopeptide Folding

Relevant References: • Maranas C.D., I.P. Androulakis and C.A. Floudas, "A Deterministic Global Optimization Approach for the Protein Folding Problem", DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pp. 133-150, (1995). • Androulakis I.P., C.D. Maranas and C.A. Floudas, "Prediction of Oligopeptide Conformations via Deterministic Global Optimization", Journal of Global Optimization, 11, pp. 1-34, June (1997). • Klepeis J.L. and C.A. Floudas, "A Comparative Study of Global Minimum Energy Conformations of Hydrated Peptides", Journal of Computational Chemistry, 20, pp.636-654, (1999). • Westerberg K.M. and C.A. Floudas, "Locating All Transition States and Studying Reaction Pathways of Potential Energy Surfaces", Journal of Chemical Physics, 110, pp.9259-9296, (1999). • Klepeis J.L. and C.A. Floudas, "Free Energy Calculations for Peptides via Deterministic Global Optimization", Journal of Chemical Physics, 110, pp.7491-7512, (1999). • Westerberg K.M. and C.A. Floudas, "Dynamics of Peptide Folding : Transition States and Reaction Pathways of Solvated and Unsolvated Tetra-Alanine", Journal of Global Optimization, 15, 261-297 (1999).

Structure Prediction In Protein Folding: Outline

• Introduction to Protein Structure Prediction • Free Energy Calculations in Oligo-peptides • Prediction of Helical Segments • Prediction of Beta Sheet Topologies • Prediction of Loop Structures • Derivation of Restraints • Prediction of Protein Tertiary Structure Prediction of Helical Segments from First Principles

Relevant References: • Klepeis J.L. and C.A. Floudas, "Ab Initio Prediction of Helical Segments in Polypeptides", Journal of Computational Chemistry, 23, 245-266 (2002). • Subramani A. and C.A. Floudas, “A Novel Approach for the Prediction of Helices and Beta Strands”, in preparation (2008). ASTRO-FOLD Klepeis & Floudas, 2002c Understanding Helix Formation Physical Characteristics of Helices • Well defined backbones and hydrogen bonding patterns • Different types of helices •  : hydrogen bonding every fourth residues (3.6) • 310 : backbone turn every three residues Physical Understanding of Protein Folding • Two competing explanations • Local forces : hierarchical folding • Non-local forces : hydrophobic collapse Experimental Evidence for Helix Formation • Helix formation proceeds rapidly • Sequence sufficient to identify initiation \ termination

Helix formation dominated by local forces Helix Prediction : Key Ideas Klepeis & Floudas 2002a Overlapping oligopeptides Decompose polypeptide to identify local sites of helix formation and termination Ensemble of low energy states

Calculate properties of proteins using data from many low energy states rather than a single state

Free energy calculations Klepeis & Floudas 2000 Model proteins using detailed energy calculations including entropic and solvation contributions

Deterministic global optimization Floudas 2000 Predict low energy states using powerful global optimization approaches such as aBB Overlapping Oligopeptides

• Decompose polypeptide sequence into smaller oligopeptide sequences

• Pentapeptides • Heptapeptides • Nonapeptides

• Capture local interactions governing helix formation

• Combine free energy calculations to get prediction Ensemble of Low Energy States Klepeis & Floudas 1999

Generate low energy states along with global minimum energy state

Mathematical formulation • Nonconvex optimization problem • Requires global optimization search Free Energy Algorithm Klepeis & Floudas 1999 Free energy calculations require methods for finding low energy local minima

BB based approaches capitalize on • ability to identify domains of low energy • information from lower bounding function L(x) • initialize and locally minimize lower bounding functions multiple times in each domain • unique lower bounding minima serve as initial points for minimizations of original E(x) Search Techniques BB Deterministic Global Optimization Floudas & coworkers 1994,1995,1996,1998,1999,2000 • based on branch-and-bound framework • convergence through successive subdivision at each level in the b&b tree to generate non-increasing upper bounds (original problem) and non-decreasing lower bounds (convexified problem) • guaranteed -convergence for C2 NLPs Conformational Space Annealing (CSA) Scheraga & coworkers 1997,1998 • stochastic prediction of global minimum energy state • genetic algorithm updates produce low energy states • anneal using deviation between energy states • termination criteria is heuristic Free Energy Calculations Klepeis & Floudas 2002a

F = Fvac + Fcavity + Fsolvation + Fionization

Atomistic level free energy calculations • include both enthalpic and entropic contributions • model potential energy using semi-empirical force field • employ harmonic approximation for entropic effects • include cavity formation energy • calculate solvation / ionization energies from solution of Poisson-Boltzmann equation Honig & coworkers 1988,1993,1995 • calculate total free energies for ensemble of low energy states • employ efficient search techniques via global optimization Overall Free Energy

• Potential Fvac - Scheraga & coworkers

• Entropic TSvac +

• Cavity Fcavity + Honig & coworkers 1988, 1993, 1995

• Polarization Fsolvation + Honig & coworkers 1988, 1993, 1995

• Ionization Fionization Honig & coworkers 1988, 1993, 1995

Probability of Helix Formation Klepeis & Floudas 2002a • Calculate probability of conformer i from free energy

• Cluster probabilities for helical (AAA) conformers

• Classify residues using probability of central peptides • Probability calculation for residue j (pentapeptide) is Computational Study : 1R69 N-terminal domain of phage 434 repressor protein • 69 residue fragment of N-terminal domain • Dimer involved with operator sites in phage genome • N-terminal domain binds to DNA • Compact hydrophobic interior • Five helices • Extended first helix • Helix 2 and 3 form helix-turn-helix motif Computational Study : 1R69

Predicted Exp 2-11 1-12 16-22 16-22 31-34 28-35 48-50 45-50 56-62 56-64

Experimental

• Helices from 1-12, 16-22, 28-35, 45-50, 56-64 Comparison with PSIPRED

• Helices from 2-12, 17-24, 28-35, 42-52, 56-59 Computational Study : 9WGA Wheat germ agglutinin • Dimeric plant lectin • 171 residue chains with four domains each (A, B, C and D) • Internal repeating domain of 43 residues • Homology in all Cys and many Gly positions • Four disulfide bridges per domain • Tight fold • No helical or beta structure Computational Study : 9WGA

Experimental

• Helices from 1-12, 16-22, 28-35, 45-50, 56-64 Comparison with PSIPRED

• Helices from 2-12, 17-24, 28-35, 42-52, 56-59

Structure Prediction In Protein Folding: Outline

• Introduction to Protein Structure Prediction • Free Energy Calculations in Oligo-peptides • Prediction of Helical Segments • Prediction of Beta Sheet Topologies • Prediction of Loop Structures • Derivation of Restraints • Prediction of Protein Tertiary Structure Prediction of Beta Sheet Topologies via Integer Linear Optimization

Relevant References:

• Klepeis J.L. and C.A. Floudas, "Prediction of Beta-Sheet Topology and Disulfide Bridges in Polypeptides", Journal of Computational Chemistry, 24, 191-208 (2003). ASTRO-FOLD Klepeis & Floudas, 2002c Formation of -Sheets Major challenge for accurate structure prediction • Prediction of -strand location not accurate • No reliable method for -sheet topology • Antiparallel -sheets • Parallel -sheets • How to treat formation of disulfide bridges Physical Understanding • Local forces not as dominant (as for helices) • Non local forces are significant • Hydrophobic collapse • Tertiary contacts Experimental Evidence • Hydrophobic collapse proceeds rapidly Hydrophobic forces drive -sheet formation Klepeis & Floudas, -Sheet Prediction 2002b -Strand Protocol • Classify nonhelical and all cystine residues Hydrophobic H : Leu,Ile,Val,Phe,Met,Cys,Tyr,Trp Bridge B : Ala,Thr Turn T : Asn,Asp,Gly,Pro,Ser Other N : Arg,Lys,Glu,Gln,His • Scan sequence for Hydrophobic residues and identify Hydrophobic to Hydrophobic segments • Build  strands using rules for intervening residues

• Scan sequence and identify Turn to Turn segments • Modify  strands which enclose, intersect or are enclosed within the Turn to Turn segments

93 % strand prediction accuracy for 11000 strands from over 2000 PDB sequences (50 to 150 amino acids)

Illustration of strand Superstructure Bovine Pancreatic Trypsin Inhibitor Illustration of strand Superstructure Bovine Pancreatic Trypsin Inhibitor Residue-based Concepts

• Identify set i and assign hydrophobicity index Hi to nonhelical Hydrophobic and all cysteine residues

• Assign binary variable yij to each possible unique Residue-to-Residue contact

Strand-based Concepts

• Identify set si and assign hydrophobicity weight Ssi according to superstructure of potential -strands

• Assign binary variable wsi,sj to each possible unique Strand-to-Strand contact -strand Superstructure • Model incorporates ALL possible alternatives

• Number of postulated strands may be greater than actual • Many possible sheet arrangements are allowable Formulation : Key Concepts Klepeis & Floudas 2002b Binary variables 0-1 variables are used to characterize residue-to-residue and strand-to-strand contacts Linear objective function

Objective is to maximize the hydrophobic potential as controlled by the binary variables Linear constraints Constraints account for different combinations of residue and strand contacts (e.g., parallel/antiparallel) Integer cuts Iterative addition of these constraints allow for the generation of a ranked list of optimal solutions Constraint Functions

• Allowable antiparallel combinations

• Allowable parallel combinations

• Limit number of strand contacts to (2) • Disallow extended -ladders • Disallow double intersecting loops

Objective Function

• Maximization of hydrophobic potential • Additional disulfide contact energy Computational Study of  Prediction Bovine Pancreatic Trypsin Inhibitor • 58 residue protein • Inhibits serine proteases • Three disulfide bonds • Cys5-Cys55, Cys14-Cys38, Cys30-Cys51 • Two antiparallel strands (Strand 1 to Strand 2) • Hydrophobic residue match (Strand 1 to Strand 3) Computational Study of  Prediction Bovine Pancreatic Trypsin Inhibitor

• Disulfide bridge matches : 5-55, 14-38, 30-51 • Residue matches (Strand 1 to Strand 2) : 18-34 (Ile-Val), 19-33 (Ile-Phe), 23-29 (Tyr-Leu) • Residue match (Strand 1 to Strand 3) : 22-45 (Phe-Phe) PSIPRED comparison • 2 strands :19-24 (79 %) and 29-35 (54 %) • -sheet configuration can NOT be identified Computational Study of  Prediction SmD3 : Small Nuclear Ribonucleoprotein (T0059) • 75 residue protein • Common fold of SH3 proteins • N-terminal helix • Set of antiparallel -sheets • Barrel-like topolgy Computational Study of  Prediction SmD3 : Small Nuclear Ribonucleoprotein (T0059)

• Multiple global optima with six contacts • Consistent features include matches between strands 1-2 and 1-8 • One of the six global optima corresponds to experimental observations PSIPRED comparison • Length of  strands inconsistent with experimental results • -sheet configuration can NOT be identified

Structure Prediction In Protein Folding: Outline

• Introduction to Protein Structure Prediction • Free Energy Calculations in Oligo-peptides • Prediction of Helical Segments • Prediction of Beta Sheet Topologies • Prediction of Loop Structures • Derivation of Restraints • Prediction of Protein Tertiary Structure Prediction of Loop Structures with Flexible Stems

Relevant References:

• Klepeis J.L. and C.A. Floudas, "Analysis and Prediction of Loop Segments in Protein Structures", Computers and Chemical Engineering, 29, 423-436 (2005). • Monnigmann M. and C.A. Floudas, "Protein Loop Structure Prediction with Flexible Stem Geometries", Proteins, 61, 748-762 (2005). Outline

Protein Loop Prediction • Motivational Examples • Problem Definition • Current Approaches • Limitations • Loop Prediction Strategies • Overlapping Oligopeptides • Pivot Point Constraints • Computational Studies • Bovine Pancreatic Trypsin Inhibitor • Immunoglobulin Binding Domain of Protein G • T0114 : Antifungal Protein (Streptomyces Tendae) • Conclusions Functional Significance of Loops Endotoxins • Natural pesticide used in agriculture • CryIIIA class is active against potato beetle

Function • Bind to receptor proteins • Inset into membrane • Function as ion channels

Structure-Function • 3 surface loops involved in receptor binding & specificity • Alanine replacements affect receptor binding (loops 1 & 3), membrane binding (loop 2) Loop 3 and toxicity Loop 2 Loop 1 Functional Significance of Loops Serine proteases • Enzymes involved in a variety of physiological processes • Function as digestive enzymes • Function as regulators & cell differentiation

Chymotrypsin-like serine protease • Catalytic residues bridge beta-barrel • Substrate specificity determined by three adjacent surface loops • No direct contact with substrate

Trypsin-like serine protease • Subsite specificity in addition to primary • Two surface loops flank catalytic residues • Kinetic studies highlight preferential substrate specificity for certain subclasses Protein Folding Problem

• To predict a protein’s native three-dimensional conformation from its linear amino acid sequence defines the Protein Folding Problem • Predictive ability would allow for the production of improved drugs, biocatalysts and foster a better understanding of molecular biochemistry and biophysics • Anfinsen’s hypothesis Native state Lowest free of protein energy of system

• Express protein free energy as a function of the conformation of the protein (atomic coordinates) • Use global optimization techniques to deduce the global minimum energy conformation (native state) Loop Modeling Mini-Protein Folding Problem • Structure determined from sequence of segment • Sequence-structure variability is integral to loop functionality • Loops are not well conserved or regular as with basic secondary structure • Identical loop segments in different proteins have unrelated conformations • Structure influenced by stem regions that flank loop

Loop segment (8 residues)

N-terminal stem

C-terminal stem Loop stems • Stems consist of main chain atoms that precede and follow the loop • Stem regions offer topological constraints • Stem regions indicate the orientation for the rest of the protein Methods for Loop Modeling Database Methods Find template that fits the two stem regions of the loop segment Greer & • Search through all proteins, not just homologous proteins co-workers • Many sequences fit, sort according to Cohen & • Geometric criteria for stem regions (orientation) co-workers • Sequence similarity between loop and target segments Levitt • Superpose and anneal onto stem regions Karplus & Limitations (1) requires correct loop conformation to exist in database coworkers (2) exponential increase with length for geometric search Chothia & (3) only feasible for loops < 8 residues and specific classes coworkers Ab-Initio Methods Conformational search guided by scoring or energy function Sali & • Generate large numbers of loop conformations between loop stems co-workers • Models range from Scheraga & • Unified atom models to all atom models (solvation forces) co-workers • Cartesian to internal coordinates in discrete or continuous space Friesner & • Optimization approaches include local minimization, molecular co-workers dynamics, systematic searches, simulated annealing, monte carlo Karplus & Limitations (1) native conformation does not provide lowest energy coworkers (2) effectiveness of conformational search procedure Honig & (3) how to handle loop flexibility (conformational entropy) coworkers Existing Approaches to Loop Structure Prediction Colony energy minimization • Xiang, Z., Soto, C., and Honig, B., PNAS 99, 7432-7437, 2002 • Colony energy= potential energy – f(# close neighbors) • favors conformers in broad energy basins • random backbone, loop closure, 2000 conformers for each loop, rotamer libraries for side chains • 553 loops of lengths 5 to 12 residues • 2/3 of conformers improved when colony energy is used

Loop reconstruction with dihedral angle sampling • DePristo, M., de Bakker, P.I.W., Lovell, S.C., and Blundell, T.L, Proteins 51, 41-55, 2003 0.02 • de Bakker, et al., Proteins 51, 21-40, 2003 0.015

p • backbone angle sampling, probability 0.01 distributions p(,), resolution up to 5°x5° 0.005

0 150 • 400 loops, 2-12 residues –150 100 –100 50 –50 0 0 psi –50 • energy minimization with AMBER force field, phi 50 100 –100 including Generalized Born/surface area 150 –150 solvation model Existing Approaches to Loop Structure Prediction

Loop reconstruction by hierarchical clustering • Jacobson, M.P., Pincus, D.L., Rappa, C.S., Day, T.J.F., Honig, B., Shaw, D.E., Friesner, R.A., Proteins 55, 351-367, 2004 • Backbone angle sampling with probability functions p(,), resolution 5°x5° • unique in that up to 10^6 conformers are generated • clustering and filters used to reject decoys before energy minimization • filters based on information on surrounding protein: steric clashes, loop closure, distance to remainder of protein • test set of 833 loops from 4-12 residues length Data-driven methods • Baker and coworkers, Proteins 55, 656-677, 2004: fragment database, heuristic scoring function • Deane and Blundell, Protein Science 10, 599-612, 2001: consensus method that combines database of known loops and set of decoys • Zhang et al.: efficient statistical energy function that compares favorably to physically based energy functions Derivation of Restraints Dihedral angle restraints • Backbone dihedral angles restrained according to classification of residue as either helix or strand Distance restraints • Ca-Ca distance restraints for hydrogen bond network of helix (residues i and i+4) • Ca-Ca distance restraints for hydrogen bonds between residues in opposing strands

Bounds on loop residues • Perform free energy calculations to derive tighter constraints on backbone variables of loop residues Using Loop Stems Topological Constraints Stem-to-Stem • In general, distance and orientation distance between loop stems are not known N-terminal stem Available Constraints • 4 basic classes of loop stem combinations C-terminal stem • Distance constraints only known for beta-sheet connection • Typically 4.5 - 6.5 angstroms between opposing residues in loop stems • Usually such loops are relatively short (< 5 residues)

Antiparallel Beta-Sheet Protein Loop Prediction Strategies

Free Energy Calculations for Loop Modeling

• Overlapping Oligopeptides

•• Pivot Point Constraints Overlapping Oligopeptides

• Decompose loop segment and 3 flanking stem residues at each end into smaller oligopeptide sequences

• Pentapeptides • Heptapeptides • Nonapeptides

• Impose appropriate bounds on residues in loop stem regions (helix or strand)

• Combine free energy calculations to derive tighter bounds on dihedral angles for residues in loop segment Probability of Conformational States • Calculate probability of conformer i from free energy

• Cluster probabilities for (YYY) conformational states

YYY • Classify residues using probabilityYYY of central peptides • Probability calculation for residue j (pentapeptide) is

YYY YYY YYY YYY Tighter Backbone Constraints 5 residue loop Helical segment at N-terminal stem Strand segment at C-terminal stem Step 1: Initialization • Select free energy model • Set bounds for dihedral angles of helix • Set bounds for dihedral angles of strand Step 2 : Overlapping Pentapeptides • 7 free energy based optimizations • Impose appropriate helix/strand bounds • Calculate cumulative probabilities for

conformational state of each residue (pYYY) Step 3 : Overlapping Heptapeptides • 5 free energy based optimizations • Impose appropriate helix/strand bounds • Impose reduced bounds from pentapeptides • Calculate cumulative probabilities for

conformational state of each residue (pYYYYY) Proceed until full sequence simulation • Use final bounds in tertiary structure prediction Overall Free Energy

• Potential Fvac - Scheraga & coworkers

• Entropic TSvac +

• Cavity Fcavity + Honig & coworkers 1988, 1993, 1995

• Polarization Fsolvation + Honig & coworkers 1988, 1993, 1995

• Ionization Fionization Honig & coworkers 1988, 1993, 1995 Ensemble of Low Energy States Klepeis & Floudas 1999

Generate low energy states along with global minimum energy state

Mathematical formulation • Nonconvex optimization problem • Requires global optimization search Floudas 2000 Floudas & co-workers The BB Framework Adjiman et al. 1998,2001 • Based on a branch-and-bound framework • Upper bound on the global solution is obtained by solving the full nonconvex problem to local optimality • Lower bound is determined by solving a valid convex underestimation of the original problem • Convergence is obtained by successive subdivision of the region at each level in the brand & bound tree • Guaranteed -convergence for C2 NLPs Protein Loop Prediction Strategies

Free Energy Calculations for Loop Modeling

•• Overlapping Oligopeptides

• Pivot Point Constraints Pivot Point Constraints Goals • Improve conformational search • Derive distance restraints for tertiary structure prediction Pivot Point • Define Ca of internal loop residue as the pivot point • From free energy calculations of smaller oligopeptides derive cumulative probability ranges for two Ca to Ca distances between end residues (or other internal residues) and the pivot residue (d1 and d2) • For most probable distance ranges (d1 and d2) sweep out different ranges of  and calculate dx range • Impose dx ranges in free energy calculations of larger oligopeptides Tighter Distance Constraints 5 residue loop Pivot Point : Residue 3 d1 : Ca(1) to Ca(3) d2 : Ca (3) to Ca(5) dx : Ca(1) to Ca(5) • 4 ranges for  and dx • Free energy calculations for all 4 ranges constrained

 = 135-180 = 90-135 = 45-90 = 0-45  Distance Min Max Min Max Min Max Min Max 5.5 10.2 11.0 7.8 10.2 4.2 7.8 0.0 4.2 6.0 11.1 12.0 8.5 11.1 4.6 8.5 0.0 4.6 6.5 12.0 13.0 9.1 12.0 5.0 9.1 0.0 5.0 7.0 12.9 14.0 9.9 12.9 5.4 9.9 0.0 5.4

6.0-6.5 11.1 13.0 8.5 12.0 4.6 9.2 0.0 5.0 Unique 12.0 13.0 9.1 11.1 5.0 8.5 0.0 4.6 Range1.0 2.0 3.5 4.6 Constrained Formulation

Objective • Nonconvex atomistic level forcefield

Constraints • Enforce bounds on backbone variables • Enforce upper / lower distances through square well constraints Torsion Angle Dynamics Wuthrich & coworkers Initialization • Difficult to identify low energy feasible structures Torsion Angle Dynamics • Identify feasible low energy structures (satisfy constraints) • Fast evaluation of simplified force field (steric based) • Unconstrained formulation using penalty functions Implementation • Solve equations of motion as preprocessing for each Good Initial Points for constrained minimization Local Optimization Structure Prediction Bovine Pancreatic Trypsin Inhibitor 56 AAs Backbone variable restraints • a-helical : 2-5, 47-54 Tertiary fold • -strand : 17-23, 29-35, 44-46 • Best energy : -428.0 kcal/mol Distance restraints • RMSD : 4.0 A • Two -sheet contacts • 32 lower and upper Ca-Ca for helix and -sheets • 6 lower and upper S for disulfide bridge Bovine Pancreatic Trypsin Inhibitor

Helix 1Beta 1 Beta 2 Beta 3 Helix 2 RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA Loop 1 Loop 2 Loop 3 11 residues 5 residues 8 residues (Helix-Beta) (Beta-Beta) (Beta-Beta) RMSD: 2.6 A RMSD: 0.4 A RMSD: 2.0 A

17 residues 11 residues 14 residues RMSD: 3.2 A RMSD: 1.4 A RMSD: 2.9 A Large Scale Testing

• Set of 15 proteins • Previous CASP competitions • Benchmark systems • Tested within context of ASTRO-FOLD • Consistent results over all lengths RMSD • Comparable to best results in which loop stems are fixed (+6 residues)

Loop length

• Set of proteins from CASP5 • Wide range of loops tested • Mixed structure predictions • Comparisons available starting Count December 2002 Loop Structure Prediction with Flexible Stem Geometry Flexible Stem Geometry Assess quality of loop structure prediction with flexible stem geometries • use in ASTRO-FOLD ab initio structure prediction Klepeis, J.L., Floudas, C.A., Biophysical J. 85: 2119-2146, 2003; Klepeis, J.L., Floudas, C.A., J. Comput. Chem. 24(2), 191-208, 2003; Klepeis, J.L., Floudas, C.A., J. Computat. Chem. 23, 245-266, 2003; Klepeis, J.L., Pieja, M.T., Floudas, C.A., Biophysical J. 84, 869-882, 2003; • investigate limit of prediction accuracy if long range interactions are neglected

   loop residues      Loop Structure Prediction Methodology

Create ensemble by dihedral angle sampling – extracted p(,) from ~2500 loops – sampled p(,) at 5ºx5º resolution – created ensembles of 2000 conformers for each loop

ARG PRO …

0.02 0.008

0.015 0.006

p p 0.01 0.004

0.005 0.002

0 150 0 150 –150 100 –150 100  –100 50 –100 50 –50 0 –50 0 0 psi 0 psi –50 –50 phi 50 phi 50 –100 100 100 –100 Structure optimization150 –150 with first150 –150 principles force field – Dunbrack rotamer library – ECEPP/3 force field for structure optimization Clustering to identify conformers that are close to native New Use of Clustering Clustering has been used before to • group conformers • select conformers that represent groups New use of clustering • discard conformers that are far from native First steps of approach • calculate pairwise RMSDs for ensemble • choose RMSD threshold t • for each conformer, record number of conformers with RMSD<= t Clustering Example

• threshold t= 3.0Å

• large clusters for small RMSDs unfortunately also for large RMSDs

• not always advisable to consider centroid of largest cluster

• threshold t= 3.5Å

• increasing threshold shows that clusters with large RMSDs are small basins only

• large clusters with small RMSDs survive Clustering Example

• threshold t= 4.0Å

• for sufficiently large threshold distribution is monotonous

• tail with large RMSDs becomes apparent

• threshold t= 4.5Å

• distribution more conservative the larger threshold

• for sufficiently large threshold clusters of conformers with large RMSDs can be discarded Iterative Clustering Algorithm

1. Choose thresholds t and N, choose critical cluster size Ncrit

2. Calculate cluster sizes Ni for all conformers in ensemble

3. If Ni>Ncrit for all i, stop

4. Discard conformers that generate clusters of size Ni

5. Go back to step 2 Results Treated >3000 loops, length 3+4+3 through 3+14+3

5.5 a a a 5 bc a b 4.5 a bc a bc c energy [A] 4 a b bc colony energy c

native a bc

to 3.5 a a b cluster size after c a, b, c

rmsd b k=0, 1, 2 clustering steps 3 a b c c bc 2.5 RMSDs to native [Å] RMSDs to

10 11 12 13 14 15 16 17 18 19 2 0 number of residues including stems- [] number of residues incl. stems [-]

• Surprisingly, energy almost as good as colony energy • Clustering always improves result • For all loops, algorithm stops after 2 or 3 clustering steps • RMSD grows only linearly with length for at least 20 residues Quality of Ensembles

5

4

[A] 3 k= 2 cluster size ve ti o na t k= 2 cluster size, select five d 2 rms + best in ensemble RMSDs to native [Å]

1 10 11 12 13 14 15 16 17 18 19 2 0 -] number ofnumber residues of residues includingincl. stems [ stems [-]

• Quality of ensembles never restricts result • Linear for at least up to 20 residues, but slopes differ • Gap reduced when considering 5 representatives • Slopes equal when considering 5 representatives Results for CASP6 Targets

5.5 54 5

4.5 3

[A] a [A] 4 ve ti

native bc

o na

to 3.5 a b t

c d 2 a a rmsd 3 rms b a c bc RMSDs to native [Å] RMSDs to native [Å] RMSDs to 2.5 bc 1 10 11 12 13 1 4 10 11 12 13 1 4 -] numbertotal of number residues of residues incl. including stems stems [-] [-] number totalof numberresidues of residues incl. including stems stems [ [-]

energy k= 2 cluster size colony energy k= 2 cluster size, select five cluster size after a, b, c k=0, 1, 2 clustering steps + best in ensemble Comparison to Previous Results

Comparison difficult RMSD [Å] 5 • flexible stem residues Xiang et al. our result

• fixed stems in all previous 4 results we solve harder problem de Bakker et al. • number of residues includes 3 3+3 stem residues in our case Jacobson et al. • stem residues have tighter 2 probability distributions Results of comparison 1 • Jacobson et al. result with fixed 4 6 8 101214161820 stems better number of residues use information on stem geometry, if available • new method results in very favorable slope • new method is better than or only slightly worse than methods for fixed stems Structure Prediction In Protein Folding: Outline

• Introduction to Protein Structure Prediction • Free Energy Calculations in Oligo-peptides • Prediction of Helical Segments • Prediction of Beta Sheet Topologies • Prediction of Loop Structures • Derivation of Restraints • Prediction of Protein Tertiary Structure Derivation of Restraints

Relevant References: • Klepeis J.L. and C.A. Floudas, "ASTRO-FOLD: a combinatorial and global optimization framework for ab initio prediction of three-dimensional structures of proteins from the amino acid sequence", Biophysical Journal, 85, 2119-2146 (2003). • McAllister S.R. and C.A. Floudas, ”Enhanced Bounding techniques to Reduce the Protein Conformational Search Space”, submitted for publication, 2008. Derivation of Restraints

• Dihedral angle restraints – For residues with -helix or - sheet classification – For loop residues using the best identified conformer from loop modeling efforts • Distance restraints – Helical hydrogen bond network (i,i+4) – -helical topology predictions – -sheet topology predictions

Klepeis, JL and Floudas, CA. Journal of Global Optimization. (2003) Structure Prediction In Protein Folding: Outline

• Introduction to Protein Structure Prediction • Free Energy Calculations in Oligo-peptides • Prediction of Helical Segments • Prediction of Beta Sheet Topologies • Prediction of Loop Structures • Derivation of Restraints • Prediction of Protein Tertiary Structure Prediction of Protein Tertiary Structure

Relevant References: • Klepeis J.L. and C.A. Floudas, "Ab Initio Tertiary Structure Prediction of Proteins", Journal of Global Optimization, 25, 113-140 (2003). • Klepeis J.L., M. Pieja, and C.A. Floudas, "A New Class of Hybrid Global Optimization Algorithms for Peptide Structure Prediction: Integrated Hybrids", Computer Physics Communications, 151, 121-140 (2003). • Klepeis J.L., M. Pieja, and C.A. Floudas, "A New Class of Hybrid Global Optimization Algorithms for Peptide Structure Prediction: Alternating Hybrids and Application fo Met- Enkephalin and Melittin", Biophysical Journal, 84, 869-882 (2003). • Klepeis J.L. and C.A. Floudas, "ASTRO-FOLD: a combinatorial and global optimization framework for ab initio prediction of three-dimensional structures of proteins from the amino acid sequence", Biophysical Journal, 85, 2119-2146 (2003). • Klepeis J.L., Y. Wei, M.H. Hecht, and C.A. Floudas, "Ab Initio Prediction of the 3-Dimensional Structure of a De Novo Designed Protein: A Double Blind Case Study", Proteins, 58, 560-570 (2005). Tertiary structure prediction

Constrained optimization • Problem definition

• Atomistic level force field (ECEPP/3)

• Distance constraints BB Global Optimization • Based on a branch-and-bound framework • Upper bound on the global solution is obtained by solving the full nonconvex problem to local optimality • Lower bound is determined by solving a valid convex underestimation of the original problem • Convergence is obtained by successive subdivision of the region at each level in the brand & bound tree • Guaranteed -convergence for C2 NLPs

Floudas, CA and co-workers, 1995-2006 Adjiman, CS, et al. Computers and Chemical Engineering. (1998a,b) Torsion Angle Dynamics

• Why? Difficult to identify low energy feasible points • Fast evaluation of steric based force field • Unconstrained formulation with penalty functions

• Implemented by solving equations of motion as preprocessing for each constrained minimization Guntert, P, et al. Journal of Molecular Biology. (1997) Klepeis, JL, et al. Journal of Computational Chemistry. (1999) Klepeis, JL and Floudas, CA. Computers and Chemical Engineering. (2000) Conformational Space Annealing

• Induce variations – Mutations – Crossovers • Subject to local energy minimization • Anneal through the gradual reduction of space

Scheraga and co-workers, 1997-2006. Lee, JH, et al. Journal of Computational Chemistry. (1997) Tertiary Structure Prediction

• Hybrid global optimization approach – BB deterministic global optimization – Conformational Space Annealing (CSA) • Modifications – Streamlined parallel implementation – Integrated a rotamer optimization stage for quick energetic improvements – Improved initial point selection using a torsion angle dynamics based annealing procedure from CYANA*

*Guntert, P, et al. Journal of Molecular Biology. (1997) Global Optimization: Alternating Hybrid •The BB and CSA algorithms have complementary strengths and drawbacks • Implement hybrid algorithm to capture strengths of both • Parallelize by dividing problem, assigning subproblems

Klepeis, JL, et al. Biophysical Journal. (2003) Alternating Hybrid: Implementation • All secondary nodes begin performing BB iterations • Once the CSA bank is full, CSA takes control of a subset of secondary nodes

Primary BB Control CSA Control processor

Secondary processors Rotamer Side Chain Optimization

• Side chain packing is crucial to the stability and specificity of the native state • Rotamer optimization is a quick way to alleviate steric clashes • Better starting point for constrained nonlinear minimization Tertiary structure prediction

Results Results – Tertiary Structure Prediction

•PDB: 1nre Energy -1395.48 Energy -1340.45 RMSD 6.63 RMSD 3.52

Lowest energy predicted structure Lowest RMSD predicted structure of 1nre (color) versus native 1nre of 1nre (color) versus native 1nre (gray) (gray) Results – Tertiary Structure Prediction

•PDB: 1hta Energy -941.02 Energy -915.57 RMSD 6.70 RMSD 2.58

Lowest energy predicted structure Lowest RMSD predicted structure of 1hta (color) versus native 1hta of 1hta (color) versus native 1hta (gray) (gray) Tertiary structure prediction

Blind studies CASP7 Results – T311 (87/97 aa)

• PSIPRED used for -helical prediction • A small number of loose, homology based constraints introduced RMSD 5.75

TS 3 CASP7 Results – T335 (42/85 aa)

• PSIPRED used for -helical prediction • Distance constraints imposed based on -helical topology prediction results RMSD 2.71

TS 1 Results – Blind Tertiary Structure Prediction •S836 Energy -1740.11 Energy –1697.88 RMSD 2.84 RMSD 2.39

Lowest energy predicted structure Lowest RMSD predicted structure of s836 (color) versus native s836 of s836 (color) versus native s836 (gray) (gray) CASP7 Results ,T382 (121/123 aa) • PSIPRED, PROFsec used for -helical prediction • Distance constraints imposed based on -helical topology prediction results

TS 2 CASP7 Results – T340 (90 aa) • Selection by energy alone may not identify the lowest RMSD structure CASP7 Results-T354 (120/130 aa) • Incorrect topology predictions can misdirect global conformational search Predicted TS3 Native

1 1 2 2 5 4 5 3 4 3

Antiparallel Parallel CASP7 Results – T351 (60/117 aa)

• Overall RMSD can be deceiving, hiding a correct topology prediction Native Predicted TS1

9.69 RMSD CASP7 – T367 Comparison • PSIPRED used for -helical prediction • Tight distance constraints imposed based on - helical topology prediction results Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins

McAllister S.R., Mickus B.E., J.L. Klepeis, and C.A. Floudas, "A Novel Approach for Alpha-Helical Topology Prediction in Globular Proteins: Generation of Interhelical Restraints", Proteins, 65, 930-952 (2006). Outline

• Predicting -helical contacts – Probability development – Model – Results • Predicting -helical contacts in / proteins – Distance bounding – Model – Results • Structure prediction of -helical proteins – Framework – Results ASTRO-FOLD

Helix Prediction -Detailed atomistic modeling -Simulations of local interactions (Free Energy Calculations)

-sheet Prediction -Novel hydrophobic modeling -Predict list of optimal topologies (Combinatorial Optimization, MILP)

Loop Structure Prediction -Dihedral angle sampling -Discard conformers by clustering (Novel Clustering Methodology)

Derivation of Restraints -Dihedral angle restrictions -C distance constraints (Reduced Search Space)

Overall 3D Structure Prediction -Structural data from previous stages -Prediction via novel solution approach (Global Optimization and Molecular Dynamics)

Klepeis, JL and Floudas, CA. Biophys J. (2003) Overview

• Problem – Topology prediction of globular -helical proteins • Approach • Thesis: Topology is based on certain Inter-helical Hydrophobic to Hydrophobic Contacts – Create a dataset of helical proteins – Develop inter-helical contact probabilities – Apply two novel mixed-integer optimization models (MILP) • Level 1 - PRIMARY contacts • Level 2 - WHEEL contacts

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Dataset Selection

•Protein Sources –229 PDBSelect251 database –62 CATH2 database –20 Zhang et al.3 –7 Huang et al.4 •Restrictions –No -sheets, at least 2 -helices –No highly similar sequences •Dataset –318 proteins in the database set

1Hobohm, U. and C.Sander. Prot Sci 3 (1994) 522 2Orengo, C.A. et al. Structure 5 (1997) 1093. 3Zhang, C. et al. PNAS 99 (2002) 3581. 4Huang, E.S. et al. J Mol Biol 290 (1999) 267. McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Probability Development

•Contact Types –PRIMARY contact •Minimum distance hydrophobic contact between 4.0 Å and 10.0 Å –WHEEL contact •Only WHEEL position hydrophobic contacts between 4.0 Å and 12.0 Å •Classified as parallel or antiparallel contacts

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Model Overview •Formulation: Maximize inter- helical residue-residue contact probabilities –Binary variables indicate antiparallel helical contact –Binary variables indicates residue contact •Goal: Produce a rank-ordered list of the most likely helical contacts –Contacts used to restrict conformational space explored during protein tertiary structure prediction

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Pairwise Model Objective

•Level 1 Objective –Maximize probability of pairwise residue-residue contacts

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Pairwise Model Constraints

•Level 1 Constraints –At most one contact per position

–Helix-helix interaction direction

–Linking interaction variables

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Pairwise Model Constraints •Level 1 Constraints –Restrict number of contacts between a given helix pair (MAX_CONTACT)

–Vary the number of helix-helix interactions (SUBTRACT)

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Pairwise Model Constraints •Level 1 Constraints –Allow for and Limit helical kinks

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Pairwise Model Constraints

•Level 1 Constraints –Consistent numbering

j

i

l k

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Pairwise Model Constraints

• Feasible topologies

1 1

m n p Pairwise Model Objective •Level 2 Objective –Maximize the sum of predicted wheel probabilities

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Pairwise Model Constraints

•Level 2 Constraints –Require at most one wheel contact for a specified primary contact

•Level 2 Aim –Distinguish between equally likely Level 1 predictions –Increase the total number of contact predictions

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Results – 2-3 helix bundles

PDB:1mbh in PyMol PDB:1nre in PyMol

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Results – 1nre Contact Predictions •subtract 0, max_contact 2

PRIMARY PRIMARY WHEEL WHEEL Helix-Helix Contact Distance Contact Distance Interaction 25L-49L 6.0 28L-45L 9.1 1-2 A

28L-83V 12.7 - - 1-3 P

45L-85L 9.3 49L-81L 8.1 2-3 A

51I-77L 9.3 - - 2-3 A Results – 1hta Contact Predictions •subtract 0, max_contact 1

PRIMARY PRIMARY Helix-Helix Contact Distance Interaction 5I-28L 9.1 1-2 A

46L-62L 8.4 2-3 A Results – Contact Prediction Summary

McAllister, Mickus, Klepeis, Floudas. Proteins. 2006, 65:930-952. Summary

• Thesis: Topology of alpha helical globular proteins is based on inter- helical hydrophobic to hydrophobic contacts • Validated on alpha helical globular proteins Outline

• Protein structure prediction overview • Predicting -helical contacts – Probability development – Model – Results • Predicting -helical contacts in / proteins – Distance bounding – Model – Results • Structure prediction of -helical proteins – Framework – Results ASTRO-FOLD for -helical Bundles

Helix Prediction -Detailed atomistic modeling -Simulations of local interactions (Free Energy Calculations)

Interhelical Contacts Loop Structure Prediction -Maximize common residue pairs -Dihedral angle sampling -Rank-order list of topologies -Discard conformers by clustering (MILP Optimization Model) (Novel Clustering Methodology)

Derivation of Restraints -Dihedral angle restrictions -C distance constraints (Reduced Search Space)

Overall 3D Structure Prediction -Structural data from previous stages -Prediction via novel solution approach (Global Optimization and Molecular Dynamics)

McAllister, Floudas. Proceedings, BIOMAT 2005. Hybrid Global Optimization Algorithm • All secondary nodes begin performing BB iterations • Once the CSA bank is full, CSA takes control of a subset of secondary nodes

CSABB workwork •Rotamer•Torsion angle optimization dynamics Primary •Minimization•RotamerBB optimizationControl of CSA trialCSA conformationControl Idle Work processor •Minimization of lower bounding function •Minimization of upper bounding function CSAIdleBB work control •Maintains•Maintains•Performs shearCSAlist of bank movementslower bounding and subregions perturbations on CSA •Maintains•Tracksstructures overall queue upper of  BBand minima lower bounds for bank increases •Handles•Defines•Only executed branchingbank updates during directions idle time of primary processor •Sends•Sends andand receivesreceives workwork toto andand fromfrom CSABB work nodes Secondary processors

McAllister and Floudas. 2007, Submitted for publication. Results – Tertiary Structure Prediction

•PDB: 1nre Energy -1395.48 Energy -1340.45 RMSD 6.63 RMSD 3.52

Lowest energy predicted structure Lowest RMSD predicted structure of 1nre (color) versus native 1nre of 1nre (color) versus native 1nre (gray) (gray) Results – Tertiary Structure Prediction

•PDB: 1hta Energy -941.02 Energy -915.57 RMSD 6.70 RMSD 2.58

Lowest energy predicted structure Lowest RMSD predicted structure of 1hta (color) versus native 1hta of 1hta (color) versus native 1hta (gray) (gray) Results – Blind Tertiary Structure Prediction (Collaboration with Michael Hecht)

•S836 Energy -1740.11 Energy –1697.88 RMSD 2.84 RMSD 2.39

Lowest energy predicted structure Lowest RMSD predicted structure of s836 (color) versus native s836 of s836 (color) versus native s836 (gray) (gray) Advances In De Novo Protein Design

Christodoulos A. Floudas Princeton University Department of Chemical Engineering Program of Applied and Computational Mathematics Department of Operations Research and Financial Engineering Center for Quantitative Biology De Novo Protein Design

Relevant References: • Klepeis J.L., C.A. Floudas, D. Morikis, C.G. Tsokos, E. Argyropoulos, L. Spruce, and J.D. Lambris, "Integrated Computational and Experimenal Approach for Lead Optimization and Design of Compstatin Variants with Improved Activity", Journal of the American Chemical Society, 125 (28), 8422-8423 (2003). • Morikis D., A.M. Soulika, B. Mallik, J.L. Klepeis, C.A. Floudas, and J.D. Lambris, "Improvement of the anti-C3 activity of complement using rational and combinatorial approaches", Biochemical Society Transactions, 32, 28-32 (2003). • Klepeis J.L., C.A. Floudas, D. Morikis, and J.D. Lambris, "Design of Peptide Analogs with Improved Activity using a Novel de novo Protein Design Approach", Industrial and Engineering Chemistry Research, 43, 3817-3826 (2004). • Fung H.K., Rao S., Floudas C.A., Prokopyev O., Pardalos P.M., and F. Rendl, "Computational Comparison Studies of Quadratic Assignment Like Formulations for the In Silico Sequence Selection Problem in De Novo Protein Design", Journal of Combinatorial Optimization, 10, 41-60 (2005). • Fung H.K., Taylor M.S. and C.A. Floudas, "Novel Formulations for the Sequence Selection Problem in de Novo Protein Design with Flexible Templates", Optimization Methods and Software, 22 (1), 51-71 (2007). • Fung H.K., Floudas C.A, Taylor M.S., Zhang L., and D. Morikis, "Towards Full Sequence De Novo Protein Design with Flexible Templates for Human Beta-Defensin-2", Biophysical J., 94, 584-599 (2008). • Taylor M.S., Fung H.K., Rajgaria R., Filizola M., Weinstein H., and C.A. Floudas, "Mutations Affecting the Oligomerization Interface of G-Protein Coupled Receptors Revealed by a Novel De Novo Protein Design Framework", Biophysical J., 94, 2470-2481 (2008). Review Articles • Floudas C.A., "Research Challenges, Opportunities and Synergism in Systems Engineering and Computational Biology", AIChE Journal, 51, 1872-1884 (2005). • Floudas C.A., H.K. Fung, S.R. McAllister, M. Monnigmann, and R. Rajgaria, "Advances in Protein Structure Prediction and De Novo Protein Design: A Review", Chemical Engineering Science, 61, 966-988 (2006). • Fung H.K., Welsh W.J., and C.A. Floudas, "Computational De Novo Peptide and Protein Design: Rigid Templates versus Flexible Templates", IECR, 47, 993-1001 (2008). Outline • Motivational Examples • De Novo Protein Design: Definition • De Novo Design: Background • Advances and Limitations • Novel Two-Stage De Novo Protein Design Approach • Sequence Selection - Force Fields: Distance Dependent - Quadratic Assignment-like Models - Compstatin, Human beta defensin-2 • Fold Specificity and Validation - First Principles based: Astro-Fold - NMR like framework • Computational Studies • Design of Inhibitors for Complement 3 • Design of C3a • Conclusions & Acknowledgements Complement System • ~30 distinct plasma proteins that interact to attack/eliminate pathogens • Activated via (3) interacting pathways • (A) Classical : Antibody-binding (IgM, IgG) to pathogens • (B) Lectin : Mannose binding protein to carbohydrates on bacteria or viruses • (C) Alternative : Spontaneous binding to pathogens (A) : Adaptive/Acquired Immune Response (B); (C) : Innate/Natural Immune Response • Activation of the Complement System results in : Unregulated Activation • opsonization of pathogens (C3b; C4b) • recruitment of inflammatory cells (C3a; C5a; C4a+) Host Cell Damage • killing of pathogens (C5b, C6, C7, C8, C9) MAC: Membrane Attack Complex Need for Inhibitors • Acute Complement-Mediated Conditions Annual US Patient Population • Myocardial Infarction (Heart Attack) 1,500,000 • Coronary Artery Bypass 363,000 • Stroke 600,000 • Chronic Complement-Mediated Conditions • Rheumatoid Arthritis 2,100,000 • Alzheimer’s Disease 4,000,000 • Systemic Lupus Erythematosus 500,000 Complement Pathways

• C3 is central in the complement system • C3 activation/inhibition essential for all functions of the complement system • C3 acts as “double edge sword” • Promotes phagocytosis • Host cell damage - cytolysis

with Prof.. John Lambris (Univ. of Pennsylvania) and Prof. Dimitris Morikis (Univ. of California, Riverside) Complement 3 : Design of Inhibitors With Prof. John Lambris, University of Pennsylvania, School of Medicine With Prof. Dimitris Morikis, University of California at Riverside Compstatin : Synthetic Inhibitor • 13 amino acid cyclic peptide ICVVQDWGHHRCT • Disulfide bridge • beta-turn

Objective Designed improved Inhibitors (Compstatin-like inhibitors) C3a Biologically active fragment of C3 component in the with Prof. John Lambris complement pathway (Univ. of Pennsylvania) A potent mediator of inflammation and Prof. Dimitris Morikis (Univ. of California, Riverside) Background •77 residues, 3 S-S bonds, 4 -helices • C-terminal primary binding site (LGLAR) • Super-potent peptide (12-15 times more active than natural C3a) WWGKKYRASKLGLAR corresponding to positions 63-77 identified by Ember et al., Biochemistry, 1991 • Extensive sequence-activity studies by Ember et al., Biochemistry, 1991 • Ideal target for pharmaceutical development because of its small size and the fact that no complement inhibitor is yet available in clinic

Functions • Binds to C3a receptor (C3aR) with nanomolar affinity • structure of positions 1-12 not resolved Structure of C3a

13 N-terminal 77 17 C-terminal 69 helix II 23

47 helix IV

37

41

helix III • Segment 1 – 12 is not shown because structure data are not available. • helix I is segment 5 – 15. Antibacterial Peptides (with Prof. D. Morikis)

25 30 35 40 44 hBD-2 I GDPVTCLKSGA I CHPVFCP Beta-Defensins hBD-1 . . DHYNCVSSGGQC L YSACP mBD-7 . NSKRACYREGGECLQ. RC I mBD-8 . NEPVSC I RNGGI CQY . RC I • Family of antimicrobial peptides hBD-3 TLQKYYCRVRGGRCAV LSCL hBD-4 FE LDR I CGYGTARCRK . KCR mBD-1 . . DQYKCLQHGGFCLRSSCP • Cationic peptides of 28-42 AAs mBD-2 . AE LDHCHTNGGYCVRA I CP mBD-3 I NNPVSCLRKGGRCWN. RC I mBD-4 I NNP I TCMTNGA I CWG . PCP • Structure for only (2) humanBDs bBD-1 . . DFASCHTNGG I CLPNRCP bBD-2 . . NHV TCR I NRGFCVP I RCP • Structure-function unknown bBD-12 . . GP LSCGRNGGVC I P I RCP 45 50 55 60 64 hBD-2 RRY KQI GTCGLPGTKCCKK P • Low sequence identity hBD-1 I FTK I QGTCYRGKAKCCK . . mBD-7 GL FHK I GTC . NFRFKCCK FQ mBD-8 GLRHK I GTC . GSP FKCCK . . • hBD-2 10x more potent than hBD-1 hBD-3 PKEEQI GKCSTRGRKCCRRK hBD-4 SQEYRI GRCPNTYA . CCLRK mBD-1 SNTK LQGTCKPDKPNCCKS . mBD-2 PSARRPGSCFPEKNPCCKYM mBD-3 GNTRQI GSCGVP FLKCCKRK mBD-4 TA FRQI GNCGHFKVRCCK I R bBD-1 GHM I QI GI CFRPRVKCCRSW bBD-2 GRTRQI GTCFGPR I KCCRSW

Objective Design improved antibacterial peptides De Novo Protein Design Define target template Design folded protein Backbone coordinates for N,Ca,C,O Which amino acid sequences will and possibly Ca-Cb vectors from PDB stabilize this target structure ?

Human -Defensin-2 Full sequence design hbd-2 (PDB: 1fqq) Mayo et al.; Hellinga et al.; DeGrado et al; Saven et al.; Hecht et al. Challenges Combinatorial complexity -Backbone length : n In silico sequence selection -Amino acids per position : m Fold specificity mn possible sequences De Novo Protein Design Define target template Design folded protein Backbone coordinates for N,Ca,C,O Which amino acid sequences will and possibly Ca-Cb vectors from PDB stabilize this target structure ?

Human -Defensin-2 Full sequence design hbd-2 (PDB: 1fqq) Mayo et al.; Hellinga et al.; DeGrado et al; Saven et al.; Hecht et al. Challenges Combinatorial complexity -Backbone length : n In silico sequence selection -Amino acids per position : m Fold validation/specificity mn possible sequences De Novo Protein Design: challenges

• Flexibility of Backbone Templates • Full sequence combinatorial design of proteins of practical size still challenging 100 130 Full combinatorial 20 or 10 amino acid sequences, design of a 100- (20r)100 rotamer sequences to residue protein consider ! - Currently often only possible to design core, boundary or surface Average number of regions of small protein domains rotamers per amino acid (25 – 74 residues) (Gordon et al., J. Comput. Chem., 2003) • De novo protein design: NP-hard problem Pierce and Winfree, 2002 - No exact polynomial-time algorithms known Fung, Rao, Floudas, Prokopyev, - For exponential-time algorithms, computation Pardalos, Rendl, 2005 time varies exponentially with number of design positions Background and Advances • Stochastic Methods: MC, Genetic Algorithms

Tuffery et al. (1991); Desjarlais, Handel (1995), (2003) • Probabilistic Approaches, Combinatorial Libraries

Saven & co-workers (2000), (2001), (2004) • Deterministic Methods - Self-Consistent Mean Field (Koehl, Delarue, 1994) - Self-Consistent Mean Field and MC (Koehl, Levitt, 1999a.b) - Dead End Elimination Criterion (Mayo & coworkers;Desjarlais, Handel,1995,1999; Hellinga & co-workers; Desmet at al. 1992; Goldstein, 1994;Pierce et al. 2000) • Iterative Sequence-Structure (Kuhlman et al. 2003) Background Different De Novo Protein Design Approaches • Stochastic methods: • Monte Carlo Metropolis et al., J. Chem. Phy., 1953 - Perturb the structure by some random change in residue or rotamer. Move is accepted if Boltzmann probability is higher than some random number.

Tuffery et al., J. Biomol. Struct. Dyn., • Genetic algorithms 1991 -Random sequences are allowed to mutate, cross-over, and reproduce. High energy sequences are eliminated from population.

• Stochastic methods do not guarantee convergence to the global energy minimum Background Different De Novo Protein Design Approaches • Combinatorial Libraries: Probabilistic approach Zou and Saven, J. Mol. Bio., 2000 Kono and Saven, J. Mol. Bio., 2001 Park et al., Current Opinion in Structural Biology, 2004

- set of 20 probabilities for each design position

  - maximize the total conformational  entropy subject to constraints:      max w ( ,r( ))ln w ( ,r( )) ( w ( ,r( )) 1) (E E )   i i   i     o i, ,r( )  ,r( ) i: position : amino acid r(): rotamer Lagrange multipliers - provide framework for designing and interpreting protein combinatorial experiments Background Different De Novo Protein Design Approaches • Deterministic methods: • Self-Consistent Mean Field Koehl and Delarue, J. Mol. Bio., 1994 Lee, J. Mol. Bio., 1994 - refines iteratively a conformational matrix 1 - initial guess for conformational matrix: CM (i , k ) = for K i rotamer k of residue I N K j - mean field energy: E(i,k) = U (xikC )+ U (xikC , x0C )+ CM ( j,l)U (xikC , xjlC ) j==11, ji l

Coordinates of atoms of Coordinates of residue i described by atoms in template rotamer k  ( , ) E i k e RT CM1 i k - update conformational matrix: ( , ) =  ( , )  K i E i l CM = CM1 + (1  )CM  e RT =1 optimal  = 0.9 l - convergence criterion (e.g., 10-4) is set to define self-consistency Background Different De Novo Protein Design Approaches • Self-Consistent Mean Field + Monte Carlo Koehl and Levitt, J. Mol. Bio., 1999I,II

- energy function: full-atomistic using self-consistent mean field approach - “Design in” for stability and “design out” for specificity, using the approximation of a random energy model: Repeat N times W = Ran[0,1]

Randomly exchange 2 Calculate If W < P, Initial sequence S New sequence S o positions 1 S = S P=exp[-(E1-E0)/kT] o 1 E = E Partition function is assumed to depend on o 1 amino acid composition but not the ordered sequence  specificity (confirmed by fold recognition techniques) achieved by optimizing sequence space and holding amino acid composition fixed Background Different De Novo Protein Design Approaches

• Deterministic methods: Desmet et al., Nature, 1992 • Dead-End Elimination Voigt et al., J. Mol. Bio., 2000 Pierce et al., J. Comput. Chem., 2000 - systematically eliminate rotamers that Gordon et al., J. Comput. Chem., 2003 are incompatible with the lowest energy sequence N N 1 N - energy function: E = E(ir ) + E(ir , js ) i=1 i=>1 j i Rotamer-template Rotamer-rotamer

- different dead-end elimination criteria: N N

E(ir ) + minE(ir , js ) > E(it ) + maxE(it , js ) Original DEE  s  s ji N ji

E(ir )  E(it ) + min[E(ir , js )  E(it , js )] > 0 Simple Goldstein DEE  s ji N Simple E(ir )  E(it ) + {min[E(ir , ju )  E(it , ju )]}+ [E(ir , k v )  E(it , k v )] > 0  u j, jki Split DEE Background Different De Novo Protein Design Approaches • Dead-End Elimination p Generalized E(c)  E(c') + min [E(c, js )  E(c', js )] > 0 { j }||c{ j }  s s j=1,r DEE criterion c: query cluster c’: comparison cluster

Looger & Hellinga, J. Mol. Bio., 2001

Fundamental Assumptions of Dead-End Elimination Fixed backbone template Discrete set of rotamers Background

• Deterministic methods guarantee convergence to the global minimum

• Self-Consistent Mean Field and Dead-End Elimination methods either: 1. assume a fixed template 2. fix the amino acid composition 2. consider an average set of templates 3. consider a discrete set of rotamers True backbone flexibility is not allowed De Novo Protein Design: Advances

Conferring novel functions onto template • DEZYMER program for designing metalloproteins Richards and Hellinga, 1. First identify the catalytic functional groups J. Mol. Bio., 1991 that catalyze the desired reaction Richards et al., J. Mol. Bio., 1991 2. Relocate these groups from mother sequence Benson et al., Proc. Nat. Acad. to the best positions in the de novo designed protein Sci. USA, 2000

• Succeeded to create Zn, FeS, and Cu binding sites in thioredoxin, a protein which normally does not bind metals (Richards et al., J. Mol. Bio., 1991) De Novo Protein Design: Advances

Better stability and specificity • Engineered -lytic protease showed over 200-fold preference for one substrate kind over another Wilson et al., J. Mol. Bio., 1991

• Redesigned compstatin (complement 3 inhibitor) found to have the best inhibitory activity of 16-fold more potent than the parent peptide Klepeis et al., J. Am. Chem. Soc., 2003 Klepeis et al., Ind. & Eng. Chem. Research, 2004 •Higher stability by having more hydrophobic amino acids in the core than parent proteins Kuhlman & Baker, Current Opinion in Structural Biology , 2004

•Redesign protein-protein interfaces Kortemme & Baker, Current Opinion in Chem. Bio., 2004 De Novo Protein Design: Advances

Locking proteins into particular conformations • Enforced integrin I, a cell-surface adhesion receptor that binds with complement component iC3b, to adopt either the open or closed conformation Shimaoka et al., Nat. Struct. Bio., 2000

•Restricted amino-terminal domain of calmodulin to its calcium saturated closed form. Kraemer-Pecore et al., Current Opinion in Chem. Bio., 2001

many more other successes, but… Flexibility of Backbone Templates

• Scaling down the atomic van der Waals radii by a factor (~5-10%)

Dahiyat, B.I. & Mayo, S.L. (1997) Proc. Natl. Acad. Sci. USA, 94, 10172-10177. - Overestimation of attractive forces between atoms and the possibility of atom overpacking • Considering a fixed set of rotamers (DEE) or changing super-secondary structure parameters which alter relative orientation and distance between secondary structures

Su, A. & Mayo, S.L. (1997) , Protein Science, 6, 1701-1707. - Only a subset of possible conformations is considered • Generating ensemble of random structures from template

Desjarlais, J.R. & Handel, T.M. (1999),J. Mol. Bio., 289, 305-318.

- Solve each structure in the ensemble assuming fixed backbone and apply genetic algorithms and Monte Carlo sampling to combine results into a single low energy structure - Only a random subset of possible conformations is considered Flexibility of Backbone Templates • Iterating between sequence space and structure space

Saunders, C.T. & Baker D. (2005) J. Mol. Bio., 346, 631-644. Kuhlman, B. & Dantas G. & Ireton G.C. & Varani G. & Stoddard B.L. & Baker D. (2003) J. Mol. Bio., 302, 1364-1368.

Sequence design using RosettaDesign (MCsearch, 12-6 LJ energy function )

structure sequence

Structure prediction using Rosetta (MC • for each of an ensemble of starting energy minimization) structures around the target fold

- Backbone flexibility only indirectly addressed by transitions between similar structures in the structure space Flexibility of Backbone Templates

De novo protein design framework allows true backbone flexibility

Incorporated in stage 1 through LU the use of distance bins and in ddd stage 2 through lower and CC  CC  CC upper bounds. LU Incorporated in stage 2 through LU the template-constrained folding  calculation. Bounds are ± 10o.

• C-C distance and dihedral angles are bounded continuous functions Floudas, AIChE J. (2005); Klepeis, Floudas, Morikis, Lambris, JACS (2003), IERC (2004); Fung, Rao, Floudas, Prokopyev, Pardalos, Rendl, J. Comb. Optim. (2005)Fung, Taylor, Floudas, Opt. Meth. Soft. (2007)

- NMR ensemble - MD with GB - MD with Explicit water molecules De Novo Protein Design Framework

Sequence selection stage: generates a rank-ordered list of sequences with the lowest energies by solving an integer linear programming (ILP) model - Quadratic-assignment-like models (Klepeis et al., JACS (2003); Klepeis et al., IERC (2004); Fung et al., J. Comb. Optim.,2005; Fung, Taylor, Floudas, OMS, 2007) - Distance-dependent C-C, centroid-centroid forcefields (Loose, Klepeis, Floudas, Proteins (2004); Rajgaria, McAllister, Floudas, Proteins, 2006; Rajgaria, McAllister, Floudas, Proteins, Accepted, 2007)

Fold specificity stage: calculates specificity of each sequence to the flexible design template using full-atomistic force fields AMBER, ECEPP/3 - First principles via ASTRO-FOLD (Klepeis and Floudas, Biophys. J., 2003) - NMR structures refinement-based method via CYANA and TINKER using AMBER De Novo Protein Design Framework Klepeis, Floudas, Lambris, Morikis 2003, 2004 Fung, Taylor, Floudas 2005, 2007 Sequence selection • Identify target template for desired fold; specify coordinates of backbone • Identify possible residue mutations • Distance dependent pairwise potentials • Generate rank-ordered energetic list from mixed-integer linear (MILP) Fold Validation:Specificity • Model selected sequences using flexible, detailed energetics • Employ global optimization for free system • Employ global optimization for system constrained to template • Calculate relative probability for structures similar to desired fold A High Resolution Ca-Ca and Side Chain Centroid Based Distance Dependent Force Field

R. Rajgaria S. R. McAllister and C. A. Floudas, Proteins, 70, 950-970 (2008) R. Rajgaria S. R. McAllister and C. A. Floudas, Proteins, 65(2), 726-741 (2006) C. Loose, J.L. Klepeis and C.A. Floudas, Proteins, 54:303-314 (2004) Objectives

• Create a distance-dependent force field to find native protein folds. • Design a training procedure that will make the force field robust using large scale linear optimization • Test our force field against a very good distance dependent force fields (e.g., TE-13)1 by attempting to identify the native fold of novel proteins. 1 Tobi, D.; Elber, R. Distance-Dependent, Pair Potential for Protein Folding. Proteins: Structure, Function, and Genetics 2000, 41, 40-46. Force Field – Formulation*

Table 1: Bin Definition C C distance dependent Bin ID C-distance [Ao] 1 3-4 8-bin definition (ID) 2 4-5 More resolution for bin 3 to 6 3 5-5.5 210 amino-acid combination (IC) 4 5.5-6 5 6-6.5 6 6.5-7 1680 energy variables ( ) 7 7-8 8 8-9 Energy calculation Sum of pairwise interaction at a particular distance

Variable

Parameter

*Loose. C., Klepeis. J.L., and Floudas. C.A., Proteins ,2004, 54, 303-314. *Rajgaria. R., McAllister. S., and Floudas C.A. Proteins,2006, 65, 726-741 . Force Field – Formulation*

Anfinsen’s hypothesis was used as main criteria for energy evaluation

 IC, ID  [ 25,25]

Many more constraints – based on physical properties of interacting amino acids

*Loose. C., Klepeis. J.L., and Floudas. C.A., Proteins ,2004, 54, 303-314. *Rajgaria. R., McAllister. S., and Floudas C.A. Proteins, 2006, 65, 726-741. Constraints

A yhm ,n

wsi ,sj

Smooth profile (Tobi and Elber*, 2000)

*Tobi. D., and Elber. R., Proteins ,2000, 41, 40-46. Constraints

Smooth profile ( Tobi and Elber, 2000)

Decrease in effectiveness at long distances

Favorable interaction at 4-6.5 Å between hydrophobic groups (Bahar and Jernigan, 1997)

“Energy well” formation at around 4.5 to 5.0 Å (below this “steric effects” and above this “insufficient contact”) Hydrophobic Constraints* Captures interaction between certain amino acids (Bahar and Jernigan, 1997)

Favorable interaction at 4-6.5 Å between hydrophobic groups

“Energy well” formation at around 4.5 to 5.0 Å (below this “steric effects” and above this “insufficient contact”) Table 2: Amino Acid Classification

*Loose. C., Klepeis. J.L., and Floudas. C.A., Proteins ,2004, 54, 303-314. High Resolution Decoys* - Idea Goal To create a large number of near- native protein structures for a non- homologous set of proteins that span the Protein Data Bank.

Hypothesis High quality near-native structures maintain similar C-C distances for the hydrophobic residues contained in the elements of secondary structure.

*Rajgaria. R., McAllister. S., and Floudas C.A. Proteins, 2006, 65, 726-741. High Resolution Decoys - Generation

The Set: 1482 non-homologous proteins from Skolnick and co-workers*.

Method

– Identify hydrophobic residues in the secondary structure

– If protein contains little secondary structure then consider all hydrophobic residues.

– Introduce a range of distance variations for the selected residues (8 values between 0.5 Å and 5.0 Å).

– An ensemble of 200 structures is created through torsion angle dynamics enforcing the distance bounds on the hydrophobic core.

*Zhang. Y. and Skolnick J. PNAS, 2004, 101, 7594-99. Method and Implementation

Training 1250 proteins and 500 decoys of each protein LP formulation to optimize energy parameters Due to limited computer memory Only a small subset of high quality decoys were used at a time Iterative dropping scheme was used to include all decoys in force field generation RMSD Training Test (native) Set Set Testing 0.0-0.5 12 1 0.5-1.0 458 60 High Resolution Set 1.0-1.5 607 74 150 randomly selected proteins 1.5-2.0 173 15 500 decoys of each protein Medium Resolution Set Minimum RMSD distribution 151 randomly selected proteins 200 decoys of each protein RMSD (native) Test Set 3.0-16.0 150 Results* – Testing the Force Field Evaluation Metrics – average rank – number of first ranked proteins – average RMSD – Z-score

Test on High Resolution Decoys FF name Avg. Rank # Firsts Avg. RMSD (Å) Avg. Z-score HR 1.87 113 0.45 2.11 LKF 39.45 17 1.72 1.55 TE-13 19.94 92 0.81 3.15

*Rajgaria. R., McAllister. S., and Floudas C.A. Proteins, 2006, 65, 726-741. Results – Testing the Ca-Ca Distance Dependent Force Field Test on Medium Resolution Decoys Common proteins between HR training set and LKF test set were removed

FF name Avg. Rank # Firsts Avg. Z-score Avg.RMSD(Å)

HR 4.32 86/110* 3.83 1.90

LKF 5.84 93/151 3.08 3.51

TE-13** 17.36 43/131 2.01 - not avail -

**Tobi. D., and Elber. R., Proteins ,2000, 41, 40-46. Side Chain Centroid based Force Field

*Rajgaria. R., McAllister. S. R., and Floudas C.A. Side Chain Based Force Field

Importance – C based formulation disregards presence of the side chain atoms – Inclusion of side chain atoms might improve the energy estimation – Side chain dependence needed for protein design problems

Need to revisit the interaction center definition “effective” distance range might be different define “centroid” Force Field – Side Chain Based Formulation

Side Chain Centroid definition

Pro

Arg Phe Tyr Force Field – Side Chain Based Formulation

Table 5: 6-Bin Definition Bin ID Centroid-distance[Ao] 1 4-5 Side Chain Centroid distance 2 5-5.5 dependence 3 5.5-6 4 6-6.5 6-bin definition (ID) 5 6.5-7 More resolution for bin 2 to 5 6 7-8 210 amino-acid combination (IC) Table 6: 7-Bin Definition Bin ID Centroid-distance[Ao] 1 4-5 1260 energy variables ( ) 2 5-5.5 3 5.5-6 4 6-6.5 Energy calculation 5 6.5-7 Sum of pairwise interaction 6 7-8 7 8-9 Results – Testing the Force Field

Testing Centroid Based Force Field on High Resolution Decoys

FF name Avg. Rank # Firsts Avg.RMSD (Å) Avg. Z-score 6bin-HRSC 2.49 128/148 0.29 3.62 7bin-HRSC 2.01 125/148 0.32 3.39 HR 1.87 113/150 0.45 2.11 LKF 39.45 17/150 1.72 1.55 TE-13* 19.94 92/148 0.81 3.15

*Rajgaria. R., McAllister. S. R., and Floudas C.A. , Proteins, Accepted for publicaion, (2007)

*Tobi. D., and Elber. R., Proteins ,2000, 41, 40-46. De Novo Protein Design Framework Klepeis, Floudas, Lambris, Morikis 2003, 2004 Fung, Taylor, Floudas 2005, 2007 Sequence selection • Identify target template for desired fold; specify coordinates of backbone • Identify possible residue mutations • Introduce distance dependent pairwise potential based on Ca • Generate rank-ordered list from mixed-integer linear (MILP) Fold Validation:Specificity • Model selected sequences using flexible, detailed energetics • Employ global optimization for free system • Employ global optimization for system constrained to template • Calculate relative probability for structures similar to desired fold Sequence Selection : Key Ideas

• Consider template peptide of n positions

• At each position i = 1,2,...,n there can be j = 1,2,...,mi mutations

• Define equivalent sets k = 1,2,...,n and l = 1,2,...,mk • Require k > i to represent all unique interactions • Introduce 0-1 variables to indicate possible mutations at a given position

1 if residue type j is in position i j yi = 0 otherwise

y l = 1 if residue type l is in position k k 0 otherwise Mixed-integer Nonlinear Model

1 if residue type j j is in position i yi = 0 otherwise 1 if residue type l l yk = is in position k 0 otherwise

jl • Eik is energy for protein with residue j at position I and residue l at position k • taken from pairwise distance dependent energy function (xij, xkl) using Ca positions • parameters derived from MILP model to select native over low energy decoys Important Remarks j l • yi and yk are binary variables that control the residue type at a given position • Binary variables appear bilinearly Nonconvex Mixed-integer Linear Reformulation

jl • Transform bilinear combinations to linear variables wik Floudas 1995 • Reproduce properties of original formulation with constraints j l jl if yi = yk = 0 wik = 0 j l jl if yi = yk = 1 wik = 1 j l jl if yi OR yk = 0 wik = 0 • Use Reformulation Linearization Technique (RLT) Sherali & coworkers based constraints to reduce integrality gap Prove global optimality Compstatin Potent inhibitor of third component of complement with Dr. John Lambris Structural features (Univ. of Pennsylvania) and Dr. Dimitri Morikis • Cyclic, 13 residues (Univ. of California, Riverside) • Disulfide Bridge Cys2-Cys12 • Central beta-turn Gln5-Asp6-Trp7-Gly8 • Hydrophobic core • Acetylated form displays higher inhibitory activity

Functional features • Binds to and inactivates third component of complement • Structure of bound complex not yet available Sequence Selection : Compstatin Design a more potent C3 inhibitor Variable positions • Conserve cystine residues (maintain cyclic nature of peptide) • Conserve turn residues (do not overstabilize the turn) Consensus results from top sequences Position Exp 1A,V 4Y,V 9 T,F,A 10 H 11 T,V,A,F,H Key finding from computations13 V,A,F • His conserved at position 10 • Position 11 provides most variation : maintain Arg • Selections at positions 4 and 9 allow for turn flexibility New Enhancements of Quadratic Assignment like

• Three new algorithmicModels enhancement components to consider: 1. RLT with inequality constraints 2. Triangle inequalities 3. Preprocessing via DEE theorem

• Different combinations of the three components incorporated into the QA-like model to check for computational performance

Fung, Rao, Floudas, Prokopyev, Pardalos, Rendl, J. Comb. Opt. (2005) Stage one: New formulation for sequence selection

Fung, Floudas et al., J. Comb. Optim., 2005; Fung, Taylor, Floudas et al., OMS, 2007 New sequence selection model for single template structure:

nnmmik jl jl minExxwik ( i, k ) ik yy,j l  ikjl ijkil===+=11 11 wik

mi subject to yij = 1   i j=1

mi wyjl= l  , ikil> ,  ik k j=1

mk wyjl= j  , ikij> ,  ik i =1 l kil> , yywjl , , jl= 0 1 , ij , ikik jl - obtained by declaring w ik as binary and new reduction properties - we compared performance of the two models Sequence selection models for flexible template with multiple structures: NEW MODELS • We developed two different models: 1. Formulation using a weighted average of the structures Flexible design template jl jl wEikn ik Eik  in objective function n

no. of structures where dis(ik , ) falls into dist. bin n w = ikn total no. of structures

nnmmbikm Position i jl jl C2C3 minExxwtxxdwikik ( , ) ( ik , , ) ik C1 C4 yyj , l  ikijkild===+==11 111

mi C5 j C8 C7 subject to yii = 1  Position k C 6   j=1 Energy mi wyjl= l  ikil,,>  ik k j=1 m 5 5.5 6 6.5 dis(i,k)[Å] k 3 4 7 8 9 wyjl= j  ikij,,>  ik i HR forcefield l=1 wik2 (Rajgaria and jl jl wik4 Floudas, 2006) yywikik,,= 01 ijkl ,,, wik3 Sequence selection models for flexible template with multiple structures 2. Formulation using binary distance bin variables – Most General Model jl jl E changed tobE in objective function ik  ikn ik n bikn = 1 if dis(i,k) falls into dist. bin n Flexible design template s. t. b 1 = 0 otherwise  ikn = n

nnmmik minExxzjl ( , ) jl j l   ik i k ikd yyik, i===+=11 j k i 11: l d disbin (,,)1 xik x d =

mi subject toyij = 1   i j=1

Position i mi C2C3 wyjl= l  , ikil> , C1 C4  ik k j=1

mk wyjl= j  , ikij> ,  ik i C5 l=1 C8 biki= 1 , > C7  ikd Position k C6 ddisbinx:(,,)1ik x d = Energy bwjl 1 zb jl ijkild, , , , 1 or 2 or 3 ikd+ ik ikd  ikd  > zwjl jl ijkil, , , 1 2 3  ikd= ik  > ddisbinx:(,,)1ik x d = ** bbikd+ kpd'  1

if (ldmid ( ')< disipld ( , ) mid ( ) or ld mid ( ')>+ disipld ( , ) mid ( )) dis(i,k)[Å] b 5 5.5 6 6.5 m and  disbin ( xik , x , d ") 1 and disbin ( x ik , x , d )== 1 and disbin ( x k , x p , d ') 1 3 4 7 8 9 dd"1=+ ik , ipdd , , , ', i k p HR forcefield  >  jl jl jl wik2 (Rajgaria and ,,bz = 0-1  ijkilpkidd,,,,>  ,,' yywbikik,, ,ikd kpd' ikd w Floudas, 2006) ik4 wik3 Antibacterial Peptides

25 30 35 40 44 Beta-Defensins hBD-2 I GDPVTCLKSGA I CHPVFCP hBD-1 . . DHYNCVSSGGQC L YSACP mBD-7 . NSKRACYREGGECLQ. RC I • Family of antimicrobial peptides mBD-8 . NEPVSC I RNGGI CQY . RC I hBD-3 TLQKYYCRVRGGRCAV LSCL hBD-4 FE LDR I CGYGTARCRK . KCR mBD-1 . . DQYKCLQHGGFCLRSSCP • Cationic peptides of 28-42 AAs mBD-2 . AE LDHCHTNGGYCVRA I CP mBD-3 I NNPVSCLRKGGRCWN. RC I mBD-4 I NNP I TCMTNGA I CWG . PCP • Structure for only (2) humanBDs bBD-1 . . DFASCHTNGG I CLPNRCP bBD-2 . . NHV TCR I NRGFCVP I RCP • Structure-function unknown bBD-12 . . GP LSCGRNGGVC I P I RCP 45 50 55 60 64 hBD-2 RRY KQI GTCGLPGTKCCKK P • Low sequence identity hBD-1 I FTK I QGTCYRGKAKCCK . . mBD-7 GL FHK I GTC . NFRFKCCK FQ mBD-8 GLRHK I GTC . GSP FKCCK . . • hBD-2 10x more potent than hBD-1 hBD-3 PKEEQI GKCSTRGRKCCRRK hBD-4 SQEYRI GRCPNTYA . CCLRK mBD-1 SNTK LQGTCKPDKPNCCKS . mBD-2 PSARRPGSCFPEKNPCCKYM mBD-3 GNTRQI GSCGVP FLKCCKRK mBD-4 TA FRQI GNCGHFKVRCCK I R bBD-1 GHM I QI GI CFRPRVKCCRSW bBD-2 GRTRQI GTCFGPR I KCCRSW

Objective Design improved antibacterial peptides De Novo Design of hD-2 Structural features of Human beta defensin - 2:

Structural Feature Positions 14 - 16  Strands 25 - 28 36 - 39  Helix 5 - 10 8 - 37 S-S bonds 15 - 30 20 - 38 16 - 19 -Turns 21 - 24 32 - 35 Constraints to Hairpins 25 - 29 add to model: Bulges 27, 28, 37 At least 2 hydrophobics on each  strand: (yAla yCys yIle yLeu yMet yPhe yTrp yTyr yVal ) 2 14 i 16  i + i + i + i + i + i + i + i + i     , i j

(y Ala yCys yIle yLeu yMet yPhe yTrp yTyr yVal ) 2 25 i 28  i + i + i + i + i + i + i + i + i     i, j De Novo Design of hD-2 More constraints to add… Total number of hydrophobics more than wild type sequence for higher stability (Kuhlman & Baker, Cur. Op. Struct. Bio., 2004):

Ala Cys Ile Leu Met Phe Trp Tyr Val y ( +yi +yi +y i +yi +yi +yi +yi +yi i) 17 i i j  , • Using PSI-BLAST, homology search was run to determine properties that are conserved among hD and similar sequences. •Conserved properties are translated into constraints: Charge properties of the 97 hD homologs from PSI-BLAST

lower bound upper bound Positive charges 5 10 Negative charges 0 2 Net charges 4 9 5 (y Arg y Lys ) 10 i Total helix charge 0 +3   i + i   i, j 0 (y Asp yGlu ) 2 i   i + i   i, j 4 (y Arg y Lys y Asp yGlu ) 9 i   i + i  i  i   i, j 0 (y Arg y Lys y Asp yGlu ) 3 5 i 10   i + i  i  i     i, j De Novo Design of hD-2 More constraints to add… 0 y Ala 3 i 1 y Arg 9 i   i     i   Amino acid lower bound upper bound i, j i, j 0 y Asn 6 i 0 y Asp 2 i Ala 0 3   i     i   i, j i, j Arg 1 9 6 yCys 6 i 0 yGln 3 i   i     i   Asn 0 6 i, j i, j Asp 0 2 0 yGlu 3 i   i   Cys 4 7 i, j 0 y His 4 i 0 y Ile 6 i Gln 0 3   i     i   i, j i, j Glu 0 3 0 y Leu 4 i 0 y Lys 7 i   i     i   Gly 3 7 i, j i, j His 0 4 0 y Met 3 i 0 y Phe 4 i   i     i   Ile 0 6 i, j i, j 5 y Pro 5 i 0 y Ser 6 i Leu 0 4   i     i   i, j i, j Lys 0 7 0 yThr 4 i 0 yTrp 2 i   i     i   Met 0 3 i, j i, j Phe 0 4 0 yTyr 4 i 0 yVal 6 i   i     i   Pro 0 5 i, j i, j • No. of Cys fixed at 6 (3 S-S bridges) Ser 0 6 • No. of Pro (inflexible) fixed at 5 (same as native sequence) Thr 0 4 • Max. Trp set to 2 to allow greater flexibility Trp 0 1 • No constraint on Gly Tyr 0 4

Val 0 6 Amino acid occurrence of the 97 hD homologs from PSI-BLAST De Novo Design of hD-2 In Silico Sequence Selection • Pos 1, 3, 12, 31, and 34 fixed at Gly • Pos 5, 17, 21, 33, and 41 fixed at Pro • Pos 8, 15, 20, 30, 37, and 38 fixed at Cys • Full combinatorial optimization (all 20 amino acids allowed) for other positions

Complexity: 2025 or 3.41032 sequences

Global energy minimum solution:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Gly Arg Gly Tyr Pro Arg Asn Cys Asp Thr Lys Gly Tyr Tyr Cys Tyr Pro Met Ala Cys

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Pro Arg His Arg His Phe Phe His Met Cys Gly Met Pro Gly Phe Phe Cys Cys Ala His Pro Quadratic Assignment Like Formulation Comparison • Problem 2: full combinatorial optimization at pos. 2, 4, 6, 7, 9, 10, 11, 13, 14, and 16 (10 positions in total) all other pos. fixed at their native residues Sequence search space = 1.0  1013 • Problem 3: full combinatorial optimization at pos. 2, 4, 6, 7, 9, 10, 11, 13, 14, 16, 18, 19, 22, 23, and 24 (15 positions in total) all other pos. fixed at their native residues Sequence search space = 3.3  1019 • Problem 4: fix native Gly at pos. 1, 3, 12, 28, 31, and 34 fix native Pro at pos. 5, 17, 21, 33, and 41 fix native Cys at pos. 8, 15, 20, 30, 37, and 38 full combinatorial optimization at all other positions (24 positions in total) Sequence search space = 1.7  1031 • Problem 5: only fix native Cys at pos. 8, 15, 20, 30, 37, and 38 full combinatorial optimization at all other positions (35 positions in total) Sequence search space = 3.4  1045 Quadratic Assignment Like Formulation Comparison Computation time comparison of the 12 formulations:

Formulations No cutoff for triangle inequalities Sequence F1 F2 F3 F4 F5 F6 F7 Search Cutoff for triangle space inequalities = -40 Problem 1 1.3x108 0.14 0.30 0.05 0.04 0.05 0.15 0.23, 0.21 Problem 2 1.0x1013 1.93 34874 12.80 65.04 13.23 2.16 44.02, 3.01 Problem 3 3.3x1019 3.01 70.14% gap 137.85 2052.2 278.0 3.22 64.39, 2.87 Problem 4 1.7x1031 38.14 - - - - 31.67 - , 29.06 Problem 5 3.4x1045 74713 - - - - 30006 - , 65575 Obtained after 100,000 CPU sec Formulations F1: Base case - original O(n2) formulation from Klepeis et al. (2003)(2004) F2: original O(n2) formulation without RLTs Sequence F8 F9 F10 F11 F12 F3: O(n) formulation from Oral and Kettani (1990)(1992) search space cutoff=-40 cutoff =-40 cutoff = -40 F4: O(n) formulation from Oral and Kettani (1990)(1992) F5: O(n) formulation from Pardalos et al. (2004) Problem 1 1.3x108 0.16 0.11 0.16 0.17 0.11 F6: original O(n2) formulation with inequality RLT constraints 13 Problem 2 1.0x10 2.15 2.26 2.01 2.52 2.10 F7: original O(n2) formulation with inequality RLT constrains and triangle inequalities Problem 3 3.3x1019 2.94 3.31 3.03 3.43 3.04 F8: original O(n2) formulation with inequality RLT constraints and Problem 4 1.7x1031 31.08 35.48 35.92 25.00 36.15 preprocessing F9: original O(n2) formulation with inequality RLT constraints and triangle 45 Problem 5 3.4x10 32657 52276 61872 24388 57569 inequalities and preprocessing F10: original O(n2) formulation with triangle inequalities F11: original O(n2) formulation with preprocessing 67% reduction in CPU time F12: original O(n2) formulation with triangle inequalities and preprocessing Sequence selection: Comparison

 Test case: human  defensin-2 (41 Amino acids; 3 C-C) First problem - mutate all positions except CYS. Allow all 20 amino acids for each mutated position - no biological constraints

First Problem Problem No. of linear CPU times[s] complexity biological constraints old formulation new formulation 3.4x1045 none 53,263 649 Second problem - mutation set derived from SASA patterning SASA < 20%: core. Allow only hydrophobic amino acids. SASA 20-50%: intermediate. Allow all amino acids except CYS SASA >50%: surface. Allow only hydrophilic amino acids - 49 biological constraints Bounds on charges, hydrophobic content, and amino acid occurrence from PSI-BLAST Second Problem Problem No. of linear CPU times[s] complexity biological constraints old formulation new formulation -CPU times generated using CPLEX 9.0 on one 6.4x1037 49 4,578 14 single Pentium IV 3.2 GHz processor De Novo Protein Design Framework Klepeis, Floudas, Lambris, Morikis 2003, 2004 Fung, Taylor, Floudas 2005, 2007 Sequence selection • Identify target template for desired fold; specify coordinates of backbone • Identify possible residue mutations • Introduce distance dependent pairwise potential based on Ca • Generate rank-ordered energetic list from mixed-integer linear (MILP) Fold Validation:Specificity • Model selected sequences using flexible, detailed energetics • Employ global optimization for free system • Employ global optimization for system constrained to template • Calculate relative probability for structures similar to desired fold Fold Validation : Astro-Fold based How to discriminate among the selected sequences For each selected sequence solve (2) folding problems • Free folding calculation

• Template constrained folding calculation

Quantify the specificity of the ensemble of structures similar to the template using probability calculation Enhanced ASTRO-FOLD Helix Prediction -Detailed atomistic modeling -Simulations of local interactions (Free Energy Calculations)

-sheet Prediction -Novel hydrophobic modeling -Predict list of optimal topologies (Combinatorial Optimization)

 proteins  /  proteins Interhelical Contacts Flexible Stems Loop Prediction -Maximize common residue pairs -Dihedral angle sampling -Rank-order list of topologies -Discard conformers by clustering (Novel Clustering Methodology) (MILP Optimization Model)

Derivation of Restraints -Dihedral angle restrictions -CC distance constraints

Improved Distance Restraints -Iterative LP-based bound tightening approach

Tertiary Structure Prediction -Structural data from previous stages -Prediction via novel solution approach (Global Optimization and Torsional Angle Dynamics)

Force Field for High and Medium Resolution Decoys -Novel linear programming approach -Distinguishes high resolution structures (Large-scale linear programming) Fold Validation: NMR like framework

• Protein structure prediction method ASTRO-FOLD via first principles and deterministic global optimization: computationally expensive for large proteins (>200 residues) • New fold specificity calculation method

For each amino acid sequence: • input upper and lower C-C distance bounds and torsional angle bounds based on observations from CYANA design templates • ensemble generation via high temp torsion angle dynamics and annealing torsion dynamics

Hundreds of structures

Local energy minimization and evaluation: • BFGS Quasi-Newton optimization algorithm TINKER • directed by AMBER full-atomistic force field Stage two: Fold Validation

• For each sequence from stage one, a specificity factor to the design template(s) is calculated

FOLD SPECIFICITY CALC. native E sequence exp(-i )  kT f i conformers of new sequence spec = E exp(-i )  kT new i conformers of native sequence sequence Framework allows for true backbone flexibility • True backbone flexibility: bounded continuous distance and dihedral angles

• Stage one - distance dependence of energy is discretized into bins Position i C2C3 - models for flexible template with multiple C1 C4

structures & C 5 C 8   C 7 Position k  C6 Energy • Stage two

5 5.5 6 6.5 - upper and lower bounds on distance and dis(i,k)[Å 3 4 7 8 9 ] HR forcefield dihedral angles input by user (Rajgaria and - CYANA and TINKER-AMBER consider all possible combinationsFloudas, 2006) of continuous distance and angle values between bounds De Novo Design of Inhibitors for Complement 3: Compstatin variants

with Prof. J.D. Lambris (U. Penn) and Prof. D. Morikis (UC, Riverside) Compstatin Potent inhibitor of third component of complement with Dr. John Lambris Structural features (Univ. of Pennsylvania) and Dr. Dimitri Morikis • Cyclic, 13 residues (Univ. of California, Riverside) • Disulfide Bridge Cys2-Cys12 • Central beta-turn Gln5-Asp6-Trp7-Gly8 • Hydrophobic core • Acetylated form displays higher inhibitory activity

Functional features • Binds to and inactivates third component of complement • Structure of bound complex not yet available Sequence Selection : Compstatin Design a more potent C3 inhibitor Variable positions • Conserve cystine residues (maintain cyclic nature of peptide) • Conserve turn residues (do not overstabilize the turn) Consensus results from top sequences Position Exp 1A,V 4Y,V 9 T,F,A 10 H 11 T,V,A,F,H Key finding from computations13 V,A,F • His conserved at position 10 • Position 11 provides most variation : maintain Arg • Selections at positions 4 and 9 allow for turn flexibility Compstatin Analogs > 3x 0.5 - 3x

< 0.5x

> 3x 0.5 - 3x

< 0.5x

> 3x 0.5 - 3x

< 0.5x De Novo Protein Design Framework

Klepeis, Floudas, Lambris, Morikis 2003, 2004 SequenceFung, selection Taylor, Floudas, 2005, 2007 • Identify target template for desired fold; specify coordinates of backbone • Identify possible residue mutations • Introduce distance dependent pairwise potential based on Ca • Generate rank-ordered energetic list from mixed-integer linear (MILP) Fold Validation • Model selected sequences using flexible, detailed energetics • Employ global optimization for free system • Employ global optimization for system constrained to template • Calculate relative probability for structures similar to desired fold Fold Specificity : Compstatin Determine ensemble for Free & Template systems Find probability for portion of Free ensemble within some deviation of Template ensemble Compstatin Analogs > 3x 0.5 - 3x

< 0.5x

> 3x 0.5 - 3x

< 0.5x

> 3x 0.5 - 3x

< 0.5x In Silico De Novo Design

Ac-compstatin Analog Ac-V4Y/H9A x7 x16 Analog Ac-W4Y/H9A x45 Klepeis, Floudas, Morikis, Tsokos, Argyropoulos, Spruce, Lambris (2003) J. American Chemical Society. Klepeis, Floudas, Morikis, Lambris (2004) Ind. & Eng. Chem. Res. Fung, Floudas (2005) Redesign of Complement 3a

with Prof. J.D. Lambris (U. Penn) and Prof. D. Morikis (UC, Riverside) Complement 3a • Background: - fragment of the complement 3 protein, active mediator of inflammation - 77-residue, 3 S-S bonds, 4 -helices - sequence of C-terminal (pos 73-77) primary binding site: LGLAR - extensive sequence-activity studies by Ember et al. (1991) - super-potent peptide (12-15 times more active than natural C3a), WWGKKYRASKLGLAR (pos 63-77) identified by Ember et al. (1991)

• De novo design of C3a: 69 37 - redesign pos 63-68, 70-72 - goal: identify peptides that 77 41 are more active than natural C3a 47 23 Residues 1-12 not shown 17 De novo design of C3a • We used 3 sets of design templates: 1. Single structure from X-ray crystallography • Huber et al., 1980 • Resolution: 3.2Å • Residue 1 to 12 missing • Side-chain information is also missing 2. Flexible templates from MD simulations with GB implicit solvation

• Initial structure: composite of C3a domain of C3’s crystal structure (Janssen et al., 2005) for Val1-Ala70 and Huber et al.’s crystal structure for Ser71-Arg77 • Starting from 10ns, one structure generated at each 1ns increment. • 11 flexible template structures in total 3. Flexible templates from MD simulations with explicit water molecules

• Structures generated using the same method as for the previous set of flexible templates except water molecules were treated explicitly in MD simulations • 11 flexible template structures in total Flexible design template with multiple structures for C3a

Green: from MD simulations with GB implicit solvation Magenta: from MD simulations with explicit water molecules De novo design of C3a

• Structural deviation among the 3 sets of flexible design templates:

Cyan: single structure from X-ray crystallography Purple: structure at 5 ns from MD simulations with GB implicit solvation Green: structure at 5 ns from MD simulations with explicit water molecules

 The structural deviation means we should use all three sets of templates to get different predictions for active sequences De novo design of C3a • Forcefields and models applied for sequence selection: Design templates Forcefields Sequence selection models Single X-ray crystal structure

• HR C-C forcefield • Basic model for single only structure

MD simulations with GB implicit solvent • Weighted average model for multiple • HR C-C forcefield structures • HR centroid-centroid • Binary distance bin forcefield model for multiple structures MD simulations with explicit water • Weighted average model for multiple • HR C-C forcefield structures • HR centroid-centroid • Binary distance bin forcefield model for multiple structures • Generated 500 sequences for each De novo design of C3a • Mutation set for sequence selection Position 63 64 65 66 67 68 SASA 54.6% 41.1% 51.6% 49.9% 31.0% 46.4% Classification surface intermediate surface intermediate intermediate intermediate Mutated? yes yes yes yes yes yes Allowed AILMFYWV RND AILMFYWVRND AILMFYWV RNDQEGHKST RND QEGHKST RNDQEG HKST residues QEGHKST QEGHKST

69 70 71 72 51.3% 41.8% 36.4% 55.1% Problem complexity surface intermediate intermediate surface no yes yes yes =2.59109 R RNDQEGHKST A RNDQEGHK ST RNDQEG HKST

• Biological constraint charge=+3 - Maintain native charge on helix y Arg y Lys y Asp y Glu 3 63 i 69  i +  i   i   i =    i i i i

• Fold specificity stage: upper and lower bounds on angles and distances are based on observations about the flexible template(s). Sequences from flexible templates generated with MD simulations

Design templates from MD simulations with Design templates from MD simulations with GB implicit solvation explicit water molecules

Forcefield: HR centroid-centroid (Rajgaria et al., 2007)

flexible sequence Sequences synthesized templates selection model WT LRRQHARASHLG LAR n. a. n. a. WR-15 W W R SKWR EEQLGLARex. water binary dist. bin WR-15-1 W W Q R R W R DEQLGLARgen. Born wt. avg. WR-15-2 W W R RQWR DEQLGLARgen. Born wt. avg. WR-15-3 W W Q R R W R DERLGLARgen. Born wt. avg. WR-15-4 W W Q R R W R DNQLGLARgen. Born wt. avg. WR-15-5 W W Q R R W R EERLGLARgen. Born binary dist. bin WR-15-6 W W R RQWR DERLGLARgen. Born binary dist. bin WR-15-7 W W R RQWR ENQLGLARgen. Born binary dist. bin WR-15-8 W W R RSWR EERLGLARgen. Born binary dist. bin WR-15-9 W W R NRWR ENRLGLARgen. Born binary dist. bin WR-15-10 W W G K K Y RASK LGLAR n. a. n. a. WR-15-11 W W R RQWR EDHLGLARex. water wt. avg. WR-15-12 W W N R K W R EDHLGLARex. water wt. avg. Super- WR-15-13 W W R RQWR EEQLGLARex. water binary dist. bin agonist from WR-15-15 W W R RQWR EDKLGLARex. water binary dist. bin WR-15-16 W W R RQWR EEHLGLARex. water binary dist. bin Ember et al., WR-15-17 W W R R H W R EDQLGLARex. water binary dist. bin 1991 WR-15-18 W W R RQWR EEKLGLARex. water binary dist. bin WR-15-19 W W R RQWR EQKLGLARex. water binary dist. bin Conclusions De Novo Peptide Design : Structure to Function • Novel method for sequence selection • Distance dependent pairwise interaction energy • MILP reformulation: Quadratic Assignment-Like • RLT constraints • Preprocessing via DEE • New Formulation • Quantification of fold specificity • Template flexibility • Constrained and free energy calculations • Ranking of sequence-structure specificity • Sequence Selection for Compstatin, Human beta defensin-2, C3a, and HIV-1 • Fold specificity for Compstatin analogs, C3a Functionally enhanced peptides for C3 inhibition Discovery in Proteomics: De Novo and Hybrid Methods via Tandem Mass Spectrometry

Professor Christodoulos A. Floudas Department of Chemical Engineering Program in Applied and Computational Mathematics Department of Operations Research and Financial Engineering Center for Quantitative Biology Princeton University PeptidePeptide IdentificationIdentification InIn ProteomicsProteomics

Relevant Publications

• P.A. DiMaggio and C.A. Floudas, A mixed-integer optimization framework for de novo peptide identification, AIChE Journal, 53(1), 160-173 (2007). • P.A. DiMaggio and C.A. Floudas, De novo peptide identification via tandem mass spectrometry and integer linear optimization, Anal. Chem., 79, 1433- 1446 (2007). • P.A. DiMaggio, C.A. Floudas, B. Lu, and J.R Yates, A hybrid methodology for peptide identification using integer linear optimization, local database search, and QTOF or OrbiTrap tandem mass spectrometry, J. Proteome Res., 7, 1584-1593 (2008).

2 PresentationPresentation OutlineOutline

• Introduction to proteomics and review of peptide and protein identification using tandem mass spectrometry • Survey of existing de novo and database algorithms for peptide identification • De Novo approach for peptide identification, PILOT • Hybrid approach based on our mixed-integer linear optimization model and algorithmic framework (denoted as PILOT_SEQUEL) for peptide identification • Comparative studies of PILOT, and PILOT_SEQUEL, and existing state-of-the-art database and hybrid methods on tandem MS from Ion Trap, QTOF, and OrbiTrap mass analyzers

3 Proteomics: Bottom-Up Peptide and Protein Identification via Tandem MS

Protein sample Protein identifications

Experimental ABC Protein level Validation

Computational Peptide grouping

Enzymatic ? digestion LKYVI STCMYAR DILNGG

Peptide GAWKLK ILFAD level

Peptide Mixture Peptide Identifications Mixture separation MS-MS sequencing Validation Database search

MS-MS spectra level

MS-MS spectra 4 Problem Introduction and Definition  Fundamental problem in proteomics: Protein and peptide identification and quantification  Advances in high-throughput experimentation High-performance liquid chromatography (HPLC) coupled with tandem mass spectrometry (MS/MS)

MS and MS/MS

Protein / Peptide Mixture HPLC

Electrospray Ionization (ESI) Fragmentation… 5 TandemTandem MS/MSMS/MS fromfrom CIDCID

 Collision-induced dissociation (CID) causes a positively-charged peptide to fragment along its backbone and results in several types of fragment ions in the tandem mass spectrum (i.e., a, b, c, x, y, z, etc.)

Hypothetical parent peptide*

R or K  Objective: Use these fragment ions to construct the amino acid sequence of the parent peptide

 Issues: The type of an ion peak (a, b, c, x, y, or z) in a tandem MS is not known a priori and the primary sequence of candidate peptide must be derived using ions of the same type

*Adapted from http://www.matrixscience.com/help/fragmentation_help.html 6 Proteomics  Specific Aim 1: Investigate and develop a de novo computational approach for peptide identification based exclusively on information of the ion peaks in the peptide spectrum  Specific Aim 2: Study and develop a new hybrid in silico method which will combine the de novo approach of Specific Aim 1 with database techniques for peptide identification  Specific Aim 3: Incorporate uncertainty into the de novo framework to address experimental uncertainty in problem parameters  Specific Aim 4: Study and develop computational methods for protein identification given the de novo prediction and/or hybrid prediction of the individual peptides  Specific Aim 5: Research and develop computational methods and experimental protocols for protein quantification

7 PeptidePeptide && ProteinProtein IdentificationIdentification viavia TandemTandem MSMS • Database-based methods • Correlate the experimental spectra with spectra of peptides/proteins which exist in the databases • SEQUEST – Eng et al. (1994), Mascot – Perkins et. al (1999), SCOPE – Bafna and Edwards (2001), MS-CONVOLUTION and MS-ALIGNMENT – Pevzner et. al (2001), Poptiam – Hernandez et. al (2003) • De Novo Methods • Predict peptides without sequence databases • Exhaustive listing; sub-sequencing; graphical • Graph theory and shortest path algorithms • Graph theory and dynamic programming • Bayesian scoring of random peptides • Lutefisk – Taylor and Johnson (1997,2001), SHERENGA – Dancik et. al (1999), PEAKS – Ma et al. (2003), NovoHMM – Fischer et al. (2005), PepNovo – Frank and Pevzner (2005), EigenMS – Bern and Goldberg (2006)

8 ChallengesChallenges

• Tandem MS are missing ion peaks due to incomplete fragmentation and/or instruments with low mass-to-charge ratio (m/z) cutoff (i.e., ion trap mass analyzers) • Incorporating parametric uncertainty in the measured values for ion peaks during peptide identification • Existing de novo techniques enumerate an exhaustive number of candidate sequences from the tandem mass spectrum • No straightforward method for including post- translational modifications into existing frameworks

9 Introduction to De Novo Peptide Identification

The De Novo Peptide Identification Problem: Q: Which of these possible Given the tandem mass spectrum (MS/MS) primary sequences of a peptide, derive the primary sequence of corresponds to the correct the peptide without consulting other sources peptide? of information (i.e., protein databases) YLYKNAR YLFPMTR Derive candidate sequences by connecting the ion mass peaks by the YLYELAR weights of amino acids YFEELAR YEYLLAR YLYKKGR YFEKNAR YLY[171.06]AAR

mass-to-charge ratio

10 Traditional De Novo Methods

Spectrum Graph Approach* Transform tandem MS/MS into a spectrum graph, where:

paths on the graph = amino acid sequences

 Solve via dynamic programming  Nodes assigned probabilistic weights  Highest scoring path is selected mass-to-charge ratio * Taylor and Johnson (1997,2001), Dancik et. al (1999), Fernandez de Cossio et. al (2000), Chen et. al (2001), Lubeck et. al (2002), Cannon and Jarman (2003), Chen and Bingwen (2003), Jarman et. al (2003), Frank and Pevzner (2005), Bern and Goldberg (2006) 11 Peptides From Protein Database DatabaseDatabase MethodsMethods Predicted* Raw Tandem MS/MS for YLYEIAR YLYEIAR

MP = 926.45 Da

Predicted* YLYQNVK

Predicted* NRIISLLV Which predicted spectrum matches the experimental spectrum under question?

*Predicted spectra generated with MassAnalyzer 1.03 12 DatabaseDatabase MethodsMethods (cont(cont’’d)d)  Cross-Correlation (e.g., SEQUEST*)

Experimental Spectrum  x Predicted Spectrum  y

Determine “mathematical overlap”

Displacement value Discrete Fourier Transforms

*Eng et. al (1994) 13 DatabaseDatabase MethodsMethods (cont(cont’’d)d)

 Probabilistic Matching* (e.g., Mascot, SCOPE)

Predict primarily y- and b-ions, and their offsets, based on the following formulae:

Actual? Q: Is ion match with experimental spectrum Random?

“A”:  Likelihood ratio hypothesis test (Bafna and Edwards (2001), Havilio et. al (2003))

 Null hypothesis (Sadygov and Yates (2003))

 Integration of spectral dependencies into model (Bafna and Edwards (2001), Havilio et. al (2003))  Empirically estimated probabilities *Perkins et. al (1999), Bafna and Edwards (2001), Pevzner et. al (2001), Havilio et. al (2003), Hernandez et. al (2003), Sadygov and Yates (2003)

14 Drawbacks of Existing Methods

 De Novo Methods

 Exhibit variable prediction accuracies  Computationally intensive  exhaustive enumeration  Many are instrument dependent

 Database Methods

 False predictions if missing protein in database  Difficult to identify post-translational modifications / mutations  Often exhibit dependencies on training data sets and databases

15 Our Approach to Address the Peptide Identification Problem

Novel Technique: Using Mixed-Integer Linear Optimization (MILP) to formulate the peptide sequencing problem

Binary variables {0-1 variables} define whether or not peaks (pi) and paths between peaks (wij) are used in the construction of the candidate sequence, where 1 indicates yes and 0 indicates no

P.A. DiMaggio and C.A. Floudas, AIChE Journal, 53(1), 160-173 (2007). 16 AlgorithmicAlgorithmic FrameworkFramework

Components of Framework:

I. Preprocessing of Tandem MS Data

M II. Mathematical Model for Peptide Identification

Validation of Candidate III. Postprocessing of Sequences Postprocessing Candidate Sequences

Cross-Correlation (de novo) Database Alignment (hybrid) De Novo: PILOT Hybrid: PILOT_SEQUEL17 I.I. PreprocessingPreprocessing AlgorithmAlgorithm  Determine boundary condition (BCtail) for the N-terminus of the y-ion series

Algorithm for N-terminus Boundary Ions  For tryptic peptides, C-terminus amino acid is K  147 Da R  175 Da

 Identify multiply-charged ions

tail  For high-resolution instruments only  Measure distance between isotopes

 Identify neutral losses of tail small molecules i.e., -H2O, -NH3, etc.

18 II. Mathematical Model: Objective Function

Illustration using the y-ion series for YLYELAR

 Maximize the use of high intensity peaks in constructing the candidate sequence  Based on the observation that y- and b-ions are consistently the most abundant peaks in intensity in MS/MS mass-to-charge ratio 19 II. Mathematical Model: Constraints

Conservation of Mass

tolerance “relaxes” equality

Boundary Conditions (BC)

 BC elements are dependent on ion type  BC elements are checked in a preprocessing algorithm  If elements missing then BC set is adjusted

Complementary Ions b y Eliminates a x different ions of c z different type

20 II. Mathematical Model: MILP

Relationship

between pi & wi,j

Flow conservation law

P.A. DiMaggio and C.A. Floudas, AIChE Journal, 53(1), 160-173 (2007). 21 II. Two-Stage Framework

 During the Stage I calculations, the candidate sequence is constructed using only single amino acid weights

 Most tandem MS are missing ion peaks due to incomplete fragmentation and/or instruments with low m/z cutoff (i.e., ion trap mass analyzers)

 Stage II calculations allow for combinations of amino acids to bridge the gap between missing ion peaks

 Combinations of amino acids are penalized in objective function to favor use of single amino acid weights in derivation of candidate sequences

22 III. De Novo Post-Processing Algorithm

 Amino acid permutations substituted for weights in candidate sequences from Stage II calculations  No current models exist for accurate prediction of ion intensity trends as a function of peptide composition for generalized mass analyzers  Assume normalized intensity distribution + reward / penalty based on observation/absence of supporting ions  Cross-correlate of all theoretical mass spectra of candidate peptide sequences with experimental tandem mass spectrum

P.A. DiMaggio and C.A. Floudas, AIChE Journal, 53(1), 160-173 (2007). 23 DeDe NovoNovo AlgorithmAlgorithm

Components of Framework:

I. Preprocessing of Tandem MS Data

II. Mathematical Model for Peptide Identification

III. Postprocessing of Candidate Sequences

PILOT: Peptide identification via Mixed-Integer Linear Optimization and Tandem mass spectrometry 24 Illustrative Example for PILOT: DAFLGSFLYEYSR Raw Preprocessing Algorithm MS/MS spectrum C-terminal amino acid R  peak at 175 Da N-terminus boundary DA, AD, SV, VS, EG, or GE tail conditions, BC no supporting yn-1 ion tail Adjust BC  m/z(yn-2 ion) = 1381.69 Da

Filtered spectrum Stage I Sequences Identified peaks Stage II Sequences Candidate Sequence Objf Candidate Sequence Objf F(L/I)GSF(L/I)YHGANR 2.9850 F(L/I)GSF(L/I)Y[194.06]ANR 2.8323 Integer cuts F(L/I)GSF(L/I)YHAGNR 2.9674 F(L/I)GSF(L/I)Y[208.10]GNR 2.8148 Stage I sequences F(L/I)GSF(L/I)YQHNR 2.9499 F(L/I)GSF(L/I)Y[265.12]NR 2.7847 F(L/I)GSF(L/I)YHQNR 2.9374 F(L/I)GSF(L/I)[172.04]QGANR 2.6501 F(L/I)GSF(L/I)YEYSR 2.8547 F(L/I)GSF(L/I)[171.04]EGANR 2.6501 F(L/I)GSF(L/I)YEH(L/I)R 2.7544 F(L/I)GSF(L/I)[171.04]SVANR 2.6351 F(L/I)GSF(L/I)YE(L/I)HR 2.7444 All F(L/I)GSF(L/I)[171.04]GEANR 2.6351 sequences F(L/I)GSF(L/I)YYESR 2.6968 F(L/I)GSF(L/I)[171.04]DAANR 2.6351 F(L/I)GSF(L/I)YYTDR 2.6391 F(L/I)GSF(L/I)[172.04]NAANR 2.6351

X = high confidence residue X = low confidence residue PostProcessing: DAFLGSFLYEYSR P.A. DiMaggio and C.A. Floudas, AIChE Journal, 53(1), 160-173 (2007). 25 DeDe NovoNovo ComparativeComparative StudyStudy To benchmark the performance of PILOT, we tested it on several tandem mass spectra from

 Quadrupole time-of-flight spectra, QTOF (higher resolution)  Ion trap spectra (lower resolution, low m/z cutoff)

and compared the predictions to other state-of-the- art de novo methods, namely:

 Lutefisk, LutefiskXP – J.A. Taylor and R.S. Johnson, Anal. Chem., 73, 2594-2604 (2001).  PEAKS – B. Ma et al., Rapid Commun. Mass Spec., 17, 2337-2342 (2003).  NovoHMM – B. Fischer et al., Anal. Chem., 77, 7265-7273 (2005).  PepNovo – A. Frank and P. Pevzner, Anal. Chem., 77, 964-973 (2005).  EigenMS – M. Bern and D. Goldberg, J. Comp. Biol., 13(2), 364-378 (2006).

26 De Novo Comparative Study: Ion Trap MS/MS

 Open Proteomics Database*: contains MS/MS spectra for 5 different organisms recorded with ESI-Ion Trap mass spectrometers

 Mass spectra accompanied with predictions from SEQUEST

Which identifications are correct?

 Assignments examined on individual basis for quality Xcorr = cross correlation 1. Xcorr > 2.2 and Cn > 0.1 for +2 charge state score computed by 2. Consistent identification with Mascot SEQUEST Cn = normalized 3. Number of observed b and y ions difference in cross- Number of predicted b and y ions correlation value between #1 and #2 hit  Organism studied: Mycobacterium smegmatis in the search

*http://bioinformatics.icmb.utexas.edu/OPD/ 27 De Novo Comparative Study: Ion Trap MS/MS

P.A. DiMaggio and C.A. Floudas, Anal. Chem., 79, 1433-1446 (2007). 28 De Novo Comparative Study: Ion Trap MS/MS

P.A. DiMaggio and C.A. Floudas, Anal. Chem., 79, 1433-1446 (2007). 29 De Novo Comparative Study: QTOF MS/MS

 Quadrupole time-of-flight (QTOF) spectra have better resolution that ion trap spectra  Examined QTOF data for a mixture of 4 known proteins*:

 Alcohol dehydrogenase (yeast)  Myoglobin (horse)  Albumin (horse, BSA)  Cytochrome C (horse)

 Spectra were assessed for quality based on the metric:

(i = intensity of ion peak i)

*http://www.csd.uwo.ca/~bma/peaks/ 30 De Novo Comparative Study: QTOF MS/MS

P.A. DiMaggio and C.A. Floudas, Anal. Chem., 79, 1433-1446 (2007). 31 De Novo Comparative Study: QTOF MS/MS

P.A. DiMaggio and C.A. Floudas, Anal. Chem., 79, 1433-1446 (2007). 32 De Novo Method Summary

 Developed accurate de novo framework, PILOT, for the identification of peptides via tandem mass spectrometry (MS/MS)  PILOT outperformed several state-of-the-art de novo methods in a comparative study for ion trap and QTOF tandem mass spectra  Key elements of de novo framework:  Novel mixed-integer linear optimization (MILP) formulation for peptide identification  Preprocessing algorithm for filtering spectra and identifying important ion peaks  Post-processing algorithm for cross-correlating theoretical tandem mass spectra with experimental tandem mass spectrum

33 Hybrid Method for Peptide Identification

 Main Idea: Can use protein databases to resolve ambiguous residue assignments from de novo sequence predictions  Combine strengths of de novo and database methods

361 HPEYAVSVLL RLAKEYEATL EDCCAKEDPH ACYATVFDKL HPEYAVEGLL R

 Local database search tools, such as FASTA*, can be utilized to align de novo sequences in a protein database

… but several modifications are necessary

*W. Pearson and D. Lipman. PNAS, 85:2444-2448, 1988. 34 II.II. ModifiedModified ILPILP ModelModel forfor HybridHybrid MethodMethod

Conservation of Mass relaxed by tolerance

Complementary Ions

Relationship

between pi & wi,j

Boundary Conditions for C-terminus and N-terminus

Tryptic Peptide

Flow conservation law

35 III. PostProcessing using Modified FASTA Algorithm

ARNDCQEGH I LKMFPSTWYVBZX  Scoring Matrices A 5 -5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-51 R -55 -5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-51 N -5-55 -5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-51 D -5-5-55 -5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-51 Modify BLOSUM matrix to C -5-5-5-55 -5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-5-51 Q -5 -5 -5 -5 -5 5 -5 -5 -5 -5 -5 5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 1 conserve mass between query E -5-5-5-5-5-55 -5-5-5-5-5-5-5-5-5-5-5-5-5-5-51 G -5 -5 -5 -5 -5 -5 -5 15 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 1 H -5-5-5-5-5-5-5-55 -5-5-5-5-5-5-5-5-5-5-5-5-51 and template sequences I -5 -5 -5 -5 -5 -5 -5 -5 -5 5 5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 1 L -5 -5 -5 -5 -5 -5 -5 -5 -5 5 5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 1 K -5 -5 -5 -5 -5 5 -5 -5 -5 -5 -5 5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 1 M -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 5 -5 -5 -5 -5 -5 -5 -5 -5 -5 1 F -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 5 -5 -5 -5 -5 -5 -5 -5 -5 1  Hashing P -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 15 -5 -5 -5 -5 -5 -5 -5 1 S -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 5 -5 -5 -5 -5 -5 -5 1 T -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 5 -5 -5 -5 -5 -5 1 W -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 5 -5 -5 -5 -5 1 ktup = 4 to optimize only high Y -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 5 -5 -5 -5 1 V -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 5 -5 -5 1 B -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 5 -5 1 quality sequence matches Z -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 5 1 X 11111111111111111111111

 Smith and Waterman Optimization Q W N  Isobaric residues G A S V G G

 Explicit Conservation of Mass between template & query

 Tryptic Peptide Databases P.A. DiMaggio, C.A. Floudas, B. Lu, and J.R Yates, J. Proteome Res., 7, 1584-1593 (2008). 36 HybridHybrid AlgorithmAlgorithm

Components of Framework:

I. Preprocessing of Tandem MS Data

II. Mathematical Model for Peptide Identification

III. Postprocessing of Candidate Sequences

PILOT_SEQUEL: Peptide identification via Mixed-Integer Linear Optimization, and Tandem mass spectrometry, and local SEQUEnce aLignment 37 Hybrid Approach for Peptide Identification Distributive Computing Framework

Beowulf Cluster : 80 nodes with dual Intel Xeon 3.0 GHz processors

PILOT_SEQUEL: Peptide identification via Mixed-Integer Linear Optimization, and Tandem mass spectrometry, and local SEQUEnce aLignment 38 Example for PILOT_SEQUEL: VEADIAGHGQEVLIR Raw Preprocessing Algorithm MS/MS C-terminal amino acid Peaks observed at 147 and 175 Da  spectrum allowed to be both K and R

N-terminus boundary conditions, No N-terminal pair with significant BCtail score: Upper bounding calculations select 1192 Da and 1378 Da as N- terminal boundary conditions

Stage I Sequences Filtered spectrum Candidate Sequence Objf N- and C-terminal boundary conditions IAGHGQEVIIR 3.43 Integer cuts ANNAGHGQEVIIR 3.38 Stage II Sequences IAGHGWAVIIR 3.32 Stage I sequences Candidate Sequence Objf ANNAGHGWAVIIR 3.26 IAG[194.05]QEVIIR 3.41 IAGHGQQVNIR 3.04 ANNAGHGQQVNIR 2.98 Query all sequences IACFEEVIIR 2.78 in nr protein database ANNACFEEVIIR 2.73 ATVVGHGQQVNIR 2.41 bold font = high confidence residues De novo peptide: VEW IAGHGQEVLIR Database protein: LKTEAEMKVEADLAGHGQEVLIR

Selected Peptide: VEADIAGHGQEVLIR

39 Hybrid Comparative Study: OrbiTrap MS/MS

 OrbiTrap instruments have 2-3 times the resolution of conventional mass spectrometers.  Examined 380 OrbiTrap tandem MS for a control mixture of 16 known proteins from:

 bovine, bovine serum, horse, chicken, rabbit, ecoli

 SEQUEST was used to search a target protein database comprised of 5009 proteins (appended with a database containing the reversed sequences of these proteins).  The validity of the spectra/peptide matches were assessed using DTASelect*.

*D.L. Tabb et al. J. Proteome Res., 1:21-26, 2002. 40 Hybrid Comparative Study: OrbiTrap MS/MS

*

*De novo algorithm trained on OrbiTrap tandem MS+. A sequence tag algorithm was used to perform database search in place of the direct lookup method based on hashing.

Residues predicted for de novo sequences: PILOT 4249/4364 (0.974) PepNovo 2958/4364 (0.678)

+A. Frank et al. J. Proteome Res., 6:114-123, 2007. 41 Hybrid Comparative Study: OrbiTrap MS/MS

P.A. DiMaggio, C.A. Floudas, B. Lu, and J.R Yates, J. Proteome Res., 7, 1584-1593 (2008). 42 HybridHybrid MethodMethod SummarySummary  Developed accurate hybrid framework, for the identification of peptides via tandem mass spectrometry (MS/MS) which combines strengths of de novo and database techniques  PILOT_SEQUEL outperformed several state-of-the- art algorithms for hybrid and database peptide identification in a comparative study using OrbiTrap tandem mass spectra.  Major components of hybrid method:  Modified integer linear optimization (ILP) formulation for peptide identification  Enhanced implementation of FASTA  Distributed computing framework for performing several sequence alignment calculations

43