A Probabilistic Constraint-based Approach to Structure Predication

1 Backgrounds and Objectives Protein folding and predication Protein folding is the process by which a protein structure assumes its functional shape or conforma- tion. It is believed that the dynamical folding of a protein to its native conformation is determined by the amino acid sequence of the protein. Given the usefulness of known protein structures in such valuable tasks as rational , protein structure prediction is a highly active field of research in [3, 10]. During recent decades various methods have been developed along the following two main lines [3]. The first line of methods, called homology modeling, rely on assembling structures of using structural fragments of similar sequences available in sources such as PDB [5]. The second line of methods, called ab initio methods, predict the structure from sequence alone, without relying on similarity at the fold level between the modeled sequence and any of the known structures. The folding of any particular protein is an extremely complex process; and simulation of the folding of even a small protein remains an insurmountable challenge to state-of-the-art computers. Many ab initio methods adopt reduced models [18] such as lattice models which alleviate the com- plexity of the process because of the reduced number of degrees of freedom. Reduced models cannot be expected to consistently generate predications with high accuracy. Nevertheless, the low resolu- tion results obtained can be used to narrow the possible conformations from an exponentially large number to a number small enough that more computationally expensive methods can be applied.

The constraint-based approach to protein structure predication Constraint programming [9, 32] is a declarative programming paradigm well suited for describing and solving combinatorial optimization problems. Constraint languages allow for not only high-level modeling but also efficient solving of problems. Recent implementations are amenable to large scale problems thanks to the availability of constraint solving algorithms such as propagation algorithms for finite-domain constraints [22], sophisticated heuristics [14], and advanced compilation techniques [8, 36]. Constraint programming has found its way into many application areas such as design, scheduling, configuration, bioinformatics, and graphical user interface design [6, 12, 15, 13, 33, 35]. Several constraint programming systems are available now. The PI, Neng-Fa Zhou, has designed and implemented an event-handling language, called AR (Action Rules), and developed a state-of- the-art and award-winning1 constraint solver in AR available with B-Prolog [36]. Several attempts have been made to apply constraint programming to protein structure predi- cation [2, 21, 25]. All the systems adopt lattice models for proteins and use some simplified energy functions in the optimization. There are different ways to map lattice models for proteins into CSP (Constraint Satisfaction Problem) models. A straightforward way is to treat amino acids or residues

1The solver was ranked top in two categories in the Second International Solvers Competition (http://www.cril.univ-artois.fr/CPAI06/). 2

as variables and lattice positions as the domains of variables. The only enforced constraints in this model are neighborhood constraints which ensure that every pair of residues are placed in neigh- boring coordinates. Auxiliary constraints, such as constraints for breaking symmetries, are used to enhance the performance. The constraint-based approach has produced some exciting results. For example, Backofen and Will report in [2] that their program outperforms all other approaches for lattice HP models and is able to find optimal structures and prove the optimality for proteins of up to 200 residues. In [26] a goal is set to handle proteins of lengths up to 500.

Logic-based probabilistic learning The past few years have witnessed a tremendous interest in logic-based probabilistic learning as testified by an increasing number of formalisms and systems (e.g., BLP [17], CLP(BN) [11], PRM [20], PRISM [29], and SLP [24]). Logic-based probabilistic learning is a multidisciplinary research area that integrates relational or logic formalisms, probabilistic reasoning mechanisms, and machine learning and data mining principles. Logic-based probabilistic learning has found its way into many application areas including bioinformatics, diagnosis and troubleshooting, stochastic language processing, information retrieval, linkage analysis and discovery, and robot control. The PI has involved in the development of PRISM,2 an extension of B-Prolog that integrates logic programming, probabilistic reasoning, and EM learning [29, 30, 38]. PRISM allows for the description of independent probabilistic choices and their consequences in general logic programs. PRISM supports parameter learning. For a given set of possibly incomplete observed data, PRISM can estimate the probability distributions to best explain the data. PRISM suitable for applications such as learning parameters of stochastic grammars, training stochastic models for gene sequence analysis, user modeling, and obtaining probabilistic information for tuning systems performance. PRISM offers incomparable flexibility compared with specific statistical tools such as Hidden Markov Models (HMMs) [28], Probabilistic Context Free Grammars (PCFGs) [34] and discrete Bayesian networks [7, 27]. Thanks to an efficient tabling system [39] and other optimization techniques, the latest version of PRISM is able to handle large volumes of data. For example, a natural language application written in PRISM is able to train probabilistic models using corpora of tens of thousands of sentences. There is an increasing interest in using machine learning techniques to learn heuristics for search. For CSPs, the heuristics used to order variables and values can have a dramatic effect on the per- formance. Various kinds of strategies have been proposed (e.g., the first-fail principle for ordering variables [22] and the mini-conflict strategy for ordering values [16]). Nevertheless, for many prob- lems users still have to experiment with different heuristics and tune them manually, and this process can be very tedious and painful.

Objectives We propose a probabilistic constraint-based approach to protein structure predication. Our approach is based on the reasonable assumption that two similar proteins will have similar structures. In this project we will develop probabilistic models and train them using proteins in the PBD with known structures. These probabilistic models will be used to predicate structures for other proteins. The number of proteins with known three-dimensional conformations in the PDB is the order of 16,000 and is increasing rapidly, while the number of conformation classes has remained about 500 for some

2Available from http://mi.cs.titech.ac.jp/prism/. 3

time is not expected to grow beyond 1000.3 It is expected that a trained probabilistic model can guide the search, leading to a better low bound and thus a dramatic reduction of the search space. This project also aims to improve the constraint-based methods used in existing systems. In concrete, we propose the following: (1) tailor our solver to the protein structure predication problem; (2) design and implement new global constraints and propagation algorithms for the problem; and (3) investigate the effectiveness of other constraint solving techniques such as symmetry-breaking and dual-modeling techniques to this problem.

Merits and impacts The main contribution of the project will be a high-performance specialized solver for protein struc- ture predication that takes advantage of probability distributions learned from existing proteins with known structures. The system will be based on our efficient constraint logic programming system B-Prolog and the probabilistic learning system PRISM, and will incorporate the following research results: • Probabilistic models: A naive probabilistic model would facilitate learning but could hardly produce useful results to guide search. On the other hand, a sophisticated model entails expo- nential learning time. An optimal probabilistic model will be developed through experiments that well balances the learning complexity and the quality of results. • Special-purpose constraints: Special-purpose constraints including routing, symmetry-breaking, and dual-modeling constraints will be identified for the problem, and will be implemented using the AR language available in B-Prolog. This is a multidisciplinary research project which will involve both computer scientists and biologists. The research results will disseminated through several avenues and the resulting system from the project will be made available to the public.

2 Protein Structure Predication as a Constraint Satisfaction and Optimization Problem

Proteins are amino acid chains, made up from 20 different amino acids, also referred to as residues, that fold into unique three-dimensional protein structures. Let R = r1,...,rn be a sequence of residues. A structure of R is represented by a sequence of points in a three-dimensional space p1,...,pn where pi =. The following generic formula defines the energy of a structure:

E =ΣiΣj>iErirj ∆pipj

where Erirj specifies the contact energy of residues ri and rj , and ∆pipj is a value dependent on the distance between pi and pj in the space. A structure is a self-avoiding walk in the space; and structure predication is an optimization probelm which infers feasible structures for a protein with the lowest energy. The problem is an insurmountable challenge to the fastest computers even for small proteins. During the last three decades various kinds of reduced models with reduced degrees of conformational freedom, especially lattice models, have been used [19]. Protein structure predication can be easily modeled as a descrete constraint satisfaction problem if a lattice model is used: Residues are treated

3http://scop.mrc-lmb.cam.ac.uk/scop/. 4 as variables whose domains are lattice points, and the only enforced constraints are neighborhood constraints which ensure that every pair of residues (ri, ri+1) are placed into neighboring positions. In the simple cubic lattice model the space is assumed to be packed with unit cubes and the allowed positions are lattice points which are intersecting points of cubes. The Eclidian distance of two residue positions pi = and pj = are defined as follows: − − − Dpipj = abs(xi xj )+ abs(yi yj)+ abs(zi zj)

The positions pi and pj are neighbors if Dpipj = 1. In the simple cubic lattice model, each position has up to 6 neighbors. In the FCC (face-centered-cubic) lattice model [1], each cube is assumed to have size 2 and the the allowed positions for residues include the central point of each face in addition to the lattice points. The Eclidian distance can be approximated as in the simple cubic model but two positions pi and pj are neighbors if Dpipj = 2. In the FCC lattice model, each position has up to 12 neighbors. Protein structure predication is NP-complete even for very simplified models such as the HP model [23, 4] in which resides are divided into two types called H (hydrophobic) and P (polar), and a very simple energy function is used. Because of the NP-completeness, brute-force search is out of the question, and hence heuristics must be used to cut off the search space and guide search toward an optimal solution with as few backtracks as possible.

3 Results from Prior Research

The results from PI’s prior research on constraint solving and probabilistic learning4 will serve as the base for the proposed research.

3.1 Prior research results on constraint solving A constraint satisfaction problem (CSP) consists of a set of variables each with a domain of values, and a set of constraints each of which specifies allowed combinations of values for a subset of variables. A solution is an assignment of values to the variables that satisfies the constraints. A lot of combinatorial optimization problems can be formulated naturally as CSPs. Some problems originally belonging to seeming different types such as planning problems and traveling salesman problems have also been tackled as CSPs. Constraint programming is a declarative programming paradigm well suited for modeling and solving CSPs. Constraint solvers for finite-domain constraints typically use a technique, called propagation (see e.g. [22, 31] for tutorials), to enforce a local consistency property. There has been a need for an efficeient and expressive implementation language for constraint propagation. The PI has designed and implemented an event-driven language called AR (Action Rules) and used it to implement an efficient solver for finite-domain constraints [36]. The solver is available with B-Prolog. The event constructs facilitate the implementation of global constraints and problem-specific constraints. The AR language has also been used to program interactions in graphical user interfaces [37] and compiling CHR (Constraint Handling Rules) [], a language capable of describing constraint reasoning rules. The solver to be used for the proposed protein structure predication project represents the state of the art. The following table gives benchmark results comparing the B-Prolog solver (BP) version 6.9 with SICStus Prolog (Sics) version 3.12.5.

4supported by grants from CUNY Research Foundation, CUNY Software Institute, Brooklyn College, and AIST Japan 5

Table 1: Benchmark resuls (CPU time, 1.3GHz CPU, 1GM, Windows XP). Sics Program BP (seconds) Sics (seconds) BP block 5.32 17.87 3.35 color 3.84 39.50 10.27 hamming 0.82 2.50 3.02 hamilton 0.73 2.12 2.89 knapsack 0.95 5.65 5.93 protein 10.07 24.98 2.47 schur 6.68 49.40 7.38

3.2 Prior research results on logic-based probabilistic learning 4 Proposed Research

In constraint solving, the ordering of values can have a dramatic effect on the performance. It is presumed in bioinformatics that homologous proteins almost always have similar 3D structures. One question arises: how can we make use of the existing proteins with known structures to enhance the performance of constraint solving for protein structure predication? This research proposes integrating the probabilistic appearch with the constraint-based approach to the problem.

4.1 Representation of protein structures In the basic representation scheme, residues in a protein structure are treated as variables whose domains are lattice positions. As absolute coordinates are used, a structure can have many different representations, and hence the basic representation is not suited for our purpose. We need to change this representation scheme so that a structure has fewest representations as possible. With this representation scheme, probabilitistic models can be established and proteins with known structures can be used to train these models. In Unger and Moult [] a protein structure is represented by the angles formed by neiboring links in a fold to represent the resulting structure. On a 2D lattice, two neighboring links can form one of the following three angles: 90 (left turn), 180 (straight), and 270 (right turn). The 360 degree is possible since a valid fold must be a self-avoiding walk on the lattice. We use three small integers to represent the angles: -1 for a left turn, 0 for a straight connection, and 1 for a right turn. In this scheme, a structure can still have multiple representations. For example, the t shape on the 2D lattice can be represented as (1,1) or (-1,-1).

4.2 Probabilistic models Dovier implemented a constraint-based solution for conformations in the face-centered cubic lattice. Their solution is much more generic, because it is uses an energy function based on all pairs of all 20 amino acids and incorporates secondary structure information. That is, it preserves the known structure of subsequences regardless of what may be the optimum conformation. Statical analysis and constraint solving are two major approaches to solving optimization prob- lems. Both approaches have been applied to protein structure predication Structrual template 6

5 Intellectual Merits and Broader Impact 6 Experience and Capabilities of the PI 7 Plan of Activities References

[1] Richa Agarwala, Serafim Batzoglou, Vlado Dancik, Scott E. Decatur, Martin Farach, Sridhar Hannenhalli, and Steven Skiena. Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the hp model. In SODA ’97: Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms, pages 390–399, Philadelphia, PA, USA, 1997. Society for Industrial and Applied Mathematics. [2] Rolf Backofen and Sebastian Will. A constraint-based approach to fast and exact structure predication in three-dimensional protein models. Constraints, An International Journal, to appear. [3] David Baker and Andrej Sali. Protein structure prediction and structural genomics. Science, 294:93–96, 2001. [4] Bonnie Berger and Tom Leighton. Protein folding in the hydrophobic-hydrophilic (hp) is np- complete. In RECOMB ’98: Proceedings of the second annual international conference on Computational molecular biology, pages 30–39, , NY, USA, 1998. ACM Press. [5] Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, , and Philip E. Bourne. The protein data bank. Nucleic Acids Research, 28(1):235–242, 2000. [6] A. Borning, R. Lin, and K. Marriott. Constraint-based document layout for the web. ACM Multimedia Systems Journal, 2000. [7] Enrique Castillo, Jose Manuel Gutierrez, and Ali S. Hadi. Expert Systems and Probabilistic Network Models. Springer, 1 edition, 1997. [8] Philippe Codognet and Daniel Diaz. Compiling constraints in clp(FD). Journal of Logic Pro- gramming, 27(3):185–226, 1996. [9] Jacques Cohen. Constraint logic programming. Communications of the ACM, 33(7), 1990. [10] Jacques Cohen. Bioinformatics - an introduction for computer scientists. ACM Comput. Surv., 36(2):122–158, 2004. [11] Vitor Santos Costa, David Page, Maleeha Qazi, and James Cussens. CLP(BN): Constraint logic programming for probabilistic knowledge. In Proceedings of 2003 Conference on Uncertainty in Artificial Intelligence (UAI-03). Morgan Kaufmann Publishers, 2003. [12] I. Cruz, K. Marriott, , and P. van Hentenryck. Special issue on constraints, graphics, and visualization. Constraints, An International Journal, 3(1), 1998. [13] R.H.C. Yap (eds.) D. Gilbert, R. Backofen. Special issue on bioinformatics. Constraints, An International Journal, 6(2-3), 2001. 7

[14] Rina Dechter. Constraint Processing. Morgan Kaufmann Publishers, 2003. [15] Mehmet Dincbas, Helmut Simonis, and Pascal van Hentenryck. Solving large combinatorial problems in logic programming. Journal of Logic Programming, 8:75–93, 1990. Special Issue: Logic Programming Applications. [16] Daniel Frost and Rina Dechter. Look-ahead value ordering for constraint satisfaction problems. In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI’95, pages 572–578, 1995. [17] Kristian Kersting and Luc De Raedt. Bayesian logic programs. In J. Cussens and A. Frisch, editors, Proceedings of the Work-in-Progress Track at the 10th International Conference on Inductive Logic Programming, pages 138–155, 2000. [18] Andrzej Kolinski and Jeffrey Skolnick. Reduced models of proteins and their applications. Polymer, 45:511–524, 2003. [19] Andrzej Kolinski and Jeffrey Skolnick. Reduced models of proteins and their applications. Polymer, 45:511–524, 2004. [20] D. Koller. Probabilistic relational models. In S. Dˇzeroski and P. Flach, editors, Proceedings of the 9th International Workshop on Inductive Logic Programming, volume 1634 of Lecture Notes in Artificial Intelligence, pages 3–13. Springer-Verlag, 1999. Invited paper. [21] Ludwig Krippahl and Pedro Barahona. Applying constraint programming to protein structure determination. In Constraint Programming, pages 289–302, 1999. [22] V. Kumar. Algorithms for constraint satisfaction problems: A survey. AI Magazine, 13:32–44, 1992. [23] K. F. Lau and Ken A. Dill. A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules, 22:3986–3997, 1989. [24] S. Muggleton. Stochastic logic programs. In L. De Raedt, editor, Advances in Inductive Logic Programming, pages 254–264. IOS Press, 1996. [25] Alessandro Dal Pal`u, Agostino Dovier, and Federico Fogolari. Constraint logic programming approach to protein structure prediction. BMC Bioinformatics, 5:186, 2004. [26] Alessandro Dal Pal`u, Agostino Dovier, and Enrico Pontelli. Heuristics, optimizations, and parallelism for protein structure prediction in clp(d). In PPDP, pages 230–241, 2005. [27] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann Publishers, 1987. [28] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recog- inition. Proceedings of the IEEE, 77:257–286, 1989. [29] Taisuke Sato and Y. Kameya. Parameter learning of logic programs for symbolic-statistical modeling. Journal of Artificial Intelligence Research, pages 391–454, 2001.

[30] Taisuke Sato, Yoshitaka Kameya, and Neng-Fa Zhou. Generative modeling with failure in prism. In IJCAI, pages 847–852, 2005. [31] E. Tsang. Foundations of Constraint Satisfaction. Academic Press, 1993. 8

[32] P. van Hentenryck and V. Saraswat (eds.). Strategic directions in constraint programming. ACM Computing Survey, 28(4):701–728, 1996. [33] M.G. Wallace. Practical applications of constraint programming. Constraints Journal, 1(1), 1996. [34] C. S. Wetherell. Probabilistic languages: A review and some open questions. ACM Computing Surveys, 12(4):361–379, 1980. [35] Neng-Fa Zhou. CGLIB — a constraint-based graphics library. Software Practice and Experience, 33(13):1199–1216, 2003. [36] Neng-Fa Zhou. Programming finite-domain constraint propagators in action rules. To appear in Theory and Practice of Logic Programming (TPLP), 2006. [37] Neng-Fa Zhou and Taisuke Sato. Efficient fixpoint computation in linear tabling. In Fifth ACM- SIGPLAN International Conference on Principles and Practice of Declarative Programming, pages 275–283, 2003. [38] Neng-Fa Zhou, Taisuke Sato, and Koiti Hasida. Toward a high-performance system for symbolic and statistical modeling. In IJCAI Workshop on Learning Statistical Models from Relational Data, pages 153–159, 2003. [39] Neng-Fa Zhou, Taisuke Sato, and Yi-Dong Shen. Linear tabling strategies and optimizations. Theory and Practice of Logic Programming (TPLP), submitted, preliminary results appear in ACM PPDP’03 and ACM PPDP’04.