<<

275

Algorithmic Strategies in Combinatorial

Deborah Goldman* Sorin Istrail~ Giuseppe Lancia ~ Antonio Piccolboni ~ Brian Walenz ¶

Abstract chemical compounds various measures, or indices, that is a powerful new technology provide correlations with the likelihood of biological ac- in drug design and molecular recognition. It is a wet- tivity. There are 2D measures (at the level of the chem- laboratory methodology aimed at "massively parallel" ical graph) and 3D measures (at the level of coordi- screening of chemical compounds for the discovery of nates for its in the 3D space). In our context, compounds that have a certain biological activity. The "biological activity" is a complex process of molecular power of the method comes from the interaction be- recognition, binding, and possible conformation change tween experimental design and computational model- between one small compound, and a large biological ing. Principles of "rational" drug design are used in the complex (e.g., a protein complex). It is very diffi- construction of combinatorial libraries to speed up the cult to capture the notion of biological activity within discovery of lead compounds with the desired biological the framework of numerical measures at the compound activity. level. However, some measures were found that that This paper presents algorithms, software develop- work well. One notorious example is the Wiener index ment and computational complexity analysis for prob- defined as the sum of pairwise shortest path distances lems arising in the design of combinatorial libraries for between atoms in the chemical graphs of the compound. . We provide exact polynomial time al- It correlates with physicochemical characteristics such gorithms and intractability results for several Inverse as the boiling point. A variety of chemical topological Problems - formulated as (chemical) graph reconstruc- (2D) and topographical (3D) indices were introduced tion problems - related to the design of combinatorial and much research was performed towards the under- libraries. These are the first rigorous algorithmic re- standing of their correlation with various types of ac- sults in the literature. We also present results pro- tivities. vided by our combinatorial chemistry software package A chemical index is a map from the set of chemical OCOTILLO for combinatorial design using real compounds to the Real numbers. One could think of data libraries. The package provides exact solutions for the co-domain of this function as the "activity space". general inverse problems based on shortest-path topo- Compounds with similar activity are mapped "close" logical indices. Our results are superior both in accuracy in the space. Typically huge numbers of compounds are and computing time to the best software reports pub- mapped to identical, or near identical index values. In a lished in the literature. For 5-peptoid design, the com- natural way, given some activity level/value, or a region putation is rigorously reduced to an exahustive search in the activity space, one wants to design chemical of about 2% of the search space; the exact solutions are compounds having that index value, or whose index is found in a few minutes. in that region. Solving these types of inverse problems is the subject of our paper. The input data for these 1 Introduction computational problems are laboratory experiments, where some lead compounds were identified. The 1.1 The Combinatorial Chemistry Framework problem is to generate new laboratory experiments that Chemical Indices and Inverse Design Prob- will accelerate the likelihood of discovering new, more lems based on them. The area of quantitative powerful, compounds. In order to do so we have to structure-activity relationship (QSAR) identified for solve inverse problems based on specific indices. One wants several solutions for the inverse problem that are ~Berkeiey, [email protected] as "diverse" (different chemcial structure) as possible. tSandia National Laboratories, [email protected] Based on them, a new combinatorial library is created, **Universityof Padova, lancia~dei.unipd.it and new lead compounds are discovered. §UC Davis, piccolboQucdavis.edu ¶Sandia National Laboratories, [email protected] 276

Chemical Graph Reconstruction Problems. duction of a simple computational filter - the flower New types of graph reconstruction problems occur in compression - and show how it can be used to group this area whose solutions are needed for the design of many graphs which have related Wiener indices and combinatorial libraries. One type involves constructing discard, at once, whole families of unfeasable solutions graphs or trees having a given topological index. A sec- without examining their members in detail. Our al- ond type involves selecting chemical fragments from a gorithms can be easily generalized to find all (or any) library and creating "artificial proteins", called combi- feasible whose topological index of interest is natorail , that match a given index. within some given range from a specific target. Our software package OCOTILLO contains the implemen- 1.2 Algorithmic Challenges tations of several algorithms that exactly solve inverse In this paper we will consider in particular the problems based on general shortest-paths indices. Wiener index (the sum of the distances in the graph between each pair of vertices), which is probably the 1.3 Previous Work most widely known ([1]). Combinatorial chemistry research started in the The Wiener index, W, was devised by the early 1990s (see [5, 7, 8, 9] for early develoments and Harold Wiener in 1947 [2], who found a strong correla- history). tion between W and a variety of physical and chemical A lot of studies were devoted topological indices and properties of , and arenes. correlations with biological activity [10, 11, 12, 13, 14], With respect to the inverse problem on unrestricted including an entire book "Chemical Graph Threory", N. graphs, we will show that in general it has a simple Trinajstic [15]. solution both in its decision (does a graph with a given Heuristic approaches to combinatorial chemistry Wiener index exist?) and construction versions. The design problems are discussed in [3, 16]. problem however becomes more complex if we add the constraint that the graph must be a tree. For this case 1.4 An outline of the paper we give a pseudo-polynomial dynamic programming The remainder of the paper is organized as follows. procedure which builds a tree with a given Wiener index In section 2 we introduce some suitable notation. Sec- (if one exists), but we do not know the complexity of the tion 3 is devoted to the inverse Wiener index problem decision version: While analyzing the inverse problem for general graphs (subsection 3.1) and trees (subsec- on trees, we come to the definition of a new interesting tion 3.2). Section 4 discusses the problem of recon- topological property, that is the loads distribution for structing a tree from its set of splits. In section 5 we the edges. We show that finding a tree whose edges address the problem of building a peptoid with a given have given load values is NP-complete, and describe a Wiener index. Subsection 5.1 contains a polynomial search procedure which solves the problem very quickly algorithm, based on dynamic programming, for find- in practice. ing one such peptoid, while subsection 5.2 describes a As far as the construction of peptoids is concerned, fast search procedure capable of listing all feasible so- our work focuses on inverse problems based on 2D and lutions and reports on our computational results of the 3D QSAR descriptors (which include the Wiener index, OCOTILLO package. but also the Pairs, the Bemis-Kuntz histogram of triangles) that have been proven effective in a num- 2 Preliminary Definitions ber of projects for selecting active from large DEFINITION 2.1. Given a graph G = (V, E), by dG(i, j) databases. Formulated as graph reconstruction prob- we denote the shortest path (i.e. with the smallest lems, a typical inverse problem is defined as follows. number of edges) between two vertices i and j. If G Given a combinatorial library for peptides with N units, is a tree, then dG(i,j) is the length of the unique path with fragment libraries for every position of maximum between i and j. We simply write d(i, j) if the graph or size L and an integer W, find a set of high diversity tree is understood from the context. peptides whose Wiener index is W. We present a polynomial time algorithm, based on As is customary, we may often denote by n, or n(G), dynamic programming, for such inverse problem. Fur- the number of nodes of a graph. We denote by Kn the ther, we describe a software implementation of a search complete graph on n nodes. Sn is a star on n nodes (all algorithm, capable of finding all possible solutions, that nodes but one are leaves). Pn is a path of n nodes. outperform the existing methods proposed in the liter- For ease of notation, in the following definition and ature (see e.g. [3, 4, 5, 6]). Our strategy is based on in the remainder of the paper, when we write ~,jey, an effective pruning of the search space, via the intro- the summation has to be understood as actually re- 277

In order to prove this theorem, we need the follow- ing lemma:

LEMMA 3.1. For every graph G = (V, E) with diameter v j 2 and Wiener index W, the graph G' = (V, EU {e}) for VI ] V3 e ~ E has Wiener index W - 1. Proof. Let e -- (vl, v2). Clearly riG(v1, v2) = 2 and de, (vl, v2) -- 1. Any other distance is preserved by this transformation. •

We are now ready to prove Theorem 3.1.

Proof. Let Go = Sn, the star of size n. We have Figure 1: A 3-peptoid; the three fragments are anchored w(Go) = (n - 1) 2 and the diameter of Go is two. Let on a linear scaffold at positions vl, v2 and v3. G1 be the graph obtained by adding to Go an edge not already contained in it. G1 is either K~ or has diameter stricted to pairs of distinct vertices. two, and by the above lemma w(G1) = w(Go) - 1. It it possible to repeat this procedure until the graph DEFINITION 2.2. Given a graph G = (V,E), its Wiener obtained is Kn and w(Kn) = n(n - 1)/2. At any step, index w(G) is the total node-to-node path length. That the lemma guarantees that w(Gk) = w(Gk-1) - 1. Thus is, w(G) = ~'~.i,jev dG(i,j). each number in the interval IN = [n(n - 1)/2, (n - 1) 2] is the Wiener index of Gk for some k. The following graphs are used to describe formally Since the intervals overlap for n > 4, and including the problem of the combinatorial synthesis of specific the interval values for n = 4, we find for W > 5 there is a molecular structures. graph G such that w(G) -- W. 1,3 and 4 are the Wiener DEFINITION 2.3. A (chemical) fragment is a graph G index of P2 (a path of length 2), K3 and P3, resp. To with a special vertex v denoted as its anchor, or hook- prove that there is no graph G such that w(G) = 2, it ing point. A peptoid is a graph obtained by join- is enough to observe that the graph on n nodes with the smallest Wiener index is Kn, and the one with the ing in a linear fashion from left to right, k fragments largest is Pn, but w(P2) = 1 and w(K3) = 3. • G1, . . . , Gk via a path through their hooking points (Fig- ure i). Note that, when k = 1, a fragment is a spe- The theorem is constructive and leads in a straight- cial case of a peptoid. For a peptoid D = (V, E), by forward way to an algorithm solving the search problem, l(D) := ~icu dG(i,vk) we denote the total distance of that is outputting a graph with the required given in- all vertices from the rightmost hooking point vk. For dex. Since the size of the graph is polynomial in the k = 1, lO gives the total distance from all nodes of a Wiener index and a number can be represented with a fragment to its anchor. logarithmic number of bits, this algorithm can be classi- We can think of a rooted tree as a special case of fied as pseudo-polynomial -- that is, it is polynomial in fragment whose hooking point is its root. Henceforth the parameters describing the problem but not on the we have the following definition for rooted trees. size of the representation of the input. More in detail, the computation time is dominated by the time neces- DEFINITION 2.4. Given a tree T = (V,E) with root sary to output the graph, that is O(n2). Since for the v C V, the total distance of its vertices from the root class of graphs considered n(n - 1)/2 _< W < (n - 1) 2, is l(T) := ~,.i~y d(i, v). the time complexity is also O(W).

3 The Inverse Wiener Index Problem 3.2 The inverse Wiener problem for trees We have developed graph theoretic results for the The problem we will be concerned with in this reconstruction problem based on the Wiener index. subsection is the following: given a positive integer W, find whether there exists a tree T s.t. w(T) = W. We 3.1 The inverse Wiener index problem for will consider also the problem of finding such a tree. graphs Clearly, Theorem 3.1 involves non-trees, and thus it does not apply to this more constrained setting. Indeed, THEOREM 3.1. For any W # 2, 5 there exists a graph there are many integers that are the Wiener index of G such that w(G) = W some graph but not of any tree. Using an algorithm we 278

will describe in the following, we checked exhaustively for W < 10000 and 159 turns out to be the largest such example. This experimental evidence together with the w(T) = Z analogy with the case of graphs leads to the following v,wEV conjecture: = Z + Z + v,wE V1 v ,wE V2 CONJECTURE 3.1. Every positive integer but a finite set1 is the Wiener index of some tree. Z e(v,w) vE V1 ,wEt'? The above conjecture appeared first in [17], where = w(r~) + w(T2) + it was verified for W up to 1206 by a complete enumer- Z (d(v, vl)+l+d(v2,w)) ation of all unlabeled non isomorphic trees of up to 20 vE V1 ,wE V2 nodes. If true, it would imply that the decision problem = w(r~) + w(r2) + l(T1)n(r.,) + is trivial, but the proof would not necessarily lead to an efficient solution of the search problem. l(T2)n(T1) + n(Ti)n(T2)

3.2.1 A recurrence relation for the Wiener in- dex It is possible to prove a recurrence relation for the 3.2.2 A dynamic programming algorithm for Wiener index of trees which is closely related to the the inverse Wiener index problem one we will prove for peptoids in 5.1. Let T = (V, E) This recurrence relation leads naturally to a dy- be a tree and (vi,v2) an edge. Let T1 = (Vi,Ei) and namic programming algorithm for the problem of find- 27.2 = (V2, E2) be the two trees obtained by removing ing a tree T with assigned w(T), l(T) and n(T) . (Vl, v2). Let us assume that T and T1 are rooted in vl The key observation is the following: every tree with and 2"2 in v2. We have the following recurrence for w(-), at least one edge can be decomposed in the way dic- l(-) and n(-): tated by the above recurrence, that is by removing an edge. Whatever the edge removed, we obtain two trees THEOREM 3.2. Ti,i = 1, 2, and for each i, w(Ti) < w(T), l(Ti) < l(T) and n(Ti) < n(T). Let us define a matrix M so that MW, L,N be 1 if there is a tree T such that w(T) = W, (3.1) n(T) = n(T1) + n(T2) l(T) = L, and n(T) = N, 0 otherwise. According to the (3.2) l(T) = l(T1) + l(T2) + n(T2) above recurrence MW, L,N can be computed if MW,,L,,N, (3.3) w(T) = w(Ti) + w(T2) + l(T1)n(T2) + is known for every W' < W, L' < L and N' < N. This I(T2)n(T1) + n(Ti)n(T2) implies that it is possible to compute the entries of M, starting form the initial value Mo,oA = 1 and evaluating Proof. 3.1 is obvious. To prove 3.2 we use the definition to 0 all the entries corresponding to W, L, N values out- of l(-) and rearrange the summations slightly, as follows: side feasible bounds, proceeding in an orderly fashion. This algorithm solves as well the inverse Wiener index problem: given W, we compute upper bounds t(T) = Zd(vl,v) for the largest L and N such that the triple (W, L, N) vEV is feasible. Then we fill the matrix M up to the entry = d(Vl, + E d(vl, v) MW, L,N. If for any L' <_ L, N' < N MW, L,,N, = 0 then vEV1 t, EV2 there is no tree T such that w(T) = ~V. = e(v , + Z v) + 1) + 1 The algorithm can be extended so as to return a vE Vi vE V2 tree with the required properties: as it is customary in = *(T1) + t(T2) + n(r2) dynamic programming, it is enough to stor~, whenever an entry of M is set to 1, the indexes of the two entries to which the recurrence relation has been successfully The same technique leads to the proof of 3.3 applied. In our implementation we use a technique related to dynamic programming called memoization. Instead --r--NNamely:{2, 3, 5, 6, 7, 8, 11, 12, 13, 14, 15, 17, 19, 21, 22, 23, of filling the matrix M bottom up, this technique ap- 24, 26, 27, 30, 33, 34, 37, 38, 39, 41, 43, 45, 47, 51, 53, 55, 60, 61, plies recursively the recurrence relation. To avoid re- 69, 73, 77, 78, 83, 85, 87, 89, 91, 99, 101, 106, 113, 147, 159} computation of the same entries, intermediate results 279 get stored in M. It can be thought of as a top-down of a tree and rdeg(-) the degree of its root. As for The- version of basic dynamic programming. It is worth not- orem 3.2, let T = (V,E) be a tree and (vl,v2) an edge. ing that without storage of intermediate results the time Let T1 = (171, El) and T2 = (V2, E2) be the two trees ob- complexity would blow-up exponentially, because of the tained by removing (vl, v2). Let us assume that T and repeated recomputation of the same entries in M. This T1 are rooted in vl and T2 in v2. We have the following: technique is valuable when an algorithm can be termi- THEOREM 3.3. nated without filling completely the matrix M. Other- wise, the number of entries evaluated is the same, but mdeg(T) = max(mdeg(T1), mdeg(T2), there is a slight overhead due to function calls and stack rdeg(T1) + 1, rdeg(T2) + 1) management. For our problem, memoization turns out to be much faster for "yes" instances. For example, rdeg(T) = rdeg(T1) + 1 (524,36,19) is a "yes" instance and requires less than one Together with Theorem 3.2, Theorem 3.3 character- second to compute, while (525,36,19) is a "no" instance, izes the existence of a tree with the required properties requiring 145 seconds. This example is rather extreme, and thus can be used to define a dynamic programming but this behavior is absolutely consistent. This evidence algorithm. This time, though, the matrix M contain- prompts for further research along different lines: ing the partial solutions will have five dimensions, to • quantify and analyze this asymmetry between account also for mdeg(.) and rdeg(-). The worst case "yes" and "no" instances; bound for the running time has to be updated accord- ingly. • exploit it to make the computation more efficient (is We turn now to k-ary trees. To develop a recur- it safe to "give up" after a reasonably short running rence for the Wiener index in this case, we still rely on time? In exploiting the recurrence, is it faster to the quantities that proved useful so far -- namely w(.), compute many entries in parallel and stop when the l(-) and n(T) --, but we decompose a tree in a differ- first successful computation is over?) ent way. Instead of using cuts as before we exploit the As a further algorithmic refinement, already ex- definition of k-axy tree. Let T ---- (V~ E) be a k-ary tree ploited in the above mentioned experimental results, and let Ti = (Vi, Ei) be the k subtrees hanging from its we adopt also a divide and conquer strategy, whenever root. We can prove a yet more complex recurrence for possible. Since there are many possible ways of using the Wiener index in this case. the recurrence relation, we try first the ones for which THEOREM 3.4. n(T1) ~- n(T2). This way we proceed directly to the smallest possible sub-problems. This approach is in- (3.4) n(T) = 2 n(Ti) + 1 effective in the worst case (consider W = (N - I) 2, L -- N - I, that is a star on N nodes), but suggests a (3.5) l(T) = Z (l(Ti) + n(Ti)) sensible order in which to proceed. i The pseudo-code is given in Appendix A. (3.6) w(T) = + l(ri) + ,,(rd) + i 3.2.3 Recurrence relations for the Wiener in- dex of bounded degree trees and k-ary trees Z l(Ti)n(Tj) + E 2n(T~)n(Tj) Often the graphs of molecular structures have in- i~j i

nodes on the smallest shore of the unique cut identified of the form we just described. Inductively, the tree must bye. The load of the edge, denoted by l(e), is the number necessarily contain 3m paths of length mini{s/} consist- s(e) × (n- s(e)) of paths in T which contain the edge e. ing of edges with splits min{si}, min{si} - 1,... , 1 (for example, each edge with a split of two must necessar- By using the loads, we can rewrite the Wiener index ily be connected to an edge with a split of one). At for a tree as w(T) = ~-~eeE l(e)- this point, we conclude that, in fact, we must have 3m The last bit of the Wiener index and the last bit similar paths of length si each starting from an edge of n are not independent, as the following proposition with split si (which contain the former paths) since we shows. This result appears also in [17, 18], where it are now only able to attach loads >_ min{s~} and, by as- is derived by considering trees as bipartite graphs and sumption on the si sizes, max{s/} < 2 min{si}: for each arguing on the parity of paths. We give a much simpler edge with split s between min{s~} + 1 and max{si}, the proof. only edge with smaller load that we can attach must PROPOSITION 4.1. Any tree with an odd number of have load exactly s - 1. Finally, also from the bounds nodes has an even Wiener index. on the s~, we infer that exactly three paths depart from each leaf of an initial m-star, the edges of which all Proof. For each edge either s(e) or n - s(e) is even, so have split size B + 1. As above, the s~ values of the the load is even. • edges departing from the star edges provide a solution to the instance of of 3-PARTITION. Since the reduction The problem of finding a tree of a given Wiener is clearly polynomial time computable, this completes index asks therefore to find n - 1 loads whose sum is W. the proof of the theorem. [] This prompted us to the following question: assume we are given such loads; can we find the tree? Since for a The problem of reconstructing a tree from its set fixed n the loads uniquely determine the splits, we can of splits can be solved by the following enumerative rephrase the problem as: given splits sl,..., sn-1 find algorithm. Sort the splits so as to have sl > ... > a tree T such that the edges of T have the given input sn-1 = 1. Starting with a tree consisting of a single splits. This is a problem of tree reconstruction and the node of weight n, we insert the edges one at a time, set of splits can be viewed as yet another topological ending Up with a tree on n nodes, each of weight 1. At property that characterizes a family of trees. Further- step k we more, the problem of reconstructing a tree from its 1. look -exaustively- for a node i whose weight w~ is set of splits is interesting on its own. Unfortunately, larger than sk the reconstruction problem turns out to be NP-complete 2. augment: attach to node i a new node node j, setting wj := Sk and decreasing wi to wi - sk. THEOREM 4.1. The problem, SPLITS, of reconstruct- ing a tree from its set of splits is NP-complete. Note that at step 1 we may have to break ties. The presence of these ties is what makes the algorithm Proof. We reduce from the problem, 3-PARTITION. In exponential, since we may have to backtrack from a this problem we are given a bound, B, and 3m elements, wrong choice. It is not immediate that this algorithm Sl,... ,s3m, such that for each i E {1,... ,3m},B/4 < does indeed work. For instance, the sorting of the .si si < B/2. The problem asks whether there exists a is crucial, as the following example shows: Take n = 4 partition of the {si} into into 3-element disjoint sets and sl = s2 = 1, s3 = 2. Then there is no way of such that the sum~of the elements in each set is B. placing the split 2 after having placed the two splits We map the instance of 3-PARTITION to the 1. So we need to show that it is enough to consider following instance of SPLITS: the value B + 1 ap- the sorted permutation of the splits out of the (n - 1)! pears m times and, for each i, we include the values, possibilities. si,s~ - 1,... ,1. If we are given a yes instance of 3- PARTITION, we can build a tree in the following way: PROPOSITION 4.2. If sl > ... > sn-1 is a YES the root has m children (an m-star) each corresponding instance of SPLITS, then the algorithm terminates with to a 3-element set in the solution to 3-PARTITION and a feasible solution. then departing from each of these there are three paths of length equal to the size of items that belong to that Proof. We may reason backwards by starting from set. It can be easily verified that we obtain the given the tree and finding the correct sequence of nodes to splits. Conversely, suppose we are given a tree with the augment. Let T be a feasible solution. Give weight 1 to set of splits listed above. We show that it is necessarily each node of T and repeat the following operation, for 281 k = 1 to n - 1, until T has only one node. Take a leaf i l(-) denote the sum of thedistances to the rightmost of T of minimum weight among the leaves. Let (i,j(k)) anchor in a peptoid, or the sum of the distances to the be the unique edge out of i. Delete node i and increase anchor of a fragment from our library (which is just a Wj(k) as Wj(k) :: Wj(k) -~- Wi. By looking backwards at peptoid with one anchor). Remove the edge linking the the sequence of trees thus obtained, we see a possible run rightmost two anchors of a peptoid, P, leaving a smaller of the algorithm which augments on j(k - 1),... ,j(1) peptoid, P', and a fragment, F. The dynamic program- creating edges of decreasing splits. • ming algorithm follows from the recurrences which we present below. Note that by storing one solution (if one This argument also implies that for YES-instances exists) in each entry of the table we build, we can out- there always exists a choice of nodes to augment which put a solution with Wiener index W if one exists. The requires no backtrack, and indeed this is what hap- recurrences follow. pened on the vast majority of small examples which we tried initially, before proving that the problem is NP- n(P) = n(P') +n(F) complete. We then performed a more exaustive testing l(P) = l(P') + n(P') + l(F) in the following way. We generate an unlabeled tree, uniformly at random (as described in [19]), then com- w(P) = w(P') + w(G) + n(P')l(F) + pute its splits and try to reconstruct it (or a different n(F)l(P') + n(P')n(F). feasible solution). Ten instances for each value of n = 10, 20,... , 100 were solved immediately, while for n _> 110 the algo- rithm started incurring in some long runs every once in 5.2 A Fast Enumerative Algorithm a while. By performing the selection at step 1 in an or- In this section, we present a general method for derly fashion (i.e. try the available nodes by increasing inverse problems based on shortest-paths topological order of weight) we solved all generated problems, for n indices. We also present results from our software up to 300, in less than 1 second each. The good average package OCOTILLO on actual combinatorial libraries. performance raises an interesting theoretical question on the probability that the search algorithm may find a THEOREM 5.2. The Wiener index of a linear-scaffold solution withouth backtracking (or within a small num- peptoid constructed with fragments (Figure 1) is ber of tries) on a tree generated u.a.r. N N N 5 Inverse Problems for Peptoid Design W = Z Z [ndj + (j - i)ninj + nflil + Z wi In this section we consider the following problem. In i=l j=i+l i=i the framework of combinatorial chemistry we are given where ni is the number of nodes in fragment i, wi is the a fragment library, and values (lists, histograms) for Wiener index of the fragment and li is the sum of the some index. We want to find combinatorial peptoids distance from each node to the anchor. (a compound of elements from the given library) that match exactly that index. Proof. Consider the compound in Figure 1. When we 5.1 A dynamic program for peptoid construc- compute the shortest path between any two atoms, there are two cases: either the two atoms are in the tion same fragment, or they are not. For all pairs of atoms THEOREM 5.1. One can compute in polynomial time that are in the same fragment we ~e-compute the sum whether there exists a peptoid with a given Wiener of the distance between each pair and denote this value index, W, and, if so, output a solution. w -- it is just the Wiener index of the fragment. For pairs of atoms that are in different fragments, Proof. We use a dynamic programming algorithm sim- the shortest path between the two atoms is always ilar to the algorithm given to find a tree of a particular through the two anchors associated with the fragments. Wiener index. Note that W is bounded by a polyno- We break this path into three components: mial in the size of the peptoid, N, and the library size, L. Assume we have precomputed the Wiener indices of 1. The shortest path from atom i to its anchor. the fragments in the library. Number the anchors along the peptoid, say from left to right, by 1 through N. We 2. The shortest path along the scaffold. build up our peptoid from left to right by adding a frag- ment from the library to each anchor sequentially. Let 3. the shortest path from atom j to its anchor. 282

The sum of distances between all pairs of atoms in two different fragments is: 4 7r,. i----i "~'~- ! ...... T---'T ...... T'--7----~ .... " ...... I .~ \ ! ! ! i ! ! = E iEF. jEF~ 3~" i~-- I I \r-"" i' ! ~ i ...... 3~-,. i / \ ] i i i i l I / \ i i i i i i = nb E d(i, Va) + nanbd(Va,Vb) + .~ zso~ ! ...... , --~_~__~..._.~ .... ._~__._._~ iEF,, I I /~ ' \i i ! i i i i ~ 9~. -----.~----/-- .'-4-.----~-.--.~,A ...... ~ ...... i ...... ~.---.-.~ ...... ~...... jEF~

= nbl~ + nanbd(va,Vb) + nalb o95*, ! i\ :: i ' ! [ ] ~ : ~ : i ! where Va is the anchor atom of fragment a and na is the number of atoms in fragment a. O'O0~"~S ~'.~ 72'2 11~1~799 ff~6 2S804 29!4T3 33il42 £IO

The Wiener index is now the sum of P over all pairs Wiener Index of fragments, plus the sum of the Wiener index of the individual fragments. Figure 2: The percent of constructed flower- N N N configuration peptoids that need to be examined in de- tail. A histogram of the number of peptoids with a spe- i=l j:i+l i=1 cific Wiener index follows almost the same curve, which N N N explains why the pruning algorithm does not prune uni- : E E [nilj + ninjd(vi,vj)+njli] + Ewi formly for all Wiener index -- there are just more pep- i:1 j:i+l i----1 toids that match.

If, as in our case, the scaffold is a linear chain, then the distance from the anchor of fragment i to the anchor of The proof closely follows that for Theorem 5.2 and is fragment j is IJ - i], and, omitted. We let D be the difference between F on a set N N N of fragments and W on a set of fragments given the W .~- E E [nilj Jr (j - i)ninj --t- njl'il "Jr"E wi ordering ~r: i=1 j=i+l i=1 N N D(Tr) = W - F = Z E (J - i)n~(i)n,,(j). i~l j=i+l We can rewrite the Wiener index equation as For a given set of fragments, the ordering, 7rmi~, with N N N N N the smallest Wiener index is also the ordering with the W = E ni E li + E E (j - i)nrnj + E (wi -nili) smallest value of D. This forms our pruning search. i=1 ~=z i=1 j=~+l i=1 If the Wiener index we are looking for is smaller than minD + F we do not need to check any orderings for Suppose that we treat the entire scaffold as a the correct Wiener index. single vertex, fort~xample, in Figure 1, we would for each set of N fragments do compress vl, v2 and v3 to a single vertex. In effect, if mind + F < Wta~get 4_ maxD + F then we have constructed a peptoid with an unordered set examine all orderiugs of the set of fragments for of fragments, rather than an ordered list of fragments. Wtarget. This is called the flower compression, and is at the heart else of our fast search method. discard all orderings of the set of fragments. As a first approximation to the minimum and THEOREM 5.3. The Wiener index of a flower- maximum, we can replace each ni with the smallest or compressed peptoid is largest value of n in the peptoid, for example, N N N N N N 3 - N n~ni~. i=l j=i+l i:l minD > E E (j - i)n~i~ = 6 i=l j=i+l 283

amine fragments (resulting in 164 different w,l,n values) for each position. This configuration results in 2.6e12 possible peptoids. A brute force enumeration using the w, l, n compu- t lJ ~ i i tation explained earlier required 51,786 cpu-seconds, or 50.5e6 peptoids per second. For comparison, we can -o.i~w ...... 41___i__ ....] i: !i estimate that without the w, l, n computation, the enu- meration would be at least (350/165) 5 ~ 43 times as ,,,~ > ..... +../_....~ ..... &._.~ ...... ~ .._~ long -- about one month of cpu time -- without even considering that the Wiener index computation is also il i '. '~! ! : ! ! ! ! more difficult. Applying the flower-compression prun- ing algorithm achieves a significant speedup as can be a,, r- ...... -2---@ ...... ~ ..... - .... v- -~- --{-- +-- --~ seen in Figure 3, requring anywhere from 540 seconds 5401 ~2 24 28 32 36 10 41 CO (4,860e6 peptoids per second) to 3,500 seconds (750e6 Wiener Index peptoids per second).

6 Acknowledgements Figure 3: The CPU time required to search. The authors would like to thank Jean-Loup Faulon and Diana Roe for useful discussions regarding this However, this bound is very weak -- at Wiener index paper. This work was supported in part by Sandia 9000, we need to examine 32.8% of the peptoids in National Laboratories, operated by Lockheed Martin detail, compared to 4.7% when using the optimal values for the U.S. Department of Energy under contract for minD and maxD. No. DE-AC04-94AL85000 and by the Mathematics, Information, and Computational Science Program of CONJECTURE 5.1. Given nl < n2 < ... < aN, the the Office of Science of the U.S. Department of Energy. ordering for the optimal minimum value of D is: D. Goldman was supported by an American Fellowship from the American Association of University Women 2i- 1 ifi < N/2 Educational Foundation. 7rmin(i) = 2(g-i+l) if i > N/2

References CONJECTURE 5.2. An algorithm to compute the order- ing for the optimal maximum value of D, given nl <_ [1] Fiftieth Anniversary of the Wiener Index, Discrete n.2 < ... < aN, is: Applied Mathematics Special Issue, Vol. 80, no. 1 Lp := 1; L := 0; Gutman, I., Klavzar, S. and Mohar, B. eds., 122 pages, Rp := N; R := 0; 1997 for i := N downto 1 do [2] Wiener, H., Structural determination of paraffin boil- if R >_ L then ing points, J. Amer. Chem. Soc., 69 (1947) 17-20 7rma~(Lp) := i; Lp := Lp + 1; L := L + ni; [3] Sheridan, R., P. and Kearsley, S., K., Using a Ge- else netic Algorithm To Suggest Combinatorial Libraries, J. Chem. Inf. Comput. Sci., 35 (1995) 310-320 7rmax(Rp) := i; Rp := Rp - 1; R := R + ni; [4] Venkatasubramanian, V., Chan, K. and Caruthers, J. For example, M., Evolutionary Design of Molecules with Desired Properties Using the Genetic Algorithm, J. Chem. Inf. i 1 2 3 4 5 6 7 8 Comput. Sci., 35 (1995) 188-195 [5] Gordon, Douglas J., Bellott, Emile M., and Tenen- ni baum, Boris, Using a Genetic Algorithm to Select an nTr,~i,~ ( i ) Optimum Combinatorial Library Using a Subset of n.=a= i z~i5 iO fl £ O IO I I l~ Available Input Materials, Exploiting Molecular Di- versity: Refining Small Molecule Libraries, La Jolla, Both conjectures have been extensively tested. California, February 1-5, 1999. As seen in Figure 2, not many flower peptoids pass [6] Singh, Jasbir et. al., Application of Genetic Algorithms the test. to Combinatorial Synthesis: A Computatorial Synthe- Computational Results. We tested the perfor- sis: A Computational Approach to Lead Identification mance of the pruning algorithm by searching for a five- and Lead Optimization, J. Am Chem. Soc., Vol. 118, fragment peptoid using a fragment library with 350 (1996), 1669-1676 284

[7] Gallop, Mark A., Barrett, Ronald W., Dover, William mation Content of 2D and 3D Structural Descriptors J., Fodor, Stephen P. A., Gordon, Eric M., Applica- Relevant to Ligand-Receptor Binding, J. Chem. Inf. tions of Combinatorial Technologies to Drug Discov- Comput. Sci., Vol. 37, (1997) 1-9 ery. Background and Peptide Combinatorial Libraries, [22] Needham, Diane E., Wei, I-Chen, and Seybold, Paul Journal of , Vol. 37, No. 9 (1994) G., Molecular Modeling of the Physical Properties of 1233-1251 Alkanes, J. Am Chem. Soc., Vol. 110, (1998), 4186- [8] Zheng, Weifaa, Cho, Sung Jin, and Tropsha, Alexan- 4194 der, Rational Combinatorial Library Design. 1. Focus- [23] Plunkett, Matthew J. and Ellman, Jonathan A., Com- 2D: A new Approach to the Design of Targeted Com- binatorial Chemistry and New Drugs, Scientific Amer- binatorial Chemical Libraries, J. Chem. Inf. Comput. ican, April 1997, 69-73 Sci., Vol. 38, (1998) 251-258 [24] Mohar, Bojan, A Novel Definition of the Weiner index [9] Zheng, Weifan, Cho, Sung Jin, and Tropsha, Alexan- for Trees, J. Chem. Inf. Comput. Sci, Vol. 33, (1993), der, Rational Combinatorial Library Design. 2. Ra- 153-154 tional Design of Targeted Combinatorial Peptide Li- braries using Chemical Similarity Probe and the In- A Appendix: pseudo-code for dynamic verse SAR Approaches, J. Chem. Inf. Comput. Sci., programming algorithm for the inverse Vol. 38, (1998) 259-268 Wiener index problem [10] Carhart, Raymond E., Smith, Dennis H., and Venkataraghavan, R., Atom Pairs as Molecular Fea- In the pseudo-code description of the algorithm that fol- tures in Structure-Activity Studies: Definition and Ap- lows, we assume that the matrix M has been initialized plications, J. Chem. Inf. Comput. Sei., Vol. 25, No. 25 to a value "undefined" but for M0,0,1 = 1. (1985) 64-73 [11] Bemis, Guy W. and Kuntz, Irwin D., A fast and ef- tree (W,L,N) ficient method for 2D and 3D molecular shape de- ifN3-N<6WV(N-1) ~>WVL scription, J. Computer-Aided Molecular Design, Vol. N(N - 1)/2 then 6 (1992) 607-628 return 0 [12] Good, Andrew C. and Kuntz, Irwin D., Investigating if MW, L,N ~£ undefined then the extension of pairwise distance mea- sures to triplet-based descriptors, J. Computer-Aided return MW, L,N Molecular Design, Vol. 9, (1995) 373-379 if N = 1 then [13] Rouvray, D.H., The Search for Useful Topological return 0 Indices in Chemistry, American Scientist, Vol. 61, No. forN1 := N/2 to N- l do 6, (1973), 729-735. N2:=N-N1 [14] Sabljid, Aleksandar and Trinajstid, Nenad, Quantita- forL1 :=NI-ltoL-N2 do tive structure-activity relationships: the role of topo- L2 := L- L1 - N2; logical indices, Acta Pharm. Jugosl., Vol 31, (1981), for WI := L1 to W - L1N2 - L2NI - N1N2 do 189-214. W2 := W - l~ - L12~:2 - L2N1 - N1N2 [15] Trinajstic, N., Chemical Graph Theory, CRC Press, if tree(W1, L1, N1) = 1 A tree(VV½, L2, N2) = 1992. 1 then [16] Gillet, V. J. and Willett, P. and Bradshaw, J. and Green, D.V.S, Selecting Combinatorial Libraries to MW, L,N :: 1 Optimize Diversity and Physical Properties, J. Chem. return 1 Inf. Comput. Sei., Vol. 39, No. 1 (1999) 169-177 forL1 :--NI-ltoL-N1 do [17] Lepovid, M. and Gutman, I., A Collective Property of L2 :-- L - L1 - J¥1 Trees and Chemical Trees, J. Chem. Inf. Comput. Sci., for W1 := L1 to W - L1N2 - L2N1 - N1N2 do Vol. 38, No. 5 (1998) 823-826 W2 := W - W1 - L1N2 - L2N1 - NIN2 [18] Bonchev, D., Gutman, I. and Polansky, O., Parity if tree(W1, L1, N1) = 1 A tree(W:, L2, N2) = of the Distance Numbers and Wiener Numbers of 1 then Bipartite Graphs, Commun. Math. Chem., 22 (1987) I~IW,L,N :: 1 209-214 return 1 [19] Wilf~ H. S., The Uniform Selection of Free Trees, MW, L,N := 0 Journal of Algorithms 2 (1981) 204-207 return 0 [20] Brown, Robert D. and Martin, Yvonne C., Use of Structure-Activity Data to Compare Structure-Based Clustering Methods and Descriptors for Use in Com- pound Selection, J. Chem. Inf. Comput. Sci., Vol. 25, (1985) 64-73 [21] Brown, Robert D. and Martin, Yvonne C., The Infor-