Finding Diverse and Similar Solutions in Constraint Programming

Emmanuel Hebrard Brahim Hnich Barry O’Sullivan Toby Walsh NICTA and UNSW University College Cork University College Cork NICTA and UNSW Sydney, Australia Ireland Ireland Sydney, Australia [email protected] [email protected] [email protected] [email protected]

Abstract Parkes, & Roy 1998). Finally, suppose you are trying to ac- It is useful in a wide range of situations to find solutions quire constraints interactively. You ask the user to look at which are diverse (or similar) to each other. We therefore solutions and propose additional constraints to rule out in- define a number of different classes of diversity and simi- feasible or undesirable solutions. To speed up this process, larity problems. For example, what is the most diverse set it may help if each solution is as different as possible from of solutions of a constraint satisfaction problem with a given previously seen solutions. cardinality? We first determine the computational complexity In this paper, we introduce several new problems classes, of these problems. We then propose a number of practical so- each focused on a similarity or diversity problem associated lution methods, some of which use global constraints for en- with a NP-hard problems like constraint satisfaction. We forcing diversity (or similarity) between solutions. Empirical distinguish between offline diversity and similarity problems evaluation on a number of problems show promising results. (where we compute the whole set of solutions at once) and their online counter-part (where we compute solutions in- Introduction crementally). We determine the computational complexity Computational complexity deals with a variety of differ- of these problems, and propose practical solution methods. ent problems including decision problems (e.g. “Is there a Finally, we present some promising experimental results. solution?”), function problems (e.g. “Return a solution”), and counting problems (e.g. “How many solutions exist?”). Formal background However, much less has been said about problems where we A binary relation R over strings is polynomial-time decid- wish to find a set of solutions that are diverse or similar to able iff there is a deterministic Turing machine deciding each other. For brevity, we shall call these diversity and sim- the language {x; y | (x, y) ∈ R} in polynomial time (Pa- ilarity problems. Existing work in this area mostly focuses padimitriou 1994). A binary relation R is polynomially bal- on finding pairs of solutions that satisfy some distance con- anced iff (x, y) ∈ R implies |y| < |x|k for some k.A straint (Bailleux & Marquis 1999; Crescenzi & Rossi 2002; language belongs to NP iff there is a polynomial-time Angelsmark & Thapper 2004). However, there is a wider decidable and polynomially balanced relation R such that variety of problems that one may wish to consider. L = {x | (x, y) ∈ R}. We let Sol(x) = {y | (x, y) ∈ R}. In product configuration, similarity and diversity prob- We assume that we have some symmetric, reflexive, total lems arise as preferences are elicited and suggestions are and polynomially bounded distance function, δ, between presented to the user. Suppose you want to buy a car. There strings. For example, if L is SAT, this might be the Ham- are various constraints on what you want and what is avail- ming distance between truth assignments. Note, however, able for sale. For example, you cannot buy a cheap Ferrari that most of our complexity results will hold for any func- nor a convertible with a sunroof. You begin by asking for tion that is polynomially bounded. Finally, we use 0 to de- a set of solutions as diverse as possible. You then pick the note FALSE and 1 to denote TRUE. most preferred car from this set. However, as not all the details are quite right, you ask to see a set of solutions as Offline diversity and similarity similar as possible to this car. As a second example, suppose you are scheduling staff in We define two decision problems at the core of diver- a hospital. Unfortunately, the problem is very dynamic and sity and similarity problems. These ask if there is a sub- uncertain. People are sure to phone in ill, and unforseen op- set of solutions of size k at least (or at most) d distance apart. For brevity, we define max(δ, S) = max δ(y, z) and erations to be required. To ensure that the schedule is robust y,z∈S to such changes and can be repaired with minor disruption, min(δ, S) = min δ(y, z). you might look for a schedule which has many similar solu- y,z∈S tions nearby. Supermodels are based on this idea (Ginsberg, dDISTANTkSET (resp. dCLOSEkSET) Copyright °c 2005, American Association for Artificial Intelli- Instance. Given a polynomial-time decidable and gence (www.aaai.org). All rights reserved. polynomially balanced relation R, a distance function

AAAI-05 / 372 δ and some string x. Computational complexity Question. Does there exist S with S ⊆ Sol(x), For some of these problem classes, specifically those for |S| = k and min(δ, S) ≥ d (resp. max(δ, S) ≤ d). which the CSP is an input (e.g. dDISTANT/CLOSEkSET), We might also consider the average distance between so- the size of the output (a set of solutions) may not be poly- lutions instead of the minimum or maximum. As we have nomially bounded by the size of the input. Therefore we two parameters, d and k, we can choose to fix one and op- will always make the assumption that k is polynomial in the timize the other. That is, we can fix the size of the set of size of the input. With such an assumption, dDISTANTkSET solutions and maximize their diversity (similarity). Alterna- and dCLOSEkSET are in NP. Since dDISTANTkSET and tively, we can fix the diversity (similarity) and maximize the dCLOSEkSET decide membership in R when k = 1, both size of the set of solutions returned. Alternatively, we could problems are trivially NP-complete. We now show that both MAXDIVERSEkSET and look for a Pareto optimal set or combine d and k into a single NP [log n] metric. MAXSIMILARkSET are FP -complete. This is the class of all languages decided by a polynomial-time oracle MAXDIVERSEkSET (resp. MAXSIMILARkSET) which on input of size n asks a total number of O(log n) NP Instance. Given a polynomial-time decidable and queries (Papadimitriou 1994). polynomially balanced relation R, a distance function Theorem 1 MAXDIVERSEkSET and MAXSIMILARkSET δ and some string x. are FPNP [log n]-complete. Question. Find S with S ⊆ Sol(x), |S| = k, and for all S0 with S0 ⊆ Sol(x), |S0| = k, min(δ, S) ≥ min(δ, S0) (resp. max(δ, S) ≤ max(δ, S0)). Proof. MAXDIVERSEkSET (resp. MAXSIMILARkSET) ∈ FPNP [log n]: The distance function is polynomial in MAXdDISTANTSET (resp. MAXdCLOSESET) the size of the input, n. Using binary search, we call Instance. Given a polynomial-time decidable and dDISTANTkSET (resp. dCLOSEkSET) to determine the ex- polynomially balanced relation R, a distance function act value of d. This requires only logarithmically many δ and some string x. adaptive NP queries. Question. Find S with S ⊆ Sol(x), min(δ, S) ≥ Completeness: We sketch a reduction from MAX CLIQUE 0 d (resp. max(δ, S) ≤ d), and for all S with IZE NP [log n] 0 0 0 S which is FP -complete (Papadimitriou 1994). min(δ, S ) ≥ d (resp. max(δ, S ) ≤ d), |S| ≥ |S |. Given a graph G = hV,Ei, with set of nodes V and set of For example, in the product configuration problem, we edges E,MAX CLIQUE SIZE determines the size of the the might want to see the 10 most diverse solutions. This is an largest clique in G. We define a CSP as follows. We intro- instance of the MAXDIVERSEkSET problem. On the other duce a Boolean variable Bi, called a graph-node variable hand, we might want to see all solutions at most 2 features for each node i in V . For each a, b ∈ V , if ha, bi ∈/ E, we different to our current best solution. This is an instance of enforce that Ba and Bb cannot both be 1. We add an addi- d n e + 1 the MAXdCLOSESET problem. tional 2 Boolean variables, called distance variables, and enforce that they are 0 iff every graph-node variable is 0. This CSP always admits the solution with 0 assigned to Online diversity and similarity all Boolean variables. So that MAXDIVERSEkSET finds the In some situations, we may be computing solutions one by clique of maximum size, we define the distance between two one. For example, suppose we show the user several possi- solutions, δ(sa, sb), as the Hamming distance defined over ble cars to buy but none of them are appealing. We would all Boolean variables. Thus, MAXDIVERSEkSET(P ) with now like to find a car for sale which is as diverse from these k = 2 will always have one of the solutions assigning 0 as possible. This suggests the following two online prob- to each Boolean variable and the other solution assigning lems in which we find a solution most distant (close) to 1 to those graph-node variables representing nodes in the a set of solutions. For brevity, we define max(δ, S, y) = largest clique possible, and 0 to all the additional distance max δ(y, z) and min(δ, S, y) = min δ(y, z). variables, since these two solutions are maximally diverse z∈S z∈S wrt Hamming Distance. Suppose this is not the case, i.e., MOSTDISTANT (resp. MOSTCLOSE) both solutions assign some node variables to 0 and some Instance. Given a polynomial-time decidable and others to 1. Since all extra variables are set to 1, the maxi- polynomially balanced relation R, a distance function mum achievable distance is n. Moreover, if the clique with δ, some string x and a subset S of strings. largest cardinality has m nodes, two such solutions cannot Question. Find y with y ∈ Sol(x)−S, such that for all be more than 2m assignments apart. However, the solution 0 0 0 with maximum clique and the solution with all variables as- y with y ∈ Sol(x)−S, max(δ, S, y) ≥ max(δ, S, y ) n (resp. min(δ, S, y) ≤ min(δ, S, y0)). signed to 0 are d 2 e + 1 + m assignments apart, which is strictly greater than min(n, 2m). Hence, we have an answer We might also be interested in finding not one but a set to MAX CLIQUE SIZE(G). of solutions distant from each other and from a given set. The reduction for MAXSIMILARkSET is analogous. Alternatively, we might be interested in finding a larger set However, we need to define the distance between a pair of n of solutions than the ones we currently have with a given solutions, sa and sb as n + d 2 + 1e − δ(sa, sb) if sa 6= sb, diversity or similarity. and 0 otherwise, where δ is the Hamming distance. 2

AAAI-05 / 373 MAXdDISTANTSET and MAXdCLOSESET may require Finally, we show that the online diversity problem MOST- exponential time just to write out their answers (since the DISTANT and similarity problem MOSTCLOSE are both answer set may itself be exponentially large for certain FPNP [log n]-complete. queries). We therefore assume d (resp. n − d) to be at most Theorem 3 MOSTDISTANT and MOSTCLOSE are O(log n). Furthermore, we assume δ to be the Hamming FPNP [log n]-complete. distance. These assumptions are sufficient to ensure that the answer set is polynomially bounded and MAXdCLOSESET Proof. MOSTDISTANT (resp. MOSTCLOSE) ∈ (resp. MAXdDISTANTSET) is FPNP [log n]-complete. FPNP [log n]: We need to have an algorithm that calls at most Theorem 2 MAXdCLOSESET and MAXdDISTANTSET a logarithmic number of NP queries. Consider the deci- NP [log n] are FP -complete. sion version of MOSTDISTANT, where the solution has to have at least k discrepancies with a given set of vectors, V. Proof. MAXdCLOSESET (resp. MAXdDISTANTSET) is in This problem is in NP, the witness being the solution, since FPNP [log n]: With the assumptions on d and δ, the cardinal- checking the number of discrepancies is polynomial. By bi- ity of the set of solutions returned is polynomially bounded. nary search, we need at most O(log n) calls to this oracle to By binary search on k, using a dCLOSEkSET oracle, we determine the most distant solution. can find the largest possible d-close set. Thus, we only Completeness: We sketch a reduction from MAX CLIQUE need a logarithmic number of calls to an NP oracle to an- SIZE.We define a CSP in the same way as in the complete- swer both questions. We omit the details of the proof for ness proof of Theorem 1 but only use the graph node vari- MAXdDISTANTSET since it is similar (n − d is bounded ables. We let vector v ∈ V be a string of 0s of length n = instead of d). |V |. In order that MOSTDISTANT finds the clique of max- Completeness: We sketch a reduction from MAX CLIQUE imum size, we define the distance between two solutions, SIZE where G = hV,Ei denotes a graph with a set of n δ(sa, sb) as the Hamming distance. We have an answer to nodes V and set of edges E. We now define a CSP as MAX CLIQUE SIZE(G) iff MOSTDISTANT(P, {v}). The 2 follows. We introduce n + 1 variables. The first vari- reduction for MOSTCLOSE is analogous, except we define able X0 takes its value in the set {1, 2, . . . , n}, and the the distance between solutions sa and sb as n − δ(Sa,Sb), other variables are partitioned in n groups of n variables, if sa 6= sb, and 0 otherwise to give a reflexive distance Xi ∈ {0, 1, . . . , n}. We introduce the following constraints: measure. We have an answer to MAX CLIQUE SIZE(G) iff MOSTCLOSE(P, {v}). 2 ∀j, X(j−1).n+X0 = 0 & hX0, ji ∈ E ⇒ X(X0−1).n+j = 0,

∀j 6= X0, k, (k 6= X0 ∨ 6 ∃hk, ji ∈ E) ⇒ X(k−1).n+j = X0. Solution Methods The entire solution is entailed by the value i assigned to X0, We have developed a number of complete and heuristic al- i.e. the ith variable of every group must be assigned to 0, the gorithms for solving similarity and diversity problems. jth variable of the ith group is assigned to 0 iff hi, ji ∈ E and to i otherwise. Finally, all other variables are set to i. Complete Methods based on Reformulation It is therefore easy to see that we have exactly one solu- Given a CSP, we can solve similarity and diversity problems tion per node in the graph. Let δ(sa, sb) be the Hamming using a reformulation approach. Specifically, we build a new distance between the solutions sa and sb. Let i (resp. j), CSP as follows. We create k copies of the CSP, each copy be the value assigned to X0 in sa (resp. sb), we show that with a different set of variable names. We add k(k−1)/2 ex- δ(s , s ) = 2 iff hi, ji ∈ E. Suppose first that hi, ji ∈ E a b tra variables to represent the Hamming distance dj between and consider the variable X . In s this variable i (i−1).n+j a each pair of vectors of variables. We let min be the min- is set to 0 as hi, ji ∈ E, moreover, it is assigned to 0 in j j th imum of the di ’s and max be the maximum of the di ’s. sb, as are all j variables of each group. We can repeat this reasoning for X , hence δ(s , s ) ≥ 2. Now Starting with this basic model, we now describe how to ex- (j−1).n+i a b tend it for each of the following problems: consider any other variable Xk. Either Xk = i in sa, or Xk = j in sb, however, no variable is set to j in sa nor dDISTANTkSET/dCLOSEkSET: We also enforce that min is assigned to i in sb. We thus have δ(sa, sb) = 2. Sup- (resp. max) is at least (resp. at most) d. pose now that hi, ji 6∈ E. The last step in our reasoning is MAXDIVERSEkSET/MAXSIMILARkSET: We introduce still valid, whilst now we have X(i−1).n+j = i in sa and an objective function that maximizes (resp. minimizes) X(j−1).n+i = j in sb. Therefore, δ(sa, sb) = 0. We first min (resp. max). showed that a set of solutions of the CSP map to a set of nodes of G, then we showed that if the distance threshold d MOSTDISTANT/MOSTCLOSE: We set k to |V | + 1 where is set to 2, then this set of nodes is a clique. Since the set V is the input set of solutions and assign |V | sets of vari- with maximum cardinality will be returned, we obtain the ables to the given solutions in V . We also introduce an objective function that maximizes (resp. minimizes) min solution to MAX CLIQUE SIZE(G) from MAXdCLOSESET. (resp. max). The reduction for MAXdDISTANTSET is analogous. How- ever, here we define the distance between a pair of solutions, We then solve the reformulated problem using 2 sa and sb as n + 1 − δ(sa, sb), if sa 6= sb, and 0 otherwise off-the-shelf constraint programming solvers. For to give a reflexive distance measure. 2 MAXdDISTANTSET (resp. MAXdCLOSESET, we find

AAAI-05 / 374 all solutions and construct a graph whose nodes are the we add V3 = h0, 1, 0, 1 ..., 0, 1i to this example, then set- solutions. There is an edge between two nodes iff the ting a variable with even index to 1 implies that the distance pairwise Hamming distance is at least d (resp. at most d). grows by 1, whereas it grows by 2 if we set it to 0, and by 3 We then find a maximum clique in the resulting graph. For if we use any other value. Suppose that all variables take problems with many solutions, the constructed graph may their values in {0, 1, 2}. The assignment with maximum be prohibitively large. Hamming distance sets to 2 every variable. The distance between this assignment and V1,V2,V3 is 3n. Now suppose Complete Method for MOSTDISTANT/MOSTCLOSE that d = 3n − 1, then for any i, Xi = 0 and Xi = 1 are arc We will show in the next section that the problems MOST- inconsistent, whilst Xi = 2 is GAC. DISTANT/MOSTCLOSE can be used to approximate all the We now give an algorithm for enforcing GAC on others. Therefore, we introduce here an efficient com- DiverseΣ(X1,...,Xn,V1,...,Vk, d): plete algorithm for MOSTDISTANT/MOSTCLOSE based on Algorithm 1 Diverse (X ,...,X ,V ,...,V , d) global diversity/similarity constraints. We define and study Σ 1 n 1 k four such global constraints that use Hamming distance. Let Dist ← 0; 1 foreach Xi do d be an integer, V = {v1, . . . , vk} be a set of vectors of size th foreach vj ∈ D(Xi) do n, and vj[i] be the i element of vj. We define the following Occ[i][j] ← |{l | Vl[i] = vj }|; global constraints. Dist ← Dist +k − min(Occ[i]); i=Xn,j=k if Dist < d then Fail; SimilarΣ(X1,...,Xn, V, d) ⇔ (Xi 6= vj [i]) ≤ d Best[i] ← min(Occ[i]); i=1,j=1 2 foreach Xi do 3 D(X ) ← {v |v ∈ D(X ) ∧ d ≤ (Dist+Best[i]−Occ[i][j])}; i=Xn,j=k i j j i DiverseΣ(X1,...,Xn, V, d) ⇔ (Xi 6= vj [i]) ≥ d i=1,j=1 Theorem 4 Algorithm 1 maintains GAC on Xi=n Similar (X ,...,X , V, d) ⇔ maxj=k (X 6= v [i]) ≤ d DiverseΣ(X1,...,Xn,V1,...,Vk, d) and runs in max 1 n j=1 i j O(n(d + k)) d i=1 where is the maximum domain size. Xi=n j=k v ∈ D(X ) Diversemin(X1,...,Xn, V, d) ⇔ minj=1 (Xi 6= vj [i]) ≥ d Proof. Soundness: suppose that j i after propaga- i=1 tion. We construct an assignment satisfying the constraint, and involving X = v as follows: For each variable X For reasons of space, we only consider here the diver- i j l other than Xi we assign this variable to the value vm such sity constraints. The results are analogous for the similarity that Occ[l][m] is minimum, that is, equal to Best[l]. We ones. We propose a Branch & Bound algorithm for MOST- P therefore have x=n,y=k(X 6= v [x]) = Dist−Best[i]+ DISTANT that uses these global constraints as follows. We x=1,y=1 x y first post a global Diverse constraint with d = 1. For ev- k−Occ[i][j] for this assignment, however, line 3 ensures that ery solution s that is found during search, we set d to the this value is greater than d. Hence Xi = j is GAC. Hamming distance between V and s plus one, and continue Completeness: Suppose that vj 6∈ D(Xi) after propaga- searching. When the algorithm eventually proves unsatisfia- tion. Then we have Dist − Best[i] + k − Occ[i][j] < d, bility, the last solution found is optimal. We expect propaga- where Dist is the sum of Best[l] for all l ∈ [1..n]. It there- tion on the Diverse constraint to prune parts of the search fore means that the assignment where every variable but Xi space. takes the value occurring the least in V1,...,Vk, is not a support (a valid extension) for Xi = j. Moreover any other assignment would have a greater distance to V1,...,Vk, and DiverseΣ: A constraint like the global DiverseΣ con- would therefore not be a support. Hence Xi = j is not GAC. straint is generalized arc consistent (GAC), if every value Worst case : The loop 1 has complex- for every variable can be extended to a tuple satisfying the ity O(nd + kd). The values in Occ[i][j] can be computed constraint. Consider first the case where there is only one in two passes. The first pass set Occ[i][j] to 0 for every vector V as argument of DiverseΣ. As long as the num- value j of every variable j (O(nd)). The second pass incre- ber of variables Xi still containing V [i] in their domain is ments Occ[i][j] by one for each vector Vl such that Vl[i] = j greater than d, the constraint is GAC. If this number is less (O(nk)). The second loop (2) is in O(nd). 2 than d, then we fail. Finally, when there are exactly d such variables, we can remove V [i] from D(Xi) for all i. The situation is more subtle when we look for assign- Diversemin: Unfortunately, when we use the minimum ments close to several vectors at once. For instance, consider (resp. maximum) distance, maintaining GAC on a global the following two vectors: Diverse (Similar) constraint becomes intractable. We therefore should look to enforce a lesser level of consistency. V1 = h0, 0,..., 0i,V2 = h1, 1,..., 1i Theorem 5 GAC is NP-hard to propagate on Diversemin. Even if all the domains contain the values 0 and 1, the max- imum distance to V1 and V2 is n, as any solution has to dis- Proof. We reduce 3SAT to the problem of find- agree with either V1 or V2 on every variable. Moreover, if ing a consistent assignment for Diversemin. Given a

AAAI-05 / 375 Boolean formula in n variables (xi, for all i ∈ [1..n]) solutions are chosen first. We also ran the same experiment and m clauses (ci, for all i ∈ [1..m]), we construct without a specific value ordering. We also report results by the Diversemin(X1,...,Xn,V1,...,Vm, d) constraint in simply shuffling the values in the domains and starting a new which Xi = {i, ¬i} for all i ∈ [1..n] and each Vj for all search without using a value ordering. j ∈ [1..m] represents one of the m clauses. If the clause cl is {xi, ¬xj, xk} then Vl[i] = i, Vl[j] = ¬j and Vl[k] = k. 60 greedy (1th sol) greedy (2nd sol) All other values are set to n + 1. If d = n − 2, then the con- greedy+vo (1th sol) 50 greedy+vo (2nd sol) shuffle (1th sol) structed Diversemin constraint has a satisfying assignment shuffle (2nd sol) iff the 3SAT formula has a model. The solution we find cor- 40 responds to an assignment of the variables. Moreover, the 30 distance dummy values “consume” exactly n − 3 differences. There- 20 fore the solution has to share at least one element with each clause, and thus satisfies it. Notice that in this case the so- 10 0 lution we find corresponds to a model where all literal are 0.001 0.01 0.1 1 10 100 negated. 2 cputime Figure 1: Solving MAXDIVERSE3SET for the Renault Megane Heuristic Approaches configuration benchmark. Methods based on reformulation may give prohibitively large CSPs. We therefore propose an approximation scheme The results for the Renault benchmark are given in Fig- that is based on the Branch & Bound algorithm for MOST- ure 1. The distance between solutions, averaged over all DISTANT/MOSTCLOSE, that approximates the others. Vari- instances, is plotted on the y-axis, against the cpu-time re- ants of the following greedy heuristic can approximate quired to obtain it. To obtain different instances, upon which dCLOSEkSET and MAXdCloseSet problem by using the to base an average, we simply shuffled the values. For each Branch & Bound algorithm for MOSTCLOSE. method (i.e., greedy with value ordering, greedy without value ordering, and shuffling), the same initial solution is Algorithm 2 greedy found, then the first curve corresponds to the first diverse Data : P = (X,D,C), K, d solution from it, and the second curve to the next diverse Result : V solution (from both previous solutions). find a solution v; We also ran some experiments on random binary CSPs, to V ← {v} ; observe how performance changes with the constrainedness while |V | < K & minx,y∈V δ(x, y) > d do 1 find most diverse solution u to previous solutions V 0; of the problem. Results are given in Figure 2. All instances V ← V ∪ {u}; have 100 variables, domain size is 10, and 275 binary con- straints. We generated 500 instances with tightness equal to 0.5 and 0.52 (0.52 is the phase transition), and 1500 in- Finding a solution that maximizes similarity to previ- stances with tightness equal to 0.53, as few of them were ous solutions (Line 1) corresponds to solving MOSTDIS- satisfiable. TANT. Observe that if we keep only the first test for the The results on random CSPs show that our approximation “while” loop, then we approximate MAXDIVERSEkSET. method is more efficient as the constrainedness of the prob- Similarly, if we keep only the second test, we approximate lem increases. For loosely constrained problems the sim- MAXdDistantSet. This method easily applies to diversity. ply shuffling method can be competitive since many possi- ble diverse solutions exist. We also observe that choosing Experiments the value that is used the least in previous solutions tends to We ran experiments using the Renault Megane Configura- improve greatly performance on loosely constrained prob- tion benchmark (Amilhastre, Fargier, & Marguis 2002), a lems, but when the tightness of the constraints increases this large real-world configuration problem. The variables rep- value heuristic can slow the algorithm down as it chooses resent various options such as engine type, interior options, “successful” values last. etc. The problem consists of 101 variables, domain size Overall, the optimization method without value order- varies from 1 to 43, and there are 113 table constraints, many ing is more efficient on more tightly constrained problems. of them non-binary. The number of solutions to the prob- However, the same method with value ordering is consis- lem is over 1.4 × 1012. We report the results obtained us- tently the best choice. Indeed, on easier problems the heuris- ing the greedy algorithm and a Branch & Bound procedure tic finds a distant solution quickly, whilst on harder prob- taking advantage of the DiverseΣ constraint. We solved lems the the Branch & Bound procedure combined with the MAXDIVERSEkSet for k set to 3, motivated by work in rec- DiverseΣ constraint can find far more distant solutions than ommender systems that argues that the optimal sized set the shuffling approach. to present to a user contains 3 elements (Shimazu 2001). The Renault Megane configuration problem is loosely The same variable ordering heuristic (H 1 dom/deg x in constrained and thus admits a large number of solutions. It is (Bessiere,` Chmeiss, & Sa¨ıs 2001)) was used for all in- not surprising therefore that the best method uses the value stances. The values that were used the least in the previous ordering heuristic. We are able to find 2 diverse solutions

AAAI-05 / 376 100 70 50 greedy (1th sol) greedy (1th sol) greedy (1th sol) 90 greedy (2nd sol) greedy (2nd sol) 45 greedy (2nd sol) greedy+vo (1th sol) 60 greedy+vo (1th sol) greedy+vo (1th sol) 80 greedy+vo (2nd sol) greedy+vo (2nd sol) 40 greedy+vo (2nd sol) shuffle (1th sol) shuffle (1th sol) shuffle (1th sol) shuffle (2nd sol) shuffle (2nd sol) shuffle (2nd sol) 70 50 35

60 30 40 50 25

distance distance 30 distance 40 20

30 20 15

20 10 10 10 5

0 0 0 0.001 0.01 0.1 1 10 100 0.001 0.01 0.1 1 10 100 0.001 0.01 0.1 1 10 100 cputime cputime cputime

Figure 2: Solving MAXDIVERSE3SET for random CSPs (100 variables, 10 values, 275 constraints). Tightness is 0.5/0.52/0.53,respectively. such that half of the variables are assigned differently within results on some real-world problems are very encouraging. 3 seconds, or 3 such solutions within 15 seconds. In future work, we will study problems that combine both similarity and diversity. Specifically, given a solution (or set Related work of solutions), we may wish to find a set of solutions that are A number of researchers have studied the complexity of as similar as possible to the given one(s), but are as mutually finding another solution to a problem. For example, given diverse as possible. a graph and a Hamiltonian cycle with this graph, it is NP- Acknowledgements. Hebrard and Walsh are supported by complete to decide if the graph contains another Hamilto- the National ICT Australia, which is funded through the nian cycle. However, constraints are not typically placed Australian Government’s Backing Australia’s Ability ini- on the relationship between the two solutions. One excep- tiative, in part through the Australian Research Council. tion is work in constraint satisfaction where we ensure that Hnich is supported by Science Foundation Ireland (Grant the Hamming distance between the two solutions is max- 00/PI.1/C075). O’Sullivan is supported by Enterprise Ire- imized (Angelsmark & Thapper 2004; Crescenzi & Rossi land (Grant SC/02/289). 2002) or that the solutions are within some distance of each other (Bailleux & Marquis 1999). The problem classes in- References troduced here subsume these problems. Amilhastre, J.; Fargier, H.; and Marguis, P. 2002. Con- Diversity and similarity are also important concepts in sistency restoration and explanations in dynamic CSPs – case-based reasoning and recommender systems (Bridge & application to configuration. AIJ 135:199–234. Ferguson 2002; Shimazu 2001; Smyth & McClave 2001). Typically, we want chose a diverse set of cases from the Angelsmark, O., and Thapper, J. 2004. Algorithms for the case-base, but retrieve cases based on similarity. Also more maximum hamming distance problem. In Proceedings of recent work in decision support systems has focused on the CSCLP-04, 128–141. use of diversity for finding diverse sets of good solutions as Bailleux, O., and Marquis, P. 1999. DISTANCE-SAT: well as the optimum one (Løkketangen & Woodruff 2005). Complexity and algorithms. In AAAI/IAAI, 642–647. Bessiere,` C.; Chmeiss, A.; and Sa¨ıs, L. 2001. Conclusions Neighborhood-based variable ordering heuristics for the Motivated by a number of real-world applications that re- constraint satisfaction problem. In Proceedings of CP- quire solutions that are either diverse or similar, we have pro- 2001, 565–569. posed a range of similarity and diversity problems. We have Bridge, D., and Ferguson, A. 2002. Diverse product recom- presented a detailed analysis of their computational com- mendations using an expressive language for case retrieval. plexity (see Table 1), and developed a number of algorithms In Procs. of EC-CBR, 43–57. and global constraints for solving them. Our experimental Crescenzi, P., and Rossi, G. 2002. On the hamming dis- tance of constraint satisfaction problems. TCS 288:85–100. Ginsberg, M. L.; Parkes, A. J.; and Roy, A. 1998. Super- Table 1: A summary of the complexity results. models and robustness. In AAAI/IAAI, 334–339. Løkketangen, A., and Woodruff, D. L. 2005. A distance dDISTANTkSET NP-complete function to support optimized selection decisions. Decision dCLOSEkSET NP-complete Support Systems 39(3):345–354. MAXDIVERSEkSET FPNP [log n]-complete Papadimitriou, C. 1994. Computational Complexity. MAXSIMILARkSET FPNP [log n]-complete Addison-Wesley. MAXdDISTANTSET FPNP [log n]-complete Shimazu, H. 2001. Expertclerk: Navigating shoppers’ buy- MAXdCLOSESET FPNP [log n]-complete ing process with the combination of asking and proposing. MOSTDISTANT FPNP [log n]-complete In Proceedings of IJCAI-2001, 1443–1448. MOSTCLOSE FPNP [log n]-complete Smyth, B., and McClave, P. 2001. Similarity vs diversity. In Procs of IC-CBR, 347–361.

AAAI-05 / 377