INTEGRATING PROBABILISTIC REASONINGWITH CONSTRAINT SATISFACTION

by

Eric I-Hung Hsu

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science

Copyright c 2011 by Eric I-Hung Hsu Abstract

Integrating Probabilistic Reasoning with Constraint Satisfaction

Eric I-Hung Hsu

Doctor of Philosophy

Graduate Department of Computer Science

University of Toronto

2011

We hypothesize and confirm that probabilistic reasoning is closely related to constraint sat- isfaction at a formal level, and that this relationship yields effective algorithms for guiding constraint satisfaction and constraint optimization solvers.

By taking a unified view of probabilistic inference and constraint reasoning in terms of graphical models, we first associate a number of formalisms and techniques between the two areas. For instance, we characterize search and inference in constraint reasoning as summation and multiplication (or disjunction and conjunction) in the probabilistic space; necessary but insufficient consistency conditions for solutions to constraint problems (like arc-consistency) mirror approximate objective functions over probability distributions (like the Bethe free en- ergy); and the polytope of feasible points for marginal probabilities represents the linear relax- ation of a particular constraint satisfaction problem.

While such insights synthesize an assortment of existing formalisms from varied research communities, they also yield an entirely novel set of “bias estimation” techniques that con- tribute to a growing body of research on applying probabilistic methods to constraint prob- lems. In practical terms, these techniques estimate the percentage of solutions to a constraint satisfaction or optimization problem wherein a given variable is assigned a given value. By de- vising search methods that incorporate such information as heuristic guidance for variable and value ordering, we are able to outperform existing solvers on problems of interest from con- straint satisfaction and constraint optimization–as represented here by the SAT and MaxSAT

ii problems.

Further, for MaxSAT we present an “equivalent transformation” process that normalizes the weights in constraint optimization problems, in order to encourage prunings of the search tree during branch-and-bound search. To control such computationally expensive processes, we determine promising situations for using them throughout the course of an individual search process. We accomplish this using a reinforcement learning-based control module that seeks a principled balance between the exploration of new strategies and the exploitation of existing experiences.

iii Acknowledgements

I would like to recognize the support and friendship of Sheila McIlraith, my academic advisor, throughout the preparation of this dissertation, the conduct of the research that it represents, and the overall process of completing the doctoral program. Thanks also to my committee members, Fahiem Bacchus and Chris Beck, as well as the external examiner, Gilles Pesant, for their many valuable corrections, criticisms, questions, and suggestions, as well as their constant advocacy and flexibility. For their guidance and inspiration earlier in my career, I would also like to acknowledge Barbara Grosz, Karen Myers, Charlie Ortiz, and Wheeler Ruml.

Many other researchers have made direct contributions, small or large, to the specific project described in this dissertation. A partial listing would include: Dimitris Achlioptas,

Ronan Le Bras, Jessica Davies, Rina Dechter, Niklas Een,´ John Franco, Brendan Frey, Inmar

Givoni, Vibhav Gogate, , Geoff Hinton, Holger Hoos, Frank Hutter, Matthew

Kitching, Lukas Kroc, Chu Min Li, Victor Marek, Mike Molloy, Christian Muise, Pandu

Nayak, Andrew Ng, Ashish Sabharwal, Horst Samulowitz, Sean Weaver, Toma´sˇ Werner, Lin

Xu, Alessandro Zanarini, and many anonymous reviewers who have applied their attention and intellect to previous publications. I am glad to be in the professional and/or personal company of such researchers, as well as the many more who did not have a direct role in this specific project but have enriched my experience to this point.

Finally, the dissertation is dedicated to my mate, Sara Mostafavi.

iv Contents

I Foundations 13

1 Computing Marginal Probabilities over Graphical Models 14

1.1 Graphical Models and Marginal Probabilities ...... 15

1.1.1 Basic Factor Graphs: Definitions, Notation, and Terminology . . . . . 16

1.1.2 Transforming a Factor Graph: the Dual Graph and Redundant Graph . . 19

1.1.3 The Marginal Computation Problem ...... 23

1.2 Exact Methods for Computing Marginals ...... 27

1.2.1 Message Passing for Trees ...... 27

1.2.2 Transforming Arbitrary Graphs into Trees ...... 31

1.2.3 Other Methods: Cycle-Cutset and Recursive Conditioning ...... 38

1.3 Inexact Methods for Computing Marginals ...... 39

1.3.1 Message-Passing Algorithms ...... 39

1.3.2 Gibbs Sampling and MCMC ...... 41

1.4 The Most Probable Explanation (MPE) Problem ...... 42

2 Message-Passing Techniques for Computing Marginals 50

2.1 Belief Propagation as Optimization ...... 50

2.1.1 Gibbs Free Energy ...... 51

2.1.2 Mean Field Free Energy Approximation ...... 53

2.1.3 Bethe Free Energy Approximation ...... 55

v 2.1.4 Kikuchi Free Energy Approximation ...... 57

2.2 Other Methods as Optimization:

Approximating the Marginal Polytope ...... 58

3 Solving Constraint Satisfaction Problems 62

3.1 Factor Graph Representation of Constraint Satisfaction ...... 63

3.2 Complete Solution Principles for Constraint Satisfaction ...... 70

3.2.1 Search ...... 71

3.2.2 Inference ...... 77

3.3 Incomplete Solution Principles for Constraint Satisfaction ...... 84

3.3.1 Decimation ...... 85

3.3.2 Local Search ...... 86

3.4 The Optimization, Counting, and Unsatisfiability Problems ...... 89

4 Theoretical Models for Analyzing Constraint Problems 91

4.1 The Phase Transition for Random Constraint Satisfaction Problems ...... 92

4.1.1 Defining Random Problems and the Phase Transition ...... 92

4.1.2 Geometry of the Solution Space ...... 94

4.2 The Survey Propagation Model of Boolean Satisfiability ...... 97

4.3 Backbone Variables and Backdoor Sets ...... 98

4.3.1 Definitions and Related Work ...... 98

4.3.2 Relevance to the Research Goals ...... 100

5 Integrated View of Probabilistic and Constraint Reasoning 102

5.1 Relationship between Probabilistic Inference and Constraint Satisfaction . . . . 103

5.1.1 Representing Constraint/Probabilistic Reasoning

as Discrete/Continuous Optimization ...... 105

5.1.2 Duality in Probabilistic Reasoning, Constraint Satisfaction, and Nu-

merical/Combinatorial Optimization ...... 113

vi 5.1.3 Node-Merging ...... 116

5.1.4 Adding Constraints During the Solving Process ...... 117

5.1.5 Correspondences between Alternative Approaches ...... 118

5.2 EMBP: Expectation Maximization Belief Propagation ...... 119

5.2.1 Deriving the EMBP Update Rule within the EM Framework...... 120

5.2.2 M-Step ...... 122

5.2.3 E-Step ...... 125

5.2.4 The EMBP Update Rule ...... 127

5.2.5 Relation to (Loopy) BP and Other Methods ...... 129

5.2.6 Practical Yield of EMBP ...... 130

II Solving Constraint Satisfaction Problems 132

6 A Family of Bias Estimators for Constraint Satisfaction 133

6.1 Existing Methods and Design Criteria ...... 134

6.2 Deriving New Bias Estimators for SAT ...... 139

6.2.1 The General EMBP Framework for SAT ...... 140

6.2.2 Representing the Q-Distribution in Closed Form ...... 143

6.2.3 Deriving EMBP-L ...... 144

6.2.4 Deriving EMBP-G ...... 146

6.2.5 Deriving EMSP-L ...... 148

6.2.6 Deriving EMSP-G ...... 149

6.2.7 Other Bias Estimators ...... 151

6.3 Interpreting the Bias Estimators ...... 153

7 Using Bias Estimators in Backtracking Search 158

7.1 Architecture and Design Decisions for the VARSAT Integrated Solver . . . . . 160

7.1.1 The General Design of VARSAT ...... 160

vii 7.1.2 Specific Design Decisions within VARSAT ...... 161

7.2 Standalone Performance of SAT Bias Estimators ...... 165

7.2.1 Experimental Setup ...... 166

7.2.2 Findings ...... 167

7.2.3 Other Experiments ...... 171

7.3 Performance of Bias Estimation in Search ...... 173

7.3.1 Limitations and Conclusions ...... 177

III Solving Constraint Optimization Problems 180

8 A Family of Bias Estimators for Constraint Optimization 181

8.1 Existing Methods and Design Criteria ...... 182

8.1.1 Branch-and-Bound Search ...... 183

8.1.2 Other Probabilistic Approaches ...... 185

8.2 Creating an Approximate Distribution over MaxSAT Solutions ...... 186

8.3 Deriving New Bias Estimators for MaxSAT ...... 188

9 Using Bias Estimators in Branch-and-Bound Search 191

9.1 Architecture and Design Decisions for the MAXVARSAT Integrated Solver . . 191

9.2 Performance of Bias Estimation in Search ...... 195

9.2.1 Random Problems ...... 195

9.2.2 General Problems ...... 196

9.2.3 Limitations and Conclusions ...... 198

10 Computing MHET for Constraint Optimization 201

10.1 Motivation and Definitions ...... 202

10.2 Adopting the Max-Sum Diffusion Algorithm to Perform MHET on MaxSAT . 209

10.3 Implementation and Results ...... 217

viii 10.3.1 Problems for which MHET is Beneficial, Neutral, or Harmful Overall . 217

10.3.2 Direct Measurement of Pruning Capability, and Proportion of Runtime 219

10.3.3 Practical Considerations: Blending the Two Pruning Methods . . . . . 220

10.4 Related Work and Conclusions ...... 222

10.4.1 Comparison with the Virtual Arc-Consistency Algorithm ...... 223

10.4.2 Conclusion ...... 224

IV Pragmatics 227

11 Online Control of Constraint Solvers 228

11.1 Existing Automatic Control Methods ...... 229

11.2 Reinforcement Learning Representation of Branch-and-Bound Search . . . . . 233

11.2.1 Markov Decision Processes and Q-Learning ...... 234

11.2.2 Theoretical Features of Q-Learning...... 237

11.3 Application to MaxSAT ...... 237

11.4 Experimental Design and Results ...... 240

11.4.1 Example Policy Generated Dynamically by Q-Learning ...... 244

11.4.2 Concluding Remarks ...... 246

12 Future Work 249

12.1 General Program for Future Research ...... 249

12.2 Specific Research Opportunities ...... 251

12.2.1 Conceptual Research Directions ...... 251

12.2.2 Algorithmic Research Directions ...... 253

12.2.3 Implementational Research Directions ...... 255

13 Conclusion 256

Bibliography 259

ix List of Tables

5.1 Primal/Dual concepts from constraint satisfaction and probabilistic reasoning. . 115

5.2 EM formulation for estimating marginals using the redundant factor graph

transformation...... 121

9.1 Performance of MAXVARSAT on large random problems...... 199

9.2 Performance of MAXVARSAT on problems from 2009 MaxSAT Evaluation . 200

10.1 Performance of MHET on problems from 2009 MaxSAT Evaluation ...... 226

11.1 Performance of Q-Learning on problems from 2009 MaxSAT Evaluation . . . . 247

11.2 Example policy developed by Q-Learning ...... 248

x List of Figures

1 Map of topics appearing in this dissertation, organized by chapter...... 3

1.1 Example Factor Graph...... 17

1.2 Transformations of an Example Factor Graph...... 21

1.3 Variable Elimination Message Update Rules...... 28

1.4 Node Merging to Eliminate Cycles in an Example Factor Graph...... 33

1.5 Node Merging to Eliminate Cycles in an Example Factor Graph, Lexicographic

Ordering...... 37

2.1 Belief Propagation (Variable Elimination) Message Update Rules...... 56

2.2 Notional representation of marginal polytope and two approximations...... 59

3.1 Example 3-SAT Problem: as Factor Graph and as CNF Theory...... 66

3.2 Solutions to Example 3-SAT Problem, and Resulting Biases...... 67

3.3 Example QWH Problem : as Latin Square with Erasures, and as Factor Graph. . 69

4.1 Phase transition in satisfiability of random problems ...... 94

4.2 Notional representation of solution geometry for random problems generated

near the phase transition in satisfiability ...... 95

5.1 Optimization Representation of the Constraint Satisfaction Problem...... 106

5.2 Optimization Representation of Computing Marginals on a CSP...... 107

5.3 Belief Propagation Message Update Rules...... 115

xi 5.4 Example redundant factor graph...... 122

5.5 The completed EMBP update rule...... 127

6.1 “v is the sole support of c.”...... 143

6.2 Update rules for the belief propagation (BP) family of bias estimators...... 153

6.3 Update rules for the survey propagation (SP) family of bias estimators...... 154

7.1 RMS error of all bias estimates over 500 instances of increasing backbone size. 167

7.2 Average strength of estimated bias, over same 500 instances...... 168

7.3 Average success rate in predicting the correct sign for backbone variables. . . . 170

7.4 Average strength ranking of the variable most strongly biased toward the wrong

polarity...... 171

7.5 True versus Estimated Bias...... 172

7.6 Fluctuations in Bias Estimates...... 172

7.7 Adjusting the threshold parameter: average runtime for various settings. . . . . 173

7.8 Total/Survey runtimes averaged over 100 random problems, n = 250, α ≡

m n = 4.11...... 174 m 7.9 Comparison on Random Problems, 50 ≤ n ≤ 500, α ≡ n = 4.11...... 176

8.1 “v is the sole support of c.” (MaxSAT version.) ...... 189

9.1 Bias estimators for MaxSAT...... 194

10.1 Example MaxSAT problem and MHET ...... 205

13.1 Topic map...... 257

xii Introduction

The research presented here relies on a unifying account of probabilistic reasoning and con- straint satisfaction that abstracts away from many of the problem variations and algorithmic techniques developed independently in these areas, by appealing to graphical models and their correspondence to numerical or combinatorial optimization over algebraic structures. The the- sis is that we can thus translate methods from one area and apply them to the other, and ulti- mately create a new class of heuristics for solving constraint satisfaction and constraint opti- mization problems (“CSPs” and “COPs.”) The four-part structure of this document corresponds to the formal, algorithmic, and pragmatic contributions of the research program.

Part I considers formal foundations within the probabilistic inference and constraint satis- faction research areas, employing a unified account based on graphical models. Two chapters each serve to characterize the two areas from this unified perspective, and an additional chapter summarizes the most relevant correspondences that come to light. This integration builds on certain existing accounts developed independently in the two areas, but is a substantially new contribution in terms of clarifying, compiling, and augmenting such knowledge within a fixed formal representation. Further, the integration motivates the derivation of a novel marginal estimation framework, “EMBP,” that will be used through the rest of the document.

Parts II and III of the dissertation comprise its algorithmic contribution: “bias estimation” techniques derived using the insights revealed by the formal foundations. Such techniques esti- mate the probability of finding a particular variable assigned to a particular value if we were to somehow sample from the set of solutions to a problem. Thus Part II presents a family of bias

1 2 estimators for the constraint satisfaction problem; one chapter concerns the techniques them- selves, while a second chapter concerns their integration within a modern SAT solver along with experimental results. Part III adapts the bias estimators to the constraint optimization problem–again, one chapter defines the adaptation and a second depicts the integration with a modern MaxSAT solver. Here, though, a third chapter presents an additional “equivalent trans- formation” method with origins in probabilistic graphical models, and describes its adoption as a pruning mechanism within branch-and-bound search.

Part IV of the dissertation consists of three chapters. The first considers the tuning of parameters for controlling probabilistic methods that have been embedded within a constraint solver, and also presents a controller for constraint optimizers that uses reinforcement learning to develop an advantageous strategy online, i.e. through the course of solving a single problem.

The remaining two chapters discuss future possibilities for research, and make concluding observations.

Figure 1 lays out the space of topics and decisions that make up the chapters described above. In the figure, bold-faced arrows and subjects represent the main line of research that has actually been explored to date, culminating in the constraint reasoning methods at the center of the figure. (The central arrow labeled “EMBP” represents the thesis of this research–that a unified perspective on probabilities and constraints can motivate new methods for improving constraint solvers.) Other topics are not depicted in bold, and appear beneath dashed arrows– these represent alternative formulations or techniques that have not been applied in the solvers; together with the current program they span a space of alternative approaches that encompasses both related research, and more often, still-open possibilities for future investigation.

The remainder of this introduction will summarize the chapters described above and situate them with respect to the overall research program and its thesis. 3

3 Related Problems

NETWORK NETWORK INSTANCE) SOLVING A SOLVING A CONSTRAINT CONSTRAINT (CSP OR COP OR COP (CSP #SAT UNSAT

Methods

Incomplete

Analysis Methods Complete

Theoretical

HEIGHT -

10

4

Sum DiffusionSum

- Decimation Search MINIMUM EQUIVALENT TRANSFORMS: Max Local Local Search Inference SP ModelSP Backdoors

CONCLUSIONS

Backbones and 13

:

7

5 9 EMBP

INTEGRATION

BP Marginal Polytope SOLVER SETTINGS Bias Estimator Deactivation Point Decimation Size Fail/Succeed First Others

AND REINFORCEMENT LEARNING REINFORCEMENT AND 6 FUTURE WORK AND WORK AND FUTURE

8 2

12 L G L G

Message Passing - - - -

CONTROLLING THE METHODS: PARAMETER SETTINGS SETTINGS PARAMETER METHODS: THE CONTROLLING

BIAS BIAS ESTIMATORS: EMBP EMBP EMSP EMSP / BP SP [ ] SAT Gibbs Sampling and MCMC 11

[ MAXSAT[ ] Inexact Inexact Methods Exact Methods

1 Problems Related MOST PROBABLE EXPLANATION (MPE) COMPUTING COMPUTING MARGINALS OVER A GRAPHICAL MODEL

Figure 1: Map of topics appearing in this dissertation, organized by chapter. 4

Part I: Foundations

The use of probabilistic reasoning to solve constraint satisfaction problems is a growing field of study whose most exciting developments have only emerged within the last decade–thus the existing body of literature remains small in scope and depth. However, in joining two of the most prominent areas of computer science, such research exploits a large space of well- established concepts and techniques from machine learning and symbolic reasoning. Here, this broad foundation will eventually motivate a new solution family that not only achieves concrete empirical gains in solver efficiency, but additionally yields a more formal understand- ing of the connections between probability and logic, in the context of graphical models and continuous/discrete optimization.

To that end, Chapter 1 begins by defining graphical models as a representation of how global consistency can decompose into local consistency; constraint satisfaction problems are cast within this specific framework in terms of 0/1-probabilities (or later, smoothed approxi- mations of such probabilities). After defining a small number of fundamental graph transfor- mations that are key to synthesizing the large body of knowledge on how to make such models tractable, we proceed to define the problem of computing marginal probabilities from graphical models representing joint probability distributions.

We first consider exact methods for calculating marginal probabilities on factor graphs; while impractical for the task at hand, they motivate the remaining (inexact) approaches, and also serve as analogues to the logical inference procedures considered in later sections. Of the inexact methods, we briefly mention sampling-based approaches based on “Markov Chain

Monte Carlo” methods as promising direction for future research, while message-passing tech- niques represent the actual line that has been pursued to date. Another alternative research direction to the one taken here is the “MPE” problem, which solicits the most likely configura- tion of all the variables jointly, instead of their individual marginal probabilities. Only inexact

MPE methods will be tractable for constraint satisfaction, and these are unlikely to solve a problem outright–but they may yet find some future application within a framework like the 5 one described here.

Chapter 2 extends our treatment of approximate message-passing algorithms for computing marginals, to a greater level of detail. Focusing on the loopy belief propagation algorithm, we can consider the Bethe free energy function that it optimizes, and shift from algorithmic defi- nitions of message-passing estimators to functional ones. By the end of Part I, this perspective will be useful in characterizing the very act of marginal estimation as a (weighted) constraint problem in and of itself–we seek a set of marginal distributions with maximum likelihood, sub- ject to particular consistency constraints between such distributions. The choice of consistency constraints distinguishes a particular estimation method, and in subsequent chapters we will develop our own such methods using constraints that are specifically adapted to our SAT and

MaxSAT problem domains.

In Chapter 3 we turn to constraint problems and the two fundamental methodologies that can solve them to completion: search and inference. We define some associated incomplete methods, namely local search and decimation. These will be relevant in Chapters 7 and 9, where we integrate probabilistic techniques within an overall solution framework. Finally, we will define some alternative constraint problems to satisfaction and optimization, as these can be addressed in the future using an apparatus similar to the one described here. While the material in this chapter is largely expository of existing knowledge, here we persist in formulating it in terms of the graphical and algebraic paradigms established in the first chapter, allowing for an eventual synthesis in Chapter 5.

Chapter 4 reviews theoretical models of random constraint problems from near the phase transition region in satisfiability. Such problems are the most intrinsically difficult (particularly, in terms of the interaction of constraints, as opposed to more pragmatic challenges like large problem sizes) for existing complete and incomplete methods. This chapter explains why such problems are of interest, and define the concepts of backbone and backdoor variables, along with the survey propagation (“SP”) model for Boolean satisfiability (“SAT”). In terms of the overall line of research, the purpose of this section is to motivate the designs that appear in the 6 subsequent parts–a main component of the thesis is that marginal estimation methods can be successful at finding the most important variables to set during search, as defined in relation to the concepts defined here.

Finally, Chapter 5 takes inventory of the various correspondences between probabilistic and constraint reasoning that appear within the unified account of the preceding chapters. To wit, search and inference in constraint reasoning can be viewed as summation and multiplication

(or disjunction and conjunction) in the probabilistic space; necessary but insufficient consis- tency conditions for solutions to constraint problems (like arc-consistency) mirror approximate objective functions over probability distributions (like the Bethe free energy); and linear relax- ations of particular constraint problems correspond to various approximations to the polytope of feasible points for probabilistic marginalization. Beyond the intrinsic interest of this com- parison, its yield is the derivation of new marginal estimation methods developed specifically for constraint reasoning (plus a bounding technique developed in Chapter 10.) In particular, we complete Chapter 5 by deriving the “Expectation Maximization Belief Propagation” (EMBP) framework for expressing the constrained-optimization nature of marginal estimation in gen- eral form, and thus present a domain-independent EMBP update rule for marginal estimation.

This rule represents a novel framework for marginal estimation that can be customized for any specific reasoning task; indeed it will form the basis for the techniques in the remainder of the research presented here.

Part II: Solving Constraint Satisfaction Problems

With the second part of the dissertation, we turn from its conceptual contribution to specific algorithmic contributions for constraint satisfaction. The methods derived in this part have seen success on certain general constraint problems, and we state these at the start. The remainder of Part II, though, concerns the Boolean satisfiability (“SAT”) problem, as a special case of constraint satisfaction.

Chapter 6 presents our research on how the general EMBP framework of Chapter 2 can 7 be specialized to constraint satisfaction. The basic method is general and has been applied to a number of CSPs, but here the focus will be on SAT. The decision to work with SAT is motivated by clarity of presentation and also the fact that SAT is the area where the overall methodology used here has been most fully studied. So, here we demonstrate the derivation of a novel family of bias estimators for SAT, and also interpret the rules at an intuitive level.

Chapter 7 concerns more pragmatic issues that arise when integrating bias estimators within an overall backtracking search framework. Here we report the best known practices for doing so; in short the estimators are used as a variable-ordering heuristic in which variables whose estimated bias distributions are strongest (that is, most extreme or skewed) are assigned first, to the value with greater probability. The estimator activates each time a variable is assigned, until a threshold is passed when the skew of the most strongly-skewed bias distribution falls below a certain threshold. The remainder of this chapter describes the overall implementation of such a solver, along with experimental results.

Part III: Solving Constraint Optimization Problems

The third part of the dissertation transitions from constraint satisfaction to constraint optimiza- tion, and from a primary focus on Boolean satisfiability to a new focus on Boolean optimization

(“MaxSAT.”) Constraint optimization can be principally distinguished from the satisfaction problem by the need to account for an entire search tree: the goal of satisfaction is to find a single branch that violates no clauses, but even if we happen to begin with an optimal branch on solving an optimization problem, we must still process the entire tree and determine that no other branch has a better score.

Chapter 8 alters the “0/1” probability distribution that served to distinguish unsatisfying configurations of constraint satisfaction problems from solution configurations in Chapter 6.

Here, in the constraint optimization context, a parameterized distribution instead assigns a non-zero probability to each configuration, but discounts said probability by an exponential penalty factor that increases in proportion to the weight of constraints that the configuration 8 violates. Thus, sub-optimal configurations have exponentially less weight than optimal ones.

Applying the same basic derivation that was used for SAT to this altered distribution results in a new family of bias estimation techniques for the MaxSAT problem.

Chapter 9 integrates the bias estimators within a full MaxSAT solver, addressing the same variety of design issues as in Chapter 7. Here a number of difference in design settings reflect the shift from constraint satisfaction to optimization, and likewise the experiments reflect a new domain where longer run-times and smaller problems better motivate high-cost/high-reward techniques like the probabilistic ones used here.

In a departure from Part II of the dissertation, Part III contains a third chapter contribut- ing an additional probabilistically motivated technique for use in constraint optimization. Still motivated by the conceptual integration of probabilistic and constraint reasoning, in Chapter

10 we introduce a “Minimum-Height Equivalent Transformation” algorithm to compute lower bounds for a branch-and-bound constraint optimizer. We adapt the technique to MaxSAT, where it determines a minimum amount of weight for clauses that will have to remain unsatis-

fied if we extend a partial assignment representing the current branch of search. Experimental results demonstrate the usefulness of this technique within a state-of-the-art MaxSAT solver.

Part IV: Pragmatics and Conclusions

The final part of the dissertation assesses its various contributions as a whole, and considers the general problem of how to tune the integration of a probabilistic technique within a back- tracking or branch-and-bound solver.

Chapter 11 surveys the existing body of research on parameter tuning, where typically, multiple runs over example problems are used to learn settings that will hopefully generalize to good performance on future problems that are similar to those in the example set. In con- trast, this chapter introduces a novel approach based on reinforcement learning that adjusts the settings for heuristics and bounding techniques online–that is, while solving a single prob- lem instance. The approach is better-motivated on constraint optimization problems than for 9 constraint satisfaction, and experimental results demonstrate its usefulness over a variety of

MaxSAT problem domains.

Chapter 12 outlines future research directions in light of the conceptual integration of probabilistic and constraint reasoning. Based on the correspondence, a variety of alternative marginal estimation techniques can be used as bias estimators for constraint satisfaction and constraint optimization, or for other problem variations like those described in Part I. In the opposite direction, the conceptual contribution of the thesis also paves the way for constraint techniques to improve probabilistic inference methods by efficiently tightening the consistency constraints that underly their characteristic objective functions.

Finally, Chapter 13 summarizes the evidence for concluding that a unifying account of probabilistic reasoning and constraint satisfaction can yield useful methods for solving con- straint problems, and makes other closing remarks.

Contributions of the Dissertation Research

The primary contributions of the research presented in this document can be summarized as follows:

1. Review of probabilistic reasoning and constraint satisfaction from the perspective

of graphical models. Most of the general ideas (e.g. duality, node-merging, dynamic

programming) that underly the foundational concepts in Part I are known in various

forms wherever they have inspired the parallel development of separate techniques in

constraint satisfaction and probabilistic reasoning over graphical models. The contribu-

tion of the background chapters is to synthesize these ideas by stating them explicitly

within the factor graph formalism, so that it is possible, for example, to describe an al-

gorithm as a particular form of “inference combined with node-merging on the dual of

the problem,” as a further step beyond merely observing that many of the same loosely-

defined ideas seem to inspire multiple techniques across the two areas of study. This

formal perspective is key to deriving the novel frameworks and algorithms that follow. 10

2. Statement of marginal estimation as relaxed constraint satisfaction over the marginal

polytope. In further support of the claim that the explicit representation of common con-

cepts between probabilistic reasoning and constraint satisfaction is a contribution of the

dissertation, we exhibit the culmination of such an understanding in the first part of Chap-

ter 5. Namely, we draw on existing explanations of the marginal polytope from the area

of machine learning [144], as well as a recent review of past research from computer

vision [146, 134], to represent the task of marginal estimation as a relaxed version of a

constraint satisfaction problem representing the set of feasible marginal distributions for

a given graphical model. We make a novel statement of the correspondence between the

marginal estimation task and constraint satisfaction by presenting them as continuous

and discrete optimization problems, respectively, that share the same underlying struc-

ture. Prior to such a statement the semantics of algorithms like loopy belief propagation

were not well-understood outside of a few special cases [48]; here such a semantics

falls out directly from such a correspondence. (Namely, belief propagation alternates

between soft search and soft inference over a constraint satisfaction problem reflecting

the pairwise conditional dependencies encoded by a factor graph.) A number of similar

observations emerge from this correspondence, as well as suggestions for new work in

cases where certain concepts have been further developed in constraint satisfaction than

in probabilistic reasoning, or vice versa.

3. The EMBP (“Expectation Maximization Belief Propagation”) bias estimation frame-

work. The second half of Chapter 5 introduces a new marginal approximation frame-

work that is built on the formal insights of the first half. EMBP can be applied to any

marginal estimation task for which BP, or any other variational method, is typically ap-

plied. Its strengths are guaranteed convergence, plus the explicit representation of an

extensional object that is optimized by any such estimation method. This object can be

represented by customized closed forms specialized to a given application domains, in

order to achieve better accuracy or efficiency of computation. 11

4. Novel bias estimation rules for SAT and MaxSAT. A number of ad hoc heuristics may

share similar goals in estimating likely settings for variables, but to date the application

of marginal estimation to constraint reasoning remains a new and active area of study.

All such probabilistic methods have relied on belief propagation as a fundamental means

of estimation–either applied to a theoretically sophisticated “survey propagation” model

to produce the “SP” estimation rule [23, 29], or in combination with graph-structural

join-graph operations to produce “IJGP” [88]. In Chapters 6 and 8 we present four

new estimation methods (“EMBP-L,” “EMSP-L,” “EMBP-G,” and “EMSPG”) special-

ized especially for SAT and MaxSAT within the EMBP framework, and explain their

derivation and operation alike in terms of soft consistency conditions as per the corre-

spondences in Part I.

5. Novel solver implementations for SAT and MaxSAT. In addition to deriving new bias

estimation rules, we have designed and implemented two complete solvers that integrate

arbitrary bias estimation techniques within existing state-of-the-art SAT and MaxSAT

software. The resulting solvers, called VARSAT and MAXVARSAT, represent the

first integration of bias estimation techniques of any kind within a complete solution

method; the former is built upon the MINISAT solver [55], and the latter is built upon

the MAXSATZ solver [104]. In Chapters 7 and 9 we identify the key design decisions

and best known settings for these solvers, and assess their performance on a variety of

benchmark problems. Additionally, in Chapter 7 we assess the performance of the SAT

bias estimators in isolation.

6. The MHET (“Minimum-Height Equivalent Transformation”) lower bounding tech-

nique for MaxSAT. A further result of characterizing constraint reasoning in terms

of numerical graphical models (as opposed to simple constraint networks) is a new

lower-bounding technique for MaxSAT that extends the problem definition to allow

for fractional and varying weights on clauses. In Chapter 10 we consult the previous 12

work in computer vision [146, 134], to formulate a “height minimization” formalism for

MaxSAT that computes lower bounds for triggering prunings during branch-and-bound

search. We develop efficient algorithms that overcome the exponential space require-

ments of the formalism by exploiting the specific structure of MaxSAT clauses, and

implement them within MAXVARSAT. Finally, we assess the behavior of the bound-

ing technique within the solver, using a variety of benchmark problems, and begin to

consider simple ways of controlling its activation during search.

7. A technique based on reinforcement learning for online solver configuration. In

Chapter 11 we further pursue the question of how to control the activation and deactiva-

tion of solver features so as to accrue maximum informational value for their computa-

tional expense. Techniques like bias estimation and MHET are very costly in terms of

runtime, in comparison to existing heuristics. At the same time, they are typically sig-

nificantly more powerful in terms of their effects on backtracking or pruning. Thus it is

well-motivated to avoid running them at every node of search, while employing them at

points where they are most likely to make an impact. In contrast to existing approaches

to solver configuration that rely on training sets of example problems, we present a novel

framework based on reinforcement learning that adjusts the behavior of the solver on-

line, that is, through the course of solving a single problem instance. We implement the

framework within MAXVARSAT and assess its usefulness in controlling the application

of the MHET lower bounding technique. Part I

Foundations

13 Chapter 1

Computing Marginal Probabilities over

Graphical Models

In this chapter we describe the probabilistic reasoning task of computing marginal probabilities over a joint distribution represented by a graphical model, employing the factor graph format for such a model. After briefly considering the representation of a constraint satisfaction or constraint optimization problem within this framework, we turn to methods for computing marginals. We first consider exact algorithms that operate on trees, and then algorithms that operate on arbitrary graphs (essentially, by transforming them into trees.) We then turn to inexact algorithms that attempt to apply the same message-passing rules as the exact methods, even if a graph is not a tree, as well as algorithms based on sampling. Finally, we briefly digress and consider the “most probable explanation” problem, an alternative query to the marginal computation problem that underlying the remainder of the research presented in this dissertation.

14 CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 15

1.1 Graphical Models and Marginal Probabilities

Graphical models can be viewed as data structures for representing joint probability distribu- tions over sets of variables. The variables are associated with nodes that connect to form a graph whose structure represents a factorization of the variables’ joint distribution. That is, conditional independences between variable values correspond directly to the structure of a graph; specific probabilities are associated with various entities within a graph according to the type of graphical model being used. (For example, Bayesian networks use nodes to rep- resent prior probabilities or conditional probabilities with respect to neighboring nodes, while

Markov random fields associate unnormalized joint probabilities with cliques of nodes.)

Throughout this document we will use the “factor graph” [101] variety of graphical model, as first introduced for the generalization of Tanner Graphs [147] used in error-correcting codes.

Bayesian networks [123, 84] and Markov random fields (MRF’s) [126, 94] comprise the other two most popular types of model. However, factor graphs are a more expedient format for the presentation at hand for three reasons. First, they can be extended to structurally represent a wider class of factorizations than the other two models [58]; secondly, such flexibility allows for more explicit comparisons with the constraint graphs considered in Chapter 3. Finally, they allow a clearer semantics for approximate algorithms that seek to unify reasoning con- cepts from constraints and probability. That being said, individual models are easily translated

(perhaps with some loss of structural factorization information) between the three formalisms, as explained, for instance, in [151]. Taken together, the formalisms generalize many familiar application-specific techniques like dynamic Bayes nets [43], hidden Markov models [12], and

Kalman filters [86], as well as the Ising and Potts models at the root of statistical physics [13].

We will begin by defining the basic (primal) factor graph model, then proceed to define more sophisticated (dual and redundant) transformations of the factor graph that will be useful in future sections. Returning to the primal factor graph, we define the problem of comput- ing marginals and explain how factor graphs represent the factorization structure behind such computations. CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 16

1.1.1 Basic Factor Graphs: Definitions, Notation, and Terminology

Definition 1 (Factor Graphs). A factor graph is a bipartite graph containing two types of nodes: variables and factors (a.k.a. functions). Associated with each such node is a variable or a function over variables, respectively. Usually we will not need to distinguish between variables/factors and their associated nodes in the graph; as we will see, the edges in a graph are defined unambiguously by the scopes of the functions. Thus, we will typically denote a factor graph G = (X,F ) solely in terms of its variables X and factors F .

Definition 2 (Variables, Values, and Neighborhoods). Let X = {x1, x2, . . . , xn} be the set of variables in a given factor graph. Typically, we will use ‘i’ as an index over such sets. Let D =

{v1, v2, . . . , vd} be the set of values that each xi can hold. In actuality each variable can have its own arbitrary (possibly continuous) domain, but here we make the domains identical and finite,

for simpler presentation. Values in D will typically be indexed by the subscript ‘j’. In a factor

graph, a variable is represented by a node that shares edges with a set of nodes representing

factors that contain the variable in their scope. A given variable xi’s neighborhood ηi denotes such a set of factors.

Definition 3 (Assignments and Projections). An assignment (sometimes called “labeling”)

π : X → D is a function associating variables with their values. We can project an assignment

π over a subset S of variables on which it is defined, producing a new assignment π|S ranging over exactly S and returning the same values as π on elements of S. Assignments that are

defined over all variables in a graph are called (total) configurations, as opposed to partial

assignments. The symbol ~x ∈ CONFIGS(X) will typically serve to denote a configuration,

where CONFIGS(X) = Dn is the set of all possible configurations to the variables X in a

factor graph.

Definition 4 (Factors). Let F = {f1, f2, . . . , fm} be the set of factors in a factor graph. Typ-

ically, we will use ‘a’ as an index over such sets. Associated with each factor fa is a scope

(or less conventionally, a “signature”) σa ⊆ X representing the set of variables comprising the CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 17

|σa| + domain of function fa. Thus each fa : D → R ∪ 0 is a function that receives assignments to the variables in its scope, and maps them to the non-negative reals. By construction, the edges of a factor graph connect each factor fa with exactly the variables in its scope σa; in this sense a factor’s scope is the analogue of a variable’s neighborhood.

Definition 5 (Extensional View of Factors and their Instantiations). An extension (a.k.a. “tu- ple”) za : σa → D is an assignment to exactly the variables in the scope of some factor fa.

Thus, “za(xi) = vj” indicates that the extension za assigns the value vj to variable xi. Exten- sions za and zb to two different factors fa and fb are consistent iff they make the same assign- ment to any variables their factors share between their scopes: ∀xi ∈ σa ∩ σb, za(xi) = zb(xi). We stretch notation a bit further by expressing the instantiation of a factor by some extension

+ as fa(za) ∈ R ∪ {0}. This represents the value of fa when evaluated with its arguments assigned to the values specified by za. A final notational convention is to represent extensions as vectors whenever there is a natural lexicographic ordering over a function’s parameters. For instance, if σa = {x2, x4, x5} then za = h0, 1, 2i indicates that za assigns x2 to 0, x4 to 1, and x5 to 2.

f1 X1 f4 X4

f3 X3

f2 X2 f5 X5

Figure 1.1: Example Factor Graph.

Example 1 (Depicting a Factor Graph). Figure 1.1 [101] depicts a factor graph with variables

X = {x1, . . . , x5} and functions F = {f1, . . . , f5}. Their associated nodes are drawn as cir- cles and squares, respectively. If we assume a common domain D = {0, 1, 2} for all variables,

5 then there are 3 possible configurations of X. Factor f3 has scope σ3 = {x1, x2, x3}, and thus each of its 33 possible extensions is a mapping from these three variables onto D. Such CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 18

an extension, for example z3 = {(x1, 0), (x2, 1), (x3, 2)} ≡ h0, 1, 2i, is consistent with an ex-

tension z4 to f4 iff z4(x3) = 2. That is, they must agree on x3; their projections onto {x3}

(denoted z3|{x3} and z4|{x4}) must be identical. Finally, the instantiation of f3 by z3 is the value traditionally denoted as f3(0, 1, 2).

The semantics of a factor graph are such as to map each configuration of its variables to a non-negative real value; the graph’s structure shows how this value factorizes into a product of individual instantiations. In particular, a factor graph G defined as above realizes a function

WG over configurations ~x ∈ CONFIGS(X):

Y WG(~x) = fa(~x|σa ) (1.1)

fa∈F In other words, to calculate the value of a configuration, we can just perform several func- tion evaluations and multiply the results together. In its role as a probabilistic graphical model, a factor graph represents the joint probability distribution over X, whose contents are consid- ered random variables. Thus WG should be interpreted as issuing weights over joint configura- tions; such weights must then be normalized to form a proper distribution:

1 X p(~x) = W (~x), where N W (~x) (1.2) N G , G ~x∈CONFIGS(X) N is known as the normalizing constant or partition function (of G); its calculation is

equivalent in structure and complexity to the marginal computation problem considered in

Section 1.1.3. At this point it is easy to see how factor graphs encompass alternative approaches

like MRF’s and Bayes Nets. In converting an MRF to a factor graph, each clique induces a

single factor connected to exactly the clique’s members; the factor simply realizes the clique’s

potential function. To convert a Bayes net, we attach a factor to each node and additionally

connect it to any of the node’s parents; such factors represent priors or conditional probability

distributions depending on whether there are in fact any parents for the node in question.

Given our ultimate interest in applying the factor graph representation to constraint reason- CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 19 ing, look ahead and consider two example types of factor graphs before formally defining their associated reasoning tasks in Chapter 3.

Example 2 (Constraint Satisfaction Problems). As elaborated in Chapter 3, a constraint sat- isfaction problem can be represented as a factor graph whose factors are all 0/1 functions. That is, all instantiations evaluate to either 0, expressing the violation of a constraint, or to

1, expressing the satisfaction of a constraint. The weight function WG for such a factor graph equals one if and if a configuration meets all of the constraints, indicating the satisfaction of the problem as a whole. Ultimately, our aim is to compute approximate marginal probabili- ties for variables, informed by the factor graph representation, and use them as heuristics for solving constraint satisfaction problems.

Example 3 (Constraint Optimization Problems). In Chapter 8 we will also consider the con- straint optimization problem, wherein constraints are assigned numeric weights (or “penal- ties”) to represent their relative importance. Here we seek a configuration that minimizes the sum of weights for the constraints it violates. Instead of representing constraints by 0/1 functions, we will use functions that evaluate to 1 if a constraint is satisfied, and an exponen- tially small value otherwise–the exact value is dependent on the weight of the constraint and a general scaling parameter y. The weight function WG for such a factor graph will assign exponentially more weight to an optimal configuration than to any other; the model is approx- imate in that other configurations still have some small weight, but this discrepancy vanishes as y approaches infinity.

1.1.2 Transforming a Factor Graph: the Dual Graph and Redundant

Graph

We will soon consider the problem of computing marginal probabilities over factor graphs, but at this point the groundwork is already sufficient for defining a pair of factor graph trans- formations that will prove useful in future sections. First, we present a “dual” transformation CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 20 that treats functions as though they were variables ranging over tuples, and variables as though they were (constraint) functions enforcing consistency over tuples. Secondly, we present a fur- ther “redundant” transformation that combines elements of the primal and the dual into one factor graph. Both transformations produce factor graphs that are equivalent to the original

“primal” version. Their purpose is to expose the type of primal/dual optimization that under- lies both message-passing techniques for computing marginals, as well as search and inference techniques for solving constraint problems.

Definition 6 (Kronecker δ-Function). δ(a, b) = 1 if a = b; otherwise it equals 0.

Dual Graph. Let G be a factor graph with variables, values, and factors indexed as in the previous definitions. Then its dual formulation DUAL(G) = (Z,C ∪ S) consists of dual variables Z = {z1, . . . zm} representing extensions to corresponding factors in F , plus factors

C = {c1, . . . cn} and S = {s1, . . . sm} representing “consistency” and “scoring,” respectively. The basic idea is to convert factors into variables representing tuples and variables into c- factors enforcing consistency between tuples. In addition, we will add unary s-factors to score the tuples by simulating the original factors in the primal graph.

Thus if Zi is a vector of all dual variables whose associated scopes contain variable xi, then c (Z ) = Q δ(z (x ), z (x )), which evaluates to 1 if the two extensions project onto i i za,zb∈Zi a i b i the same assignment for xi, and 0 otherwise. Further, to each dual variable za we connect a scoring factor sa where sa(za) = fa(za). Each sa simulates the result of instantiating factor fa from the original graph with the extension za defined for the dual graph.

Example 4 (Dualizing a Graph). Figure 1.2(b) illustrates the dualization process for a four- node factor graph where the variable domain is D = {0, 1}. Factors f1 and f2 both have scope

σ1 = σ2 = {x1, x2}. So in the dual they become variables Z1 and Z2, which have identical domains consisting of the cross product of x1’s and x2’s: z1, z2 ∈ {h0, 0i, h0, 1i, h1, 0i, h1, 1i}.

Consistency factor c1 ensures that the two extension variables agree on the value of x1: c1(z1, z2) =

δ(z1(x1), z2(x1)), and likewise for c2 with respect to x2. Finally, scoring factor s1 simulates the instantiation of f1 with z1. For example, s1(h0, 1i) = f1(0, 1). CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 21

X 1 0 f X f X s1 c1 s2 1 1 4 4 1 Z1 Z2 f f 00 01 00 01 f3 X3 1 2 10 11 10 11 X 2 0 f2 X2 f5 X5 c 1 (D = {0,1}) 2 (a) Example primal graph. (b) Detailed example of a dual transformation.

s s s1 s4 1 4

Z c c c c 1 1,1 X1 Z4 4,4 X4 Z1 1 s3 Z4 4 s3 c1,3 c3,4 Z c Z3 c3,3 X3 s2 3 3 s5 s2 s5 c2,3 c3,5

Z2 c2 Z5 c5 Z2 c2,2 X2 Z5 c5,5 X5

(c) Dual transformation of original example. (d) Redundant transformation of original exam- ple.

Figure 1.2: Transformations of an Example Factor Graph.

Example 5 (Dualizing the Example Graph). Figure 1.2(c) depicts the dual transformation

of the earlier example factor graph, reproduced in 1.2(a). In the dual graph, c1 is a 0/1 function guaranteeing a value of zero for configurations that assign dual variables z1 and z3 to contradictory values. In particular, such values must correspond, in the primal graph, to tuples of f1 and f3 that assign the same value to x1. Returning to the dual graph, z3 instantiates s3 to

produce the same value as if f3 were instantiated in the primal graph with x1, x2, and x3 set

according to the value of z3.

Because each application of transformation DUAL adds new nodes to a factor graph, the re-

sulting factor graphs are clearly not identical to the original. Instead, we demonstrate functional

equivalence according to a mapping between configurations in the primal and dual versions of

a problem. The only complication is that the mapping is not one-to-one: it is possible to set

the dual variables inconsistently and thus without any corresponding valid configuration of the

primal variables. However, in the case of any probabilistic factor graph G, this issue is moot

by the design of DUAL(G): it will evaluate to zero for any such inconsistent dual configura-

tion. This is the same as defining invalid configurations to have zero probability. For technical

purposes, then, we can extend the definition of a configuration to allow a single additional CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 22 possible value ‘⊥’ (“false”). This configuration is defined to always have zero probability:

∀G, WG(⊥) = 0.

Proposition 1. For any probabilistic factor graph G, G ≡ DUAL(G).

Proof. The equivalence is easy to observe once we allow that invalid dual configurations have zero probability. Consider a mapping M between configurations of the two graphs that matches normal primal configurations with (valid) dual configurations that assign dual vari- ables according to their projections onto corresponding primal variables. Additionally, the pri- mal configuration ‘⊥’ corresponds to the dual configuration ‘⊥’ as well as any inconsistent dual

configuration. It is then straightforward to demonstrate that WG(~x) = WDUAL(G)(M(~x)) and

WG(M(~z)) = WDUAL(G)(~z) for any primal configuration ~x or any dual configuration ~z, es- tablishing the equivalence between the primal and dual versions of a factor graph.

Redundant Graph. The dual graph can be further transformed by replacing its consistency factors with variable nodes directly representing the original variables from the primal graph.

These nodes are then connected to the rest of the graph by introducing new consistency factors of the form ci,a ensuring that any relevant extension za uses the value for xi that is specified by its node. Thus, the term “redundant” is meant to reflect how a variable’s value is encoded by a graph configuration multiple times: once in its variable node and once in each of the extension nodes representing the factors in its primal neighborhood.

Formally, the redundant transformation of a primal graph G = (X,F ) is REDUNDANT(G) =

0 0 (X ∪ Z,C ∪ S), where Z and S are defined as in DUAL(G), and C = {ci,a : 1 ≤ i ≤ n, fa ∈

ηi}. Each ci,a connects a primal variable xi to a subsuming dual variable za, ensuring consis- tency: ci,a(vj, za) = δ(vj, za(xi)).

Example 6 (Making the Example Graph Redundant). Figure 1.2(d) shows the redundant trans- formation of the example factor graph, combining the primal variables of 1.2(a) with the scor- ing factors and dual variables of 1.2(c). Consistency is now enforced between primal variables and all their corresponding duals. For instance, the factor c1,1 ensures that the chosen tuple CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 23

for z1 projects x1 onto the same value held by the node for x1, while the factor c1,3 ensures that the chosen value of z3 does the same.

Proposition 2. For any probabilistic factor graph G, G ≡ REDUNDANT(G).

Proof. The justification is straightforward as in Proposition 1; corresponding consistent config-

urations evaluate to the same value in the two graphs, and inconsistent configurations evaluate

to zero.

In Section 1.2 we will be able to view factors in a primal graph as potential functions

over buckets of variables. Under these equivalent transformations, the dual variables Z can be

viewed as explicit statements of such structure. Further, their surrounding c-factors embody the

sort of “running intersection” property between such buckets that figures prominently in vari-

able elimination techniques. So while adjacent nodes in “join-trees” (or “junction-trees”, etc.

[123]) might pass messages about shared variables, such sets of variables are now represented

explicitly and individually as extension variables in the redundant graph. At this point, how-

ever, we can leave the transformations aside and proceed to define the marginal computation

problem in terms of the primal factor graph.

1.1.3 The Marginal Computation Problem

Given a factor graph representing a joint probability distribution over its variables, we wish

to compute the marginal probability for each variable to hold each of the possible values in

D. That is, we want a distribution P (xi = vj) on the variable’s value “all other things being equal,” or in the absence of any information about the other variables. The problem is #P-

complete in complexity [101].

Marginal probabilities can be represented by a matrix Θ ∈ [0, 1]n×d where each row con- tains d multinomial parameters for a given variable. For clarity, then, we can use θi(vj) , θ[i, j] to denote the probability that variable xi holds value vj. Thus each θi must act as a proper prob- Pd ability distribution, as in, ∀i : 1 ≤ i ≤ n, j=1 θi(vj) = 1. (The character ‘µ’ is often used CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 24 instead of ‘θ’, which appears here in recognition of the parameter estimation perspective of

[79, 78].)

Definition 7 (Bias and Bias Distributions). For consistency with [79, 78, 81, 102] we will also call θi the (multinomial) bias distribution for a variable xi, and refer to each θi(vj) as an individual bias. That is, the term “bias” can be freely substituted for “marginal probability”; in practice it will be used in cases where the underlying probability distribution represents a constraint satisfaction or constraint optimization problem.

Definition 8 (Survey). Also for notational consistency with previous work [79] that applies the computation of marginals to constraint reasoning, we will use the term “survey” to denote a set of bias distributions, one for each variable in a joint distribution. We will typically use ‘Θ’ to denote a survey.

Definition 9 (Computing the Bias/Marginal Probability, Summary). The marginal probabil- ity of a variable xi holding a value vj can be defined operationally as a summary over all configurations that contain the assignment in question. Essentially we query G = (X,F ) for the joint probability of each such configuration and add these probabilities to get the desired marginal:

1 X X X X θ (v ) = ...... W (hx , . . . , x , v , x , . . . , x i) (1.3) i j N G 1 i−1 j i+1 n x1 xi−1 xi+1 xn

Each summation represents the summary of a given variable with respect to WG. N is the normalizing constant from (1.2), and can be disregarded for present purposes: typically we

will seek a whole bias distribution for a variable rather than an individual bias, so we can get

N at no additional cost by summing an unnormalized version of (1.3) over all vj ∈ D. By dropping this constant and expressing the summation over entire configurations that make the

assignment in question, we can represent the marginal probability in a more compact form:

X θi(vj) ∝ WG(~x) (1.4)

~x:xi=vj CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 25

The summary process will appear in various forms throughout the current section. If we view each summation in (1.3) as an outer loop over a series of additions, the fundamental purpose of graphical models is to specify a structural decomposition where we can push such loops into tighter inner loops (i.e., “to the right,”) according to the factorization of WG.

Example 7 (Factorizing). Returning to the example factor graph of Figure 1.1, we can calcu- late the marginal probability of x2 being 0 by rote, according to the expression:

X X X X θ2(0) ∝ (f1(x1) · f2(0) · f3(x1, 0, x3) · f4(x3, x4) · f5(x3, x5)) (1.5)

x1 x3 x4 x5 However, factoring over sums allows us to push summations as far to the right of (1.5) as possible, yielding an equivalent expression that represents a more efficient computation: ! !!! X X X X = f2(0) · f1(x1) · f3(x1, 0, x3) · f4(x3, x4) · f5(x3, x5) (1.6)

x1 x3 x4 x5 In other words, if we operationalize (1.5) as four outer loops over domain D = {0, 1, 2}, we must perform 4 · 34 multiplications and 34 − 1 additions. In contrast, by creating inner loops according to (1.6), we avoid the duplication of intermediate results and thus perform only

1 + (3 · (3 · 2)) multiplications and 2 + 3 · 2 + 3 · (2 + 2) additions.

As another example, pushing summaries to the right creates another efficient computation

for the marginal probability of x3 being 0: ! ! ! X X X X θ3(0) ∝ f3(x1, x2, 0) · f1(x1) · f2(x2) · f4(0, x4) · f5(0, x5) (1.7)

x1 x2 x4 x5 Here, the expression only requires 2 · 32 + 2 multiplications and 32 − 1 + 2 + 2 additions.

The process of extracting factors that have been distributed over sums can be viewed opera-

tionally as a dynamic programming technique where repeated expressions evade recalculation;

instead of calculating av + bv + cv for different values of v, we calculate v · (a + b + c) for each

v, computing the parenthesized value just once. There are three important points to note about

the factorization process in relation to the general problem of computing marginal probabilities

over a graphical model: CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 26

1. Messages. We can interpret the intermediate quantities that evade re-computation as

comprehensive summaries of all relevant influence from certain sets of variables. For in-

stance, the two innermost expressions in Eq. (1.6) contain all relevant information about

the possible settings of x4 and x5, for a fixed configuration of the preceding variables. Such summaries form the fundamental units of the message-passing process presented

in the next section.

2. Structure and Variable Ordering. The form of a factorization depends on the struc-

ture of the factor graph, as well as the order that we push variables’ summations into

inner loops (in Example 7 the ordering was reverse-lexicographic.) These two basic con-

siderations will appear repeatedly as we consider reasoning over graphical models and

constraint satisfaction alike. For instance, we will see that the induced width concept un-

derpinning the complexity of variable elimination-type algorithms can define an optimal

elimination ordering on variables. In terms of graph structure, we will find that many

algorithms cannot be applied exactly to graphs with loops; at this point this is already

apparent if we observe that the expression P P f (x , x )·f (x , x ) cannot be sub- x1 x2 1 1 2 2 1 2 jected to our factorization process.1 A final consideration for finding hidden tractability

in otherwise intractable problems is the internal structure of the functions themselves;

here this is left out by our focus on graph structure, but it can be captured by our own

bias estimation techniques, which appear later in this document.

3. The Algebraic Semiring. The overall formulation of computing marginals from sum

and product is just one instance of a larger class of problems defined over arbitrary com-

mutative semirings of the form (S, ⊕, ⊗) [20]. Here S is R+ ∪ {0}, and we have se-

lected standard addition for ⊕ and standard multiplication for ⊗. But we can generalize

the formulation herein to a larger class of problems by substituting for the summation

operators in Eq (1.3) and the multiplications in Eq (1.1). For instance, substituting max

1 We can, however compact f1 and f2 into a single factor representing their product; we will see that this is one of the essential components of the variable elimination process. CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 27

for ⊕ and retaining multiplication for ⊗ yields the most probable explanation problem

of Section 1.4. Other substitutions can produce game-theoretic principles like minimax

[142], logical compilation formalisms like AND/OR trees [49], database query optimiza-

tion structures like join/project-trees [115], certain preference frameworks for constraint

optimization [20], or any of a variety of statistical methodologies [3].

1.2 Exact Methods for Computing Marginals

We can now present two types of algorithms for computing marginals exactly. The central technique is a message-passing dynamic programming approach that summarizes variables one at a time while avoiding repeated calculations whenever possible. This method is only correct and convergent to a fixed point on tree-structured graphs. Thus, the second set of approaches consists of ways to process arbitrary graphs into trees at the expense of exponential time and space on a graph’s tree-width (which, roughly speaking, represents the degree to which the graph is not a tree.)

1.2.1 Message Passing for Trees

Limiting ourselves to tree-structured factor graphs for now, we consider a message-passing algorithm for computing marginals exactly. The core dynamic programming idea that we will present here has appeared in many areas under different names. It was originally introduced as

“belief propagation” [123] for probabilistic models, and the “junction-tree” or “sum-product” algorithm [3, 101] in coding and information theory–here we avoid both of these terms to prevent confusion with loopy versions presented later in Section 1.3. Together with a node- merging approach described in the next section, the idea has been formulated as “bucket elim- ination” over a variety of commutative semirings, as well [45].2 We will present the approach

2To be more precise, bucket elimination is variable elimination on its own. The introduction of mini-buckets to handle only some of the functions that belong in a bucket distinguishes the overall approach. CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 28 in a way that clarifies its close connection to the node-merging variant, and call it “variable elimination on a tree,” in recognition of this correspondence.

Algorithm 1 produces a survey (that is, a marginal distribution for each variable) by making two passes through the factor graph: from the leaves to an arbitrarily chosen root and from the root back down to the leaves. As depicted in Figure 1.3, variables pass messages to factors that summarize the messages of all the variable’s other factors, by multiplicative average (1.8).

Essentially, they tell a factor how they are likely to be set, according to the other factors.

Similarly, factors send messages to variables summarizing all the factor’s other variables, by a summary process that sums the variable out from the set of possible extensions, and weights each resulting extension by its instantiation of the factor (1.9). This tells a variable how it ought to be set, according to the influence of the other variables in the subtree rooted at that factor.

Finally, weights are normalized into probabilities in order to form proper bias distributions.

The complexity of variable elimination on a tree is linear on the size of the tree.

Y µxi→fa (v) = µfb→xi (v) (1.8)

fb∈ηi\{fa}   X Y µfa→xi (v) = fa(za) · µxj →fa (za(xj))

za:za(xi)=v xj ∈σa\{xi} (1.9)

Figure 1.3: Variable Elimination Message Update Rules.

Thus it is worth emphasizing that each message is not a single value, but rather, a unary

function over a variable’s domain. That is, a message conveys D values representing the output

of said function for each possible value of the variable xi associated with the message. Thus

a variable-to-factor µxi→fa () message can be seen as an unnormalized distribution over the

various values xi might take, according to the other factors in its neighborhood, besides fa.

Specifically, we construct a function that, for a given value vj, evaluates each of the other CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 29

factors’ messages at vj and multiplies the results together. Likewise, factor-to-variable message

µfa→xi () summarizes xi according to fa and the status of the other variables in its scope. In particular, we sum over extensions to the factor, with xi fixed to a given vj. For each extension, we multiply its weight according to fa together with the decoupled probability of generating the extension: for each variable in σa other than xi, we factor in the (unnormalized) probability that it holds the value specified by the current extension–according to its other factors. We can view the message as an expected value for the instantiation of fa should xi choose vj as its value: the expectation is with respect to a distribution defined over the other assignments in the extension; and the probability of such an extension is compiled as products of the other variables’ messages, via a mean-field [85] approximation that is exact for trees but not for arbitrary graphs.

Another perspective is to view the algorithm as several simultaneous instances of the fac- toring operation in Eq. (1.6). Between the two (downward and upward) passes, a variable will receive a single incoming message from each of the factors in its neighborhood; the order of receipt depends on the chosen root. This factoring process is but a continuous (i.e. “soft” or

“probabilistic”) version of the sorts of inference techniques that can be applied to the constraint satisfaction problems defined in Chapter 3.

Example 8 (Tracing Variable Elimination on a Tree). For the example factor graph of Fig- ure 1.1, Algorithm 1 will at some point produce a total of three incoming messages for the

node representing x3. If we focus on x3’s bias toward the value 0, then by the equations in Figure 1.3, these messages are:

P P µf3→x3 (0) = (f3(x1, x2, 0) · µx1→f3 (x1) · µx2→f3 (x2)) x1 x2 P µf4→x3 (0) = (f4(0, x4) · µx4→f4 (x4)) (1.10) x4 P µf5→x3 (0) = (f5(0, x5) · µx5→f5 (x5)) x5

By re-applying the update and initialization rules to the messages appearing in these expres- CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 30 sions, we get: P P µf3→x3 (0) = (f3(x1, x2, 0) · f1(x1) · f2(x2)) x1 x2 P µf4→x3 (0) = (f4(0, x4) · 1) (1.11) x4 P µf5→x3 (0) = (f5(0, x5) · 1) x5

Thus, the algorithm ultimately equates x3’s bias toward 0 to the normalized product of these three expressions:

      1 P P P P θ3(0) = N · f3(x1, x2, 0) · f1(x1) · f2(x2) · f4(0, x4) · f5(0, x5) x1 x2 x4 x5 (1.12) On reviewing Example 7, then, we can confirm that the algorithm’s result is identical to that of

pushing summations to the right (Eq. (1.7)). Analogous results hold for the other variables in

the factor graph; the algorithm computes all marginals at once.

Adopting the concept of summarization clarifies the term “elimination” in “variable elimi-

nation.” For each successive node in a tree-ordering, we essentially eliminate the node from the

current factor graph by summary. Just as we created Equation (1.7) by successively pushing

in the summation for x5, then x4, then x2 and x1, we can replace this same sequence of vari-

ables with messages summarizing their probabilistic influence: x5 and x4 are eliminated and replaced with P f (0, x ) and P f (0, x ), respectively, and then the entirety of influence x4 4 4 x5 5 5 by x and x is replaced by P P f (x , x , 0) · f (x ) · f (x ), producing the final result 2 1 x1 x2 3 1 2 1 1 2 2 in Equation (1.12). More sophisticated processes are needed to perform the same process on

graphs with loops, but in the next section we will present these in a way that emphasizes their

foundation on this basic elimination concept.

From the highest-level perspective, connectivity in the graph corresponds to lines of in-

fluence between variables; tree structure ensures that root variable xr separates the graph into

disjoint subtrees that do not interact except through xr. During the leaf-to-root pass, it will receive messages summarizing the totality of influence received from each subtree. On the

downward pass, it will summarize, for each subtree, the totality of influence arising from the CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 31 other subtrees. The same holds for all other variables but with different timing than for the root: at some point, they receive an incoming message from each of the disjoint subtrees that they root. Averaging these messages by taking their product (and normalizing) tells a variable all it needs to know to determine its marginal distribution. From this perspective it is clear that the story breaks down in the absence of tree structure. Thus, exact methods for general graphs revolve around the general aim of transforming graphs with cycles into trees.

1.2.2 Transforming Arbitrary Graphs into Trees

In this section we expand Algorithm 1, VARIABLE ELIMINATION ON A TREE, to compute exact marginals on arbitrary graphs, meaning those that have loops. The general approach is well-known in many forms, such as “clustering” and “join-trees,” in the probabilistic reasoning community [101, 123], “bucket elimination” in the field of constraint processing [45], and the

“Kikuchi” family of free energy approximations in statistical physics and machine learning

[92, 151]. However, here we will apply duality to adopt a “node-merging” perspective that better unifies these lines of research. The basic premise is to combine variable nodes that constitute cycles into a single “hypernode” whose domain is the cross product of the original variables’.

Definition 10 (Hypernode). Much as duality represents all the variables in a factor’s scope as a single Z variable over its extensions, a hypernode denoted vR is a distinguished variable

k representing an entire set of regular variables R = (r1, . . . rk). Thus, its domain is |D| and we can denote its projection onto a regular variable xi as vR|xi.

Node Merging. We define the operation of merging a subset R ⊆ X of the variables in

0 a factor graph G = (X,F ) by MERGER(G) = ((X\R) ∪ {vR},F ), where the elements of R must be contiguous in G–that is, they must be separated only by single factor nodes.

|R| The elements of R are thus replaced by a single new hypernode vR with domain D . Let

H = {ha ∈ F : σa ∩ R 6= ∅} be the set of factors in G that have an element of R as a CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 32 parameter. Then F 0 = (F \H) ∪ COMPACT(PROJECT(H)); we leave untouched any factors

that do not connect to vR, and process the rest by the compaction and projection operations defined below.

PROJECT(H) replaces each ha ∈ H with a new factor hb where σb = (σa\R) ∪ {vR}. In other words, the factor node for h stays connected as before to variables outside of R,

and replaces all its connections to variables in R with a single edge to the new hypernode

vR. Instantiating hb produces the same value as instantiating ha with the hypernode projected

onto the relevant parameters in the original factor graph. Formally, then, the value of hb on

some extension zb equals that of ha on a specially constructed za: hb(zb) = ha(za), where

∀x∈ / R, za(x) = zb(x) and ∀r ∈ R, za(r) = (zb(vR))|r. Here ‘|’ is a projection operator that receives a vector representing values for all the variables in R, and retrieves the value

associated with an individual r.

COMPACT(F ) completes the merging process by consolidating redundant factors. Specif-

ically, if one factor’s scope is a subset of another’s, then we use the product of the two factors

as a single replacement factor with the larger scope. More formally, then, the result of com-

paction is the fixed point of F with respect to repeated application of an operator over pairs of

functions hfa, fbi where fa, fb ∈ F and σa ⊆ σb. The operator replaces fa and fb with a new

factor f{a,b} with scope σb and value f{a,b}(zb) = fa(zb|σa ) · fb(zb).

Example 9 (Merging Nodes). Figure 1.4(a) extends our example factor graph with two new variables, x6 and x7; each introduces a loop into the graph structure. Figure 1.4(b) details the merging operation on a simpler factor graph containing a single cycle, where we declare

the variables’ domains to be {0, 1} for simplicity. MERGE{x1,x2} produces a new two-variable factor graph, where hypernode v{1,2} ∈ {h0, 0i, h0, 1i, h1, 0i, h1, 1i} represents the two merged nodes, and x3 remains unchanged. By projection, we replace f1 with a unary function h1(v{1,2}) where for instance h1(h0, 0i) = f1(0, 0). Similarly, we replace f2 with a binary function h2(v{1,2}, x3) where for instance h2(h0, 1i, 0) = f2(0, 1, 0). By compaction, we replace these

0 two newly-created functions with a single binary function f{1,2}(v{1,2}, x3) where for instance CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 33

X1 0 f X f X 1 1 4 4 1 V1,2 X f f X 00 01 f' 6 f3 X3 X7 1 2 3 10 11 1,2 X3 0 f2 X2 f5 X5 X2 1 (D = {0,1}) (a) Example factor graph with loops. (b) Detailed example of a node-merging opera- tion.

h1 h4 X4 f'4 X4

V1,2,6 h3 V3,7 V1,2,6 f'1,2,3 V3,7

h2 h5 X5 f'5 X5

(c) Partial node-mergings on original example. (d) Completed node-mergings on original exam- ple.

Figure 1.4: Node Merging to Eliminate Cycles in an Example Factor Graph.

0 f{1,2}(h0, 1i, 0) = f1(0, 1) · f2(0, 1, 0), thus completing the transformation. (Recall that by notational convention we order the arguments to a function lexicographically.)

Returning to the example factor graph, Figure 1.4(c) shows the intermediate results of two merging operations, on {x1, x2, x6} and {x3, x7}. Here h1(v{1,2,6}) simulates the operation

of f1 on x1 and x6. For instance, h1(h0, 1, 2i) = f1(0, 2). Likewise, h2(h0, 1, 2i) = f2(1, 2).

Also, h4(v{3,7}, x4) replaces f4(x3, x4, x7) and h5(v{3,7}, x5) replaces f5(x3, x5, x7). So un-

der projection, h4(h0, 1i, 2) = f4(0, 2, 1) and likewise h5(h0, 1i, 2) = f5(0, 2, 1). Finally,

h3(v{1,2,6}, v{3,7}) must project its two arguments down to the relevant variables for f3, namely

σ3 = {x1, x2, x3}. Thus, for example, h3(h0, 1, 2i, h0, 1i) = f3(0, 1, 0).

Figure, 1.4(d) completes the transformation by applying compaction to h1, h2, and h3,

0 as h3’s scope is a superset of the others’. Thus, f{1,2,3}(v{1,2,6}, v{3,7}) consolidates the three

0 intermediate factors into a single product. So for example, f{1,2,3}(h0, 1, 2i, h0, 1i) = f1(0, 2) ·

f2(1, 2) · f3(0, 1, 0).

Proposition 3. For any factor graph G = (X,F ) and contiguous subset of its variables R =

{r1, . . . , rk} ⊆ X, G ≡ MERGER(G).

Proof. It is straightforward to demonstrate that the two factor graphs give identical scores if we CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 34 map configurations according to the projection of the hypernode onto its original constituents.

0 Formally, given a configuration c of G, we map to a configuration c for MERGER(G) where

0 0 ∀x ∈ X − R, c (x) = c(x) and c (vR) = c(r1) × ... × c(rk). In the other direction, we map a

0 0 configuration c of MERGER(G) to a configuration c of G where ∀x ∈ X − R, c(x) = c (x)

0 and ∀r ∈ R, c(r) = (c (vR))|r.

The purpose of merging nodes is to process a graph into a tree so that we can apply Al-

gorithm 1. Note that there is no reason we cannot perform MERGEG(G) to reduce a factor graph into a single variable representing total configurations of G and a single compacted fac- tor encoding all of G’s factors. Indeed, we will see in Section 3.2.2 that this is akin to solving an entire constraint problem by “inference.” The problem is that passing a single message on this graph is equivalent to solving the original marginalization on G, as in Equation (1.3). In

general, merging nodes exponentiates the space complexity of representing and reasoning over

the resulting hypernodes. Thus, we must be judicious in merging nodes, doing just enough to

ensure tractability (tree structure in this case,) while avoiding the creation of hypernodes with

prohibitively large domains.

Definition 11 (Tree Decomposition, Tree-Width). [132]. We first define the width of a merged

graph as the cardinality of the largest hypernode. (More precisely, in case of repeated merging

operations, we take the cardinality of the largest set of variables from the original graph that

are represented by some single hypernode). A tree decomposition of a factor graph is any

tree-structured result of some series of node-merging operations. Finally, the tree-width of a factor graph is the minimum width across all possible tree decompositions of the graph.

Example 10 (Calculating Tree-Width). The tree-width of the factor graph in Figure 1.4(a) is

3: we have demonstrated a pair of merging operations that produce a tree of width 3 (v{1,2,6} is the largest hypernode, representing 3 of the original variables), and in fact there exists no other series of merging operations that creates a tree of lesser width.

In short, when given a factor graph with loops, we would like to identify an optimal se- CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 35 ries of merging operations to produce a tree of minimum width, thus achieving the graph’s tree-width. However, it is NP-complete to determine the tree-width of a graph and find such a corresponding tree decomposition, easy as it might seem on small examples [9]. So in con- cluding this section we present two algorithms whose performance depends on how close they come to the optimal tree decomposition; the first is a simple abstraction, while the second is a special case of the first.

The first algorithm is called “cluster-tree elimination” in [46]. It removes all cycles from a graph by some intelligent, undetermined sequence of merging operations, and then computes marginals on the resulting tree by Algorithm 1, VARIABLE ELIMINATION ON A TREE, as before. Finally, to get a bias θi(vj), for a variable xi that has been merged into a hypernode vR in the tree decomposition, we project down by summing over all assignments to vR that map xi to v : θ (v ) = P θ (r), where r represents each possible value of variable v in the j i j r:r|xi=vj R R tree decomposition.

Thus, the algorithm’s complexity is exponential on the width of the tree decomposition: the bottleneck is that VARIABLE ELIMINATION ON A TREE must represent and compute messages over all possible values of a hypernode. The number of such values is D|R| where R is the set of variables in the original graph that are represented by the hypernode. There are a variety of algorithms for finding narrow tree decompositions, either to optimality in time exponential on the tree-width, or approximately in linear or polynomial time [21]. In practice, however, most methods for computing marginals do not operate over a directly stated tree decomposition as in the first line of Algorithm 2. Rather, as exemplified in the second algorithm, they indirectly operate over an implicit decomposition that is implied by some variable ordering–the focus is thus on optimizing such orderings.

We call the second algorithm “variable elimination on an arbitrary factor graph” to better match the widest usage. It represents a specific instance of the first whereby we construct a specific tree decomposition according to a given variable ordering. That is to say, it is equiva- lent to running “cluster-tree elimination” on a tree decomposition determined by the supplied CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 36 variable ordering. Similarly to before, then, the complexity depends on how well we choose this ordering, and the same remedies come into play.

Canonical statements of Algorithm 3 do not explicitly separate the formation of a tree de- composition from the ensuing message-passing process, as we do here to highlight connections with Algorithms 1 and 2. Rather, they introduce an “induced width” concept that parallels tree- width, and that is essentially identical for our purposes: a given elimination ordering over vari- ables induces a particular tree decomposition whose width once again exponentially governs the complexity of our overall algorithm. Thus the optimal ordering corresponds to the opti- mal tree decomposition that we might find by whatever other means under CLUSTER-TREE

ELIMINATION.

Our presentation also seeks to emphasize the correspondence to VARIABLE ELIMINATION

ONA TREE, via the node-merging concept. In that algorithm, a leaf node could safely send a message containing the entirety of its influence, essentially summarizing itself out of the con- tinuing joint distribution calculation. Such information was then passed further up the tree, where messages could now convey the entirety of influence for an entire subtree with respect to the subsequent variables found higher in the tree. This process is equivalent to merging all the nodes in a subtree into a single hypernode: tree structure guarantees that the information embedded in a subtree can be recursively computed by a series of individual summaries, and then passed to the rest of the graph in the form of a single message from the subtree’s root.

Here, we perform the same recursive summary process whereby a variable can send a message representing the entirety of influence from all previous variables in the ordering. In the ab- sence of tree structure, though, this requires a series of non-unary merging operations that can produce hypernodes, and thus, an exponential blow-up in the complexity of such messages.

Example 11 (Tracing Variable Elimination on a Loopy Graph). Figure 1.5(a) reproduces the loopy graph of Section 1.2.2 to illustrate the elimination process over the variable ordering

(x1, x2, x3, x4, x5, x6, x7). Figure 1.5(b) shows the dual graph U; the depiction of scoring factors is simplified by means of unlabeled solid boxes, for clarity. CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 37

Z1 c1 c f1 X1 f4 X4 Z4 4

X c Z c c 6 f3 X3 X7 6 3 3 7

f2 X2 f5 X5 Z2 c2 Z5 c5

(a) Example factor graph with loops. (b) Dual of the factor graph.

c1

c6 V{1,2,6} Z3 c3 V{3,4,5,7} c7 V{1,2,6} c1,2 V{1,2,3} c3 V{3,4,5,7}

c2

(c) Partial variable elimination, by x7 and x6. (d) Completed variable elimination transforma- tion.

Figure 1.5: Node Merging to Eliminate Cycles in an Example Factor Graph, Lexicographic

Ordering.

Figure 1.5(c) shows an intermediate stage where x7 and x6 have been processed. Also for clarity, hypernodes are named after the sets of primal variables that they represent, as opposed to the indices of their constituted dual variables (which are based on primal factor indices.)

Thus we see that in processing x7 we have merged the neighbors of c7, namely Z4 and Z5, to form a hypernode that corresponds to primal variables x3, x4, x5, and x7. Scoring functions s4 and s5 are conjoined into a single factor, and c4 and c5 can be dropped automatically here, or more rigorously, when we reach them in the ordering. (They are irrelevant to begin with since they are unary.) A similar process occurs when we process x6, merging Z1 and Z2 into a single hypernode. For now, the figure inaccurately depicts c1 and c2 as two separate factors with identical domains. Instead, recall that as part of the merging process they will be compacted; their product will form a single factor that simultaneously forces consistency over projections onto both x1 and x2 for v{1,2,6} and Z3.

Thus in Figure 1.5(d) the completed transformation exhibits “running intersection” [123]: hypernodes are separated by consistency factors that represent exactly the intersections of their constituent variables, and any variable appearing in two hypernodes (“cliques”) appears in all hypernodes connecting them along any path. In forming the final result we process x3 by CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 38

compacting Z3, and noting that there is nothing to process for x2 and x1. Running VARIABLE

ELIMINATION ON A TREE over this structure is equivalent to more canonical presentations of variable elimination; a message from each hypernode represents the entirety of influence for variables toward the end of the ordering, and maintains running intersection for the remainder of the variables. Note that (x1, . . . , x7) is not an optimal ordering, as the induced width of the resulting tree-decomposition, 4, does not achieve the original graph’s tree-width of 3.

1.2.3 Other Methods: Cycle-Cutset and Recursive Conditioning

In addition to basic VARIABLE ELIMINATION, there are other related approaches to reasoning

over graphs that use similar principles, but translated into alternative algorithmic frameworks.

For instance, the cycle-cutset scheme [44] and the closely-related but more general framework

of recursive conditioning [39] both seek to decompose a network by identifying (perhaps re-

cursively) key variables whose absence would disconnect a graph into a tree, or some other

set of tractable subproblems. The original idea arose in the context of search and constraint

satisfaction, where algorithms could iterate over various instantiations of such variables and

solve the remaining problems via inference. However, an overarching theme of this document

is that probabilistic and symbolic reasoning are unified under the semi-ring framework: search

and summation correspond to ⊕, while inference and product correspond to ⊗. Accordingly it

is no surprise the approach has since been applied to a variety of probabilistic tasks [19, 39].

Within the context of the current presentation, conditioning on certain variables is akin to elim-

inating them first; the only distinction is between searching over their various instantiations

and choosing the first successful one, or of performing them all and adding the results together

as part of the summary process. In more recent years, a node-splitting [50, 30] approach has

also extended the same basic concept to the approximation of marginals, a problem consid-

ered in the next section. From the perspective of the current presentation, the approach is like

eliminating a node by merging, but decoupling otherwise relevant factors during the merging

process. CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 39

1.3 Inexact Methods for Computing Marginals

Real-world problems of any significant difficulty correspond to factor graphs with high tree- widths, making them intractable for exact marginal computation methods. Indeed, at a basic level the very meaning of tree-width corresponds directly to the non-triviality of a problem.

Thus, we turn to inexact techniques that produce surveys attempting to approximate the true marginal distributions of the variables in a problem. The first class of techniques is still based on the message-passing framework of the previous section, while the second takes an alterna- tive sampling-based approach.

1.3.1 Message-Passing Algorithms

Despite their stand-alone infeasibility, exact methods for computing surveys merit a solid un- derstanding because they provide the underlying motivation and operational basis for inexact methods. So while inexact message-passing techniques comprise the main line followed in our overall research program, we will demonstrate that their actual algorithms closely follow those of exact methods. Thus, we will see that extensions to exact methods can usually be applied di- rectly to inexact ones, and that the simplest way to initially understand inexact message-passing is as a “wrongful” application of exact message-passing.

In particular, Algorithm 4 simply follows the same message-passing format as VARIABLE

ELIMINATION ON TREES, except for the apparatus of two passes between roots and leaves. In- stead, it sends messages continuously with no guarantee of convergence, and simply hopes to converge naturally on some local maximum in likelihood, while otherwise relying on artificial means of termination. The algorithm is commonly called “loopy belief propagation,” suggest- ing that at a crude level, we are just duplicating Algorithm 1 on an arbitrary graph even though it has loops. (In Chapter 2, we will portray the algorithm with more mathematical sophistica- tion, as the approximate minimization of KL-divergence between our estimated marginals and an approximation of the true marginals [151].) Perhaps surprisingly, the approach has worked CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 40 extremely well on certain applications in coding theory [17, 106, 145]. For brevity, we will simply refer to the algorithm as (loopy) “belief propagation,” or “BP.”

We can realize the check for CONVERGENCE at Line 9 by a variety of conditions. The standard approach is to first check whether the messages have reached a fixed point, by de- tecting whether any have changed since the previous iteration. As such natural convergence is not guaranteed, we can instead declare convergence by some artificial secondary conditions as simple as reaching a given number of iterations or as sophisticated as measuring improvement in statistical measures of survey likelihood. BP is performing a gradient-ascent local search over the space of surveys, attempting to converge at some local maximum in likelihood. Ac- cordingly, another popular approach to artificial convergence is to gradually dampen messages, scaling successive updates down by a given factor until “cooling” to some “frozen state” [127].

From the perspective of local search, then, it is also natural that there are various ways to perform the initialization at Line 1. Typical approaches sample the initial messages uniformly at random. From any given starting point, Algorithm 4 will deterministically stop at some final survey approximating the true marginal distribution. So it is possible to restart from multiple initializations, random or otherwise, and to compile the results of such runs by averaging or by taking the maximum with respect to some measure of likelihood.

The important thing to remember is that BP only approximates the marginals, and rather crudely, too, as we will see. In fact, we will later demonstrate that BP is a probabilistic version of the arc (pairwise) consistency algorithm considered in Section 3 when discussing constraint satisfaction. We can arbitrarily boost its accuracy at the expense of complexity by merging nodes to form simpler but wider graphs; this corresponds to higher forms of consistency within constraint satisfaction. Thus a hybrid method like “iterative join-graph propagation” (IJGP) can mitigate the inaccuracy of BP by performing a partial tree decomposition as in CLUSTER-TREE

ELIMINATION; we merge some quantity of nodes that might still be insufficient to produce an actual tree, but when running BP on the resulting graph, we get a better approximation than on the original version [47]. CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 41

1.3.2 Gibbs Sampling and MCMC

Here we consider a different set of techniques for computing approximate marginal probabil- ities, based on explicit sampling from various distributions constructed to simulate that of a given factor graph. The motivation is not only to suggest a potential step for future work and better define the path that has been taken to date, but also to clarify the relationships between the methods presented so far, and those to come in Section 3. To that end, we present the

“Gibbs sampling” method for estimating marginal probabilities on a factor graph [61]. The basic idea is to create a series of configurations c(1), c(2), . . . , c(T ) according to the decomposed joint probability distribution; the proportion of such configurations containing a specific as- signment determines our estimate of a variable’s bias. Under certain conditions, the central limit theorem can guarantee that our estimate approaches the true marginal distribution as the series length T approaches infinity.

It is non-trivial to create a series of samples that fully explores the space of configura- tions, though. Thus, the algorithm specializes the more general “Markov chain Monte Carlo”

(“MCMC”) class of algorithms that generate each new sample according to a conditional dis- tribution that depends entirely on the previous sample [5]. In this case, we transition an existing configuration to a new one by changing one variable assignment at a time. By our graphical representation, we can nicely decompose the conditional probability of a particular variable as- signment given an assignment to all the other variables. In the equations below, let π represent said partial assignment to all variables in X\{xi}, and let (π ∪{xi = vj}) represent a complete configuration where xi is assigned vj and the remaining variables are assigned according to π:

Q f ((π∪{x =v }) ) p(xi=vj ,π) fa∈F a i j |σa p(xi = vj|π) = p(π) = N (1.13)

∝ Q f ((π ∪ {x = v }) ) fa∈ηi a i j |σa

The first equality is by Bayes’ rule, and the second inserts the weighting function repre- sented by the factor graph. The normalizing constant N can absorb the other divisor p(π), CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 42

which is constant across assignments to xi. Thus, we have proportionality to the final ex- pression, which does not require normalization across vj’s for the purposes of sampling in the completed Algorithm 5.

Given that MCMC methods essentially take random walks biased toward the most “im- portant” (loosely speaking, the most likely) portions of the search space over configurations, it’s natural that they should be very similar to local search algorithms for solving constraint problems. As we will see when representing such problems as factor graphs, it is too much to ask that an algorithm directly visit a series of configurations that satisfy an entire constraint problem–this would be tantamount to having solved the problem in the first place. Instead, each iteration of the loop at Line 4 adjusts a variable to its most likely value according to local information represented by the factors in its neighborhoods. This is the direct correspondent of greedily setting a variable satisfy its own constraints during local search.

However, the principle of focusing on a single most likely region must not be taken too far.

A final note is important for any effort to apply MCMC techniques to constraint satisfaction: such algorithms require an irreducible sample space guaranteeing “ergodicity” [5]. Essentially, the structure of our factor graph must allow that from any current configuration, Algorithm

5 can eventually reach any other configuration, perhaps over the course of several samples, with non-zero probability. This is often achieved simply by adding a randomized element to any rule for taking a step, persistently allowing for arbitrary changes to the configuration with small probability.

1.4 The Most Probable Explanation (MPE) Problem

Before concluding this chapter, we make one more digression from the main line of research that will nevertheless enable both future opportunities and a deeper perspective on the compar- isons at hand. Here we define a second typical query over graphical models, aside from the marginal computational problem: the “Most Probable Explanation” (“MPE”) problem asks for CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 43 the most likely configuration of all the variables at once, under a joint distribution.3 So, for a factor graph G = (X,F ) it returns the total assignment that maximizes G’s weight function

(the usual normalizer N is constant across configurations and thus irrelevant):

Y MPE(G) = argmax WG(~x) ≡ argmax fa(~x|σa ) (1.14) ~x ~x fa∈F Stepping back to our overall research program, computing a bias distribution tells us the probability that a variable should be set to each of its possible values in solving a constraint problem–in isolation from the biases of other variables. MPE gives us a chance to assign all variables at once: if we define a distribution that gives a weight of one to all solutions of a constraint problem, and zero weight to all other configurations, (as we will do in Section 3,) then any solution will be a valid answer to the MPE query.

However, we are unlikely to simply find such a solution outright by such means: MPE is

NP-complete (just like constraint satisfaction) and further, there is currently no approximation with bounded quality [121]. In essence, MPE algorithms must traverse the exponentially- numerous facets of the “marginal polytope” described in Section 2.2, while marginal compu- tations only focus on finding their mean under vector representation [144].

Interestingly, then, the most successful complete methods for MPE currently use branch- and-bound search methods from discrete optimization, perhaps employing tree decompositions or similar techniques for general tractability [42, 109] or for computing bounds [139, 122]. As for approximate methods that quickly find a configuration that is locally maximal in likelihood, the most popular approach is the “max-product” algorithm (originally called “belief revision”)

[123]. This technique is essentially belief propagation applied to an alternate instance of the algebraic semiring. Here, ⊕ is the max operator, and ⊗ remains multiplication. Thus, it is identical to loopy belief propagation except with maximization substituted for the summation

3In most general form, the query allows us to condition on a set of observed variable assignments; for our purposes this set will always be empty, so by “MPE” we will typically mean “most probable explanation, with no evidence.” Also, somewhat confusingly, this problem is typically known as the “MAP” or “Maximum a Posteriori” problem within the machine learning community, though this term has been used differently in other research on graphical models. CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 44 in Figure 1.3. Correspondingly, MAX-PRODUCT is exact and naturally convergent on tree- structured factor graphs, but approximate and not guaranteed to converge on arbitrary graphs.

Thus deploying an approximate or anytime MPE-solver to aid the search for a solution to a constraint satisfaction problem might amount to using one kind of search to aid another kind. That is to say, MPE can provide useful information, but it will still have to be employed within some overarching framework like the one that the current line of research uses to employ marginals.

Summary. To this point we have defined the marginal computation problem within a fac- tor graph formalism, along with methods for performing exact and approximate computations.

Subjectively speaking, factor graphs are well-known within the machine learning community, but transformations like DUAL and MERGE are not commonly used; in Chapter 3 we will see that such operations are much more prominent within the area of constraint satisfaction.

On the other hand, research on constraint satisfaction has traditionally focused on constraint graphs representing variables alone. (In such graphs, edges denote the appearance two vari- ables within the scope of a common constraint.) Accordingly, the constraint programming research community has historically focused on variable-interaction structure, as opposed to constraint structure, in the context of reasoning over graphical models (as opposed to that of, say, developing heuristics). The presentation in this chapter is motivated by our own require- ments, where we will ultimately apply probabilistic algorithms to constraint satisfaction, while developing closed forms that capture the specific structure of a constraint reasoning task. Be- fore proceeding to consider constraint satisfaction, though, we first present a more advanced perspective on marginal estimation in the next chapter. CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 45

Algorithm 1: Variable Elimination on a Tree-Structured Factor Graph Data: tree-structured factor graph G = (X,F ).

Result: marginal probability distribution θi() associated with each variable.

1 With an arbitrary variable node as root, choose any node ordering O = ho1, . . . , o|G|i that begins at the leaf nodes and ends at the root, with all children ordered before any parent.

2 Initialize leaf variable nodes to just send the identity message µx→f (v) = 1, and leaf factor nodes (whose functions are by definition unary) to just send their own functions as

messages µf→x(v) = f(v).

//FIRST PASS:LEAVESSENDMESSAGESTOROOT.

3 for c ← 1 to |G| do

4 if node oc represents a variable xi then

5 Send variable-to-factor message µxi→fa () to (unique) parent factor fa.

6 end

7 if node oc represents a factor fa then

8 Send factor-to-variable message µfa→xi () to (unique) parent variable xi.

9 end

10 end

//SECOND PASS:ROOTSENDSMESSAGESTOLEAVES.

11 for c ← |G| to 1 do

12 if node oc represents a variable xi then

13 Send variable-to-factor message µxi→fa () to each child factor fa.

14 end

15 if node oc represents a factor fa then

16 Send factor-to-variable message µfa→xi () to each child variable xi.

17 end

18 end

//FORMMARGINALDISTRIBUTIONS.

19 forall xi ∈ X do

//AVERAGEINCOMINGMESSAGES (MULTIPLICATIVELY.)

20 forall vj ∈ D do

21 ω (v ) ← Q µ (v ). i j fa∈ηi fa→xi j

22 end

//NORMALIZEWEIGHTSTOFORMPROPERDISTRIBUTIONS.

23 N ← P ω (v ). vj ∈D i j

24 forall vj ∈ D do

25 θi(vj) ← ωi(vj)/N .

26 end

27 end CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 46

Algorithm 2: Cluster-Tree Elimination on an Arbitrary Factor Graph Data: factor graph G = (X,F ).

Result: marginal probability distribution θi() associated with each variable.

1 Form a tree decomposition of G by some intelligently chosen series of MERGE

operations.

2 Run VARIABLE ELIMINATION ON A TREE (Algorithm 1) on the tree decomposition.

3 Project the results back down onto the variables of G. CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 47

Algorithm 3: Variable Elimination on an Arbitrary Factor Graph Data: factor graph G = (X,F ).

Result: marginal probability distribution θi() associated with each variable.

1 Intelligently choose some variable ordering O = ho1, . . . , oni.

2 Assign U = (Z,C ∪ S) ← DUAL(G), where Z, C, and S represent extensions,

consistency, and scoring as in the definition of the dual transformation.

//BUILDCLIQUETREE, FROMENDOFORDERINGTOBEGINNING.

3 processed ← ∅.

4 for c ← n to 1 do

5 Let xi be the primal variable ordered oc in O.

6 Let ci be the consistency factor representing xi in the dual graph, U.

//MERGEALLUNPROCESSEDDUALVARIABLESTOUCHING ci.

7 Let R be σci \processed.

8 U ← MERGER(U).

9 processed ← processed ∪ {R}.

10 end

//WE NOW HAVE A (DUAL) TREEDECOMPOSITION.

11 Run VARIABLE ELIMINATION ON A TREE (Algorithm 1) on U.

12 Project the results back down onto the variables of G. CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 48

Algorithm 4: Loopy Belief Propagation (BP) Data: factor graph G = (X,F ).

Result: approximate marginal probability distribution θi() associated with each variable.

1 Initialize variables and factors to send arbitrary initial messages.

2 repeat

//SENDONEROUNDOFMESSAGES.

3 for a ← 1 to m do

4 Send factor-to-variable message µfa→x() to each adjacent variable x ∈ σa.

5 end

6 for i ← 1 to n do

7 Send variable-to-factor message µxi→f () to each adjacent factor f ∈ ηi.

8 end

9 until CONVERGENCE

//FORMBIASDISTRIBUTIONS.

10 forall xi ∈ X do

//AVERAGEINCOMINGMESSAGES (MULTIPLICATIVELY.)

11 forall vj ∈ D do

12 ω (v ) ← Q µ (v ). i j fa∈ηi fa→xi j

13 end

//NORMALIZEWEIGHTSTOFORMPROPERDISTRIBUTIONS.

14 N ← P ω (v ). vj ∈D i j

15 forall vj ∈ D do

16 θi(vj) ← ωi(vj)/N .

17 end

18 end CHAPTER 1. COMPUTING MARGINAL PROBABILITIESOVER GRAPHICAL MODELS 49

Algorithm 5: Gibbs Sampling for Computing Marginals Approximately Data: factor graph G = (X,F ).

Result: approximate marginal probability distribution θi() associated with each variable.

(0) 1 Initialize c to some arbitrary configuration of the variables X.

//SAMPLE A SERIES OF CONFIGURATIONS.

2 for τ ← 1 to T do

(τ) (τ−1) 3 c ← c .

//CONSTRUCT CONFIGURATION ONE ASSIGNMENT AT A TIME.

4 for i ← 1 to n do

(τ) 5 Sample v ∼ p(x = v|c ) as per Eq. (1.13). i |X\{xi} (τ) 6 c (xi) ← v.//(INFLUENCE IS IMMEDIATE ON REMAINING ASSIGNMENTS.)

7 end

8 end

//FORMBIASDISTRIBUTIONS.

9 forall xi ∈ X do

//COUNTNUMBEROFSAMPLESFEATURINGEACHASSIGNMENT.

10 forall vj ∈ D do T P (τ) 11 ωi(vj) ← δ(c (xi), vj). τ=1

12 end

//NORMALIZEWEIGHTSTOFORMPROPERDISTRIBUTIONS.

13 N ← P ω (v ). vj ∈D i j

14 forall vj ∈ D do

15 θi(vj) ← ωi(vj)/N .

16 end

17 end Chapter 2

Message-Passing Techniques for

Computing Marginals

Before turning to the constraint problems that we will solve with guidance from marginal esti- mations methods, we will first give a more comprehensive account of inexact message-passing techniques. Specifically, we re-state (loopy) belief propagation as an approximate optimization method for an objective–Bethe free energy–that itself approximates the true marginal distribu- tion. We then situate it within a family of methods distinguished by their own characteristic approximating objectives. The numerical-optimization perspective assumed here will bear fruit in Chapter 5, where we compare marginal estimation and constraint satisfaction in the context of optimization, and introduce the EMBP marginal approximation method that will be instan- tiated for SAT and MAXSAT in Parts II and III.

2.1 Belief Propagation as Optimization

Having narrowed the field of marginal computation methods to approximate message-passing techniques, we concluded the previous chapter having characterized (loopy) belief propaga- tion as a “wrongful” application of variable elimination to graphs with loops. Here we will

50 CHAPTER 2. MESSAGE-PASSING TECHNIQUESFOR COMPUTING MARGINALS 51 instead present it as a well-defined approximate optimization method for an objective function that itself approximates the true set of marginals for a distribution. For a number of years, researchers considered the application of the belief propagation rules to graphs with loops to be surprisingly effective in practice, (especially for implementing “turbo-codes”), but hard to formalize in terms of semantics and convergence behavior [17, 106, 48]. Eventually, it was explained in terms of the Bethe free energy, which was developed independently in the field of statistical physics [151]. Instead of defining BP operationally as applying local variable elimi- nation messages to a graph with cycles instead of to a tree, here we can present it functionally as a method for optimizing the Bethe free energy. As observed before, the algorithm is not guaranteed to converge, and furthermore, when it does converge it is only to a local minimum of this objective function. Thus another way of phrasing the relationship is that the fixed points of the BP updates are local minima of the Bethe free energy.

2.1.1 Gibbs Free Energy

Generalizing the narrative of [151] to factor graphs with functions of arbitrary arity (as opposed to a pairwise specialization of the Markov Random Fields mentioned in Section 1.1), we begin by introducing the “Gibbs free energy” that the Bethe free energy approximates. This can be thought of as the “true” free energy insofar as it directly represents the Kullback-Leibler

(“KL”) divergence between an approximation of the joint probability distribution, and the true joint probability. KL-divergence is a foundational measure in statistics that is large and positive when two distributions defined over the same space differ greatly over the probabilities that they assign within this space, and drops to zero when they are identical. By applying Boltzmann’s law and some convenient settings to constants, the KL-divergence between an estimated joint distribution q(·) and a true distribution p(·) yields the following expression for the Gibbs free energy: CHAPTER 2. MESSAGE-PASSING TECHNIQUESFOR COMPUTING MARGINALS 52

X X G(q(·), p(·)) = − q(~x) log p(~x) + q(~x) log q(~x) + log N (2.1) ~x ~x The first term in (2.1) is known as the mean free energy. Consider each term in this sum- mation: as the probability p(~x) of a configuration ~x approaches 0, the logarithm of this prob- ability approaches negative infinity; as the probability approaches 1, its logarithm increases toward 0. So, to minimize free energy, q(~x) should assign smaller probabilities to configura- tions where p(~x) is small, and larger probabilities to configurations on which p(~x) puts more probability mass. In fact, then, mean free energy is not minimized by a q(·) that matches p(·) across the set of all configurations, but by a q(·) that puts all probability mass exclusively on configurations with maximal probability under p(·).

The second term in (2.1) represents the (negative) entropy of q(·). The term is always negative, with a maximum value approaching 0 when q(·) puts all probability mass on a single configuration, and with minimum value when q(·) is the uniform distribution over configura- tions. Together, the first two terms of the Gibbs free energy balance the assignment of q(·)’s probability mass to the most likely configurations under p(·), versus an imperative to assign mass evenly across configurations. In fact, the minimum value for the sum of mean free energy and negative entropy is − log N , and this is achieved when q(·) is identical to p(·).

Hence the final term of the expression, called the Helmholz free energy, is constant with respect to q(·) and can therefore be ignored for the purposes of optimization. (Recall that N denotes the partition function of p(·), as in Equation (1.2) in Chapter 1.) As a whole, then,

(2.1) equals 0 when the two distributions are equal, as is consistent with the assertion that it represents the KL-divergence between two distributions.

What is the purpose of defining a measure between two joint probabilities distributions, when the measure is minimized by simply choosing the approximate joint distribution q(·) to be identical to the true joint distribution p(·)? The idea is to choose q(·) from a limited family of possible distributions, as characterized by a given structural restriction that must hold across all members of such a family. By expressing such structure in terms of marginal probabilities, CHAPTER 2. MESSAGE-PASSING TECHNIQUESFOR COMPUTING MARGINALS 53 we can then seek marginal estimates that most closely match p(·) when aggregated into a joint probability distribution according to the design of the family. This is the basis for variational methods [85]: we constrain the way in which marginal probabilities combine to form a joint distribution, and accept the loss of exactness in order to efficiently optimize the surrogate by taking advantage of the constraints. Thus, a given variational approach is characterized by the type of structure that it assumes for relating marginal probabilities to the joint probability, together with algorithms for exploiting such structure to optimize the resulting surrogate to the

Gibbs free energy.

2.1.2 Mean Field Free Energy Approximation

We will soon be able to express belief propagation as an optimization algorithm for the Bethe approximation to the Gibbs free energy of a given factor graph, and then proceed to consider further elaborations of the variational framework. First, though, we consider a simple “mean

field” approximation, for clarity and also because it will elucidate the derivation of EMBP in

Section 5.2.

For a given factor graph G = (X,F ), the mean field approximation represents a marginal distribution qi(·) over each variable in X, and assumes that the joint distribution over X fac- torizes exactly into the product of the marginal distributions:

Y qMF (~x) = qi(xi) (2.2) i In other words, mean field methods assume zero correlation between all variables. (In the equation and those that follow, the expression xi represents the value assigned to variable xi in configuration ~x, as this is easier to parse than the notationally correct expression ~x(xi).)

To produce the mean field approximation to the Gibbs free energy, we substitute qMF (·) for q(·) in (2.1). For p(·), we appeal to the semantics of a factor graph whereby the true probability

(ignoring normalization, which is captured by the Helmholz term) is the product of the factors in F . The result is as follows: CHAPTER 2. MESSAGE-PASSING TECHNIQUESFOR COMPUTING MARGINALS 54

P Q Q P Q Q GMF (qMF (·), p(·)) = − qi(xi) log( fa(za)) + qi(xi) log( qi(xi)) + log N ~x i a ~x i i

P P Q P P Q = − ( qi(za(xi))) log fa(za) + ( qi(xi)) log qi(xi) + log N a za i:xi∈σa i xi i (2.3)

In accord with this expression, mean field methods try to choose a set of marginal distribu- tions that, when individually instantiated according to the assignments in a given configuration, yield a product that is high when the true joint distribution is high. This imperative is distributed uniformly and additively across configurations, and is balanced by an entropy objective that is similarly distributed across variables.

The end result is that by making the very strong assumption that all variables are uncorre- lated, we produce an expression that is easy to optimize; Equation (2.3) is linear on each qi(xi) and can be minimized by a variety of stochastic gradient descent methods (in combination with the non-convex constraints that the qi(xi)’s are non-negative and sum to 1 for a given i.) The resulting proper probability distributions qi(·) constitute our estimates for the marginal prob- ability distributions over X. In comparison with other variational methods, then, mean field algorithms are generally among the least accurate and most efficient to compute.

Another perspective on the underlying assumption of mean field methods is that they treat factor graphs as though all variables were disconnected from one another. We judge the marginal probability of a variable holding a particular value in proportion to the simple av- erage of its functions’ valuations when instantiated by extensions that contain this assignment.

The fact that certain extensions might be more or less likely, based on the marginal distribu- tions of the other variables in the function’s scope, is not captured. Thus we do not represent any influence from other variables on the way that this one is set, and only consult the vari- able’s functions in isolation. In Chapter 3 we will see that the analogous constraint satisfaction condition is node consistency. CHAPTER 2. MESSAGE-PASSING TECHNIQUESFOR COMPUTING MARGINALS 55

2.1.3 Bethe Free Energy Approximation

From the approximation underlying mean field methods, we turn now to the more sophisti-

cated Bethe free energy approximation that underlies belief propagation [18]. In contrast to

mean field algorithms, which only represent single-variable marginal estimates denoted qi(·)

for each variable xi, the Bethe approximation additionally represents marginal probabilities

over function extensions, denoted qa(·) for each function fa. Referring to the definition of fac- tor graph duality in Section 1.1.2, we are thus representing single-variable marginals in both the

primal and dual versions of the factor graph, and accordingly the primal/dual nature of optimiz-

ing by BP is reflected in the alternation between variable-to-function and function-to-variable

messages. P In addition to the standard constraints whereby qi(v) ≥ 0, qa(za) ≥ 0, and v qi(v) = P q (z ) = 1 for all i and a, the Bethe free energy also includes a marginalization con- za a a straint for each function fa in the factor graph and each variable xi in the function’s signature: P q (z ) = q (v). That is, the marginal distribution for any function extension must za:za(xi)=v a a i itself reduce to the marginal distribution for any individual variable in the function’s signature, should we sum over all settings to the other variables in the signature. Viewed in terms of the redundant problem, we seek marginal probabilities over dual variables, but require pair- wise consistency between two such variables when their corresponding functions in the primal problem share a variable between their signatures. Combined with these constraints, then, the

Bethe free energy is expressed as follows:

P Q Q P Q Q GB(qB(·), p(·)) = − qa(za) log( fa(za)) + qa(za) log( qa(za)) + log N ~x a a ~x a i (2.4) P P P P = − qa(za) log fa(za) + qa(za) log qa(za) + log N a za a za

(As with xi in the expression for mean field free energy, in the first line of (2.4) we use za to represent the value of extension za that is encompassed by ~x.) By adding Lagrange multipliers CHAPTER 2. MESSAGE-PASSING TECHNIQUESFOR COMPUTING MARGINALS 56

to enforce the constraints, it is possible to take first derivatives of the expression to define

zero-gradient fixed points that correspond exactly to the fixed points of the belief propagation

algorithm. (See [151] for a full derivation on the simplified pairwise MRF model.) Within the

belief propagation messages, reproduced below, the clause-to-variable messages are actually

the values of the Lagrange multipliers for the constraints forcing the distributions on extensions

to marginalize onto single-variable marginals.

Y µxi→fa (v) = µfb→xi (v)

fb∈ηi\{fa}   X Y µfa→xi (v) = fa(za) · µxj →fa (za(xj))

za:za(xi)=v xj ∈σa\{xi}

Figure 2.1: Belief Propagation (Variable Elimination) Message Update Rules.

Re-examining these two rules, we can view the first as what will ultimately become the single-variable marginal distribution for xi to hold value v. (Recall that the final step is to

reintroduce µfa→xi (v) to the product and then normalize.) By similar means, the function-to- variable messages are the same in the dual–they can ultimately be used to represent a marginal distribution on function extensions that contain the assignment xi = v. This interpretation is materialized by the second update rule, which represents the marginalization of the weighted average value of the function, over such extensions. The average is weighted by the probability of each such extension; it is here that the approximation underlying BP comes in. Specifically, we assume zero correlation between the other variables in fa’s extension and construct the probability of any given extension as the product of variable-to-factor messages.

One perspective on belief propagation is that it treats arbitrary factor graphs as if they were trees, and not merely in the sense that it is correct on trees and approximate on graphs with cycles. In particular, we judge the marginal probability of a variable holding a particular value in proportion to the weighted average of its functions’ valuations when instantiated by CHAPTER 2. MESSAGE-PASSING TECHNIQUESFOR COMPUTING MARGINALS 57 extensions that contain this assignment. To determine the weight of each such extension, we actually do consult the distributions of the other variables in a function’s signature, rather than weighting all extensions uniformly as in mean field methods. But, to do so, we just multiply together the estimated marginal probabilities for the other variable assignments comprising the extension. In so doing, we do not represent correlations between the settings of these other variables. For a tree this is unnecessary because such variables are separated by the current variable’s node; in the presence of cycles this is not necessarily the case. In short, we only enforce consistency between marginal distributions over function extensions (i.e. over dual variables) and marginal distributions over individual (primal) variables. The extension marginal distribution must itself marginalize down to each individual variable’s marginal, but need not do so jointly in capturing correlations between these individual variables. In Chapter 3 we will see that the analogous constraint satisfaction condition is (generalized) arc-consistency.

2.1.4 Kikuchi Free Energy Approximation

Given the node-merging operation already elaborated in Section 1.2.2, it is easy to describe a further refinement of the Bethe free energy, and correspondingly more accurate but more costly algorithm. In particular, the Kikuchi free energy is built from marginal distributions over clus- ters of functions and clusters of variables, as opposed to single functions and variables [92].

The resulting approximation is more accurate in representing correlations among functions; if the choice of clusters is sufficiently extensive, the approximation is exact. However, the num- ber of values that can be held by a cluster of variables or functions grows exponentially with its size, hence the greater complexity of optimization. In short, a given Kikuchi approximation to a factor graph is equivalent to the Bethe approximation for a transformation of the graph that consists of some corresponding set of node-merging operations. The associated optimization methods, the most prominent being “generalized belief propagation,” can be derived by apply- ing the belief propagation rules to this transformed graph, and have proved useful for a number of tasks, particularly coding [152, 113, 145]. CHAPTER 2. MESSAGE-PASSING TECHNIQUESFOR COMPUTING MARGINALS 58

2.2 Other Methods as Optimization:

Approximating the Marginal Polytope

The examples to this point exhibit a general three-step process for estimating marginal prob- abilities from a factor graph. First, we express the KL-divergence between a hypothesized joint distribution q(·) and the true distribution p(·) encoded by the factor graph, creating the

Gibbs free energy. Secondly, we introduce a restricted encoding for q(·) that represents the single-variable marginals directly, or else some other quantities from which such marginals can be derived. Finally, we substitute the encoding into the Gibbs free energy and optimize the resulting expression to find minimizing values for the represented quantities. Based on the sophistication of the chosen encoding, the resulting marginal estimates can be more or less accurate, and the appropriate optimization methods can be more or less efficient. (Mean field methods are convergent, but only to local minima of their associated free energy, while belief propagation may or may not converge, also to local minima of the Bethe free energy.)

From this perspective it is clear that alternative formulations are possible. Indeed, an ac- tive line of research has been to characterize the space of probabilistic inference methods in terms of the “marginal polytope” of a given factor graph [138, 143]. For the purposes of this presentation, we can view the polytope as enclosing the space of surveys (i.e. sets of marginal distributions) that are realizable for a factor graph with a given graph structure, irrespective of the specific functions associated with the graph. (In particular, we do not have to represent distributions from the full exponential family, and can substitute “mean parameters” for “ex- pected values of sufficient statistics.” For the discrete domains that we will be working with, the sufficient statistics are simply indicators for whether each variable holds each of its possible values. Thus, the polytope defines a space of surveys.)

The marginal probabilities for the variables in a factor graph comprise a set that must fall within this polytope; the inference task means finding a maximally likely such survey with respect to the functions in the factor graph. In other words, the marginal polytope constrains CHAPTER 2. MESSAGE-PASSING TECHNIQUESFOR COMPUTING MARGINALS 59 the space of possible solutions to the marginal computation task; when we wish to minimize the Gibbs free energy in Eq. 2.1, our chosen distribution q(·) must marginalize to a set of distributions within this space, in order to correctly capture the structure of the graph. The problem is that the number of facets to such a polytope grows exponentially with both the size and the width of a factor graph. So, the significance of the marginal polytope is conceptual, in terms of its relationship to the q(·) distribution decomposition chosen during the second step of the variational process described above.

For illustration, Figure 2.2 reproduces a drawing from [144] that depicts the relationship between the marginal polytope and the surrogates introduced by the mean field and Bethe approximations, at a highly notional level.

Mean Field Bethe Approximation Approximation

: Marginal Polytope of G : (Extreme) Realizable Surveys : (Extreme) Pseudo-Surveys

Figure 2.2: Notional representation of marginal polytope and two approximations.

In the figure, the true marginal polytope for a given graph structure appears as a gray shaded area; any point within this area represents a possible survey that would be consistent with this structure. The vertices of the polytope represent surveys wherein every variable is subject CHAPTER 2. MESSAGE-PASSING TECHNIQUESFOR COMPUTING MARGINALS 60 to a marginal distribution that puts all weight on a single value; the faces represent linear combinations of these extreme distributions.

The most relevant aspect to note for our purposes is that mean field methods use an inner approximation to the marginal polytope. The figure depicts this as a non-convex space situated entirely within the true marginal polytope. The zero-correlation condition that the joint distri- bution should perfectly factorize into individual marginal distributions is certainly sufficient for the feasibility of the resulting survey, but it is not necessary. Thus, there are possible surveys that cannot be represented by the mean field approximation because they include correlations between variables.

In the case of the Bethe free energy, the condition that marginal probabilities over exten- sions must themselves marginalize to individual variable marginals defines a strictly larger space of surveys that fully contains the true marginal polytope. Again, the marginalization condition only operates in a pairwise fashion between function extensions and constituent variables–two individual functions may marginalize correctly over some shared variable, but if they are connected by additional paths through the factor graph then there can be additional correlations that are not represented by this marginalization process. In other words, pairwise consistency is necessary, but not sufficient for surveys to belong to the relative interior of the marginal polytope. Accordingly, Figure 2.2 illustrates a space of pseudo-surveys surrounding the marginal polytope, representing collections of marginal distributions that are pairwise con- sistent, but that do not correspond to any survey that is realizable on a factor graph with the given structure. Such “pseudo-marginals” are part of the inaccuracy that can complicate the use of belief propagation, in addition to its attraction to local rather than global optima, and its lack of guaranteed convergence (in the absence of dampening or other artificial means.)

Recent research has exploited the marginal polytope conceptualization to develop improved variational methods for probabilistic inference [143, 138]. The general approach of such re- search is to introduce new constraints on the decomposition of the q(·) approximator to the true joint probability, including the generation of new constraints during the inference process CHAPTER 2. MESSAGE-PASSING TECHNIQUESFOR COMPUTING MARGINALS 61 itself (much as clause learning introduces new constraints through the execution of a modern

SAT Solver, as described in Chapter 3.) The hope is that such new constraints can exploit the special structure inherent to a given problem instance or domain. To the extent that such re- search abstracts away from the specific constraints underlying various variational methods, it is similar in spirit to the “EMBP” method presented later in this document. However, EMBP is derived by a different route involving the “expectation maximization” algorithm. On the other hand, some may be surprised to learn that the two paths have similar origins, and similar end results, as expectation maximization itself has an energy-minimization interpretation based on

KL-divergence.

Summary. For our own purposes the key insight is that the outputs of approximate marginal estimation techniques do not have to be understood exclusively by operational semantics, that is, as the fixed points of specific update rules. Instead, we can take a functional perspective and view a given estimation technique as generating a local optimum with respect to a given well-defined model. The model itself represents an approximation to the marginal polytope, and it is the nature of this approximation that characterizes a particular marginal estimation technique. Such an understanding paves the way for two contributions in Chapter 5: a formal statement of marginal estimation itself as a linear relaxation of constraint satisfaction, and the

“EMBP” framework for bias estimation. The next step, though, is to consider basic concepts from constraint satisfaction from the perspective of the factor graph formalism, in order to facilitate such an integration. Chapter 3

Solving Constraint Satisfaction Problems

In this section we turn from the numerical perspective of probabilistic reasoning and consider the symbolically-oriented concepts that underly constraint satisfaction. At the same time, by re- taining factor graphs as a basic unifying representation, we will observe a number of analogues between such continuous and discrete forms of optimization. Probabilistic and constraint rea- soning are more closely related than it might otherwise appear, considering the present-day divergence between research communities like machine learning and constraint programming.

Recall that our overall research interest is to compute marginal probabilities, i.e. biases, over the variables in a constraint problem. Thus by casting such problems in terms of factor graphs, we show that we can use any existing marginal computation method on such problems.

Further, we show that standard solution methods for constraint satisfaction can often be cast as hardened versions of the same techniques. Eventually, such insights allow the creation of the methods appearing in Parts II and III.

So, in this chapter we first define the constraint satisfaction problem while employing a straightforward factor graph encoding, and then state the basic backtracking search method for solving such problems. By applying our graphical formalism, we can expand from the classical origins of search and exhibit connections with probabilistic reasoning. (Essentially, search corresponds to conditioning and the ‘⊕’ operator in a semiring, while inference techniques are

62 CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 63 the analogues of averaging and ‘⊗’.)

In this chapter, we will focus on constraint satisfaction, rather than constraint optimization problems–the latter are a straightforward variant that will be considered in detail in Part III.

Other than bounding methods for branch-and-bound search, the majority of techniques used in constraint optimization are adapted from constraint satisfaction.

3.1 Factor Graph Representation of Constraint Satisfaction

We begin by defining the general constraint satisfaction problem in terms of factor graphs, then proceed to illustrate the formulation by a pair of example problem classes.

Definition 12 (Constraint Satisfaction Problem). A constraint satisfaction problem (“CSP”) corresponds to a factor graph G = (X,F ) defined as before in Section 1.1, with the additional requirement that each factor, a.k.a. constraint, must be a 0/1-function. (That is, every function is onto {0, 1}.) To solve the problem, we must find a configuration ~x of the variables that

evaluates to 1: WG(~x) = 1. Such a solution (a.k.a. “model”) satisfies the problem as a whole;

an arbitrary (possibly partial) assignment π satisfies a given constraint fa iff fa(π|σa ) = 1. We will denote the set of solutions to a constraint satisfaction problem G as SOLS(G). The

decision version of a CSP is to determine whether G is unsatisfiable (denoted “G ` ⊥”),

meaning that SOLS(G) = ∅. Finally, it can be useful to assume a predicate interpretation of

CSPs whereby configurations represent states of the world, and a CSP interprets such states as

true (1) or false (0).

In other words, our factor graph defines a satisfying configuration as one that meets all

the requirements represented by some set of local constraints; such constraints are designed

to return 1 for satisfaction, and 0 for dissatisfaction. Only by satisfying all of the constraints

at once can a configuration achieve a weight of 1 and thereby solve the problem–already we

see that multiplication and logical conjunction play similar roles under the semiring formalism

first referenced in Section 1.1.3. Finding a solution or proving unsatisfiability for a constraint CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 64 satisfaction problem are both NP-complete in complexity–in fact such problems are the original foundation for the very theory of NP-completeness [34].

Interpreted probabilistically, a constraint satisfaction factor graph G defines a uniform dis-

tribution over its set of solutions, assuming that the problem is satisfiable (i.e., the set is not

empty):

 1  0 : ~x/∈ SOLS(G) p(~x) = WG(~x) = (3.1) N  1  |SOLS(G)| : ~x ∈ SOLS(G) Recall that if we were to somehow solve the MPE problem defined in Section 1.4, then any maximally likely configuration would have weight 1, and constitute a solution to the CSP.

Indeed, the solution method for MPE would most likely involve a weighted analogue of the search techniques presented shortly. However, in our main line of research, we will not ulti- mately query a CSP directly for a solution configuration. Rather, we will treat it as a uniform joint distribution over all solutions, and compute its underlying marginal probabilities. Such biases are then used to find a solution within a modern constraint satisfaction solver. In other words, a variable’s bias represents the probability of finding a particular variable assignment if we were to somehow draw a sample uniformly at random from the solution set of a CSP:

θi(vj) ≡ p(~x(xi) = vj | ~x ∈ SOLS(G)) (3.2)

This is remarkable in that we can implicitly (and approximately) sample from a set for which we have no explicit representation; if we could actually construct the set SOLS(G) for direct sampling, we would have already gone far beyond solving the constraint satisfaction problem.

With our eventual goals in mind, we can now consider two example classes of CSPs, be- ginning with the most important in terms of both historical development and fundamental gen- erality: Boolean satisfiability.

Definition 13 (Boolean Satisfiability). A general Boolean satisfiability (“SAT”) problem is a constraint satisfaction problem where all variables have the boolean domain D = {0, 1}. In CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 65 this chapter, the value 0 is used interchangeably with the term “negative” and the symbol ‘−’,

while 1 can be associated with “positive” and depicted ‘+’. A boolean satisfiability problem

can also be called a “theory”, as suggested by its interpretation as a set of rules constraining

any model of some system. Interpreted as predicate functions that map configurations to 0 or

1, SAT theories can also be represented as propositional Boolean formulas; we will typically

use the symbol ‘Φ’ to represent such a formula.

Important research has sought to expand the scope of practical SAT-solving to incorporate

constraints expressed as arbitrary Boolean formulas [57]. The methods presented throughout

this document are general and may be applied, with some modification, to such formulations.

However, to date the vast majority of theoretical and practical results for Boolean satisfiability

have been based on a single canonical form; an arbitrary problem can be methodically con-

verted to this “conjunctive normal form” by straightforward means (possibly with blow-up in

the size of the problem representation.)

Definition 14 (CNF, k-SAT, Clause-to-Variable Ratio). A SAT problem in conjunctive normal

form (“CNF”) represents a conjunction of disjunctions of literals–a literal represents either a

variable or its negation. That is, the factor graph for such a problem has a single, fixed format

for its constraints: a “clause” representing a disjunction of literals. A k-SAT problem further

restricts such clauses to each disjoin exactly k literals. An important property of any problem

in CNF is its clause-to-variable ratio α = m/n, representing, intuitively, the constrainedness

of a problem; when it is high there are probably few or no solutions, and when it is low the

solutions are likely numerous and easy to find. We will see in Chapter 4 that this is especially

relevant to problems with random structure.

Less formally, we will speak of a variable xi’s “positive” and “negative” literals, xi and ¬xi,

+ respectively. Further, in a given theory, a variable’s “positive clauses” Ci are those in which its

− positive literal appears as a disjunct, and likewise for its “negative clauses” Ci and its negative

+ literal. Similarly, Vc denotes the set of variables that appear in clause c as a positive literal,

− and likewise for Vc and variables appearing as negative literals. CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 66

Φ = C1 ∧ ... ∧ C8

f7 f8 C1 = ( x1 ∨ x2 ∨ ¬x3)

X4 X5 C2 = (¬x1 ∨ ¬x2 ∨ ¬x4)

C3 = ( x1 ∨ ¬x2 ∨ ¬x5)

f3 f4 C4 = (¬x1 ∨ x3 ∨ ¬x4) X2 f6 X3 C5 = ( x1 ∨ ¬x3 ∨ x5) f2 f5

C6 = ( x1 ∨ ¬x4 ∨ x5) X1 C7 = ( x2 ∨ x4 ∨ x5) variable appears variable appears positively in clause: negatively in clause: C8 = (¬x3 ∨ x4 ∨ ¬x5) f1

n = 5, m = 8, α = 1.6

Figure 3.1: Example 3-SAT Problem: as Factor Graph and as CNF Theory.

Example 12 (Depicting a 3-SAT Problem). Figure 3.1 shows the factor graph and formulaic representations of an example 3-SAT problem. While a general factor graph can only depict a

factor’s scope, via graph connectivity, here we can additionally convey the exact structure of

each constraint by drawing positive and negative edges as solid and broken lines, respectively.

A positive edge indicates that a clause contains a variable’s positive literal, and likewise for

negative edges and literals; thus, for example, we see that factor f1 realizes the disjunction x1 ∨ x2 ∨ ¬x3. Accordingly, the formulaic representation Φ consists of m = 8 disjunctions. Conjoining such clauses constructs the overall formula over n = 5 variables. Thus the clause- to-variable ratio α is 8/5 = 1.6.

The example problem has nine solutions, as listed in Figure 3.2. From this table we can cal- culate exact marginals by tallying entries along each column: for instance x1 is set positively in six out of the nine solutions, and negatively in the remaining three.

As an aside, note that we could also have tried to approximate the biases in Figure 3.2 by CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 67

x1 x2 x3 x4 x5 h 0, 0, 0, 0, 1 i

h 0, 0, 0, 1, 1 i θ1(1) = 6/9, θ1(0) = 3/9 h 0, 1, 0, 0, 0 i θ2(1) = 4/9, θ2(0) = 5/9 h 1, 0, 0, 0, 1 i θ3(1) = 3/9, θ3(0) = 6/9 h 1, 0, 1, 1, 0 i θ4(1) = 3/9, θ4(0) = 6/9 h 1, 0, 1, 1, 1 i θ5(1) = 5/9, θ5(0) = 4/9 h 1, 1, 0, 0, 0 i

h 1, 1, 0, 0, 1 i

h 1, 1, 1, 0, 0 i

Figure 3.2: Solutions to Example 3-SAT Problem, and Resulting Biases.

simply considering the number of positive and negative clauses for each variable. For instance,

in Figure 3.1, x1 appears as a positive literal in four out of the six clauses in its neighborhood,

2 and sure enough its true positive bias is 3 . In fact, we will see that many traditional SAT-solving heuristics are based implicitly or explicitly on these kinds of measurements, and this particular

+ + − estimate, θi(+) ≈ |Ci |/(|Ci | + |Ci |), will serve as an experimental control when assessing the probabilistic methods developed in future sections.1

Each time a probabilistic method outperforms an existing heuristic technique on any partic- ular measure, then, it represents a success for the approximate over the heuristic. The methods we develop can give inaccurate results, but they do so in approximating a precise probabilistic quantity with purposeful semantics on a well-defined model. So during experimentation, for example, we are able to benefit by measuring the quality of approximation in isolation from the actual usefulness of the targeted quantity. In contrast, the heuristics that we will supplement or replace are approximate in a different sense: either we consider them to be exact computa-

1In fact, from the perspective of the variational approximations considered in Section 2.1, the crude approach of using the ratio of positive to negative clauses is in fact an instance of the mean field approximation. That is, we bias a variable by averaging the influence of its surrounding clauses–without considering the influence of any other variables that appear in such clauses. CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 68 tions for quantities of unknown formal motivation that cannot be defined except post hoc, or alternatively we believe that they can be successful only insofar as they happen to implicitly approximate some other even more ill-specified quantity. Thus the development of heuristics has historically been trial-and-error, while the research path presented herein allows us to ex- plicitly measure various improvements in more accurately estimating a well-defined objective

(variable bias), whose own effectiveness can be theoretically justified and experimentally vali- dated as well. And in so doing we can draw on the wide array of existing and novel techniques discussed in Section 1.

For simplicity, future sections will focus mostly on Boolean satisfiability problems in con- junctive normal form. As mentioned, though, the same general ideas can be applied to arbitrary

SAT problems, and they have also proved successful for non-Boolean constraint satisfaction problems [78, 102] like the quasigroup-with-holes problem [71], defined below in two stages.

Definition 15 (Latin Square). A Latin square of order d is represented by a d × d array of

“cells.” Each cell takes a value from D = {1, . . . , d} such that no value occurs twice in any row or column. (The array denotes the multiplication table for an algebraic quasigroup.)

Definition 16 (Quasigroup with Holes Problem, Hole-to-Cell Ratio). The (balanced) quasigroup- with-holes (“QWH”) problem represents a Latin square that has had a given percentage of entries erased at random by a process detailed in [91]. To solve a problem means finding valid values for the erased (“hole”) cells. The remaining cells are considered “fixed.” A useful measure for the constrainedness of a QWH problem is the hole-to-cell ratio representing the percentage of its entries that have been erased to form holes.

We can represent QWH as a constraint satisfaction problem by associating a variable xr,c of domain D = {1, . . . , d} with each cell, where r is the row and c is the column of the cell.

Leaving the holes aside, we constrain each fixed cell xr,c via a single unary constraint fr,c that evaluates to 1 if the variable holds the value to which the cell is fixed, and 0 otherwise. Finally we have 2d “alldiff” [131] constraints, one over each row and column. Such a constraint evaluates to 1 if none of its arguments are equal, and 0 otherwise. CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 69

Algebraic structures like quasigroups are closely studied in artificial intelligence because their underlying combinatorics capture the essence of realistic problems like scheduling, error- correcting coding, and resource allocation. Recently, EMBP-based methods were shown to be state-of-the-art when applied to QWH and other similar problems [102]. (Instrumental in such successes was an independently-developed version of global consistency that plays a similar role to that of the version that will be applied in Chapter 6.)

2 3 4 1 0 1 0

3 1 2 0 4 2

1 0 3 4 2 =⇒ 1 3 col1 col2 col3 col4 col5 0 4 1 2 3 4

X1,1 X1,2 X1,3 X1,4 X1,5

4 2 0 3 1 3 1 1 0

row1 (a) Latin Square. (b) QWH Problem. X2,1 X2,2 X2,3 X2,4 X2,5

2

row2 X3,1 X3,2 X3,3 X3,4 X3,5

1 3

row3 2 3 4 1 0 2 3 4 1 0 X4,1 X4,2 X4,3 X4,4 X4,5 4 3 1 2 0 4 0 1 2 4 3 row4 X5,1 X5,2 X5,3 X5,4 X5,5 1 0 3 4 2 1 0 3 2 4 3 1 row5 0 4 1 2 3 3 4 1 0 2 (d) Factor Graph Representation of QWH Problem. 4 2 0 3 1 4 2 0 3 1 (c) Two (Unique) Solutions to the QWH Prob- lem.

Figure 3.3: Example QWH Problem : as Latin Square with Erasures, and as Factor Graph.

Example 13 (Depicting a QWH Problem). Figure 3.3 shows the cellular and factor graph representations of an example QWH problem. We begin in Fig. 3.3(a) with a particular 5 × 5

Latin square, and create the QWH problem in Fig. 3.3(b) by randomly erasing 17 of the entries.

There are exactly two ways to fill in the holes and re-create a valid Latin square; these two CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 70 solutions are shown in Fig. 3.3(c). Finally, Fig. 3.3(d) presents the factor graph encoding for this CSP. An array of variables represents the cells of the Latin square, while five row and five column constraints ensure that all entries in a row or column take on different values. Thus, for instance:   1 if ∀ 1 ≤ i < j ≤ 5, x1,i 6= x1,j; row1(x1,1, x1,2, x1,3, x1,4, x1,5) =  0 otherwise.

Finally, a unary function fr,c constrains each of the problem’s fixed cells–in the figure these appear as smaller factors labeled only by the values to which the cells are fixed. So the neigh- borhood of variable x5,4, for instance, includes not only row5 and col4, but also f5,4, where f5,4(x5,4) = 1 if x5,4 = 3 and 0 otherwise. As there are only two satisfying configurations to this factor graph, it is easy to state the variables’ exact bias distributions. For instance, θ5,1()

1 gives probability 1 to the value 4, and 0 to all other values, while θ4,5 gives probability 2 to both 2 and 3, and 0 to all others.

3.2 Complete Solution Principles for Constraint Satisfaction

We now consider the general principles of search and inference, which comprise the core of

complete solution methods for constraint satisfaction problems. Upon presenting these princi-

ples we can demonstrate their connections to probabilistic reasoning on arbitrary factor graphs.

For motivation, if we interpret a CSP as a predicate representing the truth of its satisfying

assignments, there are two fundamental facts that we can operationalize to find such a solu-

tion. The first is a disjunction over all |D|n configurations of the variables, plus unsatisfiability

(‘⊥’): (at least) one of the configurations must be true, or the CSP must be unsatisfiable. The

second is a conjunction over all m constraints: if we interpret an output of ‘1’ as truth, we know

that all factors must be true for any solution to hold. In the language of the algebraic semir-

ing, where computing marginals on an arbitrary factor graph is “sum-product” and computing

MPE is “max-product”, we can call the task of deciding a CSP factor graph the “disjunction- CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 71 conjunction” problem. (In practice one is unlikely to encounter a non-constructive decision procedure for CSPs: most conceivable procedures for deciding CSPs would also produce a solution or a proof of unsatisfiability.) In other words, we now wish to evaluate the same type of expression as in Section 1, but with different operators:

X Y becomes _ ^ fa(~x|σa ) −→ fa(~x|σa ) (3.3)

~x fa∈F ~x fa∈F Accordingly, search is simply a system for iterating through the disjunction of possible configurations, rotely at first, or by eliminating related configurations in blocks when integrat- ing backtracking, inference, and other more advanced methods. Likewise, inference combines conjoined constraints into fewer, more powerful ones; in the limit the conjunction of constraints completely specifies the predicate form of a CSP, so it must be possible (if expensive) to com- bine them into an explicit listing of its models. By treating search as ⊕ and inference as ⊗, we can view the former as a conditioning operation in correspondence with those of probabilistic methods, while the latter corresponds to the sort of elimination that was previously realized by summary and other forms of averaging.

3.2.1 Search

Backtracking search can be roughly understood as “trying all configurations” depth-first, via a recursively-generated tree of individual variable assignments. To that end, Algorithm 6 builds up a set of assignments to individual variables in a CSP until reaching a contradiction with some constraint, at which point it backtracks and tries a new value for the most recently-assigned variable. If all values are exhausted, it backtracks to try a new value for the previously assigned variable. If the algorithm ever manages to reach a complete configuration without violating any constraints, it reports this solution. Otherwise, on exhausting all possible backtracks, it reports unsatisfiability from the top level of search. From its recursive design it follows that each call to

RECURSIVE BACKTRACKING is like solving a new CSP, one where all assigned variables have already been fixed according to the current branch of our search tree. Only after conditioning a CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 72

CSP on each of the top-level variable’s values, and solving the |D| resulting CSPs, can we be

sure of resolving the decision version of the original problem.

Algorithm 6: Backtracking Search Data: constraint satisfaction problem G = (X,F ).

Result: solution, or ‘⊥’ (“unsatisfiable”).

1 return RECURSIVE-BACKTRACKING(∅, G).

2 subroutine RECURSIVE-BACKTRACKING(assignments, csp)

3 begin

4 if (assignments violate csp) then return ‘⊥’.

5 if (|assignments| = n) then return assignments.

6 var ← CHOOSE-VARIABLE(assignments, csp).

7 foreach val in ORDER-VALUES(var, assignments, csp) do

8 result ← RECURSIVE-BACKTRACKING(assignments ∪ {var = val}, csp).

9 if result = ‘⊥’ then continue else return assignments.

10 end

11 return ‘⊥’.

12 end

All state-of-the-art complete CSP solvers employ this basic search format. To some extent,

they can succeed or fail according to the cleverness of their data structures and other imple-

mentational details, but at an algorithmic level there are three critical areas for designing a

state-of-the art solver:

1. Variable- and value-ordering decisions, to be discussed presently.

2. Inference methods that propagate the consequences of decisions over constraints.

3. Advanced techniques for analyzing conflicts to infer new constraints, and for periodically

restarting a search under new orderings. CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 73

In Algorithm 6, the function calls for ordering variables (Line 6) and values (Line 7) are the main mechanism for adding intelligence to backtracking search. When considering the marginal computation (sum-product) problem, we have already seen that the order in which we condition on variables can have a drastic effect on the complexity of our overall computation, as bounded below by a problem’s tree-width. By the correspondence to constraint satisfaction in Eq. (3.3), then, it is clear that our goals in distributing disjunctions over conjunctions include the same factorization goals we had when pushing summations to the right in Example 7, or ordering our elimination in Algorithm 3. Namely, such operations help disentangle a graph into simpler components that can be solved more or less disjointly according to the resulting width; similarly, choosing a variable at Line 6 produces |D| new subproblems to be solved

recursively.

Such decomposition goals focus on the graph structure of a problem; when specializing

the general factor graph framework to the case of CSPs, there are additional types of problem

structure that arise from the 0/1 nature of constraints [95]. Firstly, setting a particular variable

can force the values of remaining variables. So, pushing a disjunction over a specific variable

assignment can not only factorize the weighting function WG for a CSP, it can also eliminate other disjuncts from consideration by making their factors 0. Another way of putting it is that

the conjunction in Eq. (3.3) takes the min of its values; there is no need to extend a chain of

assignments that violates a constraint. Similarly, the equation’s outer disjunction is like a max

over configurations; if we can find a solution then there is obviously no need to examine all the

remaining configurations as when summing weights to form a marginal. This perspective yields

a second, presumably intuitive goal for exploiting problem structure: we would like to find a

solution, if one exists, as early as possible in our iteration through disjoined configurations.

In light of these three types of problem structure, variable- and value-ordering heuristics

tend to pursue some combination of three goals in confronting the combinatorial challenges

of constraint satisfaction: to decompose a problem in terms of graph structure, to trigger in-

ferences that prune out large portions of the search space automatically, or to pursue a branch CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 74 of search that will likely lead to a solution. In fact the latter two goals can lead to divergent heuristic strategies; “fail-first” strategies pursue a broad but shallow search tree where we back- track often but early, while “succeed-first” strategies select branches that appear more likely to succeed, creating a deep but narrow search [16].

Example 14 (Variable- and Value-Ordering for SAT Search). In combination with appro- priate inference procedures, Algorithm 6 constitutes the Davis-Putnam-Logemann-Loveland

(“DPLL”) method when applied to SAT problems [40]. If we consider a default lexicographic ordering for variables and values (for values, ‘0’ is attempted before ‘1’,) we can observe how

DPLL exploits graph structure via the same sort of factoring operations as in Example 7. This results in simplified recursive calls to RECURSIVE-BACKTRACKING, as demonstrated below for the example SAT problem in Figure 3.1:

C1 = ( x1 ∨ x2 ∨ ¬x3) C5 = ( x1 ∨ ¬x3 ∨ x5)

C2 = (¬x1 ∨ ¬x2 ∨ ¬x4) C6 = ( x1 ∨ ¬x4 ∨ x5)

C3 = ( x1 ∨ ¬x2 ∨ ¬x5) C7 = ( x2 ∨ x4 ∨ x5)

C4 = (¬x1 ∨ x3 ∨ ¬x4) C8 = (¬x3 ∨ x4 ∨ ¬x5)

8    W W W W W V W W W W W Ca = C1 ∧ C2 ∧ C4 ∧ C3 ∧ C5 ∧ C6 ∧ C7 ∧ C8 (3.4) x1 x2 x3 x4 x5 a=1 x1 x2 x3 x4 x5 For instance, when operationalizing the three outer disjunctions of (3.4) we are able to back-

track immediately should we violate C1, without considering the settings of x4 and x5; other-

wise, our recursive calls proceed with x1, x2, and x3 fixed to values that satisfy C1, and we can drop this clause from consideration. In other words the concept of backtracking is fun-

damentally a factoring process over the algebraic semiring, as follows from a given variable

ordering and the annihilation/idempotence of 0/1. With regard to the latter, on presenting back-

tracking search we have claimed that ordering can capture additional sorts of structure that

go beyond graph connectivity. As an example, suppose we attempt a succeed-first approach by

first assigning 0 to x4:

 8   8  W W W W V W W W W V = Ca : {x4 = 0} ∨ Ca : {x4 = 1} (3.5) x1 x2 x3 x5 a=1 x1 x2 x3 x5 a=1 CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 75

Here the notation indicates that we will perform recursive calls where the constraints are mod- ified to reflect that x4 has been assigned a particular value. By the structure of our algorithm, we will explore the first parenthesized disjunction in its entirety before considering the second.

Suppose our ordering heuristics next direct us to assign 1 to x1, resulting in a new subprob- lem that will be solved in its entirety before examining the other three possible assignments to

{x1, x4}. This first subproblem is represented below with its constraints modified or dropped to reflect the assignments:

8 W W W V (Ca : {x4 = 0, x1 = 1}) x2 x3 x5 a=1

W W W (3.6) = (C7 : {x4 = 0, x1 = 1}) ∧ (C8 : {x4 = 0, x1 = 1}) x2 x3 x5 W W W = (x2 ∨ x5) ∧ (¬x3 ∨ ¬x5) x2 x3 x5 By the idempotence of 1, we have eliminated clauses that are already satisfied by our assign- ments: C2 and C4 by x4 = 0, C1, C3, and C5 by x1 = 1, and C6 by both; for the remaining clauses, we project over the assignments to produce two simplified constraints. The overall point is that by fortuitous choice of assignments, we have created a greatly simplified subprob- lem within which we are very likely to find an assignment, and quickly. The same cannot be said for all the other three subproblems that we have pushed to bottom of the recursive activation stack, nor for other possible orderings of the variables and their values.

Presumably, the idea that we would like to set the most important variables early in Algo- rithm 6, to values that satisfy the most clauses, is unsurprising and simple to intuit. In express- ing the idea in terms of factoring, our goal is to unify the algorithmic with the mathematical: whenever BACKTRACKING SEARCH can evade the exponential complexity of constraint sat- isfaction in operation, there is a corresponding syntactic representation using equations as in

Example 14. The efficiencies that are symbolized by such representations in turn translate into the algebraic properties of associativity, idempotence, and annihilation. On translating such algebraic abstractions to the marginal computation problem, then, we will eventually be able CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 76 to view the very process of computing a survey as a search and inference process in a relaxed space of fractional variable assignments.

Arguably there is nothing profound in distributing multiplication over addition, or in multi- plying by one and zero–but ultimately, expressing search in terms of these formal concepts will allow the unified account of mainstream probabilistic and combinatorial reasoning in Chapter

5. To that end, we provide a second, even more obvious example of sensible ordering, again in order to exhibit its algebraic characterization.

Example 15 (Variable- and Value-Ordering for QWH Search). Any reasonable search method for QWH would first assign each variable representing a fixed cell with its single allowable value; in fact most implementations would probably process such cells separately and per- form actual search over the hole variables alone. Still, even this trivial observation can find algebraic expression. In the example from Figure 3.3, suppose our variable ordering begins by assigning the variables in the first row, lexicographically (i.e., from left to right). Further, suppose that values are also ordered lexicographically, augmented by an inference procedure that instructs us not to re-use values within a row. The operation of BACKTRACKING SEARCH under this ordering can be expressed as follows: ! W W W W W f1,4(x1,4) ∧ f1,5(x1,5) ∧ row1(x1,1, x1,2, x1,3, x1,4, x1,5) ... (3.7) x1,1 x1,2 x1,3 x1,4 x1,5

Three outer loops serve to fix the first three cells of the row, followed by a loop that eventually

assigns 1 to the fourth and another that assigns 0 to the fifth while also ensuring that all cells in

the row hold different values. The expression is inefficient, not merely because we must iterate

through values of x1,4 and x1,5, but more importantly, because we can set the first three cells to arbitrary values whose inconsistencies with the fixed cells cannot be detected until reaching the inner loops. In particular, the first branch of search will assign 0, 1, and 2 to the first three cells in the row, then backtrack on finding no feasible value for the fourth. The search will then attempt 0, 1, 3, then 0, 1, 4, and so forth before eventually reaching 2, 3, 4, the first partial

assignment under our ordering that can be extended to feasible values for the fourth and fifth CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 77 cells. In contrast, an ordering that sets fixed cells first does not have this problem: !! W W W W W f1,4(x1,4) ∧ f1,5(x1,5) ∧ row1(x1,1, x1,2, x1,3, x1,4, x1,5) ... (3.8) x1,4 x1,5 x1,1 x1,2 x1,3

So, we see that ordering is essential for exploiting not only graph structure, but also constraint

structure whereby we want to evaluate 0-terms as early as possible, should we ever encounter

them at all. Instead of encountering the same dead end repeatedly in different subtrees, we

wish to resolve the issue once and for all as early in the search as possible–this is equivalent

to distributing a 0 over conjunctions in order to remove entire blocks of disjuncts from our

factored algebraic representation of a CSP.

3.2.2 Inference

We now consider inference, both as a second complete solution method, and also in its more

common role as a supplement to search. In both cases we will retain the formalism of graphical

models over an algebraic semiring, while remembering that the fundamental imperative of

tractable reasoning is to find structure, whether in the connectivity of related variables, or in

the special properties of constraints, or in the interplay of the two. In most general form,

inference is an operator that combines multiple constraints to make a new one, by applying the

properties of multiplication in a semiring. In one form, this might produce a unary constraint

that simply fixes a variable to a single value. Alternatively, it might eliminate only a subset of

the values in a single variable’s domain, or a subset of the tuples that might instantiate a set of

variables.

We will begin by considering localized consistency between constraints as the root of infer-

ence. Accordingly, we will define the (i, c)-consistency family of inference rules, taking par-

ticular note of the special cases of generalized (1, 1) and relational (|σa|−1, 1) arc-consistency [46]. We then describe three main modes for using inference, illustrating each with examples

from SAT. The first mode is as an outright solution procedure, and the second is as an adjunct

to search that generates new constraints for guidance. The third way of using inference, prop- CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 78 agation, is a special case of the second where we generate singleton unary constraints that fix variables directly whenever we reach a subproblem where their values are forced.

Definition 17 ((i, c)-Consistency, Generalized/Relational Arc Consistency). Let C ⊆ F be any subset of the constraints in a CSP G = (X,F ), with |C| = c. Then let V ⊆ S σ fa∈C a be any subset of all the variables in the scopes of any such constraints, with |V | = i. Further, an assignment π to the variables in V is feasible iff it satisfies all constraints fb : σb ⊆ V defined entirely over variables in V . The CSP is (i, c)-consistent iff for any such constraint set

C and variable set V , plus any single additional variable x∈ / V , any feasible assignment to the variables in V can be extended by an assignment to x such that the resulting assignment to i+1 variables is also feasible. The assignment to x that ensures feasibility to a given assignment to the variables in V is called that assignment’s support. A CSP is strongly (i, c)-consistent if it is (j, c)-consistent for all j ≤ i. Under this framework, (generalized) arc-consistency is

(1, 1)-consistency and relational arc-consistency [46] is (d − 1, 1)-consistency, where d is the size of the largest variable neighborhood.

Essentially, (i, c)-consistency ensures that we cannot make a joint assignment to i vari- ables that seems promising–in satisfying constraints defined entirely over these variables–but actually creates a future conflict with a single additional variable. Textbook definitions of

“i-consistency” focus on networks of binary constraints; the low computational complexity of such networks makes them poorly-motivated for discussing SAT (as well as most other non-didactic constraint satisfaction domains.) Generalized definitions of “(i, j)-consistency” address this limitation by handling constraints of arity bounded by j; but here the emphasis is on the interplay between constraints in a problem rather than variables in a constraint. Thus, we have created a differently-parameterized format for defining various levels of consistency in terms of sets of constraints.

A simpler concept, for single variables with respect to individual constraints, could be defined as (0, 1)-consistency with some abuse of notation–but instead we define it separately below. CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 79

Definition 18 ((Generalized) Node Consistency). A variable xi from a CSP is (generalizedly) node consistent iff every feasible assignment to the variable can be extended to support any constraint fa ∈ ηi. Note that here feasibility means only the satisfaction of any unary con- straints on xi, and support means that for fa there exists some assignment to the variables in

σa \{xi} that satisfies fa when extended by the assignment to xi.

The purpose of enforcing consistency is to remove values from a variable’s domain, or in the dual, from the extensions of constraints, when we know that individually or jointly they cannot lead to any solution. Here such removals will be realized formally by adding extra constraints to a constraint satisfaction problem, but in practice they can also be effected by explicit deletions of particular values from certain data structures, or implicitly by other means. For our purposes there is no need to go into depth in describing specific algorithms for enforcing consistency. In general, they perform a straightforward search for unsupported values or tuples; the degree of sophistication for such algorithms depends on efficient data structures for performing this search over multiple passes, and for removing values.

Enforcing a particular level of (i, c)-consistency, though, does not guarantee an immediate

solution to the problem. The limitation is that while some assignment to a given set of i vari- ables may be supported by some value of a variable x, and by some other value for another other variable x0, it may be the case that x and x0 cannot hold these two supporting values simultaneously due to some constraint between them. This specific possibility is actually elim- inated if c ≥ 2, but clearly the same problem can obtain between a larger number of supports for any given value of c short of m, i.e. the size of the entire constraint set F . With this in mind we will now consider how inference techniques, explained in terms of consistency, can be used to solve CSPs on their own, or in tandem with search.

Using inference alone to solve a problem. As suggested immediately above, repeatedly enforcing increasing degrees of consistency of a problem, while expensive, can yield a solution and/or decide satisfiability. By raising i all the way to n = |X| or c all the way to m =

|F |, one can find a solution directly by assigning successive variables, without backtracking. CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 80

(Unsatisfiability is determined by whether a primal or dual variable has all possible values removed during the process of achieving consistency.)

Example 16 (Resolution; The Davis-Putnam Algorithm for SAT). We first define the resolution rule with respect to a single variable within a SAT theory in CNF. For a given CNF theory

represented as a factor graph G = (X,F ), and variable x ∈ X, we define the output of

RESOLUTION(G, x) to be a new factor graph G0 = (X \{x},F 0) with F 0 defined as follows. First, collect all clauses in which x appears as a positive or negative literal, respectively,

+ − + − + − forming the sets Cx and Cx . Then, for each pair of clauses (c , c ) ∈ Cx × Cx , “resolve” the clauses to produce a new clause containing all the literals of c+ and c−, except for x. Let

0 0 + − 0 C denote the set of all such newly created clauses; then F = F − Cx − Cx + C . The Davis-Putnam (“DP”) algorithm [41] successively performs the resolution of each

variable in the theory: G ← RESOLUTION(G, xi) for i = 1 to n. If an empty clause is ever encountered during this process, DP reports unsatisfiability (each successive theory is equisat-

isfiable to the original G.) Otherwise, it can reconstruct a model by retaining all the generated

0 clauses C and arbitrarily choosing variable assignments from x1 to xn while obeying such clauses; this process will not require any backtracking.

It is straightforward to see that each resolution step of DP performs a variable elimination,

as defined in Section 1.2.1. Each such elimination constructs a clause-to-variable message

reporting an unnormalized bias “distribution” over the current variable, by summing over its

possible extensions–as prescribed by its own clause structure and the product of distributions

over preceding variables. The resulting message will assign a 0 or 1 to each of the variable’s

polarities; if both are assigned a value of 0 the theory is unsatisfiable. To construct a variable-to-

clause message, a variable takes the product of all such incoming clause-to-variable messages.

For a particular assignment to this variable to be feasible, then, as represented by assigning

a value of 1 in the outgoing message, the assignment must evaluate to 1 in all the incoming

messages.

There is only one substantial difference between DP and a direct application of the vari- CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 81 able elimination rules in Figure 1.3 to a SAT problem. Instead of being correct on a tree and

“incorrect” on a graph with cycles, as is the case of BP, it is instead tractable (poly-time) on a tree and exponentially costly in proportion to the width of an arbitrary graph. This is because

DP essentially combines the variable elimination algorithm with a node-merging process, as defined in Section 1.2.2, on the dual of the graph (i.e. on the function nodes.) Upon composing the outgoing variable-to-clause message for a variable, we are able to act as though the elim- inated variable has no further influence on subsequent variables in the ordering, as if it were a leaf in a tree with respect to each of its factors. In fact DP makes this assumption valid by completely disconnecting the variable from the graph, which is why the algorithm is able to remove it completely from the theory. Specifically, resolving the variable by taking the cross product of all clauses is exactly the node-merging process, but on the functions rather than on the variable. Schematically, it is as though we took all the function nodes connected to the variable in the factor graph, and replaced them with a single function node connected to all the variables that had previously shared a function with the one being eliminated. In creating this merged function from the cross product of all the variable’s clauses, we run the cost of an exponentially-sized representation, according to the induced width of the graph and the chosen ordering.

As a result, the DP algorithm is able to eventually achieve an implicit form of (1, m)-

consistency over the theory: any variable assignment that is feasible with respect to all the

original constraints, and all of those generated by the algorithm, can be extended to a full

satisfying configuration. (An empty clause renders any assignment infeasible.)

Using inference to add constraints during search. The remaining two models for using

inference involve its integration within a backtracking search process. Under the first of these

models, a solver can generate or “learn” new constraints during search, by seeking a reason

for each backtracking event and representing this reason as a new constraint. Originally, the

idea may have developed from a highly limited version known as “conflict-based backjump-

ing,” which does not attempt to retain learned constraints for future use [128]. The full concept CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 82 originated within the field of SAT-solving [110, 117] as “clause learning” and was later ex- tended to general constraint satisfaction [90] in the form of “nogood learning.” In the context of finding a solution to a satisfiable CSP, the primary motivation for learning new constraints is to avoid repeating the same mistake later in search; if a problem is unsatisfiable, the hope is that the additional constraints will shorten the proof. An additional, often under-appreciated motivation is that learned constraints can work together with variable-ordering heuristics like

“VSIDS,” which direct search toward the variables that have been involved in the most recent conflicts [117]. In this case clause learning works together with the heuristic to keep search localized within a focused area of the overall problem.

Example 17 (Clause Learning in SAT). Within a SAT solver, clause learning occurs when we make a series of variable assignments, either through decisions by our variable- and value- ordering heuristics, or through implications from a propagation process like those described immediately below, and discover a conflict with one of the clauses in our theory. At this point we know that the set of assignments as a whole cannot be extended to a solution configura- tion, a fact which itself comprises a new constraint on the problem. Different clause learning schemes choose particular subsets of the variable assignment that are sufficient to explain the conflict; one simple scheme is to choose only the decision-based assignments since they im- ply the remainder [14]. Taken together, the negations of all the assignments in the chosen set comprise a clause that can be added to the original theory.

Viewing clause learning as a form of inference is consistent with the previous comparison of inference and search. Indeed, the set of clauses that are learned during search could instead have been derived as a series of inferences from the original clauses, much as the Davis-Putnam procedure solves a problem entirely by inference; a choice of learning scheme corresponds to a resolution proof format that can be stronger or weaker than that of DP [14]. Thus, while the node-merging (variable elimination) or clause-merging (resolution) rules described in Section

1.2 can be applied explicitly during pre-processing or interleaved with search [46], they can also arise organically as a by-product of seemingly unrelated learning mechanisms. CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 83

Using inference to propagate constraints. A final means of integrating inference into a search process takes the form of “propagators,” or specialized functions that automatically enact the consequences of extending the current branch of search by a given variable assign- ment, by making any additional variable assignments that are implied by the assignment. In other words, upon assigning some variable x in the current subproblem, some constraint in its neighborhood might now force some other variable x0 to hold a particular value v0. Without a

propagator to realize this, the search would continue through an exponential subtree of assign-

ments to variables other than x0, many of which might be inconsistent with x0 = v0. For any

such branch, this would not be recognized until the variable ordering heuristic tried to assign

x0 and found that no values were possible. Instead, a propagator exploits the special structure

of a given constraint to automatically enforce generalized arc-consistency by fixing variables

when their feasible domains have been reduced to singletons.

Example 18 (Unit Propagation for SAT). Almost any modern SAT solver includes a mecha-

nism for performing unit propagation each time a variable is assigned. This entails a means of detecting clauses that now have all but one of their literals violated by the resulting par- tial assignment. The variable for this literal is now the “sole support” for this clause, and is automatically assigned to its satisfying value. This assignment may now trigger further prop- agations, resulting either in a simpler problem or an immediate backtrack. Unit propagation enforces (1, 1)-consistency: for any constraint in a CNF problem, any variable assignment is consistent with any feasible second assignment as long as it does not violate a unit clause.

Example 19 (Unit Propagation and Alldiff propagation for QWH). Recall that in our rep- resentation of the Quasigroup with Holes problem, certain “fixed” variables are attached to

unary constraints that evaluate to 1 only if they hold a single particular value. It is obvious

to humans that in trying to assign a variable x within a QWH problem, one should not choose

the same value v as the one prescribed for any such fixed variable x0 in the same row r (or

column). This is unclear, however, to any naive CSP solver that fails to implement generalized

arc-consistency. In particular, {row1} is a constraint set from whose signature we can choose CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 84 a single variable x that can feasibly hold value v according to this set. However, if we con- sider the single additional variable x0, we can observe that there is no feasible assignment over

0 {x, x }, as they are constrained by a function set containing {row1} and the unary constraint fixing x0 to v. So, any solver that does not immediately enforce generalized arc-consistency will be wasting its effort on any further variable assignments performed between this point and the attempt to assign x0, at which point it backtracks.

We can also consider an analogous alldiff propagation process for the row (or column) constraint of a variable x that we have just assigned to value v. To enforce generalized arc- consistency with respect to this constraint, we immediately remove v from the feasible domain of all other variables in the same row (or column) as x. Conceptually, this is by a unary constraint on each such variable, stating that it cannot hold the value v; implementationally this is done by removing v from some representation of such a variable’s domain.

3.3 Incomplete Solution Principles for Constraint Satisfac-

tion

Having considered complete solution methods in the form of search, inference, and combina- tions of the two, we now consider incomplete methods. Such methods can find a solution to a constraint satisfaction problem if one exists, but may nevertheless fail to do so on any individ- ual execution. Thus, they cannot prove a problem unsatisfiable–failure to find solutions does not rule out their existence. These methods are not used directly in the solvers that we will introduce in future sections, but defining them here will allow us to characterize such solvers in terms of our integrated view of constraint-based and probabilistic reasoning. Additionally, incomplete methods are the underlying search framework for related research that will be dis- cussed in future sections as well [23]. CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 85

3.3.1 Decimation

We will define two incomplete methods, and call the first and simplest of these “decimation.”

Essentially, we attempt to guess a solution to a CSP, by building up a configuration from suc- cessive “blocks” of variable assignments–without backtracking.

Algorithm 7: Decimation Data: constraint satisfaction problem G = (X,F ), decimation block size parameter d.

Result: solution, or FAILURE.

1 return RECURSIVE-DECIMATION(∅, G).

2 subroutine RECURSIVE-DECIMATION(assignments, csp)

3 begin

4 if (assignments violate csp) then return FAILURE.

5 if (|assignments| = n) then return assignments.

6 decimation-block ← CHOOSE-VARIABLES-AND-VALUES(assignments, csp, d).

7 return RECURSIVE-DECIMATION(assignments ∪ decimation-block, csp).

8 end

The algorithm is essentially Algorithm 6 for search, but without backtracking, and while

making d variable assignments at a time. The only opportunity for intelligence arises in that

instead of simply guessing a solution outright, we are permitted to examine the subproblem

resulting from each call to RECURSIVE-DECIMATION and apply the CHOOSE-VARIABLES-

AND-VALUES heuristic to select the next decimation block. Each such block will consist of

the d variable assignments that we consider most likely to be in a solution configuration.

The only way to find a solution by decimation is to avoid ever making a variable assign-

ment that eliminates all remaining solution configurations that can be built from the current

partial assignment. Difficult as this may seem, we will see in Chapter 6 that the survey prop-

agation solver that has proved so successful on large, hard, random problems, essentially uses

Algorithm 7 with a decimation block size of 1 and a heuristic based on bias estimation [22]. CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 86

3.3.2 Local Search

The second, and more general incomplete solution method that we will consider is local search

[75]. The basic idea is to always maintain a complete assignment to all the variables in a problem; initially this configuration is constructed at random. If ever we find that the current configuration satisfies all the constraints in a problem, then we have found a solution, and we return it. Otherwise, we alter the configuration by changing one variable assignment, in hopes of eventually reaching a solution. The pseudocode for this process appears below as Algorithm

8.

Algorithm 8: Local Search Data: constraint satisfaction problem G = (X,F ), decimation block size parameter d.

Result: solution, or FAILURE.

1 configuration ← random assignment to all variables in X.

2 repeat

3 if configuration is a solution then

4 return configuration.

5 else

6 var ← CHOOSE-VARIABLE(G, configuration).

7 val ← CHOOSE-VALUE(G, configuration, var).

8 configuration ← STEP(configuration, var, val).

9 end

10 until htermination-conditioni

11 return FAILURE.

The intelligence in the algorithm is once again encapsulated in heuristics for choosing

which variable to alter next, and to what value. In the algorithm we call each alteration to

a single variable within the configuration a “step,” reflecting a stochastic optimization process

directed by the heuristic decisions. A straightforward heuristic is to greedily make new variable CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 87 assignments that maximize the number of newly satisfied constraints, while minimizing the number of currently satisfied constraints that will now be unsatisfied. Clearly, though, this can lead to local maxima in the number of clauses satisfied by a configuration. So a second element of many local search algorithms is a random element for occasionally taking steps that are generated at random or that otherwise break from the greedy maximization strategy.

The algorithm terminates according to some condition that may be as simple as reaching a maximum number of iterations, or as particular as reaching a particular point in the space of configurations from which the heuristics will not be able to generate a step. At this point the process ends in failure–the search did not find a solution, but cannot conclude that none exists.

Not only is local search relevant to our project due to its use in related algorithms, but it also merits discussion from the perspective of probabilistic reasoning. Simply by virtue of being an optimization problem–either on the number of satisfied constraints in a CSP, or on the weight of satisfied constraints in the case of a COP–it bears a general similarity to the characterization of marginal estimation as optimization in Chapter 5.

This superficial similarity becomes more developed when one considers local search al- gorithms where CHOOSE-VARIABLE and CHOOSE-VALUE together select the new variable assignment randomly, with probability explicitly or implicitly proportionate to the resulting net increase in the number of satisfied constraints [114, 135]. Here each step is itself discrete, as Algorithm 8 makes a single all-or-nothing alteration to the current configuration, but under expectation the behavior of the algorithm can be viewed in terms of fractional assignments under the same linear objective as the one at the core of marginal estimation.

In particular we can imagine a simple process that translates fractional assignments (or bi- ases) into hard (0/1) assignments–and thus surveys into configurations–by sampling a hard as- signment with probabilities in proportion to a given fractional assignment. Given a set of frac- tional assignments (like a survey), then, this generative process can sample hard assignments independently across all variables in order to produce a configuration. Under this model, then, let SAT represent the event that we possess a configuration that satisfies a given CSP, while CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 88 assignment is the event of observing a particular variable assignment in that solution. Then randomized, greedy local search methods can be viewed as maximizing P (SAT|assignment)

with respect to a fractional representation of assignment. That is, under expectation it will

choose variable assignments that maximize the chances of satisfying all the constraints in a

problem. In our own project the goal is to estimate the marginal probabilities of finding partic-

ular variable assignments when sampling from the set of solutions to the CSP, cf. Example 12.

Again, then, we are choosing a fractional value for assignment (namely, a distribution over the

variable’s domain), but this time in order to optimize P (assignment|SAT). The two conditional probabilities are proportional to each other across values of assignment, due to Bayes’ Rule.

In essence, then, greedy local search and bias estimation over constraint satisfaction problems are almost the same task.

This explains the success of local search CSP-solving techniques that are based on dis- cretized Lagrangian primal/dual optimization methods [136]; these mirror the dual messages of approximate bias estimators, as expressed most directly in Section 5.2 by EMBP’s explicit distributions on both variable assignments and function extensions. Similarly, a number of the- oretical approximations for SAT, as well as its optimization variant defined in the next section, rely on fractional variable assignments [62, 63]. At the same time, local search methods that were not explicitly designed to estimate marginals often do so implicitly by biasing themselves according to search experiences that themselves constitute appoximators to marginal probabil- ities [155].

The distinction between our approach and local search is that in estimating bias we are not seeking a solution to the problem directly; doing so would be akin to the MPE problem defined in Section 1.4. Rather, we will use bias estimates as a heuristic within complete backtracking search. Interestingly, though, if we revisit the Gibbs sampling algorithm of Section 2.1.1, we can observe that it essentially performs the same greedy random walk in estimating marginals that local search performs in seeking a solution. In Gibbs sampling the purpose of the random walk is not to find a solution (though that is certainly a welcome outcome). Rather, by measur- CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 89 ing how often we find a particular variable assignment in the trail of such a walk, we form an estimate of that assignment’s marginal probability. The assumption is that solutions must be in some way correlated with near-solutions, and this can be more or less true on random prob- lems as the clause-to-variable ratio varies through the phase transition thresholds described in the next chapter. Here, though, it suffices to note the close correspondence between two alter- native approaches from the two main branches of our conceptual foundation: in considering probabilistic reasoning we bypassed sampling methods in favor of message-passing, and in constraint reasoning we forgo local search in favor of complete backtracking search. The idea that near-solutions may or may not be informative of true solutions can be successfully applied in a variety of unexpected forms, e.g. [15]; but, a further exploration of this theme is left for future work.

3.4 The Optimization, Counting, and Unsatisfiability Prob-

lems

We conclude this chapter by defining some alternative constraint reasoning tasks. The first of these, constraint optimization, will be a focus of the present project, along with regular constraint satisfaction. The remaining tasks, model counting and proving unsatisfiability, bear mention here but are otherwise left to future research.

Constraint Optimization. Constraint optimization problems (“COPs”) are variants of con- straint satisfaction problems where instead of seeking a configuration that satisfies all con- straints, we seek a configuration that maximizes the number of satisfied constraints. Typically we will actually consider the more general class of partial weighted COPs, where we specify numeric weights on constraints and maximize the sum of weights across satisfied constraints, and where certain “hard” constraints must be satisfied by any solution. We will consider con- straint optimization, as realized for the Boolean domain by the MaxSAT problem, in Part III of this document. CHAPTER 3. SOLVING CONSTRAINT SATISFACTION PROBLEMS 90

Model Counting. Given a constraint satisfaction problem, the task of model counting is to determine the number of solution configurations. This problem is the analogue of marginal estimation in that the partition function N of Equation (1.2) represents exactly the number of solutions under the factor graph interpretation of CSPs defined in Section 3.1. In other words, any algorithm that can solve the marginalization problem exactly can also do so for the model- counting problem, and vice-versa. Accordingly a number of approximate model counters are based on approximate probabilistic reasoning [98, 31, 67, 66]. Exact model counting is #P-

Complete in worst-case complexity.

Proving Unsatisfiability. Most complete methods for seeking a solution to a given constraint satisfaction problem can also produce a proof of unsatisfiability if the problem has no solutions.

However, problems that are known to be unsatisfiable may arise from a variety of applications– most notably verification [74], or more generally, any application based on the construction of refutation proofs. Thus it is worth noting for future work that the bias estimation techniques developed in this dissertation can be applied to such scenarios in a customized fashion.

Summary. Here we have reviewed key concepts from constraint satisfaction from the per- spective of factor graphs and the algebraic semiring, so that we can ultimately apply such concepts to integrate probabilistic reasoning with constraint satisfaction, both formally and at an an implementational level. Before proceeding to do so, though, we next present a final background chapter that concerns theoretical models for the sorts of difficult random problems that historically motivated the initial application of statistical methods to constraint satisfaction

[23]. Chapter 4

Theoretical Models for Analyzing

Constraint Problems

In the preceding chapter, with its account of complete solution methods for constraint satis- faction, we identified variable-ordering and value-ordering heuristics as the main means of applying intelligence in order to escape from worst-case runtime behavior. We can now con- sider some salient problem properties that can determine the success of a particular solving strategy. Here we primarily review existing work, in terms of our own formal framework, and highlight the aspects that are most relevant to the current project. This analysis particularly emphasizes “random” and “structured” problems, as opposed to “industrial” ones, but later we will assess our solvers on the full range of problem instances.

Specifically, we will consider the phase transition phenomenon and how it affects the space of solutions to problems within the transition area; such problems are both the motivation and the primary successful application for the “Survey Propagation” (SP) bias estimator for

SAT [23]. We will then consider the model underlying this bias estimator, as we will later adapt it within the EMBP framework of Chapter 5 in forming our own estimators. Finally, we will consider “backbone” and “backdoor” variables as keys to the underlying structure of solutions to both random and structured CSPs. In Part II, our own bias estimators will target

91 CHAPTER 4. THEORETICAL MODELSFOR ANALYZING CONSTRAINT PROBLEMS 92 such variables–especially backbone variables–within the overall process of search.

4.1 The Phase Transition for Random Constraint Satisfac-

tion Problems

We first define, by construction, the sort of “random” problem that we will consider through the rest of this document. While a similar analysis has been conducted for random constraint satisfaction problems like quasigroup with holes [71, 10, 82], the main emphasis in the phase transition literature is SAT. Based on this definition of random problems, we state the phase transition conjecture and illustrate its empirical manifestation. Next, we consider the geometry of the set of solutions to a random problem, and how this can be observed to vary across the phase transition area. The purpose of this study is to tie such phenomena to the success of various solving methods.

4.1.1 Defining Random Problems and the Phase Transition

Definition 19 (Random k-SAT Problem, Parameterized by n and α). To generate a random k-SAT problem over n variables and with clause-to-variable ratio α, begin with a set X of variables, where |X| = n, and an empty set of clauses C = ∅. Repeatedly generate new clauses by randomly selecting k variables from X without replacement and forming a positive or negative literal for each such variable uniformly and independently at random. Add each such clause to C (unless C already contains the clause) and terminate when |C| > α · n. Recall

that α is called the “clause-to-variable ratio”.

Consider the likelihood that a random problem is satisfiable, given varying values of α. If

the clause-to-variable ratio is extremely low, the problem is under-constrained and very likely

to be satisfiable; further, it is presumable that a solver of almost any design will be able to

find a solution easily. Likewise, if α is extremely high then we can presume not only that the CHAPTER 4. THEORETICAL MODELSFOR ANALYZING CONSTRAINT PROBLEMS 93 problem is likely to be unsatisfiable, but also that solvers will be able to make this determi- nation efficiently. Random SAT problems are of interest to a variety of researchers outside of computer science, in disciplines like statistical physics and thermodynamics, because of what happens between these two extremes [116].

Specifically, as α increases so does the probability that a random problem is unsatisfiable– but, the transition is not gradual. Rather, much as water abruptly becomes ice at a certain temperature point, random problems undergo a “phase transition” in satisfiability across values of α. Figure 4.1 illustrates the phase transition by plotting a series of random 3-SAT problems with n = 300 and with α varying across the horizontal axis. For each value of α we have gen- erated 1000 instances and determined satisfiability or unsatisfiability by solving each instance to completion (using the VARSAT solver presented in Chapter 7). We then plot the proportion of each set of instances that consists of unsatisfiable problems.

In the figure, we observe a jump in the probability of producing an unsatisfiable problem, around the neighborhood of α = 4.28. A standing conjecture in theoretical computer science and statistical physics is that for a given k, this phase transition occurs at a fixed clause-to- variable ratio αu, for sufficiently large n. (Here ‘u’ is used as a symbol to indicate “unsatisfi- ability”.) To be more formal, the hypothesis is the existence of an αu such that for any  > 0, formulas with density αu −  are satisfiable with high probability and formulas with density

αu +  are unsatisfiable with high probability, as n → ∞. The ratio αu is referred to as the “phase transition threshold” (in satisfiability).

At this time, there is no known way to calculate αu for arbitrary k, or to even prove its existence in general. However, there exists a proof for a weaker version of the general con- jecture that expresses the threshold as a function of both n and k [59]. For the special case of

k 2-SAT, αu = 1 is known analytically; additionally, there is a bound of 2 log 2 − O(k) on any

αu as k becomes large [2]. For the special case of 3-SAT, though, this bound is relatively loose;

here the best known bounds are that 3.52 ≤ αu ≤ 4.51 [53, 87]. Non-rigorously, statistical

physicists have estimated a value of αu ≈ 4.267 for 3-SAT, using the replica method [112] for CHAPTER 4. THEORETICAL MODELSFOR ANALYZING CONSTRAINT PROBLEMS 94

1

0.75

0.5

0.25 Probability that Problem is Unsatisfiable

0 3.6 3.8 4 4.2 4.4 4.6 4.8 5 Clause-to-Variable Ratio (Alpha)

Figure 4.1: Phase transition in satisfiability of random problems. Here we have generated 1000 random 3-SAT problems with n = 300 variables for each clause-to-variable ratio α marked along the horizontal axis. For each such α we plot the proportion of these problems that were unsatisfiable. approximating the regular structure of a random graph.

4.1.2 Geometry of the Solution Space

Aside from its relevance to natural science, the phase transition has attracted interest as an explanation for the performance limits of various SAT-solving approaches. The explanation is in terms of the geometry of solution space, where we consider the set of solutions to a satisfi- able random problem in terms of the Hamming distances between such configurations. Figure

4.2 is a notional depiction of how this space varies as α increases across the phase transition region. Solutions are depicted as solid circles, while near-solution configurations (which sat- isfy almost all, but not all, of the clauses) appear as fainter circles. Within the limitations of a two-dimensional representation, the placement of configurations represents their Hamming CHAPTER 4. THEORETICAL MODELSFOR ANALYZING CONSTRAINT PROBLEMS 95 distances–configurations are considered adjacent if they differ by a single variable assignment.

Finally, larger dotted outlines group adjacent configurations into arbitrary “clusters” of interest.

< any algorithm > Belief propagation, Survey propagation Local Search αu ≈ 4.267

αsub ≈ 3.860 αshatter ≈ 4.210 αcond ≈ 4.254

Figure 4.2: Notional representation of solution geometry for random problems generated near

the phase transition in satisfiability. Approximate threshold values are for 3-SAT.

In the figure, we see various thresholds occurring just below the unsatisfiability threshold

αu, corresponding to abrupt changes in the geometry of the solution set, along with limits on the applicability of various solving methodologies. For extremely low values of α, almost any

configuration is a solution or near-solution–thus solutions are likely to cover the whole space of

configurations and to be tightly interconnected, forming a single cluster. Accordingly, any con-

ceivable and reasonable solving technique should find a solution relatively quickly. As α rises

beyond a “submodularity” threshold αsub, the space of solutions narrows; but, all solutions are still likely lie, for the most part, in a single large cluster of solutions and near-solutions. Thus,

local search algorithms can continue to do well in this region because near-solutions are very

likely to be near actual solutions. And as for bias estimation, belief propagation is still likely

to do well–it is unlikely to build its estimates from solution-impoverished regions of the space.

The approximate location of the submodularity threshold for 3-SAT is αsub ≈ 3.86 [112, 100]. CHAPTER 4. THEORETICAL MODELSFOR ANALYZING CONSTRAINT PROBLEMS 96

Importantly, as α continues to increase, but still before reaching the phase transition in un- satisfiability, the space of solutions has been observed to “shatter” into an exponentially large number of solution and near-solution clusters, where each such cluster is in turn exponentially small and where clusters that contain exclusively near-solutions exponentially outnumber those that contain solutions [116, 70, 1]. This impedes both local search and belief propagation alike.

In the first case, there are exponentially many local optima to entrap the search process, and in the latter, surveys will include information about exponentially many near-solution clusters, versus solution clusters. The survey propagation model described in the next section was de- signed to extend the feasibility of bias estimation into this region, and has done so with good success [23]. A number of incomplete accounts attempt to explain why SP continues to work well beyond the shattering threshold [107, 29], but perhaps the most compelling explanation is that solution clusters in this region tend to have many “frozen” variables, and the more sophisticated survey propagation model is more sensitive to this fact [1]. Within a specific solution cluster, frozen variables are those that hold the same value in all solutions comprising the cluster; flipping their value when holding all other assignments constant would produce an unsatisfying configuration. The survey propagation model is the topic of the next section, and “backbone” variables are considered in the section after; within the constraints defining a specific solution cluster, frozen variables are backbones. For 3-SAT, the shattering threshold is

estimated to be near 4.210 [112].

Recently, statistical physicists have posited one more threshold just before the phase tran-

sition in problem satisfiability. As α increases toward 4.267 (in the case of 3-SAT), the set of solutions has been observed to “condense” into a sub-exponential number of clusters with sizes varying according a Poisson-Dirichlet process [100, 7]. Because the size of such a cluster is not correlated with whether it contains a solution, and bias estimators like survey propaga- tion weight all clusters according to size, irrespective of whether they contain any solutions as opposed to near-solutions alone, this threshold represents a hypothetical limit to the power of such methods. CHAPTER 4. THEORETICAL MODELSFOR ANALYZING CONSTRAINT PROBLEMS 97

While the properties of this final phase transition have not been well-studied to date, in summary it suffices to note that random problems from near the phase transition in satisfiability are among the most difficult for all known solvers. Recently the challenge of solving such problems has begun to motivate the design of probabilistic solvers, including those considered here, and any successes in meeting this challenge will be of intrinsic interest both computer science and statistical physics.

4.2 The Survey Propagation Model of Boolean Satisfiability

In this section we will present the survey propagation bias estimation method, to use our own terminology, primarily by defining the underlying survey propagation model. The survey prop- agation bias estimator simply consists of the belief propagation algorithm of Chapter 1, applied to this more sophisticated model of variable assignments [24]. To this point we have used a two-element domain where every variable in an assignment can be set positively or negatively.

Here, we consider a third possibility where a variable is unconstrained, meaning that even though it appears positively (negatively) in a given satisfying configuration, its value could be individually flipped to negative (positive) and the result would still be a solution.

Definition 20 (The Survey Propagation Model). [112]. Let c be a satisfying configuration for a given constraint satisfaction problem Φ, and let configuration c0 be identical to c except that the assignment to a given variable x is flipped. (That is, c0(x) = 0 iff c(x) = 1). If c0 is not a solution to Φ, then x is constrained positively, or negatively, according to its value in c.

Notationally, for such cases we can apply the label ‘+’ or ‘−’ to the value of x instead of ‘1’

or ‘0’. Otherwise, x is unconstrained within this solution assignment, and we can apply the

label ‘*’ (designating the “joker” or “wild” state) under the survey propagation model.

Thus, when applying the belief propagation technique to the survey propagation model, we

are asking for the marginal probability of finding a variable positively constrained, negatively

constrained, or unconstrained, if we were to sample from the space of solutions uniformly at CHAPTER 4. THEORETICAL MODELSFOR ANALYZING CONSTRAINT PROBLEMS 98 random. This produces the survey propagation bias estimator, which still must be embedded within an overall solving framework–the original means of doing so [23] will be reviewed in the next chapter.

As for the bias estimator itself, it was designed to be effective on random problems with clause-to-variable ratios that exceed the shattering threshold, where an exponential number of solution clusters and near-solution clusters arise. As suggested previously, the hypothetical explanation for the success of the survey propagation model in this region is that by allowing for joker assignments, it is possible to group multiple clusters into single “cores” of meta- solutions; at this point the hypothesis remains unconfirmed [100].

4.3 Backbone Variables and Backdoor Sets

We now consider two final theoretical concepts that will be relevant to the constraint solvers in Parts II and III. Backbone variables are constrained to a particular value, in all solutions to a CSP (as opposed to all the solutions in a particular cluster, as in the discussion of frozen variables with respect to the SP model). Backdoors are sets of variables that greatly simplify a problem when set a particular way.

4.3.1 Definitions and Related Work

Definition 21 (Backbone Variables). [116]. Let SOLS(Φ) be the set of solutions to a constraint satisfaction problem Φ = (X,C). Then variable x ∈ X is a backbone variable of Φ iff s(x)

is identical for all s ∈ SOLS(Φ)

Recall that backbone variables occur most frequently in the most difficult random problems,

near the phase transition in satisfiability [23]. During backtracking search, it is crucial to

never set a backbone to its incorrect value; this guarantees that the sub-tree rooted at this

variable assignment is unsatisfiable, and any further search below this point will ultimately

be wasted effort. Thus a number of CSP heuristics are designed to implicitly favor backbone CHAPTER 4. THEORETICAL MODELSFOR ANALYZING CONSTRAINT PROBLEMS 99 variables, or implicitly defined proxies, for early assignment–perhaps by noting the number of constraints on a variable, and their relative proportions in influencing it to hold a particular value [54, 155, 111, 129].

Presumably, it is also intuitive that setting a backbone to the correct value will greatly simplify a problem, since the variable is likely to be highly constrained across solutions–thus its assignment to this correct value should remove many constraints from the resulting sub- problem. However, this is not precisely the case; setting backbone variables correctly may be necessary but are not exactly sufficient for efficient search.

This motivates a conceptualization with more operationally-oriented semantics. Though backdoor variables may be harder to identify formally, their definition exactly captures the efficiency property that we desire, in terms of an explicit solving process.

Definition 22 (Backdoors). [148]. Let Φ = (X,C) be a satisfiable constraint satisfaction problem, and let P be an arbitrary inference procedure defined over CSPs. Then B ⊆ X,B 6= ∅ is a backdoor set of Φ with respect to P iff there exists some assignment π to the variables of B such that P returns a solution when applied to Φ|π (i.e., to the simplified subproblem resulting from applying π to Φ.)

In other words, if the first |B| variables that we set constitute a backdoor to a given satis-

fiable problem, and we happen to configure them a particular way, then we will immediately

find a solution without backtracking, upon applying the inference procedure P to the resulting sub-problem. In comparison to backbones, then, it may be harder to identify such variables em- pirically, without already having solved a problem several times; but in turn the definition more closely captures our aspirations in trying to search efficiently. At a theoretical level, backdoor variables are the subject of ongoing research into complexity results and methods of identifi- cation for restricted versions of certain constraint satisfaction problems [120, 133, 93, 52, 60].

To handle unsatisfiable problems as well as satisfiable ones, there is also a more compre- hensive definition for backdoors; note the switch from existential to universal quantification over assignments to the backdoor variables. CHAPTER 4. THEORETICAL MODELSFOR ANALYZING CONSTRAINT PROBLEMS 100

Definition 23 (Strong Backdoors). [148]. Let Φ, P, and B be defined as before. B is a strong backdoor of Φ with respect to P iff for all assignments π to the variables in B, P will either

find a solution for Φ|π or prove that it is unsatisfiable.

Small backdoor sets have been observed in a variety of random and structured problems

[148]. Their formulation was motivated by the observation of a heavy-tailed distribution on the runtime of backtracking solvers on such problems of interest. In particular, if a randomized solver is run on the same problem multiple times with different seeds, then the distribution over run-times may primarily favor short runs (corresponding to fortuitous early variable as- signments) overall, but will also include a non-negligible proportion of very long runs in corre- spondence with a power law on the probability density function [70, 68]. In fact, it has become an essential design feature of modern constraint solvers to randomly restart if a problem has not been solved within a certain increasing time threshold; one explanation for the popularity of this feature is that, in essence, it is more important to find backdoor variables–possibly by luck–than to be maximally persistent in search [69].

4.3.2 Relevance to the Research Goals

The algorithmic contributions of this dissertation are based on the identification of backbone variables, plus an expanded prerogative to identify near-backbone variables as well. While other heuristics are inspired by the formulation of backbones and seek them implicitly, we will identify them explicitly (though approximately) by calculating biases. To be concrete, a variable is a backbone if and only if its marginal distribution over assignments, with respect to a uniform distribution on solutions, puts all weight on a single value. In the case of SAT, for instance, the biases for a backbone variable are 0/1. Further, if we consider variables whose bias distributions are highly skewed toward particular values, though not necessarily 100% so, then setting such near-backbones to such values during search will encourage sub-trees that are rich in solutions. In the context of bias estimation, the hope is that highly skewed bias estimates will correlate beneficially with backbones and near-backbones. CHAPTER 4. THEORETICAL MODELSFOR ANALYZING CONSTRAINT PROBLEMS 101

In Chapter 7 we will directly measure the ability of various bias estimators to identify backbones, before assessing their general usefulness in search. Backdoors remain a subject of formal interest, but their identification in practice remains an aim for future work. One possibility is that under the expanded survey propagation model, backdoor variables have weak marginal probabilities for being in the “joker” state.

Summary. We have now completed the theoretical background that underlies the remainder of the research described in this document. At a formal level, we will design marginal esti- mation methods for constraint satisfaction with the goal of identifying key variables like back- bones and near-backbones. More pragmatically, we will test the resulting solvers on the sorts of random problems described in this chapter, for comparison with alternative approaches. In the next chapter, we will integrate all the theoretical background to derive a formal comparison of marginal estimation and constraint satisfaction, from the perspective of numerical/discrete optimization, and present a general estimation framework that we will subsequently customize for constraint reasoning. Chapter 5

Integrated View of Probabilistic and

Constraint Reasoning

To this point we have constructed a formal foundation in support of the algorithmic contribu- tions that will emerge from the next chapter onward. In assuming a somewhat non-standard perspective for defining marginal computation and constraint satisfaction, we have facilitated a number of comparisons between the two problem areas. These comparisons constitute a contribution in their own right–they not only motivate the algorithms to be considered shortly, but they also establish the full space of possibilities for integrating theories and techniques from the two areas. To a certain extent, the correspondence between the two areas has been known to researchers working at their intersection, but more so in the form of folklore and unstated inspiration than in terms of formalization or explicit synthesis. Thus, in this chapter we will make a number of specific observations and re-conceptualizations that may have only been suggested in passing during our prior account of reasoning over probabilistic models and constraint satisfaction problems.

Grouping and formalizing such points into a single account motivates the creation of a new marginal estimation framework that makes explicit the constraint-reasoning aspect of the optimization that underlies marginal estimation. In the second part of this chapter, then, we

102 CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 103 introduce the EMBP (“Expectation Maximization Belief Propagation”) framework, in general form for marginal estimation. The methodology is to re-derive a method like belief-propagation from first principles, appealing to the expectation maximization [51] technique for parameter estimation over models with hidden variables. EMBP is characterized by a Q(·) distribution over function extensions that can be customized by various closed forms for the purpose of marginalization over specific domains. Here we present a product-decomposition representa- tion of Q(·) that can be applied to any domain; in Chapters 6 and 8 we will derive special forms for SAT and MaxSAT.

5.1 Relationship between Probabilistic Inference and Con-

straint Satisfaction

Ultimately, we will solve constraint problems by embedding customized bias estimators as variable- and value-ordering heuristics for backtracking search. Here, though, we will first observe that this idea is less contrived than it may initially appear; the task of calculating marginals is itself tantamount to solving a variation on the constraint satisfaction problem. In particular, we will demonstrate that marginal computation is a linear relaxation of constraint satisfaction with non-linear objective and constraints. Accordingly, the various approximations to the marginal polytope employed by different estimation methods are continuous analogues to various inference procedures for CSP solvers, and domain-specific global constraints can improve the efficiency and accuracy of such estimators–just as they have for constraint solvers.

In general, we will see that the same principles that apply across combinatorial and numerical optimization (constraint-driven search, duality, structural simplification, constraint generation, etc.) apply to these two problems once we have formulated them in optimization form.

This leads to a second main observation, that the inherent duality of the algebraic semir- ing appears simultaneously as summation and multiplication in probabilistic reasoning, and as search and inference in constraint satisfaction. Where before the meanings of factor-to-variable CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 104 and variable-to-factor messages within loopy belief propagation were not well-understood, now they can be viewed as applications of search or inference within the relaxed space of fractional variable assignments and fractional satisfaction. The same holds for expectation and maxi- mization steps within EM, when applied to marginal estimation in formulating EMBP. This correspondence will allow us to apply arbitrary soft inference or soft search principles to our bias estimators, as facilitated by the explicit statement of this duality within EMBP. In the next chapter we will demonstrate this process for constraint satisfaction, by integrating closed forms that improve the accuracy of our bias estimators.

A third observation is best articulated in terms of the factor graph, whose bipartite structure has served to represent such duality throughout our account of probabilistic reasoning and constraint satisfaction. The induced width of the graph is an intuitive representation for the complexity of reasoning over the interactions within a specific problem, so in this setting it is natural that the “node-merging” operation defined in Section 1.2.2 has been widely used to reduce that complexity at cost of greater memory requirements. Here the observation is that a number of seemingly unrelated techniques essentially employ just this operation, either on its own or in concert with some other methodology, as instantiated by the particular task to which we have applied our factor graph formalism. The practical yield is a more systematic understanding of such techniques.

A fourth general principle is the practice of adding new constraints during optimization.

At the most abstract level, the aim is not to add arbitrary new constraints inferred indiscrimi- nately from arbitrary combinations of original ones, but rather to reason over the most relevant constraints with respect to the current state of the optimization process.

Finally, we will consider additional parallels between alternative branches of research aside from the main line of study taken here, with an eye to related and future work. CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 105

5.1.1 Representing Constraint/Probabilistic Reasoning

as Discrete/Continuous Optimization

We begin by comparing marginal estimation on a constraint satisfaction problem (i.e., “bias

estimation”) with the task of solving a constraint satisfaction problem outright. The ultimate

aim is to state both as optimization problems, distinguished by continuous versus discrete do-

mains. Within this framework, we will translate equivalent techniques between the two areas,

applying new semantics to quantities that have previously been poorly understood. A further

contribution is the identification of techniques that have been developed primarily in just one

of the areas and that can be transferred to the other; one example that we realize in the next

chapter is the application of global constraints from constraint satisfaction to the task of bias

estimation.

First, recall the factor graph format of a constraint satisfaction problem Φ = (X,F ), repro-

duced from Section 3.1 in fully expanded form:

X Y fa(~x|σa ) (5.1)

~x fa∈F

As stated before, ~x is a configuration of the variables in X and the factors are 0/1 functions by definition. In the expression, we sum over configurations and for each one, we determine whether the product of constraint instantiations is 0 or 1. So, the expression is positive if

and only if the CSP is satisfiable, and zero if and only if the CSP is unsatisfiable. To state

the problem as an optimization, we change the representation of configurations from a vec-

tor of variable values ~x, to an indicator for each variable/value pair. Specifically, we define

Ii[vj] ∈ {0, 1} for each xi ∈ X and each vj ∈ D; recall that D is the (common) domain for

each variable. Here, the indicator Ii[vj] equals 1 if the configuration assigns value vj to variable

xi; otherwise it equals 0. Notationally, the Ii’s are grouped into a single vector I. By these no- tational means, we can state an alternative optimization-based representation of the constraint

satisfaction problem, as depicted in Figure 5.1. CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 106

     Q P Q max Ii[za(xi)] · fa(za) I fa∈F za xi∈σa

P such that ∀i, j : Ii[vj] ∈ {0, 1} and ∀i : Ii[vj] = 1. vj

Figure 5.1: Optimization Representation of the Constraint Satisfaction Problem.

In this optimization, the two sets of constraints ensure that I represents a valid configuration

of the variables in X: a given variable either holds a given value, or it does not, and it must hold

exactly one of its values. The objective function is a product over constraints: the expression

inside square braces evaluates to 0 if a constraint is violated, and 1 if it is satisfied. So, the the

entire objective’s maximum value is either 1, indicating satisfiability (with a solution encoded

by the maximizing I), or 0, indicating unsatisfiability.

To parse the contents of the square braces, recall that za denotes a factor extension. We iterate (by summation) over all possible extensions to the current constraint, and determine

(by product) whether a given extension is in fact realized by the configuration encoded by

I. That is, the innermost parenthesized expression will evaluate to 1 for a unique extension

za, in particular the one wherein each variable in the constraint’s scope is in fact assigned to

the value specified by za; otherwise the expression’s value is 0. Having identified the single extension realized by our chosen configuration, we multiply by the constraint evaluation over

this extension to produce the desired value representing the violation or satisfaction of the

constraint.

The purpose of stating the constraint satisfaction problem as this somewhat convoluted

discrete optimization is to compare with the marginal computation problem, which we first

expressed as an optimization in the previous section. Here, instead of “hard” assignments

to the variables as by I above, we choose distributions over the variables’ values that can be interpreted as “soft” assignments subject to specific constraints. Notationally, recall that we CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 107

must determine a single-variable marginal probability θi[vj] for each xi ∈ X, vj ∈ D, as well as marginal distributions on factor extensions, which we will denote θa[za] for each fa ∈ F, za ∈

CONFIGS(σa). We will group these two sets of marginals into a single vector Θ; to obtain an answer to the marginal computation problem we simply obtain the single-variable marginals from the maximizing Θ. The optimization itself is over the factor extension marginals, but, they are constructed from single-variable marginals according to the characteristic approximation of a particular marginal estimation method. In Figure 5.2, we state the exact optimization in most general form.

  Q P  max θa(za) · fa(za) Θ fa∈F za

P such that ∀i, j : θi[vj] ∈ [0, 1], ∀i : θi[vj] = 1, vj

P ∀a, za : θa[za] ∈ [0, 1], ∀a : θa[za] = 1, za

and CONSISTENT(Θ).

Figure 5.2: Optimization Representation of Computing Marginals on a CSP.

The optimization pursues the maximum likelihood principle in choosing the fractional as-

signment Θ that maximizes the factor graph’s value W (~x) = Q f (~x ), under expecta- G fa∈F a |σa tion. Because we are now representing marginal probabilities over entire function extensions, we can directly substitute a given extension indicator into the objective, in place of the inner- parenthesized product of Figure 5.1. The difference is that instead of summing over function extensions in order to select the single za that is realized by a configuration I, we now sum over extensions to calculate an expectation for the function evaluation, as weighted by a probability CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 108

distribution θa(·) over possible extensions. Thus the maximum value of the objective is not 0/1, but in fact, the partition function (model count) of the CSP.

As for the constraints, note that we have essentially applied a linear relaxation to the orig- inal constraint satisfaction problem: single node marginals are now in [0, 1] instead of {0, 1}, while the same sort of sum-to-one constraint that originally ensured that each Ii was a valid configuration now ensures that θi is a valid distribution. This is the grounds for referring to the act of computing bias as making “soft” or “fractional” variable assignments.

The final constraint simply enforces consistency between the single-variable and function- extension marginals, ensuring that they fall within the marginal polytope of Section 2.2. Turn- ing to approximate marginal computation techniques, recall that a given estimator is distin- guished by its approximation of this constraint. By formulating maximum-likelihood marginal estimation and constraint satisfaction under a common abstract optimization format, we can formally recognize the parallels between such constraints and inference methods for CSPs.

Observation 1 (Mean-Field marginal estimation is the continuous analogue of constraint sat- isfaction with node consistency). Recall from Section 2.2 that one draconian means of enforc- ing consistency is to state each marginal probability on a function extension as the product of constituent single-variable marginals: ∀a, z : θ (z ) = Q θ [z (x )]. Under this a a a xi∈σa i a i “mean-field” approximation, there is no correlation between variables, which is a sufficient but unnecessary condition on agreement between function extension marginals.

If we use this mean-field equation in place of CONSISTENT(Θ) in Figure 5.2, then we can in fact eliminate the use of function extension marginals entirely. That is, in the objective we simply replace θ (z ) with Q θ [z (x )], and retain only the first two constraints. The a a xi∈σa i a i resulting optimization representation of mean-field marginal estimation is as follows:      Q P Q max θi[za(xi)] · fa(za) Θ fa∈F za xi∈σa (5.2) P such that ∀i, j : θi[vj] ∈ [0, 1] and ∀i : θi[vj] = 1. vj CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 109

On producing this representation, we now have exactly the same optimization problem as in

Figure 5.1, but with θi[vj] ∈ [0, 1] replacing Ii[vj] ∈ {0, 1}. (Constraint solvers typically en- force node-consistency automatically during pre-processing, by removing values from a vari-

able’s domain if they violate any unary constraints on the variable.)

Observation 2 (Belief propagation marginal estimation is the continuous analogue of con-

straint satisfaction with generalized arc-consistency). Recall, again from Section 2.2, that the belief propagation algorithm is known to optimize the Bethe free energy of its marginal esti-

mates [151]. The Bethe approximation requires that marginal probabilities for factor exten-

sions marginalize down to individual variable marginals: ∀a, x ∈ σ , v : P θ [z ] = i a j za:za(xi)=vj a a

θi[vj]. This requires a certain level of consistency between factor extensions in that factors that share a variable must marginalize down to the same probability distribution for that in- dividual variable–but only in isolation. Thus, though necessary, this condition is not suffi- cient for Θ to lie within the marginal polytope. For example, a factor extension marginal

θ (x , x , x ) might obey P θ [v , x , x ] = θ [v ] and P θ [x , v , x ] = θ [v ], and a 1 2 3 x2,x3 a j 2 3 1 j x1,x3 a 1 k 3 2 k likewise another factor extension marginal θ (x , x , x ) might obey P θ [v , x , x ] = b 1 2 4 x2,x4 a j 2 4 θ [v ] and P θ [x , v , x ] = θ [v ]. But, there is no guarantee that P θ [v , v , x ] = 1 j x1,x4 a 1 k 4 2 k x3 a j k 3 P θ [v , v , x ]. However, this sort of situation cannot arise if the CSPs factor graph is a x4 b j k 4

tree, because the marginal probability of any pair (x1, x2) will be decomposable as the prod-

uct of the two individual variable marginal distributions θ1[·] and θ2[·]. Under our optimization CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 110

representation of marginal estimation, the belief propagation objective is expressed as follows:   Q P  max θa(za) · fa(za) Θ fa∈F za

P such that ∀i, j : θi[vj] ∈ [0, 1], ∀i : θi[vj] = 1, vj (5.3)

P ∀a, za : θa[za] ∈ [0, 1], ∀a : θa[za] = 1, za

and ∀a, x ∈ σ , v : P θ [z ] = θ [v ]. i a j za:za(xi)=vj a a i j For comparison we will now represent the constraint satisfaction problem, augmented by the requirement of arc-consistency, as an optimization as well. First, to match our general state- ment of marginal computation, from Figure 5.2, we introduce a direct representation of exten- sion indicators into the inner-parenthesized product in Figure 5.1. Thus we employ discrete indicators Ia[·] and Ii[·] restricted to {0, 1} instead of continuous bias variables θa[·] and θi[·] restricted to [0, 1]:   Q P max (Ia[za] · fa(za)) I fa∈F za

P such that ∀i, j : Ii[vj] ∈ {0, 1}, ∀i : Ii[vj] = 1, vj

P ∀a, za : Ia[za] ∈ [0, 1], ∀a : Ia[za] = 1, za

and CONSISTENT(I). Here the requirement CONSISTENT(I) simply means that a particular extension indicator is 1 if and only iff the individual variable indicators for the assignments in that extension are all 1 as well. Now suppose that we augment the optimization process by first adding constraints that disallow the selection of function extensions that violate their associated con- straints: ∀a, za : fa(za) ≥ Ia(za); let these be called “satisfaction constraints.” Then, if we CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 111

proceed to introduce the Bethe free energy constraints to our optimization, but over indicator

variables instead of biases, they will read, ∀a, x ∈ σ , v : P I (z ) = I (v ). To i a j za:za(xi)=vj a a i j complete the augmentation, we then substitute the satisfaction constraints into these Bethe free

energy constraints to produce the following constraints for the optimization: ∀a, xi ∈ σa, vj : P f (z ) ≥ I (v ). The constraint states that we should not attempt a variable za:za(xi)=vj a a i j assignment if it does not appear in any satisfying extension of some constraint in which the

variable appears, which is another way of stating generalized arc-consistency. At the end of

this formal path it is possible to just drop the representation of function extension indicators

and just add such constraints to the original optimization problem, producing the following:      Q P Q max Ii[za(xi)] · fa(za) I fa∈F za xi∈σa

P such that ∀i, j : Ii[vj] ∈ {0, 1}, ∀i : Ii[vj] = 1, vj

and ∀a, x ∈ σ , v : P f (z ) ≥ I (v ). i a j za:za(xi)=vj a a i j

In the end, then, by taking the Bethe free energy constraints that underly belief propagation, and applying them directly to the optimization representation of constraint satisfaction, we can produce the optimization representation of constraint satisfaction augmented with arc- consistency. In practice arc-consistency is typically enforced by pre-processing, or continu- ously during search in the case of unit propagation for SAT.

Given these example observations, the point of emphasis is not merely that marginal com- putation is useful for constraint satisfaction, but that in fact they are two versions of the same problem, applied either to a continuous probability space or a discrete space of configurations.

A number of associated correspondences fall into line accordingly, by similar reasoning. For instance, the necessity and insufficiency of local consistency (e.g. arc-consistency) for a CSP reflects the outer-boundedness of approximations (e.g. pairwise/Bethe) for the marginal poly- tope. We will see below that the two types of belief propagation update messages in Section CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 112

1.2.1 have hardened counterparts from constraint reasoning, and that the merging operation of

Section 1.2.2 can also be used to reduce the tree-width of a CSP–either directly, or indirectly through the implicit actions of knowledge compilation techniques.

In other words, a solver that performs the sum-product calculation on a constraint satis- faction problem, while making hard variable assignments over hard constraints, is essentially doing a specific form of search and inference to produce a solution to the CSP. An estima- tor performing exactly the same sum-product calculation, but with soft variable assignments over soft constraints, is essentially doing a specific form of maximum likelihood estimation to produce an approximate survey over the CSP. This is the formal basis for the empirical obser- vations of succeeding chapters, where our bias estimators prove to be powerful but expensive techniques for constraint satisfaction, in comparison to existing heuristics.

In terms of motivating concrete algorithmic improvements, observe that global constraints are widely used within the field of constraint satisfaction in order to support richer problem models, and when combined with propagators for efficient inference over such constraints, to greatly improve the efficiency of the solving process. However, the equivalent concept has not been widely applied within the probabilistic inference community, due most likely to a historical emphasis on grid MRF’s for computer vision, and a cultural dogma that associates tractability with locality. In remedy of this situation, there have been some very recent efforts to introduce global constraints to the MPE problem in a vision domain [141], and in the case of the present project we will see global consistency realized by the “EMBP-G” and “EMSP-G” bias estimators appearing in the next chapter. In particular, in deriving these estimators we will apply a global constraint that enriches our model by essentially connecting all variables that appear in any constraint in a given variable’s neighborhood–while maintaining efficiency by capturing the corresponding soft propagation method in closed form. CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 113

5.1.2 Duality in Probabilistic Reasoning, Constraint Satisfaction, and Nu-

merical/Combinatorial Optimization

Given our optimization-based formulations of constraint satisfaction and bias estimation, it follows naturally that persistent algorithmic patterns should be reflected across the two areas.

First, consider the most general formulation of duality in optimization: the solution to a con- strained primal problem corresponds to a solution to some dual problem where the primal constraints are represented by variables whose values represent the tightness or “activation” of such constraints in the primal problem. Another convention in the formulation of duality is that the dual of the dual problem must be equivalent to the primal problem. Because we have described the optimization problems in terms of factor graphs, the dual graph transforma- tion of Section 1.1.2 constitutes a natural formalization of this principle; it is straightforward to ascertain that this formulation meets the conventions for duality employed in the field of optimization.

The purpose of this observation is to illustrate that a number of existing concepts spanning various fields could have been derived by directly applying basic concepts to either the primal or dual factor graph representation of a CSP, or of a bias estimation problem. For example, if we consider a SAT problem in CNF, then the primal problem is a set of conjoined clauses, and the straightforward primal solution method is to search the set of all possible variable assignments, as realized to completion by DPLL. The straightforward dual solution method is to apply inference and merge all clauses into a single disjunction of terms, as realized by DP.

This single clause is empty if the problem is unsatisfiable; in any case it constitutes a DNF

(disjunction of conjunctions) representation of the original problem. If we consider the dual of a SAT problem in CNF, then variables range over factor extensions and the constraints enforce agreement between extensions on their assignments to shared variables. To search the dual problem by building a partial assignment over such dual variables is to create a subproblem representing the resolution of all clauses in the partial assignment, which is the main object CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 114 of DP. To apply inference to the dual problem is to reason over the disjunction of possible primal variable assignments, which returns us to the original version of DPLL. In other words, applying the primal solution method (search) to the dual problem corresponds to applying the dual solution method (inference) to the primal problem, and applying the dual solution method to the dual problem corresponds to applying the primal solution method to the primal problem.

This relationship formally justifies the common practice of referring to search as the dual of inference and vice versa; for the primal and dual factor graph formulations of SAT, this is in turn reflected by the fact that CNF and DNF are dual representations of the same theory.

In the case of marginal (or bias) estimation, we have already observed that applying pri- mal/dual optimization yields the variable-to-factor and factor-to-variable update rule structure of belief propagation and associated message-passing algorithms. Given our formulation of calculating marginals as soft constraint satisfaction, then, it is natural to associate these two types of messages with soft primal and dual CSP solution methods. Previously the semantics of such messages were not well-understood, outside of special-case analysis that partially cap- tures the main idea described here [47] when applied to extreme values. Later, the semantics of the messages were linked operationally to the Bethe free energy underlying the present ac- count [151]. Now, we can assume the perspective that a variable-to-factor message essentially performs a variable elimination of the variable in question as encoded by an (unnormalized) distribution over its domain; this amounts to a soft, simultaneous conditioning of the vari- able onto all its possible values at once. The factor-to-variable message does the same in the dual; in other words, the factors iterate over all their extensions at once in order to summarize their influence on the variable in question; this amounts to an inference procedure over all the constraints in a variable’s neighborhood. The message-passing rules are reproduced below as

Figure 5.3.

A similar process is at the heart of existing CSP methodologies like recursive conditioning

[39], which is based on search, and bucket elimination [45], which is based on inference. The duality will be made more explicit by the EMBP method, which was formulated with this CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 115

Y µxi→fa (v) = µfb→xi (v)

fb∈ηi\{fa}   X Y µfa→xi (v) = fa(za) · µxj →fa (za(xj))

za:za(xi)=v xj ∈σa\{xi}

Figure 5.3: Belief Propagation Message Update Rules. principle in mind from the beginning. There the E-step corresponds to inference, whereby we use soft variable assignments to hypothesize a distribution over factor extensions, while the M- step corresponds to search, where we softly assign a variable according to these hypothesized distributions. Table 5.1 summarizes a number of corresponding concepts.

Primal Concept Dual Concept Principle from CSP or Probabilistic Reasoning

Sum Product Algebraic Semiring Operations Disjunction Conjunction Algebraic Semiring, Interpreted for CSP Conditioning Elimination (Structural) Graph Simplification Search (i.e. DPLL) Inference (i.e. DP) Solution Method for CSP (i.e. SAT) Variable-to-Factor Message Factor-to-Variable Message BP Update Equations for Marginal Estimation (M)aximization Step (E)xpectation Step EM Formulation of Marginal Estimation Entropy Minimization Free Energy Minimization Gibbs Free Energy Formulation of Marginal Estimation

Table 5.1: Primal/Dual concepts from constraint satisfaction and probabilistic reasoning.

Again the ultimate purpose of making such comparisons is to better understand an array

of techniques that have been developed more or less independently along diverging historical

paths of research. Placing them into a single space clarifies areas that can yet be developed or

improved by transferring concepts from one area to another. For instance, we have observed

that conditioning, as realized most commonly by backtracking search, is a fairly straightfor-

ward primal method for solving CSPs; any intelligence goes toward identifying variables that

form cutsets, backbones, backdoors, etc. as opposed to any notion of “higher-order” condition- CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 116

ing. On the other hand, the notion of inference is highly developed for CSPs, with a number of

higher-order conditions on combinations of constraints and variables, many of which are repre-

sentable within the (i, c)-consistency framework of Section 3.2.2. We have seen in the previous section that global constraints can have an analogue in marginal calculations, for instance by creating closed forms for messages within EMBP. Here we note that the very message-passing format of such algorithms is itself pegged to (1, 1)-consistency: messages in BP or steps in

EMBP perform soft generalized arc-consistency, while another choice of consistency level

would produce new message-passing formats. There is no reason that messages must be passed

exclusively edge-wise, between single variables and single factors; instead groups of variables

can send messages to groups of factors and vice versa, just as high-order inference in a CSP

propagates hard information between groups of variables and constraint functions. A germ of

the same idea is behind region-based approximation methods for calculating marginals, as well

as more structural approximations to the task [152, 31, 32, 33]. But by and large the subject

remains open to future research, and in all certainty it has not been explored systematically in

terms of account of duality presented here.

5.1.3 Node-Merging

The node-merging operation defined in Section 1.2.2 is another recurrent theme that comes to

light when constraint satisfaction and probabilistic inference are formulated in terms of graph-

ical models. A number of algorithms or methodologies in both fields essentially perform just

this operation, on the primal or the dual version of the graph, or on a partially conditioned

modification of the graph. In particular, we have seen that one way of viewing the Kikuchi

approximation to the Gibbs free energy [92], or the generalized belief propagation algorithm

[151], is as regular belief propagation on a graph that has merged some of the variable nodes

into hypernodes. Likewise the iterative join-graph propagation method takes a more direct ap-

proach by explicitly merging certain nodes and then running BP [47]. Here the main motivation

is to improve accuracy; in constraint satisfaction the emphasis on improving run-time (at cost CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 117

of memory) has spurred the development of explicit node-merging techniques like cluster-tree

elimination [89], as well as more refined versions like mini-bucket elimination [50].

Indeed, as an integral component of inference, node merging will solve a problem outright

if performed exhaustively. To this extent it can be viewed as a knowledge-compilation tech-

nique predicated on distributing product over sum; recall, for instance, that if all variables in

a CNF SAT problem are merged in turn, the resulting hypernode will simply be a DNF repre-

sentation of the same theory. Any repeated elements that can be re-used over different portions

of the compilation constitute the space savings accrued by such techniques as binary decision

diagrams [103] or AND/OR graphs [49]. To be more specific, consider the (i, c)-framework

whereby up to i + 1 variables must be jointly supported by up to c constraints, simultaneously.

Any increment in i is enforced, at a conceptual level, by node-merging on the primal problem; the domain of any hyper-node on i + 1 variables can only contain tuples that support the cor- responding constraints. Likewise, to increment c we conceptually check any hyper-node in the dual graph, representing combinations of k variables, for emptiness.

As the node-merging idea has already been widely applied in constraint satisfaction and probabilistic inference, the main concrete gain from its formalization is to understand its effects separately from those of other features in approaches that were developed without explicitly formalizing the merging process.

5.1.4 Adding Constraints During the Solving Process

One aspect of the correspondence between constraint satisfaction and probabilistic reasoning that has not been widely exploited is the practice of adding constraints during the problem- solving process. Such techniques have a substantial history within the field of numerical op- timization, dating back to the use of cutting planes for mixed integer programming [72]. In constraint satisfaction, clause learning [110, 117] and to a lesser extent, no-good learning [90] are established practices as well. More recently, the idea has crossed to probabilistic inference

[138] in the form of cutting planes and odd-cycle inequalities applied to the marginal poly- CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 118 tope. In all these cases the goal is not to learn arbitrary constraints, but ones that will refine or otherwise guide the current state of the solving process.

To date, the systematic study of adding such constraints has not been attempted in a top- down fashion across probabilistic reasoning and constraint satisfaction. In principle, there are as many possible approaches to clause learning or refinements of the marginal polytope as there are kinds of inference. Not all of these approaches will be applicable in every setting, but the openness of this area reflects the historically separate development of the principle within different fields, rather than the impossibility of transferring new inference rules for use in constraint generation.

5.1.5 Correspondences between Alternative Approaches

Finally, we consider some of the alternative techniques and problems that were not ultimately chosen for us in the current project, and identify two main correspondences of interest. In our main line of research we have selected message-passing for estimating marginals instead of Gibbs sampling; to solve constraint problems we integrate such estimates within complete backtracking search, instead of local search. Interestingly, though, we have noted before that

Gibbs sampling performs a greedy random walk over the space of configurations in order to locally maximize the number of satisfied constraints. In so doing, it may follow exactly the same rules as a stochastic local search algorithm; the distinction is that the purpose of such a walk under Gibbs sampling is to estimate marginals from the entire trail of visited configurations, rather than to reach a solution configuration outright.

Likewise, our concerns here are the constraint satisfaction and constraint optimization prob- lems, rather than model counting. As noted before, though, the partition function that results from marginalizing our representation of a CSP corresponds exactly to the number of solutions to the problem. Unsurprisingly, then, there are a number of existing efforts to use probabilistic inference for model counting [98, 31, 67, 66]. In the other direction, it is now also well- established that constraint optimization can be used to perform probabilistic inference [28]. CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 119

The sensibility of such efforts becomes clear upon formulating the two tasks in terms of optimization, as we have done here. In the case of techniques and problems that are excluded from the current project, the persistence of correspondences within such topics only corrobo- rates the claim that constraint satisfaction and probabilistic reasoning are more closely related and amenable to cross-fertilization than might be suggested by the relative isolation of their research communities.

5.2 EMBP: Expectation Maximization Belief Propagation

Now we present a new alternative to belief propagation that combines its variable-to-function/function- to-variable message passing structure with the guaranteed convergence of mean field methods.

More importantly, it incorporates an explicit representation for problem-specific constraints, so they can be integrated directly into the basic framework. In general, then, the algorithm’s accuracy ranges from that of mean field methods in the absence of any problem-specific de- sign, all the way to matching and exceeding that of belief propagation when configured for specific purposes like the constraint reasoning tasks in Parts II and III. Before configuring the algorithm for such tasks, though, we first explain and derive it here in general form.

Expectation Maximization (EM) [51, 119] is a probabilistic algorithm traditionally used to

fit a series of observations Y to a model with parameters Θ, in the presence of hidden variables

Z. Starting from an initial random setting of Θ, the Expectation Step (“E-Step”) constructs a Q(·) distribution over Z, estimating the likelihood of hidden variable configurations with respect to Θ. Next the Maximization Step (“M-Step”) assigns Θ a new value maximizing the expected log-likelihood of the observations according to the current Q(·). The two steps are repeated to guaranteed convergence at a local maximum in likelihood. (Here we can note that the notation for previous topics was selected to suggest a common way of thinking for both

EM and BP.)

Here we will instead view the EM framework as a general primal/dual optimization tech- CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 120 nique, in order to “re-derive” a formula very similar to BP using the redundant reformulation of a factor graph. The resulting derivation clarifies the relationship between EM, BP, and EMBP, so that a single canonical representation can be applied to future problems with a degree of control over efficiency and approximation that mirrors the similar freedom we observed across the class of variational methods.

5.2.1 Deriving the EMBP Update Rule within the EM Framework.

Here we show how to derive EMBP by treating the extraction of marginals from a factor graph as a maximum-likelihood parameter estimation task within the EM framework. The derivation is motivated by the redundant transformation of a factor graph, but it is different from simply running the standard sum-product algorithm on the transformed graph. Rather, if we view a distribution over configurations from a frequentist perspective, then we can imagine that such configurations represent samples from a generative process driven by the marginal probabilities of individual variables.

That is, a factor graph is in itself a sufficient statistic for a hypothetical process of sampling complete assignments to all of its variables. If we assume an approximate (mean-field) model whereby such configurations are generated by individual variables operating independently, then their individual marginal probabilities constitute the parameters of this model. Thus the task of estimating marginals is reformulated as the task of fitting an approximate model to an imaginary series of observations.

It is also worth noting here that the resulting methodology will not involve an explicit sampling process over hidden variables. Instead we will optimize the setting Θ of parameters with respect to a Q(·) distribution that captures the relative weights of such samples in closed form. To that end, Table 1 shows how a factor graph can be recast in terms of the observations

Y , hidden variables Z, and parameters Θ of the EM framework.

Intuitively, we will tell EM that we have indirect access to some randomly sampled config- uration of the variables in a factor graph. We posit that there is some underlying configuration CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 121

Vector Status Interpretation Domain

Y Observed Consistency: whether variables agree {0, 1}nm (notional)

with extensions, i.e. all ci,a factors eval- uate to 1.

m Z Unobserved Extensions: tuples of variable assign- {za} , za : σa → D ments that instantiate the original factors.

n Θ Parameters Marginals: marginal probabilities for {θi} , θi : D → [0, 1], Pd original variables (a.k.a. mean parame- j=1 θi[vj] = 1. ters).

Table 5.2: EM formulation for estimating marginals using the redundant factor graph transfor- mation.

to the dual variables (factor extensions), but that we were unable to observe this configuration

Z. All we know is that the extensions are consistent with one another: Y = [1]nm. (This constant setting of Y is conceptual and does not have to be manipulated directly when deriving

EMBP.) Ultimately we ask the framework to hypothesize a (consistent) dual configuration via a Q(·) distribution, and to find the most likely marginals Θ on individual variables with respect to Q(·).

In terms of the redundant transformation of a factor graph, this can be viewed as a primal- dual optimization process where we softly assign factor extensions to their most likely values according to the marginals of variables in their signatures, as well as the weights that the factors assign to resulting instantiations. In turn, we softly assign variables to their most likely values with regard to the most likely extensions. The first operation is the E-Step, and the second is the M-Step; the two are repeated to guaranteed convergence, even on factor graphs with loops. The example redundant factor graph from Chapter 1 is reproduced here as Figure 5.4 for reference within the derivation. CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 122

s1 s4

Z1 c1,1 X1 Z4 c4,4 X4 s3 c1,3 c3,4

Z3 c3,3 X3 s2 s5 c2,3 c3,5

Z2 c2,2 X2 Z5 c5,5 X5

Figure 5.4: Example redundant factor graph.

5.2.2 M-Step

In presenting the derivation we begin with the M-Step for clarity. Here our basic goal is to

find variable biases Θ that maximize the expected log-likelihood of consistency Y within

some hidden set of factor extensions Z. The expectation is with respect to an artificial dis- tribution over extensions Q(Z) that we will construct in the subsequent E-Step. So, we set

0 Θ = argmaxΘ L(Q(·), Θ), where L(Q(·), Θ) = F(Q(·), Θ) − P(Θ) is a Lagrangian function

comprised of the expectation in question and a penalty term that forces each θi to be a proper probability distribution:

F(Q(·), Θ) = EQ[log P (Z,Y |Θ)], (5.4a)

n d X X P(Θ) = λi( θi[vj] − 1) (5.4b) i=1 j=1

The penalty term P(Θ) introduces a series of Lagrange multipliers, one for each variable xi, that allows unlimited penalty for any bias distribution that does not sum to exactly one. The CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 123 main object of interest is the expected log-likelihood represented by F(Q(·), Θ):

F(Θ) = EQ[log P (Z,Y |Θ)] (5.5a) X = Q(Z) log P (Z,Y |Θ) (5.5b) Z m X Y ∝ Q(Z) log p(za|Θ) (5.5c) Z a=1 m ∼ X Y Y = Q(Z) log θi[za(xi)] (5.5d) Z a=1 i: xi ∈σa m X X X = Q(Z) log θi[za(xi)] (5.5e) Z a=1 i: xi ∈σa m X X X = Q(Z) log θi[za(xi)] (5.5f) a=1 Z i: xi ∈σa Our goal is to express the expected log-likelihood in a form that can be readily optimized in terms of Q(·). Recall that Y indicates the observation of consistency between extensions; if we consider only consistent values of Z from this point forward, we can drop Y from future steps.

Thus, Equation (5.5c) focuses on consistent dual configurations, decomposing them into individual extensions. Equation (5.5d) further decomposes each such extension into the indi- vidual probabilities of the variable assignments it represents: the probability of generating a specific extension za is the product of probabilities for generating the prescribed assignment za(xi) for each of the variables xi in its signature. Such probabilities are precisely the biases represented by θi. Note that in the dual, this second decomposition represents a mean field approximation that upper-bounds the free energy underlying our EM formulation. Thus, while

EMBP is convergent, its results are still approximate.

Equation (5.5e) rearranges the expression by converting logarithms of products into sums of logarithms, and by exchanging summations. Now we are ready to recombine F(Q(·), Θ) with P(Θ) and optimize the convex function L(Q(·), Θ) by setting its first derivative to zero: CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 124

dL = 0 (5.6a) dθi[vj]      X X 1  ⇒  Q(Z) ·  − λi = 0 (5.6b)  θi[vj]  a: xi Z: za(xi) 

∈σa = vj      X X  ⇒ θi[j] =  Q(Z) / λi (5.6c)    a: xi Z: za(xi) 

∈σa = vj   d   X  X X  ⇒ λi =  Q(Z) (5.6d)   j=1  a: xi Z: za(xi) 

∈σa = vj

We treat each marginal probability θi[vj] as a variable and differentiate L(Θ) with respect to each such component of Θ. Thus Equation (5.6b) constitutes an M-Step rule for updating a variable xi’s bias toward value vj with respect to a fixed Q(·). The numerator directs us

to sum through the signatures in which xi appears, and add up the Q(Z)-weights of all dual configurations Z wherein the corresponding extension has xi set to vj. The denominator is just a normalizing function to make θi a proper distribution, as in (5.6c); this can be confirmed by summing both sides of (5.6b) over j from 1 to d.

In reference to the redundant transformation of the factor graph in Figure 5.4, the formula

specifies, for example, how to set the multinomial parameter for the assignment x3 = v accord-

ing to a presumed Q(·) distribution over values of z3, z4, and z5. For each of these extensions,

we determine the marginal probability that they project x3 to v, by summation. These values

are then arithmetically averaged to create an overall weight for x3 to hold the value v. Weights for other values in D can be constructed in the same way, and then normalized to form a proper

probability distribution θ3(·). CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 125

5.2.3 E-Step

For the second part of our derivation we use an existing estimate Θ of variable marginals to construct a new Q(·) distribution over factor extensions. In particular, pursuant to the EM framework, we will set Q(Z) = P (Z|X, Θ). In fact, it is straightforward to observe that this is the maximizer of the log likelihood function L(Q(·), Θ) for a fixed Θ.

Note, however, that we do not necessarily have to construct an explicit table of probabilities over all possible consistent configurations to a dual problem. If in possession of such a table, we would only use it for the specific operation of summing rows where specific extensions project specific variables to specific values, as in the numerator of the M-Step (5.6b). The relevant operation is reproduced as the left-hand expression below:

X X X X Q(Z) = qa(za) (5.7)

a: xi Z: za(xi) a: xi za: za(xi)

∈σa = vj ∈σa = vj

Later we can use closed forms to represent the left hand side of (5.7) as a whole; in general, though we can decompose Q(Z) into a set of independent {qa(za), 1 ≤ a ≤ m} distributions over extensions, just as we applied the mean-field approximation to the primal variables. (For traditional applications of EM this decomposition is not considered an approximation, as the components of Z are considered latent variables driving independent trials of some probabilis- tic process; but here the Z’s represent function extensions and are thus certainly correlated.)

Under this approximation, we now need only to consider the sum of qa probabilities for indi- vidual extensions za that support a specific variable assignment xi = vj. Taken literally, the time complexity of this process is exponential over the size of σa, but the closed forms for specific applications can reduce this dramatically–for instance to linear time over signatures in the case of SAT. Still, here we present the process in its most general form: we determine the probability that an extension supports a particular variable assignment by marginalizing out all of its remaining variables. This is analogous to the “summary” operation in, for example,

[101], and similarly, it encumbers us with some extra notation. CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 126

The s-scope σa−i = σa − {xi} of a factor fa with respect to a variable xi represents the set of variables in the domain of fa, excluding xi.

The s-extension za−i : σa−i → D of fa with respect to xi is an extension representing an assignment to to all the variables in the domain of fa, excluding xi.

+ The s-instantiation fa(za−i, xi = vj) ∈ R ∪ {0} of a factor fa with respect to a corre- sponding s-extension za−i and variable assignment xi = vj represents the value of fa when evaluated with xi set to vj and all other arguments set according to za−i. Using this notation we can finally express the summary process for estimating the proba- bility of a variable assignment with respect to a specific factor:

P qa(za) ∝

za: za(xi)

= vj   (5.8)   P  Q   θi0 [za−i(xi0 )] · fa(za−i, xi = vj) za−i 0  i : xi0  ∈σa−i

The right-hand side can be normalized across all extensions to produce proper equality, but because we will ultimately substitute this expression into the M-Step (Equation (5.6b)), which will itself be normalized, the step is unnecessary here. The formula weights the probability of tuple za mapping xi to vj by iterating through all dual values that do so and summing their individual probabilities.

Estimating such individual probabilities decomposes into the two factors at the right-hand side of Eq. (5.8). The first exhibits a generative flavor in considering the likelihood of produc- ing a given tuple by way of the individual marginals of its constituent variables. The second expression factors in the a priori likelihood of ever seeing this tuple, by using it to instantiate the associated factor.

Returning to the redundant graph in Figure 5.4, for example, and assuming a domain of

D = {0, 1}, we can apply Eq. (5.7) to the assignment x3 = 0. For the M-Step, we only need to know the sum over extensions z1, z2, z3 of the sum of Q(Z) probabilities for dual configurations CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 127

        0 P  P  Q   θi[vj] ←   θi0 [za−i(xi0 )] · fa(za−i, xi = vj) / Ni za−i 0 a: xi   i : xi0   ∈σa ∈σa−i

Figure 5.5: The completed EMBP update rule.

wherein the current extension includes x3 = 0. By the equation, we can approximate the term

for z3, for example, by consulting a decomposed q3() distribution over of series of extension

values z3 that project z3(x3) = 0. Equation (5.8) gives the expression that we substitute for

this sum of q3(z3)’s:

θ1[0] · θ2[0] · f3(0, 0, 0)

X + θ1[0] · θ2[1] · f3(0, 1, 0) q3(z3) ∝ (5.9) + θ [1] · θ [0] · f (1, 0, 0) z3: z3(x3) 1 2 3 = 0 + θ1[1] · θ2[1] · f3(1, 1, 0)

For each pair of assignments for x1 and x2, we multiply the likelihoods of generating these two assignments (according to θ1 and θ2) with the value of f3 when they are combined with the assignment x3 = 0. (In the above example we assumed that the arguments to f3 are ordered lexicographically, as in, f3(x1, x2, x3).)

5.2.4 The EMBP Update Rule

To complete the derivation of EMBP we can substitute Equations (5.7) and (5.8) into (5.6b), producing a single update rule for individual biases. (Again the denominator Ni is a normaliz- ing value representing the summation of the numerator over 1 ≤ j ≤ d):

The overall EMBP algorithm begins by selecting an initial Θ at random. We then consider each variable xi in sequence, calculating the numerator in Figure 5.5 for each of xi’s possi- ble values. Summing these weights produces the normalizer Ni and we can now update the CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 128

bias distribution θi before proceeding in sequence to the next variable. We can call a single pass through all the variables an “iteration;” at the end of each we check whether any bias has

changed since the previous iteration. If so, we begin a new pass through the variables; other-

wise we terminate with Θ set to a local maximum in approximate likelihood. Due entirely to

the properties of EM, the overall EMBP algorithm is guaranteed to converge after some number

of iterations.

Theorem 5.2.1 (Convergence of EMBP). With each iteration of the EMBP update rule in Figure 5.5, the expected log-likelihood F(Q(·), Θ) (Equation (5.4)) increases monotonically.

Proof. The proof follows directly from the EM framework. Let Θ and Θ0 denote the settings

of Θ before and after a given iteration, respectively, and likewise for Q(·) and Q0(·). By con-

struction, F(Q0(·), Θ0) ≥ F(Q0(·), Θ); during the M-step Θ0 is set by maximizing F(·) given

Q0(·). Similarly, F(Q0(·), Θ) ≥ F(Q(·), Θ); during the E-step the chosen Q0(·) maximizes

F(·) given Θ, within the space of product-decomposable distributions.

In future chapters we will substitute specialized closed-form representations of Q(·) for constraint satisfaction and constraint optimization, in place of the inner-parenthesized product decomposition of Figure 5.5. As noted in existing studies of the EM algorithm itself [119], strictly maximizing the expected log-likelihood with respect to Q(·) is not necessary for con-

vergence, so long as each successive choice of Q(·) increases this function in comparison to

the previous as in the second step of the proof.

The complexity of a single iteration of EMBP is bounded by O(m2ndk), where k is the

size of the largest scope across functions in a given factor graph. (Recall that n is the number

of variables, m is the number of factors, and d is the size of the variables’ domain.) As with

the general class of variational methods, though, specific domains can be amenable to closed-

form representations that seek to reduce the runtime or the degree of approximation for the

general expression. Doing just this will produce the family of bias estimators that comprise the

algorithmic contribution of this project. CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 129

5.2.5 Relation to (Loopy) BP and Other Methods

When written out within the same notational framework as used here, the regular Belief Prop- agation algorithm produces almost the same update rule as in Figure 2.1!1 The only difference is the outer summation over factors that neighbor a given variable: changing this summation into a product produces regular BP. In EMBP, the summation results naturally from taking log- arithms and applying Jensen’s Inequality to get guaranteed convergence [51]. This produces a softer, arithmetic average over the weights that a variable node receives from its neighboring factors. On the other hand, the harsher, geometric average of BP embodies its inherent non- convergence. If both approaches are viewed as coordinate ascent in likelihood over the space of bias distributions [119, 151], it is possible to demonstrate that EMBP takes smaller steps that are guaranteed never to increase the distance to the nearest local maximum, while BP can take larger ones that overshoot maxima and produce orbiting non-convergence.

The otherwise close correspondence between EMBP and BP makes sense in light of how

EM and BP are both based on the minimization of the Gibbs free energy [119, 151]. Each seeks to minimize the KL-Divergence between two distributions; one goal of deriving EMBP from first principles is to demonstrate the primal/dual nature of this process. In BP it is clear that variables weight themselves by summarizing reports from surrounding factors. The EM framework emphasizes that such reports should themselves be interpreted as summaries over constraint extensions, and provides a different way of combining them that produces conver- gence. Such a means of weighting constraint extensions in the dual problem can be viewed as a limited form of node merging (Section 1.2.2), providing another perspective on how EMBP is able to handle loops.

Such similarities beg the question of how EMBP differs from EM itself. At the lowest level, it cannot be different at all, as it was derived entirely within the EM framework. Rather,

1Usually BP is written as two sets of rules, but in the context of sequential updates there is a natural way to combine them into a single rule since estimated marginals for a variable node can be calculated at any point in the algorithm by taking the product of incoming messages [101]. CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 130 the significance of the primal/dual derivation lies in its isolation of the relationship between parameter estimation and inferring approximate marginals, and the similarity between the re- sulting rule and regular BP–which was designed with no such relationship in mind. That being said, there are a number of other methods that can also address various shortcomings of BP. Ex- pectation Propagation (EP) [113] and related “moment-matching” methods assume a loosely similar perspective to EMBP’s in seeking to summarize the behavior of factors as though they were variables, but such methods do not guarantee convergence. In the previous section, we also considered research centered on the “marginal polytope” of valid marginal distributions for a given graph structure [138, 143]. Together with novel approximations to the entropy term of the free energy underlying such an optimization, alternative approximations to the marginal polytope can produce new methods that also converge, but that minimize different objectives functions than those underlying BP and EMBP. Finally, Generalized Belief Propagation (GBP)

[151] has also been previously mentioned as another means of improving on regular belief propagation.

5.2.6 Practical Yield of EMBP

Here we have derived a general EMBP framework for estimating marginal probabilities from a factor graph. The end result is the update rule in Figure 5.5; for a given application, an al- gorithm designer can substitute a closed form expression for the expression in square brackets, instead of the product decomposition used in the equation. Moving forward, the next chapter will consider constraint satisfaction and constraint optimization problems, and how to encode uniform distributions over their solutions in factor graph form. Ultimately, in Parts II and III we will be able to substitute closed forms for such problems, and produce a family of marginal estimators especially designed to compute surveys over constraint problems. But first, we will define such problems within the factor graph representational framework, consider so- phisticated models of such problems, and integrate all of these subjects with the probabilistic concepts considered to this point. It is this integration that motivates the use of the closed forms CHAPTER 5. INTEGRATED VIEWOF PROBABILISTICAND CONSTRAINT REASONING 131 and the resulting rules that we will eventually embed within actual solvers.

Summary. With this chapter we conclude Part I of our presentation, having contributed a formal comparison of marginal estimation and constraint satisfaction. The practical yield is the general EMBP marginal estimation framework. In Parts II and III we will apply EMBP to constraint satisfaction and constraint optimization, respectively, as represented by the SAT and

MaxSAT problems. In each such part, we begin with a chapter that derives and explains custom bias estimation rules for the corresponding problem, based on EMBP. A second chapter then considers the integration of such rules with a full-featured, modern, constraint solver. In the case of III, a third chapter will additionally concern a lower-bounding technique that is only relevant in the case of constraint optimization. Part II

Solving Constraint Satisfaction Problems

132 Chapter 6

A Family of Bias Estimators for

Constraint Satisfaction

Up to this point we have laid out a formal foundation for integrating probabilistic and constraint-

oriented solution methods–both conceptually and, more tangibly, by introducing the expecta-

tion maximization-belief propagation (“EMBP”) framework. In this chapter we turn to the

algorithmic contributions that result from this foundation, focusing first on the constraint sat-

isfaction problem before turning to constraint optimization in Part III. Given our conceptual

recognition of message-passing for marginal estimation as a relaxed analogue of search and in-

ference, it is well motivated to consider how such methods can be used to solve CSPs. To this

end, we will first review existing approaches, primarily those that combine regular belief prop-

agation with a decimation-based search framework. Then, motivated by Chapter 4’s analysis

of the structure underlying hard problems, we identify desired properties for our own solver.

We then turn to the main subject of this chapter, a new family of bias estimators built upon

EMBP. Recall that as defined in most general form in Chapter 5, the EMBP framework uses

a marginal estimation update rule that sums the probabilities of various function extensions

according to a hypothesized Q-distribution. In that chapter, the final update rule employed a general product decomposition to represent Q(·). But, motivated by our desired properties for a

133 CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 134 bias estimator, we will presently introduce closed forms that exploit the special 0/1 structure of constraint satisfaction. The particular choice of closed form depends on the relatively “local” or “global” character of its enabling constraint, as well as its choice of model for configurations of variables.

Such rules have proved useful for general constraint satisfaction problems like quasigroups with holes, nonograms, and generalized Latin squares [88, 78, 102]. In fact, the most recent of these studies has demonstrated the usefulness of the same type of global consistency criterion that characterizes the best-performing bias estimator to be derived in this chapter. Here, though, we restrict our treatment to the basic Boolean satisfiability problem, both for simplicity of presentation, and because SAT has undergone closer study under the present general framework of applying bias estimation to search [22, 79, 81, 80]. So, by the end of this chapter we will have in hand a total of eight bias estimation methods: four novel methods based on EMBP, plus regular BP, regular SP, and two low-computational-expense control methods that represent simpler heuristics than those arising directly from probabilistic models. Chapter 7 will then install the estimators within a state-of-the-art SAT solver, and evaluate the results.

6.1 Existing Methods and Design Criteria

The general interchange of ideas between research on probabilistic graphical models and con- straint networks is well-motivated by the formal connections described up to this point, and there is a correspondingly active field of recent research that has begun to exploit this inter- change. Outside of the constraint satisfaction problem, ongoing projects have also tied proba- bilistic methods to other forms of constraint reasoning, by using message-passing or sampling to perform approximate model counting [64, 65, 67, 98], or, in the other direction, by using exact model counting to perform exact probabilistic inference [28]. Also within the area of probabilistic inference on graphical models, an ongoing direction of research has been to de- velop new approximation and bounding techniques that are often analogous to graph concepts CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 135

within the constraint satisfaction area of study [30, 31, 33]. Another series of approaches per-

forms a similar process to bias estimation without relying on probabilistic reasoning [153, 154].

The object of this research is again to develop heuristics that can identify variable assignments

that are most likely to occur in solutions, but exactly and locally to individual constraints. In

particular, such “solution-counting” heuristics rely on specialized algorithms for exactly cal-

culating the number of satisfying extensions to an individual constraint that contain a given

variable assignment; here the approximation is that such extensions are not reconciled across

constraints for consistency.

The application of probabilistic bias estimation to constraint satisfaction, though, repre-

sents its own well-defined sub-area of study withing constraint programming. Aside from the

more general CSP applications listed above, its most influential application has been to SAT

problems by way of the Survey Propagation (“SP”) algorithm [112, 23, 22, 97, 107]. Here we

distinguish between the three-state survey propagation model of Chapter 4, where variables can

be positively constrained, negatively constrained, or not constrained at all, versus the survey

propagation algorithm which combines belief propagation over this model with a specific over-

all search framework based on decimation. (As discussed shortly, there has been no published

effort to produce a complete solver that uses survey propagation, up until the current project.)

The survey propagation algorithm as implemented by the publicly available “sp-1.4” code [25]

can be explained as follows:

The algorithm repeatedly computes surveys using the survey propagation bias estimator.

Upon computing a survey, it assigns a set (“block”) of variables according to the “decimation”

block size parameter. The block is chosen by sorting the variables according to the absolute

value of the difference between their positive and negative biases. The top decimation variables are chosen for the block, and they are assigned to the value to which they are more strongly biased. After assigning a block, the algorithm removes any newly satisfied clauses from the theory, reduces the remaining clauses to reflect the assignment, and performs unit propagation, cf. Section 3.2.2. If the assignment has violated the theory, the process simply fails–the theory CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 136

Algorithm 9: Survey Propagation Solver Framework Input : SAT theory Φ, strength threshold thresh, decimation block size decimate.

Output: Solution configuration, or else FAIL.

1 Compute initial survey Θ over initial theory Φ using SP.

2 while (maximal strength across bias distributions > thresh) do

3 Update Φ by fixing decimate most strongly-biased variables toward stronger polarity.

4 Simplify Φ, using unit propagation.

5 If this produces an empty clause, output FAIL.

6 If this completes a solution configuration, output it.

7 Compute new survey Θ over current theory Φ using SP.

8 end

9 Run WALKSAT local search on Φ for a given time period. Output solution if found,

otherwise output FAIL.

may or may not be satisfiable. Otherwise, if assigning the decimation block has not violated

any clauses, the process repeats with a new survey and another decimation. If at some point

all the variables are successfully assigned by these means, then the algorithm terminates with

a solution.

However, in practice the algorithm’s main loop almost always terminates well before this

point, when the strongest bias across variables falls below a minimum “threshold.” The strength

threshold has two equally compelling motivations. Firstly, each survey is expensive to com-

pute, because of the O(nm) runtime cost for each iteration when n may be on the order of

104 for typical random problems, and because of the threat of non-convergence each time we attempt to compute a survey. Though non-convergence can be handled conveniently by setting a maximum number of iterations, or by gradually dampening each step, this can be costly in time. Further, it has been demonstrated that non-convergence by SP is closely correlated with poor marginal estimates [99]. The second motivation is that the need for surveys should de- CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 137

crease along with the constrainedness of successive subproblems, as we continuously seek out

the most constrained “backbone” or “near-backbone” variables for assignment. If a problem

exhibits the theoretical properties described in Chapter 4.1, then there should only be a few of

these more “important” variables. Once these variables are set, the remaining problem should

be easily solved by local search, as realized in this case by the WALKSAT solver [135].

The most salient characteristic of the framework is that it does not perform backtrack- ing. Most likely, one motivation for designing the survey propagation solver in this way is that continuously updating the data structures for passing messages over a factor graph can be complicated to implement, as the current subproblem keeps growing and shrinking with each assignment or backtrack. Thus, the solver is not complete: if it does not find a solution, the theory may or may not be satisfiable. Another motivation, though, is that for the large uniformly random problems that the algorithm was designed to study, backtracking is not al- ways necessary. Indeed, survey propagation has proved very successful on such problems–on structured problems, though, this is not the case, as the solver almost always ends in failure– either by causing a contradiction, or by creating a problem that WALKSAT cannot solve in any reasonable amount of time.

Creating a SAT solver that is robust to a variety of problem classes motivates the use of all modern features like pre-processing, clause learning, restarts, and dynamic variable ordering, not to mention backtracking. It also motivates the design of new bias estimators with comple- mentary strengths and weaknesses to those of the SP bias estimator at the heart of Algorithm

9:

• Guaranteed convergence is not only more formally satisfying, but it will also be a

practical necessity for both efficiency and for finding quality local optima in survey like-

lihood. And when the estimator does converge, we would like it to do so naturally in as

few iterations as possible, instead of, for instance, by a protracted dampening process.

Convergence is an inherent property of EMBP. CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 138

• Completeness, backtracking, and modern SAT-solving features will require that the

estimator be computable without the use of overly sophisticated data structures that are

hard to update as search progresses. The update-rule formulation of EMBP is a suitable

format for meeting this requirement.

• Efficiency, in terms of computational complexity, will become increasingly important as

we accrue a longer run times solving larger problems. Within the EMBP framework we

would like to design rules that are at least as efficient as those of SP, on a per-iteration

basis.

• Accuracy of bias estimates will be important, as before, but slightly less so in two ways.

First, with backtracking, a mistake no longer wreaks catastrophe either in eliminating

all solutions, or by creating a satisfiable but highly-constrained problem that WALKSAT

cannot solve. Secondly, then, it is important to recall the impossibility of eliminating

all solutions unless a variable is set one way when its true bias is 100% toward the

other polarity. Especially given that an underlying complete solver can be successful in

solving highly-constrained sub-problems, then, the concept of a “mistake” is now greatly

narrowed in scope.

• “Reliability” can thus be used to designate a more important property than simple accu-

racy: a more reliable bias estimator does not have to be more accurate on average, over

all bias distributions in a survey. But, it should not make an extremely inaccurate esti-

mate for any given variable, especially among those for which its estimates are strongest.

For instance, it should not assign a stronger positive bias than negative bias to a backbone

variable whose true bias is entirely negative. CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 139

6.2 Deriving New Bias Estimators for SAT

Having identified certain desirable properties for the bias estimators that we will use for SAT- solving, we can now present four EMBP update rules that are designed to best meet these goals. By using EMBP, we guarantee convergence; in Chapter 7 we will embed the rules within a full-featured backtracking solver, achieving completeness and robustness over non-random problems. For use in this setting, the rules will be efficient to compute; any loss of accuracy will also be assessed in the next chapter. In this section we will first review the relevant features of the EMBP framework as specialized to SAT, and exhibit the derivation of each rule; for purposes of understanding, we will then interpret each rule operationally, explaining why it is intuitive as a means of biasing a variable according to the biases of surrounding variables and clauses.

The four update rules arise from choosing either the two-valued model of BP, or the three- valued model of SP, and then applying Expectation Maximization (EM) with either a local or global consistency approximation. Thus the four methods are called “EMBP-L,” “EMBP-G,”

“EMSP-L,” and “EMSP-G.” To review the difference between the BP and SP models, recall that they both attempt to label a variable with ‘+’ or ‘-’. Importantly, this indicates that the variable is positively or negatively constrained, as opposed to merely appearing positively or negatively in some satisfying assignment. (Hence, our use here of ‘+’ and ‘-’ rather than ‘0’ and

‘1’.) In other words, when we examine a satisfying assignment there must exist some clause that is entirely dependent on a given variable being positive in order for us to label the variable

‘+’. Under BP all variables are assumed to be constrained, while SP introduces the additional

‘*’ state indicating that a variable is unconstrained, i.e. all clauses are already satisfied by the other variables.

The algebra of how clauses are supported will come into play when we consider the local and global consistency approximations. The local consistency considered here will correspond to generalized arc-consistency: according to a specific clause c, variable v can hold a value like

‘+’ only if the other variables of c are individually consistent with v being positive. Note that CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 140

this cannot capture whether v is truly constrained to be positive; each individual clause might

report that they are fine with v being labeled ‘+’, but between them there is no way to determine

whether v is constrained to be positive. When we move to the more global form of consistency

in the following proof, this shortcoming is remedied at potentially greater computational cost.

6.2.1 The General EMBP Framework for SAT

Recall from Chapter 5 that the EM framework [51] encompasses a vector of observations Y , and seeks some model parameters Θ that maximize the log-likelihood of having seen Y . This likelihood consists of a posterior probability represented by some model designed for the do- main at hand; we seek the best setting of Θ for fitting the model to the observations. Maximiz- ing log P (Y |Θ) would ordinarily be straightforward, but for the additional complication that we posit some latent variables Z that contributed to the generation of Y , but that we did not get to observe. That is, we want to set Θ to maximize log P (Y,Z|Θ), but cannot marginalize on Z.

So, we bootstrap by constructing an artificial probability distribution Q(Z) to estimate

P (Z|Y, Θ) and then use this distribution to maximize the expected log-likelihood log P (Y,Z|Θ) with respect to Θ. In a sense we simulate the marginalization over Z by using Q(Z) as a surro- P gate weight: EQ[log P (Z,Y |Θ)] = Z Q(Z) log P (Z,Y |Θ). The first step of hypothesizing a Q(·) distribution is called the E-Step, and the second phase of setting Θ is the M-Step. The two are repeated until convergence to a local maximum in likelihood, which is guaranteed.

To harness the EM framework for estimating bias in SAT, the trick is to tell EM that we have seen that all the clauses are satisfied, but not how exactly each clause went about choosing satisfying variable assignments for support. We ask the algorithm to find the variable biases that will best support our claimed sighting, via some hypothesized support configuration. This produces the desired P (Θ|SAT).

As in the general derivation, then, Y will be a vector of all 1’s–here the meaning is that all clauses are satisfied. (Once again we will not really need to refer to this variable directly CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 141

when performing our derivation.) Each triple (θv(+), θv(−), θv(∗)) ∈ Θ represents variable v’s probability of being positively constrained, negatively constrained, or unconstrained, re-

spectively. In other words, the use of ‘θ’ to represent bias is no coincidence–the parameters of

our model are exactly the bias distributions that comprise the survey that we would ultimately

like to compute. Finally, Z contains some support configuration zc for each clause c, denoting the assignment to the clause’s variables that results in satisfaction. In the language of dual-

ity, then, zc represents the (satisfying) extension for c. The overall process derived below can be seen as a primal/dual optimization iterating over the original and extensional forms of the

SAT problem. In constructing Q(·) we weight the various values zc that the clauses can hold, where the values range over tuples corresponding to the clauses’ variables. In setting Θ we probabilistically label variables according to the weights of the clause extensions under Q(·).

The end result, by appealing directly to the same derivation as in Chapter 5–until the very last step of decomposing the Q-distribution–is a general update rule to use as a starting point for the EMBP bias estimators:   X X θ (s) = ωs/ N , ωs  Q(Z) v v where v ,   (6.1) c∈Cv Z: z [v] =cs

Here we have introduced some new notation. First recall that N represents the normalizing

s constant that makes bias distributions sum to one. We introduce ωv to denote variable v’s un- normalized weight toward s ∈ {‘+’,‘−’,‘∗’}. The semantics of (6.1) are that we construct the

weight of assignment v = ‘+’, for instance, by averaging over all of v’s clauses arithmetically.

For each such clause, we determine the probability that its extension is consistent in assigning

‘+’ to v by summing the Q-probabilities of all global dual configurations for which this is the

case. Rather than designing a Q-distribution that explicitly represents a probability for all 2n

valid dual configurations, we will substitute specific, approximate, closed forms for these sums

over consistent Q(Z)’s, in order to generate the EMBP bias estimators that are the yield of this

chapter.

Such closed forms will be expressed in terms of the biases of the other variables appearing CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 142 in each clause; thus we iterate between using Q(·) to update Θ and using Θ to update Q(·), within a single rule. The formal flexibility to include such forms is a specific feature of EMBP’s design, and is key to the creation of the update rules that we will contribute shortly.

First, though, we introduce a few more notational conventions that will be convenient from

+ − ∗ this point forward. First, for better readability we will use the notation θv , θv , and θv as shorthand for the marginal probabilities θv(+), θv(−), and θv(∗) for each variable v. Also for convenience, we will use square braces for mapping extensions to the values of their constituent variables, in recognition of how they are treated as variables in the sense of the dual problem, and simultaneously, as functions (assignments) as per the basic definitions in Section 1.1.1.

Thus, “za” refers to a particular choice of extension over the signature of some clause a, while

“za[xi]” is the value that this extension assigns to variable xi.

At this point it is also worth revisiting the difference between regular BP (or SP) and EM- based methods. The weights here are constructed by summation and will be normalized into an arithmetic average. Similar BP/SP rules could have been constructed merely by changing the sums into products, bypassing the entire tree-based message-passing motivation in Chapter 1.

Doing so would produce a harsher geometric average. If we view message passing algorithms in terms of conjugate gradient ascent in the space of surveys, this in turn means that BP/SP will take larger steps than EM versions even when they are both following the same gradient to the same local optimum in likelihood. These larger steps are the root of non-convergence for BP/SP. In applying the basic EM framework, we introduce logarithms in order to satisfy

Jensen’s Inequality [51]; the summations in these weights can be seen as a reflection of these logarithms.

To summarize up to this point, the EMBP methods will all form bias distributions by nor- malizing over their weights toward various assignments, as below:

+ − ∗ +0 ωv −0 ωv ∗0 ωv θv ← + − ∗ θv ← + − ∗ θv ← + − ∗ ωv + ωv + ωv ωv + ωv + ωv ωv + ωv + ωv (6.2) CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 143

s Each ωv is defined as in (6.1); what remains is to choose various closed forms for the summation over Q(Z) probabilities, producing various bias estimator update rules.

6.2.2 Representing the Q-Distribution in Closed Form

As described previously, the EM framework is first initialized with a randomly-generated Θ, which it uses to produce a Q-distribution that is used in turn to produce a new Θ, repeating until guaranteed convergence. In this section we show how the current setting of Θ can be used P to produce two different closed-form expressions for Z: z [v] Q(Z) as per the E-Step of EM; =cs the versions differ in using local or global approximations and determine whether we produce

EMBP-L versus EMBP-G, or EMSP-L versus EMSP-G. In the process of doing so, a specific

form arises often enough to merit an abbreviation and some explanation:

Q − Q + σ(v, c) , θi θj + − i∈Vc \{v} j∈Vc \{v} Figure 6.1: “v is the sole support of c.”

The expression σ(v, c) indicates the probability that variable v is the sole support of clause

c. We iterate through all the other literals and consult their variables’ biases toward the opposite

polarity from that of the literal. In other words, the formula forms a product expressing the

probability that the variables for all positive literals turned out constrained to be negative, and

all negative literals turned out positive. Thus, c is solely dependent on v for support, and v

must be positively or negatively constrained, based on whether it appears in c as a positive or

negative literal.

Under survey propagation, a variable is free to assume the ‘*’ value if there exists no clause

for which it is the sole support. (Whether it actually must do so is controlled by the optional

ρ parameter that is not discussed here [107].) Under regular belief propagation, there is no ‘*’

value available, codifying the assumption that every variable must be the sole support of some

∗ clause. Pragmatically speaking, this is like evenly dividing the weight assigned to θv between

+ − θv and θv after every iteration of the message-passing process. CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 144

The first step in representing the summation over Q(Z)’s that appears in (6.1) is to divide a variable’s clauses into those in which it appears as a positive literal, and those in which it

+ appears negatively, as shown below for the case of ωv :

  ω+ =  P P Q(Z) v   c∈Cv Z: zc[v] = ‘+0

(6.3) = P P Q(Z) + P P Q(Z) + − c∈Cv Z: zc[v] c∈Cv Z: zc[v] = ‘+0 = ‘+0

Intuitively, the weight is just the sum (over each of v’s clauses) of probabilities representing whether the dual configuration Z lists v as positive (within each clause.) By separating this summation into two summations, over v’s positive and negative clauses, we can consider the two distinct cases of whether v can be positive according to a clause where it appears as a positive literal, versus the more demanding case whereby it appears as a negative literal. We will first express these two cases using a local-consistency approximation.

6.2.3 Deriving EMBP-L

In Equation 6.3, consider the first summation, over v’s positive clauses. As noted, we will not literally create any sort of a table listing probabilities for all possible total clause configurations that could instantiate Z. Rather, we note that if we could consult such a table, we would iterate through its entries and construct a running sum of probabilities for entries that make v positive for the current clause c.

The local approximation is so-called because it corresponds to generalized arc-consistency; a necessary (but not sufficient) condition for v to be assigned ‘+’ is for every other variable appearing in the clause to be individually set in some way that supports this. For the first sum- mation in 6.3, there is no reason why v could not be ‘+’; it appears as a positive literal in each c, after all. Actually, to be labeled ‘+’, v must be constrained positively in some clause–within CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 145 this clause all other variables must by set to unsatisfying polarities. Under generalized arc- consistency, though, no single clause can be certain of determining this scenario individually with respect to v. So, there is a probability of 1 that the other variables in the clause will allow this value, i.e. that they will be generalizedly arc-consistent. Thus, we can replace the inner summation with the constant value of 1. In so substituting we enable a single update rule that performs both of the E- and M- Steps at the same time, by directly representing the E-Step within the rule we derived for the M-Step.

An additional note is that because generalized arc-consistency is necessary but not suf-

ficient, such a substitution represents a (trivial) upper bound on the probability in question.

Later we will be able to make tighter bounds by moving to global consistency and to the sur- vey propagation model.

Turning to the second summation, over clauses where v appears as a negative literal, we note that here it is possible for the other variables to be unsupportive of v’s assignment to ‘+’.

Specifically, if all other variables are constrained not to satisfy the clause, then v must not be

‘+’. In other words, the assignment v = ‘+’ is not generalizedly arc-consistent with the other variables exactly when it is the sole support, as denoted by σ(v, c). Thus, instead of explicitly going through all possible extensions as represented by Z and summing the probabilities of those that assign v to ‘+’, we can analytically substitute a single expression for this entire sum.

As show in the second line of the derivation below, this expression is 1 − σ(v, c). CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 146

+ P P P P EMBP-L: ωv = Q(Z) + Q(Z) + − c∈Cv Z: zc[v] c∈Cv Z: zc[v] = ‘+0 = ‘+0

≈ P [1] + P [1 − σ(v, c)] + − c∈Cv c∈Cv

+ − P = |Cv | + |Cv | − σ(v, c) − c∈Cv (6.4) P = |Cv| − σ(v, c) − c∈Cv

− P ωv ≈ |Cv| − σ(v, c) + c∈Cv

∗ ωv = 0

+ The remaining lines for ωv simplify summations over the constant term 1 into cardinalities

− ∗ of the sets being summed over. An analogous process applies for ωv , and ωv can just be set to zero since we are using the BP model.

6.2.4 Deriving EMBP-G

An alternative way of representing the summation over Q(Z) applies a global rather than a local sense of consistency to BP, producing the EMBP-G update rule: CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 147

+ P P P P EMBP-G: ωv = Q(Z) + Q(Z) + − c∈Cv Z: zc[v] c∈Cv Z: zc[v] = ‘+0 = ‘+0

" # ≈ P [1] + P Q (1 − σ(v, c)) + − − c∈Cv c∈Cv c∈Cv

" # + − Q = |Cv | + |Cv | (1 − σ(v, c)) − (6.5) c∈Cv

" # − − + Q ωv ≈ |Cv | + |Cv | (1 − σ(v, c)) + c∈Cv

∗ ωv = 0

As before, there is no reason why a positive clause would prohibit v from holding the value

‘+’, so the first substitution is again for the constant 1. But turning to the second summation over negative clauses, we take a more global view of consistency in assessing whether v can be positively constrained. Whereas generalized arc-consistency did not care whether other negative clauses would allow this assignment, so long as zc[v] = ‘+’ for the current clause, we will now enforce consistency across all negative clauses. For example, if c1 = (¬x ∨ y1 ∨ z1) and c2 = (¬x ∨ y2 ∨ z2), then local consistency would implicitly include models where zc1[x] chooses ‘+’ even if zc2[x] holds the value ‘-’. Under global consistency, we say that a negative clause can hold an extension where v is ‘+’ only if all negative clauses allow it. Thus we substitute a product of probabilities for events wherein all negative clauses do not need v as a sole support.

This produces a tighter upper bound on the sum of Q(Z)’s for Z’s wherein clause c has v set to ‘+’ in its extension. Why use a local consistency rather than a global consistency approximation, then? For other situations like CSPs, or even for the SP case in SAT, such CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 148

an approximation can be much faster to compute than a global one. In the particular case

of EMBP-L vs. EMBP-G the computational cost is comparable–but it is well-motivated to

empirically confirm that EMBP-G’s tighter bound does in fact eventually yield a better bias

estimator than EMBP-L’s, and ultimately a better search heuristic [81].

The rest of the above equations again group constant terms, and present the analogous result

− ∗ for ωv . As before, ωv is set to 0.

6.2.5 Deriving EMSP-L

∗ Next, we lift the condition that ωv = 0 by switching to the SP model. Now variables can be set to the ‘*’ state whereby they are not the sole support of any clause. Once again we start with local (generalized arc-) consistency.

+ P EMSP-L: ωv ≈ |Cv| − σ(v, c) − c∈Cv

− P ωv ≈ |Cv| − σ(v, c) + c∈Cv

∗ P P P P ωv = Q(Z) + Q(Z) + − c∈Cv Z: zc[v] c∈Cv Z: zc[v] = ‘∗0 = ‘∗0 (6.6)

≈ P [1 − σ(v, c)] + P [1 − σ(v, c)] + − c∈Cv c∈Cv

P = |Cv| − σ(v, c) c∈Cv

+ − The first two rules of EMSP-L, for ωv and ωv , are actually the same as for EMBP-L. We now consider the question of whether v can be labeled unconstrained, and thus assigned the value ‘*’. We use the same upper bound that we used on whether a negative clause could CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 149 make v positive: v must not be the sole support of such a clause. Here, though, we make the substitution twice: v can be the sole support of neither a positive nor a negative clause. Thus the end result has the same form as the other two rules, except that it iterates over all clauses.

6.2.6 Deriving EMSP-G

We now derive the final EMBP update rule by applying global consistency to the SP framework, resulting in EMSP-G. Here, when we determine whether v can be ‘+’, we must consider some stricter criteria than those we faced in working with BP or with local consistency. We begin as usual with the sum of probabilities over Z’s where a positive clause c allows the assignment v

= ‘+’. Under global consistency, it is not enough that just c allows it; we should consider all the other positive clauses. And under the survey propagation model, it is not enough that all positive clauses simply allow it; one of them must actively require it, or else the actual state of v is ‘*’ rather than ‘+’. (This second requirement could not be enforced by any individual clause under local consistency.) Thus for each positive clause we substitute the probability that some positive clause requires v as its sole support. In the derivation below, this probability is calculated as the negation of the event that all positive clauses do not require v.

For the second summation, over negative clauses, we use the same expression that we used before for EMBP-G: an individual negative clause is allowed to assign ‘+’ to v only if no

+ negative clause needs v for sole support. This completes the rule for ωv , and an analogous

− process holds for ωv : CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 150

+ P P P P EMSP-G: ωv = Q(Z) + Q(Z) + − c∈Cv Z: zc[v] c∈Cv Z: zc[v] = ‘+0 = ‘+0

" # " # ≈ P 1 − Q (1 − σ(v, c)) + P Q (1 − σ(v, c)) + + − − c∈Cv c∈Cv c∈Cv c∈Cv

" # + Q − Q = |Cv | 1 − (1 − σ(v, c)) + |Cv | (1 − σ(v, c)) + − c∈Cv c∈Cv

" # − − Q + Q ωv ≈ |Cv | 1 − (1 − σ(v, c)) + |Cv | (1 − σ(v, c)) − + (6.7) c∈Cv c∈Cv

∗ P P P P ωv = Q(Z) + Q(Z) + − c∈Cv Z: zc[v] c∈Cv Z: zc[v] = ‘∗0 = ‘∗0

" # " # ≈ P Q (1 − σ(v, c)) + P Q (1 − σ(v, c)) + + − − c∈Cv c∈Cv c∈Cv c∈Cv

Q = |Cv| (1 − σ(v, c)) c∈Cv

The final step is to express the sum of Q(Z)’s for Z’s that allow positive or negative clauses to assign ‘*’ to v in their extensions. In either case, this is just the probability that all clauses, be they positive or negative, do not need v. So we substitute the same expression as before to

indicate that v is not a sole support, except over both positive and negative clauses, completing

∗ the rule for ωv . This completes the derivation of the four EMBP-based methods for bias estimation in SAT.

All methods are guaranteed to converge, but still seek to maximize the same log-likelihood

(Eq. 5.5 in Section 5.2) as regular BP and SP. When BP and SP do converge, they can return better or worse answers than these EMBP-based methods. The next chapter will compare the CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 151 accuracy of the various methods, in terms of both average accuracy, and in terms of variance.

Together with update rules representing regular BP and SP, the four EMBP update rules appear in Figures 6.2 and 6.3.

6.2.7 Other Bias Estimators

Recognizing that most existing heuristics for ordering variables and values during search have some notion of variable “importance” and have some way of setting them to their most likely values, we must question whether the full-blown apparatus of probabilistic reasoning offers anything in addition to a principled formal foundation. Such heuristics are unlikely to have explicitly defined objective functions, or formally-defined approximation relationships with such objectives. But implicitly, they may have a probabilistic flavor in terms of scoring vari- ables according to their surrounding constraints and the settings of other variables, and they may compare favorably or not with probabilistic methods in terms of effectiveness in guid- ing search versus computational expense. So, here we define two simple control methods for comparison with the probabilistic techniques.

LC (“Literal Count”) simplifies a heuristic that was not explicitly designed to rigorously infer bias, but that still comprises the core of a highly successful system for refuting unsatisfi- able SAT instances [54]. The heuristic measures the effect of setting a variable v to be positive or negative by directly counting the degrees of freedom for the remaining literals in v’s clauses.

As such, information propagates between interconnected variables much as update messages travel through the probabilistic methods. If repeated recursively to no end, the heuristic simply solves the problem directly; instead LC performs a single iteration of lookahead for use as an CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 152 experimental control. The specific update rules for LC are as follows:

! + P Q + Q − ωv = |Ci | · |Cj | + − + c∈Cv i∈Vc j∈Vc

! − P Q + Q − (6.8) ωv = |Ci | · |Cj | − − + c∈Cv i∈Vc j∈Vc

∗ ωv = 0

As before we have specified three weights which can be normalized to form positive, negative and joker biases as in Eq. 6.2; in setting the joker value to zero, the heuristic subscribes to the basic BP model whereby all variables are assumed to be constrained by some clause. Note that the update rules assume a similar form to those of the previously designed techniques, even though they were designed as heuristics using human ingenuity rather than a specific combination of probabilistic model and closed-form approximation. For instance, we can view a variable’s weight toward its positive bias as an additive aggregation, over positive clauses, of a heuristic assessment of how much that clause needs the variable for support. Instead of our more fine-grained formula for sole support, which determines the probability that all of the other variables do not support the current clause by directly consulting the current bias estimate for each such neighbor, we now heuristically estimate this probability, for a given neighbor, by the number of clauses in which it appears under the polarity that does not support the current clause.

CC (“Clause Count”) is an even simpler baseline method that just counts the number of clauses containing a given variable as a positive literal, and the number wherein it appears negatively. The ratio of these two counts serves to estimate the variable’s bias distribution:

+ + − − θv ← |Cv |/|Cv| and θv ← |Cv |/|Cv|. It may be of interest that if one were to apply the basic Mean-Field marginal approximation framework of Section 2.1.2 to our probabilisitc model of

SAT solutions, then the resulting estimator would be precisely CC. CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 153

+ Q + P ωv = (1 − σ(v, c)) ωv = |Cv| − σ(v, c) − − c∈Cv c∈Cv

− Q − P ωv = (1 − σ(v, c)) ωv = |Cv| − σ(v, c) + + c∈Cv c∈Cv (a) Regular BP update rule. (b) EMBP-L update rule. " # + − Q + ωv = |Cv | (1 − σ(v, c)) + |Cv | − c∈Cv

" # − + Q − ωv = |Cv | (1 − σ(v, c)) + |Cv | + c∈Cv (c) EMBP-G update rule.

Figure 6.2: Update rules for the belief propagation (BP) family of bias estimators.

6.3 Interpreting the Bias Estimators

In order to better understand the differences between the probabilistic bias estimation methods, we will now interpret them operationally, in terms of arbitrary heuristic quantities, supplement- ing the denotational semantics arising from their underlying models and the EMBP framework.

In light of the argument that non-probabilistic heuristics may calculate similar quantities with- out appealing to any probabilistic model, it can be illustrative to view the bias estimators as intuitively reasonable rules that someone might have happened to have designed in the absence of such a model.

In general, the bias estimators all weight a variable toward a constrainedly positive or neg- ative assignment, or toward a neutral state; normalization is not really relevant in the context of choosing a value for the variable according to which bias is highest. The estimators’ rules construct such weights in different ways, appealing to the “sole-support” formula in Figure

6.2.2, as well as the number of clauses in which a variable appears as a positive or negative literal. Recall that the rules for the six probabilistic methods described below appear in Figures CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 154

" # + Q Q + P ωv = (1 − σ(v, c)) · ρ 1 − (1 − σ(v, c)) ωv = |Cv| − σ(v, c) − + − c∈Cv c∈Cv c∈Cv

" # − P ω− = Q (1 − σ(v, c)) · ρ 1 − Q (1 − σ(v, c)) ωv = |Cv| − σ(v, c) v + + − c∈Cv c∈Cv c∈Cv

∗ P ∗ Q ωv = |Cv| − σ(v, c) ωv = (1 − σ(v, c)) c∈Cv c∈Cv (a) Regular SP update rule. (b) EMSP-L update rule. " # + − Q + Q ωv = |Cv | (1 − σ(v, c)) + |Cv | 1 − (1 − σ(v, c)) − + c∈Cv c∈Cv

" # − + Q − Q ωv = |Cv | (1 − σ(v, c)) + |Cv | 1 − (1 − σ(v, c)) + − c∈Cv c∈Cv

∗ Q ωv = |Cv| (1 − σ(v, c)) c∈Cv (c) EMSP-G update rule.

Figure 6.3: Update rules for the survey propagation (SP) family of bias estimators. CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 155

6.2 and 6.3.

BP can first be viewed as generating the probability that v should be positive according to the odds that one of its positive clauses is completely dependent on v for support. That is,

+ v appears as a positive literal in some c ∈ Cv for which every other positive literal i turns

− out negative (with probability θi ), and for which every negative literal ¬j turns out positive

+ (with probability θj ). This combination of unsatisfying events would be represented by the expression σ(v, c). However, a defining characteristic of BP is its assumption that every v is the sole support of at least one clause. (Further, v cannot simultaneously support both a positive and a negative clause since we are sampling from the space of satisfying assignments.)

Thus, we should view Figure 6.2(a) as weighing the probability that no negative clause needs v

(implying that v is positive by assumption), versus the probability that no positive clause needs v for support.

EMBP-L introduces summations in contrast to the product-based formulation of BP. This results in milder, arithmetic averages as opposed to harsher geometric averages. This is one reflection of an EM-based method’s convergence versus the non-convergence of regular BP and SP. Recall that all propagation methods can be viewed as energy minimization techniques whose successive updates form paths to local optima in the landscape of survey likelihood [79].

By taking smaller, arithmetic steps, EMBP-L (along with all the other EMBP-based methods) is guaranteed to proceed from its initial estimate to the nearest optimum; BP and SP take larger, geometric steps, and can therefore overshoot optima. This difference explains why BP and SP can explore a larger area of the space of surveys, even when initialized from the same point as EMBP-L, but it also leads to their non-convergence. Empirically, EMBP-L and EMBP-

G usually converge in three or four iterations for most of the SAT instances studied in this document, whereas BP and SP typically require at least ten or so, if they converge at all.

Intuitively, the equation in Figure 6.2(b) (additively) reduces the weight on a variable’s positive bias according to the chances that it is needed by negative clauses, and vice-versa.

Such reductions are taken from a smoothing constant representing the number of clauses a CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 156

variable appears in overall; highly connected variables might be assumed to have less extreme

bias distributions than those with fewer constraints.

EMBP-G is also based on smoother, arithmetic averages, but employs the broader, “global,”

view of consistency. In the final result, this is partly reflected by the way that Figure 6.3(c)

weights a variable’s positive bias by going through each negative clause (in multiplying by

− |Cv |) and uniformly adding the chance that all negative clauses are satisfied without v. In contrast, when EMBP-L iterates through the negative clauses, it considers their satisfaction on an individual basis, without regard to how the clauses’ means of satisfaction might interact with one another. So local consistency is more sensitive to individual clauses in that it will subtract a different value for each clause from the total weight, instead of using the same value uniformly. At the same time, the uniform value that global consistency does apply for each constraint reflects the satisfaction of all clauses at once.

SP eliminates the assumption that every variable is the sole support of some clause, by in- troducing the possibility that a variable is not constrained at all in a given satisfying assignment.

Thus, in examining the weight on the positive bias in Figure 6.3(a), it is no longer sufficient to represent the probability that no negative clause needs v. Rather, we explicitly factor in the condition that some positive clause needs v, by complementing the probability that no posi-

tive clause needs it. This acknowledges the possibility that no negative clause needs v, but

∗ no positive clause needs it either. As seen in the equation for ωv , such mass goes toward the unconstrained “joker” state.

∗ For the purposes of guiding search in the next chapter, any probability mass for θv will be

+ − evenly distributed between θv and θv when the final survey is compiled. This reflects how the event of finding a solution with v labeled as unconstrained indicates that there exists one

otherwise identical solution with v set to true, and another with it set to false. So while the

joker state plays a role between iterations in setting a variable’s bias, the final result omits it

for the purposes of bias estimation.

EMSP-L and EMSP-G are analogous to their BP counterparts, extended to weight the CHAPTER 6. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT SATISFACTION 157 third ’*’ state where a variable may be unconstrained. So similarly, they can be understood as convergent alternatives to SP that take a locally or globally consistent view of finding a solution, respectively. Chapter 7

Using Bias Estimators in Backtracking

Search

Having designed a set of customized bias estimators for our specific aims in solving constraint satisfaction problems, we can now take a more applied direction and consider how exactly to use them within an overall solving process. While bias estimators produce intuitively useful information about the solutions to problem instances, they cannot actually solve a problem on their own. We have seen that previous research has embedded them within a “decimation” framework that computes a single sequence of estimates, while setting a block of one or more most strongly-biased variables after each estimate [23]. However, this construction cannot backtrack or take advantage of modern advances in systematic DPLL search like the clause learning inference method (Section 3.2.2) or the use of sophisticated random restart schemes designed to exploit heavy-tail runtime distributions (Section 4.3). Rather, the decimation pro- cess either reaches a solution directly by a series of fortuitous variable assignments, or it ends in failure without determining satisfiability or unsatisfiability. Toward the upper reaches of the phase transition threshold, failure occurs about half the time (at α = 4.4 for satisfiable problems). For industrial SAT problems with “real-world” structure, the combination is en- tirely unusable: the SP bias estimator within a decimation search framework cannot solve such

158 CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 159

problems after days of trial and error.

To test the hypothesis that probabilistic reasoning can be used for general constraint satis-

faction, we too will focus on the area of SAT. The first step will be to present the VARSAT

complete SAT-solver, which essentially consists of the full-featured, modern MINISAT solver

[55] plus modifications for embedding arbitrary bias estimators as heuristics for ordering both variables and values, simultaneously. Here we will consider the most relevant design issues for making best use of the information contained in surveys, and state the best settings known to date through trial-and-error experimentation. VARSAT represents the first application of bias estimation within a complete solver.

Before considering the overall performance of the resulting solver, though, we first assess the various bias estimation methods in isolation, breaking down the hypothesis by first con-

firming that the bias estimators accomplish what they were designed to accomplish, and only then assessing whether such accomplishments are useful to search within VARSAT. To that end, we will run the various estimators on smaller problems whose exact bias distributions are known by exhaustive calculation, and thereby assess their accuracy in terms of various metrics of interest.

Finally, we summarize the overall performance of VARSAT, across the entire spectrum of problems used in the 2009 MaxSAT evaluation. Ultimately we conclude that the estimators do what they were designed to do, in efficiently making accurate and reliable bias estimates, and that these estimates are useful to search in terms of reducing backtracking during search.

As an overall solver, the current implementation of VARSAT is superior to the state-of-the-art for complete solvers on random problems, and is competitive with incomplete and complete solvers on other problems, subject to limitations on problem size. CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 160

7.1 Architecture and Design Decisions for the VARSAT In-

tegrated Solver

The VARSAT solver uses any given bias estimation method to choose a set of variables and

values for assignment, within a complete backtracking solver that exploits all the features na-

tive to MINISAT. A high-level pseudo-code representation for VARSAT appears below as

Algorithm 10.

7.1.1 The General Design of VARSAT

Though VARSAT is implemented iteratively rather than recursively, for clarity of presentation

we closely follow the recursive backtracking algorithm of Chapter 3. Each activation begins

by using the built-in random restart and (unit) propagation methods of the MINISAT solver;

on backtracking it uses the built-in “first unique implication point” clause learning mechanism

[55].

To perform decimation, a queue of variable assignments retains up to decimate variable as-

signments as decided by a given survey. The contents of the queue are applied in a backtracking

fashion, in lieu of any other variable or value ordering scheme, until the queue is empty. When

the queue is indeed empty, we either apply the default VSIDS heuristic [55]–if surveys have

been deactivated–or else proceed to compute a new survey on the current subproblem using a

chosen bias estimation method.

On computing a survey, we check whether the maximum difference between positive and

negative bias across variables is above the parameterized threshold for continuing to use sur-

veys; if not, we deactivate them throughout the subtree of search rooted at the current node.

Otherwise, we consult the survey to fill the decimation queue with our estimate of the dec- imate-most biased variables, and then perform a new activation of the algorithm in order to begin emptying the queue.

Thus the algorithm chooses a variable assignment exclusively by following the decimation CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 161

queue, or else using the default heuristics if the queue is empty and surveys have been deacti-

vated within the current sub-tree. Whenever such an assignment is chosen, VARSAT performs

the branching operation according to the standard mechanisms in MINISAT, while efficiently updating the data structures required for computing surveys.

7.1.2 Specific Design Decisions within VARSAT

Algorithm 10 features a number of points where designers can make decisions that will impact its overall performance on a particular problem or set of problems. Here we consider the

five most salient design considerations for the VARSAT system: the decimation block size for fixing variables on completing a survey, a branching strategy for using the survey to order variables and values, the threshold for deactivating the entire bias estimation apparatus, a policy for using learned clauses in surveys, and finally, the choice of bias estimation technique.

Other than the choice of branching strategy, all of these decisions seek a profitable sacrifice in accuracy in return for a reduction in the amount of time devoted to computing surveys. Here it is important to reiterate that in searching for a satisfying solution, “reliability” is more impor- tant than pure accuracy. Even if our bias estimator instructs us to set a variable positively when its true bias is 90% negative, there still exists some set of solutions in the resulting subproblem.

Thus our system will always proceed directly to a solution without even backtracking, so long as we never set a variable to a polarity for which it has a true bias of zero.

Decimation Block Size The cost of computing surveys is high relative to the other features used in a modern SAT-solver. One way to mitigate this cost is by using a larger decimation block size, i.e. by setting more variables at once each time we compute a survey. To this point we have found that it is still best to use a decimation block size of 1, as a general setting that works well across all domains. The motivation for a small decimation block size is to account for correlations between variable biases. When setting a block of multiple variables, we are approximating a joint probability over their mutual configuration by the product of their individual marginal probabilities. If we instead set a single variable and then compute a new CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 162

survey, the resulting values are conditioned upon this previous decision.

For instance, a single survey might report that v1 is usually true within the space of so- lutions, and that v2 is usually false, even though the two events happen simultaneously with relative infrequency. Instead of attempting to assign multiple variables at once, then, we assign a single variable first and simplify the resulting problem. In subsequent surveys the other vari- ables’ biases would thus be conditioned on this first assignment. In practice the extra accuracy provided by this property is worth the greater computational cost.

A related practice attempts to reduce the amount of time spent on each individual survey as opposed to the number of surveys overall. Recall that for the bias estimators that are based on update rules, the estimates are initialized arbitrarily. Thus for our first survey along a branch of search, we initially choose values uniformly at random; subsequent surveys are then initialized using the final estimates of the previous one. In terms of accuracy, the general motivation is that in pursuing a given branch of the backtracking search for a SAT solution, we wish to consistently sample the same region within the local search space over surveys. Initializing new surveys where the previous left off would seem to encourage such continuity, though the phenomenon has not been validated empirically by direct measurement. In terms of efficiency, though, it has been observed that this practice leads to fewer iterations over the update rules, per survey.

Branching Strategy We have also tested several branching strategies for using surveys as variable-and value-ordering heuristics. In addition to the “conflict-avoiding” strategy of setting the most strongly biased variable to its stronger value, we have also tried to “fail-first” or streamline a problem via the “conflict-seeking” strategy of setting the strongest variable to its weaker value [16]. Additional approaches involve different ways of blending the two: for instance, one strategy seeks to trigger propagations and build up a strong database of learned clauses by seeking conflicts, and then proceeds to find a solution within this greatly restricted search space by switching to conflict avoidance. A second motivator for seeking conflicts is unsatisfiable problems. While surveys are not well-defined for such problems, seeking conflicts CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 163 can lead to shorter proofs of unsatisfiability and thus faster run-times; since we must account for the entire breadth of the search tree, we should order variables so that conflicts occur on the shallowest subtrees possible.

For mixtures of satisfiable and unsatisfiable problems like those comprising the test cases for recent SAT-Solving contests, it turns out that in practice the single best strategy is the

(presumably) most intuitive one of avoiding conflicts, based on trial-and-error experimentation.

Thus, the pseudo-code in Algorithm 10 shows us seeking the most strongly biased variables, and setting them to their polarity of greater estimated bias.

Deactivation Threshold A third consideration when integrating with a backtracking solver is that any bias estimator can be governed by the “threshold” parameter expressed in terms of the most strongly biased variable in a survey. For instance, if this parameter is set to 0.6, then we only persist in using surveys so long as their most strongly biased variables have a gap of size 0.6 between their positive and negative biases. As soon as we receive a survey where the strongest bias for a variable does not exceed this gap, then we deactivate the bias estimation process and revert to using MINISAT’s default variable and value ordering heuristic until the solver triggers a restart. (Note that setting this parameter to 0.0 is the same as directing the solver to never deactivate the bias estimator.)

The underlying motivation is that problems may contain a few important variables that are more constrained than the rest, and that the rest of the variables should be easy to set once these few have been assigned correctly. For various theoretical reasons this is thought to be of special relevance within the phase-transition region in problem hardness [23], as described in

Chapter 4. In general, surveys are highly valuable but also very expensive; they can take the majority of total runtime depending on how this and other parameters are set. So it is critical to turn them off after the most important decisions have been made.

For typical SAT contest problems, we have found 0.9 to be a good value in combination with the other settings decided in this section. (Each time the solver performs a restart in solv- ing a given problem, the bias estimation module is re-instated if previously deactivated. This CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 164

will typically result in the execution of about one survey per thousand variable assignments for

the most difficult contest problems, which typically require tens of millions of variable assign-

ments.) If we limit our consideration exclusively to random problems, then a good setting is

0.6.

Integrating Learned Clauses Integrating learned clauses into surveys represents another bal-

ance between accuracy and runtime efficiency. On the one hand, learned clauses are all implied

by a theory, and their influence is implicitly captured by the original clauses of a problem. On

the other hand, they may provide especially useful information about the specific area of the

search space that a solver is currently exploring. Time-wise, the expense is that update rules

must now iterate through additional clauses when estimating biases; at an implementational

level there is also an overhead cost in managing the memory for registering learned clauses for

inclusion in the calculation of surveys and for unregistering them when they are periodically

purged from the clause database.

The trade-off is accomplished by a parameter that states the maximum size of learned clause

that is to be included in a survey. (Shorter clauses are more valuable in the sense that they

contain more information, and they are also faster to process since they appear in the update

rules of fewer variables.) In practice we have found it best to integrate all learned 4-clauses

and below into survey computations. For the SAT-contest problems, though, few such clauses

are ever learned, and the improvement over not using any learned clauses at all is small.

Bias Estimation Technique Finally, the choice of bias estimator represents a trade-off be-

tween runtime and accuracy or other desirable survey qualities. This decision interacts most

sensitively with the way the other design issues are resolved. For instance, if we are decimating

one variable at a time and seek solutions by branching on the one with strongest bias, then it

does no good to improve our accuracy on a majority of the variables if the one variable with

strongest bias is set incorrectly.

For random problems, stronger global constraints and the richer SP model make EMSP-G

the best bias estimator, despite the greater cost of computing such constraints and performing CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 165 three updates per variable instead of two. For industrial problems from recent SAT contests, the global constraints are still valuable but the SP model no longer seems to be worthwhile–here

EMBP-G is the method of choice. (Under trial-and-error experimentation, the surveys are not any less accurate when using EMSP-G rather than EMBP-G; the extra computational expense of including a third update rule appears to be the sole disadvantage.)

7.2 Standalone Performance of SAT Bias Estimators

Before testing whether good bias estimates are effective in guiding search, we first determine whether our own techniques are in fact good estimators, in isolation. We will define the quality of a survey in terms of a number of metrics, and calculate these metrics using SAT problems that are small enough so that we can calculate their true biases by exhaustively enumerating all solutions. The experiments in this section were conducted with the help of Christian Muise.

We begin by concretely defining two items of terminology in order to expedite the descrip- tion of more complex concepts to follow.

Definition 24 (Strength of Bias, for SAT). We will refer to a variable as “positively biased” with respect to a true or estimated bias distribution, if under this distribution its positive bias exceeds its negative bias, and likewise for calling a variable “negatively biased.” Within the survey propagation model, the mass on the joker state is left out from this comparison. Sim- ilarly the “strength” of a bias distribution indicates how much it favors one polarity over the other, as defined by the between its positive and negative biases. For bias distributions over the survey propagation model, any probability mass for the third ‘*’ state is first evenly split between the positive and negative bias for the purposes of calculating strength.

Definition 25 (Bias Profile). Analogously to our use of Θ to denote a survey consisting of estimated bias distributions θi for each variable xi in a given problem instance, we will use

Φ to denote a set of φi representing true marginal distributions over individual variables. We CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 166

will distinguish such a true set of marginal distributions by calling it a “profile” rather than a

“survey”.

7.2.1 Experimental Setup

We now define the test regimes for evaluating the eight bias estimators of the previous chapter.

In general, the selected problems are small enough to allow the exact calculation of true bias

profiles, or to allow repeated experiments in a reasonable amount of time. Further, they are ran-

domly generated so as to isolate the approaches’ average-case behavior. Finally, the problems

m are drawn from near the phase transition, with α , n set to 4.11, in order to exclude trivially easy instances from study. While the results presented in the next section are limited to this

fixed window of problem types, we have observed consistent behavior under trial-and-error

experimentation when scaling the problems and backbones to larger sizes, and when increas-

ing α beyond 4.11 within the space of satisfiable problems. For structured problems, though,

we have only anecdotal reports of success, and leave the systematic study of such less-regular

domains to future research.

Specifically, we have consulted an existing library of 100-variable satisfiable instances with

controlled backbone size [137]. From the library, we extract 100 instances each of problems

with backbones of size 10, 30, 50, 70, and 90. By alternately fixing each variable to one

value then the other, and passing the resulting problems to a model-counter, we were able to

exhaustively calculate the true bias distribution Φi for every variable xi in each problem. We then assess the bias estimators by running each them on each of the problem instances. That is, we do not pursue any search for a solution, but instead apply each bias estimator to the top-level problem (i.e. to the root of the search tree alone.) Thus, we can compare the estimates with the true surveys–recall that in our estimates, the probability mass for the ‘*’ state is evenly split between that of the ‘+’ and ‘-’ states upon completing a survey via an estimator that uses the survey propagation model. (Within a variable-ordering heuristic that prefers the most strongly- biased variable, splitting the estimated mass for the ‘*’ state is equivalent to simply ignoring it CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 167

0.48

0.46

0.44

0.42

0.4

0.38

0.36 'BP' 'EMBPL'

Root Mean Squared Error 'EMBPG' 0.34 'SP' 'EMSPL' 'EMSPG' 0.32 'LC' 'CC' 0.3

0.28 0 20 40 60 80 100 Backbone Size

Figure 7.1: RMS error of all bias estimates over 500 instances of increasing backbone size. during the decision-making process.)

7.2.2 Findings

As with the all the experiments described through the rest of this document, a number of results are available online [76]. Here we summarize those results that are most relevant to the overall goal of measuring bias estimation ability in isolation.

Basic Accuracy. Figure 7.1 shows root-mean-squared error for bias estimates on all variables q P + + 2 in all runs described above: i(θi − φi ) /n. Aside from BP, which is discussed later, the remaining methods are grouped into two bands of linearly increasing error.

The best band in terms of average accuracy contains the two global methods, EMBP-G and EMSP-G, along with the two control methods, LC and CC, and BP. SP and the local

EM methods comprise the less accurate band. Prior study indicates that problems with larger backbones are more constrained and usually harder to solve [116]. Here it appears that their CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 168

0.9

0.8

0.7 'BP' 'EMBPL' 'EMBPG' 'SP' 0.6 'EMSPL' 'EMSPG' 'LC' 0.5 'CC'

0.4

0.3 Mean Estimated Bias Strength

0.2

0.1

0 0 20 40 60 80 100 Backbone Size

Figure 7.2: Average strength of estimated bias, over same 500 instances.

biases are harder to estimate as well. However, the flat lines in the assessment of strength

(Figure 7.2) below demonstrate that the methods are bound to make predictions of constant

average strength regardless of problem constrainedness. That is, the problems in the two figures

contain increasingly strong biases; the increase in error in the first figure can be at least partially

explained by the methods’ inability to make stronger predictions in turn.

Strength. Recall that the strength of a variable’s bias is the distance between its positive

and negative biases; backbones have strength 1, and evenly split variables have strength 0. So

P + − the plot in Figure 7.2 averages the formula ( v |θv − θv |)/n across runs, and again we see the same groupings of algorithms in terms of strength. BP is the most wildly inaccurate, and also makes the strongest predictions by far. Again the global and control methods form one group, of moderate estimators, and SP joins the local methods in making the most conservative estimates. CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 169

Within the EMBP-based methods, the consistency of average strength across increasingly constrained problems reflects the use of smoothing sums. Recall from Figures 6.2 and 6.3 in the previous chapter that the local-consistency rules subtract a sole-support calculation that depends on the biases of surrounding variables, from a constant representing the total number of clauses in which a variable appears; the same constant appears in all update rules, and its influence is reflected by the weakest grouping of estimators within the graph. Similarly,

EMBP estimators based on local consistency (along with the CC control) exhibit multiplicative smoothing constants that moderate average strength. Here the influence is lesser because the rules distinguish between the number of clauses that contain a variable positively or negatively as two separate quantities, and the rules discriminate between these quantities is estimating bias.

Backbone Identification Rate. Figure 7.3 considers the special case of backbone identifi- cation. Because the estimators are not prone to offer many predictions of 100% bias strength, we instead measure whether they correctly predict the polarity of known backbone variables.

Looking ahead to the context of search, it is not critical to set backbones early, but it is certainly critical to set them correctly (during solution-seeking search). To that end, the vertical axis of the plot shows the percentage of backbone variables for which an estimator predicts the cor- rect polarity, averaged across the 100 problems in each of the five test sets. Here, “predicting the correct polarity” means that the estimated positive bias exceeds the negative estimate for a backbone variable that is indeed constrained to be positive in all solutions; and likewise the negative estimate must exceed the positive in the case of a negative backbone variable.

Here the same bands of bias estimators that appeared in Figure 7.1 are now compressed into a more mixed band of linearly decreasing success rates; when there are more backbones

(up to 90 out of 100 variables) it becomes harder to get the same percentage of them correct.

The uncharacteristically poor performance of CC provides an extra insight into the diffi- culty of predicting backbones. Because it always biases a variable according to the proportion of its positive and negative clause counts, all of CC’s incorrect predictions and at least some of CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 170

1

'BP' 0.95 'EMBPL' 'EMBPG' 'SP' 0.9 'EMSPL' 'EMSPG' 'LC' 'CC' 0.85

0.8

0.75

0.7 Backbone Identification Rate

0.65

0.6

0.55 0 20 40 60 80 100 Backbone Size Figure 7.3: Average success rate in predicting the correct sign for backbone variables. its correct ones involve variables that appear in the theory more often than not with the opposite sign from the one they are constrained to hold! So for instance, in a problem that contains 50% backbone variables, at least 30% of those appear more often as literals with the wrong polarity than with the right one.

“Reliability”. A final quantity of interest when looking ahead to search is the rank of the

first wrong bias with respect to strength. If we are fixing the most strongly biased variable to its stronger value, then a survey’s accuracy on the other, weaker estimates is irrelevant.

Further, when we fix a variable to some polarity, we can never eliminate all solutions unless its true bias is 100% in the direction of the opposite polarity. In this light, Figure 7.4 averages the results of ranking the variables in a survey by strength, and reading down from the top to find the most highly ranked variable that was actually estimated in the wrong direction. A variable’s estimate is “in the wrong direction” if the positive estimate exceeds the negative estimate when in fact the true negative bias exceeds the true positive bias, and likewise if the negative estimate is greater when in fact the true positive bias is greater. (So at its strictest, the

+ − definition will penalize an estimate of (θv , θv ) = (0.51, 0.49) even when the true bias for a

+ − variable is (φv , φv ) = (0.49, 0.51)). CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 171

35

'BP' 'EMBPL' 30 'EMBPG' 'SP' 'EMSPL' 'EMSPG' 25 'LC' 'CC'

20

15

10 Mean Rank of the First Wrong Bias

5

0 0 20 40 60 80 100 Backbone Size Figure 7.4: Average strength ranking of the variable most strongly biased toward the wrong polarity.

Here BP exhibits excellent average performance; for backbones of size ten, for instance, its thirty strongest estimates typically turn out to predict the correct sign, before the thirty-first is biased the wrong way. As we will see, outside of BP and SP, the ordering between methods closely follows their overall effectiveness when embedded within VARSAT.

7.2.3 Other Experiments

Some of the other experiments we conducted do not directly evaluate our hypotheses, but serve a useful role in visualization. Figures 7.5 and 7.6 give two examples of the sorts of results that are available online [76].

Figure 7.5 plots true vs. estimated biases; points along the diagonal line are the most accurate and those along the top and the bottom of the frame are backbones. It is this sort frame that can be generated each time we compute a survey along a single branch of search.

Similarly, Figure 7.6 shows how the estimated biases for the variables in a problem can

fluctuate more or less wildly during the process of computing a single survey, depending on the chosen technique and the character of the problem instance. CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 172

Figure 7.5: True versus Estimated Bias.

Figure 7.6: Fluctuations in Bias Estimates. CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 173

1.3

1.2

1.1

1

0.9

0.8

'BP' (N/A) 0.7 'EMBPL' Average Runtime (seconds) 'EMBPG' 'SP' 0.6 'EMSPL' 'EMSPG' 'LC' 'CC' 0.5 'default'

0.4 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Minimum Threshold for Survey Use, based on Strongest Bias across Variables Figure 7.7: Adjusting the threshold parameter: average runtime for various settings.

7.3 Performance of Bias Estimation in Search

Finally, we can consider the overall performance of the bias estimators embedded in back-

tracking search, as realized by the VARSAT solver. A number of initial experiments are not

depicted here, but were used to determine the parameter settings detailed in Section 7.1. As a

single example, Figure 7.7 summarizes the process of tuning the threshold parameter for each

survey method, for use on random problems. (Recall that VARSAT only persists in using sur-

veys when their strongest variables exceed this value; otherwise it deactivates them and default

heuristics take over until restart.) The vertical axis plots the average runtime of VARSAT on

one hundred 250-variable problems with clause-to-variable ratio 4.11, as the threshold setting

varies along the horizontal axis. The dashed level line represents the fixed runtime of the de-

fault MINISAT heuristic without any bias estimation. Each line in the graph represents changes in average runtime when a given heuristic is run with different threshold values. Lines that are discontinuous to the left represent combinations that could not solve some problem in less than three minutes if the threshold was set too low. (BP timed out on some problem at every threshold setting, and thus does not appear on the graph.) CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 174

Computing Surveys All Other Activities EMSPG [165.4]

EMBPG [78.9]

LC [132.8]

EMSPL [23.0]

EMBPL [12.8]

CC [75.1]

DEFAULT [0.0]

SP [13.0]

Variable/Value Ordering Heuristic [Average # Surveys / Run] 0 0.2 0.4 0.6 0.8 1 1.2 Average Runtime for Best Parameter Setting (seconds)

m Figure 7.8: Total/Survey runtimes averaged over 100 random problems, n = 250, α ≡ n = 4.11.

EMBP-L, EMSP-L, and SP make mild estimates and thus are not sensitive to the deactiva- tion threshold parameter within the range covered by the graph. Typically they make a small number of surveys containing strong estimates before the strength of subsequent surveys falls well below 0.3 and they are deactivated until the next restart. As for the remaining approaches, as the parameter increases they first reach feasibility when they deactivate themselves early enough to prevent wrong decisions that cause timeouts. Their plots then slope downward to some optimal setting before rising somewhat as VARSAT increasingly ignores them when it ought not to. Also observe that most of the methods are still below the default line when the threshold is set to 1.0–if we only follow the bias estimators when they claim to have found a backbone or near-backbone variable, then search performance still improves slightly.

Figure 7.8 compiles the lowest points from Figure 7.7 to compare the average run-times of the various heuristics when using their best settings. Additionally, the bars are broken down to show the proportion of runtime that was devoted to computing surveys. The bar labels also indicate the average number of surveys that each method wound up producing at its optimal threshold setting. The most important general observation about the relative performance of these methods is that it roughly corresponds to their accuracy as bias estimators. This supports CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 175 our hypothesis of a correlation between bias and efficient search: better bias estimators tended to produce better SAT-Solvers when employed as variable and value ordering heuristics. (In the most extreme, directing VARSAT to follow the opposite of a good bias estimator’s guidance greatly degraded its ability to find solutions, as did the application of artificial bias estimates that were generated uniformly at random.)

Among the other metrics besides average accuracy, the set of “reliability” results (concern- ing rank of the first wrong estimate) represents the next-strongest criterion for a good heuristic.

For instance, the control heuristic CC performs worse than the EMBP-based heuristics overall, despite exhibiting a relatively acceptable RMS error rate in Figure 7.1. One explanation is that it is the worst method with respect to rank of first wrong estimate, and with respect to backbone identification, as well. However, good rankings are not enough, as BP timed out so frequently as to be omitted from the figures. The erratic behavior of BP, and its subsequent failure as a search heuristic suggest future investigation into variance and entropy across surveys: a quick check of our data showed that it suffered much greater variance than the other approaches in almost every assessment of quality; this is consistent with its non-convergence (observed on about 5% of runs), and the strong biases engendered by its strong, unsmoothed assumption that every variable is a sole-support to some clause.

Figure 7.9 compares VARSAT and default MINISAT on hard random problems of increas- ing size. Here EMSP-G was used as the bias estimator, with deactivation threshold 0.6, to perform conflict-avoiding search with decimation block size of one and while only consulting learned clauses of size four in forming surveys.

For each problem size marked in the graph, the two solvers were run on 1000 satisfiable instances that were randomly generated with a clause-to-variable ratio of 4.11. The average runtime on such problems is plotted in log-scale. The last two data points for default MINISAT represent lower bounds; on a percentage of the runs MINISAT timed out by failing to solve an instance within 10,800 seconds (three hours).

However, the real strength of MINISAT and other DPLL-based solvers is industrial prob- CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 176

10000 +11% timeout +3% timeout 1000

100

10

1

0.1

0.01 'EMSPG' Average Runtime in Seconds / Timeout = 10,800 sec. 'default' 'default-lb' 0.001

50 100 150 200 250 300 350 400 450 500 Number of Variables in Problem (n)

m Figure 7.9: Comparison on Random Problems, 50 ≤ n ≤ 500, α ≡ n = 4.11. lems, particularly as defined by inclusion in the “industrial” category of the semi-annual SAT-

Race solver competition [8]. Using such data sets from the most recent competition, it is pos- sible to compare VARSAT and regular MINISAT. Here the best configuration of VARSAT is to use EMBP-G with a deactivation threshold of 0.9, and the other parameters remaining the same. On running the two solvers on a collection of 125 past problems with a timeout of 15 minutes, they were both able to solve 65 of them, and both failed to solve 46. Of the former,

MINISAT was generally faster though both were within the time limit. Additionally, there were

4 problems that only VARSAT could solve, compared to 10 for MINISAT.

Across this entire set of industrial instances, VARSAT reduces the number of backtracks, usually by at least one order of magnitude. However, runtime is almost always proportionately slower. This is because calculating surveys on industrial problems is extremely expensive–even more so than on the random problems discussed earlier. Industrial problems from the test set include many more (tens or hundreds of thousands of) variables. On some of the larger prob- lems, survey computation takes up more than 99% of runtime, and just initializing the required data structures can take minutes under the current implementation. The overall performance of CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 177

VARSAT was still good enough that it placed fourth in the category for “crafted” problems at

the 2008 SAT-Race. The evaluation was divided into divisions for random, industrial, crafted

problems, where the last of these is characterized by problems directly encoded by humans

to feature specific combinatorial properties that are designed to be challenging for existing

solvers. While VARSAT does indeed represent a philosophical departure from the existing

state-of-the-art in SAT solving, the main explanation for its success in this category is that the

problems in this category tend to be smaller.

7.3.1 Limitations and Conclusions

Ultimately our most concrete conclusions on the usefulness of probabilistic reasoning in solv-

ing constraint satisfaction problems must be based entirely on the qualities of VARSAT. Thus,

it is well-motivated to consider its limitations within the larger space of SAT-solving meth-

ods. In general, the number of variables in a problem is the best predictor for the success of

VARSAT.

For random problems, we have seen that VARSAT’s bias estimators are superior to the

survey propagation bias estimator, in isolation. But, for large random problems (n > 104), the

survey propagation decimation-based solver [23] is more suitable than the complete backtrack- ing VARSAT solver. Here the advantage of the survey propagation solver is essentially that of simplicity: because it does not include any modern technologies, or even backtracking, it is simple to represent a large problem in memory, without paging. Thus the size of problems that the survey propagation solver can handle is not the real accomplishment; rather, it is that the solver can successfully process them at all without representing more sophisticated structures in memory. When the best of the EMBP-based bias estimators (in terms of the measures of accuracy used above) is embedded within the same decimation-based framework, it produces comparable results to those of the standard SP solver. But, as with the standard SP solver, this combination is not effective on non-random problems. On the other hand, the random problem division of the SAT evaluation is dominated by incomplete local search algorithms, designed CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 178

to handle hundreds of thousands of variables.

For industrial problems, the challenge of large problem size persists. Here the issue is not

exclusively one of memory management, but additionally encompasses the entire philosophy

of the solution process. Modern solvers are adept at handling industrial instances with millions

of clauses, and on successful runs they may fail to touch a majority of these clauses in memory,

while finding a solution. Watch literals or similar data structures monitor whether a clause has

been violated; building up an assignment without actually accessing such clauses is key to

efficient solving. This phenomenon is contrary to the fundamental character of bias estimation,

where we consult all the clauses in a problem multiple times to calculate a single survey; the

difference motivates more approximate methods. For example, such a method might consult

only some of the clauses in a variable’s neighborhood when calculating its bias; such a subset

might favor clauses that have recently participated in conflicts.

At this point, though, such possibilities for handling larger problems remain an unexplored

body of future work. Here, we have demonstrated that our own contributed techniques are

effective at estimating variables bias, and that variable bias is itself useful in guiding search. As

for the VARSAT solver, it surpasses the state-of-the-art for complete solvers on hard random

problems, and is competitive with incomplete solvers, exclusively on problems of size n = 103 or below. CHAPTER 7. USING BIAS ESTIMATORS IN BACKTRACKING SEARCH 179

Algorithm 10: VARSAT Algorithm Input : SAT theory Φ, strength threshold thresh, decimation block size decimate, bias

estimation method ESTIMATOR.

Data: Queue of variable assignments decimation-queue (static).

Output: Solution configuration, or else ‘⊥’(unsatisfiable).

1 decimation-queue ← ∅.

2 return BACKTRACKING(∅, Φ, false).

3 subroutine BACKTRACKING(assignments, Φ, use-default?)

4 begin

5 Restart if prescribed.

6 Simplify Φ using prescribed inference procedures.

7 if (|assignments| = n) then return assignments.

8 if (assignments violate Φ) then

9 Perform clause learning as prescribed.

10 return ‘⊥’.

11 end

//IF DECIMATION QUEUE CONTAINS ANY ASSIGNMENT, USEITNEXT...

12 if (decimation-queue 6= ∅) then

13 (var, val) ← POP(decimation-queue).

14 // OR ELSE USE DEFAULT HEURISTICS BELOW THIS DEPTH...

15 else if (use-default?) then

16 (var, val) ← CHOOSE-VARIABLE-AND-VALUE(assignments, Φ).

17 //...ORELSECONTINUEUSINGBIASESTIMATORSASHEURISTICS.

18 else

19 Θ ← COMPUTE-SURVEY(assignments, Φ, ESTIMATOR).

20 if (STRENGTH(Θ) < thresh) then

21 return BACKTRACKING(assignments, Φ, true).

22 else

23 Fill decimation-queue with variable assignments fixing decimate most

strongly-biased variables toward stronger polarity.

24 return BACKTRACKING(assignments, Φ, false).

25 end

26 end

//MAKETHEVARIABLEASSIGNMENTANDBACKTRACKIFNECESSARY.

27 result ← BACKTRACKING(assignments ∪ {var = val}, Φ, use-default?).

28 if result = ‘⊥’ then

29 return BACKTRACKING(assignments ∪ {var = ¬val}, Φ, use-default?).

30 else

31 return result.

32 end

33 end Part III

Solving Constraint Optimization

Problems

180 Chapter 8

A Family of Bias Estimators for

Constraint Optimization

In this chapter, we turn from the constraint satisfaction problem to the constraint optimization

problem, and for the purposes of this presentation, from SAT to MaxSAT. As defined below,

we will accommodate the most general (weighted, partial) version of the problem, and will

seek exact solutions.

In a general sense the MaxSAT problem is a better-motivated application for bias estimation

techniques than SAT is: as we have seen, the estimators tend to be very expensive, and very

powerful, in terms of minimizing conflicts. Recall from Chapter 5 that one perspective on the

bias estimation task is that it attempts to solve a (linearly) relaxed version of the SAT problem

itself, outright. Bias distributions correspond to soft variable assignments, maximizing the

variable’s fractional support to a set of clauses, as P (Θ|SAT) ∝ P (SAT|Θ).

The most pertinent property of MaxSAT is that finding a provably optimal solution by means of search is fundamentally harder than searching for a single satisfying assignment to a SAT problem. For SAT, if we manage to follow a branch of variable assignments to completion without violating any clauses, then we have solved the problem. For MaxSAT, we must continue searching; we must account for every branch of the complete search tree in order

181 CHAPTER 8. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT OPTIMIZATION 182 to be sure that our solution is indeed optimal. Thus, MaxSAT problems typically take much longer to solve than SAT problems of comparable size, motivating the use of high-cost/high- reward heuristics. At a lower level, because the types of problems that are on the frontier of practicality for contemporary MaxSAT solvers tend to have hundreds or thousands of variables, rather than tens of thousands or more, the memory requirements for storing floating point bias values for all variables are much easier to accommodate under implementation.

In this chapter, then, we define the MaxSAT problem and the basic branch-and-bound search framework used to account for the entire search tree of assignments without necessarily exploring every branch to completion. We then modify the factor graph representation for a uniform distribution over SAT assignments to produce a representation for MaxSAT problems.

Here, to accommodate the fact that the optimization criterion for MaxSAT is additive, we de-

fine a scaled exponential distribution that approximates a uniform distribution over MaxSAT solutions. In particular, maximum-value configurations will have exponentially more weight than non-maxima, as scaled by an order-of-magnitude parameter. As this parameter approaches infinity, the distribution approaches a uniform distribution on solutions.

Then, in Chapter 9 we turn again to consider ways of integrating the bias estimators with a state-of-the-art solver–in this case a branch-and-bound MaxSAT solver–and report on empirical results. Further, in Chapter 10 we devote a third chapter to a bounding technique based on

“equivalent transformations” motivated by the concept of linear relaxations on the factor graph of a constraint optimization problem.

8.1 Existing Methods and Design Criteria

Definition 26 (MaxSAT). A partial weighted MaxSAT problem (in CNF) is a triple Φ =

(X,C,W ):

• X is a set of n Boolean variables. As before, we refer to 0 as the “negative” value and

1 as the “positive.” Assignments, configurations, and the like are also defined the same CHAPTER 8. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT OPTIMIZATION 183

way as for SAT, in Section 3.1.

• C is a set of m clauses defined over X. As before, each clause is a disjunction of positive

or negative literals formed from variables in X. We will typically index members of C

with the subscript a. Clause ca is satisfied by a given configuration A iff it contains a literal whose variable is mapped to the corresponding polarity in the configuration.

• W ∈ (R+ ∪ {0})m associates a non-negative weight with each clause in C. We will

denote by “wa” the weight associated with clause ca. For partial weighted MaxSAT problems, hard clauses must be satisfied by any solution (and can thus be considered to

have infinite weight,) as distinguished from soft clauses.

Let the term “slack” denote the sum of the weights of the clauses that a configuration violates.

A solution to a MaxSAT problem as defined here is a configuration that: firstly, satisfies all of the hard clauses in C; and secondly, minimizes slack across all configurations meeting the first condition. Any configuration that meets the first condition is “feasible”, though not necessarily a solution. In short, a solution to a MaxSAT problem minimizes the weight of violated soft clauses while satisfying all hard clauses, where weight is aggregated additively. In future sections it will occasionally be useful to speak of maximization rather than minimization. The

“score” of an assignment to all the variables in a MaxSAT problem is the sum of the weights associated with the clauses that it satisfies. Then we can redefine the concept of a solution to a

MaxSAT problem as a configuration that achieves maximum possible score without violating any hard clauses. The two definitions are equivalent because for any given configuration, the sum of its slack and its score is fixed to be the sum of the weights in W that are associated with soft clauses. Thus maximizing the score is equivalent to minimizing the slack.

8.1.1 Branch-and-Bound Search

The state of the art for exact MaxSAT solving encompasses a variety of approaches, perhaps re-

flecting the diversity of commonly-used weighted partial MaxSAT problem classes [6]. Some CHAPTER 8. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT OPTIMIZATION 184 problems are so highly constrained by their hard clauses that there are very few potential so- lutions; solving such a problem may be tantamount to finding SAT solutions and simply com- paring their weights. Other problems feature few or no hard clauses, and exhibit more of a pure optimization character; other problems are a mixture of the two. Accordingly, MaxSAT is currently solved by consulting an embedded SAT-solving process [6], by reasoning over unsatisfiable cores [108], by integrating SAT-solving techniques [6], or with assistance from knowledge compilation techniques [125]. However, a majority of these solvers is built entirely upon a basic branch-and-bound search approach, and many of the remainder somehow inte- grate their distinguishing techniques within some version of this framework [104, 4, 73, 105].

The framework is a straightforward variation on backtracking search, and exhibits the same algebraic properties. Instead of operationalizing algebraic shortcuts like multiplication by zero, though, it takes a more nuanced approach by reasoning over sums or products that cannot exceed certain bounds, and thus do not need to be calculated to completion. The method itself appears in recursive form as Algorithm 11.

The algorithm begins by determining an initial upper bound on slack value for a problem solution. This may be accomplished by an incomplete method like local search, or simply by choosing infinity. Each time we complete a configuration during RECURSIVE-BACKTRACKING, and its score is better (lesser) than the current upper bound, we store this new value for the up- per bound, along with the configuration that achieved it. If an assignment violates any hard clauses outright, it is discarded. (A more practical way to handle hard clauses is to assign them a weight greater than the sum of weights of all soft clauses; this will not introduce any ineffi- ciencies so long as we select an initial upper bound that is below this weight.) In either case, the solver backtracks in order to continue searching the entire tree.

Branch-and-bound algorithms are characterized primarily by the method they use for COMPUTE-

LOWER-BOUND; Chapter 10 will describe a means of doing so based on probabilistic methods.

The lower bound is an admissible estimate on the weight we will have to incur over violated clauses no matter how we extend the current partial assignment. If this slack value exceeds CHAPTER 8. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT OPTIMIZATION 185 that of our best assignment encountered so far, then we can backtrack immediately, pruning this subtree of the overall search. Thus Algorithm 11 has a means of accounting for every branch of search without necessarily visiting every search node. State-of-the-art solvers typ- ically employ modified versions of SAT inference techniques, like resolution, lookahead, the pure literal rule, and unit propagation, to compute lower bounds.

Otherwise, the algorithm is very similar to Algorithm 6 for backtracking search: from Lines

20 to 23, variable-ordering and value-ordering heuristics dynamically direct the order that we search the tree, possibly in hopes of finding configurations with good scores early in the search, or possibly in hopes of exceeding the current upper bound as early as possible to prune larger subtrees. In the next chapter we will consider this trade-off, as our bias estimators can be used to pursue either goal, and in Chapter 11 we consider reinforcement learning as one preliminary means of optimizing over such trade-offs during search.

8.1.2 Other Probabilistic Approaches

Just as the survey propagation solver first embedded the survey propagation bias estimator within a non-backtracking solver, there also exist two incomplete MaxSAT solvers that use a modified version of the bias estimator [11, 29] in the same way. (One method optionally employs a crude version of backtracking that is more akin to restarts: the solver occasionally undoes a certain percentage of its assignments at random, but cannot keep track of whether it has visited the entire search space. Thus, it is still an incomplete solution method.) The two solvers differ primarily by the interpretation of the ‘*’ (joker) state of the SP model; several behavioral differences result.

The goal of improving on the state of the art for bias-directed MaxSAT-solving again mo- tivates many of the same qualities cited in Chapter 6 for applying bias estimation to SAT, such as convergence, completeness, efficiency, and reliability. Furthermore, the bias estimators for both of these solvers are derived from an exponential weighting scheme similar to the one described below. As such, they are very sensitive to a “scaling” parameter that undermines CHAPTER 8. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT OPTIMIZATION 186

the estimators’ accuracy when set too low for a problem, and prevents convergence when set

too high. Thus there is an additional motivation for guaranteed convergence in the case of

MaxSAT.

8.2 Creating an Approximate Distribution over MaxSAT So-

lutions

In order to derive bias estimators for solving MaxSAT problems, we will have to replace the

factor graph representation defined in Chapter 3 for SAT. The semantics of a SAT problem

are multiplicative over the satisfaction of its clauses: if 1 indicates satisfaction and 0 indicates

dissatisfaction, then the product of such values across clauses represents whether the entire

theory is satisfied. Thus, it is straightforward to encode such a problems using factor graphs,

whose own semantics are multiplicative over function evaluations.

On the other hand, clause weights in MaxSAT are aggregated additively as per the defini-

tion above. Furthermore, while our probabilistic model can assign a probability of 0 to any configuration that violates a hard constraint, what value should it give the remaining config- urations? Unfortunately, the model will have to give some mass to non-optimal but feasible configurations–if it were able to give 0 mass to such configurations, we would have already solved the MaxSAT problem by the very act of representing its graphical model.

So, we must accept that non-optimal configurations will have some probability mass. But, we will adapt an existing approach to statistical reasoning over MaxSAT [11] and formulate a model wherein they have exponentially less mass than optimal configurations. And, by linearly scaling the exponent in question using a parameter y, we will be able to approach an actual 0/1 distribution solutions as y approaches infinity.

Definition 27 (Factor graph representation of MaxSAT). Let Φ be a weighted partial MaxSAT

problem Φ = (X,C,W ). Then let FG(Φ) be a factor graph (X,F ) with function set F one-

to-one mappable onto C. Each clause c ∈ C corresponds to a function f ∈ F whose scope CHAPTER 8. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT OPTIMIZATION 187

is the set of variables appearing in the clause, as before in the case of MaxSAT. While f still evaluates to 1 on extensions that satisfy c, though, here it evaluates to exp{−y · wc} for unsatisfying extensions–unless c is a hard clause, in which case f evaluates to 0 on unsatisfying

extensions. Here y is the scaling parameter of the model.

Thus, for a MaxSAT problem Φ defined as above, FG(Φ) represents a joint probability

distribution on X with the following density function:

   0 :  if ~x violates a hard clause; 1  p(~x) = W (~x) = N FG(Φ) otherwise, where U(~x) is the  1 Q  exp{−y · wc} :  N  c∈U(~x) set of clauses violated by ~x. (8.1)

Under this model, a configuration’s probability is inversely proportional to ey, raised to the power of the sum of the weights of clauses that it violates. (Or, it is zero if it violates a hard constraint.) So, it is easy to verify that if 4w is the difference in slack values between a true optimum and a configuration for which the weight of violated clauses is second-least, then the probability of any given optimum is greater than the probability assigned to any non- optimum by a factor of ey·4w. Accordingly, y can be seen as scaling the order of magnitude by which we prefer better configurations. As it increases, then the probabilities corresponding to non-optimal configurations approach zero.

For bounded values of y, though, our bias estimation process is still in fact influenced by near-optima, much as local search algorithms may be biased toward near-solutions in their ex- plorative behaviors. Though it may seem a strong assumption to think that maximally-likely surveys should be located generally near to almost-maximally-likely ones, let it be reinforced that there is no way to explicitly eliminate non-optimal sample configurations, be they implic- itly or explicitly sampled, without having already solved the MaxSAT problem outright. At any rate, as long as y is sufficiently high, the influence of such clauses can be rendered negligible.

If convergence is guaranteed, then for the purposes of solving a problem by choosing the CHAPTER 8. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT OPTIMIZATION 188

most strongly biased variables and setting them according to their stronger polarities, it may

not be very important for y to be so high as to drive the model to exactness. For if the end

effect is that the strongest bias now estimates that variable xi is constrained positively in 99.9% of optima instead of 99%, then there will probably be no effect on the actual actions taken by the incorporating solver. This hypothesis will be tested in the next chapter; the final step before proceeding is to define the bias estimators that result from applying this model to the same derivations that we used for SAT.

8.3 Deriving New Bias Estimators for MaxSAT

Having defined a factor graph representation for MaxSAT problems that encodes a uniform distribution over solutions but for a vanishingly small proportion of mass that is assigned to feasible but non-optimal configurations, we can now derive new bias estimators within the

EMBP framework. This is actually quite straightforward, given the process that was already presented for the case of SAT, in Chapter 6. In fact, we are able to derive MaxSAT versions of EMBP-L, EMBP-G, EMSP-L, and EMSP-G by following almost exactly the same lines of reasoning. (It is also simple to design a weighted version of the CC estimator for use as an experimental control.)

In the end, it so happens that the totality of differences between the two sets of rules can be encoded by altering the recurring formula for sole support. Recall that given a clause and a variable, this formula estimates the probability that the clause will be unsatisfied by all the other variables that it contains as literals, thereby constraining the variable in question to serve as its support. In doing this calculation, the formula consults the current values for those neighboring variables’ bias distributions and estimates the probability that all of them are constrained in an unsupporting fashion. By modifying the version of this formula that was used for SAT and substituting into the same update rules defined previously in Figures 6.2 and 6.3 of Chapter

6, we can represent exactly the equivalent estimators for MaxSAT. Thus, for brevity, here we CHAPTER 8. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT OPTIMIZATION 189 forgo the derivation of the update rules and instead display only the modified formula for sole support, as Figure 8.3. ! Q − Q + σMAX(v, c) , θi θj · (1 − exp{−y · wc}) + − i∈Vc \{v} j∈Vc \{v} Figure 8.1: “v is the sole support of c.” (MaxSAT version.)

The product in large parentheses represents the old formula for sole support; here we in- troduce an additional factor that approaches 1 from below with an increase in clause weight, or in the scaling parameter. One way to interpret this modification and its effect is to re- conceptualize optimization in MaxSAT as a relaxed version of solution-finding in SAT. Specif- ically, we change the semantics of SAT by imagining that any given clause has a extremely small chance of somehow being satisfied exogenously, without appealing to any one of its lit- erals for support. So, to determine the chance that v is constrained to support clause c, we estimate the probability that every other variable is simultaneously unable to take a satisfying value, and additionally, that the independent chance for the clause to somehow be satisfied ex- ogenously has also failed. Here the chance of a clause being satisfied without appealing to any of its variables for support is exp{−y·wc}, which becomes smaller for more strongly-weighted clauses or stricter values of y.

In other words, bias estimation methods for constraint optimization could have been in- vented by relaxing bias estimators from constraint satisfaction. But it is arguably more fruitful to specify the precise probability model over which they optimize their surveys, as we have done here, in order to make design decisions (like choosing a value for the scaling parameter), and to understand the space of possible models as a precondition for future research. CHAPTER 8. A FAMILY OF BIAS ESTIMATORS FOR CONSTRAINT OPTIMIZATION 190

Algorithm 11: Branch-and-bound search (for MaxSAT) Data: MaxSAT problem Φ = (X,F,W ), best configuration found so far, upper bound

on minimum weight of unsatisfied clauses.

Result: solution, or ‘⊥’ (“hard constraints unsatisfiable”).

1 best ← ∅.

2 upper ← INITIAL-UPPER-BOUND (Φ).

3 RECURSIVE-BACKTRACKING(∅, Φ).

4 if (best = ∅) then

5 return ⊥.

6 else

7 return best.

8 end

9 subroutine RECURSIVE-BACKTRACKING(assignments, Φ)

10 begin

11 if (|assignments| = n) then

12 if (SCORE(assignments, Φ) < upper) then

13 best ← assignments.

14 upper ← SCORE (assignments, Φ).

15 end

16 return.

17 end

18 if (assignments violate hard clauses of Φ) then return.

19 if (COMPUTE-LOWER-BOUND (Φ) > upper) then return.

20 var ← CHOOSE-VARIABLE(assignments, Φ).

21 foreach val in ORDER-VALUES(var, assignments, Φ) do

22 RECURSIVE-BACKTRACKING(assignments ∪ {var = val}, Φ).

23 end

24 end Chapter 9

Using Bias Estimators in

Branch-and-Bound Search

Having designed a means of applying bias estimation to constraint optimization, we can now consider its usefulness in solving real-world problems. Again the focus will be on the spe- cific area of Boolean satisfiability; just as Chapter 7 embedded SAT bias estimators within a state-of-art complete backtracking SAT-solver, here we embed MaxSAT bias estimators within the state-of-the-art MAXSATZ solver [104]. The resulting solver, MAXVARSAT, is available online [77].

As before we begin by describing relevant design decisions, and then proceed to report on experimental results for large random problems as well as the full suite of problems featured in the most recent MaxSAT Evaluation, from 2009 [8].

9.1 Architecture and Design Decisions for the MAXVARSAT

Integrated Solver

In the previous chapter, we described the adaptation of our bias estimators to constraint op- timization, as well as the branch-and-bound framework (Algorithm 11) for solving them to

191 CHAPTER 9. USING BIAS ESTIMATORS IN BRANCH-AND-BOUND SEARCH 192 completion. To produce the MAXVARSAT solver we use bias estimation as a variable-and- value ordering heuristic at Lines 20 and 21 of the algorithm. The architecture is not described in detail here because it closely follows the embedding of SAT bias estimators within MINISAT to create VARSAT. Here we instead note the best known ways of resolving various design considerations, including those that are shared with the SAT framework, as well as other con- siderations unique to MaxSAT.

Shared Design Considerations. As before with SAT, we have the option of using a single survey to branch on more than one value at a time, in interests of amortizing the expense of calculating surveys, but at the cost of introducing inaccuracy due to correlations between variable biases within the block. Once again we have found that a decimation block size of

1 works best, in conjunction with the other settings described here. In other words, we pay the cost of calculating a survey for each individual variable assignment, in order to ensure that each successive survey is conditioned on all previous assignments.

Also in analogy with SAT, we have a choice of solution-seeking or conflict-seeking branch- ing strategies. For the former, we use bias estimates to set variables to their most likely values.

In this case the motivation is to quickly find good configurations, i.e. to tighten the upper bound on the minimum weight of clauses that must be violated by any solution. By choosing conflict-seeking branching strategies, we instead use bias estimates to make the worst possible assignments in terms of satisfying clauses. Here the benefit is to eliminate larger sub-trees by pruning at shallower levels of search; conflict-seeking assignments engender stronger and earlier lower bounds. The motive of shallower backtracks is stronger in the setting of com- plete constraint optimization, where we know that we must account for the entire search tree of variable assignments. We will discuss a combination of these two approaches shortly; if constrained to use only one, then as before we have found it much better to rely on a solution- seeking branching strategy.

Another feature that MAXVARSAT shares with VARSAT is a deactivation threshold for cutting off the computation of surveys beneath a particular search node, whenever the maxi- CHAPTER 9. USING BIAS ESTIMATORS IN BRANCH-AND-BOUND SEARCH 193 mum difference between a variable’s positive and negative biases falls below a particular value.

For general use across the spectrum of problems considered in this chapter, we have found (by trial and error) that 0.4 is a good value for this deactivation threshold parameter; performance is fairly robust across values ranging from 0.1 to 0.8.

Learned clauses play a less prominent role within MaxSAT, for reasons unknown to the author and to the body of published research in general. Still, we have the option of integrating such clauses into the computation of surveys. Doing so does not constitute any improvement or hindrance to the performance of MAXVARSAT on the problems considered here.

Finally, we have our choice of bias estimator, and as before we have found EMBP-G

(EMBP over a two-state model with a global consistency approximation) and EMSP-G (the same but over the three-state survey propagation model) to be most effective, with EMSP-G the slightly more effective of the two. Thus the bias estimation results reported below are exclusively in terms of EMSP-G (and the experimental control strategy CC that relies on the

(weighted) proportion of a variable’s appearances as a positive or negative literal). Figure 9.1 exhibits these two rules. (In the figure, we have redefined cardinality over sets of clauses to P account for weight: |C| , c∈C wc.)

Design Considerations Unique to MaxSAT. As explained in the preceding chapter, adapting our bias estimation framework to MaxSAT introduces an additional parameter y that scales the penalty for violating clauses. We have found MAXVARSAT robust to variation in y; a value of 5 works best under trial and error, but is not drastically better than other settings between

2 and 20. When the value is too high, accuracy suffers due to numerical precision issues.

In contrast, other bias estimation methods for MaxSAT are extremely sensitive to y and are typically unusable if y falls outside an interval of size 3 or so [11, 29]. Here we see one advantage of guaranteed convergence: such studies have found that when y is outside such

a problem-specific interval, the bias estimation process does not converge, and in fact must

terminate without any useful progress toward an accurate estimate. So, these methods must

perform a line search by which they attempt the same problem multiple times using varying CHAPTER 9. USING BIAS ESTIMATORS IN BRANCH-AND-BOUND SEARCH 194

! + + σ(v, c) Q θ− Q θ+ ωv = |Cv | , i j + − i∈Vc \{v} j∈Vc \{v}

− − ωv = |Cv | · (1 − exp{−y · wc}) (a) CC rule for MaxSAT. (b) Sole-support formula for MaxSAT. " # + − Q + Q ωv = |Cv | (1 − σ(v, c)) + |Cv | 1 − (1 − σ(v, c)) − + c∈Cv c∈Cv

" # − + Q − Q ωv = |Cv | (1 − σ(v, c)) + |Cv | 1 − (1 − σ(v, c)) + − c∈Cv c∈Cv

∗ Q ωv = |Cv| (1 − σ(v, c)) c∈Cv (c) EMSP-G update rule.

Figure 9.1: Bias estimators for MaxSAT. values of y.

A final design feature that is unique to MaxSAT arises from the desire to establish a good upper bound early in search by attempting to violate as few clauses as possible, but then to quickly violate this bound once it is established, by attempting to violate as many clauses as possible as early as possible. Accordingly, MAXVARSAT provides for a simple scheme that allows the solver to switch between these modes partway through a search process. Specifi- cally, the user can supply a fixed number of seconds to search before switching, or if a cutoff time is supplied for terminating search prematurely, then the user can alternatively specify a percentage of the allotted time. The solver begins by following a solution-seeking branch- ing strategy where the most strongly biased variable is assigned first, and to the polarity with stronger bias. Upon reaching the switching point, the solver then proceeds as usual, except now it follows a conflict-seeking strategy that again seeks the most strongly-biased variable for as- signment, but sets it to its weaker polarity. In practice we have found that MAXVARSAT’s per- CHAPTER 9. USING BIAS ESTIMATORS IN BRANCH-AND-BOUND SEARCH 195 formance improves slightly but noticeably with respect to this feature. In the contest-problem experiments that follow, we use a setting of switching strategies after 50% of the alloted run- time has passed.

9.2 Performance of Bias Estimation in Search

We now assess the performance of MAXVARSAT on a variety of problems, beginning with large random problems and then considering the array of problem sets in the most general division of the most recent MaxSAT evaluation.

9.2.1 Random Problems

Table 9.1 compares the performance of MAXVARSAT, using EMSP-G or CC as a bias es- timator, with regular MAXSATZ, on large, hard, random MaxSAT problems. We consider values of α ranging across and beyond the phase transition in satisfiability, from 4.2 to 4.9.

Each value of α corresponds to a test set of one hundred problems. Each problem contains n = 1000 variables, and is generated as previously defined for SAT. (Note that this produces unweighted, non-partial problems, meaning essentially that all weights are 1 and there are no clauses that must be satisfied.) True solution configurations cannot be calculated efficiently for these problems due to their size; accordingly none of the three configurations is able to efficiently solve such instances to completion. Instead, we note the current upper bound on the number of clauses that must be violated (i.e., the slack) for any configuration before terminat- ing each solver after one hour of execution. Given that the solvers thus have no incentive to process suboptimal regions of the configuration space, the branching strategy remains solution- seeking through the entire course of an execution. The solver with the least final upper bound on termination has made better progress toward the true optimum, if it has not found it already.

For each value of α the table lists the average final upper bound for each configuration, across the one hundred problems in the corresponding problem set. For any individual problem CHAPTER 9. USING BIAS ESTIMATORS IN BRANCH-AND-BOUND SEARCH 196 the upper bound computed by MAXVARSAT using EMSP-G was always tighter than or equal to that of MAXVARSAT using CC, or default MAXSATZ; further, the performance of CC was in fact identical to that of regular MAXSATZ across all problems. Thus, the average upper bound for each test set is lower for EMSP-G and identical for the other two configurations.

In the third column we also show the standard deviation in the final upper bound found by

EMSP-G, across each problem set. Here we see that EMSP-G’s margin of superiority never exceeds one standard deviation, and decreases as problems become more constrained. On the other hand, we re-iterate that EMSP-G equaled or bettered the outputs of other two methods on

100% of the problems considered. In the fourth column, we display the percentage of problems for which it was strictly better; in all other cases it found the same bound as the default and

CC.

Using EMSP-G also produces consistently fewer backtracks, despite making roughly the same number of variable assignments within its one hour execution. The average number of backtracks performed within the one hour runtime across problems in a set appears below each of the three configurations. The data shows the effectiveness of EMSP-G in persistently seeking configurations that violate the fewest possible clauses–in so doing, it continues deeper down each branch of search before being forced to violate a sufficient number of clauses for the lower-bounding mechanism to produce a value exceeding the upper bound.

A final observation is that number of backtracks generally decreases as α increases and problems become more constrained. The basic intuition is that more heavily constrained prob- lems inherently trigger more backtracks because there are more ways to trigger a lower bound that exceeds the upper bound; however no explanation has been verified experimentally to this point.

9.2.2 General Problems

To assess the general utility of bias estimation for solving MaxSAT problems from a variety of domains, we compare with MAXSATZ on the weighted partial problem sets from the 2009 CHAPTER 9. USING BIAS ESTIMATORS IN BRANCH-AND-BOUND SEARCH 197

MaxSAT evaluation. Table 9.2 depicts the results for MAXVARSAT with EMSP-G as a bias estimator, versus regular MAXSATZ, on all problems from the evaluation. Here, all runs were conducted with a thirty-minute timeout for consistency with the contest rules. MAXVARSAT was configured with y = 5 and a deactivation threshold of 0.4; additionally it was instructed to switch from a solution-seeking branching strategy to a conflict-seeking one, after fifteen minutes for any execution lasting to that point.

For each solver and problem set, we list the number of problems in the set that were solved within the thirty-minute timeout period, using bold face when one solver has handled a greater number of problems than the other. The remaining values show the number of backtracks and the overall runtime across problems in the set that were solved successfully. Thus we bold all three values if a solver has solved more problems, even if the average execution required more backtracks or time. When the two solvers successfully process an equal number of problems from a set, then we use bold to indicate which required fewer backtracks or less runtime.

Finally, in the case of MAXVARSAT we also indicate the percentage of run-time that was dedicated to calculating EMSP-G surveys.

For the first seven problem sets listed, MAXVARSAT solves more instances than MAXSATZ.

With the exception of the planning problem, and the satellite scheduling tests sets generated by encoding the same problem in two different ways, the difference is substantial. Problems in this set are generally the hardest for both solvers, and accordingly the extra power of com- puting surveys pays off relative the extra cost. This is reflected in the far smaller number of backtracks that MAXVARSAT requires, as well as the larger proportion of run-time devoted to surveys, relative the other problem sets.

The next grouping of five problem sets, with the exception of the mixed integer program- ming problem, consists of problems that are known to be very easy to a variety of solvers participating in MaxSAT evaluations. As such, the two solvers both handle all the problems in such sets without difficulty. Still, though, we observe that bias estimation results in fewer backtracks–but in this case, longer run-times result as well. CHAPTER 9. USING BIAS ESTIMATORS IN BRANCH-AND-BOUND SEARCH 198

Finally, the mixed integer programming and Bayesian inference problems are difficult for both solvers. In the case of the former, MAXVARSAT can still solve the same three prob- lems that are solved by regular MAXSATZ; but in the latter problem set its performance lags significantly. At this point the author has not determined any concrete explanation for this difference.

9.2.3 Limitations and Conclusions

Here we have demonstrated that for thirteen general MaxSAT problem sets of interest, bias estimation improves or maintains the performance of MAXSATZ on all but one. For the random problems, though, MAXVARSAT and MAXSATZ alike suffer from the same scaling issues as described in relation to VARSAT and MINISAT in Chapter 7. Specifically, incomplete methods based on local search or decimation can process problems of size n = 104, while n = 103 is the limit for a problem that MAXVARSAT can solve in a matter of days. Still, the performance of MAXVARSAT on MaxSAT confirms the hypothesis that customized bias estimation techniques can have a positive effect on complete solution methods for constraint optimization. CHAPTER 9. USING BIAS ESTIMATORS IN BRANCH-AND-BOUND SEARCH 199

Solved with EMSP-G Solved with Default Solved with CC

α UB avg backtracks UB std best UB avg backtracks UB avg backtracks

4.2 5.83 2,213K 1.90 23% 6.25 4,184K 6.25 3,663K

4.3 8.16 2,202K 1.98 10% 8.25 3,849K 8.25 3,567K

4.4 10.53 2,157K 2.22 4% 10.56 3,609K 10.56 3,536K

4.5 13.01 2,077K 2.31 4% 13.04 3,295K 13.04 3,471K

4.6 16.04 2,042K 2.59 3% 16.06 3,068K 16.06 3,357K

4.7 20.31 2,014K 2.66 3% 20.32 2,878K 20.32 3,199K

4.8 22.20 1,962K 2.84 2% 22.21 2,705K 22.21 2,094K

4.9 25.56 1,928K 2.86 2% 25.57 2,566K 25.57 2,921K

Table 9.1: Performance of EMSP-G, Default, and CC heuristics on large random prob- lems.

We compare the performance of MAXVARSAT instantiated with the EMSP-G or CC bias estimator (with y = 5 and deactivation threshold of 0.4, using solution-seeking search for 100% of the alloted cutoff period) versus the default MAXSATZ variable-ordering and value-ordering heuristics, on sets of one hundred 1,000-variable regular MaxSAT problems with α varying across the phase transition in satisfiability. For each of these three solvers, we compare the average upper bound on the weight of unsatisfied clauses after one hour of execution. As none of the runs completed within this period, the minimum of these three upper bounds is itself an upper bound on the true minimum slack for any configuration. In each case, we also list the number of backtracks performed during the one hour of execution. For the best configuration, using EMSP-G, we additionally show the standard deviation in the final upper bound found across problems in a given set. Also, we show the percentage of problems within a given set for which this configuration gives a better bound than any of the other two methods. The default and CC heuristics never produced a better bound than EMSP-G, so for these configurations the percentage is always zero and is not displayed. CHAPTER 9. USING BIAS ESTIMATORS IN BRANCH-AND-BOUND SEARCH 200

Solved w/ EMSP-G Solved w/ Default

Test Suite # backtracks runtime surveys # backtracks runtime

auction-paths (88) 87 208K 71 19% 76 14087K 192

mancoosi-config (80) 77 5K 147 78% 40 5424K 1271

planning (56) 52 9K 160 34% 50 97K 128

quasigroup (25) 21 34K 33 45% 13 564K 93

warehouse-place (18) 6 213K 561 23% 1 6K 0

satellite-dir (21) 3 285K 25 19% 2 593K 7

satellite-log (21) 3 3264K 159 22% 2 578K 9

factor (186) 186 60 2 14% 186 32K 13

auction-schedule (84) 84 630K 119 13% 84 1504K 38

part. rand. 2-SAT (90) 90 457 21 8% 90 511 8

part. rand. 3-SAT (60) 60 25K 128 12% 60 35K 20

miplib (12) 3 18K 583 10% 3 30K 192

bayes-mpe (191) 10 8K 3 24% 21 787K 152

Table 9.2: Performance on weighted partial problems from the 2009 MaxSAT Evaluation.

We compare the performance of MAXVARSAT instantiated with the EMSP-G bias estimator (with y = 5 and deactivation threshold of 0.4, switching to conflict-guided search after 50% of alloted cutoff period) versus the default MAXSATZ variable-ordering and value-ordering heuristics, on the full suite of most general (weighted partial) problems from the most recent MaxSAT evaluation. Each entry lists the number of problems from a given test set that were solved within a 30-minute cutoff period, and the average number of backtracks and seconds of CPU time required for each successful run. (All three entries appear in bold when a configuration solves more problems than the other; otherwise backtracks and CPU time are highlighted to indicate the more efficient configuration.) Additionally, for the configuration using EMSP-G we note the percentage of overall run-time that was used for computing surveys. The number of problems in each test set appears next to the set’s name, in parentheses. Chapter 10

Computing MHET for Constraint

Optimization

The thesis of this dissertation is that an integrated perspective on probabilistic and constraint reasoning is valuable not only for its intrinsic interest, but also for developing new algorithms.

We have already seen that bias estimation can be viewed as a linear relaxation of constraint satisfaction where we make fractional assignments to variables in order to maximize the frac- tional satisfaction of constraints. In turn, we have also seen that the general task of calculating marginal probabilities on arbitrary graphical models in turn entails the maximization of a linear likelihood objective subject to non-convex constraints, and that various approximations to such constraints result in various estimation methods. In Chapters 6 and 8 the yield was a family of customized bias estimators designed for constraint satisfaction and constraint optimization.

Here we derive an additional algorithm that computes lower bounds for constraint optimiza- tion, drawing once more upon our graphical model-based integration. Again we illustrate the approach on the MaxSAT problem by integrating with the same MAXSATZ solver described in the previous chapter. Instead of forming fractional variable assignments, here we will per- form a linear relaxation on the dual of a constraint optimization problem, allowing fractional extension values across constraints. In this case the constraint on our maximization is that

201 CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 202

such transformations must maintain equivalence to the original problem, and the objective is to

minimize problem “height,” which we will be able to view as tightening our lower bound, ac-

cording to the definitions that follow. The overall process will be designated “minimum-height

equivalent transformation,” or “MHET.”

Thus, instead of a uniform or near-uniform distribution over solutions to a constraint prob-

lem, the object of MHET is the constraint optimization problem itself. As we have seen before

with other adaptations of probabilistic concepts to constraint satisfaction, the resulting algo-

rithm can be seen as a generalization of existing inference procedures that aspire to produce

tighter lower bounds for branch-and-bound search. Indeed, we will survey an alternative algo-

rithm that was independently developed as a way to perform soft arc-consistency, and note that

it minimizes the same objective within the same space of equivalent problems.

In this chapter we will first motivate and define the notion of minimum-height equiva-

lent transformation for MaxSAT, and then present efficient algorithms for calculating problem

height and iteratively applying MHETs to local optimality. Next we describe the implemen-

tation of MHET within MAXVARSAT and present empirical results. Finally, we will relate the framework to our integrative account of probability and constraints, as well as existing research, before making concluding observations.

10.1 Motivation and Definitions

Our goal is to derive lower bounds by optimistically assuming that we can achieve full problem height, i.e., that we can achieve the maximum score for each clause. Making this bound non- trivial first requires an extension to the language of MaxSAT problems, where clauses can now give varying weights to different configurations of their variables. Then, given a particular basic problem, we can seek a problem in the extended space of problems that is equivalent in that it yields the same overall score for any configuration, but has minimal height. If we consider the sum total of all the weight that is available across all the clauses of the original CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 203 problem, then any difference with the height of the equivalent problem represents a lower bound on the problem’s slack, or on the amount of weight that must be left unsatisfied by any configuration. In other words, the difference between the height of the original problem, and that of the equivalent problem, is a lower bound on the weight of clauses that must be left unsatisfied by any assignment to the variables in the problem. The concept of MHET originates from a formal analysis of vision problems that was originally developed in the Soviet Union

(in the absence of high-powered computers!), and that was recently reviewed in the context of relating probabilistic reasoning to constraint satisfaction [134, 146].

To describe our approach to computing lower bounds by MHET, we first define the MaxSAT problem conventionally under a graphical representation akin to our previous treatment of fac- tor graphs, but with summation substituted for product when interpreting the graph. Particu- larly, in the previous chapter we used a factor graph to represent a near-uniform distribution over all the solutions to a given MaxSAT problem; here, in contrast, we represent the MaxSAT problem itself in the form of a factor graph, with addition used as the and the ⊕ operators alike. By proceeding to adopt an extensional representation of clauses, we can then extend the space of problems, and specify how a problem from this extended space can be equivalent to a given conventional problem. We then present a specific “equivalent transformation” process that encodes a mapping from a conventional problem to an equivalent extended one. Finally, we define the notion of problem height, which serves as our minimization criterion within the space of equivalent problems, and is the basis for MHET lower bounds.

Definition 28 (Graphical Representation of MaxSAT). Let Φ = (X,C,W ) be a (weighted partial) MaxSAT problem defined as in Chapter 8. Recall that the score of an assignment to all the variables in a MaxSAT problem is the sum of the weights associated with the clauses that it satisfies. (The slack of a given configuration is the difference between the sum of all the weights of its clauses, and the score of that configuration.) A solution to a MaxSAT problem is a configuration that achieves the maximum possible score–or rather, minimizes the slack. As before with standard factor graphs, we will represent MaxSAT problems as CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 204

bipartite graphs containing two types of nodes: variables and functions. Associated with each

such node is a variable or a function over variables, respectively; usually we will not need

to distinguish between variables/functions and their associated nodes in the graph. Edges in

the graph connect variable nodes to nodes for functions in which they appear as arguments.

To interpret a MaxSAT problem Φ = (X,C,W ) as a factor graph G = (X,F ), first let X

correspond to the variables in the problem as before. Then let each clause fa ∈ F be a two- valued function that evaluates to 0 if it is not satisfied by the way its variables are assigned, and wa otherwise. As usual we will call the set of variables appearing as literals in a clause fa the

“scope” of fa, denoted σa. Similarly, we call the set of clauses containing literals for a variable xi its “neighborhood,” denoted ηi.

Definition 29 (Extensional Representation). Under the extensional, or dual, representation,

|σa| we explicitly list the value of a function fa for each of the 2 possible assignments to the variables in its scope. As usual we call such assignments “extensions” of fa. Further, we can condition fa on an individual variable assignment xi = vj by discarding those extensions that do not contain the assignment, and removing xi from the scope of fa–thus we define a new function with scope σa \{xi}.

Example 20. Figure 10.1(a) shows a simple two-variable MaxSAT problem in clausal form. Figure 10.1(b) depicts the same problem as a factor graph with the clauses represented as functions in extensional form. Conditioning the function fc on the variable assignment x1 = 1, for example, yields a new function defined over x2 whose value is 0 if x2 is assigned the value

0, and 18 if x2 is 1.

Definition 30 (Extended Representation; Equivalent Problems). Under the extensional repre- sentation, we can now extend our definition of MaxSAT problems from using clauses that can be represented exclusively by two-valued functions, to arbitrary functions that can give distinct positive or negative values to each possible extension of their arguments. That is, each func-

σa tion fa can now map {0, 1} to the reals as opposed to {0, wa}. Where before the score of a CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 205 given configuration was the sum of weights for clauses that it satisfied, now the score is the sum of function values according to the configuration. We can call such problems “extended

MaxSAT” problems, of which traditional MaxSAT problems are a subclass. Two extended

MaxSAT problems Φ = (X,F ) and Φ0 = (X,F 0) are equivalent iff they yield the same score for all configurations over the variables in X.

© = (X; F; W ) X = x1; x2 ; F = fa; fb; fc f g f g fa = x1 wa = 12 x1 fa0 (x1) x2 fb0(x2) fb = x2 wb = 6 : 0 0 + 8 = 8 0 6 + 2 = 8 fc = x1 x2 wc = 18 : _ 1 12 2 = 10 1 0 + 10 = 10 ¡ (a) MaxSAT Problem. f'a f'b

á = 8 á = 2 x1 fa(x1) x2 fb(x2) 1;a 2;b ¡ + ¡ Ã+ = 2 à = 10 0 0 0 6 1;a 2;b ¡ 1 12 1 0 x1 x2

fa fb Ã1¡;c = 8 Ã2¡;c = 2 Ã+ = 2 Ã+ = 10 1;c ¡ 2;c x1 x2 f'c

x1 x2 f (x1; x2) f c0 c 0 0 18 8 2 = 8 ¡ ¡ x x f (x ; x ) 0 1 18 8 10 = 0 1 2 c 1 2 ¡ ¡ 0 0 18 1 0 0 + 2 2 = 0 1 1 18 + 2 ¡10 = 10 0 1 18 ¡ 1 0 0 1 1 18 (c) Equivalent Transformation to (b) Extensional Representation. Extended MaxSAT Problem.

Figure 10.1: Example MaxSAT problem, represented (a) conventionally as clauses and (b) extension- ally as a factor graph. Equivalent extended MaxSAT problem (c) under equivalent transformation Ψ,

− + whose every pair of components ψi,a and ψi,a appears beside the edge connecting variable xi to function fa. (The equivalent transformation also introduces two new unary functions over x1 and x2, but these both assign a score of 0 to any extension, and are therefore not depicted.)

Definition 31 (Equivalent Transformations). Here our interest is in extended MaxSAT prob- lems that are equivalent to traditional ones that we wish to solve. Given a traditional MaxSAT CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 206 problem Φ = (X,F,W ), we can encode a specific equivalent extended problem Φ0 = (X,F 0) by means of an equivalent transformation Ψ ∈ R|X| |F | |{+,−}|. Ψ is a vector of individual positive and negative potentials, which are defined between variables and functions in their

+ − neighborhoods, and are denoted ψi,a and ψi,a respectively. Each such pair of potentials acts

+ to “rescale” a function’s values by instructing fa to subtract ψi,a from the value of each ex-

− tension wherein xi is assigned positively, and to subtract ψi,a from those where it is assigned negatively. Thus, if Φ0 is the result of applying equivalent transformation Ψ to Φ, then for each

0 0 fa ∈ F the rescaled fa ∈ F is defined as follows:

  ψ− if val is 0; 0 X  i,a fa(za) = fa(za) − (10.1) + (xi=val)∈za  ψi,a if val us 1.

In the expression, za stands for a given extension over the variables of fa. Recall that because fa represents a clause that evaluates to either 0 or wa, the expression above can be recast as the negative sum of all potentials corresponding to the given extension–plus wa if the extension is any of the 2|σa| − 1 that satisfy the original clause. (This perspective will be crucial to the algorithms in Section 10.2, where we will evade the computational expense of explicit extensional function representations.)

Subtracting a series of arbitrary real numbers from the values of particular function ex- tensions yields a transformation, but to be sure this is not necessarily an equivalent one. To achieve exactly the same score for Φ0 as for Φ under any assignment, we augment F 0 with a

0 unary function fi for each variable xi ∈ X. These functions serve to add back in all the positive or negative rescalings associated with the variable, according to how it is assigned:   P ψ− if val is 0; 0  a∈ηi i,a fi ({xi = val}) = (10.2)  P ψ+ if val us 1.  a∈ηi i,a

(If F already contains a unary, or “unit” clause for variable xi it does not matter with respect to the formalism if we create two functions with the same scope in F 0, or perform some sort of special handling when implementing the algorithms in Section 10.2.) CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 207

In summary, an equivalent transformation Ψ encodes two potentials for every variable/function pair, and thus comprises a rescaling of all the functions (clauses) in a standard MaxSAT prob- lem. Applying Ψ to standard problem Φ = (X,F,W ) yields the equivalent extended problem

0 0 0 0 0 0 0 Φ = (X,F ) where F = {fa : fa ∈ F } ∪ {fi : xi ∈ X}, with each function fa and fi defined as in (10.1) and (10.2). Clearly the original score under Φ equals the new score under Φ0 for

0 any assignment to the variables in X: if xi is assigned positively, for example, then fa will

+ subtract ψi,a from the original score for each clause fa in which xi appears; fi will in turn add

+ each such ψi,a back into the score.

Example 21. Figure 10.1(c) applies equivalent transformation Ψ to the example problem, yielding an equivalent extended MaxSAT problem. Individual potentials comprising Ψ appear

beside corresponding edges in the factor graph. Intuitively, variable x1 has directed function

0 0 fc to decrease its value by 8 if x1 is assigned negatively; this is offset by also instructing fa

to increase its value by 8 if x1 is 0. By such means, (b) and (c) yield the same score for any assignment to x1 and x2.

Definition 32 (Height). The height of a function fa is its maximum value across extensions. The height of an extended MaxSAT problem is the sum of the heights of its functions. A minimum-height equivalent transformation for a given MaxSAT problem is one that pro- duces an equivalent problem whose height is no greater than that of any other equivalent prob- lem. The height of a problem is an upper bound on the maximum score that can be achieved by any assignment to its variables; so the motivation for finding minimum-height equivalent transformations is to tighten this bound. (As a final step we will convert this upper bound on problem score to a lower bound on problem slack, in order to situate the MHET apparatus within the traditional notational scheme for branch-and-bound search as applied to MaxSAT.)

Example 22. The height of the regular MaxSAT problem in Figure 10.1(b) is 12+6+18 = 36; here we optimistically assume that we can accrue the maximum score for each clause, though in reality this cannot be done consistently by a single assignment to all the variables. In contrast, CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 208

the height of the extended problem (c) is 10 + 10 + 10 = 30. In fact, 30 is the minimum height

across all problems equivalent to (b), so Ψ is a minimum-height equivalent transformation.

(Furthermore, the height of (c) is a tight upper bound on the maximum score achievable for

(b): the score of 30 corresponding to {x1 = 1, x2 = 1} is maximal across assignments.)

Creating a lower bound on slack. In summary, we use the height of a problem to bound the maximum achievable score for a problem from above. Given a particular standard MaxSAT problem, then, we can consider the space of equivalent extended problems, and aspire to find one yielding the tightest possible upper bound, by minimizing height. Recall, though, that the canonical branch-and-bound algorithm for MaxSAT (cf. Algorithm 11) is typically defined in terms of lower bounds on problem slack rather than upper bounds on problem score. (Recall that slack is the sum of weights for unsatisfied clauses, while score is the sum of weights for satisfied clauses.)

As demonstrated in Proposition 4, we can calculate a lower bound on the minimum slack achievable on a problem by subtracting the upper bound from the sum of all clause weights from the original problem.

Proposition 4. The sum of the clause weights associated with a given (weighted) MaxSAT problem, minus the height of any equivalent problem, is a lower bound on the slack of any configuration within the original problem.

Proof. Let Φ = (X,C,W ) be a MaxSAT problem, and let ~x be an arbitrary configuration of X. Then let SAT(~x,ca) be a predicate indicating whether ~x satisfies clause ca ∈ C. We

can thus define SCORE(~x, Φ) = P w ,SLACK(~x, Φ) = P w , ca∈C: SAT(~x,ca) a ca∈C: ¬SAT(~x,ca) a

and WEIGHTS(Φ) = P w to denote the score or slack of a given configuration, or the ca∈C a sum of clause weights, with respect to a given MaxSAT problem. Further, let Φ0 = (X,F )

be any extended MaxSAT problem in factor graph format that is equivalent to Φ, and let

HEIGHT(Φ0) = P max f (~x ) denote the height of such a problem. fa∈F ~x a |σa 0 0 Then, we have that max~x SCORE(~x, Φ ) ≤ HEIGHT(Φ ) by construction. And because CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 209

0 0 Φ and Φ are equivalent, we can substitute to produce: max~x SCORE(~x, Φ) ≤ HEIGHT(Φ ).

Further, for all ~x,SLACK(~x, Φ) = WEIGHTS(Φ) − SCORE(~x, Φ) by definition. Therefore,

0 min~x SLACK(~x, Φ) ≥ WEIGHTS(Φ) − HEIGHT(Φ )

Note that before transformation, the sum of the weights of clauses in an original MaxSAT

problem is equivalent to that problem’s height. To summarize, then, the height of a problem

is an upper bound on its score. Performing MHET tightens this bound by minimizing height

within the space of equivalent problems. Finally, the difference between the original problem

height and the minimized problem height is a lower bound on the weight of clauses that will be

unsatisfied by any configuration. This constitutes the lower-bounding technique that we will

embed in MAXVARSAT.

10.2 Adopting the Max-Sum Diffusion Algorithm to Perform

MHET on MaxSAT

Our overall goal of providing height-based lower bounds to branch-and-bound MaxSAT solvers requires efficient algorithms for calculating function heights, and for pursuing minimum height equivalent transformations. To calculate heights in extended problems, we sidestep the inher- ently exponential cost of representing functions extensionally. For finding MHETs we apply an approximate “max-sum diffusion” algorithm that circumvents the inherent NP-completeness of the task. (The general concept of equivalent transformation is not limited to the potential-based structure defined here; if optimal general height-minimization could be done in polynomial time, then so could SAT, because any satisfiable problem has a minimum-height equivalent problem whose height is the number of clauses.) In particular, the algorithm searches a re- stricted space of equivalent problems to find one with minimum height, namely the space of problems that can be encoded in terms of the original using only the potential-based equivalent transformations defined in the previous section. CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 210

Algorithm 12 depicts the overall process of producing lower bounds. Given the subprob-

lem induced by the series of variable assignments that we have made up to a certain point

during branch-and-bound search, we first find a height-minimizing equivalent transformation

Ψ. We then calculate the difference between the total weight available from the problem, and

its height under the equivalent transformation, producing a lower bound on the slack that we

must produce should we continue down this branch of search. Aside from the step of actually

pursuing a minimum-height equivalent transformation, the complexity of Algorithm 12 and its

subroutines is O(d(n + m)), where d is the largest degree of any node in the problem’s factor graph representation.

Algorithm 12:LOWER-BOUND-BY-MIN-HEIGHT Input : Weighted MaxSAT problem Φ = (X,F,W ).

Output: Lower bound on weight of unsatisfied clauses.

1 Ψ ← MIN-HEIGHT-EQUIV-TRANSFORM (Φ).

2 lb ← 0.

3 foreach fa ∈ F do

4 lb ← lb + wa.

5 end

6 foreach fa ∈ F do

7 lb ← lb − CLAUSE-HEIGHT (fa, wa, Ψ).

8 end

9 foreach xi ∈ X do

10 lb ← lb − VARIABLE-HEIGHT (xi, Ψ).

11 end

12 return lb.

The “variable-height” function at Line 10 of the algorithm indicates the height of the unary

function that the equivalent transformation process introduces for each variable, as in Eq. CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 211

(10.2). Algorithm 13 performs this fairly straightforward calculation. (This depiction and the following are written for clarity rather than efficiency–many of the calculations can be cached and re-used within and between procedures.)

Algorithm 13:VARIABLE-HEIGHT

Input : Variable xi, potentials comprising equivalent transformation Ψ.

Output: Height of xi under equivalent transformation.

//VARIABLE’SPOSITIVE/NEGATIVERESCALEDWEIGHTISSUMOF

// THEPOS/NEGPOTENTIALSITDISTRIBUTESACROSSITSCLAUSES.

− + 1 neg ← P ψ , pos ← P ψ . a∈ηi i,a a∈ηi i,a

//HEIGHTISMAXIMUMRESCALEDWEIGHT.

2 return max (pos, neg).

To calculate the height of a transformed problem clause at Line 7, we cannot tractably employ any explicit representation of extension values akin to the tables in Figure 10.1(c). In- stead, we rely on the fixed structure of equivalent transformations by variable-clause potentials, and observe that with one exception, the highest-scoring extension for a clause is always the one for which the sum of associated potentials is least. The single exception occurs when the extension with least rescaling happens to be the unique extension that does not satisfy the orig- inal clause–then we do not accrue the original clause weight. In this case, we must determine whether gaining this weight is worth the extra rescaling that our score will suffer should we flip the assignment of a single variable and thus satisfy the clause. This is the basis for calculating a transformed clause’s height under a given equivalent transformation, in Algorithm 14.

Line 15 of Algorithm 14 determines the decrease in score that we suffer on substituting the second-least rescaled extension for the least-rescaled one identified earlier in the algorithm. To do so, we identify the single variable with least difference between its two potentials (recall that all extensions satisfy an original clause, except one.) If we consider flipping the assignment to this variable, we can thereby determine whether sacrificing the clause’s weight or suffering CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 212 additional rescaling will yield the function’s highest score. CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 213

Algorithm 14:CLAUSE-HEIGHT

Input : Factor fa, weight wa, potentials comprising equivalent transformation Ψ.

Output: Height of fa under equivalent transformation.

//GETMINIMUM-POTENTIALASSIGNMENTTOVARIABLESIN fa.

1 assignment ← {}.

2 potentials ← 0.

3 foreach xi ∈ σa do

− + 4 if ψi,a < ψi,a then

5 assignment ← assignment ∪ {xi = 0}.

− 6 potentials ← potentials + ψi,a.

7 else

8 assignment ← assignment ∪ {xi = 1}.

+ 9 potentials ← potentials + ψi,a.

10 end

11 end

12 if assignment satisfies fa then

//HEIGHT IS FACTOR’SWEIGHTAFTERMINIMUMRESCALING.

13 return wa− potentials.

14 else

//HEIGHTDEPENDSONWHETHERFACTOR’SWEIGHTIS

// MORE VALUABLE THAN DIFFERENCE BETWEEN MINIMUM

// ANDNEXT-MOST-MINIMUMRESCALING.

15 difference ← ∞.

16 foreach xi ∈ σa do

− + 17 if |ψi,a − ψi,a| < difference then

//FLIPPINGTHISVARIABLEYIELDSLEASTINCREASE

// INMINIMUMRESCALING (AND SATISFIES fa).

− + 18 difference ← |ψi,a − ψi,a|.

19 end

20 end

21 if wa > difference then

22 return wa− potentials − difference.

23 else

24 return −potentials.

25 end

26 end CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 214

The remaining algorithmic task is to find equivalent transformations that minimize prob- lem height. In Algorithm 15 we adapt the max-sum diffusion algorithm [96] developed from an older line of research in the Soviet Union [134] and recently reviewed in the context of probabilistic inference and computer vision [146]. The algorithm is guaranteed to converge, but only to a local minimum in problem height, rather than a global minimum. Lines 6 though

14 comprise the core of the process, effecting a “height-averaging” that rescales the factors surrounding a given variable so that they all have the same height when conditioned on ei- ther assignment to the variable. For a given positive or negative assignment, this is realized by calculating the height of each function when conditioned on the assignment (by subjecting the clause height algorithm to the simple changes depicted as Algorithm 16.) and forming an average. We then update the variable’s potentials for this polarity to the difference between each clause’s conditional height and this average. The time complexity for each iteration of the algorithm is O(d(n + m)), where n is the number of variables in a problem, m is the number of functions, and d is the maximum number of variables appearing in the scope of a function. CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 215

Algorithm 15:MIN-HEIGHT-EQUIV-TRANSFORM Input : Weighted MaxSAT problem Φ = (X,F,W ).

Output: Height-minimizing equivalent transformation Ψ ∈ R|X| |F | |{+,−}|.

//INITIALIZE ALL VARIABLE-FACTORPOTENTIALSTOZERO.

1 foreach xi ∈ X, fa ∈ F do

+ − 2 ψi,a ← 0, ψi,a ← 0.

3 end

4 repeat

5 foreach xi ∈ X do

//FINDMAXRESCALEDSCOREFOREACHOFVARIABLE’S

// FUNCTIONS, CONDITIONEDONEACHPOLARITY.

6 foreach fa ∈ ηi do

+ 7 αi,a ← COND-HEIGHT (fa, wa, xi = 1, Ψ).

− 8 αi,a ← COND-HEIGHT (fa, wa, xi = 0, Ψ).

9 end

//AVERAGEMAXRESCALEDSCORESACROSSFUNCTIONS.

+ + 10 µ ← P α /|η |. a∈ηi i,a i − − 11 µ ← P α /|η |. a∈ηi i,a i

//SET THE VARIABLE’SPOTENTIALSTORESCALEEACH

// FUNCTION’SFORMERMAXSCORETOTHEAVERAGE.

12 foreach fa ∈ ηi do

+ + + 13 ψi,a ← αi,a − µ .

− − − 14 ψi,a ← αi,a − µ .

15 end

16 end

17 until convergence

18 return Ψ. CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 216

Algorithm 16:COND-HEIGHT

//ALGORITHMISTHESAMEAS CLAUSE-HEIGHT, BUT MODIFIED

// TO REFLECT ASSUMPTION THAT xi IS ALREADY FIXED TO val.

1 Change Line 1 to “assignment ← {xi = val}.”

2 Change Line 3 to “foreach xi ∈ σa \{xi} do.”

3 Change Line 16 to “foreach xi ∈ σa \{xi} do.”

Convergence of the algorithm is readily demonstrated by directly adapting a theorem from

the review of the originating MHET research [146]. (Technically the theorem allows the algo-

rithm to stagnate at the same non-minimal height, so long as the height never increases, but this

has never been observed in practice. Also, the potential number of iterations is unbounded, but

convergence can be enforced by an epsilon-valued threshold.)

Theorem 10.2.1 (Monotonically decreasing problem height). After each iteration of Line 5 in Algorithm 15, the height of Φ cannot increase.

Proof. At the beginning of the loop, variable xi and its neighborhood ηi contribute the quantity max{P ψ− , P ψ+ }+P max{α+ −ψ+ , α− −ψ− } (the single-variable height a∈ηi i,a a∈ηi i,a a∈ηi i,a i,a i,a i,a of xi plus the height of each clause) to the height of Φ. At the end of the loop, they contribute max{P ψ+ + α+ − ψ+ , P ψ− + α− − ψ− }. The second quantity cannot exceed a∈ηi i,a i,a i,a a∈ηi i,a i,a i,a the first because it maximizes the sum of two quantities instead of summing the maximizations of the same two quantities.

Intuitively, the proof observes that in first calculating problem height, we are free to cast

xi positively in choosing the highest-scoring extension for one of its functions, and to still

cast it negatively when taking the height of another. After iterating on xi, we are still free to do so, but all of xi’s functions will yield the same height when it is set positively, and they will all share another common height when xi is set negatively–we only consider which of these two is greater. As for correctness of the lower bound, it is easiest to appeal directly to CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 217

the equivalence property of equivalent transformations: no matter what values Algorithm 15

assigns to its components, Ψ will by construction define a problem that yields the same score

as Φ, for any variable assignment.

10.3 Implementation and Results

We have implemented the algorithms described above within the MAXVARSAT solver de-

scribed in the previous chapter, which is in turn built upon the 2009 version of the state-of-

the-art MAXSATZ solver [104]. As before we evaluate the usefulness of MHET as a lower- bounding framework on nineteen test sets comprising the most general divisions (weighted, and weighted partial MaxSAT) of the 2009 MaxSAT Evaluation [8]. The experiments were run on machines with 2.66 GHz CPU’s using 512 MB of memory. To isolate the benefit of us- ing MHET, we use the built-in variable-ordering and value-ordering heuristics of MAXSATZ, rather than the heuristics of the previous chapter.

Table 11.1 compares the performance of MAXVARSAT with and without the computation of a minimum-height equivalent transformation at every search node. The basic trade-off is between the increased computational cost of computing the transformations, versus the oppor- tunity to search a reduced space due the extra prunings triggered by such transformations.

10.3.1 Problems for which MHET is Beneficial, Neutral, or Harmful Over-

all

The rows of the table group the nineteen problem sets into three categories, by comparing the number of problems solved using the two versions of MAXSATZ within a 30-minute timeout.

The categories can be understood as containing problems sets for which adding MHET im-

proves overall performance, makes no significant difference, or degrades overall performance.

The principal statistics appear in the first two groupings of three columns each. The first col-

umn in each such grouping displays the number of problems from a given test set that were CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 218 solved before timeout. The next columns show, for problems that were solved within the time- out period, the average number of backtracks and overall average runtimes. (Thus, a version may show a lower average runtime even if it solved fewer problems.)

For the first six problem sets in the table, MAXSATZ with MHET embedded solves a greater number of problems without timing out when compared to regular MAXSATZ. Looking ahead to the third grouping of columns, entitled “Prunings Triggered,” we can characterize these problems as ones where MHET allows for a significant number of prunings, while the original version of MAXSATZ could find few or none. (The measures for prunings are described in the next section.) Here the difference is enough to outweigh the added computational cost of performing MHET. The problems in this category are diverse in their specific characteristics; one general property that they have is common relative to the other entries in the table is that they are fairly difficult to solve, meaning that fewer problems within a set can be solved in the time allotted, and that these problems themselves require more time. This yields a straightforward explanation for why the greater power of MHET is worth the computational expense. A secondary phenomenon that remains to be explained is that the clauses in these problems tend to have large weights; at this point it is unclear whether this is actually beneficial to the MHET formalism, and why this might be.

For the middle grouping of problem sets, adding MHET slows MAXSATZ overall for all but two test sets; but the difference in runtime is negligible in the sense that the two versions still solve the same number of problems within the 30-minute cutoff period. This grouping is characterized by problems that are already extremely easy or extremely hard for both versions.

Still, it is worth noting that for all but the (extremely easy) random problems, MHET continues the trend of finding additional prunings and drastically reducing the search space, while regular

MAXSATZ performs far more branches–but without the computational overhead of MHET, which tends to double overall runtime. So, while the two versions of the solver are comparable here, particularly in terms of the number of problems solved before the cutoff time, such parity is achieved by divergent means. CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 219

For the final grouping of three problem sets, adding MHET prevents MAXSATZ from solv- ing problems that it was originally able to solve, within the 30-minute timeout. While MHET is still able to find a significant number of new prunings, for these problem its overhead cost is prohibitive. Such problems are characterized by the largest numbers of variables across the entire evaluation (thousands, in the case of the Mancoosi configuration problems.) Thus, the greater cost of doing MHET on such problems makes it not worthwhile. Still, it is important to note that in terms of algorithmic complexity, MHET is not especially more costly than ex- isting heuristics, as such heuristics also tend to have low-order polynomial complexity on the number of variables. Rather, the problems are just large in general and present a challenge to any inference-based heuristic.

10.3.2 Direct Measurement of Pruning Capability, and Proportion of Run-

time

Turning to the third (“Prunings Triggered”) grouping of columns, we can corroborate some overall trends across all categories of problems. Here we have re-run each test suite with a new configuration of MAXSATZ that runs both “MHET” and the “original” MAXSATZ inference rules (Rules 1 and 2 from [104]) each time a variable is fixed, regardless of whether one or the other has already triggered a pruning. This way, we can identify pruning opportunities that could be identified exclusively by one technique or the other, or by both. Additionally, if neither technique triggers a pruning, we identify a third class of “other” prunings–those that are triggered by additional mechanisms built into the solver like unit propagation, the pure literal rule, and most prominently, look-ahead.

Thus, the column labeled “MHET” denotes the average number of nodes where MHET alone was able to trigger a pruning, while the original MAXSATZ inference rules were not able to do so, and vice versa for the column labeled “orig..” The third column identifies situations where “both” sets of techniques were able to trigger a pruning, while the fourth identifies situations where neither could determine that a pruning was possible, but one was triggered CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 220 nonetheless by the “other” built-in processes within MAXSATZ. With these figures we see that although MHET always triggers more prunings than the original MAXSATZ inference rules, the test sets for which it causes longer runtimes on solving the same number of problems correspond almost exactly to those where an even greater number of prunings could have been triggered by the third class of built-in mechanisms.

This observation is reinforced by the final column of the results table, which shows the percentage of total solver runtime that was used to calculate MHET as well as the rest of the pruning mechanisms (both “original” and “other” combined). Such figures were calculated by extracting timestamps from within the implementation of MAXVARSAT. Here we see that while MHET computations are extremely valuable, they are also very expensive in terms of time. For the three test suites where adding MHET increases overall runtime, the “other” built-in pruning mechanisms are able to find a large number of prunings, de-motivating the use of MHET. Still, the low-order polynomial complexity of performing MHET suggests that its percentage cost will always be within a constant factor of the other mechanisms’.

10.3.3 Practical Considerations: Blending the Two Pruning Methods

The experiments demonstrate that MHET is able to generate unique lower bounds and trigger prunings that were not possible using the existing methods. For at least some of the problems, this ability is strictly beneficial; and for the majority it does not do any harm in terms of the number of problems solved within the timeout period. For practical considerations, though, it is compelling to seek a finer-grained approach to using MHET versus the existing rules. For in the experiments above, recall that the new and existing lower bounding techniques are run at every single node of the search space. Instead, it may be worthwhile to attempt to prune at only a portion of the nodes in the search space, and to choose between MHET and the old techniques in some way that balances computational cost with greater pruning power. This goal will be the main focus of the next chapter, but here we will describe a simple alternative to the reinforcement-learning-based techniques developed therein. CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 221

To be specific, we wish to formulate a blend of using either no lower-bounding techniques, old techniques alone, MHET alone, or both, at selected nodes in the search space. On the one hand, we are generally more likely to prune toward the bottom of the search tree, so running a bounding technique here is more likely to pay off with a pruning. On the other hand, if we do manage to trigger a pruning higher in the search tree, then the pruned subtree will be exponentially larger. In this light, a simple heuristic is to run the old bounding techniques at every node, and run MHET once every 10α node expansions, where α is a parameter. Such a scheme shows an exponential preference for running the method lower in the search tree, because there are exponentially more such nodes. Naturally, using low values of α will be effective on problems from Table 1 where MHET was more successful, while increasingly high values of α will simulate using the old techniques alone, producing better results on the corresponding problems. In practice, we have found that this method produces a reasonable improvement on the test sets considered, with α set to about 3. The results themselves appear in the next chapter.

The main contribution of the next chapter is a more sophisticated approach to blending four choices of pruning strategy (nothing, old, MHET, old+MHET), while learning from experience and consulting the current state of the search process. We will use reinforcement learning to balance between exploring the effectiveness of the various actions in different situations within search, while also exploiting our experiences to prefer those actions that have done well when applied to comparable situations in the past. We will measure “doing well” by penalizing decisions for the runtimes they incur, while rewarding them according to how many nodes they are able to prune. And “situations” are realized by state variables that represent our current depth within the search tree and the current gap between the upper bound and the most recently computed lower bound. Within the results presented here, though, we have isolated the effect of MHET by running it at every node, and demonstrated its usefulness in triggering unique prunings and minimizing backtracks, while in some cases improving overall runtime. CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 222

10.4 Related Work and Conclusions

We have demonstrated a Minimum-Height Equivalent Transformation framework for comput- ing a new type of lower bound that can improve the performance of branch-and-bound search for MaxSAT. Interestingly, the notion of equivalent transformation defined over an expanded space of extensionally represented functions can be seen as a “soft” generalization of existing inference rules within the current state of the art. More precisely, just as a linear programming relaxation drives the formal apparatus in the original account of MHET [134], so can we view the algorithms of Section 10.2 as sound relaxations of traditional inference rules like those of

MAXSATZ, whose proofs of soundness in fact appeal to an integer programming formulation

[104]. For instance, the problem in Figure 10.1 is simple enough to have been solved one such rule from [104] (essentially, weighted resolution or arc-consistency.) But by softening the rules to work over fractional weights and functions with more than two values, and by allowing the consequences of a rescaling operation to propagate throughout the factor graph over multiple iterations, we achieve a finer granularity and more complete process to reduce problem height on real problems (a goal that is not motivated by existing approaches that seek exclusively to create empty clauses.)

Similarly, within the research area of weighted constraint satisfaction problems (a general- ization of MaxSAT), a growing body of work has also sought to process weighted constraint problems into equivalent ones that produce tighter bounds [35, 38, 36]. In keeping with the pre- vious discussion of related MaxSAT techniques, such efforts originated from the translation of hard (consistency-based) inference procedures from regular constraint problems to weighted ones. Again the work presented in this chapter differs primarily in using fractional weights and multiple rescaling iterations, thus allowing weight to be re-distributed further along chains of interacting variables, and at a finer granularity. More recently, the same review linking constraint satisfaction and probabilistic reasoning [146] has been simultaneously and indepen- dently applied to this research on weighted CSP, introducing fully fractional weights and mul- tiple iterations of transformation to solve satellite scheduling and radio frequency assignment CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 223

WCSP problems [37]. The principal differentiating characteristics of framework presented here are its specialization for MaxSAT, and the associated algorithms for making MHET tractable on Boolean problems in conjunctive normal form. Further, we have used a different overall algorithm for minimizing problem height, and evaluated the result over a variety of problem types while isolating the effect of MHET in comparison to the existing pruning techniques. In contrast, in addition to the scheduling and frequency assignment experiments, the other publi- cation focuses on a more in-depth treatment of the framework’s theoretical foundations.

10.4.1 Comparison with the Virtual Arc-Consistency Algorithm

In more detail, the related research from within the WCSP community is centered on a height minimization approach known as “augmenting paths” in the originating research by Schlesinger, while the algorithm presented here is an adaptation of the “max-sum diffusion” algorithm in the same work [134, 146]. Thus the two projects use two different algorithms to minimize the same function (the height of a constraint optimization problem.)

At a theoretical level, it is possible to demonstrate that every fixed-point of max-sum dif- fusion is also a fixed-point of the augmenting paths algorithm. Without going into the details of defining the latter algorithm, a brief proof sketch is to exploit the same condition used in theorem at the end of Section 10.2, which proves the convergence of max-sum diffusion. In essence, the following condition holds for every variable at a fixed-point for the execution of max-sum diffusion on a particular problem: choosing the height-maximizing polarity for the variable with respect to each clause in its neighborhood independently (and thus, possibly inconsistently) will not result in a different height than if we choose a single fixed polarity to maximize height across all clauses simultaneously and additively. This equilibrium cor- responds to sending an empty message within the augmenting path algorithm (not described here), which yields a fixed point for that algorithm as well.

Consequently one can ask whether the two algorithms will reach the same fixed point on every problem. To answer this negatively, the recent study using augmenting paths [37] cites CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 224 an example problem for which it will not terminate. In our own algorithm, this problem is circumvented by a convergence epsilon-parameter that overrides such non-convergence when manifested by the the diffusion process.

At a more abstract level, the correspondence between probabilistic reasoning and con- straint satisfaction that we defined in Chapter 5 can fruitfully characterize the difference be- tween the augmenting paths (virtual arc-consistency) and the max-sum diffusion algorithms.

Recall that the belief propagation algorithm for estimating marginal probabilities is actually a specific method for optimizing the Bethe free energy function, and that the operation of this method is founded on messages that correspond to soft arc-consistency on the relaxed constraint satisfaction representation of the marginal polytope (Section 5.1.1, Observation 2).

Each iteration of the augmenting path algorithm, then, achieves soft arc-consistency over a single variable/constraint pair; it thereby corresponds to a single belief propagation update rule over a max-sum problem. In contrast, the diffusion algorithm achieves a specific form of arc- consistency amongst all the constraints corresponding to a specific variable at the same time, as restricted to an operation that averages the height of all the constraints in order to achieve consistency–this is the cause of its convergence behavior.

10.4.2 Conclusion

In conclusion, the MaxSAT solver presented here reflects an MHET apparatus that is decades- old [134], but that requires a good deal of adaptation and algorithmic design to achieve the gains depicted in Table 11.1. Several features of that foundation suggest directions of future work. For instance, it is possible to link the height-minimization framework to the entire field of research in performing statistical inference on probabilistic models (like Markov Random

Fields.) This link suggests algorithmic alternatives to the “diffusion” method presented here, and an opportunity to improve the generated bounds or (more importantly) reduce their com- putational cost. Here, though, we have adapted an existing formulation motivated by computer vision to introduce a new bounding technique to the MaxSAT research area. We have devel- CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 225 oped efficient algorithms to make the framework feasible for problems in clausal form, and tested its utility on a variety of contemporary problems. CHAPTER 10. COMPUTING MHET FOR CONSTRAINT OPTIMIZATION 226

Solved w/ MHET Solved w/o MHET Prunings Triggered Runtime %

Test Suite # brnchs time # brnchs time mhetorig.bothother mhet rest

auction-paths (88) 87 201K 74 7614087K 192 57 0 0 0 82 4

random-frb (34) 34 1K 187 9 789K 12 1K 0 0 0 82 5

ramsey (48) 39 40K 99 37 28K 4 4K 2 87 11K 12 1

warehouse-place (18) 6 208K 520 1 6K 0 636 0 0 0 42 6

satellite-dir (21) 3 632K 105 2 593K 7 33K 0 0 0 75 5

satellite-log (21) 3 1012K 475 2 578K 9 95K 0 0 89 59 4

factor (186) 186 38 0 186 32K 13 3 0 0 17 67 26

part. rand. 2-SAT (90) 90 511 15 90 511 8 143 0 0 333 57 40

part. rand. 3-SAT (60) 60 35K 72 60 35K 20 473 0 4 33K 57 8

rand. 2-SAT (80) 80 236 1 80 236 1 5 0 0 229 66 29

rand. 3-SAT (80) 80 61K 171 80 62K 56 645 0 1 52K 76 13

maxcut-d (62) 57 15K 89 57 37145 82 5K 0 4 13K 70 23

planning (56) 50 9K 147 50 97K 128 17K 149 162 27K 76 13

quasigroup (25) 13 110K 156 13 564K 93 4K 1K 87 185 87 7

maxcut-s (5) 4 3K 35 4 4K 22 5 0 2 3K 41 33

miplib (12) 3 24K 325 3 30K 192 25 0 0 10 92 8

mancoosi-config (80) 51107K 1445 40 5424K 1271 4K 23 3 95 78 13

bayes-mpe (191) 10 5K 2 21 787K 152 299K 0 0 6K 82 14

auction-schedule (84) 82 549K 164 84 1504K 38 33 0 0 0 83 7

Table 10.1: Performance of MHET on weighted problems from MaxSAT Evaluation 2009.

The first two groups of columns depict the performance of MAXSATZ with and without MHET-derived lower bounds. Each entry lists the number of problems from a given test set (number of instances in a test set appears in parentheses) that were solved within a 30-minute cutoff period, and the average number of backtracks and seconds of CPU time required for each successful run. (All three entries appear in bold when a configuration solves more problems than the other; otherwise backtracks and CPU time are highlighted to indicate the more efficient configuration.) The third column tracks the number of prunings that the various lower bounding mechanisms are able to generate, as described in the text. The final column depicts the average percentage of total runtime devoted to performing the MHET lower-bounding technique, versus the rest (“orig.” and “other” from the previous column heading, combined). Part IV

Pragmatics

227 Chapter 11

Online Control of Constraint Solvers

We have seen in Chapter 5 that probabilistic inference, be it the marginal estimation task, or the minimum-height equivalent transformation process of the previous chapter, is formally com- parable to constraint reasoning–both in terms of structure, and accordingly, in terms of com- plexity. In this sense, then, the algorithmic designs of Parts II and III essentially solve linear relaxations of constraint problems in order to guide a solution process for the original problems themselves: bias estimates correspond to fractional variable assignments, and equivalent trans- formations correspond to fractional inference operations on constraints. Accordingly, we have observed experimentally that these probabilistic techniques make for comparatively high-cost, high-reward heuristics when embedded within modern solvers: runtimes can approach the cost of algorithms that are normally used to pre-process problems, or even solve them to comple- tion, while the outputs of such techniques can drastically reduce backtracking, possibly to the point that they in fact represent outright solutions.

Such observations motivate the judicious application of such probabilistic techniques dur- ing problem-solving. Already we have seen a variety of control parameters that attempt to activate our methods when they will have the most impact on search, while forgoing their ex- pense when the impact is unlikely to outweigh the cost. For instance, the deactivation threshold and decimation block size parameters try to get the most use from the fewest survey computa-

228 CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 229

tions, and the α parameter thins the application of MHET according to a fixed format that puts

an exponential preference on lower levels of the search tree. In this chapter we will consider

automated methods for controlling search, either by tuning parameters within a fixed frame-

work, or by directly choosing whether to employ a given method at arbitrary points during

search. To perform the latter, we present a novel method based on reinforcement learning, and

demonstrate its usefulness in controlling the MHET lower bounding technique of Chapter 10.

Thus in this chapter we first review existing approaches for tuning parameters within an

existing control framework, and for choosing heuristics by training on sets of example prob-

lems. We then argue for an online approach to controlling parameters that can formulate and

adjust a strategy during a single problem-solving execution, without relying on sets of train-

ing problems. To realize this approach, we define a reinforcement learning representation for

backtracking search, and review the Q-Learning algorithm for computing a policy within this

representation. Finally, we present the experimental results.

11.1 Existing Automatic Control Methods

Here we introduce terminology for expressing the goals of this chapter, and situate such goals

within the space of existing research on learning and search. The effectiveness of a search

framework depends primarily on the effectiveness of its variable/value-ordering heuristics and, in the case of branch-and-bound search, its lower-bounding techniques for triggering prunings.

(Here we will isolate the effect of our learning framework by focusing exclusively on two lower-bounding techniques, but the framework is general with respect to other heuristics and techniques.) First, though, we introduce some terminology that reflects the goal of using the best heuristic or technique at different points during search, either within a problem, or across multiple problem instances.

Search Strategy. A search strategy is a means of deciding which heuristics and/or tech- niques to run at a given search node, conditioned exclusively on information about the current CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 230 subproblem. A simple strategy is to always run a specific lower-bounding technique in hopes of triggering a pruning, and then to use a specific variable/value-ordering heuristic if no pruning results–at every search node. But, here “strategy” may also indicate a combination of tech- niques and heuristics or even a series of rules that determine which heuristics or techniques to use according to statistics over the current branch of search.

Dynamic Solver. Our goal is to produce a dynamic solver, meaning one that is free to change strategy in two distinct but related senses. First, it can employ different strategies for different problem instances, hopefully in ways that are specialized to the properties of specific problems or problem sets. Furthermore, a dynamic solver is also free to change strategies online, or within a single execution, hopefully in a way that exploits its experiences from earlier during the search process; the distinction is between a fixed strategy that follows specific rules for triggering different reactions to different subproblems, and one where the rules themselves can change, in addition to the actions.

A number of existing perspectives on learning and search use multiple runs over a training set of example problem instances to formulate a strategy for use on future problems. The pre- sumption is that the future problems are similar enough to the training problems so as to exhibit whichever properties are relevant to the strategy’s success. One approach is to compile a “port- folio” strategy that chooses from a variety of heuristics, or even complete solvers, according to the properties of a given problem [150]. Based on the problem’s properties, a probabilis- tic model (e.g. linear regression) uses prior experiences on training set problems with similar properties, to formulate the final strategy. Another approach, focused on individual solvers, is to tune a solver’s control parameters by local search over parameter settings [83]. For each step of local search, the settings are adjusted and the resulting solver is run over all the problems in a training set for feedback. Often, such a training set can naturally correspond to the problems comprising an individual industrial benchmark suite.

Additional methods that use similar reinforcement learning algorithms to the one described here nevertheless retain the use of a training set [124, 149, 56]. Such frameworks use each CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 231 problem-solving run of a solver, under a particular configuration, as a single episode for tuning a solver. That is, feedback to the learning algorithm is at the level of a single execution on a single problem. In contrast, we wish to find a learning signal from within a single execution.

Solvers that learn from training sets meet the first condition for a dynamic solver because they can choose a customized strategy for a given problem based on past executions on the training problems. However, they do not meet the second condition, because the output of the learning process is still a static strategy that will not change through the course of solving a new problem. If the learning process is unsuccessful in choosing a suitable strategy for a given problem, there is no way to adjust the strategy based on negative feedback during search. The suitability of such a strategy depends on the number and relevance of similar problems that are available to form a training set. This observation holds for any method that forms a uniform strategy, whether by using a probabilistic model or local search to do so, or even if the method happens to use reinforcement learning to perform the same task.

Here our motivation is to create a dynamic solver that makes adjustments not only across problem instances, but also while solving a single instance. The other learning approaches have seen significant success in the presence of suitable training data, but relying on the availability of such data is not always well-motivated. Firstly, we should not understate the convenience of an off-the-shelf solver that does not need a special configuration or training period over exam- ple problems for each new problem type it encounters. More saliently, a second observation is that the availability of suitable training sets is often overestimated. Not only is the CSP-contest format of grouping problems into benchmark suites a fairly artificial usage scenario for general purpose constraint solvers, but many previous studies on learning have conflated training and test data in evaluating an algorithm.

Test problems are used to evaluate the strategy that the training process produces. When the test set comprises problems that were already used in the training set, this naturally in- creases the perceived success of the learning algorithm. While this model is well motivated by problem-solving contests that carry over a proportion of past problems for use in future CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 232 contests, it is hard to imagine other applications where a user who would like to solve a given problem would first customize their solver by solving that very problem. Thus, while learning across problems using training sets is still a powerful and well-motivated approach in many cases, studies that use training problems as test problems may confuse the question of just how many training problems are required for efficient learning.

That being said, the framework to be presented is still amenable to a training-set usage scenario. In particular, the online reinforcement learning framework described below could also be interpreted as a particular learning technique for single-problem training sets. The

final strategy that is in place at the end of a dynamically learning execution on one problem can be applied uniformly to a second problem with similar structure. Such a design would be equivalent to those of existing approaches that use reinforcement learning on training sets, to formulate uniform solvers. In the case of larger training sets, some manner of average over

final strategies can constitute a uniform strategy for test problems. Here we do not test the power of this design–our main motivation is to learn through the course of solving a single problem.

Thus we feel that there is some motivation for a dynamic solver that learns without training.

A similar motivation has inspired previous work that uses is to use look-ahead to probe various paths resulting from the application of different strategies from the same node, within a single execution [118]. Such probes serve not only to decide the current strategy, but also supply predictive information for future state and action combinations will be successful. But without a specific formalism for integrating such information, the overall approach is itself heuristic in nature. As described in the next section, we will employ the “Q-Learning” methodology to perform such an integration, drawing on existing designs from the field of reinforcement learning. Perhaps the most similar approach to the one described here applied a reinforcement learning framework to a “switching” design for iteratively running an assortment of algorithms in sequence, to solve an individual problem [26]. For a given iteration, each algorithm is alloted a greater or lesser amount of run time according to its degree of success during the CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 233

previous iteration. The framework proved successfyk at switching betweenanytime local search

algorithms, in solving problems where progress is easily measured by constantly-improving

upper bounds: the only form of communication between solvers is the current upper bound,

and improvement in this bound is the only signal for favoring one algorithm over another. This

design is not as well-motivated for complete branch-and-bound search over problems where

a large percentage of the runtime is devoted to eliminating large portions of the search space

without necessarily improving the bound.

11.2 Reinforcement Learning Representation of Branch-and-

Bound Search

Based on our motivations, we will hypothesize that it is possible to exploit prior experience

“dynamically,” or online within a single execution of search. To test this hypothesis, we have implemented a Q-Learning system within the MAXVARSAT solver described in the two pre- ceding chapters, in order to blend two different lower bounding techniques into an overall strategy that is reactive to the properties of the current subproblem in combination with prior experiences with similar subproblems.

The goal of designing a dynamically learning solver begs the question of how prior events within a single execution can inform future decisions. One of the central insights of this chapter is that while such a question may be less motivated for constraint satisfaction, it does make sense for constraint optimization. If we are simply looking for a single solution to a given constraint problem, then any experience short of actually finding one is not directly useful to the search.1 And on finding a solution, the problem-solving process is over anyway–looking ahead to the language of reinforcement learning, it is hard to think of a reward signal from

1One exception is scenarios where solutions are guaranteed to be “close” to near-solutions in some sense, in which case the problem is easy and amenable to local search. Another possible exception arises from viewing clause learning as a process where previous backtracking events in search can be used to prevent equivalent mistakes in the future. CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 234 which to learn online in a constraint satisfaction setting.

On the other hand, complete, exact solvers for constraint optimization problems must ac- count for the entire search tree–either by direct examination, as accomplished by node expan- sions, or by prunings, as justified by bounding techniques. This introduces an episodic structure by which we can measure our progress through the search by progressively accounting for an increasing number of the exponentially-numerous branches of the search space. (Since finding a solution to a CSP does not necessitate exploring every branch, there is no equivalent sense of progress for such a process.) So, for constraint optimization we can determine the benefi- cial effects of our decisions in terms of node expansions, and more significantly, in terms of prunings. Naturally, such a process will be approximate, but it need not be ad hoc or entirely heuristic if we exploit existing formalisms from reinforcement learning.

Similarly, it is not formally guaranteed that experiences with certain branches within an individual search tree must yield any information for efficiently processing others branches.

But just as the hope of supervised learning is that problems from a training set have some re- lationship with unseen instances that are subsequently drawn from such a set, a key aspect of the hypothesis is that the various branches of a search space cannot be perfectly uncorrelated in problems of interest. That is, we hope that if certain actions trigger numerous node expan- sions or prunings under certain conditions during search, they are likely to do so under similar conditions later in the same search.

11.2.1 Markov Decision Processes and Q-Learning

Thus our hypothesis is that pruning events from prior branches of branch-and-bound search can guide our decisions over future branches, in a constraint optimization setting. To test this hypothesis, we will first review a specific framework for applying past experiences to future decisions in a dynamic fashion. We begin by describing the basic reinforcement learning framework in terms of Markov Decision Processes (“MDP’s”). The presentation is adapted from a variety of basic texts, e.g. [140]. CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 235

We can represent a MDP by a tuple M = (S, A, T , R), where:

•S is a set of states and A is a set of actions.

•T : S ×A → S is a (probabilistic) transition function that (stochastically) maps a prior

state to a subsequent one, upon performing an action.

•R : S × A × S → R is a reward function that assigns a utility to each event where we execute a particular action in a particular state and in so doing transition to a particular

new state.

For a given MDP we can define a policy P : S → A that maps each state to a recom- mended action. Narrowly construed, the goal of reinforcement learning is to learn a policy that maximizes the (future-discounted) cumulative reward if it is executed (in this case, indefinitely) within a particular MDP. In this setting, the exact transition function will not be known exactly

(nor will the reward function.)

Q-Learning is a specific “model-free” reinforcement learning framework that seeks an op- timal policy without building an explicit representation of transition probabilities in the course of operation. Rather, Q-Learning maintains a function Q : S × A → R representing the “value” of performing a particular action within a particular state. (Here the function will be implemented in tabular form.) The value in question incorporates both the immediate reward that is expected when performing the action in this state, as well as the (discounted) subsequent value of the expected ensuing state, if we are to perform in turn the most “valuable” action for that state. To accomplish this without having to model T , the framework applies the recurrence at Line 3 of Algorithm 17.

The algorithm is called each time we are about to perform a new action; for our own purposes this will correspond to expanding a new node of the search tree. For each state/action pair, we use statically allocated memory to retain a Q-value, as well as a counter for how many times we have previously performed the action when situated in the current state. We also retain a record of the previous state and action. The first step of the algorithm is to determine CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 236

Algorithm 17:Q-LEARNING Data: Q[|S|, |A|], C[|S|, |A|]: Q-value and visit count for each state/action pair.

Data: s, a: state and action corresponding to previous invocation of Q-LEARNING.

0 0 1 Determine current state s , and reward r for previous action.

2 C[s, a] + +.

0 0 0 3 Q[s, a] ← Q[s, a] + α(C[s, a])(r + γ maxa0 Q[s , a ] − Q[s, a]).

0 4 s ← s .

5 a ←EXPLORE/EXPLOIT(Q,C).

6 return a.

the current state and calculate the reward received for our previous action. In the sequel this will mean deriving statistics about the subproblem engendered by the current branch of search, and measuring how long the previous action took as well as how many nodes it pruned, if any.

After updating the counter for our previous state/action combination, we now perform the main update, to the Q-value for this pair. We adjust it by a specific value based on the reward that we have just accrued, plus the difference in the Q-value of the best action we believe we can execute from our current state, and the Q-value of the previous state. The discounting constant γ ∈ (0, 1) is typically set close to 1, and encourages a greater emphasis on immediate rewards; this is both pragmatically motivated as well as necessary for the optimization criterion of expected perpetual reward to be well-conditioned. The learning rate function α : N → [0, 1] returns a fraction that should start near 1 and decrease toward 0 as we visit (s, a) an increasing number of times. The effect is to allow new experiences to greatly affect our initial estimates of

Q-values, but to gradually dampen their effect so as to eventually converge to a more uniform policy.

Finally, at Line 5 of the algorithm we consult the Q-values to decide which action to per- form next. On the one hand, we would like to exploit our existing knowledge by choosing the action with the highest value in tandem with the current state. On the other hand, we would CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 237

like to expand our base of experiences and explore actions that have been infrequently sam- pled from this state. Within reinforcement learning this tradeoff is made explicit by a chosen exploration/exploitation function, as informed by the Q-values and the visit counts for the var- ious actions available in the current state. Common approaches allow for a constant amount of noise that promotes the occasional selection of a completely random action, in addition to an initial exploration-heavy period where we first try to use a diversity of actions from each state, before placing greater emphasis on exploitation. The implementation described below is based on such typical approaches.

11.2.2 Theoretical Features of Q-Learning.

Under certain conditions on the learning rate and exploration/exploitation function, Q-Learning can be shown to converge to the optimum policy for a fixed MDP. Additional conditions allow for theorems that bound the worst-case rate of convergence to such optimum policies, or even the worst-case difference between the optimal sequence of actions and the one determined un- der expectation by a given learning process [27]. Such results reflect a solid formal footing for the endeavor of learning dynamically during search, and motivate future research investigating the structure of various search spaces. At this point, though, such results cannot be applied directly because the reward function for our domain cannot be stated explicitly in closed form.

Instead, at a highly conceptual level one can contemplate the existence of some exact function that specifies the effect of applying a given heuristic or technique at any arbitrary point dur- ing the search; such theorems are well-founded with respect to this function, but for practical purposes the function will remain unknown.

11.3 Application to MaxSAT

To validate our hypothesis that prior experience from within a single run of branch-and-bound search can inform future decisions for constraint optimization, we have implemented the de- CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 238 scribed Q-learning framework within MAXVARSAT. The learning module maintains a dy- namic policy that blends the “standard” set of lower-bounding techniques already built into

MAXSATZ (weighted resolution, unit propagation, pure literal rule, and one-step lookahead) with the MHET technique of the previous chapter. As we have seen, the standard techniques use data structures that are already built into the overall solver, and are more efficient to com- pute; MHET is more expensive in terms of run time, but is very powerful and can trigger unique prunings that are not available to the standard techniques. Regular MAXSATZ runs the standard techniques at every search node; previously we have also run MHET after every α node expan- sions. Now, we introduce the opportunity to blend the two techniques at any given node during search by choosing to run each of them or not, by balancing our learned experiences with the desire to explore diverse actions. At this point we state our representation of MaxSAT-solving process within the Q-Learning framework.

States. We represent state by a pair of quantities: the current depth of the branch we are searching (i.e. the number of variables that have been assigned), and the gap between the current upper bound and the most recently computed lower bound on this branch of search.

Shallower depths generally represent a smaller chance of pruning, but exponentially larger savings in node expansions when pruning is achieved. They also represent somewhat greater computational costs for lower bounding techniques because fewer variables are already fixed.

Smaller gaps represent situations where pruning is more likely to be imminent, but where the rewards may also be correspondingly lesser if we are already close to backtracking anyway, by variable assignments alone.

To keep the number of states tractable, depth and gap values are grouped into a fixed number of brackets. Each bracket groups a fixed percentage of the variables in the case of depth, and a

fixed range of integer values in terms of gap. Because these areas are more important, brackets have finer granularity at greater depths and smaller gap values. Specifically, we use 18 “depth brackets” to represent situations where we have assigned up to 50%, 75%, 85%, 90%, 92%,

94%, 95%, 96%, 96.5%, 97%, 97.5%, 98%, 98.5%, 99%, 99.25%, 99.5%, 99.75%, or up to CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 239

100% of the variables to yield the current branch of search. We use six “gap brackets” to

aggregate states where the upper bound exceeds the lower bound by a weight value of 1, 2, 3,

4, 8 to 64, or greater than 64.

Actions. At each node of the search tree, we can run neither lower-bounding technique

(action “0”), standard techniques alone (“1”), MHET alone (“2”), or both standard techniques

and MHET (“3”) at any given node during search. (The actions are numbered in increasing

order of computational expense, and pruning capability.)

Rewards. The usefulness of each action is based (negatively) on the amount of time it took

to execute and (positively) on the amount of time the solver would have required to process

any subtree that the action managed to prune. The motivation for this scheme is to balance

the tradeoff between computational expense and improved pruning by expressing them in the

same quantity: time. The time cost of an action is measured by recording timestamps between

calls to the Q-Learning module. This allows the cost to incorporate miscellaneous overhead

like simplifying a theory after fixing a variable, clause learning, etc. To calculate a reward for pruning, we rely on an estimate of how long it takes to expand a node within the current problem. To reflect fluctuations of this value across different regions of the search space, we use a moving average over the last one thousand expansions. By consulting the depth of the search at the time of a successful pruning, we calculate the size of the pruned subtree and multiply by this value. In particular, if d is the difference between the number of variables in a problem, and the current depth of search, then pruning at this point eliminates 2d search nodes.

Exploration Strategy, Learning Rate, and Discount Rate. To balance exploitation and ex- ploration, we employ the common reinforcement learning practice of greatly preferring actions that have not yet been attempted from a given state more than a given number of times [140].

In particular, we use a threshold value of 100 previous executions of a particular action from a given state. That is, from a given state, our first priority is to attempt any action whose visit count, defined with respect to this state, remains below the threshold. After this initial explo- ration phase, we exploit our experiences by choosing the action with highest Q-value for a CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 240

given state with high probability. In particular, at any given point we will perform a completely

random action with probability 0.01, and choose the action with highest Q-value otherwise.

In updating Q-values, we gradually harden our policy by using a learning rate function of

α(n) = 10, 000/(9, 999 + n), where n is the number of times we have visited the current state.

As we visit a state increasingly often, this function progressively scales the update to the state’s

Q-values downward to 0.

In evaluating future rewards, we use a discount factor of γ = 0.99. This value is closer to

1 than in many reinforcement learning applications, reflecting the relatively long time horizon

(in terms of number of node expansions) for solving a constraint optimization problem.

Transition Model. Recall that Q-Learning is a model-free method. Therefore, it never ex- plicitly or implicitly represents, for instance, the probability that a particular action will tran- sition the search from one particular depth bracket to another more or less advantageous one.

Rather, the average “value” of the states that tend to ensue from a particular state/action pair is built into the pair’s Q-value by the update rule.

11.4 Experimental Design and Results

The experiments summarized in Table 11.1 evaluate the effectiveness of the implemented rein- forcement learning system on fourteen test sets, again comprising the most general (weighted partial) division of the 2009 MaxSAT Evaluation [8]. The experiments were run on machines with 2.66 GHz CPU’s using 512 MB of memory.

Each of the four main groupings of columns shows the performance of a particular version of MAXSATZ. The first embodies a fixed uniform policy of running both standard and MHET lower bounding techniques at every node of the search tree. The second consists of regular

MAXSATZ, in pursuing a fixed uniform policy of running standard techniques alone at every node. The third grouping of columns represents our reinforcement learning approach to blend- ing the two techniques. The final grouping reflects the simple alternative blending method CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 241

that produces a (technically) dynamic solver: we run standard techniques at every node, and

additionally attempts to prune using MHET on ever α = 1000 node expansions. Recall that this policy is not entirely unmotivated: there are exponentially more nodes at the bottom of the search tree, so MHET will run exponentially more often on deeper nodes, only occasionally attempting shallower nodes in hopes of achieving a larger pruning.

Within each grouping of columns, the table depicts the number of problems from a given test set that a particular configuration can solve within a 30-minute timeout window. It also shows, over problems that were solved within the timeout period, the average number of back- tracks and average overall runtimes. (Thus, a version may show a lower average runtime even if it solved fewer problems.) The most basic conclusion to draw from the table is that in terms of problems solved, using Q-learning to produce a dynamic policy is never as bad as the worse of the two fixed, uniform policies, except in one case (at the bottom of the table.) With three additional exceptions, Q-learning is always able to match or exceed the better of the two fixed solvers, despite the overhead of pursuing exploratory actions and performing learning.

Problems where Learning Does Better than any Original Solver. At a finer level of ex- amination, the first grouping of three rows in the table consists of problems where in fact, the learning solver outperforms both of the standard solvers. Closer examination has revealed that problems in this category tend to be mostly amenable to standard lower bounding techniques, but still feature a significant number of opportunities for pruning by MHET. Performing MHET at every node is therefore prohibitive due to its cost, but leaving it out altogether keeps the solver from handling additional problems without timing out. Further, it is important to run

MHET at strategic points in the search, as identified by Q-Learning: the control represented by the fourth grouping of columns actually runs MHET a roughly comparable number of times on these problems, but does not show any improvement over the standard uniform policy.

Problems where Learning Matches the Better Original Solver. The second, largest group- ing shows problems where the Q-learning method does as well as the better of the two original solvers. For the first three problems, the learner first progresses through its initial exploration- CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 242 intensive period and then successfully learns to simulate the better of the two fixed policies before time runs out. This is reflected by final policies that assign 0’s and 1’s to most states, or

0’s and 2’s, depending on whether standard techniques or MHET are the most suited for these problems, respectively.

Interestingly, for the quasigroup completion test set, learning achieves essentially the same performance as both fixed solvers, but on examining its final policies one typically observes a wide variety of actions prescribed across various states of search. Thus, it achieves comparable performance by very different means. This is not so for the control method in the fourth grouping of columns, whose performance is, as usual, very similar to the fixed policy of using standard techniques alone. This difference suggests that unless one is intelligent about using

MHET in specific situations, then one must fill the entire search process with MHET operations to observe any (positive or negative) effect.

The last three problems in this group are very easy for all three solvers. The policies generated by learning do still tend to be mixed as for the quasigroup problem. Here, though, this does not reflect the discovery of a new strategy for factoring and random MaxSAT, as opposed to the fact that it does not really matter what actions the solver takes; the problems are typically solved before the initial exploration phase is even completed.

Problems where Learning Matches (or Exceeds) the Worse Original Solver. For the prob- lems in the third grouping of rows, the reinforcement learning mechanism fails to match the better of the two uniform strategies, but is at least as good as the other. (In the case of the warehouse placement test set, it solves one more problem than regular MAXSATZ.) Manual study of the policies for these test sets reveals that the learning framework does not actually simulate the worse of the original solvers (by including MHET often in the case of warehouse placement, or by excluding it when solving the other two problems.) Rather, it does eventually learn close approximations to the superior fixed solver, but in contrast to the previous group- ing, it does not always succeed in doing so before the timeout period. The small number of problems for which this happens drops its overall performance to the same number of solved CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 243

problems as the worse fixed solver. This generalization was corroborated by observing better

performance on extending the timeout period from 30 to 45 minutes.

Problem where Learning does Worse than any Original Solver. For the Mancoosi con-

figuration problem set comprising the “industrial” category of the MaxSAT evaluation, adding

reinforcement learning degrades performance to the point that fewer problems are solved than

by either of the fixed solvers, or even the experimental control. This set tends to feature the

largest problem instances, with numbers of variables typically in the thousands. However, the

complexity of the Q-Learning system does not directly depend on the number of variables;

one possibility is that a larger number of depth brackets is needed to represent this larger state

space. Alternatively, it is possible that applying MHET at any point on these problems is so

harmful that the initial exploration phase already renders the learning solver hopeless. How-

ever, the uniform strategy of applying MHET at all times still does better than the learning

method; a possible further explanation is that for four of the problems, MHET is actually use-

ful but must be applied all the time from the beginning. Further experimentation is required to

test these hypotheses.

General Observations The above analysis is in terms of number of problems solved within a given cutoff period. In terms of overall runtime and number of backtracks, Q-Learning is typically a little bit slower than the better uniform solver. The runtime difference is not as troubling as it may first seem, because the overhead of performing Q-learning is constant with respect to problem size (the dependency is with respect to the number of states and actions.)

So, with increasing timeout periods, the overhead involved with Q-learning will shrink relative the rest of the problem-solving process. In terms of backtracks, the discrepancy is due to the exploration period where the learning module is biased toward trying every action from every state a fixed number of times before solidifying its reliance on past experiences. With longer timeout periods, the effects of this period will become relatively smaller as well, but only with respect to a fixed threshold on the number of state/action visits before terminating early exploration; we may actually wish to increase this threshold on encountering harder problems CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 244 and longer timeouts in the future.

A final observation is that Q-Learning outperforms the control method depicted in the fourth column of the table on all but the last problems set, even though the α-parameter for using MHET every one thousand expansions was optimized (by trial and error) for the entire set of problems as a whole. This supports the hypothesis that the flexibility to use different lower bounding techniques non-uniformly at different points in search is at least as useful as the flexibility to use different combinations of techniques in a uniform but unfixed fashion.

11.4.1 Example Policy Generated Dynamically by Q-Learning

Table 11.2 shows an example policy generated by Q-Learning through the course of solving an individual problem instance from the Quasigroup Completion test set. As described in the caption, the row and column labels denote the various states of search represented within our learner, in terms of the depth of the current branch of search (rows), and the gap between the current upper bound and the most recently computed lower bound (columns). Zero entries correspond to null actions whereby we forgo the opportunity to trigger a pruning in order to save the time required to compute a lower bound. The roughly upper-left-triangular format of the non-zero entries corroborates the intuition that attempting to prune is most likely to be worthwhile when the gap between upper bound and recent lower bound is smaller, and when we are higher in the search tree–here the reward for pruning is exponentially larger.

Roughly speaking, MHET has proven more effective for this individual problem than the standard lower-bounding techniques, especially toward the bottom of the search tree and when the gap between lower and upper bound is small. This is reflected by the presence of two’s cor- responding to states where the prescribed action is to run MHET alone, versus one’s indicating that standard techniques alone have proved more successful. Entries marked “3” indicate that we should pay the extra computational expense of running both MHET and standard tech- niques, in hopes that one will trigger a pruning. These tend to appear higher in the search tree, where the reward for triggering a pruning is greater (in terms of the size of the pruned subtree, CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 245 and also in the sense that lower in the tree we are already close to triggering backtrack anyway, merely by assigning variables), and also when the gap between the current upper bound and the most recent lower bound is larger and it is therefore harder to trigger a pruning.

Also note the almost all-zero row corresponding to the depth bracket [732,734). Rather than reflecting a natural property of 747-variable quasigroup completion problems, it almost certainly demonstrates the approximate nature of the whole learning methodology. In essence the solver just happened to encounter almost no success when attempting to prune partial as- signments of length 732 or 733–due partly to the unpredictable nature of pruning itself, and partly to the behavior of the exploration policy that would have encouraged repeated attempts at this level. A final observation is that if one were to develop an improved static policy for similar quasigroup completion problems by directly consulting this table, then the all-zero rows and columns suggest a simple rule that could be added to any non-dynamic solver: so long as we maintain a one-percent chance of performing a random bounding action at any search node, we can deactivate any lower bounding techniques when fewer than two thirds of the variables have been fixed, and when the most recently computed lower bound was not within seven of the current upper bound.

Recall, however, that a “dynamic” policy, as defined here, is flexible both within a single problem, and across problems. That is, the policy shown in the table was captured at the culmi- nation of a (successful) search process; at any other point during the search the entries would have been different, reflecting the solver’s experiences up that point as determined by its ex- ploration/exploitation policy. And across problems, the same upper-triangular format does not necessarily persist. For instance, in managing to solve the one additional Warehouse Placement problem that timed out when using a fixed policy of applying MHET at every node, Q-Learning eventually learns to forgo both MHET and even standard pruning techniques, producing a table that is mostly zero’s, with a small number of one’s as well. CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 246

11.4.2 Concluding Remarks

We have adapted a Q-Learning model-free reinforcement learning framework to a modern branch-and-bound MaxSAT solver. The resulting solver is not itself revolutionary; there is a limit to the possible empirical improvement from blending a single pair of techniques. But as a preliminary proof of concept that prior experience can be used to guide branch-and-bound search for constraint optimization dynamically, i.e. within a single run, the solver motivates future efforts to integrate more comprehensive collections of heuristics and techniques within the general reinforcement learning framework. In this light, future directions of research in- clude opportunities to incorporate additional reinforcement learning variations like finite time horizons, whereby the learning process can be episodically re-initialized, for improved sensi- tivity to the most recently traversed areas of the search tree. Also, a richer representation of state may result in more powerful policies, with the subject of function approximation coming to prominence in order to remedy the intractability of a tabular Q-function representation over much larger state spaces. CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 247

Solved Using

Solved Using Solved Using Solved Using Standard at

MHET+Standard Standard Only Q-Learning Each Node, and

at Each Node at Each Node MHET Every 103

Test Suite # brnchs time # brnchs time # brnchs time # brnchs time

planning (56) 50 9K 147 50 97K 128 55 335K 65 50 97K 133

ramsey (48) 39 40K 99 37 28K 4 40 46K 92 37 28K 4

miplib (12) 3 24K 325 3 30K 192 3 4K 44 3 30K 165

auction-paths (88) 87 201K 74 76 14087K 192 87 296K 86 76 13959K 191

satellite-dir (21) 3 632K 105 2 593K 7 3 2146K 349 2 590K 7

satellite-log (21) 3 1012K 475 2 578K 9 3 2320K 542 2 576K 9

quasigroup (25) 13 110K 156 13 564K 93 13 1100K 218 13 564K 94

factor (186) 186 38 0 186 32K 13 186 74 0 184 19K 6

rand. 2-SAT (90) 90 511 15 90 511 8 90 998 13 90 511 9

rand. 3-SAT (90) 60 35K 72 60 35K 20 60 47K 60 60 35K 20

bayes-mpe (191) 10 5K 2 21 787K 152 19 496K 227 21 786K 154

warehouse (18) 6 208K 520 1 6K 0 2 4K 3 1 6K 0

auction-sched (84) 82 549K 164 84 1504K 38 82 1062K 93 1 6K 0

configuration (80) 5 1107K 1445 40 5424K 1271 1 4708K 1358 34 5037K 1304

Table 11.1: Performance of Q-Learning on problems from MaxSAT Evaluation 2009.

The first two groups of columns depict the performance of MAXSATZ with fixed, uniform policies for computing lower bounds. The first corresponds to a policy of using both the MHET and the standard techniques at every search node; the second represents the use of standard techniques alone, again at every search node. The third group of columns shows the performance of MAXSATZ using Q-Learning to develop a dynamic policy to choose from four actions at each node: no lower bounding, standard techniques alone, MHET alone, or the use of both standard techniques and MHET. The fourth grouping represents the control dynamic policy that uses standard techniques at every node, and MHET at every thousandth node expansion. Each grouping of three columns lists the number of problems from a given test set that were solved within a 30-minute cutoff period, and the average number of backtracks and seconds of CPU time required for each successful run. (All three entries appear in bold when a configuration solves more problems than any other; otherwise backtracks and CPU time are highlighted to indicate the more efficient configuration(s) for these metrics.) The number of problems in each test set appears in parentheses after the name of the set.) CHAPTER 11. ONLINE CONTROLOF CONSTRAINT SOLVERS 248

UB-LB ∈ UB-LB ∈ UB-LB ∈ UB-LB ∈ UB-LB ∈ UB-LB ∈

[1,1] [2,2] [3,3] [4,7] [8,63] [64,...)

Depth ∈ [ 0, 386) 0 0 0 0 0 0

Depth ∈ [387, 572) 0 0 0 0 0 0

Depth ∈ [573, 646) 3 2 1 3 0 0

Depth ∈ [647, 683) 2 1 1 3 0 0

Depth ∈ [684, 697) 2 2 2 3 0 0

Depth ∈ [698, 711) 3 2 1 3 0 0

Depth ∈ [712, 718) 3 2 3 1 0 0

Depth ∈ [719, 725) 0 2 1 0 0 0

Depth ∈ [726, 728) 1 0 2 0 0 0

Depth ∈ [729, 731) 3 2 3 0 0 0

Depth ∈ [732, 734) 0 1 0 0 0 0

Depth ∈ [735, 737) 0 2 1 0 0 0

Depth ∈ [738, 740) 2 1 0 0 0 0

Depth ∈ [741, 743) 2 2 0 0 0 0

Depth ∈ [744, 744) 2 2 0 0 0 0

Depth ∈ [745, 745) 2 0 0 0 0 0

Depth ∈ [746, 746) 2 0 0 0 0 0

Depth ∈ [747, 747] 0 0 0 0 0 0

Table 11.2: Example Policy Developed by Q-Learning for a Quasigroup Completion Prob- lem.

The table depicts the final policy that emerges at the end of a single run of MAXSATZ using Q-Learning to form a dynamic policy, for a 747-variable problem from the Quasigroup Completion test set. States are specified by row labels corresponding to the current depth of search, along with column labels corresponding to the gap between the current upper bound and the most recently computed lower bound. For each such state s, the table’s entries represent the most rewarding action a to perform in this context, i.e. the a that maximizes Q(s, a). The action entries correspond to: 0, do not compute any lower bound (i.e. do not attempt to prune); 1, compute lower bound using standard techniques only; 2, compute lower bound using MHET only; 3, compute lower bounds using both standard techniques and MHET. Chapter 12

Future Work

Here we summarize some main avenues for future research that are motivated the present thesis: that probabilistic reasoning and constraint satisfaction are closely related on a formal level, and that adapting representational and computational ideas from across two areas can produce effective new algorithms.

12.1 General Program for Future Research

One compelling way to characterize research in artificial intelligence is as the adaptation of the- oretical and algorithmic frameworks from existing areas like statistical inference or symbolic reasoning, to new applications that go beyond the original motivations for developing such frameworks. For instance, statistical regression can be used by machines to recognize human voices, and combinatorial search can be used to schedule the activities of electronic agents.

The process of adaptation is not straightforward or trivial–it requires both creativity and a deep formal understanding of the existing theories and algorithms. To extend this observation to a greater level of detail, here we have sought in turn to present probabilistic and constraint reasoning as special realizations of a general underlying formal and algorithmic framework.

Namely, within the representation of graphical models and the sum-product framework, we view both tasks in terms of constrained optimization, either numerically in the case of contin-

249 CHAPTER 12. FUTURE WORK 250 uous marginal probabilities from a joint distribution, or combinatorially in the case of discrete variables from a constraint satisfaction problem (cf. Figures 5.1 and 5.2).

Just as it is an oversimplification to describe machine learning as a rote rehash of statistics, or planning and scheduling as a glorified restatement of search, it is important to recognize that probabilistic reasoning and constraint satisfaction have their own special characteristics as problems. But, in fact it is these special properties themselves that motivate a better un- derstanding of basic optimization principles in order to cross-fertilize ideas between the two areas. To do so, it is necessary to explicitly recognize the formal correspondence between prob- abilistic and constraint reasoning, a task that is impossible without the sort of unified account presented here in Chapter 5. While folklore and introductory texts have noted a variety of cor- respondences based on common approaches to constraint-graph structure, and increasingly, to the decomposition of such structure through the course of a problem-solving process, an even more abstract account based on optimization allows for even greater interplay between the two areas of research.

The use of marginal estimation as a combination of soft search inference to guide complete constraint satisfaction and constraint optimization solvers, as we have done here, represents one navigation of this space at the intersection of probabilistic and constraint reasoning. A general research program is to further explore this space by transferring approaches from one area to the other, or creating approaches that are inspired by both areas, as in the case of EMBP. As al- ready suggested in Chapter 5, for instance, an optimization perspective on marginal estimation motivates the generation of new constraints online during a solving process, as a probabilistic analogue of clause learning in SAT or nogood learning for CSPs. In the other direction, by viewing constraint satisfaction as an optimization we were able to view the application of ex- isting bias estimators as a soft form of inference between individual variable/constraints pairs; just as the “max-sum diffusion algorithm” of Section 10.2 breaks from this format in updating all constraints simultaneously with respect to a given variable, so can more flexible message- passing schedules from machine learning inspire new inference methods for bias estimation, or CHAPTER 12. FUTURE WORK 251 for constraint satisfaction itself. Below we elaborate some additional opportunities for future work in greater detail.

12.2 Specific Research Opportunities

Within the general research program described above, we can distinguish specific research opportunities by whether they are conceptual, algorithmic, or implementational in focus.

12.2.1 Conceptual Research Directions

Applying global constraints to marginal estimation. Because the project described in this dissertation was founded on the integration of probabilistic reasoning with constraint satisfac- tion, it follows that many of the possibilities for future work involve the further transfer of ideas between these two areas. While here we have primarily applied algorithms and ideas from probabilistic inference to constraint satisfaction, one particularly promising idea is to transfer the concept of a global constraint from constraint satisfaction to probabilistic reason- ing. Typically factors in probabilistic reasoning applications have highly localized structure, but there is no reason why a global factor with specific properties cannot be applied to various reasoning tasks by exploiting closed forms.

For instance, in the field of constraint reasoning a global constraint like the alldiff func- tion [131] may range over many if not all the variables in a given problem. If we consider the resulting constraint graph or factor graph representation strictly in terms of graph struc- ture, then, it is unlikely to appear tractable from the perspective of induced width and general variable elimination (cf. Section 1.2.2). However, for the specific purposes of enforcing gen- eralized arc-consistency with respect to such a constraint, it is possible to create closed form

“propagator” procedures that eliminate inconsistent values from variables in the constraint’s scope according to assignments to the other variables [131]; indeed many practitioners view the fundamental distinguishing focus of the entire constraint programming enterprise (versus CHAPTER 12. FUTURE WORK 252

more general areas like “search”) to be the design of such propagators.

Because in Chapter 5 we have characterized marginal estimation as a relaxed version of a

specific constraint satisfaction task, we can now motivate the design of similar “soft” propa-

gators for use within probabilistic reasoning. Instead of passing messages calculated over all

possible extensions to a function in a factor graph, which prohibits the incorporation of global

function scopes, we can instead use closed forms to determine the relevant quantity without

iterating explicitly through extensions. Here we have proposed EMBP as a general framework

for integrating such forms; in this case we determine the sum over Q(·) probabilities for all ex-

tensions containing a particular variable assignment, by using closed forms expressed in terms

of the sole support of a clause (Figure 6.2.2). Perhaps because of a historical emphasis on com-

puter vision and grid Markov random fields, this idea has not been widely pursued in machine

learning outside of some very recent research [141]; one central dogma of machine learning

research has been that locality (of functions in a probabilistic graphical model) is the basis for

tractability.

Targeting backdoor variables instead of backbones. A second conceptual direction for

future research that is a more direct extension of the present effort is to identify the backdoor

variables defined in Chapter 4 within problem instances. In the present project we use bias

estimates to target backbone variables in a natural way: such variables appear with a single po-

larity across all solutions, so their bias distributions should be 0/1. Thus we have confirmed that

stronger bias estimates are indeed correlated with backbone variables. In the case of backdoors, or sets of variables that greatly simplify a problem with respect to a given inference procedure, the path forward is less obvious. One possibility is that backdoor variables may have weaker biases toward the unconstrained “joker” state within the survey propagation model.

A more ambitious possibility is to integrate a model of the unit propagation inference pro- cedure, or any other procedure that we may elect to apply to the definition of a backdoor set, within our probabilistic model. Specifically, such a program would aim to directly model the probability that a given variable assignment will trigger a propagation, as opposed to the proba- CHAPTER 12. FUTURE WORK 253 bility that it will appear in a solution to a given problem. One superficial way of describing the resulting heuristic strategy would be as a probabilistic version of impact-based search [130].

12.2.2 Algorithmic Research Directions

Developing alternative bias estimation update rules. By characterizing such solvers as the one associated with survey propagation, as well as our own, explicitly in terms of the calculation of marginals, we immediately suggest alternative approaches to the main line of research taken here. Since any method for computing marginals can be used in place of the methods developed here, it is worth considering the larger space of variational methods for use as bias estimators. Here the suggestion is not to indiscriminately substitute some assortment off-the-shelf marginal estimation packages according to the latest developments in machine learning, in hopes that one of them will happen to work better than the methods presented here. Indeed one of the conclusions that can be drawn from the development of the EMBP bias estimators is that a first-principles approach to algorithm development can advance the state of the art by abstracting away from the historical contingencies and assumptions that underly current practices: by viewing marginal estimation as an arbitrary optimization task defined over distributions (as suggested by [151]), we were able to adapt a parameter estimation perspective and apply the EM framework to obtain a novel algorithmic framework. A deeper, principled understanding of variational methods and the marginal polytope can only encourage the future algorithmic advances, especially given the rich history of online constraints like cutting planes within the area of discrete optimization [72].

Treating local search as a bias estimation technique. Recall from Chapter 1 that sampling- based methods represent an entire alternative approach to estimating marginals, in lieu of the message-passing techniques comprising the main line of research here. At the end of that chapter, we introduced random walk-based estimators like the Gibbs sampler. Such techniques are unlikely to give good estimates unless we customize them to constraint satisfaction, as we CHAPTER 12. FUTURE WORK 254

have done for message-passing through the EMBP framework.

In Chapter 5 we observed a correspondence between the random walks underlying such

marginal estimation techniques, and random walks that are conducted to solve constraint sat-

isfaction problems directly as a means of local search. This comparison yields the possibility

of employing local search techniques that have been developed especially for constraint satis-

faction, not only in hopes of finding a solution directly, but also as bias estimators for guiding

some overall search process.

Specifically, any local search technique that combines an exploitation strategy that is at-

tracted to near-solutions (and thus, hopefully, solutions) with an ergodic exploration strategy

that perpetually prevents entrapment within a set of local optima, itself represents a particular

policy for conducting a random walk over the space of configurations. Where before such a

walk was viewed as useless unless it ended at a solution configuration, the sampling approach

to bias estimation views the trail of configurations comprising the walk as a series of sam-

ples; the proportion of configurations containing a particular variable assignment is the basis

for measuring that variable’s bias toward the corresponding value. This raises the empirical

question of whether near-solutions are correlated with solutions in applications of theoretical

or practical interest.

Solving variations of the marginal estimation, constraint satisfaction, and constraint op-

timization problems. Recall that in Chapter 1 we defined the MPE (also known as “MAP”) problem as an alternative query to computing marginals; in Chapter 3 we defined the model counting and unsatisfiability problems as alternatives to constraint satisfaction and constraint optimization. Here we have applied marginals to CSP and COP, but all other combinations of queries and problems can be attempted as future work, as well. Solution methods to MPE typically involve search, so applying MPE to CSP may be redundant to a certain extent, but as mentioned above the influence between probabilistic reasoning and constraint reasoning can

flow in both directions; thus CSP concepts like global constraints can be applied to the MPE task within machine learning. CHAPTER 12. FUTURE WORK 255

12.2.3 Implementational Research Directions

Improving the control of heuristics and techniques within CSP and COP solvers. At a more pragmatic level, the VARSAT and MAXVARSAT solvers involve many parameters and design options that have been tuned to date by primarily manual methods. To achieve competitive success in future SAT and MaxSAT evaluations, these can be tuned to problem sets from past incarnations of these contests using automated methods like those discussed in the previous chapter [149, 83]. More generally, the reinforcement learning method of that chapter was only used to control a single design dimension, in order to isolate the effect of the learning mechanism, but also just because the research is preliminary. Future success in such competitions may rest on extending this framework to all the features of a given solver.

Improving performance in contests and evaluations. Better competitive performance will also require a means of handling larger problems. We have seen that our current surveys are very accurate, and most likely more accurate than strictly necessary for guiding search. This motivates the calculation of surveys using only some of the clauses in a variable’s neighbor- hood. This approach will sacrifice some degree of accuracy in return for greater efficiency.

One way of viewing the idea is in terms of modern SAT solvers that typically find solutions without accessing most of the clauses stored in memory. Given the observation in Chapter 5 that computing marginals is a linear relaxation of constraint satisfaction, omitting clauses from surveys appears to be a natural analogue to this phenomenon. Chapter 13

Conclusion

The thesis of this dissertation is that probabilistic reasoning is closely related to constraint

satisfaction at a formal level, and that this relationship yields effective algorithms for guiding

constraint satisfaction and constraint optimization solvers. In supporting this thesis we have

pursued a wide variety of subjects across the preceding chapters; the organization of this pursuit

is reproduced from the introduction as Figure 13.1.

In Chapter 1 we reviewed the body of research on computing marginal probabilities over

graphical models, adding our own emphasis on the algebraic character of this task. In so doing

we have explicitly formalized a number of recurring graph operations like dualization and

node-merging that are well-known as implicit or ad hoc components of a variety of techniques

in probabilistic inference and constraint satisfaction alike.

In Chapter 2 we reviewed recent research that recharacterizes message-passing algorithms

for estimating marginals in terms of constrained numerical optimization.

In Chapter 3 we reviewed solution methods for constraint satisfaction from within the

framework of graphical models; the correspondence is straightforward but has not been fully

elaborated as it is here. To wit, we cast search and inference as algebraic graph operations,

and reformulate familiar concepts like (i,c)-consistency in terms of the elementary operations of Chapter 1, like variable elimination and node-merging in the dual.

256 CHAPTER 13. CONCLUSION 257

2 4 BP SP Model 1 Theoretical Message Backbones and 3 Analysis Passing Marginal Backdoors COMPUTING Inexact Polytope SOLVING A MARGINALS Methods CONSTRAINT Inference Complete OVER A 5 Methods NETWORK Others Search GRAPHICAL (CSP OR COP Gibbs Sampling MODEL and MCMC INTEGRATION INSTANCE) Local Search Incomplete

Exact Methods

EMBP Problems Related Methods Decimation [ MAXSAT ] 8 9 [ SAT ] 6 7 10 BIAS SOLVER SETTINGS:

ESTIMATORS: MINIMUM-HEIGHT Related Related Problems Bias Estimator EMBP-L EQUIVALENT Deactivation Point #SAT MOST PROBABLE EMBP-G TRANSFORMS: Decimation Size EXPLANATION EMSP-L Fail/Succeed First Max-Sum Diffusion (MPE) EMSP-G UNSAT BP / SP

11 CONTROLLING THE METHODS: PARAMETER SETTINGS AND REINFORCEMENT LEARNING

12 FUTURE WORK AND 13 CONCLUSIONS

Figure 13.1: Topic map.

In Chapter 4 we reviewed theoretical models of problem hardness, compiling recent re- search from across the statistical physics literature; but here the novel contribution is limited to a straightforward association of identifying backbones with computing variable bias. Other- wise, the concepts reviewed here help define the aspirations of future chapters.

In Chapter 5 we exploited the program of applying a unified perspective across the previous chapters, primarily by stating a formal correspondence between constraint satisfaction and computing biases, in terms of optimization. From this novel equation, and as facilitated by the unified perspective, we have illustrated a number of secondary correspondences, including the recurrence of duality, graph or function simplification, and constraint generation. Some of these correspondences have been known informally as folklore, whereas others do not seem to be so known; at any rate, to date they have not appeared at once within a single integrated account.

This claim is reflected in the novelty of the techniques appearing in subsequent chapters, as they all exploit this understanding for motivation. CHAPTER 13. CONCLUSION 258

Further, by exploiting this correspondence we were able to state a novel link from free en- ergy to KL-divergence to Expectation Maximization, and we made the underlying optimization process explicit in deriving the general EMBP bias estimation methodology. EMBP is a novel approximate reasoning framework that adopts a parameter estimation perspective to perform marginal estimation.

In Chapter 6 we thereby presented a family of novel techniques for bias estimation, by ap- plying the EMBP formalism of Chapter 5 to SAT and exploiting the understanding of inference that we developed in Chapter 3. In contrast to prior heuristics that are merely inspired by back- bones and other phenomena reviewed in Chapter 4, these methods (approximately) optimize a specific likelihood function defined over a specific probabilistic structure, namely a uniform distribution over the solutions to a problem instance.

In Chapter 7 we contributed an architecture for integrating such techniques within a com- plete backtracking solver, and identified good general settings for various features of that ar- chitecture. We have implemented the architecture in the form of the VARSAT solver, and assessed the performance and limitations of our bias estimators and solver alike. This repre- sents the first integration of the survey propagation bias estimator, as well as our own methods, within a complete modern SAT-solver. The novel EMBP-based techniques outperform all ex- isting methods at pure bias estimation, while VARSAT is superior on hard random problems of restricted size but makes no contribution to the state of the art (local search) for problems with tens of thousands of variables or more.

In Chapter 8 we adapted our bias estimators to MaxSAT using an existing technique for rep- resenting a near-uniform distribution over solutions, whereby non-solutions have exponentially small weight.

In Chapter 9 we expanded our architecture for integrating bias estimation in order to ac- comodate special characteristics of the MaxSAT problem. Again we identify effective design settings, and realize the design as the MAXVARSAT solver. Under experimentation the solver is effective for random problems of limited size, and significantly outperforms the state-of-the- CHAPTER 13. CONCLUSION 259 art MAXSATZ solver on general problems from the most recent MaxSAT evaluation.

In Chapter 10 we contributed a “MHET” height-minimization framework for calculating lower bounds that was simultaneously pursued elsewhere, as inspired by the same pre-existing research [134]–but independently otherwise [37]. Still, we present unique efficient algorithms for realizing the framework within the SAT problem domain, and assess its usefulness on prob- lem sets from the same MaxSAT evaluation considered in the previous chapter. The method improves the state of the art on the majority of problem sets from this collection.

In Chapter 11 we contributed a novel means of controlling the various techniques and features of a solver like VARSAT or MAXVARSAT, using reinforcement learning. In contrast to existing methods that rely on training examples, the method learns an effective strategy online, that is, through the course of solving a single instance. We have demonstrated that the algorithm is able to mix two techniques to ensure that we almost always get at least the same performance as by always choosing the better of the two for exclusive use.

In Chapter 12 we identified promising future research possibilities, all inspired by the inte- grated view of probabilistic reasoning and constraint satisfaction. Arguably, the simplest and most concrete of these is to improve the solver’s performance on contest problems, in order to enter one of the annual SAT or MaxSAT evaluations. At this point, though, there is sufficient evidence to conclude that there is indeed a closer connection between probabilistic reasoning and constraint satisfaction than one might surmise from their respective research communities, and that this connection yields concrete algorithms by which we can use probabilistic methods to guide complete solution methods for constraint problems. Bibliography

[1] Dimitris Achlioptas and Federico Ricci-Tersenghi. Random formulas have frozen vari-

ables. SIAM Journal of Computing, 39(1), 2009.

[2] Dmitris Achlioptas and Yuval Peres. The threshold for random k-SAT is 2k log 2−O(k).

Journal of the American Mathematical Society, 17(4), 2004.

[3] Srinivas Aji and Robert McEliece. The generalized distributive law. IEEE Trans. on

Information Theory, 46(2):325–343, 2000.

[4] Teresa Alsinet, Felip Manya,` and Jordi Planes. An efficient solver for weighted Max-

SAT. Journal of Global Optimization, 41(1), 2008.

[5] Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael Jordan. An intro-

duction to MCMC for machine learning. Machine Learning, 50(1):5–43, 2003.

[6] Carlos Ansotegui,´ Maria Luisa Bonet, and Jordi Levy. Solving (weighted) partial

MaxSAT through satisfiability testing. In Proc. of 12th International Conference on

Theory and Applications of Satisfiability Testing (SAT ’09), Swansea, Wales, UK, 2009.

[7] John Ardelius, Erik Aurell, and Supriya Krishnamurthy. Clustering of solutions in

hard satisfiability problems. Journal of Statistical Mechanics: Theory and Experiment,

P10012, 2007.

[8] Josep Argelich, Chu Min Li, Felip Manya,` and Jordi Planes. Fourth Max-SAT evalua-

tion. http://www.maxsat.udl.cat/09/, 2009.

260 BIBLIOGRAPHY 261

[9] Stefan Arnborg, Derek G. Corneil, and Andrzej Proskurowski. Complexity of finding

embeddings in a k-tree. SIAM Journal on Algebraic and Discrete Methods, 8(2):277–

284, 1987.

[10] Roman Bartak.` On generators of random quasigroup problems. In Proc. of Annual Work- shop on Constraint Solving and Constraint Logic Programming (CSCLP ’05), Uppsala,

Sweden, 2005.

[11] Demian Battaglia, Michal Kola´r,ˇ and Riccardo Zecchina. Minimizing energy below the

glass thresholds. Physical Review E, 70, 2004.

[12] Leonard E. Baum and Ted Petrie. Statistical inference for probabilistic functions of

finite state markov chains. Annals of Mathematical Statistics, 41, 1966.

[13] Rodney J. Baxter. Exactly Solved Models in Statistical Mechanics. Academic Press,

1982.

[14] Paul Beame, , and Ashish Sabharwal. Understanding the power of clause

learning. In Proc. of the 18th International Joint Conference on Artificial Intelligence,

Acapulco, Mexico, 2003.

[15] J. Christopher Beck. Multi-point constructive search. Journal of Artificial Intelligence

Research, 29, 2007.

[16] J. Christopher Beck, Patrick Prosser, and Richard J. Wallace. Trying again to fail-first.

Recent Advances in Constraints, Lecture Notes in Artificial Intelligence, 3419, 2005.

[17] Claude Berrou, Alaine Glavieux, and Punya Thitimajshima. Near Shannon limit error-

correcting coding and decoding: Turbo-codes. In Proc. of IEEE International Confer-

ence on Communications (ICC ’93), Geneva, Switzerland, pages 1064–1070, 1993.

[18] Hans A. Bethe. Statistical theory of superlattices. Royal Society of London Proceedings

Series A, 150(7):552–575, 1935. BIBLIOGRAPHY 262

[19] Bozhena Bidyuk and Rina Dechter. Cutset sampling for Bayesian networks. Journal of

Artificial Intelligence Research, 28:1–48, 2007.

[20] Stefano Bistarelli and Ugo Montanariand Francesca Rossi. Semiring-based constraint

logic programming: Syntax and semantics. ACM Trans. Program. Lang. Syst., 23(1):1–

29, 2001.

[21] Hans L. Bodlaender. A linear-time algorithm for finding tree-decompositions of small

treewidth. SIAM Journal on Computing, 25(6):1305–1317, 1996.

[22] A. Braunstein, M. Mezard,´ and R. Zecchina. Survey propagation: An algorithm for

satisfiability. Random Structures and Algorithms, 27:201–226, 2005.

[23] A. Braunstein and R. Zecchina. Survey propagation as local equilibrium equations.

Journal of Statistical Mechanics: Theory and Experiments, PO6007, 2004.

[24] A. Braunstein and R. Zecchina. Learning by message passing in networks of discrete

synapses. Physics Review Letters, 96(5), 2006.

[25] Alfredo Braunstein, Michele Leone, Marc Mezard,´ Martin Weigt,

and Riccardo Zecchina. SP-1.4 survey propagation implementation.

http://www.ictp.trieste.it/˜zecchina/SP/, 2005.

[26] Tom Carchrae and J. Christopher Beck. Applying machine learning to low-knowledge

control of optimization algorithms. Computational Intelligence, 21(4), 2005.

[27] Nicolo´ Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge

University Press, 2006.

[28] Mark Chavira and Adnan Darwiche. On probabilistic inference by weighted model

counting. Artificial Intelligence, 172(6–7):772–799, 2008. BIBLIOGRAPHY 263

[29] Hai Leong Chieu, Wee Sun Lee, and Yee-Whye Teh. Cooled and relaxed survey propa-

gation for MRF’s. In 22nd Conf. on Neural Information Processing Systems (NIPS ’07),

Vancouver, Canada, 2007.

[30] Arthur Choi, Mark Chavira, and Adnan Darwiche. Node splitting: A scheme for gener-

ating upper bounds in Bayesian networks. In Proc. of 23rd Conference on Uncertainty

in Artificial Intelligence (UAI ’07), Vancouver, Canada, 2007.

[31] Arthur Choi and Adnan Darwiche. Approximating the partition function by deleting and

then correcting for model edges. In Proc. of 24th Conference on Uncertainty in Artificial

Intelligence (UAI ’08), Helsinki, Finland, 2008.

[32] Arthur Choi and Adnan Darwiche. Focusing generalizations of belief propagation on

targeted queries. In Proc. of 23rd Conference on Artificial Intelligence (AAAI ’08),

Chicago, IL, 2008.

[33] Arthur Choi and Adnan Darwiche. Many-pairs mutual information for adding struc-

ture to belief propagation approximations. In Proc. of 23rd Conference on Artificial

Intelligence (AAAI ’08), Chicago, IL, 2008.

[34] Stephen A. Cook. The complexity of theorem proving procedures. In Proc. of Third

Annual ACM Symposium on Theory of Computing, pages 151–158, 1971.

[35] M. C. Cooper, S. DeGivry, M. Sanchez, T. Schiex, and M. Zytnicki. Virtual arc consis-

tency for weighted CSP. In Proc. of 23rd National Conference on Artificial Intelligence

(AAAI ’08), Chicago IL, pages 253–258, 2008.

[36] M. C. Cooper and T. Schiex. Arc consistency for soft constraints. Artificial Intelligence,

154(1-2):199–227, 2004.

[37] M.C. Cooper, S. de Givry, M. Sanchez, T. Schiex, M. Zytnicki, and T. Werner. Soft arc

consistency revisited. Artificial Intelligence, (to appear), 2010. BIBLIOGRAPHY 264

[38] M.C. Cooper, S. DeGivry, and T. Schiex. Optimal soft arc consistency. In Proc. of 20th

International Joint Conf. on Artificial Intelligence (IJCAI ’07), Hyderabad, India, pages

68–73, 2007.

[39] Adnan Darwiche. Recursive conditioning. Artificial Intelligence, 126(1-2):5–41, 2001.

[40] Martin Davis, George Logemann, and Donald Loveland. A machine program for

theorem-proving. Communications of the ACM, 5(7):394–397, 1962.

[41] Martin Davis and Hilary Putnam. A computing procedure for quantification theory.

Journal of the ACM, 7(3):201–215, 1960.

[42] Simon de Givry and Thomas Schiex. Exploiting tree decomposition and soft local con-

sistency in weighted CSP. In Proc. of 21st National Conference on Artificial Intelligence

(AAAI ’06), Boston, MA, page 2006, 2006.

[43] Thomas Dean and Keiji Kanazawa. A model for reasoning about persistence and causa-

tion. Computational Intelligence, 5(2):142–150, 1989.

[44] Rina Dechter. Enhancement schemes for constraint processing: Backjumping, learning

and cutset decomposition. Artificial Intelligence, 41(3):273–213, 1990.

[45] Rina Dechter. Bucket elimination: A unifying framework for reasoning. Artificial Intel-

ligence, 113(1-2):41–85, 1999.

[46] Rina Dechter. Constraint Processing. Morgan Kaufmann, San Mateo, 2003.

[47] Rina Dechter, Kalev Kask, and Robert Mateescu. Iterative join-graph propagation. In Proc. of 18th International Conference on Uncertainty in Artificial Intelligence (UAI

’02), Edmonton, Canada, pages 128–136, 2002.

[48] Rina Dechter and Robert Mateescu. A simple insight into properties of iterative be-

lief propagation. In Proc. of 19th International Conference on Uncertainty in Artificial

Intelligence (UAI ’03), Acapulco, Mexico, 2003. BIBLIOGRAPHY 265

[49] Rina Dechter and Robert Mateescu. AND/OR search spaces for graphical models. Ar-

tificial Intelligence, 171(2–3), 2006.

[50] Rina Dechter and Irina Rish. Mini-buckets: A general scheme for approximating infer-

ence. Journal of the ACM, 50(2):107–153, 2003.

[51] Arthur Dempster, Nan Laird, and Donald Rubin. Maximum likelihood from incomplete

data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1–39, 1977.

[52] Bistra Dilkina, Carla P. Gomes, and Ashish Sabharwal. Tradeoffs in the complexity of

backdoor detection. In Principles and Practice of Constraint Programming - CP 2007,

2007.

[53] Olivier Dubois, Yacine Boufkhad, and Jacques Mandler. Typical random 3-SAT formu-

lae and the satisfiability threshold. In Proc. of 11th ACM-SIAM Symposium on Discrete

Algorithms (SODA ’00), San Francsico, CA, 2000.

[54] Olivier Dubois and Gilles Dequen. A backbone-search heuristic for efficient solving

of hard 3-SAT formulae. In Proc. of 17th International Joint Conference on Artificial

Intelligence (IJCAI ’01), Seattle, WA, 2001.

[55] Niklas Een´ and Niklas Sorensson.¨ An extensible SAT-solver. In Proc. of 6th Inter- national Conference on Theory and Applications of Satisfiability Testing (SAT ’03),

Portofino, Italy, 2003.

[56] Susan L. Epstein, Eugene C. Freuder, Richard Wallace, Anton Morozov, and Bruce

Samuels. The adaptive constraint engine. In Proc. of 8th International Conference on

Constraint Processing (CP ’02), Ithaca, U.S.A., pages 525–540, 2002.

[57] John Franco, Michal Kouril, John Schlipf, Jeffrey Ward, Sean Weaver, Michael Drans-

field, and W. Mark Vanfleet. SBSAT: a state-based, BDD-based satisfiability solver. In BIBLIOGRAPHY 266

Theory and Applications of Satisfiability Testing, pages 398–410. Springer LNCS 2919,

Berlin, 2004.

[58] Brendan J. Frey. Extending factor graphs so as to unify directed and undirected graphical

models. In Proc. of 19th International Conference on Uncertainty in Artificial Intelli-

gence (UAI ’03), Acapulco, Mexico, 2003.

[59] Ehud Friedgut. Necessary and sufficient conditions for sharp thresholds and the k-SAT

problem. Journal of the American Mathematical Society, 12(1), 1999.

[60] Yong Gao. Phase transitions and complexity of weighted satisfiability and other in-

tractable parameterized problems. In Proc. of 23rd Conference on Artificial Intelligence

(AAAI ’08), Chicago, IL, 2008.

[61] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions and the

Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 6(6):721–741, 1984.

[62] M.X. Goemans and D.P. Williamson. New 3/4-approximation algorithms for the maxi-

mum satisfiability problem. SIAM Journal on Discrete Mathematics, 7:656–666, 1994.

[63] M.X. Goemans and D.P. Williamson. Improved approximation algorithms for maximum

cut and satisfiability problems using semidefinite programming. Journal of the ACM,

42:1115–1145, 1995.

[64] Vibhav Gogate and Rina Dechter. A new algorithm for sampling CSP solutions uni-

formly at random. In Proc. of 12th International Conference on Constraint Processing

(CP ’06), Nantes, France, 2006.

[65] Vibhav Gogate and Rina Dechter. Approximate counting by sampling the backtrack-

free search space. In Proc. of 22nd Conference on Artificial Intelligence (AAAI ’07),

Vancouver, Canada, 2007. BIBLIOGRAPHY 267

[66] Vibhav Gogate and Rina Dechter. SampleSearch: A scheme that searches for consis-

tent samples. In Proc. of 11th International Conference on Artificial Intelligence and

Statistics (AISTATS ’07), San Juan, PR, 2007.

[67] Vibhav Gogate and Rina Dechter. Studies in solution sampling. In Proc. of 23rd Con-

ference on Artificial Intelligence (AAAI ’08), Chicago, IL, 2008.

[68] Carla Gomes, Cesar` Fernandez,´ Bart Selman, and Christian Bessiere.` Statistical regimes

across constrainedness regions. Constraints, 10(4):317–337, 2005.

[69] Carla Gomes, Bart Selman, Nuno Crato, and Henry Kautz. Boosting combinatorial

search through randomization. In Proc. of 15th National Conference on Artificial Intel-

ligence (AAAI ’98), Madison, WI, 1998.

[70] Carla Gomes, Bart Selman, Nuno Crato, and Henry Kautz. Heavy-tailed phenomena

in satisfiability and constraint satisfaction problems. Journal of Automated Reasoning,

24(1-2):67–100, 2000.

[71] Carla Gomes and David Shmoys. Completing quasigroups or latin squares: A structured

graph coloring problem. In Proc. of Computational Symposium on Graph Coloring and

its Generalizations (COLOR ’02), Ithaca, NY, 2002.

[72] Ralph E. Gomory. Outline of an algorithm for integer solutions to linear programs.

Bulletin of the American Mathematical Society, 64, 1958.

[73] F. Heras, J. Larrosa, and A. Oliveras. MiniMaxSAT: An efficient weighted Max-SAT

solver. Journal of Artificial Intelligence Research, 31, 2008.

[74] Marc Herbstritt. Satisfiability and Verification: From Core Algorithms to Novel Appli-

cation Domains. Suedwestdeutscher Verlag fuer Hochschulschriften (SVH), 2009.

[75] Holger Hoos and Thomas Stutzle.¨ Stochastic Local Search: Foundations and Applica-

tions. Morgan Kaufmann, San Francisco, 2004. BIBLIOGRAPHY 268

[76] Eric Hsu. VARSAT SAT-Solver homepage. http://www.cs.toronto.edu/

˜eihsu/VARSAT/, 2008.

[77] Eric Hsu. MAXVARSAT MaxSAT-Solver homepage. http://www.cs.toronto.

edu/˜eihsu/MAXVARSAT/, 2010.

[78] Eric Hsu, Matthew Kitching, Fahiem Bacchus, and Sheila McIlraith. Using EM to

find likely assignments for solving CSP’s. In Proc. of 22nd Conference on Artificial

Intelligence (AAAI ’07), Vancouver, Canada, 2007.

[79] Eric Hsu and Sheila McIlraith. Characterizing propagation methods for boolean satisfi-

ability. In Proc. of 9th International Conference on Theory and Applications of Satisfi-

ability Testing (SAT ’06), Seattle, WA, 2006.

[80] Eric Hsu and Sheila McIlraith. VARSAT: Integrating novel probabilistic inference tech-

niques with dpll search. In Proc. of 12th International Conference on Theory and Ap-

plications of Satisfiability Testing (SAT ’06), Cardiff, Wales, 2009.

[81] Eric Hsu, Christian Muise, J. Christopher Beck, and Sheila McIlraith. Probabilistically

estimating backbones and variable bias: Experimental overview. In Proc. of 14th Inter-

national Conference on Constraint Processing (CP ’08), Sydney, Australia, 2008.

[82] Tudor Hulubei and Barry O’Sullivan. Heavy-tailed runtime distributions: Heuristics,

models, and optimal refutations. In Proc. of 12th International Conference on Constraint

Programming (CP ’06), Nantes, France, 2004.

[83] Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown, and Thomas Stutzle.¨ ParamILS:

an automatic algorithm configuration framework. Journal of Artificial Intelligence Re-

search, 36:267–306, 2009.

[84] Finn V. Jensen. An Introduction to Bayesian Networks. Springer Verlag, New York,

1996. BIBLIOGRAPHY 269

[85] M.I. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational

methods for graphical models. In M.I. Jordan, editor, Learning in Graphical Models.

MIT Press, 1998.

[86] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems.

Transactions of the ASME, 82(Series D):35–45, 1960.

[87] Alexis C. Kaporis, Lefteris M. Kirousis, and Efthimios G. Lalas. The probabilistic

analysis of a greedy satisfiability algorithm. Random Structures and Algorithms, 28(4),

2006.

[88] Kalev Kask, Rina Dechter, and Vibhav Gogate. Counting-based look-ahead schemes

for constraint satisfaction. In Proc. of 10th International Conference on Constraint

Processing (CP ’04), Toronto, Canada, 2004.

[89] Kalev Kask, Rina Dechter, Javier Larrosa, and Avi Pfeffer. Cluster-tree decompositions

for reasoning in graphical models. Artificial Intelligence, 166(1-2), 2005.

[90] George Katsirelos. Nogood Processing in CSP’s. PhD thesis, University of Toronto,

2008.

[91] Henry Kautz, Yongshao Ruan, Dimitris Achlioptas, Carla Gomes, Bart Selman, and

Mark Stickel. Balance and filtering in structured satisfiable problems. In Proc. of 17th

International Joint Conference on Artificial Intelligence (IJCAI ’01), Seattle, WA, 2001.

[92] Ryoichi Kikuchi. A theory of cooperative phenomena. Physics Review, 81(6):988–1003,

1951.

[93] Philip Kilby, John Slaney, Sylvie Thiebaux,´ and Toby Walsh. Backbones and backdoors

in satisfiability. In Proc. of 20th National Conference on Artificial Intelligence (AAAI

’05), Pittsburgh, PA, 2005. BIBLIOGRAPHY 270

[94] Ross Kindermann and J. Laurie Snell. Markov Random Fields and their Applications.

American Mathematical Society, Providence, R.I., 1980.

[95] Matthew Kitching. Decomposition and Symmetry in Constraint Optimization Problems.

PhD thesis, University of Toronto, 2010.

[96] V. A. Kovalevsky and V. K. Koval. A diffusion algorithm for decreasing energy of

max-sum labeling problem, (approx.) 1975.

[97] Lukas Kroc, Ashish Sabharwal, and Bart Selman. Survey propagation revisited. In Proc. of 23rd International Conference on Uncertainty in Artificial Intelligence (UAI

’07), Vancouver, Canada, 2007.

[98] Lukas Kroc, Ashish Sabharwal, and Bart Selman. Leveraging belief propagation, back-

track search, and statistics for model counting. In Proc. of Fifth International Conference

on Integration of AI and OR Techniques (CP-AI-OR ’08), Paris, France, 2008.

[99] Lukas Kroc, Ashish Sabharwal, and Bart Selman. Message-passing and local heuristics

as decimation strategies for satisfiability. In Proc. of 24th ACM Symposium on Applied

Computing (SAC ’09), Honolulu, HI, 2009.

[100] Florent Krzakala, Andrea Montanari, Federico Ricci-Tersenghi, Guilhem Smerjian, and

Lenka Zdeborova.´ Gibbs states and the set of solutions of random constraint satisfaction

problems. Proc. Natl. Academy of Sciences, 104(25):10318–10323, 2007.

[101] Frank R. Kschischang, Brendan J. Frey, and Hans-Andrea Loeliger. Factor graphs and

the sum-product algorithm. IEEE Transactions on Information Theory, 47(2), 2001.

[102] Ronan Le Bras, Alessandro Zanarini, and Gilles Pesant. Efficient generic search heuris-

tics within the EMBP framework. In Proc. of 15th International Conference on Con-

straint Processing (CP ’09), Lisbon, Portugal, 2009. BIBLIOGRAPHY 271

[103] William Lee. Represenation of switching circuits by binary-decision programs. Bell

Systems Technical Report, 38, 1959.

[104] Chu Min Li, Felip Manya,` and Jordi Planes. New inference rules for Max-SAT. Journal

of Artificial Intelligence Research, 30, 2007.

[105] Han Lin, Kaile Su, and Chu Min Li. Within-problem learning for efficient lower bound

computation in Max-SAT solving. In AAAI ’08, 2008.

[106] David J. C. Mackay. Good error-correcting codes based on very sparse matrices. IEEE

Trans. Inform. Theory, 45:399–431, 1999.

[107] Elitza Maneva, Elchanan Mossel, and Martin Wainwright. A new look at survey propa-

gation and its generalizations. Journal of the ACM, 54(4):2–41, 2007.

[108] Vasco M. Manquinho, Joao˜ P. Marques Silva, and Jordi Planes. Algorithms for weighted

Boolean optimization. In Proc. of 12th International Conference on Theory and Appli-

cations of Satisfiability Testing (SAT ’09), Swansea, Wales, UK, 2009.

[109] Radu Marinescu and Rina Dechter. Best-first AND/OR search for most probable expla-

nations. In Proc. of 23rd Conference on Uncertainty in Artificial Intelligence (UAI ’07),

Vancouver, Canada, 2007.

[110] Joao˜ Marques-Silva and Karem A. Sakallah. GRASP: A search algorithm for proposi-

tional satisfiability. IEEE Transactions on Computers, 48(5), 1999.

[111] Mohamed El Bachir Mena¨ı. A two-phase backbone-based search heuristic for partial

MAX-SAT: An initial investigation. In Proc. of 18th International Conference on Inno-

vations in Applied Artificial Intelligence (IEA/AIE ’05), Bari, Italy, 2005.

[112] M. Mezard,´ G. Parisi, and R. Zecchina. Analytic and algorithmic solution of random

satisfiability problems. Science, 297(5582):812–815, 2002. BIBLIOGRAPHY 272

[113] Thomas Minka. A family of approximate algorithms for Bayesian inference. In Ph.D.

Thesis, 2001.

[114] Steven Minton, Andy Philips, Mark D. Johnston, and Philip Laird. Minimizing conflicts:

A heuristic repair method for constraint-satisfaction and scheduling problems. Journal

of Artificial Intelligence Research, 58:161–205, 1993.

[115] Chaitanya Mishra and Nick Koudas. Join reordering by join simulation. In Proc. of IEEE

International Conference on Data Engineering (ICDE ’09), Shanghai, China, 2009.

[116] Remi` Monasson, Riccardo Zecchina, Scott Kirkpatrick, Bart Selman, and Lidror Troy-

ansky. Determining computational complexity from characteristic phase transitions. Na-

ture, 400(7):133–137, 1999.

[117] Matthew Moskewicz, Conor Madigan, Ying Zhao, Lintao Zhang, and Sharad Malik.

Chaff: engineering an efficient SAT solver. In Proc. of the 38th Design Automation

Conference (DAC ’01), Las Vegas, Nevada, 2001.

[118] Alexander Nareyek. Choosing search heuristics by non-stationary reinforcement learn-

ing. In Metaheuristics: Computer Decision-Making, pages 523–544. Kluwer Academic

Publishers, 2003.

[119] Radford Neal and Geoffrey Hinton. A view of the EM algorithm that justifies incremen-

tal, sparse, and other variants. In M.I. Jordan, editor, Learning in Graphical Models,

pages 355–368. Kluwer Academic Publishers, 1998.

[120] Naomi Nishimura, Prabhakar Ragde, and Stefan Szeider. Detecting backdoor sets with

respect to Horn and binary clauses. In Proc. of 7th International Conference on Theory

and Applications of Satisfiability Testing (SAT ’04), Vancouver, Canada, 2004. BIBLIOGRAPHY 273

[121] James Park. MAP complexity results and approximation methods. In Proc. of the 18th

Conference on Uncertainty in Artificial Intelligence (UAI ’02), Alberta, Canada, pages

388–396, 2002.

[122] James Park and Adnan Darwiche. Approximating MAP using local search. In Proc. of

17th Conference on Uncertainty in Artificial Intelligence (UAI ’01), Seattle, WA, pages

403–410, 2001.

[123] Judea Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.

[124] Smiljana Petrovic. Learning to Combine Heuristics to Solve Constraint Satisfaction

Problems. PhD thesis, City University of New York, 2008.

[125] Knot Pipatsrisawat, Akop Palyan, Mark Chavira, Arthur Choi, and Adnan Darwiche.

Solving weighted Max-SAT problems in a reduced search space: A performance analy-

sis. Journal on Satisfiability, Boolean Modeling, and Computation (JSAT), 4, 2008.

[126] Christopher J. Preston. Gibbs States on Countable Sets. Cambridge University Press,

Cambridge, U.K., 1974.

[127] Marco Pretti. A message-passing algorithm with damping. Journal of Statistical Me-

chanics, 2005(11):P11008, 2005.

[128] Patrick Prosser. Hybrid algorithms for the constraint satisfaction problem. Computa-

tional Intelligence, 9(3), 1993.

[129] Adam Prugel-Bennett. Finding critical backbone structures with genetic algorithms. In Proc. of the 9th Conference on Genetic and Evolutionary Computation (GECCO ’07),

London, England, pages 1343–1348, 2007.

[130] Philippe Refalo. Impact-based search strategies for constraint programming. In Proc.

of 10th International Conference on Constraint Processing (CP ’04), Toronto, Canada,

2004. BIBLIOGRAPHY 274

[131] Jean-Charles Regin.´ A filtering algorithm for constraints of difference in CSP’s. In Proc.

of 12th Conference on Artificial Intelligence (AAAI ’94), Seattle, WA, pages 362–367,

1994.

[132] Neil Robertson and Paul Seymour. Graph minors III: Planar tree-width. Journal of

Combinatorial Theory, Series B, 36:49–64, 1984.

[133] Yongshao Ruan, Henry Kautz, and . The backdoor key: A path to under-

standing problem hardness. In Proc. of 19th National Conference on Artificial Intelli-

gence (AAAI ’04), San Jose, CA, 2004.

[134] M. I. Schlesinger. Sintaksicheskiy analiz dvumernykh zritelnikh signalov v usloviyakh

pomekh (syntactic analysis of two-dimensional visual signals in noisy conditions).

Kibernetika, 4, 1976.

[135] Bart Selman, Henry Kautz, and Bram Cohen. Local search strategies for satisfiability

testing. DIMACS Series in Discrete Mathematics and Theoretical Computer Science,

26, 1996.

[136] Y. Shang and B.W. Wah. A discrete Lagrangian-based global-search method for solving

satisfiability problems. Journal of Global Optimization, 12(1):61–99, 1998.

[137] Josh Singer, Ian Gent, and Alan Smaill. Backbone fragility and the local search cost

peak. Journal of Artificial Intelligence Research, 12:235–270, 2000.

[138] David Sontag and Tommi Jaakkola. New outer bounds on the marginal polytope. In

22nd Conf. on Neural Information Processing Systems (NIPS ’07), Vancouver, Canada,

2007.

[139] X. Sun, M. J. Druzdzel, and C. Yuan. Dynamic weighting A∗ search-based MAP al-

gorithm for Bayesian networks. In Proc. of the 20th International Joint Conference on

Artificial Intelligence (IJCAI ’07), Hyderabad, India, pages 2385–2390, 2007. BIBLIOGRAPHY 275

[140] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction.

MIT Press, 1998.

[141] Daniel Tarlow, Inmar Givoni, and Richard Zemel. HOP-MAP: Efficient message passing

with high order potentials. In Proc. of 13th Int’l Conf. on Artificial Intelligence and

Statistics (AISTATS ’10), Sardinia, Italy, 2010.

[142] John von Neumann. Zur theorie der gesellschaftsspiele. Math. Annalen, 100:295–320,

1928.

[143] Martin Wainwright, Tommi Jaakkola, and Alan Willsky. A new class of upper bounds

on the log partition function. IEEE Transactions on Information Theory, 51:2313–35,

2005.

[144] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families,

and variational inference. In University of California, Berkeley Technical Report 649,

2003.

[145] Y. Wang, J. Zhang, M. Fossorier, and J. Yedidia. Reduced latency iterative decoding

of LDPC codes. In IEEE Conference on Global Telecommunications (GLOBECOM),

2005.

[146] Toma´sˇ Werner. A linear programming approach to the max-sum problem: A review.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(7), 2007.

[147] Niclas Wiberg, Hans-Andrea Loeliger, and Ralf Kotter.¨ Codes and iterative decoding on

general graphs. European Transactions on Telecomm., 6(Sep./Oct.):513–525, 1995.

[148] Ryan Williams, Carla Gomes, and Bart Selman. Backdoors to typical case complexity.

In Proc. of 18th International Joint Conference on Artificial Intelligence (IJCAI ’03),

Acapulco, Mexico, 2003. BIBLIOGRAPHY 276

[149] Lin Xu, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. SATzilla: portfolio-

based algorithm selection for SAT. Journal of Artificial Intelligence Research, 32:565–

606, 2008.

[150] Yuehua Xu, David Stern, and Horst Samulowitz. Learning adaptation to solve con-

straint satisfaction problems. In Proc. of Third Workshop on Learning and Intelligent

Optimization (LION ’03), Trento, Italy, 2009.

[151] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Understanding belief propagation and its

generalizations. In Bernhard Nebel and Gerhard Lakemeyer, editors, Exploring Artificial

Intelligence in the New Millennium, pages 239–256. Morgan Kaufmann, 2003.

[152] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and

generalized belief propagation algorithms. IEEE Transactions on Information Theory,

51(7):2282–2312, 2005.

[153] Alessandro Zanarini and Gilles Pesant. Solution counting algorithms for constraint-

centered search heuristics. In Proc. of 13th International Conference on Constraint

Processing (CP ’07), Providence, RI, 2007.

[154] Alessandro Zanarini and Gilles Pesant. More robust counting-based search heuristics

with alldifferent constraints. In Proc. of 7th International Conference on Integration of

AI and OR (CPAIOR ’10), Bologna, Italy, 2010.

[155] Weixiong Zhang. Configuration landscape analysis and backbone guided local search.

Part I: Satisfiability and maximum satisfiability. Artificial Intelligence, 158(1):1–26,

2004.