A GPU Implementation of Large Neighborhood Search for Solving Constraint Optimization Problems

A GPU Implementation of Large Neighborhood Search for Solving Constraint Optimization Problems F. Campeotto and A. Dovier and F. Fioretto and E. Pontelli 1 Abstract. Constraint programming has gained prominence as an Processing Units—GPGPUs). Several libraries and programming en- effective and declarative paradigm for modeling and solving com- vironments (e.g., the Compute Unified Device Architecture (CUDA) plex combinatorial problems. In particular, techniques based on lo- created by NVIDIA) have been made available to allow programmers cal search have proved practical to solve real-world problems, pro- to access GPGPUs and exploit their computational power. viding a good compromise between optimality and efficiency. In In this paper, we propose the design and implementation of a spite of the natural presence of concurrency, there has been rela- novel constraint solver that exploits parallel Local Search (LS) us- tively limited effort to use novel massively parallel architectures— ing GPGPU architectures to solve constraint optimization problems. such as those found in modern Graphical Processing Units (GPUs)— The optimization process is performed in parallel on multiple large to speedup constraint programming algorithms, in general, and local promising regions of the search space, with the aim of improving the search methods, in particular. This paper describes a novel frame- quality of the current solution. The local search model pursued is a work which exploits parallelism from a popular local search method variant of Large Neighborhood Search (LNS) [18, 3]. LNS is a local (the Large Neighborhood Search method), using GPUs. search techniques characterized by an alternation of destroy and repair methods—used to destroy and repair part of a solution of a problem. The perturbation of the solution caused by each destroy-repair 1 Introduction step is significant, enabling the exploration of a large neighborhood of the search space within each iteration. Each neighborhood is ex- Constraint programming (CP) is a declarative paradigm designed to plored using local search strategies and the best neighborhood (i.e, provide high-level modeling and resolution of combinatorial search the one that better improves the quality of the solution) is selected problems. It is attracting widespread commercial interest and it is to update the variables accordingly. The use of GPGPUs allows us now becoming the method choice for modeling many types of opti- to speed-up this search process and represents an alternative way to mization problems (e.g., [11, 19, 2]), possibly combined with other enhance performance of constraint solvers. The main contributions techniques. A problem is modeled using variables, each of them com- of this paper are: ing with its own domain (typically, a finite set of values), and a set 1) Novel design and implementation of a constraint solver perform- of constraints (i.e., relations) among variables. The model is given as ing parallel search. Unlike the traditional approaches to parallelism, input to a constraint tool (e.g., [17, 10]), which explores the search we take advantage of the computational power of GPGPUs. These space of possible solutions, alternating non-deterministic variable as- architectures can provide thousands of parallel processing units and signments and deterministic constraint propagation. The goal is typ- they are readily available in the form of graphic cards in most mod- ically to find one (all) solution(s), the best one, or determine that the ern computers. To the best of our knowledge this is the first constraint problem is unsatisfiable. The programmer might help the search by solver system that uses GPGPU to perform parallel local search. tuning search parameters or providing problem-specific knowledge. 2) A general framework that exploits Single-Instruction Multiple- Although this declarative approach allows one to model a broad Threads (SIMT) parallelism to speed-up local search strategies. We class of optimization problems with relatively ease, real-world com- will present six different local search strategies that can be used to binatorial optimization problems are often characterized by huge explore in parallel multiple large neighborhoods. These strategies are search spaces (e.g., [6]) and heterogeneous sets of constraints. If implemented by making very localized changes in the definition of a this is the case, incomplete search strategies (e.g., local search strate- neighborhood. Hence, the user needs only to specify the structure of gies) are usually preferred to exact approaches that would require a neighborhood, without worrying about how it is actually performed prohibitive time to determine an optimal solution. the underlying parallel computation. Recent technological trends have made highly parallel hardware 3) A hybrid method for solving constraint optimization problems that platforms and corresponding programming models available to the uses local search strategies on large neighborhoods of variables. Usu- broad users community—transforming high performance computing ally, large neighborhood are explored using standard CP techniques. from a specialized domain for complex scientific computing into a Instead, we present an approach based on local search to find the general purpose and widely accessible model for everyday comput- neighborhood that improves the objective function the most among a ing. One of the most successful efforts is represented by the use of the large set of different neighborhoods. multicores available in modern Graphical Processing Units (GPUs) for general purpose parallel computing (General Purpose Graphical 2 Background 1 Dept. Mathematics & Computer Science, University of Udine and Dept. A Constraint Satisfaction Problem (CSP) [14] is defined as P = Computer Science, New Mexico State University (X; D; C) where: (1) X = hx1; : : : ; xni is an n-tuple of variables; (2) D = hDx1 ;:::;Dxn i is an n-tuple of domains, each data structures being processed, threads are grouped in blocks, and associated to a distinct variable in X, and (3) C is a finite set have access to several memory levels, each with different properties of constraints on variables in X, where a constraint c on the m in terms of speed, organization and capacity. CUDA maps blocks variables xi1 ; : : : ; xim , denoted as c(xi1 ; : : : ; xim ), is a relation (coarse-grain parallelism) to the SMs for execution. Each SM sched- c(x ; : : : ; x ) ⊆ im Dxj .A solution of a CSP is a tuple ules the threads in a block (fine-grain parallelism) on its computing i1 im ×j=i1 n xi hs1; : : : ; sni 2 ×i=1D s.t. for each c(xi1 ; : : : ; xim ) 2 C, we cores in chunks of 32 threads (warps) at a time. Blocks are organized have hsi1 ; : : : ; sim i 2 c(xi1 ; : : : ; xim ). P is (in)consistent if it has in a 3D grid, and hence a kernel is executed by a grid of parallel (no) solutions. A Constraint Optimization Problem (COP) is a pair thread blocks. Threads within a block can communicate by reading n xi Q = (P; g) where P is a CSP, and g : ×i=1D ! N is a cost func- and writing a common area of memory (shared memory). Commu- tion. Given Q, we seek a solution s of P such that g(s) is minimal nication between blocks and communication between the blocks and (maximal) among all solutions of P. the host is realized through a large slow global memory. Typical CSP solvers alternate two steps: (a) Selection of a variable The development of CUDA programs that efficiently exploit SIMT and non-deterministic assignment of a value from its domain (label- parallelism is a challenging task. Several factors are critical in gain- ing), and (b) Propagation of the assignment through the constraints, ing performance. Memory levels have significantly different sizes to reduce the admissible values of the remaining variables and possi- (e.g., registers are in the order of dozens per thread, shared mem- bly detect inconsistencies (constraint propagation). COP solvers fol- ory is in the order of a few kilobytes per block) and access times, lows the same scheme but they explore the space of possible solution and various optimization techniques are available (e.g., coalesced of of the problem in order to find the optimal one, e.g., using branch memory accesses to contiguous locations into a single memory trans- and bound techniques. A COP solver stops whenever exploration is action). Thus, optimization of CUDA programs requires a thorough complete or a given limit is reached (e.g., time/number of improving understanding of GPU’s hardware characteristics. solutions), returning the best solution found so far. Local Search (LS) techniques [1, 14] deal with COPs and are based on the idea of iteratively improving a candidate solution s by minor 3 Solver Design and Implementation “modifications” in order to reach another solution s0 from s. The set Overall Structure of the Solver: The structure of our constraint of allowed modifications is called the neighborhoods of s and it is solver is based on the general design recently presented in [5]— often defined by means of a neighborhood function η applied to s. LS where a GPU architecture is used to perform parallel constraint prop- methods rely on the existence of a candidate solution. Most problems agation within a traditional event-driven constraint propagation en- typically have a naive (clearly not optimal) solution. If this is not the gine [16]. We adopt this design to compute a first feasible solution case, some constraints can be relaxed and a LS method is used with to be successively improved via LNS.2 Variable’s domains are repre- a cost function based on the number of unsatisfied constraint: when sented using bit-masks stored in ` unsigned int 32-bit variables (for a solution of cost 0 is found, it will be used as a starting point for a suitable `), while the status of the computation at every node of the the original CSP. Other techniques (e.g., a constraint solver) might search tree is represented by a vector of bit-masks corresponding to be used to determine the initial candidate solution.

Load more