Undefined 1 (2014) 1–5 1 IOS Press

Distributed Multi-Agent Optimization for Smart Grids and

¦ Ferdinando Fioretto a, Agostino Dovier b Enrico Pontelli c a Department of Industrial and Operation Engineering, University of Michigan, Ann Arbor, MI, USA E-mail: fi[email protected] b Department of Mathematics, Computer Science, and Physics, University or Udine, Udine, Italy. E-mail: [email protected] c Department of Computer Science, New Mexico State University, NM, USA E-mail: [email protected]

Abstract. Distributed Constraint Optimization Problems (DCOPs) have emerged as one of the prominent multi-agent architectures to govern the agents’ autonomous behavior in a cooperative multi-agent system (MAS) where several agents coordinate with each other to optimize a global cost function taking into account their local preferences. They represent a powerful approach to the description and resolution of many practical problems. However, typical real-world MAS applications are characterized by complex dynamics and interactions among a large number of entities, which translate into hard combinatorial problems, posing significant challenges from a computational and coordination standpoints. This paper reviews two methods to promote a hierarchical parallel model for solving DCOPs, with the aim of improving the performance of the DCOP algorithm. The first is a Multi-Variable Agent (MVA) DCOP decomposition, which exploits co-locality of an agent’s variables allowing the adoption of efficient centralized techniques to solve the subproblem of an agent. The second is the use of Graphics Processing Units (GPUs) to speed up a class of DCOP algorithms. Finally, exploiting these hierarchical parallel model, the paper presents two critical applications of DCOPs for demand re- sponse (DR) program in smart grids. The Multi-agent Economic Dispatch with Demand Response (EDDR), which provides an integrated approach to the economic dispatch and the DR model for power systems, and the Smart Home Device (SHDS) problem, that formalizes the device scheduling and coordination problem across multiple smart homes to reduce energy peaks.

Keywords: DCOP, GPUs, , Smart Homes

1. Introduction demand peaks are however hard to predict and hard to address while meeting increasingly higher levels of The power network is the largest operating machine sustainability. Thus, in the modern smart electricity on earth, generating more than US$400 billion a year.1 grid, the energy providers can exploit the demand-side A significant concern in power networks is for the en- flexibility of the consumers to reduce the peaks in load ergy providers to be able to generate enough power to demand. supply the demands at any point in time. Short terms This control mechanism is called Demand response (DR) and can be obtained by scheduling shiftable loads (i.e., a portion of power consumption that can be * Corresponding author. E-mail: fi[email protected]. moved from a time slot to another) from peak to off- The research summarized in the paper was developed during Fioretto’s PhD program at University of Udine (b) and New Mexico peak hours [16,21,37]. Due to concerns about privacy State University (c). and users’ autonomy, such an approach requires a de- 1Source: U.S. Energy Information Administration centralized, yet, coordinated and cooperative strategy,

0000-0000/14/$00.00 c 2014 – IOS Press and the authors. All rights reserved 2 F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation with active participation of both the energy providers tant complete and approximated algorithms for solv- and consumers. ing DCOPs. Section3 briefly reviews the GPU pro- In a cooperative multi-agent system (MAS) mul- gramming model. Section4 describes a decomposi- tiple autonomous agents interact to pursue personal tion technique that is used for solving DCOPs where goals and to achieve shared objectives. The Distributed each agents is responsible of solving a complex sub- Constraint Optimization Problem (DCOP) model [11, problem. Section5 describes a DCOPs solving tech- 25,40] is an elegant formalism to describe coopera- nique that is accelerated through the use of GPUs. Sec- tive multi-agent problems that are distributed in na- tion6 reviews two complex multi-agent applications ture and where a collection of agents attempts to opti- of DR that are solved using DCOP techniques. The mize a global objective within the confines of localized first application, presented in Section7, presents an communication. DCOPs have been applied to solve a approach to optimize demand response at large scale, variety of coordination and resource allocation prob- from a system operator perspective, by acting on large lems [20,42], sensor networks [9], and are shown to electric generators and loads. The second application, be suitable to model varius DR programs in the smart presented in Section8, presents an approach to mini- grid [15,16,24,31]. mize the energy peaks through a demand response pro- DCOP resolution algorithms are classified as either gram that makes use of automated schedules of smart complete incomplete or . Complete algorithms guar- home devices. Finally,9 concludes the work. antee the computation of an optimal solution to the problem, employing one of two broad approaches: distributed search-based techniques [25,26,39] or dis- tributed inference-based techniques [27,36]. In search- 2. Background: DCOP based techniques, agents visit the search space by se- lecting value assignments and communicating them to Let us start by defining the theoretical framework other agents. Inference-based techniques rely instead that is used for modeling smart grid and home automa- on the notion of agent-belief, describing the best cost tion problems. an agent can achieve for each value assignment of its variables. These beliefs drive the value selection pro- 2.1. Preliminaries cess of the agents to find an optimal solution to the problem. A Distributed Constraint Optimization Problem [12, Despite their success, the adoption of DCOPs on 25,40] is a tuple P  x , , , , αy, where: large, complex, instances of problems faces several X D F A  tx1, . . . , xnu is a set of variables; limitations, including restricting assumptions on the X  t u Dx1 ,...,Dxn is a set of finite domains, with modeling of the problem and the inability of current D P P xi Dxi for all xi ; solvers to capitalize on the presence of structural infor-  t u X f1, . . . , fm is± a set of cost functions (or con- mation. F Ñ Y t8u straints), where fi : P i Dx and In this paper, we review a multi-variable-agent xj x j R xi „ is the set of the variables relevant to f , (MVA) DCOP decomposition to allow agents to solve X i complex subproblems efficiently, and a technique to called its scope. speed up the execution of inference-based algorithms  ta1, . . . , apu is a set of agents; and A via graphic processing units (GPUs) parallel architec- α : Ñ maps each variable to one agent. X A tures. These techniques enable us to design practical With a slight abuse of notation, we consider the func- algorithms to efficiently solve large, complex, DCOPs. tion α : ℘p q Ñ ℘p q,2 where αpAq denotes the set A X We review two DR applications of DCOPs, one where of all variables controlled by the agents in A. the goal is to optimize the power generator dispatch, A partial assignment σX is an assignment of values while taking into account dispatchable loads, and an- to a set of variables X „ that is consistent with the other in the context of home automation, in which a X domains of the” variables; i.e., it is a partial function home agent coordinates the schedule of smart devices n σX : Ñ  Dx such that, for each xj P , if in order to satisfy the user preferences while minimiz- X i 1 i X σX pxjq is defined (i.e., xj P X), then σX pxjq P Dx . ing the global peaks of energy consumption. j The paper is organized as follows: Section2 re- calls the main definitions and describe two impor- 2℘pSq is the power set of a set S. F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation 3

The cost For each a , we denote with N  ta P |ta , a u P i ai j A i j ¸ E u the set of its neighbors. A F pσ q  f pσ i q X i x A widespread assumption in the communication f P ,xi„X i F model of a DCOP algorithm is that each agent can communicate exclusively with its neighboring agents of an assignment σX is the sum of the evaluation of the constraints whose scope falls within X.A solution in the agent interaction graph. σ σ A DFS pseudo-tree arrangement of AP is a sub- is a (partial) assignment X (written for shorthand)  p q for all the variables of the problem, i.e., with X  , graph TP ,ET of AP such that ET is a span- A P P i F pσ q  8 X ning tree of AP and for each fi and x, y x , whose cost is finite: X . The goal is to find a p q p q F solution with minimum cost: α x and α y appear on the same branch of TP . Edges of AP that appear in TP are referred to as tree ¦ x  argmin F pxq. edges,while the remaining edges of AP are referred to x as back edges. Figure1 (right) shows an example of a pseudo tree for the constraint graph in Figure1 (left). Let us observe that the notion of constraint is ex- pressed by the cost functions. Using terminology de- Example 1 Let us suppose that there are three agents. rived from local search [30], all constraints are con- Agent 1 controls variables x1, x2, x3; agent 2 controls sidered to be “soft”. “Hard” constraints, i.e., con- variables x4 and x6, and agent 3 controls variable x4. straints defining combinations of values for their vari- For example, x2, x5, x6 could represent power gener- ables that are either allowed or violated, can be repre- ators (x6 is the most powerful of the three), x1, x2 rep- t 8u 8 sented through the value set 0, , where repre- resent lamps, and x3 a reading device that requires sents an infeasible value combination.  ¤ ¤ ¤  light. The domains for the variables are Dx1 D  t0, 1u while D  t0, 1, 2, 3u. Constraints  p q  x5 x6 Given a DCOP P , GP (,E , where E X F F between the variables are encoded through cost func- tx, yu : pDf P qtx, yu „ xi , is the constraint i F tions represented by the following tables: graph of P . The representation as a constraint graph cannot deal explicitly with k-ary constraints (with A B fA,B x4 x5 fx4,x5 x2 x5 x6 fx2,x5,x6 k ¡ 2). A typical artifact to deal with such constraints 0 0 8 0 0 0 ¡ ¡ 0 0 is to introduce a virtual variable that monitors the value 0 1 8 0 1 5 1 1 ¡ 10 assignments for all the variables in the scope of the 1 0 10 1 0 10 1 0 ¡ 0 constraint and generates the cost values [2]. One of the 1 1 0 1 1 0 0 ¡ i 10 ¤ i existing variables can take the role of a virtual variable. Following [14], we introduce the concepts of local The leftmost table models the cases with pA, Bq  variables for an agent a : L  tx P |αpx q  a u, i i j X j i px , x q and pA, Bq  px , x q. Intuitively, the cost 8 and the set of its interface variables B  tx P 1 2 1 3 i j is used to ensure that lamp x be switched on. If x L | Dx P ^ Df P : αpx q  a ^ tx , x u „ 1 1 i k X s F k i j k is switched, but it does not use energy from generator xsu. This concept defines a clear separation between x , a penalty of 10 is enforced. Reading is important. agent’s local problems and the agents’ problem depen- 1 Thus, x must be set on 1, as well. However, reading dencies. The former is the set of functions involving 3 without light cause a penalty. The combination lamp exclusively the variables controlled by the agent. The x switched on and generator x active is allowed latter is defined through the functions whose scope in- 4 5 without any penalty (its cost is 0). If generator x is volves variables controlled by some other agent. Such 5 working it’s a pity to keep x switched off; costs are separation allows us to give a structure to the constraint 4 computed accordingly. These constraints generate the graph as follows: local constraint graphs. Let us focus now on agent’s P For each agent ai , its local constraint graph interactions (rightmost table). If the local generator G  pL ,E q A i i i is a subgraph of the constraint is not working, energy from x (in agent (3)) must be F  tt u | pD P 6 graph GP , where E i xa, xb fj ¡ F used, and this is costly ( means “any”). qptx , x u „ xj ^ xj „ L qu. F a b i The agent interaction graph AP  p ,E q is the In the following, we describe two popular DCOP al- A A graph where E  ttai, aju|pDfk P qptai, aju „ gorithms: a complete inference-based algorithm and an A F αpxkqqu. incomplete search-based algorithm. These algorithms 4 F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation

ROOT contain the optimal values of the variables and are x3 x1 x2 x3 x1 x2 AGENT 1 AGENT 1 propagated from parents to their children, down to the leaves of the pseudo-tree. The time and the space complexities of DPOP are AGENT 3 x6 AGENT 3 x6 dominated by the UTIL propagation phase, which is exponential in the maximum number of variables in a seppx q AGENT 2 x4 x5 AGENT 2 x4 x5 set i . The other two phases require a polyno- mial number of linear sized messages (in the number of variables of the problem), and the complexity of the local operations is at most linear in the size of the do- Fig. 1. Constraint Graph (left) and Pseudo Tree (right) of the DCOP main [27]. of Example1. x2, x5, x6 are the interface variables. Boldface edges are tree edges. tx2, x5u is a back edge. 2.3. MGM were originally proposed in the context in which each agent controls a single variable. We thus restrict our Maximum Gain Message (MGM) [23] is an in- attention to this special case and describe a general- complete, search-based algorithm that performs a dis- ization technique to handle multiple-variable agents in tributed local search to solve a DCOP. Each agent ai Section4. starts by assigning a random value to each of its vari- ables. The agent then sends this information to all of 2.2. DPOP its neighbors in GP . Upon receiving the values of its neighbors, each agent calculates the maximum gain he Distributed Pseudo-tree Optimization Procedure (i.e., the maximum decrease in cost) if it changes its (DPOP) [27] is one of the most popular DCOP resolu- values and sends this information to all of its neigh- tion algorithms. DPOP is a complete, inference-based bors as well. Upon receiving the gains of its neighbors, algorithm, consisting of three phases. it changes its values if its gain is the largest among its In the first phase, the variables are ordered through neighbors. This process is repeated until a termination a depth-first search visit of GP into a DFS pseudo- condition is met. MGM provides no quality guarantees tree. A pseudo-tree construction is achieved through on the returned solution. a distributed algorithm (e.g., [1]). A set of variables, Let us indicate with l  maxa P |Na |, where Na i A i i called the separator set seppxiq is computed for each is the set of neighbors of ai in the constraint graph.  | | node xi. seppxiq contains all ancestors of xi in the Let us denote with d maxx P Dx . MGM agents i X i pseudo-tree that are connected with via tree edges or perform Opldq number of operations in each iteration, back edges to xi or one of the descendants of xi in the as each agent needs to compute the cost for each of pseudo-tree. its values by taking into account the values of all its In the second phase, called the UTIL propagation neighbors. The memory requirement per MGM agent phase, each agent, starting from the leaves of the is Oplq since it needs to store the values of all its neigh- pseudo-tree executes two operations: (1) It aggregates boring agents. In terms of communication require- the costs in its subtree for each value combination of ment, each MGM agent sends Oplq messages, one to variables in its separator and the variables it controls, each of its neighboring agents and each message is of and (2) It eliminates the variables it controls by opti- constant size Op1q as it contains either the agent’s cur- mizing over the other variables (i.e., for each combina- rent value or the agent’s current gain. tion of values for the variables in its separator, it selects the one with the smallest cost). The aggregated costs are encoded in a UTIL message, which is propagated 3. Background: GPU from children to their parents, up to the root. In the third phase, called the VALUE propagation Modern Graphics Processing Units (GPUs) are phase, each agent, starting from the root of the pseudo- massive parallel architectures, offering thousands of tree, selects the optimal value for its variables. The op- computing cores and a rich memory hierarchy to sup- timal values are calculated based on the UTIL mes- port graphical processing. NVIDIA’s Compute Unified sages received from the children and the VALUE mes- Device Architecture (CUDA) [32] aims at enabling the sage received from its parent. The VALUE messages use of the multiple cores of a graphics card to accel- F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation 5 erate general purpose (non-graphical) applications by nication between the blocks and the host is realized providing programming models and APIs that enable through a large global memory. The data stored in the the full programmability of the GPU. The computa- global memory has global visibility and lifetime. Thus, tional model supported by CUDA is Single-Instruction it is visible to all threads within the application (in- Multiple-Threads (SIMT), where multiple threads per- cluding the host) and lasts for the duration of the host form the same operation on multiple data points simul- allocation. taneously. Apart from lifetime and visibility, different mem- A GPU is constituted by a series of Streaming Multi- ory types have also different dimensions, bandwidths, Processors (SMs), whose number depends on the spe- and access times. Registers have the fastest access cific class of GPUs. For example, the Fermi architec- memory, typically consuming within a clock cycle per ture provides 16 SMs, as illustrated in Figure2(left). instruction, while the global memory is the slowest Each SM contains a number of computing cores, each but largest memory accessible by the device, with ac- containing an ALU and a floating-point processing cess times ranging from 300 to 600 clock cycles. The unit. Figure2(right) shows a typical CUDA logical ar- shared memory is partitioned into 32 logical banks, chitecture. A CUDA program is a C/C++ program that each serving exactly one request per cycle. Shared includes parts meant for execution on the CPU (re- memory has a minimal access latency, provided that ferred to as the host) and parts meant for parallel exe- multiple thread memory accesses are mapped to differ- cution on the GPU (referred as the device). A parallel ent memory banks. computation is described by a collection of GPU ker- nels, where each kernel is a function to be executed by 3.2. Bottlenecks and Common Optimization Practices several threads. When mapping a kernel to a specific GPU, CUDA schedules groups of threads (blocks) on While it is relatively simple to develop correct GPU the SMs. In turn, each SM partitions the threads within programs (e.g., by incrementally modifying an exist- a block in warps3 for execution, which represents the ing sequential program), it is nevertheless challenging smallest work unit on the device. Each thread instan- to design an efficient solution. Several factors are criti- tiated by a kernel can be identified by a unique, se- cal to gaining performance. In this section, we discuss quential, identifier (T ), which allows differentiating id a few common practice that is important for the design both the data read by each thread and the code to be of efficient CUDA programs. executed. Memory bandwidth is widely considered to be a 3.1. Memory Organization critical bottleneck for the performance of GPU appli- cations. Accessing global memory is relatively slow GPU and CPU are, in general, separate hardware compared to accessing the shared memory in a CUDA units with physically distinct memory types connected kernel. However, even if not cached, global accesses by a system bus. Thus, for the device to execute some that cover a segment of 128 contiguous Bytes data are computation invoked by the host and to return the re- fetched at once. Thus, most of the global memory ac- sults to the caller, a data flow needs to be established cess latency can be hidden if the GPU kernel employs a from the host memory to the device memory and vice coalesced memory access pattern. Figure3(left) illus- versa. The device memory architecture is entirely dif- trates an example of coalesced memory access pattern, ferent from that of the host, in that it is organized in in which aligned threads in a warp accesses aligned several levels differing to each other for both physical entries in a memory segment, which results in a sin- and logical characteristics. gle transaction. Thus, coalesced memory accesses al- Each thread can utilize a small number of registers,4 low the device to reduce the number of fetches to which have thread lifetime and visibility. Threads in a global memory for every thread in a warp. In contrast, block can communicate by reading and by writing a when threads adopt a scattered data accesses (Fig- common area of memory, called shared memory. The ure3(right)), the device serializes the memory transac- total amount of shared memory per block is typically tion, drastically increasing its access latency. 48KB. Communication between blocks and commu- Data transfers between the host and device memory are performed through a system bus, which translates 3A warp is typically composed of 32 threads. to slow transactions. In general, it is convenient to store 4In modern devices, each SM allots 64KB for registers space. the data onto the device memory. Additionally, batch- 6 F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation

GRID S S S S S S S S DRAM DRAM GRID Block Block M M M M M M M M S S S S S S S S DRAM DRAM Block Block M M M M M M M M Shared Shared DRAM memory memory L2 Cache Shared Shared DRAM memory memory L2 Cache Host Host Interface DRAM

Host Host Interface S S S S S S S S DRAM regs regs regs regs S S S S S S S S M M M M M M M M regs regs regs regs M M M M M M M M DRAM DRAM DRAM DRAM Thread Thread Thread Thread Thread Thread Thread Thread core core core core core core core core core core core core core core core core core core core core core core core core

core core core core core core core core WARP scheduler WARP scheduler SFU SFU SFU SFU SFU SFU SFU SFU GLOBAL MEMORY GLOBAL MEMORY HOST

HOST (32K) Registers core core core core core core core core Memory Shared Registers (32K) Registers Instruction Cache Instruction L1 Cache (64KB) Cache L1

core core core core core core core core Memory Shared Instruction Cache Instruction L1 Cache (64KB) Cache L1

CONSTANT MEMORY WARP CONSTANT MEMORY WARP scheduler scheduler core core core core core core core core core core core core core core core core

Fig. 2. Fermi Hardware Architecture (left) and CUDA Logical Architecture (right)

… Global … which each agent controls exactly one variable. While Memory … … this assumption simplifies the algorithm description … Aligned … segment and its development, it does not hold true in many of the applications of our interest. For instance, micro- Warp of … 0 1 2 … 30 31 32 Threads 0 1 2 30 31 32 grids are self-sustainable and autonomous organiza- Tid Tid tions composed of generators and loads within an elec- tric smart grid; this can be modeled as agents whose Fig. 3. Coalesced (left) and scattered (right) data access patterns variables represent the amount of energy to dispatch, transmit, and withdraw from the grid. In a smart home, ing small memory transfers into a large one will reduce multiple devices can be coordinated by a single au- most of the per-transfer processing overhead [32]. tonomous home assistant, which, too, has to control a The organization of the data in data structures and large number of variables. data access patterns plays a fundamental role in the ef- There are two commonly used reformulation tech- ficiency of the GPU computations. Due to the com- niques to cope with DCOP in which agents control putational model employed by the GPU, it is impor- multiple variables [3,41]: (i) Compilation, where each tant that each thread in a warp executes the same agent creates a new pseudo-variable, whose domain is branch of execution. When this condition is not satis- the Cartesian product of the domains of all variables fied (e.g., two threads execute different branches of a of the agent; and (ii) Decomposition, where each agent conditional construct), the degree of concurrency typi- creates a pseudo-agent for each of its variables. While cally decreases, as the execution of threads performing both techniques are relatively simple, they can be in- separate control flows have to be serialized. This is re- efficient. In compilation, the memory requirement for branch divergence ferred to as , a phenomenon that has each agent grows exponentially with the number of been intensely analyzed within the High Performance variables that it controls. In decomposition, the DCOP Computing (HPC) community [6,8,18]. algorithms will treat two pseudo-agents as indepen- The line of work in this paper, focused on using dent entities, resulting in unnecessary computation and GPUs to improve performance of DCOP solvers, ex- communication costs. tends recent approaches to make use of GPUs to en- To overcome these limitations, the following sub- hance performance of constraint solves [4,5,7,10]. sections discusses the Multi-Variable Agent (MVA) DCOP decomposition, originally proposed in [14]. It 4. DCOP Applications Challenge: Agents with enables a separation between the agents’ local sub- Multiple Variables problems, which can be solved independently using centralized solvers, and the DCOP global problem, The vast majority of the DCOP resolution algo- which requires coordination of the agents. The decom- rithms have been proposed in a restrictive setting, in position does not lead to any additional privacy loss F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation 7 and enables the use of different centralized and dis- independently of one another because they exploit the tributed solvers in a hierarchical and parallel fashion. co-locality of their local variables without any addi- tional privacy loss, which is a fundamental aspect in 4.1. The MVA Framework DCOPs [17]. In addition, the local optimization process can op- In the MVA decomposition, a DCOP problem P is erate on m ¥ 1 combinations of value assignments decomposed into | | subproblems Pi pLi,Bi,E q, of the interface variables, before passing control to A Fi where P is associated to agent a P . In addition to the next phase. This is the case when the agent ex- i i A the decomposed problem Pi, each agent receives: (1) plores m different assignments for its interface vari- the global DCOP algorithm PG, which is common to ables in Phases 2 and 3. These operations are per- all agents in the problem and defines the agent’s coor- formed by storing the best local solution and their cor- dination protocol and the behavior associated with the responding costs in a cost table of size m, which can receipt of a message, and (2) the local algorithm PL, be visualized as a cache memory. The minimum value which can differ between agents and is used to solve of m depends on the choice of the global DCOP al- the agent’s subproblem. gorithm PG. For example, for common search-based Figure4 shows a flowchart illustrating the four con- algorithms such as asynchronous forward-bounding ceptual phases in the execution of the MVA framework (AFB), it is 1, while for common inference-based al- for each agent ai: gorithms such as DPOP, it is exponential in size of the Phase 1—Wait: The agent waits for a message to ar- separator set. rive. If the received message results in a new value as- Correctness, completeness, asymptotical complex- signment σpxr, kq for an interface variable xr of Bi, ity, and an extensive evaluation of the MVA decompo- then the agent will proceed to Phase 2. If not, it will sition are presented in [14]. proceed to Phase 4. The value k P N establishes an enumeration of the interface variables’ assignments. Phase 2—Check: The agent checks if it has performed 5. DCOP Applications Challenge: Scalability a new complete assignment for all its interface vari- through Hierarchical Parallel Models ables, indexed with k P N. If it has, then the agent will proceed to Phase 3. Otherwise it will return to Phase 1. As mentioned in Section 2.2, the aggregation and Phase 3—Local Optimization: When a complete as- elimination operations within the UTIL propagation signment is given, the agent passes the control to a lo- phase of DPOP are the most expensive operations of cal solver, which solves the following problem: the algorithm. In this section, we review the GPU- ¸ based Distributed Bucket Elimination (GpuBE) frame- j work [13], which extends DPOP by exploiting GPU minimize fjpx q parallelism within the aggregation and elimination fj P i F (i.e., UTIL) operations to speed up these operations.  p q @ P subject to: xr σ xr, k xr Bi The key observation that allows us to parallelize these operations is that the computation of the cost for each  t P | i „ u where i fi x Li . Solving this problem value combination in a UTIL message is independent F F results in finding the best assignment for the agent’s of the computation of the other combinations. The use local variables given the particular assignment for its of a GPU architecture allows us to exploit such in- interface variables. Notice that the local solver PL is dependence, by concurrently exploring several value independent of the DCOP structure and it can be cus- combinations of the UTIL message, computed by the tomized based on the agent’s local requirements. For aggregation operator, as well as concurrently eliminat- example, agents can use GPU-optimized solvers, if ing variables. they have access to GPUs, or use off-the-shelf CP, MIP, or ASP solvers. Once the agent solves its subproblem, 5.1. GPU Data Structures it proceeds to Phase 4. Phase 4—Global Optimization: The agent processes To fully utilize the parallel computational power of the new assignment as established by the DCOP algo- GPUs, the data structures need to be designed in such a rithm PG, executes the necessary communications and way to limit the amount of information exchanged be- returns to Phase 1. The agents can execute these phases tween the CPU host and the GPU device, minimizing 8 F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation msgaj

msgaj msg (xi,k) msga aj 2 j msg ajmsg aj msg msg aj (x ,k) msg for some xk Bi ai i 2 aj 2 msg (xi,k) msga aj Wait msg (x ,k) jmsg No aj 2i aj Global msgmsg 2 ajai New ( xi Bi) xi = (xi,k)(xi,k) msgaj for some xk Bi 8 2Optimization 2 Messages for some x 2 B (xi,k) msg k i aj (xi,k) msgfor someaj 2 xk Bi 2 2 2 (ximsg,k) a msga ( x B ) x = (x ,k) ( xi Bi) for some 2ixk Bji 8 i 2 i i i 8 2 2 ( xi Bi) xi = (xi,k) for some xk Bi for some8 2( xxik BBYesii) xi = (xi,k) k = k+1 2 8 22 x = (x ,k) for(xi some,k) msgxk Bi ( x B ) i i ( xi Bi) 2xi = a2(jxi,k) 8 i 2 i 8 2 ( x B ) x( =xi(xB,ki)) ( xi Bi) xi = (xi,k) i i i 8 2( ixi Bi) Local Optimization 8 2 8 2 ( forxi someBi) xxik= B(ixi,k) 8 2 x =Check(x ,k) j ( xi Bi) No i i Yes minimize fj(x ) 8 2 8 2 2 xi = (xi,k) f i ( xi Bi) xi = (xi,k) j Faj ( xi Bi) 8 2 X2 ( x B() xx =B)(x ,k) 8 2 i xii = i (ixi,ki ) i j subject to: 8 2 8 2 minimize fj(x ) xi = (xi,k) j xi = (xi,k), xi Bi minimize fj Fa fj(x ) j 8 2 xi = (xi,k) minimizeX2 j fj(x ) (xix=i (Bxii),k) fj Faj 8 2 j subject to: X2 fj Fa f ( ) X2 j minimize j x subject to: j x = (x ,k), x B fj Faj minimize subjectfij(x to:) i i i xiX2= (xi,k) j x Fig.= (x 4.,k MVA), 8 x Execution2 B Flow Chart. j minimize fj(x ) fj Fa i i i i minimize f ( ) j xi = (xi8,k),2 xi Bi subject to: j x X2 8 2 fj Faj fj Faj X2 subject to: xi X=2 (xi,k), xi Bi j 8 2 thesubject accesses to: to the slow device globalxi = memory(xi,k), x whilei Bi data stored in the GPUminimize globalsubject to: memoryfj(x ) is organized in 8 2 f x = (x ,k), x B jxiF=aj (xi,k), xi Bi ensuring thati thei data8 accessi 2 i pattern enforced is coa- mono-dimensional arrays, to facilitateX2 coalesced8 2 mem- subject to: lesced. To do so, we store in the device global mem- ory accesses. xi = (xi,k), xi Bi ory exclusively the minimal information required to 8 2 compute the UTIL functions, which are communicated 5.2. GPU-based Constraint Aggregation to the GPU once at the beginning of the computation. This allows the GPU kernels to communicate with the The constraint aggregation takes as input two UTIL- CPU host exclusively to exchange the results of the ag- tables: Ti and To, and aggregates the cost values in χi gregation and elimination processes. to those of χo for all the corresponding assignments We introduce the following concept: An UTIL-table of the shared variables in the scope of the two UTIL- is a 4-tuple, T  xS, R, χ, y, where: tables. We refer to Ti and To as to the input and output S„ , is a list of variables denoting the scope of T . UTIL-tables, respectively. X R is a list of tuples of values, each tuple having Consider the example in Figure5, the cost values χi length |S|. Each element in this list (called row of T ) of the input UTIL-table Ti (left) are aggregated to the specifies an assignment of values for the variables in cost values χo of the1 output UTIL-table To (right)— S that is consistent with their domains. We denote which were initialized to 0. The rows of the two tables with Rris the tuple of values corresponding to the with identical value assignments for the shared vari- 1  t | |u 1 ables x2 and x3 are shaded with the same color. i-th row in R, for i 1,..., R . 1 1 χ is a list of length |R| of cost values correspond- To optimize the performance of the GPU operations 1 ing to the costs of the assignments in1 R. In par- and to avoid unnecessary data transfers to/from the ticular, the element χris represents the cost of the GPU global memory, we only transfer1 the list of cost assignment R1 ris for the variables in S, with i  values χ for each UTIL-table that need to be aggre- t1,..., |R|u. gated and employ a simple, perfect hashing function to denotes an ordering relation used to sort the vari- efficiently associate row numbers with variables’ val- ables in the list S. In turn, the value assignments, ues. This allows us to compute the indexes of the cost and cost values, in each row of R and χ, respec- vector of the input UTIL-table relying exclusively on tively, obey the same ordering. the information of the thread ID and, thus, avoid ac- cessing the scope S and assignment vectors R of the As a technical note, a UTIL-table T is mapped onto input and output UTIL-tables. We refer the interested the GPU device to store exclusively the cost values reader to [13] for the complete details. χ, not the associated variables values. We assume that To exploit the highest degree of parallelism offered the rows of R are sorted in lexicographic order— by the GPU device, we: thus, the i-th entry χris is associated with the i-th permutation Rris of the variable values in S, in lex- 1. map one GPU thread Tid to one element of the icographic order. This strategy allows us to employ output UTIL-table, and a simple, perfect hashing to efficiently associate row 2. adopt the ordering relation T for each input and numbers with variables’ values. Additionally, all the output UTIL-table processed. F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation 9

Ti To 5.3. GPU-based Variable Elimination Tid x2 x3 i x1 x2 x3 o 0 [0] 0 0 2 + 0 0 0 0 + 2 [0] 1 [1] 0 1 0 + 0 0 1 0 + 0 [1] The variable elimination takes as input a UTIL-table 2 P [2] 1 0 1 + 0 1 0 0 + 1 [2] To and a variable xi So. It removes xo from the 3 [3] 1 1 3 + 0 1 1 0 + 3 [3] 4 UTIL-table’s scope by optimizing over its cost rows. + 1 0 0 0 + 2 [4] 5 As a result, the output UTIL-table rows list the unique + 1 0 1 0 + 0 [5] 6 zt u + 1 1 0 0 + 1 [6] assignments for the value combinations of So xi in 7 + 1 1 1 0 + 3 [7] the input UTIL-table Ro which minimizes the costs P values for each d Dxi . Figure6 illustrates this Fig. 5. Example of aggregation of two tables on GPU. process, where the variable x3 is eliminated from the UTIL-table To. The column being eliminated is high- lighted yellow in the input UTIL-table. The different To To row colors identify the unique assignments for the re- o Tid o x1 x2 x3 x1 x2 maining variables x1, x2, and exposes the high de- [0] 0 0 0 5 0 min 0 0 3 [0] gree of parallelization that is associated to such opera- [1] min [1] 0 0 1 3 0 0 1 4 tion. To exploit this level of parallelization, we adopt a [2] min 1 0 4 [2] 0 1 0 4 1 paradigm similar to that employed in the aggregation [3] 0 1 1 7 1 min 1 1 5 [3] operation on GPU, where each thread is responsible of [4] 1 0 0 5 2 [5] 1 0 1 4 2 the computation of a single output element. [6] 1 1 0 5 3 The variable elimination is executed in parallel by a [7] 1 1 1 6 3 number of GPU threads equal to the number of rows of the output UTIL-table. Each thread identifies its row Fig. 6. Example of variable elimination on GPU. index ro within the output UTIL-table cost values χo, given its thread ID. It hence retrieves an input row in- dex r to the value of the first χ input UTIL-table Adopting such techniques allows each thread to be re- j o row to analyze. Note that, as the variable to eliminate sponsible for performing exactly two reads and one is listed last in the scope of the UTIL-table, it is pos- write from/to the GPU global memory. Additionally, sible to retrieve each unique assignment for the pro- the ordering relation enforced on the UTIL-tables al- jected output bucket table, simply by offsetting r by lows us to exploit the locality of data and to encour- o the size of D . Additionally, all elements listed in age coalesced data accesses. As illustrated in Figure5, xi χ rr s, . . . , χ rr |D |s differ exclusively on the this paradigm allows threads (whose IDs are identi- o j o j xi value assignment to the variable x (see Figure6). fied in red by their T ’s) to operate on contiguous i id Thus, the GPU kernel evaluates the input UTIL-table chunks of data and, thus, minimizes the number of ac- cost values associated to each element in the domain of tual read (from the input UTIL-table, on the right) and x , by incrementing the row index r , |D | ¡ 1 times, write (onto the output UTIL-table, on the left) opera- i j xi and chooses the minimum cost value. At last, it saves tions from/to the global memory performed by a group the best cost found to the associated output row. of threads with a single data transaction.5 Note that each thread reads |D | adjacent values of As a technical detail, the UTIL-tables are created xi the vector χo, and writes one value in the same vec- and processed so that the variables in their scope are tor. Thus, this algorithm (1) perfectly fits the SIMT sorted according to the order . This means that the T paradigm, (2) minimizes the accesses to the global variables with the highest priority appear first in the memory as it encourages a coalesced data access pat- scope list, while the variables to be eliminated always tern, and (3) uses a relatively small amount of global appear last. Such detail allows us to efficiently encode memory, as it recycles the memory area allocated for the elimination operation on the GPU, as explained the input UTIL-table, to output the cost values for the next. output UTIL-table. For additional details on these operations, and on the 5Accesses to the GPU global memory are cached into cache lines theoretical and experimental results of the GPU paral- of 128 Bytes, and can be fetched by all requiring threads in a warp. lel version of DPDOP, we refer the reader to [13]. 10 F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation

6. Applications to the Smart Grid each user has personal preferences and comforts lev- els. The DR model enables users to act cooperatively With the growing complexity of the current power to reduce the global energy peaks. grid, there is an increasing need for intelligent opera- tions coordinating energy supply and demand. A key feature of the smart grid vision is that intelligent mech- 7. Multi-agent Economic Dispatch with Demand anisms will coordinate the production, transmission, Response and consumption of energy in a distributed way, guar- anteeing sustainability, reliability, and resilience. The In traditional operations, ED and DR are im- distributed nature of the grid makes cooperative multi- plemented separately, despite the strong inter- agent based solutions a natural fit. A critical problem dependencies between these two problems. Fioretto in this domain is that of minimizing the peak load de- et al. [16] proposed an integrated approach to solve mands, as these are expensive from both a system op- the ED and DR problems to simultaneously maxi- erational perspective and for the risk associated with mizes the benefits of customers and minimizes the power failure due to overload. Demand response (DR) generation costs. We survey such approach next. strategies allow customers to make autonomous de- cisions on their energy consumption and have been 7.1. The EDDR Model shown to improve various power system operations, including load balancing and frequency regulation. A power grid can be viewed as a network of nodes From a large scale power generation perspective, (called buses in the power systems literature) that are the economic dispatch (ED) of power generators is ap- connected by distribution cables and whose loads are plied to allocate the power to be produced by each gen- served by (electromechanical) power generators. Typ- erator minimizing the production costs while satisfy- ically, a group of such power generators is located in ing the physical constraints of the power system. De- a single power station, and a number of power stations mand response programs enable customers to make in- are distributed in different geographic areas. Such a formed decisions regarding their energy consumption power grid can be visualized as an undirected graph and can be used to reduce the total peak demand, re- p , q, in which buses (in ) are modeled as graph V E V shape the demand profile, and thus increase grid sus- nodes and transmission lines (in „ ¢ ) as edges. E V V tainability. Maintaining a constant balance between This representation captures the ability of the current generation and consumption of power is critical for ef- to flow in both directions in a circuit. fective power system operations. Section7 proposes A n-bus power system is considered where each a proactive DCOP approach to the economic dispatch bus injects and withdraws power through a generator which includes demand response and can quickly re- g P and a load l P , respectively, where and G L G spond to the dynamic changes of the grid loads (within are, respectively, the set of generators and loads in L a few minutes). the problem. Load buses can be dispatchable or non- At a smaller scale, from a consumer perspective, dispatchable, based on whether it is possible to defer a demand response can be used to reduce the residen- portion of the load. tial electricity spending by designing smart buildings We model each bus as an agent, capable of making capable of making autonomous decisions to control autonomous decisions on its power consumption and power loads, production, and transmission. Through generation. We assume that there is exactly one gen- the proliferation of smart devices (e.g., smart ther- erator and one load in each bus. The case with multi- mostats, smart washers) in our homes and offices, the ple loads and generators per bus can be easily trans- buildings’ automation within the broader smart grid is formed into this simpler model by precomputing the becoming crucial. Home automation is defined as the best operational conditions for each output combina- automated control of the buildings electrical devices tion of loads and generation power. with the objective of the improved comfort of the oc- When a load difference is revealed to the power sys- cupants, improved energy efficiency, and reduced op- tem, the ED problem is to regulate the output of the ap- erational costs. Section8 focuses on the scheduling of propriate units so that the new generation output meets smart devices in a decentralized way. In the proposed the new load and the generators are at economic dis- model, each household is responsible for the schedule patch (i.e., they are running efficiently according to of the devices in her building under the assumption that some objective function). Near real-time power con- F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation 11 sumption monitoring from smart meters allows short- However, differently, from DPOP, when an agent ag- term load prediction, which can supply the smart grid gregates the UTIL-tables of its children, it also updates with predictions on power consumption levels. the effect on the congestion of the transmission lines The D-EDDR problem with a time horizon H is pre- of the network. Finally, when the root agent checks for sented in [16]. Due to a rippling effect on the gen- satisfiability of some “global” constraints: erators’ power-cost curve, the generators’ cost func- If none of the values in its UTIL-table satisfies all tions contain nonlinear terms, which makes the opti- the constraints, it suggests that the problem is insuf- mization problem non-convex. Additionally, due to the ficiently relaxed. Thus, it increases a parameter ω by non-linearity of the transmission lines capacity con- a factor of 2 based on the difference with respect to straints, Fioretto et al. consider a relaxation of the D- its value in the last iteration,6 propagates this infor- EDDR problem in which the transmission lines capac- mation to all agents and informs them to reinitialize ities constraints are modeled as soft constraints with a their UTIL tables with the new updated ω, and solve penalty on the degree of their violation. This approach the new relaxed problem in a new iteration. is similar to the one used by researchers in the power If there is a row satisfying all the constraints, then systems community [34,38]. the problem may be overly relaxed. Thus, it de- The D-EDDR problem has been modeled as a Dy- creases ω by a factor of 2 based on the difference namic DCOP [28], which is a sequence of DCOPs with respect to its value in the last iteration, and this x y P1,P2,... , where each Pt is associated with a par- new relaxed problem is solved in a new iteration. ticular time step t. Each DCOP Pt in the Dynamic This process repeats itself until some criteria is met DCOP has the same agents of the network buses, (e.g., a maximum number of iterations, convergence in P A t each agent a controls two variables: a.xg and the solution). t A a.xl , for a generator and a load. The domains of these A number of the operations of the algorithm can be variables correspond to the possible power that can be sped up through parallelization using GPUs. In partic- injected by generator g, or withdrawn by load l, respec- ular, the initialization and consistency checks of the tively. In addition to the cost functions associated with different rows of the UTIL-table can be done indepen- the generators’ costs and the loads’ utilities, the prob- dently from each other. Additionally, the aggregation lem includes hard constraints that represent the physi- of the different rows of the UTIL-tables can also be cal constraints of the energy flow conservation and the computed in parallel, fitting well the SIMT processing generators’ ramp constraints and soft constraints that paradigm. Thus, R-DEEDS adopts a scheme similar to represent the transmission lines capacities. that described in Section5 to implement a GPU ver- sion of the UTIL Initialization and UTIL Propagation 7.2. R-Deeds phases.

The Relaxed Distributed Efficient ED with DR 7.3. Empirical Evaluation Solver (R-DEEDS)[16] is a multi-agent solver for solving the (relaxed) D-EDDR problem which is based We evaluate the proposed algorithms on 5 standard on the dynamic DCOP model described above. R- IEEE Bus Systems,7 all having non-convex solution DEEDS bears similarities with DPOP (Section 2.2) and spaces, and ranging from 5 to 118 buses. it is composed of four phases: (1) Pseudo-tree Gen- All the generators are set with a 5MW ramp rate eration, (2) UTIL Initialization, (3) UTIL Propagation, limit. We use a 1MW discretization unit for each load and (4) VALUE Propagation. and generators. Thus, the maximum domain sizes of The first and last phases of R-DEEDS are identical the variables are between 100 and 320 in every exper- to that of DPOP. In the UTIL Initialization phase, each iment. The horizon H is set to 12, and we define a pa- agent initializes its UTIL-tables by computing, for all rameter Hopt that denotes the number of time steps that the possible combinations of loads and generators out- are considered by R-DEEDS agents at a time. Thus, puts within the bus controlled by the agent, and for it first solves the subproblem with the first Hopt time each time in the horizon, their costs and their effect steps. Next, it solves the following subproblem with on the each of the transmission lines of the network. Finally, the UTIL Propagation phase is similar to that 6The larger the value of ω, the more relaxed the problem. The of DPOP: R-DEEDS agents propagate the UTIL-tables relaxed problem is identical to the original one if ω 0. up to the root agent analogously to agents in DPOP. 7http://publish.illinois.edu/smartergrid/ 12 F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation

SIMULATED RUNTIME (SEC) NORMALIZED QUALITY CPU Implementation GPU Implementation

Hopt 1 2 3 4 1 2 3 4 1 2 3 4 5 0.010 0.044 3.44 127.5 0.025 (0.4x) 0.038 (1.2x) 0.128 (26.9x) 2.12 (60.2x) 0.8732 0.8760 0.9569 1.00 14 0.103 509.7 – – 0.077 (1.3x) 3.920 (130x) 61.70 (n/a) – 0.6766 0.8334 1.00 – 30 0.575 9084 – – 0.241 (2.4x) 79.51 (114x) – – 0.8156 1.00 – – 57 4.301 – – – 0.676 (6.4x) 585.4 (n/a) – – 0.8135 1.00 – – IEEE Buses 118 174.4 – – – 4.971 (35.1x) – – – 1.00 – – – Table 1

R-DEEDS CPU and GPU Runtimes, Speedups (in parenthesis), and Normalized Solution Qualities.

for each IEEE bus. We make the following observa- 60.02 60.02 tions: 60.01 60.01 The solution quality increases as Hopt increases. 60.00 60.00 For the smaller test case with Hopt 1, the CPU im- plementation is faster than the GPU one. However, 59.99 59.99 for all other configurations, the GPU implementa- Deployment of Deployment of 59.98 ED-DR 59.98 tionED-DR is much faster than its CPU counterpart, with Frequency (Hz) Frequency (Hz) 59.97 59.97 speedups up to 130 times. The reported speedup in- crease with increasing complexity of the problem 59.96 59.96 0 120 240 360 480 600 720 0 120(i.e.,240 bus360 size and480Hopt600). 720 Time (sec) The GPUTime (sec) implementation scales better than the CPU implementation. The latter could not solve any  Fig. 7. Dynamic Simulation of the IEEE 30 Bus with heavy loads. of the instances for the configurations with Hopt 3 for the 14-bus system and Hopt  2 for the 57-bus system. We observe that the algorithms report satisfiable in- the next Hopt time steps, and so on, until the entire problem is solved. The satisfiability of the constraints stances within the first four iterations of the iterative process. between the subsequent Hopt subproblems is ensured in Phase 2 of R-DEEDS. We generate 30 instances for each test case. The To validate the solutions returned by R-DEEDS, we performance of the algorithms are evaluated using the tested the stability of the returned control strategy for the IEEE 30-Bus System on a dynamic simulator (Mat- simulated runtime metric [33], and we imposed a time- lab SimPowerSystems) that employs a detailed time- out of 5 hours. Results are averaged over all instances. domain model and recreates conditions analogous to These experiments are performed on a AMD Opteron those of physical systems. Figure7 shows the dy- 6276, 2.3GHz, 128GB of RAM, which is equipped namic response of the system frequency, whose nomi- with a GeForce GTX TITAN GPU device with 6GB of nal value is 60 Hz. The system frequency is at the nom- global memory. If an instance cannot be completed, it inal value when the power supply-demand is balanced. is due to runtime limits for the CPU implementation of If more power is produced than consumed, the fre- the algorithm, and memory limits for the GPU imple- quency would rise and vice versa. Deviations from the mentation. We set the number of iterations to 10. nominal frequency value would damage synchronous We do not report evaluation against existing (Dy- machines and other appliances. Thus, to ensure stable namic) DCOP algorithms as the standard (Dynamic) operating conditions, it is important to ensure that the DCOP models cannot fully capture the complexities system frequency deviation is confined to be within a of the D-EDDR problem (e.g., environment variables small interval (typically 0.1 Hz). with continuous values). In our experiment, the first 60 seconds of the simula- Table1 tabulates the runtimes in seconds of R- tion are tested enforcing the ED solution in a full load DEEDS with both CPU and GPU implementations at scenario (100% of full load) using the same setting varying Hopt. It also tabulates the average solution as in [22]. The rest of the simulation deploys the D- quality found normalized with the best solution found EDDR solution returned by R-DEEDS. While the reso- x2 the provider sets energy prices according to a real-time pric- a1 x1 x2 for i

szj and zj to denote the start time and duration (expressed Figure 1: Example DCOP in exact multiples of time intervals), respectively, of device z . A solution is a value assignment to a set of variables j i The2 Z energy consumption of each device z is ⇢ kWh for X that is consistent with the variables’ domains. The j zj ✓ X each hour that it is on. It will not consume any energy if it cost function F ()= f f() is the sum of the t P f ,x X is off-the-shelf. We use the indicator function to indicate costs of all the applicable constraints2F ✓ in . A solution is said zj the state of the device z at time step t, and whose value is 1 to be complete if X =P is the value assignment for all j exclusively when the device z is on at time step t: variables. The goal is toX find an optimal complete solution j x⇤ = argminx F (x). 1 if s t s + t P t = zj  ^ zj zj Following Fioretto et al. [2016b], we introduce the follow- zj 0 otherwise ing definitions: ⇢ Additionally, the execution of device zj is characterized Definition 1 For each agent ai , Li = xj ↵(xj)= by a cost and a discomfort value. The cost represents the a is the set of its local variables.2A I ={ x 2 X|L x i i j i k monetary expense for the user to schedule zj at a given time, } f{s 2 | 9 2 fs : ↵(xk) = ai xj,xk x is the set of its Ct X ^9 2F 6 ^ { }✓ } and we use i to denote the aggregated cost of the building interface variables. hi at time step t, expressed as: t t Definition 2 For each agent ai , its local constraint graph C = P ✓(t), (1) 2A i i · Gi =(Li, i ) is a subgraph of the constraint graph, where EF fj where i = fj x Li . F { 2F| ✓ } P t = t ⇢ (2) i zj · zj Figure 1(a) shows the constraint graph of a sample DCOP z Xj 2Zi with 3 agents a1, a2, and a3, where L1 = x1,x2 , L2 = { } is the aggregate power consumed by building hi at time step x3,x4 , L3 = x5,x6 , I1 = x2 , I2 = x4 , and t { } { } { } { } t. The discomfort value µz R describes the degree of I3 = x6 . The domains are D1 = = D6 = 0, 1 . j 2 { } ··· { } dissatisfaction for the user to schedule the device zj at a given Figure 1(b) shows the constraint cost tables (all constraints t have the same cost table for simplicity). time step t. Additionally, we use Ui to denote the aggregated discomfort associated to the user in building hi at time step t: t t 3 Scheduling of Devices in Smart Buildings U = µz (t). (3) i zj · j z Through the proliferation of smart devices (e.g., smart ther- Xj 2Zi mostats, smart lightbulbs, smart washers, etc.) in our homes The SBDS problem is the problem of scheduling the de- and offices, within the larger smart grid vices of each building in the neighborhood in a coordinated is becoming inevitable. Building automation is the automated fashion so as to minimize the monetary costs and, at the same F. Fioretto et al. / Distributed Multi-Agent Optimizationcontrol for Smartof the building’s Grids and devices Home with Automation the objective of im- time, minimize the13 discomfort of users. While this is a multi- proved comfort of the occupants, improved energy efficiency, objective optimization problem, we combine the two objec- and reduced operational costs. In this paper, we are interested tives into a single objective through the use of a weighted in the scheduling devices in smart buildings in a decentral- sum: lution of the simulation is in microseconds, R-DEEDS t t ized way, where each user is responsible for the schedule of minimize ↵c C + ↵u U (4) · i · i the devices in her building, under the assumption that each t T hi agents send only desired power injected and with- user cooperate to ensure that the total energy consumption of X2 X2H where ↵ Zandi ↵ are weights in the open interval (0, 1) the neighborhood is always within some maximum threshold c u R drawn (schedules), computed offline, in one-minute such that ↵ + ↵ =1. The SBDS problem is also subject✓ to that is defined by the energy provider such as a energy utility c u the following constraints: intervals; the simulator interpolates the power gener- company.

We now provide a description of the Smart Building De- 1 szj T zj hi ,zj i (5) ated between such intervals. This scenario reflects real- vices Scheduling (SBDS) problem. We describe related so-   8 2 H 2 Z t = h ,z (6) lution approaches in Section 6. An SBDS problem is com- zj zj 8 i 2 H j 2 Zi life conditions, where control signals are sent to ac- posed of a neighborhood of smart buildings h that t T H i 2 H X2 are able to communicate with one another and whose energy P t `t t T (7) tual generators at intervals of several minutes. As ex- i  8 2 demands are served by an energy provider. We assume that h pected, the frequency deviation is more stable when Xi2H the load predictions are accurate. Crucially, the devia- Fig. 8. Illustration of a Neighborhood of Smart Homes tion in both cases is within 0.05 Hz, thereby ensuring system stability. In addition, we observe, during simu-  Y P Z is a set of smart devices, where Z is lation, that the D-EDDR solution can reduce the total Z hi H i i load up to 68.5%, showing the DR contribution in peak the set of devices in the smart home hi (e.g., vacuum demand reduction. Finally, the frequency response of cleaning robot, smart thermostat).  Y the D-EDDR solution converges faster than that of the hiPHLi is a set of locations, where Li is the L ED only, with smaller fluctuations. The reason is that set of locations in the smart home hi (e.g., living the supply-demand balance is better maintained by co- room, kitchen). ordinating the generators and loads in the system si- PH is the set of the state properties of the smart multaneously. homes (e.g., cleanliness, temperature). PZ is the set of the devices state properties (e.g., battery charge for a vacuum cleaning robot). 8. Smart Home Device Scheduling H is the planning horizon of the problem, and T  t1,...,Hu denotes the set of time points. Residential and commercial buildings are progres- θ : T Ñ represents the real-time pricing sively being automated through the introduction of R smart devices (e.g., smart thermostats, circulator heat- schema adopted by the energy utility company, ing, washing machines). Besides, a variety of smart which expresses the cost per kWh of energy con- plugs, which allow users to control devices by re- sumed by consumers. motely switching them on and off intelligently, are Finally, Ωp is used to denote the set of all possible P Y now commercially available. Device scheduling can, states for state property p PH PZ (e.g., all the therefore, be executed by users, without the control different levels of cleanliness for the cleanliness prop- of a centralized authority. This schema can be used erty). Figure8 shows an illustration of a neighborhood for demand-side management in a DR program. How- of smart homes with each home controlling a set of ever, uncoordinated scheduling may be detrimental smart devices. to DR performance without reducing peak load de- 8.1.1. Smart Devices mands [35]. For an effective DR, a coordinated de- h P H vice scheduling within a neighborhood of buildings is For each home i , the set of smart devices necessary. Privacy concerns arise when users share re- Zi is partitioned into a set of actuators Ai and a set sources or cooperate to find suitable schedules. This of sensors Si. Actuators can affect the states of the section surveys the Smart Home Device Scheduling home (e.g., heaters and ovens can affect the tempera- (SHDS) problem, which formalizes the problem of ture in the home) and possibly their states (e.g., vac- coordinating smart devices schedules across multiple uum cleaning robots drain their battery power when smart homes as a multi-agent system, presented ini- running). On the other hand, sensors monitor the states tially in [15]. of the home. x H Z y P A tuple `z,Az, γz , γz defines each device z P 8.1. The Problem Zi of a home hi , where `z Li denotes the relevant location in the home that it can act or sense, Az is the H Ñ A Smart Home Device Scheduling (SHDS) problem is set of actions that it can perform, and γz : Az defined by the tuple xH, , , P , P , H, θy, where: ℘pP q and γZ : A Ñ ℘pP q map the actions of the Z L H Z H z z Z H  th1, h2,...u is a neighborhood of to the relevant state properties of the home and homes, capable of communicating with one another. to those of the device, respectively. 14 F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation

S R R C C R R C R Device Schedule 8.1.3. Scheduling Rules

0 15 30 30 30 45 60 60 75 Cleanliness (%) The SHDS introduces two types of scheduling rules:

65 40 15 35 55 30 5 25 0 Battery Charge (%) Active Scheduling Rules (ASRs) that define user- defined objectives on a desired state of the home Goal 75 75 (e.g., the living room is cleaned by 1800 hrs). Battery Charge (%) Battery Charge 60 60 Passive Scheduling Rules (PSRs) that define implicit 45 45 constraints on devices that must hold at all times 30 30 (e.g., the battery charge on a vacuum cleaning robot Cleanliness (%) 15 15 is always between 0% and 100%). 0 0 1400 1500 1600 1700 1800 Example 3 The following ASR defines a goal state Time Deadline where the living room floor is at least 75% clean Fig. 9. Smart Home Device Scheduling Example (i.e., at least 75% of the floor is cleaned by a vacuum cleaning robot) by 1800 hrs:

Example 2 Consider a vacuum cleaning robot zv with  living_room cleanliness ¥ 75 before 1800, location `zv living_room. The set of possible actions is Az  trun, charge, stopu and the mappings are: v and the following PSRs state that the battery charge of H z γ : runÑtcleanlinessu; chargeÑH; stopÑH, the vacuum cleaning robot v needs to be between 0% zv and 100% of its full charge at all the times: Z Ñt u Ñt u ÑH γzv : run charge ; charge charge ; stop , zv charge ¥ 0always ^ zv charge ¤ 100always where H represents a null state property. rtaÑtbs We denote with Rp a scheduling rule over a state P Y r s 8.1.2. Device Schedules property p PH PZ and time interval ta, tb . To control the energy profile of a smart home, we Each scheduling rule indicates a goal state at a home need to describe the behavior of the smart devices act- P Y location or on a device `Rp Li Zi of a particular ing in the smart home during the time horizon. We for- state property p that must hold over the time interval device sched- malize this concept with the notion of rta, tbs „ T. The scheduling rule’s goal state is either t P ules. We use ξz Az to denote the action of device z the desired state of a home if it is an ASR (e.g., the t  t t | P u at time step t, and ξX ξz z X to denote the set cleanliness level of the room floor) or a required state of actions of the devices in X „ at time step t. of a device or a home if it is a PSR ( e.g., the battery r Ñ s Z ta tb  x ta tb y A schedule ξX ξX , . . . , ξX is a sequence charge of the vacuum cleaning robot). of actions for the devices in X „ within the time Each rule is associated with a set of actuators Φ „ Z p interval from ta to tb. Ai that can be used to reach the goal state. For in- Consider the illustration of Figure9. The top row stance, in our Example (3), Φp correspond to the vac- shows a possible schedule xR,R,C,C,R,R,C,Ry uum cleaning robot zv, which can operate on the liv- for a vacuum cleaning robot starting at time 1400 hrs, ing room floor. Additionally, a rule is associated with P where each time step is 30 minutes. The robot’s actions a sensor sp Si capable of sensing the state property at each time step are shown in the colored boxes with p. Finally, in a PSR, the device can also sense its own letters in them: red with ‘S’ for stop, green with ‘R’ for internal states. The ASR of Example3 is illustrated in run, and blue with ‘C’ for charge. Figure9 by dotted red lines on the graph. The PSRs At a high level, the goal of the SHDS problem is are not shown as they must hold for all time steps. to find a schedule for each of the devices in every 8.1.4. Feasibility of Schedules smart home that achieves some user-defined objectives To ensure that a goal state can be achieved across (e.g., the home is at a particular temperature within a the desired time window, the system uses a predictive time window, the home is at a particular cleanliness model of the various state properties. This predictive level by some deadline) that may be personalized for model captures the evolution of a state property over each home. We refer to these objectives as scheduling time and how this state property is affected by a given rules. joint action of the relevant actuators. F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation 15

A predictive model‘Γp for a state property p is a that the vacuum cleaning robot will take on the sched- Γ :Ω ¢ A Y tKu Ñ Ω Y tKu ule shown in the first row of the figure. The predicted function p p zPΦp z p , where K denotes an infeasible state and K p¤q  K. trajectories of these state properties are also illustrated In other words, the model describes the transition of in the graph, where the dark gray line shows the states state property p from state ωp P Ωp at time step t to for the robot’s battery charge and the black line shows time step t 1 when it is affected by a set of actu- the states for the cleanliness of the room. Φ ξt Γt 1pω , ξt q  Notice that, in order to verify if a schedule satisfies ators p running joint actions Φp : p p Φp t t a scheduling rule, it is sufficient to check that the pre- ωp ∆ppωp, ξ q where ∆ppωp, ξ q is a function de- Φp Φp dicted state trajectories are within the set of feasible scribing the effect of the actuators’ joint action ξt on Φp state trajectories of that rule. Additionally, notice that state property p. each active and passive scheduling rule defines a set We assume here, without loss of generality, that the of feasible state trajectories. For example, the active state of properties is numeric—when this is not the scheduling rule of Example3 allows all possible state case, a mapping of the possible states to a numeric rep- trajectories as long as the state at time step 1800 is no resentation can be easily defined. smaller than 75. We use Rprts „ Ωp to denote the set of states that are feasible according to rule R of state Example 4 Consider the charge state property of the p property p at time step t. vacuum cleaning robot zv. Assume it has 65% charge rt Ñt s t More formally, a schedule ξ a b satisfies a at time step t and its action is ξz at that time step. Φp v r Ñ s r Ñ s Thus: R ta tb ξ ta tb |ù scheduling rule p (written as Φp rt Ñt s rt Ñts R a b @t P rt , t s : π pωta , ξ a q P t 1 t t p ) iff: a b p p Φp Γ p65, ξ q65 ∆chargep65, ξ q charge zv zv r s ta Rp t , where ωp is the state of state property p at time p t q  step t . ∆charge ω, ξzv a $ A schedule is feasible if it satisfies all the passive 'minp20, 100¡ωq if ξt  charge ^ ω 100 &' zv and active scheduling rules of each home in the SHDS ¡ t  run ^ ¡ 25 if ξzv ω 25 problem. '0 if ξt  stop %' zv 8.1.5. Cost of Schedules K otherwise In addition to finding feasible schedules, the goal in the SHDS problem is to optimize for the aggregated In other words, at each time step, the charge of the bat- total cost of energy consumed. tery will increase by 20% if it is charging until it is fully Each action ξ P Az of device z P Zi in home hi P H Ñ charged, decrease by 25% if it is running until it has has an associated energy consumption ρz : Az R , r Ñ s less than 25% charge, and no change if it is stopped. Etpξ 1 H q expressed in kWh. The aggregated energy i Zi across all devices consumed by hi at time step t under r Ñ s The predictive model of the example above models trajectory ξ 1 H is: a device state property. Notice that a recursive invoca- Zi ¸ tion of a predictive model allows us to predict the tra- t r1ÑHs t t E pξ q  ` ρzpξ q jectory of a state property p for future time steps, given i Zi i z zPZi a schedule of actions of the relevant actuators Φp. Given a state property p, its current state ωp at time t r Ñ s where `i is the home background load produced at t ξ ta tb step a, and a schedule Φp of relevant actuators time t, which includes all non-schedulable devices rtaÑtbs Φp, the predicted state trajectory πppωp, ξ q of (e.g., TV, refrigerator), and sensor devices—which are Φp t always active, and ξz is the action of device z at time that state property is defined as: r Ñ s r Ñ s t ξ 1 H c pξ 1 H q in the schedule Zi . The cost i Zi associ- t ¡ tb p tb¡1 p p ta p ta q q b 1 q tb q r1ÑHs Γp Γp ... Γp ωp, ξΦ ,... , ξΦ , ξΦ . ξ h p p p ated to schedule Zi in home i is: ¸ r Ñ s r Ñ s Consider the device scheduling example in Fig- c pξ 1 H q  Etpξ 1 H qq ¤ θptq i Zi i Zi (1) ure9. The predicted state trajectories of the charge and tPT cleanliness state properties are shown in the second and third rows. These trajectories are predicted given where θptq is the energy price per kWh at time t. 16 F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation

8.1.6. Optimization Objective Global soft constraints (i.e., constraints that in- The objective of an SHDS problem is that of mini- volve variables controlled by different agents) mizing the following weighted bi-objective function: whose costs correspond to the peak energy con- sumption, as defined in the second term in Equa- sum peak min αc ¤C αe ¤E (2) tion (2). r1ÑHs ξZ i The neighbors Nai of agent ai are defined as all the rt Ñt s r Ñ s r Ñ s agents in the coalition that contains hi. ξ a b |ù R ta tb p@h P H,R ta tb P R q H st: Φp p i p i (3) 8.3. MVA-based MGM

° r Ñ s Csum  c pξ 1 H q The SHDS problem requires several homes to solve where hiPH i Zi is the aggregated peak  a complex scheduling subproblem, which involves °monetary° cost° across all homes hi¨; and E t r1ÑHs 2 multiple variables, and to optimize the resulting en- P P P E pξ q is a quadratic t T Hj hi Hj i Zi ergy profiles among the neighborhood of homes. The penalty functionH on the aggregated energy consump- problem can be thought of as each agent solving a lo- tion across all homes h , and α , α P are weights. i c e R cal complex subproblem, whose optimization is influ- Additionally, is a partition set of H defining a set of H enced by the energy profiles of the agent’s neighbors, homes coalitions, into each of which homes may share and by a coordination algorithm that allows agents to their energy profile with each other so to optimize the exchange their newly computed energy profiles. We energy peaks. These coalitions can be exploited by a thus adopt the multiple-variable agent (MVA) decom- distributed algorithm to (1) parallelize computations position [14] to delegate the resolution of the agent’s between multiple groups and (2) avoid data exposure local problem to a centralized solver while managing over long distances or sensitive areas. inter-agent coordination through a message-passing Finally, constraint (3) defines the valid trajectories procedure similar to that operated by MGM, described P for each scheduling rule r Ri, where Ri is the set of in Section 2.3. The resulting algorithm is called Smart all scheduling rules of home hi. Home MGM (SH-MGM). SH-MGM is a distributed al- gorithm that operates in synchronous cycles. The algo- 8.2. DCOP Representation rithm first finds a feasible DCOP solution and then it- eratively improves it, at each cycle, until convergence A DCOP mapping of the SHDS problem is as follows: or time out. We leave the details out and refer the in- AGENTS: Each agent a P in the DCOP is terested reader to [15]. i A mapped to a home hi P H. VARIABLES and DOMAINS: Each agent ai controls: 8.4. Empirical Evaluation For each actuator z P Ai and time t P T, a vari- t Our empirical evaluations compare the SH-MGM able x whose domain is the set of actions in Az. i,z algorithm against an uncoordinated greedy approach, The sensors in Si are considered always active, and thus not directly controlled by the agent. where each agent computes a feasible schedule for its home devices without optimizing over its cost and the An auxiliary interface variable xˆt whose domain ° j aggregated peak energy incurred. is r0, P ρpargmax P ρzpaqqs and repre- z Zi a Az Each agent controls nine smart actuators to sents the aggregated energy consumed by all the schedule—Kenmore oven and dishwasher, GE clothes devices in the home at each time step t. washer and dryer, iRobot vacuum cleaner, Tesla elec-

CONSTRAINTS: There are three types of con- tric vehicle, LG air conditioner, Bryant heat pump, straints: and American water heater—and five sensors. We se-

Local soft constraints (i.e., constraints that in- lected these devices as they can be available in a typ- volve only variables controlled by one agent) ical (smart) home and they have published statistics whose costs correspond to the weighted summa- (e.g., energy consumption profiles). The algorithms tion of monetary costs, as defined in Equation (1). take as inputs a list of smart devices to schedule as well Local hard constraints that enforce Constraint (3). as their associated scheduling rules and the real-time Feasible schedules incur a cost of 0 while infeasi- pricing scheme adopted. Each device has an associated ble schedules incur a cost of 8. active scheduling rule that is randomly generated for F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation 17 each agent and a number of passive rules that must al- Moines, 1357 in Boston, and 3766 in San Francisco. ways hold. To find local schedules at each SH-MGM For each city, we created a 200m¢200m grid, where iteration, each agent uses a Constraint Programming the distance between intersections is 20m, and ran- solver8 as a subroutine. An extended description of the domly placed houses in this grid until the density is the smart device properties, the structural parameters (i.e., same as the sampled density. We then divided the city size, material, heat loss) of the homes, and the predic- into k (=| |) coalitions, where each home can commu- H tive models for the homes and devices state properties nicate with all homes in its coalition. In generating our is reported in [19]. Finally, we set H  12 and adopted SHDS topologies, we verified that the resulting topol- a pricing scheme used by the Pacific Gas & Electric ogy is connected. Finally, we ensure that there are no Co. for its customers in parts of California,9 which ac- disjoint coalitions; this is analogous to the fact that mi- counts for seven tiers ranging from $0.198 per kWh to crogrids are all connected to each other via the main $0.849 per kWh. power grid. In these experiments, we imposed a 10 seconds In our first experiment, we implemented the algo- timeout for the agents’ CP solver. We first evaluate the rithm on an actual distributed system of Raspberry Pis. effects of the weights α and α (see Equation (2)) A (called “PI” for short) is a bare-bone c e on the quality of solutions found by the algorithms. single-board computer with limited computation and We evaluate three configurations: pα  1.0, α  storage capabilities. We used Raspberry Pi 2 Model c e 0.0q, pα  0.5, α  0.5q, and pα  0.0, α  1.0q, B with quadcore 900MHz CPU and 1GB of RAM. c e c e using randomly generated microgrids with the den- We implemented the SH-MGM algorithm using the sity of Des Moines, IA. Figure 10(middle) shows the Java Agent Development Environment (JADE) frame- total energy consumption per hour (in kWh) of the work,10 which provides agent abstractions and peer- day under each configuration. Figure 10(right) illus- to-peer agent communication based on asynchronous trates the average daily cost paid by one household un- message passing. Each PI implements the logic for der the different objective weights. The uncoordinated one agent, and the agent’s communication is supported greedy approach achieves the worst performance, with through JADE and using a wired network connected to the highest energy peak at 2852 kWh and the most a router. expensive daily cost ($3.84). The configuration that We set up our experiments with seven PIs, each con- disregards the local cost reduction (α  0) reports trolling the nine smart actuators and five sensors de- c the best energy peak reduction (peak at 461 kWh) but scribed above. All agents belong to the same coalition. highest daily cost among the coordinated configura- We set the equal weights α and α of the objective c e tions ($2.31). On the other extreme, the configuration (see Equation (2)) to 0.5. Figure 10(left) illustrates the that disregards the global peak reduction (α  0) re- results of this experiment, where we imposed a 60 sec- p ports the worst energy peak reduction (peak at 1738 onds timeout for the CP solver. As expected, the SH- kWh) but the lowest daily cost among the coordinated MGM solution improves with increasing number of configurations ($1.44). Finally, the configuration with cycles, providing an economic advantage for the users α  α  0.5 reports intermediate results, with the as well as peak energy reduction, when compared to c e highest peak at 539 kWh and a daily cost of $2.18. the uncoordinated schema. These results, thus, show For a more extensive analysis of the results, we refer the feasibility of using a local search-based schema im- the reader to [15]. plemented on hardware with limited storage and pro- cessing power to solve a complex problem. In our second set of experiments, we generate syn- 9. Conclusions thetic microgrid instances sampling neighborhoods in three cities in the United States (Des Moines, IA; Responding to the recent call to action for AI re- Boston, MA; and San Francisco, CA) and estimate searchers to contribute towards the smart grid vi- the density of houses in each city. The average den- sion [29], this paper presented two important appli- sity (in houses per square kilometers) is 718 in Des cations of Distributed Constraint Optimization Prob- lems (DCOPs) for demand response (DR) programs in 8We adopt the JaCoP solver (http://www.jacop.eu/) smart grids. The multi-agent Economic Dispatch with 9https://goo.gl/vOeNqj/ Demand Response (EDDR) provides an integrated ap- 10http://jade.tilab.com/ proach to the economic dispatch and the DR model for 18 F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation

● 4

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● 750 ●

● 3 1000 ●

● 2 700

200 ● ● uncoord.

SH−MGM ● 50 Solution Cost Total Energy ●

uncoord. 1

650 αc=0, αe=1 Average Cost ($/day) Average

Consumption (kWh) Consumption αc=0.5, αe=0.5 αc=1, αe=0 ● 10 0

0 10 20 30 40 50 2 4 6 8 10 =0, =1, c =0.5, c ! =1 c ! =0 ! =0.5 Number of Cycles Time uncoord. ! e ! e ! e

Fig. 10. Physical Experimental Result with PIs (left); Synthetic Experimental Results Varying αc and αe (middle and right). power systems. The Smart Home Device Scheduling Son, Athena M. Tabakhi, Makoto Yokoo, Roie Zivan, (SHDS) problem formalizes the device scheduling and and, in particular, William Yeoh. coordination problem across multiple smart homes to A. Dovier is partially supported by Indam GNCS reduce the energy peaks. 2014–2017 grants, and by PRID ENCASE. E. Pon- Due to the complexity of the problems the adop- telli has been supported by NSF grants CNS-1440911, tion of off-the-shelf DCOPs resolutions algorithms is CBET-14016539, HRD-1345232. infeasible. Thus, the paper introduces a methodology to exploit the structural problem decomposition (called MVA) for DCOPs, which enables agents to solve com- References plex local sub-problems efficiently. It further discusses a general GPU parallelization schema for inference- [1] Baruch Awerbuch. A new distributed depth-first-search algo- based DCOP algorithms, which produces speed-ups of rithm. Information Processing Letters, 20(3):147–150, 1985. up to 2 order of magnitude. [2] Emma Bowring, Milind Tambe, and Makoto Yokoo. Multiply- In the context of EDDR applications, evaluations on constrained distributed constraint optimization. In Proc. of AA- a state-of-the-art power systems simulator show that MAS, pages 1413–1420, 2006. the solutions found are stable within acceptable fre- [3] David Burke and Kenneth Brown. Efficiently handling com- plex local problems in distributed constraint optimisation. In quency deviations. In the context of home automation, Proceedings of the European Conference on Artificial Intelli- the proposed algorithm was deployed on a distributed gence (ECAI), pages 701–702, 2006. system of Raspberry Pis, each capable of controlling [4] Federico Campeotto, Agostino Dovier, Ferdinando Fioretto, and scheduling smart devices through hardware inter- and Enrico Pontelli. A GPU implementation of large neighbor- faces. The experimental results show that this algo- hood search for solving constraint optimization problems. In Proceedings of the European Conference on Artificial Intelli- rithm outperforms a simple uncoordinated solution on gence (ECAI), pages 189–194, 2014. realistic small-scale experiments as well as large-scale [5] Federico Campeotto, Alessandro Dal Palù, Agostino Dovier, synthetic experiments. Ferdinando Fioretto, and Enrico Pontelli. Exploring the Use Therefore, this work continues to pave the bridge of GPUs in Constraint Solving. In Proceedings of the Practi- cal Aspects of Declarative Languages (PADL), pages 152–167, between the AI and power systems communities, high- 2014. lighting the strengths and applicability of AI tech- [6] Imen Chakroun, Mohand-Said Mezmaz, Nouredine Melab, niques in solving power system problems. and Ahcene Bendjoudi. Reducing thread divergence in a GPU-accelerated branch-and-bound algorithm. Concurrency and Computation: Practice and Experience, 25(8):1121–1136, 2013. Acknowledgments [7] Alessandro Dal Palù, Agostino Dovier, Andrea Formisano, and Enrico Pontelli. CUD@SAT: SAT solving on GPUs. J. Exp. We would like to thank all the that have Theor. Artif. Intell., 27(3):293–316, 2015. worked with us in this challenging research area, [8] Gregory Frederick Diamos, Benjamin Ashbaugh, Subrama- niam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar namely, Federico Campeotto, Alessandro Dal Palù, Yalamanchili. SIMD re-convergence at thread frontiers. In Rina Dechter, Khoi D. Hoang, Ping Hou, William Proceedings of the Annual IEEE/ACM International Sympo- Kluegel, Muhammad Aamir Iqbal, Tiep Le, Tran Cao sium on Microarchitecture, pages 477–488, 2011. F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation 19

[9] Alessandro Farinelli, Alex Rogers, Adrian Petcu, and Nicholas [22] Ye Ma, Wei Zhang, Wenxin Liu, and Qinmin Yang. Fully dis- Jennings. Decentralised coordination of low-power embedded tributed social welfare optimization with line flow constraint devices using the Max-Sum algorithm. In Proceedings of the consideration. IEEE Transaction on Industrial Informatics, International Conference on Autonomous Agents and Multia- 11(6):1532–1541, 2015. gent Systems (AAMAS), pages 639–646, 2008. [23] Rajiv T. Maheswaran, Jonathan P. Pearce, and Milind Tambe. [10] Ferdinando Fioretto, Federico Campeotto, Luca Da Rin Distributed Algorithms for DCOP: A Graphical-Game-Based Fioretto, William Yeoh, and Enrico Pontelli. GD-GIBBS: a Approach. In Proceedings of the ISCA 17th International Con- GPU-based sampling algorithm for solving distributed con- ference on Parallel and Distributed Computing Systems, pages straint optimization problems. In Proceedings of the Interna- 432–439, 2004. tional Joint Conference on Autonomous Agents and Multiagent [24] Sam Miller, Sarvapali D. Ramchurn, and Alex Rogers. Optimal Systems (AAMAS), pages 1339–1340, 2014. decentralised dispatch of embedded generation in the smart [11] Ferdinando Fioretto, Enrico Pontelli, and William Yeoh. Dis- grid. In AAMAS, pages 281–288, 2012. tributed constraint optimization problems and applications: A [25] Pragnesh Modi, Wei-Min Shen, Milind Tambe, and Makoto survey. CoRR, abs/1602.06347, 2016. Yokoo. ADOPT: Asynchronous distributed constraint op- [12] Ferdinando Fioretto, Enrico Pontelli, and William Yeoh. Dis- timization with quality guarantees. Artificial Intelligence, tributed constraint optimization problems and applications: A 161(1–2):149–180, 2005. survey. Journal of Artificial Intelligence Research (JAIR), (to [26] Arnon Netzer, Alon Grubshtein, and Amnon Meisels. Concur- appear), 2018. rent forward bounding for distributed constraint optimization [13] Ferdinando Fioretto, Enrico Pontelli, William Yeoh, and Rina problems. AI Journal, 193:186–216, 2012. Dechter. Accelerating exact and approximate inference for [27] Adrian Petcu and Boi Faltings. A scalable method for mul- (distributed) discrete optimization with GPUs. Constraints, tiagent constraint optimization. In Proceedings of the Inter- 23(1):1–43, Jan 2018. national Joint Conference on Artificial Intelligence (IJCAI), [14] Ferdinando Fioretto, William Yeoh, and Enrico Pontelli. Multi- pages 1413–1420, 2005. variable agents decomposition for DCOPs. In Proceedings of [28] Adrian Petcu and Boi Faltings. Superstabilizing, fault- the AAAI Conference on Artificial Intelligence (AAAI), pages containing distributed combinatorial optimization. In Proc. of 2480–2486, 2016. AAAI, pages 449–454, 2005. [15] Ferdinando Fioretto, William Yeoh, and Enrico Pontelli. A [29] Sarvapali D Ramchurn, Perukrishnen Vytelingum, Alex multiagent system approach to scheduling devices in smart Rogers, and Nicholas R Jennings. Putting the ’smarts’ into the homes. In Proceedings of the International Conference on smart grid: A grand challenge for artificial intelligence. Com- Autonomous Agents and Multiagent Systems (AAMAS), pages munications of the ACM, 55(4):86–97, 2012. 981–989, 2017. [30] Francesca Rossi, Peter van Beek, and Toby Walsh, editors. [16] Ferdinando Fioretto, William Yeoh, Enrico Pontelli, Ye Ma, Handbook of Constraint Programming. Elsevier, 2006. and Satishkumar Ranade. A DCOP approach to the economic [31] Pierre Rust, Gauthier Picard, and Fano Ramparany. Using dispatch with demand response. In Proceedings of the Interna- message-passing DCOP algorithms to solve energy-efficient tional Conference on Autonomous Agents and Multiagent Sys- smart environment configuration problems. In Proceedings tems (AAMAS), 2017. of the International Joint Conference on Artificial Intelligence [17] Rachel Greenstadt, Jonathan Pearce, and Milind Tambe. Anal- (IJCAI), pages 468–474, 2016. ysis of privacy loss in DCOP algorithms. In Proceedings of [32] Jason Sanders and Edward Kandrot. CUDA by Example. An the AAAI Conference on Artificial Intelligence (AAAI), pages Introduction to General-Purpose GPU Programming. Addison 647–653, 2006. Wesley, 2010. [18] Tianyi David Han and Tarek S. Abdelrahman. Reducing [33] Evan Sultanik, Pragnesh Jay Modi, and William C Regli. On Branch Divergence in GPU Programs. In Proceedings of the modeling multiagent task scheduling as a distributed con- Fourth Workshop on General Purpose Processing on Graphics straint optimization problem. In Proceedings of the Inter- Processing Units, pages 3:1–3:8, New York, NY, 2011. ACM national Joint Conference on Artificial Intelligence (IJCAI), Press. pages 1531–1536, 2007. [19] William Kluegel, Muhammad A. Iqbal, Ferdinando Fioretto, [34] D. I. Sun, B. Ashley, B. Brewer, A. Hughes, and W. F. Tin- William Yeoh, and Enrico Pontelli. A realistic dataset for ney. Optimal power flow by Newton approach. IEEE Trans- the smart home device scheduling problem for DCOPs. In actions on Power Apparatus and Systems, PAS-103(10):2864– Gita Sukthankar and Juan A. Rodriguez-Aguilar, editors, 2880, Oct 1984. Autonomous Agents and Multiagent Systems: AAMAS 2017 [35] Menkes Van Den Briel, Paul Scott, Sylvie Thiébaux, et al. Workshops, Visionary Papers, São Paulo, Brazil, May 8-12, Randomized load control: A simple distributed approach for 2017, Revised Selected Papers, pages 125–142, Cham, 2017. scheduling smart appliances. In Proceedings of the Inter- Springer International Publishing. national Joint Conference on Artificial Intelligence (IJCAI), [20] Thomas Léauté and Boi Faltings. Distributed constraint opti- 2013. mization under stochastic uncertainty. In Proceedings of the [36] Meritxell Vinyals, Juan A. Rodríguez-Aguilar, and Jesús AAAI Conference on Artificial Intelligence (AAAI), pages 68– Cerquides. Constructing a unifying theory of dynamic pro- 73, 2011. gramming DCOP algorithms via the generalized distributive [21] T. Logenthiran, D. Srinivasan, and T. Shun. Demand side law. Autonomous Agents and Multi-Agent Systems, 22(3):439– management in smart grid using heuristic optimization. IEEE 464, 2011. Transactions on Smart Grid, 3(3):1244–1252, 2012. [37] T. Voice, P. Vytelingum, S. Ramchurn, A. Rogers, and N. Jen- nings. Decentralised control of micro-storage in the smart grid. 20 F. Fioretto et al. / Distributed Multi-Agent Optimization for Smart Grids and Home Automation

In Proceedings of the AAAI Conference on Artificial Intelli- [40] William Yeoh and Makoto Yokoo. Distributed problem solv- gence (AAAI), pages 1421–1427, 2011. ing. AI Magazine, 33(3):53–65, 2012. [38] S. Wang, S. Shahidehpour, D. Kirschen, S. Mokhtari, and [41] Makoto Yokoo, editor. Distributed Constraint Satisfaction: G. Irisarri. Short-term generation scheduling with transmission Foundation of Cooperation in Multi-agent Systems. Springer, and environmental constraints using an augmented lagrangian 2001. relaxation. IEEE Transaction on Power Systems, 10(3):1294– [42] Roie Zivan, Harel Yedidsion, Steven Okamoto, Robin Glinton, 1301, 1995. and Katia Sycara. Distributed constraint optimization for teams [39] William Yeoh, Ariel Felner, and Sven Koenig. BnB-ADOPT: of mobile sensing agents. Journal of Autonomous Agents and An asynchronous branch-and-bound DCOP algorithm. Journal Multi-Agent Systems, 29(3):495–536, 2015. of Artificial Intelligence Research, 38:85–133, 2010.