Permuted Orthogonal Block-Diagonal Transformation Matrices for Large Scale Optimization Benchmarking Ouassim Ait Elhara, Anne Auger, Nikolaus Hansen
To cite this version:
Ouassim Ait Elhara, Anne Auger, Nikolaus Hansen. Permuted Orthogonal Block-Diagonal Trans- formation Matrices for Large Scale Optimization Benchmarking. GECCO 2016, Jul 2016, Denver, United States. pp.189-196, 10.1145/2908812.2908937. hal-01308566v3
HAL Id: hal-01308566 https://hal.inria.fr/hal-01308566v3 Submitted on 11 May 2016
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Permuted Orthogonal Block-Diagonal Transformation Matrices for Large Scale Optimization Benchmarking
Ouassim Ait ElHara Anne Auger Nikolaus Hansen LRI, Univ. Paris-Sud Inria Inria Université Paris-Saclay LRI, Univ. Paris-Sud LRI, Univ. Paris-Sud TAO Team, Inria Université Paris-Saclay Université Paris-Saclay firstname.ait [email protected] [email protected] [email protected]
ABSTRACT In the context of optimization, it consists in running an We propose a general methodology to construct large-scale algorithm on a set of test functions and extracting differ- testbeds for the benchmarking of continuous optimization ent performance measures from the generated data. To be algorithms. Our approach applies an orthogonal transfor- meaningful, the functions on which the algorithms are tested mation on raw functions that involve only a linear number of should relate to real-world problems. The COCO platform is operations in order to obtain large scale optimization bench- a benchmarking platform for comparing continuous optimiz- mark problems. The orthogonal transformation is sampled ers [3]. In COCO, the philosophy is to provide meaningful from a parametrized family of transformations that are the and quantitative performance measures. The test functions product of a permutation matrix times a block-diagonal ma- are comprehensible, modeling well-identified difficulties en- trix times a permutation matrix. We investigate the impact countered in real-world problems such as ill-conditionning, of the different parameters of the transformation on its shape multi-modality, non-separability, skewness and the lack of a and on the difficulty of the problems for separable CMA-ES. structure that can be exploited. We illustrate the use of the above defined transformation in In this paper, we are interested in designing a large-scale the BBOB-2009 testbed as replacement for the expensive test suite for benchmarking large-scale optimization algo- orthogonal (rotation) matrices. We also show the practica- rithms. For this purpose we are building on the BBOB-2009 bility of the approach by studying the computational cost testbed of the COCO platform. We consider a continuous and its applicability in a large scale setting. optimization problem to be of large scale when the number of variables is in the hundreds or thousands. Here, we start by extending the dimensions used in COCO (up to 40) and Keywords consider values of d up to at least 640. Large Scale Optimization; Continuous Optimization; Bench- There are already some attempts on large scale continu- marking. ous optimization benchmarking. The CEC special sessions and competitions on large-scale global optimization have ex- 1. INTRODUCTION tended some of the standard functions to the large scale set- The general context of this paper is numerical optimiza- ting [9,8]. In the CEC large scale testbed, four classes of tion in a so-called black-box scenario. More precisely one is functions are introduced depending on their level of separa- interested in estimating the optimum of a function bility. These classes are: (a) separable functions; (b) par- tially separable functions with a single group of dependent xopt = arg min f(x) , (1) variables; (c) partially separable functions with several in- dependent groups, each group comprised of dependent vari- with x ∈ S ⊆ d and d ≥ 1 the dimension of the problem. R ables; and (d) fully non-separable functions. On the par- The black-box scenario means that the only information on tially separable functions, variables are divided into groups f that an algorithm can use is the objective function value and the overall fitness is defined as the sum of the fitness of f(x) at a given queried point x. There are two conflict- one or more functions on each of these groups independently. ing objectives: (i) estimating x as precisely as possible opt Dependencies are obtained by applying an orthogonal ma- with (ii) the least possible cost, that is the least number of trix to the variables of a group. By limiting the sizes of the function evaluations. groups, the orthogonal matrices remain of reasonable size, Benchmarking is a compulsory task to evaluate the per- which allows their application in a large scale setting. formance of optimization algorithms designed to solve (1). The approach we consider in this paper is to start with the BBOB-2009 testbed [6] from the COCO platform and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed replace the transformations that do not scale linearly with for profit or commercial advantage and that copies bear this notice and the full cita- the problem dimension d with more efficient variants whilst tion on the first page. Copyrights for components of this work owned by others than trying to conserve these functions’ properties. ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission The paper is organized as follows: in Section2 we moti- and/or a fee. Request permissions from [email protected]. vate and introduce our particular transformation matrix for GECCO ’16, July 20-24, 2016, Denver, CO, USA large scale benchmarking and identify its different parame- c 2016 ACM. ISBN 978-1-4503-4206-3/16/07. . . $15.00 ters. Section3 treats the effects of the different parameters DOI: http://dx.doi.org/10.1145/2908812.2908937 on the transformed problem and its difficulty. We choose In order to introduce non-separability and coordinate sys- possible default values for the parameters in Section4 and tem independence, another transformation consists in apply- give some implementation details and complexity measures ing an orthogonal matrix to the search space: x 7→ z = Rx , d×d in Section5 where we also illustrate how the transformation with R ∈ R an orthogonal matrix. can be applied in COCO. We sum up in Section6. If we take the Rastrigin function (f15) from [6] (and nor- malize it in our case), it combines all the transformations 2. BBOB-2009 RATIONALE AND THE CASE defined above: Rastrigin FOR LARGE SCALE f15(x) = fraw (z) + fopt , (5) Our building of large-scale functions within the COCO 10 asy osz Rastrigin framework is based on the BBOB-2009 testbed [6]. We are where z = RΛ QT0.2 (T (R (x − xopt))), fraw as explaining in this section how this testbed is built, and how defined in Table1 and R and Q two d × d orthogonal ma- we intend to make it large-scale friendly. trices. 2.1 The BBOB-2009 Testbed 2.2 Extension to Large Scale The BBOB-2009 testbed relies on the use of a number of Complexity-wise, all the transformations above, bar the raw functions from which 24 different problems are gener- application of an orthogonal matrix, can be computed in lin- ated. Here, the notion of raw function designates functions ear time, and thus remain applicable in practice in a large in their basic form applied to a non-transformed (canonical scale setting. The problems in COCO are defined, in most base) search space. For example, we call the raw ellipsoid cases, by combining the above defined transformations. Ad- function the convex quadratic function with a diagonal Hes- ditional transformations are of linear cost and can be applied 6 i−1 sian matrix whose ith element equals 2 × 10 d−1 , that is in a large scale setting without affecting computational com- d 6 i−1 plexity. f Elli(x) = P 10 d−1 x2. On the other hand, the separa- i=1 i Our idea is to derive a computationally feasible large scale ble ellipsoid function used in BBOB-2009 (f ) sees the search 2 optimization test suite from the BBOB-2009 testbed [6], space transformed by a translation and an oscillation T osz while preserving the main characteristics of the original func- (4), that is, it is called on z = T osz(x − x ) instead of x, opt tions. To achieve this goal, we replace the computationally with x ∈ d. Table1 shows normalized and generalized opt R expensive transformations that we identified above, namely examples of such widely used raw functions. full orthogonal matrices, with orthogonal transformations The 24 BBOB-2009 test functions are obtained by apply- of linear computational complexity: permuted orthogonal ing a series of transformations on such raw functions that block-diagonal matrices. serve two main purposes: (i) have non trivial problems that can not be solved by simply exploiting some of their proper- 2.2.1 Objectives ties (separability, optimum at fixed position...) and (ii) allow to generate different instances, ideally of similar difficulty, The main properties we want the replacement transfor- of a same problem. The second item is made possible by mation matrix to satisfy are to: the random nature of some of the transformations (a com- bination of the function identifier and the instance number 1. Have (almost) linear cost: both the amount of is provided as seed to the random number generator). All memory needed to store the matrix and the compu- functions have, at least, one random transformation, which tational cost of applying the transformation matrix to allows this instantiation to work on all of them. a solution must scale, ideally, linearly with d or at most in d log(d) or d1+ε with ε 1. First, the optimal solution xopt and its fitness fopt are generated randomly. These allow to have the optimal value and its fitness non-centered at zero: 2. Introduce non-separability: the desired scenario is to have a parameter/set of parameters that allows to f (x) = fraw(z) + fopt , (2) control the difficulty and level of non-separability of the resulting problem in comparison to the original, with z = x − xopt and f a raw function. raw non-transformed, problem. In addition to the translation by xopt, the search space might be perturbed using one or both of the deterministic transformations T asy and T osz defined coordinate-wise for 3. Preserve, apart from separability, the proper- i = 1 . . . d as ties of the raw function: as in the case when using ( i−1 √ a full orthogonal matrix, we want to preserve the con- 1+β d−1 xi asy dition number and eigenvalues of the original function T (x) = xi , if xi > 0 , (3) β i when it is convex-quadratic. xi, otherwise
osz 2.2.2 Block-Diagonal Matrices [T (x)]i = sign(x ˆi + 0.049(sin(c1xˆi) + sin(c2xˆi))) , (4) Sparse matrices, and more specifically band-matrices sat- where [x]i = xi,x ˆi = log(|xi|) if xi 6= 0, and 0 otherwise and isfy Property1. However, a full band matrix (all elements (c1, c2) = (10, 7.9) when xi > 0, (c1, c2) = (5.5, 3.1) other- in the band are non-zero) can not be orthogonal, and thus osz wise. T applies a non linear oscillation to the coordinates does not satisfy Property3, unless the band-width is 1 (see asy while T introduces a non-symmetry between the positive appendix). This means that the matrix is diagonal, which and negative values. contradicts Property2. α The variables are also scaled using a diagonal matrix Λ A special case of band matrices that can be non-diagonal 1 i−1 whose ith diagonal element equals α 2 d−1 . and still orthogonal are block-diagonal matrices. A block- diagonal matrix B is a matrix of the form with Pright, Pleft two permutation matrices, that is, matrices that have one and only one non-zero element (equal to 1) B 0 ··· 0 1 per row and per column. Property3 remains satisfied since 0 B2 ··· 0 B = , (6) permutation matrices are orthogonal, and the product of .. 0 0 . 0 orthogonal matrices is an orthogonal matrix. In effect, the permutation P shuffles the variables in 0 0 ··· Bn left b which the raw function is defined1. Hence, regularities com- where nb is the number of blocks and Bi, 1 ≤ i ≤ nb are ing from the ordering of variables in this definition, e.g. Pnb square matrices of sizes si×si satisfying si ≥ 1 and i=1 si = monotonously increasing coefficients, are perturbed. As a re- d. sult, the permutation determines which coefficients are used Property1 relies upon the number of non-zero entries of within the blocks defined by B and, consequently, the block B which is condition numbers and the difficulty of the problem (see
d d nb the appendix). The permutation Pright shuffles the variables X X 1 X 2 which are accessed by the optimization algorithm, thereby b(i,j)6=0 = si , (7) i=1 j=1 i=1 hiding the block-structure of B. Most numerical optimiza- 1 tion algorithms are invariant to the choice of Pright. where b(i,j)6=0 is the indicator function that equals 1 when th th b(i,j) 6= 0 and 0 otherwise and b(i,j) the i row j column Generating the Random Permutations. element of B. If we simplify and consider blocks of equal When applying the permutations, especially Pleft, one sizes s, bar possibly the last block, then nb = dd/se. We wants to remain in control of the difficulty of the resulting end up with at most d × s non-zero entries. Then, in order problem. Ideally, the permutation should have a parame- to satisfy Property1 with a number of non-zero elements terization that easily allows to control the difficulty of the linear in d, s needs to be independent of d. In the case of transformed problem. different block-sizes, the same reasoning can be applied to We define our permutations as series of ns successive swaps. the largest block-size instead of s. To have some control over the difficulty, we want each vari- For Property2, as long as the matrices Bi in (6) are not able to travel, in average, a fixed distance from its starting diagonal, the functions that are originally separable become position. For this to happen, we consider truncated uniform non-separable since correlations between variables are intro- swaps. duced. In a truncated uniform swap, the second swap variable is Now for Property3, B is orthogonal if and only if all the chosen uniformly at random among the variables that are matrices Bi are orthogonal. We want to have the matrices within a fixed range rs of the first swap variable. Let i be Bi uniformly distributed in the set of orthogonal matrices the index of the first variable to be swapped and j be that of the same size (the orthogonal group O(si)). We first gen- of the second swap variable, then erate square matrices with entries i.i.d. standard normally distributed. Then we apply the Gram-Schmidt process to j ∼ U ({lb(i), . . . , ub(i)} r {i}) , (9) orthogonalize these matrices. The resulting B , and thus B, i where U (S) the uniform distribution over the set S and are orthogonal and satisfy Property3. l (i) = max(1, i − r ) and u (i) = min(d, i + r ). Orthogonal block-diagonal matrices are the raw transfor- b s b s When r ≤ (d − 1)/2, the average distance between the mation matrices for our large scale functions. Their param- s √ first and second swap variable ranges from ( 2 − 1)r + 1/2 eters are s to rs/2 + 1/2. It is maximal when the first swap variable is • d, defines the size of the matrix, at least rs away from both extremes or is one of them. Algorithm1 describes the process of generating a per-
•{ s1, . . . , snb }, the block sizes where nb is the number mutation using a series of truncated uniform swaps. The of blocks. parameters for generating these permutations are:
It is also possible to consider rank deficient matrices with • d, the number of variables, a limited number, deff , of non-zero columns. We end up with d×deff non zero entries. Such matrices result in transformed • ns, the number of swaps. Values proportional to d will problems with a low effective dimension. This approach was allow to make the next parameter the only free one, introduced, tested and discussed in another work. • rs, the swap range and eventually the only free pa- 2.2.3 Permutations rameter. The swap range can be equivalently defined With the model above, we have seen that we do introduce in the form rs = brr dc, with rr ∈ [0, 1]. Each vari- non-separability. However, dependencies remain limited to able moves in average about rr × 50% of the maximal the blocks of B, so variables from different blocks have no distance d. chance of being correlated (unless they already are in the The indexes of the variables are taken in a random order raw function). Such a property seems to be rather artificial thanks to the permutation π. This is done to avoid any bias and could be exploited by algorithms. with regards to which variables are selected as first swap To prevent this, two permutations are applied, one to the rows of B and the other to its columns. The purpose of 1Given a separable quadratic function f(x) = xT Dx, then T T T these permutations is to hide the block-diagonal structure f(PleftBPrightx) = (Prightx) B [PleftDPleft]B(Prightx) of B. The overall transformation matrix R becomes which illustrates that Pleft permutes the coefficients of the diagonal matrix D (“variables in which the raw function is R = PleftBPright , (8) defined”). Algorithm 1 Truncated Uniform Permutations We see that as d increases, the ratio of moved variables Inputs: problem dimension d, number of swaps ns, swap approaches what seem to be stationary values at 100% and range rs. 70% for respectively d and d/2 swaps. These stationary d Output: a vector p ∈ N , defining a permutation. values are independent of the relative swap range rr, except 1: p ← (1, . . . , d) for relatively small dimensions in particular with d swaps. 2: generate a uniformly random permutation π This is due to the fact that the probability of swapping a variable back to its original position increases as the swap 3: for 1 ≤ k ≤ ns do range decreases. This probability is also affected by the 4: i ← π(k), xπ(k) is the first swap variable actual value of the swap range r = br dc: the larger the 5: lb ← max(1, i − rs) s r swap range, the smaller the probability of a variable being 6: ub ← min(d, i + rs) swapped back to its original position. 7: S ← {lb, . . . , ub} r {i} 8: Sample j uniformly in S With this, we know that for large enough dimensions, only few variables are left unmoved. We fix the number of swaps 9: swap pi and pj 10: end for to ns = d, leaving only rs as free parameter. 11: return p 3. DIFFICULTY SCALING WITH MATRIX 1.0 AND FUNCTION FEATURES Now that we have decided upon the transformation ma- rs =⌊d/1⌋ 0.8 trix that we want to apply to the raw functions, namely rs =⌊d/3⌋ permuted orthogonal block-diagonal matrix (8), and identi- fied its free parameters, (s ) and r , we are interested r =⌊d/10⌋ i 1≤i≤nb s 0.6 s in how these two parameters, in addition to the problem rs =⌊d/25⌋ dimension d, affect the transformed problem. 0.4 rs =⌊d/50⌋ 3.1 Impact of the Parameters on Structure of the Transformation Matrix 0.2 rs =⌊d/75⌋
touched vars ratio vars touched First, we are interested in the effect of (si)1≤i≤nb , rs and d rs =⌊d/100⌋ 0.0 on the shape of the sparse orthogonal matrix defined in (8). 20 40 80 160 320 640 1280 2560 5120 To this end, we investigate the dependency of the sum of the d distances to the diagonal of the non-zero elements (dToD) on Figure 1: Proportion of variables that are moved these parameters: from their original position when applying Algo- d d rithm1 versus problem dimension d. Each graph X X 1 dToD(R) = r(i,j)6=0|i − j| , (10) corresponds to a different swap range rs indicated i=1 j=1 in the legend. Solid lines: d swaps; dashed lines: d/2 where 1r 6=0 is the indicator function on the condition swaps. When rs = 0, no swap is performed. The (i,j) averages of 51 repetitions per data point are shown. r(i,j) 6= 0 and r(i,j) is element (i, j) of R. This helps us mea- Standard deviations (not shown) are at most 10% of sure how different from a diagonal matrix the transformed the mean. matrix is, which gives us an idea of the non-separability of the resulting problem. In Figure2, we show a normalized dToD of a matrix R = variables when less than d swaps are applied. We start with PleftB with blocks of equal sizes depending on the values of p initially the identity permutation. We apply the swaps d, s and rs. For a full square matrix of size d, the corre- sponding dToD is defined in (9) taking pπ(1), pπ(2), . . . , pπ(ns), successively, as first swap variable (that replace i in (9)). The resulting 1 d (full) = d(d − 1)(d + 1) . (11) vector p is returned as the desired permutation. ToD 3 Which means that for a block-diagonal matrix B with blocks Impact of the Number of Swaps and Variable Pool Type. of equal sizes s: Given the way the permutations are generated, some vari- d s 1 dToD(B) = (s − 1)(s + 1) + c(c − 1)(c + 1) , (12) ables may end up being swapped more than once. This leads s 3 3 to variables potentially moving further than rs away from where c = d − s × bd/sc (which accounts in case for the last their starting positions or (with low probability) even end- block of size c < s). ing up in their original position after being swapped back. In order to better understand the effect of each parameter, In Figure1, we look into the proportion of variables that we considered the following model for r ≥ 1/d: are affected by the swap strategy. That is variables that end r up in positions different from their initial ones. A different dˆ (α, d, s, r ) = α (d + α )α2 × (s + α )α4 × (r + α )α6 approach (that we call static while the current version is ToD r 0 1 3 r 5 s α8 dynamic) was tried where Line9 in Algorithm1 swaps, in- × ( + α7) , (13) −1 −1 −1 rr d stead, pi with pj , where p is the, dynamically updated, inverse permutation of p. This has shown no significantly where α = (α0, . . . , α8) are the free parameters of the fit to different results. be optimized. 3 10 blocks are relatively small. It is invariant to the coordinate s =2 permutation Pright which allows us to restrict our study to 2 s =16
s) 10 Pleft and B. By comparing to the original, non transformed × s =128 d
( and separable problem, we get an idea of how non-separable
/ 1 r =0 10 s the problem becomes. This helps us decide on the parame- ToD r =⌊d/100⌋ d s ter choice. Considering that sep-CMA-ES is a rather trivial 0