Conic Reformulations for Kullback-Leibler Divergence Constrained Distributionally Robust Optimization and Applications

Home , Divergence (statistics)

An International Journal of Optimization and Control: Theories & Applications ISSN:2146-0957 eISSN:2146-5703 Vol.11, No.2, pp.139-151 (2021) http://doi.org/10.11121/ijocta.01.2021.001001

RESEARCH ARTICLE

Conic reformulations for Kullback-Leibler divergence constrained distributionally robust optimization and applications

Burak Kocuk*

Industrial Engineering Program, Faculty of Engineering and Natural Sciences, Sabancı University, Istanbul, Turkey [email protected]

ARTICLEINFO ABSTRACT

Article History: In this paper, we consider a Kullback-Leibler divergence constrained distri- Received 12 July 2020 butionally robust optimization model. This model considers an ambiguity Accepted 11 January 2021 set that consists of all distributions whose Kullback-Leibler divergence to an Available 19 April 2021 empirical distribution is bounded. Utilizing the fact that this divergence mea- Keywords: sure has an exponential cone representation, we obtain the robust counterpart Distributionally robust optimization of the Kullback-Leibler divergence constrained distributionally robust opti- Stochastic programming mization problem as a dual exponential cone constrained program under mild Conic programming assumptions on the underlying optimization problem. The resulting conic re- Newsvendor problem formulation of the original optimization problem can be directly solved by a Uncapacitated facility location commercial conic programming solver. We specialize our generic formulation to problem two classical optimization problems, namely, the Newsvendor Problem and the Uncapacitated Facility Location Problem. Our computational study in an out- AMS Classiﬁcation 2010: of-sample analysis shows that the solutions obtained via the distributionally 90C15; 90C25; 90C90 robust optimization approach yield signiﬁcantly better performance in terms of the dispersion of the cost realizations while the central tendency deteriorates only slightly compared to the solutions obtained by stochastic programming.

1. Introduction size of the resulting deterministic equivalent formulation grows larger with the size of the scenarios. Secondly, an expectation may not be an appropriate performance measure for risk-averse Decision making under uncertainty is one of the decision makers. Thirdly, satisfying constraints most challenging tasks in operations research. for each scenario might be too restrictive. The re- Two paradigms are predominantly used in the spective remedies for these shortcomings are pro- literature to address uncertainty: stochastic proposed such as sample average approximation to gramming and robust optimization. In the clas- limit the problem size, risk-averse objective func- sical stochastic programming [1,2], a predefined tion for a more appropriate performance measure set of scenarios (or samples) are determined, ei- and chance constraints to allow constraint satis- ther taken directly from observed data or after faction with high probability. However, the im- fitting an appropriate distribution. Then, the ob- plicit assumption of stochastic programming re- jective function is replaced with an expectation mains, which is the need to assume a distribution taken with respect to the random elements, and by analyzing the data or fitting one. Unfortu- constraints are copied for each scenario. In addi- nately, this step may not be performed satisfac- tion to the assumption about knowing the under- torily in all cases. lying distribution, this basic stochastic programming approach has some limitations: Firstly, the *Corresponding author 139 140 B. Kocuk / IJOCTA, Vol.11, No.2, pp.139-151 (2021)

In robust optimization [3–5], a predefined uncer- Especially, if the robust counterpart can be ex- tainty set, which includes all possible values of the pressed as a conic program for which the under- uncertain elements, is used. Then, the optimiza- lying cone admits a self-concordant barrier, then tion is performed with the aim of optimizing with efficient polynomial-time interior point methods respect to the worst possible realization from the can be applied directly [22]. This desired prop- uncertainty set. There are two main advantages erty holds for linear programs, second-order cone of using robust optimization. Firstly, the decision programs and semidefinite programs, which ap- maker does not need to make any assumptions pear extensively in both robust and DR optimiza- about the distribution of the uncertain elements tion literature. We note that the efficiency of in the problem as opposed to the stochastic pro- the conic programming solvers specialized in these gramming approach. Secondly, the deterministic three problem classes has improved considerably. equivalent (or so-called the robust counterpart) There is some recent interest in conic programs for formulation of the robust optimization problem which the underlying cone is not self-dual, such as typically has the same computational complex- exponential cone. There are two main reasons: i) ity as the deterministic version of the problem exponential cone has extensive expressive power under reasonable assumptions on the uncertainty that is useful to model optimization problems in- sets. On the other hand, the main disadvantage of volving the exponential and logarithm functions the robust optimization approach is that depend- (see e.g. [23]), and ii) a practical implementation ing on the construction of the uncertainty set, it of a primal-dual interior point method is devel- might lead to overly conservative solutions, which oped [24] although its polynomial-time complex- might have poor performance in central tendency ity has not been proven yet. Our paper will ex- such as expectation. ploit both the expressive power of the exponen- Distributionally robust optimization (DRO) is a tial cone and the practical implementation that relatively new paradigm that aims to combine sto- can be used to solve the resulting optimization chastic programming and robust optimization ap- problem, as detailed below. proaches. The main modeling assumption of DRO is that some partial information about the dis- The Kullback-Leibler (KL) divergence [25] is a tribution governing the underlying uncertainty is popular divergence measure in information the- available, and the optimization is performed with ory that can be used to quantify the divergence respect to the worst distribution from an ambigu- of one distribution from another (see Definition 8) ity set, which contains all distributions consistent and we prove that it is exponential cone repre- with this partial information. There are mainly sentable (see Definitions 3 and 6, and Proposi- two streams in the DRO literature based on how tion 2). Although the robust counterpart of KL the ambiguity set is defined: moment-based and divergence constrained DRO is proven to be a distance-based. tractable convex program [26], to the best of our In moment-based DRO, ambiguity sets are de- knowledge, its exponential cone representability fined as the set of distributions whose first few mo- has not been exploited in the literature before. ments are assumed to be known or constrained to Also, its practical performance against stochas- lie in certain subsets. If certain structural proper- tic programming has not been analyzed in detail ties hold for the ambiguity sets such as convexity except for a limited number of applications from (or conic representability), then tractable convex power systems [27,28]. (or conic reformulations) can be obtained [6–8]. In this paper, we consider KL divergence con- In distance-based DRO, ambiguity sets are de- strained DRO problems and propose their dual fined as the set of distributions whose distance exponential cone constrained reformulation un- (or divergence) from a reference distribution is der the mild assumption of conic representabil- constrained. For Wasserstein distance [9–12] and ity. This allows us to solve the corresponding φ−divergence [13–17] constrained DRO, tractable robust counterpart using a conic programming convex reformulations have been proposed. Re- solver such as MOSEK [29]. We also present cently, chance constrained DRO problems have how the generic formulation can be specialized also drawn attention [18–21]. for two classical problems: Newsvendor and Un- As summarized above, in many cases, tractable capacitated Facility Location. Although the DRO convex robust counterparts or reformulations can methodology has been applied to variations of be obtained for robust and distributionally ro- these problems [30–35], to the best of our knowl- bust (DR) optimization problems. However, an edge, their KL divergence constrained versions even more special structure such as conic repre- have not been studied in detail. Our computa- sentability can be preferred whenever available. tional results suggest that solutions obtained via Conic reformulations for KL divergence constrained DRO and applications 141 a DR approach give slightly higher cost realiza- 2.1. Convex analysis tions when central tendencies such as mean and Rm median are considered compared to solutions ob- For a set X ⊆ , we denote its interior as tained via stochastic programming in an out-of- int(X), its relative interior as ri(X) and its clo- sample analysis. However, the dispersion (measure as cl(S). We use the shorthand notation [n] sured by the standard deviation and range of the for the set {1,...,n}. cost realizations) and the risk (measured by the We will first review some basic concepts from con- average of worst cost realizations and the third vex analysis related to cones. quartile) metrics improve significantly with solu- Definition 1 (Regular cone). A cone K ⊆ Rm is tions obtained via a DR approach. called regular if it is closed, convex, pointed and Our main contribution in this paper is to exploit full-dimensional. the exponential cone representation of the KL- Examples of regular cones include the nonnega- divergence to solve DRO problems with ambigu- tive orthant, Lorentz (or second-order) cone and ity sets defined using this measure1. Unlike the the cone of positive semidefinite matrices. We previous literature, which treats such problems will refer to these cones as canonical cones in this as “general convex programs” (see, for instance, paper. [13]), we utilize the general-purpose conic programming solver MOSEK in our computations. Definition 2 (Dual cone). The dual cone to a m m The conic representability property can be quite cone K ⊆ R is defined as K∗ = {y ∈ R : useful in practice since one can then include other xT y ≥ 0, ∀x ∈ K}. (mixed-integer) conic representable sets and func- It is well-known that the dual cone to a regular tions to the underlying optimization problem and, cone is also regular. In addition, the three canon- hence, model a wide variety of real-life situations. ical cones mentioned above are self-dual. Although we carry out the computational ex- We will now define the exponential cone, which is periments for two classical problems, we remark the key ingredient of this paper. that our approach is rather general and it can be adapted to various settings. To give a few exam- Definition 3 (Exponential cone). The exponen- ples, one can use our framework in applications tial cone, denoted as Kexp, is defined as 3 x3/x2 from portfolio optimization [7,10], image process- Kexp = cl {x ∈ R : x1 ≥ x2e , x2 > 0} . ing [9, 16], asset pricing [13], multidimensional knapsack problem [12, 20] and logistics [35, 36]. As opposed to the three canonical cones mentioned above, the exponential cone is not self-dual The rest of the paper is organized as follows: In although it is a regular cone. Section 2, we review basic concepts from convex analysis and probability theory which serve as the Proposition 1 (See e.g. [23]). The dual cone to basis of our main result about conic reformula- the exponential cone (or simply the dual exponen- tion of KL divergence constrained DRO problems tial cone) is given as in Section 3. Then, we analyze two applications, (Kexp )∗ = namely, the Newsvendor Problem in Section 4 and R3 (s2−s3)/s3 the Uncapacitated Facility Location Problem in cl {s ∈ : s1 ≥−s3e , s3 < 0} . Section 5, and present the results of our compu- The following definitions are instrumental in the tational study. Finally, we conclude our paper in description of conic programming problems: Section 6. Definition 4 (Conic inequality). A conic inequality with respect to a regular cone K is defined as x K y, meaning that x − y ∈ K. We will denote the relation x ∈ int(K) alternatively as x ≻K 0. 2. Preliminaries Definition 5 (Conic representability of a set). A set X ⊆ Rn is called conic representable if it can Before stating our main result in Section 3, we will be expressed as first review some important concepts from convex Rn Rk analysis in Section 2.1 and probability theory in X = {x ∈ : ∃y ∈ : Ax + By K b}, Section 2.2. for some appropriately chosen regular cone K.

1We also note that two other φ−divergence measures called Burg entropy and J−divergence have the same representation property. 142 B. Kocuk / IJOCTA, Vol.11, No.2, pp.139-151 (2021)

If any of the variables in this representation is in- Proposition 3. Let q ∈ ri(∆S). Then, teger, then X is called a mixed-integer conic rep- ǫ(q):= sup {DKL(p||q)} = log(1/ min{qs}). resentable set. p∈∆S s∈[S] Deﬁnition 6 (Conic representability of a func- Proof. Notice that the objective function of tion). A function is called conic representable if the optimization problem supp∈∆S {DKL(p||q)} its epigraph is a conic representable set. is convex and its feasible region is a polytope. Therefore, there exists an optimal solution which In this paper, when we say that “a set or function is an extreme point of ∆S. Observe that the ex- is conic representable”, we will implicitly assume treme points of ∆S are the unit vectors in RS, that the cone used in its representation is either s s denoted bye ˜ for s ∈ [S] (note thate ˜s′ = 1 for one of the three canonical cones or the (dual) ex- ′ s s = s , ande ˜s′ = 0 otherwise). ponential cone. s Let us now compute DKL(˜e ||q) for some s ∈ [S]. In fact, we have 2.2. Probability theory S s s s Deﬁnition 7 (Probability simplex). The proba- DKL(˜e ||q)= e˜s′ log(˜es′ /qs) bility simplex in dimension S is denoted as sX′=1 S S s s s s S RS =e ˜s log(˜es/qs)+ e˜ ′ log(˜e ′ /qs) ∆ := p ∈ + : ps = 1 . s s X′ s=1 s′=1 X s 6=s

The following function can be used to measure the = log(1/qs). “divergence” of one distribution from another. Here, we use the fact that limx→0+ x log(x/y) = 0 Deﬁnition 8 (KL Divergence). For two discrete for y > 0 in the last equality (recall that q ∈ S probability distributions p ∈ ∆S and q ∈ ri(∆S), ri(∆ ), which implies that qs > 0, s ∈ [S]). the KL divergence from p to q is deﬁned as Finally, we have

S ǫ(q)= sup {DKL(p||q)} p∈∆S DKL(p||q) := ps log(ps/qs). s Xs=1 = max{DKL(˜e ||q)} s∈[S] We note that the KL Divergence does not define = max{log(1/qs)} a distance metric between two probability distri- s∈[S] butions since it is not symmetric. However, it has = log(1/ min{qs}). the following useful property. s∈[S] Proposition 2. Let p ∈ ∆S and q ∈ ri(∆S). Then, the function DKL(p||q) is exponential cone Proposition 3 is useful to quantify the ambiguity representable. sets in KL divergence constrained DRO problems as we will see later. Proof. Due to Definition 6, it suffices to show that the epigraph of the function DKL(p||q) is an 3. Main results exponential cone representable set. Since the set {(x,y,t) ∈ R3 : t ≥ x log(x/y)} has the exponen- In this section, we present our main result about the reformulation of a KL divergence constrained tial cone representation (y,x, −t) ∈ Kexp [37], we obtain an exponential cone representation for the DRO problem as a conic program under mild conditions. function DKL(p||q) as follows: S S {(p,q,ǫ) ∈ ∆ × ri(∆ ) × R : DKL(p||q) ≤ ǫ} 3.1. Generic problem formulation S = (p,q,ǫ): ∃δ ∈ RS : δ ≤ ǫ, We first give the generic problem setting consid- s ered in this paper. Suppose that there are m ran- Xs=1 dom variables ξi ∈ R, i ∈ [m], each with a discrete (q ,p , −δ ) ∈ K , s ∈ [S] . s s s exp distribution qi ∈ ri(∆Si ) estimated from the historical data as i i i The following proposition gives an upper bound Pr(ξ = ds)= qs s ∈ [Si], i on the KL-divergence of a given distribution from where {ds : s ∈ [Si]} is the set of observed real- any other distribution. izations of ξi, i ∈ [m]. Under this probabilistic Conic reformulations for KL divergence constrained DRO and applications 143

i setting, we define the ambiguity set 0 0 i qs ps i i i i Si i i i −1 0 i Kexp  0  s ∈ [Si] (3d) P (q ,ǫ ) := {p ∈ ∆ : DKL(p ||q ) ≤ ǫ }, δ 0 1 s 0 for i ∈ [m], where ǫi ∈ R controls the divergence     + i R i R from the historical data (or robustness level). ps ∈ +, δs ∈ s ∈ [Si]. (3e) Then, we consider the following KL divergence Here, constraints (3b)-(3e) model the relation constrained DRO problem, pi ∈Pi(qi,ǫi), as stated in Proposition 2. m Recall that ǫi > 0 for each i ∈ [m]. Then, each E i i min h(y)+ max ξi [H (y,ξ )] , (1) inner maximization problem (3) satisfies essential y∈Y pi∈Pi(qi,ǫi) i i Xi=1 strict feasibility [38] (e.g. consider ps = qs and s i where each expectation is taken with respect to an δi = ǫ /|Si| for s ∈ [Si]), and its optimal value i i i i i i ambiguous distribution p ∈ P (q ,ǫ ). In prob- is bounded above (e.g. by maxs∈[Si] H (y,ds)). lem (1), h is a real-valued function defined on Rn; Therefore, strong duality holds between prob- Hi is a real-valued function defined on Rn × R; lem (3) and its conic dual given as follows: Rn and Y is a subset of . Observe that given y Si decisions, the inner maximization problem is de- i i i i i min α + ǫ β + qsus (4a) i composable over the random elements ξ , i ∈ [m]. Xs=1 Although we are assuming discrete probability i i i i s.t. α − vs ≥ H (y,ds) s ∈ [Si] (4b) distributions in this paper, our framework can be i i used to solve problems involving continuous dis- β + ws = 0 s ∈ [Si] (4c) i i tributions via a finite support approximation. α ∈ R,β ∈ R+ (4d) i i i 3.2. Robust counterpart and conic (us,vs,ws) ∈ (Kexp)∗ s ∈ [Si]. (4e) reformulation i i i i i Here, α , β and (us,vs,ws) are the dual variables associated with the primal constraints (3b), (3c) We will now obtain the robust counterpart [4] of and (3d), respectively. Notice that problem (4) is problem (1) utilizing Conic Duality under mild a dual exponential cone constrained program. conditions. As the final step in the proof, we write the dual Theorem 1. Consider the KL divergence con- of each inner maximization problem and obtain strained DRO problem (1) as described in Sec- the robust counterpart of problem (1) as prob- tion 3.1, and assume that ǫi > 0, i ∈ [m]. Then, lem (2). the robust counterpart is given as follows: We will now discuss the consequences of Theo- m Si min h(y)+ αi + ǫiβi + qi ui (2a) rem 1 under additional structural properties such s s as convexity and conic representability. Xi=1 Xs=1 i i i i s.t. α − vs ≥ H (y,ds) i ∈ [m]; s ∈ [Si] (2b) Corollary 1. Consider the KL divergence constrained DRO problem (1) as described in Theo- βi + wi = 0 i ∈ [m]; s ∈ [S ] (2c) s i rem 1. In addition, let us assume that Y is a con- i i i i α ∈ R,β ∈ R+ i ∈ [m] (2d) vex set, h(y) and H (y,ξ ) are convex functions i i i in y, i ∈ [m]. Then, the robust counterpart (2) is (us,vs,ws) ∈ (Kexp )∗ i ∈ [m]; s ∈ [Si] (2e) a convex program. y ∈Y. (2f) Corollary 2. Consider the KL divergence con- Proof. We will start the proof by analyzing the strained DRO problem (1) as described in The- inner maximization problems. Given a vector orem 1. In addition, let us assume that Y is y ∈ Y, let us write the i-th inner maximization a (mixed-integer) conic representable set, h(y) problem explicitly as the following exponential and Hi(y,ξi) are conic representable functions, cone constrained program: i ∈ [m]. Then, the robust counterpart (2) is a Si dual exponential cone constrained (mixed-integer) i i i max H (y,ds)ps (3a) program. Xs=1 Si As an application of Corollary 2, we will consider i the Newsvendor Problem in Section 4 and the s.t. ps =1 (3b) Xs=1 Uncapacitated Facility Location Problem in Sec- Si tion 5. The common characteristic of these two i i problems is that the set Y is a mixed-integer lin- δs ≤ ǫ (3c) Xs=1 ear set, the function h is a linear function and the 144 B. Kocuk / IJOCTA, Vol.11, No.2, pp.139-151 (2021) functions Hi are the maxima of linear functions ξ (that is, m = 1), representing the unknown de- (hence, they are polyhedrally representable). mand, we will omit the superscript i for convenience. 3.3. Extension to joint discrete probability distributions 4.1. Problem formulation

We will now extend our analysis to the case of Consider the generic formulation (1) with the fol- Z joint discrete probability distributions with a fi- lowing specifications: We let y ∈Y := + be the nite support. Suppose that we have a vector of order quantity, and consider functions random variables Ξ ∈ Rm with a joint probability h(y) := cy, S distribution q ∈ ri(∆ ) estimated from the his- where c is the variable order cost, and torical data as H(y,ξ) := cb max{ξ − y, 0} + ch max{y − ξ, 0}, Pr(Ξ = D˜s)= qs s ∈ [S], where c is the back-order penalty for unsatisfied m b where {D˜s : s ∈ [S]}⊆ R is the set of observed demand and ch is the inventory cost. Notice that realizations of Ξ. H(y,ξ) is a piecewise linear convex function in y Consider the ambiguity set and can be rewritten as S P(q,ǫ) := {p ∈ ∆ : DKL(p||q) ≤ ǫ}, H(y,ξ) = max{−cby + cbξ,chy − chξ}. where ǫ ∈ R+, and the following KL divergence This observation will be useful to linearize con- constrained DRO problem: straint (2c).

min h(y)+ max EΞ[H(y, Ξ)] . (5) By omitting i indices and simplifying the notation y∈Y p∈P(q,ǫ) of problem (2) by taking into account the special Here, h is a real-valued function deﬁned on Rn; H structure of the newsvendor problem, we obtain is a real-valued function deﬁned on Rn × Rm; and the following dual exponential cone constrained Y is a subset of Rn. Then, we have the following MIP as its robust counterpart: result: S min cy + α + ǫβ + q u (6a) Theorem 2. Consider the KL divergence con- s s Xs=1 strained DRO problem (5) as described above, and assume that ǫ > 0. Then, the robust counterpart s.t. α − vs ≥−cby + cbds s ∈ [S] (6b) is given as follows: α − vs ≥ chy − chds s ∈ [S] (6c)

S β + ws = 0 s ∈ [S] (6d) min h(y)+ α + ǫβ + qsus Z R R y ∈ +,α ∈ ,β ∈ + (6e) Xs=1 (us,vs,ws) ∈ (Kexp)∗ s ∈ [S]. (6f) s.t. α − vs ≥ H(y, D˜s) s ∈ [S] β + ws = 0 s ∈ [S] 4.2. Computations α ∈ R,β ∈ R+ 4.2.1. Experimental setup (us,vs,ws) ∈ (Kexp )∗ s ∈ [S] y ∈Y. To compare the eﬀect of robustness level of KL divergence constrained DR version of the Newsven- We will omit the proof of Theorem 2 due to its dor Problem, we propose Algorithm 1. Note that similarity to the proof of Theorem 1. We remark setting ǫ = 0 in problem (6) reduces it to the sto- that results similar to Corollaries 1 and 2 can be chastic programming approach while larger values also obtained in this case under the assumptions of ǫ lead to more robustness (and conservative- of convexity and conic representability, respec- ness) in solutions. tively. Algorithm 1. Input: A probability distribution D, the number of samples R, the set of ro- 4. Application to the Newsvendor bustness levels T . problem 1: Sample R random variates from D for training, and obtain the empirical distribution q In this section, we analyze a toy example, the KL and the maximum KL divergence ǫ(q) as com- divergence constrained DR version of the single- puted in Proposition 3. period, single-product Newsvendor Problem. In 2: Solve problem (6) with ǫ := θǫ(q) for each this case, since there is only one random variable θ ∈T to obtain a decision y∗(θ). Conic reformulations for KL divergence constrained DRO and applications 145

3: Sample R random variates from D for test- which we attribute to its right-skewness. How- ing, and then compute the cost realizations for ever, the order quantities in those cases are very each realization under the decision y∗(θ). high, which result in overly conservative policies and deteriorated performance measures. We implement Algorithm 1 in the Python programming language and use MOSEK 9.2 [29] to solve the dual exponential cone constrained MIP Table 1. Summary results for problem (6). the Newsvendor Problem with Uniform(0, 10) and R = 100. 4.2.2. Results θ y∗ Avg. St. Dev. Worst 10% For this illustration, we choose the following cost 0.00 4 8.08 2.92 13.80 coeﬃcients: 0.05 4 8.08 2.92 13.80 c = 1, cb = 2, ch = 1. 0.10 5 8.67 2.22 12.80 0.15 5 8.67 2.22 12.80 We will now specify the parameters of Algo- 0.20 5 8.67 2.22 12.80 rithm 1. Firstly, we experiment with three dif- 0.25 6 9.53 1.94 12.00 ferent discrete distributions: • Discrete Uniform Distribution with parameters 0 and 10, denoted as Table 2. Summary results for Uniform(0, 10). the Newsvendor Problem with • Binomial Distribution with parameters 10 Binomial(10, 0.5) and R = 100. and 0.5, denoted as Binomial(10, 0.5). • Poisson Distribution with parameter 5, θ y∗ Avg. St. Dev. Worst 10% denoted as Poisson(5). 0.00 4 6.76 2.38 11.40 We sample R = 100 random variates separately 0.05 5 7.02 1.67 10.40 to obtain “training” and “test” datasets. Then, 0.10 5 7.02 1.67 10.40 we repeat the experiments for the following “ro- 0.15 5 7.02 1.67 10.40 bustness” levels: 0.20 5 7.02 1.67 10.40 0.25 5 7.02 1.67 10.40 T := {0.00, 0.05, 0.10, 0.15, 0.20, 0.25}. The summary statistics of our experiments are reported in Tables 1-3 for Uniform, Binomial and Table 3. Summary results for Poisson distributions, respectively. In particular, the Newsvendor Problem with we report the average and standard deviation of Poisson(5) and R = 100. the cost realizations, abbreviated as “Avg.” and ∗ “St. Dev.”, respectively. In addition, we compute θ y Avg. St. Dev. Worst 10% the average of the worst 10% of the realizations, 0.00 4 7.59 3.04 13.60 abbreviated as “Worst 10%”, to quantify the risk. 0.05 5 7.73 2.40 12.60 We observe that as the robustness level θ in- 0.10 5 7.73 2.40 12.60 creases, the optimal order quantity y∗ increases 0.15 5 7.73 2.40 12.60 (recall that θ = 0.00 corresponds to the stochas- 0.20 6 8.44 1.86 12.00 tic programming approach). Moreover, with in- 0.25 6 8.44 1.86 12.00 creasing θ, the average cost increases while the standard deviation and the average of worst realizations decrease for each distribution. This is In addition to the summary statistics, we also an expected behavior when robust optimization provide the box plots of the cost realizations in is utilized. We note that Binomial distribution Figures 1-3 for Uniform, Binomial and Poisson is the least sensitive with respect to θ as the or- distributions, respectively. We observe that as der quantity (and performance measures) do not the robustness level θ increases, the median of change after θ ≥ 0.05. On the other hand, Uni- the cost realizations increases while the range form and Poisson distributions are more sensitive shrinks for each distribution. We also note that with respect to this parameter. the maximum and upper quartile values decrease We also repeat the experiments with even higher for θ ∈ [0.05, 0.15]. This is a desired property values of θ and observe that only the results cor- since it implies that the risk of the stochastic pro- responding to the Poisson distribution changes, gramming approach (θ = 0.00) can be lowered. 146 B. Kocuk / IJOCTA, Vol.11, No.2, pp.139-151 (2021) 5. Application to the uncapacitated facility location Problem

In this section, we analyze the KL divergence constrained DR version of the Uncapacitated Facility Location (UFL) Problem.

5.1. Deterministic version

We first remind the reader the deterministic version of the well-known UFL Problem. Suppose i Figure 1. Box plot of the results that we have m customers, each with demand d , for the Newsvendor Problem with i ∈ [m]. The demand must be satisfied by open- Uniform(0, 10) and R = 100. ing new facilities. There are n potential facilities, each with a fixed cost of fj, j ∈ [n]. The unit transportation cost between each customer i and facility j is given as tij, i ∈ [m], j ∈ [n]. The objective is to minimize the total fixed cost and transportation cost. The UFL Problem can be modeled as an integer program by defining two sets of binary decision variables. The first set of decision variables, denoted as yj, represent the status of each facility j, and the second set of decision variables, denoted as xij, represents the assignment of a customer i to a facility j. The complete model is given as follows: Figure 2. Box plot of the results n m for the Newsvendor Problem with i min fjyj + d tijxij (7a) Binomial(10, 0.5) and R = 100. Xj=1 Xi=1 n s.t. xij = 1 i ∈ [m] (7b) Xj=1

xij ≤ yj i ∈ [m]; j ∈ [n] (7c)

xij ∈ {0, 1} i ∈ [m]; j ∈ [n] (7d)

yj ∈ {0, 1} i ∈ [m]; j ∈ [n]. (7e) Here, constraint (7b) guarantees that each customer is served by exactly one facility while constraint (7c) ensures that each customer is served by an open facility. We point out two useful observations about the Figure 3. Box plot of the results UFL Problem. Firstly, in any feasible solution to for the Newsvendor Problem with problem (7), at least one facility must be opened. Poisson(5) and R = 100. Therefore, we must have n y ≥ 1. (8) As a ﬁnal comparison, we formulate the DR ver- j Xj=1 sion of the Newsvendor Problem assuming that ∗ the ﬁrst and second moments are known (in our Secondly, given any optimal y vector, the opti- case, estimated from the training data) following mal objective function value can be computed as [39]. We observe that the optimal order quantity n m ∗ ∗ i fjyj + d min∗ {tij}, (9) y obtained from this approach turns out to be j:y =1 identical to the optimal order quantity obtained Xj=1 Xi=1 j from the stochastic programming approach in our since each customer can be served by the closest experimental setting. open facility. Conic reformulations for KL divergence constrained DRO and applications 147

5.2. Distributionally robust version Case 2: Let l 6∈ T ∗, and choose any j∗ ∈ T ∗. Then, we have Now, suppose that we replace the deterministic demand di with a random variable ξi having an zl = tl − (tl − tj∗ ) − (tl − tj) ∗ empirical distribution qi ∈ ri(∆Si ), with realiza- j∈T \{Xj }:til>tij tions di , s ∈ [S ]. Then, the DR version of the s i = tj∗ − (tl − tj) UFL Problem can be modeled as an instance of ∗ j∈T \{Xj }:til>tij the generic model (1) with Y := {y ∈ {0, 1}n : ≤ t ∗ . (8)} as follows: We choose functions j n This analysis indicates that n h(y) := fjyj Xj=1 max tl − yj max{tl − tj, 0} = max zl = tl∗ , l∈[n] l∈[n] Xj=1 and ∗ ∗ i i i where l ∈ T . Hence, we conclude that equa- H (y,ξ ) := ξ min {tij}, i = 1,...,m, j:yj =1 tion (10) holds true. due to (9). In the remainder of this subsection, we An alternative proof of Lemma 1 can be obtained will obtain the robust counterpart of the KL di- via LP duality: First, one would write the prob- vergence constrained DR UFL Problem as a dual lem min{tj : yj = 1} as an IP by introducing ad- exponential cone constrained MIP by the help of ditional binary variables xj. Secondly, this IP can Lemma 1. be relaxed as an LP due to the totally unimodular structure. Then, the extreme points of the feasi- 5.2.1. A lemma ble region of the dual LP can be characterized, enabling the dual LP to be solved in closed form The following lemma will be critical to linearize (see dual based arguments in [40,41]). constraint (2c). 5.2.2. The final formulation Rn Lemma 1. Let t ∈ + be a given vector and consider the function g(y) : {0, 1}n → R defined Taking into account the special structure of the UFL Problem and utilizing Lemma 1 by setting as g(y) := min{tj : yj = 1}. Then, for any i y ∈ {0, 1}n satisfying (8), we have g := H for each i ∈ [m], we obtain the following n dual exponential cone constrained MIP: S g(y)= max tl − yj max{tl − tj, 0} . (10) n m i l=1,...,n i i i i i Xj=1 min fjyj + α + ǫ β + q u (11a) s s Xj=1 Xi=1 Xs=1 Proof. n n Let y ∈ {0, 1} satisfying (8) be given. i i i We define the following nonempty sets T := {j : s.t. α − vs ≥ ds til − yj max{til − tij, 0} ∗ Xj=1 yj = 1} and T := argmin{tj : j ∈ T }. Notice ∗ that g(y) = tl for l ∈ T . Also, let us define the i ∈ [m]; s ∈ [Si]; l ∈ [n] (11b) quantity (2c) − (2e), (7e), (8). n zl := tl − yj max{tl − tj, 0}, l ∈ [n], 5.3. Computations Xj=1 5.3.1. Experimental setup for convenience. Notice that we have yj max{tl−tj, 0} = We utilize Algorithm 2 to compare the effect of ro- 0 if j 6∈ T bustness level to KL divergence constrained DR  version of the UFL Problem. This algorithm is 0 if j ∈ T and til ≤ tij .  quite similar to Algorithm 1 used for the analysis til − tij if j ∈ T and til >tij of the Newsvendor Problem.  This observation helps us to rewrite zl as Algorithm 2. Input: A probability distribution

zl = tl − (tl − tj), l ∈ [n]. D, the number of samples R, the set of ro-

j∈TX:tl>tj bustness levels T . 1: Sample R random variates from D for each Now, we will look at the following cases to com- customer i ∈ [m] for training, and obtain the pute or bound zl: empirical distribution qi and the maximum ∗ ∗ i Case 1: Let l ∈ T . Then, we have zl∗ = tl∗ . KL divergence ǫ(q ) for each i ∈ [m]. 148 B. Kocuk / IJOCTA, Vol.11, No.2, pp.139-151 (2021)

2: Solve problem (11) with ǫi := θǫ(qi) for each stochastic programming approach (θ = 0.00), the i ∈ [m] and θ ∈T to obtain a decision vector first policy becomes optimal whereas in the DR y∗(θ). approach (θ ≥ 0.05), the second policy becomes 3: Sample R random variates from D for each optimal. We note that considering the ambiguity i ∈ [m] for testing, and then compute the cost of the demand distributions increases the average realizations for each realization under the de- cost only slightly whereas both the standard decision vector y∗(θ). viation and the average of the worst 10% of the realizations decrease significantly. We remind the 5.3.2. Results reader that the total fixed cost of the stochastic For this illustration, we assume that there are 12 programming approach is only 5 while the fixed customers and three potential facilities located in cost of the DR approach is 20. This also shows the unit interval. Their precise locations are given that the corresponding transportation cost, which as is affected by the random uncertainty, is significantly smaller in the DR approach. 2ω − 1 35 − 2ω : ω ∈ [6] ∪ : ω ∈ [6] , 36 36 Table 4. Summary results for the Uniform(0, 10) and UFL Problem with 2Ω − 1 and R = 100. : Ω ∈ [3] , 6 θ y∗ Avg. St. Dev. Worst 10% respectively, and are shown in Figure 4. 0.00 0,1,0 23.11 3.47 29.19 0.05 1,0,1 24.52 0.92 26.10 0.10 1,0,1 24.52 0.92 26.10 0.15 1,0,1 24.52 0.92 26.10 Figure 4. Locations of the cus- 0.20 1,0,1 24.52 0.92 26.10 tomers (circles) and potential facili- 0.25 1,0,1 24.52 0.92 26.10 ties (diamonds). Table 5. Summary results for the As evident from the figure, potential facilities UFL Problem with Binomial(10, are located evenly across the unit interval and 0.5) and R = 100. there are two clusters of customers which are ∗ also distributed evenly in their respective regions. θ y Avg. St. Dev. Worst 10% The fixed cost of opening facilities are given as 0.00 0,1,0 24.87 1.88 28.15 0.05 1,0,1 24.97 0.52 25.86 f1 = f3 = 10 for the two facilities in the middle of these clusters (marked by a large diamond), and 0.10 1,0,1 24.97 0.52 25.86 0.15 1,0,1 24.97 0.52 25.86 f2 = 5 for the other facility (marked by a small diamond). Finally, the unit transportation cost 0.20 1,0,1 24.97 0.52 25.86 between a facility-customer pair is assumed to be 0.25 1,0,1 24.97 0.52 25.86 equal to the their distance from each other. Table 6. Summary results for the We specify the parameters of Algorithm 2 regard- UFL Problem with Poisson(5) and ing the generation of random variates similar to R = 100. that of Algorithm 1 as described in Section 4.2.2. The summary statistics of our experiments are θ y∗ Avg. St. Dev. Worst 10% reported in Tables 4-6 for Uniform, Binomial and 0.00 0,1,0 24.96 2.67 29.89 Poisson distributions, respectively. We first ob- 0.05 1,0,1 24.99 0.74 26.34 serve that the optimal solutions and the perfor- 0.10 1,0,1 24.99 0.74 26.34 mance measures are similar for every distribution, 0.15 1,0,1 24.99 0.74 26.34 therefore, we will summarize our observations to- 0.20 1,0,1 24.99 0.74 26.34 gether. Due to the choice of parameters and the 0.25 1,0,1 24.99 0.74 26.34 locations of the facilities and customers as can be seen from Figure 4, there is a fundamental trade- In addition to the summary statistics, we also pro- off in this instance: We can i) either open a single vide the box plots of the cost realizations in Fig- facility at the middle of the line segment with the ures 5-7 for Uniform, Binomial and Poisson dis- lower fixed cost and serve customers via longer tributions, respectively. We observe that the me- distances, or ii) open two facilities at the middle dian of the cost realizations either stays the same of two customer clusters with higher fixed cost or increases slightly in the DR approach while and serve customers via shorter distances. In the the range shrinks significantly compared to the Conic reformulations for KL divergence constrained DRO and applications 149 stochastic programming approach for each distri- exponential cone constrained reformulations uti- bution. We also note that the maximum and up- lizing the exponential cone representability prop- per quartile values decrease considerably with the erty of KL divergence and Conic Duality. The DR approach (especially for Binomial and Poisson resulting robust counterpart can be solved by distributions). a commercial conic programming solver directly. We specialized our results to the Newsvendor and UFL Problems by providing problem spe- cific reformulations, and conducted a computational analysis comparing the performance of the solutions obtained via DR approach and stochastic programming from different aspects. We observed that although the mean and median of the cost realizations deteriorate slightly when the DR approach is preferred; the range, standard deviation and worst case values of the cost realizations improve significantly compared to stochastic programming approach. Some future research directions seem promising. Figure 5. Box plot of the results for Firstly, by utilizing the semidefinite programming Uniform(0, the UFL Problem with approximations of the matrix logarithm [42], we 10) and R = 100. can try to formulate and solve KL-divergence constrained DRO problems that involve multivariate normal distributions. Secondly, we would like to test the success of the proposed method on different problems with real datasets. Lastly, we may adapt our results to the decision-dependent setting, which is a recent active research area in the DRO literature [35, 36, 43].

Acknowledgments

The author would like to thank Dr. Beste Bas- ciftci for her comments on an earlier version of Figure 6. Box plot of the results for the UFL Problem with Binomial(10, this paper. 0.5) and R = 100. References [1] Birge, J. R., & Louveaux, F. (2011). Intro- duction to stochastic programming. Springer Science & Business Media. [2] Shapiro, A., Dentcheva, D., & Ruszczy´nski, A. (2014). Lectures on stochastic programming: modeling and theory. Society for Indus- trial and Applied Mathematics. [3] Ben-Tal, A., & Nemirovski, A. (2002). Robust optimization–methodology and applications. Mathematical Programming, 92(3), 453-480. Figure 7. Box plot of the results for [4] Ben-Tal, A., El Ghaoui, L., & Nemirovski, A. the UFL Problem with Poisson(5) (2009). Robust optimization. Princeton Uni- and R = 100. versity Press. [5] Bertsimas, D., Brown, D. B., & Caramanis, C. (2011). Theory and applications of robust 6. Conclusion optimization. SIAM Review, 53(3), 464-501. [6] Popescu, I. (2007). Robust mean-covariance In this paper, we analyzed the KL divergence con- solutions for stochastic optimization. Opera- strained DRO problems and proposed their dual tions Research, 55(1), 98–112. 150 B. Kocuk / IJOCTA, Vol.11, No.2, pp.139-151 (2021)

[7] Delage, E. and Ye, Y. (2010). Distributionally joint chance constrained optimization prob- robust optimization under moment uncer- lems. SIAM Journal on Optimization, 28(2), tainty with application to data-driven prob- 1151-1182. lems. Operations Research, 58(3), 595-612. [21] Yanıko˘glu, I.˙ (2019). Robust reformulations [8] Wiesemann, W., Kuhn, D., & Sim, M. (2014). of ambiguous chance constraints with dis- Distributionally robust convex optimization. crete probability distributions. An Interna- Operations Research, 62(6), 1358-1376. tional Journal of Optimization and Control: [9] Gao, R. & Kleywegt, A. J. (2016). Distribu- Theories & Applications, 9(2), 236-252. tionally robust stochastic optimization with [22] Nesterov, Y., & Nemirovski, A. (1994). Wasserstein distance. Optimization Online. Interior-point polynomial algorithms in con- [10] Mohajerin Esfahani, P. & Kuhn, D. (2018). vex programming. Society for Industrial and Data-driven distributionally robust optimiza- Applied Mathematics. tion using the Wasserstein metric: perfor- [23] Serrano, S. A. (2015). Algorithms for unsym- mance guarantees and tractable reformula- metric cone optimization and an implementa- tions. Mathematical Programming, 171(1), tion for problems with the exponential cone. 115-166. PhD Thesis. Stanford University. [11] Hanasusanto, G. A., & Kuhn, D. (2018). [24] Dahl, J., & Andersen, E. D. (2019). A primal- Conic programming reformulations of two- dual interior-point algorithm for nonsymmet- stage distributionally robust linear programs ric exponential-cone optimization. Optimiza- over Wasserstein balls. Operations Research, tion Online. 66(3), 849-869. [25] Kullback, S., & Leibler, R. A. (1951). On in- [12] Xie, W. (2019). On distributionally robust formation and suﬃciency. Annals of Mathe- chance constrained programs with Wasser- matical Statistics, 22(1), 79-86. stein distance. Mathematical Programming, 1- [26] Hu, Z., & Hong, L. J. (2013). Kullback- 41. Leibler divergence constrained distribution- [13] Ben-Tal, A., Den Hertog, D., De Waegenaere, ally robust optimization. Optimization On- A., Melenberg, B., & Rennen, G. (2013). Ro- line. bust solutions of optimization problems af- [27] Chen, Y., Guo, Q., Sun, H., Li, Z., Wu, W., fected by uncertain probabilities. Manage- & Li, Z. (2018). A distributionally robust op- ment Science, 59(2), 341-357. timization model for unit commitment based [14] Klabjan, D., Simchi-Levi, D., & Song, M. on Kullback–Leibler divergence. IEEE Trans- (2013). Robust stochastic lot-sizing by means actions on Power Systems, 33(5), 5147-5160. of histograms. Production and Operations [28] Li, Z., Wu, W., Zhang, B., & Tai, X. (2018). Management, 22(3), 691-710. Kullback–Leibler divergence-based distribu- [15] Jiang, R., & Guan, Y. (2016). Data-driven tionally robust optimisation model for heat chance constrained stochastic program. Math- pump day-ahead operational schedule to ematical Programming, 158(1-2), 291-327. improve PV integration. IET Generation, [16] Yanıko˘glu, I.,˙ den Hertog, D., & Kleijnen, J. Transmission & Distribution, 12(13), 3136- P. (2016). Robust dual-response optimization. 3144. IIE Transactions, 48(3), 298-312. [29] MOSEK ApS. (2020). MOSEK optimizer [17] Lam, H. (2019). Recovering best statistical API for Python. guarantees via the empirical divergence-based [30] Hanasusanto, G. A., Kuhn, D., Wallace, S. distributionally robust optimization. Opera- W., & Zymler, S. (2015). Distributionally ro- tions Research, 67(4), 1090-1105. bust multi-item newsvendor problems with [18] Zymler, S., Kuhn, D., & Rustem, B. (2013). multimodal demand distributions. Mathemat- Distributionally robust joint chance con- ical Programming, 152(1-2), 1-32. straints with second-order moment informa- [31] Natarajan, K., Sim, M., & Uichanco, tion. Mathematical Programming, 137(1-2), J. (2018). Asymmetry and ambiguity in 167-198. newsvendor models. Management Science, [19] Yanıko˘glu, I.,˙ & den Hertog, D. (2013). 64(7), 3146-3167. Safe approximations of ambiguous chance [32] Lee, S., Kim, H., & Moon, I. (2020). A constraints using historical data. INFORMS data-driven distributionally robust newsven- Journal on Computing, 25(4), 666-681. dor model with a Wasserstein ambiguity set. [20] Xie, W., & Ahmed, S. (2018). On determin- Journal of the Operational Research Society, istic reformulations of distributionally robust 1-19. Conic reformulations for KL divergence constrained DRO and applications 151

[33] Lu, M., Ran, L., & Shen, Z. J. M. (2015). Re- [40] Erlenkotter, D. (1978). A dual-based proce- liable facility location design under uncertain dure for uncapacitated facility location. Op- correlated disruptions. Manufacturing & Ser- erations Research, 26(6), 992-1009. vice Operations Management, 17(4), 445-455. [41] Conn, A. R., & Cornuejols, G. (1990). A pro- [34] Santiváñez, J. A., & Carlo, H. J. (2018). jection method for the uncapacitated facil- Reliable capacitated facility location problem ity location problem. Mathematical Program- with service levels. EURO Journal on Trans- ming, 46(1-3), 273-298. portation and Logistics, 7(4), 315-341. [42] Fawzi, H., Saunderson, J., & Parrilo, P. A. [35] Basciftci, B., Ahmed, S., & Shen, S. (2020). (2019). Semidefinite approximations of the Distributionally robust facility location prob- matrix logarithm. Foundations of Computa- lem under decision-dependent stochastic de- tional Mathematics, 19(2), 259-296. mand. European Journal of Operational Re- [43] Luo, F., & Mehrotra, S. (2020). Distribution- search. doi:10.1016/j.ejor.2020.11.002. ally robust optimization with decision depen- [36] Noyan, N., Rudolf, G., & Lejeune, M. dent ambiguity sets. Optimization Letters, 14, (2018). Distributionally robust optimization 2565-2594. with decision-dependent ambiguity set. Opti- mization Online. [37] MOSEK ApS. (2020). MOSEK modeling Burak Kocuk is an assistant professor at the Indus- cookbook. trial Engineering Program, Sabancı University. He ob- [38] Ben-Tal, A., & Nemirovski, A. (2001). Lec- tained his BS degrees in Industrial Engineering and tures on modern convex optimization: analy- Mathematics, and MS degree in Industrial Engineer- sis, algorithms, and engineering applications. ing from Bo˘gazi¸ci University. He obtained his PhD degree of Operations Research at the School of Industrial Society for Industrial and Applied Mathemat- and Systems Engineering, Georgia Institute of Tech- ics. nology. Before joining Sabancı University, he was a [39] Scarf, H. A. (1958). A min-max solution of postdoctoral fellow at the Tepper School of Business, an inventory problem. In: K. J. Arrow, S. Carnegie Mellon University. His current research fo- Karlin and H. E. Scarf, Eds., Studies in the cuses on mixed-integer nonlinear programming and Mathematical Theory of Inventory and Pro- stochastic optimization problems, from both theoreti- duction, Stanford University Press, Califor- cal and methodological aspects. nia, 201-209. https://orcid.org/0000-0002-4218-1116

An International Journal of Optimization and Control: Theories & Applications (http://ijocta.balikesir.edu.tr)

This work is licensed under a Creative Commons Attribution 4.0 International License. The authors retain ownership of the copyright for their article, but they allow anyone to download, reuse, reprint, modify, distribute, and/or copy articles in IJOCTA, so long as the original authors and source are credited. To see the complete license contents, please visit http://creativecommons.org/licenses/by/4.0/.