A Regret Lower Bound for Assortment Optimization Under the Capacitated MNL Model with Arbitrary Revenue Parameters Yannik Peeters1 and Arnoud V

Probability in the Engineering and Informational Sciences (2021), 1–9 doi:10.1017/S0269964821000395 RESEARCH ARTICLE A regret lower bound for assortment optimization under the capacitated MNL model with arbitrary revenue parameters Yannik Peeters1 and Arnoud V. den Boer1,2 1 Amsterdam Business School, University of Amsterdam, Plantage Muidergracht 12, 1018 TV Amsterdam, The Netherlands. E-mail: [email protected] 2 Korteweg-de Vries Institute for Mathematics, University of Amsterdam, Science Park 105-107, 1098 XG Amsterdam, The Netherlands. Keywords: Assortment optimization, Incomplete information, Multinomial logit model, Regret lower bound Abstract In this note, we consider dynamic assortment optimization with incomplete information under the capacitated multinomial logit choice model. Recently, it has been shown that the regret (the cumulative expected revenue loss caused√ by offering suboptimal assortments) that any decision policy endures is bounded from below by a constant times ,where denotes the number of products and denotes the time horizon. This result is shown under the assumption that the product revenues are constant, and thus leaves the question open whether a lower regret rate can be achieved for nonconstant revenue parameters. In this note, we show that this is not the case: we show that, for any vector of product revenues√ there is a positive constant such that the regret√ of any policy is bounded from below by this constant times . Our result implies that policies that achieve O( ) regret are asymptotically optimal for all product revenue parameters. 1. Introduction In this note, we consider the problem of assortment optimization under the multinomial logit (MNL) choice model with a capacity constraint on the size of the assortment and incomplete information about the model parameters. This problem has recently received considerable attention in the literature (see, e.g., [1–7,9,10]). Two notable recent contributions are from Agrawal et al. [1,2], who construct decision policies based on Thompson Sampling and Upper Confidence Bounds, respectively, and show that the regret of these policies—the cumulative expected revenue loss compared with the benchmark√ of always offering an optimal assortment—is, up to logarithmic terms, bounded by a constant times , where denotes the number of products and denotes the length of the time horizon. These upper bounds are complemented by the recent work from Chen and Wang√ [3], who show that the regret of any policy is bounded from below by a positive constant times , implying that the policies by Agrawal et al. [1,2] are (up to logarithmic terms) asymptotically optimal. The lower bound by Chen and Wang [3] is proven under the assumption that the product revenues are constant—that is, each product generates the same amount of revenue when sold. In practice, it often happens that different products have different marginal revenues, and it is apriorinot completely clear whether the policies by Agrawal et al. [1,2] are still asymptotically optimal or that a lower regret can be achieved. In addition, Chen and Wang [3] assume that , the maximum number of products allowed in 1 · 1 an assortment, is bounded by 4 , but point out that this√ constant 4 can probably be increased. In this note, we settle this open question by proving√ a regret lower bound for any given vector of product revenues. This implies that policies with O( ) regret are asymptotically optimal regardless © The Author(s), 2021. Published by Cambridge University Press. This is an Open Access article, distributed under the terms of the Creative Com- mons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited. Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.19, on 02 Oct 2021 at 02:18:55, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0269964821000395 2 Y. Peeters and A. V. den Boer <1 of the product revenue parameters. Furthermore, our result is valid for all 2 , thereby confirming 1 the intuition of Chen and Wang [3] that the constraint 4 is not tight. 2. Model and main result We consider the problem of dynamic assortment optimization under the MNL choice model. In this model, the number of products is ∈ N. Henceforth, we abbreviate the set of products {1 ...,} as []. Each product ∈[] yields a known marginal revenue to the seller of > 0. Without loss of generality due to scaling, we can assume that 1 for all ∈[]. Each product ∈[] is associated with a preference parameter 0, unknown to the seller. Each offered assortment ⊆[] must satisfy a capacity constraint, that is, || for capacity constraint ∈ N, . For notational convenience, we write A := { ⊆[] : || } for the collection of all assortments of size at most , and S := { ⊆[] : || = } for the collection of all assortments of exact size .Let ∈ N denote a finite time horizon. Then, at each time ∈[], the seller selects an assortment ∈A based on the purchase information available up to and including time − 1. Thereafter, the seller observes a single purchase ∈ ∪{0}, where product 0 indicates a no-purchase. The purchase probabilities within the MNL model are given by P( = | = ) = , 1 + ∈ for all ∈[] and ∈ ∪{0},wherewewrite0 := 1. The assortment decisions of the seller are described by his/her policy: a collection of probability distributions = ((·|ℎ) : ℎ ∈ ) on A , where −1 := {(,) : ∈ ∪{0}, ∈A } ∈[ ] is the set of possible histories. Then, conditionally on ℎ = (1,1,...,−1,−1), assortment has distribution (·|ℎ), for all ℎ ∈ and all ∈[].LetE be the expectation operator under policy and preference vector ∈V:= [0, ∞) . The objective for the seller is to find a policy that maximizes the total accumulated revenue or, equivalently, minimizes the accumulated regret: Δ (,) := · max (, )− E [( ,)], ∈A =1 where (, ) is the expected revenue of assortment ⊆[] under preference vector ∈V: ∈ (, ) := . 1 + ∈ In addition, we define the worst-case regret: Δ () := sup Δ (,). ∈V The main result, presented√ below, states that the regret of any policy can uniformly be bounded from below by a constant times . Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.19, on 02 Oct 2021 at 02:18:55, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0269964821000395 Probability in the Engineering and Informational Sciences 3 Theorem 1. Suppose that 퐾<푁/2. Then, there exists a constant 푐1 > 0 such that, for all 푇 푁 and for all policies 휋, √ Δ휋 (푇) 푐1 푁푇. 3. Proof of Theorem 1 3.1. Proof outline The proof of Theorem 1 can be broken up into four steps. First, we define a baseline preference vector 0 0 푣 ∈Vand we show that under 푣 any assortment 푆 ∈S퐾 is optimal. Second, for each 푆 ∈A퐾 ,we define a preference vector 푣푆 ∈Vby 푣0 ( + 휖), 푖 ∈ 푆, 푣푆 = 푖 1 if 푖 : 0 (3.1) 푣푖 , otherwise, for some 휖 ∈(0, 1]. For each such 푣푆, we show that the instantaneous regret from offering a suboptimal assortment 푆푡 is bounded from below by a constant times the number of products |푆 \ 푆푡 | not in 푆;cf. Lemma 1 below. This lower bound takes into account how much the assortments 푆1,...,푆푇 overlap 푆 with 푆 when the preference vector is 푣 . Third, let 푁푖 denote the number of times product 푖 ∈[푁] is contained 푆1,...,푆푇 , that is, 푇 푁푖 := 1{푖 ∈ 푆푡 }. 푡=1 Then, we use the Kullback–Leibler (KL) divergence and Pinsker’s inequality to upper bound the 푆 푆\{푖} difference between the expected value of 푁푖 under 푣 and 푣 , see Lemma 2. Fourth, we apply a 푆 randomization argument over {푣 : 푆 ∈S퐾 }, we combine the previous steps, and we set 휖 accordingly to conclude the proof. The novelty of this work is concentrated in the first two steps. The third and fourth step closely follow the work of Chen and Wang [3]. These last steps are included (1) because of slight deviations in our setup, (2) for the sake of completeness, and (3) since the proof techniques are extended to the case where 퐾/푁<1/2. In the work of Chen and Wang [3], the lower bound is shown for 퐾/푁 1/4, but the authors already mention that this constraint can probably be relaxed. Our proof confirms that this is indeed the case. 3.2. Step 1: Construction of the baseline preference vector Let 푤 := min푖∈[푁 ] 푤푖 > 0 and define the constant 푤2 푠 := . 3 + 2푤 Note that 푠<푤. The baseline preference vector is formally defined as 푠 0 푣푖 := , for all 푖 ∈[푁]. 퐾(푤푖 − 푠) 0 Now, the expected revenue for any 푆 ∈A퐾 under 푣 can be rewritten as 푤 푤 0 푣 푤푖 푠 푖∈푆 푠 푖∈푆 푟(푆, 푣0) = 푖∈푆 푖 = 푤 −푠 = 푤 −푠 . + 푣0 퐾 + 푠 퐾 −|푆|+ 푤 1 푖∈푆 푖 푖∈푆 푤 −푠 푖∈푆 푤 −푠 Downloaded from https://www.cambridge.org/core. IP address: 170.106.33.19, on 02 Oct 2021 at 02:18:55, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0269964821000395 4 Y. Peeters and A. V. den Boer The expression on the right-hand side is only maximized by assortments 푆 with maximal size |푆| = 퐾, in which case 푟(푆, 푣0) = 푟(푆,푣0) = 푠. max 푆 ∈A It follows that all assortments 푆 with size 퐾 are optimal. 3.3. Step 2: Lower bound on the instantaneous regret of 푣푆 For the second step, we bound the instantaneous regret under 푣푆.

Load more