Minimum Description Length Principle in Supervised

Home , Divergence (statistics)

arXiv:1607.02914v1 [cs.IT] 11 Jul 2016 as ihrno einaedrvd h eie ikbound risk derived size bounds sample The regret n finite derived. any (probabilistic) are exampl for design and holds important random an risk with As supervis new lasso code theory. to BC application, two-stage a principle original of the gives MDL the extension of like the proposed learning redundancy of the of justification Hence, procedure. mathematical form MDL a the sam Thir for has finite assumptions. for bound th few holds remarkably from risk requires bound it inherited risk Second, the T advantages a size. First, proposed. several issue, theory. is BC has this learning original bound supervised overcome to risk To theory derived general. BC in of extension learning essentially c be supervised cannot in idea to approximation their similar learning, supervise a unsupervised of removing e to Barron in applied Though succeeded be limitedly. principl recently and can MDL approximately however, only the theory, learning of BC justification original theory (BC mathematical The theory a Cover’s and gives Barron which is th principle important most MDL the the of for One studied. is learning supervised n lcrclEgneig ysuUiest,74 Motoo [email protected] 744, (e-mail: University, Japan 819-0395 Kyushu city, Engineering, Fukuoka Electrical and hsc 7,[] 9 n oo.I eea,ms fteetech these of stati most numbe general, (sample [6], assumption In asymptotic process on. nonparametric either so require empirical by and niques [9] [5], analyzed [8], [4], [7], been physics [3], has [1 [2], lasso lasso statistics example, Taking theoretically. an methods as learning machine of BC of extension first the wi design is random with approximation. this learning numeri supervised general by that to investigated theory believe is We bound regret simulations. the of Behavior work. ain.Ti aeilwl epeetdi ata h 3dI 33rd the at USA. NY, part city, York in Telec New presented and in Learning be Information Machine will for on Conference material Foundation This Okawa model to cations. the famous leads and A 25870503 data source. of data set the given about a hypotheses of best description the (MDL) most shortest cl [14] length the the [13], [12], description of that [11], [10], one minimum principle MDL is the The principle. which for theory), results Barr (BC famous is purpose theory assumptions this few Cover’s for as and candidate important with evalu- An this methods performance possible. In for learning as use. way machine another practical of develop for ation to restrictive try much we are paper, them condition moment of or Some features of boundedness like sumptions ple n/rfauenumber feature and/or .Kwkt n .Tkuh r ihtefclyo Informati of faculty the with are Takeuchi Number J. and Grant Kawakita KAKENHI M. JSPS by part in supported was work This ≪ Abstract hr aebe aiu ehiust vlaeperformance evaluate to techniques various been have There Terms Index p uevsdLann ihApiaint Lasso to Application with Learning Supervised ihu onens ffaue ncnrs otepast the to contrast in features of boundedness without Temnmmdsrpinlnt ML rnil in principle (MDL) length description minimum —The iiu ecito eghPicpein Principle Length Description Minimum lso ikbud admdsg,MLprinci- MDL design, random bound, risk —lasso, .I I. p ot nnt)o aiu ehia as- technical various or infinity) to go NTRODUCTION n n etr number feature and aaoiKwkt n u’ciTakeuchi Jun’ichi and Kawakita Masanori u-u.ac.jp). a Nishi-Ku, ka, nternational nScience on p ommuni- applied vnif even eories ,the d, stical thout aims al. t ase ple r cal on ed he of s. e. n ), n d e e - ] ie ahmtcljsicto fteMLpicpe Fur principle. finite for MDL holds the theory BC of thermore, re justification this mathematical bound, upper a b risk gives data smallest the the of yields description code shortest terms two-stage Because the above code. that in from two-stage means result estimator corresponding bounded this theory the MDL tightly of BC the is redundancy code. [16] by of divergence Rényi two-stage risk minimizer the this the a of as that of defined guarantees codelength is [15] model. total estimator the the with MDL encoded an of model are case, statistical data this a the then encodes In and one codelengt data which encode a in to to code corresponds two-stage proposed a criterion was of This principle [10]. MDL Rissanen the by on based criterion selection tefadtedrvto nfie eincss ewl try will We cases. design al fixed extension in at above derivation trivial the the both not than and is effort itself under more application application much This requires its and as features. design of random regr normality with and lasso risk ex of new an derive bounds for also required We remarked is interpretation. the be condition redundancy Thus, should additional of It code. an bound justification however, learning. two-stage that, risk mathematical supervised of in the a principle redundancy gives of MDL also of term extension form main our a The of again advantages theory. of has BC most inherits original bound desi the risk random derived theory in BC The quantization of cases. without extension learning an supervised propose super we to paper, of this error In (i.e., generalization learning. evaluate only to design t satisfactory howev fixed Actually, design, not with fixed explained solve. The derived setting). to as was unsupervised essentially lasso difficult 1 of essentially Condition bound However, thus risk from example. is an stems gener It in later. as difficulty learning supervised [19] main derived to [18], and applied The be quantization lasso cannot the of idea Barron avoid their bound space. to risk way parameter a a of proposed be quantization al. will of et which necessity forcedly, sufficiently the disadva 1 is be well-recognized Another Condition III. not Section be imposing in may explained can to bound 1 due His risk Condition tight knowledge. where the our BC setting, However, to specific of satisfied. learning a applicati technique supervised of assumed key to example work only theory a the BC sens breaks is a of [17] condition in literature critical this The is theory. of condition This lack III). applied (Condit that Section be condition in cannot certain defined actually a 1 theory without learning BC learning supervised false, supervised to not can and is it unsupervised that it both recognized Though widely to Th be applicable limitedly. to be seems or applied theory approximately BC been only original has learning theory supervised BC to However, conditions. technical n ihu n complicated any without ntage vised the y r is er, heir sult ion act gn on al. et h e e 1 - l . 2 to derive those bounds in a manner not specific to our setting given xn. Again by the chain rule, such a codelength can be but rather applicable to several other settings. Interestingly, the decomposed as redundancy and regret interpretation for the above bounds are L˜(yn, θ˜ xn)= L˜(yn xn, θ˜)+ L˜(θ˜ xn). exactly justified without any additional condition in the case | | | of lasso. The most advantage of our theory is that it requires Here, L˜(yn xn, θ˜) expresses a codelength to describe yn using almost no assumptions: neither asymptotic assumption (n

loss function exp( L˜(θ˜ xn)) 1. (the Rényi divergence) is not bounded. Behavior of the regret − | ≤ ˜ e n bound will be investigated by numerical simulations. It may θ∈XΘ(x ) be worth noting that, despite we tried several other approaches The MDL estimator is defined by the minimizer of the above in order to extend BC theory to supervised learning, we can codelength: hardly derive a risk bound of lasso as tight as meaningful n n n n n θ¨(x ,y ) := arg min log p˜(y x )+ L˜(θ˜ x ) . by using them. We believe that our proposal is currently the e θ θ˜∈Θ(xn) − | | unique choice that could give a meaningful risk bound. This paper is organized as follows. Section II introduces an Let us write the minimum description length attained by the MDL estimator in supervised learning. We briefly review BC two-stage code as theory and its recent progress in Section III. The extension of n n n n n L˜ (y x ) := log p¨(y x )+ L˜(θ¨ x ). BC theory to supervised learning will appear in Section IV-A. 2 | − θ | | We derive new risk and regret bounds of lasso in Section IV-B. Because L˜2 also satisfies Kraft’s inequality with respect to All proofs of our results are given in Section V. Section VI yn for each xn, it is interpreted as a codelength of a prefix contains numerical simulations. A conclusion will appear in two-stage code. Therefore, Section VII. p˜ (yn xn) := exp( L˜ (yn xn)) 2 | − 2 | II. MDL ESTIMATOR IN SUPERVISED LEARNING is a conditional sub-probability distribution corresponding to the two-stage code. Suppose that we have n training data (xn,yn) := X Y generated from (xi,yi) i = 1, 2, ,n ARRON AND OVER S HEORY { n n ∈ ×n | n n ··· } III. B C ’ T p¯∗(x ,y ) = q∗(x )p∗(y x ), where X is a domain of feature vector x and Y could| be (regression) or a finite We briefly review Barron and Cover’s theory (BC theory) set (classification) according to targetℜ problems. Here, the se- and its recent progress in view of supervised learning though they discussed basically unsupervised learning (or supervised quence (x1,y1), (x2,y2), is not necessarily independently and identically distributed··· (i.i.d.) but can be a stochastic learning with fixed design). In BC theory, the Rényi divergence [16] between p(y x) and r(y x) with order λ (0, 1) process in general. We write the jth component of the ith | | ∈ sample as xij . To define an MDL estimator according to the n n 1−λ n 1 r(y x ) notion of two-stage code [10], we need to describe data itself d (p, r)= log E ∗ n n n | (1) λ −1 λ q (x )p(y |x ) p(yn xn) and a statistical model used to describe the data too. Letting − | L˜(xn,yn) be the codelength of the two-stage code to describe is used as a loss function. The Rényi divergence converges to (xn,yn), L˜(xn,yn) can be decomposed as Kullback-Leibler (KL) divergence n n ˜ n n ˜ n ˜ n n p(y x ) L(x ,y )= L(x )+ L(y x ) D n(p, r) := q (xn)p(yn xn) log | dxndyn | ∗ | r(yn xn) Z | by the chain rule. Since a goal of supervised learning is to (2) n n n as λ 1, i.e., estimate p∗(y x ), we need not estimate q∗(x ). In view of → | ˜ n the MDL principle, this implies that L(x ) (the description n D n lim dλ(p, r)= (p, r) (3) length of xn) can be ignored. Therefore, we only consider λ→1 n n the encoding of y given x hereafter. This corresponds to a for any p, r . We also note that the Rényi divergence at λ =0.5 description scheme in which an encoder and a decoder share is equal to Bhattacharyya divergence [20] the data xn. To describe yn given xn, we use a parametric model p (yn xn) with parameter θ Θ. The parameter space n n n n n n n n θ d0.5(p, r)= 2 log q∗(x ) p(y x )r(y x )dx dy . Θ is a certain| continuous space or∈ a union of continuous − | | Z (4) spaces. Note that, however, the continuous parameter cannot p We drop n of each divergence like d (p, r) if it is defined be encoded. Thus, we need to quantize the parameter space λ with a single random variable, i.e., Θ as Θ(xn). According to the notion of the two-stage code, we need to describe not only yn but also the model used 1 r(y x) 1−λ n ˜ n dλ(p, r)= log E ∗ | . to describee y (or equivalently the parameter θ Θ(x )) −1 λ q (x)p(y|x) p(y x) ∈ − | e 3

BC theory requires the model description length to satisfy a important case ‘lasso with column normalization’ (explained little bit stronger Kraft’s inequality defined as follows. below) from the scope of application. However, it is essentially difficult to remove this restriction as noted in Section I. Definition 1. Let β be a real number in (0, 1). We say that Another concern is quantization. The quantization for the a function h(θ˜) satisfies β-stronger Kraft’s inequality if encoding is natural in view of the MDL principle. Our target, exp( βh(θ˜)) 1, however, is an application to usual estimators or machine − ≤ θ˜ learning algorithms themselves including lasso. A trivial exam- X ple of such an application is a penalized maximum likelihood ˜ where the summation is taken over a range of θ in its context. estimator (PMLE)

The following condition is indispensable for application of n n n n n θˆ(x ,y ) := arg min log pθ(y x )+ L(θ x ) , BC theory to supervised learning. θ∈Θ − | | Condition 1 (indispensable condition). Both the quantized where L :Θ X n [0, ) is a certain penalty. Similarly × → ∞ space and the model description length are independent of xn, to the quantized case, let us define i.e., p (yn xn) := p (yn xn) exp( L(θˆ xn)), Θ(xn)= Θ, L˜(θ˜ xn)= L˜(θ˜). (5) 2 | θˆ | · − | | Under Condition 1, BC theory [15] gives the following that is, e e n n n n n two theorems for supervised learning. Though these theorems log p2(y x ) = min log pθ(y x )+ L(θ x ) . were shown only for the case of Hellinger distance in the − | θ∈Θ {− | | } n n original literature [15], we state these theorems with the Rényi Note that, however, p2(y x ) is not necessarily a sub- divergence. probability distribution in contrast| to the quantized case, which will be discussed in detail in Section IV-A. PMLE is a Theorem 2. Let β be a real number in (0, 1). Assume that L˜ satisfies β-stronger Kraft’s inequality. Under Condition 1, wide class of estimators including many useful methods like Ridge regression [21], lasso, Dantzig Selector [22] and any n Ep¯∗(xn,yn)dλ(p∗,pθ¨) Maximum-A-Posteriori estimators of Bayes estimation. If we n n can accept θ¨ as an approximation of θˆ (by taking L˜ = L), we p∗(y x ) ˜ ˜ Ep¯∗(xn,yn) inf log | + L(θ) (6) ˜ n n have a risk bound by direct application of BC theory. However, ≤ θ∈Θ pθ˜(y x ) n n | the quantization is unnatural in view of machine learning p∗(y x ) = E ∗ n n log | (7) application. Besides, we cannot use any data-dependent L. p¯ (x ,y ) p˜ (yn xn) 2 | Barron et al. proposed an important notion ‘risk validity’ to for any λ (0, 1 β]. remove the quantization [23], [19], [24]. ∈ − Theorem 3. Let β be a real number in (0, 1). Assume that Definition 4 (risk validity). Let β be a real number in (0, 1) L˜ satisfies β-stronger Kraft’s inequality. Under Condition 1, and λ be a real number in (0, 1 β]. For fixed xn, we say thata − n n n penalty function n is risk valid if there exist a quantized d (p∗,p¨) 1 p∗(y x ) L(θ x ) λ θ −nτβ n | n Pr log n| n τ e space Θ(x ) Θ and a model description length L˜(θ˜ x ) n − n p˜2(y x ) ≥ ≤ ⊂ | | satisfying β-stronger Kraft’s inequality such that Θ(xn) and for any λ (0, 1 β]. L˜(θ˜ xne) satisfy ∈ − | Since the right side of (7) is just the redundancy of the n n e n Y n n n p∗(y x ) n prefix two-stage code, Theorem 2 implies that we obtain the y , max dλ(p∗,pθ x ) log n| n L(θ x ) ∀ ∈ θ∈Θ | − pθ(y x ) − | smallest upper bound of the risk by compressing the data most n n n | o n n p∗(y x ) n with the two-stage code. That is, Theorem 2 is a mathematical max d (p∗,p˜ x ) log | L˜(θ˜ x ) , (8) e λ θ n n justification of the MDL principle. We remark that, by inter- ≤ θ˜∈Θ(xn) | − pθ˜(y x ) − | n | o changing the infimum and the expectation of (6), the right where side of (6) becomes a quantity called “index of resolvability” n n 1−λ n n 1 r(y x ) [15], which is an upper bound of redundancy. It is remarkable d (p, r x ) := log E n n | . λ | −1 λ p(y |x ) p(yn xn) that BC theory requires no assumption except Condition 1 − | and β-stronger Kraft’s inequality. However, Condition 1 is Note that their original definition in [19] was presented only a somewhat severe restriction. Both the quantization and the for the case where λ = 1 β. Here, d(p, r xn) is the Rényi n − n | n n model description length can depend on x in the definitions. divergence for fixed design (x is fixed). Hence, dλ(p, r x ) n | In view of the MDL principle, this is favorable because the does not depend on q∗(x ) in contrast to the Rényi divergence n n total description length can be minimized according to x for random design dλ(p, r) defined by (1). Barron et al. proved flexibly. If we use the model description length that is uniform that θˆ has bounds similar to Theorems 2 and 3 for any risk over X n in contrast, the total codelength must be longer valid penalty in the fixed design case. Their way is excellent in general. Hence, data-dependent model description length because it does not require any additional condition other is more desirable. Actually, this observation suggests that than the risk validity. However, the risk evaluation only for a n n n the bound derived in [17] may not be sufficiently tight. In particular x like Ep∗(yn|xn)[dλ(p∗,pθˆ x )] is unsatisfactory addition, the restriction by Condition 1 excludes a practically for supervised learning. In order to evaluate| the so-called 4

‘generalization error’ of supervised learning, we need to eval- basically depend on xn in general. If not (L(θ xn) = L(θ)), | n n ′ n uate the risk with random design, i.e., Ep¯∗(xn,yn)[dλ(p∗,pθˆ)]. L(θ) must bound maxx H (θ, x ), which makes L(θ) much However, it is essentially difficult to apply their idea to random larger. This is again unfavorable in view of the MDL principle. design cases as it is. Let us explain this by using lasso as an In particular, H′(θ, xn) includes an unbounded term in linear example. The readers unfamiliar to lasso can refer to the head regression cases with regard to xn, which originates from the of Section IV-B for its definition. By extending the definition third term of the left side of (10). This can be seen by checking of risk validity to random design straightforwardly, we obtain Section III of [19]. Though their setting is fixed design, this the following definition. fact is also true for the random design. Hence, as long as we use their technique, derived risk valid penalties must depend Definition 5 (risk validity in random design). Let β be a on xn in linear regression cases. However, the ℓ norm used real number in (0, 1) and λ be a real number in (0, 1 β]. 1 in the usual lasso does not depend on xn. Hence, the risk We say that a penalty function L(θ xn) is risk valid if− there | validity seems to be useless for lasso. However, the following exist a quantized space Θ Θ and a model description length weighted ℓ norm L˜(θ˜) satisfying β-stronger⊂ Kraft’s inequality such that 1 e xn X n, yn Y n, p ∀ ∈ ∈ n n n p∗(y x ) n θ w,1 := wj θj , max dλ(p∗,pθ) log | L(θ x ) k k | | θ∈Θ − p (yn xn) − | j=1 θ | X n n n o n n p∗(y x ) ˜ ˜ 1 max dλ(p∗,pθ˜) log | L(θ) . (9) T 2 ˜ e p (yn xn) where w := (w1, , wp) , wj := xij ≤ θ∈Θ − θ˜ − ··· vn n | o u i=1 u X In contrast to the fixed design case, (8) must hold not only t for a fixed xn X n but also for all xn X n. In addition, Θ and L˜(θ˜) must∈ be independent of xn due∈ to Condition 1. plays an important role here. The lasso with this weighted n ℓ1 norm is equivalent to an ordinary lasso with column The form of Rényi divergence dλ(p∗,pθ) also differs from n n normalization such that each column of the design matrix deλ(p∗,pθ x ) of the fixed design case in general. Let us rewrite (9)| equivalently as has the same norm. The column normalization is theoretically and practically important. Hence, we try to find a risk valid n X n n Y n n x , y , θ Θ, penalty of the form L1(θ x ) = µ1 θ w,1 + µ2, where µ1 ∀ ∈ ∀ ∈ ∀ ∈ n n | k k n n pθ(y x ) ˜ ˜ and µ2 are real coefficients. Indeed, there seems to be no min dλ(p∗,pθ) dλ(p∗,pθ˜) + log n| n + L(θ) n ˜ e − p˜(y x ) other useful penalty dependent on x for the usual lasso. θ∈Θ θ | Ln(θ xn). o(10) In contrast to fixed design cases, however, there are severe ≤ | difficulties to derive a meaningful risk bound with this penalty. For short, we write the inside part of the minimum of the We explain this intuitively. The main difficulty is caused by left side of (10) as H(θ, θ,˜ xn,yn). We need to evaluate Condition 1. As described above, our strategy is to take θ˜ n n min˜ H(θ, θ,˜ x ,y ) in order to derive risk valid penal- close to θ. Suppose now that it is ideally almost realizable θ{ } ties. However, it seems to be considerably difficult. To our for any choice of xn,yn,θ. This implies that H(θ, θ,˜ xn,yn) knowledge, the technique used by Chatterjee and Barron [19] is almost equal to L˜(θ). On the other hand, for each fixed θ, is the best way to evaluate it, so that we also employ it in the weighted ℓ1 norm of θ can be arbitrarily small by making ˜ n this paper. A key premise of their idea is that taking θ close x small accordingly. Therefore, the penalty µ1 θ w,1 + µ2 ˜ n n k k to θ is not a bad choice to evaluate minθ˜ H(θ, θ, x ,y ). is almost equal to µ2 in this case. This implies that µ2 Regardless of whether it is true or not, this premise seems must bound maxθ L˜(θ), which is infinity in general. If L˜ to be natural and meaningful in the following sense. If we depended on xn, we could resolve this problem. However, quantize the parameter space finely enough, the quantized L˜ must be independent of xn. This issue does not seem to be estimator θ¨ is expected to behave almost similarly to θˆ with specific to lasso. Another major issue is the Rényi divergence n the same penalty and is expected to have a similar risk bound. dλ(p∗,pθ). In the fixed design case, the Rényi divergence ˜ ˜ n n ˜ n n If we take θ = θ, then H(θ, θ, x ,y ) is equal to L(θ), dλ(p∗,pθ x ) is a simple convex function in terms of θ, which which implies that L˜(θ) is a risk valid penalty and has a makes its| analysis easy. In contrast, the Rényi divergence n risk bound similar to the quantized case. Note that, however, dλ(p∗,pθ) in case of random design is not convex and more we cannot match θ˜ to θ exactly because θ˜ must be on the complicated than that of fixed design cases, which makes it fixed quantized space Θ. So, Chatterjee and Barron random- difficult to analyze. We will describe why the non-convexity ized θ˜ on the grid points on Θ around θ and evaluate the of loss function makes the analysis difficult in Section V-G. expectation with respecte to it. This is clearly justified because The difficulties that we face when we use the techniques of ˜ n n ˜ n n minθ˜ H(θ, θ, x ,y ) Eθ˜[eH(θ, θ, x ,y )]. By using a [19] in the random design case are not limited to them. We carefully{ tuned randomization,} ≤ they succeeded in removing do not explain them here because it requires the readers to ˜ n n n the dependency of Eθ˜[H(θ, θ, x ,y )] on y . Let us write understand their techniques in detail. However, we only remark ′ n ˜ n n the resultant expectation as H (θ, x ) := Eθ˜[H(θ, θ, x ,y )] that these difficulties seem to make their techniques useless for for convenience. Any upper bound L(θ xn) of H′(θ, xn) is supervised learning with random design. We propose a remedy a risk valid penalty. By this fact, risk valid| penalties should to solve these issues in a lump in the next section. 5

IV. MAIN RESULTS that both bounds become tightest when λ = 1 β because n − In this section, we propose a way to extend BC theory to the Rényi divergence dλ(p, r) is monotonically increasing supervised learning and derive a new risk bound of lasso. in terms of λ (see [12] for example). We call the quantity n n n n log(1/p2(y x )) ( log(1/p∗(y x ))) in Theorem 8 − | − − | n n ‘regret’ of the two-stage code p2 on the given data (x ,y ) A. Extension of BC Theory to Supervised Learning in this paper, though the ordinary regret is defined as the There are several possible approaches to extend BC theory codelength difference from log(1/p (yn xn)), where θˆ θˆmle mle to supervised learning. A major concern is how tight a resultant denotes the maximum likelihood estimator.| Compared to the n risk bound is. Below, we propose a way that gives a tight risk usual BC theory, there is an additional term (1/β) log(1/Pǫ ) upper bound for at least lasso. A key idea is to modify the in the risk bound (11). Due to the property of the typical set, risk validity condition by introducing a so-called typical set of this term decreases to zero as n . Therefore, the first xn. We postulate that a probability distribution of stochastic term is the main term, which has→ a form∞ of redundancy of process x1, x2, , is a member of a certain class Px. two-stage code like the quantized case. Hence, this theorem ··· Pn Furthermore, we define x by the set of marginal distribution gives a justification of the MDL principle in supervised n P n n of x1, x2, , x of all elements of x. We assume that learning. Note that, however, log p2(y x ) needs to satisfy ··· n Pn − | we can define a typical set Aǫ for each q∗ x , i.e., Kraft’s inequality in order to interpret the main term as a n n ∈ Pr(x A ) 1 as n . This is possible if q∗ is conditional redundancy exactly. A sufficient conditions for this ∈ ǫ → → ∞ stationary and ergodic for example. See [25] for detail. For was introduced by [24] and is called ‘codelength validity’. short, Pr(xn An) is written as P n hereafter. We modify the ∈ ǫ ǫ Definition 9 (codelength validity). We say that L(θ xn) is risk validity by using the typical set. | codelength valid if there exist a quantized subset Θ(xn) Definition 6 (ǫ-risk validity). Let β,ǫ be real numbers in Θ and a model description length L˜(θ˜ xn) satisfying Kraft’s⊂ n | (0, 1) and λ be a real number in (0, 1 β]. We say that L(θ x ) inequality such that e Pn n − Pn | is ǫ-risk valid for (λ, β, x , Aǫ ) if for any q∗ x , there n n ∈ p∗(y x ) exist a quantized subset Θ(q ) Θ and a model description n Y n n ∗ y , max log n| n L(θ x ) ˜ ˜ ⊂ ∀ ∈ θ∈Θ − pθ(y x ) − | length L(θ q∗) satisfying β-stronger Kraft’s inequality such | | n p (yn xn) o that e ∗ ˜ ˜ n (13) max log n| n L(θ x ) ≤ ˜ e n − p (y x ) − | n n n Y n θ∈Θ(x ) θ˜ x Aǫ , y , n | o n ∀ ∈ ∀ ∈ n n for each x . n p∗(y x ) n max dλ(p∗,pθ) log n| n L(θ x ) θ∈Θ − pθ(y x ) − | We note that both the quantization and the model description n | n n o n n p∗(y x ) length on it depend on x in contrast to the ǫ-risk validity. max d (p∗,p˜) log | L˜(θ˜ q∗) . e λ θ n n This is because the fixed design setting suffices to justify the ≤ θ˜∈Θ(q∗) − pθ˜(y x ) − | n | o redundancy interpretation. Let us see that log p2(y x) can be Note that both Θ and L˜ can depend on the unknown exactly interpreted as a codelength if L(−θ xn) is codelength| n | distribution q∗(x ). This is not problematic because the fi- valid. First, we assume that Y , the range of y, is discrete. For n n nal penalty L doese not depend on the unknown q∗(x ). A each x , we have n difference from (10) is the restriction of the range of x onto n n the typical set. From here to the next section, we will see exp( ( log p2(y x ))) n Y n − − | how this small change solves the problems described in the y X∈ previous section. First, we show what can be proved for ǫ-risk n n n = exp max log pθ(y x ) L(θ x ) θ∈Θ { | − | } valid penalties. yn X Theorem 7 (risk bound). Define En as a conditional ǫ n n ˜ ˜ n n n n n exp max log pθ˜(y x ) L(θ x ) expectation with regard to p¯∗(x ,y ) given that x Aǫ . Let ≤ ˜ e n | − | ∈ yn θ∈Θ(x ) ! β,ǫ be arbitrary real numbers in (0, 1). For any λ (0, 1 β], X n o n Pn n ∈ − n n ˜ ˜ n if L(θ x ) is ǫ-risk valid for (λ, β, x , Aǫ ), exp log pθ˜(y x ) L(θ x ) | ≤ n e | − | n n y θ˜∈Θ(xn) p∗(y x ) 1 1 X X Endn(p ,p ) En log | + log . (11) ǫ λ ∗ θˆ ǫ n n n ˜ ˜ n n n ≤ p2(y x ) β Pǫ = exp L(θ x ) pθ˜(y x ) 1. | − | n | ≤ e n y Theorem 8 (regret bound). Let β,ǫ be arbitrary real θ˜∈XΘ(x ) X n numbers in (0, 1). For any λ (0, 1 β], if L(θ x ) is ǫ- Hence, log p (yn xn) can be exactly interpreted as a code- Pn n ∈ − | − 2 | risk valid for (λ, β, x , Aǫ ), length of a prefix code. Next, we consider the case where Y n n n is a continuous space. The above inequality trivially holds by dλ(p∗,pθˆ) 1 p∗(y x ) n Pr log n| n τ replacing the sum with respect to y with an integral. Thus, n − n p2(y x ) ≥ n n n | p2(y x ) is guaranteed to be a sub-probability density func- exp( nτβ)+1 Pǫ . (12) | n n ≤ − − tion. Needless to say, log p2(y x ) cannot be interpreted as A proof of Theorem 7 is described in Section V-A, while a codelength as itself− in continuous| cases. As is well known, a proof of Theorem 8 is described in Section V-B. Note however, a difference ( log p (yn xn)) ( log p (yn xn)) − 2 | − − ∗ | 6 can be exactly interpreted as a codelength difference by way of In particular, taking β = (α + 1)/2 yields the tightest bound quantization. See Section III of [15] for details. This indicates n n D n 1 p∗(y x ) that both the redundancy interpretation of the fist term of (11) Ep¯∗ [ α (p∗,pˆ)] Ep¯∗ log | θ ≤ λ(α) p (yn xn) and the regret interpretation of the (negative) second term in 2 | P n 1 the left side of the inequality in the first line of (12) are ǫ + log n justified by the codelength validity. Note that, however, the λ(α)(λ(α)+ α) Pǫ (1 P n) ǫ-risk validity does not imply the codelength validity and vice + − ǫ . (19) versa in general. λ(α)(λ(α)+ α) We discuss about the conditional expectation in the risk Its proof will be described in Section V-C. Though it is not n n bound (11). This conditional expectation seems to be hard so obvious when the condition “p2(y x ) is a sub-probability | to be replaced with the usual (unconditional) expectation. distribution” is satisfied, we remark that the codelength valid- The main difficulty arises from the unboundedness of the ity of L(θ xn) is its simple sufficient condition. The second | loss function. Indeed, we can immediately show a similar and the third terms of the right side vanish as n due → ∞ risk bound with unconditional expectation for bounded loss to the property of the typical set. The boundedness of loss functions. As an example, let us consider a class of divergence, function is indispensable for the proof. On the other hand, called α-divergence [26] it seems to be impossible to bound the risk for unbounded loss functions. Our remedy for this issue is the risk evaluation D n α (p, r) := based on the conditional expectation on the typical set. Be- 1+α n n n n 2 cause x lies out of Aǫ with small probability, the conditional 4 r(y x ) n n n n n 1 | q∗(x )p(y x )dx dy . expectation is likely to capture the expectation of almost all 1 α2 − p(yn xn) | Z cases. In spite of this fact, if one wants to remove the unnatural − | (14) conditional expectation, Theorem 8 offers a more satisfactory The α-divergence approaches KL divergence as α 1 [27]. bound. Note that the right side of (12) also approaches to zero More exactly, →± as n . We→ remark ∞ the relationship of our result with KL divergence D n D n D n D n D n lim α (p, r)= (p, r), lim α (p, r)= (r, p). (p, r). Because of (3) or (15), it seems to be possible α→−1 α→1 (15) to obtain a risk bound with KL divergence. However, it is impossible because taking λ 1 in (11) or α 1 in We also note that the α-divergence with α = 0 is four times → → ± the squared Hellinger distance (19) makes the bounds diverge to the infinity. That is, we cannot derive a risk bound for the risk with KL divergence 2,n by BC theory, though we can do it for the Rényi divergence dH (p, r)= 2 and the α-divergence. It sounds somewhat strange because KL p(yn xn) r(yn xn) q (xn)p(yn xn)dxndyn, (16) | − | ∗ | divergence seems to be related the most to the notion of the Zp p MDL principle because it has a clear information theoretical which has been studied and used in statistics for a long interpretation. This issue originates from the original BC time. We focus here on the following two properties of α- theory and has been casted as an open problem for a long divergence: time. (i) The α-divergence is always bounded: Finally, we remark that the effectiveness of our proposal in real situations depends on whether we can show the risk D n 2 α (p, r) [0, 4/(1 α )] (17) validity of the target penalty and derive a sufficiently small ∈ − n n bound for log(1/Pǫ ) and 1 Pǫ . Actually, much effort is for any p, r and α ( 1, 1). required to realize them for lasso.− (ii) The α-divergence is∈ bounded− by the Rényi divergence as 1 α B. Risk Bound of Lasso in Random Design dn (p, r) − D n(p, r) (18) (1−α)/2 ≥ 2 α In this section, we apply the approach in the previous section to lasso and derive new risk and regret bounds. In a setting of for any p, r and α ( 1, 1). See [14] for its proof. p ∈ − lasso, training data (xi,yi) i =1, 2, ,n obey { ∈ℜT ×∗ ℜ| ··· } As a corollary of Theorem 7, we obtain the following risk a usual regression model yi = xi θ + ǫi for i =1, 2, ,n, ∗ ··· bound. where θ is a true parameter and ǫi is a Gaussian noise having zero mean and a known variance σ2. By introducing Y := Corollary 1. Let β,ǫ be arbitrary real numbers in (0, 1). (y ,y , ,yn)T , E := (ǫ ,ǫ , ,ǫ )T and an n p matrix Define a function λ(t) := (1 t)/2. For any α [2β 1, 1), if 1 2 1 2 n X := [x···x x ]T , we have··· a vector/matrix expression× of L(θ xn) is ǫ-risk valid for (−λ(α), β, Pn, An) ∈and p−(yn xn) 1 2 n x ǫ 2 the regression··· model Y = Xθ∗ + E . The parameter space Θ is a| sub-probability distribution, | is p. The dimension p of parameter θ can be greater than n. n n ℜ n 1 p∗(y x ) The lasso estimator is defined by E ∗ [D (p ,p )] E ∗ log | p¯ α ∗ θˆ ≤ λ(α) p¯ p (yn xn) 1 2 ˆ n n 2 n | n θ(x ,y ) := arg min 2 Y Xθ 2 + µ1 θ w,1 , Pǫ 1 (1 Pǫ ) θ∈Θ 2nσ k − k k k + log n + − , (20) λ(α)β Pǫ λ(α)(λ(α)+ α) 7

where µ1 is a positive real number (regularization coefficient). Note that the weighted ℓ1 norm is used in (20), though the original lasso was defined with the usual ℓ1 norm in [1]. As explained in Section III, θˆ corresponds to the usual lasso with ‘column normalization’. When xn is Gaussian with zero mean, we can derive a risk valid weighted ℓ1 penalty by choosing an appropriate typical set. Lemma 1. For any ǫ (0, 1), define ∈ Pn n n 0 magnification x := q(x ) = Πi=1N(xi , Σ) non-singular Σ , { | |n 2 } n n (1/n) i=1 xij Aǫ := x j, 1 ǫ 1+ ǫ ,(21) ∀ − ≤ Σjj ≤ n P o where N(x µ, Σ) is a Gaussian distribution with a mean

| 0 1 2 3 4 5 vector µ and a covariance matrix Σ. Here, Σjj denotes the 0.0 0.2 0.4 0.6 0.8 1.0 jth diagonal element of Σ and xij denotes the jth element of λ xi. Assume a linear regression setting: p (yn xn) = Πn N(y xT θ∗, σ2), Fig. 1. Plot of p(λ + 8)/8(1 − λ) against λ. ∗ | i=1 i| i p (yn xn) = Πn N(y xT θ, σ2). θ | i=1 i| i n Let β be a real number in (0, 1) and λ be a real number in Next, we show that Pǫ exponentially approaches to one as n (0, 1 β]. The weighted ℓ1 penalty L1(θ x )= µ1 θ w,1 +µ2 n increases. − Pn n | k k is ǫ-risk valid for (λ, β, x , Aǫ ) if Lemma 2 (Exponential Bound of Typical Set). Suppose n log 4p λ +8√1 ǫ2 log 2 that xi N(xi 0, Σ) independently. For any ǫ (0, 1), µ1 − , µ2 . (22) ∼ | ∈ ≥ sβσ2(1 ǫ) · 4 ≥ β p − n n We describe its proof in Section V-F. The derivation is much Pǫ 1 2exp (ǫ log(1 + ǫ)) (24) ≥ − − 2 − more complicated and requires more techniques, compared to n 1 2p exp (ǫ log(1 + ǫ)) the fixed design case in [19]. This is because the Rényi diver- ≥ − − 2 − gence is a usual mean square error (MSE) in the fixed design nǫ2 1 2p exp . case, while it is not in the random design case in general. In ≥ − − 7 addition, it is important for the risk bound derivation to choose an appropriate typical set in a sense that we can show that P n ǫ See Section V-H for its proof. In the lasso case, it is often approaches to one sufficiently fast and we can also show the postulated that p is much greater than n. Due to Lemma 2, ǫ-risk validity of the target penalty with the chosen typical set. 1 P n is O(p exp( nǫ2/7)), which also implies that the In case of lasso with normal design, the typical set An defined ǫ ǫ second− term in (11)· can− be negligibly small even if n p. in (21) satisfies such properties. In this sense, the exponential bound is important for lasso.≪ Let us compare the coefficient of the risk valid weighted ℓ 1 Combining Lemmas 1 and 2 with Theorems 7 and 8, we obtain penalty with the fixed design case in [19]. They showed that the following theorem. the weighted ℓ1 norm satisfying Theorem 10. For any , define 2n log 4p log 2 ǫ (0, 1) µ , µ (23) ∈ 1 ≥ σ2 2 ≥ β r Pn n n 0 x := q(x ) = Πi=1N(xi , Σ) non-singular Σ , is risk valid in the fixed design case. The condition for µ2 { | | } (1/n) n x2 is the same, while the condition for µ1 in (22) is more strict n n i=1 ij Aǫ := x j, 1 ǫ 1+ ǫ . than that of the fixed design case. We compare them by taking ∀ − ≤ Σjj ≤ n P o β = 1 λ (the tightest choice) and ǫ = 0 in (22) because ǫ can be− negligibly small for sufficiently large n. The minimum Assume a linear regression setting: µ1 for the risk validity in the random design case is p (yn xn) = Πn N(y xT θ∗, σ2), λ +8 ∗ i=1 i i n| n n | T 2 8(1 λ) pθ(y x ) = Πi=1N(yi xi θ, σ ). s − | | times that for the fixed design case. Hence, the smallest value Let β be a real number in (0, 1). For any λ (0, 1 β], if of regularization coefficient µ1 for which the risk bound holds ∈ − in the random design is always larger than that of the fixed design case for any λ (0, 1) but its extent is not so large log 4p λ +8√1 ǫ2 log 2 ∈ µ1 − , µ2 , unless λ is extremely close to 1 (See Fig. 1). ≥ snβσ2(1 ǫ) · 4 ≥ nβ − 8 the lasso estimator θˆ(xn,yn) in (20) has a risk bound By the ǫ-risk validity, we obtain

n n θ n n n Eǫ [dλ(p∗,pθˆ(xn,yn))] Eǫ exp β max Fλ (x ,y ) L(θ x ) ≤ θ∈Θ − | 2 ∗ 2 h n ˜ oi n Y Xθ 2 Y Xθ 2 n θ n n ˜ ˜ E inf k − k −k − k + µ θ + µ Eǫ exp β max Fλ (x ,y ) L(θ q∗) ǫ 2 1 w,1 2 ≤ ˜ e − | "θ∈Θ 2nσ k k # θ∈Θ h n ˜ oi n n θ n n E exp β F (x ,y ) L˜(θ˜ q∗) p log 1 2exp 2 (ǫ log(1 + ǫ)) ǫ λ − − − , (25) ≤ e − | θ˜∈Θ(q∗) − nβ X h i ˜ ˜ n θ˜ n n and a regret bound = exp( βL θ q∗) E exp βF (x ,y ) . (28) − | ǫ λ ˜ e θ∈XΘ(q∗) h i dλ(p∗,pˆ n n ) θ(x ,y ) ≤ The following fact is an extension of the key technique of BC 2 ∗ 2 Y Xθ 2 Y Xθ 2 theory: inf k − k −k − k + µ1 θ w,1 + µ2 + τ θ∈Θ 2nσ2 k k (26) n θ˜ n n Eǫ exp βFλ (x ,y ) " # with probability at least n n β p˜(y x ) n p = exp βdn(p ,p ) En θ | 1 2exp (ǫ log(1 + ǫ)) exp( τnβ), (27) λ ∗ θ˜ ǫ p (yn xn) − − 2 − − − " ∗ | # n n β which is bounded below by 1 n pθ˜(y x ) exp βd (p ,p˜) E ∗ | ≤ P n λ ∗ θ p¯ p (yn xn) ǫ " ∗ | # 1 O (p exp( nκ)) − · − 1 n n = exp βd (p ,p˜) exp βd (p ,p˜) with κ := min ǫ2/7,τβ . P n λ ∗ θ − 1−β ∗ θ { } ǫ 1 1 Since n and n are i.i.d. now, n . Hence, n n x y dλ(p, r)= ndλ(p, r) n exp βdλ(p∗,pθ˜) exp βdλ(p∗,pθ˜) = n . we presented the risk bound as a single-sample version in (25) ≤ Pǫ − Pǫ n n by dividing the both sides by n. Finally, we remark that the The first inequality holds because Ep¯∗(xn,yn) [A] Pǫ Eǫ [A] following interesting fact holds for the lasso case. for any non-negative random variable A. The second≥ inequality holds because of the monotonically increasing property of Lemma 3. Assume a linear regression setting: n dλ(p∗,pθ) in terms of λ. Thus, the right side of (28) is p (yn xn) = Πn N(y xT θ∗, σ2), bounded as ∗ | i=1 i| i n n n T 2 ˜ pθ(y x ) = Π N(yi x θ, σ ). ˜ ˜ n θ n n | i=1 | i exp( βL θ q∗) Eǫ exp βFλ (x ,y ) e − | n θ˜∈Θ(q∗) If µ1 and µ2 satisfy (22), then the weighted ℓ1 norm L(θ x )= X h i | 1 1 µ1 θ w,1 + µ2 is codelength valid. exp( βL˜ θ˜ q ) . k k ≤ P n − | ∗ ≤ P n ǫ ˜ e ǫ That is, the weighted ℓ1 penalties derived in Lemma 1 are θ∈XΘ(q∗) not only ǫ-risk valid but also codelength valid. Its proof will Hence, we have an important inequality be described in Section V-I. By this fact, the redundancy and regret interpretation of the main terms in (25) and (26) are 1 n θ n n n Eǫ exp β max Fλ (x ,y ) L(θ x ) . (29) justified. It also indicates that we can obtain the unconditional P n ≥ θ∈Θ − | ǫ risk bound with respect to α-divergence for those weighted ℓ1 Applying Jensen’s inequality to (29), we have penalties by Corollary 1 without any additional condition. 1 n θ n n n exp Eǫ β max Fλ (x ,y ) L(θ x ) P n ≥ θ∈Θ − | V. PROOFS OF THEOREMS,LEMMAS AND COROLLARY ǫ ˆ exp En β F θ(xn,yn) L(θˆ xn) . We give all proofs to the theorems, the lemmas and the ≥ ǫ λ − | corollary in the previous section. Thus, we have h i log P n p (yn xn) ǫ En dn(p ,p ) log ∗ | L(θˆ xn) . A. Proof of Theorem 7 − β ≥ ǫ λ ∗ θˆ − p (yn xn) − | θˆ | Here, we prove our main theorem. The proof proceeds along Rearranging the terms of this inequality, we have the state- with the same line as [19] though some modifications are ment. necessary.

Proof. Deﬁne B. Proof of Theorem 8 p (yn xn) It is not necessary to start from scratch. We reuse the proof F θ(xn,yn) := dn(p ,p ) log ∗ | . λ λ ∗ θ − p (yn xn) of Theorem 7. θ | 9

Proof. We can start from (29). For convenience, we deﬁne following reason. By the decomposition of expectation, we

n n have ξ(x ,y ) n n n p∗(y x ) 1 E n n I n (x ) log θ n n n p¯∗(x ,y ) Aǫ n| n = max Fλ (x ,y ) L(θ x ) p2(y x ) θ∈Θ n − | h | i n n dn(p ,p ) 1 p (y n xn) L(θ xn) n p∗(y x ) λ ∗ θ ∗ = E ∗ n I n (x )E ∗ n n log | . = max log | | . q (x ) Aǫ p (y |x ) n n e n − n p (yn xn) − n p2(y x ) θ∈Θ θ | | Since n n is a sub-probability distribution by the as- By Markov’s inequality and (29), p2(y x ) sumption, the| conditional expectation part is non-negative. n n n n n Pr (ξ(x ,y ) τ x A ) Therefore, removing the indicator function I n (x ) cannot ≥ | ∈ ǫ Aǫ = Pr(exp(nβξ(xn,yn)) exp(nβτ) xn An) decrease this quantity. The final part of the statement follows ≥ | ∈ ǫ exp( nτβ) from the fact that taking λ =1 β makes the bound in (11) − . tightest because of the monotonically− increasing property of ≤ P n ǫ Rényi divergence with regard to λ. Hence, we obtain Again, we remark that the sub-probability condition of Pr (ξ(xn,yn) τ) p (yn xn) can be replaced with a sufficient condition ≥ 2 | = P n Pr (ξ(xn,yn) τ xn An) “L(θ xn) is codelength valid.” In addition, the sub-probability ǫ ǫ | n ≥ n| n∈ n n condition can be relaxed to +(1 Pǫ )Pr(ξ(x ,y ) τ x / Aǫ ) n − n n n ≥ n| ∈ n n n n Pǫ Pr (ξ(x ,y ) τ x Aǫ )+(1 Pǫ ) sup p2(y x )dy < , ≤ ≥ | ∈ − xn∈X n | ∞ exp( nτβ)+(1 P n). Z ≤ − − ǫ under which the bound increases by (1 n n n n − The proof completes by noticing that Pǫ ) log supxn∈X n p2(y x )dy . ˆ | (1/n) F θ(xn,yn) L(θˆ xn) ξ(xn,yn) for any xn λ R n − | ≤ D. Renyi´ Divergence and Its Derivatives and y . In this section and the next section, we prove a series of C. Proof of Corollary 1 lemmas, which will be used to derive risk valid penalties for lasso. First, we show that the Rényi divergence can be The proof is obtained immediately from Theorem 7. λ understood by defining p¯θ (x, y) in Lemma 4. Then, their n Proof. Let again Eǫ denote a conditional expectation with explicit forms in the lasso setting are calculated in Lemma n n n n n regard to p¯∗(x ,y ) given that x Aǫ . Let further IA(x ) 5. be an indicator function of a set A ∈ X n. The unconditional Lemma 4. Define a probability distribution p¯λ(x, y) by risk is bounded as ⊂ θ q (x)p (y x)λp (y x)1−λ n λ ∗ ∗ θ E ∗ [D (p ,p )] p¯θ (x, y) := | | , p¯ α ∗ θˆ Zλ n n n n θ = Ep¯∗ [IAn (x )D (p∗,pˆ)]+Ep¯∗ [(1 IAn (x ))D (p∗,pˆ)] ǫ α θ − ǫ α θ where Zλ is a normalization constant. Then, the Renyi´ diver- 4 θ P nEn[D n(p ,p )] + (1 P n) gence and its first and second derivatives are written as ≤ ǫ ǫ α ∗ θˆ − ǫ · 1 α2 P n (1 P−n) 1 λ ǫ n n ǫ dλ(p∗,pθ) = − log Zθ , Eǫ [dλ(α)(p∗,pθˆ)] + − 1 λ ≤ λ(α) λ(α)(λ(α)+ α) − n n n ∂dλ(p∗,pθ) P p∗(y x ) 1 1 = E λ [sθ(y x)] , (30) ǫ n p¯θ Eǫ log n| n + log n ∂θ − | ≤ λ(α) p2(y x ) β Pǫ ∂2d (p ,p ) n | λ ∗ θ (1 P ) = Ep¯λ [Gθ(x, y)] + − ǫ ∂θ∂θT − θ λ(α)(λ(α)+ α) (1 λ)Varp¯λ (sθ(y x)) , (31) n n n − − θ | 1 n p∗(y x ) Pǫ 1 = E ∗ I n (x ) log | + log where Var (A) denotes a covariance matrix of A with respect λ(α) p¯ Aǫ p (yn xn) λ(α)β P n p 2 ǫ to and n | p (1 Pǫ ) + − ∂ log pθ(y x) λ(α)(λ(α)+ α) sθ(y x) := | , n n n | ∂θ 1 p∗(y x ) Pǫ 1 2 Ep¯∗ log | + log ∂ log pθ(y x) ≤ λ(α) p (yn xn) λ(α)β P n Gθ(x, y) := | . 2 | ǫ ∂θ∂θT (1 P n) + − ǫ . Proof. The normalizing constant is rewritten as λ(α)(λ(α)+ α) p (y x) 1−λ The first and second inequalities follow from the two proper- Zλ = q (x)p (y x) θ | dxdy θ ∗ ∗ | p (y x) ties of α-divergence in (17) and (18) respectively. The third Z ∗ 1−| λ inequality follows from Theorem 7 because λ(α) (0, 1 β) pθ(y x) ∈ − = Ep¯∗(x,y) | . by the assumption. The last inequality holds because of the p (y x) " ∗ | # 10

T ∗ 2 Thus, the Rényi divergence is written as If we assume that p∗(y x) = N(y x θ , σ ) (i.e., linear regression setting), | | 1 λ dλ(p∗,pθ)= log Zθ . λ T 2 −1 λ pθ (y x) = N(y x θ(λ), σ ), − | | 1 T 2 λ q∗(x)exp (x θ¯) Next, we calculate the partial derivative of log Zθ as λ 2c qθ (y x) = −λ , λ | Zθ ∂ log Zθ ∂dλ(p∗,pθ) λ T ∂θ = E λ [xx ]θ,¯ 2 qθ 1 ∂Zλ ∂θ σ = θ ∂2d (p ,p ) λ λ λ λ ∗ θ T T ¯ Z ∂θ = Eqλ [xx ] Varqλ xx θ . (32) θ ∂θ∂θT σ2 θ − σ2c θ 1 p (y x) 1−λ ∂ p (y x) 1−λ θ θ If we additionally assume that q∗(x)= N(x 0, Σ) with a non- = λ Ep¯∗ | log | | Z p∗(y x) ∂θ p∗(y x) singular covariance matrix Σ, θ " | | # 1−λ λ 0 λ 1 λ pθ(y x) ∂ log pθ(y x) qθ (x) = N(x , Σθ ), = − Ep¯∗ | | | Zλ p (y x) ∂θ ∂d (p ,p ) λ c θ " ∗ | # λ ∗ θ = Σ1/2θ¯′, (33) ∂θ σ2 c + θ¯′ 2 1 λ λ 1−λ 2 = − q∗(x)p∗(y x) pθ(y x) sθ(y x)dxdy 2 k k Zλ | | | ∂ dλ(p∗,pθ) λ c θ = Σ Z T 2 ¯′ 2 = (1 λ)Ep¯λ [sθ(y x)]. ∂θ∂θ σ c + θ 2 − θ | k k 2λ c T Therefore, the first derivative is Σ1/2θ¯′ θ¯′ Σ1/2, 2 ¯′ 2 2 − σ (c + θ 2) λ k k ∂dλ(p∗,pθ) 1 ∂ log Zθ (34) = = Ep¯λ [sθ(y x)] . ∂θ −1 λ ∂θ − θ | − where Furthermore, we have Σ1/2θ¯′(θ¯′)T Σ1/2 Σλ := Σ . θ ¯′ 2 ∂ logp ¯λ(x, y) ∂ q (x)p (y x)λp (y x)1−λ − c + θ 2 θ = log ∗ ∗ | θ | k k ∂θ ∂θ Zλ Proof. By completing squares, we can rewrite p¯λ(x, y) as θ θ λ ∂ log pθ(y x) ∂ log Z λ = (1 λ) | θ p¯θ (x, y) − ∂θ − ∂θ T ∗ 2 T 2 q∗(x) λ(y x θ ) + (1 λ)(y x θ) = (1 λ)sθ(y x) (1 λ)Ep¯λ [sθ(y x)] − − − θ = 1 exp 2 − | − − | 2 2 λ − 2σ (2πσ ) Zθ ! = (1 λ) sθ(y x) Ep¯λ [sθ(y x)] . − | − θ | q∗(x) = n Hence, 2 2 λ (2πσ ) Zθ 2 T 2 T ∗ 2 ∂ dλ(p∗,pθ) (y x θ(λ)) + λ(1 λ)(x (θ θ)) − − − T exp 2 ∂θ∂θ · − 2σ ! λ T λ ∂ logp ¯θ (x, y) q (x) λ(1 λ)(xT θ¯)2 = sθ(y x)¯pθ (x, y) ∗ T 2 − | ∂θ = λ exp − 2 N(y x θ(λ), σ ). Z Zθ − 2σ | ∂sθ(y x) λ λ T 2 +¯pθ (x, y) T| dxdy Hence, p (y x) is N(y x θ(λ), σ ). Integrating y out, we also ∂θ θ | | T have = Ep¯λ (1 λ)sθ(y x) sθ(y x) Ep¯λ [sθ(y x)] 1 T ¯ 2 − θ − | | − θ | q∗(x)exp (x θ) qλ(x) = − 2c . 2 θ λ ∂ log pθ(y x) Zθ + T | ∂θ∂θ When q∗(x)= N(0, Σ), 2 ∂ log pθ(y x) 1 T −1 1 T ¯¯T = Ep¯λ | (1 λ) λ exp 2 x Σ x 2c x θθ x − θ ∂θ∂θT − − q (x) = − − θ (2π)p/2 Σ 1/2Zλ T | | θ 1 T −1 1 ¯¯T Ep¯λ sθ(y x) Ep¯λ [sθ(y x)] sθ(y x) Ep¯λ [sθ(y x)] exp x Σ + θθ x · θ | − θ | | − θ | = − 2 c . (35) p/2 1/2 λ 2 (2π) Σ Zθ ∂ log pθ(y x) | | = Ep¯λ | (1 λ)Varp¯λ (sθ(y x)) . −1 − θ ∂θ∂θT − − θ | Since Σ is strictly positive definite by the assumption, Σ + (1/c)θ¯θ¯T is non-singular. Hence, by the inverse formula (Lemma 8 in Appendix), Lemma 5. Let −1 ¯¯T λ −1 1 ¯¯T Σθθ Σ Σθ = Σ + θθ =Σ θ(λ) := λθ∗ + (1 λ)θ, θ¯ := θ θ∗, θ¯′ := Σ1/2θ,¯ c − c + θ¯T Σθ¯ − − σ2 Σ1/2θ¯′(θ¯′)T Σ1/2 c := . = Σ . (36) λ(1 λ) − c + θ¯′ 2 − k k2 11

λ 0 λ Therefore, qθ (x) = N(x , Σθ ). The score function and By (31) combined with (37), the Hessian of Rényi divergence Hessian of log p (y x) are | is calculated as θ | 2 1 T ∂ dλ(p∗,pθ) sθ(y x) = x(y x θ), | σ2 − ∂θ∂θT ∂2 log p (y x) 1 2 θ T 1 T 1 T λ T ¯ | = xx . (37) = Ep¯λ [xx ] (1 λ) Eqλ [xx ]+ Varqλ xx θ ∂θ∂θT −σ2 σ2 θ − − σ2 θ σ4 θ 2 Using (30), the first derivative is obtained as λ T λ (1 λ) T ¯ = Eqλ [xx ] − Varqλ xx θ σ2 θ − σ4 θ ∂dλ(p∗,pθ) λ T λ T ¯ = Ep¯λ [sθ(y x)] = Eqλ [xx ] Varqλ xx θ . ∂θ − θ | σ2 θ − σ2c θ = Eqλ Epλ [sθ(y x)] T − θ θ | When q∗(x) = N(0, Σ), Var λ xx θ¯ is calculated as fol- qθ h 1 i T lows. Note that = Eqλ Epλ x(y x θ) − θ θ σ2 − T T T T λ λ T Var λ (xx θ¯)= E λ (xx θ¯)(xx θ¯) (Σ θ¯)(Σ θ¯) . 1 T qθ qθ θ θ = Eqλ xx (θ(λ) θ) − − θ σ2 − T T T λ The (j1, j2) element of E λ xx θ¯θ¯ xx is calculated as T qθ = E λ xx θ¯ 2 qθ σ p ¯ 0 T ¯¯T T ¯ ¯ because θ(λ) θ = λθ. When q∗(y x)= N( , Σ), Eqλ xx θθ xx = θj3 θj4 Eqλ [xj1 xj2 xj3 xj4 ] , − − | θ j1j2 θ j3,j4=1 h i X ∂dλ(p∗,pθ) λ λ ¯ = 2 Σθ θ. ∂θ σ where xj denotes the jth element of x only here. Thus, we need all the fourth-moments of qλ(x). We rewrite Σλ as S From (36), we have θ θ to reduce notation complexity hereafter. By the formula of Σ1/2θ¯′(θ¯′)T Σ1/2θ¯ moments of Gaussian distribution, we have Σλθ¯ = Σθ¯ θ ¯′ 2 − c + θ 2 k k E λ [xj1 xj2 xj3 xj4 ]= Sj1j2 Sj3j4 + Sj1j3 Sj2j4 + Sj2j3 Sj1j4 . ¯′ 2 qθ 1 θ 2 = Σ 2 θ¯′ k k Σ1/2θ¯′ − c + θ¯′ 2 k k2 Therefore, the above quantity is calculated as c = Σ1/2θ¯′, (38) c + θ¯′ 2 T T T 2 E λ xx θ¯θ¯ xx k k qθ j1j2 which gives (33). Though (34) can be obtained by differenti- h p i ¯ ¯ ating (33), we derive it by way of (31) here. To calculate the = θj3 θj4 (Sj1j2 Sj3j4 + Sj1j3 Sj2j4 + Sj2j3 Sj1j4 ) λ j3,j4=1 covariance matrix of sθ in terms of p¯θ , we decompose sθ as X ¯T ¯ ¯ ¯ = θ SθSj1j2 + 2(Sθ)j1 (Sθ)j2 . 1 T T T sθ(y x) = x(y x θ(λ)+ x θ(λ) x θ) | σ2 − − Summarizing these as a matrix form, we have 1 λ = x(y xT θ(λ)) xxT θ.¯ σ2 σ2 T T T T T − − E λ xx θ¯θ¯ xx = (θ¯ Sθ¯)S +2Sθ¯(Sθ¯) . qθ 2 T Note that the covariance of (1/σ )x(y x θ(λ)) and 2 T − T ¯ (λ/σ )xx θ¯ vanishes since As a result, Varqλ (xx θ) is obtained as − θ T T T λ ¯ T ¯ ¯T ¯ ¯¯T ¯¯T Ep¯ [x(y x θ(λ))(xx θ) ] Varqλ (xx θ) = (θ Sθ)S +2Sθθ S Sθθ S θ − θ − T T ¯ T ¯¯T ¯T ¯ = Eqλ xx (x θ)Epλ (y x θ(λ)) =0. = Sθθ S + (θ Sθ)S. (39) θ θ − h i Therefore, we have Using (38), the first and second terms of (39) are calculated as Varp¯λ (sθ) θ 2 c T 1 λ Sθ¯θ¯T S = Σ1/2θ¯′ θ¯′ Σ1/2, T T ¯ ′ 2 2 = Varp¯λ x(y x θ(λ)) + Varp¯λ xx θ (c + θ¯ ) θ σ2 − θ σ2 k k2 c 1 λ2 θ¯T Sθ¯ = (θ¯)T Σ1/2θ¯′ T 2 T T ¯ ′ 2 = Ep¯λ (y x θ(λ)) xx + Varqλ xx θ c + θ¯ σ4 θ − σ4 θ k k2 2 c θ¯′ 2 1 T λ T = 2 . = E λ xx + Var λ xx θ¯ k k′ 2 2 qθ 4 qθ c + θ¯ σ σ k k2 12

Combining these, Define c(t c) 2 f(t) := − ∂ dλ(p∗,pθ) (c + t)2 ∂θ∂θT 2 for t 0. Checking the properties of f(t), we have λ λ c T ≥ = S Σ1/2θ¯′ θ¯′ Σ1/2 σ2 − σ2c (c + θ¯′ 2)2 f(0) = 1, k k2 − c θ¯′ 2 f(c) = 0, + k k2 S c + θ¯′ 2 f( ) = 0, k k2 ∞ λ c df(t) c(3c t) = S = − . σ2 c + θ¯′ 2 dt (t + c)3 k k2 λ c T Therefore, . As a result, we Σ1/2θ¯′ θ¯′ Σ1/2 maxt∈[0,∞) f(t) = f(3c)=1/8 −σ2 (c + θ¯′ 2)2 obtain k k2 ∂2d (p ,p ) λ λ c Σ1/2θ¯′(θ¯′)T Σ1/2 λ ∗ θ Σ. = Σ − ∂θ∂θT 8σ2 σ2 c + θ¯′ 2 − c + θ¯′ 2 k k2 k k2 λ c T Σ1/2θ¯′ θ¯′ Σ1/2 −σ2 (c + θ¯′ 2)2 k k2 F. Proof of Lemma 1 λ c = Σ We are now ready to derive ǫ-risk valid weighted ℓ1 σ2 c + θ¯′ 2 k k2 penalties. 2λ c T Σ1/2θ¯′ θ¯′ Σ1/2. Proof. Similarly to the rewriting from (8) to (10), we can − σ2 (c + θ¯′ 2)2 2 rewrite the condition for ǫ-risk validity as k k n n n Y n x Aǫ , y , θ Θ, ∀ ∈ ∀ ∈ ∀ ∈ n n n n pθ(y x ) min d (p ,p ) d (p ,p˜) + log | + L˜(θ˜ q ) E. Upper Bound of Negative Hessian e λ ∗ θ λ ∗ θ n n ∗ ˜ ∗ − p˜(y x ) | θ∈Θ(q ) θ | Using Lemma 5 in Section V-D, we show that the negative n loss variation part o codelength validity part Hessian of the Rényi divergence is bounded from above. n| {z } L(θ x ). | {z (41)} Lemma 6. Assume that q (x)= N(x 0, Σ) and p (y x)= ≤ | ∗ ∗ We again write the inside part of the minimum in (41) N(y xT θ∗, σ2), where Σ is non-singular.| For any θ,θ∗,| | as H(θ, θ,˜ xn,yn). As described in Section III, the direct ∂2d (p ,p ) λ minimization of H(θ, θ,˜ xn,yn) seems to be difficult. Instead λ ∗ θ Σ, (40) − ∂θ∂θT 8σ2 of evaluating the minimum explicitly, we borrow a nice where A B implies that B A is positive semi-definite. randomization technique introduced in [19] with some modifi- ˜ n n − cations. Their key idea is to evaluate not minθ˜ H(θ, θ, x ,y ) Proof. By Lemma 5, we have ˜ n n directly but its expectation Eθ˜[H(θ, θ, x ,y )] with respect to 2 a dexterously randomized θ˜ because the expectation is larger ∂ d (p ,p ) 2λ c T λ ∗ θ = Σ1/2θ¯′ θ¯′ Σ1/2 ∗ ∗ ∗ ∗ T T 2 ¯′ 2 2 than the minimum. Let us define w := (w1 , w2, , wp) , − ∂θ∂θ σ (c + θ 2) ∗ ∗ ∗ ··· ∗ k k where w = Σjj and W := diag(w , , w ). We λ c j 1 ··· p Σ. quantize Θ as −σ2 c + θ¯′ 2 p 2 ∗ −1 p k k Θ(q∗) := δ(W ) z z Z , (42) For any nonzero vector v p, { | ∈ } ∈ℜ where δ > 0 is a quantization width and Z is the set of all T 2 e n vT Σ1/2θ¯′ θ¯′ Σ1/2v = vT Σ1/2θ¯′ integers. Though Θ depends on x in fixed design cases [19], we must remove the dependency to satisfy the ǫ-risk validity Σ1/2v 2 θ¯′ 2 = vT ( θ¯′ 2 Σ)v ˜ ≤ k k2 ·k k2 k k2 as above. For eache θ, θ is randomized as δ by Cauchy-Schwartz inequality. Hence, we have w∗ mj with prob. mj mj j ⌈ ⌉ −⌊ ⌋ δ ˜ ∗ 1/2 ′ ′ T 1/2 ′ 2 θj = w mj with prob. mj mj , (43) Σ θ¯ θ¯ Σ θ¯ Σ.  j ⌊ ⌋ ⌈ ⌉− 2 δ k k  ∗  w mj with prob. 1 ( mj mj ) Thus, j − ⌈ ⌉−⌊ ⌋  ∗ ˜ 2 where mj := wj θj /δ and each component of θ is statistically ∂ dλ(p∗,pθ) independent of each other. Its important properties are − ∂θ∂θT ′ 2 ˜ 2λ c θ¯ λ c E˜[θ]= θ, (unbiasedness) 2 Σ Σ θ 2 k k′ 2 2 2 ′ 2 σ (c + θ¯ ) − σ c + θ¯ E˜[ θ˜ ]= θ , (44) k k2 k k2 θ | | | | λ c( θ¯′ 2 c) δ 2 ˜ ˜ ′ ′ ′ = Σ. E˜[(θj θj )(θj θj )] I(j = j ) θj , 2 k k ′−2 2 θ ∗ σ (c + θ¯ ) − − ≤ wj | | k k2 13

where θ˜ denotes a vector whose jth component is the Combining with the loss variation part, we obtain an upper | | ˜ ˜ n n absolute value of θj and similarly for θ . Using these, we bound of Eθ˜[H(θ, θ, x ,y )] as ˜ n n | | can bound Eθ˜[H(θ, θ, x ,y )] as follows. The loss variation part in (41) is the main concern because it is more complicated p 2 than squared error of fixed design cases. Let us consider the δnλ δn wj log 4p log 2 θ ∗ + θ + θ ∗ + . 16σ2 k kw ,1 2σ2 w∗ | j | βδ k kw ,1 β following Taylor expansion j=1 j X n T n n ∂dλ(p∗,pθ) d (p ,p ) d (p ,p˜)= (θ˜ θ) λ ∗ θ − λ ∗ θ − ∂θ − Since xn An, we have ∈ ǫ 1 ∂2dn(p ,p ◦ ) λ ∗ θ ˜ ˜ T (45) Tr T (θ θ)(θ θ) , −2 ∂θ∂θ − − (1 ǫ)w∗ w (1 + ǫ)w∗. − j ≤ j ≤ j where θ◦ is a vector between θ and θ˜. The first term in the p p right side of (45) vanishes after taking expectation with respect n n Thus, we can bound E˜[H(θ, θ,˜ x ,y )] by the data-dependent to ˜ because ˜ . As for the second term, we obtain θ θ Eθ˜[θ θ]=0 weighted ℓ norm θ as − 1 k kw,1 ∂2dn(p ,p ◦ ) Tr λ ∗ θ (θ˜ θ)(θ˜ θ)T − ∂θ∂θT − − ˜ n n Eθ˜[H(θ, θ, x ,y )] T nλ p 2 Tr Σ θ˜ θ θ˜ θ δnλ θ w,1 δn√1+ ǫ w log 4p θ w,1 ≤ 8σ2 − − k k + j θ + k k ≤ 16σ2 √ 2σ2 w | j | βδ √ 1 ǫ j=1 j 1 ǫ by Lemma 6. Thus, expectation of the loss variation part with − X − log 2 respect to θ˜ is bounded as + β δnλ n n ∗ δn λ √1+ ǫ log 4p log 2 E˜ dλ(p∗,pθ) dλ(p∗,p˜) θ w ,1. (46) θ − θ ≤ 16σ2 k k = + + θ w,1+ . σ2 16√1 ǫ 2 δβ√1 ǫ k k β The codelength validity part in (41) have the same form as that − − for the fixed design case in its appearance. However, we need Because this holds for any , we can minimize the upper to evaluate it again in our setting because both Θ and L˜ are δ > 0 different from those in [19]. The likelihood term is calculated bound with respect to δ, which completes the proof. as e 1 2(Y Xθ)T X(θ θ˜)+Tr XT X(θ˜ θ)(θ˜ θ)T . 2σ2 − − − − G. Some Remarks on the Proof of Lemma 1 Taking expectation with respect to θ˜, we have n n The main difference of the proof from the fixed design case pθ(y x ) n 2 ˜ ˜ T Eθ˜ log n| n = 2 Eθ˜ Tr W (θ θ)(θ θ) is in the loss variation part. In the fixed design case, the Rényi p˜(y x ) 2σ − − θ n | p h i divergence dλ(p∗,pθ x ) is convex in terms of θ. When the δn w2 | j θ , Rényi divergence is convex, the negative Hessian is negative ≤ 2σ2 w∗ | j | j=1 j semi-definite for all θ. Hence, the loss variation part is trivially X bounded above by zero. On the other hand, dλ(p∗,pθ) is not where W := diag(w1, w2, , wp). We define a codelength convex in terms of θ. This can be intuitively seen by deriving ··· p function C(z) := z 1 log 4p+log 2 over Z . Note that C(z) k k the explicit form of dλ(p∗,pθ) instead of checking the positive satisfies Kraft’s inequality. Let us define a codelength function semi-definiteness of its Hessian. From (35), we have on Θ(q∗) as

1 1 ∗ 1 ∗ log 2 exp 1 xT (Σλ)−1x L˜(θ˜eq∗) := C W θ˜ = W θ˜ 1 log 4p + . (47) λ 2 θ | β δ βδ k k β Zθ = − dx (2π)p/2 Σ 1/2 Z | | ˜ −1/2 λ 1/2 −1/2 λ −1/2 1/2 By this deﬁnition, L satisﬁes β-stronger Kraft’s inequality and = Σ Σθ = Σ Σθ Σ n ∗ | | | | | | does not depend on x but depends on q∗(x) through W . By 1/2 1 ¯′ ¯′ T taking expectation with respect to θ˜, we have = Ip θ θ − c + θ¯′ 2 2 k k 1/2 log 4p log 2 ′ 2 ′ ′ T E˜ L˜(θ˜ q∗) = θ w∗,1 + θ¯ θ¯ θ¯ θ | βδ k k β = I k k2 , (48) p ¯′ 2 ¯′ ¯′ h i − c + θ 2 θ 2 θ 2 because of (44). Thus, the codelength validity part is bounded k k k k k k

above by

where I is the identity matrix of dimension p. Prof. A. p 2 p δn wj log 4p log 2 R. Barron suggested in a private discussion that λ can be θ + θ ∗ + . Zθ 2σ2 w∗ | j | βδ k kw ,1 β j=1 j simpliﬁed more as follows. Let Q := [q1, q2, , qp] be an X ··· 14

′ ′ orthogonal matrix such that q1 := θ¯ / θ¯ 2. Using this, we settings, we believe that providing Lemmas 4 and 5 would be have k k useful in some cases. ¯′ 2 ¯′ ¯′ T θ 2 θ θ Ip k k − c + θ¯′ 2 θ¯′ θ¯′ k k2 k k2 k k2 H. Proof of Lemma 2 θ¯′ 2 = QQT k k2 q qT n n ¯′ 2 1 1 Here, we show that x distributes out of Aǫ with exponen- − c + θ 2 k k p tially small probability with respect to n. θ¯′ 2 = 1 k k2 q qT + q qT ¯′ 2 1 1 j j − c + θ 2 n k k j=2 Proof. The typical set Aǫ can be decomposed covariate-wise p X c as = q qT + q qT ′ 2 1 1 j j c + θ¯ p 2 j=2 n n k k X Aǫ = Πj=1Aǫ (j), ¯′ 2 c/(c + θ 2) 0 0 0 An(j) := x n (w∗)2 ( x 2/n) ǫ(w∗)2 k k ··· ǫ j ∈ℜ j − k j k2 ≤ j 0 10 0 x n ∗ 2 2 ∗ 2  ···  = j (wj ) wj ) ǫ(w j ) , = Q 0 01 0 QT . ∈ℜ − ≤ . . . ···. .  . . . . .  T  . . . . .  where xj := (x1j , x2j , , xnj ) and the above Π denotes   ···  0 00 1  a direct product of sets. From its definition, w2 is subject  ···  j   to a Gamma distribution Ga((n/2), (2s)/n) when xj λ ∼ Hence, the resultant Zθ is obtained as Πn N(x 0, (w∗)2). We write w2 as z and (w∗)2 as s (the i=1 j | j j j 1 index is dropped for legibility). We rewrite the Gamma T 2 j θ¯′ θ¯′ Zλ = I γ( θ¯′ 2) distribution g(z; s) in the form of exponential family: θ˜ p 2 ′ ′ − k k θ¯ 2 θ¯ 2 n n k k k k n 2 −1 1 n 2s Γ( ) nz 2s 2 c 2 g(z; s) := Ga , = 2 exp = . 2 n z 2s n ¯′ 2 − c + θ 2 n k k n 2 nz 2s 2 n Thus, we have a simple expression of the Rényi divergence as = exp − log z log Γ 2 − 2s − n 2 1 θ¯′ 2 = exp(C(z)+ νz ψ(ν)) , d (p ,p )= log 1+ k k2 . (49) − λ ∗ θ 2(1 λ) c − where From this form, we can easily know that the Rényi divergence is not convex. When the Rényi divergence is non-convex, it n 2 n C(z) := − log z, ν := , is unclear in general whether and how the loss variation part 2 −2s is bounded above. This is one of the main reasons why the ψ(ν) := log( ν)−n/2Γ(n/2). derivation becomes more difficult than that of the fixed design − case. That is, ν is a natural parameter and z is a sufficient statistic, so We also mention an alternative proof of Lemma 1 based that the expectation parameter η(s) is Eg(z;s)[z]. The relation- on (49). We provided Lemma 4 to calculate Hessian of the ship between the variance parameter s and natural/expectation Rényi divergence. However, the above simple expression of the parameters are summarized as Rényi divergence is somewhat easier to differentiate, while the n n expression based on (48) is somewhat hard to do it. Therefore, ν(s) := , η(ν)= . we can twice differentiate the above Rényi divergence directly −2s −2ν in order to obtain Hessian instead of Lemma 5 in our Gaussian For exponential families, there is a useful Sanov-type inequal- setting. However, there is no guarantee that such a simplifica- ity (Lemma 7 in Appendix). Using this Lemma, we can bound tion is always possible in general setting. In our proof, we tried x n Pr( j / Aǫ (j)) as follows. For this purpose, it suffices to to give a somewhat systematic way which is easily applicable ∈ 2 ∗2 ∗2 bound the probability of the event wj wj wj ǫ. When to other settings to some extent. Suppose now, for example, ∗ 2 ′ | − |≤ s = (wj ) and s = s(1 ǫ), we are aim at deriving ǫ-risk valid ℓ1 penalties for lasso ± when q∗(x) is subject to non-Gaussian distribution. By (32) in D T (ν(s ǫs),ν) Lemma 5, it suffices only to bound Var λ (xx θ¯) in the sense ± qθ T n n n of positive semi-definiteness because Eqλ [xx ] is negative = s(1 ǫ) log(1 ǫ) − θ −2s(1 ǫ) − −2s ± − 2 ± semi-definite. In general, it seemingly depends on a situation ± n 1 n which is better, the direct differential or using (32). In our = 1 s(1 ǫ) log(1 ǫ) Gaussian setting, we imagine that the easiest way to calculate −2s (1 ǫ) − ± − 2 ± n ± n Hessian for most readers is to calculate the first derivative by = (1 (1 ǫ)) log(1 ǫ) the formula (30) and then to differentiate it directly, though − 2 − ± − 2 ± n this depends on readers’ background knowledge. For other = ( ǫ log(1 ǫ)) , 2 ± − ± 15 where D is the single data version of the KL-divergence de- condition, we obtained upper bounds on the terms (i) and (ii) n n n fined by (2). It is easy to see that ǫ log(1+ǫ) ǫ log(1 ǫ) of (51) respectively, and shown that L1(θ v ) with v Aǫ is for any 0 <ǫ< 1. By Lemma 7,− we obtain ≤− − − not less than the sum of both upper bounds| if (22) is satisfied.∈ 2 ∗2 ∗2 A point is that the upper bound on the term (i) we derived is Pr( wj wj ǫwj ) n n | − |≤ a non-negative function of θ (see (46)). Hence, if v Aǫ 2 ∗2 ∗2 ∗2 2 ∗2 n ∈ = 1 Pr(wj wJ ǫwj or wJ wj ǫwj ) and (22) hold, L (θ v ) is an upper bound on the term (ii), − − ≥ − ≥ 1 2 ∗2 ∗2 ∗2 2 ∗2 which is not less than| = 1 Pr(wj wJ ǫwj ) Pr(wJ wj ǫwj ) − −n ≥ − − ≥ n n 1 exp (ǫ log(1 + ǫ)) pθ(y v ) ˜ ˜ ∗ min log n| n + L(θ q ) . ≥ − − 2 − ˜ e ∗ p˜(y v ) | n θ∈Θ(q ) θ | exp ( ǫ log(1 ǫ)) Pn n − − 2 − − − Now, assume (22) and let us take q∗ x given x , such n n 2 ∈ 1 2exp (ǫ log(1 + ǫ)) . that Σjj is equal to (1/n) i=1 xij for all j. Then we have ≥ − − 2 − n n x Aǫ , which implies n ∈ P Hence Pǫ can be bounded below as n n n pθ(y x ) ∗ p ˜ ˜ n n n x n L1(θ x ) min log n| n + L(θ q ) . Pǫ = Pr(x Aǫ ) = Πj=1(1 Pr( j / Aǫ (j))) | ≥ ˜ e ∗ p (y x ) | ∈ − ∈ θ∈Θ(q ) θ˜ n p | 1 2exp (ǫ log(1 + ǫ)) ∗ n ˜ ˜ ∗ ≥ − − 2 − Since q is determined by x and L(θ q ) satisfies Kraft’s n inequality, the codelength validity condition| holds for L . 1 2p exp (ǫ log(1 + ǫ)) . 1 ≥ − − 2 − The last inequality follows from (1 t)p 1 pt for any VI. NUMERICAL SIMULATIONS − ≥ − t [0, 1] and p 1. To simplify the bound, we can do more. We investigate behavior of the regret bound (26). In the The∈ maximum positive≥ real number such that, for any a ǫ regret bound, we take β =1 λ with which the regret bound [0, 1], aǫ2 (1/2)(ǫ log(1 + ǫ)) is (1 log 2)/2. Then,∈ − becomes tightest. Furthermore, µ1 and µ2 are taken as their the maximum≤ integer a− such that (1 log− 2)/2 1/a is 7, 1 − ≥ 1 smallest values in (22). As described before, we cannot obtain which gives the last inequality in the statement. the exact bound for KL divergence which gives the most famous loss function, the mean square error (MSE), in this I. Proof of Lemma 3 setting. This is because the regret bound diverges to the infinity We can prove this lemma by checking the proof of Lemma as λ 1 unless n is accordingly large enough. That is, we → 1. can obtain only the approximate evaluation of the MSE. The precision of that approximation depends on the sample size Proof. Let n n. We do not employ the MSE here but another famous loss L1(θ x ) := µ1 θ w,1 + µ2. 2 | k k function, squared Hellinger distance dH (for a single data). Similarly to the rewriting from (9) to (10), we can restate the The Hellinger distance was defined in (16) as n sample version n 2 2,1 2 codelength validity condition for L1(θ x ) as “there exist a (i.e., dH = dH ). We can obtain a regret bound for dH (p∗,pθˆ) n | n 2 quantize subset Θ(x ) and a model description length L˜(θ˜ x ) by (26) because two times the squared Hellinger distance 2dH | satisfying the usual Kraft’s inequality, such that is bounded by Bhattacharyya divergence (d0.5) in (4) through e the relationship (18). We set n = 200, p = 1000 and Σ= Ip xn X n, yn Y n, θ Θ, ∀ ∈ ∀ ∈ ∀ ∈ to mimic a typical situation of sparse learning. The lasso p (yn xn) θ ˜ ˜ n n ” (50) estimator is calculated by a proximal gradient method [28]. To min log n| n + L(θ x ) L1(θ x ). ˜ e n p˜(y x ) | ≤ | make the regret bound tight, we take τ = 0.03 that is close θ∈Θ(x ) θ | to zero compared to the main term (regret). For this τ, Fig. 2 Recall that (22) is a sufficient condition for the ǫ-risk validity shows the plot of (27) against ǫ. We should choose the smallest of L , in fact, it was derived as a sufficient condition for the 1 ǫ as long as the regret bound holds with large probability. Our proposition that L (θ xn) bounds from above 1 | choice is ǫ =0.5 at which the value of (27) is 0.81. We show ˜ n n n n Eθ˜[H(θ, θ, v ,y )] = Eθ˜ dλ(p∗,pθ) dλ(p∗,pθ˜) the results of two cases in Figs. 3-5. These plots express the − 2 value of d0.5, 2dH and the regret bound that were obtained in (i) a hundred of repetitions with different signal-to-noise ratios n n | pθ({zy v ) } T ∗ 2 2 2 ˜ ˜ (51) (SNR) Eq∗ [(x θ ) ]/σ (that is, different σ ). From these + Eθ˜ log n| n + L(θ q∗) p˜(y v ) | figures and other experiments, we observed that 2d2 almost θ | H always equaled (they were almost overlapped). As the (ii) d0.5 SN ratio got larger, then the regret bound became looser, for for any q Pn, vn A|n, yn Y n,{zθ Θ, where }θ˜ was example, about six times larger than 2d2 when SNR is 10. ∗ ∈ x ∈ ǫ ∈ ∈ H randomized on Θ(q∗) and (Θ(q∗), L˜(θ˜ q∗)) were defined by One of the reasons is that the ǫ-risk validity condition is too (42) and (47), in particular, L˜(θ˜ q ) satisfies| β-stronger Kraft’s strict to bound the loss function when SNR is high. Hence, | ∗ inequality. Recalle that H(θ, θ,˜e xn,yn) is the inside part of the a possible way to improve the risk bound is to restrict the minimum in (41). Here, we used vn instead of xn so as to parameter space Θ used in ǫ-risk validity to a range of θˆ, which discriminate from the above fixed xn. To derive the sufficient is expected to be considerably narrower than Θ due to high 16 ) τ ) − λ 1 ( n − ( d0.5

exp 2 2dH − ) n

( ε regret bound loss lower bound of P lower 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 ε trial

2 Fig. 2. Plot of (27) against ǫ ∈ (0, 1) when n = 200,p = 1000 and Fig. 4. Plot of d0.5 (Bhattacharyya div.), 2dH (Hellinger dist.) and the regret τ = 0.03. The dotted vertical line indicates ǫ = 0.5. bound with τ = 0.03 in case that SNR=10. loss loss

d0.5 d0.5 2 2 2dH 2dH regret bound regret bound 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0 20 40 60 80 100 0 20 40 60 80 100 trial trial

2 2 Fig. 3. Plot of d0.5 (Bhattacharyya div.), 2dH (Hellinger dist.) and the regret Fig. 5. Plot of d0.5 (Bhattacharyya div.), 2dH (Hellinger dist.) and the regret bound with τ = 0.03 in case that SNR=1.5. bound with τ = 0.03 in case that SNR=0.5.

SNR. In contrast, the regret bound is tight when SNR is 0.5 posal to non-normal cases for lasso and other machine learning in Fig. 5. Finally, we remark that the regret bound dominated methods. the Rényi divergence over all trials, though the regret bound is probabilistic. One of the reason is the looseness of the lower APPENDIX bound (27) of the probability for the regret bound to hold. SANOV-TYPE INEQUALITY This suggests that ǫ can be reduced more if we can derive its tighter bound. The following lemma is a special case of the result in [29]. Below, we give a simpler proof. In the lemma, we denote a random variable of one dimension by X and denote its VII. CONCLUSION corresponding one dimensional variable by x. We proposed a way to extend the original BC theory to supervised learning by using a typical set. Similarly to the Lemma 7. Let original BC theory, our extension also gives a mathematical x pθ(x) := exp(θx ψ(θ)), justification of the MDL principle for supervised learning. As ∼ − an application, we derived a new risk and regret bounds of where x and θ are of one dimension. Then, lasso. The derived bounds still retains various advantages of Pr (X η′) exp( D(θ′,θ)) if η′ η, the original BC theory. In particular, it requires considerably θ ≥ ≤ − ≥ few assumptions. Our next challenges are applying our pro- Pr (X η′) exp( D(θ′,θ)) if η′ η, θ ≤ ≤ − ≤ 17 where η is the expectation parameter corresponding to the [7] M. Bayati, J. Bento, and A. Montanari, “The LASSO risk: asymptotic natural parameter θ and similarly for η′. The symbol D results and real world examples,” Advances in Neural Information Processing Systems, pp. 145–153, 2010. denotes the single sample version of the KL-divergence defined [8] M. Bayati and A. Montanari, “The LASSO risk for Gaussian matrices,” by (2). IEEE Transactions on Information Theory, vol. 58, no. 4, pp. 1997– 2017, 2012. Proof. In this setting, the KL divergence is calculated as [9] M. Bayati, M. Erdogdu, and A. Montanari, “Estimating LASSO risk and noise level,” Advances in Neural Information Processing Systems, D ′ pθ(X) ′ ′ pp. 1–9, 2013. (θ,θ )= Epθ log = (θ θ )η ψ(θ)+ ψ(θ ). [10] J. Rissanen, “Modeling by shortest data description,” Automatica, p ′ (X) − − θ vol. 14, no. 5, pp. 465–471, 1978. Assume η′ η 0. Because of the monotonicity of natural [11] A. R. Barron, J. Rissanen, and B. Yu, “The minimum description length − ≥ principle in coding and modeling,” IEEE Transactions on Information parameter and expectation parameter of exponential family, Theory, vol. 44, no. 6, pp. 2743–2760, 1998. [12] P. D. Grünwald, The Minimum Description Length Principle. MIT X η′ (θ′ θ)X (θ′ θ)η′ ≥ ⇔ − ≥ − Press, 2007. exp ((θ′ θ)X) exp ((θ′ θ)η′) . [13] P. D. Grünwald, I. J. Myung, and M. A. Pitt, Advances in Minimum ⇔ − ≥ − Description Length: Theory and Applications. MIT Press, 2005. [14] J. Takeuchi, “An introduction to the minimum description length princi- By Markov’s inequality, we have ple,” in A Mathematical Approach to Research Problems of Science and ′ ′ ′ Technology, Springer (book chapter). Springer, 2014, pp. 279–296. Prθ (exp((θ θ)X) exp ((θ θ)η )) [15] A. R. Barron and T. M. Cover, “Minimum complexity density estima- − ′ ≥ − Epθ [exp ((θ θ)X)] tion,” IEEE Transactions on Information Theory, vol. 37, no. 4, pp. − 1034–1054, 1991. ≤ exp ((θ′ θ)η′) − [16] A. Rényi, “On measures of entropy and information,” In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and = exp(θx ψ(θ)) exp((θ′ θ)x)dx exp( (θ′ θ)η′) − − · − − Probability, vol. 1, pp. 547–561, 1961. Z [17] K. Yamanishi, “A learning criterion for stochastic rules,” Machine = exp(θ′x ψ(θ))dx exp( (θ′ θ)η′) Learning, vol. 9, no. 2-3, pp. 165–203, 1992. − · − − [18] A. R. Barron and X. Luo, “MDL procedures with ℓ1 penalty and their Z statistical risk,” in Proceedings of the First Workshop on Information = exp(ψ(θ′)) exp( ψ(θ)) exp( (θ′ θ)η′) − · − − Theoretic Methods in Science and Engineering, Tampere, Finland, = exp( ((θ′ θ)η′ ψ(θ′)+ ψ(θ))). August 18-20 2008. − − − [19] S. Chatterjee and A. R. Barron, “Information theory of penalized likeli- The other inequality can also be proved in the same way. hoods and its statistical implications,” arXiv’1401.6714v2 [math.ST] 27 Apr., 2014. [20] A. Bhattacharyya, “On a measure of divergence between two statistical INVERSE MATRIX FORMULA populations defined by their probability distributions,” Bulletin of the Calcutta Mathematical Society, vol. 35, pp. 99–109, 1943. Lemma 8. Let A be a non-singular m m matrix. If c and [21] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical d are both m 1 vectors and A + cd is× non-singular, then Learning: Data Mining, Inference and Prediction. Springer-Verlag, × 2001. A−1cdT A−1 [22] E. J. Candès and T. Tao, “The Dantzig selector: statistical estimation (A + cdT )−1 = A−1 . when p is much larger then n,” Annals of Statistics, vol. 35, no. 6, pp. − 1+ dT A−1c 2313–2351, 2007. See, for example, Corollary 1.7.2 in [30] for its proof. [23] A. R. Barron, C. Huang, J. Q. Li, and X. Luo, “MDL, penalized likelihood and statistical risk,” in Proceedings of IEEE Information Theory Workshop, Porto, Portugal, May 4-9 2008. ACKNOWLEDGMENT [24] S. Chatterjee and A. R. Barron, “Information theoretic validity of penalized likelihood,” 2014 IEEE International Symposium on Information We thank Professor Andrew Barron for fruitful discussion. Theory, pp. 3027–3031, 2014. The form of Rényi divergence (49) is the result of simplifica- [25] T. M. Cover and J. A. Thomas, Elements of Information Theory, ser. A Wiley-Interscience publication. Wiley-Interscience, 2006. tion suggested by him. Furthermore, we learned the simple [26] A. Cichocki and S. Amari, “Families of alpha- beta- and gamma- proof of Lemma 7 from him. We also thank Mr. Yushin divergences: flexible and robust measures of similarities,” Entropy, Toyokihara for his support. vol. 12, no. 6, pp. 1532–1568, 2010. [27] S. Amari and H. Nagaoka, Methods of information geometry. AMS& Oxford University Press, 2000. REFERENCES [28] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Sciences, vol. 2, [1] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Jour- no. 5, pp. 183–202, 2009. nal of the Royal Statistical Society Series B, vol. 58, no. 1, pp. 267–288, [29] I. Csiszár, “Sanov property, generalized I-projection and a conditional 1996. limit theorem,” The Annals of Probability, vol. 12, pp. 768–793, 1984. [2] F. Bunea, A. Tsybakov, and M. Wegkamp, “Sparsity oracle inequalities [30] J. R. Schott, Matrix Analysis for Statistics, 2nd edition. John Wiley & for the Lasso.” Electronic Journal of Statistics, vol. 1, pp. 169–194, Sons, 2005. 2007. [3] ——, “Aggregation for Gaussian regression.” Annals of Statistics, vol. 35, no. 4, pp. 1674–1697, 2007. [4] T. Zhang, “Some sharp performance bounds for least squares regression with l1 regularization,” Annals of Statistics, vol. 37, no. 5 A, pp. 2109– 2144, 2009. [5] P. J. Bickel, Y. Ritov, and A. B. Tsybakov, “Simultaneous analysis of Lasso and Dantzig selector,” Annals of Statistics, vol. 37, no. 4, pp. 1705–1732, 2009. [6] P. L. Bartlett, S. Mendelson, and J. Neeman, “ℓ1-regularized linear regression: persistence and oracle inequalities,” Probability Theory and Related Fields, vol. 154, no. 1-2, pp. 193–224, 2012.