How SVMs can estimate and the

Ingo Steinwart Andreas Christmann Information Sciences Group CCS-3 Department of Mathematics Los Alamos National Laboratory Vrije Universiteit Brussel Los Alamos, NM 87545, USA B-1050 Brussels, Belgium [email protected] [email protected]

Abstract

We investigate regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to 1. This result is then used to derive an oracle inequality for an SVM based kon · thek pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P .

1 Introduction

Let P be a distribution on X Y , where X is an arbitrary set and Y R is closed. The goal of quantile regression is to estimate× the conditional quantile, i.e., the set valued⊂ F ∗ (x) := t R :P ( , t] x τ and P [t, ) x 1 τ , x X, τ,P ∈ −∞ | ≥ ∞ | ≥ − ∈ where τ (0, 1) is a fixed constant and P( x), x X, is the (regular) conditional . For conceptual∈ simplicity (though mathematically· | this is∈ not necessary) we assume throughout this paper that F ∗ (x) consists of singletons, i.e., there exists a function f ∗ : X R, called the conditional τ,P τ,P → τ-quantile function, such that F ∗ (x) = f ∗ (x) , x X. Let us now consider the so-called τ,P { τ,P } ∈ τ-pinball loss L : R R [0, ) defined by L (y, t) := ψ (y t), where ψ (r) = (τ 1)r, if τ × → ∞ τ τ − τ − r < 0, and ψτ (r) = τr, if r 0. Moreover, given a (measurable) function f : X R we define the L -risk of f by (f) :=≥ E L (y, f(x)). Now recall that f ∗ is up to→ zero sets the only τ RLτ ,P (x,y)∼P τ τ,P function that minimizes the -risk, i.e. ∗ ∗ , where the infi- Lτ Lτ ,P(fτ,P) = inf Lτ ,P(f) =: Lτ ,P mum is taken over all f : X R. Based onR this observation severalR estimatorsR minimizing a (mod- → ified) empirical Lτ -risk were proposed (see [5] for a survey on both parametric and non-parametric methods) for situations where P is unknown, but i.i.d. samples D := ((x1, y1),..., (xn, yn)) drawn from P are given. In particular, [6, 4, 10] proposed an SVM that finds a solution fD,λ H of n ∈ 2 1 arg min λ f H + Lτ (yi, f(xi)) , (1) f∈H k k n i=1 X where λ > 0 is a regularization parameter and H is a reproducing kernel Hilbert space (RKHS) over X. Note that this optimization problem can be solved by considering the dual problem [4, 10], but since this technique is nowadays standard in machine learning we omit the details. Moreover, [10] contains an exhaustive empirical study as well some theoretical considerations. Empirical methods estimating quantiles with the help of the pinball loss typically obtain functions for which is close to ∗ with high probability. However, in general this only fD Lτ ,P(fD) Lτ ,P R ∗ R implies that fD is close to fτ,P in a very weak sense (see [7, Remark 3.18]), and hence there is so far only little justification for using fD as an estimate of the quantile function. Our goal is to address this issue by showing that under certain realistic assumptions on P we have an inequality of the form

∗ ∗ f f cP L ,P(f) . (2) k − τ,PkL1(PX ) ≤ R τ − RLτ ,P q We then use this inequality to establish an oracle inequality for SVMs defined by (1). In addition, we illustrate how this oracle inequality can be used to obtain learning rates and to justify a data- dependent method for finding the hyper-parameter λ and H. Finally, we generalize the methods for establishing (2) to investigate the role of ǫ in the ǫ-insensitive loss used in standard SVM regression.

2 Main results

In the following X is an arbitrary, non-empty set equipped with a σ-algebra, and Y R is a closed non-empty set. Given a distribution P on X Y we further assume throughout this⊂ paper that the × σ-algebra on X is complete with respect to the marginal distribution PX of P, i.e., every subset of a PX -zero set is contained in the σ-algebra. Since the latter can always be ensured by increasing the original σ-algebra in a suitable manner we note that this is not a restriction at all.

Definition 2.1 A distribution Q on R is said to have a τ-quantile of type α > 0 if there exists a τ-quantile t∗ R and a constant c > 0 such that for all s [0, α] we have ∈ Q ∈ Q (t∗, t∗ + s) c s and Q (t∗ s, t∗) c s . (3) ≥ Q − ≥ Q   It is not difficult to see that a distribution Q having a τ-quantile of some type α has a unique τ- ∗ quantile t . Moreover, if Q has a Lebesgue density hQ then Q has a τ-quantile of type α if hQ is ∗ ∗ ∗ ∗ bounded away from zero on [t α, t +α] since we can use cQ := inf hQ(t): t [t α, t +α] in (3). This assumption is general− enough to cover many distributions{ used in parametric∈ − } such as Gaussian, Student’s t, and logistic distributions (with Y = R), Gamma and log-normal distributions (with Y = [0, )), and uniform and Beta distributions (with Y = [0, 1]). ∞ The following definition describes distributions on X Y whose conditional distributions P( x), x X, have the same τ-quantile type α. × · | ∈ Definition 2.2 Let p (0, ], τ (0, 1), and α > 0. A distribution P on X Y is said to have a ∈ ∞ ∈ × τ-quantile of p-average type α, if Qx := P( x) has PX -almost surely a τ-quantile type α and b : X (0, ) defined by b(x):=c , where· | c is the constant in (3), satisfies b−1 L (P ). → ∞ P( · |x) P( · |x) ∈ p X Let us now give some examples for distributions having τ-quantiles of p-average type α.

Example 2.3 Let P be a distribution on X R with marginal distribution PX and regular condi- ×−z tional probability Qx ( , y] := 1/(1+e ), y R, where z := y m(x) /σ(x), m : X R describes a location shift,−∞ and σ : X [β, 1/β] describes∈ a scale modification− for some constant→  →  β (0, 1]. Let us further assume that the functions m and σ are measurable. Thus Qx is a logistic ∈ −z −z 2 R distribution having the positive and bounded Lebesgue density hQx (y) = e /(1 + e ) , y . The -quantile function is ∗ ∗ τ , , and we can choose∈ τ t (x) := fτ,Qx = m(x) + σ(x) log( 1−τ ) x X ∗ ∗ ∈ b(x) = inf hQx (t): t [t (x) α, t (x) + α] . Note that hQx (m(x) + y) = hQx (m(x) y) for all y R,{ and h (y)∈is strictly− decreasing for}y [m(x), ). Some calculations show − ∈ Qx ∈ ∞ ∗ ∗ u1(x) u2(x) 1 b(x) = min hQx (t (x) α), hQx (t (x)+α) = min 2 , 2 cα,β , , − (1+) u1(x) (1+) u2(x) ∈ 4  1−τ −α/σ(x) 1−τ α/σ(x) n o   where u1(x) := τ e , u2(x) := τ e and cα,β > 0 can be chosen independent of x, because σ(x) [β, 1/β]. Hence b−1 L (P ) and P has a τ-quantile of -average type α. ∈ ∈ ∞ X ∞ Example 2.4 Let P˜ be a distribution on X Y with marginal distribution P˜ and regular con- × X ditional probability Q˜ x := P(˜ x) on Y . Furthermore, assume that Q˜ x is P˜X -almost surely of τ-quantile type α. Let us now· | consider the family of distributions P with marginal distribu- tion P˜ and regular conditional distributions Q := P˜ ( m(x))/σ(x) x , x X, where X x · − ∈ m : X R and σ : X (β, 1/β) are as in the previous example. Then Qx has a τ-quantile ∗ → ∗ →  fτ,Q = m(x) + σ(x)f ˜ of type αβ, because we obtain for s [0, αβ] the inequality x τ,Qx ∈ ∗ ∗ ˜ ∗ ∗ Qx (fτ,Q , fτ,Q + s) = Qx (f ˜ , f ˜ + s/σ(x)) b(x)s/σ(x) b(x)βs . x x τ,Qx τ,Qx ≥ ≥ Consequently, P has a τ-quantile of p-average type αβ if and only if P˜ does have a τ-quantile of p-average type α. The following theorem shows that for distributions having a quantile of p-average type the condi- tional quantile can be estimated by functions that approximately minimize the pinball risk.

Theorem 2.5 Let p (0, ], τ (0, 1), α > 0 be real numbers, and q := p . Moreover, let ∈ ∞ ∈ p1+ P be a distribution on X Y that has a τ-quantile of p-average type α. Then for all f : X R × − p+2 2p → satisfying (f) ∗ 2 1p+ α 1p+ we have RLτ ,P − RLτ ,P ≤ ∗ √ −1 1/2 ∗ f f L (P ) 2 b Lτ ,P(f) . k − τ,Pk q X ≤ k kLp(PX ) R − RLτ ,P q Our next goal is to establish an oracle inequality for SVMs defined by (1). To this end let us assume Y = [ 1, 1]. Then we have Lτ (y, t¯) Lτ (y, t) for all y Y , t R, where t¯ denotes t clipped to the interval− [ 1, 1], i.e., t¯ := max ≤1, min 1, t . Since∈ this yields∈ (f¯) (f) for − {− { }} RLτ ,P ≤ RLτ ,P all functions f : X R we will focus on clipped functions f¯ in the following. To describe the → 2 approximation error of SVMs we need the approximation A(λ) := inff∈H λ f H + (f) ∗ , λ > 0. Recall that [8] showed lim A(λ) = 0 if the RKHS H is densek k in RLτ ,P − RLτ ,P λ→0 L1(PX ). We also need the covering numbers which for ε > 0 are defined by B , ε, L (µ) := min n 1 : x , . . . , x L (µ) with B n (x + εB ) , (4) N H 2 ≥ ∃ 1 n ∈ 2 H ⊂ ∪i=1 i L2(µ) where µ is a distribution on X, and BH and BL2(µ) denote the closed unit balls ofH and the Hilbert n space L2(µ), respectively. Given a finite sequence D = ((x1, y1),..., (xn, yn)) (X Y ) we write D := (x , . . . , x ), and (B , ε, L (D )) := (B , ε, L (µ)) if µ is∈ the empirical× X 1 n N H 2 X N H 2 measure defined by DX . Finally, we write Lτ f for the function (x, y) Lτ (y, f(x)). With these preparations we can now recall the following oracle◦ inequality shown in7→ more generality in [9].

Theorem 2.6 Let P be a distribution on X [ 1, 1] for which there exist constants v 1, ϑ [0, 1] with × − ≥ ∈ 2 ϑ E L f¯ L f ∗ v E (L f¯ L f ∗ ) (5) P τ ◦ − τ ◦ τ,P ≤ P τ ◦ − τ ◦ τ,P R for all f : X . Moreover, let H be a RKHS over X for which there exist̺ (0, 1) and a 1 with → ∈ ≥ −2̺ sup log BH , ε, L2(DX ) aε , ε > 0 . (6) D∈(X×Y )n N ≤  Then there exists a constant K̺,v depending only on ̺ and v such that for all ς 1, n 1, and λ > 0 we have with probability not less than 1 3e−ς that ≥ ≥ − 1 2−ϑ+̺(ϑ−1) 1 A(λ) ς K a K a 32vς 2−ϑ (f¯ ) ∗ 8A(λ) + 30 + ̺,v + ̺,v + 5 . Lτ ,P D,λ Lτ ,P ̺ ̺ R − R ≤ r λ n λ n λ n n     Moreover, [9] showed that oracle inequalities of the above type can be used to establish learning rates and to investigate data-dependent parameter selection strategies. For example if we assume that there exist constants c > 0 and β (0, 1] such that A(λ) cλβ for all λ > 0 then (f¯ ) ∈ ≤ RLτ ,P T,λn converges to ∗ with rate n−γ where γ := min β , 2β and λ = n−γ/β. RLτ ,P { β2 ( −ϑ+̺(ϑ−1))+̺ β1 + } n Moreover, [9] shows that this rate can also be achieved by selecting λ in a data-dependent way with the help of a validation set. Let us now consider how these learning rates in terms of risks translate into rates for f¯ f ∗ . To this end we assume that P has a τ-quantile of p-average type k T,λ − τ,PkLq (PX ) α for τ (0, 1). Using the Lipschitz continuity of L and Theorem 2.5 we then obtain ∈ τ 2 q/2 E L f¯ L f ∗ E f¯ f ∗ 2 f¯ f ∗ 2−qE f¯ f ∗ q c (f¯) ∗ P τ ◦ − τ ◦ τ,P ≤ P| − τ,P| ≤ k − τ,Pk∞ P| − τ,P| ≤ RLτ ,P −RLτ ,P − p+2 2p for all f satisfying  (f¯) ∗ 2 1p+ α 1p+ , i.e. we have a bound (5) for ϑ :=q/2 RLτ ,P −RLτ ,P ≤ and clipped functions with small excess risk. Arguing carefully to handle the restriction on f¯ we then see that f¯ f ∗ can converge as fast as n−γ , where k T,λ − τ,PkLq (PX ) β β γ := min β4 ( −q+̺(q−2))+2̺, β1 + . n o To illustrate the latter let us assume that H is a W m(X) of order m N over X, where X is the unit ball in Rd. Recall from [3] that H satisfies (6) for ̺ := d/(2m) if ∈m > d/2 and in this case H also consists of continuous functions. Furthermore, assume that we are in the ideal situation f ∗ W m(X) which implies β = 1. Then the learning rate for f¯ f ∗ be- τ,P ∈ k T,λ − τ,PkLq (PX ) comes n−1/(4−q(1−̺)), which for -average type distributions reduces to n−2m/(6m+d) n−1/3. ∞ ≈ Let us finally investigate whether the ǫ-insensitive loss defined by L(y, t) := max 0, y t ǫ for y, t R and fixed ǫ > 0, can be used to estimate the median, i.e. the (1/2)-quantile.{ | − | − } ∈ Theorem 2.7 Let L be the ǫ-insensitive loss for some ǫ > 0 and P be a distribution on X R which has a unique median f ∗ . Furthermore, assume that all conditional distributions P( x×), x X, 1/2,P ·| ∈ are atom-free, i.e. P( y x) = 0 for all y R, and symmetric, i.e. P(h(x)+A x) = P(h(x) A x) for all measurable A{ }|R and a suitable∈ function h : X R. If for the conditional| distributions− | have a positive mass concentrated⊂ around f ∗ ǫ then→f ∗ is the only minimizer of . 1/2,P ± 1/2,P RL,P Note that using [7] one can show that for distributions specified in the above theorem the ∗ SVM using the ǫ-insensitive loss approximates f1/2,P whenever the SVM is L,P-consistent, ∗ R i.e. L,P(fT,λ) L,P in probability, see [2]. More advanced results in the sense of Theorem 2.5 seemR also possible,→ R but are out of the scope of this paper.

3 Proofs

Let us first recall some notions from [7] who investigated surrogate losses in general and the question how approximate risk minimizers approximate exact risk minimizers in particular. To this end let L : X Y R [0, ) be a measurable function which we call a loss in the following. For a × × → ∞ E distribution P and an f : X R the L-risk is then defined by L,P(f) := (x,y)∼PL(x, y, f(x)), → ∗ R and, as usual, the Bayes L-risk, is denoted by L,P := inf L,P(f), where the infimum is taken over all (measurable) f : X R. In addition, givenR a distributionR Q on Y the inner L-risks were defined → R by L,Q,x(t) := Y L(x, y, t) dQ(y), x X, t , and the minimal inner L-risks were denoted by ∗ C ∈ ∈ R L,Q,x := inf L,Q,x(t), x X, where the infimum is taken over all t . Moreover, following [7]C we usuallyC omitR the indexes∈ x or Q if L is independent of x or y, respectively.∈ Obviously, we have (f) = f(x) dP (x) , (7) RL,P CL,P( · |x),x X ZX and [7, Theorem 3.2] further shows that x ∗ is measurable if the σ-algebra on X is 7→ CL,P( · |x),x complete. In this case it was also shown that the intuitive formula ∗ = ∗ dP (x) RL,P X CL,P( · |x),x X holds, i.e. the Bayes L-risk is obtained by minimizing the inner risks and subsequently integrating R with respect to the marginal distribution PX . Based on this observation the basic idea in [7] is to consider both steps separately. In particular, it turned out that the sets of ε-approximate minimizers (ε) := t R : (t) < ∗ + ε , ε [0, ], and the set of exact minimizers ML,Q,x ∈ CL,Q,x CL,Q,x ∈ ∞ (0+) := (ε) play a crucial role. As in [7] we again omit the subscripts x and L,Q,x  ε>0 L,Q,x MQ in these definitions ifML happens to be independent of x or y, respectively. T Now assume we have two losses L : X Y R [0, ] and L : X Y R [0, ], and tar × × → ∞ sur × × → ∞ that our goal is to estimate the excess Ltar-risk by the excess Lsur-risk. This issue was investigated in [7], where the main device was the so-called calibration function δ ( , Q, x) defined by max · ∗ ∗ inft∈R\M (ε) L ,Q,x(t) if < , δ (ε, Q, x) := Ltar,Q,x C sur − CLsur,Q,x CLsur,Q,x ∞ max if ∗ = , (∞ CLsur,Q,x ∞ for all ε [0, ]. In the following we sometimes write δmax,Ltar,Lsur (ε, Q, x) := δmax (ε, Q, x) whenever∈ we need∞ to explicitly mention the target and surrogate losses. In addition, we follow our convention which omits x or Q whenever this is possible. Now recall that [7, Lemma 2.9] showed δ (t) ∗ , Q, x (t) ∗ , t R (8) max CLtar,Q,x − CLtar,Q,x ≤ CLsur,Q,x − CLsur,Q,x ∈ if both ∗ < and ∗ < . Before we use (8) to establish an inequality between Ltar,Q,x Lsur,Q,x  C ∞ C ∞ ∗∗ the excess risks of Ltar and Lsur, we finally recall that the Fenchel-Legendre bi-conjugate g : I [0, ] of a function g : I [0, ] defined on an interval I is the largest convex function → ∞ → ∞ ∗∗ ∗∗ h : I [0, ] satisfying h g. In addition, we write g ( ) := limt→∞ g (t) if I = [0, ). With these→ preparations∞ we can≤ now establish the following generalization∞ of [7, Theorem 2.18].∞ Theorem 3.1 Let P be a distribution on X Y with ∗ < and ∗ < and assume × RLtar,P ∞ RLsur,P ∞ that there exist p (0, ] and functions b : X [0, ] and δ : [0, ) [0, ) such that ∈ ∞ → ∞ ∞ → ∞ δ (ε, P( x), x) b(x) δ(ε) , ε 0, x X, (9) max · | ≥ ≥ ∈ and b−1 L (P ). Then for q := p , δ¯ := δq : [0, ) [0, ), and all f : X R we have ∈ p X p1+ ∞ → ∞ → ∗∗ ∗ −1 q ∗ q δ¯ L ,P(f) b L ,P(f) . R tar − RLtar,P ≤ k kLp(PX ) R sur − RLsur,P   Proof: Let us first consider the case (f) < . Since δ¯∗∗ is convex and satisfies δ¯∗∗(ε) RLtar,P ∞ ≤ δ¯(ε) for all ε [0, ) we see by Jensen’s inequality that ∈ ∞ ∗∗ ∗ ∗ δ¯ L ,P(f) δ¯ (t) dPX (x) (10) R tar − RLtar,P ≤ CLtar,P( · |x),x − CLtar,P( · |x),x ZX Moreover, using (8) and (9) we obtain  b(x) δ (t) ∗ (t) ∗ CLtar,P( · |x),x − CLtar,P( · |x),x ≤ CLsur,P( · |x),x − CLsur,P( · |x),x for PX -almost all x X and all t R. By (10), the definition of δ¯, and Holder’s¨ inequality in the ∈ ∈ ∗∗ ∗ form of q p 1, we thus find that δ¯ L ,P(f) is less than or equal to k · k ≤ k · k · k · k R tar − RLtar,P q/q −q ∗ q  b(x) f(x) dPX (x) CLsur,P( · |x),x − CLsur,P( · |x),x ZX   q/p    q −p ∗ b dPX f(x) dPX (x) ≤ CLsur,P( · |x),x − CLsur,P( · |x),x ZX ZX  −1 q   ∗ q    b L ,P(f) . ≤ k kLp(PX ) R sur − RLtar,P Let us finally deal with the case (f) = . If δ¯∗∗( ) = 0 there is nothing to prove and RLtar,P ∞ ∞ hence we assume δ¯∗∗( ) > 0. Following the proof of [7, Theorem 2.13] we then see that there exist constants c , c ∞(0, ) satisfying t c δ∗∗(t) + c for all t [0, ]. From this we obtain 1 2 ∈ ∞ ≤ 1 2 ∈ ∞ ∗ ¯∗∗ ∗ = Ltar,P(f) Ltar,P c1 δ Ltar,P( · |x),x(t) Ltar,P( · |x),x dPX (x) + c2 ∞ R − R ≤ X C − C Z q −q ∗  c1 b(x) f(x) dPX (x) + c2 , ≤ CLsur,P( · |x),x − CLsur,P( · |x),x ZX     −1 where the last step is analogous to our considerations for Ltar,P(f) < . By b Lp(PX ) and ∗ R ∞ ∈ Holder’s¨ inequality we then conclude L ,P(f) = . R sur − RLsur,P ∞ Our next goal is to determine the inner risks and their minimizers for the pinball loss. To this end recall (see, e.g., [1, Theorem 23.8]) that given a distribution Q on R and a non-negative function g : X [0, ) we have → ∞ ∞ g dQ = Q(g s) ds . (11) R ≥ Z Z0 Proposition 3.2 Let τ (0, 1) and Q be a distribution on R with ∗ < and t∗ be a τ-quantile ∈ CLτ ,Q ∞ of Q. Then there exist q , q [0, ) with q + q = Q( t∗ ), and for all t 0 we have + − ∈ ∞ + − { } ≥ t (t∗ + t) (t∗) = tq + Q (t∗, t∗ + s) ds , and (12) CLτ ,Q − CLτ ,Q + Z0 t  (t∗ t) (t∗) = tq + Q (t∗ s, t∗) ds . (13) CLτ ,Q − − CLτ ,Q − − Z0 ∗ ∗  Proof: Let us consider the distribution Q(t ) defined by Q(t )(A) := Q(t∗ + A) for all measurable ∗ A R. Then it is not hard to see that 0 is a τ-quantile of Q(t ). Moreover, we obviously have ⊂ ∗ ∗ ∗ (t + t) = (t ) (t) and hence we may assume without loss of generality that t = 0. Then CLτ ,Q CLτ ,Q our assumptions together with Q(( , 0]) + Q([0, )) = 1 + Q( 0 ) yield τ Q(( , 0]) τ + Q( 0 ), i.e., there exists a q satisfying−∞ 0 q ∞ Q( 0 ) and { } ≤ −∞ ≤ { } + ≤ + ≤ { } Q(( , 0]) = τ + q . (14) −∞ + Let us now compute the inner risks of L . To this end we first assume t 0. Then we have τ ≥ (y t) dQ(y) = y dQ(y) tQ(( , t)) + y dQ(y) − − −∞ Zy

δ ˘ (ε, P( x), x) = δ ˘ (ε, P( x)) , ε [0, ], x X. (17) max,LP,L · | max,L,L · | ∈ ∞ ∈ In other words, the self-calibration function can be utilized in Theorem 3.1.

R ∗ ∗ Proof of Theorem 2.5: Let Q be a distribution on with L,Q < and t be the only τ-quantile of Q. Then the formulas of Proposition 3.2 show C ∞ ε ε ∗ ∗ ∗ ∗ δmax,L,L˘ (ε, Q) = min εq+ + Q (t , t + s) ds, εq− + Q (t s, t ) ds , ε 0, 0 0 − ≥ n Z Z o where q+ and q− are the real numbers defined in Proposition 3.2. Let us additionally assume that the τ-quantile t∗ is of type α. For the Huber type function δ(ε) := ε2/2 if ε [0, α], and δ(ε) := αε α2/2 if ε > α, a simple calculation then yields δ (ε, Q) c δ∈(ε), where c is the − max,L,L˘ ≥ Q Q constant satisfying (3). Let us further define δ¯ : [0, ) [0, ) by δ¯(ε) := δq(ε1/q), ε 0. In ∞ → ∞ ≥ view of Theorem 3.1 we then need to find a convex function δˆ : [0, ) [0, ) such that δˆ δ¯. ∞ → ∞ ≤ To this end we define δˆ(ε) := spε2 if ε 0, s a and δˆ(ε) := a ε sp+2a if ε > s a , p ∈ p p p − p p p p where a := αq and s := 2−q/p. Then δˆ : [0, ) [0, ) is continuously differentiable and its p p  ∞ → ∞  is increasing, and thus δˆ is convex. Moreover, we have δˆ′ δ¯′ and hence δˆ δ¯ which in ≤ ≤ turn implies δˆ δ¯∗∗. Now we find the assertion by (16), (17), and Theorem 3.1. ≤ The proof of Theorem 2.7 follows immediately from the following lemma.

Lemma 3.3 Let Q be a symmetric, atom-free distribution on R with median t∗ = 0. Then for ǫ > 0 ∗ ∞ and L being the ǫ-insensitive loss we have L,Q(0) = L,Q = 2 ǫ Q[s, )ds and if L,Q(0) < we further have C C ∞ C ∞ R ǫ ǫ+t (t) (0) = Q[s, ǫ] ds + Q[ǫ, s] ds, if t [0, ǫ], CL,Q − CL,Q ∈ Zǫ−t Zǫ t−ǫ ǫ+t t−ǫ (t) (ǫ) = Q[s, ) ds Q[s, ) ds + 2 Q[0, s] ds 0, if t > ǫ. CL,Q − CL,Q ∞ − ∞ ≥ Z0 2Zǫ Z0 In particular, if Q[ǫ δ, ǫ + δ] = 0 for some δ > 0 then (δ) = ∗ . − CL,Q CL,Q Proof: Because L(y, t) = L( y, t) for all y, t R we only have to consider t 0. For later use we note that for 0 a b − Equation− (11) yields∈ ≥ ≤ ≤ ≤ ∞ b b y dQ(y) = aQ([a, b]) + Q([s, b])ds . (18) Za Za Moreover, the definition of L implies t−ǫ ∞ (t) = t y ǫ dQ(y) + y ǫ t dQ(y) . CL,Q − − − − Z−∞ Zt+ǫ Using the symmetry of Q yields t−ǫ y dQ(y) = ∞ y dQ(y) and hence we obtain − −∞ ǫ−t t−ǫ R t+ǫ R t+ǫ ∞ (t) = Q( , t ǫ]ds Q[t + ǫ, )ds + y dQ(y) + 2 y dQ(y) . (19) CL,Q −∞ − − ∞ Z0 Z0 Zǫ−t Zt+ǫ t+ǫ t+ǫ Let us first consider the case t ǫ. Then the symmetry of Q yields ǫ−t y dQ(y) = t−ǫ y dQ(y), and hence (18) implies ≥ R R t−ǫ t−ǫ t+ǫ (t) = Q[ǫ t, )ds + Q[t ǫ, t+ǫ] ds + Q[s, t+ǫ] ds CL,Q − ∞ − Z0 Z0 Zt−ǫ ∞ t+ǫ +2 Q[s, ) ds + Q[t+ǫ, ) ds. ∞ ∞ Zt+ǫ Z0 Using t+ǫ t+ǫ t−ǫ Q[s, t + ǫ) ds = Q[s, t + ǫ) ds Q[s, t + ǫ) ds − Zt−ǫ Z0 Z0 we further obtain t+ǫ t+ǫ ∞ ∞ t−ǫ Q[s, t + ǫ) ds + Q[t + ǫ, ) ds + Q[s, ) ds = Q[s, ) ds Q[s, t + ǫ) ds . ∞ ∞ ∞ − t−Zǫ Z0 t+Zǫ Z0 Z0 From this and t−ǫ Q[t ǫ, t + ǫ] ds t−ǫ Q[s, t + ǫ] ds = t−ǫ Q[s, t ǫ] ds follows 0 − − 0 − 0 − R t−ǫ tR−ǫ ∞R ∞ (t)= Q[s, t ǫ] ds+ Q[ǫ t, ) ds+ Q[s, ) ds+ Q[s, ) ds . CL,Q − − − ∞ ∞ ∞ Z0 Z0 Zt+ǫ Z0 The symmetry of Q implies t−ǫ Q[ǫ t, t ǫ] ds = 2 t−ǫ Q[0, t ǫ] ds, and we get 0 − − 0 − t−ǫ R t−ǫ R t−ǫ t−ǫ Q[s, t ǫ] ds + Q[ǫ t, ) ds = 2 Q[0, s) ds + Q[s, ) ds . − − − ∞ ∞ Z0 Z0 Z0 Z0 This and ∞ ∞ ∞ t+ǫ Q[s, ) ds + Q[s, ) ds = 2 Q[s, ) ds + Q[s, ) ds ∞ ∞ ∞ ∞ Zt+ǫ Z0 Zt+ǫ Z0 yields t−ǫ t−ǫ ∞ t+ǫ (t) = 2 Q[0, s) ds + Q[s, ) ds + 2 Q[s, ) ds + Q[s, ) ds . CL,Q ∞ ∞ ∞ Z0 Z0 Zt+ǫ Z0 By t−ǫ t+ǫ t−ǫ t+ǫ Q[s, ) ds + Q[s, ) ds = 2 Q[s, ) ds + Q[s, ) ds ∞ ∞ ∞ ∞ Z0 Z0 Z0 Zt−ǫ we obtain t−ǫ ∞ t+ǫ (t) = 2 Q[0, ) ds + 2 Q[s, ) ds + Q[s, ) ds CL,Q ∞ ∞ ∞ Z0 Zt+ǫ Zt−ǫ if t ǫ. Let us now consider the case t [0, ǫ]. Analogously we obtain from (19) that ≥ ∈ ǫ−t ǫ+t ∞ (t) = Q[ǫ t, t + ǫ] ds + Q[s, t + ǫ] ds + 2 Q[s, ) ds CL,Q − ∞ Z0 Zǫ−t Zǫ+t ǫ+t ǫ−t ǫ+t +2 Q[ǫ + t, ) ds Q[ǫ t, ) ds Q[ǫ + t, ) ds . ∞ − − ∞ − ∞ Z0 Z0 Z0 Combining this with ǫ−t ǫ−t ǫ−t Q[ǫ t, t + ǫ] ds Q[ǫ t, ) ds = Q[ǫ + t, ) ds − − − ∞ − ∞ Z0 Z0 Z0 and ǫ+t Q[ǫ + t, ) ds ǫ−t Q[ǫ + t, ) ds = ǫ+t Q[ǫ + t, ) ds we get 0 ∞ − 0 ∞ ǫ−t ∞ R ǫ+t R ǫ+t R ∞ (t) = Q[ǫ + t, ) ds + Q[s, t + ǫ] ds + 2 Q[s, ) ds CL,Q ∞ ∞ Zǫ−t Zǫ−t Zǫ+t ǫ+t ∞ ∞ ∞ = Q[s, ) ds + 2 Q[s, ) ds = Q[s, ) ds + Q[s, ) ds. ∞ ∞ ∞ ∞ Zǫ−t Zǫ+t Zǫ−t Zǫ+t Hence (0) = 2 ∞ Q[s, ) ds. The expressions for (t) (0), t (0, ǫ], and (t) CL,Q ǫ ∞ CL,Q −CL,Q ∈ CL,Q − L,Q(ǫ), t > ǫ, given in Lemma 3.3 follow by using the same arguments. Hence one exact minimizer C R ∗ of L,Q( ) is the median t = 0. The last assertion is a direct consequence of the formula for C (t) · (0) in the case t (0, ǫ]. CL,Q − CL,Q ∈ References [1] H. Bauer. Measure and Integration Theory. De Gruyter, Berlin, 2001. [2] A. Christmann and I. Steinwart. Consistency and robustness of kernel based regression. Bernoulli, 15:799–819, 2007. [3] D.E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential Operators. Cambridge University Press, 1996. [4] C. Hwang and J. Shim. A simple quantile regression via support vector machine. In Advances in Natural Computation: First International Conference (ICNC), pages 512 –520. Springer, 2005. [5] R. Koenker. Quantile Regression. Cambridge University Press, 2005. [6] B. Scholkopf,¨ A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12:1207–1245, 2000. [7] I. Steinwart. How to compare different loss functions. Constr. Approx., 26:225–287, 2007. [8] I. Steinwart, D. Hush, and C. Scovel. Function classes that approximate the Bayes risk. In Proceedings of the 19th Annual Conference on Learning Theory, COLT 2006, pages 79–93. Springer, 2006. [9] I. Steinwart, D. Hush, and C. Scovel. An oracle inequality for clipped regularized risk mini- mizers. In Advances in Neural Information Processing Systems 19, pages 1321–1328, 2007. [10] I. Takeuchi, Q.V. Le, T.D. Sears, and A.J. Smola. Nonparametric quantile estimation. J. Mach. Learn. Res., 7:1231–1264, 2006.