How Svms Can Estimate Quantiles and the Median
Total Page:16
File Type:pdf, Size:1020Kb
How SVMs can estimate quantiles and the median Ingo Steinwart Andreas Christmann Information Sciences Group CCS-3 Department of Mathematics Los Alamos National Laboratory Vrije Universiteit Brussel Los Alamos, NM 87545, USA B-1050 Brussels, Belgium [email protected] [email protected] Abstract We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to 1. This result is then used to derive an oracle inequality for an SVM based kon · thek pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1 Introduction Let P be a distribution on X Y , where X is an arbitrary set and Y R is closed. The goal of quantile regression is to estimate× the conditional quantile, i.e., the set valued⊂ function F ∗ (x) := t R :P ( , t] x τ and P [t, ) x 1 τ , x X, τ,P ∈ −∞ | ≥ ∞ | ≥ − ∈ where τ (0, 1) is a fixed constant and P( x), x X, is the (regular) conditional probability. For conceptual∈ simplicity (though mathematically· | this is∈ not necessary) we assume throughout this paper that F ∗ (x) consists of singletons, i.e., there exists a function f ∗ : X R, called the conditional τ,P τ,P → τ-quantile function, such that F ∗ (x) = f ∗ (x) , x X. Let us now consider the so-called τ,P { τ,P } ∈ τ-pinball loss L : R R [0, ) defined by L (y, t) := ψ (y t), where ψ (r) = (τ 1)r, if τ × → ∞ τ τ − τ − r < 0, and ψτ (r) = τr, if r 0. Moreover, given a (measurable) function f : X R we define the L -risk of f by (f) :=≥ E L (y, f(x)). Now recall that f ∗ is up to→ zero sets the only τ RLτ ,P (x,y)∼P τ τ,P function that minimizes the -risk, i.e. ∗ ∗ , where the infi- Lτ Lτ ,P(fτ,P) = inf Lτ ,P(f) =: Lτ ,P mum is taken over all f : X R. Based onR this observation severalR estimatorsR minimizing a (mod- → ified) empirical Lτ -risk were proposed (see [5] for a survey on both parametric and non-parametric methods) for situations where P is unknown, but i.i.d. samples D := ((x1, y1),..., (xn, yn)) drawn from P are given. In particular, [6, 4, 10] proposed an SVM that finds a solution fD,λ H of n ∈ 2 1 arg min λ f H + Lτ (yi, f(xi)) , (1) f∈H k k n i=1 X where λ > 0 is a regularization parameter and H is a reproducing kernel Hilbert space (RKHS) over X. Note that this optimization problem can be solved by considering the dual problem [4, 10], but since this technique is nowadays standard in machine learning we omit the details. Moreover, [10] contains an exhaustive empirical study as well some theoretical considerations. Empirical methods estimating quantiles with the help of the pinball loss typically obtain functions for which is close to ∗ with high probability. However, in general this only fD Lτ ,P(fD) Lτ ,P R ∗ R implies that fD is close to fτ,P in a very weak sense (see [7, Remark 3.18]), and hence there is so far only little justification for using fD as an estimate of the quantile function. Our goal is to address this issue by showing that under certain realistic assumptions on P we have an inequality of the form ∗ ∗ f f cP L ,P(f) . (2) k − τ,PkL1(PX ) ≤ R τ − RLτ ,P q We then use this inequality to establish an oracle inequality for SVMs defined by (1). In addition, we illustrate how this oracle inequality can be used to obtain learning rates and to justify a data- dependent method for finding the hyper-parameter λ and H. Finally, we generalize the methods for establishing (2) to investigate the role of ǫ in the ǫ-insensitive loss used in standard SVM regression. 2 Main results In the following X is an arbitrary, non-empty set equipped with a σ-algebra, and Y R is a closed non-empty set. Given a distribution P on X Y we further assume throughout this⊂ paper that the × σ-algebra on X is complete with respect to the marginal distribution PX of P, i.e., every subset of a PX -zero set is contained in the σ-algebra. Since the latter can always be ensured by increasing the original σ-algebra in a suitable manner we note that this is not a restriction at all. Definition 2.1 A distribution Q on R is said to have a τ-quantile of type α > 0 if there exists a τ-quantile t∗ R and a constant c > 0 such that for all s [0, α] we have ∈ Q ∈ Q (t∗, t∗ + s) c s and Q (t∗ s, t∗) c s . (3) ≥ Q − ≥ Q It is not difficult to see that a distribution Q having a τ-quantile of some type α has a unique τ- ∗ quantile t . Moreover, if Q has a Lebesgue density hQ then Q has a τ-quantile of type α if hQ is ∗ ∗ ∗ ∗ bounded away from zero on [t α, t +α] since we can use cQ := inf hQ(t): t [t α, t +α] in (3). This assumption is general− enough to cover many distributions{ used in parametric∈ − statistics} such as Gaussian, Student’s t, and logistic distributions (with Y = R), Gamma and log-normal distributions (with Y = [0, )), and uniform and Beta distributions (with Y = [0, 1]). ∞ The following definition describes distributions on X Y whose conditional distributions P( x), x X, have the same τ-quantile type α. × · | ∈ Definition 2.2 Let p (0, ], τ (0, 1), and α > 0. A distribution P on X Y is said to have a ∈ ∞ ∈ × τ-quantile of p-average type α, if Qx := P( x) has PX -almost surely a τ-quantile type α and b : X (0, ) defined by b(x):=c , where· | c is the constant in (3), satisfies b−1 L (P ). → ∞ P( · |x) P( · |x) ∈ p X Let us now give some examples for distributions having τ-quantiles of p-average type α. Example 2.3 Let P be a distribution on X R with marginal distribution PX and regular condi- ×−z tional probability Qx ( , y] := 1/(1+e ), y R, where z := y m(x) /σ(x), m : X R describes a location shift,−∞ and σ : X [β, 1/β] describes∈ a scale modification− for some constant→ → β (0, 1]. Let us further assume that the functions m and σ are measurable. Thus Qx is a logistic ∈ −z −z 2 R distribution having the positive and bounded Lebesgue density hQx (y) = e /(1 + e ) , y . The -quantile function is ∗ ∗ τ , , and we can choose∈ τ t (x) := fτ,Qx = m(x) + σ(x) log( 1−τ ) x X ∗ ∗ ∈ b(x) = inf hQx (t): t [t (x) α, t (x) + α] . Note that hQx (m(x) + y) = hQx (m(x) y) for all y R,{ and h (y)∈is strictly− decreasing for}y [m(x), ). Some calculations show − ∈ Qx ∈ ∞ ∗ ∗ u1(x) u2(x) 1 b(x) = min hQx (t (x) α), hQx (t (x)+α) = min 2 , 2 cα,β , , − (1+) u1(x) (1+) u2(x) ∈ 4 1−τ −α/σ(x) 1−τ α/σ(x) n o where u1(x) := τ e , u2(x) := τ e and cα,β > 0 can be chosen independent of x, because σ(x) [β, 1/β]. Hence b−1 L (P ) and P has a τ-quantile of -average type α. ∈ ∈ ∞ X ∞ Example 2.4 Let P˜ be a distribution on X Y with marginal distribution P˜ and regular con- × X ditional probability Q˜ x := P(˜ x) on Y . Furthermore, assume that Q˜ x is P˜X -almost surely of τ-quantile type α. Let us now· | consider the family of distributions P with marginal distribu- tion P˜ and regular conditional distributions Q := P˜ ( m(x))/σ(x) x , x X, where X x · − ∈ m : X R and σ : X (β, 1/β) are as in the previous example. Then Qx has a τ-quantile ∗ → ∗ → fτ,Q = m(x) + σ(x)f ˜ of type αβ, because we obtain for s [0, αβ] the inequality x τ,Qx ∈ ∗ ∗ ˜ ∗ ∗ Qx (fτ,Q , fτ,Q + s) = Qx (f ˜ , f ˜ + s/σ(x)) b(x)s/σ(x) b(x)βs . x x τ,Qx τ,Qx ≥ ≥ Consequently, P has a τ-quantile of p-average type αβ if and only if P˜ does have a τ-quantile of p-average type α. The following theorem shows that for distributions having a quantile of p-average type the condi- tional quantile can be estimated by functions that approximately minimize the pinball risk. Theorem 2.5 Let p (0, ], τ (0, 1), α > 0 be real numbers, and q := p . Moreover, let ∈ ∞ ∈ p1+ P be a distribution on X Y that has a τ-quantile of p-average type α. Then for all f : X R × − p+2 2p → satisfying (f) ∗ 2 1p+ α 1p+ we have RLτ ,P − RLτ ,P ≤ ∗ √ −1 1/2 ∗ f f L (P ) 2 b Lτ ,P(f) . k − τ,Pk q X ≤ k kLp(PX ) R − RLτ ,P q Our next goal is to establish an oracle inequality for SVMs defined by (1). To this end let us assume Y = [ 1, 1]. Then we have Lτ (y, t¯) Lτ (y, t) for all y Y , t R, where t¯ denotes t clipped to the interval− [ 1, 1], i.e., t¯ := max ≤1, min 1, t .