PRIL: Ranking Using Interval Labeled Data

Naresh Manwani [email protected] IIIT Hyderabad, India

Abstract proaches based on risk minimization principal for learn- ing ordinal classifier has been proposed. Variants of large In this paper, we propose an online learning al- margin frameworks for learning ordinal classifiers are pro- gorithm PRIL for learning ranking classifiers us- posed in Shashua & Levin(2002); Chu & Keerthi(2005). ing interval labeled data and show its correct- One can maintain the order of thresholds implicitly or ex- ness. We show its convergence in finite number plicitly. In the explicit way the ordering is posed as a con- of steps if there exists an ideal classifier such that straint in the optimization problems itself. While the im- the rank given by it for an example always lies plicit method captures the ordering by posing separability in its label interval. We then generalize this mis- conditions between every pair of classes. Li & Lin(2006) take bound result for the general case. We also propose a generic method which converts learning an ordi- provide regret bound for the proposed algorithm. nal classifier into learning a binary classifier with weighted We propose a multiplicative update algorithm for examples. A classical online algorithm for learning linear PRIL called M-PRIL. We provide its correctness classifiers is proposed in Rosenblatt(1958). Crammer & and convergence results. We show the effective- Singer(2001b) extended Perceptron learning algorithm for ness of PRIL by showing its performance on var- ordinal classifiers. ious datasets. In the approaches discussed so far, the training data has cor- rect class label for each feature vector. However, in many 1. Introduction cases we may not know the exact label. Instead, we may have an interval in which the true label lies. Such a sce- Ranking (also called as ordinal classification) is an impor- nario is discussed in Antoniuk et al.(2015; 2016). In this tant problem in . Ranking is different setting, corresponding to each example, an interval label is from multi-class classification problem in the sense that provided and it is assumed that the true label of the example there is an ordering among the class labels. For example, lies in this interval. In Antoniuk et al.(2016), a large mar- product ratings provided on online retail stores based on gin framework for batch learning is proposed using interval customer reviews, product quality, price and many other insensitive loss function. factors. Usually these ratings are numbered between 1-5. While these numbers can be thought of as class labels, there In this paper, we propose an online algorithm for learn- is also an ordering which has to be taken care. This prob- ing ordinal classifier using interval labeled data. We name lem has been very well addressed in the machine learning the proposed approach as PRIL (Perceptron ranking using and referred as ordinal classification or ranking. interval labeled data). Our approach is based on interval arXiv:1802.03873v1 [cs.LG] 12 Feb 2018 insensitive loss function. As per our knowledge, this is the In general, an ordinal classifier can be completely defined first ever online ranking algorithm using interval labeled by a linear function and a set of K − 1 thresholds (K data. We show the correctness of the algorithm by showing be the number of classes). Each threshold corresponds that after each iteration, the algorithm maintains the order- to a class. Thus, the thresholds should have the same or- ings of the thresholds. We derive the mistake bounds for der as their corresponding classes. The classifier decides the proposed algorithm in both ideal and general setting. the rank (class) based on the relative position of the lin- In the ideal setting, we show that the algorithm stops after ear function value with respect to different thresholds. One making finite number of mistakes. We also derive the regret can learn a non-linear classifier also by using an appropri- bound for the algorithm. We also propose a multiplicative ate nonlinear transformation. A lot of discriminative ap- update algorithm for PRIL (called M-PRIL). We also show the correctness of M-PRIL and find its mistake bound. The rest of the paper is organized as follows. In section2, we describe the problem of learning ordinal classifier using PRIL: Perceptron Ranking Using Interval Labeled Data

t interval labeled data. In section3, we discuss the proposed Note that if for any i ∈ {1, . . . , yl − 1}, f(x) < θi, then t online algorithm for learning ordinal classifier using inter- f(x) < θj, ∀j = yr,...,K − 1 because θ1 ≤ θ2 ≤ t val labeled data. We derive the mistake bounds and the ... ≤ θK−1. Similarly, if for any i ∈ {yr,...,K − 1}, t regret bound in section 3.2. We present the experimental f(x) > θi, then f(x) > θj, ∀j = 1, . . . , yl − 1. Let results in section5. We make the conclusions and some I¯ = {1, . . . , yl − 1} ∪ {yr,...,K − 1}. Then, we define remarks on the future work in section6. zi, ∀i ∈ I¯ as follows. ( +1 ∀i ∈ {1, . . . , yl − 1} 2. Ordinal Classification using Interval zi = (3) Labeled Data −1 ∀i ∈ {yr,...,K − 1}

d IMC Let X ⊆ R be the instance space. Let Y = {1,...,K} Thus, LI (f(x), yl, yr, θ) = 0 requires that zi(f(x) − ¯ IMC be the label space. Our objective is to learn an ordinal clas- θi) ≥ 0, ∀i ∈ I. Thus, LI can be re-written as: sifier h : X → Y which has the following form IMC X LI (f(x), yl, yr, θ) = max (0, −zi (f(x) − θi)) K−1 X i∈I¯ h(x) = 1 + I{w.x>θk} = min {i : w.x − bi ≤ 0} i∈[K−1] k=1 3. Perceptron Ranking using Interval Labeled where w ∈ Rd and θ ∈ RK−1 be the parameters to be opti- Data mized. Parameters θ = [θ1 . . . θK−1] should be such that θ1 ≤ θ2 ≤ ... ≤ θK−1. The classifier splits the real line In this section, we propose an online algorithm for ranking into K consecutive intervals using thresholds θ1, . . . , θK−1 using interval insensitive loss described in eq. (2). Our al- IMC and then decides the class label based on which interval gorithm is based on stochastic gradient descent on LI . corresponds to w.x. We derive the algorithm for linear classifier. Which means, Here, we assume that for each example x, the anno- f(x) = w.x. Thus, the parameters to be estimated are w 0 0 t t tator provides an interval [yl, yr] ∈ Y × Y (yl ≤ and θ. We initialize with w = 0 and θ = 0. Let w , θ yr). The interval annotation means that the true label be the estimates of the parameters in the beginning of trial t t t y for example x lies in the interval [yl, yr]. Let S = t. Let at trial t, x be the example observed and [yl , yr] be 1 1 1 T T T its label interval. wt+1 and θt+1 are found as follows. {(x , yl , yr ),..., (x , yl , yr )} be the training set. t+1 t IMC t t t Discrepancy between the predicted label and correspond- w = w − η∇ L (w.x , y , y , θ) t w I l r wt,θ ing label interval can be measured using interval insensitive t X t t = w + z x t t t t loss (Antoniuk et al., 2016). i I{zi (w .x −θi )<0} i∈I¯t y −1 l K−1 IMC t t t MAE X X t+1 t ∂LI (w.x , yl , yr, θ) L (f(x), θ, yl, yr) = + θ = θ − η t I I{f(x)<θi} I{f(x)≥θi} i i wt,θ ∂θi i=1 i=yr (1) ( t t ¯t θi − zi I{zt(wt.xt−θt)<0} ∀i ∈ I = i i Where subscript I stands for interval. This, loss function t ¯t θi ∀i∈ / I takes value 0, whenever θyl−1 ≤ f(x) ≤ θyr . However, this loss function is discontinuous. A convex surrogate of Thus, only those constraints will participate in the update this loss function is as follows (Antoniuk et al., 2016): which are not satisfied. The violation of ith constraint leads t t t+1 t t+1 yl−1 to the update contribution of zi x in w and zi in θi . IMC X θt, t∈ / I¯t are not updated in trial t. The complete approach LI (f(x), yl, yr, θ) = max (0, −f(x) + θi) i i=1 is described in Algorithm1. It is important to see that when K−1 exact labels are given to the Algorithm1 instead of par- X + max (0, f(x) − θi) (2) tial labels, it becomes same as the algorithm proposed in

i=yr (Crammer & Singer, 2001b). PRIL can be easily extended for learning nonlinear classifiers using kernel methods. Here IMC stands for the implicit constraints for order- ing of thresholds θis. For a given example-interval pair IMC 3.1. Kernel PRIL {x, (yl, yr)}, the loss LI (f(x), θ, yl, yr) becomes zero only when We can easily extend the proposed algorithm PRIL for learning nonlinear classifiers using kernel functions. We f(x) − θi ≥ 0 ∀i ∈ {1, . . . , yl − 1} see that the classifier learnt after t trials using PRIL can s f(x) − θi ≤ 0 ∀i ∈ {yr,...,K − 1} be completely determined using τi , i ∈ {1,...,K − PRIL: Perceptron Ranking Using Interval Labeled Data

Algorithm 1 Perceptron Ranking using Interval Labeled Algorithm 2 Kernel PRIL Data (PRIL) Input: Training Dataset S t t Input: Training Dataset S Output: τ1, . . . , τK−1, t = 1 ...T 1 1 1 0 0 0 Initialize Set t = 1, w1 = 0, θ1 = θ2 = ... = θK−1 = Initialize: Set τ1 = ... = τK−1 = 0, f (.) = 0 and 0, m = 1 t = 0 for i ← 1 to T do for t ← 1 to N do t t t t t Get example xt and its (yl , yr) Get example x and its (yl , yr); t t for i ← 1 to yl − 1 do for i ← 1 to yl − 1 do t t zi = +1 zi = +1; end for end for t t for i ← yr to K − 1 do for i ← yr to K − 1 do t t zi = −1 zi = −1; end for end for t t Initialize τi = 0, i ∈ [K − 1] Initialize τi = 0, ∀i ∈ [K]; ¯ for i ∈ It do for i ∈ I¯t do t t t t t t t t if zi (w .x − θi ) ≤ 0 then if zi f (x ) − θi ≤ 0 then t t t t τi = zi τi = zi ; end if end if end for end for t+1 t PK−1 t t t+1 t P t t w = w + i=1 τi x f (.) = f (.) + i∈I¯t τi κ(x ,.) θt+1 = θt − τ t, i = 1 ...K − 1 t+1 t t i i i θi = θi − τi , ∀i ∈ [K] end for end for h(x) = min i : wT +1.x − θT +1 < 0 T +1 t Output: i∈[K] i h(x) = mini∈[K]{i : f (x) − θi < 0}

t+1 Pt P s s 1}, s ∈ [t] as follows w = ¯s τ x and i s=1 I i Proof: Note that θt ∈ Z, ∀i ∈ {1,...,K − 1}, ∀t ∈ t+1 Pt s t+1 1 θi = − s=1 τi , i = 1 ...K − 1. Also, f (x) can {1,...,N} as PRIL initializes θi = 0, ∀i ∈ {1,...,K − t+1 Pt P s s be found as f (x) = s=1 i∈I¯s τi x .x. Thus, we 1}. To show that PRIL preserves the ordering of the thresh- can replace the inner product with a suitable kernel func- olds, we consider following different cases. tion κ : X × X −→ R and represent f t+1(x) as t t 1. i ∈ {1, . . . , yl − 2}: we see that, X X f t+1(x) = τ sκ(xs, x) t+1 t+1 t t t i θ − θ = θ − θ − z t t t t i+1 i i+1 i i+1I{zi+1(w .x −θi+1)≤0} s=1 i∈I¯s t + zi I{zt(wt.xt−θt)≤0} t+1 t+1 Pt t i i Similarly, θi can be expressed as θi = s=1 τi , ∀i ∈ t t t = θ − θ + {wt.xt−θt≤0} − {wt.xt−θt ≤0} I¯ . The ordinal classifier learnt after T trials is i+1 i I i I i+1     t t T We used the fact that zi = +1, ∀i ∈ {1, . . . , yl − 1}.  X X s s s  h(x) = min i :  τi κ(x , x) + τi < 0 Thus, there can be two cases only. i∈[K]  s=1 j∈I¯s  t t t+1 (a) θi+1 = θi : In this case, we simply get θi+1 = t+1 Complete description of kernel PRIL is provided in Algo- θi . rithm2. t t t t (b) θi+1 > θi : Since θi ∈ Z ∀i, ∀t, we get θi+1 ≥ t θi + 1. This means 3.2. Analysis t+1 t+1 θi+1 − θi ≥ 1 + I{wt.xt−θt≤0} − I{wt.xt−θt ≤0} Now we will show that PRIL preserves the ordering of the i i+1 But, t t t − t t t ∈ thresholds θ1, . . . , θK−1. I{w .x −θi ≤0} I{w .x −θi+1≤0} {−1, 0}. Thus, θt+1 − θt+1 ≥ 0. Lemma 1 Order Preservation: Let wt and θt = i+1 i t t T t t+1 t [θ1 . . . θK−1] be the current parameters for ranking clas- 2. i = yl − 1: In this case θi+1 = θi+1 as per the update t t t t t t sifier, where θ1 ≤ θ2 ≤ ... ≤ θK−1. Let x be the instance rule. Also, zi = +1. Thus, using the fact that θi+1 − t t t fed to PRIL at trial t and [yl , yr] be its corresponding rank θi ≥ 0, we get: t+1 t+1 t+1 t+1 T interval. Let w and θ = [θ1 . . . θK−1] be the t+1 t+1 t t t θ − θ = θ − θ + z {zt(wt.xt−θt)≤0} resulting ranking classifier parameters after the update of i+1 i i+1 i i I i i t+1 t+1 t+1 t t PRIL. Then, θ ≤ θ ≤ ... ≤ θ . = θ − θ + t t t ≥ 0 1 2 K−1 i+1 i I{w .x −θi ≤0} PRIL: Perceptron Ranking Using Interval Labeled Data

t t 3. i ∈ {yl , . . . , yr −2}: PRIL does not update thresholds constraints. Thus, t+1 t t t in this range. Thus, θi = θi , ∀i ∈ {yl , . . . , yr −1}. K−1 K−1 t+1 t t+1 t t t X X Thus, θi+1 = θi+1 ≥ θi = θi , ∀i ∈ {yl , . . . , yr − t+1 ∗ t t t ∗ t t ∗ v .v = (w + τi x ).w + (θi − τi )θi 2}. i=1 i=1 K−1 4. i = yt − 1: In this case θt+1 = θt as per the up- t ∗ X t ∗ t ∗ r i i = v .v + τi (w .x − θi ) t date rule. Also, zi+1 = −1. Thus, using the fact that i=1 t t θi+1 − θi ≥ 0, we get: t ∗ X t ∗ t ∗ = v .v + zi (w .x − θi ) i∈Mt t+1 t+1 t t t θi+1 − θi = θi+1 − θi − zi+1I{zt (wt.xt−θt )≤0} X i+1 i+1 ≥ vt.v∗ + γ = vt.v∗ + γmt t t = θ − θ + t t t ≥ 0 t i+1 i I{w .x −θi+1≥0} i∈M t ∗ t ∗ where we have used the fact that zi (w .x − θi ) ≥ γ. 1 5. i ∈ {yr,...,K − 1}: we see that, Summing over both sides from t = 1 to T and using v = 0, we get: t+1 t+1 t t t θ − θ = θ − θ − z t t t t i+1 i i+1 i i+1I{zi+1(w .x −θi+1)≤0} T T t T +1 ∗ 1 ∗ X t X t + z t t t t v .v ≥ v .v + γ m = γ m (4) i I{zi (w .x −θi )≤0} t t t=1 t=1 = θi+1 − θi − I{wt.xt−θt≥0} + I{wt.xt−θt ≥0} i i+1 T +1 2 Now, we will upper bound ||v ||2. t t We used the fact that zi = −1, ∀i ∈ {yr,...,K − 1}. K−1 K−1 t+1 2 t 2 X t t t X t 2 t 2 Thus, there can be two cases only. ||v ||2 − ||v ||2 = 2 τi w .x + ( τi ) ||x ||2 i=1 i=1 t t t+1 (a) θi+1 = θi : In this case, we simply get θi+1 = K−1 K−1 t+1 X t t X t 2 θi . − 2 τi θi + (τi ) t t t i=1 i=1 (b) θi+1 > θi : Since θi ∈ Z ∀i, ∀t, we get t t t+1 t+1 t 2 X t t t t θi+1 ≥ θi + 1. This means θi+1 − θi ≥ = ||v ||2 + 2 τi (w .x − θi ) t 1 + I{wt.xt−θt ≥0} − I{wt.xt−θt≥0}. But, i∈M i+1 i X X t t t − t t t ∈ {0, −1}. t 2 t 2 t 2 I{w .x −θi+1≥0} I{w .x −θi ≥0} + ( τi ) ||x ||2 + (τi ) t+1 t+1 t t Thus, θi+1 − θi ≥ 0. i∈M i∈M t 2 2 P t 2 We know that ||x ||2 ≤ R2 ∀t; i∈Mt (τi ) = t P t 2 t 2 t t t t m ;( i∈Mt τi ) ≤ (m ) and zi (w .x − θi ) ≤ 0 ∀i ∈ t t+1 2 t 2 t 2 2 t M . Thus, ||v ||2 − ||v ||2 ≤ (m ) R2 + m . Summing Now we will show that the PRIL makes finite number of over both sides from t = 1 to T and using v1 = 0, we get, mistakes if there exists an ideal interval ranking classifier. T T T +1 2 2 X t 2 X t 1 1 1 T T T ||v ||2 ≤ R2 (m ) + m (5) Theorem 1 Let S = {(x , yl , yr ),..., (x , yl , yr )} be 2 t 2 t=1 t=1 an input sequence. Let R2 = maxt∈[T ] ||x || and t t ∗ d c = mint∈[T ](yr − yl ). Let ∃γ > 0, w ∈ R and Now, using Cauchy-Schwartz inequality, we get T +1 ∗ T +1 ∗ T +1 ∗ ∗ ∗ 2 PK−1 ∗ 2 v .v ≤ ||v ||2.||v ||2 = ||v ||2. Now θ1, . . . , θK−1 ∈ R such that ||w ||2 + i=1 (θi ) = 1 t ∗ t ∗ using eq. (4) and (5), we get and mini∈I¯t zi (w .x − θi ) ≥ γ, ∀t ∈ [T ]. Then, T T T X X X T 2 2 t 2 T +1 2 2 t 2 t X (R + 1)(K − c − 1) γ ( m ) ≤ ||v ||2 ≤ R2 (m ) + m LMAE(f t(xt), θ, yt, yt ) ≤ 2 I l r γ2 t=1 t=1 t=1 t=1 T T ! X 1 P (mt)2 ⇒ mt ≤ R2 t=1 + 1 t t t t γ2 2 PT t where f (x ) = w .x . t=1 t=1 m

∗ ∗T ∗ ∗ ∗ T t But, mt ≤ K − (yt − yt) − 1 ≤ K − c − 1, then Proof: Let v = [w θi θ2 . . . θK−1] and v = r l t T t t t T PT t 2 PT t [(w ) θi θ2 . . . θK−1] . Let Algorithm1 makes a t=1(m ) ≤ i=1 m (K − c − 1). Which gives t ¯t t t t t mistake at trial t. Let M = {i ∈ I | zi (w .x −θi ) ≤ 0}. T t X R2(K − c − 1) + 1 (R2 + 1)(K − c − 1) M be the set of indices of the constraints which are not mt ≤ 2 ≤ 2 t t γ2 γ2 satisfied at trial t. Let m = |M | be the number of those t=1 PRIL: Perceptron Ranking Using Interval Labeled Data

t MAE t t t t t 2 But m = LI (f (x ), θ , yl , yr) Which means, Z is chosen such that ||v˜||2 = 1. Thus, N T 2 P P t 2 2 X (R + 1)(K − c − 1) 2 t=1 i∈I¯t (di) D LMAE(f t(xt), θ, yt, yt ) ≤ 2 Z = 1 + = 1 + I l r γ2 ∆2 ∆2 t=1 t 2 t 2 2 2 2 We also see that ||x˜i||2 = ||x ||2 + 1 + ∆ ≤ R2 + ∆ + 1, ∀i ∈ I¯t, ∀t ∈ [T ]. Moreover,

In Theorem1, we assumed that there exists an ideal classi- t t t ∗ d ∗ ∗ ∗0 ∗0 0 t t zi (w.x − θi) di fier defined by w ∈ R and θ . Let v = [w θ ] and zi v˜.x˜i = + ∗ t ∗ t Z Z f (x ) = w .x . Thus, t t t t zi (w.x − θi) γ − zi (w.x − θi) γ MAE ∗ t ∗ t t ≥ + = LI (f (x ), θ , yl , yr) = 0, ∀t ∈ [T ] Z Z Z Which means, zt(w∗.xt − θ∗) ≥ 0, ∀i ∈ I¯t, ∀t ∈ [T ] Thus, v˜ correctly classifies all the examples in i i γ t ¯t d+(T +1)(K−1) with margin at least . Thus, by where zi , i ∈ I are as described in eq. (3). Now we R Z t d+K−1 ¯t using the mistake bound given in Theorem1, we get define xˆi ∈ R , ∀i ∈ I as follows. t t 0 0 t T 2 2 2 xˆ = [(x ) 0 ... 1 0 ... 0] , i ∈ I¯ (6) X Z (R + ∆ + 1)K1 i LMAE(f t(xt), θt, yt, yt ) ≤ 2 I l r γ2 where the component values at locations d + 1, . . . , d + t=1 K − 1 (d + i) are all set to ’0’ except for the location . where K1 = K − c − 1. To find the least upper bound, we t Component value at location (d + i) is set to ’-1’ in xˆi. minimize the RHS above with respect to ∆. For minimum t ∗ t ¯t ∗ ∗ 2 2 1 ∗ Thus, we have zi (v .xˆi) ≥ 0, ∀i ∈ I , ∀t ∈ [T ]. Thus, v ∆ = (D (R + 1)) 4 . Replacing ∆ by ∆ , we get t ¯t 2 correctly classifies all the xˆi, ∀i ∈ I , ∀t ∈ [T ]. However, T p in general, for a given dataset, we may not know if such an X (D + R2 + 1)2(K − c − 1) LMAE(f t(xt), θt, yt, yt ) ≤ 2 ideal classifier exists. Next, we derive the mistake bound I l r γ2 for this general setting. t=1

1 1 1 T T T Theorem 2 Let S = {(x , yl , yr ),..., (x , yl , yr )} be 2 t 2 an input sequence. Let γ > 0, R2 = maxt∈[T ] ||x || and REGRET ANALYSIS c = min (yt − yt). Thus, for any w ∈ d, θ ∈ K−1 t∈[T ] r l R R 1 1 1 T T T such that ||w||2 + ||θ||2 = 1, we get Let S = {(x , yl , yr ),..., (x , yl , yr )} be the input se- 2 2 t t T quence. Let {w , θ }t=1 be the sequence of parameter vec- T p X (D + R2 + 1)2(K − c − 1)tors generated by an online ranking algorithm A. Then, LMAE(f t(xt), θ, yt, yt ) ≤ 2 I l r γ2 regret of algorithm A is defined as t=1 T t t t t 2 PT P t 2 t X IMC t t t t t  where f (x ) = w .x , D = t=1 i∈I¯t (di) and di = RT (A) = LI w .x , θ , yl , yr t t ¯t max[0, γ − zi (w.x − θi)], i ∈ I , ∀t ∈ [T ]. t=1 T x˜t ∈ X IMC t t t  Proof: We proceed by constructing i − min LI w.x , θ, yl , yr d+(T +1)(K−1) ¯t R , ∀i ∈ I corresponding to every (w,θ)∈Ω t=1 xt, t ∈ [T ] as follows. For online gradient descent applied on convex cost func- t t 0 0 tions, the regret bound analysis is given by Zinkevich x˜i = [(xˆi) 0 ... 0 ∆ 0 ... 0] (2003). We know that the objective function of PRIL is also t where xˆi is as described in eq.(6). The first d + K − 1 convex. Motivated by that, here we find the regret bound t t components of x˜i are same as xˆi and rest of all the elements for PRIL. are set to 0 except for the location d + (K − 1)t + i, which t K−1 1 1 1 T T T is set to ∆. Let u ∈ R be as follows. Theorem 3 Let S = {(x , yl , yr ),..., (x , yl , yr )} be t d 2 h ian input sequence such that x ∈ R , ∀t ∈ [T ]. Let R2 = t t t t t t t t t 2 t t z1d1 . . . z t d t 0 ... 0 zyt dyt . . . zK−1dK−1 max ||x || and c = min (y − y ). Let Ω = t yl −1 yl −1 r r t∈[T ] t∈[T ] r l u = d+K−1 t t t ∆Z {v ∈ R : ||v||2 ≤ Λ}. Let v = (w , θ ), t = t t t ¯t 1 ...T +1 be the sequence of vectors in Ω such that ∀t ≥ 1, where di = max [0, γ − zi (w.x − θi)] , i ∈ I , ∀t ∈ t+1 t 0 d+(T +1)(K−1) v = v − ∇t where ∇t belongs to the sub-gradient set [T ]. We now construct v˜ ∈ R as follows. IMC t t t t of LI (w.x , θ, yl , yr) at v . Then,  0 0 0 w θ 1 2 T 1 v˜ = u u ... u R (PRIL) ≤ Λ2 + T (R2 + 1)(K − c − 1) Z Z T 2 2 PRIL: Perceptron Ranking Using Interval Labeled Data

Proof: Let v = (w, θ) ∈ Ω. Using the convexity property algorithm proposed for learning linear predictors (Kivinen IMC of LI , we get & Warmuth, 1997). In M-PRIL, at every iteration t, the weight vector wt and the thresholds θt are maintained such IMC t t t t t  IMC t t t  t 2 t 2 LI w .x , θ , yl , yr − LI w.x , θ, yl , yr that kw k + kθ k = 1. The complete algorithm M-PRIL IMC t t t t t  t is described in Algorithm3. ≤ ∇LI w .x , θ , yl , yr .(v − v) = (vt − vt+1).(vt − v) using the update in Algorithm1 Algorithm 3 M-PRIL 1 = ||v − vt||2 − ||v − vt+1||2 + ||vt − vt+1||2 Input: Training Dataset S 2 Initialize t = 1, w = 0, θ1 = ... = θ1 = 0 1 1 1 K−1 = ||v − vt||2 − ||v − vt+1||2 for i ← 1 to T do 2 t t t Get example x and its (yl , yr) 1 IMC t t t t t  2 for i ← 1 to yt − 1 do + ||∇LI w .x , θ , yl , yr || l 2 t zi = +1 But, end for t for i ← yr to K − 1 do IMC t t t t t  2 t ||∇LI w .x , θ , yl , yr || zi = −1 X t t 2 X t 2 end for = || τi x || + || τi || t Initialize τi = 0, i ∈ [K − 1] ¯t ¯t i∈I i∈I for i ∈ I¯t do   t t t t X X if zi (w .x − θi ) ≤ 0 then = (||xt||2 + 1)|| τ t||2 ≤ (R2 + 1) |τ t| t t i 2  i  τi = zi i∈I¯t i∈I¯t end if ≤ (R2 + 1)(K − yt + yt − 1) ≤ (R2 + 1)(K − c − 1) end for 2 r l 2 d t PK−1 t K−1 PK−1 t t P t ηxi k=1 τk P t −η k=1 τk Z = i=1 wi e + i=1 θi e Thus, for i ∈ [d] do IMC t t t t t  IMC t t t  t+1 t PK−1 t LI w .x , θ , yl , yr − LI w.x , θ, yl , yr 1 t ηxi k=1 τk wi = Zt wi e 1 ≤ ||v − vt||2 − ||v − vt+1||2 + (R2 + 1)(K − c − 1) end for 2 2 for i ∈ [K − 1] do t (7) t+1 1 t −ητi θi = Zt θi e Summing eq.(7) on both sides from 1 to T , we get, end for end for  T +1 T +1 T T Output: h(x) = mini∈[K] i : w .x − θi < 0 X IMC t t t t t  X IMC t t t  LI w .x , θ , yl , yr − LI w.x , θ, yl , yr t=1 t=1 1 1 2 T +1 2 2  t d ≤ ||v − v || − ||v − v || + T (R2 + 1)(K − c − 1) Lemma 2 Order Preservation: Let w ∈ R and 2 t θ ∈ RK−1 denote the current parameters of current rank- 1 2 2  t t K−1 ≤ ||v|| + T (R2 + 1)(K − c − 1) ing classifier. Assume that θ1 ≤ θ2 ≤ ... ≤ θt . 2 Let wt+1 and θt+1 be the new parameters generated by T +1 2 t t t where we have used the fact that ||v − v || ≥ 0. We the M-PRIL algorithm after observing (x , yl , yr). Then, t+1 t+1 K−1 2 2 θ ≤ θ ≤ ... ≤ θ . know that minv∈Ω ||v||2 = Λ . Since the above holds for 1 2 t+1 any v = (w, θ) ∈ Ω, we have i Proof: Note that θt ∈ Z, ∀i ∈ {1,...,K − 1}, ∀t ∈ 1 {1,...,N} as M-PRIL initializes θ1 = 1 , ∀i ∈ R (PRIL) ≤ Λ2 + T (R2 + 1)(K − c − 1) i K+d−1 T 2 2 {1,...,K − 1}. To show that M-PRIL preserves the or- dering of the thresholds, we consider following different cases.

4. Multiplicative PRIL t 1. i ∈ {1, . . . , yl − 2}: we see that, In this section, we propose a multiplicative algorithm for t+1 t+1 1  t −ητ t t −ητ t  θ − θ = θ e i+1 − θ e i PRIL called M-PRIL. In this algorithm, the weight vec- i+1 i Zt i+1 i tor w and the thresholds θ are modified in a multiplica- t t t tive manner. The algorithm is inspired from the Winnow We know that τ = z t t t t and z = i i I{zi (w .x −θi )≤0} i PRIL: Perceptron Ranking Using Interval Labeled Data

t t t +1, ∀i ∈ {1, . . . , yl − 1}. Thus, there can be two (b) θi+1 > θi : We see that cases only. t t t ≤ t t t I{w .x −θi ≤0} I{w .x −θi+1≤0} t t t+1 (a) θi+1 = θi : In this case, we simply get θi+1 = ⇒1 − t t t ≤ 1 − t t t t+1 I{w .x −θi+1≥0} I{w .x −θi ≥0} θi . ⇒ t t t ≤ t t t t t I{w .x −θi ≥0} I{w .x −θi+1≥0} (b) θi+1 > θi : We see that ⇒ t t t t ≤ t t t t I{zi (w .x −θi )≤0} I{zi+1(w .x −θi+1)≤0} {wt.xt−θt≤0} ≤ {wt.xt−θt ≤0} t t I i I i+1 ⇒τi+1 ≤ τi t t −ητ t −ητ t ⇒ − ητi+1 ≥ −ητi ⇒e i+1 ≥ e i t t −ητi+1 −ητi t t ⇒e ≥ e t −ητi+1 t −ητi ⇒θi+1e ≥ θi e t −ητ t t −ητ t ⇒θ e i+1 ≥ θ e i i+1 i t+1 t+1 Thus, θi+1 − θi ≥ 0. t+1 t+1 Thus, θi+1 − θi ≥ 0.

t t 2. i = yl − 1: In this case τi+1 = 0. Thus, using the fact Thus, M-PRIL preserves the order of thresholds in consec- t t that θi+1 − θi ≥ 0, we get: utive rounds. We now find the mistake bound for M-PRIL.

1  t  1 1 1 T T T t+1 t+1 t t −ητi Theorem 4 Let S = {(x , yl , yr ),..., (x , yl , yr )} be θi+1 − θi = t θi+1 − θi e t Z an input sequence to M-PRIL algorithm. Let kx k∞ ≤ 1 t t ∗ t t t 1, ∀t ∈ [T ] and c = mint∈[T ](yr − yl ). Let ∃γ > 0, w ∈ ≥ t θi+1 − θi ∵ τi ∈ {0, 1} d ∗ ∗ ∗ ∗ Z R and θ1, . . . , θK−1 ∈ R such that kw k1 + kθ k1 = 1 t ∗ t ∗ ≥ 0 and mini∈I¯t zi (w .x − θi ) ≥ γ, ∀t ∈ [T ]. Then, the ranking loss of M-PRIL is

3. i ∈ {yt, . . . , yt − 2}: In this case, we see that τ t = T l r i X log(K + d − 1) 0, ∀i ∈ {yt, . . . , yt − 2} . Thus, θt+1 = 1 θt, ∀i ∈ LMAE(f t(xt), θt, yt, yt ) ≤ (K−c−1)2 l r i Zt i I l r γ2 t t t+1 1 t 1 t t=1 {yl , . . . , yr − 2}. Thus, θi+1 = Zt θi+1 ≥ Zt θi = θt+1, ∀i ∈ {yt, . . . , yt − 2}. i l r where f t(xt) = wt.xt.

t t t t t t 4. i = yr − 1: In this case τi = 0. Also, zi+1 = −1. Proof: Let v = (w , θ ). We start by finding the decrease t t t ∗ Thus, using the fact that θi+1 − θi ≥ 0, we get: in the KL-divergence between v and v . Thus,

∗ t+1 ∗ t t+1 t+1 1  t −ητ t t ∆t = DKL(v ||v ) − DKL(v ||v ) θ − θ = θ e i+1 − θ i+1 i Zt i+1 i 1 The logic is to bound P ∆ from above and below. ≥ θt − θt τ t ∈ {−1, 0} t∈[T ] t Zt i+1 i ∵ i+1 We first derive the upper bound. ≥ 0 d K−1 X wt+1  X θt+1  ∆ = w∗ log i + θ∗ log k t i wt k θt i=1 i k=1 k 5. i ∈ {yr,...,K − 1}: we see that, d K−1 X  Zt  X  Zt  = w∗ log + θ∗ log t+1 t+1 1  t −ητ t t −ητ t  i ηxt PK−1 τ t k −ητ t θ − θ = θ e i+1 − θ e i e i k=1 k e k i+1 i Zt i+1 i i=1 k=1 " d K−1 # K−1 t X ∗ X ∗ X t ∗ t ∗ t t = log Z wi + θk − η τk(w .x − θk) We used the fact that zi = −1, ∀i ∈ {yr,...,K − 1}. Thus, there can be two cases only. i=1 k=1 k=1 t X t ∗ t ∗ = log Z − η zk(w .x − θk) t t t t t (a) θi+1 = θi : In this case, we simply get τi+1 = τi . k∈M Thus, t t t ∗ t ∗ ¯t ≤ log Z − ηγm ∵ zk(w .x − θk) ≥ γ, ∀k ∈ I

t+1 t+1 1  t −ητ t t −ητ t  K−1 θ − θ = θ e i+1 − θ e i = 0 Let c = K − c − 1. Note that xt P τ t ≤ c . We i+1 i Zt i+1 i 1 i k=1 k 1 PRIL: Perceptron Ranking Using Interval Labeled Data now bound log (Zt) as follows: By substituting this value of η in eq.(8), we get

d K−1 T X t P t X t K + d − 1 t t ηxi k∈Mt τk t −ητk X t Z = wi e + θi e m ≤  γ  i=1 i=1 t=1 g (K−c−1)  xt PK−1 τ t xt PK−1 τ t  d 1 + i k=1 k 1 − i k=1 k X t c1 ηc1 c1 −ηc1 1+t 1−t ≤ wi  e + e  where g(t) = 2 log(1 + t) + 2 log(1 − t). It can be 2 2 2 i=1 t shown that g(t) ≥ 2 for 0 ≤ t ≤ 1. We know that  t t  K−1 τk τk X 1 + c 1 − c t ∗ t ∗ ∗ ∗ t 1 ηc1 1 −ηc1 γ ≤ zk(w .x − θk) by the definition of (w , θ ) + θk  e + e  2 2 t ∗ t t k=1 = zkv .xˆk xˆk is defined in eq.(6) " d K−1 # ∗ t 1 ≤ kv k1kxˆkk∞ using Cauchy-Schwarz Inequality ηc1 −ηc1  X t X t = e + e wi + θk t t 2 = max(kx k∞, 1) = 1 kx k∞ = 1 i=1 k=1 ∵ K−1 ηc1 −ηc1 γ X e − e t t t t Thus, K−c−1 ≤ 1. By putting the above inequality in g(t), + τk(w .x − θk) 2c1 we get k=1 1 T T ηc1 −ηc1  t t t t ≤ e + e ∵ τk(w .x − θk) ≤ 0, ∀k X MAE t t t t t X t 2 LI (f (x ), θ , yl , yr) = m Thus, ∆ becomes t=1 t=1 t log(K + d − 1) ≤ (K − c − 1)2 1   2 ∆ ≤ log eη(K−c−1) + e−η(K−c−1) − ηγmt γ t 2 Thus, M-PRIL would converge after making finite  1    ≤ mt log eη(K−c−1) + e−η(K−c−1) − ηγ number of mistakes if there exists an ideal interval ranking 2 classifier for the given training data. In the above we used the fact that mt ≥ 1. Now, summing ∆t over t = 1 till T , we get 5. Experiments

T T X 1   X We now discuss the experimental results to show the effec- ∆ ≤ log eη(K−c−1) + e−η(K−c−1) mt t 2 tiveness of the proposed approach. We first describe the t=1 t=1 datasets used. T X − ηγ mt 5.1. Dataset Description t=1 We show the simulation results on 3 datasets. The descrip- Now, we find the lower bound on PT ∆ as follows: t=1 t tion of these datasets are as given below. T T X X ∗ t ∗ t+1  2 ∆t = DKL(v kv ) − DKL(v kv ) 1. Synthetic Dataset: We generate points x ∈ R t=1 t=1 uniformly at random from the unit square [0, 1]2. ∗ 1 ∗ T +1 = DKL(v kv ) − DKL(v kv ) For each point, the rank was assigned from the set ∗ 1 ∗ T +1 {1,..., 5} as y = maxr{r : 10(x1 − 0.5)(x2 − ≥ −DKL(v kv ) ∵ DKL(v kv ) ≥ 0 0.5) + ξ > br} where b = (−∞, −1, −0.1, 0.25, 1). ≥ − log(K + d − 1) ξ ∼ N (0, 0.125) (normally distributed with zero mean and a standard deviation of 0.125). The visual- Now comparing the lower and the upper bound on T ization of the synthetic dataset is provided in Figure1. P ∆ , we get t=1 t We generated 100 sequences of instance-rank pairs for T synthetic dataset each of length 10000. X t log(K + d − 1) m ≤ n h i o (8) 2 2. Parkinsosns Telemonitoring Dataset: This dataset t=1 log eη(K−c−1)+e−η(K−c−1) + ηγ (Lichman, 2013) contains biomedical voice measure- The RHS above can be minimized by taking η as ments for 42 patients in various stages of Parkinsons disease. There are total 5875 observations. There are 1 (K − c − 1) + γ  η = log 20 features and the target variables is total UPDRS. 2(K − c − 1) (K − c − 1) − γ In the dataset, the total-UPDRS spans the range 7-55, PRIL: Perceptron Ranking Using Interval Labeled Data

• Parkinson’s Telemonitoring: κ(x1, x2) = x1.x2. 3 • Abalone: κ(x1, x2) = (x1.x2 + 1) .

5.3. Varying the Fraction of Partial Labels in PRIL We now discuss the performance of PRIL when we vary the fraction of partial labels in the training. We train with 60%, 70%, 80%, 90% and 100% examples with partial labels. We compute the loss at each round with the same partial label used for updating the hypothesis. For PRIL, at each time step, we compute the average of LMAE (defined in eq.(1)). The results are shown in Figure2. We see that for Figure 1. Synthetic Dataset all the datasets the average mean absolute error (MAE) is decreases faster as compared to the number of rounds (T ). Also, the average MAE decreases with the increase in the with higher values representing more severe disabil- fraction of examples with partial labels. This we observe in ity. We divide the range of total-UPDRS into 10 parts. all 3 datasets and both types of partial labels. This happens We normalize each feature independently by making because the allowed range for predicted values is more for the mean 0 and standard deviation 1. partial labels as compared to the exact label. Consider an t t t t t example x with partial labels as [yl , yr] (yr > yl ) and 3. Abalone Dataset: The age of abalone (Lichman, exact label as yt. The MAE for this example with partial 2013) is determined by counting the number of rings. t t labels is sum of K − yr + yl − 1 losses. On the other hand, There are 8 features and 4177 observations. The num- MAE becomes sum of K − 1 losses if the label if exact ber of rings vary from 1 to 29. However, the distri- for xt. Thus, as we increase the fraction of partial labels in bution is very skewed. So, we divide the whole range the training set, the average MAE decreases. We see that into 4 parts, namely 1-7, 8-9, 10-12, 13-29. training with no partial labels gets larger values of average MAE for all the datasets. The training data is comprised of two parts. One part con- tains the absolute labels for feature vectors and the other 5.4. Comparisons With Other Approaches contains the partial labels. In our experiments, we keep 25% of the examples which have correct label and rest 75% We compare the proposed algorithm with (a) PRank examples having partial labels. (Crammer & Singer, 2001b) by considering the exact la- bels, (b) Widrow-Hoff (Widrow & Hoff, 1988) by posing Generating Interval Labels: Let there be K categories. it as a regression problem and (c) multi-class Perceptron We consider two different methods of generating interval (Crammer & Singer, 2001a). For PRIL, at each time step, labels. we compute the average of LMAE (defined in eq.(1)). For PRank, Widrow-Hoff and MC-Perceptron, we find the av- 1. For y ∈ {2,..., (K − 1)}, the interval label was ran- 1 PT erage absolute error ( T t=1 |yˆt −yt|). We repeat the pro- domly chosen between [y − 1, y] and [y, y + 1] where cess 100 times and average the instantaneous losses across y is the true label. For the class label 1, the interval the 100 runs. label was set to [1, 2]. For class label K, the interval label was set to [K − 1,K]. We train PRIL with 100% partially labeled examples (Type 1 or Type 2). But when we compute the MAE, we use 2. For y ∈ {2,..., (K − 1)}, the interval labels were exact labels of examples. Which means, we used the model set to the interval [y − 1, y + 1] where y is the true trained using PRIL with all partial labels and measure its label. For the class label 1, the interval label was set performance with respect to the exact labels. This sets a to [1, 2]. For class label K, the interval label was set harder evaluation criteria for PRIL. Moreover, in practice, to [K − 1,K]. we want a single predicted label and we want to see how similar it is as compared to the original label. Figure3 5.2. Experimental Setup shows the comparison plot of PRIL with PRank, Widrow- Hoff (WH) and Multi-Class Perceptron (MCP). Kernel Functions Used: We used following kernel func- tions for different datasets. We see that PRIL does better as compared to Widrow-Hoff. This happens because Widrow-Hoff does not consider the 2 • Synthetic: κ(x1, x2) = (x1.x2 + 1) . categorical nature of the labels even though it respects the PRIL: Perceptron Ranking Using Interval Labeled Data

Figure 2. Experiment: Varying the fraction of examples with partial labels. Variation in average MAE performance by varying the fraction of partial labels in the training set. The loss is computed using the partial labels. orderings of the labels. On the other hand, MCP is a com- References plex model for solving a ranking problem. It also does not Antoniuk, Kostiantyn, Franc, Vojtech, and Hlavac, Vaclav. consider the ordering of the labels. We see that for Syn- Interval insensitive loss for ordinal classification. In thetic and Parkinsons datasets PRIL performs better that Proceedings of the Sixth Asian Conference on Machine MCP. Learning, volume 39 of Proceedings of Machine Learn- We observe that PRIL performance is comparable as com- ing Research, pp. 189–204, Nha Trang City, Vietnam, pared to PRank (Crammer & Singer, 2001b). On Abalone Nov 2015. and Parkinsons datasets, it performs similar to PRank. Antoniuk, Kostiantyn, Franc, Vojtech,˘ and Hlava´a˘z,´ Which means that PRIL is able to recover the underlying Vaclav.´ V-shaped interval insensitive loss for ordinal classifier even if we have partial labels. This is a very in- classification. Machine Learning, 103(2):261–283, May teresting finding as we don’t need to worry to provide ex- 2016. act labels. All we need is a range around the exact labels. Thus, PRIL appears to be a better way to deal with ordinal Chu, Wei and Keerthi, S. Sathiya. New approaches to sup- classification when we have uncertainties in the labels. port vector ordinal regression. In Proceedings of the 22Nd International Conference on Machine Learning, 6. Conclusions ICML ’05, pp. 145–152, 2005. We proposed a new online algorithm called PRIL for learn- Crammer, Koby and Singer, Yoram. Ultraconservative on- ing ordinal classifiers when we only have partial labels for line algorithms for multiclass problems. In 14th Annual the examples. We show the correctness of the proposed al- Conference on Computational Learning Theory, COLT gorithm. We show that PRIL converges after making finite 2001, pp. 99–115, Amsterdam, The Netherlands, July number of mistakes whenever there exist an ideal partial 2001a. labeling for every example. We also provide the mistake Crammer, Koby and Singer, Yoram. Pranking with ranking. bound for general case. A regret bound is provided for In Proceedings of the 14th International Conference on PRIL. We also propose a multiplicative update algorithm Neural Information Processing Systems, NIPS’01, pp. for PRIL (M-PRIL). We show the correctness of M-PRIL 641–647, 2001b. and find its mistake bound. We experimentally show that PRIL is a very effective algorithm when we have partial Kivinen, Jyrki and Warmuth, Manfred K. Exponentiated labels. gradient versus gradient descent for linear predictors. In- formation and Computation, 132(1):1 – 63, 1997. Li, Ling and Lin, Hsuan-Tien. Ordinal regression by ex- tended binary classification. In Proceedings of the 19th PRIL: Perceptron Ranking Using Interval Labeled Data

International Conference on Neural Information Pro- cessing Systems (NIPS), pp. 865–872, 2006. Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psy- chological Review, 65(6):386–408, 1958. Shashua, Amnon and Levin, Anat. Ranking with large margin principle: Two approaches. In Proceedings of the 15th International Conference on Neural Informa- tion Processing Systems, NIPS’02, pp. 961–968, Van- couver, British Columbia, Canada, 2002. Widrow, Bernard and Hoff, Marcian E. Neurocomputing: Foundations of research. chapter Adaptive Switching Circuits, pp. 123–134. MIT Press, 1988. Zinkevich, Martin. Online convex programming and gen- eralized infinitesimal gradient ascent. In Proceedings of the 12th International Conference on International Con- ference on Machine Learning, ICML’03, pp. 928–935, 2003.

Figure 3. PRIL vs. PRank, Widrow-Hoff and Multi-Class Percep- tron (MCP). PRIL trained using partial labels and evaluated on exact labels.