Normalized Online Learning
Total Page:16
File Type:pdf, Size:1020Kb
Normalized online learning Stephane´ Ross Paul Mineiro John Langford Carnegie Mellon University Microsoft Microsoft Research Pittsburgh, PA, USA Bellevue, WA, USA New York, NY, USA [email protected] [email protected] [email protected] Abstract The biggest practical benefit of invariance to feature scaling is that learning algorithms “just work” in a We introduce online learning algorithms which more general sense. This is of significant impor- are independent of feature scales, proving regret tance in online learning settings where fiddling with bounds dependent on the ratio of scales existent hyper-parameters is often common, and this work can in the data rather than the absolute scale. This be regarded as an alternative to investigations of opti- has several useful effects: there is no need to pre- mal hyper-parameter tuning [Bergstra and Bengio, 2012, normalize data, the test-time and test-space com- Snoek et al., 2012, Hutter et al., 2013]. With a normalized plexity are reduced, and the algorithms are more update users do not need to know (or remember) to pre- robust. normalize input datasets and the need to worry about hyper- parameter tuning is greatly reduced. In practical experi- ence, it is common for those unfamiliar with machine learn- 1 Introduction ing to create and attempt to use datasets without proper nor- malization. Any learning algorithm can be made invariant by initially Eliminating the need to normalize data also reduces com- transforming all data to a preferred coordinate system. In putational requirements at both training and test time. For practice many algorithms begin by applying an affine trans- particularly large datasets this can become important, since form to features so they are zero mean with standard de- the computational cost in time and RAM of doing normal- viation 1 [Li and Zhang, 1998]. For large data sets in the ization can rival the cost and time of doing the machine batch setting this preprocessing can be expensive, and in learning (or even worse for naive centering of sparse data). the online setting the analogous operation is unclear. Fur- Similarly, for applications which are constrained by test- thermore preprocessing is not applicable if the inputs to ing time, knocking out the need for feature normalization the algorithm are generated dynamically during learning, allows more computational performance with the same fea- e.g., from an on-demand primal representation of a ker- tures or better prediction performance when using the freed nel [Sonnenburg and Franc, 2010], virtual example gener- computational resources to use more features. ated to enforce an invariant [Loosli et al., 2007], or ma- chine learning reduction [Allwein et al., 2001]. 1.1 Adversarial Scaling When normalization techniques are too expensive or im- possible we can either accept a loss of performance due Adversarial analysis is fairly standard in online learning. to the use of misnormalized data or design learning al- However, an adversary capable of rescaling features can in- gorithms which are inherently capable of dealing with duce unbounded regret in common gradient descent meth- unnormalized data. In the field of optimization, it is a ods. As an example consider the standard regret bound for settled matter that algorithms should operate independent projected online convex subgradient descent after T rounds of an individual dimensions scaling [Oren, 1974]. The using the best learning rate in hindsight [Zinkevich, 2003], same structure defines natural gradients [Wagenaar, 1998] p ∗ R ≤ T jjw jj2 max jjgtjj2: where in the stochastic setting, results indicate that for t21:T the parametric case the Fisher metric is the unique invari- ∗ ant metric satisfying a certain regular and monotone prop- Here w is the best predictor in hindsight and fgtg is the erty [Corcuera and Giummole, 1998]. Our interest here is sequence of instantaneous gradients encountered by the al- in the online learning setting, where this structure is rare: gorithm. Suppose w∗ = (1; 1) 2 R2 and imagine scaling ∗ typically regret bounds depend on the norm of features. the first coordinate by a factor of s. As s ! 1, jjw jj2 ap- running times that are superlinear in the dimensionality. Loss as feature scale varies More recently, diagonalized second order perceptron and 1 Adaptive grad. AROW have been proposed [Orabona et al., 2012]. These 0.8 NAG (this paper) algorithms are linear time, but their analysis is generally not unit free since it explicitly depends on the norm of the 0.6 weight vector. Corollary 3 is unit invariant. A comparative 0.4 analysis of empirical performance would be interesting to observe. 0.2 test squared loss The use of unit invariant updates have been implicitly stud- 0 ied with asymptotic analysis and empirics. For exam- 0.001 0.01 0.1 1 10 100 1000 ple [Schaul et al., 2012] uses a per-parameter learning rate scale of first feature proportional to an estimate of gradient squared divided by Figure 1: A comparison of performance of NAG (this pa- variance and second derivative. Relative to this work, we per) and adaptive gradient [McMahan and Streeter, 2010, prove finite regret bound guarantees for our algorithm. Duchi et al., 2011] on a synthetic dataset with varying scale in the first feature. 1.2 Contributions proaches 1, but unfortunately for a linear predictor the gra- We define normalized online learning algorithms which are invariant to feature scaling, then show that these are inter- dient is proportional to the input, so maxt21:T jjgtjj2 can be made arbitrarily large. Conversely as s ! 0, the gradient esting algorithms theoretically and experimentally. ∗ sequence remains bounded but jjw jj2 becomes arbitrarily We define a scaling adversary for online learning analysis. large. In both cases the regret bound can be made arbitrar- The critical additional property of this adversary is that al- ily poor. This is a real effect rather than a mere artifact of gorithms with bounded regret must have updates which are analysis, as indicated by experiments with a synthetic two invariant to feature scale. We prove that our algorithm has dimensional dataset in figure 1.1. a small regret against this more stringent adversary. Adaptive first-order online meth- We then experiment with this learning algorithm on a num- ods [McMahan and Streeter, 2010, Duchi et al., 2011] ber of datasets. For pre-normalized datasets, we find that it also have this vulnerability, despite adapting the geometry makes little difference as expected, while for unnormalized to the input sequence. Consider a variant of the adaptive or improperly normalized datasets this update rule offers gradient update (without projection) large advantages over standard online update rules. All t of our code is a part of the open source Vowpal Wabbit X T −1=2 project [Ross et al., 2012]. wt+1 = wt − η diag( gsgs ) gt; s=1 which has associated regret bound of order 2 Notation v u ( T ) Throughout this draft, the indices i; j indicate elements of a ∗ 1=2u X −1 jjw jj2 d tinf hgt;S gti : S 0; tr(S) ≤ d : vector, while the index t; T or a particular number indicates S t=1 time. A label y is associated with some features x, and we P are concerned with linear prediction i wixi resulting in Again by manipulating the scaling of a single axis this can some loss for which a gradient g can be computed with be made arbitrarily poor. respect to the weights. Other notation is introduced as it is The online Newton step [Hazan, 2006] algorithm has a re- defined. gret bound independent of units as we address here. Unfor- tunately ONS space and time complexity grows quadrat- 3 The algorithm ically with the length of the input sequence, but the ex- istence of ONS motivates the search for computationally We start with the simplest version of a scale invariant online viable scale invariant online learning rules. learning algorithm. Similarly, the second order percep- NG (Normalized Gradient Descent) is presented in algo- tron [Cesa-Bianchi et al., 2005] and rithm 1. NG adds scale invariance to online gradient de- AROW [Crammer et al., 2009] partially address this scent, making it work for any scaling of features within the problem for hinge loss. These algorithms are not unit- dataset. free because they have hyperparameters whose optimal value varies with the scaling of features and again have Without s; N, this algorithm simplifies to standard stochas- Algorithm 1 NG(learning rate ηt) Algorithm 2 NAG(learning rate η) 1. Initially wi = 0, si = 0, N = 0 1. Initially wi = 0, si = 0, Gi = 0, N = 0 2. For each timestep t observe example (x; y) 2. For each timestep t observe example (x; y) (a) For each i, if jxij > si (a) For each i, if jxij > si w s2 wisi i i i. wi i. wi 2 jxij jxij ii. s jx j ii. si jxij i i P P (b) y^ = wixi (b) y^ = i wixi i 2 x2 P xi P i (c) N N + i 2 (c) N N + i 2 si si (d) For each i, (d) For each i, 2 t 1 @L(^y;y) @L(^y;y) i. wi wi − ηt 2 i. Gi Gi + N si @wi @wi q t 1 @L(^y;y) ii. wi wi − η p N si Gi @wi tic gradient descent. as this is potentially sensitive to outliers, we also consider a The vector element s stores the magnitude of feature i ac- i squared norm version of NAG, which we refer to as sNAG cording to s = max 0 jx 0 j. These are updated and ti t 2f1:::tg t i that is a straightforward modification—we simply keep the maintained online in steps 2.(a).ii, and used to rescale the accumulator s = P x2 and use ps =t in the update rule.