Online

Vladyslav Kolbasin data scientist at GridDynamics, lecturer at NTU KhPI, dep. KMMM Definition

ò Online machine learning (OML) is a method of machine learning in which: ò data becomes available in a sequential order ò is used to update our best predictor for future data at each step , as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once Difference to standard approach

ò Online prediction refers to the problem of prediction in the online protocol (sequential prediction problems): ò Nature outputs some side information ò Predictor outputs a prediction ò Nature outputs an observation ò The cycle is repeated ò Difference to usual : ò Test and Train datasets are the same but the distinguish between train and test is through time Motivation example

ò A problem:

ò What is papaya?

ò How to find good papaya? Motivation example

ò A problem:

ò What is papaya?

ò How to find good papaya?

ò Buy papaya and let’s try to predict if it is ok. Motivation

ò It is computationally infeasible to train over the entire dataset ò So should train by mini-batches (or one-by-one) ò It is used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data ò Data changes too fast – decrease lag ò Fast tuning to important data trends How to solve it?

ò Recursive adaptive algorithms (Robbins and Monro - 1951) ò Stochastic approximation (Kushner and Clark, 1978) ò Adaptive filtering (Haykin 2002, 2010) Model types

ò Statistical models ò Data samples are usually assumed to be i.i.d. ò Algorithm just has a limited access to the data ò Adversarial models ò They are looking at the learning problem as a game between two players (the learner vs the data generator) ò The goal is to minimize losses regardless of the move played by the other player I. Statistical online models

ò Gradient descent ò Kalman filtering ò Kernel model ò SVM ò Folding-in 1. Batch gradient descent

L ˆ 1 wt+1 = wt − γt∇w J L (wt ) = wt − γt ∑∇wQ(xi ,wt ) L i=1

ò When the are small enough, the γt algorithm converges towards a local minimum of the empirical risk ˆ J L (w) ò We can speed up convergence by replacing by γt positive matrix ò We should save all points from training dataset ò Compute average gradient for all points Online gradient descent

ò We don’t do averaging

wt+1 = wt − γt∇ wQ(xt ,wt )

ò Use just one point one at a time ò We hope that random selection will not perturbate the average behavior of the algorithm Online gradient descent

ò So we see weird behavior ò Is there a convergence? Online gradient descent

ò The main question: is there a convergence?

ò Theory: Yes! ò When the learning rate decreases with an appropriate rate ò We will get global minimum if objective function is convex , otherwise almost surely will get local minimum Gradient descent optimizations

ò There is a couple of GD optimizations: ò Momentum (Sutton, R. S. 1986) ò Nesterov accelerated gradient (Nesterov, Y. 1983) ò AdaGrad (Duchi, J., Hazan, E., & Singer, Y. 2011) ò Adadelta (extension of AdaGrad) ò Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) AdaGrad

ò SGD with per-parameter learning rate ò Large for more sparse parameters

gt = ∇ wQ(xt ,wt ) t 2 Gt, jj = ∑ gk , j k =1 − 2/1 o wt+1 = wt − γ ⋅ diag (Gt + e) g j

ò Useful for sparse applications (for example NLP and image recognition) ò Used in Google ò Used for Glove model 2. Kalman filtering

ò Recursive filter ò Quasi-Newton algorithm

−1 Kt = H t−1 T (Kt xt )( Kt xt ) Kt+1 = Kt − T 1+ xt Kt xt T T wt+1 = wt − Kt+1( yt − wt xt ) xt 3. Online Kernel models

ò Batch regime: α = K(λ)−1Y = (K + λI )−1Y

K i, j = κ (xi , x j ), i, j = 1, r Online Kernel models

ò The main problem: how to update inverse matrix ò It is possible to do in 2 stages:

−1 −1 −1 K n− ,1 r (λ) → K n− ,1 r−1 (λ) → K n, r (λ) Online Kernel models −1 −1 α n = (K n− ,1 r (λ)) (λ α n−1 +Yn− ,1 r )

−1 −1 T K n− ,1 r −1 = R r K n− ,1 r R r − T −1 −1 −1 T −1 T − (e 1 K n− ,1 r e1 ) R r K n− ,1 r e1e1 K n− ,1 r R r  A B  −1   K n, r (λ) =  −1  C δ n  −1 −1 −1 A = K n− ,1 r−1 +δ n ⋅ K n− ,1 r−1 ⋅k n− ,1 r−1 (xn )⋅ T −1 ⋅k n− ,1 r−1 (xn )⋅ K n− ,1 r−1 Online Kernel models −1 −1 B = −δ n K n− ,1 r−1 ⋅ k n− ,1 r−1 (xn ) T −1 T −1 C = B = −δ n k n− ,1 r−1 (xn )⋅ K n− ,1 r−1 −1 T −1 δ n = λ + k n,n − k n 1,- 1-r (xn )⋅K n− ,1 r−1 ⋅k n− ,1 r−1 (xn ) M T R r = (0r Ir−1 ) e1 = (1 0 ... 0)

T k n− ,1 r −1 (xn ) = (κ(xn , xn−r+1 ) ... κ(xn , xn−1 ))

k n,n = κ(xn , xn ) Online Kernel models

ò The issue is a complexity

ò Accumulate new observations in Gram matrix ò Control complexity of Gram matrix: ò Sparsification ò Prunning Online Kernel models

ò Approximate Linear Dependency (Engel 2004)

2 m t −1 ~ δt = min ∑ a jφ(x j ) − φ(xt ) ≤ v a j=1

T ~ T ~ δt = min {a Kt−1a − 2a kt−1(xt ) + ktt } a

ò Novelty criterion (Haykin 2010) Online Kernel models

ò More about kernels and recursive models: “Kernel Adaptive Filtering A Comprehensive Introduction (Adaptive and Learning Systems for Signal Processing, Communications and Control Series)” by Haykin 4. Online SVMs

ò The most famous is LASVM algorithms (Léon Bottou 2005-2011) ò There is a package for R: https://cran.r- project.org/web/packages/lasvmR/index.html

ò Uses 2 steps: PROCESS & REPROCESS ò In general add and delete support vectors 5. Folding-in

ò SVD decomposition for recommender systems

SVD(A) = U × S ×V T T A ≈ U k × Sk ×Vk T T Pi, j = ri + (U k Sk (i))⋅ ( Sk Vk ( j))

ò Challenge: building SVD is time consuming ò How to add new product, new customer? Incremental Singular Value Decomposition

1) New product p (mx1) T −1 p′ = p U k Sk

2) New customer c (1xn)

−1 c′ = cVk Sk II. Adversarial models

ò Definition: ò Player chooses wt ò Adversary chooses lt (w) ò Player suffers loss lt (wt ) ò Need to minimize cumulative loss

ò Some standard algorithms: ò Follow the leader (FTL) ò Follow the regularised leader (FTRL) Adversarial models

ò We will not look at this models, but they have several advantages: ò In contrast to statistical machine learning (stochastic), adversarial algorithms don’t make stochastic assumptions about the data they observe, and even handle situations where the data is generated by a malicious adversary ò So no i.i.d. assumption! Pros and Cons of Online ML algorithms

ò Online algorithms are ò Often much faster ò More memory-efficient ò Can adapt to the best predictor changing over time ò Hard to maintain in production ò Hard to evaluate ò Have problems with convergence Applications

ò Online Machine Learning can be used for: ò Computer vision ò Recommender systems ò Predicting stock market trends ò Deciding which ads to present on a web page ò IoT applications

ò Put here your application… Useful Links

ò http://sebastianruder.com/optimizing-gradient- descent/index.html ò Kernel Adaptive Filtering A Comprehensive Introduction (Haykin 2010) ò The Kernel Recursive Least Squares Algorithm (Engel 2003) ò Foundations of Machine Learning (M. Mohri, A. Rostamizadeh, and A. Talwalkar 2012) Thank you!

Questions?

[email protected]