<<

Adaptive Filtering

Jose C. Principe and Weifeng Liu

Computational NeuroEngineering Laboratory (CNEL) University of Florida [email protected], [email protected] Acknowledgments

Dr. Badong Chen Tsinghua University and Post Doc CNEL

NSF ECS – 0300340 and 0601271 (Neuroengineering program)

Outline

1. Optimal adaptive signal processing fundamentals Learning strategy Linear adaptive filters 2. Least-mean-square in kernel space Well-posedness analysis of KLMS 3. Affine projection algorithms in kernel space 4. Extended recursive least squares in kernel space 5. Active learning in kernel adaptive filtering Wiley Book (2010)

Papers are available at www.cnel.ufl.edu

Part 1: Optimal adaptive signal processing fundamentals Machine Learning Problem Definition for Optimal System Design

Assumption: Examples are drawn independently from an unknown probability distribution P(u, y) that represents the rules of Nature. Expected Risk: R( f ) = ∫ L( f (u), y)dP(u, y) Find f that minimizes R(f) among all functions. But we∗ use a mapper class F and in general f * ∉ F * The best we can have is f F ∈ F that minimizes R(f). P(u, y) is also unknown by definition. ˆ = Empirical Risk: RN ( f ) 1/ N∑ L( f (ui ), yi ) i ∈ Instead we compute f N F that minimizes Rn(f). Vapnik-Chervonenkis theory tells us when this will work, but the optimization is computationally costly.

Exact estimation of fN is done thru optimization.

Machine Learning Strategy

The optimality conditions in learning and optimization theories are mathematically driven: Learning theory favors cost functions that ensure a fast estimation rate when the number of examples increases (small estimation error bound). Optimization theory favors super-linear algorithms (small approximation error bound) What about the computational cost of these optimal solutions, in particular when the data sets are huge? Estimation error will be small, but can not afford super linear solutions: Algorithmic complexity should be as close as possible to O(N). Statistic Signal Processing Adaptive Filtering Perspective

Adaptive filtering also seeks optimal models for time series. The linear model is well understood and so widely applied. Optimal linear filtering is regression in functional spaces, where the user controls the size of the space by choosing the model order. Problems are fourfold: Application conditions may be non stationary, i.e. the model must be continuously adapting to track changes. In many important applications data arrives in real time, one sample at a time, so on-line learning methods are necessary. Optimal algorithms must obey physical constrains, FLOPS, memory, response time, battery power. Unclear how to go beyond the linear model. Although the optimalilty problem is the same as in machine learning, constraints make the computational problem different.

Machine Learning+Statistical SP Change the Design Strategy Since achievable solutions are never optimal (non-reachable of functions, empirical risk), goal should be to get quickly to the neighborhood of the optimal solution to save computation. The two types of errors are * * * Approximation error R( f N ) − R( f ) = R( fF ) − R( f ) + Estimation error + R( f ) − R ( f * ) N F

But fN is difficult to obtain, so why not create a third error (Optimization Error) to approximate the optimal solution ~ R( f N ) − R( f N ) = ρ provided it is computationally simpler to obtain. So the problem is to find F, N and ρ for each application.

Leon Bottou: The Tradeoffs of Large Scale Learning, NIPS 2007 tutorial Learning Strategy in Biology

In Biology optimality is stated in relative terms: the best possible response within a fixed time and with the available (finite) resources. Biological learning shares both constraints of small and large learning theory problems, because it is limited by the number of samples and also by the computation time. Design strategies for optimal signal processing are similar to the biological framework than to the machine learning framework. What matters is “how much the error decreases per sample for a fixed memory/ flop cost” It is therefore no surprise that the most successful algorithm in adaptive signal processing is the least mean square algorithm (LMS) which never reaches the optimal solution, but is O(L) and tracks continuously the optimal solution! Extensions to Nonlinear Systems

Many algorithms exist to solve the on-line linear regression problem: LMS stochastic gradient descent LMS-Newton handles eigenvalue spread, but is expensive Recursive Least Squares (RLS) tracks the optimal solution with the available data. Nonlinear solutions either append nonlinearities to linear filters (not optimal) or require the availability of all data (Volterra, neural networks) and are not practical. Kernel based methods offers a very interesting alternative to neural networks. Provided that the adaptation algorithm is written as an inner product, one can take advantage of the “kernel trick”. Nonlinear filters in the input space are obtained. The primary advantage of doing gradient descent learning in RKHS is that the performance surface is still quadratic, so there are no local minima, while the filter now is nonlinear in the input space.

Adaptive Filtering Fundamentals

Adaptive Output System On-Line Learning for Linear Filters

Notation: yi() w weight estimate at time i ui i Transversal filter wi (vector) (dim = l)

ui input at time i (vector) e(i) estimation error at time i - (scalar) ei() Adaptive weight- d(i) desired response at time i control mechanism Σ (scalar)

+ ei estimation error at iteration i di() (vector) d desired response at iteration The current estimate w i is computed in i i (vector) terms of the previous estimate, w i − 1 , as: Gi capital letter matrix wi= w i−1 + Ge ii ei is the model prediction error arising from the use of wi-1 and Gi is a Gain term

On-Line Learning for Linear Filters

4 Contour 3.5 MEE FP-MEE 3 = −η −1∇ wi = wi−1 −η∇Ji wi wi−1 H Ji−1 2.5

2 2 l i mE[wi ] = w* W i 1.5 2 J = E[e (i)] 1 0.5 η step size 0 -1 -0.5 0 0.5 1 1.5 2 2.5 3 W 1 On-Line Learning for Linear Filters

Gradient descent learning for linear mappers has also great properties It accepts an unbiased sample by sample estimator that is easy to compute (O(L)), leading to the famous LMS algorithm. w = w +ηu e(i) i i−1 i

The LMS is a robust estimator ( H ∞ ) algorithm. For small stepsizes, the visited points during adaptation always belong to the input data manifold (dimension L), since algorithm always move in the opposite direction of the gradient.

On-Line Learning for Non-Linear Filters?

Can we generalize w i = w i− 1 + Ge ii to nonlinear models?

T y= wu y= fu() and create incrementally the nonlinear mapping?

fi = fi−1 + Giei

yi() ui Universal

approximator fi

- ei() Adaptive weight- control mechanism Σ + di() Part 2: Least-mean-squares in kernel space Non-Linear Methods - Traditional (Fixed topologies)

Hammerstein and Wiener models An explicit nonlinearity followed (preceded) by a linear filter Nonlinearity is problem dependent Do not possess universal approximation property

Multi-layer perceptrons (MLPs) with back-propagation Non-convex optimization Local minima Least-mean-square for radial basis function (RBF) networks Non-convex optimization for adjustment of centers Local minima Volterra models, Recurrent Networks, etc

Non-linear Methods with kernels

Universal approximation property (kernel dependent) Convex optimization (no local minima) Still easy to compute (kernel trick) But require regularization Sequential (On-line) Learning with Kernels

(Platt 1991) Resource-allocating networks Heuristic No convergence and well-posedness analysis (Frieb 1999) Kernel adaline Formulated in a batch mode well-posedness not guaranteed (Kivinen 2004) Regularized kernel LMS with explicit regularization Solution is usually biased (Engel 2004) Kernel Recursive Least-Squares (Vaerenbergh 2006) Sliding-window kernel recursive least-squares Neural Networks versus Kernel Filters

ANNs Kernel filters Universal Approximators YES YES Convex Optimization NO YES Model Topology grows with data NO YES

Require Explicit Regularization NO YES/NO (KLMS) Online Learning YES YES Computational Complexity LOW MEDIUM

ANNs are semi-parametric, nonlinear approximators Kernel filters are non-parametric, nonlinear approximators Kernel Methods

Kernel filters operate in a very special Hilbert space of functions called a Reproducing Kernel Hilbert Space (RKHS). A RKHS is an Hilbert space where all function evaluations are finite Operating with functions seems complicated and it is! But it becomes much easier in RKHS if we restrict the computation to inner products. Most linear algorithms can be expressed as inner products. Remember the FIR

L−1 T y(n) = ∑ wi x(n − i) = w x(n) i=0 Kernel Methods

Moore-Aronszajn theorem Every symmetric positive definite function of two real variables in E κ(x,y) defines a unique Reproducing Kernel Hilbert Space (RKHS). ( I)∀x ∈ E κ(., x)∈ H ∀ ∈ ∀ ∈ =< κ > (II) x E f H, f (x) f , (., x) Hκ 2 κ(x, y) = exp(−h x − y ) Mercer’s theorem Let κ(x,y) be symmetric positive definite. The kernel can be expanded in the series m κ(,xy )= ∑ λϕii () x ϕ i () y i=1 Construct the transform as T ϕ()x= [ λϕ11 (), xx λϕ 2 2 (),..., λmm ϕ ()] x Inner product ϕϕ()x,=, () y κ ( xy )

Kernel methods

Mate L., Hilbert Space Methods in Science and Engineering, A. Hildger, 1989 Berlinet A., and Thomas-Agnan C., “Reproducing kernel Hilbert Spaces in probaability and Statistics, Kluwer 2004 Basic idea of on-line kernel filtering

Transform data into a high dimensional feature space ϕϕii:= ()u Construct a linear model in the feature space F yu= 〈Ω,ϕ () 〉 F Adapt iteratively parameters with gradient information

Ωi = Ωi−1 +η∇Ji Compute the output

mi fi() u= 〈Ω i ,ϕκ () u 〉 F = ∑ a jj (, uc ) Universal approximation theorem j=1

For the Gaussian kernel and a sufficient large mi, fi(u) can approximate any continuous input-output mapping arbitrarily close in

the Lp norm.

Growing network structure

φ(u) Ω y u + c1

c2 Ω=Ω− +ηϕei() ( u ) a ii1 i 1

a2 y u +

ami-1 =+⋅ηκ cmi-1 fii f−1 ei() ( ui ,) a mi

cmi Kernel Least-Mean-Square (KLMS)

Least-mean-square T wi = wi−1 +ηuie(i) e(i) = d(i) − wi−1ui w0

Transform data into a high dimensional feature space F ϕϕii:= ()u Ω= 0 0 Ω=0 0 ei() = di () − 〈Ω− ,ϕ ( u ) 〉 i1 iF ed(1)= (1) − 〈Ω ,ϕ ( u ) 〉 = d (1) 01F Ω=Ω− +ηϕ(u ) ei () Ω=Ω+ηϕ()(1)ue = a ϕ () u ii 1 i 1 0 1 11 ed(2)= (2) − 〈Ω ,ϕ ( u ) 〉 i 12F Ω= ηϕ =d(2) −〈 auϕϕ ( ), ( u )〉 ij ∑ ej()( u ) 11 2F j=1 =d(2) − aκ ( uu , ) i 1 12 Ω21 =Ω+ηϕ(ue 2 ) (2) fi() u = 〈Ω iF ,ϕ () u 〉 = ∑ ηκ e ()(, j uuj ) =auϕϕ() + au () j=1 11 2 2 RBF Centers are the samples, and Weights... are the errors! Kernel Least-Mean-Square (KLMS)

i−1 fi−1 =η∑e( j)κ(u( j),.) j=1 i−1 fi−1(u(i)) =η∑e( j)κ(u( j),u(i)) j=1

e(i) = d(i) − fi−1(u(i))

fi = fi−1 +ηe(i)κ(u(i),.) Free Parameters in KLMS Step size

Traditional wisdom in LMS still applies here.

N N η < = N tr[Gϕ ] κ(u( j),u( j)) ∑ j=1

where G ϕ is the Gram matrix, and N its dimensionality. For translation invariant kernels, κ(u(j),u(j))=g0, is a constant independent of the data. η The Misadjustment is therefore M = tr[Gϕ ] 2N

Free Parameters in KLMS Rule of Thumb for h

Although KLMS is not kernel density estimation, these rules of thumb still provide a starting point. Silverman’s rule can be applied h =1.06min{σ , R /1.34}N −1/(5L) where σ is the input data s.d., R is the interquartile, N is the number of samples and L is the dimension. Alternatively: take a look at the dynamic range of the data, assume it uniformly distributed and select h to put 10 samples in 3 σ. Free Parameters in KLMS Kernel Design The Kernel defines the inner product in RKHS Any positive definite function (Gaussian, polynomial, Laplacian, etc.), but we should choose a kernel that yields a class of functions that allows universal approximation. A strictly positive definite function is preferred because it will yield universal mappers (Gaussian, Laplacian).

See Sriperumbudur et al, On the Relation Between Universality, Characteristic Kernels and RKHS Embedding of Measures, AISTATS 2010 Free Parameters in KLMS Kernel Design

Estimate and minimize the generalization error, e.g. cross validation

Establish and minimize a generalization error upper bound, e.g. VC dimension

Estimate and maximize the posterior probability of the model given the data using Bayesian inference

Free Parameters in KLMS Bayesian model selection

The posterior probability of a Model H (kernel and parameters θ) given the data is

p(d | U, Hi ) p(Hi ) p(Hi | d,U) = p(d | U) where d is the desired output and U is the input vector. This is hardly ever done for the kernel function, but it can be applied to θ and leads to Bayesian principles to adapt the kernel parameters. Free Parameters in KLMS Maximal marginal likelihood

T 2 −1 2 J (Hi ) = m a x[−1 2d (G +σ n I) d −1 2log G +σ n I − N 2log(2π )] θ Sparsification

Filter size increases linearly with samples! If RKHS is compact and the environment stationary, we see that there is no need to keep increasing the filter size. Issue is that we would like to implement it on-line! Two ways to cope with growth: Novelty Criterion Approximate Linear Dependency First is very simple and intuitive to implement. Sparsification Novelty Criterion

C(i) = c mi Present dictionary is { j } j = 1 . When a new data pair arrives (u(i+1),d(i+1)). First compute the distance to the present dictionary

dis = min u(i +1) − c j c j∈C If smaller than threshold δ1 do not create new center Otherwise see if the prediction error is larger than δ2 to augment the dictionary. δ1 ~ 0.1 kernel size and δ2 ~ sqrt of MSE

Sparsification Approximate Linear Dependency

Engel proposed to estimate the distance to the linear span of the centers, i.e. compute

dis = min ϕ(u(i +1)) − bjϕ(c j ) ∨b ∑c j∈C Which can be estimated by 2 T −1 dis = κ(u(i +1),u(i +1)) − h(i +1) G (i)h(i +1) Only increase dictionary if dis larger than threshold Complexity is O(m2) Easy to estimate in KRLS (dis~r(i+1)) Can simplify the sum to the nearest center, and it defaults to NC

dis = min ϕ(u(i +1)) −ϕ(c j ) ∨b,c j∈C

KLMS- Mackey-Glass Prediction 0.2x(t −τ ) x(t) = −0.1x(t) + τ = 30 1+ x(t −τ )10

LMS η=0.2 KLMS a=1, η=0.2 Regularization worsens performance

Performance Growth tradeoff

δ1=0.1, δ2=0.05 η=0.1, a=1 KLMS- Nonlinear channel equalization

i fi() u= 〈Ω iF ,ϕ () u 〉 = ∑ ηκ e ()(, j uuj ) j=1

st = + 2 rt zstt0.5 s t−1 rztt=−+0.9 z t nσ

c1

c2 a 1

a2 y u +

ami-1 cmi-1 a mi

cmi cumi← i

ami ←η ei() Nonlinear channel equalization

KLMS (η=0.1) RN Algorithms Linear LMS (η=0.005) (NO REGULARIZATION) (REGULARIZED λ=1) BER (σ = .1) 0.162±0.014 0.020±0.012 0.008±0.001 BER (σ = .4) 0.177±0.012 0.058±0.008 0.046±0.003 BER (σ = .8) 0.218±0.012 0.130±0.010 0.118±0.004

2 κ(uuij , )=−− exp( 0.1||ui u j || )

Algorithms Linear LMS KLMS RN

Computation (training) O(l) O(i) O(i3) Memory (training) O(l) O(i) O(i2) Computation (test) O(l) O(i) O(i) Memory (test) O(l) O(i) O(i)

Why don’t we need to explicitly regularize the KLMS? Self-regularization property of KLMS

o Assume the data model di () =Ω+ ( ϕ i ) vi () then for any unknown vector Ω 0 the following inequality holds i − 2 ∑ = |()ej vj ()| j 1 <= i−1 1, for all i1,2,..., N η −12||Ω+o || |vj ( ) | 2 ∑ j=1 As long as the matrix { η − 1 I − ϕ ( i )ϕ ( i ) T } is positive definite. So H∞ robustness − ||ev ||21<η || Ω+o || 2 2 || || 2

And Ω ( n ) is upper bounded 2o 22 ||Ω< ||ση (|| Ω+ || 2 η ||v || ) σ1 is the largest N 1 eigenvalue of Gφ The solution norm of KLMS is always upper bounded i.e. the algorithm is well posed in the sense of Hadamard. Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008. Regularization Techniques

Learning from finite data is ill-posed and a priori information to enforce Smoothness is needed. Norm The key is to constrain the solution norm constraint In Least Squares constraining the norm yields 1 N Ω = −ΩTϕ 22Ω < J( )∑ ( di ( )i ) , subject to || || C N i=1 In Bayesian modeling, the norm is the prior. (Gaussian process) N 1 T 22Gaussian distributed prior J(Ω ) =∑ ( di ( ) −Ωϕλi ) + || Ω || N i=1 In statistical learning theory, the norm is associated with the model capacity and hence the confidence of uniform convergence! (VC dimension and structural risk minimization) Tikhonov Regularization

In numerical analysis the method is to constrain the condition number of the solution matrix (or its eigenvalues) The singular value decomposition of Φ can be written S 0 T S= diag{ s , s ,..., s } Φ = P Q 12 r 0 0 Singular value Ω d(i) = ϕ(i)T Ω0 +ν (i) The pseudo inverse to estimate in is Ω = Pdiag[s−1,..., s−1,0....0]QT d PI 1 r which can be still ill-posed (very small sr). Tikhonov regularized the least square solution to penalize the solution norm to yield 2 J (Ω) = d − ΦT Ω + λ Ω ss Ω= 1 r T Pdiag(22 ,..., ,0,...,0)Q d ss1 ++λλr 2 Notice that if λ = 0, when sr is very small, sr/(sr + λ) = 1/sr → ∞. 2 However if λ > 0, when sr is very small, sr/(sr + λ) = sr/ λ → 0. Tikhonov and KLMS

For finite data and using small stepsize theory: m N Denote ϕϕii=()uR ∈ 1 T T Rϕ = ∑ϕϕii Rx = PP Λ N i=1 Assume the correlation matrix is singular, and

ς11≥≥... ςςkk >+ ==... ς m = 0 From LMS it is known that i E[ε n (i)] = (1−ης n ) ε n (0)

2ηηJJmin 22i min En[|εin ( ) | ]= +− (1ης ) (| ε0 (n ) |− ) 22−−ης nnης m Define Ω − Ω 0 = ε so (i) ∑n=1 n (i)Pn M M Ω = Ω0 + −ης iε = − −ης i Ω0 0 E[ (i)] ∑(1 j ) j (0)Pj ∑[1 (1 j ) ] j Pj Ω(0) = 0 ε j (0) = −Ω j j=1 j=1

and M 2 2 0 2 0 η ≤1/ς E[Ω(i)] ≤ ∑(Ω j ) = Ω max j=1 Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008. Tikhonov and KLMS

In the worst case, substitute the optimal weight by the pseudo inverse i −1 i −1 T E[Ω(i)] = Pdiag[(1− (1−ης1) )s1 ,...,(1− (1−ης r ) )sr ,0....0]Q d

Regularization function for finite N in KLMS [1−− (1ηsN21 / )N ] ⋅ s− nn No regularization −1 sn Tikhonov 22− 1 [ssnn /(+⋅λ )] s n 1 PCA 0.8 −1 ssnnif> th  0.6 0.4 0 ifsn ≤ th

 reg-function

The stepsize and N control the reg-function in 0.2 KLMS Tikhonov 0 KLMS. Truncated SVD 0 0.5 1 1.5 2 singular value Liu W., Principe J. The Well-posedness Analysis of the Kernel Adaline, Proc WCCI, Hong-Kong, 2008

The minimum norm initialization for KLMS

The initialization Ω 0 = 0 gives the minimum possible norm solution. Ω= m 5 i ∑ n=1 cPnn 4 ςς≥≥ > 1 ...k 0 3

ςςkm+1 =... = = 0 2

Ω=22km + 21 ||in ||∑∑n=11 ||cc ||nk= + || n || 0

-1 0 2 4 Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008. KLMS and the Data Space

KLMS search is insensitive to the 0-eigenvalue directions i E[ε n (i)] = (1−ης n ) ε n (0) ηη 2JJmin 22i min En[|εin ( ) | ]= +− (1ης ) (| ε0 (n ) |− ) 22−−ης nnης 2 2 So if ς n = 0 , E [ ε n ( i )] = ε n ( 0 ) and E[ε n (i) ] = ε n (0) The 0-eigenvalue directions do not affect the MSE T 2 Ji( )= E [| d −Ωi ϕ | ]

ηηJJmm =+min ς + ς ε 22 −−min ης i Ji() Jmin ∑∑nn=11n= nn(| (0) | )(1n ) 22 KLMS only finds solutions on the data subspace! It does not care about the null space!

Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008. Energy Conservation Relation

The fundamental energy conservation relation holds in RKHS!

Energy conservation in RKHS

2 2 22ei() eip () ΩΩ()ii+a =( −+ 1) FFκκuu(),ii () uu(),ii () ( ) ( ) Upper bound on step size for mean square convergence

2 0.012 2E Ω* F 0.01 η ≤  *22 E Ω +σ 0.008 F v 0.006 EMSE Steady-state mean square performance 0.004

2 2 ησ 0.002  v simulation limEea ( i ) = theory i→∞  −η 0 2 0.2 0.4 0.6 0.8 1 stepsize η Chen B., Zhao S., Zhu P., Principe J. Mean Square Convergence Analysis of the Kernel Least Mean Square Algorithm, submitted to IEEE Trans. Signal Processing

Effects of Kernel Size

-3 0.8 x 10

σ = 0.2 8 simulation σ = 1.0 0.7 theory σ = 20 7 0.6 6 0.5 5 0.4

EMSE 4 EMSE 0.3 3 0.2 2 0.1 1

0 0 200 400 600 800 1000 0 0.5 1 1.5 2 iteration kernel size σ Kernel size affects the convergence speed! (How to choose a suitable kernel size is still an open problem)

However, it does not affect the final misadjustment! (universal approximation with infinite samples)

Part 3: Affine projection algorithms in kernel space The big picture for gradient based learning

Leaky Normalize Kivinen LMS LMS d LMS

2004

1 = K

1 = K

1 = K

Leaky Newton APA

APA APA

i = K i = K

FriebAdaline , 1999 Engel,RLS 2004 We have kernelized versions of all

The EXT RLS is a Extended weighted model with states RLS RLS

Liu W., Principe J., “Kernel Affine Projection Algorithms”, European J. of Signal Processing, ID 784292, 2008.

Affine projection algorithms

T 2 0 -1 Solve min J ( w ) = E d − w u which yields = w w Ru rdu

There are several ways to approximate this solution iteratively using Gradient Descent Method

w(0) w(i) = w(i −1) +η[rdu - Ruw(i −1)] Newton’s recursion −1 w(0) w(i) = w(i −1) +η(Ru + εI) [rdu - Ruw(i −1)]

LMS uses a stochastic gradient that approximates ˆ T Ru = u(i)u(i) rˆdu = d(i)u(i)

Affine projection algorithms (APA) utilize better approximations Therefore APA is a family of online gradient based algorithms of intermediate complexity between the LMS and RLS. Affine projection algorithms

APA are of the general form U(i) = [u(i − K +1),...,u(i)] d(i) = [d(i − K +1),...,d(i)]T LxK 1 1 Rˆ = U(i)U(i)T rˆ = U(i)d(i) u K du K

Gradient w(0) w(i) = w(i −1) +ηU(i)[d(i) - U(i)T w(i −1)]

Newton T −1 T w(i) = w(i −1) +η(U(i)U(i) + εI) U(i)[d(i) - U(i) w(i −1)] Notice that (U(i)U(i)T + εI)−1 U(i) = U(i)(U(i)T U(i) + εI)−1 So

w(i) = w(i −1) +ηU(i)[U(i)T U(i) + εI]−1[d(i) - U(i)T w(i −1)] Affine projection algorithms

If a regularized cost function is preferred

2 2 min J (w) = E d − wT u + λ w w

The gradient method becomes

= −ηλ − +η T − w(0) w(i) (1 )w(i 1) U(i)[d(i) - U(i) w(i 1)]

Newton T −1 w(i) = (1−ηλ)w(i −1) +η(U(i)U(i) + εI) U(i)d(i) Or

w(i) = (1−ηλ)w(i −1) +ηU(i)[U(i)T U(i) + εI]−1d(i) Kernel Affine Projection Algorithms

Q(i) w ≡ Ω KAPA 1,2 use the least squares cost, while KAPA 3,4 are regularized KAPA 1,3 use gradient descent and KAPA 2,4 use Newton update Note that KAPA 4 does not require the calculation of the error by rewriting the error with the matrix inversion lemma and using the kernel trick Note that one does not have access to the weights, so need recursion as in KLMS. Care must be taken to minimize computations.

KAPA-1

cumi← i ←η ami ei i () c1 aaeimi−−11←+− miη i ( 1)  c2 a 1 aaeiKmi−+ K11← mi −+ K +η i ( −+ 1)

a2 y u +

ami-1 cmi-1 a mi

i cmi fi() u= 〈Ω i ,ϕκ () u 〉 F = ∑ a jj (, uu ) j=1 KAPA-1

i fi = fi−1 +η ∑e(i; j)κ(u( j),.) j=1−K +1

ai (i) =ηe(i;i) a j (i) = a j (i −1) +ηe(i; j) j =1− K +1,....,i −1 a j (i) = a j (i −1) j =1,...,i − K C(i) = {C(i −1),u(i)} Error reusing to save computation

For KAPA-1, KAPA-2, and KAPA-3 To calculate K errors is expensive (kernel evaluations) T ei() k= dk () −ϕki Ω−1 ,( i − K +≤ 1 k ≤ i )

K times computations? No, save previous errors and use them

TT ei+−11() k= dk () −ϕk Ω= i dk () − ϕηk ( Ω i +Φ iie ) TT =(()dk −Ω+ϕk i−1 ) ηϕ k Φ iie

T Still needs ei i ( + 1) =eki() +Φηϕk ii e which requires i kernel evals, i So O(i+K2) T =eki() + ∑ η ei () j ϕϕ kj . jiK=−+1 KAPA-4

KAPA-4: Smoothed Newton’s method.

Φ=[ϕϕ , ,..., ϕ ] i i i−11 iK −+ T di =[ di ( ), di ( − 1),..., di ( −+ K 1)] There is no need to compute the error

T −1 w(i) = (1−ηλ)w(i −1) +ηΦ(i)[Φ(i) Φ(i) + λI] d(i)

The topology can still be put in the same RBF framework. Efficient ways to compute the inverse are necessary. The sliding window computation yields a complexity of O(K2)

KAPA-4

~ a k (i) =ηd (i) k = i ~ a k (i) = (1−η)ak (i −1) +ηd (k) i − K +1≤ k ≤ i −1

ak (i) = (1−η)ak (i −1) 1≤ k ≤ i − K +1

~ −1 d (i) = (G(i) + λI) d(i)

T 3 How to invert the K-by-K matrix () ε I +Φ ii Φ and avoid O(K )?

Sliding window Gram matrix inversion

Φ=ϕϕ ϕ T i[ i , i−11 ,..., iK −+ ] Gri=ΦΦ ii

T ab Dh +=λ Sliding window +=λ Gri I  Gri+1 I T bD hg Assume known T ef −1 T 1 −1 D= H − ff/ e ()Gri +=λ I  fH 2 −− s=() g − hDT 11 h Schur complement of D

− −− − 3 D1+−()()() DhDh 11T s Dhs 1 +=λ −1 ()Gri+1 I −1 T −()Dh s s Complexity is K2 Relations to other methods

Recursive Least-Squares

The RLS algorithm estimates a weight vector w(i-1) by minimizing the cost function i−1 2 m i n d( j) − u( j)T w w ∑ j=1 The solution becomes w(i −1) = (U(i −1)U(i −1)T )−1 U(i −1)d(i −1)

And can be recursively computed as

P(i −1)u(i) T w (i) = w(i −1) + [d(i) − u(i) w(i −1)] 1+ u(i)T P(i −1)u(i)

T −1 Where P ( i ) = ( U ( i ) U ( i ) ) . Start with zero weights and P(0) = λ−1I

r(i) =1+ u(i)T P(i −1)u(i) w(i) = w(i −1) + k(i)e(i) k(i) = P(i −1)u(i) / r(i) P(i) = [P(i −1) − k(i)k(i)T r(i)] e(i) = d(i) − u(i)T w(i −1) Kernel Recursive Least-Squares The KRLS algorithm estimates a weight function w(i) by minimizing i−1 2 2 m i n d( j) − wTϕ( j) + λ w w ∑ j=1 The solution in RKHS becomes −1 w(i) = Φ(i) λI + Φ(i)T Φ(i) d(i) = Φ(i)a(i) a(i) = Q(i)d(i) [ ] -1 Q ( i ) can be computed recursively as −1 -1 Q(i −1) h(i)  T Q (i) =   h(i) = Φ(i −1) ϕ(i) h(i)T λ +ϕ(i))T ϕ(i)   From this we can also recursively compute Q(i) T − Q(i −1)r(i) + z(i) z(i) − z(i) z(i) = Q(i −1)h(i) = 1 Q(i) r(i)  T  T  - z(i) 1  r(i) = λ +κ(u(i),u(i)) − z(i) h(i) And compose back a(i) recursively a(i) − z(i)r −1(i)e(i) = = − T − a(i)  −1  e(i) d(i) h(i) a(i 1)  r (i)e(i)  with initial conditions −1 Q(1) = [λ +κ(u(i),u(i)T )] , a(1) =Q(1)d(1) KRLS c ← u mi i −1 c1 ami ← r(i) e(i) 

−1 c2 ← − a ami− j ami− j r(i) e(i)z j (i) 1

a 2 y u +

ami-1 cmi-1 a mi i cmi fi (u) = ∑a(i)κ(u( j),u) j=1

Engel Y., Mannor S., Meir R. “The kernel recursive least square algorithm”, IEEE Trans. Signal Processing, 52 (8), 2275-2285, 2004. KRLS

−1 i−1 fi = fi−1 + r(i) [κ(u(i),⋅) − z j (i)κ(u( j),⋅)]e(i) ∑ j=1

−1 ai (i) = r(i) e(i) −1 a j (i) = a j (i) − r(i) e(i)z j (i) j =1,...,i −1 C(i) = {C(i −1),u(i)} Regularization

The well-posedness discussion for the KLMS hold for any other gradient decent methods like KAPA-1 and KAPA-3 If Newton method is used, additional regularization is needed to invert the Hessian matrix like in KAPA-2 and normalized KLMS Recursive least squares embed the regularization in the initialization Computation complexity

Prediction of Mackey-Glass

L=10 K=10 K=50 SW KRLS Simulation 1: noise cancellation n(i) ~ uniform [-0.5, 05]

ui( )= ni ( ) − 0.2 ui ( −− 1) ui ( − 1) ni ( −+ 1) 0.1 ni ( −+ 1) 0.4 ui ( − 2) =H( ni ( ), ni ( −−− 1), ui ( 1), ui ( 2)) Simulation 1: Noise Cancellation

κ((),())exp(||()ui u j=−− ui u ()||) j 2

K=10 Simulation 1:Noise Cancellation

0.5 Noisy Observation 0 -0.5 -1 2500 2520 2540 2560 2580 2600

0.5 NLMS 0 -0.5

2500 2520 2540 2560 2580 2600

0.5 KLMS-1 0 Amplitute -0.5

2500 2520 2540 2560 2580 2600 0.5 KAPA-2 0 -0.5

2500 2520 2540 2560 2580 2600 i Simulation-2: nonlinear channel equalization

st 2 rt zstt= + 0.5 s t−1 rztt=−+0.9 z t nσ

K=10 σ=0.1 Simulation-2: nonlinear channel equalization

Nonlinearity changed (inverted signs) Gaussian Processes A Gaussian process is a stochastic process (a family of random variables) where all the pairwise correlations are Gaussian distributed. The family however is not necessarily over time (as in time series). For instance in regression, if we denote the output of a learning system by y(i) given the input u(i) for every i, the conditional probability = Ν σ 2 + p(y(1),...y(n) | u(1),...,u(n) (0, n I G(i)) Where σ is the observation Gaussian noise and G(i) is the Gram matrix κ(u(1),u(1))  κ(u(1),u(i))   G(i) =      κ(u(i),u(1))  κ(u(i),u(i)) and κ is the covariance function (symmetric and positive definite). Just like the Gaussian kernel used in KLMS. Gaussian processes can be used with advantage in Bayesian inference. Gaussian Processes and Recursive Least-Squares The standard linear regression model with Gaussian noise is f (u) = uT w ,d = f (u) +ν 2 where the noise is IID, zero mean and variance σ n The likelihood of the observations given the input and weight vector is i = = Ν T σ 2 p(d(i) | U(i),w) ∏ p(d( j) | u( j),w) (U(i) w, n I) j=1 To compute the posterior over the weight vector we need to specify the prior, here a Gaussian and use Bayes rule p(d(i) | U(i),w) p(w) p(w) = Ν(0,σ 2 I) p(w | U(i),d(i)) = w p(d(i) | U(i)) Since the denominator is a constant, the posterior is shaped by the numerator, and it is approximately given by  1  1   ∝ − − T  T +σ 2  − p(w |U,d) exp (w w(i))  2 U(i)U(i) wI (w w(i)) 2 σ n    −1 −1   T 2 2 1 T +σ 2 with mean w ( i ) = ( U ( i ) U ( i ) + σ n σ w I ) U ( i ) d ( i ) and covariance 2 U(i)U(i) wI   σn  Therefore, RLS computes the posterior in a Gaussian process one sample at a time.

KRLS and Nonlinear Regression

It easy to demonstrate that KRLS does in fact estimate online nonlinear regression with a Gaussian noise model i.e. f (u) = ϕ(u)T w ,d = f (u) +ν 2 where the noise is IID, zero mean and variance σn By a similar derivation we can say that the mean and variance are −1 −1 T 2 2  1 T 2  w(i) = (Φ(i)Φ(i) +σ nσ wI ) Φ(i)d(i)  2 Φ(i)Φ(i) +σ wI   σn  Although the weight function is not accessible we can create predictions at any point in the space by the KRLS as ˆ T T 2 2 −1 E[ f (u)] = ϕ(u) Φ(i)(Φ(i)Φ(i) +σ nσ wI ) d(i) with variance 2 2 T 2 T T 2 2 −1 T σ ( f (u)) = σ wϕ(u) ϕ(u) −σ wϕ(u) Φ(i)(Φ(i)Φ(i) +σ nσ wI ) Φ(i) ϕ(u) Part 4: Extended Recursive least squares in kernel space Extended Recursive Least-Squares

STATE model

xi+1 = Fi xi + ni

T di = Ui xi + vi Start with = Π−1 wP0|−− 1, 0| 1 Special cases Notations:

• Tracking model (F is a time varying scalar) xi state vector at x =+=+α x n, di () uT x vi () time i i+1 i i ii • Exponentially weighted RLS wi|i-1 state estimate at time i using T x i+1 =α x i, di () = u ii x + vi () data up to i-1 • standard RLS T xi+1 = x i, di () = u ii x + vi ()

Recursive equations

The recursive update equations

= =λβ−−11 Ι wP0|−− 10, 0| 1 Conversion factor =λ iT + rie() uPi ii|1− u i gain factor k= α P u/ ri () pi, ii |1− i e error T ei()= di () − ui w ii|1− weight update

wi+−1| i=α w ii | 1 + k pi , ei()

2 Ti Pi+1| i =|αλ | [ Pii|1 −− − P ii |1 uu i i P ii |1 −/ r e ( i )] +Ι q

Notice that uwTˆˆ =αα uw TT + uP uei()/ ri () i+1| i ii|1 −−ii |1 i e T If we have transformed data, how to calculate ϕϕ () uP k ii |1 − () u j for any k, i, j?

New Extended Recursive Least-squares

P=ρ Ι− HQHT , ∀ j Theorem 1: jj|1− j − 1 j −− 1 j 1 j − 1 = T where ρ j− 1 is a scalar, H jj −− 10 [ uu ,..., 1 ] and Q j − 1is a jxj matrix, for all j. Proof: PQ=λβ−−11 Ι=, ρ λβ−−11 ,0 = 0|− 1 −−1 1 T 2 Pii|1−− uu i i P ii |1 i By mathematical PPi+−1| i =|αλ |[ii| 1 −] +Ι q induction! rie ()

=−−|αρ |[2 HQHT i−1 ii −− 11 i − 1 ()ρρ−−HQHT uu TT() HQH i−1 ii −− 11 i − 1 iii − 1 ii −− 11 i − 1]+Ιλ i q rie () Q+− f fT ri−−11() ρ fri() =(||αρ22 + λiTqH )|| Ι− α i−1 iiiie − 1, − 1, i−−1 iie 1, ii−1 T −−1 21 Hi −ρi−1 fiie−−1, ri() ρ i1 ri e()

Liu W., Principe J., “Extended Recursive Least Squares in RKHS”, in Proc. 1st Workshop on Cognitive Signal Processing, Santorini, Greece, 2008.

New Extended Recursive Least-squares

T Theorem 2: wˆ jj|1−= Ha j −− 1 jj |1, ∀ j T a − j ×1 where H jj −− 10 = [ uu ,..., 1 ] and jj |1 is a vector, for all j. Proof: waˆ =0, = 0 0|−− 1 0| 1 By mathematical induction again! wˆ ˆi+−1| i=α w ii | 1 + k pi , ei() T =ααHi−−1 a ii |1 + P ii |1 − uei i()/ r e () i

TT = αHi−1 a ii |1 − + αρ( i − 1 Ι−H i −− 1 Q i 1 H i − 1) uei i ()/ r e () i TT =+−αHiiiiieiiie−1 a | − 1 αρ − 1 uei()/ r () iα H−−1 f 1, ei()/ r () i ααa− f eir()−1 () i = T ii|−− 1 i 1, i e Hi −1 αρi−1eir()e ()i Extended RLS New Equations −−11 aQ=0, ρ = λβ−−11, = 0 wP0|−− 1=0, 0| 1 =λβ Ι 0|−− 1 1 −1

k= uHTT ii−−1, i i1 fii−1,= Qk i −− 1 ii 1, iT i TT ri()=λ + uP u rie()=+−λρi−1 uu i i k iiii −−1, f 1, e i ii|1− i ei()= di () − kT a kpi, = α P ii |1− u i/ ri e () i−−1, i ii | 1 T −1 = − aii|−− 1− f i 1, i r e () iei () ei() di () ui w ii|1− = α aii+1| −1 ρie−1r() iei () wi+−1| i=α w ii | 1 + k pi , ei() 2 i 2 ρii=|| αρ−1 + λq PPi+−1| i =|α |[ii| 1 − Ti +−T −−11ρ +Ιλ 2 Qi−1 f iiiie − 1, f − 1, ri() i−−1 fri iie 1, () Pii|1−− uu i i P ii |1/ r e ( i )] q Q =||α  i  −ρρT −−1 21 i−−1f i 1, ie ri() i−1 ri e() An important theorem

Assume a general nonlinear state-space model

+ = s(i +1) = g(s(i)) x(s(i 1)) Ax(s(i))

T d(i) = h(u(i),s(i)) +ν (i) d(i) = ϕ(u(i)) x(s(i)) +ν (i)

ϕ(u)T ϕ(u′) = κ(u,u′) Extended Kernel Recursive Least-squares

−−11 aQ0|−− 1 =0, ρ1 = λβ, −1 = 0 Initialize = κκT k ii−−1, [ ( uu0 ,i ),..., ( u i1 , u i )]

f ii−1,= Qk i −− 1 ii 1, iT rie()=+−λ ρκi−1 ( uu i , i ) k iiii−−1, f 1,

= − T ei() di () ki−−1, i a ii | 1 Update on weights a− f r−1 () iei () = α ii|−− 1 i 1, i e aii+1| −1 ρie−1r() iei () Update on P matrix 2 i ρii=|| αρ−1 + λq Q+ ffT ri−−11() −ρ fri() Q =||α 2 i−1 iiii −− 1, 1, e i−−1 i 1, ie i T −−1 21 −ρρi−−1f iie 1, ri() i−1 ri e() Ex-KRLS cu← mi i −1 ami← αρ i−1 r e () iei ()

c1 ←−αα −1 ami−1 ami −−1 f i 1, i() ir e () iei ()  c2 a − 1 1 a1←−αα a 1 fii− 1, (1) r e () iei ()

a2 y u +

ami-1 cmi-1 a mi

i cmi fi() u= 〈Ω i ,ϕκ () u 〉 F = ∑ a jj (, uu ) j=1 Simulation-3: Lorenz time series prediction

Simulation-3: Lorenz time series prediction (10 steps)

Simulation 4: Rayleigh channel tracking

1,000 symbols

Noise

5 tap Rayleigh multi- rt st tanh path fading channel + -5 fD=100Hz, t=8x10 s σ=0.005 Rayleigh channel tracking

MSE (dB) (noise MSE (dB) (noise variance Algorithms variance 0.01 and f = 0.001 and f = 50 Hz ) D D 200 Hz ) ε-NLMS -13.51 -9.39 RLS -14.25 -9.55 Extended RLS -14.26 -10.01 Kernel RLS -20.36 -12.74 Kernel extended RLS -20.69 -13.85

2 κ(uuij , )=−− exp( 0.1||ui u j || ) Computation complexity

Linear Algorithms KLMS KAPA ex-KRLS LMS Computation (training) O(l) O(i) O(i+K2) O(i2) Memory (training) O(l) O(i) O(i+K) O(i2) Computation (test) O(l) O(i) O(i) O(i) Memory (test) O(l) O(i) O(i) O(i)

At time or iteration i Part 5: Active learning in kernel adaptive filtering Active data selection

Why? Kernel trick may seem a “free lunch”! The price we pay is memory and pointwise evaluations of the function. Generalization (Occam’s razor)

But remember we are working on an on-line scenario, so most of the methods out there need to be modified. Active data selection

The goal is to build a constant length (fixed budget) filter in RKHS. There are two complementary methods of achieving this goal:

Discard unimportant centers (pruning) Accept only some of the new centers (sparsification)

Apart from heuristics, in either case a methodology to evaluate the importance of the centers for the overall nonlinear function approximation is needed. Another requirement is that this evaluation should be no more expensive computationally than the filter adaptation. Previous Approaches – Sparsification Novelty condition (Platt, 1991) • Compute the distance to the current dictionary

dis = min u(i +1) − c j c j∈D(i) • If it is less than a threshold δ1 discard • If the prediction error e(i +1) = d(i +1) −ϕ(i +1)T Ω(i)

• Is larger than another threshold δ2 include new center. Approximate linear dependency (Engel, 2004) • If the new input is a linear combination of the previous centers discard dis = min ϕ(u(i +1) − b ϕ(c ) 2 ∑c ∈D(i) j j j which is the Schur Complement of Gram matrix and fits KAPA 2 and 4 very well. Problem is computational complexity

Previous Approaches – Pruning Sliding Window (Vaerenbergh, 2010) mi Impose mi

The learning system y(u;T (i)) Already processed (your dictionary) i D(i) = {u( j),d( j)} j=1 A new data pair {u(i +1),d(i +1)} How much new information it contains? Is this the right question? Or How much information it contains with respect to the learning system y ( u ; T ( i )) ?

Information measure

Hartley and Shannon’s definition of information How much information it contains? I(i +1) = −ln p(u(i +1),d(i +1))

Learning is unlike digital communications: The machine never knows the joint distribution! When the same message is presented to a learning system information (the degree of uncertainty) changes because the system learned with the first presentation! Need to bring back MEANING into information theory! Surprise as an information measure

Learning is very much like an experiment that we do in the laboratory. Fedorov (1972) proposed to measure the importance of an experiment as the Kulback Leibler distance between the prior (the hypothesis we have) and the posterior (the results after measurement). Mackay (1992) formulated this concept under a Bayesian approach and it has become one of the key concepts in active learning. Surprise as an information measure

IS (x) = −log(q(x))

y(u;T (i))

ST (i) (u(i +1)) = CI(i +1) = −ln p(u(i +1) |T (i)) Shannon versus Surprise

Shannon Surprise (absolute (conditional information) information) Objective Subjective

Receptor Receptor independent dependent (on time and agent) Message is Message has meaningless meaning for the agent Evaluation of conditional information (surprise)

Gaussian process theory CI (i +1) = −ln[ p(u(i +1),d(i +1) |T (i))] = (d(i +1) − dˆ(i +1))2 ln 2π + lnσ (i +1) + − ln[ p(u(i +1) |T (i))] 2σ 2 (i +1) where

ˆ + = + T σ 2 + −1 d (i 1) h(i 1) [ n I G(i)] d(i) σ 2 + = σ 2 +κ + + − + T σ 2 + −1 + (i 1) n (u(i 1),u(i 1)) h(i 1) [ n I G(i)] h(i 1) Interpretation of conditional information (surprise) CI(i +1) = −ln[ p(u(i +1),d(i +1) |T (i))] = (d(i +1) − dˆ(i +1))2 ln 2π + lnσ (i +1) + − ln[ p(u(i +1) |T (i))] 2σ 2 (i +1)

Prediction error e(i +1) = d(i +1) − dˆ(i +1) Large error  large conditional information Prediction variance σ 2 (i +1) Small error, large variance  large CI Large error, small variance  large CI (abnormal) Input distribution p(u(i +1) |T (i)) Rare occurrence  large CI Input distribution p(u(i +1) |T (i))

Memoryless assumption p(u(i +1) |T (i)) = p(u(i +1))

Memoryless uniform assumption

p(u(i +1) |T (i)) = const. Unknown desired signal

Average CI over the posterior distribution of the output CI (i +1) = lnσ (i +1) − ln[ p(u(i +1) |T (i))]

Memoryless uniform assumption

CI (i +1) = lnσ (i +1)

This is equivalent to approximate linear dependency!

Redundant, abnormal and learnable

+ > Abnormal : CI(i 1) T1

Learnable : T ≥ CI(i +1) ≥ T 1 2

Re dundant : CI(i +1) < T2

Still need to find a systematic way to select these thresholds which are hyperparameters. Active online GP regression (AOGR)

Compute conditional information If redundant Throw away If abnormal Throw away (outlier examples) Controlled gradient descent (non-stationary) If learnable Kernel recursive least squares (stationary) Extended KRLS (non-stationary) Simulation-5: nonlinear regression—learning curve Simulation-5: nonlinear regression— redundancy removal

T1 is wrong, should be T2 Simulation-5: nonlinear regression– most surprising data

Simulation-5: nonlinear regression

Simulation-5: nonlinear regression— abnormality detection (15 outliers)

AOGR=KRLS Simulation-6: Mackey-Glass time series prediction

AOGR=KRLS Simulation-7: CO2 concentration forecasting

Quantized Kernel Least Mean Square A common drawback of sparsification methods: the redundant input data are purely discarded! Actually the redundant data are very useful and can be, for example, utilized to update the coefficients of the current network, although they are not so important for structure update (adding a new center). Quantization approach: the input space is quantized, if the current quantized input has already been assigned a center, we don’t need to add a new, but update the coefficient of that center with the new information! Intuitively, the coefficient update can enhance the utilization efficiency of that center, and hence may yield better accuracy and a more compact network.

Chen B., Zhao S., Zhu P., Principe J. Quantized Kernel Least Mean Square Algorithm, submitted to IEEE Trans. Neural Networks

Quantized Kernel Least Mean Square  f0 = 0 Quantization in Input Space  ei()= di () − fi−1 (u ()) i   fii= f−1 +ηκ ei()( Q[u. (), i] ) 

Quantization in RKHS Ω(0) = 0  ei()= di () −−Ω ( i 1)T ϕ () i  ΩΩ()i= ( i −+ 1)η ei ()Q [ϕ () i] Using the quantization method to compress the input (or feature) space and hence to compact the RBF Quantization operator structure of the kernel adaptive filter

Quantized Kernel Least Mean Square

The key problem is the vector quantization (VQ): Information Theory? Information Bottleneck? …… Most of the existing VQ algorithms, however, are not suitable for online implementation because the codebook must be supplied in advance (which is usually trained on an offline data set), and the computational burden is rather heavy. A simple online VQ method: 1. Compute the distance between u(i) and C(i-1)

dis(uu( i ),CC ( i −= 1)) min (i )−j ( i − 1) : 1≤≤j size(C ( i − 1))

2. If dis ( u ( i ), C ( i −≤ 1) ) ε U keep the codebook unchanged, and quantize u(i) into j* =arg minu ( ii ) −−C ( 1) the closest code-vector by aa**()i= ( i −+ 1)η ei () j jj 1≤≤j size(C ( i − 1)) 3. Otherwise, update the codebook: CC () iii = { ( − 1), u () } , and quantize u(i) as itself

Quantized Kernel Least Mean Square

Quantized Energy Conservation Relation 2 2 22ei() ei() ΩΩ()ii+a =( −+ 1) p +β FF22q κκ(uuqq(),ii ()) (uu(),ii ())

A Sufficient Condition for Mean Square Convergence  Ω −>T ϕ Eeiaq( ) ( i 1) ( i ) 0 (C 1)  ∀i, Ω − T ϕ  2Eeiaq () ( i 1) () i 0 <≤η (C 2) Ee22() i +σ  av Steady-state Mean Square Performance

22 ησ−+22 ξγγησ ξ vv≤≤2 max ,0 limEea ( i ) 22−−ηηi→∞

Quantized Kernel Least Mean Square

Static Function Estimation

22 (ui ( )+− 1)  (ui ( ) 1)  di( )=×− 0.2 exp  +−exp  +vi() 22   40

Upper bound 30

-1 10 20 EMSE

EMSE = 0.0171 final network size 10

Lower bound -2 0 10 -2 -1 0 1 0.1 2 4 6 8 10 10 10 10 10 quantization factor γ quantization factor γ

Quantized Kernel Least Mean Square

Short Term Lorenz Time Series Prediction

1 10 QKLMS 500 NC-KLMS 450 SC-KLMS 0 10 400 350 300 -1 10 250 QKLMS 200 testing MSE

network size NC-KLMS SC-KLMS -2 150 10 100 50 -3 10 0 0 1000 2000 3000 4000 0 1000 2000 3000 4000 iteration iteration

Quantized Kernel Least Mean Square

Short Term Lorenz Time Series Prediction

1 400 10 QKLMS QKLMS 350 NC-KLMS NC-KLMS SC-KLMS SC-KLMS 0 KLMS 300 10 250 -1 200 10 testing MSE network size 150

-2 100 10

50

-3 0 10 0 1000 2000 3000 4000 0 1000 2000 3000 4000 iteration iteration

Redefinition of On-line Kernel Learning

Notice how problem constraints affected the form of the learning algorithms. On-line Learning: A process by which the free parameters and the topology of a ‘learning system’ are adapted through a process of stimulation by the environment in which the system is embedded. Error-correction learning + memory-based learning What an interesting (biological plausible?) combination.

Impacts on Machine Learning

KAPA algorithms can be very useful in large scale learning problems. Just sample randomly the data from the data base and apply on-line learning algorithms There is an extra optimization error associated with these methods, but they can be easily fit to the machine contraints (memory, FLOPS) or the processing time constraints (best solution in x seconds). Information Theoretic Learning (ITL)

This class of algorithms can be extended to ITL cost functions and also beyond

Regression (classification, Clustering, ICA, etc). See

IEEE SP MAGAZINE, Nov 2006

Or ITL resource www.cnel.ufl.edu