Applications of Information Geometry to Machine Learning

Applications of Information Geometry

Clustering based on Divergence: Cluster Center Stochastic Reasoning: Belief Propagation in Graphical Model Support Vector Machine Bayesian Framework and Restricted Boltzmann Machine Natural Gradient in Multilayer Perceptron Learning Independent Component Analysis Sparse Signal Analysis and Minkovskian Gradient Convex Optimization Clustering of Points D  {123 , ,  ,...,N } Clustering : center of a cluster of points  Center of C C  1,,m

   arg minD , i    i C 2 1 3  

m  * is simply the arithmetic average in the dual coordinate system

D[: iii ] () ()

 D[:ii ]  

()0i 1 *  m  i k‐means: clustering algorithm k‐means: clustering algorithm

1. Choose k cluster centers 2. Classify patterns of D into clusters 3. Calculate new cluster centers Voronoi Diagram: Boundaries of Clusters

DD[: 12 *] [: *] Total Bregman divergence (Vemuri)

 ()pq ()p q tBD(pq : )  2 1| ()|q   t-center of C 

 w *   ii  wi

1 w  i 2 1 i  Conformal change of divergence

Dpq  ::   qDpq  

gpgij    ij

 TTsijk () ijk kg ijs jg iks ig jk s  log ii  t-center is robust

 E  1,,; n y

   1   zy;, n influence function z ;y

zyc as : robust 1[:][:DD y] minimize ( i  )  Nww1  iN1  y    zy,  G 1 wy

y zy ,  G 1 Robust: z is bounded 1 y 2

zyy,       yy  wy 2 1 y 1 2 Euclidean case f  x  2 wy 1 MPEG7 database

• Great intraclass variability, and small interclass dissimilarity. Shape representation First clustering then retrieval Other TBD applications

Diffusion tensor imaging (DTI) analysis [Vemuri] • Interpolation • Segmentation

Baba C. Vemuri, Meizhu Liu, Shun-ichi Amari and Frank Nielsen, Total Bregman Divergence and its Applications to DTI Analysis, IEEE TMI, to appear Information Geometry of Stochastic Reasing

Belief Propagation in Graphical Model

• Shun‐ichi Amari (RIKEN BSI) • Shiro Ikeda (Inst. Statist. Math.) • Toshiyuki Tanaka (Kyoto U.) Stochastic Reasoning: Graphical Model pxyzrs(,,,,) pxyz(, , |,) rs xyz,,,...1,1  Stochastic Reasoning q(x1,x2,x3,…| observation)

X= (x1 x2 x3 …..) x = 1, ‐1

Ｘ= argmax q(x1, x2 ,x3 ,…..) maximum likelihood

Ｘi = sgn E[xi] least bit error rate estimator Mean Value Marginalization: projection to independent distributions

011220qqxqxqxq(xx ) ( ) ( )...nn ( ) ( )  qx( ) qx ( ..., xdxdxdx ) .. .. ii 1, n 1 i n   Ex[] E [] x qq0 cliques  L  qkxcxxexp  ii r q  r1    ccxxrx  ,  ii rrii1 s 1 s xriii1, 1  12, qwxxhxx epx   Boltzmann machine, spin glass,ij ij neural networks ii Turbo Codes, LDPC Codes Computationally Difficult

qExx  

qcxxexp rq  mean‐field approximation belief propagation tree propagation, CCCP (convex‐concave) Information Geometry of Mean Field Approximation qx() • m‐projection Dq[:] p  qx ()log • e‐projection x p()x

M {()} px m q = argmin D[q:p] 0 ii i 0 e q = argmin D[p:q] 0

px() M0 m‐projection keeps the expectation of x

pqM(x, *)  0  px(x, *) *    qx() p *() x tx() px*( )

tx(), x *p*  ( x *)() qx  p *()0 x  x  Exqp[] E* [] x Information Geometry qx

M r M r'  M 0

qx() exp{ cr () x   Mp000 xx,exp  0 0

Mprrrxxx,exp c r r  r rL 1, , Belief Prop Algorithm

 r  'r M  r ' r M r'  r M0 Belief Propagation

px(, rrrr ) exp{() c x x } px(,tt ) px (, ) 00rr r tt1t rrrr0 p (xcx , )  : belief for r ( ) tt11 rr  ' rr' t1  t1  0  r    Equilibrium of BP  ,ξr 

1) m‐condition

 * M  0 prrx,  r  M' mM-flat submanifold   r M0 ξθ1(')ξ 2) e‐condition q  * θξ M  r r M r' qex  -flat submanifold   M0 Belief Propagation e‐condition OK

(θξ ;12 , ξ , ξLr  , θ '   ξ '

(,12 ,LL )   (',' 1 2   ,')

CCCP m‐condition OK θθ:(,)(,)px px 00rrrr

θ0 '   r Convex‐Concave Computational Procedure (CCCP) : A. Yuille

FFF()  12 () () tt1 FF12() ()

Elimination of double loops Geometry of Support Vector Machine modification of kernel

Simple perceptron: decision surface fx() wx b minimize |w |2 ||wx b d  constraint ywxii (  b ) 1 ||w 1 Lwb(,,) | w |2 ( wx  b ) 2  ii   iiyx 0 : support vector i : i 0 w y x f (xw , ) y xb iii iii Embedding in higher‐dimensional space

xz() x f ()xwxb  ()

z: infinite‐dimensional K(,xx ') () x (') x Kernel trick  K(,')(')'xx xdx (') x  iii 1 ()xx  () i i

fxw(, ) yKxx ( ,)  ii i 22  ds|( z x dx ) z ()| x  z() x z () x dxij dx xxij  gxij ()   () x () x xxij 2 gxij() Kx (,x')|' x x xxij Conformal Transformation of Kernel (xfx ) exp{ { ( )}2 Kxx (, ') () x (')(, x Kxx ')  2 (x ) exp{ii | x x *| } 2 gij()xxx{ ()}g ij() i () xx j ()  Basic Principles of Unsupervised and Supervised Learning Toward Deep Learning

Shun‐ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Self‐Organization + Supervised Learning Deep Learning RBM: Restricted Boltzmann Machine Auto‐Encoder, Recurrent Net

Dropout Contrastive divergence Boltzmann Machine RBM: Restricted Boltzmann Machine RBM Simple Hebbian Self‐Organization self‐organization of Self‐Organization Interaction of Hidden Neurons Bayesian Duality in Exponential Family

Data x Parameter (higher‐order concepts)

Curved exponential family RBM = h, x = Wv x = v = hW Gaussian Boltzmann Machine Equilibrium Solution Equilibrium Solution

General Solution

• • diagonalized by

You can choose m(≤ k) eigen values form

Stable Solution the case of m = k Contrastive Divergence

• RBM Visible v Hidden h ‐ 2‐layered probabilistic neural network ‐ No connections within layers W

• How to train RBM Sampling Maximum Likelihood (ML) learning is hard

Input Equilibrium

Many iterations of Gibbs Sampling demand too much computational time 50 Contrastive Divergence Solution Two Manifolds Geometry of CDn (contrastive divergence) Bernoulli‐Gaussian RBM

ICA R. Karakida Simulation The number of Neurons: N = M = 2, σ = 1/2 Sources p (s) Input ICA Solution Mixing CD Uniform Output Distribution

Independent sources are extracted in G‐B RBM 55 Information Geometry of Neuromanifolds Natural Gradient and Singularities Multilayer Perceptron

Shun-ichi Amari RIKEN Brain Science Institute Mathematical Neurons

ywxh ii  wx 

x y ()u

u Multilayer Perceptrons

ｙ y  vnii wx x

xxxx (,12 ,...,)n

 1 2  pyxx;exp c  y f , 2

fvxwx,  ii  

  (wwvv11 ,...,mm ; ,..., ) Multilayer Perceptron: Neuromanifold ()x neuromanifold

space of functions

yf x, θ

vii  wx

θ  vv11,,mm ; ww , Universal Function Approximator

N x   viai x   i1 m   x  vi i x  i1  

i x   wi  x Learning from examples x  f x,ˆ

examples  D  {xx11 , yy , ,  nn ,  }

learning ; estimation Backpropagation ‐‐‐gradient learning

yfx (, ) n examples :yy11 ,xx , tt , training set

1 2 Eyx(,;)  y fx , 2   logpy ,x ;  E   tt fvx,  wx     ii  Multilayer Perceptron: Neuromanifold Metric and Topology ()x neuromanifold

space of functions

yf x, θ

vii  wx

θ  vv11,,mm ; ww ,  Metric: Riemannian manifold

 log py( | x;)lo g py( | x;) gEij () [ ] ij

 j   d ds2  d 2  i  gdd  ij i j  dGT  d Topology: Neuromanifold

• Metrical structure

• Topological structure  singularities Geometry of singular model

yv wx n v ||w  0 v

Ｗ Parameter Space S

S   yv  ii wx n

Equivalence

1)viiw  0

2) wwij vv ij MS/ Topological Singularities

S : parameter space M S/ M : behavioral space 2 hidden‐units

yv11wx v 2 w2 x n

Svv: 12wwww 122 1 0

1vxwvxw   12 Gaussian mixtures

 1 2  px() vii exp  x  w  2 Gaussian mixture

pxvww;,12 , 1 v x  w 1  v x  w 2

112  x exp x 2 2  singular: ww12,10 v v v

w2 Learning, Estimation, and Model Selection

EDpypy  xx:;  gen 0   EDpy x; train emp 

d Ed :dimension gen 2n d EE gen train n Problem of Backprop

ttttt lx(, y ;)

• slow convergence‐‐‐‐plateau‐‐‐saddle

• local minima Flaws of MLP slow convergence : plateau error

local minima

Boosting, Bagging, SVM Steepest Direction --- Natural Gradient

l()

d

ll ttttt lx(, y ;) l ,, 1 n lG1   l

2  ij ddGdGdd= ij   Natural Gradient max dl l d   l   l d

 2 2 under dgdd ij i j   1 dlGl    ttttt lx(, y ;) Information Geometry of MLP Natural Gradient Learning : S. Amari ; H.Y. Park

l G1   Adaptive natural gradient learning

1111 T GGGffGtttt1 1    Regular statistical model Mpx  ,  G : Fisher information

1 EGT  1 n

ˆ 1 EKLpx,:0 px , GE 2n d AIC, MDL  2n Landscape of error at singularity Milner attractor Dynamics of Learning  dd   lGl, 1  dt dt

du dz fuz(,), kuz (,) dt dt

du f(,) u z 221  , uzlog zc  dz k(,) u z 2 Coordinate Transformation

 uw 21 w:0 u   vvww www11 2 2   v   vv12 v vv   vv22 zz 1   v Dynamics of Learning: Natural gradient works!!  dd   lGl, 1  dt dt

du dz fuz(,), kuz (,) dt dt

du f(,) u z 221  , uzlog zc  dz k(,) u z 2 Dynamic vector fields: General case (|z|<1 part stable) Dynamic vector fields: General case (|z|>1 part stable ) Adaptive Natural Gradient works well Signal Processing ICA : Independent Component Analysis

xsttA xs tt

sparse component analysis

positive matrix factorization mixture and unmixture of independent signals

n x1 xii  Asjj s1 j1 xm s2 xAs sn

x2 Independent Component Analysis

s A W y

xsAx iijj As x yxWWA 1

observations: x(1), x(2), …, x(t) recover: s(1), s(2), …, s(t)

Semiparametric Statistical Model

pr(;xW ,)|| W r() Wx

WA 1 , rs ( ) : unknown rr  i

x(1), x(2), …, x(t) Natural Gradient

l yW,  WWW T W Space of Matrices : Lie group

WW d ddX  WW -1 W 1 W

I I  dX

ddddWXXWWWW2 tr TTT tr  1 d l l WWT W dX : non-holonomic basis Information Geometry of ICA

S ={p(y)}

r q I  {qyqy112 ( ) ( 2 )... qynn ( )}

{(pWx )} natural gradient lKLpq()W  [(;):()]y W y estimating function stability, efficiency r (y ) Estimating Functions WF(y,W ) 0,WW ' EW ,r FyW,'  0,WW ' estimating equation

 Fy,Wt 0 ytt Wx t learning

WFWxttt   Admissible class

FyW ,{ I  yy T }W  FRWFyW , {}  yyij   yyW ji    RWFWxt  0: estimating equation   on-line learning : WRWFWxtttt F canonical estimating function:  I W Basis Given: overcomplete case Sparse Solution

sparse Aˆˆ:xAs ˆ xsA  sii a many solutions many si  0 ˆ xstt A sparse AAsˆˆ: x  ˆ

generalized inverse 2 L2 -norm: min  sˆi sparse solution ˆ L1-norm: min  si i

Minkovskian Gradient

min  : convex function   constraint Fc

11 2 typical case:   y XG (*)(*)T  22 1 p F   ; ppp 2,  1, 1/ 2 p  i Optimization under Sparsity Condition:

y X  ( n)

 : k -sparse

mkn 2log Linear regression k t yxntNtiit  , 1,2,... i1

x1  x  ynXX  2      x N  Overcomplete: N < k under‐determined  infinitely many solutions

Sparsity constraint solves the problem Maximum likelihood estimator

ˆ 1 2  arg minyttx , (uu )   t 2

 '0yttt  xx t

Euclidean case (Gaussian noise):

GXXGX TT, ˆ  y

1 ˆ  XXTT Xy =X† y Not sparse Sparse Solution * min  minD [     ] ( ) ( *) ( *)( *)  ( *) 0 p penalty Fcpi  =

F0  #1[i 0]: sparsest solution

FL11  i: solution Fp: 0 1 Sparse solution: overcomplete case p  2 F  : generalized inverse solution 2  i  orthogonal projection, dual projection

* min () β  Dβ:β,Fβ=cdual  : geodesic projection

dual

F( )  c c a) Rnc :  2, p 1 b) Rnc :  2, p 1

non-convex

c) Rnc :  2, p 1

Fig. 1 L1-constrained optimization

P Problem min  under F  c LASSO c   solution  cc :  0   c  0 

min  F   LARS P Problem λ  solution   0   0 

** solutions βcλ and β : coincide,, λ = λ c p 1 p < 1: λ = λ c multiple, noncontinuous stability different Projection from β * to F = c (information geometry)  *

 * F( ) c  c

n  F

Fig. 5 subgradient LASSO path and LARS path (stagewise solution)

min   : F   c

min  F 

c ,   c λ correspondence Active set and gradient

Ai  i 0

 1 p sgnii , iA   Fp ,, iA  1,1  Solution path

 Ac cAcF   0, c   A Ac  cAAc FF  c  c  A c 

  1  d cc K  AF c ; cc   dc

   KG  F  cc  c 

FF11 0;   (sgni ) : L 1 Solution path in the subspace of the active set

 AA F   0 A : active direction   1  KFAA 

turning point A A Gradient Descent Method min L( +a): gaaij  2  ij LL{ ( )}: covariant  i   LgL{ji ( )}: contravariant   i    tt1 cL() t Steepest direction of L

Ggaaa  ij  i j Riemannian metric

p Gtaa t G 0 Minkovskian metric

p Gaa   i .1 p Lp-norm

 LGaa  0 a p1 Gpaaa  sign  ii 

 fFi   i

1 p1 aciii sgn ff

 1 Natural gradient GgaaFGfa   ij i j ,    F f Euclidean case

1  p1 F cffsgn ii

0     1  Fcsgn f   1 i   0     0  if arg max i

max f ff i ij 

1, for ii  and j ,  F  i 0otherwise.

   F tt1 LASSO Try for various p, p->1 Try for various noise function LASSO and flat geometry Extended LARS (p = 1) and Minkovskian gradient

p norm a  a p  i max aa under 1  p  aa p p  1 

sgn , max , ,   ii1 N 1 1A   0, otherwise      if arg max i max f ff i ij 

1, for ii  and j ,  F  i 0otherwise.

   F tt1 LARS 2 * c 2  2 1 22    2 1   

   -trajectory and-trajectory

    

 

 fF Ex. 1-dim L1/2 constraint: non-convex optimization non-convex constraint: L1/2  2 Pc : min  ,  c

 0 ˆ c   c c

Pf:0    ˆ   0  R   : Xu Zongben's operator

 R   

 c  cc   c

 

 An Example of the greedy path

β２

β１ LASSO and LARS :

 p 1 :c is non-sparse p 1: sparse

2



1

Minkovskian gradient DF:      c   F 0 ccc    ccc F

 

c p

 ccc   FF  ccc     c GH   1 ccGH  F c  Solution Path :  c not continuous, not-monotone jump

  c Linear Programming

 Axij j b i

max cxii

 x logAijxb j i i inner method Convex Cone Programming

P : positive semi-definite matrix

convex potential function

dual geodesic approach

Axb , min cx

Support vector machine Convex Programming ━ Inner Method

LP: Ax  b min cx

Barrier function  x log Aijxb j i 

 i x

Simplex method ; inner method Polynomial-Time Algorithm

curvature : step-size 2 H m

min : ttcx  x x     geodesic Integration of evidences:

xx12,,... xm arithmetic mean geometric mean harmonic mean  ‐mean Generalized mean: f‐mean f(u): monotone; f‐representation of u

fa() fb () mab(,) f1 { } f 2

scale free mcacbcmaff(,) (,b)  1 α-representation fu( ) u2 ,   1 logu , =1 ‐mean : mpsps ((),())12

 1 : ab ab 1: 2 ab 1 0 : (ab )2 ab  42  mab  min( , )    mab  max( , )   Various Means 2 ab 11 : ab :  2 ab arithmetic geometric harmonic

Any other mean?   Family of Distributions

p1(),sps ,k () 1 px(; ) f { ii f ( p ())} x

mixture family : k psmix( )  tps i i ( ), t i  1 i1 exponential family : logpsexp ( )  tii log ps ( )  α‐geodesic projection

robust estimator