Information Geometry

2013年統計数理研究所夏期大学院統計数理研究所セミナー室 1 9月26日 14時40分～17時50分

ISM Summer School 2013

Shinto Eguchi

The Institute of Statistical Mathematics, Japan Email: [email protected], [email protected] Url: http//www.ism.ac.jp/~eguchi/ Outline

Information geometry

historical remarks

treasure example

Information divergence class

robust statistical methods

robust ICA

robust kernel PCA

2 Historical comment

R.A.Fisher 1922 C.R.Rao 1945 geometry 1975 paper P. Dawid B. Efron comment statistics

1982 paper 1984 learning S. Amari Workshop RSS150 quantum

3 Dual Riemannian Geometry

Dual Riemannian geometry gives reformulation for Fisher’s foundation

Cf. Einstein field equations (1916) Information Geometery aims to geometrize A dualistic structure between modeling and estimation

Cf. Erlangen program (Klein, 1872) http://en.wikipedia.org/wiki/Erlangen_program

Estimation is projection of data onto model. “Estimation is an action by projection and model is an object to be projected” The interplay of action and object is elucidated by a geometric structure.

4 2 x 2 table

The space of all 2×2 tables associates with a regular tetrahedron The space of all independent 2×2 tables associates with a ruled surface In the regular tetrahedron

We know the ruled surface is exponential geodesic, but does anyone know the minimality of the ruled surface? 5 regular tetrahedron

P 01 00 0 p 0 1p

1/4 1/4 Q 1/4 1/4 00 S 01 00 10 1p 0 10 R p 0 00 6 regular tetrahedron

1 1/3 2/3 2 00  +  P 3 00 3 1/3 2/3

01/3 1 1/3 0 2 01/3 1/3 2/3 =  +  3 3 02/3 00 2/3 0 02/3

1 1 1 2   = 3 3 3 3 Q 2 1 2 2 00   S 3 3 3 3 1/3 2/3 1/3 0 R 2/3 0

7 Ruled surface

P P

Q S S R

R Q 8 AB e  x  y, f  z  w, C xye g  x  z, h  y  w 01 D zwf 00 n  x  y  z  w ghn

AB C pq p(1q) P D (1p)q (1p)(1q) 1p 00 q 1q 1 01

00 ． 10 2 2 (xw  yz) p q p (1-q)   n (1-p)q (1-p) (1-q) e f g h 10 00 {pq(1 p)(1 q)  p(1 q)(1 p)q}2  n p(1 p)q(1 q)  0  e-geodesical     11 12  21  ( 11, 12, 21)  (log ,log ,log ) where 22 1 11  12  21  22  22 22

3 S3 IR

5 5

0 0

-5 -5 10 -10 5 -10 5 0 0 0 0 10 -5 -5 10 -10 p q p q {(log , log , log ), 0 p1, 0q1} {(x  y, x, y ), x IR, yIR} (1 p)(1q) 1 p 1q

10 Two parametrization

(p,q)

p q p q (p q, p(1q),(1 p)q) (log ,log ,log ) (1 p)(1q) 1 p 1q

-5

-10 5

0 0

10 -5

11 Kullback-Leibler Divergence

Let p(x) and q(x) be probability density functions. Then Kullback-Leibler Divergence is defined by

r p

12 Two one-parameter families

Let be the space of all pdfs on a data space. m-geodesic

e-geodesic

ここで

q 13 Pythagoras theorem Amari-Nagaoka (1982)

q 14 Proof

15 ABC in differential geometry

Riemannian metric defines an inner product on any tangent space of a manifold.

(M , g ) g : X (M )  X (M )  IR

Geodesic = a minimal arc between x and y

1   argmin g (x(t), x(t))dt 0   c{x(t):x(0) x,x(1) y}1 t 0

Linear connection defines parallelism along a vector field.

 : X (M )  X (M )  X (M )

(1)  f X Y  f  X Y ,

(2)  X ( f Y )  f  X Y  ( X f )Y (f  F (M ), X ,Y  X (M )) d Cf. slide 41 Componetwise  X   k X X i j  i j k 16 k 1 Geodesic

A one-parameter family (curve) C  { θ ( t ) :    t   } is called geodesic  i with respect to a linear connection {j k (θ) :1 i, j,k  p}

p p i i  k k  (t)  j k (θ(t))  (t) (t)  0 (i 1,..., p) speed of acceleration k11j Remark 2： i If  j k ( θ )  0 ( 1  i , j , k  p ) , any geodesic is a line. Newton's law of inertia The property is not invariant with parametrization.  The gedesic C is expressed by a transform rule from parameter  to  p p ~  a (t) a (ω(t)) b (t) c (t) 0 ( a 1,..., p)   c b       c11b   ~ p p p p Bi  a  i  a (ω)  B k B j B a  i (θ)  B a b , a i  a a c b  c b i j k  i c Bi  i , Ba  a for  t (θ) k 111j i a1   

17 Change rule on connections

Let us consider a transform from parameter  to  . Then a geodesic C is written by { ω ( t )   ( θ ( t )) :    t   } . Hence we have the following: 

 a p    a (t)  B a (θ(t))i (t) (a  1,..., p)  Ba     i  i  i  i1   p p p a  a a i Bi (θ(t))  i j  (t)   Bi (θ(t))  (t)   j  (t) (t) (a  1,..., p) i1 j11i 

p p a ~ a  b c  (t)  c b (ω(t))  (t) (t)  0 (a  1,..., p) c11b 

p p p p i ( 1)i  ~ a k j a i a Bb i a a Ba  for  (θ) c b (ω)  Bc Bb Bi j k (θ)   Bi c , a k 111j i a1  

18 What are reference books?

Foundations of Differential Geometry (Wiley Classics Library) Shoshichi Kobayashi, Katsumi Nomizu

Differential Geometrical Methods in Statistics Shun-Ichi Amari 1985 年 Springer

Methods of Information Geometry Shun-Ichi Amari , Hiroshi Nagaoka Amer Mathematical Society (2001) 19 Statistical model and information

p Statistical model M  { pθ ( x)  p( x, θ) : θ  } (  IR )

 Score vector s( x, θ)  log p( x, θ) θ

T p Space of score vector Tθ  { α s(, θ): α IR }

Fisher information metric g θ (u, v)  Eθ {uv} (u, v  Tθ )

T Fisher information matrix Iθ  Eθ {s(x,θ)s(x,θ) }

20 Score vector space

T Note 1： u Tθ  Eθ (u)   u( x, θ) p( x, θ)dx   α s( x, θ) p( x, θ)dx  α T  s( x, θ) p( x, θ)dx  0

u, v Tθ  g θ (u, v)   u( x, θ)v( x, θ) p( x, θ)dx T T T  α { s( x, θ)s( x, θ ) p( x, θ)dx}β  α I θ β The space of all random variables with mean 0 and finite variance

Sθ  {t( x, θ) : Eθ (t( x, θ))  0, Vθ (t( x, θ))  }

Note 2: S θ is an infinite dimensional vector space that includes Tθ

T  For u( x, θ)   s( x, θ)   k  k u(x,θ)k  E {u(x,θ)k }, u(x,θ)  E {  u(x,θ)}, k 1,2,.... θ  ... θ  ... i1 ik i1 ik

belong to Sθ Cf. Tiatis (2006) 21 Linear connections

p Score vector s( x, θ)  (si ( x, θ)) i1

p m i  m-connection ii' sk  j k (θ)   g [Eθ{ j si'} Eθ{sk s jsi'}] (1 i, j,k  p) i'1  p e i s e-connection ii' k  j k (θ)   g Eθ{ j si'} (1 i, j,k  p) i'1 

m-geodesic

e-geodesic

22 Semiparamtric thoery

1 Introduction to Semiparametric Models 2 Hilbert Space for Random Vectors 3 The Geometry of Influence Functions 4 Semiparametric Models 5 Other Examples of Semiparametric Models 6 Models and Methods for Missing Data 7 Missing and Coarsening at Random for Semiparametric Models 8 The Nuisance Tangent Space and Its Orthogonal Complement 9 …14.

Semiparametric Theory and Missing Data (Springer Series in Statistics) Anastasios Tsiatis 23 Geodesical models

Definition (i) Statistical model M is said to be e-geodesical if

1t t p(x), q(x) M  ct p(x) q(x) M (t (0,1)) 1t t where ct 1/  p(x) q(x) dx

(ii) Statistical model M is said to be m-geodesic if

p(x), q(x) M  (1t) p(x) tq(x)M (t (0,1))

Note : Let P be a space of all probability density functions. By definition P is e-geodesical and m-geodesical. However the theoretical framework for P is not perfectly complete.

Cf. Pistone and Sempi (1995, AS) 24 Two type of modeling

Let p0(x) be a pdf and t (x) a p-dimensional statistic.

(e) T Exponential model M { p(x,θ)  p0 (x)exp{θ t(x) (θ)}; θ } {θ  IR p : (θ)  }

(m) Matched mean model M { p(x) : E {t(x)} E {t(x)} } p p0

Let { p i ( x ) : i  0 , 1 ,..., I } be a set of pdfs.  I p (x) Exponential model (e) i M { p(x,θ)  p0 (x)exp{  i log (θ)}; θ } i1 p0 (x) p  {θ  IR : (θ)  }

I I  Mixture model M (m) { p(x,θ)  (1 ) p (x)  p (x) : }   i 0  i i i1 i1 I  25   { θ  ( 1,..., I ) :0   i  1, i  0 (i  1,..., I )} i1 Statistical functional

A functional f (p) is called a statistical functional if p(x) is a pdf. (Hampel, 1974)

A statistical functional f (p) is said to be Fisher-consistent for a model

M  { pθ ( x) : θ  } if f (p) satisfies that (1) f ( p)  ,

(2) f ( pθ )θ (θ)  

 2 1 (x  )  2  Example. For a normal model M {pθ (x) exp{ 2 }, θ  ( , ) IR IR } 2 2 2 f ( p)  ( xp(x) dx,  x2dx  ( xp(x) dx)2 )T is Fisher-consistent for  = (, 2 ).

26 Transversality

Let f (p) be Fisher consistent functional.

Lθ ( f )  { p : f ( p)θ} , M

which is called a leaf transverse to M. Lθ ( f )

pθ (1) Lθ ( f ) M  {pθ}

(2)  Lθ ( f ) is a local neighborhood. pθ* 

Lθ*( f )

27 Foliation structure

p Statistical model M  { pθ ( x)  p( x, θ) : θ  } (IR )

Foliation P   Lθ ( f ) M  u(x,θ) t(x,θ) Decomposition of tanget spaces

pθ Tp (P)  Tp (M)Tp (Lθ ( f )) θ θ θ v(x,θ)

Lθ ( f )

t T (P), u T (M), v T (L ( f )) pθ pθ pθ θ

such that t(x,θ)  u(x,θ)  v(x,θ)

28 Transversality for MLE

Let p0(x) be a pdf and t (x) a p-dimensional statistic. (e) T Exponential model M { p(x,θ)  p0 (x)exp{θ t(x) (θ)}; θ } {θ  IR p : (θ)  }

For the exponential model M ( e ) the MLE functional

fML( p) :argmaxEp{log p(x,θ)} θ is written by

fML( p)  arg{Ep{t(x)} Ep(,θ){t(x)}} θ

Hence the foliation associated with fML is a matched mean model.

Lθ ( fML)  { p : Ep{t(x)} Ep(,θ){t(x)}} 29 Maximum likelihood foliation

M (e) r2

Lθ ( f ML ) p3

pθ p1 p2

30 Estimating function

(IRp) Statistical model M  { pθ ( x)  p( x, θ) : θ  }

p-variable function u ( x , θ ) is unbiased def  u(x,θ) E {u(x,θ)} 0, det( E { }) 0 (θ ) p(,θ) p(,θ) θ

The statistical functional f ( p)  argsolve { E pu(x,θ)  0 } θ is Fisher-consitent, and the leaf transverse to M

Lθ ( f )  { p : E p u( x, θ)  0 }

is m-geodesical.  Lθ ( f )  31 Minimum divergence geometry

D : M×M → R is an information divergence on a statistical model M

(i) D(p,q) ≧ 0 with equality if and only if p = q (ii) D is a differentialble on M×M

Let

Then we get a Riemannian metric and dual connections on M (Eguchi, 1983,1992)

32 Remarks is a Riemann metric

Cf. Slide 21 33 U divergence

U is a convex function,

U cross-entropy

U entropy

U divergence

34 Example of U divergence

KL divergence

Beta (power) divergence

Note

35 Geometric formula with DU

36 Triangle with DU

mixture geodesic

U geodesic

r p

q 37 Information divergence class and robust statistical methods

38 Light and shadow of MLE

1．Invariance under 1. Non-robust data-transformations

2．Asymptotic efficiency 2. Over-fitting

Log-likelihood on exponential family Sufficiency and efficiency

Likelihood method log exp

U-method  u

39 Max U-entropy distribution

Let us fix a statistics t (x).

Equal mean space

U-model

Euler-Lagrange’s 1st variation

40 U-estimate

Let p(x) be a data density function with statistical model

U-loss function

U-empirical loss function

U-estimate

U-estimating function

41 42 ). x ( t -minimax

 Let us fix a statistics Equal mean space U-estimator for U-model

U-model

U-estimator of mean parameter

We observe that

which implies

Furthermore,

43 Influence function

 Statistical functional   TU (G)  arg min{ ( p(x, ))dG(x)  U( ( p(x, ))dx}    Influence function      IF(T , x)  T (G )   G (1)F x    0  1 IF(TU , x)  JU ( )[w(x, )S(x, )  E {w(x, )S(x,)}]

GES(TU )  sup ||IF(TU , x) || x 44 Efficiency   Asymptotic variance   ˆ  1 1 n( U  )  N(0, JU ( ) HU ( )JU () ) D    where J ( )  E(w(X , )S(X , )S(X , )T ), U   H U ( ) Var(w(X , )S(X ,)) Information inequality  1 1  1 I( )  JU ( ) HU ( )JU ()

( equality holds iff U(x) = exp(x) )

45 Normal mean

Influence function η = 0 β= 0 = 0.01 0 = 0.0150 = 0.050.015 = 0.150.01 6 6 = 0.0750.15 = 0.40.05 0.4 4 = 0.80.075 4 = 0.1 0.1 0.8 2 2 = 2.50.125 = 0.1252.5

-6 -4 -2 2 4 6 -6 -4 -2 2 4 6 -2 -2

-4 -4 -6 -6 β-power estimates η-sigmoid estimates

U (t)  exp(t)  t  46 Gross Error Sensitivity

β-power estimates η-sigmoid estimates

β 効率 GES η 効率 GES 01∞ 01∞ 0.01 0.97 6.16 0.015 0.972 1.9 0.05 0.861 2.92 0.15 0.873 1.04 0.075 0.799 2.47 0.4 0.802 0.678 0.1 0.742 2.21 0.8 0.753 0.455 0.125 0.689 2.04 2.5 0.694 0.197

47  Multivariate normal  pdf is  1  T 1 p  (y  )  (y  ) f (y, ,)  ((2 ) det) 2 exp   2  Estimating equation

 1   w(xi ,θ)(xi  μ)  0 n  1  w(x,θ){(x  μ)(x  μ)T }  c ()  n  i i U θ  (μ, ) 48 U-algorithm θk  μk ,k   θk1  μk1,k1 

繰り返し重み付け平均と分散   T  w(xi , k )(xi  k )(xi  k )  w(xi , k ) x j   k 1  k1   ,  w(xi , k )  w(xi ,k ) cU det k Under a mild condition

LU (k1)  LU (k ) ( k 1,...)

49 Simulation

0  1 0.5  5   4 0 ε-conmination (1)     Gε  (1-ε)N   ,    N   ,   0 0.5 2  -5  0 1

0 1 0 0 9 9 (2)     Gε  (1-ε)N   ,     N   ,   0 0 1 0 9 9

ε  0  ε  0.05 ˆ (1) KL error DKL (θ,θ0 ) 3.03  39.24 under G (2) with MLE θˆ 2.70  16.65 under G 

50 β-power estimates v.s. η-sigmoid estimates

η β KL error KL error 0 39.24 0 39.24 0.01 35.30 0.0001 23.04 (1) 0.05 20.93 0.00025 6.70 G 0.10 8.91 0.0005 4.46 0.20 12.40 0.00075 4.64 0.30 31.64 0.001 6.04

0 16.5 0 16.5

(2) 0.01 13.91 0.0005 3.36 G 0.05 6.65 0.00075 3.19 0.10 5.14 0.001 3.9 0.20 12.20 0.002 3.04 51 0.30 29.68 0.003 3.07 Plot of MLEs （no outliers）

100 replications with 100 size of sample under Normal dis.    11 12       12  22 

 22 η-Est (η=0.0025) β- Est (β=0.1) MLE  12 11 true (1,0,1)

52 Plot of MLEs with ouliers

100 replications with 1.5 1 100 size of sample 0.5 under G(2) 0 

1.5 η-Est (η=0.0025) 1 β- Est (β=0.1) 0.5 MLE 1 1.5 true (1,0,1) 2 2.5

53 Selection for tuning parameter

Squared loss function   Loss ( ˆ)  1 { f ( y,ˆ)  g ( y)}2 dy 2    1 n 1 CV( ) f (x , ˆ (i) )  f (y,ˆ ) 2 dy   i   n i1 2  ˆ  arg min CV ( )   1 Approximate ˆ(i )  ˆ  IF(x ,ˆ ) n 1 i 

54 0 .26 -.1 Normal with mean   and variance   0  -.1 .26

MLE  0.054   .228 - .126      CV ( )     signal  - 0.081   - .126 .261  2.94 2.92 2.9 2.88 MLE  0.204  1.059 - .263  2.86     ˆ Contami  - 0.184   - .263 .383  2.84   0.07 2.82  - Est  0.086   .293 - .134  0.0250.050.0750.10.1250.150.175     contami  - 0.132   - .134 .286  β

55 What is ICA?

Cocktail party effect

Blind source separation s s ……… 1 2 sm

x x ……… 1 2 xm

56 ICA model

(Independent signals） s  (s1,..., sm ) ～ p(s)  p1 (s1 ) pm (sm )   E(S1 )  0,, E(Sm )  0

(Linear mixture of signals) W Rmm , Rm s.t. x W 1s  

f (x,W , p) | det(W ) | p1(w1(x  μ1)) pm (wm (x  μm ))

Aim is to learn W from a dataset (x1, , xn )

in which p(s)  p1(s1) pm (sm ) is unknown.

57 ICA likelihood

Log-likelihood function

n m (W )    log p j (w j (xi  μ j ))  log | det(W ) | i1 j1 Estimating equation  (x,W , p) F (x ,W , p)    ( I  h(Wx) (Wx)T )W T W m

 log p1 (s1 )  log pm (sm ) h(s)  ( ,, )  s1  sm Natural gradient algorithm, Amari et al. (1996) 58 Beta-ICA    n  power equation 1  f (xi ,W, ) F(xi ,W, )  B (W,) n i1   decomposability：  s  t

E[{p(w X  )} ]   q q  qs,qt

 E[{ps (ws X  s )} hs (ws X  s )]   E[{pt (wt X  t ))} wt X ]  0

59 Likelihood ICA

1.4 150 signals U(0,1)× U(0,1) 1.2

0.8 0.6 1 1 2  0.4 Mixture matrix W    0.2 1 0.5 0.5 1 1.5 2 2.5

1.5

0.5

-1.5 -1 -0.5 0.5 1 1.5 -0.5

-1

-1.5

60 Non-robustness

1 50 Gaussian noise N(0, 1)× N(0, 1)

-2 -1 1 2 3

-1

-2 2

-2 -1 1

-1

-2

61 62 7.5 5 2.5 4 2 -2 -4 -6 -2.5 -5 =0.2) β -7.5 ( -power

-10  ICA 3 Minimum 2 power-

 1 2 1 1 2 - - 1 - -2 Tsallisべきエントロピーに射影不変性を課したエントロピ－、ダイバージェンスを考察した。平均と分散を一定にする分布族で射影べきエントロピー最大分布モデルが正規分布， t-分布，ウエグナー分布を含む形で導出された．そのモデル上での平均と分散の素直な推定法を提案し，推定量が標本平均と標本分散になることが示された．

63 Projective -power divergence

-cross entropy

-diagonal entropy

-divergence

Cf. Tsallis (1988), Basu et al (1998), Fujisawa-Eguchi (2008), Eguchi-Kato (2010)

Remark= 0 reduces to Boltzmann-Shannon entropy and Kullback-Leibler divergence

64 Max -entropy

Equal moment space

-entropy

Max -entropy

-model

65 -estimation

Parametric model

-Loss function

-estimator

Gaussian

-estimator

Remark Cf. Fujisawa-Eguchi (2008) 66 -estimation on -model

-loss function

Theorem.

67 (-estimator  ’-model)

model normal-model -model loss function

0-loss Robust model ( log-likelihood) MLE MLE

Robust -estimator -loss Moment estimator

68 Gamma-ICA

Mixture of independent signals

Whitening

Preprocessed form

Statistical model

-ICA

69 70 71 Outline

Boosting leaning algorithms for supervised/unsupervized learning U-loss functions

72 Which chameleon wins?

73 Pattern recognition…

74 What is pattern recognition?

□ There are a lot of examples for pattern recognition.

□ In principle pattern recognition is a prediction problem for class label

which would classify a human interestingness and importance.

□ Originally human brain wants to label phenomena a few words , for

example (good, bad), (yes, no), (dead, alive), (success, failure), (effective, no effect)….

□ Brain intrinsically predicts the class label from empirical evidence.

75 Practice

● Character recognition ● face recognition

● voice recognition ● finger print recognition ● Image recognition ● speaker recognition

☆ Credit scoring ☆ Drug response ☆ Medical screening ☆ Treatment effect ☆ Default prediction ☆ Failure prediction ☆ Weather forcast ☆ Infectious desease

76 Binary classification Feature vector Class label

score

classifier

In a binary class y = 1, 1 (G=2)

77 Probability distribution

Let p( x, y) be a pdf of random vector ( x, y).

p(B, C )   {  p( x, y) dx } B yC

p(y)  p(x, y)dx p(x)  p(x, y) Marginal density X  y{1,...,G} p(x, y) p(x, y) Conditional density p(y | x)  p(x | y)  p(x) p(y)

p(x, y)  p(x | y) p(y)  p(y | x) p(x)

p(x, y 1) p(y 1| x)  p(x, y  1) p(y  1| x) 78 Error rate

Feature vector ，class label

Classifier has error rate

Err(hF )  Pr(hF (x)  y)

G Err (hF )   Pr( hF ( x)  i, y  j)  1  Pr( hF ( x)  i, y  i) i j i1

Training error

Test error

79 False negative/positive

True Positive False Positive

False Negative True Negative

FPR

FNR

80 Bayes rule

Let p(y|x) be a conditional probability of y given x.

The classifier h Bayes ( x )  sgn( F 0 ( x )) leads to Bayes rule.

Theorem 1 For any classifier h

Err(h Bayes)  Err(h)

Note： The optimal classifier is equivalent to the likelihood ratio. However, in practice p(y|x) is unknown, so we have to

learn hBayes(x) based on the training data set.

81 Error rate for Bayes rule

Discriminant space associated with Bayes classifier is given by

In general, when a classifier h associates with spaces

82 Err(h)  Err(h Bayes) Cf. Mclachlan, 2002 Boost learning

Weak learners can be combined into a strong leanrner?

Weak learner (classifier) = error rate is slightly less than .5 Strong learner (classifier) = error rate is slightly less than that of Bayes classifier

Boost by filter (Schapire, 1990) Bagging, Arching （bootstrap） (Breiman, Friedman, Hasite) AdaBoost (Schapire, Freund, Batrlett, Lee) 83 Set of weak learners

Decision stumps

Linear classifiers

Neural net SVM k-nearest neighbur

Note: not strong but a variety of characters

84 Exponential loss function

Let be a training (example) set.

Empirical exponential loss function for a score function F(x)

n D 1 Lexp (F )   exp{  yi F ( x i )} n i1

Expected exponential loss function for a score function F(x)

L E (F )  { exp{  yF ( x)}q( y | x)}q( x)dx exp   X y{1,1}

where q(y|x) is the conditional distribution given x, q(x) is the pdf of x.

85 Adaboost

1 1. Intial : w (i)  (i  1 n), F (x)  0 1 n  0 w (i) 2. For t 1, ,T t    t ( f )   I( yi  f (xi )) ,  wt (i ')

(a) t ( f(t) )  mint ( f ) f  F

1 1  t ( f(t) ) (b) t  2 log t ( f(t) )

T 3. sign (FT ( x)), where FT ( x)    t f (t ) ( x) t1

86 Learning algorithm

f ( x ) w1 (1), , w1 (n) (1 )  1  1 f ( x ) w2 (1),, w2 (n) ( 2 )  2 T {( x1 , y1 ),..., ( x n , yn )}  2   t f(t ) ( x) t1

 T f ( x ) wT (1),, wT (n) ( T )

T Final learner FT ( x)    t f (t ) ( x) t 1 87 Update weight

update wt (i)  wt1(i)  t f(t) (xi )  yi  Multiply e  f (x )  y  Multiply et (t) i i 

Weighted error rate t ( f(t) )  t1( f(t) )  t1( f(t1) )

1  ( f )  The worst case t1 (t ) 2

88 1  (f )  t 1 (t ) 2

n wt1(i) t1( ft )   I( f(t ) (xi )  yi ) i1 wt1(i') n  I ( f ( t ) ( xi )  yi ) exp{  t yi f ( t ) ( xi )}wt (i) i1   n exp{  y f ( x )}w (i)    t i ( t ) i t i1

n exp{ t } I ( f ( t ) ( xi )  yi )wt (i) i1  n n exp{ t } I ( f ( t) ( xi )  yi )wt (i)  exp{  t } I ( f ( t ) ( xi )  yi )wt (i)  i1  i1    1  t ( f ( t ) )  t ( f ( t ) ) t ( f ( t ) ) 1    1  t ( f ( t ) ) t ( f ( t ) ) 2 t ( f ( t ) )  {1   t ( f ( t ) )} t ( f ( t ) ) 1  t ( f ( t ) ) 89 Update in exponential loss

1 n Lexp (F )   exp{ yi F(xi )} n i1  Consider F(x)  F(x)   f (x)

1 n L (F  f )  exp{y F(x )}exp{ y f (x )} exp   i i i i n i1

n 1   exp{yiF(xi )}e I( f (xi )  yi )  e I( f (xi )  yi ) n i1 

  Lexp (F){ e ( f )  e (1 ( f ))}

n I( f (x j )  yi )exp{yi F(xi )} where ( f )  i1 90 Lexp(F)  Sequential optimization  

 Lexp (F f )  Lexp (F){ ( f )e  (1 ( f ))e }   ( f ) e  (1   ( f )) e   2   1 ( f )      ( f )e   2 ( f ){1( f )}  e    2 ( f ){1  ( f )}

1 1  ( f ) Equality holds if and only if   log opt 2  ( f )

91 AdaBooost = Seq-min exponential loss

 min Lexp (Ft1  f(t) )  Lexp (Ft1) ( f(t) ){ 1  ( f(t) )}  R

1 1  ( f(t) ) opt  log 2  ( f(t) )

(a) f(t)  argmin t ( f ) fF

(b) t  arg min Lexp (Ft1  f(t) )  R

1 Feature space [-1,1]×[-1,1] 0.5 Decision boundary

-1 -0.5 0.5 1

-0.5

-1

93 Set of linear classifiers

Linear classification machines

1 Random generation 0.5

-1 -0.5 0.5 1

-0.5

-1 94 Learning process

1 1 1

0.5• 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Iter = 1, train err = 0.21 Iter = 13, train err = 0.18 Iter = 17, train err = 0.10

1 1 1

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 95 Iter = 23, train err = 0.10 Iter = 31, train err = 0.095 Iter = 47, train err = 0.08 Final decision boundary

0.5

0 0

-0.5 -0.5

-1 -1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Contour of F(x) Sign(F(x)) 96 Learning curve

0.2 Training curve

0.15

0.1

0.05

50 100 150 200 250

Iteration number

97 KL divergence

Label set Feature space

Nonnegative function

Note For conditinal distribution p(y|x), q(y|x) given x with common marginal density p(x) we can write

Then,

98 Twin of KL loss function

For data distribution q(x, y) = q(x) q(y|x) we model as

G m1 ( y | x)   exp{ F ( x, g )  F ( x, y)} g 1 exp{ F ( x, y)} m2 ( y | x)  G  exp{ F ( x, g )} g 1 Then

q(y | x) D (q,m)  { q(y | x)log  q(y | x)  m(y | x)}q(x)dx KL   X y{1,1} m(y | x)

Log loss Exp loss 99 Bound for exponential loss

Empirical exp loss

Expected exp loss

Theorem.

100 Variational calculas

101 On AdaBoost FDA, or Logistic regression

Parametric approach to Bayes classifier

AdaBoost

Parametric approach to Bayes classifier, but dimension and basis function are flexible

The stopping time T can be selected according to the state of learning.

102 U-Boost

U-empirical loss function

In a context of classification

Unnormalized U-loss

Normalized U-loss

103 U-Boost (binary)

Unnormalized U-loss

Note

Bayes risk consistency

F*(x) is Bayes risk consistent because 104 AdaBoost with margin

Eta-loss function

regularized

105 EtaBoost for Mislabels Expected Eta-loss function

Optimal score

The variational argument leads to

Mislabel modeling

106 EtaBoost

1 1. Initial settings : w (i)  (i  1 n), F (x)  0 1 n  0 n 2. For m  1, ,T   m ( f )  I( yi  f (xi ))wm (i), i1

(a) m( f(m) )  minm( f ) f (b)

* (c) wm1(i)  wm (i)exp( m f(m) (xi )yi )

3. sign (FT ( x)), where FT ( x)   t f(t ) ( x)  107 t1 A toy example

108 Examples partly mislabeled

109 AdaBoost vs. EtaBoost

AdaBoost EtaBoost 110 ROC curve, AUC

(X , Y ) , X  IR p , Y {0, 1}

判別関数: F(X ), 閾値：c F ( X )  c  陽性 (Y  1) F ( X )  c  陰性 (Y  0)

偽陽性確率 FPR(c)  P(F(X )  c | Y  0) TPR 真の陽性確率 TPR(c)  P(F(X )  c |Y 1) 1 c

受信者動作特性 ROC( F )  {( FPR (c), TPR (c)) | c  IR } AUC (ROC) 曲線 Cf. Bamber (1975), Pepe et al. (2003) 0 1  下側面積 AUC(F,)   TPR(c')d FPR(c') FPR (AUC ) 

111 112 113 114 115 116 117 5-fold CVの結果（AUCBoost） λ

118 118 119 120 121 2標本問題とAUCブーストの関係

AUCブーストの学習アルゴリズムは逐次的に各 t-ステップで

得られた判別関数 Ft(x) のウィルコクソン検定統計量

n1 n0 ˆ 1 C(Ft )  I (Ft (x1k )  Ft (x0i ) ) n0 n1 k11i

を(t+1)-ステップ Ft1(x)  Ft (x)  t1 ft1(x)

で最大方向に増大するよう設計されている．このように ˆ ˆ   C(Ft )  C(Ft1)   となる中で全ての単点解析が統合されている．この意味で全ての単点解析はブースティングに

組むことは可能である． Eguchi-Komori (2009) 122 123 124 Partial AUC

 pAUC  TPR(c)d FPR(c) c

 P(F(X0)  F(X1), c  F(X 0) )

健常者病人健常者病人

TPR (c) TPR (c)

c FPR (c) F (x) c FPR (c) F (x) TPR TPR 1 1 c c

pAUC pAUC 125 FPR FPR 0  1 0  1 126 Robust for noninformative genes

127 Robustness for noninformative genes

128 生成関数

真のU 陽性確率

129 Boost learning algorithm

130 Bayes risk consistency

131 ベンチマークデータ

132 True density

0.02

0.015 4 0.01

0.005 2

0 -4

-2 -2 0 dictionary 2 W -4 4 133 U-Boost learning  1 n L ( f )   ( f (x ))  U ( ( f (x)))dx U-loss function U  i  p n i1 IR Dictionary of density functions 

W { g (x) : g (x) 0,  g (x)dx 1,   }

Learning space = U-model      * 1 1  WU  (co( (W )))  { (   (g (x)))}       1   Let f (x,π)  (   (g (x))). Then f (x,(0,,1,0))  g (x)   ( ) * WU  W

* Goal : find f  argmin LU ( f ) * f WU 134 Inner step in the convex hull   * 1 W  {(x,) : } WU   (co (W ))

* * 1 ˆ ˆ ˆ ˆ Goal : f  argmin LU ( f ) f (x)  ( 1 ( f1(x))  kˆ ( fkˆ (x))) * f WU

W g 7 g5 * WU f (x) g 4 g 6

f (x) g2 g 3

g1 135 Convergence of Algorithm

* WU   L ( f )  L ( f )  D ( f , f ) U k U k1 U k k1

* ( f1)  ( fk1)  ( fk )  ( fk1) ( f )

k LU ( f1)  LU ( fk1)   DU ( f j , f j1) j1   ( f ) (1 )( f )  (g ) -mixture K K1 K1  K1 K *  1 (g1)  K (gK )  ( f ) 136 Simplified boosting

 K ˆ 1  fK  ( k (gk )) k1

For k  0,...., K 1, 1 fk  fk1  ((1 k1) ( fk )  k1 (gk1))  such that  g  arg min L ( 1((1 ) ( f )  (g))), k1 U  k gW  c    k1 k  c    with the initial f1  arg min LU (g) gW

137 Non-asymptotic bound

Theorem. Assume that a data distribution has a density g(x) and that    2 (A)   supW U( ){ ( ) ()}  b  W  U ( , , )co(( )) W Then we have ˆ * IE g DU (g, fK )  FA(g,WU )  EE(g,W )  IE(K,W ), where

FA(g,WU= )  inf DU (g, f ) (Functional approximation) f WU=  n EE(g, )  2IE sup | 1 ( f (x )) IE ( f )|} W g{ n  i p (Estimation error) fW i1 c2b 2 IE(K,W )  U (c:step-lengthconstant) (Iteration effect) K  c 1

* Remark. Trade between FA( p,W U ) and EE( p,W ) 138      Lemmas   ~ N Let f be a density estimator and f *  1( p ( (x))). Then Lemma 1  j1 j j

N ~ ~ ~ p L ( 1((1 ) ( f (x))  ( (x))))  L ( f *)  (1 ){L ( f )  L ( f *)}  ( f , f *, ).  j1 j U j U U U U

~ Lemma 2 Let f be in co((W )). Then  * 2 2 U ( f , f , )  bU ( [0,1])   Lemma 3 Under assumption (A)  c2b 2  L ( fˆ )  inf L ( f )  U where f  1( ( )) U K U K  c 1   j j 

ˆ  Lemma 4 If LU ( fK )  inf LU ( f )  , then  ˆ 1 IEg DU (g, fK )  inf DU (g, f )   2 IEg sup ||  ( f (xi ))  IEg( f ) || f W n

139 -Boost  -Boost  KDE RSDE

C (skewed-unimodal) H (bimodal) L (quadrimodal)

140 Future of information geometry

Deepening geometry Poincare conjecture Optimal transform statistics Free probability

learning Compressed sensing Semiparametrics

quantum Dynamical systems

Mathematical evolutionary theory Dempster-Shafer theory

141 Poincaré conjecture 1904

Every simply connected, closed 3-manifold is homeomorphic to the 3-sphere

Ricci flow by Richard Hamilton in 1981 Perelmann 2002, 2003 Optimal control theory due to Pontryagin and Bellman

2 Wasserstein space dW( f ,g)  min{ E||X Y || : X ~ f ,Y ~ g}

Optimal transport

142 Optimal transport

Theorem (Brenier, 1991)

Talagrand inequality

Log Sovolev inequality

Optimal transport theory is extending on a general manifold

C. Villani (2010) Optimal transport theory as Fields medal

Geometers consider not a space, but a distribution family on the space.

143 Thank you

144