2015 Workshop on High-Dimensional Statistical Analysis Dec.11 (Friday) ~15 (Tuesday) Humanities and Social Sciences Center, Academia Sinica, Taiwan

Information Geometry and Spontaneous Data Learning

ShintoShinto Eguchi Eguchi InstituteInstitute Statistical Statistical Mathematics, Mathematics, Japan Japan

This talk is based on a joint work with Osamu Komori and Atsumi Ohara, University of Fukui Outline

Short review for Information geometry

Kolmogorov-Nagumo

 -path in a function space

Generalized mean and

 –divergence geometry

U-divergence geometry

Minimum U-divergence and density estimation

2 A short review of IG

Nonparametric space

Space of

Information geometry is discussed on the product space

3 Bartlett’s identity

Parametric model

Bartlett’s first identity

Bartlett’s second identity

4 Metric and connections

Information metric

Mixture connection

Exponential connection

Rao (1945), Dawid (1975), Amari (1982) 5 Geodesic curves and surfaces in IG

m-geodesic curve e-geodesic curve

m-geodesic surface

e-geodesic surface

6 Kullback-Leibler

K-L divergence

1. KL divergence is the expected log-likelihood ratio

2. Maximum likelihood is minimum KL divergence. Akaike (1974)

3. KL divergence induces to the m-connection and e-connection

Eguchi (1983)

7 Pythagoras

Thm

Amari-Nagaoka (2001)

Pf

8 Exponential model

Exponential model

Mean parameter

For 

For  Amari (1982)

Degenerated Bartlett identity

9 Minimum KL leaf

Exponential model

Mean equal space

10 Pythagoras foliation

11 log + exp

log & exp

Bartlett identities KL-divergence e-connection e-geodesic m-connection m-geodesic Pythagoras identity

exponential model Pythagoras foliation mean equal space

12 Path geometry

{ m-geodesic, e-geodesic , … }

13 Kolmogorov-Nagumo mean

K-N mean is for positive numbers

14 K-N mean in Y

Def . K-N mean

Cf. Naudts (2009)

15 g and  f -path  ) c

 ( 1     lim c -path connecting -path connecting

 ) Pf ( Def. Thm Examples of  -path

Exm 0

Exm 1

Exm 2

Exm 3

17 Identities of -density

Model

1st identity

2nd identity

because

18 Generalized mean and variance

Def

Note Generalized mean and variance

Exm

20 Bartlett Identity

Model

Bartlett identity

Bartlett identities

21 Tangent space of Y 

Tangent space

Riemannian metric

Expectation gives the tangent space

 Topological properties of Tf depend on 

 If  = log, then Tf is too large to do statistics on Y  Cf. Pistone (1992) 22 Parallel transport Def A vector field is parallel along a curve

A curve is -geodesic

Cf. Amari (1982).

23 -geodesic

Thm If is the -geodesic curve

Proof.

24 -divergence

-cross entropy

-entropy

-divergence

Note: -divergence is KL-divergence if = log

25 Divergence geometry

Def.

Let be a .

with the Riemannian metric on M :

the pair of affine connections on M:

26 -divergence geometry

The metric

Affine connection pair

27 –Pythagorean theorem

Thm -geodesic

-geodesic

Pf h f

g

28 -Pythagorean foliation

-mean equal space

29 -mean

-Bartlett identities - divergence

- connection - geodesic - connection - geodesic - Pythagoras identity

- model - Pythagoras foliation - mean equal space

What is a statistical meaning of -mean and -variance?

30 U-divergence

U-cross-entropy

U-entropy

U-divergence

Note

Exm

31 U-divergence geometry

The metric associated with U-divergence:

Affine connections associated with U-divergence:

Thm (i) U-geodesic is mixture geodesic.

(ii) U*-geodesic is geodesic

32 U-geometry /= -geometry

-metric on a model M

U-metric on a model M

-connection

U*-connection

33 Triangle with DU

Thm mixture geodesic

-geodesic

h Pf f

g

34 U-estimation

Let g(x) be a data density function with statistical model

U-

U-empirical loss function

U-estimator for 

35 U-estimator under -model

-model

U-empirical loss function

U-estimator for 

U-estimator under -model has analogy with MLE under exponential model 36 Potential function 

Def We call the potential function on -model

Note

We define the mean parameter  by

Cf. mean parameter

Thm U-estimator for is given by the sample mean

37 Pythagoras foliation

Thm

Pf

38 Pythagoras foliation

39 U-Boost learning for density estimation  1 n L ( f )   ( f (x ))  U(( f (x)))dx U-loss function U  i  p n i1 IR Dictionary of density functions 

W { g (x) : g (x) 0,  g (x)dx 1,   }

Learning space = -model      * 1 1  W  (co((W )))  { (  (g (x)))}       1   Let f (x,π)  (  (g (x))). Then f (x,(0,,1,0))  g (x)   ( ) * W  W

* Goal : find f  argmin LU ( f ) * f WU U-Boost algorithm

 (A) Find f1  arg min LU (g) gW    1   (B) Update fk  fk1  ((1 k1) ( fk )  k1 (gk1)) st   1 ( k1, gk1)  arg min LU ( ((1 ) ( fk )  (g))) (, g)(0,1)W  ˆ 1  (C) Select K, and f  ((1 K ) ( fK 1)  K (gK ))

 1  *  Example 2. Power entropy f (x)  ( k gk (x) ) k  *  g (x) k If   0, f (x)  exp( k log k )   gk (x) Friedman et al (1984) k k * Klemela (2007) If 1, f (x)   k gk (x) k Inner step in the convex hull   * 1 W  {t(x,) : } WU   (co(W ))

* * 1 ˆ ˆ ˆ ˆ Goal : f  argmin LU ( f ) f (x)  ( 1 ( f1(x))  kˆ( fkˆ (x))) * f WU

W g 7 g5 * WU f (x) g 4 g 6

f (x) g2 g 3

g1 Non-asymptotic bound

Theorem. Assume that a data distribution has a density g(x) and that      2 (A)   sup  U''( ){ ( )  ()}  bU ( , , )co((W ))  W Then we have W ˆ * IEg DU (g, fK )  FA(g,WU )  EE(g,W )  IE(K,W ), where

FA(g,WU= )  inf DU (g, f ) (Functional approximation) f  WU=  n EE(g, )  2 IE sup | 1 ( f (x )) IE ( f )|} W g{ n  i p (Estimation error) fW i1 c2b 2 IE(K,W )  U (c:step-lengthconstant) (Iteration effect) K  c 1

* Remark. Trade between FA( p,W U ) and EE( p,W )

43 -Boost  -Boost  KDE RSDE (Girolami-He, 2004)

C (skewed-unimodal) H (bimodal) L (quadrimodal)

44 Conclusion

K-N mean -path

 =

E(), Cov() -geodesic

-divergence (G(),*(), ())

-geodesic, -geodesic )

U U-divergence (G(U),*(U), (U))

-geodesic, m-geodesic )

 * 1 -path WU  (co(W )) 45 Future problems

Tangent space

Path space

46 Future problems

-mean, -variance, -divergence

These are natural ideas from IG view point We can build -efficiency in estimation theory, but

What is a statistical meaning of -mean and -variance?

Can we define a random sample of -version?

47 Thank you

48