2015 Workshop on High-Dimensional Statistical Analysis Dec.11 (Friday) ~15 (Tuesday) Humanities and Social Sciences Center, Academia Sinica, Taiwan
Information Geometry and Spontaneous Data Learning
ShintoShinto Eguchi Eguchi InstituteInstitute Statistical Statistical Mathematics, Mathematics, Japan Japan
This talk is based on a joint work with Osamu Komori and Atsumi Ohara, University of Fukui Outline
Short review for Information geometry
Kolmogorov-Nagumo mean
-path in a function space
Generalized mean and variance
–divergence geometry
U-divergence geometry
Minimum U-divergence and density estimation
2 A short review of IG
Nonparametric space
Space of statistics
Information geometry is discussed on the product space
3 Bartlett’s identity
Parametric model
Bartlett’s first identity
Bartlett’s second identity
4 Metric and connections
Information metric
Mixture connection
Exponential connection
Rao (1945), Dawid (1975), Amari (1982) 5 Geodesic curves and surfaces in IG
m-geodesic curve e-geodesic curve
m-geodesic surface
e-geodesic surface
6 Kullback-Leibler
K-L divergence
1. KL divergence is the expected log-likelihood ratio
2. Maximum likelihood is minimum KL divergence. Akaike (1974)
3. KL divergence induces to the m-connection and e-connection
Eguchi (1983)
7 Pythagoras
Thm
Amari-Nagaoka (2001)
Pf
8 Exponential model
Exponential model
Mean parameter
For
For Amari (1982)
Degenerated Bartlett identity
9 Minimum KL leaf
Exponential model
Mean equal space
10 Pythagoras foliation
11 log + exp
log & exp
Bartlett identities KL-divergence e-connection e-geodesic m-connection m-geodesic Pythagoras identity
exponential model Pythagoras foliation mean equal space
12 Path geometry
{ m-geodesic, e-geodesic , … }
13 Kolmogorov-Nagumo mean
K-N mean is for positive numbers
¥
14 K-N mean in Y
Def . K-N mean
Cf. Naudts (2009)
15 g and f -path ) c
( 1 lim c -path connecting -path connecting
) Pf ( Def. Thm Examples of -path
Exm 0
Exm 1
Exm 2
Exm 3
17 Identities of -density
Model
1st identity
2nd identity
because
18 Generalized mean and variance
Def
Note Generalized mean and variance
Exm
20 Bartlett Identity
Model
Bartlett identity
Bartlett identities
21 Tangent space of Y
Tangent space
Riemannian metric
Expectation gives the tangent space
Topological properties of Tf depend on
If = log, then Tf is too large to do statistics on Y Cf. Pistone (1992) 22 Parallel transport Def A vector field is parallel along a curve
A curve is -geodesic
Cf. Amari (1982).
23 -geodesic
Thm If is the -geodesic curve
Proof.
24 -divergence
-cross entropy
-entropy
-divergence
Note: -divergence is KL-divergence if = log
25 Divergence geometry
Def.
Let be a statistical model.
with the Riemannian metric on M :
the pair of affine connections on M:
26 -divergence geometry
The metric
Affine connection pair
27 –Pythagorean theorem
Thm -geodesic
-geodesic
Pf h f
g
28 -Pythagorean foliation
-mean equal space
29 -mean
-Bartlett identities - divergence
- connection - geodesic - connection - geodesic - Pythagoras identity
- model - Pythagoras foliation - mean equal space
What is a statistical meaning of -mean and -variance?
30 U-divergence
U-cross-entropy
U-entropy
U-divergence
Note
Exm
31 U-divergence geometry
The metric associated with U-divergence:
Affine connections associated with U-divergence:
Thm (i) U-geodesic is mixture geodesic.
(ii) U*-geodesic is geodesic
32 U-geometry /= -geometry
-metric on a model M
U-metric on a model M
-connection
U*-connection
33 Triangle with DU
Thm mixture geodesic
-geodesic
h Pf f
g
34 U-estimation
Let g(x) be a data density function with statistical model
U-empirical loss function
U-estimator for
35 U-estimator under -model
-model
U-empirical loss function
U-estimator for
U-estimator under -model has analogy with MLE under exponential model 36 Potential function
Def We call the potential function on -model
Note
We define the mean parameter by
Cf. mean parameter
Thm U-estimator for is given by the sample mean
37 Pythagoras foliation
Thm
Pf
38 Pythagoras foliation
39 U-Boost learning for density estimation 1 n L ( f ) ( f (x )) U(( f (x)))dx U-loss function U i p n i1 IR Dictionary of density functions
W { g (x) : g (x) 0, g (x)dx 1, }
Learning space = -model * 1 1 W (co((W ))) { ( (g (x)))} 1 Let f (x,π) ( (g (x))). Then f (x,(0,,1,0)) g (x) ( ) * W W
* Goal : find f argmin LU ( f ) * f WU U-Boost algorithm
(A) Find f1 arg min LU (g) gW 1 (B) Update fk fk1 ((1 k1) ( fk ) k1 (gk1)) st 1 ( k1, gk1) arg min LU ( ((1 ) ( fk ) (g))) (, g)(0,1)W ˆ 1 (C) Select K, and f ((1 K ) ( fK 1) K (gK ))
1 * Example 2. Power entropy f (x) ( k gk (x) ) k * g (x) k If 0, f (x) exp( k log k ) gk (x) Friedman et al (1984) k k * Klemela (2007) If 1, f (x) k gk (x) k Inner step in the convex hull * 1 W {t(x,) : } WU (co(W ))
* * 1 ˆ ˆ ˆ ˆ Goal : f argmin LU ( f ) f (x) ( 1 ( f1(x)) kˆ( fkˆ (x))) * f WU
W g 7 g5 * WU f (x) g 4 g 6
f (x) g2 g 3
g1 Non-asymptotic bound
Theorem. Assume that a data distribution has a density g(x) and that 2 (A) sup U''( ){ ( ) ()} bU ( , , )co((W )) W Then we have W ˆ * IEg DU (g, fK ) FA(g,WU ) EE(g,W ) IE(K,W ), where
FA(g,WU= ) inf DU (g, f ) (Functional approximation) f WU= n EE(g, ) 2 IE sup | 1 ( f (x )) IE ( f )|} W g{ n i p (Estimation error) fW i1 c2b 2 IE(K,W ) U (c:step-lengthconstant) (Iteration effect) K c 1
* Remark. Trade between FA( p,W U ) and EE( p,W )
43 -Boost -Boost KDE RSDE (Girolami-He, 2004)
C (skewed-unimodal) H (bimodal) L (quadrimodal)
44 Conclusion
K-N mean -path
=
E(), Cov() -geodesic
-divergence (G(),*(), ())
-geodesic, -geodesic )
U U-divergence (G(U),*(U), (U))
-geodesic, m-geodesic )
* 1 -path WU (co(W )) 45 Future problems
Tangent space
Path space
46 Future problems
-mean, -variance, -divergence
These are natural ideas from IG view point We can build -efficiency in estimation theory, but
What is a statistical meaning of -mean and -variance?
Can we define a random sample of -version?
47 Thank you
48