2013年統計数理研究所夏期大学院 統計数理研究所 セミナー室 1 9月26日 14時40分~17時50分
ISM Summer School 2013
Information Geometry
Shinto Eguchi
The Institute of Statistical Mathematics, Japan Email: [email protected], [email protected] Url: http//www.ism.ac.jp/~eguchi/ Outline
Information geometry
historical remarks
treasure example
Information divergence class
robust statistical methods
robust ICA
robust kernel PCA
2 Historical comment
R.A.Fisher 1922 C.R.Rao 1945 geometry 1975 paper P. Dawid B. Efron comment statistics
1982 paper 1984 learning S. Amari Workshop RSS150 quantum
3 Dual Riemannian Geometry
Dual Riemannian geometry gives reformulation for Fisher’s foundation
Cf. Einstein field equations (1916) Information Geometery aims to geometrize A dualistic structure between modeling and estimation
Cf. Erlangen program (Klein, 1872) http://en.wikipedia.org/wiki/Erlangen_program
Estimation is projection of data onto model. “Estimation is an action by projection and model is an object to be projected” The interplay of action and object is elucidated by a geometric structure.
4 2 x 2 table
The space of all 2×2 tables associates with a regular tetrahedron The space of all independent 2×2 tables associates with a ruled surface In the regular tetrahedron
We know the ruled surface is exponential geodesic, but does anyone know the minimality of the ruled surface? 5 regular tetrahedron
P 01 00 0 p 0 1p
1/4 1/4 Q 1/4 1/4 00 S 01 00 10 1p 0 10 R p 0 00 6 regular tetrahedron
1 1/3 2/3 2 00 + P 3 00 3 1/3 2/3
01/3 1 1/3 0 2 01/3 1/3 2/3 = + 3 3 02/3 00 2/3 0 02/3
1 1 1 2 = 3 3 3 3 Q 2 1 2 2 00 S 3 3 3 3 1/3 2/3 1/3 0 R 2/3 0
7 Ruled surface
P P
Q S S R
R Q 8 AB e x y, f z w, C xye g x z, h y w 01 D zwf 00 n x y z w ghn
AB C pq p(1q) P D (1p)q (1p)(1q) 1p 00 q 1q 1 01
00 . 10 2 2 (xw yz) p q p (1-q) n (1-p)q (1-p) (1-q) e f g h 10 00 {pq(1 p)(1 q) p(1 q)(1 p)q}2 n p(1 p)q(1 q) 0 e-geodesical 11 12 21 ( 11, 12, 21) (log ,log ,log ) where 22 1 11 12 21 22 22 22
3 S3 IR
5 5
0 0
-5 -5 10 -10 5 -10 5 0 0 0 0 10 -5 -5 10 -10 p q p q {(log , log , log ), 0 p1, 0q1} {(x y, x, y ), x IR, yIR} (1 p)(1q) 1 p 1q
10 Two parametrization
(p,q)
p q p q (p q, p(1q),(1 p)q) (log ,log ,log ) (1 p)(1q) 1 p 1q
5
0
-5
-10 5
0 0
10 -5
11 Kullback-Leibler Divergence
Let p(x) and q(x) be probability density functions. Then Kullback-Leibler Divergence is defined by
r p
q
12 Two one-parameter families
Let be the space of all pdfs on a data space. m-geodesic
e-geodesic
ここで
r
p
q 13 Pythagoras theorem Amari-Nagaoka (1982)
r
p
q 14 Proof
15 ABC in differential geometry
Riemannian metric defines an inner product on any tangent space of a manifold.
(M , g ) g : X (M ) X (M ) IR
Geodesic = a minimal arc between x and y
1 argmin g (x(t), x(t))dt 0 c{x(t):x(0) x,x(1) y}1 t 0
Linear connection defines parallelism along a vector field.
: X (M ) X (M ) X (M )
(1) f X Y f X Y ,
(2) X ( f Y ) f X Y ( X f )Y (f F (M ), X ,Y X (M )) d Cf. slide 41 Componetwise X k X X i j i j k 16 k 1 Geodesic
A one-parameter family (curve) C { θ ( t ) : t } is called geodesic i with respect to a linear connection {j k (θ) :1 i, j,k p}
p p i i k k (t) j k (θ(t)) (t) (t) 0 (i 1,..., p) speed of acceleration k11j Remark 2: i If j k ( θ ) 0 ( 1 i , j , k p ) , any geodesic is a line. Newton's law of inertia The property is not invariant with parametrization. The gedesic C is expressed by a transform rule from parameter to p p ~ a (t) a (ω(t)) b (t) c (t) 0 ( a 1,..., p) c b c11b ~ p p p p Bi a i a (ω) B k B j B a i (θ) B a b , a i a a c b c b i j k i c Bi i , Ba a for t (θ) k 111j i a1
17 Change rule on connections
Let us consider a transform from parameter to . Then a geodesic C is written by { ω ( t ) ( θ ( t )) : t } . Hence we have the following:
a p a (t) B a (θ(t))i (t) (a 1,..., p) Ba i i i i1 p p p a a a i Bi (θ(t)) i j (t) Bi (θ(t)) (t) j (t) (t) (a 1,..., p) i1 j11i
p p a ~ a b c (t) c b (ω(t)) (t) (t) 0 (a 1,..., p) c11b
p p p p i ( 1)i ~ a k j a i a Bb i a a Ba for (θ) c b (ω) Bc Bb Bi j k (θ) Bi c , a k 111j i a1
18 What are reference books?
Foundations of Differential Geometry (Wiley Classics Library) Shoshichi Kobayashi, Katsumi Nomizu
Differential Geometrical Methods in Statistics Shun-Ichi Amari 1985 年 Springer
Methods of Information Geometry Shun-Ichi Amari , Hiroshi Nagaoka Amer Mathematical Society (2001) 19 Statistical model and information
p Statistical model M { pθ ( x) p( x, θ) : θ } ( IR )
Score vector s( x, θ) log p( x, θ) θ
T p Space of score vector Tθ { α s(, θ): α IR }
Fisher information metric g θ (u, v) Eθ {uv} (u, v Tθ )
T Fisher information matrix Iθ Eθ {s(x,θ)s(x,θ) }
20 Score vector space
T Note 1: u Tθ Eθ (u) u( x, θ) p( x, θ)dx α s( x, θ) p( x, θ)dx α T s( x, θ) p( x, θ)dx 0
u, v Tθ g θ (u, v) u( x, θ)v( x, θ) p( x, θ)dx T T T α { s( x, θ)s( x, θ ) p( x, θ)dx}β α I θ β The space of all random variables with mean 0 and finite variance
Sθ {t( x, θ) : Eθ (t( x, θ)) 0, Vθ (t( x, θ)) }
Note 2: S θ is an infinite dimensional vector space that includes Tθ
T For u( x, θ) s( x, θ) k k u(x,θ)k E {u(x,θ)k }, u(x,θ) E { u(x,θ)}, k 1,2,.... θ ... θ ... i1 ik i1 ik
belong to Sθ Cf. Tiatis (2006) 21 Linear connections
p Score vector s( x, θ) (si ( x, θ)) i1
p m i m-connection ii' sk j k (θ) g [Eθ{ j si'} Eθ{sk s jsi'}] (1 i, j,k p) i'1 p e i s e-connection ii' k j k (θ) g Eθ{ j si'} (1 i, j,k p) i'1
m-geodesic
e-geodesic
22 Semiparamtric thoery
1 Introduction to Semiparametric Models 2 Hilbert Space for Random Vectors 3 The Geometry of Influence Functions 4 Semiparametric Models 5 Other Examples of Semiparametric Models 6 Models and Methods for Missing Data 7 Missing and Coarsening at Random for Semiparametric Models 8 The Nuisance Tangent Space and Its Orthogonal Complement 9 …14.
Semiparametric Theory and Missing Data (Springer Series in Statistics) Anastasios Tsiatis 23 Geodesical models
Definition (i) Statistical model M is said to be e-geodesical if
1t t p(x), q(x) M ct p(x) q(x) M (t (0,1)) 1t t where ct 1/ p(x) q(x) dx
(ii) Statistical model M is said to be m-geodesic if
p(x), q(x) M (1t) p(x) tq(x)M (t (0,1))
Note : Let P be a space of all probability density functions. By definition P is e-geodesical and m-geodesical. However the theoretical framework for P is not perfectly complete.
Cf. Pistone and Sempi (1995, AS) 24 Two type of modeling
Let p0(x) be a pdf and t (x) a p-dimensional statistic.
(e) T Exponential model M { p(x,θ) p0 (x)exp{θ t(x) (θ)}; θ } {θ IR p : (θ) }
(m) Matched mean model M { p(x) : E {t(x)} E {t(x)} } p p0
Let { p i ( x ) : i 0 , 1 ,..., I } be a set of pdfs. I p (x) Exponential model (e) i M { p(x,θ) p0 (x)exp{ i log (θ)}; θ } i1 p0 (x) p {θ IR : (θ) }
I I Mixture model M (m) { p(x,θ) (1 ) p (x) p (x) : } i 0 i i i1 i1 I 25 { θ ( 1,..., I ) :0 i 1, i 0 (i 1,..., I )} i1 Statistical functional
A functional f (p) is called a statistical functional if p(x) is a pdf. (Hampel, 1974)
A statistical functional f (p) is said to be Fisher-consistent for a model
M { pθ ( x) : θ } if f (p) satisfies that (1) f ( p) ,
(2) f ( pθ )θ (θ)
2 1 (x ) 2 Example. For a normal model M {pθ (x) exp{ 2 }, θ ( , ) IR IR } 2 2 2 f ( p) ( xp(x) dx, x2dx ( xp(x) dx)2 )T is Fisher-consistent for = (, 2 ).
26 Transversality
Let f (p) be Fisher consistent functional.
Lθ ( f ) { p : f ( p)θ} , M
which is called a leaf transverse to M. Lθ ( f )
pθ (1) Lθ ( f ) M {pθ}
(2) Lθ ( f ) is a local neighborhood. pθ*
Lθ*( f )
27 Foliation structure
p Statistical model M { pθ ( x) p( x, θ) : θ } (IR )
Foliation P Lθ ( f ) M u(x,θ) t(x,θ) Decomposition of tanget spaces
pθ Tp (P) Tp (M)Tp (Lθ ( f )) θ θ θ v(x,θ)
Lθ ( f )
t T (P), u T (M), v T (L ( f )) pθ pθ pθ θ
such that t(x,θ) u(x,θ) v(x,θ)
28 Transversality for MLE
Let p0(x) be a pdf and t (x) a p-dimensional statistic. (e) T Exponential model M { p(x,θ) p0 (x)exp{θ t(x) (θ)}; θ } {θ IR p : (θ) }
For the exponential model M ( e ) the MLE functional
fML( p) :argmaxEp{log p(x,θ)} θ is written by
fML( p) arg{Ep{t(x)} Ep(,θ){t(x)}} θ
Hence the foliation associated with fML is a matched mean model.
Lθ ( fML) { p : Ep{t(x)} Ep(,θ){t(x)}} 29 Maximum likelihood foliation
M (e) r2
Lθ ( f ML ) p3
pθ p1 p2
r1
30 Estimating function
(IRp) Statistical model M { pθ ( x) p( x, θ) : θ }
p-variable function u ( x , θ ) is unbiased def u(x,θ) E {u(x,θ)} 0, det( E { }) 0 (θ ) p(,θ) p(,θ) θ
The statistical functional f ( p) argsolve { E pu(x,θ) 0 } θ is Fisher-consitent, and the leaf transverse to M
Lθ ( f ) { p : E p u( x, θ) 0 }
is m-geodesical. Lθ ( f ) 31 Minimum divergence geometry
D : M×M → R is an information divergence on a statistical model M
(i) D(p,q) ≧ 0 with equality if and only if p = q (ii) D is a differentialble on M×M
Let
Then we get a Riemannian metric and dual connections on M (Eguchi, 1983,1992)
32 Remarks is a Riemann metric
Cf. Slide 21 33 U divergence
U is a convex function,
U cross-entropy
U entropy
U divergence
34 Example of U divergence
KL divergence
Beta (power) divergence
Note
35 Geometric formula with DU
36 Triangle with DU
mixture geodesic
U geodesic
r p
q 37 Information divergence class and robust statistical methods
38 Light and shadow of MLE
1.Invariance under 1. Non-robust data-transformations
2.Asymptotic efficiency 2. Over-fitting
Log-likelihood on exponential family Sufficiency and efficiency
Likelihood method log exp
U-method u
39 Max U-entropy distribution
Let us fix a statistics t (x).
Equal mean space
U-model
Euler-Lagrange’s 1st variation
40 U-estimate
Let p(x) be a data density function with statistical model
U-loss function
U-empirical loss function
U-estimate
U-estimating function
41 42 ). x ( t -minimax
Let us fix a statistics Equal mean space U-estimator for U-model
U-model
U-estimator of mean parameter
We observe that
which implies
Furthermore,
43 Influence function
Statistical functional TU (G) arg min{ ( p(x, ))dG(x) U( ( p(x, ))dx} Influence function IF(T , x) T (G ) G (1)F x 0 1 IF(TU , x) JU ( )[w(x, )S(x, ) E {w(x, )S(x,)}]
GES(TU ) sup ||IF(TU , x) || x 44 Efficiency Asymptotic variance ˆ 1 1 n( U ) N(0, JU ( ) HU ( )JU () ) D where J ( ) E(w(X , )S(X , )S(X , )T ), U H U ( ) Var(w(X , )S(X ,)) Information inequality 1 1 1 I( ) JU ( ) HU ( )JU ()
( equality holds iff U(x) = exp(x) )
45 Normal mean
Influence function η = 0 β= 0 = 0.01 0 = 0.0150 = 0.050.015 = 0.150.01 6 6 = 0.0750.15 = 0.40.05 0.4 4 = 0.80.075 4 = 0.1 0.1 0.8 2 2 = 2.50.125 = 0.1252.5
-6 -4 -2 2 4 6 -6 -4 -2 2 4 6 -2 -2
-4 -4 -6 -6 β-power estimates η-sigmoid estimates
U (t) exp(t) t 46 Gross Error Sensitivity
β-power estimates η-sigmoid estimates
β 効率 GES η 効率 GES 01∞ 01∞ 0.01 0.97 6.16 0.015 0.972 1.9 0.05 0.861 2.92 0.15 0.873 1.04 0.075 0.799 2.47 0.4 0.802 0.678 0.1 0.742 2.21 0.8 0.753 0.455 0.125 0.689 2.04 2.5 0.694 0.197
47 Multivariate normal pdf is 1 T 1 p (y ) (y ) f (y, ,) ((2 ) det) 2 exp 2 Estimating equation
1 w(xi ,θ)(xi μ) 0 n 1 w(x,θ){(x μ)(x μ)T } c () n i i U θ (μ, ) 48 U-algorithm θk μk ,k θk1 μk1,k1
繰り返し重み付け平均と分散 T w(xi , k )(xi k )(xi k ) w(xi , k ) x j k 1 k1 , w(xi , k ) w(xi ,k ) cU det k Under a mild condition
LU (k1) LU (k ) ( k 1,...)
49 Simulation
0 1 0.5 5 4 0 ε-conmination (1) Gε (1-ε)N , N , 0 0.5 2 -5 0 1
0 1 0 0 9 9 (2) Gε (1-ε)N , N , 0 0 1 0 9 9
ε 0 ε 0.05 ˆ (1) KL error DKL (θ,θ0 ) 3.03 39.24 under G (2) with MLE θˆ 2.70 16.65 under G
50 β-power estimates v.s. η-sigmoid estimates
η β KL error KL error 0 39.24 0 39.24 0.01 35.30 0.0001 23.04 (1) 0.05 20.93 0.00025 6.70 G 0.10 8.91 0.0005 4.46 0.20 12.40 0.00075 4.64 0.30 31.64 0.001 6.04
0 16.5 0 16.5
(2) 0.01 13.91 0.0005 3.36 G 0.05 6.65 0.00075 3.19 0.10 5.14 0.001 3.9 0.20 12.20 0.002 3.04 51 0.30 29.68 0.003 3.07 Plot of MLEs (no outliers)
100 replications with 100 size of sample under Normal dis. 11 12 12 22
22 η-Est (η=0.0025) β- Est (β=0.1) MLE 12 11 true (1,0,1)
52 Plot of MLEs with ouliers
100 replications with 1.5 1 100 size of sample 0.5 under G(2) 0
2
1.5 η-Est (η=0.0025) 1 β- Est (β=0.1) 0.5 MLE 1 1.5 true (1,0,1) 2 2.5
53 Selection for tuning parameter
Squared loss function Loss ( ˆ) 1 { f ( y,ˆ) g ( y)}2 dy 2 1 n 1 CV( ) f (x , ˆ (i) ) f (y,ˆ ) 2 dy i n i1 2 ˆ arg min CV ( ) 1 Approximate ˆ(i ) ˆ IF(x ,ˆ ) n 1 i
54 0 .26 -.1 Normal with mean and variance 0 -.1 .26
MLE 0.054 .228 - .126 CV ( ) signal - 0.081 - .126 .261 2.94 2.92 2.9 2.88 MLE 0.204 1.059 - .263 2.86 ˆ Contami - 0.184 - .263 .383 2.84 0.07 2.82 - Est 0.086 .293 - .134 0.0250.050.0750.10.1250.150.175 contami - 0.132 - .134 .286 β
55 What is ICA?
Cocktail party effect
Blind source separation s s ……… 1 2 sm
x x ……… 1 2 xm
56 ICA model
(Independent signals) s (s1,..., sm ) ~ p(s) p1 (s1 ) pm (sm ) E(S1 ) 0,, E(Sm ) 0
(Linear mixture of signals) W Rmm , Rm s.t. x W 1s
f (x,W , p) | det(W ) | p1(w1(x μ1)) pm (wm (x μm ))
Aim is to learn W from a dataset (x1, , xn )
in which p(s) p1(s1) pm (sm ) is unknown.
57 ICA likelihood
Log-likelihood function
n m (W ) log p j (w j (xi μ j )) log | det(W ) | i1 j1 Estimating equation (x,W , p) F (x ,W , p) ( I h(Wx) (Wx)T )W T W m
log p1 (s1 ) log pm (sm ) h(s) ( ,, ) s1 sm Natural gradient algorithm, Amari et al. (1996) 58 Beta-ICA n power equation 1 f (xi ,W, ) F(xi ,W, ) B (W,) n i1 decomposability: s t
E[{p(w X )} ] q q qs,qt
E[{ps (ws X s )} hs (ws X s )] E[{pt (wt X t ))} wt X ] 0
59 Likelihood ICA
1.4 150 signals U(0,1)× U(0,1) 1.2
1
0.8 0.6 1 1 2 0.4 Mixture matrix W 0.2 1 0.5 0.5 1 1.5 2 2.5
1.5
1
0.5
-1.5 -1 -0.5 0.5 1 1.5 -0.5
-1
-1.5
60 Non-robustness
2
1 50 Gaussian noise N(0, 1)× N(0, 1)
-2 -1 1 2 3
-1
-2 2
1
-2 -1 1
-1
-2
61 62 7.5 5 2.5 4 2 -2 -4 -6 -2.5 -5 =0.2) β -7.5 ( -power
-10 ICA 3 Minimum 2 power-
1 2 1 1 2 - - 1 - -2 Tsallisべきエントロピーに射影不変性を課したエントロピ-、ダイバージェンスを考察した。 平均と分散を一定にする分布族で射影べきエントロピー最大分布モデルが正規分布, t-分布,ウエグナー分布を含む形で導出された. そのモデル上での平均と分散の素直な推定法を提案し,推定量が標本平均と標本分散 になることが示された.
63 Projective -power divergence
-cross entropy
-diagonal entropy
-divergence
Cf. Tsallis (1988), Basu et al (1998), Fujisawa-Eguchi (2008), Eguchi-Kato (2010)
Remark= 0 reduces to Boltzmann-Shannon entropy and Kullback-Leibler divergence
64 Max -entropy
Equal moment space
-entropy
Max -entropy
-model
65 -estimation
Parametric model
-Loss function
-estimator
Gaussian
-estimator
Remark Cf. Fujisawa-Eguchi (2008) 66 -estimation on -model
-loss function
Theorem.
67 (-estimator ’-model)
model normal-model -model loss function
0-loss Robust model ( log-likelihood) MLE MLE
Robust -estimator -loss Moment estimator
68 Gamma-ICA
Mixture of independent signals
Whitening
Preprocessed form
Statistical model
-ICA
69 70 71 Outline
Boosting leaning algorithms for supervised/unsupervized learning U-loss functions
72 Which chameleon wins?
73 Pattern recognition…
74 What is pattern recognition?
□ There are a lot of examples for pattern recognition.
□ In principle pattern recognition is a prediction problem for class label
which would classify a human interestingness and importance.
□ Originally human brain wants to label phenomena a few words , for
example (good, bad), (yes, no), (dead, alive), (success, failure), (effective, no effect)….
□ Brain intrinsically predicts the class label from empirical evidence.
75 Practice
● Character recognition ● face recognition
● voice recognition ● finger print recognition ● Image recognition ● speaker recognition
☆ Credit scoring ☆ Drug response ☆ Medical screening ☆ Treatment effect ☆ Default prediction ☆ Failure prediction ☆ Weather forcast ☆ Infectious desease
76 Binary classification Feature vector Class label
score
classifier
In a binary class y = 1, 1 (G=2)
Let p( x, y) be a pdf of random vector ( x, y).
p(B, C ) { p( x, y) dx } B yC
p(y) p(x, y)dx p(x) p(x, y) Marginal density X y{1,...,G} p(x, y) p(x, y) Conditional density p(y | x) p(x | y) p(x) p(y)
p(x, y) p(x | y) p(y) p(y | x) p(x)
p(x, y 1) p(y 1| x) p(x, y 1) p(y 1| x) 78 Error rate
Feature vector ,class label
Classifier has error rate
Err(hF ) Pr(hF (x) y)
G Err (hF ) Pr( hF ( x) i, y j) 1 Pr( hF ( x) i, y i) i j i1
Training error
Test error
79 False negative/positive
True Positive False Positive
False Negative True Negative
FPR
FNR
80 Bayes rule
Let p(y|x) be a conditional probability of y given x.
The classifier h Bayes ( x ) sgn( F 0 ( x )) leads to Bayes rule.
Theorem 1 For any classifier h
Err(h Bayes) Err(h)
Note: The optimal classifier is equivalent to the likelihood ratio. However, in practice p(y|x) is unknown, so we have to
learn hBayes(x) based on the training data set.
81 Error rate for Bayes rule
Discriminant space associated with Bayes classifier is given by
In general, when a classifier h associates with spaces
82 Err(h) Err(h Bayes) Cf. Mclachlan, 2002 Boost learning
Weak learners can be combined into a strong leanrner?
Weak learner (classifier) = error rate is slightly less than .5 Strong learner (classifier) = error rate is slightly less than that of Bayes classifier
Boost by filter (Schapire, 1990) Bagging, Arching (bootstrap) (Breiman, Friedman, Hasite) AdaBoost (Schapire, Freund, Batrlett, Lee) 83 Set of weak learners
Decision stumps
Linear classifiers
Neural net SVM k-nearest neighbur
Note: not strong but a variety of characters
84 Exponential loss function
Let be a training (example) set.
Empirical exponential loss function for a score function F(x)
n D 1 Lexp (F ) exp{ yi F ( x i )} n i1
Expected exponential loss function for a score function F(x)
L E (F ) { exp{ yF ( x)}q( y | x)}q( x)dx exp X y{1,1}
where q(y|x) is the conditional distribution given x, q(x) is the pdf of x.
85 Adaboost
1 1. Intial : w (i) (i 1 n), F (x) 0 1 n 0 w (i) 2. For t 1, ,T t t ( f ) I( yi f (xi )) , wt (i ')
(a) t ( f(t) ) mint ( f ) f F
1 1 t ( f(t) ) (b) t 2 log t ( f(t) )
(c) wt1(i) wt (i)exp( t f(t) (xi )yi )
T 3. sign (FT ( x)), where FT ( x) t f (t ) ( x) t1
86 Learning algorithm
f ( x ) w1 (1), , w1 (n) (1 ) 1 1 f ( x ) w2 (1),, w2 (n) ( 2 ) 2 T {( x1 , y1 ),..., ( x n , yn )} 2 t f(t ) ( x) t1
T f ( x ) wT (1),, wT (n) ( T )
T Final learner FT ( x) t f (t ) ( x) t 1 87 Update weight
update wt (i) wt1(i) t f(t) (xi ) yi Multiply e f (x ) y Multiply et (t) i i
Weighted error rate t ( f(t) ) t1( f(t) ) t1( f(t1) )
1 ( f ) The worst case t1 (t ) 2
88 1 (f ) t 1 (t ) 2
n wt1(i) t1( ft ) I( f(t ) (xi ) yi ) i1 wt1(i') n I ( f ( t ) ( xi ) yi ) exp{ t yi f ( t ) ( xi )}wt (i) i1 n exp{ y f ( x )}w (i) t i ( t ) i t i1
n exp{ t } I ( f ( t ) ( xi ) yi )wt (i) i1 n n exp{ t } I ( f ( t) ( xi ) yi )wt (i) exp{ t } I ( f ( t ) ( xi ) yi )wt (i) i1 i1 1 t ( f ( t ) ) t ( f ( t ) ) t ( f ( t ) ) 1 1 t ( f ( t ) ) t ( f ( t ) ) 2 t ( f ( t ) ) {1 t ( f ( t ) )} t ( f ( t ) ) 1 t ( f ( t ) ) 89 Update in exponential loss
1 n Lexp (F ) exp{ yi F(xi )} n i1 Consider F(x) F(x) f (x)
1 n L (F f ) exp{y F(x )}exp{ y f (x )} exp i i i i n i1
n 1 exp{yiF(xi )}e I( f (xi ) yi ) e I( f (xi ) yi ) n i1
Lexp (F){ e ( f ) e (1 ( f ))}
n I( f (x j ) yi )exp{yi F(xi )} where ( f ) i1 90 Lexp(F) Sequential optimization
Lexp (F f ) Lexp (F){ ( f )e (1 ( f ))e } ( f ) e (1 ( f )) e 2 1 ( f ) ( f )e 2 ( f ){1( f )} e 2 ( f ){1 ( f )}
1 1 ( f ) Equality holds if and only if log opt 2 ( f )
91 AdaBooost = Seq-min exponential loss
min Lexp (Ft1 f(t) ) Lexp (Ft1) ( f(t) ){ 1 ( f(t) )} R
1 1 ( f(t) ) opt log 2 ( f(t) )
(a) f(t) argmin t ( f ) fF
(b) t arg min Lexp (Ft1 f(t) ) R
(c) wt 1 (i) wt (i) exp{ t yi ft (xi )} 92 Simulation (complete separable)
1 Feature space [-1,1]×[-1,1] 0.5 Decision boundary
-1 -0.5 0.5 1
-0.5
-1
93 Set of linear classifiers
Linear classification machines
1 Random generation 0.5
-1 -0.5 0.5 1
-0.5
-1 94 Learning process
1 1 1
0.5• 0.5 0.5
0 0 0
-0.5 -0.5 -0.5
-1 -1 -1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Iter = 1, train err = 0.21 Iter = 13, train err = 0.18 Iter = 17, train err = 0.10
1 1 1
0.5 0.5 0.5
0 0 0
-0.5 -0.5 -0.5
-1 -1 -1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 95 Iter = 23, train err = 0.10 Iter = 31, train err = 0.095 Iter = 47, train err = 0.08 Final decision boundary
1
1
0.5
0.5
0 0
-0.5 -0.5
-1 -1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
Contour of F(x) Sign(F(x)) 96 Learning curve
0.2 Training curve
0.15
0.1
0.05
50 100 150 200 250
Iteration number
97 KL divergence
Label set Feature space
Nonnegative function
Note For conditinal distribution p(y|x), q(y|x) given x with common marginal density p(x) we can write
Then,
98 Twin of KL loss function
For data distribution q(x, y) = q(x) q(y|x) we model as
G m1 ( y | x) exp{ F ( x, g ) F ( x, y)} g 1 exp{ F ( x, y)} m2 ( y | x) G exp{ F ( x, g )} g 1 Then
q(y | x) D (q,m) { q(y | x)log q(y | x) m(y | x)}q(x)dx KL X y{1,1} m(y | x)
Log loss Exp loss 99 Bound for exponential loss
Empirical exp loss
Expected exp loss
Theorem.
100 Variational calculas
101 On AdaBoost FDA, or Logistic regression
Parametric approach to Bayes classifier
AdaBoost
Parametric approach to Bayes classifier, but dimension and basis function are flexible
The stopping time T can be selected according to the state of learning.
102 U-Boost
U-empirical loss function
In a context of classification
Unnormalized U-loss
Normalized U-loss
103 U-Boost (binary)
Unnormalized U-loss
Note
Bayes risk consistency
F*(x) is Bayes risk consistent because 104 AdaBoost with margin
Eta-loss function
regularized
105 EtaBoost for Mislabels Expected Eta-loss function
Optimal score
The variational argument leads to
Mislabel modeling
106 EtaBoost
1 1. Initial settings : w (i) (i 1 n), F (x) 0 1 n 0 n 2. For m 1, ,T m ( f ) I( yi f (xi ))wm (i), i1
(a) m( f(m) ) minm( f ) f (b)
* (c) wm1(i) wm (i)exp( m f(m) (xi )yi )
T
3. sign (FT ( x)), where FT ( x) t f(t ) ( x) 107 t1 A toy example
108 Examples partly mislabeled
109 AdaBoost vs. EtaBoost
AdaBoost EtaBoost 110 ROC curve, AUC
(X , Y ) , X IR p , Y {0, 1}
判別 関数: F(X ), 閾値:c F ( X ) c 陽性 (Y 1) F ( X ) c 陰性 (Y 0)
偽陽性確率 FPR(c) P(F(X ) c | Y 0) TPR 真の陽性確率 TPR(c) P(F(X ) c |Y 1) 1 c
受信者動作特性 ROC( F ) {( FPR (c), TPR (c)) | c IR } AUC (ROC) 曲線 Cf. Bamber (1975), Pepe et al. (2003) 0 1 下側面積 AUC(F,) TPR(c')d FPR(c') FPR (AUC )
111 112 113 114 115 116 117 5-fold CVの結果(AUCBoost) λ
118 118 119 120 121 2標本問題とAUCブーストの関係
AUCブーストの学習アルゴリズムは逐次的に各 t-ステップで
得られた判別関数 Ft(x) のウィルコクソン検定統計量
n1 n0 ˆ 1 C(Ft ) I (Ft (x1k ) Ft (x0i ) ) n0 n1 k11i
を(t+1)-ステップ Ft1(x) Ft (x) t1 ft1(x)
で最大方向に増大するよう設計されている. このように ˆ ˆ C(Ft ) C(Ft1) となる中で全ての単点解析が統合されている. この意味で全ての単点解析はブースティングに
組むことは可能である. Eguchi-Komori (2009) 122 123 124 Partial AUC
pAUC TPR(c)d FPR(c) c
P(F(X0) F(X1), c F(X 0) )
健常者 病人 健常者 病人
TPR (c) TPR (c)
c FPR (c) F (x) c FPR (c) F (x) TPR TPR 1 1 c c
pAUC pAUC 125 FPR FPR 0 1 0 1 126 Robust for noninformative genes
127 Robustness for noninformative genes
128 生成関数
真のU 陽性確率
129 Boost learning algorithm
130 Bayes risk consistency
131 ベンチマークデータ
132 True density
0.02
0.015 4 0.01
0.005 2
0
0 -4
-2 -2 0 dictionary 2 W -4 4 133 U-Boost learning 1 n L ( f ) ( f (x )) U ( ( f (x)))dx U-loss function U i p n i1 IR Dictionary of density functions
W { g (x) : g (x) 0, g (x)dx 1, }
Learning space = U-model * 1 1 WU (co( (W ))) { ( (g (x)))} 1 Let f (x,π) ( (g (x))). Then f (x,(0,,1,0)) g (x) ( ) * WU W
* Goal : find f argmin LU ( f ) * f WU 134 Inner step in the convex hull * 1 W {(x,) : } WU (co (W ))
* * 1 ˆ ˆ ˆ ˆ Goal : f argmin LU ( f ) f (x) ( 1 ( f1(x)) kˆ ( fkˆ (x))) * f WU
W g 7 g5 * WU f (x) g 4 g 6
f (x) g2 g 3
g1 135 Convergence of Algorithm
* WU L ( f ) L ( f ) D ( f , f ) U k U k1 U k k1
* ( f1) ( fk1) ( fk ) ( fk1) ( f )
k LU ( f1) LU ( fk1) DU ( f j , f j1) j1 ( f ) (1 )( f ) (g ) -mixture K K1 K1 K1 K * 1 (g1) K (gK ) ( f ) 136 Simplified boosting
K ˆ 1 fK ( k (gk )) k1
For k 0,...., K 1, 1 fk fk1 ((1 k1) ( fk ) k1 (gk1)) such that g arg min L ( 1((1 ) ( f ) (g))), k1 U k gW c k1 k c with the initial f1 arg min LU (g) gW
137 Non-asymptotic bound
Theorem. Assume that a data distribution has a density g(x) and that 2 (A) supW U( ){ ( ) ()} b W U ( , , )co(( )) W Then we have ˆ * IE g DU (g, fK ) FA(g,WU ) EE(g,W ) IE(K,W ), where
FA(g,WU= ) inf DU (g, f ) (Functional approximation) f WU= n EE(g, ) 2IE sup | 1 ( f (x )) IE ( f )|} W g{ n i p (Estimation error) fW i1 c2b 2 IE(K,W ) U (c:step-lengthconstant) (Iteration effect) K c 1
* Remark. Trade between FA( p,W U ) and EE( p,W ) 138 Lemmas ~ N Let f be a density estimator and f * 1( p ( (x))). Then Lemma 1 j1 j j
N ~ ~ ~ p L ( 1((1 ) ( f (x)) ( (x)))) L ( f *) (1 ){L ( f ) L ( f *)} ( f , f *, ). j1 j U j U U U U
~ Lemma 2 Let f be in co((W )). Then * 2 2 U ( f , f , ) bU ( [0,1]) Lemma 3 Under assumption (A) c2b 2 L ( fˆ ) inf L ( f ) U where f 1( ( )) U K U K c 1 j j
ˆ Lemma 4 If LU ( fK ) inf LU ( f ) , then ˆ 1 IEg DU (g, fK ) inf DU (g, f ) 2 IEg sup || ( f (xi )) IEg( f ) || f W n
139 -Boost -Boost KDE RSDE
C (skewed-unimodal) H (bimodal) L (quadrimodal)
140 Future of information geometry
Deepening geometry Poincare conjecture Optimal transform statistics Free probability
learning Compressed sensing Semiparametrics
quantum Dynamical systems
Mathematical evolutionary theory Dempster-Shafer theory
141 Poincaré conjecture 1904
Every simply connected, closed 3-manifold is homeomorphic to the 3-sphere
Ricci flow by Richard Hamilton in 1981 Perelmann 2002, 2003 Optimal control theory due to Pontryagin and Bellman
2 Wasserstein space dW( f ,g) min{ E||X Y || : X ~ f ,Y ~ g}
Optimal transport
142 Optimal transport
Theorem (Brenier, 1991)
Talagrand inequality
Log Sovolev inequality
Optimal transport theory is extending on a general manifold
C. Villani (2010) Optimal transport theory as Fields medal
Geometers consider not a space, but a distribution family on the space.
143 Thank you
144