Supervised Learning: Setup

Laplace approximation

Fall 2019

Instructor: Shandian Machine [email protected] SchoolFall 2017 of Computing

1

1 Outline

• Laplace approximation • Bayesian logistic regression

2 Outline

• Laplace approximation • Bayesian logistic regression

3 Laplace approximation

• Objective: construct Gaussian distribution to approximate the target distribution • Method: second order Taylor expansion at the posterior mode (.., MAP estimation)

4 648 p(x)= p xi pa(xi) 649 | i 648 650 pY(x)= p xi pa( xi) 649 q(x ) exp [log|p(x)] 651 j Eq(x ij ) 650 / { Y¬ } x 652 q(xj) expj Eq(x j )[log p(x)] 651 / { ¬ } 653648 652 p(x)= p xi pa(xi) xj 654649 q(xj) exp E log p(x|j pa(xj)) 653 / { i | 650 q(x ) Yexp log p(x pa(x )) 655 654 q(x ) j exp E [log p(jx)] j 651 Laplace approximation+j /⇥ Eq{E(x jlog) p(xt| x⇤j, pa(xt) xj ) (5) 656 655 / +{ ¬ log| p}(x x , pa(\{x ) }x }) (5) xj pa(xt)x ⇥ E t ⇤j t j 652 656 2X j | \{ } } 657 xj ⇥pa(xt) ⇤ 653 657 2X ⇥ ⇤ 658 • Given a joint probability p(✓, ) 654 658 q(xj) exp E log p(xj pa(xj))- Dp(✓, ) 659 / { | D 655 659 p(✓ ) • How to+ compute⇥ (approximate)E log p(xt x|D⇤j , pa( p ( ✓ x t ) ) ?x j ) (5) 660656 660 | |D \{ } } xj pa(xt) 661657 661 2X ⇥ ⇤ Let us do MAP estimation first 662658 p(✓, ) 662 D 663659 663 p(✓ ) 660 664 |D 664 ✓0 = argmax log p(✓, ) = argmax log p( ✓) 665661 665 ✓ D ✓ D| 662 666 666 f(✓) , log p(✓, ) 663 667 D 667 664 668 668 665 669 5 669666 670 670667 671 671668 672 672669 673 674 673670 675 674671 672 676 675 673 677 676 674 678 677 675 679 680 678676 681 679677 682 680678 683 681679 684 682680 685 681 683 686 682 684 687 683 685 688 684 686 689 685 690 687686 691 688687 692 689688 693 690689 694 691690 695 692691 696 693692 697 693 694 698 694 695 699 695 700 696 696 701 697697 698698 699699 13 700700 701701

13 13 648 648 p(x)= p xi pa(xi) 649 648648 p(x)= p xi| pa(xi) 649 i | pp((xx)=)= pp xxiipa(pa(xxii)) 650 649649 || 650 Y ii 650 YY 648 650 q(xqj()x ) expexpEq(x j )[log[log pp((xx)])] 651 651 648 p(x)=j p xi Epa(q(x¬xij)) / / { { ¬ qqp(((xxxjj)=))}} expexpp EExqqi((xpa(x jj))x[log[logi) pp((xx)])] 649 651651 | // {{ ¬¬ }} 652 649 i xxj j | 652 652652 i xxjj 650 650 Y Y 653 653 653653 q(xj) exp q(x )[log p(x)]q(x ) exp [log p(x)] 651 651 E j j Eq(x j ) q(x ) exp/ log{ p(¬x pa(x )) } / { ¬ } 654 q(x654654j) jexp Elog p(xj jpa(qq((xxxjjjj)))) expexp EE loglogpp((xxjj pa(pa(xxjj)))) 654 652 652 / E{ xj | // {{ xj || 655 655655 / { | 655 653 653 + ⇥ E log p(xt x⇤j+,+pa(x⇥t⇥) xEEj loglog) pp((xxttxx⇤j⇤j,,pa(pa(xxtt)) xxjj )(5)) (5)(5) 656 + log p(xt xj, pa(xt) xj ) | \{ } } (5) 656 656 ⇥ E q(xj) |exp⇤ x logpa(px\{(x) j pa(} x}j)) | \{ } } 654 q654(xj) exp E logxj ppa((xxjt)pa(xj)) Exjj pa(xtt) 656 657 2X /| { 22XX \{|⇥ } } ⇤ 657 655657/ {xj pa(xt) | ⇥ ⇥ ⇤ ⇤ 657 655 658 2X ⇥ + ⇥ E logp(p✓(,x⇤t )x⇤j, pa(xt) xj ) (5) 658 Laplace656658 + approximation⇥ E log p(px(✓t ,x⇤j,)pa(xt) xj ) p(✓,D|) \{(5) } } 656 659 | D xj \{pa(xt}) } D 658 659 657659 xj pa(xt) p(✓, ) 2X ⇥ pp((✓✓ )) ⇤ 657 2X ⇥ p(✓ ) ⇤ |D 660660 D|D |D 659 660 658 ✓✓0 == argmax argmax loglogppp((✓(✓✓,,, ))) = = argmax argmax loglogpp(( ✓✓)) 658 661✓0 = argmax plog(✓,pp((✓)✓, )) =0 argmax✓ log p( ✓DD) ✓ D| 661 •659We661 then expand theDlog|Djoint probability✓ p(at✓ theD) ✓ D| 660 662 ✓ D ✓ D| 659 662 p(✓ ) |D 662 660✓0 = argmax log p(✓, ) = argmax logff((✓✓p))(, loglog✓)pp((✓✓,, )) 661 660 posterior663 mode f(✓|D) Dlog✓0 p=(✓ argmax, ) log p(✓, D|,) = argmaxD log p( ✓) 661663 ✓ , ✓✓ D ✓ D D| 663 ✓0664= argmax log p(✓, ) = argmax Dlog p( ✓) 662 661 662664 ✓ D ✓ f(✓)D|f(✓0)+ f(✓0)>(✓ ✓0) 664 665 f(✓) log p(✓, f)(✓)f⇡(✓f)(,✓0log)+pr(✓f,(✓0))>(✓ ✓0) 662 663665 f(✓) f(✓,0)+ f(✓0)>(✓1 ⇡✓0) r D 663 f(✓) log p(✓, ) - D 665 666 ⇡, r + 1(✓ ✓0)> f(✓0)(✓ ✓0) 663 664666 1 D + 2 (✓ ✓0)> f(✓0)(✓ ✓0) 664 667 f(✓) f(✓0)+ rrf(✓0)>(✓ ✓0) 666 + (✓ ✓0)> f(✓0)(✓ ⇡2✓0) rrr 664 665667 f(✓) 2 f(✓0)+ f(✓0)>(✓ ✓0) 665 667 668 f(✓) f⇡(✓0)+ f(✓rrr0)>(✓ ✓10) f(✓0)=0 665 666668 ⇡ r + (✓ ✓0)>r f(✓f0(✓)=0)(0✓ ✓0) 669 1 1 2 rrrf(✓ ) 0 666 668 667669 f(✓0)=0 0 666 670 ++(✓(✓✓0)>✓0)r>f(✓0)(f✓(✓0)(✓0✓) ✓0) rr f(✓0) 0 Why? 669 668670 2 2 rr rr frr(✓0)=- 0 667 667 671 f(✓0) 0 r 670 669671 rr f(✓ ) 0 668 668 672 1 rr 0 670 = f(✓ ) (✓ ✓ )>A(✓ ✓ ) 669 671 672673 0 0 0 1 669 2 = f(✓) (✓ ✓ )>A(✓ ✓ ) 671673 0 0 0 670 672 674 2 670 672 A = f(✓0) 673 674675 rr A = f(✓0) 0 671 673 rr 671 674 675676 6 672 674 676677 672 673 675 675 677678 673 674 676 676 678679 675 677 677 674 679680 676 678 678 675 680681 679 679 677 681682 676 680 680 678 682683 677 681 679 681 683684 682 678 680 682 685 683684 683 686 679 681 684685 684 687 680 682 685686 688 683 685 686687 681 689 684 686 687688 682 690 685 687 688689 691 683 686 688 689690 692 684 689 690691 687 693 691692 685 688 690 694 692693 686 689 691 695 693694 690 692 696 687 694695 691 693 697 688 695696 692 694 698 696697 689 695 699 693 697698 690 700 694 696 698699 701 691 695 697 699700 692 696 698 700701 697 699 701 693 13 698 700 694 699 701 13 695 700 13 696 701 697 13 698 699 13 700 701

13 648 p(x)= p x pa(x ) p(x)= p xi pa(i xi) i 649 | | i i 650 YY q(xj) exp q(x )[log p(x)] 651 q(xj) exp EqE(x j )[logj p(x)] 651 / / { { ¬ ¬ } } 652 xj xj 648 653 648 p(x)= p xi pa(xi) 649 654 q(qx(jx)jp)(xexp)=expE Elogplogpx(xipj(pa(xpa(j| xpa(xij)))xj)) 649 // { { i | | | 650 655 i Y 650 q(+x +) Y⇥exp⇥ E Eloglogp[log(xpt(xpx⇤(tjx,xpa()]⇤j, pa(xt)xt)xj x) j ) (5) (5) 651 656 j Eq(x j ) | | \{ \{} }} } q(x ) exp/x pa({x ) [log¬ p(x)] } 651 648j xj jpa(Extq)(xt j ) 652 657 / 2X2{X x¬⇥j ⇥ } p(x)= ⇤ p x⇤i pa(xi) 657 649 | 652 xj i 653 658 650 p(✓p,(✓,) ) Y D D q(x ) exp [log p(x)] 653 654 659 q(xj) 651exp E log p(xj pa(xj)) j Eq(x j ) / { |p(✓p(✓) ) / { ¬ } 654 655 660 q(xj) exp652E log p(xj pa(xj))|D|D xj + log p(xt xj, pa(xt) xj ) (5) / ✓0✓653{0== argmax argmax⇥ logE|logp(✓p,(✓,) =)⇤ argmax = argmaxloglogp( p(✓) ✓) 655 656 661 ✓ D D| ✓ \{ }D|}D| xj pa(✓xt) q(x ) ✓exp log p(x pa(x )) + 654 ⇥ 2XE log p(xt x⇤j, pa(j xt) xEj ) j j (5) 657 662 ⇥ / { ⇤ | 656 655 f(f✓()✓) log| logp(✓p,(✓,) \{) } } 658 xj pa(xt) p(✓,, ,) + ⇥ E log p(xt x⇤j, pa(xt) xj ) (5) 657 663 6562X ⇥ D D ⇤ | \{ } } D xj pa(xt) 659 664 657 p(✓ ) 2X ⇥ ⇤ 658 f(f✓()✓)p(f✓(f✓, (0✓)+)0|D)+f(✓f0()✓>0()✓>(✓✓0)✓0) 660 665 658 ⇡ D r p(✓, ) 659 665 ✓0 = argmax log⇡p(✓, ) = argmaxr log p( ✓) D 661 659 1 1 p(✓ D) D| p(✓ ) 666 ✓ ++ (✓(✓ ✓0✓)>) f(✓f✓0()(✓✓)(✓✓0)✓ ) 660 660 2 |D 0 rr> 0 0 |D 662 667 ✓ = argmaxLaplacelog p 2approximation(✓, ) = argmaxrr ✓0 =log argmaxp( ✓log) p(✓, ) = argmax log p( ✓) 667 0 661 f(✓) , log p(✓, ) D D| 661 663 ✓ D ✓D ✓ D| ✓ 668 662 f(✓f0()=✓0)=0 0 662 664 r r f(✓) , log p(✓, ) 669 663f(✓f) (✓)f(✓0log)+p(f✓f(✓,(✓)0))>(0✓ ✓0) D ⇡ , r f0(✓0) 0 663 665 670 664 rrrr D 670 1 648 1 f(✓) f(✓0)+ f(✓0)>(✓ ✓0) 666 665+ (✓ ✓0)> 1 f(✓0)(✓ ✓0) ⇡ p(xr)= p xi pa( xi) 664 671 f(f✓(0✓649) ) (✓(✓ ✓0)✓>A)>(A✓ (✓1✓0)✓ ) | 671 f(✓666) 2f⇡(✓0)+0 rr2f(✓0)>(✓0 ✓0) 0 i 667 ⇡ 650 2 + (✓ ✓0)> f(Y✓0)(✓ ✓0) 665 672 ⇡ r 2 q(x )rrexp [log p(x)] 667 A651=f(✓ )=f0(✓ ) 0 j Eq(x j ) 668 1 A =0 f0(✓0) 0 / { ¬ } 666 673 + 668(✓ ✓ )>r rrf(✓ )(✓ ✓ ) f(✓0)=0x 669 673 0 652 rr0 0 r j 674 2669 653rrf(✓0) 0 1 667 674 rr f(✓0) 0 670 p(✓, ) p(✓6540, )exp (✓ ✓q0(x)j>) Aexp(rr✓ E ✓log0)p(xj pa(xj)) 675 670- D ⇡ 1 D 2 / 1{ | 668 675 f(✓ ) 655(✓ ✓ ) A(✓ ✓ ) 671 0 0 > - 0 671 f(✓0) + (✓⇥ ✓E0)>logAp((✓xt x⇤✓j,0pa() xt) xj ) (5) 676 ⇡ 6562 ⇡ Gaussian! 2 | \{ } } 669 672 676 xj pa(xt) 672 2X 677 A =657 f(✓0) 0 A = f(✓⇥0) 0 ⇤ 670 673 677 673 658rr rr p(✓, ) 678 674 1 1 D 674 678 659 p(✓, ) p(✓0, )exp (p✓(✓ ✓)0)>A(✓ ✓0) 671 p(✓, ) p(✓0, )exp (✓ ✓0)>A(✓ ✓0) 2 |D 679 D 675⇡ D 660 2 D ⇡ D 672 675 679 ✓0 = argmax log p(✓, ) =1 argmax log p( ✓) 680 676 661 1 p(✓ )✓ (✓ ✓0,DA ) ✓ D| 676 680 p(✓ ) (✓ ✓ , A ) |D ⇡ N | 677 662 0 673 681 |D ⇡ N | A = logf(✓p)(✓,, log) p✓(=✓✓, ) 663 D0 677 681 678 rr D | 674 682 664 EenE- 678 679 f(✓) f(✓0)+ f(✓0)7>(✓ ✓0) 682 665 ⇡ r 675 683 680 1 679 683 666 + (✓ ✓0)> f(✓0)(✓ ✓0) 676 684 681 2 rr 680 684 667 685 682 668 f(✓0)=0 677 681 685 r 683 669 f(✓0) 0 678 682 686 rr 686 684 670 1 687 671 f(✓0) (✓ ✓0)>A(✓ ✓0) 679 683 687 685 ⇡ 2 688 672 684 686 A = f(✓0) 0 680 688 673 rr 689 685 687 674 1 681 689 p(✓, ) p(✓0, )exp (✓ ✓0)>A(✓ ✓0) 690 688 675 D ⇡ D 2 686 1 682 690 689 676 p(✓ ) (✓ ✓0, A ) 687 691 |D ⇡ N | 683 691 690 677 688 692 678 692 691 684 693 679 689 692 693 680 685 690 694 693 694 681 686 691 695 694 682 695 687 692 696 695 683 696 684 693 697696 688 697 685 694 698697 686 689 698 698 687 695 699 699 690 688 696 700699 700 691 689 697 701700 701 690 692 698 701 691 693 692 699 693 13 694 700 694 13 695 701 695 696 13 696 697 697 698 699 698 700 13 699 701 700 701 13

13 4.4. The Laplace Approximation 215

0.8 40

0.6 30

0.4 20

0.2 10

0 0 Laplace approximation −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 Yellow: true 4.4. The Laplace Approximation 215 Figure 4.14 Illustration of the Laplace approximation applied to the distribution p(z) exp( z2/2)σ(20z +4) z 1 ∝ − Red: Laplace approx. where σ(z) is the logistic sigmoid function defined by σ(z)=(1+e− )− . The left plot shows the normalized distribution p(z) in yellow, together with the Laplace approximation centred on the mode z0 of p(z) in red. The right plot shows the negative logarithms of the corresponding curves. 0.8 40

We can extend the Laplace method to approximate0.6 a distribution p(z)=f(z)/Z 30 defined over an M-dimensional space z. At a stationary point z the gradient f(z) 0 ∇ will vanish. Expanding around this stationary point0.4 have 20 1 ln f(z) ln f(z ) (z z )TA(z z ) (4.131) " 0 − 2 − 0 0.2 − 0 10 where the M M Hessian matrix A is defined by × 0 0 A = ln f(z) −2 −1 0 1(4.132)2 3 4 −2 −1 0 1 2 3 4 − ∇∇ |z=z0 Figure 4.14 Illustration of the Laplace approximation applied to8 the distribution p(z) exp( z2/2)σ(20z +4) and is the gradient operator. Taking the exponential of both sides we obtain z 1 ∝ − ∇ where σ(z) is the logistic sigmoid function defined by σ(z)=(1+e− )− . The left plot shows the normalized 1 distribution p(z) in yellow, together with the Laplace approximation centred on the mode z0 of p(z) in red. The f(z) f(z ) exp (z zright)TA plot(z showsz ) the. negative(4.133) logarithms of the corresponding curves. " 0 −2 − 0 − 0 ! " The distribution q(z) is proportional to f(z) and the appropriate normalizationWe cancoef- extend the Laplace method to approximate a distribution p(z)=f(z)/Z ficient can found by inspection, using the standard result (2.43) fordefined a normalized over an M-dimensional space z. At a stationary point z0 the gradient f(z) multivariate Gaussian, giving will vanish. Expanding around this stationary point we have ∇

1/2 1 T A 1 T 1 ln f(z) ln f(z ) (z z ) A(z z ) (4.131) q(z)= | | exp (z z ) A(z z ) = (z z , A− ) (4.134) 0 0 0 (2π)M/2 −2 − 0 − 0 N | 0 " − 2 − − ! " where the M M Hessian matrix A is defined by where A denotes the determinant of A. This Gaussian distribution will be well × defined| provided| its precision matrix, given by A, is positive definite, which implies A = ln f(z) (4.132) − ∇∇ |z=z0 that the stationary point z0 must be a local maximum, not a minimum or a saddle and is the gradient operator. Taking the exponential of both sides we obtain point. ∇ In order to apply the Laplace approximation we first need to find the mode z , 0 1 and then evaluate the Hessian matrix at that mode. In practice a mode will typi- f(z) f(z ) exp (z z )TA(z z ) . (4.133) " 0 −2 − 0 − 0 cally be found by running some form of numerical optimization algorithm (Bishop ! " The distribution q(z) is proportional to f(z) and the appropriate normalization coef- ficient can be found by inspection, using the standard result (2.43) for a normalized multivariate Gaussian, giving

1/2 A 1 T 1 q(z)= | | exp (z z ) A(z z ) = (z z , A− ) (4.134) (2π)M/2 −2 − 0 − 0 N | 0 ! " where A denotes the determinant of A. This Gaussian distribution will be well defined| provided| its precision matrix, given by A, is positive definite, which implies that the stationary point z0 must be a local maximum, not a minimum or a saddle point. In order to apply the Laplace approximation we first need to find the mode z0, and then evaluate the Hessian matrix at that mode. In practice a mode will typi- cally be found by running some form of numerical optimization algorithm (Bishop Outline

• Laplace approximation • Bayesian logistic regression

9 4.5. Bayesian Logistic Regression 217

where θMAP is the value of θ at the mode of the posterior distribution, and A is the Hessian matrix of second derivatives of the negative log posterior

A = ln p( θ )p(θ )= ln p(θ ). (4.138) −∇∇ D| MAP MAP −∇∇ MAP|D The first term on the right hand side of (4.137) represents the log likelihood evalu- ated using the optimized parameters, while the remaining three terms comprise the ‘Occam factor’ which penalizes model complexity. If we assume that the Gaussian prior distribution over parameters is broad, and Exercise 4.23 that the Hessian has full rank, then we can approximate (4.137) very roughly using 1 ln p( ) ln p( θ ) M ln N (4.139) D # D| MAP − 2 where N is the number of data points, M is the number of parameters in θ and we have omitted additive constants. This is known as the Bayesian Information Criterion (BIC) or the Schwarz criterion (Schwarz, 1978). Note that, compared to

324 AIC given by (1.73), this penalizes model complexity more heavily. p(X1 =1,X2 =2,X3 = 3) = p(X1 =2,X2 =3,X3 = 1) 325 Complexity measures such as AIC and BIC have the virtue of being easy to = p(X1 =3,X2 =1,X3 = 2) = ... 326 evaluate, but can also give misleading results. In particular, the assumption that the 327 324 p(✓)d✓ Hessian matrix has full rank is often not valid sincep(X many1 =1,X of2 the=2 parameters,X3 = 3) = arep(X not1 =2,X2 =3,X3 = 1) 328 325 ✓ p(✓) = p(X1 =3,X2 =1,X3 = 2) = ... Section 3.5.3329 ‘well-determined’. We can326 use the⇠ result (4.137) to obtain a more accurate estimate 1 330 of the model evidence starting327X1,X2,... from✓ thep Laplace(Xi ✓) approximation, as we illustratep(✓)d✓ in | ⇠ | 331 328 i=1 332 the context of neural networks in SectionY 5.7. ✓ p(✓) 329 X3 ⇠ 333 1 330 h(x) 0 334 X ,X ,... ✓ p(X ✓) 1 2 | ⇠ i| 335 331h(x, y)=h(x)+h(y) i=1 3364.5. Bayesian Logistic332 Regressionp(x, y)=p(x)p(y) Y X3 337 333 h(x)= log p(x) h(x) 0 338 We now turn to a Bayesian334 treatment of logistic regression. Exact Bayesian infer- 339 H(x)= p(x) log p(x) 335 h(x, y)=h(x)+h(y) 340 ence for logistic regression is intractable.x In particular, evaluation of the posterior 336 X 341 D p(x, y)=p(x)p(y) distribution would require normalizationj : R R of the product of a prior distribution and a 342 337 ! h(x)= log p(x) likelihood function that itself comprisesj(x)=xj a product of logistic sigmoid functions, one 343 338 344 for every data point. Evaluation339 ofj(x the)= predictive distribution isH similarly(x)= intractable.p(x) log p(x) 345 Here we consider the application340 of thej Laplace approximation to the problemx of j(x)=x X 346 Bayesian logistic regression341 (Spiegelhalter and Lauritzen, 1990; MacKay, : 1992b).D 347 j(x)=sin(xj) j R R 342 ! 348 (xn)=[1(xn),...,M (xn)]> j(x)=xj 4.5.1 Laplace approximation343 349 N M 344 j(x)= 350 ⇥ 1 Recall from Sectionp( 4.4t w, thatX)= the(t Laplacew, I) approximation is obtained by finding 351 345| N | j 1 j(x)=x 352 the mode of the posteriorBayesian346 distribution(t w> (xLogistic), and ) then regression fitting a Gaussian centred at that 206 4. LINEARN MODELS| FOR CLASSIFICATION 206206 4.4. LINEAR LINEAR MODELS MODELS FOR FOR CLASSIFICATION CLASSIFICATION j(x)=sin(xj) 353 mode. This requires evaluation347 ( ofw m theN , S secondN ) derivatives of the log posterior, which 354 N | 206is equivalent4. to LINEAR finding the348 MODELS Hessian↵ˆ, matrix.ˆ FOR CLASSIFICATION(xn)=[1(xn),...,M (xn)]> 206355 4. LINEAR MODELS FOR CLASSIFICATION206 4. LINEAR MODELS FOR CLASSIFICATION • Given349 For aFor dataset aFor a data data a data set set set φ φ n n ,tφ ,t n n ,t ,,,n where where where, wherettn n t n 0 0, 1, 01 , 1 and and ,and φ nφ φn=n ==φ(φxφn(()xx,nn with)), withn =nn == 356 Because we seek a Gaussian(a)=1/ 1 representation + exp( a{){ { for}} } the posterior∈∈ ∈{{- { distribution,}} } N M it- is 1,...,N1,...,N1,...,N, the, the, likelihood the likelihood likelihood= function function function can can be be bewritten written written ⇥ 357 350 1 natural to begin with a Gaussianand n =1 prior, ,...,N which , the welikelihood write in thefunction generalp(t w ,isX form )=given( tby w , I) 351 - For a dataNN N set φ ,tn , where tn 0, 1 and φ = φ(xn), with n = 358 For a data set φFor,tn a, data where settn φn,t0n, 1, whereand φ tn= φ|n(x0n,)1, withandN n|φ=n = φ(xn),n with n = n n t{ }1 t1n tn ∈ 1{ } 1,...,N, the likelihoodttnn n function1 t cann be written 359 352{ } { ∈ {pp(}(ttp}ww(t)=)=w)= yyn yn∈11 1{yny(ytn−w}−>−(x), ) (4.89)(4.89)(4.89) 1,...,N, the likelihood1,...,N functionp,( thew)= likelihood can be(w writtenm function0, S|0)| can ben written{ {−N−}n| } (4.140) 360 353 | n=1n=1 { − } N | !n=1! (w mNN, SN ) 361 N ! 1 t 354 T T N N | tn n wherewheret =(t =(t ,...,tt1,...,t)TN )andandy y=n =p( p(φ1 φ).) Asp.(t As usual,w)= usual, we we canyn can define1 definey ann error an− error (4.89) 362 where t =(t ,...,t1 tnN) and yn1 =tn p( 1 φn )n. As usual,↵ˆ, ˆ we can define an error p(t w)=1 y N 1 y n − C 1|Ct |n | 1 (4.89)tn { − } 355Ifunctionfunction by by taking takingn the the negative negativen logarithm logarithmC of| n the of the likelihood, likelihood,−n which=1 which gives gives the thecross-cross- (4.89) 363 function-| by taking the{ negativep−(t w})= logarithm- y ofn the1 likelihood,yn ! which gives the cross- 356 entropyentropyerrorerrorn function=1 function in the in the| form form ({a)=1− / }1 + exp( a) 364 entropy error function! inwhere the formt =(nt =1,...,t )T and y = p( φ ). As usual, we can define an error 357 1 N n =1n ,...,NC1| n 218 3654. LINEAR MODELS FOR CLASSIFICATIONT function byN takingN! the negative logarithm of the likelihood, which gives the cross- where t =(t1,...,tN ) and yn = p( 1 φn). As usual,N we can define an error 366 358 E(w)=C |lnTp(t w)= t ln y + (1 t ) ln(1 y ) (4.90) function by takingwhere the negativet =(Et( logarithm1w,...,t)= lnNp of)(tentropy thewand)= likelihood,yerrorn = functiontnpln( whichny1n inφ+nn the(1 gives).= formt Asn the)(w ln(1n usual,>cross-ny)n) wen can(4.90) define an error 367 359 E(w)= −ln−p(t w| )=| − − { tn{ ln yn + (1− −tn)I ln(1− −y}n)} (4.90) n=1n=1 C | entropy error functionfunction in the by form taking− the negative| − logarithm"{ of the− likelihood,N − which} gives the cross- 368 where m and S are360 fixed hyperparameters. The"n posterior=1 distribution over w is 0 0 T"T 369 entropywhereerrorwherey functiony=n =σ(aσ(a) nand in) and thea a=n formw=Ew(φw)=φ. Takingn. Takingln p the(t thew gradient)= gradient of the oft the errorn ln errory functionn + function (1 withtn) with ln(1 yn) (4.90) given by 361where y n= σN(a )n and a n = wTφn . Taking− the| gradient− of{ the error function− with − } 370 ExerciseExercise 4.13 4.13 respectrespectn to to,w we,n we obtain obtainn n¥exptT n=1 362 w 371 ExerciseE(w 4.13)= ln prespect(t w)= top(ww, wet) obtaintn pln(wyn)+p( (1t w)tNn) ln(1 yn) "(4.90)(4.141) − | − i { → − N− } T 363 N 372 n|=1 ∝ where|yn = σ(an) and an = w φn. Taking10 the gradient of the error function with T N where t .E Taking(w)= the" logln p of(t w both)= sides,E(w)= andtn substitutingln(yynn +tn) (1φn fortn the) ln(1 prior yn) (4.91) (4.90) 373 =(t1,...,tN364) Exercise 4.13 respectE∇( toww)=, we obtain(yn t−n)φn (4.91) T − | ∇E−(w)={ n=1(y − t )φ − − }(4.91) 374 distributionwhere yn using= σ(a (4.140),n) and365 an and= w forφ then. likelihoodTaking the gradient functionn=1 ofn=1 using then error (4.89),n functionn we with obtain ∇ "" − N Exercise 4.13 respect to , we obtain " n=1 375 w 366 wherewhere we we have have made made use use of of(4.88). (4.88). We" We see see that that the the factor factor involving involving the the derivative derivative T E(w)= (yn tn)φn (4.91) 376 where367whereyofnof1 the= we the logisticσ have logistic(an) made sigmoidand sigmoidT usean has1 of= has cancelled, (4.88).w cancelled,φn We. leading Taking seeleading that to a tothe the simplified∇ a simplifiedgradient factor involvingform form of for the for the− the error the gradient derivative gradient function with ln p(w t)= of( thew logm likelihood.0)N S−0 In(w particular,m0) the contribution to the gradientn=1 from data point 377 Exercise 4.13 | respect368of−of the to2 thew logistic log, we− likelihood. obtainsigmoid In has particular, cancelled,− the leading contribution to a to simplified the gradient" form from for data the point gradient n isE given(w)= by the ‘error’(yn ytn)φt between the target value(4.91) and the prediction of the ofn theisN log given likelihood. by the ‘error’ Inwhere particular,y n t webetweenn have the contributionmade the target use of value (4.88). to the and gradient We the see prediction that from the data of factor the point involving the derivative 369 model,∇ times the basis−n function−n vector φ . Furthermore, comparison with (3.13) n ismodel, given times by the the ‘error’n basis=1 functionofy the− t logistic vectorbetween sigmoidφ N. then Furthermore, target has cancelled, value comparison and leading the prediction to with a simplified (3.13) of the form for the gradient 370 +shows thatt ln thisy" takes+ (1 preciselyn tn ) the ln(1 samen formy ) as+ the const gradient of(4.142) the sum-of-squares model,shows times that7 n this the takes basisn precisely functionof the− log then vector likelihood.same formφ . asn Furthermore, In the particular, gradient the of comparison the contribution sum-of-squares with to the (3.13) gradient from data point where weSection have 3.1.1 made371 use oferror (4.88). function{ We for see the that linear− theE regression( factorw)= involving model.−n (yn} thetn derivative)φn (4.91) Section 3.1.1 showserrorn=1 that function this takes for the precisely linearn is regression given the sameby the model. form ‘error’ asy then gradienttn between of the the sum-of-squares target value and the prediction of the of the logistic sigmoid372 has! cancelled,If desired, leading we could to∇ make a simplified use of the form result for (4.91)−− the to gradient give a sequential algorithm If desired, we couldmodel, make use times of the the resultn basis=1 (4.91) function to give vector a sequentialφn. Furthermore, algorithm comparison with (3.13) Section 3.1.1 T errorin function which patterns for the arelinear presented regression one at model. a" time, in which each of the weight vectors is whereof theyn log= σ likelihood.(w φ 373). In Toin particular, which obtain patterns the a Gaussian contribution are presentedshows approximation that one to the this at a gradient takestime, in precisely which to from the each the data posterior same of point the form weight dis- as vectorsthe gradient is of the sum-of-squares n Ifupdated desired, using we (3.22) could inmake which use ofE theis the resultnth (4.91)term in to (4.91). give a sequential algorithm n is given by thewhere ‘error’374 updated weynSection havetn using 3.1.1between made (3.22) use the in whicherror of target (4.88). functionE valueisn theWe for andn theth see the linearterm prediction that in regression (4.91). the factor of model. the involving the derivative tribution, we first maximizein which the−It patterns posterioris worth are noting presented distribution that maximum one∇n at to a time, give likelihood in the which can MAP eachexhibit (maximum of severe the weight over-fitting vectors for is model, times the basis375 functionIt is worth vector notingφ . that Furthermore, maximumIf∇ desired, likelihood we comparison could make can exhibit with use of (3.13) severe the result over-fitting (4.91) to for give a sequential algorithm posterior) solution wof theupdated, logistic whichdata using sets defines sigmoid that (3.22) are linearly inthen has which mean cancelled, separable.E of theis This the leading Gaussian.n arisesth term because to in a The (4.91). simplified the covariance maximum form likelihood for the so- gradient MAP data sets that are linearlyin separable. which patternsn This arises are presented because the one maximum at a time, likelihood in which each so- of the weight vectors is shows that this takesof376 the precisely logItlution is likelihood. worth the occurs same noting when form In that the particular, as hyperplane maximum the∇ gradient the corresponding likelihood contribution of the sum-of-squares can to exhibitσ =0 to. the5, severe equivalent gradient over-fitting to fromwTφ for= data point is then given by the inverse of the matrix ofupdated second using derivatives (3.22) in which of theE negativeis the nth logterm inT (4.91). Section 3.1.1 error function for the377 linearlution, regression separates occurs when the model. two the classes hyperplane and the corresponding magnitude of to σ =0goes.n5 to, equivalent infinity. In to thisw case,φ = the likelihood, which takesn isdata giventhe form sets0 by that the are ‘error’ linearlyy separable.n Itt isn worthbetween This noting arises the that because target maximumw ∇ the value maximum likelihood and the likelihood can prediction exhibit so- severe of over-fitting the for If desired, we could0 make,logistic separates use sigmoid of the the two functionresult classes (4.91)− and becomes the to magnitude give infinitely a sequential of steepw goes in feature algorithm to infinity. space, In corresponding this case, theT to model,lution times occurs the when basis the function hyperplanedata sets that vector corresponding are linearlyφ . separable.Furthermore, to σ =0 This.5, equivalent arises comparison because to w theφ with maximum= (3.13) likelihood so- in which patterns are presentedlogistica Heaviside sigmoid one at step a function time, function, in becomes which so that infinitely each every of training steepthen weight in point feature vectors from space, each is corresponding class k is assigned to 0, separates the two classeslutionN and occurs the magnitude when the of hyperplanew goes to corresponding infinity. In this to σ case,=0. the5, equivalent to wTφ = Exercise 4.14showsa that Heavisidea posterior this takes step probability function, preciselyth p so( that thex)=1 every same. training Furthermore, form point as thefrom there gradient each is typically class ofk is the a assigned continuum sum-of-squares updated using (3.22) inlogistic which sigmoidEn is function the1 n 0 becomesterm, separatesk in (4.91). infinitely the two classes steepT in and feature the magnitude space, corresponding of w goes to infinity. to In this case, the ExerciseSN = 4.14 ln ap( posteriorofw sucht∇)= solutions probabilityS0− + becausep( kyCx anyn)=1|(1 separating.y Furthermore,n)φ hyperplanenφn . there will is7 give typically(4.143) rise to a continuum the same pos- Section 3.1.1It is worth−∇∇ notingerrora function that Heavisideof such maximum| solutions for step the function, likelihood because linearlogisticC so any regression| can that separating sigmoid exhibit every− training model.function hyperplane severe point becomes over-fitting will from give infinitely each rise for to class steep the samek inis feature assigned pos- space, corresponding to Exercise 4.14 a posteriorterior probabilities probability atpn(=1 the trainingx)=1 data. Furthermore, points, as will there be seen is typically later in Figure a continuum 10.13. data sets that are linearlyIf desired,terior separable.Maximum probabilities we This likelihood could arises at the! makea provides training Heaviside becausek use no data theof step way points, the maximum function, to result favour as will soone (4.91) likelihood be that such seen every to solution later give so- training in over Figurea sequential point another, 10.13. from and each algorithm class k is assigned of such solutions becauseC any| separating hyperplane will giveT rise to the same pos- lution occurs whenin whichthe hyperplaneMaximumExercise patterns likelihood 4.14corresponding are presented providesa posterior to no oneσ way=0 probability at to.5 a favour, time, equivalentp one( ink such whichx)=1 to solutionw . eachφ Furthermore,= over of another, the weight there and is vectors typically isa continuum The Gaussian approximationwhich to the solution posterior is found distribution in practice will therefore dependC on| takes the choice the formof optimization algo- 0, separates the two classesteriorwhich andprobabilities solution the magnitude is found at theof in training of suchpracticew goes solutions data will to points, depend infinity. becauseth as on In will any the this choice separating be case, seen of the later optimization hyperplane in Figure algo-will 10.13. give rise to the same pos- updated usingrithm and (3.22) on the in parameter which initialization. is the n Noteterm that the in problem (4.91). will arise even if logistic sigmoid functionMaximumrithm becomesthe andnumber likelihood on infinitely the of dataparameter provides pointssteepterior initialization. in is probabilities no largefeature way compared to space, favour Note at the correspondingwith that one training the thesuch problem number data solution points, to willof over parameters arise as will another, even be in if seen andthe later in Figure 10.13. It isq worth(w)= noting(w thatwMAP maximum∇, SN ). likelihood can exhibit(4.144) severe over-fitting for a Heaviside step function,whichthemodel, so number solution that so every of longN is data found training as points| the inMaximum training practice is point large data from compared likelihood will set each depend is linearly with provides class on the theseparable.k numberis no choice assigned way of to The of parameters favour optimization singularity one suchin thecan algo- solution be over another, and rithmmodel, and so on long the as parameter the training initialization. data set is linearly Note separable. that the problem The singularity will arise can even be if Exercise 4.14 Havinga posterior obtained probabilitydata a Gaussiansetsp( avoided thatx)=1 are approximation by linearly inclusion. Furthermore, separable. ofwhich a prior to solution there and the This finding posterior is is foundtypically arises a MAP in practicedistribution,because solutiona continuum will for the dependw maximum, there or equivalently on the choice likelihood by of optimization so- algo- theavoided numberk by of inclusion data points of arithm prior is large and and onfinding compared the parameter a MAP with solution the initialization. number for w, or ofNote equivalently parameters that the by in problem the T will arise even if remainsof such the solutions task oflution marginalizingbecause occursC anyadding| separating when a with regularization the respect hyperplane hyperplane term to to this will the corresponding distribution error give function. rise to the in to same orderσ =0 pos- to.5 make, equivalent to w φ = , separatesmodel,adding so the a long regularization two as classesthe training termthe and number to data the the error set of magnitude datais function. linearly points of separable. is largegoes compared The to infinity. singularity with the In number can this be case, of parameters the in the predictions.terior probabilities0 at theavoided training by inclusion data points, of amodel, prior as will and so be long finding seen as the alater MAP training in solution Figurew data set 10.13. for isw linearly, or equivalently separable. by The singularity can be Maximum likelihoodlogistic providesadding sigmoid a regularization no way function to favour term becomesavoided one to the such by error inclusioninfinitely solution function. of steepover a prior another, inand feature finding and a space, MAP solution corresponding for w, or equivalently to by 4.5.2which solution Predictive isa found Heaviside in distribution practice step will function, dependadding so on that the a regularization every choice training of optimization term to point the error from algo- function. each class k is assigned rithm and on the parameter initialization. Note that the problem will arise even if ExerciseThe 4.14 predictivea distribution posterior probability for class p(,k givenx)=1 a new. Furthermore, feature vector thereφ( isx) typically,is a continuum the number of dataof such points solutions is large compared becauseC1C with any| theseparating number ofhyperplane parameters will in the give rise to the same pos- obtainedmodel, by so marginalizing long as the training with data respect set is to linearly the posterior separable. distribution The singularityp(w t can), which be is itselfavoided approximated by inclusionterior by ofa Gaussian probabilities a prior and distribution finding at the a MAP trainingq( solutionw) so data that for points,w, or equivalently as will| be seen by later in Figure 10.13. adding a regularizationMaximum term to likelihood the error function. provides no way to favour one such solution over another, and which solution is found in practice will depend on the choice of optimization algo- p( φ, t)= p( φ, w)p(w t)dw σ(wTφ)q(w)dw (4.145) C1| rithm andC1| on the parameter| $ initialization. Note that the problem will arise even if the" number of data points is large" compared with the number of parameters in the with the correspondingmodel, probability so long as for the class training2 given data by setp( is2 linearlyφ, t)=1 separable.p( 1 φ The, t). singularity can be To evaluate the predictive distribution, we firstC note that theC | function −σ(wCTφ| ) - avoided by inclusion of a prior and finding a MAP solution for w, or equivalently by pends on w only through its projection onto φ. Denoting a = wTφ, we have adding a regularization term to the error function. σ(wTφ)= δ(a wTφ)σ(a)da (4.146) − " where δ( ) is the Dirac delta function. From this we obtain · σ(wTφ)q(w)dw = σ(a)p(a)da (4.147) " " 4.3. Probabilistic Discriminative Models 205 648 p(x)= p xi pa(xi) 649 | basis functionsi is typically set to a constant, say φ0(x)=1, so that the correspond- 650 ing parameterYw0plays the role of a bias. For the remainder of this chapter, we shall q(xincludej) exp a fixedE basisq(x j function)[log p( transformationx)] φ(x), as this will highlight some useful 651 ¬ similarities/ to{ the regression models} discussed in Chapter 3. 652 xj For many problems of practical interest, there is significant overlap between 653 the class-conditional densities p(x k). This corresponds to posterior probabilities 654 q(xj) exp ( logx),p which,(xj pa( forx atj least)) some values|C of x, are not 0 or 1. In such cases, the opti- / { Ck| | 655 mal solution is obtained by modelling the posterior probabilities accurately and then + applying⇥ standardE log p decision(xt x⇤j, theory,pa(xt) as discussedxj ) in Chapter 1. Note that(5) nonlinear 656 | \{ } } xjtransformationspa(xt) φ(x) cannot remove such class overlap. Indeed, they can increase 657 2X ⇥ ⇤ 648 the level of overlap, or create overlap where none existed in the original observation 658 space. However,p(✓, suitablep)(x)= choicesp of nonlinearityxi pa(xi) can make the process of modelling 649 D | 659 the posteriorp probabilities(✓ ) easier.i 650 Section 3.6 Such fixed basis functionY models have important limitations, and these will be 660 |D resolved in laterq(x chapters) exp by allowing the[log basisp functions(x)] themselves to adapt to the 651 ✓0 = argmax log p(✓, )j = argmaxElogq(xp(j ) ✓) 661 ✓ data. NotwithstandingD / these✓ limitations,{ ¬ D| models with fixed} nonlinear basis functions 662 652 play an important role in applications,xj and a discussion of such models will intro- f(✓) , log p(✓, ) 663 653 duce many of the key conceptsD needed for an understanding of their more complex counterparts. 664 654 q(xj) exp E log p(xj pa(xj)) f(✓) f(✓/0)+ {f(✓0)>(✓ |✓0) 665 655 ⇡ 4.3.2 Logisticr regression 1 + ⇥ E log p(xt x⇤j, pa(xt) xj ) (5) 666 656 + (✓ We✓ begin0)> our treatmentf(✓0)(✓ of generalized✓0) | linear models\{ by considering} } the problem 2 rrxj pa(xt) 667 657 of two-class classification.2X In⇥ our discussion of generative approaches⇤ in Section 4.2, we saw that under rather general assumptions, the posterior probability of class 668 f(✓0)=0 1 658 can be writtenr as a logistic sigmoidp(✓, acting) on a linear function of the feature vectorC 669 D 659 218 4. LINEARφ so MODELS that f(✓ FOR0) CLASSIFICATION0 rr p(✓ ) T 670 p( 1 φ)=|D y(φ)=σ w φ (4.87) 660 218 4. LINEAR MODELS1 FOR CLASSIFICATIONC | 671 f(✓✓0)0 = argmax(✓ ✓0)log>Ap(✓(✓, ✓0)) = argmax log p( ✓) wherewith p(m2 φand)=1S arep( fixed1 φ).Here hyperparameters.σ( ) is the logistic The posterior sigmoid function distribution defined over by w is 661 ⇡ C 02| ✓ 0− C | D · ✓ ! " D| 672 given(4.59). by In the terminology of statistics, this model is known as logistic regression, 662 A = f(✓0) 0 673 althoughwhere itrrm should0 and S be0 f emphasizedare(✓ fixed) plog hyperparameters.(w thatpt() this✓, p is(w) a) modelp(t Thew for) posterior classification distribution rather than over(4.141)w is given by , 674 663 regression. 1 | ∝ D | p(✓, ) p(where✓0, t)exp=(t ,...,t(✓)T.✓ Taking0)>A the(✓ log✓ of0) both sides, and substituting for the prior 664 D ⇡ ForD an M1-dimensional 2 N feature spacep(wtφ), thisp( modelw)p(t hasw)M adjustable parameters.(4.141) 675 distributionBy contrast,f using if(✓ we) (4.140), hadf fitted(✓0 and)+ Gaussian forf the(✓| class likelihood0)∝> conditional(✓ function✓|0) densities using using (4.89), maximum we obtain where t =(t ,...,t )T.1 Taking the log of both sides, and substituting for the prior 676 665 likelihood,pBayesian(✓ ) we would1(✓⇡✓ logistic0 have,NA used) 2rM regressionparameters for the means and M(M + 1)/2 distribution|D ⇡ N using1 | (4.140),1 and for the likelihood function using (4.89), we obtain 666 parameters for the (shared) covariance matrix.T 1 Together with the class prior p( ), 677 A = ln+plog(w p(t)=✓(✓, ✓)0✓)=>✓(w fm(✓0)0)(S✓−0 (w✓0)m0) 1 this gives a total2 of| M(M +5)−2/2+1rr0 −parameters, which− grows quadratically withCM, 678 667 rr D | 1 T 1 in contrastlog tolnp thep(w(w linear, tt)=) dependenceN (w on mM0)ofS the− number(w m of0) parameters in logistic | −2 − 0 − 679 668 regression. For large values+f of(✓M0,t)= thereln 0y is- + a clear(1 advantaget ) ln(1 iny working) + const with the(4.142) r N n n n n 680 logistic regression model directly.{ − − } 669 nf=1(✓0) 0 We now use maximum likelihood!+ tn toln determineyn + (1 thetn) parameters ln(1 yn) of+ the const logistic(4.142) 681 670 T rr { − − } whereregressionyn = model.σ(w Toφ don) this,. To1 we obtainn shall=1 a make Gaussian use of approximation the derivative of to the the logistic posterior sig- dis- 682 f(✓ ) (✓!✓ )>A(✓ - ✓ ) 671 tribution,moid function, we first which maximize0T can conveniently the posterior0 be expressed distribution in0 terms to of give the sigmoidthe MAP function (maximum 683 where yn⇡= σ(w φn)2. To obtain a Gaussian approximation to the posterior dis- 672 Exercise 4.12 posterior)itself solution wMAP, which defines the mean of the Gaussian. The covariance tribution, we first maximize thedσ posterior distribution to give the MAP (maximum 684 is then given by theA inverse= of thef( matrix=✓0σ)(1 ofσ0 second). derivatives of the negative(4.88) log 673 posterior) solution wMAPrr, which defines the mean of the Gaussian. The covariance da − - 685 likelihood, which takes the form - HWY , I 674 is then given by the inverse of the1 matrix of second derivatives of the negative log p(✓, ) p(✓0, )exp (✓ ✓0)>A(✓ ✓0) 686 likelihood,D ⇡ which takesD the form 2 N 675 1 T 687 S = ln p(w t)=S− + y (1 y )φ φ . (4.143) N 0 1N n n n n I 688 676 −∇∇p(✓ ) | (✓ ✓0, A ) − 1 n=1 T - SN = |D ln⇡p(Nw t)=| S0− +! yn(1 yn)φnφn . w - (4.143) 689 677 A =−∇∇ log| p(✓, ) #− Wmap The Gaussian approximation to the posterior✓ distribution=n=1✓0 therefore takes the form 690 678 rr D | ! The Gaussian approximationlog p to(w the, t posterior) distribution therefore takes the form 691 679 q(w)= (w wMAP, SN ). (4.144) N | 1 q(w)= (w w , S ) 692 680 Having obtained a Gaussianq(w approximation)=MAP(w NwMAP to, theSN posterior). distribution,(4.144) there - ←Tf me N | N | 11 693 681 remainsHaving the task obtained of marginalizing a Gaussian with approximation respect to this to the distribution posterior in distribution, order to make there 694 682 predictions.remains the task of marginalizing with respect to this distribution in order to make 695 683 predictions.4.5.2 Predictive distribution 696 684 697 The4.5.2 predictive Predictive distribution distribution for class 1, given a new feature vector φ(x),is 685 obtained by marginalizing with respect to theC posterior distribution p(w t), which is 698 The predictive distribution for class 1, given a new feature vector| φ(x),is 686 itself approximated by a Gaussian distributionC q(w) so that 699 obtained by marginalizing with respect to the posterior distribution p(w t), which is 687 itself approximated by a Gaussian distribution q(w) so that | 700 T p( 1 φ, t)= p( 1 φ, w)p(w t)dw σ(w φ)q(w)dw (4.145) 701 688 C | C | | $ p( φ, t)=" p( φ, w)p(w t)dw " σ(wTφ)q(w)dw (4.145) 689 C1| C1| | $ with the corresponding probability for class 2 given by p( 2 φ, t)=1 p( 1 φ, t). 690 " C " C | − CT | Towith evaluate the corresponding the predictive probability distribution, for we class first notegiven that by p the( functionφ, t)=1σ(wp(φ) φde-, t). 2 2 T 1 691 pendsTo evaluate on w only13 the through predictive its projection distribution, onto weφ firstC. Denoting note thata = theC w| functionφ, we have−σ(wCTφ| ) de- 692 pends on w only through its projection onto φ. Denoting a = wTφ, we have T T 693 σ(w φ)= δ(a w φ)σ(a)da (4.146) T − T 694 σ(w φ)=" δ(a w φ)σ(a)da (4.146) where δ( ) is the Dirac delta function. From this− we obtain 695 · " where δ( ) is the Dirac delta function. From this we obtain 696 · σ(wTφ)q(w)dw = σ(a)p(a)da (4.147) 697 " σ(wTφ)q(w)dw =" σ(a)p(a)da (4.147) 698 " " 699 700 701

13 648 p(x)= p xi pa(xi) 649 | i 650 Y 648 q(x ) exp [log p(x)] 651 p(x)= p xi pa(xi) j Eq(x j ) 649 | / { ¬ } 652 i xj 650 648 Y p(x)= p xi pa(xi) 653 q(xj) exp Eq(x j )[log p(x)] 651 649 / { ¬ } | 654 q(xj) exp E logi p(xj pa(xj)) 652 650 xj / { Y | 653 655 q(xj) exp Eq(x j )[log p(x)] 651 + ⇥ ¬log p(xt x⇤j, pa(xt) xj ) (5) 648 / { E } 654 648 656q(xj) exp E log p(xj pa(p(xxj)=)) p x pa(x ) x | \{ } } 652 / { | p(x)= p xi ipa(xxj i)pa(i xt) j 649649 657 | | 2X ⇥ ⇤ 655 653 + log p(x x , pa(ii x ) x ) (5) 650650 ⇥ E t ⇤jYY t j p(✓, ) 656 654658 qq((x q))(x|jexpexp) exp E\{[log[loglogp}(ppx(()]x}x)]j pa(xj)) 651651 xj pa(xt) j EEqq(x(xj )j ) D 659 2X ⇥ // /{{ ¬{¬ ⇤ } }| 657 655 x p(✓ ) 652652 +xjj ⇥ log p(xt x⇤j, pa(xt) xj ) (5) 658 660 p(✓, ) E |D 653 656 | \{ } } 653 D ✓0 =x argmaxj pa(xt) log p(✓, ) = argmax log p( ✓) 659 654 657661 qq((xxj)) pexpexp(✓ E) loglogpp((xxj pa(pa(x2xXj))✓)) ⇥ D ✓ ⇤ D| 654 j / |D{E j| j 660 655 658662 / { | p(✓, ) 655 ✓0 = argmax log p(✓++, ) =⇥ argmaxE loglogplogp(x(xtpx(⇤xj, pa(,✓pa()fx(tx)✓) xD,jxlog) ) p(✓, ) (5) 661 656 663 ✓ D ⇥ ✓E |t D|⇤j \{t }j } D (5) 656 659 xj pa(xt) | p(✓ \{) } } xj 2Xpa(xt) ⇥ ⇤ 662 657 664 2X |D 657 660 f(✓) , log p(✓, )⇥ f(✓) f(✓0)+ f⇤ (✓0)>(✓ ✓0) 663 658 ✓0D=p argmax(✓, ) log p(✓, ) = argmax log p( ✓) 658 661665 p(✓,✓D ) ⇡ D r ✓ D| 659 1 664 666 p(✓ D) 659 662 f(✓) f(✓0)+ f(✓0)>(✓p(✓|D✓0)+) (✓ ✓0)> f(✓0)(✓ ✓0) 665 660 ⇡ ✓ = argmaxr log p(✓, ) = argmaxf2(✓),loglogp( prr(✓✓), ) 660 663667 1 0 |D D 661 ✓0 = argmax✓ log p(✓,D ) = argmax✓ log pD|( ✓) 666 + (✓ ✓0)> f(✓0)(✓ ✓0) f(✓ )=0 661662 664668 2 ✓ D ✓ D|0 667 rr f(✓) ,flog(✓p)(✓, f)(✓0)+rf(✓0)>(✓ ✓0) 662663 665669 ⇡D r f(✓ ) 0 f(✓ )=f0(✓) , log1p(✓, ) 0 668 663664 0 D rr 666670 r f(✓) f(✓ )++ f((✓✓) (✓✓0)>✓ ) f(✓0)(✓ ✓0) 669 0 2 0 > 0 1 664665 667671 f(✓0⇡) 0 r f(✓0) rr(✓ ✓0)>A(✓ ✓0) rrf(✓1) f(✓0)+ f(⇡✓0)>(✓ ✓02) 670 665666 + (⇡✓ ✓0)> rf(✓0)(✓ ✓0) 668672 1 12 rr f(✓0)=0 671 667 f(✓0) (✓ ✓0)>A(✓ ✓0) rA = f(✓0) 0 666 669673 ⇡ 2 + (✓ ✓0)> f(✓0)(✓ f✓0(✓)rr) 0 672 668 2 f(✓rr0)=0 0 667 670674 A = f(✓ ) r 0 rr 1 669 0 p(✓, ) p(✓0,1 )exp (✓ ✓0)>A(✓ ✓0) 673 668 rr ff((✓✓00)=) 00 671675 rrr Df(⇡✓0) D(✓ ✓0)>A2(✓ ✓0) 674 670 1 1 ⇡ 2 669 p(✓, ) p(✓0, )exp (✓ ✓0f)(>✓A0)(✓ 0 ✓0) 1 671 672676 f(✓20) (✓ ✓0)>A(✓p(✓✓0)) (✓ ✓0, A ) 675 670 D ⇡ D ⇡ rr2 A= |D f⇡(✓N0) | 0 672 673677 1 1 A =rr log p(✓, ) 676 671 p(✓ ) f((✓✓A0✓)0=, A(✓) f(✓✓00))>A0(✓ ✓0) 1 ✓=✓0 673 674678 |D ⇡⇡N | rr2 pityrr D | 677 672 p(✓, ) p(1✓0, )exp log p(✓(w, ✓t)0)>A(✓ ✓0) 674 A = log p(✓, D) ✓=⇡✓0 D 2 675679 Bayesianp(✓rr, ) logisticp(✓0,A D=)exp| regressionf((✓✓0)t.E.44-JPIH.no/p9E.Eydw✓00)>A(✓ ✓0) 678 673675 D ⇡ D rr 2 1 1 676 log p(w, t) 1p(✓ q(w) )= (✓(✓w0,wAMAP ) , S ) 679 674 680 p(✓ ) (✓ ✓ , A 1) N 676 p(✓, ) p(✓0, )exp1 0(✓ |D✓0⇡)>NAN(✓ | ✓|0) 677681 q(w)=D ⇡(w wMAPD|D , ⇡SN)A|2 = log p(✓, ) ✓=✓ 680 675677 • Predictive- distribution:A = Nlog givenp(✓, a) new input 0 678 N | rr D rr|✓=✓10 D | 681 676678 682 p(✓ ) (✓ ✓0, Alog) p(w, t) 679 |D log⇡pN(w, t|) t-SPIEIE.no/plwIIIdw 682 677679 683 1 a = A>=w log p(✓, ) ✓1=✓0 680 680 q(w)=rr(w wqMAP(wD)=, SN| ) (w wMAP, SN ) 683 678 684 N | an N | 681 log p1(w, t) 684 679 681685 q(a)= (a wMAP> , >SN) 682 N | 1 685 680 682686 q(w)=a =(w>ww%Efplt49.no/qwidwnMAP, SN ) 683 N | 686 681 683687 1 684 q(a)= (a wMAP> , >SN ) 'T 682 684 N | 687 685 688gpitt.tk/pla4EiHddpIE=llNwY--teFp-aiaYIT4Inna = >w 683 685 688 686 689 1 684 686 q(a)= "(a wMAP> , >SN ) 689 687 690 daNX| , 690 685688 687 691 p( 1 t)= (a)q(a)da Numerical quadrature 686689 688 C | 691 692 Z 687690 689 { 692 693 693 688691 690 ff¥e⇒=ffnawtxo.mx 694 692 12 694 689 691 693 695 695 690 692 691694 696 696 693 692695 697 697 696 694 693 698 698 697 695 694 699 699 698 696 695 700 700 699 697 696 701 701 700 698 697 701 699 698 700 699 701 13 13 700 13 701

13 13 What you need to know

• The general idea of Laplace’s- Approximation • Being able to implement it

13