Machine Learning

Machine Learning Bert Kappen SNN Radboud University, Nijmegen Gatsby Unit, UCL London October 9, 2020 Bert Kappen Course setup • 3 ec course. 7 weeks. • Lectures are prerecorded. Available through brightspace. • examination based on weekly exercises only – You make a group of maximal 3 persons – Each week, you make the exercises with your group. – You can ask questions in tutorial class – Your group hands them in before the next tutorial class. Hard rule, because answers may be discussed in tutorial class. • All course materials (slides, exercises) and schedule via http://www.snn.ru.nl/~bertk/machinelearning/ Bert Kappen ML 1 Content 1. Probabilities, Bayes rule, information theory 2. Model selection 3. Classification, perceptron, gradient descent learning rules 4. Multi layered perceptrons 5. Deep learning 6. Graphical models, latent variable models, EM 7. Variational autoencoders Bert Kappen ML 2 Lecture 1a Based on Mackay ch. 2: Probability, Entropy and Inference • Probabilities • Bayesian inference, inverse probabilities, Bayes rule Bert Kappen ML 3 Forward probabilities Forward probabilities is the usual way to compute the probabilities of possible outcomes given a probability model p(xj f ) ! p(outcomej f ) Exercise 2.4 Urn with B black balls and W white balls and K = W + B. Draw N times a ball from the urn with replacement. What is the probability to draw NB black balls? Bert Kappen ML 4 Forward probabilities Define f = B=K the fraction of black balls in the urn. Then N p(NB = 0) = (1 − f ) N−1 p(NB = 1) = f (1 − f ) N ! N NB N−NB p(NB) = f (1 − f ) NB Expected value and variance: N N X X 2 ENB = p(NB)NB = ::: = N f VNB = p(NB)(NB − N f ) = ::: = N f (1 − f ) NB=0 NB=0 Suppose K = 10; B = 2; f = 0:2. When N = 5; NB ≈ 1 ± 0:9. When N = 400; NB ≈ 80 ± 8. Bert Kappen ML 5 Inverse probabilities Inverse probabilities do the converse: given a specific outcome, what is the probability that this has been generated by a model with parameters f ? outcome ! p( f joutcome) Exercise 2.6 11 urns u = 0;:::; 10 each with 10 balls. Urn u has u black balls and 10 − u white balls. Select one urn at random, and draw N times with replacement from that urn. The outcome is that after N = 10 draws there are NB = 3 black balls. What is the probability that urn u was selected? Bert Kappen ML 6 We treat u and NB as a random variables. N is given and provides the context or condition in which the probabilities are calculated. p(u; NBjN) = p(ujN)p(NBju; N) p(ujN)p(NBju; N) p(ujNB; N) = p(NBjN) ! u N NB N−NB p(NBju; N) = fu (1 − fu) fu = NB 10 1 p(ujN) = p(u) = 11 10 ! 10 X 1 X N NB N−NB p(NBjN) = p(u)p(NBju; N) = fu (1 − fu) 11 NB u=0 u=0 N f B(1 − f )N−NB p(ujN ; N) = u u B 10 N P B 0 N−NB u0=0 fu0 (1 − fu ) Bert Kappen ML 7 Ù ¼ ½ 0.3 0.25 0.2 ¾ 0.15 0.1 ¿ 0.05 0 4 0 1 2 3 4 5 6 7 8 9 10 u 5 Ù È ´Ù j Ò = ¿;Æ µ B 6 ¼ ¼ 7 ½ ¼º¼6¿ ¾ ¼º¾¾ 8 ¿ ¼º¾9 4 ¼º¾4 9 5 ¼º½¿ 6 ¼º¼47 ½¼ 7 ¼º¼¼99 8 ¼º¼¼¼86 9 ¼º¼¼¼¼¼96 ¼ ½ ¾ ¿ 4 5 6 7 8 9 ½¼ Ò ½¼ ¼ B Left: Joint probability p(u; NBjN). Right: conditional probability p(ujNB; N). Bert Kappen ML 8 Bayesian inference Note, that our ’inference’ has resulted in a distribution over u. We do not know which urn was selected, only probabilities. This is the best we can do given the data. This is called Bayesian inference. In general, when the models are parametrized by θ, the procedure is 1. A given data set 2. Specify the prior over models p(θ). 3. Compute the likelihood of the observed data under the model with parameters θ: p(datajθ). 4. Compute the posterior using Bayes’ rule p(datajθ)p(θ) Z p(θjdata) = p(data) = dθp(datajθ)p(θ) p(data) Bert Kappen ML 9 Exercise 2.6 continued We draw another ball from the same urn. Given the observations sofar, what is the probability that it is black? This is computed from the posterior: X10 p(blackjNB = 3; N = 10) = p(blackju; NB; N)p(ujNB; N) u=0 X10 = fu p(ujNB; N) u=0 10 N X f B(1 − f )N−NB = f u u = 0:333 u 10 N P B 0 N−NB u=0 u0=0 fu0 (1 − fu ) Compare with prediction from most likely urn (u = 3) would give p(blackju = 3) = fu=3 = 3=10 = 0:3 Bert Kappen ML 10 Exercise 2.7. The bent coin Bent coin has probability f to come up head. We do not know f . We toss N times and get NH times head. What is f ? Bert Kappen ML 11 Exercise 2.7. The bent coin Bent coin has probability f to come up head. We do not know f . We toss N times and get NH times head. What is f ? f is fixed but unknown. The key conceptual step is to treat f as a random variable. Assume a prior over f : p( f ). p( f ) is our (subjective) prior belief in the value of f . Given f and N, we know the likelihood of the observation: ! N NH N−NH p(NHj f; N) = f (1 − f ) NH The posterior is p(NHj f; N)p( f ) p( f jNH; N) = p(NHjN) Bert Kappen ML 12 Intermezzo: the Beta distribution The Beta distribution is a probability density over a continuous random variable x 2 [0; 1] defined by two shape parameters a; b > 0. Γ(a + b) Beta(xja; b) = xa−1(1 − x)b−1 0 ≤ x ≤ 1 Γ(a)Γ(b) The prefactor ensures normalisation Z 1 Γ(a)Γ(b) dxxa−1(1 − x)b−1 = 0 Γ(a + b) 3 3 3 3 a = 0.1 a = 1 a = 2 a = 8 b = 0.1 b = 1 b = 3 b = 4 2 2 2 2 1 1 1 1 0 0 0 0 0 0.5 µ 1 0 0.5 µ 1 0 0.5 µ 1 0 0.5 µ 1 Bert Kappen ML 13 Intermezzo: the Beta distribution Mean and variance a Ex = a + b ab Vx = (a + b)2(a + b + 1) Note, that with a = f N; b = (1 − f )N we get f (1 − f ) Ex = f Vx = N + 1 becomes more peaked for large N. Bert Kappen ML 14 Exercise 2.7. The bent coin Assume that we have no prior knowledge of f . Therefore p( f ) = 1. The posterior N p(N j f; N)p( f ) H NH NH N−NH p( f jNH; N) = = f (1 − f ) p(NHjN) p(NHjN) Γ(a + b) = f a−1(1 − f )b−1 Γ(a)Γ(b) with NH = a − 1; N − NH = b − 1. From this we can infer that ! N Γ(a)Γ(b) N! NH!(N − NH)! 1 p(NHjN) = = = NH Γ(a + b) NH!(N − NH)! (N + 1)! N + 1 where we used Γ(x) = (x − 1)! when x is integer. R Since p(NHjN) = d f p(NHj f; N)p( f ) we conclude that when integrating over all models, the probability of all outcomes are equally likely. Bert Kappen ML 15 Exercise 2.7. The bent coin So, given our experiment where we have thrown the coin N times and have observed NH times ’head’. What is the probability to observe another ’head’? Z p(HjNH; N) = d f p(Hj f )p( f jNH; N) Z a N + 1 = d f f Beta( f ja = N + 1; b = N − N + 1) = = H H H a + b N + 2 Suppose N = NH = 1. Naive answer is f = 1 and thus p(Hj f ) = 1. 2 Bayesian answer is p( f jNH; N) = 2 f and p(HjNH; H) = 3 Bert Kappen ML 16 Lecture 1b Based on Mackay ch. 2: Probability, Entropy and Inference • Entropy as information • Maximum entropy, and its relation to exponential models • the KL divergence – use in variational approximation – relation to maximum likelihood Bert Kappen ML 17 Entropy as information Information is a measure of the ’degree of surprise’ that a certain value gives us given that we know the distribution. Unlikely events are informative, likely events less so. Thus, information decreases with the probability of the event. Given p(x) = δ(x) (no uncertainty), observing x gives us no additional information. Let us denote h(x) the information of x. Then if x; y are independent, we want information to be additive h(x; y) = h(x) + h(y). Since p(x; y) = p(x)p(y) we see that h(x) = − log2 p(x) is a good candidate to quantify the information in x. The expected information is X H[x]:= − log2 p = − p(x) log2 p(x) x is the entropy of the distribution p. Bert Kappen ML 18 Entropy as information 0.5 0.5 H = 1.77 H = 3.09 0.25 0.25 probabilities probabilities 0 0 When p is sharply peaked (p(x1) = 1; p(x2) = ::: = p(xM) = 0) then the entropy is H[x] = −1 log 1 − (M − 1)0 log 0 = 0 When p is flat (p(xi) = 1=M) the entropy is maximal 1 1 H[x] = −M log = log M M M 1 1 log is log2 is this section on entropy as information.

Machine Learning

Bayesian Methods for Unsupervised Learning

Neural Trees for Learning on Graphs Rajat Talak, Siyi Hu, Lisa Peng, and Luca Carlone

Duality of Graphical Models and Tensor Networks Arxiv:1710.01437V1

Grammar Factorization by Tree Decomposition

Performance Evaluation of Algorithms for Soft Evidential Update in Bayesian Networks: First Results

Preventing Unnecessary Groundings in the Lifted Dynamic Junction Tree Algorithm

Region Based Approximation for High Dimensional Bayesian Network Models

Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2

The Inclusion-Exclusion Rule and Its Application to the Junction Tree

A Sufficiently Fast Algorithm for Finding Close to Optimal Junction Trees

6 : Factor Graphs, Message Passing and Junction Trees

Duality of Graphical Models and Tensor Networks Arxiv:1710.01437V1