Laplace's Rule of Succession in Information Geometry Yann Ollivier

Laplace’s Rule of Succession in Information Geometry Yann Ollivier CNRS & Paris-Saclay University, France Second International Conference on Geometric Science of Information (GSI 2015), École polytechnique, October 29, 2015 Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? Common performance criterion for prediction: cumulated log-loss T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. This corresponds to compression cost, and is also equal to square loss for Gaussian models. Sequential prediction Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively. Common performance criterion for prediction: cumulated log-loss T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. This corresponds to compression cost, and is also equal to square loss for Gaussian models. Sequential prediction Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively. Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? This corresponds to compression cost, and is also equal to square loss for Gaussian models. Sequential prediction Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively. Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? Common performance criterion for prediction: cumulated log-loss T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. Sequential prediction Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively. Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? Common performance criterion for prediction: cumulated log-loss T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. This corresponds to compression cost, and is also equal to square loss for Gaussian models. ML Then, predict xt+1 using this 휃t : ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” estimator. Heavily used in machine learning. Argmax often computed via gradient descent. Maximum likelihood estimator Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭ ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” estimator. Heavily used in machine learning. Argmax often computed via gradient descent. Maximum likelihood estimator Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭ ML Then, predict xt+1 using this 휃t : Heavily used in machine learning. Argmax often computed via gradient descent. Maximum likelihood estimator Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭ ML Then, predict xt+1 using this 휃t : ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” estimator. Maximum likelihood estimator Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭ ML Then, predict xt+1 using this 휃t : ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” estimator. Heavily used in machine learning. Argmax often computed via gradient descent. m w + m according to the ML estimator. =⇒ The ML estimator is not satisfying: I How do you predict the first observation? I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. I Often overfits in machine learning. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability =⇒ The ML estimator is not satisfying: I How do you predict the first observation? I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. I Often overfits in machine learning. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator. I How do you predict the first observation? I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. I Often overfits in machine learning. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator. =⇒ The ML estimator is not satisfying: I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. I Often overfits in machine learning. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator. =⇒ The ML estimator is not satisfying: I How do you predict the first observation? I Often overfits in machine learning. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator. =⇒ The ML estimator is not satisfying: I How do you predict the first observation? I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator. =⇒ The ML estimator is not satisfying: I How do you predict the first observation? I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. I Often overfits in machine learning. I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2). I Generalizes to other discrete data (“additive smoothing”). I May seem arbitrary but has a beautiful Bayesian interpretation. Laplace’s rule of succession Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession. I Generalizes to other discrete data (“additive smoothing”). I May seem arbitrary but has a beautiful Bayesian interpretation. Laplace’s rule of succession Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession. I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2). I May seem arbitrary but has a beautiful Bayesian interpretation. Laplace’s rule of succession Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession. I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2). I Generalizes to other discrete data (“additive smoothing”). Laplace’s rule of succession Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession. I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2). I Generalizes to other discrete data (“additive smoothing”). I May seem arbitrary but has a beautiful Bayesian interpretation. Proposition (folklore) For Bernoulli distributions on a binary variable, e.g., {woman, man}, Laplace’s rule coincides with the Bayesian predictor with a uniform prior on the Bernoulli parameter 휃 ∈ [0; 1]. Bayesian predictors Bayesian predictors start with a parametric model p휃(x) together with a prior 훼(휃) on 휃.

Laplace's Rule of Succession in Information Geometry Yann Ollivier

Sequence Models

Introduction to Machine Learning Lecture 3

Additive Smoothing for Relevance-Based Language Modelling of Recommender Systems

Smoothing Parameter and Model Selection for General Smooth Models

Introduction to Naivebayes Package

Adding Improvements to Multinomial Naive Bayes for Increasing the Accuracy of Aggressive Tweets Classification

Efficient Learning of Smooth Probability Functions from Bernoulli Tests With

Bias/Variance Tradeoff

Package 'Naivebayes'

Language Modeling and Probability

How to Count Thumb-Ups and Thumb-Downs: User-Rating Based Ranking of Items from an Axiomatic Perspective

Language Models