Laplace's Rule of Succession in Information Geometry Yann Ollivier

Laplace's Rule of Succession in Information Geometry Yann Ollivier

Laplace’s Rule of Succession in Information Geometry Yann Ollivier CNRS & Paris-Saclay University, France Second International Conference on Geometric Science of Information (GSI 2015), École polytechnique, October 29, 2015 Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? Common performance criterion for prediction: cumulated log-loss T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. This corresponds to compression cost, and is also equal to square loss for Gaussian models. Sequential prediction Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively. Common performance criterion for prediction: cumulated log-loss T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. This corresponds to compression cost, and is also equal to square loss for Gaussian models. Sequential prediction Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively. Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? This corresponds to compression cost, and is also equal to square loss for Gaussian models. Sequential prediction Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively. Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? Common performance criterion for prediction: cumulated log-loss T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. Sequential prediction Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively. Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? Common performance criterion for prediction: cumulated log-loss T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. This corresponds to compression cost, and is also equal to square loss for Gaussian models. ML Then, predict xt+1 using this 휃t : ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” estimator. Heavily used in machine learning. Argmax often computed via gradient descent. Maximum likelihood estimator Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭ ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” estimator. Heavily used in machine learning. Argmax often computed via gradient descent. Maximum likelihood estimator Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭ ML Then, predict xt+1 using this 휃t : Heavily used in machine learning. Argmax often computed via gradient descent. Maximum likelihood estimator Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭ ML Then, predict xt+1 using this 휃t : ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” estimator. Maximum likelihood estimator Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭ ML Then, predict xt+1 using this 휃t : ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” estimator. Heavily used in machine learning. Argmax often computed via gradient descent. m w + m according to the ML estimator. =⇒ The ML estimator is not satisfying: I How do you predict the first observation? I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. I Often overfits in machine learning. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability =⇒ The ML estimator is not satisfying: I How do you predict the first observation? I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. I Often overfits in machine learning. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator. I How do you predict the first observation? I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. I Often overfits in machine learning. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator. =⇒ The ML estimator is not satisfying: I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. I Often overfits in machine learning. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator. =⇒ The ML estimator is not satisfying: I How do you predict the first observation? I Often overfits in machine learning. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator. =⇒ The ML estimator is not satisfying: I How do you predict the first observation? I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. Problems with the ML estimator Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator. =⇒ The ML estimator is not satisfying: I How do you predict the first observation? I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. I Often overfits in machine learning. I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2). I Generalizes to other discrete data (“additive smoothing”). I May seem arbitrary but has a beautiful Bayesian interpretation. Laplace’s rule of succession Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession. I Generalizes to other discrete data (“additive smoothing”). I May seem arbitrary but has a beautiful Bayesian interpretation. Laplace’s rule of succession Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession. I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2). I May seem arbitrary but has a beautiful Bayesian interpretation. Laplace’s rule of succession Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession. I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2). I Generalizes to other discrete data (“additive smoothing”). Laplace’s rule of succession Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession. I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2). I Generalizes to other discrete data (“additive smoothing”). I May seem arbitrary but has a beautiful Bayesian interpretation. Proposition (folklore) For Bernoulli distributions on a binary variable, e.g., {woman, man}, Laplace’s rule coincides with the Bayesian predictor with a uniform prior on the Bernoulli parameter 휃 ∈ [0; 1]. Bayesian predictors Bayesian predictors start with a parametric model p휃(x) together with a prior 훼(휃) on 휃.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    52 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us