00-FRONTMATTER Seminar Explainable AI Module 00 Primer on Probability, Information and Learning from Data Andreas Holzinger Human-Centered AI Lab (Holzinger Group) Institute for Medical Informatics/Statistics, Medical University Graz, Austria and Explainable AI-Lab, Alberta Machine Intelligence Institute, Edmonton, Canada
[email protected] 1 Last update: 26-09-2019 Remark
This is the version for printing and reading. The lecture version is didactically different.
[email protected] 2 Last update: 26-09-2019 Keywords Probability . Probability Distribution . Probability Density . Frequentist/Bayesian . Continuous/Discrete . Independent/Dependent . Identical/Non‐Identical . Correlation/Causation . Joint Probability . Conditional Probability
[email protected] 3 Last update: 26-09-2019 Keywords Information and Learning from Data . Information Entropy . Mutual Information . Kullback‐Leibler Divergence . Maximum Likelihood Estimation . Maximum a Posteriori Estimate . Bayesian Estimate . Bayesian Learning . Fisher Information . Marginal Likelihood
[email protected] 4 Last update: 26-09-2019 Book recommendations (selection)
Christopher M. Bishop Kevin P. Murphy 2012. David Barber 2012. Bayesian Ian Goodfellow, Yoshua 2006. Pattern Recognition Machine learning: a reasoning and machine Bengio & Aaron Courville and Machine Learning, probabilistic perspective, learning, Cambridge, 2016. Deep Learning, New York, Springer. Cambridge (MA), MIT press. Cambridge University Press. Cambridge (MA), MIT Press.
For those students who are interested in decision making, this is a Standard: Michael W. Kattan (ed.) 2009. Encyclopedia of medical decision making, London: Sage. [email protected] 5 Last update: 26-09-2019 Agenda
. 00 Mathematical Notations . 01 Probability Distribution and Density . 02 Expectation and Expected Utility Theory . 03 Joint Probability/Conditional Probability . 04 Independent, identically distributed (iid) . 05 Bayes and Laplace . 06 Information Theory & Entropy
[email protected] 6 Last update: 26-09-2019 00 Mathematical Notations
[email protected] 7 Last update: 26-09-2019 Most used – at a glance
[email protected] 8 Last update: 26-09-2019 Linear algebra notations
[email protected] 9 Last update: 26-09-2019 Probability notations (1/3)
[email protected] 10 Last update: 26-09-2019 Probability notations (2/3)
[email protected] 11 Last update: 26-09-2019 Probability notations (3/3)
[email protected] 12 Last update: 26-09-2019 Note: parameter (theta) credit to Zoubin Ghahramani
http://mlg.eng.cam.ac.uk/zoubin
[email protected] 13 Last update: 26-09-2019 Machine learning builds on 3 mathematical pillars . 1) Linear algebra, . 2) probability/statistics, and . 3) optimization . typical data organization is in arrays (matrices), . rows represent the samples (data items) . columns represent attributes (features, representations, covariates), ML = “feature engineering”
. Simplest: each training input 𝑖 is a d‐dimensional vector of numbers, but 𝑖 could be a complex object (image, sentence, graph, time series, molecular shape, etc.
See a typical ML‐pipeline here: https://www.datasciencecentral.com/profiles/blogs/data‐version‐control‐iterative‐machine‐learning [email protected] 14 Last update: 26-09-2019 Discussion: What makes a good feature?
https://www.youtube.com/watch?v=N9fDIAflCMY
Allan Jepson & Whitman Richards 1992. What makes a good feature. Spatial vision in humans and robots, 89‐126.
Avrim L. Blum & Pat Langley 1997. Selection of relevant features and examples in machine learning. Artificial intelligence, 97, (1), 245‐271, doi:10.1016/S0004‐3702(97)00063‐5.
[email protected] 15 Last update: 26-09-2019 Practical Example: Probabilistic Reasoning System
Remark: compare with early Expert Systems (see module 3)
Avi Pfeffer 2016. Practical probabilistic programming, Shelter Island (NY), Manning Publications.
[email protected] 16 Last update: 26-09-2019 Please have a look at the lectures by Frank Wood, UBC, formerly Oxford
Jan‐Willem Van De Meent, Brooks Paige, Hongseok Yang & Frank Wood 2018. An introduction to probabilistic programming. arXiv preprint arXiv:1809.10756. [email protected] 17 Last update: 26-09-2019 Example for a graphical model
Jan‐Willem Van De Meent, Brooks Paige, Hongseok Yang & Frank Wood 2018. An introduction to probabilistic programming. arXiv preprint arXiv:1809.10756. [email protected] 18 Last update: 26-09-2019 01 Probability
[email protected] 19 Last update: 26-09-2019 What is probability p(x)? . is the formal study of laws of chance and managing uncertainty; allows to measure (many) events . Frequentist* view: coin toss . Bayesian* view: probability as a measure of belief (this is what made machine learning successful) . = 1 means that all events occur for certain . Information is a measure for the reduction of uncertainty . If something is 100 % certain its uncertainty = 0 . Uncertainty is max. if all choices are equally probable (I.I.D = independent and identically distributed) . Uncertainty (as information) sums up for independent sources:
*) Bayesian vs. Frequentist ‐ please watch the excellent video of Kristin Lennox (2016): https://www.youtube.com/watch?v=eDMGDhyDxuY [email protected] 20 Last update: 26-09-2019 Probability Density Function PDF vs. Cumulative Density Function CDF
. Discrete distributions:
123456
. Continuous: Probability density function (PDF) vs Cumulative Density Function (CDF):
x a
[email protected] 21 Last update: 26-09-2019 Probability Density Function and Probability Distribution
John Kruschke 2014. Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan, Amsterdam et al., Academic Press.
https://brilliant.org/wiki/multivariate‐normal‐distribution
[email protected] 22 Last update: 26-09-2019 02 Expectation and Expected Utility Theory
[email protected] 23 Last update: 26-09-2019 Estimate Confidence Interval: Uncertainty matters !
Image by Katharina Holzinger [email protected] 24 Last update: 26-09-2019 Expected Utility Theory 𝐸 𝑈|𝑑
For a single decision variable an agent can select for any The expected utility of decision is http://www.eoht.info/page/Oskar+Morgenstern
An optimal single decision is the decision whose expected utility is maximal:
John Von Neumann & Oskar Morgenstern 1944. Theory of games and economic behavior, Princeton university press. [email protected] 25 Last update: 26-09-2019 03 Joint Probability Conditional Probability
[email protected] 26 Last update: 26-09-2019 Independence and Conditional Independence
Please review chapter 2.2.4, page 30 of David Barbers Book, here a summary:
[email protected] 27 Last update: 26-09-2019 When we condition on y: Are x and z independent?
Please review chapter 10.5., page 325 of David Barbers Book [email protected] 28 Last update: 26-09-2019 04 Independent Identically distributed (IID, i.i.d.) data
[email protected] 29 Last update: 26-09-2019 Please review chapter 8.6., page 180ff of David Barbers Book, particularly:
Watch these examples: https://www.youtube.com/watch?v=lhzndcgCXeo
[email protected] 30 Last update: 26-09-2019 05 Bayes and Laplace
[email protected] 31 Last update: 26-09-2019 Bayes and Laplace on one single slide
Note: This is probably Pierre Simon de not Bayes, but this Laplace (1749‐1827) image is heavily in use
Please refer to the excellent Lectures of Zoubin Ghahramani for profound details, http://mlg.eng.cam.ac.uk/zoubin [email protected] 32 Last update: 26-09-2019 06 Measuring Information
[email protected] 33 Last update: 26-09-2019 Probability > Information > Entropy . Information is a measure for the reduction of uncertainty . If something is 100 % certain its uncertainty = 0 . Uncertainty is max. if all choices are equally probable (I.I.D) . Uncertainty (as information) sums up for independent sources
Andreas Holzinger, Matthias Hörtenhuber, Christopher Mayer, Martin Bachler, Siegfried Wassertheurer, Armando Pinho & David Koslicki 2014. On Entropy‐Based Data Mining. In: Lecture Notes in Computer Science, LNCS 8401. Berlin Heidelberg: Springer, pp. 209‐226, doi:10.1007/978‐3‐662‐43968‐5_12. [email protected] 34 Last update: 26-09-2019 Entropy as measure for disorder
http://www.scottaaronson.com [email protected] 35 Last update: 26-09-2019 An overview on the History of Entropy
Bernoulli (1713) Maxwell (1859), Boltzmann (1871), Principle of Insufficient Gibbs (1902) Statistical Modeling Reason of problems in physics Pearson (1900) Goodness of Fit measure Bayes (1763), Laplace (1770) How to calculate the state of a system with a limited number of expectation values Fisher (1922) Maximum Likelihood Jeffreys, Cox (1939‐1948) Shannon (1948) Statistical Inference Information Theory
Bayesian Statistics Entropy Methods Generalized Entropy
See next slide confer also with: Golan, A. (2008) Information and Entropy Econometric: A Review and Synthesis. Foundations and Trends in Econometrics, 2, 1‐2, 1‐145. [email protected] 36 Last update: 26-09-2019 Towards a Taxonomy of Entropic Methods
Entropic Methods Generalized Entropy Jaynes (1957) Maximum Entropy (MaxEn) Renyi (1961) Renyi-Entropy
Adler et al. (1965) Topology Entropy (TopEn) Mowshowitz (1968) Graph Entropy (MinEn) Tsallis (1980) Posner (1975) Tsallis-Entropy Minimum Entropy (MinEn) Pincus (1991) Approximate Entropy (ApEn) Rubinstein (1997) Richman (2000) Cross Entropy (CE) Sample Entropy (SampEn)
Holzinger, A., Hörtenhuber, M., Mayer, C., Bachler, M., Wassertheurer, S., Pinho, A. & Koslicki, D. 2014. On Entropy‐Based Data Mining. In: Holzinger, A. & Jurisica, I. (eds.) Lecture Notes in Computer Science, LNCS 8401. Berlin Heidelberg: Springer, pp. 209‐226. [email protected] 37 Last update: 26-09-2019 Example of the usefulness of ApEn (1/3)
Holzinger, A., Stocker, C., Bruschi, M., Auinger, A., Silva, H., Gamboa, H. & Fred, A. 2012. On Applying Approximate Entropy to ECG Signals for Knowledge Discovery on the Example of Big Sensor Data. In: Huang, R., Ghorbani, A., Pasi, G., Yamaguchi, T., Yen, N. & Jin, B. (eds.) Active Media Technology, Lecture Notes in Computer Science, LNCS 7669. Berlin Heidelberg: Springer, pp. 646‐657. EU Project EMERGE (2007‐2010) [email protected] 38 Last update: 26-09-2019 Example of the usefulness of ApEn (2/3)