Applied Logic Lecture 4 Part 2 – Bayesian Inductive Reasoning

Applied Logic Lecture 4 part 2 – Bayesian inductive reasoning Marcin Szczuka Institute of Informatics, The University of Warsaw Monographic lecture, Spring semester 2018/2019 Marcin Szczuka (MIMUW) Applied Logic 2019 1 / 34 The ones illiterate in general probability theory still keep asking why, above all, Trurl did probabilized the dragon instead of elf or dwarf. Those do it due to ignorance, since they do not know that the dragon is just more probable than a dwarf ... Stanisław Lem, The Cyberiad Fable three or dragons of probability Marcin Szczuka (MIMUW) Applied Logic 2019 2 / 34 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection – general issues Marcin Szczuka (MIMUW) Applied Logic 2019 3 / 34 Measure of truth/possibility Recall that from an inductive (quasi-)formal system that we dare to call inductive logic we expect to provide a measure of support. This measure gives use the level of influence of the truthfulness of premises on truthfulness of conclusions. We require: 1 Fulfillment of the Criterion of Adequacy (CoA). 2 Ensuring, that the degree of confidence in the inferred conclusion is no greater than the confidence of the premises and inference rules. 3 Ability to clearly discern between proper conclusions (hypotheses) and nonsensical ones. 4 Intuitive interpretation. Marcin Szczuka (MIMUW) Applied Logic 2019 4 / 34 Probabilistic inference From the earliest onset the researchers tried to match the inductive reasoning paradigm with probability and/or statistics. Over time probability-based reasoning, in particular Bayesian reasoning, have established itself as a central focal point of philosophers and logicians working on formalisation of inductive systems (inductive logics). elements of probabilistic reasoning can be found in works of Pascal, Fermat, and others. Modern, formal approach to inductive logic based on the notion of similarity and probability was proposed by John Maynard Keynes in Treatise on Probability (1921). Rudolf Carnap developed further these ideas in his Logical Foundations of Probability (1950) and some other works, which are now considered a corner stone of probabilistic logic. After the mathematical theory of probability was “ordered” by Kolmogorov the probabilistic reasoning gained more traction as a proper, formal theory. Marcin Szczuka (MIMUW) Applied Logic 2019 5 / 34 Probabilistic inductive logic In case of inductive logics, in particular those based on probability, there is very little point in considering the strict formal consequence relation ` and its relationship with relation j=. For the relation j= we usually consider the support (probability) mapping rather than exact logical consequences. Support mapping (function) Function P : L 7! [0; 1], where L is a set of statements (a language) is called support function if for A; B; C - statements in L the following holds: 1 There exists at least one pair of statement D; E 2 L for which P (DjE) < 1. 2 If B j= A then P (AjB) = 1. 3 If j= (B ≡ C) then P (AjB) = P (AjC). 4 If C j= :(A ^ B) then either P (A _ BjC) = P (AjC) + P (BjC) or 8D2LP (DjC) = 1. 5 P ((A ^ B)jC) = P (Aj(B ^ C)) × P (BjC) Marcin Szczuka (MIMUW) Applied Logic 2019 6 / 34 Probabilistic inductive logic Its easy to see that the conditions for support function P are a re-formulation of the axioms for probability measure. In definition of P the operator j corresponds to logical entailment, i.e., the basic step in reasoning. It is easy to see that the mapping P is not uniquely defined. The conditions for P are essentially the same as for (unconditional) probability. It suffices to set P (A) = P (Aj(D _:D)) for some sentence (event) D. However, these conditions also allow for establishing the value P (AjC) in case of probability of event C being 0 (P (C) = P (Cj(D _:D)) = 0). Condition 1 (not-triviality) in definition of P can be also expressed as 9A2LP ((A ^ :A)j(A _:A)) < 1. Marcin Szczuka (MIMUW) Applied Logic 2019 7 / 34 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection – general issues Marcin Szczuka (MIMUW) Applied Logic 2019 8 / 34 Probability At this point we need to introduce the (simplified) axioms for probability measure that we will further use. In order to clearly discern form previous notation we will use the Pr to mark probability measure. Axioms for discrete probability (Kolmogorov) 1 For each event A 2 Ω value Pr(A) 2 [0; 1]. 2 Unit measure Pr(Ω) = 1. 3 Additivity – if A1;:::;An are mutually exclusive events, then n n X X Pr(Ai) = 1 ) Pr(B) = Pr(BjAi) Pr(Ai): i=1 i=1 Axiom 2 (unit measure) may be a source of some concern for us. Marcin Szczuka (MIMUW) Applied Logic 2019 9 / 34 Properties of probability Pr(A ^ B) = Pr(B) · Pr(AjB) = Pr(A) · Pr(BjA) Pr(A _ B) = Pr(A) + Pr(B) − Pr(A ^ B) Pr(AjB) - (conditional) probability of A given B. Pr(A ^ B) Pr(AjB) = Pr(B) Bayes’ rule Pr(BjA) · Pr(A) Pr(AjB) = Pr(B) Marcin Szczuka (MIMUW) Applied Logic 2019 10 / 34 Bayesian inference For the reason that will become clear in the next part of the lecture, we will use the following notation. T ⊂ X - set of premises (evidence set) coming from (huge) universe X. h 2 H - conclusion (hypothesis) coming from some (huge) set of hypotheses H. VSH;T - version space, i.e., subset of H containing hypotheses that are consistent with T . Inference rule (Bayes’) For a hypothesis h 2 H and evidence set T ⊂ X: Pr(T jh) · Pr(h) Pr(hjT ) = Pr(T ) Probability (level of support) of conclusion (hypothesis) h is established on the basis of support of premises (evidence) and the degree to which the hypothesis justifies the existence of evidence (premises). Marcin Szczuka (MIMUW) Applied Logic 2019 11 / 34 Remarks Pr(hjT ) - a posteriori (posterior) probability of hypothesis h given premises (evidence data) T . That is what we are looking for. Pr(T ) - Probability of premises (evidence data) T . Fortunately, we do not have to know it if we are only interested in comparison of posterior probabilities of hypotheses. If, for some reason, we need to directly calculate that then we may have a problem. We need to calculate Pr(h) and Pr(T jh). For the moment we assume that we can do that and that H is known. Pr(T jh) determines the degree to which h justifies the appearance (truthfulness) of premises in T . Marcin Szczuka (MIMUW) Applied Logic 2019 12 / 34 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection – general issues Marcin Szczuka (MIMUW) Applied Logic 2019 13 / 34 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection – general issues Marcin Szczuka (MIMUW) Applied Logic 2019 14 / 34 Decision support tasks The real usefulness of the Bayesian approach is visible in its practical applications. The most popular of these is decision support (classification). Decision support (classification) is an example of using inductive inference methods such as prediction, argument by analogy and eliminative induction. We are going to construct Bayesian classifiers,i.e., algorithms (procedures) that “learn” the probability of decision value (classification) for new cases on the basis of cases observed previously (training sample). By restricting the reasoning task to prediction of decision value we can produce computationally viable, automated tool. Marcin Szczuka (MIMUW) Applied Logic 2019 15 / 34 Classifiers - basic notions Domain (space, universe) is a set X from which we draw examples. An element x 2 X we address as example (instance, case, record, entity, vector, object, row). Attribute (feature, variable, measurement) is a function a : X ! A: Set A is called attribute value set or attribute domain. We assume that each example x 2 X is completely represented by the vector a1(x); :::; an(x); where ai : X ! Ai for i = 1; :::; n. n is sometimes called the size (length) of example. Foe our purposes we usually distinguish a special decision attribute (decision, class), traditionally marked by dec or d. Marcin Szczuka (MIMUW) Applied Logic 2019 16 / 34 Tabular data Outlook Temp Humid Wind EnjoySpt sunny hot high FALSE no sunny hot high TRUE no overcast hot high FALSE yes rainy mild high FALSE yes rainy cool normal FALSE yes rainy cool normal TRUE no overcast cool normal TRUE yes sunny mild high FALSE no . rainy mild high TRUE no Marcin Szczuka (MIMUW) Applied Logic 2019 17 / 34 Classifier Training set (training sample) T ⊆ X corresponds to the set of premises. T d - subset of training data with decision d which corresponds to the set of premises supporting a particular hypothesis. d Tai=v - subset of training data with attribute ai requal to v and decision d. This corresponds to the set of premises of particular type supporting a particular hypothesis. Hypothesis space H is now limited to a set of possible decision bvalues, i.e., conditions (dec = d), where d 2 Vdec. Classification task Given training sample T determine the best (most probable) value of dec(x) for previously unseen case x 2 X ( x2 = T ). Question: How to choose the best value of decision? Marcin Szczuka (MIMUW) Applied Logic 2019 18 / 34 Lecture plan 1 Introduction 2 Bayesian reasoning 3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier 4 Hypothesis selection – general issues Marcin Szczuka (MIMUW) Applied Logic 2019 19 / 34 Hypothesis selection - MAP In Bayesian classification we want to find the most probable decision value for new example x given the collection of previously seen (training) examples and attribute values for x.

Load more