Applied Logic Lecture 4 part 2 – Bayesian inductive reasoning
Marcin Szczuka
Institute of Informatics, The University of Warsaw
Monographic lecture, Spring semester 2018/2019
Marcin Szczuka (MIMUW) Applied Logic 2019 1 / 34 The ones illiterate in general probability theory still keep asking why, above all, Trurl did probabilized the dragon instead of elf or dwarf. Those do it due to ignorance, since they do not know that the dragon is just more probable than a dwarf ...
Stanisław Lem, The Cyberiad Fable three or dragons of probability
Marcin Szczuka (MIMUW) Applied Logic 2019 2 / 34 Lecture plan
1 Introduction
2 Bayesian reasoning
3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier
4 Hypothesis selection – general issues
Marcin Szczuka (MIMUW) Applied Logic 2019 3 / 34 Measure of truth/possibility
Recall that from an inductive (quasi-)formal system that we dare to call inductive logic we expect to provide a measure of support. This measure gives use the level of influence of the truthfulness of premises on truthfulness of conclusions. We require:
1 Fulfillment of the Criterion of Adequacy (CoA).
2 Ensuring, that the degree of confidence in the inferred conclusion is no greater than the confidence of the premises and inference rules.
3 Ability to clearly discern between proper conclusions (hypotheses) and nonsensical ones.
4 Intuitive interpretation.
Marcin Szczuka (MIMUW) Applied Logic 2019 4 / 34 Probabilistic inference
From the earliest onset the researchers tried to match the inductive reasoning paradigm with probability and/or statistics. Over time probability-based reasoning, in particular Bayesian reasoning, have established itself as a central focal point of philosophers and logicians working on formalisation of inductive systems (inductive logics). elements of probabilistic reasoning can be found in works of Pascal, Fermat, and others. Modern, formal approach to inductive logic based on the notion of similarity and probability was proposed by John Maynard Keynes in Treatise on Probability (1921). Rudolf Carnap developed further these ideas in his Logical Foundations of Probability (1950) and some other works, which are now considered a corner stone of probabilistic logic. After the mathematical theory of probability was “ordered” by Kolmogorov the probabilistic reasoning gained more traction as a proper, formal theory.
Marcin Szczuka (MIMUW) Applied Logic 2019 5 / 34 Probabilistic inductive logic
In case of inductive logics, in particular those based on probability, there is very little point in considering the strict formal consequence relation ` and its relationship with relation |=. For the relation |= we usually consider the support (probability) mapping rather than exact logical consequences. Support mapping (function) Function P : L 7→ [0, 1], where L is a set of statements (a language) is called support function if for A, B, C - statements in L the following holds: 1 There exists at least one pair of statement D,E ∈ L for which P (D|E) < 1.
2 If B |= A then P (A|B) = 1.
3 If |= (B ≡ C) then P (A|B) = P (A|C).
4 If C |= ¬(A ∧ B) then either P (A ∨ B|C) = P (A|C) + P (B|C) or
∀D∈LP (D|C) = 1. 5 P ((A ∧ B)|C) = P (A|(B ∧ C)) × P (B|C)
Marcin Szczuka (MIMUW) Applied Logic 2019 6 / 34 Probabilistic inductive logic
Its easy to see that the conditions for support function P are a re-formulation of the axioms for probability measure. In definition of P the operator | corresponds to logical entailment, i.e., the basic step in reasoning. It is easy to see that the mapping P is not uniquely defined. The conditions for P are essentially the same as for (unconditional) probability. It suffices to set P (A) = P (A|(D ∨ ¬D)) for some sentence (event) D. However, these conditions also allow for establishing the value P (A|C) in case of probability of event C being 0 (P (C) = P (C|(D ∨ ¬D)) = 0). Condition 1 (not-triviality) in definition of P can be also expressed as
∃A∈LP ((A ∧ ¬A)|(A ∨ ¬A)) < 1.
Marcin Szczuka (MIMUW) Applied Logic 2019 7 / 34 Lecture plan
1 Introduction
2 Bayesian reasoning
3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier
4 Hypothesis selection – general issues
Marcin Szczuka (MIMUW) Applied Logic 2019 8 / 34 Probability
At this point we need to introduce the (simplified) axioms for probability measure that we will further use. In order to clearly discern form previous notation we will use the Pr to mark probability measure. Axioms for discrete probability (Kolmogorov)
1 For each event A ∈ Ω value Pr(A) ∈ [0, 1].
2 Unit measure Pr(Ω) = 1.
3 Additivity – if A1,...,An are mutually exclusive events, then
n n X X Pr(Ai) = 1 ⇒ Pr(B) = Pr(B|Ai) Pr(Ai). i=1 i=1
Axiom 2 (unit measure) may be a source of some concern for us.
Marcin Szczuka (MIMUW) Applied Logic 2019 9 / 34 Properties of probability
Pr(A ∧ B) = Pr(B) · Pr(A|B) = Pr(A) · Pr(B|A)
Pr(A ∨ B) = Pr(A) + Pr(B) − Pr(A ∧ B) Pr(A|B) - (conditional) probability of A given B.
Pr(A ∧ B) Pr(A|B) = Pr(B)
Bayes’ rule Pr(B|A) · Pr(A) Pr(A|B) = Pr(B)
Marcin Szczuka (MIMUW) Applied Logic 2019 10 / 34 Bayesian inference
For the reason that will become clear in the next part of the lecture, we will use the following notation. T ⊂ X - set of premises (evidence set) coming from (huge) universe X. h ∈ H - conclusion (hypothesis) coming from some (huge) set of hypotheses H.
VSH,T - version space, i.e., subset of H containing hypotheses that are consistent with T . Inference rule (Bayes’) For a hypothesis h ∈ H and evidence set T ⊂ X: Pr(T |h) · Pr(h) Pr(h|T ) = Pr(T )
Probability (level of support) of conclusion (hypothesis) h is established on the basis of support of premises (evidence) and the degree to which the hypothesis justifies the existence of evidence (premises). Marcin Szczuka (MIMUW) Applied Logic 2019 11 / 34 Remarks
Pr(h|T ) - a posteriori (posterior) probability of hypothesis h given premises (evidence data) T . That is what we are looking for. Pr(T ) - Probability of premises (evidence data) T . Fortunately, we do not have to know it if we are only interested in comparison of posterior probabilities of hypotheses. If, for some reason, we need to directly calculate that then we may have a problem. We need to calculate Pr(h) and Pr(T |h). For the moment we assume that we can do that and that H is known. Pr(T |h) determines the degree to which h justifies the appearance (truthfulness) of premises in T .
Marcin Szczuka (MIMUW) Applied Logic 2019 12 / 34 Lecture plan
1 Introduction
2 Bayesian reasoning
3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier
4 Hypothesis selection – general issues
Marcin Szczuka (MIMUW) Applied Logic 2019 13 / 34 Lecture plan
1 Introduction
2 Bayesian reasoning
3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier
4 Hypothesis selection – general issues
Marcin Szczuka (MIMUW) Applied Logic 2019 14 / 34 Decision support tasks
The real usefulness of the Bayesian approach is visible in its practical applications. The most popular of these is decision support (classification). Decision support (classification) is an example of using inductive inference methods such as prediction, argument by analogy and eliminative induction. We are going to construct Bayesian classifiers,i.e., algorithms (procedures) that “learn” the probability of decision value (classification) for new cases on the basis of cases observed previously (training sample). By restricting the reasoning task to prediction of decision value we can produce computationally viable, automated tool.
Marcin Szczuka (MIMUW) Applied Logic 2019 15 / 34 Classifiers - basic notions
Domain (space, universe) is a set X from which we draw examples. An element x ∈ X we address as example (instance, case, record, entity, vector, object, row). Attribute (feature, variable, measurement) is a function
a : X → A.
Set A is called attribute value set or attribute domain. We assume that each example x ∈ X is completely represented by the vector
a1(x), ..., an(x),
where ai : X → Ai for i = 1, ..., n. n is sometimes called the size (length) of example. Foe our purposes we usually distinguish a special decision attribute (decision, class), traditionally marked by dec or d. Marcin Szczuka (MIMUW) Applied Logic 2019 16 / 34 Tabular data
Outlook Temp Humid Wind EnjoySpt sunny hot high FALSE no sunny hot high TRUE no overcast hot high FALSE yes rainy mild high FALSE yes rainy cool normal FALSE yes rainy cool normal TRUE no overcast cool normal TRUE yes sunny mild high FALSE no ...... rainy mild high TRUE no
Marcin Szczuka (MIMUW) Applied Logic 2019 17 / 34 Classifier
Training set (training sample) T ⊆ X corresponds to the set of premises. T d - subset of training data with decision d which corresponds to the set of premises supporting a particular hypothesis. d Tai=v - subset of training data with attribute ai requal to v and decision d. This corresponds to the set of premises of particular type supporting a particular hypothesis. Hypothesis space H is now limited to a set of possible decision bvalues, i.e., conditions (dec = d), where d ∈ Vdec. Classification task Given training sample T determine the best (most probable) value of dec(x) for previously unseen case x ∈ X ( x∈ / T ). Question: How to choose the best value of decision?
Marcin Szczuka (MIMUW) Applied Logic 2019 18 / 34 Lecture plan
1 Introduction
2 Bayesian reasoning
3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier
4 Hypothesis selection – general issues
Marcin Szczuka (MIMUW) Applied Logic 2019 19 / 34 Hypothesis selection - MAP
In Bayesian classification we want to find the most probable decision value for new example x given the collection of previously seen (training) examples and attribute values for x. So, using Bayes’ formula we need to find a hypothesis h (decision value) that maximises support (empirical probability).
MAP - Maximum A Posteriori hypothesis Given training set T we attempt to classify example x ∈ X using hypothesis hMAP ∈ H by assigning to object x the decision value given by:
hMAP = arg max Pr(h|T ) = arg max Pr(T |h) · Pr(h) h∈H h∈H In MAP we chose the hypothesis that is the most probable.
Marcin Szczuka (MIMUW) Applied Logic 2019 20 / 34 Hypothesis selection - ML
ML - Maximum Likelihood hypothesis Given training set T we attempt to classify example x ∈ X using hypothesis hML ∈ H by assigning to object x the decision value given by:
hML = arg max Pr(T |h). h∈H
In ML approach we chose the hypothesis that best explains (makes it most likely) the existence of our training sample. Note, that the hypothesis hML may itself have low probability, but be very well adjusted to our particular data.
Marcin Szczuka (MIMUW) Applied Logic 2019 21 / 34 Discussion of ML and MAP Both methods require the knowledge of Pr(T |h). In case of MAP we also need musimy Pr(h) to be able to use Bayes’ formula. MAP is quite natural, but has major drawbacks. In particular, it promotes the dominating decision value. Both methods assume that the training set is error-free and that the hypothesis we look for is in H. ML is close to intuitive understanding of inductive learning. In the process of selecting hypothesis we go for the one that gives the best reason for existence of the particular training set we have. The MAP rule for selecting hypotheses select the most probable hypothesis while we are rather interested in selecting the most probable decision value for an example. With Vdec = {0, 1}, H = {hMAP , h1, . . . , hm}, ∀1≤i≤mh(x) = 0, hMAP (x) = 1 and m X Pr(hMAP |T ) Pr(hi|T ) i=1 Marcin Szczuka (MIMUW) Applied Logic 2019 22 / 34 Finding probabilities
Pr(h) – the easier part. We may be either given a probability (by learning method) or treat all hypotheses equally. In the later case: 1 Pr(h) = |H|
The problem is the size of H. It may be a HUGE space. Also, in reality, we may not even know the whole H. Pr(T |h) – the harder part. Notice, that we are in fact only interested in decision making. We want to know the probability that a sample T will be consistent (will have the same decision) with hypothesis h. This yields:
1 gdy h ∈ VS Pr(T |h) = H,T 0 gdy h∈ / VSH,T
Unfortunately, the problem with size of H is still present.
Marcin Szczuka (MIMUW) Applied Logic 2019 23 / 34 ML and MAP in practice
The MAP and/or ML, despite serious practical limitations, can still be used in some special cases, given that: The hypothesis space is very restricted (and reasonably small). We use MAP and/or ML to score (few) competing hypotheses constructed by other means. This relates to the topics of stacking, coupled classifiers and layered learning.
Marcin Szczuka (MIMUW) Applied Logic 2019 24 / 34 Lecture plan
1 Introduction
2 Bayesian reasoning
3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier
4 Hypothesis selection – general issues
Marcin Szczuka (MIMUW) Applied Logic 2019 25 / 34 Bayesian Optimal Classifier
The Bayesian Optimal Classifier (BOC) always returns the most probable decision value for an example. In this respect it cannot be beaten by any other algorithm in terms of true (global) error. Sadly, the BOC isn’t very useful from practical point of view since it uses entire hypothesis space. The hypothesis returned by BOC may not belong to H. Let c(.) be a the desired decision (target concept), T training sample. Then
hBOC = arg max Pr(c(x) = d|T ) d∈Vdec
where: X Pr(c(x) = d|T ) = Pr(c(x) = d|h) Pr(h|T ) h∈H 1 if h(x) = d Pr(c(x) = d|h) = 0 if h(x) 6= d
The hypothesis returned by BOC may not belong to H.
Marcin Szczuka (MIMUW) Applied Logic 2019 26 / 34 Lecture plan
1 Introduction
2 Bayesian reasoning
3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier
4 Hypothesis selection – general issues
Marcin Szczuka (MIMUW) Applied Logic 2019 27 / 34 Naïve Bayes classifier
Let x∗ be a new example that we need to classify. We should select a hypothesis h such that: n ^ h(x∗) = arg max Pr(c(x) = d| ai(x) = ai(x∗)) d∈V dec i=1 Hence, from Bayes’ formula: n ^ arg max Pr(c(x) = d) · Pr( ai(x) = ai(x∗)|c(x) = d) d∈C i=1 If we (naïvely) assume that attributes are independent as probabilistic variables then: n Y arg max Pr(c(x) = d) · Pr(ai(x) = ai(x∗)|c(x) = d) d∈C i=1 All, that is left to do is to estimate Pr(c(x) = d) and Pr(ai(x) = v|c(x) = d) from data. Marcin Szczuka (MIMUW) Applied Logic 2019 28 / 34 NBC - technical details
Usually, we employ an m-estimate to get
|T d | + mp Pr(a (x) = v|c(x) = d) = aiv i |T | + m
where m is an integer parameter, and p is prior probability of decision class. Usually, if no background knowledge is given, we set m = |Ai| and 1 p = , where Ai is a (finite) set of values for attribute ai. |Ai| Complexity of NBC For each example we have to modify counts for decision class and for particular attribute values. That is, in total O(n · |T |) basic computational steps Complexity of NBC is the lowest “rational” estimate for any classification algorithm without prior knowledge. Also, each step in NBC is fast and cheap, hence the method is computationally efficient.
Marcin Szczuka (MIMUW) Applied Logic 2019 29 / 34 Lecture plan
1 Introduction
2 Bayesian reasoning
3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier
4 Hypothesis selection – general issues
Marcin Szczuka (MIMUW) Applied Logic 2019 30 / 34 Requirements for hypotheses
On the higher level of abstraction we can demand from the hypothesis to not only be the best (most probable) explanation, but also to be the simplest one. This may be seen as a special application of lex parsimoniæ (Occam’s razor). We prefer the simplest explanation, i.e., the hypothesis that requires – according to William of Occam – the least amount of assumptions. In practice, lex parsimoniæ is frequently replaced by a simpler Minimum Description Length (MDL) principle. MDL - Minimum Description Length MDL recommends the simplest method for re-encoding the data with use of hypothesis, i.e., hypothesis that gives the best compression. Choosing the particular hypothesis produces a shortest algorithm for reproduction of data. In classification, this usually means the shortest hypothesis.
Marcin Szczuka (MIMUW) Applied Logic 2019 31 / 34 MDL in Bayesian classification
Bayesian classifiers are considered one of the best method for producing MDL-compliant hypotheses. For the purposes of comparing description lengths in the example below we define the length with a (binary) logarithm of the description of probability. Taking the logarithm of Bayes’ formula, we get:
log Pr(h|T ) = log Pr(h) + log Pr(T |h) − log Pr(T ) Substituting L(.) for − log Pr(.) we obtain:
L(h|T ) = L(h) + L(T |h) − L(T )
where L(h),L(T |h) represent the length of hypothesis h and length of data T (given h). In both cases we assume that the encoding is known and optimal.
Marcin Szczuka (MIMUW) Applied Logic 2019 32 / 34 MDL in Bayesian classification
Ultimately, we select a hypothesis that is the best w.r.t. MDL:
hMDL = arg min LEnc (h) + LEncD (T |h) h∈H H
Assuming that EncH and EncD are odpowiednio, hipotezy i danych, dostajemy: hMDL = hMAP . Intuitively, MDL helps to find the right balance between quality and simplicity of a hypothesis. The MDL principle is frequently used for scoring candidate hypotheses constructed by other means. It is also applicable to the task of simplifying existing hypotheses, for example in filtering of decision rule sets and decision tree pruning. It also provides an effective stop criterion for many practical algorithms.
Marcin Szczuka (MIMUW) Applied Logic 2019 33 / 34 Kolmogorov complexity
MDL is also connected with more general notion of Kolmogorov Complexity (descriptive complexity, Kolmogorov–Chaitin complexity, algorithmic entropy). Kolmogorov Complexity – for a finite or infinite sequence of symbols (stream of data) is defined as a length of the simplest (shortest) algorithm that generates this data. Naturally, the notion of algorithm length is quite complicated and requires formal definition. Such definition is usally done with use of formal languages and Turing machines. In most non-trivial cases the task of calculating Kolmogorov complexity for a sequence is very hard, frequently practically impossible (undecidable). Let’s consider two finite sequences of numbers: 1415926535897932384626433832795028841971 - has a very low Kolmogorov complexity since there exists a very simple algorithm to generate decimal expansion of π. 5230619672181840811135324016881717004139 - is a random sequence with potentially very high Kolmogorov complexity. Marcin Szczuka (MIMUW) Applied Logic 2019 34 / 34