<<

Applied Lecture 4 part 2 – Bayesian inductive reasoning

Marcin Szczuka

Institute of Informatics, The University of Warsaw

Monographic lecture, Spring semester 2018/2019

Marcin Szczuka (MIMUW) Applied Logic 2019 1 / 34 The ones illiterate in general theory still keep asking why, above all, Trurl did probabilized the dragon instead of elf or dwarf. Those do it due to ignorance, since they do not know that the dragon is just more probable than a dwarf ...

Stanisław Lem, The Cyberiad Fable three or dragons of probability

Marcin Szczuka (MIMUW) Applied Logic 2019 2 / 34 Lecture plan

1 Introduction

2 Bayesian reasoning

3 Bayesian and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier

4 Hypothesis selection – general issues

Marcin Szczuka (MIMUW) Applied Logic 2019 3 / 34 Measure of /possibility

Recall that from an inductive (quasi-)formal system that we dare to call inductive logic we expect to provide a measure of support. This measure gives use the level of influence of the truthfulness of on truthfulness of conclusions. We require:

1 Fulfillment of the Criterion of Adequacy (CoA).

2 Ensuring, that the degree of confidence in the inferred conclusion is no greater than the confidence of the premises and rules.

3 Ability to clearly discern between proper conclusions (hypotheses) and nonsensical ones.

4 Intuitive interpretation.

Marcin Szczuka (MIMUW) Applied Logic 2019 4 / 34 Probabilistic inference

From the earliest onset the researchers tried to match the inductive reasoning with probability and/or . Over time probability-based reasoning, in particular Bayesian reasoning, have established itself as a central focal point of philosophers and logicians working on formalisation of inductive systems (inductive ). elements of probabilistic reasoning can be found in works of Pascal, Fermat, and others. Modern, formal approach to inductive logic based on the notion of similarity and probability was proposed by in Treatise on Probability (1921). developed further these ideas in his Logical Foundations of Probability (1950) and some other works, which are now considered a corner stone of probabilistic logic. After the mathematical theory of probability was “ordered” by Kolmogorov the probabilistic reasoning gained more traction as a proper, formal theory.

Marcin Szczuka (MIMUW) Applied Logic 2019 5 / 34 Probabilistic inductive logic

In case of inductive logics, in particular those based on probability, there is very little point in considering the strict formal consequence relation ` and its relationship with relation |=. For the relation |= we usually consider the support (probability) mapping rather than exact logical consequences. Support mapping (function) Function P : L 7→ [0, 1], where L is a set of statements (a language) is called support function if for A, B, C - statements in L the following holds: 1 There exists at least one pair of D,E ∈ L for which P (D|E) < 1.

2 If B |= A then P (A|B) = 1.

3 If |= (B ≡ C) then P (A|B) = P (A|C).

4 If C |= ¬(A ∧ B) then either P (A ∨ B|C) = P (A|C) + P (B|C) or

∀D∈LP (D|C) = 1. 5 P ((A ∧ B)|C) = P (A|(B ∧ C)) × P (B|C)

Marcin Szczuka (MIMUW) Applied Logic 2019 6 / 34 Probabilistic inductive logic

Its easy to see that the conditions for support function P are a re-formulation of the axioms for probability measure. In definition of P the operator | corresponds to logical entailment, i.e., the basic step in reasoning. It is easy to see that the mapping P is not uniquely defined. The conditions for P are essentially the same as for (unconditional) probability. It suffices to set P (A) = P (A|(D ∨ ¬D)) for some sentence (event) D. However, these conditions also allow for establishing the value P (A|C) in case of probability of event C being 0 (P (C) = P (C|(D ∨ ¬D)) = 0). Condition 1 (not-triviality) in definition of P can be also expressed as

∃A∈LP ((A ∧ ¬A)|(A ∨ ¬A)) < 1.

Marcin Szczuka (MIMUW) Applied Logic 2019 7 / 34 Lecture plan

1 Introduction

2 Bayesian reasoning

3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier

4 Hypothesis selection – general issues

Marcin Szczuka (MIMUW) Applied Logic 2019 8 / 34 Probability

At this point we need to introduce the (simplified) axioms for probability measure that we will further use. In order to clearly discern form previous notation we will use the Pr to mark probability measure. Axioms for discrete probability (Kolmogorov)

1 For each event A ∈ Ω value Pr(A) ∈ [0, 1].

2 Unit measure Pr(Ω) = 1.

3 Additivity – if A1,...,An are mutually exclusive events, then

n n X X Pr(Ai) = 1 ⇒ Pr(B) = Pr(B|Ai) Pr(Ai). i=1 i=1

Axiom 2 (unit measure) may be a source of some concern for us.

Marcin Szczuka (MIMUW) Applied Logic 2019 9 / 34 Properties of probability

Pr(A ∧ B) = Pr(B) · Pr(A|B) = Pr(A) · Pr(B|A)

Pr(A ∨ B) = Pr(A) + Pr(B) − Pr(A ∧ B) Pr(A|B) - (conditional) probability of A given B.

Pr(A ∧ B) Pr(A|B) = Pr(B)

Bayes’ rule Pr(B|A) · Pr(A) Pr(A|B) = Pr(B)

Marcin Szczuka (MIMUW) Applied Logic 2019 10 / 34

For the that will become clear in the next part of the lecture, we will use the following notation. T ⊂ X - set of premises ( set) coming from (huge) universe X. h ∈ H - conclusion (hypothesis) coming from some (huge) set of hypotheses H.

VSH,T - version space, i.e., subset of H containing hypotheses that are consistent with T . Inference rule (Bayes’) For a hypothesis h ∈ H and evidence set T ⊂ X: Pr(T |h) · Pr(h) Pr(h|T ) = Pr(T )

Probability (level of support) of conclusion (hypothesis) h is established on the basis of support of premises (evidence) and the degree to which the hypothesis justifies the existence of evidence (premises). Marcin Szczuka (MIMUW) Applied Logic 2019 11 / 34 Remarks

Pr(h|T ) - a posteriori (posterior) probability of hypothesis h given premises (evidence data) T . That is what we are looking for. Pr(T ) - Probability of premises (evidence data) T . Fortunately, we do not have to know it if we are only interested in comparison of posterior of hypotheses. If, for some reason, we need to directly calculate that then we may have a problem. We need to calculate Pr(h) and Pr(T |h). For the we assume that we can do that and that H is known. Pr(T |h) determines the degree to which h justifies the appearance (truthfulness) of premises in T .

Marcin Szczuka (MIMUW) Applied Logic 2019 12 / 34 Lecture plan

1 Introduction

2 Bayesian reasoning

3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier

4 Hypothesis selection – general issues

Marcin Szczuka (MIMUW) Applied Logic 2019 13 / 34 Lecture plan

1 Introduction

2 Bayesian reasoning

3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier

4 Hypothesis selection – general issues

Marcin Szczuka (MIMUW) Applied Logic 2019 14 / 34 Decision support tasks

The real usefulness of the Bayesian approach is visible in its practical applications. The most popular of these is decision support (classification). Decision support (classification) is an example of using inductive inference methods such as prediction, by and eliminative induction. We are going to Bayesian classifiers,i.e., (procedures) that “learn” the probability of decision value (classification) for new cases on the basis of cases observed previously (training ). By restricting the reasoning task to prediction of decision value we can produce computationally viable, automated tool.

Marcin Szczuka (MIMUW) Applied Logic 2019 15 / 34 Classifiers - basic notions

Domain (space, universe) is a set X from which we draw examples. An element x ∈ X we address as example (instance, case, record, entity, vector, object, row). Attribute (feature, variable, measurement) is a function

a : X → A.

Set A is called attribute value set or attribute domain. We assume that each example x ∈ X is completely represented by the vector

a1(x), ..., an(x),

where ai : X → Ai for i = 1, ..., n. n is sometimes called the size (length) of example. Foe our purposes we usually distinguish a special decision attribute (decision, class), traditionally marked by dec or d. Marcin Szczuka (MIMUW) Applied Logic 2019 16 / 34 Tabular data

Outlook Temp Humid Wind EnjoySpt sunny hot high FALSE no sunny hot high TRUE no overcast hot high FALSE yes rainy mild high FALSE yes rainy cool normal FALSE yes rainy cool normal TRUE no overcast cool normal TRUE yes sunny mild high FALSE no ...... rainy mild high TRUE no

Marcin Szczuka (MIMUW) Applied Logic 2019 17 / 34 Classifier

Training set (training sample) T ⊆ X corresponds to the set of premises. T d - subset of training data with decision d which corresponds to the set of premises supporting a particular hypothesis. d Tai=v - subset of training data with attribute ai requal to v and decision d. This corresponds to the set of premises of particular type supporting a particular hypothesis. Hypothesis space H is now limited to a set of possible decision bvalues, i.e., conditions (dec = d), where d ∈ Vdec. Classification task Given training sample T determine the best (most probable) value of dec(x) for previously unseen case x ∈ X ( x∈ / T ). Question: How to choose the best value of decision?

Marcin Szczuka (MIMUW) Applied Logic 2019 18 / 34 Lecture plan

1 Introduction

2 Bayesian reasoning

3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier

4 Hypothesis selection – general issues

Marcin Szczuka (MIMUW) Applied Logic 2019 19 / 34 Hypothesis selection - MAP

In Bayesian classification we want to find the most probable decision value for new example x given the collection of previously seen (training) examples and attribute values for x. So, using Bayes’ formula we need to find a hypothesis h (decision value) that maximises support (empirical probability).

MAP - Maximum A Posteriori hypothesis Given training set T we attempt to classify example x ∈ X using hypothesis hMAP ∈ H by assigning to object x the decision value given by:

hMAP = arg max Pr(h|T ) = arg max Pr(T |h) · Pr(h) h∈H h∈H In MAP we chose the hypothesis that is the most probable.

Marcin Szczuka (MIMUW) Applied Logic 2019 20 / 34 Hypothesis selection - ML

ML - Maximum Likelihood hypothesis Given training set T we attempt to classify example x ∈ X using hypothesis hML ∈ H by assigning to object x the decision value given by:

hML = arg max Pr(T |h). h∈H

In ML approach we chose the hypothesis that best explains (makes it most likely) the existence of our training sample. Note, that the hypothesis hML may itself have low probability, but be very well adjusted to our particular data.

Marcin Szczuka (MIMUW) Applied Logic 2019 21 / 34 Discussion of ML and MAP Both methods require the of Pr(T |h). In case of MAP we also need musimy Pr(h) to be able to use Bayes’ formula. MAP is quite natural, but has major drawbacks. In particular, it promotes the dominating decision value. Both methods assume that the training set is error-free and that the hypothesis we look for is in H. ML is close to intuitive understanding of inductive . In the process of selecting hypothesis we go for the one that gives the best reason for existence of the particular training set we have. The MAP rule for selecting hypotheses select the most probable hypothesis while we are rather interested in selecting the most probable decision value for an example. With Vdec = {0, 1}, H = {hMAP , h1, . . . , hm}, ∀1≤i≤mh(x) = 0, hMAP (x) = 1 and m X Pr(hMAP |T )  Pr(hi|T ) i=1 Marcin Szczuka (MIMUW) Applied Logic 2019 22 / 34 Finding probabilities

Pr(h) – the easier part. We may be either given a probability (by learning method) or treat all hypotheses equally. In the later case: 1 Pr(h) = |H|

The problem is the size of H. It may be a HUGE space. Also, in , we may not even know the whole H. Pr(T |h) – the harder part. Notice, that we are in only interested in decision making. We want to know the probability that a sample T will be consistent (will have the same decision) with hypothesis h. This yields:

 1 gdy h ∈ VS Pr(T |h) = H,T 0 gdy h∈ / VSH,T

Unfortunately, the problem with size of H is still present.

Marcin Szczuka (MIMUW) Applied Logic 2019 23 / 34 ML and MAP in practice

The MAP and/or ML, despite serious practical limitations, can still be used in some special cases, given that: The hypothesis space is very restricted (and reasonably small). We use MAP and/or ML to score (few) competing hypotheses constructed by other . This relates to the topics of stacking, coupled classifiers and layered learning.

Marcin Szczuka (MIMUW) Applied Logic 2019 24 / 34 Lecture plan

1 Introduction

2 Bayesian reasoning

3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier

4 Hypothesis selection – general issues

Marcin Szczuka (MIMUW) Applied Logic 2019 25 / 34 Bayesian Optimal Classifier

The Bayesian Optimal Classifier (BOC) always returns the most probable decision value for an example. In this respect it cannot be beaten by any other in terms of true (global) error. Sadly, the BOC isn’t very useful from practical point of view since it uses entire hypothesis space. The hypothesis returned by BOC may not belong to H. Let c(.) be a the desired decision (target concept), T training sample. Then

hBOC = arg max Pr(c(x) = d|T ) d∈Vdec

where: X Pr(c(x) = d|T ) = Pr(c(x) = d|h) Pr(h|T ) h∈H  1 if h(x) = d Pr(c(x) = d|h) = 0 if h(x) 6= d

The hypothesis returned by BOC may not belong to H.

Marcin Szczuka (MIMUW) Applied Logic 2019 26 / 34 Lecture plan

1 Introduction

2 Bayesian reasoning

3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier

4 Hypothesis selection – general issues

Marcin Szczuka (MIMUW) Applied Logic 2019 27 / 34 Naïve Bayes classifier

Let x∗ be a new example that we need to classify. We should select a hypothesis h such that: n ^ h(x∗) = arg max Pr(c(x) = d| ai(x) = ai(x∗)) d∈V dec i=1 Hence, from Bayes’ formula: n ^ arg max Pr(c(x) = d) · Pr( ai(x) = ai(x∗)|c(x) = d) d∈C i=1 If we (naïvely) assume that attributes are independent as probabilistic variables then: n Y arg max Pr(c(x) = d) · Pr(ai(x) = ai(x∗)|c(x) = d) d∈C i=1 All, that is left to do is to estimate Pr(c(x) = d) and Pr(ai(x) = v|c(x) = d) from data. Marcin Szczuka (MIMUW) Applied Logic 2019 28 / 34 NBC - technical details

Usually, we employ an m-estimate to get

|T d | + mp Pr(a (x) = v|c(x) = d) = aiv i |T | + m

where m is an integer parameter, and p is of decision class. Usually, if no background knowledge is given, we set m = |Ai| and 1 p = , where Ai is a (finite) set of values for attribute ai. |Ai| Complexity of NBC For each example we have to modify counts for decision class and for particular attribute values. That is, in total O(n · |T |) basic computational steps Complexity of NBC is the lowest “rational” estimate for any classification algorithm without prior knowledge. Also, each step in NBC is fast and cheap, hence the method is computationally efficient.

Marcin Szczuka (MIMUW) Applied Logic 2019 29 / 34 Lecture plan

1 Introduction

2 Bayesian reasoning

3 Bayesian prediction and decision support Classification problems Selecting hypothesis - MAP and ML Bayesian Optimal Classifier Naïve Bayes classifier

4 Hypothesis selection – general issues

Marcin Szczuka (MIMUW) Applied Logic 2019 30 / 34 Requirements for hypotheses

On the higher level of we can demand from the hypothesis to not only be the best (most probable) , but also to be the simplest one. This may be seen as a special application of lex parsimoniæ (Occam’s razor). We prefer the simplest explanation, i.e., the hypothesis that requires – according to William of Occam – the least amount of assumptions. In practice, lex parsimoniæ is frequently replaced by a simpler Minimum Length (MDL) principle. MDL - Minimum Description Length MDL recommends the simplest method for re-encoding the data with use of hypothesis, i.e., hypothesis that gives the best compression. Choosing the particular hypothesis produces a shortest algorithm for reproduction of data. In classification, this usually means the shortest hypothesis.

Marcin Szczuka (MIMUW) Applied Logic 2019 31 / 34 MDL in Bayesian classification

Bayesian classifiers are considered one of the best method for producing MDL-compliant hypotheses. For the purposes of comparing description lengths in the example below we define the length with a (binary) logarithm of the description of probability. Taking the logarithm of Bayes’ formula, we get:

log Pr(h|T ) = log Pr(h) + log Pr(T |h) − log Pr(T ) Substituting L(.) for − log Pr(.) we obtain:

L(h|T ) = L(h) + L(T |h) − L(T )

where L(h),L(T |h) represent the length of hypothesis h and length of data T (given h). In both cases we assume that the encoding is known and optimal.

Marcin Szczuka (MIMUW) Applied Logic 2019 32 / 34 MDL in Bayesian classification

Ultimately, we select a hypothesis that is the best w.r.t. MDL:

hMDL = arg min LEnc (h) + LEncD (T |h) h∈H H

Assuming that EncH and EncD are odpowiednio, hipotezy i danych, dostajemy: hMDL = hMAP . Intuitively, MDL helps to find the right balance between quality and of a hypothesis. The MDL principle is frequently used for scoring candidate hypotheses constructed by other means. It is also applicable to the task of simplifying existing hypotheses, for example in filtering of decision rule sets and decision tree pruning. It also provides an effective stop criterion for many practical algorithms.

Marcin Szczuka (MIMUW) Applied Logic 2019 33 / 34

MDL is also connected with more general notion of Kolmogorov Complexity (descriptive complexity, Kolmogorov–Chaitin complexity, algorithmic entropy). Kolmogorov Complexity – for a finite or infinite sequence of symbols (stream of data) is defined as a length of the simplest (shortest) algorithm that generates this data. Naturally, the notion of algorithm length is quite complicated and requires formal definition. Such definition is usally done with use of formal languages and Turing machines. In most non-trivial cases the task of calculating Kolmogorov complexity for a sequence is very hard, frequently practically impossible (undecidable). Let’s consider two finite sequences of numbers: 1415926535897932384626433832795028841971 - has a very low Kolmogorov complexity since there exists a very simple algorithm to generate decimal expansion of π. 5230619672181840811135324016881717004139 - is a random sequence with potentially very high Kolmogorov complexity. Marcin Szczuka (MIMUW) Applied Logic 2019 34 / 34