Statistics for Data Science

MSc Data Science WiSe 2019/20

Prof. Dr. Dirk Ostwald

1 (1) spaces

2 Bibliographic remarks The presented material is standard and can be found in any introductory text- book to statistical inference. Wasserman (2004, Chapter 1) and ?, Sections 1.1 - 1.3 and are closest in spirit. Further excellent introductions to modern probabil- ity theory include Billingsley (1978), Fristedt and Gray (1997), and Rosenthal (2006).

3 Probability spaces

• Probability spaces

• Elementary

4 Probability spaces

• Probability spaces

• Elementary probabilities

5 Probability spaces

Definition ()

A probability space is a triple (Ω, A, P), where • Ω is a set of elementary outcomes ω,

• A is a σ-algebra, i.e., A is a set with the following properties, oΩ ∈ A, o A is closed under the formation of complements, i.e., if A ∈ A, then also Ac := Ω \ A ∈ Ω for all A ∈ A, ∞ o A is closed under countable unions, i.e., if A1,A2, ... ∈ A, then ∪i=1Ai ∈ A.

• P is a mapping P : A → [0, 1] with the following properties: o P(A) ≥ 0 for all A ∈ A and P(Ω) = 1. ∞  P∞ o If A1,A2, ... ∈ A are pairwise disjoint, then P ∪i=1Ai = i=1 P(Ai).

6 Probability spaces

3 Notes on Elementary Theory of Probability § . Terminology 5 6 I. If two separate statements are each practically Random Rema,rk 1. Tkecwy of Sets Even.ts . . The o s te event A reliable, then we may say that simultaneously they are both reli .. 5 The complementary set 5 ppo i able, although the degree of reliability is somewhat lowered in the FOUNDATI

sponds, a Az . . . A. = We therefore call Ah Az, ... in ccordance with our axioms, the probability P(O) = 05, At + + + E. , (This assumes th t burt the converse is not true: P(A) = 0 does not imply the im­ a the A,. the possible results of ex· possibility of When P(A) 0, from principle all we can sets A, do not intersect,in A. = (b) periment !I. BY assert is that when the conditions 5 are realized but once, event pairs.) A is practically impossible. It does not at all assert. however. that . B is a subset of A: Bt=A. A. N. KOLMOGOROV 9 9. From the occurrence of in a sufficiently long series of tests the event A will not occur. On event B follows the inevitable the other hand, one can deduce from the prineiple(a) merel that y occurrence of A. P(AJ and is very large, the ratio m n will be very when = 0 n / Second Encli•h Edition small (it might, for example, be equal to 1/n). § 4. Immediate Corollarie8 of the Axioms; Conditional

§ 3. Notes on Terminolo Probabilities; of Bayes 8Y We have defined the objects of our future study, random From A + A = E and the Axioms IV and V it follows that TRANSLATION EDITED BY events, as sets. However, in the theory of probability many set­ P(A) + P(A) = 1 NATHAN MORRISON theoretic concepts are designated by other terms. We shall give (1) here a brief list of such concepts. P(A) = 1- P(A) • (2)

Since E = 0, then, in particular, JUBLIOGRAPHY BY Theory of Sets Random Events WITH AN ADDED 3 A. T. BHARUCHA-REID and P(O) = ( ) 1. A and B do not intersect, 1. Events A B are in­ 0 .. 'UNIYERSI'IY OF OR.. �GON

i.e., AB = 0. compatible. If A, B, ... follows , N are incompatible, then from Axiom V t e (the Addition Theorem) 2. AB ... 0. 2. Events A, B, ... are N = , N h formula incompatible. P(A + 4 + B + ... + N) = P{A) + P(B) .I.+ P(N). ( )

3. AB ... Event s defined as the N ==X. 3. X i If P(A) >0, then the quotient simultaneous occurrence of CHELSEA PUBLISHING COMPANY p {B)= P(AB) 5) events A, B� . . . ( , N. A P(A) . Event defined as the NEW YORK• A B ... = 4 X is 4. + -f- + N X. is defined to be the of the event B under occurrence of at least one of the condition A. 1956 the events A, B, . . . , N. From ( 5) it follows immediately that 4. Fonnula • ��- S (8). “The purpose of this monograph is to give an axiomatic foundation for the theory of proba- bility. The author set himself the task of putting in their natural place, among the general notions of modern mathematics*, the basic concepts of -concepts which until recently were considered to be quite peculiar”. (*e.g., set theory, mappings, Lebesgue integrals) Kolmogorov, A.N. (1933) Grundbegriffe der Wahrscheinlichkeitsrechnung

7 Selected remarks • Probability spaces are used as abstract models of random experiments. • Probability spaces are special cases of spaces (Ω, A, µ). • Measure spaces are mathematical models for assigning volume to sets.

• Elementary outcomes are “realized” according to P({ω}). • Probability spaces unify finite, countable (N), and uncountable (R) sets. • Probability spaces offer a language spanning discrete and continuous probability. • σ-algebras are “complete sets of events (which comprise elementary outcomes)”. • For finite outcome spaces, the typical σ-algebra is the power set P(Ω).

• For the outcome space R, the Borel σ-algebra fulfills the same function. • The probabilistic characteristics of (Ω, A, P) are defined by P. • Probability spaces often disappear behind random variables and distributions. • Probability spaces are useful in the definition of random fields, Markov kernels, ...

8 Probability spaces

Example (Throwing a dice)

• It is reasonable to consider the elementary outcome set Ω := {1, 2, 3, 4, 5, 6}. • The elementary outcome set Ω := {·, ··, ···, · · ··, ·····, · · · · ··} also works.

• The power set P ({1, 2, 3, 4, 5, 6}) contains all possible events Ai, for example:

Any number occurs A1 = Ω A number larger than 4 occurs A2 = {5, 6} An even number occurs A3 = {2, 4, 6} A six occurs A4 = {6} One, three, or six occurs A5 = {1, 3, 6} No number occurs A6 = ∅

• P can be defined by specification of P({ω}) for all ω ∈ Ω. • Because the ω ∈ A for which ω ∈ Ω are pairwise disjoint, the probabilities of all P events A ∈ A can be evaluated based on P(A) = ω∈A({ω}). • A fair dice would have P({ω}) := 1/6 for all ω ∈ Ω. • A biased dice could have P({1}) = P({2}) = P({6}) := 1/9, P({3}) = P({4}) = P({5}) := 2/9.

9 Probability spaces

• Probability spaces

• Elementary probabilities

10 Elementary probabilities

Definition () Let Ω denote an outcome space and A denote a σ-algebra on Ω. A function

P : A → R,A 7→ P(A) (1) that satisfies the following axioms (1) P(A) ≥ 0 for all A ∈ A (2) P(Ω) = 1 ∞ P∞ (3) If A1,A2, ... are disjoint, then P(∪i=1) = i=1 P(Ai)(σ-additivity) is called a probability measure or probability.

Remarks • P(A) can be interpreted as the idealized long run frequency of observing A. • P(A) can be interpreted as the subjective degree of belief that A is true. • Frequentist and Bayesian interpretations use the same formal framework.

11 Elementary probabilities Some properties of probabilities • P(∅) = 0 and 0 ≤ P(A) ≤ 1 • A ⊂ B ⇒ P(A) ≤ P(B) c • P(A ) = 1 − P(A) • A ∩ B = ∅ ⇒ P(A ∪ B) = P(A) + P(B) • P(A ∪ B) = P(A) + P(B) − P(A ∩ B) Exemplary proof

With the additivity of P for disjoint events, we have:

c c P(A ∪ B) = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) c c = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) + P(A ∩ B) − P(A ∩ B) (2) c  c  = P (A ∩ B ) ∪ (A ∩ B) + P (A ∩ B) ∪ (A ∩ B) − P(A ∩ B) = P(A) + P(B) − P(A ∩ B)



12 Elementary probabilities

Definition (Independent events) Two events A ∈ A and B ∈ A are independent, if

P(A ∩ B) = P(A)P(B). (3)

A set of events {Ai|i ∈ I} ⊂ A for an arbitrary index set is independent, if for every finite subset J ⊂ I  Y P ∩j∈J Aj = P(Aj ). (4) j∈J

Remarks • Independence of events often arises by design of the probabilistic model. • Independence models the absence of deterministic and stochastic influences. • Independence may follow from the design of a probabilistic model. • Disjoint events with positive probability are not independent:

P(A)P(B) > 0, but P(A ∩ B) = P(∅) = 0, thus P(A ∩ B) 6= P(A)P(B). • The arbitrary subset condition for |I| events ensures their pairwise independence.

13 Elementary probabilities

Definition (Conditional probability)

If P(B) > 0, then the conditional probability of A ∈ A given B ∈ A is P(A ∩ B) P(A|B) = . (5) P(B) For any fixed B, P(·|B) is a probability measure, i.e., P(·|B) ≥ 0, P(Ω|B) = 1, and if ∞  P∞ A1,A2, ... are disjoint, P ∪i=1Ai|B = i=1 P (Ai|B).

Remarks • P(A|B) is “the fraction of times A occurs among those in which B occurs”. • P(A|B) is the normalized version of P(A ∩ B). • Defining the joint probability P(A ∩ B) defines P(A|B). • The rules of probability apply to the events on the left of the vertical bar.

• Usually P(A|B) 6= P(B|A), e.g., P(Death|Hanging) 6= P(Hanging|Death). • An extension to P(B) = 0 is possible, but technically more challenging.

14 Elementary probabilities

Theorem (Conditional probability for independent events) If A, B ∈ A are independent, then

P(A ∩ B) P(A)P(B) P(A|B) = = = P(A). (6) P(B) P(B)

Remark • Given independence, knowledge of B does not change the probability of A.

Theorem (Joint and conditional probabilities) For any A, B ∈ A P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A). (7)

Remark • Joint probabilities can be constructed from conditional and total probabilities.

15 Elementary probabilities

Theorem (Law of total probability)

k Let A1, ..., Ak be a partition of Ω, i.e., ∪i=1Ai = Ω and Ai ∩ Aj = ∅ for all 1 ≤ i, j ≤ k with i 6= j. Then, for any event B ∈ A

k X P(B) = P(B|Ai)P(Ai). (8) i=1

Proof k For i = 1, ..., k, let Ci := B ∩ Aj , so ∪j=1Cj = B and Ci ∩ Cj = ∅ for 1 ≤ i, j ≤ k, i 6= j.

k k k X X X Thus P(B) = P(Ci) = P(B ∩Ai) = P(B|Ai)P(Ai) i=1 i=1 i=1

 Remark • The total probability of B as a weighted average of conditional probabilities of B.

16 Elementary probabilities

Theorem (Bayes theorem)

Let A1, ..., Ak be a partition of Ω with P(Ai) > 0. If P(B) > 0, then for each i = 1, ..., k (B|A ) (A ) (A |B) = P i P i . (9) P i Pk i=1 P(B|Ai)P(Ai)

Proof Using the definition of conditional probability twice and the law of total probability, we have

P(Ai ∩ B) P(B|Ai)P(Ai) P(B|Ai)P(Ai) P(Ai|B) = = = . (10) (B) (B) Pk P P i=1 P(B|Ai)P(Ai)  Remarks • There is nothing “Bayesian” about Bayes theorem. • Bayes theorem is a means to compute conditional probabilities.

• P(Ai) is often called prior and P(Ai|B) posterior.

• P(B|Ai) is sometimes called likelihood and P(B) evidence.

17 References Billingsley, P. (1978). Probability and measure. John Wiley & Sons, Collier-Macmillan Publishers. DeGroot, M. H. and Schervish, M. J. (2012). Probability and . Pearson Education. Fristedt, B. E. and Gray, L. F. (1997). A modern approach to probability theory. Birkhauser. Rosenthal, J. S. (2006). A first look at rigorous probability theory. World Scientific Publishing Company. Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer.

18