Statistics for Data Science

MSc Data Science WiSe 2019/20

Prof. Dr. Dirk Ostwald

2 Bibliographic remarks The presented material is standard and can be found in any introductory text- book to statistical inference. Wasserman (2004, Chapter 1) and ?, Sections 1.1 - 1.3 and are closest in spirit. Further excellent introductions to modern probability theory include Billingsley (1978), Fristedt and Gray (1997), and Rosenthal (2006).

3 Probability spaces

• Probability spaces

• Elementary probabilities

4 Probability spaces

• Probability spaces

• Elementary probabilities

5 Probability spaces

Deﬁnition (Probability space)

A probability space is a triple (Ω, A, P), where • Ω is a set of elementary outcomes ω,

• A is a σ-algebra, i.e., A is a set with the following properties, oΩ ∈ A, o A is closed under the formation of complements, i.e., if A ∈ A, then also Ac := Ω \ A ∈ Ω for all A ∈ A, ∞ o A is closed under countable unions, i.e., if A1,A2, ... ∈ A, then ∪i=1Ai ∈ A.

• P is a mapping P : A → [0, 1] with the following properties: o P(A) ≥ 0 for all A ∈ A and P(Ω) = 1. ∞ P∞ o If A1,A2, ... ∈ A are pairwise disjoint, then P ∪i=1Ai = i=1 P(Ai).

6 Probability spaces

3 Notes on Elementary Theory of Probability § . Terminology 5 6 I. If two separate statements are each practically Random Rema,rk 1. Tkecwy of Sets Even.ts . . The o s te event A reliable, then we may say that simultaneously they are both reli .. 5 The complementary set 5 ppo i able, although the degree of reliability is somewhat lowered in the FOUNDATI

sponds, a Az . . . A. = We therefore call Ah Az, ... in ccordance with our axioms, the probability P(O) = 05, At + + + E. , (This assumes th t burt the converse is not true: P(A) = 0 does not imply the im a the A,. the possible results of ex· possibility of When P(A) 0, from principle all we can sets A, do not intersect,in A. = (b) periment !I. BY assert is that when the conditions 5 are realized but once, event pairs.) A is practically impossible. It does not at all assert. however. that . B is a subset of A: Bt=A. A. N. KOLMOGOROV 9 9. From the occurrence of in a sufficiently long series of tests the event A will not occur. On event B follows the inevitable the other hand, one can deduce from the prineiple(a) merel that y occurrence of A. P(AJ and is very large, the ratio m n will be very when = 0 n / Second Encli•h Edition small (it might, for example, be equal to 1/n). § 4. Immediate Corollarie8 of the Axioms; Conditional

§ 3. Notes on Terminolo Probabilities; Theorem of Bayes 8Y We have defined the objects of our future study, random From A + A = E and the Axioms IV and V it follows that TRANSLATION EDITED BY events, as sets. However, in the theory of probability many set P(A) + P(A) = 1 NATHAN MORRISON theoretic concepts are designated by other terms. We shall give (1) here a brief list of such concepts. P(A) = 1- P(A) • (2)

Since E = 0, then, in particular, JUBLIOGRAPHY BY Theory of Sets Random Events WITH AN ADDED 3 A. T. BHARUCHA-REID and P(O) = ( ) 1. A and B do not intersect, 1. Events A B are in 0 .. 'UNIYERSI'IY OF OR.. �GON

i.e., AB = 0. compatible. If A, B, ... follows , N are incompatible, then from Axiom V t e (the Addition Theorem) 2. AB ... 0. 2. Events A, B, ... are N = , N h formula incompatible. P(A + 4 + B + ... + N) = P{A) + P(B) .I.+ P(N). ( )

3. AB ... Event s defined as the N ==X. 3. X i If P(A) >0, then the quotient simultaneous occurrence of CHELSEA PUBLISHING COMPANY p {B)= P(AB) 5) events A, B� . . . ( , N. A P(A) . Event defined as the NEW YORK• A B ... = 4 X is 4. + -f- + N X. is defined to be the conditional probability of the event B under occurrence of at least one of the condition A. 1956 the events A, B, . . . , N. From ( 5) it follows immediately that 4. Fonnula • ��- S (8). “The purpose of this monograph is to give an axiomatic foundation for the theory of probability. The author set himself the task of putting in their natural place, among the general notions of modern mathematics*, the basic concepts of probability theory-concepts which until recently were considered to be quite peculiar”. (*e.g., set theory, mappings, Lebesgue integrals) Kolmogorov, A.N. (1933) Grundbegriﬀe der Wahrscheinlichkeitsrechnung

7 Selected remarks • Probability spaces are used as abstract models of random experiments. • Probability spaces are special cases of measure spaces (Ω, A, µ). • Measure spaces are mathematical models for assigning volume to sets.

• Elementary outcomes are “realized” according to P({ω}). • Probability spaces unify finite, countable (N), and uncountable (R) outcome sets. • Probability spaces offer a language spanning discrete and continuous probability. • σ-algebras are “complete sets of events (which comprise elementary outcomes)”. • For finite outcome spaces, the typical σ-algebra is the power set P(Ω).

• For the outcome space R, the Borel σ-algebra fulfills the same function. • The probabilistic characteristics of (Ω, A, P) are defined by P. • Probability spaces often disappear behind random variables and distributions. • Probability spaces are useful in the definition of random fields, Markov kernels, ...

8 Probability spaces

Example (Throwing a dice)

• It is reasonable to consider the elementary outcome set Ω := {1, 2, 3, 4, 5, 6}. • The elementary outcome set Ω := {·, ··, ···, · · ··, ·····, · · · · ··} also works.

• The power set P ({1, 2, 3, 4, 5, 6}) contains all possible events Ai, for example:

Any number occurs A1 = Ω A number larger than 4 occurs A2 = {5, 6} An even number occurs A3 = {2, 4, 6} A six occurs A4 = {6} One, three, or six occurs A5 = {1, 3, 6} No number occurs A6 = ∅

• P can be deﬁned by speciﬁcation of P({ω}) for all ω ∈ Ω. • Because the ω ∈ A for which ω ∈ Ω are pairwise disjoint, the probabilities of all P events A ∈ A can be evaluated based on P(A) = ω∈A({ω}). • A fair dice would have P({ω}) := 1/6 for all ω ∈ Ω. • A biased dice could have P({1}) = P({2}) = P({6}) := 1/9, P({3}) = P({4}) = P({5}) := 2/9.

9 Probability spaces

• Probability spaces

• Elementary probabilities

10 Elementary probabilities

Deﬁnition (Probability measure) Let Ω denote an outcome space and A denote a σ-algebra on Ω. A function

P : A → R,A 7→ P(A) (1) that satisﬁes the following axioms (1) P(A) ≥ 0 for all A ∈ A (2) P(Ω) = 1 ∞ P∞ (3) If A1,A2, ... are disjoint, then P(∪i=1) = i=1 P(Ai)(σ-additivity) is called a probability measure or probability.

Remarks • P(A) can be interpreted as the idealized long run frequency of observing A. • P(A) can be interpreted as the subjective degree of belief that A is true. • Frequentist and Bayesian interpretations use the same formal framework.

11 Elementary probabilities Some properties of probabilities • P(∅) = 0 and 0 ≤ P(A) ≤ 1 • A ⊂ B ⇒ P(A) ≤ P(B) c • P(A ) = 1 − P(A) • A ∩ B = ∅ ⇒ P(A ∪ B) = P(A) + P(B) • P(A ∪ B) = P(A) + P(B) − P(A ∩ B) Exemplary proof

With the additivity of P for disjoint events, we have:

c c P(A ∪ B) = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) c c = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) + P(A ∩ B) − P(A ∩ B) (2) c c = P (A ∩ B ) ∪ (A ∩ B) + P (A ∩ B) ∪ (A ∩ B) − P(A ∩ B) = P(A) + P(B) − P(A ∩ B)

12 Elementary probabilities

Deﬁnition (Independent events) Two events A ∈ A and B ∈ A are independent, if

P(A ∩ B) = P(A)P(B). (3)

A set of events {Ai|i ∈ I} ⊂ A for an arbitrary index set is independent, if for every ﬁnite subset J ⊂ I Y P ∩j∈J Aj = P(Aj ). (4) j∈J

Remarks • Independence of events often arises by design of the probabilistic model. • Independence models the absence of deterministic and stochastic inﬂuences. • Independence may follow from the design of a probabilistic model. • Disjoint events with positive probability are not independent:

P(A)P(B) > 0, but P(A ∩ B) = P(∅) = 0, thus P(A ∩ B) 6= P(A)P(B). • The arbitrary subset condition for |I| events ensures their pairwise independence.

13 Elementary probabilities

Deﬁnition (Conditional probability)

Remarks • P(A|B) is “the fraction of times A occurs among those in which B occurs”. • P(A|B) is the normalized version of P(A ∩ B). • Deﬁning the joint probability P(A ∩ B) deﬁnes P(A|B). • The rules of probability apply to the events on the left of the vertical bar.

• Usually P(A|B) 6= P(B|A), e.g., P(Death|Hanging) 6= P(Hanging|Death). • An extension to P(B) = 0 is possible, but technically more challenging.

14 Elementary probabilities

Theorem (Conditional probability for independent events) If A, B ∈ A are independent, then

P(A ∩ B) P(A)P(B) P(A|B) = = = P(A). (6) P(B) P(B)

Remark • Given independence, knowledge of B does not change the probability of A.

Theorem (Joint and conditional probabilities) For any A, B ∈ A P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A). (7)

Remark • Joint probabilities can be constructed from conditional and total probabilities.

15 Elementary probabilities

Theorem (Law of total probability)

k Let A1, ..., Ak be a partition of Ω, i.e., ∪i=1Ai = Ω and Ai ∩ Aj = ∅ for all 1 ≤ i, j ≤ k with i 6= j. Then, for any event B ∈ A

k X P(B) = P(B|Ai)P(Ai). (8) i=1

Proof k For i = 1, ..., k, let Ci := B ∩ Aj , so ∪j=1Cj = B and Ci ∩ Cj = ∅ for 1 ≤ i, j ≤ k, i 6= j.

k k k X X X Thus P(B) = P(Ci) = P(B ∩Ai) = P(B|Ai)P(Ai) i=1 i=1 i=1

Remark • The total probability of B as a weighted average of conditional probabilities of B.

16 Elementary probabilities

Theorem (Bayes theorem)

Let A1, ..., Ak be a partition of Ω with P(Ai) > 0. If P(B) > 0, then for each i = 1, ..., k (B|A ) (A ) (A |B) = P i P i . (9) P i Pk i=1 P(B|Ai)P(Ai)

Proof Using the deﬁnition of conditional probability twice and the law of total probability, we have

P(Ai ∩ B) P(B|Ai)P(Ai) P(B|Ai)P(Ai) P(Ai|B) = = = . (10) (B) (B) Pk P P i=1 P(B|Ai)P(Ai) Remarks • There is nothing “Bayesian” about Bayes theorem. • Bayes theorem is a means to compute conditional probabilities.

• P(Ai) is often called prior and P(Ai|B) posterior.

• P(B|Ai) is sometimes called likelihood and P(B) evidence.

17 References Billingsley, P. (1978). Probability and measure. John Wiley & Sons, Collier-Macmillan Publishers. DeGroot, M. H. and Schervish, M. J. (2012). Probability and Statistics. Pearson Education. Fristedt, B. E. and Gray, L. F. (1997). A modern approach to probability theory. Birkhauser. Rosenthal, J. S. (2006). A ﬁrst look at rigorous probability theory. World Scientiﬁc Publishing Company. Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer.