Statistics for Data Science
MSc Data Science WiSe 2019/20
Prof. Dr. Dirk Ostwald
1 (1) Probability spaces
2 Bibliographic remarks The presented material is standard and can be found in any introductory text- book to statistical inference. Wasserman (2004, Chapter 1) and ?, Sections 1.1 - 1.3 and are closest in spirit. Further excellent introductions to modern probabil- ity theory include Billingsley (1978), Fristedt and Gray (1997), and Rosenthal (2006).
3 Probability spaces
• Probability spaces
• Elementary probabilities
4 Probability spaces
• Probability spaces
• Elementary probabilities
5 Probability spaces
Definition (Probability space)
A probability space is a triple (Ω, A, P), where • Ω is a set of elementary outcomes ω,
• A is a σ-algebra, i.e., A is a set with the following properties, oΩ ∈ A, o A is closed under the formation of complements, i.e., if A ∈ A, then also Ac := Ω \ A ∈ Ω for all A ∈ A, ∞ o A is closed under countable unions, i.e., if A1,A2, ... ∈ A, then ∪i=1Ai ∈ A.
• P is a mapping P : A → [0, 1] with the following properties: o P(A) ≥ 0 for all A ∈ A and P(Ω) = 1. ∞ P∞ o If A1,A2, ... ∈ A are pairwise disjoint, then P ∪i=1Ai = i=1 P(Ai).
6 Probability spaces
3 Notes on Elementary Theory of Probability § . Terminology 5 6 I. If two separate statements are each practically Random Rema,rk 1. Tkecwy of Sets Even.ts . . The o s te event A reliable, then we may say that simultaneously they are both reli .. 5 The complementary set 5 ppo i able, although the degree of reliability is somewhat lowered in the FOUNDATI sponds, a Az . . . A. = We therefore call Ah Az, ... in ccordance with our axioms, the probability P(O) = 05, At + + + E. , (This assumes th t burt the converse is not true: P(A) = 0 does not imply the im a the A,. the possible results of ex· possibility of When P(A) 0, from principle all we can sets A, do not intersect,in A. = (b) periment !I. BY assert is that when the conditions 5 are realized but once, event pairs.) A is practically impossible. It does not at all assert. however. that . B is a subset of A: Bt=A. A. N. KOLMOGOROV 9 9. From the occurrence of in a sufficiently long series of tests the event A will not occur. On event B follows the inevitable the other hand, one can deduce from the prineiple(a) merel that y occurrence of A. P(AJ and is very large, the ratio m n will be very when = 0 n / Second Encli•h Edition small (it might, for example, be equal to 1/n). § 4. Immediate Corollarie8 of the Axioms; Conditional § 3. Notes on Terminolo Probabilities; Theorem of Bayes 8Y We have defined the objects of our future study, random From A + A = E and the Axioms IV and V it follows that TRANSLATION EDITED BY events, as sets. However, in the theory of probability many set P(A) + P(A) = 1 NATHAN MORRISON theoretic concepts are designated by other terms. We shall give (1) here a brief list of such concepts. P(A) = 1- P(A) • (2) Since E = 0, then, in particular, JUBLIOGRAPHY BY Theory of Sets Random Events WITH AN ADDED 3 A. T. BHARUCHA-REID and P(O) = ( ) 1. A and B do not intersect, 1. Events A B are in 0 .. 'UNIYERSI'IY OF OR.. �GON i.e., AB = 0. compatible. If A, B, ... follows , N are incompatible, then from Axiom V t e (the Addition Theorem) 2. AB ... 0. 2. Events A, B, ... are N = , N h formula incompatible. P(A + 4 + B + ... + N) = P{A) + P(B) .I.+ P(N). ( ) 3. AB ... Event s defined as the N ==X. 3. X i If P(A) >0, then the quotient simultaneous occurrence of CHELSEA PUBLISHING COMPANY p {B)= P(AB) 5) events A, B� . . . ( , N. A P(A) . Event defined as the NEW YORK• A B ... = 4 X is 4. + -f- + N X. is defined to be the conditional probability of the event B under occurrence of at least one of the condition A. 1956 the events A, B, . . . , N. From ( 5) it follows immediately that 4. Fonnula • ��- S (8). “The purpose of this monograph is to give an axiomatic foundation for the theory of proba- bility. The author set himself the task of putting in their natural place, among the general notions of modern mathematics*, the basic concepts of probability theory-concepts which until recently were considered to be quite peculiar”. (*e.g., set theory, mappings, Lebesgue integrals) Kolmogorov, A.N. (1933) Grundbegriffe der Wahrscheinlichkeitsrechnung 7 Selected remarks • Probability spaces are used as abstract models of random experiments. • Probability spaces are special cases of measure spaces (Ω, A, µ). • Measure spaces are mathematical models for assigning volume to sets. • Elementary outcomes are “realized” according to P({ω}). • Probability spaces unify finite, countable (N), and uncountable (R) outcome sets. • Probability spaces offer a language spanning discrete and continuous probability. • σ-algebras are “complete sets of events (which comprise elementary outcomes)”. • For finite outcome spaces, the typical σ-algebra is the power set P(Ω). • For the outcome space R, the Borel σ-algebra fulfills the same function. • The probabilistic characteristics of (Ω, A, P) are defined by P. • Probability spaces often disappear behind random variables and distributions. • Probability spaces are useful in the definition of random fields, Markov kernels, ... 8 Probability spaces Example (Throwing a dice) • It is reasonable to consider the elementary outcome set Ω := {1, 2, 3, 4, 5, 6}. • The elementary outcome set Ω := {·, ··, ···, · · ··, ·····, · · · · ··} also works. • The power set P ({1, 2, 3, 4, 5, 6}) contains all possible events Ai, for example: Any number occurs A1 = Ω A number larger than 4 occurs A2 = {5, 6} An even number occurs A3 = {2, 4, 6} A six occurs A4 = {6} One, three, or six occurs A5 = {1, 3, 6} No number occurs A6 = ∅ • P can be defined by specification of P({ω}) for all ω ∈ Ω. • Because the ω ∈ A for which ω ∈ Ω are pairwise disjoint, the probabilities of all P events A ∈ A can be evaluated based on P(A) = ω∈A({ω}). • A fair dice would have P({ω}) := 1/6 for all ω ∈ Ω. • A biased dice could have P({1}) = P({2}) = P({6}) := 1/9, P({3}) = P({4}) = P({5}) := 2/9. 9 Probability spaces • Probability spaces • Elementary probabilities 10 Elementary probabilities Definition (Probability measure) Let Ω denote an outcome space and A denote a σ-algebra on Ω. A function P : A → R,A 7→ P(A) (1) that satisfies the following axioms (1) P(A) ≥ 0 for all A ∈ A (2) P(Ω) = 1 ∞ P∞ (3) If A1,A2, ... are disjoint, then P(∪i=1) = i=1 P(Ai)(σ-additivity) is called a probability measure or probability. Remarks • P(A) can be interpreted as the idealized long run frequency of observing A. • P(A) can be interpreted as the subjective degree of belief that A is true. • Frequentist and Bayesian interpretations use the same formal framework. 11 Elementary probabilities Some properties of probabilities • P(∅) = 0 and 0 ≤ P(A) ≤ 1 • A ⊂ B ⇒ P(A) ≤ P(B) c • P(A ) = 1 − P(A) • A ∩ B = ∅ ⇒ P(A ∪ B) = P(A) + P(B) • P(A ∪ B) = P(A) + P(B) − P(A ∩ B) Exemplary proof With the additivity of P for disjoint events, we have: c c P(A ∪ B) = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) c c = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) + P(A ∩ B) − P(A ∩ B) (2) c c = P (A ∩ B ) ∪ (A ∩ B) + P (A ∩ B) ∪ (A ∩ B) − P(A ∩ B) = P(A) + P(B) − P(A ∩ B) 12 Elementary probabilities Definition (Independent events) Two events A ∈ A and B ∈ A are independent, if P(A ∩ B) = P(A)P(B). (3) A set of events {Ai|i ∈ I} ⊂ A for an arbitrary index set is independent, if for every finite subset J ⊂ I Y P ∩j∈J Aj = P(Aj ). (4) j∈J Remarks • Independence of events often arises by design of the probabilistic model. • Independence models the absence of deterministic and stochastic influences. • Independence may follow from the design of a probabilistic model. • Disjoint events with positive probability are not independent: P(A)P(B) > 0, but P(A ∩ B) = P(∅) = 0, thus P(A ∩ B) 6= P(A)P(B). • The arbitrary subset condition for |I| events ensures their pairwise independence. 13 Elementary probabilities Definition (Conditional probability) If P(B) > 0, then the conditional probability of A ∈ A given B ∈ A is P(A ∩ B) P(A|B) = . (5) P(B) For any fixed B, P(·|B) is a probability measure, i.e., P(·|B) ≥ 0, P(Ω|B) = 1, and if ∞ P∞ A1,A2, ... are disjoint, P ∪i=1Ai|B = i=1 P (Ai|B). Remarks • P(A|B) is “the fraction of times A occurs among those in which B occurs”. • P(A|B) is the normalized version of P(A ∩ B). • Defining the joint probability P(A ∩ B) defines P(A|B). • The rules of probability apply to the events on the left of the vertical bar. • Usually P(A|B) 6= P(B|A), e.g., P(Death|Hanging) 6= P(Hanging|Death). • An extension to P(B) = 0 is possible, but technically more challenging. 14 Elementary probabilities Theorem (Conditional probability for independent events) If A, B ∈ A are independent, then P(A ∩ B) P(A)P(B) P(A|B) = = = P(A). (6) P(B) P(B) Remark • Given independence, knowledge of B does not change the probability of A. Theorem (Joint and conditional probabilities) For any A, B ∈ A P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A). (7) Remark • Joint probabilities can be constructed from conditional and total probabilities. 15 Elementary probabilities Theorem (Law of total probability) k Let A1, ..., Ak be a partition of Ω, i.e., ∪i=1Ai = Ω and Ai ∩ Aj = ∅ for all 1 ≤ i, j ≤ k with i 6= j. Then, for any event B ∈ A k X P(B) = P(B|Ai)P(Ai). (8) i=1 Proof k For i = 1, ..., k, let Ci := B ∩ Aj , so ∪j=1Cj = B and Ci ∩ Cj = ∅ for 1 ≤ i, j ≤ k, i 6= j. k k k X X X Thus P(B) = P(Ci) = P(B ∩Ai) = P(B|Ai)P(Ai) i=1 i=1 i=1 Remark • The total probability of B as a weighted average of conditional probabilities of B. 16 Elementary probabilities Theorem (Bayes theorem) Let A1, ..., Ak be a partition of Ω with P(Ai) > 0. If P(B) > 0, then for each i = 1, ..., k (B|A ) (A ) (A |B) = P i P i . (9) P i Pk i=1 P(B|Ai)P(Ai) Proof Using the definition of conditional probability twice and the law of total probability, we have P(Ai ∩ B) P(B|Ai)P(Ai) P(B|Ai)P(Ai) P(Ai|B) = = = . (10) (B) (B) Pk P P i=1 P(B|Ai)P(Ai) Remarks • There is nothing “Bayesian” about Bayes theorem. • Bayes theorem is a means to compute conditional probabilities. • P(Ai) is often called prior and P(Ai|B) posterior. • P(B|Ai) is sometimes called likelihood and P(B) evidence. 17 References Billingsley, P. (1978). Probability and measure. John Wiley & Sons, Collier-Macmillan Publishers. DeGroot, M. H. and Schervish, M. J. (2012). Probability and Statistics. Pearson Education. Fristedt, B. E. and Gray, L. F. (1997). A modern approach to probability theory. Birkhauser. Rosenthal, J. S. (2006). A first look at rigorous probability theory. World Scientific Publishing Company. Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer. 18