Summary of Basic Probability Theory Math 218, Mathematical Statistics

8. For any two events E and F , P (E) = P (E \ F ) + P (E \ F c): 9. If event E is a subset of event F , then P (E) ≤ P (F ). Summary of basic probability theory 10. Statement 7 above is called the principle of Math 218, Mathematical Statistics inclusion and exclusion. It generalizes to more than two events. D Joyce, Spring 2016 n n [ X X P Er = P (Ei) − P (Ei \ Ej) Sample space. A sample space consists of a un- r=1 i=1 i<j derlying set Ω, whose elements are called outcomes, X a collection of subsets of Ω called events, and a + P (Ei \ Ej \ Ek) − · · · function P on the set of events, called a probability i<j<k n−1 function, satisfying the following axioms. + (−1) P (E1 \ E2 \···\ En) 1. The probability of any event is a number in In words, to find the probability of a union of the interval [0; 1]. n events, first sum their individual probabilities, 2. The entire set Ω is an event with probability then subtract the sum of the probabilities of all P (Ω) = 1. their pairwise intersections, then add back the sum 3. The union and intersection of any finite or of the probabilities of all their 3-way interections, countably infinite set of events are events, and the then subtract the 4-way intersections, and continue complement of an event is an event. adding and subtracting k-way intersections until 4. The probability of a disjoint union of a finite you finally stop with the probability of the n-way or countably infinite set of events is the sum of the intersection. probabilities of those events, [ X Random variables notation. In order to de- P ( Ei) = P (Ei): scribe a sample space, we frequently introduce a i i symbol X called a random variable for the sam- From these axioms a number of other properties ple space. With this notation, we can replace can be derived including these. the probability of an event, P (E), by the notation P (X 2 E), which, by itself, doesn't do much. But 5. The complement Ec = Ω − E of an event E is many events are built from the set operations of an event, and complement, union, and intersection, and with the random variable notation, we can replace those by P (Ec) = 1 − P (E): logical operations for `not', òr', and ànd'. For in- c 6. The empty set ; is an event with probability stance, the probability P (E [ F ) can be written P (;) = 0. as P (X 2 E but X2 = F ). 7. For any two events E and F , Also, probabilities of finite events can be written in terms of equality. For instance, the prob- P (E [ F ) = P (E) + P (F ) − P (E \ F ); ability of a singleton, P (fag), can be written as P (X=a), and that for a doubleton, P (fa; bg) = therefore P (X=a or X=b). One of the main purposes of the random variable P (E [ F ) ≤ P (E) + P (F ): notation is when we have two uses for the same 1 sample space. For instance, if you have a fair die, Also, the probability of a singleton fbg can be found the sample space is Ω = f1; 2; 3; 4; 5; 6g where the as a limit 1 probability of any singleton is 6 . If you have two fair dice, you can use two random variables, X and P (fbg) = lim (F (b) − F (a)): Y , to refer to the two dice, but each has the same a!b sample space. (Soon, we'll look at the joint distri- From these, probabilities of unions of intervals can bution of (X; Y ), which has a sample space defined be computed. Sometimes, the c.d.f. is simply called on Ω × Ω. the distribution, and the sample space is identified with this distribution. Random variables and cumulative distribution functions. A sample space can have any set Discrete distributions. Many sample distribu- as its underlying set, but usually they're related tions are determined entirely by the probabilities of to numbers. Often the sample space is the set of their outcomes, that is, the probability of an event real numbers R, and sometimes a power of the real E is numbers Rn. X X The most common sample space only has two el- P (E) = P (X=x) = P (fxg): ements, that is, there are only two outcomes. For x2E x2E instance, flipping a coin as two outcomes|Heads and Tails; many experiments have two outcomes| The sum here, of course, is either a finite or count- Success and Failure; and polls often have two ably infinite sum. Such a distribution is called a dis- outcomes|For and Against. Even though these crete distribution, and when there are only finitely events aren't numbers, it's useful to replace them many outcomes x with nonzero probabilities, it is by numbers, namely 0 and 1, so that Heads, Suc- called a finite distribution. cess, and For are identified with 1, and Tails, Fail- A discrete distributions is usually described in ure, and Against are identified with 0. Then the terms of a probability mass function (p.m.f.) f de- sample space can have R as its underlying set. fined by When the sample space does have R as its underlying set, the random variable X is called a real f(x) = P (X=x) = P (fxg): random variable. With it, the probability of an interval like [a; b], which is P ([a; b]), can then be de- This p.m.f. is enough to determine this distribution scribed as P (a ≤ X ≤ b). Unions of intervals can since, by the definition of a discrete distribution, also be described, for instance P ((−∞; 3) [ [4; 5]) the probability of an event E is can be written as P (X < 3 or 4 ≤ X ≤ 5). X When the sample space is R, the probability P (E) = f(x): function P is determined by a cumulative distri- x2E bution function (c.d.f.) F as follows. The function F : R ! R is defined by In many applications, a finite distribution is uni- form, that is, the probabilities of its outcomes are F (x) = P (X ≤ x) = P ((−∞; x]): all the same, 1=n, where n is the number of outcomes with nonzero probabilities. When that is Then, from F , the probability of a half-open inter- the case, the field of combinatorics is useful in find- val can be found as ing probabilities of events. Combinatorics includes various principles of counting such as the multipli- P ((a; b]) = F (b) − F (a): cation principle, permutations, and combinations. 2 Continuous distributions. When the cumula- For a continuous random variable, it is defined in tive distribution function F for a distribution is terms of the probability density function f as differentiable function, we say it's a continuous dis- Z 1 tribution. Such a distribution is determined by a E(X) = µX = xf(x) dx: probability density function f. The relation be- −∞ 0 tween F and f is that f is the derivative F of F , There is a physical interpretation where this and F is the integral of f. mean is interpreted as a center of gravity. Z x Expectation is a linear operator. That means F (x) = f(t) dt −∞ that the expectation of a sum or difference is the difference of the expectations Conditional probability and independence. If E and F are two events, with P (F ) 6= 0, then E(X + Y ) = E(X) + E(Y ); the conditional probability of E given F is defined and that's true whether or not X and Y are inde- to be P (E \ F ) pendent, and also P (EjF ) = : P (F ) E(cX) = c E(X) Two events, E and F , neither with probability 0, are said to be independent, or mutually indepen- where c is any constant. From these two properties dent, if any of the following three logically equiva- it follows that lent conditions holds E(X − Y ) = E(X) − E(Y ); P (E \ F ) = P (E) P (F ) P (EjF ) = P (E) and, more generally, expectation preserves linear combinations P (F jE) = P (F ) n ! n X X E c X = c E(X ): Bayes' formula. This formula is useful to invert i i i i i=1 i=1 conditional probabilities. It says P (EjF ) P (F ) Furthermore, when X and Y are independent, P (F jE) = then E(XY ) = E(X) E(Y ), but that equation P (E) doesn't usually hold when X and Y are not inde- P (EjF ) P (F ) = pendent. P (EjF ) P (F ) + P (EjF c) P (F c) where the second form is often more useful in prac- Variance and standard deviation. The vari- tice. ance of a random variable X is defined as 2 2 2 2 Expectation. The expected value E(X), also Var(X) = σX = E((X − µX ) ) = E(X ) − µX called the expectation or mean µ , of a random X where the last equality is provable. Standard devia- variable X is defined differently for the discrete and tion, σ, is defined as the square root of the variance. continuous cases. Here are a couple of properties of variance. First, For a discrete random variable, it is a weighted if you multiply a random variable X by a constant average defined in terms of the probability mass c to get cX, the variance changes by a factor of the function f as square of c, that is X E(X) = µ = xf(x): X Var(cX) = c2 Var(X): x 3 That's the main reason why we take the square The moment generating function.

Load more