<<

1.3 theory

1.3.1 Basics starts with a non-definable notion of experiment, which has possible outcomes. The of all possible outcomes is called the and usually denoted by Ω. In all the problems dealing with the first step if to identify the .

Example 1.5. An experiment consists in tossing a coin three times, here is the sample space:

Ω = {HHH,HHT,HTH,HTT,THH,THT,TTH,TTT },

3 and an is ωi = (a1, a2, a3), ai ∈ {H,T }. Obviously, |Ω| = 2 = 8. Example 1.6. Consider an experiment of choosing a graph randomly from the set of all graphs on 4 vertices and with 3 edges. The sample space here is G(n, m), where n = 4 and m = 3. All the elementary events are presented in the figure below. Note that we consider that the vertices of the graph are distinguishable (labeled). How many non-isomorphic graphs are here?

Example 1.7. If you are asked to specify the for the experiment that consists in picking n balls our of an urn containing m balls, you should ask the follow-up questions whether we care about the order of the balls or not, and which sampling procedure is used. The answer crucially depends on these important details. Start with an urn that contains M distinguishable balls, and perform sampling with replace- ment (i.e., after each drawing we return the ball into the urn). If an outcome of our experiment

2 4 2 4 2 4 2 4 2 4

1 3 1 3 1 3 1 3 1 3

2 4 2 4 2 4 2 4 2 4

1 3 1 3 1 3 1 3 1 3

2 4 2 4 2 4 2 4 2 4

1 3 1 3 1 3 1 3 1 3

2 4 2 4 2 4 2 4 2 4

1 3 1 3 1 3 1 3 1 3

( 4 ) G |G| (2) Figure 1.6: The sample space for (4, 3), here = 3 = 20

18 is a sample of n balls, then what is |Ω|? To answer this question we need to distinguish between ordered and unordered samples, whether we care about the exact order the balls appear or not. For the ordered sample we have that ωi = (a1, . . . , an), where each aj can take any values out of M. Hence here |Ω| = M n. If, however we consider the unordered samples:

Ω = {ω : ω = {a1, . . . , an}, ai = 1,...,M}, then the answer is not straightforward to come( up with) (except that it should be smaller than n | | M+n−1 M ). Let us prove that N(M, n) := Ω = n in this case. I use induction. First, note that for k ≤ M ( ) k N(k, 1) = k = . 1 ( ) k+n−1 ≤ Now assume that N(k, n) = k for k M, I need to show that this formula continues to hold when n is replaced with n + 1. For an unordered sample we can always assume that it is arranged as

a1 ≤ a2 ≤ ... ≤ an ≤ an+1.

We have that the number of the unordered samples with a1 = 1 is N(M, n), with a1 = 2 is N(M − 1, n), etc, with a1 = M is N(1, n) = 1. Hence,

N(M, n + 1) = N(M, n) + N(M − 1, n) + ..., +N(1, n) = ( ) ( ) ( ) M + n − 1 M + n − 2 n = + + ... + = (( n ) ( n )) (( n ) ( )) M + n M + n − 1 M + n − 1 M + n − 2 = − + − + ((n + 1 ) ( n +)) 1 n + 1 n + 1 n + 1 n ... + − = ( n +) 1 n + 1 M + n = . n + 1

Here we used the fact that ( ) ( ) ( ) k + 1 k k = + . l l l − 1 If we need to perform the sampling without replacement, we need n ≤ M. For the ordered samples one has

(M)n := |Ω| = M(M − 1) ... (M − n + 1). Note that if n = M, we obtain here permutations of the set of balls, the total number of which is M! := 1 · 2 · ... · M (and of course 0! := 1). For the unordered samples we do not bother about the order of the balls in our sample, hence here ( ) (M) M(M − 1) ... (M − n + 1) M! M |Ω| = n = = = . n! n! n!(M − n)! n

19 Example 1.8. Distribution of n objects in M cells (think about distribution of n particles among M energy states). Assume that we assign numbers 1, 2,...,M to the cells and 1, 2, . . . , n to the balls. If all the balls are distinguishable, then putting n balls into M cells amounts to having an ordered sample (a1, . . . , an), where ai is the number of the cell into which the ith ball was put. However, if we do not distinguish the balls, then an outcome in an unordered sample {a1, . . . , an}, where ai is the number of the cell into which an object is put at the step i. Hence we get a

ordered samples ↔ distinguishable objects unordered samples ↔ indistinguishable objects

In an analogous way we get

sampling with replacement ↔ a cell may get any number of balls sampling without replacement ↔ a cell can get only one ball per cell

Hence we calculated the sizes of the sample spaces in four cases for out example! Problem 1.16. Give a combinatorial prove for the number of outcomes in the case of putting n indistinguishable balls among M cells such that any cell may contain any number of balls. The next important thing to specify is the set of events F. An event A ∈ F is a of Ω, for our case finite Ω F is usually taken as the 2Ω, i.e., the set of all of Ω. The events are sets, and we can do usual set operation with them (taking the , , intersection, difference). In the jargon of probability theory, if A, B ∈ F, the event A ∩ B reads “both A and B occurred”, the event A ∪ B reads “either A or B or both occurred”, the event A \ B reads “event A occurred and B did not”, the event “A := Ω \ A” “not A occurred”. For example the event A for the graph to be not connected in case G(4, 3) consists of four outcomes (see the figure above). Given the sample space and the set of events, finally to specify the probability space one needs the probability P: F → R, for which the following axioms hold: 1. P {A} ∈ [0, 1] for any A ∈ F and P {Ω} = 1.

2. If {Ai ∈ F : i ∈ I} is a of pairwise disjoint events, then { } ∪ ∑ P Ai = P {Ai} . i∈I i∈I

The triple (Ω, F, P) is called the probability space. When we dealt with the Ramsey numbers, we used the classical probability model that assign to each event A the probability |A| P {A} = , |Ω| and each outcome ωi ∈ Ω has the probability 1 P {ω } = . i |Ω|

20 The classical probability model is also called the uniform probability space on Ω. Now we can return to Example 1.8 and ask which probability space one should pick to solve one or another problem. This question is not as obvious as might seem from the first view. For example, a simple question would be what is more probable: to observe 11 or 12 points if two are tossed. To answer this question we first must decide whether the outcomes (5, 6) and (6, 5) are considered different. If they are different (and we talk about ordered samples) then 2 P {11 points is observed} = , 36 where |Ω| = 36 and 1 P {12 points is observed} = . 36 However, if we think that (5, 6) and (6, 5) are the same outcome, then

{ } ( 1 ) { } P 11 points is observed = 6+2−1 = P 12 points is observed . 2 Those who played probably know that 11 points are observed somewhat more frequently than 12, hence the first approach is what we need to use not to contradict the nature. However, if we move, e.g, to the realm of physics particles, then other sample spaces have to be chosen. Consider the statistical physics problem to describe the (random) distribution of particles in some region, subdivided into smaller ones. It would be natural to assume that any configuration of particles has to have the probability M −n, where n is the number of particles and M is the number of subregions, in physics this is called Maxwell–Boltzmann . Now we know from experiments that this statistics does not apply to any known type of particles! Actually, photons,( for) example, satisfy Bose–Einstein statistics, when the probability of any configuration M+n−1 −1 is n (i.e, the particles are indistinguishable and any subregion can accommodate more than one particle),( ) and protons obey Fermi–Dirac statistics, with the probability of any config- M −1 uration is n (the particles are indistinguishable and any subregion can accommodate only one particle). Note that the general problem of the probability theory is not to figure out how to assign probabilities to the outcomes, but, given the probabilities of outcomes, to present a framework to infer the probabilities of more complex events. In general, for our finite sample space Ω we can define the probability of event A as ∑ P {A} = P {ωi} ,

ωi∈A and the axioms above follow from this definition. Using the axioms above, we can deduce the following properties of probability (proof left as an exercise):

1. P {∅} = 0. { } 2. P A = 1 − P {A}.

3. If A ⊆ B then P {A} ≤ P {B}

21 4. If {Ai : i ∈ I} is a finite set of events, then { } ∪ ∑ P Ai ≤ P {Ai} . i∈I i∈I

5. For any events A, B P {A ∪ B} = P {A} + P {B} − P {A ∩ B} .

Here is one more example of the probability space. Example 1.9 (Bernoulli trials). Assume that we are throwing a coin such that the probability of observing the head is p and the probability observing the tail is q = 1 − p for one throw. Now consider a series of n tosses, such that an outcome of our experiment is the ordered sample ωi = (a1, . . . , an), aj ∈ {H,T }. It is natural to put P {ω} = p#{heads in n trials}qn−#{heads in n trials}. Thus we built a probability space by defining the outcomes, the total number of which is 2n, and defining the . Now we can calculate probabilities of events. For example, the probability of the event that in the series of n trials we observe exactly k heads (“k successes”), is (prove this) ( ) n P {A} = pk(1 − p)n−k. k This probability is called binomial. The name can be explained by the following: ( ) ∑ ∑n n P {w} = pkqn−k = (p + q)n = 1. k ω∈Ω k=0

1.3.2 and independence Consider a probability space (Ω, F, P) and assume that some event B happened. Here is the question: How the probabilities of other events A ∈ F changed if we have this additional knowledge? Example 1.10. Let A be an event that after one throw of a dice we observe a square of a natural number, and B be an event that even number of points is observed. Obviously here P {A} = 2/6, P {B} = 3/6, A = {ω1, ω4}, B = {ω2, ω4, ω6},P {A ∩ B} = 1/6. Probability of A under the condition that B occurred is given by 1 1/6 P {A ∩ B} P {A | B} = = = . 3 3/6 P {B} This example leads to the following definition Definition 1.11. Conditional probability of event A given that event B occurred is defined as P {A ∩ B} P {A | B} = , P {B} provided that P {B} ̸= 0.

22 The last formula can be rewritten, if P {A} ̸= 0, as P {A ∩ B} P {B | A} = . P {A} Hence we get what is sometimes called the theorem of the product of probabilities:

P {A ∩ B} = P {A | B} P {B} = P {B | A} P {A} .

Problem 1.17. Generalize the last equality on the case

P {A1 ∩ ... ∩ Ak} .

Problem 1.18. Assume that H1,...,Hk are disjoint events (i.e., Hi ∩ Hj = ∅ for any i ≠ j) such that H1 ∪ ... ∪ Hk = Ω. Prove that for any event A ∑k P {A} = P {A | Hi} P {Hi} . i=1 This is sometimes called the formula of the total probability. Also prove Bayes’ theorem: P {A | H } P {H | A} = ∑ i . i k { | } { } j=1 P A Hj P Hj Problem 1.19. In the urn there are M white and N black balls. One ball is picked and set aside, without looking at the ball’s color. What is the probability that the next randomly picked ball is white? Conditional probability allows us to give a definition for independence, which is a central notion of the probability theory: Definition 1.12. Two events A, B are independent if

P {A | B} = P {A} , P {B | A} = P {B} , or, equivalently, P {A ∩ B} = P {A} P {B} .

A system of events A1,...,Ak is called independent if for any j ≤ k and for any i1, . . . , ij one has { } { } ∩ ∩ { } P Ai1 ... Aij = P Ai1 ... P Aij . This definition gives theoretical grounds for the operations we performed when considering the example of Bernoulli’s trials. Problem 1.20. Find an example of a system of events such that they are pairwise independent (i.e., for any pair Ai,Aj we have P {Ai ∩ Aj} = P {Ai} P {Aj}) but not independent in the sense of the last definition.

23 1.3.3 Random variables Definition 1.13. Let (Ω, F, P) be a probability space and Ω be finite. Any function

X :Ω → R is called a .

Since in our case Ω is finite, it means that we’ll deal only with the situations of the form

X :Ω → {x1, . . . , xk}, where x1, . . . , xk are (distinct) values that exhaust random variable X. Consider the probabilities that X takes, e.g., the value xi:

pi = P {ω : X(ω) = xi} .

Hence, together with the discrete values of the random variable, we obtain the corresponding probabilities ∑ {p1, . . . , pk}, pi = 1. i which are called or probability mass function of the random variable X. In the following I will abuse the notation P {X = xi} to actually pi, or, in more details, P {ω : X(ω) = xi}. Random variables, which are defined by a discrete number of values and corresponding discrete probability distributions are called (surprise, surprise) discrete. Here are some examples of random variables.

• Indicator random variable. Consider the of an event A ∈ F: { 1, ω ∈ A 1A(ω) = 0, ω∈ / A

• Uniform discrete random variable. This is a random variable taking the values {1, . . . , n} with equal probabilities 1 P {X = i} = . n • Bernoulli random variable. This is a random variable that takes only two values {1, 0} with probabilities {p, q}, p + q = 1.

• Binomial random variable. Random variable X has a with proba- bility of success p if it takes values {0, 1, . . . , n} with probabilities ( ) n p = P {X = i} = pi(1 − p)n−i. i i

Note that we have two parameters for this distribution p and n. You should check that the sum of probabilities is 1.

24 • Poisson random variable. Random variable X has a Poisson distribution with parameter λ > 0 if it takes the countable number of values {0, 1, . . . , n, . . .} with probabilities

λi p = P {X = i} = e−λ. i i! Formally, this random variable was not covered by our definitions, since the sample space, on which it is defined, is necessarily infinite. However, in some cases we will need to use this random variable, hence its definition. Note here that the sum of all pi is one (recall the Taylor series for the exponential function).

• Normal random variable. Sometimes the sample space not only infinite but uncountable. Therefore, the random variables defined on this space cannot be described by a discrete set of probabilities (finite or countable). For some (not all!) of these random variables, a probability density function p(x) ≥ 0 (abbreviated often as pdf) can be defined , such ∫ ∞ that −∞ p(x)dx = 1. In this case the sums are replaced with the corresponding integrals. Random variables having probability density functions are called absolutely continuous. The most important example of an absolutely continuous random variable is a , with pdf

2 1 − (x−µ) p(x) = √ e 2σ2 , µ ∈ R, σ > 0. 2πσ Often the notation X ∼ N (µ, σ) is used.

Random variables X and Y defined on the same sample space Ω are independent if the events {ω : X(ω) ≤ x} and {ω : Y (ω) ≤ y} are independent for any x, y. In other words, one must have

P {X ≤ x, Y ≤ y} = P {{ω : X(ω) ≤ x} ∩ {ω : Y (ω) ≤ y}} = P {X ≤ x} P {Y ≤ y} , ∀x, y ∈ R.

For discrete random variables, with which we mostly deal, the conditions of independence is somewhat simplified: discrete r.v. X and Y are independent if

P {X = xi,Y = yi} = P {X = xi} P {Y = yi} .

This definition can be generalized for the set of three or more random variables, defined on the same sample space.

1.3.4 Characteristics of the random variables. Chebyshev’s inequality

Here we exclusively will talk about discrete random variables. We also use the notation∪ Ai = { } ω : X(x) = xi for the random variable X. Note that Ai are mutually disjoint and i Ai = Ω.

Definition 1.14.∑ The , or expectation, or mathematical expectation of a random n variable X = i=1 xi 1Ai is the number ∑n ∑n E(X) = xi P {Ai} = xipi. i=1 i=1

25 If we deal with a countable set of values of a random variable, we need to require additionally that the corresponding series be absolutely convergent. Here are some properties of the expectation whose proof is left as an exercise. 1. If X ≥ 0 then E (X) ≥ 0. 2. E(aX + bY ) = a E(X) + b E(Y ), where a and b constants. 3. If X ≥ Y then E (X) ≥ E(Y ). 4. | E(X) | ≤ E(|X|). 5. If X and Y are independent then E (XY ) = E (X)E(Y ).

6. If X = 1A then E (X) = P {A}. Problem 1.21. Show that if X is a Bernoulli random variable then E (X) = p, if X has binomial distribution then E (X) = np, if X has Poisson distribution then E (X) = λ.

If X = {x1, . . . , xn} is a random variable, then any function f : {x1, . . . , xn} → R is also a random variable whose expectation can be found as (this actually requires proof) ∑k E(f(X)) = f(xi)pi. i=1 ( ) ∑ 2 2 k 2 For example, for any X the expectation of X is E X = i=1 xi pi. Another very important characteristic of a random variable is its variance: ( ) ∑n ∑n ( ) − 2 − 2 2 − 2 2 − 2 Var (X) = E (X E(X)) = (xi E(X)) pi = xi pi (E (X)) = E X (E (X)) . i=1 i=1 We have that Var (X) ≥ 0, Var (a + bX) = b2 Var (X), Var (a) = 0, and Var (X + Y ) = Var (X) + Var (Y ) for independent X and Y . Problem 1.22. Find Var (X) for binomial and Poisson random variables. Very often we will need to estimate deviations of various random variables from, e.g., expec- tation. The first step in this is often Chebyshev’s inequality: Proposition 1.15 (Chebyshev’s inequality). Let (Ω, F, P) be a probability space and X :Ω → R+ be a nonnegative random variable. Then E(X) P {X ≥ ϵ} ≤ ϵ for any ϵ > 0. Proof. Notice that X = X 1X≥ϵ + X 1X<ϵ ≥ X 1X≥ϵ ≥ ϵ 1X≥ϵ. Then, by the properties of expectations,

E(X) ≥ ϵ E( 1X≥ϵ) = ϵ P {X ≥ ϵ} . 

26 Corollary 1.16. For any ϵ > 0 and any random variable X we have E(|X|) P {|X| ≥ ϵ} ≤ ϵ ( ) { } E X2 P {|X| ≥ ϵ} = P X2 ≥ ϵ2 ≤ ϵ2 Var (X) P {|X − E(X) | ≥ ϵ} ≤ . ϵ2 Finally, we will usually deal with finite sample spaces, but will be considering sequences of probability spaces such that n → ∞. It is said that some property holds asymptotically (abbreviated as a.a.s.) if its probability converges to one as n approaches infinity.

1.4 Asymptotic notations

Consider two functions f and g defined on N. By definition f ∈ o(g) (very often we abuse the notation and write f = o(g), or f(n) = o(g(n))) if there exists function α(n) such that α(n) → 0 when n → 0 and f(n) = α(n)g(n). Sometimes, if g(n) ≠ 0 for all n, a more convenient definition is used: f(n) f ∈ o(g) ⇔ lim → 0. n→∞ g(n) If f(n) → 0 then, obviously, f ∈ o(c) for any constant c. Usually one writes f = o(1) in this case (though o(2) would be perfectly legitimate, but I never saw this notation). One has n = o(n2), n2 = o(n3) and so on. By definition f ∈ O(g) if there exists a constant c > 0 such that the inequality |f(n)| ≤ c|g(n)| is true for any n. If f = o(g) then f = O(g) (what about the opposite?). Less frequent O O notations are: f =√ Ω(g) if g = (f). Notation f = Θ(g) means that f = (g) and f = Ω(g) (e.g., n2 = Θ(n2 + n) ). Finally, f ∼ g is f(n) = (1 + o(1))g(n). Using the asymptotic notations we can define the small world property of a graph. Consider graph G(V,E) of order n. It is called small world if its diameter diam G satisfies diam G = Θ(log n) Another possible measure is ∑ d(u, v) L(G) = , |S| u,v∈S where S is the set of pairs of distinct vertices of G with the property that for any u, v ∈ S the distance d(u, v) is finite. If G is connected of order n, then ( )− ∑ n 1 L(G) = d(u, v) , 2 u,v∈V so that L(G) is the average distance in G. For the small world graph it is usually required that L(G) = Θ(log log n).

27