Note Set 1: Review of Basic Concepts in Probability
Total Page:16
File Type:pdf, Size:1020Kb
Note Set 1: Review of Basic Concepts in Probability Padhraic Smyth, Department of Computer Science University of California, Irvine January 2021 This set of notes is intended as a brief refresher on probability. As a student reading these notes you will likely have seen (in other classes) most or all of the ideas discussed below. Nonetheless, even though these ideas are relatively straightforward, they are the key building blocks for more complex ideas in probabilistic learning. Thus, it is important to be familiar with these ideas in order to understand the material we will be discussing later in the course. 1 Discrete Random Variables and Distributions Consider a variable A that can take a finite set of values fa1; : : : ; amg. We can define a probability dis- tribution (also referred to as a probability mass function) for A by specifying a set of numbers fP (A = Pm a1);:::;P (A = am)g, where 0 ≤ P (A = ai) ≤ 1 and where i=1 P (A = ai) = 1. We can think of A = ai as an event that either is or is not true, and P (A = ai) is the probability of this particular event being true. We can also think of A = ai as a proposition or logical statement about the world that is either true or false. Probability expresses our uncertainty about whether the proposition A = ai is true or is false. Notational comment: for convenience, we follow a typical convention in probability notation and will often shorten P (A = ai) to just P (ai), and will also often use a (and P (a)) to indicate a generic value for A, i.e., one of the possible ai values. We call A a random variable as long as the set of values fa1; : : : ; amg are mutually exclusive and exhaustive. Mutually exclusive means that the random variable A can only take one value at a time— another way to state this is that the P (A = ai AND A = aj) = 0; 8 i; j; i 6= j, i.e., the probability of the event occurring where A takes two different values is 0 (i.e., it is deemed impossible)1. Exhaustive means 1the symbol 8 means “for all” 1 Note Set 1, Probability Review 2 that the variable A always takes one of the values in the set fa1; : : : ; amg, i.e., there are no other values it can take. EXAMPLE 1: In medical diagnosis we often want to be able to predict whether a patient has a particular disease or not, given other measurements—this is a particular type of prediction problem known as classi- fication that we will discuss later in more detail. Let C be a variable representing what disease a particular patient has. Consider for example C taking values in fc1; c2g, where c1 = has the flu and c2 = does not have the flu. In this case C is a random variable since c1 and c2 are clearly mutually exclusive and exhaustive, and we will be uncertain in general about the actual true value of C for any particular patient. Now consider another set of events: d1 = has the flu and d2 = has malaria and d3 = healthy. Could fd1; d2; d3g be used as the set of events to define a random variable? The answer is no since the set of events are not mutually exclusive: a person could have both the flu and malaria (this have a very small probability of happening, but this probability is not zero, i.e., it is possible). Nor are the events exhaustive: a person could be in none of these states since they could in general have some condition other than flu or malaria. EXAMPLE 2: Consider modeling the probability distribution of English words in a particular set of text documents. Let W be the word variable and let wi be a particular word. For example, we might want to know the probability that a particular word wi will be typed by a user in a text editor or will be spoken by a user. Such probabilities are used in practice in a variety of applications such as speech recognition, automated machine translation, and so forth. In order to treat W as a random variable we need to ensure that the words being spoken or written by a user are mutually exclusive and exhaustive by defining W appropriately. For example, for mutual exclusion, we could define W to be the next word in a sequence of words, which by definition means there will always be a single unique “next word”—and we would need to include in our vocabulary a special symbol for the end of a sequence. The requirement for exhaustivity is more difficult to satisfy. We could define the set of possible words fw1; : : : ; wmg as the set of all words in an English dictionary—but which dictionary should we use? and what about words not in the dictionary such as regional variations of words, slang words, newly-emerging words, proper nouns such as “California,” and so on? This is a problem commonly faced with text modeling, since it is impossible to know all future words that might occur in practice. One way to get around this is to limit the set of words being modeled to (say) the m = 20; 000 most frequent words in a particular set of text and then an additional symbol w20;001 is P defined to represent “all other words,” ensuring that i P (wi) = 1. An interesting problem in this context, that we will not discuss at this point, is how much probability mass we should assign to the event “all other words,” or equivalently, how likely are we to encounter previously unseen words in the future? Note Set 1, Probability Review 3 Before we finish with discrete random variables we note that not all discrete random variables necessarily take values from a finite set, but instead could take values from a countably infinite set. For example, we can define a random variable A taking values from the set of positive integers f1; 2; 3;:::g. The sum P1 i=1 P (A = i) must converge to 1 of course. For random variables taking values in a countably infinite set it is impossible to explicitly represent a probability distribution directly by a table of numbers—and even in the finite case it may be often incon- venient or inefficient. In such cases we use a parametric model for the distribution, i.e., a function that describes how P (i) varies as a function of i and as a function of one or more parameters of the model. EXAMPLE 3: An example of a probability distribution defined on the positive integers is the geometric distribution, P (A = i) = (1 − α)i−1α; i = 1; 2;::: where α is a parameter of the model and 0 < α < 1. This can be used for example to describe the distribution of the number of consecutive “tails” that we will see before we see the first “heads event” in a coin-tossing experiment, where α is the probability of a head occurring on each coin toss. 2 Continuous Random Variables and Density Functions The variables we discussed above such as A and C take values that can be put in one-to-one correspondence with sets of integers, and are often referred to as discrete random variables. It is also useful to be able to build and use probability models for real-valued or continuous random variables, e.g., a random variable X that can take values x anywhere on the real line. (Sidenote: as is often done elsewhere in the literature we will sometimes use the term “distribution” rather than “density” when referring to a probability density function—the terminology can sometimes be somewhat loose, but the key distinction to keep in mind is whether a random variable is discrete or contiuous.) Generally speaking, where we have sums for discrete random variables we usually will have integrals P R for continuous random variables. So, instead of the requirement that i P (xi) = 1 we have p(x)dx = 1. Here p(x) is the probability density function for the variable X, where p(x) ≥ 0; 8x. We can calculate the probability that the variable X lies between 2 values a and b by integrating p(x) between a and b, i.e., R b P (a ≤ X ≤ b) = a p(x)dx. 1 EXAMPLE 4: The uniform density is defined as p(x) = b−a for a ≤ x ≤ b and 0 otherwise. We Note Set 1, Probability Review 4 Figure 1: An example of two Gaussian density functions with different parameter values. sometimes use the notation X ∼ U(a; b) to denote that the random variable X has a particular type of density distribution (in this case the uniform density U). Here a and b are the parameters of the uniform density function. Note that while we had the option of describing a discrete probability distribution (for a finite set) with a list or table of probabilities, we can’t define a continuous density function this way—instead we must parametrize the function and describe it via its functional form and its parameters. Note also that although R p(x)dx = 1 this does not imply that p(x) itself needs to be less than 1. For 1 example, for the uniform density if a = 0 and b = 0:1, then p(x) = 0:1−0:0 = 10 for a ≤ x ≤ b. The key point is that the area under the density function, p(x), is constrained to be 1, but the height of the function can be any non-negative quantity.