Probability Theory and Statistics

Probability Theory and Statistics With a view towards bioinformatics Lecture notes Niels Richard Hansen Department of Applied Mathematics and Statistics University of Copenhagen November 2005 2 Prologue Flipping coins and rolling dices are two commonly occurring examples in an introductory course on probability theory and statistics. They represent archetypical experiments where the outcome is uncertain – no matter how many times we role the dice we are unable to predict the outcome of the next role. We use probabilities to describe the uncertainty; a fair, classical dice has probability 1/6 for each side to turn up. Elementary probability computations can to some extent be handled based on intuition and common sense. The probability of getting yatzy in a single throw is for instance 6 1 = = 0.0007716. 65 64 The argument for this and many similar computations is based on the pseudo theorem that the probability for any event equals number of favourable outcomes . number of possible outcomes Getting yatzy consists of the six favourable outcomes with all five dices facing the same side upwards. We call the formula above a pseudo theorem because, as we will show in Section 1.4, it is only the correct way of assigning probabilities to events under a very special assumption about the probabilities describing our experiment. The special assumption is that all outcomes are equally probable – something we tend to believe if we don’t know any better. It must, however, be emphasised that without a proper training most people will either get it wrong or have to give up if they try computing the probability of anything except the most elementary events. Even when the pseudo theorem applies. There exist numerous tricky probability questions where intuition somehow breaks down and wrong conclusions can be drawn easily if one is not extremely careful. A good challenge could be to compute the probability of getting yatzy in three throws with the usual rules and provided that we always hold as many equal dices as possible. i ii 0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 4000 5000 n Figure 1: The average number of times that the dice sequence 1,4,3 comes out before the sequence 2,1,4 as a function of the number of times the dice game has been played. The yatzy problem can in principle be solved by counting – simply write down all combinations and count the number of favourable and possible combinations. Then the pseudo theorem applies. It is a futile task but in principle a possibility. But in many cases it is impossible to rely on counting – even in principle. As an example lets consider a simple dice game with two participants: First you choose a sequence of three dice throws, 1, 4, 3, say, and then I choose 2, 1, 4. We throw the dice until one of the two sequences comes out, and you win if 1, 4, 3 comes out first and otherwise I win. If the outcome is 4, 6, 2, 3, 5, 1, 3, 2, 4, 5, 1, 4, 3, then you win. It is natural to ask with what probability you will win this game. In addition, it is clearly a quite boring game, since we have to throw a lot of dices and simply wait for one of the two sequences to occur. Another question could therefore be to tell how boring the game is? Can we for instance compute the probability for having to throw the dice more than 20, or perhaps 50, times before any of the two sequences shows up. The problem that we encounter here is first of all that the pseudo theorem does not apply simply because there is an infinite number of favourable as well as possible outcomes. The event that you win consists of the outcomes being all finite sequences of throws ending with 1, 4, 3 without 2, 1, 4 occurring somewhere as three subsequent throws. Moreover, these outcomes are certainly not equally probable. By developing the theory of probabilities we obtain a framework for solving problems like this and doing many other even more subtle computations. And if we can not compute the solution we might be able to obtain an answer to our questions using computer simulations. Admittedly problems do not solve themselves iii just by developing some probability theory, but with a sufficiently well developed theory it becomes easier to solve problems of interest, and what should not be underestimated it becomes possible to clearly state which problems we are solving and on which assumptions the solution rests. number 0 50 100 150 200 250 3 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 n Figure 2: Playing the dice game 5000 times, this graph shows how the games are distributed according to the number times we had to throw the dice before one of the sequences 1,4,3 or 2,1,4 occured. Enough about dices! After all this is going to be about probability theory and statistics in bioinformatics. But some of the questions that we encounter in bioinformatics, especially in biological sequence analysis, are nevertheless very similar to those we asked above. If we don’t know any better, and we typically don’t, the majority of sequenced DNA found in the publicly available databases is considered as being random. This typically means that we regard the DNA sequences as the outcome of throwing a four sided dice, with sides A, C, G, and T, a tremendously large number of times. One purpose of regarding DNA sequences as being random is to have a background model, which can be used for evaluating the performance of methods for locating interesting and biologically important sequence patterns in DNA sequences. If we have developed a method for detecting novel protein coding genes, say, in DNA sequences we need to have such a background model for the DNA that is not protein coding. Otherwise we can not tell how likely it is that our method finds false protein coding genes, i.e. that the method claims that a segment of DNA is protein coding even though it is not. Thus we need to compute the probability that our method claims that a random DNA sequence – with random having the meaning above – is a protein coding gene. If this is unlikely we believe that our method indeed finds truly protein coding genes. A simple gene finder can be constructed as follows: After the iv start codon, ATG, a number of nucleotides occur before one of the stop codons, TAA, TAG, TGA is reached for the first time. Our protein coding gene finder then claims that if more than 100 nucleotides occur before any of the stop codons is reached then we have a gene. So what is the chance of getting more than 100 nucleotides before reaching a stop coding for the first time? The similarity between determining how boring our little dice game is should be clear. The sequence of nucleotides occurring between a start and a stop codon is called an open reading frame, and what we are interested in is thus how the lengths of open reading frames are distributed in random DNA sequences. Probability theory is not restricted to the analysis of the performance of methods on random sequences, but also provides the key ingredient in the construction of such methods – for instance more advanced gene finders. As mentioned above, if we don’t know any better we often regard DNA as being random, but we actually know that a protein coding gene is certainly not random – at least not in the sense above. There is some sort of regularity, for instance the codon structure, but still a substantial variation. A sophisticated probabilistic model of a protein coding DNA sequence may be constructed to capture both the variation and regularities in the protein coding DNA genes, and if we can then compute the probability of a given DNA sequence being a protein coding gene we can compare it with the probability that it is “just random”. The construction of probabilistic models brings us to the topic of statistics. Where probability theory is concerned with computing probabilities of events under a given probabilistic model, statistic deals to a large extend with the opposite. That is, given that we have ob- served certain events as outcomes of an experiment how do we find a suitable probabilistic model of the mechanism that generated those outcomes. How we can use probability theory to make this transition from data to model and in particular how we can analyse the methods developed and the results obtained using probability theory is the major topic of these notes. Contents 1 Probability Theory 1 1.1 Introduction.................................... 1 1.2 Samplespaces................................... 1 1.3 Probabilitymeasures.. .. .. .. .. .. .. .. .. .. .. 3 1.4 Probability measures on discrete sets . ....... 7 1.5 Probability measures on the real line . ...... 12 1.6 Randomvariables................................. 20 1.6.1 Transformations of random variables . 21 1.6.2 Several random variables – multivariate distributions......... 24 1.7 Conditional distributions and independence . ......... 29 1.7.1 Conditional probabilities and independence . ....... 29 1.7.2 Random variables, conditional distributions and independence . 33 1.8 Simulations .................................... 37 1.8.1 The binomial, multinomial, and geometric distributions ....... 41 1.9 Entropy ...................................... 44 1.10 Sums,MaximaandMinima . 48 1.11 Localalignment-acasestudy . 52 v vi Contents A R 55 A.1 ObtainingandrunningR .. .. .. .. .. .. .. .. .. .. 55 A.2 Manuals,FAQsandonlinehelp . 56 A.3 TheRlanguage,functionsandscripts . ..... 57 A.3.1 Functions, expression evaluation, and objects .

Load more