Introduction to Probability Theory and Statistics for Linguistics

Introduction to Probability Theory and Statistics for Linguistics Marcus Kracht Department of Linguistics, UCLA 3125 Campbell Hall 405 Hilgard Avenue Los Angeles, CA 90095–1543 [email protected] Contents 1 Preliminaries and Introduction . 5 I Basic Probability Theory 9 2 Counting and Numbers . 11 3 Some Background in Calculus . 16 4 Probability Spaces . 25 5 Conditional Probability . 32 6 Random Variables . 41 7 Expected Word Length . 48 8 The Law of Large Numbers . 57 9 Limit Theorems . 62 II Elements of Statistics 69 10 Estimators . 71 11 Tests . 77 12 Distributions . 86 13 Parameter Estimation . 93 3 4 CONTENTS 14 Correlation and Covariance . 100 15 Linear Regression I: The Simple Case . 104 16 Linear Regression II . 112 16.1 General F test . 114 16.2 Lack of Fit . 115 17 Linear Regression III: Choosing the Estimators . 117 18 Markov Chains . 120 III Probabilistic Linguistics 127 19 Probabilistic Regular Languages and Hidden Markov Models . 129 Preface 5 1 Preliminaries and Introduction Some Useful Hints. This text provides a syllabus for the course. It is hopefully needless to state that the manuscript is not claimed to be in its final form, and I am constantly revising the text and it will grow as material gets added. Older material is subject to change without warning. One principle that I have adopted is that everything is explained and proved, unless the proof is too tedious or uses higher mathematics. This means that there will be a lot of stuff that is quite difficult for someone interested in practical applications. These passages are marked by ~ in the margin, so that you know where it is safe to skip. If you notice any inconsistencies or encounter difficulties in understanding the explanations, please let me know so I can improve the manuscript. Statistics and probability theory are all about things that are not really certain. In everyday life this is the norm rather than the exception. Probability theory is the attempt to extract knowledge about what event has happened or will happen in presence of this uncertainty. It tries to quantify as best as possible the risks and benefits involved. Apart from the earliest applications of probability in gambling, numerous others exist: in science, where we make experiments and interpret them, in finance, in insurance and in weather reports. These are important areas where probabilities play a pivotal role. The present lectures will also give evidence for the fact that probability theory can be useful for linguistics, too. In everyday life we are frequently reminded of the fact that events that are predicted need not happen, even though we typically do not calculate probabilities. But in science this is absolutely necessary in order to obtain reliable result. Quantitative statements of this sort can sometimes be seen, for example in weather reports, where the experts speak of the “probability of rain” and give percentages rather than saying that rain is likely or unlikely, as one would ordinarily do. Some people believe that statistics requires new mathematics, as quantum mechanics required a new kind of physics. But this is not so. The ordinary mathematics is quite enough, in fact it has often been developed for the purpose of applying it to probability. However, as we shall see, probability is actually a difficult topic. Most of the naive intuitions we have on the subject matter are either (mathematically speaking) trivial or false, so we often have to resort to computations of some sort. Moreover, to apply the theory in a correct fashion, often two things are required: extensive motivation and a lot of calculations. I give an example. To say that an event happens with the 1 probability 6 means that it happens in 1 out of 6 cases. So if we throw a die six 6 Preface times we expect a given number, say 5, to appear once, and only once. This means that in a row of six, every one of the numbers occurs exactly once. But as we all know, this need not happen at all! This does not mean that the probabilities are wrong. In fact probability theory shows us that any six term sequence of numbers between 1 and 6 may occur. Any sequence is equally likely. However, one can calculate that for 5 to occur not at all is less likely than for it to occur once, and to occur a number n > 1 of times also is less likely. Thus, it is to be expected that the number of occurrences is 1. Some of the events are therefore more likely than others. But if that is so, throwing the die 60 times will not guarantee either that 5 occurs exactly 10 times. Again it may occur less often or more. How come then that we can at all be sure that the probabilities we have assigned to the outcomes are correct? The answer lies in the so called law of the large numbers. It says that if we repeat the experiment more often than the chance of the frequency of the number 5 deviating from its assigned probability gets smaller and smaller; in the limit it is zero. Thus, the probabilities are assumed exactly in the limit. Of course, since we cannot actually perform the experiment an infinite number of times there is no way we shall actually find out whether a given die is unbiased, but at least we know that we can remove doubts to any desirable degree of certainty. This is why statisticians express themselves in such a funny way, saying that something is certain (!) to occur with such and such probability or is likely to be the case with such and such degree of confidence. Finite experiments require this type of caution. At this point it is actually useful to say something about the difference between probability theory and statistics. First, both of them are founded on the same model of reality. This means that they do not contradict each other, they just exploit that model for different purposes. The model is this: there is a certain space of events that occur more or less freely. These can be events the happen without us doing anything like “the sun is shining” or “there is a squirrel in the trashcan”. Or they can be brought about by us like “the coin shows tails” after we tossed it into the air. And, finally, it can be the result of a measurement, like “the voice onset time is 64 ms”. The model consists in a set of such events plus a so-called probability. We may picture this as an oracle that answers our question with “yes” or “no” each time we ask it. The questions we ask are predetermined, and the probabilities are the likelihood that is associated with a “yes” answer. This is a number between 0 and 1 which tells us how frequent that event is. Data is obtained by making an experiment. An experiment is in this scenario a question put to the oracle. An array of experiments yields data. Probability theory tells us Preface 7 how likely a particular data or set of data is. In real life we do not have the probabilities, we have the data. And so we want tools that allow us to estimate the probabilities given the data that we have. This is what statistics is about. The difference is therefore merely what is known and what is not. In the case of an unbiased die we already have the probabilities; and so we can make predictions about a particular experiment or series thereof. In science it is the data we have and we want to know about the probabilities. If we study a particular construction, say tag questions, we want to know what the probability is that a speaker will use a tag question (as opposed to some other type of construction). Typically, the kind of result we want to find is even more complex. If, for example, we study the voice onset time of a particular sound, then we are interested to find a number or a range thereof. Statistics will help in the latter case, too, and we shall see how. Thus, statistics is the art of guessing the model and its parameters. It is based on probability theory. Probability theory shows us why the particular formula by means of which we guess the model is good. For example, throw a die 100 times and notice how many times it shows 5. Let that number be 17. Then statistics tells you that you should guess the probability of 5 at 17/100 = .17. Probability tells you that although that might not be right, it is your best bet. What it will in fact prove is that if you assign any other probability to the outcome 5 then your experiment becomes less likely. This argument can be turned around. Probability theory tells you that the most likely probability assignment is .17. All this is wrapped up in the formula that the probability equals the frequency. And this is what you get told in statistics. Literature. The mathematical background is covered in [5] and in [2]. Both texts are mathematically demanding. As for R, there is a nice textbook by Peter Dalgaard, himself a member of the R team, [1]. This book explains how to use R to do statistical analysis and is as such a somewhat better source than the R online help. In this manuscript I shall give a few hints as to how to use R, but I shall not actually introduce R nor do I intend to give a comprehensive reference.

Load more