Statistics 231 Course Notes
Total Page:16
File Type:pdf, Size:1020Kb
STATISTICS 231 COURSE NOTES Original notes by Jerry Lawless Winter 2013 Edition Contents INTRODUCTION TO STATISTICAL SCIENCE 2 1.1StatisticalScience..................................... 2 1.2CollectionofData..................................... 3 1.3DataSummaries...................................... 7 1.4ProbabilityDistributionsandStatisticalModels..................... 12 1.5DataAnalysisandStatisticalInference.......................... 16 1.6StatisticalSoftware.................................... 19 1.7 A More Detailed Example: Colour Classification by Robots . 25 1.8Appendix.TheRLanguageandSoftware........................ 30 1.9Problems......................................... 35 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECK- ING 43 2.1StatisticalModelsandProbabilityDistributions..................... 43 2.2EstimationofParameters(ModelFitting)........................ 47 2.3 Likelihood Functions From Multinomial Models . 54 2.4CheckingModels..................................... 56 2.5Problems......................................... 60 PLANNING AND CONDUCTING EMPIRICAL STUDIES 66 3.1EmpiricalStudies..................................... 66 3.2PlanningaStudy..................................... 70 3.3DataCollection...................................... 72 3.4Problems......................................... 73 STATISTICAL INFERENCE: ESTIMATION 75 4.1 Introduction . 75 4.2SomeDistributionTheory................................ 79 ii 1 4.3 ConfidenceIntervalsforaParameter........................... 88 4.4Problems......................................... 98 STATISTICAL INFERENCE: TESTING HYPOTHESES 106 5.1 Introduction . 106 5.2 Testing Parametric Hypotheses with Likelihood Ratio Statistics . 110 5.3 Hypothesis Testing and Interval Estimation . 115 5.4Problems......................................... 116 GAUSSIAN RESPONSE MODELS 121 6.1 Introduction . 121 6.2InferenceforasinglesamplefromaGaussianDistribution............... 126 6.3GeneralGaussianResponseModels........................... 131 6.4InferenceforPairedData................................. 136 6.5LinearRegressionModels................................ 142 6.6ModelChecking...................................... 149 6.7Problems......................................... 151 TESTS AND INFERENCE PROBLEMS BASED ON MULTINOMIAL MODELS 161 7.1 Introduction . 161 7.2 Goodness of Fit Tests . 162 7.3 Two-Way Tables and Testing for Independence of Two Variables . 164 7.4Problems......................................... 169 CAUSE AND EFFECT 173 8.1 Introduction . 173 8.2ExperimentalStudies................................... 175 8.3ObservationalStudies................................... 177 8.4Problems......................................... 178 References and Supplementary Resources 182 Statistical Tables 183 APPENDIX. ANSWERS TO SELECTED PROBLEMS 184 A Short Review of Probability 188 INTRODUCTION TO STATISTICAL SCIENCE 1.1 Statistical Science Statistical science, or statistics, is the discipline that deals with the collection, analysis and interpreta- tion of data, and with the study and treatment of variability and of uncertainty. If you think about it, you soon realize that almost everything we do or know depends on data of some kind, and that there is usually some degree of uncertainty present. For example, in deciding whether to take an umbrella on a long walk we may utilize information from weather forecasts along with our own direct impression of the weather, but even so there is usually some degree of uncertainty as to whether it will actually rain or not. In areas such as insurance or finance, decisions must be made about what rates to charge for an insurance policy, or whether to buy or sell a stock, on the basis of certain types of data. The uncertainty as to whether a policy holder will have a claim over the next year, or whether a stock’s price will rise or fall, is the basis of financial risk for the insurer and the investor. In order to increase our knowledge about some area or to make better decisions, we must collect and analyze data about the area in question. To discuss general ways of doing this, it is useful to have terms that refer to the objects we are studying. The words “population”, “phenomenon” and “process” are frequently used below; they are simply catch-all terms that represent groups of objects and events that someone might wish to study. Variability and uncertainty are present in most processes and phenomena in the real world. Uncer- tainty or lack of knowledge is the main reason why someone chooses to study a phenomenon in the first place. For example, a medical study to assess the effect of a new drug for controlling hyperten- sion (high blood pressure) may be conducted by a drug company because they do not know how the drug will perform on different types of people, what its side effects will be, and so on. Variability is ever-present; people have varying degrees of hypertension, they react differently to drugs, they have different physical characteristics. One might similarly want to study variations in currency or stock val- ues, variation in sales for a company over time, or variation in hits and response times for a commercial web site. Statistical science deals both with the study of variability in processes and phenomena, and with good (i.e. informative, cost-effective) ways to collect and analyze data about such processes. 2 3 There are various possible objectives when one collects and analyzes data on a population, phenom- enon, or process. In addition to pure “learning” or furthering knowledge, these include decision-making and the improvement of processes or systems. Many problems involve a combination of objectives. For example, government scientists collect data on fish stocks in order to further their scientific knowledge, but also to provide information to legislators or groups who must set quotas or limits on commercial fishing. Statistical data analysis occurs in a huge number of areas. For example, statistical algorithms are the basis for software involved in the automated recognition of handwritten or spoken text; statis- tical methods are commonly used in law cases, for example in DNA profiling or in determining costs; statistical process control is used to increase the quality and productivity of manufacturing processes; individuals are selected for direct mail marketing campaigns through statistical analysis of their char- acteristics. With modern information technology, massive amounts of data are routinely collected and stored. But data does not equal information, and it is the purpose of statistical science to provide and analyze data so that the maximum amount of information or knowledge may be obtained. Poor or improperly analyzed data may be useless or misleading. Mathematical models are used to represent many phenomena, populations, or processes and to deal with problems that involve variability we utilize probability models. These have been introduced and studied in your first probability course, and you have seen how to describe variability and solve certain types of problems using them. This course will focus more on the collection, analysis and interpreta- tion of data, but the probability models studied earlier will be heavily used. The most important part of probability for this course is the material dealing with random variables and their probability distribu- tions, including distributions such as the binomial, hypergeometric, Poisson, multinomial, normal and exponential. You should review your previous notes on this material. Statistical science is a large discipline, and this course is only an introduction. Our broad objectives are to discuss the collection, analysis and interpretation of data, and to show why this is necessary. By way of further introduction we will outline important statistical topics, first data collection, and then probability models, data analysis, and statistical inference. We should bear in mind that study of a process or phenomenon involves iteration between model building, data collection, data analysis, and interpretation. We must also remember that data are collected and models are constructed for a specific reason. In any given application we should keep the big picture in mind (e.g. why are we studying this? what else do we know about it?) even when considering one specific aspect of a problem. 1.2 Collection of Data The objects of study in this course are usually referred to as either populations or processes. In essence a population is just some collection of units (which can be either real or imagined), for example, the 4 collection of persons under the age of 18 in Canada as of September 1, 2012 or the collection of car insurance policies issued by a company over a one year period. A process is a mechanism by which output of some kind is produced; units can often be associated with the output. For example, hits on a website constitute a process (the “units" are the distinct hits), as do the sequence of claims generated by car insurance policy holders (the “units" are the individual claims). A key feature of processes is that they usually occur over time, whereas populations are often static (defined at one moment in time). Populations or processes are studied by defining variates or variables which represent character- istics of units. These are usually numerical-valued and are represented by letters such as .For example, we might define a variable as the number of car insurance claims from an individual policy holder in a given year, or as the number of hits on a website