Introduction to Applied Statistics

STP 420 SUMMER 2005STP 420 INTRODUCTION TO APPLIED STATISTICSNOTESPART 1 - DATACHAPTER 3PRODUCING DATAIntroductionExploratory data analysis – covered in Ch. 1 & ch.2 Use of graphs and numbers to uncover the nature of a data set.Not good enough to provide convincing evidence for its conclusionsFormal statistical inference – answers specific questions with a known degree of confidence. - it uses the descriptive tools given in the previous chapters along with new kinds of reasoning (numerical rather than graphical)3.1 First StepsMajor questions when trying to produce data.1. What individuals shall you study?2. What variables shall you measure?Designs – arrangements or patterns used to collect data from many individualsSome questions addressed by designs1. How many individuals shall you collect data from?2. How shall you select the individuals to be studied?3. How shall you form groups where relevant?Otherwise you may be misled by haphazard or incomplete data or by confounding (mixed effects of various variables on the same response without being able to determine which variable has which effect on the response)1 STP 420 SUMMER 2005Where to find data: library and internetAnecdotal evidence – based on haphazardly selected individual cases that come to our attention, but may not be representative of any larger group of cases.There are many places to find data including:The annual Statistical Abstract of the United States US Census Bureau Website: http://www.fedstats.govAvailable data – data produced in the past for some other purpose but may help answer a present questionProducing new data is expensive and available data is used whenever possible.Statistical designs used for producing data rely on sampling or experiments.Sample – group of individuals/subjects from which data is gathered and is representative of a larger body or population Census – when information is gathered from the whole population - time consuming and expensive, hence, the reason samples are used insteadObservational study – observes individuals and measures variables of interest but does not influence the responsesExperiment – deliberately imposes treatment on individuals in order to observed their responses3.2 Design of ExperimentsExperimental units – individuals/subjects on which experiment is doneTreatment – specific experimental condition applied to the unitsFactors – explanatory variables in an experimentLevel of a factor – specific value of a factor/variablePlacebo – dummy treatmentIn principle, experiments can give good evidence for causation. 2 STP 420 SUMMER 2005Comparative experiments Simple design with only a single treatment is:Treatment  Observe responseControl group – group that are given the dummy/placeboIt is more beneficial to have an experiment with at least two groups; one given the treatment and the other given the dummy treatment.The design of a study is biased if it systematically favors certain outcomes.Randomization – use of chance to divide experimental units into groupsPrinciple of Experimental DesignThe basic principles of statistical design of experiments are1. Control of the effects of lurking variables on the response, most simply by comparing several treatments.2. Randomization, the use of impersonal chance to assign experimental units to treatments.3. Replication of the experiment on many units to reduce chance variation in the results.Statistically significant – observed effect so large that it would rarely occur by chance.How to randomizeDrawing names/numbers in a hat Table of random digits Computer software3 STP 420 SUMMER 2005Random digitsTable of random digits – list of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 that has the foillowing properties:1. The digit in any position in the list has the same chance o being any one of 0, 1, 2, 3, 4, 5, 6, 7, 8, 92. The digits in different positions are independent in the sense that the value of one has no influence on the value of any other.Completely randomized design – experimental design with all experimental units are allocated at random among all treatmentsCautions about experimentationDouble-blind – neither the subjects nor the researcher knows which subject got which treatment.Lack of realism – subjects or treatments or setting of an experiment may not realistically duplicate the conditions we really want tostudy.Matched pairs designsCan produce more precise results than simple random sampling Uses principles of comparison of treatments, randomization, and replication on several experimental units Is an example of block designBlock designs Block – group of experimental units or subjects that are known before the experiment to be similar in some way that is expected to affect the response to the treatments.In a block design, the random assignment of units to treatments is carried out separately within each block.4 STP 420 SUMMER 20053.3 Sampling DesignPopulation – entire group of individuals that we want information aboutSample - a part of the population that we actually examine in order to gather informationSample design – method used to choose the sample from the populationVoluntary response sample – consist of people who choose themselves by responding to a general appeal. Biased because people with strong negative opinions usually respond more oftenSimple random samplesSimple random sample (SRS) – consist of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.Stratified samplesProbability sample – gives each member of the population a known chance (> 0) to be selectedStratified random sample – first divide the population into groups of similar individuals called strata. Choose a separate SRS in each stratum and combine these SRSs to form the full sample.Multistage samplesEx:1. Select a sample from 3000 counties in the US2. Select a sample of townships within each of the counties chosen3. Select a sample of city blocks or other small areas within each chosen township4. Take a sample of households within each block chosen.This helps in sampling more randomly across the whole country.5 STP 420 SUMMER 2005Cautions about sample surveysUndercoverage – occurs when some groups in the population are left out of the process of choosing the sample.Nonresponse – occurs when an individual chosen for the sample can’t be contacted or does not cooperate.Response bias – causes by the behavior of the respondent/subject or the interviewerWording of questions – most important influence on the answers given to a sample survey. Confusing or leading questions can introduce strong bias and may even change the outcome of a survey.3.4 Toward Statistical InferenceStatistical inference – producing data from a sample in order to draw conclusions about the wider populationParameter – a number that describes the population. It is fixed but unknown.Statistic – a number that describes a sample. It is known when we take the sample but may vary from sample to sample. We use a sample statistic to estimate a population parameterSampling variability – difference between the mean of one sample and the mean of another sampleSimulation – using random digits from a table or computer software to imitate chance behavior since taking many samples may be expensive and time consumingSampling distribution of a statistic – the distribution of values taken by the statistic in all possible samples of the same size from the same population. Usually normally distributed.6 STP 420 SUMMER 2005The bias of a statisticUnbiased estimator – mean of its sampling distribution is equal to the true value of the parameter being estimatedThe variability of a statistic – described by the spread of its distribution. The spread is determined by the sampling design and the sample size n. Larger samples have smaller spreads.As long as the population is much larger than the sample (>= 10 times as large), the spread of the sampling distribution for a sample of fixed size n is approximately the same for any population size.Bias and variabilityHigh bias, low variability Off target but close togetherLow bias, high variability On target but spread outHigh bias, high variability Off target and spread outLow bias, low variability On target and close togetherWhy randomizeIt guarantees that the results of analyzing our data are subject to the laws of probability. It eliminates bias The shape of the sampling distribution is usually approximately normal and its center lies at the true value of the parameter.7

Introduction to Applied Statistics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support