Analyze Phase D> Statistical Inference

Analyze Phase 331 60000 50000 40000 30000 20000 10000 O~----,------.------,------,,------,------.------,----- N = 227 136 27 41 32 5 ' V~ 00 0' 00 00 i-.~ ~G ~~ ~O~ ()0 -Sfl' 0 -S 0 ~~ 0 ~~ ~G d> ~0~ ~0 0 2S ~~ 0'0 (j «l Employment category FIGURE 10.7 Boxplots of salary by job category. Boxplots are particularly useful for comparing the distribution of values in several groups. Figure 10.7 shows boxplots for the salaries for several different job titles. The boxplot makes it easy to see the different properties of the distributions. The location, variability, and shapes of the distributions are obvious at a glance. This ease of interpretation is something that statistics alone cannot provide. Statistical Inference This section discusses the basic concept of statistical inference. The reader should also consult the glossary in the Appendix for additional information. Inferential statistics belong to the enumerative class of statistical methods. All statements made in this sec tion are valid only for stable processes, that is, processes in statistical control. Although most applications of Six Sigma are analytic, there are times when enumerative statistics prove useful. The term inference is defined as (1) the act or process of deriving logical conclusions from premises known or assumed to be true, or (2) the act of reasoning from factual knowledge or evidence. Inferential statistics provide information that is used in the process of inference. As can be seen from the definitions, inference involves two domains: the premises and the evidence or factual knowledge. Additionally, there are two conceptual frameworks for addressing premises questions in inference: the design-based approach and the model-based approach. As discussed by Koch and Gillings (1983), a statistical analysis whose only assump tions are random selection of units or random allocation of units to experimental condi tions results in design-based inferences; or, equivalently, randomization-based inferences. The objective is to structure sampling such that the sampled population has the same 332 Chap te r Ten characteristics as the target population. If this is accomplished then inferences from the sample are said to have internal validity. A limitation on design-based inferences for experimental studies is that formal conclusions are restricted to the finite population of subjects that actually received treatment, that is, they lack external validity. However, if sites and subjects are selected at random from larger eligible sets, then models with random effects provide one possible way of addressing both internal and external validity considerations. One important consideration for external validity is that the sample coverage includes all relevant subpopulations; another is that treatment differ ences be homogeneous across subpopulations. A common application of design-based inference is the survey. Alternatively, if assumptions external to the study design are required to extend inferences to the target population, then statistical analyses based on postulated prob ability distributional forms (e.g., binomial, normal, etc.) or other stochastic processes yield model-based inferences. A focus of distinction between design-based and model based studies is the population to which the results are generalized rather than the nature of the statistical methods applied. When using a model-based approach, external validity requires substantive justification for the model's assumptions, as well as statis tical evaluation of the assumptions. Statistical inference is used to provide probabilistic statements regarding a scientific inference. Science attempts to provide answers to basic questions, such as can this machine meet our requirements? Is the quality of this lot within the terms of our con tract? Does the new method of processing produce better results than the old? These questions are answered by conducting an experiment, which produces data. If the data vary, then statistical inference is necessary to interpret the answers to the questions posed. A statistical model is developed to describe the probabilistic structure relating the observed data to the quantity of interest (the parameters), that is, a scientific hypoth esis is formulated. Rules are applied to the data and the scientific hypothesis is either rejected or not. In formal tests of a hypothesis, there are usually two mutually exclusive and exhaustive hypotheses formulated: a null hypothesis and an alternate hypothesis. Chi-Square, Student's T, and F Distributions In addition to the distributions present earlier in the Measure phase, these three distri butions are used in Six Sigma to test hypotheses, construct confidence intervals, and compute control limits. Chi-Square Many characteristics encountered in Six Sigma have normal or approximately normal distributions. It can be shown that in these instances the distribution of sample vari ances has the form (except for a constant) of a chi-square distribution, symbolized X2. Tables have been constructed giving abscissa values for selected ordinates of the cumu lative X2 distribution. One such table is given in Appendix 4. The X2 distribution varies with the quantity u, which for our purposes is equal to the sample size minus 1. For each value of u there is a different X2 distribution. Equation (10.3) gives the pdf for the X2. (10.3) 336 Chapter Ten 1.0 0.8 F(2,2) 0.6 0.4 0.2 0.0 0 2 4 6 8 10 F 8 7 6 5 4 3 2 2 4 6 8 10 F FIGURE 10.12 F distributions. denominator. Appendix 5 and 6 provide values for the 1 and 5% percentage points for the F distribution. The percentages refer to the areas to the right of the values given in the tables. Figure 10.12 illustrates two F distributions. Point and Interval Estimation So far, we have introduced a number of important statistics including the sample mean, the sample standard deviation, and the sample variance. These sample statistics are called point estimators because they are single values used to represent population parameters. It is also possible to construct an interval about the statistics that has a pre determined probability of including the true population parameter. This interval is called a confidence interval. Interval estimation is an alternative to point estimation that gives us a better idea of the magnitude of the sampling error. Confidence intervals can be either one-sided or two-sided. A one-sided or confidence interval places an upper or lower bound on the value of a parameter with a specified level of confidence. A two sided confidence interval places both upper and lower bounds. In almost all practical applications of enumerative statistics, including Six Sigma applications, we make inferences about populations based on data from samples . In this chapter, we have talked about sample averages and standard deviations; we have even used these numbers to make statements about future performance, such as long term Analyze Phase 337 yields or potential failures. A problem arises that is of considerable practical impor tance: any estimate that is based on a sample has some amount of sampling error. This is true even though the sample estimates are the "best estimates" in the sense that they are (usually) unbiased estimators of the population parameters. Estimates of the Mean For random samples with replacement, the sampling distribution of X has a mean /.1 £!..nd a standard deviation equal to (J/.};;. For large samples the sampling distribution of X is approximately normal and normal tables can be used to find the probability that a sam ple mean will be within a given distance of /.1. For example, in 95% of the samples we will observe..E mean within t..1.96(J/.};; of /.1. In other words, in 95% of the samples the interval from X -1.96(J/.};; to X + 1.96(J/.};; will include /.1. This interval is called a "95% confidence interval for estimating /.1." It is usually shown using inequality symbols: X -1.96(J/.};; < /.1X + 1.96(J/.};; The factor 1.96 is the Z value obtained from the normal in the Appendix 2. It corre sponds to the Z value beyond which 2.5% of the population lie. Since the normal distri bution is symmetric, 2.5% of the distribution lies above Z and 2.5% below -Z. The notation commonly used to denote Z values for confidence interval construction or hypothesis testing is Za/ z where 100(1 - a) is the desired confidence level in percent. For example, if we want 95% confidence, a = 0:05,100(1- a) = 95%, and ZO.025 = 1.96. In hypoth esis testing the value of a is known as the significance level. Example: Estimating Jl When 0' Is Known Supp~e that cr is known to be 2.8. Assume that we collect a sample of n = 16 and com pute X = 15.7. Using the e equation mentioned in previous section we find the 95% confidence interval for /.1 as follows: X-1.96cr/.};; < /.1 < X + 1.96cr/.};; 15.7 -1.96(2.8/.Ji6) < /.1 < 15.7+1.96(2.8/.Ji6) 14.33 < /.1 < 17.07 There is a 95% level of confidence associated with this interval. The numbers 14.33 and 17.07 are sometimes referred to as the confidence limits. Note that this is a two-sided confidence interval. There is a 2.5% probability that 17.07 is lower than /.1 and a 2.5% probability that 14.33 is greater than /.1. If we were only interested in, say, the probability that /.1 were greater than 14.33, then the one sided confidence interval would be /.1 > 14.33 and the one-sided confidence level would be 97.5%. Example of Using Microsoft Excel to Calculate the Confidence Interval for the Mean When Sigma Is Known Microsoft Excel has a built-in capability to calculate confidence intervals for the mean.

Load more