GOODNESS OF FIT

INTRODUCTION

Goodness of fit tests are used to determine how well the shape of a sample of obtained from an matches a conjectured or hypothesized distribution shape for the population from which the data was collected. The idea behind a goodness-of-fit test is to see if the sample comes from a population with the claimed distribution. Another way of looking at that is to ask if the distribution fits a specific pattern, or even more to the point, how do the actual observed frequencies in each class interval of a compare to the frequencies that theoretically would be expected to occur if the data exactly followed the hypothesized . This is relevant to cost risk analysis because we often want to apply a distribution to an element of cost based on observed sample data. A goodness of fit test is a statistical hypothesis test: Set up the null and ; determine alpha; calculate a test ; look-up a critical value statistic; draw a conclusion.

In this course, we will discuss three different methods or tests that are commonly used to perform Goodness-of-Fit analyses: the Chi-Square (χ2) test, the Kolmogorov-Smirnov One Sample Test, and the Anderson-Darling test. The Kolmogorov-Smirnov and Anderson-Darling tests are restricted to continuous distributions while the χ2 test can be applied to both discrete and continuous distributions.

GOODNESS OF FIT TESTS

CHI SQUARE TEST

The Chi-Square test is used to test if a sample of data came from a population with a specific distribution. An attractive feature of the chi-square goodness-of-fit test is that it can be applied to any univariate distribution for which you can calculate the cumulative distribution function. The chi-square goodness-of-fit test is applied to binned data (i.e., data put into classes). This is actually not a restriction, since for non-binned data you can simply calculate a histogram or frequency table 1 Oct 2016

before generating the chi-square test. However, the value of the chi-square test statistic is dependent on how the data is binned. Another characteristic of the chi-square test is that it requires a sufficient sample size in order for the chi-square test statistic to be valid. The chi-square statistic measures how well the expected frequency of the fitted distribution compares with the frequency of a histogram of the observed data. It compares the histogram of the data to the shape of the candidate density (continuous data) or mass (discrete data) function. Definition The chi-square test is defined for the hypothesis:

H0: The data follow a specified distribution.

H1: The data do not follow the specified distribution. Test Statistic For the chi-square goodness-of-fit computation, the data are divided into k bins and the test statistic

k 2 2 is defined as: χ  ((Oi  Ei ) / Ei ) i1

Where Oi is the observed frequency for bin i and Ei is the expected frequency for bin i.

Computation of the expected frequency (Ei) will be shown by example. For the chi-square approximation to be valid, the expected frequency in each bin should be at least 5. This test is less sensitive when the sample size is small, and if some of the theoretical bin counts are less than five, you may need to combine some bins to ensure that there are at least 5 theoretical observations in each bin. Significance Level Critical Region: The test statistic follows, approximately, a Chi-Square distribution with (k – 1- number of population parameters estimated) degrees of freedom where k is the number of non- empty bins. If specific sample need to be computed in order to develop the binning, then the degrees of freedom are reduced by the number of statistics that were computed. Therefore, the hypothesis that the data are from a population with the specified distribution is rejected if the computed χ2 is greater than the critical value. Note that the information needed to determine critical values from the χ2 distribution is the level of significance (α) and the Degrees of Freedom (df). If

k 2 2 the sum of the squared deviations from χ  ((Oi  Ei ) / Ei ) is small, the observed frequencies i1 are close to the expected frequencies and there would be no reason to reject the claim that it came 2

from that distribution. Only when the sum is large is there a reason to question the distribution. Therefore, the chi-square goodness-of-fit test is always a right tail test.

KOLMOGOROV-SMIRNOV TEST

The Kolmogorov-Smirnov One Sample Test, also referred to as the KS test, is an alternative to the χ2 test and is called a distribution-free test because it does not require that any assumptions about the underlying distribution of the Test Statistic be made. The KS test compares the cumulative relative derived from sample data with the theoretical cumulative relative frequency distribution that is described by the Null Hypothesis. In essence, the KS test is based on the maximum distance between these two cumulative relative frequency curves. The Tests Statistic, D, is the absolute value of the maximum deviation between the observed cumulative relative frequencies and the expected (theoretical) relative cumulative frequencies. Depending on the probability that such a deviation would occur if the sample data really came from the distribution specified in the Null Hypothesis, the Null Hypothesis should be rejected or not rejected. Note that in the KS test we are talking about relative frequencies, which are percentages rather than actual frequencies. The KS test is restricted to continuous distributions only. Definition The Kolmogorov-Smirnov test is defined as:

H0: The data follow a specified distribution

H1: The data do not follow the specified distribution Test Statistic: The Kolmogorov-Smirnov test statistic is defined as:

D = Maximum|Fo – Fe|

where: Fo = observed relative frequency

Fe = theoretical relative frequency

Significance Level

3

Critical Values: The hypothesis regarding the distributional form is rejected if the test statistic, D, is greater than the critical value obtained from a table. There are several variations of these tables in the literature that use somewhat different scalings for the KS test statistic and critical regions. These alternative formulations should be equivalent, but it is necessary to ensure that the test statistic is calculated in a way that is consistent with how the critical values were tabulated.

ANDERSON-DARLING TEST

The Anderson-Darling test is used to test if a sample of data came from a population with a specific distribution. It is a modification of the Kolmogorov-Smirnov (KS) test and gives more weight to the tails than does the KS test. The KS test is distribution free in the sense that the critical values do not depend on the specific distribution being tested. The Anderson-Darling test makes use of the specific distribution in calculating critical values. This has the advantage of allowing a more sensitive test and the disadvantage that critical values must be calculated for each distribution. Definition The Anderson-Darling test is defined as:

H0: The data follow a specified distribution.

H1: The data do not follow the specified distribution Test Statistic: The Anderson-Darling test statistic is defined as: A2 = (-Sum/n)-n

Where Sum is the sum of the (2i-1)*{(ln(Pi)+ln(1-Pn+1-i)} column and n is the sample size. The estimated (computed) Critical Value, designated as A* is computed as follows: A* = A2 (1.0 + 0.75/n + 2.25/n2) This is the value that is compared against the Critical Region value. Significance Level: α Critical Region: The critical values for the Anderson-Darling test are dependent on the specific distribution that is being tested. Tabulated values and formulas have been published (Stephens, 1974, 1976, 1977, 1979) for a few specific distributions (normal, lognormal, exponential, Weibull, logistic, extreme value type 1). The test is a one-sided test and the hypothesis that the distribution is of a specific form is rejected if the test statistic, A, is greater than the critical value.

4

Note that for a given distribution, the Anderson-Darling statistic may be multiplied by a constant (which usually depends on the sample size, n). These constants are given in the various papers by Stephens. In the sample output below, this is the “adjusted Anderson-Darling” statistic. This is what should be compared against the critical values. Also, be aware that different constants (and therefore critical values) have been published. You just need to be aware of what constant was used for a given set of critical values (the needed constant is typically given with the critical values).

EXAMPLES

CHI-SQUARE TEST EXAMPLE

You have been presented with a set of 25 data points that represent the weights in pounds of missile warheads that have been installed on a number of different kinds of aircraft. The government is interested in determining if the distribution of these weights can be considered to be normally distributed with a of 100 lbs. and a of 5 pounds. Table 1 provides the raw data with the values ranked from low to high. Table 1: Sample Data WEIGHTS (lbs.) 79.5 93.6 98.7 102.6 107.3 85.1 94.8 99.4 103.4 108.2 88.4 95.8 100.0 104.2 108.4 89.8 96.4 100.6 104.2 111.8 93.2 98.7 101.9 105.6 113.9 In order to perform the Chi-Square test, the data must be tabulated into “bins” to form the histogram. The question is: how many bins should I use? There is no optimal choice for the bin width (since the optimal bin width depends on the distribution). Most reasonable choices should produce similar, but not identical, results. A commonly used algorithm called Sturges’ Rule, is sometimes used to determine a reasonable number of bins for a given sample size. The formula for Sturges is given as follows: k = 1 + 3.3*log (n) where k is the number of bins and n is the sample size. Once k is determined, the (discussed earlier) of the data can be divided by k to get an approximate bin width. For this problem, Sturges Rule yields the following: k = 1 + 3.3*log (25) = 5.61 = 6 bins or cells 5

The range for this data set is computed to be: R = Max value – Min value = (113.9 -79.5) = 34.4 Dividing R by 6 yields a cell width of approximately 6 lbs. Table 2 shows the data in tabular form. Figure 1 provides the histogram or . Table 2: Tabular or “Binned” Data

LOWER UPPER FREQ BOUND BOUND (f) 77.0 82.9 1 83 88.9 2 89 94.9 4 95 100.9 7 101 106.9 6 107 112.9 4 113 118.9 1 TOTAL 25

8

7

6

5

4

3

2

1

0 1234567

Figure 1: Data Histogram

Your job is to perform a statistical hypothesis test on this data to determine if it fits the stipulated distribution. You are directed to use the Chi-Square Goodness of Fit test. 1. Establish the Null Hypothesis and Alternative Hypothesis (What you are trying to prove or disprove). 6

Ho = N(100, 5) This is a Normal distribution with mean of 100 lbs and standard deviation of 5 lbs.

The Alternative Hypothesis (Your fallback position in the case you cannot disprove Ho).

The Alternative Hypothesis is designated as H1.

H1 = Not N(100, 5) 2. Set the level of significance. For this test we will set  = 0.05. 3. Perform the calculations. For this test we will use the Chi-Square distribution. The test statistic is given by:

k 2 2 χ  ((Oi  Ei ) / Ei ) Where: i1

Oi = Observed frequency

Ei = Theoretical expected frequency

So, as you can see, it will be necessary to compute the Ei. Let’s use the spreadsheet (Table 3) below to walk through the steps. Table 3: Chi Square Example Calculation Table

LOWER UPPER FREQ CELL TO GET TO BOUND BOUND Oi LL UL Z AREA AREA E 5 IN CELL (O-E)^2/E 77 82.9 1 77 -4.600 0.5000 0.0003 0.008 83 88.9 2 83 -3.400 0.4997 0.0136 0.339 89 94.9 4 89 -2.200 0.4861 0.1448 3.619 95 100.9 7 95 -1.000 0.3413 0.4128 10.319 14.286 0.948385818 100.9 0.180 0.0714 101 106.9 6 106.9 1.380 0.4162 0.3448 8.620 10.712 0.528957378 107 112.9 4 112.9 2.580 0.4951 0.0789 1.971 113 118.9 1 118.9 3.780 0.4999 0.0049 0.122 25 1.477

The columns labeled LOWER BOUND and UPPER BOUND are taken straight from the “binned” data source presented to you. The column labeled FREQ contains the observed frequencies that were also given to you. These numbers represent the Oi. The columns labeled LL and UL contain the values in the LOWER BOUND and UPPER BOUND columns. Note that up until the “bin” or cell which contains the hypothesized mean (100) is reached, only LL values are entered into the column. Entries after the “mean” cell is reached are entered in the UL column. Z represents the Standard Normal deviate and is computed as follows: Z = (LL – 100)/5 for the rows that contain LL values and 7

Z = (UL – 100)/5 for the rows that contain UL values. For example, for the first LL value of 77 the computation is: Z = (77 – 100)/5 = -4.60 as shown. Other values are likewise computed. The column labeled AREA represents the area under the Standard Normal distribution curve between the center of the distribution and the point at which Z is plotted. The following diagram depicts the AREA for the Z value -2.20 (AREA = 0.4861).

0.4861

-4 -3 -2 -1 0 1 2 3 4

Figure 2: Normal Distribution Curve

The column labeled CELL AREA represents the area in each “cell” of the distribution for each Z value. For example, the CELL AREA 0.1448 associated with Z = -2.20 represents the difference in area from the center of the distribution to Z = -2.20 (0.4861) and from the center of the distribution to Z = -1.00 (0.3413). This is depicted in Figure 3 below.

8

0.1448 0.3413

-4 -3 -2 -1 0 1 2 3 4

Figure 3: Distribution of Areas

The other values are computed in the same manner.

The values in the column labeled EXPECTED FREQUENCY are derived by multiplying the CELL AREA values times the total sample size of 25 for each row. For example, the value of 0.008 results from the product of 0.0003 times 25. All the other values are computed in the

same manner. These are the Ei values.

2 The column labeled (Oi – Ei) /Ei contains values computed exactly as this formula states.

However, recall that each Ei must be at least 5 in each cell for the Chi-Square test. Since the first three cells contain numbers that are less than 5, and their sum is less than 5, the first four cells must be combined which results in the total of 14.286. Likewise, the last two cells in the

Table contain Eis which are less than 5, and their sum is less than 5, so the last three cells in the Table must also be combined resulting in the value of 10.712. Once the requirement of 5 2 in each cell is satisfied by combining adjacent cells, the (Oi – Ei) /Ei can be computed. The value of 0.9484 is computed as follows: (14 – 14.286)2/14.286 = 0.0.9484 9

And the value 0.5289 is computed as follows: (11 – 10.712)2/10.712 = 0.5289

Note that in both of these computations, the Oi corresponding to the combined Ei cells also needed to be combined. The estimated Chi-Square statistics is computed to be 2 2 χ = ∑((Oi – Ei) /Ei) = 1.477 4. Evaluate the results. Based on a comparison of the computed result with the Critical

Value, make a conclusion about the test. That is, either reject Ho and accept H1; or

Accept Ho. As previously mentioned, the Chi-Square distribution has associated with it a parameter called the “degrees of freedom” denoted as df. For this kind of Goodness of Fit test, the df are computed by counting up the number of cells in the original data set and subtracting one from that total and then subtracting the number of population parameters that needed to be estimated from the data.. For this problem, df = 1 because although the data was put into seven cells, when the cells were combined to satisfy the 5 in each cell rule, only two cells remained. There were no population parameters estimated. The df and  are the two values that you need to look up the critical value for a Chi-Square Goodness of Fit test. For this problem we chose  = 0.05 and we have 1df, so the critical value is 3.841, which we looked up in a table of critical values for the Chi-Square distribution. These tables are contained in most standard statistical textbooks. 5. Make a decision. Since our computed Chi-Square value of 1.477 is less than the critical value of 3.841, there is not enough evidence from the sample data to refute the assertion (hypothesis) that the data came from a N(100, 5). Therefore fail to reject the Ho. The diagram below depicts this result.

10

1.477

3.841

5% Area

Figure 4: Critical Area for Chi-Square Test

KOLMOGOROV-SMIRNOV TEST EXAMPLE

Table 4 below contains the same data that was used for the Chi-Square example, Note that this table does not include any Chi-Square calculations, but does include the computation of relative frequencies and their differences.

Table 4: K-S Test Computational Spreadsheet RELATIVE EXPECTED RELATIVE LOWER UPPER FREQ FREQUENCY FREQUENCY FREQUENCY

BOUND BOUND (Oi) (Oi) (Ei) (Ei) FO FE |D| 77 82.9 1 .04 .008 .0003 .0400 .0003 .0397 83 88.9 2 .08 .339 .0136 .1200 .0139 .1061 89 94.9 4 .16 3.619 .1448 .2800 .1587 .1213 95 100.9 7 .28 10.319 .4128 .5600 .5715 .1115 101 106.9 6 .24 8.620 .3448 .8000 .9163 .1163 107 112.9 4 .16 1.971 .0789 .9600 .9951 .0351 113 118.9 1 .04 .122 .0049 1.0000 1.000 .0000 25 25 11

Note that the red bolded number in the |D| column is the maximum absolute difference between the observed cumulative frequencies and the expected cumulative frequencies.

Staying consistent with the Chi-Square example, we will use significance level 0.05 for this test also. Critical values for D are found in most statistics texts. For a sample size of 25, the critical value with significance level 0.05 is 0.23768.

The maximum |D| from the table above is 0.1213. Since 0.1213 is less than 0.23768, we fail to reject the hypothesis that this data came from a N(100, 5) distribution. This conclusion is consistent with the results under the Chi-Square test.

ANDERSON-DARLING TEST EXAMPLE

Table 5 below contains the same data that was used for the Chi-Square and the K-S examples. Unlike the Chi-Square test, the Anderson-Darling Test does not need the data to be binned, so the example which follows shows how to do the Anderson-Darling Test on raw data. The Table below summarizes the computational results. The calculations associated with each column are presented in subsequent paragraphs. Recall that the Null Hypothesis is N(100, 5).

12

Table 5: Anderson-Darling Test Computational Spreadsheet

i WEIGHTS (lbs.) NORMAL P 1-P LN P LN(1-P) (2i-1)*{(ln(Pi)+ln(1-Pn+1-i)} 1 79.5 -2.410 0.0080 0.9920 -4.8314 -0.0080 -8.0603 2 85.1 -1.734 0.0414 0.9586 -3.1839 -0.0423 -17.6734 3 88.4 -1.341 0.0900 0.9100 -2.4077 -0.0943 -21.9410 4 89.8 -1.170 0.1210 0.8790 -2.1118 -0.1290 -28.3430 5 93.2 -0.755 0.2252 0.7748 -1.4907 -0.2552 -29.3216 6 93.6 -0.707 0.2397 0.7603 -1.4282 -0.2741 -32.0734 7 94.8 -0.559 0.2882 0.7118 -1.2440 -0.3400 -32.7082 8 95.8 -0.440 0.3298 0.6702 -1.1092 -0.4002 -35.6838 9 96.4 -0.369 0.3559 0.6441 -1.0330 -0.4400 -37.1431 10 98.7 -0.084 0.4666 0.5334 -0.7622 -0.6285 -34.4787 11 98.7 -0.084 0.4666 0.5334 -0.7622 -0.6285 -36.2837 12 99.4 -0.006 0.4975 0.5025 -0.6982 -0.6881 -34.8382 13 100.0 0.070 0.5278 0.4722 -0.6390 -0.7504 -34.7350 14 100.6 0.146 0.5580 0.4420 -0.5834 -0.8165 -34.3296 15 101.9 0.303 0.6192 0.3808 -0.4793 -0.9656 -32.1264 16 102.6 0.388 0.6509 0.3491 -0.4294 -1.0524 -32.7952 17 103.4 0.479 0.6840 0.3160 -0.3799 -1.1519 -27.0539 18 104.2 0.580 0.7191 0.2809 -0.3298 -1.2697 -25.5493 19 104.2 0.582 0.7197 0.2803 -0.3289 -1.2721 -24.7483 20 105.6 0.752 0.7741 0.2259 -0.2561 -1.4876 -20.6769 21 107.3 0.951 0.8292 0.1708 -0.1873 -1.7673 -18.1411 22 108.2 1.062 0.8559 0.1441 -0.1556 -1.9372 -12.2380 23 108.4 1.089 0.8620 0.1380 -0.1485 -1.9805 -10.9274 24 111.8 1.501 0.9333 0.0667 -0.0690 -2.7073 -5.2337 25 113.9 1.755 0.9604 0.0396 -0.0404 -3.2290 -2.3722 SUM -629.48

The column labeled i simply contains a count of each data point. The column labeled WEIGHTS (lbs.) contains the raw data. In order to perform the Anderson-Darling test, it is necessary to compute the average and standard deviation of the raw data. These are shown below.

Average = 99.424

Standard Deviation = 8.247

The column labeled NORMAL is the Standard Normal Deviate (Z) computed as follows:

Z = (Data Point – Average)/Standard Deviation

So the value of -2.410 in the first row results from:

Z = (79.5 – 99.424)/8.247 = -2.41

All the other NORMAL values are computed in the same manner.

The column labeled P is the cumulative area in the Standard Normal Distribution that is associated with the Z value. 13

The column labeled 1-P is self explanatory.

The columns labeled LN P and LN (1-P) represent the Natural Logarithms of the P and the 1-P values.

The column labeled (2i-1)*{(ln(Pi)+ln(1-Pn+1-i)} represents the computation as indicated. For example, the value of -8.0603 is derived as follows: (2(1)-1)(-4.8314 + (-3.2290)) = - 8.0603 Likewise, the value of -17.6734 is derived as follows: (2(2)-1)(-3.1839 + (-2.7073)) = - 17.6734 All subsequent values are derived in the same manner.

Once all of the (2i-1)*{(ln(Pi)+ln(1-Pn+1-i)} values are computed, that column is summed resulting in the -629.48 as shown in the Table.

The next step is to compute a value designated as A2 as follows: A2 = (-Sum/n)-n

Where Sum is the sum of the (2i-1)*{(ln(Pi)+ln(1-Pn+1-i)} column and n is the sample size.

For this example, A2 is computed as follows: A2 = (-(-629.48/25) – 25 = 0.1792 Finally, the estimated hypothetical Critical Value, designated as A* is computed as follows: A* = A2 (1.0 + 0.75/n + 2.25/n2) For this example: A* = 0.1792(1.0 + 0.75/25 = 2.25/625) = 0.185 The hypothetical critical value for the Anderson-Darling Test is dependent on the distribution being tested under the Null hypothesis. For testing the Normal Distribution, the Critical Values are as follows for the range of significance levels shown in Table 6.

14

Table 6: Critical Values for the A-D Test When Testing for a Normal Distribution Significance .005 0.1 0.025 0.05 0.10 Level

Critical Value 1.159 1.035 0.873 0.752 0.631

Since the computed critical value of 0.185 is less than the Tabled Critical Value of 0.752, there is not enough evidence to reject the Null Hypothesis of N(100, 5). This conclusion is consistent with the findings under the Ch-Square Test and the K-S Test. SUMMARY  Goodness of Fit (GOF) tests provide guidance for evaluating the suitability of a potential model.  There is no single correct distribution choice from GOF testing – don’t be locked into the numbers of the test results.  GOF tests do not provide a true probability measure for the data actually coming from the fitted distribution – they provide a probability that the random data from the fitted distribution would have produced a GOF statistic value as low as that calculated for the observed data.  The most intuitive measure is a visual comparison of the probability distributions.  Chi-Square: o Used for continuous and discrete data. o The test is sensitive to the choice of bins. o No optimal choice for bin width (distribution-dependent). o Not valid for small sample size (one rule of thumb states the N>50). o For bin counts <5 (expected frequency), may need to combine bins. o It is sensitive to large errors (it uses a sum of squared errors). o Most commonly used GOF test.  Kolmogorov-Smirnov (K-S): o Used with continuous distributions. o Tends to be more sensitive near the center of the distribution that at the tails. o Avoids problem of determining bins – some believe more useful than Chi-Square. 15

o Value determined by the largest distance between observed and fitted distribution, as it does not take into account lack of fit across rest of distribution.  Anderson-Darling (A-D): o A sophisticated version of the K-S test. o Used with continuous distributions. o Only available for a few specific distributions (critical values computed). o Gives more weight to the tails than the K-S. o Vertical distances are integrated over all values of X to make maximum use of the observed data. o Generally, more useful than K-S, especially when equal emphasis on body and tails is desired.

16