<<

Simple Hypothesis Testing Student t-test and Analysis of Chi-Squared Test Contingency Tables

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 5: Hypothesis Testing

M. Vidyasagar

Cecil & Ida Green Chair The University of Texas at Dallas Email: [email protected]

October 11, 2014

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Chi-Squared Test Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Chi-Squared Test Contingency Tables Overview

One of the major applications of statistical methods is to test various hypotheses, and choose the most likely one. This lecture is devoted to several such applications. Hypotheses can involve either categoricl variables, or numerical variables. Example of categorical variables: The percentage of left-handed men is roughly the same as that of left-handed women. Here the sex and handedness are categorical variables. Example of numerical variable: On average men are taller than women. Here the sex is categorical, but height (which is the quantity being compared) is numerical.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Accepting or Rejecting a Null Hypothesis Chi-Squared Test Choosing Among Multiple Hypotheses Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Accepting or Rejecting a Null Hypothesis Chi-Squared Test Choosing Among Multiple Hypotheses Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Accepting or Rejecting a Null Hypothesis Chi-Squared Test Choosing Among Multiple Hypotheses Contingency Tables The Notion of a Null Hypothesis

In hypothesis testing, one begins with a “null” hypothesis, which is defined as what one believes in the absence of any information. For instance, given a coin with two faces, we believe the coin is fair. Given a six-sided die, we believe that all six outcomes are equally likely. And so on. As evidence accumulates, we may be able to reject the null hypothesis. Or else we may be unable to reject it, and thus are forced to accept it.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Accepting or Rejecting a Null Hypothesis Chi-Squared Test Choosing Among Multiple Hypotheses Contingency Tables Coin-Tossing Example

We are given a two-faced coin, and the null hypothesis is that it is fair. 100 tosses of the coin produce 60 heads. Question: Do we accept or reject the null hypothesis? Answer: We compute the likelihood of the observation or worse under the null hypothesis. The matlab command T = binopdf(k,n,p) with n = 100, k = 60, p = 0.5 is the likelihood of getting exactly 60 heads in 100 fair coin tosses. But it is more reasonable to compute T = 1 - binocdf(k-1,n,p), which is the likelihood of getting at least 60 heads in 100 fair coin tosses. This returns the answer T = 0.0284. So we are 97.16% sure that the null hypothesis is false.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Accepting or Rejecting a Null Hypothesis Chi-Squared Test Choosing Among Multiple Hypotheses Contingency Tables Coin-Tossing Example (Cont’d)

Usually we set a threshold below which we will reject the null hypothesis. The value of 0.05 is widely used. So in this case we would say that “the null hypothesis is rejected at a 95% level,” or in plain English that “we are 95% sure that the null hypothesis is false.” This statement only that the likelihood of the actual outcome is less than 0.95 – it doesn’t say what the likelihood is. To eliminate the ambiguity, the phrase “P -value” (or p-value) is used. For this outcome the P -value is 0.0284, which is lower than the threshold we set of 0.05, so the null hypothesis is rejected at a 95% confidence level.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Accepting or Rejecting a Null Hypothesis Chi-Squared Test Choosing Among Multiple Hypotheses Contingency Tables Coin-Tossing Example (Cont’d)

Low P -values lead to the null hypothesis being rejected; but what about high P -values? Suppose the null hypothesis is that the coin is fair, and that 100 tosses result in 47 heads. This time we compute the likelihood of getting 47 or fewer heads. The commant binocdf(k,n,p) with n = 100, k = 47, p = 0.5 returns the value 0.3086. So we are about 30% sure that the null hypothesis is not consistent with the data. But we cannot reject the null hypothesis because this value is above the usual threshold of 0.05.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Accepting or Rejecting a Null Hypothesis Chi-Squared Test Choosing Among Multiple Hypotheses Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Accepting or Rejecting a Null Hypothesis Chi-Squared Test Choosing Among Multiple Hypotheses Contingency Tables Null vs. Alternate Hypotheses

A major change in hypothesis testing took place with the introduction of “alternate hypotheses.” Instead of having just one hypothesis, namely the null, we have a variety of hypotheses from which to choose. Approach: Compute the likelihood of the data under each of the competing hypotheses, and choose the one with the highest (log) likelihood.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Accepting or Rejecting a Null Hypothesis Chi-Squared Test Choosing Among Multiple Hypotheses Contingency Tables Coin-Tossing Example (Cont’d)

Suppose we have two competing hypotheses:

H0: The coin is fair (the null hypothesis).

H1: The probability of heads is 0.7. An of 100 tosses produces 60 heads. Which hypothesis do we choose?

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Accepting or Rejecting a Null Hypothesis Chi-Squared Test Choosing Among Multiple Hypotheses Contingency Tables Coin-Tossing Example (Cont’d)

The likelihood of the data under H0 is computed as 1 - binocdf(k-1,n,p) with n = 100, k = 60, p = 0.5. This gives L0 = 0.0284.

The likelihood of the data under H1 is computed as binocdf(k,n,p) with n = 100, k = 60, p = 0.7. This gives L1 = 0.0210.

So we reject H1 and accept H0. Note however that each hypothesis would have been rejected, had it been the only (null) hypothesis!

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Accepting or Rejecting a Null Hypothesis Chi-Squared Test Choosing Among Multiple Hypotheses Contingency Tables Maximum Likelihood Estimation Revisited

The previous example shows the potential pitfalls of being “forced” to choose the “least unlikely” hypothesis from a set of hypotheses, all of them unlikely by themselves. Maximum likelihood estimation gets around this difficulty by finding the best possible fit to the data from the given family of models. If n = 100 coin tosses result in k = 60 heads, then the maximum likelihood estimate of the probability of heads is pˆ = 60/100 = 0.6. To put this in perspective, this choice maximizes the likelihood binopdf(k,n,p) (not binocdf(k,n,p)) with respect to p, given that n = 100, k = 60.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Student t-test

The “student” t-test refers to a hypothesis-testing procedure first proposed by William Seely Gossett under the pseudonym “Student.” It tests whether or not there is a statistically significant difference between the means of two samples.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Motivating Question

Suppose that we order parts from two suppliers, call them A and B. The average time taken by Supplier A is 14 weeks, while the average time taken by Supplier B is 13 weeks. How confidently can we assert that Supplier B takes less time than Supplier A?

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Motivating Question (Cont’d)

Obviously, the confidence depends on three quantities:

The difference between the two means: The larger it is, the more sure we are that the difference is significant. The size of the two samples. If the two averages are computed over 10 orders each, we would be less sure than if the average for A was computed on 50 orders, and that for B is computed on 40 orders. The “within sample” variations of the supply times for A and B. If the “within sample” variations of the supply times for A and B are themselves large, then the difference between means is not so significant. The next slide illustrates this.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Influence of In-Sample Variance on Significance

Significance of Difference in Means with Large In−Sample Significance of Difference in Means with Small In−Sample Variances 0.2 0.8 Supp. B Supp. A 0.7 (x) 0.18 (x) B B

0.6 0.16 (x) and phi (x) and phi A A 0.5 0.14 0.4 0.12 0.3

0.1 0.2

0.08

Probability Density Functions phi Probability Density Functions phi 0.1

0.06 0 10 11 12 13 14 15 16 17 10 11 12 13 14 15 16 17 Time to Delivery in Weeks for the Two Suppliers Time to Delivery in Weeks for the Two Suppliers

If variances are large, then the two distributions overlap a lot, and the difference is not so significant, as when the variances are small.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Theory of the t-Test

Suppose we have two sets of samples x1, . . . , xn of a random variable X, and y1, . . . , ym of another random variable Y . We wish to know whether there is a statistically significant difference between the values of the two sample sets. Define the two sample means:

n m 1 X 1 X x¯ = x , y¯ = y . n i m i i=1 i=1 As we know, these are unbiased estimates of the true but unknown mean values of X and of Y respectively.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Theory of the t-Test (Cont’d)

Next, define the sample variance estimates

n m 1 X 1 X S2 = (x − x¯)2,S2 = (y − y¯)2. X n − 1 i Y m − 1 i i=1 i=1 Again, these are unbiased estimates of the true but unknown variances of X and Y respectively. The next step is to “pool” the two variance estimates to arrive at an estimate for the overall set of samples. n − 1 m − 1 S2 = S2 + S2 . P n + m − 2 X n + m − 2 Y

2 So SP is just a weighted average of the two individual unbiased estimates.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Theory of the t-Test (Cont’d)

The so-called t-test is defined by x¯ − y¯ dt = p . SP (1/n) + (1/m) This quantity satisfies the t-distribution with dof = n + m − 2 degrees of freedom. There are complicated but closed-form formulas for the t-distribution. But the matlab command tcdf(d t,dof) allows us to compute the P -value Pr{M ≤ dt} if the t-statistic dt is negative, while 1 - tcdf(d t,dof) is used if dt is positive.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Theory of the t-Test (Cont’d)

The t-test is used when the number of samples is very small. If the number of samples is large, then the t-distribution looks like a normal distribution. This is shown in the next slides. In practice, once the number of samples exceeds 50 or so, the normal distribution can be used.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Probability Density of the t-Statistic

Probability Density Function of the t−Statistic for Various Degrees of Freedom 0.4 dof = 5 dof = 10 0.35 dof = 20 dof = 50 Normal 0.3

0.25

0.2

0.15 Probability Density Function 0.1

0.05

0 −3 −2 −1 0 1 2 3 Values of t−Statistic

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Example of Applying the t-Test

U.S. government data shows that the average income of Hispanics between 35 and 39 years of age is $54,651 with a of $1,714, based on 1,842 samples. The corresponding figures for black persons are an average of $55,900 and a standard deviation of $5,888, based on 1,104 samples. The objective is to determine whether this difference is statistically significant, using the t-test. Source: http://www.census.gov/hhes/www/cpstables/032012/ hhinc/hinc02 000.htm

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Example of Applying the t-Test (Cont’d)

With the information provided, the t-statistic turns out to be dt = 8.5224. Because the number of samples is so large, the t-distribution can be taken to be the normal distribution. So the P -value is the probability of a normally distributed random variable exceeding 8.5 times its standard deviation, which is essentially zero. Therefore, though the difference in means is very small (just $1,249), it is clinchingly significant because of the large number of samples.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Example of Applying the t-Test (Cont’d)

For illustrative purposes only, suppose that the number of samples was fifty times smaller. To be specific, suppose that the averages and standard deviations are the same, but based on 37 Hispanic samples and 22 black samples.

The resulting t-statistic is dt = 1.2130. Because the number of degrees of freedom is larger than 50, we can again use the normal approximation. If X is a normally distributed random variable, then

Pr{X > 1.2130} ≈ 0.1126.

So the P -value is about 11%, which is higher than the threshold. So we cannot accept that the difference in average incomes is statistically significant.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Example of Applying the t-Test (Cont’d)

U.S. government data shows that the average income for Hispanic persons between the ages of 25 to 29 years is $47,274, with a sample standard deviation of $1,509. These figures are determined using 1,621 samples. The corresponding figures for black persons are an average of $40,791 with a standard deviation of $2,101, based on 866 samples. The objective is to determine whether this difference is statistically significant, using the t-test. Given the difference between means is larger, while the number of samples and standard deviations are comparable, the t- will be much smaller, meaning that the difference in means is even more statistically significant.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Welch’s t-Test

The “Student” t-testis based on the assumption that the two sets of samples have the same variance. In the two examples this assumption is not satisfied. For the general case where the two sets of samples have unequal variances, it is better to use the Welch t-test. It is not covered here.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables ANOVA: Motivation

The t-test is used to test the null hypothesis that the means of two sets of samples are the same. Analysis of Variance (ANOVA) is used if there are k ≥ 3 sets of samples, and the null hypothesis is that all k means are equal to each other. This is preferrable to testing all k(k − 1)/2 pairs of means to see if they are equal to each other.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables ANOVA: Theory

Given k sets of samples (not necessarily the same number of samples within each set), say

xij, j = 1, . . . , ni, i = 1, . . . , k. We can compute the k different sample means n 1 Xi x¯ = x , i = 1, . . . , k. i n ij i j=1 Define the overall sample mean k k n 1 X 1 X Xi y¯ = n x¯ = x . n i i n ij i=1 i=1 j=1 The null hypothesis is that

x¯1 =x ¯2 = ··· =x ¯k.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables ANOVA: Theory (Cont’d)

To test this hypothesis, two sets of variances are computed. The first is called the between-groups variance, denoted by FB, and defined as k 1 X F = (¯x − y¯)2, B k − 1 i i=1 which corresponds to treating the k means as k “samples.” The second is called the within-groups variance, denote by FW , and defined as k n 1 X Xi F = (x − x¯ )2. W n − k ij i i=1 j=1

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables ANOVA: Application

The ANOVA test statistic is denoted by dF and is defined by

FB dF = . FW This test statistic satisfies the F -distribution with degrees of freedom k − 1 and n − k.

Once the ANOVA test statistic dF is computed, the significance is determined via the matlab command t F = 1 - fcdf( d F, k-1,n-k ). If this value is sufficiently small, then the null hypothesis can be rejected.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables ANOVA: Example

Suppose four models of cars are test-driven by five drivers each. We would like to test the hypothesis that, even though different drivers might get different mileage on each car, the average mileage of each car is the same. Suppose the data is as given next, with the rows denoting the drivers, and the columns the cars. So k = 4 (four sets of samples, one for each model of car), and ni = 5 for all i (five samples for each model of car).

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables ANOVA: Example (Cont’d)

Suppose the matrix of sample mileages is

 27.4000 24.8000 26.6000 30.2000   22.5000 25.6000 26.1000 27.2000    A =  25.2000 24.8000 26.8000 27.2000  .    28.1000 26.5000 27.3000 26.7000  24.1000 24.7000 26.1000 25.9000

The average mileage of each car, across the five drivers, is

x¯ = [ 25.4600 25.2800 26.5800 27.4400 ].

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Student t-Test Chi-Squared Test Analysis of Variance (ANOVA) Contingency Tables ANOVA: Example (Cont’d)

The between-groups variance is just the variance of the vector x¯, while the within-groups variance requires greater computation.

FB FB = 1.0252,FW = 0.1482, dF = = 6.9165. FW Using the command F test = 1 - fcdf(F stat,3,16) gives a P -value of 0.0034. So we can reject the null hypothesis (namey that all cars have the same average mileage) with greater than 99.5% confidence.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Chi-Squared Test Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Chi-Squared Test Contingency Tables Degrees of Freedom

One can think of the chi-squared test (also written as χ2-test) to be a likelihood test when the number of outcomes is more than two. Recall: When there are only two outcomes, say heads and tails, if p is the probability of heads, then 1 − p is the probability of tails. So there is only one degree of freedom (one number that can be adjusted).

Suppose have a six-sided die, with probabilities p1 through p6. Then in reality there are only five degrees of freedom, because specifying any five out of the six probabilities specifies the sixth.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Chi-Squared Test Contingency Tables Motivating Example

Suppose we are given a six-sided die. The null hypothesis is that the die if pair, i.e., pi = 1/6 for all i. Suppose roll the die 150 times. We would “expect” to see each outcome 25 times. But we would tolerate some deviation from these numbers due to the random nature of the experiment. Question: When can we say that the deviation from the expected numbers cannot be attributed to chance, and therefore the null hypothesis is false? In particular, suppose we get the following outcomes

 20 29 32 22 21 26  M = . 25 25 25 25 25 25

How sure are we that the die is fair (or not)?

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Chi-Squared Test Contingency Tables Theory of the Chi-Squared Test

Suppose there are k possible outcomes. We expect Ei instances of outcome i, and actually we get Ai instances. We define the 2 χ -test statistic as follows: Let Ei denote the expected outcome, and Ai the actual outcome. Define

k 2 X (Ai − Ei) d 2 = . χ E i=1 i

This test statistic satisfies the χ2-distribution for k − 1 degrees of freedom. The complementary cdf (or P -value) of the χ2-distribution for various degrees of freedome is shown in the next slide.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Chi-Squared Test Contingency Tables Complementary CDF of the χ2-Distribution

Complementary CDF (P−Value) of Chi−Squared Distribution for Various DOF 1 dof = 1 0.9 dof = 2 dof = 3 dof = 4 0.8 dof = 5 dof = 10 0.7 alue) V 0.6

0.5

0.4

0.3 Complementary CDF (P

0.2

0.1

0 0 1 2 3 4 5 6 7 8 Value of x

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Chi-Squared Test Contingency Tables Motivating Example (Cont’d)

Recall that  20 29 32 22 21 26  M = . 25 25 25 25 25 25

The first row consists of the actual outcomes Ai, while the second row consists of expected outcomes Ei. The utility chi cont nby2.m can be used to these computations automatically. Moreover, the expected numbers need not all be the same.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Chi-Squared Test Contingency Tables Motivating Example (Cont’d)

The test-statistic is computed as

6 2 X (Ai − Ei) 116 d 2 = = = 4.64. χ E 25 i=1 i The matlab command P chi sq = 1 - chi2cdf(d chi sq,5) results in a P -value of 0.4614. This means that there is a 46% chance that these outcomes were produced by a fair die, and 54% that they were not.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Chi-Squared Test Contingency Tables Motivating Example (Cont’d)

Now suppose the actual outcomes are as shown below:

 11 34 39 13 17 36  M = . 25 25 25 25 25 25

The test-statistic is computed as

6 2 X (Ai − Ei) 802 d 2 = = = 32.08. χ E 25 i=1 i The matlab command P chi sq = 1 - chi2cdf(d chi sq,5) returns a P -value of 5.7284 × 10−6. This means that the probability that the outcomes were generated by a fair die is less than 0.00001. So we can reject the null hypothesis (that the die is fair) at a 99.999% confidence level.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Student t-test and Analysis of Variance Chi-Squared Test Contingency Tables Another Example

Suppose that an experiment has eight possible outcomes, and that the actual and expected outcomes are as shown below.  14 11 18 12 16 10 22 21  M = . 20 17 15 17 10 14 18 13 The command [t , dof , stat ] = chi cont nby2(M) returns the values t = 0.0206, dof = 7, stat = 16.5431. Therefore the χ2-statistic is 16.5431, and the P -value is 0.0206. Therefore we are roughly 98% sure that the actual outcomes were not produced by a random variable having the 1 φ = [ 20 17 15 17 10 14 18 13 ]. 124

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Contingency Tables: Background

The book The Lady Tasting Tea by David Salzberg describes an interesting experiment whereby a lady at a party claimed that she could predict whether tea was added to milk, or milk was added to tea. Prof. Roland Fisher, one of the pioneers of modern , performed an experiment whereby he created four cups with the milk first and four cups with tea first. The lady in question correctly classified all eight cups correctly. The question is: What is the likelihood of getting this outcome purely by chance?

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Contingency Tables: Background (Cont’d)

This is an example of a two-class classification problem. We have also seen these problems in connection with Bayes’ rule. Other examples are security screening an airport, detecting faulty parts, etc. In each of these cases, there are two actual classes (milk added to tea or not, has HIV or not, carrying a weapon or not), and two sets of labels.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Contingency Tables: Background (Cont’d)

But we can also think of problems where there are three or more possible classes, and/or three or more labels. Note also that the number of labels need not be the same as the number of classes. For instance, patients in a can be given one of four drugs plus a placebo, and their response can be quantized into three classes. The objective is to compute the likelihood that the classification was arrived at purely chance. For two-class problems, more elaborate analysis is possible.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Contingency Tables vs. Hypothesis Testing

It is important to understand the difference between hypothesis testing and contingency tables. In the tea-tasting experiment, Prof. Fisher generated four samples each of tea into milk, and milk into tea. The lady in question predicted in the same ratio. Hypothesis testing addresses the question: Did Prof. Fisher and the lady use the same probability distribution? Contingency tables address the question: How many of the lady’s labels match those of Prof. Fisher, and what is the likelihood that the matches happened purely by chance?

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables 2 × 2 Classification Problems

In 2 × 2 classification problems, both the classes and the labels are usually called “Positive (P)” and “Negative (N)”. The table below shows the usual convention for the classification performance.

Class/Label Positive Negative Positive TP (True Positive) FN (False Negative) Negative FP (False Positive) TN (True Negative)

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Sensitivity, Specificity, Accuracy

The sensitivity of the classifier is defined as TP Se = . TP + FN The specificity of the classifier is defined as TN Sp = . FP + TN The accuracy of the classifier is defined as TP + TN Ac = . TP + TN + FP + FN

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Sensitivity, Specificity, Accuracy (Cont’d)

All three quantities lie in the interval [0, 1]. Moreover, accuracy is a convex combination of sensitivity and specificity. In particular,

|C | |C | Ac = Se · P + Sp · N . |CP | + |CN | |CP | + |CN |

where |CP |, |CN | denote the number of elements in the positive class and in the negative class respectively. Therefore

min{Se, Sp} ≤ Ac ≤ max{Se, Sp}.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables False Discovery Rate

Suppose we are given the table of predictions

Class/Label Positive Negative Positive TP (True Positive) FN (False Negative) Negative FP (False Positive) TN (True Negative)

The false discovery rate (FDR) is defined as

FP FDR = . TP + FP It is possible for a classifier to have very high accuracy and very low FDR.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables HIV Example Revisited

Earlier we had the following table for an HIV diagnostic test (with all entries multiplied by 10000 to get integers)

Class/Label P N P 98 2 N 99 9801

Then

Se = 0.98, Sp = 0.99, Ac = 0.9899,FDR = 0.5025.

Classifier has fantastic sensitivity, specificity, and accuracy, but high FDR – greater than 50%!

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Contingency Tables

Suppose we are given a total of N samples belonging to n different classes. We are asked to assign one of m labels. The result is called an n × m , where the rows denote classes and the columns denote labels (or bins to which the samples are assigned).

Class/Label Label 1 Label 2 ... Label m Class Total Class 1 a11 a12 . . . a1n C1 . . . . . A = ...... Class n an1 an2 . . . anm Cn Label Total L1 L2 ...Lm N

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Contingency Tables (Cont’d)

Note: The number of classes need not equal the number of labels. However, since each sample has both a class as well as a label, we must have n m X X Ci = Lj = N, i=1 j=1 where N is the total number of samples. Question: If we were to take the N samples and assign them labels at random, how likely are we to get this result?

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables “The Lady Tasting Tea”

Let us use M (milk first) and T (tea first) to denote both the class (what Prof. Fisher actually did) and the label (what the lady predicted). Then the resulting 2 × 2 contingency table is

Class/Label M T Total M 4 0 4 T 0 4 4 Total 4 4 8

Question: What is the likelihood of the lady getting these results purely by guessing?

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Chi-Squared Approximation: Theory

The null hypothesis for this situation is that samples are assigned at random to the cells. So if we define the marginal probabilities

m n 1 X 1 X φ = a , i = 1, . . . , n, ψ = a , j = 1, . . . , m, i N ij j N ij j=1 i=1

Then the joint probability distribution would be just the product given by θij = φiψj for all i, j. So, under the null hypothesis, the (i, j)-th cell would have Nφiψj elements, i.e., all columns (or rows) would be proportional to each other. The χ2-approximation computes how close the actual values of the cells, given by the matrix A, correspond to the expected numbers. Note that the expected numbers might not be integers.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Chi-Squared Approximation: Theory (Cont’d)

The test statistic in this case is determined as follows:

n m 2 X X (aij − Nφiψj) d = . C Nφ ψ i=1 j=1 i j

This test statistic satisfies the χ2-distribution with (n − 1)(m − 1) degrees of freedom. The matlab command chi cont (not to be confused with chi cont nby2) returns the significance value of how far the matrix A is from all of its columns being proportional.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Chi-Squared Approximation: Theory (Cont’d)

Specifically the command

[ t , dof , d C ] = chi cont ( A )

returns the significance value t, the number of degrees of freedom (n − 1)(mm − 1), and the test statistic dC . If t is smaller than some desired threshold such as 0.05, the null hypothesis can be rejected.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Chi-Squared Approximation: Example

Suppose our data has four classes and three labels, as follows:

Class/Label Label 1 Label 2 Label 3 Total Class 1 17 7 6 30 Class 2 7 9 4 20 Class 3 5 12 3 20 Class 4 7 8 15 30 Total 36 36 28 100

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Chi-Squared Approximation: Example (Cont’d)

Call this 4 × 3 matrix A. The command t = chi cont(A) returns the value t = 0.0045. So we can be 99.5% certain that the cells were not filled up at random.

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Outline 1 Simple Hypothesis Testing Accepting or Rejecting a Null Hypothesis Choosing Among Multiple Hypotheses

2 Student t-test and Analysis of Variance Student t-Test Analysis of Variance (ANOVA)

3 Chi-Squared Test

4 Contingency Tables Sensitivity, Specificity, Accuracy Chi-Squared Approximations Fisher’s Exact Test

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Fisher’s Exact Test

When the number of samples is rather small (as with the “lady tasting tea”), the χ2-approximation gives overly flattering results; that is, the significance level t is smaller than it ought to be. The Fisher exact test, invoked by the command t = Fisher 2by2 (A), returns a more realistic signifiance value. For the “lady tasting tea,” the significance level is 2−8, or 1/256, or about 0.004. So we can be 99.6% sure that she wasn’t guessing!

M. Vidyasagar Modeling Dependencies Simple Hypothesis Testing Sensitivity, Specificity, Accuracy Student t-test and Analysis of Variance Chi-Squared Approximations Chi-Squared Test Fisher’s Exact Test Contingency Tables Fisher’s Exact Test: Example

Suppose the outcome of the experiment had been the following:

Class/Label M T Total M 6 2 8 T 1 5 6 Total 7 7 14

Then the significance of this contingency table is obtained by the command t = Fisher 2by2(A) where A is the above matrix. The returned value is 0.1026. So we can be roughly 90% sure that the above results were not obtained by chance. In contrast, the command tc = chi cont(A) returns the value 0.0308, which is too low.

M. Vidyasagar Modeling Dependencies