HST 190: Introduction to Biostatistics

Total Page:16

File Type:pdf, Size:1020Kb

HST 190: Introduction to Biostatistics HST 190: Introduction to Biostatistics Lecture 6: Methods for binary data 1 HST 190: Intro to Biostatistics Binary data • So far, we have focused on setting where outcome is continuous • Now, we consider the setting where our outcome of interest is binary, meaning it takes values 1 or 0. § In particular, we consider the 2x2 contingency table tabulating pairs of binary observations (�#, �#), … , (�(, �() 2 HST 190: Intro to Biostatistics • Consider two populations § IV drug users who report sharing needles § IV drug users who do not report sharing needles • Is the rate of positive tuberculin skin test equal in both populations? § To address this question, we sample 40 patients who report and 60 patients who do not to compare rates of positive tuberculin test § Data cross-classified according to these two binary variables 2x2 table Positive Negative Total Report sharing 12 28 40 Don’t report 11 49 60 sharing Total 23 77 100 3 HST 190: Intro to Biostatistics Chi-square test for contingency tables • The Chi-square test is a test of association between two categorical variables. • In general, its null and alternative hypotheses are § �*: the relative proportions of individuals in each category of variable #1 are the same across all categories of variable #2; that is, the variables are not associated (i.e., statistically independent). § �# : the variables are associated o Notice the alternative is always two-sided • In our example, this means § �*: reported needle sharing is not associated with PPD 4 HST 190: Intro to Biostatistics • The Chi-square test compares observed counts in the table to counts expected if no association (i.e., �*) § Expected counts are obtained using the marginal totals of the table. • Recall independence rule � � ∩ � = � � �(�), so from 100 people, assuming independence, we expect 40 23 � share ∩ positive = � share � positive = = 0.092 100 100 § Then, we’d expect 0.092 100 = 9.2 positive sharers, instead of 12 2x2 table Positive Negative Total Report 12 28 40 sharing Don’t report 11 49 60 sharing Total 23 77 100 5 HST 190: Intro to Biostatistics • Similarly, there will likely be some discrepancy between observed and expected counts for the other three cells in the table. § Chi-square test assesses: are these differences too large to be the result of sampling variability? • Steps of Chi-square test 1) Complete the observed-data table 2) Compute table of expected counts 3) Calculate the �A statistic 4) Get p-value from the chi-square table • This method is valid only if all expected counts ≥ 5 § test relies on approximation that does not hold in small samples 6 HST 190: Intro to Biostatistics 1) Complete observed data table 2) Complete table of expected counts �C⋅×�⋅D (�C# + �CA)(�#D + �AD) � = = CD � � O11 O12 O1. E11 E12 E1. O21 O22 O2. E21 E22 E2. O.1 O.2 n E.1 E.2 n 3) Calculate chi-square test statistic observed − expected A �A = ∑ expected � − � A � − � A � − � A � − � A = ## ## + #A #A + A# A# + AA AA �## �#A �A# �AA § swap �CD − �CD with �CD − �CD − 0.5 for Yates continuity correction 7 HST 190: Intro to Biostatistics 4) Get p-value from chi-square distribution § Under null hypothesis �*: no association between the two factors, the �A statistic follows a chi-square distribution with 1 degree of A A freedom. This is often written as � ~�# o continuous and positive-valued, defined by one parameter df § p-value comes from right tail, but is inherently ‘two-sided’ o matlab: 1-chi2cdf(x,1) A �#,*.ST = 3.84 Area = 0.05 8 HST 190: Intro to Biostatistics A A • Thus, at the � level, �* is rejected if � > �#,#YZ • Using 2x2 contingency table, an alternate formula for the _ ` ( [\Y]^ Y Yates corrected test statistic is �A = ` ([a])(^a\)([a^)(]a\) 100 12(49) − 28(11) − 50 A �A = = 1.24 < 3.84 = �A (40)(60)(23)(77) #,*.ST • ⇒ Fail to reject �* 2x2 table Negativ Positive e Total Report � + � � = 12 � = 28 sharing = 40 Don’t report � + � � = 11 � = 49 sharing = 60 Total � + � � + � � = 100 = 23 = 77 9 HST 190: Intro to Biostatistics Fisher’s exact test What happens if all expected counts < 5? Instead of chi-square test, use a Fisher’s exact test (see Rosner 10.3) • Like the chi-square test, Fisher's exact test examines the significance of the association (contingency) between the two kinds of classification – rows and columns. • Both row and column totals (a+c, b+d, a+b, c+d) are assumed to be fixed - not random. • We then consider all possible tables that could give the row and column totals observed and corresponding probability of each configuration (it helps to realize that the first count, a, has a hypergeometric distribution under the null) • Finally, the p-values are computed by adding up the probabilities of the tables as extreme or more extreme than the observed one. 10 HST 190: Intro to Biostatistics Chi-square test for contingency tables, RxC What if we are interested in a variable that has more than two categories? Example: Test for association between eye color and presence or absence of a mutant allele at some genetic locus. Eye color categories: blue, green, brown, hazel, gray Genetic categories: 0 copies mutant allele, ≥ 1 copy mutant allele 11 The chi-square test can be used for variables with more than two categories. Data presented in an RxC table, a generalization of the 2x2 table: blue green brown hazel gray Total Mutant allele 3 7 21 15 15 61 absent Mutant allele 6 10 18 14 17 65 present Total 9 17 39 29 32 126 R = # rows, C = # columns (doesn’t matter which variable is which) 12 Chi-square test for RxC table same as for 2x2 table except: • This method can only be used if no more than 1/5 of cells have expected count < 5 AND if no cell has expected count < 1. j Yl ` j Yl ` j Yl ` �A = kk kk + k` k` + … + mn mn lkk lk` lmn 2 • Under H0 , the X test statistic follows a chi-square distribution on (R-1)(C-1) degrees of freedom A A � ~�(oY#)(pY#) 13 Again, we have to obtain marginal totals to determine expected count for each cell. For example… blue green brown hazel gray Total Mutant allele 4.36 8.23 18.88 14.04 15.49 61 absent Mutant allele 4.64 8.77 20.12 14.96 16.51 65 present Total 9 17 39 29 32 126 The expected counts would be calculated as follows q#rS qTrsA E = = 4.36, … , E = = 16.51 11 #Aq RC #Aq 14 2 2 2 3 − 4.36 7 − 8.23 17 − 16.51 X 2 = ( ) + ( ) +!+ ( ) 4.36 8.23 16.51 = 1.80 A A • Under H0 , � ~�t • MATLAB: 1-chi2cdf(1.8,4) p-value = 0.77 Conclusion: No evidence for association between eye color and mutant alleles. 15 Two-sample comparison of proportions What if we are interested in estimating and quantifying uncertainty about the difference in proportions between two groups? • e.g., want estimate and CI of difference in proportions of positive tuberculosis skin tests between needle sharers and non-sharers Approach is similar to two-sample estimation for continuous data questions, with subtle differences! 16 HST 190: Intro to Biostatistics Two-sample comparison of proportions • Whereas we have previously considered the difference in means of continuous two-sample data, we now compare two populations’ unknown proportions �# and �A. • Suppose we want to know whether two communities have the same obesity rate. § You draw random samples from both; in the first city, 20 out of 100 are obese, while in the second 24 out of 150 are obese. • Goals: § estimate and compute the 95% C.I. for the difference in proportions § conduct a significance test at level � = 0.05 for a difference 17 HST 190: Intro to Biostatistics • Before, we saw that if a random experiment has two possible outcomes, “success” and “failure”, and we do � independent repetitions with identical success probability �, then �~Bin(�, �) is the number of successes. § Now, we observe �#~Bin(�#, �#) and XA~Bin(�A, �A) and then make inference about �# − �A. • Estimation is identical to two-sample continuous case: difference of sample proportions, �̂# − �̂A • If �#�̂# 1 − �̂# ≥ 5 and �A�̂A 1 − �̂A ≥ 5 , the associated 100 1 − � % CI given by • �̂#(1 − �̂#) �̂A(1 − �̂A) Z �̂# − �̂A ± �#Y + A �# �A 18 HST 190: Intro to Biostatistics • For example, consider two samples A* § � = 100, � = 20, �̂ = = 0.20, � �̂ 1−�̂ = 16 ≥ 5 # # # #** # # # At § � = 150, � = 24, �̂ = = 0.16, � �̂ (1−�̂ ) = 20.16 ≥ 5 A A A #T* A A A • Then the 95% CI for the difference is • 0.2(0.8) 0.16(0.84) = (0.20 − 0.16) ± 1.96 + 100 150 = 0.04 ± 1.96 0.050 = 0.04 ± 0.10 = −0.06, 0.14 19 HST 190: Intro to Biostatistics Hypothesis testing for difference of proportions • Now, consider �*: �# = �A versus �#: �# ≠ �A § Under �*, we can pool the two samples to calculate standard error, ( ‚ƒ a ( ‚ƒ letting �̂ = k k ` ` (ka(` • Then If �#�̂# 1 − �̂# ≥ 5 and �A�̂A 1 − �̂A ≥ 5, under �* we form the Z-test statistic �̂ − �̂ � = # A • 1 1 �̂(1 − �̂) + �# �A • It has an approximate N(0,1) distribution when the null is true. 20 HST 190: Intro to Biostatistics • Continuing the same example, A* § � = 100, � = 20, �̂ = = 0.20, � �̂ 1−�̂ = 16 ≥ 5 # # # #** # # # At § � = 150, � = 24, �̂ = = 0.16, � �̂ (1−�̂ ) = 20.16 ≥ 5 A A A #T* A A A … a… A*aAt § �̂ = k ` = = 0.176 (ka(` #**a#T* • Test statistic is then �̂ − �̂ 0.20 − 0.16 � = # A = • 1 1 • 1 1 �̂(1 − �̂) + 0.176(0.824) + �# �A 100 150 = 0.81 • From table or MATLAB, � � > 0.81 = 0.21, so p-value is 2 0.21 = 0.42 > 0.05 ⇒ do not reject H* 21 HST 190: Intro to Biostatistics Odds ratio and relative risk Chi-square tests for contingency How do we estimate the magnitude tables allow us to test for of the association between two categorical variables? association between two categorical variables.
Recommended publications
  • Hypothesis Testing and Likelihood Ratio Tests
    Hypottthesiiis tttestttiiing and llliiikellliiihood ratttiiio tttesttts Y We will adopt the following model for observed data. The distribution of Y = (Y1, ..., Yn) is parameter considered known except for some paramett er ç, which may be a vector ç = (ç1, ..., çk); ç“Ç, the paramettter space. The parameter space will usually be an open set. If Y is a continuous random variable, its probabiiillliiittty densiiittty functttiiion (pdf) will de denoted f(yy;ç) . If Y is y probability mass function y Y y discrete then f(yy;ç) represents the probabii ll ii tt y mass functt ii on (pmf); f(yy;ç) = Pç(YY=yy). A stttatttiiistttiiicalll hypottthesiiis is a statement about the value of ç. We are interested in testing the null hypothesis H0: ç“Ç0 versus the alternative hypothesis H1: ç“Ç1. Where Ç0 and Ç1 ¶ Ç. hypothesis test Naturally Ç0 § Ç1 = ∅, but we need not have Ç0 ∞ Ç1 = Ç. A hypott hesii s tt estt is a procedure critical region for deciding between H0 and H1 based on the sample data. It is equivalent to a crii tt ii call regii on: a critical region is a set C ¶ Rn y such that if y = (y1, ..., yn) “ C, H0 is rejected. Typically C is expressed in terms of the value of some tttesttt stttatttiiistttiiic, a function of the sample data. For µ example, we might have C = {(y , ..., y ): y – 0 ≥ 3.324}. The number 3.324 here is called a 1 n s/ n µ criiitttiiicalll valllue of the test statistic Y – 0 . S/ n If y“C but ç“Ç 0, we have committed a Type I error.
    [Show full text]
  • Data 8 Final Stats Review
    Data 8 Final Stats review I. Hypothesis Testing Purpose: To answer a question about a process or the world by testing two hypotheses, a null and an alternative. Usually the null hypothesis makes a statement that “the world/process works this way”, and the alternative hypothesis says “the world/process does not work that way”. Examples: Null: “The customer was not cheating-his chances of winning and losing were like random tosses of a fair coin-50% chance of winning, 50% of losing. Any variation from what we expect is due to chance variation.” Alternative: “The customer was cheating-his chances of winning were something other than 50%”. Pro tip: You must be very precise about chances in your hypotheses. Hypotheses such as “the customer cheated” or “Their chances of winning were normal” are vague and might be considered incorrect, because you don’t state the exact chances associated with the events. Pro tip: Null hypothesis should also explain differences in the data. For example, if your hypothesis stated that the coin was fair, then why did you get 70 heads out of 100 flips? Since it’s possible to get that many (though not expected), your null hypothesis should also contain a statement along the lines of “Any difference in outcome from what we expect is due to chance variation”. Steps: 1) Precisely state your null and alternative hypotheses. 2) Decide on a test statistic (think of it as a general formula) to help you either reject or fail to reject the null hypothesis. • If your data is categorical, a good test statistic might be the Total Variation Distance (TVD) between your sample and the distribution it was drawn from.
    [Show full text]
  • Use of Statistical Tables
    TUTORIAL | SCOPE USE OF STATISTICAL TABLES Lucy Radford, Jenny V Freeman and Stephen J Walters introduce three important statistical distributions: the standard Normal, t and Chi-squared distributions PREVIOUS TUTORIALS HAVE LOOKED at hypothesis testing1 and basic statistical tests.2–4 As part of the process of statistical hypothesis testing, a test statistic is calculated and compared to a hypothesised critical value and this is used to obtain a P- value. This P-value is then used to decide whether the study results are statistically significant or not. It will explain how statistical tables are used to link test statistics to P-values. This tutorial introduces tables for three important statistical distributions (the TABLE 1. Extract from two-tailed standard Normal, t and Chi-squared standard Normal table. Values distributions) and explains how to use tabulated are P-values corresponding them with the help of some simple to particular cut-offs and are for z examples. values calculated to two decimal places. STANDARD NORMAL DISTRIBUTION TABLE 1 The Normal distribution is widely used in statistics and has been discussed in z 0.00 0.01 0.02 0.03 0.050.04 0.05 0.06 0.07 0.08 0.09 detail previously.5 As the mean of a Normally distributed variable can take 0.00 1.0000 0.9920 0.9840 0.9761 0.9681 0.9601 0.9522 0.9442 0.9362 0.9283 any value (−∞ to ∞) and the standard 0.10 0.9203 0.9124 0.9045 0.8966 0.8887 0.8808 0.8729 0.8650 0.8572 0.8493 deviation any positive value (0 to ∞), 0.20 0.8415 0.8337 0.8259 0.8181 0.8103 0.8206 0.7949 0.7872 0.7795 0.7718 there are an infinite number of possible 0.30 0.7642 0.7566 0.7490 0.7414 0.7339 0.7263 0.7188 0.7114 0.7039 0.6965 Normal distributions.
    [Show full text]
  • Chapter 4: Fisher's Exact Test in Completely Randomized Experiments
    1 Chapter 4: Fisher’s Exact Test in Completely Randomized Experiments Fisher (1925, 1926) was concerned with testing hypotheses regarding the effect of treat- ments. Specifically, he focused on testing sharp null hypotheses, that is, null hypotheses under which all potential outcomes are known exactly. Under such null hypotheses all un- known quantities in Table 4 in Chapter 1 are known–there are no missing data anymore. As we shall see, this implies that we can figure out the distribution of any statistic generated by the randomization. Fisher’s great insight concerns the value of the physical randomization of the treatments for inference. Fisher’s classic example is that of the tea-drinking lady: “A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. ... Our experi- ment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject in random order. ... Her task is to divide the cups into two sets of 4, agreeing, if possible, with the treatments received. ... The element in the experimental procedure which contains the essential safeguard is that the two modifications of the test beverage are to be prepared “in random order.” This is in fact the only point in the experimental procedure in which the laws of chance, which are to be in exclusive control of our frequency distribution, have been explicitly introduced. ... it may be said that the simple precaution of randomisation will suffice to guarantee the validity of the test of significance, by which the result of the experiment is to be judged.” The approach is clear: an experiment is designed to evaluate the lady’s claim to be able to discriminate wether the milk or tea was first poured into the cup.
    [Show full text]
  • 8.5 Testing a Claim About a Standard Deviation Or Variance
    8.5 Testing a Claim about a Standard Deviation or Variance Testing Claims about a Population Standard Deviation or a Population Variance ² Uses the chi-squared distribution from section 7-4 → Requirements: 1. The sample is a simple random sample 2. The population has a normal distribution (n −1)s 2 → Test Statistic for Testing a Claim about or ²: 2 = 2 where n = sample size s = sample standard deviation σ = population standard deviation s2 = sample variance σ2 = population variance → P-values and Critical Values: Use table A-4 with df = n – 1 for the number of degrees of freedom *Remember that table A-4 is based on cumulative areas from the right → Properties of the Chi-Square Distribution: 1. All values of 2 are nonnegative and the distribution is not symmetric 2. There is a different 2 distribution for each number of degrees of freedom 3. The critical values are found in table A-4 (based on cumulative areas from the right) --locate the row corresponding to the appropriate number of degrees of freedom (df = n – 1) --the significance level is used to determine the correct column --Right-tailed test: Because the area to the right of the critical value is 0.05, locate 0.05 at the top of table A-4 --Left-tailed test: With a left-tailed area of 0.05, the area to the right of the critical value is 0.95 so locate 0.95 at the top of table A-4 --Two-tailed test: Divide the significance level of 0.05 between the left and right tails, so the areas to the right of the two critical values are 0.975 and 0.025.
    [Show full text]
  • A Study of Non-Central Skew T Distributions and Their Applications in Data Analysis and Change Point Detection
    A STUDY OF NON-CENTRAL SKEW T DISTRIBUTIONS AND THEIR APPLICATIONS IN DATA ANALYSIS AND CHANGE POINT DETECTION Abeer M. Hasan A Dissertation Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY August 2013 Committee: Arjun K. Gupta, Co-advisor Wei Ning, Advisor Mark Earley, Graduate Faculty Representative Junfeng Shang. Copyright c August 2013 Abeer M. Hasan All rights reserved iii ABSTRACT Arjun K. Gupta, Co-advisor Wei Ning, Advisor Over the past three decades there has been a growing interest in searching for distribution families that are suitable to analyze skewed data with excess kurtosis. The search started by numerous papers on the skew normal distribution. Multivariate t distributions started to catch attention shortly after the development of the multivariate skew normal distribution. Many researchers proposed alternative methods to generalize the univariate t distribution to the multivariate case. Recently, skew t distribution started to become popular in research. Skew t distributions provide more flexibility and better ability to accommodate long-tailed data than skew normal distributions. In this dissertation, a new non-central skew t distribution is studied and its theoretical properties are explored. Applications of the proposed non-central skew t distribution in data analysis and model comparisons are studied. An extension of our distribution to the multivariate case is presented and properties of the multivariate non-central skew t distri- bution are discussed. We also discuss the distribution of quadratic forms of the non-central skew t distribution. In the last chapter, the change point problem of the non-central skew t distribution is discussed under different settings.
    [Show full text]
  • Pearson-Fisher Chi-Square Statistic Revisited
    Information 2011 , 2, 528-545; doi:10.3390/info2030528 OPEN ACCESS information ISSN 2078-2489 www.mdpi.com/journal/information Communication Pearson-Fisher Chi-Square Statistic Revisited Sorana D. Bolboac ă 1, Lorentz Jäntschi 2,*, Adriana F. Sestra ş 2,3 , Radu E. Sestra ş 2 and Doru C. Pamfil 2 1 “Iuliu Ha ţieganu” University of Medicine and Pharmacy Cluj-Napoca, 6 Louis Pasteur, Cluj-Napoca 400349, Romania; E-Mail: [email protected] 2 University of Agricultural Sciences and Veterinary Medicine Cluj-Napoca, 3-5 M ănăş tur, Cluj-Napoca 400372, Romania; E-Mails: [email protected] (A.F.S.); [email protected] (R.E.S.); [email protected] (D.C.P.) 3 Fruit Research Station, 3-5 Horticultorilor, Cluj-Napoca 400454, Romania * Author to whom correspondence should be addressed; E-Mail: [email protected]; Tel: +4-0264-401-775; Fax: +4-0264-401-768. Received: 22 July 2011; in revised form: 20 August 2011 / Accepted: 8 September 2011 / Published: 15 September 2011 Abstract: The Chi-Square test (χ2 test) is a family of tests based on a series of assumptions and is frequently used in the statistical analysis of experimental data. The aim of our paper was to present solutions to common problems when applying the Chi-square tests for testing goodness-of-fit, homogeneity and independence. The main characteristics of these three tests are presented along with various problems related to their application. The main problems identified in the application of the goodness-of-fit test were as follows: defining the frequency classes, calculating the X2 statistic, and applying the χ2 test.
    [Show full text]
  • Two-Sample T-Tests Assuming Equal Variance
    PASS Sample Size Software NCSS.com Chapter 422 Two-Sample T-Tests Assuming Equal Variance Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample t-tests when the variances of the two groups (populations) are assumed to be equal. This is the traditional two-sample t-test (Fisher, 1925). The assumed difference between means can be specified by entering the means for the two groups and letting the software calculate the difference or by entering the difference directly. The design corresponding to this test procedure is sometimes referred to as a parallel-groups design. This design is used in situations such as the comparison of the income level of two regions, the nitrogen content of two lakes, or the effectiveness of two drugs. There are several statistical tests available for the comparison of the center of two populations. This procedure is specific to the two-sample t-test assuming equal variance. You can examine the sections below to identify whether the assumptions and test statistic you intend to use in your study match those of this procedure, or if one of the other PASS procedures may be more suited to your situation. Other PASS Procedures for Comparing Two Means or Medians Procedures in PASS are primarily built upon the testing methods, test statistic, and test assumptions that will be used when the analysis of the data is performed. You should check to identify that the test procedure described below in the Test Procedure section matches your intended procedure. If your assumptions or testing method are different, you may wish to use one of the other two-sample procedures available in PASS.
    [Show full text]
  • Chapter 7. Hypothesis Testing
    McFadden, Statistical Tools © 2000 Chapter 7-1, Page 155 ______________________________________________________________________________ CHAPTER 7. HYPOTHESIS TESTING 7.1. THE GENERAL PROBLEM It is often necessary to make a decision, on the basis of available data from an experiment (carried out by yourself or by Nature), on whether a particular proposition Ho (theory, model, hypothesis) is true, or the converse H1 is true. This decision problem is often encountered in scientific investigation. Economic examples of hypotheses are (a) The commodities market is efficient (i.e., opportunities for arbitrage are absent). (b) There is no discrimination on the basis of gender in the market for academic economists. (c) Household energy consumption is a necessity, with an income elasticity not exceeding one. (d) The survival curve for Japanese cars is less convex than that for Detroit cars. Notice that none of these economically interesting hypotheses are framed directly as precise statements about a probability law (e.g., a statement that the parameter in a family of probability densities for the observations from an experiment takes on a specific value). A challenging part of statistical analysis is to set out maintained hypotheses that will be accepted by the scientific community as true, and which in combination with the proposition under test give a probability law. Deciding the truth or falsity of a null hypothesis Ho presents several general issues: the cost of mistakes, the selection and/or design of the experiment, and the choice of the test. 7.2. THE COST OF MISTAKES Consider a two-by-two table that compares the truth with the result of the statistical decision.
    [Show full text]
  • Chi-Square Tests
    Chi-Square Tests Nathaniel E. Helwig Associate Professor of Psychology and Statistics University of Minnesota October 17, 2020 Copyright c 2020 by Nathaniel E. Helwig Nathaniel E. Helwig (Minnesota) Chi-Square Tests c October 17, 2020 1 / 32 Table of Contents 1. Goodness of Fit 2. Tests of Association (for 2-way Tables) 3. Conditional Association Tests (for 3-way Tables) Nathaniel E. Helwig (Minnesota) Chi-Square Tests c October 17, 2020 2 / 32 Goodness of Fit Table of Contents 1. Goodness of Fit 2. Tests of Association (for 2-way Tables) 3. Conditional Association Tests (for 3-way Tables) Nathaniel E. Helwig (Minnesota) Chi-Square Tests c October 17, 2020 3 / 32 Goodness of Fit A Primer on Categorical Data Analysis In the previous chapter, we looked at inferential methods for a single proportion or for the difference between two proportions. In this chapter, we will extend these ideas to look more generally at contingency table analysis. All of these methods are a form of \categorical data analysis", which involves statistical inference for nominal (or categorial) variables. Nathaniel E. Helwig (Minnesota) Chi-Square Tests c October 17, 2020 4 / 32 Goodness of Fit Categorical Data with J > 2 Levels Suppose that X is a categorical (i.e., nominal) variable that has J possible realizations: X 2 f0;:::;J − 1g. Furthermore, suppose that P (X = j) = πj where πj is the probability that X is equal to j for j = 0;:::;J − 1. PJ−1 J−1 Assume that the probabilities satisfy j=0 πj = 1, so that fπjgj=0 defines a valid probability mass function for the random variable X.
    [Show full text]
  • Notes: Hypothesis Testing, Fisher's Exact Test
    Notes: Hypothesis Testing, Fisher’s Exact Test CS 3130 / ECE 3530: Probability and Statistics for Engineers Novermber 25, 2014 The Lady Tasting Tea Many of the modern principles used today for designing experiments and testing hypotheses were intro- duced by Ronald A. Fisher in his 1935 book The Design of Experiments. As the story goes, he came up with these ideas at a party where a woman claimed to be able to tell if a tea was prepared with milk added to the cup first or with milk added after the tea was poured. Fisher designed an experiment where the lady was presented with 8 cups of tea, 4 with milk first, 4 with tea first, in random order. She then tasted each cup and reported which four she thought had milk added first. Now the question Fisher asked is, “how do we test whether she really is skilled at this or if she’s just guessing?” To do this, Fisher introduced the idea of a null hypothesis, which can be thought of as a “default position” or “the status quo” where nothing very interesting is happening. In the lady tasting tea experiment, the null hypothesis was that the lady could not really tell the difference between teas, and she is just guessing. Now, the idea of hypothesis testing is to attempt to disprove or reject the null hypothesis, or more accurately, to see how much the data collected in the experiment provides evidence that the null hypothesis is false. The idea is to assume the null hypothesis is true, i.e., that the lady is just guessing.
    [Show full text]
  • Fisher's Exact Test for Two Proportions
    PASS Sample Size Software NCSS.com Chapter 194 Fisher’s Exact Test for Two Proportions Introduction This module computes power and sample size for hypothesis tests of the difference, ratio, or odds ratio of two independent proportions using Fisher’s exact test. This procedure assumes that the difference between the two proportions is zero or their ratio is one under the null hypothesis. The power calculations assume that random samples are drawn from two separate populations. Technical Details Suppose you have two populations from which dichotomous (binary) responses will be recorded. The probability (or risk) of obtaining the event of interest in population 1 (the treatment group) is p1 and in population 2 (the control group) is p2 . The corresponding failure proportions are given by q1 = 1− p1 and q2 = 1− p2 . The assumption is made that the responses from each group follow a binomial distribution. This means that the event probability, pi , is the same for all subjects within the group and that the response from one subject is independent of that of any other subject. Random samples of m and n individuals are obtained from these two populations. The data from these samples can be displayed in a 2-by-2 contingency table as follows Group Success Failure Total Treatment a c m Control b d n Total s f N The following alternative notation is also used. Group Success Failure Total Treatment x11 x12 n1 Control x21 x22 n2 Total m1 m2 N 194-1 © NCSS, LLC. All Rights Reserved. PASS Sample Size Software NCSS.com Fisher’s Exact Test for Two Proportions The binomial proportions p1 and p2 are estimated from these data using the formulae a x11 b x21 p1 = = and p2 = = m n1 n n2 Comparing Two Proportions When analyzing studies such as this, one usually wants to compare the two binomial probabilities, p1 and p2 .
    [Show full text]