Exploratory Methods for Categorical/Nominal Variables - Bivariate Relationships

Total Page:16

File Type:pdf, Size:1020Kb

Exploratory Methods for Categorical/Nominal Variables - Bivariate Relationships

Analysis of Categorical Data ~ Goodness-of-Fit Tests, Exact Tests for Comparing Two Proportions (p1 vs. p2), Tests of Homogeneity, and Tests of Independence

This assignment focuses on tests used to analyze categorical data. The problems are similar to the examples shown in class. For help regarding JMP use for this assignment refer to the Chi-Square Analysis of Categorical Data tutorial. This handout is available in the Tutorials section of the course webpage.

1. Income and Job Satisfaction Data File: JobSatis-Income in the Categorical JMP folder Key Words: Contingency table, mosaic plot, Chi-square test of independence Topics: Social science

These data come from the General Social Survey of U.S. National Data Program.

The variables in this data set are:  Income ($) – Income level (< 6000, 6000-15000, 15000-25000, > 25000) (Don’t Use)  Income Level – coded as 1, 2, 3, 4 (1 = < 6000, ...., 4 = > 25000)  Job Satisfaction – Levels are: Very dissatisfied, Little dissatisfied, Moderately satisfied, Very satisfied.  Freq – Number of respondents in each Income Level/Job Satisfaction category.

The main questions of interest are:  Is there a relationship between income and job satisfaction?  What is the nature of the relationship between these variables?

Questions and Tasks a) Use the Chi-square test to test for a relationship between income and job satisfaction. Summarize your findings. (3 pts.) b) Now look at the mosaic plot. Does there seem to be a pattern in the direction one might expect? (2 pts.)

Note: The Chi-square test of independence does not look for a specific trend in the data. There are statistical tests designed to look for trends of this type if it is believed in advance that such a trend might exist.

2 – Marijuana Use and Ability to Do Well in School Data File: Marijuana-School Performance in the Categorical JMP folder Key Words: Contingency table, mosaic plot, Chi-square test of homogeneity Topics: Social science, education

The data come from a telephone survey of 1000 Americans aged between 12 and 17. They have been split into two age groups, 12-14 and 15-17, each of size 500. This enables use see how some patterns change with age. Data are also included from surveys of 825 teachers and 822 principals.

On of the questions asked was: Is it possible for students to use marijuana every weekend and still do well in school? This data set contains the responses to this question.

The variables in this data set are:  Group – Teachers, Principals, 12-14 year olds, 15-17 year olds.  Response – Yes, No, Don’t know/refused to answer  Freq – Number of respondents in each Response/Group category. a) Plot the percentages for each of the four groups. What do you see? (2 pts.) b) Test whether is any differences between teachers and principals with respect to beliefs about the ability of students to smoke marijuana every weekend and still do well at school. This is a test of homogeneity, why? State the hypotheses being tested and give your conclusion based on the results of the test. (4 pts.)

Note: To eliminate the students from the analysis highlight the rows in the table corresponding to these groups and select both Exclude/Unexclude... and Hide/Unhide... from Rows menu. When finished select Clear Row States from the same menu to restore all cases.

c) Compare the responses of students aged 12-14 with those of students aged 15-17 using a test of homogeneity. Summarize your findings. (3 pts.) 3. Sexual Discrimination in Graduate Admissions at the University of California at Berkeley Data Files: Berkeley Sex Discrim 1975 and Berkeley Sex Discrim 1975 by Program in the Categorical JMP folder Keywords: Contingency tables, mosaic plots, Fisher’s Exact Test, Chi-square tests, Confounding & Simpson’s Paradox Topics: Social science, education, law

Bickel and O’Connell (1975) investigated whether there was any evidence of gender bias in graduate admission at the University of California at Berkeley. The data file Berkeley Sex Discrim 1975 contains the results of a cross-classification of all 12,763 applications to graduate programs in 1973 by Sex (M or F) and Admission (whether or not the applicant was admitted to the program to which they applied).

a) What percent of male applicants were admitted? What percent of female applicants were admitted? What percent of all applicants were admitted? (3 pts.) b) Conduct a test to determine if the proportion of males applicants accepted is greater than the proportion of female applicants accepted. This would clearly be the research hypothesis for a sex discrimination complaint. What do you conclude? (3 pts.) c) When we consider the six largest graduate programs (labeled A, B, C, D, E, F) and again cross-classify by sex of the applicant and whether or not they were accepted we obtain the data in the file Berkeley Sex Discrim 1975 by Program.

Admission for All Programs (A – F)

Now consider looking at admission rates for men and women conditional on program. To do this in JMP select Analyze > Fit Y by X and place Admission in the Y, Response box, Sex in the X, Factor box, and Program in the By box. Examine the result of the Fisher’s Exact Test for each program. Also be sure to look at the percentage of female and male applicants accepted. Summarize your findings from this analysis. (6 pts.) d) This is an example of what is known as Simpson’s Paradox. While the overall acceptance rate appears to be biased against female applicants we find no evidence or evidence to the contrary when we consider admissions on a program by program basis. One way to take a potential “confounding factor” like program into account when doing this type of analysis is to use the Cochran-Mantel-Haenszel Test. To do this in JMP, first perform a standard test for a relationship between Y and X, then select the CMH test option from the Contingency Analysis pull-down menu. Then select Program as the Grouping Column, which specifies graduate program as the potential confounding factor.

What does the p-value for the CMH test tell you about the presence of potential sexual discrimination in terms of admission to these graduate programs? (2 pts.)

4. Voter Turnout and Exit Polls Data File: Voter-Turnout in the Categorical JMP folder Key Words: Bar Graph, Frequency Distribution Table, Goodness-of-Fit Tests Topics: Political science

In elections, one particular characteristics of interest is the voter turnout amongst certain age groups. It is known from 1998 U.S. Census figures that the percentage of U.S. adults eligible to vote aged 18-29 was 22%, aged 30-34 was 32%, aged 45-59 was 24%, and over age 60 or older was 22%.

Results of exit polls of 10,000 voters were widely reported after the 1998 U.S. midterm elections and the voter turnout amongst the age groups identified above was as follows: Age Group 18 – 29 30 – 44 45 – 59 60 + Total Frequency 1303 2897 3014 2786 n = 10,000

a) Construct a bar graph of Age Group. Do there appear to be any differences from the percentages we would expect to see in each age group given the U.S. Census figures? Which age group is most underrepresented? overrepresented? (3 pts.)

b) The question that arises given the results from (a) is whether these discrepancies are real, or could they be due to sampling variation? To answer this question conduct a Chi- square Goodness-of-Fit test using the U.S. Census figures to define the null hypothesis. Also construct 95% CI for the percentage of voters in each age group. Summarize your findings. (5 pts.)

Recommended publications