<<

Math 4220 Dr. Zeng

Student Activity 2: Basic Statistical Analysis with SAS

1. Visualizing Your Data

Example 1: The Toluca Company manufactures refrigeration equipment as well as many replacement parts. In the past, one of the replacement parts has been produced periodically in lots of varying sizes. When a cost improvement program was undertaken, company officials wished to determine the optimum lot size for producing this part. This data set can be found on http://www.csub.edu/~bzeng/4220/activities.shtml

 Creating and Density Curves

To show the distribution of continues data, you can use . In a histogram, the data are divided into discrete interval called bins. To create a histogram chart, use a “histogram” statement with this general form: proc sgplot; histogram variable-name/options;

You can also plot density curves for your data. The basic form a density statement is: proc sgplot; density variable-name/options;

SAS CODE 1: proc sgplot data=Toluca; histogram Hours; density Hours; /*default setting is normal */ density Hours/ type=kernel; /*specifies the type of distribution: kernel */ run;

Math 4220 Dr. Zeng

Remark: the bar charts show the distribution of categorical data. To graph a bar chart, we just need to replace histogram statement by vbar statement. Everything else is the same.

 Creating Box Plots

Like histograms, box plots show the distribution of continuous data. It contains the following information: minimum, first quartile, , , third quartile, maximum, . To create a vertical box plot, use a vbox statement like this: proc sgplot; vbox variable-name/options; /* use “hbox” statement if you like a horizontal box plot*/

Example 2: This example contains the variables Subj (ID values for each subject), Drug (with values of Placebo, Drug A, or Drug B), SBP (systolic blood pressure), DBP (diastolic blood pressure), and Gender (with M or F). This data set can be found on http://www.csub.edu/~bzeng/4220/activities.shtml

SAS CODE 2: title "creating box plots"; proc sgplot data=bloodpressure; hbox SBP; /*horizontal boxplot*/ run; title "creating boxplots for SBP by different type of drugs"; proc sgplot data=bloodpressure; vbox SBP/category=Drug; run;

Math 4220 Dr. Zeng

 Creating Scatter Plots

Here, we will learn some graphical methods for displaying relationships between two continuous variables. First are simple scatter plots that were produced with several different procedures. Later, you will also learn how to create multiple plots on a single page using PROC SGSCATTER.

SAS CODE 3(a): title "Creating a Scatter Plot Using PROC GPLOT: Toluca Company Example"; symbol value=dot; proc gplot data=Toluca; plot Size*Hours; run;

Alternatively, if you use PROC SGPLOT, then the keyword SCATTER may be used.

SAS CODE 3(b): title "Creating a Scatter Plot Using PROC SGPLOT: Toluca Company Example"; symbol value=dot; proc sgplot data=Toluca; scatter x=Size y=Hours; run;

Math 4220 Dr. Zeng

 Creating Probability Plots

PROC UNIVARIATE, is another SAS procedure which produces output that is similar to the output from PROC . However, PROC UNIVARIATE provides additional statements that produce histograms and probability plots. Back to the example2, the following program demonstrates these features of PROC UNIVARIATE. Also note that the normal in the top line requests test statistics for checking normality. If the sample size is over 2000, the Kolmgorov test should be used. Otherwise, the Shapiro test is better. The null hypothesis of a normality test is that there is no significant departure from normality. When the p-value is greater than 0.05, we fail to reject the null hypothesis and hence the normality assumption holds.

SAS CODE 4: title "demonstrating PROC UNIVARIATE"; proc univariate data=bloodpressure normal; var SBP DBP; histogram; probplot/normal (mu=est sigma=est); run;

Tests for Normality Test Statistic p Value Shapiro-Wilk W 0.512239 Pr < W <0.0001 Kolmogorov-Smirnov D 0.31514 Pr > D <0.0100 Cramer-von Mises W-Sq 0.656015 Pr > W-Sq <0.0050 Anderson-Darling A-Sq 3.636438 Pr > A-Sq <0.0050

Math 4220 Dr. Zeng

2. Generating Descriptive Statistics with PROC MEANS

One of the first steps in any statistical analysis is to calculate some basic descriptive statistics on the variables of interest. One way to compute means and standard deviations is to use PROC MEANS. The following list describes some of the more useful options:

Option Description N # of nonmissing observations NMISS # of observations with missing values MEAN Arithmetic mean STD STDERR Standard error MIN Minimum value MAX Maximum value MEDIAN Median MAXDEC Maximum number of decimal places to display

CLM 95% confidence limit on the mean CV Coefficient of variation Example 2 Revisit: Use the above blood pressure data set, and compute the descriptive statistics for each level of Drug (Placebo, Drug A, Drug B)

SAS CODE 5: title "Descriptive Statistics for SBP and DBP"; proc means data=bloodpressure n nmiss mean std median maxdec=3; var SBP DBP; run;

SAS CODE 6: title "Descriptive Statistics for SBP and DBP: broken down by Drug"; proc means data=bloodpressure n nmiss mean std median maxdec=3; class Drug; var SBP DBP; run;

Math 4220 Dr. Zeng

3. Computing a 95% Confidence Interval

A 95% confidence interval for the mean is useful in helping you decide how well your sample mean estimate the mean of the population from which you took your sample. Standard error is also useful for the same reason.

Example 2 Revisit: Use the above blood pressure data set, and compute the 95% confidence interval and the standard error for SBP and DBP.

SAS CODE 7: title "Computing a 95% Confidence Interval and the Standard Error"; proc means data=bloodpressure n nmiss mean std median clm stderr maxdec=3; var SBP DBP; run;

SAS CODE 8: title "Computing a 95% Confidence Interval and the Standard Error: broken down by drug"; proc means data=bloodpressure n nmiss mean std median clm stderr maxdec=3; class Drug; var SBP DBP; run;

Math 4220 Dr. Zeng

4. Conducting a One-Sample t-test

Hypothesis testing is a very important part of inferential statistics. One sample t-test is a very common test in hypothesis testing. Here, you will learn how to conduct a one-sample t-test using SAS.

Without ods graphics turned on

Example 1 Revisit: Using the Toluca company example above to test whether the mean work hours is 300. That is H0 :  300 v.s. H1 :  300 where  is the mean of the work hours. What is the P-Value and what is your conclusion?

SAS CODE 8: title "Conducting a One-Sample t-test Using PROC TTEST"; proc ttest data=Toluca h0=300 sides=2 alpha=0.05; var Hours; run;

With ods graphics turned on

Example 1 Revisit: Use the Toluca company example as shown above, but add the ods graphics on on the top of your SAS program and see what happen? What you can conclude from the output?

SAS CODE 9: ods graphics on; title "Conducting a One-Sample t-test Using PROC TTEST"; proc ttest data=Toluca h0=300 sides=2 alpha=0.05; var Hours; run; ods graphics off;

Remarks: (1) h0=option on the null hypothesis (2) sides=2 for two sided test; sides=U for upper tail test; sides=L for lower tail test (3) alpha=specified the significance level Math 4220 Dr. Zeng

Math 4220 Dr. Zeng

5. Conducting a Two-Samples t-test

 Conducting the Unpaired t-test

Example 2 Revisit: For example, you will compare the SBP and DBP between the males and females in the bloodpressure data set. Is there any difference between two genders for SBP & DBP? Comment on the equality of .

SAS CODE 9: title "conducting a unpaired two-sample t-test"; proc ttest data=bloodpressure; class gender; var SBP DBP; run;

 Conducting a Paired t-test

If you have a design in which each subject received two treatments and you want to conduct a t-test, you need a PAIRED statement with PROC TTEST.

Example 3: A study was conducted to determine whether a reading program could improve the reading speed of eight subjects. The before and after reading speeds, along with a subject number, are stored in a SAS data set called reading. What is your conclusion?

SAS CODE 10: proc ttest data=reading; paired After*Before; run;

Math 4220 Dr. Zeng

6. SAS Distribution Functions

The following program shows how you can compute p-values for a normal, t, chi-square and F distributions, using SAS cumulative distribution functions. For example, for an upper tailed Z test, if you want the probability of observing a Z statistics greater than Z=1.96, then it would equal: 1-probnorm(1.96)=1-0.975=0.025. If it is a two-sided p-value, then we would use 2*(1- probnorm(1.96))=2*(1-0.975)=2*0.025=0.05.

SAS CODE 9: data pvalues; zstat = 1.96 ;/*z score*/ zpvalue = 1 - probnorm(zstat);/*upper tailed z test*/ zpvalue2 = 2*(1 - probnorm(zstat));/*two-sided z test*/ tstat=2.32; /*t score*/ tdf=10;/*degree of freedom*/ tpvalue=1-probt(tstat, tdf);/*upper tailed t test*/ fstat=31.83; ndf=2; /*numerator DF*/ ddf=681;/* denominator DF*/ fpvalue=1-probf(fstat, ndf, ddf); /*pvalue from F-test*/ chistat=2.32**2;/*chi-square test*/ cdf=1; cpvalue=1-probchi(chistat,cdf); run; proc print data=pvalues; run;

Obs zstat zpvalue zpvalue2 tstat tdf tpvalue fstat ndf ddf fpvalue chistat cdf cpvalue 1 1.96 0.024998 0.049996 2.32 10 0.021386 31.83 2 681 6.0951E-145.3824 1 0.020341

Math 4220 Dr. Zeng

Assignment:

For all the assignments, you need to include the SAS code, the most relevant SAS output, and the interpretation of the output.

1. Practice SAS by repeating all the above SAS coding.

2. The director of admissions of a small college selected 60 students at random from the new freshman class in a study. He wants to know that whether the training program helps to improve a students’ math grade. The before (Score1) and after (Score2) scores for each student are both collected. ACT test score is also included in the dataset. The data can be found on http://www.csub.edu/~bzeng/4220/activities.shtml

(a) Create histograms of score1 and score2 for each gender. (b) Apply the density curve by kernel methods. (c) Create the vertical box plots of score1 and score2 for all students. (d) Draw a scatter plot of ACT versus the score before training. (e) Create the probability plots for ACT (f) Compute the descriptive statistics for both variables of Score1 and ACT. Please include the mean, median, standard deviation, standard error in your output with 3 maximum numbers of decimal places to display. (g) Compute the 95% confidence interval for both Score1 and ACT.

(h) Test H01:  3.0 v.s. H11:  3.0 , where 1 is the population mean of Score1 (Please write down the five steps of conducting a hypothesis testing)

(i) Test H01:  3.5 v.s. H11:  3.5 where is the population mean of Score1 (Please write down the five steps of conducting a hypothesis testing) (j) Is the training program could improve the math grade of these students? (Please write down the five steps of conducting a hypothesis testing)