Student Activity 2: Basic Statistical Analysis with SAS
Total Page:16
File Type:pdf, Size:1020Kb
Math 4220 Dr. Zeng Student Activity 2: Basic Statistical Analysis with SAS 1. Visualizing Your Data Example 1: The Toluca Company manufactures refrigeration equipment as well as many replacement parts. In the past, one of the replacement parts has been produced periodically in lots of varying sizes. When a cost improvement program was undertaken, company officials wished to determine the optimum lot size for producing this part. This data set can be found on http://www.csub.edu/~bzeng/4220/activities.shtml Creating Histograms and Density Curves To show the distribution of continues data, you can use histogram. In a histogram, the data are divided into discrete interval called bins. To create a histogram chart, use a “histogram” statement with this general form: proc sgplot; histogram variable-name/options; You can also plot density curves for your data. The basic form a density statement is: proc sgplot; density variable-name/options; SAS CODE 1: proc sgplot data=Toluca; histogram Hours; density Hours; /*default setting is normal */ density Hours/ type=kernel; /*specifies the type of distribution: kernel */ run; Math 4220 Dr. Zeng Remark: the bar charts show the distribution of categorical data. To graph a bar chart, we just need to replace histogram statement by vbar statement. Everything else is the same. Creating Box Plots Like histograms, box plots show the distribution of continuous data. It contains the following information: minimum, first quartile, median, mean, third quartile, maximum, outliers. To create a vertical box plot, use a vbox statement like this: proc sgplot; vbox variable-name/options; /* use “hbox” statement if you like a horizontal box plot*/ Example 2: This example contains the variables Subj (ID values for each subject), Drug (with values of Placebo, Drug A, or Drug B), SBP (systolic blood pressure), DBP (diastolic blood pressure), and Gender (with M or F). This data set can be found on http://www.csub.edu/~bzeng/4220/activities.shtml SAS CODE 2: title "creating box plots"; proc sgplot data=bloodpressure; hbox SBP; /*horizontal boxplot*/ run; title "creating boxplots for SBP by different type of drugs"; proc sgplot data=bloodpressure; vbox SBP/category=Drug; run; Math 4220 Dr. Zeng Creating Scatter Plots Here, we will learn some graphical methods for displaying relationships between two continuous variables. First are simple scatter plots that were produced with several different procedures. Later, you will also learn how to create multiple plots on a single page using PROC SGSCATTER. SAS CODE 3(a): title "Creating a Scatter Plot Using PROC GPLOT: Toluca Company Example"; symbol value=dot; proc gplot data=Toluca; plot Size*Hours; run; Alternatively, if you use PROC SGPLOT, then the keyword SCATTER may be used. SAS CODE 3(b): title "Creating a Scatter Plot Using PROC SGPLOT: Toluca Company Example"; symbol value=dot; proc sgplot data=Toluca; scatter x=Size y=Hours; run; Math 4220 Dr. Zeng Creating Probability Plots PROC UNIVARIATE, is another SAS procedure which produces output that is similar to the output from PROC MEANS. However, PROC UNIVARIATE provides additional statements that produce histograms and probability plots. Back to the example2, the following program demonstrates these features of PROC UNIVARIATE. Also note that the normal in the top line requests test statistics for checking normality. If the sample size is over 2000, the Kolmgorov test should be used. Otherwise, the Shapiro test is better. The null hypothesis of a normality test is that there is no significant departure from normality. When the p-value is greater than 0.05, we fail to reject the null hypothesis and hence the normality assumption holds. SAS CODE 4: title "demonstrating PROC UNIVARIATE"; proc univariate data=bloodpressure normal; var SBP DBP; histogram; probplot/normal (mu=est sigma=est); run; Tests for Normality Test Statistic p Value Shapiro-Wilk W 0.512239 Pr < W <0.0001 Kolmogorov-Smirnov D 0.31514 Pr > D <0.0100 Cramer-von Mises W-Sq 0.656015 Pr > W-Sq <0.0050 Anderson-Darling A-Sq 3.636438 Pr > A-Sq <0.0050 Math 4220 Dr. Zeng 2. Generating Descriptive Statistics with PROC MEANS One of the first steps in any statistical analysis is to calculate some basic descriptive statistics on the variables of interest. One way to compute means and standard deviations is to use PROC MEANS. The following list describes some of the more useful options: Option Description N # of nonmissing observations NMISS # of observations with missing values MEAN Arithmetic mean STD Standard deviation STDERR Standard error MIN Minimum value MAX Maximum value MEDIAN Median MAXDEC Maximum number of decimal places to display CLM 95% confidence limit on the mean CV Coefficient of variation Example 2 Revisit: Use the above blood pressure data set, and compute the descriptive statistics for each level of Drug (Placebo, Drug A, Drug B) SAS CODE 5: title "Descriptive Statistics for SBP and DBP"; proc means data=bloodpressure n nmiss mean std median maxdec=3; var SBP DBP; run; SAS CODE 6: title "Descriptive Statistics for SBP and DBP: broken down by Drug"; proc means data=bloodpressure n nmiss mean std median maxdec=3; class Drug; var SBP DBP; run; Math 4220 Dr. Zeng 3. Computing a 95% Confidence Interval A 95% confidence interval for the mean is useful in helping you decide how well your sample mean estimate the mean of the population from which you took your sample. Standard error is also useful for the same reason. Example 2 Revisit: Use the above blood pressure data set, and compute the 95% confidence interval and the standard error for SBP and DBP. SAS CODE 7: title "Computing a 95% Confidence Interval and the Standard Error"; proc means data=bloodpressure n nmiss mean std median clm stderr maxdec=3; var SBP DBP; run; SAS CODE 8: title "Computing a 95% Confidence Interval and the Standard Error: broken down by drug"; proc means data=bloodpressure n nmiss mean std median clm stderr maxdec=3; class Drug; var SBP DBP; run; Math 4220 Dr. Zeng 4. Conducting a One-Sample t-test Hypothesis testing is a very important part of inferential statistics. One sample t-test is a very common test in hypothesis testing. Here, you will learn how to conduct a one-sample t-test using SAS. Without ods graphics turned on Example 1 Revisit: Using the Toluca company example above to test whether the mean work hours is 300. That is H0 : 300 v.s. H1 : 300 where is the mean of the work hours. What is the P-Value and what is your conclusion? SAS CODE 8: title "Conducting a One-Sample t-test Using PROC TTEST"; proc ttest data=Toluca h0=300 sides=2 alpha=0.05; var Hours; run; With ods graphics turned on Example 1 Revisit: Use the Toluca company example as shown above, but add the ods graphics on on the top of your SAS program and see what happen? What you can conclude from the output? SAS CODE 9: ods graphics on; title "Conducting a One-Sample t-test Using PROC TTEST"; proc ttest data=Toluca h0=300 sides=2 alpha=0.05; var Hours; run; ods graphics off; Remarks: (1) h0=option on the null hypothesis (2) sides=2 for two sided test; sides=U for upper tail test; sides=L for lower tail test (3) alpha=specified the significance level Math 4220 Dr. Zeng Math 4220 Dr. Zeng 5. Conducting a Two-Samples t-test Conducting the Unpaired t-test Example 2 Revisit: For example, you will compare the SBP and DBP between the males and females in the bloodpressure data set. Is there any difference between two genders for SBP & DBP? Comment on the equality of variance. SAS CODE 9: title "conducting a unpaired two-sample t-test"; proc ttest data=bloodpressure; class gender; var SBP DBP; run; Conducting a Paired t-test If you have a design in which each subject received two treatments and you want to conduct a t-test, you need a PAIRED statement with PROC TTEST. Example 3: A study was conducted to determine whether a reading program could improve the reading speed of eight subjects. The before and after reading speeds, along with a subject number, are stored in a SAS data set called reading. What is your conclusion? SAS CODE 10: proc ttest data=reading; paired After*Before; run; Math 4220 Dr. Zeng 6. SAS Distribution Functions The following program shows how you can compute p-values for a normal, t, chi-square and F distributions, using SAS cumulative distribution functions. For example, for an upper tailed Z test, if you want the probability of observing a Z statistics greater than Z=1.96, then it would equal: 1-probnorm(1.96)=1-0.975=0.025. If it is a two-sided p-value, then we would use 2*(1- probnorm(1.96))=2*(1-0.975)=2*0.025=0.05. SAS CODE 9: data pvalues; zstat = 1.96 ;/*z score*/ zpvalue = 1 - probnorm(zstat);/*upper tailed z test*/ zpvalue2 = 2*(1 - probnorm(zstat));/*two-sided z test*/ tstat=2.32; /*t score*/ tdf=10;/*degree of freedom*/ tpvalue=1-probt(tstat, tdf);/*upper tailed t test*/ fstat=31.83; ndf=2; /*numerator DF*/ ddf=681;/* denominator DF*/ fpvalue=1-probf(fstat, ndf, ddf); /*pvalue from F-test*/ chistat=2.32**2;/*chi-square test*/ cdf=1; cpvalue=1-probchi(chistat,cdf); run; proc print data=pvalues; run; Obs zstat zpvalue zpvalue2 tstat tdf tpvalue fstat ndf ddf fpvalue chistat cdf cpvalue 1 1.96 0.024998 0.049996 2.32 10 0.021386 31.83 2 681 6.0951E-145.3824 1 0.020341 Math 4220 Dr.