Introduction to Statistical Data Analysis Lecture 9: Miscellaneous Topics

Analysis of Variance Time Series Biostatistics Big Data Introduction to Statistical Data Analysis Lecture 9: Miscellaneous Topics James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Analysis of Variance Previously, we have learned how test a single mean, and compare two means. Analysis of variance, also known as ANOVA, is useful for comparing three or more population means. James V. Lambers Statistical Data Analysis 2 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data The Hypotheses Suppose that we have m samples, each of size ni , for i = 1; 2;:::; m, drawn from m populations with means µ1; µ2; : : : ; µm. Let the ith sample consist of observations xij , j = 1; 2;:::; ni , with sample meanx ¯i . For one-way ANOVA, the null hypothesis is H0 : µ1 = µ2 = ··· = µm. That is, all of the population means are equal. The alternative hypothesis H1 is that there is a statistically significant difference between at least two of the population means. James V. Lambers Statistical Data Analysis 3 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Stuff We Need To perform one-way ANOVA, we need to compute the following: Pm n x¯ ¯ i=1 i i x¯ = Pm i=1 ni is the grand mean, which is the mean of all observations from all samples. m ni X X 2 SSW = (xij − x¯i ) i=1 j=1 is the \sum of squares within groups". m X 2 SSB = ni (¯xi − x¯) i=1 is the \sum of squares between groups". James V. Lambers Statistical Data Analysis 4 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data More Stuff We Need Let n = n1 + n2 + ··· + nm be the total number of observations from all samples. SSW MSW = n − m is the \within-sample variance". It is an estimate of the variance σ2 whether H0 is true or not. SSB MSB = ; m − 1 also known as MSE, is the \between-sample variance" or \mean square error". It is an estimate of the variance only if H0 is true. Otherwise, it is quite large compared to MSW . James V. Lambers Statistical Data Analysis 5 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data The Test The test statistic is MSB F ∗ = : MSW This is compared to the critical value Fα,m−1;n−m from the F -distribution. ∗ If F ≤ Fα,m−1;n−m, then we do not reject H0, and conclude that the population means are equal. Alternatively, we could compute the p-value P(F > F ∗) and compare it to our level of significance α. James V. Lambers Statistical Data Analysis 6 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data The F -distribution The F -distribution 1. is not symmetric; it is skewed toward zero. 2. becomes more symmetric as its degrees of freedom increase. 3. has a total area under the curve of 1. 4. has a mean that is approximately 1. James V. Lambers Statistical Data Analysis 7 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data The F -distribution James V. Lambers Statistical Data Analysis 8 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Pairwise Comparisons Suppose that H0 from ANOVA is rejected, so that we know that at least two of the means are statistically different. To find out which means are different, we can use the Scheffétest to compare each pair of sample means. For each pair (i; j), i; j = 1; 2;:::; m, the test statistic is (¯x − x¯ )2 F ∗ = i j : S SSW 1 1 + n − m ni nj This is compared to FSC = (m − 1)Fα,m−1;n−m. If FS > FSC , we conclude that means i and j are different. James V. Lambers Statistical Data Analysis 9 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Example A consumer group is testing the gas mileage of three different models of cars. Each car was driven 500 miles and the mileage recorded as follows: Model 1 Model 2 Model 3 22.5 18.7 17.2 20.8 19.8 18.0 22.0 20.4 21.1 23.6 18.0 19.8 21.3 21.4 18.6 22.5 19.7 James V. Lambers Statistical Data Analysis 10 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Example: One-way ANOVA We have n1 = n2 = 6; n3 = 5, n = n1 + n2 + n3 = 14, and m = 3. Our means arex ¯1 = 22:2,x ¯2 = 19:7,x ¯3 = 18:9, and n x¯ + n x¯ + n x¯ x¯ = 1 1 2 2 3 3 = 20:3: n1 + n2 + n3 We then compute SSW = 21:6 and SSB = 31:5, followed by SSW 21:6 SSB 31:5 MSW = = = 1:54; MSB = = = 15:73; n − m 14 m − 1 2 which yields F ∗ = MSB=MSW = 10:19. ∗ This is compared to F0:05;2;14 = 3:74. Since F > F0:05;2;14, we reject H0 and conclude that the means are not equal. James V. Lambers Statistical Data Analysis 11 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data Example: Pairwise Comparisons To see which means are not equal, we obtain FSC = (m − 1)Fα,m−1;n−m = 2F0:05;2;14 = 7:48: This is compared against FC for all 3 pairs of means: (¯x − x¯ )2 F ∗ = 1 2 = 11:7; S;1;2 SSW 1 1 + n − m n1 n2 ∗ ∗ and similarly, FS;1;3 = 17:83 and FS;2;3 = 0:93. ∗ ∗ Since FS;1;2 and FS;1;3 are both greater than FSC , we conclude those pairs ∗ of means are different, whereas FS;2;3 < FSC , so µ2 and µ3 are equal. James V. Lambers Statistical Data Analysis 12 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data R Statements The following R statements can be used for the preceding example: > alpha=0.05 > x1=c(22.5,20.8,22.0,23.6,21.3,22.5) > x2=c(18.7,19.8,20.4,18.0,21.4,19.7) > x3=c(17.2,18.0,21.1,19.8,18.6) > n1=length(x1) > n2=length(x2) > n3=length(x3) > x1b=mean(x1) > x2b=mean(x2) > x3b=mean(x3) > x=c(x1,x2,x3) > xbb=mean(x) James V. Lambers Statistical Data Analysis 13 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data R Statements, cont'd > SSW=sum((x1-x1b)^2)+sum((x2-x2b)^2)+sum((x3-x3b)^2) > SSB=n1*(x1b-xbb)^2+n2*(x2b-xbb)^2+n3*(x3b-xbb)^2 > n=n1+n2+n3 > m=3 > MSW=SSW/(n-m) > MSB=SSB/(m-1) > F=MSB/MSW > FC=qf(1-alpha,m-1,n-m) > F [1] 10.18602 > FC [1] 3.738892 James V. Lambers Statistical Data Analysis 14 / 55 Analysis of Variance Time Series One-Way ANOVA Biostatistics Pairwise Comparisons Big Data R Code, Pairwise Comparions The following R code perform the pairwise comparions: > FSC=(m-1)*qf(1-alpha,m-1,n-m) > FC12=(x1b-x2b)^2/(SSW/(n-3)*(1/n1+1/n2)) > FC13=(x1b-x3b)^2/(SSW/(n-3)*(1/n1+1/n3)) > FC23=(x2b-x3b)^2/(SSW/(n-3)*(1/n2+1/n3)) > FSC [1] 7.477784 > FC12 [1] 11.66415 > FC13 [1] 17.82672 > FC23 [1] 0.9328217 James V. Lambers Statistical Data Analysis 15 / 55 Analysis of Variance Time Series Biostatistics Big Data Time Series A time series is a sequence of data points measured at successive, equally spaced points in time. Examples: the daily closing value of the Dow Jones Industrial Average, or the annual flow volume of the Mississippi River Important in statistics, but also in many other fields, such as signal processing, econometrics, mathematical finance, meteorology, seismology, and more James V. Lambers Statistical Data Analysis 16 / 55 Analysis of Variance Time Series Biostatistics Big Data Time Series Analysis Time series analysis refers to a broad spectrum of methods for extracting meaningful statistics and characteristics from time series. While regression analysis is used to compare time series to one another, time series analysis itself is concerned with the study of a single series. Time series analysis differs from other forms of data analysis due to the natural ordering of the observations, which is absent from general data sets. Time series analysis has many motivations, but for statistics, the main motivation is forecasting. James V. Lambers Statistical Data Analysis 17 / 55 Analysis of Variance Time Series Biostatistics Big Data Types of Time Series Analysis Methods for time series analysis can be classified in several ways: I Frequency domain (e.g., Fourier analysis or wavelet analysis) or time domain (auto-correlation, cross-correlation) I Parametric (regression, moving averages) or non-parametric (estimating covariance or spectrum) I Linear or non-linear I Univariate or multivariate James V. Lambers Statistical Data Analysis 18 / 55 Analysis of Variance Time Series Biostatistics Big Data Time Series Patterns Much of time series analysis is about detecting patterns within the data. Types of patterns include: I Stationary I Linear trend I Cyclic trend I Random variations James V. Lambers Statistical Data Analysis 19 / 55 Analysis of Variance Time Series Biostatistics Big Data Exploratory Analysis Exploratory analysis is used to get a feel for the behavior of the observations, as a function of time.

Load more