Instructions: Answer Any Combination of Question to Complete 120-130 Points

058fb72d90eecde64b2daad8646b3237.doc Revised: 5/11/2018

Midterm exam. PLS 206.

Instructions: Answer any combination of question to complete 120-130 points.

This is a take-home exam. You have to submit your work by the end of Tuesday, a week after you receive this document. Be concise in your answers, but make sure you explain things to a level that would satisfy you if someone else were explaining it to you. You are encouraged to discuss the questions with your classmates and others, but you have to produce as original an answer as possible. Of course, if the answer is a unique number or set of words, not much originality is possible, so try to go through the calculation on your own. If you discuss the questions with anyone, try not to get the number before you understand how and why to obtain the numerical answer. Submit an electronic version of the answers, including the questions you chose to answer clearly marked.

1 Basic concepts [30]

1.1 Question [5] Why do we calculate the estimated variance of a sample by dividing by n-1 instead of by n? [5] Because this removes the bias in the estimation of the variance that is introduced by using the average as an estimator for the mean. The expected value of the sample variance calculated using the formula that divides by n instead of n-1 yields (on average) a value that is smaller than the true variance. Out of all of the degrees of freedom or independent observations or “dimensions” in which the sample vector can vary, we lose one due to the way the average is calculated. The sum of the deviations to the estimated mean are restricted to equal 0. Given n-1 residuals, the last one can be calculated by difference.

1.2 Question [5] What is the distribution of a random variable obtained by the sum of 4 independent standard normal random variables? [5] A normal distribution

1.3 Question [3] What is the equation to standardize a random variable that is not normally distributed? [3] Regardless of the distribution, standardization is achieved by subtracting the mean and dividing by the standard deviation.

1.4 Question [7] What are degrees of freedom and how do you usually calculate them? [7] Degrees of freedom is a number associated with a sum of squares (SS). The value of this number is equal to the total number of independent observations (more correctly, the number of independent random variables) used to calculate the sum of squares minus the total number of independent parameters estimated and used to calculate the SS. This operational definition is easiest to apply to the SS of the residuals (SSR) for any model. For example, in a completely randomized design with n

1 058fb72d90eecde64b2daad8646b3237.doc Revised: 5/11/2018

observations and 4 fixed treatments the SSE has n terms in the summation, and one uses 4 estimated parameters, one for the mean of each treatment. Therefore, dfe=n-4. When calculating the SS of the model, in this case there are four random variables (the averages for the 4 treatments), and one estimation for the overall mean. Therefore, df of treatments is 3. In the case of regression, where there are no discrete treatments but continuously varying predictors, it is easier to calculate the df for the model as the difference between the df for the total SS (n-1) and the dfe. In a more geometrical and formal approach, degrees of freedom are the number of dimensions over which a vector can vary. Usually we think of observations in variable space, i .e., we think of dots in graphs where the axis are the variable. To understand degrees of freedom we need to reverse that and think of variables in observation space. Observation space is a hyperspace that has as many dimensions as we have observations or rows in the data set. In that space, each variable is a vector. Any two vectors that are not exactly on the same direction define a 2-dimensional subspace, just as two edges of the table define the plane represented by the tabletop. The cosine of the angle between the two vector on that plane is the correlation between the variables! For any given variable X and corresponding vector you can think of another vector that has all the same values in all dimensions, each equal to the average of the observations in X. This vector Xbar is different from X, thus there is a vector X-Xbar. This difference vector is called e and has length e’e. The vector e is perpendicular to Xbar and has no component in the direction of Xbar. Thus, e can vary in n-1 dimensions, i.e., in the n-1 dimensions that are orthogonal to Xbar. For full development of intuitive understanding of linear statistics, cultivate this approach. For example, think of the calculation of e’e and of the other name we have been using for e’e in AGR206. 1.5 Question [10] Describe two experiments from which data were obtained and analyzed by multiple linear regression, one observational and the other manipulative. Indicate what determines whether an experiment is manipulative or observational. [10] An experiment is manipulative if the values of the predictor variables, or treatments were randomly assigned to the experimental units by the experimenter. An experiment is observational is the values of the independent variables are set by the selection of experimental units. Note that although manipulated independent variables are usually fixed factors, the manipulative-observational dimension is not the same as the fixed-random factor dimension.

2 Simple linear regression [25]

2.1 Question [5] How does the variance of predicted expected Y change as X increases beyond Xbar. Why? [5] The variance of the predicted expected Y, V{Yhat}, is directly proportional to (X – Xbar)2. Therefore, as X increases, towards the Xbar the variance declines until X=Xbar. Once X>Xbar, as X increases he variance also increases. Intuitively, one can think that variance (uncertainty) increases as we move away from the values where our data or information are centered.

2.2 Question [20] In the example used for clover growth rate, you need to determine if the weight of plants grown in the lowest temperature for 35 (D1) days is still lower than the weight of plants grown at the highest temperature for 30 (D2) days. 1. State Ho and Ha in words. [6] Ho: The weight of plants in the low temperature at day 35 (D1) is less than or equal to the weight of plants in the high temperature at day 30 (D2).

Ha: … greater than …

2. State Ho and Ha with equations, or using statements about parameter values. [6] Subscripts are 1-low, 2-medium, 3-high temperature.

Ho: b01+b11*35 b03+b13*30

or b0+db01+b1*35+db11*35b0+db03+b1*30+db13*30

or db01-db03+b1*(35-30)+db11*35-db13*300

Recall that db03=-db01-db02 and db13=-db11-db12 therefore,

Ho: db01-(-db01-db02)+b1*(35-30)+db11*35-(-db11-db12)*300 =>

=> 2*db01+db02+5*b1+65*db11+30*db120 Ha: …>0

Subscripts are 1-low, 2-medium, 3-high temperature.

Ho: b01+b11*D1 b03+b13*D2

or b0+db01+b1*D1+db11*D1b0+db03+b1*D2+db13*D2

or db01-db03+b1*(D1-D2)+db11*D1-db13*D20

Recall that db03=-db01-db02 and db13=-db11-db12 therefore,

Ho: db01-(-db01-db02)+b1*(D1-D2)+db11*D1-(-db11-db12)*D20 =>

=> 2*db01+db02+(D1-D2)*b1+(D1+D2)*db11+D2*db120 Ha: …>0

3. Build an L vector to test Ho in a Custom Test using JMP. [8] Based on the last form of the Ho above, the vector is immediately identified.

Assuming that the effects are ordered as: bo, b1, db01, db02, db11, db12, L’={0, 5, 2, 1, 65, 30}={0, (D1-D2), 2, 1, (D1+D2), D2}

3 Data screening [60]

3.1 Question [10] Briefly explain the Bonferroni correction for detection of outliers. Focus on showing that you understand why the test is done using /2n, even if you check only one point that seems to be away from the rest. Two to three sentences should suffice. [10] See pages 154-157 of Neter et al. or KNN Textbooks.

3.2 Question [12] What assumption of the usual linear regression model is violated by spatially or temporally correlated errors? [4] Independence of residual or errors. What graphical method can you use to detect autocorrelated errors? [4]

A plot of the residuals vs. time or spatial position. What is the main effect of errors with positive autocorrelation on tests, if the correlation is not corrected? [4] Correlation of the residuals leads to an overestimation of the true number of degrees of freedom. Therefore, the estimated variance is biased downwards and the tests become more “liberal” in the sense that over many samples we will make an error type I with a frequency greater than alpha. 3.3 Question [9] Draw a scatter plot of X2 vs. X1 showing the following points: A: outlier in X1 but not in X2. B: outlier in neither X1 or X2 individually, but clearly outlier in the bivariate distribution. C: a point with a high leverage.

3.4 Question [26] Eleven numbers have been erased from the following report. Use the rest of the information to complete all blanks. [11] R2a = 1-MSE/(SSTotal/(n-1)) = 0.962 Root MSE=sqrt(MSE)=sqrt(0.03400)=0.1844 SSE=dfe*MSE=1.224 Df TotalError=36, df Pure error=36-9=27, SSTotal Error=1.224 SSPureError=1.224-1.029=0.195, MaxRsq=1-SSPureError/C.TotalSS =1-(0.195/36.351)=0.9946, t Ratios: divide Estimate by Std Error.31.3,-8.83,1.88

Write the null hypotheses for the test of lack of fit and interpret the result. [10]

H 0 : m j = b0 + b1 X j, H A : m j №b0 + b1 Xj

where m j = E{Y| X = Xj}

There was a significant lack of fit, which indicates that the linear model is not a good description of the data. A model that went through the averages for each level of X would have an R2 = 0.9946. Write the complete model for the filaree example, but with 4 instead of 3 temperature groups. [5]

Y = b0 + Db01 G1 + Db02 G 2 + Db03 G 3 + Db04 G 4 + (b1 + Db11 G1 + Db12 G 2 + Db13 G 3 + Db14 G 4 ) X + e where Gi=1 if group=i, 0 otherwise.

3.5 Question [10] Explain the concept of Jackknifing and indicate how it is used in the testing of multivariate outliers. [10]

Multivariate outliers can be detected by calculating the Mahalanobis distance D for each observation. This distance is a measure of the "statistical" distance between each observation and the centroid for the group being considered. Suppose you are performing a MANOVA where the Y's are seed number and weight per seed of a species of interest Figure 5), and the X variable is level of soil fertility and water availability. In this example, X is a categorical or "class" variable that, say, takes three values: low, medium, and high. In order to detect outliers in the Y dimensions, a centroid or vector of average values for each Y is calculated for each group. The centroids are the best estimates of the expected value for the vector of random variables, and serves the same function as Yhat in the univariate case where we considered deviations about a straight line. After calculating the centroids, the deviations of each observation form its group's centroid are calculated. Analogously to the deleted residuals, a robust or "jackknifed" squared Mahalanobis distance is calculated for each observation while holding the observation out of the sample. This prevents potential outliers from distorting the very detection of outliers. Figure 4-6 shows the simulated data for the group of medium fertility and water availability. The centroid for this group is the point (972, 1173). The Euclidean or geometric distances from each of two potential outliers, observations a and b, are represented by the lines from each point to the centroid. Clearly, point b is further away from the centroid than a. The jacknifed squared Mahalanobis distances for points a and b are 31 and 10 respectively, which is counter to the Euclidean distances. The difference is due to the fact that the two variables, seed weight and seed number, exhibit a strong positive covariance within this group. As is intuitively clear form the scatterplot, given the dispersion and correlation between the variables, point a is a lot less likely than point b. This is reflected in the Mahalanobis distance. Assuming multivariate normality for the random vector of Y variables (in the example the vector is {seed number, seed weight}), the squared Mahalanobis distance and its jackknifed version should follow a 2 distribution with 2 degrees of freedom (df=number of variables). This distribution can be used in the same way the t distribution was used for the univariate situation. A critical value of 2 is determined either by a set probability =0.001 or by using the Bonferroni correction with =0.10. Outliers are identified by testing the following hypothesis for each observation:

Ho: Yi follows the same multivariate normal distribution as the rest of the sample. Ha: Yi does not follows the same multivariate normal distribution as the rest of the sample.

L e t Y 1 a n d Y 2 b e r a n d o m v a r i a b l e s t h a t h a v e a b i v a r i a t e n o r m a l d i s t r i b u t i o n w i t h i n a g r o u p .

Y i  Y 1 i , Y 2 i  i s a r a n d o m o b s e r v a t i o n , a n d i = 1 , , n T h e j a c k n i f e d s q u a r e d M a h a l a n o b i s d i s t a n c e i s d e f i n e d a s d 2  Y  Y  S 1 Y  Y   2 ( i )  i ( i )   i ( i )  ( d f  2 ) The simulated seed weight example has 250 observations. Observations a and b were not considered in the sample to simplify the calculations in xmpl_MVoutl.xls, but a strict calculation should have included point b when calculating d2 for a, and vice versa. Using the Bonferroni approach, the critical level for the Mahalanobis distance is 2 with 2 degrees of freedom for =0.10/250=0.0002. The test is one-tailed, because the squared distance can only be positive, and values close to zero indicate that the observations are very much within the range expected under the assumptions and Ho. The critical value is 15.65, indicating that observation a is an outlier but observation b is not. Through the multivariate platform in JMP one can obtain the Mahalanobis distance (D) and the jackknifed distance. The plot also shows a dotted line that represents the critical D value for =0.05. This distance is calculated as F*number of variables. The F value is the table value for the desired

probability level and with df in the numerator=number of variables (nvars) and df in the denominator =n-nvars-1. A question that is relevant at this point is, why did we use the Bonferroni correction with n=250 while in fact we only tested 2 observations? This is because the 2 observations were picked after looking at the scatterplot. The only time that the n for the Bonferroni correction is equal to the actual number of tests performed is when the identity of observations to be tested is identified a priori, before obtaining or looking at the results. In such a case, SAS may print out all distances anyway, and one can catch a glimpse of a significant distance that was not in the list prepared a priori. The proper critical value for testing that observation has to be the determined with n=total sample size.

3.6 Question [12] Compare PRESS and Cook’s D. How are they different? [4] What are the similarities? [4] What is each one used for? [4] The PRESS statistic is the prediction error sum of squares. It is calculated like the SSE except that samples are jackknifed and are therefore less biased towards the sample in question: Щ p = the number of variables in the model n 2 n = the number of samples (Yi - Y i(i) ) PRESS p = i(i) = the predicted value for observation i when observation i is еi=1 n -1 excluded from the model run

The Cook’s D statistic measures the relative influence that the ith observation in a model has on all n fitted values: p = the number of variables in the model Щ Щ n 2 n = the number of samples (Y j - Y j(i) ) D = е j=1 j(i) = the predicted value for observation j when observation i is excluded i pMSE from the model run

While PRESSp uses the difference between the observed value for case i and the predicted value based on a model withholding that case i, Cook’s Di uses the difference between the predicted value for case j and the predicted value for case j when case i is withheld from fitting the regression model. PRESSp is concerned with the ability of the th model to predict an observation whereas Cook’s Di is concerned with how the i observation influences the overall model. The denominators of these formulas are also th different, as they are concerned with differing objectives – PRESSp subtracts the i case from the number of samples because it is withheld in the model making the prediction and Di standardizes the measures across comparisons. [Answer by Lauren Garske, Spring 2007)

4 Multiple linear regression

4.1 Question [20] Use the Spartina data to regress bmss on NH4 and pH, build two bar graphs like those in Figure 7.1 of the textbook. Show and label SSTO, SSR(pH), SSR(NH4), SSR(pH | NH4), SSR(NH4 | pH),

SSE(pH), SSE(NH4), SSE(pH, NH4), and SSR(pH, NH4). Label each part of the bars with the name and value of the SS. [20]

Use the Spartina data to regress bmss on sal and pH, build two bar graphs like those in Figure 7.1 of the textbook. Show and label SSTO, SSR(pH), SSR(sal), SSR(pH | sal), SSR(sal | pH), SSE(pH), SSE(sal), SSE(pH, sal), and SSR(pH, sal). Label each part of the bars with the name and value of the SS. [20]

[Answer by Iago Lowe. Spring 2007.]

4.2 Question [28] What is collinearity? [5] Collinearity is the presence of correlations among predictor or explanatory variables. What effects does collinearity have in multiple linear regression [8]? Collinearity increases the variance of estimated partial regression coefficients and makes it hard to unequivocally assign explanatory power to the different predictors. Use the file xmpl_ColinDemo to illustrate each one of the effects. Explain how the effects are illustrated by the calculations. [15]. (See notes) 4.3 Question [26] Based on the following output, perform a test to determine if there are any quadratic effects. Specify the null and alternative hypotheses and write the corresponding models. [12] Express the hypotheses as equations about the values of parameters. [8] Use a simultaneous general linear test for all quadratic terms together. [6]

Full model:

2 2 2 2 Y   0  C a C a   p H p H   M g M g  C u C u   p H 2 p H  C a 2 C a   M g 2 M g  C u 2 C u   Reduced model:

Y   0  C a C a   p H p H   M g M g  C u C u  

Ho:  p H 2  C a 2   M g 2  C u 2  0

Ha: at least one is not zero.

( 3 8 3 5 5 + 4 5 5 8 3 9 + 1 7 1 4 5 7 + 6 8 4 ) / 4 F-test:  1 . 4 5 2 compare to F(4,36) 1 1 4 7 4 5

4.4 Question [10] In a multiple linear regression of bmss (in a subset of the Spartina data) on Na, pH, and Mg, the VIF for Na is 22.57. What would be the VIF for Na if the same subset of data were used to regress Ca on Na, pH, and Mg? Explain. [10]. The VIF would be the same, because it only depends on the regression of Na on the other variables. For as long as the set of predictors is the same and we use the same observations, the VIF is constant.

4.5 Question [10] Explain how a K-fold cross-validation works.

5 Principal components analysis

5.1 Question [5] What is the advantage of performing PCA based on the correlation instead of the covariance matrix? [5] PCA based on the correlation gives equal weight to each to the original variables and removes the units, thus making the analysis independent of the units chosen. 5.2 Question [33] Use the Spartina data set for this question. Obtain the PC’s of the 14 predictors. [5] Regress any one of the original variables on all the 14 PC's. What is the R-square? Why? [8] Compare the proportion of the SS of the variable explained by each PC with the corresponding loading. What is the relationship between them? Calculate the sum of squared loadings for any original variable across all PC's. What value would you get if you did this for a different original variable? Why? [20] The R-square of regressing any of the original variables on all PC’s is always 1 because all of the variation of the original variables is contained in the PC’s. Because the PC’s are orthogonal to each other and the loading is the correlation between the original variable and a PC, the loading equals the square root of the proportion of the variance of the variable explained by the PC. The sum of all squares of loadings for a variable is always one. This is another reflection of the previous two points. Answer contributed by Diran Tashjian (S02) When pH was regressed on all of the 14 principal components, the R2 value was 1. The R2 value is equal to 1 because the total variation in each one of the variables was partitioned into the 14 principal components although the principal component analysis attempted to group most of the variation among the variables into the first few principal components by generating principal components which capture as much of the covarying variation in the data set as possible. However, since not all the variables in this data set were orthogonal to each other and there was some degree of collinearity between the variables, different portions of the variation in each variable was divided up into each of the 14 principal components. Therefore, when a particular variable is regressed on all the 14 principal components, all the variation in that variable will be explained by the 14 principal components. Since the R2 value measures the amount of variation in the data set accounted for by the model, which in this case includes the 14 principal components, the R2 value was equivalent to 1 because all of the variance in the pH variable can

be explained by the 14 principal components. If only the first 13 principal components were included in the model, the R2 value would be slightly less than one, because there is a slight amount of variation explained by the 14th principal component. This amount of variation explained by the 14th principal component is shown in the effect tests (see printout 1). The type III sum of squares test shows that the addition of the 14 principal component when all the other components are included into the model explains an additional 0.365464/68.419778 =0.53% of the variance in the pH variable. It is also interesting to note that the Type III sums of squares for each PC is equivalent to the Type I sum of squares. This occurs because the principal components are calculated in such a way that each PC does not share any variance with the other PC’s. Therefore each PC explains a unique portion of the variance that is not captured by any other PC, so the amount of additional variance explained by a variable does not change when the variable is added to the model last (Type III) or if the variable is added to the model first (Type I).

When the proportion of the sum of squares of the variable explained by each principal component is compared with the loadings of the variable on each principal component, a positively covarying relationship is found. The larger the variation, or sum of squares in each variable that is explained by each principal component, the larger the loadings of that variable on the principal components. For example, PC1 explained 56.14/68.42=82.1% of the variance in the pH variable and the loading of pH on PC1 was -0.90584, while PC2 explained 0.191/68.42= 0.3% of the variance and the loading of pH on PC2 was -0.05281. This positively covarying relationship occurs because the loadings provide information as to how strong the relationship is between an original variable and the principal component, with the relationship based on how much of the variance in a variable is explained by the principal component. Since the total variation for a standardized variable in a principal component analysis is equal to one, the loadings of each variable on each of the principal components will be proportional to how much of the total variation in each variable is explained by the principal components. For example, the loading of pH on PC1 was -0.90584, but when this value is squared to obtain the amount of variation in pH that is explained by PC1, the resulting value is -0.905842 = 82.1%, which is exactly equivalent to the proportion of the sum of squares in the pH variable that is explained by PC1.

Loadings 0.0944 0.1353 0.0353 0.0555 0.0730 -0.9058 -0.0528 -0.3583 0.07619 -0.0182 0.009745 0.012278 -0.07047 0.004482 for ph 2 1 2 7 8 0.8205 0.1284 0.0089 0.0183 0.0012 0.0003 9.4965E- 0.0030 2.0088E- 0.0053 Squared 0.0027 0.00580 0.000150 0.004966 Sum=1 4 1 1 1 4 3 05 8 05 4 loadings

The squaring of the loadings and summing them up provides the overall length and direction of the vector of the original variable that extends into the 14 dimensional space of the principal components. There are 14 dimensions because there were 14 original variables in the data set. Since the squaring of the loadings for a particular variable was shown to be the proportional to the amount of variance explained by each PC, the length of the vector in each of the 14 dimensions is proportional to the amount of variance which is explained by each dimension or PC. Therefore the overall length of the vector in the 14 dimensions, which is the sum of all of the vectors in each of the 14 dimensions, must be equivalent to the total amount of variance present in each variable. Since the total variance of a standardized variable is equal to one, the length of the overall vector obtained by squaring and summing the loadings of a particular variable will also equal 1. This applies to all of the original variables in the data set because all of the variables were standardized, resulting in each one of the variables to have a variance equivalent to one.

5.3 Question [10] List the assumptions for PCA and explain the two main goals of PCA. [10] Assumptions are only necessary to facilitate interpretation. PC only address linear relationships, so linearity is desirable and some times listed as an assumption. Although no distributional assumptions are required for exploratory PCA, if one needs to assess probabilities that the correlation matrix is

not diagonal, or other tests that require pivotal distributions, normality has to be assumed in order to use the typical formulas and tables.

5.4 Question [16] For each of the following scatter plots, draw and label the principal components. [8] Give numerical estimates of the eigenvalues. [8]

13