Stats Model Answers

Total Page:16

File Type:pdf, Size:1020Kb

Stats Model Answers

STATS MODEL ANSWERS

SECTION A

Q 1 (B4 for MSc Occ)

Topic: Factor Analysis

(i) [10% of marks] The technique is exploratory factor analysis (FA only gets some of marks). Answering “principal components analysis” is Ok as long as there is a justification for why it can be used even though PCA doesn’t propose any underlying latent variables. (It is easier to justify FA rather than PCA: one might use FA if the researcher thinks there are underlying factors, influencing the responses to the items, that are responsible for degree choices . The results will usually be very similar.) (ii) [30% of marks] In all cases a good answer should consider the specifics of the particular study in relation to the issues below  She would want to carry out a thorough screening of the data to ensure that it is broadly normally distributed (frequencies, graphs etc.) with no univariate, bivariate or multivariate outliers (e.g. frequencies, scatter plots etc.) Normality not required but helps get clearer solutions  She would also check for illegal values (e.g. <1 or >7).  She might consider whether restriction of range in the data may occur (given students desire to appear keen on the course that they have embarked upon)  She should consider collinearity and singularity by reviewing correlation matrix (.90 bivariate a problem; multicollinearity based on high SMC values for each variable predicted by all the others).  Factorisability of the correlation matrix -- a rule of thumb is that one needs bivariate correlations above 0.3, also should look for low partial pairwise correlations partialling out all other variables and can use KMO Measure of sampling adequacy (Bartlett’s sphericity test not much use for testing factorisability)  Good responses will also take into account the fact that the FA can only produce what is put in, and so might ask whether the researcher is likely to have used a wide enough set of items; may also mention checks on the quality of the ratings produced (e.g. possibility of halo effects, perhaps assessed with the first un-rotated factor).  Answer must mention the requirements of sample size, noting that there is no single opinion on this matter (e.g. Comfrey & Lee = a minimum of 300 cases for a good factor analysis or ratio of cases to variables - Nunnally 10:1, Guildford 2:1, Barrett &Kline find 2:1 replicates structure while 3:1 is better). Current data look adequate.  Should mention ratio of Variables to Factors (as above e.g. Tebachnik & Fidell 5 or 6:1; Klein 3:1; Thurstone 3:1; Kim & Mueller 2:1). Current data look Ok as long as there are not lots of factors or items which don’t correlate with other items.  Should mention whether to use listwise, pairwise (to be avoided) or imputation. Always listwise if numbers allow. Good answer may discuss different forms of imputation (regression, mean). May alternatively appear in part (iii) (iii) [40% of marks].  She should hypothesise about the relationships between the items (i.e. the factors) a priori on the basis of the items used and the likely factors influencing student degree choice.  Decide on FA or principal components –give differences and reason for choice (should note that PCA doesn’t have underlying factor theory)  If FA, then should comment on the EXTRACTION methods available – bottom line is that it doesn’t really matter though – may suggest trying all and going for the most interpretable solution. (Could include Maximum Likelihood, Unweighted Least Squares, Generalised (or Weighted) Least Squares, Alpha Factoring , Image factoring as methods of  Have to decide how many factors to retain – should mention at least two approaches from Kaiser (eigenvalue approach), scree plots, hypothesis testing, interpretability (find solution that makes most sense) or significance testing (if using ML or LS)  Should note the trade off between number of factors and variance explained.  Should explain the use of the communalities to identify variance explained in each variable and what to do if it is too low (e.g. remove the variable or increase factors)  Should explain need for rotation and the basic choice or orthogonal or oblique. Should refer to the factors and whether we would expect them to be related. Should justify decision (e.g. orthogonal more easy to interpret, oblique more appropriate/realistic/useable) but choice to try both is acceptable or to start with an oblique method and if the best rotation has an angle between the factors that is close to orthogonal then this suggests an orthogonal solution will be OK  May comment briefly on different orthogonal and oblique methods, explaining the differences and choosing between them.  Should comment that there is a choice of factor computation methods although needn’t give details -- often just adding up the standardised scores on the high loading items works well.

(iv) [10% of marks] A brief discussion of factor loadings of items and variance accounted for may be useful, perhaps including factor scores here if not earlier. We are not looking just for “the analysis will reveal the factor structure of this domain”; we are looking for specific suggestions about the factors that might emerge from which kinds of items. Any intelligent discussion on this should be rewarded. Can include things like saying that the results will identify which items are the best indices of particular factors (indeed, the researcher may have included marker items to help to identify factors). May suggest ways of validating the obtained solution (replication study with confirmatory factor analysis) (v) [10% of marks] If the expectation is for a different factor structure in the 3 student groups then 3 separate FAs should be carried out and qualitatively compared. Note that this may render the sample inadequate on some accounts for FA (if one subsample is much smaller than the others -- that subsample should be ignored). Note that a difference in factor structure is not the same as saying that one factor might be more relevant in one group than another. If factor structure is similar in the groups then pooling increases sample size and improves solutions. So if separate factor solutions look pretty similar a second stage might be to pool the data and rerun the model. Including degree course as a variable in the analysis is not a legitimate approach (unless one dummy coded the degree course) and even then it might provide very limited information and would not avoid distortions resulting from pooling samples. The idea of using a single sample and then testing out the mean factor scores across the 3 degree groups is OK but doesn’t avoid the pooling samples issue.

Q2 (B5 for MSc Occ)

Topic: Logistic Regression

(i) [10% of marks]. In stage 1 of the analysis a more complete model is fitted than in stage 2. In stage 2 the model contains only main effects (hence a main effects model). In stage 1 the model contains a covariate, a factor and a covariate*factor interaction. This is NOT a full factorial model because such models do not contain factor by covariate interaction terms. (ii) [40% of marks]  The model fitting information table includes information about the fit of two models: one is a model with no effects just the intercept (intercept only model) and the other (final model) is the model specified for this stage (for stage 1 the model is as described in (i) above). The -2 * log likelihood (-2LL) values for each model is given in the table with larger values indicating worse fitting models. The difference between -2LL values for two models is a likelihood ratio test statistic and this is distributed approximately as the chi-squared distribution with degrees of freedom (df) equal to the difference in number of parameters between the two models. This value is shown in the Chi-square column (=29.403) and the statistic is highly significant for 6 df indicating that there is be a statistically significant deterioration in fit from the final model to the intercept only model. this means that some or all the parameters in the final model are useful in explaining variance in outcome (i.e., DV category membership). A good answer might also explain why there are 6 df. The final model contains terms for age covariate effect, which for a 3-category DV requires 2 parameters; the gender factor also requires 2 parameters; and the age*gender interaction also requires 2 parameters. This explains the 6 df (=2+2+2). Note that effects in logistic regression are really effect*DV interactions, explaining why the 3 levels of the DV are relevant to the number of parameters needed.  The goodness-of-fit table compares the deterioration of fit between a saturated model (i.e., a model with 0 df, that provides the best possible fit to the data) and the final model for that stage of the analysis. The two statistics (pearson and Deviance) calculate the goodness of fit statistic in slightly different ways (deviance is a log likelihood ratio test). This shows that the final model for stage 1 is not a significantly worse fit to the data than a saturated model with 8 more parameters (8 more parameters in the saturated model explain why these test statistics have 8 df). Thus the extra parameters in the saturated model are not particularly useful in fitting the data. A really good answer might explain why there are 8 more parameters required for the saturated model than for the final model of stage 1. The saturated model treats the covariate as if it were a factor with 4 levels, thus requiring 3 (=4-1) parameters compared with 1 parameter for each effect including the covariate. The effects involving the age covariate are the age main effect (n.b. this is really an age*DV effect: saturated model requires 3*2 parameters; whereas model with age as covariate requires 1*2 parameters) and the age*gender interaction (again 3*1*2 parameters in saturated model c.f. 1*1*2 parameters in age as covariate model). This leads to a total of 4+4=8 more parameters in the saturated model.  The likelihood ratio tests table provides information on the deterioration in the fit from the final model fitted in this stage to reduced models in which particular terms are removed from the model. It is possible to remove the gender main effect and the age*gender interaction but it is not possible independently to remove the age covariate main effect (as this is nested under the age*gender interaction) and so a zero df likelihhod ratio test is reported. The intercept row gives the -2LL for the final (complete) model fitted in this stage (the same value as in the model fitting information table). The gender row, for example, shows the -2LL value after removing the gender main effect from the model. Note that the -2LL value is higher indicating a worse fitting model. The difference in -2LL values is shown in the chi-square column of the table (=6.584) and this is the likelihood ratio test statistic for the deterioration in model fit and it has 2 df which is the number of parameters associated with the gender main effect (n.b. really gender* DV, hence 1*2 parameters). This statistic is significant and so we would have a significantly worse-fitting model if we were to remove gender from the model fitted in stage 1. Very importantly, the table also shows that we would not have a significantly worse-fitting model if we were to remove the gender*age interaction effect. This means it is appropriate to move onto stage 2 of the analysis in which we fit a new model which does not include the gender8age interaction term. (iii) [10% of marks] The key information is in the Likelihood ratio tests table -- this shows that we cannot afford to drop either the age or gender main effect terms without suffering a statistically significant worsening in model fit. (iv) [20% of marks] The information in the Parameter estimates table conveys largely the same information as the likelihood ratio tests table in showing that both age and gender are predictive of DV category membership. The difference is that the parameter estimates table provides a test of the effect within a model containing the other terms, while the likelihood ratio tests provide information on the comparison of two models -- one with the effect and one without. The parameter estimates are more informative in that the effects are broken down into each single df, whereas they are aggregated together in the likelihood ratio tests. So the rows marked “ID suspect” reflect the odds of identifying a suspect relative to not identifying anyone in the parade (this third DV category is the reference category, as explained in the question). The effect for age is the log odds ratio for the age covariate (given by the B parameter) or the odds ratio (given by Exp(B)). The odds ratio is the change in odds of identifying the suspect, relative to not identifying anyone, for each unit increase in the age rating scale. The log odds ratio is significantly below zero (as indicated by the Wald test: B2/std error2 which is distributed as chi-square with 1 df) which means that the odds ratio is significantly below 1. (This is confirmed by the fact that the 95% confidence interval above the estimated value -- 0.882 -- is still below 1.) Thus for each unit increase in the age covariate score the odds of identifying a suspect (relative to not identifying anyone) drop by (100- 75.5)%. The table also shows that the odds of falsely identifying a volunteer from the ID parade, relative to identifying no one, (see “ID volunteer” rows) do not drop significantly with increasing age covariate score. However, the effects of gender, revealed by the parameter estimates table, are significant both for identifying a suspect and for falsely identifying a volunteer. In both cases the odds of identification are lower for men than for women. The odds ratios for men:women, are given in the gendwitn=1 rows, and are 0.644 and 0.471 for identifying a suspect and identifying a volunteer respectively. These odds ratios are both significantly below 1. This pattern is consistent with a bias towards making an identification (whether accurate or not) in women compared to men. (v) [20% of marks] A serious problem for these analyses is the fact that many ID parades were viewed by more than 1 witness. It is likely that factors specific to a particular parade will mean that the responses of different witnesses to the same parade are likely not to be independent of one another (for example, these responses are likely to be less independent of one another than the responses of different witnesses viewing different parades relating to separate crimes). Logistic regression requires that all responses being analysed are independent of one another. A very good answer might suggest solutions to this problem: analyse only 1 witness from each parade (throw away some witness data, randomly selected when there are >1 witness per parade); or use multilevel logistic regression, a technique designed to deal specifically with this problem.

Q3 (B6 for MSc Occ)

Topic: Multiple Regression

(i) [10% of marks for question] RT is significantly and negatively correlated with sports ability ratings (SARs). This means that lower (i.e. faster) RTs are associated with higher (i.e. better) SARs. The correlation between handedness (1=R; 2=L) and SARs is significant and positive and so, as L-handers have higher hand scores (=2), then this means that L-handers are rated as having higher sports ability. (ii) [20% of marks for question] The technique is called: HIERARCHICAL multiple regression (or SEQUENTIAL) -- no marks for STEPWISE or STATISTICAL or FORCED ENTRY MR. It is suitable because the influence of the “uninteresting” variables (IQ; self-rated sports enjoyment; handedness) is removed (“allowed for”) first on step 1, permitting investigation of the additional independent contribution of the variables of main interest (RT task performance; Extraversion) on the second step. Credit can be given for an intelligent suggestions for EASILY recorded variables which might plausibly influence SARs, and the idea that these should be also be included on step 1. Not too many variables should be added as this may compromise the analysis with a relatively modest sample size (N=100). For example, I would have coded for school and, assuming more than 1 sports teacher per school, the sports teacher actually doing the rating on each student. (iii) [20% of marks for question]  Note the sample size is adequate; neither too large nor too small (according to e.g. Green, 91) for the overall model, but on the margins (slightly too small) for testing individual predictor contributions.  Data are broadly adequate for MR – all ordinal/linear (scale) data or dummy variables.  Will want to screen the data checking for normality of distribution, univariate and bivariate outliers (frequency and scatterplots) and multivariate outliers (Mahalanobis distance), illegal variables.  Collinearity between pairs of IVs can be checked by their bivariate correlations (should be below 0.9) and multicollinearity within the set of IVs used (explain what this is) should be assessed either by calculating tolerances (explain what these are and what are likely to be unsafe values) or by using collinearity diagnostics.  Multivariate normality can be assessed by scatterplots on selected pairs of variables (checking for linearity, normality and homoscedasticity); variables with very different skews may be useful to plot in this connection. Alternatively, violations of multivariate normality can be revealed by examining plots of residuals against predicted DVs (iv) [40% of marks for question]. Don’t get any credit for commenting on the correlations (except to illuminate the regression analysis)  The model summary table shows that Model 1(i.e. the model entered on Step 1) has a highly significant ability to predict the DV judged on the F-change statistics (which reflect the F-statistics for that model itself, as the “change” on the first step is the change c.f. nothing; hence these stats are the same as the ANOVA table for model 1). The 3 IVs explain just under 40% of the variance in the DV. Note that adjusted R2 takes account of the number of IVs included and also adjusts for the fact that unadjusted R2 overestimates the population value of R2. The standard error of the estimate can be compared with the standard error of the raw DV to show how much knowledge of the IVs can be used to reduce error in estimating the true value of the DV. The complete model (model 2; after inclusion of the 2 IVs on step 2) has a barely improved R2 (adjusted or unadjusted) and the change statistics for step 2 (i.e. the improvement of model fit after adding in the new IVs on step 2) indicate that the 2 IVs of step 2 add very little to the model (explain an additional 0.2% of the variance -- see R2 change)  The ANOVA table has already been alluded to although it does show that model 2, still shows a highly significant ability to predict the DV (this of course derives from the 3 IVs entered on step 1).  The coefficients table gives the regression coefficients (B) for the constant and each of the IVs in each of the 2 models. Each coefficient is also reported with the std error for that coefficient and the t-test statistic which test whether each coefficient is significantly different from zero. Note that the t value is just the coefficient divided by its std error. A standardised coefficient (beta) is also reported; this is simply the coefficient for the regression equation had both the DV and the IVs been standardised (and allow a better comparison of the relative importance of the separate IVs). The results for model 1 show that each of the entered variables makes a significant independent contribution to the overall regression model (each t test is significant): they test the influence of that IV with all the other IVs held constant. The sign of the coefficients tells us the direction of the effects (+ve for all means left-handed, high IQ, and those who enjoy sports, have better SAR). The Tolerance statistics are simply derived by treating each IV in the model as a DV and performing a multiple regression using the remaining IVs. Tolerance is (1-R2) from such a model. Multicollinearity occurs when an IV is extremely well-predicted by a linear combination of the other IVs in the model, thus Tolerance (TOL) should not be low (various figures are suggested by different authorities -- not below 0.25 or 0.1). VIF (variance inflation factor) is simply 1/TOL. Clearly there are no collinearity problems with model 1 or 2. The t- test on the regression coefficients for model 2 confirm that there is no significant independent contribution from either of the variables of interest (RT or extraversion). Discussed more in (v) below. (v) [10% of marks for question] The findings DO confirm the previous findings on sports ability ratings (IQ, liking of sports, and handedness all are associated with SARs). The present data show that they each make a significant independent contribution and so may be associated with different processes which contribute to being judged good or poor at sports. The “new” variables (RT, extraversion) of primary research interest did not explain significant additional SAR variance. Note that both of these variables do have significant simple correlations with the SARs. The regression analysis tells us, however, that their relationships with SAR are not independent of the contribution of IQ, self-reported liking of sports, and handedness. For example, higher IQ students are shown by the correlations to be both faster at RT tasks and judged to be better at sports. This may derive from a genuine effect conferred by IQ and so explain why RT task performance might not index any variation in SARs that is not already indexed by IQ. Of course, multiple regression does not lead one to particular causal interpretations. Alternative causal models can also fit with the regression results: people who are born with fast responses, assuming this is detected by the RT task used, are more likely to become both good at sports and to develop a higher IQ (some alternatives may be more plausible than others).

Question 4

Topic: MANOVA and Repeated measures MANOVA

(i) [20 % of marks for question] A doubly multivariate analysis includes two different types of multivariate effects: typically, as in the present study, this involves a number (>2) of measures of the same construct, each measured at a different time-point (>2 timepoints in total). The first multivariate effect involves combining the separate measures into a single DV and the second multivariate effect involves a multivariate repeated-measures analysis (profile analysis) over the different time-points. (ii) [30% of marks for question] The predicted pattern of results on clustering scores would correspond to an interaction between group and delay such that the decrease in clustering over delay would be significantly greater for the experimental group (group 1) who had an imposed structure on the presented lists, in contrast to the control group (group 2) who created their own structure within the lists. There was also an expectation for a group main effect, reflecting a higher overall level of clustering in group 1. Of course, even if group 1 started out with higher clustering scores, but their drop in clustering was sufficiently greater than for group 2, then this could prevent the detection of a main effect of group.

Inspection of the doubly multivariate results (“Multivariate tests”) shows that there was a significant delay*group interaction. Inspection of the table of means for each clustering score separately confirms that the interaction was in the appropriate direction (if we presume that a higher clustering score indicates more clustering in recall responses). In group 1 there was a steady decrease in clustering scores across intervals for each of the cluster measures. In group 2, there was little change in any cluster measure across the delays. The overall group main effect did not reach significance (P=0.1260 and so the predicted main effect of group did not occur (higher cluster scores at the shortest delay in group 1 were offset by the greater decline in clustering across delays). (iii) [30% of marks for the question]. The main advantage of carrying out a doubly multivariate analysis is that only a single delay*group interaction test, and a single group main effect, is needed to test the predictions of the study. With a set of 3 separate singly multivariate analyses, one for each clustering measure, there are 3 tests of the interaction and 3 tests of the main effect. These significance level of these tests need to be adjusted for multiple tests. If we consider the 3 delay*group interactions in the singly multivariate analyses then we can see this advantage at work: the p levels for these 3 analyses were 0.039; 0.126; and 0.123. None of these would be significant using an adjusted p-level (eg for Bonferroni, critical p would be 0.05/3). The singly multivariate results for the main effect of group are not presented and so we cannot do the same comparison here. Also, under infrequent circumstances, and separately from the issue of multiple comparisons, doubly multivariate approaches may be able to reveal differences that would not show up in singly multivariate analyses.

The main disadvantage of the doubly multivariate approach is that it is not useful if one was interested in differences in the 3 separate clustering measures (i.e. if your predictions were only made for one clustering measure but no the other 2; then you would want to include clustering scoring method as another repeated-measures factor). Also, if there are several (more than 3 moderately correlated DVs) then the power of MANOVA relative to separate ANOVAs is substantially reduced.

(iv) [20% of marks for question] With only one clustering measure, the researcher would then have to decide only how to treat the repeated measures factor (delay). A univariate analysis might be preferable to a multivariate analysis here. however, the univariate -- but not the multivariate -- analysis requires that the covariance matrix across time-points possesses a property called sphericity. If this assumption of the univariate analysis is not met then the analysis can be badly biased and moreover the sphericity assumption is often violated in psychological designs like the present one, because measures that are adjacent in time (e.g. at delays 1 and 2) will typically be more highly correlated than measures that are more separated in time (e.g. at delays 1 and 4). This pattern of covariance over time is not spherical. One can determine whether the data possess sphericity by applying a test of the assumption (the best-known is Mauchly’s) and if the test is significant then the data are not spherical. For moderate departures from sphericity, it is possible to correct the univariate results for the degree of non-sphericity and use corrected (Greenhouse-Geisser or Huyhn-Feldt) statistics.

Question 5

Topic: ANCOVA

(i) [30% of marks for question]  The first use is to remove the effects of noise variables in experimental designs. For example, 3 groups of subjects might be compared on psychometric tests after being randomly assigned to a particular intervention (training, diet, etc etc). One might know that performance on the tests would be associated with initial IQ score (before randomisation) and so one would to remove the influence of IQ performance. The intention of this approach is to increase the power of the statistical tests for the effects of the experimental variables.  The second use is to try to “correct for” differences in a nuisance variable which differs between the groups in nonexperimental designs. In the above example the 3 groups might be patient and control groups who are being investigated for differences in memory test performance and the intention is to see whether the memory differences between the groups are significant after adjusting for the IQ differences present between the groups.  The third use of ANCOVA is in so-called Roy-Bargmann step-down analyses carried out after finding a significant effect in a MANOVA. The Manova might test for group differences on a set of DVs. One has to have an a priori priority ordering of DVs (based on theory or other considerations). One begins with the highest priority DV and tests this in a simple ANOVA (adjusting for the total number of comparisons). Then one takes the next highest priority DV is tested via an ANCOVA with the higher priority DV acting as a covariate. The procedure repeats down the priority order with all the higher-order DVs acting as covariates at each step. The intention is to try to understand the relative contribution of the DVs to the MANOVA effect -- one is seeing whether there is an effect for a particular DV even after removing the influence of higher priority DVs (rather like hierarchical or sequential multiple regression). (ii) [70% of marks] {Answers which just say that ANCOVA’s assumptions are often violated, e.g. homogeneity of regression etc, do not get any marks} The second of these uses of ANCOVA is often inappropriate and is likely to give rise to misleading results. The basic problem is that by using regression methods (in the ANCOVA) to equate all the groups on the covariate(s), one cannot be sure that one is not removing part of the intrinsic difference between the groups, thereby testing an almost meaningless hypothesis about a group variable that doesn’t really reflect the group difference that one wants to test. If the between- groups differences on the covariate arose by chance (e.g. sampling error), as they must do in a randomised design, then removing the effect of group differences on the covariate may be OK (as one will not be removing anything intrinsic to the nature of the groups, and this will be a by-product of the first use of ANCOVA above). It may be possible to make a case that the group differences on the covariate arose by chance, even for a nonexperimental design (an example is a comparison of smokers and nonsmokers and the two groups differed on age; this is likely to have been a sampling error as there is no clear reason why a sample of smokers and nonsmokers should differ in age). In these cases, too, one might also be able safely to remove the influence of the group differences on the covariate.

Examples to illustrate the problem: from Lord --- do boys end up with a higher final weight (DV) after following a specific diet than girls (gender=IV) even when including initial weight as a covariate? part of the intrinsic gende difference is in weight and so using ANCOVA here would end up comparing the weight gain for relatively light boys with the weight gain for relatively heavy girls. This is not the hypothesis we want to test and there are issues about regression to the mean as well (if we sampled light boys at the start of an experiment then they would be likely to have gained more weight than heavy boys -- or indeed heavy girls -- at a later testing point by regression to the mean). from Miller and Chapman -- Imagine using ANCOVA to answer the question would six and eight year old boys differe in weight if they did not differ in height? Once again ANVOVA would create a comparison of short 8 year olds with tall 6 year olds. Do we want to ask that question? other egs --- imagine asking if chronic schizophrenics have impaired memory c.f. controls allowing for differences in IQ (which are known to exist). Low IQ is an intrinsic part of chronic schizophrenia

Question 6

Topic: Contrasts

(i) [10% of marks] T12 and T22 are the DVs which should show most errors for patients with this syndrome. This is true for T12 (relative to other session one tests, T11 and T13) for all groups. This is expected because all groups have been treated identically (placebo) at this point. Groups 1 and 2 receive a drug treatment in session 2 and group 3 continue to receive placebo. I presume that the patients are blind to their drug treatment condition. Note that group 3 continue to show the same pattern of errors over T21, T22, and T23 that they showed over T11, T12 and T13 (i.e. most errors on Tn2). However, the other 2 drug-treated groups do not show an increase in errors on T22 relative to T21 and T23. Thus, it appears that the active drug are both having the predicted effects on the specific problems of face perception demonstrated by patients with this syndrome. An annotated graph would be a good way to answer this question. (ii) [10% of marks] It is generally not considered important for an overall interaction term to be significant in order for one to probe contrasts relevant to that interaction. The older (opposite) view was true for certain specific (and now out of favour) post hoc contrast methods (Fisher’s LSD test). In a complex design one’s specific predictions may relate to particular cells which contribute part of an overall interaction term. Thus it is possible for the specific part of the interaction to be significant (as predicted) while the overall interaction might not reach significance. It is better, and more hypothesis-driven, to go directly to the contrasts of interest. (iii) [10% of marks] The sphericity test for Trial does not approach significance and so this means that the covariance matrix relevant to the Trial effect has the property of sphericity and so uncorrected univariate ANOVA statistics can be reported appropriately (“Sphericity assumed” results in SPSS printout). The equivalent test for session by trial indicates a very mild (but significant departure from sphericity (p=0.04) and so uncorrected ANOVA statistics for that effect should not be reported. Corrected ANOVA statistics are indicated in this case and because of the value of epsilon the Huyhn-Feldt correction may be the more appropriate. The sphericity for the session effect matrix is 1 (perfect sphericity), as it must be when there are only 2 levels of a rep-measures factor. (iv) [20% of marks] Considering LMATRIX:- the effect of group is partitioned into two single df components. The first is 0.5 0.5 -1 which contrasts the average of groups 1 and 2 (the drug-treated groups) with group 3 (the placebo-treated group). The numbers are ordered as for the groups thus it is 0.5*group1 +0.5*group2 - group3. The other contrast (1 -1 0) compares each of the two drug-treated groups with one another.

Considering MMATRIX: the pattern specified by the numbers is the correct pattern for the specific part of the interaction of within-Ss factors (session*trial) related to the predictions. Under the placebo session (session 1) trial 2 is predicted to be associated with more errors than either trials 1 or 3, so the numbers are designed to compare trial 2 with the average of the other two trials (hence 0.5 -1 0.5); under the session in which some groups received drugs (session 2), a cross-over interaction would mean an inverted pattern of behavioural effects and this is described by using the same set of coefficients but with reversed signs (i.e. -0.5 1 -0.5). We must specify a cross-over interaction effect as this is the only way to keep the interaction term independent of (i.e. orthogonal to) the main effects of Trial and Session. (v) [10% of marks] The descriptive label for the LMATRIX as “orthogonal contrasts” means that the two single df contrast components are independent of one another. The size of one contrast will not have any effect on the other because the contrast coefficients are uncorrelated with one another. To illustrate this we cross-multiply the contrast coefficients and then add up the results:

0.5 0.5 -1 * * * 1 -1 0

= = =

0.5 -0.5 0 add these up gives zero

(vi) [40% of marks] The first contrast of interest is the interaction between the L1 part of the LMATRIX contrast (comparing the two drug-treated groups with the placebo-treated group) and the within-subjects contrast (from the trial by session interaction). The contrast is tested by a t-statistic (or F-statistic) which is based on the (contrast estimate minus the hypothesized value) divided by its standard error (for the t-statistic; F=t2). Thus the t-value from the “Contrast Results” table is 0.06738/0.029=2.32, which is significant, p=0.022 (this is for a 2-tailed test, and we could argue for a one-tailed test as we correctly predicted the direction of the contrast effect, i.e. the contrast value is positive as predicted). The degrees of freedom (df) for the t-test are the error df (given in the “Test Results” table; i.e. 59; for the F-test the df are 1,59).

This result means that the two drug treated groups combined showed a deterioration in performance (increased errors) on T12 (c.f. T11 and T13) when drug-free, that was greater than their deterioration on T22 (cf T21 and T23) when drug-treated, and this pattern was significantly more marked for them than it was for the control group treated with placebo throughout. The other contrast (L2 * within-Ss contrast) is not significant (t[59]=0.47, p=0.637) which indicates that the two drug-treated groups did not differ significantly in the pattern of errors over T11 to T23.

The “Test results” output combines the between-within interaction contrast based on L1 with the equivalent interaction based on L2. This is not particularly interesting with respect to the hypotheses under test and need not be reported.

Question 7

Topic: Classical Test Theory

For a variable xobs the following expression is the basis of classical test theory (CTT):

xobs = xtrue + x

(i) [10% marks] xobs is the observed (i.e. measured) value of a variable x; xtrue is the

true value for the variable; and x is the error term associated with the measurement of xobs. (ii) [15% marks] The error term is random which means that it has zero mean (i.e. not a systematic bias) and is also uncorrelated with xtrue. Thirdly, the error term is assumed to be drawn from a normal distribution. (iii) [15% marks] We define 3 variances:

2 2 σ true is the variance associated with the true score of x; σ obs is the variance 2 associated with the observed score of x; and σ error is the error variance.

2 2 2 2 2 Reliability = σ true /( σ true + σ error) = σ true / σ obs

(iv) [25% marks]

2 Let p be the true value of the process being measured, with variance = σ p. We can thus write:

xobs = p + error1 and yobs = p + error2

2 2 2 Because x and y measure p with equal reliability then σ error1 = σ error2 = σ err

2 2 2 Thus the reliability of xobs = σ p /( σ p + σ err)

The average score, Ave = (xobs + yobs)/2 = p + 0.5*error1 + 0.5*error2

2 The variance of 0.5*error1 = 0.25*σ err and so the total error variance associated 2 with Ave is 0.5*σ err and so the reliability of Ave is

2 2 2 σ p /( σ p + 0.5*σ err) which is greater than the reliability of xobs or yobs. (v) [25% marks]

Let us assume xobs is the value measured at one time-point and yobs is the value of the same variable measured at another time point. The correlation between x and y, rxy, is defined as rxy = Covar(xobs, yobs) / sqrt(Var(xobs)*Var(yobs)) where Covar(a, b) is the covariance between a and b.

From the information given in previous parts of this answer, the expected values of sample variance of x and y can be written as:

2 2 Exp{Var(xobs)} = p + errx 2 2 Exp{Var(yobs)} = p + erry

2 The covariance (shared variance) between x and y is p . It follows from the definition of the correlation between measures, and the expected variance results, that we can obtain the following result for the expected value of the correlation:

2 2 2 2 2 Exp{rxy} = p / sqrt((p + errx )*( p + erry ))

If we assume that the measures at each of the two time-points have equal reliability 2 2 2 then errx = erry = err . From this is then follows that

2 2 2 Exp{rxy} = p / (p + err ) i.e. the test-retest correlation will approximate the reliability of the measure.

(vi) [10% marks] Cronbach’s alpha measures the internal consistency of a scale comprising multiple items. It is the extent to which all the items measure the same construct. (A really good answer might explain how, for a two-item scale, Cronbach’s alpha, is identical to the definition of reliability for an average score, given in part iv).

Question 8 (no-one did it)

Question 9

Topic: Power

(i) To increase power one could:

-raise alpha level to 0.05 -increase total sample size (most efficiently by keeping group sizes equal) -measure the brain structures on more than 1 occasion and average to decrease the amount of measurement error (ii) The figure 8.1 from Howell should be reproduced and fully labelled. A really good answer should explain what the means (μ0 and μ1) are and what the hypotheses (H0 and H1) are. The means are the values of the mean difference between the groups on the measure of interest under the relevant hypotheses (H0 states that the mean difference is zero; H1 that it is different from zero). (iii) The figures 8.2 and 8.3 from Howell should be reproduced for a really good answer (i.e. 2 figures needed). They should be related to the answers given in part i. Note that for the increase in sample size, or reduction in error, the diagram should be accompanied by an explanation of why the distributions are narrower than in the earlier diagrams. The reasons is that the values are expressed in terms of the standard deviation of the mean difference (this is the standard error of the mean difference), which (for equal sample sizes of n per group) is given by

standard error of mean difference=2s/sqrt(n),

where s is the sample standard deviation in each group (assuming they are sufficiently similar to be pooled)

So if n increases then the standard error of mean difference decreases and the distributions get narrower.

(iv) The effect size, d, is given by (μ1 - μ0)/s. μ1 is estimated by the observed mean 3 difference between the groups (=[27-21.5]=5.5 mm ) and μ0=0. The given value of s=11 mm3 and so d is 0.5. (v) The key trick here is that a one-tailed test will be used (which increases power) and so the value of X in the formula should be changed from the z-value below which (100 - /2)% of a normal distribution lies to the z-value below which (100 - )%. The answer must state this -- in some form -- as a student could get it “right” by accident by applying the two-tailed X formula incorrectly (i.e. not dividing alpha by two). Give some credit for working even when a wrong answer is arrived at (as long as error can be seen). So, X=1.64 (1.96 for a two- tailed test) and Y=0.84. Thus, m = 2*(1.64 +0.84)2/0.52

which is 49.2 (i.e. 50) subjects per group needed.

(vi) More subjects are needed for 90% power (good answer may actually calculate the number needed, although this isn’t necessary). Clearly, the initial study, with only 15 per group, had a much lower power than 80% and so was likely to commit a Type II error (assuming that the measure effect size was an accurate estimate of the true effect size). A really smart student could calculate the approximate power by re-arranging the formula to estimate Y from the other quantities (d, m and X). This method gives a value of Y (for a two-tailed test and 0.05 significance level) of Y=-0.59.

This is the z-value of the power; the estimated power is the proportion of the normal distribution which lies below -0.59. Clearly, lies between 50%.(Y=0) and 17% (Y=-1). Obviously the power would have been even smaller for the 0.01 significance level intended in the initial study.

Recommended publications