University of London s3
Total Page:16
File Type:pdf, Size:1020Kb
GOLDSMITHS University of London
PSYCHOLOGY DEPARTMENT
MSc in RESEARCH METHODS IN PSYCHOLOGY 2004
PS71020A STATISTICAL METHODS
SECTION C
7. A researcher investigated learning performance in schizophrenic patients and matched controls. She tested half the patients and half the controls under standard learning conditions and the remaining halves of the two groups under special learning conditions. These were designed to reduce the learning deficit (relative to the controls) that is usually observed in the patients under standard learning conditions. She was therefore predicting an interaction between learning condition and subject group in a 2x2 independent groups ANOVA carried out on the learning scores. Unfortunately, she could not use parametric methods because her dependent variable (the learning measure) was bimodally distributed in all cells of her design.
(i) Explain in detail how the researcher might investigate her key hypothesis using nonparametric (i.e., rank-based) statistics. (15 marks)
(ii) Explain in detail how the researcher might investigate her key hypothesis using resampling methods (i.e., bootstrapping and/or randomisation). (15 marks)
The same researcher was interested in whether individual schizophrenic patients were able to generate random sequences of digits, suspecting that some might not be able to do so because of a tendency to perseverate, thereby generating some long sequences of the same digit. Participants were asked to generate random sequences of 1s and 2s, and sequences of 50 responses were recorded for each participant. The researcher wanted to test whether the sequences generated by an individual were random but knew of no standard way to test this. She decided that she would calculate the longest run of the same digit produced by each participant (let this be N digits long). She then needed to estimate the probability that a genuinely random process could generate a run of N digits or longer within a set of 50 digits.
To do this she used a resampling method: she randomised the 50-item sequence produced by an individual participant 10000 times, and calculated the longest run of the same digit within each of the 10000 randomised sequences. She then used these 10000 “longest run values” to estimate the probability of generating a sequence N digits or longer by chance.
(iii) Explain this method more fully, giving details of the logic behind it, alternative methods of randomisation that might be employed, and how she would use the 10000 “longest run values” to estimate the probability she needed. (10 marks) (iv) The researcher also knew that some schizophrenic patients might show the opposite tendency: i.e., to generate non-random sequences by avoiding runs of the same digit. How would she need to use her 10000 “longest run values” above to also evaluate whether an individual patient was showing this alternative tendency to a significant extent? (5 marks)
An alternative method would have been to take a computer and programme it to generate 10000 sequences randomly (i.e., where each digit in each sequence could, with equal probability, be a 1 or 2).
(v) Why would this alternative method be likely to generate different findings from the resampling method which she employed? (5 marks)
Answer.
(i).
The first step would be to rank the scores of all of the participants, taken as a complete set. the explanation is not completely clear here “taken as a complete set” probably means what I want you to know and say here; i.e. the participants’ scores are ranked irrespective of the participant grouping/condition
It does not matter whether the ranking is in increasing or decreasing order of scores, but the convention is to award rank one to the lowest ranking score and so on.
According to Conover and Iman, the best non-parametric test of the 2 x 2 independent groups scenario is to perform an ANOVA on the ranks, ie regarding rank as the DV and the two factors of group membership (control vs schizophrenic, standard vs special learning conditions) as the IVs. For each participant, therefore, substitute their rank in the ordering outlined above, for their score on the learning performance test.
Carrying out the ANOVA analysis would give both main effects and the interaction, but in this case it is only the latter that the experimenter would be interested in. Essentially, if we assume the four groups to be of equal size, and their rank means to be a, b, c and d, with “a” referring to patients under the standard learning condition, “b” to patients under special learning conditions, “c” to controls with standard learning, and “d” to controls under special learning, she would be testing whether the statistic: a – b – (c – d), measuring the interaction effect, was signficantly different from zero. might say why, under the null hypothesis of no interaction effect, the value of int is expected to be zero (it is a little awkward to do in the case where either of the main effects might be significant). Under the complete null hypothesis (both main effects and the interaction are non-effects) it is easy to see: a to d should all be equal.
Let us call this statistic “int”. this answer is very good and is what i was looking for. I’m probably only going to award 13 or 14 out of 15 for perhaps not making a key point clearly enough (see above).
(ii).
Using randomisation, the question says bootstrapping and/or randomisation (choose randomisation as it is easier) the researcher would pool the whole dataset of DVs, and from this entire dataset randomly select four groups (without replacement) corresponding in size to the four groups in the 2 x 2 setup. For each such randomised dataset, the value of “int” would be calculated. this is an interesting approach. You might say that any “sensible” statistic (ie one addressing your hypothesis of interest) could be chosen, including the usual F statistic for the interaction of the two IVs. If she chose an F statistic, then she needs to use randomisation (or the ranking method in the earlier answer) because she is worried that the calculated F statistic will not be distributed as the F distribution under the null hypothesis. You have proposed a “doubly nonparametric” statistic where you do the ranking first and then compute a mean rank difference (for the interaction effect) and then use randomisation to get the p-value associated with it. This is fine (but not what I was expecting).
These values of int, taken over say 10,000 randomisations, would have a distribution, with a two-tailed 95% confidence limit centred (roughly) on zero; good the lower boundary of the confidence limit would be the value of int at the 2.5 percentile, and the upper one at the 97.5 percentile. She would then calculate the interaction statistic for the actual data: call its value INT. If INT fell outside the 95% confidence limits, she would then pronounce the interaction as significant; if inside, or on the boundary values, it would not be significant. ok -- I’m wondering if you are going to explain how to get an “exact” p-value? A quick skim ahead shows you do not explain this: basically you take your calculated value of “int” (it could be positive or negative depending on which way around the interaction went) and turn it into an absolute value; call it abs_int. Then you count how many instances of your randomised distribution of “int” statistics lie above the value of +abs_int (say n) and how many values lie below the value of -abs_int (say m). Then the exact two-tailed probability for your interaction is given by (n + m)/10000.
This amounts to a test of the actual data against the null hypothesis that the 2 x 2 grouping did not have any significant effect on the pattern of the results. The actual groupings are only compared at the end of the process, after randomisation has been performed on the mixed up dataset. Put crudely, this approach makes the null hypothesis fundamental, and the actual data are brought in as an afterthought. good
I’m going to give this part of the answer 13/15 cos you didn’t explain how to get a p- value.
(iii).
She would be taking the actual numbers of “1”s and “2”s in the given sequence, and randomly sampling from this pooled set of digits without replacement, calculating the longest run of a single number – N – that this process produced on each occasion, and plotting a histogram of the distribution of N. good -- this is restating more explicitly what is in the description in the question. Emphasize that the distribution is for 10000 randmoisations of the sequence produced by a particular participant.
If there were for example 20 ones and 30 twos in her sequence, ie the sequence produced by a particular schizophrenic patient -- there is no fixed sequence as each patient generates their own attempt at a random sequence of 50 digits. the randomisation would produce loads of sequences with 20 ones and 30 twos in them, but the positions of the ones and twos in the sequences would be completely random. ok to be this basic - it shows me that you understand the process
Basically what she would be doing is taking her set of ones and twos, and randomly assigning the numbers from 1 to 50 to each of them, this latter number giving their position in the sequence. Alternatively, she could get the computer to randomly pick twenty different numbers between 1 and 50, make those the ones, and put twos in the “gaps”. giving actual mechanics of how you do this for these questions is good -- it shows me you could do it yourself, or more likely explain to a computer programmer how to write the software to do it for you.
On the assumption – and this is the null hypothesis – that the individual was indeed producing numbers randomly, this process would provide a histogram giving the number of sequences with any given value of N, which would give, proportionately, the probability of finding precisely this value of N. The logic is that if the null hypothesis is true, this is the distribution of N that would be observed, and that a significance test of the null hypothesis can be carried out by comparing actual values of N with the confidence limits of the randomised distribution. exactly -- this is expressed well using precise statistical language; once again a p- value for the value of N for each participant, under the null hypothesis, could be calculated. Actually we can do more than just generate confidence limits. Probabilities can be calculated as follows. Suppose the researcher is looking at the sequence produced by a particular schizophrenic patient, and suppose N = 8 for this sequence. If there are, say, 40 sequences with a value of N >= 8 among her 10,000 randomised sequences, then it follows that the proportion of trials giving a value of N at least as great as 8, under a genuinely random process, would be 40 over 10000 or 4 percent, and this is the value of the probability sought. excellent -- this is what was missing from the part i answer (it was explicitly askd for in this part of the question and so leaving it out would have cost more lost marks here!!)
An alternative method would be to take a set of (in our example) 20 ones and 30 twos, and generate at random a 50-item sequence of digits, by random sampling from this set with replacement – essentially a bootstrapping procedure. The probability of a one or a two appearing at any position in the sequence would be a constant, at the “correct” value, but the actual numbers of ones and twos in any given sequence would tend to vary from the 20/30 split. It would thus differ from the randomised sequences described above, which would all have the same numbers of ones and twos in them. In most cases, excluding where the numbers of ones and twos were very disproportionate, the two approaches would give very similar distributions for N. good -- this bit of the answer covers the part of the question requiring you to give: “alternative methods of randomisation that might be employed”
The third possible method would be to investigate all possible permutations for a given sequence, and calculate the probability directly. If a given sequence produced by a patient had 20 ones and 30 twos in some order, there would be 50!/(20! x 30!) possible permutations, each one equally likely under the null hypothesis. These could be all checked in principle on a computer, and the probability found that a value of N or more would appear by chance. This would be the most theoretically precise way of doing the calculation, but it is doubtful whether the extra computer time would be worth it. excellent -- a perfect answer need include only one alternative but two alternatives, both exactly correct, is very impressive
10/10 for this part
(iv).
If the researcher wished to investigate the significance of a lower than expected value of N found for a particular patient, she would look at the probability of that value of N, or lower, being obtained by chance. If an individual patient had a value of N, say N = 3, such that – for instance – 4.5 percent of the randomised sample should really say “samples” here or “distribution” had N <= 3, then the probability that such a low value could occur by chance would be 0.045. good -- at least 9 out of 10; you don’t have to repeat all the same details -- I’m looking for what is different here from part (iii) -- in fact it gets 10 out of 10 because of note 1 below
Note.
1. This could be interpreted as a result significant at the 5% level, ie using a one- tailed test of significance, but only if the researcher had specified in advance that she was looking for a lower than expected value of N: perhaps because the particular patient fell into a known sub-category of schizophrenia where avoidance of long runs of the same digit was likely. If she decided in advance only that she was interested in either unusually high or unusually low values of N, then she would need to use a two- tailed test of significance, and declare a low result significant only if the probability of obtaining such a low value by chance was <= .025, and similarly for a high value of N. indeed; this is quite an important point but I’m reserving only 1 mark out of 10 for it
2. In the spiel after item (ii), the question is careful to discuss the tests of digit runs in terms of individual patients. This avoids problems with Bonferroni corrections etc. But the phrase “… suspecting that some might not be able to do so … ” implies that she might be intending to draw conclusions from the fact that some patients showed results significant (under a two-tailed test, admittedly) for long sequences of digits. It might be worth mentioning in a complete answer, that if she wanted to avoid type I errors, the researcher should be cautious about drawing conclusions about the prevalence of digit runs among the population of schizophrenic people as a whole. Even if her patients were performing entirely randomly, she would expect to get 1 in 20 showing a significant result under this test. If she wished to test the hypothesis that some – in the sense of at least one – of her patients was showing unexpected performance on this test, she would have to apply a strict Bonferroni correction to the probabilities, or risk obtaining a spurious result. If on the other hand her a priori hypothesis was that all her patients would show this trait, then no Bonferroni correction would be necessary, provided that she was prepared to accept her hypothesis as being not supported if just one of the patients failed to perform at outside chance levels. There is a paper, which I cannot lay hands on at present, which enables one to calculate a more sophisticated value p for experiment-wise error rates, and which is not so ferocious as Bonferroni provided you have more than one “significant” outcome. But presumably this is not required in an exam answer. no it wasn’t but well spotted and entirely correct -- if you did have the time to say stuff like this, then all it can do to benefit you is put me in a positive disposition towards you, and then I will be more likely to give you the benefit of the doubt on any ambiguous wording elsewhere (cos if you are smart enough to go beyond the answer correctly it’s almost certain you mean the correct answer elsewhere -- eg the “sample” rather than samples wording above)
(v).
There are two reasons why the distribution of N generated by the computer could be different from the one she calculated. One is that the computer chose ones and twos with equal probability, whereas her randomisation was based on an actual sequence, which was likely to have unequal numbers of ones and twos in it. that’s it -- 5/5 (this question is not something you were “taught” but, if you understand the logic of the resampling approach completely you would see this point immediately; even if you didn’t you could still get 90% for the rest of the question)
Her randomisation would tend – I think – to give higher values of N than the computer-generated sequences: in the limit where the numbers of ones and twos were very unbalanced in the actual sequence, for instance, there would clearly be a lower limit to N (eg to take an extreme example, if there were only one “one” and 49 “twos”, N cannot be less than 25). The computer sequence would be extremely unlikely to give such high values of N. thoughtful, correct, but unnecessary
The other reason why this might generate different findings is that the computer- generated sequences would have variable numbers of ones and twos in them, whereas the researcher’s randomised sequences would all have the same number of ones and twos. This again might have an effect on the distribution of N. true (assuming the researcher did the resampling without replacement) but not the key point ….