Individual Differences in Cognitive Planning on the Tower of Hanoi Task: Neuropsychological Maturity or Measurement Error?

D. V. M. Bishop University of Oxford, U.K.

G. Aamodt-Leeper, C. Creswell, R. McGurk, and D. H. Skuse Institute of Child Health, London, U.K.

The Tower of Hanoi (ToH) task was given to 238 children aged from 7 to 15 years, and 20 adults. Individual variation within an age band was substantial. ToH score did not correlate significantly with Verbal IQ, nor with ability to inhibit a prepotent response. We readministered the ToH to 45 children after 30 to 40 days. The test-retest correlation of n5is low in relation to accepted psychometric standards, though at least as high as reliability of the related Tower of London (ToL) in adults. The reasons for low reliability remain unclear; task novelty did not seem to be involved, as children did not improve on retest. We conclude that it is not safe to use this test to index integrity or maturation of underlying neurological systems in children. We compared our results with three published studies using the ToL with children, and found similar levels of performance on problems involving the same number of moves. Another study using automated ToL obtained much poorer scores, suggesting that computerised presentation may impair children’s performance.

Keywords: Assessment, executive function, psychometrics.

Abbreviations: ToH: Tower of Hanoi; ToL: Tower of London.

The ability to plan ahead to solve a problem is an 1996; Krikorian, Bartok, & Gay, 1994; Luciana & executive function that is thought to depend on integrity Nelson, 1998). Intriguingly, whereas most cognitive tasks of the frontal lobes. The Tower of Hanoi (ToH, see Fig. show improvement both with age and with higher IQ, it 1) has been used to assess this skill, and gives a has been suggested that the ToH is related only to age, quantitative index of planning ability because one can and uncorrelated with IQ (Welsh, Pennington, & easily specify the number of steps that are involved in the Grossier, 1991). solution to a problem. Using a simplified version of this However, just as interest is mounting in assessing task which he termed the Tower of London (ToL: see planning abilities, questions have been raised about the Fig. 2), Shallice (1982) demonstrated that certain patients psychometric characteristics of Tower tasks. First, there with frontal lesions showed pronounced impairments in is the issue of validity. As noted above, these tasks have planning that could not be accounted for in terms of any traditionally been viewed as indexing the ability to plan more basic perceptual or memory problems. Further ahead, with difficulty depending on the memory load, studies, some of which have used a computerised version which increases with the number of moves that are of the ToL or ToH, have both replicated findings of required to solve the problem (Shallice, 1982). However, deficits in patients with frontal lesions (Owen, Downes, Goel and Grafman (1995) drew attention to another facet Sahakian, Polkey, & Robbins, 1990; Goel & Grafman, of these tasks, the extent to which they require the person 1995), and also demonstrated frontal activation on PET to inhibit a prepotent tendency to move a disc or ball imaging when normal volunteers perform this task (Baker immediately to its final peg position. They argued that the et al., 1996; Morris, Ahmed, Syed, & Toone, 1993) ToH task is especially difficult for people with frontal Impaired functioning on Tower tasks has been found lobe lesions because it requires them to see and resolve a in various clinical groups, including people with autism conflict between a long-term goal and an immediate or Asperger syndrome (Ozonoff, Pennington, & Rogers, subgoal, whereas the ToL task is much easier in this 1991) and children with intellectual disabilities (Borys, regard. Burgess (1997), in contrast, suggested that what Spitz, & Dorans, 1982). Three papers have reported makes these planning tasks so difficult for people with developmental trends for normal children on different frontal lesions may be their novelty. On this view, once versions of the ToL (Anderson, Anderson, & Lajoie, the person has had experience with the task and worked out an optimal strategy, performance is likely to improve. Humes, Welsh, Retzlaff, and Cookson (1997) noted that little was known about the reliability of Tower tasks. Requests for reprints to: Dorothy Bishop, Department of They found only a weak intercorrelation (r ln37) be- Experimental Psychology, University of Oxford, South Parks tween the ToH and ToL tasks, which they attributed to Road, Oxford, OX1 3UD, U.K. the low reliability of the ToL, as assessed by an index of (E-mail: dorothy.bishop!psy.ox.ac.uk). internal consistency (Cronbach α was n25 for ToL vs. n90

551 552 D. V. M. BISHOP et al.

test-retest interval, (and indeed, obtained relatively high reliability). Other research that found significant im- provement on retesting was a study by Aman, Roberts, and Pennington (1998), who administered the test to children aged from 10 to 14 years on two occasions 1 week apart. In this study, our main aim was to assess developmental trends and test-retest reliability for a ToH task. We also explored sex differences in performance, and correlations between ToH, verbal ability (which might be regarded as important for formulating a plan of action and keeping this in mind), and another executive function measure testing ability to inhibit a prepotent response. We used a modification of the original ToH task, as described by Borys et al. (1982), as data from Krikorian et al. (1994) and Anderson et al. (1996) suggested that the simpler ToL task gives ceiling effects in older children. We also extended the difficulty level by incorporating problems that involved four discs, and required eight or nine moves for their solution. To simplify administration and scoring, we adopted a self-terminating procedure analogous to Figure 1. The Tower of Hanoi puzzle, showing apparatus that used in the well-known Digit Span test (Wechsler, consisting of three vertical posts, and three doughnut-like discs, 1992). The child was presented with two problems at each of different colour and size, that fit on the pegs. The testee is level of difficulty, and was deemed to have solved a shown the model array and must duplicate the arrangement on problem if it was completed successfully in the minimum a second apparatus in the minimum number of moves, while obeying the following rules: (1) only one disc may be moved at number of moves on two out of three trials. If two a time; (2) a larger disc must not be placed on top of a smaller consecutive problems were failed, then testing termin- one; (3) discs may not be placed on the table. The illustration ated, and the subsequent problems were scored as failed. shows a 5-move problem. This procedure allowed each child’s performance to be represented by a single number that corresponded to the highest level of difficulty problem that was solved.

Method Participants Children were recruited from three primary schools and one secondary school in London. These schools were selected because their pupils came from a wide range of social backgrounds. The test battery was administered to all children Figure 2. The Tower of London puzzle, devised by Shallice whose parents gave consent for participation, with the exception (1982) to involve similar reasoning as in the Tower of Hanoi but of children who did not have English as a first language. in an easier format. Rather than discs of different size, Thirteen children with estimated short form Verbal IQ below 70 perforated balls of different colours are used, and the pegs are of were excluded, giving a total sample of 238 children. Sample different lengths. The illustration shows a 4-move problem. characteristics are shown in Table 1. The smaller sample size for for ToH). Burgess (1997) noted the relatively weak Table 1 intercorrelations between different measures of executive Constitution of Sample in Terms of Age, Sex, and Verbal function including the ToL, and Lowe and Rabbitt (1998) Ability showed that test-retest reliability for elderly volunteers Short form Verbal IQ completing the computerised ToL ranged from n26 to n60, depending on the performance index adopted, with Age range N Mean SD reliability overall falling below conventionally accepted levels. Gnys and Willis (1991) found higher levels of test- 7–8 yr retest reliability (r 72) for 5-year-old children on a Boys 31 99n516n76 ln Girls 35 103 51846 noncomputerised ToH task, but the test-retest interval n n was only 25 minutes. 9–10 yr The reasons for poor test-retest reliability over longer Boys 39 95n115n22 Girls 44 95 11535 intervals are unclear. Lowe and Rabbitt (1998) noted that n n if, as suggested by Burgess (1997), task novelty is a critical 11–12 yr factor, then poor reliability is to be expected if a subset of Boys 26 99n28n63 Girls 30 102 81498 individuals show dramatic improvement as they develop n n a strategy. If this is so, a general improvement in task 13–15 yr performance should be seen on retest. This is exactly what Boys 18 96n212n89 Girls 15 90 11670 was seen in the study by Gnys and Willis (1991) on 5- n n year-old children, where the mean score on retest was Adults more than 1 SD higher on retest than on first test. Male 5 113n46n66 Female 15 105 8947 However, as noted above, that study used a very short n n TOWER OF HANOI TASK 553 the 13–15-year-old age group reflects the fact that it is difficult make an arrangement that looks just the same. That is not as to schedule research testing to fit in with a school curriculum easy as it sounds, because there are certain rules you have to once children enter secondary school. A group of 20 adults aged follow. First of all, you are not allowed to put any of the discs from 18 to 26 years, recruited from hospital staff, was also tested on the table. When you let go of a disc, it must be on one of the to provide an indication of adult levels of performance on this pegs. Second, you can only move one disc at a time. And third, test. you can never put a bigger disc on top of a smaller disc, like this (demonstrating WRONG move). That is NOT allowed. You can only make moves like this’’ (demonstrating RIGHT move). Procedure The tester then demonstrated right and wrong moves, asking Children were tested individually in a quiet room at their the child ‘‘would I be allowed to do this?’’, including putting school in one or two sessions lasting 30 minutes. Forty-five discs of right or wrong size on pegs, putting a disc on the table children aged from 7 to 10 years were retested on the Tower of and moving two discs at once, until it was established that the Hanoi task after an interval of 30 to 40 days. child understood the rules. She then said: ‘‘Now we are ready to begin.’’ She presented the first three-move (two disc) problem, by arranging her own Core Test Battery apparatus in the start position, and the child’s apparatus in the end position (see Appendix), and asked child to move the discs Children were given the Vocabulary and Similarities subtests on his or her apparatus to give the same arrangement of discs. from the Wechsler Scale of Intelligence for Children-3rd UK As can be seen in the Appendix, there were two problems, A edition, WISC-III (Wechsler, 1992) to provide an estimate of and B, at each number of moves. To be credited as passing a short form Verbal IQ. given problem, the child had to solve it twice in the minimum The Same-Opposite World subtest from the Children’s Test number of moves. The child was given up to three trials per of Everyday Attention (TEA-Ch: Manly, Robertson, problem. After the first successful trial on a problem, the tester Anderson, & Nimmo-Smith, 1998) was administered to 147 of said ‘‘Good! Show me again’’, to indicate that the child should the children and all of the adults. This subtest, which has repeat the successful performance, rather than adopting a similarities to the Day-Night test of Gerstadt, Hong, and different solution. The test stopped when this criterion of 2\3 Diamond (1994), is designed to assess the ability to inhibit a attempts correct was not reached on two successive problems. prepotent response, which is seen as a critical component of When moving on to problems involving three discs, the tester executive function. In this test, the testee sees a trail of squares, said: ‘‘These are getting harder now. It is important to think out each containing the written digit 1 or 2. The test starts with a how you are going to do it before you start’’. For problems ‘‘Same World’’ trial, in which the tester points to each square in where the end state involved placing all discs on the rightmost turn, and the task is to name the digit in the square as quickly peg (i.e., the first problem at each move length), the examiner as possible. The tester’s finger moves on to the next square as was careful to indicate on her apparatus the rod that the child’s soon as the correct name has been supplied, but remains on the discs had to go on, to ensure that there was no doubt as to which same square if an error is made. A practice run is given before way round the tower should be constructed. In each case, the the test proper. The time taken to complete the trail is recorded. child’s tower was to be a direct match of the adult’s tower, with On the next trial, the tester explains that this is the ‘‘Opposite both child and adult towers having the discs on the rod that was World’’, where the task is to say ‘‘one’’ when the digit is 2, and rightmost from the child’s point of view. ‘‘two’’ when the digit is 1. A practice run is first administered, The tester counted the number of moves made by the child, and then the test proper. The test continues with one more and noted it on the record form under the relevant trial number ‘‘Opposite World’’ trial, and finishes with another ‘‘Same (see Appendix). If the child moved a disc to another peg and World’’ trial. Time to complete each trial is converted to a rate then had a change of mind and moved it back without letting go measure (squares per second), averaged for the two ‘‘Same of it, this was counted as two moves. If the child violated one of World’’ trials and the two ‘‘Opposite World’’ trials, and the the rules, the tester gave a reminder of the rule and restarted the difference taken. Scores for ‘‘Same World’’, ‘‘Opposite trial, counting this as a failed trial. Rule violations (disc placed World’’, and the difference score were converted to age-adjusted on table; two discs moved at once; larger disc put on smaller) z-scores. Because the test was still in the process of stan- were recorded. dardisation at the time we conducted our study, we used our The child’s final score was the highest level of task successfully own sample as the basis for deriving norms. completed (in terms of number of moves), with an additional half point being added if both tasks at this level of moves were passed. Thus a child who passed both 3-move problems, one 4- Tower of Hanoi Test move problem, and both 5-move problems, and failed both 6- Apparatus. Two sets of a wooden apparatus for the ToH move problems, would achieve a score of 5n5 (highest level test were constructed according to the specifications given by passed l 5, plus n5 for passing both problems at this level). A Borys et al. (1982), except that a fourth size of disc was added so child who passed both 3-move problems, passed only the second that more complex problems could be administered. Thus the 4-move problem, and then failed both 5-move problems would apparatus consisted of a board containing 3 upright rods, be given a score of 4n0. 6n5 cm apart, and discs of diameter 2 cm, 3n5 cm, 4n5 cm, and 7 cm, coloured yellow, green, blue, and red respectively. The thickness of the discs was 1n7 cm, and each contained a central Results hole 1 cm in diameter, so that the discs could be fitted over the rods. Mean scores on Tower of Hanoi are shown in Table 2. Test procedure. The child was given problems of increasing A regression analysis was carried out to investigate age complexity, starting with 3-move problems, and increasing up effects on performance. Analysis of age trends from 7 to 9-move problems, until two consecutive problems were years to adulthood is complicated by the fact that we failed. The arrangements for each problem are shown in the would not expect a linear relationship. For most cognitive Appendix. functions, growth is more rapid in the early childhood, At the start of the test, the tester sat facing the child, with one ToH apparatus in front of her containing a two-disc tower on and then slows down, with stability being achieved by the rightmost peg. The other identical apparatus was in front of young adulthood. There is no reason to expect growth in the child, with one disc on each of the three pegs. in adulthood, and within the adult The tester said: ‘‘In this game, I will show you an ar- group, the correlation between age and ToH was non- rangement of these discs on pegs, and what you have to do is to significant (r ln137, df l 18, n.s.). Therefore all adults 554 D. V. M. BISHOP et al.

Table 2 Mean (SD) Scores on ToH Task by Age and Sex (Possible Range 3 to 9n5) Male Female Total

Age range Mean SD Mean SD Mean SD 7–8 yr 5n61n44 6n01n33 5n81n39 9–10 yr 6n91n75 5n91n60 6n31n74 11–12 yr 7n91n61 6n71n74 7n21n77 13–15 yr 6n81n90 6n71n36 6n81n65 Adult 9n00n61 8n11n65 8n31n50

were arbitrarily assigned the age of 18 years. The under 9 years: r ln457, df l 21, 9 years and over: correlation between log age and ToH score (r ln370, r ln594, df l 20; test for difference in correlations, df l 255) was only marginally higher than the corre- z l 0n47 (Guilford & Fruchter, 1973). lation with raw age (r ln366, df l 255), but the former was preferred for use in a regression equation as provid- Comparison with Developmental Studies Using ToL ing a more plausible developmental model. Task The regression equation relating age to ToH score was: y l 2n72nxk6n589, where y is the predicted score, and x is The question arises as to how typical our findings are. the natural log of age in months. By subtracting the Studies by Krikorian et al. (1994) and Anderson et al. predicted from obtained ToH score, and dividing by the (1996) presented normative data on the ToL task (non- RMS residual (l 1n646), we can convert ToH scores to computerised version) for children aged from 7 to 13 age-adjusted z-scores. We considered whether to derive a years. Direct comparison of results is impossible because separate regression equation for males and females, but different numbers of problems and different scoring although there was a statistically significant effect of methods were used. Also, both the ToL studies used only gender on ToH z-score, males: mean l 0n15, SD l 3-, 4- and 5-move problems. However, we can consider 1 033; females: mean 0 13, SD 0 953; F (1, 255) sensitivity to age in each study in terms of effect size, d, n lk n l n # l 4n95, p ln027, the effect size was so small (η ln019) that which is a unit-free measurement (Cohen, 1977). For each it did not seem justified using separate norms for boys adjacent year band from 7 to 13 years, d was computed as and girls. the difference between mean scores divided by the SD of Rule-breaking errors occurred on only a small minority the younger group. The mean value of d, which gives an of trials: over the whole test, the average number of such index of the average increase in score for each year of age, errors was 0n358 in the 7–8-year-olds, falling to 0n177 for was n10 for the current study, which is similar to the value the 13–15-year-olds. Only nine children in the whole of n16 from the Krikorian et al. study. The mean for the study made more than one rule-breaking error. Anderson et al. study was twice as large at n31. The ToH z-score did not correlate significantly with either probable reason for this difference is that Anderson et al. short form Verbal IQ (r ln089, df l 255) or the z-scores used a scoring method that took into account time to derived from the Same-Opposite World subtest from complete each problem, whereas the current study and TEA-Ch (Same rate: r ln139; Opposite rate, r ln136; that of Krikorian et al. considered only accuracy. Neither difference score r lkn014, df l 163). ToL study found the drop in score that we observed for Test-retest reliability for the 45 children tested on two the 13–15-year-olds relative to younger children; it is occasions was r ln528 for raw scores, and r ln508 for likely that this simply reflects sampling error on a measure age-adjusted z-scores. Although these values are stat- where age effects are small. Overall, it is noteworthy that istically significant (p !n001), they are low in relation to the effect sizes for age are relatively small, compared with conventional psychometric criteria. As noted above, one those for the other measures in our study: vocabulary reason for low reliability would be if task novelty were a (n47), similarities (n35), TEA-Ch same world (n55), TEA- factor, with some children developing a strategy for Ch opposite world (n48). Thus on the ToH, variation performing the task as they became more experienced. within an age group is substantial in relation to score Although it was the impression of the testers that some increases from one year band to the next. Coupled with children suddenly got the point of the test, and then our finding of low reliability, this suggests that age effects showed a dramatic improvement, overall, scores for the may be masked by other random factors that affect two test sessions did not differ significantly: mean for the children’s performance from day to day. first session was 5n92, SD l 1n56, and for the second Luciana and Nelson (1998) used a computerised session, 6n13, SD l 1n83; F (1, 44) ! 1. There was a version of the ToL, which included 3-, 4- and 5-move suggestion that older children might be more susceptible problems, with children aged from 4 to 8 years. The to learning effects: for children aged 7–8 years, mean computer screen is divided into a top and bottom half, score was 5n72 (SD l 1n38) in session 1 and 5n78 (SD l with the goal position being shown on the top half of the 1n94) in session 2, an improvement of only 0n06 points, screen, and the initial position on the bottom half. The whereas for those aged 9 to 10 years, the mean score was child moves the coloured ‘‘balls’’ on the bottom half of 6n14 (SD l 1n72) for session 1 and 6n50 for session 2 the display using a touchscreen, so as to make the bottom (SD l 1n67), a gain of 0n36 points. However, this differ- display look the same as the top display. The displays are ence was not statistically significant, with the inter- presented in such a way that they can be perceived as action between age and session having a corresponding stacks of coloured balls, held in stockings and suspended F-ratio of less than 1. The test-retest reliabilities did not from a beam. Luciana and Nelson reported their data differ significantly for the older and younger children; separately for each level of problem difficulty, and direct TOWER OF HANOI TASK 555 comparison with the current study is difficult. However, slowly? It may be that a better scoring system can be there is a suggestion in their data that the computerised devised, but it is unlikely that our system was simply too version of ToL may be substantially harder for children insensitive to pick up age effects: the problem was not than versions using a more conventional apparatus. that children did not vary, but rather than the variance Luciana and Nelson reported that around 20% of 8-year- within each age group was substantial in relation to the olds, and somewhat fewer 7-year-olds, achieved age effect size. minimum-move solutions on 4- and 5-move problems. In We found no support for the notion that ability to the current study, 70% of 7- to 8-year-olds succeeded on inhibit a prepotent response is a major determinant of both problems at the 4-move level, and 47% succeeded ToH performance in normally developing children: the on both problems at the 5-move level. (The percentages correlations with the Same-Opposite world task, which is of children of this age succeeding on at least one of the designed to assess this aspect of executive function, were two problems at a given level were even higher: 97% for close to zero. In future work, it would be interesting to 4-move problems and 74% for 5-move problems.) The consider alternative explanations for individual differ- apparatus-based ToL studies do not report data sep- ences in ToH performance; for instance, Pennington, arately for different levels of task difficulty, but a Bennetto, McAleer, and Roberts (1996) have emphasised consideration of mean scores suggests their findings were the importance of working memory for executive tasks. more comparable to those of the current study than to This is particularly the case for the ToH, where the need those of Luciana and Nelson. In the study by Krikorian to hold a sequential plan in memory increases with task et al. (1994) the mean percentage score for 7- and 8-year- difficulty. It is possible, also, that more complex in- olds was around 80% on a task that consisted pre- hibitory processes are implicated than are tapped by the dominantly of 4- and 5-move problems. Anderson et al. Same-Opposite World task. In the Opposite World task, (1996) used only 4- and 5-move problems, and reported the participant must simply maintain a response set that scores of over 75% correct for 7- to 8-year-olds. involves doing the opposite of what is customary. In the ToH, it is necessary to shift continuously between subgoals in order to arrive at a final goal, and this can Discussion mean inhibiting a subgoal (e.g., get the red disc on the rightmost peg) that was previously active. Overall, these results with normally developing chil- The low reliability of the ToH task is disappointing for dren do not offer much encouragement for those wishing those hoping to use this task for individual assessment. to use the ToH as a clinical index of executive function. Luciana and Nelson (1998) argued because one cannot Although the test-retest reliability may be adequate for readily apply brain imaging studies to normal children, demonstrating group differences in experimental studies, one can adopt the empirical strategy of selecting be- as when a clinical group is compared to a control group, havioural measures with reliable neural correlates (in it is too low for confident assessment of individual cases. terms of lesion studies, or imaging of healthy adults), and We cannot attribute poor test-retest correlations to then ‘‘attribute children’s successful performance on restriction of range, as there is large variation between these measures to the functional maturation of the brain children even within a single year band. We considered regions with which they have been experimentally cor- whether low reliability might arise from the task losing its related’’ (p. 273). Appendix Record form for Tower of Hanoi task, showing the con- completed in the minimum number of moves. The test stops figuration for each problem. The three dashes denote the three when this criterion of 2\3 attempts correct is not reached on rods, with the letters indicating the discs. The number of moves two successive problems, and the child is awarded a score corre- taken by the child on 1st, 2nd, and 3rd trials of each problem is sponding to the highest level (number of moves) passed, with recorded in the box against that problem. At each level, problem n5 added if both problems at that level were passed. Two A is given first, and then problem B. For each problem, the child children who failed both 3-move problems were given a score of is administered up to three trials, until two trials have been 2n0. Thus the total score could range from 2n0to9n5.

