Assessing children’s understanding of complex : a comparison of two methods

Pauline Frizelle 1, 2, Paul Thompson1, Mihaela Duta, 1 & Dorothy V. M. Bishop 1

1 Department of Experimental Psychology, University of Oxford, Oxford, Oxon, UK.

2 Department of Speech and Hearing Sciences, University College Cork, Republic of Ireland

RUNNING HEAD: Assessing children’s understanding of grammar

Abstract We examined the effect of two methods of assessment, multiple-choice sentence-picture matching and an animated truth-value judgement task, on typically developing children’s understanding of relative clauses. Children between the ages of 3;06 and 4;11 took part in the study (n = 103). Results indicated that (i) children performed better on the animation than on the multiple-choice task independently of age (ii) each testing method revealed a different hierarchy of constructions (iii) and the testing method had a greater impact on children’s performance with some constructions more than others. Our results suggest that young children have a greater understanding of complex sentences than previously reported, when assessed in a manner more reflective of how we process language in natural discourse.

KEYWORDS: complex syntax, relative clause, language, development

1 During the process of child language development it is often the case that children will demonstrate language knowledge that is sufficient to support comprehension but is insufficient for production (Fraser, Bellugi & Brown, 1963). This is not surprising given the extra demands of utterance planning and sentence formulation involved in language production. However, in the case of complex syntax the opposite appears to be the case with children’s production superior to comprehension (Ha kansson̊ & Hansson, 2000). This causes us to question how best to assess receptive syntactic knowledge.

There are a number of paradigms typically used to assess children’s comprehension of syntax, most of which share commonalities in relation to assessment principles. These include the set-up, which is usually purposely structured to remove context or situational cues in order to evaluate understanding of the specific aspect of language of interest such as the grammatical structure, morphological marker or a combination thereof. In a typical test

(administered clinically), the items (e.g., sentences) are sequenced so that they make increasing demands on processing skills, in order to identify the upper threshold of a child’s comprehension ability. Despite these commonalities, there are also many differences in the assessment methods and procedures used to investigate children’s comprehension of syntax, such as preferential looking, act-out tasks, story comprehension, grammaticality judgements, on-line methods, picture selection tasks and truth-value judgment tasks. Of interest in the current paper is the effect that different methodologies can have on the language knowledge being measured. Given the variation in assessment procedures used, it is likely that they will have a varied effect on the knowledge being tested and consequently on the outcome measure. While each assessment measure comes with its own biases, one of the main characteristics of a good test is that the methodology should not compromise the knowledge being measured. It would be misleading for both the research and clinical communities to use assessments that yield results that are more a reflection of the testing method than of

2 knowledge of the item being measured. In a recent paper, Frizelle, O’Neill and Bishop (2017) compared children’s performance on a sentence recall task and a multiple-choice comprehension task of relative clauses. The authors found that although the two assessments revealed a similar order of difficulty of constructions, there was little agreement between them when evaluating individual differences. In addition, children showed the ability to repeat sentences that they did not understand. Given that sentence repetition has been reported to be a reliable measure of children’s syntactic knowledge (Gallimore & Tharpe,

1981; Kidd, Brandt, Lieven & Tomasello, 2007; Polišenská, Chiat, & Roy, 2015), the authors attributed the discrepancy between the two measures to the additional processing load resulting from the design of multiple-choice picture selection comprehension tasks.

Here we consider two ‘pure’ methods of receptive language assessment, the multiple-choice picture selection task and a newly devised truth-value judgement animation task. The former is one of the most commonly used assessment paradigms to evaluate children’s understanding of syntax, within a formal, standardized assessment framework (it is used for example in The

Clinical Evaluation of Language Fundamentals – 4th Edition (CELF- 4) (Semel, Wiig &

Secord, 2006), The Test for the Reception of Grammar – 2nd Edition (TROG – 2) (Bishop,

2003) and the Assessment of Comprehension and Expression 6 – 11 (ACE) (Adams, Coke,

Crutchley, Hesketh & Reeves, 2001). In a picture selection paradigm the child is presented with a sentence and is asked to select the picture (usually from a choice of four) that best corresponds to the stimulus presented. Within a four-picture layout, one picture represents the target structure and the other three are considered distractors. On the condition that each distractor is equally plausible, this reduces the probability of choosing the correct item by chance to 25%. However, it may not always be possible to design three equally plausible distractors (Gerken and Shady, 1994), in which case deciding what constitutes chance performance becomes difficult. Not only is the semantic plausibility of the distractors

3 influential but also their syntactic framework and the relationship between the two. In parsing a sentence for syntactic comprehension, a child will typically form a semantic representation of that sentence, relying on the syntactic structure to assign the thematic roles to the appropriate words in the sentence. However, there are additional linguistic factors that influence a child’s performance and these are not always controlled for. For example, greater comprehension difficulty has been noted when the thematic roles assigned by the verb could be applied to either noun (Gennari & MacDonald, 2009). It is also the case that on hearing the target stimulus, children will use typical thematic role-to-verb argument mapping in trying to process the given structure. The presence of three distractors requires the child to rule out three competing alternative mappings, increasing the processing load considerably and creating a level of ambiguity in how the roles should be assigned. If we consider the test sentence She pushed the man that was reading the book (taken from the multiple-choice task used in the current experiment) shown in figure 1, the distractor images represent the sentences The woman that was reading the book pushed the man, The man pushed the woman that was reading the book and The man that was reading the book pushed the woman. Using the same lexical set, the four pictures represent four alternative ways to assign the thematic roles. With regard to relative clauses, this type of distractor design is used in commercially available standardized assessments such as the TROG-2 (Bishop, 2003) and the ACE (Adams et al., 2001) (see Table 1.) Interestingly, in the example taken from the CELF the object (the banana) is inanimate and therefore the roles of the subject and object cannot be reversed, necessitating a different set of distractors. However, we note that the distractors in the example below serve to test children’s understanding of prepositions and, on condition that the child understands the concept of under, they could choose the correct item. In addition, in the illustrations depicting the distractors, there is no alternative to the head noun to which the

4 relative clause was referring ( no other boy) making the use of a relative clause pragmatically inappropriate.

Table 1. Examples of distractors used in commercially available standardized assessments.

Test- Assessment of Comprehension and Expression (ACE) Target Sentence The cat that scratched the fox is fat. Distractors The cat that the fox scratched is fat The fox that scratched the cat is fat The fox that the cat scratched is fat. Test – Test for the Reception of Grammar (TROG-2) Target Sentence The girl chases the dog that is jumping. Distractors The dog chases the girl that is jumping. The girl that is jumping chases the dog. The dog that is jumping chases the girl. Test – The Clinical Evaluation of Language Fundamentals (CELF – 4) Target Sentence The boy who is sitting under the big tree is eating a banana. Distractors The boy who is standing beside the small tree is eating a banana. The boy who is sitting beside a small tree is eating a banana. The boy who is sitting in a big tree is eating a banana.

Figure 1. Example of an illustration for the test sentence: She pushed the man that was reading the book

5 The advantage of using this role-reversal distractor approach, is that we can devise test items that can only be correctly identified by children with a deep understanding of the syntactic structure being tested. It can also be argued that through careful manipulation of the distractor images we can complete a systematic error analysis, which will potentially shed light on children’s developing representations. However, at the same time it is important to question the ecological validity of this method as it results in the assessment of an understanding that is far removed from that used in natural discourse. Would error patterns revealed emerge at all if the distractors were less salient? In addition, while it is important to ascertain ‘true’ or deep levels of syntactic knowledge we need to be aware that this format is likely to lead to children failing for reasons other than a lack of linguistic knowledge (Frizelle et al., 2017).

For these reasons, in this experiment we are considering the multiple-choice picture selection task to be an example of an assessment procedure in which the method is likely to have a strong effect on the knowledge being measured.

An alternative assessment measure is the truth-value judgement task, described by

Gordon (1996, p.211) as ‘one of the most illuminating methods of assessing children’s linguistic competence’. This task requires the child to make a binary judgement about whether a statement accurately describes a situation presented, in a given context. Some studies have used a yes/ no version within the context of a short story i.e. the child is presented with a story followed by several yes/ no questions (see Gordon & Chafetz 1986).

Other researchers such as Crain and McKee (1985) have used what is termed a reward/ punishment version. In this design, Crain and McKee (1985) asked children to watch a situation ‘acted out’ while it was simultaneously described. Following a statement made by

Kermit the frog about an event; children were asked to reward him if it was true and to punish him (by feeding him a rag) if it were false.

6 In the current study we used a truth-value judgement task that could be considered somewhat less demanding than those previously described. In this task, the child listens to a single sentence while simultaneously being shown an animation and must simply decide if the sentence was truly represented in the animation shown. Half of the sentences are accurately represented by the animation and half are not. By listening to the stimulus sentence at the same time as it is acted out ‘in real time’ the child can take advantage of a reduction in memory demands. However, one disadvantage of using this method is that a larger number of stimuli are required to determine if performance is reliably above chance, which may result in fatigue, particularly for young children. In addition, Crain, Thornton, Boster, and Conway,

(1996) raised some methodological design concerns in relation to single sentence truth-value- judgement tasks and the pragmatic context in which they are represented. These concerns were prompted by studies by Philip (1991; 1992) and Roeper and de Villiers (1991), showing that when children were interpreting sentences containing universal qualifiers (such as every), they were influenced by the extra elements in the picture. Children showed a pragmatic bias surmising that the extra element was there for a reason and therefore responded negatively and incorrectly to the question asked. Crain et al., followed up with further research in which they created a story context where the child could plausibly reject the extra elements depicted.

They argued that with the appropriate contextual support children would not override their grammatical knowledge in favour of any pragmatic bias. However, they also acknowledge that children’s response to the number of items in each illustration could be as much to do with the fore-grounding/back-grounding of the items as the pragmatic context. Although previously discussed in relation to universal qualifiers, this is relevant to any truth-value judgement task where the child is presented with a single image or animation. Even when the image is an accurate representation of the sentence, the extra elements within that image are in effect a form of distractor. On the other hand, if the image does not represent the sentence

7 accurately, the entire image is considered a distractor and how that is structured affects the processing load for the child. If we consider a distractor animation for the relative clause She pushed the boy that was jumping – the child would be shown a sequence where there are two potential referents a boy that is jumping and a boy that is not and the animation would show a girl pushing the boy that is not jumping. If however the girl is also jumping this would increase the salience of an alternative interpretation of the sentence The girl that was jumping pushed the boy, which in turn increases the processing load for the child and the difficulty level of the item. As assessors, our aim is to gain the most accurate picture of children’s receptive syntactic knowledge while at the same time gaining some insight into the difficulties they experience. However, we do not want to distract or provoke children in to making an invalid response. Therefore, how each structure is represented is central to the assessment results we are likely to obtain.

The advantage of using a truth value judgement task where the child is shown a single animation at a time, is that they can evaluate the truth of each sentence (as it is happening) directly against the real-world situation without having to store in memory the arguments associated with the verbs. In this regard, the task has no greater memory load than that required for processing everyday discourse. The child is required to make a semantic evaluation of a sentence and to map the thematic roles to the appropriate verb argument structures without having to actively rule out three competing alternative mappings. In addition, using animation reduces the dependence on children’s ability to interpret adult created pictures and reduces the amount of inference-making required by the child. We might therefore expect this assessment framework to yield more accurate test results than those obtained from the multiple-choice picture selection task i.e. we expect this methodology to have less bearing on the knowledge being measured.

8 In the current study, to examine to what extent the assessment method affects the assessment outcome; we used the two methods of assessment previously described to measure the same language knowledge (i.e. children’s understanding of relative clauses). We then investigated if the use of these two different methodologies significantly affects the children’s understanding of relative clauses. The following research questions were addressed:

 Do children show a greater understanding of relative clauses when assessed using a truth-

value judgement animation task or a multiple-choice comprehension task? Based on

findings reported in Frizelle et al., (2017) showing that the multiple-choice

comprehension task is assessing factors other than those that are linguistic, we predict that

children will achieve higher scores on the animation task than on the multiple-choice task.

 Is the hierarchy of structures similar across the tests? Given that the exact same sentences

were used in both assessments (with additional sentences added to the animation task) we

might predict that a similar hierarchy would be shown by both assessment measures.

 What is the relationship between children’s performance on either method of syntactic

assessment and (i) their language skills, as measured by a sentence recall task (ii) their

cognitive skills, as measured by the Goodenough-Harris draw a person test (1963). Given

the additional cognitive load of the multiple-choice task (the fact that it is assessing skills

beyond the linguistic) we might expect that this would correlate more highly with the

draw a person test. On the other hand, if our animation task is a less compromised

measure of syntactic comprehension, and sentence recall is a reliable measure of syntactic

knowledge then we would expect the animation task to be more closely associated with

the sentence recall measure.

Method Participants

9 One hundred and ten typically developing children, between the ages of 3;06 and 5;0 years, were recruited in to the study. Seven children were subsequently excluded as a result of speech and language delay (n = 3), attention difficulties (n = 2), failure on the hearing screening tests and an unwillingness to participate (n = 1). This resulted in 103 children participating in the study. To take account of the rapid changes in language development in this age range, the children were divided into six age bands, with between 16 and 18 children in each band (3;06 – 3;08, 3;09 – 3;11, 4;0 – 4;02, 4;03 – 4;05, 4;06 – 4;08 and 4;09 – 4;11).

The children were recruited through nurseries and primary schools in the Oxford area. To minimise volunteer bias, children were recruited using an ‘opt out’ protocol which was approved by the Central University Research Ethics Committee, University of Oxford. A decile of deprivation was obtained from post code data for each participant; 32% had a deprivation index of 7 or below. The study was explained to the children recruited; younger children made a verbal choice regarding their participation and older children were asked to sign an assent form. Children were included in the sample on the basis that they had typical language abilities, had never been referred for speech and language therapy, spoke English as their first language and the language of the home; and had no known intellectual, neurological or hearing difficulties. Children completed the Goodenough-Harris draw a person test (DAP)

(1963) to ensure cognitive ability within the normal range. We wanted a simple nonverbal measure of developmental level that could be used with children as young as 3 years and because the maturity of children's drawings shows a clear developmental trend between 3 to 5 years, we decided to use this as a quick play-based assessment, using the scoring methods of

DAP to quantify the maturity of drawings. Using a DSP Pure Tone Audiometer (Micro

Audiometric Corporation), hearing ability was screened on the first day of assessment at three frequencies (1000 Hz, 2000 Hz and 4000 Hz) at a 25 dB level. Owing to difficulty in locating an adequately quiet space the results of this initial screen were inconclusive for 7 children.

10 These children were further assessed using the Ling six sound test, which was deemed to be an appropriate measure of the ability to hear speech at conversational loudness level.

Comprehension Tasks

Animation task

The newly devised animation task was presented on Microsoft Surface Pro. Children were shown fifty animations representing one of five types of relative clause; subject (both intransitive and transitive), object, indirect object and oblique. All fifty relative clauses were attached to the direct object of a transitive clause and are therefore categorized as full bi- clausal relatives. Example test sentences are given in Table 2.

Table 2. Example Relative Clause Test Sentences

Relative clause type Example test sentence Subject intransitive He found the girl that was hiding Subject transitive He pushed the girl that scored the goal Object The boy picked up the cup that she broke Oblique The man opened the gate she jumped over Indirect object She kissed the boy she poured the juice for

There were ten animations for each relative clause structure, five of which matched the structure and five non-match items. The test sentences were chosen on the basis of pilot work carried out by the first author, work completed by Diessel and Tomasello (2000; 2005) and research by Frizelle and Fletcher (2014). Object relatives were adapted to include only those considered to be more discourse relevant (Kidd et al., 2007) i.e. all had an inanimate head noun and a pronominal subject. We also included pronominal subjects in the oblique, and indirect object clause structures as they are considered to be more reflective of natural discourse. In order to increase ecological validity, within each animation there was an

11 alternative to the head noun to which the relative clause was referring i.e. a referent from which another can be distinguished. For example the representation of the sentence He laughed at the girl he threw the ball to included another girl in the sentence who was holding a ball. Examples have been deposited on the open science framework and can be accessed at the following links ( https://youtu.be/OM27lMM4zPs ; https://youtu.be/Cd-EBpCtzZw ; https:// youtu.be/d3dz_m8zTvc). Animations were on average six seconds in length. As each animation began children were presented with a pre-recorded sentence orally and asked if the animation matched the sentence they heard. Children were given the opportunity to hear the sentence and see the animation more than once if needed. They were asked to respond by touching either the smiley or sad face on the Surface Pro touch screen.

Multiple-Choice task

The multiple-choice task was a sentence-picture matching task designed to assess the same relative clause structures as those in the animation task described. The sentences were identical to a subset of those used in the animation task. The multiple-choice items were also part of a larger set reported on in Frizelle, et al., (2017). Sentences were again pre-recorded.

Children were given each sentence orally and were asked to choose the picture (from a choice of four) that corresponded to that sentence. The other three images were distractors, which included role reversal of the main clause (the relative clause is understood), role reversal of the relative clause (the main clause is understood) and role reversal of both main and relative clause (see Figure 1).

In comparison to the animation task, where the child chooses between two (correct vs. incorrect), choosing from four pictures reduces the possibility of the child answering correctly by chance. Therefore, in the multiple-choice task fewer exemplars of each construction were required. In contrast to the ten examples of each construction in the animation task, the

12 multiple-choice task included 4 of each construction i.e. twenty test items in total. The pictures were presented in two formats governed by the type of verbs within the relative clause. For example, in the test sentence She pushed the man that was reading the book, the verbs reading and pushing can occur simultaneously and can therefore be represented by four single images (the correct response and three distractors) (see Figure 1). In contrast, for the sentence The girl washed the teddy that he played with, both verbs must be shown consecutively (the boy needs to play with the teddy before the girl washes it) and therefore require two images within each set of four (see Figure 2).

Figure 2.

Procedure

Children were assessed individually in a quiet room in their respective schools. The assessments were administered in two sessions between three and twenty days apart. The hearing screen was carried out on the first occasion each child was seen. The cognitive assessment, sentence recall task, and the multiple-choice comprehension task were

13 administered in one session and the animation task was administered in the other. The sentence recall task was audio recorded and sentences from 5% of the participants were re- transcribed to check transcription reliability. Inter-rater agreement was at 98%. The sequence of test sentences was randomised for both experimental tasks so that there were two orders of presentation for each task. The order in which the assessments were administered was also randomised. For the animation task children simultaneously watched an animation and listened to a sentence. They were required to indicate if the animation represented the sentence they heard by touching the smiley face / sad face icons on the tablet screen. For the multiple-choice task children listened to a sentence and were asked to point to the picture that corresponded to that sentence. Verbal positive feedback was given every four to five trials.

As the animation task was significantly longer than the multiple-choice, children received positive feedback visually (using star animations) every 10 trials, regardless of the child’s performance on the task.

Scoring for the animation task was automated and the results were collated and stored within the application. The researcher scored the multiple-choice task in real time as the children completed it. The initial scoring for both tasks was binary, 1 for a correct response and 0 if the response was incorrect.

Results

To allow both assessments to be compared based on similar scales, a coding system was developed that took into account the probability of obtaining a correct response by chance.

The binomial theorem was used to compute the probability of obtaining a given number of items correct by chance guessing, where chance was .25 for the multiple-choice test, and .5 for the animation task. A two-tiered coding system was applied; the child was given 2 if they answered 4 out of 4 correctly on the multiple-choice task (p = .004) and 9 or 10 out of 10

14 correctly on the animation task (p = .01). In the more lenient coding the child was given 1 if they answered 3 out of 4 correctly on the multiple-choice task and 8 out of 10 on the animation task. With this coding, the binomial probability of a score of 1 or 2 on the multiple- choice test was .051, and the probability of a score of 1 or 2 on the animation task was .055.

Thus in both tasks, a score of 2 could be regarded as evidence that the child completely understood the construction, and a score of 1 that they were close to mastery. A score less than or equal to 2 out of 4 on the multiple-choice test or less than or equal to 7 out of 10 on the animation task was coded as 0. This coding allowed us compare children’s performance across different relative clause types in both tests on a similar metric. Figure 3 shows stacked bar charts of the relevant data.

Figure 3. Children’s performance, categorised as 2 (mastery), 1 (above chance) or 0 (not mastered), on each construction on the animation and multiple-choice tasks

SI = subject intransitive, ST = subject transitive, Obj = object, Obl = oblique, IO = indirect object.

15 An examination of the stacked charts shows considerable growth in children’s comprehension of these constructions from 3;06 to 4;11 years. This is particularly evident on the animation task where (with the exception of the subject transitive relatives) we see more evidence of full mastery on all constructions. If we look at the 3 younger age bands (where children are aged between 3; 6 and 4;02 years) the differences on full mastery between the two tasks are particularly evident. For example, younger children show very limited mastery of object and indirect object relatives on the multiple-choice task, yet a considerable percentage of children show full mastery on the animation task.

To subject the data to statistical analysis, we grouped together categories 1 and 2 for contrast with category 0 before fitting a mixed model logistic regression to the data. Odds ratios were calculated as measures of effect size. This represents the odds of scoring 1 vs. scoring 0 given a particular condition, compared to the odds of 1 vs. 0 occurring in the absence of that condition. An odds ratio of 1 indicates no difference between conditions. As shown in Table

3, our results showed an effect of test, such that children performed better on the animation task than the multiple-choice task: the odds of achieving higher scores on the animation task were 3.6 times that of the multiple-choice task. To evaluate the effect of the specific constructions, the intransitive relative clause score was the reference against which the other relative clause types were measured. In relation to the intransitive relatives, the odds of achieving a higher score on the oblique relatives were 1 in 4 (0.27), on the object relatives 1 in 10 (0.1) and on the indirect object relatives 1 in 20 (0.05).

The model also revealed an effect of age, showing that older children were more likely to achieve higher score on the tasks.

Table 3. Estimates of fixed effects for mixed effects logistic regression.

Odds ratio 95% CI p (Intercept) 4.44 2.39 to 8.25 <.0001

16 Test type (Animation) 3.6 1.49 to 8.67 0.0043 Age (centred) 2.72 1.84 to 4.01 <.0001 Transitive 1.93 0.87 to 4.31 0.1074 Indirect object 0.05 0.02 to 0.11 <.0001 Object 0.12 0.05 to 0.24 <.0001 Oblique 0.27 0.13 to 0.56 0.0004 Test type x transitive 0.10 0.03 to 0.34 0.0002 Test type x indirect object 4.14 1.31 to 13.2 0.016 Test type x object 1.02 0.33 to 3.13 0.9773 Test type x oblique 0.53 0.17 to 1.63 0.269 Test type x Age 1.26 0.87 to 1.8 0.2235 Note: Intransitive clause is treated as reference category

Although we see considerable differences between the two assessment methods at the level of full mastery, particularly with the younger children, when we analysed the data according to above chance performance on the tests (2 or 1 vs. 0), there was no interaction between the method of testing and age overall. We also considered the pattern of results for full mastery

(2 vs. 1 or 0) and the pattern did not change. However, the significant odds ratios for interaction terms with transitive subject and indirect object relatives indicates that children's performance on these constructions differed depending on which method of assessment was used, whereas this was not seen for the other constructions (see Table 3 for the fixed effects).

This suggests that some constructions are more vulnerable to testing method than others.

Our second question was whether the order of difficulty of constructions was similar on the two types of test. Based on the lenient coding system (where children showed an understanding of constructions at a level above chance) a rank ordering was assigned to each of the constructions for total items correct in the animation and multiple-choice tasks. The rank orderings were different suggesting that the tests are not sensitive to the same aspects of clause complexity. For the multiple-choice test, the rank ordering (and mean items correct) was Subject transitive (3.28) > Subject intransitive (3.04) > Oblique (2.46) > Object ( 2.27) >

17 Indirect object (1.93). For the animation task, the order was Subject intransitive (9.03) >

Indirect object (8.24) > Subject transitive (8.14) > Object (7.91) > Oblique (7.87).

We next considered whether the rank orderings were meaningful, i.e. whether the differences in difficulty were reliable enough to merit interpretation. To do this, we compared scores for adjacently ranked constructions, e.g., for the animation task, we compared Subject intransitive vs Indirect object, then Indirect object and Subject transitive, and so on. Because the data was not normally distributed, paired Wilcoxon tests were used. The alpha level was corrected for each set of comparisons within the test (alpha = .05/4 (.0125) as there were four comparisons in each test). For the animation task, there was a significant difference between intransitive subject and indirect object relatives (Z = 369.5, p = 4.62* 10-7). However, there were no significant differences between any of the other sequential pairs, showing a two- tiered hierarchy in relation to children’s understanding of different relative clause types on this assessment. In contrast, the multiple-choice task showed three levels of difficulty, where children performed similarly on both subject relative types (Z = 724, p = .025), there was a significant different between subject intransitive and oblique relatives (Z = 480, p = 1.24*10-

5), no difference between oblique and object relatives (Z = 755, p = 0.109) and a significant difference between object and indirect object relatives (Z = 963.5, p = 0.012), which were the most difficult construction based on the multiple choice measure.

Our third question asked whether children’s performance on either measure of syntactic comprehension would correlate with their performance on the sentence recall task from the

CELF-P2 (Semel et al,. 2006) or with their mental age as measured by the Goodenough-

Harris draw a person test (1963). Table 4 provides a summary of the correlations between measures. Note that the lack of correlation between Sentence Recall and age arises because these were age-scaled scores. Draw a person scores were missing for three children, who

18 were therefore excluded from the analysis. Correlations between both assessments of syntax and sentence recall scores were substantial, using Davis' (1971) criteria for interpreting the magnitude of correlation coefficients. Although not as strong, associations between both measures of syntax and DAP scores were also highly significant and are classified as moderate associations.

Table 4. Pearson correlations between Variables

Measure Animation Multiple Age Sent Recall Choice Multiple .56*** – choice Age .50*** .48*** – Sent Recall1 .50*** .40*** .03 – Draw a person .45*** .46*** .51*** .16 * p < .05, ** p < .01, *** p <.001

1 Age-scaled scores

To investigate the independent contributions of age, sentence recall and cognitive ability to the scores on both assessments of syntactic comprehension, likelihood ratio tests were used to perform hierarchical linear regression analyses. Results are shown in Table 5. In the first analysis, performance on the animation task was the independent variable. Age was entered first in order to control for its effect on the children’s performance on the task. This was followed by sentence recall and finally the DAP scores. The final model accounted for a large proportion of the variance in animation score (51%) and age explained 25% of that variance (p < .001). An additional 23% of the variance was explained by the inclusion of sentence recall, demonstrating the significant contribution of this language measure. In contrast, when age was accounted for, the addition of DAP scores only explained a further

2% of the variance. In the second analysis, performance on the multiple-choice task was the independent variable. Age was again entered first and accounted for 23% of the variance in

19 multiple-choice scores. In contrast to the animation task, when age was accounted for, sentence-recall explained just under 15% of additional variance in the multiple-choice scores and DAP accounted for 3.5%. We can see from the analyses that, as predicted, sentence recall contributed to a larger proportion of the variance in animation scores than to those from the multiple-choice task. However, contrary to what we expected, the contribution of DAP scores was fairly similar across the two tasks.

Table 5 - Hierarchical linear Regression: predicting total score on Animation and Multiple Choice tasks from age, sentence recall and Draw A Person test

Model Outcome variables Predictor variables R2 R2 Likelihood ratio (Test total score) increase test between subsequent models (p-value) Null Animation Age 0.2519 - 1 Age and Sentence 0.4859 0.2340 <.001*** recall 2 Age, Sentence 0.5076 0.0217 .038* recall, and DAP Null Multiple Choice Age 0.2309 - 1 Age and Sentence <.001*** 0.3787 0.1478 recall 2 Age, Sentence 0.4136 0.0349 .016* recall, and DAP * p < .05, ** p < .01, *** p <.001

Discussion

The current study aimed to examine to what extent the assessment method affects the assessment outcome in relation to children’s understanding of complex sentences. To do this we compared two methods of assessment – a multiple-choice picture-matching sentence comprehension task and a newly devised truth-value judgement animation task. As we predicted the results showed that children performed better on the truth-value animation task

20 than on the multiple-choice task, and this was across the full age range. In addition, we aimed to explore whether the two assessment methods would reveal a similar order of difficulty in relation to understanding specific syntactic constructions. We predicted the same order between assessment methods, but our results showed a different hierarchy depending on the method used. Finally, we were interested to know if results from either test would be predicted by children’s performance on an independent measure of language (sentence recall ability) or on a measure of cognitive ability. As expected, our results showed that sentence recall ability predicted children’s performance on the animation assessment more than on the multiple-choice test, validating the animation task as a measure of syntactic knowledge.

However, cognitive ability, as measured by the Draw a Person test, made a small but significant contribution to children’s performance fairly equally across both tasks.

As previously discussed, a recent paper by Frizelle et al., (2017) highlighted the fact that the multiple-choice picture-matching assessment method may be problematic when assessing comprehension, in that it is testing skills beyond those of linguistic competence. Frizelle and colleagues compared this method of assessing comprehension with a sentence recall task, which (once beyond span) is considered to tap into both comprehension and production of language. They found children could repeat sentences they didn’t understand and there was little agreement between the two measures in relation to individual differences. Although the lack of agreement between the two measures is unsettling, sentence recall and multiple-choice methods do not purport to measure identical linguistic skills. The two methods used in the current study however, have been designed specifically to assess sentence comprehension. If these assessment methods are effective they should not compromise access to the knowledge being tested and therefore negatively impact on the scores achieved by the children who have completed them. Our results show that this is not the case. Children across the full age range

21 performed better on the animation task than on the multiple-choice task, causing us to question what is influencing children’s performance on the two measures. We suggest that differences may be driven by task design. As previously outlined, in attempting to process a sentence for syntactic comprehension, children typically form a semantic representation of the sentence while using the syntactic structure to help them assign the thematic roles to the appropriate words. In the case of the animation task, children hear the target sentence and see the animation at the same time. We expect that children are using typical thematic role to verb argument mapping in trying to process the given sentence in real time, as they would in natural discourse. The added requirement is that they must decide whether the linguistic structure describes accurately what happened in the animation. In addition, in the animation task, children are not required to infer the same amount of information as they are in a still image. In contrast, the multiple-choice task is presented using still images, which requires children to infer temporal relationships and movement of characters, which is often depicted by more subtle means. Along with the correct answer, the multiple-choice task also involves the presentation of three additional distractors, which demands that the child rules out three competing alternative mappings. This creates a level of ambiguity in how the thematic roles should be assigned and increases the processing load of the task considerably. The presence of three distractors also requires the child to store in memory the arguments associated with each verb in order to rule them out. This memory load is heightened even further by the fact that although the structure and thematic roles are assigned differently, the nouns in each of the distractors are the same as those in the target sentence.

Even when comparing children’s test performance based on above chance (rather than full mastery) the differences between the two methods are evident. Our results showed that some constructions were more vulnerable to testing method than others. Specifically, the testing

22 method had a large impact on children’s understanding of indirect object relatives and subject transitive relatives. Indirect object relatives are those in which the relativized element ‘who’ is the indirect object of the relative clause and they have what is called a stranded preposition.

They are considered to be one of the more complex relative clause constructions in relation to children’s production (Diessel & Tomasello, 2005; Frizelle & Fletcher, 2014). Unlike object or oblique relatives the head noun of the indirect object relative is always animate which means that the thematic role assigned by the verb could be applied to either noun. Previous literature has suggested that when this is the case, comprehension is more difficult (Gennari

& McDonald, 2009). In order to make the task pragmatically appropriate, in the multiple- choice task indirect object relatives were depicted with at least four referents in each image. If we take the example sentence He followed the girl he gave the present to, the correct image shows a boy giving a girl a present, and is succeeded by the boy following the same girl.

There are two aspects of the multiple choice task which make the illustration particularly complex, firstly the temporal aspect (the boy must be seen to give the girl the present before following her – the two actions can not happen simultaneously) and secondly the need to provide distractors showing alternative semantic-syntactic mappings regarding ‘who did what to who’. In contrast a still image of the corresponding animation shows the three referents included in the animation - a boy, a girl and an alternative girl who is also holding a present.

Because the animation happens in real time the verbs can be shown consecutively as they are enacted and the design is not based on the need for three alternate distractors. See figure 4 below.

Figure 4. Multiple-choice stimulus picture and still image from the animation task

23 The reason why testing method had a particular impact in relation to subject transitive relative clauses is potentially due to the type of verbs (in both the main and relative clause) that are required in these constructions. The transitive nature of the relativized verb means that the subject of the relative clause is again usually animate (the girl that scored the goal) and as a result the main clause verbs are limited to those that we enact with a person or animal such as push, pull, chase, hit etc (He pushed the girl that scored the goal). We suggest that these types of verbs are less ambiguously depicted in an animation task than when shown in a still image where the child is required to engage in a higher level of inference making. In contrast,

(in line with discourse regularities) the head noun in our object relatives and most of our oblique relative test items was inanimate (The dog ate the banana she dropped or The girl painted the wall he pointed to), facilitating the use of a broader range of verbs.

Both test methods showed growth in children’s understanding of complex syntax across the age range. However, our findings in relation to a hierarchy of constructions did not show consistency between the two assessments i.e. the assessments were not sensitive to the same aspects of clause complexity. The multiple choice task showed three levels of complexity. Children showed the highest level of understanding of subject relatives (both types), followed by object and oblique relatives and lastly indirect object relatives. In contrast, the animation task showed that children performed best on the intransitive subject relatives but there were no significant differences between any of the other relative clause

24 types. The finding that intransitive subject relatives are the least difficult bi-clausal relative clause type to process is in keeping with previous literature (Diessel, 2004, Diessel &

Tomasello, 2005, Frizelle & Fletcher, 2014). However, the lack of any difference between the other relative clause types has not been previously reported and may suggest that we have been underestimating children’s understanding of complex sentences. We suggest that when children’s understanding of structures is assessed in a manner that is linguistically focussed, pragmatically scaffolded, and more closely related to natural discourse we see a different picture of what children are capable of understanding.

Our final research question addressed the associations between the two methods of assessing syntactic comprehension and a measure of language and cognitive ability. We chose sentence recall ability as the measure of language as there is a strong link between sentence repetition and syntactic competence (Frizelle & Fletcher, 2014; Gallimore & Tharpe, 1981;

Geers & Moog, 1978; Kidd et al., 2007; McDade, Simpson & Lamb, 1982; Polisenska, et al.,

2015). Current thinking is that sentence repetition is not merely a task of repeating a heard series of word supported by phonological short term memory, but that it is also supported by conceptual, lexical and syntactic representations in long-term memory (Brown & Hulme,

1995; Hulme, Maughan & Brown 1991; Klem, Melby-Lervag, Hagtvet, Lyster, Gustafsson &

Hulme, 2015; Potter & Lombardi, 1990, 1998; Schweickert, 1993). Our findings that sentence recall was more predictive of children’s performance on the animation than on the multiple- choice assessment serves to validate this as a measure of syntactic comprehension.

Regarding cognitive ability, we found only marginal differences in how DAP scores predicted children’s performances on both syntactic comprehension measures (it accounted for slightly more variance on the multiple-choice than the animation task). Given the additional cognitive load of the multiple-choice task we expected that there would be a closer association between

DAP scores and performance on this than the animation task. However, we note that DAP

25 scores do not correlate well with some other measures of non-verbal reasoning (Imuta, Scarf,

Pharo & Hayne, 2013). Despite this they have been shown in 4 year-old children to be a predictor of later IQ (Arden, Trzaskowski, Garfield, Plomin, 2014). In any case, the DAP measures only some aspects of children's nonverbal ability and it may be that the cognitive demands of the multiple-choice assessment are more reflected in other cognitive measures such as attentional control, working memory or inhibition, rather than a general measure of intelligence.

In conclusion, we have compared two comprehension assessment methodologies and have found that by varying the assessment context we alter children’s understanding of complex sentences. Our results suggest that young children have a greater understanding of complex sentences than previously reported, when assessed in a manner that is more reflective of how we process language in natural discourse. Multiple choice tests may be useful when we want to see how robust comprehension is when the child is forced to focus only on grammatical structure and has to ignore other information that may be competing or distracting, but interpretation needs to take into account the role of other factors, such as attention, memory and inhibition. Thus in addition to testing complex sentences, we would need other items that pose similar task demands, but use more simple sentences. In addition, in a well-designed multiple-choice test, analysis of error patterns can help distinguish genuine grammatical difficulties from more general cognitive impairments (Bishop, 2003). We suggest that those designing and administering language assessments need to critically appraise the tools they are using and reflect on whether the test is actually measuring what it claims to. The current technological era opens the way to developing assessment tools which allow testing of children’s understanding ‘in real time’. Our results have implications for those who work clinically as well as researchers who are investigating children’s language comprehension ability.

26 Acknowledgments

This research was supported by funding to the first author from the charity RESPECT and the

People Programme (Marie Curie Actions) of the European Union's Seventh Framework

Programme (FP7/2007-2013), under REA grant agreement no. PCOFUND-GA-2013-608728.

We are grateful to the parents, children, schools and nurseries that participated in this study.

References

Adams, C, Coke, R, Crutchley, A, Hesketh, A & Reeves, D (2001). Assessment of

Comprehension and Expression 6 - 11. NFER-Nelson, London. ISBN

9780708705612

Arden, R, Trzaskowski, M, Garfield, V, Plomin, R (2014). Genes Influence Young Children's

Human Figure Drawings and Their Association With Intelligence a Decade Later.

Psychological Science, 25, 1843-50. DOI: 10.1177/0956797614540686

Bishop, D.V.M. (2003). The Test for Reception of Grammar. TROG 2. London:

Psychological Corporation.

Brown, G. D. A. & Hulme, C. (1995). Modeling item length effects in memory span: No

rehearsal needed? Journal of Memory and Language, 34, 594 – 621.

Crain, S., & McKee, C. (1985) The acquisition of structural restrictions on anaphora. In S.

Berman, J. W. Choe & J. McDonagh (eds) Proceedings of NELS 16. GLSA

University of Massachussetts, Amherst.

Crain, S., Thornton, R., Boster, C., & Conway, L. (1996). Quantification without

qualification. Language, 5(2), 83–153. http://doi.org/10.1207/s15327817la0502_2

Davis, J.A. (1971) Elementary survey analysis. Englewood Cliffs, NJ: Prentice–Hall;

Diessel, H. (2004). The acquisition of complex sentences. Cambridge, United Kingdom:

Cambridge University Press.

27 Diessel, H. & Tomasello, M. (2000). The development of relative clauses in spontaneous

child speech. Cognitive 11(1/2) 131-51.

Diessel, H. & Tomasello, M. (2005). A new look at the acquisition of relative clauses.

Language 81(4), 1–25.

Fraser, C., Bellugi, U., & Brown, R. (1963). Control of grammar in imitation, comprehension,

and production. Journal of verbal learning and verbal behavior, 2(2), 121-135.

Frizelle, P. & Fletcher, P. (2014). Relative clause constructions in children with specific

language impairment. International Journal of Language & Communication

Disorders, 49, 255- 64.

Frizelle, P., O'Neill, C., & Bishop, D. V. M. (2017). Assessing understanding of relative

clauses: a comparison of multiple-choice comprehension versus sentence repetition.

Journal of Child Language, 1–23. http://doi.org/10.1017/S0305000916000635

Gallimore, R. & Tharp, R. G. (1981). The interpretation of elicited sentence imitation in a

standardized context. Language Learning, 31(2), 369 - 92.

Gerken L., & Shady M. E. (1996) The picture selection task. In D. McDaniel, C. McKee, and

H. Smith Cairns (eds), Methods for Assessing Children’s Syntax Cambridge, MA:

MIT Press, pp. 55–76.

Geers, A. E. & Moog, J. S. (1978). Syntactic maturity of spontaneous speech and elicited

imitations of hearing impaired children. Journal of Speech and Hearing Disorders,

43, 380 –91.

Gennari, S. P., & MacDonald, M. C. (2009). Linking production and comprehension

processes: The case of relative clauses. Cognition, 111, 1–23.

http://dx.doi.org/10.1016/j.cognition.2008.12.006.

Goodenough, F. L. (1926). Measurement of intelligence by drawings. New York: World

Books.

28 Gordon, P. (1996). The Truth-Value Judgment Task. In D. McDaniel, C. McKee and H.

Smith-Cairns (Eds.) Methods for Assessing Children’s Syntax. (pp. 211- 231) MIT

Press.

Gordon, P. & Chafetz, J (1986). Lexical learning and generalization in the passive

acquisition. Paper presented at the 11th Annual Boston University Conference on

Language Development, Boston, October.

Ha kansson,̊ G. & Hansson, K. (2000). Comprehension and production of relative clauses: a

comparison between Swedish impaired and unimpaired children. Journal of Child

Language 27, 313–33.

Harris, D. B. (1963). Children’s drawings as measures of intellectual maturity. New York:

Harcourt, Brace, Jovanovich.

Hulme, C., Maughan, S. & Brown, G. D. A. (1991). Memory for familiar and unfamiliar

words: evidence for a long-term memory contribution to short-term memory span.

Journal of Memory and Language, 30(6), 685–701.

Imuta, K., Scarf, D., Pharo, H., & Hayne, H. (2013). Drawing a Close to

the Use of Human Figure Drawings as a Projective Measure of Intelligence.

PLoS ONE, 8(3), e58991. http://doi.org/10.1371/journal.pone.0058991

Kidd, E., Brandt, S., Lieven, E. & Tomasello, M. (2007). Object relatives made easy: a cross-

linguistic comparison of the constraints influencing young children’s processing of

relative clauses. Language and Cognitive Processes, 22(6), 860–97.

Klem, M., Melby-Lervåg, M., Hagtvet, B., Lyster, S.-A. H., Gustafsson, J. E. & Hulme, C.

(2015). Sentence repetition is a measure of children’s language skills rather than

working memory limitations. Developmental Science, 18(1), 146 – 54.

McDade, H. L., Simpson, M. A. & Lamb, D. E. (1982). The use of elicited imitation as a

measure of expressive grammar: a question of validity. Journal of Speech and

29 Hearing Disorders, 47(1), 19 – 24.

McDaniel, D., & McKee, C. (1998) Methods for assessing children’s syntax. MIT Press.

Philip, W. (1991). Spreading in the acquisition of universal quantifiers. In West Coast

Conference on Formal Linguisitics 10, 359-373.

Philip, W. (1992). Distributivity and logical form in the emergence of universal

quantification. In and Linguistic Theory, 2, 327-346.

Polišenská, K., Chiat, S. & Roy, P. (2015). Sentence repetition: What does the task measure?

International Journal of Language & Communication Disorders, 50 (1), 106–18.

Potter, M. C. & Lombardi, L. (1990). Regeneration in the short-term recall of sentences.

Journal of Memory and Language 29, 633–54.

Potter, M. C. & Lombardi, L. (1998). Syntactic priming in immediate recall of sentences.

Journal of Memory and Language 38, 265–82.

Roeper, T., & De Villiers, J. (1993). The emergence of bound variable structures.

In Knowledge and language (pp. 105-139). Springer Netherlands.

Schweickert, R. (1993). A multinomial processing tree model for degradation and

redintegration in immediate recall. Memory and Cognition, 21, 168–75.

Semel, E., Wiig, E. M. & Secord, W. (2006). Clinical Evaluation of Language Fundamentals

—Fourth Edition, UK Standardisation (CELF–4 UK). London: Pearson Assessment.

30