(Submitted to JRSS, C) Multivariate multilevel analyses of examination results

Min Yang, Harvey Goldstein, William Browne and Geoffrey Woodhouse Institute of Education, University of London

Abstract In the study of examination results much interest centres upon comparisons of curriculum subjects entered and the correlation between these at individual and institution level based on data where not every individual takes all subjects. Such 'missing' data are not missing at random because individuals deliberately select subjects they wish to study according to criteria that will be associated with their performance. In this paper we propose multivariate multilevel models for the analysis of such data. We analyse A/AS level results in different mathematics papers of 52,587 students from 2,592 institutions in England in 1997. Although this paper is concerned largely with methodology, substantive findings emerge on the effects of gender, age, GCSE intakes, examination board and establishment type for A/AS level mathematics.

Keywords:

Examination choice, Institutional differences, Mathematics examinations, Multivariate response model, Missing data, Multilevel model.

Address for correspondence Dr Min Yang, Mathematical Sciences, Institute of Education, 20 Bedford Way, London, WC1H 0AL, U.K.

[email protected]

-1- 1 Introduction

Yang and Woodhouse (2000) describe the analysis of a large dataset on pupils in English schools and colleges catering for 16-19 year olds. The dataset contains results of General Certificate of Education (GCE) Advanced Level (A-level) and Advanced Supplementary level (AS-level) examinations of all 696,660 individual candidates in 2,794 institutions over a 4 year period. The AS-level examination is normally, but not always, taken after 1 year of study following GCSE and involves approximately half the amount of time as the A-level examination taken after two years. Yang and Woodhouse looked at comparisons among institutions after adjustment for performance in the General Certificate of Secondary Education (GCSE) examinations taken by the same candidates two year’s earlier. Among other findings, they showed that the adjustment reduced apparent differences between various types of institution and between males and females. Their analyses, however, were conducted using the total A/AS-level ‘points score’ for each pupil, that is the sum of points for all the examinations taken: each examination grade is assigned a score ranging from 1 to 10.

Much interest centres upon comparisons of curriculum subject entered and the correlation between these at individual and institution level. Thus, for example, institutions which produce high scores for one science subject may be expected to do so for others, although individual institutions which do not follow such a pattern may also be of interest. In principle we could fit multivariate multilevel models to such data, regarding this as a case where potentially all individuals have responses for all curriculum subjects but where for any one individual most responses are missing (Goldstein, 1995). The multivariate responses are treated as defining the lowest level of a hierarchy, being 'nested' within individuals.

One of the difficulties with fitting such a multivariate model to the A/AS level data is that the 'missingness' is non-random: individuals deliberately select subjects they wish to study according to criteria which will be associated with their performance. This is a general problem with many kinds of data where a choice of response variable operates. Fitting other covariates such as GCSE scores may move the assumption closer towards the assumption of missing at random, but may not always be effective. In the present paper we explore a practical method for handling such data, and try to specify models around certain substantive questions of research interest. We have chosen to study only A/AS level mathematics since

-2- this has several options that pupils can choose between. In the next section we introduce the data and this is followed by various models.

2 Data

A full description of the data set is given by Yang and Woodhouse (2000). Briefly, data for each pupil were matched from A/AS level examinations back to GCSE, incorporating the individual GCSE total point score, point score on GCSE Mathematics and point score on GCSE English. Four years worth of data are available for pupil exam entries in 1993, 1994, 1995 and 1997 and in this analysis we use only the 1997 data. This gives 61,116 entries for 53,798 students from 2,607 institutions from six examination boards. Information on student’s age in months, gender and type of educational establishment are also available.

According to the course code in the database, supplied by the U.K. Department for Education and Employment (DfEE), data entries covered ten different types of mathematics as in Table 1. Among the students, 46,727 (86.9%) had a single entry, 6,829 (12.7%) had double entries, and 237 (0.44%) had three or four entries. The mean point score in the table is the average point score on all entries for each particular entry type by A and AS level.

The definition of the subject is not consistent across the six examination boards. In particular the courses labelled D/D, P/A, P/S, P/M and ADD were present for only a single board. All entries for D/D were for an AS-level course with 65 single D/D and 44 entries in D/D plus the AS-level Main math. Additional mathematics had only 67 entries from one exam board. Statistics is another subject with 68% of entries at AS-level. For simplicity, and because the main thrust of this paper is methodological, we have excluded entries on D/D, additional mathematics and statistics from the analysis.

For entries on subjects P/A, P/S and P/M, single entries were 84%, 95% and 78% respectively, and the most common double entries were those paired with F, which were 15%, 5% and 22% respectively. This suggests that these three subjects were in fact the equivalent of Main mathematics for the JMB exam board. Therefore, we recoded them as Main in our analysis. We compared models based on two sets of data including and excluding P/A, P/S, P/M and found little difference in the results of the analyses.

-3- Table 1 Mathematics entries and the mean A/AS level point scores

Mathematics entry Number of Mean point- Number of Mean point- Exam Board* entries (%) score (SD) entries (%) score (SD) A-level A-level AS-level AS-level Main (M) 44,672 (84.24) 6.00 (3.36) 3,825 (47.32) 1.26 (1.51) All six Pure (P) 640 ( 1.21) 7.60 (3.19) 667 ( 8.25) 1.27 (1.46) All but JMB Decision/Discrete (D/D) 0 ( 0.00) 153 ( 1.89) 1.62 (1.55) AEB Applied (A) 399 ( 0.75) 7.26 (3.22) 585 ( 7.24) 1.73 (1.64) AEB, LOND, OXCAM Pure & Applied (P/A) 1,329 ( 2.51) 5.90 (3.48) 149 ( 1.84) 1.03 (1.03) JMB Pure & Statistics (P/S) 526 ( 0.99) 4.55 (3.41) 0 JMB Pure & Mechanics (P/M) 529 ( 1.00) 6.45 (3.48) 0 JMB Further (F) 4,404 ( 8.30) 7.55 (2.98) 1,608 (19.89) 2.97 (1.76) All six Additional (ADD) 28 ( 0.05) 7.14 (4.05) 39 ( 0.48) 3.31 (1.88) OXCAM Statistics (S) 505 ( 0.95) 5.30 (3.40) 1,058 (13.09) 1.54 (1.58) All six* Total 53,032 (100.0) 6.14 (3.34) 8,084 (100.0) 1.68 (1.58) The points allocated for A level (AS level in brackets) are: A grade =10(5), B grade = 8(4), C grade = 6(3), D grade = 4(2), E grade = 2(1), F grade = 0(0).

The following examination boards in 1997 are coded as: 1. AEB = Associated Examining Board 2. CAMB = Cambridge 3. LOND = London 4. OXFL = Oxford 5. JMB = Joint Matriculation Board 6. OXCAM = Oxford and Cambridge

* Cambridge and Oxford do not have statistics at AS level.

Due to missing establishment type codes, 78 entries have also been excluded. We thus obtain a data set of 59,256 entries on 52,587 students from 2,592 institutions with A/AS-level subjects Main, Pure, Applied and Further. Since the A and AS scores are on different scales with different distributions we will keep them separate in the following analyses, in contrast to the normal practice of combining them into a single score.

Among the 52,587 students, 6,541 had double mathematics and 64 had triple mathematics entries. We have calculated the raw correlation coefficients based on only the pairs shown in Table 2. Because the AS level entries are small in number for most combinations we have confined the calculations to A level entries only.

-4- Table 2 Number of students with single A level mathematics (on the diagonal) and double mathematics entries (off diagonal) with pairwise correlation coefficient of responses in brackets M P A F M 40,800 P 13 (0.83) 273 A 10 (0.25) 278 (0.63) 94 F 4,310 (0.60) 12 (0.99) 0 73

Table 3 shows the sub-group means and standard deviations of the four mathematics scores by combination of subject. Here the combination C1 is for students who did courses on A level Main maths combined with A level Further maths and possibly other A level Mathematics subjects. This group of students had than average scores for both Main and Further mathematics A level. Likewise, the third combination, indicated by C3, is for students who took A-level Main combined with AS-level Further mathematics. This combination also produced, on average, higher scores on A level Main maths. The students taking combination C5, Pure plus Applied and possibly more, also had higher scores on average for both Pure and Applied mathematics. This suggests that these three combinations were common choices for more able students. Students taking the A-level M and an additional AS paper other than the Further (indicated by C4 in Table 3) had rather lower scores on average for main mathematics suggesting that this combination was a choice for the less able students. It is this 'informative' choice of combinations which implies non random missingness that is the main focus of this paper.

-5- Table 3 Mean scores (SD) of students by course combination A-level AS-level Mean (SD) Students Mean (SD) Stude nts Main Main Alone 40,800 5.54 (3.29) Alone 3,906 1.23 (1.48) +A level F + possibly more (C1) 4,314 9.45 (1.39) +at least 1A level(C6) 24 3.38 (1.95) +A level others but no F (C2) 27 7.70 (2.76) +at least 1 AS (no A) 18 1.18 (1.59) +AS level F + more (C3) 1,560 8.68 (2.04) Overall 3,948 1.23 (1.49) + AS others but no F (C4) 305 3.34 (3.82) Overall 47,006 5.99 (3.37) Pure Pure Alone 273 5.96 (3.49) Alone 407 1.10 (1.49) +A level A + possibly more (C5) 286 9.23 (1.66) + at least 1A 256 1.60 (1.39) +AS level A + more 52 7.77 (3.11) + at least 1 AS (no A) 3 0.96 (1.40) +AS others but no A 29 6.50 (3.30) Overall 666 1.27 (1.46) Overall 640 7.60 (3.19) Applied Applied Alone 94 3.62 (2.94) Alone 394 1.44 (1.46) +A level P + possibly more (C5) 286 8.51 (2.17) + at least 1A (C7) 173 2.46 (1.81) +AS level P + more 7 4.86 (3.44) + at least 1 AS (no A) 18 1.94 (1.95) +AS others but no P 12 5.47 (4.44) Overall 585 1.73 (1.58) Overall 399 7.26 (3.22) Further Further Alone 73 8.66 (2.24) Alone 35 3.17 (1.72) +A level M + possibly more (C1) 4,314 7.54 (2.99) + at least 1A 1,572 2.97 (1.76) +A level others but no M 17 8.48 (2.94) + at least 1 AS (no A) 1 Overall 4404 7.55 (2.98) Overall 1,608 2.97 (1.76)

Adjusting for the GCSE results can be expected to move our analysis nearer to the assumption of missingness conditionally (given GCSE) at random, since GCSE results are correlated with overall ability. We would however expect other factors, such as school exam entry policies, also to be associated with the responses, but we have no information on these.

Our basic general proposal is to fit separate terms for each observed combination. This adjusts the mean values for each response for overall differences in choice of subjects. It is still possible that there are interactions between combinations taken and other predictor variables such as gender and GCSE and we will explore this possibility. Because of possible institutional effects we will also fit models where we allow the combination adjustment effects to vary randomly across institutions.

3 Analytical Strategies

The following substantive questions are of interest:

1. What are the relationships between the four mathematics options at institution level? In other words, do institutions that do well in one mathematics option tend to do well in

-6- other options? Do institutions that do well with an A level option also tend to do well in the same option at AS level?

2. For students taking particular combinations of options, does the performance for different options vary across institutions? What is the institution level relationship between the different mathematics subjects for students taking particular combinations of different options?

3. How are the effects of the predictor variables GCSE average score, GCSE maths score, gender, exam board and institution types associated with the average performance of students taking different options or taking particular combinations of options?

First, we fit a single level model with the combinations identified in Table 3. Results are compared to the simple multivariate model assuming missingness of responses at random for A and AS level scores separately. For simplicity, no covariates are fitted at this step and the single level model is compared to the raw data.

Then we extend the proposed multivariate multilevel model to eight responses for A and AS level scores jointly, including the main effects for other covariates, the GCSE scores and gender. We also explore the fixed and random effects of the combinations of options in the proposed model, and interactions between the combinations and the covariates. This allows us to compare effects of covariates on any maths subject between A and AS levels, and to estimate the relationship of mathematical outcomes between A and AS levels within institution.

We use the original point scores, rather than for example a transformation to Normality. This retains the familiar scale for easy interpretation. We have compared the residual distributions for each type of mathematics with the raw point score and the Normally transformed score. Results based on the conditional model suggest that the assumption of normality for both A and AS level raw point scores is acceptable.

4 Some basic multivariate models

Let j indicate institution, i indicate student, and h (=1, 2, 3, 4; main, pure, applied, further) the paper chosen. This gives a three-level data structure, papers nested within students nested within institutions.

-7- th For a score on the hth paper for the ith student in the j institution, the simplest multilevel model, without adjusting for any covariate, is a fixed intercept for the paper, an institution random effect vhj, and an individual random effect uhij, ,

yhij, =+β h vuhj,,+ hij (1)

T (,vvvv1,jjjj 2, , 3, , 4, )~MVN (0,)Ωv

T (,uuuu1,ij 2, ij , 3, ij , 4, ij )~MVN (0,)Ωu

2 2  σ v1   σ u1   2   2  = σσvv12 2  , = σσuu12 2  Ωv  2  Ωu  2  σσvv13 23 σ v 3  σσuu13 23 σ u 3  2 2 σσσσvvvv14 24 34 4 σσσσuuuu14 24 34 4

Not every element in Ωu can be estimated because of the amount of missing data in some pairs. We constrain the model to estimate only the covariance for the pairs of M and F, A and P for A level results at student level, and some different pairs for AS level results. For simplicity, we first fit a simpler version of Model 1 without the institution random effects, i.e.

fitting y hi, =+β h uhi, with only covariances σ u14 and σ u23 being fitted for A level scores, and

σ u12 , σ u13 , σ u23 fitted for AS level scores. The results are listed in column A in Table 4.

To allow for the differences associated with option choice, we introduce in addition to the fixed part of the model, dummy variables for combinations of options derived from exploratory analyses of Table 3, along with separate variances for combinations C1 and C5 of A level results. We first present separate analyses for A and AS level scores. The model for A level scores ignoring the institution level for simplicity may be written as

yhi, = β hi, ( h = 14,..., )

=+++()() + + + β 1,i β 1 uuCCCC1,iciiiiiα 1 1,1234ααα 2 3 4

=+++()() (2) ββ2,i 2 uuC2,iciiα 5 5, 5

=+ ββ3,i 3 u 3,i

=+ β 4,i β 4 u 4,i

T ()uuuuu1,iiiiCiCi,,,, 2, 3, 4, 1, , u 5, ~ MVN (0,)Ω u

-8- Following the procedure for modelling complex variance (Rasbash et al, 2000), we set

22 σσuC11== uC 0 . We assume also cov(uuci1, , ci 5, )= 0 since the combinations are measured at the student level and students take only one of the combinations C1 or C5. C1 - C5 are defined as in Table 3. Note that we do not need to fit combination effects for the Applied and Further responses, since the model (2) is adequate in terms of agreement with the raw data.

In the variance-covariance structure we assume additional random coefficients only for students who did the combinations C1 and C5. Thus for example the variance for Main

2 mathematics for the combination C1 is given by σ uuc111+ 2σ and that for Pure from the

2 combination C5 is given by σσuuc222+ 2 . The covariance between options Main and Further is assumed the same, for students taking the combination C1 as for other combinations. The covariance between options Pure and Applied is assumed to be the same for the combination C5 as the other combinations. Thus the correlation between Main and Further for those taking

22 C1 is σ uuc14/(σσuu14+× 2 1 1 ) σ etc.

The student level covariance matrix is

2 σ u1 2 σ u2 2 σσu23 u3 Ωu = 2 σσu14 u4

σ uc11  σ uc25

The extension of the model to a multilevel structure, with institutions at the highest level, is straightforward with the standard assumption of the multivariate Normal distribution among higher level units.

Estimates for this model are listed in column B in Table 4. The multilevel version including the institution random effects vhj, is presented in column C in the table.

-9- Table 4 Estimates of the A/AS-level math scores for the basic multivariate models, SE in parentheses A B C no combinations fitted combinations fitted combinations + random effects at institution level Fixed effects A level AS level A level AS level A level AS level 5.99 (0.016) 1.24 (0.024) 5.53 (0.016) 1.22 (0.024) 5.28 (0.032) 1.22 (0.033) Main β1 7.29 (0.120) 1.27 (0.056) 6.28 (0.186) 1.27 (0.056) 6.42 (0.206) 1.24 (0.071) Pure β 2 6.34 (0.149) 1.73 (0.067) 7.26 (0.161) 1.45 (0.076) 7.24 (0.227) 1.29 (0.108) Applied β 3 3.39 (0.041) 2.91 (0.044) 7.55 (0.045) 2.97 (0.044) 6.74 (0.077) 2.86 (0.057) Further β 4 3.93 (0.026) 3.70 (0.028) A-C1 α1 1.81 (0.375) 2.46 (0.369) A-C2 α 2 3.15 (0.084) 3.03 (0.079) A-C3 α 3 -1.38 (0.139) -1.18 (0.147) A-C4 α 4 2.34 (0.216) 2.09 (0.228) A-C5 α 5 AS-C6 α 6 2.11 (0.32) 2.27 (0.33) AS-C7 0.97 (0.14) 1.13 (0.17) α 7 Institution level 2 1.74 (0.07) 0.69 (0.06) σ v1 2 2.27 (0.39) 0.85 (0.14) σ v2 2 5.91 (0.97) 1.09 (0.18) σ v3 2 5.93 (0.31) 1.05 (0.12) σ v4 1.16 (0.20) 0.24 (0.12) σ v12 1.76 (0.32) 0.16 (0.15) σ v13 2.92 (0.13) 0.44 (0.08) σ v14 3.41 (0.57) 0.02 (0.20) σ v23 1.98 (0.43) 0.19 (0.20) σ v24 3.04 (0.69) 0.50 (0.19) σ v34 Student level 2 11.37 (0.07) 2.22 (0.05) 10.55 (0.07) 2.19 (0.05) 8.73 (0.06) 1.55 (0.04) σ u1 2 10.48 (0.56) 2.13 (0.12) 12.30 (0.92) 2.13 (0.12) 9.56 (0.78) 1.27 (0.10) σ u2 2 13.85 (0.83) 2.68 (0.16) 10.35 (0.73) 2.49 (0.15) 5.05 (0.43) 1.52 (0.11) σ u3 2 22.50 (0.34) 3.11 (0.11) 8.89 (0.19) 3.11 (0.11) 6.95 (0.16) 2.09 (0.09) σ u4 1.42 (0.28) 1.22 (0.34) σ u12 0.88 (0.49) 0.71 (0.49) σ u13 13.79 (0.15) 2.45 (0.07) 2.18 (0.07) σ u14 10.72 (0.62) 1.60 (0.40) 5.12 (0.44) 1.63 (0.34) 2.43 (0.27) σ u23 -4.31 (0.04) -3.41 (0.04) σ u11c -4.09 (0.49) -3.53 (0.41) σ u25c -2log-likelihood 273,081.2 25,346.43 262,158.1 25,261.64 256,748.0 24,451.89

-10- We can obtain predictions from Table 4 for the mean of Main maths as the weighted sum from Model B, using observed sample sizes, estimates of β 1 and α 1 ~α 4 ; that of Pure maths is the weighted sum, of β 2 and α 5 . The estimated between-student standard deviation can be calculated correspondingly. Similarly, estimates of overall means and standard deviations for the AS level options can be obtained. We compare the estimates for models in columns A and B in Table 4 with the raw data in Table 5.

Table 5 Comparison between models and raw data using results from Table 4. Estimated standard deviations in brackets. A level AS level Raw data No adjustment Adjusted for Raw data No adjustment Adjusted for for combinations combinations for combinations combinations Mean (SD) Est. (SD) Est. (SD) Mean (SD) Est. (SD) Est. (SD) Main 5.99 (3.37) 5.99 (3.37) 5.99 (3.13) 1.23 (1.49) 1.24 (1.48) 1.22 (1.48) Pure 7.60 (3.19) 7.29 (3.24) 7.33 (2.93) 1.27 (1.46) 1.27 (1.46) 1.28 (1.46) Applied 7.26 (3.20) 6.34 (3.72) 7.26 (3.23) 1.73 (1.64) 1.73 (1.64) 1.73 (1.58) Further 7.55 (2.98) 3.39 (4.73) 7.55 (2.98) 2.97 (1.76) 2.91 (1.76) 2.98 (1.76) Correlation M vs P 0.83 (n=13) 0.64 (n=15) 0.65 0.57 M vs A 0.25 (n=10) 0.37 (n=17) 0.36 0.30 M vs F 0.60 (n=4310) 0.86 0.53 (C1) N/a N/a N/a P vs A 0.63 (n=278) 0.89 0.78 (C5) 0.82 (n=7) 0.67 0.71 P vs F 0.99 (n=12) 0.60 (n=9) N/a N/a A vs F N/a 0.83 (n=5) N/a N/a

For A level we see that it is important to adjust for options chosen in order to represent the actual score pattern. This is the case for both the fixed and random parameters. For AS level, adjusting for the two options chosen is less important, suggesting a less pronounced option choice policy.

The multilevel model which includes institutional effects gives correlation coefficients for A level between M and F, P and A at student level as 0.60 and 0.68, and 0.93 and 0.91 at the institution level. Thus the average performance by institutions on one A level mathematics subject is highly correlated with performance in other subjects.

5 Modelling A and AS level together and adjusting for GCSE

We now fit A and AS level scores in a single model together with various other predictors, including GCSE scores. The polynomials fitted for the GCSE variables are similar to those fitted by Yang and Woodhouse (2000). All GCSE scores are centered at their sample means. The estimates for the fixed parameters are given in Table 6 and those for the random parameters in Table 7. Note that the term for combination C2 is omitted because of the small number of students. -11- Table 6 Parameter estimates for the fixed effects with and without adjusting for GCSE results. S.E. in parentheses Variable Without Adjusted for Variable Without Adjusted for adjusting for GCSE adjusting for GCSE GCSE GCSE Main effects Intercept Oxfl – A – M 0.648 (0.255) 0.487 (0.185) A – M 4.183 (0.089) 5.155 (0.067) Oxfl – A – P 2.164 (1.021) 1.246 (0.719) A – P 5.949 (0.280) 6.051 (0.209) Oxfl – A – F -2.801 (0.661) -3.003 (0.568) A – A 5.750 (0.319) 5.875 (0.192) Oxfl – AS – M 0.318 (0.288) -0.006 (0.218) A – F 5.256 (0.233) 4.803 (0.202) Oxfl – AS – P 2.944 (0.547) 1.595 (0.400) AS – M 0.804 (0.162) 1.561 (0.126) Oxfl – AS – F 0.445 (0.947) -0.365 (0.815) AS – P 0.964 (0.100) 1.556 (0.078) JMB – A – M 0.701 (0.100) 0.919 (0.073) AS – A 0.578 (0.215) 1.303 (0.172) JMB – A – F 1.330 (0.258) 1.378 (0.224) AS – F 2.269 (0.319) 1.780 (0.279) JMB – AS – M 0.153 (0.170) 0.194 (0.130) Combinations JMB – AS – F 0.516 (0.335) 0.339 (0.292) A – C1 3.660 (0.047) 2.796 (0.054) Oxcamb – A – M 0.998 (0.108) 0.865 (0.078) A – C3 3.012 (0.076) 2.239 (0.066) Oxcamb – A – P 0.790 (1.038) 0.635 (0.784) A – C4 -1.732 (0.171) -1.897 (0.162) Oxcamb – A – A 2.867 (1.723) 1.558 (1.313) A – C5 1.895 (0.247) 1.486 (0.204) Oxcamb – A – F 1.364 (0.270) 0.937 (0.232) AS – C6 2.338 (0.312) 1.788 (0.250) Oxcamb – AS – M 0.524 (0.175) 0.365 (0.134) AS – C7 1.603 (0.204) 0.754 (0.166) Oxcamb – AS – P -0.044 (0.316) 0.294 (0.244) Gender (Female – Male) Oxcamb – AS – A 0.750 (0.242) 0.686 (0.191) A – M 0.416 (0.031) -0.412 (0.029) Oxcamb – AS – F 0.459 (0.336) 0.422 (0.293) A – P -0.337 (0.252) -0.839 (0.209) GCSE average A – A -0.195 (0.310) -0.840 (0.232) A – GA 1.878 (0.065) A – F 0.559 (0.091) -0.639 (0.086) A – GA^2 0.223 (0.021) AS 0.085 (0.103) -0.236 (0.086) A – GA^3 -0.198 (0.013) Age A – GA^4 -0.028 (0.002) A – M -0.017 (0.004) -0.049 (0.003) AS – GA 0.705 (0.027) A – P -0.001 (0.029) -0.020 (0.024) AS – GA^2 0.217 (0.021) A – A 0.007 (0.035) -0.017 (0.029) AS – GA^3 0.011 (0.005) A – F -0.032 (0.011) -0.078 (0.010) GCSE mathematics AS – M -0.006 (0.006) -0.018 (0.005) A – GM 1.344 (0.020) AS – P -0.033 (0.014) -0.021 (0.012) A – GM^2 0.078 (0.021) AS – A 0.017 (0.017) 0.005 (0.014) A – GM^3 -0.023 (0.003) AS – F -0.040 (0.011) -0.041 (0.010) AS – GM 0.575 (0.029) Institution ( Base = M/C) AS – GM^2 0.161 (0.025) M sel 0.631 (0.099) AS – GM^3 0.010 (0.003) M modern -0.865 (0.252) Institution level GM comp -0.010 (0.060) 0.055 (0.045) Aggregated GA 0.148 (0.060) GM sel 0.644 (0.084) -0.006 (0.066) Aggregated GM -0.017 (0.034) GM modern -0.556 (0.295) Interactions Ind. sel 1.230 (0.055) 0.060 (0.048) GCSE × combination Ind. ns 0.138 (0.154) 0.208 (0.119) A – C1 × GA th -1.036 (0.068) 6 -Form coll 0.144 (0.068) 0.121 (0.051) A – C1 × GA^2 -0.511 (0.065) -0.164 (0.051) -0.238 (0.033) FE college A – C1 × GA^3 0.656 (0.196) 0.099 (0.024) Unknown A – C1 × GM Board ( Base =AEB) A – C1 × GM^2 -0.605 (0.050) Camb – A – M 1.493 (0.114) 1.136 (0.083) A – C3 × GM -0.082 (0.027) Camb – A – P 2.438 (2.401) 0.893 (1.765) A – C3 × GM^2 0.460 (0.169) Camb – A – F 1.962 (0.279) 1.114 (0.238) GA × GM Camb – AS – M 0.560 (0.180) 0.433 (0.136) A Camb – AS – P 0.563 (0.322) 0.305 (0.248) -0.232 (0.028) AS Camb – AS – F 0.630 (0.361) 0.589 (0.314) -0.051 (0.027) GCSE × Gender Lond – A – M 0.483 (0.097) 0.295 (0.071) Girl GA Lond – A – P 0.183 (0.304) -0.706 (0.220) × 0.063 (0.029) Lond – A – A 1.678 (0.392) 0.286 (0.236) Girl × GA^2 0.117 (0.024) Lond – A – F 0.323 (0.248) -0.082 (0.214) GCSE × Subject Lond – AS – M -0.080 (0.165) -0.159 (0.126) A – M × GA 0.106 (0.061) Lond – AS – P 0.007 (0.130) -0.156 (0.098) A – P × GA -0.341 (0.110) Lond – AS – A 0.086 (0.226) -0.019 (0.177) Lond – AS – F -0.181 (0.334) -0.108 (0.291) Institution × Subject Ind sel × A – F 1.038 (0.11) 0.388 (0.107) -2log-likelihood 282,241.1 255,038.4 The range of scores for GCSE Maths (GM) and average (GA) scores is 0 to 8, and they are centred around an origin of 6.25. Institution types: M/C=Maintained comprehensive; M sel = Maitained selective; M modern = Maintained modern; GM comp = Grant maintained comprehensive; GM sel = Grant maintained selective; GM modern = Grant maintained secondary modern; Ind sel = Independent selective; Ind ns = Independent non selective; 6th-form coll = 6th form college; FE college = Further education college;

-12- Based on the GCSE adjusted model, we can summarise our findings as follows:

Gender: The results show that compared to male students, at the average GCSE score, females do worse. Among the AS options, gender effects were less pronounced, and a single AS gender effect was estimated. There is a quadratic interaction effect between gender and the GCSE average score, showing girls having worst performance at a GCSE average score of about 4 and improving both for lower and higher GCSE scores. Achievement on Main maths for females is less variable than for male students, but the reverse is true for Further maths. For AS level data, achievements on both Main and Further for females is more variable than males.

Age: Significant age effects are present only for Main and Further maths for both A and AS levels, scores decreasing with age. For Pure and Applied maths, no significant age effect is found.

Institution type: Before adjusting for the two GCSE scores, institution type showed large effects with an advantage to the Selective, Independent and Sixth form institutions. Having adjusted for GCSE results, the effects of Maintained selective and modern schools as well as Grammar modern schools were very small and have been combined with those of the base category, Maintained comprehensive. There is just a small advantage for the Sixth Form colleges and other Independent schools, and a small disadvantage for FE colleges. Independent selective schools show a small advantage for A level Further mathematics. No differential institutional effects are found for AS level. See also Yang & Woodhouse (2000) for a further discussion.

Exam board: The exam boards had different effects on different mathematics options, although all the other five boards did better on A-level Main maths compared to AEB. Cambridge and JMB boards did significantly better on both Main and Further maths for A- level. For the AS subjects, Cambridge had a significantly positive effect for Main only. The Oxford-Cambridge joint board students had higher scores for Main and Further A-level, Main and Applied AS-level. There is a tendency that boards that had a positive effect on a A level subject also had a positive effect on the same subject at AS level. In some cases the effects on AS level subjects were not significant because of small number of students in those subjects.

-13- Table 7 Estimated variances, covariances and correlation coefficients with and without adjusting for GCSE results (S.E. in parentheses), correlation coefficients in the upper triangle, estimates without adjusting for GCSE are in the first line, those with adjustment in the second line.

Institution A-M A-P A-A A-F AS-M AS-P AS-A AS-F level A-M 1.36 (0.06) 0.50 0.49 0.90 0.19 0.64 (0.03) 0.45 0.49 0.81 0.32 A-P 0.88 (0.20) 2.27 (0.45) 0.93 0.51 -0.06 0.29 (0.10) 0.67 (0.21) 0.73 0.59 0.30 A-A 1.19 (0.25) 2.82 (0.53) 4.04 (0.76) 0.49 0.28 (0.11) 0.44 (0.19) 0.54 (0.23) 0.74 A-F 2.25 (0.11) 1.64 (0.43) 2.10 (0.56) 4.60 (0.26) 0.19 1.00 (0.06) 0.75 (0.24) 0.84 (0.26) 2.42 (0.16) 0.30 AS-M 0.13 (0.03) 0.35 (0.04) 0.28 0.10 (0.02) 0.16 (0.02) 0.23 AS-P -0.06 (0.16) 0.46 (0.10) 0.09 (0.11) 0.03 (0.05) AS-A 0.86 (0.16) 0.49 0.39 (0.09) 0.36 AS-F 0.39 (0.09) 0.16 (0.06) 0.44 (0.16) 0.92 (0.11) 0.35 (0.08) 0.07 (0.04) 0.17 (0.11) 0.56 (0.09) Student A-M A-P A-A A-F AS-M AS-P AS-A AS-F level A-M 8.04 (0.07) 0.07 0.83 5.02 (0.04) 0.17 0.73 A-P 0.45 (1.09) 5.49 (0.40) 0.82 0.75 (0.75) 4.09 (0.30) 0.73 A-A 5.06 (0.43) 6.96 (0.58) 3.30 (0.30) 4.94 (0.40) A-F 8.98 (0.13) 14.5 (0.27) 4.71 (0.09) 8.28 (0.18) AS-M 1.38 (0.05) 0.96 (0.03) AS-P 1.11 (0.10) 0.90 (0.08) AS-A 1.48 (0.13) 1.14 (0.10) AS-F 1.87 (0.10) 1.62 (0.08) Female - -0.12 (0.05) 1.07 (0.42) 0.24 (0.60) 1.24 (0.20) 0.23 (0.04) 0.19 (0.09) 0.04 (0.11) 0.40 (0.10) male -0.24 (0.03) 0.45 (0.27) -0.52 (0.33) 0.46 (0.16) 0.11 (0.03) 0.13 (0.06) -0.01 (0.08) 0.31 (0.08) difference in variance

The estimates of variances and covariances in Table 7 indicate that, adjusting for the main effects of the two GCSE scores reduces the institutional level variation by over half for both A and AS level. It also reduces considerably the variation among students. At institution level for A level, there are high correlations between the pairs of M & F and A & P. For the AS level subjects, the institution level correlations are estimated to be small or insignificant.

We also see from Table 7 that the estimated correlation coefficients for options Main, Pure and Further between A and AS levels are all around 0.30 among institutions. The numbers of institutions involved in doing each of the three subjects on both A and AS levels are 1,212, 78 and 405 respectively. The correlation for the option Applied between A and AS level could not be estimated due to the relatively small number of institutions (44) that did both.

-14- The weak relationships suggest that institutions that did well on a maths subject at A level were not necessarily also doing well on the same maths subject at AS level.

Among students, no student in the data had scores of the same type of maths at both A and AS levels.

6 Modelling random effects for subject choice; A level only

We now allow the coefficients of the subject combination dummy variables to be random across institutions. At the student level the variance is now modelled as a function of these combinations as well as gender. For simplicity we now model only the A level results in these analyses.

The four combination groups do not come from the same set of institutions. The combination C1 consists of 4,314 students from 990 institutions, C3 has 1,560 students from 622 institutions, combinations C4 and C5 have students 305 and 286 from 162 and 101 institutions respectively. The variances and covariances to be estimated are as follows

2 σ v 1 σ 2 v2 2 σσv 2,3 v3 2 σσσvv 1,4 3,4 v4 Ω=v 2 σσσvv 1,cc 1 4, 1 vc1 2 σσσσvvv 1,cccc 3 4, 3 1, 3 vc3 2 σσσvv 1,ccc 4 1, 4 vc 4 2 σσvv σ 2,cc 5 3, 5 vc5 Thus, for example, the variance of Main maths effect of the base group of students among institutions is estimated as σ 2 , and that for the coefficient of the combination C1 is v1

22 σ ++2σσv . At the institution level the covariance between the base group score effect vv111,c 1 c

2 and that for the combination C1 on Main maths is σσ+ v . Thus the correlation coefficient v1 1,c 1 between the base group and C1 combination for main maths is

2 22 2 ρσ=+( σvv)/ σσσσ ×+ ( 2 + ). 1,cv 1 1 1,cc 1vv11 1, 1 vc 1

Between students, the first four random effects refer to the subjects (M, P, A, F), the next four to the combinations (C1, C3, C4, C5) and the final one is the gender effect (indexed by the subscript g). The variance covariance matrix is

-15- 2 σ u 1 σ 2 u2 2 σσu 23 u3 2 σσu 14 u4  Ωu = σ u 1,c 1

σ u1,c 3  σ u 1,c 4

σ u2,c 5  σσσσuuuu 1,gggg 2, 3, 4,

The results from fitting these models are given in Tables 8 and 9 for the fixed and random parameter estimates separately. For the fixed effects, changes occur mainly in the estimates associated with Further maths.

-16- Table 8 Parameter estimates for the fixed effects with and without random coefficients of combinations for A level papers, S.E. in parentheses

Variable No random Random Variable No random Random coefficients coefficients coefficients coefficients fitted for fitted for (continuo) fitted for fitted for combinations combinations combinations combinations (cont.) (cont.) Main effects Oxfl – M 0.452 (0.176) 0.400 (0.164) Intercept Oxfl – P 1.806 (0.842) 1.858 (0.943) M 5.208 (0.064) 5.261 (0.061) Oxfl – F -2.983 (0.558) -3.008 (0.529) P 5.937 (0.215) 5.909 (0.227) JMB – M 0.780 (0.069) 0.637 (0.064) A 5.925 (0.179) 5.961 (0.180) JMB – F 1.354 (0.222) 0.903 (0.214) F 4.804 (0.198) 5.108 (0.191) Oxcamb – M 0.778 (0.074) 0.686 (0.067) Combinations Oxcamb – P 0.467 (0.800) 0.511 (0.817) C1 2.855 (0.037) 2.897 (0.042) Oxcamb – A 1.145 (1.301) 1.367 (1.308) C3 2.287 (0.051) 2.331 (0.061) Oxcamb – F 0.918 (0.229) 0.563 (0.220) C4 -1.948 (0.194) -1.963 (0.226) GCSE average C5 1.630 (0.193) 1.699 (0.220) GA 1.829 (0.059) 1.812 (0.059) Gender (Female – Male) GA^2 0.233 (0.021) 0.229 (0.020) M -0.412 (0.028) -0.397 (0.028) GA^3 -0.196 (0.012) -0.193 (0.012) P -0.669 (0.172) -0.688 (0.169) GA^4 -0.028 (0.002) -0.027 (0.002) A -0.747 (0.226) -0.751 (0.225) GCSE mathematics F -0.621 (0.086) -0.614 (0.085) GM 1.339 (0.021) 1.343 (0.021) Age GM^2 0.070 (0.021) 0.063 (0.021) M -0.043 (0.003) -0.042 (0.003) GM^3 -0.025 (0.003) -0.026 (0.003) P -0.004 (0.020) -0.007 (0.020) Interactions A -0.013 (0.028) -0.016 (0.028) GCSE × combination F -0.072 (0.010) -0.072 (0.010) C1 × GA -1.115 (0.051) -1.124 (0.050) Institution ( - M/C) C1 × GA^2 -0.239 (0.027) -0.245 (0.026) 0.177 (0.059) 0.164 (0.054) GM comp C1 × GA^3 0.122 (0.019) 0.119 (0.018) 0.157 (0.052) 0.123 (0.044) Ind. sel C1 × GM 0.393 (0.163) 0.401 (0.156) -0.623 (0.039) -0.608 (0.039) Ind. ns C1 × GM^2 -0.080 (0.021) -0.085 (0.020) 6th-Form college 0.174 (0.072) 0.100 (0.056) -0.346 (0.072) -0.251 (0.066) C3 × GM 0.516 (0.205) -0.047 (0.197) FE college C3 × GM^2 Exam Board ( - AEB) 0.643 (0.177) 0.458 (0.161) GA × GM Camb – M 1.005 (0.078) 0.892 (0.070) -0.250 (0.027) -0.242 (0.027) GCSE × Gender Camb – P 1.250 (1.944) 1.354 (2.031) Girl GA Camb – F 1.090 (0.235) 0.670 (0.225) × 0.083 (0.027) 0.089 (0.026) Lond – M 0.250 (0.067) 0.263 (0.061) Girl × GA^2 0.126 (0.023) 0.122 (0.022) Lond – P -0.566 (0.195) -0.596 (0.196) GCSE × Subj Lond – A 0.298 (0.223) 0.277 (0.225) M × GA 0.150 (0.055) 0.157 (0.055) Lond – F -0.036 (0.212) -0.174 (0.205) P × GA -0.446 (0.102) -0.440 (0.102) Institution × Subj Ind sel × F 0.387 (0.109) 0.370 (0.106) -2log-likelihood 231,455.0 230,654.6

-17- Table 9. Estimated variances and covariances for model in Table 8. (S.E. in parentheses). Estimates without random coefficients of combinations are in the first line, those with random coefficients in the second line.

Institution M P A F C1 C3 C4 C5 level M 0.51 (0.02) 0.76 (0.03) P 0.45 (0.14) 1.07 (0.46) A 0.28 (0.13) 0.36 (0.18) 0.08 (0.35) 0.41 (0.19) F 0.87 (0.05) 0.45 (0.16) 2.21 (0.15) 0.55 (0.05) 0.46 (0.16) 1.22 (0.12) - - - C1 -0.66 (0.04) -0.37 (0.06) 0.70 (0.05) - - - - C3 -0.42 (0.06) -0.26 (0.11) 0.43 (0.07) 0.92 (0.13) - - - C4 -0.43 (0.12) 0.30 (0.66) 3.66 (0.80) - - - C5 -0.99 (0.54) 0.25 (0.38) 1.42 (0.77) Main: var(C1) = 0.76 –2 (0.66) + 0.70 = 0.14, var(C3) = 0.76 – 2 (0.42) + 0.92 = 0.84, var(C4) = 0.76 –2 (0.43) + 3.66 = 3.56 Further: var(C1) = 1.22 –2 (0.37) + 0.70 = 1.18, var(C3) = 1.22 – 2 (0.26) + 0.92 = 1.62 Pure: var(C5) = 1.07 –2 (0.99) + 1.42 = 0.51 Applied: var(C5) = 0.41+2 (0.25) + 1.42 = 2.33 Note: var(Cx) means the variance of those choosing the combination Cx. Student level M 5.47 (0.04) 5.36 (0.04) P 7.25 (0.57) 6.77 (0.58) A 1.70 (0.21) 3.99 (0.34) 1.66 (0.21) 3.97 (0.34) F 1.59 (0.06) 5.37 (0.13) 1.50 (0.06) 5.31 (0.13) C1 -1.88 (0.03) -1.90 (0.03) C3 -1.27 (0.05) -1.46 (0.03) C4 1.08 (0.29) -0.27 (0.24) C5 -2.66 (0.28) -2.45 (0.29) Gender -0.22 (0.03) -0.36 (0.16) -0.56 (0.25) -0.05 (0.09) -0.22 (0.02) -0.41 (0.15) -0.60 (0.24) -0.01 (0.09) Main: var(C1) = 5.36 – 2 (1.90) = 1.56, var(C3) = 5.36 – 2 (1.46) = 2.44, var(C4) = 5.36 – 2 (0.27) = 4.82 Pure: var(C5) = 6.77 – 2 (2.45) = 1.87 Estimated reduction in variance for females for each response Main: 2 (0.22) = 0.44 Pure: 2 (0.41) = 0.82 Applied: 2 (0.60) = 1.20 Further: 2 (0.01) = 0.02 We shall not discuss these results in detail but it is worth noting that for Main maths the performance of institutions for students taking combination C1 is much less variable with variance 0.14, compared to 3.56 for those taking C4 and 0.76 for the base group students. As can be seen from Table 3, the mean for C1 is close to the maximum (10) so that we would expect less variation among both students and institutions. Note also that the variance for those taking C4 is based on relatively small numbers and is very poorly estimated, having a standard error of 0.57 compared to standard errors of 0.02 and 0.11 for the variances associated with combinations C1 and C3 respectively. Also of interest is that female students

-18- for Main, Pure and Applied maths show less variability than male students, with no difference on Further maths.

We can also compute the institution level correlations, for each subject, for those who take just that subject and those who take a particular combination. Thus, for example, the institution level correlation between those taking just Main mathematics and those taking combination C1 is (0.76−×−×+= 0.66) / 0.76 (0.76 2 0.66 0.70) 0.31. For combination C3 the correlation is 0.42 and for combination C4 it is 0.20. The institution level correlation between those taking just Further mathematics and those taking combinations C1 and C3 is estimated respectively as 1.0 and 0.68, which is unsurprising given the definitions of these combinations.

At student level the correlation between Main and Further maths for C1, between Pure and Applied maths for C5 remains strong as 0.52 and 0.47 respectively. For other students the correlations are 0.28 and 0.32.

7 Discussion In this study we have shown that subject choice is strongly associated with performance. In multivariate response models, therefore, where not all responses are present as a result of deliberate choice of response combination, we cannot assume missingness at random. We demonstrate, via a series of models of increasing complexity, that the inclusion of terms based upon chosen subject combinations can provide insights into the data structure. In particular we fit models that allow the effects of choice to vary at institution level and where the student level variation is allowed to be a function of the chosen combinations. In effect, adding terms for choice combinations makes allowance for different 'abilities', insofar as these are not adjusted for using the GCSE prior achievement measure. This approach requires spotting the patterns of missing responses correctly and re-parameterising the random parameters at levels. We have shown that the combinations of Main (A level) and Further options (A or AS level), and also Pure and Applied options (A level) are common choices for more able students. Our results suggest that for AS results the assumption of missing at random is acceptable.

In the present context subject choices are made generally at the start of the course of study. In other situations, for example when considering the choice of questions within an examination our approach is more problematical. In particular this will be so if the number of

-19- combinations is very large with some combinations chosen by few students. In this case, pooling of such combinations may be acceptable.

We have shown that the A and AS level scores cannot be modelled together simply by assigning the traditional point scoring system. The AS score can be shown to have a different distribution from the A level score and the relationships among AS level combinations are different from those among A level combinations. Furthermore, mixing A and AS level results in the basic multivariate model caused convergence problems in our case. Modelling AS level scores as separate responses enabled us to study the institutional level relationships between A and AS level.

The effects of gender, age and institution type are similar in this study to the previous study which used the total A/AS point score (Yang & Woodhouse, 2000). We note however, that whereas Yang and Woodhouse found that girls disadvantage increased with increasing GCSE average score, for mathematics this is not the case, with girls performance increasing with increasing average GCSE score above a value of about 4. Further research with other A level subjects would be useful in this respect.

Acknowledgements

This study is part of the project ‘Application of Advanced Multilevel Modelling Methods for the Analysis of Examination Data’, supported by the Economic and Social Research Council of U.K. under Grant award R000237394. The Department for Education and Employment (DfEE) kindly provided the exam data. We are also grateful to Tony Fielding, Trevor Knight, Sally Thomas and John Gray for their helpful advice.

-20- References Goldstein, H. (1995). Multilevel Statistical Models (2nd edition). London: Edward Arnold.

Goldstein, H. and Thomas, S. (1996). Using examination results as indicators of school and college performance. Journal of the Royal Statistical Society. A, 159, 149-163.

Gray, J., Jesson, D., Goldstein, H., Hedger, K. and Rasbash, J. (1995). A multi-level analysis of school improvement: changes in schools’ performance over time. School Effectiveness and School Improvement, Vol. 6 No. 2, 97-114.

Gray, J., Goldstein, H. and Jesson, D. (1996). Changes and improvements in schools’ effectiveness: trends over five years, Research Papers in Education, 11(1), 35-51.

O’Donoghue, C., Thomas, S., Goldstein, H. and Knight, T. (1997). 1996 DfEE study of value added for 16-18 year olds in England, DfEE research series, March. London DfEE.

Rasbash, J., Browne, W., Goldstein, H., Yang, M., Plewis, I., Healy, M., Woodhouse, G., Draper, D., Langford, I. and Lewis, T. (2000), A user’s guide to MLwiN, V2.1, Multilevel Models Project, Institute of Education, University of London.

Yang, M. and Woodhouse, G. (2000). Progress from GCSE to A and AS level: institutional and gender differences, and trends over time. British Educational Research Journal, (forthcoming).

-21- Multilevel ordinal models for examination grades

Min Yang Institute of Education, University of London, UK

Antony Fielding* University of Birmingham, UK

Harvey Goldstein, Institute of Education, University of London ,UK

Abstract:

In multilevel situations graded category responses are often converted to points scores and linear models for continuous normal responses fitted. This is particularly prevalent in educational research. Generalised multilevel ordinal models for response categories are developed and contrasted in some respects with these normal models. Attention is given to the analysis of a large database of the General Certificate of Education Advanced Level examinations in England and Wales. Ordinal models appear to have advantages in facilitating the examination of institutional differences in more detail. Of particular importance is the flexibility offered by non-proportionally changing odds logit models. Examples are given of the richer contrasts of institutional and sub-group differences that may be evaluated. Appropriate widely available software for this approach is also discussed.

Keywords:

Educational grades, GCE Advanced Level, Logit, MLwiN, MULTICAT, Multilevel models, Non-proportional odds, Ordinal responses

* Contact Author e-mail: [email protected]

( Submission to Journal of Educational and Behaviotal Statistics; January 4th 2001)

-1- 1. Introduction

In the England and Wales public examinations systems the reporting of pass results is by grades: A* - E for General Certificate of Secondary Education (GCSE) and grades A-E for the Advanced (A)–level General Certificate of Education. Principally for purposes of selection to higher education, the A-level grades are converted to the traditional University Central Admissions Service (UCAS) tariff of points scores (A=10, B=8, C=6, D=4, E=2, F=0) and these scores are summed to provide a total point score for each candidate over the examinations taken.

Most research on A-level examinations to date has used such summative scores in studies of school effectiveness, and they are used by government in the production of ‘performance indicators’ or ‘league tables’ (O’Donoghue et al., 1996). One of the drawbacks to the use of scores is that information is lost about the actual distribution over grades in particular subjects when inferences are made at the level of the school. A school mean score in a particular subject could be the product of different individual student grade distributions. Thus an average score could be produced by most students performing very close to the median grade or by some performing very well and some very badly; the distinction between two such schools potentially being important.

The present paper develops models for the actual grades and compare these with the standard point scoring system The aim is to gain insights from using the former as opposed to the latter models. This is done using multilevel models which recognise the essentially hierarchical nature of examination data with students nested within schools. For the point scoring system standard Normal theory models are applied and for the grades, less well known ordered categorical response models are used. A technical advantage of the latter lies in the fact that they do not require strong scaling assumptions, merely the existence of an ordering. They are also not subject to estimation problems arising from grouped observations of an assumed continuous response scale. Further they do not require a basic normality assumption over the scale, although this is not

-2- problematic in the point scores used in our examples. It is also possible that the use of the ordered categorical models will result in models with fewer parameters to fit the data. Many of these comparative criteria are reviewed in Fielding (1999, 2000).

Whilst the main thrust of this paper is methodological, some substantive results emerge, especially in terms of gender differences. These are highlighted by the use of the ordered category response models.

2. Data and source

In this paper we utilise information provided by the U.K. Department for Education and Employment (DfEE) from its database of linked A/AS-level and GCSE examination results (O'Donoghue et al., 1996). Each individual’s outcomes for each subject and qualification are recorded. Additionally a limited range of information is available on certain background factors. We have student’s gender, date of birth and previous educational achievements, as well as the type of their educational establishment, local education authority, region and examination board.

In this study, we concentrate on A level outcomes in 1997 for two subjects: Chemistry and Geography. There are a number of reasons for these particular choices of subjects and the exclusion of the intermediate 1 - year Advanced Supplementary (AS) level entries. The principal reason is that these two subjects are popular, giving a reasonable number of entries which could be matched to previous General Certificate of Secondary ( GCSE) results. The latter are the normal secondary school qualifications taken by most students prior to any further advanced study and at the end of compulsory education. Total A and AS entries are 30,910 for Chemistry and 33,276 for Geography, of which AS-level entries are only 3.5% and 3.8% respectively. Given this, for present purposes modelling essentially different outcomes simultaneously would add to model complexity and AS entries are not included in analyses In both these subjects only 1.8% of students had multiple entries and in these cases all entries except the final one scored zero. Thus single best entry, indicative of achievement in that subject, was entered into our analyses

-3- without loss of substantive meaning. Another reason for the choice of these two subjects is that they have distinct distributions of grades and thus exemplify the differences between the statistical models in this respect. As Table 1 shows, for students included in the study the grade distribution for Chemistry was skewed towards grades A and B, whilst that for Geography was more nearly symmetrical. They therefore provide good case studies for comparing models for points scores with those for grades with respect to sensitivities to distributional shapes.

(Table 1 about here)

It was also decided to omit 0.36% of Chemistry students and 0.37% of Geography students who had extremely low average GCSE scores (3 or less) relative to the overall average GCSE distribution and who represent an extreme group.

In many analyses of educational data, and in particular school effectiveness studies when aggregate points across several A level subject entries have been used, problems in modelling often arise due to the presence of ceilings and floors in the range of possible scores. Transforming the aggregate scores has been a practical way to overcome these and also to ensure that model assumptions of normal errors are more nearly met (Goldstein, 1995). In this case for the single subject A-level score, with a very limited range of discrete values, experimentation with normalising transformations did not noticeably improve the error distribution of the data compared to using the raw point score. As outlined in the introduction, problems are more likely to arise due to the grouping; these would be present even if transformed scores were used. Further, effects in models of the raw points score are more easily interpretable and can also be linked directly to the raw grades. This makes the comparison and interpretation between models for scores and the grades more direct.

In the modelling, a student's GCSE average score (GA) centred around its mean was used as the prior attainment control variable. Among the available individual level covariates, we also used the gender of the student (Females=1; Males=0) and centred age. The

-4- cohort is aged between 18 and 19 years with a mean of 18.5 years. The establishment level variables were establishment type, the aggregate establishment mean of the student's centred GCSE score (Sch-GA), and the standard deviation of GA within establishment (Sch-SD). The establishments or institutions were categorized into 11 groups according to their admission policy and the institution type: Local Authority Maintained Schools were either Comprehensive, Selective or Modern (M/C, M/S, M/M). There are the same three types of directly funded Grant Maintained institutions (GM/C, GM/S, GM/M). There are also Independent schools which are either Selective or Non- Selective (IND/S, IND/NS), Sixth Form Colleges (SFC), Further Education Colleges (FE) and a small miscellaneous range of other types (Other). In modelling M/C was the base category and dummies were formed for other establishment types. The exam boards involved in the study were Associated Examining Board (AEB), Cambridge (Camb), London, Oxford, Joint Matriculation Board (JMB), Oxford-Cambridge joint delegacy (OXCAM), and the Welsh board, WJEC. The WJEC board had only a few entries and did not show obvious difference from the entries of AEB in our data exploration. We thus combined these two exam boards together and used this as a base reference category for board dummy variable parameters in our models.

3. Statistical models for point scores

th We denote the UCAS point score response for the ith student from the j school as

yij . As a basis for evaluating further model developments we may formulate a variance components multilevel model of this data structure with schools at level 2 and students at level 1.

yij =++β 0 uoj eij (1)

th Here is the j education institution effect. They are random effects in the models

2 2 considered and are assumed distributed as N(0 ,σu0 ). The term e~N(,ij0 σ u0 ) is the

-5- th ith student random residual within the j institution. We note that these imply normality assumptions for the response and hence that it is taken as continuous. For our grade points scored data this is strictly untenable but may sometimes be assumed to hold approximately for a scale on which the points are located.

We can now add to the model other factors such as students' gender, GCSE average score, school type or context, and examination board. We can allow covariates which are polynomial terms in continuous variables, interactions between main factors and so on by writing

yXueij=+β00 ijβ + j + ij (2) where β is a vector of fixed effect coefficients associated with such factors and covariates X ij . Goldstein (1995) gives details of fitting such types of model.

As O'Donoghue et al (1996) and Goldstein and Spiegelhalter (1996) showed, we need in general to fit random coefficients models to adequately describe institution level variation. Extending by these means we have a model of the form

yXij=++ ijβ Zu ij j e ij

Xij== {1 ,x11 ij ...}, Z ij { 1 ,z ij ...} (3) TT β is now{ β01,β ...}, ujjj= { u 0 ,u 1 ...}

Usually, but not always, most of the Z variables are a subset of the X variables. The u j are random variables at the institution level and are assumed dependent multivariate normal.

In the following analyses variants of types (1), (2), (3) are fitted and developed separately to model the A level Chemistry and Geography point scores.

-6- 4. Multilevel models for ordered categories

We now examine models with similar effect patterns to those for points scales above, but which now model the distributions over ordered categories directly without reference to an explicit scale. We distinguish the six categories A-F by using the set of integer th labels s=1,2,3,4,5,6. For the ith student from the j school the probabilities over the ()s grades are denoted by π ij . Since the grades are ordered the distribution is equivalently characterised by a cumulative probability distribution, which is usually more convenient to use. probability that the grade is higher than or equal to that represented by s is

s (s) ()6 λ where 0<γγ(1) (2) γ (5) γ . In single level models these γ ij = ∑π ij ij<<<< ij... ij ij = 1 λ=1 cumulative probabilities are taken as responses in generalised linear models for ordered category outcomes (McCullagh and Nelder, 1989).

In general, any set of cumulative probabilities for the ordered grades may be conceived as a scale for the ordered grade responses in that they are inversely monotonically related to the set of grades: the increasing cumulative probabilities corresponding to decreasing difficulty of achieving at least a certain grade level. It is the changing nature of this probability scale across individuals in response to fixed and random explanatory effects that we now wish to model. In continuous (normal) response models by contrast we model conditional expectations given the set of these effects. An individual response is then assumed governed by a drawing from this distribution given by the appropriate individual level variance. For ordered categories an individual’s response is a drawing from the conditional distribution formed from the specification of that individual’s cumulative distribution over categories.

Analogy with generalised linear models shows the derivability of allowing effects to operate in a linear and additive fashion. A monotonic transformation or ‘link’ probability scale to the real line achieves this. In general this transformation can be any inverse distribution function of a continuous variable. In particular the logit, complementary log- log and probit link functions are frequently used. Such a transformation turns the

-7- cumulative probability scale on [0,1] to the interval (−∞,+∞) . Conceptually a set of thresholds or cut-points on this scale for each individual are determined by the

(s) individual’s probability distribution over the grades (Bock, 1975). Thus γ ij (s=1,2….6) in our case under transformation correspond to intervals on the linear scale

(1)(1)(2)(2)(3)(3)(4)(4)(5)(5) (s) (−∞ ,ααααααααααij ],( ij , ij ],( ij , ij ],( ij , ij ],( ij , ij ],( ij , +∞ ), with α ij constituting the thresholds. Predictor variables and random effects on the transformed scale change the thresholds, and hence the cumulative distributions over the ordered grades. The nature of these changes in response to effects is what we wish to model. A slightly different but similar conception of ordinal models is used by many researchers (e.g. McCullagh, 1980; Hedeker & Gibbons, 1994; Fielding, 1999) is through the notion of an unmeasured outcome latent variable underlying the ordered grades, and varying continuously along the real line. The ordered categories represent contiguous intervals on this scale with unknown but fixed thresholds. This variable is assumed governed by a kin ear model, in our case multilevel. Modelling the observed categorical data then involves transforming this latent variable model to one involving probabilities over the ordered grades. In this transformation since the scale and distribution of the latent variable may be arbitrary they are fixed a priori by setting the distribution and variance of the level 1 disturbance This is required in any case for identification of the latent variable model parameters from the ordered category responses Under most circumstances this essentially leads to an equivalent modelling framework Models with standard logistic distributions for the latent variable disturbance yields a logit link for the probabilities. Standard normal distributions yield the probit link. There are some advantages in these ideas in the interpretation of results directly on the scale of the latent variable. However, we shall not be directly concerned with such an interpretation here since our principal aim is to compare the different kinds of inferences arising from the Normal points and ordinal models. See, however, Fielding and Yang (1999) for a further discussion. Father, when we allow more complexity in varying thresholds as we do later in the paper, it is not clear that the models can be adapted to such latent variable interpretations.

-8- Goldstein (1995) discusses the formulation of these models in a multilevel context. We start with the most basic of logit link models. This two level model for cumulative probability grades is conceptually comparable to the variance components model (1). We have ()s ()s γ ij logit===+ log αα()ss () u γ ij ()s ij0 j (4) () 1− γ ij A fit to model (4) estimates only the series of marginal cut-points with a single random th intercept effect for the distribution for the j educational establishment. Thus the cumulated probabilities do not depend on any individual level characteristics or institutional ones other than a single random effect u0 j . The latter is assumed normal with zero mean and variance σ 2 . An individual’s response follows a multinomial u0

()s distribution determined by the cumulative probabilities γ ij , although estimation procedures can allow for extra-multinomial variation (Goldstein, 1995).

Analogously to model (2) we extend the basic model to estimate fixed effects of student background, GCSE average scores and other factors by adding to the model appropriate covariates. This model is then of the form

()s ()s γ ij logit===++ log αα()ss () Xu β γ ij ()s ij ij0 j (5) () 1− γ ij In this model, we assume no interactions between the cut-points and other explanatory variables. In another words, it assumes identical comparative ratios for cumulative odds for the categories no matter what the values of other effects are. This has the proportional odds property (McCullagh, 1980; Hedeker, 1998). The odds ratios associated with unit changes in the vector of explanatory variables Xij is the vector exp(β ). More detailed explanation and illustration on the parameter estimate is given by Yang (2001).

-9- Further, by analogy with the random coefficients model (3) we have

()s ()s γ ij logit===++ logαα()ss () X βZu γ ij ()s ij ij ij j (6) () 1− γ ij with the same normality assumptions about the vector of random components u j .

5. Comparison of results between the Normal point score and the ordinal models

To fit the normal models we use the iterative generalised least squares (IGLS) procedures of MLwiN (Rasbash et al (1999)) and for the ordinal models the quasi-likelihood procedures of the MLwiN macros MULTICAT (Yang et al (1999)). Yang (1997) discusses the validity and statistical properties of these estimators.

In this section we compare parameter estimates and individual institution residual estimates for the two types of model: normal and ordinal. Results on fitting the models are given in Tables 2, 3 and 4. Table 2 provides the comparative results for base models (1) and (4). Tables 3 and 4 give results for two variants of models (2) and (5). Firstly we adjust for a range of background characteristics of student, institution, and exam board but exclude the main intake ability characteristic, student GCSE average. Table 4 then adds variables derived from the latter in various ways. In particular polynomial terms in GA are introduced (GA^2, GA^3, GA^4), interactions with gender (GA-F, GA^2-F, GA^3-F), and the aggregated institutional level intake score. This is useful extension since it draws a distinction between modelling extraneous factors affecting raw performance and models for progress controlling for initial intake achievement.

As a precision measure in the Tables we take the standardised t-ratio of each fixed parameter estimate to its estimated standard error for comparisons between the model approaches, since the parameters associated with the same variable have different scales

-10- under each of the two approaches. The broad pattern of effects are the same with Normal and Ordinal models. In a similar context Ezzett and Whitehead (1992) have commented that major effects will emerge and are relatively insensitive to model formulation, though we may expect their exact estimates to differ. This is borne out by the fact that precision measures of regression parameter estimates are very similar between model types. Formal statistical tests on covariate parameters would yield the same inferential conclusions. In general the impressions from either model type are close and conceptually no real difference in impact on broad substantive interpretation emerge.

(Tables 2 and 3 about here)

The precision of the school level variance from the ordinal models is slightly better in all cases in Tables 2, 3 and 4 but the improvement is barely discernable. Both models for both subjects in Table 3, which do not adjust for intake achievement, exhibit much the same pattern of covariate effects on performance. In general, better performances come from females and younger students. Compared to the base M/C schools, selective schools of all types have significantly higher achievement. Modern schools have lower performance but this is not statistically discernable for M/M in Chemistry. Sixth form colleges and others (mainly sixth form centres in schools) also perform better but in neither case are results statistically significant for Geography. FE colleges have much lower achievement generally. GM/C and IND/NS schools are not significantly different from their maintained comprehensive (MC) counterparts. The estimates of dummy parameters for boards relative to AEB / WJEC reveal significantly higher average grade performances for OXCAM, JMB and CAMB Chemistry examinations. Oxford has significantly worse performance in Geography. Other board effects are not distinguishable from the base.

The results in Table 4 additionally control for covariates based on prior achievement and individual and institutional GCSE averages. Effect estimates in this table thus relate to 'adjusted performance' and properly relate to progress over the course of A level study. As before, models (2b) and (5b) are comparable. In both subjects younger students make

-11- more progress in addition to having higher general achievement. However, boys now make more progress than girls despite girls being higher achiever, as noted by O'Donoghue et al. (1996). M/S and GM/S selective schools now lose their significant comparative advantage when progress rather than raw performance is the criterion, although the effect of IND/S is statistically significant in both subjects. M/M, GM/C and IND/NS schools make no significantly different progress from M/C in either subject. GM/M seem to do worse but the effect is significant only for Geography. Sixth form colleges achieve higher progress in Chemistry than the base M/C type but no longer have the advantage in Geography. FE colleges show significantly lower progress in Geography but this does not carry over for Chemistry. The pattern of board effects for adjusted performance in Chemistry is similar to those on raw performance. Now, however, CAMB joins Oxford in having significantly worse adjusted performance in Geography. Other board effects are again not distinguishable. All these general substantive findings are similar to those of Yang & Woodhouse (2001) based on aggregate A/AS scores for the whole database.

(Table 4 about here)

6. The nature of GCSE effects

Prior achievement as measured by GCSE results form an essential input into models for A level performance. The way it operates on the outcome in combination with other factors is similar to the economic concept of an educational production function. This paper and others (Goldstein & Thomas, 1996; O’Donoghue et al, 1997; Yang & Woodhouse, 2001) show that this production function should include many non-linear terms in the covariates, Thus as noted above polynomial terms up to fourth order m individual prior ability variables may in some circumstances be required as in Table 4. These polynomial terms allow a necessary graduation of the response to this variable particularly at the extremes. Institution context factors such as Sch-GA and Sch-SD will also often have a discernible influence. The institution standard deviation may be useful in examining what, if any, is the impact of homogeneity of school intake on progress.

-12- The gender differentiation of GCSE effects is represented by terms for the interaction with the female gender dummy variable.

As shown in Table 4, the normal model evidenced significant polynomial terms in GCSE average up to the fourth order. The ordinal logit model required only a quadratic function. As a result of polynomial effects, the nature of the interaction of gender with prior ability cannot be simply expressed as an additive element. However, the normal model exhibited interactions for both subjects but the ordinal model only for Chemistry. This may be due to the skewed nature of its response distribution over grades or could be related to differential impacts of ceilings and floors. In the context of primary school progress, Fielding (1999) notes that ordered category models seem to be more parsimonious in many circumstances and require fewer complex terms. It is further noted there that this requirement may be conditioned by the response distributions. The results here seem to confirm these impressions. The normal model seems much more sensitive to the actual values of GCSE scores than the logit model.

In both types of model the effects of school intake contexts appear relatively small but with Sch-GA having a marginally significant effect in Chemistry and school heterogeneity having a positive effect in Geography.

7. The use of ordinal models in predicting grade distributions

In A level studies, teachers are expected to predict A level performances for university entrance purposes. Rather than giving point predictions a more useful approach might be to evaluate the 'chances' that a certain student will achieve certain grade thresholds. Ordinal models have a very useful role in this area by estimating directly the probability that students with given background characteristics and initial ability will achieve certain grades. Normal models can only do this indirectly from the conditional means and variances and assumptions that grade boundaries are placed appropriately along the points scale (5.0 to 7.0, for instance for grade C). Using estimates from Model 5(b) (Table 4), we illustrate the ‘chances’ of achievement estimated for two female students

-13- having the same set of covariate values but different GCSE average scores in Figure 1. Student 1, with a high GCSE score of 7.5, has a very high chance of achieving a Chemistry grade no less than B on A level Chemistry; while Student 2, with a GCSE score of 5, is most likely to achieve a grade no higher than D. Using estimates from Model 2(b) (Table 4), we obtain the point estimates of A level point scores for the two as 9.8 and 1.9 respectively. Although they are roughly equivalent to grades A and E, they represent conditional expectations only. Using the conditional means and individual level variances from the model estimates a predicted grade distribution could also be calculated here, using appropriate areas under the normal curve ( e.g. probability of grade C would be found from the area under the assumed normal between 5 and 6) However, the assumed normal distribution of a continuous variable on the underlying arbitrary raw points scale is crucial when applied in this context. Model estimates may be reasonably robust to departures from the morality assumption. However, in computing probabilities in this direct way it has a much more obvious impact. Even relatively small departures from normality in the true distribution over the raw points scale might yield a different set of predictions This is a further aspect of sensitivity to the scale of the arbitrary scores used and the assumption of normality on that scale. No such fixing of a scale is required for ordinal models. (Figure 1 about here)

8. Value added estimates using school residuals

We have extended the models in Table 4 by incorporating random coefficients as in models (3) and (6). The covariates which were found to have coefficients random at the institutional level in both models were the female gender and the first order GA term. For these models we thus have three residuals for each institution for each model type. In this section we focus on the intercept residuals only. Since GA was centred these represent school effects or 'value added' for males with an average of GCSE score. For model checking, the distributions of the standardised institution residuals for the two models on both A-level subjects are very close to normality as seen by the normal plots in Figures 2 and 3. Residuals from the model approaches are closely related. The correlation

-14- coefficients and rank correlations between the institution residuals from each pair of models are:

Chemistry Geography Establishment residuals 0.982 0.968 Ranks of establishment residuals 0.983 0.970

Note that these correlations are somewhat inflated because of 'shrinkage'. Nevertheless, they and the scatter plots suggest a strong agreement for the school value added estimates between the two models. Given the fairly full complex modelling of covariates and effects we would usually not expect otherwise. However, they are not perfect and there is scope for some movement in the positioning of individual schools to which the correlations remain relatively insensitive. Even mild sensitivity to model formulation is potentially of interest. We might ask, for instance, does the identification of extreme institutes depend on the model?

[Figures 2 and 3 about here]

Thus we now examine in detail some selected institutes, choosing four for each subject. There are two institutions in the middle of the distribution of the institution residual estimates for normal models for Chemistry (Table 4). These are also examined for Geography. Further, for each subject separately, two extreme institutes are selected. Some details on the selected institutes are listed in Table 5. Table 6 shows the ranks of the residuals of the selected institutes from each model, the residual estimates and their standard errors. Also 95% overlap intervals (Goldstein and Healy 1995) converted into equivalent ranking intervals for residuals are shown in Table 6.

[Tables 5 and 6 about here]

The results show that at the extremities there is little difference in the models. Extreme schools are detected with either model. In the middle of the distribution, however, there

-15- are often considerable differences in rankings. There is a great sensitivity of 'league table' position to the chosen adjustment model even when both include the same covariates. The intervals for institutions 1 and 2 on Chemistry, whose ranks are only 66 apart (1317 – 1251), overlap considerably for the normal model. However, even in the ordinal model for this case where rankings are about 1000 apart (1706 – 702), and much more clearly separated, there is still considerable overlap of the intervals. For Geography, institutions 1 and 2 being less concentrated in the middle of the rank range, have separable intervals under both models, but less clearly so for the ordinal model. However, it requires rankings differing by 1502 and 1472 respectively to achieve these separations. Although the extreme institutes have a much more consistent ranking there are one or two other features worthy of note. For Geography, institutions 2 and 5 (highest ranked on the ordinal model) have relatively short rank differences of 111 and 300 for normal and ordinal models respectively. However, at the top end of the range their intervals do not overlap for the ordinal model and only just for the normal. Rank intervals for extreme cases are very short. There appears to be a clearer separation between pairs as we move away from the middle of the distribution. The same phenomenon occurs at the lower end. Although residual values are on different scales, both Normal and ordinal models give similar results, pointing that scale does not affect ranking.

9. Extensions of the ordinal model

So far the ordinal logit models have used estimable fixed cut-point threshold parameters that do not vary across observations on different individuals. Indeed it is this feature which accounts for the proportional odds properties of such models. Changes in covariate values and random effects shift log-odds additively and hence odds proportionally. However, more flexibility can be introduced by allowing interactions with covariates or fitting random effects on cut-point thresholds. For example, as it stands, the fitted ordinal model adjusted for GCSE average score in Table 4 suggests that the additive main effect of gender on the cumulative log-odds is the same across grades. Since the cut points are not gender differentiated, gender changes shift only the location of the entire grade distribution keeping the relative odds proportional. Interactions of covariates with

-16- cut-points allow a relaxation of this feature of covariate effects. For instance, by fitting interactions of gender with cut-points, different cut-points for males and females can be estimated. There is some evidence that a consistent proportional odds gender effect on the distributions may not be entirely tenable. In Table 1, for instance, for Geography more female students achieved the top two grades than males and vice versa for the bottom two grades. The distributional differences between genders may be more complex than a constant shift in cumulative log-odds.

9.1. A Non-proportional changing odds model

Model 7 below extends the fixed part of Model 5 in these directions. For clarity we have reverted to the model with a single variance at level 2. We relax the assumption of identical cut-points so that the odds are now allowed to change across grades. This type of model has been called a multilevel thresholds of change model (MTCM) by Hedeker & Mermelstein (1998).

th Letting tij be 1 if the i person from institute j is female, we write

()s logit ==+ααω()sss () ()tX + β + u ()γ ij ij ij ij0 j (7)

Estimates of the cut-points α ()s determine the cumulative grade distribution of males

()s conditional on other explanatory variables, and estimates of (α +ω ()s ) those of females. Similar terms may be introduced for other explanatory covariates and higher order interactions are also possible.

(Table 7 about here)

The results of fitting model 7 are displayed in Table 7. They suggest that there is an interaction with gender and hence a non proportional effect of gender on cumulative odds.

-17- (Figure 4 about here)

The ratios of cumulative odds of males to females at each threshold are displayed in Figure 4 and contrasted with the constancy of the proportional odds ratio of Model 5. It is clearly seen that the overall negative effect of females estimated by model 5 on Geography was mainly because female students achieve relatively few high grades, having adjusted for their GCSE average score. For Chemistry there is a different pattern with relatively more females in middle grades and slightly more failures than a proportionality assumption would warrant. Differences in skewness of the distributions of grades in the two subjects may play a role and the importance of these is underlined by this type of ordinal model.

9.2. Random institution effects on cut-points for the distribution over grades

In the same way that we consider odds changing non-proportionally for different values of covariates we can allow the set of cumulative odds ratios for pairs of institutions to differ across categories by allowing the cut points to vary randomly across institutions. In model 7 we fitted one random effect for each institution. We now generalise (7) to model

(8) by allowing a set of specific cut-points ( ()s + ()s , 1, 2, 3, 4, 5) for each α u j s = institution. The model is

(s) =+()X(s) (s) ++β (s) (8) y ij αωtuijij j

()s ~(0,)MVN Ω u j u

Where Ωu is a (5 x 5) covariance matrix of the random institution effects for each separate cut-point. For simplicity we also assume that the gender coefficients are fixed, that is, there is are no differential institutional effect by gender.

-18- The estimates for model 8 are given in Table 7. The institution random effect parameters are shown separately in the lower part of Table 8. The fixed part results show main effects similar to those estimated by model 4 for both Chemistry and Geography but there are some changes to the base and female cut-points. From the random parameter estimates we see that there is relatively more institutional variation at grade A and F thresholds in Geography. For Chemistry the F threshold parameter exhibits most variation. Such sources of institutional variation at crucial thresholds may be potentially of more substantive interest then overall average levels of adjusted performance. (Table 8 about here)

Normal plots for the estimated standardised residuals from model (8) showed good agreement with the normality assumption for both subjects.

For the same institutions we previously investigated in detail using proportional odds ordinal models, we now estimate a full set of cut point residuals. Based on model (8) with model estimates in Tables 7 and 8 the estimated institutional residuals are illustrated in Figures 5.1 and 5.2. The figures show a pattern of effects on the logistic scale over the grades on Chemistry for institutions 2, 3 and 4 that are not much different from results which were observed in Table 6. These institutions could be interpreted as having relatively uniform effect on students across all levels of ability. Institution 1 has a below average conditional expectation of achieving at least either of the top two thresholds, about the same as average for above grade C, and above average for the proportion not failing or achieving above grade D. It would be interesting to examine the practice at this institution , which seems to have a better than expected pass rate but a relatively lower than expected achievement at the top end of the grading. This pattern is further displayed in detail for males in Figure 6.1 which contrasts the predicted A level grades in Chemistry for typical males in Institution 1 with similar males of the base group. It will be noted that compared to the typical base group there appear relatively fewer in the top and bottom two categories but much higher proportions in Grades C and D. For Geography the effect of Institutions 1 and 2 are approximately proportional across grade thresholds. Institutions 5 and 6, occupying the highest and lowest positions in Table 6,

-19- have a profile of threshold effects that are parallel with a similar relative pattern. Given their general levels, the size of their effects declines relative to all institutions as we move through the grade thresholds. These institutions are relatively rather better in the targeted top grades than they are at getting students above low thresholds and passing. Institution 6, for instance, is not too far from average in its effect on top grade chances (and better than Institution 1) but its low overall position is due to ‘deficiencies’ at lower thresholds. Figure 6.2 presents the predicted distributions of A level Geography grades for males of mean age with mean GCSE in Institution 5 at the top of the scale and Institution 6 at the bottom, Although as expected Institution 5 has a much higher proportion of Grade A’s, there is little difference between Institution 5 and that of the base group in this respect. The impact of failures on the overall position of Institution 6 is obvious from this diagram. There is a further important general caveat for predictions for particular institutions. These use residual estimates which have uncertainty and require some caution as stressed by Goldstein & Spiegelhalter (1996). Often the residuals are based on relatively small numbers of students so that the standard errors of estimates can be quite large. Thus, for example, in the present comparisons Institution 1 has a standard error for the grade B cut-point of 0.299. Conditional on the fixed point estimates a 95% confidence interval for the logit can be constructed. Converting to the cumulative probabilities gives an approximate interval is 28.1% to 56.1%. In principle, for more detailed analysis intervals can be constructed for overall predictions of full grade distributions for each institution. [Figures 5 and 6 about here]

10 Discussion

In this paper we have demonstrated a flexible range of models for educational grades as ordered categories. The operational definition of the outcome variable is at no higher a level of measurement than this. From a statistical viewpoint the assumptions for multilevel models for continuous responses may thus be inappropriate for scores applied to a few wide categories for several reasons. These range from objections about the scaling implied by sets of arbitrary scores through to continuous distribution properties

-20- applied to discrete measurements and to bias in estimation due to grouping. A review of some literature on this is given by Fielding (1999) and a recent general contribution is Kampen and Swyngedouw (2000).

From a substantive view normal linear models are certainly useful and have the virtue of familiarity and accessible methodology and software. Certainly, this paper shows that the general scale of fixed effects is relatively insensitive to model formulation. Estimated precision of the fixed effect estimates are also comparable, although little is known about how the estimated precision is affected by the discrete and arbitrary scored nature of the observed data. However, for institution specific details there are considerable differences between the models. In this respect it may be argued that ordinal models make fewer restrictive assumptions about the response distribution and provide conclusions which are less open to substantive query. Although the intercept residuals from models 3 and 6 are highly correlated, the location of specific institutions in their range can vary greatly. For instance, Institution 2 for Chemistry is at the 29th percentile of ranked residuals of 2409 schools on the ordinal model and at the 55th percentile for the normal model (see Table 6). It is true that these positions are both subject to uncertainty but the question arises as to the appropriateness of the modelling if we want to draw substantive ‘value added’ conclusions.

From a practical point of view in educational research the ordinal models also offer as much information as do normal models. The ‘newness’ of ordinal models and (until recently) lack of suitable software may have acted as practical deterrent, but this is being remedied. The ability to convey predictive information through probability distributions, which can not be done so easily using standard models, is a particular advantage. Since grades and levels are standard modes of assessment it may obviously useful to relate the interpretations of results to these. A predicted point score for an individual, even when contextualised in terms of the conditional mean of a continuous distribution, has less ready an interpretation when grades and levels are the medium of converse. The implications of the use of ordinal models in such practical areas as target setting within schools may be clear.

-21-

Ordinal models seem to be capable of extension in useful ways. Their characterisation in terms of grade probabilities permits flexible grade distribution parameterisation for a variety of conditions. We have seen, for instance, that by allowing cut-points to interact with gender we can study gender differences in distributions in greater detail. As discussed these differences go further than simply differences in average performance. By allowing random cut-points school differences can also be exhibited in more complex ways. They can be compared at meaningful thresholds rather then through simple mean levels of adjusted achievements. Thus Institution 1 in Chemistry has lower grade A achievements than expected but it also has lower failures. Institution 6 in Geography has a considerable failure rate but does quite well in achieving high grades compared to other typical institutions. Differences between schools in such respects might well engage the interest of effectiveness researchers and policy makers as much if not more as interest as differences in ‘average’ achievement

An aspect of ordinal model we have not discussed in any detail is the nature of the link function used, since we have focused on the familiar logit. However, we have also carried out some investigations using a probit link which is available in the latest issue of MULTICAT. Since a probit link can sometimes be interpreted in normally distributed latent variable terms it might seem to fit more easily into comparisons with normal linear models. There is a conventional wisdom in the generalised linear modelling literature (e.g. McCullagh and Nelder (1989), Greene (2000)) that important results are relatively insensitive to this choice of link. This is often attributed to the similarity of the logistic and normal distributions except at the tails. Preliminary results show some differences but none are startling. However, methodological work comparing logit, probit and other links in the multilevel context is under way and needs further advancing.

If, as we claim, ordinal models are worthy of more extensive application they need also to be developed further in a number of important directions. Ordinal models with cross- classified random effects at higher levels have been considered by Fielding & Yang (1999). Multivariate response multilevel models for continuous variables are developed

-22- and quite widely used in education (Goldstein 1995, Goldstein and Sammons 1997). In our investigations with Model 7 we compared the two sets of cut-point residuals for institutions which had Geography and Chemistry in common. A general impression conveyed was that there were two major groups of institutions. One group was those schools whose performances in the two subjects were similar relative to all schools. However, another major group had relatively high performances in one subject together with a low achievement in the other. There are some interesting practical questions here about the differential 'effectiveness' of schools in different A level subjects and the relationships between subject grades at both student and institutional level. In ordinal variable modelling multivariate developments are required which are analogous to the continuous variable case. A promising line of inquiry which we have started to investigate is log-linear characterisation of the multivariate distribution over sets of ordered categories at the lowest level. Other developments we envisage are multivariate models for mixed continuous and ordered category responses, and to parallel the longitudinal binary response models of Yang et al. (2000), variants for ordered categorisations. The latter situation has received some attention in the generalised estimating equation (GEE) literature (Lipsitz and Kim (1994)) but being population averaging methods concentrating mainly on ways of obtaining efficient fixed effects estimates; they cannot be used to investigate the detailed structure of multilevel effects.

Acknowledgements

This study is part of the project Application of Advanced Multilevel Modelling Methods for the Analysis of Examination Data, supported by the Economic and Social Research Council (ESRC) of U.K. under the Grant award R000237394. The Department of Education and Employment (DfEE) kindly provided the raw exam data. The ESRC has also given support to Antony Fielding’s Visiting Research Fellowship at the Multilevel Models Project under award H51944500497 of the Analysis of Large and Complex Datasets programme. Valuable comments were received from James Carpenter and reviewers of an earlier version of the paper.

-23- References

Bock, R.D. (1975). Multivariate statistical methods in behavioral research. New York: McGraw-Hill.

DfEE (1995). The development of a national framework for estimating value added at GCSE A/AS level, technical annex to GCSE to GCE A/AS value added: Briefing for schools and colleges. London: DfEE.

Ezzett, F., & Whitehead, J. (1991). A random effects model for ordinal responses from a crossover trial. Statistics in Medicine, 10, 901-907.

Fielding, A. (1999). Why use arbitrary points scores?: ordered categories in models of educational progress. Journal of the Royal Statistical Society. A, 162, 303-330.

Fielding, A. (2000). Ordered category responses and random effects in multilevel and other complex structures: scored and generalised linear models, in Multilevel Modelling: Methodological Advances, Issues and Applications, S. Reise & N. Duan ( eds.) , New Jersey, Erlbaum, forthcoming.

Fielding, A. & Yang, M. (1999) Random effects models for ordered category responses and complex structures in educational progress. University of Birmingham Department of Economics Discussion Paper 99-20, submitted for publication.

Goldstein, H. (1995). Multilevel Statistical Models (2nd edition). London: Edward Arnold.

Goldstein, H. and Spiegelhalter, D.J. (1996). League tables and their limitations: statistical issues in comparisons of institutional performance (with discussion). Journal of the Royal Statistical Society. A, 159, 385-443.

Goldstein, H. and Healy, M. J. R. (1995). The graphical presentation of a collection of means. Journal of the Royal Statistical Society. A, 158, 505-513.

Goldstein, H. and Sammons, P. (1997). The influence of secondary and junior schools on sixteen year examination performance: a cross-classified multilevel analysis. School Effectiveness and School Improvement, 8, 219-230.

Goldstein, H. and Thomas, S. (1996). Using examination results as indicators of school and college performance. Journal of the Royal Statistical Society. A, 159, 149-163.

Gray, J., Jesson, D., Goldstein, H., Hedger, K. and Rasbash, J. (1995). A multi-level analysis of school improvement: changes in schools’ performance over time. School Effectiveness and School Improvement, Vol. 6 No. 2, 97-114.

Greene, W. H. (2000). Econometric analysis (4th Edition), Upper Saddle Valley, New Jersey, Prentice-Hall.

-24- Gray, J., Goldstein, H. and Jesson, D. (1996). Changes and improvements in schools’ effectiveness: trends over five years, Research Papers in Education, 11(1), 35-51.

Hedeker, D. & Gibbons, R.D. (1994). A random-effects ordinal regression model for multilevel analysis. Biometrics, 50, 933-944.

Hedeker, D. & Mermelstein. (1998) A multilevel thresholds of change model for analysis of stages of change data, Multivariate Behavioral Research, 33 (4), 427-455.

Kampen, J., & Swyngedouw, M. (2000). The ordinal controversy revisited. Quality & Quantity, 34, 87-102.

Lipsitz, S. R., Kim. K. & Zhao, L. (1994). Analysis of repeated categorical data using generalised estimating equations. Statistics in Medicine, 13, 1149-1163.

McCullagh, P. (1980). Regression models for ordinal data (with discussion). Journal of the Royal Statistical Society. B, 42, 109-142.

McCullagh, P. & Nelder, J.A. (1989). Generalised Linear Models ( 2nd Edition). London, Chapman and Hall.

O’Donoghue, C., Thomas, S., Goldstein, H. and Knight, T. (1997). 1996 DfEE study of value added for 16-18 year olds in England, DfEE research series, March. London DfEE.

Rasbash, J., Browne, W, Goldstein, H., Yang, M., Plewis, I., Healy, M., Woodhouse, G., & Draper, D. ( 1999). A user’s guide to MlwiN , Multilevel Models Project, Institutute of Education, University of London.

Yang, M. (1997). Multilevel models for multiple category responses by MLn simulation. Multilevel Modelling Newsletter, Vol. 9, no. 1, 9-15. Institute of Education, University of London.

Yang, M, Goldstein, H, & Heath, A. (2000). Multilevel models for repeated binary outcomes: Attitudes and voting over the electoral cycle, Journal of the Royal Statistical Society, A, 163, 49-62.

Yang, M., Rasbash, J., & Goldstein, H. (1998). MlwiN macros for advanced multilevel modelling, Multilevel Models Project, Institutute of Education, University of London.

Yang, M. and Woodhouse, G. (2001). Progress from GCSE to A and AS level: institutional and gender differences, and trends over time, British Journal of Education Research, Vol 27, 2.

Yang, M. (2001). Multilevel Multinomial Models in Medical Research in A. Leyland & H. Goldstein edied Multilevel Models in Medicine. (to be published by Wiley in Feb.)

-25- Tables

Table 1. Frequency distributions on A-level Chemistry and Geography (1997) on cases which had matching students’ GCSE results Chemistry (2,409 institutions) Geography (2,317 institutions) Grades Number Male Female Entry Overall Male Female Overall % % % % % % A 6,680 21.6 21.8 21.4 4,170 12.5 10.8 14.7 B 6,666 21.6 21.2 22.1 7,407 22.3 20.4 24.5 C 5,732 18.5 18.0 19.2 7,885 23.7 23.7 23.7 D 4,611 14.9 14.8 15.0 6,297 18.9 20.3 17.2 E 3,606 11.7 11.7 11.6 4,271 12.8 14.2 11.1 F 3,615 11.7 12.5 10.7 3,246 9.8 10.6 8.7 Total 30,910 100.0 100.0 100.0 33,276 100.0 100.0 99.9 Average A-level point score: 5.83 Average A-level point score: 5.47 (Males = 5.78; Females = 5.89) (Males = 5.23; Females = 5.76) Average GCSE point score: 6.30 Average GCSE point score: 5.85 (Males = 6.16; Females = 6.47) (Males = 5.70; Females = 6.04)

-26- Table 2. Model estimates for variance component model (1) and basic ordinal model (4) Model (1) Model (4) Param. Chemistry Geography Param. Chemistry Geography Estimate Precision Estimate Precision Estimate Precision Estimate Precision 5.349 5.250 ()1 -1.881 -2.405 β 0 α α ()2 -0.668 -0.913 α ()3 0.248 0.230 α ()4 1.106 1.274 α ()5 2.089 2.406 2 2.829 24.90 2.017 24.41 2 1.190 25.50 0.995 25.38 σ u σ u 2 8.507 7.228 Extra- 0.945 0.959 σ e multinom ial variation

-27- Table 3. Model estimates and precision of estimates for models (2) and (5) without adjusting for GCSE average score Model 2(a) Model 5(a) Parameter Chemistry Geography Parameter Chemistry Geography Estimate Precision Estimate Precision Estimate Precision Estimate Precision 4.318 4.571 ()1 -2.511 -2.883 β 0 α α ()2 -1.297 -1.382 α ()3 -0.383 -0.233 α ()4 0.470 0.814 α ()5 1.446 1.946 Female 0.169 4.40 0.553 16.7 Female 0.099 4.30 0.383 17.6 Age -0.004 -0.82 -0.003 -0.64 Age -0.003 -1.07 -0.002 -0.63 M/S 1.239 6.49 1.291 7.51 M/S 0.755 6.22 0.869 7.27 M/M -0.481 -0.86 -0.948 -2.78 M/M -0.300 -0.87 -0.624 -2.69 GM/C 0.011 0.10 0.086 0.90 GM/C 0.001 0.02 0.073 1.11 GM/S 1.513 9.70 1.288 9.20 GM/S 0.933 9.39 0.885 9.05 GM/M -2.119 -2.84 -2.321 -5.28 GM/M -1.296 -2.78 -1.728 -5.59 IND/S 2.187 23.3 1.694 18.7 IND/S 1.407 23.5 1.185 18.9 IND/NS 0.076 0.27 0.214 0.81 IND/NS 0.041 0.24 0.167 0.93 SFC 0.569 3.95 0.109 0.86 SFC 0.343 3.73 0.069 0.78 FE -0.993 -7.22 -1.012 -8.25 FE -0.653 -7.55 -0.704 -8.32 Other 1.083 3.30 0.195 0.59 Other 0.603 2.93 0.150 0.66 Camb. 0.700 5.11 -0.193 -1.87 Camb. 0.441 5.19 -0.122 -1.71 London 0.169 1.31 0.230 2.71 London 0.097 1.20 0.149 2.55 Oxford -0.136 -0.59 -1.267 -6.54 Oxford -0.092 -0.64 -0.835 -6.28 JMB 0.725 5.50 0.231 2.17 JMB 0.458 5.55 0.149 2.04 OXCAM 1.085 7.48 0.228 1.61 OXCAM 0.654 7.27 0.123 1.26 2 1.662 21.71 1.236 21.50 2 0.698 22.83 0.623 22.57 σ u σ u 2 8.521 7.173 Extra- 0.951 0.959 σ e multinomi al variation

-28- Table 4 Model estimates and precision of estimates for models 2 and 5 adjusting for GCSE average Model 2(b) Model 5(b) Chemistry Geography Var. Chemistry Geography Estimate Precision Estimate Precision Estimate Precision Estimate Precision 5.272 5.224 ()1 -2.950 -3.614 β 0 α α ()2 -1.104 -1.404 α ()3 0.247 0.276 α ()4 1.436 1.718 α ()5 2.704 3.147 Female -0.900 -24.7 -0.348 -11.3 Female -0.720 -23.4 -0.304 -11.1 Age -0.039 -10.4 -0.026 -7.94 Age -0.035 -11.7 -0.025 -8.61 M/S -0.033 -0.23 0.060 0.44 M/S -0.065 -0.50 0.069 0.53 M/M 0.521 1.25 -0.127 -0.49 M/M 0.395 1.10 -0.057 -0.23 GM/C -0.019 -0.22 0.085 1.17 GM/C -0.017 -0.23 0.099 1.43 GM/S 0.114 0.95 0.023 0.21 GM/S 0.069 0.65 0.038 0.36 GM/M -1.366 -2.45 -1.088 -3.23 GM/M -1.251 -2.57 -1.174 -3.58 IND/S 0.266 3.24 0.239 3.05 IND/S 0.242 3.08 0.247 3.34 IND/NS -0.161 -0.77 -0.040 -0.20 IND/NS -0.147 -0.82 -0.051 -0.27 SFC 0.477 4.44 -0.010 -0.10 SFC 0.402 4.38 -0.025 -0.27 FE -0.159 -1.53 -0.445 -4.69 FE -0.179 -1.96 -0.414 -4.60 Other 0.699 2.86 -0.329 -1.31 Other 0.577 2.70 -0.361 -1.51 Camb. 0.628 6.16 -0.400 -5.06 Camb. 0.579 6.51 -0.358 -4.77 London -0.005 -0.05 0.091 1.40 London 0.005 0.06 0.084 1.35 Oxford -0.303 -1.76 -1.397 -9.44 Oxford -0.212 -1.41 -1.281 -9.22 JMB 0.476 4.83 -0.036 -0.44 JMB 0.460 5.34 -0.046 -0.60 OxCam 1.207 11.2 -0.014 -0.13 OxCam 1.054 11.2 -0.044 -0.42 GA 3.309 79.2 2.808 93.9 GA 2.600 65.2 2.437 77.6 GA^2 0.255 7.31 0.236 7.61 GA^2 0.484 15.1 0.377 12.6 GA^3 -0.404 -17.0 -0.219 -17.5 GA^3 -0.035 -1.32 -0.047 -3.36 GA^4 -0.099 -9.61 -0.042 -5.45 GA^4 -0.013 -1.30 -0.007 -0.92 GA-F -0.048 -0.86 0.088 2.87 GA-F -0.201 -4.24 0.049 1.69 GA^2-F 0.275 7.45 0.066 2.48 GA^2-F 0.114 2.93 0.028 1.08 GA^3-F 0.087 3.38 N/A N/A GA^3-F 0.076 2.97 N/A N/A Sch-GA 0.175 2.82 0.108 1.86 Sch-GA 0.151 2.76 0.094 1.71 Sch-SD 0.210 1.71 0.329 2.79 Sch-SD 0.146 1.38 0.333 3.03 2 0.910 21.52 0.733 21.77 2 0.754 22.85 0.705 23.11 σ u σ u

2 4.820 4.112 Extra- 0.928 0.945 σ e multinom ial variance

-29- Table 5. Selected institutes with institutional characteristics

Instn. Number of A level entries Exam Instn. Mean GSCE point score of Mean A-level point score of

by gender (male : female) Board Type individual students individual students

Chemistry Geography Chemistry Geography Chemistry Geography

1 19 : 28 29 : 29 CAMB 6th Form 5.84 5.74 5.19 4.93

2 0 : 14 0 : 30 OXCAM IND/S 7.46 6.78 9.71 7.93

3 6 : 0 AMB IND/S 6.79 3.00

4 14 : 12 CAMB G/C 5.28 6.54

5 45 : 27 CAMB 6th Form 5.70 6.75

6 17 : 21 London 6th Form 5.33 1.42

-30- Table 6 School value added estimates forthe Normal and Ordinal models School Rank of residuals Residual estimate Ranks corresponding (S.E.) to 95% overlap intervals of residuals Model 3 Model 6 Model 3 Model 6 Model 3 Model 6 Chemistry 1 1251 1706 -0.02 (0.38) -0.35 (0.26) 1209 ~ 2020 1209 ~ 2068 2 1317 702 -0.07 (0.58) 0.37 (0.69) 71 ~ 1940 50 ~ 1971 3 2406 2405 -2.11 (0.65) -2.08 (0.56) 2324 ~ 2408 2347 ~ 2409 4 1 2 2.75 (0.43) 2.31 (0.46) 1 ~ 23 1 ~ 11

Geography 1 1615 1773 -0.30 (0.30) -0.46 (0.30) 985 ~ 2045 1226 ~ 2133 2 113 301 1.04 (0.50) 0.69 (0.51) 6 ~ 663 46 ~ 1198 5 2 1 2.09 (0.27) 2.69 (0.31) 1 ~ 10 1 ~ 4 6 2316 2317 -2.28 (0.36) -2.94 (0.39) 2308 ~ 2317 2316 ~ 2317

-31- Table 7 Parameter estimates for Models 7 and 8 by subject Model 7 Model 8 Var. Chemistry Geography Chemistry Geography Estimate SE Estimate SE Estimate SE Estimate SE α ()1 -2.899 0.111 -3.551 0.102 -2.775 0.086 -3.600 0.103 α ()2 -1.084 0.110 -1.401 0.097 -0.955 0.084 -1.344 0.097 α ()3 0.228 0.110 0.261 0.097 0.334 0.084 0.318 0.097 α ()4 1.384 0.110 1.710 0.098 1.490 0.085 1.789 0.098 α ()5 2.598 0.112 3.158 0.101 2.764 0.088 3.340 0.102 ω()1 -0.856 0.048 -0.434 0.054 -0.841 0.047 -0.408 0.054 ω()2 -0.786 0.036 -0.312 0.035 -0.775 0.034 -0.304 0.033 ω(3) -0.697 0.036 -0.271 0.032 -0.695 0.034 -0.269 0.031 ω()4 -0.621 0.041 -0.289 0.038 -0.623 0.039 -0.278 0.037 ω(5) -0.488 0.052 -0.333 0.051 -0.476 0.051 -0.310 0.051 Age -0.036 0.003 -0.025 0.003 -0.036 0.003 -0.026 0.003 M/S -0.064 0.129 0.069 0.130 -0.078 0.126 0.089 0.130 M/M 0.395 0.359 -0.057 0.243 0.350 0.358 -0.077 0.246 GM/C -0.017 0.073 0.099 0.069 -0.012 0.072 0.099 0.069 GM/S 0.070 0.107 0.039 0.107 0.078 0.104 0.022 0.107 GM/M -1.258 0.490 -1.176 0.329 -1.319 0.502 -1.125 0.344 IND/S 0.241 0.072 0.248 0.074 0.244 0.071 0.251 0.074 IND/NS -0.149 0.180 -0.051 0.189 -0.149 0.178 -0.044 0.187 6th form 0.403 0.096 -0.025 0.094 0.434 0.093 -0.012 0.095 FE -0.182 0.092 -0.413 0.090 -0.115 0.092 -0.368 0.091 Other 0.583 0.215 -0.361 0.240 0.631 0.211 -0.345 0.241 Camb. 0.581 0.089 -0.359 0.075 0.547 0.088 -0.444 0.075 London 0.006 0.084 0.084 0.062 -0.019 0.084 0.033 0.062 Oxford -0.215 0.151 -1.283 0.139 -0.223 0.150 -1.429 0.141 JMB 0.460 0.086 -0.047 0.077 0.454 0.086 -0.121 0.077 OXCAM 1.054 0.094 -0.045 0.104 1.026 0.093 -0.077 0.104 GA 2.545 0.041 2.417 0.033 2.497 0.028 2.410 0.029 GA^2 0.487 0.032 0.367 0.030 0.459 0.017 0.371 0.020 GA^3 -0.029 0.026 -0.047 0.014 n/a n/a -0.023 0.010 GA-F -0.070 0.053 0.089 0.037 -0.052 0.048 0.090 0.035 GA^2-F 0.111 0.040 0.055 0.028 0.104 0.032 0.043 0.027 GA^3-F 0.064 0.026 n/a n/a 0.055 0.019 n/a n/a Sch-GA 0.150 0.055 0.094 0.055 0.119 0.055 0.105 0.056 Sch-SD 0.144 0.106 0.333 0.110 n/a n/a 0.323 0.110 2 0.759 0.033 0.706 0.706 σ u School level variance-covariance, see table 8

Extra- 0.928 0.003 0.946 0.946 0.871 0.003 0.853 0.003 multinom ial variation

-32- Table 8. Variance-covariance estimates of the cut-points for Chemistry (the first line) and Geography (the second line) at school level, SE in parentheses, correlation coefficients in the upper triangle of the table A Above B Above C Above D Above E A 0.806 (0.047) 0.92 0.83 0.78 0.48 1.157 (0.068) 0.88 0.70 0.56 0.61 Above B 0.699 (0.037) 0.714 (0.037) 0.94 0.88 0.63 0.803 (0.043) 0.725 (0.037) 0.92 0.80 0.81 Above C 0.659 (0.036) 0.701 (0.035) 0.785 (0.039) 0.97 0.80 0.658 (0.040) 0.683 (0.033) 0.755 (0.036) 0.95 0.96 Above D 0.658 (0.038) 0.700 (0.035) 0.807 (0.039) 0.883 (0.045) 0.92 0.549 (0.043) 0.630 (0.033) 0.770 (0.036) 0.863 (0.043) 0.99 Above E 0.479 (0.055) 0.594 (0.041) 0.791 (0.043) 0.965 (0.050) 1.236 (0.070) 0.716 (0.045) 0.758 (0.041) 0.908 (0.044) 1.002 (0.050) 1.194 (0.066)

-33- Figures

Student 1 Student 2

45 40 35 30 25 20 15 10

Predicted probability,% 5 0 ABCDEF Grade

Figure 1 Predicted distribution of A level grades for Chemistry for Ordinal model 5(b) for two students. (Student 1: female of 18.5 years old from an independent selective school with exam board Oxford-Cambridge with overall GCSE score as 7.5. Student 2: same as student 1 but a lower GCSE average score of 5)

-34- Figure 2 Normal score (y axis) by standardised residual (x axis) for the Normal model 3 and Ordinal model 6, for Chemistry

-35- Figure 3 Normal score (y axis) by standardised residual ( x axis) for the Normal model 3 and Ordinal model 6, for Geography.

-36- 0.8

0.7 Model 5 Chemistry Model 7 Chemistry

0.6 Model 5 Geography Model 7 geography

Cumulative odds ratios 0.5

0.4 A A-B A-C A-D A-E Grade

Figure 4 Gender effects on grade threshold probabilities: ratios of cumulative odds between females and males estimated by models 5 and 7

-37- 3

2

1 Institution 1 Institution 2 0 Institution 3 Institution 4 -1 Institution residuals -2

-3 A A-B A-C A-D A-E Grade (Chemistry)

Figure 5.1 Plots of residuals of 4 institutions for A-level Chemistry, logit scale

5 4 3 2 1 Institution 1 Institution 2 0 Institution 5 -1 Institution 6 -2 Institution residuals -3 -4 -5 A A-B A-C A-D A-E Grade (Geography)

Figure 5.2 Plots of residuals of 4 institutions for A-level geography, logit scale

-38- Male (base group) Institution 1

45 40 35 30 25 20 15 Frequency, % 10 5 0 ABCDEF Grade (Chem istry)

Figure 6.1

Predicted distributions of A-level Chemistry for males of mean age with mean GCSE score: Overall base group of maintained comprehensive schools with AEB board compared with Institution 1 ( a sixth form college and Cambridge Board)

Male (base group) Institution 5 Institution 6

70 60 50 40

30

Frequency, % 20 10 0 ABCDE F Grade (Geography)

Figure 6.2

Predicted distributions of A-level Geography for males of mean age with mean GCSE score: Overall base group of maintained comprehensive schools with AEB board compared with institution 5 ( a sixth form college and Cambridge Board) and institution 6 ( a sixth form college with London Board)

-39-

Progress from GCSE to A and AS level: institutional and gender differences, and trends over time

Min Yang and Geoffrey Woodhouse†

Institute of Education University of London 20 Bedford Way London, WC1A 0AL

† Author to whom correspondence should be sent

1 Abstract We study the relationship between results obtained in examinations for the General Certificate of Education at Advanced and Advanced Supplementary (A/AS) level and those obtained by the same students two years earlier in examinations for the General Certificate of Secondary Education (GCSE). Using comprehensive data on four cohorts examined between 1994 and 1997, we build a multilevel, longitudinal model of student progress. We find that progress differs between men and women, and between students of different ages, and that the average GCSE performance of the students in an establishment is a significant predictor of individual progress. Once establishments are matched on this measure, and students are matched on their own GCSE performance, the effects of most establishment types are substantially reduced: in particular, the average progress of students in maintained grammar schools does not differ significantly from that of students in maintained comprehensive schools. We find less stability over time in the usual residual estimates of the relative effectiveness of institutions than has been found in earlier studies.

Key words multilevel modelling; examination results; trends over time; compositional effects; school effectiveness; value added

Acknowledgements

This work has been supported by the Economic and Social Research Council through grant number R000237394 (Applications of Advanced Multilevel Modelling Methods for the Analysis of Examination Data). We are grateful to the Department for Education and Employment for access to the data.

2 1. Introduction

Our purpose in this paper is to build and interpret models of the relationship between students’ performance in examinations for the General Certificate of Education (GCE) at Advanced and Advanced Supplementary (A/AS) level and their earlier performance in examinations for the General Certificate of Secondary Education (GCSE). Such models are commonly used to describe educational progress, and how this differs for different kinds of student and in different establishments. The present study extends the work of the Department for Education (DFE, 1995), Goldstein and Thomas (1996), and O’Donoghue et al. (1997) by using data on four complete cohorts of students: these allow us to model the relationship, in particular with time, in greater detail than before. We use a multilevel framework for our investigation, with four levels of potential variation corresponding to the hierarchy in the population: students are modelled as varying within cohorts, which vary within establishments, which vary within local education authorities (LEAs). The study is part of the project ‘Applications of Advanced Multilevel Modelling Methods for the Analysis of Examination Data’, funded by the Economic and Social Research Council.

The data relate to the results obtained in examinations at A/AS level taken in each of the years 1994, 1995, 1996, and 1997 by students attaining the age of 18 in the academic year of the examination and attending an educational establishment in England. For each student we have also the grade results obtained in GCSE examinations taken at age 16 years or earlier. Other data available are the age and gender of the student, and the establishment attended at the time of the A/AS-level examination. The outcome variable, or response, is a composite score derived from the individual subject grades obtained at A/AS level. For GCSE performance we use aggregate measures based on a commonly used scoring system. These apparently simple measures are the ‘common currency’ for certifying students and comparing institutions.

In the next section we describe the data and how we derived the variables used in our models. We then describe the initial model used to adjust for individual performance at GCSE. After these preliminaries we develop the model in stages. The first stage

3 (described in section 4) is to build a model of the ‘main effects’ on the response variable, in this case the effects of age, gender, establishment type, region, time, and GCSE performance. In Section 5 we extend the model by including interaction effects and compositional effects. For example, we whether the effect of gender depends on the GCSE performance of the candidate, and whether the average prior attainment of the students in an establishment has an effect on individual outcomes. In Section 6 we introduce random coefficients at establishment level in order to test whether the effect of gender, for example, varies from institution to institution. Finally, we summarise our findings and discuss their interpretation.

The data and derived variables The data comprise records on 722,903 students, spread over the four years 1994–7. These are, for each year of examination, the full cohort of students who attained the age of 18 years during the year, attended a recognised educational establishment in England, and entered at least one examination at A/AS level. We excluded from the analysis entries and results in General Studies, both at GCSE and at A/AS level. In addition, students were excluded if grades or records were missing for either GCSE or A/AS level examinations. These exclusions reduced the number of students by 18,760, some 2.6% of the total. We also excluded records in which the establishment was not identified: this accounted for another 7,483 students, or approximately 1% of the initial total. In sum, we used data on 696,660 students attending between them 2,794 establishments. Of these establishments, 2,534 had examination cohorts in all four years, 121 in three of the four, 79 in two, and 60 in one only. Inspection of the data on the establishments not represented in all four years revealed no dependence on establishment type, LEA, or cohort size.

We use as the response variable a measure of overall achievement at A/AS level based on the points score used currently by many British universities as a criterion for entry. This scale assigns 10 points to each A grade obtained at A level, 8 to each B, 6 to each C, 4 to each D, and 2 to each E. For grades at AS level 5 points are assigned to each A, 4 to each B, etc. The points (excluding grades obtained for General Studies and for multiple entries in the same subject) are then accumulated to form a total A/AS score for each student. The total examination score at A/AS level,

4 calculated as above, is important for the future careers of students, and achievements on this measure, aggregated to the level of establishments, have been published nationally since 1992.

Table I about here

The mean scores obtained in each year are shown in Table I, where it can be seen that they increased slightly over the period 1994–7. For our analysis we transformed the total score using normal scoring applied to the combined data. This transformation produces ‘normal equivalent deviates’, and its chief purpose in this case is to ensure that the distributions of the residuals from our models are close to the normal, so that inferences about model parameters are valid. Figure 1 shows how the raw A/AS score is related to the normalised A/AS score. Approximately 6% of the students scored 0 at A/AS level. This score is transformed to the value –1.86, which is the third centile of the standard normal distribution and corresponds to the mean rank of these students. At the other end of the distribution, a few students obtained very high scores. Because they have so few peers, such students have high normalised scores. Normalised scores in the range above 1.96 are obtained by the top 2.5% of the students and in this part of the range the relationship between the scales becomes irregular, reflecting irregularity in the distribution of raw scores there. For the rest of the range the relationship is approximately linear, and a difference of 1 point on the normalised scale corresponds to a difference of about 10 raw-score points. Thus, to convert from normalised score to raw score an approximate rule of thumb for most of the range is to multiply by 10. We shall use the term ‘score point’ for the result of applying this rule of thumb.

Figure 1 about here

To adjust for the prior achievement of each student, we use a combination of measures relating to overall performance at GCSE, as described in detail in the next section. The total GCSE score for each student is computed using grade results obtained two or more years earlier than the relevant A/AS level examination year, that is, at the age of 16 years or earlier, excluding grades in General Studies. Where a student had more than one result in a given subject, the best grade is used. Each A or A* grade is assigned 7 points, each B grade 6 points, each C grade 5 points, etc.

5 Standard practice at present is to assign 8 points to an A* grade, but this option is not open to us: A* grades were not introduced to the GCSE until 1994 and two of the four cohorts in our analysis took their GCSE examinations before that year. The resulting total GCSE scores are summarised in Table I, as are the numbers of subjects taken, and the mean score obtained per subject. All of these increased over the period. Table I also shows the mean age of the students and the percentage of women. These were effectively the same for each cohort.

Table II about here

Table II shows the numbers and percentages of establishments included in the analysis for each year, according to type. For the schools, the categories beginning ‘Maintained’ include county schools, voluntary aided schools, and voluntary controlled schools, but not grant-maintained (GM) schools. We have further categorised all schools according to their admission policy, as coded by the DfEE. Maintained and GM schools that are not selective or comprehensive are officially designated ‘other’. Most of these are Modern schools. The admission policies of some schools changed during the period. In these cases, as with changes from LEA- maintained to grant-maintained status, we have assigned a school to its earliest known category, on the principle that A/AS-level candidates were not substantially influenced by the later policies. Similarly, where an establishment’s LEA changed during the period we have assigned it to its earliest known LEA. For a small number of schools (less than 1% of the total, as shown in Table II) the admission policy was listed throughout as unknown. These were retained in the analysis as a separate category. It is important to note that the formally listed admission policy of a school cannot describe precisely how that policy operates in practice. For example, a ‘non- selective’ school in one area may in practice operate an admission policy similar to that of a ‘selective’ school in another. One of our purposes is to estimate the effects of such differences in selectivity upon students’ progress.

3. The GCSE performance function We have available three variables derived from the subject grade results at GCSE: the GCSE total score, the GCSE number taken, and the GCSE mean score. In order to adjust for GCSE performance we require a function of these variables that adequately predicts later performance at A/AS level. The simplest such functions are

6 polynomials, and these have been used in previous research (see, for example, DFE, 1995, Goldstein and Thomas, 1996, and O’Donoghue et al., 1997).

Of the three variables, GCSE total score, GCSE mean score, and GCSE number taken, any one can be calculated from the other two. The total score by itself is problematic as a predictor. For example, a score of 63 can be obtained by entering 9 GCSEs and gaining grade A or A* in all of them: a score of 64 requires at least 10 GCSEs and implies that a lower grade than A was obtained in at least one and possibly several. In fact, a GCSE total score of 64 was found to be generally associated with a lower A/AS score than one of 63. Figure 2 shows how the relationship between A/AS score and GCSE total score depends also on the number of GCSEs taken. Turning to the GCSE mean score, we found a strong curvilinear relationship between this and the normalised A/AS score, for students taking between 4 and 12 GCSEs and obtaining mean scores above about 3.5. Figure 3 shows the relationship as it was in 1997 for numbers of GCSEs taken between 7 and 10. The relationships for the other years had a similar form. Figures 2 and 3 about here

Table III about here

Table III shows the distributions of GCSE mean score and number taken over the whole data set. Very few A/AS-level candidates took fewer than 4 GCSEs or more than 12, and the total number of candidates with GCSE mean scores less than 3 was only 1700. This explains why no clear relationship is seen between GCSE mean score and mean A/AS score when the GCSE mean score falls below about 3.5. It might be argued that students in these categories belong to different populations and should be excluded from the analysis. We have chosen instead to develop the model including all students and investigate later the effects, if any, of removing these categories.

We carried out a series of cross-sectional analyses of the data for each year, using a variety of polynomial functions of the GCSE mean score and number taken. The function that best fitted the data in all four years was of the form:

7 2 3 4 β 0 + β1 GA + β 2 GA + β 3 GA + β 4 GA 2 3 + β 5 GN + β 6 GN + β 7 GN 2 2 2 2 + β 8 GA × GN + β 9 GA × GN + β10GA × GN + β11GA × GN , where GA = GCSE mean score (centred at 5.0), GN = GCSE number taken (centred at 9).

Figure 4 about here

We illustrate this functional relationship in Figure 4. For ease of understanding, predicted A/AS scores have been converted to score points. The shaded areas in the surface correspond to 5-point intervals in the predicted A/AS score, while the gridlines in the surface correspond to particular values of the GCSE mean score and number taken. A candidate who took 4 GCSEs and averaged 6.5 points (obtaining, for example, two As and two Bs) is predicted to score 12 or 13 points at A/AS level. The predicted score increases as the GCSE mean score increases, for a given number of GCSEs taken. It does not necessarily increase when more GCSEs are taken, if the GCSE mean score is below 5 (an average grade of C). For a GCSE mean score above this threshold, the predicted score increases most sharply as the number of GCSEs taken rises from about 5 to about 10. The predicted score also increases sharply as the mean GCSE score increases beyond 5, for students taking at least 4 GCSEs. The function captures the main features noted in Figure 3, and we call it the GCSE performance function.

4. Variance components models of the main effects

Each of the models in this section has simple components of variance at 4 levels, that of student (level 1), cohort (level 2), establishment (level 3), and LEA (level 4). The general algebraic form is given in the appendix. Table IV shows the estimates for three such models, two that do not include the GCSE performance function and one that does.

Table IV about here

Model 1 fits just the ‘intercept’, or constant term, and is intended to show how the variance in the response is shared between the four levels. The total variance is about 1.01, reflecting the scaling of the response variable. Most of this variance is at

8 student level (76%), with very little at cohort level. Establishments account for 22% of the variance. A noticeable feature is the small amount of variance – less than 2% – at LEA level.

Model 2 estimates the main effects of gender, age, establishment type, region, and time, but ignores prior performance at GCSE. Women are estimated to score better than men of the same age in establishments of the same type, by about 0.08 on the transformed scale or about four-fifths of a score point. Age is measured in months, centred at 18 years and 6 months. Its effect is estimated to be quadratic, with students of mean age predicted to have the lowest scores. This symmetrical effect disappears when the GCSE performance function is included. The main effects of establishment type are shown as contrasts with maintained comprehensive schools. Several of these effects are substantial. For example, students in maintained selective schools score on average 0.63 more than students of the same age and gender in comprehensives, on the transformed A/AS scale – equivalent to more than 6 score points or 3 A-level grades. There are no statistically significant regional effects. The main effect of time is approximately linear, with a yearly increase of about two-fifths of a score point.

This model provides a simple summary of average attainment at A/AS level. To address the issue of the progress made in different types of establishment and by different types of student, it is necessary to adjust for the prior performance of the students. Model 3 includes the GCSE performance function together with the main effects in Model 2. We find marked changes from Model 2 in the estimates of the main effects of gender, age, establishment type, and region, these being now conditional on the GCSE performance of the student. Women are predicted to obtain about two-thirds of a score point less than men with equivalent GCSE performance. The main effect of age (within cohort) is now negative: the oldest students in the cohort are predicted to obtain about 1 score point less than the youngest students, when they are matched according to the other variables in the model. The effects of most types of establishment are much reduced.

The regional effects are shown as contrasts with the Northern region. The contrasts for the North Western region and for Yorkshire & Humberside are statistically non- significant at the 5% level. Students in East Anglia, the East Midlands, and Greater London, however, are predicted to obtain about 1 score point more than similar

9 students in the Northern region. Once the other main effects are adjusted for, there remains a slight positive trend with time.

When we speak of the ‘effect’ of establishment type, gender, etc., we mean the difference in the predicted scores of candidates in a given category, for example maintained selective schools, when compared with candidates in the base category (in this case, maintained comprehensive schools), once the candidates have been matched according to the other explanatory variables in the model. Such effects are ‘conditional on’ or ‘adjusted for’ the other explanatory variables. We use the terms ‘advantage’ and ‘disadvantage’ in this conditional sense also. Comparison of models 2 and 3 shows that the effects attributed to establishment type and gender alter markedly once adjustment is made for the prior performance of each candidate at GCSE. Model 3 also reveals regional differences. We shall not in this paper attempt to interpret these, but it does appear that in the North of England students are doing less well at A/AS level, in relation to their measured performance at GCSE, than students elsewhere in England.

The total residual variance has been reduced from 1.01 in Model 1 to 0.44 in Model 3. In other words, the explanatory variables in Model 3 have accounted for about 57% of the variance. This reduction is not evenly distributed between the four levels: the LEA component has been reduced to less than one-sixth of 1% of the total residual variance.

5. Interactions and compositional effects

Model 3 assumes that the effects of gender, age, and establishment type are constant across the attainment range of students, and over time. To explore whether this was in fact the case, we require to extend the model by incorporating interaction terms and compositional effects. A further extension is to allow some of the coefficients in the model to vary randomly. Table V summarises the results of the first extension. We describe the second in the next section.

Table V about here

We denote an interaction by means of descriptors separated by multiplication signs. We are concerned particularly with the interactions Gender×GCSE, Age×GCSE, Gender×Age, Estab×GCSE, and Estab×Gender. Before we study these, however,

10 there are two other interactions that it is important to consider. The GCSE performance function we have chosen fitted the data well in all four years. Over this period, however, many establishments reviewed their policies at GCSE, partly in response to the publication of results in league tables. This will have led to patterns of change over time in entry rates, subjects entered, and grades obtained, and hence in the coefficients of the GCSE performance function. The first interaction, Year×GCSE, allows for these changes. The second, Estab×Year, shows the mean changes in the effect of establishment type over time. These are statistically significant for the independent and FE sectors, and it is important to include them since they may reflect systematic changes that we have not measured, for example in levels of resources.

Figure 5 about here

The Gender×GCSE interaction shows that the effect of gender depends upon the GCSE performance of the candidate, and in a complex way. Figure 5 illustrates this dependence, for candidates taking 9 GCSEs. It shows that the mean disadvantage to women increases with increasing GCSE mean score, so that women obtaining grade B on average at GCSE fall behind men with a similar GCSE performance by approximately 0.1 on the normalised A/AS scale. Women obtaining grades A or A* for all of their GCSEs fall behind men by about 0.22, or more than a full grade at A level. The effect of the interaction terms involving GCSE number taken (not shown in the diagram) is to increase the disadvantage for women students taking more than 9 GCSEs. Whatever the underlying reasons for these findings, it is clear from Table VI that women with average or above-average GCSE mean scores tend overall to enter for fewer A/AS levels, and to obtain lower total scores, than men with equivalent GCSE scores. This tendency increases as the GCSE mean score increases above about 5.5, as does the ratio of women to men in each mean-score group.

Table VI about here

The Age×GCSE interaction in Table V shows that the effect of age also depends on prior performance at GCSE. The younger students in a cohort tend to progress further between GCSE and A/AS level than the older students, and this tendency becomes more marked at the higher end of the performance range. The effect is slightly less for women than for men with the same GCSE performance and age difference, as is

11 shown by the interaction Gender×Age. There are also interactions between age and GCSE number taken but these are relatively small for nearly all students.

We turn now to the effects of establishment type. In Model 3 (Table IV), it appeared that selective schools (LEA-maintained, grant maintained, and independent) had positive effects on progress when compared with maintained comprehensive schools, and modern schools had negative effects. We note, however, that establishments differ in the average GCSE performance of their students, and in how much their student GCSE scores vary. Students in selective schools, on the whole, have higher GCSE mean scores and vary less on this measure than students in comprehensive schools. Comprehensive schools differ amongst themselves in these respects, and their distribution of student mean scores overlaps that of selective schools at one end of the performance range and modern schools at the other (see Table VII). These characteristics of the overall performance of the students in an establishment provide a measure of the academic composition of that establishment, and this has been found (DFE, 1995, and O’Donoghue et al, 1997) to have an effect on individual progress. When considering the possible effects of different establishment types, therefore, it is necessary to allow also for these compositional effects.

Table VII about here

In Model 4 (Table V), we have introduced as additional predictors

• the mean for each establishment of its students’ GCSE mean scores,

• the mean for each establishment of its students’ GCSE numbers taken, and

• the standard deviation for each establishment of its students’ GCSE mean scores.

We call these the establishment mean GCSE score, the establishment mean GCSE number, and the establishment GCSE standard deviation, respectively. The effects of the first and the third are statistically significant and positive. We have explored interactions between these effects and establishment type and found two, between FE college type and the establishment mean GCSE score and between Independent selective type and the establishment mean GCSE number. These are included in the model. There is also a small positive interaction between the student’s individual GCSE mean score and the establishment mean GCSE score. A student with a high

12 GCSE mean score is predicted to benefit slightly more from a high establishment mean than a student with a low mean score.

Figure 6 about here

Although the effect of the establishment GCSE standard deviation is statistically significantly different from zero, its estimated value is small and in our illustrations we shall ignore it. The effect of the establishment mean GCSE score is more substantial, and is shown in Figure 6. To illustrate the effect, consider two students at the median of the overall distribution of student GCSE mean scores, in different maintained comprehensive schools A and B, and suppose that school A has a GCSE mean score of 5.0 and school B a mean score of 5.8. As Table VII shows, A is at about the tenth centile for comprehensive schools and B is at about the 85th: clearly different kinds of school, but neither of them extreme for comprehensives. The schools differ by 0.8 points in their mean GCSE scores, and from Figure 6 we see that the students’ outcome scores will differ on average by approximately 0.15, or about three-quarters of a grade at A level.

Now consider two similar students in FE colleges C and D, where C has a mean score of 4.6 (at about the 10th centile for FE colleges) and D a mean score of 5.4 (at about the 80th). The colleges differ by 0.8 points, and Figure 6 shows that the students’

1 outcome scores will differ on average by about 0.33 points: more than 12 grades at A level. These are comparatively large effects: it is clear that the academic composition of an establishment, as estimated by the mean GCSE score of its A/AS candidates, is an important predictor of a student’s outcome score.

We are now in a position to compare the effects of establishment type, after allowing for the compositional effect in each establishment. Table VIII shows these, for men at the median of GCSE mean score and number taken, as contrasts with maintained comprehensive schools.

Table VIII about here

The relative effect of FE colleges depends on the mean GCSE score in the college: 5.31 is the upper quartile of the FE college distribution. As their mean GCSE score increases, the FE colleges move closer to the corresponding comprehensive schools. Once adjustment is made for academic composition, the only establishment types that

13 have sizeable positive effects over the comprehensive schools are the independent selective schools and the sixth-form colleges. The effects of these are similar.

Figure 7 about here

Figure 7 summarises the effects of each establishment type for students of differing prior performance, predicted for men in 1994. Students in independent selective schools on average score more highly at A/AS level than similar students in maintained comprehensives, but by an amount that shrinks as the student’s GCSE mean score increases. Students in FE colleges or in maintained modern schools score on average less highly, by an amount that increases as the GCSE mean score increases. The effects of independent non-selective, maintained selective, GM selective, and GM modern schools are statistically non-significant throughout the range of prior performance.

Comparisons for women differ for four of the establishment types: GM selectives, independent selectives, sixth-form colleges, and FE colleges (see Table V). These interactions did not change significantly over time, and did not depend further on age or GCSE performance. Thus, for example, in comparing FE colleges and maintained comprehensives, the graph for women is parallel to that for men, but raised by an amount 0.036. As we mentioned earlier, we detected some changes over time in the average effects (for both men and women) of independent schools and FE colleges: in particular, the relative performance at A/AS level of FE colleges, as a group, declined over the period by slightly less than half a grade at A level.

The FE colleges and the modern schools include establishments that concentrate their efforts on vocational qualifications rather than A/AS levels. Vocational qualifications are not included in our analysis, and for those establishments that concentrate on them it is not appropriate to interpret the differences in Table VIII and Figure 7 as differences in relative effectiveness. In fact it is possible that comprehensive schools whose establishment mean GCSE score is low also concentrate on other qualifications than A/AS level. Without further information it is not possible to compare the effectiveness of these establishments.

14 6. Random coefficients

The effects of gender, GCSE performance, and time have so far been modelled by means of fixed coefficients. These effects may vary from institution to institution within the same establishment type. To explore this possibility we introduce random coefficients at establishment level. We also improve the model for the variance at LEA and student levels by making it more specific. The full model is given in the appendix.

In Model 3 (Table IV) the variance at LEA level was estimated to be about one-sixth of one per cent of the total residual variance. Since this model of the variance includes all establishments, whether LEA-maintained or not, the resulting estimate may mask true variation between LEAs in the performance of the establishments maintained by them. When the variance at this level is modelled for LEA-maintained schools only it is estimated to be 0.0016 (s.e. 0.0005), or about one-third of one per cent of the total residual variance. We found no dependence on GCSE performance or gender, and no time trend.

We also tested the variance at student level for dependence on GCSE performance and time, and found none. In line with previous research, we found that men varied more about their predictions than women throughout the range of GCSE mean scores. Between-student within-establishment variance for men was 0.43 (s.e. 0.001) and for women 0.37 (s.e. 0.001).

The effects of gender, GCSE performance, and time varied significantly between establishments, after taking account of the fixed effects summarised in Table V. To capture this, we define a ‘residual function’ whose coefficients are the establishment residuals. There are several possible functions and we describe in the appendix what we believe is the most straightforward one, given the data. The resulting residual covariance matrix at establishment level is shown in Table IX.

Table IX about here

Note that in this model there is a specific term for each year (after the first). This is consistent with regarding the cohorts as fixed rather than random, and accordingly we omit entirely the cohort level of variance. Using this model we may readily estimate year-on-year correlations between the establishment residuals. For women at the

15 median of the overall distribution of GCSE mean scores these were as in Table X. The figures for men were similar.

Table X about here

We can also obtain estimates of the residuals at each level. Suitably transformed, these can be used either to check model fit or to compare units. As far as model fit is concerned, we found the residuals at all levels to be distributed satisfactorily. When comparing units, for example establishments, the value of the residual function for a given unit is frequently used as a measure of its ‘effect’. On this interpretation, we have confirmed that establishments are differentially effective according to the gender and prior attainment of the student. Correlations between establishment residuals across two or more years are quite modest – of the order of 0.6 for students of median prior performance. O’Donoghue et al. (1997), using data for 1993–5 and a simpler model, estimated the overall correlation between establishment residuals in 1993 and 1995 to be 0.75. These correlations can be used to judge the stability of establishment effects, and such judgements are sensitive to the model used.

7. Summary and discussion

What we have presented is an analysis based on overall scores. In practice, at both GCSE and A/AS level, candidates enter varying numbers and combinations of examinations. The usual scoring systems, which we have used with a minor alteration, are an attempt to extract simple measures from this complication. Such measures are only approximate indicators of educational attainment and, given the choice of subjects that students have, the same scores may mean different things for different students. Nevertheless, as we pointed out in Section 2, the total score at A/AS level is important for both students and establishments.

Our aim, broadly speaking, has been to model the progress made by students in the two years from age 16 to age 18. Since the available data relate to students who entered at least one A/AS level at age 18, we have nothing to say about students who are not in this category. Nor do we have information on the other achievements, for example General National Vocational Qualifications (GNVQ), of the students who are represented in the data. Thus, the response variable we use is incomplete as a

16 measure of attainment for some students, who may be concentrated in particular establishments. We do not know the establishments in which the students obtained their GCSEs, nor do we have data on the socio-economic characteristics of the students, their mobility, or the social environment of the establishments. All of these are known to have effects on examination outcomes (see, for example, Goldstein and Sammons, 1997, and van den Eeden et al., 1990).

Within these limitations our findings, based on the estimates in Tables V and IX, may be summarised as follows:

1. Women with grades at GCSE averaging C or above tended to enter for fewer A/AS levels than similarly qualified men, and to obtain lower overall scores. This tendency increased as GCSE scores increased. It appeared to be stable over time. See also Table VI and Figure 5.

2. Older students tended to make less progress from GCSE to A/AS level than younger students. This tendency also increased as GCSE scores increased and was stable over time.

3. Students in establishments where the average attainment at GCSE was high tended to make better overall progress to A/AS level than those in other establishments of the same type. See Figure 6.

4. Once establishments were matched according to the average grades at GCSE of all their A/AS-level candidates, there was no statistically significant difference between LEA-maintained grammar schools, GM grammar schools, and LEA- maintained comprehensive schools in the average progress of their students. See Table VIII and Figure 7.

5. Students in the North of England, specifically in the Northern region, the North Western region, and Yorkshire and Humberside, tended to score less highly at A/AS level than students in the rest of England whose performance at GCSE was similar.

6. Establishments were differentially effective for students of different gender and prior performance.

17 7. Establishment effects (residuals) varied from year to year, with only modest correlation two or more years apart.

8. Once all the above effects were allowed for, there was little remaining variation between Local Education Authorities.

Findings 1 and 2, the individual gender and age effects, confirm the findings of O’Donoghue et al. (1997), who used data from 1993–5 and a somewhat different model. In addition, we have found that the relative disadvantage of women increased with the number of GCSEs that they entered, and the age effect for women was somewhat less than that for men throughout the range of GCSE performance. These effects were effectively stable over time. There were some interactions between gender and establishment type, but these were small and did not depend further upon GCSE performance. O’Donoghue et al. (1997), with additional information on whether schools were single-sex or co-educational, found this made no difference to progress once the selectivity of the establishment was taken into account. Thus, the overall shape of the relationship between gender, GCSE performance, and progress illustrated in Figure 5 appears to persist across time, age of student, and type of establishment. Further work is needed, in particular to explore individual subject choices and performance in different subjects.

The next finding, that the academic composition of the establishment, as measured by the average GCSE performance of all its A/AS-level candidates, has a substantial effect upon the progress of students, confirms that of O’Donoghue et al.(1997) and DFE (1995). In addition, we have found that abler students benefited more from a favourable composition than less able students (as measured by their average GCSE scores). Once these effects are taken into account, many of the differences in student progress that appear to be associated with establishment type cease to be detectable.

This finding is the most important of the present study. Table VIII and Figure 7 provide more detail about the relationships. The establishment measures that we used were derived from data relating only to A/AS-level candidates and it may be argued that they are incomplete as measures of academic composition. But for those schools and colleges whose main ‘mission’ is to prepare candidates for A/AS levels (the term

18 used by DFE, 1995, paragraph 51), including all selective schools, comprehensive schools with comparable intakes, and sixth-form colleges, this objection has less force. In practice, the majority of the students in these establishments would have entered at least one A/AS level, and they would be less likely to have taken a vocational qualification in addition. We believe, therefore, that within the limits of a study of overall effects both the response variable and the intake measures used in our model justify our main conclusions for the above types of establishment. These may be briefly stated as follows. Students in maintained comprehensive schools made similar progress to those in state selective schools, both LEA and grant maintained, when these were matched according to the average performance at GCSE of all their candidates. Those in sixth-form colleges, similarly matched for intake, made better progress, amounting to about three-quarters of a grade at A level for men and a full grade for women. Students of average and above-average GCSE performance made similar progress in independent selective schools and sixth-form colleges, with some variation over time in the independent schools. Students with lower performance at GCSE made better progress in the independent schools.

At present we have no explanation that we can offer with any confidence for our finding number 5, an apparent regional effect, which persists through all of our models that include GCSE performance. It may be related to the examining boards used for A/AS level or GCSE (we do not have information on the latter), and it may affect some subjects more than others. This is an area for further study.

Finding number 6 confirms earlier findings of the ‘differential effectiveness’ of individual establishments according to the gender and prior performance of students. Different models can lead to different estimates of individual establishment effects, and this is the subject of continuing research. In particular, our model leads to lower correlations between these effects over time than were found by O’Donoghue et al. (1997). Such model sensitivity has implications for the use of residuals to study institutional ‘improvement’. The residual variance at LEA level remains small in all models.

We explored the effect of removing cases with GCSE mean scores of 3 or less, or GCSE number taken less than 4 or more than 12 (5,492 cases altogether). This allowed a slight simplification of the GCSE performance function (the removal of one

19 term), and some simplifications to the interactions with it. The remaining parameters of the model were only marginally affected if they were already statistically significant. The main findings listed above are not materially affected by the removal of such untypical cases. This does not mean, however, that individual establishment effects (residuals) would not be affected.

We have shown that apparently simple measures of intake and outcome have a complex relationship, which is sensitive to the precise model used in both the fixed and the random parts. The findings of our study are relevant to any attempt to rank individual establishments using value added procedures.

20 References DFE (1995). The development of a National framework for estimating value added at GCSE A/AS level, Technical annex to GCSE to GCE A/AS value added: Briefing for schools and colleges. London: DfEE. GOLDSTEIN, H. (1995). Multilevel statistical models (2nd Ed.). London: Edward Arnold. GOLDSTEIN, H. (1997). Methods in school effectiveness research, School Effectiveness and School Improvement, 8(4), pp369–95. GOLDSTEIN, H. AND HEALY, M. (1995). The graphical representation of a collection of means, Journal of the Royal Statistical Society, Series A, Vol. 158, pp175–7. GOLDSTEIN, H., RASBASH, J., PLEWIS, I., DRAPER, D., BROWNE, W., YANG, M., WOODHOUSE, G., AND HEALY, M. (1998). A user’s guide to MLwiN. London: Institute of Education. GOLDSTEIN, H. AND SAMMONS, P. (1997). The influence of secondary and junior schools on sixteen year examination performance: A cross classified multilevel analysis, School Effectiveness and School Improvement,8, pp219– 30. GOLDSTEIN. H. AND THOMAS, S. (1996). Using examination results as indicators of school and college performance, Journal of the Royal Statistical Society, Series A, Vol. 159, pp149–63. GRAY. J., GOLDSTEIN. H., AND JESSON. D. (1996). Changes and improvements in schools’ effectiveness: Trends over five years, Research Papers in Education 11(1), pp35-51. GRAY, J., JESSON, D., GOLDSTEIN, H., HEDGER, K., AND RASBASH, J. (1995). A multi- level analysis of school improvement: Changes in schools’ performance over time, School Effectiveness and School Improvement, 6(2), pp97–114. O’DONOGHUE, C., THOMAS, S., GOLDSTEIN, H., AND KNIGHT. T. (1997). 1996 DfEE study of value added for 16–18 year olds in England, DfEE research series, March. London: DfEE. VAN DEN EEDEN. P., DE JONG, U., KOOPMAN, P., AND ROELEVELD, J. (1990). The effect of social status on educational attainment in Amsterdam: school differences and stability, International Journal of Adolescence and Youth, Vol. 2, pp101–18.

21 Appendix

Each of the models in Section 4 (see Table IV) may be written in the form

p (A.1) yijkl = β 0 + ∑ β r xr,ijkl + wl + vkl + u jkl + eijkl , r=1 where i indexes students (level 1), j indexes cohorts (level 2), k indexes

establishments (level 3), and l indexes LEAs (level 4). In this equation β 0 is the

p constant term or ‘intercept’, ∑ β r xr,ijkl denotes the rest of the fixed part of the r=1 model, where the xs are predictor variables and the β s are coefficients to be

estimated, and wl , vkl , u jkl , eijkl are independent residuals at levels 4, 3, 2, 1, respectively, identically distributed for each level, whose variances are to be estimated.

The cohort level of variance has been removed from Model 4 (Tables V and IX). Omitting subscript j and keeping the rest of the subscript scheme in (A.1), we have:

p (A.2) yikl = β 0 + ∑ β r xr,ikl + wl ()LEA _ Estab kl r=1 + v + v Female + v GA + v GA2 0,kl 1,kl ()ikl 2,kl ()ikl 3,kl ()ikl

+ v4,kl ()Year95 ikl + v5,kl ()Year96 ikl + v6,kl ()Year97 ikl

+ e0,ikl + e1,ikl ()Female ikl ,

where ()LEA _ Estab kl =1 if the klth establishment is LEA-maintained and Female, GA, Year95, etc., have the same meanings as in the fixed part of the model. Residuals at different levels are independent, and we estimate the full covariance matrix of the

establishment residuals vr,kl , now at level 2. The random part at level 1 is in the form commonly used to estimate separate variances for men and women (see Goldstein et al., 1998, for further details).

22 TABLES

23 Table I. Some descriptive statistics of the data for each examination year (standard deviations in parentheses) 1994 1995 1996 1997 No of students included in analysis 164,180 167,583 175,172 189,725 Percentage of women students 53.1% 53.2% 53.1% 53.6% Mean age of students (years) 18.5 (0.3) 18.5 (0.3) 18.5 (0.3) 18.5 (0.3)

Mean GCSE exams taken per student 9.1 (0.9) 9.2 (1.1) 9.2 (1.0) 9.3 (1.0) Mean GCSE total score per student 51.2 (9.6) 52.6 (10.6) 52.9 (10.1) 53.7 (10.3) Mean GCSE mean score per student 5.63 (0.86) 5.67 (0.83) 5.74 (0.79) 5.73 (0.80) Mean A/AS score per student 13.9 (9.5) 14.4 (9.6) 14.8 (9.7) 14.9 (9.6)

Table II. Numbers and percentages of establishments according to type 1994 1995 1996 1997 Number of schools 2,618 2,687 2,700 2,712 Maintained comprehensive 1,149 43.9% 1,163 43.3% 1,164 43.1% 1,171 43.2% Maintained selective 69 2.6% 69 2.6% 68 2.5% 69 2.5% Maintained modern 38 1.5% 40 1.5% 39 1.4% 45 1.7% GM comprehensive 282 10.8% 308 11.2% 313 11.6% 326 12.0% GM selective 84 3.2% 86 3.2% 89 3.3% 90 3.3% GM modern 18 0.69% 20 0.74% 21 0.78% 22 0.81% Independent selective 517 19.7% 526 19.6% 521 19.3% 508 18.7% Independent non-selective 55 2.1% 63 2.3% 69 2.6% 65 2.4% 6th-form college 110 4.2% 110 4.1% 111 4.1% 109 4.0% FE college 276 10.5% 282 10.5% 287 10.6% 289 10.7% Maintained, unknown 14 0.53% 15 0.56% 16 0.59% 16 0.59% Independent, unknown 6 0.23% 5 0.19% 2 0.07% 2 0.07%

24 Table III. Overall distribution of GCSE number taken and GCSE mean score (percentages in parentheses)

GCSE number Number of GCSE mean score Number of taken students (%) (lower limit) students (%) 1 1,362 (0.20) 0.0 117 (0.02) 2 457 (0.07) 1.0 233 (0.03) 3 321 (0.05) 2.0 1,350 (0.19) 4 569 (0.08) 3.0 1,041 (0.15) 5 1,188 (0.17) 3.2 1,341 (0.19) 6 3,509 (0.50) 3.4 2,040 (0.29) 7 15,688 (2.25) 3.6 3,680 (0.53) 8 87,336 (12.54) 3.8 3,729 (0.54) 9 343,150 (49.26) 4.0 10,356 (1.49) 10 194,181 (27.87) 4.2 16,574 (2.38) 11 42,462 (6.10) 4.4 23,967 (3.44) 12 5,522 (0.79) 4.6 36,371 (5.22) 13 776 (0.11) 4.8 28,777 (4.13) 14 113 (0.02) 5.0 55,085 (7.91) 15 24 (0.003) 5.2 59,011 (8.47) 16 2 (0.0003) 5.4 59,136 (8.49) Total 696,660 5.6 64,806 (9.30) 5.8 43,245 (6.21) 6.0 64,592 (9.27) 6.2 58,066 (8.33) 6.4 51,953 (7.46) 6.6 49,836 (7.15) 6.8 33,908 (4.87) 7.0 27,446 (3.94) Total 696,660

1st quartile: 5.11 Median: 5.75 3rd quartile: 6.33

25 Table IV. Estimates for three variance components models, standard errors in parentheses

Variable Model 1 Model 2 Model 3 Fixed coefficients Intercept -0.1686 (0.0172) -0.3287 (0.0448) -0.7479 (0.0163) Year Year95 0.0558 (0.0044) 0.0017 (0.0040) Year96 0.0997 (0.0045) 0.0060 (0.0040) Year97 0.1253 (0.0047) 0.0123 (0.0041) GCSE GA 0.5882 (0.0019) GA2 0.1480 (0.0017) GA3 0.0167 (0.0007) GA4 0.0043 (0.0002) GN 0.0488 (0.0013) GN2 0.0026 (0.0005) GN3 -0.0006 (0.0001) GA.GN 0.0268 (0.0012) GA.GN2 -0.0051 (0.0002) GA2.GN 0.0079 (0.0008) GA2.GN2 -0.0018 (0.0001) Gender Female 0.0800 (0.0023) -0.0703 (0.0017) Age Age (months) 0.0003 (0.0004) -0.0080 (0.0003) Age2 0.0002 (0.0001) 0.0003 (0.0001) Estab Maint. Sel 0.6336 (0.0392) 0.0849 (0.0187) Maint. Mod -0.4519 (0.0525) -0.1116 (0.0268) GM comp 0.0490 (0.0213) 0.0236 (0.0103) GM sel 0.6444 (0.0364) 0.0812 (0.0173) GM mod -0.4915 (0.0761) -0.1626 (0.0393) Ind. Sel 0.6119 (0.0167) 0.1751 (0.0083) Ind. Ns -0.0143 (0.0434) 0.0569 (0.0233) 6th-form college 0.1466 (0.0298) 0.1114 (0.0143) FE college -0.5285 (0.0202) -0.2160 (0.0101) Maint. Unknown 0.1352 (0.0750) 0.0556 (0.0373) Ind. Unknown -0.2913 (0.1785) -0.0356 (0.1030) Region Yorks&Humberside -0.0464 (0.0597) 0.0287 (0.0206) North Western -0.0520 (0.0570) -0.0129 (0.0203) East Midlands -0.0161 (0.0671) 0.1066 (0.0220) West Midlands -0.0980 (0.0587) 0.0516 (0.0202) East Anglia 0.0285 (0.0816) 0.1171 (0.0268) Greater London -0.0695 (0.0517) 0.1034 (0.0186) Other South-East -0.0091 (0.0555) 0.0935 (0.0188) South Western 0.0372 (0.0623) 0.0922 (0.0209)

Residual variation Between LEAs 0.0196 (0.0042) 0.0111 (0.0023) 0.0007 (0.0002) Between estab. 0.2208 (0.0058) 0.0995 (0.0028) 0.0197 (0.0007) Between cohorts 0.0123 (0.0004) 0.0096 (0.0004) 0.0108 (0.0003) Between students 0.7612 (0.0013) 0.7600 (0.0013) 0.4060 (0.0007)

26 Table V. Fixed parameter estimates for Model 4, standard errors in parentheses

Variable Estimate Variable Estimate Main effects Gender x GCSE Intercept -0.8454 (0.0312) Female x GA -0.0467 (0.0037) Year Female x GA2 -0.0174 (0.0020) Year95 0.0001 (0.0044) Female x GA3 -0.0034 (0.0010) Year96 -0.0180 (0.0051) Female x GN -0.0128 (0.0020) Year97 0.0172 (0.0054) Female x GN2 0.0008 (0.0005) GCSE Female x GA.GN -0.0065 (0.0016) GA 0.5836 (0.0049) Age x GCSE 2 GA 0.1489 (0.0025) Age x GA -0.0042 (0.0005) 3 GA 0.0181 (0.0010) Age x GA2 0.0007 (0.0002) 4 GA 0.0051 (0.0003) Age x GA3 0.0006 (0.0001) GN 0.0546 (0.0026) Age x GN -0.0009 (0.0002) 2 GN 0.0042 (0.0006) Age x GN2 0.0004 (0.0001) 3 GN -0.0005 (0.0001) Gender x Age GA.GN 0.0294 (0.0016) Female x Age 0.0020 (0.0004) 2 GA.GN -0.0043 (0.0002) Estab x GCSE 2 GA .GN 0.0062 (0.0009) Maint. mod x GA -0.0384 (0.0192) 2 2 GA .GN -0.0020 (0.0001) GM sel x GN 0.0167 (0.0055) Gender Ind. sel x GA -0.0626 (0.0063) Female -0.0263 (0.0037) Ind. sel x GN 0.0089 (0.0026) Age Ind. ns x GA -0.0898 (0.0169) Age -0.0081 (0.0004) Ind. ns x GN 0.0198 (0.0061) Age2 0.0003 (0.0001) FE x GA -0.0541 (0.0062) Estab FE x GA2 -0.0105 (0.0035) Maint. sel -0.0228 (0.0223) FE x GN -0.0059 (0.0024) Maint. mod -0.0420 (0.0301) FE x GN2 -0.0021 (0.0006) GM comp 0.0217 (0.0107) Estab x Gender GM sel -0.0124 (0.0206) GM sel x Female -0.0361 (0.0177) GM mod -0.0519 (0.0388) Ind. sel x Female 0.0227 (0.0088) Ind. sel 0.1466 (0.0133) 6th-form x Female 0.0205 (0.0089) Ind. ns 0.0894 (0.0272) FE x Female 0.0362 (0.0082) 6th-form college 0.0752 (0.0154) Estab x Year FE college -0.1358 (0.0166) Ind. sel x Year95 -0.0244 (0.0080) Unknown 0.0383 (0.0347) Ind. sel x Year97 -0.0571 (0.0086) Region Ind. ns x Year96 0.1098 (0.0302) North -0.0983 (0.0081) FE x Year95 -0.0320 (0.0114) Composition FE x Year96 -0.0861 (0.0138) Estab mean GA 0.1678 (0.0122) FE x Year97 -0.0884 (0.0142) Estab mean GN 0.0045 (0.0066) Estab x Composition Estab SD of GA 0.1162 (0.0355) Ind. sel x Estab mean GN 0.0438 (0.0115) Interactions FE x Estab mean GA 0.2175 (0.0273) Year x GCSE GCSE x Composition Year95 x GA 0.0217 (0.0030) GA x Estab mean GA 0.0295 (0.0057) Year95 x GN -0.0146 (0.0025) Year96 x GA 0.0207 (0.0048) Year96 x GA2 0.0188 (0.0023) Year96 x GA3 0.0028 (0.0012) Year96 x GN 0.0063 (0.0028) Year96 x GN2 -0.0013 (0.0006) Year97 x GA 0.0119 (0.0046) Year97 x GA2 0.0055 (0.0022) Year97 x GA3 0.0048 (0.0011) Year97 x GN 0.0059 (0.0027) Year97 x GN2 -0.0012 (0.0006)

27 Table VI. Gender differences (Women–Men) in the mean A/AS total score and mean A/AS number taken according to students' GCSE mean score, with the percentage of women in each group

Difference in A/AS Difference in A/AS GCSE mean score total score number taken % Women 0.0-0.499 1.49 0.20 52.6 0.5-1.499 1.80 0.01 32.6 1.5-2.499 1.62 0.17 43.8 2.5-3.499 0.30 -0.04 43.5 3.5-4.499 -0.09 -0.09 45.1 4.5-5.499 -0.38 -0.04 49.4 5.5-6.499 -1.20 -0.09 55.3 6.5-7.0 -2.20 -0.14 58.2

28 Table VII. Distributions of establishment mean GCSE score by establishment type (column percentages in parentheses)

Score Maint.comp Maint.sel Maint.mod GM comp GM sel GM mod Ind.sel Ind.ns 6th-form FE Unknown 2.2- 1 (0.08) 2.4- 0 2.6- 0 2.8- 0 3.0- 0 3.2- 0 1 (1.32) 3.4- 0 1 (0.19) 1 (1.32) 2 (0.66) 3.6- 2 (0.17) 1 (4.35) 0 0 1 (0.33) 3.8- 0 0 0 0 3 (1.00) 4.0- 2 (0.17) 2 (4.08) 1 (4.35) 0 2 (2.64) 6 (1.99) 1 (4.00) 4.2- 4 (0.34) 3 (6.12) 1 (0.31) 2 (8.70) 2 (0.38) 1 (1.32) 6 (1.99) 0 4.4- 11 (0.93) 5 (10.2) 4 (1.22) 3 (13.0) 2 (0.38) 3 (3.95) 12 (3.99) 1 (4.00) 4.6- 30 (2.53) 7 (14.3) 10 (3.06) 3 (13.0) 4 (0.74) 5 (6.58) 25 (8.31) 2 (8.00) 4.8- 66 (5.57) 8 (16.3) 13 (3.98) 5 (21.7) 4 (0.74) 3 (3.95) 2 (1.80) 57 (18.9) 2 (8.00) 5.0- 115 (9.70) 10 (20.4) 33 (10.0) 2 (8.70) 7 (1.30) 11 (14.5) 5 (4.50) 82 (27.2) 1 (4.00) 5.2- 208 (17.6) 8 (16.3) 63 (19.3) 4 (17.4) 26 (4.83) 9 (11.8) 10 (9.01) 48 (15.9) 2 (8.00) 5.4- 297 (25.1) 6 (12.2) 80 (24.5) 3 (3.33) 1 (4.35) 35 (6.51) 9 (11.8) 39 (35.1) 38 (12.6) 6 (24.0) 5.6- 276 (23.3) 2 (2.90) 79 (24.2) 10 (11.1) 1 (4.35) 62 (11.5) 16 (21.1) 40 (36.0) 14 (4.65) 6 (24.0) 5.8- 146 (12.3) 30 (43.5) 36 (11.0) 24 (26.7) 99 (18.4) 7 (9.21) 11 (9.91) 7 (2.33) 2 (8.00) 6.0- 24 (2.03) 17 (24.6) 8 (2.45) 24 (26.7) 117 (21.7) 3 (3.95) 3 (2.70) 1 (4.00) 6.2- 3 (0.25) 16 (23.2) 19 (21.1) 94 (17.5) 3 (3.95) 1 (0.90) 1 (4.00) 6.4- 4 (5.80) 10 (11.1) 64 (11.9) 1 (1.32) 6.6- 18 (3.35) 0 6.8- 2 (0.38) 1 (1.32) 7 1 (0.19) Total 1,185 69 49 327 90 23 538 76 111 301 25

29 Table VIII. Estimated effects of establishment type, adjusted for academic composition. Effects are for men at the median of GCSE mean score and number taken. Units are point differences on the normalised A/AS scale, compared to maintained comprehensive schools.

Estimated effect 95% confidence interval Establishment type (s.e. in parentheses)

Maintained selective − 0.02 (0.02) (−0.07,+ 0.03)

Maintained modern − 0.07 (0.03) (−0.14,− 0.00)

GM comprehensive + 0.02 (0.01) (+0.00,+ 0.05)

GM selective − 0.01 (0.02) (−0.06,+ 0.03)

GM modern − 0.05 (0.04) (−0.13,+ 0.03)

Independent selective + 0.10 (0.01) (+0.07,+ 0.13)

Independent nonselective + 0.02 (0.03) (−0.03,+ 0.08)

6th-form college + 0.08 (0.02) (+0.04,+ 0.11)

FE college (mean GCSE = 5.31) − 0.11 (0.02) (−0.15,− 0.08)

Table IX. Estimated residual covariance matrix at establishment level for Model 5, standard errors in parentheses

Intercept Female GA GA2 Year95 Year96 Year97 Intercept 0.0351 (0.0014) Female -0.0030 (0.0006) 0.0071 (0.0005) GA 0.0017 (0.0005) -0.0018 (0.0003) 0.0045 (0.0003) GA2 -0.0023 (0.0003) -0.0008 (0.0002) -0.0001 (0.0001) 0.0012 (0.0001) Year95 -0.0080 (0.0008) 0.0006 (0.0004) -0.0006 (0.0004) -0.0004 (0.0002) 0.0146 (0.0008) Year96 -0.0158 (0.0010) 0.0016 (0.0006) -0.0013 (0.0004) -0.0004 (0.0003) 0.0111 (0.0008) 0.0275 (0.0012) Year97 -0.0180 (0.0011) 0.0012 (0.0006) -0.0021 (0.0005) -0.0003 (0.0003) 0.0107 (0.0008) 0.0218 (0.0011) 0.0302 (0.0013)

Table X. Correlation coefficients of establishment residuals between years, for women at the median for GCSE mean score 1994 1995 1996 1995 0.79 1996 0.59 0.70 1997 0.53 0.63 0.77

30 FIGURES

31 32 33 34 1. Maintained selective 2. Maintained modern 3. GM comprehensive 4. GM selective 5. GM modern 6. Independent selective 7. Independent non-selective 8. Sixth-form college 9. FE college

35 Description of ‘A-level-total’

The data are relate to the results of examinations at A /AS level taken in years 1994, 1995, 1996 and 1997 by students attaining the age of 18 in the academic year of the examination and attending an educational establishment in England. It comprises 696,660 students from 2,794 establishments in 120 Local Education Authority (LEA).

There are 15 fields in the dataset of ASCII format, and each field is separated by a blank space. The detailed description of the fields is as followings

The data contain no missing value.

Format Variable Coding Description 1-6 STUDENT 12 to 196043 Student identification

8 GIRL 0 or 1 0 for males, 1 for females

10 REGION Category 1 ~ 9 1 = Northern, 2 = Yorkshire, 3 = North Western, 4 = East Midlands, 5 = West Midlands, 6 =East Anglia, 7 = Greater London, 8 = Other South East, 9 = South Western

12 – 13 GTOT 0 ~ 99 continuous Total scores on GCSE subjects of individual’s

15 – 16 GN -8 ~ 7 continuous Total Number of GCSE subjects taken by individual student, centred at 9

18 – 19 ATOT 0 ~ 58 continuous Total scores on A/AS level subjects of individual’s

21 – 22 AGE -7 ~ 5 continuous Age of student in month, centred at 222 months ( 18.5 years)

24 – 25 SCHTYPE Category 1 ~ 11 Establishment type:

1 = LEA Maintained Comprehensive, 2 = Maintained Selective, 3 = Maintained Modern, 4 = Grammar Comprehensive, 5 = Grammar Selective, 6 = Grammar Modern, 7 = Independent selective, 8 = Independent non- selective, 9 = Sixth Form College, 10 = Further Education College, 11 = Others

27 COHORT Category 1 ~ 4 Cohort of students:

1 = 1994, 2 = 1995, 3 = 1996, 4 = 1997

29 CONS 1 Vector of one specifying constant in models

31 – 32 GA -5 ~ 2 continuous GCSE average score of student’s, centred at 5

33 – 38 NORATOT -1.86 ~ 4.82 continuous Normal score of A/AS level total score of student

40 – 45 NORGTOT -3.86 ~ 4.68 continuous Normal score of GCSE total score of student

47 – 50 SCHOOL 1 ~ 2794 School identification

52 – 54 LEA 1 ~ 739 Local Education Authority identification References

Yang, M. and Woodhouse, G. (2001). Progress from GCSE to A and AS level: institutional and gender differences, and trends over time, British Journal of Educational Research, Vol 27.

O, Donoghue, C., Thomas, S., Goldstein, H., and Tnight, T. (1997). 1996 DfEE study of value added for 16 – 18 year olds in England, DfEE researcg series, March. London DfEE.

Goldstein, H. and Thomas, S. (1996). Using examination results as indicators of school and college performance, Journal of the Royal Statistical Society, Series A, Vol. 159, 149 - 63.

DFE (1995). The development of a national framework for estimating value added at GCE A/AS level, Technical annex to ‘GCSE to GCE A/AS value added: briefing for schools and colleges’ London: DfEE. Description of ‘A-level-chemistry’

The data contain the result of examination on A-level Chemistry for 31,022 students from over 2,000 institutions in England in 1997. There are 10 fields in the data set of ASCII format, and each field is separated by a blank space. The detailed description of the fields is as followings.

The data contain no missing value.

Format Variable Coding Description 1 BOARD 1,2,3,5,6,7,8 Exam board: 1=Associate, 2=Cambridge, 3=London, 5=Oxforld, 6=Joint Matriculation, 7=Oxford-Cambridge, 8=WJB

3 – 4 A-SCORE 0, 2, 4, 6, 8, 10 Point score for A-level chemistry: 0=fail, 2=grade E, 4=grade D, 6=grade C, 8=grade B, 10=grade A

6 – 8 GTOT 0 ~ 103 continuous Total point score of all GCSE subjects

10 – 11 GNUM 1 ~14 continuous Total number of GCSE taken

13 GENDER 0,1 0=Male, 1=Female

15 – 16 AGE continuous Age of student in month, centred at 222 months ( 18.5 years)

18 – 19 INSTTYPE Category 1 ~ 11 Establishment type:

1 = LEA Maintained Comprehensive, 2 = Maintained Selective, 3 = Maintained Modern, 4 = Grammar Comprehensive, 5 = Grammar Selective, 6 = Grammar Modern, 7 = Independent selective, 8 = Independent non- selective, 9 = Sixth Form College, 10 = Further Education College, 11 = Others

21 – 23 LEA 1 ~ 131 Local Education Authority identification

25 – 27 INSTITUTE 1 ~ 100 Institution identification within LEA

29 - 34 STUDENT 23 ~ 196035 Student identification

Reference

Yang, M., Fielding, A. and Goldstein, H. Multilevel ordinal models for examination grades, submitting to Journal of Educational Behavioral Study (December 2000).

Description of ‘A-level-math’

The data contain the result of examination on A-level Mathematics of 61,116 students from over 2,000 institutions in England in 1997. There are 22 fields in the data set of ASCII format, and each field is separated by a blank space. The detailed description of the fields is as followings

Format Variable Coding Description

1 – 3 LEA 1 ~ 131 Local Education Authority identification

5 - 10 STUDENT 23 ~ 196051 Student identification

12 GENDER 0,1 category 0=Male, 1=Female

14 – 15 SADMPOL -9, 1 – 5 category Institution admission policy

-9 missing code, 1 = comprehensive, 2 = selective, 3 = modern, 4 = no-selective, 5 = N/A

17 – 18 SGENDER -9, 1 – 3 category Institution gender

-9 missing code, 1 = boys school, 2 = girls school, 3 = mixed

20 201 0 – 7 category Institution type

0=others, 1=comprehensive, 2=grammar, 3=other maintained, 4=grant maintained, 5=independent, 6=six form college, 7=further education college

22 REGION 1 – 9 category 1 = Northern, 2 = Yorkshire, 3 = North Western, 4 = East Midlands, 5 = West Midlands, 6 =East Anglia, 7 = Greater London, 8 = Other South East, 9 = South Western

24 – 26 GTOT 0 ~ 103 continuous Total point score of all GCSE subjects

28 – 29 GNUM 2 ~13 continuous Total number of GCSE taken

31 GCSE-math- 0,2,3,4,5,6,7,8 Maximum point score for GCSE math: 0=fail, 2=grade F, max 3=grade E, 4=grade D, 5=grade C, 6=grade B, 7=grade A, 8=grade A*

33 GCSE-math-n 1,2,3,4 Total number of GCSE math subjects taken

35 GCSE-eng 0,2,3,4,5,6,7,8 Point score for GCSE English: 0=fail, 2=grade F, 3=grade E, 4=grade D, 5=grade C, 6=grade B, 7=grade A, 8=grade A*

37 SUBLEV 1 or 2 Subject level indicator: 1=A-level, 2 = AS-level

39 BOARD 1 – 7 category Exam board: 1=Associate and WJB, 2=Cambridge, 3=London, 5=Oxforld, 6=Joint Matriculation, 7=Oxford- Cambridge

41 A-SCORE 0, 2, 4, 6, 8, 10 Point score for A-level Mathematics: 0=fail, 2=grade E, 4=grade D, 6=grade C, 8=grade B, 10=grade A

43 – 46 LEAPSUB Category Code on math subject course: 2210=Main, 2230=Pure, 2240=Decision/Discrete, 2250=Applied, 2270=Pure & Applied, 2290=Pure & Statistics, 2310=Pure & mechanics, 2330=Further, 2340=Additional, 2510=Statistics

48 – 51 INST-GA-MN continuous Institution average of GCSE score

53 – 56 INST-GA-SD continuous Institution standard deviation of GCSE score

58 – 61 INST-GMA- continuous Institution average of GCSE math score MN

63 – 66 INST-GMA-SD continuous Institution standard deviation of GCSE math score

68 - 70 AGE continuous Age of student in month

72 – 74 INSTITUTE 1 – 107 Institution identification within LEA

Reference

Yang, M., Goldstein, H., Browne, W. and Woodhouse, G. Multivariate multilevel analyses of examination results, submitted to Journal of Royal Statistical Society, Series A (November 2000), presented to 5th International Conference for Methodology in Social Sciences, Cologne (2~6 October 2000).

Description of ‘A-level-geography’

The data contain the result of examination on A-level geography for 33,276 students from over 2,000 institutions in England in 1997. There are 15 fields in the data set of ASCII format, and each field is separated by a blank space. The detailed description of the fields is as followings

Format Variable Coding Description 1 – 2 A-SCORE 0, 2, 4, 6, 8, 10 Point score for A-level geometry: 0=fail, 2=grade E, 4=grade D, 6=grade C, 8=grade B, 10=grade A

4 BOARD 1 – 7 category Exam board: 1=Associate and WJB, 2=Cambridge, 3=London, 5=Oxforld, 6=Joint Matriculation, 7=Oxford- Cambridge

6 GCSE-G- 0,2,3,4,5,6,7,8 Point score for GCSE geometry: 0=fail, 2=grade F, SCORE 3=grade E, 4=grade D, 5=grade C, 6=grade B, 7=grade A, 8=grade A*

8 GENDER 0 or 1 0=Male, 1=Female

10 – 11 GTOT 19 ~ 95 continuous Total point score of all GCSE subjects

13 – 14 GNUM 4 ~13 continuous Total number of GCSE taken

16 GCSE-MA- 0 – 8 category Maximum point score for GCSE math: 0=fail, 2=grade F, MAX 3=grade E, 4=grade D, 5=grade C, 6=grade B, 7=grade A, 8=grade A*

18 GCSE-math-n 1,2,3,4 Total number of GCSE math subjects taken

20 – 21 AGE continuous Age of student in month, centred at 222 months ( 18.5 years)

23 – 27 INST-GA-MN continuous Institution average of GCSE score, centred at its mean

29 – 32 INST-GA-SD continuous Institution standard deviation of GCSE score

34 – 35 INSTTYPE Category 1 ~ 11 Establishment type:

1 = LEA Maintained Comprehensive, 2 = Maintained Selective, 3 = Maintained Modern, 4 = Grammar Comprehensive, 5 = Grammar Selective, 6 = Grammar Modern, 7 = Independent selective, 8 = Independent non- selective, 9 = Sixth Form College, 10 = Further Education College, 11 = Others

37 – 39 LEA 1 ~ 131 Local Education Authority identification

41 – 42 INSTITUTE 1 ~ 98 Institution identification within LEA

44 – 49 STUDENT 25 ~ 196053 Student identification

Reference

Yang, M., Fielding, A. and Goldstein, H. Multilevel ordinal models for examination grades, submitting to Journal of Educational Behavioral Study (December 2000).