Literature Review of E-assessment Jim Ridgway, Sean Mccusker, Daniel Pead

To cite this version:

Jim Ridgway, Sean Mccusker, Daniel Pead. Literature Review of E-assessment. 2004. ￿hal-00190440￿

HAL Id: hal-00190440 https://telearn.archives-ouvertes.fr/hal-00190440 Submitted on 23 Nov 2007

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. FUTURELAB SERIES

REPORT 10: Literature Review of E-assessment

Jim Ridgway and Sean McCusker, School of , University of Durham Daniel Pead, School of Education, University of Nottingham FUTURELAB SERIES

REPORT 10: CONTENTS:

EXECUTIVE SUMMARY 2

Literature Review PURPOSE 4

SECTION 1 of E-assessment ASSESSMENT DRIVES EDUCATION 5

Jim Ridgway and Sean McCusker, School of Education, University of Durham SECTION 2 HOW AND WHERE MIGHT Daniel Pead, School of Education, University of Nottingham ASSESSMENT BE DRIVEN? 11

SECTION 3 CURRENT DEVELOPMENTS IN E-ASSESSMENT 17 FOREWORD SECTION 4 OPPORTUNITIES I have to admit to being someone who for focus (perhaps the only focus in this day AND CHALLENGES FOR many years has avoided thinking about and age) for a shared societal debate E-ASSESSMENT 29 assessment – it somehow always seemed about what we, as a society, think are the GLOSSARY 40 distant from my interests, divorced from core purposes and values of education. my concerns about how children learn If we wish to create an education system BIBLIOGRAPHY 43 with technologies and, to be honest, just a that reflects and contributes to the little less interesting than other things I development of our changing world, then APPENDIX: was working on… In recent years, however, we need to ask how we might change FUNDAMENTALS OF ASSESSMENT 46 working in the field of education and assessment practices to achieve this. technology, it has become clear that anyone with an interest in how we create The authors of this review provide a equitable, engaging and relevant education compelling argument for the central role systems needs to think long and hard of assessment in shaping educational about assessment. Futurelab’s conference practice. They outline the challenges ‘Beyond the Exam’ in November 2003 and opportunities posed by the changing further highlighted this point, as committed global world around us, and the potential and engaged educators, software and role of technologies in our assessment media developers came together to raise practices. Both optimistic and practical, a rallying cry for a rethink of our current the review summarises existing research assessment practices. and emergent practice, and provides a blueprint for thinking about the risks and What I and many others working in this potential that awaits us in this area. area have come to realise is that we can’t just ignore assessment, or simply see it We look forward to hearing your response as ‘someone else’s job’. Assessment to this review. practices shape, possibly more than any other factor, what is taught and how it is Keri Facer, Director of Learning Research taught in schools. At the same time, Futurelab these assessment practices serve as the [email protected]

1 EXECUTIVE SUMMARY

EXECUTIVE SUMMARY multinational companies, and the need to defend democracy are discussed. All of “E-assessment must not simply these influences are drivers for increased invent new technologies which recycle uses of ICT in assessment. Many of the our current ineffective practices.” developments require the assessment of Martin Ripley, QCA, 2004 higher-order thinking. However, there is a constant danger that assessment Assessment is central to educational systems are driven in undesirable ways, practice. High-stakes assessments where things that are easy to measure are exemplify curriculum ambitions, define valued more highly than things that are what is worth knowing, and drive more important to learn (but harder to classroom practices. It is essential to assess). In order to satisfy is develop systems for assessment which goals, we need to develop ways to make reflect our core educational goals, and important things easier to measure - central to which reward students for developing and ICT can help. educational skills and attributes which will be of long-term benefit to them and to society. All is not well with education. The practice There is good research evidence to show Tomlinson Report (2004) identifies major that well designed assessment systems problems with current educational lead to improved student performance. provision at ages 14-19 years: there is a In contrast, the USA provides some plethora of qualifications; too few students spectacular examples of systems where engage with education; the drop-out rate narrowly focused high-stakes assessment is scandalously high; and the most able systems produce illusory student gains; students are not stretched by their studies. this ‘friendly fire’ results at best in lost Young people are not being equipped with opportunities, and at worst in damaged the generic skills, knowledge and personal students, teachers and communities. attributes they will need in the future. A radical approach to qualifications is ICT provides a link between learning, suggested which (in our view) can only teaching and assessment. In school, ICT be introduced if there is a widespread is used to support learning. Currently, adoption of e-assessment. we have bizarre assessment practices where students use ICT tools such as The UK government is committed to a word processors and graphics calculators bold e-assessment strategy. Components as an integral part of learning, and are include: ICT support for current paper- then restricted to paper and pencil when based assessment systems; some online, their ‘knowledge’ is assessed. on-demand testing; and the development of radical, ICT-set and assessed tests of Assessment systems drive education, but ICT capability. Some good progress has are themselves driven by a number of been made with these developments. factors, which sometimes are in conflict. To understand likely developments in E-assessment can be justified in a number assessment, we need to examine some of of ways. It can help avoid the meltdown these drivers of change. Implications of current paper-based systems; it can of technology, globalisation, the EU, assess valuable life skills; it can be better

2 REPORT 10 LITERATURE REVIEW OF E-ASSESSMENT JIM RIDGWAY AND SEAN MCCUSKER, SCHOOL OF EDUCATION, UNIVERSITY OF DURHAM DANIEL PEAD, SCHOOL OF EDUCATION, UNIVERSITY OF NOTTINGHAM

for users – for example by providing on- representations; however, it seems likely demand tests with immediate feedback, that complex ideas (notably in reasoning and perhaps diagnostic feedback, and from evidence of various sorts) will be more accurate results via adaptive testing; acquired better and earlier than they are it can help improve the technical quality of at present, and that the standards of tests by improving the reliability of scoring. performance demanded of students will rise dramatically. Here, we also explore E-assessment can support current ways to assess important but ill-defined educational goals. Paper and pencil tests goals such as the development of can be made more authentic by allowing metacognitive skills, creativity, students to word process essays, or to use communication skills, and the ability spreadsheets, calculators or computer to work productively in groups. algebra systems in paper-based examinations. It can support current UK A major problem with education policy and examination processes by using Electronic practice in England is the separation of Data Exchange to smooth communications ‘academic’ and ‘practical’ subjects. In the between schools and examinations worst case, to be able to invent and create authorities; current processes of training something of value is taken to be a sure markers and recording scores can be sign of feeble-mindedness; where as to e-assessment improved. Systems where student work is opine on the work of others shows can be used to scanned then distributed have advantages towering intellectual power. A diet of over conventional systems in terms of academic subjects with no opportunities to assess ‘new’ logistics (posting and tracking large act upon the world fails to equip students educational volumes of paper, for example), and with ways to deal with their environments; continuous monitoring can ensure high a diet of practical subjects which do not goals marker reliability. Current work is pushing engage higher-order thinking throughout boundaries in areas such as text the creative process equip students only to comprehension, and automated analysis become workers for others. Both streams of student processes and strategies. produce one-handed people, and polarised societies. E-portfolios can provide working E-assessment can be used to assess ‘new’ environments and assessment frameworks educational goals. Interactive displays which support project-based work across which show changes in variables over the curriculum, and can offer an escape time, microworlds and simulations, from one of the most pernicious historical interfaces that present complex data in legacies in education. E-portfolios solve ways that are easy to control, all facilitate problems of storing student work, and the assessment of problem-solving and make the activity of documenting the process skills such as understanding process of creation and reflection relatively and representing problems, controlling easy. Reliable teacher assessment is variables, generating and testing enabled. There is likely to be extensive use hypotheses, and finding rules and of teacher assessment of those aspects of relationships. ICT facilitates new performance best judged by humans representations, which can be powerful (including extended pieces of work aids to learning. Little is known about assembled into portfolios), and more the cognitive implications of these extensive use made of on-demand tests

3 of those aspects of performance which PURPOSE can be done easily by computer, or which are done best by computer. The purpose of this report is:

The issue for e-assessment is not if it will • to assert the centrality of assessment happen, but rather, what, when and how it in education systems will happen. E-assessment is a stimulus • to identify ‘drivers’ of assessment, for rethinking the whole curriculum, as and their likely impact on assessment, well as all current assessment systems. and thence on education systems New educational goals continue to emerge, and the process of critical • to describe current, radical plans reflection on what is important to learn, for increased use of high-stakes and how this might be assessed e-assessment in the UK authentically, needs to be institutionalised • to describe and exemplify current into curriculum planning. uses of ICT in assessment • to explore the potential of new E-assessment is certain to play a major e-assessment is technologies for enhancing current role in defining and implementing assessment (and pedagogic) practices a stimulus for curriculum change in the UK. There is a rethinking the strong government commitment to high • to identify opportunities and to suggest ways forward whole quality e-assessment, and good initial progress has been made; nevertheless, • to ‘drip feed’ criteria for good curriculum there is a need to be vigilant that the assessment throughout (set out design of assessment systems is not explicitly in an appendix). driven by considerations of cost. This report has been designed to: present Major challenges of ‘going to scale’ have key findings on research in assessment; yet to be faced. A good deal of innovative describe current UK government plans, work is needed, coupled with a grounded and likely future developments; provide approach to system-wide implementation. links to interesting examples of e-assessment; offer speculations on possible future developments; and to stimulate a debate on the role of e-assessment in assessment, teaching, and learning.

The key findings and implications of the report are presented within the Executive Summary.

4 SECTION 1 ASSESSMENT DRIVES EDUCATION

1 ASSESSMENT DRIVES EDUCATION There is an intimate association between the assessment teaching, learning and assessment, Assessment is an integral part of being. illustrated in Fig 1. Robitaille et al (1993) system is the We all make myriads of assessments in distinguish three components of the most potent the course of everyday life. Is Jane a good curriculum: the intended curriculum (set friend? Which Rachel Whiteread do I like out in policy statements), the implemented driver of best? Does my bum look big in this? The curriculum (which can only be known by classroom questions we ask, and the referents, give studying classroom practices) and the an insight into the way we see ourselves attained curriculum (which is what practice and the world (eg Groucho Marx’s “Please students can do at the end of a course of accept my resignation. I don’t want to study). The links between these three belong to any club that will accept me as aspects of the curriculum are not a member”). For aspects of our lives that straightforward. The ‘top down’ ambitions are goal-directed (getting promoted, going of some policy makers are hostages to a shopping), assessment is essential to number of other factors. The assessment progress. To be effective, it is necessary system – tests and scoring guides - to know something of the intended goal; provides a far clearer definition of what in well-defined situations, this will be is to be learned than does any verbal relatively easy, and goals will be specified description (and perhaps provides the only clearly. In ill-defined situations, such as clear definition), and so is a far better creative acts, and research, the goals basis for curriculum planning at themselves might not be well specified, classroom level than are grand statements but the criteria for assessing products of educational ambitions. Teachers’ values and processes may well be. and competences also mediate policy and attainment; however, the assessment system is the most potent driver of 1.1 ASSESSMENT AND EDUCATION classroom practice.

Assessment is central to the practice of education. For students, good performance on ‘high-stakes’ assessment gives access to further educational opportunities and Learning employment. For teachers and schools, it provides evidence of success as individuals and organisations. Cultures of accountability drive everyone to be ‘instrumental’ – how do I demonstrate Assessment Pedagogy success (without compromising my deep values)? Assessment systems provide the ways to measure individual and organisational success, and so can have a profound driving influence on systems Fig 1: Adapted from Pellegrino, Chudowski they were designed to serve. and Glaser (2001)

5 SECTION 1 ASSESSMENT DRIVES EDUCATION

In the UK, there is a long-standing belief as evidence of his effectiveness as a (eg Cockcroft 1982) that assessment governor in raising educational standards. systems have a direct effect on curriculum and on classroom practices. In Australia, Linn (2000) points to an underhand Barnes, Clarke and Stevens (2000) traced method sometimes used by incoming the effects of changing a high-stakes superintendents of school districts to show assessment on classroom practice, and the effectiveness of their leadership. Most claimed evidence for a direct causal link. commercially available multiple choice Mathews (1985) traced the distorting tests of educational attainment have a effects on the whole school curriculum of number of ‘parallel forms’, designed formal examinations for university to measure the same knowledge and skills entrance (now A-levels), introduced when in the same way, but with slightly different the university sector expanded beyond formats (so ‘12 men take six days, how Cambridge, Durham and Oxford – to long will six men take?’ becomes ‘12 men accommodate as much as 5% of the take six days, how long will four men population. There was a perceived need take?’). These tests are designed in such a for entrance tests to pre-university way that student scores on two parallel courses (O-levels) – designed for about forms would be the same (plus or minus 20% of the population - followed by a measurement error). Test designers do perceived need to align all certification in this so that school districts can change the the education system (notably O-levels test form every year, in order that tests and CSE). This linkage between measure the underlying knowledge and assessment for university admission and skills, not the ability to memorise the the assessment of low-attaining students answers to specific questions. Linn (2000) had a direct and often damaging impact gives an example where an incoming on courses of study for lower attaining Superintendent decides to use a new test students (Cockcroft 1982). form and also chooses to use this same test form in successive years. The result is Ill-conceived assessment can damage a steady increase in student scores simply educational systems. Klein, Hamilton, because of poor test security – students McCaffrey and Stecher (2000) present are taught to memorise answers. It evidence on the ‘Texas Miracle’. Here, appears that the superintendent has scores on a rather narrow test designed by worked miracles with student attainment, the State of Texas showed very large gains because scores have gone up so much. over a period of just four years. This test is However, when students are tested on a used to determine the funding received by new parallel form, and have to work out individual schools. Unfortunately, scores the answers and not rely on memory, on a national test which supposedly then scores plummet. So the high ill-conceived measured the same sort of student reputation for increasing student attainment were largely unchanged in performance is built upon deliberate assessment can the same time interval. So scores on deceit. This is bad for teachers and damage narrow tests can rise, even when students, and bad for public morality. underlying student attainment does not. educational The ‘Texas Miracle’ was used in the High-stakes assessment systems define systems election campaign of President Bush, what is rewarded by a culture, and

6 therefore the knowledge that is valuable. 1.3 ICT AND ASSESSMENT It is unsurprising that high-stakes assessment has a profound effect on both ICT perturbs the links between learning, learning and teaching. Decisions about teaching and assessment in a number of assessment systems are not made in a distinct ways: vacuum; the educational community in the UK (but not universally) is involved in the 1 ICT has changed the ways that research design of assessment systems, and these is conducted in most disciplines. decisions are usually grounded in Linguists analyse large corpuses of text; discussions on what is worth knowing, and geographers use GIS systems; scientists in the practicalities of teaching different and engineers use modelling packages. concepts and techniques to students of Everyone uses word processors, different ages. databases and spreadsheets. Students well designed should use contemporary research methods; if they do not, school-based formative 1.2 THE IMPACT OF ASSESSMENT learning will become increasingly assessment is ON ATTAINMENT irrelevant to understanding developments in knowledge. associated with An extensive literature review by Black and Assessment should reinforce good major gains in curriculum practice. We are Wiliam (2002) showed that well designed student is associated with approaching a bizarre situation where major gains in student attainment on a students use powerful and appropriate attainment wide range of conventional measures of tools to support learning and solve attainment. This result was found across problems in class, but are then denied all ages and all subject disciplines. access to these tools when their Topping (1998) reviewed the impact of peer ‘knowledge’ is assessed. assessment between students in higher education on writing, and found large 2 ICT can support educational goals that positive effects. A major literature review have been judged to be desirable for a commissioned by the EPPI Centre (2002) long time, but hard to achieve via showed that regular summative conventional teaching methods. In assessment had a large negative effect on particular, ICT can support the the attainment of low-attaining students, development of higher-order thinking but did little harm to high-attaining skills such as critiquing, reflection on students. These studies provide strong cognitive processes, and ‘learning to evidence that good assessment practices learn’, and can facilitate group work, produce large performance gains. These and engagement with extended projects; gains are amongst the largest gains found ICT competence is itself a (moving) in any educational ‘treatments’. Similarly, target for assessment. poor assessment systems have negative – not neutral – effects on the performance of 3 New technologies raise an important weak students. It follows that when we set of questions about what is worth consider the introduction of e-assessment, learning in an ICT-rich environment; we should be aware that we are working what can be taught, given new with a very sharp sword. pedagogic tools; and how assessment

7 SECTION 1 ASSESSMENT DRIVES EDUCATION

systems can be designed which put Audience: summative often pressure on educational systems to help have a large audience; the student and students achieve these new goals. If we teacher, parent, school, employer and ignore these important questions, we educational system. Formative run the risk that e-assessment will be can have a small audience; perhaps just designed on the basis of convenience, the student and teacher (and parent in with disastrous consequences for younger years). educational practice. Mendacity quotient: in summative assessment, students are advised to focus 1.4 ON THE NATURE OF SUMMATIVE on things they do best and hide areas of AND FORMATIVE ASSESSMENT ignorance; in formative assessment, it is more sensible for students to focus on We should distinguish between summative things they understand least well. and formative assessment, which are different in conception and function. In Agency: summative assessment is often principle, it is easy to distinguish between done to students, perhaps without their them. Summative assessment takes place willing participation. Formative at the end of some course of study, and is assessment is often actively sought out by designed to summarise performance and the student; good formative feedback attainment at the time of testing; high- depends on student engagement in the stakes, end of schooling assessment process of revision. such as GCSE provides a good example. Formative assessment takes place in Validation methods: summative mid-course, and is intended to enhance assessment is often judged in terms of students’ final performance; comments predictive validity - are students who got on the first draft of an essay provide A grades more likely to get top grades in an example. college (but see Messick 1995)?? Formative assessment might be judged Summative and formative assessments in terms of its usefulness in undoing differ on a number of dimensions. These predictive validity – what feedback can we include: give to students with C grades, so that they perform as well in college as anyone else? Consequences: summative assessment is often highly significant for the student and Quality of the assessment: for summative teacher, whereas formative assessments assessment, the assessment method need not be. should achieve appropriately high standards of reliability and validity; for Exchange value: summative assessments formative assessment, ‘reliability and often have a value outside the classroom - validity’ are negotiable between teacher for certification, access to further courses, and student. and careers; formative assessment usually has no currency outside a small group. Resources required: the nature of summative assessment can be influenced by considerations of cost and time. In

8 terms of cost, the estimation of the cost of Theory dependence: summative testing is often done very badly, especially assessment rarely rests on theory; in the USA. There, it is common for ‘cost’ formative assessment is likely to be to be equated with the money paid for the ‘theory-genic’ as participants discuss test and its scoring, not the real cost, progress, what is known, how to learn and which is the opportunity cost, measured in remember things, and how best to use terms of the reduction in time spent evidence. learning which has been diverted to useless ‘test prep’. Formative evaluation Tool types: summative assessment should be an integral part of the work of commonly uses timed written teaching, so estimation of cost focuses assessments where the structure is naturally on opportunity costs – just what specified in advance, and which is scored is an effective allocation of teaching and using a common set of rules. Tests are learning time to formative evaluation? In often designed to discriminate between terms of time, for summative assessment students, and to put them into a rank order time is easy to measure (so long as in terms of performance. Formative useless ‘test prep’ is counted in); again, assessment commonly uses a variety of formative assessment is an integral part methods such as portfolios of work, of teaching. student draft work, student annotations of their work, concept mapping tools, Knowledge and the knowledge diagnostic interviews and diagnostic tests. community: summative assessment is Each student is their own referent – explicit about what is being assessed, and comparison with other students may not ideas about the nature of knowledge are be useful, and is often harmful to learning. shared within a wide community; with formative evaluation, ideas about the nature of knowledge might be negotiated 1.4.1 Reflecting on summative by just two people. and formative assessment

Status of the assessment: in summative Despite the differences highlighted here, assessment, the assessment can be the two sorts of assessment have many ignored by the student; formative areas of overlap: assessment simply isn’t formative assessment unless the student does • a student can change their study something with it to improve performance. methods on the basis of an end-of-year examination result (summative Focal domain: it is useful to distinguish assessment used for formative purposes) between cognitive, social and emotional aspects of performance. Summative • summative evaluation of students assessment commonly focuses on can provide formative evaluation for cognitive performance; formative teachers, schools and educational assessment can run wild in the social and systems affective domains. • formative assessment always rests on some sort of summative assessment – feedback and discussion must rest

9 SECTION 1 ASSESSMENT DRIVES EDUCATION

on some assessment of the current practices where ICT is an integral part of state of knowledge learning, but where students are denied • some summative assessment should access to technology during assessment, include the ability to benefit from must be reformed as a matter of urgency. formative assessment – learning to Skills in ICT are essential for much learn is an important educational goal, modern living, and so should be a target and should be assessed, formally for assessment. • summative assessment (eg of student teachers) should include the ability to provide formative assessment. frequent testing 1.5 SUMMARY OF SECTION 1 and reporting of scores damages Assessment lies at the heart of education. Assessment systems exemplify the goals weaker students and values of education systems. High- stakes assessment systems have a direct influence on classroom practices. Any discussion of assessment raises important questions about what is worth knowing, the extent to which such knowledge can be taught, and the best ways to support knowledge acquisition.

Well designed assessment systems are associated with large increases in student performances; frequent testing and reporting of scores damages weaker students. Badly designed high-stakes assessment systems can have strong negative consequences for students, communities and societies.

In this section, we distinguish between summative assessment (assessment of learning) and formative assessment (assessment for learning), and compare their characteristics.

ICT has changed the ways that academic work is done; this should be reflected in the tools used in education for both learning and assessment. Bizarre current

10 SECTION 2 HOW AND WHERE MIGHT ASSESSMENT BE DRIVEN?

2 HOW AND WHERE MIGHT easier to use, and is attracting users at an ASSESSMENT BE DRIVEN? increasing rate. Technology is ubiquitous: as well as computers in the form of There is a comforting belief that decisions desktops and laptops, there has been an about education and education systems explosion of distributed computer power in are made within those systems, and that the form of mobile phones which are also outside agents – notably foreign outside fully functioning personal digital assistants agents – have little or no influence on (PDAs), containing features such as a internal affairs. This has been true in the spreadsheet, database and word UK for a long time, but has not been true processor. It has been estimated that in countries which (for example) make use there are over three billion mobile phones of UK examinations to certify students. worldwide (Bennett 2002); as before, If we are to explore plausible scenarios this number is growing very fast, and about the future impact of ICT on new phones are manufactured with an assessment, it is necessary to take increasing range of features. Technology account of ‘drivers of change’. Here, we as a driver has a number of likely effects consider technology, globalisation, the rise on assessment. New skills (and so new of mass education, problems of political assessments) are needed for work and stability, current government plans, and social functioning, which require fluent likely government plans, as drivers of use of ICT; technology has had a profound educational change and, in parallel, of effect on many labour intensive work likely changes in assessment systems. practices, many of which resemble educational assessment. The use of ICT for assessment has hardly begun, and 2.1 TECHNOLOGY AS DRIVER OF some new technologies such as mobile phones offer great promise not only SOCIAL CHANGE because of their ubiquity (which might solve a current problem of access which Technology is a key driver of social change. has restricted widespread use of ICT in Technology has transformed the ways we assessment in the past), but also because work, our leisure activities, and the ways new technologies have become a natural we interact with each other. The use of the form of communication for very many web is growing at an extraordinary rate, young people. and people increasingly have access to rich sources of information. Metcalfe’s law states that the value of a network rises GLOBALISATION dramatically as more people join in – its 2.2 value doesn’t just increase steadily. The Globalisation is probably the most obvious capability of computer hardware and driver of change. Significant features for software continues to improve, and the current discussion are: the mobility features are being added (such as high of capital, employment opportunities quality video) which make computer use (jobs), and people. Cooperation between increasingly attractive, and well suited to the use of ICT countries (eg in the European Union), and supporting human-human interactions. the pervasive influence of multinational for assessment The web is an increasingly valuable companies also have profound social effects. resource which is becoming progressively has hardly begun

11 SECTION 2 HOW AND WHERE MIGHT ASSESSMENT BE DRIVEN?

The mobility of capital and jobs has For developed economies to maintain changed the profile of the job market, their global dominance, their economies with new kinds of jobs being created (eg must be geared to ‘adding value’ to raw in ICT) and old ones disappearing (eg in materials (or to creating value from manufacturing industries). It is very easy to nothing, as in the entertainment and export jobs and capital from the developed finance industries). This requires changes world to the developing world (eg by in the education system which encourage relocating telephone call centres, or by creative activities, and good problem- establishing factories in countries with low solving ability. Employment in a post- wage costs). For people (and economies) industrial society is likely to depend on to be successful, they must continue to higher-order thinking skills, such as learn new skills, and to adapt to change. ‘learning to learn’. This requires that Retraining will often require re- these thinking skills be exemplified certification of competence, with the and assessed, if they are to receive obvious consequence of further appropriate attention in school. assessment, and the need to design assessment systems appropriate to the The effects of cooperation between new needs of employment. These are countries in Europe will have an effect on pressures for more, and effective, systems assessment systems. Currently, there is of competence-based assessment. a problem that qualifications in different member states (‘architect’, ‘engineer’) cooperation Migration for work and education raises are gained after rather different amounts similar issues. The developed world has a of training, and equip people for quite between need to import highly skilled workers; different levels of professional countries in universities worldwide seek international responsibility. This makes job mobility students. In both cases, there is a need to very difficult. The Bologna Accord is an Europe will have certify the competence of applicants, and agreement between EU member states an effect on to reject those least likely to be effective that all universities will adopt the same workers, or to complete courses pattern of professional training (typically a assessment successfully (because of a lack of fluency three-year undergraduate degree followed systems in the language of instruction, for by a two-year professional qualification) in example). Financial considerations make order to make qualifications in different it impractical for testing to take place in member states more comparable. the target country, and so a good deal Convergence of course structure is likely of testing takes place in the country to lead to a convergence of assessment supplying workers or students. Again, it is systems, in line with the desire to increase common to use competence tests which mobility (see www.engc.org.uk/ are externally mandated and designed. international/bologna.asp for an analysis of Language testing provides a good example; the impact of the Bologna, Washington and a computer-based version of the Test of Sidney Accords on engineering). English as a Foreign Language (TOEFL) has been developed which adjusts the Globalisation is having a profound effect on difficulty level of the questions in the light educational systems worldwide. In higher of the performance of the candidate on the education, Slaughter and Leslie (1997) test (see www.ets.org/toefl). describe the response of universities in

12 several countries to ‘academic capitalism’ by a commercial company before they are multinational – a global trend to view knowledge as a allowed to certify student competence. ‘product’ to be created and controlled, and companies also to see universities as organisations which The scale on which such examinations are drive changes in produce knowledge and more taken is impressive. Bennett (2002) knowledgeable people as efficiently as describes the National Computer Rank assessment possible. They document the changes in Examination, China, which is a proficiency practices university structures and functioning which exam to assess knowledge of computer have been a response to such pressures; science and the ability to use it; two these include greater collaboration on million examinations were taken in 2002. teaching between universities, and mutual Tests for the European Computer Driving accreditation of courses. Again, the need Licence have been taken by more than a for comparability of course difficulty and million people. student attainment will lead to a careful re-examination of assessment systems, and some homogenisation. 2.3 MASS EDUCATION

Multinational companies also drive Mass education has developed rapidly and changes in assessment practices. These recently. In the last 30 years, the companies are successful in part because percentage of the UK population being of their emphasis on uniform standards; educated at university has risen from one is unlikely to get a badly cooked about 5% to about 40%. This puts hamburger in Macdonalds, or a copy of pressures on academic systems to develop Excel that functions worse than other efficient assessment systems. copies. This emphasis on quality control extends to job qualifications, and to There is now a great deal of distance standards required of workers. In fast education. China plans to have five million changing markets such as technology students in 50-100 online colleges by 2005. provision, retraining workers and checking At least 35 US states have virtual their competence to use, install or repair universities (Bennett 2002). (The recent new equipment or software requires failure of the E-university in the UK - appropriate assessment of competence. www.parliament.uk/post/pn200.pdf - and The needs of employers for large numbers of the US Open University, shows that such of staff who are able to use ICT effectively ventures are not always successful!) A as part of their job has lead to trans- great deal of curriculum material is national qualifications such as the delivered via a variety of technologies (the European Computer Driving Licence Massachusetts Institute of Technology (www.ecdl.co.uk). Such examples are is in the process of putting all its course interesting because they are set by material online, for example – see international organisations, or commercial http://ocw.mit.edu/index.html). Over organisations, and in some cases (eg the 3,000 textbooks are freely available Microsoft Academy programme - online at the National Academy Press www.microsoft.com/education/ (www.nap.edu). The use of technology in msitacademy/ITAPApplyOnline.aspx), the assessment process is a logical state-funded educational organisations consequence of these developments. must submit themselves for examination

13 SECTION 2 HOW AND WHERE MIGHT ASSESSMENT BE DRIVEN?

2.4 DEFENDING DEMOCRACY of current national systems. Two current UK initiatives are likely to lead to radical Problems of potential political instability changes in assessment practices, notably provide another driver of change. The rise to increase the use of e-assessment. One of fundamentalism (both Christian and is the DfES E-assessment Strategy Moslem) can be seen as a loss for (www.dfes.gov.uk/elearningstrategy/ rationalism. Electoral apathy is a threat to default.stm) which maps out a tight the democratic process. One problem for timeline for change in current examination politicians is to explain complex policies to systems; the other is the Tomlinson (2004) citizens. This is made difficult if citizens Report 14-19 Curriculum And understand little about modelling (such as Qualifications Reform, which proposes ideas of multiple causality, feedback in radical changes in educational provision systems, lead and lag times of effects etc). itself (with direct consequences for Informed citizens need to understand e-assessment). something about ways to describe and model complex systems, in order that they The Tomlinson Report (2002) into A-level do not give up on democracy simply standards argued that the examinations because they do not understand the policy system is operating at, or perhaps beyond, arguments being made. Understanding capacity. According to Tomlinson (2002), in arguments about causality and some 2001, 24 million examination scripts and experience of modelling systems via ICT coursework assignments were produced at should be major educational goals. These GCSE, AS and A level. In terms of the goals will need to be exemplified and number of students being assessed, in valued by high-stakes assessment 2002 there were around six million GCSE systems, if they are to become part of entries and nearly two million children sat students’ educational experiences. Key Stage tests. More students are engaging in post-compulsory education; continued Education for citizenship has received the introduction of modular A-levels, and increasing emphasis in the UK. Some of the popularity of AS courses has resulted expansion of the the educational goals – such as in an increase in the number of current understanding different perspectives, examinations taken (Tomlinson reports a increased empathy, and community growth of 158% over a 20-year period). examination engagement - seem intangible. However, There is an associated problem concerning system without ICT can play a role in posing authentic the supply of examiners, in terms of both questions (for example via video) and recruitment and training. Roan (2003) some changes could play a role in formative assessment, estimated that about 50,000 examiners does not seem a and perhaps in summative assessment were involved in the assessment of GCSEs, (using portfolios). GNVQs and A-levels. Continued expansion viable option of the current examination system without some changes does not seem a viable 2.5 GOVERNMENT-LED REFORMS option. ICT support for current activities, IN CURRICULUM AND ASSESSMENT described later, might well be of benefit. ICT-based assessment is now part of UK Governments are responsive to global government policy, and will be introduced pressures, and analyses of the limitations

14 progressively, but on a tight timescale. across the West Midlands and the west of The DfES E-learning Strategy will be England. AQA conducted a live trial in accompanied by radical changes to the March 2004 on 20,000 scripts (Adams and assessment process, for which the Hudson 2004); in Summer 2004, about Qualifications and Curriculum Authority 500,000 marks (5% of the total) will be are responsible (www.qca.org.uk/ collected; by 2007, 100% of marks will be adultlearning/workforce/6877.html). Over captured electronically. the next five years, the following activities are planned: The Tomlinson Report (2004, in prep) will offer a more radical challenge to “All new qualifications should include assessment practices. The Interim Report assessment on-screen (Tomlinson 2004) identified a number of Awarding bodies set up to accept and problems with the existing system. These the Tomlinson include concerns about: assess e-portfolios Report will offer Most examinations should be available • excellence – the current system does a more radical optionally on-screen, where appropriate not stretch the most able young people National curriculum tests available (in 2003, over 20% of A-level entries challenge to on-screen for those schools that want resulted in grade A) assessment to use them • vocational training – there is an historic practices The first on-demand GCSE examinations failure to provide high-quality vocational are starting to be introduced courses that stretch young people and 10 new qualifications specifically designed prepare them for work for electronic delivery and assessment” • vocational learning is often assessed by QCA Blueprint (2004) external written examinations, not practical and continuous assessment The timescale for these changes is short. • assessment - the burden on students For example, in 2005, 75% of basic and key and teachers is too high skills tests will be delivered on-screen; in 2006, each major examination board will • disaffection - our high drop-out rates offer live GCSE examinations in two are scandalous subjects, and will pilot at least one • the plethora of qualifications – currently qualification, specifically designed for around 4,000 electronic delivery and assessment; in 2007, 10% of GCSE examinations will be • curricula - are often narrow, overfull, administered on-screen; in 2008, there will and limit in-depth learning be on-demand testing for GCSEs in at • too few students develop high levels of least two subjects. competence in mathematical skills, communication, working with others, or Good progress has been made with these problem-solving developments. For example, Edexcel is carrying out a pilot scheme for online • failure to equip young people with the GCSEs in chemistry, biology, physics and generic skills, knowledge and personal geography with 200 schools and colleges attributes they will need in the future.

15 SECTION 2 HOW AND WHERE MIGHT ASSESSMENT BE DRIVEN?

there is an The Report proposes a single qualifications explosion of its usefulness and use in framework, based on diplomas set at four everyday life. These provide pressures for urgent need to levels (Entry, Foundation, Intermediate and more relevant skills to be assessed, and invent and apply Advanced). Students are expected to also provide an assessment medium which progress at a pace appropriate to their is largely unexplored. Demands for lifelong new sorts of e- attainment, rather than their age. Each learning, for people who can innovate and assessment on a diploma shares some common features. create new ideas, and the needs for These require students to demonstrate informed citizenship are all pressures for large scale evidence of: education (and associated assessment systems) that rewards higher-order • mathematical skills, communication thinking, and personal development. and ICT skills Conversely, drivers such as the need to • successful completion of an extended retrain and recertify staff, to ensure project common standards across organisations in different countries, and to allow access • participation in activities based on to well-qualified migrants for jobs and personal interest, contribution to the education, emphasise assessments which community as active citizens, and transcend national boundaries and which experience of employment are based on well-defined competencies • personal planning, review and making (and where assessment design is informed choices sometimes based on perceived commercial imperatives). These drivers • engagement in ‘main learning’- the require different approaches to major part of the diploma – chosen by assessment, and all require new sorts of the student in order to open access to assessments and assessment systems further opportunities (eg in employment to be developed. or education). In the UK, there are a number of problems These recommendations are exciting and with current assessment systems. First, very ambitious, but deeply problematic, they serve students very badly; second, unless there are radical changes to they might soon collapse under their own current assessment systems – notably in weight. There is now the political will (and the large-scale adoption of e-assessment. a tight timescale) to develop pervasive, We consider ways these recommendations high quality e-assessment on a tight might be met, in Section 3. timeline, aligned with current and emerging educational goals. There is also an urgent need to invent and apply new SUMMARY OF SECTION 2 2.6 sorts of e-assessment on a large scale. A number of ‘drivers’ are shaping both assessment and ICT; these need to be taken into account in any discussion of future developments. These drivers provide conflicting pressures. The drivers considered here include the increasing power and ubiquity of ICT, and the

16 SECTION 3 CURRENT DEVELOPMENTS IN E-ASSESSMENT

3 CURRENT DEVELOPMENTS • paper-based testing systems are well IN E-ASSESSMENT established - it is relatively easy to prevent candidates from copying from The UK government has embarked on a each other, for example very ambitious project to extend the use of • paper is easy to distribute, and can be e-assessment. The issue for education is used in most locations not if e-assessment will play a major role, • in extreme circumstances, it is possible but when, what, and how. E-assessment to copy an examination paper, and find can take a number of forms, including another desk automating administrative procedures; digitising paper-based systems, and online • human judgements are brought to bear testing - which extends from banal throughout the process, so the scope of multiple choice tests to interactive questions is unconstrained. assessments of problem-solving skills. In this section, we focus on current developments in e-assessment for 3.1 SOME MOTIVES FOR summative purposes that can be used COMPUTER-BASED TESTING across the educational system. In Section 4 we address important but less well- A number of justifications have been put defined targets for e-assessment. forward for computer-based testing, and are set out below. Not all justifications Before we begin this section exploring apply to every use of computers in different aspects of e-assessment, we assessment. should remember some of the virtues of paper-based tests, in order that we do Avoiding meltdown: it may well be not become so enamoured of new impossible to maintain existing paper- technologies that we lose sight of the based assessment systems in the face benefits of current assessment systems. of the current growth in the number of With paper: students being tested. Scanning technologies can help. • all stakeholders are familiar with all aspects of the medium Valuable life skills: much of everyday life • paper is robust – it can be dropped, (including professional life) requires people and it still functions to use computers. Not using computers for assessment seems perverse. • there are rarely problems of legibility • high resolution displays are readily Alignment of curriculum and assessment: the issue for available there is a danger of an emerging gap • students can take questions in any order between classroom practices and the education is not assessment system. It is very common for if e-assessment • users can input cursive script, students (and almost all professionals) to diagrams, graphs, tables use word processors when they write; in will play a major • a number of equity issues have been mathematics and science, the use of role, but when, solved – it is easy to create large fonts graphics calculators, spreadsheets, and to solve other access problems computer algebra systems (CAS) and what, and how

17 SECTION 3 CURRENT DEVELOPMENTS IN E-ASSESSMENT

modelling software is commonplace in the case of language testing, and (and universal in professional practice). selection tests for employment. Systems Assessment systems that do not allow of assessment that change the tasks taken access to these tools are requiring in the light of progress so far can be useful students to work in unfamiliar and in such circumstances. The principle is maladaptive ways. Non-ICT-based straightforward: candidates are presented assessment can be a drag on curriculum with tasks of intermediate difficulty; if reform, rather than a useful driver they are successful, the difficulty level (see Section 1.2). increases; if they are unsuccessful, it decreases. This allows a more accurate On-demand testing: in many situations estimate of the level of attainment. (for example, students engaged in part- Adaptive tests can work well when there is on-demand time study; students taking courses a single scale of difficulty – for example in designed to develop competencies; number skill, or vocabulary. They require testing would students on short courses) it is appropriate careful development when a number of enable students to test students whenever they are judged different factors affect performance (such (or judge themselves) to be ready. City and as technical as well as problem-solving to take tests Guilds tests provide an illustration; 75,000 skills), and are unlikely to be useful where when they are online tests have been taken, and extended responses are required, because candidates book a test time that suits the adaptive system has too little to work ready them. Saturday is the third most popular on. Examples in the school system can be day for assessment (Ripley 2004). found in Victoria, Australia (AIM Online 2003), where adaptive tests of English and Students progress at different rates: mathematics are used. currently, the UK examination system acts as a force against differentiation in the Better immediate feedback: candidates curriculum. Summative end-of-year tests can often be given information immediately make it attractive to schools to teach year about success, as is the case in the tests groups together and to enter them in a that all trainee teachers are required to common set of examinations. On-demand take in English, mathematics and ICT testing would enable students to take tests (Teacher Training Agency 2003). (This is not such as GCSEs when they are ready, and necessarily an advantage, if this testing to progress through different academic method encourages an ‘instrumental’ subjects at different rates. In the USA, approach, where students learn in order the Advanced Placement system allows to pass tests rather than to learn things. students to take university-level courses It could also force assessment design in school, be tested, and to have success to focus on objective knowledge rather rewarded by college credits – so a student than the development of process skills, might enter the second year university if immediate feedback became a course, for example. The Tomlinson Report requirement for all testing.) In principle, (2004) argues for a more differentiated candidates could also be given diagnostic curriculum. information about those aspects of performance most in need of improvement. Adaptive testing: in some circumstances, the group to be tested is heterogeneous as

18 Motivational gains: there are claims Better task design: it is easier for test (Richardson, Baird, Ridgway, Ripley, constructors to change tasks on the basis Shorrocks-Taylor and Swan 2002; Ripley of information during testing and pre- 2004) that students prefer e-assessment to testing, because of the immediacy of data paper-based assessment, because the collection. This can range from the users feel more in control; interfaces are rejection of items that do not function well judged to be friendly; and because some (for example items where students who tests use games and simulations, which score well overall are likely to fail a resemble both learning environments and particular item) to improved test design recreational activities. (for example, ensuring that there are a lot of items set around critical cut-off points Better exemplification for students and – especially the pass/fail boundary – so teachers: posting examples of work which that the test is most reliable there). meets certain standards can be beneficial. In South Australia, excellent student work Cost: it is common to claim that e- in technology is displayed on the web (see assessment can save money – it is clear www.ssabsa.sa.edu.au/tech/2004techsho/ that online multiple choice tests can be index.htm). cheap to administer and score. However, if we are to exploit the potential of ICT to Better ‘system’ feedback: having full sets improve assessment – for example by of response data from students available at presenting simulations or video as an the time of Examiners’ Reports can integral part of a test – then the costs of improve the quality of feedback. Details of testing are likely to increase. questions, and parts of questions, that proved relatively difficult and easy should improve the quality of Examiners’ Reports 3.2 USES OF E-ASSESSMENT TO (which are based currently on examiners’ SUPPORT CURRENT EDUCATIONAL experiences of a sample of scripts, and GOALS rarely on candidate success on questions and part-questions). This information will be useful for both improving the quality of 3.2.1 Using ICT to support questions, and in providing information to Multiple Choice Tests teachers about topics that have not been learned well. This is a well-established technology, particularly well suited to assessing Faster information for higher education: declarative knowledge (‘knowing that’) in universities need assessment results in a well-defined domains. Developing tasks to timely fashion. UK universities receive identify student misconceptions is also A-level results quite late in the academic possible. It is harder to assess procedural year, and engage in a frenetic process knowledge (‘knowing how’). MCT is to fill places with appropriately qualified unsuited to eliciting student explanations, applicants when students do and do not or other open responses. MCT have the achieve the grades that were a condition great advantage that they can be very of entry. These pressures would be cheap to create and use. Some of this eased if results were delivered earlier. cheapness is illusory, because the costs

19 SECTION 3 CURRENT DEVELOPMENTS IN E-ASSESSMENT

of designing good items can be high. running a CAS pilot for its Higher Level Over-use of MCT can be very expensive, if Mathematics Diploma from September it leads to a distortion of the curriculum in 2004. In the USA, CAS can be used when favour of atomised declarative knowledge, taking the College Board’s Advanced divorced from conceptual structures that Placement Calculus test. students can use to work on the world, effectively. MCT are used extensively in the USA for high-stakes assessment, and are 3.2.3 Using ICT to support current presented increasingly via the web. For UK examination processes example, web-based high-stakes State tests are available in Dakota and Georgia; A number of ways in which ICT can the Graduate Record Examination (GRE), improve current examination practices used by many colleges to determine are set out below. access to Graduate School in many US colleges, is available online. Better school-examination board communication: Tomlinson (2002) points it makes sense to existing extensive use of ICT by awarding Creating more authentic paper to allow students 3.2.2 bodies in the examination process, and and pencil tests argues for more use of Electronic Data access to the Interchange (EDI) systems, which enable tools they use in It makes sense to allow students access to schools and colleges to submit the tools they use in class, such as word examination entries and information about class, during processors, and that professionals use candidates online and to receive results testing at work, such as graphing tools and automatically. modelling packages, during testing. It makes no sense at all to always forbid Supporting the current marking and students to use ‘tools of the trade’ when moderation process: a challenge faced by being assessed. E-learning changes large-scale tests that require human the nature of the skills required. E- markers is to ensure the comparability of assessment allows examiners to focus standards across markers, and over time more on conceptual understanding of what for all markers during the grading process. needs to be done to solve problems, and Chief examiners create scoring rubrics to less on telling students what to do, then guide other markers, and there is usually a assessing them on their competence in process of standardisation where markers using the manual techniques required to use the scoring rubrics to score a sample get the answer. In Australia, the State of of scripts, and attend a standardising Victoria (www.vcaa.vic.edu.au/prep10) has a meeting where standards are compared, system for essay marking where students discrepancies are discussed, and the key in their responses to questions, which rubric is tuned. Once markers have are then distributed electronically and reached an appropriate level of marking marked by human markers. Computer accuracy, they mark examinations Algebra Systems (CAS) can be used in the independently. Systems vary in terms of Baccalauréat Général Mathématiques the extent of the moderation used. In some examination in France; the International systems, scripts are sampled by chief Baccalaureate Organisation (IBO) is examiners, and serious deviation from the

20 rubric can lead to the remarking of all the marked. There is flexibility in the ways scripts sent to a particular examiner. ICT that scoring is done. Markers can be can be used to support this process. asked to score whole scripts, or individual Sample scripts typical of different questions. So a newly appointed marker categories of student work can be put might be sent questions judged to be online, for easy reference by markers. easy to mark, and more experienced Entry of marks can be done via templates markers might be sent questions which that ensure that markers complete every require deeper subject knowledge. The section, and the tedious process of reliability of scoring can be increased. aggregating marks from different parts of Scripts judged to be around key the script is done automatically and borderlines on first marking can be sent without error. Data is collected in a way to other markers; scripts judged to be that facilitates rapid and detailed analysis, well away from boundaries need be at the level of responses to different parts scored only once. Online support can be of questions, whole questions, and the provided; markers can ask for help with distribution of test scores. specific student responses. Data is captured in a form suitable for a number Replacing paper: in the USA (and of subsequent analyses. increasingly in the UK), there is widespread use of systems where students An interesting variant of this approach that take paper-based examinations, and the obviates the need for scanning would be to scripts are scanned electronically (this is require candidates to use ‘intelligent pens’. analogous to Optical Mark Recognition for These pens have two distinct functions. multiple choice tests that has been The first is to write like a conventional pen. available for many years). Once in this The second is to record its movements format, the documents can be sent (exactly) on the page. This is done by using electronically to markers, who can be specially prepared stationery. Imagine you working almost anywhere. These systems could see a small square area of a have a number of advantages over paper- banknote. The pattern across the whole based systems. First, there are surface is never repeated, so that, given considerable problems in tracking the sufficient time, you could find exactly distribution and return of large volumes of where the square is located on the note. paper to and from markers; there are The pen works in a similar way, to record security issues sending examination its position on the page over the course of papers by post, and scripts can get lost. the examination. The pen is then Second, moderation of the quality of connected to a computer, and all the data scoring can be done easily. Pre-scored is downloaded. The whole student ‘anchor’ papers can be sent to markers response can then be reconstructed. during the course of their marking, to Clearly, this approach would have to be ensure they are maintaining standards; subjected to extensive trialling before any markers who do not perform adequately widespread adoption. can be told to take a break, or can be removed from the pool of markers. The whole process can be monitored in terms of the rate at which scripts are being

21 SECTION 3 CURRENT DEVELOPMENTS IN E-ASSESSMENT

ICT can be used 3.2.4 Online assessment: turning capturing the rough work, and second, allocating partial credit. Computer capture to moderate a GCSE paper into ‘computer-only’ e-assessment is very difficult, given current interfaces; human markers the rules for allocating partial credit would have to be specified in very fine detail for An interesting challenge is to devise them to be used as part of an automatic ways to replace paper-based tests with scoring routine. ICT-based tests, and to score them automatically. Some virtues of paper- based tests are unlikely to be replicated for a number of reasons, so setting tests on- 3.2.5 Scoring of open responses screen is likely to bring about changes in the nature of what is assessed. Here, we GCSE questions often require students to consider one specimen GCSE mathematics answer questions in their own way, and to paper to illustrate the problems. explain things – scoring these responses automatically is inherently difficult. Measuring and drawing: about 10% of the Automated scoring of open student marks in the paper-based assessment responses is the focus of a good deal of required the use of actual ‘instruments’ ongoing work. A number of approaches (ruler, protractor, compasses). One have been taken to the problem of approach for translation onto screen would automatic scoring. One is based on the be to simulate the physical instruments, analysis of the surface features of the eg to provide a virtual protractor that can response (Cohen, Ben-Simon and Hovav be dragged around the screen and rotated. 2003), such as the number of characters Another is to provide CAD or interactive entered, the number of sentences, geometry packages. The latter would sentence length, the number of low- require a substantial change to the frequency words used, and the like. The syllabus, but could provide real benefits in success of such methods can be judged by terms of student learning. comparing the correlation between computer and human judges, and the Mathematical expressions: about 20% of correlation between scores given by two the marks required the student to write sets of human judges. Cohen, Ben-Simon down answers that could not be keyed in, and Hovav (2003) looked at the scoring of a using a standard keyboard. These included range of essay types by humans and fractions, division expressions, and powers. computer, and report that the correlation between the number of characters keyed Rough work and partial credit: almost by the student, and the scores given by every question in the paper format human judges are as high as the included space for rough work, and about correlation between scores given by 30% of the total marks potentially could be human judges. Nevertheless, these awarded based on this work, in the form of scoring systems do not provide a panacea. partial credit awarded where the final In the USA, double marking is used to answer is incorrect (these marks are ensure reliability (this is rarely done in the usually awarded in full if the final answer UK). ICT can be used to moderate human is correct). There are two distinct problems markers (and save money) – if the in translating this to a digital format – first computer and the human disagree, the

22 paper is re-marked by a human. Machine- produce student responses that are only scoring is unlikely to be useful in UK difficult to score) and in terms of writing contexts, for two reasons. First is that the questions which highlight student UK culture requires that scoring schemes misconceptions. This approach requires a be described in ways that are useful to good deal of work prior to live testing, so teachers and students. Second is that the is well suited to situations where tasks consequential validity of such scoring will be used repeatedly. systems would be dire – the advice to students would be to improve their scores In the USA, the Graduate Management simply by using more keystrokes. A second Aptitude Test (GMAT) - used to determine approach which could improve the quality access to business schools - uses of scoring and reduce costs is being used automated scoring of text. Here again, the to assess student responses on tasks in test is scored by both human and machine, new goals involve contexts where the range of acceptable to offer some sort of reliability check for responses can be well defined, such as in the human marker. the development short answer science tasks (eg Sukkarieh, of higher-order Pulman and Raikes 2003). Here, appropriate (‘the Earth rotates around the 3.3 ICT SUPPORT FOR CURRENT thinking, and a sun’) and inappropriate (‘the sun rotates ‘NEW’ EDUCATIONAL GOALS range of social around the Earth’) responses are defined. skills Lists of synonyms are generated for nouns There is an emerging consensus (‘our globe’) and verbs (‘circles’), and worldwide on ‘new’ educational goals, alternative grammatical forms are defined, focused on problem solving using based on analyses of large numbers of mathematics and science, supported by an student responses. Student responses are increased use of information technology parsed using techniques borrowed from (compare, for example, UK developments Natural Language Processing, and are with those in New Zealand compared with stored appropriate and www.minedu.govt.nz; and Singapore inappropriate responses, using a variety of www1.moe.edu.sg/iteducation). These new Information Extraction techniques (see goals involve the development of higher- Cowie and Lehnert 1996). Mitchell, order thinking, and a range of social skills Aldridge, Williamson and Broomhead such as communication, and working in (2003) describe work at The Dundee groups. There is an honourable tradition of Medical School. Here, all students take the assessing problem solving via the use of same examination at the end of every year. extended tasks, such as those developed Academics are presented with all the by the APU (eg Archenhold, Bell, Donnelly, responses to the same question, with the Johnson and Welford 1988). However, the computer’s judgement on the correctness computer offers some unique features in or otherwise of the answer, and an terms of representation, interaction, and estimate of the confidence of the its support for modelling. Here, we judgement. Human scoring time is describe some recent developments which dramatically reduced, and staff report make use of these unique features. positive benefits in terms of the quality of the questions they ask, both in terms of rewriting ambiguous questions (which

23 SECTION 3 CURRENT DEVELOPMENTS IN E-ASSESSMENT

3.3.1 The development of Further examples of tasks can be found World Class Tests in Ridgway and McCusker (2003). Skills assessed include: Tests were designed to identify high- attaining students in problem solving in Understanding and representing mathematics, science and technology problems: traditional educational goals at ages 9 and 13 years, as part of such as the ability to interpret tables and the work on the World Class Arena graphs, and to translate information coded (www.worldclassarena.org). Computers in one representation into information make it easy to present new sorts of tasks, coded in another representation continue for example tasks where dynamic displays to be vital skills for mathematical and show changes in several variables over scientific . Computers allow fast time, or which present video of a situation and reversible transformations of which students must model. A wide variety information from one representation to of representations can be supported, and another, and students can be asked to students can be asked to switch between explain the relationships between them. them. The interactive properties of computers make them well suited to Assessing process skills in science the assessment of process skills. and mathematics: the desire to assess process skills is not new. Traditionally, Using computers to give students control students would be presented with tasks in over how data is presented allows them to laboratories, or would be required to keep work with complex data sets of a sort that logs and portfolios of their laboratory would be very difficult to work with on work. However, the laboratory setting can paper. Tasks can be set in realistic introduce elements which reduce the contexts, using realistic data to address reliability of the assessment, such as problems of considerable complexity, using instruments which fail to function properly, resources and methods that are familiar to or materials whose properties are less professionals working in the relevant field. than ideal. Students are required to Two examples are presented here: Oxygen physically manipulate apparatus – chance and Bean Lab. differences between students in terms of

the interactive properties of computers make them well suited to the assessment of process skills

24 their previous exposure to particular Students performed better on some tasks computers can equipment can both reduce reliability, than one might expect – notably tasks that and add an extra cognitive load to the require them to reason from complex data play a leading intellectual task being performed. In some sets (eg data with two independent role in the situations, issues of health and safety variables and one dependent variable at arise. Some education systems are age 9 years). We take this as a very positive development of unwilling to accept teacher ratings of sign that computers can play a leading role the skills which students for the purposes of high-stakes in the development of the skills which testing, with the result that process skills constitute the new educational agenda. constitute the in science are not assessed at all. In many aspects, student performance was new educational Computer-based assessment permits the poor - work characterised by guessing, assessment of these valuable aspects of too little use of systematic methods, agenda learning science, at modest cost. A range poor hypothesis generation, and poor of different process skills can be identified, generalisation. On many tasks, students which include: were able to show evidence of good reasoning skills; however, explanations • working systematically (for example, were often weak. Given the earlier choosing tests systematically, discussion of the impact of assessment on controlling variables and recording the curriculum, it is to be hoped that the results systematically) use of e-assessment of process skills will • generating and testing hypotheses lead to better student performance on a range of important activities. • finding rules and relationships • handling complex data World Class Tests focused on summative assessment in science, mathematics and • testing solutions technology, and used a variety of contexts, • seeking completeness and rigour (in including geography and economics, as many real-world situations, exemplified well as biology, physics, and engineering. by diagnosis and remediation in spheres The ideas are generic, and can be applied such as medicine and industrial process to many curriculum areas. On the basis of control, it is important to find all of the analyses of student performance on WCT, faults in a system). teaching modules for whole class use have been developed, targeted on weak process Five sets of live tests have been skills. These teaching modules provide a administered in the UK and elsewhere, good deal of formative assessment, and each of which was preceded by extensive require students to engage in reflective pre-testing. A notable result was the ease activities such as critiquing student work, with which students interacted with and explaining their own solution strategies. computers. The affective response from students was very strong – they really We discuss ‘new’ educational goals enjoy working on these tasks. This might that are less amenable to summative be related to the sustained challenge the assessment – such as the ability to work tasks present, which is similar to the in groups, to communicate, to learn to reported reasons why they like computer- learn – in Section 4. based games (Kirriemuir and McFarlane 2004).

25 SECTION 3 CURRENT DEVELOPMENTS IN E-ASSESSMENT

3.3.2 Assessing ICT at Key Stage 3 There are three distinct uses for portfolios. The first is to provide a repository for Ongoing work funded by QCA sets out to student work; the second is to provide a assess student attainment in ICT at age 13 stimulus for reflective activity – which years. A key principle for the design of might involve reflection by the student, these tests is that students should be and critical and creative input from peers tested on their performance on extended and tutors; the third is as showcase, tasks (‘create a web page about topic X for which might be selected by the student to audience Y, using a particular set of represent their ‘best work’ (as in an artist’s resources - a database, ‘clients’ accessible portfolio) or to show that the student has via e-mail, spreadsheets for planning, satisfied some externally defined criteria, web page creation tools’) not on a series as in some teacher accreditation systems assessment of sub-tasks (‘use a spreadsheet to add (eg Schulman 1998). These uses are not up these numbers’). An extraordinarily mutually exclusive. Students may well systems must ambitious goal is to present tasks and wish to archive all their work; reflective require students score performance entirely by computer. activities and feedback from others will be This is a laudable aim, and shows a based on a subset of this work; the final to show the full government commitment to high ‘presentation portfolio’ will be selected spectrum of quality e-assessment (including £20m from this corpus. for the project). competencies These different uses of portfolios reflect different, but not always incompatible, 3.3.3 Digital portfolios theories of learning. A behaviourist approach will focus on defining ‘core An historical legacy which bedevils the competencies’ that are impossible to current education system in the UK is assess in timed examinations, and the the distinction between ‘academic’ and need for fast and efficient feedback on practical’ subjects. This was enshrined in student products. A social constructivist the 1944 Education Act, which created view will focus on the importance of grammar, technical and secondary reflection and sense making by a group modern schools (Tattersall 2003). Abstract (including the tutor) which will include thinking is important; appropriate action in the negotiation of educational goals. context that rests on practical competence is important. Neither is much use on its ICT provides an opportunity to introduce own, and students should be taught to manageable, high quality coursework as both abstract and apply. For this to part of the summative assessment become a classroom reality, assessment process. Student portfolios have been systems must require students to show advocated for a long time, and have the full spectrum of competencies in a been used on a limited basis. From the number of school subjects. If high-stakes viewpoint of assessment, the rationale for assessment systems fail to reward such portfolios is clear: there are a number of behaviours, they are unlikely to be the valuable activities and attainments that focus of much work in school. E-portfolios cannot be assessed using the format of offer a way forward. timed tests. The ability to create, design, reflect, modify and persevere are all

26 important goals of education. It is entirely it simplifies the documentation of the appropriate to assess these processes by development of work – reducing the collecting evidence on the ability to engage ‘busy work’ students might otherwise in an extended piece of work, and to bring have had to engage in. The process of it to a successful conclusion by the documentation via a portfolio of work creation of some product – lab report, supports student reflections on processes video, installation etc. Part of the portfolio – on decisions made deliberately, those can (should) provide evidence of the range forced by circumstances, and those that of personal skills demonstrated, perhaps just sort of happened. Digital images are under the headings suggested in the easy to manipulate and present. Student Tomlinson Report (2004): student self- presentations of work on the development awareness – of themselves and the ways of artefacts is easy, once images are they learn and what they know; how captured digitally. students appear to, and interact with, others; thinking about possible futures and In some subjects, such as design and the ability to making informed decisions. A section of technology, and art, extended projects are create, design, the portfolio in the form of a viva, or simply at the heart of the discipline. The use of annotations of products where students e-portfolios maps directly onto current reflect, modify show their attainments in these three conceptions of the domain, and offers and persevere aspects of performance is appropriate. practical solutions to some common problems (eg Kimbell 2003). This work is are all important A number of problems are associated with important, and is likely to be applicable on goals of portfolios and other sorts of coursework. a large scale in the near future. A very One is the problem of storage – especially large number of institutions have made education in design projects and in art. ICT can solve use of portfolio systems; the American the problem by holding images of artefacts Association for Higher Education (AAHE) created. A second problem is student Portfolio Clearinghouse (www.aahe. misbehaviour; this can have a number org/teaching/portfolio_db.htm) provides an of forms. One is simply that work is online searchable database of profiles of plagiarised; another is that students create electronic portfolio projects and resources some artefact, then ‘back-fill’ by inventing in higher education, and is a valuable the development process (which is often source of ideas. assessed as part of the final mark) post hoc. ICT can help with both of these problems by requiring the submission of 3.4 SUMMARY OF SECTION 3 images of intermediate products, with time stamps. On a more positive note, There are a number of exciting the ability to store and work with images developments in the use of e-assessment (photographs, video) is likely to make for both summative and formative teaching of the design process more purposes, and several UK developments effective. Devices such as mobile phones are at the leading edge, worldwide. In with in-built cameras and facilities for the UK, the government has decided audio recording make it easy to document that extensive use will be made of e- the evolution of ideas and artefacts. This assessment. Some of these developments facility serves a number of functions. First, are a response to current problems

27 SECTION 3 CURRENT DEVELOPMENTS IN E-ASSESSMENT

associated with increases in the volume E-assessment is a stimulus for rethinking of assessment; some reflect a desire the whole curriculum, as well as all to improve the technical quality of current assessment systems. E- assessment (such as increased scoring assessment provides a cost-effective way reliability), and to make the assessment to integrate high quality portfolio process more convenient and more useful assessment with externally set and to users (by the introduction of on-demand marked tests, in any combination. This testing, and fast reporting of results, for makes it likely that there will be significant example). E-assessment also makes it changes in the structure of summative possible to assess aspects of performance assessments, because of the range of that have been seen as desirable for a long student attainments that can now be time – such as the assessment of process assessed reliably. There is likely to be skills, and the efficient handling of student extensive use of teacher assessment of portfolios. Using E-assessment to test those aspects of performance best judged student ICT capability represents an by humans (including extended pieces of extremely ambitious goal of presenting work assembled into portfolios), and more holistic tasks to assess performance, extensive use made of on-demand tests of rather than a collection of short tasks those aspects of performance which can which are symptoms, rather than be done easily by computer, or which are exemplars, of ICT capability. Nevertheless, done best by computer. some major challenges face these new developments. Paper tests have a number of advantages in terms of the quality of the image presented, and the variety of ways in which students can respond; automatic scoring of responses will be very difficult, and in some cases impossible to achieve via computer.

A complete reliance on paper-based assessment has a number of drawbacks; first is that such assessments are increasingly ‘inauthentic’ as classroom and professional practices embrace ICT. Second is that such assessments constrain progress, and have a negative effect on students who have to learn (just for the exam) how to do things on paper that are done far more effectively with ICT. A third major constraint is that current innovative suggestions for curriculum reform, which rely on student portfolios for their implementation, will be impossible to manage on a large scale without extensive use of ICT.

28 SECTION 4 OPPORTUNITIES AND CHALLENGES FOR E-ASSESSMENT

4 OPPORTUNITIES AND Here, examples of metacognition are given CHALLENGES FOR E-ASSESSMENT under four headings: knowing how to use knowledge; analysing and improving Here, we consider some issues which need cognitive processes; supporting reflection to be addressed as a matter of urgency. and critical skills; and assessing First are some speculations on how we competence with different thinking styles. might assess process skills - essential but often ill-defined educational goals. It will Knowing how to use knowledge: the web be important to establish the value of offers great opportunities and pitfalls for such assessments as part of large-scale assessment. Most obviously, the existence summative assessment, in contrast of the web means that successful use of it to their roles as potentially useful should be an educational target. Expertise components of formative assessment. in navigation, such as learning how to It will also be important to establish the bookmark useful sources, and how to appropriate scale of such assessments, refine searches are useful skills, but are and their locus in the curriculum, in terms subsidiary to a set of meta-knowledge of educational gains and manageability. skills about the nature of knowledge – how Second, we consider the problems of it is constructed, presented, and used by ‘going to scale’. Large scale innovation – different people for different purposes. especially where computers are involved – There is a need for students to develop does not always run smoothly. sophisticated theories-in-action about knowledge. These theories should include accounts of the nature of knowledge – its 4.1 ASSESSING PROCESS SKILLS generation, and the various functions it serves (including its use as just another rhetorical device!). Students also need to 4.1.1 Assessing metacognition know about their own knowing – what they do and do not know, how they acquire, lose As we move towards a knowledge-based and change their own knowledge – and society, the development of metacognitive how they control their cognitive processes skills increases in importance, and they when solving problems. become educational goals in themselves. Currently, these goals are ill-defined in We address the first goal elsewhere in the that there is not yet a consensus in the discussion on assessing competence in educational community about their exact ICT. The latter goal is illustrated by Lord nature or how they can be assessed. Goals Armstrong’s remark “power is knowing there is a need can be described, and recognised when how to use knowledge”. The common they are achieved, but exemplification corruption to “knowledge is power” for students needs further work, and a general sharing misses Armstrong’s point almost entirely. to develop of ideas. Ridgway, Swan and Burkhardt Our educational ambitions should be (2001) exemplify this process as part of to encourage students to become sophisticated ‘Assessing Mathematical Thinking’ in sophisticated users and creators of theories-in- materials developed for the US National knowledge. Good formative assessment Institute for should contribute to students’ action about (www.wcer.wisc.edu/nise/cl1). development; web-based sources can knowledge

29 SECTION 4 OPPORTUNITIES AND CHALLENGES FOR E-ASSESSMENT

be part of both formative and summative asked to compare and contrast different assessment of these key elements of presentations, and to describe the student performance. evolution of a news event over time. This requires analysis of the way that evidence Key aspects of performance relate to the is selected, and the ways that ‘events’ are exploration of the origins of the source, reconstructed over time. analysis of its qualities as a source, and its relation to a wider set of information. A further key aspect of knowledge use is Successful formative assessment helps the ability to relate a particular source to a students to internalise questions and larger body of knowledge. It will always be question styles. For summative important for learners to develop rich assessment, we expect students to ask schemas of knowledge – facts, skills, and questions about the nature of the procedures and their interconnections – as information source. The originator can be the basis for judging the value or important – dietary advice from Kellogg’s otherwise of putative new information, or a should be treated more cautiously than theoretical account. In science, a simple advice from the British Medical example is a digital image of a mammal Association. Who created it? For what with horns and claws. Students are purpose? From what perspective was this expected to say it is most unlikely, because written? The poor quality of much of the horns are associated with herbivores, and information on the web can be a virtue, claws with carnivores. At a higher level of pedagogically, because students see the abstraction, students might be asked to sense in challenging the authority of any resolve famous conflicts in scientific ideas, source, and can do so easily by considering in terms of what was known at the time. alternative sources (eg Downes and For example, Lord Kelvin – probably the Zammit 2000). most distinguished scientist of his day – argued against the theory of evolution, on Skills in analysing documents in terms of the grounds that the timescale was their style and their use of particular impossible. The core of the Earth is largely rhetorical devices, and in creating molten, but if the Earth were really the documents for different audiences and in millions of years old needed for different writing genres, are being evolutionary processes to work, it would developed and used in English (and have cooled down long ago. What didn’t he sociology and philosophy at university know (or is his criticism valid)? The web is level). Again, the ubiquitous use of web a source of information that challenges sources provides both a rationale for the current knowledge – students can be value of these analytic and creative asked to relate ‘breaking’ research to a activities, and a rich source of resources wider set of knowledge. The recent scare for assessment purposes. over the MMR vaccine (and the damage that will be done to children by an under- The web makes it easy to compare and analysed and over-publicised piece of contrast different interpretations of ‘the research) provides an example. same’ events by different ‘news’ providers, and by the same provider over time. In A vivid example of summative evaluation terms of assessment, students can be which requires both a deep knowledge

30 schema and powerful skills in knowledge that can be effective when preparing for deconstruction and reconstruction is conventional examinations. There, the provided by a final undergraduate danger is that students hold information examination at Goldsmith’s University on in a relatively temporary state for the the art history course, where students are purpose of the examination, then forget presented with two pictures, side by side, the information once the examination is which they are to compare and contrast. over. Open-web examinations are likely to They are required to name the artist, have desirable ‘consequential validity’ – deconstruct the iconography, and interpret that is to say, are likely to lead to desirable each work in its historical context. This learning (and learning strategies). The could be presented via ICT, and could be unpopularity of open-book examinations extended to film, and to other contexts. (which probably arises because they require serious thought about the subject open-web Another approach to supporting reflection matter) is likely to apply equally to open- about knowledge acquisition and creation web examinations. The potential for examinations is to incorporate assignments that require fraudulent behaviour by students (such as are likely to lead a reflective account of the process of e-mailing for advice in situations where creating some artefact (object or written). the purpose of testing is to assess the to desirable Students can be asked process questions ability to search the web, or searching the learning about sources of information – ways to find web when the purpose of the assessment good sources (perhaps in the form of is to assess ‘networking’ skills) means (and learning ‘advice to someone with a similar job to that student activities will need to be strategies) do’), and about the sources themselves. constrained in appropriate ways. They can be asked about problems faced, Nevertheless, open-web assessment and the ways they were solved, in these should be explored further. ‘meta-learning’ essays. Analysing and improving cognitive ‘Open-web’ examinations offer a parallel processes: interactive whiteboards can to open-book examinations. One virtue provide the facility to work as a whole of such examinations is that they are class on a problem or simulation, then to more ‘authentic’ than conventional replay and critique the sequence of examinations, in that, outside educational actions. This provides the opportunity to contexts, one rarely has to answer a discuss seemingly abstract concepts such substantive question without any as ‘strategy’ and exemplify them with resources. They allow the examiner to set concrete examples. Analogies with the a broader range of questions, because analysis of games (eg tennis) can make the students are not expected to retain all activity seem natural in class (of course, the relevant information in memory. analysis of on-screen video of ongoing An adaptive strategy for success on games is a specific example of the sorts of such examinations is to develop meta- analyses being described here). The long- knowledge of the whole area, and to index term intention is to help students develop sources very carefully. A large information metacognitive skills that will be applicable bank with no index is of little use. Compare in a wide variety of situations. By looking at the preparation necessary for this sort of different solution attempts, students can examination with the ‘cramming’ strategy’ be asked high-level questions such as

31 SECTION 4 OPPORTUNITIES AND CHALLENGES FOR E-ASSESSMENT

‘how do you solve problems of this sort?’ – annotate work to show where they meet which can be assessed more formally by the assessment criteria. tasks such as ‘write some guidance for someone else, that will help them to solve Courtenay (personal communication, 2004) problems like this one’. A requirement for described an activity designed to support summative e-portfolios could be that creative writing in English in a night class sample reflective analyses of processes comprised of 30 non-native speakers at an be included. early stage of learning English. Courtenay focuses on creation and critique, and These techniques have great potential seeks to spend as much time as possible when the focus is on the social and interacting with his students. Each student emotional education of students. Topics writes online, and when they are satisfied raised in personal and social education with their composition, it is posted to a such as approaches to bullying can be shared server. Every student is required to approached by presenting students with offer constructive comments on five video vignettes, and asking them to compositions, and to revise their own describe situations, the interactions that writing in the light of five sets of take place, and the feelings of participants. comments. The teacher is able to tour and Parallel information channels (provided by coach individuals as they write. With little the participants) can provide students with effort, this approach could be extended to feedback on the correctness or otherwise providing summative assessment. of their insights. At a lower level, Students could be required to submit their assessing children’s ability to identify the comments on others’ writing to be emotions being expressed in different evaluated, and could provide evidence of faces can give insights into their their ability to use comments on their own developmental state (or, in more extreme work. An assessment system like this cases, into pathological states such as would reinforce rather than distort the autism). If summative information is educational ambitions of the teacher. appropriate, it can be based on the analysis of such vignettes. Peer assessment is attractive for a number of reasons. (Topping’s 1998 review Supporting reflection and critical skills: demonstrated that it is associated with an important higher-order skill is the gains on conventional performance ability to review and improve work. This measures, in higher education.) Students can be done via paper and pencil (for can be asked to create far more pieces of example by writing on every third line, and work than could be marked by a single changing pen colour at every revision tutor. It can avoid the problem that as a cycle), but is made very easy by the use of class size gets bigger, the load on the ICT, with facilities such as ‘track changes’ tutor increases directly, along with the in MS-Word. Students can be asked to time taken to provide feedback to students. provide examples of their ability to improve Students must understand criteria for work on the basis of others’ and their own assessment, and must acquire a range of suggestions, and of their ability to critique higher-order skills, such as abstracting the work of others. Another way to assess ideas, detecting errors and misconceptions, critical thinking is to require students to critiquing and suggesting improvements, if

32 they are to engage in peer assessment. assignment, write an exemplar answer peer assessment Peer assessment is a fact of life outside for calibration, and select two pieces of education, so peer assessment is far more student work which contain interesting is a fact of life ‘authentic’ than some forms of assessment errors or omissions. Each of these has outside such as multiple choice tests. Possible to be graded by the tutor, and relevant disadvantages relate to the possibility of an comments have to be written. The tutor education enhanced workload on students, unreliable also writes key questions on content and feedback, and biased feedback. style. CPR is designed to overcome the potential weakness of peer assessment A number of commercially available in terms of unreliable assessment systems have been designed to support (via training and moderation) and bias peer assessment. Calibrated Peer (via anonymity). The authors claim Review™ (Chapman and Fiore 2001) was considerable gains in students’ ability to designed to support the peer assessment ‘learn to learn’ because their attention of essays in molecular science, but has is focused on abstracting ideas and been applied in a variety of subjects, and arguments, describing, analysing and with students across the education system. assessing the quality of material, and in Students write short essays, and are asked review. CPR also increases the amount questions designed to foster their critical of writing that students do. thinking. Students are presented with three ‘calibration’ essays to grade, and Doiron and Isaac (2002) have developed a must demonstrate their competence novel form of online peer review designed before they progress. Two of the essays to complement the American College of contain errors and misconceptions which Surgeons Advanced Trauma Life Support students must identify and correct. Course for fourth year medical students. Students are also asked questions on style Their system involves self-assessment, and grammar. The scores they give to the peer evaluation, feedback and debate. assignments are compared with ‘official’ There is an inherent problem giving large scores, and a calibration report is created numbers of students direct experience of for the student and the tutor. If Emergency Room procedures. Here, performance is inadequate, more students are presented with a realistic instruction is provided, and the student case study, and must prevent the patient must repeat the activity. Once they have from dying, conduct clinical tests, then shown that they can assess essays request appropriate lab work followed by effectively and reliably, they are asked to diagnosis and recommendation of a grade three essays by peers, and finally treatment. Students reflect on, and self- are asked to grade their own essay. The assess, their knowledge. They submit a student and the instructor receive diagnosis and proposed treatment plan to comments and scores. the whole group. For peer review, they are presented with two other diagnoses and CPR is not restricted to essays in science; treatments – one from the tutor, prepared the idea is generic, and can be applied to to contain errors, for critique. If the literary criticism, commentaries on a piece student fails to detect the errors, they get of art, or laboratory reports, for example. individual feedback from the tutor. The tutor must select the focus of the Students then review ‘live’ reports from

33 SECTION 4 OPPORTUNITIES AND CHALLENGES FOR E-ASSESSMENT

two of their peers (so three reviews are course of group work. He suggests a considered together). Where there are formal mechanism for this, where thinking disagreements, the two views are styles are associated with hats of different presented to a larger group (four to ten colours, and group members are invited to students) who must all offer their own take particular roles – sometimes as view, and debate the issue. Similar work is individuals, and sometimes as a whole being conducted on a health psychology group. Thinking styles include asking course, and in engineering. about what is known or what is needed (the White Hat); saying why an idea won’t Assessing competence with different work (the Black Hat); generating ideas and thinking styles: mobile phone technology alternatives (the Green Hat); describing might provide a means of assessing feelings, hunches and intuitions (the Red mobile phone thinking styles via simulated group work. Hat); managing group processes (the Blue Here, each student works in a simulated Hat); and the optimistic advocacy of ideas technology might environment, where responses from other (the Yellow Hat). provide a means ‘group members’ are pre-specified, and some responses to the actions of the Given some specific suggestions for of assessing student are pre-defined. This environment actions via mobile phone or e-mail, thinking styles is artificial for a number of obvious students can be asked to work in Red, reasons – contact is via phone (or e-mail) Yellow and Black Hat styles; or given a rather than face-to-face and the range of stream of (simulated) input to a dynamic interactions is constrained. conference, students can be asked to work However, these constraints mean that in Blue Hat mode. Their responses provide students can be assessed in relatively information on their strengths and standardised conditions, and sequences weaknesses working in different thinking can be replayed for analysis and reflection styles. This idea is not restricted to de as part of formative assessment. Bono’s framework, but is a generic idea for assessing individual skills in Analysing the ability to engage in De group settings. Bono’s (2000) ‘Thinking Hats’ activity provides a concrete example. De Bono has identified a number of thinking styles, all 4.1.2 Assessing group projects of which are useful when solving problems. None is effective on its own. He argues A valuable skill is the ability to work that people differ in their preferences for productively in groups. This requires these different thinking styles, and often good communication skills, understanding stick with a particular style of thinking. In the criteria for effective group work, terms of group dynamics, individuals can understanding different roles, the ability to become ego-involved with a particular assess one’s own work and the work of style of thinking, with negative others, and the ability to respond positively consequences for the productivity of the to formative and summative feedback. The group. De Bono argues that these different assessment of group work is problematic thinking styles should be made explicit, for a number of reasons: problems can and that every group member should be caused by ‘social loafing’ and the engage with every thinking style in the allocation of equal marks for unequal

34 contributions; undesirable effects of ‘Intelligence’. A problem with some of it is important students rating peers; and time-hungry these early proponents of ‘creativity’ (eg procedures for gathering accurate Getzels and Jackson 1962) was that they to develop evidence on student performance. accepted many of the philosophical creativity, and assumptions of the Intelligence movement, SPARK (Self and Peer Assessment and many of their methods, but were to evaluate the Resource Kit - www.educ.dab.uts. incompetent in their use. The result was a products of edu.au/darrall/sparksite) is an academic movement that was based on some good open source project designed to support ideas, but which was poorly theorised, and creative thinking the effective evaluation of group work, that supported by flawed evidence. Just as has been used in a variety of contexts in there are many styles of analytic thinking, higher education. It requires a clear that are coloured and improved by specification of the tasks to be performed knowledge in particular domains, and by the group and the assessment criteria. different ways to represent information, Students reflect on group processes during so too are there many styles of creative the performance of the task, and rate all thinking, again, influenced by knowledge the group members, and themselves and experience in a variety of domains. against the criteria provided. The tutor Creativity (as defined above) requires an monitors the work of the group, grades the intimate interplay of creative and analytic product of the group work, uses SPARK to thinking. It is important to develop convert group marks into individual marks, creativity, and to evaluate the products of and provides individual summative and creative thinking. Creativity should be formative feedback (eg that a student rates evaluated by an analysis of product, and their own contribution to the group far by an analysis of student processes, higher than other group members do). using methods described earlier (notably, Evaluations of SPARK by its authors in tracking the design process, and reflective a variety of higher education contexts accounts on this process). have been positive (eg Freeman and McKenzie 2002). It can be difficult to obtain good paper- based accounts of student processes and results after engaging with an extended 4.1.3 Assessing creativity piece of work. This can be a desirable activity for a number of reasons. First, it ‘Creativity’ involves the production of a new requires students to translate knowledge idea or artefact that is judged by some from one form to another, and to consider community to be of value. Many writers the needs of a different audience – notably have made a distinction between analytic from a static written form whose primary and creative thinking. Analytic thinking has audience is the teacher, to a visual and been characterised as: linear, rational, dynamic form for some predefined logical, conscious and deliberate. Creative audience, who will have a range of thinking has been described as: parallel, understandings about the topic in hand. unconstrained, illogical, unconscious, and Second, it is inherently valuable as a skill. chaotic. Creativity became a bandwagon Digital cameras and whiteboards make it for education in the 1960s, in part as a easy for students to show their work healthy corrective to an over-emphasis on (which might be on paper, in the form of

35 SECTION 4 OPPORTUNITIES AND CHALLENGES FOR E-ASSESSMENT

manipulatives, or some artefact that has elementary aspects of learning such as been created) and to explain what they pronunciation, to vocabulary, and to have done, justify their answer, and correcting sentence structure ‘mistakes’ describe the design decisions they took. presented to students. Given test technologies that support ‘tailored testing’, the phone system could be used to provide 4.1.4 Assessing communication skills on-demand testing of some aspects of language use. Such systems are unlikely to Mobile phones could be used more be useable (in the short term at least) for extensively for assessment. A simple high-stakes testing, because of problems example would be to use mobile phones of impersonation. These problems may be for the aural comprehension aspect of removed if effective person recognition language learning. Current practices of systems are developed and introduced on using an analogue tape recorder at the a large scale. front of a classroom are inherently unfair. The quality of the sound will differ as a function of the tape machine used; the 4.2 NATIONAL CURRICULA, sound intensity at the front of the room will NATIONAL ASSESSMENT be dramatically higher than at the back of the room. Using conventional computer The Tomlinson Report (2004) addresses technology, Southern Australia uses MP3 fundamental questions about curriculum files to test language comprehension (see design and assessment, and describes a www.ssabsa.sa.edu.au) – clearly, good number of serious problems with current practice. systems. Assessment exemplifies educational goals, and has a major effect The eVIVA project (www.qca.org.uk/ on educational practice. Unless adultlearning/downloads/eviva_project.pdf, assessment systems are aligned with www.eviva.tv) uses phones as the medium educational goals, they will distort for oral testing with portfolio-based Key curriculum ambitions. There is a general Stage 3 ICT assessment. Students can desire for more school-based assessment, book a test session, and so can have and more process-based assessment, and (almost) on-demand testing. The phones an insistence that current high standards are also used for recording ‘voice of equity and probity in the examination postcards’ of learning milestones, and process are maintained. E-assessment posting these to a central website. The (eg via e-portfolios) can provide the means ‘voice postcards’ can be used by a student to empower teachers and schools, while to support the piece of portfolio evidence ensuring that high standards of which they are presenting. assessment are met. ICT can support the whole process of teacher preparation, and As speech recognition technologies the establishment of procedures to ensure continue to improve, one can envisage a comparability of standards across schools. situation where questions are posed orally School-based judgements could be by telephone, and student responses are moderated by external computer-based scored automatically. In the case of tests. E-assessment can extend the range language learning, this could be applied to of reliable assessments that can be

36 conducted, and so can widen the debate on will be important to phase the introduction curriculum and assessment design. On- of e-assessment in such a way that the demand testing will have considerable load on students, teachers, schools implications for curriculum planning. and systems is lower than the current Students could take summative tests at assessment load. Some barriers are different times, and could progress discussed below. through the curriculum at different rates. Establishing the credibility of E-assessment could reduce the damage e-assessment: in some areas such as caused by current tests. At present, new competency-based assessment, the case SAT papers are created each year, and all for e-assessment is self-evident. In other students answer the same questions. If the areas, reasonable sceptics will have to be purpose of testing is to establish the convinced of its value. They will have performance of some system (such as a concerns about the construct validity of school or an LEA), better methods could new tests (exactly what do they measure?); be employed. If there were a large bank of the reliability of new tests in comparison tasks available in electronic form, and with existing tests; and the educational different students received a different set standards required – both in relation to of tasks, then coverage of the curriculum current tests, and across tests such as could be better, and there would be no those given ‘on-demand’ in different places need to report individual student scores. and at different times. Each of these This would have the advantage that a questions will need to be addressed for larger variety of task types could be used, each family of e-assessments, usually and would avoid the current distortions by means of an empirical study. caused by teachers ‘teaching to the SAT’. Building system capacity: there is an urgent need to build capacity for e- 4.3 EVOLUTION AND REVOLUTION assessment that ranges from test design, test delivery and processing, and expertise Even where there is a shared vision on in school. Each of these is problematic. future curricula, there can be considerable problems in implementation. Ridgway Task and test design: very few people have (1998) draws analogies between ecological expertise in creating e-assessments, in e-assessment restoration and educational change, and comparison to the large numbers of could reduce the describes the sorts of research needed for people competent to create conventional successful change. This style is close to tests. There is an urgent need to create damage caused research in fast-changing fields such as new task types and to explore their by current tests electronics, where discoveries and reliability and validity. If we do not continue inventions drive practice and theory, in to explore, students will be faced with a set contrast to well-established fields where of tasks which recently were innovative, theory can lead practice. It is important to but which are now hackneyed. be aware that some goals are easy to achieve from most starting points, whilst others need a good deal of capacity building before they can be reached. It

37 SECTION 4 OPPORTUNITIES AND CHALLENGES FOR E-ASSESSMENT

Establishing technical standards: increased accuracy and validation at input, currently, there are three sets of technical and the auto-totalling of marks by the standards. We need a consensus computer, and the electronic management document. The needs of students with of reporting and discrepancies. special needs must be addressed. Standards for monitoring the quality of the On examiners and examining: High quality assessments given in schools (actually a training is an essential aspect of reliable rather hostile environment for ICT, assessment. Tomlinson recommends because of the plethora of machines and (paras 134–136) “a thorough operating systems), and the procedures professionalisation of the role of markers put in place by examination authorities and examiners, including coursework need to be written, and validated in markers”, and the Report makes a number it is practical settings. of specific recommendations on how this might be institutionalised via schemes for important that ICT infrastructure: good broadband professional development, accreditation, e-assessment systems are needed – in particular, very and appropriate professional reward high specification systems are needed for systems. The Secondary Heads does not create big schools. Currently, about 40% of Associations have argued for the a ‘digital divide’ primary schools, and about 100% of establishment of ‘Chartered Examiners’ in secondary schools have broadband access, schools and colleges, who would give their but not necessarily at the levels needed for organisations the right to take more online assessment (Rt Hon Charles Clarke control over examination assessment. MP 2004). The proposals set out in the Tomlinson Report are only feasible if a School and test-centre expertise: national database of student achievement this presents a massive challenge for is established. At school level, extensive professional development. Schools need investment in ICT will be needed, and to develop systems which are robust. costs will recur. Plagiarism: poses a major threat to all The examination process: dealing with assessment systems (eg Ridgway and e-assessment poses serious challenges Smith 2004). These threats range from to paper-based examination authorities. downloading work direct from the internet, They need to develop a robust technology commissioning work, and impersonation. infrastructure, and (at least as important) Assessment systems will need to be the competencies of staff to make these resistant to such attacks. systems function effectively. A good start has been made here, for example in the Equity issues: it is important that work on the assessment of basic and key e-assessment does not create a ‘digital skills. However, there are salutary divide’ which privileges some students messages from the QCA Report on over others on the basis of opportunities implementation (QCA 2004). AQA report of access. (Adams and Hudson 2004) that their surveys show considerable satisfaction from examiners. Examiners report that the software is easy to use; they like the

38 4.4 RELIABLE TEACHER 4.6 SUMMARY OF SECTION 4 ASSESSMENT VIA E-PORTFOLIOS New educational goals continue to A key decision for educational systems is emerge, and the process of critical to decide exactly how much of the reflection on what is important to learn, students’ time should be devoted to and how this might be assessed working on extended projects, and how authentically needs to be institutionalised much should be based on shorter into curriculum planning. In this section, activities. A related decision is the balance we explore ways to assess metacognition, to be struck between portfolio systems group projects, creativity and assessed in school, and timed external communication skills. E-assessment is assessments. A key issue is to establish certain to play a major role in defining and robust and reliable systems of school- implementing curriculum change in the e-assessment based assessment. It is worth highlighting UK. There is a strong government the extreme positions that different commitment to e-assessment, and good is certain to play systems use. In some systems, all initial progress has been made. Major a major role in assessment is done externally. In some challenges of ‘going to scale’ have yet to systems – for example Queensland, be faced. A good deal of innovative work is defining and Australia - all assessment is school- needed, coupled with a grounded approach implementing based. Queensland provides extensive to system-wide implementation. systems for training teachers, and for curriculum moderating their judgements. ICT can change in the UK facilitate this process. All student submissions can be put onto the web, and systems of cross-moderation can be established. Externally defined tests can be used to guide the moderation process.

4.5 DUMBING-DOWN ASSESSMENT

There is a danger that considerations of cost and ease of assessment will lead to the introduction of ‘cheap’ assessment systems which prove to be very expensive in terms of the damage they do to students’ educational experiences. At the time of writing, this seems most unlikely in the UK. QCA have funded some innovative e-assessment developments at investment levels beyond the reach of most companies, and have a large group focused on developing and sharing expertise in e-assessment (www.qca.org.uk).

39 GLOSSARY

ACKNOWLEDGEMENTS examinations, questionnaires, surveys and collateral sources used to draw inferences We wish to thank a number of people who about characteristics of people, objects or have commented constructively on this programs for a specific purpose document, in particular Keri Facer, Annika Basic skills the ability to read, write and Small, Jeremy Tafler, and Kathleen speak in English and use mathematics at a Tattersall. We are grateful to them for their level necessary to function and progress at input. All the faults and errors of omission work and society in general are our own. CAS Computer Algebra System. Software package used for the manipulation of GLOSSARY mathematical formulae. Automates tedious and sometimes difficult algebraic Adaptive testing a sequential form of manipulation tasks. Systems vary and may individual testing in which successive include facilities for graphing equations or items in the test are chosen based provide a programming language for the primarily on the psychometric properties user to define their own procedures and content of the items, and the City and Guilds major awarding body for participant’s response to previous items vocational qualifications in the UK A-level (AS/A2) General Certificate of Competency-based assessment Education (GCE) Advanced Level. Study assessment process based on the usually consists of a two-year academic collection of evidence on which judgments course and students will usually select two are made concerning progress towards or three subjects from subjects studied at satisfaction of standard performance AS-levels to continue to A-level (called A2) criteria Anchor(s) a sample of student work that Concept map the arrangement of ideas exemplifies a specific level of performance. into a visual layout highlighting Markers use anchors to score student connections between associated ideas, work, usually comparing the student revealing the structural pattern in the performance to the anchor information AQA an awarding body: Assessment and Criterion referenced assessment Qualifications Alliance formed from the assessment linked to predefined merger of Associated Examining Board standards. (eg ‘Can swim 25 metres in a (AEB) and the Northern Examinations and swimming pool’) Assessment Board (NEAB) in 2000 CSE Certificate of Secondary Education: AS-levels General Certificate of Advanced former system of British examinations Supplementary Level, considered to be the taken in a range of subjects, usually at the equivalent of half an A-level. Young people age of 16 are now expected to study four AS-levels during Year 12 at school or college Diagnostic testing testing used to identify the conceptions and misconceptions with a Assessment any systematic method of view to providing appropriate remedial obtaining evidence from tests, experiences

40 Discrimination the ability to distinguish taken as an alternative to GCSE or A- between and among different levels of levels, usually after compulsory schooling. work or achievement Available at three levels; Foundation, Intermediate, and Advanced E-assessment : processes involving the implementation of High-stakes assessment assessment that ICT for the recording, transmission, has important consequences or presentation and processing of implications for students, staff or schools assessment material ICT Information and Communications Edexcel UK examining and awarding body Technology providing a range of qualifications Key sills a group of skills valued by including at higher education level employers as being central to all work and EiC Excellence in Cities. Government learning, including communication, initiative aimed at raising the educational information technology, application of aspirations and attainment of children in numbers, working with others, and inner cities improving own learning and performance European Computer Driving Licence Key Stages the four stages of the National European-wide qualification allowing Curriculum: KS1 for pupils aged 5-7; KS2 candidates to demonstrate competence in for 7-11; KS3 for 11-14; KS4 for 14-16 computer skills, covering the areas of NVQ National Vocational Qualifications. basic concepts of IT, using the computer Work-based vocational qualifications. They and managing files, word processing, are portfolio-based qualifications which spreadsheets, database, presentation and show skills, knowledge and ability in information, and communication specific work areas. Can be taken at five Formative assessment often called levels, depending on level of expertise and assessment for learning. Assessment used responsibility of the job to support teaching and learning, which O-level also GCE Ordinary level. Former identifies strengths and weaknesses of the system of British examinations taken in a student range of subjects, usually at the age of 16. GCE General Certificate of Education Ran in parallel with but at a higher level than CSE. Both systems now replaced by GCSE General Certificate of Secondary current GCSE Education (GCSE). The main secondary school examinations usually at 16, which Parallel forms tests that are created to replaced previous system GCE O-levels measure the same constructs, and to and CSEs produce the same scores, if they were given to individuals on different occasions GIS Geographic Information System. System of software used for the storage, PDA Personal Digital Assistant; a small retrieval, mapping and analysis of spatial hand-held computer. Depending on level of data, such as mortality by different regions sophistication may allow e-mail, word processing, music playback, internet GNVQ General National Vocational access, digital photography or GPS Qualification. Vocational qualification, often reception

41 GLOSSARY

Pedagogy philosophy of approach to and science achievement from an schooling, learning, and teaching including international perspective. Data from 1995, what is taught, how teaching occurs, and 1999, and 2003 how learning occurs UCLES University of Cambridge Local Portfolio a representative collection of a Examinations Syndicate, comprising three candidate’s work, which is used to business units: Cambridge ESOL (English demonstrate or exemplify either that a for Speakers of Other Languages), range of criteria has been met, or to providing examinations in English as a showcase the very best that a candidate is foreign language and qualifications for capable of language teachers; CIE (University of Cambridge International Examinations), Portfolio assessment assessment based providing international school on judgment made about the work shown examinations and international vocational as evidence within a portfolio awards; and OCR (Oxford, Cambridge and Predictive validity the extent to which RSA Examinations), providing general and scores on a test predict some future vocational qualification performance. For example, a student’s Validity the appropriateness of the GSCE grade can be used to predict their interpretation and use of the results for likely A-level grade – in some subjects, the any assessment procedure prediction is better than in other subjects Value added the increase in learning that QCA UK public body, sponsored by the occurs during a course of education. Department for Education and Skills Based either on the gains of an individual (DfES). Roles include the maintenance and or a group of students. Requires a baseline development of the national curriculum measurement for comparison and associated assessments, tests and examinations Reliability reliability in measurement and testing is a measure of the accuracy of the score achieved, with respect to the likelihood that the score would be constant if the test were re-taken or the same performance were re-scored by another marker, or if another test from a test bank of ostensibly equivalent items is used Summative assessment assessment used to measure performance, usually at the end of a course of study TIMSS Trends in International Mathematics and Science Study, formerly Third International Mathematics and Science Study. Comprehensive study offering data on students’ mathematics

42 BIBLIOGRAPHY

BIBLIOGRAPHY Cowie, J and Lehnert W (1996). Information extraction. Communications Adams, C and Hudson, G (2004). AQA and of the ACM vol 39 (1), pp80-91 DRS electronic mark capture, presented at De Bono, E (2000). Six Thinking Hats. the QCA E-assessment Summit, 24 April London: Penguin Books Aim Online P–10 Supplement (2003). Doiron, G and Isaac JR (2002). Designing Supplement to the VCAA Bulletin No 6 an ER online role play for medical September 2003. AIM Online: students. 2nd Symposium on Teaching and www.aimonline.vic.edu.au Learning in Higher Education Paradigm Archenhold, WF, Bell, J, Donnelly, J, Shift in Higher Education, National Johnson, S and Welford, G (1988). Science University of Singapore, 4-6 September 2002 at Age 15: a Review of APU Findings 1980- Downes, T and Zammit, K (2000). New 1984. London: HMSO for connected learning in global Barnes, M, Clarke, D and Stephens, M classrooms, in: H Taylor and P Hogenbirk (2000). Assessment: the engine of systemic (Eds) Information and Communication curriculum reform? Journal of Curriculum Technologies: the School of the Future. Studies, 32(5) 623-650 London: Kluwer Academic Publishers Bennett, RE (2002). Inexorable and EPPI Centre (2002). A Systematic Review inevitable: the continuing story of of the Impact of Summative Assessment technology and assessment. Journal of and Tests on Students’ Motivation for Technology, Learning, and Assessment, ı(ı). Learning. http://eppi.ioe.ac.uk Available from www.jtla.org Frederikson, JR and Collins, A (1989). Black, P and Wiliam, D (2002). A system approach to educational testing. Assessment for Learning: Beyond the Educational Researcher, 18(9), 27-32 Black Box (2002). www.assessment- Freeman, MA and McKenzie, J (2002). reform-group.org.uk/publications.html Implementing and evaluating SPARK, a Chapman, OL and Fiore, MA (2001). confidential web-based template for self Calibrated peer review: a writing and and peer assessment of student critical thinking instructional tool. The teamwork: benefits of evaluating across White Paper: a Description of CPR. different subjects. British Journal of http://cpr.molsci.ucla.edu/ , 33 (5), pp553-572. Cited at www.educ.dab.uts.edu.au/ Cockcroft, WH (1982). Mathematics darrall/sparksite Counts. London: HMSO Getzels, JW and Jackson, PW (1962). Cohen, Y, Ben-Simon, A and Hovav, M Creativity and Intelligence: Explorations (2003). The effect of specific language with Gifted Students. New York: John Wiley features on the complexity of systems for automated essay scoring. Paper presented Kimbell, R (2003). Performance to the 29th Annual Conference of the assessment: assessing the inaccessible. International Association for Educational Paper presented at Futurelab’s Beyond the Assessment. www.aqa.org.uk/support/ Exam conference, 19-20 November 2003, iaea/papers/ben-cohen-hovav.pdf Bristol

43 BIBLIOGRAPHY

Kirriemuir, J and McFarlane, A (2003). of computer-based World Class Tests of Literature Review in Games and Learning problem solving. Computers and Human (2004). Bristol: Futurelab. Retrieved Behaviour, 18 (6), 633-649 05/09/2004 from www.futurelab.org.uk/ Ridgway, J and Passey, D (1993). An research/lit_reviews.htm international view of mathematics Klein SP, Hamilton, LS, McCaffrey, DF assessment - through a class, darkly, in: and Stecher, BM (2000). What do test Niss, M (Ed) Investigations into scores in Texas tell us? RAND Issues Paper. Assessment in . www.rand.org/publications/IP/IP202 Kluwer Academic Publishers, pp57-72 Koretz and Barron (1998). The Validity Ridgway, J (1998). The Modelling of of Gains in Scores on the Kentucky Systems and Macro-Systemic Change - Instructional Results Information System Lessons for Evaluation from Epidemiology (KIRIS). www.rand.org/publications/MR/ and Ecology. National Institute for Science MR1014/MR1014.pref.pdf Education Monograph 8, University of Wisconsin-Madison Linn, RL (2000). Assessments and accountability. ER Online, 29(2). Ridgway, J and Smith, H (2004). Against www.aera.net/pubs/er/arts/29-02/ plagiarism: strategies for defending the linn01.htm validity of assessment systems. EARLI Assessment SIG, Bergen, Norway Mathews, JC (1985). Examinations: a Commentary. London: George Allen Ridgway, J, Swan, M and Burkhardt, H and Unwin (2001). Assessing mathematical thinking via FLAG, in: D Holton and M Niss (Eds) Messick, S (1995). Validity of psychological Teaching and Learning Mathematics at assessment. American Psychologist vol 50, University Level - An ICMI Study. no 9, pp741-749 Dordrecht: Kluwer Academic Publishers, Mitchell, T, Aldridge, N, Williamson, W pp 423-430. Field-Tested Learning and Broomhead, P (2003). Computer Assessment Guide (FLAG). based testing of medical knowledge. www.wcer.wisc.edu/nise/cl1 Proceedings of the 7th International Ridgway J and McCusker, S (2003). Computer Assisted Assessment Using computers to assess new Conference, Loughborough, pp249-267 educational goals. Assessment in Pellegrino, JW, Chudowski, N, Glaser, R Education: Principles, Policy and Practice, (Eds) (2001). Knowing What Students vol 10, no 3, pp309-328(20) Know. Washington DC: National Academy Ripley, M (2004). E-assessment question of Sciences 2004 – QCA keynote speech e-assessment: QCA (2004). The Basic and Key Skills (BKS) an overview. Presentation given by Martin E-assessment Experience Report. Ripley at Delivering E-assessment - a Fair www.qca.org.uk/adultlearning/downloads/ Deal for Learners, a summit held by QCA bks_e-assessment_experience.pdf on 20 April 2004 Richardson, M, Baird, J, Ridgway, J, Ripley, Roan, M (2003). Computerised M, Shorrocks-Taylor, D and Swan, M (2002). assessment: changes in marking UK Challenging minds? Students’ perceptions examinations – are we ready yet? Paper

44 presented to the 29th Annual Conference Tomlinson, M (2004). 14-19 Curriculum of the International Association for and Qualifications Reform: Interim Report Educational Assessment. www.aqa.org.uk/ Of The Working Group On 14-19 Reform. support/iaea/papers/roan.pdf London: DfES. www.14-19reform.gov.uk Robitaille, DF, Schmidt, WH, Raizen, S, Topping, KJ (1998). Peer assessment McKnight, C, Britton, E and Nicol, C between students in college and university. (1993). Curriculum frameworks for Review of . 68 (3), mathematics and science. TIMSS 249-276 Monograph No 1. Vancouver: Pacific Educational Press Rt Hon Charles Clarke MP, Secretary of State for Education and Skills. Keynote speech at Delivering E-assessment - a Fair Deal for Learners, a summit held by QCA on 20 April 2004 Schulman, L (1998). Teacher portfolios: a theoretical activity, in: N Lyons (Ed) With Portfolio in Hand: Validating the New Teacher Professionalism (pp23-37). NY: Teachers College Press Slaughter, S and Leslie, LL (1997). Academic Capitalism: Politics, Policies and the Entrepreneurial University. Baltimore: The Johns Hopkins University Press Sukkarieh, JZ, Pulman, SG and Raikes, N (2003). Auto-marking: using computational linguistics to score short, free text responses. Paper presented to the 29th Annual Conference of the International Association for Educational Assessment. www.aqa.org.uk/support/iaea/papers/ sukkarieh-pulman-raikes.pdf Tattersall, K (2003). Ringing the changes: educational and assessment policies, 1900 to the present, in: Setting the Standard. AQA: Manchester, pp7-27 Teacher Training Agency (2003). Qualifying to Teach: Professional Standards for Qualified Teacher Status and Requirements for Initial Teacher Training Tomlinson, M (2002). Inquiry into A Level Standards. London: DfES

45 APPENDIX: FUNDAMENTALS OF ASSESSMENT

APPENDIX: measure. There is a need for a clear FUNDAMENTALS OF ASSESSMENT description of the whole topic area (the domain definition) covered by the test. How shall they be judged? There is a need for a clear statement of the design of the test (the test blueprint), Here we consider some of the criteria with examples in the form of tasks and against which tests and testing systems sample tests. Construct validity requires can be judged. supporting evidence on the match between Validity and reliability are often written the domain definition and the test. about as if they were separate things. Construct validity can be approached in a Actually, they are intimately entwined, but number of ways. It is important to check on: it is worth starting with two simple • content validity: are items fully definitions: validity is concerned with the representative of the topic being nature of what is being measured, while measured? reliability is concerned with the quality of the measurement instrument. • convergent validity: given the domain definition, are constructs which should A loose set of criteria can be set out under be related to each other actually the heading of educational validity observed to be related to each other? (Frederikson and Collins (1989) use the • discriminant validity: given the domain term ‘systemic validity’). Educational definition, are constructs which should validity encompasses a number of aspects not be related to each other actually which are set out below. observed to be unrelated? Consequential validity: refers to the • concurrent validity: does the test effects that assessment has on the correlate highly with other tests which educational system (Ridgway and Passey supposedly measure the same things? (1993) use ‘generative validity’). Messick (1995) argues that consequential validity is The essential idea about reliability is that probably the most important criteria on test scores should be a lot better than which to judge an assessment system. For random numbers. Test situations have lots example, high-stakes testing regimes of reliabilities. The over-arching question which focus exclusively on timed multiple concerning reliability is: if we could test choice items in a narrow domain can identical students on different occasions produce severe distortions of the using the same tests, would we get the educational process, including rewarding same results? both students and teachers for cheating. Take the measurement of student height Klein, Hamilton, McCaffrey and Stecher as an example. The concept is easy to (2000), and Koretz and Barron (1998) define; we have good reason to believe provide examples where scores on high- that ‘height’ can be measured on a single stakes State tests rise dramatically over a dimension (contrast this with ‘athletic four-year period, while national tests taken ability’, or ‘creativity’ where a number by the same students, which measure the of different components need to be same constructs, show little change. considered). However, the accurate Construct validity: refers to the extent to measurement of height needs care. which a test measures what it purports to

46 Height is affected by the circumstances of Usability: people using an assessment measurement – students should take off system – notably students and teachers – their shoes and hats, and should not need to understand and be sympathetic to slump when they are measured. The its purposes. measuring instrument is important – a Practicality: few designers work in arenas yard stick will provide a crude estimate, where cost is irrelevant. In educational good for identifying students who are settings, a major restriction on design is exceptionally short or exceptionally tall, the total cost of the assessment system. but not capable of fine discriminations The key principle here is that test between students; using a tape measure is administration and scoring must be likely to lead to more measurement error manageable within existing financial than using a fixed vertical ruler with a bar resources, and should be cost-effective in which rests on each student’s head. Time the context of the education of students. of day should be considered (people are taller in the morning); so should the time Equity: equity issues must be addressed - between measurements. If we assess the inequitable tests are (by definition) unfair, reliability of measurement by comparing illegal, and can have negative social measurements on successive occasions, consequences. we will under-estimate reliability if the measures are taken too far apart, and students grow different amounts in the intervening period. Exploration of reliability raises a set of finer-grained questions. Here are some examples: • is the phenomenon of being measured relatively stable? What inherent variation do we expect? (mood is likely to be less stable than vocabulary size) • to what extent do different markers assign the same marks as each other to a set of student responses? • do students of equal ability get the same marks no matter which version of the test they take? Fitness for purpose: the quality of any design can be judged in terms of its ‘fitness for purpose’. Tests are designed for a variety of purposes, and so the criteria for judging a particular test will shift as a function of its intended purpose; the same test may be well suited to one purpose and ill suited to another.

47 About Futurelab

Futurelab is passionate about transforming the way people learn. Tapping into the huge potential offered by digital and other technologies, we are developing innovative learning resources and practices that support new approaches to education for the 21st century.

Working in partnership with industry, policy and practice, Futurelab:

• incubates new ideas, taking them from the lab to the classroom • offers hard evidence and practical advice to support the design and use of innovative learning tools • communicates the latest thinking and practice in educational ICT • provides the space for experimentation and the exchange of ideas between the creative, technology and education sectors.

A not-for-profit organisation, Futurelab is committed to sharing the lessons learnt from our research and development in order to inform positive change to educational policy and practice.

Futurelab 1 Canons Road Harbourside Bristol BS1 5UH United Kingdom tel +44 (0)117 915 8200 fax +44 (0)117 915 8201 [email protected] www.futurelab.org.uk

Registered charity 1113051 This publication is available to download from the Futurelab website – www.futurelab.org.uk/research/lit_reviews.htm

Also from Futurelab:

Literature Reviews and Research Reports Written by leading academics, these publications provide comprehensive surveys of research and practice in a range of different fields.

Handbooks Drawing on Futurelab's in-house R&D programme as well as projects from around the world, these handbooks offer practical advice and guidance to support the design and development of new approaches to education.

Opening Education Series Focusing on emergent ideas in education and technology, this series of publications opens up new areas for debate and discussion.

We encourage the use and circulation of the text content of these publications, which are available to download from the Futurelab website – www.futurelab.org.uk/research. For full details of our open access policy, go to www.futurelab.org.uk/open_access.htm.

Creative Commons

© Futurelab 2006. All rights reserved; Futurelab has an open access policy which encourages circulation of our work, including this report, under certain copyright conditions - however, please ensure that Futurelab is acknowledged. For full details of our Creative Commons licence, go to www.futurelab.org.uk/open_access.htm

Disclaimer

These reviews have been published to present useful and timely information and to stimulate thinking and debate. It should be recognised that the opinions expressed in this document are personal to the author and should not be taken to reflect the views of Futurelab. Futurelab does not guarantee the accuracy of the information or opinion contained within the review. FUTURELAB SERIES

REPORT 10

ISBN: 0-9544695-8-5 Futurelab © 2004