Statistical Analyses for Language Testers

Statistical Analyses for Language Testers

Rita Green © Rita Green 2013 Foreword © J. Charles Alderson 2013 Softcover reprint of the hardcover 1st edition 2013 978-1-137-01827-4 All rights reserved. No reproduction, copy or transmission of this publication may be made without written permission. No portion of this publication may be reproduced, copied or transmitted save with written permission or in accordance with the provisions of the Copyright, Designs and Patents Act 1988, or under the terms of any licence permitting limited copying issued by the Copyright Licensing Agency, Saffron House, 6–10 Kirby Street, EC1N 8TS. Any person who does any unauthorized act in relation to this publication may be liable to criminal prosecution and civil claims for damages. The author has asserted her right to be identified as the author of this work in accordance with the Copyright, Designs and Patents Act 1988. First published 2013 by PALGRAVE MACMILLAN Palgrave Macmillan in the UK is an imprint of Macmillan Publishers Limited, registered in England, company number 785998, of Houndmills, Basingstoke, Hampshire RG21 6XS. Palgrave Macmillan in the US is a division of St Martin’s Press LLC, 175 Fifth Avenue, New York, NY 10010. Palgrave Macmillan is the global academic imprint of the above companies and has companies and representatives throughout the world. Palgrave® and Macmillan® are registered trademarks in the United States, the , Europe and other countries. ISBN 978-1-137-01828-1 ISBN 978-1-137-01829-8 (eBook) DOI 10.1057/9781137018298 This book is printed on paper suitable for recycling and made from fully managed and sustained forest sources. Logging, pulping and manufacturing processes are expected to conform to the environmental regulations of the country of origin. A catalogue record for this book is available from the British Library. A catalog record for this book is available from the Library of Congress. 10 9 8 7 6 5 4 3 2 1 22 21 20 19 18 17 16 15 14 13 Contents

Foreword by J. Charles Alderson vii

Introduction and Overview ix

Classical Test Theory versus Modern Test Theory xii

Acknowledgements xiv

Symbols and Acronyms xv

1 Data Entry 1 2 Checking and Correcting Data Files 12 3 Item Analysis 25 4 Descriptive Statistics 41 5 Analysing Test Taker Feedback 56 6 Comparing Performance I: Means, Scatterplots and Correlations 69 7 Comparing Performance II: Parametric and Non-Parametric Analyses 88 8 Comparing Performance III: ANOVA 107 9 Factor Analysis 122 10 Creating a Control File and Convergence Table 140 11 Analysing the Convergence Table and Creating a Variable Map 150 12 Item and Person Statistics 167 13 Distracter Analysis 185 14 Creating and Running a Specifications File 194 15 Analysing the Iteration Report and Vertical Ruler 206 16 Rater and Item Measurement Reports 215

Appendix 1 Data Files 229 Appendix 2 Data Spreadsheet 231 Appendix 3 Item Analysis 232 Appendix 4 Descriptive Statistics 235 Appendix 5 Comparing Performance I: Means, Scatterplots and Correlations 239

v vi Contents

Appendix 6 Comparing Performance II: Parametric and Non-Parametric Analyses 248 Appendix 7 Comparing Performance III: ANOVA 255 Appendix 8 Factor Analysis 260

Appendix 9 Creating a Control File, Convergence Table and Variable Map 271

Appendix 10 Item and Person Statistics 276

Appendix 11 Distracter Analysis 289

Appendix 12 Creating a Specifications File, Iteration Report and Vertical Ruler 294

Appendix 13 Rater and Assessment Criteria Measurement Reports 299

References and Further Reading 305

Index 307 Foreword

Language testing is both an art and a science, and language test developers need a range of skills and interests. Obviously an interest in and knowledge of language and languages, how languages ‘work’, how they are learnt and taught and how the various aspects of language use – the four skills of reading, writ- ing, listening and speaking – can be described and developed, all these are essential for anybody with a professional involvement in language. Language learning and teaching usually appeal to those who have studied the humani- ties or the social sciences, rather than the ‘hard’ sciences, and therefore many language teachers approach the necessary task of assessing their learners either with trepidation or reluctance. However, even such teachers will benefit from reading this book. It will pro- vide them with an eye-opening experience which those who have to develop tests for their learners, for their institution or for their careers will surely also benefit from. The title of this volume, Statistical Analyses for Language Testers, may appear as dry as dust and unattractive to those who hated maths at school. However, those who open the pages of this book out of curiosity or from a sense of professional responsibility will certainly benefit. They will rapidly be drawn into the fascinating world of using figures to understand much better how good their tests are. They will learn how to improve their tests on the basis of statistical analyses, and they will explore how ‘numbers’ can throw light on the art of test design, development and administration. I have often thought that good test developers and insightful, creative item writers are probably born, rather than trained. However, this book shows very clearly how one can become a better test developer, with a more professional attitude to and understanding of what contributes to the quality, validity and reliability of a language test. With the help of earlier drafts of this book, par- ticipants on the summer course at Lancaster have experi- enced how useful it can be to examine the results of a test. They have explored what makes a test item difficult or easy, why a learner might unexpectedly get an item wrong, and how to improve the reliability and meaningfulness of test results – through simple statistical analyses. I used to tell my incoming Masters students at Lancaster that if they can do simple arithmetical operations like add, subtract, multiply and divide then they can easily use, understand and explore statistical procedures that can reveal even the deepest secrets of the tests they have constructed or used. When the author of the textbook they are using or the teacher of the course they are taking is as experienced a teacher and as clear an explainer of even

vii viii Foreword the most complex concepts as Dr Rita Green is, then they are certainly in for a treat. People of all ages, both students and teachers, from the UK, Europe, Asia and elsewhere – indeed from the four corners of the globe – have had the good luck to have attended Rita’s courses in language testing and especially in the use of statistics for language test development. Such learners have discovered how statistics can reveal all sorts of interesting things about test items, test tasks and test scores. And those who have not had or will not have the oppor- tunity of being taught by Rita in person can now experience her clear exposi- tions, her amusing exemplifications and her sheer good pedagogic sensitivity to the needs of her audience by working through this wonderfully clear and readable, practical, sensible and thoroughly enjoyable approach to statistical analyses for the language test developer. Enjoy!

J. Charles Alderson Professor of Linguistics and English Introduction and Overview

Who is this book for?

I have dabbled with data analysis since the early 1980s when I first became interested in the field of language testing and since that time have used the Statistical Package for Social Sciences (SPSS) to help me investigate how items and tasks are performing. Since the 1990s I have taught SPSS to numerous stu- dents from countries around the world, many of them working on national and international projects including a number of high-stakes tests; others engaged in MA or PhD studies. Over the last ten years, I have added Winsteps and Facets to the programmes I teach my students. It is these language testers for whom I have written this book and I have done this for a number of reasons. First, because I want to encourage the ‘everyday’ language test developer or item writer to embrace the insights that data analysis can offer them in their work. Second, although there are many books available which deal with both classical test theory (CTT) and modern test theory (MTT), few of them focus on readers from the field of language testing. Third, many of the current books are, I suspect, somewhat intimidating to the type of reader I have in mind. I feel there is a stepping stone which is missing as far as test developers are concerned; I see this book as filling that gap, providing a ‘taster’ of what is out there – something to work through and then decide whether you want to delve further into the mysteries of statistical analyses. And, of course, I hope you do. You may already be thinking ... hmm, statistics ... not for me; or perhaps you have visions of school maths already swimming in front of your eyes. Before you take these thoughts further, read on. This book is not about mathemat- ics; it is not about theoretical statistics per se – although of course I do discuss some of the concepts which are the foundation of the applied statistics on which this book is based. The word ‘applied’ is crucial; it is the application of these analyses to the field of language testing which makes this book hopefully more accessible than others which have been written with sociologists, psy- chologists, economists and other types of scientists in mind. To this end the analyses carried out in this book are based on data which come from real tests, developed for real purposes, and the data are real data. The tests and question- naires, from which the data come, are not perfect – this is not their purpose; they were chosen as vehicles to show you the reader how to apply and interpret the relevant statistical methods which will in turn provide insights into your own test development work.

ix x Introduction and Overview

The statistical packages

In the first part of the book, I use SPSS, and in the second part I use Winsteps and Facets, both created by Dr John Michael Linacre. Why these particular choices? First, a practical reason – as mentioned above, I have been working with SPSS for over 25 years from the days when it was referred to as SPSSx – a few of you may ‘fondly’ remember typing in commands at the DOS prompt; others will definitely not. Second, I believe SPSS is not only user-friendly but also well respected in the field of language testing and is one of the most com- monly used programmes by non-psychometricians. This means it is likely to be a programme which is better known than others, more accessible than others and by extension more acceptable when making presentations on your find- ings. Third, the programme Winsteps is not only more accessible than many item response theory (IRT) based packages it also works well with SPSS files – a crucial factor for me when I started using the programme in the late nineties. Fourth, choosing to work with Facets, which comes from the same source as Winsteps, made absolute sense when it came to analysing data sets with more than two facets such as those which occur in writing and speaking tests where you might want to take the rater or examiner into account as well as the per- formance and the test taker.

The organisation of the book

The first part of the book focuses on CTT; the second on MTT. The constraints of exploring what both types of test theory have to offer the language test developer within one book inevitably means that only a limited range of the more important statistical analyses can be investigated; SPSS, Winsteps and Facets offer much, much more. Chapters 1 and 2 provide a brief introduction to SPSS data entry and data correction. Chapters 3 to 9 then take the reader through a range of analyses providing useful insights for the language test developer. Chapter 10 shows the reader how to set up a control file in Winsteps while Chapters 11 to 13 concentrate on a few of the analyses that programme has to offer. Chapter 14 repeats this process by explaining how to prepare the data set for Facets and how to create a specifications file while Chapters 15 and 16 allow the reader to explore a few of the analyses this programme provides to those working with multi-faceted data. Appendix 1 presents a full list of the data files used in the book; Appendix 2 contains a practice data set while Appendices 3 to 13 provide further opportu- nities to practise many of the analyses introduced in the book using additional data sets and including questions for you to answer as well as keys. Introduction and Overview xi

Each chapter focuses on one particular type of analysis, for example analys- ing how items are performing or investigating the relationship between two variables. Each chapter begins with an introduction as to why this particular analysis is important for the test developer and then provides some explana- tions about the terms and concepts which the reader will meet in the chapter. The method for carrying out each analysis is then described in a step-by-step manner; the main aspects of the output files are investigated and explained. I should add here that I do not see it as the purpose of the book to explain every single aspect of the SPSS, Winsteps or Facets output tables; I have deliberately chosen to discuss and explain what I feel is important, taking into account the targeted readership of this book. I hope that this book will encourage you as language test developers to use statistics as a tool to help you understand your tests, to enable you to produce better tests and to empower you to explain to others the importance of carry- ing out such procedures as field trials and statistical analyses, given the high- stakes nature of many of the tests language testers are involved with. Above all, don’t leave the statistical analyses to others who have not been involved in the test development cycle; you will lose immeasurably by doing so in terms of test development and subsequent decision-making. Conversely, you will gain so much more by taking on the challenge statistical analyses offer you. Good luck with the book and I do hope that you come to enjoy this aspect of language test development as much as I have. Classical Test Theory versus Modern Test Theory

There are two broad ways of analysing test data: one uses what is referred to as the classical test theory (CTT) approach, and the other the modern test theory (MTT, also referred to as IRT) approach. Both have their advantages and disad- vantages; my own preference is to use both wherever possible. However, sample size, available time, understanding of and access to the programmes also have to be factored into this decision. In the field of language testing, CTT involves analysing test data in order to investigate such aspects as item difficulty, levels of discrimination, the con- tribution each item or part of a test makes to the test’s internal reliability, the relationship between various parts of a test or tests, the relationship between test taker characteristics and their performance on a test, to name but a few. Many of these analyses depend strongly on the correlation coefficient. The data used to explore these analyses often come from a particular test population based on a particular set of test items administered at a particu- lar point in time. The results of the statistical analyses can only be reliably interpreted in light of these factors, although generalisations are often made towards a larger test population if the population and circumstances are felt to be sufficiently representative. However, it is clear that there is a degree of dependency between item difficulty and test taker ability, particularly when small sample sizes are used. The item statistics may vary if given to another test population, and the scores achieved by the test takers may well be differ- ent if they were given another set of test items. With larger populations (n = 200+) such generalisations are easier to uphold and this is one reason why the sample size and representativeness of the test population are crucial factors in field trials. As Hambleton et al. (1991: 2) succinctly put it: ‘examinee charac- teristics and test characteristics cannot be separated: each can be interpreted only in the context of the other’. This leaves the test developer in rather a difficult position; as Wright and Stone (1979: xi) note ‘... how do you interpret this measure beyond the confines of that set of items and those groups of children?’ IRT is based on probability theory: the chances of a person answering an item correctly is a function of his / her ability and the item’s difficulty (Henning 1987). In other words, a test taker with more ability has a better chance of answering an item correctly; similarly, an easy item is likely to be answered correctly by more people than a difficult one.

xii Classical Test Theory versus Modern Test Theory xiii

IRT provides a range of insights into the performance of an item or person. For example, it provides estimates of item difficulty and person ability, person and item reliabilities, and information concerning the amount of associ- ated with each item and person. By taking into account the ability estimate for each person, the difficulty measure for each item and the associated standard error, it is possible for the test developer to be 95 per cent confident of a per- son’s true ability or an item’s true difficulty. In other words, IRT makes it pos- sible to estimate sample-free item difficulty and item-free person ability. Obviously, where the degree of error associated with an item or person is high, for example when the items are too easy for a test taker and an accurate picture of their true ability cannot be obtained, the level of confidence the test developer can have in that particular person’s measure will not be high. Similarly if items are trialled on an inappropriate test population, the infor- mation we obtain will not be accurate either. The appropriateness of item to person, just as in CTT, must be observed. Having said this, it is clear that IRT has a great deal to offer and this being the case you might wonder why test developers continue to use CTT. The answer is largely a practical one: the constraints. First of all, IRT programmes need relatively large numbers; a minimum of 200 cases and preferably 300 or more is needed and this is often simply not possible for the ‘everyday’ test devel- oper. Second, until relatively recently the software and output were rather user- unfriendly and even now still pale in comparison for some users to the familiar environment in which packages like SPSS operate. Third, IRT programmes are based on the log-odd unit scale which is not very familiar to most language test developers or stakeholders. Despite this, IRT is still very much worth considering as an alternative or additional option to analysing test data and is particularly useful in dealing with multi-faceted data sets such as those resulting from writing and speak- ing tests, where such facets as the relationships between test taker, task, rater / interlocutor / examiner and rating scale can all be analysed on the same equal interval scale. Which approach you choose to use will very much depend on your own needs, the data you have and access to the necessary software, but the purpose of this book is to introduce you to both so that you can then make an informed decision. Acknowledgements

I would like to start by thanking my colleagues and students from around the world for their feedback on previous versions of these materials. The list is too long to include you all but you know who you are and my heartfelt thanks go to you. A particular mention goes to Charles Alderson, , Eszter Benke, Miguel Fernandez, Irene Thelen-Schaefer, Caroline Shackleton and Nathan Turner. A very special thanks also goes to Charles Alderson who not only introduced me to the world of data analysis in the mid-1980s when he asked me to analyse 1000 data cases as part of my MA thesis, but who has also been a driving force behind turning these teaching materials into a book. My grateful thanks also go to Mike Linacre for his unstinting support and for always being there to answer questions at whatever time of the day they arrived. The book is data driven and without the support of the teams I have worked with the book could not be what it is today. I would particularly like to express my gratitude to the following: Carol Spoettl of the Projekt Standardisierte Reifeprüfung, University of Innsbruck, Austria; the Bundes Institut, Zentrum für Innovation und Quälitatsentwicklung (Bifie), Austria; Graham Hyatt of the Länderverbundprojekt VerA 6, Germany and Yelena Yerznkyan and Karine Poghosyan of the Testing Unit, Department of Assessment at the National Institute of Education in Armenia. Reprint of SPSS screen images courtesy of International Business Machines Corporation, © SPSS, Inc., an IBM Company. Reproduction of screen shots and output tables from Winsteps and Facets, including material from the Winsteps and Facets Manuals and Help files, granted by Dr John Michael Linacre.

xiv Symbols and Acronyms

Symbols

Indicates the level of difficulty: the more boots, the more difficult!

Helps to show the way by explaining any new terminology coming up in the next chapter

Provides an explanation about a table or a figure

Questions

Key to the exercises

Acronyms

CAID Cronbach’s Alpha if item deleted CEFR Common European Framework of Reference CITC corrected item-total correlation CTT classical test theory DVW Data View Window IRT item response theory

xv xvi Symbols and Acronyms

MNSQ mean square MTT modern test theory SPSS Statistical Package for Social Sciences VVW Variable View Window ZSTD standardised fit statistic