<<

A COMPILATION MATERIAL OF ENGLISH TESTING

Compiled By: BERTARIA SOHNATA HUTAURUK

Prodi PendidikanBahasaInggris FAKULTAS KEGURUAN DAN ILMU PENDIDIKAN UNIVERSITAS HKBP NOMMENSEN PEMATANGSIANTAR 2015

English Language Testing 1 INTRODUCTION

This book is a compilation material for English Language Testing. General outlines of material as an introduction to English Language Testing that has been compiled for the students in S1 degree. This collection of material consists of Testing, Assessment, Meassurement, Evaluation, Kinds of Testing, Validity and Reliability of the tests, and Interpreting the test Score. Hopefully, this compilation will be useful for the students and yet not perfect, so any critism is welcomed.

Compiled by: Bertaria Sohnata Hutauruk

English Language Testing 2 CONTENS

1. What is the difference between assessment and evaluation?…………. 1

2. Testing, Assessment, Measurement and Evaluation…………………... 4

3. Informal vs. Formal Assessments: Tests are not the only end-all-be-all

of how we assess.………………………………………………………….. 6

4. Norm-referenced test and Criterion-referenced test…………………... 11

5. Discrete Point Testing and Integrative Testing………………………… 19

6. Communicative Language Testing……………………………………… 22

7. Testing Communicative Competence…………………………………… 24

8. Testing ……………………………………………. 30

9. Performance-Based Assessment……………………………………….... 32

10. Validity and Reliability…………………………………………………... 42

11. Constructing Tests……………………………………………………….. 61

12. Types of Listening Testing………………………………………………. 75

13. Testing Grammar………………………………………………………… 97

14. Interpreting Test Score…………………………………………………... 103

English Language Testing 3 1

What is the difference between assessment and evaluation?

There is a lot of confusion over these two terms as well as other terms associated with assessment, testing, and evaluation. The big difference can be summarized as this: assessment is information gathered by the teacher and student to drive instruction, while evaluation is when a teacher uses some instrument (such as the CMT or an end-of-unit test) to rate a student so that this information can be used to compare or sort students. Assessment is for the student and the teacher in the act of learning while evaluation is usually for others.

“If mathematics teachers were to focus their efforts on classroom assessment that is primarily formative in nature, students’ learning gains would be impressive. These efforts would include gathering data through classroom questioning and discourse, using a variety of assessment tasks, and attending primarily to what students know and understand” (Wilson & Kenney, page 55).

Assessment is a lot more important because it is integral to instruction. Unfortunately, it is being hampered by the demands of evaluation. The biggest demand for evaluation is grading or report cards. There shouldn’t be a problem with that, except historically evaluation (grades) were determined exclusively by computing a student’s numeric average on paper and pencil assessments called quizzes or tests.

“Most experienced teachers will say that they know a great deal about their students in terms of what the students know, how they perform in different situations, their attitudes and beliefs, and their various levels of skill attainment. Unfortunately, when it comes to grades, they often ignore this rich storehouse of information and rely on test scores and rigid averages that tell only a small fraction of the story.

English Language Testing 4 The myth of grading by statistical number crunching is so firmly ingrained in schooling at all levels that you may find it hard to abandon. But it is unfair to students, to parents, and to you as the teacher to ignore all of the information you get almost daily from a problem-based approach in favor of a handful of numbers based on tests that usually focus on low-level skills” (Van de Walle and Lovin, page 35).

The reason this is a problem is that students learn what is valued and they strive to do well on those things. If the end-of-unit tests are what are used to determine your grade, guess what kids want to do well on, the end-of-unit test! You can do all the great activities you want, but if the bottom line is the test, then that is what is going to be valued most by everyone: teachers, students, and parents, alike.

What we need to get better at is valuing the day-to-day activities we do and learn how to use them for both assessment and evaluation.

This will not be an easy task.

It is very different from what we are used to doing. We are used to teaching and then assessing. In reality, the line between teaching and assessment should be blurred (NCTM, 2000). “Interestingly, in some languages, learning and teaching are the same word”(Fosnot and Dolk, page 1). We need to assess on a daily basis to give us the information to make choices about what to teach the next day. If we just teach the whole unit and wait until the end-of-unit test to find out what the kids know, we may be very unhappily surprised. On the other hand, if we are assessing on a daily basis throughout the unit, we do not need to average all those assessments to come up with a final evaluation. Instead, we could just use the most recent assessments to make that evaluation. In this way, we do not penalize the student that did not know much at the beginning of the unit and worked really hard to learn what you felt were the big ideas. Instead we rate them on where they are when you finished the unit. This gives a more accurate report or evaluation of where they are performing when the evaluation is made.

English Language Testing 5 English Language Testing 6 2

Testing, Assessment, Measurement and Evaluation

The definition for each are:

Test: a method to determine a student’s ability to complete certain tasks or demonstrate mastery of a skill or knowledge of content. Some types would be multiple choice tests, or a weekly spelling test. While it is commonly used interchangeably with assessment or even evaluation, it cab be distinguished by the fact that a test is one form of an assessment.

Assessment: The process of gathering information to monitor progress and make educational decisions if necessary. As noted an assessment may include a test, but also includes methods such as obeservations, interviews, behavior monitoring etc.

Evaluation: Procedured used to determine whether the subject (i.e. student) meets a preset criteria such as qualifying for special education services. This uses assessment (remember that an assessement may be a test) to make a determination of qualification in accordance with a predetrmined criteria.

Meassurement, beyond its general definition, refers to the set of procedures and the principles for how to use the procedures in educational evaluations would be raw scores, percentiles ranks, derrived scores, standard scores etc.

English Language Testing 7 English Language Testing 8 3

Informal vs. Formal Assessments: Tests are not the only end-all-be-all of how we assess.

Formal assessment

Formal assessment uses formal tests or structured continuous assessment to evaluate a learner's level of language. It can be compared to informal assessment, which involves observing the learners' performance as they learn and evaluating them from the data gathered.

Example At the end of the course, the learners have a final exam to see if they pass to the next course or not. Alternatively, the results of a structured continuous assessment process are used to make the same decision.

In the classroom Informal and formal assessments are both useful for making valid and useful assessments of learners' knowledge and performance. Many teachers combine the two, for example by evaluating one skill using informal assessment such as observing group work, and another using formal tools, for example a discrete item grammar test.

Formative assessment

Formative assessment is the use of assessment to give the learner and the teacher information about how well something has been learnt so that they can decide what to do next. It normally occurs during a course. Formative assessment can be compared

English Language Testing 9 with summative assessment, which evaluates how well something has been learnt in order to give a learner a grade.

Example The learners have just finished a project on animals, which had as a language aim better understanding of the use of the present simple to describe habits. The learners now prepare gap-fill exercises for each other based on some of their texts. They analyse the results and give each other feedback.

In the classroom ,One of the advantages of formative feedback is that peers can do it. Learners can test each other on language they have been learning, with the additional aim of revising the language themselves. It has been once said that ““Everybody is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.” Our students must be assessed relative to what their skills are. It could be done by doing formal assessments or informal assessments or combination of both.

I realized that beyond giving formal assessments (i.e. Summative assessments: Quizzes, long tests, periodical exams, etc.), our main role as teachers is determined by

English Language Testing 10 how we recognize our students’ progress/stagnation through informal assessments (i.e. formative assessments: port folios, role play, record tracking, etc.) These methods allow the teacher to easily maneuver where and how his/her instuction is going.

The result of a formal test (e.g. long test) alone would not necessarily dictate the entire academic ability of our students. It does not mean that when a student fails a formal test (e.g. periodical test), we could already conclude that he’s entire learning capabilities for that subject failed as well.

Assessing students is not monopolized by just doing it formally (e.g. giving out tests, quizzes, summative exams, etc.), but rather depends on the other informal assessments (e.g. coaching sessions, reflective logs, fly-by-question and answers, etc.) that reinforce formal ones.

There are many factors why a student could fail from a test (e.g. lack of sleep, emotional and family distress, etc.), but there would only be few factors why he/she would not be able to provide a reflective insight on the lesson. But how do we separate formal assessments from informal ones?

The table and concept map I incorporated below could give some help (you could click the picture or open it in a new tab to see it clearer =).

English Language Testing 11 When are informal assessments useful (versus formal assessments)?

The most applicable time to use informal assessments is when:

1. We want to gauge the students cognitive, affective and manipulative skills in the simplest way possible. We ask students to recite or write down essays to easily determine if they understood a specific lesson well or poorly, if they are enthusiastic or bored with the lesson, if they are already familiar or completely unfamiliar with the topic, etc.

2. We deem that the results of the formal examinations are not enough to give a concluding mark for the students’ performance. If a specific student performs excellently in class activities but suddenly failed a summative test, it could tell us that there could be a deviation between our formal against our informal assessments, or other factors might have been involved with such event (e.g. student factor: did not review, physically/emotionally troubled, etc.)

How valuable are informal assessments? Can informal assessments be good replacements for formal assessments?

Although informal assessments provide teachers with solid bases of how the students are performing, it would not imply that it could already replace formal assessments. They should work hand-in-hand and interdependently. One should complement the other.For instance, if we opt to use role plays and recitals in assessing students’ communications skills informally, we should also align our formal exams with the activities our students previously engaged on. In this way, we could ensure validity and fairness of our assessments. Moreover, we could find that these methods relieve our burdens with analyzing, comparing, and understanding our students “true” abilities.

We cannot just give (formal) tests or quizzes in the same manner as we cannot just consume course-time with just giving out (informal) class activities. Arriving at a valid and reliable grades for our students is a combination of maximing both formal and informal assessments.

English Language Testing 12 Tosummarize:Informal assessment being a systematic observation = knowing what, when and where we are going to assess + (How) Establishing criteria for assessing students

4

English Language Testing 13 Norm-referenced test and Criterion- referenced test

A norm-referenced test (NRT) is a type of test, assessment, or evaluation which yields an estimate of the position of the tested individual in a predefined population, with respect to the trait being measured. The estimate is derived from the analysis of test scores and possibly other relevant data from a sample drawn from the population.That is, this type of test identifies whether the test taker performed better or worse than other test takers, not whether the test taker knows either more or less material than is necessary for a given purpose.The term normative assessment refers to the process of comparing one test-taker to his or her peers.Norm-referenced assessment can be contrasted with criterion-referenced assessment and ipsative assessment. In a criterion- referenced assessment, the score shows whether or not test takers performed well or poorly on a given task, not how that compares to other test takers; in an ipsative , test takers are compared to previous performance.Alternative to normative testing, tests can be ipsative, in which individuals' assessment is compared to themselves through time.

By contrast, a test is criterion-referenced when provision is made for translating the test score into a statement about the behavior to be expected of a person with that score. The same test can be used in both ways. Robert Glaser originally coined the terms norm- referenced test and criterion-referenced test.

Standards-based education reform is based on the belief that public education should establish what every student should know and be able to do.Students should be tested against a fixed yardstick, rather than against each other or sorted into a mathematical bell curve.

By assessing that every student must pass these new, higher standards, education officials believe that all students will achieve a diploma that prepares them for success in the 21st century.Most state achievement tests are criterion-referenced. In other words,

English Language Testing 14 a predetermined level of acceptable performance is developed and students pass or fail in achieving or not achieving this level. Tests that set goals for students based on the average student's performance are norm-referenced tests. Tests that set goals for students based on a set standard (e.g., 80 words spelled correctly) are criterion- referenced tests.

Many college entrance exams and nationally used school tests use norm-referenced tests. The SAT, Graduate Record Examination (GRE), and Wechsler Intelligence Scale for Children (WISC) compare individual student performance to the performance of a normative sample. Test takers cannot "fail" a norm-referenced test, as each testtaker receives a score that compares the individual to others that have taken the test, usually given by a percentile. This is useful when there is a wide range of acceptable scores that is different for each college.

By contrast, nearly two-thirds of US high school students will be required to pass a criterion-referenced high school graduation examination. One high fixed score is set at a level adequate for university admission whether the high school graduate is college bound or not. Each state gives its own test and sets its own passing level, with states like Massachusetts showing very high pass rates, while in Washington State, even average students are failing, as well as 80 percent of some minority groups. This practice is opposed by many in the education community such as Alfie Kohn as unfair to groups and individuals who score lower than others.

Advantages and limitations

An obvious disadvantage of norm-referenced tests is that it cannot measure progress of the population as a whole, only where individuals fall within the whole. Thus, measuring against only a fixed goal can be used to measure the success of an educational reform program that seeks to raise the achievement of all students against new standards that seek to assess skills beyond choosing among multiple choices. However, while this is attractive in theory, in practice, the bar has often been moved in the face of excessive failure rates, and improvement sometimes occurs simply because of familiarity with and teaching to the same test.

English Language Testing 15 With a norm-referenced test, grade level was traditionally set at the level set by the middle 50 percent of scores.By contrast, the National Children's Reading Foundation believes that it is essential to assure that virtually all of read at or above grade level by third grade, a goal which cannot be achieved with a norm referenced definition of grade level.

Advantages to this type of assessment include that students and teachers know what to expect from the test and just how the test will be conducted and graded. Likewise, all schools will conduct the exam in the same manner, reducing such inaccuracies as time differences or environmental differences that may cause distractions to the students. This also makes these assessments fairly accurate as far as results are concerned, a major advantage for a test.

Critics of criterion-referenced tests point out that judges set bookmarks around items of varying difficulty without considering whether the items actually are compliant with grade level content standards or are developmentally appropriate Thus, the original 1997 sample problems published for the WASL 4th grade mathematics contained items that were difficult for college educated adults, or easily solved with 10th grade level methods such as similar triangles.The difficulty level of items themselves and the cut- scores to determine passing levels are also changed from year to year. Pass rates also vary greatly from the 4th to the 7th and 10th grade graduation tests in some states.

One of the limitations of No Child Left Behind is that each state can choose or construct its own test, which cannot be compared to any other state. A Rand study of Kentucky results found indications of artificial inflation of pass rates which were not reflected in increasing scores in other tests such as the NAEP or SAT given to the same student populations over the same time.Graduation test standards are typically set at a level consistent for native born 4 year university applicants.Unusual side effect is that while colleges often admit immigrants with very strong math skills who may be deficient in English, there is no such leeway in high school graduation tests, which usually require passing all sections, including language. Thus, it is not unusual for institutions like the University of Washington to admit strong Asian American or Latino students who did

English Language Testing 16 not pass the writing portion of the state WASL test, but such students would not even receive a diploma once the testing requirement is in place.

Although the tests such as the WASL are intended as a minimal bar for high school, 27 percent of 10th graders applying for Running Start in Washington State failed the math portion of the WASL. These students applied to take college level courses in high school, and achieve at a much higher level than average students. The same study concluded the level of difficulty was comparable to, or greater than that of tests intended to place students already admitted to the college.

A norm-referenced test has none of these problems because it does not seek to enforce any expectation of what all students should know or be able to do other than what actual students demonstrate. Present levels of performance and inequity are taken as fact, not as defects to be removed by a redesigned system. Goals of student performance are not raised every year until all are proficient. Scores are not required to show continuous improvement through Total Quality Management systems. Disadvantages include standards based assessments measure the level that students are currently by measuring against where their peers are currently at instead of the level that both students should be at.

A rank-based system produces only data that tell which average students perform at an average level, which students do better, and which students do worse, contradicting fundamental beliefs, whether optimistic or simply unfounded, that all will perform at one uniformly high level in a standards based system if enough incentives and punishments are put into place. This difference in beliefs underlies the most significant differences between a traditional and a standards based education system.

Examples

1. IQ tests are norm-referenced tests, because their goal is to see which test taker is more intelligent than the other test takers.

English Language Testing 17 2. Theater auditions and job interviews are norm-referenced tests, because their goal is to identify the best candidate compared to the other candidates, not to determine how many of the candidates meet a fixed list of standards.

A criterion-referenced test is one that provides for translating test scores into a statement about the behavior to be expected of a person with that score or their relationship to a specified subject matter. Most tests and quizzes that are written by school teachers can be considered criterion-referenced tests. The objective is simply to see whether the student has learned the material. Criterion-referenced assessment can be contrasted with norm-referenced assessment and ipsative assessment.

A common misunderstanding regarding the term is the meaning of criterion. Many, if not most, criterion-referenced tests involve a cutscore, where the examinee passes if their score exceeds the cutscore and fails if it does not (often called a mastery test). The criterion is not the cutscore; the criterion is the domain of subject matter that the test is designed to assess. For example, the criterion may be "Students should be able to correctly add two single-digit numbers," and the cutscore may be that students should correctly answer a minimum of 80% of the questions to pass.

The criterion-referenced interpretation of a test score identifies the relationship to the subject matter. In the case of a mastery test, this does mean identifying whether the examinee has "mastered" a specified level of the subject matter by comparing their score to the cutscore. However, not all criterion-referenced tests have a cutscore, and the score can simply refer to a person's standing on the subject domain.The ACT is an example of this; there is no cutscore, it simply is an assessment of the student's knowledge of high-school level subject matter.Because of this common misunderstanding, criterion-referenced tests have also been called standards-based assessments by some education agencies,as students are assessed with regards to standards that define what they "should" know, as defined by the state.

English Language Testing 18 Comparison of criterion-referenced and norm-referenced tests

Both terms criterion-referenced and norm-referenced were originally coined by Robert Glaser.Unlike a criterion-reference test, a norm-referenced test indicates whether the test-taker did better or worse than other people who took the test. For example, if the criterion is "Students should be able to correctly add two single-digit numbers," then reasonable test questions might look like " " or " " A criterion- referenced test would report the student's performance strictly according to whether the individual student correctly answered these questions. A norm-referenced test would report primarily whether this student correctly answered more questions compared to other students in the group. Even when testing similar topics, a test which is designed to accurately assess mastery may use different questions than one which is intended to show relative ranking. This is because some questions are better at reflecting actual achievement of students, and some test questions are better at differentiating between the best students and the worst students. (Many questions will do both.) A criterion- referenced test will use questions which were correctly answered by students who know the specific material. A norm-referenced test will use questions which were correctly answered by the "best" students and not correctly answered by the "worst" students (e.g. Cambridge University's pre-entry 'S' paper). Some tests can provide useful information about both actual achievement and relative ranking. The ACT provides both a ranking, and indication of what level is considered necessary to likely success in college. Some argue that the term "criterion-referenced test" is a misnomer, since it can refer to the interpretation of the score as well as the test itself.In the previous example, the same score on the ACT can be interpreted in a norm-referenced or criterion-referenced manner.

Sample scoring for the history question: What caused World War II?

Criterion- Student answers referenced Norm-referenced assessment assessment

Student #1: This answer is This answer is worse than Student #2's

English Language Testing 19 WWII was caused by Hitler correct. answer, but better than Student #3's and Germany invading answer. Poland.

Student #2: WWII was caused by multiple factors, including the Great Depression and the general economic situation, the rise of This answer is This answer is better than Student #1's nationalism, fascism, and correct. and Student #3's answers. imperialist expansionism, and unresolved resentments related to WWI. The war in Europe began with the German invasion of Poland.

Student #3: WWII was caused by the This answer is This answer is worse than Student #1's assassination of Archduke wrong. and Student #2's answers. Ferdinand.

Relationship to high-stakes testing

Many high-profile criterion-referenced tests are also high-stakes tests, where the results of the test have important implications for the individual examinee. Examples of this include high school graduation examinations and licensure testing where the test must be passed to work in a profession, such as to become a physician or attorney. However, being a high-stakes test is not specifically a feature of a criterion-referenced test. It is instead a feature of how an educational or government agency chooses to use the results of the test.

English Language Testing 20 Examples

1. Driving tests are criterion-referenced tests, because their goal is to see whether the test taker is skilled enough to be granted a driver's license, not to see whether one test taker is more skilled than another test taker.

2. Citizenship tests are usually criterion-referenced tests, because their goal is to see whether the test taker is sufficiently familiar with the new country's history and government, not to see whether one test taker is more knowledgeable than another test taker.

English Language Testing 21 5

Discrete Point Testing and Integrative Testing

Electronic quiz tools usually involve a discrete point approach to testing as opposed to an integrated or authentic approach, such as papers and projects. Discrete point tests are made up of test questions each of which is meant to measure one content point. Discrete point testing is associated with multiple choice and true/false formats, which have been criticized for testing only recognition knowledge and facilitating guessing and cheating. However, if they are used for an appropriate PURPOSE and if the test questions are well constructed, discrete point tests can be used for effective teaching and learning.

Should language be tested by discrete points or by integrative testing?Traditionally, language test have been constructed on the assumption that: language can be broken down intoits component and those component parts are duly tested.What is discrete point? Language is segmented into many small linguistic points and the four language skills of listening, speaking,reading and writing. Test questions are designed to test these skills and linguistic points. A discrete point testconsists of many questions on a large number of linguistic points, but each question tests only one linguisticpoint.Examples of Discrete point test are:1. Phoneme recognition.2. Yes/No, True/ False answers.3. Spelling.4. Word completion.5. Grammar items.6. Multiple choice tests.Such tests have a down side in that they take language out of context and usually bear no relationship to theconcept or use of whole language.Discrete point test met with some criticism, particularly in the view of more recent trends toward viewing theunits of language and its communicative nature and purpose, and viewing language as the arithmetic sum ofall its parts.That is why John Oller (1976) introduced“INTEGRATIVE TESTING”.

English Language Testing 22 According to him“language competence is a unified set of interacting abilities which cannot be separated apart and testedadequately.“ Oller (1979:37) “Whereasdiscrete items attempt to test knowledge of language one bit at a time, integrativetests attempt to assess a learner's capacity to use many bits all at the same time, and possibly while exercisingseveral presumed components of a grammatical system, and perhaps more than one of the traditional skills oraspects of skills. Therefore, communicative competence is so global and requires such“integration”for its“pragmatic”use in thereal world that it cannot be captured in additive tests of grammar or reading or vocabulary and other discretepoints of language.This emphasizes the simultaneous testing of the testee's multiple linguistic competence from variousperspectives.Examples of integrative test are: 1.Cloze tests 2. Dictation 3.Translation 4.Essays and other coherent writing tasks 5.Oral interviews and conversation 6.Reading, or other extended samples of real text

Oller (1979:38) has refined the integrative concept further by proposing what he calls pragmatic test.A pragmatic test is“...any procedure or task that causes the learner to process sequences of elements in alanguage that conform to the normal contextual constraints of that language and which requires learner torelate sequences of elements via pragmatic mappings to extra linguistic contexts. ” A step in a positive direction would be to concentrate on tests of communicative competence.The recent direction of linguistic study has been toward viewing language as an integrated and pragmatic skill,we cannot be certain that a test like a cloze test meets the criterion of predicting or assessing a unified andintegrated underlying linguistic competence we must be cautious in selecting and constructing test oflanguage.There is nothing wrong to use the traditional tests of discrete points of language especially in achievement andother classroom-oriented testing in which certain discrete points are very important.

English Language Testing 23 English Language Testing 24 6

Communicative Language Testing

The notion of communicative competence is broad and needs to be fully understood before being considered as a basis for a research testing regime. As previously indicated assessment can be viewed in terms of two distinct paradigms as follows: 1) The Psychometric-Structuralist era: Testing is based on discrete linguistic points related to four language skill areas, reading, writing, speaking and listening. Additionally there is the Psycholinguistic-Sociolinguistic era: Integrative tests were conceived in response to the language proficiency limitations associated with discrete point testing. According to Oller (in Weir, 1988), Integrative testing could measure the ability to integrate disparate language skills in ways that more closely resembled the actual process of language use. The communicative paradigm is founded on the notion of competence. According to Morrow (in Weir, 1988; pp8) communicative language testing should be concerned with :1) what the learner knows about the form of the language and how to use it appropriately in context (Competence). 2) the extent to which the learner is able to demonstrate this knowledge in a meaningful situation (Performance) i.e what can he do with the language. Performance testing should therefore be representative of a real-life situation where an integration of communicative skills is required. The performance test criteria should relate closely to the effective communication of ideas in that context. Weir emphasises the importance of context and related tasks as an important dimension in communicative (performance) (ibid, pp11). In conclusion a variety of tests different tests are required for a range of different purposeds and the associated instruments are no longer uniform in content or method.

In recognising the broad definitions of communication, Carroll (Testing Communicative Performance, 1980) adopts a rationalist approach to test requirement definition. The

English Language Testing 25 basis of the methodology therefore is a detailed analysis including the identification of events and activities (communication functions) that drive the communicative need. Having identified the test requirements, they are divided between the principle communicative domains of speaking, listening, writing and reading.

This approach is no doubt reminiscent of the requirements definition related to English for Specific Purposes (ESP) i.e functional language appropriate for Tourist, Students, Lawyers etc. However, this strategy (and associated methodology) would seem inappropriate in the given research context for the following salient reasons:

1. No practical to undertake a meaningful needs analysis for all participants 2. The entirely process is far too complex and labour intensive 3. ESP is not aimed at marginalised communities or children

Sabria and Samer (other students) have pointed me in the direction of Cambridge ToEFL exams (conformant with the Common European Framework of Reference for Languages) as a potential basis for communicative testing. The tests are divided into the 4 principal language dimensions (Speaking, Listening, Writing and Reading) and provide tests and marking criteria at all levels of competency including that for the research context (Young Learners English – YLE starters).

English Language Testing 26 7

Testing Communicative Competence

Testing language has traditionally taken the form of testing knowledge about language, usually the testing of knowledge of vocabulary and grammar. However, there is much more to being able to use language than knowledge about it. Dell Hymes proposed the concept of communicative competence. He argued that a speaker can be able to produce grammatical sentences that are completely inappropriate. In communicative competence, he included not only the ability to form correct sentences but to use them at appropriate times. Since Hymes proposed the idea in the early 1970s, it has been expanded considerably, and various types of competencies have been proposed. However, the basic idea of communicative competence remains the ability to use language appropriately, both receptively and productively, in real situations.

The Communicative Approach to Testing

What Communicative Language Tests Measure

Communicative language tests are intended to be a measure of how the testees are able to use language in real life situations. In testing productive skills, emphasis is placed on appropriateness rather than on ability to form grammatically correct sentences. In testing receptive skills, emphasis is placed on understanding the communicative intent of the speaker or writer rather than on picking out specific details. And, in fact, the two are often combined in communicative testing, so that the testee must both comprehend

English Language Testing 27 and respond in real time. In real life, the different skills are not often used entirely in isolation. Students in a class may listen to a lecture, but they later need to use information from the lecture in a paper. In taking part in a group discussion, they need to use both listening and speaking skills. Even reading a book for pleasure may be followed by recommending it to a friend and telling the friend why you liked it.

The "communicativeness" of a test might be seen as being on a continuum. Few tests are completely communicative; many tests have some element of communicativeness. For example, a test in which testees listen to an utterance on a tape and then choose from among three choices the most appropriate response is more communicative than one in which the testees answer a question about the meaning of the utterance. However, it is less communicative than one in which the testees are face- to-face with the interlocutor (rather than listening to a tape) and are required to produce an appropriate response.

Tasks

Communicative tests are often very context-specific. A test for testees who are going to British universities as students would be very different from one for testees who are going to their company's branch office in the United States. If at all possible, a communicative language test should be based on a description of the language that the testees need to use. Though communicative testing is not limited to English for Specific Purposes situations, the test should reflect the communicative situation in which the testees are likely to find themselves. In cases where the testees do not have a specific purpose, the language that they are tested on can be directed toward general social situations where they might be in a position to use English.

This basic assumption influences the tasks chosen to test language in communicative situations. A communicative test of listening, then, would test not whether the testee could understand what the utterance, "Would you mind putting the groceries away before you leave" means, but place it in a context and see if the testee can respond appropriately to it.

English Language Testing 28 If students are going to be tested over communicative tasks in an achievement test situation, it is necessary that they be prepared for that kind of test, that is, that the course material cover the sorts of tasks they are being asked to perform. For example, you cannot expect testees to correctly perform such functions as requests and apologies appropriately and evaluate them on it if they have been studying from a structural syllabus. Similarly, if they have not been studying writing business letters, you cannot expect them to write a business letter for a test.

Tests intended to test communicative language are judged, then, on the extent to which they simulate real life communicative situations rather than on how reliable the results are. In fact, there is an almost inevitable loss of reliability as a result of the loss of control in a communicative testing situation. If, for example, a test is intended to test the ability to participate in a group discussion for students who are going to a British university, it is impossible to control what the other participants in the discussion will say, so not every testee will be observed in the same situation, which would be ideal for test reliability. However, according to the basic assumptions of communicative language testing, this is compensated for by the realism of the situation.

Evaluation

There is necessarily a subjective element to the evaluation of communicative tests. Real life situations don't always have objectively right or wrong answers, and so band scales need to be developed to evaluate the results. Each band has a description of the quality (and sometimes quantity) of the receptive or productive performance of the testee.

Examples of Communicative Test Tasks

Speaking/Listening

Information gap. An information gap activity is one in which two or more testees work together, though it is possible for a confederate of the examiner rather than a testee to

English Language Testing 29 take one of the parts. Each testee is given certain information but also lacks some necessary information. The task requires the testees to ask for and give information. The task should provide a context in which it is logical for the testees to be sharing information.

The following is an example of an information gap activity.

Student A

You are planning to buy a tape recorder. You don't want to spend more than about 80 pounds, but you think that a tape recorder that costs less than 50 pounds is probably not of good quality. You definitely want a tape recorder with auto reverse, and one with a radio built in would be nice. You have investigated three models of tape recorder and your friend has investigated three models. Get the information from him/her and share your information. You should start the conversation and make the final decision, but you must get his/her opinion, too.

(information about three kinds of tape recorders)

Student B

Your friend is planning to buy a tape recorder, and each of you investigated three types of tape recorder. You think it is best to get a small, light tape recorder. Share your information with your friend, and find out about the three tape recorders that your friend investigated. Let him/her begin the conversation and make the final decision, but don't hesitate to express your opinion.

(information about three kinds of tape recorders)

This kind of task would be evaluated using a system of band scales. The band scales would emphasize the testee's ability to give and receive information, express and elicit opinions, etc. If its intention were communicative, it would probably not emphasize pronunciation, grammatical correctness, etc., except to the extent that these might interfere with communication. The examiner should be an observer and not take part in

English Language Testing 30 the activity, since it is difficult to both take part in the activity and evaluate it. Also, the activity should be tape recorded, if possible, so that it could be evaluated later and it does not have to be evaluated in real time.

Role Play. In a role play, the testee is given a situation to play out with another person. The testee is given in advance information about what his/her role is, what specific functions he/she needs to carry out, etc. A role play task would be similar to the above information gap activity, except that it would not involve an information gap. Usually the examiner or a confederate takes one part of the role play.

The following is an example of a role play activity.

Student

You missed class yesterday. Go to the teacher's office and apologize for having missed the class. Ask for the handout from the class. Find out what the homework was.

Examiner

You are a teacher. A student who missed your class yesterday comes to your office. Accept her/his apology, but emphasize the importance of attending classes. You do not have any extra handouts from the class, so suggest that she/he copy one from a friend. Tell her/him what the homework was.

Again, if the intention of this test were to test communicative language, the testee would be assessed on his/her ability to carry out the functions (apologizing, requesting, asking for information, responding to a suggestion, etc.) required by the role.

English Language Testing 31 English Language Testing 32 8

Testing Reading and Writing

Some tests combine reading and writing in communicative situations. Testees can be given a task in which they are presented with instructions to write a letter, memo, summary, etc., answering certain questions, based on information that they are given.Letter writing. In many situations, testees might have to write business letters, letters asking for information, etc.The following is an example of such a task.

Your boss has received a letter from a customer complaining about problems with a coffee maker that he bought six months ago. Your boss has instructed you to check the company policy on returns and repairs and reply to the letter. Read the letter from the customer and the statement of the company policy about returns and repairs below and write a formal business letter to the customer.

(the customer's complaint letter; the company policy)

The letter would be evaluated using a band scale, based on compliance with formal letter writing layout, the content of the letter, inclusion of correct and relevant information, etc.

Summarizing. Testees might be given a long passage--for example, 400 words--and be asked to summarize the main points in less than 100 words. To make this task communicative, the testees should be given realistic reasons for doing such a task. For example, the longer text might be an article that their boss would like to have summarized so that he/she can incorporate the main points into a talk.The summary would be evaluated, based on the inclusion of the main points of the longer text.

Testing Listening and Writing/Note Taking

English Language Testing 33 Listening and writing may also be tested in combination. In this case, testees are given a listening text and they are instructed to write down certain information from the text. Again, although this is not interactive, it should somehow simulate a situation where information would be written down from a spoken text.

English Language Testing 34 9

Performance-Based Assessment

Performance-based assessment is an alternative form of assessment that moves away from traditional paper and pencil tests. Performance-based assessment involves having the students produce a project, whether it is oral, written or a group performance. The students are engaged in creating a final project that exhibits their understanding of a concept they have learned.

A unique quality of performance-based assessment is that is allows the students to be assessed based on a process. The teacher is able to see first hand how the students produce language in real-world situations. In addition, performance-based assessments tend to have a higher content validity because a process is being measured. The focus remains on the process, rather than the product in performance-based assessment.

There are two parts to performance-based assessments. The first part is a clearly defined task for the students to complete. This is called the product descriptor. The assessments are either product related, specific to certain content or specific to a given task. The second part is a list of explicit criteria that are used to assess the students. Generally this comes in the form of a rubric. The rubrics can either be analytical, meaning it assesses the final product in parts, or holistic, meaning that is assesses the final product as a whole.

Performance-based assessment tasks are generally not as formally structured. There is room for creativity and student design in performance-based tasks. Generally, these tasks measure the students when they are actually performing the given task. Due to the nature of these tasks, performance-based assessment is highly interactive. Students are interacting with each other in order to complete real-world examples of language tasks.

English Language Testing 35 Also, performance-based assessment tends to integrate many different skills. For example, reading and writing can be involved in one task or speaking and listening can be involved in the same task.

As previously mentioned, there are many types of performance-based assessments. Each type of assessment brings with it different strengths and deficiencies relative to credible and dependable information. Because it is virtually impossible for a single assessment tool to adequately assess all aspects of student performance, the real challenge comes in selecting or developing performance-based assessments that complement both each other and more traditional assessments to equitably assess students in physical education and human performance.

The goal for assessment is to accurately determine whether students have learned the materials or information taught and reveal whether they have complete mastery of the content with no misunderstandings. Just as researchers use multiple data sources to determine the truthfulness of the results, teachers can use multiple types of assessment to evaluate the level of student learning. Because assessments involve the gathering of data or information, some type of product, performance, or recording sheet must be generated. The following are some examples of various types of performance-based assessments used in physical education.

Performance-based assessment is an opportunity to allow students to produce language in real-world contexts while being assessed. This type of assessment is unique because it is not a traditional test format. Some examples of performance-based assessment tasks are as follows:

Types of Performance-Based Assessment:

1. Journals

English Language Testing 36 Students will write regularly in a journal about anything relevant to their life, school or thoughts. Their writing will be in the target language. The teacher will collect the journals periodically and provide feedback to the students. This can serve as a communication log between the teacher and students.Journals can be used to record student feelings, thoughts, perceptions, or reflections about actual events or results. The entries in journals often report social or psychological perspectives, both positive and negative, and may be used to document the personal meaning associated with one’s participation (NASPE Standard 6). Journal entries would not be an appropriate summative assessment by themselves, but might be included as an artifact in a portfolio. Journal entries are excellent ways for teachers to “take the pulse” of a class and determine whether students are valuing the content of the class. Teachers must be careful not to assess affective domain journal entries for the actual content, because doing so may cause students to write what teachers want to hear (or give credit for) instead of true and genuine feelings. Teachers could hold students accountable for completing journal entries. Some teachers use journals as a way to log participation over time.

2.Letters The students will create original language compositions through producing a letter. They will be asked to write about something relevant to their own life using the target language. The letter assignment will be accompanied by a rubric for assessment purposes.

3. Oral Reports The students will need to do research in groups about a given topic. After they have completed their research, the students will prepare an oral presentation to present to the class explaining their research. The main component of this project will be the oral production of the target language.

4. Original Stories The students will write an original fictional story. The students will be asked to include several specified grammatical structures and vocabulary words. This assignment will be assessed analytically, each component will have a point value.

English Language Testing 37 5. Oral Interview An oral interview will take place between two students. One student will ask the questions and listen to the responses of the other student. From the given responses, more questions can be asked. Each student will be responsible for listening and speaking.

6. skit The students will work in groups in order to create a skit about a real-world situation. They will use the target language. The vocabulary used should be specific to the situation. The students will be assess holistically, based on the overall presentation of the skit.

7.Poetry Recitations After studying poetry, the students will select a poem in the target langugage of their choice to recite to the class. The students will be assessed based on their pronunciation, rhythm and speed. The students will also have an opportunity to share with the class what they think the poem means. 8.Portfolios Portfolios allow students to compile their work over a period of time. The students will have a checklist and rubric along with the assignment description. The students will assemble their best work, including their drafts so that the teacher can assess the process.

9.PuppetShow The students can work in groups or individually to create a short puppet show. The puppet show can have several characters that are involved in a conversation of real- world context. These would most likely be assessed holistically.

10. Art Work/ Designs/Drawings This is a creative way to assess students. They can choose a short story or piece or writing, read it and interpret it. Their interpretation can be represented through artistic

English Language Testing 38 expression. The students will present their art work to the class, explaining what they did and why.

Using Observation in the Assessment Process

Human performance provides many opportunities for students to exhibit behaviors that may be directly observed by others, a unique advantage of working in the psychomotor domain. Wiggins (1998) uses physical activity when providing examples to illustrate complex assessment concepts, as they are easier to visualize than would be the case with a cognitive example. The nature of performing a motor skill makes assessment through observational analysis a logical choice for many physical education teachers. In fact, investigations of measurement practices of physical educators have consistently shown a reliance on observation and related assessment methods (Hensley and East 1989; Matanin and Tannehill 1994; Mintah 2003).

Observation is a skill used with several performance-based assessments. It is often used to provide students with feedback to improve performance. However, without some way to record results, observation alone is not an assessment. Going back to the definition of assessment provided earlier in the chapter, assessment is the gathering of information, analyzing the data, and then using the information to make an evaluation. Therefore, some type of written product must be produced if the task is considered an assessment.

Teachers and peers can assess others using observation. They might use a checklist or some type of event recording scheme to tally the number of times a behavior occurred. Keeping game play statistics is an example of recording data using event recording techniques. Students can self-analyze their own performance and record their performances using criteria provided on a checklist or a game play rubric. Table 14.1 is an example of a recording form that could be used for peer assessment. When using peer assessment, it is best to have the assessor do only the assessment. When the person recording assessment results is also expected to take part in the assessment (e.g., tossing the ball to the person being assessed), he or she cannot both toss and do an accurate observation. In the case of large classes, teachers might even use groups of four, in

English Language Testing 39 which one person is being evaluated, a second person is feeding the ball, the third person is doing the observation, and a fourth person is recording the results.

Individual or Group Projects

Projects have long been used in education to assess a student’s understanding of a subject or a particular topic. Projects typically require students to apply their knowledge and skills while completing the prescribed task, which often calls for creativity, critical thinking, analysis, and synthesis. Examples of student projects used in physical education and human performance include the following: demonstrating knowledge of invasion game strategies by designing a new game; demonstrating knowledge of how to become an active participant in the community by doing research on obesity and then developing a brochure for people in the community that presents ideas for developing a physically active lifestyle; demonstrating knowledge of fitness components and how to stay fit by designing one’s own fitness program using personal fitness test results; demonstrating knowledge of how to create a dance by video recording a dance that members of the group choreographed; and doing research on childhood games and teaching children from a local elementary school how to play them. Criteria for evaluating the projects are developed and the results of the project are recorded.

Group projects involve a number of students working together on a complex problem that requires planning, research, internal discussion, and presentation. Group projects should include a component that each student completes individually to avoid having a student receive credit for work that he or she did not do. Another way to avoid this issue is to have members of the group award paychecks to the various members of the group (e.g., split a $10,000 check) and provide justifications about the amount given to each person. To encourage reflections on the contributions of others, students are not allowed to give an equal amount to everyone. These “checks” are confidential and submitted directly to the teacher in an envelope that others in the group are not allowed to see.

The following example of a project designed for middle school or high school students involves a research component, analysis and synthesis of information, problem solving, and effective communication.

English Language Testing 40 Portfolios

Portfolios are systematic, purposeful, and meaningful collections of an individual’s work designed to document learning over time. Since a portfolio provides documentation of student learning, the knowledge and skills that the teacher desires to have students document guides the structure of the portfolio. The type of portfolio, its format, and the general contents are usually prescribed by the teacher. Portfolio collections may also include input provided by teachers, parents, peers, administrators, or others.The guidelines used to format a portfolio will be based on the type of learning that the portfolio is used to document. The following are two basic types of portfolios:

Working portfolio—A repository of portfolio documents that the student accumulates over a certain period of time. Other types of process information may also be included, such as drafts of student work or records of student achievement or progress over time.

Showcase or model portfolio—A portfolio consisting of work samples selected by the student that document the student’s best work. The student has consciously evaluated his or her work and selected only those products that best represent the type of learning identified for this assessment. Each artifact selected is accompanied by a reflection, in which the student explains the significance of the item and the type of learning it represents.

It’s a good idea to limit the portfolio to a certain number of pieces of work to prevent the portfolio from becoming a scrapbook that has little meaning to the student and to avoid giving teachers a monumental evaluation task. This also requires students to exercise some judgment about which artifacts best fulfill the requirements of the portfolio task and document their level of achievement. The portfolio itself is usually a file or folder that contains the student’s collected work. The contents could include items such as a training log, student journal or diary, written reports, photographs or sketches, letters, charts or graphs, maps, copies of certificates, computer disks or computer-generated products, completed rating scales, fitness test results, game

English Language Testing 41 statistics, training plans, report of dietary analyses, and even video- or audio recordings. Collectively, the artifacts selected will document student growth and learning over time as well as current levels of achievement. The potential items that could become portfolio artifacts are almost limitless. Kirk (1997) suggests the following list of possible portfolio artifacts that may be useful for physical activity settings. A teacher would never require that a portfolio contain all of these items. The list is offered as a way to generate ideas for possible artifacts.

A rubric (scoring tool) should be used to evaluate portfolios in much the same manner as any other product or performance. Providing a rubric to students in advance allows them to self-assess their work and thus be more likely to produce a portfolio of high quality. Portfolios, since they are designed to show growth and improvement in student learning, are evaluated holistically. The reflections that describe the artifact and why the artifact was selected for inclusion in the portfolio provide insights into levels of student learning and achievement. Teachers should remember that format is less important than content and that the rubric should be weighted to reflect this. Table 14.2 illustrates a qualitative analytic rubric for judging a portfolio along three dimensions.

For additional information about portfolio assessments, Lund and Kirk (2010) have a chapter on developing portfolio assessments. An article published as part of a JOPERD feature presents a suggested scoring scale for a portfolio (Kirk 1997). Melograno’s Assessment Series publication (2000) on portfolios also contains helpful information.

Performances

Student performances can be used as culminating assessments at the completion of an instructional unit. Teachers might organize a gymnastics or track and field meet at the conclusion of one of those units to allow students to demonstrate the skills and knowledge that they gained during instruction. Game play during a tournament is also considered a student performance. Rubrics for game play can be written so that students are evaluated on all three learning domains (psychomotor, cognitive, and affective).

English Language Testing 42 Students might demonstrate their skills and learning in one of the following ways:

Performing an aerobics routine for a school assembly

Organizing and performing a jump rope show at the half-time of a basketball game

Performing in a folk dance festival at the county fair

Demonstrating wushu (a Chinese martial art) at the local shopping mall

Training for and participating in a local road race or cycling competition

Although performances do not produce a written product, there are several ways to gather data to use for assessment purposes. A score sheet can be used to record student performance using the criteria from a game play rubric. Game play statistics are another example of a way to document performance. Performances can also be video recorded to provide evidence of learning. In some cases teachers might want to shorten the time used to gather evidence of learning from a performance. Event tasks are performances that are completed in a single class period. Students might demonstrate their knowledge of net or wall game strategies by playing a scripted game that is video recorded during a single class. The ability to create movement sequences or a dance that uses different levels, effort, or relationships could be demonstrated during a single class period with an event task. Many adventure education activities that demonstrate affective domain attributes can be assessed using event tasks.

Student Logs

Documenting student participation in physical activity (NASPE Standard 3) is often difficult. Teachers can assess participation in an activity or skill practice trials completed outside of class using logs. Practice trials during class that demonstrate student effort can also be documented with logs. A log records behaviors over a period of time (see figure 14.1). Often the information recorded shows changes in behavior, trends in performance, results of participation, progress, or the regularity of physical

English Language Testing 43 activity. A student log is an excellent artifact for use in a portfolio. Because logs are usually a self-recorded document, they are not used for summative assessments unless as an artifact in a portfolio or for a project. If teachers wanted to increase the importance placed on a log, a method of verification by an adult or someone in authority should be added.

English Language Testing 44 10 VALIDITY AND RELIABILITY

For the statistical consultant working with social science researchers the estimation of reliability and validity is a task frequently encountered. Measurement issues differ in the social sciences in that they are related to the quantification of abstract, intangible and unobservable constructs. In many instances, then, the meaning of quantities is only inferred.

Let us begin by a general description of the paradigm that we are dealing with. Most concepts in the behavioral sciences have meaning within the context of the theory that they are a part of. Each concept, thus, has an operational definition which is governed by the overarching theory. If a concept is involved in the testing of hypothesis to support the theory it has to be measured. So the first decision that the research is faced with is “how shall the concept be measured?” That is the type of measure. At a very broad level the type of measure can be observational, self-report, interview, etc. These types ultimately take shape of a more specific form like observation of ongoing activity, observing video-taped events, self-report measures like questionnaires that can be open- ended or close-ended, Likert-type scales, interviews that are structured, semi-structured or unstructured and open-ended or close-ended. Needless to say, each type of measure has specific types of issues that need to be addressed to make the measurement meaningful, accurate, and efficient. Another important feature is the population for which the measure is intended. This decision is not entirely dependent on the theoretical paradigm but more to the immediate research question at hand.

English Language Testing 45 A third point that needs mentioning is the purpose of the scale or measure. What is it that the researcher wants to do with the measure? Is it developed for a specific study or is it developed with the anticipation of extensive use with similar populations? Once some of these decisions are made and a measure is developed, which is a careful and tedious process, the relevant questions to raise are “how do we know that we are indeed measuring what we want to measure?” since the construct that we are measuring is abstract, and “can we be sure that if we repeated the measurement we will get the same result?”. The first question is related to validity and second to reliability. Validity and reliability are two important characteristics of behavioral measure and are referred to as psychometric properties.

It is important to bear in mind that validity and reliability are not an all or none issue but a matter of degree. Measurement All measurements may contain some element of error; validity and reliability concern the amount and type of error that typically occurs, and they also show how we can estimate the amount of error in a measurement. There are three chief sources of error: 1. in the thing being measured (my weight may fluctuate so it's difficult to get an accurate picture of it); 2. the observer (on Mondays I may knock a pound off my weight if I binged on my mother's cooking at the week-end. Obviously the binging doesn't reflect my true weight!); 3. or in the recording device (our clinic weigh scale has been acting up; we really should get it recalibrated).And there are two types of error: Random errors are not attributable to a specific cause. If sufficiently large numbers of observations are made, random errors average to zero, because some readings over- estimate and some under-estimate.Systematic errors tend to fall in a particular direction and are likely due to a specific cause. Because systematic errors fall in one direction (e.g., I always exaggerate my athletic abilities) they bias a measurement.Random errors

English Language Testing 46 are considered part of the reliability of a measurement.Systematic errors are considered part of the validity of a measurement.

Reliability and validity The reliability of an assessment tool is the extent to which it measures learning consistently. The validity of an assessment tool is the extent by which it measures what it was designed to measure.

Reliability The reliability of an assessment tool is the extent to which it consistently and accurately measures learning. When the results of an assessment are reliable, we can be confident that repeated or equivalent assessments will provide consistent results. This puts us in a better position to make generalised statements about a student’s level of achievement, which is especially important when we are using the results of an assessment to make decisions about teaching and learning, or when we are reporting back to students and their parents or caregivers. No results, however, can be completely reliable. There is always some random variation that may affect the assessment, so educators should always be prepared to question results.

Factors which can affect reliability:

The length of the assessment – a longer assessment generally produces more reliable results. The suitability of the questions or tasks for the students being assessed. The phrasing and terminology of the questions. The consistency in test administration – for example, the length of time given for the assessment, instructions given to students before the test. The design of the marking schedule and moderation of marking procedures. The readiness of students for the assessment – for example, a hot afternoon or straight after physical activity might not be the best time for students to be assessed.

English Language Testing 47 How to be sure that a formal assessment tool is reliable

Check in the user manual for evidence of the reliability coefficient. These are measured between zero and 1. A coefficient of 0.9 or more indicates a high degree of reliability.

Assessment tool manuals contain comprehensive administration guidelines. It is essential to read the manual thoroughly before conducting the assessment. Validity

Educational assessment should always have a clear purpose. Nothing will be gained from assessment unless the assessment has some validity for the purpose. For that reason, validity is the most important single attribute of a good test. The validity of an assessment tool is the extent to which it measures what it was designed to measure, without contamination from other characteristics. For example, a test of reading comprehension should not require mathematical ability.

There are several different types of validity:

Face validity: do the assessment items appear to be appropriate? Content validity: does the assessment content cover what you want to assess? Criterion-related validity: how well does the test measure what you want it to? Construct validity: are you measuring what you think you're measuring?

It is fairly obvious that a valid assessment should have a good coverage of the criteria (concepts, skills and knowledge) relevant to the purpose of the examination. The important notion here is the purpose. For example:

The PROBE test is a form of reading running record which measures reading behaviours and includes some comprehension questions. It allows teachers to see the reading strategies that students are using, and potential problems with decoding. The test would not, however, provide in-depth information about a student’s comprehension strategies across a range of texts. English Language Testing 48 STAR (Supplementary Test of Achievement in Reading) is not designed as a comprehensive test of reading ability. It focuses on assessing students’ vocabulary understanding, basic sentence comprehension and paragraph comprehension. It is most appropriately used for students who don’t score well on more general testing (such as PAT or e-asTTle) as it provides a more fine grained analysis of basic comprehension strategies. There is an important relationship between reliability and validity. An assessment that has very low reliability will also have low validity; clearly a measurement with very poor accuracy or consistency is unlikely to be fit for its purpose. But, by the same token, the things required to achieve a very high degree of reliability can impact negatively on validity. For example, consistency in assessment conditions leads to greater reliability because it reduces 'noise' (variability) in the results. On the other hand, one of the things that can improve validity is flexibility in assessment tasks and conditions. Such flexibility allows assessment to be set appropriate to the learning context and to be made relevant to particular groups of students. Insisting on highly consistent assessment conditions to attain high reliability will result in little flexibility, and might therefore limit validity.

Validity: Very simply, validity is the extent to which a test measures what it is supposed to measure. The question of validity is raised in the context of the three points made above, the form of the test, the purpose of the test and the population for whom it is intended. Therefore, we cannot ask the general question “Is this a valid test?”. The question to ask is “how valid is this test for the decision that I need to make?” or “how valid is the interpretation I propose for the test?” We can divide the types of validity into logical and empirical. VALIDITY refers to what conclusions we can draw from the results of a measurement. Introductory-level definitions are "Does the test measure what we are intending to measure?", or "How closely do the results of a measurement correspond to the true state of the phenomenon being measured?"

English Language Testing 49 Nerd's Corner: These ideas of validity fit under a more general conception in terms of "How can we interpret the test results?" or "What does this measurement actually mean?" This approach is useful because sometimes information collected for one purpose can also tell us about something quite different. So, the World Bank records the gross national product of each country for economic monitoring, but this also gives us a pretty good idea of how countries will rank in terms of child health. Nerd's Corner: Putting these ideas together, we get a table showing how validity and reliability may be assessed:

Thing being measured

Observer Recording device (e.g., screening test) Random error test re-test reliability correlation between observers calibration trial (variation with standard object) Systematic record diurnal (etc) variation (e.g. BP higher on Mondays) agreement between observers (e.g. nurses or patients) construct& criterion validity; sensitivity& specificity Validity of a screening test. This can be used to illustrate the way validity is assessed. Here, it is commonly reported in terms of sensitivity and specificity. Sensitivity refers to what fraction of all the actual cases of disease a test detects. If the test is not very good, it may miss cases it should detect. Its sensitivity is low and it generates "false negatives" (i.e., people score negatively on the test when they should have scored positive). This can be extremely serious if early treatment would have saved the person's life. Mnemonics to help you: The word 'sensitivity' is intuitive: a sensitive test is one that can identify the disease. English Language Testing 50 SeNsitivity is inversely associated with the false Negative rate of a test (high sensitivity = few false negatives). Specificity refers to whether the test identifies only those with the disease, or does it mistakenly classify some healthy people as being sick? Errors of this type are called "false positives." This can lead to worry and expensive further investigations.

Types of Validity

1. Content Validity: When we want to find out if the entire content of the behavior/construct/area is represented in the test we compare the test task with the content of the behavior. This is a logical method, not an empirical one. Example, if we want to test knowledge on American Geography it is not fair to have most questions limited to the geography of New England.

2. Face Validity: Basically face validity refers to the degree to which a test appears to measure what it purports to measure. Face Validity ascertains that the measure appears to be assessing the intended construct under study. The stakeholders can easily assess face validity. Although this is not a very “scientific” type of validity, it may be an essential component in enlisting motivation of stakeholders. If the stakeholders do not believe the measure is an accurate assessment of the ability, they may become disengaged with the task. Example: If a measure of art appreciation is created all of the items should be related to the different components and types of art. If the questions are regarding historical time periods, with no reference to any artistic movement, stakeholders may not be motivated to give their best effort or invest in this measure because they do not believe it is a true assessment of art appreciation.

3. Criterion-Oriented or Predictive Validity:

English Language Testing 51 Criterion-Related Validity is used to predict future or current performance - it correlates test results with another criterion of interest. Example: If a physics program designed a measure to assess cumulative student learning throughout the major. The new measure could be correlated with a standardized measure of ability in this discipline, such as an ETS field test or the GRE subject test. The higher the correlation between the established measure and new measure, the more faith stakeholders can have in the new assessment tool. When you are expecting a future performance based on the scores obtained currently by the measure, correlate the scores obtained with the performance. The later performance is called the criterion and the current score is the prediction. This is an empirical check on the value of the test – a criterion-oriented or predictive validation.

4. Concurrent Validity: Concurrent validity is the degree to which the scores on a test are related to the scores on another, already established, test administered at the same time, or to some other valid criterion available at the same time. Example, a new simple test is to be used in place of an old cumbersome one, which is considered useful, measurements are obtained on both at the same time. Logically, predictive and concurrent validation are the same, the term concurrent validation is used to indicate that no time elapsed between measures.

5. Construct Validity: Construct Validity is used to ensure that the measure is actually measure what it is intended to measure (i.e. the construct), and not other variables. Using a panel of “experts” familiar with the construct is a way in which this type of validity can be assessed. The experts can examine the items and decide what that specific item is intended to measure. Students can be involved in this process to obtain their feedback. Example: A women’s studies program may design a cumulative assessment of learning throughout the major. The questions are written with complicated wording and phrasing. This can cause the test inadvertently becoming a test of reading comprehension, rather than a test of women’s studies. It is important that the measure is actually assessing the intended construct, rather than an extraneous factor. English Language Testing 52 Construct validity is the degree to which a test measures an intended hypothetical construct. Many times psychologists assess/measure abstract attributes or constructs. The process of validating the interpretations about that construct as indicated by the test score is construct validation. This can be done experimentally, e.g., if we want to validate a measure of anxiety. We have a hypothesis that anxiety increases when subjects are under the threat of an electric shock, then the threat of an electric shock should increase anxiety scores (note: not all construct validation is this dramatic!) A correlation coefficient is a statistical summary of the relation between two variables. It is the most common way of reporting the answer to such questions as the following: Does this test predict performance on the job? Do these two tests measure the same thing? Do the ranks of these people today agree with their ranks a year ago? (rank correlation and product-moment correlation) According to Cronbach, to the question “what is a good validity coefficient?” the only sensible answer is “the best you can get”, and it is unusual for a validity coefficient to rise above 0.60, though that is far from perfect prediction. All in all we need to always keep in mind the contextual questions: what is the test going to be used for? how expensive is it in terms of time, energy and money? what implications are we intending to draw from test scores? Formative Validity when applied to outcomes assessment it is used to assess how well a measure is able to provide information to help improve the program under study. Example: When designing a rubric for history one could assess student’s knowledge across the discipline. If the measure can provide information that students are lacking knowledge in a certain area, for instance the Civil Rights Movement, then that assessment tool is providing meaningful information that can be used to improve the course or program requirements. Sampling Validity (similar to content validity) ensures that the measure covers the broad range of areas within the concept under study. Not everything can be covered, so items need to be sampled from all of the domains. This may need to be completed using a panel of “experts” to ensure that the content area is adequately sampled. Additionally, a panel can help limit “expert” bias (i.e. a test reflecting what an individual personally feels are the most important or relevant areas). English Language Testing 53 Example: When designing an assessment of learning in the theatre department, it would not be sufficient to only cover issues related to acting. Other areas of theatre such as lighting, sound, functions of stage managers should all be included. The assessment should reflect the content area in its entirety.

What are some ways to improve validity? 1. Make sure your goals and objectives are clearly defined and operationalized. Expectations of students should be written down. 2. Match your assessment measure to your goals and objectives. Additionally, have the test reviewed by faculty at other schools to obtain feedback from an outside party who is less invested in the instrument. 3. Get students involved; have the students look over the assessment for troublesome wording, or other difficulties. 4. If possible, compare your measure with other measures, or data that may be available.

Reliability: Research requires dependable measurement. (Nunnally) Measurements are reliable to the extent that they are repeatable and that any random influence which tends to make measurements different from occasion to occasion or circumstance to circumstance is a source of measurement error. (Gay) Reliability is the degree to which a test consistently measures whatever it measures. Errors of measurement that affect reliability are random errors and errors of measurement that affect validity are systematic or constant errors. Test-retest, equivalent forms and split-half reliability are all determined through correlation. RELIABILITY refers to consistency or dependability. Your patient Jim is unpredictable; sometimes he comes to his appointment on time, sometimes he's late and once or twice he was early.

One way to estimate reliability of a measurement is to record its stability: do you get the same blood pressure reading if you repeat the measurement? This is sometimes English Language Testing 54 called "test-retest stability" or "intra-rater reliability" and focuses on the observer and the instrument as potential sources of error. (Note that we must assume that no actual change in BP occurred between the measurements: there is no error in the thing being measured). You can also estimate reliability by comparing the agreement between different people making a rating (e.g., if several nurses measure a patient's blood pressure, do they get the same reading?). This can be called "inter-rater reliability" or "inter-rater agreement."

Nerd's Corner: This is a simplification. Sometimes it's difficult to figure out if an error is random or systematic: the disagreement between the nurses could really be random, or it could arise because one of them tends to under-record the BP. Further testing would be needed to trace the origin of the inaccuracy.

Types of Reliability

1. Test-retest Reliability: Test-retest reliability is the degree to which scores are consistent over time. It indicates score variation that occurs from testing session to testing session as a result of errors of measurement. Problems: Memory, Maturation, Learning. Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time. Example: A test designed to assess student learning in psychology could be given to a group of students twice, with the second administration perhaps coming a week after the first. The obtained correlation coefficient would indicate the stability of the scores.

2. Equivalent-Forms or Alternate-Forms Reliability Parallel forms reliability is a measure of reliability obtained by administering different versions of an assessment tool (both versions must contain items that probe the same construct, skill, knowledge base, etc.) to the same group of individuals. The scores from the two versions can then be correlated in order to evaluate the consistency of results across alternate versions. English Language Testing 55 Example: If you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set of items that all pertain to critical thinking and then randomly split the questions up into two sets, which would represent the parallel forms. Equivalent-Forms or Alternate-Forms Reliability: Two tests that are identical in every way except for the actual items included. Used when it is likely that test takers will recall responses made during the first session and when alternate forms are available. Correlate the two scores. The obtained coefficient is called the coefficient of stability or coefficient of equivalence. Problem: Difficulty of constructing two forms that are essentially equivalent. Both of the above require two administrations.

3. Inter-rater reliability Inter-rater reliabilit is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed.

Example: Inter-rater reliability might be employed when different judges are evaluating the degree to which art portfolios meet certain standards. Inter-rater reliability is especially useful when judgments can be considered relatively subjective. Thus, the use of this type of reliability would probably be more likely when evaluating artwork as opposed to math problems.

4. Internal consistency reliability Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results. A. Average inter-item correlation is a subtype of internal consistency reliability. It is obtained by taking all of the items on a test that probe the same construct (e.g., reading comprehension), determining the correlation coefficient for each

English Language Testing 56 pair of items, and finally taking the average of all of these correlation coefficients. This final step yields the average inter-item correlation. B. Split-half reliability is another subtype of internal consistency reliability. The process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores.

Split-Half Reliability: Requires only one administration. Especially appropriate when the test is very long. The most commonly used method to split the test into two is using the odd-even strategy. Since longer tests tend to be more reliable, and since split-half reliability represents the reliability of a test only half as long as the actual test, a correction formula must be applied to the coefficient. Spearman-Brown prophecy formula. Split-half reliability is a form of internal consistency reliability. Internal Consistency Reliability: Determining how all items on the test relate to all other items. Kudser- Richardson-> is an estimate of reliability that is essentially equivalent to the average of the split-half reliabilities computed for all possible halves.

Rationale Equivalence Reliability: Rationale equivalence reliability is not established through correlation but rather estimates internal consistency by determining how all items on a test relate to all other items and to the total test.

Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed. English Language Testing 57 Example: Inter-rater reliability might be employed when different judges are evaluating the degree to which art portfolios meet certain standards. Inter-rater reliability is especially useful when judgments can be considered relatively subjective. Thus, the use of this type of reliability would probably be more likely when evaluating artwork as opposed to math problems.

Standard Error of Measurement: Reliability can also be expressed in terms of the standard error of measurement. It is an estimate of how often you can expect errors of a given size.

Principles of Language Testing

1. What are the principles of language testing? 2. How can we define them? 3. What factors can influence them? 4. How can we measure them? 5. How do they interrelate?

Three Important Characteristics of Tests: 1.Reliability: consistency and free from extraneous sources of error 2.Validity : how well a test measures what it is supposed to measure

Refers to measuring what we intend to measure. How well a test measures what it is supposed to measure. For eexampleIf math and vocabulary truly represent intelligence then a math and vocabulary test might be said to have high validity when used as a measure of intelligence. Estimating the Validity of a Measure:

1. A good measure must not only be reliable, but also valid 2. A valid measure measures what it is intended to measure 3. Validity is not a property of a measure, but an indication of the extent to which an assessment measures a particular construct in a particular context—thus a measure may be valid for one purpose but not another

English Language Testing 58 4. A measure cannot be valid unless it is reliable, but a reliable measure may not be valid

Content Validity:

1. Does the test contain items from the desired “content domain”? 2. Based on assessment by experts in that content domain. 3. Is especially important when a test is designed to have low face validity. 4. Is generally simpler for “other tests” than for “psychological constructs”

For Example - Easier for math experts to agree on an item for an algebra test than it is for psych experts to agree whether or not an item should be placed in a EI or a personality measure.

5. Content Validity is not “tested for”. Rather it is assured by experts in the domain.

Basic Procedure for Assessing Content Validity:

1. Describe the content domain 2. Determine the areas of the content domain that are measured by each test item 3. Compare the structure of the test with the structure of the content domain

For Example:

In developing a nursing licensure exam, experts on the field of nursing would identify the information and issues required to be an effective nurse and then choose (or rate) items that represent those areas of information and skills.

A test is to measure foreign students’mastery of English sentence structure, an analysis must first be made of the language itself and decisions made on which matters need to be tested and in what proportions.

Face Validity English Language Testing 59 1. Face validity refers to the extent to which a measure ‘appears’ to measure what it is supposed to measure 2. Not statistical—involves the judgment of the researcher (and the participants) 3. A measure has face validity—’if people think it does’ 4. Just because a measure has face validity does not ensure that it is a valid measure (and measures lacking face validity can be valid)

Relationship Between Reliability & Validity

• usefulness of a test. Though different, they work together. It would not be beneficial to design a test with good reliability that did not measure what it was intended to measure. The inverse, accurately measuring what you desire to measure with a test that is so flawed that results are not reproducible, is impossible. Reliability is a necessary requirement for validity. This means that you have to have good reliability in order to have validity. Reliability actually puts a cap or limit on validity, and if a test is not reliable, it can not be valid. Establishing good reliability is only the first part of establishing validity. Validity has to be established separately. Having good reliability does not mean you have good validity, it just means you are measuring something consistently. Now you must establish what it is that you are measuring consistently. The main point here is reliability is necessary but not sufficient for validity. Tests that are reliable are not necessarily valid or predictive. If the reliability of a psychological measure increases, the validity of the measure is also expected to increase.

FACTORS THAT INFLUENCE VALIDITY:

1. Inadequate sample 2. Items that do not function as intended 3. Improper arrangement/unclear directions 4. Too few items for interpretation 5. Improper test administration 6. Scoring that is subjective

English Language Testing 60 Reliability is influenced by:

1. the longer the test, the more reliable it is likely to be [though there is a point of no extra return] 2. items which discriminate will add to reliability, therefore, if the items are too easy / too difficult, reliability is likely to be lower 3. if there is a wide range of abilities amongst the test takers, test is likely to have higher reliability 4. the more homogeneous the items are, the higher the reliability is likely to be

Practicality:

The ease with which the test:

1. items can be replicated in terms of resources needed e.g. time, materials, people 2. can be administered 3. can be graded 4. results can be interpreted

Factors which can influence reliability, validity and practicality:

From the TEST:

1. quality of items 2. number of items 3. difficulty level of items 4. level of item discrimination 5. type of test methods 6. number of test methods 7. time allowed 8. clarity of instructions

English Language Testing 61 9. use of the test 10. selection of content 11. sampling of content 12. invalid constructs

From the TEST TAKERS:

1. familiarity with test method 2. attitude towards the test i.e. interest, motivation, emotional/mental state 3. degree of guessing employed 4. level of ability

From the Test Administration

1. consistency of administration procedure 2. degree of interaction between invigilators and test takers 3. time of day the test is administered 4. clarity of instructions 5. test environment – light / heat / noise / space / layout of room 6. quality of equipment used e.g. for listening tests

From the Scoring

1. accuracy of the key e.g. does it include all possible alternatives? 2. inter-rater reliability e.g. in writing, speaking 3. intra-rater reliability e.g. in writing, speaking 4. machine vs. human

How can we measure reliability?

Test-retest:same test administered to the same test takers following an interval of no more than 2 weeks

English Language Testing 62 Inter-rater reliability: two or more independent estimates on a test e.g. written scripts marked by two raters independently and results compared

3.Practicality

English Language Testing 63 11

CONSTRUCTING TESTS

Writing items requires a decision about the nature of the item or question to which we ask students to respond, that is, whether discreet or integrative, how we will score the item; for example, objectively or subjectively, the skill we purport to test, and so on. We also consider the characteristics of the test takers and the test taking strategies respondents will need to use. What follows is a short description of these considerations for constructing items.

Test Items

A test item is a specific task test takers are asked to perform.Test items can assess one or more points or objectives, and the actual item itself may take on a different constellation depending on the context. For example, an item may test one point (understaning of a given vocabulary word) or several points (the ability to obtain facts from a passage and then make inferences based on the facts). Likewise, a given objective may be tested by a series of items. For example, there could be five items all testing one grammatical point (e.g., tag questions). Items of a similar kind may also be grouped together to form subtests within a given test.

Classifying Items

Discrete – A completely discrete-point item would test simply one point or objective such as testing for the meaning of a word in isolation. For example: English Language Testing 64 Choose the correct meaning of the word paralysis.

(A) inability to move

(B) state of unconscious

(C) state of shock

(D) being in pain

Integrative – An integrative item would test more than one point or objective at a time. (e.g., comprehension of words, and ability to use them correctly in context). For example:

Demonstrate your comprehension of the following words by using them together in a written paragraph: “paralysis,” “accident,” and “skiing.”

Sometimes an integrative item is really more a procedure than an item, as in the case of a free composition, which could test a number of objectives; for example, use of appropriate vocabulary, use of sentence level discourse, organization, statement of thesis and supporting evidence. For example:

Write a one-page essay describing three sports and the relative likelihood of being injured while playing them competitively.

Objective – A multiple-choice item, for example, is objective in that there is only one right answer.

Subjective – A free composition may be more subjective in nature if the scorer is not looking for any one right answer, but rather for a series of factors (creativity, style, cohesion and coherence, grammar, and mechanics).

The Skill Tested

English Language Testing 65 The language skills that we test include the more receptive skills on a continuum – listening and reading, and the more productive skills – speaking and writing. There are, of course, other language skills that cross-cut these four skills, such as vocabulary. Assessing vocabulary will most likely vary to a certain extent across the four skills, with assessment of vocabulary in listening and reading – perhaps covering a broader range than assessment of vocabulary in speaking and writing. We can also assess nonverbal skills, such as gesturing, and this can be both receptive (interpreting someone else’s gestures) and productive (making one’s own gestures).

The Intellectual Operation Required

Items may require test takers to employ different levels of intellectual operation in order to produce a response (Valette, 1969, after Bloom et al., 1956). The following levels of intellectual operation have been identified: knowledge (bringing to mind the appropriate material); comprehension (understanding the basic meaning of the material); application (applying the knowledge of the elements of language and comprehension to how they interrelate in the production of a correct oral or written message);

analysis (breaking down a message into its constituent parts in order to make explicit the relationships between ideas, including tasks like recognizing the connotative meanings of words and correctly processing a dictation, and making inferences); synthesis (arranging parts so as to produce a pattern not clearly there before, such as in effectively organizing ideas in a written composition); and evaluation (making quantitative and qualitative judgments about material). it has been popularly held that these levels demand increasingly greater cognitive control as one moves from knowledge to evaluation – that, for example, effective operation at more advanced levels, such as synthesis and evaluation, would call for more advanced control of the second language. Yet this has not necessarily been borne

English Language Testing 66 out by research (see Alderson &Lukmani, 1989). The truth is that what makes items difficult, sometimes defies the intuitions of the test constructors.

The Tested Response Behavior

Items can also assess different types of response behavior. Respondents may be tested for accuracy in pronunciation or grammar. Likewise, they could be assessed for fluency, for example, without concern for grammatical correctness. Aside from accuracy and fluency, respondents could also be assessed for speed – namely, how quickly they can produce a response, to determine how effectively the respondent replies under time pressure.In recent years, there has also been an increased concern for developing measures of performance – that is, measures of the ability to perform real-world tasks, with criteria for successful performance based on a needs analysis for the given task (Brown, 1998; Norris, Brown, Hudson, & Yoshioka, 1998).

Performance tasks might include “comparing credit card offers and arguing for the best choice” or “maximizing the benefits from a given dating service.” At the same time that there is a call for tasks that are more reflective of the real world, there is a commensurate concern for more authentic language assessment. At least one study, however, notes that the differences between authentic and pedagogic written and spoken texts may not be readily apparent, even to an audience specifically listening for differences (Lewkowicz, 1997). In addition, test takers may not necessarily concern themselves with task authenticity in a test situation. Test familiarity may be the overriding factor affecting performance.

Characteristics of Respondents

Items can be designed to be appropriate for groups of test-takers with differing characteristics. Bachman and Palmer (1996: 64-78) classify these characteristics into four categories: the personal characteristics of the respondents – for example, their age, gender, and native language; the knowledge of the topic that they bring to the language testing situation; their affective schemata (that is, their prior likes and dislikes with regard to assessment); and their language ability.

English Language Testing 67 Research into the impact of these characteristics continues. For example, with regard to the age variable, researchers have suggested that educators revisit this issue and perhaps conceive of new ways to consider the impact of the age variable in assessing language ability (Marinova-Todd, Marshall, & Snow, 2000). With regard to performance on language measures, it would appear that age interacts with other variables such as attitudes, motivation, the length of exposure to the target language, as well as the nature and quality of language instruction (see García Mayo &GarcíaLecumberri, 2003).

With regard to language ability, both Bachman and Palmer (1996) and Alderson (2000) detail the many types of knowledge that respondents may need to draw on to perform well on a given item or task:world knowledge and culturally-specific knowledge, knowledge of how the specific grammar works, knowledge of different oral and written text types, knowledge of the subject matter or topic, and knowledge of how to perform well on the given task.

Item-Elicitation Format the format for item elicitation has to be determined for any given item. An item can have a spoken, written, or visual stimulus, as well as any combination of the three. Thus, while an item or task may ostensibly assess one modality, it may also be testing some other as well. So, for example, a subtest referred to as “listening” which has respondents answer oral questions by means of written multiple-choice responses is testing reading as well as listening. It would be possible to avoid introducing this reading element by having the multiple-choice alternatives presented orally as well. But then the tester would be introducing yet another factor, namely, short-term memory ability, since the respondents would have to remember all the alternatives long enough to make an informed choice.

Item-Response Format

The item-response format can be fixed, structured, or open-ended. Item responses with a fixed format include true/false, multiple-choice, and matching items.Item responses, which call for a structured format include ordering (where respondents are requested to

English Language Testing 68 arrange words to make a sentence, and several orders are possible), duplication – both written (such as., dictation) and oral (for example, recitation, repetition, mimicry), identification (explaining the part of speech of a form), and completion.Those item responses calling for an open-ended format include composition – both written (for example, creative fiction, expository essays) and oral (such as a speech) – as well as other activities, such as free oral response in role-playing situations.

Grammatical competence

According to Canale and Swain (1980, p. 29), grammatical competence includes , , syntax, knowledge of lexical items, and semantics, as well as matters of mechanics (spelling, punctuation, capitalization, and handwriting). It would seem that this definition is perhaps too broad for practical purposes. A truly perplexing issue is determining what constitutes a grammatical error, as well as determining the severity of this error. In other words, will the use of the error stigmatize the speaker? Let us say that we are using a grammatical scale which deals with how acceptably words, phrases, and sentences are formed and pronounced in the respondents' utterances. Let us assume that the focus is on both of the following: clear cases of errors in form, such as the use of the present perfect for an action completed in the past (e.g., ”We have had a great time at your house last night."), and matters of style, such as the use of a passive verb form in a context where a native would use the active form (e.g., Question - “What happened to the CD I lent you, Jorge?” Reply - "The CD was lost." vs. "I lost your CD.").

Major grammatical errors might be considered those that either interfere with intelligibility or stigmatize the speaker. Minor errors would be those that do not get in the way of the listener's comprehension nor would they annoy the listener to any extent.Thus, getting the tense wrong in the above example, "We have had a great time at your house last night" could be viewed as a minor error, whereas in another case, producing "I don't have what to say" ("I really have no excuse" by translating directly from the appropriate Hebrew language) could be considered a major error since it is not only ungrammatical but also could stigmatize the speaker as rude and unconcerned, rather than apologetic.

English Language Testing 69 Rational for Tests:

Measures of student performance (testing) may have as many as five purposes:

Student Placement, Diagnosis of Difficulties, Checking Student Progress, Reports to Student and Superiors, Evaluation of Instruction.

Unfortunately the most common perception is that tests are designed to statistically rank all students according to a sampling of their knowledge of a subject and to report that ranking to superiors or anyone else interested in using that information to adversely influence the student's feeling of self-worth. It is even more unfortunate that the perception matches reality in the majority of testing situations. Consequently tests are highly stressful anxiety producing events for most persons.

All too often tests are constructed to determine how much a student knows rather than determining what he/she must learn. Frequently tests are designed to "trap" the student and in still other situations tests are designed to insure a "bell curve" distribution of results. Most of the other numerous testing designs and strategies fail to help the student in his learning process and in many cases are quite detrimental to that process.

In a Mastery Based system of instruction the two main reasons for testing are to determine mastery and to diagnose difficulties. When tests are constructed for these purposes, the other four purposes will also be satisfied. For example, consider a test which requires the student to demonstrate mastery and at the same time rigorously diagnoses learning difficulties. If no difficulties are indicated, it may be safely assumed that the learner has mastered the concept. That information may then be used to record student progress and to make reports to the student and superiors. Examining student performance collectively for a group of students provides information about the quality of instruction. Examining a single student's performance collectively for a group of

English Language Testing 70 learning objectives may be used to determine proper placement within that group of learning objectives.

It is therefore important that the instructional developer construct each question so that a correct response indicates mastery of the learning objective and any incorrect response provides information about the nature of the student's lack of mastery. Furthermore, each student should have ample opportunity to "inform" the instructor of any form of lack of mastery. Unfortunately the mere presence of a test question influences the student's response to the question. The developer should minimize that influence by constructing questions which permit the student to make any error he would make in the absence of such influence. For example, a multiple choice question should have all the wrong answers the student might want to select and should also have as many correct answers as the student might want to provide.

True/False Questions:

True/false questions should be written without ambiguity. That is, the statement of the question should be clear and the decision whether the statement is true or false should not depend on an obscure interpretation of the statement. A true/false question may easily be used, and most commonly is used, to determine if the student recalls facts. However, a true/false question may also be used to determine if the learner has mastered the learning objective well enough to correctly analyze a statement.

It is important to be aware that only two choices are available to the student and therefore the nature of the question gives the student a 50% chance of being correct. A single True/False question therefore is helpful only if the student answers the question incorrectly and the incorrect response indicates a specific misunderstanding of the learning objective. A collection of true/false questions, about a single learning objective, all answered correctly by a student is a much stronger indication of mastery. It is therefore important that the instructional developer construct a "test bank" containing a large number of true/false questions. It is also important to include numerous true/false questions on any test which utilizes true/false questions. Ideally a true/false question should be constructed so that an incorrect response indicates something about the

English Language Testing 71 student's misunderstanding of the learning objective. This may be a difficult task, especially when constructing a true statement. The instructional developer should try to accomplish the ideal, but should recognize that in some instances he/she will not reach that goal.

Multiple Choice Questions:

Multiple choice questions should be written without ambiguity. That is, the statement of the question stem should be clear and should leave no doubt about how to select choices. Additionally the choices should be written without ambiguity and should contain all information required to make a decision whether or not to choose it. The decision whether to select or not select a choice should not depend on an obscure interpretation of either the stem or the choice. A multiple choice question may easily be used to determine if the student recalls facts. However, a multiple choice question may also be used to determine if the student has mastered the learning objective well enough to correctly analyze a statement.

The instructional developer should not construct multiple choice questions with a uniform number of choices, a uniform number of valid choices, or any other recognizable pattern for construction of choices. Instead the instructional developer should include as many valid and invalid choices as is required to determine the student's deficiencies with respect to the learning objective. Moreover, each choice should appear to be a valid choice to some student.

Multiple choice questions should therefore contain any number of choices with one or more valid choices. The student is of course required to select all valid choices and failure to select any one of the valid choices will provide information about the student's misunderstanding of the learning objective in the same way that selection of an invalid choice reveals the nature of his/her misunderstanding. The nature of the choices provided in a multiple choice question may be of two types: those which require merely recall of facts and those which require additionally activity such as synthesis, analysis, computation, comparison, or diagramming. The instructional developer who is seriously concerned with the student's success will use both types extensively.

English Language Testing 72 Fill-in-the-Blank Questions:

The temptation, when constructing fillintheblank questions, is to construct traps for the student. The instructional developer should avoid this problem. Ensure that there is only one acceptable word for the student to provide and that the word (or words) is significant. Avoid asking the student to supply "minor" words. Avoid fillintheblank question with so many blanks that the student is unable to determine what is to be completed.

Sometime/Always/Never Questions:

The collection of Sometime/Always/Never (referred to as SAN) statements are statements which are: true sometimes, always true, and never true. The statements used in these questions must be stated carefully and should contain enough information to permit the student to decide whether the statement is true sometimes, always, or never.

SAN questions (especially the sometimes statements) are the most difficult to construct but can be the most significant part of a test. SAN questions should be constructed to force the student to engage in some critical thinking about the learning objective. When used properly, SAN questions force the student to consider important details about the learning objective. Careful use of this type of question and careful analysis of student's response will provide detailed information about some of the student's deficiencies.

SAN questions are especially appropriate, and easy to construct, for learning objectives addressing concepts which are "black" or "white" except in a few cases. The true statements in a collection of true/false questions are of course always true statements while the set of false statement may be further subdivided into those which are true sometimes and those which are never true.

Test Construction

Closed-Answer or “Objective” Tests

Although by definition no test can be truly “objective” (existing as an object of fact, independent of the mind), this handbook refers to tests made up of multiple choice, English Language Testing 73 matching, fill-in, true/false, or fill-in-the-blank items as objective tests. Objective tests have the advantages of allowing an instructor to assess a large and potentially representative sample of course material and allow for reliable and efficient scoring. The disadvantages of objective tests include a tendency to emphasize only “recognition” skills, the ease with which correct answers can be guessed on many item types, and the inability to measure students’ organization and synthesis of material (Adapted with permission from Yonge, 1977).

Since the practical arguments for giving objective exams are compelling, we offer a few suggestions for writing multiple-choice items. The first is to find and adapt existing test items. Teachers’ manuals containing collections of items accompany many textbooks. (AIs: Your course supervisor or former teachers of the same course may be willing to share items with you.) However, the general rule is adapt rather than adopt. Existing items will rarely fit your specific needs; you should tailor them to more adequately reflect your objectives.

Second, design multiple choice items so that students who know the subject or material adequately are more likely to choose the correct alternative and students with less adequate knowledge are more likely to choose a wrong alternative. That sounds simple enough, but you want to avoid writing items that lead students to choose the right answer for the wrong reasons. For instance, avoid making the correct alternative the longest or most qualified one, or the only one that is grammatically appropriate to the stem. Even a careless shift in tense or verb-subject agreement can often suggest the correct answer.

Finally, it is very easy to disregard the above advice and slip into writing items which require only rote recall but are nonetheless difficult because they are taken from obscure passages (footnotes, for instance). Some items requiring only recall might be appropriate, but try to design most of the items to tap the students’ understanding of the subject (Adapted with permission from Farris, 1985). One way to write multiple choice questions that require more than recall is to develop questions that resemble miniature “cases” or situations. Provide a small collection of data, such as a description of a situation, a series of graphs, quotes, a paragraph, or any cluster of the kinds of raw

English Language Testing 74 information that might be appropriate material for the activities of your discipline. Then develop a series of questions based on that material. These questions might require students to apply learned concepts to the case, to combine data, to make a prediction on the outcome of a process, to analyze a relationship between pieces of the information, or to synthesize pieces of information into a new concept.

Here are a few additional guidelines to keep in mind when writing multiple-choice tests (Adapted with permission from Yonge, 1977):

The item-stem (the lead-in to the choices) should clearly formulate a problem.

As much of the question as possible should be included in the stem.

Randomize occurrence of the correct response (e.g., you don’t always want “C” to be the right answer).

Make sure there is only one clearly correct answer (unless you are instructing students to select more than one).

Make the wording in the response choices consistent with the item stem.

Don’t load down the stem with irrelevant material.

Beware of using answers such as “none of these” or “all of the above.”

Use negatives sparingly in the question or stem; do not use double negatives.

Beware of using sets of opposite answers unless more than one pair is presented (e.g., go to work, not go to work).

Beware of providing irrelevant grammatical cues.

Grading of multiple choice exams can be done by hand or through the use of computer scannable answer sheets available from your departmental office. Take completed answer sheets to IUB Evaluation Services and Testing (BEST) located in Franklin Hall

English Language Testing 75 M014. If you have your test scored by BEST, they will provide statistics on difficulty and reliability, which will help you to improve your tests.

If you choose the computer-grading route, you must be sure students have number 2 pencils to mark answers on their sheets. These are often available from your department’s main office. At the time of the exam it is helpful to write on the chalkboard all pertinent information required on the answer sheet (course name, course number, section number, instructor’s name, etc.). Also, remind students to fill in their university identification numbers carefully so that you can have a roster showing the ID number and grade for each student. If you would like to consult with someone about developing test items, call theCenter for Innovative Teaching and Learning at 855-9023.

If you would like to consult with someone about how to interpret your test results, call BEST at 855-1595.

Essay Tests

Conventional wisdom accurately portrays short-answer and essay examinations as the easiest to write and the most difficult to grade, particularly if they are graded well. You should give students an exam question for each crucial concept that they must understand.

If you want students to study in both depth and breadth, don't give them a choice among topics. This allows them to choose not to answer questions about those things they didn’t study. Instructors generally expect a great deal from students, but remember that their mastery of a subject depends as much on prior preparation and experience as it does on diligence and intelligence; even at the end of the semester some students will be struggling to understand the material. Design your questions so that all students can answer at their own levels.

The following are some suggestions that may enhance the quality of the essay tests that you produce (Adapted with permission from Ronkowski, 1986):

1. Have in mind the processes that you want measured (e.g., analysis, synthesis).

English Language Testing 76 2. Start questions with words such as “compare,” “contrast,” “explain why.” Don’t use “what,” “when,” or “list.” (These latter types are better measured with objective-type items). Writing Tutorial Services, Ballantine Hall 207, 855-6738, has a handout for students which defines these terms and explains how to study for and respond to essay questions. 3. Write items that define the parameters of expected answers as clearly as possible. 4. Make sure that the essay question is specific enough to invite the level of detail you expect in the answer. A question such as “Discuss the causes of the American Civil War,” might get a wide range of answers, and therefore be impossible to grade reliably. A more controlled question would be, “Explain how the differing economic systems of the North and South contributed to the conflicts that led to the Civil War. 5. Don’t have too many questions for the time available.

English Language Testing 77 12 TYPES OF LISTENING TESTING

1. DISCRIMINATIVE LISTENING

Discriminative Listening is an awareness of changes in pitch and loudness of sounds and it is determining if sounds are different or the same.These activities are designed to enahnce this listening skill: 1) Same or different? - Call out two words andhave the children determine if they are the same or different. For example, say bat/ bat, bat/bet. 2) Rhyming words- Practice rhyming discriminative listening skills by calling out a few rhyming words, such as“hat, bat, rat, cat, and so on” Have the children take turns calling out a word that rhymes with “at” as well as other rhyming words you want to use. 3) What’s the problem? - After reading a storybook to children (one that’s very familiar to them) have them tell you what the problem is. As you read the story change things around so the story isdifferent somehow, to see if they catch the changes and can tell you what theproblem is. 4) Musical moods- Play music, but change it up some by changing the pace, make it fast, slow, loud, soft, high and low. Have the children tell you when a sound change is made and what the change is. 5) Clap it out- After talking about syllables of words, clap out the syllables of some words you call out, starting with a two syllable word, then three, and so on. Repeat a word at least twice (or more if needed) so the concept is fully graspe.

Lastly, we have discriminative listening which has to do with the identification of different variations in sounds and words in order to understand the different messages. This is the most important listening and it spans all the other forms of listening. It involves being sensitive to pitch, volume, emphasis and rate of speech in order to detect

English Language Testing 78 the messages that may be hidden. This form of listening usually requires one to be efficient in two factors: have a good hearing ability and the knowledge of sound structure (Kline, 2010).

Hearing ability The ability to hear helps in sound differentiation and therefore is one can hear well, then there is a high likelihood that they can get the message well (Lengel, 1998).

Knowledge of sound structure

The knowledge of sound structure enables an individual to differentiate different sounds and be able to tell what is being said. For example the difference between “I would rank it first” and “I drank it first” requires such kind of ability in order to get the message clearly.In conclusion there are various forms of listening and these include listening for the sake of making critical evaluations, building relationships, making discriminations and obtaining information or gaining appreciation and each of the needs in listening calls for a different form of listening. These forms of listening depend on basic factors such as concentration, attention, memory, perception, experience, presentation style and the determination of ethos, pathos and logos under the various forms of listening. The lack of these may imply that there will be no communication that would be ongoing.

For example of discriminative listening Exercise Difference sounds is identified 1) “I would rank it first” and “I drank it first” 2) bat/ bat, bat/bet. 3) Safe/save 4) Made/mate 5) Age/h

English Language Testing 79 COMPREHENSION LISTENING

The next step beyond discriminating between different sound and sights is to make sense of them. To comprehend the meaning requires first having a lexicon of words at our fingertips and also all rules of grammar and syntax by which we can understand what others are saying. The same is true, of course, for the visual components of communication, and an understanding of body language helps us understand what the other person is really meaning. In communication, some words are more important and some less so, and comprehension often benefits from extraction of key facts and items from a long spiel.Comprehension listening is also known as content listening, informative listening and full listening.Listening Comprehension Sample Questions Transcript Sample Item A On the recording, you will hear:

(Narrator): Listen to a high school principal talking to the school's students.

(Man): I have a very special announcement to make. This year, not just one,but three of our students will be receiving national awards for their academic achievements. Krista Conner, Martin Chan, and Shriya Patel have all been chosen for their hard work and consistently high marks.It is very unusual for one school to have so many students receive this award in a single year.

(Narrator): What is the subject of the announcement?

In your test book, you will read: 1. What is the subject of the announcement? A. The school will be adding new classes. B. Three new teachers will be working at the school. C. Some students have received an award.

English Language Testing 80 D. The school is getting its own newspaper. Sample Item B On the recording, you will hear:

(Narrator): Listen to a teacher making an announcement at the end of the day.

(Man): Remember that a team of painters is coming in tomorrow to paint the walls. In this box on my desk are sheets of plastic that I want you to slip over your desks. Make sure you cover your desks completely so that no paint gets on them. Everything will be finished and the plastic will be removed by the time we return on Monday.

(Narrator): What does the teacher want the students to do?

In your test book, you will read: 2. What does the teacher want the students to do? A. Take everything out of their desks B. Put the painting supplies in plastic bags C. Bring paints with them to school on Monday D. Put covers on their desks to keep the paint off Sample Set A On the recording, you will hear:

(Narrator): Listen to a conversation between two friends at school.

(Boy): Hi, Lisa.

(Girl): Hi, Jeff. Hey, have you been to the art room today?

(Boy): No, why?

(Girl): Well, Mr. Jennings hung up a notice about a big project that's going ondowntown. You know how the city's been doing a lot of work to fix up MainStreet—you know, to make it look nicer? Well, they're going to create a mural.

English Language Testing 81 (Boy): You mean, like, make a painting on the entire wall of a building?

(Girl): It's that big wall on the side of the public library. And students from this school are going to do the whole thing ... create a design, and paint it, and everything. I wish I could be a part of it, but I'm too busy.

(Boy): [excitedly] Cool! I'd love to help design a mural. Imagine everyone in town walking past that wall and seeing my artwork, every day.

(Girl): I thought you'd be interested. They want the mural to be about nature, so I guess all the design ideas students come up with should have a nature theme.

(Boy): That makes sense—they've been planting so many trees and plants along the streets and in the park.

(Girl): If you're interested you should talk with Mr. Jennings.

(Boy): [half listening, daydreaming] This could be so much fun. Maybe I'll try to visit the zoo this weekend ... you know, to see the wild animals and get some ideas, something to inspire me!

(Girl): [with humor] Well maybe you should go to the art room first to get more information from Mr. Jennings.

(Boy): [slightly sheepishly] Oh yeah. Good idea. Thanks for letting me know, Lisa! I'll go there right away.

(Narrator): Now answer the questions.

In your test book,you will read: 3. What are the speakers mainly discussing? A. A new art project in the city B. An assignment for their art class C. An art display inside the public library

English Language Testing 82 D. A painting that the girl saw downtown 4. Why is the boy excited? A. A famous artist is going to visit his class. B. His artwork might be seen by many people. C. His class might visit an art museum. D. He is getting a good grade in his art class.

5. Where does the boy say he may go this weekend? A. To the zoo B. To an art store C. To Main Street D. To the public library 6. Why does the girl suggest that the boy go to the art room? A. So that he can hand in his homework B. So that he can sign up for a class trip C. So that he can see a new painting D. So that he can talk to the teacher

Sample Set B On the recording, you will hear:

Script Text:

(Narrator): Listen to a teacher talking in a biology class.

(Woman): We've talked before about how ants live and work together in huge communities. Well, one particular kind of ant community also grows its own food. So you could say these ants are like people like farmers. And what do these ants grow? They grow fungi [FUN-guy]. Fungi are kind of like plants—mushrooms are a kind of fungi. These ants have gardens, you could say, in their underground nests. This is where the fungi are grown.

English Language Testing 83 Now, this particular kind of ant is called a leafcutter ant. Because of their name, people often think that leafcutter ants eat leaves. If they cut up leaves they must eat them, right? Well, they don't! They actually use the leaves as a kind of fertilizer. Leafcutter ants go out of their nests looking for leaves from plants or trees. They cut the leaves off and carry them underground . . . and then feed the leaves to the fungi—the fungi are able to absorb nutrients from the leaves. What the ants eat are the fungi that they grow. In that way, they are like farmers! The amazing thing about these ants is that the leaves they get are often larger and heavier than the ants themselves. If a leaf is too large, leafcutter ants will often cut it up into smaller pieces—but not all the time. Some ants carry whole leaves back into the nest. In fact, some experiments have been done to measure the heaviest leaf a leafcutter ant can lift without cutting it. It turns out, it depends on the individual ant. Some are stronger than others. The experiments showed that some "super ants" can lift leaves about 100 times the weight of their body!

(Narrator): Now answer the questions.

In your test book, you will read: 7. What is the main topic of the talk? A. A newly discovered type of ant B. A type of ant with unusual skills C. An increase in the population of one type of ant D. A type of ant that could be dangerous to humans 8. According to the teacher, what is one activity that both leafcutter ants and people do? A. Clean their food B. Grow their own food C. Eat several times a day D. Feed their young special food English Language Testing 84 9. What does the teacher say many people think must be true about leafcutter ants? A. They eat leaves. B. They live in plants. C. They have sharp teeth. D. They are especially large.

10. What did the experiments show about leafcutter ants? A. How fast they grow B. Which plants they eat C. Where they look for leaves D. How much weight they can carry

Answer Key for Listening Comprehension

1. C 2. D 3. A 4. B 5. A 6. D 7. B 8. B 9. A 10. D

English Language Testing 85 1. CRITICAL LISTENING

Critical listening is listening in order to evaluate and judge, forming opinion about what is being said. Judgment includes assessing strengths and weaknesses, agreement and approval. This form of listening requires significant real-time cognitive effort as the listener analyzes what is being said, relating it to existing knowledge and rules, whilst simultaneously listening to the ongoing words from the speaker.

2. BIASED LISTENING

Biased listening happens when the person hears only what they want to hear, typically misinterpreting what the other person says based on the stereotypes and other biases that they have. Such biased listening is often very evaluative in nature.

3. EVALUATIVE LISTENING

In evaluative listening, or critical listening, we make judgments about what the other person is saying. We seek to assess the truth of what is being said. We also judge what they say against our values, assessing them as good or bad, worthy or unworthy. Evaluative listening is particularly pertinent when the other person is trying to persuade us, perhaps to change our behavior and maybe even to change our beliefs. Within this, we also discriminate between subtleties of language and comprehend the inner meaning of what is said. Typically also we weigh up the pros and cons of an argument, determining whether it makes sense logically as well as whether it is helpful to us.Evaluative listening is also called critical, judgmental or interpretive listening.

4. APPRECIATIVE LISTENING

In appreciative listening, we seek certain information which will appreciate, for example that which helps meet our needs and goals. We use appreciative listening when English Language Testing 86 we are listening to good music, poetry or maybe even the stirring words of a great leader. The student use ppreciative listening when they are listening this poetry and they seek certain information which will appreciate Adventure Quotient (AQ) Test 77 questions, 30 min How adventurous are you? Thrill-seeking can come in different forms, whether it's doing a swan dive bungee jump off the Auckland Harbour Bridge in New Zealand, or trying that new exotic restaurant around the corner from work. The type of adventure you enjoy (or avoid) depends a great deal on your personality. Are you more of a planner or spontaneous? Courageous or careful? Do you have the energy level of a bee or a sloth? Find out more about your adventure personality with this test!

Examine the following statements and choose the answer option that best applies to you. There may be some questions describing situations that may not be relevant to you. In such cases, select the answer you would most likely choose if you ever found yourself in that type of situation. In order to receive the most accurate results, please answer as truthfully as possible.After finishing the test, you will receive a Snapshot Report with an introduction, a graph and a personalized interpretation for one of your test scores. You will then have the option to purchase the full results.

Adventure Quotient (AQ) Test 50 questions, 30 min 1. I _____ repetitive tasks. enjoy don't mind can't stand 2. I take pride in my appearance and upkeep. Agree Somewhat agree Disagree 3. I have already been or would consider any of the following: skydiving, bungee jumping, hang gliding, or free climbing. Definitely Maybe English Language Testing 87 No way 4. I see getting away from it all as a chance to: Connect with people and places Connect with myself

5. I would travel to a developing country and leave the airport/train station: With pleasure. Only with a friend. Only with a hired guide. 6. I seek new experiences more... To learn about new places, people, and things. For the way they make me feel 7. I am more likely to ask myself: "When is break time?" "What's next?" 8. I am more likely to get my thrills from: Doing something physically or emotionally gutsy Watching someone else do something physically or emotionally gutsy 9. The lowest comfort I would consider for sleeping is: Outside on the ground. A tent. An RV or camper. A motel. A bed and breakfast. A furnished apartment or house. A 3 or 4 star hotel. 10. Adrenaline is a chemical that: I avoid I enjoy from time to time I seek I am addicted to 11. Having a daily routine is: English Language Testing 88 Oppressive and stifling Annoying and limiting Sometimes a good thing, sometimes not Helpful and comforting Totally necessary 12. Not knowing what the future might hold is: Terrifying A little disconcerting, but that's just the way life is Exhilarating 13. At a theme park, I'll try: The highest, scariest ride Something fast, but no upside-down stuff The kiddie train or merry-go-round The park bench 14. Having nice things and looking good is important to me Extremely Somewhat Not very 15. A life without luxury is: Not worth living Difficult to imagine. Perfectly acceptable Expected. 16. Knowing what others think of me is: Essential Important Helpful Not important 17. When visiting new places, I am more interested in: Soaking up the environment Interacting with people 18. Others are more likely to wonder... English Language Testing 89 Where my energy goes. Where my energy comes from 19. Life's experiences are most rich and interesting when I contemplate them... With others In my own mind An old friend is in town. Where are you most likely to eat?

A. We'd eat at: A fast food joint An ethnic café B. We'd eat at: A themed restaurant or dinner theater At the kitchen table in my house, warming up something in the microwave C. We'd eat at a: Chain restaurant Upscale restaurant You inherit $100,000 from a distant uncle. What are you more likely to do with it? D. I'd take my wallet out and: Go on an epic shopping spree Donate some, or all, to charity E. I'd take my wallet out and: Go on a casino fling Put it in the bank F. I'd take my wallet out and: Go on a dream vacation Throw a gigantic party

It's time to learn something new. Which class would you be most interested in taking up? G. I would rather take: Acting classes Creative writing classes English Language Testing 90 H. I would rather take: Survival skills classes Speed reading classes I. I would rather take: Kickboxing classes Tai Chi classes Which of the following would you rather visit or spend some time in? J. I would rather go to: An Inuit igloo A Buddhist monastery

K. I would rather go to: An African hut A European hostel L. I would rather go to: A Japanese pagoda A California spa Pick your preferred pet M. I'd rather have a: Parrot Hamster N. I'd rather have a: Goldfish Snake O. I'd rather have a: Tarantula Horse Which is your preferred adrenaline rush? P. There's nothing like the thrill of: A looming deadline A charging rhino Q. There's nothing like the thrill of: English Language Testing 91 Running cross-country Running with the bulls R. There's nothing like the thrill of: Swimming with dolphins Swimming with sharks Which is your preferred adrenaline rush? S. There's nothing like the thrill of: Finding something I really like on sale. Finding an ancient Egyptian artifact in Valley of the Kings T. There's nothing like the thrill of: Cycling or hiking Taking a scenic drive

U. There's nothing like the thrill of: Getting a tattoo or piercing Skydiving or hang gliding Pick the adjective that best describes you. V. I am more: Bold Timid W. I am more: Impulsive Deliberate X. I am more: Of an improviser Of a planner Y. What's your most favorite way to get from point A to point B? First class or Business class The scenic railroad route Automobile - the classic "road trip" Budget airline - who needs legroom? Tour bus - sit back and relax English Language Testing 92 An all-terrain vehicle. No road? No problem My bike - and the wind in my hair Z. What's your comfort zone when it comes to heights? Top shelf of the bookcase The 3-meter diving board A bungee jump A skydive A spacewalk AA. What is the one form of footwear you could never live without? Skis Cycling shoes Stiletto heels Cross-trainers Walking shoes Flip-flops Hiking boots Dress shoes BB. Which voice mail message are you most likely to leave on a friend's phone? "How about a movie and some take out?" "Got an extra ticket to a show, let's go!" "Party of the century! Pick you up at 9." "Meet me at the airport with a suitcase and your passport." CC. Which phrase do you agree with more? "Better safe than sorry." "Nothing ventured, nothing gained." DD. How much of Mother Nature's wrath will you endure for adventure? Monsoon, tornado, ice storm - bring it on! Thundershowers, extreme hot and cold Some wind, clouds, and drizzle If it's not blue skies, forget it EE. How often do you pick up new fashions? Daily to weekly. English Language Testing 93 Monthly to yearly. Every decade or so

5. SYMPATHETIC LISTENING

In sympathetic listening we care about the other person and show this concern in the way we pay close attention and express our sorrow for their ills and happiness at their joys.

EMPATHETIC LISTENING

When we listen empathetically, we go beyond sympathy to seek a truer understand how others are feeling. This requires excellent discrimination and close attention to the nuances of emotional signals. When we are being truly empathetic, we actually feel what they are feeling.In order to get others to expose these deep parts of themselves to us, we also need to demonstrate our empathy in our demeanor towards them, asking sensitively and in a way that encourages self-disclosure.

6. THERAPEUTIC LISTENING

In therapeutic listening, the listener has a purpose of not only empathizing with the speaker but also to use this deep connection in order to help the speaker understand, change or develop in some way.This not only happens when you go to see a therapist but also in many social situations, where friends and family seek to both diagnose problems from listening and also to help the speaker cure themselves, perhaps by some cathartic process. This also happens in work situations, where managers, HR people, trainers and coaches seek to help employees learn and develop.

7. DIALOGIC LISTENING

English Language Testing 94 The word 'dialogue' stems from the Greek words 'dia', meaning 'through' and 'logos' meaning 'words'. Thus dialogic listening mean learning through conversation and an engaged interchange of ideas and information in which we actively seek to learn more about the person and how they think.Dialogic listening is sometimes known as 'relational listening'.

The example of dialogic listening A : I was working as a training director for a national homelessness foundation. I was traveling around the country doing a lot of teaching and consulting. I was mostly the only white male wherever I went. So I was doing big urban shelters and city governments in Detroit and places like that. I was always coming up against race, class, and gender issues between myself and the participants. Q : Because they weren't white males? A : Right, they were mostly females of color, and I could always deal with it, but it was by the seat of my pants. So I came to PCP for consultation initially and then I was accepted into their first workshop back in 1994. I found it to be such a revolutionary approach to difference, one that I had never experienced before in all my training in diversity and all that other stuff. I found out after my first class that I had to do my training in Louisville, Kentucky for the homelessness network there. The issue there was that the staff of the homeless shelters were mostly women of color, and the volunteers were mostly affluent white women from the suburbs and they differed in many ways and had different ideas about each other as well. So I started doing this training. One of the goals of this group was that they wanted the people to work more effectively together. About half way through the first day, an African American woman stood up and she was very angry. She said, "You don't know shit about my life, you're a white man with privilege." I had some choices to make there. But because I had been to this one PCP class, I decided that I was going to deal with this differently than I would have dealt with this prior. I said, "You're absolutely right. I am white. I'm a guy. I have certain level of power. I wear a tie. I live in suburbs. I drive a nice car. And I imagine that your story has a lot to do with why you're here. I imagine that English Language Testing 95 a lot of other people's stories have a lot to do with why they're here. I'm wondering if we can make a choice together as a group to hear your story, and what it is that you want people to understand about you. Would you be willing to hear the stories of others?" She said, "Yeah." So I had everyone go around and tell the group how their personal story connected to why they were there. Everybody went around the room. Women told these incredible stories. I remember there was one white woman who told how she had been homeless for the last two years. That she had been beaten by her husband, but because they were wealthy and lived in the suburbs, he was basically able to buy off the police, and she was basically in prison because of her wealth. Finally, when he started beating the children, she took them. She was cut off completely from his wealth and lived on the streets for two years. She had just gotten out of shelter. This tremendous bonding happened among these women. We were all brought to tears by it. That affected me deeply. I came home and a couple days later my kids were fighting. I was always the type, and I still give into this temptation, of getting involved in the middle and trying to referee, thinking I know what's going on. In this instance, I tried taking what is called a not-knowing attitude. I suggested that each kid take five minutes to explain what's going on. I was using the "what's-at- the-heart-of-the-matter-for-you approach," but in a way that was easier for them to understand because they were younger. So each kid had five minutes. Once they spoke, I realized that I certainly didn't have a clue about what their concerns were. I had a completely different idea about what they were concerned about, and they had completely different ideas about each other. They were then able to say, "Oh, so that's all you want," and then move along. Now it doesn't always happen like that, but it made a really deep impression on me. The biggest thing for me is being a father, it's the most important thing in my life, and the fact that I can do it well is my biggest accomplishment. To think that I was doing it so well, yet I was doing it so ineffectively that I could not know my own kids. I could be with them ten hours a day but still not know them because I wasn't listening to them deeply. It blew my mind. I just thought that this is the best thing since sliced bread. So those two things really catapulted me into the whole PCP mindset. English Language Testing 96 Q : So you were really struck by the real power of letting the parties speak for themselves, without being the convoy, without being the person who summarizes and says, "This is what's going on." A : Right. Exactly. Yeah, because I could have said, "This is what I hear." I try to relate it to my own experience in some way, but basically I don't know. Being asked, "Can we use your wisdom and tap the rest of the wisdom in the room and make it work for us here?" Then leaving it in their hands afterward was big. It just was not my style to do that before.

Listen carefully to the dialog between nick and jimmy,then complete the conversation Nick : I heard (1)...... as a computer pragrammer Jimmy : Yes,and I had already(2)...... Nick : Really?i’m happy(3)... Jimmy : Thank you. Nick : Your parents must be(4)...... Jimmy : They want me to run their business.they’re(5)...... Nick : That’s a pity!did you explain your reasons? Jimmy : I did and I hope they’ll accept my decision.

Dialog II Margaret : Look at you!you look so great now.what have you been doing? Joe : Really?(1)...... i’ve been in canada for two weeks.by the way,how about your job? Margaret : (2)...... it’s in a big new hospital.My working conditions aremuch better than the the last place. Tony : Attention,please.today,we have a surprise.we’ve been offered a trip from our boss Joe : Really?(3)...... ? Tony : Bandung Joe : (4)...... but where is it located? Tony : Aren’t you pleased? English Language Testing 97 Joe : Yes,of course.(5)...... but tell me where it is. Margare : It’s in indonesia. Joe : Oh,I see.that’s not so good Tony : Don’t worry joe.my friend,lisa,who lives there,wrote to me about the conditions in indonesia.indonesia is safe now,especially in that twon.there is no riot.it’s just a rumour.

Key Answer 1) I think it’s usual 2) That’s great 3) Where to 4) Marvellous 5) I’m delighted to hear that

8. RELATIONSHIP LISTENING

Sometimes the most important factor in listening is in order to develop or sustain a relationship. This is why lovers talk for hours and attend closely to what each other has to say when the same words from someone else would seem to be rather boring.Relationship listening is also important in areas such as negotiation and sales, where it is helpful if the other person likes you and trusts you.

English Language Testing 98 13 Testing Grammar

English is a very important language in the world. It plays a very big rule in communication and education. Everything which is served by technology should be related to English. By and by, English will be the global language in every part of the world.Since English is an international language, people all over the world try to learn as much as Possible about english. To develop our skill in english we always meet the grammar. And we practice our english by the testing grammar, so that we know how far we understand the english.

A. Definition of grammar

Grammar is the structural foundation of our ability to express ourselves. The more we are aware of how it works, the more we can monitor the meaning and effectiveness of the way we and others use language. It can help foster precision, detect ambiguity, and

English Language Testing 99 exploit the richness of expression available in English. And it can help everyone--not only teachers of English, but teachers of anything, for all teaching are ultimately a matter of getting to grips with meaning.

1. Descriptive grammar refers to the structure of a language as it is actually used by speakers and writers. 2. Prescriptive grammar refers to the structure of a language as certain people think it should be used.

Both kinds of grammar are concerned with rules--but in different ways. Specialists in descriptive grammar (called linguists) study the rules or patterns that underlie our use of words, phrases, clauses, and sentences. On the other hand, prescriptive grammarians (such as most editors and teachers) lay out rules about what they believe to be the “correct” or “incorrect” use of language.

B. Types of test Before writing a test it is vital to think about what it is you want to test and what its purpose is. We must make a distinction here between proficiency tests, achievement tests, diagnostic tests and prognostic tests.

21. A proficiency test is one that measures a candidate's overall ability in a language; it isn't related to a specific course. 22. An achievement test on the other hand tests the students' knowledge of the material that has been taught on a course. 23. A diagnostic test highlights the strong and weak points that a learner may have in a particular area. 24. A prognostic test attempts to predict how a student will perform on a course.

There are of course many other types of tests. It is important to choose elicitation techniques carefully when you prepare one of the aforementioned tests.There are many elicitation techniques that can be used when writing a test. Below are some widely used types with some guidance on their strengths and weaknesses. Using the right kind of

English Language Testing 100 question at the right time canbe enormously important in giving us a clear understanding of our students' abilities, but we must also be aware of the limitations of each of these task or question types so that we use each one appropriately

1. Multiple choice Choose the correct word to complete the sentence. Cook is ______today for being one of Britain's most famous explorers. a) Recommended b) reminded c) recognized d) remembered In this question type there is a stem and various options to choose from. The advantages of this question type are that it is easy to mark and minimizes guess work by having multiple distracters. The disadvantage is that it can be very time-consuming to create, effective multiple choice items are surprisingly difficult to write. Also it takes time for the candidate to process the information which leads to problems with the validity of the exam. If a low level candidate has to read through lots of complicated information before they can answer the question, you may find you are testing their reading skills more than their lexical knowledge.

Multiple choice can be used to test most things such as grammar, vocabulary, reading, listening etc. but you must remember that it is still possible for students to just 'guess' without knowing the correct answer.

2. Transformation Complete the second sentence so that it has the same meaning as the first. 'Do you know what the time is, John?' asked Dave. Dave asked John ______(what) ______it was. This time a candidate has to rewrite a sentence based on an instruction or a key word given. This type of task is fairly easy to mark, but the problem is that it doesn't test understanding. A candidate may simply be able to rewrite sentences to a formula. The fact that a candidate has to paraphrase the whole meaning of the sentence in the example above however minimizes this drawback.

English Language Testing 101 Transformations are particularly effective for testing grammar and understanding of form. This wouldn't be an appropriate question type if you wanted to test skills such as reading or listening.

3. Gap-filling Complete the sentence.

Check the exchange ______to see how much your money is worth.

The candidate fills the gap to complete the sentence. A hint may sometimes be included such as a root verb that needs to be changed, or the first letter of the word etc. This usually tests grammar or vocabulary. Again this type of task is easy to mark and relatively easy to write. The teacher must bear in mind though that in some cases there may be many possible correct answers.

 Gap-fills can be used to test a variety of areas such as vocabulary, grammar and are very effective at testing listening for specific words

4. True / False Decide if the statement is true or false.

England won the world cup in 1966. T/F

Here the candidate must decide if a statement is true or false. Again this type is easy to mark but guessing can result in many correct answers. The best way to counteract this effect is to have a lot of items.

 This question type is mostly used to test listening and reading comprehension

5. Open questions Answer the questions.

English Language Testing 102 Why did John steal the money?

Here the candidate must answer simple questions after a reading or listening or as part of an oral interview. It can be used to test anything. If the answer is open-ended it will be more difficult and time consuming to mark and there may also be a an element of subjectivity involved in judging how 'complete' the answer is, but it may also be a more accurate test.

 These question types are very useful for testing any of the four skills, but less useful for testing grammar or vocabulary.

6.Error Correction Find the mistakes in the sentence and correct them.

Ipswich Town was the more better team on the night.

Errors must be found and corrected in a sentence or passage. It could be an extra word, mistakes with verb forms, words missed etc. One problem with this question type is that some errors can be corrected in more than one way.

 Error correction is useful for testing grammar and vocabulary as well as readings and listening.

7. Other Techniques There are of course many other elicitation techniques such as translation, essays, dictations, ordering words/phrases into a sequence and sentence construction (He/go/school/yesterday).

It is important to ask yourself what exactly you are trying to test, which techniques suit this purpose best and to bear in mind the drawbacks of each technique. Awareness of this will help you to minimize the problems and produce a more effective test.

English Language Testing 103 C.The Value of Studying Grammar

The study of grammar all by itself will not necessarily make you a better writer. But by gaining a clearer understanding of how our language works, you should also gain greater control over the way you shape words into sentences and sentences into paragraphs. In short, studying grammar may help you become a more effective writer.Descriptive grammarians generally advise us not to be overly concerned with matters of correctness: language, they say, isn't good or bad; it simply is. As the history of the glamorous word grammar demonstrates, the English language is a living system of communication, a continually evolving affair. Within a generation or two, words and phrases come into fashion and fall out again. Over centuries, word endings and entire sentence structures can change or disappear.

Prescriptive grammarians prefer giving practical advice about using language: straightforward rules to help us avoid making errors. The rules may be over-simplified at times, but they are meant to keep us out of trouble--the kind of trouble that may distract or even confuse our readers.

English Language Testing 104 14 INTERPRETING TEST SCORE

Introduction

What does interpret mean? To interpret is to decide what the intended meaning of something is (Cambridge Advanced Learner’s Dictionary). To interpret is to conceive the significance of; construe (thefreedictionary.com). Thus, to interpret is to understand the meaning and the significance of something.Interpreting test scores is to understand the meaning and the significance of test scores, which can be used to plan next action - to fix or to retain. There are many ways to do it, but the most common three are frequency distribution, measures of central tendency, and measures of dispersion. Frequency distribution here is talking about the distribution of scores and the frequency of each category. On the other hand, measures of central tendency refer to measure of “middle” value, and are measured using the mode, median, and mean. The last but not least, is the measures of dispersion. It is related to the range or spread of scores. All three can help teachers interpret the meaning behind test scores.

English Language Testing 105 II. Content

A. Frequency Distribution Frequency distribution deals with the distribution of scores and the frequency of the distribution. Each entry in the table contains the frequency or count of the occurrences of scores within a particular name, and in this way, the table summarizes the distribution of scores. The example case here is: a teacher administers a test of 40 questions to 26 students. Marks are awarded by counting the number of correct answers on the test scripts. These are known as raw marks. Here are the steps to create a table of frequency distribution:

1. Create Table 1 and put the raw mark of every student in it. TABLE 1 Testee Mark A 20 B 25 C 33 D 35 E 29 F 25 G 30 H 26 I 19 J 27 K 26 L 32 M 34 N 27 O 27 P 29 Q 25

English Language Testing 106 R 23 S 30 T 26 U 22 V 23 W 33 X 26 Y 24 Z 26

2. Create Table 2. Sort the marks from the highest to the lowest score. This is called descending sorting. It is easier and faster to use tool like Microsoft Excel to do the sorting.

TABLE 2 Testee Mark D 35 M 34 C 33 W 33 L 32 G 30 S 30 E 29 P 29 J 27 N 27 O 27 H 26 K 26 T 26

English Language Testing 107 X 26 Z 26 B 25 F 25 Q 25 Y 24 R 23 V 23 U 22 A 20 I 19

Now, we determine the rank. We start form rank 1 up to rank 26, for there are 26 students. The problem comes when there are two or more students with the same mark. Here we highlight the same mark to make it easier to distinguish. Then, we write imaginary rank on the right of Rank column from 1 to 26. The imaginary rank of the same mark is then added and divided by how many people who get the same mark. For example, student C and W have the same mark, 33. Their imaginary rank is 3 and 4. To get the actual rank, we add 3 and 4 (3+4=7). The result, 7, is then divided by the number of people of the same score, which is 2 here. The final result is 3.5. Thus, the final result is 3.5. Thus, the final result is 3.5. Thus, the ranks of both of them are 3.5.

TABLE 2 Imaginary Testee Mark Rank rank D 35 ? 1 M 34 ? 2 C 33 ? 3 (3+4) / 2 = 3.5 W 33 ? 4 L 32 ? 5

English Language Testing 108 G 30 ? 6 (6+7) / 2 = 6.5 S 30 ? 7

E 29 ? 8 (8+9) / 2 = 8.5 P 29 ? 9 J 27 ? 10 (10+11+12) / 3 = N 27 ? 11 11 O 27 ? 12 H 26 ? 13

K 26 ? 14 (13+14+15+16+17) / 5 T 26 ? 15 = 15 8.5 X 26 ? 16 Z 26 ? 17 B 25 ? 18 (18+19+20) / 3 = F 25 ? 19 19 Q 25 ? 20 Y 24 ? 21

R 23 ? 22 (22+23) / 2 = V 23 ? 23 22.5 U 22 ? 24 A 20 ? 25 I 19 ? 26

The result will be like this. Table 2 shows the students’ scores in order of merit and their rank as well. TABLE 2 Testee Mark Rank D 35 1 M 34 2 C 33 3.5 W 33 3.5 L 32 5

English Language Testing 109 G 30 6.5 S 30 6.5 E 29 8.5 P 29 8.5 J 27 11 N 27 11 O 27 11 H 26 15 K 26 15 T 26 15 X 26 15 Z 26 15 B 25 19 F 25 19 Q 25 19 Y 24 21 R 23 22.5 V 23 22.5 U 22 24 A 20 25 I 19 26

3. Create Table 3, which consists of Mark column, Tally column, and Frequency column. In Mark column, we can expand the range from 40 up to 15, for the highest score is 35 and the lowest score is 19. We usually do this to give more space to enhance readability. Tally is the stroke of how many students get a certain score. It is simply a method of counting the frequency of scores. Frequency column lists the number of students obtaining each score. It is easier to count due to the tallies.

English Language Testing 110 Table 3 is the table of frequency distribution. TABLE 3 Mark Tally Frequency 40 39 38 37 36 35 / 1 34 / 1 33 // 2 32 / 1 31 30 // 2 29 // 2 28 27 /// 3 26 //// 5 25 /// 3 24 / 1 23 // 2 22 / 1 21 20 / 1 19 / 1 18 17 16 15 TOTAL 26

English Language Testing 111 B. Measures of Central Tendency A measure of central tendency is a measure that tells us where the “middle” of a bunch of data lies. The three most common measures of central tendency are the mode, the median, and the mean.

B.1. Mode Mode refers to the score which most candidates obtained. We can easily spot it from Table 3. The most frequent score in Table 3 is 26, as five testees have scored this mark. Thus, the mode is 26.

B.2. Median Median refers to the score gained by the middle candidate after the data is put in order. We use Table 2, which has been ordered in descending order, to find the median. In the case of 26 students here, there can obviously be no middle student and thus the score halfway between the lowest score in the top half and the highest score in the bottom half is taken as the median. The median score in this case is 26.

English Language Testing 112 TABLE 2 Testee Mark Rank D 35 1 M 34 2 C 33 3.5 W 33 3.5 L 32 5

G 30 6.5 TOP HALF S 30 6.5 E 29 8.5 P 29 8.5 J 27 11 N 27 11

O 27 11 Lowest score of top half: 26 (26+26)/2=26 H 26 15 Highest score of bottom half: 26 K 26 15 T 26 15 X 26 15 Z 26 15 B 25 19 F 25 19 BOTTOM HALF Q 25 19 Y 24 21 R 23 22.5 V 23 22.5 U 22 24 A 20 25 I 19 26

B.3. Mean

English Language Testing 113 Mean or average score is the sum of the scores divided by the total number of testees. The mean is the most efficient measure of central tendency, but it is not always appropriate. Now, we are going to create Table 4, to count the mean. Note that symbol x is used to denote the score, N the number of the testees, and m the mean. The symbol fdenotes the frequency with which a score occurs. The symbol ∑ mean the sum of. First, we gather the data from previous Table 3. We get the scores and their frequencies from it. The score (x) is then multiplied by the frequency (f). The result is put in column fx. After that, the total of fxis summed up as ∑fx. TABLE 4 x . f Fx 35 x 1 35 34 x 1 34 33 x 2 66 32 x 1 32 30 x 1 60 29 x 2 58 27 x 3 81 26 x 5 130 25 x 3 75 24 x 1 24 23 x 2 46 22 x 1 22 20 x 1 20 19 x 1 19 TOTAL 702 ∑fx

To get the mean, we use formula m = ∑fx / N.

English Language Testing 114 ∑ 702 Thus, the mean is 27. N 26

C. Measures of Dispersion Measures of dispersion are important for describing the spread of the scores, or its variation around a central value. There are various methods that can be used to measure the dispersion of a dataset, but the most common ones are the range and the standard deviation.

C.1. Range A simple way of measuring the spread of marks is based on the difference between the highest and the lowest scores. It is called the range. From previous Table 2, we can see the highest score is 35 and the lowest score is 19. The range is 16. Range =Xmax – Xmin Range = 35 – 19 = 16

C.2. Standard Deviation The standard deviation (s.d.) is another way of showing the spread of scores. It shows how all the scores are spread out and gives a fuller description of test scores than the range. One simple method of calculating s.d. is shown below:

Σd s. d. N is the number of scores N d is the deviation of each score from the mean From previous calculation, mean is 27. The steps to calculate s.d. are as followings:

English Language Testing 115 1. Step 1: Find out the amount by which each score deviates from the mean (d).

Mean deviation (d) Score (Score - 27) 35 8 34 7 33 6 33 6 32 5 30 3 30 3 29 2 29 2 27 0 27 0 27 0 26 -1 26 -1 26 -1 26 -1 26 -1 25 -2 25 -2 25 -2 24 -3 23 -4 23 -4 22 -5 20 -7 19 -8

2. Step 2: Square each result (d2)

English Language Testing 116 Mean Score deviation (d) d2 (Score - 27) 35 8 64 34 7 49 33 6 36 33 6 36 32 5 25 30 3 9 30 3 9 29 2 4 29 2 4 27 0 0 27 0 0 27 0 0 26 -1 1 26 -1 1 26 -1 1 26 -1 1 26 -1 1 25 -2 4 25 -2 4 25 -2 4 24 -3 9 23 -4 16 23 -4 16 22 -5 25 20 -7 49 19 -8 64

3. Step 3: Total all the results ( )

English Language Testing 117 4. Step 4: Divide the total by the number of testees( )

/ = 432 / 26 = 16.62

/ 5. Step 5: Find the square root of the result

= 4.077= 4.08 /

/ √ . Thus, standard deviation (s.d.) is 4,08. That means that on average, the scores are about 4 points away from the average.

English Language Testing 118 References:

Teaching Student-Centered Mathematics (K-3). John van de Walle and LouAnnLovin, Pearson Publishing, 2006.

Classroom and Large- Scale Assessment. Wilson and Kenney. This article appeared in A Research Companion to Principles and Standards for School Mathematics (NCTM), 2003, (pages 53-67).

Principles and Standards for School Mathematics. National Council of Teachers of Mathematics (NCTM), 2000.

Young Mathematicians at Work, Constructing Number Sense, Addition, and Subtraction. By Catherine TwomeyFosnot and Maarten Dolk, Heinemann

Assessing Learners with Special Needs: 6TH ED. By Terry Overton

Weaver, B. Formal versus Informal Assessments. http://www.scholastic.com/teachers/article/formal-versus-informal-assessments Morrison, G. Informal Methods of Assessment. http://www.education.com/reference/article/informal-methods-assessment/ Forlizzi, L. Informal assessment: The Basics. http://aded.tiu11.org/disted/FamLitAdminSite/fn04assessinformal.pdf Navarete, C., et al. Informal Assessment In Educational Evaluation. Mind map retrieved february 20, 2013 from the URL: http://www.mindmeister.com/122645400/formal-vs-informal-assessments

TESTING: BASIC CONCEPTS: BASIC TERMINOLOGY by

Anthony Bynom, Ph.D., December 2001

A Statistical Analysis of Different Instruments to Measure Short-term In an L2 ImmersionProgram1. by

Kyle Perkins Southern Illinois University at Carbondale U.S.A.

TESL Journal, Vol. II, No. 5, May 1996 http://iteslj.org/

English Language Testing 119 Berk, R., 1979. Generalizability of Behavioral Observations: A Clarification of Interobserver Agreement and Interobserver Reliability. American Journal of Mental Deficiency, Vol. 83, No. 5, p. 460-472.

Cronbach, L., 1990. Essentials of psychological testing. Harper & Row, New York.

Carmines, E., and Zeller, R., 1979. Reliability and Validity Assessment. Sage Publications, Beverly Hills, California.

Gay, L., 1987. Eductional research: competencies for analysis and application. Merrill Pub. Co., Columbus.

Guilford, J., 1954. Psychometric Methods. McGraw-Hill, New York.

Nunnally, J., 1978. Psychometric Theory. McGraw-Hill, New York.

Winer, B., Brown, D., and Michels, K., 1991. Statistical Principles in Experimental Design, Third Edition. McGraw-Hill, New York.

American Educational Research Association, American Psychological Association, &

National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: Authors.

Cozby, P.C. (2001). Measurement Concepts. Methods in Behavioral Research (7th ed.).

California: Mayfield Publishing Company.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational

Measurement (2nd ed.). Washington, D. C.: American Council on Education.

Moskal, B.M., &Leydens, J.A. (2000). Scoring rubric development: Validity and reliability. Practical Assessment, Research & Evaluation, 7(10). [Available online: http://pareonline.net/getvn.asp?v=7&n=10].

The Center for the Enhancement of Teaching. How to improve test reliability and validity: Implications for grading. [Available online: http://oct.sfsu.edu/assessment/evaluating/htmls/improve_rel_val.html]. http://spiritize.blogspot.com/2007/10/active-listening.html http://wiki.answers.com/Q/Examples_of_poetry#ixzz1xYYnZXmU

English Language Testing 120 Alderson, J. C 2002 Conceptions of validity and validation. Paper presented at a conference in Bucharest, June 2002.

Angoff, 1988 Validity: An evolving concept. In H. Wainer& H. Braun [Eds.] Test validity [pp. 19-32], Hillsdale, NJ: Erlbaum.

Bachman, L. F. 1990 Fundamental considerations in language testing. Oxford: O.U.P.

Cumming A. & Berwick R. [Eds.] Validation in Language Testing Multilingual Matters 1996

Hatch, E. &Lazaraton, A. 1991 The Research Manual - Design & Statistics for Applied Linguistics Newbury House

Henning, G. 1987 A guide to language testing: Development, evaluation and research Cambridge, Mass: Newbury House

Hubley, A. M. &Zumbo, B. D. A dialectic on validity: where we have been and where we are going. The Journal of General Psychology 1996. 123[3] 207-215

Messick, S. 1988 The once and future issues of validity: Assessing the meaning and consequences of measurement. In H. Wainer& H. Braun [Eds.] Test validity [pp. 33- 45], Hillsdale, NJ: Erlbaum.

Messick, S. 1989 Validity. In R. L. Linn [Ed.] Educational measurement. [3rd ed., pp 13-103]. New York: Macmillan

English Language Testing 121