Reliability and Validity of a Test and Its Procedure

View metadata, citation and similar papers at core.ac.uk brought to you by CORE 新潟国際情報大学情報文化学部紀要 RELIABILITY AND VALIDITY OF A TEST AND ITS PROCEDURE CONDUCTED AT A JAPANESE HIGH SCHOOL Paul Bela Nadasdy Abstract This paper analyses the validity and reliability of an English listening test being used at a private Japanese high school. Through an analysis of the test results,an attempt was made to make salient the qualities and deficiencies of the test and its procedure. The test’s reliability was analysed using a split-half method measuring the coefficient of internal consistency. The split-test’s coefficient results suggested that there was a certain amount of unreliability between the two halves of the test. Although the reliability was below an acceptable level,calculations using the Spearman-Brown formula suggested the possibility of higher coefficiency. Regarding construct,content,criterion-related,and face validity the test appeared valid. Key words:TESTING,VALIDITY,RELIABILITY,HIGH SCHOOL . Introduction Though some have argued whether testing is actually necessary at all, it is generally agreed that it is the most practical way to monitor and systematically rank students. And as tests remain the most popular way to grade students fairly,the quality of their production would seem vital. For test efficiency,validity and reliability need to be present. And as these two conditions are important for the effectiveness of testing, it is generally accepted that we can achieve a precise evaluation of our students if they are both consistent. Unsurprisingly, however,the variables that exist in measuring both reliability and validity in tests at times produce a range of results. This paper starts with an analysis of testing in general and of how the examination of validity and reliability is used as a means of quality control in test production. This is followed by an analysis of a listening test that is being used in a high school in Japan. Quantitative and qualitative results are analysed to ascertain whether it is reliable and valid, and this followed by a evaluation of its overall effectiveness. .Review of literature Analysts who have made important contributions within the realm of testing include Paul Bela Nadasdy［情報文化学科］ ― 23 ― Oller (1979), Hughes (1989), Bachman (1990), Spolsky (1985), Messick (1996), Fulcher (1997), Cohen et al.(2000),and Chapelle(1999,2003). In defining testing and its usefulness Bachman states that “language tests are indirect indicators of the underlying traits in which we are interested”(1990:33). Davies(1990),Hughes(1989),and Baker(1989)refer to tests in the way that they help us to acquire information,act as a procedure for problem solving,and act as a decision making procedure (Owen 1997:2). Owen also offers an endorsement of testing in that instructors need to monitor student progress independently,which opposes the possibility of inaccurate and biased self-assessment (1997:5). Owen further defines possible motivations for tests in language learning explaining that they assist in ranking students, assist in gauging whether students are able to cope with certain language forms,help us to observe whether learning has been achieved, give useful information relating to forecasting future developments in student performance,and help us to refine what we are teaching and testing. Furthermore, testing can also contribute to establishing whether certain entities are effective such as teachers, schools and teaching methods in comparing them against one another. Among these positive endorsements,Owen also suggests that tests act as a means of control and motivation of our students. However, some commentators draw our attention to the negative reputation that tests have within the teaching community. For example,Hughes (2003)refers to the“mistrust”educators have of tests and testing in general. Validity in testing Two areas should be considered when discussing validity in testing : 1.Consider how closely the test performance resembles the performance we expect outside the test. 2.Consider to what extent evidence of knowledge about the language can be taken as evidence of proficiency. (Owen 1997:13) Referring to the importance of validity in tests, Cohen et al. (2000)state that effective research is impossible or even“worthless”without the presence of validity(2000:105),though they do recommended against aiming for absolute validity. Instead they define the search for validity as being one of minimizing invalidity, maximizing validity, and therefore using measurement in validity as a matter of degree rather than a pursuit of perfection (2000:105). Owen (1997)citing Baker (1989)also considers the accuracy and proficiency of testing and how we evaluate individuals: It is quite useful for understanding tendencies in testing, but...it seems less easy actually to allocate particular tests to one cell rather than another, and...it is not easy to separate ― 24 ― 新潟国際情報大学情報文化学部紀要 knowledge of system as a counterpoint to performance from knowledge of a system as indirect evidence of proficiency. (Owen,1997:17) 3.1 Construct, content, criterion-based, and face validity Several categories exist for validity. The following four categories are described by Hughes (1989)and Bachman (1990),these being construct validity,content validity(included within this are internal and external validity),criterion-based validity,and face validity. 3.1.1 Construct validity Construct validity is concerned with the level of accuracy a construct within a test is believed to measure (Brown 1994:256;Bachman & Palmer 1996)and,particularly in ethno- graphic research,“must demonstrate that the categories that the researchers are using are meaningful to the participants themselves”(Cohen et al 2000:110). 3.1.2 Content validity Content validity is concerned with the degree to which the components of a test relate to the real-life situation they are attempting to replicate(Hughes 1989 :22;Bachman 1990:306) and is relevant to the degree to which it proportionately represents. Within the domain of content validity are internal validity and external validity. These refer to relationships between independent and dependent variables when experiments are conducted. External validity occurs when our findings can be related to the general populous, whereas internal validity is related to the elimination of difficult variables within studies. 3.1.3 Criterion-related validity Criterion-related validity “(relates)the results of one particular instrument to another external criterion”(Cohen et al. 2000:111). It contains two primary forms, these being predictive and concurrent validity. Concerning predictive validity, if results from two separate but related experiments or tests produce similar results the original examination is said to have strong predictive validity. Concurrent validity is similar but it is not necessary to have been measured over a span of time and can be “demonstrated simultaneously with another instrument”(2000:112). 3.1.4 Face validity This term relates to what degree a test is perceived to be doing what it is supposed to. In general,face validity in testing describes the look of the test as opposed to whether the test is proved to work or not. 3.2 Messick’s framework of unitary validity Messick’s (1989) framework of unitary validity differs from the previous view which ― 25 ― identifies exclusively content validity,face validity,construct validity,and criterion-related validity as its main elements. Messick considers these sole elements to be inadequate and stresses the need for further consideration of complementary facets of validity, and in particular the examination of scores and construct validity assessment as its key features. Six aspects of validation included in Messick’s paradigm provide “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores”(Messick 1989 : 13 cited in Bachman 1990:236). These elements are judgmental/logical analysis, which is concerned with content rele- vance,correlation analyses,which utilizes quantitative analyses in interpreting test scores to gather evidence in support of the test scores,analyses of process,which involves the inves- tigating of test taking,analyses of group difference and change over time,which examines to what extent score properties generalize across population groups, manipulation of tests and test conditions,which is concerned with gathering knowledge about how test intervention affects test scores, and test consequences, which examines elements that affect testing including washback, consequences of score interpretation, and bias in scoring (Bachman 1990;Messick 1996). 3.3 Testing outcomes Considering the above framework defining validity in testing, we need to consider the importance of determining what is appropriate for our students and teaching situations as well as on a larger scale. The importance of analysis in low-stakes testing could be signifi- cant if one considers how data can be collected from the source and used productively. Regarding Chapelle’s(2003)reference to Shepard(1993)in that the primary focuses are testing outcomes and that “a test’s use should serve as a guide to validation”(2003:412),suggests we are in need of a point from where to start our validation analysis from. Chapelle also cites that “as a validation argument is ‘an argument’rather than a ‘thumbs up/thumbs down’ verdict”(Cronbach cited in Chapelle 2003), we start to focus on something that we can generally agree is an important outcome― the result. 4.Reliability in testing Reliability relates to the generalisability,consistency,and stability of a test. Following on from test validity Hughes points out that “if a test is not reliable,it cannot be valid”(2003: 34). Hughes continues that “to be valid a test must provide consistently accurate measure- ments”(2003:50) Therefore it would seem that the higher amount of similarity there is between tests,the more reliable they would appear to be(Hughes:1989). However,Bachman (1990)argues that although the similarity case is relevant,other factors concerning what we are measuring will affect test reliability.

Load more