Ocelot Mandarin Oral Recognition Test Roderick Gammon Kapi'olani
Total Page:16
File Type:pdf, Size:1020Kb
Ocelot Mandarin Oral Recognition Test Roderick Gammon Kapi’olani Community College February 16, 2004ß Abstract The Ocelot Mandarin Oral Recognition Test is a diagnostic test that measures a student’s ability to match spoken Chinese syllables to a romanization. Analogous to using a Romanized dictionary index, test items are multimedia multiple-choice with an audio prompt and a selection of textual romanizations. The test is organized into subunits focused on different features in a formal linguistic construct of a Mandarin syllable (Yao et al., 1997). The test is computer adaptive and allows for criterion referenced (Brown and Hudson, 2002) item banking (Brown, 1997). After a literature review, this paper describes the structure and piloting of the test. Introduction The Ocelot Mandarin Oral Recognition Test (MORT) measures recognition of Mandarin Chinese oral syllables by interpreting subject performance relevant to a linguistic feature system. Implemented as a Flash (2004 MX) applet, the test is computer adaptive and algorithmically generates items at runtime. The test was created as a member of a set of computer assisted language-learning (CALL) materials modeled around a particular course of study using the Integrated Chinese (Yao et al., 1997) textbooks. This introduction reviews literature regarding Mandarin romanization, language testing, and CALL design. Following that review the introduction recasts the MORT and its pilot study in terms for research questions raised by the literature review. The MORT instrument and its piloting are then described. The paper concludes with a consideration of further avenues of study. Mandarin Mandarin Chinese, with increasing regularity since the mid-20th century, been transcribed using the Pinyin Romanization method (Ramsey, 1987). This method is used indigenously in both the People’s Republic of China and the Republic of China (Taiwan) (e.g. in government Romanizations), and is widely taught in modern Mandarin foreign language courses in the United States. The vast majority of Chinese- Gammon: Ocelot Pinyin Recognition Test 2 English bilingual reference dictionaries use order entries alphabetically by romanization, from the Mathews (1943) dictionary of classical Chinese to the Alphabetic Computerized Dictionary (2003) that itself reflects a particularly aggressive romanization project. There are two divergent strains of Pinyin. Hanyu Pinyin is the original form and is nearly universally used (Tsai, 2002). However there is also Tongyong Pinyin that was developed in 1998 in the Republic of China and which diverges from Hanyu Pinyin by roughly 10% (Tsai, 2002). Divergence in Chinese orthographies is however not unusual. For example the alternate “traditional” and “simplified” character forms exist at the tangible core of the Chinese language, vary from each other on about two thirds of the character forms, and some conflicting usages have political motivation (Ramsey, 1987, chapter 8). Additionally romanized surface forms, at least in the modern period, have fluctuated without changes in the phonetic interpretation of Mandarin syllables. For example standardized conversion tables are common for translation between the Pinyin and Wade-Giles romanization systems (Library of Congress, 1999), and each system use the same syllabic construct of initial, final, and tone. Yao et al. (1997), using Hanyu Pinyin, provides an overview of the Mandarin syllabic construct the generally accepted (Chao, 1968) that divides syllables into initials, finals, and tones. Initials are further described by point of articulation and aspiration. Finals are described as concatenated vowels and nasals, and vowels are subdivided into simple and complex. Yao et al. (1997) also serves as the first year Mandarin (101-102) basal textbook at the institutions serving as pilot sites, within which there are XXX syllables that occur as the first syllable of a vocabulary item. In Pinyin a syllable is produced by concatenating one value for each of an initial, final, and tone, although initials and tones may be null values. As the name indicates, romanization systems use the alphabet to reproduce the initials and finals. Tones are represented in one of two methods, either by a diacritic mark over a vowel or by a numeral suffixed to a syllable. Generally the former method is preferred because it more clearly reflects that a tone is a constant portion of a syllable, not a temporally second event. In a review of CALL materials for Chinese, Bourgerie (2003) notes only three testing packages for Chinese. Two are for listening. One predates the majority of Internet technologies (Asay & Bourgerie, 1988) and the other focuses on placement using the ACTFL scale (Ke and Zhang, 2002). There are 2/16/2004 Gammon: Ocelot Pinyin Recognition Test 3 other, scattered instruments available via Internet (for better-quality examples see “Mini-projects...” and University of Iowa in references), however few attribute authorship let alone validity and reliability studies. There is no test known to this author that is a CALL diagnostic listening test for Mandarin and closely coupled to a linguistic construct of sub-syllabic phonetic features. Language Testing Brown and Hudson (2002, p. 4) define a criterion referenced test (CRT) as one “...designed to described the performances of examinees in terms of the amount that they know of a specific domain of knowledge or set of objectives.” Scores resulting from such tests are termed absolute decisions because they map examinee ability to the tested material, as opposed to performance in comparison to other examinees. The latter, population-based, scoring method is referred to as norm-referenced testing (Brown, 1996). Finally, Brown and Hudson (2002, p. 3) define a CRT subtype called a domain referenced for CRT that are “...based on a well-established domain that has been clearly defined with items selected from that domain.” Language tests commonly provide evaluation for one of four purposes: proficiency, placement, diagnostics, or achievement. Diagnostic tests, in a CRT context, most often occur at the beginning of a term and are used to gauge a student’s proficiency relevant to a course’s objectives (Brown and Hudson, 2002, p. 31). Although discrete item formats such as multiple choice (Brown, 1996) are not always perfect fits for CRT, Brown and Hudson (2002, pp. 68-69) note that “receptive skills such as ...listening... and phonemic discrimination can be efficiently tested using the multiple-choice format”. However those authors emphasize that critical focus is needed when using the multiple-choice format on CRT. Regarding item construction, Brown and Hudson (2002, pp. 62-63) and Wainer et al. (2000, pp. 243-4) voice the fundamental testing concern of item independence, particularly noting that items should be avoided where one item’s prompt contains the response to another item. One statistical method for identifying item interdependence is Yen’s (1993, named in Brown and Hudson, 2002, pp. 206-207) Q3 statistic for detecting item dependencies and the use of testlets as a unit of analysis that allows one to factor out dependencies. 2/16/2004 Gammon: Ocelot Pinyin Recognition Test 4 Wainer et al. (2000, pp. 245-254) discuss testlets in the domain of computer adaptive testing (CAT). Within that discussion a testlet is a fixed set of questions, such that any subject exposed to a testlet will take all items in that testlet. When testlets are designed for each trait then any dependency effects, or context effect, that exist will at least be evenly distributed among all subjects taking a particular testlet, which in Wainer et al.’s (2000, p. 246) view is the most important concern. A traditional feature of CAT is that student test lengths and individual item difficulties vary in reaction to subject performance (Brown, 1997). Item selection algorithms are therefore central to the usefulness of any CAT. Discussing CAT, Flanagan (2000) underscores that the quality of the test bank data is fundamental to proper functioning of any item selection algorithm. It should also be noted that Flanagan and others (Brown, 1997) refer to item banks as collections of pre-authored items. Wainer et al. (2002, p. 246) define the traditional notion of item ordering as a power test, or from “easy” to “hard” questions. However one must note that this definition is based on norm-referenced principles rather than in reference to the tested construct. Wainer et al. (2000, pp. 238-9) discusses multi-dimensionality in a test where each skill required to perform a test is one dimension. For example a mathematical word problem would include at least the two dimensions of reading and computation. RELIABILITY Knowledge of a student’s background is essential for proper test calibration (XXXX), including the features of gender, native language, and educational status. In addition, Kondo-Brown (forthcoming) has developed a background questionnaire for students of Japanese that has proved particularly useful when the population includes heritage learners. CALL Design Measurement errors are those of incorrect score recording and reporting including respondent mistakes as well as instrument errors (Lepkowski et al., 1998, p. 367-368). Always an important concern, online tests provoke particular measurement error concerns. Lynch and Horton (1999) note that 800 pixels wide by 600 pixels high is a reasonable size to assume for target screens within which all display 2/16/2004 Gammon: Ocelot Pinyin Recognition Test 5 elements must appear, an important concern if the test is to be embedded into a web page or other framing component. Regarding CALL design for Mandarin, a well-known problem is the issue of character encoding systems. Character encoding is a blanket term for technologies that control how orthography is represented by a computer, have traditionally been inadequate for representation of Mandarin’s large set of characters. Even though technological solutions exist, they are not so widely disseminated that they can be factored out of designs targeting lowest common denominator (LCD) systems. This problem extends to romanization, where diacritic marks over vowels cannot be reliably produced by LCD platforms.