Ocelot Mandarin Oral Recognition Test Roderick Gammon Kapi’olani Community College February 16, 2004ß

Abstract

The Ocelot Mandarin Oral Recognition Test is a diagnostic test that measures a student’s ability to match spoken Chinese syllables to a . Analogous to using a Romanized dictionary index, test items are multimedia multiple-choice with an audio prompt and a selection of textual .

The test is organized into subunits focused on different features in a formal linguistic construct of a

Mandarin syllable (Yao et al., 1997). The test is computer adaptive and allows for criterion referenced

(Brown and Hudson, 2002) item banking (Brown, 1997). After a literature review, this paper describes the structure and piloting of the test.

Introduction

The Ocelot Mandarin Oral Recognition Test (MORT) measures recognition of oral syllables by interpreting subject performance relevant to a linguistic feature system. Implemented as a Flash (2004 MX) applet, the test is computer adaptive and algorithmically generates items at runtime.

The test was created as a member of a set of computer assisted language-learning (CALL) materials modeled around a particular course of study using the Integrated Chinese (Yao et al., 1997) textbooks.

This introduction reviews literature regarding Mandarin romanization, language testing, and CALL design. Following that review the introduction recasts the MORT and its pilot study in terms for research questions raised by the literature review. The MORT instrument and its piloting are then described. The paper concludes with a consideration of further avenues of study.

Mandarin

Mandarin Chinese, with increasing regularity since the mid-20th century, been transcribed using the Romanization method (Ramsey, 1987). This method is used indigenously in both the People’s

Republic of and the Republic of China () (e.g. in government Romanizations), and is widely taught in modern Mandarin foreign language courses in the United States. The vast majority of Chinese- Gammon: Ocelot Pinyin Recognition Test 2

English bilingual reference dictionaries use order entries alphabetically by romanization, from the

Mathews (1943) dictionary of to the Alphabetic Computerized Dictionary (2003) that itself reflects a particularly aggressive romanization project.

There are two divergent strains of Pinyin. Hanyu Pinyin is the original form and is nearly universally used (Tsai, 2002). However there is also Tongyong Pinyin that was developed in 1998 in the

Republic of China and which diverges from Hanyu Pinyin by roughly 10% (Tsai, 2002). Divergence in

Chinese orthographies is however not unusual. For example the alternate “traditional” and “simplified” character forms exist at the tangible core of the , vary from each other on about two thirds of the character forms, and some conflicting usages have political motivation (Ramsey, 1987, chapter 8). Additionally romanized surface forms, at least in the modern period, have fluctuated without changes in the phonetic interpretation of Mandarin syllables. For example standardized conversion tables are common for translation between the Pinyin and Wade-Giles romanization systems (Library of

Congress, 1999), and each system use the same syllabic construct of initial, final, and .

Yao et al. (1997), using Hanyu Pinyin, provides an overview of the Mandarin syllabic construct the generally accepted (Chao, 1968) that divides syllables into initials, finals, and tones. Initials are further described by point of articulation and aspiration. Finals are described as concatenated vowels and nasals, and vowels are subdivided into simple and complex. Yao et al. (1997) also serves as the first year Mandarin (101-102) basal textbook at the institutions serving as pilot sites, within which there are

XXX syllables that occur as the first syllable of a vocabulary item.

In Pinyin a syllable is produced by concatenating one value for each of an initial, final, and tone, although initials and tones may be null values. As the name indicates, romanization systems use the alphabet to reproduce the initials and finals. Tones are represented in one of two methods, either by a diacritic mark over a vowel or by a numeral suffixed to a syllable. Generally the former method is preferred because it more clearly reflects that a tone is a constant portion of a syllable, not a temporally second event.

In a review of CALL materials for Chinese, Bourgerie (2003) notes only three testing packages for

Chinese. Two are for listening. One predates the majority of Internet technologies (Asay & Bourgerie,

1988) and the other focuses on placement using the ACTFL scale (Ke and Zhang, 2002). There are

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 3 other, scattered instruments available via Internet (for better-quality examples see “Mini-projects...” and

University of Iowa in references), however few attribute authorship let alone validity and reliability studies.

There is no test known to this author that is a CALL diagnostic listening test for Mandarin and closely coupled to a linguistic construct of sub-syllabic phonetic features.

Language Testing

Brown and Hudson (2002, p. 4) define a criterion referenced test (CRT) as one “...designed to described the performances of examinees in terms of the amount that they know of a specific domain of knowledge or set of objectives.” Scores resulting from such tests are termed absolute decisions because they map examinee ability to the tested material, as opposed to performance in comparison to other examinees. The latter, population-based, scoring method is referred to as norm-referenced testing

(Brown, 1996). Finally, Brown and Hudson (2002, p. 3) define a CRT subtype called a domain referenced for CRT that are “...based on a well-established domain that has been clearly defined with items selected from that domain.”

Language tests commonly provide evaluation for one of four purposes: proficiency, placement, diagnostics, or achievement. Diagnostic tests, in a CRT context, most often occur at the beginning of a term and are used to gauge a student’s proficiency relevant to a course’s objectives (Brown and Hudson,

2002, p. 31).

Although discrete item formats such as multiple choice (Brown, 1996) are not always perfect fits for CRT, Brown and Hudson (2002, pp. 68-69) note that “receptive skills such as ...listening... and phonemic discrimination can be efficiently tested using the multiple-choice format”. However those authors emphasize that critical focus is needed when using the multiple-choice format on CRT.

Regarding item construction, Brown and Hudson (2002, pp. 62-63) and Wainer et al. (2000, pp.

243-4) voice the fundamental testing concern of item independence, particularly noting that items should be avoided where one item’s prompt contains the response to another item. One statistical method for identifying item interdependence is Yen’s (1993, named in Brown and Hudson, 2002, pp. 206-207) Q3 statistic for detecting item dependencies and the use of testlets as a unit of analysis that allows one to factor out dependencies.

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 4

Wainer et al. (2000, pp. 245-254) discuss testlets in the domain of computer adaptive testing

(CAT). Within that discussion a testlet is a fixed set of questions, such that any subject exposed to a testlet will take all items in that testlet. When testlets are designed for each trait then any dependency effects, or context effect, that exist will at least be evenly distributed among all subjects taking a particular testlet, which in Wainer et al.’s (2000, p. 246) view is the most important concern.

A traditional feature of CAT is that student test lengths and individual item difficulties vary in reaction to subject performance (Brown, 1997). Item selection algorithms are therefore central to the usefulness of any CAT. Discussing CAT, Flanagan (2000) underscores that the quality of the test bank data is fundamental to proper functioning of any item selection algorithm. It should also be noted that

Flanagan and others (Brown, 1997) refer to item banks as collections of pre-authored items.

Wainer et al. (2002, p. 246) define the traditional notion of item ordering as a power test, or from

“easy” to “hard” questions. However one must note that this definition is based on norm-referenced principles rather than in reference to the tested construct.

Wainer et al. (2000, pp. 238-9) discusses multi-dimensionality in a test where each skill required to perform a test is one dimension. For example a mathematical word problem would include at least the two dimensions of reading and computation.

RELIABILITY

Knowledge of a student’s background is essential for proper test calibration (XXXX), including the features of gender, native language, and educational status. In addition, Kondo-Brown (forthcoming) has developed a background questionnaire for students of Japanese that has proved particularly useful when the population includes heritage learners.

CALL Design

Measurement errors are those of incorrect score recording and reporting including respondent mistakes as well as instrument errors (Lepkowski et al., 1998, p. 367-368). Always an important concern, online tests provoke particular measurement error concerns. Lynch and Horton (1999) note that 800 pixels wide by 600 pixels high is a reasonable size to assume for target screens within which all display

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 5 elements must appear, an important concern if the test is to be embedded into a web page or other framing component.

Regarding CALL design for Mandarin, a well-known problem is the issue of character encoding systems. Character encoding is a blanket term for technologies that control how orthography is represented by a computer, have traditionally been inadequate for representation of Mandarin’s large set of characters. Even though technological solutions exist, they are not so widely disseminated that they can be factored out of designs targeting lowest common denominator (LCD) systems. This problem extends to romanization, where diacritic marks over vowels cannot be reliably produced by LCD platforms. However numerals can be reliably produced, recommending the, perhaps less-preferred, numeric method of tone representation discussed above.

In a discussion of software improvement techniques referred to as refactoring, Marting et al.

(1999) consistently promote the concept of decoupled data system architectures. In brief, this concept stipulates that software longevity is promoted by separating data values from algorithms that manipulate them, because that separation will minimize unintended, cascading failures during future changes to both the data and the algorithms. Given that Mandarin romanization systems historically fluctuate, and that tests and test items undergo revision over time for many reasons (Brown and Hudson, 2002 and Wainer et al. 2000), decoupling is an important design tool to the CALL test creator. Regarding data decoupling, the extensible markup language (XML) encoding format is also extremely useful as it combines the ease of use given by a markup language like better known HTML with an easily parsed tree structure (Flynn,

2003).

Purpose

The MORT was designed in an institutional context, further described in a subsequent section on pilot population, that uses Yao et al.’s (1997) textbook. Additionally MORT was designed as a component in the Ocelot (Gammon, 2003 & forthcoming) set of CALL materials for 100-level Mandarin courses using that textbook. Therefore an initial purpose of the test was to provide criterion-referenced diagnostic testing of the Mandarin syllables encountered in the 100-level courses at the pilot institution. The

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 6 expense of produciing a new test was supported by the literature review, because there simply are not enough properly piloted Chinese second-language tests available.

The literature review additionally suggested test development relevant to three research questions. First, is it proper to interpret Mandarin syllabic transcription as a domain-referenced activity?

Second, because syllabic romanization is essentially a concatenative process, can the test item bank be recast from a collection of pre-authored items to a set materials needed to generate items? Finally, what can the experience of developing and piloting this test offer to the still emerging field of CALL?

Instrument

This section describes the syllabic construct, unit structure, and item generation algorithms used by MORT. Following this section the test’s piloting is described.

Construct

MORT is derived from a seven feature Hanyu Pinyin construct adapted from Yao et al. (1997). All feature values are categorical. The features are nested in a tree such that a parent node value can be constructed by concatenating one value from each of the child nodes. That tree is described in figure 1.

Table 1 describes the universe of whole values and subfeature values for initials, and table 2 does the same for finals. There are only five tonal values, the four traditional Chinese tones plus a neutral or unmarked tone. Further, the tones have no subfeature designation and so no special table listing is given. Tones are suffixed numbers, except for the neutral tone that has a null value. This approach was adopted to mitigate display issues on target systems. Table 3 lists the whole syllables used in the pilot program. The universe of whole syllables could ultimately equal the product of initials times finals and tones. However standard Mandarin does not utilize that full syllabic set. Additionally, because the set of whole syllables of interest may vary (for example by curriculum), and because

Chinese Romanization systems have historically been mutable, MORT loads the construct data from external files. This allows one to change the Romanization system being tested and vary the whole syllables forming the initial item bank, without rewriting the software, so long as the Romanization system

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 7 adheres to the construct. In other words construct categorical values may change in any way that does not entail changing the feature tree structure in figure 1.

Testing method

Before describing item selection and construction in individual units, it is useful to review the testing method from the subject’s perspective. The test begins by presenting the tested construct to the subject. Each major feature is described and examples are given. Finally, the multimedia multiple-choice format of the test is explained.

During the test one item appears per screen. An item is an audio prompt plus five Pinyin (textual) response options. When a screen is entered the audio prompt is played. A button is provided to replay the clip. Responses are ordered alphabetically to improve reading and as an attempt to stabilize washback effect from response ordering. After answering a question the subject presses a “next” button to continue to the next item. Backwards navigation is not allowed. The screen footer indicates which unit one is in, and what item out of how many is displayed. Figure 2 displays a subtest screen. A detail of the testing screen showing the item format is given in figure 3.

Item Bank

The item bank is composed of audio clips, a construct definition, and item authoring algorithms.

There are four item authoring algorithms, one each for whole syllables, initials, finals, and tones. The construct definition is implemented as an XML file with sections for syllables, initials, and finals. Because the tones are considered atomic and having regular application, no special data store is required to produce them. Syllables, initials, and finals are however marked with their subfeature values. The algorithms and their use of subfeature values are described in following sections.

Although the collection of initials and finals is expected to be static, defining all compositional data in an external file has offers advantages. In particular it allows MORT to be used as a criterion- referenced test, an example of which occurs in the pilot program. MORT items are two-dimensional, listening and reading, because they have audio prompts and text responses. The audio clips are a

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 8 collection of MP3 audio files stored in a directory accessible to the test software following the principle of decoupling, links between audio clips and the phonetic construct are defined in the XML file.

Item selection is computer adaptive in two dimensions. First, items in the initial and final sections are selected on the basis of subject performance in the first; item-construct correspondence and unit length therefore both vary. Additionally all units are adaptive in that all items are generated at runtime, the number of possible items for a unit is a function of the number of available prompts and categorical values for subfeatures used for responses. Each latter section may also be considered as a testlet, or contained selection of items prompted by performance.

Each subunit randomly orders items to reduce testwiseness. Response prompts are alphabetized, using that as an ordering from outside the construct. Alphabetization is also hoped to aid student reading of prompts and reduce accidental error response.

To enhance item independence, prompts are removed from the item bank when used. The effect is that any one Mandarin syllable as audio prompt appears only once per each administration of the entire test. For example if one was unable to identify the prompt “bing1”, then items testing the “b” initial would be provided later, but the “bing1” prompt itself would not be a candidate prompt within the initials unit.

Unit 1: Whole Syllable

Prompts are selected randomly from the syllable set, in an amount matching the total number of items to be given in the unit. The total number of items is implemented as a variable. The particular value used in the pilot, 30, is discussed in the section on piloting. The random selection order is used as the test item order to confound washback from presenting items in alphabetical order.

The five responses are in the form of a whole Romanized syllable. The response choices will include the actual answer, a distractor that matches by at least initial, a distractor that matches by at least final, a distractor that matches by at least tone, and a randomly selected distractor. In the sample item, note that the responses are ordered alphabetically. The upper row gives the feature value a response was selected by:

RESPONSES

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 9

PROMPT ANSWER FINAL INITIAL TONE RANDOM jian3 jian3 jian4 jie3 suo3 ti

Unit 2 – Initial

The construct subdivides an initial into the two features of articulation point and aspiration. Input data includes a list of each initial type, tagged with its subfeature types and a list of example syllables.

The subtest provides items testing initials used in prompts on items missed in the first subtest. A random selection of initials will be generated, to match the length of the unit. From within those initial groups, one example syllable will be selected as a prompt.

Five responses are supplied, each being only an initial. One is the answer, one distractor matches the articulatory point, one distractor matches the aspiration type, and two are randomly selected.

A sample item follows:

RESPONSES PROMPT BREATH ANSWER ARTICULATION RANDOM RANDOM jia1 b j q sh y

Unit 3 – Final

The construct subdivides a final into a vowel and a nasal ending. The final subtest contains items testing finals used in prompts on items missed in the first subtest. Five responses are given, each being only a final. One is the correct answer, one matches the vowel, one matches the nasal, and two are randomly selected. A sample item follows:

RESPONSES PROMPT NASAL RANDOM VOWEL ANSWER RANDOM bing1 eng iang in ing uan

Unit 4 - Tone

The construct declares five tonal values of null and one through four. Each of these values are categorical; ordinal comparisons have no linguistic relevance. Input data is a set containing each of the five tones with a list of syllabic examples. The tonal feature is atomic in this construct. 10 items are

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 10 selected such that two items are to each tone. Because the items are selected in groups based on tone, test items will be shuffled before presented to the user. There are five responses, one for each possible tone.

RESPONSES PROMPT DISTRACTOR ANSWER DISTRACTOR DISTRACTOR DISTRACTOR cong1 NEUTRAL 1 2 3 4

Item Scoring and Result Data

As a domain referenced test, test scores are not useful for comparing students directly—two students could have equal correct/error scores and yet be proficient in different parts of the construct (for example one may have trouble with labial initials while another finds nasalized finals difficult). However raw scores, or correct/incorrect rates, will still be given because they are expected by test consumers

(Brown, personal correspondence).

Still, the test results will focus on providing a map of construct points identified as outside a student’s knowledge. This will include a list of syllables, initials, finals, and tones missed by the student, the particular items and answers, and a multimedia set of examples for each construct point. The results will be rendered in HTML.

Pilot Method

Population

The population is composed of 100 level Mandarin students and post-100 level Mandarin students, who will be interpreted as non-masters and masters of the content.

Item Bank

The whole Mandarin construct delimited to those syllables appearing as first syllables in Yao et al. (1997) index. This set forms the initial syllables of all vocabulary to which a 100-level Mandarin student would be exposed.

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 11

Discussion

Although he spoke in a context involving pre-authored items, Flanagan (2000) indicates the need precise criteria in the item bank specification. For example reliability measures required X items during the pilot, therefore the shortest possible test was that long.

An important question is whether item difficulty can expressed only in reference to the construct, of if student performance is a necessary ingredient.

Conclusion

Further development could include syllables in combination, vowel composition, and dialectical variance. Whole syllable items provide “brief but comprehensive review of construct potential”.

An open research question is how feature-based distractors compare to random distractors. On the one hand, the distractors are more informative since selection of one may indicate partial recognition of the syllable (say the initial). However the correct answer is also that one which shares one feature with three other answers, and that could lead to washback.

Overhaul tone system- algorithmic handling reduces variability. Particular variables of interest are the number of tones and also encompassing tone sandhi.

References

ABC Chinese-English Comprehensive Dictionary: Alphabetically Based Computerized. (2003). Ed. J.

DeFrancis. Honolulu, HI: University of Hawaii Press.

Asay, D. & Bourgerie, D.S. (1998). Chinese Pronunciation and Romanization Diagnostics. V.1.0.

Columbus, OH: The Ohio State University Center for Teaching Excellence.

Bourgerie, D.S.. (May 2003). Computer assisted language learning for Chinese: A

survey and annotated bibliography. Journal of the Chinese Language Teachers Association, vol.

38:2. pp. 17-48.

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 12

Brown, J.D. (July 1997). Computers in Language Testing: Present Research and Some Future

Directions. Language Learning & Technology. Vol. 1, No. 1 (pp. 44-59).

Chao, Y.R. (1968). A Grammar of Spoken Chinese. University of California Press.

Flynn, P. (Jan 14, 2003). The XML FAQ. World Wide Web Consortium XML Special Interest Group.

Retrieved February 16, 2004 from http://www.ucc.ie:8080/cocoon/xmlfaq

Fowler, M. et al. (1999). Refactoring: Improving the Design of Existing Code. San Francisco: Addison-

Wesley Company.

Ke, C. & Zhang, Z. (2002). Chinese Computerized Adaptive Listening Comprehension Test (CCALCT).

Columbus, OH: National East Asian Language Resource Center at the Ohio State University.

http://flc.ohio-state.edu

Lepkowski, J.M. & Sadowski, S.A., Weiss, P.S. (1998). Mode, Behavior, and Data Recording

Error. Chapter 19. In Computer Assisted Survey Information Collection (pp. 367-388). New York:

Wiley & Sons.

Library of Congress (May 28, 1999). New Chinese Romanization Guidelines. Retrieved February 15,

2004 from http://lcweb.loc.gov/catdir/pinyin/romcover.html

Lynch, P. and Horton, S.. 1999. Web style guide: Basic design principles for creating web sites.

New York: Yale Press.

Mathews’ Chinese –English Dictionary. 1943. American Edition. Harvard University Press.

Mini-projects with Mandarin: Chinese Character/Pinyin Test. Accessed February 3,

2004 from http://stat-www.berkeley.edu/users/deegan/test1.htm

Ramsey, S.. (1987). The . Princeton: Princeton University Press.

Tsai, Chih-Ho. (July 17, 2002). Similarities Between Tongyong Pinyin and Hanyu

Pinyin: Comparisons at the Syllable and Word Levels. Accessed November 21, 2003 from

http://www.geocities.com/hao520/research/papers/pinyin- comparison.htm

University of Iowa. Practice Your Pinyin. Accessed February 3, 2004 from:

http://www.uiowa.edu/~chinese/pinyin/index.html

Yao, T.C., Liu, Y.H. et al.. (1997). Integrated Chinese: Traditional Character Edition.

Level 1, Parts 1 and 2. Boston: Cheng & Tsui Company.

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 13

Figures

Figure 1: Construct

Figure 2: Subtest screen

Figure 3: Item Format Detail

2/16/2004 Gammon: Ocelot Pinyin Recognition Test 14

2/16/2004