A Survey of Media and Data Processing Development for Written Taiwanese

Iûⁿ Ún-giân Henry H. Tân-Tēⁿ Lecturer of Dahan Inst. Tech. Montclair State University [email protected] [email protected]

Abstract

Many proposals have been advanced for writing Taiwanese; none as yet have been

designated official. Of all the proposals, Ph-ōe-jī (POJ) has the longest history and

the largest corpus of didactic materials, dictionaries, and literary works. This paper

surveys development in Taiwanese media and data processing as they pertain to POJ.

In the former category we include the print media of books, periodicals, and

newspapers, as well as the broadcast media and the Internet. In the latter we discuss

the evolution of Taiwanese textual processing, segmentation, machine translation,

and Unicode support. Finally we propose a preliminary program for applied

computational linguistics in Taiwanese, for the purpose of revitalizing and advancing

written Taiwanese.

Keywords: Ph-ōe-jī (POJ), written Taiwanese, vernacular literature, media, textual

processing, computational linguistics

1. Introduction

Broadly speaking, “Taiwanese” refers to the languages of the Taiwanese

people, including Holo1, Hakka, and the Austronesian languages. Of all the groups

the Holo are the most numerous, accounting for over 70% of the population (Huang

1995). Therefore, as early as the Japanese colonial period a century ago, Holo was

known as Taiwanese, as well. In this article we use the terms interchangeably.

The Holo written language and literature, or Tâi-gú-bûn (TGB), can ultimately

be traced to the Southern Min dramas dated 1566 (G 1995), prior to the era of mass

1 1 migration from China to . At that time Han characters were primarily employed in the classical language, not in service of the written vernacular. The earliest orthography tailored to the language (and related dialects) is undoubtedly Ph-

ōe-jī (POJ, “vernacular writing”), traceable to the 1832 A Dictionary of the Hok-këèn

Dialect of the Chinese Language by the missionary Walter Henry Medhurst (Heylen

2001; Klöter 2002). Since then the scheme has undergone a number of minor changes and is currently stable. It has, moreover, been adapted to provide for the minority Hakka language.

POJ initially served the interests of the Protestant missionaries and local

Taiwanese converts and their descendants, particularly the illiterate and semi-literate.

Until the 1980s much of the POJ literature, therefore, revolved around Christian themes followed distantly by education of a more general nature. Whereas both the

Japanese and Chinese Nationalist regimes on the island had a history of suppressing romanization, the post-martial law era (1987-) of democratic reforms saw the emergence of small, competing organizations seeking to promote written Taiwanese of one form or another. In this lively if sometimes fractious environment, POJ found renewed usage in a newly secularized context even as its functions in the religious domain appeared to continue to decline. Concomitant with this development was the switch from monoscript to mixed scripts within running texts; that is, among those familiar with POJ the mainstream preference today is for mixing Han (Chinese) characters and romanization (a practice known as hàn-lô), particularly in formal publication2 (TiuN 1998: 230).

Many new phonetic systems and orthographies – at least 64 in one study – emerged during this period (Iûⁿ and Tiuⁿ 1999). By virtue of its long history and recent revival, POJ is presently the system with the most numerous and varied publications,

2 including didactic materials, dictionaries, and literary works. The status derived

thereof is nevertheless insecure, as a number of alternative systems have sought

legitimacy via endorsement by the political and academic establishments, at the same

time engaging in publishing efforts of their own3.

In the sections to follow we first survey the use of POJ in various types of

media for the past century and beyond, in its capacity as an orthography or phonetic

scheme for Holo. Other orthographic or annotative choices – for example, Han

characters – are not considered. We then summarize recent efforts in the area of POJ

Taiwanese computing, including both text manipulation tools, pedagogic tools, and

research in applied computational linguistics.

2. Development of written Taiwanese in the media

This section is divided into three portions, namely the print media (books,

periodicals, newspapers), broadcast media, and the newly emerged Internet.

2.1 Print media: books, periodicals, newspapers

We have collected a substantial bibliography of publications utilizing POJ as

an orthography or phonetic annotation system for Han characters. We have

additionally consulted an as-yet unpublished bibliography by Lī Heng-chhiong.

2.1.1 Books

Printed books are by far the most common sources of POJ. In Table 1 several

categories employing POJ are identified. In principle literary works of an overtly

religious nature (e.g. The Psalms) are classified as “religious works.”

3 3 Table 1. POJ Books By Year and Type Decade Year Pre- 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Total (%) Type N/A 1900 ~09 ~19 ~29 ~39 ~49 ~59 ~69 ~79 ~89 ~99 ~02 Religious 40 51 11 27 38 56 30 119 80 11 11 26 3 503 (60) Didactic 2 4 - - 9 5 2 20 9 3 9 58 15 136 (16) Literary - - - 1 4 - 1 9 4 - 1 71 17 108 (13) Reference - 15 1 1 5 3 2 - 1 4 2 13 2 49 (6) Specialized - 11 2 3 5 2 2 - 3 1 4 8 3 44 (5) Total (%) 42 81 14 32 61 66 37 148 97 19 27 176 40 840 (100) (5) (10) (2) (4) (7) (8) (4) (18) (12) (2) (3) (21) (5)

1. Religious works: These account for the single largest category, over 500 titles

(60%) in our catalog, all Christian. Publishers no longer reprint most them.

2. Language teaching materials: Over 130 titles are known, including some

bilingual textbooks in English or Japanese.

3. Literary works: Including fiction, prose, poetry, drama, translated works, and

folk literature. At least 100 titles are known, the earliest indigenous fictional

work being Mother’s Tears (1925) by Lōa Jîn-seng (g 2000).

4. References: Over 40 volumes, including dictionaries featuring Mandarin,

English, Japanese, and Spanish. Other references concern geography,

proverbs, botany.

5. Specialized texts: Covering subjects other than language pedagogy, such as

mathematics, astronomy, medicine, botany, and social commentary. In

addition, in recent years a few academic papers and monographs in Taiwanese

have appeared in spite of the general reluctance of institutions to accept them.

2.1.2 Newspapers

As of this writing no newspaper or similar publication, either religious or

secular in orientation, exists in Taiwanese. Historically Taiwan Church News

4 (originally Taiwan Prefectural City Church News) published in the Holo language

using POJ. Founded in 1885, it was the first newspaper of any language to publish on

the island. As the longest publishing paper, as well, it is in a unique position of

having documented Taiwanese society during the century of Manchurian, Japanese,

and Chinese rule, and to have done so in a major language of the masses. According

to the Committee on History of the Presbyterian Church of Formosa, in 1928 and

1932 it merged with three regional church newspapers also publishing in POJ (2000:

190). In 1942 the Japanese colonial authorities forced it to cease publishing.

Resuming publication after 1945, it eventually succumbed to pressure from the

Chinese Nationalist regime in 1970 and abandoned the traditional language in favor of

the official Mandarin (see section 2.1.4). This policy was partly reversed in the 1990s

with the inclusion of a “special column” consciously devoted to the mother tongue,

and then mostly in hàn-lô. The space afforded it was quite limited and increasingly so;

eventually this column also came to include Mandarin contents.

Other historical newspapers include one regional Presbyterian weekly

publishing in the early 1940s and a Catholic newspaper in the mid-1930s. We believe

additional sources have yet to be unearthed.

In the 1990s several mass-circulation (Mandarin) dailies featured occasional

columns on language-related topics by well-known teachers or activists. Neither the

front-page news nor the popular entertainment sections, however, have offered

Taiwanese alternatives, even as headlines incorporating Holo catchwords have

become more fashionable4. Indeed the literary pages have been devoted to works in

the standard Mandarin language (the rare poem not withstanding).

As is the case with periodicals (section 2.1.3), immigrant newspapers in the

United States have been more willing to experiment with new forms. The Pacific

5 5 Times and particularly Taiwan Tribune have featured original and reprinted articles in

Taiwanese. The historically more radical Taiwan Tribune pioneered Taiwanese language editorials in the 1990s; its literature section often carried poetry and essays in the language. Unlike the Taiwan-based papers, both of these arrange the text horizontally, a format aesthetically compatible with han3-lo5.

In 2002 the trilingual Taiwanese Children’s Newspaper began soliciting subscribers. Apparently responding to the market potential partly driven by the

English and local languages components of the Nine-year Comprehensive Curriculum for the Elementary and Junior High Education, it targeted elementary schoolchildren and their parents by offering contents in Mandarin, Taiwanese, and English. Even though the Taiwanese portion accounts for only a quarter of the space, the concept was unprecedented.

2.1.3 Periodicals

As with newspapers, all known POJ periodicals (primarily monthlies) prior to the 1970s were Church-published. Of these, the most recent new periodical appeared in 1960; the last to cease publication did so in 1969. The longest surviving periodical published for fifteen years (ah-miā ê bí-niû, 1954-68).

The first known periodical of a secular nature was the Taiwanese Language and Literature Bi-monthly (1977-1979). As the domestic political milieu of the time prohibited such a publication, it was produced and published in the United States (and distributed outside Taiwan). The Bi-monthly is also believed to be the first publication to have systematically adopted hàn-lô in writing Taiwanese5.

Not until 1989 did a new periodical appear on the island. Known as Hong-

Hiòng “Wind Direction”, this purely POJ bi-monthly published until 1992. According to its publisher and editor, the readership were to be elderly Christians literate in POJ

6 but not Han characters. As it turned out, a disproportionate percentage of readers did

not fit that profile. The only other Christian-oriented periodical to appear since then

was Taiwanese Lily Forum (1995-6), whose publisher opted for hàn-lô instead.

The Taiwanese language periodicals of the 1990s sought to rejuvanate and

advance the written language by cultivating a more “mainstream” readership. Table 2

lists their publication dates and orthographic choices. Seven of the eight periodicals

targeted the general readership and thus generally did not feature religious themes.

As of this writing, five continue to publish. Of particular note is the fact that only one

periodical, Tai5-oan5-ji7, favors monoscript after the pre-1970s Church tradition, yet

more than 90% of its staff are not members of the Church.

Table 2: POJ Periodicals Appearing in the 1990s6

Year Periodical Publisher Location Orthography Publishing Cycle 1991.7~ Taiwanese Writing Chhong-Bí Memorial Fund USA HL M Forum 1992.6~ Taiwanese Wind Association to Promote Taiwanese Taipei HL B (12 issues) 1994.6 Writing 1992.7~ The Taiwanese Students Taiwanese Promotion Taiwan HL S (4 trial issues), 1994.6 Student Association (STAPA) q10 (issues 1-18), Q (issues 19-22). Campus chapters edited in turn. 1995.7~ Taiwanese Lily Forum Chhong-Bí Memorial Fund USA HL B (9 issues) 1996.11 1996.10~ Tâi-bûn Bóng Pò TBBP Publishing Taipei HL M 1999.1~ Liân-chiau-h LCH Publishing Taichung HL Q oeMagazine 1999.11~ TGB Communiqué STAPA Taipei HL M 2000.5~ Tâi-ôan-jī Ko-hiông City Taiwanese Kaohsiung CL B Romanization Association

Legends. HL = hàn-lô, CL = chôan-lô (fully romanized). q10 = every ten days, S = semi-monthly,

M = monthly, B = bimonthly, Q = quarterly

7 7 2.1.4 POJ: Toward secularization and mixed scripts

As shown in Table 1, the publishing output per decade appeared to increase steadily between 1900 and 1939, dipped to a low in the 1940s, resumed in force to reach a peak in the 1950s, only to decline from there until the 1980s. The output in the 1970s was nearly as low as that of the 1900s. The 1950s and ’60s accounted for

30 percent of the POJ titles catalogued, or more than the combined output of the preceding half a century -- religious works accounted for some 80% of the new titles, far more than their overall percentage across eras (60%). Indeed, religious works outstripped all other categories until the 1990s. Since then output in all categories has significantly increased. Whereas literary works were meager in the pre-1990 decades, they now account for 40% of POJ books published since. Output in language teaching materials also more than doubled their previous peak percentage of 15% (in the 1950s), to 34%. The inclusion of local languages as a subject in the Nine-year

Comprehensive Curriculum (in force after September 1, 2001) no doubt spurred publication in this area.

The trend of secularization is therefore one major characteristic of the revitalization of POJ in the 1990s. In advocating the traditional name Ph-ōe-jī over the common “Church Romanization,” Tiuⁿ (2001: 13) notes that nowadays non-

Christians not only account for more POJ users than Christians but are more

“fundamentalist” in their attitude toward the script. The publication data are consistent with this observation.

The apparent decline in book publication in the 1940s is most likely related to war-time turmoils towards the end of the Japanese colonial rule, as well as the political and economic turbulances during the earliest years of Chinese Nationalist rule 7 . The decline in the 1970s is similarly and at least partly attributable to

8 government policy discouraging or prohibiting the cultivation of local languages in

the public sphere, especially where romanization was involved. In reporting on the

banning of a new Taiwanese-English dictionary, The New York Times quoted in 1974

a government official as saying, “We have no objection to the dictionary being used

by foreigners. They could use it in mimeographed form. But we don't want it

published as a book and sold publicly because of the Romanization it contains.

Chinese should not be learning Chinese through Romanization.” In an unprecedented

move the Presbyterian Church in Taiwan (PCT) issued a statement entitled “Our

Appeal,” in which it urges “that the freedom to continue to publish and distribute the

Bible in any language be guaranteed.” Similar official objections to Taiwanese

romanization continued well into the 1980s (Huang 1995: 54) and even the early

1990s. As mentioned, by 1970 the PCT had ceased publishing its organ newspaper in

Taiwanese via POJ and instead turned to Mandarin (via Han characters).

One might also speculate on the impact of two decades of compulsory

Mandarin/Han character education on Taiwanese/POJ proficiency. The fact that for a

century the Church was the only institution with the experience and motivation to

carry out POJ education certainly hampered its spread throughout the larger society8.

Indeed, POJ came to be associated with the Church (and thus the “foreign religion” it

represented) in the minds of the non-Christian majority, to this day. Although

Church-run private schools have existed, they have had to conform to strict

government regulations on crucial aspects of education, including language.

Furthermore, as public education via Mandarin became more prevalent and

expectation of achieving higher education greater, POJ’s utility as a literacy

instrument declined and its status within the Church evolved to that of an institutional

symbol. The 1996 publication of a Han-character transcription of the 1933 Bible was

9 9 emblematic. We thus believe it likely that increasingly fewer learners of POJ relied on it to acquire information. This proposed process of Han characters replacing the functions of the traditional POJ is analogous to what Rohsenow (2001: 133) has described for China’s two-year-long Hanyu “immersion program”: “The fact is…that after several years most of these students lose their ability to read and write fluently in Hanyu Pinyin alphabet through later disuse and a lack of any advanced materials written in Pinyin that would give them an opportunity to maintain their early skills, again a situation brought about by the Chinese government’s ongoing policy of monographia with .” As it is, the sociolinguistic aspect of the acquisition and practice of POJ within the Church in the early postwar decades requires further investigation.

As Table 2 suggests, recent revitalization of POJ largely took the form of han3-lo5, or mixed scripts. In the era of democratization, segments of the Taiwanese

Nativist Movement, both outside of and within the Church, were seeking to revitalize the mother tongues. In this atmosphere, POJ was seen as an indigenous cultural asset as well as a literacy tool that enables the writer to sidestep the difficulties of

Taiwanese-via-Han characters (TiuN 1998: 227). The addition of Han characters reflects both sentimental attachment to the script and the practical advantages of cross-language learning of cognates. As such it represents a synthesis of the old

Southern Min Han-character tradition and the later romanization monoscript culture of the Church.

2.2 Dynamic media: the audiovisual

Dynamic media such as television may be regarded as combining written and audiovisual materials. The audiovisual is not directly related to the written language and can be regarded as auxilliary tool for the text. Some literacy advocates regard the

10 spoken word as problematic: On the one hand, sound may facilitate reading; on the

other, since most of those fluent in Taiwanese lack literacy, some advocates believe

that audio may adversely affect the willingness of the audience to acquire or utilize

Taiwanese literacy. Literacy, then, is deemed more of a priority than the audiovisual.

The inclusion of Taiwanese texts in these media is almost unheard of. In 2000

and 2001 Formosa Television hosted conferences on Taiwanese and following the

suggestion of participants, later broadcast conference footage with han3-lo5 subtitles.

Even Holo news programs, presumably catering to elderly monolingual speakers,

universally display Mandarin headlines.

Regardless, the broadcast media remain a direction worth cultivating – not

only television programming but also repackaged programming on storage media,

namely VCDs and DVDs.

2.3 New media: the Internet

2.3.1 E-mail groups & e-newsletters

For marginalized languages the Internet presents a new, low-cost channel for

disseminating texts. By far the most relevant technology is electronic mail, through

which a virtual community of like-minded individuals may engage in discourse in

their beloved language(s) outside of the traditional, territorially defined home-

neighborhood-community domains. All of the several currently available lists are

language-oriented, the subscribers being language teachers, learners, writers, activists,

enthusiasts. The largest such list has about 135 subscribers and the norm is to use

chôan-lô.

Until recently certain diacritics special to POJ glyphs could not be rendered

electronically, and it became common practice to use numerals as substitute or discard

tone representation altogether. Since 2001 the development of new fonts has allowed

11 11 direct and authentic textual representation, in accord with the standard orthography

(see 3.1.1).

Currently a single e-newsletter is active. It transmits select articles from literary periodicals, in addition to original learning materials and occasional announcements. Unlike the e-mail discussion groups, subscribers are more likely to come from outside of the activist circle. Hàn-lô is the dominant form and its use is consistent with the e-newsletter’s more formal contents compared to e-mail.

2.3.2 World-Wide Web

The advent of the Web since 1989 has significantly altered the lifestyle of computer users in the span of a few years. The amount of on-line information accumulates ever more rapidly.

As with e-mail the availability of POJ fonts has allowed the authentic reproduction of the standard orthography on the Web without resorting to graphical formats. Consequently not only has the number of web sites grown rapidly, newer applications have emerged, ranging from Taiwanese-Mandarin dictionary, on-line translation, learning tools, message boards, POJ Hangman and other word games.

Unicode support remains a goal, however.

3. Development of Taiwanese data processing

With the advent of the information age, the problems of data processing are increasingly important. Advances in Taiwanese data processing relates to the modernization of written Taiwanese. In viewing this aspect of development, one needs to bear in mind that the resources for it are but a tiny percentage of those available to Mandarin.

3.1 Input methods and word processing

12 Input methods and word processing are fundamental to textual data processing.

Insofar as han3-lo5 is concered, input methods translate keystrokes from the

“standard” English keyboard9 to POJ or Han characters on-screen. A word bank maps

combinations of keystrokes to a word string of Han characters, POJ letters, or both.

Because most syllables in POJ running texts carry diacritics, efficiency is an

important design criterion. In Taiwan personal computer users primarily use IBM

PC-compatible machines. Therefore software development has been focused on the

PC platform.

3.1.1 PC platform: TW301, HOTSYS, TP

TW301 was developed in 1990 under the auspices of Robert Cheng (University

of Hawai’i). It operates in the DOS environment in concert with the Eten [Yi-T’ian]

Chinese System. It utilizes modified (Western) fonts and Taiwanese-unique Han

characters, as well as a user-modifiable word bank, to constitute the input method.

This method seeks to enhance the key-in rate by defining multi-syllabic words and

phrases (to sidestep the ambiguity associated with mono-syllabic homonyms), “direct

input” shortcuts (to avoid selecting among high-frequency mono-syllabic homonyms

by assigning them unique key strokes), and abbreviations (consisting of each syllable-

initial letter in a multi-syllabic word).

TW301 contributed much to the written Taiwanese movement of the early

1990s. Most publishers of the time used this application in production. The line-

command user interface of DOS, however, soon became eclipsed by the graphical

user interface of Windows.

HOTSYS was developed in 1994 by Sȯ Chi-bêng. It runs in the Windows

environment and requires Microsoft Word. Although its technical aspects are similar

to TW301, its input method and word bank are independent of the operating system’s

13 13 built-in bank. (Unlike TW301 this approach sought to prevent the casual user from modifying the word bank at will, and thus can be regarded as an attempt at corpus planning.) Selections are of two kinds: selecting a Han character or POJ syllable from homonymous choices, and choosing a suitable form from a list of non- standardized variants, which in hàn-lô may be any combinations of Han characters or

POJ. The most recently chosen alternatives are ranked near the top of the selection menu. Furthermore it provides for several “macros” to convert between tone diacritics and tone numbers, thus facilitating file exchange.

Neither TW301 nor HOTSYS are compatible with the Web’s HTML documents, however. For this reason another application, Taiwanese Package (TP), was designed and released in 2001 by Lâu Kit-gk. In addition to providing Web- compatible fonts, its input method differed from previous packages in that letters with diacritics are now inputed like their Western European counterparts: the diacritic first, followed by the letter proper; the combined glyph is then displayed. TP does not handle Han characters (and thus lacks word banks), though plans are underway to add that functionality (2002, personal communication). Importantly it is not bound to one specific application and can work with software such as Excel, FrontPage, and

Outlook. Most significantly it has allowed the expansion of written Taiwanese into the realm of the Internet (Iûⁿ and Lâu 2002).

3.1.2 Macintosh platform

Although most computer users in Taiwan use the DOS/Windows operating systems, Taiwan’s publishing industry is dominated by Apple’s Macintosh platform.

Thus often the process of converting Taiwanese manuscripts from PC to Macintosh formats requires time-consuming re-editing. Not until 2001 was a POJ Macintosh font available. In 2002 Jason Cox and Tân It-kùi developed input methods for both

14 Han characters and POJ (Fig. 1). The POJ component makes use of Unicode-

compliant fonts to construct some of the tone-marked letters. These fonts are not yet

widely available in the publishing industry and only a subset of them give

aesthetically pleasing results.

3.1.3 Linux platform: TEX

The Linux operating system has the advantage of being “open-sourced.” TEX

is a layout application that has also been used under DOS/Windows to export

Taiwanese documents. TEX does not provide any input method. However, the ability

to freely combine diacritics and letters allows one to select from a large number of

readily available fonts. Figure 2 is an example by Phoaⁿ Kho-gôan using TaiTeX, a

version known based on LaTeX in concert with Perlscript, CJK macros (for

annotation of Han characters) and Tipa macros (for IPA symbols and POJ diacritics).

As of this writing TaiTex does not have any input method, though plans are underway

(2002, personal communication).

3.2 Machine translation

In 1997 Lîm Chhoan-kiat experimented with a Mandarin-Taiwanese translator

using Robert Cheng’s Taiwanese-Mandarin lexical file. It included a voice module.

Apparently this system is not rule-based, as evidenced by inconsistent results. Recall

rate is 48.12%, precision rate 42.9% (Lîm 1997). Various approaches to evaluation

exist. Again, there is much room for improvement.

In addition, Iûⁿ Ún-giân in 2000 implemented a web-based Taiwanese-

Mandarin dictionary, also drawing its lexical data from Cheng’s file. It uses a

database for search via a web interface. Although simple in design, it has proven

useful.

3.3 Others

15 15 The linguist Robert Cheng has pioneered the development of research tools for

Taiwanese data processing, including databases in the areas of comparative lexicography, low-level Taiwanese-Mandarin machine translation, and inter-script transcription. On this basis various software packages have been developed for research and commercial purposes, including the development of a hand-held electronic dictionary.

The Taiwanese-Mandarin Language Aid Project10 (TMLAP), developed by

Robert Cheng and Roderick Gammon in the late 1990s, features a Taiwanese segmentation system, Han character-romanization transcription system, bidirectional

Taiwanese-Mandarin lexical translation, part-of-speech tagging, and word frequency analysis (Li 2000). In addition, Tân Sìn-hông has worked on a text-to-speech synthesis system and plans to establish a database of Taiwanese speech samples from 100 individuals (2002, personal communication).

3.4 Unicode inclusion of POJ-unique characters

Unicode has as its goal the provision of a standard encoding scheme to facilitate the efficient representation of multilingual data. The third edition (2000) includes 49,194 characters and later editions continue to see an increase in this number. Unicode 1.1 and later conform to the International Organization for

Standardization’s ISO 10646.

As early as 1996 Tè Khái-sū applied for Unicode inclusion of several POJ- unique characters (e.g. ȯ, , , , ȱ). The ISO turned to Taiwan’s Institute for

Information Industry for consultation, it in turn turned to the Mandarin Promotion

Council (MPC) under the Ministry of Education. At that time members of the MPC favored a different phonetic scheme. The response given was that POJ was no longer used in Taiwan. The proposal was dropped (Iûⁿ 2002; Tiuⁿ 2002).

16 4. Future directions: a preliminary agenda for Taiwanese applied computational

linguistics

Here we suggest some concrete steps that may be taken to develop the field of

Taiwanese applied linguistics via computational research. Given the fair amount of

language data accumulated in the past decade or so, systematic corpus analysis may

contribute to an understanding of the way the written language has been used in

different contexts, and facilitate the definition and refinement of those practices, as

well as contribute to language pedagogy.

Robert Cheng has also outlined a program with at least the following

components: machine translation (Taiwanese/Mandarin, Taiwanese/Hakka,

Taiwanese/Japanese, etc.), speech-to-text conversion, spelling checker. The proposal

here is more limited and short-term.

Computational linguistics is an interdisciplinary field involving linguistics and

computer science.

Currently available language data include the following:

1. Taiwanese Net: A POJ e-mail list operating since 1991. For technical and

habitual reasons, most participants do not follow the orthographic convention,

particularly where tone diacritics are concerned. These data will require

processing.

2. Published works: Many electronic files are available from a decade of

publishing.

3. POJ Literary Data Compilation Project: Conducted by Li7 Heng-chhiong for

the National Taiwan Literature Museum (in planning), this ambitious project

focuses on collecting rare or hitherto unknown literary works in the language.

Conversion into electronic formats is expected. As most of the works

17 17 unearthed thus far were published before 1960, they may well constitute a

corpus complementing the sources mentioned.

These sources altogether are expected to make up approximately 40 megabytes of textual data, an amount sufficient to establish a balanced corpus.

Some workable tasks are as follows:

• Improving segmentation techniques, perhaps using TMLAP as a basis;

• Designing a program to pre-process non-standard POJ writing by, for

example, inserting tone marks and hyphens between syllables;

• Script interpreter to convert POJ running texts to hàn-lô, and vice versa

(may be used in publishing and pre-processing);

• Concordance with applications in language learning;

• Lexicography: Using word frequency data to select words to include

in a dictionary, with contextualized examples from the corpus.

Inevitably much remains to be done. Progress in Mandarin computational

linguistics is likely to be valuable to this research.

References

Committee on History of the Presbyterian Church of Formosa, ed. (2000). A

Centennial History of the Presbyterian Church of Formosa 1865-1965. 3rd ed.

Tai-nan, Taiwan: Presbyterian Church of Formosa Centenary Publication

Committee.

G, Siú-Lé (1995). Nan-kuan lyrics as preserved in early Southern Min drama texts.

In Anthology of Min/Taiwan Dialects Research. S.-L. Gou5 (ed.). Taipei: Nan-

t’ian.

Guide to dialect barred in Taiwan: dictionary tried to render local Chinese sounds.

18 New York Times. 15 September 1974. Sec. 1, 15.

Heylen, Ann (2001). Missionary linguistics on Taiwan: romanizing Taiwanese:

codification and standardization of dictionaries in Southern Min (1837-1923). In

Authentic Chinese Christianity: Preludes to Its Development (Nineteenth and

Twentieth Centuries). W. Ku & K. de Ridder (eds.), 135-174. Leuven, Belgium:

Leuven University Press.

Huang, Shuan-fan. (1995). Language, Society and Ethnicity. 2nd ed. Taipei: Crane.

Iûⁿ, Ún-giân (2002). Taiwanese symbols in competition: POJ versus TLPA as a case

study. In Proceedings of the 2002 Conference on the Teaching and Research of

Taiwanese Romanization. (no ed.), E3. Tai-tung, Taiwan: National Taitung

Teachers College.

Iûⁿ, Ún-giân; and Lâu, Kit-gk (2002). Study of Ph-ōe-jī computer word-processing.

In Proceedings of the Fourth Annual Taiwanese Languages and Teaching

Conference. Sun Yat-sen University Department of Chinese (ed.), 341-349.

Kaohsiung: National Sun Yat-sen University.

Iûⁿ, Ún-giân; and Tiuⁿ, Hk-khiam (1999). Review and analysis of Taiwanese Holo's

non-Han-character phonetic symbols. In Proceedings of the First Annual

Taiwanese Mother Tongue Cultural Revival and Reconstruction Conference. Tai-

nan City Cultural Foundation (ed.), 62-76. Tai-nan, Taiwan: City

Cultural Foundation.

Klöter, Hanning (2002). The history of Peh-oe-ji. In Proceedings of the 2002

Conference on the Teaching and Research of Taiwanese Romanization. (no ed.).

Tai-tung, Taiwan: National Taitung Teachers College.

Lai, Tse-han; Myers, Ramon H.; and Wou, Wei (1991). A Tragic Beginning: the

Taiwan Uprising of February 28, 1947. Stanford: Stanford University Press.

19 19 Language (2001). Republic of China Yearbook—Taiwan 2001. Taipei: Government

Information Office.

Li, Chin-an (2000). Lexical Change and Variation in Taiwanese Literary Texts, 1916-

1988 -- A Computer-assisted Corpus Analysis. Ph.D. thesis, University of

Hawai’I.

Lîm, Chhoan-kiat (1997). The Study of a Mandarin-Taiwanese Machine Translation

System. MS thesis, National Taiwan University.

Nilsen, Kenneth (1997). Irish in nineteenth century New York. In The Multilingual

Apple: Languages in New York City. O. García and J. Fishman (eds.), 53-69.

Berlin: Mouton de Gruyter.

g, Ka-hūi (2000). Research on Taiwanese [Language] Literature in Peh-oe-ji

Sources. MA thesis, National Tainan Teachers College, Graduate School of

Indigenous Culture.

Presbyterian Church in Taiwan (1975). Our appeal concerning the Bible, the church,

and the nation. 20 April 2002.

.

Rohsenow, John (2001). The present status of digraphia in China. IJSL 150, 125-140.

Tiuⁿ, Hk-khiam (1998). Writing in two scripts. Written Language and Literacy 1,

225-247.

Tiuⁿ, Hk-khiam, comp. (2002). Proposal to add Latin characters required by

Latinized Taiwanese Holo language to ISO/IEC 10646. 10 August 2002.

.

Tiuⁿ, Jū-hông (2001). Principles of POJ or the Taiwanese Orthography: An

Introduction to Its Sound-Symbol Correspondences and Related Issues. Taipei:

Crane.

20

Figure 1. A third-party POJ input method for the Macintosh.

Figure 2. Hàn-lô text with POJ annotation of select Han characters.

21 21

Notes

1 As a language label, Holo may be regarded as synonymous with Hoklo, Hokkian, Amoy, and

Southern Min, but as written reflects indigenous pronounciation.

2 Partly due to the practical hurdles of inputing Han characters in general, POJ is more often written

without Han characters in electronic mail than in formal publication.

3 In 1998 Taiwan’s Ministry of Education promulgated a phonetic system based on POJ. In place of

tone-marking diacritics, numbers are used. As noted in the language section of the ROC Yearbook

(2001), use of the Taiwan Language Phonetic Alphabet (TLPA) is not required. More recently an yet

newer system derived from Hanyu Pinyin, known as Tongyong Pinyin, has also gained momentum.

4 In contrast, it is not uncommon to find sections of Hong Kong’s entertainment pages written in the

style of colloquial Cantonese.

5 As a parallel, in the nineteenth century an Irish-American newspaper also preceded its counterparts in

Ireland in carrying a Gaelic column, according to Nilsen (1997:60).

6 At least four other periodicals feature written Taiwanese in part. These are not included in Table 2.

7 Prior to the influx of some two million refugees and Nationalist bureaucrats and soldiers in 1949, the

most significant postwar event of that decade was the 1947 state-sponsored massacre of nearly thirty

thousand Taiwanese civilians (see Lai, Myers, and Wou 1991).

8 Although the secular Taiwanese Cultural Association passed a resolution supporting the use of POJ in

1922, little evidence exists of concrete activities in that direction (Huang 1995: 92).

9 The standard national keyboard in Taiwan is configured for Mandarin via the official Mandarin

Phonetic Symbols.

10 Previously known as PC Taiwanese-Mandarin Database (PCTMD).

22

23 23

24