A Survey of Media and Data Processing Development for Written Taiwanese
Iûⁿ Ún-giân Henry H. Tân-Tēⁿ Lecturer of Dahan Inst. Tech. Montclair State University [email protected] [email protected]
Abstract
Many proposals have been advanced for writing Taiwanese; none as yet have been
designated official. Of all the proposals, Ph-ōe-jī (POJ) has the longest history and
the largest corpus of didactic materials, dictionaries, and literary works. This paper
surveys development in Taiwanese media and data processing as they pertain to POJ.
In the former category we include the print media of books, periodicals, and
newspapers, as well as the broadcast media and the Internet. In the latter we discuss
the evolution of Taiwanese textual processing, segmentation, machine translation,
and Unicode support. Finally we propose a preliminary program for applied
computational linguistics in Taiwanese, for the purpose of revitalizing and advancing
written Taiwanese.
Keywords: Ph-ōe-jī (POJ), written Taiwanese, vernacular literature, media, textual
processing, computational linguistics
1. Introduction
Broadly speaking, “Taiwanese” refers to the languages of the Taiwanese
people, including Holo1, Hakka, and the Austronesian languages. Of all the groups
the Holo are the most numerous, accounting for over 70% of the population (Huang
1995). Therefore, as early as the Japanese colonial period a century ago, Holo was
known as Taiwanese, as well. In this article we use the terms interchangeably.
The Holo written language and literature, or Tâi-gú-bûn (TGB), can ultimately
be traced to the Southern Min dramas dated 1566 (G 1995), prior to the era of mass
1 1 migration from China to Taiwan. At that time Han characters were primarily employed in the classical language, not in service of the written vernacular. The earliest orthography tailored to the language (and related dialects) is undoubtedly Ph-
ōe-jī (POJ, “vernacular writing”), traceable to the 1832 A Dictionary of the Hok-këèn
Dialect of the Chinese Language by the missionary Walter Henry Medhurst (Heylen
2001; Klöter 2002). Since then the romanization scheme has undergone a number of minor changes and is currently stable. It has, moreover, been adapted to provide for the minority Hakka language.
POJ initially served the interests of the Protestant missionaries and local
Taiwanese converts and their descendants, particularly the illiterate and semi-literate.
Until the 1980s much of the POJ literature, therefore, revolved around Christian themes followed distantly by education of a more general nature. Whereas both the
Japanese and Chinese Nationalist regimes on the island had a history of suppressing romanization, the post-martial law era (1987-) of democratic reforms saw the emergence of small, competing organizations seeking to promote written Taiwanese of one form or another. In this lively if sometimes fractious environment, POJ found renewed usage in a newly secularized context even as its functions in the religious domain appeared to continue to decline. Concomitant with this development was the switch from monoscript to mixed scripts within running texts; that is, among those familiar with POJ the mainstream preference today is for mixing Han (Chinese) characters and romanization (a practice known as hàn-lô), particularly in formal publication2 (TiuN 1998: 230).
Many new phonetic systems and orthographies – at least 64 in one study – emerged during this period (Iûⁿ and Tiuⁿ 1999). By virtue of its long history and recent revival, POJ is presently the system with the most numerous and varied publications,
2 including didactic materials, dictionaries, and literary works. The status derived
thereof is nevertheless insecure, as a number of alternative systems have sought
legitimacy via endorsement by the political and academic establishments, at the same
time engaging in publishing efforts of their own3.
In the sections to follow we first survey the use of POJ in various types of
media for the past century and beyond, in its capacity as an orthography or phonetic
scheme for Holo. Other orthographic or annotative choices – for example, Han
characters – are not considered. We then summarize recent efforts in the area of POJ
Taiwanese computing, including both text manipulation tools, pedagogic tools, and
research in applied computational linguistics.
2. Development of written Taiwanese in the media
This section is divided into three portions, namely the print media (books,
periodicals, newspapers), broadcast media, and the newly emerged Internet.
2.1 Print media: books, periodicals, newspapers
We have collected a substantial bibliography of publications utilizing POJ as
an orthography or phonetic annotation system for Han characters. We have
additionally consulted an as-yet unpublished bibliography by Lī Heng-chhiong.
2.1.1 Books
Printed books are by far the most common sources of POJ. In Table 1 several
categories employing POJ are identified. In principle literary works of an overtly
religious nature (e.g. The Psalms) are classified as “religious works.”
3 3 Table 1. POJ Books By Year and Type Decade Year Pre- 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Total (%) Type N/A 1900 ~09 ~19 ~29 ~39 ~49 ~59 ~69 ~79 ~89 ~99 ~02 Religious 40 51 11 27 38 56 30 119 80 11 11 26 3 503 (60) Didactic 2 4 - - 9 5 2 20 9 3 9 58 15 136 (16) Literary - - - 1 4 - 1 9 4 - 1 71 17 108 (13) Reference - 15 1 1 5 3 2 - 1 4 2 13 2 49 (6) Specialized - 11 2 3 5 2 2 - 3 1 4 8 3 44 (5) Total (%) 42 81 14 32 61 66 37 148 97 19 27 176 40 840 (100) (5) (10) (2) (4) (7) (8) (4) (18) (12) (2) (3) (21) (5)
1. Religious works: These account for the single largest category, over 500 titles
(60%) in our catalog, all Christian. Publishers no longer reprint most them.
2. Language teaching materials: Over 130 titles are known, including some
bilingual textbooks in English or Japanese.
3. Literary works: Including fiction, prose, poetry, drama, translated works, and
folk literature. At least 100 titles are known, the earliest indigenous fictional
work being Mother’s Tears (1925) by Lōa Jîn-seng (g 2000).
4. References: Over 40 volumes, including dictionaries featuring Mandarin,
English, Japanese, and Spanish. Other references concern geography,
proverbs, botany.
5. Specialized texts: Covering subjects other than language pedagogy, such as
mathematics, astronomy, medicine, botany, and social commentary. In
addition, in recent years a few academic papers and monographs in Taiwanese
have appeared in spite of the general reluctance of institutions to accept them.
2.1.2 Newspapers
As of this writing no newspaper or similar publication, either religious or
secular in orientation, exists in Taiwanese. Historically Taiwan Church News
4 (originally Taiwan Prefectural City Church News) published in the Holo language
using POJ. Founded in 1885, it was the first newspaper of any language to publish on
the island. As the longest publishing paper, as well, it is in a unique position of
having documented Taiwanese society during the century of Manchurian, Japanese,
and Chinese rule, and to have done so in a major language of the masses. According
to the Committee on History of the Presbyterian Church of Formosa, in 1928 and
1932 it merged with three regional church newspapers also publishing in POJ (2000:
190). In 1942 the Japanese colonial authorities forced it to cease publishing.
Resuming publication after 1945, it eventually succumbed to pressure from the
Chinese Nationalist regime in 1970 and abandoned the traditional language in favor of
the official Mandarin (see section 2.1.4). This policy was partly reversed in the 1990s
with the inclusion of a “special column” consciously devoted to the mother tongue,
and then mostly in hàn-lô. The space afforded it was quite limited and increasingly so;
eventually this column also came to include Mandarin contents.
Other historical newspapers include one regional Presbyterian weekly
publishing in the early 1940s and a Catholic newspaper in the mid-1930s. We believe
additional sources have yet to be unearthed.
In the 1990s several mass-circulation (Mandarin) dailies featured occasional
columns on language-related topics by well-known teachers or activists. Neither the
front-page news nor the popular entertainment sections, however, have offered
Taiwanese alternatives, even as headlines incorporating Holo catchwords have
become more fashionable4. Indeed the literary pages have been devoted to works in
the standard Mandarin language (the rare poem not withstanding).
As is the case with periodicals (section 2.1.3), immigrant newspapers in the
United States have been more willing to experiment with new forms. The Pacific
5 5 Times and particularly Taiwan Tribune have featured original and reprinted articles in
Taiwanese. The historically more radical Taiwan Tribune pioneered Taiwanese language editorials in the 1990s; its literature section often carried poetry and essays in the language. Unlike the Taiwan-based papers, both of these arrange the text horizontally, a format aesthetically compatible with han3-lo5.
In 2002 the trilingual Taiwanese Children’s Newspaper began soliciting subscribers. Apparently responding to the market potential partly driven by the
English and local languages components of the Nine-year Comprehensive Curriculum for the Elementary and Junior High Education, it targeted elementary schoolchildren and their parents by offering contents in Mandarin, Taiwanese, and English. Even though the Taiwanese portion accounts for only a quarter of the space, the concept was unprecedented.
2.1.3 Periodicals
As with newspapers, all known POJ periodicals (primarily monthlies) prior to the 1970s were Church-published. Of these, the most recent new periodical appeared in 1960; the last to cease publication did so in 1969. The longest surviving periodical published for fifteen years (ah-miā ê bí-niû, 1954-68).
The first known periodical of a secular nature was the Taiwanese Language and Literature Bi-monthly (1977-1979). As the domestic political milieu of the time prohibited such a publication, it was produced and published in the United States (and distributed outside Taiwan). The Bi-monthly is also believed to be the first publication to have systematically adopted hàn-lô in writing Taiwanese5.
Not until 1989 did a new periodical appear on the island. Known as Hong-
Hiòng “Wind Direction”, this purely POJ bi-monthly published until 1992. According to its publisher and editor, the readership were to be elderly Christians literate in POJ
6 but not Han characters. As it turned out, a disproportionate percentage of readers did
not fit that profile. The only other Christian-oriented periodical to appear since then
was Taiwanese Lily Forum (1995-6), whose publisher opted for hàn-lô instead.
The Taiwanese language periodicals of the 1990s sought to rejuvanate and
advance the written language by cultivating a more “mainstream” readership. Table 2
lists their publication dates and orthographic choices. Seven of the eight periodicals
targeted the general readership and thus generally did not feature religious themes.
As of this writing, five continue to publish. Of particular note is the fact that only one
periodical, Tai5-oan5-ji7, favors monoscript after the pre-1970s Church tradition, yet
more than 90% of its staff are not members of the Church.
Table 2: POJ Periodicals Appearing in the 1990s6
Year Periodical Publisher Location Orthography Publishing Cycle 1991.7~ Taiwanese Writing Chhong-Bí Memorial Fund USA HL M Forum 1992.6~ Taiwanese Wind Association to Promote Taiwanese Taipei HL B (12 issues) 1994.6 Writing 1992.7~ The Taiwanese Students Taiwanese Promotion Taiwan HL S (4 trial issues), 1994.6 Student Association (STAPA) q10 (issues 1-18), Q (issues 19-22). Campus chapters edited in turn. 1995.7~ Taiwanese Lily Forum Chhong-Bí Memorial Fund USA HL B (9 issues) 1996.11 1996.10~ Tâi-bûn Bóng Pò TBBP Publishing Taipei HL M 1999.1~ Liân-chiau-h LCH Publishing Taichung HL Q oeMagazine 1999.11~ TGB Communiqué STAPA Taipei HL M 2000.5~ Tâi-ôan-jī Ko-hiông City Taiwanese Kaohsiung CL B Romanization Association
Legends. HL = hàn-lô, CL = chôan-lô (fully romanized). q10 = every ten days, S = semi-monthly,
M = monthly, B = bimonthly, Q = quarterly
7 7 2.1.4 POJ: Toward secularization and mixed scripts
As shown in Table 1, the publishing output per decade appeared to increase steadily between 1900 and 1939, dipped to a low in the 1940s, resumed in force to reach a peak in the 1950s, only to decline from there until the 1980s. The output in the 1970s was nearly as low as that of the 1900s. The 1950s and ’60s accounted for
30 percent of the POJ titles catalogued, or more than the combined output of the preceding half a century -- religious works accounted for some 80% of the new titles, far more than their overall percentage across eras (60%). Indeed, religious works outstripped all other categories until the 1990s. Since then output in all categories has significantly increased. Whereas literary works were meager in the pre-1990 decades, they now account for 40% of POJ books published since. Output in language teaching materials also more than doubled their previous peak percentage of 15% (in the 1950s), to 34%. The inclusion of local languages as a subject in the Nine-year
Comprehensive Curriculum (in force after September 1, 2001) no doubt spurred publication in this area.
The trend of secularization is therefore one major characteristic of the revitalization of POJ in the 1990s. In advocating the traditional name Ph-ōe-jī over the common “Church Romanization,” Tiuⁿ (2001: 13) notes that nowadays non-
Christians not only account for more POJ users than Christians but are more
“fundamentalist” in their attitude toward the script. The publication data are consistent with this observation.
The apparent decline in book publication in the 1940s is most likely related to war-time turmoils towards the end of the Japanese colonial rule, as well as the political and economic turbulances during the earliest years of Chinese Nationalist rule 7 . The decline in the 1970s is similarly and at least partly attributable to
8 government policy discouraging or prohibiting the cultivation of local languages in
the public sphere, especially where romanization was involved. In reporting on the
banning of a new Taiwanese-English dictionary, The New York Times quoted in 1974
a government official as saying, “We have no objection to the dictionary being used
by foreigners. They could use it in mimeographed form. But we don't want it
published as a book and sold publicly because of the Romanization it contains.
Chinese should not be learning Chinese through Romanization.” In an unprecedented
move the Presbyterian Church in Taiwan (PCT) issued a statement entitled “Our
Appeal,” in which it urges “that the freedom to continue to publish and distribute the
Bible in any language be guaranteed.” Similar official objections to Taiwanese
romanization continued well into the 1980s (Huang 1995: 54) and even the early
1990s. As mentioned, by 1970 the PCT had ceased publishing its organ newspaper in
Taiwanese via POJ and instead turned to Mandarin (via Han characters).
One might also speculate on the impact of two decades of compulsory
Mandarin/Han character education on Taiwanese/POJ proficiency. The fact that for a
century the Church was the only institution with the experience and motivation to
carry out POJ education certainly hampered its spread throughout the larger society8.
Indeed, POJ came to be associated with the Church (and thus the “foreign religion” it
represented) in the minds of the non-Christian majority, to this day. Although
Church-run private schools have existed, they have had to conform to strict
government regulations on crucial aspects of education, including language.
Furthermore, as public education via Mandarin became more prevalent and
expectation of achieving higher education greater, POJ’s utility as a literacy
instrument declined and its status within the Church evolved to that of an institutional
symbol. The 1996 publication of a Han-character transcription of the 1933 Bible was
9 9 emblematic. We thus believe it likely that increasingly fewer learners of POJ relied on it to acquire information. This proposed process of Han characters replacing the functions of the traditional POJ is analogous to what Rohsenow (2001: 133) has described for China’s two-year-long Hanyu Pinyin “immersion program”: “The fact is…that after several years most of these students lose their ability to read and write fluently in Hanyu Pinyin alphabet through later disuse and a lack of any advanced materials written in Pinyin that would give them an opportunity to maintain their early skills, again a situation brought about by the Chinese government’s ongoing policy of monographia with Chinese characters.” As it is, the sociolinguistic aspect of the acquisition and practice of POJ within the Church in the early postwar decades requires further investigation.
As Table 2 suggests, recent revitalization of POJ largely took the form of han3-lo5, or mixed scripts. In the era of democratization, segments of the Taiwanese
Nativist Movement, both outside of and within the Church, were seeking to revitalize the mother tongues. In this atmosphere, POJ was seen as an indigenous cultural asset as well as a literacy tool that enables the writer to sidestep the difficulties of
Taiwanese-via-Han characters (TiuN 1998: 227). The addition of Han characters reflects both sentimental attachment to the script and the practical advantages of cross-language learning of cognates. As such it represents a synthesis of the old
Southern Min Han-character tradition and the later romanization monoscript culture of the Church.
2.2 Dynamic media: the audiovisual
Dynamic media such as television may be regarded as combining written and audiovisual materials. The audiovisual is not directly related to the written language and can be regarded as auxilliary tool for the text. Some literacy advocates regard the
10 spoken word as problematic: On the one hand, sound may facilitate reading; on the
other, since most of those fluent in Taiwanese lack literacy, some advocates believe
that audio may adversely affect the willingness of the audience to acquire or utilize
Taiwanese literacy. Literacy, then, is deemed more of a priority than the audiovisual.
The inclusion of Taiwanese texts in these media is almost unheard of. In 2000
and 2001 Formosa Television hosted conferences on Taiwanese and following the
suggestion of participants, later broadcast conference footage with han3-lo5 subtitles.
Even Holo news programs, presumably catering to elderly monolingual speakers,
universally display Mandarin headlines.
Regardless, the broadcast media remain a direction worth cultivating – not
only television programming but also repackaged programming on storage media,
namely VCDs and DVDs.
2.3 New media: the Internet
2.3.1 E-mail groups & e-newsletters
For marginalized languages the Internet presents a new, low-cost channel for
disseminating texts. By far the most relevant technology is electronic mail, through
which a virtual community of like-minded individuals may engage in discourse in
their beloved language(s) outside of the traditional, territorially defined home-
neighborhood-community domains. All of the several currently available lists are
language-oriented, the subscribers being language teachers, learners, writers, activists,
enthusiasts. The largest such list has about 135 subscribers and the norm is to use
chôan-lô.
Until recently certain diacritics special to POJ glyphs could not be rendered
electronically, and it became common practice to use numerals as substitute or discard
tone representation altogether. Since 2001 the development of new fonts has allowed
11 11 direct and authentic textual representation, in accord with the standard orthography
(see 3.1.1).
Currently a single e-newsletter is active. It transmits select articles from literary periodicals, in addition to original learning materials and occasional announcements. Unlike the e-mail discussion groups, subscribers are more likely to come from outside of the activist circle. Hàn-lô is the dominant form and its use is consistent with the e-newsletter’s more formal contents compared to e-mail.
2.3.2 World-Wide Web
The advent of the Web since 1989 has significantly altered the lifestyle of computer users in the span of a few years. The amount of on-line information accumulates ever more rapidly.
As with e-mail the availability of POJ fonts has allowed the authentic reproduction of the standard orthography on the Web without resorting to graphical formats. Consequently not only has the number of web sites grown rapidly, newer applications have emerged, ranging from Taiwanese-Mandarin dictionary, on-line translation, learning tools, message boards, POJ Hangman and other word games.
Unicode support remains a goal, however.
3. Development of Taiwanese data processing
With the advent of the information age, the problems of data processing are increasingly important. Advances in Taiwanese data processing relates to the modernization of written Taiwanese. In viewing this aspect of development, one needs to bear in mind that the resources for it are but a tiny percentage of those available to Mandarin.
3.1 Input methods and word processing
12 Input methods and word processing are fundamental to textual data processing.
Insofar as han3-lo5 is concered, input methods translate keystrokes from the
“standard” English keyboard9 to POJ or Han characters on-screen. A word bank maps
combinations of keystrokes to a word string of Han characters, POJ letters, or both.
Because most syllables in POJ running texts carry diacritics, efficiency is an
important design criterion. In Taiwan personal computer users primarily use IBM
PC-compatible machines. Therefore software development has been focused on the
PC platform.
3.1.1 PC platform: TW301, HOTSYS, TP
TW301 was developed in 1990 under the auspices of Robert Cheng (University
of Hawai’i). It operates in the DOS environment in concert with the Eten [Yi-T’ian]
Chinese System. It utilizes modified (Western) fonts and Taiwanese-unique Han
characters, as well as a user-modifiable word bank, to constitute the input method.
This method seeks to enhance the key-in rate by defining multi-syllabic words and
phrases (to sidestep the ambiguity associated with mono-syllabic homonyms), “direct
input” shortcuts (to avoid selecting among high-frequency mono-syllabic homonyms
by assigning them unique key strokes), and abbreviations (consisting of each syllable-
initial letter in a multi-syllabic word).
TW301 contributed much to the written Taiwanese movement of the early
1990s. Most publishers of the time used this application in production. The line-
command user interface of DOS, however, soon became eclipsed by the graphical
user interface of Windows.
HOTSYS was developed in 1994 by Sȯ Chi-bêng. It runs in the Windows
environment and requires Microsoft Word. Although its technical aspects are similar
to TW301, its input method and word bank are independent of the operating system’s
13 13 built-in bank. (Unlike TW301 this approach sought to prevent the casual user from modifying the word bank at will, and thus can be regarded as an attempt at corpus planning.) Selections are of two kinds: selecting a Han character or POJ syllable from homonymous choices, and choosing a suitable form from a list of non- standardized variants, which in hàn-lô may be any combinations of Han characters or
POJ. The most recently chosen alternatives are ranked near the top of the selection menu. Furthermore it provides for several “macros” to convert between tone diacritics and tone numbers, thus facilitating file exchange.
Neither TW301 nor HOTSYS are compatible with the Web’s HTML documents, however. For this reason another application, Taiwanese Package (TP), was designed and released in 2001 by Lâu Kit-gk. In addition to providing Web- compatible fonts, its input method differed from previous packages in that letters with diacritics are now inputed like their Western European counterparts: the diacritic first, followed by the letter proper; the combined glyph is then displayed. TP does not handle Han characters (and thus lacks word banks), though plans are underway to add that functionality (2002, personal communication). Importantly it is not bound to one specific application and can work with software such as Excel, FrontPage, and
Outlook. Most significantly it has allowed the expansion of written Taiwanese into the realm of the Internet (Iûⁿ and Lâu 2002).
3.1.2 Macintosh platform
Although most computer users in Taiwan use the DOS/Windows operating systems, Taiwan’s publishing industry is dominated by Apple’s Macintosh platform.
Thus often the process of converting Taiwanese manuscripts from PC to Macintosh formats requires time-consuming re-editing. Not until 2001 was a POJ Macintosh font available. In 2002 Jason Cox and Tân It-kùi developed input methods for both
14 Han characters and POJ (Fig. 1). The POJ component makes use of Unicode-
compliant fonts to construct some of the tone-marked letters. These fonts are not yet
widely available in the publishing industry and only a subset of them give
aesthetically pleasing results.
3.1.3 Linux platform: TEX
The Linux operating system has the advantage of being “open-sourced.” TEX
is a layout application that has also been used under DOS/Windows to export
Taiwanese documents. TEX does not provide any input method. However, the ability
to freely combine diacritics and letters allows one to select from a large number of
readily available fonts. Figure 2 is an example by Phoaⁿ Kho-gôan using TaiTeX, a
version known based on LaTeX in concert with Perlscript, CJK macros (for
annotation of Han characters) and Tipa macros (for IPA symbols and POJ diacritics).
As of this writing TaiTex does not have any input method, though plans are underway
(2002, personal communication).
3.2 Machine translation
In 1997 Lîm Chhoan-kiat experimented with a Mandarin-Taiwanese translator
using Robert Cheng’s Taiwanese-Mandarin lexical file. It included a voice module.
Apparently this system is not rule-based, as evidenced by inconsistent results. Recall
rate is 48.12%, precision rate 42.9% (Lîm 1997). Various approaches to evaluation
exist. Again, there is much room for improvement.
In addition, Iûⁿ Ún-giân in 2000 implemented a web-based Taiwanese-
Mandarin dictionary, also drawing its lexical data from Cheng’s file. It uses a
database for search via a web interface. Although simple in design, it has proven
useful.
3.3 Others
15 15 The linguist Robert Cheng has pioneered the development of research tools for
Taiwanese data processing, including databases in the areas of comparative lexicography, low-level Taiwanese-Mandarin machine translation, and inter-script transcription. On this basis various software packages have been developed for research and commercial purposes, including the development of a hand-held electronic dictionary.
The Taiwanese-Mandarin Language Aid Project10 (TMLAP), developed by
Robert Cheng and Roderick Gammon in the late 1990s, features a Taiwanese segmentation system, Han character-romanization transcription system, bidirectional
Taiwanese-Mandarin lexical translation, part-of-speech tagging, and word frequency analysis (Li 2000). In addition, Tân Sìn-hông has worked on a text-to-speech synthesis system and plans to establish a database of Taiwanese speech samples from 100 individuals (2002, personal communication).
3.4 Unicode inclusion of POJ-unique characters
Unicode has as its goal the provision of a standard encoding scheme to facilitate the efficient representation of multilingual data. The third edition (2000) includes 49,194 characters and later editions continue to see an increase in this number. Unicode 1.1 and later conform to the International Organization for
Standardization’s ISO 10646.
As early as 1996 Tè Khái-sū applied for Unicode inclusion of several POJ- unique characters (e.g. ȯ, , , , ȱ). The ISO turned to Taiwan’s Institute for
Information Industry for consultation, it in turn turned to the Mandarin Promotion
Council (MPC) under the Ministry of Education. At that time members of the MPC favored a different phonetic scheme. The response given was that POJ was no longer used in Taiwan. The proposal was dropped (Iûⁿ 2002; Tiuⁿ 2002).
16 4. Future directions: a preliminary agenda for Taiwanese applied computational
linguistics
Here we suggest some concrete steps that may be taken to develop the field of
Taiwanese applied linguistics via computational research. Given the fair amount of
language data accumulated in the past decade or so, systematic corpus analysis may
contribute to an understanding of the way the written language has been used in
different contexts, and facilitate the definition and refinement of those practices, as
well as contribute to language pedagogy.
Robert Cheng has also outlined a program with at least the following
components: machine translation (Taiwanese/Mandarin, Taiwanese/Hakka,
Taiwanese/Japanese, etc.), speech-to-text conversion, spelling checker. The proposal
here is more limited and short-term.
Computational linguistics is an interdisciplinary field involving linguistics and
computer science.
Currently available language data include the following:
1. Taiwanese Net: A POJ e-mail list operating since 1991. For technical and
habitual reasons, most participants do not follow the orthographic convention,
particularly where tone diacritics are concerned. These data will require
processing.
2. Published works: Many electronic files are available from a decade of
publishing.
3. POJ Literary Data Compilation Project: Conducted by Li7 Heng-chhiong for
the National Taiwan Literature Museum (in planning), this ambitious project
focuses on collecting rare or hitherto unknown literary works in the language.
Conversion into electronic formats is expected. As most of the works
17 17 unearthed thus far were published before 1960, they may well constitute a
corpus complementing the sources mentioned.
These sources altogether are expected to make up approximately 40 megabytes of textual data, an amount sufficient to establish a balanced corpus.
Some workable tasks are as follows:
• Improving segmentation techniques, perhaps using TMLAP as a basis;
• Designing a program to pre-process non-standard POJ writing by, for
example, inserting tone marks and hyphens between syllables;
• Script interpreter to convert POJ running texts to hàn-lô, and vice versa
(may be used in publishing and pre-processing);
• Concordance with applications in language learning;
• Lexicography: Using word frequency data to select words to include
in a dictionary, with contextualized examples from the corpus.
Inevitably much remains to be done. Progress in Mandarin computational
linguistics is likely to be valuable to this research.
References
Committee on History of the Presbyterian Church of Formosa, ed. (2000). A
Centennial History of the Presbyterian Church of Formosa 1865-1965. 3rd ed.
Tai-nan, Taiwan: Presbyterian Church of Formosa Centenary Publication
Committee.
G, Siú-Lé (1995). Nan-kuan lyrics as preserved in early Southern Min drama texts.
In Anthology of Min/Taiwan Dialects Research. S.-L. Gou5 (ed.). Taipei: Nan-
t’ian.
Guide to dialect barred in Taiwan: dictionary tried to render local Chinese sounds.
18 New York Times. 15 September 1974. Sec. 1, 15.
Heylen, Ann (2001). Missionary linguistics on Taiwan: romanizing Taiwanese:
codification and standardization of dictionaries in Southern Min (1837-1923). In
Authentic Chinese Christianity: Preludes to Its Development (Nineteenth and
Twentieth Centuries). W. Ku & K. de Ridder (eds.), 135-174. Leuven, Belgium:
Leuven University Press.
Huang, Shuan-fan. (1995). Language, Society and Ethnicity. 2nd ed. Taipei: Crane.
Iûⁿ, Ún-giân (2002). Taiwanese symbols in competition: POJ versus TLPA as a case
study. In Proceedings of the 2002 Conference on the Teaching and Research of
Taiwanese Romanization. (no ed.), E3. Tai-tung, Taiwan: National Taitung
Teachers College.
Iûⁿ, Ún-giân; and Lâu, Kit-gk (2002). Study of Ph-ōe-jī computer word-processing.
In Proceedings of the Fourth Annual Taiwanese Languages and Teaching
Conference. Sun Yat-sen University Department of Chinese (ed.), 341-349.
Kaohsiung: National Sun Yat-sen University.
Iûⁿ, Ún-giân; and Tiuⁿ, Hk-khiam (1999). Review and analysis of Taiwanese Holo's
non-Han-character phonetic symbols. In Proceedings of the First Annual
Taiwanese Mother Tongue Cultural Revival and Reconstruction Conference. Tai-
nan City Cultural Foundation (ed.), 62-76. Tai-nan, Taiwan: Tainan City
Cultural Foundation.
Klöter, Hanning (2002). The history of Peh-oe-ji. In Proceedings of the 2002
Conference on the Teaching and Research of Taiwanese Romanization. (no ed.).
Tai-tung, Taiwan: National Taitung Teachers College.
Lai, Tse-han; Myers, Ramon H.; and Wou, Wei (1991). A Tragic Beginning: the
Taiwan Uprising of February 28, 1947. Stanford: Stanford University Press.
19 19 Language (2001). Republic of China Yearbook—Taiwan 2001. Taipei: Government
Information Office.
Li, Chin-an (2000). Lexical Change and Variation in Taiwanese Literary Texts, 1916-
1988 -- A Computer-assisted Corpus Analysis. Ph.D. thesis, University of
Hawai’I.
Lîm, Chhoan-kiat (1997). The Study of a Mandarin-Taiwanese Machine Translation
System. MS thesis, National Taiwan University.
Nilsen, Kenneth (1997). Irish in nineteenth century New York. In The Multilingual
Apple: Languages in New York City. O. García and J. Fishman (eds.), 53-69.
Berlin: Mouton de Gruyter.
g, Ka-hūi (2000). Research on Taiwanese [Language] Literature in Peh-oe-ji
Sources. MA thesis, National Tainan Teachers College, Graduate School of
Indigenous Culture.
Presbyterian Church in Taiwan (1975). Our appeal concerning the Bible, the church,
and the nation. 20 April 2002.
Rohsenow, John (2001). The present status of digraphia in China. IJSL 150, 125-140.
Tiuⁿ, Hk-khiam (1998). Writing in two scripts. Written Language and Literacy 1,
225-247.
Tiuⁿ, Hk-khiam, comp. (2002). Proposal to add Latin characters required by
Latinized Taiwanese Holo language to ISO/IEC 10646. 10 August 2002.
Tiuⁿ, Jū-hông (2001). Principles of POJ or the Taiwanese Orthography: An
Introduction to Its Sound-Symbol Correspondences and Related Issues. Taipei:
Crane.
20
Figure 1. A third-party POJ input method for the Macintosh.
Figure 2. Hàn-lô text with POJ annotation of select Han characters.
21 21
Notes
1 As a language label, Holo may be regarded as synonymous with Hoklo, Hokkian, Amoy, and
Southern Min, but as written reflects indigenous pronounciation.
2 Partly due to the practical hurdles of inputing Han characters in general, POJ is more often written
without Han characters in electronic mail than in formal publication.
3 In 1998 Taiwan’s Ministry of Education promulgated a phonetic system based on POJ. In place of
tone-marking diacritics, numbers are used. As noted in the language section of the ROC Yearbook
(2001), use of the Taiwan Language Phonetic Alphabet (TLPA) is not required. More recently an yet
newer system derived from Hanyu Pinyin, known as Tongyong Pinyin, has also gained momentum.
4 In contrast, it is not uncommon to find sections of Hong Kong’s entertainment pages written in the
style of colloquial Cantonese.
5 As a parallel, in the nineteenth century an Irish-American newspaper also preceded its counterparts in
Ireland in carrying a Gaelic column, according to Nilsen (1997:60).
6 At least four other periodicals feature written Taiwanese in part. These are not included in Table 2.
7 Prior to the influx of some two million refugees and Nationalist bureaucrats and soldiers in 1949, the
most significant postwar event of that decade was the 1947 state-sponsored massacre of nearly thirty
thousand Taiwanese civilians (see Lai, Myers, and Wou 1991).
8 Although the secular Taiwanese Cultural Association passed a resolution supporting the use of POJ in
1922, little evidence exists of concrete activities in that direction (Huang 1995: 92).
9 The standard national keyboard in Taiwan is configured for Mandarin via the official Mandarin
Phonetic Symbols.
10 Previously known as PC Taiwanese-Mandarin Database (PCTMD).
22
23 23
24