<<

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 46

Computers and the

Hugh Thaweesak Koanantakool National Science and Technology Development Agency

Theppitak Karoonboonyanan Thai Linux Working Group

Chai Wutiwiwatchai National Electronics and Computer Technology Center

This article explains the history of Thai language development for computers, examining such factors as the language, , and writing , among others. The article also analyzes characteristics of Thai characters and / methods, and addresses key issues involved in Thai text processing. Finally, the article reports on language processing research and provides detailed information on Thai language resources.

Thai is the official language of . Certain , all marks, and The system has been used are written above and below the main for Thai, , and languages in Bud- character. dhist texts all over the country. Standard Pronunciation of Thai words does not Thai is used in all schools in Thailand, and change with their usage, as each word has a most dialects of Thai use the same script. tone. Changing the tone of a Thai is the language of 65 million people, may lead to a totally different meaning. and has a number of regional dialects, such Thai verbs do not change their forms as as Northeastern Thai (or ; 15 million with tense, gender, and singular or plural people), Northern Thai (or Kam Meuang or form,asisthecaseinEuropeanlanguages.In- Lanna; 6 million people), Southern Thai stead, there are other additional words to help (5 million people), Khorat Thai (400,000 with the meaning for tense, gender, and sin- people), and many more variations (http:// gular or plural. Basic Thai words are typically en.wikipedia.org/wiki/Thai_language). Thai monosyllabic. Contemporary Thai makes ex- language is considered a member of the fam- tensive use of adapted Pali, Sanskrit, English, ily of , the language used and Chinese words embedded in day-to-day in many parts of the Indochina subre- vocabulary. Some words have been in use gion including India, southern China, long enough that people have forgotten that northern Myanmar, Laos, Thai, Cambodia, they originated from other languages. At pres- and North Vietnam. ent, most new words are created from The Thai script of today has a history English. going back about 700 years, with gradual AThaiwordistypicallyformedbythe changes in the script’s shape and writing sys- combination of one or more consonants, tem evolving over the years. The script was one , one tone mark, and one or more originally derived from the in final consonants to make one syllable. Cer- the sixth century. It is generally thought tain words may be polysyllabic and therefore that the Khmer script developed from the they may consist of many characters in com- of India.1 bination. Because of a limited number of Thai is written left-to-right, without spaces characters (see Figure 1), the between words. Each character has only one can be implemented on typewriters by map- form, that is, no notion of uppercase and ping each directly to each key. A typ- lowercase characters. Some vowels are writ- ing sequence is exactly the same as a writing ten before and after the main consonant. sequence.

46 IEEE Annals of the History of Computing Published by the IEEE Computer Society 1058-6180/09/$25.00 c 2009 IEEE

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 47

Consonants 44 company, introduced the first Thai language typewriter. In 1898, son George McFarland opened a typewriter shop in Bangkok. Those first typewriters had a moving carriage without any shift mechanism.4 Today’s type- writer keyboard has five rows, with standard shift keys. The use of computers with the Thai lan- guagestartedinthelate1960swhenIBM Vowels 18 introduced card-punch machines and line printers with Thai character capability. In Tone marks 4 doing so, the 8- EBCDIC code assignments 5 Diacritics 5 were made for Thai characters. With the printer’s mechanical limitations, for each Numerals 10 line of Thai text an equivalent of four-pass Other symbols 6 printing was required to complete the multi- Total 87 level nature of the Thai writing system. In the 1970s, Univac, Control Data, and Wang Figure 1. Thai characters. introduced their computers using the Thai language. In 1979, some companies offered The computerization of Thai script for the CRT terminals, with a 12-lines-per-screen - Thai language preserves the input sequence, pability, which could display Thai characters. and assigns one character in storage to corres- Interactive computing came to Thailand in pond to one input keystroke. This concept is 1983 when a Thai inventor modified the Her- simple to understand, and all language pro- cules Graphic Card (HGC) circuitry and made cessing algorithms are designed to suit this it capable of displaying 25 lines per screen in arrangement, despite the fact that some text mode. The full display capability of the Thai vowels consist of up to three characters Thai language has been adopted widely (from the vowel group) in that word. since then. Early IT standards in Thailand defined the The computer industry in Thailand in the code points for computer handling, keyboard 1980s was very much vertically integrated: layout, and I/O method. Subsequent research that is, hardware manufacturers always sup- projects created many useful functions such ported their version of localized operating as word break, text-to-speech, optical charac- systems and applications. Every company ter recognition (OCR), recognition, - used different Thai character codes as a result chine translation, and search algorithms, of competition and corporate strategy to re- which we describe elsewhere in this article. tain their customers. Application developers, however, suffered from the codes’ incompat- ibility and from differences in the system Thai language on typewriters interfaces. By 1984, there were at least 20 dif- and computers ferent character coding conventions used in Printing of the Thai language was first ac- Thailand.6 This situation prompted the Thai complished at Serampore in 1819: according Industrial Standard Institute (TISI) to estab- to professor Michael Winship,2 a Catechism of lish a standards committee to draft the na- Religion had been prepared by the American tional standard code for computers. The Baptist missionary Ann Hasseltine Judson work, completed in 1986, is known as TIS for distribution to a small community of 620-2529—The Standard for Thai Character Thai prisoners of war in Burma. By 1836, Codes for Computers B.. [Buddhist Era] Dan Beach Bradley, a missionary physician, 2529.7 Basically, two designs of the code started printing in Thailand using a donated points were adopted in the standard: the press by the American Board of Commis- IBM-EBCDIC and the extended ASCII (8-bit sioners for Foreign Mission. At the time, pub- ). In 1990, the TIS 620 standard was lications were intended primarily for enhanced with a clearer explanation but spreading Christianity into Thailand.3 without any change to the code points, and Thai-language typewriter localization this standard became TIS 620-2533. The tech- was first shown to the Thai public in nical committees of TISI subsequently issued 1891, when Edwin McFarland, in coopera- anumberofITstandardsforThailand tion with Smith Premier, a US typewriter (see Table 1).

January–March 2009 47

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 48

Computers and the Thai Language

Table 1. Thai industrial standards on information technology. The four digits at the end of the standard number (e.g., 2529) is the year of issue, according to Buddhist Era.

Standard number Standard title TIS 620-2529 (1986) Standard for Thai Character Codes for Computers TIS 820-2531 (1988) Layout of Thai Character Keys on Computer Keyboards TIS 620-2533 (1990) Standard for Thai Character Codes for Computers (Enhanced) TIS 988-2533 (1990) Recommendation for Thai Combined Character Codes and Symbols for Line Graphics for -Matrix Printers TIS 1074-2535 (1992) Standard for 6-Bit Teletype Codes TIS 1075-2535 (1992) Standard for Conversion Between Computer Codes and 6-Bit Teletype Codes TIS 1099-2535 (1992) Standard for Province Identification Codes for Data Interchange TIS 1111-2535 (1992) Standard for Representation of Dates and Times TIS 820-2538 (1995) Layout of Thai Character Keys on Computer Keyboards (Enhanced) TIS 1566-2541 (1988) Thai Input/Output Methods for Computers

Source: http://www.nectec.or.th/it-standards/.

It took only one year after the standard’s code table was placed in the ‘‘most accept- announcement for every computer company able’’ collating sequence. The TIS 620 se- to be fully compliant with the TIS 620 stan- quence was subsequently registered with the dard code. Everyone’s data then became Universal Character Set defined by the inter- interchangeable. national standard ISO/IEC 10646, Universal In 1990, the Thai API Consortium, a group Multiple-Octet Coded Character Set, in the re- of Thailand software developers led by Tha- gion Uþ0E00 ...Uþ0E7F, and also as the weesak Koanantakool, drafted a common ISO/IEC 8859-11 standard for 8-bit (single specification for computer handling of the ) coded graphic character sets, Latin/ Thai I/O method. The work was funded by Thai . Figure 2 shows the code assign- the National Electronics and Computer Tech- ment for Thai characters and their names in nology Center (NECTEC)andwaspublished TIS 620 and Unicode. in 19918 as the WTT 2.0 specification.9 The specification was widely used by industry, Consonants including IBM, Digital Equipment, Hewlett- Classifications of the Thai consonants can Packard, and, later, Microsoft. WTT 2.0 was be made as (stops), non-plosives, sibi- proposed to TISI as a draft national standard lants, and voiced ‘‘h.’’ In the table in and, seven years later, became TIS 1566-2541 Figure 3, each column is also grouped (1998). At the same time, many more inter- by voiced/unvoiced properties. In the last esting, natural language processing (NLP) group, อ is classified as a , projects with useful results and solutions and it can be used to write with vowels, found their way into industrial use. In the which cannot be stand-alone. Figure 3 shows past two decades, Thailand has achieved sev- the consonants, classifications, and the associ- eral milestones for advanced computer pro- ated International Phonetic Alphabet (created cessing with the Thai language; we report by the International Phonetic Association, on most of them in this article. IPA) and the International Alphabet of San- skrit (IAST) representations. Thai Encoding of Thai characters into an octet Vowels has been made simple, since we need only The 18 symbols for vowels, together with 87 characters to represent the language. In three consonants—ย, ว,andอ—are used in the code space between 080 and 0FF, TIS combination to create 32 vowels for Thai. 620 occupies only the code space between The 18 symbols are listed in Figure 4, with 0A1 and 0FB. The ASCII-compatible ver- the associated Unicode values. sion of TIS 620 was interoperable with any These 18 vowel symbols and 3 consonants ASCII-based computer with true 8-bit storage (21 in total) combine to form 32 vowels, as per character. The actual placement on the Figure 5 shows.

48 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 49

TIS 0xAx 0xBx 0xCx 0xDx 0xEx 0xFx

Unicode +0E0x U+0E1x U+0E2x U+0E3x U+0E4x U+0E5x U+0E6x U+0E7x

0 tho than pho samphao sara a sara e thai zero

1 ko kai tho nangmontho mo ma mai han-akat sara ae thai one

2 kho khai tho phuthao yo yak sara aa sara o thai two

3 no nen ro rua sara am sara maimuan thai three

4 kho khwai do dek ru sara i sara ai maimalai thai four

5 kho khon to tao lo ling sara ii lakkhangyao thai five

6 kho rakhang tho thung lu sara ue maiyamok thai six

7 ngo ngu tho thahan wo waen sara uee maitaikhu thai seven

8 cho chan tho thong so sala sara u mai ek thai eight

9 cho ching no nu so rusi sara uu mai tho thai nine

A cho chang bo baimai so sua pinthu mai tri angkhankhu

B so so po pla ho hip mai chattawa khomut

C cho choe pho phung lo chula thanthakhat

D yo ying fo fa o ang nikhahit

E do chada pho phan ho nokhuk yamakkan

F to patak fo fan paiyannoi baht fongman

Figure 2. Standard codes for Thai characters—TIS 620 and Unicode, with their names as defined by the WTT 2.0 specification.

Thai input method and keyboards Thai is written with the combining marks stacked above or below the base consonant, Plosive Unvoiced voiced like diacritics in European languages. How- class unaspiratedaspirated unaspirated aspirated nasal ever, although the concepts are quite sim- velar ilar, the implementations are significantly palatal different. retroflex

Thai input method requirements dental First, there are too many possible combina- tionsofbaseconsonantsandcombiningmarks labial

in Thai to be enumerated like Latin accents in tone middle high low low low

the ISO/IEC 8859 series standard for 8-bit char- palatal retroflex dental labial

acter encoding. Therefore, base characters and non-plosive combining characters are encoded separately, (semi-vowel) rather than precombined. sibilants

Second, Thai combining marks are classi- Voiced ''h'' fied into upper or lower vowels, tone marks, zero consonant (never a vowel) and other diacritics. Moreover, a Thai base consonant can be combined with up to two Figure 3. Classification of Thai consonants. Each cell consists of the combining marks, that is, zero or one upper character, International Phonetic Alphabet, and the International Alpha- or lower vowel and zero or one tone mark or bet of Sanskrit Transliteration (IAST) representations in square brackets.

January–March 2009 49

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 50

Computers and the Thai Language

Position relative to therefore, are totally consistent. Moreover, a Vowel Name Type the main consonant typing sequence is almost exactly the same sara a FV1 follow as a handwriting sequence, with the excep- mai han-akat AV2 above tion of only one character, sara am,acom- sara aa FV1 follow posite vowel that consists of two symbols sara am FV1 follow sharing one . sara i AV1 above sara ii AV3 above Typewriter keyboards sara ue AV2 above The keyboard of the first Thai typewriter sara uee AV3 above made by Smith Premier in 1891 consisted of sara ue BV1 below 12 keys in seven rows (Figure 6a shows the lay- sara uu BV2 below out). When the typewriter industry started sara e LV leading vowel using the shift key, the Thai layout was also sara ae LV leading vowel adapted on the typical QWERTY arrangement. sara o LV leading vowel The characters that normally appear above sara ai maimuan LV leading vowel and below the base characters are assigned to sara ai maimalai LV leading vowel individual keys, and they appear above or ru FV3 follow below the previous position without moving lu FV3 follow the carriage. This concept is called dead keys. lakkhangyao FV2 follow The two main differences between Thai and English typewriters are that there are Figure 4. Eighteen vowel symbols in Thai. two different characters on the same key, and that there are eight dead keys. All the dead keys are grouped in the center of the Ket- . The upper/lower vowel, if present, is manee keyboard layout (see Figure 6b), the always attached to the consonant before the classic layout named after its inventor. tone/diacritic. These conditions must be gov- In 1966, Sarit Pattajoti,11 an engineer erned by some rules to limit the number and at the Royal Irrigation Department, rede- order of the combining characters to be put signed the layout to improve the mechanical after the base consonant. coordination of human fingers. Statistically, As a third difference, Thai users are famil- the new layout would eliminate the imbal- iar with typing the combining characters anceoftheKetmaneelayout(30%distrib- after thebaseconsonant,incontrastto uted to the left hand and 70% to the right many European input methods in which hand), and distribute more of the workload users type the diacritic before the base letter to the index and middle fingers. The Patta- to compose an accent. The typing sequence joti layout (see Figure 6c) became the official and the internal storage of a Thai string, standard. Manufacturers were encouraged to make them in quantity for government offi- ces. However, this ‘‘official’’ standard was Front Back abandoned in 1971 because its marginal Unrounded Unrounded Rounded speed improvement was not significant Short LongShort Long Short Long enough for people to migrate from the de Close facto Ketmanee. Close-mid In 1988, TISI issued the standard layout for Open-mid the computer keyboard, TIS 820-2531 (1988) Open (see Figure 6d). The TISI technical committee adopted the Ketmanee layout with only one (a) 18 Monophthongs change. In 1993, the standard was enhanced Thai to utilize the additional three keys available IPA on the computer keyboards. Figure 6e shows the present standard layout as defined (b) 9 Diphthongs by TIS 820-2536 (1993). Four rarely used char- Thai acters, however, are defined in the TIS 620 IPA standard that are not on the keyboard; they would be entered on the specific program (c) 5 Semi-vowels that uses them, typically through a GUI or, Figure 5. Thirty-two vowels in Thai: 18 monophthongs, 9 diphthongs, in the pre-Microsoft Windows era, via direct and 5 semi-vowels. code in DOS.

50 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 51

WTT 2.0 canonical order a The Thai I/O method specifications described in WTT 2.0 are based on the TIS 620 standard code.8 Thesameconceptwas later extended to the Thai in Uni- code. TIS 620 defines a code point for each of the Thai symbols, irrespective of their posi- tions when rendered. In the same manner as Unicode normalization, the WTT 2.0 speci- fication defines the canonical order of Thai character strings. It differs, however, in the implementation details and in terms of syntactic strictness. WTT 2.0 compliance requires that certain input-sequence rules must be met, and most (if not all) input syn- b tactic errors are eliminated at the time of data entry (see Figure 7). First, WTT 2.0 requires strings to be stored and transmitted in canonical order, not to be normalized at processing time. Therefore, in WTT 2.0, canonical and noncanonical strings mean different things and can yield different results in rendering, sorting, searching, and so on. This helps simplify the string handling c routines to some degree, leaving the task of ensuring the order at a single point, the input method. Second, WTT 2.0 concerns syntactic cor- rectness of the input string. It inhibits multi- ple tone marks to combine in one cell, while Unicode does not care. Moreover, WTT 2.0 defines three levels of syntactic strictness of input method. Level 0 (Passthrough) does not filter at all. Level 1 (BasicCheck) just d ensures that the input sequence complies with canonical order and can be displayed gracefully in the output method. Level 2 (Strict) is more picky in filtering out most problematic sequences. Therefore, the WTT 2.0 input conditioning helps make the storage order of Thai text cleaner than would the raw Unicode concept. The WTT 2.0 input method produces strings of a proper subset of the Uni- code specification, and thus it does not nega- tively affect Unicode-compliant programs. e WTT 2.0 eliminates problematic strings that might otherwise fail string matching or that might confuse the output method.

Technical characteristics of Thai input method Technical requirements of the Thai TIS 1566-2541 (WTT 2.0) input method differ from input methods of other languages. The input method Figure 6. Evolution of Thai typewriter keyboard layout: from top, (a) Smith Premier layout, ca. 1890, (b) Ketmanee layout (no does not need pre-edit-and-commit stages date), and computer keyboard layouts: (c) Pattajoti layout (1966), (as used in some CJK [Chinese, Japanese, (d) TIS 820-2531 (1988), and (e) TIS 820-2536 (1993). (Source: and Korean] input methods), Koanantakool10)

January–March 2009 51

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 52

Computers and the Thai Language

Forward Dead using composing method to commit strings cell-by-cell, and NON remembering previously typed characters in the input context memory. LV These solutions can serve only for the input- FV1 ting of new text but cannot handle the edit- ing of existing text, especially when the first FV2 key to be input is a . The last requirement of write access to the FV3 application input buffer is for input sequence correction enhancement. CONS Implementations

NON = non-Thai printable Known implementations of WTT-based CONS = Thai consonants AV1 TONE Thai input methods include the following: LV = leading vowels FV1 = ordinary following vowels FV2 = dependent following vowel AV2 BD Thai Language Environment for Solaris (LAKKHANGYAO) FV3 = special following vowels (RU, LU) Microsoft DOS 6.0 and Windows (all BV1 = short below vowel (SARA U) AV3 AD1 versions) BV2 = long below vowel (SARA UU) X Window: XIM (embedded in Xlib) BD = Below diacritic (PHINTHU) þ TONE = tone marks BV1 AD2 GTK : AD1 = above diacritic 1 þ (THANTHAKHAT, NIKHAHIT) —Thai-Lao IM module (stock GTK IM AD2 = above diacritic 2 (MAITAIKHU) BV1 module) AD3 = above diacritic 3 (YAMAKKAN) AV1 = above vowel 1 (SARA I) —gtk-im-libthai (third party, based on AV2 = above vowel 2 (MAI HAN-AKAT, AD3 libthai) SARA UE) AV3 = above vowel 3 (SARA II, SARA UEE) SCIM: —scim-thai (third party, based on libthai) Figure 7. WTT 2.0 Level 1 I/O state machine. —scim-m17n (third party, by ETL, Japan)

does not use the composing method (as Making TIS 620 a part of international standards used in Europe), and The early 1990s witnessed different opin- is not a straight key-to-character mapping ions concerning the encoding of the Thai (as used in ordinary English input method). block in the making of draft Basic Multilin- gual Plane (BMP), the code space between Rather, the input method requires the 00000 and 0FFFF in Unicode. Trin Tant- following: setthi, a Thai computer architecture and soft- ware expert, developed proposals to relevant one-to-one key-to-character mapping (ir- standardsbodiesthatwereresponsiblefor respective of the width and position of the internationalization of Thai character the character), codes.12 In order to have a Thai language ability to retrieve context character from ap- character set for the ISO 8859 series, the TIS plication input buffer (to verify validity), 620 standard was registered under ISO 2375 and by the ECMA (European Computer Manufac- (optional) write access to application turers Association) as ISO-IR-166. The process input buffer. encountered many obstacles. According to The key-to-character mapping is straightfor- Tantsetthi, the first proposal was to eliminate ward. Thailand has an industrial standard nonspacing characters, which were nonop- for a keyboard map, totally compatible with tional atomic parts of the script, by pre- the traditional (i.e., mechanical) typewriter. composing characters into a rectangular The ability to retrieve context characters bounding box of glyphs. This proposal was from the application input buffer is intended later dropped because there were an insuffi- to validate the key events, according to the cient number of code spaces to accommodate state machine. Note that other solutions are all precomposed glyphs. Once the industry not adequate, such as these: had learned about this limitation, the long- awaited ISO/IEC 8859-11 Latin/Thai standard using pre-edit string and committing vali- was approved. TIS 620 was formally regis- dated characters as a chunk, tered by the Internet Assigned Numbers

52 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 53

Authority (IANA) as a legitimate encoding for The bias technique was necessary for hot- the Internet in 1998.12 metal typesetting because all the combined Thailand also encountered another prob- symbols had to be molded individually as lem in defining Thai in the ISO 10646 stan- one piece, and there were about 28 combi- dard, as some nonnative linguists had nations to be implemented. Another set of strongly proposed the use of phonetic order for 28 types precombined with ascenders was Thai, following Hindi-based scripts. Although also required. In phototypesetting or com- this scheme allows words to be normalized puter typesetting, these precombination into a ‘‘proper form,’’ making word parsing a techniques were no longer a concern, but bit easier, the proposal could not address the the same concept is useful to improve the following problems: speed of dot matrix printing. To print a line of Thai text on an printer normally it could not handle language exceptions requires four passes. By precombining the vigorously; top characters (two levels) before printing, compounded vowels were not addressed, the speed improves by 25% because only neither in code-point assignments nor in three passes are required, rather than the input method; and four that mechanical typesetting required— no backward compatibility was provided. one for each of the four vertical zones. To make all dot matrix printers interoperable, We opted for the traditional input method TISI issued TIS 988-2533, the standard code (i.e., visual, instead of phonetic, order) be- for combined characters in dot matrix print- cause, since Thai localization had begun in ers, in 1990. 1968, by 1990 the Thai computer industry Thai WTT-2.0—based rendering engines had equipped itself with libraries and tools for text-mode display usually handle Thai to use visual order without a problem. text in two steps. The text string is first clus- That’s basically how the Thai block in BMP/ tered into cells, which are then shaped by Unicode has developed. In 1998, IANA proper placements of the combining charac- (Internet Assigned Numbers Authority) ters. Printing with an impact printer, on the adopted TIS 620 as a legitimate character set other hand, requires a conversion of one on the Internet. The visual order was for- line of text into three or four strings mally announced as a national standard in whose lengths are equal to the line width, TIS 1566-2541 (1998). with each string corresponding to each pass of printing. The fastest dot matrix Computer display, printing, and Thai printer manages to print a Thai line with word processors only two passes; typical printing requires Like most languages, modern Thai text three passes. display on computers has evolved from tradi- A number of DOS-based Thai-language tional printing technologies. Mechanical I/O subsystems were developed at Kasetsart typewriter design has allocated four vertical University, Chulalongkorn University, and zones for placing Thai characters: one for Thammasat University. Yuen Poovorawan base characters, one for combining characters and his team at Kasetsart University created below the baseline, and the other two for the first Thai word processor called Thai combining characters above the base charac- Easy Writer. He also developed the Thai Ker- ter. Characters are strictly classified on the nel System as a resident program for DOS to basis of their assigned positions. This has handle the Thai language.13 The most popu- propagated to a text-mode computer display lar DOS-based Thai word processor was grid. The WTT 2.0 specification was designed released in 1989. According to professor to accommodate exactly this. Wanchai Rivepiboon of Chulalongkorn, CU Thai typography has a more-refined Writer version 1.1 was first released to the scheme than that of typewriters. The top- public in April 1989. It was a tremendous suc- most combining characters can be shifted cess because of the demands for quality print- down slightly when the combining character ing of Thai documents. The Faculty of below them is missing to eliminate the Engineering of Chulalongkorn University empty gap. In addition, the combining char- continued to support and improve the acters can be slightly biased to the left or program for about five years until it was right of the base characters, which have superseded in 1994 by commercial word long ascenders (stems) to deliver aesthetic processing software running on the Windows print results. platform.14

January–March 2009 53

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 54

Computers and the Thai Language

situation caused developers to create some common hacks with automatic filters to encode text with PUA glyphs so that such applications would render properly. This situ- ation created another mess: when users of Figure 8. WTT 2.0 clustering. The dotted circles such filters transmitted messages in PUA, indicate a rejected sequence. they were unreadable by anyone using an- other platform. Cell clustering Thai users have long suffered from this Cell clustering shares the same state table lack of interoperability. The availability of with the input method. Composable se- PUA glyphs, however, is important enough quences are tokenized into cells. Rejected in Thai typography to live on. sequences are separated, with an optional dot- Pango, a free software rendering engine, ted circle as the base character that can be supports both sets of PUA glyphs—Windows easily caught by human eyesight, as Figure 8 and Apple—by detecting them before using. illustrates. Open standard and open source software has significantly helped the local Thai indus- Shaping try to find useful typographical solutions. For typewriter-style displays, the toke- nized cells are rendered in a perfect rectangu- OpenType solutions lar bounding box, and all combining marks With OpenType technology, all shaping are placed at fixed vertical positions. But for details can be moved into fonts. All that the modern desktop publishing and the Thai lan- rendering engines need do is call the appro- guage, beginning in 1989, combining marks priate features for the language. have been typographically adjusted for the Glyph processing features in OpenType following issues: fonts are represented in two forms: glyph substitution (GSUB) and glyph positioning The topmost combining mark without a (GPOS). GSUB contains substitution rules, mark below it is shifted down to eliminate while GPOS describes anchor points for the gap between it and the base character. attaching combining marks to base glyph or Combining marks that are combined with combining marks to other combining marks. the base character with long ascenders According to Microsoft’s specification for (stems) are biased slightly to the left to Thai OpenType fonts,15 several features are avoid overlapping. required, as explained next. Consonants with removable descenders Character Composition/Decomposition (yo ying and tho than) have the descender (‘‘ccmp’’). This must contain GSUB rules to removed when combined with a lower vowel or diacritic. 1. select the lower variation for topmost marks in the absence of an upper vowel To accommodate these issues, typesetters (see Figure 9, left), have used Private Use Area glyphs. PUA 2. remove the descender for the consonants glyphs are still widely used in most fonts yo ying and tho than when combined with today. OpenType may eventually make below-base characters (see Figure 9, mid- their use obsolete with Thai, although not dle), and in the near term. 3. decompose the composite vowel sara am into the nikhahit sign and sara aa vowel, PUA glyphs solutions plus rearrange them with the combined Operating system vendors assigned sep- tone mark if one is present (see Figure 9, arate, predefined PUA code points to right). shifted glyphs. Font creators prepared the required glyphs, and the rendering engine Upper/lower Descender SARA AM knew how to use them. Unfortunately, variation selection removal decomposition Microsoft and Apple defined their own PUA code points differently, which meant fonts could not be used across Windows and Mac platforms. To make matters worse, Adobe did not support any PUA glyphs for Thai. This Figure 9. Glyph substitutions for shaping.

54 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 55

Figure 10. Mark positioning with GPOS.

Mark to Base Positioning (‘‘mark’’). This is composed of GPOS base anchors in the base characters and mark anchors in the combin- ing marks. This feature places marks above or below the base characters (see Figure 10). Mark to Mark Positioning (‘‘mkmk’’). This is composed of GPOS base anchors in the base marks and mark anchors in the combin- ing marks. This feature places topmost marks on upper vowels.

Thai cultural conventions The most basic kind of cultural convention is the Posix locale for the standard C library. According to Posix, a number of C functions Figure 11. Example of a sorted list of Thai words. are defined to be locale-dependent, such as String collation solutions date and time format, string collation, and so The first known solution for Thai string on. Many other programming languages also collation was proposed in 1969 by D. Londe rely on these functions. and Udom Warotamasikkhadit,17 and later The Posix locale specification has been referenced by Vichit Lorchirachoonkul18 in further extended in ISO/IEC TR 14652, Speci- 1979. Londe and Warotamasikkhadit pro- fication Method for Cultural Conventions, by posed an algorithm for converting Thai adding more categories, which have now strings into left-to-right comparable form. been supported by more-modern software. In 1992, Samphan Khamthaidee (a pen Theppitak Karoonboonyanan16 has gath- name used by Samphan Raruenrom)19 pub- ered information on Thai cultural conven- lished an article in a computer hobbyist jour- tions and documented his Thai locale nal describing an algorithm for comparing creation for the GNU C library. Most Posix two Thai strings on the fly. The algorithm categories were defined and later extended was well crafted so that the strings are when ISO/IEC TR 14652 was supported. scanned from left to right in a single pass, Handling of Thai words and the difference is detected as early as pos- 20 According to the Royal Institute of Thai- sible. Karoonboonyanan later extended land’s Royal Institute Dictionary (http://en. Khamthaidee’s algorithm to cover punctua- wikipedia.org/wiki/The_Royal_Institute_of_ tion marks by generalizing it to accommo- Thailand#The_Royal_Institute_Dictionary), date extra levels of character weights in Thai strings can be collated by comparison 1997. Subsequently, details about the order from left to right, with two exceptions: of Thai marks have been sum- marized in the literature.21,22 Leading vowels are less significant than the initial consonant of the syllable and International standards must be compared after the consonant. String collation is based primarily on the Tone marks, thanthakhat and maitaikhu, Posix standard library functions strcoll() are ignored unless all other parts are equal. and strxfrm().BotharebasedontheLC_COL- LATE locale setting. The Posix locale was later No syllable structure or word boundary analysis extended as ISO/IEC TR 14652, Specification is required. Thai words are sorted alphabeti- Method for Cultural Conventions. For LC_COL- cally, not phonetically. Figure 11 shows an LATE in particular, a Common Template example of a sorted list of Thai words. Table for collating strings encoded in ISO/IEC

January–March 2009 55

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 56

Computers and the Thai Language

10646, Universal Multiple-Octet Coded Character reason why Thailand completely rejected pho- Set, in general has been defined as ISO/IEC netic encoding in its standards. 14651, International String Ordering and Compar- Thai string collation is in fact as simple as ison. Everylocalecancustomizethetabletofit its encoding scheme. The leading vowel only local needs. Unicode also defines UTS #10 (the needs to be swapped with its immediate suc- Unicode Collation Algorithm) in parallel with ceeding character—nothing more than that. the ISO/IEC 14651 standard. Both standards already supported multiple- Word and syllable boundary level character weights in the first place, TypicalThaiwritinghasnospacebe- which was important for handling Thai tween words and . We use space tone marks and diacritics. So, all the specifica- only to separate sentences. Lacking a word tion needed at that time was the proper or syllable boundary marker has been a weights definition. The reordering of the major problem for early Thai language proc- leading vowels, however, had been contro- essing applications, which has been versial for a while, as we will explain. researched since the 1980s. Early works fo- Defending the non-Indic style of Thai Uni- cused on syllable segmentation due to its code encoding had given the Unicode commit- potential for being modeled in rule-based tee sufficient information to address Thai in fashion.25,26 Word segmentation, on the UTS#10 since the beginning. The reordering other hand, requires a dictionary. Longest was defined as a separate preprocessing stage. or maximal matching incorporated with a It took time, however, to convince the ISO/ well-designed dictionary was the first practi- IEC 14651 committee to include this in the cal approach.27 Over time, researchers have Common Template Table (CTT) as a set of pre- proposed advanced algorithms. A major defined contraction forms as there was no pre- contribution includes the first open-source processing concept in it. Rather, the reordering software for word segmentation called requirement was first described in Annex C in Smart Word Analyzer for THai (SWATH), 23 28 DIS 14651:2000 without realization in the which NECTEC developed. It is based on a CTT, before the actual amendment was imple- statistical part-of-speech n-gram. In 2002, mented in 2003.24 Wirote Aroonmanakun constructed a word segmentation engine using two-pass proc- Leading-vowel rearrangement issue essing, syllable segmentation, and syllable One common misinterpretation of the merging based on collocations.29 These fun- leading-vowels rearrangement by nonnative damental tools have been applied in various linguists, which has been frequently encoun- language processing tasks such as line break- tered at conferences and meetings, was that ing and word boundary detection in word Thai string collation required linguistic analysis, processing and publishing software. as a consequence of a Thai visual encoding scheme, as opposed to a phonetic encoding scheme Soundex of other Indic scripts. This was simply wrong. Soundex is a phonetic algorithm for index- Amajorambiguityproblemwouldresultif ing words by sound. It has been widely used one tried to do linguistic analysis. For example, in Internet search engines for flexible searching the Thai string เพลา can mean two different by the same or a similar sound without regard words. One is a two-syllable word phe- (mean- to the written form. An original algorithm, ing ‘‘time’’), and the other is a one-syllable proposed by Vichit Lorchirachoonkul,30 was word phlao (meaning ‘‘axel’’ or ‘‘abate’’). In try- based on a set of rules used to segment Thai ing to preprocess the string by rearranging the character strings into syllables, simplify the leading vowel เ with the syllable initial sound, syllables, and encode them in a five-character one will find two possible rearrangements: code. Another general procedure is to find พเลา for the former case, and พลเา for the lat- pronunciations of a given keyword and ter. This makes it impossible to determine a match them to indexed words in the search single weight for the string. This is another database. Standard sound representations are defined by the International Phonetic As- sociation (IPA) and the Speech Assessment Sample text IPA SAMPA Methods Phonetic Alphabet (SAMPA).31,32 sawatdi sa_2 wat_2 di:_1 The Thai Royal Institute has defined a stan- khopkhun khOp_2 khun_1 dard sound-based romanization of Thai lakon la:_1 kOn_2 words. Figure 12 demonstrates these standard Figure 12. Examples of Thai sound representations. sound representations.

56 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 57

Automatic letter-to-sound conversion is of OCR. Some systems have been adopted an important tool that activates the auto- commercially in mobile devices.38 On this matic soundex search. One of the first com- simple but accurate task, characters are plete tools for letter-to-sound conversion grouped into Thai alphabets, English alpha- was based on a probabilistic, generalized bets, and digits. Modification on character left-to-right parser trained by a pronuncia- shapes is applied to enlarge the difference of tion dictionary.33 some characters or to simplify the writing.

— Advanced human computer I/O Text-to-speech and speech synthesis — Several forms of human computer I/O— Research on Thai speech synthesis, chiefly OCR, handwriting recognition, text-to-speech for academic purposes, began in the 1980s. and speech synthesis, and automatic Early work was reported by Sudaporn Luksa- speech recognition—present different chal- neeyanawin.39 Text-to-speech synthesis lenges with respect to the Thai language. (TTS) has resulted in several practical prod- This section reviews major contributions to ucts produced around 1990—for example, these advanced technologies. CUTalk by Luksaneeyanawin and Vaja by 40 OCR and handwriting recognition NECTEC. Major efforts required for Thai TTS are text processing, where a given text is seg- Thai OCR is not trivial, as a bounding box mented, normalized, and converted to sound may contain up to three symbols stacking to- representatives; prosody prediction including gether, and the written (or printed) sentences estimating unit durations, filled pauses, and do not have spaces within them. In addition, fundamental frequencies; and synthesis,includ- modern Thai texts do mix with English ing the speech database and synthesizing words. The challenges to OCR developers algorithms. have been tremendous. The technology for speech synthesis has The history of Thai OCR spans well over been incorporated by adopting available 20 years. Pipat Hiranvanichakorn et al. and TTS engines in applications such as the IBM Chom Kimpan et al. were two of the original homepage reader,41 screen reader, and tele- groups of researchers in this area.34,35 Their phone voice response services. Limitations subsequent works primarily concerned recog- on speech quality and intelligibility inhibit nition of individual (isolated) printed charac- widespread use. The desire to provide TTS ters, that is, without connections among the applications on portable devices is increas- characters. A number of commercial products ing, and therefore researchers will have to op- have been launched since the 1990s—for ex- timizethesizeofthespeechdatabaseand ample, ArnThai developed by NECTEC was one text processing dictionary. Thai TTS has of the first complete OCR software applica- been used widely, however, for persons with tions that helped users convert scanned pic- visual disabilities in Thailand.42 tures of documents to editable documents, with over 90% conversion accuracy under restricted scanning conditions.36 Automatic speech recognition The technology trend has been tailored to Development of Thai automatic speech rec- the use of machine learning such as an artifi- ognition (ASR) began with isolated word cial neural network. Although OCR technol- recognition (in which speakers utter only a ogy has reached a commercial level, its short command, not a sentence) in the mid- mass utilization has not yet been achieved. 1990s, primarily at Chulalongkorn Univer- Problems include connected characters, sity.43 A few years ago, the research expanded which often appear in dirty documents; to applications of continuous speech in lim- poorly scanned documents; or documents ited tasks: for example, email access and tele- with rarely used fonts that cause recognition banking. Several attempts were made on errors. Research on Thai OCR is ongoing: for large vocabulary continuous speech recogni- example, an introduction of postprocessing tion (LVCSR) for Thai; Table 2 summarizes to recover recognition errors. the results. The lack of a very large speech Thai handwriting recognition was database has slowed the advance of Thai explored about a decade after OCR. Academic speech recognition; however, NECTEC is one research started also on isolated-character rec- of the organizations that has continuously ognition.37 Current applicable systems still developed Thai speech recognition resources mostly focus on isolated characters with a available for research, such as the LOTUS neural network as a basic classifier, like those corpus.44

January–March 2009 57

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 58

Computers and the Thai Language

Table 2. State-of-the-art Thai large vocabulary continuous speech recognition (LVCSR) performance. The organization that provided the research results is listed after each “Task.”

Vocabulary Word error rate Task size (words) Perplexity (%) Newspaper reading (Carnegie Mellon ~7,400 140 14.0 University)45 Newspaper reading (Tokyo Institute 5,000 140 11.6 of Technology)46

Search and Semantic Web and semantic searches for implementing Increasingly, as more companies and org- question-answering (QA) systems. anizations in Thailand, both public and pri- The Semantic Web research initiatives in vate, start adopting digital solutions in place Thailand can be classified into four major of their original paper-based workflow, the categories: need for search technology becomes inevita- ble. Most of the available software tools for information extraction (IE) and ontology developing information retrieval (IR) systems learning, metadata/ontology standards and tools, have not been specifically designed to work with Thai. Consequently, there are two possi- knowledge representation (KR) and infer- ences, and ble approaches for indexing Thai texts, including suffix array and inverted file index. applications and services. The former approach treats text as a string A wide spectrum of techniques has been of characters and records character positions; researched and developed since the early the latter approach adopts the conventional 2000s. For example, a Thai translation of word-based paradigm applied for Latin-based the Dublin Core metadata element sets48 languages. Word segmentation is therefore was completed by Science and Technology necessary in the latter approach. The stem- Knowledge Services (STKS), which also pro- ming process is, however, inapplicable since motes its uses in Thailand, and the XML De- Thai words are considered noninflectional. clarative Description (XDD)49 language was An early publication describing full-text proposed by the Asian Institute of Technol- search for Thai was published in 1999 by Pradit ogy to extend the capabilities of XML and 47 Mitrapiyanuruk et al. In 2000, Surapan RDF to support rule-based reasoning. Meknavin, a co-author of that publication, fo- cused his research on founding an Internet ser- Language resources and industrial vice business called SiamGuru (http://www. standard lists siamguru.com), which could be recognized One of the most important language as the first Thai-specific search engine. Today, resourcesisadictionary.So Sethaputra is con- the research trend in Thai-language search sidered the first Thai-English dictionary engines is moving toward the natural-language made into a commercial electronic dictionary

Table 3. Available Thai dictionaries.

Dictionary Type Size (words) Availability Source Royal Institute Monolingual with 33,582 Web application http://rirs3.royin.go.th/dictionary.asp Dictionary pronunciation NECTEC Bilingual (En-Th) 53,000 English, Publicly available http://lexitron.nectec.or.th/ LEXiTRON with pronunciation 35,000 Thai KDictThai Bilingual (En-Th) 37,018 Publicly available http://kdictthai.sourceforge.net/ Saikam Dictionary Bilingual (Jp-Th) 133,524 Web application http://saikam.nii.ac.jp/ Longdo Dictionary Multilingual >600,000 Web, gadget, http://dict.longdo.com/ (En-Th, Th-En, widget De-Th, Th-De, Jp-Th Th-Jp, Fr-Th, Th-Fr)

58 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 59

Table 4. Thai text and speech resources.

Corpus (Organization) Type Purpose Orchid (NECTEC)51 Text Annotated 568,316 words of Thai junior http://www.hlt.nectec.or.th/orchid/ encyclopedias and NECTEC technical papers LOTUS (NECTEC)44 Speech Well-designed speech utterances for 5,000-word dictation systems

product. LEXiTRON is another electronic dic- Acknowledgments 50 tionary with a long history. The first version The authors would like to thank Trin Tantset- of LEXiTRON was launched by NECTEC in both thi and Yuen Poovorawan for their personal standalone and Web-based systems in 1995. communications, and to Choochart Harue- This first version contained approximately chaiyasak and Marut Buranarach from NECTEC 13,000 Thai words and 9,000 English words. for their initial survey on search engines and Through user collaboration, the number of Semantic Web technology. entries has been more than doubled in the current version, LEXiTRON 2.2. Table 3 lists other available dictionaries. Another language References and Notes resource necessary for language processing is 1. P. Soonthornpoct, From Freedom to Hell: A His- text and speech corpora. Orchid and LOTUS, tory of Foreign Interventions in Cambodian Politics shown in Table 4, are considered two of the and Wars, Vantage Press, 2006. first official language corpora available pub- 2. M. Winship, ‘‘The Printing Press as an Agent of licly for Thai language research. Change? Early missionary printing in Thailand’’; http://www.historycooperative.org/journals/cp/ Concluding remarks vol-08/no-02/tales/. The history of the Thai language on com- 3. NECTEC National Font Committee, The Time Line puters dates back to about 1965, but local re- of Thai Printing, Thai National Font Book (in search involvement began around 1975, Thai), National Electronics and Computer Tech- when low-cost microcomputers became avail- nology Center (NECTEC), Bangkok, 2001. able. Since the first standard Thai code for 4. Antique Phonograph and Gramophone Thai computers was developed in 1986, the com- Society, ‘‘The Origins of Thai Typewriter’’; http:// puter processing speed has improved from www.talkingmachine.org/smithpremierorigin.html. 8-bit, 8-MHz to 32-bit, 3.6-GHz (1,800 5. K. Hensch, ‘‘IBM History of Far Eastern Lan- times), and the RAM size for an entry-level guages in Computing, Part 1: Requirements computer has increased from 256 Kbytes to and Initial Phonetic Product Solutions in the 1 Gbyte (3,900 times). NLP in real time is be- 1960s,’’ IEEE Annals of the History of Computing, coming a reality. Research opportunities are vol. 27, no. 1, 2005, pp. 17-26. open to any students and researchers who 6. T. Koanantakool, ‘‘Compilation of all Thai Char- can afford these inexpensive computers. Thai- acter Codes for Computers,’’ Microcomputer land’s I18N (internationalization) and L10N Magazine, vol. 1, no. 9, Aug. 1984. (localization) processes owe much to open 7. National Electronics and Computer Technology standards and many open source projects. It Center, Thai Information Technology Standards; is anticipated that openness in the interna- http://www.nectec.or.th/it-standards/. tional IT community will enable the Thai lan- 8. T. Koanantakool and the Thai API Consortium, guagetobemoreusableonanycomputersin Computer Gub Pasa Thai [Computers and the the future. Thai Language], NECTEC, Oct. 1991 (in Thai). With more IT users, improvement in human- 9. The WTT specification was published in the book computer interaction for the Thai language will Computers and the Thai Language in 1991. WTT, become more desirable. Because computers are a project code name, is a Thai acronym of also becoming indispensable for the elderly, per- ‘‘Wing Took Thee,’’ meaning ‘‘I run everywhere.’’ sons with disabilities, and those who are illiter- 10. T. Koanantakool, ‘‘The Keyboard Layouts and ate, many new developments will be possible. Input Method of the Thai Language,’’ Proc. Speech I/O, handwriting input, sign languages, Symp. Natural Language Processing, Chulalong- mobile-phone text entry, a voice-command sys- korn Univ., 1993. tem, speech-to-speech translation, voice search, 11. S. Pattajoti, The Evolution of the Typewriter, and so on are the most likely efforts to be devel- National Research Council, The Office of the oped and commercialized. Prime Minister, 4 Nov. 1966 (in Thai).

January–March 2009 59

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 60

Computers and the Thai Language

12. T. Tantsetthi, ‘‘Thai Software Alert #1: Legiti- Natural Language Processing (SNLP) and Oriental mate Thai Charset on the Internet’’; http:// Chapter of the Int’l Committee for the Coordina- software.thai.net/alerts/tis-620/index.html. tion and Standardization of Speech Databases 13.P.Suriyawong,Y.Poovarawan,andC.Wongchai- and Assessment Techniques (O-COCOSDA), suwat, ‘‘A Development of Thai Kernel System,’’ 2002, pp. 68-75. Proc. Regional Workshop on Computer Processing of 30. V. Lorchirachoonkul, ‘‘A Thai Soundex System,’’ Asian Languages (CPAL), 26-28 Sept. 1989. Information Processing and Management, vol. 18, 14. T. Koanantakool, ‘‘A Brief History of ICT in no. 5, 1982, pp. 243-255. Thailand’’; http://www.bangkokpost.com/ 31. K. Tingsabadh and A.S. Abramson, Illustrations of 20th_database/07Feb2007_data00.php. the IPA: Thai, Handbook of the International Pho- 15. Microsoft, ‘‘Thai OpenType Specification: netic Association, Cambridge Univ. Press, 1999. Creating and Supporting OpenType fonts for 32. J.C. Wells, ‘‘SAMPA for Thai’’; http://www.phon. the Thai Script’’; http://www.microsoft.com/ ucl.ac.uk/home/sampa/thai.htm. typography/otfntdev/thaiot/default.htm. 33. P. Tarsaku, V. Sornlertlamvanich, and R. Thong- 16. T. Karoonboonyanan, ‘‘Thai Locale’’; http:// prasirt, ‘‘Thai -to-Phoneme Using linux.thai.net/~thep/th-locale/. Probabilistic GLR Parser,’’ Proc. European Conf. 17. D. Londe and U. Warotamasikkhadit, ‘‘Compu- Speech Communication and Technology (Euro- terized Alphabetization of Thai, tech. memo TM-- Speech), ISCA, 2001, pp. 1057-1060. 1000/000/01, System Development Corp., 1969. 34. P. Hiranvanichakorn, T. Agui, and M. Nkajima, 18. V. Lorchirachoonkul, Thai Sorting Algorithm, Thai ‘‘A Recognition of Thai Characters,’’ Trans. IECE Algorithm: Design, Analysis and Implementation, Japan, vol. E 54, no. 12, 1982. tech. report, NIDA and Electrical Generation 35. C. Kimpan et al., ‘‘Thai Character Pattern Rec- Authority of Thailand, Bangkok, 1979. ognition Preliminary,’’ Proc. Conf. L-05, College 19. S. Khamthaidee, ‘‘Thai Style Algorithm: Sorting of Science and Technology, Nihon Univ., 1980. Thai,’’ Computer Rev., vol. 91, Mar. 1992. 36. NECTEC, ‘‘Open TLE’’; http://www.opentle.org/. 20. T. Karoonboonyanan, ‘‘Thai Sorting Algo- 37. V. Tanrudee, ‘‘Thai Character Handwriting Rec- rithms’’; http://linux.thai.net/~thep/tsort.html. ognition System,’’ master’s thesis, Chulalong- 21. T. Karoonboonyanan, S. Raruenrom, and korn Univ., 1989. P. Boonma, ‘‘Thai-English Bilingual Sorting’’; 38. G Softbiz Co., Ltd., ‘‘Thai-G’’; http://www.thai-g. http://linux.thai.net/~thep/blsort_utf.html. com/. 22. T. Karoonboonyanan, S. Raruenrom, and 39. S. Luksaneeyanawin, ‘‘A Thai Text-to-Speech Sys- P. Boonma, ‘‘Sorting Thai Words with Punctua- tem,’’ Proc. Regional Workshop Computer Processing tion Marks,’’ NECTEC J., vol. 14, Jan.—Feb. 1997, of Asian Languages (CPAL), 1989, pp. 305-315. NECTEC, 1997. 40. P. Mittrapiyanurak et al., ‘‘Issues in Thai Text-to- 23. A. LaBonte´, ed., ISO/IEC DIS 14651: Interna- Speech Synthesis: The NECTEC Approach,’’ Proc. tional String Ordering and Comparison—Method NECTEC Ann. Conf., NECTEC, 2000, pp. 483-495. for Comparing Character Strings and Description 41. IBM, ‘‘IBM Launches IBM Home Page Reader of the Common Template Tailorable Ordering, V3.02 Thai, A Thai-speaking Web Browser—The ISO/IEC JTC1/SC22/WG20 N731, 2000. 12th Language of the World,’’ press release, 24. K. Karlsson, Ordering Rules for Thai and Lao, 12 Feb. 2003. ISO/IEC JTC1/SC22/WG20 N1077, 2003. 42. Thailand Association of the Blind, ‘‘Software for 25. Y. Thairatananond, ‘‘Towards the Design of a the Blind’’; http://www.tab.or.th/. Thai Text Syllable Analyzer,’’ master’s thesis, 43. T. Pathumthan, ‘‘Thai Speech Recognition Using Asian Inst. of Technology, Pathumthani, 1981. Syllable Units,’’ master’s thesis, Chulalongkorn 26. S. Charnyapornpong, ‘‘A Thai Syllable Separa- Univ., 1987 (in Thai). tion Algorithm,’’ master’s thesis, Asian Inst. of 44. S. Kasuriya et al., ‘‘Thai Speech Corpus for Speech Technology, Pathumthani, 1983. Recognition,’’ Proc. Int’l Conf. Int’l Committee for 27. R. Varakulsiripunth et al., ‘‘Word Segmentation the Coordination and Standardization of Speech in Thai Sentence by Longest Word Mapping,’’ Databases and Assessments (O-COCOSDA), Papers on Natural Language Processing: Multilin- 2003, pp. 105-111. gual Machine Translation and Related Topics 45. S. Suebvisai et al., ‘‘Thai Automatic Speech Rec- (1987—1994), NECTEC, 1989. ognition,’’ Proc. IEEE Int’l Conf. Acoustics, Speech, 28. S. Meknavin, P. Charoenpornsawat, and and Signal Processing (ICASSP), IEEE Press, 2005, B. Kijsirikul, ‘‘Feature-Based Thai Word Segmen- pp. 857-860. tation,’’ Proc. Natural Language Processing Pacific 46. I. Thienlikit, C. Wutiwiwatchai, and S. Furui, Rim Symp. (NLPRS 97), 1997, pp. 41-48. ‘‘Language Model Construction for Thai 29. W. Aroonmanakun, ‘‘Collocation and Thai Word LVCSR,’’ Reports of the Meeting of the Acoustic Segmentation,’’ Proc. Joint Int’l Conf. Symp. Soc. of Japan, ASJ, 2004, pp. 131-132.

60 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 61

47. P. Mitrapiyanuruk et al., ‘‘A Development of the Electronic Transactions Act of 2001 and the Full-Text Search Engine for Large Scale Thai Computer-related Offences Act in 2007. Contact Text Database,’’ Proc. Nat’l Science and Technol- him at [email protected]. ogy Development Agency (NSTDA), annual meet- Chai Wutiwiwatchai received ing, 1999, pp. 246-257 (in Thai). a PhD from Tokyo Institute 48. Science and Technology Knowledge Service of Technology in 2004. Cur- (STKS), ‘‘Dublin Core Metadata Element Set rently he is the director of (Thai translation)’’; http://dublin.stks.or.th/. the Human Language Tech- 49. V. Wuwongse et al., ‘‘XML Declarative Descrip- nology Laboratory in NECTEC, tion: A Language for the Semantic Web,’’ IEEE In- Thailand. His research inter- telligent Systems, vol. 16, no. 3, 2001, pp. 54-65. ests include speech process- 50. NECTEC, ‘‘LEXiTRON Online Dictionary’’; http:// ing, natural language processing, and human— lexitron.nectec.or.th/. machine interaction. Contact him at chai.wutiwi- 51. V. Sornlertlamvanich, T. Charoenporn, and [email protected]. H. Isahara, ORCHID: Thai part-of-speech tagged corpus, tech. report TR-NECTEC-1997-001, Theppitak Karoonboonyanan ISBN 9747576-98-8, NECTEC, 1997. received a B.Eng. (honors) in computer engineering from Hugh Thaweesak Koananta- Chulalongkorn University in kool is a vice president of the 1993. His interests are soft- National Science and Technol- ware engineering, and free ogy Development Agency. and open source software phi- He received a PhD in digital losophy. He has contributed communications from the Im- to the GNOME, Debian, and other projects for perial College of Science and Thai supporting infrastructure, and has maintained Technology, London Univer- upstream projects for Thai resources. Contact him sity. His major works are in IT standards of at [email protected]. Thailand, the establishment of the Internet in Thailand, SchoolNet Thailand, and e-Government. For further information on this or any other Koanantakool served on Thailand’s National IT computing topic, please visit our Digital Library Committee, the parliamentary commissions on at http://computer.org/csdl.

January–March 2009 61

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.