Computers and the Thai Language

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 46 Computers and the Thai Language Hugh Thaweesak Koanantakool National Science and Technology Development Agency Theppitak Karoonboonyanan Thai Linux Working Group Chai Wutiwiwatchai National Electronics and Computer Technology Center This article explains the history of Thai language development for computers, examining such factors as the language, script, and writing system, among others. The article also analyzes characteristics of Thai characters and I/O methods, and addresses key issues involved in Thai text processing. Finally, the article reports on language processing research and provides detailed information on Thai language resources. Thai is the official language of Thailand. Certain vowels, all tone marks, and diacritics The Thai script system has been used are written above and below the main for Thai, Pali, and Sanskrit languages in Bud- character. dhist texts all over the country. Standard Pronunciation of Thai words does not Thai is used in all schools in Thailand, and change with their usage, as each word has a most dialects of Thai use the same script. fixed tone. Changing the tone of a syllable Thai is the language of 65 million people, may lead to a totally different meaning. and has a number of regional dialects, such Thai verbs do not change their forms as as Northeastern Thai (or Isan; 15 million with tense, gender, and singular or plural people), Northern Thai (or Kam Meuang or form,asisthecaseinEuropeanlanguages.In- Lanna; 6 million people), Southern Thai stead, there are other additional words to help (5 million people), Khorat Thai (400,000 with the meaning for tense, gender, and sin- people), and many more variations (http:// gular or plural. Basic Thai words are typically en.wikipedia.org/wiki/Thai_language). Thai monosyllabic. Contemporary Thai makes ex- language is considered a member of the fam- tensive use of adapted Pali, Sanskrit, English, ily of Tai languages, the language used and Chinese words embedded in day-to-day in many parts of the Indochina subre- vocabulary. Some words have been in use gion including India, southern China, long enough that people have forgotten that northern Myanmar, Laos, Thai, Cambodia, they originated from other languages. At pres- and North Vietnam. ent, most new words are created from The Thai script of today has a history English. going back about 700 years, with gradual AThaiwordistypicallyformedbythe changes in the script’s shape and writing sys- combination of one or more consonants, tem evolving over the years. The script was one vowel, one tone mark, and one or more originally derived from the Khmer script in final consonants to make one syllable. Cer- the sixth century. It is generally thought tain words may be polysyllabic and therefore that the Khmer script developed from the they may consist of many characters in com- Pallava script of India.1 bination. Because of a limited number of Thai is written left-to-right, without spaces characters (see Figure 1), the writing system between words. Each character has only one can be implemented on typewriters by map- form, that is, no notion of uppercase and ping each symbol directly to each key. A typ- lowercase characters. Some vowels are writing sequence is exactly the same as a writing ten before and after the main consonant. sequence. 46 IEEE Annals of the History of Computing Published by the IEEE Computer Society 1058-6180/09/$25.00 c 2009 IEEE Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 47 Consonants 44 company, introduced the first Thai language typewriter. In 1898, son George McFarland opened a typewriter shop in Bangkok. Those first typewriters had a moving carriage without any shift mechanism.4 Today’s typewriter keyboard has five rows, with standard shift keys. The use of computers with the Thai lan- guagestartedinthelate1960swhenIBM Vowels 18 introduced card-punch machines and line printers with Thai character capability. In Tone marks 4 doing so, the 8-bit EBCDIC code assignments 5 Diacritics 5 were made for Thai characters. With the printer’s mechanical limitations, for each Numerals 10 line of Thai text an equivalent of four-pass Other symbols 6 printing was required to complete the multi- Total 87 level nature of the Thai writing system. In the 1970s, Univac, Control Data, and Wang Figure 1. Thai characters. introduced their computers using the Thai language. In 1979, some companies offered The computerization of Thai script for the CRT terminals, with a 12-lines-per-screen ca- Thai language preserves the input sequence, pability, which could display Thai characters. and assigns one character in storage to corres- Interactive computing came to Thailand in pond to one input keystroke. This concept is 1983 when a Thai inventor modified the Her- simple to understand, and all language pro- cules Graphic Card (HGC) circuitry and made cessing algorithms are designed to suit this it capable of displaying 25 lines per screen in arrangement, despite the fact that some text mode. The full display capability of the Thai vowels consist of up to three characters Thai language has been adopted widely (from the vowel group) in that word. since then. Early IT standards in Thailand defined the The computer industry in Thailand in the code points for computer handling, keyboard 1980s was very much vertically integrated: layout, and I/O method. Subsequent research that is, hardware manufacturers always sup- projects created many useful functions such ported their version of localized operating as word break, text-to-speech, optical charac- systems and applications. Every company ter recognition (OCR), voice recognition, ma- used different Thai character codes as a result chine translation, and search algorithms, of competition and corporate strategy to re- which we describe elsewhere in this article. tain their customers. Application developers, however, suffered from the codes’ incompat- ibility and from differences in the system Thai language on typewriters interfaces. By 1984, there were at least 20 dif- and computers ferent character coding conventions used in Printing of the Thai language was first ac- Thailand.6 This situation prompted the Thai complished at Serampore in 1819: according Industrial Standard Institute (TISI) to estab- to professor Michael Winship,2 a Catechism of lish a standards committee to draft the na- Religion had been prepared by the American tional standard code for computers. The Baptist missionary Ann Hasseltine Judson work, completed in 1986, is known as TIS for distribution to a small community of 620-2529—The Standard for Thai Character Thai prisoners of war in Burma. By 1836, Codes for Computers B.E. [Buddhist Era] Dan Beach Bradley, a missionary physician, 2529.7 Basically, two designs of the code started printing in Thailand using a donated points were adopted in the standard: the press by the American Board of Commis- IBM-EBCDIC and the extended ASCII (8-bit sioners for Foreign Mission. At the time, pub- plane). In 1990, the TIS 620 standard was lications were intended primarily for enhanced with a clearer explanation but spreading Christianity into Thailand.3 without any change to the code points, and Thai-language typewriter localization this standard became TIS 620-2533. The tech- was first shown to the Thai public in nical committees of TISI subsequently issued 1891, when Edwin McFarland, in coopera- anumberofITstandardsforThailand tion with Smith Premier, a US typewriter (see Table 1). January–March 2009 47 Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 48 Computers and the Thai Language Table 1. Thai industrial standards on information technology. The four digits at the end of the standard number (e.g., 2529) is the year of issue, according to Buddhist Era. Standard number Standard title TIS 620-2529 (1986) Standard for Thai Character Codes for Computers TIS 820-2531 (1988) Layout of Thai Character Keys on Computer Keyboards TIS 620-2533 (1990) Standard for Thai Character Codes for Computers (Enhanced) TIS 988-2533 (1990) Recommendation for Thai Combined Character Codes and Symbols for Line Graphics for Dot-Matrix Printers TIS 1074-2535 (1992) Standard for 6-Bit Teletype Codes TIS 1075-2535 (1992) Standard for Conversion Between Computer Codes and 6-Bit Teletype Codes TIS 1099-2535 (1992) Standard for Province Identification Codes for Data Interchange TIS 1111-2535 (1992) Standard for Representation of Dates and Times TIS 820-2538 (1995) Layout of Thai Character Keys on Computer Keyboards (Enhanced) TIS 1566-2541 (1988) Thai Input/Output Methods for Computers Source: http://www.nectec.or.th/it-standards/. It took only one year after the standard’s code table was placed in the ‘‘most accept- announcement for every computer company able’’ collating sequence. The TIS 620 se- to be fully compliant with the TIS 620 stan- quence was subsequently registered with the dard code. Everyone’s data then became Universal Character Set defined by the inter- interchangeable. national standard ISO/IEC 10646, Universal In 1990, the Thai API Consortium, a group Multiple-Octet Coded Character Set, in the re- of Thailand software developers led by Tha- gion Uþ0E00 ...Uþ0E7F, and also as the weesak Koanantakool, drafted a common ISO/IEC 8859-11 standard for 8-bit (single specification for computer handling of the byte) coded graphic character sets, Latin/ Thai I/O method. The work was funded by Thai alphabet. Figure 2 shows the code assign- the National Electronics and Computer Tech- ment for Thai characters and their names in nology Center (NECTEC)andwaspublished TIS 620 and Unicode.

Computers and the Thai Language

Introduction to the Ipaddress Module

Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names

418338 1 En Bookbackmatter 205..225

The King Never Smiles

How Many Bits Are in a Byte in Computer Terms

Intelligence System for Tamil Vattezhuttuoptical

The Application of File Identification, Validation, and Characterization Tools in Digital Curation

1 Powers of Two

A History of Siamese Music Reconstructed from Western Documents 1505-1932

Development of the Sinhalese Script from 8Th Century A.D. to 15Th Century A.D~

Principles of Romanization for Thai Script by Transcription Method

On the History of Thai Scripts Hans Penth