Good Spelling of Vietnamese Texts, one aspect of computational linguistics in Vietnam PHAN Huy Khanh Department of Information Technology DaNang University 17, Le Duan Street, DaNang City, Vietnam [email protected]

Abstract At the IT Dept. DaNang University we are There are many challenging problems for building a lexical database based on processing. It will be a code for accomplishing the following tasks: long time before these challenges are met. Even - Converting Vietnamese texts from any some apparently simple problems such as font to any other font. spelling correction are quite difficult and have - Putting texts in alphabetical order not been approached systematically yet. In this independently of the font in use. paper, we will discuss one aspect of this type of - Looking up words up in the monolingual work: designing the so-called Vietools to detect and / or multilingual dictionary. and correct spelling of Vietnamese texts by - Building specialized monolingual using a spelling database based on TELEX dictionaries. code. Vietools is also extended to serve many At present, we are taking part in the GETA, purposes in Vietnamese language processing. CLIPS, IMAG, France, in the FEV project: for a multilingual dictionary: French-Vietnamese Introduction via English. For the past two decades computational In fact, inputting Vietnamese texts still linguistics (CL) has progressed substantially in encounters many problems, not yet solved Vietnam, mainly in these basic aspects: data properly. The most common mistakes in acquisition from the keyboard, encoding, and detecting and correcting spelling errors are: restitution through an output device for - wrong intonation or misspelling, Vietnamese characters, updates on the - not following spelling specialization, not fonts in Microsoft DOS/Windows, using syllables systematically in the same standardization for Vietnamese (James Do, Ngo texts, etc. Thanh Nhan), automatic translation of English Winword, a commercial text processor, is not documents into Vietnamese and vice versa (Phan able to detect and correct spelling mistakes. Thi Tuoi, Dinh Dien), recognition of The program designed by Ngo Thanh Nhan handwriting (Hoang Kiem, Nguyen Van (without an associated spelling dictionary) Khuong), speech processing (Nguyen Thanh and other packages for Vietnamese Phuc, Quach Tuan Ngoc), building bilingual still do not offer adequate solutions. dictionaries such as English-Vietnamese and V- We propose here a general solution for E, French-Vietnamese and V-F dictionaries building the so-called Vietools for detecting (Lac Viet), archives of old Sino-Vietnamese and correcting spelling errors. Vietools is documents (Ngo Trung Viet, Cong Tam), etc. designed for office application such as Some of these works have been presented in Winword, Excel, Acess, PowerPoint, etc. in Informatics and IT workshops organized in . Vietools has also been Vietnam. These efforts are modest and do not yet extended for converting and rearranging show our full potential. There are many reasons Vietnamese words in the dictionaries and for this weakness. The major reasons that consulting the Vietnamese dictionaries, the different efforts are quite isolated and including multilingual dictionaries. there is not enough coordination. Some 1 Building spelling database coordinated workshops held from time to time would be very helpful. In the spelling dictionary by Hoang Phe (1995), there are 6760 syllables in the writing system (6616 syllables in the phonology of bytes used: 1 byte or 2 bytes, order of system) to compose single words or complex tones, letter arrangements, etc.). Because words. Each syllable has two parts: initial there has not been a unified code for consonant (optional) and rhyme pattern Vietnamese text, we selected a pivot code (including rhyme and tone). Altogether, there and TELEX code. There are many codes to are 27 initial consonants, and 1160 rhyme convert from such as IBM-CP01129, patterns (including 6 tones). Microsoft-CP1258, VISCII, VietKey, Based on Vietnamese syllable structure, the VietWare, VNI, TCVN3, , etc. spelling database is built in a tabular form. Each Vietools works on syllables converted to element of the table helps to check the TELEX. Vietools analyses syllables to correction of a syllable based on the column detect initial consonants and rhyme pattern position of initial consonants and the row in TELEX code. position of rhyme patterns, for example, the Conclusion syllable lamf (work) in the TELEX form, is The main advantage of our method is that composed of the initial consonant l and the tool operates independently of the rhyme pattern am with by low falling tone Vietnamese font used. The design of (or grave accent) f. Each element of the table Vietools is open: one can add new functions can be understood as: such as text or data conversion Spelling data - syllables used in Vietnamese. base structure design helps building multi- - elements between tone sign positions (on functional dictionaries, which are essential o: oja or on a: oaj), pronunciation or for natural language processing. dialect with spelling (z is equivalent to d or gi, y is equivalent to i...) and Acknowledgements borrowings such as karaoke, photocopy, fax... My thanks go to my students for the - Sino-Vietnamese word: coongj (addition) realization of Vietools and my colleagues for → → congj, quoocs (country) nuwowcs... their opinions. In particular, I thank Professor - being unable to form syllables: quts, Aravind Joshi, University of Pennsylvania, quoon, coan , cuee... Philadelphia, USA, for his helpful suggestions Techniques have been developed to I am grateful to Christian Boitet, Professor, recognize the compound words from two Joseph Fourier University, GETA, CLIPS, syllables, such as baor damr or damr baor IMAG, France, for his comments on this (guarantee), chung chung (vague), etc., from paper. three syllables, such as howpj tacs xax (cooperative), etc., from four syllables, such References as coong awn vieecj lamf (work, job), etc. 1. Hoang Phe (1995) Dictionary of Orthography. Center of Lexicography, DaNang Publishing 2 Designing Vietools House, 509 p. The error detecting program reads one syllable at 2. Hoang Phe (1997) Vietnamese Dictionary. a time from the text. The syllable is divided into Center of Lexicography, DaNang Publishing an initial consonant and a rhyme pattern, paying House, 1130 p. attention to solving initial consonants such as: gi containing vowel i; the consonant qu has vowel u, but it is easy to separate it from the syllable for it does not have the consonant q; the other combined initial consonants have the length of 2, or 3. The error-correcting unit checks the conformity of initial consonants (if present) and the rhyme pattern. 3 Code converting At present, there are many Vietnamese fonts built on different codes (different in number