Majalah Jurusan Teknik Informatika

The Balinese Unicode Text Processing Imam Habibi Informatics Engineering Department Bandung Institute of Technology Ganesha 10 Street Bandung 40132 E-mail : [email protected], [email protected] Abstraction In principal, the computer only recognizes numbers as the representation of a character. Therefore, there are many encoding systems to allocate these numbers although not all characters are covered. In Europe, every single language even needs more than one encoding system. Hence, a new encoding system known as Unicode has been established to overcome this problem. Unicode provides unique id for each different characters which does not depend on platform, program, and language. Unicode standard has been applied in a number of industries, such as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, and Unisys. In addition, language standards and modern information exchanges such as XML, Java, ECMA Script (JavaScript), LDAP, CORBA 3.0, and WML make use of Unicode as an official tool for implementing ISO/IEC 10646. There are four things to do according to Balinese script: the algorithm of transliteration, searching, sorting, and word boundary analysis (spell checking). To verify the truth of algorithm, some applications are made. These applications can run on Linux/Windows OS platform using J2SDK 1.5 and J2ME WTK2 library. The input and output of the algorithm/application are character sequence that is obtained from keyboard punch and external file. This research produces a module or a library which is able to process the Balinese text based on Unicode standard. The output of this research is the ability, skill, and mastering of: 1. Unicode standard (21-bit) as a substitution to ASCII (7-bit) and ISO8859-1 (8-bit) as the former default character set in many applications. 2. The Balinese Unicode text processing algorithm. 3. An experience of working with and learning from an international team that consists of the foremost experts in the area: Michael Everson (Ireland), Peter Constable (Microsoft-US), I Made Suatjana, and Ida Bagus Adi Sudewa. Keywords: Unicode, transliteration, searching, sorting, word boundary analysis, canonical combining class, normalization, and Unicode Collation Element. Balinese is rarely used and has less scope 1. Introduction of usage. Efforts to preserve it have been Language and local script are the most attempted but met an obstacle, i.e. the lack precious cultural assets that have to be of application to accommodate opinions preserved for generations to come. Balinese using Balinese script. Basically, the more script which can be used to writing Balinese sophisticated the tool is, the more language is threatened to extinction because guaranteed the education and culture in the 2 The Balinese Unicode Text Processing future are. This tool refers to the computer should be ignored when comparing which is capable to build software engineering consonants, but vowels should be easily in such a manner to produce and process factored in only when the Balinese script quickly and properly [YBG05]. consonants are equal. Furthermore, The endeavor to computerize Balinese there are two different sorting script is being conducted by Bali Galang schemes exist: the traditional Foundation. The first step is done by including Balinese HANACARAKA ordering the Balinese script in standard of Unicode and the Sanskrit ordering. character. The Unicode Consortium1 and the 3. The Balinese text does not use ISO/IEC JTC1/SC2/WG22 committee have spaces as word separators. A spell- agreed in principle to include the Balinese checking algorithm should be able script as per defined in the formal proposal to perform a dictionary-based written by Michael Everson and I Made lookup to determine word Suatjana in the standards that they boundaries and to validate the maintain[EVE05]. This proposal, numbered spelling of the text. N2908, was presented to the committee submitted to the WG2 46th meeting in Xiamen, 2. Text Processing in Computer China, in January 2005. The information given Generally, information which flows in the proposal is by all means complete but from and into computer is in the form of will only be finalized during the next WG2 text document, picture, audio, video, and meeting in Sophia-Antipolis, France, in combination among them. Text is used to September 2005. submit information in language and written The conventional methods of processing a with understandable scripts by human being string of Latin text are not applicable to the as either the subject or object of the Balinese text because there are at least three information. In order to being processed in different areas for the Balinese Unicode text computer, these scripts need to be decoded processing [EVE05], i.e.: in number since computer can only 1. Searching algorithm should work on recognize in number. This number consists both pre-composed and decomposed of binary numbers, i.e. 0 and 1, known as strings. Searching for U+1B12 bit. BALINESE LETTER OKARA In fact, bit processing is processed on octet (from Latin word, octo which means TEDUNG should be equivalent with eight), a bit combination of eight digits also searching for U+1B11 BALINESE called byte. Some methods of convention are made to look for solution of how to LETTER OKARA and U+1B35 interpret octet and other series of octet on BALINESE VOWEL SIGN TEDUNG the way to represent data. For example, . four series of octet are used to interpret real 2. Sorting algorithm should not be based numbers. In this final project, octet series is purely on character code points. Vowels used to interpret string. The simplest way which is still used widely to interpret 1 http://www.unicode.org character is by mapping one octet with one 2 http://anubis.dkuug.dk/JTC1/SC2/WG2 The Balinese Unicode Text Processing 3 character according to the mapping table. In was issued in 1991, ASCII and ISO-8859 doing so, we can interpret 256 (2^8=256) had become the most well known standard. characters. This number of characters exceeds The development of the Unicode those in character set3 used by Latin script, a character model follows 10 basic rules script which has widely been used to write stated below [UNI03]. However, not all are many languages around the world such as actually fulfilled. Consistency can be English and Indonesian. This technique is also sacrificed in order to keep simplicity, used by ASCII character standard (American efficiency, and compatibility with the Standard Character for Information precious standard. The basic rules are: Interchange) which is developed at 1960’s and 1. Universality has been being used up to now. 2. Efficiency In general, text processing in computer 3. Character, not glyphs works when user types with keyboard, and the 4. Semantics keyboard sends its scan codes to the keyboard 5. Plain text driver. Then, the driver transforms the scan 6. Logical order codes into meaningful character sequence. In 7. Unification the case of non-roman input mode is on, the 8. Dynamic composition driver also checks the input sequences and 9. Equivalent sequences rejects invalid sequences. After that, the text 10. Convertibility processor manipulates the characters. It may do searching, copy-pasting, sorting, word 4. Balinese Script counting, line breaking, transliteration, etc. The Balinese script is used for writing These characters are also stored in memory or the Balinese language, the native language other storage devices. In order to show of the people of Bali. It is a descendent of character sequences, the rendering engine picks the ancient Brahmic script from India; the glyph that represents the character. Then, therefore it has some notable similarities the display such as monitor and printer with modern scripts of South Asia and displays the rendered glyphs. Southeast Asia that also are descendent of the Brahmic script. The Balinese script is 3. Unicode as a Character Coding Standard also used for writing Kawi, or Old Unicode is a character coding standard for Javanese, which had a heavy influence to representing a written language in computer. Balinese language in the 11th century. Some Unicode was actually not the first coding Balinese words are also borrowed from standard, because it came as the answer to the Sanskrit, thus Balinese script is also used to problems arising from the previous coding write words from Sanskrit. standard for years [UNI03]. Therefore, The basic elements of the alphabet are Unicode is close to the previous existing syllables. Each syllable has inherent sound coding standard. When Unicode version 1.0 of /a/ or /ĕ/ depending of the position of the syllable within a word. 3 The script character collection. It is not yet related The text direction of the Balinese script to its code representation. For example, Indonesian is from left to right, with vowel signs alphabet with its punctuation mark. attached to either before, after, below or 4 The Balinese Unicode Text Processing above the syllable. Some vowel signs are split vowels, meaning that they appear at more than one position to the syllable. Writing system of Balinese script is more complex than Latin script. The alphabet consists of syllables. Every syllable ends up with vowel sound /a/. Consonant cluster is a consonant group of syllable appearing without any vowels. In Balinese script, consonant naturally obtains the suffix of vowel sound /a/. In general, there are Picture 4-1. Writing position of Balinese script two ways to omit original vowel sound: 1. Utilizing consonant in the form of 5. Reordering and Split Vowel gantungan or gempelang attached to the next consonant. This gantungan or Dependent vowel in Balinese script gempelan consonant is applied to omit modifies base consonant syllable with the vowel on its left side, not the vowel several forms. A consonant or a cluster of on itself. For example, word consonants may have a dependant vowel to ‘bakta’ (bring).

Load more