<<

The Text Processing

Imam Habibi

Informatics Engineering Department Bandung Institute of Technology Ganesha 10 Street Bandung 40132

E-mail : [email protected], [email protected]

Abstraction

In principal, the computer only recognizes numbers as the representation of a character. Therefore, there are many encoding systems to allocate these numbers although not all characters are covered. In Europe, every single language even needs more than one encoding system. Hence, a new encoding system known as Unicode has been established to overcome this problem. Unicode provides unique id for each different characters which does not depend on platform, program, and language. Unicode standard has been applied in a number of industries, such as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, and Unisys. In addition, language standards and modern information exchanges such as XML, Java, ECMA (JavaScript), LDAP, CORBA 3.0, and WML make use of Unicode as an official tool for implementing ISO/IEC 10646. There are four things to do according to : the algorithm of transliteration, searching, sorting, and word boundary analysis (spell checking). To verify the truth of algorithm, some applications are made. These applications can run on Linux/Windows OS platform using J2SDK 1.5 and J2ME WTK2 library. The input and output of the algorithm/application are character sequence that is obtained from keyboard punch and external file. This research produces a module or a library which is able to process the Balinese text based on Unicode standard. The output of this research is the ability, skill, and mastering of: 1. Unicode standard (21-bit) as a substitution to ASCII (7-bit) and ISO8859-1 (8-bit) as the former default character set in many applications. 2. The Balinese Unicode text processing algorithm. 3. An experience of working with and learning from an international team that consists of the foremost experts in the area: (Ireland), Peter Constable (Microsoft-US), Made Suatjana, and Ida Bagus Adi Sudewa.

Keywords: Unicode, transliteration, searching, sorting, word boundary analysis, canonical combining class, normalization, and Unicode Collation Element. Balinese is rarely used and has less scope 1. Introduction of usage. Efforts to preserve it have been Language and local script are the most attempted but met an obstacle, i.. the lack precious cultural assets that have to be of application to accommodate opinions preserved for generations to come. Balinese using Balinese script. Basically, the more script which can be used to writing Balinese sophisticated the tool is, the more language is threatened to extinction because guaranteed the education and culture in the 2 The Balinese Unicode Text Processing future are. This tool refers to the computer should be ignored when comparing which is capable to build software engineering consonants, but should be easily in such a manner to produce and process factored in only when the Balinese script quickly and properly [YBG05]. consonants are equal. Furthermore, The endeavor to computerize Balinese there are two different sorting script is being conducted by Galang schemes exist: the traditional Foundation. The first step is done by including Balinese HANACARAKA ordering the Balinese script in standard of Unicode and the ordering. character. The Unicode Consortium1 and the 3. The Balinese text does not use ISO/IEC JTC1/SC2/WG22 committee have spaces as word separators. A spell- agreed in principle to include the Balinese checking algorithm should be able script as per defined in the formal proposal to perform a dictionary-based written by Michael Everson and I Made lookup to determine word Suatjana in the standards that they boundaries and to validate the maintain[EVE05]. This proposal, numbered spelling of the text. N2908, was presented to the committee submitted to the WG2 46th meeting in Xiamen, 2. Text Processing in Computer China, in January 2005. The information given Generally, information which flows in the proposal is by all means complete but from and into computer is in the form of will only be finalized during the next WG2 text document, picture, audio, video, and meeting in Sophia-Antipolis, France, in combination among them. Text is used to September 2005. submit information in language and written The conventional methods of processing a with understandable scripts by human being string of Latin text are not applicable to the as either the subject or object of the Balinese text because there are at least three information. In order to being processed in different areas for the Balinese Unicode text computer, these scripts need to be decoded processing [EVE05], i.e.: in number since computer can only 1. Searching algorithm should work on recognize in number. This number consists both pre-composed and decomposed of binary numbers, i.e. 0 and 1, known as strings. Searching for +1B12 bit. BALINESE LETTER OKARA In fact, bit processing is processed on octet (from Latin word, octo which means TEDUNG should be equivalent with eight), a bit combination of eight digits also searching for U+1B11 BALINESE called byte. Some methods of convention are made to look for solution of how to LETTER OKARA and U+1B35 interpret octet and other series of octet on BALINESE SIGN TEDUNG the way to represent data. For example, . four series of octet are used to interpret real 2. Sorting algorithm should not be based numbers. In this final project, octet series is purely on character code points. Vowels used to interpret string. The simplest way which is still used widely to interpret 1 http://www.unicode.org character is by mapping one octet with one 2 http://anubis.dkuug.dk/JTC1/SC2/WG2 The Balinese Unicode Text Processing 3 character according to the mapping table. In was issued in 1991, ASCII and ISO-8859 doing so, we can interpret 256 (2^8=256) had become the most well known standard. characters. This number of characters exceeds The development of the Unicode those in character set3 used by , a character model follows 10 basic rules script which has widely been used to write stated below [UNI03]. However, not all are many languages around the world such as actually fulfilled. Consistency can be English and Indonesian. This technique is also sacrificed in order to keep simplicity, used by ASCII character standard (American efficiency, and compatibility with the Standard Character for Information precious standard. The basic rules are: Interchange) which is developed at 1960’s and 1. Universality has been being used up to now. 2. Efficiency In general, text processing in computer 3. Character, not glyphs works when user types with keyboard, and the 4. Semantics keyboard sends its scan codes to the keyboard 5. Plain text driver. Then, the driver transforms the scan 6. Logical order codes into meaningful character sequence. In 7. Unification the case of non-roman input mode is on, the 8. Dynamic composition driver also checks the input sequences and 9. Equivalent sequences rejects invalid sequences. After that, the text 10. Convertibility processor manipulates the characters. It may do searching, copy-pasting, sorting, word 4. Balinese Script counting, line breaking, transliteration, etc. The Balinese script is used for writing These characters are also stored in memory or the , the native language other storage devices. In order to show of the people of Bali. It is a descendent of character sequences, the rendering engine picks the ancient Brahmic script from India; the glyph that represents the character. Then, therefore it has some notable similarities the display such as monitor and printer with modern scripts of South Asia and displays the rendered glyphs. that also are descendent of the Brahmic script. The Balinese script is 3. Unicode as a Character Coding Standard also used for writing Kawi, or Old Unicode is a character coding standard for Javanese, which had a heavy influence to representing a written language in computer. Balinese language in the 11th century. Some Unicode was actually not the first coding Balinese words are also borrowed from standard, because it came as the answer to the Sanskrit, thus Balinese script is also used to problems arising from the previous coding write words from Sanskrit. standard for years [UNI03]. Therefore, The basic elements of the are Unicode is close to the previous existing syllables. Each syllable has inherent sound coding standard. When Unicode version 1.0 of /a/ or /ĕ/ depending of the position of the syllable within a word. 3 The script character collection. It is not yet related The text direction of the Balinese script to its code representation. For example, Indonesian is from left to right, with vowel signs alphabet with its mark. attached to either before, after, below or 4 The Balinese Unicode Text Processing above the syllable. Some vowel signs are split vowels, meaning that they appear at more than one position to the syllable. of Balinese script is more complex than Latin script. The alphabet consists of syllables. Every syllable ends up with vowel sound /a/. Consonant cluster is a consonant group of syllable appearing without any vowels. In Balinese script, consonant naturally obtains the suffix of vowel sound /a/. In general, there are Picture 4-1. Writing position of Balinese script two ways to omit original vowel sound: 1. Utilizing consonant in the form of 5. Reordering and Split Vowel gantungan or gempelang attached to the next consonant. This gantungan or Dependent vowel in Balinese script gempelan consonant is applied to omit modifies base consonant syllable with the vowel on its left side, not the vowel several forms. A consonant or a cluster of on itself. For example, word consonants may have a dependant vowel to ‘bakta’ (bring). change the last vowel sound attached to it. 2. Utilizing adeg-adeg, U+1B44 Balinese script has various forms of BALINESE ADEG-ADEG. For dependant vowel, the spacing and the non- spacing one written on the previous, the example, word ‘kadep’ (sold). next, the top side, or the bottom side of the

base character. Yet the combination of Position adjustment in Balinese script them is also possible. writing is divided into several different areas Unicode standard determines that the (see picture 4-1), i.e.: is coded after its base 1. Baseline area: writing base of Balinese character. Therefore, when a character script. Consonants are written in this sequence contains dependent vowels, area. reordering is necessary in the computer 2. Area on the left side or pre-base marks memory just before it is displayed on the (prem) and on the right side or post- screen. The function of the reordering is to base marks (pstm) baseline: used to make a change of the glyph order so that write dependent vowel and gempelan. glyph component of dependent vowel is 3. Area on the top side or above-base displayed properly (see picture 5-1). marks (abvm) and on the bottom side or below-1 base marks (blw1m) and below-2 base marks (blw2m) baseline: used to write gantungan, and Picture 5-1. Reordering pengangge suara. Split vowel is a vowel whose components appear on two different sides of its consonant. The component may The Balinese Unicode Text Processing 5 appear on either the top-right side, or the left table 8-1). They are published in the and right side of its base consonant. proposal L2/05-008 which was approved by Glyph of vowels drawn on the bottom side . However, the of base character needs a special treatment decomposition mapping property should be because glyph selection depends on the context added to the proposal. Therefore, according of the consonant frequency or the previous to the Unicode standard, there should be ten consonant cluster. These vowels are 1B38 characters requiring decomposition BALINESE VOWEL SIGN SUKU (u) and mapping (see algorithm 8-1). 1B39 BALINESE VOWEL SIGN SUKU ILUT (uu). Both of these vowels have two Table 8-1. Characteristics of Balinese Script different glyphs, i.e.: the one attached on the base consonants and their conjunct forms 1B00;BALINESE SIGN ULU (Pengangge Aksara). RICEM;Mn;230;NSM;;;;;N;;ardhacan dra;;;

6. Ligatures 1B01;BALINESE SIGN ULU CANDRA;Mn;230;NSM;;;;;N;;candrab A glyph representing more than one indu;;; character is called , a script that is handwritten on a paper with no more than one 1B02;BALINESE SIGN scratch. Several Balinese scripts appearing CECEK;Mn;230;NSM;;;;;N;; adjacent to one another form ligature. ;;; Therefore, they seem on the screen as if they 1B03;BALINESE SIGN were only one glyph. For example, U+1B35 SURANG;Mn;230;L;;;;;N;;repha;;; BALINESE VOWEL SIGN TEDUNG (aa) forms a ligature when attached to a syllable. 1B04;BALINESE SIGN BISAH;Mc;226;L;;;;;N;;;;; 7. Line Breaking 1B05;BALINESE LETTER Although Balinese script is written without AKARA;Lo;0;L;;;;;N;;a;;; any spaces between two successive words, line breaking cannot be conducted at random 1B06;BALINESE LETTER AKARA TEDUNG;Lo;0;L;1B05 places. Hence, there are two common rules of 1B35;;;;N;;aa;;; line breaking, i.e.: 1. Line breaking may not be done between 1B07;BALINESE LETTER a syllable and any following combining IKARA;Lo;0;L;;;;;N;;i;;; characters. 2. Line breaking may not be done just 1B08;BALINESE LETTER IKARA before any punctuation. TEDUNG;Lo;0;L;1B07 1B35;;;;N;;ii;;; 8. Characteristics of Balinese Script 1B09;BALINESE LETTER UKARA;Lo;0;L;;;;;N;;u;;; Like any other Unicode script, Balinese script has some unique characteristics (see 6 The Balinese Unicode Text Processing

1B0A;BALINESE LETTER UKARA 1B19;BALINESE LETTER TEDUNG;Lo;0;L;1B09 LACA;Lo;0;L;;;;;N;;;;; 1B35;;;;N;;uu;;; 1B1A;BALINESE LETTER 1B0B;BALINESE LETTER ;Lo;0;L;;;;;N;;;;; REPA;Lo;0;L;;;;;N;;vokalic r;;; 1B1B;BALINESE LETTER JA 1B0C;BALINESE LETTER RA REPA JERA;Lo;0;L;;;;;N;;;;; TEDUNG;Lo;0;L;1B0B 1B35;;;;N;;vokalic rr;;; 1B1C;BALINESE LETTER NYA;Lo;0;L;;;;;N;;;;; 1B0D;BALINESE LETTER LENGA;Lo;0;L;;;;;N;;vokalic l;;; 1B1D;BALINESE LETTER TA LATIK;Lo;0;L;;;;;N;;tta;;; 1B0E;BALINESE LETTER LA LENGA TEDUNG;Lo;0;L;;;;;N;;vokalic ll;;; 1B1E;BALINESE LETTER TA MURDA MAHAPRANA;Lo;0;L;;;;;N;;ttha;;; 1B0F;BALINESE LETTER EKARA;Lo;0;L;;;;;N;;e;;; 1B1F;BALINESE LETTER DA MURDA ALPAPRANA;Lo;0;L;;;;;N;;dda;;; 1B10;BALINESE LETTER AIKARA;Lo;0;L;;;;;N;;;;; 1B20;BALINESE LETTER DA MURDA MAHAPRANA;Lo;0;L;;;;;N;;ddha;;; 1B11;BALINESE LETTER OKARA;Lo;0;L;;;;;N;;;;; 1B21;BALINESE LETTER NA RAMBAT;Lo;0;L;;;;;N;;nna;;; 1B12;BALINESE LETTER OKARA TEDUNG;Lo;0;L;1B11 1B22;BALINESE LETTER 1B35;;;;N;;;;; TA;Lo;0;L;;;;;N;;;;;

1B13;BALINESE LETTER 1B23;BALINESE LETTER TA ;Lo;0;L;;;;;N;;;;; TAWA;Lo;0;L;;;;;N;;tha;;;

1B14;BALINESE LETTER KA MAHAPRANA;Lo;0;L;;;;;N;;;;;

1B15;BALINESE LETTER

GA;Lo;0;L;;;;;N;;;;;

1B16;BALINESE LETTER GORA;Lo;0;L;;;;;N;;;;;

1B17;BALINESE LETTER NGA;Lo;0;L;;;;;N;;;;;

1B18;BALINESE LETTER CA;Lo;0;L;;;;;N;;;;;

The Balinese Unicode Text Processing 7

U+1B06 ( ) Æ 9. Canonical Combining Class of [U+1B05,U+1B35] ( , ) Balinese Script

U+1B08 ( ) Æ The purpose of canonical combining [U+1B07,U+1B35] ( , ) classes is to establish appropriate equivalence classes under Unicode U+1B0A ( ) Æ normalizations for character sequences that [U+1B09,U+1B35] ( , ) involve combining marks. Specifically: 1. Given a pair of combining marks U+1B0C ( ) Æ that interact typographically (i.e., [U+1B0B,U+1B35] ( , ) that nominally occupy the same position relative to the base), different encoded orders correspond U+1B12 ( ) Æ to visually-distinct relative positions [U+1B11,U+1B35] ( , ) of the marks, hence are semantically distinct. By assigning these marks to U+1B3B ( ) Æ the same canonical combining class [U+1B3A,U+1B35] ( , ) (zero or non-zero), the nonequivalence of differently- ordered sequences is established U+1B3D ( ) Æ under normalization. [U+1B3C,U+1B35] ( , ) 2. Given a pair of combining marks that do not interact typographically (i.e., that occupy distinct positions U+1B40 ( ) Æ relative to the base), different [U+1B3E,U+1B35] ( , ) encoded orders are visually identical, hence not semantically U+1B41 ( ) Æ distinct. By assigning these marks to different, non-zero canonical [U+1B3F,U+1B35] ( , ) combining classes, the equivalence of differently-ordered sequences is U+1B43 ( ) Æ established under normalization.

[U+1B42,U+1B35] ( , ) In canonical combining class, the class 0 has special behavior in the Unicode Æ Symbol showing that the left normalization algorithms: if a sequence sided element has to be equivalent to the right side element contains a combining mark in class 0 and a mark in a non-zero class n, equivalence Algorithm 8-1. Decomposition mapping of Balinese classes are defined as though the class-0 script [CON05] mark belonged to class n; i.e., that sequence is not equivalent to the sequence containing those marks in the opposite order. 8 The Balinese Unicode Text Processing Using the canonical combining classes marks that do not interact typographically proposed in L2/05-008, there is only one pair to be considered canonically equivalent. of combining marks for which distinct orders According to Unicode standard, a would be considered canonically equivalent suggestion is made by assigning each (see table 9-1). classes 220, 224, 226, 230 to every character whose position relative to the Table 9-1. Canonical combining class of Balinese base is bottom, left, right, or top [CON05]. script proposed in L2/05-008 However, character U+1B34 BALINESE SIGN REREKAN and U+1B44 < 1B34 SIGN REREKAN (ccc=7), 1B44 BALINESE SIGN is assigned to ADEG-ADEG (ccc=9) > ≡ < 1B44 ADEG-ADEG (ccc=9), 1B34 fixed-position class 7 and 9. SIGN REREKAN (ccc=7) > 10. Normalization of Balinese Script In proposal L2/05-008, Combinations of There are four Unicode normalization syllable-modifier signs (1B00—1B03), forms, i.e.: and vowel signs, at least, are REREKAN 1. Normalized Form D: canonical linguistically valid. Because all of these but decomposition. REREKAN are assigned to class 0, differently- 2. Normalized Form C: canonical ordered sequences of these marks, which decomposition, followed by would be visually distinct, are not canonically canonical composition. equivalent. Thus, the use of class 0 provides 3. Normalized Form KD: compatibility appropriate results in these cases. decomposition. In this case, < 1B35 BALINESE VOWEL 4. Normalized Form KC: compatibility SIGN TEDUNG, 1B04 BALINESE SIGN decomposition, followed by BISAH > is a linguistically plausible canonical composition. combination. Assuming it is normal use as a Considering the efficiency and performance vowel killer, ADEG-ADEG should not co-occur of the Unicode normalization algorithm, with either of the other two marks. Again, Balinese script is processed in the though, different encoded orders of a normalized form D. Besides, the combination of these marks are possible in normalized form C is basically obtained principle and would be visually distinct, and so through the same steps as the normalized the use of class 0 provides appropriate results form D. in these cases. Normalized form D consists of two In the cases described above, the use of phases. First, in every Balinese word is the class 0 is sufficient to cause differently-ordered decomposition mapping done. Second, combinations of marks that do interact canonical reordering is done according to typographically (having different visual the canonical combining class of each results) to be considered not canonically character. Decomposition mapping has equivalent. However, the assignment of marks recursive property, so the correct character to class 0 breaks down because it is in failing sequences are obtained (see algorithm to cause differently-ordered combinations of 10-1). For example, U+1B40 turns into The Balinese Unicode Text Processing 9 3. The next step is separating the [U+1B3E,U+1B35], so 4. The last step is comparing the

values obtained from the step (3) If: X Æ [Y,Z] (defined) using binary comparison algorithm. If: Y Æ [Y1,Y2] (defined) Æ Then: X [Y1,Y2,Z] (conclusion) 13. Transliteration of Balinese Script

Symbol Æ showing that the left sided element has to be equivalent Transliteration is a mapping from one to the right side element writing system to another one, e.g. from Balinese script to Latin and vice versa by Algorithm 10-1. Decomposition mapping algorithm considering both accent and grammar of them [WIK05]. The main criterion is the 11. Unicode Collation Algorithm (UCA) lossless information so that user should be capable of retransforming the information Basically, UCA is a process to create to its original format. Therefore, sorting keys possessing particular priority transliteration is different from value. All strings are first processed on the transcription which only focuses on primary level. They all are then compared. If mapping from one language to another one. they equal to each other, the comparison is The use of transliteration is for helping continued to the secondary level and later to people who can not read the Balinese the tertiary level when necessary. script. For example, [U+1B13 In general, the sorting keys of UCA are: KA,U+1B2E LA] becomes ‘kala’ (time). 1. Alphabet/base character Transliteration is performed by building 2. mark/accent a table to map characters from Balinese 3. Uppercase and lowercase script to Latin and vice versa. To get it done, a complex conversion is needed in 12. Comparison Algorithm of Balinese order to overcome the change of shape and Script special case of letters in source script. The input for transliteration is two digits from Comparison algorithm of Balinese script keyboard in hexadecimal format consists of four phases, i.e.: (0,1,2,…,A,B,C,D,E,F) with padding bit 0 1. Every Balinese script input is first if the number of digits is odd. The processed with the normalized form D. algorithm of transliteration utilizes data 2. The result from step (1) is then structure of inversion list (including basic continued to the UCA according to the functions: invert, union, intersection, set chosen sorting method, either difference, adding, and deleting) in order to HANACARAKA or SANSKRIT. save more spaces in memory. The performance of those operations is faster due to random access for every element. 10 The Balinese Unicode Text Processing

Character U+1B33 BALINESE LETTER 2. Perform iteration from the first character to the last serves as a neutral place for vowel. one. Therefore, when a vowel is written on the first 3. If the character is the base letter of word, either independent vowel or character, then go to step 4. character U+1B3 followed by appropriate Otherwise, go to step 7. dependent vowel may be used. 4. If the character displays Character U+1B05 BALINESE LETTER glyph in the appended form, it means that there is a AKARA serves as a neutral place when usage of character virama. followed by appropriate character or sign, so 5. Verify whether the character the character may be transliterated to ‘e’, ‘i’, has a special case as ‘o’, ‘u’, ‘ī’, ‘ū’, ‘ĕ’, and ‘ö’. mentioned before. Character nania (U+1B2C) is used in 6. If the character is character consonant cluster between words using rerekan, then the previous character should be appended form ‘’, so this character may be validated. Besides, the transliterated to ‘ia’, e.g. ‘siap’ and ‘tabia’. previous output should be Character suku kembung (U+1B2F) is used changed as well. in consonant cluster between words using 7. If the character is dependent appended form ‘wa’, so this character may be vowel, then the previous transliterated to ‘ua’. vowel of base character Transliteration algorithm calls translate should be turned into an string function receiving both string input and appropriate character. output (see algorithm 13-1 and 13-2), for 8. Finally, perform character example, given an string input s with n = matching using mapping table in appropriate form. The length(s), then an iteration for n characters is provided forms are normal performed. form, appended form, case In transliteration algorithm, the most 1...8 forms, and dominant function is searching character in I/O UNKNOWN_TRANSLATE. file. The comparison is performed at Balinese Algorithm 13-1. Character transliteration of script block (U+1B00-U+1B7F). In the worst Balinese script case, the comparison is performed 128 times. Given m is the number of comparison at lookup table in average, the transliteration algorithm theoretically has a complexity of O(n*m).

Input: character sequence of Balinese script Output: character sequence of Latin script 1. Perform a block validation of Balinese script in Unicode.

The Balinese Unicode Text Processing 11

Input: character sequence of character. Therefore, it should work on Latin script both pre-composed and decomposed Output: character sequence of strings. Balinese script Searching is performed by validating 1. Perform a block validation of equivalent forms of Balinese script (see Latin script in Unicode. 2. If there is a separator table 14-1), e.g. ‘daar’ (eat) and character, then the split ‘daara’ (eaten), ‘baang’ characters are combined. (give) dan ‘baanga’ (given). 3. Perform iteration from the In searching algorithm, the most first character to the last dominant functions are normalization, one. 4. Verify whether the character sorting key creation according to UCA, and has a special case as mentioned binary comparison functions (algorithm 14- before, e.g. the vowel. 1). Normalization function receives string 5. Verify whether the input is a input and an array of integer. Given string balanced word (symmetric word) input s with n = length(s), then perform in an appropriate position. iteration for n characters. This function 6. Validate whether the output consists of two processes: contains character virama and 1. Canonical decomposition, i.e. rerekan. 7. Then, perform character recursive function transforming matching using mapping table in Balinese script into atomic form. In appropriate form. The provided the best case, character input has forms are normal form, appended already had an atomic form. form, case 1...8 forms, and Otherwise, in the worst case, UNKNOWN_TRANSLATE. character input is processed 8. Finally, verify whether the recursively until the character gets output is included into the an atomic form. For the case of consonant of Balinese script Balinese script, the recursion depth using the data structure of inversion list. If the output is one. is a consonant, then the 2. Reordering canonical combining character virama is appended class is a function that reorders into the output. Balinese script character classes already processed in canonical Algorithm 13-2. Inverse Transliteration of Balinese decomposition. In the worst case, script iteration is performed until the recursion depth minus one. 14. Searching of Balinese Script Given p is the average recursion depth, the Searching of Balinese script has some normalization theoretically has a challenges, i.e. that there are some characters complexity of O(p). which are made of other characters and they In average, element collation function are even possible to be combined into a single performs iteration for n times, so the array of integer is produced with the size of n. 12 The Balinese Unicode Text Processing

Therefore, this function has a complexity of 4. Build sorting key according O(n). to UCA. Binary comparison function performs 5. Utilize the data structure comparison of element collation values. In the ListSorting and select the appropriate sorting method worst case, the comparison is performed for n used, either HANACARAKA or times. SANSKRIT. Theoretically, searching algorithm has a 6. Perform binary comparison of complexity as the following: the sorting keys. T(n) = O(n*(O(p)) + n*(O(n)) + n) 7. If the values are equal, put = O(n*p + n2 + n) Æ the values into the list pick the maximum value among the three output. complexity values (p

Sorting of Balinese Script is performed Table 14-1. The Equivalent Form of Balinese Script by two different ways depending of which sorting method is used: Sanskrit or 1. U+1B03 BALINESE SIGN SURANG ≡ HANACARAKA. Based on the and Latin U+1B2D BALINESE LETTER RA script sorted by character , 2. U+1B02 BALINESE SIGN CECEK ≡ Balinese script is processed with the U+1B17 BALINESE LETTER NGA ≡ following order: U+1B01 BALINESE SIGN ULU CHANDRA 1. consist of seven characters, i.e.: ‘carik’, ‘carik pareren’, ‘panten’, ‘pasalinan’, ‘pamada’, ‘carik agung’, and ‘carik 3. U+1B04 BALINESE SIGN BISAH ≡ pamungkah’ (U+1B5A – U+1B60). U+1B33 BALINESE LETTER HA 2. Numbers consist of ten characters 4. U+1B00 BALINESE SIGN ULU RICEM (U+1B50 – U+1B59). ≡ U+1B2B BALINESE LETTER 3. Sorting methods, i.e.: HANACARAKA (table 15-1) and Input: character sequence of SANSKRIT (table 15-2). Balinese script 4. Other symbols (U+1B61 – Output: list of character U+1B7C). sequence of Balinese script 1. Perform a block validation of Therefore, sorting algorithm of Balinese Balinese script in Unicode. script does not depend on the character

2. Transform the input into the code pint. Besides, the vowels must be equivalent form of Balinese script according to table 14-1. ignored when comparing the consonants. 3. Normalize the Balinese script The vowels are taken into account when the character. consonants are identical. For example, The Balinese Unicode Text Processing 13

‘krama’ (member) appears before In sorting algorithm, the most dominant ’cakra’ (disc) in Sanskrit sorting method. On functions are normalization, sorting key the other hand, ‘cakra’ appears before ‘krama’ creation according to UCA, and binary in HANACARAKA sorting method. comparison functions (algorithm 15-1).

Table 15-1. HANACARAKA Sorting Input: pair of character A > > Ë > Ö > I > Ī > U > Ū > E sequence of Balinese script > AI > O > AU > Output: sequence character ha (bisah) > ha-rerekan > na > nna order (-1=smaller, 0=equal, > ca > cha > ra ([ra-repa = rë], 1=bigger)

surang) > ka > ka-rerekan > kaf- 1. Perform a block validation of sasak > khot-sasak > kha > Balinese script in Unicode. da > da-rerekan > dha > dda > ddha 2. Normalize the Balinese script > ta > tzir-sasak > tha > tta > character.

ttha > 3. Build sorting key according sa > zal-sasak > asyura-sasak > to UCA.

sha > ssa > wa > wa-rerekan > ve- 4. Utilize the data structure sasak > la ([la-lenga = lë]) > ListSorting and select the ma (uli ricem) > ga > ga-rerekan > appropriate sorting method gha > > > nga (cecek, ulu used, either HANACARAKA (see candra) > nga-rerekan table 15-1) or SANSKRIT (see > pa-rerekan > ef-sasak > table 15-2). > ja > ja-rerekan > jha > ya > nya 5. Perform binary comparison of the sorting keys. Table 15-2. SANSKRIT Sorting Algorithm 15-1. Sorting of Balinese Script A > Ā > Ë > Ö > I > Ī > U > Ū > RË > RÖ > LË > LÖ > E > AI > O > AU > Normalization function receives string ka > ka-rerekan > kaf-sasak > input and an array of integer. Given string khot-sasak > kha > ga > ga-rerekan input s with n = length(s), then perform > gha > nga (cecek, ulu candra) > iteration for n characters. This function nga-rerekan > consists of two processes: ca > cha > ja > ja-rerekan > jha > 1. Canonical decomposition, i.e. recursive nya > tta > ttha > dda > ddha > function transforming Balinese script nna > ta > tzir-sasak > tha > da > da- into atomic form. In the best case, rerekan > dha > na > pa > pa- character input has already had an rerekan > ef-sasak > pha > ba > atomic form. Otherwise, in the worst bha > ma (uli ricem) > case, character input is processed ya > ra (surang) > la > wa > wa- recursively until the character gets an rerekan > ve-sasak > atomic form. For the case of Balinese sha > ssa > sa > zal-sasak > script, the recursion depth is one. asyura-sasak > ha (bisah) > ha- 2. Reordering canonical combining class rerekan is a function that reorders Balinese

14 The Balinese Unicode Text Processing script character classes already processed consonant and consonant cluster). This in canonical decomposition. In the worst algorithm uses data structure of stack to case, iteration is performed until the retrieve valid character cluster and then recursion depth minus one. process it. The validity of a piece of the character sequences can be inferred through Given p is the average recursion depth, the comparison between the sequences and the normalization theoretically has a complexity of data stored in dictionary. There are three O(p). possible outputs resulted from the word In average, element collation function boundary algorithm, i.e.: performs iteration for n times, so the array of 1. The character sequences are integer is produced with the size of n. successfully separated completely Therefore, this function has a complexity of and correctly (the best case). O(n). 2. The character sequences are Binary comparison function performs separated, but only partially comparison of element collation values. In the completed because it is possible that worst case, the comparison is performed for n the input is invalid or the data in the times. dictionary are not complete. Theoretically, sorting algorithm has a However, the algorithm returns the complexity as the following: best word boundary obtained T(n) = O(n*(O(p)) + n*(O(n)) + n) according to the statistical principle = O(n*p + n2 + n) Æ (the endeavor that produces the pick the maximum value among the three maximum number of Balinese script complexity values (p

outputs as explained above. Algorithm 16-1. Balinese Script Word Boundary In addition, the functions of B-Unicode application are also implemented to desktop application through some simple drivers in 17. Software Specification order to test and verify the truth of designed algorithm and library. Those drivers are run Applications made related to text on TUI (Text User Interface) mode. processing based on Unicode standard are B- In conclusion, picture 17-1 shows that Linguist, B-Unicode, and desktop application. this software involves mobile device These applications require JVM (Java Virtual network. The hardware specifications used Machine) with the following specification: in this research are: B-Linguist and B-Unicode is developed on 1. HP Sony Ericsson K750i, as well as J2ME WTK2 whereas desktop application on an active SIM card. The HP serves J2SDK 1.5. as a tool running B-Linguist and B-Linguist is a mobile dictionary B-Unicode. It also sends feedback application which has a function as a translator to the developer through SMS from Balinese to Indonesian and Balinese to technology. English. B-Unicode is a mobile text processing 2. A PC serves as a tool running application of Balinese script according to desktop application to generate files Unicode standard. This application covers the required by both B-Linguist and B- four main functions, i.e.: transliteration, Unicode. searching, sorting, and word boundary. The 3. Data cable or Bluetooth connection desktop application serves as the dictionary serves as a connector between HP and transliteration generator which is required and computer. 16 The Balinese Unicode Text Processing 4. Pocket PC serves as an SMS receiver 13 Change La nguage Succes sfully from mobile application user. tested 14 Send SMS Successfully 18. Testing tested Desktop Application Testing is performed on a PC with 15 Balinese Successfully following specifications: Transliteration tested Generator to Latin 1. Processor : AMD Athlon XP 1600+ 16 Latin Transliteration Successfully 2. Memory : 1 GB Generator to Balinese tested 3. Hard disk : 20 GB 17 Unicode Code Successfully Generator tested Testing is divided into two parts, i.e. software 18 HANACARAKA Successfully testing (see table 18-1) and input keyboard Generator tested testing (see table 18- 2). 19 SANSKRIT Successfully Generator tested

Table 18-1. Software Testing Table 18-2. Input Keyboard Testing No. Name Description B-Linguist No. Input keyboard Description 1 Balinese to English Successfully 1 NULL Successfully tested tested 2 Balinese to Successfully 2 Only consonants Successfully Indonesian tested tested 3 Balinese Index to Successfully 3 Both consonants and Successfully English tested vowels tested 4 Balinese Index to Successfully 4 Both consonant Successfully Indonesian tested clusters and vowels tested 5 Change Language Successfully 5 All special cases Successfully tested (e.g., in searching tested 6 Send SMS Successfully testing) tested B-Unicode

7 Balinese Successfully Algoritma Transliteration t O(n*m) Transliteration to tested ) 300

Latin 250

8 Latin Transliteration Successfully milisecond 200 Big O(n) dengan m=5 150 to Balinese tested Pengujian 100

9 Searching Successfully 50 Waktu eksekusiWaktu ( 0 Input (n) tested 4 6 8 10121420 10 HANACARAKA Successfully Big O(n) dengan m=5 20 30 40 50 60 70 100 Sorting tested Pengujian 15 31 47 47 47 78 281 11 SANSKRIT Sorting Successfully tested Picture 18-1. Big O(n) Testing of 12 Word Bounda ry Successfully Transliteration Algorithm tested The Balinese Unicode Text Processing 17

Algoritma Transliteration viceversa Algoritma Sorting ) d t O(n*m) ) 2 d

n tO(n ) n

60 o 250 ec seco i

l 50 i

lis 200 i m ( m

40 ( i si 150 30 s u

eku Big O(n) dengan m=5 20 100 Big O(n) sek

eks Pengujian Pengujian

10 ek

u 50 t k aktu 0 Input (n) a

W 23456810 W 0 Input (n) 4 6 10 14 Big O(n) dengan 10 15 20 25 30 40 50 m=5 Big O(n) 16 36 100 196 Pengujian 15 15 15 15 15 15 15 Pengujian 47 78 156 203

Picture 18-2. Big O(n) Testing of Inverse Picture 18-5. Big O(n) Testing of Sorting Transliteration Algorithm Algorithm

Algoritma Searching (comparison ) ) Algoritma Word Boundary d n

) 2 t 2 o

d tO(n ) O(m*n ) c n

o 250 1200000 c ilise

m 1000000 200 ilise m si ( 800000 (

150 ku si 600000 Big O(n) se Big O(n) dengan m=20 eku

100 ek 400000

Pengujian u Pengujian kt eks 200000 50 a W ktu 0

a Input (n) Input (n) 6122040116234 W 0 461014 Big O(n) dengan 720 2880 8000 32000 269120 1E+06 Big O(n) 16 36 100 196 m=20 Pengujian 47 78 156 203 Pengujian 1109 2829 10000 21797 126000 193875

Picture 18-3. Big O(n) Testing of Searching Picture 18-6. Big O(n) Testing of Word Algorithm (comparison) Boundary Algorithm

Algoritma Searching

) 19. Implementation of Application d t 2 n O(n ) o

c 2500

ilise 2000 m The result of GUI designing (

si 1500

ku implementation can be seen on picture 1000 Big O(n) dengan m=10 kse

e 500 Pengujian 19-1, 19-2, and 19-3. ktu a 0 Input (n) W 481014 Big O(n) dengan 160 640 1000 1960 m=10 Pengujian 516 734 1000 1015

Picture 18-4. Big O(n) Testing of Searching Algorithm

18 The Balinese Unicode Text Processing

Picture 19-2. User Interface of B-Unicode

Picture 19-1. User Interface of B-Linguist

The Balinese Unicode Text Processing 19

Picture 19-2. User Interface of B-Unicode (continued)

Picture 19-3. User Interface of Desktop 20 The Balinese Unicode Text Processing

Picture 19-3. User Interface of Desktop (continued)

20. Conclusion and Suggestion

Based on the result of conducted analysis and research, there are several conclusions taken into account, i.e.: 1. The system has successfully performed Balinese Unicode text processing covering transliteration, searching, sorting, and word boundary functions. Moreover, a general and multiplatform library has been successfully built. This library runs well on both computer and handheld device as long as JVM (java virtual machine) is installed on them. The Balinese Unicode Text Processing 21 2. The algorithm complexity of Beside of the conclusions, there are transliteration, searching, sorting, and several suggestions that can be considered, word boundary are O(n*m), O(n2), i.e.: O(n2), O(m*n2) with n = length(string 1. An expert of Balinese script is input) and m = the number of required in order to help develop the comparison at lookup table in average, software. Specifically: respectively. a. Grouping letters 3. It is proved that Unicode has been b. Recognizing Balinese character supported by many operating systems properties. and browsers. The trend of global c. Analyzing letter combinations. software technology nowadays involves d. Evaluating the correctness of the the usage of Unicode standard and the execution flow of the software. availability of supporting devices. 2. Communication with the user can be 4. Building software capable of processing conducted with an interview or a words containing Balinese characters questionnaire to know specifically requires a good cooperation between the following things: two fields of study, i.e. computer a. Layout keyboard compatibility science and Balinese culture and for Balinese letter. language science. Without mastering b. The means of communication those fields, the software would not with software, including the handle all possible cases. language used when interacting. 5. When analyzing software requirements, 3. SIM card used for updating communication with the user must dictionary should be the prepaid frequently be done so that the software one, so that controlling bill is not meets the demand of the user. required.

21. Bibliography

[CON05] Constable, Peter(2005). Microsoft. Comments on Balinese Proposal, L2/05-008. [EVE05] Everson, Michael dan I Made Suatjana(2005). Proposal for Encoding Balinese Script in the UCS. [GIL03] Gilliam, Richard(2003). Unicode Demystified: A Practical Programmer’s Guide to the Encoding Standard. Addison-Wesley Professional. [MAR91] Martha, IBM Jaya(1991). Perangkat Lunak Teks Editor Berhuruf Bali. [MED96] Medra, I Nengah, et al (1996). Pembinaan Bahasa, Aksara, dan Sastra Bali: Pedoman Penulisan Papan Nama dengan Aksara Bali. Dinas Kebudayaan Prop. Dati I Bali. [NIK05] Nikanaya, I Nyoman(2005). Surat Rekomendasi Nomor 042/84/DISBUD tentang Temu Wicara Aksara Bali (http://anubis.dkuug.dk/JTC1/SC2/WG2/ Bali_Government_n2916.pdf). [SUT03] Sutjaja, I Gusti Made(2003). Kamus Sinonim Bahasa Bali. [UNI03] The Unicode Consortium(2003). The Unicode Standard, Version 4.0. Addison- 22 The Balinese Unicode Text Processing Wesley Professional. [YBG05] Galang, Yayasan Bali(2005). Komputerisasi Aksara Bali. Maret 2005 (http://www.babadbali.com/aksarabali). [WIK05] Wikipedia(2005). The Free Encyclopedia. Desember 2005 (http://www.en.wikipedia.org).