A Revised Unicode Based Sorting Algorithm for Bengali Texts

A Revised Unicode Based Sorting Algorithm for Bengali Texts

International Journal of Computer Applications (0975 – 8887) Volume 147 – No.14, August 2016 A Revised Unicode based Sorting Algorithm for Bengali Texts Md. Mahfuzur Rahaman Dept. of Computer Science and Engineering Shahjalal University of Science and Technology Sylhet – 3114, Bangladesh ABSTRACT Bengali texts with Unicode representation according to This paper describes a sorting algorithm for Bengali texts Bangla Academy [4] standard. As Bangla Academy is which is one of the most vital tasks for Bengali Natural Bangladesh’s national language authority [5] and this is the Language Processing. As Unicode is much more preferable national academy for promoting Bengali language in than ASCII encoding, we need to use this representation for Bangladesh, we need to follow Bangla Academy to set Bengali Language. But due to some distinct properties of standard for Bengali Linguistic works. Bengali Language, they cannot be sorted directly using the order in Unicode character scheme. A few works have been 2. BENGALI LANGUAGE Bengali language is written using the Bengali alphabet which done on this topics – some of them are for ASCII encoding th whether some are for Unicode. But still they have some is the 6 most widely used writing system in the world. The drawbacks and still there is no standard to sort Bengali texts. script shared by Assamese with minor variants and is the basis In this paper, we have discussed about the previous for the other writing systems like Meithei and Bishnupriva approaches and proposing a revised and easier procedure to Manipuri [6]. The script has also been used to write Sanskrit sort Unicode Bengali texts. We used a mapping to simplify in the region of Bengal. the sorting process. The efficiency depends on the efficiency 2.1 Base Letters of the sorting algorithm. This method is able to sort any There are 11 vowels and 39 consonants in the written form of Unicode Bengali texts. It will also work for Unicode text of Bengali alphabets. When we use these alphabets in full form, any language if we just change the mapping part. So the we call them base letters. process is both keyboard and language independent. Independent Vowels (স্বরব쇍ণ) General Terms Theoretical Informatics অ আ ই ঈ উ ঊ ঋ এ ঐ ঑ ঒ Keywords Consonants (বযঞ্জনব쇍ণ) Bengali Word Sorting, Bengali Text Sorting, Unicode Bengali Text Sorting, Bengali Linguistic Sort, Bengali Dictionary ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড ঢ ণ দ ধ ন Sort, Bangla Academy Dictionary Based Sort. ঩ প ফ ব ভ ম য র ঱ ল ঳ ঴ ড় ঢ় ৞ ৎ ং ং ং 1. INTRODUCTION Bengali or Bangla is an Indo-Aryan language spoken 2.2 Modifiers predominantly in Bangladesh and in the Indian state of West There are two types of modifiers in Bengali alphabets – vowel Bengal and Tripura [1]. With about 250 million native and modifiers and consonant modifiers. about 300 million total speakers worldwide, it is the second Dependent Vowels or Vowel Modifiers (-কার) most spoken language in the Indian subcontinent, seventh most spoken language in the world by total number of native 10 of the 11 vowels are used as modifiers to consonants. They speakers and the tenth most spoken language by total number are called vowel modifiers and are generally known as -ওায. of speakers [1][2]. This language is derived from Sanskrit and They can never be used independently. Following is the list of hence appears to be similar to Hindi [3]. It is written left-to- vowel modifiers with examples: right, top-to-bottom of page. Vocabulary of Bengali language is similar to Sanskrit and there are to some extent similarities Table 1. List of Vowel Modifiers with Examples with Latin. As it is one of the most spoken languages and it Vowel Vowel Modifier Example has some complexities in its structure, it becomes a fundamental necessity to have some standardization such as আ ংা ওা Bengali keyboard layout, Bengali character recognition, voice synthesis like speech to text or text to speech etc. Bengali text ই িং িও sorting is the first issue that need to be standardized first. There are some papers on this topic but still none of them ঈ ংী ওী could set standard for Bengali text sorting. In this writing, we উ ংু 嗁 have shown some analysis, drawbacks and limitations on the previous works. We also proposed a revised procedure that ঊ ংূ 嗂 can be used as a standard procedure to sort Bengali texts. This procedure is easy to comprehend and implementation is so ঋ ংৃ ওৃ much easier in any programming language. It sorts the 35 International Journal of Computer Applications (0975 – 8887) Volume 147 – No.14, August 2016 Vowel Vowel Modifier Example But ং , ং , ং are used like a modifier and they cannot be used without any other alphabet. Though many compound এ েং েও characters are made up with consonant modifiers, they can ঐ ৈং ৈও also be written with conjunct character (ং ) between two consonants. To simplify these kind of complexities, Bangla ঑ েংা েওা Academy uses the following order for Bengali words in Dictionary: ঒ েং েও অ আ ই ঈ উ ঊ ঋ এ ঐ ঑ ঒ ং ং ং Consonant Modifiers (- ) ফলা ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ড় ঠ ঢ় ড ঢ ৎ ণ দ ধ ন ঩ প ফ ব ভ ম Like the vowel modifiers, some consonants have short forms ৞ য র ঱ ল ঳ ঴ when they are used with another consonant. They are called We followed this alphabetic order to sort Bengali texts in our consonant modifiers and are generally known as -পরা. Some approach. of them are listed below with examples: Table 2. List of Consonant Modifiers with Examples 3. DIFFICULTIES TO SORT BENGALI TEXTS Consonant Consonant Example The problems associated with sorting of Bengali texts are as Modifier follows: ন ন-পরা মত্ন Bengali words should be sorted according to Bangla ভ ভ-পরা আত্মা Academy [4] standard. But Unicode representation of Bengali alphabets are not in Bangla Academy Dictionary ম ম-পরা চনয order. So, mapping is required to sort texts. - য য পরা প্রিঢ Compound characters with consonant modifier or র র-পরা �ক্ল conjunct character make Bengali sorting more complex. ফ ফ-পরা জ্বয Vowel modifiers can precede or follow the base letters in Bengali text, but the modifier should be considered after the base letter in computation for proper sorting. 2.3 Compound Characters Unicode characters , , , can be written in two ways When two or more consonant characters used together, then য ৞ ড় ঢ় they are called compound characters. There are about 285 – as a single character or as a compound character with ং঵ compound characters in Bengali [7]. Some examples of character. compound characters are listed below: Two vowel modifier েংা and েং can be written as a Table 3. Some Compound Characters with usage single Unicode character or as preceding and following No. of two modifiers. Compound Decompressed Word Alphabets Character Form Ambiguity between and ‌ adds a bit more Used ময যং ম complexity in sorting Bengali texts. In both case, we get চ + ং + চ + ং + য + ং + ম but they are not same ( যং‌ ম = য + ZWNJ + ং উজ্জ্বর জ্জ্ব 3 ফ + ম). ঘ + ং + ঙ + ং + উচ্ছ্বা঳ চ্ছ্ব 3 4. PREVIOUS WORKS ফ Md. Ruhul Amin et al. [8] proposed an efficient Unicode দ + ং + ফ based sorting algorithm for Bengali words. They have used দ্ব 2 null modifier which not mandatory. This approach cannot sort দ্বন্দ্ব ন + ং + দ + ং + texts in the following situation: ন্দ্ব 3 ফ Table 4. Situation cannot be solved by [8] + + 2 ফিৃ ি ি ল ং ঝ Representation Decompressed + + 2 Word with mapped ভুিি ি ও ং ঢ Form value 2.4 Alphabetical order of Bangla Academy ফ঳িঢ ফ ৹ ঳ ৹ ঢ িং 520161014503 Generally, we use the following alphabetical order everywhere: ফিঢ‌ ফ ৹ ঳ ং ঢ িং 520161124503 অ আ ই ঈ উ ঊ ঋ এ ঐ ঑ ঒ ফিি ফ ৹ ঳ ং ঢ িং 520161124503 ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড ঢ ণ দ ধ ন ঩ প ফ ব ভ ম য র ঱ ল ঳ ঴ ড় ঢ় ৞ ৎ ং ং ং 36 International Journal of Computer Applications (0975 – 8887) Volume 147 – No.14, August 2016 We actually get ফিি = ফ + ৹ + ঳ + ং + ZWNJ + ঢ + িং We assume that, য, ৞, ড়, ঢ় are made up with a where ZWNJ is not mentioned in their process. So their single character, not a conjunct with ং঵ character. েংা algorithm will treat both ফিি and ফিি as same word. and েং are also assumed as single modifier. Aamira Shabnam et al. [9] have described an easily comprehendible Unicode based sorting algorithm for Bangla 5.2 Mapping Our proposed mapping scheme is listed below. We are words. They didn’t use any null modifier and used single digit proposing at least two digits for each letter or modifier. mapping. Table 6. Mapping for our proposed method Table 5. Situation not handled by [9] Unicode Value Character Mapped Value Representation Decompressed Word with mapped 200C ZWNJ 00 Form value 200D ZWJ 01 ওরভ ও + র + ভ 255652 0985 অ 02 ওরাভ ও + র + ংা + ভ 2556052 0986 আ 03 0987 ই 04 If the mapped string is sorted in lexicographical order, we will ঈ 05 get ওরাভ before ওরভ which is not correct. 0988 06 Aamira Shabnam et al. [10] have also described a faster 0989 উ approach to sort Unicode represented Bengali words. This 098A ঊ 07 paper also has the drawbacks of the previous one. In addition to this, the order mentioned in the discussion is different from 098B ঋ 08 Bangla Academy standard. They used just the regular sequence of Bengali alphabets. 098F এ 09 Partha Sarathi Kar et al. [11] proposed an improved Unicode 0990 ঐ 10 based sorting algorithm for Bengali words.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    6 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us