Overview of Unicode and Indian Scripts

CHAPTER: 2 Overview of Unicode and Indian Scripts Introduction History and Development of Human Languages History and Development of Scripts Character Representation in Computers Brief History of Character Representation ASCII (American Standard for Information Interchange) Unicode Principles of Unicode Standard Observations of Mapping Table Transliteration Algorithm Conclusion References Chapter 2: Overview of Unicode and Indian Scripts 39 2.1 INTRODUCTION Internet is being populated with resources in many of the world languages. Technology has enabled even smallest community group to work in their language and literature. Internet can deliver information about any domain which library professionals can collect, organize and provide to users. Librarian should look forward to make information easily accessible to user community. Giving organized information services over network and application of IT has brought about the cojicept of Digital library. With the availability of information in different languages and scripts. Digital library demands multilingual access to information. The representation of information in more than 7150 languages (1) in their respective scripts is a real challenge for IT industry as well as for library community. The system should have intelligence to understand different scripts and accordingly should be able to display it. One of the major problems in this is diversity of human language and scripts. To some extent transliteration comes to rescue, atleast to make user understand 'what the document is about'? Particularly in India, many people know a minimum of two/three languages and this is common especially in South India where people know two or three South Indian languages but not the script. In such situation the transliteration could definitely cut across the script barrier to some extent. Also one can read fields like Author name, publisher and place which are pronounced in the same way in any language. Libraries can hold the documents in many languages and scripts. Libraries should ensure access to these documents to user community irrespective of the language it is published. Translation service is one approach to solve the problem. But before that, the user should be able to know that there is a document of his/her interest so that he/she can ask for translation. To solve this problem library should also ensure generating catalogue entry for the document in all the languages which is a difficult task. The only ray of hope Chapter 2: Overview of Unicode and Indian Scripts 40 is transliteration algorithms and codification of data for controlled translation of catalogue entries to the extent possible. Transliteration and data codification of the catalogue may give an idea to the user about the content of the document. Mostly, character representation is done with ASCII (American Standard Code for Information Interchange) in computers. ASCII is a 7 bit character encoding system. Due to the development of Internet, multilingual communication has grow^n to a great extent and this situation demonstrates that ASCII does not hold good in handling multilingual data. When these problems were realized, a consortium of Xerox, IBM and others was formed to develop a character representation code that could handle all the scripts of the world. The outcome of these efforts was the development of the Unicode standard, which promises to handle all the scripts of the world (2). There are two parts to multilingual communication. First is the spoken part which we understand as language. In general, language is understood to be gesticulation or phonetic representation of one's feeling. The second part is recording of such representation for future use or communication which is done by recording the permanent or temporary marks on a surface which is known as Script. Thus the scripts are signs/symbols which carry some meaning. It is evident that language was bom first and then scripts. One of the objectives of the present work is to transliterate the data across Indian languages. For practical purposes and to demonstrate Hindi, Bengali, Telugu and Kannada are taken in consideration. The present work does not attempt to transliterate data from English to Indian language scripts or vis versa. Chapter 2: Overview of Unicode and Indian Scripts 41 2.2 HISTORY AND DEVELOPMENT OF HUMAN LANGUAGES The initiation of language has taken place with the genesis of life. The very way of expressing joy and sorrow itself is a form of language that nature has given to humans. Animals use different form of language like body language, contact language or gesture language to express their feelings. (3) According to Webster's Dictionary, a language is (4) "a systematic means of communicating ideas or feelings by the use of conventionalized signs, sounds, gestures, or marks having understood meanings". That means a language is not necessarily a vocal phenomena, it also includes gesture or visual signs and so on. A very good definition for language is given on Indiana University website. It provides meaning of language in different senses (5), >For a person, language exists in his brain. >For a pair or a number of persons, language can only be described in terms of the interaction between a speaker (writer, signer) and a hearer (reader, sign observer). >For a community, language is a system that is shared by the members of a community, evolving over the time as the community evolves There are a number of research projects to find out the cultural evolution and divergence of human through tracing the linguistic development. It is an established fact that the origin of man kind was in Africa, which also leads to the fact that the first ever language whatever it would have been would have taken birth in Afiica. The question comes 'Why there are so many languages in the world?' Chapter 2: Overview of Unicode and Indian Scripts 42 The mythological story of 'Tower of Babel' tells how God felt envious about Man's collaborative endeavor and 'blessed' him with different languages so that they couldn't communicate among themselves and could not further carry out collaborative jobs which would make them equal to God one day (6). Though the story does not hold true but states that it was the geographical separation which brought out lingual divergence. This separation happened because of the Man's migration in search of food and shelter. Besides understanding the geographical diversity as the prime cause of language divergence, there are small variations found in the language usage among the people of same community. Every one of us controls a definite sphere of words which keeps increasing throughout the life but in actual usage we don't use all the word of our lingual sphere which gives rise to a style often found in the renowned authors. Not only that often we are identified because of our style of writing, a person often commits same kind of mistake while talking. It could be because of his ignorance or incapability of pronunciation. In India there is a caste called as AHIR (used for North Indian Yadavas), but the actual pronunciation for it is ABHIR, means a fearless man. It is evident that ABHIR became AHIR in due course of time because of erroneous pronunciation and gradual adoption by the community. Such influence of one person on the others within a region brings out a regional dialect. Similarly, in due course of time the words are softened, new idioms and pronunciations become operational and thus language of the community goes through a kind of internal divergence. Besides the impact of other communities also play a part in lingual divergence which is known as External divergence. External divergence basically takes place when two or more communities come in contact. This contact could happen because of cross communal trade, marriages, and so on. A typical example could be sought in Dravidian language family. This family has borrowed a lot of words from Sanskrit (also known as Samskrit) which belongs to Indo-Aryan Chapter 2: Overview of Unicode and Indian Scripts 43 language family. The generation of different forms of Prakrits like Hindi, Bengali, Oriya, and so on, could have taken place because of internal divergence but the effect of external divergence can't be ruled out. According to one study, a total of 7150 spoken languages exist out of which 110 from broad categories (1). The following table gives an idea of 10 most spoken languages in the world. Rank, Countries Population Language (in millions) 1. Chinese, Brunei, Cambodia, China, Indonesia, Malaysia, 885 Mandarin Mongolia, Philippines, Singapore, S. Africa, Taiwan, Thailand 2. Spanish Algeria, Andorra, Argentina, Belize, Benin, Bolivia, 332 Cambodia, Chad, Chile, Colombia, Costa Rica, Cuba, Dominican Rep., Ecuador, El Salvador, Eq. Guinea, Guatemala, Honduras, Ivory Coast, Laos, Madagascar, Mali, Mexico, Morocco, Nicaragua, Niger, Panama, Paraguay, Peru, Spain, Togo, Tunisia, Uruguay, U.S., Venezuela, Vietnam 3. English Australia, Botswana, Brunei, Cameroon, Canada, 322 Eritrea, Ethiopia, Fiji, The Gambia, Guyana, India, Ireland, Israel, Lesotho, Liberia, Malaysia, Micronesia, Namibia, Nauru, New Zealand, Palau, Papua New Guinea, Samoa, Seychelles, Sierra Leone, Singapore, Solomon Islands, Somalia, S. Africa, Suriname, Swaziland, Tonga, U.K., U.S., Vanuatu, Zimbabwe, many Caribbean states 4. Arabic Egypt, Sudan, ALgeria, Morocco, Tunisia, Lybia, Saudi 235 Arabia, Syria, Jordan, Yemen, UAE, Oman, Iraq, Lebanon 5. Bengali Bangladesh, India, Singapore 189 6. Hindi India,

Overview of Unicode and Indian Scripts

UNICODE for Kannada General Information & Description

New York Statewide Data Warehouse Guidelines for Extracts for Use In

The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles

Unicode and Code Page Support

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

The Braille Font

A New Research Resource for Optical Recognition of Embossed and Hand-Punched Hindi Devanagari Braille Characters: Bharati Braille Bank

Proposal to Encode 0D5F MALAYALAM LETTER ARCHAIC II

Proposal for a Kannada Script Root Zone Label Generation Ruleset (LGR)

The Fontspec Package Font Selection for XƎLATEX and Lualatex

Haptiread: Reading Braille As Mid-Air Haptic Information

Proposal to Encode Tamil Fractions and Symbols §1. Thanks §2