CHAPTER: 2

Overview of and Indian Scripts

Introduction History and Development of Human Languages History and Development of Scripts Representation in Computers Brief History of Character Representation ASCII (American Standard for Information Interchange) Unicode Principles of Unicode Standard Observations of Mapping Table Transliteration Algorithm Conclusion References Chapter 2: Overview of Unicode and Indian Scripts 39

2.1 INTRODUCTION

Internet is being populated with resources in many of the world languages. Technology has enabled even smallest community group to work in their language and literature. Internet can deliver information about any domain which library professionals can collect, organize and provide to users. Librarian should look forward to make information easily accessible to user community. Giving organized information services over network and application of IT has brought about the cojicept of Digital library. With the availability of information in different languages and scripts. Digital library demands multilingual access to information. The representation of information in more than 7150 languages (1) in their respective scripts is a real challenge for IT industry as well as for library community. The system should have intelligence to understand different scripts and accordingly should be able to display it. One of the major problems in this is diversity of human language and scripts. To some extent transliteration comes to rescue, atleast to make user understand 'what the document is about'? Particularly in India, many people know a minimum of two/three languages and this is common especially in South India where people know two or three South Indian languages but not the . In such situation the transliteration could definitely cut across the script barrier to some extent. Also one can read fields like Author name, publisher and place which are pronounced in the same way in any language.

Libraries can hold the documents in many languages and scripts. Libraries should ensure access to these documents to user community irrespective of the language it is published. Translation service is one approach to solve the problem. But before that, the user should be able to know that there is a document of his/her interest so that he/she can ask for translation. To solve this problem library should also ensure generating catalogue entry for the document in all the languages which is a difficult task. The only ray of hope Chapter 2: Overview of Unicode and Indian Scripts 40 is transliteration algorithms and codification of data for controlled translation of catalogue entries to the extent possible. Transliteration and data codification of the catalogue may give an idea to the user about the content of the document.

Mostly, character representation is done with ASCII (American Standard Code for Information Interchange) in computers. ASCII is a 7 bit system. Due to the development of Internet, multilingual communication has grow^ to a great extent and this situation demonstrates that ASCII does not hold good in handling multilingual data. When these problems were realized, a consortium of Xerox, IBM and others was formed to develop a character representation code that could handle all the scripts of the world. The outcome of these efforts was the development of the Unicode standard, which promises to handle all the scripts of the world (2).

There are two parts to multilingual communication. First is the spoken part which we understand as language. In general, language is understood to be gesticulation or phonetic representation of one' feeling. The second part is recording of such representation for future use or communication which is done by recording the permanent or temporary marks on a surface which is known as Script. Thus the scripts are signs/symbols which carry some meaning. It is evident that language was bom first and then scripts.

One of the objectives of the present work is to transliterate the data across Indian languages. For practical purposes and to demonstrate Hindi, , and are taken in consideration. The present work does not attempt to transliterate data from English to Indian language scripts or vis versa. Chapter 2: Overview of Unicode and Indian Scripts 41

2.2 HISTORY AND DEVELOPMENT OF HUMAN LANGUAGES

The initiation of language has taken place with the genesis of life. The very way of expressing joy and sorrow itself is a form of language that nature has given to humans. Animals use different form of language like body language, contact language or gesture language to express their feelings. (3)

According to Webster's Dictionary, a language is (4) "a systematic means of communicating ideas or feelings by the use of conventionalized signs, sounds, gestures, or marks having understood meanings".

That means a language is not necessarily a vocal phenomena, it also includes gesture or visual signs and so on.

A very good definition for language is given on Indiana University website. It provides meaning of language in different senses (5), >For a person, language exists in his brain. >For a pair or a number of persons, language can only be described in terms of the interaction between a speaker (writer, signer) and a hearer (reader, sign observer). >For a community, language is a system that is shared by the members of a community, evolving over the time as the community evolves

There are a number of research projects to find out the cultural evolution and divergence of human through tracing the linguistic development. It is an established fact that the origin of man kind was in Africa, which also leads to the fact that the first ever language whatever it would have been would have taken birth in Afiica. The question comes 'Why there are so many languages in the world?' Chapter 2: Overview of Unicode and Indian Scripts 42

The mythological story of 'Tower of Babel' tells how God felt envious about Man's collaborative endeavor and 'blessed' him with different languages so that they couldn' communicate among themselves and could not further carry out collaborative jobs which would make them equal to God one day (6). Though the story does not hold true but states that it was the geographical separation which brought out lingual divergence. This separation happened because of the Man's migration in search of food and shelter.

Besides understanding the geographical diversity as the prime cause of language divergence, there are small variations found in the language usage among the people of same community. Every one of us controls a definite sphere of words which keeps increasing throughout the life but in actual usage we don't use all the word of our lingual sphere which gives rise to a style often found in the renowned authors. Not only that often we are identified because of our style of writing, a person often commits same kind of mistake while talking. It could be because of his ignorance or incapability of pronunciation. In India there is a caste called as AHIR (used for North Indian Yadavas), but the actual pronunciation for it is ABHIR, means a fearless man. It is evident that ABHIR became AHIR in due course of time because of erroneous pronunciation and gradual adoption by the community. Such influence of one person on the others within a region brings out a regional dialect. Similarly, in due course of time the words are softened, new idioms and pronunciations become operational and thus language of the community goes through a kind of internal divergence. Besides the impact of other communities also play a part in lingual divergence which is known as External divergence. External divergence basically takes place when two or more communities come in contact. This contact could happen because of cross communal trade, marriages, and so on. A typical example could be sought in Dravidian language family. This family has borrowed a lot of words from (also known as Samskrit) which belongs to Indo-Aryan Chapter 2: Overview of Unicode and Indian Scripts 43

language family. The generation of different forms of Prakrits like Hindi, Bengali, , and so on, could have taken place because of internal divergence but the effect of external divergence can't be ruled out.

According to one study, a total of 7150 spoken languages exist out of which 110 from broad categories (1). The following table gives an idea of 10 most spoken languages in the world.

Rank, Countries Population Language (in millions) 1. Chinese, Brunei, Cambodia, China, Indonesia, Malaysia, 885 Mandarin Mongolia, Philippines, Singapore, S. Africa, Taiwan, Thailand 2. Spanish Algeria, Andorra, Argentina, Belize, Benin, Bolivia, 332 Cambodia, Chad, Chile, Colombia, Costa Rica, Cuba, Dominican Rep., Ecuador, El Salvador, Eq. Guinea, Guatemala, Honduras, Ivory Coast, Laos, Madagascar, Mali, Mexico, Morocco, Nicaragua, Niger, Panama, Paraguay, Peru, Spain, Togo, Tunisia, Uruguay, .S., Venezuela, Vietnam 3. English Australia, Botswana, Brunei, Cameroon, Canada, 322 Eritrea, Ethiopia, Fiji, The Gambia, Guyana, India, Ireland, Israel, Lesotho, Liberia, Malaysia, Micronesia, Namibia, Nauru, New Zealand, Palau, Papua New Guinea, Samoa, Seychelles, Sierra Leone, Singapore, Solomon Islands, Somalia, S. Africa, Suriname, Swaziland, Tonga, U.., U.S., Vanuatu, Zimbabwe, many Caribbean states 4. Arabic Egypt, Sudan, ALgeria, Morocco, Tunisia, Lybia, Saudi 235 Arabia, Syria, Jordan, Yemen, UAE, Oman, Iraq, Lebanon 5. Bengali Bangladesh, India, Singapore 189 6. Hindi India, Nepal, Singapore, S. Africa, Uganda 182 7. Portuguese Angola, Brazil, Cape Verde, France, Guinea-Bissau, 170 Mozambique, Portugal, S2o Tome and Principe 8. Russian China, Israel, Mongolia, Russia, U.S. 170 9. Japanese Japan, Singapore, Taiwan 125 10. German, Austria, Belgium, Bolivia, Czech Rep., Denmark, 98 Standard Germany, Hungary, Italy, Kazakhstan, Liechtenstein, Luxembourg, Paraguay, Poland, Romania, Slovakia, Switzerland

Table 2.1: Ten Most Spoken Languages of the World (7) Chapter 2: Overview of Unicode and Indian Scripts 44

The major language families that exist today, their major languages, number of speakers and number of languages are listed below,

Major Language Major Languages Speakers Languages Family (Numbers) (Numbers) Afro-Asiatic Arabic and Hebrew 250 Million 372 Altaic Turkish, Uzbek, Mongolian, 250 Million 65 Korean and Japanese Austro-Asiatic Vietnamese and Khmer 60 Million 168 Caucasian Georgian and Chechen 5 Million 34 Dravidian , Telugu and Kannada 150 Million 75 Indo-European English, Spanish, Portuguese, 3 Billion 443 French, Italian, Russian, Greek, Hindi, Bengali, Latin, Sanskrit, and Persian Maiayo-Polynesian Malay, Indonesian, Maori and 250 Million 1262 Hawaian Niger-Congo Swahili, Shona, Xhosa and Neariy 200 Million 1489 Zulu Sino-Tibetan Mandarin More than 1 Billion 365 Uralic Hungarian, Finnish) and 20 Million 38 Siberia (Mordvin)

Table 2.2: Major Language Families and their Properties [This data is collected consulting different websites from200 1 - 2004 (8) (9) (10)]

The Indo-European family covers through America, Norway, Britain, Germany, Yugoslavia, Iran, Afghanistan and India. There are seven sub­ families under this family. Under this family there is another sub-family called Indo-lranian family. This family consists of 296 languages. It is further sub-divided into Indo-Aryan and Iranian. Indo-Aryan chiefly consists of Sanskrit, Hindi, Urdu, Bengali, Oriya, Bihari, Gujrati, Panjabi, Kashmiri, Torwali and so on. These are basically the North-Indian languages of India. Chapter 2: Overview of Unicode and Indian Scripts 45

Sub-Families Number of Languages Albanian 4 Armenian 2 Baltic 3 Celtic 7 Germanic 58 Greek 7 Indo-Iranian 296 Italic 48 Slavic 18

Table 2.3: Sub-Families of Indo-European

A very good account for historical development of Indo-European family is given by Dr. Tariq Rahman (11). According to him, there used to be a Proto-Aryan language which developed in due course as Indo-Aryan languages. Gopal Haldar calls Proto-Aryan language, Old Indo-Aryan language (12). There was no single language but there were efforts towards the polishing of language, which was done by Panini, who is worlds most celebrated grammarian, sometime around 400 .. This polished language was called Samskrit, which means polished work or 'refined'. The unpolished one was called as Prakrit, means old work or 'unrefined'. It is well understood that Indo-Aryan was brought to India by Aryans. It was in due course of time, they spread all over India and many dialects of Prakrit came in existence. Initially it was regional variants like Udicya or Northern Madhyadesa or Midland and Pracya or Eastern dialects. But slowly within the region due to internal and external divergence new languages came up. It is because this reason there is similarity found among the North Indian languages (13).

But there is a marked difference between North Indian languages and South Indian Languages. This is because they both originated from two different families. The North Indian languages originated from Indo-European Family the South Indian languages belong to Dravidian family. This was the Chapter 2: Overview of Unicode and Indian Scripts 46 language of native Indians, believed by many historians (11). The Dravidian culture spread over Indus valley civilization sometime around 2300-2000 B.C. Due to invasion of Aryan's or some other reasons (could be plague or famine) it is said that this civilization ended and probably the inhabitants migrated towards south Indian peninsula. A small pocket is still found in Afghanistan which speaks Brahui which is a Dravidian language, it is said that these people are actual inhabitants of Baluchistan. Besides Gond is also an off shoot of Dravidian language. The people of this community are scattered mostly Central India, Andhra Pradesh and Tamilnadu. Besides there are few tribes in Bihar and Orissa that speak Dravidian and they moved till Assam. The major languages in Dravidian family are Tamil, Telugu, Kannada and Malyalam. Out of which Tamil is the oldest and is the latest .. sometime in 14 century C.E (14).

This picture put the image of India on the world map as a country of diversity in Unity. But surprisingly, the various scripts used through out India are derived from one parent i.e. Brahmi, which makes Indian languages more phonetic i.e. whatever the language, the are almost same.

2.3 HISTORY AND DEVELOPMENT OF SCRIPTS

Writing is perhaps one of the most significant cognitive developments of mankind. It provided for a visible form to a spoken language and helped the development of cognitive processes and knowledge base. The initiation of writing would have started when man would have felt to commit memories for future aid. The first form of writing came as rock drawings. It would have been very handy to remember the face of animal which he could kill for food or may be fioiits. Slowly drawing was used to remember the face of a person, place or thing. Chapter 2: Overview of Unicode and Indian Scripts Al

Blackwell Encyclopedia of Writing Systems defines a as "a set of visible or tactile signs used to represent units of language in a systematic way". (15)

A writing system is a means of expressing thought through written symbols. Some writing systems do this with symbols that represent some aspect of spoken language such as sounds, syllables, or words. If a writing system uses symbols for sounds, or letters, it is called ALPHABETIC.

Different civilizations have different Gods who were believed to have devised writing. In Hindu mythology writing is attributed to Saraswati, the goddess of Knowledge and learning. In old Egyptian civilization Toth was the god who devised the scripts and collected the history. (16) In Mayan civilization Rabbit is associated with writing. On funerary vase a rabbit is found with a pen in it's hand recording the human sacrifice happening, (17)

Historically writing has been developed in different places and different time fi-ames. There are following types of writing: (18) > Proto-writing > Logographic > Syllabic > Alphabetic

2.3.1 Proto-writing

These are the earliest form of writing. These occurred in 7000-4000 B.C. This is the most rudimentary form of writing where signs do not represent the full language but are used mostly as mnemonic devices. Chapter 2: Overview of Unicode and Indian Scripts 48

2.3.2 Logographic

This kind of script is made up of signs where each sign carries some meaning i.e. a morpheme. A morpheme is the smallest unit that carries some meaning. Chinese is the typical example of this category.

2.3.3 Syllabic

These consist of consonants and vowels. Each consonant inherits a vowel which could be changed with the use of other vowel. Indie scripts fall in this category.

2.3,4 Alphabetic

There are two categories to it: : Only consonants or consonants plus vowels are represented. The vowel can be further muted by adding some other vowel. Basically, this script is written from right to left.

Alphabet: Phonemic alphabets represent consonants and vowels. The typical example of it is Roman script.

Family Location Time period West Asia [-3300 BC to 500 CE Egyptian Egypt, North Africa 3lOOBCto400CE Proto-Sinaitic, and descendants West Asia 1900BCtoll00BC Eteo-Cretan or "Aegean" Crete, Greece and Cyprus 2000 BC - 300 BC Sinitic China, Japan, Korean 1500 BC-till date Peninsula, East Asia Mesoamerican Mesoamerica 500 BC to 1697 Brahmi and related scripts South Asia 5th century BC - till date

Table 2.4: Script Families Chapter 2: Overview of Unicode and Indian Scripts 49

If there existed a writing system in Indus valley civilization that would been very primitive system to represent the language. But there is wide gap of 1600 years where there is no recorded instance of any kind of script in any form (19). Even Megasthenease in Chandragupta period (3"^*^ century BC) denies about any kind of writing system in India. Of course and Brahmi are said to be paternal language of Indian scripts but it is widely believed that these are developed in 3*^ or 4' century BC. According to Harry Falk's (as quoted by Richard Salomon) Kharosthi is derived from Aramaic Script, which was used by Acheamdian kings. Kharosthi was used in north western part of India i.e. Afghanistan of present. Even Ashoka in Northwestern India has Aramaic inscription which shows the dominance of Aramaic script. But later on, Kharosthi was bom out of Aramaic. Interestingly, Brahmi was devised for writing inscriptions (20). It was a voluntary invention that is why Brahmi appears to be more scientific. Brahmi has dipthongs while Kharoshti does not have dipthongs. Like wise there are many features Brahmi possesses which Kharosthi lacks. (21) (22)

It is out of Brahmi, different Indian scripts have come out. That is why the Indian languages are phonetic. (23)

The phonetic nature of the Indian languages could be utilized to standardize character representation in computers.

2.4 CHARACTER REPRESENTATION IN COMPUTERS

Each character is assigned in computer a single number. If the key labeled character 'A' is pressed, the computer understands is as the binary number 'OlOOOOOr (which is equivalent to decimal value 65 and prints 'A' on the screen. But the actual process is even more complex. As a computer is an Chapter 2: Overview of Unicode and Indian Scripts 50 electronic device, it represent voltage ON and OFF as '1' and '0' respectively. This particular system of notation is known as 'Binary system'. The mapping of codes to characters using such system is known as character encoding.

Character on the Binary value used Character on the Binary value used screen to process it screen to process it 1 0110001 A 1000001 2 0110010 B 1000010 3 0110011 C 1000011 4 0110100 1000100 5 0110101 E 1000101

Table 2.5: Character Representation

2.5 BRIEF HISTORY OF CHARACTER REPRESENTATION

Morse invented the code which he used to send his historic message in 1838. The code is based on, a or a dash similar to current binary system. The principle was to give the most frequently used letters the shortest possible patterns, to reduce the length of a message. For example, 'E', is represented by just a dot, understanding that it has very frequent appearance in text, similarly, 'T', by just a dash. Basically, this all started with telegraphic technology.

The next great leap in telegraph technology was a primitive printing telegraph, or "teleprinter," patented by Jean-Maurice-Emile Baudot (1845- 1903) in France in 1874. Like Morse's Baudot developed, the 5-bit Baudot code, which was also the world's first binary character code for processing textual data for his teleprinter. Initially in this system one needs to know the code to print. The automatic generation of what was developed using a Chapter 2: Overview of Unicode and Indian Scripts 51

special five-key keypad in later typewriter. Baudot's system was fairly successful.

The problem with Baud was that, it could only accommodate 32 characters i.e. ". To accommodate more than 32 this code used switches. Thus it could accommodate upper case alphabets and numerals. (24)

Herman Hollerith (1860-1929) developed the famous punch card system which was also known as Hollerith code. Meanwhile, Charles Babbage tried to store data on paper strips and failed. To solve to the problem Hollerith devised puched card technology. To punch the card he devised a machine called as "tabulating machine". (25)

A punch card used to have, 12 rows and 80 columns. That means any character is registered across 12 rows of a punched card, which makes it a 12-bit character code.

2.6 ASCII (AMERICAN STANDARD CODE FOR INFORMATION INTERCHANGE)

There was no character code which could handle all the English characters. American National Standards Institute (ANSI) came out with the idea of a 7- bit code. In 1963, ANSI announced the American Standard Code for Information Interchange (ASCII). In the first of ASCII version lower case and escape character were ignored. Since 1950s and 60s major computer manufacturers from US adopted this code as standard and so ASCII became the de facto standard. It was only IBM which kept itself out of the standard. (26) Chapter 2: Overview of Unicode and Indian Scripts 52

IBM brought its own standard of 8-bits called as EBCDIC, which stands for 'Extended Binary Coded Decimal Interchange Code'. IBM used it in mainframe computers. (27)

In 1967 another version of ASCII came. This version included Roman smalls and few more additional characters, which were not there in the older version of 1963. The additional characters are, {,}, |, ~, etc.

ASCII is not an 8-bit code rather it uses 7-bit, that means it uses only 128 character positions. ASCII 1967 is still very popular, but the demand of multilingual code gave a thrust for development of Unicode.

2.6.1 Limitation of ASCII and Rise of Unicode

Basically with ASCII one can represent maximum 256 characters at a time. That means one can represent Roman alpha-numerals, control characters and maximum one or two languages. That means ASCII fails to represent all multilingual characters, which became crucial in the web environment. There are fairly good amount of multilingual documents available on Internet. The need of multilingual representation was first felt by the software companies. To change the language one needs to change the source code of software because each language will have its own specific code. With Unicode one need not change the source, the same software can be represented in other language. (28)

The need was soon realized to have a standard which could represent the multilingual character set. It was first felt with Chinese, Japanese and Korean as they have many characters in common. So a thought of merging there codes emerged. The very idea sometime in 1987 took birth at Xerox and slowly prevailed in others mind at different companies. By end of 1988 there was an organized and programmed effort towards developing Unicode. Chapter 2: Overview of Unicode and Indian Scripts 53

A consortium emerged consisting of software companies mainly, IBM, Xerox, Microsoft, Apple, etc called as . The name Unicode was suggested understanding the universal, unique and uniform features of the code. In 1991 the first version of Unicode came into existence. It is at the same time when Unicode was united with ISO 10646 character encoding system. Though ISO 10646 is a four byte character-code where as Unicode is 2 byte character-code (29), (30) it was stated Unicode will provide the necessary characters for ISO 10646 when unified. It is not true that Unicode is sufficient to accommodate all the characters of the world, as Chinese itself claims more than 40,000 characters. That is why Unicode consortium has to come out with a idea of Surrogate characters. But whatever is the intention, Unicode is serving the purpose of character representation in computers (31). Currently, Unicode is in version 4.0.

2.7 UNICODE

The basic principle of Unicode is "Start from zero and add characters". That means this code is designed from the scratch. Since, it is in conformance with ASCII, the existing programs need not to be changed.

The primary scripts currently supported by Unicode 4.0 are: (32) Arabic Lao Armenian Latin Bengali Limbu Buhid Malayalam Canadian Syllabics Mongolian Cherokee Myanmar Cypriot Cyrillic Old Italic (Etruscan) Deseret Osmanya Oriya Ethiopic Runic Georgian Shavian Gothic Sinhala Greek Syriac Chapter 2: Overview of Unicode and Indian Scripts 54

Gujarati Tagalog Tagbanwa Han Tai Le Tamil Hanunoo Telugu Hebrew Thai Kannada Tibetan Ugaritic Khmer Yi

In addition to the above scripts, a number of other collections of symbols are also encoded by Unicode. These collections consist of the following: > Numbers > General > General > General Symbols > Mathematical Symbols > Musical Symbols (Western and Byzantine) > Technical Symbols > Dingbats > Arrows, Blocks, Box Drawing Forms, and > > Presentation Forms > Patterns > Kangxi Radicals

2.7,1 Designing Principles of Unicode

There were two serious problems when Unicode project was undertaken, (33) > Mapping of different set of fonts to same set of codes, for example, in ISCII (Indian Script Code of Information Interchange) same code Chapter 2: Overview of Unicode and Indian Scripts 55

is designated to different script fonts. Switches are used to select different language scripts. > Harmonization of multiple inconsistent character codes to avoid conflict at national and industrial level.

The following are the principles on which Unicode is built: (34) > Universal repertoire > Efficiency > Characters, not glyphs > Semantics > Plain Text > Logical order > Unification > Dynamic composition > Equivalent Sequence > Convertibility

2.7.1.1 Universal Repertoire

It acts as repertoire for all written languages of the world. Unicode is a single code which handles all the world's scripts. Not only the living, it also caters to dead scripts. Many of the dead indic scripts like Brahmi and Kharoshti are covered under Unicode.

It is basically a 16 bit character-code which promises, 65,536 characters to be represented. If considering Hangul scripts (Chinese, Japanese and Korean) it may not be sufficient but using Surrogates it could be extended upto 4 bytes that means 4,294,967,296 characters can be represented.

To preserve the ASCII code Unicode uses UTF-8 (Unicode Transformation Format-8) has been used. Here a sequence of two bytes is used to present Chapter 2: Overview of Unicode and Indian Scripts 56 the text of 16 bit code. It is a standard method of transforming the Unicode values into 8 bit code. This is required when data is transferred in the form of 8 bits, particularly handling protocols or if there is any existing data in ASCII.

According to Unicode Standard Version 3.0 (35) "A Unicode Transformation format (UTF) transforms each Unicode scalar value into a unique sequence of code values. A UTF may specify a byte order for the serialization of the code values into bytes. A UTF may specify the use of a ."

UTF-8 is a 8-bit unique sequence of Unicode scalar values. Any code which is transferred over network breaks into these 8-bit packets and transfers. Any Unicode scalar value is broken into byte sequence and the distribution of bits are as follows :

Scalar Value UTF-16 * Byte 2"" Byte ^ Byte 4"' Byte OOOOOOOOOxxxxxxx OOOOOOOOOxxxxxxx Oxxxxxxx OOOOOyyyyyxxxxxx OOOOOyyyyyxxxxxx llOyyyyy 1Oxxxxxx zzzzyyyyyyxxxxxx zzzzyyyyyyxxxxxx 111Ozzzz lOyyyyyy 1Oxxxxxx uuuuuzzzzyyyyyxxxxxx 11011 Owwwwzzzzyy llUOuuu" lOuuzzzz lOyyyyyy 1Oxxxxxx + 11011lyyyyxxxxxx a. Where uuuuu=wwww + 1 Table 2.6: UTF-8 bit Distribution (36)

A scalar value 2305 could be broken as 224, 164 and 129 according to the table. The packet sequence which will be transferred would be as <11100000 10100100 10000001>. A scalar code which is transferred in this sequence would be decoded or reverse mapped at the other end. But there is a possiblity that a given code could not be decoded at the other end such codes are called as Illegal code value sequence. Chapter 2: Overview of Unicode and Indian Scripts 57

Any UTF code value sequence which cannot be produced from any Unicode scalar value is called as an Ill-formed code value sequence.

2.7.1.2 Efficiency

Unicode identifies one character only once. Punctuation marks are given at only one place and whenever it is required they are called. For example, while writing Telugu one needs to use for ending a sentence. The full stop is borrowed from 46 character position of Unicode, where as Telugu falls in between 3072 - 3184 position. That means the repetition has been avoided in Unicode and no full stop is added for Telugu sentence end. Further efficiency has been added by providing language blocks, so that same kind of scripts could be found at only one place. For example, all Indie scripts are kept together and each one has been given a block. For example, Devanagari falls between 2304 - 2431. Immediately, after this block Bengali starts from 2433 -2554 and so on. The efficiency of Unicode is demonstrated when we write transliteration algorithms. Because of the block structure of Unicode, it is easy to write transliteration algorithms, if languages are of phonetic nature.

2.7.1.3 Characters, Not Glyphs

Glyph is the shape of character, the way it is represented when it is rendered or displayed. Unicode standard is bothered about the code of the text not the actual rendering of the text on the screen. This code is stored in memory and the semantics of the code is stored in coded form. The Indie scripts represent some glyphs because of sequence of more than one code sequence. For example, Chapter 2: Overview of Unicode and Indian Scripts 58

Tra (•?) is the combination of three characters. It is true that some type of relationship exists between character code sequence and resulting glyph.

2.7.1.4 Semantics

In Unicode characters have well defined meaning. Each Unicode character carries certain properties. There are two kinds of properties characters possess. (37) > Normative properties > Informative properties

2.7.1.4.1 Normative properties

The properties which are must for any Unicode implementation. Unicode poses Case, directionality, combination, numeric, etc. as Normative properties.

2.7.1.4.2 Informative properties

The properties which are not must for any Unicode implementation. For example, case mapping. Case mapping is the mapping of uppercase to lowercase.

Besides, there are characters which are used for defining the element boundaries. For example, in English Full stop (.) i.e. a period defines the sentence boundary. can be used to define the boundary of word and so on. Sometimes other are used for the same purpose for example, in Devanagari, Devanagari Danda (I) is used as phrase separator or sentence boundary. Similarly, question mark (?) is used to identify that the subject sentence is a question. Chapter 2: Overview of Unicode and Indian Scripts 59

2.7.1.5 Plain Text

Unicode states Plain text as: (38) "Plain text must contain enough information to permit the text to be rendered legibly, nothing more."

Plain text is pure character code. It does not contain any kind of formatting. Any kind of software editor applications like MS-Word, TeX, etc are software which generate formatted text, unlike text produced by Edit program of DOS or Notepad of Windows. In formatted text, meta­ characters are used for formatting texts.

A clear distinction between the Plain and Formatted text could be seen in HTML or XML documents. These applications read plain text and formatting is applied to make it Fancy text or Rich text. Particularly, XML reads UTF-8 or UTF-16 characters. Formatting is applied using XSL (extensible Style Sheet Language).

2.7.1.6 Logical Ordering

Directionality is the basic property of Unicode. Urdu, Hebrew, etc. are written and displayed from right to left. When a character is typed it is stored in logical order in memory. But while displaying, the rendering of characters is changed. For example, (38)

?1 Rama said," &*#iao^ •

Direction ->

ii ^ ^ ^ i b »» R a a s a i d 5 VJ o ^ V • Logical order Chapter 2: Overview of Unicode and Indian Scripts 60

Sometimes in the same text bi-directionality is seen. In Devanagari, vowel I is rendered with some consonant right to left. For example,

Logically (JfT) is written and then short vowel I (fv) is added but while

display rendering engine displays short vowel I (fj^) first and then Ka (JP).

2.7.1.7 Unification

Duplication of character is avoided in Unicode that means one character is used only once. For example, punctuations are used only once. Same treatment is given to CJK (Chinese/Japanese/Korean) ideographs. All three languages use almost same kind of script so it was preferred to have a unified class where all three are merged and called as Hans category. (39)

Similarly, Telugu uses period (.) for separating the sentences, which has scalar value 46 and same period is used in English also. But there are characters which would not have been encoded (except for compatibility) because they are in some sense variants of characters that have already been coded, such characters are known as Compatibility characters. For example, '' (:) {character position 58} has two Compatibility characters one at character position 847 i.e. 'Ratio' (:) and the other is 1417 i.e. 'Armenian full stop'(,. ). There appears to be only formatting variance between 'colon' ratio and Armenian full stop. In ratio the distance between the points is more while in Armenian full stop the dots are nearer and bold where compared to colon.

Most of the compatibility characters reside in Compatibility area as one can see the index of characters position in Unicode manual. The position of Compatibility characters differs and it has a link with the actual character. Chapter 2: Overview of Unicode and Indian Scripts 61

For example, the Latin capital character 'A' (character position 65) has another variant called Full length Latin capital character 'A' (character position 65313) which is given a position at the end Unicode table.

2.7.1,8 Dynamic composition

When a combination of characters is made, it is known as Dynamic composition. Dynamic composition is used to generate different character forms. Dynamic composition is done using combining characters. A is non-spacing character represented in Unicode character-set table by dotted base. The blocks dealing with combinations in Unicode are called 'Combining characters Areas'.

To generate NEE in Telugu (^), one should combine

This can be achieved only if the font-set contains the glyph 3. In Indie

scripts, vowels are nothing but 'Combining characters'. These combining characters are also called as 'Non-spacing characters'.

2.7.1,9 Equivalent sequence

These are pre-imposed character codes available in Unicode character set. However, the same characters can be generated using Dynamic composition. For example, A -> A + ^ 192 65 768 Chapter 2: Overview of Unicode and Indian Scripts 62

In this case 192 is equivalent to 65 and 768. That means value 65 + 768 is canonical to 192. It is the task of software to convert the values in any one form i.e. either 192 or 65 + 768. This is also knovra as 'Normalization'.

2.7.1.10 Convertibility

Unicode promises conversion to any standard code, because each character is given a distinct number. Normally the conversion with any other code is done through character mapping.

2.7.2 Indian Languages in Unicode

For Indian character representation, the Bureau of Indian Standards (BIS) has defined a standard called as ISCII (Indian Script Code for Information Interchange), is an 8-bit code. It covers 10 Indic scripts derived out of Brahmi. ISCII uses extended ASCII and uses last 128 characters position for characters representation in Indic scripts. The arrangement of characters in ISCII is phonetic. (40).

ISCII is no doubt a solution for the Indian scenario but when it comes to global representation of characters ASCII and ISCII both fail to represent the characters at a stretch, as both use single byte code.

ISCII caters to the following 10 Indian scripts - Devanagari, , Punjabi, Bengali, Assamese, Oriya, Telugu, Tamil, Malayalam, Kannada. The ISCII code table is a superset of all the characters required for the above mentioned scripts. The same table has been adopted by Unicode Consortium for Unicode Standard.

The very idea of having Unicode was to represent all the scripts of world. The mapping of characters or mapping of the combination of characters Chapter 2: Overview of Unicode and Indian Scripts 63

could be used for transliteration. While, working with Indie scripts we have found that there exists almost one to one relationship between characters among Indian language scripts. Few exceptions are there because of either addition or omission of new characters. For the present study four Indian languages i.e. Hindi (Devanagari script), Bengali, Telugu and Kannada have been selected. The former two belong to Indo-Aryan Language family and the later two belong to Dravidian family. But the scripts for all four languages are derived from Brahmi.

A study has been conducted on these four languages for character mapping. The mapping of the characters can be seen in Appendix 2.7. The difference between the characters of two language is always fixed with few exceptions. For example, the difference between Devanagari and Bangla is 128. So more or less this difference is same along the Devanagari and Bangla block.

2.7.2.1 Transliteration with Unicode

While writing transliteration algorithm UTF-8 has been used. That means each character is broken in a sequence of bytes. For Indie languages, Unicode scalar value could be broken in three bytes. That way, for all the four languages, the very first byte is always 224. The second byte represents language. The third byte represents the character. For example, for Devanagari characters, UTF-8 value is 224, 164 or +1 i.e. 165 and third byte value lies in between 128-191. UTF-8 can be calculated manually for any Unicode scalar value. For example, to obtain UTF-8 byte sequence for value 2325 i.e. Devanagari ^, following method should be followed.

1. Convert the decimal value to 16-bit binary code. Binary value of 2325 is 0000100100010101. Chapter 2: Overview of Unicode and Indian Scripts 64

2. Use the formula in UTF-8 bit conversion table and break 16-bits to byte sequence of 8-bits. 00001001 00010101 would be broken as . 11100000 10100100 10010101

3. Convert b34e sequence to decimal. 11100000 = 224 {Represents that it is a three byte code} 10100100 = 164 {Represents the value of language} 10010101 = 149 {Represents the value of character}

4. The sequence value of the bytes represents the Unicode scalar value. 224 164 149 represent in sequence Devanagari JP.

2.8 OBSERVATIONS OF MAPPING TABLE FOR INDIAN LANGUAGES

1. There are 105 characters listed in Devanagari, Bangla has 90 characters, Telugu has 80 characters and Kannada has 82 characters. Unicode version 4.0 which is published on May 2003, introduced following characters in Character Database. > Devanagari Letter Short A (glyph is not available with font) > Bengali Sign (glyph is not available with font) > Kannada Sign Nukta (glyph is not available with font) > Kannada Sign Avagraha (glyph is not available with font)

2.8.1 Devanagari Vs Bengali

2. Letter short A has been added to Devanagari family, which has no equivalent in Bengali. Chapter 2: Overview of Unicode and Indian Scripts 65

3. There are no equivalent characters for Devanagari ^ and ^ in Bengali so these three are mapped to Bengali >«. But while on reverse transliteration same character would not be achieved rather one would get TJ.

4. There are no equivalent characters for Devanagari 3TT and 3ft in Bengali so these three are mapped to Bengali «. But while on reverse transliteration same character would not be achieved rather one would get art. 5. There is no equivalent character for Devanagari FT in Bengali so this is mapped to Bengali ^. But while on reverse transliteration same character would not be achieved rather one would get ?r.

6. There is no equivalent character for Devanagari in Bengali so this is mapped to Bengali ^. But while on reverse transliteration same character would not be achieved rather one would get T.

7. There are no equivalent characters for Devanagari c5 and qS in Bengali so these two are mapped to Bengali ^. But while on reverse transliteration same character would not be achieved rather one would get cT.

8. There is no equivalent character for Devanagari cT, since Bengali does not have this sound so this character is mapped to Bengali ^. But while on reverse transliteration one would get Devanagari sr.

9. There are no equivalent characters for Devanagari o and 5 so this character is mapped to Bengali oc. But while on reverse transliteration one would get Devanagari d.

10. There are no equivalent characters for Devanagari o\ and ::>T so this character is mapped to Bengalicr. But while on reverse transliteration one would get Devanagari (:>t. Chapter 2: Overview of Unicode and Indian Scripts 66

11. Devanagari 3^ should also be used in Bengali as there is no separate character defined for it. 12. There are no equivalent characters for Devanagari o, o, o and 6 in Bengali. 13. There is no equivalent character for Bengali d1 in Devanagari. 14. There are no equivalent characters for Devanagari ?P, is, -T, 3r and !P. These characters are called as Additional consonants. These characters are mapped to Bengali corresponding consonants. But while on reverse transliteration one would get Devanagari Pure consonants. 15. Devanagari I, II and . should also be used in Bengali as there is no separate character defined for it. 16. Bengali and Assamese share same script. ^ and I characters are used in Assamese. Assamese '' is represented by ^ compare to Bengali RA ^ (41) There is no equivalent character in Bangia for Assamese (^). So for Bengali to Assamese transliteration VA (1) should be mapped to Bengali (^). Except that rest is same in Assamese as Bangia. 17. Bengali ^ (Assamese) is mapped to Devanagari T.

18. Bengali ^ (Assamese) is mapped to Devanagari cl. 19. There is no equivalent character for Bengali ^, , ^, ^,., i, ^, o and^.

2.8.2 Devanagari Vs Telugu

20. Letter short A has been added to Devanagari family, which has no equivalent in Telugu. Chapter 2: Overview of Unicode and Indian Scripts 67

21. There is no equivalent character for Devanagari ^ in Telugu so this is mapped to Telugu .3. But while on reverse transliteration same

character would not be achieved rather one would get ^.

22. There is no equivalent character for Devanagari 3TT in Telugu so this is mapped to Telugu . But while on reverse transliteration same

character would not be achieved rather one would get 3TT.

23. There is no equivalent character for Devanagari fT in Telugu so this

is mapped to Telugu .i). But while on reverse transliteration same

character would not be achieved rather one would get ?r.

24. There is no equivalent character for Devanagari p5 in Telugu so this is mapped to Telugu v. But while on reverse transliteration same

character would not be achieved rather one would get 3.

25. There is no equivalent character for Devanagari c, s in Telugu.

26. There is no equivalent character for Devanagari o in Telugu so this is mapped to Telugu ". But while on reverse transliteration same

character would not be achieved rather one would get 5.

27. There is no equivalent character for Devanagari or in Telugu so this

is mapped to Telugu ". But while on reverse transliteration same

character would not be achieved rather one would get <:>!.

28. Devanagari ^ should also be used in Telugu as there is no separate character defined for it. 29. There are no equivalent characters for Devanagari o, o, 6 and o in Telugu. 30. There are no equivalent characters for Telugu d and o in Devanagari. Chapter 2: Overview of Unicode and Indian Scripts 68

31. There are no equivalent characters for Devanagari ff>, IS, ?T, 3r, ^, ",

PF and ?T. These characters are called as Additional consonants.

These characters are mapped to Telugu corresponding consonants. But while on reverse transliteration one would get Devanagari Pure consonants. 32. There are no equivalent characters for Devanagari ^ and i^ in

Telugu. 33. There is no character for Devanagari I but same function is rendered

by Full Stop (.) in Telugu. 34. There is no equivalent character for Devanagari in II Telugu.

35. Devanagari . should also be used in Telugu as there is no separate

character defined for it.

2.8.3 Devanagari Vs Kannada

36. Letter short A has been added to Devanagari family, which has no equivalent in Kannada. 37. There is no equivalent character for Devanagari ^ in Kannada so this

is mapped to Kannada o. But while on reverse transliteration same

character would not be achieved rather one would get ^.

38. There is no equivalent character for Devanagari afr in Kannada so this

is mapped to Kannada t>. But while on reverse transliteration same

character would not be achieved rather one would get 3ft.

39. There is no equivalent characters for Devanagari FT in Kannada so

this is mapped to Kannada ^. But while on reverse transliteration

same character would not be achieved rather one would get ?T. Chapter 2: Overview of Unicode and Indian Scripts 69

40. There is no equivalent characters for Devanagari ^ in Kannada so

this is mapped to Kannada \^. But while on reverse transliteration

same character would not be achieved rather one would get c5.

41. There is no equivalent character for Devanagari o in Kannada so this

is mapped to Kannada :'. But while on reverse transliteration same

character would not be achieved rather one would get 5.

42. There is no equivalent character for Devanagari or in Kannada so this

is mapped to Kannada -'fj^. But while on reverse transliteration same

character would not be achieved rather one would get 'Si.

43. Devanagari :P^ should also be used in Kannada as there is no separate

character defined for it. 44. There are no equivalent characters for Devanagari o, o, c^ and 6 in

Kannada. 45. There are no equivalent characters for Kannada oe and o^^ in

Devanagari.

46. There are no equivalent characters for Devanagari ?', IS, TT, oT, ?, iff

and ?T. These characters are called as Additional consonants. These

characters are mapped to Kannada corresponding consonants. But while on reverse transliteration one would get Devanagari Pure consonants. 47. There are no equivalent characters for Devanagari Q and q in

Kannada. 48. There is no character for Devanagari I but same function is rendered

by Full Stop (.) in Kannada. 49. There is no equivalent character for Devanagari in II Kannada. Chapter 2: Overview of Unicode and Indian Scripts 70

50. Devanagari . should also be used in Kannada as there is no separate

character defined for it.

2.8.4 Bengali Vs Telugu

51. There is no equivalent character for Telugu

mapped to Bengali 'ii. But while on reverse transliteration same character would not be achieved rather one would get S).

52. There is no equivalent character for Telugu ij in Bengali so this is

mapped to Bengali "S. But while on reverse transliteration same character would not be achieved rather one would get i,.

53. There is no equivalent character for Telugu eo in Bengali so this is

mapped to Bengali ^. But while on reverse transliteration same character would not be achieved rather one would get 6.

54. There is no equivalent character for Telugu sf in Bengali so this is

mapped to Bengali '^. But while on reverse transliteration same character would not be achieved rather one would get v.

55. Telugu i> is mapped to Bengali

56. There is no equivalent to Bengali characters o. and Avaragha in Telugu. 57. There is no equivalent character for Telugu o in Bengali so this is

mapped to Bengali i. But while on reverse transliteration same character would not be achieved rather one would get *. Chapter 2: Overview of Unicode and Indian Scripts 71

58. There is no equivalent character for Telugu " in Bengali so this is

mapped to Bengali ci. But while on reverse transliteration same character would not be achieved rather one would get ^.

59. There are no equivalent characters for Telugu o and o in Bengali.

60. There is no equivalent character for Bengali c5t in Telugu. 61. There are no equivalent characters for Bengali ?, 7 and Ti. These characters are called as Additional consonants. These characters are mapped to Telugu corresponding consonants. But while on reverse transliteration one would get Telugu Pure consonants. 62. There are no corresponding characters for Bengali h and Q in Telugu. 63. Bengali ? (Assamese) is mapped to Telugu 6.

64. Bengali ^ (Assamese) is mapped to Telugu £1.

65. There is no equivalent character for Bengali >, li, -, -, ^, \, i, •> and ^ in Telugu.

2.8.5 Bengali Vs Kannada

66. There is no equivalent character for Kannada i;^ in Bengali so this is

mapped to Bengali >ii. But while on reverse transliteration same character would not be achieved rather one would get u).

67. There is no equivalent character for Kannada ej in Bengali so this is

mapped to Bengali »3. But while on reverse transliteration same character would not be achieved rather one would get .

68. There is no equivalent character for Kannada o in Bengali so this is

mapped to Bengali <). But while on reverse transliteration same character would not be achieved rather one would get d. Chapter 2: Overview of Unicode and Indian Scripts 72

69. There is no equivalent character for Kannada v' in Bengali so this is

mapped to Bengali ^. But while on reverse transliteration same character would not be achieved rather one would get o.

70. Kannada d is mapped to Bengali ^.

71. There is no equivalent character for Kannada o* in Bengali so this is

mapped to Bengali i. But while on reverse transliteration same character would not be achieved rather one would get o'e.

72. There is no equivalent character for Kannada oVa in Bengali so this is

mapped to Bengalis. But while on reverse transliteration same character would not be achieved rather one would get o'jae.

73. There are no equivalent characters for Kannada ot and c^ in Bengali.

74. There is no equivalent character for Bengali oT in Kannada. 75. There are no equivalent characters for Bengali ^, 7 and ". These characters are called as Additional consonants. These characters are mapped to Kannada corresponding consonants. But while on reverse transliteration one would get Kannada Pure consonants. 76. There is no equivalent character for Kannada . This is mapped to

Bengali ^. But while on reverse transliteration one would get Kannada Pure consonants T$.

11. There are no corresponding characters for Bengali Q and c^ in Kannada. 78. Bengali ^ (Assamese) is mapped to Kannada d.

79. Bengali ^ (Assamese) is mapped to Kannada d.

80. There is no equivalent character for Bengali >, \, --, -, -, 1, ^,» and ^ in Kannada. Chapter 2: Overview of Unicode and Indian Scripts 73

2.8.6 Kannada Vs Telugu

81. There is no equivalent character for Kannada <^ in Telugu.

82. There is no equivalent character for Telugu •sr in Kannada.

2.9 TRANSLITERATION ALGORITHM

Any Unicode character can be represented in UTF-8 format. UTF-8 is an 8- bit character coding sequence for any Unicode scalar value. Any Unicode scalar value can be broken as 8-bit character coding sequence using Table 2.6: UTF-8 bit Distribution. The program uses UTF-8 coding sequence for character representation. All the Indic script characters are three byte sequence. The first byte represents the represent the Indic block i.e. 224. The second byte represents the code for language. Each script has two consecutive exclusive number for example, all Devanagari characters fall within two numbers 164 and 165. Similarly other scripts can be represented as:

Scripts Middle Byte Devanagari 164 and 165 Bengali 166 and 167 Panjabi or Gurumukhi 168 and 169 Gujarat! 170 and 171 Oriya 172 and 173 Tamil 174 and 175 Telugu 176 and 177 Kannada 178 and 179 Malayalam 180 and 181

Table 2.7: Scripts and byte value

The last byte of UTF-8 code sequence represents the syllables. The syllables are same for all the Indic scripts (See Appendix 2.7) thus their value is also the same. Chapter 2: Overview of Unicode and Indian Scripts 74

Changed value of second byte changes the script. For example, if

224 165 133 -> Devanagari Letter A 3T

224 166 133 ^ Bengali Letter A ^

224 178 133 -^ Kannada Letter A O

224 176 133 ^ Telugu Letter A

The transliteration program reads character from an input file and converts each character value to UTF-8 value i.e. three byte code sequence. Besides, one need to give the target language where as source language is detected from second byte value of UTF-8 sequence. Source language is the language of document and Target language is the language one wants to transliterate to. The UTF-8 value is returned by ord() function where fgetc() reads a character from a string, $c = fgetc($fp); $num = ord($c);

There is a major difference between North Indian and South Indian scripts. All Indian language scripts have an inherent vowel 3T with all consonants in

pronunciation. But in Hindi and Bengali if a word ends with a consonant the vowel 3T is stripped off while pronouncing that particular consonant. For example HP is pronounced as KA which is made up of two sounds HF and 3T.

While transliterating from Telugu or Kannada to Hindi or Bangla for any word that ends with a Viram, vowel Viram is stripped off Just opposite, from Hindi or Bangla to Telugu or Kannada if the word does not end with a vowel, Viram should be added. The following example demonstrates, the word

cfTJToT = JF + 3T + oT + •:::• Chapter 2: Overview of Unicode and Indian Scripts 75

But when it is transliterated to Telugu or Kannada it becomes Sioej in telugu and ^djej in Kannada. It is pronounced as Kamala in Telugu and Kannada.

Devanagari consonants cP, 3T and cT are transliterated to g, io and e> in Telugu

^ dj and t> in Kannada. The last character Viram (o) is left for processing.

Opposite to this when g^ioti in Telugu and ddl^t) in Kannada is transliterated to Hindi Viram (o) is added at the end of '' Telugu Kannada.

So at the time of transliteration, words should be captured and analyzed. The word boundary is decided by presence of space or return character and each character of the word is stored in an array. The following statement depicts how words are captured and characters are stored in array. for($i=0; $num != 32 | $num !$num!=13; $c=fgetc($fp), $num=ord($c),$i++)

{ $sl[$i] = $c;

}

The algorithm further checks that only the language byte that is the second byte should be altered not anything else. It uses the value of first byte to recognize the second byte. It is found that the value of first byte is 224 in case of all Indie languages and the immediate next byte is language identifier. if($num = = $source) $num = $target;

Each language shares two consecutive numbers for a script. That means its value must be $source +1. This condition is checked and accordingly the transliterated value is converted to Starget +1. Chapter 2: Overview of Unicode and Indian Scripts 76

if($num == $source+l) $num = $target+l;

There are a few exceptions which are taken care in the algorithm. Such exceptions are mentioned in Appendix 2.7.

2.10 CONCLUSION

There are many characters in one script which are not present in other script for example, Kannada eo has no equivalence in Bengali. Such characters are

mapped to another character taking the pronunciation into account for example, Kannada t> is mapped to Bengali «.

The algorithm developed in this section of research transliterates data from one language to other. The other key features are, > The algorithm automatically identifies the source language > If the source and target language for a record is same, the algorithm leaves that record as it is, where as other retrieved records in other languages are transliterated > The output is generated in XML format which can used for any other processing if required.

The code of the transliteration algorithm is embedded in the main program for multilingual MARC. Chapter 2: Overview of Unicode and Indian Scripts 11

2.11 REFERENCES

1. LIST of languages, http://home.ccil.ors/~cowan/lanstree.txt Browsed on lO"'July, 2004.

2. TRIPATHI (Aditya). Unifying character set codes. In Workshop on Multimedia and Internet Technologies, 26''^-28'*' February, 2001, DRTC, Bangalore, Paper: CE.

3. STEEL (Allan K.). The development of languages is nothing like biological evolution. 77,14(2), pp.31^0, 2000 also available at: http://www.answersinsenesis.ors/home/area/masazines/ti/docs/vl4n 2 lansuages.asp Browsed on lO'** July, 2004.

4. WEBSTER'S Third New International Dictionary, Unabridged, Merriam-Webster, Springfield, Massachusetts.

5. WHAT is language? http://www.indiana.edu/~sasser/L503/language.html Browsed on 10"^ July, 2004.

6. BIBLE. Genesis 11:1-9.

7. The 50 most widely spoken language. http://www.photius.com/rankinss/lansuazes2.html Browsed on 10"" July, 2004.

8. BOEREE (C. George). The language families of the world http://www.ship.edu/~psvch/lansuasefamilies.htmlBrov/sQd on lO"" July, 2004.

9. LANGUAGE families. http://www.krysstal.com/langfams.html Browsed on lO"" July, 2004.

10. ETHONOLOGUE. http://www.ethnolosue.com Browsed on lO"' July, 2004. Chapter 2: Overview of Unicode and Indian Scripts 78

11. RAHMAN (Tariq). Peoples and languages in pre-islamic Indus valley. http://asnic.utexas.edu/asnic/suhject/peoplesandlaniruases.html Browsed on 10'^ July, 2004.

12. HALDAR (Gopal). Languages of India. Translated by Tista Bagchi. National Book Trust: New Delhi, p. 65.

13. HALDAR (Gopal). Languages of India. Translated by Tista Bagchi. National Book Trust: New Delhi, pp. 58-89.

14. HALDAR (Gopal). Languages of India. Translated by Tista Bagchi. National Book Trust: New Delhi, pp. 38-57.

15. COULMAS (Florian). The Blackwell Encyclopedia of Writing Systems Blackwell: Oxford, 1999.

16. ORIGIN of writing, http://www.ancientscripts.com/ws orisins.html Browsed on lO"" July, 2004.

17. THE Rabbit scribe, http://www.ancientscripts.com/rabbit.html Browsed on lO"' July, 2004.

18. TYPES of writing system. http://www.ancientscripts.com/ws tvpes.html Browsed on lO'^ July, 2004.

19. SMITH (Vincent A.). The Oxford history of India. Ed. Percival Spear. 4"'ed. 2001,p. 32.

20. SALOMON (Richard). On the origin of the early Indian scripts: a review article. Journal of the American Oriental Society 115(2), 1995, pp.271-279. also available at: http://www.ucl.ac. uk/~ucsadkw/position/salomon.html Browsed on lO"'July, 2004.

21. KHAROSTHI. http://www.ancientscripts.com/kharosthi.html Browsed on lO"'July, 2004. Chapter 2: Overview of Unicode and Indian Scripts 79

22. BRAHMI. http://www.ancientscripts.com/hrahmi.html Browsed on 10'^ July, 2004.

23. KAK (Subhash). How Old Is Indian Writing? Sulekha, December,2003. http://www, sulekha. com/expressions/column, asp ?cid=30589 7 Browsed on lO"' July, 2004.

24. MARTIN (Nicolas). On the origins of serial communications and data encoding, http://www.staubassociates.com/dbase/bu07sh.htm Browsed on 10'^ July, 2004.

25. BASANDRA (Suresh K.). Computer today, Golgotia : New Delhi, 1997, p. 40.

26. ASCII chart, http://www, jimprice. com/jim-asc. htm Browsed on lO"' July, 2004.

27. TREMBLAY (Jean-Paui) and BUNT (Richard B.). Introduction to computer science: an algorithmic approach. 2"^ ed. McGraw-Hill: New York, pp. 292-295.

28. CHRONOLOGY of Unicode Version 1.0. http://www. Unicode, ors/history/ Browsed on lO"" July, 2004.

29. JARNEFORS (Olle). A short overview of ISO/IEC 10646 and Unicode. http://www.nada.kih.se/iJ8n/ucs/unicode-isoI0646- oview. html Browsed on lO"' July, 2004.

30. SEARLE (Steven J.). A Brief History of Character Codes in North America, Europe, and East Asia, http://tronweb.super­ nova. CO.ip/characcodehist.html Browsed on lO'*" July, 2004.

31. SEARLE (Steven J.). Unicode Revisited. http://tronweb.super­ nova. co.ip/unicoderevisited.html Browsed on lO"" July, 2004. Chapter 2: Overview of Unicode and Indian Scripts 80

32. SUPPORTED scripts http://www.unicode.oni/slandard/supported.html Browsed on 10'^ July, 2004.

33. THE Unicode standard version 3.0. The Unicode Consortium, Addison-Weseley ; Massachusetts, 2000, p. 3.

34. THE Unicode standard version 3.0. The Unicode Consortium, Addison-Weseley : Massachusetts, 2000, pp. 12-18.

35. THE Unicode standard version 3.0. The Unicode Consortium, Addison-Weseley : Massachusetts, 2000, p. 46.

36. THE Unicode standard version 3.0. The Unicode Consortium, Addison-Weseley : Massachusetts, 2000, p. 47.

37. THE Unicode standard version 3.0. The Unicode Consortium, Addison-Weseley : Massachusetts, 2000, p. 73.

38. THE Unicode standard version 3.0. The Unicode Consortium, Addison-Weseley : Massachusetts, 2000, p. 16.

39. THE Unicode standard version 3.0. The Unicode Consortium, Addison-Weseley : Massachusetts, 2000, p. 17.

40. LANGUAGE and scripts of India. http://www. . colostate. edu/~malaiva/scripts.html Browsed on lO''' July, 2004.

41. HALDAR (Gopal). Languages of India. Translated by Tista Bagchi. National Book Trust: New Delhi, p. 101.