Informatics Studies. ISSN 2320 – 530X. Vol. 3, Issue 4, Fourth Quarterly Issue. October – December, 2016. P 31-44

Computing in Indian Languages for Knowledge Management: Technology Perspectives and Linguistic Issues

Ram Awathar Ojha

Abstract The paper is the first among a series of documents on computer applications in Indian languages to be published in the journal. Traditionally, computer applications were based on English as the medium of interaction with the system. The technology is not yet viable to Indian languages. Majority of Indian population does not speak or write in or understand English and this weakness of technology hinders the attempts to use computers for education and literacy, meant for majority of the population who should get the benefit of Information Technology. In many parts of the world computer applications have been developed in different regional languages, appropriate to the user communities. This introductory paper gives a perspective of research and development on computational linguistics, the phonetic aspect of Indian languages, transliteration standards, lack of uniformity, data entry methods, keyboards, phonetic mapping of the vowels and consonants, web pages supporting display of Indian language text etc in the context of computing in Indian languages. Keywords: Computational Linguistics, Indian Languages, Telugu, Kannada, Phonetics, Alphabets, vowels, consonants, syllables, transliteration standards, data entry methods, keyboards, , web pages

This paper covers Writing Systems Codes for Systems (MLS) in the context of this the Aksharas, The Phonetic Aspect of Indian discussion. Typically, an MLS will permit Languages, The importance of users to interact with computers in their own Transliteration, Lack of Uniformity Between mother tongue. Such a system will have far Different Schemes. Data Entry Methods, reaching consequences in a country like India Keyboards, Phonetic Mapping of the Vowels where English is not spoken or understood and Consonants and Web Pages Supporting by the majority of the people living in areas Display of Text. away from urban environments. Brief Introduction Over the years, the bulk of software development in India has been carried out The computer programs, which permit user primarily through the English language. interaction with the computer in one or Knowledge of English is essential for the more languages, where the language can be development of Information Systems since selected dynamically, either at the time of virtually all development packages rely on invocation of the program or subsequently English specific input. Current software during its execution is termed as Multilingual development tools can also be used to

Informatics Studies 3(4), October – December, 2016 31 Computing in Indian Languages for Knowledge Management develop an application that incorporates little is associated with some data, typically a text or no English in its user interface. Hence string, a number, an image and so on. One MLSs are not only feasible at the level of an writes computer programs to manipulate application program but can also present a data. Computing in Indian languages may truly localized environment for a user broadly relate to computer programs which desiring to interact with computers in process Indian Language text strings, which regional languages. This approach has the may be input through a keyboard and added advantage of providing uniformity displayed on a conventional screen display. across computer systems where, regardless A primitive approach to computing in of the machines like PC, MAC, Workstation Indian languages may be through computer etc. the user will see the same interface. programs, which do string processing on MLS catering only to the display of texts of Indian languages much the same information are easier to design and build. way it is done for the Roman (ASCII) The phenomenal growth of the internet has strings. This way, many existing applications created the need for internationalization of may be adapted to work with text strings in the web where, using the right software tools, Indian languages. It turns out that this is one can present information in the form in not a simple task on account of the large set which it would be received best. However, of aksharas (letters) that have to be handled. technical challenges faced in implementing While the older techniques seem to be well such MLS and the lack of standards have suited for displaying the Indian Scripts, data retarded the development, especially in entry becomes formidable. Hence, new respect of the . approaches to handling Indian language text In the past, MLSs in Indian languages have have become essential. basically concentrated on data preparation Often people ask questions such as ‘why and printing (DTP). The increasing demand not have an run in Tamil for using computers in the vernacular has or Bengali?’ just as Arabic or Japanese forced developers to look at the user interface Windows. There is no satisfactory answer to with seriousness. The Systems Development this question. In the first place, building Laboratory, Department of Computer support for Indian languages within Science and Engineering at the Indian Operating systems is not an easy job, Institute of Technology, Madras (IITM) has especially when one looks for uniformity in done unique research and development to use across all the Indian languages. provide a quality MLS for the country that is truly Indian in concept, design and There is a general feeling among the implementation. They have developed professionals that it is best to deal with numerous software packages to permit user Multilingual information at the level of the interaction with computers in different user application. That is to say, keep the Indian languages during the late eighties and language aspects outside the Operating they continue their work. The series of System. This way, the Operating Systems are papers including the present one are rid of the problem of having to deal with intended to present before the readers the varied character sets across many different experience shared by concerned scientists of languages. While some persons point to the IITM in dealing with this interesting but success of Unicode implementation in complex problem. Windows, it must be clearly understood that the system kept away from Computing in Indian Languages Indian languages for an important reason. Computing is a general term which refers to Unicode is just not right for linguistic information processing, where information processing with Indian Languages and is too

32 Informatics Studies 3(4), October – December, 2016 Ram Avathar Ojha complex to handle even for one Indian script, The phonetic base and the concept of the much less for all the scripts in a uniform Akshara is unique to the languages of India. way. For all practical purposes Unicode has It is best to deal with the problem of retained only the eight bit coding structure computing with Indian languages by first (actually only 128 codes) for all our scripts, understanding how aksharas have been used which is really the bottleneck in handling the in our ancient as well as modern texts. This aksharas. For efficient string processing, it is will give us an insight into the linguistic base necessary to work with the basic linguistic of our languages, allowing us to come up quantum of our languages, which happens with a universal approach to dealing with all to be not a letter of the alphabet but an the languages of the country. Akshara, which is actually a syllable. General Introduction to Indian The problems associated to the current Languages implementation of Unicode for Indian languages, or for that matter the ISCII code Numerous languages are spoken in India. itself, the basic standard that led to the Most of them relate to one of the officially Unicode representation for our languages will recognized languages and there are about be discussed later. eighteen languages identified for regular use in the country. All these languages have a Any meaningful approach to computing in phonetic base, though their writing systems Indian languages must provide for unique vary. Some of the languages have a common identification of the full set of aksharas, script and some have scripts of their own. through fixed length codes and also provide There are nine basic scripts besides the scripts a uniform approach to data entry of the for Urdu and Sindhi. These nine constitute aksharas across all the languages. Any other the basic scripts of India. The eighteen approach will suffer from incompatibilities languages mentioned above, have been between the user interfaces across different given the status of official languages by the Indian languages. While solutions specific Government. Though the use of a language to a language may indeed be feasible, one is may appear to be confined to a region within looking for solutions, which can be used all the country, the mother tongue of many over the country. persons living in that region may be quite a For the present, and at least for some years different one, traceable to early migrations to come, it is best to handle the problem by of families. writing applications which handle All the recognized languages; mostly referred Multilingual information directly, that is to to as the regional languages have a phonetic say that the Operating System’s support base. It is seen that there is a substantial set should not be sought as there is no of words common to many of these consensus among the professionals on what languages and the roots of these words may this support should be. The total be traced to specific languages such as arbitrariness with which data entry in Indian Sanskrit or Tamil, both of which are languages has been handled, not to speak considered very ancient languages. of the lack of uniform representation across the languages, makes it necessary for us to Linguistic aspects of Indian languages have take a serious look at standardization. The always attracted scholars from different parts problem is further compounded by the of the world on account of the hoary past individual approaches taken by vendors who of the languages as well as their unique seem to think that the concept of the language phonetic base. Another interesting aspect of kit with an Operating System will solve the Indian languages is the fact that language problem. was a means not merely for communication

Informatics Studies 3(4), October – December, 2016 33 Computing in Indian Languages for Knowledge Management in daily life but also for expressing religious, combine with a vowel to make nearly 13000 philosophical, scientific and professional or so individual sounds, each with its own concepts in amazingly compact ways. unique representation in the script. Computers and Indian Languages Interestingly, the writing systems employ just about 200 or so symbols to form the unique It is not surprising therefore, to find renewed shapes representing the conjuncts by interest today, in studying and understanding combining shapes, somewhat in the manner many of India’s ancient literary works. With of adding ligatures. For each language, well- the possibility of using computers for defined rules exist for writing most of the linguistic studies and with the increasing conjuncts and their combinations with the demand to disseminate information in the vowels. The term Akshara is normally used vernacular, computing in Indian languages to refer to a consonant or a vowel or a simple has gained significance. Though applications combination of a consonant and a vowel. such as word processing, Desk Top The term Samyuktakshara is used to refer to Publishing etc., have been successfully conjuncts. implemented for Indian languages, the solutions remain substantially language Handling Indian languages on the computer specific. One is happy that many of these is complicated by the requirement that each applications are really useful in practice, in and every one of these aksharas or spite of the effort needed in handling data samyuktaksharas be individually recognized. entry. Yet, very little seems to have been done Though only a few hundred primitive shapes in respect of electronically processing the may be employed in practice to form the information. Viewed nationally, there is an combinations, the large number of aksharas urgent need to provide a uniform and must necessarily be identified individually for meaningful software solution to computing linguistic or text processing purposes. in Indian languages. Children in India are taught to identify thousands of aksharas and once they have The phonetic nature of the languages leads mastered reading the script, they find learning to a , which represents sounds other languages, including European through unique symbols. Each language has languages, relatively easy. its own representation for the sounds and thus its own script, though it was mentioned The methods which work well for a limited earlier that some languages may use a set of twenty six different letters in the common script. In practice, there are small Roman alphabet, obviously fail or become variations in the scripts that probably matter inadequate when applied to Indian when linguistic aspects are brought in. languages, not only for the reason there are thousands of aksharas but also that there is Writing Systems more than one accepted way of writing many The writing systems for most Indian of the combinations. Though there are clear Languages employ symbols for about rules for writing the combinations, existing sixteen vowels and as many as thirty-five practices permit multiple representations for consonants. Syllables that are formed from the same conjunct, even within a language, these basic sounds are also given unique not to speak of variations across the representations. The term conjunct is used languages. to refer to a syllable formed from one or Codes for the Aksharas more consonants and a vowel. Though one can theoretically think of thousands of Thus there is need to look at the problem conjuncts, only about 800 of them are known of representing (coding) the large set of to be in regular use and each of these can aksharas so as to arrive at a standard that can

34 Informatics Studies 3(4), October – December, 2016 Ram Avathar Ojha apply uniformly across all the Indian In all Indian languages, an akshara is languages. Electronic text processing can then pronounced the same way regardless of its be attempted using these codes. position within a word, unlike in English where the pronunciation varies widely, The pioneering work, which resulted in the depending not only on the word but also GIST technology at the Center for on the location of the letter within the word. Development of Advanced Computing, must be regarded as the earliest of the Also, in Indian languages, the vowels number attempts towards some standardization. between thirteen and eighteen while the This development permitted DOS based consonants vary from eighteen in Tamil to applications to handle Indian language text. as many as thirty-eight in Telugu and The text was electronically represented using Malayalam. All the aksharas are therefore built the ISCII code and was largely language from about fifty basic letters. independent, thus permitting a uniform It is indeed possible to use just the vowels approach to dealing with the languages. Over and the consonants for writing any of the the years, this hardware dependent approach languages. This is probably how children are has been replaced by quality word processing taught a script to begin with. In practice and data preparation software but the however, the scripts abound in what are called essential eight bit coding of the characters ‘Samyuktakshars ‘, which are the equivalent has been retained. As will be explained later of syllables and represent sounds built up while dealing with Character encoding for from combinations of consonants and a Indian languages, eight bit codes are not really vowel. The writing system for a language suitable for efficient string processing. often permits more than one representation All the official languages of the country are (shape) for the samyuktakshar. written using scripts specific to each language. Samyuktakshars are often referred to as Scripts denote the writing systems employed conjunct characters. Clearly, when one sees by the languages to represent the sounds, an akshara in print, its sound is fixed. which form the phonetic base of the However, there may be more than one languages. Currently, the following language representation for a given conjunct and this specific scripts are considered essential. depends on the writing practices followed in a region. , , Bengali, Gujarati, Oriya, Telugu, Tamil, Kannada and All the ideas expressed here may well be Malayalam. The scripts for Urdu and Sindhi grasped by studying the Devanagari script in should also be included in the above, though which Sanskrit and a couple of other Indian Devanagari is often used for writing in Languages are written. An extensive Sindhi. discussion on this can be found in IITM site on Learning Sanskrit (IITM, 2010). The Phonetic Aspect of Indian Languages It turns out that when dealing with Indian languages on a computer, one needs a The languages of India have a common representation for the aksharas in general and phonetic base. One does not use the term not merely the vowels and consonants. The ‘alphabet ‘ to refer to the set of letters with akshara is the basic unit or quantum from a which the script is written. Instead, the set is linguistic point of view and computer called ‘Aksharas’. Very Simply, an akshara programs processing text in Indian languages refers to a sound. Sounds heard in spoken should be able to efficiently deal with this words are built up from the basic set of quantum, built up from two or more basic sounds represented by the vowels and sounds. This poses a real challenge as there consonants of the language.

Informatics Studies 3(4), October – December, 2016 35 Computing in Indian Languages for Knowledge Management are more than 13000 individual aksharas that languages. The National Library at Calcutta have to be reckoned and many more which has recommended an efficient scheme for might come into use, if the need arises. Roman transliteration (GOV, NL). The importance of Transliteration Transliteration Principles The common phonetic base across all the Transliteration refers to the process by which Indian Languages is helpful in situations one reads and pronounces the words and where language independent information sentences of one language using the letters such as statistical data, addresses, schedules and special symbols of another language. of meetings etc., have to be disseminated in Thus transliteration is meant to preserve the different languages simultaneously. People sounds of the syllables in words. who can speak a language but do not know Transliteration is helpful in situations where its script may still be able to read information one does not know the script of a language in that language by merely reading it in a script but knows to speak and understand the familiar to them. language nevertheless. Traditionally, books written in English that For several decades now, Roman deal with text in Indian languages such as transliteration has been used to represent commentaries on ancient scriptures used texts of Indian languages, especially Sanskrit. Roman transliteration to help read the text. In many printed books, a key to In many instances, diacritical marks were transliteration would be printed at the added to the Roman letters to establish a beginning in the form of a table. Since it is closeness to the aksharas of the language, difficult to represent the aksharas of Sanskrit which would be difficult to achieve with just using just the twenty-six letters of the the twenty-six letters. Roman alphabet, scholars used varying schemes to accommodate sounds that could Transliteration between Indian languages is not be correctly indicated using appropriate very desirable to help people learn one Roman letters. language through another. The common phonetic base makes this easy. Yet, Here are some examples of transliteration transliteration between the languages will as per the schemes, which were in general have to be handled with care, for there are use in the past. The schemes are somewhat quite a few aksharas which are specific to some arbitrary in the choice of the Roman letters. languages but not seen or used in others. For instance, Tamil does not have the aspirated consonants of Telugu or Sanskrit and reading Sanskrit through Tamil, which is very desirable, is often rendered difficult. Situations such as these are usually handled by introducing new symbols in the script of a language to represent via transliteration, Sometimes phonetics symbols are used in characters found in other languages. place of the normal Roman letters. Phonetic symbols are basically the letters of the This paper stresses the need to establish a Roman alphabet with special marks known single coding scheme to cover all the different as diacritic marks. Here are some examples aksharas across all the languages of India in of transliteration using symbols from the order to allow correct transliteration. In this phonetic alphabet. In the second set of connection, the use of Roman letters with aksharas shown below, one sees the use of diacritic marks does result in a script useful special symbols from the ascii character set in for reading text prepared in any of the Indian place of diacritics.

36 Informatics Studies 3(4), October – December, 2016 Ram Avathar Ojha

displayable ASCII letters and symbols to transliterate the text. These schemes allow multiple representations for certain syllables and long vowels but the processing program handles this well. Lack of Uniformity Between Different Roman transliteration which makes use of Schemes. diacritic marks works better for Indian While transliteration based data input is very languages and in the last few decades some useful, one must remember that the schemes standardization has been effected based on themselves vary, even for a given language. the recommendation from the National The consequence of this is that the data entry Library in Calcutta (GOI, NL). Roman letter procedures will change depending on the assignments in this scheme are phonetically scheme and worse still, a given transliterated equivalent to the aksharas of Sanskrit or other string will produce different outputs for Indian languages. As indicated earlier, the different languages/scripts. Take for instance phonetic alphabet with diacritic marks is very the word ‘yoga’. The transliterated data input helpful for representing text in Indian for this string using the ‘ITRANS ‘ scheme languages. Such letters are also easily typeset, is ‘yogA ‘. However, when you use this string for typefaces are available specifically for this to get an output in Tamil, using other purpose. Typesetting was however schemes, you will get as opposed attempted manually for nearly a century until to which is the correct special word processing and typesetting transliteration. The fact that the short forms applications were developed using of the vowels ‘o’ and ‘e’ are present only in computers. These programs make use of the Southern languages is the real issue here. high quality fonts to produce good printouts and displays. However most of them rely Transliteration schemes have to face the on some indirect data entry methods to problem of letters present in one language generate the phonetic symbols. and not in the other. Thus, unless a superset of letters from all the Indian Languages is The primary difficulty in data entry of the formed, uniform transliteration is ruled out. phonetic symbols is that there is no Even if such a superset were identified, it provision to input the symbols directly using turns out that unique Roman letter the standard ASCII keyboard. Desktop combinations are not easily identified for publishing and word processing programs complex Aksharas. Moreover, the large provide means by which the glyph code of number of vowels in Indian scripts also add the symbol is input using the numeric to the complexity in transliteration. keypad. While this is acceptable, it does not provide a natural approach. Transliteration String Processing Using Transliterated methods, which use only the displayable Text. ASCII symbols, do not run into this One useful feature of transliterated problem since the ASCII letters can be typed representation of Indian Language strings in directly. A special computer program is that conventional string processing would however be required to interpret the programs may be used to process the text. input string to produce the Indian languages However, applications such as sorting will display or printout. This is precisely what produce erroneous results as the sorting the currently popular transliteration schemes order of the Aksharas and Roman letters are attempt. Schemes such as ITRANS, RIT, quite different. Many string processing ADHAWIN etc., use only the standard applications such as processing a sentence

Informatics Studies 3(4), October – December, 2016 37 Computing in Indian Languages for Knowledge Management may however work properly, so long as the obvious NO but a qualified YES. The ‘no’ input strings do not contain special part of the answer has to do with the fact characters, which are needed for transliteration that the limited number of keys on the but can cause confusion if they happen to keyboard will certainly not be able to cater to be delimiters fixed for parsing routines. the thousands of aksharas, which occur in our texts. The qualified ‘yes’ is based on the With transliterated input, the representation observation that the keys may be used to for syllables is always multibyte with varying represent only the vowels and consonants number of bytes for different syllables. For and thus provide for inputting a series of example, if we were to examine the aksharas consonants and vowels from which the in the second row of the letters seen in the required aksharas may be formed using image above, we will see that the last two suitable computer programs. words contain two aksharas (samyuktaksharas are treated as aksharas since The question of data entry in Indian scripts they constitute one syllable) each. However, had attracted the attention of scholars and the word ‘Arya’ has four ascii letters but the computer experts since long and today, we word ‘dhR^shTvA’ has nine. So linguistically can see numerous computer programs, speaking, transliteration using Roman letters which permit document preparation in may not be the best choice for text processing different languages and scripts. These at the level of a syllable. programs, some of them very good in many respects, tend to differ significantly in their It would be helpful to have a representation, approaches to data entry. The variety seen in which uses a fixed number of bytes for each their approaches merits discussion so that syllable. Such a representation would be we may better understand the problems ideally suited for studying the metrical involved. structure of poems or slokas. The programs permitting data entry in many Transliteration Features in the IITM Indian languages/scripts may be classified Software. based on the specific approach taken to The Multilingual software from IITM has forming the aksharas from the keystrokes. incorporated features to help deal with These are listed below. transliterated text. The multilingual editor  Language/script specific data entry , has a data entry method that directly allows which relies on a specific font. transliterated text to be typed in and the text  Transliteration based data entry. viewed in local scripts. A .llf file is also  Data entry conforming to Manual automatically created by the editor. Those Typewriter keyboards, specific to each familiar with ITRANS based input will find language. this feature helpful. IITM also has some  Data Entry based on the INSCRIPT utilities for viewing and converting layout transliterated text (IITM).  Data entry methods specific to generating Data Entry Methods Suited for Indian HTML pages supporting our scripts Languages (web page creation)  Standard QWERTY keyboard seen with Data entry based on uniform mapping most computers is generally used for of the keys for all the languages/scripts. preparing texts and documents in various Data Entry Methods Which are Based on Indian languages and scripts. Fonts. The answer to the question ‘can we do it as The font based data entry methods utilize simply as one does it for English?’ is an the feature supported by conventional word

38 Informatics Studies 3(4), October – December, 2016 Ram Avathar Ojha processors where the font to be used for specialized symbols, diacritic marks etc., and displaying the text may be dynamically may be required mostly in printed text and selected/changed during data entry. Today, special applications. Some word processors the font rendering capabilities built into the do support data entry for these glyphs, which operating systems are quite sophisticated in are typically located in the upper ASCII range that the required shape of a character may be (160-255) by allowing the numeric value of built from several primitive shapes which the glyph location to be input with the ALT are called glyphs. Each font may consists of key kept down as the numeric value is typed about 200 different glyphs, where each glyph in. may directly represent a letter of the alphabet, Fonts for Indian languages (Except Tamil) a special character or a symbol. are required to have many more than 96 In fonts for Indian languages, the glyphs glyphs and so, data entry based on this will invariably include shapes for the matras, method of inputting the numeric glyph code special samyuktakshars and special ligatures values and displaying the character, will besides the basic shapes for the vowels and become necessarily cumbersome. Worse still, the consonants themselves. When a font is the input sequences are font specific and will selected, the word processor will display the vary from font to font even for a given glyphs corresponding to the keys entered. script. Fonts for Indian languages had till For the English language (Roman alphabet) recently evolved arbitrarily and do not follow each letter corresponds to only one glyph in any standard. Consequently one sees wide the font and data entry is smooth. In the variations in the glyphs themselves as well case of Indian scripts, we will have to know as the encoding for the font which used to what keys will have to be entered to display locate the glyphs at specified locations in the the sequence of glyphs, which will make up table of 256 locations, there are no standards one character. In the case of Roman, the set for glyph locations for Indian scripts and it of displayable glyphs correspond to the set considered that such standards will be of ASCII codes that are generated when keys difficult. are pressed on the keyboard. This is a set of A discussion of fonts and the issues to be 96 characters and anything more than this considered in designing Indian language number will require special data entry, as fonts will occur in later sections. keyboard has only a limited set of keys. A point to keep in mind is that the internal Conventional word processors are designed representation of the text prepared according for languages where a letter (or a character) to to this method is in the form of eight bit be displayed is associated with one glyph glyph codes. This has serious consequences only. Also, for most of the western if one were to attempt any sort of string languages, the character set itself is limited processing of the text because the glyph codes and so the set of displayable characters is bear no relationship whatsoever to the well within the 96 mentioned above. Even linguistic nature of the aksharas in terms of though a font for western languages may lexical ordering, sorting or indexing etc. Yet, need to accommodate only the displayable this font based data entry method is popular set, many glyphs in the font may be present with DTP packages, where one is interested that are not displayed when the keys are more in printing text as opposed to linguistic pressed during regular data entry. That is, analysis. there may be glyphs in a font, which are displayable but not necessarily shown when There is however a bright side to this keys are entered. These glyphs usually approach. Though keyboard entry is correspond to characters with accent marks, cumbersome, one might effectively use the

Informatics Studies 3(4), October – December, 2016 39 Computing in Indian Languages for Knowledge Management cut and paste facilities supported in the word advantage that nearly every glyph in the font, processors to perform some editing of the which may have as many as 250 glyphs, may entered text. In some word processors, one be used in printing. In contrast, fonts for also sees an image of the keyboard with other systems such as X-windows, aksharas and matras assigned to the keys and MSWindows, PostScript etc., are restricted the user may simply click on the keys to select to just about 200 glyphs. This is not a design the glyph to be displayed. Also if the user limitation of the font but a problem arising were to keep a standard file containing the out of the inability of application programs glyphs, then individual glyphs may be cut and font rendering routines to look at specific and pasted even for entering short sentences glyph locations. As a consequence of the rich of text. Some Urdu word processors have set of glyphs, the Dvng package could print this feature. a rich set of conjunct characters in Devanagari. The data entry on the basis of fonts and After Dvng, Charles Wikner enhanced the glyph codes cannot really provide a natural fonts to accommodate Vedic symbols and interface, even if supported through also gave a new processing package. As of sophisticated macro facilities found in some today, the Devanagari output obtained using word processors. When we have to input a this package is of remarkably high quality text like the following multilingual text using and Wikner’s choice or design of the glyphs our favorite word processor or DTP program has allowed nearly a thousand different there is a need for some easy ways to do this. conjunct formations to be derived from the basic set of about 250 glyphs. Both the packages mentioned here had arrived at some guidelines for standardization in the selection of the Roman letters for the aksharas of Sanskrit. In many instances, special symbols from the ASCII set were Transliteration Based Data Entry required to be used to distinguish similar Methods. sounding aksharas. Printout using these Transliteration has been a popular approach packages were restricted to Devanagari but to preparing printed documents in different Roman could be part of the text as well, Indian scripts. The idea behind the method permitting bilingual outputs. Subsequently is to use Roman letters to represent the TeX based systems were introduced aksharas of the languages and process the for Tamil, Telugu, Malayalam, Gurmukhi, resulting string (ASCII text) using special Gujarati and Bengali. computer programs, to produce printed Following the success of the TeX based output. The output is obtained using packages, Avinash Chopde developed a appropriate fonts. special transliteration package that allowed One of the early computer programs to other scripts to be handled as well, via successfully implement this idea is the Dvng language specific fonts. His ITRANS processor for Devanagari using TeX. This package is well known on the web. program produced a TeX file, which could Subsequently he enhanced the package to be typeset using the TeX program. Franz work with normal fonts under Windows- Velthuis who devised this package, had also 95 and X-windows and was able to generate included a special Devanagari font for use html documents for display on the web. The with the package. The Dvng package ran on most recent version of ITRANS supports Unix systems and TeX fonts have the quite a few languages.

40 Informatics Studies 3(4), October – December, 2016 Ram Avathar Ojha

It must be remembered that all transliteration Universal Transliteration Scheme for based data entry methods, require a computer Indic Scripts. program to generate as well as format the Dr. Anthony P. Stone has recommended a output and hence they cannot be applied or special transliteration scheme to handle all used for interactive data preparation, where the Indian scripts. This interesting proposal the display in Indian scripts immediately uses eight bit character codes to represent follows the key strokes. the vowels and consonants and hence maps The ITRANS package was followed a fairly large superset of vowels and by JTRANS, a Javascript based program by consonants from all the scripts of interest. Sandip Sibal who allowed quick generation This is a meaningful proposal but has only of html documents from transliterated one likely limitation. Existing data entry inputs. This package introduced Xdvng, a facilities do not permit easy typing of quality font for Devanagari, which could be characters in the upper ascii range (160-255) used for viewing web pages with Devanagari and so data entry using this scheme was not text both under MSWindows and feasible. However, it is quite easy to display XWindows. Sibal’s package is however all aksharas using this scheme. Therefore restricted to Devanagari. printouts of Indian language texts in transliterated form, can be easily generated. The Itranslator package from Onkarnath Standardization of transliteration will help Ashram in Rishikesh allows data entry in considerably in dealing with Indian ASCII using the ITRANS scheme but allows languages in a uniform manner. the string to be converted to Devanagari and displayed on the screen itself. The font used Summary of Transliteration Based Data by this package is probably the finest of the Entry. freely available fonts for Devanagari and is 1. This method allows text in Indian known as Sanskrit_1.2. Unfortunately, this languages to be input using Roman letters. font is suited for the Windows platform A special computer program is used to alone and has glyphs in locations that create process this text in Roman to produce problems on other platforms. A later version printouts or displays using appropriate fonts of this font (Sanskrit-98) seems to avoid for the scripts. There are several the above problem. There are many later transliteration schemes in use. Most of the additions to the Itranslator package, processing programs run under Unix. including new fonts. 2. Transliteration schemes are often specific Transliteration Schemes for Tamil and to one Indian language/script. There is no Telugu. single scheme yet that correctly handles all There have been a few popular packages for the Indian languages. Tamil and Telugu, which use the 3. Phonetically close Roman letters may not transliteration, based data entry method. The be found for all aksharas. So some Adhami package was written for use under compromise is required in selecting the DOS and subsequently enhanced to work Roman letters. Also multiple representations under MSWindows and produced displays for the same akshara seem to be allowed, and printouts in Tamil. Other transliteration making the processing somewhat complex. schemes such as Mylai and Cologne were also popular with Tamil. For Telugu, the RIT 4. It is possible to confuse most of the package developed by Rama Rao Kanneganti, processing programs by inputting arbitrary formations of conjunct aksharas. used TeX for typesetting the output. Details of some of the transliteration schemes may Transliteration based data entry is a workable be found elsewhere in this study. solution for Indian scripts, since in principle,

Informatics Studies 3(4), October – December, 2016 41 Computing in Indian Languages for Knowledge Management it allows for a uniform data entry mechanism the keys provided on a standard QWERTY for all the languages. The transliteration keyboard and is hence implemented easily scheme should be comprehensive enough on personal computers. It may be observed to handle all the aksharas across all the that a number of keys normally used for languages/scripts. punctuation or special symbols are also mapped to the ISCII characters. It will Will it be meaningful to have a system where, therefore be difficult to perform data entry as one types in the transliterated text, the of text along with a full complement of actual characters of the Indian script appear punctuation marks, which have come to into on the screen? This is what was being use with almost all the scripts. Microsoft attempted by some of the applications, applications also use the INSCRIPT layout which work under for Unicode data entry and hence suffer from systems. While this was an interesting this problem. The Microsoft Hindi keyboard development, the transliteration schemes has apparently provided for many used are often language specific and have not punctuation marks but one has to effect always permitted the formation of many multiple keystrokes to enter them. Shown complex conjuncts (Samyuktakshars). below is the INSCRIPT layout on a Manual Typewriter Keyboard Based Data QWERTY keyboard. Keys corresponding to Entry. the ISCII characters are common across all Manual typewriters for different Indian the scripts. languages have been available for quite some Special Programs for Web Page Creation. time and their use in Educational institutions During the past several years, display of and Government offices is substantial. Indian language text on the Internet Manual typewriters provide for a minimal Newspapers and Magazines has become set of aksharas consisting of the basic vowels popular. Web pages in Indian scripts are and consonants together with the matras so feasible on account of the fact that web that text can be prepared conforming to the browsers may be asked to display a given writing system for the language. The location text in a specified font. This will be further of the keys for the vowels and consonants discussed later. on a regional language typewriter is specific to the language. Many are adept at using such The html standard provides for an interesting typewriters and when they have to move over way of specifying the glyphs to be displayed to using word processors, they would rather either through the numeric code assigned to see the same keyboard mappings. Some the glyph or the universal name assigned to word processors do provide for data entry that glyph location consistent with the font based on the typewriter based key mappings. encoding that has now become standard. The resulting text may not include a number This way, the html language also functions of conjuncts but will be entirely adequate as a macro language, where a text string for normal modern day correspondence. describing the glyphs to be shown may be just typed in using standard ascii. While one Data Entry Based on the INSCRIPT may not need to worry about this for glyphs Keyboard. in the displayable ascii range, the approach is The INSCRIPT keyboard allows more or very useful for glyphs in the upper ascii range. less uniform data entry of text across the In lighter vein, some people on the net refer different scripts. The mapping provides for to this as the method for the ‘ASCII the data entry of vowels, consonants and impaired’! The advantage of this approach matras consistent with the specifications in need not be emphasized, for virtually any ISCII. The INSCRIPT layout utilizes only text editor capable of data entry for the upper

42 Informatics Studies 3(4), October – December, 2016 Ram Avathar Ojha

ASCII characters can be used to produce web In a sense, generating display through html pages in Indian languages, provided one has is similar to the macro-based approach taken patience! by TeX, the typesetting program developed As an example, the html document shown by Dr. Knuth. While TeX has the advantage below will produce the display given in the of using most of the 256 glyphs in a font, image that follows. < and « html displays are constrained to using only represent two glyphs that are specified about 200, thus loosing the ability to display through their name entities. some conjunct letters. Web Pages Supporting Display of Unicode Text.

View the source of this document to see how name entities have Unicode has been accepted as a meaningful been used in preparing the Devanagari standard for handling multilingual text. string seen below
Most browsers introduced after 2002, include support for this. With Unicode text, the method indicated above does not apply, for s<Sk«tm! the encoding standard automatically
identifies the font to be used. Unfortunately, rendering Unicode text in Indian languages is beset with multitudes of problems and it is unlikely that correct rendering of text will be realized. Unicode text in Indian scripts will have to be created using appropriate programs such as Microsoft Word and related applications. Even in 2005, several browsers cannot correctly display Indian language text represented through Unicode. The difficulties encountered in dealing with Unicode for Indian languages will be explained in greater detail in later papers of this series. Phonetic Mapping of the Vowels and Consonants. The user preparing the html document must One way of looking at data entry in Indian necessarily know the location of the glyphs. languages is to view the text as consisting of This, as we know is font specific, even if the aksharas that can always be decomposed into font is meant for a specific script. vowels and consonants and perhaps a few

Informatics Studies 3(4), October – December, 2016 43 Computing in Indian Languages for Knowledge Management symbols. In this phonetic approach to data formed. This feature is helpful in situations entry, just one keystroke is associated with where consonants and vowels not present each vowel and consonant and a computer in a language are attempted to be input. The program typically an input module in an system will not accept such inputs thus application, keeps track of the keystrokes and providing a safeguard that only valid forms the aksharas. In many ways, this combinations may be input. approach is similar to the transliteration In the phonetic approach, the key mappings based data entry except that we are not do not relate to the generic consonants i.e., a constrained to mapping the vowels and consonant without any vowel. The mapping consonants to any specific keys. Also, in the relates to the form of the consonant where transliterated input case, more than one the first vowel ‘ah ‘ is assumed to be present. keystroke may be required to form a vowel This is often the way the consonants are or a consonant (e.g., an aspirated consonant taught for children. This way, only one or a diphthong). keystroke will be required to enter the most The Inscript ; the frequently required form of each consonant, recommended standard for ISCII based as opposed to the case with transliteration systems follows this approach though it based data entry where two keystrokes will includes keystrokes for the matras as well. be needed. Since the addition of a matra to form a consonant vowel combination is not References uniformly applicable to all cases (in Tamil Dash, N S (2005). Corpus linguistics and and Malayalam, the combination with the language technology: With reference to Indian vowel ‘u ‘ changes the shape of the languages. Mittal Publications. consonant), the Inscript keyboard does not Pal, U., & Chaudhuri, B B (2004). Indian correctly indicate or reflect what would script character recognition: a survey. Pattern happen when a combination is input. Recognition, 37(9), 1887-1899. However it may be assumed that the key for Shneiderman, B (2010). Designing the user a matra does not always result in a matra but interface: strategies for effective human- may change the shape of the consonant. The computer interaction. Pearson Education inscript keyboard basically confirms that a India. phonetic approach to data entry is feasible. Singh, A. K. (2006, October). A True, the basic requirement here is that the computational phonetic model for Indian input module must process each keystroke language scripts. In Constraints on Spelling taking into consideration the previously Changes: Fifth International Workshop on entered keys and also check if the conjunct is Writing Systems. valid or meaningful. But this is a module Sinha, R M K (1984). Computer Processing that can be written once and incorporated of Indian Languages and Scripts— into an application program, to work Potentialities & Problems. IETE Journal of uniformly across all the Indian languages. Research, 30(6), 133-149. The data entry scheme recommended for the Sinha, R. M. K. (2009). A journey from IITM software essentially follows this Indian scripts processing to Indian language approach with one additional facility. The processing. IEEE Annals of the History of CTRL key or an equivalent is used to indicate Computing, 31(1), 8-31. that a combination is required to be effected Sinha, R. M. K. (2009). A journey from with the previously formed akshara and the Indian scripts processing to Indian language current input. Thus the user explicitly processing. IEEE Annals of the History of indicates that a conjunct will have to be Computing, 31(1), 8-31.

44 Informatics Studies 3(4), October – December, 2016