Computing in Indian Languages for Knowledge Management: Technology Perspectives and Linguistic Issues
Total Page:16
File Type:pdf, Size:1020Kb
Informatics Studies. ISSN 2320 – 530X. Vol. 3, Issue 4, Fourth Quarterly Issue. October – December, 2016. P 31-44 Computing in Indian Languages for Knowledge Management: Technology Perspectives and Linguistic Issues Ram Awathar Ojha Abstract The paper is the first among a series of documents on computer applications in Indian languages to be published in the journal. Traditionally, computer applications were based on English as the medium of interaction with the system. The technology is not yet viable to Indian languages. Majority of Indian population does not speak or write in or understand English and this weakness of technology hinders the attempts to use computers for education and literacy, meant for majority of the population who should get the benefit of Information Technology. In many parts of the world computer applications have been developed in different regional languages, appropriate to the user communities. This introductory paper gives a perspective of research and development on computational linguistics, the phonetic aspect of Indian languages, transliteration standards, lack of uniformity, data entry methods, keyboards, phonetic mapping of the vowels and consonants, web pages supporting display of Indian language text etc in the context of computing in Indian languages. Keywords: Computational Linguistics, Indian Languages, Telugu, Kannada, Phonetics, Alphabets, vowels, consonants, syllables, transliteration standards, data entry methods, keyboards, , web pages This paper covers Writing Systems Codes for Systems (MLS) in the context of this the Aksharas, The Phonetic Aspect of Indian discussion. Typically, an MLS will permit Languages, The importance of users to interact with computers in their own Transliteration, Lack of Uniformity Between mother tongue. Such a system will have far Different Schemes. Data Entry Methods, reaching consequences in a country like India Keyboards, Phonetic Mapping of the Vowels where English is not spoken or understood and Consonants and Web Pages Supporting by the majority of the people living in areas Display of Unicode Text. away from urban environments. Brief Introduction Over the years, the bulk of software development in India has been carried out The computer programs, which permit user primarily through the English language. interaction with the computer in one or Knowledge of English is essential for the more languages, where the language can be development of Information Systems since selected dynamically, either at the time of virtually all development packages rely on invocation of the program or subsequently English specific input. Current software during its execution is termed as Multilingual development tools can also be used to Informatics Studies 3(4), October – December, 2016 31 Computing in Indian Languages for Knowledge Management develop an application that incorporates little is associated with some data, typically a text or no English in its user interface. Hence string, a number, an image and so on. One MLSs are not only feasible at the level of an writes computer programs to manipulate application program but can also present a data. Computing in Indian languages may truly localized environment for a user broadly relate to computer programs which desiring to interact with computers in process Indian Language text strings, which regional languages. This approach has the may be input through a keyboard and added advantage of providing uniformity displayed on a conventional screen display. across computer systems where, regardless A primitive approach to computing in of the machines like PC, MAC, Workstation Indian languages may be through computer etc. the user will see the same interface. programs, which do string processing on MLS catering only to the display of texts of Indian languages much the same information are easier to design and build. way it is done for the Roman (ASCII) The phenomenal growth of the internet has strings. This way, many existing applications created the need for internationalization of may be adapted to work with text strings in the web where, using the right software tools, Indian languages. It turns out that this is one can present information in the form in not a simple task on account of the large set which it would be received best. However, of aksharas (letters) that have to be handled. technical challenges faced in implementing While the older techniques seem to be well such MLS and the lack of standards have suited for displaying the Indian Scripts, data retarded the development, especially in entry becomes formidable. Hence, new respect of the languages of India. approaches to handling Indian language text In the past, MLSs in Indian languages have have become essential. basically concentrated on data preparation Often people ask questions such as ‘why and printing (DTP). The increasing demand not have an Operating System run in Tamil for using computers in the vernacular has or Bengali?’ just as Arabic or Japanese forced developers to look at the user interface Windows. There is no satisfactory answer to with seriousness. The Systems Development this question. In the first place, building Laboratory, Department of Computer support for Indian languages within Science and Engineering at the Indian Operating systems is not an easy job, Institute of Technology, Madras (IITM) has especially when one looks for uniformity in done unique research and development to use across all the Indian languages. provide a quality MLS for the country that is truly Indian in concept, design and There is a general feeling among the implementation. They have developed professionals that it is best to deal with numerous software packages to permit user Multilingual information at the level of the interaction with computers in different user application. That is to say, keep the Indian languages during the late eighties and language aspects outside the Operating they continue their work. The series of System. This way, the Operating Systems are papers including the present one are rid of the problem of having to deal with intended to present before the readers the varied character sets across many different experience shared by concerned scientists of languages. While some persons point to the IITM in dealing with this interesting but success of Unicode implementation in complex problem. Microsoft Windows, it must be clearly understood that the system kept away from Computing in Indian Languages Indian languages for an important reason. Computing is a general term which refers to Unicode is just not right for linguistic information processing, where information processing with Indian Languages and is too 32 Informatics Studies 3(4), October – December, 2016 Ram Avathar Ojha complex to handle even for one Indian script, The phonetic base and the concept of the much less for all the scripts in a uniform Akshara is unique to the languages of India. way. For all practical purposes Unicode has It is best to deal with the problem of retained only the eight bit coding structure computing with Indian languages by first (actually only 128 codes) for all our scripts, understanding how aksharas have been used which is really the bottleneck in handling the in our ancient as well as modern texts. This aksharas. For efficient string processing, it is will give us an insight into the linguistic base necessary to work with the basic linguistic of our languages, allowing us to come up quantum of our languages, which happens with a universal approach to dealing with all to be not a letter of the alphabet but an the languages of the country. Akshara, which is actually a syllable. General Introduction to Indian The problems associated to the current Languages implementation of Unicode for Indian languages, or for that matter the ISCII code Numerous languages are spoken in India. itself, the basic standard that led to the Most of them relate to one of the officially Unicode representation for our languages will recognized languages and there are about be discussed later. eighteen languages identified for regular use in the country. All these languages have a Any meaningful approach to computing in phonetic base, though their writing systems Indian languages must provide for unique vary. Some of the languages have a common identification of the full set of aksharas, script and some have scripts of their own. through fixed length codes and also provide There are nine basic scripts besides the scripts a uniform approach to data entry of the for Urdu and Sindhi. These nine constitute aksharas across all the languages. Any other the basic scripts of India. The eighteen approach will suffer from incompatibilities languages mentioned above, have been between the user interfaces across different given the status of official languages by the Indian languages. While solutions specific Government. Though the use of a language to a language may indeed be feasible, one is may appear to be confined to a region within looking for solutions, which can be used all the country, the mother tongue of many over the country. persons living in that region may be quite a For the present, and at least for some years different one, traceable to early migrations to come, it is best to handle the problem by of families. writing applications which handle All the recognized languages; mostly referred Multilingual information directly, that is to to as the regional languages have a phonetic say that the Operating System’s support base. It is seen that there is a substantial set should not be sought as there is no of words common to many of these consensus among the professionals on what languages and the roots of these words may this support should be. The total be traced to specific languages such as arbitrariness with which data entry in Indian Sanskrit or Tamil, both of which are languages has been handled, not to speak considered very ancient languages. of the lack of uniform representation across the languages, makes it necessary for us to Linguistic aspects of Indian languages have take a serious look at standardization.