[3B2-3] man2009010008.3d 12/2/09 12:27 Page 8

A Journey from Indian Scripts Processing to Indian Language Processing

R. Mahesh K. Sinha Indian Institute of Technology, Kanpur

This overview examines the historical development of mechanizing Indian scripts and the computer processing of Indian languages. While examining possible solutions, the author describes the challenges involved in their design and in exploiting their structural similarity that lead to a unified solution. The focus is on the script and Hindi language, and on the technological solutions for processing them.

India is a highly multilingual country with institutions. Moreover, English is mandated 22 constitutionally recognized languages. Be- as the authoritative text for federal laws and sides these, hundreds of other languages are Supreme court judgments.5,6 used in India, each one with a number of dia- In this article, present an overview of the lects. The officially recognized languages are historical development of the modern Indic Hindi, Bengali, Punjabi, Marathi, Gujarati, scripts’ , their mechanization Oriya, Sindhi, Assamese, Nepali, Urdu, and adaptation to computing, and I examine , Tamil, Telugu, Kannada, , how this facilitated development of Indian Kashmiri, Manipuri, Konkani, Maithali, language processing. I concentrate primarily Santhali, Bodo, and Dogari. Hindi written in on the Devanagari script and the Hindi lan- the Devanagari is India’s official na- guage as these are the most popular on the tional language and has the most speakers, subcontinent. I do not delve into the history estimated to be more than 500 million. of how modern Indic scripts and languages Indian languages belong to the Indo- have evolved; instead, I discuss only those European family of languages.1-4 Languages features found in current language usage, of the north and western part of India belong and explain how the unifying characteristics to the Indo-Aryan family (spoken by about of the scripts and languages have been 74% of India’s speakers) while the languages exploited to provide solutions applicable to of the south belong to the Dravidian family almost all Indic scripts and languages. (about 24% of India’s speakers). The Sino- Tibetan, Austric, and some other groups Indian scripts: Background form the other prominent language families. Ten major modern scripts are currently The Sino-Tibetan family is spoken mainly in used in India: Devanagari, Bengali, Oriya, the northeastern parts of India, and the Austric- Gujarati, Gurumukhi, Tamil, Telugu, Kannada, Asiatic group of languages is spoken mainly Malayalam, and Urdu. Of these, Urdu is by the tribal people of India’s northern belt. derived from the Persian script and is writ- The languages within each family exhibit ten from right to left. The other nine much structural similarity. In addition, scripts, written from left to right, originated India’s languages have undergone significant from the early (300 BC)7,8 and mixing and cross-fertilization. Interestingly, are also referred to as Indic scripts. The early the English language brought to this subcon- Brahmi script split into two major branches, tinent with British rule is understood by one consisting of the north Indian scripts less than 3% of the country’s population, al- (Devanagari, Bengali, Oriya, Gujarati, and though it continues to be the major language Gurumukhi) and the other south Indian or to link federal and state communications and Dravidian scripts (Tamil, Telugu, Kannada, is used in the country’s higher-education and Malayalam).

8 IEEE Annals of the History of Computing Published by the IEEE Computer Society 1058-6180/09/$25.00 c 2009 IEEE

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 9

Devanagari script is used for writing a - widening of the social divide.9 Although jority of the Indian languages such as Hindi, Internet usage has grown tremendously in Marathi, Sindhi, Nepali, Sanskrit, Konkani, India (28 million users), it accounts for a and Maithali. Bengali script is used for writing meager 2.72% of the Indian population. Bengali and Assamese. Gurumukhi is the India, which constitutes 15% of the world script for writing Punjabi. Some of these lan- population, accounts for only 2% of global guages have their own script, and some differ Internet searches (http://www.comscore. by having a few additional symbols to repre- com/press/release.asp?press=2400). In addi- sent the purity of their sound. Several other tion to economic factors, this disparity has scripts are in use but are gradually vanishing, resulted largely from a lack of Indian lan- primarily from the lack of political and tech- guage content on the Web and correspond- nological support. A detailed description of ing tools. The increasing availability of the Devanagari script will be provided in a these tools, however, corresponds to an in- later section. crease in computer usage, especially among The aforementioned nine scripts besides mid-level businessmen. In a random survey, Urdu are commonly used throughout India. I found that more than 90% of such business- It is estimated that all the literate people in men use their local language written in local India belonging to the different linguistic script. Computerized land records, driving zones use their regional script in communica- licenses, voter IDs, and so on are some of tion. Most of the urban population is also the other major applications where local familiar with the roman alphabet and fre- script usage is bringing about a social trans- quently use and mix Indian languages writ- formation. India is also witnessing a tremen- teninromanizedform.Suchuseismore dous growth in mobile phone usage. prominent in advertisements, cinema post- Nonvoice applications via mobile phones— ers, and text messaging. However, romanized such as text messaging, cash transfers, and text reading is mostly contextual, and only online purchases—have emerged as a major native speakers can read these correctly be- alternative to computer-based -mail in the cause no phonetic marker symbols are fea- lower-middle-income economies,11 which tured in these writings. has helped drive significant demand outside According to the 2001 Indian census, metro areas for mobile phones that handle India’s literacy rate was 65.38% and the native languages.12 It’s clear that the linguis- urban population stood at 27.8%. So, approx- tic interfaces to computers and other devices imately 65.38% people use Indic scripts. Al- play an important role in providing eco- though exact figures are not available, the nomic growth to the rural masses and in literacy rate in urban India is estimated to bridging the social divide. be higher than the national average. Thus, Although Indian languages and Indic we can say that about 18% to 25% of people scripts are several centuries old and symbol- use both Indic scripts and romanized text. ize humankind’s early evolution, their mech- A large population of about 25 million Indi- anization and computerization has received ans living abroad, however, knows the Indian little attention, for historical and political language but not the necessarily Indic reasons, compared to the languages of the script—these people use the romanized In- Far and Middle East such as Chinese, dian language text. (I found it interesting Japanese, and Korean.13-17 A major reason that when I Web-enabled the Indian Institute behind this neglect has been that the of Technology Kanpur’s English-to-Hindi ‘‘elite’’ portion (less than 3%) of the Indian translation system, Indians living abroad population with whom the international overwhelmingly requested Hindi translation community conducted business knew En- in romanized form.) glish because of longtime British rule. This English-speaking Indian community has led Social transformation India’s economic, industrial, professional, When we examine the pattern of usage of political, and social life.6 Indian scripts on computers and other de- It is only within the past decade or so, as a vices, we find a chicken-and-egg situation. result of globalization and emerging markets The language divide significantly contributes in India, that IT companies have begun to the digital divide.9,10 The benefits of investing in Indian language localization. advancements in information technology Researchers in India, however, began work- (IT) have yet to percolate down to the grass- ing on localization in the early 1970s and roots level; in fact, IT has contributed to the came up with elegant solutions unifying

January–March 2009 9

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 10

A Journey from Indian Scripts Processing to Indian Language Processing

characteristics of the Indic scripts that those applicable to Chinese or Korean formed the basis for India’s present techno- languages. Such solutions do work but are logical development on localization, as I cumbersome and unnecessarily burden a will explain. computer system because they do not exploit the logical structure of the Indic scripts. Script differences and similarities Most Southeast Asian scripts such as Thai, Indic scripts exhibit a lot of similarity in Burmese, Lao, Khmer (Cambodia), and Bali their features and are all phonetic in the are similar to Indic scripts.19 Although the sense that they are written the way they are work on mechanizing these scripts started spoken: there is no rigid concept of in the 1960s by IBM,19-21 the scripts’ unifying ‘‘spelling’’ as with western writing systems. characteristics have not been exploited in However, the same language spoken in differ- finding solutions in terms of devising the ent geographical regions can differ in their computers’ internal codes and uniformly ren- accents, which can lead to variations in dering script output. Yet no other group of their spellings. scripts in the world presents such unifying Indian scripts are a logical composition characteristics as found in Indic (South of individual script symbols and follow a Asian) scripts and Southeast Asian scripts. common logical structure we can refer to as the ‘‘script composition grammar,’’ Features of Indian scripts which has no counterpart in any other set A look at the major features of the Deva- of scripts in the world. Indic scripts are writ- nagari script7,22,23 will help illustrate the ten syllabically and are usually visually complex nature of mechanizing Indian lan- composed in three tiers where constituent guages; examples are included from other In- symbolsineachtierplayspecificrolesin dian scripts wherever there are variations. the interpretation of that syllable. In one TheIndicscriptshaveanumberofconso- method of mechanizing Indic scripts,18 nants, each of which represents a distinctive the set of these syllables—which number sound. These are arranged in different classes several thousands—has been used like based on the articulatory mechanism used to produce the corresponding sound. At a broad level, these classes (called varga)arevelar, palatal, retroflex, dental, labial, and a few others. The consonants in each varga are fur- ther arranged in the order of the voiceless and voiced plosives followed by the corre- sponding nasal sound. Each voiceless and voiced plosive category is further divided into two parts, unaspirated and aspirated. In the ‘‘others’’ category, we have the fricatives, sibilants, and some other forms. Figure 1 shows the Devanagari consonants and depicts their individual positions. The top row in each varga is what is referred to as the ‘‘full’’ consonants. The full consonants have an in- herent vowel sound of ‘‘a’’ attached to it. In the second row of each varga, the correspond- ing ‘‘pure’’ consonant form (usually referred to as the ‘‘half letter’’ form) is shown. The half-letter form represents the absence (muting) of the inherent vowel sound. Visually, we derive the pure consonant form in the Devanagari script from the full consonants by deleting the vertical line at the end (end-bar) or by putting a halant sign (see Figure 2) at the bottom of the letter.24 In the case of middle-bar characters, it is shown by straightening of the half loop at the end. Figure 1. Ordered list of consonants in full and pure Figure 3 shows the Devanagari vowels. consonant forms. These are also arranged according to the

10 IEEE Annals of the History of Computing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 11

articulation of sounds and their short and long duration. For each vowel, other than the first denoting an ‘‘a’’ sound, there is Figure 2. Halant symbol. a corresponding modifier symbol called a matra. A matra can be attached to a full con- sonant or a consonant cluster (also known as a conjunct), imparting its sound to the consonant/conjunct. Only one matra can be attached to a consonant/conjunct. Figure 4 Figure 3. Devanagari vowels with corresponding matra shows some of the diacritical marks used in symbols (dotted circle denotes a consonant/conjunct). Devanagari script. A pure consonant, or a sequence of the pure consonants, followed by a full conso- nant forms a consonant cluster—or a conjunct. Conjuncts are formed in one of two ways. One is by explicitly using the halant symbol (oranequivalentsymbolinotherscripts), and the other is to graphically combine the Figure 4. Devanagari diacritical marks (dotted circle two shapes to generate a new glyph. Figure 5 denotes a consonant/vowel). depicts some of the conjuncts along with their constituents. Note that the visual shapes of the conjuncts can be completely different from their constituents. Often, the second consonant glyph is reduced and attached to the first consonant vertically or horizontally. The total number of conjuncts can number as many as 3,000. In early hand- Figure 5. Some example conjuncts in Devanagari writing and typesetting, a large number of are shown with their constituent symbols. conjuncts were frequently used; today, how- ever, people commonly use a much smaller set of conjuncts—usually only 20 to 25. Conjuncts, regardless of how formed, are all equivalent, and the user can decide Figure 6. Some examples of Devanagari charac- which form to use, depending on how elabo- ters with the nukta diacritic attached. rate the text is that the user is composing. Even the individual consonant symbols can have different, but equivalent, shapes. Some of the consonants with the nukta diacritic be- have as an independent consonant with a Figure 7. Some example conjuncts in Devanagari slightly different sound (see Figure 6 for formed with the consonant. examples). Further, the conjuncts formed with the consonant corresponding to the ra sound yield special symbols attached to The anuswar and nasalization symbols in the associated consonant. When a pure con- the Devanagari script need special mention. sonant (half letter) is followed by a ra conso- When an anuswar is used on top of another nant, a symbol called ra-kar is attached to the symbol, the nasalization of the varga to corresponding full consonant. This ra-kar which the following consonant belongs symbol is a small left-leaning diagonal line comes into effect while speaking. Where a attached to the bottom vertical stem of the following consonant is absent, the corre- consonant. When there is no vertical stem sponding associated vowel sound on the con- at the bottom of the character as in case of sonant to which the anuswar is attached is the retroflex class, a small inverted ‘‘v’’ nasalized. Thus, there are two forms of con- shape is attached at the bottom of the charac- juncts with nasalization, one with the anus- ter. On the other hand, if the pure consonant war symbol and the other that explicitly ra is followed by a full consonant, a symbol uses the nasalization character. Both of called reph (a small c-shaped curve) is these forms are equivalent and are frequently attached to the top of the full consonant. used. Unfortunately, many Hindi writers Figure 7 gives examples. today do not follow this rule that comes

January–March 2009 11

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 12

A Journey from Indian Scripts Processing to Indian Language Processing

Figure 8. Some examples showing the use of the anuswar symbol in Devanagari and its equivalent conjunct form.

from the restrictions imposed by the articula- tory mechanism of the sound. Figure 8 shows examples. From the description thus far, it is clear that the Devanagari script is a logical compo- sition of its constituent symbols. From a more technical viewpoint, it is possible to de- fine a script composition grammar for the script.25 This also holds true for all other Indic scripts, with minor variations. Figure 9, which is my own formulation, shows this grammar in Backus-Naur Form notation; note that it gives the script composition grammar formulation only at the logical, not visual, level. The visual-level formalism is available elsewhere in a finite state ma- chine I designed for Devanagari OCR work.26 Now, let us examine how the Indic scripts are visually composed. Indic scripts are writ- ten from left to right and juxtapose the com- posite characters as defined in Figure 9; typically, the characters appear to be hanging from a horizontal baseline. With the Devana- gari, Bengali, and Gurumukhi scripts, this horizontal line (called a shirorekha) is physi- cally drawn and visible; in other scripts this line is virtual. As I have mentioned, Indic scripts are usu- Figure 10. Examples of Devanagari script compo- ally written in three tiers. Figure 10a shows sition: (a) example word (‘‘chairs’’) showing an example word. The middle (core) tier is three-tier composition; (b) example of ra-kar on a just below the shirorekha;itholdsallthe retroflex character with lower matra—this is a main characters (vowels, consonants, and rare combination, however; (c) lower matra attached to ra consonant; and (d) examples of variations in positioning of matra symbols. := {list of vowels}; := {list of ‘matra’ symbols}; conjuncts) and the aa-kar matra symbol. := {list of diacritic marks}; The lower tier is exclusively for the lower := {list of full_consonants}; matra symbols, and halant, dot diacritic, := {list of pure_consonants}; or ra-kar sign used with retroflex characters + := for the Devanagari script. For retroflex char- * := | acters with the ra-kar, the lower matra symbol * | can go in a tier just below the lower tier making * | it a four-tier composition (see Figure 10b), but * | such combinations are rare and typically peo- * ple adjust the height to accommodate the + := fourth tier into the third. In one exception, Figure 9. Indic script composition grammar. the lower matra symbol gets attached to the (There may be restrictions on the use of certain ra consonant in the core tier itself with a diacritic marks on symbols that this formulation change in shape (see Figure 10c). The upper has not considered). tier, above the shiroreka line, is used for the

12 IEEE Annals of the History of Computing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 13

upper matra symbols, diacritical marks, and The Devanagari typewriter, an adaptation the reph sign for Devanagari script. There of the English typewriter, had to accommo- are four matra symbols (i-kar, ii-kar, -kar, date Devanagari symbols in place of the and -kar) that occupy the core tier and 26 upper- and lowercase roman letters on extend to the upper tier. Figure 10d shows the keyboard. The typewriting printing examples. These examples clearly show that mechanism was also modified to allow the the matra symbols get attached to the left, multitier composition of the Devanagari right, top, or bottom of the character. For script. In summary, the basic mechanisms some scripts, the matra symbol may be split used for this adaptation are as follows: into two parts: one may get attached to the left, the other to the right. In some Indic  All the Devanagari characters that ended scripts,theshapeofthebasecharacteror in a vertical line were used with the verti- matra symbol, or both, changes after the cal line removed on the keyboard tops composition. for layout. Recall that this set corre- sponds to the half-letters (pure conso- Early mechanization of Indian scripts nant forms). Printing technology arrived with Christian  The , halant symbol, diacritical missionaries who came to India in 1556 and marks, nonvertical bar characters, and who wanted to print the Bible in the Indian some of the half-characters such as @ languages (http://www.orientalthane.com/ and F all had a place on the key tops. history/news_2007_04_4.htm). Printing did  Among the vowels and the matra symbols, not become popular, however, until the only basic shapes were placed on the key 18th century.27 The earliest type-based tops; the other shapes were composed Devanagari printing was in 1796 in Kolkata using a combination of keys. (Calcutta).28 The first publication produced  If spare unallocated key tops were avail- in Devanagari type was developed by Charles able, the frequently used vertical bar char- Wilkins, an English typographer and noted acters were given a place on them. orientalist who first translated the Bhagavad-  The concept of the ‘‘dead’’ key (overstrike) Gita into English.29 He was also closely and the ‘‘half-backspace’’ (move backward involved in the design of the first type for by half a character width) were intro- printing Bengali. duced, making it possible to position the The technology for printing the Indian lower and some upper matra symbols. scripts was adapted from Western technol-  Symbols could be vertically composed by ogy. For type-based printing, a large set of appropriate positioning of the typeface precast conjuncts—the individual characters slugs by the typing-striking-hand associ- and symbols running into the thousands, of ated with the key tops. varying sizes and shapes—were used for man-  The keyboarding method relied on the vi- ual composing on a three-tier block. An sual, rather than the logical, order of char- entire page was composed manually with acters. The typist learned how to generate these juxtaposed blocks, but the rest of the the script graphics by using the key top processwasthesameasthatforroman- symbols; the order creating the script alphabet printing. Printing quality depended followed the order as seen on paper. The on the quality of the typecast used and on the process had no correlation to composite- manual layout of the words and the page, as character composition logic discussed well as on the printing mechanism used. earlier. The first Devanagari typewriter was intro- duced around 1930.30 Designed by V.M. Figure 11 shows a mechanical Hindi type- Atre in Germany and named Nagari Lekhan writer and a sample of typed text. These Yantra, the typewriter was built by Reming- machines found extensive use for producing ton. In 1964, the government of India’s De- a low-volume document in the Hindi lan- partment of Official Language approved a guage. Such typewriters are still widely used, keyboard layout for Devanagari to which fur- especially in places where electricity is not eas- ther modifications were recommended in ily available. It is obvious, however, that the 1969.31 The Indian typewriter company God- quality of the typewritten Hindi text is poor, rej developed the Devanagari typewriter in with broken lines, broken characters, and October 1968 in collaboration with Optima, bad alignment. The poor quality worsens be- a German company. L.S. Wakankar designed cause of mechanical wear and tear, resulting both the layout and the typefaces for this.31 in inaccuracies in the half-back-spacing and

January–March 2009 13

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 14

A Journey from Indian Scripts Processing to Indian Language Processing

some of the electronic typewriters now also provided a small display where users could view the composition before printing and make corrections if needed. Next came the IBM Selectric ball and daisy wheel type- writers. These generated characters by impact printing, and the typewriters’ design was sim- ilar to mechanical typewriters except that the mechanisms were more rugged and had elec- tronic motor control. Now it was possible to achieve boldface letters by ‘‘repeat’’ printing or by a slightly deviated printing to make the character appear broader. Moreover, it was possible to support different fonts by changing the ball or wheel. The quality obtained was called letter quality. These de- vices, however, were slower than the matrix printers. In all these adaptations for Indic scripts, vendors tried to support good font quality and to handle ligatures and more-frequent conjuncts. Obviously, it was not possible to cover the set of conjuncts once available with the letterpress machines. Separate con- junct and ligature wheels were provided with the 1970s adaptations, however, and the printer could prompt for a change of the wheel—a cumbersome, slow, and te- dious process. In all these developments, few attempts were made to optimize the key- board layout and the keyboarding process: typists simply learned to adjust to the highly inefficient, somewhat irrational keyboard Figure 11. (Top) Mechanical Hindi typewriter and (bottom) a sample layout and associated keyboarding scheme. of typewritten text. Before proceeding with the technical details for processing scripts on computers, it will help provide context to take a look, the dead-key mechanisms. In the 1960s, how- in the next section, at early investigative ever, there were few, if any, alternatives. efforts and IIT Kanpur’s role. The advent of microprocessors in the 1970s made electronic typewriters possible (http:// Computers, scripts, and early efforts en.wikipedia.org/wiki/Printing). The key- Although researchers had made several board layout and the keyboarding scheme investigations into computer processing of for Hindi remained the same on these, but Indian languages using a romanized version output quality improved significantly. The of the text, it was only in the 1970s that com- characters and the symbols were stored puter issues specifically involving Indic in ROM, and the words were composed in scripts and computers were first investigated. RAM in bitmap form. These bitmaps were In 1970, I and other researchers at the Indian then printed using a dot matrix printer. The Institute of Technology (ITT) Kanpur under- 5Â7or7Â9 dot matrix used for roman script took the task of first analyzing the logic was inadequate for representing the complex basis of Indic scripts in preparation for mech- curves of Indic scripts, so one solution was to anizing them.32,33 print row by row, but this made printing slow. The other solution was to print in the IIT Kanpur tiers of the 5Â7 matrices. IIT Kanpur is a leading educational techni- Then came the 24-pin printer, which was cal institution in India and in the world. It a great relief. In addition to the better print acquired its first computer in 1963 under quality, referred to as near-letter quality, the Kanpur Indo-American program (KIAP)

14 IEEE Annals of the History of Computing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 15

and was the first educational computer sys- needed a term paper topic, and found this tem established in the northern part of problem of designing a keyboarding scheme India. Very soon, the institution became a for Indian languages to be highly suitable; focal point for computer training and aware- the results were soon published.32 Imyself ness. The institute ran a number of short- presented an alternative schema for the term programs on computer programming same topic that differed in the manner in in Fortran for teachers from other engineer- which the pure consonant forms were ing and science institutions. Demand was derived.33 These investigations resulted in great for acquiring computing skills, and the later development of a universal key- the computing resources were scarce. Soon boarding scheme and a unified internal IIT Kanpur upgraded its computing infra- code for information exchange that was ap- structure, from an IBM 1620 to a DEC 7044. plicable to all Indic scripts. In 1969, I joined IIT Kanpur in its PhD After completing a PhD in 1973 on Deva- program after I had obtained a master’s de- nagari OCR34 and serving at Banaras Hindu gree at IIT Kharagpur in electronics and University for a couple years as a Reader, communications, with a specialization in in- I joined IIT Kanpur as a faculty member (as- dustrial electronics. At that time, computer sistant professor) in 1975. It was a good op- science was not a separate discipline; it was portunity for developing and continuing offered only as a specialization in the Depart- research in Indian language technology. ment of Electrical Engineering. For my PhD, I Motivating students to work on a problem re- started working on fault tolerance in digital lated to Indian-language technology, how- circuits. In 1970, one of my professors, H.N. ever, was difficult at a time when almost all Mahabala, had returned from a visit to the thestudentswereaspiringtogototheUS Massachusetts Institute of Technology and for higher studies. Nonetheless, I encouraged described an OCR project at MIT to build a them to tackle the language problem, reading machine for the blind. Intrigued, I explaining the challenges and the fact that was motivated to switch from investigating the problem’s solutions must come from us fault tolerance to designing an OCR system within India and not from others. Moreover, for Devanagari script—it was a new topic in I persuaded them that R&D in Indian- uncharted territory and much more challeng- language technology was a necessity for a ing than working on OCR for a roman highly multilingual, multiple-script country alphabet. like ours. Consequently, I succeeded in form- That project was the beginning of any for- ing a core group with some students and re- mal exploration on mechanizing Indian search engineers, and in 1983 this finally led scripts. Some of my colleagues expressed rid- to the breakthrough development of the Inte- icule as well as surprise that I should choose grated Devanagari Computer (IDC) terminal to work on Indian languages at a time and the Graphics and Indian Script Technol- when it was almost inconceivable that Indian ogy (GIST).35 This technology incorporated languages could be used on expensive com- several desirable features that made it user puter systems, which remained within reach friendly, such as applicability to all Indian of only a very few in India. I had a strong scripts, a natural keyboarding scheme, an in- conviction, however, that the benefits of ternal representation well suited for informa- computing technology could truly reach peo- tion interchange and transliteration, and ple only through their own language, and flexibility in script composition. We publicly therefore that we Indians had to make a be- demonstrated this system at the Third World ginning in this direction. Hindi Conference (Tritiya Vishwa Hindi Sam- While I pursued the design of an OCR sys- melan)inNewDelhiinOctober1983. tem for Devanagari script, Putcha Narasim- After having achieved breakthroughs at ham, a Telugu-speaking colleague at IIT the script level,25,33,35-43 I turned my atten- Kanpur, was working on his master’s degree. tion in 1984 to solving natural-language We regularly had discussions examining fea- processing (NLP) problems for Indian lan- tures of the scripts of northern and southern guages. I have always felt that the digital di- India, coming as we did from those two dif- vide within the society cannot be bridged ferent areas. We soon discovered the unifying without bridging the language divide.10 patterns of Indian scripts that became the Over time, I developed a methodology for basis for enabling computers to work with machine-aided translation among English Indian scripts. Putcha, who was taking a and Indian languages,44-49 work that is still systems engineering course at IIT Kanpur, ongoing.

January–March 2009 15

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 16

A Journey from Indian Scripts Processing to Indian Language Processing

Key events and contributors therefore, was directly instrumental in estab- AroundthetimethatIfocusedonNLP, lishing Indian-language technology activities several faculty and research colleagues also at all these centers. began work in this area, many fanning out Meanwhile, Putcha Narasimham—who in different parts of the country, which trig- had been the first to develop a universal key- gered activities in other Indian languages boarding scheme at IIT Kanpur32—joined and scripts. Two events proved particularly the Computer Maintenance Corporation at noteworthy: in 1988, the Centre for Develop- Secunderabad and developed an Indian- ment of Advanced Computing (C-DAC) language terminal;52 he also worked on acquired the IDC and GIST technology Telugu (personal communication, Putcha (GIST was now modified to stand for ‘‘Graph- Narasimham, Aug. 2008). ics and Intelligence-based Script Technol- Other IIT Kanpur researchers who did not ogy’’). C-DAC (http://www.cdac.in/) is a participate actively in our R&D on Indian- scientific society of the Indian government’s language technology but were influenced by Department of Information Technology. our work include Om Vikas, who joined the Mohan Tambe, who had been working on government’s Department of Electronics IDC and GIST with me at IIT Kanpur, joined after completing a PhD at IIT Kanpur. He per- C-DAC and became instrumental in forming suaded the department to support and fund a group devoted to enhancing and commer- government-level activities, most notably cializing the technology.50 Subsequently, of which was a national symposium organ- C-DAC released a number of commercial ized on the ‘‘Linguistic Implications of products offering printing solutions, word Computer Based Information Systems.’’53 processing, desktop publishing, and font de- This symposium, a landmark in the history sign, spanning most of the Indian languages of Indian language computing, triggered nu- and southeast Asian languages.50 merous related research projects in India. The second noteworthy event occurred in Rajeev Sangal, who joined IIT Kanpur’s the 1990s. In 1995, while still at IIT Kanpur, I faculty after completing a PhD in the US, be- was instrumental in initiating and mentoring came motivated to pursue research in Indian- NLP activities at a newly established scientific language NLP. Vineet Chaitanya, whose PhD society of the Government of India’s Depart- at IIT Kanpur was in control systems, joined ment of Information Technology (DIT): the the Birla Institute of Technology and Science Electronic Research and Development Centre at Pilani and taught Sanskrit at IIT Kanpur of India (ER&DCI) Lucknow. The DIT’s in the early 1980s. In those days, we had program on Technology Development for In- received a number of Acorn Computers’ dian Languages (TDIL: http://www.tdil.mit. BBC microcomputer boards for teaching gov.in) sponsored the project on machine- and training purposes. Chaitanya, who used aided translation (MAT) from English to those boards to teach Sanskrit, worked with Hindi based on AnglaBharati technology51 Sangal in NLP and developed the Anusaraka that I developed, and ER&DCI Lucknow project for machine translation.54 Later, San- was associated with us in this project gal moved to IIIT Hyderabad and established for productizing the prototype developed. research programs in Indian-language tech- AnglaBharati’s underlying methodology45,51 nology. T.V. Prabhakar, another researcher, used a pseudo-interlingual approach exploit- developed Indian-language content and ing the structural commonality of a group created the Gita supersite (http://www. of Indian languages. A number of ER&DCI gitasupersite.iitk.ac.in). Three other individu- Lucknow’s engineers—when ER&DCI had als, who are products of IIT Kanpur, deserve moved to Noida and became ER&DCI mention: Pushpak Bhattacharya joined IIT Noida—underwent training with us at IIT Mumbai and continues to work in NLP; B.B. Kanpur, which helped them in establishing Chaudhuri joined ISI Kolkata and started an NLP center of their own. Subsequently, working on OCR for Devanagari and Bangala; they acquired the AnglaBharati technology and Harish Karnick, who works on Indian from IIT Kanpur. Under a government language speech and data mining. reorganization program, ER&DCI Noida eventually became C-DAC Noida. The Scripts: Basic design methodology AnglaBharati technology was also acquired The Integrated Devanagari Computer by C-DAC Kolkata and C-DAC Thiruvantha- (IDC), as I will explain, was developed on puram. At these centers, I mentored the ma- the concepts highlighted in this section. I chine translation R&D work; IIT Kanpur, spearheaded the IDC team effort in the

16 IEEE Annals of the History of Computing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 17

mid-1970s; we developed the standards for it chips—continued to flow from abroad. in cooperation with the government of Therefore, any design exploiting the latest India’s Department of Electronics (DOE; technology had to follow those standards now the Department of Information Tech- and constraints. This was also true for all nology [DIT]). By 1978, the IDC proof-of- imported systems software. concept was ready.38,55,56 In 1983, the Indian  English continued to be the effective link government sponsored a project for us to de- language in the country. Therefore, any velop a Devanagari computer based on these Indian-language machine had to also pro- concepts. This was completed in a record vide facilities for roman script. time of only eight months.35 We presented  All existing machines were designed with most of the major research results at the I/O capability only in roman script for 1978 Linguistic Implications of Computer- which large investments had been made. based Information Systems symposium and An Indian-language machine could best later published the developments carried be introduced by their adaptation or out through mid-1984.57 through add-on modules. While seeking solutions in the early 1980s to the problem of enabling computers to work Some of the major characteristics of the with Indic scripts, we concentrated on devel- Indic scripts our design considered that led oping the technology indigenously. All of us to a unified indigenous approach were these: at IIT Kanpur firmly believed that adapting western equipment and devices designed to  All Indic scripts have similar concepts of deal with roman script would lead to inferior thefullandthepureconsonants,and solutions: the Indic scripts formed an entirely of the vowels and the vowel modifier sym- separate class and were unique compared to bols (matra). Their order and categoriza- their roman counterparts. Following were tionarebasedonthesamearticulatory our major design considerations:25 mechanism. They differ in number of con- sonants and number of vowels, some pro-  The methodology should be adaptable to viding finer-grained articulation and some almost all Indian scripts and languages; remaining at a coarser level. This observa- that is, with minor modifications it tion led us to define a superset of all Indic should be possible to switch to other script symbols. This was referred to as the scripts and languages. This means that ‘‘enhanced Devanagari script’’ (Parivardhit the methodology should base itself Devanagari Lipi). on the common properties of the scripts  In all Indic scripts, each consonant has a and languages. corresponding pure consonant. Similarly,  The design methodology should assimilate each vowel has a corresponding modifier requirements from different application (matra). Thus it was possible to reduce areas and present a unified approach theentiresetofsymbolsbytakingthis such that, as far as possible, no major mod- correspondence into consideration. ification would be required while switch-  For all Indic scripts, writers use a similar ing from one application to another. logical order of symbols, which is what  The system should be modularized to the children are taught while learning how maximum possible extent. It should be to write. This order can differ from the vi- possible to configure the system modules sual order (which is graphic oriented) in appropriately to suit different applications. that the final script composition may For software modules, the language- not show the symbols in the same order. dependent and language-independent This led us to develop a uniform key- parts should be separately modularized; boarding method for inputting. similarly, the device-dependent parts  To facilitate the process of editing the in- should be kept in a separate module. Porta- dividual symbols and the word process- bility is also desirable for software modules. ing, the script data must be stored in a linearized form, not in the font codes or However, meeting these considerations composed form codes. This observation was not easy. Several constraints influenced led us to design the Indian Script Stan- our design, including these: dard Code for Information Interchange (ISSCII) code.  Developments in technology—new  The manner in which the individual sym- microprocessors, new LSI and VLSI bols are joined together to form a word

January–March 2009 17

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 18

A Journey from Indian Scripts Processing to Indian Language Processing

Linearize 2-diamensional Indic Keyboarding the set of symbols for keyboarding. For script into symbols at keytops Hindi, the frequency63 of occurrence of the vowels is about 4.11%; for the matra symbols, Convert the linearized symbols it is about 35.22%. For the standalone to code points for information Internal consonants (i.e., the consonant without an interchange, storage and text- representation attached matra), the frequency of occurrence editing/processing is about 23.87%; for the consonants with the matra, it is about 31.84%; for the pure conso- nants, it is only about 4.94%. From an opti- Compose the script and render Composition mality viewpoint, then, it’s obvious that the processor to output device pure consonants should not be included on Figure 12. Three basic stages for enabling Indic the key tops but should be derived from the scripts on computers. full consonants. Note that two keystrokes are needed for this derivation. Now with the exclusion of the pure conso- differs from one Indic script to another. nants (half characters) from the list of sym- This led us to delineate the composition bols, it was possible for us to accommodate process of the script for the purposes of all other symbols on the standard QWERTY display and printing from the rest. keyboard layout. For the actual physical lay- out, we debated, for a considerable amount These observations led us to split the de- of time, several proposals. The major debate sign process for enabling computers to han- was whether the layout should be consistent dle Indic scripts into three basic stages: with the logical grouping of the characters, or if instead it should be based on the finger  the keyboard layout and keyboarding load-balancing determined by the frequency stage; of various symbols’ occurrence. Ultimately,  representation of the text for internal stor- we favored placement according to the logi- age and text editing; and cal grouping of symbols—primarily because  the stage for rendering the script on the with electronic touch typing, finger load- output device. balancing had lost its significance. Moreover, This is diagrammatically shown in Figure 12. the logical grouping would be easy to remem- ber since that is how the script is introduced Scripts: Keyboard considerations to learners. Usually a syllable in an Indic script is a Because the aspirates occur less frequently two-dimensional composition of the constit- than the non-aspirates, we kept these with uent symbols. Therefore, an unambiguous the shift key. Similarly, we kept the matra way must be devised to convert it into a lin- symbols in the normal position and the cor- ear string of the symbols. This is what we call responding vowel in the shift position. the ‘‘keyboarding problem.’’ Finally, the project team agreed on a univer- Keyboard layout design involves the prob- sal layout applicable to all the Indic scripts lem of optimally placing all the script’s sym- with the symbols of the enhanced Devana- bols on the key tops. The placement is done gari script (see Figure 13). We named this to minimize the number of keystrokes, the InScript keyboard, and it was standar- and to balance the load on the user’s fingers. dized by the Bureau of Indian Standards (IS The two issues—minimum number of key- 13194:1991). Because space was available to strokes and the finger load-balancing—are add more symbols, some of the frequent con- related. As mentioned earlier, all the pure juncts were also assigned a place for effi- consonants can be derived from their corre- ciency; the assignment can differ from one sponding full consonants, and each vowel script to another. has a corresponding matra symbol. Thus, The decision on the keyboarding method our symbol list could have only the full con- was more vexing. The major debate was sonants, the vowels, and the diacritical whether it should be graphic-oriented (i.e., marks—we could derive all other symbols in visual order, with symbols entered in the from this set. There are other alternatives to same order as they appear on the final out- deciding the symbol list for keyboarding as put) or in phonetic order (determined by well.25,58-62 how the word being entered is pronounced). The frequency of occurrence of various I proposed a third variation in phonetic symbols plays a dominant role in deciding order—Machine Oriented Devanagari Script,

18 IEEE Annals of the History of Computing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 19

Figure 13. The InScript keyboard layout with Devanagari symbols. Note that the InScript keyboard layout is an overlay over the QWERTY layout, which lets one easily switch from roman to Indic script and vice versa.

where only the consonants and the vowels important, children learn a script by the pho- were assumed. In MODS, a link operator (O–) netic order; further, the phonetic order pro- denotes the composition. Figure 14 shows a vides an easy way for editing and making few examples to illustrate the difference in corrections on a keyboard. Phonetic order the three keyboarding schemes. makes it easier to implement the script compo- Hindi typists were accustomed to using the sition grammar and inhibit illegal/nonsensical visual order, so there was strong resistance to inputs such as putting two matra symbols on a the phonetic order on a keyboard. The visual character. order of script symbols, however, has several TheMODSschemewasavariationon drawbacks. First, it is script-dependent: the phonetic order and called for the consonants keyboarding sequence differs for different andvowelstobeusedwithoutmatra sym- scripts, which effectively loses the universal- bols. Because the keyboard layout design ity of the keyboarding scheme we had been had both the vowels and the matra symbols, seeking. The more problematic situation however, we did not pursue this approach. results when the visual order sequence Ultimately, we decided to use the phonetic does not find the corresponding anticipated keyboarding order as the standard keyboard- symbols on the key top (such as /or -in ing scheme. There is still wide resistance to Devanagari). Such graphic symbols must be its acceptance, however. Some users, influ- mapped onto a sequence of symbols on the enced by the roman juxtaposition order, can- CD key tops to obtain the required . not accept that a matra symbol like , which Whereas these symbols representing a gra- actually appears before the character on out- pheme are available on the typewriter key put, should be typed after the character on a top, inserting such symbols on the InScript keyboarding scheme designed in the pho- key tops was another step toward losing a netic order. These users fail to understand universal solution. that phonetically the vowel sound associated Conversely, however, the phonetic order with a consonant always appears after is the order in which words are spoken, and the consonant sound. As a consequence, it does not depend on the script. More many commercial software products, such as C-DAC’s multilingual word processing product i-Leap, give users the option to use the visual order of character entry: through firmware, the input is converted to the pho- netic order for further processing.

Coding considerations The coding scheme we developed had to address the needs of information inter- change, storage, and processing.

Coding in terms of conjuncts In converting Indic script symbols to code points, the simplest coding method is to use the set of conjuncts, or the composite charac- Figure 14. Examples illustrating keyboarding ters (which could number in the thou- schemes. sands),18 as the atomic code points. Another

January–March 2009 19

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 20

A Journey from Indian Scripts Processing to Indian Language Processing

method would be to use a font-based coding. Font-based coding was used by almost all ver- nacular newspapers in the early days of elec- tronic typesetting. The readers of these newspapers have to download their specific fonts for reading the e-paper. Such a situa- tion, however, is good only for the output en- vironment and is of no use for the tasks of text editing and word processing because the logical information of the conjunct or composite character compositions is lost. Figure 15. Devanagari to IITK-Roman code. Phonetic encoding Three possibilities for phonetic encoding exist. roman two-character code with the most Full consonants and vowels. The set of common interpretation (based on frequency) 40 the full consonants, vowels, diacritical marks, was developed for Hindi. Later, ITRANS (short and a link symbol operator form the vocabu- for Indian language transliteration; http:// 64 lary for internal representation. The operands en.wikipedia.org/wiki/ITRANS) and INSROT of the link symbol operator are converted to (short for Indian Script Roman Translitera- their corresponding pure consonant or tion) have been standardized along similar matra symbol forms. For Hindi, the storage lines. These use lowercase characters only, requirement is roughly 140.16 bytes per 100 which facilitates searching using conven- basic symbols. tional search engines. Pure consonants and vowels. The set Yet another roman character coding of the pure consonants, the vowels, and the scheme known as IITK-Roman was devised diacritical marks form the vocabulary for in- in the mid-1980s that uses both upper- and ternal representation. There is no link opera- lowercase roman characters. In this essen- tor. The full consonants are derived from the tially pure-consonant—based coding method, corresponding pure consonants by attaching a single roman character code is assigned to the vowel A. Recall that the pure consonant each of the vowel and consonant symbols. represents muting of the inherent A sound. Figure 15 shows the IITK-Roman assignment If a pure consonant is followed by a vowel, table. If a consonant character is followed by the corresponding matra symbol is attached. a vowel character, the corresponding matra If it is followed by another pure consonant, symbol is attached to it. If, however, it is fol- a conjunct is formed. For Hindi, the storage lowed by another consonant, it forms the requirement for this scheme is roughly corresponding conjunct. The IITK-Roman 123.87 bytes per 100 basic symbols. code provides a convenient way of inputting Full consonants, vowels, and matra. Hindi using a conventional roman keyboard; The set of all the full consonants, the vowels text editing and word processing tasks can be as well as their corresponding matra symbols, easily done with this code. The major disad- the diacritical marks, and the halant sign vantage is that a conventional search engine form the vocabulary for internal representa- cannot be used because of the uppercase let- tion. The halant sign converts the preceding ters; nonetheless, this coding scheme is still full consonant to the corresponding pure very popular. consonant. As the matra symbols occur more frequently, their redundancy helps in Code standardization reducing the storage requirement. For Soon after the 1978 symposium, India’s Hindi, the storage requirement under this Department of Electronics constituted a stan- scheme is roughly 104.94 bytes per 100 dardization committee, of which I was a basic symbols. member, for designing codes for the Indic scripts similar to ASCII. After much delibera- Coding using roman characters tion with the experts of different Indic Roman characters, with the international scripts, in 1982 we came up with the first phonetic symbols like those dictionaries use version of a 7-bit code, called ISSCII-7 (Indian to denote pronunciation, have been exten- Scripts Standard Code for Information Inter- sively used by linguists and literary scholars change).25 In 1983 the first version of the for writing Indian-language texts. In 1984, a 8-bit code (ISSCII-8) was released.65 It was

20 IEEE Annals of the History of Computing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 21

difficult to incorporate everything that differ- Devanagari, let us examine the salient fea- ent Indic script users demanded, and it took tures of each. us quite some time to make users appreciate the concept of universality and the need for 7-bit internal representation delineating the script composition phase The 7-bit code has 128 code positions from that of internal coding. available. The first two columns are reserved Another major difference of opinion, be- for the control characters. If we consider all tween users and the standardization commit- the special characters and the numerals, we tee, was in the collating order. Therefore, areleftwithonly64codepositionsfor several revisions were done and in 1988 assigning Devanagari symbols. In the ISSCII- the Department of Electronics published the 7 design of 1982, we decided to include the first official version.66 By this time, one S in full consonants and the matra symbols. ISSCII had been dropped and the acronym The vowels were obtained by attaching the became ISCII. A further modification was corresponding matra symbols to one vowel, made in 1991, and the Bureau of Indian Stan- A, which was given a code point. The pure dards accepted ISCII-8 as the national consonants were derived using the halant standard (IS 13194:1991). The design of symbol. Figure 16 shows the code point out- ISCII-8 was totally an indigenous effort, lay for ISSCII-7. The code does give the right addressing India’s needs with multiple scripts: collation order and is applicable to all the in that sense, there was no correlation to Indic scripts. It worked in all environments what was then being designed by the Interna- where 7-bit ASCII was being used, so the stan- tional Organization for Standardization (ISO) dard 7-bit communication interface could be and the newly formed consortium.67 directly used. However, the major disadvan- At the international level, ISO came up tagewasthatitdidnotprovidemixing with a draft framework for a Universal with the roman script code. Coded Character Set (ISO/IEC FIS 10646) in 1990. At the same time, the major multi- 8-bit internal representation national IT companies formed a consortium In designing the 8-bit ISSCII code table in for devising character codes to represent 1983, we made the first half of the table the all the world’s scripts. In particular, the sameasforthe7-bitASCIIandusedonly consortium was concerned for business the latter half of the code space for Devana- penetration reasons to be able to handle gari symbol assignment. For the code points the scripts of Asian countries where English of numerals, marks, and special was not used for internal communication. symbols, we used the code points of the cor- The consortium developed a 16-bit code responding ASCII code. We left the first two called Unicode (http://unicode.org/) where- columns of the Devanagari portion intact in distinct code points were assigned to for the control characters, so that the rest of each character with direct mapping to its the 96 code positions were available for rendering on the output device. For the placement of the Devanagari symbols. The Indic scripts, the Unicode consortium code used all the full consonants, the vowels, adopted the 1988 ISCII-8 standard version and the matra symbols. The ‘‘link’’ symbol O– as its base for the pages related to the (equivalent to the halant) denoted formation Indic scripts (for an example, see http:// of the conjuncts. The halant was a printable www.unicode.org/charts/PDF/U0980.pdf symbol whereas O– was a nonprintable opera- onward). As a result of philosophical differen- tor symbol. Figure 17 shows the 1983 ISSCII- ces between the ISCII-8 and Unicode designs, 8 code assignment table. As Figure 17 shows, several errors crept into the Unicode—none a special Devanagari space symbol has been of the Indian companies or research insti- provided to aid in the right sorting order. tutions was a member of the Unicode con- Thespareavailablecodepointshavebeen sortium at that time to address our concerns. used to place some of the common conjuncts Today, India’s Department of Information (user-defined codes) to reduce the text stor- Technology is a consortium member and age requirement. Thus the 8-bit code pro- has made suggestions for making the appro- vided all the desirable features and was priate changes. universal, provided that the codes for the conjuncts were not used. However, the ISSCII-7, ISSCII-8, and Unicode ISSCII-8 code was not suitable for those envi- To help illustrate the three different cod- ronments where the eighth bit of the byte ing standards’ approaches in representing was being used for some other process-specific

January–March 2009 21

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 22

A Journey from Indian Scripts Processing to Indian Language Processing

Figure 16. ISSCII-7 (1982) code assignment table for enhanced Devanagari. Here the matra symbols are indicated by writing the corresponding vowel within angular brackets. ‘‘SP2’’ is the Devanagari space, which was introduced to maintain the right collation order.

applications (assuming that ASCII has no use at http://tdil.mit.gov.in/isciiapril03pdf. The for this bit). major modification to this was the deletion This basic layout was later modified in of the additional Devanagari-space code 1988 and again in 1991; the Bureau of Indian point. Further, the code points for numerals Standards’ 1991 ISCII-8 layout can be viewed were added onto the Devanagari portion.

Figure 17. ISSCII-8 (1983) code assignment table for enhanced Devanagari.

22 IEEE Annals of the History of Computing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 23

The deletion of the Devanagari space symbol characters as desired in the output text. We did affect the collation order for the words envisioned this rendering to be dynamic— with some of the diacritical marks. Some of that is, as the input string is read from left the standardization committee members to right, the composition processor must argued that the universal acceptability of start rendering and modify the earlier ren- the sorting order was not possible across all dered part if needed. In other words, the Indic scripts. One additional pass on the composition processor should not wait for word-processing part was required to ensure the entire input string before rendering. It the right sorting order with this change. No was up to the composition processor to place was provided for the code points corre- choose the appropriate fonts, their features, sponding to the frequent conjuncts to ensure and the conjuncts, and to provide a variety applicability to all Indian scripts. of users’ choices based on the nature of the In the layout, it should be noted that the output device. Separating the rendering nukta symbol D. is not a matra but a dia- stage from the rest of the composition pro- critic mark. When a nukta is attached to a cess was a well-regarded decision. consonant, it yields a derived consonant Output was to a dot matrix plotter of vary- or another consonant. To preserve the sort- ing resolution. A basic resolution of 50 to ing order, it was kept following the matra 70 dots per inch had a matrix size of approx- symbols and not with the other diacritic imately 15Â8 (for Devanagari). Minimum marks. readability required 8 dots for the height of the main character, 3 dots for the lower ISCII versus Unicode symbol, and 4 dots for the upper symbol. The Unicode consortium adopted the Medium-to-high quality script could be 1988 version of ISCII-8 as the base for generated using a dot resolution of 100 the 16-bit Unicode for allocating codes to dif- to 200 dots per inch with a matrix size of ferent Indic scripts. Although the consortium 24Â12 or higher. tried to preserve the basic characteristics of ISCII coding, ISCII differed significantly IDC and GIST: Evolution from Unicode. The ISCII design exploited The concepts and methodology explained commonality of the Indic scripts and allo- thus far for developing linguistic interfaces cated code points for the superset of the were simulated at IIT Kanpur, where we enhanced Devanagari symbols. The graphical built prototypes during the years 1976 and or the compositional aspect of individual 1980.38,55,56 In 1983, India’s Department of characters and the script is not a consider- Electronics sponsored IIT Kanpur to design ation in its design. Therefore, ISCII applies and develop the Integrated Devanagari Com- to all Indic scripts, which makes transliteration puter terminal, a project for which I served as among Indic scripts a straightforward task. chief investigator.35 We developed the IDC Unicode, however, is more oriented toward using the Intel 8086 processor, with multi- facilitating script composition. It does not re- tasking firmware. The Devanagari keyboard flect in any way what could be common fea- was designed in hardware that directly gener- tures of a group of scripts that could be dealt ated ISCII code. The Devanagari character with uniformly for text processing. Unicode fonts were stored in ROM along with their rel- assigned a separate page for each one of the ative positioning information in the compo- scripts. Thus, as one perceives more composi- sition frame. To speed up the composition tional features in the scripts, the demand for process, the information was stored in multi- including more and more symbols continues. ple partitions. Some of the frequently occur- In ISCII, however, the symbols relate to the ring composite characters were precomposed articulatory aspect of the associated speech, andstoredinROM.Weprogrammedthe and it remains constant as long as all composition processor engine to interpret the articulatory aspects have been considered. the input ISCII-8 code dynamically and pro- vide the display with the composed sequence Rendering of Indic scripts of the composite character. Because Indic scripts vary significantly in The CRT display dynamically displayed terms of how they are composed, the IIT Kan- character changes as the input progressed. pur project team envisioned a separate com- Display flicker resulted from the script com- position processor38 for every Indic script. position time, which affected the refresh This processor, when fed with an ISCII string, time. We reduced the ROM fetch time would yield the sequence of composite by logically partitioning the ROM space, by

January–March 2009 23

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 24

A Journey from Indian Scripts Processing to Indian Language Processing

Related Work and Developments In 1988, the Graphics and Indian Script Terminal (GIST) In 1988, the first attempt at computing with Indian terminal evolved into a GIST card that was pluggable scripts was made by designing and implementing an in- into an IBM PC. This allowed all the existing character- terpreter for a Basic—like language written in Tamil or oriented software packages to be used with all the Indic Telugu. The characters were not displayed through scripts. In 1990, the Centre for Development of Advanced fonts but drawn on the screen using the curves. The Computing (C-DAC) designed an 84-pin PLCC ASIC for 16-bit character representation made it possible to GIST called the GIST-9000. It provided an interface quickly identify the strokes needed to generate the char- for Motorola’s 68008 microprocessor with 256 Kbytes of acter. In 1998, the first version of the fonts-based editor DRAMandanI/O-mappedinterfacefortheIBMPCbus. was developed for Microsoft Windows 95, and in 1999, In 1991, C-DAC designed a GIST print spooler that IIT Chennai demonstrated a text-to-speech system and could offload the time-consuming printing task for output from Indian language documents. Indic scripts from the host processor. In 1998, C-DAC These works are only a few of the many projects that developed a GIST-II card and, in 2001, designed a PCI have been undertaken. Numerous others have involved GIST card. During 1990—1992, C-DAC also developed type and composition design,4-15 font design,16,17 trans- keyboard standards for all the Perso-Arabic scripts and literation schemes for Indic scripts,18-20 and on speech phonetic standards for Thai, Sinhalese, Bhutanese, processing.21-23 Besides these, some other early works and Tibetan scripts. The Indian script font code on Urdu,24 Farsi,25 and Sinhala26 might be of interest. (ISFOC) standards were also developed for all Brahmi- based Indic scripts. During the 1997—2002 period, References and notes C-DAC commercially released multilingual word process- ing software, called LEAP, catering to all Indic scripts. 1. P.V.H.M.L. Narasimham et al., Design Information During the years 1981—1985, the CMC company in Report on Text Composition in Telugu, Computer Secunderabad, under Putcha Narasimham’s leadership, Maintenance Corp., Secunderabad, 1981. 1 prepared a design document on Telugu and designed 2. S.P. Mudur et al., Design Information Report on Text LIPI, a multilingual computer system featuring word Composition in Devanagari, Nat’l Centre for Soft- processing with proportional spacing, and high-quality ware Development and Computing Technology, printing for a large number of Indic scripts (personal Tata Inst. of Fundamental Research, Bombay, 1980. communication, Putcha Narasimham, Aug. 2008). This 3. A. Mathur and P. Dhyani, eds., Design and Devel- machine was made commercially available. Although LIPI’s design was based on a universal coding method, opment of a Devanagari Based Computer System, it did not dynamically display composition. tech. report, Project Report III, Birla Inst. of Technol- During the years 1978—1980, NCST Mumbai, under ogy and Science, Pilani, Apr. 1983. (Contributors: the leadership of S.P. Mudur, developed a design docu- S. Anand, R. Bagai, V. Dev, P. Dhyani, D. Kumar, ment for Devanagari.2 This was based on an analysis of and A. Mathur.) graphic strokes and used a visual order for keyboarding. 4. A.V. Sagar and S. Chadda, ‘‘Composite Character Between 1980 and 1983, the Birla Institute of Tech- Formation in Indian Scripts with a Small Set of nology and Science in Pilani, under the leadership of Working Patterns—A PostScript Implementation,’’ Praveen Dhyani and Aditya Mathur, developed a multi- Proc. Workshop Computer Processing of Asian Lan- lingual computer system3 under the government of India’s Department of Electronics—sponsored project. It guages, Asian Inst. of Technology, Bangkok, could display text in Devanagari and print text in several Thailand (hereafter, AIT), 1989, pp. 160-167. otherIndicscripts.ThecomputerthatDhyaniand 5. F.A.V. Donani, ‘‘Constructions and Graphic Display Mathur used was a Spectrum/3 from DCM Data Prod- of Gujarati Text,’’ master’s thesis, Dept. of Electri- ucts, connected to an ADM 3A CRT upgraded with a cal Eng. and Computer Science, Massachusetts graphics card. This system was called Siddhartha. At Inst. of Technology, 1977. the same time, the DCM Data Products company man- 6. H. Ganesh et al., Design Information Report on Text ufacturing computer systems in India also named its Composition in Malayalam, Research Inst. for computer catering to Hindi word processing as Siddhar- Newspaper Development (RIND), Madras, 1981. tha, but the two machines had no correlation (personal 7. J.B. Millar and W.W. Glover, ‘‘Synthesis of the communication, Aditya Mathur, Sept. 2008) except Devanagari Orthography,’’ Int’l J. Man-Machine that both were based on the Spectrum/3. Indian Institute of Technology (IIT) Chennai (earlier, Studies, vol. 14, 1981, pp. 423-435. IIT Madras), under the leadership of Kalyana Krishnan, 8. J.G. Krishnayya, Stroke Analysis of Devanagari Charac- developed a method for character generation using cubic ters, quarterly progress report no. 69, Massachusetts splines in 1983 (http://acharya.iitm.ac.in/history.php). Inst. of Technology, 1963, pp. 232-237.

24 IEEE Annals of the History of Computing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 25

9. M. Vannan, ‘‘Structured Approach for the Display Workshop on Computer Processing of Asian of Indian Scripts,’’ Proc. Workshop Computer Proc- Languages, AIT, 1989, pp. 261-267. essing of Asian Languages, AIT, 1989, pp. 147-153. 19. E.V. Krishnamurthy, ‘‘Automatic Phonetic Tran- 10. P.K. Ghosh, An Approach to Type Design and Text scription of Tamil in Roman Script,’’ Proc. Indian Composition in Indian Scripts, report no. STAN-CS- Academy of Sciences, Indian Academy of Sciences, 83-965, Stanford Univ., 1983. 1977, vol. 86, pp. 503-512. 11. S.P. Mudur et al., ‘‘An Integrated Software 20. S. Goel et al., ‘‘LIPYANTARAN—A Computer Aided Environment for Localization,’’ Int’l Conf. Com- Transliteration System for English to Devanagari,’’ puter and Communication (ICCC 02), Int’l Council Computer Processing of Asian Languages: CPAL-2 for Computer Communication, 2002, pp. 828- Proc., R.M.K. Sinha, ed., Tata McGraw Hill, 1992, 842. pp. 319-320. 12. S.P. Mudur et al., ‘‘An Architecture for the Shaping 21. B. Yegnanarayana et al., ‘‘A Continuous Speech of Indic Texts,’’ Computers & Graphics, vol. 23, Recognition System for Indian Languages,’’ Proc. no. 1, 1999, pp. 7-24. Workshop on Computer Processing of Asian Lan- 13. T.K. Bhatia, ‘‘The Problems of Programming Deva- guages, AIT, 1989, pp. 347-356. nagari Script on PLATO IV and a Proposal a for 22. D.D Majumder, A.K. Dutta, and N.R. Ganguli, Revised Hindi Typewriter,’’ Language, Literature ‘‘Some Studies on the Acoustic Features of Human and Society: Occasional Papers, no. 1, Center for Speech in Relation to Hindi Speech Sounds,’’ Indian Southeast Asian Studies, Northern Illinois Univ., J. Physics, vol. 47, no, 10, 1973, pp. 598-613. 1974, pp. 52-64. 23. P.V.S. Rao and N. Bondale, ‘‘Syntax and Semantics 14. T. Mukherjee, ‘‘The Design of Indian Printing of Speech Processing,’’ Computer Processing of Types: Some Considerations for the Future,’’ Asian Languages: CPAL-2 Proc., R.M.K. Sinha, ed., Inside Outside, 1978, vol. 2,3, pp. 90-93. Tata McGraw Hill, 1992, pp. 219-232. 15. T.N.V. Reddy, Design Information Report on Text 24. K.S. Mustafa, ‘‘On Computerization of Urdu: Prob- Composition in Kannada, RIND, Madras, 1981. lems and Proposals,’’ Computer Processing of Asian 16. A.V. Sagar and G. Muralidhar, ‘‘CFONTS—A Font Languages: CPAL-2 Proc., R.M.K. Sinha, ed., Tata Design System,’’ Proc. Workshop Computer Process- McGraw Hill, 1992, pp. 299-302. ing of Asian Languages, AIT, 1989, pp. 137-146. 25. B. Parhami and F. Mavaddat, ‘‘Computers and the 17. C. Muthuvel, N. Alwar, and S. Raman, ‘‘A Font Farsi Language,’’ Proc. IFIP Congress, Toronto/ Generator for Indian Languages,’’ Proc. Workshop North Holland, 1977, pp. 673-676. on Computer Processing of Asian Languages, AIT, 26. A.K. Kumarsena, ‘‘Progress in Sinhala Comput- 1989, pp. 154-159. ing,’’ Computer Processing of Asian Languages: 18. C. Chandrasekaran and S. Chadda, ‘‘Transliteration CPAL-2 Proc., R.M.K. Sinha, ed., Tata McGraw Hill, of Persons’ Names from English to Hindi,’’ Proc. 1992, pp. 59-73.

pipelining through the buffered output regis- The project’s new name was the Graphics ters, and by skipping over the dark spots (0’s and Indian Script Terminal (GIST). A number representing nonilluminated points). Because of companies bought this technology for the visual screen was a window of the physi- manufacturing multilingual computer termi- cal page, we provided facilities for panning nals. India’s Centre for Development of and zooming. We incorporated built-in intel- Advanced Computing adapted the GIST tech- ligence to prevent illegal composition such as nology for the design of an ASIC chip.50 Most attaching two matra symbols to the same of the current commercially available soft- character. ware catering to Indic scripts has borrowed The backspace and other text editing oper- ideas from the GIST technology. ations worked on the internal ISCII-8 code The IDC and GIST technology, as I have and were dynamically reflected on the explained, represented a major breakthrough screen. Automatic transliteration from one in solving the complex problem of design of Indian script to another was also made possi- the man-machine linguistic interface for In- ble because the text was stored internally in dian languages. There have been a number the ISCII-8 format. of other efforts to develop Indian-language We later extended the IDC project to computers that are noteworthy; see the work on Intel’s 32-bit 68000 microprocessor. ‘‘Related Work and Developments’’ sidebar.

January–March 2009 25

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 26

A Journey from Indian Scripts Processing to Indian Language Processing

Word processing; transliteration space. The Indian Language Space had a Text editing and word processing of Indic width equaling the greatest common divisor scripts is more complex than for their roman of all possible widths of composite characters. counterparts. Even a function as basic as the This served the dual purpose of yielding the backspace is not simple: to use it requires, right sorting order and for aligning the text. first, identification of the unit to be erased. In screen-oriented editing, the variable There are a few alternative ways to imple- width caused another problem. The cursor ment this, which offer different levels of movements had to vary according to the user friendliness. The unit to be erased width of the composite character or word, could be a composite character (which is easi- and the cursor width itself had to vary to in- est to implement), in which case the minor dicate the boundary of the word or compos- errors of the matra symbol require the user ite character. The cursor movement in the to reenter the whole composite character. vertical direction had to similarly adjust to Alternatively, a constituent symbol can be the word boundaries. erased in the order in which it was entered, The one-to-one correspondence of each in which case the level of user friendliness Indic script to the symbols of the enhanced depends on the keyboarding method used. Devanagari script with ISCII-8 phonetic Yet another alternative is to erase the symbol encoding provided a natural and easy trans- in the order it was displayed on the screen, in literation method among the Indic scripts, which case the order is dictated by the script requiring only a few exception rules and composition process and is not easily antici- a switching of the script composition pro- pated by the user. The ISCII-based keyboard- cessor. Figure 18 shows a block schematic ing enforced a definite same canonical order of the transliteration schema with a few both for entry as well as storage, thus examples. the backspace simply deleted the preceding symbols in the order of the code entered. In OCR for Indic scripts this case, the composite character composi- Researchers have investigated OCR for tion is dynamically modified, reflecting the a number of Indian scripts: Devanagari, deletions. Tamil, Telugu, Bengali, and Kannada.34,68-71 The tasks of searching and replacing, sort- However, most of this research has been con- ing, and finding the word prefixes and suf- fined to the identification of isolated charac- fixes were easily performed on the ISCII ters rather than the script. Some systems used code. The routines for morphological analy- a statistical method; others were syntactic sis49 of the words, and the routines for gener- and/or heuristic-based. Unlike simple juxta- ating the word aliases as needed in spell position in roman script, the Indic scripts checking and correction,41 required prefix are a composition of the constituent symbols and suffix identification; otherwise, these in two dimensions. This meant that research- operations would have been difficult to ers first segmented an Indian script word into implement. its composite characters. Each composite Another important deviation from the character was then decomposed into the con- roman alphabet that affected edit routines stituent symbols or the strokes that were fi- was the unequal width of the Indian charac- nally recognized. ters and their compositions. Any deletion, Further, the Indic scripts posed difficulties addition, or replacement of a word caused a to the researchers because of the natural change in the line width that was not related breaks that occurred in a character. There to the number of symbols involved but was were also natural joins or fusions of the con- related to their nature. Any editing software stituent symbols. Additionally, the matra had to store the width information of the symbols often were not attached at precisely composite characters. In a string model, the right position and users had to resolve such as ours, the lines of a page from a single them using context. These difficulties were string and edit could cause the text lines to more often encountered in the handwritten be readjusted. This was achieved by introduc- script. ing a soft carriage return for the end of every In 1973, I elaborated the techniques and logical line and a hard carriage return for strategies for segmentation, decomposi- an actual change of line. The right justifica- tion, and recognition of Devanagari script tion was achieved by our introducing the In- for the first time.34 Follow-up investigations dian Language Space (a separate code in later focused on segmentation of touching ISSCII-8) rather than the typical roman or fused characters, and on contextual

26 IEEE Annals of the History of Computing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 27

processing for error corrections.71 The contex- tual postprocessing stage used script composi- tion information in addition to the individual character confusions as in case of roman OCR.26,72

Indian language processing The need for providing technological sup- port for NLP tasks has been enumerated else- where.73,74 Although computers have long been used for studying characteristics of In- dian languages, the text form used was only a romanized version. The work on building natural-language systems in Indian languages beganonlyinthelate1970swithsome preliminary works on designing a domain- specific question-answering system.75 The user-friendly solution to Indic script interfac- ing with the computers in the 1980s triggered a number of applications in typesetting, printing, publishing, hyphenation, docu- ment preparation, spell checkers, and so on.25,40,41 Our emphasis then moved to com- puters as a tool for breaking the linguistic barriers and for language learning. Work on computer-assisted translation among Indian languages started in the early 1980s.25,76 In 1984, I outlined an interlingual approach using Sanskrit as the intermediate base language (see Figure 19).25 A related Figure 18. Transliteration using ISCII-8 phonetic encoding project was started in 1985 at IIT Kanpur by and script composition processors. Rajeev Sangal, but was not pursued due to the system’s anticipated complexity. The group instead used an easier, more direct the input English sentence was transformed approach by substituting the word-groups into a pseudo-interlingual structure called of the source language with the correspond- PLIL (Pseudo Lingua for Indian Languages) ing word-groups of the target language. As using a CFG-like pattern-directed rule base. the Indian languages are structurally similar, The PLIL structure had a word order that was this approach did yield an output that could be called a working translation for the language pair. The correctness of the transla- Source language ABTarget language tion very much depended on the closeness T T of the two languages under consideration. r r 54 Post-editing L1 This approach, known as Anusarak, was L1 Pre-editing a a described as a language accessor system. n n L2 Pre-editing s Base s Post-editing L2 In 1989, the first regional workshop on l l Computer Processing of Asian Languages77 . . a language a ...... was organized at the Asian Institute of Tech- t t . . o o . . nology, Bangkok, followed by the second . . r r . . 78 s s workshop at IIT Kanpur in 1992. Both L16 Pre-editing Post-editing L16 of these workshops, supported by UNESCO, provided good forums for exchanging results Root-word dictionaries and made recommendations to UNESCO for future work pertaining to Asian languages.79 Word decomposition Between 1990 and 1992, I designed a Prefix-suffix analyzer analyzer machine-aided translation (MAT) system— AnglaBharati—for translation from English Figure 19. Machine translation among Indian languages using interlin- to Indian.45,51 In the AnglaBharati technology, gual approach (198425).

January–March 2009 27

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 28

A Journey from Indian Scripts Processing to Indian Language Processing

applicable to a group of Indian languages and With the availability of a moderate level carried all the syntactic and semantic informa- unilingual and bilingual corpora in Indian tion needed to synthesize the target Indian languages in the post-2000 period, different language text belonging to that group. corpus-based approaches are currently I based the methodology for developing under investigation. Since 2006, six consor- the target language text generators on the tium mode mission-oriented projects have Paninian framework44 that was applicable been sponsored by India’s Department of to all the Indian languages. Thus, instead of Information Technology and are currently developing 22 different language translators in progress. These projects include two on from English to each one of the official printed text and handwritten text recogni- Indian languages, only one translator from tion for different Indic scripts, one project English to PLIL was developed. The 22 text on machine translation among Indian lan- generators transforming the PLIL into the tar- guages, two projects on English-to-Indian get language catering to each one of the languages, and one project on cross-lingual Indian languages, however, had to be devel- information retrieval. oped. It was estimated that an additional ef- And the journey continues. There is a long fort of only about 30% of the total effort of way to go, however, before we can truly say developing the full translation system was that we have overcome the linguistic barrier, needed for developing the text generator with unconstrained speech-to-speech transla- from the PLIL. In 2004, the AnglaBharati tion being the ultimate goal. technology was transferred to a number of organizations: IIT Mumbai, IIT Gwahauti, CDAC Noida, CDAC Kolkata, CDAC Thiru- References and notes vanthapuram, CDAC Pune, Jawahar Lal 1. J. Beames, A Comparative Grammar of the Mod- Nehru University New Delhi, Utkal University ern Aryan Languages of India: Hindi, Panjabi, Bhuvaneshwara, and TIET Patiala. The devel- Sindhi, Gujarati, Marathi, Oriya, and Bangali, opment work on the different text generators 3 vols., Tru¨bner, 1872—1879. is currently in progress. The AnglaBharati 2. M.B. Emeneu, Language and Linguistic Area, methodology uses both interlingual and the Stanford Univ. Press, 1980. rule-based machine translation (RBMT) 3. S.R. Hill and P.G. Harrison, Dhatu-Patha: The approaches. This has been further hybridized Roots of Language, Munshiram Manoharlal with the example-based approach.46 Publishers Pvt. Ltd., 1997. During 2001 and 2002, IIT Mumbai under 4. S.K. Chatterji, The Origin And Development of the leadership of Pushpak Bhattacharya Bengali Language, Rupa & Co., 2002. developed a similar interlingual approach 5. B. Kachru, The Alchemy of English: The Spread, using a universal networking language Functions and Models of Non-Native Englishes, (UNL) for machine translation from English Pergamon Press, 1986. to Hindi80 and for information extraction.81 6. J. Baldridge, ‘‘Linguistic and Social Characteris- They have also developed a Hindi word- tics of Indian English,’’ Language In India, net82 similar to that of Princeton University’s vol. 2, no. 4, June-July 2002. English WordNet. 7. Devanagari through the Ages, pub. no. 8/67, Almost at the same time, C-DAC Pune Central Hindi Directorate, New Delhi, 1967. -- -- under Hemant Darbari’s leadership developed 8. H. Scharfe, ‘‘Kharos.t.1 and Bra¯hm1,’’ J. Am. Ori- a machine translation system called MANTRA ental Soc., vol. 122, no. 2, 2002, pp. 391-393. (http://www.cdac.in/html/aai/mantra.asp), 9. A.S. Mahmud, ‘‘Crisis and Need: Information specifically tuned to the domain of transla- and Communication Technology in Develop- tion of official documents. The MANTRA ment Initiatives Runs through a Paradox,’’ ITU was an RBMT system and used tree-to-tree Document WSIS/PC-2/CONTR/17-E, World Sum- transformation from the source language to mit on Information Society, Int’l Telecommunica- the target language. tion Union (ITU), Geneva, 2003. At IIT Hyderabad, under the leadership of 10. R.M.K. Sinha, ‘‘Multilinguality and Global Digi- Rajeev Sangal, a machine translation system tal Divide,’’ presented at the Joint IAMCR/ICA called Shakti for English-to-Hindi translation Int’l Symp.: The Digital Divide, 2001. is being developed (http://shakti.iiit.ac.in). 11. ‘‘ITU’s Asia-Pacific Telecommunication and ICT In 1995—1996, I also designed and devel- Indicators Report Focuses on Broadband Con- oped a hybrid approach to machine-aided nectivity: Too Much or Too Little?’’; 1 Sept. translation that was essentially an example- 2008; http://www.itu.int/newsroom/press_ based approach hybridized with a rule-base.49 releases/2008/25.html.

28 IEEE Annals of the History of Computing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 29

12. M. Schwartz, ‘‘Fastap Hindi Language Platform 26. R.M.K. Sinha, ‘‘Rule Based Contextual Post- Slated to Revolutionise India Mobile Market,’’ Processing for Devanagari Text Recognition,’’ 7 Mar. 2008; http://www.developingtelecoms. Pattern Recognition, vol. 20, no. 5, 1987, com/content/view/1165/26/. pp. 475-485. 13. C.A. Arnaldo, ‘‘A Holistic Approach to the 27. S.K. Das, A History of Indian Literature: Computerization of Asian Scripts,’’ Computer 1800—1910, Sahitya Akademy, New Delhi, Processing of Asian Languages: CPAL-2 Proc., 1991. R.M.K. Sinha, ed., Tata McGraw Hill, 1992, 28. J. Gilchrist, Grammar of the Hindoostanee pp. 1-24. Language, or Part Third of Volume First, of a Sys- 14. Hanzix Work Group, ‘‘Open Systems Environ- tem of Hindoostanee Philology, Chronicle Press, ment for Hanzi Input Methods,’’ Computer Proc- Calcutta, 1796. essing of Asian Languages: CPAL-2 Proc., R.M.K. 29. W. Franklin, Introduction to The Bhaˇgvaˇt-Ge¯¯eta¯; Sinha, ed., Tata McGraw Hill, 1992, pp. 49-58. The Heˇˇeto¯paˇde¯sofVeeˇshnoˇoˇ-Saˇrma¯, C. Wilkins, 15. P. Lofting et al., ‘‘Handwriting: From Bamboo trans., Ganesha, 2001, pp. xxiv-xxv. to Laser,’’ Computer Processing of Asian Lan- 30. B.S. Naik, Typography of Devanagari, Bombay, guages: CPAL-2 Proc., R.M.K. Sinha, ed., Tata Directorate of Languages, government of McGraw Hill, 1992, pp. 93-112. Maharashtra, 1971, vol. 2, pp. 636-639; 16. T.C. Chen, ‘‘Hanzi Characters and Their Com- http://listserv.linguistlist.org/cgi-bin/wa?A2= puterizations,’’ Computer Processing of Asian ind0001&L=indology&D=1&P=20160. While Languages: CPAL-2 Proc., R.M.K. Sinha, ed., Tata researching the history of the Devanagari type- McGraw Hill, 1992, pp. 34-48. writer, I found information on the history of 17. Y.S. Moon, ‘‘Digital Fonts for Oriental Ideo- Rudraprayag (currently part of the state of graphical Languages,’’ Proc. Workshop Computer Uttarakhand) noting that the king of Rudra- Processing of Asian Languages, Asian Inst. of payag, Kirti Shah, invented a typewriter for Technology, Bangkok, Thailand (AIT), 1989, Hindi around 1892 and gave the copyright to pp. 168-174. an unnamed company (http://rudraprayag. 18. P.H. Noncarrow, ‘‘48,000 Characters in Search nic.in/history.htm); further information is not of a System,’’ presented at Symp. Linguistic traceable thus far. Implications of Computer-based Information 31. Vrunda (archivist), Godrej Archives; http:// Systems, New Delhi, 1978. www.archives.godrej.com, personal communi- 19. K. Hensch, ‘‘IBM History of Far Eastern Lan- cation, Sept. 2008. guages in Computing, Part 1: Requirements 32. P.V.H.M.L. Narasimham, B. Prasad, and and Initial Phonetic Product Solutions in the V. Rajaraman, ‘‘Code Based Keyboard for Indian 1960s,’’ IEEE Annals of the History of Computing, Languages,’’ J. Computer Soc. of India, vol. 2, vol. 27, no. 1, 2005, pp. 17-26. 1971, pp. 33-37. 20. K. Hensch, ‘‘IBM History of Far Eastern Lan- 33. R.M.K. Sinha and H.N. Mahabala, ‘‘Machine guages in Computing, Part 2: Initial efforts for Oriented Devanagari Script,’’ J. Institution of Full Solutions, 1970s,’’ IEEE Annals of the Electronics and Telecommunication Engineers, History of Computing, vol. 27, no. 1, 2005, vol. 19, 1973, pp. 623-628. pp. 27-37. 34. R.M.K. Sinha, ‘‘Syntactic Pattern Analysis and its 21. K. Hensch, ‘‘IBM: History of Far Eastern Lan- application to Devanagari Script Recognition,’’ guages in Computing, Part 3: IBM Japan Taking PhD dissertation, Electrical Eng. Dept., IIT the Lead, Accomplishments through the 1990s,’’ Kanpur, 1973. IEEE Annals of the History of Computing, vol. 27, 35. R.M.K. Sinha, principal investigator, Integrated no. 1, 2005, pp. 38-55. Devanagari Computer (IDC), tech. report IDC- 22. J. Stevens, Sacred Calligraphy of the East, 3rd 84-1, Dept. of Electrical Eng., IIT Kanpur, ed., Shambhala, 1996. 1984. 23. L.S. Wakankar, Ganesh Vidya: The Traditional Indian 36. R.M.K. Sinha, ‘‘Teaching Script on a Digital Approach to Phonetic Writing, Tata Press, 1968. Computer,’’ J. Institution of Electronics and Tele- 24. In Unicode this halant symbol has been incor- communication Engineers, vol. 22, 1976, rectly called a viram. Viram actually represents pp. 720-722. a , and Devanagari uses a vertical line 37. R.M.K. Sinha, ‘‘Computer Processing of Indian (a danda) for this. Languages,’’ presented at 4th Int’l Conf. Com- 25. R.M.K. Sinha, ‘‘Computer Processing of Indian puting in Humanities, 1979. Languages and Scripts—Potentialities and Prob- 38. R.M.K. Sinha and A. Raman, ‘‘A Modular Indian lems,’’ J. Institution of Electronics and Telecommu- Language Data Terminal,’’ Computer Graphics nication Engineers, vol. 30, no. 6, 1984, (ACM SIGGRAPH), vol. 14, ACM Press, 1980, pp. 133-149. pp. 39-72.

January–March 2009 29

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 30

A Journey from Indian Scripts Processing to Indian Language Processing

39. R.M.K. Sinha, ‘‘Computers for Indian Languages,’’ Information Systems,’’ Electronics Information & Proc. Ann. Convention of Computer Soc. of Planning, New Delhi, government of India, India, Computer Soc. of India, Madras, 1982, 1978, pp. 801-804. pp. 163-174. 54. A. Bharati et al., Anusaaraka: Machine Transla- 40. R.M.K. Sinha and B. Srinivasan, ‘‘Machine Trans- tion in Stages, report no. IIIT/TR/1997/1, IIIT literation from Roman to Devanagari and Deva- Hyderabad, 1997. nagari to Roman,’’ J. Institution of Electronics 55. A.K. Pathak, ‘‘An Input/Output Terminal for In- and Telecommunication Engineers, vol. 30, no. 6, dian Languages,’’ M.Tech thesis, Dept. of Elec- 1984, pp. 243-245. trical Eng., IIT, Kanpur, 1978. 41. R.M.K. Sinha and K.S. Singh, ‘‘A Program for 56. M.P. Sastri, ‘‘A Universal Script Generator for Correction of Single Errors in Hindi Words,’’ Indian Languages,’’ M.Tech. thesis, Dept. of J. Institution of Electronics and Telecommunication Electrical Eng., IIT Kanpur, 1978. Engineers, vol. 30, no. 6, 1984, pp. 249-251. 57. J. Institution of Electronics and Telecommunication 42. R.M.K. Sinha, Data Representations for Indian Engineers, vol. 30, no. 6, 1984, special issue on Language Databases, tech. report TRCS-84-22, computer processing of Indian languages and Dept. of Computer Science, IIT Kanpur, 1984. scripts, R.M.K. Sinha, guest ed. 43. R.M.K. Sinha, ‘‘Non-Latin Information Systems: 58. A. Mathur and F. Fowler, ‘‘Design of a Dynami- Some Basic Issues,’’ Proc. Conf. Information cally Reconfigurable Keyboard,’’ Proc. Int’l Conf. Processing, H. Kugler, ed., Elsevier Science, 1986. Chinese and Oriental Language Computing, IEEE 44. R.M.K. Sinha, ‘‘A Sanskrit Based Word-expert CS Press, 1987, pp. 20-23. Model for Machine Translation among Indian 59. B. Nag, ‘‘Information Technology for Indian Languages,’’ Proc. Workshop Computer Processing Scripts: Problems and Prospects,’’ Proc. Work- of Asian Languages, AIT, 1989, pp. 82-91. shop Computer Processing of Asian Languages, 45. R.M.K. Sinha et al., ‘‘AnglaBharti: A Multi- AIT, 1989, pp. ks-2-15. lingual Machine Aided Translation Project on 60. K.P.S. Menon, ‘‘High Speed, Visual Direct Indian Translation from English to Hindi,’’ IEEE Int’l Language Data Entry,’’ Indian Linguistics, 1974, Conf. Systems, Man and Cybernetics, IEEE Press, vol. 35, pp. 97-111. 1995, pp. 1609-1614. 61. N. Mate, ‘‘Keyboard Overview—An Accommo- 46. R.M.K. Sinha, ‘‘Hybridizing Rule-Based and dative Approach for Devanagari Keyboard,’’ Example-Based Approaches in Machine Aided Computer Processing of Asian Languages: CPAL-2 Translation System,’’ 2000 Int’l Conf. Artificial Proc., R.M.K. Sinha, ed., Tata McGraw Hill, Intelligence (IC- 2000), CSREA Press, 2000, 1992, pp. 291-298. pp. 1247-1252. 62. S.P. Mudur, An Alphabetization Procedure for 47. R.M.K. Sinha, ‘‘An Engineering Perspective Devanagari Words, tech. report, Nat’l Centre of Machine Translation: AnglaBharti-II and for Software Development and Computing AnuBharti-II Architectures,’’ Proc. Int’l Symp. Ma- Technology, 1978. chine Translation, NLP and Translation Support 63. J.N. Tripathi, ‘‘Statistical Studies of Printed System (iSTRANS-2004), Tata McGraw Hill, Devanagari Text (Hindi),’’ J. Institution of 2004, pp. 10-17. Telecommunication Engineers, 1971. 48. R.M.K. Sinha and A. Thakur, ‘‘Machine Translation 64. O. Vikas, ‘‘Standardizing Representation of of Bi-lingual Hindi-English (Hinglish) Text,’’ MT Indian Languages for Information Processing,’’ Summit X, Proc.: The Tenth Machine Translation Proc. Int’l Symp. Machine Translation, NLP and Summit, Phuket, Thailand, 2005, pp. 149-156. Translation Support System (iSTRANS-2004), Tata 49. R.M.K. Sinha, ‘‘A Hybridized EBMT System for McGraw Hill, 2004, pp. 313-314. Hindi to English Transaction,’’ CSI J., vol. 37, 65. Standardisation of Indian Script Code for Informa- no. 4, 2007, pp. 3-9. tion Interchange and Keyboard Layout, Dept. of 50. M. Kulkarni, personal communication; http:// Electronics, government of India, 1983. www.cdac.in/html/gist/about.asp. 66. ‘‘Report of the Committee on Standardization 51. K. Sivaraman, ‘‘AnglaBharati: A Machine Aided for Indian Scripts and Keyboard Layout,’’ IPAG Translation System from English to Indian J., Ministry of Communication and Information Languages—English to Tamil Version,’’ M.Tech Technology, New Delhi, Oct. 1986. thesis, Dept. of Computer Science & Eng., IIT 67. R.M.K. Sinha, ‘‘Standardizing Linguistic Kanpur, 1993. Information—An Overview,’’ Computer Processing 52. P.V.H.M.L. Narasimham et al., Design Informa- of Asian Languages: CPAL-2 Proc., R.M.K. Sinha, tion Report on Text Composition in Telugu, Com- ed., Tata McGraw Hill, 1992, pp. 272-290. puter Maintenance Corp., Secunderabad, 1981. 68. M. Chandrasekaran, ‘‘Machine Recognition of 53. O. Vikas, ‘‘Summary Report of the Symposium the Modern ,’’ PhD dissertation, on Linguistic Implications of Computer Based Univ. of Madras, India, 1982.

30 IEEE Annals of the History of Computing

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply. [3B2-3] man2009010008.3d 12/2/09 12:27 Page 31

69. R. Chandrasekaran, ‘‘Computer Recognition of 79. Report and Recommendations of Second Regionalo CertainAncientandModernIndianScripts,’’PhD Workshop on Computer Processing of Asian Lan- dissertation, Univ. of Madras, India, 1982. guages: CPAL-2, IIT Kanpur, India, pp. 19-21. 70. . Pal and B.B. Chaudhuri, ‘‘Indian Script Char- 80. S. Dave, J. Parikh, and P. Bhattacharyya, ‘‘Inter- acter Recognition: A Survey,’’ Pattern Recogni- lingua Based English Hindi Machine Translation tion, vol. 37, 2004, pp. 243-245. and Language Divergence,’’ J. Machine Transla- 71. V. Bansal, ‘‘Role of Knowledge in Document tion (JMT), vol. 17, Sept. 2002, pp. 251-304. Recognition—A Case Study for Devanagari 81. P. Bhattacharyya, ‘‘Knowledge Extraction into Script,’’ PhD dissertation, Dept. of Computer Universal Networking Language Expressions,’’ Science and Eng., IIT Kanpur, 1999. Proc. Universal Networking Language Workshop, 72. R.M.K. Sinha, ‘‘Methodology for Computer Rec- 2001. ognition of Devanagari Scripts,’’ IEEE-SMC Int’l 82. D. Narayan et al., ‘‘An Experience in Building Conf., IEEE Press, 1984, pp. 1220-1224. the Indo WordNet—A WordNet for Hindi,’’ Proc. 73. H. Nomura, ‘‘Role of AI in Natural Language Int’l Conf. Global WordNet (GWC 02), 2002. Processing for Asian Languages,’’ Computer Proc- essing of Asian Languages: CPAL-2 Proc., R.M.K. R. Mahesh K. Sinha is a pro- Sinha, ed., Tata McGraw Hill, 1992, pp. 147-152. fessor of computer science 74. R. Narasimhan, ‘‘Technology Support for Asian and engineering (CSE) and Language Studies and Applications,’’ Computer electrical engineering (EE) Processing of Asian Languages: CPAL-2 Proc., R.M.K. and has been on the faculty Sinha, ed., Tata McGraw Hill, 1992, pp. 25-33. of CSE and EE at IIT Kanpur 75. R.M.K. Sinha and G.C. Pathak, ‘‘A Heuristic since 1975. He is the origina- Based Question Answering System in Natural tor of the well-known multi- Hindi,’’ IEEE-SMC Int’l Conf., IEEE Press, 1984, lingual GIST technology, AnglaBharati and pp. 1009-1013. AnuBharati machine translation technology, 76. P.C. Ganeshsundaram, ‘‘The P-Structure C- and ISCII, among others. He has been a visiting fac- Structure Grammar (PCG) for the Contrastive ulty member at Michigan State University, Wayne Study of Two or More Languages,’’ J. Indian State University, the University of Quebec (INRS), Inst. of Science, 1978, pp. 167-191. and AIT, Bangkok. Sinha obtained a PhD from IIT 77. Proc. Workshop on Computer Processing of Asian Kanpur in 1973. Contact him at [email protected]. Languages, AIT, 1989. 78. R.M.K. Sinha, ed., Second Regional Workshop on For further information on this or any other Computer Processing of Asian Languages: CPAL-2 computing topic, please visit our Digital Library Proc., Tata McGraw Hill, 1992. at http://computer.org/csdl. See the Future of ComputingNow in IEEEIntelligent Systems

Tomorrow's PCs, handhelds, and Internet will use technology that exploits current research in artificial intelligence. Breakthroughs in areas such as intelligent agents, the Semantic Web, data mining, and natural language processing will revolutionize your work and leisure activities. Read about this research as it happens in IEEE Intelligent Systems.

SUBSCRIBE NOW! http://computer.org/intelligent IEEE

January–March 2009 31

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 23, 2009 at 21:46 from IEEE Xplore. Restrictions apply.