brief background to Nepal

Barriers to Localisaon China Afghanistan a case study of Nepal 28.2 1,333.2 Iran 167.5 74.2 Pakistan 29.3 0.7 Nepal Bhutan

Pat Hall India Bangladesh populaons – 1,169.5 162.2 Myanmar 2009 esmated millions 50.0 Wikipedia 2009 July 1 major migrant communities 0.3 20.2 worldwide Maldives Sri Lanka

228th Sept 2009 LRC 2009

language diversity in South Asia linguisc history and polics

scholarship Pushtu Sanskrit Kashmiri • Gorkhali (later Nepali) official language since 1768 Farsi Indo-European Tibetan – mother tongue of ruling elite from Gorkha Urdu Tibeto-Burmese – 1951-1990 ‘one naon, one culture, one language’ Punjabi Nepali Dzongka Hindi • around 100 other languages INDIA Sindhi Assamese – Over 500 languages Indo-European Bangla Nepali now almost universally spoken – 98% 22 mother-tongues Gujurati • and in neighbouring regions Oriya with > 1 million speakers Marathi – 5 other languages with mature wrien tradions 114 mother-tongues Telugu Austro-Asiatic – 60 completely unwrien with > 10,000 speakers Kannada (Munda) • 1990 enabled other mother-tongue educaon 22 official in India Dravidian Malayalam – the irony of English Tamil English pervasive • 2006 interim constuon enabled use of other languages large diaspora Sinhala - only 5% • 2008 mandang mother-tongue teaching

28th Sept 2009 LRC 2009 3 28th Sept 2009 LRC 2009 4

Wring through the computer an old mechanical typewriter direct connecon between key and type Tamil Kannada Hindi (Devanagari)

• all South Asian writing derived from ancientBrahmi – abugida (alphabetic with implicit vowel) – largely phonetic - but phonology changes with time – multiple forms (conjuncts)

– maybe only 50 languages across South Asia are written – around 50% literacy • new writing systems for unwritten languages – by linguists, by missionaries (SIL)

528th Sept 2009 LRC 2009 6 early computers indic “hack font” developers based on typewriters keyboards, codes, fonts dependent keyboard keyboard dot matrix conceptual char printer linkage 1. design electrical signals one-to-one font table font table your keyboard “codes” code -> dotmap correspondence code -> bitmap ASCII 128 Roman/control EBCDIC 60-70 Roman letters 128 Indic characters 3. use tool to 8 bits 5-10 controlASCII 8 bit BYTE internal codes create fonts internal codes more 7+1 bitsISO 646 characters 2. accept ISO646 whatever internal communications/applications communications/applications code arises disaster NO communication 7 8

encodings modern computers • ISCII separate issues – IIT Kanpur, then CDAC Pune keyboard – leers, not glyphs hardware GIST card laser or inkjet • like Arabic then software (DLL) printer • needs renderer cost prohibitive – based on Brahmi view font table – single code table with language switch to render “same” key mapping leer appropriately key -> code code -> bitmap – sequence of characters as spoken • Unicode the essential letters and – based on ISCII but separate tables for each script internal Unicodes characters • sll controversial writing system not graphics – new Tamil standard communications/applications 928th Sept 2009 LRC 2009 10

Nepali language into Computers two key issues in technology projects

• Desk top publishing with PCs – hack fonts to give some representaon of the wring Research versus applicaon • 128 places not enough for everything • doing ICT4D projects requires new understanding • implicitly defined an internal code • – fonts copied widely and then changed slightly what understanding? – result – inability to exchange data – ACM paper claims must deliver computer science research – 1997 aempts to standardise in context of emerging Unicode Technology transfer versus knowledge transfer • impressed by ISCII, but not internaonal – some fonts also developed for other languages • do we do it for our beneficiaries? • today, accept Unicode Devanagari for Nepali • or teach them how to do it for themselves? – 1997 IDRC and later UNDP UNESCO funded keyboard drivers and fonts Could academic objecves cloud development objecves? • can use Indian open type fonts, though not liked – sll no agreed keyboard layout for Nepal

28th Sept 2009 LRC 2009 11 28th Sept 2009 LRC 2009 12 what is involved in localisaon? Nepalinux at MPP funded by IDRC • translaon of all text in screens, menus, and help – available in separate ‘resource’ files • 1997, 2000 produced Nepali Unicode-driver add on to Windows – agree terminology in naonal commiees • 2004 joined PAN localisaon project – store past translaons in ‘translaon memory’ • Linux with GNOME desktop, KDE desktop, OpenOffice, ... • develop spell checker – release 1 December 2005 – need standard spellings – release 1.1 October 2006 – morphology • FOSS crically important here • other capabilies that might be needed need to research • then PDAs, mobile phones, OCR – code converters = hack fonts to unicode into languages – develop fonts but not – into technology • in parallel Microso produced LIP for Nepali text-to-speech – OCR – launched November 2005 – is a grammar checker needed?

28th Sept 2009 LRC 2009 13 28th Sept 2009 LRC 2009 √14

NeLRaLEC –funded by the EU Nepali Naonal Corpus – what people actually write OpenU, LancasterU, GoterborgU, ELRA, MPP, TribhuvanU advice from Lancaster University for Nepali, even though not (yet) endangered • wanted material from 1991/2 to match corpora in US and UK • Nepali Naonal Corpus – not much material then, just aer end of repressive Panchayat era – text (5 M words, 100K parallel) and speech (4hrs+130K words) • wanted in fixed genres with strong western bias • Nepali diconary – aiming at 100,000 words – very lile science and technology, 1 science ficon, no westerns (or Kung Fu) – word list and entries for most frequent • CORE – aimed a 1 million words • linguisc tools – collected 500 documents, had them typed • fonts for Nepali renamed Bhasha Sanchar – some got lost in civil disturbances and lack of records – and Maithili • General – opportunisc, at least 4 million words • speech generaon – already digised but in hack fonts, needed conversion ensures sustainability • trials in schools and universies – when to stop? – localised schools management soware • English diconaries used 200 million words or more • computaonal and corpus linguiscs course in University – eventually stopped at 13 million words, good enough will do 28th Sept 2009 LRC 2009 15 28th Sept 2009 LRC 2009 16

Nepali Corpus-based Diconary Nepali Font Development • hiring font developers first ever in South Asia – original person went to UK to do masters • needed soware to store entries as produced – recruited two graphic designers, arranged training – mulple lexicographers • short training from Reading in Kathmandu, open to public – could not find suitable exisng system – some hazards from civil disturbances • developed own system to meet needs of Nepali and linguists • font development process – 2 soware engineers in exploratory development – draw many examples of characters needed, including ligatures • developed entries using Oxford’s Xiara concordance system – scan into computer, add rules of combinaon – linguists sll learning, each did things differently – test and improve constantly – at around 20,000 entries realised quality problem • eventually agreed needed further training in UK • started again, reusing earlier entries or producing new ones – two months in Reading – only reached 8,000, okay for an on-line diconary, but not in print • outputs • expected linguists to connue aer project – font for Nepali in two versions – but didn’t because not paid – font for Maithili Mithilaaksha style of wring. 28th Sept 2009 LRC 2009 17 28th Sept 2009 LRC 2009 18 Nepali TTS speech generaon evaluaon of use of Nepali soware based on Fesval/Festvox concatenave synthesis • expected help from UK, Roger Tucker and Ksenia Shalapova – Ksenia refused to travel to Nepal to give training and guidance Nepalinux and MS LIP launched late 2005 • had to help our 2 soware engineers in other ways • 1000 CDs distributed, 500 aended demos – – spent 1 month in Hyderabad with Kishore and Rajeev Sangal heard not being used, interesng novelty • recorded voices • Conducted surveys to find out – – selected words containing all 1764 diphones, chose 1,200 sentences what was really going on, and why? – – had to research speech – eg “Schwa deleon” grounded theory, analysed with NVIVO – done by Ganesh Ghimire and Maria Newton • developed TTS through several versions, adding prosody – reasonable quality, judged reasonably “natural” what we found out ITID journal, Vol 5, Issue 1 - SPRING 2009 • wanted to make screen reader for visually disabled and illiterates – normal social processes were at work – latest Linux supported this, so in Nepalinux and • – Fesval not on Windows, could not find generaon engine should we move to other 100 languages of Nepal?

28th Sept 2009 LRC 2009 19 28th Sept 2009 LRC 2009 20

2. translaon quality 1. first impressions of computers Problem: technical terms not understood Problem: hardware not localised – Nepali terms agreed by commiee – cabinets labeled in English Example: – keyboards not marked in local script – “I have used all its funcon because I am a writer. Some of the words like Example: delivered computers to schools “radditokari” and “anuprayog” seem to be the unusual ones. They look like – could only give keyboard layout charts they have been directly borrowed from Sanskrit and that make Nepali even more difficult than English” – banking officer • not well produced – “It’s good for people who are trained in Nepali that don’t have exposure to – techies claimed phonec keyboards easy English. For people like us who have already started to use one system it • but nobody spoke or typed English! becomes difficult to switch over” – linguist from Kathmandu • ironical – claim systems for non-speakers of English. Soluon: Soluon: get keyboards marked for Nepali – don’t translate, English terminology oen bisarre, maybe transliterate – keyboards in Thailand are marked in Thai! – listen to users, standardise terminology – will they get used to it anyway?

28th Sept 2009 LRC 2009 21 28th Sept 2009 LRC 2009 22

4. crical mass of users 3. teacher/trainer inera Problem: want to get help from others • Problem: trainer not familiar with the localised soware Examples: • Example: at Sankhu telecentre – Nepal’s naonal library – language of instrucon - Nepali • one user, only used to catalogue Nepali books – started teaching Nepali interfaces – government officer taught himself – switched back to English interfaces Soluon: need to create crical mass, • Soluon: – train complete organisaon – train the trainers – restrict opportunity to use English • in soware as well as hardware • example – telephone, video tape formats – don’t give opon of English interface Sociology – “social interacon” Brock and Durlauf – “social embeddedness”

28th Sept 2009 LRC 2009 23 28th Sept 2009 LRC 2009 24 5. language shi and social mobility 6. soware eco-system

Problem: people change to dominant language – see economic advantage in English or Nepali or ... Problem: soware works with other soware Example: – cannot use Unicode for informaon exchange! – other soware is not Unicode compliant, use hack fonts – “Yeah I liked, but when we used Nepali windows that me I feel we are going to forget English language” – telecentre social mobilizer – but can do Desk Top Publishing – “I have daughter and I will not ask my daughter to use the Nepali Example: journalists interface because I want my daughter to be good in English” – computer – “But it has font problem. In publicaon houses mainly they use academic pree, kanpur so it’s not worthy for nepali compung” – journalist Soluon: accept that for some languages Soluon: raise an Open Source project to produce compliant – not worth localising the OS, but enable content publishing soware. – supporng the language may reduce the shi Sociology – “sanskrisaon” caste mobility - Srinivas

28th Sept 2009 LRC 2009 25 28th Sept 2009 LRC 2009 26

7. support all communies 8. cost of entry for new language

Problem: language communies very small Problem: Translaon cost can be significant – commercial development not viable Example: Gnome interface for Linux Example: Lohrung Rai in Nepal – 40,000 ‘strings’ = 500,000 words approx • subject of socio-linguist study by Jens Allwood, Yogendra Yadava, – grows by 5 to 10% a distribuon and Bhim Regmi – at 1,500 words per day, this takes 300 days – 1,207 ‘mother-tongue’ speakers • 30 days for a new release – language not yet wrien, want Roman system. Soluon: avoid manual translaon Soluon: – machine translaon? – localise for local lingua franca – be smart technically • create wring close to that of lingua franca – language generaon from model of soware – only enable content in local language

28th Sept 2009 LRC 2009 27 28th Sept 2009 LRC 2009 28

Conclusions

localizaon must be TOTAL – all hardware including keyboards – all soware in use by a community – translaons sensive to language polics • otherwise it will rejected entry cost must be minimal • cheap and easy to localize for a new language – base interacon by language generaon – part of s/w development process – somemes only enable content next step – harmonised wring and encoding • then machine (assisted) translaon 28th Sept 2009 LRC 2009 29