brief background to Nepal
Barriers to Localisa on China Afghanistan a case study of Nepal 28.2 1,333.2 Iran 167.5 74.2 Pakistan 29.3 0.7 Nepal Bhutan
Pat Hall India Bangladesh popula ons – 1,169.5 162.2 Myanmar 2009 es mated millions 50.0 Wikipedia 2009 July 1 major migrant communities 0.3 20.2 worldwide Maldives Sri Lanka
228th Sept 2009 LRC 2009
language diversity in South Asia linguis c history and poli cs
scholarship Pushtu Sanskrit Kashmiri • Gorkhali (later Nepali) official language since 1768 Farsi Indo-European Tibetan – mother tongue of ruling elite from Gorkha Urdu Tibeto-Burmese – 1951-1990 ‘one na on, one culture, one language’ Punjabi Nepali Dzongka Hindi • around 100 other languages INDIA Sindhi Assamese – Over 500 languages Indo-European Bangla Nepali now almost universally spoken – 98% 22 mother-tongues Gujurati • and in neighbouring regions Oriya with > 1 million speakers Marathi – 5 other languages with mature wri en tradi ons 114 mother-tongues Telugu Austro-Asiatic – 60 completely unwri en with > 10,000 speakers Kannada (Munda) • 1990 enabled other mother-tongue educa on 22 official in India Dravidian Malayalam – the irony of English Tamil English pervasive • 2006 interim cons tu on enabled use of other languages large diaspora Sinhala - only 5% • 2008 manda ng mother-tongue teaching
28th Sept 2009 LRC 2009 3 28th Sept 2009 LRC 2009 4
Wri ng through the computer an old mechanical typewriter direct connec on between key and type Tamil Kannada Hindi (Devanagari)
• all South Asian writing derived from ancientBrahmi – abugida (alphabetic with implicit vowel) – largely phonetic - but phonology changes with time – multiple forms (conjuncts)
– maybe only 50 languages across South Asia are written – around 50% literacy • new writing systems for unwritten languages – by linguists, by missionaries (SIL)
528th Sept 2009 LRC 2009 6 early computers indic “hack font” developers based on typewriters keyboards, codes, fonts dependent keyboard keyboard dot matrix conceptual char printer linkage 1. design electrical signals one-to-one font table font table your keyboard “codes” code -> dotmap correspondence code -> bitmap ASCII 128 Roman/control EBCDIC 60-70 Roman letters 128 Indic characters 3. use tool to 8 bits 5-10 controlASCII 8 bit BYTE internal codes create fonts internal codes more 7+1 bitsISO 646 characters 2. accept ISO646 whatever internal communications/applications communications/applications code arises disaster NO communication 7 8
encodings modern computers • ISCII separate issues – IIT Kanpur, then CDAC Pune keyboard – le ers, not glyphs hardware GIST card laser or inkjet • like Arabic then software (DLL) printer • needs renderer cost prohibitive – based on Brahmi view font table – single code table with language switch to render “same” key mapping le er appropriately key -> code code -> bitmap – sequence of characters as spoken • Unicode the essential letters and – based on ISCII but separate tables for each script internal Unicodes characters • s ll controversial writing system not graphics – new Tamil standard communications/applications 928th Sept 2009 LRC 2009 10
Nepali language into Computers two key issues in technology projects
• Desk top publishing with PCs – hack fonts to give some representa on of the wri ng Research versus applica on • 128 places not enough for everything • doing ICT4D projects requires new understanding • implicitly defined an internal code • – fonts copied widely and then changed slightly what understanding? – result – inability to exchange data – ACM paper claims must deliver computer science research – 1997 a empts to standardise in context of emerging Unicode Technology transfer versus knowledge transfer • impressed by ISCII, but not interna onal – some fonts also developed for other languages • do we do it for our beneficiaries? • today, accept Unicode Devanagari for Nepali • or teach them how to do it for themselves? – 1997 IDRC and later UNDP UNESCO funded keyboard drivers and fonts Could academic objec ves cloud development objec ves? • can use Indian open type fonts, though not liked – s ll no agreed keyboard layout for Nepal
28th Sept 2009 LRC 2009 11 28th Sept 2009 LRC 2009 12 what is involved in localisa on? Nepalinux at MPP funded by IDRC • transla on of all text in screens, menus, and help – available in separate ‘resource’ files • 1997, 2000 produced Nepali Unicode-driver add on to Windows – agree terminology in na onal commi ees • 2004 joined PAN localisa on project – store past transla ons in ‘transla on memory’ • Debian Linux with GNOME desktop, KDE desktop, OpenOffice, ... • develop spell checker – release 1 December 2005 – need standard spellings – release 1.1 October 2006 – morphology • FOSS cri cally important here • other capabili es that might be needed need to research • then PDAs, mobile phones, OCR – code converters = hack fonts to unicode into languages – develop fonts but not – into technology • in parallel Microso produced LIP for Nepali text-to-speech – OCR – launched November 2005 – is a grammar checker needed?
28th Sept 2009 LRC 2009 13 28th Sept 2009 LRC 2009 √14
NeLRaLEC –funded by the EU Nepali Na onal Corpus – what people actually write OpenU, LancasterU, GoterborgU, ELRA, MPP, TribhuvanU advice from Lancaster University for Nepali, even though not (yet) endangered • wanted material from 1991/2 to match corpora in US and UK • Nepali Na onal Corpus – not much material then, just a er end of repressive Panchayat era – text (5 M words, 100K parallel) and speech (4hrs+130K words) • wanted in fixed genres with strong western bias • Nepali dic onary – aiming at 100,000 words – very li le science and technology, 1 science fic on, no westerns (or Kung Fu) – word list and entries for most frequent • CORE – aimed a 1 million words • linguis c tools – collected 500 documents, had them typed • fonts for Nepali renamed Bhasha Sanchar – some got lost in civil disturbances and lack of records – and Maithili • General – opportunis c, at least 4 million words • speech genera on – already digi sed but in hack fonts, needed conversion ensures sustainability • trials in schools and universi es – when to stop? – localised schools management so ware • English dic onaries used 200 million words or more • computa onal and corpus linguis cs course in University – eventually stopped at 13 million words, good enough will do 28th Sept 2009 LRC 2009 15 28th Sept 2009 LRC 2009 16
Nepali Corpus-based Dic onary Nepali Font Development • hiring font developers first ever in South Asia – original person went to UK to do masters • needed so ware to store entries as produced – recruited two graphic designers, arranged training – mul ple lexicographers • short training from Reading in Kathmandu, open to public – could not find suitable exis ng system – some hazards from civil disturbances • developed own system to meet needs of Nepali and linguists • font development process – 2 so ware engineers in exploratory development – draw many examples of characters needed, including ligatures • developed entries using Oxford’s Xiara concordance system – scan into computer, add rules of combina on – linguists s ll learning, each did things differently – test and improve constantly – at around 20,000 entries realised quality problem • eventually agreed needed further training in UK • started again, reusing earlier entries or producing new ones – two months in Reading – only reached 8,000, okay for an on-line dic onary, but not in print • outputs • expected linguists to con nue a er project – font for Nepali in two versions – but didn’t because not paid – font for Maithili Mithilaaksha style of wri ng. 28th Sept 2009 LRC 2009 17 28th Sept 2009 LRC 2009 18 Nepali TTS speech genera on evalua on of use of Nepali so ware based on Fes val/Festvox concatena ve synthesis • expected help from UK, Roger Tucker and Ksenia Shalapova – Ksenia refused to travel to Nepal to give training and guidance Nepalinux and MS LIP launched late 2005 • had to help our 2 so ware engineers in other ways • 1000 CDs distributed, 500 a ended demos – – spent 1 month in Hyderabad with Kishore and Rajeev Sangal heard not being used, interes ng novelty • recorded voices • Conducted surveys to find out – – selected words containing all 1764 diphones, chose 1,200 sentences what was really going on, and why? – – had to research speech – eg “Schwa dele on” grounded theory, analysed with NVIVO – done by Ganesh Ghimire and Maria Newton • developed TTS through several versions, adding prosody – reasonable quality, judged reasonably “natural” what we found out ITID journal, Vol 5, Issue 1 - SPRING 2009 • wanted to make screen reader for visually disabled and illiterates – normal social processes were at work – latest Linux supported this, so in Nepalinux and Ubuntu • – Fes val not on Windows, could not find genera on engine should we move to other 100 languages of Nepal?
28th Sept 2009 LRC 2009 19 28th Sept 2009 LRC 2009 20
2. transla on quality 1. first impressions of computers Problem: technical terms not understood Problem: hardware not localised – Nepali terms agreed by commi ee – cabinets labeled in English Example: – keyboards not marked in local script – “I have used all its func on because I am a writer. Some of the words like Example: delivered computers to schools “radditokari” and “anuprayog” seem to be the unusual ones. They look like – could only give keyboard layout charts they have been directly borrowed from Sanskrit and that make Nepali even more difficult than English” – banking officer • not well produced – “It’s good for people who are trained in Nepali that don’t have exposure to – techies claimed phone c keyboards easy English. For people like us who have already started to use one system it • but nobody spoke or typed English! becomes difficult to switch over” – linguist from Kathmandu • ironical – claim systems for non-speakers of English. Solu on: Solu on: get keyboards marked for Nepali – don’t translate, English terminology o en bisarre, maybe transliterate – keyboards in Thailand are marked in Thai! – listen to users, standardise terminology – will they get used to it anyway?
28th Sept 2009 LRC 2009 21 28th Sept 2009 LRC 2009 22
4. cri cal mass of users 3. teacher/trainer iner a Problem: want to get help from others • Problem: trainer not familiar with the localised so ware Examples: • Example: at Sankhu telecentre – Nepal’s na onal library – language of instruc on - Nepali • one user, only used to catalogue Nepali books – started teaching Nepali interfaces – government officer taught himself – switched back to English interfaces Solu on: need to create cri cal mass, • Solu on: – train complete organisa on – train the trainers – restrict opportunity to use English • in so ware as well as hardware • example – telephone, video tape formats – don’t give op on of English interface Sociology – “social interac on” Brock and Durlauf – “social embeddedness”
28th Sept 2009 LRC 2009 23 28th Sept 2009 LRC 2009 24 5. language shi and social mobility 6. so ware eco-system
Problem: people change to dominant language – see economic advantage in English or Nepali or ... Problem: so ware works with other so ware Example: – cannot use Unicode for informa on exchange! – other so ware is not Unicode compliant, use hack fonts – “Yeah I liked, but when we used Nepali windows that me I feel we are going to forget English language” – telecentre social mobilizer – but can do Desk Top Publishing – “I have daughter and I will not ask my daughter to use the Nepali Example: journalists interface because I want my daughter to be good in English” – computer – “But it has font problem. In publica on houses mainly they use academic pree , kan pur so it’s not worthy for nepali compu ng” – journalist Solu on: accept that for some languages Solu on: raise an Open Source project to produce compliant – not worth localising the OS, but enable content publishing so ware. – suppor ng the language may reduce the shi Sociology – “sanskri sa on” caste mobility - Srinivas
28th Sept 2009 LRC 2009 25 28th Sept 2009 LRC 2009 26
7. support all communi es 8. cost of entry for new language
Problem: language communi es very small Problem: Transla on cost can be significant – commercial development not viable Example: Gnome interface for Linux Example: Lohrung Rai in Nepal – 40,000 ‘strings’ = 500,000 words approx • subject of socio-linguist study by Jens Allwood, Yogendra Yadava, – grows by 5 to 10% a distribu on and Bhim Regmi – at 1,500 words per day, this takes 300 days – 1,207 ‘mother-tongue’ speakers • 30 days for a new release – language not yet wri en, want Roman system. Solu on: avoid manual transla on Solu on: – machine transla on? – localise for local lingua franca – be smart technically • create wri ng close to that of lingua franca – language genera on from model of so ware – only enable content in local language
28th Sept 2009 LRC 2009 27 28th Sept 2009 LRC 2009 28
Conclusions
localiza on must be TOTAL – all hardware including keyboards – all so ware in use by a community – transla ons sensi ve to language poli cs • otherwise it will rejected entry cost must be minimal • cheap and easy to localize for a new language – base interac on by language genera on – part of s/w development process – some mes only enable content next step – harmonised wri ng and encoding • then machine (assisted) transla on 28th Sept 2009 LRC 2009 29