Pat Barriers to Localisation V1

Pat Barriers to Localisation V1

brief background to Nepal Barriers to Localisa-on China Afghanistan a case study of Nepal 28.2 1,333.2 Iran 167.5 74.2 Pakistan 29.3 0.7 Nepal Bhutan Pat Hall India Bangladesh populaons – 1,169.5 162.2 Myanmar 2009 esmated millions 50.0 Wikipedia 2009 July 1 major migrant communities 0.3 20.2 worldwide Maldives Sri Lanka 228th Sept 2009 LRC 2009 language diversity in South Asia linguisc history and polics scholarship Pushtu Sanskrit Kashmiri • Gorkhali (later Nepali) official language since 1768 Farsi Indo-European Tibetan – mother tongue of ruling elite from Gorkha Urdu Tibeto-Burmese – 1951-1990 ‘one na-on, one culture, one language’ Punjabi Nepali Dzongka Hindi • around 100 other languages INDIA Sindhi Assamese – Over 500 languages Indo-European Bangla Nepali now almost universally spoken – 98% 22 mother-tongues Gujurati • and in neighbouring regions Oriya with > 1 million speakers Marathi – 5 other languages with mature wrien tradions 114 mother-tongues Telugu Austro-Asiatic – 60 completely unwrien with > 10,000 speakers Kannada (Munda) • 1990 enabled other mother-tongue educaon 22 official in India Dravidian Malayalam – the irony of English Tamil English pervasive • 2006 interim constuon enabled use of other languages large diaspora Sinhala - only 5% • 2008 mandang mother-tongue teaching 28th Sept 2009 LRC 2009 3 28th Sept 2009 LRC 2009 4 Wring through the computer an old mechanical typewriter direct connec-on between key and type Tamil Kannada Hindi (Devanagari) • all South Asian writing derived from ancientBrahmi – abugida (alphabetic with implicit vowel) – largely phonetic - but phonology changes with time – multiple forms (conjuncts) – maybe only 50 languages across South Asia are written – around 50% literacy • new writing systems for unwritten languages – by linguists, by missionaries (SIL) 528th Sept 2009 LRC 2009 6 early computers indic “hack font” developers based on typewriters keyboards, codes, fonts dependent keyboard keyboard dot matrix conceptual char printer linkage 1. design electrical signals one-to-one font table font table your keyboard “codes” code -> dotmap correspondence code -> bitmap ASCII 128 Roman/control EBCDIC 60-70 Roman letters 128 Indic characters 3. use tool to 8 bits 5-10 controlASCII 8 bit BYTE internal codes create fonts internal codes more 7+1 bitsISO 646 characters 2. accept ISO646 whatever internal communications/applications communications/applications code arises disaster NO communication 7 8 encodings modern computers • ISCII separate issues – IIT Kanpur, then CDAC Pune keyboard – leVers, not glyphs hardware GIST card laser or inkjet • like Arabic then software (DLL) printer • needs renderer cost prohibitive – based on Brahmi view font table – single code table with language switch to render “same” key mapping leVer appropriately key -> code code -> bitmap – sequence of characters as spoken • Unicode the essential letters and – based on ISCII but separate tables for each script internal Unicodes characters • sll controversial writing system not graphics – new Tamil standard communications/applications 928th Sept 2009 LRC 2009 10 Nepali language into Computers two key issues in technology projects • Desk top publishing with PCs – hack fonts to give some representa-on of the wri-ng Research versus applica-on • 128 places not enough for everything • doing ICT4D projects requires new understanding • implicitly defined an internal code • – fonts copied widely and then changed slightly what understanding? – result – inability to exchange data – ACM paper claims must deliver computer science research – 1997 aempts to standardise in context of emerging Unicode Technology transfer versus knowledge transfer • impressed by ISCII, but not interna-onal – some fonts also developed for other languages • do we do it for our beneficiaries? • today, accept Unicode Devanagari for Nepali • or teach them how to do it for themselves? – 1997 IDRC and later UNDP UNESCO funded keyboard drivers and fonts Could academic objecves cloud development objecves? • can use Indian open type fonts, though not liked – s-ll no agreed keyboard layout for Nepal 28th Sept 2009 LRC 2009 11 28th Sept 2009 LRC 2009 12 what is involved in localisaon? Nepalinux at MPP funded by IDRC • translaon of all text in screens, menus, and help – available in separate ‘resource’ files • 1997, 2000 produced Nepali Unicode-driver add on to Windows – agree terminology in naonal commiees • 2004 joined PAN localisa-on project – store past translaons in ‘transla-on memory’ • Debian Linux with GNOME desktop, KDE desktop, OpenOffice, ... • develop spell checker – release 1 December 2005 – need standard spellings – release 1.1 October 2006 – morphology • FOSS cri-cally important here • other capabilies that might be needed need to research • then PDAs, mobile phones, OCR – code converters = hack fonts to unicode into languages – develop fonts but not – into technology • in parallel Microsoi produced LIP for Nepali text-to-speech – OCR – launched November 2005 – is a grammar checker needed? 28th Sept 2009 LRC 2009 13 28th Sept 2009 LRC 2009 √14 NeLRaLEC –funded by the EU Nepali Naonal Corpus – what people actually write OpenU, LancasterU, GoterborgU, ELRA, MPP, TribhuvanU advice from Lancaster University for Nepali, even though not (yet) endangered • wanted material from 1991/2 to match corpora in US and UK • Nepali Naonal Corpus – not much material then, just aer end of repressive Panchayat era – text (5 M words, 100K parallel) and speech (4hrs+130K words) • wanted in fixed genres with strong western bias • Nepali dic-onary – aiming at 100,000 words – very liVle science and technology, 1 science fic-on, no westerns (or Kung Fu) – word list and entries for most frequent • CORE – aimed a 1 million words • linguisc tools – collected 500 documents, had them typed • fonts for Nepali renamed Bhasha Sanchar – some got lost in civil disturbances and lack of records – and Maithili • General – opportunis-c, at least 4 million words • speech generaon – already digi-sed but in hack fonts, needed conversion ensures sustainability • trials in schools and universi-es – when to stop? – localised schools management soware • English dic-onaries used 200 million words or more • computaonal and corpus linguis-cs course in University – eventually stopped at 13 million words, good enough will do 28th Sept 2009 LRC 2009 15 28th Sept 2009 LRC 2009 16 Nepali Corpus-based Dic-onary Nepali Font Development • hiring font developers first ever in South Asia – original person went to UK to do masters • needed soware to store entries as produced – recruited two graphic designers, arranged training – mulple lexicographers • short training from Reading in Kathmandu, open to public – could not find suitable exisng system – some hazards from civil disturbances • developed own system to meet needs of Nepali and linguists • font development process – 2 soware engineers in exploratory development – draw many examples of characters needed, including ligatures • developed entries using Oxford’s Xiara concordance system – scan into computer, add rules of combinaon – linguists sll learning, each did things differently – test and improve constantly – at around 20,000 entries realised quality problem • eventually agreed needed further training in UK • started again, reusing earlier entries or producing new ones – two months in Reading – only reached 8,000, okay for an on-line dic-onary, but not in print • outputs • expected linguists to connue aier project – font for Nepali in two versions – but didn’t because not paid – font for Maithili Mithilaaksha style of wri-ng. 28th Sept 2009 LRC 2009 17 28th Sept 2009 LRC 2009 18 Nepali TTS speech generaon evaluaon of use of Nepali soiware based on Fes-val/Festvox concatena-ve synthesis • expected help from UK, Roger Tucker and Ksenia Shalapova – Ksenia refused to travel to Nepal to give training and guidance Nepalinux and MS LIP launched late 2005 • had to help our 2 soware engineers in other ways • 1000 CDs distributed, 500 aended demos – – spent 1 month in Hyderabad with Kishore and Rajeev Sangal heard not being used, interesng novelty • recorded voices • Conducted surveys to find out – – selected words containing all 1764 diphones, chose 1,200 sentences what was really going on, and why? – – had to research speech – eg “Schwa dele-on” grounded theory, analysed with NVIVO – done by Ganesh Ghimire and Maria Newton • developed TTS through several versions, adding prosody – reasonable quality, judged reasonably “natural” what we found out ITID journal, Vol 5, Issue 1 - SPRING 2009 • wanted to make screen reader for visually disabled and illiterates – normal social processes were at work – latest Linux supported this, so in Nepalinux and Ubuntu • – Fes-val not on Windows, could not find genera-on engine should we move to other 100 languages of Nepal? 28th Sept 2009 LRC 2009 19 28th Sept 2009 LRC 2009 20 2. translaon quality 1. first impressions of computers Problem: technical terms not understood Problem: hardware not localised – Nepali terms agreed by commiee – cabinets labeled in English Example: – keyboards not marked in local script – “I have used all its func-on because I am a writer. Some of the words like Example: delivered computers to schools “radditokari” and “anuprayog” seem to be the unusual ones. They look like – could only give keyboard layout charts they have been directly borrowed from Sanskrit and that make Nepali even more difficult than English” – banking officer • not well produced – “It’s good for people who are trained in Nepali that don’t

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    5 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us