The Indigenous languages technology project at NRC: an empowerment approach

Roland Kuhn COLING - December 2020 NRC’S INDIGENOUS LANGUAGES TECHNOLOGY PROJECT Federal budget March 2017: $6 million for the National Research Council to assist language communities in stabilizing, revitalizing and reclaiming their languages, by providing expertise on digital language technologies. (funding over 3 years ending March 31, 2020) Social challenges • Academic researchers have a record of exploitation (e.g., refusing access to data by communities that volunteered to provide it!) • History of attempted cultural genocide by federal government • Outsiders who “want to help” often have a white saviour fantasy • So, how build trust with communities? Empowerment Approach • We recruited an all-Indigenous advisory committee • We asked community language activists what software tools would help their languages. • Goal is to make them autonomous – software from project is open-source. • Core policy: never claim ownership of Indigenous language data collected with project funding! • Downside: different languages have different needs. E.g., (36,000+ speakers & government with many employees working in the language) vs. Senchothen (5 elderly speakers). So we ended up working on a bunch of unrelated technologies. • Upside: we’ve been working with AMAZING people inside communities! NRC’S INDIGENOUS LANGUAGES TECHNOLOGY PROJECT

Linguistic challenges • Polysynthesis – most of these languages are polysynthetic, with complex morphology. • Dialect: most of these languages have a dialect continuum. Risk of NRC being seen to favour one dialect. • Demography: in some communities, the only fluent speakers are elderly people who may be uncomfortable with technology.

Software challenges • Develop scalable technology that can be deployed across several communities/languages. • Develop software that amplifies community efforts instead of becoming an end in itself. • Develop open-source software under community guidance – avoid proprietary software & proprietary data formats. INDIGENOUS LANGUAGES IN DURING FIRST CONTACT WITH EUROPEANS Number of speakers by language (2016 census) RECENT HISTORY OF INDIGENOUS LANGUAGES IN CANADA • From 1883-1996, Christian churches & Canadian governments made strong attempt to assimilate Indigenous people into Euro-Canadian culture. Their main tool: forcibly removing children from their families and placing them in residential schools. • Minister of Public Works Hector Langevin (House of Commons, 1883): “In order to educate the children properly we must separate them from their families. Some may say that this is hard but if we want to civilize them we must do that.” • Other Indigenous children were forcibly adopted by non-Indigenous families: the “Sixties Scoop”. • Yet others were sent to day schools with the same assimilationist philosophy. • The recent Truth and Reconciliation Inquiry found that: - the residential school era was marked by physical & sexual abuse - the system was “created for the purpose of separating Aboriginal children from their families, in order to minimize and weaken family ties and cultural linkages”. But nevertheless, Indigenous languages survived! We are seeing a new era in which savvy language activists from within communities are revitalizing these languages. There is evidence that language revitalization is associated with social gains (e.g., reduced rates of teen suicide). STRUCTURE OF INDIGENOUS LANGUAGES TECHNOLOGY (ILT) PROJECT

• ILT had an “inner ring” of subprojects at NRC & an “outer ring” of subprojects run by external – mostly Indigenous – organizations. I don’t have time to talk about “outer ring” subprojects. • “Inner ring” subprojects by level of effort: - Readalong Studio for audio books (“medium” level of effort) – a surprise hit! - Predictive text for mobile devices (“light” effort) - Word Weaver tool for verb conjugation for polysynthetic languages (“heavy”) - Machine translation & office tools for Inuktut language (“heavy”).

• Speech technology for Indigenous languages developed by CRIM (Centre de recherche informatique de Montréal) (“very heavy”). READALONG STUDIO – A SURPRISE HIT! • Many Indigenous communities have transcribed recordings in their languages (e.g., traditional stories). • Carleton University team & their Indigenous collaborators had shown educational potential of manually aligning spoken & written words in audio books. As a word is spoken, it is highlighted in text. Learners can pause playback to click on a word & slow down audio so they can master its pronunciation. • Working with David Daines of Nuance, we automated alignment, making production of “Readalong” books much faster. • Teachers & students of Indigenous languages love Readalong books. So far, we’ve produced them in the Algonquin, Atikamekw, Southern East Cree, Northern East Cree, Gitksan, Inuktitut, Kwak’wala, Kanyen’kéha (Mohawk), Seneca, & SENĆOŦEN languages. There’s a queue for more! PREDICTIVE TEXT • We heard from several communities that their young people are frustrated that mobile devices don’t support text completion in their Indigenous languages. – Capability we aim to support: Make a spreadsheet of the words of your language Add counts to make a smarter dictionary Community chooses with whom to share dictionary to support text completion • Need to coordinate with other software providers has slowed this effort down. • However, we’ve rolled it out for SENĆOŦEN – educators in that community are happy with it - & we’re on verge of tackling other languages. • Advantage: in some languages, makes spelling dramatically easier. • Disadvantage: may end up favouring one dialect over another. WORDWEAVER: VERB CONJUGATORS FOR POLYSYNTHETIC LANGUAGES • Owennatékha (Brian Maracle) runs a famous adult immersion school on the Six Nations Reserve that teaches Kanyen’kéha (the Mohawk language). He told us that the school’s toughest challenge was teaching verb conjugations. • A textbook covering the conjugations of the 600 most common Kanyen’kéha verb stems would have about 200 million entries. Impossible to print! • Owennatékha’s request: could the NRC team implement a verb conjugator – for teaching purposes – in software? WORDWEAVER: VERB CONJUGATORS FOR POLYSYNTHETIC LANGUAGES • We created a framework, WordWeaver, for building verb conjugators for any language. Then, in close collaboration with instructors at Onkwawenna Kentyohkwa (the Six Nations immersion school) we built: Kawennón:nis, conjugator for the Western dialect of Kanyen’kéha. • Kawennón:nis is now in constant use at Onkwawenna Kentyohkwa – for teaching inside the classroom, for home study, & for making quizzes. • Available entirely offline • ‘Serverless’ implementation means communities can create zero-cost deployments • We have transferred ownership of the Kawennón:nis code to Onkwawenna Kentyohkwa. USING WORDWEAVER TO BUILD VERB CONJUGATORS FOR OTHER POLYSYNTHETIC DIALECTS/LANGUAGES

• With the help of Akwiratékha’ Martin, we have built a version of Kawennón:nis for the Eastern dialect of Kanyen’kéha – spoken in the large Kahnawà:ke community in . • Akwiratékha’ has been promoting this tool on his popular podcast for learners of Kanyen’kéha. After his first podcast on the topic, the software was downloaded hundreds of times. It is being integrated into the curriculum at the Kanien'keháka Onkwawén:na Raotitióhkwa school in Kahnawà:ke. • We are currently working with Heather Souter – a language activist for the Métis people – to build a verb conjugator for , their traditional language, in the WordWeaver framework. • Michif is an Algonquian language (with strong French influence) entirely unrelated to Kanyen’kéha. But it’s also polysynthetic & thus faces same challenge of teaching verbs. TOOLS FOR INUKTUT

• Inuktut is in different position from other Indigenous languages. It is an official language of a vast territory, Nunavut, inside Canada. • The Government of Nunavut & the Nunavut Assembly generate & require production of large amount of data – much of it bilingual English-Inuktut. We’ve been building tools to meet these bureaucratic needs. • Early in 2020, we released English-Inuktut corpus based on proceedings of the Nunavut Assembly: Nunavut Hansard 3.0, 1.3 million aligned sentence pairs. As far as we know, this is largest parallel corpus for Indigenous language of the Americas (or ) ever released. • Training data for the WMT 2020 Inuktut-English task was supplied by us (we also funded human evaluations for English→Inuktut). • We’re not sure yet whether MT will be useful tool for Nunavut translators. But we are also working on tools we know will be useful for Nunavut … (user testing about to start) SEARCH ENGINE, WEBINUK (CONCORDANCER), AND MORPHEME EXAMPLE SEARCH

NATIONAL RESEARCH COUNCIL CANADA 14 GISTER

Clicking on word displays gist

Morphemes with English

NATIONAL RESEARCH COUNCIL CANADA 15 SPELL CHECK

Clicking an underlined word displays a list of suggestions Words that may be misspelled are underlined

NATIONAL RESEARCH COUNCIL CANADA 16 CRIM SPEECH TECHNOLOGY TOOLS

– At start of ILT project, we gave $1 million contract to CRIM (Centre de recherche informatique de Montréal) to alleviate “transcription” & “indexation” bottlenecks for Indigenous speech data. What was achieved? • CRIM created free services that pre-segment speech data to separate speech from non-speech, speech in English or French from speech in Indigenous languages, & speech by a particular speaker from other speakers. Chris Cox & Olivia Sammons of Carleton U. report major productivity gains for processing multi-speaker data (for Tsuut’ina & Michif). • Speech recognition experiments on Inuktitut & East Cree (preliminary ones on Innu & Tsuut’ina). Speaker-independent recognition doesn’t yield good enough results to speed up transcription. But speaker-dependent results good enough for this: transcribe 1 hour of speech from an Elder, then run recognizer on speech from same Elder.

CRIM will continue this work with its own funding. SUMMARY & FUTURE WORK We’ve looked at these ILT subprojects: - Readalong Studio for audio books - Predictive text for mobile devices - WordWeaver/Kawennonis: verb conjugation for polysynthetic languages - Office tools for Inuktut↔ English - Speech technology for Indigenous languages at CRIM

Priorities for the Future - making our software more user-friendly, so communities don’t depend on our expertise. - several communities have asked for speech synthesis (as teaching tool).

Conclusion Reviewer #2: “This paper shows remarkable achievement for minority languages as a result of a $6 million grant. This is a crucial scientific finding: money works!”