The Indigenous Languages Technology Project at NRC Canada

NRC Publications Archive Archives des publications du CNRC The Indigenous Languages Technology Project at NRC Canada: an empowerment-oriented approach to developing language software Kuhn, Roland; Davis, Fineen; Désilets, Alain; Joanis, Eric; Kazantseva, Anna; Knowles, Rebecca; Littell, Patrick; Lothian, Delaney; Pine, Aidan; Running Wolf, Caroline; Santos, Eddie; Stewart, Darlene; Boulianne, Gilles; Gupta, Vishwa; Maracle, Owennatekha Brian; Martin, Akwiratékha'; Cox, Christopher; Junker, Marie-Odile; Sammons, Olivia; Torkornoo, Delasie; Brinklow, Nathan Thanyehténhas; Child, Sara; Farley, Benoît; Huggins-Daines, David; Rosenblum, Daisy; Souter, Heather For the publisher’s version, please access the DOI link below./ Pour consulter la version de l’éditeur, utilisez le lien DOI ci-dessous. https://doi.org/10.4224/40001304 NRC Publications Archive Record / Notice des Archives des publications du CNRC : https://nrc-publications.canada.ca/eng/view/object/?id=d4f10144-c711-43c5-b80b-5ace7df5e68b https://publications-cnrc.canada.ca/fra/voir/objet/?id=d4f10144-c711-43c5-b80b-5ace7df5e68b Access and use of this website and the material on it are subject to the Terms and Conditions set forth at https://nrc-publications.canada.ca/eng/copyright READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE. L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site https://publications-cnrc.canada.ca/fra/droits LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB. Questions? Contact the NRC Publications Archive team at [email protected]. If you wish to email the authors directly, please see the first page of the publication for their contact information. Vous avez des questions? Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez pas à les repérer, communiquez avec nous à [email protected]. The Indigenous Languages Technology project at NRC Canada: An empowerment-oriented approach to developing language software Roland Kuhn, Fineen Davis, Alain Désilets, Eric Joanis, Anna Kazantseva, Rebecca Knowles, Patrick Littell, Delaney Lothian, Aidan Pine, Caroline Running Wolf, Eddie Santos, Darlene Stewart National Research Council Canada (NRC) [email protected] (e.g., [email protected]) Gilles Boulianne, Vishwa Gupta Centre de recherche informatique de Montréal (CRIM) [email protected], [email protected] Owennatekha Brian Maracle Onkwawenna Kentyohkwa (Our Language Society) [email protected] Akwiratékha’ Martin Kahnawà:ke Kanien’kehá:ka Territory [email protected] Christopher Cox, Marie-Odile Junker, Olivia Sammons, Delasie Torkornoo Carleton University [email protected] (e.g., [email protected]) Nathan Thanyehténhas Brinklow Queen’s University and Tsi Tyonnheht Onkwawen:na Language and Cultural Centre (TTO) [email protected] Sara Child Sanyakola Foundation [email protected]¯ ¯ Benoît Farley Pirurvik Centre [email protected] David Huggins-Daines Nuance Communications [email protected] Daisy Rosenblum University of British Columbia [email protected] Heather Souter Prairies to Woodlands Indigenous Language Revitalization Circle [email protected] Contents Abstract 3 1 Introduction 3 2 Sociolinguistic Background 4 3 Text-based subprojects 6 3.1 Polysynthetic Verb Conjugation .............................. 6 3.1.1 Background ..................................... 6 3.1.2 WordWeaver .................................... 7 3.1.3 WordWeaver Instances ............................... 7 3.1.4 WordWeaver UI .................................. 7 3.1.5 Work on Michif ................................... 9 3.1.6 Discussion ..................................... 10 3.2 Corpus and Tools for Inuktut (including machine translation) ............... 10 3.3 Predictive Text ....................................... 12 4 Speech-based subprojects 13 4.1 Work at CRIM on Audio Segmentation and Speech Recognition ............. 13 4.1.1 Data Collection and Transcription ......................... 13 4.1.2 Audio Segmentation ................................ 14 4.1.3 Automatic Speech Recognition (ASR) ....................... 15 4.2 Read-along audiobooks ................................... 16 4.3 Limited-domain Text-to-Speech (TTS) ........................... 17 5 Online courses and games 18 5.1 Learning Platforms at Carleton University ......................... 19 5.2 7000 Languages ....................................... 19 5.3 Computer-assisted Language Learning at the University of Alberta ............ 20 5.4 FirstVoices ......................................... 20 5.5 On the Path of the Elders .................................. 21 6 Capacity-building: Yukon and Indigitization Subprojects 21 7 Single-language subprojects focused on data collection 22 8 Discussion 25 9 Future Work 26 9.1 Visual Programming for Building Rule-based Grammars ................. 26 9.2 Other Priorities ....................................... 27 Acknowledgements 27 References 27 Abstract This paper describes a three-year project at the National Research Council of Canada aimed at developing software to assist Indigenous communities in their efforts to preserve their languages and extend their use. The project aimed to work within the empowerment paradigm, where the linguistic goals of communities have at least equal weight with those of the researchers, and where collaboration with communities is central. Because many of the technological directions we took were in response to community needs, the project ended up as a collection of diverse subprojects, including the creation of a sophisticated framework for building verb conjugators for highly in- flectional polysynthetic languages (a verb conjugator for Kanyen’kéha, in the Iroquoian language family, was built in the framework), release of what is probably the largest available corpus of sentences in a polysynthetic language (Inuktut) aligned with English sentences and experiments with machine translation (MT) systems trained on this corpus, free online services based on automatic speech recognition (ASR) for easing the transcription bottleneck for recordings of speech in Indigenous languages (and other languages), limited-domain text-to-speech synthesis for some Indigenous languages, and several other subprojects. 1 Introduction This paper describes the Indigenous Languages Technology (ILT) project at the National Research Coun- cil of Canada (henceforth NRC). This project came into existence thanks to funding of $6 million over three years, granted by the Canadian government’s Treasury Board Secretariat according to the provi- sions of the March 2017 budget. The project’s primary goal was to serve Indigenous communities by producing software that would enhance their efforts to preserve and revitalize their languages. The ILT project did not primarily aim at carrying out academic research, though interesting academic research resulted as a by-product. Some ILT activities were carried out by employees of NRC, others were finan- cially supported by NRC but carried out by other organizations. A second phase of the project began in April 2020. The ILT project should be viewed in the context of a much larger set of efforts being carried out by a variety of organizations to deploy technologies that promote the revitalization and documentation of Indigenous languages spoken in Canada. Littell et al. (2018a) surveys these efforts. The current paper depicts a tiny corner of a big canvas. The activities carried out within the ILT project were extremely heterogeneous. Different communities have very different linguistic needs, so, given our primary goal of serving communities, ILT was made up of a diverse set of subprojects. Another aspect of the project was that there is so little textual or speech data for Indigenous languages in Canada (with the partial exception of Inuktut) that they could be characterized in the fields of natural language processing and speech recognition as extremely low- resource languages. Googling the term “low-resource language” brings up Sindhi as an example. Sindhi is spoken by about 23 million people. Inuktut is spoken by about 40,000 people—and that is far more speakers than most Indigenous languages in Canada have. Thus, most of the technologies developed within the ILT project have been rule-based, rather than relying on data-driven machine learning. Drawing on Cameron et al. (1992) and Czaykowska-Higgins (2009), one can list three approaches to linguistic research involving Indigenous languages: 1. the “linguist-focused” or “ethical research” model, in which members of the speech community are the passive subjects of research. They are “informants” from whom the linguist in control collects data. 2. the “advocacy research” model, where the linguist is still in a position of power, but the research is carried out not only by and for the linguist, but also partly for the benefit of the speech community. 3. the “empowerment” model, in which research is carried out with equal emphasis on the agenda of the linguist and of the community. The focus is on collaboration and dialogue. 2 SOCIOLINGUISTIC BACKGROUND 4 Most of the work carried out within the ILT project fits into the empowerment model. The most ambitious subproject described

The Indigenous Languages Technology Project at NRC Canada

KIA Language Framework

New Bookmobile in Town This Summer

Sra. K. Strand-Carroll Profesor: Sra

Is It Read Or Read Receipts

Superintendent's Memo

THE ONTARIO CURRICULUM, GRADES 9 to 12 | First Nations, Métis, and Inuit Studies

Inuit & Cancer: Fact Sheets

Proceedings of the Arctic Indigenous Languages Symposium

Table of Contents

Chrome Spreadsheet to Anki

Inuit=People for Thousands of Years, Inuit Thought They Were Alone in The

Aboriginal Languages in Canada