Chren: Cherokee-English Machine Translation for Endangered Language Revitalization
Total Page:16
File Type:pdf, Size:1020Kb
ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization Shiyue Zhang Benjamin Frey Mohit Bansal UNC Chapel Hill {shiyue, mbansal}@cs.unc.edu; [email protected] Abstract Src. ᎥᏝ ᎡᎶᎯ ᎠᏁᎯ ᏱᎩ, ᎾᏍᎩᏯ ᎠᏴ ᎡᎶᎯ ᎨᎢ ᏂᎨᏒᎾ ᏥᎩ. Ref. They are not of the world, even as I am not of the Cherokee is a highly endangered Native Amer- world. ican language spoken by the Cherokee peo- SMT It was not the things upon the earth, even as I am ple. The Cherokee culture is deeply embed- not of the world. ded in its language. However, there are ap- NMT I am not the world, even as I am not of the world. proximately only 2,000 fluent first language Table 1: An example from the development set of Cherokee speakers remaining in the world, and ChrEn. NMT denotes our RNN-NMT model. the number is declining every year. To help save this endangered language, we introduce people: the Eastern Band of Cherokee Indians ChrEn, a Cherokee-English parallel dataset, to (EBCI), the United Keetoowah Band of Cherokee facilitate machine translation research between Indians (UKB), and the Cherokee Nation (CN). Cherokee and English. Compared to some pop- ular machine translation language pairs, ChrEn The Cherokee language, the language spoken by is extremely low-resource, only containing 14k the Cherokee people, contributed to the survival sentence pairs in total. We split our paral- of the Cherokee people and was historically the ba- lel data in ways that facilitate both in-domain sic medium of transmission of arts, literature, tradi- and out-of-domain evaluation. We also col- tions, and values (Nation, 2001; Peake Raymond, lect 5k Cherokee monolingual data to en- 2008). However, according to the Tri-Council Res. able semi-supervised learning. Besides these No. 02-2019, there are only 2,000 fluent first lan- datasets, we propose several Cherokee-English and English-Cherokee machine translation sys- guage Cherokee speakers left, and each Cherokee tems. We compare SMT (phrase-based) ver- tribe is losing fluent speakers at faster rates than sus NMT (RNN-based and Transformer-based) new speakers are developed. UNESCO has identi- systems; supervised versus semi-supervised fied the dialect of Cherokee in Oklahoma is “defi- (via language model, back-translation, and nitely endangered”, and the one in North Carolina BERT/Multilingual-BERT) methods; as well is “severely endangered”. Language loss is the as transfer learning versus multilingual joint loss of culture. CN started a 10-year language re- training with 4 other languages. Our best re- vitalization plan (Nation, 2001) in 2008, and the sults are 15.8/12.7 BLEU for in-domain and 6.5/5.0 BLEU for out-of-domain Chr-En/En- Tri-Council of Cherokee tribes declared a state of Chr translations, respectively, and we hope emergency in 2019 to save this dying language. that our dataset and systems will encourage fu- To revitalize Cherokee, language immersion ture work by the community for Cherokee lan- programs are provided in elementary schools, and 1 guage revitalization. second language programs are offered in universi- 1 Introduction ties. However, students have difficulty finding ex- posure to this language beyond school hours (Al- The Cherokee people are one of the indigenous bee, 2017). This motivates us to build up English peoples of the United States. Before the 1600s, (En) to Cherokee (Chr) machine translation sys- they lived in what is now the southeastern United tems so that we could automatically translate or States (Peake Raymond, 2008). Today, there are aid human translators to translate English materi- three federally recognized nations of Cherokee als to Cherokee. Chr-to-En is also highly meaning- 1Our data, code, and demo will be publicly available at ful in helping spread Cherokee history and culture. https://github.com/ZhangShiyue/ChrEn. Therefore, in this paper, we contribute our effort Germanic Ingvaeonic Anderson, 2008), while English generally follows Indo- West European Germanic English the Subject-Verb-Object (SVO) word order. Plus, verbs comprise 75% of Cherokee, which is only Southern Cherokee Iroquoian Wendat 25% for English (Feeling, 1975, 1994). Iroquoian Lake Iroquoian Wyandot Hence, to develop translation systems for this Northern Iroquoian Five Nations low-resource and distant language pair, we in- Tuscarora Iroquois vestigate various machine translation paradigms Figure 1: Language family trees. and propose phrase-based (Koehn et al., 2003) Statistical Machine Translation (SMT) and RNN- to Cherokee revitalization by constructing a clean based (Luong et al., 2015) or Transformer-based Cherokee-English parallel dataset, ChrEn, which (Vaswani et al., 2017) Neural Machine Transla- results in 14,151 pairs of sentences with around tion (NMT) systems for both Chr-En and En- 313K English tokens and 206K Cherokee tokens. Chr translations, as important starting points for We also collect 5,210 Cherokee monolingual sen- future works. We apply three semi-supervised tences with 93K Cherokee tokens. Both datasets methods: using additional monolingual data are derived from bilingual or monolingual materi- to train the language model for SMT (Koehn als that are translated or written by first-language and Knowles, 2017); incorporating BERT (or Cherokee speakers, then we manually aligned and Multilingual-BERT) (Devlin et al., 2019) represen- cleaned the raw data.2 Our datasets contain texts tations for NMT (Zhu et al., 2020), where we in- of two Cherokee dialects (Oklahoma and North troduce four different ways to use BERT; and the Carolina), and diverse text types (e.g., sacred text, back-translation method for both SMT and NMT news). To facilitate the development of machine (Bertoldi and Federico, 2009; Lambert et al., 2011; translation systems, we split our parallel data into Sennrich et al., 2016b). Moreover, we explore the five subsets: Train/Dev/Test/Out-dev/Out-test, in use of existing X-En parallel datasets of 4 other lan- which Dev/Test and Out-dev/Out-test are for in- guages (X = Czech/German/Russian/Chinese) to domain and out-of-domain evaluation respectively. improve Chr-En/En-Chr performance via transfer See an example from ChrEn in Table 1 and the de- learning (Kocmi and Bojar, 2018) or multilingual tailed dataset description in Section 3. joint training (Johnson et al., 2017). The translation between Cherokee and English is not easy because the two languages are genealog- Empirically, NMT is better than SMT for ically disparate. As shown in Figure 1, Cherokee in-domain evaluation, while SMT is signifi- is the sole member of the southern branch of the cantly better under the out-of-domain condi- Iroquoian language family and is unintelligible to tion. RNN-NMT consistently performs better other Iroquoian languages, while English is from than Transformer-NMT. Semi-supervised learn- the West Germanic branch of the Indo-European ing improves supervised baselines in some cases language family. Cherokee uses a unique 85- (e.g., back-translation improves out-of-domain character syllabary invented by Sequoyah in the Chr-En NMT by 0.9 BLEU). Even though Chero- early 1820s, which is highly different from En- kee is not related to any of the 4 languages glish’s alphabetic writing system. Cherokee is a (Czech/German/Russian/Chinese) in terms of their polysynthetic language, meaning that words are language family trees, surprisingly, we find that composed of many morphemes that each have in- both transfer learning and multilingual joint train- dependent meanings. A single Cherokee word can ing can improve Chr-En/En-Chr performance in express the meaning of several English words, e.g., most cases. Especially, transferring from Chinese- ᏫᏓᏥᏁᎩᏏ (widatsinegisi), or I am going off at a English achieves the best in-domain Chr-En perfor- distance to get a liquid object. Since the seman- mance, and joint learning with English-German ob- tics are often conveyed by the rich morphology, tains the best in-domain En-Chr performance. The the word orders of Cherokee sentences are vari- best results are 15.8/12.7 BLEU for in-domain Chr- able. There is no “basic word order” in Cherokee, En/En-Chr translations; and 6.5/5.0 BLEU for out- and most word orders are possible (Montgomery- of-domain Chr-En/En-Chr translations. Finally, we conduct a 50-example human (expert) evalua- 2Our co-author, Prof. Benjamin Frey, is a proficient second-language Cherokee speaker and a citizen of the East- tion; however, the human judgment does not cor- ern Band of Cherokee Indians. relate with BLEU for the En-Chr translation, indi- cating that BLEU is possibly not very suitable for attention from the NLP community in helping to Cherokee evaluation. Overall, we hope that our save and revitalize this endangered language. An datasets and strong initial baselines will encourage initial version of our data and its implications was future works to contribute to the revitalization of introduced in (Frey, 2020). Note that we are not this endangered language. the first to propose a Cherokee-English parallel dataset. There is Chr-En parallel data available on 2 Related Works OPUS (Tiedemann, 2012).5 The main difference is that our parallel data contains 99% of their data Cherokee Language Revitalization. In 2008, and has 6K more examples from diverse domains. the Cherokee Nation launched the 10-year lan- guage preservation plan (Nation, 2001), which Low-Resource Machine Translation. Even aims to have 80% or more of the Cherokee peo- though machine translation has been studied for ple be fluent in this language in 50 years. Af- several decades, the majority of the initial research ter that, a lot of revitalization works were pro- effort was on high-resource translation pairs, e.g., posed. Cherokee Nation and the EBCI have es- French-English, that have large-scale parallel tablished language immersion programs and k-12 datasets available.