ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization Shiyue Zhang Benjamin Frey Mohit Bansal UNC Chapel Hill {shiyue, mbansal}@cs.unc.edu;
[email protected] Abstract Src. ᎥᏝ ᎡᎶᎯ ᎠᏁᎯ ᏱᎩ, ᎾᏍᎩᏯ ᎠᏴ ᎡᎶᎯ ᎨᎢ ᏂᎨᏒᎾ ᏥᎩ. Ref. They are not of the world, even as I am not of the Cherokee is a highly endangered Native Amer- world. ican language spoken by the Cherokee peo- SMT It was not the things upon the earth, even as I am ple. The Cherokee culture is deeply embed- not of the world. ded in its language. However, there are ap- NMT I am not the world, even as I am not of the world. proximately only 2,000 fluent first language Table 1: An example from the development set of Cherokee speakers remaining in the world, and ChrEn. NMT denotes our RNN-NMT model. the number is declining every year. To help save this endangered language, we introduce people: the Eastern Band of Cherokee Indians ChrEn, a Cherokee-English parallel dataset, to (EBCI), the United Keetoowah Band of Cherokee facilitate machine translation research between Indians (UKB), and the Cherokee Nation (CN). Cherokee and English. Compared to some pop- ular machine translation language pairs, ChrEn The Cherokee language, the language spoken by is extremely low-resource, only containing 14k the Cherokee people, contributed to the survival sentence pairs in total. We split our paral- of the Cherokee people and was historically the ba- lel data in ways that facilitate both in-domain sic medium of transmission of arts, literature, tradi- and out-of-domain evaluation.