Speech Technology and Argentinean Welsh
Total Page:16
File Type:pdf, Size:1020Kb
Speech technology and Argentinean Welsh Elise Bell The University of Texas at El Paso [email protected] Abstract under-resourced languages, including Welsh and other Celtic languages, making creation of speech This paper argues for increased efforts to technologies for these languages more challeng- source Welsh language data from the pop- ing. As we undertake that challenge, it is vital that ulation of Welsh speakers in Argentina. we consider the source of the data on which our The dialect of Argentinean Welsh is under- technology is based. As less-resourced language resourced even in comparison to other speech technology becomes more broadly accessi- Celtic languages, which are already con- ble, speakers who deviate from the norms explic- sidered less-resourced languages (LRLs). itly or implicitly assumed by the technology will Argentinean Welsh has been shown to begin to come in contact with it. Depending on diverge from other dialects of Welsh the variety inherent in the data underlying the sys- in the realization of acoustic contrasts tem, those marginalized speakers may or may not such as voice-onset time and vowel dura- be able to successfully take advantage of speech tion. These differences potentially obscure technology. The aim of this paper is to highlight phonemic contrasts in the language, cre- the particular areas of speech technology develop- ating homophony absent in other dialects. ment that may create obstacles or pose problems The inclusion of Argentinean Welsh data for users, and to propose the addition of a partic- in training sets for future Welsh speech ular source of acoustic data that lies outside the technology development will increase the norm for Welsh language technology. The main applicability of such technology to other speaker group of concern here is speakers of Welsh speaker communities whose Welsh speech in Argentina, but the arguments that follow apply may not align with that currently in use for to second language (L2) or non-fluent speakers of model training, including second-language Welsh as well. and non-fluent speakers. Compared to dialects of Welsh spoken in Wales, Argentinean (or Patagonian) Welsh is extremely 1 Introduction under-resourced and under-researched. Documen- The development of speech language technology tation efforts amount to a handful of citations such as automatic speech recognition (ASR) de- (Jones, 1984; Jones, 1998; Sleeper, 2015; Bell, pends on the availability and accessibility of large- 2017), and to my knowledge, only one speech cor- scale language data sets, both spoken and written. pus. Little is known about how the dialect of Welsh The information in these data sets is used to create spoken in Argentina differs from other dialects of statistical generalizations that form the basis for Welsh, although there are several reasons to expect speech technologies including speech recognition, dialectal variation. The effects of bilingualism on text-to-speech systems, and grammatical parsing. speech production are well documented (Flege et Large resources of this type are less available for al., 1997; Flege et al., 2003; Escudero, 2009), and all adult speakers of Argentinean Welsh are bilin- c 2019 The authors. This article is licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CC- gual with Spanish (if not trilingual with English or BY-ND. another language). Dialect differences may also Proceedings of the Celtic Language Technology Workshop 2019 Dublin, 19–23 Aug., 2019 | p. 16 arise from the effects of second language (L2) ac- using the Paldaruo app3 and website to crowd- quisition of Welsh. Differences in speech produc- source the collection of Welsh utterances (Prys tion due to these effects may include the merg- and Jones, 2018). Utterances were elicited with a ing of phonemic categories, or the use of differ- set of target words and sentences designed to col- ent acoustic cues in contrast production. Because lect a representative phoneme set. The project is these effects are fairly inextricably tied up with currently available through the Mozilla Common- the effect of Spanish bilingualism on Argentinean Voice project4 where users can contribute and eval- Welsh in general, they will not be treated sepa- uate recordings, and where a portion of the vetted rately here. This paper presents a brief overview data is available for download. of the state of Welsh language speech technology There are several other text and speech corpora and resources, followed by a short discussion of available for the Welsh language. The Language the history and modern context of the Welsh lan- Technologies Unit has created multiple text cor- guage in Argentina. Subsequently, I present evi- pora, including one of social media posts5 as well dence that experimentally observed differences be- as a million-word corpus consisting of various reg- tween Welsh dialects support the inclusion of Ar- isters of Welsh writing.6 Researchers at Bangor gentinean Welsh data in future speech technology University’s ESRC Centre for Research on Bilin- development efforts. gualism in Theory & Practice have also produced two publicly available corpora of Welsh bilingual 1.1 Welsh speech technology speech.7 One of these, the Patagonia corpus, is to my knowledge the only publicly available col- Speech recognition and speech synthesis technolo- lection of Argentinean Welsh speech. While such gies rely on (relatively) large amounts of acoustic conversational corpora are invaluable for the study data, which must be transcribed orthographically of syntactic and morphological phenomena (Carter (in the case of a grapheme-based speech recog- et al., 2010; Webb-Davies, 2016), the acoustic data nition system) or phonetically (in the case of a they contain is not always of high enough quality, phoneme-based system). The collection, analy- nor is the corpus large enough, to stand alone as the sis, and processing of this data requires resources sole source for development of speech technology. including people-hours, funding, and often, par- A brief history of Welsh in Argentina is presented ticipation of community members in data crowd- below, followed by a discussion of the benefits that sourcing efforts (Prys and Jones, 2018). Currently, Argentinean Welsh data may have on future devel- available Welsh speech technology is fairly lim- opment of Welsh speech technology. ited (compared to larger-resourced languages like English). Much of what is available has been 1.2 Argentinean Welsh produced by the Welsh Language Technologies The presence of Welsh in Argentina is due to 1 Unit, based at Bangor University. Tools pro- the mid-19th century efforts of a group of Welsh duced by the Language Technologies Unit range speakers, led by Michael D. Jones, who sought to from front-end resources accessible to the public establish a Welsh colony away from the influence (a vocabulary website plugin (Jones et al., 2016), of the English language and British government a Welsh language spelling and grammar checker (Williams, 1975). In 1865, following an agree- (Prys et al., 2016)) to back-end tools such as a ment with the government of Argentina, a Welsh part-of-speech tagger (Prys and Jones, 2015) that colony was established in the Patagonia region of are open source and accessible to researchers out- of the country. Today, descendants of the origi- side of the unit itself. The unit has also devel- nal 200 colonists (and of the several thousand who oped speech recognition and synthetic speech tech- followed in subsequent years) still maintain the nologies that are of particular relevance to this pa- Welsh language and culture in Argentina. Mod- 2 per. These include the development of Macsen, a ern Welsh speakers are clustered in two areas of Welsh-language personal digital assistant based on 3https://apps.apple.com/bs/app/paldaruo/ data collected by the Languages Technologies Unit id840185808 4https://voice.mozilla.org 1www.bangor.ac.uk/canolfanbedwyr/ 5http://techiaith.cymru/data/corpora/ technolegau_iaith.php.en twitter 2http://techiaith.cymru/2016/05/ 6http://corpws.cymru/ceg/ introducing-macsen 7http://bangortalk.org.uk/ Proceedings of the Celtic Language Technology Workshop 2019 Dublin, 19–23 Aug., 2019 | p. 17 Chubut Province, in Dyffryn Camwy on the At- developed. Specific aspects of Argentinean Welsh lantic coast, and Cwm Hyfryd to the west. variation, which may be due to synchronic effects Although inter-generational transmission of the from first language Spanish, the effect of lifelong language waned during the 20th century, revital- Spanish bilingualism, or diachronic dialect diver- ization efforts were spurred in 1965, the centennial gence, are discussed below. of the original colony’s establishment. The cen- Today, all adult speakers of Welsh are at least tennial celebration renewed interest in Welsh cul- bilingual, either with English (in Wales) or with ture and language, and by the 1990s several lan- Spanish (in Argentina). This situation compli- guage initiatives were established which still ex- cates what might otherwise be a straightforward ist today. These include Welsh language medium dialect comparison between differing varieties of primary schools, annual Eisteddfodau (traditional Welsh. Cross-linguistic influence from competing poetry and song competitions), and an ongoing languages Spanish and English is entangled with teacher exchange program with Wales through the other linguistic pressures, including effects of first 8 Welsh Language Project. language (L1) on second language