Speech technology and Argentinean Welsh

Elise Bell The University of Texas at El Paso [email protected]

Abstract under-resourced , including Welsh and other Celtic languages, making creation of speech This paper argues for increased efforts to technologies for these languages more challeng- source Welsh data from the pop- ing. As we undertake that challenge, it is vital that ulation of Welsh speakers in Argentina. we consider the source of the data on which our The dialect of Argentinean Welsh is under- technology is based. As less-resourced language resourced even in comparison to other speech technology becomes more broadly accessi- Celtic languages, which are already con- ble, speakers who deviate from the norms explic- sidered less-resourced languages (LRLs). itly or implicitly assumed by the technology will Argentinean Welsh has been shown to begin to come in contact with it. Depending on diverge from other dialects of Welsh the variety inherent in the data underlying the sys- in the realization of acoustic contrasts tem, those marginalized speakers may or may not such as -onset time and vowel dura- be able to successfully take advantage of speech tion. These differences potentially obscure technology. The aim of this paper is to highlight phonemic contrasts in the language, cre- the particular areas of speech technology develop- ating homophony absent in other dialects. ment that may create obstacles or pose problems The inclusion of Argentinean Welsh data for users, and to propose the addition of a partic- in training sets for future Welsh speech ular source of acoustic data that lies outside the technology development will increase the norm for technology. The main applicability of such technology to other speaker group of concern here is speakers of Welsh speaker communities whose Welsh speech in Argentina, but the arguments that follow apply may not align with that currently in use for to second language (L2) or non-fluent speakers of model training, including second-language Welsh as well. and non-fluent speakers. Compared to dialects of Welsh spoken in Wales, Argentinean (or Patagonian) Welsh is extremely 1 Introduction under-resourced and under-researched. Documen- The development of speech language technology tation efforts amount to a handful of citations such as automatic speech recognition (ASR) de- (Jones, 1984; Jones, 1998; Sleeper, 2015; Bell, pends on the availability and accessibility of large- 2017), and to my knowledge, only one speech cor- scale language data sets, both spoken and written. pus. Little is known about how the dialect of Welsh The information in these data sets is used to create spoken in Argentina differs from other dialects of statistical generalizations that form the basis for Welsh, although there are several reasons to expect speech technologies including speech recognition, dialectal variation. The effects of bilingualism on text-to-speech systems, and grammatical parsing. speech production are well documented (Flege et Large resources of this type are less available for al., 1997; Flege et al., 2003; Escudero, 2009), and all adult speakers of Argentinean Welsh are bilin- c 2019 The authors. This article is licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CC- gual with Spanish (if not trilingual with English or BY-ND. another language). Dialect differences may also

Proceedings of the Celtic Language Technology Workshop 2019 Dublin, 19–23 Aug., 2019 | p. 16 arise from the effects of second language (L2) ac- using the Paldaruo app3 and website to crowd- quisition of Welsh. Differences in speech produc- source the collection of Welsh utterances (Prys tion due to these effects may include the merg- and Jones, 2018). Utterances were elicited with a ing of phonemic categories, or the use of differ- set of target words and sentences designed to col- ent acoustic cues in contrast production. Because lect a representative set. The project is these effects are fairly inextricably tied up with currently available through the Mozilla Common- the effect of Spanish bilingualism on Argentinean Voice project4 where users can contribute and eval- Welsh in general, they will not be treated sepa- uate recordings, and where a portion of the vetted rately here. This paper presents a brief overview data is available for download. of the state of Welsh language speech technology There are several other text and speech corpora and resources, followed by a short discussion of available for the Welsh language. The Language the history and modern context of the Welsh lan- Technologies Unit has created multiple text cor- guage in Argentina. Subsequently, I present evi- pora, including one of social media posts5 as well dence that experimentally observed differences be- as a million-word corpus consisting of various reg- tween Welsh dialects support the inclusion of Ar- isters of Welsh writing.6 Researchers at Bangor gentinean Welsh data in future speech technology University’s ESRC Centre for Research on Bilin- development efforts. gualism in Theory & Practice have also produced two publicly available corpora of Welsh bilingual 1.1 Welsh speech technology speech.7 One of these, the Patagonia corpus, is to my knowledge the only publicly available col- Speech recognition and speech synthesis technolo- lection of Argentinean Welsh speech. While such gies rely on (relatively) large amounts of acoustic conversational corpora are invaluable for the study data, which must be transcribed orthographically of syntactic and morphological phenomena (Carter (in the case of a grapheme-based speech recog- et al., 2010; Webb-Davies, 2016), the acoustic data nition system) or phonetically (in the case of a they contain is not always of high enough quality, phoneme-based system). The collection, analy- nor is the corpus large enough, to stand alone as the sis, and processing of this data requires resources sole source for development of speech technology. including people-hours, funding, and often, par- A brief history of Welsh in Argentina is presented ticipation of community members in data crowd- below, followed by a discussion of the benefits that sourcing efforts (Prys and Jones, 2018). Currently, Argentinean Welsh data may have on future devel- available Welsh speech technology is fairly lim- opment of Welsh speech technology. ited (compared to larger-resourced languages like English). Much of what is available has been 1.2 Argentinean Welsh produced by the Welsh Language Technologies The presence of Welsh in Argentina is due to 1 Unit, based at Bangor University. Tools pro- the mid-19th century efforts of a group of Welsh duced by the Language Technologies Unit range speakers, led by Michael D. Jones, who sought to from front-end resources accessible to the public establish a Welsh colony away from the influence (a vocabulary website plugin (Jones et al., 2016), of the and British government a Welsh language spelling and grammar checker (Williams, 1975). In 1865, following an agree- (Prys et al., 2016)) to back-end tools such as a ment with the government of Argentina, a Welsh part-of-speech tagger (Prys and Jones, 2015) that colony was established in the Patagonia region of are open source and accessible to researchers out- of the country. Today, descendants of the origi- side of the unit itself. The unit has also devel- nal 200 colonists (and of the several thousand who oped speech recognition and synthetic speech tech- followed in subsequent years) still maintain the nologies that are of particular relevance to this pa- Welsh language and culture in Argentina. Mod- 2 per. These include the development of Macsen, a ern Welsh speakers are clustered in two areas of Welsh-language personal digital assistant based on 3https://apps.apple.com/bs/app/paldaruo/ data collected by the Languages Technologies Unit id840185808 4https://voice.mozilla.org 1www.bangor.ac.uk/canolfanbedwyr/ 5http://techiaith.cymru/data/corpora/ technolegau_iaith.php.en twitter 2http://techiaith.cymru/2016/05/ 6http://corpws.cymru/ceg/ introducing-macsen 7http://bangortalk.org.uk/

Proceedings of the Celtic Language Technology Workshop 2019 Dublin, 19–23 Aug., 2019 | p. 17 Chubut Province, in Dyffryn Camwy on the At- developed. Specific aspects of Argentinean Welsh lantic coast, and Cwm Hyfryd to the west. variation, which may be due to synchronic effects Although inter-generational transmission of the from first language Spanish, the effect of lifelong language waned during the 20th century, revital- Spanish bilingualism, or diachronic dialect diver- ization efforts were spurred in 1965, the centennial gence, are discussed below. of the original colony’s establishment. The cen- Today, all adult speakers of Welsh are at least tennial celebration renewed interest in Welsh cul- bilingual, either with English (in Wales) or with ture and language, and by the 1990s several lan- Spanish (in Argentina). This situation compli- guage initiatives were established which still ex- cates what might otherwise be a straightforward ist today. These include Welsh language medium dialect comparison between differing varieties of primary schools, annual Eisteddfodau (traditional Welsh. Cross-linguistic influence from competing poetry and song competitions), and an ongoing languages Spanish and English is entangled with teacher exchange program with Wales through the other linguistic pressures, including effects of first 8 Welsh Language Project. language (L1) on second language (L2) speech, and historical language change as a result of con- 2 Discussion tact. Teasing apart these intertwined factors is far beyond the scope of this paper, and it is sufficient Argentinean Welsh is spoken by a population that for our purposes to acknowledge that multiple fac- is separated from Wales by more than a century tors exist, and that they likely influence the Welsh of sparse contact as well as a language barrier language in both regions. Recent work has used (bilingualism with Spanish, rather than English). experimental methods and corpus analyses to in- These factors have almost certainly contributed to vestigate the realization of sound contrasts in Ar- linguistic divergence in many aspects of Argen- gentinean Welsh that are hypothesized to be sus- tinean Welsh. The most salient of these aspects for ceptible to influence from Spanish contact. the purpose of this paper is divergence in the lan- guage’s sound system, in the acoustic realization Sleeper (2015) investigated the realization of and phonological representation of speech sounds. voice onset time (VOT) in the Welsh voiceless Previous research on speech recognition of dialect stop series /p t k/. It was hypothesized that while and accent differences has shown that, given a contact with the English system reinforces the re- large enough data set, systems trained on multi- tention of the Welsh voiceless aspirated-voiceless ple dialects perform better than those trained on unaspirated VOT contrast, contact with Spanish in a single dialect (Rao and Sak, 2017; Li et al., Argentina may have resulted in a shift to a more 2018; Yang et al., 2018). Other work has found Spanish-like voiced-voiceless unaspirated system. that including accent classification when train- Sleeper extracted VOT values from word-initial ing a multi-accent speech recognition system im- instances of /p t k/ produced in conversational proved later classification of both accent-classified speech by Welsh bilinguals in Argentina and in and accent-unclassified datasets (Jain et al., 2018). Wales, recorded in the Patagonia and Siarad cor- Before addressing specific evidence for phonetic pora (Deuchar et al., 2014). Results confirmed and phonological differences between Argentinean his hypothesis, with Argentinean Welsh-Spanish Welsh and other dialects of Welsh, the next sec- bilingual speakers producing shorter Spanish-like tion discusses the reasoning for including dialectal VOT in voiceless-stop initial Welsh words, com- variation in speech technology models. pared to the English-like VOT produced by the The results of previous research indicate that the Welsh-English bilingual group. inclusion of dialectal acoustic variation can pro- Bell (2018) collected productions of Welsh vow- vide a more variable and more useful data set for els from Welsh-Spanish bilinguals in Argentina the future development of Welsh language technol- and Welsh-English bilinguals in Wales in order to ogy. I propose that Argentinean Welsh provides a investigate differences in the acoustic realization unique opportunity to broaden the language data of allophonic and phonemic vowel length. Be- base from which Welsh speech technologies are cause Spanish does not contrast vowels on the ba- sis of length, nor does duration vary allophonically 8https://wales.britishcouncil.org/en/ programmes/education/welsh-language- to the extent that it does in Welsh or English, it was project hypothesized that Welsh-Spanish bilinguals were

Proceedings of the Celtic Language Technology Workshop 2019 Dublin, 19–23 Aug., 2019 | p. 18 likely to exhibit differences in their production of western Argentinean Welsh communities. Addi- long and short Welsh vowels. Results showed tionally, the Welsh Language Project and Menter that Welsh-Spanish bilinguals produced phonemic Patagonia program involve networks of Welsh lan- vowel length contrasts in much the same way as guage educators in the region who may be inter- Welsh-English bilinguals (relying on both vowel ested in integrating speech technology participa- duration and spectral quality), but were less simi- tion into their classrooms of speakers at all levels. lar in production of allophonic duration differences As the resources for Welsh speech technology conditioned by following voicing. continue to grow, the opportunity to include data Differences in the acoustic realization of the fac- from speakers of Welsh in Argentina increases. tors mentioned above are likely to prove challeng- Through projects like the speech-collecting Pal- ing for an automatic speech recognition system daruo app and Mozilla CommonVoice (Prys and trained only on Welsh produced by fluent speak- Jones, 2018), it is increasingly possible to target ers in Wales. The differences observed in Argen- and recruit participants who are speakers of Ar- tinean Welsh generally appear to reduce acoustic gentinean Welsh (and of course, other minority contrast between Welsh (the voiceless dialects of the language) through crowdsourcing /p t k/ and voiced /b d g/ stop series, or the vowel methods. Encouraging participation in the Com- length contrast separating minimal pairs such as monVoice initiative, which is online and requires /mor/ mor ‘so’ and /mo:r/ morˆ ‘sea’). The collapse little time commitment is a simple first step toward of contrasting acoustic cues to these (and poten- crowdsourcing Argentinean Welsh data. tially other) phonemic differences in Argentinean Welsh is likely to prove challenging for an auto- 4 Conclusion matic speech recognition system trained on other This paper has argued that the dialect of Welsh dialects of the language. spoken in Argentina presents a valuable resource One solution to this problem, as often seems for the continuing development of Welsh speech to be the case in the domain of speech tech- technology. Technologies such as speech recogni- nology, is more data. Natural and lab-produced tion benefit from the inclusion of variation (both speech datasets collected from speakers of Ar- individual and dialectal) in the data from which gentinean Welsh will serve to diversify the infor- they are developed. I have shown that the dialect of mation set from which statistical generalizations Argentinean Welsh is different from other dialects about acoustic-phonetic realizations of Welsh are of Welsh in the way speakers acoustically realize drawn. Knowledge-based approaches to speech underlying phonemic contrasts. This variation, if recognition that incorporate linguistic generaliza- included in speech technology training data, will tions such as phonological rules into the system serve to develop technologies that will be accessi- should also be considered, as they may be well- ble to more speakers, including those who are less suited to ASR development from small datasets fluent, or who due to language background and L2 (Besacier et al., 2006; Gaikwad et al., 2010). cross-linguistic influence are not able to fully take advantage of current Welsh speech technology. 3 Suggestions for future work Furthermore, inclusion of the Argentinean Welsh community in the development of Welsh While advocating for the inclusion of Argentinean speech technology will strengthen ties between Welsh data in future Welsh speech technology speakers in Wales and Argentina. The goals of projects is well and good, it must also be acknowl- Welsh language revitalization programs in both edged that there are challenges to doing so. The countries include supporting new speakers of the Welsh-speaking population of Argentina is sparse language, and making speech technology acces- compared to that of Wales, with speaker numbers sible and responsive to those new speakers will in the low thousands, spread throughout the re- further progress toward that goal. The Welsh- gion (ON´ eill,´ 2005). This problem may be over- speaking population of Argentina is a valuable re- come by making use of existing community net- source, and outreach and integration efforts to and works and organizations. Data collection, partic- with community members can only stand to bene- ipant recruitment, and outreach could all poten- fit future efforts in the development of speech tech- tially be integrated with community events such nology for all Welsh speakers. as the annual Eisteddfod in both the eastern and

Proceedings of the Celtic Language Technology Workshop 2019 Dublin, 19–23 Aug., 2019 | p. 19 References Jones, Robert Owen. 1998. The Welsh language in Patagonia. In Jenkins, Geraint H., editor, A social Bell, Elise Adrienne. 2017. Perception of Welsh vowel history of the Welsh language: Language and com- contrasts by Welsh-Spanish bilinguals in Argentina. munity in the nineteenth century, pages 287–316. In Linguistic Society of America Annual Meeting. University of Wales Press, Cardiff. Linguistic Society of America Annual Meeting. Li, Bo, Tara N Sainath, Khe Chai Sim, Michiel Bac- Bell, Elise Adrienne. 2018. Perception and Production chiani, Eugene Weinstein, Patrick Nguyen, Zhifeng of Welsh Vowels by Welsh-Spanish Bilinguals. Ph.D. Chen, Yanghui Wu, and Kanishka Rao. 2018. Multi- thesis, The University of Arizona. dialect speech recognition with a single sequence-to- sequence model. In 2018 IEEE International Con- Besacier, Laurent, V-B Le, Christian Boitet, and Vin- ference on Acoustics, Speech and Signal Processing cent Berment. 2006. ASR and translation for under- (ICASSP), pages 4749–4753. IEEE. resourced languages. In 2006 IEEE International Conference on Acoustics Speech and Signal Process- ON´ eill,´ Diarmui. 2005. Rebuilding the Celtic lan- ing Proceedings, volume 5, pages V–V. IEEE. guages: reversing language shift in the Celtic coun- tries. Y Lolfa. Carter, Diana, Peredur Davies, M. Carmen Parafita Prys, Delyth and Dewi Bryn Jones. 2015. National lan- Couto, and Margaret Deuchar. 2010. A corpus- guage technologies portals for lrls: A case study. In based analysis of codeswitching patterns in bilingual Language and Technology Conference, pages 420– communities. In Proceedings: XXIX Simposio In- 429. Springer. ternacional de la Sociedad Espanola˜ de Lingu¨´ıstica, volume 1. Prys, Delyth and Dewi Bryn Jones. 2018. Gathering data for speech technology in the welsh language: A Deuchar, Margaret P., Peredur Davies, Jon Russell Her- case study. Sustaining Knowledge Diversity in the ring, M. Carmen Parafita Couto, and D Carter. 2014. Digital Age, page 56. Building bilingual corpora. In Thomas, E M and I Mennen, editors, Advances in the Study of Bilin- Prys, Delyth, Gruffudd Prys, and Dewi Bryn Jones. gualism. Multilingual Matters, Bristol. 2016. Cysill ar-lein: A corpus of written contem- porary welsh compiled from an on-line spelling and Escudero, P. 2009. Linguistic perception of ‘similar’ grammar checker. In LREC. L2 sounds. In Boersma, Paul and S. Hamann, edi- tors, in perception, pages 151–190. Mou- Rao, Kanishka and Has¸im Sak. 2017. Multi-accent ton de Gruyter, Berlin. speech recognition with hierarchical grapheme based models. In 2017 IEEE international conference on Flege, James E., Ocke-Schwen Bohn, and Sunyoung acoustics, speech and signal processing (ICASSP), Jang. 1997. Effects of experience on non-native pages 4815–4819. IEEE. speakers’ production and perception of English vow- Sleeper, Morgan. 2015. Contact effects on voice-onset els. Journal of , 25:437–470. time (vot) in patagonian welsh. Journal of the Inter- national Phonetic Association, pages 1–15. Flege, James E., Carlo Schirru, and Ian R.A. MacKay. 2003. Interaction between the native and second Webb-Davies, Peredur. 2016. Does the old language language phonetic subsystems. Speech Communica- endure? Age variation and change in contemporary tion, 40:467–491. . In Welsh Linguistics Seminar.

Gaikwad, Santosh K, Bharti W Gawali, and Pravin Yan- Williams, Glyn. 1975. The desert and the dream: A nawar. 2010. A review on speech recognition tech- study of Welsh colonization in Chubut 1865 – 1915. nique. International Journal of Computer Applica- University of Wales Press, Cardiff. tions, 10(3):16–24. Yang, Xuesong, Kartik Audhkhasi, Andrew Rosenberg, Jain, Abhinav, Minali Upreti, and Preethi Jyothi. 2018. Samuel Thomas, Bhuvana Ramabhadran, and Mark Improved accented speech recognition using accent Hasegawa-Johnson. 2018. Joint modeling of ac- embeddings and multitask learning. In Proc. INTER- cents and acoustics for multi-accent speech recog- SPEECH. ISCA. nition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE. Jones, Dewi Bryn, Gruffudd Prys, and Delyth Prys. 2016. Vocab: a dictionary plugin for web sites. PARIS Inalco du 4 au 8 juillet 2016, page 93.

Jones, Robert Owen. 1984. Change and variation in the Welsh of Gaiman, Chubut. In Ball, Martin J. and Glyn E. Jones, editors, Welsh phonology, pages 237– 261. University of Wales Press, Cardiff.

Proceedings of the Celtic Language Technology Workshop 2019 Dublin, 19–23 Aug., 2019 | p. 20