Exploring Automated Formant Analysis for Comparalve Varialonist Study Of

Exploring automated formant analysis for comparave variaonist study of Heritage Cantonese and English Naomi Cui1, Minyi Zhu1, Vina Law1 Holman Tse2 & Naomi Nagy1 1University of Toronto 2University of PiAsBurgh HERITAGE LANGUAGE VARIATION AND CHANGE IN TORONTO HTTP://PROJECTS.CHASS.UTORONTO.CA/NGN/HLVC What is the HLVC Project? • Large-scale project invesIgang language use and change in heritage (non-official) languages spoken in Toronto. • Goals – To document and describe heritage languages spoken by immigrants and 2 generaons of their descendants – To create a corpus available for research on language change – To push variaonist research beyond its monolingually- oriented core By focusing on heritage language use among mulIlingual speakers – To develop a framework for research on heritage languages and contact 2 A Sample of Previous HLVC Work Cantonese Faetar Italian Korean Russian Ukrainian VOT ✓ ✓ ✓ ✓ ✓ Ø-suBject ✓ ✓ ✓ ✓ Borrowing ✓ Vowels * * This presentaon 3 Vowels • Very well researched in sociolinguisIcs, But very liAle work on vowel variaon and change in languages other than English. • Large Body of research has made possiBle the development of new technologies/techniques to make vowel analysis easier – Example: FAVE (Rosenfelder et al 2011) Image from Wikipedia 4 Goals of Current Project • To determine the extent to which the vowel systems of Cantonese and English may Be mutually influencing each other in Toronto • To extend the use of automated forced alignment and formant extracIon as tools for the sociolinguisIc study of contact-induced change in Heritage Cantonese. – Prosodylab-Aligner (Gorman et al 2011) to Be adopted 5 Methodological ProBlems • Large amount of data in HLVC Corpus (~40 speakers/language) – Manual formant measurements take a lot of Ime. • FAVE designed to work only on English • Could Prosodylab-Aligner Be a viable alternave? 6 30 Dec. 2007 Italian! Chinese! Cantonese! Punjabi" Portuguese" Spanish" Tagalog" Urdu" Tamil" Polish" 7 7 ContrasIng demographics Language MT speakers Ethnic Origin Est. (2011 Census) (2006 Census) in TO Speakers come from Cantonese 170,000+ 537,000 1951 Hong Kong Italian 166,000 466,000 1908 Calabria Russian 78,000 58,505 1916 St. PetersBurg, Moscow Korean 51,000 55,000 1967 Seoul Ukrainian 26,000 122,000 1913 Lviv Faetar <100? 300? 1950 Faeto, Celle di St. Vito (Apulia Italy) www40.statcan.ca/l01/cst01/demo12c-eng.htm; www12.statcan.gc.ca/census-recensement/2011/dp-pd/prof/index.cfm?Lang=E 8 Lviv, Ukraine 1913 Western Poland, 1911 Budapest, Hungary, 1885 Faeto, Italy 1950 9 Cantonese vs. English Vowel Space Images from Wikipedia Allophonic lowering of /i/ Before velars Similar Canadian English Vowels (Yue-Hashimoto 1972) see, /si/ si1, /si˦/, 詩, ‘poem’ sick, /sɪk/ sik1, [sɪk˦], 識, ‘to know’ ??? 10 Expected outcome 1st 2nd Heritage Language / Culture English/Canadian 11 Data • Two sets of hour-long sociolinguisIc interviews from 2 generaons of speakers idenIfied as Hong Kong Chinese and who claim Cantonese as a heritage language – Not from the same speakers, however. Interviews in English from Interviews in Cantonese from the Contact in the City the HLVC Corpus (Nagy 2009, Corpus (CinC) (Hoffman 2011) and Walker 2010) “Ngo5 fu6 mou5 yat1 gau2 cat1 yi6 lin4 lei4 “My parents came dou3 do1 leon1 do1.” to Toronto in 1972.” 12 Speaker Sample Generaon Sex CANTONESE ENGLISH Male C1M62A TO.035 C1M59A TO.038 1 (Ages: 42-82) Female C1F78A TO.030 C1F54A TO.037 C1F82A TO.039 Male C2M44A TO.029 2 (Ages: 16-44) Female C2F16A TO.031 C2F21B TO.056 Total N=8 N=8 13 Methods - English Data 1. Sentence-level Ime alignment (manual) using ELAN 2. Word- and phoneme-level Ime alignment (automated) with FAVE • http://fave.ling.upenn.edu 14 15 Prosodylab-Aligner (Gorman 2011 et al) • A Python script used to perform text to audio speech alignment • Supports training on arBitrary data – à With any input from language X, can Be trained to deal with acousIc data from language X • Requirements – At least a total of one hour of audio (.wav file in chunks OK) – Matching .lab files (.txt files readable By Prosodylab-Aligner) for each .wav file – A customized dicIonary 16 Methods – Cantonese Data 1. Interviews transcriBed By nave speakers of Cantonese using Jyutping Romanizaon in ELAN – Manual sentence-level alignment 2. To create input readable By Prosodylab-Aligner, PRAAT script used to create smaller .wav files with matching .txt files for each annotaon. 17 PRAAT Script C1F54A_IV_2074.wav Transla3on: “Because at that 3me, China was at war.” C1F54A_IV_2105.wav Transla3on: “And then the Communist Party came, and then ...” 18 Training and Evaluaon • .wav files and matching .lab files put in a Training directory • Prosodylab-aligner uses Training directory and dic5onary to Build a training model Custom diconary in the format of The CMU Pronouncing DicIonary • Prosodylab-aligner uses training model to evaluate the same files in the same directory 19 Textgrid Output of Prosodylab-Aligner 20 Another PRAAT script: formant extracIon • Formant informaon extracted from Prosodylab- Aligner generated Textgrids and matching .wav files using PRAAT script • Output: Tab-delimited .txt file 21 Vowel Normalizaon • hAp://ncslaap.liB.ncsu.edu/tools/norm/ norm1.php • Labov ANAE (Vowel Extrinsic) method used 22 Prepping for R-Brul • Tab-delimited .txt file generated By NORM with normalized values for vowel formants • New columns added for variables • Ready for stasIcal analysis with R-brul (Johnson) 23 Variables of Interest • External Factors – Generaon – Gender – Age • Internal Factors – Following Segment – Tone 24 Cantonese Vowel Charts Toronto CAN (8 speakers), Labov ANAE (speaker extrinsic) Hong Kong Homeland CAN YU I Toronto CAN (8 speakers), Labov ANAE (speaker extrinsic) ING IK YU I U E U O F1’ E O F1’ A A 800 700 600 500 400 800 700 600 500 400 AA AA AllSpkrs AllSpkrsImage from Wikipedia 2000 1800 1600 1400 1200 2000 1800 1600 1400 1200F2’ F2’ 25 Toronto Anglo ENG vs CAN ENG Toronto Anglo English Toronto CAN Heritage English UW IY UW IY IH IH F1 OW F1 OW EH AH 700 600 500 400 AE 800 700 600 500 400 AA AA AVG AllSpkrs 2000 1800 1600 1400 1200 2200 2000 1800 1600 1400 1200 F2 F2 Based on means from Roeder 2012, Based on means of 7 speakers Boberg 2008, Roeder & Jarmasz 2010 26 F1 and F2 Means for /i/ in open syllables 1st 2nd Heritage Language / Culture English/Canadian Cantonese CAN English (8 speakers) (11 speakers) Toronto Anglo English Gen F1* F2* Tokens Gen F1** F2** Tokens F1 F2 1 439 2044 3207 1 454 2096 1545 474 2011 2 423 2106 857 2 434 2324 2370 All 435 2057 4064 All 441 2234 3925 •Gen 2 has higher and •Gen 2 has higher and •Anglo English has the more fronted /i/ more fronted /i/ lowest /i/. •*p < 0.05 •**p < 0.01 27 Discussion of Results • Evidence of generaonal change clear with same general developmental trend in Both languages. – Raising and fronIng of /i/ for Gen 2 in Both CAN and CAN ENG • Relave posiIon of /i/ and /ɪ/ are different in CAN and ENG. • Lack of /u/ fronIng in CAN oBserved, But some fronIng in CAN ENG • How these changes result from contact with English (if that is the case) appear to Be quite complex – further research required to BeAer understand how. • Note – Tone not considered as a factor – Variaon and change in other vowels not considered – No homeland data available 28 Discussion of Methodology • Without human intervenIon, automacally extracted data creates reasonable vowel plots • A promising avenue for future research on vowel variaon and change in heritage languages • But need to check and compare results with manual formant extracIon 29 Future Work • Assessing accuracy of automated alignment and formant extracIon By aempIng to replicate results using manual methods • Expanding to more vowels and more speakers – 8 speakers for this analysis, ~ 40 CAN speakers in Corpus – Comparing homeland data • Expanding to other heritage languages – Italian, Faetar, Russian, Ukrainian, Korean 30 감사합니다 дякую Grazie molto Спасибо 多謝! gratsiə namuor:ə HLVC RAs: Rick Grimm Paulina Lyskawa Sarah Truong Cameron Abma Dongkeun Han Rosa Mastri Dylan Uscher Vanessa Bertone Natalia Harhaj Timea Molnár Ka-man Wong Ulyana Bila Taisa Hewka Jamie Oh Olivia Yu Rosanna Calla Melania Hrycyna Maria Parascandolo Minyi Zhu Minji Cha Michael Iannozzi Rita Pang Collaborators: Karen Chan Diana Kim Andrew Peters Yoonjung Kang Joanna Chociej Janyce Kim Tiina ReBane Alexei Kochetov Sheila Chung Iryna Kulyk Hoyeon Rim James Walker Tiffany Chung Mariana Kuzela Will Sawkiw Funding: Courtney Clinton Ann Kwon Maksym Shkvorets SSHRC, University of Radu Craioveanu Alex La GamBa Vera Riche~ Smith Toronto, Marco Covi Carmela La Rosa Anna Shalaginova Shevchenko Derek Denis Natalia Lapinskaya KonstanIn Shapoval Foundaon Tonia Djogovic Kris Lee Yi Qing Sim Joyce Fok Nikki Lee Mario So Gao Paolo Frasca Olga Levitski Awet Tekeste Ma Gardner Arash Loi Josephine Tong HTTP://PROJECTS.CHASS.UTORONTO.CA/NGN/HLVC 31 References • Boberg, Charles. 2008. “Regional phoneIc differenIaon in Standard Canadian English.” Journal of English Linguiscs 36/2: 129-154. • Gorman, Kyle, Jonathan Howell & Michael Wagner. (2011). Prosodylab-Aligner: A tool for forced alignment of laboratory speech. Proceedings of AcousIcs Week in Canada, QueBec City. • Hoffman, M. F., & Walker, J. A. (2010). Ethnolects and the city: Ethnic orientaon and linguisIc variaon in Toronto English. Language Variaon and Change, 22, 37-67. • LoBanov • Nagy, Naomi. (2009). Heritage Language Variaon and Change in Toronto. hAp:// projects.chass.utoronto.ca/ngn/HLVC. • Roeder, ReBecca. 2012. “The Canadian Shi‚ in Two Ontario CiIes.” Special Issue of World Englishes: Autonomy and Homogeneity in Canadian English 31,4: 478-492. Guest editors Stefan Dollinger and Sandra Clarke. • Roeder, ReBecca and Lidia-Gabriela Jarmasz. 2010. “The Canadian Shi‚ in Toronto.” Revue canadienne de linguisque/Canadian Journal of LinguisIcs 55,3: 387-404.

Exploring Automated Formant Analysis for Comparalve Varialonist Study Of

Wikipedia, the Free Encyclopedia 03-11-09 12:04

The Culture of Wikipedia

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

Exeter's Chinese Community

The Wili Benchmark Dataset for Written Natural Language Identification

89 Annual Meeting

Training Deep Neural Networks for Bottleneck Feature Extraction

POWER LANGUAGE INDEX Which Are the World’S Most in Uential Languages?

2018 MCM Problem B: How Many Languages?

Language Distinctiveness*

Dimensions in Variationist Sociolinguistics: A

Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects, Pages 1–14, Osaka, Japan, December 12 2016