<<

Assessment of an Index for Measuring Pronunciation Difficulty

Katsunori Kotani Takehiko Yoshimi Kansai Gaidai University Ryukoku University [email protected] [email protected]

Abstract pronounceability) has not been examined exten- sively. Given that readability and the difficulty of This study assesses an index for measuring listening influence learners’ motivation and out- the pronunciation difficulty of sentences comes (Hwang 2005, Lai 2015, Yoon et al. 2016), (henceforth, pronounceability) based on we consider that CALL for pronunciation learning the normalized edit distance from a refer- ence sentence to a transcription of learn- should consider pronounceability in evaluating ers’ pronunciation. Pronounceability learners’ pronunciation. should be examined when language teach- Pronounceability can be represented as the ers use a computer-assisted language phonetic edit distance from reference pronuncia- learning for pronunciation learning tion to a learner’s expected pronunciation based to maintain the motivation of learners. on the proficiency. Phonetic edit distance can be However, unlike the evaluation of learners’ measured using a modified version of the Le- pronunciation performance, previous re- venshtein edit distance (Wieling et al. 2014) or a search did not focus on pronounceability deep-neural-network-based classifier (Li et al. not only for English but also for Asian 2016). languages. This study found that the nor- malized edit distance was reliable but not This study measured normalized edit distance valid. The lack of validity appeared to be (NED) using the orthographical transcription of because of an English test used for deter- learners’ pronunciation of reference sentences. An mining the proficiency of learners. advantage of the NED based on orthographic tran- scription is the availability of data. This is because 1 Introduction language teachers can obtain orthographical tran- scription without being trained for phonetic tran- Research on computer-assisted language learning scription. (CALL) has been carried out for learning the pro- This study measures pronounceability using nunciation of European languages as a foreign multiple regression analysis considering ortho- language such as English (Witt & Young 2002, graphic NED as a dependent variable and the fea- Mak et al. 2004, Ai & Xu 2015, Liu & Hung tures of a sentence and a learner as independent 2016) and Swedish (Koniaris 2014). CALL re- variables. First, a corpus for multiple regression search on Asian languages has considered Japa- analysis is developed. This corpus includes the da- nese as a foreign language (Hirata 2004) and ta for NED and the proficiency data in a score- Chinese as a foreign language (Zhao et al. 2012). based scale of Test of English for International The primary goal of CALL systems for the learn- Communication (TOEIC). TOEIC is a widely ing of foreign language pronunciation is to re- used English test in Asian countries, and its test solve interference from the first language of score ranges from 10 to 990. In previous research learners. For instance, a CALL system can ana- (Grahma et al. 2015, Delais-Roussarie 2015, Gósy lyze the speech in which a learner reads English et al. 2015), proficiency was demonstrated using a sentences aloud and presents pronunciation errors point-scale such as the Common European that a learner must read aloud again for reducing Framework of Reference for Languages (six lev- the errors. els from A1 to C2). Even though the methods of evaluating learn- This study assessed our phonetic learner corpus ers’ pronunciation performance have received data by answering the following research ques- considerable attention in previous research, the tions: pronunciation difficulty of sentences (henceforth, 119 Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pages 119–124 Melbourne, Australia, July 19, 2018. c 2018 Association for Computational

 How stable is NED as a pronounceability 2.2 Annotation of Pronunciation Data index? Our phonetic learner corpus includes NED, the  To what extent does NED classify learn- linguistic features of sentences, and learner fea- ers depending on their proficiency? tures. NED was derived as the Levenshtein edit dis-  How strongly does NED correlate with a tance normalized by sentence length. It reflected learner’s proficiency? the differences from the reference sentences to the transcription of learners’ pronunciation due to the  How accurately is NED measurable substitution, deletion, or insertion of letters. Be- based on linguistic and learner features fore measuring the edit distance, symbols such as for pronounceability measurement? commas and periods were deleted and expressions were uncapitalized in the transcription and refer- 2 Compilation of Phonetic Learner ence data. Corpus The pronunciation was manually transcribed by a transcriber who was a native speaker of English 2.1 Collection of Pronunciation Data and trained to replicate interviews and meetings Our phonetic learner corpus was compiled by re- but was unaccustomed to the English spoken by cording pronunciation data for English texts that learners. The transcriber examined the texts before learners read aloud sentence by sentence. In addi- starting the transcription task. The transcriber was tion, after reading a sentence aloud, learners sub- required to replicate learners’ pronunciation with- jectively determined the pronounceability of sen- out adding, deleting, and substituting any expres- tences on a five-point Likert scale (1: easy; 2: sions for improving grammaticality and/or accept- somewhat easy; 3: average; 4: somewhat difficult; ability (except the addition of symbols such as 5: difficult) (henceforth, SBJ). commas and periods). The texts for reading aloud (the title of Text I is Linguistic features were automatically derived the North Wind and the Sun and that of Text II is from a sentence as follows: Sentence length was the Boy who Cried Wolf) were selected from the derived as the number of words in a sentence. texts distributed by the International Phonetic As- Word length was derived as the number of sylla- sociation (International Phonetic Association bles in a word. The number of multiple-syllable 1999). Even though these texts contain only 15 words in a sentence were derived by calculating sentences, they cover the basic sounds of English ∑ (푆 −1), where n was the number of words (International Phonetic Association 1999, Deterd- in a sentence, and Si was the number of syllables ing 2006). This enables us to analyze which types in the i-th word (Fang 1966). This derivation elim- of English sounds influence learners’ pronuncia- inated the presence of single-syllable words. Word tion. Deterding (2006) reported that Text I failed difficulty was derived as the rate of words not to cover certain sounds, such as initial and medial listed in a basic vocabulary list (Kiyokawa 1990) /z/ and syllable-initial /θ/, and then developed ma- relative to the total number of words in a sentence. terial that covered the English pronunciation for Table 1 summarizes the linguistic features of the these sounds by rewriting a well-known fable by texts that learners read aloud, i.e., text length and Aesop (Text II). The corpus data were compiled from 50 learn- Text I Text II ers of English as a foreign language at university Text length (sentences) 5 10 (28 males, 22 females; mean age: 20.8 years Text length (words) 113 216 (standard deviation, SD, 1.3)). The learners were Sentence length 22.6 (8.3) 21.6 (7.6) compensated for their participation. In our sample, (words) Word length 1.3 (0.1) 1.2 (0.1) the mean TOEIC score was 607.7 (SD, 186.2). (syllables) The minimum and maximum scores were 295 and Multiple syllable word 6.4 (2.8) 5.7 (3.0) 900, respectively. (syllables) Word difficulty 0.3 (0.1) 0.2 (0.1)

Table 1: Linguistic features of the texts that learners read aloud. 120

the mean (standard deviation, SD) values of sen- appear to overvalue pronounceability. On the con- tence length, word length, multiple-syllable trary, if NED fails to explain pronounceability, words, and word difficulty. learners appear to undervalue pronounceability. Learner features were determined using the This provides a solution for the improvement of scores of TOEIC for the current or previous year. NED. Even though TOEIC consists of listening and reading tests, it is strongly correlated with the 4 Assessment of NED as a Pronouncea- Interview, which is a well- bility Index established direct assessment of oral language proficiency developed by the Foreign Service In- In Sections 4.1, 4.2, and 4.3, research questions 1– stitute of the U.S. Department of State (Chauncey 3 are assessed using the classical test theory Group International 1998). (Brown 1996). The fourth question is answered in Section 4.4.

3 Properties of Phonetic Learner Cor- 4.1 Reliability of NED pus The reliability of NED was examined through in- Our phonetic learner corpus was compiled using ternal consistency in terms of Cronbach’s α the method described in Section 2, and this corpus (Cronbach 1970). Internal consistency refers to included 750 instances (15 sentences read aloud whether NED demonstrates similar results for sen- by 50 learners). Table 2 shows the descriptive sta- tences with similar pronounceability. Cronbach’s tistics for NED and SBJ in the phonetic learner α is a reliability coefficient defined by the follow- corpus. ing equation: 훼 = 1− ∑ , where k is NED SBJ Minimum 0.01 1 the number of items (sentences in this study), 푆 Maximum 0.78 5 is the variance associated with item i, and 푆 is Mean 0.15 3.03 the variance associated with the sum of all k item SD 0.22 0.91 values. Cronbach’s α reliability coefficient ranges n 750 750 from 0 (absence of reliability) to 1 (absolute relia- bility), and empirical satisfaction is achieved with Table 2: Descriptive statistics of NED and values above 0.8. SBJ. As reliability depends on the number of items, The relative frequency distributions of NED the reliability coefficients were derived individu- and SBJ, in which NED was classified into five ally for each text (Text I containing 5 sentences levels based on SBJ, are shown in Figure 1. The and Text II containing 10 sentences) and jointly distributions are dissimilar, as the peak of NED for both texts. The reliability coefficients of NED appears at pronounceability level 2 (“somewhat and SBJ are shown in Table 3. easy”) while that of SBJ appears at pronouncea- bility level 3 (“average”). If NED appropriately NED SBJ accounts for learners’ pronounceability, learners Text I 0.72 0.80 Text II 0.82 0.91 0.60 Text I & II 0.86 0.92 NED 0.40 Table 3: Cronbach α coefficient of NED and SBJ SBJ. 0.20 The reliability coefficient of NED exceeded the 0.00 value required for empirical satisfaction (α = 0.8) eaiefrequency Relative in Text II and Texts I & II. Hence, NED is partial- ly reliable as a pronounceability index. However, NED demonstrated lower reliability compared to SBJ. This suggests that NED should be improved Pronounceability through modification. Figure 1: Distribution of NED and SBJ.

121

4.2 Construct Validity of NED equation was found (F (4, 745) = 124.15, p < 0.01) with an adjusted squared correlation coeffi- Construct validity was examined from the 2 viewpoint of distinctiveness. If NED appropriately cient (R ) of 0.40, which indicates that the equa- reflects learners’ proficiency, NED should demon- tion measured approximately 40% of the pro- strate a statistically significant difference (p < nounceability. 0.01) among learners at different proficiency lev- The contribution of linguistic and learner fea- els. Our phonetic learner corpus data were classi- tures can be observed using standardized particle fied into three levels based on the TOEIC scores regression coefficients; the contribution increases below 490 (beginner level) (n = 240), below 730 with the absolute value of the coefficients. The (intermediate level) (n = 240), and 730 or above standardized partial regression coefficients are (advanced level) (n = 270). summarized in Table 5. Significant contribution is Table 4 shows the mean (SD) values of NED observed in word difficulty but not in the other and SBJ for the three levels. The distinctiveness of features. This result contradicts the finding of pre- NED was investigated using ANOVA. ANOVA vious research, which reported the significant con- showed statistically significant differences be- tribution of sentence length and word length in tween the three levels of learners for SBJ (F (2, other modes such as readability (Crossley et al. 747) = 10.13, p < 0.01) but not for NED (F (2, 2017) and listening difficulty (Messerklinger 747) = 0.55, p > 0.01). NED failed to demonstrate 2006). construct validity depending on TOEIC-based Variable Coefficient *p < 0.01 proficiency. Sentence length –0.07 Beginner Intermedi- Advanced Word length 0.06 level ate level level Word difficulty 0.61* NED 0.13 (0.21) 0.12 (0.22) 0.11 (0.21) TOEIC score –0.04 SBJ 3.15 (0.95) 3.13 (0.92) 2.83(0.83) Table 5: Standardized partial regression coef- Table 4: Descriptive statistics of NED and SBJ ficients. The pronounceability measurement method was examined n times (n = 750) using a leave- 4.3 Criterion-related Validity of NED one-out cross validation test, considering one in- Criterion-related validity was examined from the stance as test data and n – 1 instances as training viewpoint of the correlation with learners’ profi- data. The measured NED exhibited moderate cor- ciency in terms of TOEIC scores. NED should re- relation with the observed NED (r = 0.63). NED flect learners’ proficiency because pronounceabil- demonstrated a low coefficient of determination ity should depend on learners’ proficiency. Then, and low predictability. the correlation between NED and TOEIC scores and between SBJ and TOEIC scores was exam- 5 Conclusion ined. This study assessed whether NED appropriately NED exhibited weaker correlation with TOEIC demonstrated pronounceability for learning the scores (r = –0.04) compared to SBJ (r = –0.20). pronunciation of English as a foreign language. Owing to this, NED failed to demonstrate criteri- The assessment suggests that NED is reliable on-related validity depending on TOEIC-based (Section 4.1) but not valid (Sections 4.2 and 4.3). proficiency. The results of pronounceability measurement 4.4 Pronounceability Measurement (Section 4.4) suggest that NED was appropriately explained by the word difficulty. Pronounceability was measured through multiple In future, we will work on the improvement of regression analysis. NED was the dependent vari- pronounceability measurement in English based able, and the linguistic and learner features de- on NED and investigate pronounceability meas- scribed in Section 2 were the independent varia- urement in Asian languages as a foreign language. bles. However, multiple-syllable words were not used owing to the variance inflation factor (VIF = 12.3) (Kutner et al. 2002). A significant regression 122

Acknowledgments Myung-Hee Hwang. 2005. How strategies are used to solve listening difficulties: Listening proficiency This work was supported by JSPS KAKENHI and text level effect. English Teaching 60(1), pages Grant Numbers, 22300299, 15H02940, 207–226. https://doi.org/10.3968/7538. 17K18679. International Phonetic Association. 1999. Handbook References of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Renlong Ai and Feiyu Xu. 2015. A system demonstra- Cambridge University Press, Cambridge, UK. tion of a framework for computer assisted pronunci- https://doi.org/10.1017/S0952675700003894. ation training. In Proceedings of the ACL-IJCNLP Hideo Kiyokawa. 1990. A formula for predicting lis- 2015 System Demonstrations. Association for tenability: the listenability of English language ma- Computational Linguistics and the Asian Federa- terials 2. Wayo Women’s University Language and tion of Natural Language Processing, pages 1–6. Literature 24, pages 57–74. https://doi.org/10.3115/v1/P15-4001. Christos Koniaris. 2014. An approach to measure pro- James Dean Brown. 1996. Testing in Language Pro- nunciation similiarity in second language learning grams. Prentice-Hall, Englewood Cliffs, NJ. using Radial Basis Function Kernel. In Proceedings Chauncey Group International. 1998. TOEIC Tech- of the Third Workshop on NLP for Computer- nical Manual. Chauncey Group International, assisted Language Learning. The Fifth Swedish Princeton, NJ. Language Technology Conference, pages 74–86. Lee Joseph Cronbach. 1970. Essentials of Psycholog- Michael H. Kutner, Christopher J. Nachtsheim, John ical Testing 3rd edition. Harper & Row, New York. Neter, and William Li. 2002. Applied Linear Statis- tical Models (5th ed.), McGrawHill/Irwin, New Scott A. Crossley, Stephen Skalicky, Mihai Dascalu, York. Danielle S. McNamara, and Kristopher Kyle. 2017. Predicting text comprehension, processing, and Degang Lai. 2015. A study on the influencing factors familiarity in adult readers: New approaches to of online learners’ learning motivation. Higher Ed- readability formulas. Discourse Processes, 54(5-6), ucation of Social Science 9(4), pages 26–30. pages 340–359. https://doi.org/10.3968/7538. http://dx.doi.org/10.1080/0163853X.2017.1296264. Wei Li, Sabato Marco Siniscalchi, Nancy F. Chen, Elisabeth Delais-Roussarie, Fabián Santiago, and Hi- and Chin-Hui Lee. 2016. Improving non-native Yon Yoo. 2015. The extended COREIL corpus: mispronunciation detection and enriching diagnos- First outcomes and methodological issues. In Pro- tic feedback with DNN-based speech attribute ceedings of Workshop on Phonetic Learner Corpo- modeling. In Proceedings of 2016 IEEE Interna- ra. Individualized Feedback for Computer-Assisted tional Conference on Acoustics, Speech and Signal Spoken Language Learning Project, pages 57–59. Processing. The Institute of Electrical and Elec- tronics Engineers, pages 6135–6139. Irving E. Fang. 1966. The “Easy listening formula.” https://doi.org/10.1109/ICASSP.2016.7472856. Journal of Broadcasting 11(1), pages 63–68. https://doi.org/10.1080/08838156609363529. Sze-Chu Liu and Po-Yi Hung. 2016. Teaching pro- nunciation with computer assisted pronunciation Mária Gósy, Dorottya Gyarmathy, and András Beke. instruction in a technological university. Universal 2015. The development of a Hungarian-English Journal of Educational Research 4(9), pages 1939– learner speech database and a related analysis of 1943. https://doi.org/10.1016/S0167- filled pauses. In Proceedings of Workshop on Pho- 6393(99)00044-8. netic Learner Corpora. Individualized Feedback for Computer-Assisted Spoken Language Learning Brian Mak, Manhung Siu, Mimi Ng, Yik-Cheung Project, pages 61–63. Tam, Yu-Chung Chan, Kin-Wah Chan, Ka-Yee Leung, Simon Ho, Fong-Ho Chong, Jimmy Wong, Calbert Graham. 2015. Phonetic and prosodic features and Jacqueline Lo. 2004. PLASER: Pronunciation in automated spoken language assessment. In Pro- learning via automatic speech recognition. In Pro- ceedings of Workshop on Phonetic Learner Corpo- ceedings of HLT-NAACL Workshop on Building ra. Individualized Feedback for Computer-Assisted Educational Applications using Natural Language Spoken Language Learning Project, pages 37–40. Processing. Association for Computational Lin- Yukari Hirata. 2004. Computer assisted pronunciation guistics, pages 1–8. training for native English speakers learning Japa- https://doi.org/10.3115/1118894.1118898. nese pitch and durational contrasts. Computer As- Josef Messerklinger. 2006. Listenability. Center for sisted Language Learning 17(3–4), pages 357–376. English Journal 14, pages https://doi.org/10.1080/0958822042000319629. 56–70. 123

Martijn Wieling, Jelke Bloem, Kaitlin Mignella, Mo- na Timmerneister, and John Nerbonne. 2014. Measuring foreign accent strength in English: Vali- dating Levenshtein distance as a measure. Lan- guage Dynamics and Change 4, pages 253–269. https://doi.org/10.1163/22105832-00402001. Silke Maren Witt and Steve Young. 2000. Phone-level pronunciation scoring and assessment for interac- tive language learning. Speech Communication 30(2–3), pages 95–108. https://doi.org/10.1016/S0167-6393(99)00044-8. Su-Youn Yoon, Yeonsuk Cho, and Diane Napolitano. 2016. Spoken text difficulty estimation using lin- guistic features. In Proceedings of the 11th Work- shop on Innovative Use of NLP for Building Edu- cational Applications. Association for Computa- tional Linguistics and the Asian Federation of Nat- ural Language Processing, pages 1–6. https://doi.org/10.18653/v1/W16-0531. Tongmu Zhao, Akemi Hoshino, Masayuki Suzuki, Nobuaki Minematsu, and Keikichi Hirose. 2012. Automatic Chinese pronunciation error detection using SVM trained with structural features. In Pro- ceedings of Spoken Language Technology Work- shop. The Institute of Electrical and Electronics Engineers, pages 473–478. https://doi.org/10.1109/SLT.2012.6424270.

124