INTERSPEECH 2011

Perceptual Quality Dimensions of Text-to-Speech Systems

Florian Hinterleitner1, Sebastian Möller1, Christoph Norrenbrock2, Ulrich Heute2

1Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany 2Digital Signal Processing and System Theory, CAU Kiel, Germany {florian.hinterleitner, sebastian.moeller}@telekom.de, {cno, uh}@tf.uni-kiel.de

Abstract The aim of our research is to assess the inherent quality dimen- The aim of this paper is to analyze the perceptual quality dimen- sions of several state-of-the-art TTS systems. This will ensure sions of state-of-the-art text-to-speech systems (TTS). There- a deeper insight into how test subjects perceive modern TTS fore, several pretests were conducted to determine a suitable set quality. Our study follows the approach presented in [8] which of attribute scales. The resulting 16 scales were used in a se- analyzed preceptual quality dimensions of modern telephone mantic differential on a diverse containing 16 different connections via different multidimensional analysis techniques. TTS systems. A subsequent multidimensional analysis (Princi- The pros and cons of these methods are discussed in Section 2. pal Axis Factor analysis with Promax rotation) resulted in three Section 3 presents the TTS database and the series of tests that underlying quality dimensions. They were labeled naturalness, were conducted. An evaluation via factor analysis of the col- disturbances, and temporal distortions. A mapping of these fac- lected data is performed in Section 4. The resulting quality di- tors onto the perceived overall quality revealed that naturalness mensions are discussed in Section 5. Finally, Section 6 summa- contributes the most to the quality of TTS signals. rizes the main results and gives a perspective to future work. Index Terms: , quality dimensions, multidi- mensional analysis 2. Multidimensional analysis To reveal a mapping of the perceptual space of human listen- 1. Introduction ers, different analysis methods can be used. The MDS [9] uses paired comparison tests to create a stimulus space which is then Naturalness has always been the major weakness of TTS sys- reduced in dimensionality. The drawback of this approach is tems. However, improvements over the past years have shown the constraint on a small set of stimuli. Moreover, no hints are a notable increase in quality, which allows them to be used for given for the interpretation of the resulting perceptual space. a number of applications, e.g. short message services, infor- Therefore we opted for a semantic differential (SD). It uses pre- mation systems, or smart-home assistants. Still, modern TTS defined attribute scales to measure the auditory impression of systems suffer from diverse quality constraints, ranging from the listeners. This guarantees a direct relation between the used concatenation artefacts to difficulties in word- and sentence- attribute scales and the derived quality dimensions and thus an intonations. With the rise of new applications further improve- easier interpretation. On the downside, due to the given set of ments will be necessary. Thus, methods to efficiently assess scales, this approach cannot guarantee that all relevant percep- different quality dimensions are an important tool. tual dimensions are actually solicited from the test participants. Depending on the quality aspect to be assessed different kinds

10.21437/Interspeech.2011-570 To reduce the influence of the test designers to a minimum, of listening tests can be carried out: articulation and intellige- a suitable set of scales has to be developed through several bility tests assess whether the synthetic speech signal is able pretests. In pretest 1 attributes describing the auditory impres- to carry information on a segmental or supra-segmental level sion of the listeners are collected. These terms are converted [1]; comparison tests measure if human listeners can compre- into scales and presented in a second pretest. An analysis of the hend the content provided via the presented TTS signals [2]; and second pretest data leads to a final selection of scales which are overall quality tests, as recommended in ITU-T P.85 [3], capture presented in the final SD experiment. On the basis of these at- different quality aspects of the signal, e.g. naturalness, listening tribute ratings, orthogonal factors can be derived with the help effort and overall impression. Though doubts have been casted of a factor analysis. The realisation of this test will be described on the test protocol [4] [5], the method described in ITU-T Rec. in the following section, and the results of the factor analysis P.85 is still the most common way to evaluate TTS systems. are discussed in Section 4. However, to evaluate the entire perceptive space of test sub- jects, a multidimensional analysis has to be performed. Differ- ent studies have been carried out to determine the underlying 3. Experimental-setup quality dimensions. In [6] a pilot study with multidimensional This section gives an overview of the database of speech syn- scaling (MDS) on TTS data generated by the Festival synthe- thesizers collected for the listening tests. Moreover, it describes sizer lead to a three-dimensional space. Since only stimuli of the approach used to gain a relevant set of attribute scales that one unit-selection synthesizer were presented in that test, the describe the perceptual space of TTS systems in a more-or-less results cannot be generalized. Kraft and Portele [7] evaluated complete way. five German TTS systems in a series of tests and came up with two dimensions representing prosodic and segmental attributes. 3.1. Test database Given that their study was carried out in 1995, distortions from modern TTS systems e.g. unit-selection and HMM-based syn- 10 German sentences from the EUROM.1 corpus [10] were thesizers could not be evaluated. chosen as source material. Since place names, proper names

Copyright © 2011 ISCA 2177 28-31 August 2011, Florence, Italy and words from foreign often use special pronunci- To narrow down the number of attribute scales, we omitted ation rules and thus cause trouble for speech synthesizers, the unnatural melody vs. natural melody which correlated highly selected sentences did not contain any of these. To avoid user (R>0.60) with the other scales that rate naturalness, and fatigue but still guarantee a valid impression of the occurring thus measure similar features. Moreover, scales that were used distortions, the sentences were shortened to a length of about rather rarely were dropped. 10 s each. In order to gain a first impression of the perceptual space a To capture a broad variety of distortions we generated synthetic Principal Component Analysis (PCA) with Varimax rotation speech files from 14/15 different TTS systems for female/male was performed on the remaining scales (=items) and 3 factors speakers, for some of them with up to 6 different voices. Thus, were extracted. Subsequently all items with high loadings on data from 35/28 different configurations (female/male) could be multiple factors and items with communalities < 0.45 were produced. Besides the synthetic speech files the database also discarded. This led to the following 16 attribute scales: contains stimuli from 4/4 amateur (female/male) and 4/4 pro- artificial vs. natural, bumpy vs. not bumpy, clinking vs. not fessional (female/male) natural speakers. All speech files were clinking, distorted vs. undistorted, fast vs. slow, hissing vs. downsampled to 16 kHz and level normalized to -26 dBov us- not hissing, interrupted vs. continuous, noisy vs. not noisy, ing the speech-level meter [11]. raspy vs not raspy, several voices vs. one voice, tense vs. calm, The database contains speech material synthesized by fol- undisturbed vs. disturbed, unintelligible vs. intelligible, unnat- lowing systems: Acapela Infovox3, AT&T Natural Voice, ural accentuation vs. natural accentuation, unnatural rhythm atip Proser, BOSS, Cepstral Voices, Cereproc CereVoice, vs. natural rhythm, unpleasant vs. pleasant (translations from DRESS, Loquendo, MARY bits, MARY hmm-bits, MARY German wordings) MBROLA, NextUp Talker, NextUp TextAloud3, Nuance Re- alSpeak, SVOX, and SyRUB. Abbr. Provider Synthesizer Female Male BOS RFW Bonn BOSS 1 - 3.2. Pretest 1 DRE TU Dresden DRESS 1 1 BIT MARY bits 1 1 The objective of pretest 1 was to collect a broad basis of at- HMM MARY hmm-bits 1 1 tributes describing auditory features of synthetic speech. There- MBR MARY MBROLA 1 1 fore audio files from 12/13 different TTS systems with fe- SYR RU Bochum SyRUB - 1 male/male voices plus 2 different natural speakers per gender CS1 Commercial synthesizer 1 1 1 were presented. 12 (4 female, 8 male) expert listeners from CS2 Commercial synthesizer 2 2 1 Deutsche Telekom Laboratories in Berlin took part in the test. CS3 Commercial synthesizer 3 1 1 CS4 Commercial synthesizer 4 1 1 The stimuli were presented in a quiet conference room environ- CS5 Commercial synthesizer 5 1 1 ment via headphones (AKG K601) in randomized order. Two CS6 Commercial synthesizer 6 1 1 sessions were conducted, one with female and one with male CS7 Commercial synthesizer 7 1 1 voices, with a break of 5 min in between. Every TTS system CS8 Commercial synthesizer 8 1 1 was covered with 2 stimuli. The listeners were instructed to CS9 Commercial synthesizer 9 1 1 write down nouns, adjectives and antonym pairs describing their CS10 Commercial synthesizer 10 - 1 auditory impression. Furthermore they were asked to give an in- tensity rating for each attribute on a scale ranging from 1 to 10. Table 1: Synthesizers and voice configurations used in the main The listening test resulted in 2179 collected terms out of which test. 296 unique descriptions were found. These attributes were con- densed into 44 scales. Attribute scales that mainly rate features 3.4. Main test concerning individual voice character and accent and those that For the main test a set of 15 different synthesizer configurations rate the same perceptual features were omitted. The remaining per gender was chosen (see Table 1). The commercial systems scales were weighted by frequency of occurrence, and the 28 had to be anonymized and were labeled with commercial syn- most named ones were chosen for pretest 2. thesizer 1 to 10. For each system 2 different stimuli were pre- sented. The test was split in 3 parts: since the subjects were not 3.3. Pretest 2 familiar with the quality as well as degradations of TTS signals, To narrow down the set of attribute scales from pretest 1 to a a training with 3 different stimuli covering the whole quality manageable number a second pretest was conducted. Here, range from the TTS signal database was conducted. In the sec- audio files from 19/20 different configurations of TTS systems ond and third part female and male stimuli were presented or with female/male voices were presented. 9 expert listeners vice versa, with a 5 min break in between. (3 female, 6 male) from the Deutsche Telekom Laboratories in 30 naïve subjects (15 female, 15 male, mean age: 27.9 years) Berlin and 13 naïve listeners (8 female, 5 male, mainly students took part in the test. Most of them were students from the local from the TU Berlin) took part in the test. All naïve listeners university. None of them had any known hearing disabilities. were paid for their participation. All subjects were paid for their participation. The stimuli were The purpose of this test was to find the set of quality scales presented via headphones (AKG K601) in a quiet environment. that were most suitable for the final SD experiment and thus After listening to each stimulus the participants had to rate the describe the quality stimulus space most precisely. Therefore, overall impression of the signal on a continuous scale rang- the subjects were instructed to only use scales that were ing from bad to excellent (resulting in a Mean Opinion Score, most relevant for their auditory impression. The stimuli were MOS). Subsequently a quality estimate for the attribute scales presented in randomized order in two sessions, one with determined in pretest 2 had to be adjusted via a slide presented female voices and one with male voices, with a 5 min break in on the test GUI. between. The stimuli were presented via headphones (AKG After the test a boxplot containing the ratings of all participants K601) in a quiet environment. on all attribute scales was generated for every stimulus. The

2178 outliers per subject were counted and participants with more Factor 1 is highly correlated with the items unnatural accentu- than 5% outlier ratings were excluded. ation vs. natural accentuation, artificial vs. natural and unnat- ural rhythm vs. natural rhythm, thus it represents the natural- 4. Analysis of the SD data ness of the TTS signal. The items with high loadings on factor 2(hiss, noise, rasping sound) are all related to disturbances in The following section describes the factor analysis that was the signal. Factor 3 seems to reflect temporal distortions e.g. used to come up with an interpretable perceptual space. More- concatenation artefacts which occur in unit-selection synthesis. over, the resulting quality dimensions are analyzed. The effect of the item polyphony (several voices vs. one voice) Factor which contributes the most to this dimension can be witnessed 1 2 3 when two units with a slightly different speed get connected. unnatural accentuation vs. This creates the impression of two different voices speaking at 0.926 natural accentuation the same time. artificial vs. natural 0.901 Figure 1 shows a graphical representation of Table 2. It has to unnatural rhythm vs. 0.891 be stated that a TTS signal with high values on all 3 dimensions natural rhythm is perceived as very natural, not disturbed and not temporally unpleasant vs. pleasant 0.772 distorted. The two items with high cross-loadings (distortions, tense vs. calm 0.658 clink) in the factor pattern matrix do also stand out here. Both bumpy vs. not bumpy 0.583 only reach very low values on all three dimensions, thus they distorted vs. undistorted 0.447 0.315 do not account much for any of these. Furthermore the item hissing vs. not hissing 0.752 bumpiness is not only correlated with naturalness but also with noisy vs. not noisy 0.651 the dimension temporal distortions (with a Pearson correlation raspy vs not raspy 0.591 coefficient of 0.594). This is hardly suprising since temporal undisturbed vs. disturbed 0.486 0.265 distortions can be perceived as bumps in the speech signal. clinking vs. not clinking 0.267 0.381 Moreover, as an effect of the oblique rotation it has to be stated several voices vs. one voice 0.792 that all factors are correlated. Especially factor 1 and factor 3 unintelligible vs. intelligible 0.214 0.574 show a very high correlation. This means that a very natural interrupted vs. continuous 0.290 0.495 sounding TTS system will most likely be bounded to the im- pression of a single speaker. Table 2: Factor pattern matrix.

4.1. Factor analysis 0.8 polyphony A first analysis of the data collected in the main test showed 0.7 that the item fast vs. slow almost always exclusively loaded 0.6 intelligibility on a single factor. Thus all other items created the remaining 0.5

fluency dimensions. To split these dimensions up in order to enable a 0.4 more detailed view of the perceptual space the item fast vs. slow 0.3 was discarded from further analysis. disturbances

dimension 3 0.2 A Principal Axis Factor analysis (PAF) of the remaining 15 at- bumpiness tributes revealed 3 factors. Separate PAFs for female and male 0.1 distortions noise 0 stimuli showed a similar factor-structure, thus one analysis over rasping sound clink pleasantness rhythm naturalness the whole dataset seemed sufficient. The 3 factors account for −0.1 hiss tension accentuation

61.47% of the total variance. This value could not be increased −0.2 0.8 0.6 1.2 significantly by extracting more than 3 factors. Residuals were 0.4 0.8 1 0.2 0.6 0 0.2 0.4 computed between the observerd and reproduced correlations: dimension 2 −0.2 −0.2 0 dimension 1 5 (4%) were nonredundant with absolute values greater than 0.05. Figure 1: Mapping of the quality scales in three dimensional Since we assumed correlated quality dimensions we subse- perceptual space quently opted for an oblique rotation method (Promax rotation with κ =4). The value of the accounted variance after rotation will not be analyzed because of a massive overestimation due to 5. Results and discussion correlated factors. The mapping of the stimuli in the perceptual space can be seen The factor analysis resulted in the factor pattern matrix shown in Figure 2. Figure 2(a) displays the values for the different sys- in Table 2. For clarity, values below 0.2 were suppressed. Due tems for the dimensions 1 (naturalness) and 2 (disturbances), to the oblique Promax rotation the resulting factors are not or- in which the subscripted character (f/m) represents the speaker thogonal. Factors 1 and 2 reach a correlation of 0.447, factors 1 gender. Independent of the speaker’s gender most synthesizers and 3 correlate with 0.734, and factors 2 and 3 with 0.522. build clusters, e.g. BIT, CS7, CS9. It is striking however that some systems obviously do not stick together. CS1 and CS1m 4.2. Resulting quality dimensions f for instance are clearly seperated with differing values for all In order to obtain a meaningful interpretation of the quality three dimensions. This indicates that the speech material used dimensions, items with high cross-loadings, meaning similar for the female CS1 system features disturbances that could not loadings on multiple factors, have to be excluded before inter- be found in the corresponding male data. Moreover, it can be pretation (| loading factor A - loading factorB|<0.2). In our stated that some systems reach high values on one axis while case this applies to the items distorted vs. undistorted and clink- only low ratings on the other, e.g. HMM is perceived as natural ing vs. not clinking. but also disturbed. Besides, systems with equal values in the

2179 disturbances dimension, for instance BOSf , CS4m and CS5f tant one when it comes to perceived overall impression. This can be perceived as natural sounding as well as artificial. factor covers different aspects of quality. A subsequent study Figure 2(b) shows the perceptual space of dimension 3 (tempo- including auditory experiments is planned and could break up ral distortions) and 2 (disturbances). Again the clustering effect this dimension into several subdimensions. can be observed. However, both figures reveal that undistorted Moreover, previous research towards instrumental quality pre- stimuli always get rated as natural as well as not temporally diction of synthetic speech [12] analyzed the performance of distorted. In contrast, natural and not temporally distorted syn- different prediction algorithms. The aim of future work will thesizers can also be perceived as disturbed (e.g. HMM). be to improve these approaches and to adjust them in order to An analysis of the correlations of the extracted dimensions with predict the identified quality dimensions. the rating on the overall impression scale comes to correlations of 0.806 for dimension 1, 0.469 for dimension 2, and 0.560 for 7. Acknowledgements dimension 3. Therefore the naturalness factor accounts the most for the overall impression rating. The present study was carried out at Deutsche Telekom Labo- ratories, Berlin. It was supported by the Deutsche Forschungs- 1 gemeinschaft (DFG), grants MO 1038/11-1 and HE 4465/4-1.

0.8 The authors would like to thank Steffen Werner from Daimler CS3 f CS2 AG, Jan de Moortel from Nuance, Donata Moers from Univer- 0.6 CS7 f2 f CS6 CS7 f CS5 CS3 sity of Bonn and Guntram Strecha from University of Dresden m m m 0.4 CS6 for their support. CS2 CS2 m m f1 CS1 0.2 CS8 m m 8. References CS5 CS8 DRE f 0 BOS f m CS4 f m [1] R. Van Bezooijen and V. van Heuven, “Assessment of speech out- SYR dimension 2 m put systems,” in Handbook of Standards and Resources for Spo- −0.2 MBR CS4 m f DRE ken Systems, D. Gibbon, R. Moore, and R. Winski, Eds. f BIT −0.4 m Berlin: Mouton de Gruyter, 1997, pp. 481–563. MBR f BIT CS1 f HMM f m HMM [2] C. Delogu, S. Conte, and C. Sementina, Speech Communication, −0.6 f CS9 f 1998, ch. Cognitive Factors in the Evaluation of Synthetic Speech, −0.8 CS10 CS9 pp. 153–168. m m

−1 [3] ITU-T Rec. P.85, A Method for Subjective Performance Assess- −1.5 −1 −0.5 0 0.5 1 1.5 dimension 1 ment of the Quality of Speech Voice Output Devices, International Telecommunication Union, Geneva, 1994. [4] M. Viswanathan and M. Viswanathan, “Measuring Speech Qual- (a) Mean factor values per synthesizer for dimension 1 (naturalness) and 2 ity for Text-to-Speech Systems: Development and Assessment of (disturbances). a Modified Mean Opinion Score (MOS) Scale,” Computer Speech and Language, vol. 19, pp. 55–83, 2005.

1 [5] D. Sityaev, K. Knill, and T. Burrows, “Comparison of the itu-t p.85 standard to other methods for the evaluation of text-to-speech 0.8 CS3 systems,” Proc. 9th International Conference on Spoken Lan- f CS6 CS7 f CS2 guage Processing (Interspeech 2006 - ICSLP), pp. 1077–1080, 0.6 f f2 CS7 CS5 CS3 m m m 2006. 0.4 CS6 CS2 CS2 m m f1 [6] C. Mayo, R. A. J. Clark, and S. King, “Multidimensional scal- 0.2 CS8 CS1 ing of listener responses to synthetic speech,” Proceedings of the m m CS5 6th Annual Conference of the ISCA (Interspeech 2005). Inter- DRE f 0 m CS4 BOS CS8 m national Speech Communication Association (ISCA), pp. 1725– f f

dimension 2 SYR 1728, 2005. −0.2 m MBR CS4 m f DRE f [7] V. Kraft and T. Portele, “Quality evaluation of five german speech −0.4 MBR BIT f synthesis systems,” Acta Acustica 3, 1995. BIT m CS1 HMM f f m −0.6 HMM [8] M. Wältermann, K. Scholz, A. Raake, U. Heute, and S. Möller, f CS9 “Underlying quality dimensions of modern telephone connec- −0.8 CS10 f m CS9 m tions,” Proceedings of the 7th Annual Conference of the ISCA

−1 (Interspeech 2006). International Speech Communication Asso- −2 −1.5 −1 −0.5 0 0.5 1 dimension 3 ciation (ISCA), 2006. [9] J. Kruskal and M. Wish, “Multidimensional scaling,” Sage Uni- versity Paper series on Quantitative Application in the Social Sci- (b) Mean factor values per synthesizer for dimension 3 (temporal distor- ences, vol. 07-11, 1978. tions) and 2 (disturbances). [10] D. Chan, A. Fourcin, D. Gibbon, B. Grandstrom, M. Huckvale, G. Kokkonakis, K. Kvale, L. Lamel, B. Lindberg, A. Moreno, Figure 2: Mapping of the stimuli in the perceptual space. J. Mouropoulos, F. Senia, I. Trancoso, C. Veld, and J. Zeiliger, “EUROM- A Spoken Language Resource for the EU,” Proceed- ings of the 4th European Conference on Speech Communication 6. Conclusions and future work and Technology (EUROSPEECH 1995), pp. 867–870, 1995. An auditory experiment with 16 different TTS systems was car- [11] ITU-T Rec. P.56, Objective Measurement of Active Speech Level, International Telecommunication Union, Geneva, 1993. ried out. Two pretests determined 16 attribute scales that were [12] S. Möller, F. Hinterleitner, T. Falk, and T. Polzehl, “Comparison used in a semantic differential. A subsequent multidimensional of approaches for instrumentally predicting the quality of text-to- analysis yielded three underlying quality dimensions. Those speech systems,” Proceedings of the 11th Annual Conference of were labeled naturalness, disturbances and temporal distor- the ISCA (Interspeech 2010). International Speech Communica- tions. The naturalness dimension appears to be the most impor- tion Association (ISCA), pp. 1325–1328, 2010.

2180