Journal of 77 (2019) 100915

Contents lists available at ScienceDirect

Journal of Phonetics

journal homepage: www.elsevier.com/locate/Phonetics

Special Issue: Integrating Phonetics and Phonology, eds. Cangemi & Baumann Exemplar-theoretic integration of phonetics and phonology: Detecting prominence categories in phonetic space

Antje Schweitzer

Institute for Natural Language Processing, University of Stuttgart, Germany article info abstract

Article history: This article explores an exemplar-theoretic approach to the integration of phonetics and phonology in the prosodic Received 30 March 2018 domain. In an exemplar-theoretic perspective, prominence categories, here specifically, pitch-accented syllables Received in revised form 26 July 2019 and unaccented syllables, are assumed to correspond to accumulations of similar exemplars in an appropriate Accepted 26 July 2019 perceptual space. It should then be possible, as suggested for instance by Pierrehumbert (2003), to infer the (phonological) prominence categories by clustering speech data in this (phonetic) space, thus modeling acquisi-

Keywords: tion of prominence categories according to an exemplar-theoretic account. The present article explores this Prominence approach on one American English and two German databases. The experiments extend an earlier study Intonation (Schweitzer, 2011) by assuming more acoustic-prosodic dimensions, by excluding higher-linguistic or phonological Pitch accents dimensions, and by suggesting a procedure that adjusts the space for clustering by modeling the perceptual rel- Exemplar theory evance of these dimensions relative to each other. The procedure employs linear weights derived from a linear Clustering regression model trained to predict categorical distances between prominence categories from phonetic distances Prosodic categories using prosodically labeled speech data. It is shown that clusterings obtained after adjusting the perceptual space in this way exhibit a better cluster-to-category correspondence that is comparable to the one found for , and that both the detection of categories and the detection of prominence categories benefit from the perceptual adjustment. Ó 2019 Elsevier Ltd. All rights reserved.

1. Introduction sometimes led to different linguistic interpretations depending on their phonetic implementation, i.e. meaning depended on Within the autosegmental-metrical (AM) framework, recent continuous phonetic parameters and not only on phonological work has emphasized that intonation research should consider categories. Similarly, Cangemi and Grice (2016) observed dif- phonological as well as continuous aspects (Cangemi & Grice, ferent amounts of variation in phonetic implementation 2016; Grice, Ritter, Niemann, & Roettger, 2017). It is not new to depending on linguistic meaning. They argue for a distribu- investigate continuous parameters in the AM tradition (see, tional approach which views phonological categories as clus- e.g., Arvaniti, Ladd, & Mennen, 1998; Barnes, Veilleux, ters in multidimensional phonetic space. Brugos, & Shattuck-Hufnagel, 2012; Kügler & Gollrad, 2015; This idea is consistent with exemplar theory (e.g. Johnson, Liberman & Pierrehumbert, 1984; Peters, Hanssen, & 1997; Pierrehumbert, 2003, 2016): Exemplar-theoretic work in Gussenhoven, 2015); however the investigation of phonetic the area of speech assumes that all instances of speech that implementation in terms of continuous parameters was usually listeners perceive are stored and retained in . Percep- focused on the role of these parameters in motivating and tually similar exemplars are stored close together, and thus implementing the categories, i.e. on systematic variation phonological categories are expected to form clusters in multi- between categories. In contrast, Grice et al. (2017) found dimensional perceptual space because their exemplars should within-category variation of these parameters, and showed that all be perceptually very similar. Phonological knowledge then this variation could be related to different linguistic functions. In consists of the unconsciously acquired “implicit knowledge” their experiment, pitch accents of the same category (Pierrehumbert, 2016, p.34) encoded in the stored exemplars of each category. Phonological knowledge thus integrates pho- netic perceptual knowledge in a natural way, in the form of E-mail address: [email protected] https://doi.org/10.1016/j.wocn.2019.100915 0095-4470/Ó 2019 Elsevier Ltd. All rights reserved. 2 A. Schweitzer / Journal of Phonetics 77 (2019) 100915 each category’s probability distribution over perceptual space vant dimensions is “buried” under irrelevant variation in many (Pierrehumbert, 2003). other dimensions. The present study therefore limits itself to Exploring the idea that phonological categories form clus- those dimensions that are confirmed to be most relevant per- ters in perceptual space, a previous study (Schweitzer, 2011) ceptually. Second, most similarity measures give equal impor- had conducted experiments on a prosodically annotated Ger- tance to all dimensions. However, this may not be perceptually man speech corpus. If exemplars of pitch accents form clus- adequate—listeners may be more susceptible to small differ- ters of similar exemplars for each category, then the ences in one parameter than in the other. Indeed, the percep- acquisition of pitch accent categories could be “simulated” by tual relevance of dimensions such as F0 rise amplitude, F0 applying clustering algorithms to speech data. In the experi- peak alignment, or F0 rise or fall steepness relative to each ments however there was by far no one-to-one correspon- other has not been thoroughly investigated. The present study dence between automatically derived clusters and addresses this issue by exploring a new way of scaling the categories. The best correspondence between categories acoustic-prosodic space in a way that it becomes perceptually and clusters was achieved when allowing for 1000–2000 clus- more adequate and then comparing clustering results on the ters. Then, more than 80% of the accented (or unaccented) original vs. the perceptually adjusted data. This is of particular syllables within a cluster would correspond to the same proso- importance here since the acoustic-prosodic space will include dic category. However, assuming more than 1000 clusters for a comparably high number of perceptual dimensions, including just a few pitch accent categories seemed inappropriate: with aspects beyond F0, since it is well known that factors five to ten pitch accent categories, this would amount to 200 other than F0 are also related to pitch accents (e.g., Bolinger, clusters on average per category. Of course, pitch accent cat- 1957; Campbell & Beckman, 1997; Kochanski, Grabe, egories may exhibit variation depending on their positional Coleman, & Rosner, 2005; Niebuhr & Pfitzinger, 2010; Okobi, context, analogous to the variation evidenced in allophones 2006; Turk & White, 1999). Nevertheless, the relative in the segmental domain. Indeed, it is well known that context importance of these dimensions is not known beforehand. influences pitch accent implementation: Pitch accent shape The present article thus contributes to the overarching topic depends on segmental structure, syllable structure, and/or of this Special Issue by exploring the perceptual basis of position in phrase for instance for English (e.g., Silverman & prominence in terms of pitch-accentedness from an Pierrehumbert, 1990; van Santen & Hirschberg, 1994) and exemplar-theoretic perspective. This allows for a natural inte- Spanish (e.g., Prieto, van Santen, & Hirschberg, 1995). These gration of phonological form, i.e. the pitch accent categories, effects are not universal: Work by Grabe (1998) for instance to phonetic substance, i.e. the phonetic implementation in suggests that the same factors can condition such variation terms of perceptually motivated phonetic dimensions that con- in two languages, but with different results: both in German stitute the space for storing memory traces of perceived pitch- and British English, upcoming phrase boundaries affect the accented and unaccented syllables. phonetic implementation of falls and rises, but in English this Regarding the definition of prominence, I assume that results in compressed (i.e. steeper) rises and falls, whereas prominence is a property of linguistic units which makes them in German, rises are compressed, but falls are truncated. For salient in perception relative to other units at the same level (cf. German, vowel height has also been shown to influence peak Terken & Hermes, 2000). Similarly, the editors of this Special height in H*L accents (Jilka and Möbius, 2007), and the earlier Issue in their Call for Papers stated that “prominence is a rela- study had identified position in word as affecting peak align- tional property that refers to any unit of speech that somehow ment in L*H accents (Schweitzer, 2011). Thus numerous seg- ‘stands out’”. In the present article, I am looking at prominence mental and prosodic factors can govern the implementation of in terms of pitch accent categories, and thus at the relative pitch accents, and so different clusters may simply represent prominence of syllables bearing different types of pitch accents context-dependent implementation variants of categories, like over other syllables at sentence level. Thus the definition of positional allophones in the segmental domain. However, I prominence assumed here is the presence of a pitch accent. would claim that 200 phonetically distinct positional variants This is consistent not only with a view from which all pitch- of each category still constitute an unexpectedly high number, accented syllables are prominent, while all unaccented sylla- given that in the segmental domain for instance, there are usu- bles are non-prominent, but also with the idea that different ally no more than just a few allophones assumed per phoneme pitch accent categories can lead to different degrees of promi- (such as, say, an aspirated, an unaspirated, and a glottalized nence (e.g. Baumann & Röhr, 2015; Cole et al., 2019, this Spe- version of an underlyingly unvoiced stop). cial Issue). The present study extends the earlier experiments, this time Please note that while it is certainly uncontroversial that using data from two German databases of read speech pitch accents at least in Germanic languages lend prominence (Barbisch, Dogil, Möbius, Säuberlich, & Schweitzer, 2007), at sentence level (e.g., Bolinger, 1958; Gussenhoven & as well as American English data from the Boston Radio News Rietveld, 1988; Rietveld & Gussenhoven, 1985), pitch- Corpus (Ostendorf, Price, & Hufnagel, 1995). The difficulty in accentedness is not exactly equivalent to prominence in those finding reasonable cluster-to-category correspondences may languages. Pitch-accentedness is taken to be a categorical have been due to several problems, and the present study phenomenon by many scholars (e.g., Bruce, 1977; Ladd, suggests solutions to these problems: First of all, clustering 1996; Pierrehumbert, 1980), while prominence is sometimes algorithms quantify similarity by some distance measure, usu- assumed to be more gradient (see for instance the discussion ally by Euclidian distance across all dimensions. However, of prominence scales by Wagner, Ćwiek, & Samlowski (2019)). considering irrelevant dimensions in clustering introduces However, the approach taken in the present study requires noise: In the worst case, the meaningful variation in few rele- the investigation of prominence categories rather than that of A. Schweitzer / Journal of Phonetics 77 (2019) 100915 3 more gradient levels of prominence since only the former can 2016) models, assume further that the exemplars are labeled per definition be expected to be separable in perceptual space. with category information. This assumption constitutes the link In support of this categorical approach, the strong relation between the stored phonetic detail (the “substance”) and the between possibly gradient prominence at sentence level and phonological category (the phonological “form”), and allows pitch accent in German is corroborated by findings in for the natural integration of the two alluded to above. Baumann and Winter (2018), who investigate phonetic and It is not entirely clear yet what the units are when storing phonological factors in the perception of prominence by naïve exemplars—most models are not very explicit about this.1 listeners and find that while a number of acoustic and discrete However, the models that assume category labels do state what linguistic factors are related to prominence as perceived by the these categories are. For instance, Lacerda (1995) models listeners, pitch accent type and pitch accent position are most vowel classification and assumes at least vowel categories as predictive of listeners’ judgments; similarly Cole et al. (2019, labels. Johnson (1997) states that “the set of category labels this Special Issue) confirm a strong relationship between pitch includes any classification that may be important to the per- accents and prominence for English, Spanish, and French, in ceiver, and which was available at the time the exemplar was that pitch-accentedness is clearly reflected in prominence rat- stored” (p. 147). ings of untrained listeners for these languages. Thus while it He mentions name of the speaker and gender as possible may be a simplification to equate prominence to pitch accent, category labels, in addition to linguistic categories. and non-prominence to absence of pitch accent here, it is not Pierrehumbert (2003) exemplifies her model using vowel cate- an ad hoc one: the relation between prominence and pitch- gories as labels for illustration, whereas her later model accentedness is well established. (Pierrehumbert, 2016) assumes word categories. Walsh et al. To conclude this introduction, the aim of this contribution is (2010) assume category labels at least at segment, syllable, to investigate whether categories can be detected based on and word level. their distribution in perceptual space. To this end, I investigate Fig. 1 is a rough sketch of what exemplar clouds could look the substance, i.e. those dimensions that are expected to be like, illustrated in an only two-dimensional space, along with relevant in pitch accent implementation. The same approach some category labels—in this case, with word category labels. could be taken to investigate prominence in terms of word I used words as the units for this sketch, to be consistent with stress, i.e. prominence of one syllable relative to other sylla- Pierrehumbert’s more recent publication (Pierrehumbert, bles at the word level. Then, presumably, other dimensions 2016), but please note that I have chosen one-syllable words would constitute the phonetic substance, viz. those that have here since in all the experiments below, we will be dealing with been found to be relevant for the perception of word stress. exemplars of syllables,2 so one could also imagine that the Before presenting the method and results of the present labels are actually syllable category labels. For the sake of tidi- study, I will first elaborate in more detail how exemplar theory ness, only few exemplars in Fig. 1 have been labeled (a few readily integrates form and substance in Section 1.1, then in exemplars of “ball” and “bell”), but it can be assumed that all Section 1.2 present more details on the Schweitzer (2011) exemplars should at least be labeled with their word category. study on detecting clusters of syllables corresponding to pro- The dots indicate further word category labels. In the figure it sodic categories, which the present study takes as a starting is not specified what the perceptual dimensions are; but the point. Based on these preliminaries, Section 2 will state the dimensions assumed by the exemplar-theoretic approaches research question addressed in this article. Method and data mentioned above have used locations of the first formant (F1) will be at issue in Section 3. Section 4 will then introduce the in open vs. closed vowels (Lacerda, 1995), of the second for- new procedure for scaling the perceptual dimensions. Cluster- mant (F2) in /I/ vs. /e/(Pierrehumbert, 2001), or of the third for- ing results using this scaling procedure will be discussed in mant (F3) in /ɹ/ vs. /ɾ/(Pierrehumbert, 2016) for demonstrating Section 5, followed by an overall discussion in Section 6. exemplar-theoretic classification in the one-dimensional case. Johnson (1997) employed several dimensions, fundamental fre- quency (F0), F1, F2, F3, and the duration for classifying vowels. 1.1. Exemplar theory and categories Similarly, Pierrehumbert (2003, p. 183) illustrates exemplar The central idea in exemplar theory as applied to speech clouds using an F1/F2 plot of American English vowels. (e.g. Goldinger, 1996, 1997, 1998; Johnson, 1997; Lacerda, Exemplar-theoretic work has also addressed the prosodic 1995; Pierrehumbert, 2001, 2003, 2016; Wade, Dogil, domain. For instance Calhoun and Schweitzer (2012) have fi Schütze, Walsh, & Möbius, 2010; Walsh, Möbius, Wade, & argued that short phrases with speci c discourse functions Schütze, 2010) is that the mental representations of speech are stored along with their intonation contours. In that study, categories are not only abstract, symbolic entities, but that they clustering was used to identify similar intonation contours, are instead accumulations of concrete, previously perceived and the same parameters to describe the contours as in the instances of speech that have been stored in memory by lis- teners, and that these memory traces include considerable 1 Two exceptions are Walsh et al. (2010), who assume that the size of stored units is phonetic detail in addition to abstract information. Details that variable and depends on the frequency of the respective unit, and Wade et al. (2010), who are assumed to be stored comprise frequency scale informa- assume that perceived speech is not stored in units at all, but stored as a whole, i.e. in complete longer utterances, with category labels at various levels. tion such as formant location or aspiration noise, but also 2 This is not because I necessarily want to claim that the syllable is the unit for exemplar details beyond speaker identity. The latter is supported by storage, but because at least in the AM tradition there is consensus that pitch accents are memory effects of identical and of perceptually similar voices linked to specific syllables, prosodic labels always refer to specific syllables, and consequently I will use acoustic-prosodic properties of single syllables, or of single in memory tests (Goldinger, 1996). Some models, for instance syllables relative to neighboring syllables, and the corresponding prosodic label for the Lacerda’s (1995), Johnson’s (1997), or Pierrehumbert’s (2003, clustering experiments below. 4 A. Schweitzer / Journal of Phonetics 77 (2019) 100915

starting to label these exemplars with meaning. As more and more exemplars are stored, implicit phonological knowledge begins to build up when exemplars associated with the same abstract meaning categories exhibit similar perceptual fea- tures, i.e. when they are located in similar regions in perceptual space. In segmental acquisition for instance, exemplars refer- ring to ball objects would end up in the same region in percep- tual space, and very close to exemplars referring to bell objects. This implicitly encodes the phonological identity of the “ball” exemplars as well as the phonological proximity of the “ball” and the “bell” exemplars. In illustrating this view, Pierrehumbert (2003) mentions results obtained by Kornai (1998), who showed that unsuper- vised clustering of F1/F2 data for vowels yields clusters that are “extremely close to the mean values for the 10 vowels of Fig. 1. Rough sketch of exemplars in two-dimensional perceptual space. Exemplars are American English” (Pierrehumbert, 2003, p. 187). She inter- labeled at least with word category information, but possibly with further categories that prets these results as supporting evidence that detection of the listener had access to. Exemplars of the same linguistic category are expected to phonetic categories in human speech acquisition may be form clusters in perceptual space, indicated by clouds of differently colored exemplars here. guided by identifying regions in perceptual space which corre- spond to peaks in population density. She also cites experi- present study. Similarly, Schweitzer et al. (2015) provide evi- ments by Maye and Gerken (2000) and Maye, Werker, and dence of exemplar storage of intonation. They describe fre- Gerken (2002) in which participants interpreted stimuli in a quency effects on the phonetic implementation of pitch continuum as belonging to two distinct categories if the stimuli accent contours, which can be explained assuming an exhibited a bimodal distribution over the continuum exemplar-theoretic account of intonation instead of the tradi- (Pierrehumbert, 2003, p. 187). In general, she assumes that tional post-lexical view. “well-defined clusters or peaks in phonetic space support In any case, linguistic categorical knowledge then arises stable categories, and poor peaks do not” (Pierrehumbert, from abstracting over the stored exemplars (e.g. 2003, p. 210). Pierrehumbert, 2001, 2003), or, in other words, phonological Following this assumption, in an earlier study (Schweitzer, form arises from abstracting over phonetic substance. Goldin- 2011) the idea of clustering as a means to simulate the acqui- fi ger exempli es this kind of abstraction by a metaphor that he sition of speech categories in the prosodic domain was attributes to Richard Semon, stating that the blending of many explored. If clusters of F1/F2 vowel data can be shown to cor- photographs results in a generic image of a face (Goldinger, respond to vowel categories, then clustering intonation data in fi 1998, p. 251). Similarly, the mass of speci c exemplars of a several prosodic dimensions could possibly give insight into “ ” category blends into an abstract idea of the properties of that the reality of intonation categories. In those experiments, alto- category. Or, coming back to Fig. 1, the many points of a word gether 29 linguistic, phonetic, and phonological (segmental fi category, which all have individual speci c values, together and prosodic) features were extracted for each syllable from fi fi de ne a less speci ed region in perceptual space that corre- a database of read speech which had been manually anno- sponds to that word category. This categorical knowledge tated for and contained six different pitch accent can then be used both in and in production: categories. in perception, categorizing new instances is achieved by com- The clustering results were evaluated by comparing the paring them to the stored exemplars and their categories obtained clusters to the manual prosodic labels. It turned out (Johnson, 1997; Lacerda, 1995; Pierrehumbert, 2001, 2003); that even though the probably most widely used clustering in , production targets are derived from them algorithm, k-means, yielded satisfactory results in terms of (Pierrehumbert, 2001, 2003), for instance by random selection cluster-to-category correspondence as quantified by an from the cloud corresponding to the intended word category. I accuracy-based evaluation measure proposed in Schweitzer will not go into the details here of exactly how exemplar- (2011), it did so only when allowing for an extremely high num- theoretic perception and production work, as this would be ber of clusters—in case of k-means clustering for instance, beyond the scope of this article. Here, it is only of interest that best results were obtained for around 1600 clusters. Clearly, exemplar theory does not separate phonetic detail, or sub- the relation between the number of clusters, 1600, to the num- stance, on the one hand, and phonological form, or categories, ber of pitch accent categories in the data, six, is very imbal- — on the other hand the two are closely related, or maybe anced. Thus it was suggested among other things (i) that indeed, as Grice et al. (2017, p. 105) put it, two sides of the future work should investigate which dimensions are relevant same coin. in the perception of prosodic categories, and limit the dimen- sions in clustering to these relevant dimensions, and (ii) that these dimensions might not contribute with equal weight, i.e. 1.2. Clustering intonation categories that dimensions might have to be scaled differently in order If phonological knowledge arises by abstracting over clouds to model their individual importance to human perception more of exemplars stored in memory, then speech acquisition is ini- closely. The present article suggests a procedure that tiated by accumulating those exemplars in memory and by addresses both these problems at the same time. A. Schweitzer / Journal of Phonetics 77 (2019) 100915 5

2. Research question Database (BRN, Ostendorf et al., 1995) for which prosodic labels were available (approx. 1 h, 5 speakers). The aim of this article is to explore an exemplar-theoretic The SWMS and SWRK databases had been recorded for perspective on the detection of what I call “prominence cate- unit selection speech synthesis in the SmartWeb project gories” in the following, using prosodically annotated data from (Barbisch et al., 2007; Wahlster, 2004), hence the “SW” in their German and American English databases of read speech. The names. The speakers are professional speakers of Standard novelty in the experiments presented here lies in the fact that I German. The utterances represent typical sentences from five employ perceptually motivated weights for modeling the rela- different text genres, and they usually consist of one or at most tive importance of a number of potentially relevant acoustic- two short sentences, corresponding to just a few prosodic prosodic dimensions. phrases. The utterances were annotated on the segment, syl- The experiments are intended to “simulate” in a very simple lable, and word level, and prosodically labeled according to way the acquisition of prominence categories by detecting GToBI(S) (Mayer, 1995). Prosodic labeling for each utterance clusters of similar syllable exemplars in this perceptually more was carried out by one of three human labelers, all supervised adequate space. Since I do not want to assume that this detec- and instructed by myself, without having the Schweitzer (2011) tion relies on any phonological knowledge, the clustering has study, or the present experiments, in mind. The SWMS data- to take into account every single syllable, irrespective of base amounts to 28,000 syllables, and 14,000 words. The whether it is stressed or not, simply because that knowledge SWRK data contain 34,000 syllables, and 17,000 words. would not be available in acquisition. Instead, it would probably The BRN database contains recordings of professional be learned in the same way, possibly even jointly with the American English speakers, partially as recorded during radio acquisition of the prominence category. Therefore the term broadcasts, and partially re-recorded in the lab. Prosodic anno- “prominence category” in the remainder of this article will com- tation followed the ToBI guidelines for American English prise the pitch accent categories assumed for the two lan- (Beckman & Ayers, 1994). Only a portion of the database guages, plus a “NONE” category for unaccented syllables: was labeled prosodically. This part was used in the experi- the listener would have to figure out which syllables are ments presented below. It amounts to 23,000 syllables and accented and by which accent type, and which syllables are 14,000 words, from 5 different speakers (3 female, two male). unaccented, and they would have to start detecting these cat- egories without prior knowledge of the location of pitch 3.2. Pitch accent inventories accents. In fact, listeners in prosody acquisition would not even know The BRN database provides pitch accent labels according what pitch accents are and what it means for a syllable to be to ToBI (Beckman & Ayers, 1994). Within the database, H* is unaccented; instead they would simply (unconsciously) notice most frequent, followed by L+H*, downstepped !H*, L+!H*, that the perceived syllable instances fall into groups based on and H+!H*. L*+H and its downstepped version L*+!H are very their properties in the acoustic-prosodic dimensions, thus initi- infrequent in BRN. A number of syllables have labels that indi- ating the prosodic categories. Only then would learners start to cate labeler uncertainty; these were excluded from the notice that these groups share meaning aspects and label analyses. them accordingly, for instance maybe relating accumulations As for the prosodic labels in the SWMS and SWRK data- of short syllables with flat F0 contours to non-prominence. bases, these are to some extent similar to the ToBI labels in The present article aims to simulate the instantiation of the BRN, as GToBI(S) is a German labeling system based on prominence categories in this way. The clusters detected will the original ToBI (Beckman & Ayers, 1994) used for BRN. then be compared to established pitch accent categories. I GToBI distinguishes 5 basic types of pitch accents, L*H, H*L, do not assume that speakers have access to those categories L*HL, HH*L, and H*M, which are claimed to serve different in acquisition; the “correct” category labels are used only for functions in the domain of discourse interpretation (Mayer, evaluating the plausibility of the detected categories. Similarly, 1995). They can be described as rise, fall, rise-fall, early peak, I do not want to assume any phonological knowledge that may and stylized contour, respectively. The stylized contour, H*M, is not have been acquired at the stage when the prosodic cate- extremely infrequent—it is mostly used in calling out and thus gories start to form, thus I will limit the dimensions for clustering does usually not occur in read speech. In addition to the basic to acoustic dimensions only. However I assume that phoneme types GToBI assumes “allotonic” variants of L*H and H*L in categories are established at this stage already, which allows pre-nuclear contexts, viz. L* and H*: in these, only the starred for normalizing some of the acoustic parameters by phoneme is realized on the pitch-accented syllable, while the trail category. tone (annotated as ..H in L*H, and as ..L in H*L) is realized on the syllable immediately preceding the next pitch- accented syllable, or even omitted completely. Even though 3. Data preparation for clustering pitch accents the linked trail tones are annotated in GToBI(S), they do not have accent status because they do not lend prominence to 3.1. Databases the syllable that they occur on (therefore no * symbol in their name), and consequently they are not treated as pitch accents The clustering experiments were carried out on three data- here. Thus in the following we will be dealing with L*H, H*L, bases, viz. the German SWMS (2 h, male) and SWRK (3 h, L*HL, HH*L, as well as the variants L* and H*. Just as in female) databases, and on a part of the Boston Radio News BRN, labeler uncertainties in the SWMS and SWRK databases 6 A. Schweitzer / Journal of Phonetics 77 (2019) 100915 were indicated by “?” as a diacritic to accent labels; these were The model uses six linguistically motivated parameters to discarded for the analyses here. describe the shape of the F0 contour in and around accented Similar to ToBI for American English (Beckman & Ayers, syllables. 1994), GToBI provides a diacritic “!” to indicate downsteps Mathematically PaIntE employs a function of time, with f ðxÞ (i.e., a H* target which is realized significantly lower than a pre- giving the F0 value at time x.Itisdefined as follows: ceding H* target in the same phrase). Mayer (1995) notes, however, that although it is recommended to label downsteps, c1 c2 f ðxÞ¼d c c ð1Þ it is not clear whether the downstepped pitch accents differ in 1 þ ea1ðbxÞþ 1 þ ea2ðxbÞþ discourse meaning from their non-downstepped counterparts This function yields a peak shape (Fig. 2), where the first (Mayer, 1995, p.8). Accordingly, in labeling the two German term, the d constant, can be interpreted as peak height param- ’ databases that I will be using for clustering, labelers attention eter. The amplitude of the rise towards the peak is determined had been less focused on consequently and consistently label- by c1, termed rise amplitude in the following, and the peak itself ing downstep. Indeed, downsteps were only labeled in 19 and is reached at (syllable-normalized) time b; in other words, b 27 cases in the SWMS and SWRK databases, respectively, can be interpreted as the peak alignment parameter. The which is why I will not distinguish downstepped and non- amplitude of the fall after the peak is given by parameter c2 (fall downstepped accents as belonging to different categories amplitude). Finally, the steepness of both movements is cap- here. tured by parameters a1 (steepness of rise) and a2 (steepness Downsteps were consistently labeled in the English data. of fall). With !H* and L+!H*, downstep constitutes a special case: For determining the PaIntE values of a given syllable, the although I do not want to deny that downstepped accents at F0 contour is estimated using ESPS’s get_f0 from the Entro- least in English, and possibly also in German (cf. Grice, pic waves+ software package, then smoothed using a median Baumann, & Jagdfeld, 2009), have different pragmatic func- smoother from the Edinburgh Speech Tools (Taylor, Caley, tions, I would like to claim that categories involving downstep Black, & King, 1999), interpolating across unvoiced regions have to be acquired later than other prosodic categories: in but not across silences. Then the PaIntE model utilizes an opti- order to perceive that a category is implemented with a down- mization procedure that finds those parameter values for which stepped high target, relative to a high target in a preceding cat- the function contour in a three-syllable window around the syl- egory in the same phrase, at least this preceding category as lable of interest is optimally close to the smoothed F0 contour. well as possibly intervening phrase boundaries have to be per- Since the aim of the present experiments is to focus on ceived with adult competence. However in the approach taken parameters that can be employed in perception to distinguish here, when clustering the syllable data to detect possible cat- between pitch-accented and unaccented syllables, and “ ” egories, of course we do not yet have access to the true pro- between different pitch accents, I extracted the PaIntE param- sodic categories, nor to those of the context syllables: it is not eters for every syllable in the three databases, irrespective of fi yet known which preceding syllable quali es as the preceding whether they had been manually labeled as pitch-accented category with the high target. In contrast, H+!H* as an accent in or not. It is of course expected for instance that pitch- which the downstep is relative to the immediately preceding accented syllables should exhibit higher rise amplitude and fall syllable, is unproblematic. In any case, as a consequence of amplitude than unaccented syllables, that peak alignment in these considerations, I will even for English treat downstepped early peak accents is earlier than in other accents, or that !H* as H* and (the very infrequent) L+!H* as L+H* in the the peak height of downstepped accents should be lower than following. that for non-downstepped accents. In an exemplar-theoretic account of speech acquisition, such generalizations would be 3.3. Features for clustering implicitly learned by abstracting over clouds of pitch accents of different categories. In contrast to the earlier study, I focus on acoustic-prosodic parameters for clustering here, without making use of linguistic or phonological parameters such as which syllables are stressed, or which syllables occurred in function or content words, in order to more realistically model acquisition, as explained above in Section 2. However I consider acoustic parameters beyond F0 shape and duration here; other recent work on prosody also emphasizes that dimensions beyond F0 need to be taken into account when discovering meaningful elements of prosody (e.g. Niebuhr, 2013; Niebuhr & Ward, 2018; Ward & Gallardo, 2017).

3.3.1. PaIntE The present paper as well as the study it extends employ the PaIntE model (Möhler, 2001; Möhler & Conkie, 1998)to quantify the F0 contour around pitch-accented syllables. “ ” PaIntE is short for Parameterized Intonation Events and Fig. 2. Example PaIntE contour in a window of three syllables around a pitch-accented was originally developed for F0 modeling in speech synthesis. syllable (r*). See text for more details. A. Schweitzer / Journal of Phonetics 77 (2019) 100915 7

3.3.2. Duration features correlated with pitch accent directly, or indirectly through word Categorization experiments in Schweitzer (2011) had stress. In any case, if there are systematic differences between shown that normalized nucleus durations were helpful in distin- unaccented and pitch accented syllables in terms of spectral guishing accented from unaccented syllables. Specifically, that tilt, then this parameter should also play a role in exemplar- study had used phoneme-specific z-scores of the nucleus theoretic acquisition of prominence categories. Thus I include duration for predicting whether syllables were accented or two measures of spectral tilt: once the spectral balance as not, and those of word-final phones for predicting the location operationalized by Sluijter and van Heuven (1997), specifically of phrase boundaries. Converting absolute scores to z-scores the difference in energy in the frequency band between 0 and is a common statistical transformation: absolute values are 500 Hz and that in the frequency band between 500 and replaced by their deviation from the overall mean, and divided 1000 Hz, and, second, spectral tilt in terms of the regression by the overall standard deviation. To get phoneme-specific z- line in a long-term average spectrum. Both measures had scores, mean and standard deviations are calculated for each been investigated by Aronov and Schweitzer (2016), interest- phoneme class separately, and in transforming a particular ingly with opposite findings, in a database of German conver- exemplar to its z-score, mean and standard deviation of the sational speech: the spectral balance approach confirmed the respective phoneme class are used. claim that the stressed syllables exhibit a flatter spectrum than Using z-scores, it is possible to model the lengthening or unstressed syllables, i.e. a more even spectral balance for shortening related to prosodic context, because phoneme- stressed syllables; while the spectral tilt approach employing specific mean and standard deviation are eliminated. Also, in the regression line gave higher tilt values for stressed an exemplar-theoretic account of prosody perception, it is plau- syllables. sible that listeners have access to phoneme-specific duration For the present experiments, both values were extracted z-scores—after all, these can in principle be interpreted as syllable by a Praat script. Specifically I first calculated long- the location of an exemplar in terms of duration relative to other term average spectra for each syllable using the pitch range exemplars of the same category: in the duration dimension, an parameters estimated as explained above and Praat’s default exemplar of a specific phoneme with a high duration z-score values for all other parameters; then extracted the slope of the would be located “to the right” of most other exemplars of that regression line for the range of 100–5000 Hz using Praat’s type, whereas an exemplar with a z-score of 0 would be function Report spectral tilt with the default parameters located in the middle of all exemplars of that type. Since pro- to calculate spectral tilt, as well as the Praat function to retrieve sody is acquired later than segmental aspects, listeners should the mean energy values in the two relevant frequency bands already have accumulated enough phoneme exemplars to for calculating the spectral balance. make such generalizations. A second feature related to duration that I employ for the 3.3.4. Intensity current study had not been used in the earlier experiments: In addition, I included overall intensity values. Intensity has here I include the number of voicing periods within a syllable. long been recognized as a parameter related to word stress The rationale for including it is that in order to realize a pitch and pitch accent—already Bolinger (1957), p. 176 mentions accent on a syllable, the speaker has to produce voicing for that intensity is a frequent correlate of pitch accent, and both a reasonably long time span. This feature was calculated using Fry (1955) and Lieberman (1960) found that it is one correlate fi a Praat script (Boersma & Weenink, 2017). To this end, I rst of perceived word stress in American English. More recent estimated the pitch range of the respective speaker by extract- work found that intensity is a correlate of word stress only in ing F0 values from all speech data of that speaker, using the pitch-accented syllables in American English (Okobi, 2006). ESPS program get_f0 from the Entropic waves+ software The correlation of intensity and pitch accent has also been fi package, as this does not require the speci cation of confirmed for other languages, for instance for British and Irish speaker-dependent parameters for the expected minimum English (Kochanski et al., 2005), or for German, where Niebuhr and maximum pitch. Then, I used the 5th percentile from these and Pfitzinger (2010) found that two types of pitch accent differ data as minimum pitch for the speaker, and the 99th percentile with respect to their intensity patterns. as the maximum pitch within the Praat script. Given that the For the present experiments, intensities were again deter- speech signal corresponding to each syllable was long mined by a Praat script, using the speaker-specific minimum enough, the script then calculated a Praat point process using pitch parameter mentioned above. I included the intensity these minimum and maximum pitch parameters and then within the vowel (vowel intensity) as well as the intensity retrieved the number of periods in it. across the whole syllable (syllable intensity) as raw values. For the experiments below I will however use the deltas 3.3.3. Spectral parameters between subsequent syllables rather than the absolute values. It is well known that syllables that carry word stress differ This is described in SubSection 3.5 below. from unstressed syllables in their spectral balance, or spectral tilt, in many languages (e.g., Aronov & Schweitzer, 2016; 3.4. Outlier removal Crosswhite, 2003; Sluijter & van Heuven, 1997; Okobi, 2006). It should be noted however that Campbell and All of the acoustic parameters described above were Beckman (1997) have challenged this result at least for Amer- obtained by automatic procedures. In addition, the segmenta- ican English, relating the effect of spectral tilt in a speech cor- tion of the databases, albeit manually checked in most cases, pus to pitch accentedness rather than word stress. For the was originally based on forced alignment for all three data- experiments here, it is less important whether spectral tilt is bases. In case of the BRN database, I derived syllable label 8 A. Schweitzer / Journal of Phonetics 77 (2019) 100915

files from the phone label files using an automatic syllabifica- Table 1 tion procedure with the segment labels as input. Thus all three Approximate numbers of data points left after each step in the outlier removal process, by database. databases, even though they are in general very clean, may occasionally contain erroneous phone labels. These together Database BRN SWMS SWRK with the syllable labels however are the basis for deriving the Full data set 23,000 41,000 34,000 Long enough vowels 21,500 35,000 30,000 acoustic parameters by Praat scripts. Some further noise is Fall/rise completed 17,000 28,500 24,500 introduced in calculating the acoustic parameters by scripts No outliers 16,500 26,500 22,500 even for cases with perfect labels. I tried to reduce this kind No labeler uncertainties 16,000 26,500 22,500 of noise by a thorough procedure for outlier removal. Outlier removal, as well as normalization and all following analyses described in the remainder of this article were carried above (cf. Section 3.3.2). The peak height parameter was first out using R (R Core Team, 2017). In a first step, I removed normalized by speaker, by subtracting the speaker’s individual cases where the Praat script had not yielded spectral or inten- mean, then z-scored across all speakers. Finally, instead of sity values due to shortness or lack of voicing, or simply using the raw intensity values I calculated the difference in because the syllable did not contain a vowel. This concerned intensity between each syllable and its preceding syllable, as a considerable amount of data (approx. 9–15% depending well as the difference between the syllable and the following on the database). I also removed cases where the PaIntE val- syllable, once based on the syllable intensities, and once ues indicated rises or falls that reached beyond the approxima- based on the vowel intensities. This yielded four intensity tion window and where the rise or fall amplitude inside the parameters: vowel intensity delta (the difference in vowel inten- window did not reach a value within 5 Hz of the full amplitude sity between a syllable and its preceding syllable), next vowel as expressed by the PaIntE parameter.3 This was the case for intensity delta (the difference in vowel intensity between the another roughly 20% of data points. In general outliers were next syllable and the current syllable), and syllable intensity removed for each parameter individually by removing all data delta and next syllable intensity delta (analogously, but using points were the observed value was more than 1.5 times the intensities across whole syllables instead of intensities within interquartile range below the first or above the third quartile. the vowels). For the peak height parameter this was done separately for each speaker, and for spectral balance, spectral tilt and the intensity 4. Procedure for finding perceptual weights measures, separately for each vowel. Altogether this reduced the number of data points by roughly 5%. Finally I removed syl- Two potential problems when clustering data for detecting lables that contained infrequent vowels (defined as cases where perceptually relevant categories were addressed in the intro- there were less than 100 instances of that vowel in the data- duction above: firstly, keeping irrelevant dimensions in cluster- base), labeler uncertainties, and syllables with infrequent pitch ing introduces noise. Clusters are characterized by small fi accent types (de ned as cases where the accent occurred less distances among their members. Distances are usually quan- 4 fi than 200 times in the database ). Outlier removal signi cantly tified by a distance metric such as the Euclidean distance: if reduced the number of data points, however it made sure that x ¼ðx1; x2; ...; xnÞ and y ¼ðy ; y ; ...; y Þ are two points in only parameters are used in the following analyses for which 1 2 n n-dimensional space, then their Euclidean distance is calcu- we are very confident that they are correct. Table 1 gives an lated by summing the squared distances in all individual overview of the data points available for each database after dimensions and taking the square root: each step, the last line indicates the final numbers. sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xn ; 2 distðx yÞ¼ ðxi yi Þ 3.5. Normalization i¼1 So assuming that, say, the second dimension is not relevant All parameters were z-scored to obtain means of 0 and 2 standard deviations of 1. In case of the vowel dependent for perception, then the term ðx2 y2Þ , the squared distance parameters syllable intensity, vowel intensity, spectral balance between the two points in dimension 2, will add perceptually and spectral tilt, this was done on a by-vowel basis, as moti- irrelevant noise to the overall distance. vated and described for the vowel-specific duration z-scores Secondly, all dimensions contribute to the Euclidean dis- tance to the same extent. However, distances in some dimen-

3 sions may be perceptually more relevant than distances in These cases constitute “degenerate” approximation results, treated as outliers here: inside the approximation window, the optimization procedure guarantees that the PaIntE others, so we might want to factor this into the overall distance. function is optimally close to the smoothed F0 contour. Outside the approximation window, This could be obtained for instance by introducing a weight wi the two contours may of course be quite different from each other. So if the PaIntE contour reaches the maximum far outside the approximation window, this does not guarantee that for the distance in each dimension as in the following equation. the smoothed contour follows the same course and reaches the same maximum inside the sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi approximation window. Therefore the amplitude that PaIntE assumes may overestimate Xn ; 2 the true amplitude. The origin of this problem is discussed in more detail in an article distw ðx yÞ¼ wi ðxi yi Þ ð2Þ submitted elsewhere (Schweitzer, Möhler, Dogil, & Möbius, in preparation), and future work i¼1 on PaIntE will address it further. For the time being, cases that might be problematic can be identified and ignored as suggested here. It should be noted that the problem occurs This is however equivalent to similarly often for accented and unaccented syllables, thus removing such data points does fi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi not signi cantly affect the overall distribution of accented/unaccented syllables. Xn 4 Given that there are fewer accent categories than vowel categories it seemed 0 0 2 distw ðx; yÞ¼ ðx y Þ ð3Þ appropriate to assume a higher threshold for accents; also for the adjustment procedure i i i¼1 described below I needed at least 200 instances of each accent category. A. Schweitzer / Journal of Phonetics 77 (2019) 100915 9 pffiffiffiffiffi 0 0 0 0 0 fi where xi ¼ wi xi , and yi ¼ wi yi , and wi ¼ wi . In general add- solves the rst problem stated above, that we want to separate ing dimension specific weights in calculating the distance as relevant from irrelevant dimensions before clustering. The next in (2) is conceptually the same as scaling all values in each section will describe how the proposed procedure was applied dimension with an appropriate weight and then taking the usual to the three databases before clustering. unweighted Euclidean distance. If the weights used for scaling the dimensions before taking the distance are the square roots of the weights used in the weighted distance, as in Eqs. (2) and 5. Clustering experiments (3) above, then the two approaches even yield identical distances. 5.1. Estimating the weights The question is then, which would be perceptually appropri- For each database I first randomly sampled 100 data points ate weights for scaling the dimensions? To address this prob- from each category. These data were used to estimate the lem, we would have to assess for each dimension how much weights for scaling the dimensions as described in Section 4. specific differences along this dimension are reflected in per- Sampling the same amount of data for each category ensured ception. Unfortunately, we do not have perceptual data avail- that even relatively infrequent categories would be represented able in the form of gradient perceptual ratings of similarity in these data.6 Thus parameters that might be helpful in distin- between instances of the prominence categories. However, guishing relatively infrequent categories would not be neglected. we do have prosodic annotations from human annotators, so I then built a new data set containing the pairwise distances we do have categorical perceptual ratings of similarity: we between all sampled data points, i.e. a set which contained for can be fairly sure that two syllables that have been labeled each pair of data points their categorical distance (0 if same cat- as belonging to different categories are perceptually more dif- egory, 1 if different), and their distances in all the dimensions ferent than two syllables which have been labeled as belong- corresponding to the parameters introduced in Section 3.3 ing to the same category. above. Thus if n was the number of categories considered, the So re-phrasing the above problem of how much specific dif- resulting data set consisted of 1=2 100n ð100n 1Þ data ferences along a dimension are reflected in perception, we can points. After mapping downstepped accents to their non- ask: How big does the difference have to be to cause a change downstepped counterparts, the number of categories for all in perceived category as labeled by the human annotators? three databases was 5, and thus 124,750 distances contributed The solution that I am proposing then is to fit a linear regres- to estimating the weights in each database. sion model that predicts whether two syllables will be per- For estimating the weights, I used the lm function in R (R ceived (i.e. labeled by the annotator) as belonging to the Core Team, 2017)tofit a linear regression model that predicted same category or not, given the individual distances in all the categorical distance of each pair using the distances in all potentially relevant dimensions: parameters as predictors. I then selected only those parame- Xn ters for which R indicated significance at p < 0:05, then fita ; distcat ðx yÞa0 þ bi ðxi yi Þð4Þ second model using only these significant predictors. If any i¼1 of the parameters were not significant anymore in the simpler fi fi where distcat ðx; yÞ is the categorical distance between x and y second model, I tted a third one, again keeping only signi - (i.e. distcat ¼ 1ifx and y belong to different categories, and 0 cant predictors. There was no case in which that third model fi if they are from the same category), a0 the intercept of the still contained insigni cant predictors. Table 2 shows the result- 5 fi fi model, and bi the coefficients of the model. It is easy to see ing coef cients of the nal models for all three databases. that the coefficients obtained from this model can be interpreted Interestingly, the two speakers in the two German data- as weights for scaling the dimensions: the model approximates bases differ considerably in the parameters they use to encode the difference in categories as a weighted sum of differences in category differences. The SWMS speaker uses almost all the individual dimensions with the coefficients as weights. These parameters considered in these experiments, while the SWRK weights thus reflect the importance of each dimension: a large speaker uses only three of the six PaIntE parameters, and coefficient shows that this dimension can be particularly impor- spectral balance and the number of voicing periods. It is espe- tant in distinguishing categories. cially noteworthy here that the largest part of these databases The second benefit of the linear regression model approach is identical in terms of text, so speaking style or content cannot for estimating perceptually motivated weights for clustering is explain the difference. The five speakers in the BRN database, that the model also yields significance values for each dimen- similar to the SWMS speaker, also used more parameters for sion. Thus in addition to the desired weights for scaling the encoding the categories—in fact they used almost all. In order dimensions, we get an assessment of how likely each dimen- to check whether the range of parameters can be attributed to sion is to play a role in distinguishing categories. That is, it also the fact that the data came from several speakers, I ran the same procedure on the female speaker for which I had most 5 Please note that I am using a simple linear regression model, not a generalized one. In data (speaker f2b, with approx. 11,500 syllables before outlier an experiment where one would truly be interested in a model that can predict whether two removal). Of the parameters in Table 2, the speaker used all syllable instances belong to the same category or not, one would use a generalized model, i.e. a model that does not directly predict the distance, but the odds of the distance being 1. except steepness of rise and next vowel intensity, so the diver- This would avoid predicting non-integer values which could even lie outside the interval sity of parameters used in the BRN case can only weakly, if at [0,1], and which would have to be mapped to 0 or 1 then. However for the present purpose, all, be related to the fact that several speakers contributed. where we are only interested in the relative contribution of each dimension when predicting the distance, but not in the predicted value, I prefer the immediate relationship in the simple linear regression model between the overall (categorical) distance and the distances in 6 The extremely infrequent categories already excluded above were not represented each dimension. anymore at this stage. 10 A. Schweitzer / Journal of Phonetics 77 (2019) 100915

Table 2 Coefficients of the linear regression models for the three databases: Estimated values (“Est.”), p-values, and the level at which the coefficients were significant (“sig”, * p < .05, ** p < .01, *** p < .001). Intercepts are reported for the sake of completeness; only the remaining coefficients will be used as weights in the following. Empty cells indicate that the coefficient was not in the model for that database.

SWMS SWRK BRN Parameter Est. p sig Est. p sig Est. p sig Intercept 0.645 0.000 *** 0.701 0.000 *** 0.652 0.000 *** Steepness of rise 0.008 0.000 *** 0.004 0.001 ** Steepness of fall 0.005 0.000 *** 0.008 0.000 *** Peak alignment 0.034 0.000 *** 0.031 0.000 *** 0.03 0.000 *** Rise amplitude 0.03 0.000 *** 0.017 0.000 *** 0.024 0.000 *** Fall amplitude 0.033 0.000 *** 0.021 0.000 *** 0.008 0.000 *** Peak height 0.005 0.006 ** 0.018 0.000 *** syl int delta 0.011 0.000 *** vwl int delta -0.004 0.003 ** 0.005 0.001 ** nxt vwl int delta 0.009 0.000 *** 0.006 0.000 *** Spectral balance -0.002 0.047 * 0.003 0.018 * 0.005 0.000 *** Spectral tilt 0.004 0.001 ** Nucleus duration 0.007 0.000 *** -0.005 0.000 *** Voicing periods 0.02 0.000 *** 0.022 0.000 *** 0.017 0.000 ***

Instead we can conclude that speakers seem to employ a vari- In any case the linear model has identified parameters ety of parameters, in fact nearly all parameters investigated where differences in that parameter may be related to differ- here, and that additionally, speakers may differ in which ences in category. Thus we would expect to see different dis- parameters they use for systematically encoding category tributions of these parameters for each accent category. differences.7 Figs. 3 through 5 show density plots for each of the significant Some of the parameters used above are of course expected parameters from above by category. These plots indicate the to be correlated—deltas derived from syllable intensity and likelihood of specific parameter values: peaks in these plots vowel intensity, for instance. Similarly, it could be suspected occur at values that are more likely than the surrounding val- that spectral balance, in terms of Sluijter and van Heuven’s ues. The plots thus show the underlying parameter distribution (1997) definition, and spectral tilt, in terms of the regression of each prominence category. It can be seen that for all param- line in a long-term average spectrum, are correlated (although eters we find visible differences in the distributions for at least the fact that they yielded opposite results in an earlier study one pair of categories. The differences correspond well to (Aronov & Schweitzer, 2016) indicates otherwise). Thus in expectations based on knowledge about these categories. order to make sure that multicollinearity did not constitute a For instance, for the SWMS data in Fig. 3, unaccented sylla- problem for fitting the models, I used the usdm R package bles (category NONE, dotted line) are most likely to exhibit (Naimi, Hamm, Groen, Skidmore, & Toxopeus, 2014) to calcu- no rise, thus their distributions for steepness of rise (top left) late the variance inflation factor (VIF) for all coefficients. All and rise amplitude (second row, right panel) have peaks at VIFs were below 2 in all cases, indicating no problem with mul- the lowest values for these two parameters. Rising L*H ticollinearity. It is possible that the low VIFs are due to the fact accents (the dot-dashed line with the longer dashes), in con- that the potentially problematic intensity predictors are not raw trast, tend to have a steeper rise (the broader peak between values, but instead delta values, and that these deltas are 1 and 0 in the top left panel indicates that these values are again not used directly but that I employ distances between most likely), greater rise amplitudes (the very broad plateau these delta values. Indeed, in the sample data used from the ranging from 0 to 2 in the right panel in the second row) and SWMS database, the correlation between vowel intensity delta their peak alignment relatively late or even very late (the two and syllable intensity delta is 0.65, while that of the pairwise peaks in the left panel in row 2). Similarly the property of H* distances between the two is considerably lower (0.42). Simi- (solid line) being a less pronounced peak with moderate steep- larly, spectral balance and spectral tilt are negatively correlated ness and moderate amplitudes aligned in the middle of the syl- at 0.38, while the distances for the two parameters are pos- lable is borne out (first five panels). Beyond these F0 related itively correlated at a lower value (0.22). To be conservative, I parameters, there are very subtle differences in terms of vowel fitted two other models for the BRN data, as this was the only intensity delta (left panel in row 4) that seem to indicate that the data set were both intensity parameters were part of the final distribution for H*L accents (dashed line) might be shifted to model. In these models I retained only one of the two factors the right compared to the other categories, i.e. they seem to vowel intensity delta and syllable intensity delta and compared exhibit subtly greater deltas in intensity than the other cate- these to the full model and to each other using anovas. The gories including NONE. This unexpected finding is more pro- results indicate that none of the three models provides a signif- nounced in the delta to the next syllable (right panel in row icantly worse fit than the others. I thus kept both factors in the 4). The remaining panels mostly indicate differences between model to adhere to the procedure as described above. the accent categories compared to unaccented syllables, indi- cating that while unaccented syllables are mostly character- ized by very neutral spectral balance values of almost exactly 0, there is much more variation with much more 7 Interestingly, such individual differences are even expected from an exemplar-theoretic extreme values in both directions for the accent categories. perspective, since the collection of exemplars stored in a speaker’s memory, which is assumed to be the basis for production, is unique for each speaker. The last two panels show that nucleus duration is shorter in A. Schweitzer / Journal of Phonetics 77 (2019) 100915 11

Fig. 3. Density plots showing the distributions of the parameters identified as important for each prominence category, for the SWMS database. See text for further details. unaccented syllables (right panel in row 5), and this is also made no use of intensity and duration—her more consistent reflected in the fact that unaccented syllables either have no voicing may have allowed to encode more differences via F0 detectable voicing periods (the pronounced left peak in their related parameters. distribution in the bottom panel) or only few periods (the right The BRN data finally look similar to the SWMS data: Unac- peak in that distribution). cented syllables (dotted lines) have the lowest values for For the parameters that were identified for the SWRK data, steepness of rise and steepness of fall (top panels), and low similar observations can be made. In her case even for unac- rise amplitudes (right panel in second row) and fall amplitudes cented syllables voicing periods could usually be detected, as (left panel in row 3). H+!H* accents have the earliest peak evident from the very little peak in the dotted line in the bottom alignment (the peak in the dashed line in the left panel in row panel. One could speculate that this is the reason why she 2). Again, there are only subtle differences in the intensity del- 12 A. Schweitzer / Journal of Phonetics 77 (2019) 100915

Fig. 4. Density plots showing the distributions of the parameters identified as important for each prominence category, for the SWRK database. See text for further details. tas, and for the spectral parameters we find a clear prevalence As a first experiment, I clustered the data using k-means of 0 values in the case of unaccented syllables for both spec- clustering as implemented in R (R Core Team, 2017) with tral balance (right panel in row 5) and spectral tilt (left panel in k ¼ 5, as we would optimally want to find clusters for 5 different row 6). Interestingly, in the BRN database many unaccented categories of accents in each case. Please note that the syllables have no detectable voicing (the sharp left peak in assumption of a fixed number of clusters exclusively originates the dotted line in the bottom panel). from the fact that k-means clustering does not decide on the All in all, from Figs. 3–5 it should be clear that in an acoustic appropriate number of clusters, instead it will only look for a space where the dimensions correspond to the parameters given number of clusters. The assumption of a specific number presented above, there are indeed differences between the of clusters thus is a necessary technicality in which the prominence categories in terms of where they are located in simulation differs from perceptual reality: Humans would of that space. However, within each individual dimension, there course not look for a given number of clusters but detect the is a strong overlap, despite the differences discussed above, clusters by perceiving particularly dense regions in perceptual so it is not clear whether the category differences are suffi- space. ciently pronounced to identify clusters in that space that would As a starting point for illustrating the idea of clustering as a correspond to the categories. This question will be addressed means to detect categories, I will assume five clusters for now, in the following section. since we know that the number of prominence categories in the data is five. In this respect, the clustering has an advantage 5.2. Finding clusters over human detection since humans would not even know beforehand what the “correct” number of clusters should be. For clustering the data, I next randomly selected another It will be argued below that the number of clusters that should 100 data points for each category, creating a second subset be expected is actually higher than that. with equal proportions of accent categories for each database. Fig. 6 shows visual representations of the results for cluster- Since I had excluded accent categories for which there were ing with and without adjusting the dimensions by the weights less than 200 instances in the database, it is guaranteed that found above. For illustrating the effects of the weights, I used there are enough data points left of each category to have the same 11 dimensions that were found to be significant in entirely new data points in this subset which have not been the analysis above for clustering, i.e. here the only difference used for finding the weights above. I then clustered these data in clustering was in whether the weights were used for adjust- twice: once keeping the original values, and once adjusting the ment or not. Both representations were generated by mapping values by multiplying them with the dimension-specific percep- the 11-dimensional space to the first two discriminant dimen- tual weights determined above. In both cases, I used only sions using the plotcluster function provided by Hennig those dimensions that had been identified as perceptually rel- (2015). Each number in the plots corresponds to one data evant by the procedure described in Section 4 above. Thus the point. The number indicates the number of the cluster that clustering space was 11-dimensional in the case of SWMS, the data point belongs to, while the color indicates its promi- 5-dimensional for SWRK, and 13-dimensional in case of BRN. nence category. A. Schweitzer / Journal of Phonetics 77 (2019) 100915 13

Fig. 5. Density plots showing the distributions of the parameters identified as important for each prominence category, for the BRN database. See text for further details.

In both plots, the clusters occupy specific regions in the points from cluster 4 tend to be in the lower left region, while space spanned by the two dimensions, i.e. numbers tend to those from cluster 1 are more towards to the upper left, those appear in similar regions. For instance in the upper plot, data from cluster 2 in the upper right region, etc. This is of course 14 A. Schweitzer / Journal of Phonetics 77 (2019) 100915

present paper is to show that prominence categories will form clusters in the appropriate perceptual space, so optimally, data points of the same prominence category should belong to the same cluster, or at least they should end up in similar regions. It can be observed in Fig. 6 that indeed while each category (each color) appears in various areas in the upper panel, the colors are much better separated in the lower panel, confirming that the prominence categories are better separated after adjustment. They also tend to belong to similar clusters in the lower panel, i.e. data points of the same color tend to have the same number: most red data points for instance belong to cluster 4, most green data points appear to belong to cluster 3, cyan data points to clusters 1 and 4, etc. There is no perfect correspondence between cluster number and color after adjustment, but a better one than in the upper panel before the adjustment. To illustrate the cluster-to-category correspondence after adjusting the weights more objectively, Table 3 indicates which prominence categories were found in which cluster after adjustment of the weights. Obviously, cluster 3 is dominated by L*H cases; they are more than twice as frequent as the other categories in that cluster. In cluster 4 H*L is most fre- quent, while cluster 5 is dominated by L*HL. However there are always considerable numbers of other categories in each cluster. Also, the majority of cases in clusters 1 and 2 belong to two categories each: H* and NONE in case of cluster 1, and H* and L*HL in case of cluster 2. Analyzing the results from a category perspective, most H* cases belong to cluster 1. The correspondence is not perfect, as a considerable num- ber of them occurs in cluster 2. However, only few of them are grouped into any of the other clusters. Similarly, H*L accents are mostly in cluster 4, and also often in cluster 1, but rarely in other clusters. L*H is usually in cluster 3, L*HL in cluster 2, and the NONE cases mostly in cluster 1. To illustrate how well the clusters found for these relatively few data with only 100 cases of each category are compatible with new data, I applied the obtained clustering to more data, taking up to another 100 cases of each category, if available, and assigning each of them to the closest of the cluster centers found above. Table 4 shows the result. Obviously, the distribution of cate- gories across clusters is very similar even for new data, con- firming that the regions derived from the clusters on one set of data generalize well to new data points. Fig. 6. 2-dimensional projection of the data points in clustering space using the SWMS In order to verify whether the impression gained from the data before (upper panel) and after (lower panel) adjusting the dimensions by the visual inspection of the plots that the adjustment led to better perceptual weights from Section 4. Numbers indicate the cluster number of the data separation is valid, I used the silhouette index (Rousseeuw, point, while colors indicate its accent category. Optimally, data points with same colors should be crowded together, and have identical number labels. (For interpretation of the 1987) as a well established measure of cluster separability rel- references to colour in this figure legend, the reader is referred to the web version of this ative to cluster cohesion, as implemented in the cluster pack- article.) age in R (Maechler, Rousseeuw, Struyf, Hubert, & Hornik, 2017). The silhouette index ranges from 1 to 1, with 1 indicat- expected, since the clusters arise from grouping points ing appropriate clustering. Table 5 gives an overview of the together that were close in the original (upper panel), or in the adjusted (lower panel) 11-dimensional space, and they Table 3 are still close when projected on the two discriminant dimen- Category-to-cluster correspondence for the SWMS data after adjusting the weights. sions. However, we hope to get a better separation of the clus- 12345 ters after the adjustment, which is confirmed here, since the H* 43 35 13 2 7 data points before adjustment seem to be grouped around just H*L 33 17 2 45 3 one denser region slightly above and to the left of the center of L*H 21 10 56 2 11 the plot in the upper panel, while there are several regions with L*HL 15 38 22 0 25 NONE 50 19 6 23 2 higher density after the adjustment. Second, the goal of the A. Schweitzer / Journal of Phonetics 77 (2019) 100915 15

Table 4 belonging to the category that corresponds to the dominant Category-to-cluster correspondence for the SWMS data when classifying further data by label among the most similar stored exemplars. In other words, assigning them to the cluster that they are closest to. they would (unconsciously) find the cluster that the new exem- 12345plar belongs to, and assign the new exemplar the same label H* 48 30 13 2 7 as the other exemplars from that cluster. In the same way, H*L 30 16 1 46 7 L*H 30 9 53 0 8 we can categorize each exemplar in the data as belonging to L*HL 0 7 1 0 2 the majority class within its cluster. I suggest to use the accu- NONE 48 8 14 18 12 racy of this categorization as an easier-to-interpret measure of the goodness of fit between clusters and categories. The formal definition of the classification accuracy is as fol- results. It can be seen that the silhouette index improves in all lows. Let K ¼fK 1; K 2; ...K Ng be the clusters and three cases when the dimensions are adjusted, though it is far C ¼fC1; C2; ...CM g the prominence categories. To calculate from ideal even then. a corresponding contingency table a, we determine for each Regarding the additional claim that the clusters are not only pair of cluster and category how many instances are both better separated but also a better match to the categories after members of cluster i and of category j, i.e., the cells aij of the adjusting the dimensions, I evaluated the clusterings in terms contingency table are calculated using of another well established measure, the Corrected Rand ; 6 6 ; 6 6 index (Gordon, 1999), using the implementation by Hennig aij ¼jfxjx 2 Ki ^ x 2 Cj gj 1 i N 1 j M (2015). Given two different groupings of data points, in our Then, the classification accuracy can be calculated from the case the grouping found by k-means clustering vs. the group- contingency table: ing into the manually labeled prominence categories, the Rand XN max aij index in general indicates how many pairs of elements are in Pj class accðaÞ¼ M ð5Þ the same group in both groupings, relative to the overall num- i¼1 j¼1aij ber of pairs, and thus is a measure to capture the match For the SWMS data and k-means clustering with k ¼ 5 for between the detected clusters and categories. Values for the instance, the accuracy according to this definition is 36.0% Corrected Rand index can range from 0 to 1, with 1 indicating for the original data, and 39.6% for the adjusted data. While perfect match. Table 6 lists Corrected Rand indices for promi- this is clearly better than the chance baseline of 20%, the accu- nence categories and clusterings obtained on the original data racies are disappointingly low: Only around 40% of data points vs. those obtained on the adjusted data, for each of the three in a cluster belong to one category, the remaining accents databases. The indices are not close to 1, indicating far from belong to other categories. perfect fit in all cases; however they are consistently higher However, it is probably naïve to expect a perfect 1-to-1 cor- in all three databases after adjusting the dimensions using respondence where each cluster represents exactly one cate- the perceptual weights. gory. Indeed, Pierrehumbert (2003) herself concedes that, The indices calculated above do confirm that the adjusted while the distributions for phoneme categories may be quite dimensions seem to make it easier to find reasonable clusters distinct for phonemes in the same contexts (and thus we could that correspond well to the prominence categories. However, hope to find a perfect match between clusters and categories how well is “well”? To give a second measure for the goodness in these contexts), there may be overlap between distributions of fit between clusters and categories, and one that I find for different phonemes in different contexts. She therefore sug- easier to interpret intuitively, an accuracy-based measure gests that “positional allophones appear to be a more viable was used in Schweitzer (2011): in exemplar-theoretic level of abstraction for the phonetic encoding system than pho- categorization, a listener would compare an incoming exem- nemes in the classic sense” (Pierrehumbert, 2003, p. 211). plar to the stored exemplars, and categorize the exemplar as This means that while the underlying distributions for phoneme categories are expected to overlap in phonetic space, the

Table 5 underlying distributions of positional allophones should be Average silhouette widths for original vs. adjusted dimensions, for clustering 100 new data more clearly distinct. points from each category with k-means clustering and k = 5. The higher values after Taking these considerations from the segmental domain to adjustment indicate a better separation-to-cohesion ratio for all three databases after fi adjusting the dimensions using the perceptual weights. the prosodic domain, speci cally, to prominence, we would expect each prominence category to correspond to several Database SWMS SWRK BRN clusters—where each cluster corresponds to a prominence Original 0.110 0.165 0.102 category in some specific type of context, an “allotone”,if Adjusted 0.225 0.225 0.179 you will. This would explain the only moderate Corrected Rand indices and silhouette widths found above when allowing only 5 clusters. Thus I varied the numbers of clusters k in a series of Table 6 > Corrected Rand indices for original vs. adjusted dimensions, for clustering 100 new data experiments, with k 5. However, when requiring data where points from each category with k-means clustering and k = 5. The values indicate a slightly all prominence categories are represented by equal propor- better cluster-to-category correspondence in all three databases after adjusting the tions, the problem is that for the relatively infrequent cate- dimensions using the perceptual weights. gories, we have only 100 new data points left. Consequently Database SWMS SWRK BRN the amount of data points altogether is 500 when keeping Original 0.095 0.051 0.065 equal proportions. Thus with, say, 10 clusters, we would end Adjusted 0.116 0.061 0.067 up with 50 data points on average in each cluster, and it does 16 A. Schweitzer / Journal of Phonetics 77 (2019) 100915 not seem reasonable to aim for more, and therefore less pop- ulated, clusters. However, ultimately the clustering experi- ments presented here aim at simulating exemplar-theoretic acquisition, where we would be dealing with much more data than what we are currently looking at. Also, even though it was important to have equal proportions of each category for finding the weights, there is no reason to require equal propor- tions for the clustering. After all, in exemplar-theoretic acquisi- tion, listeners would certainly be exposed to very unbalanced proportions of categories—this imbalance is in fact pertinent to any linguistic category. So in the next section I give up the requirement that the categories should be represented by equal proportions, allowing greater imbalance.

5.3. Clustering more data

In order to experiment with higher numbers of clusters and with more, and more imbalanced, data, I next ran a series of Fig. 7. Accuracy rates on the SWMS data when treating all data points in a cluster as experiments for each database where I took at least 100, belonging to the majority category in that cluster. Light gray circles indicate accuracies on the original data; black diamonds indicate those on adjusted data. and, if available, up to 2000 instances of each category for clustering. Again, these data points had not been used for find- ing the weights. What is new compared to the previous section is that I will make use of both outcomes of the procedure for finding weights: the identification of relevant dimensions, plus the weights. Thus I will compare results when clustering the original data using all dimensions to results when clustering data that have been adjusted by the weights, excluding dimen- sions that have been found to be irrelevant above. In these experiments I varied the number of clusters from 5 to the number of data points divided by 50, i.e. allowing on average 50 instances per cluster. For each cluster, I deter- mined the majority category in that cluster, classified each data point in that cluster as belonging to that category, and the accuracy rate for this classification. Figs. 7–9 show the results on the three databases. It can be seen that depending on the database, accuracies of between 60% and 65% can be obtained compared to the chance baselines of between 32% and 38% when classifying each data point as belonging to the overall majority class. For all three databases, there is a Fig. 8. Accuracy rates on the SWRK data when treating all data points in a cluster as clear benefit of adjusting the dimensions by the perceptual belonging to the majority category in that cluster. Light gray circles indicate accuracies weights, with the most pronounced advantage obtained on on the original data; black diamonds indicate those on adjusted data. the BRN data, followed by the SWRK data. This overall benefit of the perceptual adjustment confirms the effectiveness of the proposed method. a category irrespective of context, these numbers may well be As noted above, the increase in accuracy for the perceptual appropriate. With 5 prominence categories, at 50 clusters we adjustment is highest for the BRN data. Recall that in the case would have an average of 10 positional allophones (or rather, of BRN, almost all dimensions proposed were found to be rel- “allotones”) per category. These could possibly correspond to evant for perception, and thus the clustering space had the different implementations depending on position in the phrase highest dimensionality for BRN. I suggest that this is why the or on voicing properties of syllable onset and coda for instance, adjustment, which reflects modeling the relative relevance of all aspects that are well known to affect accent shape. the dimensions, is most helpful in case of BRN. In all three cases, curves are slightly steeper for moderate 5.4. Clustering vowel data for comparison numbers of clusters and flatter for higher numbers—in case of the SWMS and SWRK databases, the accuracy rates start To gain a more objective estimate of what constitutes a to level off at around 30–40 clusters already, as can be seen good category-to-cluster correspondence, and how many clus- from the elbows around that point. In case of BRN, there is ters to expect per category, I clustered vowel data using the no pronounced elbow, but again there is little increase beyond, same procedure as for clustering the prominence categories. say, 50 clusters. Given that we expect clusters that correspond After all, as mentioned in Section 1.2, it has been suggested to “positional allophones” (Pierrehumbert, 2003, p. 211) rather by Pierrehumbert (2003) that clustering should be a viable than clusters where each cluster would contain all instances of way for detecting vowel categories in an exemplar-theoretic A. Schweitzer / Journal of Phonetics 77 (2019) 100915 17

(Boersma & Weenink, 2017), then determined perceptual weights for clustering as above. In the case of BRN, all six parameters were retained in that procedure, with the largest coefficient and thus the greatest importance of vowel duration, followed by F1 and F2. For removing outliers and normalization I proceeded analogously to the procedure described in Sec- tion 3. The number of vowel categories in the resulting data used for clustering was nine. Fig. 10 shows a traditional F1/F2 plot for the vowels used for clustering below, along with their category labels. Axes were flipped in a way that the plot matches the usual vowel diagram. It can be seen that while

Fig. 9. Accuracy rates on the BRN data when treating all data points in a cluster as belonging to the majority category in that cluster. Light gray circles indicate accuracies on the original data; black diamonds indicate those on adjusted data.

Fig. 10. F1/F2 plot of the vowel data used for clustering. Axes were flipped in a way that the plot matches the usual vowel diagram. fashion, and she had stated that cluster centers obtained on F1/F2 data by Kornai (1998) are “extremely close” to mean vowel formants of the categories. Thus the category-to- cluster correspondence for vowels in the above data should be a good indicator of what can optimally be expected for prominence categories. I assume that the correspondence observed on vowel data should be an upper bound rather than a lower bound for what to expect for prominence categories, since vowel categories have long been accepted as valid cat- egories, while this is probably more controversial for pitch accents. Also, prosodic categories are notorious for lower labeling consistency, while vowel categories seem to be much less problematic in that respect. Fig. 11. 2-dimensional projection of the data points in clustering space using the BRN For clustering the vowels, I extracted F1, F2 and F3 as well vowel data before (upper panel) and after (lower panel) adjusting the dimensions by the as durations, spectral balance, and spectral tilt for all full perceptual weights from Section 4. Numbers indicate the cluster number of the data point, while colors indicate its accent category. Same-colored data points are crowded monophthongs in the three databases using a Praat script together much better after the adjustment. 18 A. Schweitzer / Journal of Phonetics 77 (2019) 100915

clusters. In the earlier study, the optimal number of clusters of around 1600 clusters had been found using a different eval- uation method: it was obtained on independent test data using 10-fold cross validation, with 90% of the data used clustering in each fold, at the expense of having much more unbalanced data in terms of categories for the clustering. Thus we cannot easily compare the numbers of clusters in the two studies.8 The more imbalanced data in the earlier study also affected the accuracy results: the majority baseline (the frequency of the NONE category) in that study was at almost 78%. Not sur- prisingly given this strong baseline, the accuracy rates of around 85% were higher than those of around 65% in the pre- sent study, where the majority baselines were between 32% and 38%. However the considerable differences between majority baseline and the obtained accuracies in the present study demonstrate that the approach taken here is the more promising one. Also, the present study relies solely on acoustic

Fig. 12. Accuracy rates for vowel classification on the BRN data when treating all data parameters, whereas the earlier study also used higher- points in a cluster as belonging to the majority category in that cluster. Light gray circles linguistic parameters such as the location of word stress or indicate accuracies on the original data; black diamonds indicate those on adjusted data. part-of-speech information. Thus it included some dimensions that are known to be highly predictive of prominence—for the vowel categories end up in the expected regions, there is instance unstressed syllables always belong to the promi- also considerable overlap between vowel categories in the nence category NONE, and nouns and adjectives are much — F1/F2 dimensions. more likely to be pitch-accented than function words but Fig. 11 shows 2-dimensional representations of clusterings these dimensions on the other hand encode categories that using k-means with k ¼ 9 and only 100 data points per vowel in an exemplar-theoretic account of speech acquisition would category, plotted using the first two discriminant dimensions, have to be learned from the data in the same way as promi- as above. The plots indicate results before (upper panel) and nence categories are learned. Probably, the two would be after (lower panel) adjusting the dimensions. It can be seen learned jointly, and thus the former would not be available that vowels of the same categories end up in more similar when establishing the prominence categories. Thus the chal- regions of the space after the adjustment, and that vowels of lenge in the present study is considerably higher than in the the same categories seem to end up in the same cluster more earlier study, and also a more realistic approximation of real often. I then clustered the vowel data using up to 2000 human prominence acquisition. instances of each vowel, once in the original space, and once The present study simulates in a very simple way the acqui- in the adjusted space, varying the numbers of clusters as sition of prominence categories. It does so using read data of above. The accuracies obtained for vowel categorization the kind that adults would be exposed to rather than sponta- based on these clusterings are given in Fig. 12 for BRN as neous data of the kind that children would be exposed to. How- an example; the graphs for SWRK and SWMS look very simi- ever I believe that the distribution of prominence categories lar. In general the similarity of each of the graphs to the graphs was shifted towards what children would hear: By selecting obtained for clustering prominence categories is striking. The portions of the data that contained as many of the infrequent majority baseline in the case of vowels is lower than that for prominence categories as possible, these data were less fi the above experiments, but we obtain similar rates of slightly imbalanced than the full data set, speci cally, they favored above 60%. Again, the results are better when allowing for pitch-accented categories over the most frequent category more clusters than categories–around 40 to 50 seems to be NONE. Given that child-directed speech has been shown to a good choice in this case. be prosodically more exaggerated than adult-directed speech Comparing the results for vowel category detection and (e.g. Fernald et al., 1989; Vosoughi & Roy, 2012) this should prominence category detection, it is found that we can do sim- match the distributions that children are exposed to slightly ilarly well in both cases, just slightly better on prominence cat- better. egories in terms of absolute accuracies. When taking into account the higher number of categories in the case of vowels and the consequently lower majority baseline however, then it 8 In order to make sure that the lower numbers of clusters in the present study are not can be said that the clustering adds slightly more information due to the fact that there was no such external evaluation on independent data, I used for vowels. another up to 2000 independent data points per remaining category for evaluating the clusters again. The accuracies when assigning new data points to the nearest cluster center were consistently higher than those for the data originally used for clustering. This is 6. Discussion not too surprising, since the independent data are necessarily much more imbalanced: after the last instances of the less frequent prominence categories have been used for detecting the clusters, only very frequent categories, i.e. almost exclusively unaccented Compared to the Schweitzer (2011) study, the best accu- syllables, are left for the evaluation. However the fact that the accuracies for independent racy rates in the present study are reached at far fewer num- test data are higher than those obtained on the clustering data themselves clearly indicates — that the results on the clustering data do not suffer from overfitting. Also, the results on the bers of clusters in case of the SWMS and SWRK independent data do not indicate that higher numbers of clusters are better, instead, the databases, at around 30–40, in case of BRN, at around 50 accuracies are largely independent of the number of clusters. A. Schweitzer / Journal of Phonetics 77 (2019) 100915 19

In any case I would like to emphasize that this study is to be prominence categories we can identify just a few clusters on interpreted as an exploration of the claim put forth by average for each category, and these clusters can be taken Pierrehumbert (2003) that categories can be established, or at to correspond to allophones of vowel categories in the least initiated, by detecting clusters in perceptual space. It is segmental domain, or to “allotones” of prominence categories not intended to provide a fully-fledged simulation of speech in the prosodic domain. acquisition. The latter would require a substantial amount of real child-directed, and prosodically labeled, data, recorded at the 7. Conclusion age of acquisition of prosodic categories, and such data is cur- rently, at least to my knowledge, not available for any language. I have illustrated the exemplar-theoretic integration of A future full account would also require modeling exemplar phonological form and phonetic substance in the domain of production. As discussed in Section 1.1, production would rely prominence. According to exemplar theory, phonological cate- on the category labels that are stored with the exemplars. For gories arise from abstracting over clusters of phonetically sim- instance, when producing the word “ball”, exemplars which are ilar exemplars that are associated with the same meanings. To labeled as ball exemplars will unconsciously be activated and model the perceptual importance of potentially relevant dimen- then contribute to establishing the production target. In extend- sions I have suggested a simple procedure to derive percep- ing exemplar-theoretic production to the prosodic domain, one tual weights for scaling the dimensions and shown that it would have to decide which abstract labels can be assumed to considerably facilitates the detection of clusters corresponding be stored. The prominence categories investigated here are to positional variants of prominence categories. The procedure maybe too abstract to be accessible to speakers as potential not only yields weights, it also makes it possible to identify labels. Given the communicative function of prominence cate- which dimensions are relevant for perception. Thus as a by- gories I would expect that labels such as “new information” or product the adjustment procedure can in general be used to “corrective information” serve as a proxy to the prosodic cate- confirm or reject hypotheses about which parameters play a gories at least at the beginning. Such “easy” cognitive con- role in distinguishing perceptual categories.The number of cepts have been argued to be accessible to infants already acoustic-prosodic dimensions that were found to be relevant before they are implemented with adult prosody (Höhle, in each database here demonstrates that speakers encode Berger, & Sauermann, 2016). Furthermore, modeling produc- prominence jointly by a variety of parameters. The difference tion would require modeling activation of individual exemplars, between SWRK and SWMS in that respect shows that the as well as potential consequences such as activation compe- use of these dimensions can also be speaker-specific. tition and resonance. Future work could thus (i) try to run sim- In contrast to an earlier study, the present study assumes ilar cluster experiments using data for which such easy only low-level phonetic features that should be available early concepts have been annotated and (ii) model production in speech acquisition, and a more even distribution in terms of including more complex aspects such as considering activa- prominence categories, at the expense of altogether lower tion of exemplars. accuracy rates when evaluating the clusters. However, the There is still much to be learned from the present study. increase in accuracy over the overall majority baseline (i.e. First of all, it suggests a simple approach to establishing and the baseline corresponding to an “educated guess”) is much scaling the perceptual importance of the acoustic dimensions greater in the current study than in the earlier study. This and shows that this considerably increases the quality of the attests the dimensions used in the present study a greater resulting clusters. The fact that the scaling is done using one effect. In addition, the number of detected categories is rea- uniform weight for each dimension rather than a more complex sonably lower in the current study, confirming the plausibility non-linear warping of these dimensions again owes to of an exemplar-theoretic approach to category detection in exemplar-theoretic considerations: while perceptual warping general and to prominence category detection in particular. of the phonetic space as evidenced in the well-known percep- tual magnet effect (Kuhl, 1991, PME) might be taken to sug- References gest that different areas along each dimension have to be adjusted in different ways, Lacerda (1995) has already shown Aronov, G., & Schweitzer, A. (2016). In C. Draxler & F. Kleber (Eds.), Tagungsband der that the non-linearity in the PME can be modeled as a 12. Tagung Phonetik und Phonologie im deutschsprachigen Raum (pp. 12–15). Arvaniti, A., Ladd, D. R., & Mennen, I. (1998). Stability of tonal alignment: The case of consequence of different densities of exemplars along the Greek prenuclear accents. Journal of Phonetics, 26,3–25. dimensions, without assuming any non-linear adjustment.9 Barbisch, M., Dogil, G., Möbius, B., Säuberlich, B., & Schweitzer, A. (2007). Unit selection synthesis in the SmartWeb project. In Proceedings of the 6th ISCA Second, the experiments show that clusters corresponding Workshop on Speech Synthesis (SSW-6 Bonn) (pp. 304–309). to prominence categories can indeed be detected, as pre- Barnes, J., Veilleux, N., Brugos, A., & Shattuck-Hufnagel, S. (2012). Tonal Center of Gravity: A global approach to tonal implementation in a level-based intonational dicted by exemplar theory, and that they can be detected with phonology. Laboratory Phonology, 3, 337–383. similar accuracy as clusters corresponding to vowel cate- Baumann, S., & Röhr, C. (2015). The perceptual prominence of pitch accent types in gories. As suspected by Pierrehumbert (2003), there is not a German. In Proceedings of the 18th International Congress of Phonetic Sciences, Glasgow. . perfect 1-to-1 correspondence between clusters and cate- Baumann, S., & Winter, B. (2018). What makes a word prominent? Predicting untrained gories, but both in the case of vowels and in the case of German listeners’ perceptual judgments. Journal of Phonetics, 70,20–38. Beckman, M.E., & Ayers, G.M. (1994). Guidelines for ToBI labelling, version 2.0.. Boersma, P., & Weenink, D. (2017). Praat, a system for doing phonetics by computer [computer program].http://www.praat.org/. Version 6.0.36, retrieved 20 Nov. 2017.. 9 It should be noted that I do not want to argue against the kind of non-linearity Bolinger, D. L. (1957). On intensity as a qualitative improvement of pitch accent. Lingua, encountered in psycho-acoustic scales, which is different from the magnet effect. The latter 7, 175–182. arises only in the process and is not innate. I assume that non-linearities that Bolinger, D. L. (1958). A theory of pitch accent in English. WORD, 14, 109–149. follow from neurobiological circumstances do not need to be modeled by exemplar theory, Bruce, G. (1977). Swedish word accents in sentence perspective. Gleerup, Lund: while non-linearities that arise in the learning process should be explained. Travaux de l’Institut de Phonétique XII. 20 A. Schweitzer / Journal of Phonetics 77 (2019) 100915

Calhoun, S., & Schweitzer, A. (2012). In G. Elordieta Alcibar & P. Prieto (Eds.), Prosody Möhler, G. (2001). Improvements of the PaIntE model for F0 parametrization. and Meaning (Trends in Linguistics) (pp. 271–327). Mouton DeGruyter. Manuscript. URL:http://www.ims.uni-stuttgart.de/institut/mitarbeiter/moehler/papers/ Campbell, N., & Beckman, M. E. (1997). Stress, prominence, and spectral tilt. In A. gm_aims01.ps.gz.. Botinis, G. Kouroupetroglou, & G. Carayiannis (Eds.), Intonation: Theory, models Möhler, G., & Conkie, A. (1998). Parametric modeling of intonation using vector and applications (proceedings of an ESCA workshop, September 18–20, 1997, quantization. In Proceedings of the third international workshop on speech synthesis Athens, Greece) (pp. 67–70). ESCA and University of Athens Department of (Jenolan Caves Australia) (pp. 311–316). Informatics. Naimi, B., Hamm, N. A. S., Groen, T. A., Skidmore, A. K., & Toxopeus, A. G. (2014). Cangemi, F., & Grice, M. (2016). The importance of a distributional approach to Where is positional uncertainty a problem for species distribution modelling. categoriality in autosegmental-metrical accounts of intonation. Laboratory Ecography, 37, 191–203. https://doi.org/10.1111/j.1600-0587.2013.00205.x. Phonology: Journal of the Association for Laboratory Phonology, 7,1–20. Niebuhr, O. (2013). The acoustic complexity of intonation. In E. L. Asu & P. Lippus (Eds.), Cole, J., Hualde, J. I., Smith, C. I., Eager, C., Mahrt, T., & de Souza, R. N. (2019). Sound, Nordic prosody XI (pp. 15–29). Frankfurt: Peter Lang. structure and meaning: The bases of prominence ratings in English, French and Niebuhr, O., & Pfitzinger, H. R. (2010). On pitch-accent identification – The role of Spanish. Journal of Phonetics, 75,113–147. https://doi.org/10.1016/ syllable duration and intensity. In Speech prosody 2010. , pp. 100773:1–4. j.wocn.2019.05.002. Niebuhr, O., & Ward, N. G. (2018). Challenges in studying prosody and its pragmatic Crosswhite, K. (2003). Spectral tilt as a cue to word stress in Polish, Macedonian, and functions: Introduction to JIPA special issue. Journal of the International Phonetic Bulgarian. In Proceedings of ICPhS 2003 (Barcelona, Spain) (pp. 767–770). Association, 48,1–8. Fernald, A., Taeschner, T., Dunn, J., Papousek, M., de Boysson-Bardies, B., & Fukui, I. Okobi, A. O. (2006). Acoustic correlates of word stress in American English (Ph.D. (1989). A cross-language study of prosodic modifications in mothers’ and fathers’ thesis). Massachusetts Institute of Technology. speech to preverbal infants. Journal of Child Language, 16, 477–501. https://doi.org/ Ostendorf, M., Price, P. J., & Hufnagel, S. S. (1995). The Boston University radio news 10.1017/S0305000900010679. corpus (Technical Report). Linguistic Data Consortium. Technical Report. Fry, D. B. (1955). Duration and intensity as physical correlates of linguistic stress. Peters, J., Hanssen, J., & Gussenhoven, C. (2015). The timing of nuclear falls: Evidence Journal of the Acoustical Society of America, 27, 765–768. from Dutch, West Frisian, Dutch Low Saxon, German Low Saxon, and High Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identification German. Laboratory Phonology, 6,1–52. and recognition memory. Journal of Experimental : Learning, Memory, Pierrehumbert, J. (1980). The phonology and phonetics of English intonation (Ph.D. and , 22, 1166–1183. thesis). Cambridge, MA: MIT. Goldinger, S. D. (1997). Words and voices—perception and production in an episodic Pierrehumbert, J. (2001). Exemplar dynamics: Word frequency, lenition and contrast. In lexicon. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech J. Bybee & P. Hopper (Eds.), Frequency and the emergence of linguistic structure processing (pp. 33–66). San Diego: Academic Press. (pp. 137–157). Amsterdam: Benjamins. Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Pierrehumbert, J. (2003). In R. Bod, J. Hay, & S. Jannedy (Eds.), Probability theory in Psychological Review, 105, 251–279. linguistics (pp. 177–228). The MIT Press. Gordon, A. D. (1999). Classification (2nd ed.). Chapman and Hall. Pierrehumbert, J. B. (2016). Phonological representation: Beyond abstract versus Grabe, E. (1998). Pitch accent realization in English and German. Journal of Phonetics, episodic. Annual Review of Linguistics, 2,33–52. https://doi.org/10.1146/annurev- 26, 129–143. linguistics-030514-125050. Grice, M., Baumann, S., & Jagdfeld, N. (2009). Tonal association and derived nuclear Prieto, P., van Santen, J., & Hirschberg, J. (1995). Tonal alignment patterns in Spanish. accents—The case of downstepping contours in German. Lingua, 119, 881–905. Journal of Phonetics, 23, 429–451. https://doi.org/10.1006/jpho.1995.0032. Grice, M., Ritter, S., Niemann, H., & Roettger, T. B. (2017). Integrating the discreteness R Core Team (2017). R: A language and environment for statistical computing. Vienna, and continuity of intonational categories. Journal of Phonetics, 64,90–107. Austria: R Foundation for Statistical Computing. URL:https://www.R-project.org/. Gussenhoven, C., & Rietveld, A. (1988). Fundamental frequency declination in Dutch: Rietveld, A., & Gussenhoven, C. (1985). On the relationship between pitch excursion Testing three hypotheses. Journal of Phonetics, 16, 355–369. size and prominence. Journal of Phonetics, 13, 299–308. Hennig, C. (2015). fpc: Flexible Procedures for Clustering. R package version 2.1-10. Rousseeuw, P. (1987). Silhouettes: A graphical aid to the interpretation and validation of Höhle, B., Berger, F., & Sauermann, A. (2016). Information structure in first language cluster analysis. Journal of Computational and Applied Mathematics, 20,53–65. acquisition. In C. Féry & S. Ishihara (Eds.), The oxford handbook of information van Santen, J., & Hirschberg, J. (1994). Segmental effects on timing and height of pitch structure (pp. 562–580). Oxford University Press. contours. In Proceedings of the 3rd International Conference on Spoken Language Jilka, M., & Möbius, B. (2007). The influence of vowel quality features on peak alignment. Processing (ICSLP 94) (pp. 719–722). Yokohama, Japan. In Proceedings of Interspeech 2007 (Antwerpen) (pp. 2621–2624). Schweitzer, A. (2011). Production and perception of prosodic events—Evidence from Johnson, K. (1997). Speech perception without speaker normalization: An exemplar corpus-based experiments (Ph.D. thesis). Universität Stuttgart. model. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech Schweitzer, A., Möhler, G., Dogil, G., Möbius, B., in preparation. The PaIntE model of processing (pp. 145–165). San Diego: Academic Press. intonation, in: Barnes, J., Shattuck-Hufnagel, S. (Eds.), Prosodic Theory and Kochanski, G., Grabe, E., Coleman, J., & Rosner, B. S. (2005). Loudness predicts Practice. MIT Press. prominence: Fundamental frequency lends little. The Journal of the Acoustical Schweitzer, K., Walsh, M., Calhoun, S., Schütze, H., Möbius, B., Schweitzer, A., & Dogil, Society of America, 118, 1038–1054. G. (2015). Exploring the relationship between intonation and the lexicon: Evidence Kornai, A. (1998). Analytic models in phonology. In J. Durand & B. Laks (Eds.), The for lexicalised storage of intonation. Speech Communication, 6,65–81. organization of phonology: Constraints, levels and representations (pp. 395–418). Silverman, K., & Pierrehumbert, J. (1990). The timing of prenuclear high accents in Oxford, U.K.: Oxford University Press. English. In J. Kingston & M. E. Beckman (Eds.), Papers in laboratory phonology. Kügler, F., & Gollrad, A. (2015). Production and perception of contrast: The case of the Volume I of papers in laboratory phonology (pp. 72–106). Cambridge University fall-rise contour in German. Frontiers in Psychology, 6, 1254. https://doi.org/10.3389/ Press. fpsyg.2015.01254. Sluijter, A. M., & van Heuven, V. J. (1997). Spectral balance as an acoustic correlate of Kuhl, P. K. (1991). Human adults and human infants show a ‘perceptual magnet effect’ linguistic stress. Journal of the Acoustical Society of America, 2471–2486. for the prototypes of speech categories, monkeys do not. Perception and Taylor, P., Caley, R., Black, A.W., & King, S. (1999). Edinburgh speech tools library [ Psychophysics, 50,93–107. http://festvox.org/docs/speech_tools-1.2.0/]. System Documentation Edition 1.2, for Lacerda, F. (1995). The perceptual-magnet effect: An emergent consequence of 1.2.0 15th June 1999.. exemplar-based phonetic memory. In Proceedings of the 13th international Terken, J., & Hermes, D. (2000). The perception of prosodic prominence. In M. Horne congress of phonetic sciences (Stockholm) (pp. 140–147). (Ed.), Prosody: Theory and experiment (pp. 89–127). Kluwer Academic Publishers. Ladd, D. R. (1996). Intonational phonology. Number 79 in Cambridge studies in Turk, A. E., & White, L. (1999). Structural influences on accentual lengthening in English. linguistics. Cambridge, UK: Cambridge University Press. Journal of Phonetics, 27, 171–206. Liberman, M., & Pierrehumbert, J. (1984). Intonational invariance under changes in pitch Vosoughi, S., & Roy, D. (2012). A longitudinal study of prosodic exaggeration in child- range and length. In M. Aronoff & R. T. Oehrle (Eds.), Language sound structure directed speech. In SP-2012 (pp. 194–197). (pp. 157–230). Cambridge: MIT Press. Wade, T., Dogil, G., Schütze, H., Walsh, M., & Möbius, B. (2010). Syllable frequency Lieberman, P. (1960). Some acoustic correlates of word stress in American English. The effects in a context-sensitive segment production model. Journal of Phonetics, 38, Journal of the Acoustical Society of America, 32, 451–454. https://doi.org/10.1121/ 227–239. 1.1908095. Wagner, P., Ćwiek, A., & Samlowski, B. (2019). Exploiting the speech-gesture link to Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2017). Cluster: Cluster capture fine-grained prosodic prominence impressions and listening strategies. analysis basics and extensions. R package version 2.0.6. Journal of Phonetics (in press). Maye, J., & Gerken, L. (2000). Learning phonemes without minimal pairs. In Wahlster, W. (2004). Smartweb: Mobile applications of the semantic web. In S. Biundo, Proceedings of the 24th annual Boston University conference on language T. Frühwirth, & G. Palm (Eds.), KI 2004: Advances in artificial intelligence development (pp. 522–533). Somerville, Mass: Cascadilla Press. (pp. 50–51). Berlin/Heidelberg: Springer. Maye, J., Werker, J. F., & Gerken, L. (2002). Infant sensitivity to distributional information Walsh, M., Möbius, B., Wade, T., & Schütze, H. (2010). Multilevel exemplar theory. can affect phonetic discrimination. Cognition, 82, B101–B111. Cognitive Science, 34, 537–582. Mayer, J. (1995). Transcription of German intonation—The Stuttgart system (Technical Ward, N. G., & Gallardo, P. (2017). Non-native differences in prosodic-construction use. Report). Institute of Natural Language Processing, University of Stuttgart. Dialogue & Discourse, 8,1–30.