Linköpings universitet/Linköping University | Department of Computer and Information Science Bachelor thesis, 18 hp | Cognitive Sciences Spring term 2021 | LIU-IDA/KOGVET-G--21/028--SE

The effects of emotional on perceived clarity in degraded

Rasmus Lindqvist

Supervisor: Carine Signoret Examinator: Michaela Socher

Copyright The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non- commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: https://ep.liu.se/.

© 2021 Rasmus Lindqvist

ii

Abstract

The ability to hear is important to communicate with other people. People from hearing loss are more likely to also suffer from and (Mener et al., 2013; Mo et al., 2005). To understand how degraded speech is recognized, the pop-out effect has been studied. The pop-out effect is the moment when a listener recognizes the meaning of degraded speech. Previous research on the pop-out effect in perception of speech has predominantly been focused towards top-down processes, such as form-based priming and semantic coherence in sentences. The purpose of this study was to research the relationship between emotional prosody and the perception of speech in varying levels of degraded speech. The participants were presented sentences with angry, neutral or happy prosody in varying levels of noise vocoding. The participants were then asked to rate the perceived amount of noise for each sentence, and if the prosody was perceived as positive, neutral or negative for each sentence. The results suggest that the participants' ability to perceive positive prosody in the sentences decreased more rapidly than negative as the amount of noise increased. The result did not show any statistically significant evidence that emotional prosody had any effect on the perceived amount of noise. Future research should further investigate emotional prosody together with emotional semantics, as an emotionally coherent spoken sentence, and the influences on speech perception in adverse listening conditions, in order to further investigate the factors contributing to the pop-out effect.

Keywords: Emotional prosody, Speech perception, Degraded speech, Noise vocoding, Pop-out effect

iii

Acknowledgement

First and foremost, I want to express my to Carine Signoret for taking the role as my supervisor. Thank you for your insightful ideas, feedback, and for guiding my back on track all those times where I felt lost and confused. I would also like to say a special thank you to Mattias Ekberg for providing me with the recorded audio material used in this study, and for sharing valuable and relevant articles of the field. Lastly, I would like to thank my friends and family for supporting me and helping me throughout the whole process.

Linköping in June 2021

Rasmus Lindqvist

v

Table of Contents

Copyright ...... ii 1. Introduction ...... 1 2. Theory ...... 3 2.1 Pop-out effect ...... 3 2.2 and emotional categories ...... 4 2.3 Emotional prosody ...... 4 2.4 Noise vocoding...... 5 2.5 Emotions and perception of speech ...... 6 2.6 About this study ...... 6 3. Method ...... 7 3.1 Participants ...... 7 3.2 Material ...... 7 3.2.1 HINT ...... 7 3.2.2 Emotional Prosody material ...... 8 3.2.3 Test ...... 8 3.2.4 Task ...... 9 3.2.5 Procedure ...... 9 3.3 Data analysis ...... 9 4. Result ...... 11 4.1 Ability to perceive emotional prosody ...... 11 4.2 Perceived amount of noise ...... 11 5. Discussion ...... 13 5.1 Result discussion ...... 13 5.1.1 Ability to perceive emotional prosody ...... 13 5.1.2 Perceived amount of noise ...... 14 5.2 Implications ...... 15 5.3 Method discussion ...... 15 5.4 Ethics ...... 17 5.5 Conslusion and Future...... 17 References ...... 18 Appendix ...... 23 A.1 List of the Swedish version of HINT sentences ...... 23 A.2 Consent form ...... 23 A.3 Test 24 A.4 Descriptive data for Ability to perceive emotional prosody ...... 29 A.5 Durbin-Conover for Ability to perceive emotional prosody ...... 30 A.6 Descriptive data for Amount of perceived noise ...... 33

vii

List of Abbreviations

Abbreviation Meaning

ERP Event related potential NV Noise vocoding STG Superior temporal gyrus STS Superior temporal sulcus RMS Root mean square

viii

1. Introduction Our ability to hear is one of our most important senses. Auditory perception allows us to relate to the world in a variety of very significant purposes, and has played an integral role in the survival of our species (Heffner & Heffner, 1992). The ability to perceive sounds and to locate those sounds has enabled us to either approach or evade other animals, and to direct the of our other senses towards other valued sources of sound. However, one of the most important aspects of hearing is that hearing enables us to communicate and connect with other people in a way that the other senses can not. As said by the american author and activist, Helen Keller, “Blindness cuts us off from things, but deafness cuts us off from people”. Our ability to communicate with others is highly reliant on the ability to perceive speech. Speech is among the most complex sounds that we have to perceive, and therefore, our ability to perceive speech is fragil to hearing loss, lack of contextual awareness and background noise. Many of the patients seen by hearing healthcare providers seem to suffer from in some way (Carmen & Uram, 2002). Hearing loss also seems to be strongly associated with both loneliness and depression in older adults (Mener et al., 2013). Additionally, adults with sufficient damage to the cochlea to receive cochlear implants report experiencing a decreased ability to communicate, felt more isolated, felt more as a burden, and had worse relations with friends and family before receiving a cochlear implant (Mo et al., 2005). To help decrease the sense of exclusion people with hearing loss suffer from, a substantial amount of research has gone to understanding the factors that contribute to and improve the perception of speech. Among this body of research is the pop-out effect, the moment when listeners experience a “pop-out” of the meaning of words in degraded speech (Davis et al., 2005). Previous research on the pop-out effect in perception of speech has predominantly been focused towards top-down processes, such as form-based priming and semantic coherence in sentences (Signoret & Rudner 2017; Signoret et al., 2019). Such processes are closely related to working memory and high-order cognitive abilities. However, not as much effort has been put on the mapping of the emotional processing and its effects on bottom-up processes that influence the perception of speech. Research suggests that emotional prosody, speech qualities such as the frequency, energy and articulation rate, could increase the attentional awareness of speech (Grandjean et al., 2005), as well as emotional visual stimulus effect on lexical access speed, hinting that emotional prosody could be another contributing factor when perceiving speech. The purpose of this study is to research how emotional prosody affects speech perception in different levels of degraded speech. More specifically, how the emotional prosody of , and neutral the subjective rating of perceived noise in noise vocoded sentences.

The research questions for this study are:

• How does noise degradation affect the recognition of emotional prosody?

• How does emotional prosody affect the perceived clarity of degraded speech?

1

2. Theory

2.1 Pop-out effect

The pop-out effect refers to the moment when the meaning of a sentence presented in degraded speech suddenly becomes understandable to the listener. The amount of self-reported perceived noise in sentences with background noise drastically decreases if the listeners are familiar with the material. Similar effects can be found in multiple sensory modalities. When presented with a degraded image of a previously shown clear image, viewers experienced a form of “Eureka” moment (Ahissar & Hochstein, 2004), simlar to the moment when listeners recognize the words in a degraded sentence, i.e. the pop-out effect (Davis et al., 2005). One way of studying how people perceive speech in time has been done through records of event-related potentials (ERPs), which has revealed a lot about the neural basis of (Lau et al., 2008). Several ERP responses have been investigated in the sense of speech and language studies. The ERP most associated with the pop-out effect is the negative peak at 400ms (N400). The N400 response is most widely described as an increase in negativity in response to semantic violations like 'I like my coffee with cream and socks'. The experimental designs of studies focusing on the N400 response are of broad because they tap into key aspects of language comprehension (Lau et al., 2008). The tries to minimize the mismatch between incoming sensory input and top-down expectations, according to prediction error minimization accounts of perception (Friston, 2010; Rao & Ballard, 1999). A prediction error minimization account claims that in order to understand speech, the listener must create a collection of expectations at multiple levels of representation in order to attempt to in the most precise way explain the auditory data (Paczynski & Kuperberg, 2012). The amplitude of the N400 event-related potential (ERP) in response to the final word of a sentence increases with how surprising that word is, given the context of the sentence, which is consistent with the role of expectations in speech comprehension (Kutas & Federmeier, 2011; Kutas & Hillyard, 1984). As a result, the N400 can be thought of as a measure of the amount of mismatch between a semantic prediction and the incoming sensory imput, something along the lines with a semantic prediction error (Bornkessel-Schlesewsky & Schlesewsky, 2019; Paczynski & Kuperberg, 2012). Signoret and Rudner (2017) and Signoret et al. (2019) has previously shown the enhancing effects of both semantic coherence of spoken language as well as being presented with the written form of what is about to be said has on the perceived clarity in speech. Both semantic coherency, so the syntactic structure of the sentence can be understood as something of meaning, and the form-based priming, the written form of the sentence, enable for a more precise prediction of speech. The recognition of the speech that both semantic coherence and form-based predictions contribute to a decrease of the N400 (Signoret et al., 2020), that is, reduce the amount of mismatch between the top-down expectations and the presented auditory stimulus.

3

2.2 Emotions and emotional categories

The way emotions are usually referred to are distinct categories, such as ”anger, , ” and so on. Panksepp (1998) describe these distinct categories as biologically inherited circuits, similar to the same circuits found in homologous mammal species, with each emotional state, or circuit, contributing to behaviors and physiological patterns in humans. However, for the emotional categories not described by Panksepp (1998, Chapter 2), like anger, sadness and fear and other commonly used emotional states, there is no clear biological or behavioral marker (Barret, 2006). In the sense that people feel angry and see anger in other people's actions, these types of occur in experience (Barrett, 2006), although there is some disagreement about whether anger is a scientific category with causal status. While people can easily and instinctively recognize anger, sadness, fear, and other emotions in themselves and others, there are no behavioral or physiological patterns that differentiate these emotions from one another, implying that these categories occur within the perceiver rather than in nature. Barret (1998) found that although some are able to distinguish between distinct emotional states, all of the participants could determine pleasurable and displeasurable emotional states. The categorisation of emotional states as either being pleasurable or displeasurable can be referred to as emotional valence (Feldman Barret & Russell, 1998). Kuchinke et al. (2005) found in a fMRI study that the processing of words can be positively influenced by the emotional valence of the contents. During a visual lexical decision task the participants of the study were to determine if the word presented was a noun or a nonword. Kuchinke et al. (2005) found that the response time and the accuracy data was significantly advantageous for the words with emotional valence compared to the emotionally neutral ones. These findings were further investigated by Citron et al. (2014) who found that the combinatory effects of both emotional valence and (intensity) of words could enhance the perceptual processing. Such findings stress the impact of emotional content on the processing of language.

2.3 Emotional prosody

Prosody encompasses a variety of speech features that have historically been regarded as “suprasegmental”, or something separate from the distinct units, such as consonants and vowels, of segmental phonology. This includes intonation, rhythm, and the distribution of pauses Wennerstrom (2001). Prosody has both universal and language-specific features. Prosodic features can express emotional priorities universally, like the prosodic features of warning cries are more likely to have higher volume and pitch then intimate conversations regardless of the language background of its producer (Frick, 1985). However, different have their own intonation systems and rhythm and pause distribution (Wennerstrom, 2001). Different emotional prosodies can be distinguished by listeners dependent on a number of acoustic variations of speech. Among these variatons in acoustic qualities associated with different emotional prosodies are pitch, intensity or energy, and rate of articulation (Banse & Scherer, 1996). The pitch, or the frequency of the speech which is perceived as pitch, is often measured by the fundemental frequency (F0) of the voice. The F0 refers to the approxiamte frequency of the vocal folds’ oscillation. Emotional prosodies such as anger and happiness are

4 often charactarized by an increase of the mean F0. The perceived vocal intensity or energy is often dependent on the amplitude of the oscillations of the vocal folds, where the emotional prosody anger include an increase in high-frequency energy, and happiness has an increase in mean energy. As for the rate of articulation, emotional prosodies charactarize themselves by temporal aspects, such as the tempo of the speech and the distibution of pauses (Banse & Scherer, 1996). Both the emotional prosodies anger and happiness has evidence of speeding up the rate of articulation. Grandjean et al. (2005) previously demonstrated that angry prosody may increase activity in the associative auditory cortex, an impact that occurs even when voice prosody and position are unrelated to the listener's goal, and that is independent or additive to any concomitant modulation by the spatial distribution of auditory attention in the right STS. Fearful faces showed a similar enhancement by emotion additive to spatial attention in the face-sensitive fusiform region compared to neutral faces. This means that the right STS, like the right fusiform in the visual domain, may have an auditory feature finely tuned to extract socially and affectively salient signals from conspecifics.

2.4 Noise vocoding

Noise vocoding is a form of manipulation of speech that distorts the audio to simulate the perceived hearing of cochlear implant patients. Cochlear implants are the most successful sensory prostheses that allow people who are deaf or severely hearing impaired to regain some semblance of speech and hearing. They are surgically inserted into the cochlea and are designed to stimulate the auditory nerve (Oxenham, 2019). The recognition of speech has been suggested to rely on spectral cues (Shannon et al., 1995). These cues are believed to provide information about the frequency of the speech sound and its resonant properties. However, in the development of synthesized electrical of the auditory system by cochlear implants, more attention has been shifted towards the temporal cues and on amplitude (Shannon et al., 1995). In order to gather empirical evidence of the importance of both temporal cues and on amplitude, techniques for simulation of cochlear implants were designed to limit the spectral cues of speech. One of such techniques is noise vocoding (NV), developed by Shannon et al. (1995). The process of vocoding speech occurs in four different steps. First, the speech signal is divided into a given number of logarithmically-spaced frequency bands, where four or less bands are extremely difficult to comprehend, and ten or more bands can be comprehended fairly easily. Then, for each frequency band the amplitude envelope is extracted and used to modulate the noise in the same band. As a last step, the frequency bands are recombined to produce a noise vocoded sentence (Davis, 2003). Even though the perception of speech is robust to distortion of the spectral resolution of speech (Shannon, 1995), adults with cochlear implants have been found to struggle with recognizing voice emotions (Jiam et al. 2017). Since, natural speech signals with varying degrees of muffeling by band pass filtering and/or spectral smearing do provide a close representation of the sound quality of cochlear implants (Dorman et al., 2017), the same

5 difficulty in perceiving emotional prosody that cochlear implant patients experience can be reproduced with normal hearing adults and NV-levels of lower bands.

2.5 Emotions and perception of speech

Domínguez-Borràs et al. (2009) has provided previous evidence that indicate that the superior temporal gyrus (STG), containing the auditory cortex responsible for processing sounds, is affected by emotional context. Domínguez-Borràs et al. (2009) in their study presented the participants with emotionally salient (emotional valence and arousal) pictures of faces, either negative or neutral, while simultaneously being presented with a novel sound. The task for the participants was to identify if the colour of the frame matched the colour of the face. With fMRI scans, Domínguez-Borràs et al. (2009) found that the emotionally negatively salient faces, even when the participants tried to ignore it, induced activation in the amygdala and the fusiform gyrus, the area associated with processing faces (Kawasaki et al., 2012). Moreover, a stronger response in the bilateral STG could be observed when the participants were presented with the negative faces, compared to the neutral faces (Domínguez-Borràs et al., 2009). This data supports the idea that the novelty processing areas may be enhanced by emotional context. Humans have specific auditory regions specialized in processing the speech of other humans. Similar to how other non-human primates anterior temporal regions are sensitive to the voices of primates within the same species (Petkov et al., 2008), humans too have regions in the central STS highly selective in responding to human voices compared to other auditory stimuli (Belin et al., 2000). The findings of Belin et al. (2000) bear a striking resemblance to visual cortex architecture, where face-selective regions have been studied using similar experimental paradigms. Emotional prosody has also been found to increase lexical access speed in prosodically and semantically congruent situations. Wurm et al. (2010) has reported evidence of emotional prosody increasing the speed of which participants could identify words in a lexical decision task that were semantically congruent with the emotional prosody. The participants were presented with sentences with varying emotional intonations, and were instructed to perform a speeded lexical decision task to determine the last item of the sentence. Wurm et al. (2010) found that the reaction time decreased when words matched the emotional prosody, and increased for mismatched words. Such findings build upon the findings of Nygaard and Queen (2008), which suggests that words with affective meaning are responded to more quickly when they are matched with a corresponding emotional tone of voice. Additionally, Ethofer et al. (2006) demonstrated an increase of blood in the STG in participants for both happy and angry intonation. Such findings may suggest that negative and positive valence in prosody have similar effects on the initial processing of speech.

2.6 About this study

The purpose of this study is to research the relationship between emotional prosody and the perception of speech in varying levels of degraded speech.

6

To summarize what has been stated in the theoretical background, the region containing the auditory cortex and is associated with the processing of sounds, the superior temporal gyrus (STG), has shown cortical excitability when faced with emotionally negative visual stimuli (Domínguez-Borràs et al., 2009). Such findings suggest that the STG is dependent on emotional context. The cortical regions responsible for processing the speech of other humans bear a striking resemblance to that of the visual cortex architecture (Belin et al., 2000), leading to the assumption that auditory emotional valence could have similar enhancing effects on auditory processing as emotionally visual stimuli. The hypothesis is that emotional prosody, contrary to emotionally neutral prosody, will have an enhancing effect on perception of speech, therefore leading to a lower perceived amount of noise in spoken sentences with different levels of noise vocoding. It is also hypothesized that the ability to differentiate between the emotional prosodies will drastically decrease as the bands in the noise vocoding decreases.

3. Method

3.1 Participants

Fourteen volunteers (4 Female) participated in the study. The ages of the participants ranged from 20 to 54 (M_age= 28.6 years, SD 12.6). All of the participants had swedish as native language and did not have any record of diagnosed hearing loss or previous damage to the ear. Participants were contacted and recruited via social media. All participants were informed briefly about the structure of the test and that all the data was gathered anonymously and that they could abort the test at any time without giving any reason for doing so. Every participant later gave their consent to take part in the study before proceeding to the experiment.

3.2 Material

3.2.1 HINT The original sentence material Hearing In Noise Test (HINT) included English everyday sentences, which neutrality was validated by native American English speakers (Nilsson et al., 1994). The sentences are organized into 25 ten-sentence phonemically balanced lists. The purpose of the sentences was to research the threshold for the speech recognition of sentences. The Swedish version of HINT was created with speech material consisting of everyday sentences with the goal to be perceived as natural by native Swedish speakers, despite differences in dialect, education and background (Hällgren et al., 2006). The sentences in the Swedsih HINT were composed of five to nine syllables, compared to the original HINT were the sentences had a maximum of seven syllables. The Swedish HINT has been shown to be comparable with the original American English HINT in most aspects (Hällgren et al., 2006).

7

3.2.2 Emotional Prosody material Fourteen sentences with semantically emotionally neutral content from the Swedish version of HINT were selected and recorded by four actors: an older woman (69 years old), an older man (73 years old), a young woman (19 years old) and a young man (29 years old). The sentences were recorded with emotional prosody, varying in different emotions and in low or high intensity. The stimuli were recorded with the help of a sound technician in a studio at Linköping University Hospital's Audiology clinic. AudacityTM (33) was used to make the recordings, which were made with high-quality equipment, 24 bit resolution, and a 44.1 kHz sampling rate. In a pilot study, the clearest and cleanest recordings from each actor out of two or more repetitions for each sentence and non-verbal vocalization was used for validation by Avdelningen för handikappvetenskap (AVH) at Linköpings universitet. The sentences are about 2-3 seconds long. The emotions included in the recordings are anger, happiness, sadness, fear, and interest. For the present study the recordings of young male actor were used, as they were deemed the clearest and the most emotionally accurate. The emotional prosodies included was happiness and anger with high intensity and neutral prosody with no specific intensity. For each emotion, ten out of the fourteen sentences were selected, based on clipping in the sound file, disfluency by the reader, mispronunciation and if the volume of every word was sufficient for the listener to be able to distinguish each individual word. The judgement of the quality and selection of the sentences used was made by the author of the current thesis. The set of ten sentences for the three emotions were then noise vocoded into NV1, NV3, NV12, NV24 and Clear, creating a set of 150 sentences, with ten different sentences, with three different emotional prosodies, with five levels of noise vocoding. For each sentence waveform, the root mean square (RMS) value was calculated, and the sentences were then rescaled to the same RMS level.

3.2.3 Test In order to gather data to answer the research questions of the study, an emotional prosody and speech perception test was developed. The test was created using PsychoPy 3.0 and the PsychoPy Builder v. 1.75. The test consists of an initial questionnaire-pop-up-box, asking questions of the participants age, gender, native language, previous records of hearing loss, if they are wearing a headset and if their surroundings are quiet. Next was a welcome-page, once again reminding the participants to wear a headset. To navigate to the next page of the test the participants press the spacebar on their keyboard. After the welcome-page were a page with instructions to the experiment itself. The next part of the test was the trial-loop, consisting of an exposure, a 7-point scale with an accompanied text to value the amount of noise in the sentences, and a rating of the valence of the sentences. There were 150 loops, one for each audio file in the material, where the audio files were presented in a random order using the random loop type function in PsychoPy Builder. On the exposure page was a text instructing the participant to press the spacebar when they were ready to listen to the next sentence, and when the spacebar was pressed the audiofile of the next sentence was played. On the page with the 7- point scale was a text instructing the participant to value the amount of noise in the sentence they just heard on a scale ranging from 0-6. To rate the amount of noise in a sentence, a slider with 7-ticks was presented, and to choose rating the participants clicked with the mouse on the

8 ticks of the slider. On the last page of the trial-loop was a text instructing the participant to rate in which way the sentence was said (Positive, Neutral, Negative), where to answer the participants pressed 1 on their keyboard for Positive, 2 for Neutral and 3 for Negative. At the end of the test the participants were presented with a text thanking them for their participation.

3.2.4 Task The task for the participants was to listen to each of the 150 selected sentences, varying in different amounts of noise vocoding (NV1, NV3, NV12, NV24, CLEAR), and also varying in emotional valence (Positive, Neutral, Negative). For each sentence the participant first evaluated the amount of noise perceived in the sentence on a scale for 0 (no perceived noise) to 6 (totally unintelligible), which has previously been found to be a validated method for measuring the perceived clarity in speech (Signoret & Rudner, 2017; Signoret et al., 2019). Subsequently, the participant also evaluated if the sentence was heard in a positive, neutral or negative way.

3.2.5 Procedure The experiment was performed over the videotelephony software program Zoom. The test itself was uploaded to Pavlovia.org, to which the participants were given a link and also the username and password to the account with the experiment. The participants were verbally guided to the experiment and instructed on how to begin. Due to the financial restraints of the study, the experiment had to be run in pilot-mode, therefore requiring the participants to log in to the creator account to run the experiment on their own computer. The participant was then given instructions for the task and also had the possibility to ask questions regarding the task. Before beginning the experiment the participant got to listen to four example sentences, one Clear- Happiness, one NV1-Anger, one Clear-Anger and one NV1-Happiness as a reference point before beginning the test. When the experiment was finished, a file of the results was downloaded locally to the participants computer automatically by Pavoliva, which the participants were instructed to upload to the zoom-chat.

3.3 Data analysis

All of the statistical analysis in this study was performed using Jamovi 1.1.9.0. In order to research the first research question of how degradation in speech and different emotional prosodies affect one’s ability to detect emotional prosody, a Friedman test was performed. Initially a two-way analysis of variance was planned for the analysis of the first research question, however, due to violations of the assumption for such a statistical analysis a non- parametric analysis had to be performed. The two independent variables of the Friedman test was Emotion, the emotion portrayed in the prosody of the sentences in the material (Happiness, Neutral, Anger), and Noise vocoding, the level of noise vocoding in each sentence (NV1, NV3, NV12, NV24, Clear). The dependent variable in the Friedman test was Accuracy. The Accuracy variable was the average of correct matches of Emotion and the perceived valence (where the answer Positive when presented a sentence with prosody of Happiness would be a correct match, while the answer Neutral or Negative on a sentence with prosody of Happiness would

9 be an incorrect match), for each group in the Emotion x Noise vocoding matrix. Q-Q plots and a Shapiro-Wilk test showed that the assumption of normal distribution was violated for NV3- Positive (p<.001), NV3-Neutral (p=.007), NV12-Neutral (p=.006), and NV24-Neutral (p=.043). Normal distribution could be assumed for NV12-Positive (p=.629), NV24-Positive (p=.382), NV3-Negative (p=.454), NV12-Negative (p=.369), and NV24-Negative (p=.148). So instead of the intended two-way repeated measure ANOVA, a non-parametric Friedman test was performed to analyze the participants ability to assess the valence of the sentences, dependent on the amount of noise vocoding in the sentences, and the intended emotional prosody of the sentences. To answer the second research question of how the emotional valence of prosody in speech affects the perceived amount of noise in degraded speech, a two-way repeated measures ANOVA was performed. The independent variables of the two-way repeated measures ANOVA were Noise vocoding, the level of noise vocoding in each sentence (NV12, NV24), and Valence, the valence that the participant perceived for each sentence (Positive, Neutral, Negative). The dependent variable was the average amount of noise perceived (on a scale from 0-6) in each group of the Noise Vocoding x Valence matrix. The alpha level of 0.05 with all the statistical tests performed in this study as a significant level. To check the assumptions for a repeated measures ANOVA, first a Shapiro-Wilk test among Q-Q plots was performed to control for normal distribution. The Shapiro-Wilk test and the Q-Q plots demonstrated that the assumption of normal distribution was violated for the NV-levels Clear, NV3 and NV1, with a few exceptions (see Table 4.1). After the Friedman test of the participants ability to perceive emotional prosody it was clear that the ability to perceive emotional prosody in NV1 and NV3 was at or below a chance level (see Appendix A.4), and were therefore excluded from the analysis. To avoid a ceiling effect of the Clear-condition, the NV-level Clear was also excluded from the analysis, leading to only NV12 and NV24 being included in the independent variable of amount of noise vocoding in the two-way repeated measure ANOVA. A significant Mauchly’s sphericity test, W = .372, p = .007, suggests that the assumption of sphericity was violated for the variable Noise, which was accounted for with a Greenhouse-Geisser correction.

10

4. Result

4.1 Ability to perceive emotional prosody

The Friedman test did show a significant change in the participants ability to assess the valence of the sentences 휒2(14) = 145, p < .001, w = .070. A Durbin-Conover pairwise comparisons test demonstrated significant differences in accuracy between each group except for: all three emotions without noise vocoding, Anger and Neutral in NV24, Anger and Neutral in NV12, and Happiness in NV12 and Anger in NV3 (see Appendix A.5). This can be further observed in Figure 1, indicating that the participants' ability to assess the emotional valence of the sentences declined as the quality of the sentences declined. With 2 out of 14 participants failing to ever classify any of the sentences as Positive in the NV12-level, and 8 out of 14 participants failing to ever classify any of the sentences as Positive in the NV3-level (compared to 0 out of 14 for the Negative in the respective NV-levels), the results suggests Happiness was more difficult to perceive in degraded speech, compared to Anger or Neutral prosody.

Figure 4.1. Accuracy on assessing valence of the sentences, by Noise vocoding and Emotion.

4.2 Perceived amount of noise

A two-way repeated measures ANOVA was performed in order to analyze how the participants evaluated the amount of noise in the sentences, dependent on the amount of noise vocoding in the sentences, and the perceived emotional valence of the sentences.

11

The results of the two-way repeated measures ANOVA did not find a significant interaction 2 effect, F (1.72,18.94) = .084, p = .895,휂푝= .008, nor a significant main effect of Valence, F 2 (1.23,13.52) = 3.3612, p = .083, 휂푝= .234. However, a significant main effect of Noise could 2 be observed, F (1,11) = 22.2502, p < .001, 휂푝= .669. A post-hoc analysis of the main effect of Noise was performed using a Tukey’s test, showing a significant difference between NV24 and NV12 (t = -4.72, p < .001, d = 0.267105). Even though the p-value of the main effect of Valence was not significant, as it was relevant to the research question it was considered to be interesting to further investigate the differences between the groups. Therefore, a post-hoc analysis using Tukey's test was performed. The result showed no significant differences between Positive (M = 2.99), Neutral (M = 3.03) and Negative (M = 3.23). For descriptive data see Table 4.1 and Appendix A.6.

Tabell 4.1. Descriptive data for Valence and Noise vocoding on the estimated amount of noise

NV24 NV12 95% 95% Confidence Interval Interval

Mean SE Lower Upper Mean SE Lower Upper

Positive 2.87 0.28 2.27 3.47 3.12 0.28 2.52 3.72

Neutral 2.90 0.28 2.30 3.50 3.16 0.28 2.56 3.76

Negative 3.07 0.28 2.48 3.67 3.38 0.28 2.78 3.98

12

5. Discussion

5.1 Result discussion

Based on previous findings, this study aimed to investigate the relation between emotional prosody, speech in noise and perceived clarity in speech and perception of emotional prosody. More specifically, how one’s ability to detect emotional valence of prosody is affected by different emotional prosodies and different levels of degraded speech, and the perceived amount of noise in sentences dependent on the emotional valence of the prosody in the sentences and different levels of degraded speech. A theoretical review of the field led to the formulation of two research questions: “ How does emotional prosody affect the perceived clarity of degraded speech?” and “ How does noise degradation affect the recognition of emotional prosody?”.

5.1.1 Ability to perceive emotional prosody To examine the first research question, a non-parametric Friedman's test was performed, due to the data violating the assumptions of a two-way repeated measures ANOVA. The Friedmans test showed a significant effect of Emotional prosody and NV-level on the participants ability to detect emotional prosody. Non-surprisingly, the decline of the participants' ability to distinguish between prosodies in the lower levels of NV seems congruent with the literature (Jiam et al., 2017; Dorman et al., 2017). The lack of spectral information in NV1 and NV3 were not sufficient for the participants to perceive the acoustic cues associated with each emotional prosody. However, interestingly enough it seems like the participants were more likely to label a sentence's prosody as emotionally negatively valenced, compared to emotionally positively valenced, when the sentences were noise vocoded. This would suggest that participants when faced with a noise vocoded sentence with happy prosody were more prone to label it as either neutral or negative, explaining the more rapid decline in the participants ability to label sentences with happy prosody as positive, compared to the sentences with neutral or anger prosody still being labeled as the intended emotional valence for all NV-levels above NV3. Important to note is that the sentences used in this study are not validated, and even in perfectly clear conditions not every sentence was labeled as it’s intended valence. Still, the results do indicate that the perception of emotionally positive prosody might be more fragile to noise than the perception of emotionally negative prosody. Both anger and happiness share similar acoustic cue characteristics, such as an increase in mean F0, an increase in energy, and the rate of articulation occur more rapidly (Banse & Scherer, 1996). However, after an inspection of the results there is a probability that some of the differences between the acoustic cues of anger can be preserved even in severe forms of degradation, based on the participants' above chance ability to perceive negative emotional prosody in NV3. Given that noise vocoding distorts the spectral information of audio while preserving the temporal, the observed differences between the participants ability to perceive emotional prosody of negative valence and their ability to perceive emotional prosody of positive valence are more likely to be attributed to the temporal aspects of the emotional prosodies, such as tempo and distribution of pauses. Even though no such acoustic patterns were analysed in this study, some of the acoustic quality differences of the emotional prosodies could

13 be the difference in energy, where anger has an increase in high-frequency energy, while happiness has an increase in mean energy (Banse & Scherer, 1996).

5.1.2 Perceived amount of noise Based on former results, such as Domínguez-Borràs et al. ( 2009), Belin et al. (2000) and Wurm et al. (2010), it would be considered within reason to assume that emotional prosody could have enhancing effects on the processing of speech. Subsequently, the second research question was aimed to verify or deny the existence of such an enhancing effect of emotional prosody on speech processing. An analysis of the results of the two-way ANOVA could not establish any clear evidence of emotionally valenced prosodies potential to improve the perceived clarity in speech. There was a significant effect of NV-level on the participants' perceived amount of noise in the sentences, which is congruent with the findings of Signoret et al. (2018;2019), and the purpose of the design of the experiment. However, there was no significant effect of emotional valence on the perceived amount of noise in the sentences, and additionally there was no evidence of any interaction effect between emotional valence and NV-level. The biggest difference between the groups of emotional valence was the difference between positive valence and negative valence with sentences negative valenced prosody being rated as 0.24 more noise than positive valence on average, with an effect just above an acceptable level of significance (p = .061). Still, neither of the emotionally valenced prosodies was significantly different for the neutral prosody, contrary to what was hypothesized. It therefore seems likely that the emotional valence of prosody in and of itself do not accumulate to an enhanced perception of speech. The results are in line with previous research on the pop-out effect and ERP research. Lau et al. (2008) emphasize the importance of top-down processes when processing speech and retrieving memory representations of words and morphemes. With semantically neutral sentences, the emotional prosody does not incorporate any additional context to the words, therefore, it is reasonable to assume that the emotional prosody did not contribute to reducing a semantic prediction error (Bornkessel-Schlesewsky & Schlesewsky, 2019; Paczynski & Kuperberg, 2012). The findings of Signoret et al. (2018;2019) also seem in line with the results, suggesting that the overall predictability of the semantic contents of the sentences did not increase with emotional prosody, and neither did the intelligibility. Furthermore, Wurm et al. (2010) and Nygaard and Queen (2008) did find an increase in lexical access speed for emotional prosody, however, only when the emotion of the prosody were congruent with the semantic emotion of the sentence. Consequently, based on previous ERP research (Friston, 2010; Rao & Ballard, 1999), perceived clarity in degraded speech research (Signoret et al., 2018;2019), and emotionally valenced visual and auditory stimuli and word detection research (Domínguez-Borràs et al., 2009; Belin et al., 2000; Wurm et al., 2010), it would seem more plausible that sentences with semantically and prosodic congruent emotional valences would be a feasible correction or improvement to achieve the hypothesized increase of the perceived clarity in degraded speech.

14

5.2 Implications

Even though the results were not significant, the findings of the present study still might be of value to the scientific fields of emotional prosody perception and perception of degraded speech. The result of the participants ability to perceive prosody in degraded speech suggests that the participants found it more difficult to perceive positive valenced prosody than negative. Such findings could indicate a prioritization and prominence of processing negatively valenced speech compared to positively valenced speech. However, the noise vocoding might just modulate the speech to a more dispreferred or less pleasant sounding voice. Nonetheless, if happy prosody, or prosody of positive valence, is more fragile to distortion, it could have implications for cochlear implant patients. If the ability to perceive positive emotional prosody is worse than for negative emotional prosody, cochlear implant patients will perceive a disproportionate amount of their social interactions as negative or unpleasant, furthering their s of exclusion (Mo et al., 2005). Agrawal et al. (2013) has previously shown that different cochlear implant speech-processing strategies perform differently when it comes to happy prosody perception, which could suggest that different qualities of cochlear implants perform differently at producing happy prosody. The findings of Grandjean (2005), along with Paulmann et al. (2013), seems to suggest that emotional prosody would increase attentional awareness of auditory stimuli. Assumed that such attentional increase were present when the participants were exposed to the sentences with emotional prosody, the results of this study would infer that the increase of attention did not affect the perceived clarity of the sentences, and therefore, would serve as an indication that the processing of emotional prosody and the processing of speech may co occur, but do not interact in the initial processing of speech. The findings in the present study also further elaborate the findings of Ethofer et al. (2006). Ethofer et al. (2006) showed a hemodynamic response in the bilateral STS when participants were presented with both auditory stimuli with happy and angry intonation. Because there was no effect of emotional prosody on perception of speech, one can not make the case that therefore the reaction to both the positive and negative valence auditory stimuli were the same, however, the results do follow along the lines with the findings of Ethofer et al. (2006). For the broader scientific field of listening in adverse conditions this study further emphasizes the importance of context when trying to understand speech in difficult listening conditions. This study also demonstrates a disconnect between the right lateralized emotional prosodic processing (Kotz et al., 2003), and the left hemisphere-associated auditory-to- meaning, semantic processing. While this study did not measure or look into the actual neural activity during the experiment, it does serve as an initiatory indication that the processing of emotional prosody do not have an effect on the pop-out effect.

5.3 Method discussion

15

One fundamental aspect of the method of this study, which is a susceptible aspect to critique of the validity and reliability of the method, is the fact that the experiment was performed online. Previous studies researching intelligibility and perceived clarity of speech in challenging listening conditions, do so under more controlled circumstances. Previous studies could control for the room in which the experiment was performed, controlling for both background noise and acoustic properties of the room itself, they could make sure that the sound quality was constant for each participant by using one and the same headset for each trial. This study can benefit from no such qualities as performing the experiment online with limited resources allows for far less control. Some of the aspects that could harm the validity and reliability of the study by performing it via Zoom is, firstly, that the participants wore different headphones and consequently performed the experiment with different qualities of sound, allowing for the possibility that the different NV-levels may not be compared with the same amount of noise in the sentences between participants. Secondly, neither could distractions be controlled to the same extent as previous studies. Due to participants being located at home when performing the experiment, and only the participant could be observed when performing the experiment, therefore not making it possible to account for any distractions that may have occurred. Even though no such distractions could be observed through the participants webcam nor heard from the participants mic, the risk for such distractions can not be excluded, and could therefore potentially damage the studies reliability and validity Another aspect of the method which could have an influence on the validity and reliability of the method is the material used for the auditory stimuli. The sentences in the swedish version of the HINT material containing semantically neutral sentences, nor the emotional prosodies of the actors reading the material have been validated to actually induce and be perceived as what they are intended to be. A few indications of what could be a result of this was that only about half of the participants identified all 10 sentences of each emotional prosody as the intended valence, even though the rest of the participants in each group averaged only one mistake. However, what is researched in this study is not the emotional prosody material, but how and if the participants perceive the emotional valence of the prosody in the sentences. There is no universal emotional valence for any auditory stimuli (Barrett, 2006), therefore, the emotional valence of the sentences that the participants perceived can not be misidentified, because the emotional valence they perceive is what is being researched. The participants were not supposed to guess the intended emotional valence of the material, but instead report the valence they experienced. The participants included in the study also strike a threat to the generalisability of the study, with a clear uneven distribution both in the genders and the ages of the participants. Schirmer et al. (2002) has found that women show an earlier effect of prosodic priming, both behavioural and electrophysiological, on word processing compared to men. The men in the study tended to need a longer interval between the emotional prime and the target for the same effect to be observed, indicating that women process emotional prosody earlier than men during word processing. With only 4 female participants, such effects could not be discovered, but it is an effect to be cognisant of as a more representative sample perhaps would result in the detection of such sex differences.

16

The uneven distribution of ages among the participants might not either be representative of a general population. (Gordon-Salant, 2005) showed a decline of the spectrum of perceivable frequencies based on age, especially after the age of 50. With 11 out of 14 participants being in their early twenties (20-25), and the remaining 3 being around fifty (49-54), the natural decline of hearing with age is not captured. The results of this study may therefore not be generalisable to those aged under, between, or above the ages of the participants.

5.4 Ethics

Ethical considerations were made, such as to perform the experiment online instead of in person to avoid exposing the participants, or the society as a whole, to any additional threat of the spread of Covid-19. The study was also conducted in accordance with the Helsinki declaration. Still, by exposing the participants to negatively valenced stimuli, the participants were bound to feel varying degrees of emotional distress. Depending on each individual's past experiences and how high they score on personality traits such as , being unexpectedly presented with angry voices could have induced a feeling of unease that the participants were not prepared for prior to the experiment.

5.5 Conslusion and Future

The results of this study does show a clear relationship between the level of NV in sentences and the ability to perceive emotional prosody, also suggesting that the ability to perceive positively valenced prosodies is more vulnerable to noise degradation than negatively valenced prosodies. Furthermore, the results did not show any clear evidence that the emotional valence of the prosody of the sentences had any effect on the amount of noise the participants perceived. Future studies should further research emotionally coherent sentences in challenging listening conditions and the effects on speech perception by matching emotional prosody and semantics of speech in an emotionally congruent way. Additionally, future research should further investigate the specific acoustic features required for a listener to perceive and differentiate between emotional prosodies.

17

References

Agrawal, D., Thorne, J. D., Viola, F. C., Timm, L., Debener, S., Büchner, A., Dengler, R., &

Wittfoth, M. (2013). Electrophysiological responses to emotional prosody perception in

cochlear implant users. NeuroImage: Clinical, 2, 229–238.

https://doi.org/10.1016/j.nicl.2013.01.001

Ahissar, M., & Hochstein, S. (2004). The reverse hierarchy theory of visual perceptual learning.

Trends in Cognitive Sciences, 8(10), 457–464. https://doi.org/10.1016/j.tics.2004.08.011

Barrett, L. F. (1998). Discrete Emotions or Dimensions? The Role of Valence Focus and

Arousal Focus. Cognition and Emotion, 12(4), 579–599.

https://doi.org/10.1080/026999398379574

Barrett, L. F. (2006). Valence is a basic building block of emotional life. Journal of Research

in Personality, 40(1), 35–55. https://doi.org/10.1016/j.jrp.2005.08.006

Belin, P., Zatorre, R. J., Lafaille, P., Ahad, P., & Pike, B. (2000). Voice-selective areas in human

auditory cortex. Nature, 403(6767), 309–312. https://doi.org/10.1038/35002078

Bornkessel-Schlesewsky, I., & Schlesewsky, M. (2019). Toward a Neurobiologically Plausible

Model of Language-Related, Negative Event-Related Potentials. Frontiers in Psychology,

10. https://doi.org/10.3389/fpsyg.2019.00298

Carmen, R., & Uram, S. (2002). Hearing loss and anxiety in adults. The Hearing Journal, 55(4),

48. https://doi.org/10.1097/01.HJ.0000293358.79452.49

Citron, F. M. M., Gray, M. A., Critchley, H. D., Weekes, B. S., & Ferstl, E. C. (2014). Emotional

valence and arousal affect reading in an interactive way: Neuroimaging evidence for an

approach-withdrawal framework. Neuropsychologia, 56, 79–89.

18

https://doi.org/10.1016/j.neuropsychologia.2014.01.002

Clark, A. (2013). Whatever next? Predictive , situated agents, and the future of cognitive

science. Behavioral and Brain Sciences, 36(3), 181–204.

https://doi.org/10.1017/S0140525X12000477

Davis, M. H., Johnsrude, I. S., Hervais-Adelman, A., Taylor, K., & McGettigan, C. (2005).

Lexical Information Drives Perceptual Learning of Distorted Speech: Evidence From the

Comprehension of Noise-Vocoded Sentences. Journal of Experimental Psychology:

General, 134(2), 222–241. https://doi.org/10.1037/0096-3445.134.2.222

Domínguez-Borràs, J., Trautmann, S.-A., Erhard, P., Fehr, T., Herrmann, M., & Escera, C.

(2009). Emotional Context Enhances Auditory Novelty Processing in Superior Temporal

Gyrus. Cerebral Cortex, 19(7), 1521–1529. https://doi.org/10.1093/cercor/bhn188

Ethofer, T., Anders, S., Wiethoff, S., Erb, M., Herbert, C., Saur, R., Grodd, W., & Wildgruber,

D. (2006). Effects of prosodic emotional intensity on activation of associative auditory

cortex. NeuroReport, 17(3), 249–253. https://doi.org/10.1097/01.wnr.0000199466.32036.5d

Feldman Barrett, L., & Russell, J. A. (1998). Independence and bipolarity in the structure of

current affect. Journal of Personality and Social Psychology, 74(4), 967–984.

https://doi.org/10.1037/0022-3514.74.4.967

Frick, R. W. (1985). Communicating emotion: The role of prosodic features. Psychological

Bulletin, 97(3), 412–429. https://doi.org/10.1037/0033-2909.97.3.412

Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews

Neuroscience, 11(2), 127–138. https://doi.org/10.1038/nrn2787

Heffner, R., & Heffner, H. (1992). The Evolutionary Biology of Hearing (pp. 691–715).

https://doi.org/10.1007/978-1-4612-2784-7_43

Hällgren, M., Larsby, B., & Arlinger, S. (2006). A Swedish version of the Hearing In Noise

19

Test (HINT) for measurement of speech recognition. International Journal of Audiology,

45(4), 227–237. https://doi.org/10.1080/14992020500429583

Kawasaki, H., Tsuchiya, N., Kovach, C. K., Nourski, K. V., Oya, H., Howard, M. A., &

Adolphs, R. (2012). Processing of Facial Emotion in the Human Fusiform Gyrus. Journal of

Cognitive Neuroscience, 24(6), 1358–1370. https://doi.org/10.1162/jocn_a_00175

Kotz, S. A., Meyer, M., Alter, K., Besson, M., von Cramon, D. Y., & Friederici, A. D. (2003).

On the lateralization of emotional prosody: An event-related functional MR investigation.

Brain and Language, 86(3), 366–376. https://doi.org/10.1016/S0093-934X(02)00532-1

Kuchinke, L., Jacobs, A. M., Grubich, C., Võ, M. L.-H., Conrad, M., & Herrmann, M. (2005).

Incidental effects of emotional valence in single word processing: An fMRI study.

NeuroImage, 28(4), 1022–1032. https://doi.org/10.1016/j.neuroimage.2005.06.050

Kutas, M., & Federmeier, K. D. (2011). Thirty Years and Counting: Finding Meaning in the

N400 Component of the Event-Related Brain Potential (ERP). Annual Review of Psychology,

62(1), 621–647. https://doi.org/10.1146/annurev.psych.093008.131123

Kutas, M., & Hillyard, S. A. (1984). Event-Related Brain Potentials (ERPs) Elicited by Novel

Stimuli during Sentence Processing. Annals of the New York Academy of Sciences, 425(1

Brain and Inf), 236–241. https://doi.org/10.1111/j.1749-6632.1984.tb23540.x

Lau, E. F., Phillips, C., & Poeppel, D. (2008). A cortical network for semantics:

(De)constructing the N400. Nature Reviews Neuroscience, 9(12), 920–933.

https://doi.org/10.1038/nrn2532

Mener, D. J., Betz, J., Genther, D. J., Chen, D., & Lin, F. R. (2013). Hearing Loss and

Depression in Older Adults. Journal of the American Geriatrics Society, 61(9), 1627–1629.

https://doi.org/10.1111/jgs.12429

Mo, B., Lindbaek, M., & Harris, S. (2005). Cochlear Implants and Quality of Life: A

20

Prospective Study. Ear & Hearing, 26(2), 186–194.

Nilsson, M., Soli, S.D. & Sullivan, J.A. (1994). Development of the Hearing In Noise

Test for the measurement of speech reception thresholds in quiet and in noise.

J Acoust Soc Am, 95, 1085/1099.

Nygaard, L. C., & Queen, J. S. (2008). Communicating emotion: Linking affective prosody and

word meaning. Journal of Experimental Psychology: Human Perception and Performance,

34(4), 1017–1030. https://doi.org/10.1037/0096-1523.34.4.1017

Oxenham, A. J. (2019). How We Hear: The Perception and Neural Coding of Sound. 29.

Paczynski, M., & Kuperberg, G. R. (2012). Multiple influences of semantic memory on

sentence processing: Distinct effects of semantic relatedness on violations of real-world

event/state knowledge and animacy selection restrictions. Journal of Memory and Language,

67(4), 426–448. https://doi.org/10.1016/j.jml.2012.07.003

Panksepp, J. (1998). : The foundations of human and animal emotions.

New York: Oxford University Press.

Paulmann, S., Bleichner, M., & Kotz, S. A. (2013). Valence, arousal, and task effects in

emotional prosody processing. Frontiers in Psychology, 4.

https://doi.org/10.3389/fpsyg.2013.00345

Petkov, C. I., Kayser, C., Steudel, T., Whittingstall, K., Augath, M., & Logothetis, N. K. (2008).

A voice region in the monkey brain. Nature Neuroscience, 11(3), 367–374.

https://doi.org/10.1038/nn2043

Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional

interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1), 79–

87. https://doi.org/10.1038/4580

Schirmer, A., Kotz, S. A., & Friederici, A. D. (2002). Sex differentiates the role of emotional

21

prosody during word processing. Cognitive Brain Research, 14(2), 228–233.

https://doi.org/10.1016/S0926-6410(02)00108-8

Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech

recognition with primarily temporal cues. Science, 270(5234), 303.

Signoret, C., Andersen, L. M., Dahlström, Ö., Blomberg, R., Lundqvist, D., Rudner, M., &

Rönnberg, J. (2020). The Influence of Form- and Meaning-Based Predictions on Cortical

Speech Processing Under Challenging Listening Conditions: A MEG Study. Frontiers in

Neuroscience, 14. https://doi.org/10.3389/fnins.2020.573254

Signoret, C., Holmer, E., & Rudner, M. (2019). Semantic coherence and speech production in

adverse listening conditions. Universitätsbibliothek der RWTH Aachen.

Signoret, C., Johnsrude, I., Classon, E., & Rudner, M. (2018). Combined effects of form- and

meaning-based predictability on perceived clarity of speech. Journal of Experimental

Psychology: Human Perception and Performance, 44(2), 277–285.

https://doi.org/10.1037/xhp0000442

Wennerstrom, A. (2001). The Music of Everyday Speech: Prosody and Discourse Analysis.

Oxford University Press.

22

Appendix

A.1 List of the Swedish version of HINT sentences 1. Anden simmade i dammen 2. Bollen studsade ut på vägen 3. Ungdomarna köper varsin glass 4. Flickan har kort rött hår 5. Farfar lagar mat åt barnen 6. Äggen ska kokas sju minuter 7. Flickan handlade ost och korv 8. Morfar provade för stora skor 9. Två svarta skjortor hängde på tork

A.2 Consent form Before taking part in the study all participants were required to read and agree to giving consent in a Google Form.

23

A.3 Test The test that was performed in the study consisted of first a small form for participant information, then a welcome page and a page of incstructions for the test. Before playing the audio file of a sentence there was a page for participants to say if they were ready to listen to the next sentence. For each sentence the participants rated the amount of perceived noise, and if they perceived the actor to say the sentence in a positive, neutral or negative way.

24

25

26

27

28

A.4 Descriptive data for Ability to perceive emotional prosody

Tabell 1. Descriptive data including Shapiro-Wilk test for Emotional prosody and Noise vocoding on ability to assess emotional valence.

Happiness Neutral Anger

Mean (SD) Shapiro-Wilk p Mean (SD) Shapiro-Wilk p Mean (SD) Shapiro-Wilk p

0.0143 (0.0363) <.001 0.971 (0.0611) <.001 0.0786 (0.131) <.001 NV1

0.0571 (0.0938) <.001 0.750 (0.290) .007 0.443 (0.221) .454 NV3

0.400 (0.269) .629 0.779 (0.255) .006 0.800 (0.152) .369 NV12

0.614 (0.268) .382 0.779 (0.229) .043 0.793 (0.159) .148 NV24

0.929 (0.0825) .002 0.900 (0.104) .007 0.857 (0.183) .005 Clear

29

A.5 Durbin-Conover for Ability to perceive emotional prosody Table for the pairwise comparisons of how accurate the participants were at assessing the intended valence for each emotional prosody in each NV-level. Acc = Accuracy, NV = noise vocoding level, 1 = Positive, 2 = Neutral, 3 = Negative.

Pairwise Comparisons (Durbin-Conover)

Statistic p

Acc_NV0_1 - Acc_NV1_1 11.8027 < .001 Acc_NV0_1 - Acc_NV3_1 11.3943 < .001 Acc_NV0_1 - Acc_NV12_1 8.4947 < .001 Acc_NV0_1 - Acc_NV24_1 5.4725 < .001 Acc_NV0_1 - Acc_NV0_2 1.0210 0.309 Acc_NV0_1 - Acc_NV1_2 1.2252 0.222 Acc_NV0_1 - Acc_NV3_2 3.0221 0.003 Acc_NV0_1 - Acc_NV12_2 2.1645 0.032 Acc_NV0_1 - Acc_NV24_2 2.3687 0.019 Acc_NV0_1 - Acc_NV0_3 1.3069 0.193 Acc_NV0_1 - Acc_NV1_3 11.0676 < .001 Acc_NV0_1 - Acc_NV3_3 7.4737 < .001 Acc_NV0_1 - Acc_NV12_3 2.9405 0.004 Acc_NV0_1 - Acc_NV24_3 3.1447 0.002 Acc_NV1_1 - Acc_NV3_1 0.4084 0.683 Acc_NV1_1 - Acc_NV12_1 3.3080 0.001 Acc_NV1_1 - Acc_NV24_1 6.3302 < .001 Acc_NV1_1 - Acc_NV0_2 10.7817 < .001 Acc_NV1_1 - Acc_NV1_2 13.0279 < .001 Acc_NV1_1 - Acc_NV3_2 8.7806 < .001 Acc_NV1_1 - Acc_NV12_2 9.6382 < .001 Acc_NV1_1 - Acc_NV24_2 9.4340 < .001 Acc_NV1_1 - Acc_NV0_3 10.4958 < .001 Acc_NV1_1 - Acc_NV1_3 0.7351 0.463 Acc_NV1_1 - Acc_NV3_3 4.3290 < .001 Acc_NV1_1 - Acc_NV12_3 8.8622 < .001 Acc_NV1_1 - Acc_NV24_3 8.6580 < .001 Acc_NV3_1 - Acc_NV12_1 2.8996 0.004 Acc_NV3_1 - Acc_NV24_1 5.9218 < .001 Acc_NV3_1 - Acc_NV0_2 10.3733 < .001 Acc_NV3_1 - Acc_NV1_2 12.6195 < .001 Acc_NV3_1 - Acc_NV3_2 8.3722 < .001 Acc_NV3_1 - Acc_NV12_2 9.2298 < .001 Acc_NV3_1 - Acc_NV24_2 9.0256 < .001 Acc_NV3_1 - Acc_NV0_3 10.0874 < .001

30

Pairwise Comparisons (Durbin-Conover)

Statistic p

Acc_NV3_1 - Acc_NV1_3 0.3267 0.744 Acc_NV3_1 - Acc_NV3_3 3.9206 < .001 Acc_NV3_1 - Acc_NV12_3 8.4538 < .001 Acc_NV3_1 - Acc_NV24_3 8.2496 < .001 Acc_NV12_1 - Acc_NV24_1 3.0221 0.003 Acc_NV12_1 - Acc_NV0_2 7.4737 < .001 Acc_NV12_1 - Acc_NV1_2 9.7199 < .001 Acc_NV12_1 - Acc_NV3_2 5.4725 < .001 Acc_NV12_1 - Acc_NV12_2 6.3302 < .001 Acc_NV12_1 - Acc_NV24_2 6.1260 < .001 Acc_NV12_1 - Acc_NV0_3 7.1878 < .001 Acc_NV12_1 - Acc_NV1_3 2.5729 0.011 Acc_NV12_1 - Acc_NV3_3 1.0210 0.309 Acc_NV12_1 - Acc_NV12_3 5.5542 < .001 Acc_NV12_1 - Acc_NV24_3 5.3500 < .001 Acc_NV24_1 - Acc_NV0_2 4.4515 < .001 Acc_NV24_1 - Acc_NV1_2 6.6977 < .001 Acc_NV24_1 - Acc_NV3_2 2.4504 0.015 Acc_NV24_1 - Acc_NV12_2 3.3080 0.001 Acc_NV24_1 - Acc_NV24_2 3.1038 0.002 Acc_NV24_1 - Acc_NV0_3 4.1657 < .001 Acc_NV24_1 - Acc_NV1_3 5.5951 < .001 Acc_NV24_1 - Acc_NV3_3 2.0011 0.047 Acc_NV24_1 - Acc_NV12_3 2.5321 0.012 Acc_NV24_1 - Acc_NV24_3 2.3279 0.021 Acc_NV0_2 - Acc_NV1_2 2.2462 0.026 Acc_NV0_2 - Acc_NV3_2 2.0011 0.047 Acc_NV0_2 - Acc_NV12_2 1.1435 0.254 Acc_NV0_2 - Acc_NV24_2 1.3477 0.179 Acc_NV0_2 - Acc_NV0_3 0.2859 0.775 Acc_NV0_2 - Acc_NV1_3 10.0466 < .001 Acc_NV0_2 - Acc_NV3_3 6.4527 < .001 Acc_NV0_2 - Acc_NV12_3 1.9195 0.056 Acc_NV0_2 - Acc_NV24_3 2.1237 0.035 Acc_NV1_2 - Acc_NV3_2 4.2473 < .001 Acc_NV1_2 - Acc_NV12_2 3.3897 < .001 Acc_NV1_2 - Acc_NV24_2 3.5939 < .001 Acc_NV1_2 - Acc_NV0_3 2.5321 0.012 Acc_NV1_2 - Acc_NV1_3 12.2928 < .001

31

Pairwise Comparisons (Durbin-Conover)

Statistic p

Acc_NV1_2 - Acc_NV3_3 8.6989 < .001 Acc_NV1_2 - Acc_NV12_3 4.1657 < .001 Acc_NV1_2 - Acc_NV24_3 4.3699 < .001 Acc_NV3_2 - Acc_NV12_2 0.8576 0.392 Acc_NV3_2 - Acc_NV24_2 0.6534 0.514 Acc_NV3_2 - Acc_NV0_3 1.7153 0.088 Acc_NV3_2 - Acc_NV1_3 8.0454 < .001 Acc_NV3_2 - Acc_NV3_3 4.4515 < .001 Acc_NV3_2 - Acc_NV12_3 0.0817 0.935 Acc_NV3_2 - Acc_NV24_3 0.1225 0.903 Acc_NV12_2 - Acc_NV24_2 0.2042 0.838 Acc_NV12_2 - Acc_NV0_3 0.8576 0.392 Acc_NV12_2 - Acc_NV1_3 8.9031 < .001 Acc_NV12_2 - Acc_NV3_3 5.3092 < .001 Acc_NV12_2 - Acc_NV12_3 0.7760 0.439 Acc_NV12_2 - Acc_NV24_3 0.9802 0.328 Acc_NV24_2 - Acc_NV0_3 1.0618 0.290 Acc_NV24_2 - Acc_NV1_3 8.6989 < .001 Acc_NV24_2 - Acc_NV3_3 5.1050 < .001 Acc_NV24_2 - Acc_NV12_3 0.5718 0.568 Acc_NV24_2 - Acc_NV24_3 0.7760 0.439 Acc_NV0_3 - Acc_NV1_3 9.7607 < .001 Acc_NV0_3 - Acc_NV3_3 6.1668 < .001 Acc_NV0_3 - Acc_NV12_3 1.6336 0.104 Acc_NV0_3 - Acc_NV24_3 1.8378 0.068 Acc_NV1_3 - Acc_NV3_3 3.5939 < .001 Acc_NV1_3 - Acc_NV12_3 8.1271 < .001 Acc_NV1_3 - Acc_NV24_3 7.9229 < .001 Acc_NV3_3 - Acc_NV12_3 4.5332 < .001 Acc_NV3_3 - Acc_NV24_3 4.3290 < .001 Acc_NV12_3 - Acc_NV24_3 0.2042 0.838

32

A.6 Descriptive data for Amount of perceived noise

Table of the mean of the amount of perceived noise in each condition.

Estimated Marginal Means - Noise ✻ Valence

95% Confidence Interval

Valence Noise Mean SE Lower Upper

Positive NV1 6.0000 0.344 4.81 7.19 NV3 5.7500 0.344 4.56 6.94 NV12 4.2500 0.344 3.06 5.44 NV24 3.9500 0.344 2.76 5.14 Clear -1.20e−15 0.344 -1.19 1.19 Neutral NV1 5.9500 0.344 4.76 7.14 NV3 5.1500 0.344 3.96 6.34 NV12 3.9500 0.344 2.76 5.14 NV24 3.8500 0.344 2.66 5.04 Clear 0.1000 0.344 -1.09 1.29 Negative NV1 6.0000 0.344 4.81 7.19 NV3 5.0500 0.344 3.86 6.24 NV12 4.2500 0.344 3.06 5.44 NV24 4.1000 0.344 2.91 5.29 Clear -4.70e−15 0.344 -1.19 1.19

33