<<

Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, , October 15-18, 2013

How Much Does Audio Quality Influence Ratings of Overall Listening Experience?

Michael Schoeffler, Bernd Edler and J¨urgenHerre

International Audio Laboratories Erlangen [email protected]

Abstract. Researchers, developers and manufacturers put a lot of ef- fort into improving the audio quality of their products but little is known about the overall resulting effect on the individual customer. An experi- ment with 34 participants was conducted with the purpose to investigate the relationship between audio quality and overall listening experience in more detail. When rating the overall listening experience, participants are allowed to take every aspect into account which is important to them, including their taste in music. The participants rated the overall listen- ing experience of short music excerpts in different levels of bandwidth degradation. The result analysis revealed that listening to a song which is liked was slightly more important than the provided audio quality when participants rated the music excerpts. Nevertheless, audio quality was an important factor for less-liked songs to get higher ratings than more-liked songs with low or medium audio quality.

Keywords: Overall Listening Experience, Quality of Experience, Audio Quality

1 Introduction

The purpose of the experiment presented in this paper is to find out how impor- tant audio quality is for enjoying music. The term “overall listening experience” is used for describing the listener’s enjoyment when listening to a music excerpt. When the overall listening experience is rated, listeners can take every aspect into account which seems important to them. Which aspects are important might differ from listener to listener but possible aspects could be mood, preferences in music genre, performing artist, expectations, lyrics and audio quality. The term overall listening experience is related to Quality of Experience (QoE). One can find slightly different definitions for Quality of Experience. Nevertheless, various authors share the opinion that it is an experience from the user’s point of view. In Le Callet et al. [1] Quality of Experience is described as: “Quality of Experience (QoE) is the degree of delight or annoyance of the user of an application or service. It results from the fulfillment of his or her expectations with respect to the utility and/or enjoyment of the application or service in the light of the users personality and current state.”

CMMR2013 - 678 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 2 Michael Schoeffler et al.

E. g. developers of audio products have an interest in answering the question how overall listening experience (or Quality of Experience) is related to audio quality. Although a lot of effort is put into improving the audio quality of products, but little is known about the effect on the individual customer. Therefore, the influence of audio quality on overall listening experience is investigated in this paper. There exist many possibilities to degrade audio quality, e. g. distortion in time or frequency domain, channel downmixing or bandwidth limitation. In this work the limitation of bandwidth is used for representing a degradation in audio quality. In methodologies for subjective evaluations of audio quality, like MUSHRA [2] or BS.1116-1 [3], listeners rate test items according to a given reference. In au- dio quality evaluation such a reference is the item having the optimal quality. In context of overall listening experience the definition of the optimal experience is listener-dependent. Every listener has a different understanding of the best overall listening experience and it is dependent on the listener’s personal expec- tations, experiences etc. Such user-dependent variables are included in various Quality of Experience models [4] [5] [6]. When evaluating Quality of Experience by an experimental approach, it must be carefully dealt with the listeners’s per- sonal reference of optimal overall listening experience. In psychological research humans auditory memory is distinct in different types of storages depending on the decay times. Cowan separated them into short auditory storage and long auditory storage [7]. The short auditory storage refers to a literal storage that decays within a fraction of a second following a stimulus. On the other hand, the long auditory storage lasts for several seconds. While attending an experiment and listening to several stimuli in sequence, especially the short-term-related storages are probably altering by listening to each stimulus. Assuming the lis- tener’s definition of overall listening experience is somehow dependent on his or her auditory memories, the stimuli sequence might have an effect on the out- come of the experiment. Since the influence of auditory memories on rating the overall listening experience is not fully known, the experiment was designed to affect listeners’ expectations and experiences (which are stored in the auditory memory) as little as possible.

2 Related Work

Especially in telecommunications, Quality of Experience is becoming a popular topic. An introduction into Quality of Experience and an overview of the current state of the art of Quality of Experience research is given by Schatz et al. [8]. An overview of telecommunications-related QoE models including a proposed model was published by Laghari et al. [6]. M¨oller et al. developed a taxonomy of the most relevant Quality of Service and Quality of Experience aspects which result from multimodal human-machine interactions [9]. In context of Auditory Virtual Environments, Silzle developed a taxonomy based on point of views of end-users and service providers [10].

CMMR2013 - 679 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 Influence of Audio Quality on Overall Listening Experience 3

From a more acoustical point of view, Blauert and Jekosch proposed a four- layer model for the field of sound quality evaluation [11]. The most abstract layer of their proposed model is the Aural-communication Quality including the so-called product-sound quality which is the quality from an user’s point of view. Schoeffler and Herre conducted an experiment to investigate the influ- ence of audio quality on the overall listening experience [12]. The results of their experiment showed that overall listening experience is strongly influenced by bandwidth when the listeners is presented a reference of the music in maximum available bandwidth. However, the experiment took not into account the theory of auditory storages. Quality of Experience was rated in one session which famil- iarized the participants with audio quality and might have led to a bias towards increased audio-quality awareness. In evaluation of audio quality, listeners are often divided into expert listen- ers and na¨ıve listeners. Differences between these two types of listeners were investigated in recent publications [13] [14] [15] [16] [17].

3 Method

3.1 Stimuli Selection Nine music excerpts were selected from a music database and were used as basic items. The decision for selecting these nine basic items was mainly based on their spectral centroids and their levels of awareness. The spectral centroid indicates where the “center of mass” of the spectrum is located and is defined as:

PN−1 f(n) X(n) spectral centroid = n=0 , (1) PN−1 n=0 X(n) where X(n) is the magnitude of bin number n, f(n) is the center frequency of bin number n and N is the total number of bins. Since the basic items were limited in their signal bandwidth in a subsequent pro- cessing step, the spectral centroid was used as a feature for roughly representing the perceived effect of limiting the signal bandwidth. E. g. when comparing a music excerpt with a very high spectral centroid and a music excerpt with a very low spectral centroid, limiting the bandwidth would more affect the mu- sic excerpt with the higher spectral centroid. The reason for this is that this music excerpt has more signal energy in the higher frequencies. Limiting the bandwidth would reduce the signal energy in the higher frequencies. This be- comes true, especially when the cut-off frequency for limiting the bandwidth is below the spectral centroid of the music excerpt with more signal energy in the higher frequencies. If the cut-off frequency is much higher than the spectral centroid of the music excerpt with less signal energy in the higher frequencies, the music excerpt might even not be affected. To overcome such situations a cluster of 20 music excerpts was computed from the music database. The cluster contained only music excerpts which have similar spectral centroids. The music

CMMR2013 - 680 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 4 Michael Schoeffler et al.

database consisted mainly of songs which were ranked in known record charts. Being ranked in a record chart was expected to represent the level of awareness. The music excerpts covered a recognizable part of the song (e. g. refrain) and had a duration of ten seconds. In a second step it was investigated which of these 20 music excerpts would lead to well distributed ratings according to listeners’ music preference. There- fore, an informal pilot test with four participants was conducted. The partici- pants rated the excerpts by a five-star scale according to how much they like each music excerpts. Based on the results of this informal pilot test, a group of three audio professionals decided the final selection of nine basic items. The selected nine basic items including their spectral centroids are depicted in Table 1.

Song Interpret Centroid[Hz] American Pie Madonna 3620 Hold On Korn 3421 Hold on me Marlon Roudette 3636 Hotel California Eagles 3514 Love Today Mika 3887 No Light, No Light Florence and the Machine 3863 Nobody Knows Pink 3822 Respect Yourself Joe Cocker 3595 The Real Slim Shady Eminem 3819 Table 1: Selected music excerpts and their spectral centroid.

The selected music excerpts were converted to to a sample rate of 48 kHz and downmixed to mono.

Bandwidth Limitation For having a degradation in audio quality, the nine selected basic items were limited in their bandwidth by a 5th-order Butterworth low-pass filter [18]. Six cut-off frequencies were selected for the filter: 1080 Hz, 2320 Hz, 4400 Hz, 7700 Hz, 12000 Hz and 15500 Hz. These cut-off frequencies correspond to the cut-off frequencies of the 9th, 14th, 18th, 21st and 24th critical band of the Bark scale [19]. More frequencies from the upper critical bands were selected since they were found to be more important when listening to music. In total 54 items (9 basic items ∗ 6 cut-off frequencies) were generated in the bandwidth limitation processing step.

Loudness Normalization When limiting the bandwidth, the overall signal energy of the signal was reduced. A reduction of the signal energy is perceived as a decreased loudness by humans. Without normalizing the loudness, strongly bandwidth-limited signals would sound less loud than weakly bandwidth-limited signals. One possibility to normalize the loudness is to use a standard for loud- ness normalization like EBU R128 [20] [21]. One goal of EBU R128 is to provide a standard which is easy to implement but also models the human perception

CMMR2013 - 681 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 Influence of Audio Quality on Overall Listening Experience 5

of loudness as closely as possible. For measuring the loudness according to EBU R128, the so-called K-Weighting curve (specified in ITU-R BS. 1770 [22]) is ap- plied to all channels. The K-Weighting curve is a second-order high-pass filter which forms (according to the ITU) the basis for matching an inherently subjec- tive impression with an objective measurement. After applying the K-Weighting curve, the total mean square level is calculated and the result is displayed as LKFS (Loudness, K-Weighting, referenced to digital Full Scale) or LUFS (Loud- ness Units, referenced to digital Full Scale). Loudness Units (LU) are used for relative measurements where 1 LU corresponds to 1 dB. Regarding bandwidth- limited signals, the K-Weighting curve has one major disadvantage. Especially in the lower frequencies it does not correlate well with other established weighting- curves which model thresholds of human hearing more accurately. In EBU R128 100 Hz and 1000 Hz have almost the same weighting. ISO 226:2003 [23] specifies combinations of sound pressure levels and frequencies of pure continuous tones which are perceived as equally loud by human listeners. In this established spec- ification the difference of the Sound Pressure Level between 100 Hz and 1000 Hz is about 20 dB. Figure 1 depicts the differences between ISO 226:2003 and EBU R128. ISO 226:2003 is known for modeling the human threshold very well. The high deviations in lower frequencies between actual thresholds and EBU R128 result in the circumstance that the loudness of bandwidth-limited signals is cal- culated too high by EBU R128. When comparing a bandwidth-limited signal and the same signal with full bandwidth, both normalized according to EBU R128, the full bandwidth signal might be perceived significantly louder. In previously conducted pilot tests the items were normalized by using only EBU R128 and many listener reported that especially the highly bandwidth-limited items were not perceived loud enough compared to the other items. To overcome this problem, a combination of EBU R128 and the Zwicker’s loudness model [24] was applied for normalizing the loudness of all items. For describing the algorithm, we define the basic items as basic itemi, where i is the index of the excerpt. The items are defined as itemi,b, where i is the index of the song and b is the cut-off frequency of the bandwidth limitation. The function for normalizing the loudness according to EBU R128 is defined as:

normR128(in), in ∈ {basic itemi, itemi,b}. (2) A normalized signal according to EBU R128 is normalized to -23 dB referenced to full scale. The function for calculating the loudness by Zwicker’s loudness model is defined as:

N = loudnessZwicker(in), in ∈ {basic itemi, itemi,b}. (3) A reference loudness of a normalized basic item was calculated as follows:

NiReference = loudnessZwicker(normR128(basic itemi)). (4) By an iterative algorithm the variable gain was calculated so that

|NiReference − loudnessZwicker(itemi,b ∗ gain)| < . (5)

CMMR2013 - 682 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 6 Michael Schoeffler et al.

0

−20

−40

Relative Level(dB) −60 ISO 226:2003 ITU-R BS. 1770 −80 101 102 103 104 Frequency(Hz)

Fig. 1: Frequency weighting of ISO 226:2003 and ITU-R BS. 1770 (EBU R128) relative to 1000 Hz.

The maximum allowed difference between the loudness of an item and its corre- sponding basic item () was set to 0.01 Sone. The algorithm for solving Equation 5 mainly consisted of a binary search for the variable gain. The index limits (that 23 progressively narrow the search range) were initially set to 0 and 10 20 (which is an increase of 23 dB). The upper limit is a heuristic value which should avoid clipping. Since the basic item was normalized to -23 dB and has a full band- 23 width, a bandwidth-limited item multiplied by 10 20 should not clip. The step size of gain was set to 0.0001. When a solution was found by the binary search algorithm, each item was multiplied by its determined gain. This resulted in that all items had almost the same perceived loudness. The functionality of this loudness normalizing algorithm was validated by an informal listening test with two participants. Besides Zwicker’s loudness model, other popular loudness mod- els exist, e. g. Glasberg and Moore [25]. The Zwicker’s loudness model was used since it was found to deliver sufficient results and a fast implementation [26] was available. Summarizing, the loudness normalization was the final part of the stimuli generation. The set of stimuli consists of nine basic items and 54 (bandwidth- limited) items. The basic items were loudness-normalized according to EBU R128. The items were loudness-normalized by a combination of EBU R128 and Zwicker’s loudness model.

3.2 Participants The experiment was attended by 34 participants including students and re- searchers from the International Audio Laboratories Erlangen. Participants were

CMMR2013 - 683 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 Influence of Audio Quality on Overall Listening Experience 7

asked whether they are professionals in audio and to which age group they be- long. Table 2 depicts detailed information about the participants.

Total Age group Professional 34 25 [20-29] 16 [no] 9 [yes] 5 [30-39] 2 [no] 3 [yes] 4 [40-59] 3 [no] 1 [yes] Table 2: Information about the participants.

3.3 Materials and Apparatus

The user interface of the experiment was written in HTML5 and JavaScript. The website was tested on all major web browsers and optimized for mobile devices and desktop computers. The default file format for the stimuli was MP3. All MP3 files were encoded with 256 kbit/s CBR (constant bit rate). As alternative file format WAV was used for browsers which did not support MP3.

3.4 Procedure

The experiment was conducted as a web-based experiment. Each participant attended by using his own computer or mobile device. Since all participants were known by the experimenters, them were trusted to use an appropriate audio setup and to take care of a quiet environment when doing the listening test. The experiment was divided into one registration session and 12 listening sessions. In the registration session the participants registered to the experiment by their personal email address. Furthermore, the participants were asked by a questionnaire whether they are professionals in audio and to which age group they belong. After filling out the questionnaire, participants had to test their audio setup and adjust their volume. Afterwards, the participants rated all nine basic items according to how much they like each music item (“How much do you like this item?”). It was emphasized on the instructions that participants should take everything into account what they would do in a real world scenario (e. g. including their taste in music). Participants were allowed to play back a basic item as often as they wanted. The participants rated the basic items by using a five-star Likert scale. The stars were labeled with “Very Bad”, “Bad”, “Average”, “Good” and “Very Good”. Figure 2 shows a screenshot of the user interface for the basic item rating. When the rating of the basic items was finished, an email was sent to the participant with a website link to his or her first listening session.

CMMR2013 - 684 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 8 Michael Schoeffler et al.

Fig. 2: Screenshot of the Basic Item Rating.

The items – participants would listen to in the listening sessions – were indi- vidually selected depending on the responses of the participant in the registration session. Four groups of basic items were defined: all basic items with a rating of two stars, three stars, four stars and five stars. All basic items were assigned to a group according to the rating of the participant. Basic items which were rated with one star were not assigned to a group since it was expected that a one-star basic item would not be rated higher in average when its bandwidth is degraded. Two basic items were randomly selected from each of the four groups. If a group contained less than two basic items, the remaining number of basic items were selected. If a group contained more than two basic items, only two basic items were randomly selected. A maximum of eight basic items was selected for the listening sessions which was the case when every group contained at least two items. The purpose of the listening session was to rate the corresponding items of the selected basic items. Since six bandwidth-limited items were generated from each selected basic item, listeners would have to rate a maximum of 48 (8 basic items ∗ 6 items) items. Since two responses were taken for each item,

CMMR2013 - 685 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 Influence of Audio Quality on Overall Listening Experience 9

the maximum number of ratings a participants had to give increases to 96. In a listening test we conducted previously, participants reported that listening to the same song again and again annoyed them and influenced their rating [12]. In addition, we assume that humans store references of what the songs sound in different quality levels in the auditory memory. Since our intention was that participants should not rely on their auditory memory, the next part of the experiment was divided into multiple listening sessions. By dividing the listening test into multiple sessions, time is passing between these sessions which results in an auditory memory decay. In each listening session the participants would listen to a song only once. As two responses were taken from each of the six items that belong to one basic item, in total 12 listening sessions are needed to fulfill the requirement that each song appears in a session only once. In each of these 12 listening sessions participants listened to a maximum of eight songs. The listening sessions were similar to the registration session and started with testing the audio setup and adjusting the volume. As in the registration session, the listeners were instructed to rate each item according to how they like it and they should take everything into account what they would do in real world scenario. The stimuli were rated separately from each other by using the user interface depicted in Figure 3.

Fig. 3: User interface of the listening sessions.

Between rating the stimuli, participants had to listen to a so-called pause sound. The pause sound was four seconds of white noise which was low-pass- filtered with a time-decreasing cut-off frequency. The spectrogram of the pause sound is depicted in Figure 4. The reason for listening to the pause sound was to increase the interstimulus interval. It was expected that interstimulus interval exceeded the decay time of the (long-term) auditory storage. We assume that increasing the interstimulus interval decreases the influence of the previous stim- ulus on the response of current stimulus. Especially when the previous stimulus is an item with a very high cut-off frequency and the current stimulus is an item with a very low cut-off frequency, the high drop of bandwidth might influence the participant’s response more in comparison to a very low drop of bandwidth.

CMMR2013 - 686 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 10 Michael Schoeffler et al.

·104 20 4 0 −20 −40 2 −60 −80

Frequency (Hz) −100 −120 0

0 1 2 3 4 Power Spectral Density (dB) Time (s)

Fig. 4: Spectrogram of the pause sound (1024 samples Hann windows size and 512 samples overlap size).

After listening to all selected items, listeners had to fill out a questionnaire in each session. It was asked how important audio quality, song and their mood were for their ratings. The answers were given by a Likert scale with the values “Strongly Agree”, “Agree”,“Neutral”, “Disagree” and “Strongly Disagree”. Ad- ditionally, participants indicated whether they were using headphones or loud- speakers for the session and could fill out a text field for feedback. When a session was finished the participant had to wait at least five hours before the next session could be started. Participants automatically received an email when the 5-hour-break expired and the next session was ready to begin. Since the experiment started on 10th May 2013 and lasted until 14th June 2013, participants had the opportunity to let more time pass between two sessions. In total each participant took part in one registration session where a max- imum of nine basic items were rated. In the 12 listening sessions participants rated each of the maximal 48 selected items two times (maximal 96 responses).

4 Results

For the statistical analysis the basic items ratings of the registration session are used as independent variables for the item ratings. Table 3 shows the ratings of the participants for all items. Figure 5 depicts the mean of item ratings for all cut-off frequencies and grouped by the corresponding basic item ratings of the items. For all groups of basic item ratings, a log-linear relationship between cut-off frequencies and the item ratings can be seen. The mean of item ratings for their corresponding basic item ratings and grouped by the cut-off frequencies is shown in Figure 6. A combination of Figure 5 and Figure 6 as color map is depicted in Figure 7. The relationship between the two independent variables (basic item rating and cut-off frequency) and the item rating is confirmed by the Kendall rank correlation coefficient (0 = no correlation, 1 = perfect positive correlation).

CMMR2013 - 687 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 Influence of Audio Quality on Overall Listening Experience 11

Cut-off Frequency Basic Item P Rating 1080 2320 4400 7700 12000 15500 1 Star = 82 1 Star = 41 1 Star = 20 1 Star = 2 1 Star = 4 1 Star = 4 2 Stars = 18 2 Stars = 57 2 Stars = 58 2 Stars = 56 2 Stars = 50 2 Stars = 44 2 Stars 3 Stars = 5 3 Stars = 8 3 Stars = 21 3 Stars = 35 3 Stars = 35 3 Stars = 34 636 4 Stars = 1 4 Stars = 0 4 Stars = 6 4 Stars = 10 4 Stars = 14 4 Stars = 21 5 Stars = 0 5 Stars = 0 5 Stars = 1 5 Stars = 3 5 Stars = 3 5 Stars = 3 1 Star = 69 1 Star = 24 1 Star = 11 1 Star = 2 1 Star = 2 1 Star = 3 2 Stars = 40 2 Stars = 59 2 Stars = 35 2 Stars = 16 2 Stars = 14 2 Stars = 12 3 Stars 3 Stars = 9 3 Stars = 26 3 Stars = 55 3 Stars = 59 3 Stars = 55 3 Stars = 63 708 4 Stars = 0 4 Stars = 8 4 Stars = 15 4 Stars = 35 4 Stars = 42 4 Stars = 34 5 Stars = 0 5 Stars = 1 5 Stars = 2 5 Stars = 6 5 Stars = 5 5 Stars = 6 1 Star = 60 1 Star = 21 1 Star = 3 1 Star = 0 1 Star = 0 1 Star = 0 2 Stars = 34 2 Stars = 44 2 Stars = 28 2 Stars = 9 2 Stars = 2 2 Stars = 5 4 Stars 3 Stars = 18 3 Stars = 35 3 Stars = 43 3 Stars = 32 3 Stars = 24 3 Stars = 29 720 4 Stars = 8 4 Stars = 19 4 Stars = 43 4 Stars = 70 4 Stars = 74 4 Stars = 68 5 Stars = 0 5 Stars = 1 5 Stars = 3 5 Stars = 9 5 Stars = 20 5 Stars = 18 1 Star = 11 1 Star = 0 1 Star = 0 1 Star = 0 1 Star = 0 1 Star = 0 2 Stars = 9 2 Stars = 13 2 Stars = 0 2 Stars = 0 2 Stars = 0 2 Stars = 0 5 Stars 3 Stars = 19 3 Stars = 14 3 Stars = 16 3 Stars = 6 3 Stars = 1 3 Stars = 0 276 4 Stars = 5 4 Stars = 14 4 Stars = 13 4 Stars = 18 4 Stars = 12 4 Stars = 11 5 Stars = 2 5 Stars = 5 5 Stars = 17 5 Stars = 22 5 Stars = 33 5 Stars = 35 2340 responses ( 34 participants) Table 3: The number of different item ratings given by the participants for all basic item ratings and cut-off frequencies.

Kendall’s τ between basic item rating and item rating is 0.378 (z-value: 22.03, p-value: 0). Kendall’s τ calculated from the correlation between cut-off frequency and item rating is 0.4484 (z-value: 27.25, p-value: 0). For investigating the relationships between the independent variables and item rating in more detail, a cumulative link model was applied with the following expression:

item rating ∼ basic item rating + cut-off frequency + professional. (6)

For the model the item ratings and basic item rating are used as ordered factors. The model is parameterized with orthogonal contrasts for cut-off frequency and basic item rating. The variable professional indicates whether the participant is a professional in audio. The results are described in Table 4. All predictor variables of the cumulative linear model are significant. As the model coefficients indicate, the basic item rating and the cut-off frequency have a strong influence on the item rating. Being an audio professional has a rather low effect on the item ratings. For investigating the different ratings of the participants in more detail, Kendall rank correlation coefficients were calculated for each participant. Fig- ure 8 depicts the correlation coefficients between the dependent variable item rating and the independent variables basic item rating and cut-off frequency. The mean of the inter-individual correlation between item rating and basic item rating was 0.349 and the standard deviation was 0.242. The inter-individual cor-

CMMR2013 - 688 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 12 Michael Schoeffler et al.

5 5 4 4 3 3 2 2 1 1 Mean Item Rating Mean Item Rating 0 0 2 3 4 5 103 104 Basic Item Rating Cut-off Frequency (Hz) Cut-off Frequency: Basic Item Rating: 1080 Hz 2320 Hz 2 Stars 3 Stars 4400 Hz 7700 Hz 4 Stars 5 Stars 12000 Hz 15500 Hz

Fig. 5: Mean of the item ratings for Fig. 6: Mean of the item ratings for all cut-off frequencies grouped by ba- all basic item ratings grouped by cut- sic item ratings. off frequencies.

relation between item rating and cut-off frequency had a mean of 0.505 and a standard deviation of 0.212.

5 Discussion

The analysis of the result indicates that the item rating was strongly influenced by cut-off frequency and the basic item rating. A strong correlation between cut-off frequencies and ratings was reported from a similar experiment we con- ducted before [12]. As mentioned before, the previous experiment did not take the theory of auditory memory storages into account. As a result we were told by participants that they got annoyed from listening to the same songs again and again. This was also reflected by the results which showed only a weak re- lationship between reference ratings (can be compared to ratings of basic item) and item ratings. In the experiment presented in this paper, the basic item rat- ing has almost the same correlation as the cut-off frequency. This time, none of the participants reported that he or she got annoyed by listening to same songs more than once. Even if both experiments can hardly be compared, the high correlation coefficient of basic item ratings indicates that a session-based exper- iment that takes auditory storages into account is well suited for investigating the overall listening experience. When comparing the means of the item ratings (Figure 5 and Figure 6), it becomes apparent how important the song is. For example a song rated with 5 stars in the registration session and bandwidth-limited by a cut-off frequency of 4400 Hz has an average rating of 4.02. A song rated with 4 stars in the registration

CMMR2013 - 689 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 Influence of Audio Quality on Overall Listening Experience 13

5 5 4.5 4 4 3.5 3 3 2.5 2

2 1.5 Item Rating (Stars)

Basic Item Rating (Stars) 1 1080 2320 4400 7700 12000 15500 Cut-off Frequency (Hz)

Fig. 7: Color map of the average item ratings grouped by basic item ratings and cut-off frequencies.

session and played back in highest quality has a lower average rating (3.83). This is especially interesting since such a low bandwidth is normally considered as bad audio quality. In addition, the average item ratings are “saturated” when the cut-off frequency is 12000 Hz or higher. This is also confirmed by the cumulative link model, cut-off frequencies of 12000 Hz and 15500 Hz have almost the same effect size (Table 4). Such a saturation effect can not be seen for the relationship between the basic item ratings and the items ratings, e. g. the effect size of a 4-star basic item rating and a 5-star basic item rating is very different. According to many publications, expert listeners were considered as more reliable listeners when audio quality has to be evaluated(e. g. see [13]). The experiment was participated in by 13 audio professionals who are familiar with audio quality degradations. The requirements for being an expert listener differ from experiment to experiment. Nevertheless, audio professionals can be seen as a good estimation of how expert listeners in bandwidth degradation would rate the overall listening experience. As the results of the cumulative link model (Table 4) show, the ratings of audio professionals differed only slightly from the other participants. This has also been confirmed by discussion with participants. Even if many audio professionals were very aware of the bandwidth degradation, they did not take it that much into account when rating their individual overall listening experience. An analysis of the individual correlation coefficients among the participants revealed that participants had very unique preferences for rating the overall listening experience. The item ratings of some participants had a very high correlation with the cut-off frequencies and only a very weak correlation with the basic item ratings. On the other hand, the song was very important for some other participants and their ratings had almost no correlation with the cut-off frequency. Also the high standard deviation of the individual correlations with

CMMR2013 - 690 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 14 Michael Schoeffler et al.

Coefficient Estimate Std. Error z-value p-value basic item rating = 3 1.2271 0.1074 11.426 < 2e-16 basic item rating = 4 2.2151 0.1135 19.518 < 2e-16 basic item rating = 5 4.2169 0.1641 25.704 < 2e-16 cut-off frequency = 2320 1.7289 0.1508 11.466 < 2e-16 cut-off frequency = 4400 3.0269 0.1579 19.173 < 2e-16 cut-off frequency = 7700 4.0605 0.1640 24.753 < 2e-16 cut-off frequency = 12000 4.4314 0.1672 26.506 < 2e-16 cut-off frequency = 15500 4.4133 0.1672 26.392 < 2e-16 professional = true -0.2583 0.0821 -3.146 0.00166

Threshold coefficients: Estimate Std. Error z-value 1 Star |2 Stars 1.5816 0.1378 11.48 2 Stars|3 Stars 3.8809 0.1607 24.16 3 Stars|4 Stars 5.7642 0.1813 31.79 4 Stars|5 Stars 8.1292 0.2111 38.50

AIC: 5410.13 Maximum likelihood pseudo R2: 0.5388 Cragg and Uhler’s pseudo R2: 0.5650 McFadden’s pseudo R2: 0.2517 Table 4: Logit cumulative link model of the basic item ratings.

item rating (SD basic item rating: 0.242, SD cut-off frequency: 0.212) indicates that developing a model for predicting the overall listening experience without taking into account the preferences of an individual listener seems to be very nearly impossible. This confirms also preliminary results of an experiment we conducted before where we showed that listeners preferences in song or audio quality varies a lot [12].

6 Conclusion

In the presented experiment, participants had to rate bandwidth-degraded music excerpts according to their individual overall listening experience. The analysis of the results confirmed a previous conducted experiment that bandwidth is strongly correlated to the overall listening experience. Furthermore, the average ratings of overall listening experience came into a saturation effect when band- width was limited by filtering with a cut-off frequency of 12000 Hz or more. How much a song is liked by participants was measured by a rating of the music excerpt in best quality. These ratings were also strongly correlated to the overall listening experience of the bandwidth-degraded music excerpts. By interpreting the results, the choice of the song that is played back can be seen as a very im- portant aspect for the overall listening experience. Especially for achieving high ratings, the song seems to be more important than the provided audio quality. An analysis of the differences between the participants revealed that each one had very individual criterion for rating the overall listening experience. Further research will look into investigating the effect of auditory storages on the results. In addition, other kinds of audio-quality degradations will be examined.

CMMR2013 - 691 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 Influence of Audio Quality on Overall Listening Experience 15

1

0.8

0.6

0.4 Basic Item Rating τ 0.2

0 Kendall’s 0 0.2 0.4 0.6 0.8 1 Kendall’s τ Cut-off Frequency

Significant (α = 0.05): Cut-off Frequency Basic Item Rating Both None

Fig. 8: Correlation with basic item ratings and cut-off frequencies for each partic- ipant. Correlations which are significant for basic item ratings, cut-off frequencies or both are marked differently.

References

1. Le Callet, P., M¨oller,S., Perkis, A.: Qualinet White Paper on Definitions of Quality of Experience (Version 1.1) (2012) 2. International Telecommunications Union: ITU-R BS . 1534-1 (Method for the subjective assessment of intermediate quality level of coding systems) (2003) 3. International Telecommunications Union: ITU-R BS.1116-1 (Methods for the sub- jective assessment of small impairments in audio systems including multichannel sound systems) (1997) 4. M¨oller,S.: How can we model a quality event? Some considerations on describing quality judgment and prediction processes. In: Acoustics 08, Paris (2008) 3195– 3200 5. Perkis, A., Munkeby, S., Hillestad, O.: A model for measuring Quality of Experi- ence. Proceedings of the 7th Nordic Signal Processing Symposium - NORSIG 2006 (June 2006) 198–201 6. Laghari, K.u.R., Connelly, K.: Toward total quality of experience: A QoE model in a communication ecosystem. IEEE Communications Magazine 50(4) (2012) 58–65 7. Cowan, N.: On short and long auditory stores. Psychological Bulletin 96(2) (1984) 341–370 8. Schatz, R., Hoßfeld, T., Janowski, L., Egger, S.: From Packets to People : Qual- ity of Experience as a New Measurement Challenge. In Biersack, E., Callegari, C., Matijasevic, M., eds.: Data Traffic Monitoring and Analysis. Springer Berlin Heidelberg (2013) 219–263 9. M¨oller,S., Engelbrecht, K.P., K¨uhnel,C., Wechsung, I., Weiss, B.: A Taxonomy of Quality of Service and Quality of Experience of Multimodal Human-Machine Interaction. In: QoMEX, San Diego, California, USA (2009) 7–12

CMMR2013 - 692 Proc. of the 10th International Symposium on Computer Music Multidisciplinary Research, Marseille, France, October 15-18, 2013 16 Michael Schoeffler et al.

10. Silzle, A.: Quality Taxonomies for Auditory Virtual Environments. In: Audio Engineering Society Convention 122, Vienna, (2007) 11. Blauert, J., Jekosch, U.: A Layer Model of Sound Quality. Journal of the Audio Engineering Society 60(1/2) (2012) 4–12 12. Schoeffler, M., Herre, J.: About the Impact of Audio Quality on Overall Listening Experience. In: Proceedings of Sound and Music Computing Conference, Stock- holm (2013) 13. Schinkel-Bielefeld, N., Lotze, N., Nagel, F.: Audio quality evaluation by expe- rienced and inexperienced listeners. In: Proceedings of Meetings on Acoustics. Volume 19., Montreal, Canada, Acoustical Society of America (June 2013) 060016– 060016 14. Bech, S.: Selection and Training of Subjects for Listening Tests on Sound- Reproducing Equipment. J. Audio Eng. Soc 40(7/8) (1992) 590–610 15. Rumsey, F., Zielinski, S., Kassier, R., Bech, S.: Relationships between Experienced Listener Ratings of Multichannel Audio Quality and Na¨ıve Listener Preferences. The Journal of the Acoustical Society of America 117(6) (2005) 3832–3840 16. Barbour, J.L.: Subjective Consumer Evaluation of Multi-Channel Audio Codecs. In: Audio Engineering Society Convention 119, New York (2005) 17. Olive, S.E.: Differences in Performance and Preference of Trained versus Untrained Listeners in Loudspeaker Tests: A Case Study. J. Audio Eng. Soc 51(9) (2003) 806–825 18. Butterworth, S.: On the Theory of Filter Amplifiers. Experimental Wireless and the Wireless Engineer 7 (1930) 536–541 19. Zwicker, E.: Subdivision of the Audible Frequency Range into Critical Bands (Frequenzgruppen). J. Acoust. Soc. Am. 33(2) (1961) 248–248 20. EBU: Loudness normalisation and permitted maximum level of audio signals (EBU Recommendation R 128) (2011) 21. EBU: Practical guidelines for Production and Implementation in accordance with EBU R 128 (Version 2.0) (2011) 22. International Telecommunications Union: ITU-R BS.1770 (Algorithms to measure audio programme loudness and true-peak audio level) (2006) 23. for Standardization, I.O.: ISO 226:2003 - Acoustics – Normal equal-loudness-level contours (2003) 24. Zwicker, E., Fastl, H.: Psychoacoustics: Facts and Models. 2nd editio edn. Springer (1999) 25. Glasberg, B.R., Moore, B.C.J.: A Model of Loudness Applicable to Time-Varying Sounds. Journal of the Audio Engineering Society 50(5) (May 2002) 331–342 26. GENESIS S.A.: Loudness Toolbox (Version 1.2) (2012)

CMMR2013 - 693