S8.2. Predicting Perceived Dissonance of Piano Chords Using
Total Page:16
File Type:pdf, Size:1020Kb
Predicting Perceived Dissonance of Piano Chords Using a Chord-Class Invariant CNN and Deep Layered Learning Juliette Dubois Anders Elowsson Anders Friberg ENSTA ParisTech KTH Royal Institute of Technology KTH Royal Institute of Technology [email protected] [email protected] [email protected] ABSTRACT several acoustic factors are isolated and the influence of each one is evaluated separately. This paper presents a convolutional neural network (CNN) Dissonance is closely related to another sensory concept, able to predict the perceived dissonance of piano chords. roughness. Roughness applies both for music and natural Ratings of dissonance for short audio excerpts were com- sounds, and occurs when two frequencies played together bined from two different datasets and groups of listeners. produce a beating. According to Vassilakis and Kendall The CNN uses two branches in a directed acyclic graph (2010) [7], sensory dissonance is highly correlated to rough- (DAG). The first branch receives input from a pitch esti- ness. mation algorithm, restructured into a pitch chroma. The Various sensory dissonance models were proposed since second branch analyses interactions between close partials, the middle of the 20th century. Based on Helmoltz’ theory, known to affect our perception of dissonance and rough- Plomp and Levelt [1] conducted an experiment using pure ness. The analysis is pitch invariant in both branches, fa- sine waves intervals. The result of this study is a set of dis- cilitated by convolution across log-frequency and octave- sonance curves for a range of frequencies and for different wide max-pooling. Ensemble learning was used to im- frequency differences. Sethares [8] gives a parametrization prove the accuracy of the predictions. The coefficient of of these curves, resulting in a computational model of dis- 2 determination (R ) between rating and predictions are close sonance in several cases: for two or several sine waves of to 0.7 in a cross-validation test of the combined dataset. different amplitudes, for one complex tone, for two notes The system significantly outperforms recent computational from the same timbre. models. An ablation study tested the impact of the pitch Vassilakis developed a more precise model which esti- chroma and partial analysis branches separately, conclud- mates the contribution depending on the partial amplitudes ing that the deep layered learning approach with a pitch and the amplitude fluctuation [9]. This computational model, chroma was driving the high performance. including a specific signal processing method for extract- ing the partials, is available online [10]. 1. INTRODUCTION These models are closely connected, and revolve around the same core concept of bandwidth and proximity of par- The concept of dissonance has a long history. However, tial frequencies. Other approaches have also been sug- the experimental study of dissonance dates back only to the gested: see Zwicker and Fastl [11], or Kameoka and Kuriya- middle of the 20th century. Both its definition and causes gawa [12] are still subject to discussion. In early studies [1], conso- Schon¨ [13] conducted an experiment in which listeners nance is defined as a synonym for beautiful or euphonious. rated the dissonance of a range of chords played on a pi- Sethares [2]) proposed a more general definition of dis- ano. Dyads (chords with two notes) and triads (chords with sonant intervals: they sound rough, unpleasant, tense and three notes) were played and listeners were asked to rate unresolved. the dissonance for each chord. The experiment showed According to Terhardt [3], musical consonance consists that dissonance is easy to rate with a rather high agreement both of sensory consonance and harmony. Similar to this and thus could be considered as a relevant feature for de- explanation, Parncutt [4] considers the existence of two scribing music from a perceptual point of view [14]. The forms of consonance, one being a result of psychoacous- rating data from [13] together with the data from [15] were tical factors (sensory consonance) and the other one relat- used in the current model. ing to the musical experience and the cultural environment Perceptual features of music, such as perceived disso- (musical or cultural consonance). Dissonance can thus be nance, speed, pulse clarity, and performed dynamics have divided into sensory dissonance and musical dissonance. received an increasing interest in recent years. They have Musical dissonance depends on our expectation and musi- been studied both as a group ( [14,16,17]) and through ded- cal culture, which makes it difficult to evaluate [5]. In [6], icated models ( [18–20]). The trend has been towards data driven architectures, foregoing the feature extraction step. Copyright: c 2019 Juliette Dubois et al. This is an open-access article distributed Another trend is that of ”deep layered learning” models in under the terms of the Creative Commons Attribution 3.0 Unported License, which MIR, as defined by Elowsson [21]. Such models use an in- permits unrestricted use, distribution, and reproduction in any medium, provided termediate target to extract a representation accounting for the original author and source are credited. the inherent organization of music. The strategy has been applied for, e.g., beat tracking [22] and instrument recog- pitches follow a just intonation ratio, differing from the nition [23] in the past. In this study, we show how pitch standard equal-tempered tuning used in D1. Thirty musi- estimates from a previous machine learning model can be cally trained and untrained listeners from Vienna and Sin- reshaped and fed as input for predicting dissonance. gapore rated all the examples. The inter-rater reliability as estimated by the average intraclass correlation coefficient 1.1 Overview of the article (ICC) ranged from 0.96 to 0.99. The ratings were averaged across all listeners. In Section2, we describe the two datasets and the process for merging them. Section3 is focused on the input repre- 2.3 Merged dataset sentation, detailing how it was extracted for each of the two branches of the CNN. In Section4, the design of the CNN We used the average listener rating of dissonance of each is described, as well as the methodology used for ensemble chord as a target for our experiment. In D1, the dissonance learning and the parameter search. Section5 presents the ranged from 0 to 40, 40 being the most dissonant. In D2, evaluations methodology and the results. It also includes the dissonance ranged from 1 to 4, 1 being the most disso- a small ablation study and a comparison with a previous nant. Therefore, the ratings of the latter dataset were first model of dissonance. Section6 offers conclusions and a inverted by multiplication with -1. The listener ratings (A) discussion of the results. were then normalized according to A − min(A) A = : (1) 2. DATASETS normalized max(A) − min(A) A total of 390 chords were gathered, coming from two dif- . ferent experiments [13, 15]. In these two experiments, the After normalization, the most consonant chord had a rat- listeners were asked to rate the dissonance of recorded pi- ing of 0 and the most dissonant chord had a rating of 1. ano chords. A definition of dissonance was given to the The input data were also normalized. (See Section3). listeners. In [15], the consonance was defined as “the mu- sical pleasantness or attractiveness of a sound”. In partic- 3. NETWORK INPUT ular, “if a sound is relatively unpleasant or unattractive, it is referred to as dissonant”. In [13], the following defini- Two input representations were extracted from the audio tion was given: “dissonant intervals are often described as files: the constant-Q transform (CQT) spectrum and a pitch rough, unpleasant, tense and unresolved”. Both definitions chroma. These representations aim to catch different as- referred to the unpleasantness of a sound. However, the de- pects of the audio file. The representation from the CQT scription from [13] was more precise and already includes can capture spectral aspects of the audio, such as the dis- the fact that only intervals were studied. The definitions tance between partials, whereas the pitch chroma repre- were close enough to give reason to believe that the same sents the audio at a higher level corresponding (ideally) to feature was evaluated. the actual notes that were played. 2.1 First dataset 3.1 Pitch chroma A pitch chroma was extracted from a Pitchogram represen- The first dataset (D1) comes from the experiment conducted in [13]. It contains 92 samples of 0.5 second each, created tation, as illustrated in Fig. 1. To extract the pitch chroma, with a sampled piano in Ableton Live. Two kinds of chords the implementation from [24] was applied to the audio files were played, consisting of either two notes (dyads) or three for first extracting a Pitchogram representation. This rep- notes (triads). The dyads were either centered around mid- resentation has a resolution of 1 cent/bin across pitch and dle C or one fifth above, ranging from unison to a major a frame length of 5.8 ms. The Pitchogram was thresholded tenth. The same process was used for the triads. In total, at 2 to remove lower noisy traces of pitch. Then, a Hanning the dataset contained 34 dyads and 58 triads. window of width 141 was used to filter across pitch. In our Thirty-two listeners were asked to evaluate the sound from initial model, we then extracted activations across pitch at “not dissonant at all” to “completely dissonant”, using a semitone-spaced intervals (12 bins/octave). This gave un- web interface. The listeners’ musical background varied satisfying results, presumably due to the out-of-sync spac- but were mostly on an amateur level with an average prac- ing with the just intonation ratios chords in D2. There- tice time of 4 hours per week.