The Music Emerges: a Computational Approach to the Speech-To-Song Illusion

University of Amsterdam Masters Thesis The Music Emerges: A Computational Approach to the Speech-to-Song Illusion Author: Supervisor: Arran Lyon Dr. Makiko Sadakata A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science in Computational Science in the Computational Science Lab Informatics Institute August 2020 Declaration of Authorship I, Arran Lyon, declare that this thesis, entitled ‘The Music Emerges: A Computational Approach to the Speech-to-Song Illusion’ and the work presented in it are my own. I confirm that: This work was done wholly or mainly while in candidature for a research degree at the University of Amsterdam. Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly at- tributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: 10 August 2020 i UNIVERSITY OF AMSTERDAM Abstract Faculty of Science Informatics Institute Master of Science in Computational Science The Music Emerges: A Computational Approach to the Speech-to-Song Illusion by Arran Lyon The Speech-to-Song Illusion is an auditory effect whereby a perceptual transformation occurs during an exact repetitions of some short, naturally spoken phrase, where the listener begins to hear as if the speaker is singing, when initially it sounded exactly like speech. Understanding the cognitive mechanisms underlying the illusion could reveal the inner workings of speech, song and music perception. We use a collection of audio stimuli (300) related to this illusion along with a parallel set of perceptual rating data obtained from a previous study. We describe an algorithm that automatically tran- scribes the melody as a sequence of tones within the natural voice that improves upon the original formulation. We define a set of 33 melodic, rhythmic, audio and dissonance related features measured from the raw audio and the extracted tone sequence, including features previously connected to the occurrence of the illusion. New features include a metric that measures the distance of the melody to typically composed melodies (accord- ing to a Bayesian model), along with a novel set of features that captures the dynamic change of tension and release of the musical phrase based on a measure of sensory dissonance. We then propose a suitable method to best assign a binary label (transforming or non-transforming) to the stimuli based on the continuous range of scores provided by participants of an experiment. After under going a feature selection procedure, several data classifiers (linear and non-linear support vector machines and logistic model, along with two ensembles) that predict if the stimulus will transform from the features all ob- tain balanced accuracy scores between 0.66 and 0.71, where each model used between 7 and 10 features. We confirm that stability plays an important part in the illusion, along with the closeness of the melody to typical melodies and the consonance of the ending of the melody, suggesting transforming stimuli are those with characteristics of ordinary musical phrases. Finally, 55 participants took part in a follow up validation experiment with 98 new stimuli to test if the models generalised well, and found that one model in particular (a linear support vector machine) maintained a score above baseline on the new vocal stimuli (balanced accuracy 0.65 on the new stimuli). Acknowledgements First and foremost, I would like to thank my supervisor Dr. Makiko Sadakata for in- troducing me to this curious illusion, and for all her continuous and encouraging support in this research. Next, to Bas Cornelissen for his valuable and foundational contribution for which much of this work is based on, and without this would not have been possible, and also to Gerben Groenveld for his feedback and audio materials from his own research into the illusion. A huge thanks to Prof. Henkjan Honing who has kindly contributed his time to evaluating the work, and for the valued early feedback in the process. I would like to offer a my sincerest gratitude to my friends Valentin Vogelmann, for his helpful advice on theoretic concepts, and to Mattia Capelletti for his cultural perspective. A warm thank you goes to my good friend Tanja, for her continued support over many long walks. Finally, my gratitude extends to friends Charlie, Charlotte, Efi, Gerben, David and Alice for their advice, feedback and inspiration, to Black Gold Amsterdam for the music, and to all my friends and family across the continent. iii Contents Declaration of Authorshipi Abstract ii Acknowledgements iii Contents iv List of Figures vi List of Tables vii 1 Introduction1 1.1 The Mystery of Repetition...........................1 1.2 The Transformation of Sound.........................3 1.2.1 Auditory Illusions...........................3 1.2.2 Repetition Based Illusions.......................4 1.2.3 The Speech-to-Song Illusion......................5 1.2.4 Beyond Speech and Song.......................7 1.3 Thesis Overview................................7 2 Melody Extraction9 2.1 Audio Analysis Methods............................9 2.2 The Melody Extraction Algorithm...................... 10 2.2.1 Problem Statement and Definitions.................. 11 2.2.2 Note Segmentation........................... 11 2.2.3 Note Extraction............................ 14 2.2.4 Parameterisation and Evaluation................... 16 2.3 Bayesian Melody Search............................ 19 2.3.1 Method Outline............................ 20 2.3.2 Distribution of Melodies........................ 20 2.3.3 Note and Key Calculation....................... 22 2.3.4 Search Space.............................. 24 3 Feature Engineering 25 3.1 Audio Features................................. 25 3.2 Melodic Features................................ 27 iv Contents v 3.3 Rhythmic Features............................... 29 3.4 Dissonance Features.............................. 32 3.4.1 Quantifying Dissonance........................ 33 3.4.2 The Self-Similarity Matrix....................... 34 3.4.3 Feature Measurements......................... 36 4 Data Analysis 37 4.1 Audio Stimuli.................................. 37 4.1.1 Materials................................ 38 4.1.2 Human Ratings............................. 38 4.1.3 Manipulated Stimuli.......................... 40 4.2 Data Preparation................................ 41 4.2.1 Aggregation Schemes.......................... 41 4.2.2 Data filtering.............................. 42 4.3 Feature Distributions and Correlations.................... 44 5 Classification Models 47 5.1 Statistical Modelling.............................. 47 5.1.1 The Logistic Model........................... 48 5.1.2 Support Vector Machines....................... 49 5.1.3 Ensemble Methods........................... 50 5.2 Feature Selection................................ 50 5.2.1 Evaluation............................... 51 5.2.2 Search Procedure............................ 52 5.3 Results...................................... 54 6 Validation Experiment 56 6.1 Setup...................................... 56 6.1.1 Material................................. 57 6.1.2 Procedure................................ 59 6.2 Results...................................... 61 6.2.1 Rating Distributions.......................... 61 6.2.2 Feature Analysis............................ 63 6.2.3 Model Predictions........................... 64 7 Discussions and Conclusion 66 7.1 Feature Methods................................ 66 7.2 Data Analysis.................................. 70 7.3 The Models................................... 72 7.4 Experiment Results............................... 76 7.5 Final Thoughts................................. 78 A Feature Summary 80 B Feature Correlations 82 C Model Features 84 References 85 List of Figures 2.1 Stages of the note segmentation procedure.................. 12 2.2 Comparison between the transcription and the stable mean pitch note extraction method................................ 15 3.1 Krumhansl-Kessler key profiles for each pitch class.............. 27 3.2 Example of a normalised onset envelope, along with the note onset times. 31 3.3 The dissonance measure curve for two notes for a range of ratios of their fundamental frequency.............................. 34 3.4 Self-similarity plots using the dissonance measure of a six note piano melody. 35 4.1 Distribution of final scores in mcg....................... 39 4.2 Features measureed in manipulated stimuli.................. 40 4.3 Distribution of data points of at from the ratings obtained from the mcg experiment.................................... 42 4.4 Distribution of mcg data subsets, and the effects of filtering the data... 43 4.5 Distributions of stability scores for transforming and non-transforming stimuli...................................... 44 4.6 Distribution of stimuli for three features against their ratings........ 46 6.1 Screenshots of the experiment box at the four main stages of a trial.... 60 6.2 Distributions of scores for all stimuli in the experiment, along with cate- gorical breakdown...............................

The Music Emerges: a Computational Approach to the Speech-To-Song Illusion

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support