Data-Driven Approaches for Tempo and Key Estimation of Music Recordings

Data-Driven Approaches for Tempo and Key Estimation of Music Recordings Datengetriebene Verfahren fur¨ Tempo- und Tonart-Schätzung von Musikaufnahmen Dissertation Der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nurnberg¨ zur Erlangung des Doktorgrades Doktor der Ingenieurwissenschaften (Dr.-Ing.) vorgelegt von Hendrik Schreiber aus Soest Als Dissertation genehmigt von der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nurnberg¨ Tag der mundlichen¨ Prufung:¨ 1. Juli 2020 Vorsitzender des Promotionsorgans: Prof. Dr.-Ing. Andreas Paul Fröba 1. Gutachter: Prof. Dr. rer. nat. Meinard Muller¨ 2. Gutachter: Prof. Dr.-Ing. Sebastian Stober Abstract Abstract In recent years, we have witnessed the creation of large digital music collections, accessible, for example, via streaming services. Efficient retrieval from such collections, which goes beyond simple text searches, requires automated music analysis methods. Creating such methods is a central part of the research area Music Information Retrieval (MIR). In this thesis, we propose, explore, and analyze novel data-driven approaches for the two MIR analysis tasks tempo and key estimation for music recordings. Tempo estimation is often defined as determining the number of times a person would \tap" per time interval when listening to music. Key estimation labels music recordings with a chord name describing its tonal center, e.g., C major. Both tasks are well established in MIR research. To improve tempo estimation, we focus mainly on shortcomings of existing approaches, particularly estimates on the wrong metrical level, known as octave errors. We first propose novel methods using digital signal processing and traditional feature engineering. We then re-formulate the signal-processing pipeline as a deep computational graph with trainable weights. This allows us to take a purely data-driven approach using supervised machine learning (ML) with convolutional neural networks (CNN). We find that the same kinds of networks can also be used for key estimation by changing the orientation of directional filters. To improve our understanding of these systems, we systematically explore network architectures for both global and local estimation, with varying depths and filter shapes, as well as different ways of splitting datasets for training, validation, and testing. In particular, we investigate the effects of learning on different splits of cross-version datasets, i.e., datasets that contain multiple recordings of the same pieces. For training and evaluation the proposed data-driven approaches rely on curated datasets covering certain key and tempo ranges as well as genres. Datasets are therefore another focus of this work. Additionally to creating or deriving new datasets for both tasks, we evaluate the quality and suitability of popular tempo datasets and metrics, and conclude that there is ample room for improvement. To promote better, transparent evaluation, we propose new metrics and establish a large open and public repository containing evaluation code, reference annotations, and estimates. i ii Zusammenfassung Zusammenfassung In den vergangenen Jahren sind große digitale Musiksammlungen entstanden, die { beispielsweise { uber¨ Streaming-Dienste einfach zugänglich sind. Ein effizientes Retrieval aus solchen Sammlungen, das uber¨ die simple Textsuche hinausgeht, erfordert automatisierte Musikanalysemethoden. Das Erforschen solcher Methoden ist ein zentraler Bestandteil des Forschungsgebiets Music Information Retrieval (MIR). In dieser Arbeit stellen wir neue datengetriebene Ansätze fur¨ die beiden MIR- Analyseaufgaben Tempo- und Tonart-Schätzung fur¨ Musikaufnahmen vor und analysieren sie. Dabei wird Tempo-Schätzung oft definiert als das Zählen der Male, die eine Person beim Hören von Musik pro Zeitintervall \klopfen" wurde.¨ Tonart-Schätzung weist Musikaufnahmen einen Akkordnamen zu, der den Klangmittelpunkt beschreibt, z.B. C-Dur. Beide Aufgaben sind in der MIR-Forschung fest verankert. Um die Tempo-Schätzung zu verbessern, konzentrieren wir uns hauptsächlich auf Defizite beste- hender Ansätze, insbesondere Schätzungen auf der falschen metrischen Ebene, den sogenannten Oktavfehlern. Dazu schlagen wir zunächst neue Methoden vor, die sich der digitalen Signalver- arbeitung und des traditionellen Feature-Engineerings bedienen. Anschließend formulieren wir die Signalverarbeitungspipeline in eine tiefe, graphenartige Rechenstruktur mit trainierbaren Parametern um. Dies ermöglicht uns einen rein datengetriebenen Ansatz unter Verwendung von uberwachtem¨ maschinellem Lernen (ML) mit neuronalen Netzen { insbesondere Convo- lutional Neural Networks (CNN). Wir stellen fest, dass durch das Andern¨ der Orientierung von gerichteten Filtern, die gleichen Arten von Netzwerken auch fur¨ die Tonart-Schätzung verwendet werden können. Um unser Verständnis dieser Systeme zu vertiefen, untersuchen wir systematisch Netzwerkarchitekturen fur¨ die globale und lokale Schätzung mit unterschiedlichen Tiefen und Filterformen sowie verschiedenen Datensatz-Splits fur¨ Training, Validierung und Test. Insbesondere betrachten wir, welche Auswirkungen das Lernen auf verschiedenen Splits von Cross-Version-Datensätzen hat. Dies sind Datensätze, die mehrere Aufnahmen derselben Stucke¨ enthalten. Fur¨ Training und Evaluation stutzen¨ sich die vorgeschlagenen datengetriebenen Ansätze auf kuratierte Datensätze, die bestimmte Tonart- und Tempobereiche sowie Genres abdecken. Ein weiterer Schwerpunkt dieser Arbeit liegt daher auf den Datensätzen selbst. Zusätzlich zum Erstellen oder Ableiten neuer Datensätze fur¨ beide o.g. Aufgaben evaluieren wir die Qualität und Eignung gängiger Tempo-Datensätze und -Metriken und kommen zu dem Schluss, dass es iii Zusammenfassung Raum fur¨ Verbesserungen gibt. Um eine bessere, transparentere Evaluation zu fördern, schlagen wir daher neue Metriken vor und etablieren ein großes, offenes und öffentliches Repository mit Evaluationscode, Referenzannotationen und Schätzungen. iv Contents Contents Abstract i Zusammenfassung iii 1 Introduction5 1.1 Structure of this Thesis . 7 1.2 Contributions . 10 1.3 Main Publications . 11 1.4 Additional Publications . 12 1.5 Acknowledgments . 12 2 Fundamentals 15 2.1 Audio Signal . 15 2.2 Discrete Fourier Transform . 18 2.3 Basic Tempo Estimation . 21 2.3.1 Representation . 23 2.3.2 Onset Signal Strength . 24 2.3.3 Dominant Pulse . 24 2.4 Metrics . 26 2.5 Datasets . 27 3 Tempo Octave Correction 29 3.1 Octave Correction with Global Features and Linear Regression . 29 3.1.1 Feature Selection . 30 3.1.2 Algorithm . 32 3.1.3 Evaluation . 34 3.1.4 Conclusions . 35 3.2 Error Classification and Correction with Random Forests . 35 3.2.1 Tempo Estimation . 35 3.2.2 Tempo Error Correction . 39 3.2.3 Evaluation . 42 3.2.4 Conclusions . 45 1 Contents 4 CNN-Based Tempo Estimation 47 4.1 Training Datasets . 48 4.1.1 LMD Tempo . 49 4.1.2 MTG Tempo . 49 4.1.3 Extended Ballroom . 50 4.1.4 Combined Training Dataset . 50 4.2 Method . 51 4.2.1 Signal Representation . 51 4.2.2 Network Architecture . 53 4.2.3 Network Training . 54 4.2.4 Global Tempo Estimation . 55 4.3 Evaluation . 56 4.3.1 Global Tempo Benchmarking . 56 4.3.2 Local Tempo Visualization . 57 4.4 Conclusions . 58 5 Electronic Dance Music: A Crowdsourced Experiment 59 5.1 Experiment . 60 5.2 Data Analysis . 62 5.2.1 Submission Quality . 62 5.2.2 Tempo Distribution Metrics . 63 5.2.3 Segment Annotator Agreement . 65 5.2.4 Track Annotator Agreement . 66 5.2.5 Ambiguity by Genre . 69 5.3 Evaluation . 69 5.4 Discussion and Conclusions . 70 6 Reflections on Global Tempo Estimation and Evaluation 73 6.1 Applications . 74 6.1.1 Research Justifications . 75 6.1.2 Presumed Applications . 75 6.1.3 Actual Applications . 76 6.2 Metrics . 77 6.2.1 Accuracy 1 and 2 . 77 6.2.2 P-Score . 79 6.2.3 Tempo Estimation as Classification . 79 6.3 Datasets . 80 6.3.1 Dataset Size . 80 6.3.2 Dataset Quality . 83 2 Contents 6.3.3 Modeling Global Tempo . 84 6.3.4 Dataset Suitability . 85 6.4 Research Cycles . 88 6.5 Survey . 89 6.5.1 Participants . 90 6.5.2 Relevance as MIR Task . 90 6.5.3 Application . 91 6.5.4 Genres . 91 6.5.5 Slow, Fast, and Ambiguous . 91 6.5.6 Accuracy Tolerances . 92 6.6 Proposal . 94 6.6.1 Reference Annotations . 95 6.6.2 Estimated Annotations . 95 6.6.3 Formal Octave Error . 95 6.6.4 Evaluation . 97 6.7 Conclusions . 100 7 Tailored CNN-Architectures for Tempo and Key Estimation 101 7.1 Experiments . 102 7.1.1 Key Estimation . 103 7.1.2 Tempo Estimation . 103 7.1.3 Network Architectures . ..

Data-Driven Approaches for Tempo and Key Estimation of Music Recordings

The Influence of Musical Rhythm on Cardiovascular, Respiratory, And

EDMTCC 2014 – the EDM Guide

Music Interventions in Health Care White Paper H C Are in Healt I Nterventions Music 1

Editorial Metadata in Electronic Music Distribution Systems: Between Universalism and Isolationism

Chapter 7 the Neuroscience of Emotion in Music Jaak Panksepp and Colwyn Trevarthen

Text-Based Description of Music for Indexing, Retrieval, and Browsing

MUSIK MACHT GEISTIG FIT.Pdf

31295019800852.Pdf (1.481Mb)

The Motivational Qualities and Effects of Music

Musiquence – Design, Implementation and Validation of a Customizable Music and Reminiscence Cognitive Stimulation Platform for People with Dementia

A Meta-Analytic Review and Field Studies of Ultra-Distance Athletes

Employing Subjective Tests and Deep Learning for Discovering the Relationship Between Personality Types and Preferred Music Genres