Towards an Embodied Musical Mind: Generative Algorithms for Robotic Musicians
Total Page:16
File Type:pdf, Size:1020Kb
TOWARDS AN EMBODIED MUSICAL MIND: GENERATIVE ALGORITHMS FOR ROBOTIC MUSICIANS A Thesis Presented to The Academic Faculty by Mason Bretan In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Music Georgia Institute of Technology August 2017 Copyright c 2017 by Mason Bretan TOWARDS AN EMBODIED MUSICAL MIND: GENERATIVE ALGORITHMS FOR ROBOTIC MUSICIANS Approved by: Gil Weinberg, Ph.D., Advisor Guy Hoffman, Ph.D. School of Music Department of Mechanical Engineering Georgia Institute of Technology Cornell University Jason Freeman, D.M.A. Larry Heck, Ph.D. School of Music Google Research Georgia Institute of Technology Google, Inc. Alexander Lerch, Ph.D. Date Approved: April 6, 2017 School of Music Georgia Institute of Technology ACKNOWLEDGEMENTS First, I would like to thank my advisor, Gil Weinberg, for introducing me to the field of robotic musicianship, providing the necessary guidance for me to develop this thesis, and fostering my passion for research through the continuous encouragement to explore the scientific and artistic opportunities our field has to offer. I am proud of everything we accomplished together over the years and I look forward to collaborations in the future. My committee members have been incredibly helpful. Their feedback was integral to shaping this thesis and helping me to effectively and coherently present my ideas. Jason Freeman has been a supportive and important resource during this period. As a committee member and educator in general, he has gone above and beyond the call of duty with his diligence and mindful insights. Alexander Lerch's judicious comments, technical expertise, and rational thinking were essential for this research. Conversations about embodied cognition and robotics with Guy Hoffman were inspiring and helped to organize my thoughts about embodied musicianship. Larry Heck's supervision was invaluable and without his advice and guidance on machine learning this thesis would have likely turned out very differently. I would like to thank Frank Clark, Chris Moore, Parag Chordia, Leslie Bennet, Joshua Smith, and Corissa James for providing support and advice during my time at Georgia Tech. I would also like to recognize Sageev Oore and Doug Eck who participated in helpful and stimulating discussions during my time at Google. I'd like to express my profound gratitude to my family who have been a never ending source of support and my girlfriend, Laura Hack, whose hard work and dedication to her own studies served as a source of inspiration for me to minimize my coding window, put iii down my guitar, and stop looking for every excuse to procrastinate and finally write this thesis. Finally, it is an honor to be the first to graduate with a Ph.D. in Music Technology from Georgia Tech. I thoroughly enjoyed my time as a graduate student in the School of Music and as a member of the Robotic Musicianship Group. Having the opportunity to conduct research in a fascinating field and perform in venues all over the world have provided me with rewarding and memorable experiences that I am sure I will cherish for the rest of my life. I look forward to seeing the Center for Music Technology and its degree programs continue to grow and flourish. iv TABLE OF CONTENTS ACKNOWLEDGEMENTS .......................... iii LIST OF TABLES ............................... viii LIST OF FIGURES .............................. ix SUMMARY .................................... xiv I INTRODUCTION ............................. 1 1.1 Motivation . .1 1.1.1 Unidirectional processing . .1 1.1.2 Embodied Processing . .3 1.1.3 Musical Path Planning . .6 1.2 Research Questions . .9 1.2.1 Autonomous Music Generation . 10 1.2.2 Representation . 11 1.2.3 Planning . 12 1.2.4 Learning . 13 1.3 Contributions . 15 II EMBODIED MUSICAL COGNITION AND RELATED WORK 18 2.1 Embodied Musical Cognition . 18 2.1.1 Theory and Evidence . 18 2.1.2 Associative Learning and Embodied Processing in Musicians 20 2.2 Robotics . 24 2.2.1 Embodied Intelligence in Robotics . 24 2.2.2 Robotic Musicianship . 27 2.2.3 Robotic Musicianship { Design . 28 2.2.4 Robotic Musicianship - Physical Influence and Embodied In- telligence . 33 v 2.2.5 The utility of coherent sound accompanying movements in music performance . 37 III SIMULATED ACTION AS A PLANNING TASK ......... 39 3.1 Introduction . 40 3.2 Musical Path Planning . 41 3.2.1 Musical C-Space . 41 3.2.2 Planning . 43 3.2.3 Evaluation . 52 3.2.4 Discussion . 54 3.3 Embodied Generative Music . 55 3.3.1 Parametrized representations of higher-level musical semantics 56 3.3.2 Joint Optimization . 69 3.3.3 Musical Results . 73 3.3.4 Discussion . 75 3.4 Conclusion . 77 IV LEARNING MUSICAL SEMANTICS ................ 79 4.1 Introduction . 80 4.2 Related Work . 84 4.3 Reconstruction Using Unit Selection . 86 4.3.1 Design of a Deep Musical Autoencoder . 86 4.3.2 Music Reconstruction through Selection . 90 4.4 Generation using Unit Selection . 93 4.4.1 Semantic Relevance . 95 4.4.2 Concatenation Cost . 96 4.4.3 Ranking Units . 98 4.4.4 Evaluating the model . 99 4.4.5 Discussion . 99 4.5 Subjective Evaluation . 100 4.5.1 Results . 102 vi 4.5.2 Discussion . 103 4.6 An Embodied Unit Selection Process . 104 4.6.1 Refining the selection process . 105 4.7 Conclusion . 108 V EMBODIED MUSICAL LEARNING ................. 109 5.1 Introduction . 109 5.2 Embodied Learning . 110 5.2.1 Convolutional Autoencoder . 110 5.2.2 Incorporating the Physical Constraints . 114 5.3 Evaluation . 116 5.3.1 Experiment . 116 5.3.2 Results . 117 5.3.3 Discussion . 118 5.4 Conclusion . 119 VI CONCLUSION ............................... 120 6.1 Contributions . 120 6.1.1 Path Planning . 120 6.1.2 Learning Musical Semantics and Unit Selection . 121 6.1.3 Learning the Physical Parameters . 121 6.2 Future Work . 122 6.2.1 Integrating Social Cues . 122 6.2.2 Alternative Models for Embodied Learning . 123 6.2.3 Perception . 123 6.3 Final Remarks . 123 APPENDIX A | GENERATED MUSIC ............... 125 APPENDIX B | SHIMON AND FRIENDS CONCERT SERIES 132 APPENDIX C | SURVEY ........................ 136 REFERENCES .................................. 142 vii LIST OF TABLES 1 Robots that play percussive instruments. 29 2 Robots that play string instruments. 31 3 Robots that play wind instruments. 32 4 Augmented robotic instruments. 32 5 Distribution of data in test set. 53 6 Greedy vs Planned experiment 1 results. 53 7 Greedy vs Planned experiment 2 results. 54 8 Frequently used chord sequences with manually labeled transition weights. 62 9 Autoencoder ranking and collision results . 91 10 Unit Ranking . 99 11 Subjective Ranking . 103 12 Results comparing average semantic relevance of original input to an autoencoder output and physically constrained network output. The average cosine similarity in the embedding space is reported. 117 viii LIST OF FIGURES 1 The typical signal flow of a robotic musician's decision processes mimics the unidirectional classical view of cognition and processing pipelines of many interactive music systems [169]. Given a stimulus, the system maps the input to a representation that it can be understood and analyzed (musical feature extraction and perception). The system then proceeds to make a decision about what note or notes to generate. Once this decision is made the system either makes further decisions about how best to physically actuate the notes given a set of physical constraints or directly sends the notes to physical actuators completely bypassing a simulated action stage. .2 2 In the proposed approach to robotic musicianship, simulations of actions are performed and used to determine a global optimal musical output. Optimal in music may mean playing all the notes at the proper time or achieving some higher level musical semantic such as using pitches from a certain scale over a specific chord. Optimal physicality may mean solutions that reduce spurious movements, are energy efficient, and avoid collisions. This approach seeks to find a solution that jointly optimizes the musical and physical parameters. .5 3 The original message is converted into a vector in the semantic space (a.k.a the embedding). While this vector still represents the original message, it is now doing so in terms of meaningful concepts instead of low-level ascii characters. Semantic embeddings allow us to describe the text or ask questions that have relevance to higher levels in the hierarchy of language. Instead of asking \do both of these sentences contain the letter c?" which we can do in a character-space, we can ask, \are both of these sentences about food?". 13 4 In the above figure the original message is music. Music, like language, has hierarchical characteristics and meaningful concepts describing higher-level features about a note sequence. It is necessary to learn what the appropriate meaningful concepts are in the first place and also how to map any given message into a vector relative to those concepts. 14 5 Shimon has four arms which can be configured in many possible ways. A single state in the C-Space describes the possible physical configurations of the arms capable of playing a specific note(s). In this figure two states in which Shimon can play the highlighted `Eb' are shown. 43 6 General premise for finding a path through states in the configuration space. 44 7 State representation and emission score. 46 ix 8 The transition between two states. 46 9 A beamed Viterbi search is used to traverse the state space and find good movement sequences as defined by the heuristics. 52 10 The chords (annotated above the measure) are tagged with their corre- sponding tonal center (annotated below the measure) using automatic sequence tagging based on a trigram model. Viterbi decoding is used to find the optimal sequence.