Exploiting Regularities in Natural Acoustical Scenes for Monaural Audio Signal Estimation, Decomposition, Restoration and Modification
Total Page:16
File Type:pdf, Size:1020Kb
Exploiting regularities in natural acoustical scenes for monaural audio signal estimation, decomposition, restoration and modification 音環境に内在する規則性に基づくモノラル音響 信号の推定・分解・復元・加工に関する研究 { Exploitation de r´egularit´esdans les sc`enes acoustiques naturelles pour l'estimation, la d´ecomposition, la restauration et la modification de signaux audio monocanal Jonathan Le Roux ルルー ジョナトン THE UNIVERSITY OF TOKYO GRADUATE SCHOOL OF INFORMATION SCIENCE AND TECHNOLOGY DEPARTMENT OF INFORMATION PHYSICS AND COMPUTING 東京大学 大学院情報理工学系研究科 システム情報学専攻 Ph.D. Thesis 博士論文 submitted by Jonathan LE ROUX in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Information Science and Technology Exploiting regularities in natural acoustical scenes for monaural audio signal estimation, decomposition, restoration and modification (音環境に内在する規則性に基づくモノラル音響 信号の推定・分解・復元・加工に関する研究) defended on January 29, 2009 in front of the committee composed of Shigeki SAGAYAMA University of Tokyo Thesis Supervisor Alain de CHEVEIGNE´ Ecole´ Normale Sup´erieure Co-supervisor Shigeru ANDO^ University of Tokyo Examiner Keikichi HIROSE University of Tokyo Examiner Nobutaka ONO University of Tokyo Examiner Susumu TACHI University of Tokyo Examiner UNIVERSITE´ PARIS VI { PIERRE ET MARIE CURIE ECOLE´ DOCTORALE EDITE THESE` DE DOCTORAT sp´ecialit´e INFORMATIQUE, TEL´ ECOMMUNICATIONS´ ET ELECTRONIQUE´ pr´esent´eepar Jonathan LE ROUX pour obtenir le grade de DOCTEUR de l'UNIVERSITE´ PARIS VI { PIERRE ET MARIE CURIE Exploitation de r´egularit´es dans les sc`enesacoustiques naturelles pour l'estimation, la d´ecomposition, la restauration et la modification de signaux audio monocanal (Exploiting regularities in natural acoustical scenes for monaural audio signal estimation, decomposition, restoration and modification) soutenue le 12 mars 2009 devant le jury compos´ede Alain de CHEVEIGNE´ Ecole´ Normale Sup´erieure Directeur de Th`ese Shigeki SAGAYAMA University of Tokyo Co-Directeur Dan P. W. ELLIS Columbia University Rapporteur Mark D. PLUMBLEY Queen Mary University Rapporteur Laurent DAUDET Universit´eParis VI Examinateur Ga¨elRICHARD TELECOM ParisTech Examinateur Emmanuel VINCENT INRIA - IRISA Examinateur Jean-Luc ZARADER Universit´eParis VI Examinateur Sagayama/Ono Laboratory Graduate School of Information Science and Technology The University of Tokyo 7-3-1, Hongo, Bunkyo-ku 113-8656 Tokyo (Japan) and Laboratoire de Psychologie de la Perception (UMR 8158) Equipe´ Audition Ecole´ Normale Sup´erieure 29, rue d'Ulm 75005 Paris (France) Acknowledgements The long and winding road that leads to the doors of the Ph.D. sometimes disappeared, and I am extremely grateful both to the many people who let me know the way, and to the not less many who often helped me temporarily forget about it altogether. I would first like to thank my thesis advisors, Professor Shigeki Sagayama at the University of Tokyo and Dr. Alain de Cheveign´eat the Ecole´ Normale Sup´erieure.Professor Sagayama has taught me many things, from technical topics to the way to conduct and present one's research, and continuously encouraged and helped me to participate in national and inter- national conferences, introducing me to many researchers and enabling me to widen my research horizon. Most importantly, he has created and constantly fueled with ideas and energy the wonderful research environment that constitutes the Sagayama/Ono Laboratory in which I spent most of the past four years. Alain de Cheveign´ehas been extremely present and dedicated from my very first steps in audio engineering five years ago, and helped and advised me on a daily basis on each and every aspect of the Ph.D. process, which is all the more remarkable knowing that we spent most of that time about 10000 km away and with a 7 to 9 hour time difference from each other. His careful supervision was definitely the key element which made me keep sight of what had to be done and when it had to be done. I would like to thank as well sincerely Professor Nobutaka Ono, whom I truly consider as my (unofficial) third thesis advisor. Professor Ono has both an extremely deep understanding of all the technical details involved in my work and a very clear view of the broader picture. The countless discussions we had were the roots to many of the ideas exposed in this thesis. Watching him explain utterly abstract concepts very simply by drawing figures in the air with his hands certainly counts among the most scientifically enjoyable and exciting moments of my Ph.D. I am very grateful to the members of my thesis committees (yes, with an \s", as there were two) to have accepted to take some of their very precious time to review my work and attend the defenses (yes again, as there were three): Professor Shigeru And^o,Professor Keikichi Hirose and Professor Susumu Tachi, of the University of Tokyo, for the defenses in Tokyo, Professor Dan P. W. Ellis, of Columbia University, Professor Mark D. Plumbley, of Queen Mary University of London, Dr. Laurent Daudet, of Universit´eParis 6, Professor i Ga¨elRichard, of Telecom ParisTech, Dr. Emmanuel Vincent, of INRIA-IRISA, and Professor Jean-Luc Zarader, of Universit´eParis 6, for the defense in Paris. It is an immense honor for me to have such a gathering of renowned scientists as my committee(s). I am especially thankful to Dan and Mark for accepting the most important and time-demanding roles of thesis Readers, and I am happy to be able to present my work under their auspices. I would also like to thank specially Emmanuel for the many comments he gave me on early versions of some key parts of the thesis during his stay in our lab at the University of Tokyo. His presence in the committee has a particular significance as Emmanuel and I were actually classmates ten years ago in our first year in the math department of the Ecole´ Normale Sup´erieure.Obviously he was much faster than me in both starting and finishing his Ph.D. I would not be anywhere close to finishing this work or even getting it started if it was not for Dr. Hirokazu Kameoka, my former colleague at the Sagayama/Ono Lab, two years above in the Doctoral program. I have been blessed with thesis advisors all along, and he can definitely be counted as one of them. He is an incredibly brilliant young researcher coming up with groundbreaking ideas on a daily basis, and I am convinced that he will become one of the most prominent faces worldwide in the field of audio signal processing in the near future. My work on HTC in Chapters 3 and 4 is a direct collaboration with him, and we spent days and nights discussing on most of the other problems investigated in this thesis. He is now one of my best friends in Japan (and the other half of our Manzai duo), and I am looking forward to being his colleague once again at NTT CS Labs during my post-doc. I would also like to sincerely thank Professor Lucas C. Parra, of City College New-York, who completes the gallery of researchers with whom I have had the chance to collaborate directly during the course of this Ph.D. Working literally day and night for a week with him and Alain de Cheveign´eon three time lags (New-York, Paris, Tokyo) during the preparation of our NIPS paper was one of the most exciting and fun parts of my Ph.D. I have learnt a lot from his way of approaching and solving problems, and really enjoyed the many memorable Skype sessions we had. My first encounter with audio signal processing and machine learning took place during an internship at NTT CS Labs near Kyoto in Winter 2004, under the supervision of Dr. Erik McDermott. I would like to thank him for introducing me to the field and teaching me most of the basics, and for all the great discussions we have had since then (many of them in the nice environment of an izakaya). I now consider Erik as my \older brother" in Japan, and knowing that I would be meeting him is a significant part of the motivation to attend a conference. I would also like to thank the members and former members of NTT's Signal Processing Research Group and Media Recognition Research Group with whom I have had ii the pleasure to interact during the internship and since then at many occasions: Dr. Shinji Watanabe, Dr. Tomohiro Nakatani, Dr. Atsushi Nakamura, Keisuke Kinoshita, Takanobu Oba,^ Takuya Yoshioka, Dr. Kunio Kashino, Dr. Sh^oko Araki, Dr. Hiroshi Sawada, Dr. Sh^oji Makino, Dr. Yasuhiro Minami, Dr. Takaaki Hori, Dr. Kentar^oIshizuka, Dr. Masakiyo Fu- jimoto, Dr. Michael Schuster (now with Google) and Dr. Parham Zolfaghari (now with ClientKnowledge). I would like to express my gratitude to Dr. Frank K. Soong for inviting me to Microsoft Research Asia in Beijing for an internship in 2006, and for managing to find some time in his very busy schedule to supervise my work there. I would also like to thank my fellow interns in the Speech Group for all the pleasant discussions we had and the nice restaurants we went to. I am looking forward to meeting them again in future conferences (though my Chinese skills definitely need a recovery period). I am also indebted to Dr. Philippe de Reffye, of INRIA, who supervised my internship at the Sino-French joint laboratory LIAMA in Beijing in 2001, for revealing to me, among many other things, the real good reason to do a Ph.D.: enjoy student life for a few more years. I am very grateful to Professor Hiroshi Matano, of the University of Tokyo, for taking care of me during my first year in Japan and for introducing me to Japanese culture, art and language.