Automatic General Audio Signal Classification
Total Page:16
File Type:pdf, Size:1020Kb
TECHNISCHE UNIVERSITAT¨ MUNCHEN¨ Lehrstuhl f¨urMensch-Maschine-Kommunikation Automatic General Audio Signal Classification Kun Qian Vollst¨andigerAbdruck der von der Fakult¨at f¨urElektrotechnik und Informationstechnik der Technischen Universit¨atM¨unchen zur Erlangung des akademischen Grades eines Doktor-Ingenieurs (Dr.-Ing.) genehmigten Dissertation. Vorsitzender: Prof. Dr.-Ing. Wolfgang Utschick Pr¨uferder Dissertation: 1. Prof. Dr.-Ing. habil. Bj¨ornW. Schuller 2. Prof. Dr.-Ing. Werner Hemmert Die Dissertation wurde am 04.06.2018 bei der Technischen Universit¨atM¨unchen einge- reicht und durch die Fakult¨atf¨urElektrotechnik und Informationstechnik am 08.11.2018 angenommen. Acknowledgement First of all, I am sincerely grateful to Prof. Bj¨ornSchuller, who is my supervisor and invited me to work with his group four years ago. Prof. Schuller always encourages me to do research on topics I am really interested in. Without his valuable sugges- tions and excellent supervisions, I could never achieve such work described in this thesis. The experience of working at Prof. Schuller's group in Munich, Passau and Augsburg is very happy and impressive for me. Secondly, I would like to thank Prof. Gerhard Rigoll for allowing me to work at the Chair of Human-Machine Communication and providing me help both for my research and life at Munich. I am also grateful to Prof. TBA for his careful review and examination of this thesis, and Prof. TBA for chairing the examination board. Thirdly, I am grateful to Prof. Satoshi Matsuoka from Tokyo Institute of Tech- nology, Japan, and Assoc. Prof. Florian Metze from Carnegie Mellon University, USA, for their kind and helpful collaborative work contributions to some parts of this thesis. Many thanks are also given to Prof. Aimin Wang, Dr. Zixing Zhang, Dr. Jun Deng, Dr. Florian Eyben, Dr. Hesam Sagha, Dr. Anton Batliner, Dr. Martin W¨ollmer,Dr. Xinzhou Xu, Mr. Vedhas Pandit, Mr. Maximilian Schmitt, Mr. Erik Marchi, Mr. Zijiang Yang, Ms. Jing Han, Ms. Zhao Ren, Ms. Yue Zhang, Mr. Peter Brand, Ms. Martina R¨ompp,Ms. Andrea Tonk, and Ms. Nadine Witek, for their technical and non-technical discussions during my very happy time in Germany. Specifically, I would like to thank Mr. Jian Guo, Mr. Tao Chen, Mr. Xiang Zha, Mr. Christoph Janott, Mr. Yang Zhang, Ms. Li Zhang, Ms. Hui Fang, and Mr. Zengjie Zhang, for their true help for both of research and life during my time of pursuing the doctoral degree. I also wish to thank Ms. Alice Baird, for her massive help in proof reading of my previous publications and this thesis. Finally but most of all, this thesis is dedicated to my parents, Mr. Jinglei Qian and Ms. Jingping Zhang, who always give me power and support in my life. Munich, May 2018 Kun Qian i Abstract Automatic General Audio Signal Classification (AGASC), defined as machine lis- tening based recognition of daily life audio signals rather than speech or music, is a very young field developing in recent years. Benefited from the state-of-the-art in the area of speech and music recognition, there are some standard methodologies (e. g., feature sets, classifiers, learning strategies) successfully applied into the area of AGASC. But more specific and robust models are still needed for advancing this area. This thesis proposes three typical tasks in AGASC, i. e., snore sound classifica- tion, bird sound classification, and acoustic scene classification, which represent the possible applications in healthcare, ecological monitoring, and public/home security surveillance, respectively. The aim of this thesis is to facilitate the state-of-the-art in AGASC in the fol- lowing: First, a comprehensive investigation on standard methodologies is given. To make the studies reproducible, the three databases used in this thesis are all publicly accessible. Second, some specifically-designed features, e. g., wavelet features, are presented and demonstrated to be very efficient for recognition of snore sounds and acoustic scenes. Furthermore, to reduce the human annotations on bird sound data, active learning is used. More precisely, a kernel based extreme learning machine is found to be superior to conventional support vector machines in the task of bird sound classification. Finally, a late fusion of multiple systems for acoustic scene classification in noisy environments is evaluated. All the experiments demonstrate the effectiveness of the methodologies proposed in this thesis. iii Zusammenfassung Die automatische Klassifikation von Audiosignalen (AKAS), welche sich mehr auf die maschinelle Erkennung von Audiosignalen aus dem t¨aglichen Leben als auf Sprach- oder Musiksignale konzentriert, ist ein sehr junges Forschungsgebiet, das erst in den letzten Jahren enstanden ist. Vom Stand der Technik auf dem Gebiet der Sprach- und Musikerkennung profitieren einige Standardmethoden (z. B. akustische Merkmalsextraktion, Klassifikatoren, Lernstrategien), die erfolgreich im Bereich der AKAS Anwendung finden. Es bedarf jedoch noch spezifischerer und robusterer Mod- elle, um diesen Bereich weiter voranzutreiben. In dieser Arbeit werden drei typische Aufgaben zur AKAS untersucht, n¨amlich die Klassifikation von Schnarchger¨auschen, die Klassifikation von Vogelgesang und die Klassifikation akustischer Szenen, welche m¨ogliche Anwendungen im Bereich der Gesundheitsvorsorge, der Okologie¨ und der ¨offentlichen Sicherheit darstellen. Ziel dieser Arbeit ist es, den Stand der Technik in AKAS wie folgt zu f¨ordern: Zun¨achst wird eine umfassende Untersuchung von etablierten Methoden des maschinellen Lernens durchgef¨uhrt.Um die Studien reproduktiv zu gestalten, sind die drei in dieser Arbeit verwendeten Datenbanken alle ¨offentlich zug¨anglich. Als nchstes werden einige speziell entworfene Audio-Merkmale, z. B. Wavelet- Merkmale, vorgestellt und es wird gezeigt, dass diese sehr effizient f¨urdie Erken- nung von Schnarchger¨auschen und akustischen Szenen sind. Dar¨uber hinaus wird aktives Lernen verwendet, um die notwendige Anzahl menschlicher Annotationen der Vogelgesang-Daten zu reduzieren. Es hat sich herausgestellt, dass eine kernel- basierte Extreme Learning Machine f¨urdie Klassifikation von Vogelgesang effizien- ter als eine herk¨ommliche Support Vector Machine ist. Schließlich wird eine sp¨ate Fusion mehrerer Systeme f¨urdie akustische Szenenklassifizierung in lauten Umge- bungen evaluiert. Die beschriebenen Experimente belegen die Wirksamkeit der in dieser Arbeit vorgeschlagenen Methoden. v Relevant Publications Refereed Journal Papers 2018 Kun Qian, Christoph Janott, Zixing Zhang, Jun Deng, Clemens Heiser, Winfried Hohenhorst, Michael Herzog, Werner Hemmert and Bj¨ornSchuller, \Teaching Machines on Snoring: A Benchmark on Computer Audition for Snore Sound Excitation Localisation", Archives of Acoustics, vol. 43, no. 3, pp. 465-475, 2018. 2018 Zhao Ren, Kun Qian, Zixing Zhang, Vedhas Pandit, Alice Baird and Bj¨orn Schuller, \Deep Scalogram Representations for Acoustic Scene Classification”, IEEE/CAA Journal of Automatica Sinica, vol. 5, no. 3, pp. 662-669, 2018. 2018 Christoph Janott, Maximilian Schmitt, Yue Zhang, Kun Qian, Vedhas Pan- dit, Zixing Zhang, Clemens Heiser, Winfried Hohenhorst, Michael Herzog, Werner Hemmert and Bj¨ornSchuller, `Snoring Classified: The Munich Pas- sau Snore Sound Corpus", Computers in Biology and Medicine, vol. 94, pp. 106-118, 2018. 2017 Kun Qian, Zixing Zhang, Alice Baird and Bj¨ornSchuller, \Active Learning for Bird Sound Classification via a Kernel-based Extreme Learning Machine", The Journal of the Acoustical Society of America, vol. 142, no. 4, pp. 1796- 1804, 2017. 2017 Kun Qian, Christoph Janott, Vedhas Pandit, Zixing Zhang, Clemens Heiser, Winfried Hohenhorst, Michael Herzog, Werner Hemmert and Bj¨ornSchuller, “Classification of the Excitation Location of Snore Sounds in the Upper Air- way by Acoustic Multi-Feature Analysis", IEEE Transactions on Biomedical Engineering, vol. 64, no. 8, pp. 1731-1741, 2017. 2017 Kun Qian, Zixing Zhang, Alice Baird and Bj¨ornSchuller, \Active Learning for Bird Sounds Classification”, Acta Acustica united with Acustica, vol. 103, no. 3, pp. 361-364, 2017. vii 2017 Jian Guo, Kun Qian, Gongxuan Zhang, Huijie Xu and Bj¨ornSchuller, \Ac- celerating Biomedical Signal Processing Using GPU: A Case Study of Snore Sounds Feature Extraction", Interdisciplinary Sciences: Computational Life Sciences, vol. 9, no. 4, pp. 550-555, 2017. Refereed Conference Papers 2018 Zhao Ren, Nicholas Cummins, Vedhas Pandit, Jing Han, Kun Qian and Bj¨ornSchuller, \Learning Image-based Representations for Heart Sound Clas- sification”, in Proceedings of the 8th ACM Digital Health (DH), Lyon, France, ACM, pp.143-147, 2018. 2017 Kun Qian, Zhao Ren, Vedhas Pandit, Zijiang Yang, Zixing Zhang and Bj¨orn Schuller, \Wavelets Revisited for the Classification of Acoustic Scenes", in Pro- ceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), Munich, Germany, Tampere University of Technology, pp. 108-112, 2017. 2017 Kun Qian, Christoph Janott, Jun Deng, Clemens Heiser, Winfried Hohen- horst, Michael Herzog, Nicholas Cummins and Bj¨ornSchuller, \Snore Sound Recognition: On Wavelets and Classifiers from Deep Nets to Kernels", in Proceedings of the 39th Annual International Conference of the IEEE Engi- neering in Medicine and Biology Society (EMBC), Jeju Island, Korea, IEEE, pp. 3737-3740, 2017. 2017 Tao Chen, Kun Qian, Antti Mutanen, Bj¨ornSchuller, Pertti J¨arventausta and Wencong Su, “Classification of Electricity Customer Groups Towards In- dividualized Price Scheme Design", in Proceedings