Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks

Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks

Acoustic Scene Classification by Ensembling Gradient Boosting Machine and Convolutional Neural Networks DCASE 2017 Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez and Xavier Serra Outline ● Introduction ● Proposed System & Results ● Summary 2 2 Introduction ● Acoustic Scene Classification (ASC) ⇀ 15 acoustic scenes system recording environment 3 Introduction ● Traditionally: feature engineering ⇀ feature extraction ⇀ classifier 4 Introduction ● Traditionally: feature engineering ● Nowadays: data-driven ⇀ feature extraction ⇀ learning representations ⇀ classifier 5 Introduction ● Traditionally: feature engineering ● Nowadays: data-driven ⇀ feature extraction ⇀ learning representations ⇀ classifier How about combining both approaches for ASC ? 6 Proposed System Freesound score splitting GBM Extractor aggregation late acoustic scene 10s segment fusion pre- score splitting CNN processing aggregation mel-spectrogram 7 Gradient Boosting Machine audio feature snippets vectors acoustic n Freesound n n score splitting GBM scene Extractor aggregation ● Freesound Extractor by ● http://essentia.upf.edu/documentation/freesound_extractor.html 8 Gradient Boosting Machine audio feature snippets vectors acoustic n Freesound n n score splitting GBM scene Extractor aggregation ● Gradient Boosting Machine: ⇀ effective in Kaggle ⇀ multiple weak learners (decision trees) 9 Gradient Boosting Machine audio feature snippets vectors acoustic n Freesound n n score splitting GBM scene Extractor aggregation ● Gradient Boosting Machine: ⇀ effective in Kaggle ⇀ multiple weak learners (decision trees) ⇀ added iteratively ● Implementation: ⇀ LigthGBM https://github.com/Microsoft/LightGBM 10 Gradient Boosting Machine audio feature snippets vectors acoustic n Freesound n n score splitting GBM scene Extractor aggregation ● Score aggregation: ⇀ averaging scores across snippets ⇀ argmax ● Results: ⇀ development set ⇀ 4-fold cross-validation provided ⇀ Accuracy: 80.8% 11 Convolutional Neural Network log-scaled T-F mel-spectrogram patches acoustic pre- n n score splitting CNN scene processing aggregation ● log-scaled mel-spectrogram ⇀ 128 bands 12 Convolutional Neural Network log-scaled T-F mel-spectrogram patches acoustic pre- n n score splitting CNN scene processing aggregation ● log-scaled mel-spectrogram ⇀ 128 bands ● Time splitting: ⇀ T-F patches 1.5s 13 Convolutional Neural Network log-scaled T-F mel-spectrogram patches pre- score acoustic splitting n CNN n processing aggregation scene 14 Convolutional Neural Network log-scaled T-F mel-spectrogram patches pre- score acoustic splitting n CNN n processing aggregation scene 15 Convolutional Neural Network log-scaled T-F mel-spectrogram patches acoustic pre- n n score splitting CNN scene processing aggregation ● Global time-domain pooling (Valenti, 2016) 16 Convolutional Neural Network ● Design of convolutional filters: ⇀ spectro-temporal patterns for ASC? ⇀ different rectangular filters (Pons, 2017) (Phan, 2016) 17 Convolutional Neural Network ● Design of convolutional filters: ⇀ spectro-temporal patterns for ASC? ⇀ different rectangular filters (Pons, 2017) (Phan, 2016) ⇀ multiple vertical filter shapes ( Q = 1, 2, 3, 4, 5 ) Q = 1 18 Convolutional Neural Network ● Design of convolutional filters: ⇀ spectro-temporal patterns for ASC? ⇀ different rectangular filters (Pons, 2017) (Phan, 2016) ⇀ multiple vertical filter shapes ( Q = 1, 2, 3, 4, 5 ) Q = 4 19 Recap ● Feature engineering: ⇀ Freesound Extractor ⇀ GBM ● Accuracy 80.8% 20 Recap ● Feature engineering: ● Data-driven ⇀ Freesound Extractor ⇀ log-scaled mel-spectrogram ⇀ GBM ⇀ CNN ● Accuracy 80.8% ● Accuracy: 79.9% 21 Recap ● Feature engineering: ● Data-driven: ⇀ Freesound Extractor ⇀ log-scaled mel-spectrogram ⇀ GBM ⇀ CNN ● Accuracy 80.8% ● Accuracy: 79.9% How different do they behave? 22 Models’ Comparison ● (Confusion matrix by GBM - Confusion matrix by CNN) 23 Models’ Comparison ● (Confusion matrix by GBM - Confusion matrix by CNN) GBM performs better CNN performs better 24 Models’ Comparison ● (Confusion matrix by GBM - Confusion matrix by CNN) GBM performs better CNN performs better 25 Models’ Comparison ● (Confusion matrix by GBM - Confusion matrix by CNN) GBM performs better CNN performs better 26 Models’ Comparison ● (Confusion matrix by GBM - Confusion matrix by CNN) GBM performs better CNN performs better 27 Models’ Comparison ● (Confusion matrix by GBM - Confusion matrix by CNN) GBM performs better CNN performs better 28 Models’ Comparison ● (Confusion matrix by GBM - Confusion matrix by CNN) 29 Late Fusion ● GBM: ⇀ prediction probabilities ● CNN: ⇀ softmax activation values 30 Late Fusion ● GBM: ⇀ prediction probabilities ● CNN: ⇀ softmax activation values ● Late fusion approach: ⇀ arithmetic mean + argmax ● System accuracy on development set: ⇀ 83.0 % 31 Results ● residential area vs park 32 Results ● residential area vs park ● tram vs train 33 Results ● residential area vs park ● tram vs train ● grocery store vs cafe/resto 34 Challenge Ranking ● accuracy drop ● outperforming baseline by absolute 6.3 % 35 Summary ● Ensemble of two models ● Simplicity of models: ⇀ GBM + out-of-box feature extractor ⇀ CNN using domain knowledge ⇀ providing complementary information ● Simple late fusion method ● Reasonable results although room for improvement ⇀ individual models ⇀ fusion approach 36 Thank you! 37 References ● H. Phan, L. Hertel, M. Maass, and A. Mertins, “Robust audio event recognition with 1-max pooling convolutional neural networks”, arXiv preprint arXiv:1604.06338, 2016. ● J. Pons, O. Slizovskaia, R. Gong, E. Gómez, and X. Serra, “Timbre Analysis of Music Audio Signals with Convolutional Neural Networks”, in 25th European Signal Processing Conference (EUSIPCO2017). ● M. Valenti, A. Diment, G. Parascandolo, S. Squartini, and T. Virtanen, “DCASE 2016 acoustic scene classification using convolutional neural networks,” in Proc. Workshop Detection Classif. Acoust. Scenes Events, 2016. 38.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    38 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us