Recommending Music with Waveform-Based Architectures at Scale
Total Page:16
File Type:pdf, Size:1020Kb
Recommending Music with Waveform-based Architectures at Scale ORIOL NIETO SEMINAR SERIES IN DATA SCIENCE UNIVERSITY OF SAN FRANCISCO FEB 1, 2019 Pandora Confidential OUTLINE Motivation: The Long Tail Background: Collaborative Filtering Music Recommendation Demo OUTLINE Motivation: The Long Tail Background: Collaborative Filtering Music Recommendation Demo The Long Tail Most Popular Tracks 0.01 % Pandora Confidential The Long Tail Most Popular Tracks 1 % Pandora Confidential The Long Tail Most Popular Tracks 0 spins last week 100 % Pandora Confidential OUTLINE Motivation: The Long Tail Background: Collaborative Filtering Music Recommendation Demo Collaborative Filtering RECOMMENDING POPULAR MUSIC ? ? [ ? ? ? ? Items (Tracks) ? ? [ ? ? Users Collaborative Filtering PROBLEM OVERVIEW [ [ k ? ? [ ? ? ? ? k Items (Tracks) ? ? ⇡ Items (Tracks) ? ? [ [ [ Users Users Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. Computer, 42(8), 42–49. Collaborative Filtering PROBLEM FORMULATION [ [ k ? ? [ Given Item i and User u: pu ? ? ? qi r?iu k Rating: riu Items (Tracks) ? ? ⇡ Items (Tracks) ? ? [ k[ [ Users Item Latent Factor: qi R 2 Users k User Latent Factor: pu R 2 T Rating Approximation: rˆiu = qi pu T 2 2 2 argminq ,p (rui qi pu) +λ( qi + pu ) ⇤ ⇤ − u,i || || || || Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. Computer, 42(8), 42–49.X2S Collaborative Filtering LATENT FACTORS Complex Harmony Calm Aggressive Simple Harmony Collaborative Filtering THE GOOD AND THE BAD Rich preference-driven similarity space Latent space is generally not interpretable Can only recommend items that Powerful at matching the right song have already been rated with the right listener (what about long tail content?) OUTLINE Motivation: The Long Tail Background: Collaborative Filtering Music Recommendation Demo Music Recommendation WITH COLLABORATIVE SONG FACTORS [ [ k ? ? [ ? ? ? ? k Items (Tracks) ? ? ⇡ Items (Tracks) ? ? [ [ [ Users Users Collaborative Filtering EXAMPLE ArUst Title Query Track Journey Don’t Stop Believing Collaborative Filtering EXAMPLE ArUst Title Query Track Journey Don’t Stop Believing Ranked 1 The Ouield Your Love Ranked 2 Eagles Hotel California Ranked 3 Survivor Eye Of The Tiger Ranked 4 Queen We Will Rock You The Music Genome Project LARGE-SCALE HUMAN ANNOTATED DATASET Attribute Examples Breathy Voice Nasal Voice Odd Meter Has Banjo Joyful Lyrics … Up to ~400 attributes per track Music Genome Project EXAMPLE ArUst Title Query Track Journey Don’t Stop Believing Ranked 1 Journey Stone In Love Ranked 2 Jefferson Starship Find Your Way Back Ranked 3 David Bowie Teenage Wildlife Ranked 4 Thriving Ivory On Your Side Estimating Latent Factors [ [ k ? ? [ ? ? ? ? k Items (Tracks) ? ? ⇡ Items (Tracks) ? ? [ [ [ Users Users Estimating Latent Factors FROM THE MUSIC GENOME PROJECT k y<latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> x http://blogs-images.forbes.com/kevinmurnane/files/2016/03/google-deepmind-artificial-intelligence-2-970x0-970x646.jpg f(x; ✓) y <latexit sha1_base64="(null)">(null)</latexit> ⇡ Estimating Latent Factors FROM THE MUSIC GENOME PROJECT N 2048 2048 k Dense Layers Estimating Latent Factors DATA AND OPTIMIZATION • Data set : X, Y { } • ~900k tracks (from “head”) M<latexit sha1_base64="(null)">(null)</latexit> • Loss : (✓) L<latexit sha1_base64="(null)">(null)</latexit> • Cosine Distance • Dropout 10% in Dense Layers 1 f(x; ✓)T y • Batch Normalization in all layers (✓)=1 L − M f(x; ✓) 2 y 2 <latexit sha1_base64="(null)">(null)</latexit> x ,y • Adam optimizer 2XX 2Y || || || || Estimating Latent Factors RESULTS Input Cos Distance # Epochs Time / Epoch MGP 0.30 15 ~4m Latent Factor Estimations WITH THE MUSIC GENOME PROJECT ArUst Title Query Track Journey Don’t Stop Believing Ranked 1 Journey Stone In Love Ranked 2 D Drive A Linle Bina Sunshine Ranked 3 John Parr Naughty, Naughty Ranked 4 Kiss Turn On The Night Machine Listening ESTIMATING THE MUSIC GENOME PROJECT Pons, J., et al. ISMIR 2018 Estimating Latent Factors FROM THE MUSIC GENOME PROJECT ESTIMATIONS k http://blogs-images.forbes.com/kevinmurnane/files/2016/03/google-deepmind-artificial-intelligence-2-970x0-970x646.jpg Estimating Latent Factors FROM THE MUSIC GENOME PROJECT ESTIMATIONS N 2048 2048 k Dense Layers Estimating Latent Factors RESULTS Input Cos Distance # Epochs Time / Epoch MGP 0.30 15 ~4m MGP Esomaoons 0.44 21 ~4m Latent Factor Estimations WITH MACHINE LISTENING ATTRIBUTES ArUst Title Query Track Journey Don’t Stop Believing Ranked 1 Dean Friedman Don’t You Ever Dare Ranked 2 James Taylor Stand And Fight Ranked 3 The Dingoes Starong Today Ranked 4 Chuck Girad The Days Are Young Estimating Latent Factors FROM AUDIO k http://blogs-images.forbes.com/kevinmurnane/files/2016/03/google-deepmind-artificial-intelligence-2-970x0-970x646.jpg Oord, A. Van Den, Dieleman, S., & Schrauwen, B. (2013). Deep Content-based Music Recommendation. Advances in Neural Information Processing Systems, 2643–2651. Estimating Latent Factors FROM RAW WAVEFORMS k y<latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> x http://blogs-images.forbes.com/kevinmurnane/files/2016/03/google-deepmind-artificial-intelligence-2-970x0-970x646.jpg f(x; ✓) y <latexit sha1_base64="(null)">(null)</latexit> ⇡ Estimating Latent Factors FROM RAW WAVEFORMS Conv1D Conv1D MP Conv1D MP Conv1D MP 64 x 3 64 x 3 3 64 x 3 3 128 x 3 3 <latexit sha1_base64="(null)">(null)</latexit> x k Conv1D MP Conv1D MP Conv1D Conv1D 1024 1024 1024 128 x 3 3 128 x 3 3 128 x 3 128 x 3 y<latexit sha1_base64="(null)">(null)</latexit> Conv1D Conv1D Conv1D Conv1D 256 x 3 512 x 7 512 x 7 512 x 7 Dense Layers Auto Pooling https://github.com/jordipons/music-audio-tagging-at-scale-models Lee, J., et al., 2018 McFee, B., et al., 2018 Estimating Latent Factors DATA AND OPTIMIZATION • Data set : X, Y • ~900k tracks (from “head”) { } • 16kHz - 16 bit waveforms M<latexit sha1_base64="(null)">(null)</latexit> • 3 patches of 15 seconds per track (~2.7M patches) • Loss : (✓) L<latexit sha1_base64="(null)">(null)</latexit> • Cosine Distance • Dropout 10% in Dense Layers 1 f(x; ✓)T y • Batch Normalization in all layers (✓)=1 L − M f(x; ✓) 2 y 2 <latexit sha1_base64="(null)">(null)</latexit> x ,y • Adam optimizer 2XX 2Y || || || || Estimating Latent Factors RESULTS Input Cos Distance # Epochs Time / Epoch MGP 0.30 15 ~4m MGP Esomaoons 0.44 21 ~4m Spectrogram (35s patches) 0.37 22 ~2h Estimating Latent Factors RESULTS Input Cos Distance # Epochs Time / Epoch MGP 0.30 15 ~4m MGP Esomaoons 0.44 21 ~4m Spectrogram (35s patches) 0.37 22 ~2h Waveform (15s patches) 0.34 9 ~5h Latent Factor Estimations WITH WAVEFORMS ArUst Title Query Track Journey Don’t Stop Believing Ranked 1 The Ganes Band The Final Countdown Ranked 2 Survivor Backstreet Love Affair Ranked 3 Toto Angel Don’t Cry Ranked 4 Orion The Hunter Dark And Stormy Estimating Latent Factors COMBINING MODELS k http://blogs-images.forbes.com/kevinmurnane/files/2016/03/google-deepmind-artificial-intelligence-2-970x0-970x646.jpg Estimating Latent Factors LATE-FUSION ARCHITECTURE 2 Co Co M Co M Co M 1 1 1 k Co M Co M Co Co Co Co Co Co Pons, J., McFee, k N 4 4 k http://blogs-images.forbes.com/kevinmurnane/files/2016/03/google-deepmind- Oramas, S., et al. 2018 Estimating Latent Factors LATE-FUSION ARCHITECTURE 2048 2048 k 2048 + 1024 Dense Layers Oramas, S., et al. 2018 Estimating Latent Factors RESULTS Input Cos Distance # Epochs Time / Epoch MGP 0.30 15 ~4m MGP Esomaoons 0.44 21 ~4m Spectrogram (35s patches) 0.37 22 ~2h Waveform (15s patches) 0.34 9 ~5h Waveform + MGP Esmaons 0.32 20 ~4m Latent Factor Estimations WITH AUDIO + MACHINE LISTENING ATTRIBUTES ArUst Title Query Track Journey Don’t Stop Believing Ranked 1 Patrick Simmons Knocking at your Door Ranked 2 Night Ranger When You Close Your Eyes Ranked 3 Prism Young And Restless Ranked 4 The Front The Truth Hurts CONCLUSIONS The severe length of the Long Tail in music catalogs is real Collaborative filtering is powerful at recommending music from the “head” Waveform-based architectures are effective at recommending undiscovered/new music Pandora Confidential DEMO TIME Pandora Confidential References McFee, B., Salamon, J., Bello, J. P., Adaptive pooling operators for weakly labeled sound event detection, IEEE Transactions on Audio, Speech and Language Processing, 2018. Koren, Y., Bell, R., & Volinsky, C., Matrix Factorization Techniques for Recommender Systems. Computer, 42(8), 42–49, 2009. Lee, J., Park, J, Kim, K. L., & Nam, J. SampleCNN: End-to-end Deep Convolutional neural networks using very small filters for music classification.Applied Sciences, 8(1):150, 2018. Oord, A. Van Den, Dieleman, S., & Schrauwen, B., Deep Content-based Music Recommendation. Advances in Neural Information Processing Systems, 2643–2651, 2013. Oramas, S., Barbieri, F., Nieto, O., Serra, X., Multimodal Deep Learning for Music Genre Classification. Transactions of the International Society for Music Information Retrieval (TISMIR), 2018. Pons, J., Nieto, O., Prockup, M., Schmidt, E., Ehmann, A., Serra, X., End-to-End Learning for Music Audio Tagging at Scale. Proc. of the 19th International Society for Music Information Retrieval Conference (ISMIR). Paris, France, 2018. Pandora Confidential.