Machine Learning for Speech Recognition by Alice Coucke, Head of Machine Learning Research

Machine Learning for Speech Recognition by Alice Coucke, Head of Machine Learning Research @alicecoucke [email protected] Outline: 1. Recent advances in machine learning 2. From physics to machine learning IA en général Speech & NLP Startup 3. Working at Snips (now Sonos) Snips A lot of applications in the real world, quite a difference from theoretical work Un des rares domaines scientifiques où la théorie et la pratique sont très emmêlées Recent Advances in Applied Machine Learning A lot of applications in the real world, quite a difference from theoretical work Reinforcement learning Learning goal-oriented behavior within simulated environments Go Starcraft II Dota 2 AlphaGo (Deepmind, 2016) AlphaStar (Deepmind) OpenAI Five (OpenAI) Play-driven learning for robots Sim-to-real dexterity learning (Google Brain) Project BLUE (UC Berkeley) Machine Learning for Life Sciences Deep learning applied to biology and medicine Protein folding & structure Cardiac arrhythmia prediction Eye disease diagnosis prediction (NHS, UCL, Deepmind) from ECGs AlphaFold (Deepmind) (Stanford) Reconstruct speech from neural activity Limb control restoration (UCSF) (Batelle, Ohio State Univ) Computer vision High-level understanding of digital images or videos GANs for image generation (Heriot Watt Univ, DeepMind) « Common sense » understanding of actions in videos (TwentyBn, DeepMind, MIT, IBM…) GANs for artificial video dubbing GAN for full body synthesis (Synthesia) (DataGrid) From physics to machine learning and back A surge of interest from the physics community NeurIPS 2019: workshop on « machine learning and the physical sciences » Speech and language Understand and analyze human speech { intent: FindWeather, entities: { datetime: 11/28/2019, location: Paris } } Speech transcription Spoken language Text & speech generation Human Parity (Microsoft) understanding GPT-2 (Open AI) (Super)GLUE benchmarks (Google, Bert (Google) Facebook, IBM, Stanford …) XLNet (CMU) … Speech and language Understand and analyze human speech { intent: FindWeather, entities: { datetime: 11/28/2019, location: Paris } } Speech transcription Spoken language Text & speech generation Human Parity (Microsoft) understanding GPT-2 (Open AI) (Super)GLUE benchmarks (Google, Bert (Google) Facebook, IBM, Stanford …) XLNet (CMU) … � � Neural machine translation Voice activity detection Speaker identification Sentiment analysis Detect emotions in text and Unsupervised MT (Facebook) Detect speech from audio Recognize unique speakers speech Fairness, ethics, and explainability We, as scientists, have a say in the future of AI From physics to machine learning Décrire mon parcours My background Parler de la thèse From physics to machine learning Parler du post doc Pourquoi les physiciens / chercheurs sont une bonne formation - appréhender des théories complexes et les appliquer à des domaines variés - Programmation skills - Formation par la recherche : communiquer des résultats, lire un état de l’art, écrire des articles - Experimental method Décrire mon parcours My background Parler de la thèse From physics to machine learning Parler du post doc Pourquoi les physiciens / chercheurs 2012: M2 ICFP Theoretical physics sont une bonne formation - appréhender des théories complexes et les appliquer à des 2013-2016: PhD in statistical physics @LPTENS domaines variés - Programmation skills - Formation par la recherche : communiquer des résultats, lire un état de l’art, écrire des articles - Experimental method Décrire mon parcours My background Parler de la thèse From physics to machine learning Parler du post doc Pourquoi les physiciens / chercheurs 2012: M2 ICFP Theoretical physics sont une bonne formation - appréhender des théories complexes et les appliquer à des 2013-2016: PhD in statistical physics @LPTENS domaines variés - Programmation skills - Formation par la recherche : communiquer des résultats, lire un état de l’art, écrire des articles - Experimental method Décrire mon parcours My background Parler de la thèse From physics to machine learning Parler du post doc Pourquoi les physiciens / chercheurs 2012: M2 ICFP Theoretical physics sont une bonne formation - appréhender des théories complexes et les appliquer à des 2013-2016: PhD in statistical physics @LPTENS domaines variés - Programmation skills - Formation par la recherche : communiquer des résultats, lire un état de l’art, écrire des articles - Experimental method Feb 2017: senior ML scientist @ Snips 2019: director of ML research @ Snips Today: head of ML research @ Sonos, Inc. A few takeaways (please go ask other people too) A few takeaways (please go ask other people too) • PhD? Postdoc? • Working at a startup company • Physicists and machine learning A few takeaways (please go ask other people too) • PhD? Postdoc? • Working at a startup company • Physicists and machine learning Physicists @ Snips A. C. Raffaele Tavarone Francesco Caltagirone Alaa Saade Stéphane d’Ascoli Sr ML Scientist Sr ML Scientist Sr ML Scientist ML research intern Acoustics team Tech Lead Now: DeepMind Now: PhD ENS & FAIR Language team A few takeaways (please go ask other people too) • PhD? Postdoc? • Working at a startup company • Physicists and machine learning Physicists @ Snips A. C. Raffaele Tavarone Francesco Caltagirone Alaa Saade Stéphane d’Ascoli Sr ML Scientist Sr ML Scientist Sr ML Scientist ML research intern Acoustics team Tech Lead Now: DeepMind Now: PhD ENS & FAIR Language team Machine Learning at Snips (now Sonos) ML à Snips comme on peut faire du ML et de la science en dehors de l’académique Speech & language recognition - Privacy - pas cloud données en local —> contraintes en + techniques - resources computaitonelles - Peu de données pour entraîner des données - Chaque bloc est optimisé adapté Spoken Language Understanding From speech to meaning for voice assistants Automatic Speech Recognition Engine Language modeling Natural Intent: t ɜ r n ɑ n ð ə Acoustic Language Turn on the Language SwitchLightOn l a ɪ t s ɪ n ð ə model model lights in the Understanding ˈl ɪ v ɪ ŋ r u m living room Engine Slots: room: living room Wake Word ACR ICF Dialogue Custom Wake Automatic Command Intent Classification Multiturn Dialogue Word Recognition and Filling and Response Audio Frontend Action Code Spoken Language Understanding From speech to meaning for voice assistants Automatic Speech Recognition Engine Language modeling Natural Intent: t ɜ r n ɑ n ð ə Acoustic Language Turn on the Language SwitchLightOn l a ɪ t s ɪ n ð ə model model lights in the Understanding ˈl ɪ v ɪ ŋ r u m living room Engine Slots: room: living room Spoken Language Understanding From speech to meaning for voice assistants Automatic Speech Recognition Engine Language modeling Natural Intent: t ɜ r n ɑ n ð ə Acoustic Language Turn on the Language SwitchLightOn l a ɪ t s ɪ n ð ə model model lights in the Understanding ˈl ɪ v ɪ ŋ r u m living room Engine Slots: room: living room Deep neural network Proba over phones /a/ /b/ /c/ /d/ /e/ time Spoken Language Understanding From speech to meaning for voice assistants Automatic Speech Recognition Engine Language modeling Natural Intent: t ɜ r n ɑ n ð ə Acoustic Language Turn on the Language SwitchLightOn l a ɪ t s ɪ n ð ə Language model model lights in the Understanding ˈl ɪ v ɪ ŋ r u m model living room Engine Slots: room: living room Proba over phones Decoding graph Turn on the lights in the living room 0.85 Turn off the lights in the living room 0.75 Set the lights in the living room 0.60 /a/ /b/ /c/ /d/ /e/ time Spoken Language Understanding From speech to meaning for voice assistants Automatic Speech Recognition Engine Language modeling Natural Intent: t ɜ r n ɑ n ð ə Acoustic Language Turn on the Language SwitchLightOn l a ɪ t s ɪ n ð ə model model lights in the Understanding ˈl ɪ v ɪ ŋ r u m living room Engine Slots: room: living room Logistic regression Conditional Random Field Turn on the lights Intent: Slots: in the living room SwitchLightOn room: living room Spoken Language Understanding Offline & on device Automatic Speech Recognition Engine Language modeling Natural Intent: t ɜ r n ɑ n ð ə Acoustic Language Turn on the Language SwitchLightOn l a ɪ t s ɪ n ð ə model model lights in the Understanding ˈl ɪ v ɪ ŋ r u m living room Engine Slots: room: living room ❌ ❌ ✅ Wake Word ACR ICF Dialogue Custom Wake Automatic Command Intent Classification Multiturn Dialogue Word Recognition and Filling and Response Audio Frontend Action Code Our approach Our own voice in a vast ecosystem Privacy by design Resource constrained ML: small data & hardware ✓ A new popular trend in the ML community (low resource ML, transfer learning, miniaturization, etc) ✓ Numerous conferences and workshops on the topic ✓ Towards a safer, greener and more private conversational AI Research activity Publishing research in industry Snips voice platform: Efficient keyword spotting an embedded spoken language using dilated convolutions Federated learning for Spoken language understanding system for and gating keyword spotting understanding on the edge private-by-design voice interfaces arXiv arXiv arXiv arXiv There is a lot to do in research in the industry, with us joining Sonos we expect to grow the team and doing even more exploratory stuff. ICML 2018 ICASSP 2019 ICASSP 2019 NeurIPS 2019 workshop PiMLAI Main track Main track Workshop EMC2 Cited 35 times to date ✓ Publish open & reproducible benchmarks: ‣ ~200 access granted to researchers to our open speech datasets ‣ Snips dataset for NLU is the new academic standard Thank you for your attention Questions? @alicecoucke [email protected].

Machine Learning for Speech Recognition by Alice Coucke, Head of Machine Learning Research

Aviation Speech Recognition System Using Artificial Intelligence

Artificial Intelligence: with Great Power Comes Great Responsibility

A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech

Learned in Speech Recognition: Contextual Acoustic Word Embeddings

Synthesis and Recognition of Speech Creating and Listening to Speech

Natural Language Processing in Speech Understanding Systems

Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020

Improving Named Entity Recognition Using Deep Learning with Human in the Loop

AI for Health Examples from Industry, Government, and Academia Table of Contents

Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences

Towards Incremental Agent Enhancement for Evolving Games