Speech Recognition Will Be Part of the Interface

Dialog systems Professor Joakim Gustafson CV for Joakim Gustafson 1987-1992 Electrical Engineering program at KTH 1992-1993 Linguistics at Stockholm University 1993-2000 PhD studies at KTH 2000-2007 Senior researcher at Telia Research 2007- Future faculty position at KTH 2013 – Professor 2015 - Head of department Background Todays topic: dialogue systems What is Dialogue? •A sequence of isolated utterances uttered by at least two speakers that together form a discourse. •Dialogue = a connected sequence of information units (with a goal); - provides coherence over the utterances, - provides a context for interpreting utterances, - multiple participants exchange information. General characteristics of dialogues •At least two participants •No external control over the other participants initiative •A structure develops with the dialogue •Some conventions and protocols exist •Dialogues are robust - we seek to understand the other participant •Various features are problematic. Different types of dialogue •Conversations – Informal (spoken) interaction between two individuals – Main Goal: development and maintenance of social relationships •Task-oriented Dialogues – Possibly formal multimodal interaction – Main Goal: perform a given task •Natural Dialogues: – Occur between humans •Artificial Dialogues: – At least one of the participant is a computer Dialogue research at KTH Our research dialogue systems Waxholm: the first Swedish spoken dialogue system (1995) August: a public chatbot agent (1998) • Swedish spoken dialogue system for public use in kulturhuset • Animated agent named after August Strindberg • Speech technology components developed at KTH • Designed to be easily expanded and reconfigured • Used to collect spontaneous speech data video of August 11 - a speech-enabled game (2005) • NICE was a EU project 2002-2005 • Five Partners – Nislab (Den), Telia (Swe), Liquid Media (Swe), Philips (Ger), Limsi (Fra) – Two systems were developed – a Swedish Fairytale game – an English HC Andersen story-teller Video demo Interactive Urban Robot (2012) Research issues - How can a robot: - Build a model of a city using sensory input? - Obtain information by talking with humans? - Talk with several humans at the same time? FurHat – the social robot Commercial dialogue apps Speech used to be a future promise Bill Gates quoutes • 1997: “The PC five years from now — you won’t recognize it, because speech will have come into the interface • 1998: “We’re looking at speech is an add-on today, but we’re looking in this two to three year timeframe, whether we could put it in. • 1999: “Speech recognition will be part of the interface. [..], probably we are four or five years away from that being the typical interface.” • 2000: “..great speech recognition, [..] all of those things undoubtedly will be solved in the next decade.” • 2001: ”speech recognition are becoming mainstream capabilities [..] virtually all PCs should have that right out of the box.” • 2002: “Likewise, in areas like speech recognition [..] common sense for every computer over the next five years.” • 2003: “speech recognition [..] Some people might say it’s three years, some people might say it’s 10 years to solve those things” • 2004: “We believe that speech over the next several years will be ready for primetime.” • 2005: “We totally believe speech recognition will go mainstream somewhere over the next decade.” • 2006: Vista speech recognition released: ”Dear aunt, let's set so double the killer delete select all.“ https://www.youtube.com/watch?v=2Y_Jp6PxsSQ The last decade the spoken dialogue scene changed dramatically Assistants Siri Google Now Cortana Alexa Hound Bixby PersonalAI Merlin DeepVoice Tacotron2 TTS WaveRNN WaveNet Tacotron LPCNet Vision ImageNet AlexNet 300-W MPII COCO OpenFace OpenPose Cloud ASR ASR API NDEV ASR Bluemix Azure Amazon Transcribe Open source KALDI DeepSpeech wav2letter 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Deep nets solved recogniton of images and speech Switchboard conversation speech recognition Error Rate Human performance 2016: Machines beat humans at transcribing conversations Now spoken dialogue is everywhere Smart phones Wearables Hearables XR headsets Smart speakers In-Car-systems Smart homes Social robots Back to dialogue research EACare: a social robot for the elderly • Supporting Early Diagnoses of Dementia • Supporting Therapeutic Interventions for Mild Dementia Learning by demonstration • Innovative and Accessible Human-Computer Interfaces Work package leaders Professor Hedvig Kjellström (KTH RPL) Professor Joakim Gustafson (KTH TMH) Professor Jonas Beskow (KTH TMH) Professor Miia Kivipelto (KI) Our goal is to make use of new input modalities Current multimodal setup FACT: A collaborative robot for assembly tasks Human-Robot interaction Multimodal language grounding Dual arm robot mounted using augmented reality using eye gaze and gestures on a mobile base Work package leaders Professor Danica Kragic (RPL) Professor Joakim Gustafson (TMH) Professor Patric Jensfelt (RPL) Human-Robot collaboration using augmented reality Real-time attention tracking Developing Spoken Dialogue System 29 Dialogue System Development Lifecycle •Requirement Analysis •Requirement Specification •Design •Implementation •Evaluation [McTear, 2004. Chap. 6] Requirements Analysis •Use case analysis – Role and functions of the system – Users profiles – Usage patterns •Spoken language requirements: – Vocabulary – Grammars – Interaction Patterns • Prompts, verification sub-dialogues, Repair sub-dialogues •Elicitation of requirements: – Analysis of Human-Human Dialogues (corpora); – Simulations (E.g. Wizard-of-Oz experiments) Sources for dialogue corpora •Human-human interactions •Human-computer interactions •Simulated human-computer interactions Simulation (Wizard-of-Oz) The point of WoZ simulation •Data for speech recognition •Typical utterance patterns •Typical dialogue patterns •Click-and-talk behaviour •Hints at important research problems The demands on the Wizard •How much does the wizard, WOZ, take care of – The Complete System – Parts of the system • Recognition • Synthesis • Dialog Handling • Knowledge Base • Which demands on the WOZ – How to handle errors – Should you add information – What is allowed to say •Which support does the WOZ have a robot coach for autistic children Partners MagPuzzle - joint attention robot Hardware • Laptop • Monitor for wizard gaze tracking • Realsense SR300 for user head/gaze tracking • Top camera for fiducial detection • External Microphone • Furhat Robot • Mag-puzzle pieces 38 Task • Visualize a 3D cube structure and reconstruct it in 2-D • Place magPuzzle pieces on the board to complete the task • Form the cube into 3-D when finished • 3 tasks with different restrictions of 4, 3 and 2 pieces “in a line” 39 Semi-autonomous restricted and augmented wizard interface • Built in Unity3D • Communicates with IrisTK • Robust real-time object tracking using ARToolkit • Cloud ASR integration • Gaze estimation integration with GazeSense • Wizard gaze control and logging capabilities with a tobii eye tracker • Enables creation of buttons to select easily authored dialog acts https://github.com/ Andre-pereira/ LimitedProactiveWizard 40 • Choose a dialog act directly by pressing a button on the interface 41 • More contextually relevant options can be found by clicking at a point of interest in the interface. When doing so, more options expand. 42 Why restricted? • The interface shows only what the robot “sees” - No audio or video feed is available - This allows ML models to have the necessary information to decide - More “realistic” behavior given the state of the art in robotics - Detect perception problems early on in the design process 43 The task’s contextual state appears in a central location and is updated in real-time 44 ASR results and hypotheses from a free cloud recognizer are displayed in real-time 45 Subtitles appear after selection of dialog acts is being spoken by the robot 46 The wizard is made aware of user and robot gaze targets by generated gaze lines 47 Why augmented? • The interface should also augment the wizard decision making and option selection with the advantages provided by fast computation - The wizard cognitive load is a problem so only the most relevant options should appear at any given time - The wizard should not have to control all of the robots behavior, especially if only one wizard is involved. Some parts should be automated. - A good response time is important to achieve believable behavior and to train ML models 48 • Gaze behavior is autonomous - The robot decides where to gaze autonomously based on the sensor data and dialog act selection 49 • Gaze tracking on the wizards side - The mouse cursor moves to where the user is looking at for faster selection and the dialog act options pop up automatically 50 • A.I. helps the wizard decisions - Automatically shows correct or incorrect states if any rule is broken and the correct place for hints if available 51 • Only relevant buttons are shown - Once buttons are no longer necessary they are hidden from the interface 52 • Audio cues - After the ASR detects end of utterance, the result is played in text-to-speech for the wizard in case s/he was not paying attention to the text 53 Excerpt from an experiment 54 Requirements Specification •AKA: Functional Specification – Formal Documents on what the system should do (features and constraints). •Volere template [Robertson et al. 1999]: – Project drivers: • purpose and stakeholders – Project constraints:

Speech Recognition Will Be Part of the Interface

A University-Based Smart and Context Aware Solution for People with Disabilities (USCAS-PWD)

Voicesetting: Voice Authoring Uis for Improved Expressivity in Augmentative Communication Alexander J

Microsoft Voices Download Windows 10 Use Voice Recognition in Windows 10

Dialog Systems

RT-Voice PRO Hearing Is Understanding

Romanian Minimal Sample Pack

Texafon 2.0: a Text Processing Tool for the Generation of Expressive Speech in TTS Applications

THE CEREVOICE SPEECH SYNTHESISER Juan María Garrido

Speech Synthesis

Using Large Corpora and Computational Tools to Describe

MILLA – Multimodal Interactive Language Learning Agent

Glissando: a Corpus for Multidisciplinary Prosodic Studies in Spanish and Catalan