Voice Interfaces
Total Page:16
File Type:pdf, Size:1020Kb
VIEW POINT VOICE INTERFACES Abstract A voice-user interface (VUI) makes human interaction with computers possible through a voice/speech platform in order to initiate an automated service or process. This Point of View explores the reasons behind the rise of voice interface, key challenges enterprises face in voice interface adoption and the solution to these. Are We Ready for Voice Interfaces? Let’s get talking! IO showed the new promise of voice offering integrations with their Voice interfaces. Assistants. Since Apple integration with Siri, voice interfaces has significantly Almost all the big players (Google, Apple, As per industry forecasts, over the next progressed. Echo and Google Home Microsoft) have ‘office productivity’ decade, 8 out of every 10 people in the have demonstrated that we do not need applications that are being adopted by world will own a device (a smartphone or a user interface to talk to computers businesses (Microsoft and their Office some kind of assistant) which will support and have opened-up a new channel for Suite already have a big advantage here, voice based conversations in native communication. Recent demos of voice but things like Google Docs and Keynote language. Just imagine the impact! based personal assistance at Google are sneaking in), they have also started Voice Assistant Market USD~7.8 Billion CAGR ~39% Market Size (USD Billion) 2016 2017 2018 2019 2020 2021 2022 2023 The Sudden Interest in Voice Interfaces Although voice technology/assistants Voice Recognition Accuracy Convenience – speaking vs typing have been around in some shape or form Voice Recognition accuracy continues to Humans can speak 150 words per minute for many years, the relentless growth of improve as we now have the capability to vs the typing speed of 40 words per low-cost computational power—and train the models using neural networks minute. This makes voice interfaces an breakthroughs in machine learning and and large amount of relevant user data. efficient way to interact with technology. deep learning—mean bots can now be Google has been able to achieve 95% With natural language support and ability built with self-learning capabilities. machine learning word accuracy which to respond intelligently to user queries is the same as human accuracy. It is also based upon user’s context, behavior expected to receive 50% of all searches in patterns and AI based learnings, voice Big breakthroughs in AI/ML voice by 2020 which will further improve based systems are increasingly becoming space in the recent years have recognition across languages as it will give an exciting option for users as is evident improved the voice recognition access to variety of new data for training. from the ever increasing sales numbers accuracy, adoption has also A lot of recent advancements in the NLP of voice assistants (Amazon Echo, Google increased as it is convenient (Natural Language Processing), NLG Home etc.). to use voice interfaces, as a (Natural Language Generation) and Text result a lot of interest and to Speech(TTS) areas have also given a big investments are attracted. push to voice based applications. External Document © 2019 Infosys Limited External Document © 2019 Infosys Limited Industry investments Google Machine Learning Word Accuracy Tech giants have entered the business of Voice Interfaces through either direct 100% investments or acquisitions. For example, Google recently acquired Mobvoi (founded 95% 95% by Li Zhifei, a former Google Research 90% Scientist) while Apple acquired Siri. Rate Amazon’s Bezos has openly said that they cy have more than 1,000 people working on Alexa and the Echo ecosystem for the past rd Accura 80% many years. Bezos, through his investment Wo fund Bezos Expeditions, and the Alexa Fund have invested in a $2.5 million seed round for Pulse Labs led by the Seattle- 70% 2013 2014 2015 2016 2017 based venture firm Madrona Venture Group. Google Threshold for Human Accuracy Key Technologies Enabling Voice Interfaces Currently there are four key enablers for voice interfaces: Automatic Speech Natural Language Natural Language Text to Speech (TTS) Recognition (ASR) Understanding Generation Each of these technologies have seen big options. Enterprises can also embrace business/use case requirements breakthroughs in recent years providing either online (cloud services) or offline enterprises with a multitude of solutions/ (on-premise) option depending on their Natural Language Understanding Intent Entity Recognition Extraction Speech Recognition Conversation Session and Management Context Management Response Generation Fullment Streaming Speech Natural Output Text to Speech Language Generation External Document © 2019 Infosys Limited External Document © 2019 Infosys Limited Key players and options for enterprises The vendor landscape for Voice Interfaces on-premise, the issue of routing users • While open-source on-premise options has become quite crowded with the IT and business data via third-party/cloud for speech recognition are evolving/ giants of the world having the majority based services is taken care of improving fast, there are commercial share of the pie. Here is a quick look. options available like Nuance, NICE etc. • Licensing cost, cost of maintaining/ On Premise Deployment scaling the infrastructure will vary depending upon the nature of use case/ • When speech recognition is brought volume of transactions On-premise Deployment Corporate Boundary Congure and Train Language Model Bot Conguration Designer/Developer Refer Conguration and Trained Model API Creates and Deploys Process Intent Bot r ye Enterprise Application La tegration Systems Enterprise In Recognize Intent and Extract Entities User Speaks Enterprise Speech Apps Recognition User Cloud Deployment • There are advantages of pay as you powered Google Speech API for Speech use, ease of on demand scale, no to text conversion that is available for • Cloud service providers offer a promising infrastructure/platform maintenance short or long-form audio with high option by allowing an enterprise to overhead, inbuilt analytics etc. accuracy recognition, Amazon has configure and deploy voice bots on Transcribe and Microsoft comes with • Leading online speech service providers cloud (server-less as lambda/cloud Microsoft Cognitive Services and Bing are the giants like Google, Amazon and functions or server infrastructure) which Speech API. can integrate with the enterprise systems Microsoft. While Google provides ML Cloud Deployment Corporate Boundary Cloud Congure and Train Language Model Designer/Developer Bot Conguration API User Speaks ices Channels Process rv Intent Se r User NLP/NLU ye Enterprise La tegration Speech Recognition Systems Enterprise Recognize Intent and In Extract Entities Channels/Cloud Enterprise Apps Service Providers User Speaks External Document © 2019 Infosys Limited External Document © 2019 Infosys Limited The following diagram shows the various players in on-premise, cloud and data set services Speech Recognition: A view of Online, Open Source and Dataset options Google Cloud Speech Mozilla Deep Speech API RNN Based Model Amazon Transcribe Facebook Wave2letter CNN Based Model vices Microsoft Bing Speech Open Source API Kaldi DNN Based Model Cloud Ser IBM Watson Speech API Nuance Datasets LibriSpeech SwitchBoard Google Command Fisher Enterprise Readiness for Voice as Their Strategic Priority Redesigning/enhancing digital experience organizations while conversational AI with key for businesses to enhance customers is by far the top strategic priority for most voice recognition capability is becoming experience and attract new customers. Challenges Speech Recognition owing to its that separates their world-class speech no cost but most importantly they are complexity has taken years of effort to recognition system from any other speech getting real life users’ audio data for further evolve to this level of maturity. There recognition system created elsewhere. training to improve their experience. This are multiple challenges with users’ real Alexa, Google Home, Apple Home Pod, Siri, drives enterprises to depend on such cloud life audio streams including low quality Google Now are providing convenience players in developing voice interfaces for microphones, background noise, reverb to the customer at minimal one-time or enterprise applications. and echo, accent variations etc. Training data should contain samples with all Speech Recognition these problems in order to make sure the neural network can deal with them. Other Neural Network challenges that enterprises face include: Data set for training Model Training Accuracy of machine learning solutions Input Data is completely driven by the amount of available data sets for training. As for the voice recognition accuracy, it is difficult for enterprises to create their own approach and solution because of their inability to User says gather variety and volume of data required “Hello” to build the right models. Recognition IT giants like Google, Amazon or Baidu Text Response do possess huge datasets with hours and hours of audio recordings in real-life “Hello” situations. It is the single biggest factor External Document © 2019 Infosys Limited External Document © 2019 Infosys Limited Lack of mature open source have the potential to reach a good level users’ data and other critical business options of maturity. Mozilla has launched Project information) being routed via cloud Common Voice, an initiative to help make providers for voice recognition. Some While