VIEW POINT

VOICE INTERFACES

Abstract A voice- (VUI) makes human interaction with computers possible through a voice/speech platform in order to initiate an automated service or process. This Point of View explores the reasons behind the rise of voice interface, key challenges enterprises face in voice interface adoption and the solution to these. Are We Ready for Voice Interfaces?

Let’s get talking! IO showed the new promise of voice offering integrations with their Voice interfaces. Assistants. Since Apple integration with , voice interfaces has significantly Almost all the big players (Google, Apple, As per industry forecasts, over the next progressed. Echo and Google Home ) have ‘office productivity’ decade, 8 out of every 10 people in the have demonstrated that we do not need applications that are being adopted by world will own a device (a or a user interface to talk to computers businesses (Microsoft and their Office some kind of ) which will support and have opened-up a new channel for Suite already have a big advantage here, voice based conversations in native communication. Recent demos of voice but things like Google Docs and Keynote language. Just imagine the impact! based personal assistance at Google are sneaking in), they have also started

Voice Assistant Market

USD~7.8 Billion

CAGR ~39% Market Size (USD Billion)

2016 2017 2018 2019 2020 2021 2022 2023

The Sudden Interest in Voice Interfaces

Although voice technology/assistants Voice Recognition Accuracy Convenience – speaking vs typing have been around in some shape or form Voice Recognition accuracy continues to Humans can speak 150 per minute for many years, the relentless growth of improve as we now have the capability to vs the typing speed of 40 words per low-cost computational power—and train the models using neural networks minute. This makes voice interfaces an breakthroughs in and and large amount of relevant user data. efficient way to interact with technology. deep learning—mean bots can now be Google has been able to achieve 95% With natural language support and ability built with self-learning capabilities. machine learning accuracy which to respond intelligently to user queries is the same as human accuracy. It is also based upon user’s context, behavior expected to receive 50% of all searches in patterns and AI based learnings, voice Big breakthroughs in AI/ML voice by 2020 which will further improve based systems are increasingly becoming space in the recent years have recognition across languages as it will give an exciting option for users as is evident improved the voice recognition access to variety of new data for training. from the ever increasing sales numbers accuracy, adoption has also A lot of recent advancements in the NLP of voice assistants ( Echo, Google increased as it is convenient (Natural Language Processing), NLG Home etc.). to use voice interfaces, as a (Natural Language Generation) and Text result a lot of interest and to Speech(TTS) areas have also given a big investments are attracted. push to voice based applications.

External Document © 2019 Infosys Limited External Document © 2019 Infosys Limited Industry investments Google Machine Learning Word Accuracy Tech giants have entered the business of Voice Interfaces through either direct 100% investments or acquisitions. For example, Google recently acquired Mobvoi (founded 95% 95% by Li Zhifei, a former Google Research 90% Scientist) while Apple acquired Siri. Rate Amazon’s Bezos has openly said that they cy have more than 1,000 people working on Alexa and the Echo ecosystem for the past

rd Accura 80% many years. Bezos, through his investment Wo fund Bezos Expeditions, and the Alexa Fund have invested in a $2.5 million seed round for Pulse Labs led by the Seattle- 70% 2013 2014 2015 2016 2017 based venture firm Madrona Venture Group. Google Threshold for Human Accuracy

Key Technologies Enabling Voice Interfaces

Currently there are four key enablers for voice interfaces:

Automatic Speech Natural Language Natural Language Text to Speech (TTS) Recognition (ASR) Understanding Generation

Each of these technologies have seen big options. Enterprises can also embrace business/use case requirements breakthroughs in recent years providing either online (cloud services) or offline enterprises with a multitude of solutions/ (on-premise) option depending on their

Natural Language Understanding

Intent Entity Recognition Extraction

Speech Recognition Conversation Session and Management Context Management

Response Generation Ful lment

Streaming

Speech Natural Output Text to Speech Language Generation

External Document © 2019 Infosys Limited External Document © 2019 Infosys Limited Key players and options for enterprises

The vendor landscape for Voice Interfaces on-premise, the issue of routing users • While open-source on-premise options has become quite crowded with the IT and business data via third-party/cloud for are evolving/ giants of the world having the majority based services is taken care of improving fast, there are commercial share of the pie. Here is a quick look. options available like Nuance, NICE etc. • Licensing cost, cost of maintaining/ On Premise Deployment scaling the infrastructure will vary depending upon the nature of use case/ • When speech recognition is brought volume of transactions

On-premise Deployment

Corporate Boundary

Con gure and Train Language Model Bot Conguration Designer/Developer

Refer Con guration and Trained Model API

Creates and Deploys Process Intent

Bot r

ye Enterprise

Application La tegration Systems Enterprise In

Recognize Intent and Extract Entities

User Speaks Enterprise Speech Apps Recognition User

Cloud Deployment • There are advantages of pay as you powered Google Speech API for Speech use, ease of on demand scale, no to text conversion that is available for • Cloud service providers offer a promising infrastructure/platform maintenance short or long-form audio with high option by allowing an enterprise to overhead, inbuilt analytics etc. accuracy recognition, Amazon has configure and deploy voice bots on Transcribe and Microsoft comes with • Leading online speech service providers cloud (server-less as lambda/cloud Microsoft Cognitive Services and Bing are the giants like Google, Amazon and functions or server infrastructure) which Speech API. can integrate with the enterprise systems Microsoft. While Google provides ML

Cloud Deployment

Corporate Boundary Cloud Con gure and Train Language Model

Designer/Developer Bot

Conguration API

User Speaks ices Channels Process rv Intent Se r

User NLP/NLU ye Enterprise La tegration

Speech Recognition Systems Enterprise Recognize Intent and In Extract Entities

Channels/Cloud Enterprise Apps Service Providers

User Speaks

External Document © 2019 Infosys Limited External Document © 2019 Infosys Limited The following diagram shows the various players in on-premise, cloud and data set services

Speech Recognition: A view of Online, Open Source and Dataset options

Google Cloud Speech Mozilla Deep Speech API RNN Based Model

Amazon Transcribe Wave2letter CNN Based Model vices

Microsoft Bing Speech Open Source API Kaldi DNN Based Model Cloud Ser IBM Speech API

Nuance

Datasets

LibriSpeech SwitchBoard Google Command Fisher

Enterprise Readiness for Voice as Their Strategic Priority

Redesigning/enhancing digital experience organizations while conversational AI with key for businesses to enhance customers is by far the top strategic priority for most voice recognition capability is becoming experience and attract new customers.

Challenges

Speech Recognition owing to its that separates their world-class speech no cost but most importantly they are complexity has taken years of effort to recognition system from any other speech getting real life users’ audio data for further evolve to this level of maturity. There recognition system created elsewhere. training to improve their experience. This are multiple challenges with users’ real Alexa, Google Home, Apple Home Pod, Siri, drives enterprises to depend on such cloud life audio streams including low quality are providing convenience players in developing voice interfaces for microphones, background noise, reverb to the customer at minimal one-time or enterprise applications. and echo, accent variations etc. Training data should contain samples with all Speech Recognition these problems in order to make sure the neural network can deal with them. Other Neural Network challenges that enterprises face include:

Data set for training Model Training Accuracy of machine learning solutions Input Data is completely driven by the amount of available data sets for training. As for the voice recognition accuracy, it is difficult for enterprises to create their own approach and solution because of their inability to User says gather variety and volume of data required “Hello” to build the right models. Recognition

IT giants like Google, Amazon or Text Response do possess huge datasets with hours and hours of audio recordings in real-life “Hello” situations. It is the single biggest factor

External Document © 2019 Infosys Limited External Document © 2019 Infosys Limited Lack of mature open source have the potential to reach a good level users’ data and other critical business options of maturity. Mozilla has launched Project information) being routed via cloud Common Voice, an initiative to help make providers for voice recognition. Some While there is a dependency on hosted voice recognition open to everyone. industries especially banking and finance players like Google, Amazon etc. for prefer equivalent/mature open source and building voice interfaces, open source Data privacy concerns with cloud on premise options in order to protect stack for voice processing is an alternate platforms their customers’ data within their own option. There are some promising Businesses have their own inhibitions datacenters. beginnings in open source world which about user’s interactions (involving

Enterprise Readiness for Voice as Their Strategic Priority

Voice interfaces being critical for our Roadmap. As a systems integrator, voice interfaces, our customers do need a customers we lay special emphasis on a we engage with our customers in early comprehensive approach with solid small focused approach to overcome the various stages, help them design thinking and steps instead of jumping on to some visibly challenges they face. Enabling voice enable them on the best solution in their exciting option. support is one of the key feature in Infosys context. We believe that, when it comes to

Infosys NIA Chatbot Platform

Default Text Natural Language Conversation Live Agent Escalation Adapter Understanding Management

Amazon Alexa Adapter Intent Recognition Session Management ERP Voice Google Dialog ow Adapter Service Now

dapters Intent Hierarchy Context Management

Facebook Enterprise Systems Messenger

Channel A Adapter Entity Extraction FAQ Extraction Enterprise Integration Adapter Salesforce

Natural Language Small Talk Generation Text Adapter Knowledge/RPA

We have made early inroads in this mile solutions like , Google preferred channel of voice. capability. Infosys Chatbot is architected Home, Siri, , etc. Chatbot Platform provides customized integrations to leverage the best of the breed architecture is focused on enterprise with leading commercial speech technologies to create conversational integration, conversation management recognition options like NICE and Nuance channels like Skype, , and addressing the non-functional Dragon, which claim to be the world’s Slack etc. while voice is one of the channels requirements related to voice support bestselling speech recognition software which can be realized through popular last while being modularized to support the

External Document © 2019 Infosys Limited External Document © 2019 Infosys Limited Summary

Conversation as an interface is the most natural way for humans to interact with technology.

Technology is evolving rapidly in order to allow people to interact with technology as naturally as they interact with each other. These technology advances will definitely bring more and more meaningful improvements in people’s experience in interactions with voice interfaces. Businesses are seeking to leverage this and rapidly extend reach across segments and improve convenience of use.

Reference:

• https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html

• https://www.nasdaq.com/article/voice-control-platforms-a-key-technology-trend-in-2017-cm745944

• https://www.kleinerperkins.com/perspectives/internet-trends-report-2018

• https://www.recode.net/2018/2/1/16958432/amazon-alexa-pulse-labs-jeff-bezos-madrona-skills-voice-apps-google

• https://github.com/mozilla/DeepSpeech

• https://cloud.google.com/speech-to-text/

External Document © 2019 Infosys Limited External Document © 2019 Infosys Limited About the Author

Vishal Manchanda is a Principal Technology Architect with Infosys Center for Emerging Technology Solutions and has over 19 years of experience in the IT Industry. Vishal is involved in developing, architecting and incubating various IT solutions in the area of personalized intelligent interfaces involving hyper-contextual personalized videos and engineering of the platform for conversational interfaces. Vishal works towards understanding how enterprises can adopt voice interfaces given a plethora of new voice channels and big cloud players in the fray.

For more information, contact [email protected]

© 2019 Infosys Limited, Bengaluru, . All Rights Reserved. Infosys believes the information in this document is accurate as of its publication date; such information is subject to change without notice. Infosys acknowledges the proprietary rights of other companies to the trademarks, product names and such other intellectual property rights mentioned in this document. Except as expressly permitted, neither this documentation nor any part of it may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, printing, photocopying, recording or otherwise, without the prior permission of Infosys Limited and/ or any named intellectual property rights holders under this document.

Infosys.com | NYSE: INFY Stay Connected