Recognizing Voice Commands with Microsoft Speech API

Speech Walkthrough: C#

About this Walkthrough In the Kinect™ for Windows® Software Development Kit (SDK) Beta, Speech is a C# console application that demonstrates how to use the microphone array in the Kinect for Xbox 360® sensor with Microsoft Speech API (SAPI) to recognize voice commands. This document is a walkthrough of the beta SDK Speech sample application. Resources For a complete list of documentation for the Kinect for Windows SDK Beta, plus related reference and links to the online forums, see the beta SDK website at: http://kinectforwindows.org

Contents

License: The Kinect for Windows SDK Beta is licensed for non-commercial use only. By installing, copying, or otherwise using the beta SDK, you agree to be bound by the terms of its license. Read the license. Disclaimer: This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it. This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes. © 2011 Microsoft Corporation. All rights reserved. Microsoft, DirectX, Kinect, MSDN, and Windows are trademarks of the Microsoft group of companies. All other trademarks are property of their respective owners. Speech Walkthrough: C# – 2

a. Introduction The audio component of the Kinect™ for Xbox 360® sensor is a four-element microphone array. An array provides some significant advantages over a single microphone, including more sophisticated acoustic echo cancellation and noise suppression, and the ability to use beamforming algorithms, which allow the array to function as a steerable directional microphone. One key aspect of a natural user interface (NUI) is speech recognition. The Kinect sensor’s microphone array is an excellent input device for speech recognition-based applications. It provides better sound quality than a comparable single microphone and is much more convenient to use than a headset. The Speech sample shows how to use the Kinect sensor’s microphone array with the Microsoft.Speech API to recognize voice commands. For an example of how to implement a managed application to capture an audio stream from the Kinect sensor’s microphone array, see the “RecordAudio Walkthrough” on the website for the Kinect for Windows® Software Development Kit (SDK) Beta. For examples of how to implement a C++ application to capture an audio stream from the Kinect sensor’s microphone array, see “MicArrayEchoCancellation Walkthrough,” “AudioCaptureRaw Walkthrough,” and “MFAudioFilter Walkthrough ” on the beta SDK website. Before attempting to compile the Speech application, you must first install the following: Microsoft Speech Platform - Software Development Kit (SDK), version 10.2 (x86 edition) Microsoft Speech Platform – Server Runtime, version 10.2 (x86 edition) The beta SDK runtime is x86-only, so you must download the x86 version of the speech runtime. Kinect for Windows Runtime Language Pack, version 0.9 (acoustic model from Microsoft Speech Platform for the beta SDK) Note: The online documentation for the Microsoft.Speech API on the Microsoft® Developer Network (MSDN®) is limited. You should instead refer to the HTML Help file (CHM) that is included with the Microsoft Speech Platform SDK. It is located at Program Files\Microsoft Speech Platform SDK\Docs. b. Program Basics Speech is installed with the Kinect for Windows Software Development Kit (SDK) Beta samples in %KINECTSDK_DIR%\Samples\KinectSDKSamples.zip. Speech is a C# console application that is implemented in a single file, Program.cs. Important RecordAudio targets the x86 platform. The basic program flow is as follows: 1. Create an object to represent the Kinect sensor’s microphone array. 2. Create a speech recognition object and specify a grammar. 3. Respond to commands. To use Speech 1. Build the application. 2. Press Ctrl+F5 to run the application. Speech Walkthrough: C# – 3

3. Face the Kinect sensor and say “red,” “green,” or “blue.” The speech recognition prints notifications for each command, including the following: Which member of the command set best fits the spoken command. A confidence value for that estimate. Whether the command was recognized or rejected as not part of the command set. The speech recognition engine prints a notification if it recognizes the command, together with a measure of confidence that is the engine’s estimate of the probability that the word is correctly recognized. An example is shown in the following sample output, where the spoken words were “red,” “blue,” and “yellow.”

Using: Microsoft Server Speech Recognition Language - Kinect (en-US) Recognizing. Say: 'red', 'green' or 'blue'. Press ENTER to stop Speech Hypothesized: red Speech Recognized: red Speech Hypothesized: blue Speech Recognized: blue Speech Hypothesized: green Speech Rejected

Writing file: RetainedAudio_4.wav Stopping recognizer ... The remainder of this document walks you through the application. Note This document includes code examples, most of which have been edited for brevity and readability. In particular, most routine error-correction code has been removed. For the complete code, see the Speech sample. Hyperlinks in this walkthrough display reference content on the MSDN website. c. Create and Configure an Audio Source Object The KinectAudioSource object represents the Kinect sensor’s microphone array. Behind the scenes, it uses the MSRKinectAudio Microsoft DirectX® Media object (DMO), as described in detail in “MicArrayEchoCancellation Walkthrough” on the beta SDK website. Most of the sample is implemented in Main. The first step is to create and configure KinectAudioSource, as follows: static void Main(string[] args) { using (var source = new KinectAudioSource()) { source.FeatureMode = true; source.AutomaticGainControl = false; source.SystemMode = SystemMode.OptibeamArrayOnly; ... } Speech Walkthrough: C# – 4

... } You configure KinectAudioSource by setting various properties, which map directly to the MSRKinectAudio DMO’s property keys. For details, see the reference documentation. The Speech application configures KinectAudioSource as follows: Feature mode is enabled. Automatic gain control (AGC) is disabled. AGC must be disabled for speech recognition. The system mode is set to an adaptive beam without acoustic echo cancellation (AEC). In this mode, the microphone array functions as a single-directional microphone that is pointed within a few degrees of the audio source. d. Create a Speech Recognition Engine Speech creates a speech recognition engine, as follows: static void Main(string[] args) { using (var source = new KinectAudioSource()) { ... RecognizerInfo ri = SpeechRecognitionEngine.InstalledRecognizers() .Where(r => r.Id == RecognizerId) .FirstOrDefault(); using (var sre = new SpeechRecognitionEngine(ri.Id)) { ... } } ... } SpeechRecognitionEngine.InstalledRecognizers is a static method that returns a list of speech recognition engines on the system. Speech uses a Language-Integrated Query (LINQ) to obtain the ID of the first recognizer in the list and returns the results as a RecognizerInfo object. Speech then uses RecognizerInfo.Id to create a SpeechRecognitionEngine object. e. Specify the Commands Speech uses command recognition to recognize three voice commands: “red,” “green,” and “blue.” You specify these commands by creating and loading a grammar that contains the words to be recognized, as follows: static void Main(string[] args) { using (var source = new KinectAudioSource()) { ... using (var sre = new SpeechRecognitionEngine(ri.Id)) { var colors = new Choices(); Speech Walkthrough: C# – 5

colors.Add("red"); colors.Add("green"); colors.Add("blue");

var gb = new GrammarBuilder(); gb.Culture = ri.Culture; gb.Append(colors);

var g = new Grammar(gb); sre.LoadGrammar(g);

sre.SpeechRecognized += SreSpeechRecognized; sre.SpeechHypothesized += SreSpeechHypothesized; sre.SpeechRecognitionRejected += SreSpeechRecognitionRejected; ... } } } The Choices object represents the list of words to be recognized. To add words to the list, call Choices.Add. After completing the list, create a new GrammarBuilder object—which provides a simple way to construct a grammar—and specify the culture to match that of the recognizer. Then pass the Choices object to GrammarBuilder.Append to define the grammar elements. Finally, load the grammar into the speech engine by calling SpeechRecognitionEngine.LoadGrammar. Each time you speak a word, the speech recognition compares your speech with the templates for the words in the grammar to determine if it is one of the recognized commands. However, speech recognition is an inherently uncertain process, so each attempt at recognition is accompanied by a confidence value. The Speech engine raises the following three events: The SpeechRecognitionEngine.SpeechHypothesized event occurs for each attempted command. It passes the event handler a SpeechRecognizedEventArgs object that contains the best-fitting word from the command set and a measure of the estimate’s confidence. Note: The Kinect for Windows Language Pack for this beta SDK does not have a reliable confidence model, so SpeechRecognizedEventArgs.Confidence is not used. The SpeechRecognitionEngine.SpeechRecognized event occurs when an attempted command is recognized as being a member of the command set. It passes the event handler a SpeechRecognizedEventArgs object that contains the recognized command. The SpeechRecognitionEngine.SpeechRejected event occurs when an attempted command is rejected as being a member of the command set. It passes the event handler a SpeechRecognitionRejectedEventArgs object. Speech subscribes to all three events and implements the handlers, as follows: static void SreSpeechHypothesized(object sender, SpeechHypothesizedEventArgs e) Speech Walkthrough: C# – 6

{ Console.Write("\rSpeech Hypothesized: \t{0}\tConf:\t{1}", e.Result.Text); } static void SreSpeechRecognized(object sender, SpeechRecognizedEventArgs e) { Console.WriteLine("\nSpeech Recognized: \t{0}", e.Result.Text); } static void SreSpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e) { Console.WriteLine("\nSpeech Rejected"); if (e.Result != null) DumpRecordedAudio(e.Result.Audio); } The first two handlers simply print the key data from the event object. The SreSpeechRecognitionRejected handler calls a private DumpRecordedAudio method to write the recorded word to a WAV file. For details, see the sample. f. Recognize Commands After the speech recognition has been configured, all that Speech needs to do is to start the process. The speech recognition engine automatically attempts to recognize the words in the grammar and raises events as appropriate, as shown in the following code example: static void Main(string[] args) { using (var source = new KinectAudioSource()) { ...

using (var sre = new SpeechRecognitionEngine(ri.Id)) { ... using (Stream s = source.StartCapture(3)) { sre.SetInputToAudioStream(s, new SpeechAudioFormatInfo(EncodingFormat.Pcm, 16000, 16, 1, 32000, 2, null));

sre.RecognizeAsync(RecognizeMode.Multiple); Console.ReadLine(); Console.WriteLine("Stopping recognizer ...");

sre.RecognizeAsyncStop(); } } } Speech Walkthrough: C# – 7

}

Speech starts capturing audio from the Kinect sensor’s microphone array by calling KinectAudioSource.StartCapture. Then Speech does the following: 1. Calls SpeechRecognitionEngine.SetInputToAudioStream to specify the audio source and its characteristics. 2. Calls SpeechRecognitionEngine.RecognizeAsync and specifies asynchronous recognition. The engine runs on a background thread until the user stops the process by pressing a key. 3. Calls SpeechRecognitionEngine.RecognizeAsyncStop to stop the recognition process and terminate the engine.

For More Information For more information about implementing audio and related samples, see the Programming Guide page on the Kinect for Windows SDK Beta website at: http://kinectforwindows.org