DeepSpeech @ WebSummerCamp

DeepSpeech @ WebSummerCamp Workshop

Alexandre Lissy alissy@.com 2019-08-28

• Welcome and thanks for attending ! DeepSpeech @ WebSummerCamp • I’m Alexandre, working on the DeepSpeech team in the Paris Mozilla Office • The purpose of the workshop is an introduction to leveraging Speech Recognition for the Workshop Web • I want this to be interactive and as much as possible “hands-on”

Alexandre Lissy [email protected]

#websc | DeepSpeech @ WebSummerCamp 1/21 Outline DeepSpeech @ WebSummerCamp 1 Why is Mozilla working on speech ? What is DeepSpeech ? DeepSpeech status 2 Tooling and description Virtual machine Outline 3 NodeJS DeepSpeech service Basic NodeJS CLI A DeepSpeech REST API Capturing audio from a Webpage Using WebSocket and Streaming API 1 Why is Mozilla working on speech ? Outline 4 Producing a custom language model DeepSpeech models Command-specific language model What is DeepSpeech ? 2019-08-28 DeepSpeech status • Our workshop will follow this outline 2 Tooling and description Virtual machine 3 NodeJS DeepSpeech service Basic NodeJS CLI A DeepSpeech REST API Capturing audio from a Webpage Using WebSocket and Streaming API 4 Producing a custom language model DeepSpeech models Command-specific language model

#websc | DeepSpeech @ WebSummerCamp 2/21 Next DeepSpeech @ WebSummerCamp 1 Why is Mozilla working on speech ? What is DeepSpeech ? DeepSpeech status Why is Mozilla working on speech ? 2 Tooling and description Virtual machine Next 3 NodeJS DeepSpeech service Basic NodeJS CLI A DeepSpeech REST API Capturing audio from a Webpage Using WebSocket and Streaming API 1 Why is Mozilla working on speech ? Next 4 Producing a custom language model DeepSpeech models Command-specific language model What is DeepSpeech ? 2019-08-28 DeepSpeech status • You might wonder why is Mozilla working on speech 2 Tooling and description Virtual machine • Let’s quickly have a look at the Common Voice talk 3 NodeJS DeepSpeech service Basic NodeJS CLI A DeepSpeech REST API Capturing audio from a Webpage Using WebSocket and Streaming API 4 Producing a custom language model DeepSpeech models Command-specific language model

#websc | DeepSpeech @ WebSummerCamp | Why is Mozilla working on speech ? 3/21 DeepSpeech @ WebSummerCamp Mozilla DeepSpeech

Definition Why is Mozilla working on speech ? Mozilla implementation of the DeepSpeech v1 Baidu paper Mozilla DeepSpeech FLOSS end-to-end production grade speech recognition Originally based 100% on Baidu’s paper, now with variation to allow streaming What is DeepSpeech ? usage One-shot and streaming inference API in C, exposed to bindings (Python, NodeJS, Rust, Go, . . . ) Mozilla DeepSpeech Model training, dataset import, model export (Protocol buffer, TFLite) 2019-08-28

Definition • Mozilla DeepSpeech aims at providing an end-to-end speech recognition engine, Mozilla implementation of the DeepSpeech v1 Baidu paper machine-learning based, available under MPL2 FLOSS end-to-end production grade speech recognition • First implementation was 100% Baidu’s implementation, with some limitations. We removed the bidirectionnal recurrent component to allow a more streaming-oriented usage Originally based 100% on Baidu’s paper, now with variation to allow streaming usage • We want to make sure people can reproduce our model and build on top of them, so the pre-trained model and checkpoints are available under appropriate license One-shot and streaming inference API in C, exposed to bindings (Python, NodeJS, Rust, Go, . . . ) • We ship ready-to-use English (for now) model as well as an API exposed in C with bindings in many languages Model training, dataset import, model export (Protocol buffer, TFLite)

#websc | DeepSpeech @ WebSummerCamp | Why is Mozilla working on speech ? 4/21 DeepSpeech @ WebSummerCamp Objectives for today

Workshop goals Running an English model from a NodeJS-based server app First, very basic with HTTP Objectives for today Second, using WebSocket and the Streaming API Producing and integrating a new, small language model for voice-driven webapp W3C SpeechRecognition polyfill Objectives for today 2019-08-28

• Goals for this workshop. The idea is to show you how one can rely on DeepSpeech to Running an English model from a NodeJS-based server app provide speech support First, very basic with HTTP • Being in a web context, I assumed JS and Node would be easier Second, using WebSocket and the Streaming API • The first two items should help you get familiar with the API Producing and integrating a new, small language model for voice-driven webapp • The third item is a way to show you how one can re-use existing English model and W3C SpeechRecognition polyfill produce a specialized language model for a subset of commands • The last item might be ambitious, but it’s a way to play with the standard speech recognition API, that is for now not available in

#websc | DeepSpeech @ WebSummerCamp | Workshop goals 5/21 Next DeepSpeech @ WebSummerCamp 1 Why is Mozilla working on speech ? What is DeepSpeech ? DeepSpeech status Tooling and description 2 Tooling and description Virtual machine Next 3 NodeJS DeepSpeech service Basic NodeJS CLI A DeepSpeech REST API Capturing audio from a Webpage Using WebSocket and Streaming API 1 Why is Mozilla working on speech ? Next 4 Producing a custom language model DeepSpeech models Command-specific language model What is DeepSpeech ? 2019-08-28 DeepSpeech status 2 Tooling and description Virtual machine 3 NodeJS DeepSpeech service Basic NodeJS CLI A DeepSpeech REST API Capturing audio from a Webpage Using WebSocket and Streaming API 4 Producing a custom language model DeepSpeech models Command-specific language model

#websc | DeepSpeech @ WebSummerCamp | Tooling and description 6/21 Credentials DeepSpeech @ WebSummerCamp login: wsc pass: websummer

TensorFlow in ~/DeepSpeech/tensorflow ; binaries under Tooling and description bazel-bin/native_client/ Credentials TensorFlow virtualenv in ~/DeepSpeech/tf-venv/ DeepSpeech n ~/DeepSpeech/DeepSpeech ; binaries under native_client/ Virtual machine KenLM in ~/DeepSpeech/kenlm ; binaries under build/bin English model and checkpoints in ~/DeepSpeech/models and ~/DeepSpeech/checkpoints login: wsc English audio samples in ~/DeepSpeech/audio libdeepspeech.so (LD_LIBRARY_PATH=$HOME/DeepSpeech/tensorflow/bazel-bin/native_client) pass: websummer 2019-08-28 Firefox Nightly and NodeJS v10.x

TensorFlow in ~/DeepSpeech/tensorflow ; binaries under • Quick list of what is already setup in the VM bazel-bin/native_client/ • You should have everything needed and in-place to rebuild language model TensorFlow virtualenv in ~/DeepSpeech/tf-venv/ DeepSpeech n ~/DeepSpeech/DeepSpeech ; binaries under native_client/ • It should even cover re-training or fine-tuning an existing model KenLM in ~/DeepSpeech/kenlm ; binaries under build/bin English model and checkpoints in ~/DeepSpeech/models and ~/DeepSpeech/checkpoints English audio samples in ~/DeepSpeech/audio libdeepspeech.so (LD_LIBRARY_PATH=$HOME/DeepSpeech/tensorflow/bazel-bin/native_client) Firefox Nightly and NodeJS v10.x

#websc | DeepSpeech @ WebSummerCamp | Tooling and description 7/21 Next DeepSpeech @ WebSummerCamp 1 Why is Mozilla working on speech ? What is DeepSpeech ? DeepSpeech status NodeJS DeepSpeech service 2 Tooling and description Virtual machine Next 3 NodeJS DeepSpeech service Basic NodeJS CLI A DeepSpeech REST API Capturing audio from a Webpage Using WebSocket and Streaming API 1 Why is Mozilla working on speech ? Next 4 Producing a custom language model DeepSpeech models Command-specific language model What is DeepSpeech ? 2019-08-28 DeepSpeech status 2 Tooling and description Virtual machine 3 NodeJS DeepSpeech service Basic NodeJS CLI A DeepSpeech REST API Capturing audio from a Webpage Using WebSocket and Streaming API 4 Producing a custom language model DeepSpeech models Command-specific language model

#websc | DeepSpeech @ WebSummerCamp | NodeJS DeepSpeech service 8/21 DeepSpeech @ WebSummerCamp Getting familiar

Basics NodeJS DeepSpeech service deepspeech--version Getting familiar deepspeech--help Basic NodeJS CLI Run inference on one of the sample audio files install DeepSpeech NodeJS bindings are already installed. If you need: Getting familiar npm install [email protected] 2019-08-28

Basics • Ensuring that everyone is able to run an inference from the NodeJS binary deepspeech--version • Discovering the command-line arguments, versions deepspeech--help Run inference on one of the sample audio files

install DeepSpeech NodeJS bindings are already installed. If you need: npm install [email protected]

#websc | DeepSpeech @ WebSummerCamp | NodeJS DeepSpeech service 9/21 DeepSpeech @ WebSummerCamp DeepSpeech API and NodeJS The C-level API Public-facing is in native_client/deepspeech.h NodeJS DeepSpeech service We try to keep that as stable as possible

DeepSpeech API and NodeJS NodeJS API Basic NodeJS CLI Bindings generated using SWIG, with support for NodeJS v4.x to v12.x Defined in native_client/javascript/deepspeech.i Built with node-gyp and node-pre-gyp, bundling pre-built libdeepspeech.so DeepSpeech API and NodeJS Combination of binding.gyp, index.js and Makefile that copies shared object Also supports ElectronJS from 1.6 to 5.0 2019-08-28 The C-level API Public-facing is in native_client/deepspeech.h • Now that you have been able to perform some inference and got a bit familiar, let’s have We try to keep that as stable as possible a look at the C API • From this C API, you can see the NodeJS one we derive: it’s the same, except we do NodeJS API some object-oriented stuff Bindings generated using SWIG, with support for NodeJS v4.x to v12.x • The bindings are generated using a patched version of SWIG, to support from v4.x to Defined in native_client/javascript/deepspeech.i v12.x ; patches are shared upstream, but nobody has got time to properly finish and merge Built with node-gyp and node-pre-gyp, bundling pre-built libdeepspeech.so • The build-steps are a bit tedious, because rebuilding our library is non trivial, involves a Combination of binding.gyp, index.js and Makefile that copies shared object lot of long step, so we can’t expect people to do that on their side Also supports ElectronJS from 1.6 to 5.0 • Hence the node-pre-gyp usage, to load pre-built libdeepspeech.so

#websc | DeepSpeech @ WebSummerCamp | NodeJS DeepSpeech service 10/21 DeepSpeech @ WebSummerCamp Quick cheat sheet 1 function Model() {}

2 NodeJS DeepSpeech service 3 Model.prototype.enableDecoderWithLM= function() {} 4 Model.prototype.stt= function() {} Quick cheat sheet 5 Model.prototype.sttWithMetadata= function() {} 6 Model.prototype.setupStream= function() {} Basic NodeJS CLI 7 Model.prototype.feedAudioContent= function() {} 8 Model.prototype.intermediateDecode= function() {} 9 Model.prototype.finishStream= function() {} Quick cheat sheet 10 Model.prototype.finishStreamWithMetadata= function() {} 11

12 function DestroyModel(model) {} 2019-08-28 1 function Model() {} 2 • This is a quick reminder of how the C API is being exposed as NodeJS 3 Model.prototype.enableDecoderWithLM= function() {} • Basically, we have a deepspeech module, that exports a Model and a few other functions 4 Model.prototype.stt= function() {} from the API 5 Model.prototype.sttWithMetadata= function() {} 6 Model.prototype.setupStream= function() {} • Those other functions includes printVersions, DestroyModel and FreeMetadata. 7 Model.prototype.feedAudioContent= function() {} 8 Model.prototype.intermediateDecode= function() {} 9 Model.prototype.finishStream= function() {} 10 Model.prototype.finishStreamWithMetadata= function() {}

11

12 function DestroyModel(model) {}

#websc | DeepSpeech @ WebSummerCamp | NodeJS DeepSpeech service 11/21 DeepSpeech @ WebSummerCamp Using the API

How NodeJS CLI binary is written NodeJS DeepSpeech service Have a look inside native_client/javascript/index.js Using the API Instantiate a new Model object, load a model file Get raw audio data (format depends on the model file), e.g., WAVE PCM mono Basic NodeJS CLI 16kHz 16 bits Feed the audio to the model and get back transcription

Using the API You can play with samples from ~/DeepSpeech/audio/ 2019-08-28 How NodeJS CLI binary is written • Quick outline of the usage of the module: load a model, get audio, run inference Have a look inside native_client/javascript/index.js • In this code, getting the raw audio depends on the use of sox tool, and reads directly files Instantiate a new Model object, load a model file from the disk Get raw audio data (format depends on the model file), e.g., WAVE PCM mono • You can try and hack around. 16kHz 16 bits Feed the audio to the model and get back transcription

You can play with samples from ~/DeepSpeech/audio/

#websc | DeepSpeech @ WebSummerCamp | NodeJS DeepSpeech service 12/21 Exposing DeepSpeech as HTTP/REST: setup DeepSpeech @ WebSummerCamp Start a new project, and setup npm install express body-parser (should already be okay in the VM) Run with node app.js NodeJS DeepSpeech service Test with curl

Exposing DeepSpeech as HTTP/REST: setup 1 var express= require("express"); 2 var app= express(); A DeepSpeech REST API 3 app.listen(3000, () => { 4 console.log("Server running on port 3000"); 5 }); Exposing DeepSpeech as HTTP/REST: setup 6 app.get("/test",(req, res, next) => { Start a new project, and setup npm install express body-parser (should 7 res.json(["Hello world"]); 8 }); already be okay in the VM) 2019-08-28 Run with node app.js • Make sure you start this project in a new directory Test with curl • This first step ensures that you get a working HTTP service

1 var express= require("express"); • The express and body-parser packages should already be installed 2 var app= express(); • This should just be copy-pasting and running 3 app.listen(3000, () => { 4 console.log("Server running on port 3000"); 5 }); 6 app.get("/test",(req, res, next) => { 7 res.json(["Hello world"]); 8 });

#websc | DeepSpeech @ WebSummerCamp | NodeJS DeepSpeech service 13/21 Exposing DeepSpeech as HTTP/REST: inference DeepSpeech @ WebSummerCamp Before starting the express server, load deepspeech module and model Obviously, get inspired from client.js Start with /version that prints version to process standard output NodeJS DeepSpeech service Define a POST endpoint /wav, receiving raw WAVE PCM signed 16 kHz (no need Exposing DeepSpeech as HTTP/REST: inference for sox) A DeepSpeech REST API Upon data received, call stt and return a JSON object with the transcription 1 $ curl-v-H\ 2 "Content-Type: application/octet-stream"\ Exposing DeepSpeech as HTTP/REST: inference 3 --data-binary@"$HOME/DeepSpeech/audio/4507-16021-0012.wav"\ Before starting the express server, load deepspeech module and model 4 http://127.0.0.1:3000/wav 2019-08-28 Obviously, get inspired from client.js Start with /version that prints version to process standard output • It is now time to integrate the code from client.js into a server-based code Define a POST endpoint /wav, receiving raw WAVE PCM signed 16 kHz (no need • You need to load the model before running the server for sox) • As a first start / test you can add a /version endpoint that just calls DeepSpeech’s Upon data received, call stt and return a JSON object with the transcription printVersions • We will for now expect the client to send us properly formatted data, so you don’t have to 1 $ curl-v-H\ worry about transforming with sox 2 "Content-Type: application/octet-stream"\ • Expose an endpoint named wav, and within this endpojng, read the raw payload and send 3 --data-binary@"$HOME/DeepSpeech/audio/4507-16021-0012.wav"\ that to the STT method 4 http://127.0.0.1:3000/wav

#websc | DeepSpeech @ WebSummerCamp | NodeJS DeepSpeech service 14/21 Getting audio DeepSpeech @ WebSummerCamp Write a webpage that request audio using getUserMedia() API with only audio, mono, 16kHz https: //developer.mozilla.org/docs/Web/API/MediaDevices/getUserMedia NodeJS DeepSpeech service Using the MediaRecorder API produce WAV or using sox server-side https://developer.mozilla.org/docs/Web/API/MediaRecorder_API Getting audio Input: raw, floating-point, 32 bits, 44.1kHz Capturing audio from a Webpage Output: wavpcm, signed-integer, 16 bits, 16kHz media.getusermedia.insecure.enabled=true media.getusermedia.audiocapture.enabled=true Getting audio media.getusermedia.browser.enabled=true Write a webpage that request audio using getUserMedia() API with only audio, Feed this WAV audio to your NodeJS server, to the HTTP endpoint Or capture stream from browser (should be Opus) and process server-side with mono, 16kHz 2019-08-28 sox https: //developer.mozilla.org/docs/Web/API/MediaDevices/getUserMedia • Start by requesting access to microphone to be able to record audio from the user Using the MediaRecorder API produce WAV or using sox server-side • You will likely have a permission prompt to accept https://developer.mozilla.org/docs/Web/API/MediaRecorder_API • You need to make sure that you request only audio signal in mono and at 16kHz, as it is Input: raw, floating-point, 32 bits, 44.1kHz what our model is trained on Output: wavpcm, signed-integer, 16 bits, 16kHz media.getusermedia.insecure.enabled=true • You may want to explore the MediaRecorder API as well media.getusermedia.audiocapture.enabled=true • You can also rely on server-side sox to convert from browser’s raw audio to WAV media.getusermedia.browser.enabled=true • Once you have some audio, you can feed that to the HTTP/REST API and get back Feed this WAV audio to your NodeJS server, to the HTTP endpoint transcription Or capture stream from browser (should be Opus) and process server-side with sox

#websc | DeepSpeech @ WebSummerCamp | NodeJS DeepSpeech service 15/21 DeepSpeech @ WebSummerCamp Streaming API

NodeJS augment server with a WebSocket at /stream: NodeJS DeepSpeech service npm install express-ws Streaming API Setup streaming with model.setupStream() When there is data on the WebSocket feed it to the model Using WebSocket and Streaming API model.feedAudioContent() Every 2 seconds, send JSON payload with model.intermediateDecode() Streaming API content 2019-08-28

NodeJS augment server with a WebSocket at /stream: • Now we are going to add WebSocket to handle streaming. The express-ws package npm install express-ws should already be installed Setup streaming with model.setupStream() • Create a new endpoint /stream that you will use When there is data on the WebSocket feed it to the model • After model is loaded, setup a stream with appropriate model method model.feedAudioContent() • Now plug into the WebSocket and ensure that upon data is made available, you feed that Every 2 seconds, send JSON payload with model.intermediateDecode() to the model with the appropriate call content • Model will run step by step as you feed data. You can access intermediate decoding values with the intermediate decode call • And every two seconds, you can send back data on the WebSocket

#websc | DeepSpeech @ WebSummerCamp | NodeJS DeepSpeech service 16/21 DeepSpeech @ WebSummerCamp Streaming from webapp

NodeJS DeepSpeech service Open WebSocket to /stream Streaming from webapp Pull data from the browser Using WebSocket and Streaming API Send audio data continuously Print decoded string on a regular basis Streaming from webapp 2019-08-28

• Let’s now plug what you produced earlier into the current webapp Open WebSocket to /stream • You will obviously have to open a connection to the WebSocket endpoint Pull data from the browser • Then you can send the raw audio stream from the browser to the server Send audio data continuously • And upon regular reception of JSON payload with decoded string, you can print that Print decoded string on a regular basis • At that point, you can run a deepspeech server with your app

#websc | DeepSpeech @ WebSummerCamp | NodeJS DeepSpeech service 17/21 Next DeepSpeech @ WebSummerCamp 1 Why is Mozilla working on speech ? What is DeepSpeech ? DeepSpeech status Producing a custom language model 2 Tooling and description Virtual machine Next 3 NodeJS DeepSpeech service Basic NodeJS CLI A DeepSpeech REST API Capturing audio from a Webpage Using WebSocket and Streaming API 1 Why is Mozilla working on speech ? Next 4 Producing a custom language model DeepSpeech models Command-specific language model What is DeepSpeech ? 2019-08-28 DeepSpeech status 2 Tooling and description Virtual machine 3 NodeJS DeepSpeech service Basic NodeJS CLI A DeepSpeech REST API Capturing audio from a Webpage Using WebSocket and Streaming API 4 Producing a custom language model DeepSpeech models Command-specific language model

#websc | DeepSpeech @ WebSummerCamp | Producing a custom language model 18/21 DeepSpeech @ WebSummerCamp The DeepSpeech models Acoustic model Producing a custom language model Core of the speech recognition engine Implemented with TensorFlow The DeepSpeech models Produces characters from audio waves DeepSpeech models Language model Required to improve quality of the decoding Implemented with KenLM The DeepSpeech models Rectifies acoustic decoding based on language rules 2019-08-28 Acoustic model Core of the speech recognition engine • A quick explanation of the speech recognition process Implemented with TensorFlow • We have two models required to perform proper treatment of human voice Produces characters from audio waves • First, an acoustic model that is in charge of decoding as well as possible audio waves into characters and thus produce words Language model • Second, a language model that is trained on a different and grammatically correct corpus Required to improve quality of the decoding to learn about the language Implemented with KenLM • This helps fixing ambiguities that can be solved by grammatical or words context as well Rectifies acoustic decoding based on language rules as perform some orthographic correction • The idea in this section is to make use of this capability to coerce acoustic decoding on a subset of well-known commands

#websc | DeepSpeech @ WebSummerCamp | Producing a custom language model 19/21 DeepSpeech @ WebSummerCamp Definition and production Set of commands One or more words defines a command Producing a custom language model One command per line Definition and production Plain text file, only using English (DeepSpeech) alphabet.txt Command-specific language model Production Produce ARPA file then binary tree file (KenLM) Produce order 2 ARPA file Definition and production Produce TRIE file (for DeepSpeech, using 0.5.1 generate_trie) Documented in data/lm/README.md 2019-08-28 Set of commands One or more words defines a command • Producing a new language model is pretty simple One command per line • First your need to define it in a text file, with one sentence or command per line Plain text file, only using English (DeepSpeech) alphabet.txt • Make sure you use only characters that appears in the alphabet.txt file Production • Then, you need to follow the few commands documented in data/lm/README to produce the files Produce ARPA file then binary tree file (KenLM) • At the end, you should have a binary tree file along with a trie file. Produce order 2 ARPA file Produce TRIE file (for DeepSpeech, using 0.5.1 generate_trie) • That’s it! Documented in data/lm/README.md

#websc | DeepSpeech @ WebSummerCamp | Producing a custom language model 20/21 DeepSpeech @ WebSummerCamp Integration into server / webapp Integration into server / webapp Producing a custom language model 1 model.enableDecoderWithLM(...) Command-specific language model Add or update call to this method, passing the proper files you generated That’s all. Decoding done with language model at intermediate and final decoding step Integration into server / webapp Integrate into your app by reacting to commands 2019-08-28

• Using the new LM is trivial, you just have to change the files targeted in the 1 model.enableDecoderWithLM(...) enableDecoderWithLM call • If you have properly followed the previous steps, there should be no error • When you will request decoding from the model, whether it is intermediate or final Add or update call to this method, passing the proper files you generated decoding, language model will be applied That’s all. • It is now time you make use of the command that has been spoken and do something in Decoding done with language model at intermediate and final decoding step your webapp Integrate into your app by reacting to commands

#websc | DeepSpeech @ WebSummerCamp | Producing a custom language model 21/21