Implementing Wavenet Using Intel® Stratix® 10 NX FPGA for Real-Time Speech Synthesis

White Paper FPGA Artificial Intelligence Implementing WaveNet Using Intel® Stratix® 10 NX FPGA for Real-Time Speech Synthesis Authors Abstract Jonny Shipton Real-time text-to-speech models are widely deployed in interactive voice services Software Lead such as voice assistants and smart speakers. However, deploying these models is often challenging due to the strict latency requirements that they face. We present Myrtle .ai an implementation of WaveNet, a state-of-the-art vocoder, that can generate 256 16 kHz audio streams at near-human level quality in real time: 8 times higher Jon Fowler throughput than a hand optimized GPU solution. Our implementation is enabled FPGA Lead by the Intel® Stratix® 10 NX FPGA, demonstrating the capability of FPGAs to deliver Myrtle .ai high-throughput, low-latency inference for realworld applications powered by neural networks. Chris Chalmers Senior Developer I . Introduction Myrtle .ai The increasing use of voice assistants has been powered, in large part, by advances in the field of Text-to-Speech (TTS). State-of-the-art speech synthesis systems consist of two neural networks that are run sequentially to generate audio. The first Sam Davis model takes text as input and generates acoustic features, such as spectrograms, Head of Machine Learning while the second model, referred to as a vocoder, takes these intermediate features Myrtle .ai and produces speech. Tacotron 2 is often used as the first model. In this paper, we focus on the second model in the speech synthesis system. WaveNet [1] is a state-of-the art vocoder that is capable of producing speech with near-human-level Sam Gooch naturalness [2]. The key to the model’s quality is its autoregressive loop but this Machine Learning Engineer property also makes the network exceptionally challenging to implement for real- Myrtle .ai time applications. As a result, TTS research has focused on finding alternative vocoder architectures Giuseppe Coccia such as Parallel-WaveNet [3], WaveRNN [4], ClariNet [5] and WaveGlow [6] that Machine Learning Engineer are better suited to efficient inference on existing hardware. There is a degree Myrtle .ai of ambiguity as to the highest quality vocoder, as audio quality evaluation is subjective, but all authors agree the original WaveNet architecture produces at least as good, if not higher, quality audio than more recent approaches [3], [6]–[9]. There have been some efforts made to accelerate the WaveNet model directly rather than using an alternative architecture, including some other FPGA solutions Table of Contents [10]. However, these are not known to achieve real-time audio synthesis. The Abstract . 1 implementation with the highest performance that we have been able to identify is the NVIDIA* nv-wavenet1 that can achieve real-time audio synthesis using highly- I . Introduction . 1 optimised hand-written CUDA kernels. II . Model Description . 2 In this paper, we implement a WaveNet model, with approximately the same III . Model Compression . 2 number of parameters in the repeated part of the network, as the largest BFP16 Quantization . 3 configuration of nv-wavenet available in the example repository. The repeated IV . Hardware Implementation . .. 4 part of the network is the most latency sensitive section. By using Block Floating Point (BFP16) quantization and the Intel Stratix 10 NX FPGA, we are able to deploy FPGA Processing Architecture . 4 this model and produce 256 16 kHz streams in real time. We show that the FPGA Using Intel Stratix 10 NX FPGA implementation remains efficient at higher frequencies, demonstrating 160 24 kHz AI Tensor Blocks . .. 4 streams and 128 32 kHz streams in real time. Implementing Dilated Convolutions on the FPGA . 5 The rest of this paper is structured as follows: the WaveNet model architecture is presented in more detail in Section II. The concepts of model compression by the Programming Model . 5 use of quantization and the BFP16 data format are discussed in Section III. Section V . Results . 6 IV describes the implementation of the WaveNet model on Intel Stratix 10 NX FPGA, Methodology . 6 including the use of AI Tensor Blocks, BFP16 computation and high-bandwidth memory (HBM). In Section V we present our results, including the performance of Model Quality . 6 the implementation, as well as the quality of the generated audio. Finally, in Section Model Performance . 6 VI, we summarize our results and interpret their importance for future deployments VI . Conclusion . 7 of neural networks in the data center with FPGAs. References . 8 1 https://github.com/NVIDIA/nv-wavenet White Paper | Implementing WaveNet Using Intel Stratix 10 NX FPGA for Real-Time Speech Synthesis # PARAMETERS GOP/SECOND AUDIO LAYER TYPE Per-layer Total Per-layer Total Pre-processing Layers Embedding Embedding 30,720 - Feature Upsample ConvTranspose1d 5,120,080 0.82 Repeated Layers Dilation Dilated Conv1d 57,840 925,440 1.84 29.49 Conditional Conv1d 19,440 311,040 0.61 9.83 Residual Conv1d 14,520 217,800 0.46 6.91 Skip Conv1d 29,040 464,640 0.92 14.75 Post-processing Layers Out Conv1d 61,440 1.96 End Conv1d 65,536 2.09 Total 7,196,696 65.85 Table 1 . A Breakdown of the Model's Parameters and Giga-Operations Required per Second of Synthesized Audio Output. II . Model Description The task of audio generation is that of producing a sequence After the two post-processing convolutional layers are of values [x1,...,xn]. Just like in a regular audio file, each valuex i applied, the model samples from the probability distribution is an integer representing a discrete sample from a waveform. p(xt|x1,..., xi-1); we find sampling to produce higher fidelity There are two critical parameters of any audio stream: the speech than other methods such as selecting the value with sample rate, which is the number of these samples that make highest probability. The model produces audio output that up a single second of audio, measured in Hertz; and the bit has a bit depth of 16. However, to reduce the computational depth, which is the range of discrete values that each xi can overhead of producing 216 = 65,536 values the model instead take. Increasing these parameters improves the quality of the generates 8-bit µ-law encoded values that are then decoded audio stream but requires the production of more samples to produce the final 16-bit audio sample. per second or an increase in the number of computations required for each sample. In this paper, we use a sample rate At inference time, calculating the output for one step requires of 16 kHz and a bit depth of 16, giving each sample a range the value of the previous step, so it is necessary to calculate of 65,536 possible values, as we find these to achieve both each step individually and sequentially. For 16 kHz audio it near-human-level quality and high performance. is necessary to run the model 16,000 times per second to achieve real-time audio generation. This means that a single WaveNet is a convolutional neural network that models forward pass through the model must take at most 62.5μs. the conditional probability of producing a sample given all WaveNet is parameterized by the number of channels in previous samples: p(xt|x1,..., xi-1). The input to the first layer of the model at a given step is an embedding of the model its convolutions: the number of skip, residual and audio output from the previous time step. The core of the network channels, as well as by the number of repeated layers and consists of a series of repeated layers that each contain four dilation cycle parameter D. In this paper we use a model with convolutions; see Figure 1. The first convolution in each s = 240 skip channels, r = 120 residual channels, a = 256 repeated layer is a dilated convolution where the input is audio channels, L = 16 repeated layers and D = 8. This model the concatenation of the output of the previous layer at has 7.2 million parameters, as shown in Table 1, and requires approximately 65.85 GFLOPs of computation to generate one the current step, t, and a past step. The past step for the kth second of audio. We map the ConvTranspose1d layer to CPU repeated layer depends on the dilation cycle parameter D and all other layers to the FPGA. This significantly reduces and is set to step t - 2k mod D. model parameters that must be held on the FPGA, requiring Each repeated layer also produces a skip output. The skip approximately 2.1 million parameters to be stored on the outputs are summed and used to produce the final model FPGA overall. The layers running on the FPGA still account for output after post-processing with two more convolutional the majority of the computation, requiring 98.8% of the total layers. In addition to the outputs from the previous layer, operations of the model. each repeated layer also receives a conditional input that allows control over what audio is generated. The conditional inputs are generated by applying an upsampling III . Model Compression transposed convolution to conditional features. In early Quantization is one technique for model compression. versions of WaveNet these conditional features were simple Quantizing a model consists of using a different numerical linguistic features, however, current approaches typically format for the parameters and/or activations. This lowers use spectrograms generated by a separate model such as the power consumption and latency, as both the memory Tacotron 2 [2] or Deep Voice 3 [11]. requirements are reduced and the operations can be more efficiently executed in hardware. 2 White Paper | Implementing WaveNet Using Intel Stratix 10 NX FPGA for Real-Time Speech Synthesis 16kHz Audio Output μ-law Features in + [0, a) L REPEATED LAYERS 1x1 Conv Sample r Outputs FPGA 1x1 Conv 1x1 Conv Σ s Outputs a Outputs tanh Θ σ 1x1 Conv ReLU ReLU a Outputs Split Channels 2x1 Dilated Conv 1x1 Conv 800x1 200-Strided Conv + 2r Outputs 2r Outputs 80 Outputs CPU Mel Spectrogram Input 16kHz Audio Input Embedding 80 Features μ-law Features in Features r Outputs [0, a) Model Layers μ-law 16kHz 16-bit Decoding Audio Output Stream Activation Operations Figure 1 .

Load more