<<

Empirical Evaluation of Model Compression Techniques on the WaveNet Vocoder

Sam Davis, Giuseppe Coccia, Sam Gooch, Julian Mack Myrtle.ai Cambridge, UK [email protected], [email protected], [email protected], [email protected]

Abstract In this paper, rather than changing the WaveNet architec- ture to improve performance we instead keep it WaveNet is a state-of-the-art text-to-speech vocoder that re- fixed and explore a range of model compression techniques mains challenging to deploy due to its autoregressive loop. that can yield greater inference performance. Crucially, the In this work we focus on ways to accelerate the original WaveNet architecture directly, as opposed to modifying the techniques we explore are all available in existing deep architecture, such that the model can be deployed as part of learning frameworks and are deployable to a wide range of a scalable text-to-speech system. We survey a wide variety current and future CPUs and neural network accelerators. Fi- of model compression techniques that are amenable to de- nally, motivated by the desire to maintain WaveNet’s quality, ployment on a range of hardware platforms. In particular, we evaluate the impact that these compression techniques we compare different model sparsity methods and levels, and have on the perceived fidelity of the synthesized speech. seven widely used precisions as targets for quantization; and We examine two main categories of model compression— are able to achieve models with a compression ratio of up sparsity and quantization—and explore both their indepen- to 13.84 without loss in audio fidelity compared to a dense, dent and combined impact on model quality. For spar- single-precision floating-point baseline. All techniques are sity we consider iterative and one-shot magnitude-based implemented using existing open source deep learning frame- works and libraries to encourage their wider adoption. neural network pruning; and for quantization we explore the INT8, bfloat16, half-precision floating-point with both 16-bit and 32-bit accumulation (FP16.16, FP16.32), 16-bit 1 Introduction block floating-point (BFP16), TensorFloat32 (TF32) and The widespread adoption of personal assistants has been single-precision floating-point (FP32) formats. While there powered, in large part, by advances in the field of Text-to- has been some work in vocoder pruning of WaveRNN and Speech (TTS). Many state-of-the-art TTS systems contain a its variants (Kalchbrenner et al. 2018; Valin and Skoglund model, referred to as a vocoder, that takes as input audio 2019; Tian et al. 2020), to our knowledge, this is the first features derived from a piece of text and outputs synthe- paper in which WaveNet pruning results are provided. Ad- sized speech. WaveNet (Oord et al. 2016) is a state-of-the ditionally, to our knowledge, no other authors compare as art vocoder that is capable of producing synthesized speech wide a range of precisions, nor look at the interactions be- with near-human-level quality (Shen et al. 2018). The key to tween pruning and quantization for the WaveNet model. We include samples generated by the models presented in the model’s quality is its autoregressive loop but this prop- 1 erty makes the model exceptionally challenging to deploy this work and will release our code . in applications that require real-time output or need to ef- ficiently scale to millions of users since na¨ıve implementa- 2 Related Work

arXiv:2011.10469v1 [cs.LG] 20 Nov 2020 tions may take tens of minutes to generate ten seconds of Numerous attempts have been made to improve vocoder in- speech. ference performance. The WaveRNN (Kalchbrenner et al. As a result, TTS research has focused on finding alterna- 2018) authors note that the time for vocoder inference, tive vocoder architectures such as Parallel-WaveNet (Oord T (u), for a target audio sequence u can be decomposed into et al. 2018), WaveRNN (Kalchbrenner et al. 2018), Clar- computation time ci and kernel launch overhead di for each iNet (Ping, Peng, and Chen 2018) and WaveGlow (Prenger, of the N operations (layers) of the model: Valle, and Catanzaro 2019) that achieve higher performance N when deployed on existing hardware. There is a degree of X ambiguity as to the highest quality vocoder as audio quality T (u) = |u| (ci + di) (1) evaluation is subjective but all authors agree that WaveNet i=1 produces at least as good if not higher quality audio than the Attempts to optimize a vocoder for deployment aim to more recent approaches (Kim et al. 2018; Oord et al. 2018; reduce at least one of {|u|, N, ci, di}. Many approaches in- Prenger, Valle, and Catanzaro 2019; Tian et al. 2020; Hsu and Lee 2020). 1myrtlesoftware.github.io/-paper/ cluding Parallel-WaveNet, ClariNet, WaveGlow and Wave- The WaveRNN, LPCNet and FeatherWave models utilise Flow (Ping et al. 2019) remove the autoregressive compo- high levels of sparsity to reduce inference time. The Wav- nent and hence reduce |u| so that many or all of the sam- eRNN authors use iterative pruning to achieve sparsity as ples can be generated in parallel. Others, including Wav- high as 96%. They investigate a range of block sparsity pat- eRNN and its variants LPCNet (Valin and Skoglund 2019) terns including 1x1 (unstructured), 4x4 and 16x1 and find and FeatherWave (Tian et al. 2020), keep the autoregressive that the latter is the most performant as it more closely mir- component but make alterations to the architecture to de- rors the layout in physical memory. The LPCNet and Feath- crease the product of N and ci. There have also been efforts erWave authors both use 90% sparse networks with a 16x1 that focus on reducing di by exploiting techniques such as pattern although the latter uses TSSP as discussed above in- persistent kernels that only launch the kernel once per se- stead of the purely iterative approach. quence (Pharris 2018). In this work we explore reducing ci without altering the WaveNet architecture by utilising model 2.2 Quantization

compression to give greater possibilities in the types of mod- Quantized models also reduce ci as the operations are now els that can be deployed. performed at a numerical format in which the operations are less computationally expensive. This approach has been 2.1 Sparsity applied to a wide variety of models including BERT (Wu Sparse models offer two potential benefits over their dense et al. 2020), ResNet and GNMT (Zafrir et al. 2019), and counterparts: sees adoption in widely recognised bench- 1. The amount of computation can be reduced since, for ex- marks (Reddi et al. 2019). ample, multiplications by zero need not be performed. The quantization process includes one or both of: 2. Memory bandwidth requirements can be reduced as it is 1. Reducing the number of bits of the datatype. e.g. use 8 possible to achieve higher compression ratios with sparse bits instead of 32 bits. matrices. 2. Using a less expensive format. e.g. use integer instead of The first of these reduces ci but in order to realise either floating-point. of these benefits, hardware support is usually required. De- A simple scheme is to perform all multiplications in the pending on the type of support, different hardware platforms FP16 data format as this is already widely supported on a are amenable to different types of sparsity. At one end of the variety of hardware devices. The results are accumulated in spectrum, some authors use channel pruning, in which en- either FP16 or FP32; this distinction matters for the range of tire convolutional channels are set to zero (He, Zhang, and representable values and for what precision any activation Sun 2017). It is comparatively easy to realise the inference- functions are later performed in. We represent these choices time performance benefits of channel sparsity but this ap- as FP16.16 and FP16.32 to represent using FP16 for mul- proach produces a significant degradation in audio quality tiplies and either FP16 or FP32 for accumulations respec- for WaveNet (Hussain et al. 2020). Channel sparsity is a spe- tively. cial case of block sparsity where for a 2D , blocks of Quantizing to integers is another popular choice for quan- size n × m are enforced to be either all-dense or all-sparse tization. When quantizing from floating point to integer, it (Narang, Undersander, and Diamos 2017). At the other end is necessary to use a quantization scheme in which there of the spectrum, sparsity can also be unstructured meaning is no quantization error for the value 0 as these parame- there are no constraints on the sparsity pattern; this typi- ters will have an outsized impact on model performance (Ja- cally results in the smallest quality degradation but is also cob et al. 2018) especially when quantizing sparse matri- the most challenging sparsity pattern to deploy efficiently. A ces. Running inference at INTX (most often INT8) is widely hybrid approach is to employ balanced sparsity (Yao et al. used for deployment including for models from the domains 2019; Cao et al. 2019) where each block is independently of (Wu et al. 2016), Automatic Speech pruned to the target sparsity percentage but within a block Recognition (He et al. 2019), (Wu et al. the sparsity is unstructured. 2018) and NLP embeddings (Zafrir et al. 2019). The other principal axes on which neural network spar- However, formats other than INTX and the basic FP16 sity approaches can differ relate to the method of obtain- are becoming more widely used as hardware vendors ac- ing the sparse network. One way to do this is by utilizing commodate them. For example, BrainFloat16 is supported magnitude-based pruning, an approach in which the parame- by TPUs2 and Intel engineers have demonstrated a ters closest to zero are pruned (Han et al. 2015). Magnitude- 1.5× speedup using this format for both Parallel-WaveNet based pruning can be broadly classified as either one-shot in and FeatherWave on Xeon CPUs3. BlockFloat16 is sup- which a trained network is pruned to the desired sparsity in ported on Intel’s Stratix 10 NX FPGA4 and ’s Am- a single step (Han et al. 2015) or iterative where the level of pere series of GPUs will support the TensorFloat32 format. sparsity increases more gradually over the training process (Zhu and Gupta 2017). However, the division between the 2cloud.google.com/tpu/docs/bfloat16 two is not always clear-cut. For example, the FeatherWave 3intel.com/content/www/us/en/artificial- authors propose a two-stage sparse pruning schedule (TSSP) intelligence/posts/intel-xeon-text-to-speech.html which combines a one-shot jump to 50% sparsity followed 4intel.com/content/www/us/en/products/programmable/fpga/ by an iterative pruning stage. stratix-10/nx.html In the FPGA space, there is work using bespoke formats; for μ-law INT 16kHz 16-bit audio example, (Hussain et al. 2020) achieved a 2× speedup for decoding output stream WaveNet inference quantizing from floating point to fixed point while keeping the range and precision of their 27-bit INT data type constant. 16kHz audio INT In many cases, it is possible to perform post-training output, μ-law Sample quantization (PTQ), in which full-precision weights are features in [0, a) quantized to the desired precision after training is com- ACT pleted, with minimal loss in model quality. However, when the quality degradation is large, it becomes necessary to Out, 1x1 ACT INP End, 1x1 perform Quantization Aware Training (QAT) (Jacob et al. conv, a ReLU conv, a outputs outputs 2018). In QAT, the quantization operation for the target pre- cision is simulated during training so that the trained FP32 INP ACT weights learn to remove the noise injected by the quantiza- ReLU tion. L repeated layers + ACT ACT 3 Experiments Skip, 1x1 Residual, ACT INP 3.1 Setup and Evaluation Σ conv, s 1x1 conv, Model We use the WaveNet architecture as detailed in Fig- outputs r outputs ure 1. This model uses mel spectrograms with 80 filter banks ACT ACT for the conditional features and upsamples them using a 1D tanh ⊙ σ transposed with a kernel size of 800 and a stride ACT ACT of 200. The remainder of the architecture is parameterized by the number of channels in its : the number ACT of skip, residual and audio channels; as well as by the num- + ber of repeated layers and dilation cycle parameter D. For ACT ACT all experiments we use a model with s = 240 skip channels, Conditional, Dilation, 2x1 r = 120 residual channels, a = 256 audio channels, L = 16 1x1 conv, 2r dilated conv, repeated layers and D = 8. This model has 7.2 million pa- outputs 2r outputs 256 rameters, see Table 1. Whilst the model produces audio INP INP channels the final audio output has a bit depth of 16 as a ACT ACT µ-law compounding transform is applied (Oord et al. 2016). Upsample, 800x1 ACT Sparsity The weights of the dilation, conditional, resid- 200-strided conv, embedding, ual and skip convolutions in the repeated layers and the 80 outputs r outputs out post-processing convolution weights are pruned as com- INP bined these layers contain 95.6% of the total operations; see INT FP32 Table 1. The feature upsample pre-processing weights are 16kHz audio also pruned as this contains 71.1% of the total number Mel Spectrogram input, μ-law of parameters despite only requiring 1.25% of the total op- input, 80 features features in [0, a) erations. Biases for all pruned layers remain dense. Finally, the embedding pre-processing and end post-processing lay- Figure 1: WaveNet model architecture. ACT denotes the ac- ers remain dense as, although they have comparatively few tivation format before or after an operation. For the convo- parameters and operations, pruning them has a dispropor- lution layers, the input activations and parameters are both tionate effect on the final quality. converted to the INP format prior to use. INT refers to a 64- We focus on two sparsity granularities as these are sup- bit integer data type. The diamond splits the input channels ported by the major deep learning frameworks including in half to make two outputs. TensorFlow and PyTorch. The first granularity is layer- wise unstructured sparsity. We use iterative magnitude- based pruning where each layer is pruned independently up follow their recommended approach: train a dense model, to a target compression ratio, defined as per Equation 2. perform one-shot pruning to the target sparsity, and then re- original size peat the training process starting from the pruned parame- compression ratio = (2) compressed size ters. Each layer is pruned every 500 steps following the cu- Quantization We use the 7 formats listed in Table 2. For bic pruning schedule found in TensorFlow (Zhu and Gupta our BFP16 implementation, we use a block size of 10 in 2017). The second granularity is balanced sparsity where 2 the channel dimension. These formats are chosen as they are out of every block of 4 values is pruned; denoted 2:4. For this common amongst deep learning frameworks as well as cur- technique we use NVIDIA’s automatic sparsity library and rent and future CPUs and neural network accelerators. Py- Layer Type # Parameters GOP/second audio Per-layer Total Per-layer Total Pre-processing Layers Embedding Embedding 30,720 - Feature Upsample ConvTranspose1d 5,120,080 0.82 Repeated Layers Dilation Dilated Conv1d 57,840 925,440 1.84 29.49 Conditional Conv1d 19,440 311,040 0.61 9.83 Residual Conv1d 14,520 217,800 0.46 6.91 Skip Conv1d 29,040 464,640 0.92 14.75 Post-processing Layers Out Conv1d 61,440 1.96 End Conv1d 65,536 2.09 Total 7,196,696 65.85

Table 1: A breakdown of the model’s parameters and giga-operations required per second of synthesized audio output.

Torch (Paszke et al. 2019) is used for INT8 and FP32. QPy- selected 1 second segment from an audio clip in the training (Zhang et al. 2019) is used to simulate bfloat16 (In- set, padding with zeros where necessary. We use the Adam tel 2020), FP16.16, FP16.32, BFP16 (Song, Liu, and Wang optimizer (Kingma and Ba 2014) with a fixed −3 −8 2017), and TF32 (NVIDIA 2020). Simulation is achieved by of 10 , β1 = 0.9, β2 = 0.999,  = 10 . converting the input activations and parameters of the convo- lution layers to the target formats. The remaining operations Evaluation We report both the teacher-forced cross- may use a different format as detailed in Table 2. entropy validation and test loss, the Mean Opinion Score (MOS) and the compression ratio for all experiments. For the sparsity experiments we also report theoretical speedup. INP Format Abbrev. E M ACT Each experiment is repeated 3 times with 3 different random binary32 FP32 8 23 FP32 seeds and the model with median validation loss is selected TensorFloat32 TF32 8 10 FP32 to compute and report all of these metrics. BrainFloat16 bfloat16 8 7 FP32 For each generated sample in the test set the MOS is com- BlockFloat16 BFP16 8 7 FP32 puted by asking 30 independent binary16 FP16.16 5 10 FP16 workers to rate the sample’s naturalness on a five point scale. binary16 FP16.32 5 10 FP32 The reported MOS is the mean of these scores and a 95% Signed 8-bit int INT8 0 8 INT8 confidence interval is computed using the t-distribution. Table 2: A summary of the numerical formats evaluated in The compression ratio is defined as per Equation 2. When this paper. The binary32 and binary16 formats are as defined referring to compression ratio or model compression ratio in the IEEE 754-2008 standard. The E and M columns refer the size is defined to be the total size of the model in bits. to the number of bits used for the exponent and mantissa We also refer to the sparse layer compression ratio where the respectively. The INP and ACT columns relate to the values size is defined to be the size in bits of only the layers that are used in Figure 1. being pruned. For sparse models, some layers remain dense so the model compression ratio is lower than the sparse layer compression ratio. The theoretical speedup is defined as the We use post-training quantization for all formats. We ex- number of multiply-adds in the dense model divided by the perimented with using QAT for some formats but found that number in the sparse model (Blalock et al. 2020). it did not have a significant impact on the final model qual- Note that both the compression ratio and theoretical ity compared to PTQ despite being much harder to integrate speedup only provide an upper bound on that achievable in into a training pipeline and increasing the barriers to deploy- deployment as, in practice, there will be extra overheads. ment. Hence, we leave more detailed investigations regard- For example, additional memory and hardware is required ing QAT to future work. to store and use sparse parameters. Training We use the LJSpeech (Ito and Johnson 2017) dataset subsampled to 16kHz. Prior to starting this research, 3.2 Results 100 samples were randomly selected from the dataset for use Sparsity The results for the sparsity experiments that use as a validation set and 100 samples were randomly selected the iterative unstructured and one-shot 2:4 magnitude-based from the dataset for use as a test set. All remaining samples pruning techniques are presented in Table 3. are used as a training set. No is applied. Focusing on the iterative unstructured experiments, we All experiments use a batch size of 16 distributed across see that the difference between the baseline model and the 1–4 GPUs as well as mixed precision training (Micikevicius models that use a sparse layer compression ratio of 2 and et al. 2017). One element in the batch consists of a randomly 4 respectively is not statistically significant. This means a Granularity Structure Compression Ratio Val Loss Test Loss MOS Theoretical Speedup Sparse Layer Model Ground Truth (Human) - - - 4.072 ± 0.028 - Baseline (Dense) - 1.00 2.189 2.182 3.976 ± 0.027 1.00 Iterative Unstructured 2.00 1.97 2.179 2.173 3.966 ± 0.029 1.91 Iterative Unstructured 4.00 3.83 2.200 2.194 3.922 ± 0.030 3.51 Iterative Unstructured 8.00 7.23 2.236 2.231 3.877 ± 0.032 6.03 Iterative Unstructured 16.00 13.02 2.281 2.276 3.757 ± 0.033 9.41 Iterative Unstructured 32.00 21.73 2.357 2.402 3.616 ± 0.038 12.95 One-shot 2:4 2.00 1.97 2.195 2.191 3.881 ± 0.027 1.91

Table 3: Impact of varying sparsity techniques and compression ratios on model quality and the theoretical speedups. model compression ratio of up to 3.83 and a theoretical comparable to that of the iterative unstructured models that speedup of up to 3.51 can be achieved without a reduction in use a sparse layer compression ratio of 4 and 8. This sug- fidelity of the synthesized audio. The models with a sparse gests that the iterative unstructured magnitude-based prun- layer compression ratio greater than 4 do exhibit a signifi- ing technique produces higher quality WaveNet models for cant degradation in audio fidelity. However, this reduction in a fixed compression ratio or of greater theoretical speedup fidelity does lead to a large increase in theoretical speedup. for a fixed quality. For example, the model that uses a compression ratio of 32 in its sparse layers—4.6% the total number of parameters and 7.7% the total number of operations—has a potential Quantization We present the quantization results in Ta- theoretical speedup of 12.95 over the dense baseline model ble 4. We are able to obtain an audio fidelity that matches whilst achieving only a 9.1% lower MOS. However, note that of the baseline FP32 model by using the TF32 preci- that MOS is a subjective metric and therefore relative com- sion. The other formats are all equivalent to each other but parisons are difficult to interpret. Figure 2 demonstrates this they are all significantly worse than TF32 and FP32. This trade-off between model quality and the theoretical speedup could be due to the higher compression ratio with these for- that sparsity offers. mats compared to TF32 and FP32 causing a higher infor- mation loss in the model arithmetic that leads to a quality degradation.

Sparsity and Quantization Combined Finally, we con- 0.0 sider applying both quantization and sparsity to our WaveNet model at the same time. Due to time constraints and the large number of runs required if running every com- 0.1 bination of sparsity and precision previously considered, we choose to only investigate two different sparsity levels. We use a per layer compression ratio of 4 since this produced a 0.2 model with a high compression ratio that maintains a simi- lar quality to the dense baseline. We also investigate with the 2:4 sparsity pattern since we suspect that this will become a 0.3 popular choice for sparsity with tools supported by NVIDIA for this sparsity level. 0.4 The results for these experiments are presented in Table 5. When looking at the models using the sparse layer compres- sion ratio of 4, we find that using quantization does not pro- Change in MOS from FP32 dense baseline 0 5 10 vide a significant degradation in quality compared to our Theoretical Speedup FP32 baseline in all cases besides INT8. Our best model with this sparsity pattern is the TF32 which achieve MOS Figure 2: Trade-off between model quality (change in MOS comparable with a baseline FP32 dense model whilst hav- from baseline) and the theoretical speedup. Error bars dis- ing a compression ratio of over 6. play 95% confidence intervals using the t-distribution. When looking at the models that use the 2:4 sparsity pat- tern, we find results similar to the ones obtained using the Looking at the one-shot 2:4 magnitude-based pruning re- sparse layer compression ratio of 4. The quantized models sult we see that it has a significantly lower MOS than both all have comparable audio fidelity to the FP32 model with the baseline and iterative unstructured model with a sparse no significant change in MOS result, despite having a com- layer compression ratio of 2. Further, the MOS score is pression ratio of up to 7.88 in the case of INT8. Format Compression Ratio Val Loss Test Loss MOS Ground Truth (Human) - - 4.072 ± 0.028 FP32 1.00 2.189 2.182 3.976 ± 0.027 TF32 1.68 2.189 2.182 3.972 ± 0.029 bfloat16 2.00 2.189 2.183 3.904 ± 0.030 BFP16 3.61 2.189 2.183 3.877 ± 0.032 FP16.16 2.00 2.189 2.182 3.911 ± 0.029 FP16.32 2.00 2.189 2.182 3.903 ± 0.031 INT8 4.00 2.442 2.435 3.877 ± 0.032

Table 4: Impact on model fidelity of post-training quantization.

Sparse Layer Compression Ratio 4 2:4 Format Val Loss Test Loss CR MOS Val Loss Test Loss CR MOS FP32 2.200 2.194 3.83 3.922 ± 0.030 2.195 2.191 1.97 3.881 ± 0.027 TF32 2.200 2.194 6.44 3.951 ± 0.031 2.197 2.191 3.32 3.926 ± 0.028 bfloat16 2.201 2.195 7.65 3.894 ± 0.030 2.197 2.191 3.94 3.835 ± 0.026 BFP16 2.201 2.195 13.84 3.934 ± 0.028 2.197 2.191 7.13 3.862 ± 0.029 FP16.16 2.200 2.194 7.65 3.895 ± 0.029 2.197 2.191 3.94 3.830 ± 0.031 FP16.32 2.200 2.194 7.65 3.919 ± 0.030 2.197 2.191 3.94 3.889 ± 0.029 INT8 3.591 3.534 15.30 3.527 ± 0.040 2.635 2.629 7.88 3.836 ± 0.030

Table 5: Effect on model quality of post-training quantization of sparse models.

4 Conclusions and Future Work He, Y.; Sainath, T. N.; Prabhavalkar, R.; McGraw, I.; Al- In this work we have shown that model compression can varez, R.; Zhao, D.; Rybach, D.; Kannan, A.; Wu, Y.; Pang, be utilised to create a WaveNet vocoder with a high com- R.; et al. 2019. Streaming end-to-end pression ratio whilst still being capable of near-human qual- for mobile devices. In ICASSP 2019-2019 IEEE Interna- ity . Our BFP16 model with a sparse layer tional Conference on Acoustics, Speech and Signal Process- compression ratio of 4 synthesizes audio with a quality that ing (ICASSP), 6381–6385. IEEE. is on par with a dense FP32 baseline whilst achieving a com- He, Y.; Zhang, X.; and Sun, J. 2017. Channel pruning for ac- pression ratio of 13.84. celerating very deep neural networks. In Proceedings of the We hypothesise that higher compression ratios may be IEEE International Conference on Computer Vision, 1389– achieved by exploiting more extreme forms of quantization, 1397. such as INT4, INT2 and even binary models, although we Hsu, P.-c.; and Lee, H.-y. 2020. WG-WaveNet: Real- suspect that the use of QAT will be vital to maintain accept- Time High-Fidelity Speech Synthesis without GPU. arXiv able audio fidelity in these cases. preprint arXiv:2005.07412 . We hope that our results encourage the use of model quan- Hussain, S.; Javaheripi, M.; Neekhara, P.; Kastner, R.; and tization and sparsity to realise the theoretical speedup af- Koushanfar, F. 2020. Fastwave: Accelerating autoregres- forded by them on next generation hardware accelerators sive convolutional neural networks on fpga. arXiv preprint to produce high quality text-to-speech systems, although we arXiv:2002.04971 . leave the specifics of such deployments to future work. Intel. 2020. BFLOAT16 - Hardware Numerics Def- inition. Technical report, Intel Corporation. URL References https://software.intel.com/sites/default/files/managed/40/ Blalock, D.; Ortiz, J. J. G.; Frankle, J.; and Guttag, J. 2020. 8b/bf16-hardware-numerics-definition-white-paper.pdf. What is the state of neural network pruning? arXiv preprint Ito, K.; and Johnson, L. 2017. The LJ Speech Dataset. https: arXiv:2003.03033 . //keithito.com/LJ-Speech-Dataset/. Cao, S.; Zhang, C.; Yao, Z.; Xiao, W.; Nie, L.; Zhan, D.; Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, Liu, Y.; Wu, M.; and Zhang, L. 2019. Efficient and effective A.; Adam, H.; and Kalenichenko, D. 2018. Quantization and sparse LSTM on FPGA with bank-balanced sparsity. In Pro- training of neural networks for efficient integer-arithmetic- ceedings of the 2019 ACM/SIGDA International Symposium only inference. In Proceedings of the IEEE Conference on on Field-Programmable Gate Arrays, 63–72. Computer Vision and , 2704–2713. Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learning Kalchbrenner, N.; Elsen, E.; Simonyan, K.; Noury, S.; both weights and connections for efficient neural network. In Casagrande, N.; Lockhart, E.; Stimberg, F.; Oord, A. v. d.; Advances in neural information processing systems, 1135– Dieleman, S.; and Kavukcuoglu, K. 2018. Efficient neural 1143. audio synthesis. arXiv preprint arXiv:1802.08435 . Kim, S.; Lee, S.-g.; Song, J.; Kim, J.; and Yoon, S. 2018. Charlebois, M.; Chou, W.; et al. 2019. Mlperf inference FloWaveNet: A generative flow for raw audio. arXiv . arXiv preprint arXiv:1911.02549 . preprint arXiv:1811.02155 . Shen, J.; Pang, R.; Weiss, R. J.; Schuster, M.; Jaitly, N.; Kingma, D. P.; and Ba, J. 2014. Adam: A method for Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; stochastic optimization. arXiv preprint arXiv:1412.6980 . et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE Interna- Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, tional Conference on Acoustics, Speech and Signal Process- E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; ing (ICASSP), 4779–4783. IEEE. Venkatesh, G.; et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 . Song, Z.; Liu, Z.; and Wang, D. 2017. Computation er- ror analysis of block floating point arithmetic oriented con- Narang, S.; Undersander, E.; and Diamos, G. 2017. volution neural network accelerator design. arXiv preprint Block-sparse recurrent neural networks. arXiv preprint arXiv:1709.07776 . arXiv:1711.02782 . Tian, Q.; Zhang, Z.; Lu, H.; Chen, L.-H.; and Liu, S. NVIDIA. 2020. NVIDIA A100 Tensor Core GPU 2020. FeatherWave: An efficient high-fidelity neural Architecture. Technical report, NVIDIA Corporation. vocoder with multi-band linear prediction. arXiv preprint URL https://www.nvidia.com/content/dam/en-zz/Solutions/ arXiv:2005.05551 . Data-Center/nvidia-ampere-architecture-whitepaper.pdf. Valin, J.-M.; and Skoglund, J. 2019. LPCNet: Improv- Oord, A.; Li, Y.; Babuschkin, I.; Simonyan, K.; Vinyals, ing neural speech synthesis through linear prediction. In O.; Kavukcuoglu, K.; Driessche, G.; Lockhart, E.; Cobo, L.; ICASSP 2019-2019 IEEE International Conference on Stimberg, F.; et al. 2018. Parallel wavenet: Fast high-fidelity Acoustics, Speech and Signal Processing (ICASSP), 5891– speech synthesis. In International conference on machine 5895. IEEE. learning, 3918–3926. Wu, H.; Judd, P.; Zhang, X.; Isaev, M.; and Micikevicius, Oord, A. v. d.; Dieleman, S.; Zen, H.; Simonyan, K.; P. 2020. Integer Quantization for Deep Learning Infer- Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; and ence: Principles and Empirical Evaluation. arXiv preprint Kavukcuoglu, K. 2016. Wavenet: A for arXiv:2004.09602 . raw audio. arXiv preprint arXiv:1609.03499 . Wu, S.; Li, G.; Chen, F.; and Shi, L. 2018. Training and Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, inference with integers in deep neural networks. arXiv J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; preprint arXiv:1802.04680 . Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; B.; Fang, L.; Bai, J.; and Chintala, S. 2019. PyTorch: et al. 2016. Google’s neural machine translation system: An Imperative Style, High-Performance Deep Learning Bridging the gap between human and machine translation. Library. In Wallach, H.; Larochelle, H.; Beygelz- arXiv preprint arXiv:1609.08144 . imer, A.; d’Alche´ Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Sys- Yao, Z.; Cao, S.; Xiao, W.; Zhang, C.; and Nie, L. 2019. tems 32, 8024–8035. Curran Associates, Inc. URL Balanced sparsity for efficient dnn inference on gpu. In Pro- http://papers.neurips.cc/paper/9015-pytorch-an-imperative- ceedings of the AAAI Conference on Artificial Intelligence, style-high-performance-deep-learning-library.pdf. volume 33, 5676–5683. Pharris, B. 2018. Nv-Wavenet: Better Speech Syn- Zafrir, O.; Boudoukh, G.; Izsak, P.; and Wasserblat, M. thesis Using GPU-Enabled WaveNet Inference. URL 2019. Q8bert: Quantized 8bit . arXiv preprint https://developer.nvidia.com/blog/nv-wavenet-gpu-speech- arXiv:1910.06188 . synthesis/. Zhang, T.; Lin, Z.; Yang, G.; and De Sa, C. 2019. QPyTorch: Ping, W.; Peng, K.; and Chen, J. 2018. Clarinet: Parallel A Low-Precision Arithmetic Simulation Framework. arXiv wave generation in end-to-end text-to-speech. arXiv preprint preprint arXiv:1910.04540 . arXiv:1807.07281 . Zhu, M.; and Gupta, S. 2017. To prune, or not to prune: ex- Ping, W.; Peng, K.; Zhao, K.; and Song, Z. 2019. Waveflow: ploring the efficacy of pruning for model compression. arXiv A compact flow-based model for raw audio. arXiv preprint preprint arXiv:1710.01878 . arXiv:1912.01219 . Prenger, R.; Valle, R.; and Catanzaro, B. 2019. Waveg- low: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3617– 3621. IEEE. Reddi, V. J.; Cheng, C.; Kanter, D.; Mattson, P.; Schmuelling, G.; Wu, C.-J.; Anderson, B.; Breughe, M.;