Empirical Evaluation of Deep Learning Model Compression Techniques on the Wavenet Vocoder

Empirical Evaluation of Deep Learning Model Compression Techniques on the WaveNet Vocoder Sam Davis, Giuseppe Coccia, Sam Gooch, Julian Mack Myrtle.ai Cambridge, UK [email protected], [email protected], [email protected], [email protected] Abstract In this paper, rather than changing the WaveNet architecture to improve inference performance we instead keep it WaveNet is a state-of-the-art text-to-speech vocoder that re- fixed and explore a range of model compression techniques mains challenging to deploy due to its autoregressive loop. that can yield greater inference performance. Crucially, the In this work we focus on ways to accelerate the original WaveNet architecture directly, as opposed to modifying the techniques we explore are all available in existing deep architecture, such that the model can be deployed as part of learning frameworks and are deployable to a wide range of a scalable text-to-speech system. We survey a wide variety current and future CPUs and neural network accelerators. Fi- of model compression techniques that are amenable to de- nally, motivated by the desire to maintain WaveNet’s quality, ployment on a range of hardware platforms. In particular, we evaluate the impact that these compression techniques we compare different model sparsity methods and levels, and have on the perceived fidelity of the synthesized speech. seven widely used precisions as targets for quantization; and We examine two main categories of model compression— are able to achieve models with a compression ratio of up sparsity and quantization—and explore both their indepen- to 13.84 without loss in audio fidelity compared to a dense, dent and combined impact on model quality. For spar- single-precision floating-point baseline. All techniques are sity we consider iterative and one-shot magnitude-based implemented using existing open source deep learning frameworks and libraries to encourage their wider adoption. neural network pruning; and for quantization we explore the INT8, bfloat16, half-precision floating-point with both 16-bit and 32-bit accumulation (FP16.16, FP16.32), 16-bit 1 Introduction block floating-point (BFP16), TensorFloat32 (TF32) and The widespread adoption of personal assistants has been single-precision floating-point (FP32) formats. While there powered, in large part, by advances in the field of Text-to- has been some work in vocoder pruning of WaveRNN and Speech (TTS). Many state-of-the-art TTS systems contain a its variants (Kalchbrenner et al. 2018; Valin and Skoglund model, referred to as a vocoder, that takes as input audio 2019; Tian et al. 2020), to our knowledge, this is the first features derived from a piece of text and outputs synthe- paper in which WaveNet pruning results are provided. Ad- sized speech. WaveNet (Oord et al. 2016) is a state-of-the ditionally, to our knowledge, no other authors compare as art vocoder that is capable of producing synthesized speech wide a range of precisions, nor look at the interactions be- with near-human-level quality (Shen et al. 2018). The key to tween pruning and quantization for the WaveNet model. We include samples generated by the models presented in the model’s quality is its autoregressive loop but this prop- 1 erty makes the model exceptionally challenging to deploy this work and will release our code . in applications that require real-time output or need to efficiently scale to millions of users since na¨ıve implementa- 2 Related Work arXiv:2011.10469v1 [cs.LG] 20 Nov 2020 tions may take tens of minutes to generate ten seconds of Numerous attempts have been made to improve vocoder in- speech. ference performance. The WaveRNN (Kalchbrenner et al. As a result, TTS research has focused on finding alterna- 2018) authors note that the time for vocoder inference, tive vocoder architectures such as Parallel-WaveNet (Oord T (u), for a target audio sequence u can be decomposed into et al. 2018), WaveRNN (Kalchbrenner et al. 2018), Clar- computation time ci and kernel launch overhead di for each iNet (Ping, Peng, and Chen 2018) and WaveGlow (Prenger, of the N operations (layers) of the model: Valle, and Catanzaro 2019) that achieve higher performance N when deployed on existing hardware. There is a degree of X ambiguity as to the highest quality vocoder as audio quality T (u) = juj (ci + di) (1) evaluation is subjective but all authors agree that WaveNet i=1 produces at least as good if not higher quality audio than the Attempts to optimize a vocoder for deployment aim to more recent approaches (Kim et al. 2018; Oord et al. 2018; reduce at least one of fjuj; N; ci; dig. Many approaches in- Prenger, Valle, and Catanzaro 2019; Tian et al. 2020; Hsu and Lee 2020). 1myrtlesoftware.github.io/wavenet-paper/ cluding Parallel-WaveNet, ClariNet, WaveGlow and Wave- The WaveRNN, LPCNet and FeatherWave models utilise Flow (Ping et al. 2019) remove the autoregressive compo- high levels of sparsity to reduce inference time. The Wav- nent and hence reduce juj so that many or all of the sam- eRNN authors use iterative pruning to achieve sparsity as ples can be generated in parallel. Others, including Wav- high as 96%. They investigate a range of block sparsity pat- eRNN and its variants LPCNet (Valin and Skoglund 2019) terns including 1x1 (unstructured), 4x4 and 16x1 and find and FeatherWave (Tian et al. 2020), keep the autoregressive that the latter is the most performant as it more closely mir- component but make alterations to the architecture to de- rors the layout in physical memory. The LPCNet and Feath- crease the product of N and ci. There have also been efforts erWave authors both use 90% sparse networks with a 16x1 that focus on reducing di by exploiting techniques such as pattern although the latter uses TSSP as discussed above in- persistent kernels that only launch the kernel once per se- stead of the purely iterative approach. quence (Pharris 2018). In this work we explore reducing ci without altering the WaveNet architecture by utilising model 2.2 Quantization compression to give greater possibilities in the types of mod- Quantized models also reduce ci as the operations are now els that can be deployed. performed at a numerical format in which the operations are less computationally expensive. This approach has been 2.1 Sparsity applied to a wide variety of models including BERT (Wu Sparse models offer two potential benefits over their dense et al. 2020), ResNet and GNMT (Zafrir et al. 2019), and counterparts: sees adoption in widely recognised machine learning bench- 1. The amount of computation can be reduced since, for ex- marks (Reddi et al. 2019). ample, multiplications by zero need not be performed. The quantization process includes one or both of: 2. Memory bandwidth requirements can be reduced as it is 1. Reducing the number of bits of the datatype. e.g. use 8 possible to achieve higher compression ratios with sparse bits instead of 32 bits. matrices. 2. Using a less expensive format. e.g. use integer instead of The first of these reduces ci but in order to realise either floating-point. of these benefits, hardware support is usually required. De- A simple scheme is to perform all multiplications in the pending on the type of support, different hardware platforms FP16 data format as this is already widely supported on a are amenable to different types of sparsity. At one end of the variety of hardware devices. The results are accumulated in spectrum, some authors use channel pruning, in which en- either FP16 or FP32; this distinction matters for the range of tire convolutional channels are set to zero (He, Zhang, and representable values and for what precision any activation Sun 2017). It is comparatively easy to realise the inference- functions are later performed in. We represent these choices time performance benefits of channel sparsity but this ap- as FP16.16 and FP16.32 to represent using FP16 for mul- proach produces a significant degradation in audio quality tiplies and either FP16 or FP32 for accumulations respec- for WaveNet (Hussain et al. 2020). Channel sparsity is a spe- tively. cial case of block sparsity where for a 2D matrix, blocks of Quantizing to integers is another popular choice for quan- size n × m are enforced to be either all-dense or all-sparse tization. When quantizing from floating point to integer, it (Narang, Undersander, and Diamos 2017). At the other end is necessary to use a quantization scheme in which there of the spectrum, sparsity can also be unstructured meaning is no quantization error for the value 0 as these parame- there are no constraints on the sparsity pattern; this typi- ters will have an outsized impact on model performance (Ja- cally results in the smallest quality degradation but is also cob et al. 2018) especially when quantizing sparse matri- the most challenging sparsity pattern to deploy efficiently. A ces. Running inference at INTX (most often INT8) is widely hybrid approach is to employ balanced sparsity (Yao et al. used for deployment including for models from the domains 2019; Cao et al. 2019) where each block is independently of Machine Translation (Wu et al. 2016), Automatic Speech pruned to the target sparsity percentage but within a block Recognition (He et al. 2019), Computer Vision (Wu et al. the sparsity is unstructured. 2018) and NLP embeddings (Zafrir et al. 2019). The other principal axes on which neural network spar- However, formats other than INTX and the basic FP16 sity approaches can differ relate to the method of obtain- are becoming more widely used as hardware vendors ac- ing the sparse network. One way to do this is by utilizing commodate them.

Load more