Empirical Evaluation of Deep Learning Model Compression Techniques on the Wavenet Vocoder

Total Page:16

File Type:pdf, Size:1020Kb

Empirical Evaluation of Deep Learning Model Compression Techniques on the Wavenet Vocoder Empirical Evaluation of Deep Learning Model Compression Techniques on the WaveNet Vocoder Sam Davis, Giuseppe Coccia, Sam Gooch, Julian Mack Myrtle.ai Cambridge, UK [email protected], [email protected], [email protected], [email protected] Abstract In this paper, rather than changing the WaveNet architec- ture to improve inference performance we instead keep it WaveNet is a state-of-the-art text-to-speech vocoder that re- fixed and explore a range of model compression techniques mains challenging to deploy due to its autoregressive loop. that can yield greater inference performance. Crucially, the In this work we focus on ways to accelerate the original WaveNet architecture directly, as opposed to modifying the techniques we explore are all available in existing deep architecture, such that the model can be deployed as part of learning frameworks and are deployable to a wide range of a scalable text-to-speech system. We survey a wide variety current and future CPUs and neural network accelerators. Fi- of model compression techniques that are amenable to de- nally, motivated by the desire to maintain WaveNet’s quality, ployment on a range of hardware platforms. In particular, we evaluate the impact that these compression techniques we compare different model sparsity methods and levels, and have on the perceived fidelity of the synthesized speech. seven widely used precisions as targets for quantization; and We examine two main categories of model compression— are able to achieve models with a compression ratio of up sparsity and quantization—and explore both their indepen- to 13.84 without loss in audio fidelity compared to a dense, dent and combined impact on model quality. For spar- single-precision floating-point baseline. All techniques are sity we consider iterative and one-shot magnitude-based implemented using existing open source deep learning frame- works and libraries to encourage their wider adoption. neural network pruning; and for quantization we explore the INT8, bfloat16, half-precision floating-point with both 16-bit and 32-bit accumulation (FP16.16, FP16.32), 16-bit 1 Introduction block floating-point (BFP16), TensorFloat32 (TF32) and The widespread adoption of personal assistants has been single-precision floating-point (FP32) formats. While there powered, in large part, by advances in the field of Text-to- has been some work in vocoder pruning of WaveRNN and Speech (TTS). Many state-of-the-art TTS systems contain a its variants (Kalchbrenner et al. 2018; Valin and Skoglund model, referred to as a vocoder, that takes as input audio 2019; Tian et al. 2020), to our knowledge, this is the first features derived from a piece of text and outputs synthe- paper in which WaveNet pruning results are provided. Ad- sized speech. WaveNet (Oord et al. 2016) is a state-of-the ditionally, to our knowledge, no other authors compare as art vocoder that is capable of producing synthesized speech wide a range of precisions, nor look at the interactions be- with near-human-level quality (Shen et al. 2018). The key to tween pruning and quantization for the WaveNet model. We include samples generated by the models presented in the model’s quality is its autoregressive loop but this prop- 1 erty makes the model exceptionally challenging to deploy this work and will release our code . in applications that require real-time output or need to ef- ficiently scale to millions of users since na¨ıve implementa- 2 Related Work arXiv:2011.10469v1 [cs.LG] 20 Nov 2020 tions may take tens of minutes to generate ten seconds of Numerous attempts have been made to improve vocoder in- speech. ference performance. The WaveRNN (Kalchbrenner et al. As a result, TTS research has focused on finding alterna- 2018) authors note that the time for vocoder inference, tive vocoder architectures such as Parallel-WaveNet (Oord T (u), for a target audio sequence u can be decomposed into et al. 2018), WaveRNN (Kalchbrenner et al. 2018), Clar- computation time ci and kernel launch overhead di for each iNet (Ping, Peng, and Chen 2018) and WaveGlow (Prenger, of the N operations (layers) of the model: Valle, and Catanzaro 2019) that achieve higher performance N when deployed on existing hardware. There is a degree of X ambiguity as to the highest quality vocoder as audio quality T (u) = juj (ci + di) (1) evaluation is subjective but all authors agree that WaveNet i=1 produces at least as good if not higher quality audio than the Attempts to optimize a vocoder for deployment aim to more recent approaches (Kim et al. 2018; Oord et al. 2018; reduce at least one of fjuj; N; ci; dig. Many approaches in- Prenger, Valle, and Catanzaro 2019; Tian et al. 2020; Hsu and Lee 2020). 1myrtlesoftware.github.io/wavenet-paper/ cluding Parallel-WaveNet, ClariNet, WaveGlow and Wave- The WaveRNN, LPCNet and FeatherWave models utilise Flow (Ping et al. 2019) remove the autoregressive compo- high levels of sparsity to reduce inference time. The Wav- nent and hence reduce juj so that many or all of the sam- eRNN authors use iterative pruning to achieve sparsity as ples can be generated in parallel. Others, including Wav- high as 96%. They investigate a range of block sparsity pat- eRNN and its variants LPCNet (Valin and Skoglund 2019) terns including 1x1 (unstructured), 4x4 and 16x1 and find and FeatherWave (Tian et al. 2020), keep the autoregressive that the latter is the most performant as it more closely mir- component but make alterations to the architecture to de- rors the layout in physical memory. The LPCNet and Feath- crease the product of N and ci. There have also been efforts erWave authors both use 90% sparse networks with a 16x1 that focus on reducing di by exploiting techniques such as pattern although the latter uses TSSP as discussed above in- persistent kernels that only launch the kernel once per se- stead of the purely iterative approach. quence (Pharris 2018). In this work we explore reducing ci without altering the WaveNet architecture by utilising model 2.2 Quantization compression to give greater possibilities in the types of mod- Quantized models also reduce ci as the operations are now els that can be deployed. performed at a numerical format in which the operations are less computationally expensive. This approach has been 2.1 Sparsity applied to a wide variety of models including BERT (Wu Sparse models offer two potential benefits over their dense et al. 2020), ResNet and GNMT (Zafrir et al. 2019), and counterparts: sees adoption in widely recognised machine learning bench- 1. The amount of computation can be reduced since, for ex- marks (Reddi et al. 2019). ample, multiplications by zero need not be performed. The quantization process includes one or both of: 2. Memory bandwidth requirements can be reduced as it is 1. Reducing the number of bits of the datatype. e.g. use 8 possible to achieve higher compression ratios with sparse bits instead of 32 bits. matrices. 2. Using a less expensive format. e.g. use integer instead of The first of these reduces ci but in order to realise either floating-point. of these benefits, hardware support is usually required. De- A simple scheme is to perform all multiplications in the pending on the type of support, different hardware platforms FP16 data format as this is already widely supported on a are amenable to different types of sparsity. At one end of the variety of hardware devices. The results are accumulated in spectrum, some authors use channel pruning, in which en- either FP16 or FP32; this distinction matters for the range of tire convolutional channels are set to zero (He, Zhang, and representable values and for what precision any activation Sun 2017). It is comparatively easy to realise the inference- functions are later performed in. We represent these choices time performance benefits of channel sparsity but this ap- as FP16.16 and FP16.32 to represent using FP16 for mul- proach produces a significant degradation in audio quality tiplies and either FP16 or FP32 for accumulations respec- for WaveNet (Hussain et al. 2020). Channel sparsity is a spe- tively. cial case of block sparsity where for a 2D matrix, blocks of Quantizing to integers is another popular choice for quan- size n × m are enforced to be either all-dense or all-sparse tization. When quantizing from floating point to integer, it (Narang, Undersander, and Diamos 2017). At the other end is necessary to use a quantization scheme in which there of the spectrum, sparsity can also be unstructured meaning is no quantization error for the value 0 as these parame- there are no constraints on the sparsity pattern; this typi- ters will have an outsized impact on model performance (Ja- cally results in the smallest quality degradation but is also cob et al. 2018) especially when quantizing sparse matri- the most challenging sparsity pattern to deploy efficiently. A ces. Running inference at INTX (most often INT8) is widely hybrid approach is to employ balanced sparsity (Yao et al. used for deployment including for models from the domains 2019; Cao et al. 2019) where each block is independently of Machine Translation (Wu et al. 2016), Automatic Speech pruned to the target sparsity percentage but within a block Recognition (He et al. 2019), Computer Vision (Wu et al. the sparsity is unstructured. 2018) and NLP embeddings (Zafrir et al. 2019). The other principal axes on which neural network spar- However, formats other than INTX and the basic FP16 sity approaches can differ relate to the method of obtain- are becoming more widely used as hardware vendors ac- ing the sparse network. One way to do this is by utilizing commodate them.
Recommended publications
  • Backpropagation and Deep Learning in the Brain
    Backpropagation and Deep Learning in the Brain Simons Institute -- Computational Theories of the Brain 2018 Timothy Lillicrap DeepMind, UCL With: Sergey Bartunov, Adam Santoro, Jordan Guerguiev, Blake Richards, Luke Marris, Daniel Cownden, Colin Akerman, Douglas Tweed, Geoffrey Hinton The “credit assignment” problem The solution in artificial networks: backprop Credit assignment by backprop works well in practice and shows up in virtually all of the state-of-the-art supervised, unsupervised, and reinforcement learning algorithms. Why Isn’t Backprop “Biologically Plausible”? Why Isn’t Backprop “Biologically Plausible”? Neuroscience Evidence for Backprop in the Brain? A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: How to convince a neuroscientist that the cortex is learning via [something like] backprop - To convince a machine learning researcher, an appeal to variance in gradient estimates might be enough. - But this is rarely enough to convince a neuroscientist. - So what lines of argument help? How to convince a neuroscientist that the cortex is learning via [something like] backprop - What do I mean by “something like backprop”?: - That learning is achieved across multiple layers by sending information from neurons closer to the output back to “earlier” layers to help compute their synaptic updates. How to convince a neuroscientist that the cortex is learning via [something like] backprop 1. Feedback connections in cortex are ubiquitous and modify the
    [Show full text]
  • Self-Training Wavenet for TTS in Low-Data Regimes
    StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes Manish Sharma, Tom Kenter, Rob Clark Google UK fskmanish, tomkenter, [email protected] Abstract is increased. However, it can be seen from their results that the quality degrades when the number of recordings is further Recently, WaveNet has become a popular choice of neural net- decreased. work to synthesize speech audio. Autoregressive WaveNet is To reduce the voice artefacts observed in WaveNet stu- capable of producing high-fidelity audio, but is too slow for dent models trained under a low-data regime, we aim to lever- real-time synthesis. As a remedy, Parallel WaveNet was pro- age both the high-fidelity audio produced by an autoregressive posed, which can produce audio faster than real time through WaveNet, and the faster-than-real-time synthesis capability of distillation of an autoregressive teacher into a feedforward stu- a Parallel WaveNet. We propose a training paradigm, called dent network. A shortcoming of this approach, however, is that StrawNet, which stands for “Self-Training WaveNet”. The key a large amount of recorded speech data is required to produce contribution lies in using high-fidelity speech samples produced high-quality student models, and this data is not always avail- by an autoregressive WaveNet to self-train first a new autore- able. In this paper, we propose StrawNet: a self-training ap- gressive WaveNet and then a Parallel WaveNet model. We refer proach to train a Parallel WaveNet. Self-training is performed to models distilled this way as StrawNet student models. using the synthetic examples generated by the autoregressive We evaluate StrawNet by comparing it to a baseline WaveNet teacher.
    [Show full text]
  • Deep Learning Architectures for Sequence Processing
    Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2021. All rights reserved. Draft of September 21, 2021. CHAPTER Deep Learning Architectures 9 for Sequence Processing Time will explain. Jane Austen, Persuasion Language is an inherently temporal phenomenon. Spoken language is a sequence of acoustic events over time, and we comprehend and produce both spoken and written language as a continuous input stream. The temporal nature of language is reflected in the metaphors we use; we talk of the flow of conversations, news feeds, and twitter streams, all of which emphasize that language is a sequence that unfolds in time. This temporal nature is reflected in some of the algorithms we use to process lan- guage. For example, the Viterbi algorithm applied to HMM part-of-speech tagging, proceeds through the input a word at a time, carrying forward information gleaned along the way. Yet other machine learning approaches, like those we’ve studied for sentiment analysis or other text classification tasks don’t have this temporal nature – they assume simultaneous access to all aspects of their input. The feedforward networks of Chapter 7 also assumed simultaneous access, al- though they also had a simple model for time. Recall that we applied feedforward networks to language modeling by having them look only at a fixed-size window of words, and then sliding this window over the input, making independent predictions along the way. Fig. 9.1, reproduced from Chapter 7, shows a neural language model with window size 3 predicting what word follows the input for all the. Subsequent words are predicted by sliding the window forward a word at a time.
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
    1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders
    Unsupervised speech representation learning using WaveNet autoencoders https://arxiv.org/abs/1901.08810 Jan Chorowski University of Wrocław 06.06.2019 Deep Model = Hierarchy of Concepts Cat Dog … Moon Banana M. Zieler, “Visualizing and Understanding Convolutional Networks” Deep Learning history: 2006 2006: Stacked RBMs Hinton, Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks” Deep Learning history: 2012 2012: Alexnet SOTA on Imagenet Fully supervised training Deep Learning Recipe 1. Get a massive, labeled dataset 퐷 = {(푥, 푦)}: – Comp. vision: Imagenet, 1M images – Machine translation: EuroParlamanet data, CommonCrawl, several million sent. pairs – Speech recognition: 1000h (LibriSpeech), 12000h (Google Voice Search) – Question answering: SQuAD, 150k questions with human answers – … 2. Train model to maximize log 푝(푦|푥) Value of Labeled Data • Labeled data is crucial for deep learning • But labels carry little information: – Example: An ImageNet model has 30M weights, but ImageNet is about 1M images from 1000 classes Labels: 1M * 10bit = 10Mbits Raw data: (128 x 128 images): ca 500 Gbits! Value of Unlabeled Data “The brain has about 1014 synapses and we only live for about 109 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 105 dimensions of constraint per second.” Geoff Hinton https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/ Unsupervised learning recipe 1. Get a massive labeled dataset 퐷 = 푥 Easy, unlabeled data is nearly free 2. Train model to…??? What is the task? What is the loss function? Unsupervised learning by modeling data distribution Train the model to minimize − log 푝(푥) E.g.
    [Show full text]
  • Real-Time Black-Box Modelling with Recurrent Neural Networks
    Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019 REAL-TIME BLACK-BOX MODELLING WITH RECURRENT NEURAL NETWORKS Alec Wright, Eero-Pekka Damskägg, and Vesa Välimäki∗ Acoustics Lab, Department of Signal Processing and Acoustics Aalto University Espoo, Finland [email protected] ABSTRACT tube amplifiers and distortion pedals. In [14] it was shown that the WaveNet model of several distortion effects was capable of This paper proposes to use a recurrent neural network for black- running in real time. The resulting deep neural network model, box modelling of nonlinear audio systems, such as tube amplifiers however, was still fairly computationally expensive to run. and distortion pedals. As a recurrent unit structure, we test both Long Short-Term Memory and a Gated Recurrent Unit. We com- In this paper, we propose an alternative black-box model based pare the proposed neural network with a WaveNet-style deep neu- on an RNN. We demonstrate that the trained RNN model is capa- ral network, which has been suggested previously for tube ampli- ble of achieving the accuracy of the WaveNet model, whilst re- fier modelling. The neural networks are trained with several min- quiring considerably less processing power to run. The proposed utes of guitar and bass recordings, which have been passed through neural network, which consists of a single recurrent layer and a the devices to be modelled. A real-time audio plugin implement- fully connected layer, is suitable for real-time emulation of tube ing the proposed networks has been developed in the JUCE frame- amplifiers and distortion pedals.
    [Show full text]
  • Machine Learning V/S Deep Learning
    International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 02 | Feb 2019 www.irjet.net p-ISSN: 2395-0072 Machine Learning v/s Deep Learning Sachin Krishan Khanna Department of Computer Science Engineering, Students of Computer Science Engineering, Chandigarh Group of Colleges, PinCode: 140307, Mohali, India ---------------------------------------------------------------------------***--------------------------------------------------------------------------- ABSTRACT - This is research paper on a brief comparison and summary about the machine learning and deep learning. This comparison on these two learning techniques was done as there was lot of confusion about these two learning techniques. Nowadays these techniques are used widely in IT industry to make some projects or to solve problems or to maintain large amount of data. This paper includes the comparison of two techniques and will also tell about the future aspects of the learning techniques. Keywords: ML, DL, AI, Neural Networks, Supervised & Unsupervised learning, Algorithms. INTRODUCTION As the technology is getting advanced day by day, now we are trying to make a machine to work like a human so that we don’t have to make any effort to solve any problem or to do any heavy stuff. To make a machine work like a human, the machine need to learn how to do work, for this machine learning technique is used and deep learning is used to help a machine to solve a real-time problem. They both have algorithms which work on these issues. With the rapid growth of this IT sector, this industry needs speed, accuracy to meet their targets. With these learning algorithms industry can meet their requirements and these new techniques will provide industry a different way to solve problems.
    [Show full text]
  • Linear Prediction-Based Wavenet Speech Synthesis
    LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis Min-Jae Hwang Frank Soong Eunwoo Song Search Solution Microsoft Naver Corporation Seongnam, South Korea Beijing, China Seongnam, South Korea [email protected] [email protected] [email protected] Xi Wang Hyeonjoo Kang Hong-Goo Kang Microsoft Yonsei University Yonsei University Beijing, China Seoul, South Korea Seoul, South Korea [email protected] [email protected] [email protected] Abstract—We propose a linear prediction (LP)-based wave- than the speech signal, the training and generation processes form generation method via WaveNet vocoding framework. A become more efficient. WaveNet-based neural vocoder has significantly improved the However, the synthesized speech is likely to be unnatural quality of parametric text-to-speech (TTS) systems. However, it is challenging to effectively train the neural vocoder when the target when the prediction errors in estimating the excitation are database contains massive amount of acoustical information propagated through the LP synthesis process. As the effect such as prosody, style or expressiveness. As a solution, the of LP synthesis is not considered in the training process, the approaches that only generate the vocal source component by synthesis output is vulnerable to the variation of LP synthesis a neural vocoder have been proposed. However, they tend to filter. generate synthetic noise because the vocal source component is independently handled without considering the entire speech To alleviate this problem, we propose an LP-WaveNet, production process; where it is inevitable to come up with a which enables to jointly train the complicated interactions mismatch between vocal source and vocal tract filter.
    [Show full text]
  • Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inflow Forecasting
    water Article Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inflow Forecasting Halit Apaydin 1 , Hajar Feizi 2 , Mohammad Taghi Sattari 1,2,* , Muslume Sevba Colak 1 , Shahaboddin Shamshirband 3,4,* and Kwok-Wing Chau 5 1 Department of Agricultural Engineering, Faculty of Agriculture, Ankara University, Ankara 06110, Turkey; [email protected] (H.A.); [email protected] (M.S.C.) 2 Department of Water Engineering, Agriculture Faculty, University of Tabriz, Tabriz 51666, Iran; [email protected] 3 Department for Management of Science and Technology Development, Ton Duc Thang University, Ho Chi Minh City, Vietnam 4 Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam 5 Department of Civil and Environmental Engineering, Hong Kong Polytechnic University, Hong Kong, China; [email protected] * Correspondence: [email protected] or [email protected] (M.T.S.); [email protected] (S.S.) Received: 1 April 2020; Accepted: 21 May 2020; Published: 24 May 2020 Abstract: Due to the stochastic nature and complexity of flow, as well as the existence of hydrological uncertainties, predicting streamflow in dam reservoirs, especially in semi-arid and arid areas, is essential for the optimal and timely use of surface water resources. In this research, daily streamflow to the Ermenek hydroelectric dam reservoir located in Turkey is simulated using deep recurrent neural network (RNN) architectures, including bidirectional long short-term memory (Bi-LSTM), gated recurrent unit (GRU), long short-term memory (LSTM), and simple recurrent neural networks (simple RNN). For this purpose, daily observational flow data are used during the period 2012–2018, and all models are coded in Python software programming language.
    [Show full text]
  • A Primer on Machine Learning
    A Primer on Machine Learning By instructor Amit Manghani Question: What is Machine Learning? Simply put, Machine Learning is a form of data analysis. Using algorithms that “ continuously learn from data, Machine Learning allows computers to recognize The core of hidden patterns without actually being programmed to do so. The key aspect of Machine Learning Machine Learning is that as models are exposed to new data sets, they adapt to produce reliable and consistent output. revolves around a computer system Question: consuming data What is driving the resurgence of Machine Learning? and learning from There are four interrelated phenomena that are behind the growing prominence the data. of Machine Learning: 1) the ever-increasing volume, variety and velocity of data, 2) the decrease in bandwidth and storage costs and 3) the exponential improve- ments in computational processing. In a nutshell, the ability to perform complex ” mathematical computations on big data is driving the resurgence in Machine Learning. 1 Question: What are some of the commonly used methods of Machine Learning? Reinforce- ment Machine Learning Supervised Machine Learning Semi- supervised Machine Unsupervised Learning Machine Learning Supervised Machine Learning In Supervised Learning, algorithms are trained using labeled examples i.e. the desired output for an input is known. For example, a piece of mail could be labeled either as relevant or junk. The algorithm receives a set of inputs along with the corresponding correct outputs to foster learning. Once the algorithm is trained on a set of labeled data; the algorithm is run against the same labeled data and its actual output is compared against the correct output to detect errors.
    [Show full text]
  • Deep Learning and Neural Networks Module 4
    Deep Learning and Neural Networks Module 4 Table of Contents Learning Outcomes ......................................................................................................................... 5 Review of AI Concepts ................................................................................................................... 6 Artificial Intelligence ............................................................................................................................ 6 Supervised and Unsupervised Learning ................................................................................................ 6 Artificial Neural Networks .................................................................................................................... 8 The Human Brain and the Neural Network ........................................................................................... 9 Machine Learning Vs. Neural Network ............................................................................................... 11 Machine Learning vs. Neural Network Comparison Table (Educba, 2019) .............................................. 12 Conclusion – Machine Learning vs. Neural Network ........................................................................... 13 Real-time Speech Translation ............................................................................................................. 14 Uses of a Translator App ...................................................................................................................
    [Show full text]
  • Introduction Q-Learning and Variants
    Introduction In this assignment, you will implement Q-learning using deep learning function approxima- tors in OpenAI Gym. You will work with the Atari game environments, and learn how to train a Q-network directly from pixel inputs. The goal is to understand and implement some of the techniques that were found to be important in practice to stabilize training and achieve better performance. As a side effect, we also expect you to get comfortable using Tensorflow or Keras to experiment with different architectures and different hyperparameter configurations and gain insight into the importance of these choices on final performance and learning behavior. Before starting your implementation, make sure that you have Tensorflow and the Atari environments correctly installed. For this assignment you may choose either Enduro (Enduro-v0) or Space Invaders (SpaceInvaders-v0). In spite of this, the observations from other Atari games have the same size so you can try to train your models in the other games too. Note that while the size of the observations is the same across games, the number of available actions may be different from game to game. Q-learning and variants Due to the state space complexity of Atari environments, we represent Q-functions using a p class of parametrized function approximators Q = fQw j w 2 R g, where p is the number of parameters. Remember that in the tabular setting, given a 4-tuple of sampled experience (s; a; r; s0), the vanilla Q-learning update is Q(s; a) := Q(s; a) + α r + γ max Q(s0; a0) − Q(s; a) ; (1) a02A where α 2 R is the learning rate.
    [Show full text]