Masters Thesis: Predicting Periodic and Chaotic Signals Using Wavenets

Total Page:16

File Type:pdf, Size:1020Kb

Masters Thesis: Predicting Periodic and Chaotic Signals Using Wavenets Predicting periodic and chaotic signals using Wavenets Master of Science Thesis For the degree of Master of Science in Applied Mathematics with the specialization Financial Engineering at Delft University of Technology D.C.F. van den Assem (4336100) August 18, 2017 Supervisor: Prof. dr. ir. C. W. Oosterlee TU Delft Thesis committee: Dr. S. M. Bohte, CWI Amsterdam Dr. ir. R. J. Fokkink, TU Delft Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS) · Delft University of Technology iii Copyright c Delft Institute of Applied Mathematics (DIAM) All rights reserved. Master of Science Thesis D.C.F. van den Assem (4336100) iv D.C.F. van den Assem (4336100) Master of Science Thesis Abstract This thesis discusses forecasting periodic time series using Wavenets with an application in financial time series. Conventional neural networks used for forecasting such as the LSTM and the full convolutional network (FCN) are computationally expensive. The Wavenet uses dilated convolutions which significantly reduces the computational cost compared to the FCN with the same number of inputs. Forecasts made on the sine wave shows that the network can successfully fully forecast a sine wave. Forecasts made on the Mackey Glass time series shows that the network can outperform the LSTM and other methods Furthermore, forecasts made on the Lorenz system shows that the network is able to outperform the LSTM. By conditioning the network on the other relevant coordinate, the prediction becomes more accurate and is able to make full forecasts. In a financial application, the network shows less predictive accuracy compared to multivariate dynamic kernel support vector machines. Master of Science Thesis D.C.F. van den Assem (4336100) ii D.C.F. van den Assem (4336100) Master of Science Thesis Table of Contents Acknowledgements ix 1 Introduction 1 1.1 Outline.......................................... 2 2 Machine Learning3 2.1 Terminology in Machine Learning............................ 3 2.2 Classification in the Iris data set............................ 5 2.3 The Single Layer Perceptron Model........................... 7 2.3.1 Implementation and Training.......................... 7 2.3.2 Example 2: The other data set......................... 9 2.4 Logistic regression.................................... 9 2.5 Introduction to Neural Networks............................ 11 2.6 Summary......................................... 12 3 Neural Networks 13 3.1 Network Architectures.................................. 13 3.1.1 Activation Functions.............................. 14 3.1.2 Convolutional Neural Networks......................... 15 3.1.3 Recurrent Neural Networks........................... 18 3.2 Supervised Learning of the Neural Network....................... 21 3.2.1 Backpropagation Algorithm........................... 21 3.2.2 Cost Function.................................. 22 3.2.3 Stochastic Gradient Descent, Batch Gradient Descent and Mini-Batch Gra- dient Descent.................................. 23 3.2.4 Initializers.................................... 24 3.3 Regularization methods................................. 31 3.3.1 Bootstrap aggregating............................. 33 3.3.2 Dropout..................................... 33 3.4 Wavenet......................................... 34 3.5 Augmented Wavenet (AWN).............................. 37 3.6 Summary......................................... 38 4 Methodology 39 4.1 Evaluation of the network................................ 39 4.2 Error Measures...................................... 40 4.2.1 Statistical Testing................................ 43 4.2.2 Benchmarks................................... 44 4.3 Artificial Time Series.................................. 44 4.3.1 The sine wave.................................. 44 4.3.2 The Lorenz System............................... 45 4.3.3 Mackey Glass Equation............................. 46 4.4 Real world time series.................................. 47 4.4.1 Data pre processing............................... 47 Master of Science Thesis D.C.F. van den Assem (4336100) iv Table of Contents 5 Results 49 5.1 Implementation comparison............................... 49 5.2 The Sine Wave..................................... 51 5.3 The Mackey Glass Time Series............................. 54 5.4 The Lorenz System................................... 57 5.5 Results on financial time series............................. 60 5.6 Summary......................................... 61 6 Conclusion 63 6.1 Summary and conclusion................................ 63 6.2 Future research..................................... 64 A Seperation hyperplanes 65 B Glorot derivation 67 C Code of the model 71 Bibliography 73 D.C.F. van den Assem (4336100) Master of Science Thesis List of Figures 2.1 Scatter plot of Iris data, Setosa (blue •), Versicolor (red ×), Virginica (green ♦). In the subplot on the first row and second column, the sepal width is plotted against the sepal length...................................... 6 2.2 Scatter plot of Iris data, Setosa (blue •), Versicolor (red ×), Virginica (green ♦)... 6 2.3 Schematic representation of the single layer perceptron................ 7 2.4 Illustrations of the multi-class logit and the softmax implementation......... 10 2.5 A multi layer network for solving the XOR-problem.................. 11 3.1 Graph of a layered network with E = E1 ∪ E2 ∪ E3,N = N1 ∪ N2 ......... 14 3.2 The three different activation functions......................... 15 3.3 Illustration of the Receptive Field............................ 16 3.4 Illustration of the replications, shared weights and feature map............ 16 3.5 Illustration of the Padding................................ 16 3.6 Illustration of the Strides................................ 17 3.7 Illustration of the causal convolutions.......................... 17 3.8 Illustration of the causal convolutions with larger inputs and outputs......... 18 3.9 Illustration of the dilated convolutions......................... 19 3.10 The RNN on the left and the unfolded RNN network on the right.......... 19 3.11 The LSTM block. The × and + are point-wise operators, σ, tanh are activation functions. Two joining arcs makes a concatenate operation. Two splitting arcs makes a copy operation................................. 20 3.12 Figures of paraboloids, with a = 1 and b = 2 ...................... 28 3.13 Behaviour of the training error and validation error during training.......... 32 3.14 Overview of the residual block and the entire architecture, retrieved from [1]..... 34 3.15 Illustration of the stacked dilated convolutions..................... 35 3.16 Overview of the architecture used in AugmentedWavenet, retrieved from [2]..... 37 5.1 The full forecast of the sine wave using different implementations.......... 50 5.2 Overview of the AWN I(4)............................... 54 5.3 The full forecast of the a noisy sine wave using AWN I(4) .............. 55 5.4 The full forecast of the Mackey Glass time series using 8 layers on AWN I(4) .... 55 5.5 The full forecast of the Mackey Glass time series using 8 layers on AWN I(4) trained on one-ahead noisy data (σ = 0.1)........................... 57 5.6 The one ahead forecast and the full forecast of the Lorenz system using 4 layers on AWN I(4) ........................................ 58 5.7 Convergence behavior of the training of networks with different γ parameter, for 4 and 8 layers........................................ 59 5.8 The full conditioned forecast of the Lorenz system using 4 layers on AWN I(4C) using 4 layers...................................... 59 5.9 Comparison of the one step ahead (using months) of the AWN I(4) without and with tuning of the parameters.............................. 61 Master of Science Thesis D.C.F. van den Assem (4336100) vi List of Figures A.1 Petal Length vs Petal Width.............................. 66 D.C.F. van den Assem (4336100) Master of Science Thesis List of Tables 2.1 XOR-problem using SLP on the left and Multi-Layer XOR-problem on the right... 12 3.1 Number of weights for networks with a ‘visual field’ of 512.............. 36 4.1 Difference in response to errors between MAE and RMSE ............. 42 5.1 The standard parameters used in the Wavenet..................... 49 5.2 MAE and MSE based on 1000 samples of full forecast................. 50 5.3 - means that the forecast diverged, therefore the number is not useful........ 50 5.4 Results for I(2) with a variation, I(3) and I(4) .................... 52 5.5 The√ one-ahead and full forecast performance with different values for SNR. (SNR = 2 σ2 )............................................ 53 5.6 The results for the Mackey glass t steps ahead forecast using 4 layers. Values are RMSE ×10−3. The ± value is the standard deviation of the 10 runs......... 56 5.7 Results for noisy Mackey Glass time series on two configurations. Configuration 1 uses 4 layers and configuration 2 uses 8 layers. Values are RMSE ×10−2 ...... 56 5.8 Results of the modified network for different γ, RMSE of the benchmark is 4.78 × 10−3 58 5.9 Comparison of the noisy conditioned Lorenz system. RMSE × 103 ......... 60 5.10 Results from AWN I(4) and AWN I(4C) on the S&P500 data, conditioned with the CBOE data........................................ 61 5.11 The standard parameters used in the AWN I(4) and AWN I(4C)........... 62 Master of Science Thesis D.C.F. van den Assem (4336100) viii List of Tables D.C.F. van den Assem (4336100) Master of Science Thesis Acknowledgements This thesis has been submitted
Recommended publications
  • Malware Classification with BERT
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
    [Show full text]
  • Backpropagation and Deep Learning in the Brain
    Backpropagation and Deep Learning in the Brain Simons Institute -- Computational Theories of the Brain 2018 Timothy Lillicrap DeepMind, UCL With: Sergey Bartunov, Adam Santoro, Jordan Guerguiev, Blake Richards, Luke Marris, Daniel Cownden, Colin Akerman, Douglas Tweed, Geoffrey Hinton The “credit assignment” problem The solution in artificial networks: backprop Credit assignment by backprop works well in practice and shows up in virtually all of the state-of-the-art supervised, unsupervised, and reinforcement learning algorithms. Why Isn’t Backprop “Biologically Plausible”? Why Isn’t Backprop “Biologically Plausible”? Neuroscience Evidence for Backprop in the Brain? A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: How to convince a neuroscientist that the cortex is learning via [something like] backprop - To convince a machine learning researcher, an appeal to variance in gradient estimates might be enough. - But this is rarely enough to convince a neuroscientist. - So what lines of argument help? How to convince a neuroscientist that the cortex is learning via [something like] backprop - What do I mean by “something like backprop”?: - That learning is achieved across multiple layers by sending information from neurons closer to the output back to “earlier” layers to help compute their synaptic updates. How to convince a neuroscientist that the cortex is learning via [something like] backprop 1. Feedback connections in cortex are ubiquitous and modify the
    [Show full text]
  • Ranking and Automatic Selection of Machine Learning Models Abstract Sandro Feuz
    Technical Disclosure Commons Defensive Publications Series December 13, 2017 Ranking and automatic selection of machine learning models Abstract Sandro Feuz Victor Carbune Follow this and additional works at: http://www.tdcommons.org/dpubs_series Recommended Citation Feuz, Sandro and Carbune, Victor, "Ranking and automatic selection of machine learning models Abstract", Technical Disclosure Commons, (December 13, 2017) http://www.tdcommons.org/dpubs_series/982 This work is licensed under a Creative Commons Attribution 4.0 License. This Article is brought to you for free and open access by Technical Disclosure Commons. It has been accepted for inclusion in Defensive Publications Series by an authorized administrator of Technical Disclosure Commons. Feuz and Carbune: Ranking and automatic selection of machine learning models Abstra Ranking and automatic selection of machine learning models Abstract Generally, the present disclosure is directed to an API for ranking and automatic selection from competing machine learning models that can perform a particular task. In particular, in some implementations, the systems and methods of the present disclosure can include or otherwise leverage one or more machine-learned models to provide to a software application one or more machine learning models from different providers. The trained models are suited to a task or data type specified by the developer. The one or more models are selected from a registry of machine learning models, their task specialties, cost, and performance, such that the application specified cost and performance requirements are met. An application processor interface (API) maintains a registry of various machine learning models, their task specialties, costs and/or performances. A third-party developer can make a call to the API to select one or more machine learning models.
    [Show full text]
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms
    PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs Yunbo Wang Mingsheng Long∗ School of Software School of Software Tsinghua University Tsinghua University [email protected] [email protected] Jianmin Wang Zhifeng Gao Philip S. Yu School of Software School of Software School of Software Tsinghua University Tsinghua University Tsinghua University [email protected] [email protected] [email protected] Abstract The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical frames, where spatial appearances and temporal vari- ations are two crucial structures. This paper models these structures by presenting a predictive recurrent neural network (PredRNN). This architecture is enlightened by the idea that spatiotemporal predictive learning should memorize both spatial ap- pearances and temporal variations in a unified memory pool. Concretely, memory states are no longer constrained inside each LSTM unit. Instead, they are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. The core of this network is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal representations simultaneously. PredRNN achieves the state-of-the-art prediction performance on three video prediction datasets and is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures. 1 Introduction
    [Show full text]
  • Lecture 4 Feedforward Neural Networks, Backpropagation
    CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 References/Acknowledgments See the excellent videos by Hugo Larochelle on Backpropagation 2/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons) 3/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The input to the network is an n-dimensional hL =y ^ = f(x) vector The network contains L − 1 hidden layers (2, in a3 this case) having n neurons each W3 b Finally, there is one output layer containing k h 3 2 neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer a2 can be split into two parts : pre-activation and W 2 b2 activation (ai and hi are vectors) h1 The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer a1 W 2 n×n and b 2 n are the weight and bias W i R i R 1 b1 between layers i − 1 and i (0 < i < L) W 2 n×k and b 2 k are the weight and bias x1 x2 xn L R L R between the last hidden layer and the output layer (L = 3 in this case) 4/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) a3 W3 b3 The activation at layer i is given by h2 hi(x) = g(ai(x)) a2 W where g is called the activation function (for 2 b2 h1 example, logistic, tanh, linear, etc.) The activation at the output layer is given by a1 f(x) = h (x) = O(a (x)) W L L 1 b1 where O is the output activation function (for x1 x2 xn example, softmax, linear, etc.) To simplify notation we will refer to ai(x) as ai and hi(x) as hi 5/9 Mitesh M.
    [Show full text]
  • Fun with Hyperplanes: Perceptrons, Svms, and Friends
    Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification Parallel to AIMA 18.1, 18.2, 18.6.3, 18.9 The Automatic Classification Problem Assign object/event or sequence of objects/events to one of a given finite set of categories. • Fraud detection for credit card transactions, telephone calls, etc. • Worm detection in network packets • Spam filtering in email • Recommending articles, books, movies, music • Medical diagnosis • Speech recognition • OCR of handwritten letters • Recognition of specific astronomical images • Recognition of specific DNA sequences • Financial investment Machine Learning methods provide one set of approaches to this problem CIS 391 - Intro to AI 2 Universal Machine Learning Diagram Feature Things to Magic Vector Classification be Classifier Represent- Decision classified Box ation CIS 391 - Intro to AI 3 Example: handwritten digit recognition Machine learning algorithms that Automatically cluster these images Use a training set of labeled images to learn to classify new images Discover how to account for variability in writing style CIS 391 - Intro to AI 4 A machine learning algorithm development pipeline: minimization Problem statement Given training vectors x1,…,xN and targets t1,…,tN, find… Mathematical description of a cost function Mathematical description of how to minimize/maximize the cost function Implementation r(i,k) = s(i,k) – maxj{s(i,j)+a(i,j)} … CIS 391 - Intro to AI 5 Universal Machine Learning Diagram Today: Perceptron, SVM and Friends Feature Things to Magic Vector
    [Show full text]
  • Batch Normalization
    Deep Learning Srihari Batch Normalization Sargur N. Srihari [email protected] 1 Deep Learning Srihari Topics in Optimization for Deep Models • Importance of Optimization in machine learning • How learning differs from optimization • Challenges in neural network optimization • Basic Optimization Algorithms • Parameter initialization strategies • Algorithms with adaptive learning rates • Approximate second-order methods • Optimization strategies and meta-algorithms 2 Deep Learning Srihari Topics in Optimization Strategies and Meta-Algorithms 1. Batch Normalization 2. Coordinate Descent 3. Polyak Averaging 4. Supervised Pretraining 5. Designing Models to Aid Optimization 6. Continuation Methods and Curriculum Learning 3 Deep Learning Srihari Overview of Optimization Strategies • Many optimization techniques are general templates that can be specialized to yield algorithms • They can be incorporated into different algorithms 4 Deep Learning Srihari Topics in Batch Normalization • Batch normalization: exciting recent innovation • Motivation is difficulty of choosing learning rate ε in deep networks • Method is to replace activations with zero-mean with unit variance activations 5 Deep Learning Srihari Adding normalization between layers • Motivated by difficulty of training deep models • Method adds an additional step between layers, in which the output of the earlier layer is normalized – By standardizing the mean and standard deviation of each individual unit • It is a method of adaptive re-parameterization – It is not an optimization
    [Show full text]
  • Q-Learning in Continuous State and Action Spaces
    -Learning in Continuous Q State and Action Spaces Chris Gaskett, David Wettergreen, and Alexander Zelinsky Robotic Systems Laboratory Department of Systems Engineering Research School of Information Sciences and Engineering The Australian National University Canberra, ACT 0200 Australia [cg dsw alex]@syseng.anu.edu.au j j Abstract. -learning can be used to learn a control policy that max- imises a scalarQ reward through interaction with the environment. - learning is commonly applied to problems with discrete states and ac-Q tions. We describe a method suitable for control tasks which require con- tinuous actions, in response to continuous states. The system consists of a neural network coupled with a novel interpolator. Simulation results are presented for a non-holonomic control task. Advantage Learning, a variation of -learning, is shown enhance learning speed and reliability for this task.Q 1 Introduction Reinforcement learning systems learn by trial-and-error which actions are most valuable in which situations (states) [1]. Feedback is provided in the form of a scalar reward signal which may be delayed. The reward signal is defined in relation to the task to be achieved; reward is given when the system is successfully achieving the task. The value is updated incrementally with experience and is defined as a discounted sum of expected future reward. The learning systems choice of actions in response to states is called its policy. Reinforcement learning lies between the extremes of supervised learning, where the policy is taught by an expert, and unsupervised learning, where no feedback is given and the task is to find structure in data.
    [Show full text]
  • Revisiting the Softmax Bellman Operator: New Benefits and New Perspective
    Revisiting the Softmax Bellman Operator: New Benefits and New Perspective Zhao Song 1 * Ronald E. Parr 1 Lawrence Carin 1 Abstract tivates the use of exploratory and potentially sub-optimal actions during learning, and one commonly-used strategy The impact of softmax on the value function itself is to add randomness by replacing the max function with in reinforcement learning (RL) is often viewed as the softmax function, as in Boltzmann exploration (Sutton problematic because it leads to sub-optimal value & Barto, 1998). Furthermore, the softmax function is a (or Q) functions and interferes with the contrac- differentiable approximation to the max function, and hence tion properties of the Bellman operator. Surpris- can facilitate analysis (Reverdy & Leonard, 2016). ingly, despite these concerns, and independent of its effect on exploration, the softmax Bellman The beneficial properties of the softmax Bellman opera- operator when combined with Deep Q-learning, tor are in contrast to its potentially negative effect on the leads to Q-functions with superior policies in prac- accuracy of the resulting value or Q-functions. For exam- tice, even outperforming its double Q-learning ple, it has been demonstrated that the softmax Bellman counterpart. To better understand how and why operator is not a contraction, for certain temperature pa- this occurs, we revisit theoretical properties of the rameters (Littman, 1996, Page 205). Given this, one might softmax Bellman operator, and prove that (i) it expect that the convenient properties of the softmax Bell- converges to the standard Bellman operator expo- man operator would come at the expense of the accuracy nentially fast in the inverse temperature parameter, of the resulting value or Q-functions, or the quality of the and (ii) the distance of its Q function from the resulting policies.
    [Show full text]
  • Almost Unsupervised Text to Speech and Automatic Speech Recognition
    Almost Unsupervised Text to Speech and Automatic Speech Recognition Yi Ren * 1 Xu Tan * 2 Tao Qin 2 Sheng Zhao 3 Zhou Zhao 1 Tie-Yan Liu 2 Abstract 1. Introduction Text to speech (TTS) and automatic speech recognition (ASR) are two popular tasks in speech processing and have Text to speech (TTS) and automatic speech recog- attracted a lot of attention in recent years due to advances in nition (ASR) are two dual tasks in speech pro- deep learning. Nowadays, the state-of-the-art TTS and ASR cessing and both achieve impressive performance systems are mostly based on deep neural models and are all thanks to the recent advance in deep learning data-hungry, which brings challenges on many languages and large amount of aligned speech and text data. that are scarce of paired speech and text data. Therefore, However, the lack of aligned data poses a ma- a variety of techniques for low-resource and zero-resource jor practical problem for TTS and ASR on low- ASR and TTS have been proposed recently, including un- resource languages. In this paper, by leveraging supervised ASR (Yeh et al., 2019; Chen et al., 2018a; Liu the dual nature of the two tasks, we propose an et al., 2018; Chen et al., 2018b), low-resource ASR (Chuang- almost unsupervised learning method that only suwanich, 2016; Dalmia et al., 2018; Zhou et al., 2018), TTS leverages few hundreds of paired data and extra with minimal speaker data (Chen et al., 2019; Jia et al., 2018; unpaired data for TTS and ASR.
    [Show full text]
  • On Self Modulation for Generative Adver- Sarial Networks
    Published as a conference paper at ICLR 2019 ON SELF MODULATION FOR GENERATIVE ADVER- SARIAL NETWORKS Ting Chen∗ Mario Lucic, Neil Houlsby, Sylvain Gelly University of California, Los Angeles Google Brain [email protected] flucic,neilhoulsby,[email protected] ABSTRACT Training Generative Adversarial Networks (GANs) is notoriously challenging. We propose and study an architectural modification, self-modulation, which im- proves GAN performance across different data sets, architectures, losses, regu- larizers, and hyperparameter settings. Intuitively, self-modulation allows the in- termediate feature maps of a generator to change as a function of the input noise vector. While reminiscent of other conditioning techniques, it requires no labeled data. In a large-scale empirical study we observe a relative decrease of 5% − 35% in FID. Furthermore, all else being equal, adding this modification to the generator leads to improved performance in 124=144 (86%) of the studied settings. Self- modulation is a simple architectural change that requires no additional parameter tuning, which suggests that it can be applied readily to any GAN.1 1 INTRODUCTION Generative Adversarial Networks (GANs) are a powerful class of generative models successfully applied to a variety of tasks such as image generation (Zhang et al., 2017; Miyato et al., 2018; Karras et al., 2017), learned compression (Tschannen et al., 2018), super-resolution (Ledig et al., 2017), inpainting (Pathak et al., 2016), and domain transfer (Isola et al., 2016; Zhu et al., 2017). Training GANs is a notoriously challenging task (Goodfellow et al., 2014; Arjovsky et al., 2017; Lucic et al., 2018) as one is searching in a high-dimensional parameter space for a Nash equilibrium of a non-convex game.
    [Show full text]