On the Dynamics of Gradient Descent for Autoencoders

Total Page:16

File Type:pdf, Size:1020Kb

On the Dynamics of Gradient Descent for Autoencoders On the Dynamics of Gradient Descent for Autoencoders ? ? Thanh V. Nguyen ,RaymondK.W.Wong†, Chinmay Hegde ? Iowa State University, †Texas A&M University Abstract Atypicalapproachadoptedbythislineofworkisas follows: assume that the data obeys a ground truth We provide a series of results for unsuper- generative model (induced by simple but reasonably ex- vised learning with autoencoders. Specifically, pressive data-generating distributions), and prove that we study shallow two-layer autoencoder ar- the weights learned by the proposed algorithms (either chitectures with shared weights. We focus on exactly or approximately) recover the parameters of the three generative models for data that are com- generative model. Indeed, such distributional assump- mon in statistical machine learning: (i) the tions are necessary to overcome known NP-hardness mixture-of-gaussians model, (ii) the sparse barriers for learning neural networks [8]. Nevertheless, coding model, and (iii) the sparsity model the majority of these approaches have focused on neural with non-negative coefficients. For each of network architectures for supervised learning, barring these models, we prove that under suitable a few exceptions which we detail below. choices of hyperparameters, architectures, and initialization, autoencoders learned by gradi- 1.2 Our contributions ent descent can successfully recover the pa- rameters of the corresponding model. To our In this paper, we complement this line of work by pro- knowledge, this is the first result that rigor- viding new theoretical results for unsupervised learning ously studies the dynamics of gradient descent using neural networks. Our focus here is on shallow two- for weight-sharing autoencoders. Our analy- layer autoencoder architectures with shared weights. sis can be viewed as theoretical evidence that Conceptually, we build upon previous theoretical re- shallow autoencoder modules indeed can be sults on learning autoencoder networks [9, 10, 11], and used as feature learning mechanisms for a va- we elaborate on the novelty of our work in the discus- riety of data models, and may shed insight on sion on prior work below. how to train larger stacked architectures with Our setting is standard: we assume that the training autoencoders as basic building blocks. data consists of i.i.d. samples from a high-dimensional distribution parameterized by a generative model, and 1Introduction we train the weights of the autoencoder using ordinary (batch) gradient descent. We consider three families of generative models that are commonly adopted in 1.1 Motivation machine learning: (i) the Gaussian mixture model with Due to the resurgence of neural networks and deep well-separated centers [12]; (ii) the k-sparse model, learning, there has been growing interest in the commu- specified by sparse linear combination of atoms [13]; nity towards a thorough and principled understanding and (iii) the non-negative k-sparse model [11]. While of training neural networks in both theoretical and these models are traditionally studied separately de- algorithmic aspects. This has led to several important pending on the application, all of these model families breakthroughs recently, including provable algorithms can be expressed via a unified, generic form: for learning shallow (1-hidden layer) networks with nonlinear activations [1, 2, 3, 4], deep networks with y = Ax⇤ + ⌘, (1) linear activations [5], and residual networks [6, 7]. which we (loosely) dub as the generative bilinear model. Proceedings of the 22nd International Conference on Ar- In this form, A is a groundtruth n m-matrix, x⇤ is ⇥ tificial Intelligence and Statistics (AISTATS) 2019, Naha, an m-dimensional latent code vector and ⌘ is an inde- Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by pendent n-dimensional random noise vector. Samples the author(s). y’s are what we observe. Different choices of n and m, On the Dynamics of Gradient Descent for Autoencoders as well as different assumptions on A and x⇤ lead to The code consistency property is crucial for establishing the three aforementioned generative models. the correctness of gradient descent over the reconstruc- tion loss. This turns out to be rather tedious due to Under these three generative models, and with suit- the weight sharing — a complication which requires able choice of hyper-parameters, initial estimates, and asubstantialdeparturefromtheexistingmachinery autoencoder architectures, we rigorously prove that: for analysis of sparse coding algorithms — and indeed forms the bulk of the technical difficulty in our proofs. Two-layer autoencoders, trained with (normal- Nevertheless, we are able to derive explicit linear con- ized) gradient descent over the reconstruction vergence rates for all the generative models listed above. loss, provably learn the parameters of the un- We do not attempt to analyze other training schemes derlying generative bilinear model. (such as stochastic gradient descent or dropout) but anticipate that our analysis may lead to further work To the best of our knowledge, our work is the first to along those directions. analytically characterize the dynamics of gradient de- scent for training two-layer autoencoders. Our analysis can be viewed as theoretical evidence that shallow au- 1.4 Comparison with prior work toencoders can be used as feature learning mechanisms (provided the generative modeling assumptions hold), a Recent advances in algorithmic learning theory has led view that seems to be widely adopted in practice. Our to numerous provably efficient algorithms for learning analysis highlights the following interesting conclusions: Gaussian mixture models, sparse codes, topic models, (i) the activation function of the hidden (encoder) layer and ICA (see [12, 13, 14, 15, 16, 17, 18, 19]andrefer- influences the choice of bias; (ii) the bias of each hid- ences therein). We omit a complete treatment of prior den neuron in the encoder plays an important role in work due to space constraints. achieving the convergence of the gradient descent; and We would like to emphasize that we do not propose a (iii) the gradient dynamics depends on the complexity new algorithm or autoencoder architecture, nor are we of the generative model. Further, we speculate that our the first to highlight the applicability of autoencoders analysis may shed insight on practical considerations with the aforementioned generative models. Indeed, for training deeper networks with stacked autoencoder generative models such as k-sparsity models have served layers as building blocks [9]. as the motivation for the development of deep stacked (denoising) autoencoders dating back to the work of [20]. 1.3 Techniques The paper [9]provesthatstackedweight-sharingau- toencoders can recover the parameters of sparsity-based Our analysis is built upon recent algorithmic devel- generative models, but their analysis succeeds only for opments in the sparse coding literature [14, 15, 16]. certain generative models whose parameters are them- Sparse coding corresponds to the setting where the syn- (i) selves randomly sampled from certain distributions. thesis coefficient vector x⇤ in (1) for each data sample (i) (i) In contrast, our analysis holds for a broader class of y is assumed to be k-sparse, i.e., x⇤ only has at networks; we make no randomness assumptions on the most k m non-zero elements. The exact algorithms ⌧ parameters of the generative models themselves. proposed in these papers are all quite different, but at ahighlevel,allthesemethodsinvolveestablishinga More recently, autoencoders have been shown to learn notion that we dub as “support consistency”. Broadly sparse representations [21]. The recent paper [11] (i) (i) (i) speaking, for a given data sample y = Ax⇤ + ⌘ , demonstrates that under the sparse generative model, the idea is that when the parameter estimates are close the standard squared-error reconstruction loss of ReLU to the ground truth, it is possible to accurately esti- autoencoders exhibits (with asymptotically many sam- (i) mate the true support of the synthesis vector x⇤ for ples) critical points in a neighborhood of the ground each data sample y(i). truth dictionary. However, they do not analyze gradi- ent dynamics, nor do they establish convergence rates. We extend this to a broader family of generative mod- We complete this line of work by proving explicitly that els to form a notion that we call “code consistency”. gradient descent (with column-wise normalization) in We prove that if initialized appropriately, the weights the asymptotic limit exhibits linear convergence up to of the hidden (encoder) layer of the autoencoder pro- aradiusaroundthegroundtruthparameters. vides useful information about the sign pattern of the corresponding synthesis vectors for every data sam- ple. Somewhat surprisingly, the choice of activation 2Preliminaries function of each neuron in the hidden layer plays an m important role in establishing code consistency and Notation Denote by xS the sub-vector of x R 2 affects the possible choices of bias. indexed by the elements of S [m].Similarly,letW ✓ S ? ? Thanh V. Nguyen ,RaymondK.W.Wong†,ChinmayHegde n m be the sub-matrix of W R ⇥ with columns indexed 2 by elements in S.Also,definesupp(x) , i [m]: Input Hidden Output { 2 layer layer layer xi =0 as the support of x, sgn(x) as the element-wise 6 } y1 yˆ sign of x and 1E as the indicator of an event E. 1 We adopt standard asymptotic notations: let f(n)= y2 yˆ2 O(g(n)) (or f(n)=⌦(g(n)))ifthereexistssomecon- stant C>0 such that f(n) C g(n) (respectively, . | | | | . f(n) C g(n) ). Next, f(n)=⇥(g(n)) is equivalent | |≥ | | yn to that f(n)=O(g(n)) and f(n)=⌦(g(n)).Also, yˆn f(n)=!(g(n)) if limn f(n)/g(n) = . In ad- !1 | | 1 dition, g(n)=O⇤(f(n)) indicates g(n) K f(n) | | | | for some small enough constant K. Throughout, we Figure 1: Architecture of a shallow 2-layer autoencoder network. The encoder and the decoder share the weights. use the phrase “with high probability” (abbreviated to w.h.p.) to describe any event with failure probability !(1) at most n− .
Recommended publications
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Learning to Learn by Gradient Descent by Gradient Descent
    Learning to learn by gradient descent by gradient descent Marcin Andrychowicz1, Misha Denil1, Sergio Gómez Colmenarejo1, Matthew W. Hoffman1, David Pfau1, Tom Schaul1, Brendan Shillingford1,2, Nando de Freitas1,2,3 1Google DeepMind 2University of Oxford 3Canadian Institute for Advanced Research [email protected] {mdenil,sergomez,mwhoffman,pfau,schaul}@google.com [email protected], [email protected] Abstract The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art. 1 Introduction Frequently, tasks in machine learning can be expressed as the problem of optimizing an objective function f(✓) defined over some domain ✓ ⇥. The goal in this case is to find the minimizer 2 ✓⇤ = arg min✓ ⇥ f(✓). While any method capable of minimizing this objective function can be applied, the standard2 approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates ✓ = ✓ ↵ f(✓ ) . t+1 t − tr t The performance of vanilla gradient descent, however, is hampered by the fact that it only makes use of gradients and ignores second-order information.
    [Show full text]
  • Q-Learning in Continuous State and Action Spaces
    -Learning in Continuous Q State and Action Spaces Chris Gaskett, David Wettergreen, and Alexander Zelinsky Robotic Systems Laboratory Department of Systems Engineering Research School of Information Sciences and Engineering The Australian National University Canberra, ACT 0200 Australia [cg dsw alex]@syseng.anu.edu.au j j Abstract. -learning can be used to learn a control policy that max- imises a scalarQ reward through interaction with the environment. - learning is commonly applied to problems with discrete states and ac-Q tions. We describe a method suitable for control tasks which require con- tinuous actions, in response to continuous states. The system consists of a neural network coupled with a novel interpolator. Simulation results are presented for a non-holonomic control task. Advantage Learning, a variation of -learning, is shown enhance learning speed and reliability for this task.Q 1 Introduction Reinforcement learning systems learn by trial-and-error which actions are most valuable in which situations (states) [1]. Feedback is provided in the form of a scalar reward signal which may be delayed. The reward signal is defined in relation to the task to be achieved; reward is given when the system is successfully achieving the task. The value is updated incrementally with experience and is defined as a discounted sum of expected future reward. The learning systems choice of actions in response to states is called its policy. Reinforcement learning lies between the extremes of supervised learning, where the policy is taught by an expert, and unsupervised learning, where no feedback is given and the task is to find structure in data.
    [Show full text]
  • Training Neural Networks Without Gradients: a Scalable ADMM Approach
    Training Neural Networks Without Gradients: A Scalable ADMM Approach Gavin Taylor1 [email protected] Ryan Burmeister1 Zheng Xu2 [email protected] Bharat Singh2 [email protected] Ankit Patel3 [email protected] Tom Goldstein2 [email protected] 1United States Naval Academy, Annapolis, MD USA 2University of Maryland, College Park, MD USA 3Rice University, Houston, TX USA Abstract many parameters. Because big datasets provide results that With the growing importance of large network (often dramatically) outperform the prior state-of-the-art in models and enormous training datasets, GPUs many machine learning tasks, researchers are willing to have become increasingly necessary to train neu- purchase specialized hardware such as GPUs, and commit ral networks. This is largely because conven- large amounts of time to training models and tuning hyper- tional optimization algorithms rely on stochastic parameters. gradient methods that don’t scale well to large Gradient-based training methods have several properties numbers of cores in a cluster setting. Further- that contribute to this need for specialized hardware. First, more, the convergence of all gradient methods, while large amounts of data can be shared amongst many including batch methods, suffers from common cores, existing optimization methods suffer when paral- problems like saturation effects, poor condition- lelized. Second, training neural nets requires optimizing ing, and saddle points. This paper explores an highly non-convex objectives that exhibit saddle points, unconventional training method that uses alter- poor conditioning, and vanishing gradients, all of which nating direction methods and Bregman iteration slow down gradient-based methods such as stochastic gra- to train networks without gradient descent steps.
    [Show full text]
  • Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments
    Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments Boris Ginsburg 1 Patrice Castonguay 1 Oleksii Hrinchuk 1 Oleksii Kuchaiev 1 Ryan Leary 1 Vitaly Lavrukhin 1 Jason Li 1 Huyen Nguyen 1 Yang Zhang 1 Jonathan M. Cohen 1 Abstract We started with Adam, and then (1) replaced the element- wise second moment with the layer-wise moment, (2) com- We propose NovoGrad, an adaptive stochastic puted the first moment using gradients normalized by layer- gradient descent method with layer-wise gradient wise second moment, (3) decoupled weight decay (WD) normalization and decoupled weight decay. In from normalized gradients (similar to AdamW). our experiments on neural networks for image classification, speech recognition, machine trans- The resulting algorithm, NovoGrad, combines SGD’s and lation, and language modeling, it performs on par Adam’s strengths. We applied NovoGrad to a variety of or better than well-tuned SGD with momentum, large scale problems — image classification, neural machine Adam, and AdamW. Additionally, NovoGrad (1) translation, language modeling, and speech recognition — is robust to the choice of learning rate and weight and found that in all cases, it performs as well or better than initialization, (2) works well in a large batch set- Adam/AdamW and SGD with momentum. ting, and (3) has half the memory footprint of Adam. 2. Related Work NovoGrad belongs to the family of Stochastic Normalized 1. Introduction Gradient Descent (SNGD) optimizers (Nesterov, 1984; Hazan et al., 2015). SNGD uses only the direction of the The most popular algorithms for training Neural Networks stochastic gradient gt to update the weights wt: (NNs) are Stochastic Gradient Descent (SGD) with mo- gt mentum (Polyak, 1964; Sutskever et al., 2013) and Adam wt+1 = wt − λt · (Kingma & Ba, 2015).
    [Show full text]
  • CSE 152: Computer Vision Manmohan Chandraker
    CSE 152: Computer Vision Manmohan Chandraker Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense Dense Convolution + pool Label Convolution + pool Classifier Convolution + pool Pooling Convolution + pool Feature extraction Convolution + pool Image Image Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network Slide credit: Pieter Abeel and Dan Klein Neural networks Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks next layer image Convolutional layer Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks [Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing Insight: Images have similar features at various spatial locations! Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN Feature maps Rectified Linear Unit (ReLU) Spatial pooling Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN Feature maps Spatial pooling Max Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R.
    [Show full text]
  • Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
    Article Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches Juan Cruz-Benito 1,* , Sanjay Vishwakarma 2,†, Francisco Martin-Fernandez 1 and Ismael Faro 1 1 IBM Quantum, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA; [email protected] (F.M.-F.); [email protected] (I.F.) 2 Electrical and Computer Engineering, Carnegie Mellon University, Mountain View, CA 94035, USA; [email protected] * Correspondence: [email protected] † Intern at IBM Quantum at the time of writing this paper. Abstract: In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writ- ing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engi- neering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning- enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Citation: Cruz-Benito, J.; Transformer while using transfer learning and different forms of tokenization to see how they behave Vishwakarma, S.; Martin- Fernandez, F.; Faro, I.
    [Show full text]
  • GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection
    GEE: A Gradient-based Explainable Variational Autoencoder for Network Anomaly Detection Quoc Phong Nguyen Kar Wai Lim Dinil Mon Divakaran National University of Singapore National University of Singapore Trustwave [email protected] [email protected] [email protected] Kian Hsiang Low Mun Choon Chan National University of Singapore National University of Singapore [email protected] [email protected] Abstract—This paper looks into the problem of detecting number of factors, such as end-user behavior, customer busi- network anomalies by analyzing NetFlow records. While many nesses (e.g., banking, retail), applications, location, time of the previous works have used statistical models and machine learning day, and are expected to evolve with time. Such diversity and techniques in a supervised way, such solutions have the limitations that they require large amount of labeled data for training and dynamism limits the utility of rule-based detection systems. are unlikely to detect zero-day attacks. Existing anomaly detection Next, as capturing, storing and processing raw traffic from solutions also do not provide an easy way to explain or identify such high capacity networks is not practical, Internet routers attacks in the anomalous traffic. To address these limitations, today have the capability to extract and export meta data such we develop and present GEE, a framework for detecting and explaining anomalies in network traffic. GEE comprises of two as NetFlow records [3]. With NetFlow, the amount of infor- components: (i) Variational Autoencoder (VAE) — an unsuper- mation captured is brought down by orders of magnitude (in vised deep-learning technique for detecting anomalies, and (ii) a comparison to raw packet capture), not only because a NetFlow gradient-based fingerprinting technique for explaining anomalies.
    [Show full text]
  • Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder
    Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 11, 2018 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function that predicts what the output should be, given the input. Prediction Functions Linear regression: f(x) = wTx + b Linear classification (perceptron): f(x) = 1, wTx + b ≥ 0 -1, wTx + b < 0 Need to learn what w should be! Learning Parameters Goal is to learn to minimize error • Ideally: true error • Instead: training error The loss function gives the training error when using parameters w, denoted L(w). • Also called cost function • More general: objective function (in general objective could be to minimize or maximize; with loss/cost functions, we want to minimize) Learning Parameters Goal is to minimize loss function. How do we minimize a function? Let’s review some math. Rate of Change The slope of a line is also called the rate of change of the line. y = ½x + 1 “rise” “run” Rate of Change For nonlinear functions, the “rise over run” formula gives you the average rate of change between two points Average slope from x=-1 to x=0 is: 2 f(x) = x -1 Rate of Change There is also a concept of rate of change at individual points (rather than two points) Slope at x=-1 is: f(x) = x2 -2 Rate of Change The slope at a point is called the derivative at that point Intuition: f(x) = x2 Measure the slope between two points that are really close together Rate of Change The slope at a point is called the derivative at that point Intuition: Measure the
    [Show full text]
  • Gradient Descent (PDF)
    18.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 11 Scribe: Kevin Li Oct. 14, 2015 2. CONVEX OPTIMIZATION FOR MACHINE LEARNING In this lecture, we will cover the basics of convex optimization as it applies to machine learning. There is much more to this topic than will be covered in this class so you may be interested in the following books. Convex Optimization by Boyd and Vandenberghe Lecture notes on Convex Optimization by Nesterov Convex Optimization: Algorithms and Complexity by Bubeck Online Convex Optimization by Hazan The last two are drafts and can be obtained online. 2.1 Convex Problems A convex problem is an optimization problem of the form min f(x) where f andC are x2C convex. First, we will debunk the idea that convex problems are easy by showing that virtually all optimization problems can be written as a convex problem. We can rewrite an optimization problem as follows. min f(x) , min t , min t X 2X t≥f(x);x2X (x;t)2epi(f) where the epigraph of a function is defined by epi(f)=f(x; t) 2X 2 ×IR : t ≥f(x)g Figure 1: An example of an epigraph. Source: https://en.wikipedia.org/wiki/Epigraph_(mathematics) 1 Now we observe that for linear functions, min c>x = min c>x x2D x2conv(D) where the convex hull is defined N N X X conv(D) = fy : 9 N 2 Z+; x1; : : : ; xN 2 D; αi ≥ 0; αi = 1; y = αixig i=1 i=1 To prove this, we know that the left side is a least as big as the right side since D ⊂ conv(D).
    [Show full text]
  • Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inflow Forecasting
    water Article Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inflow Forecasting Halit Apaydin 1 , Hajar Feizi 2 , Mohammad Taghi Sattari 1,2,* , Muslume Sevba Colak 1 , Shahaboddin Shamshirband 3,4,* and Kwok-Wing Chau 5 1 Department of Agricultural Engineering, Faculty of Agriculture, Ankara University, Ankara 06110, Turkey; [email protected] (H.A.); [email protected] (M.S.C.) 2 Department of Water Engineering, Agriculture Faculty, University of Tabriz, Tabriz 51666, Iran; [email protected] 3 Department for Management of Science and Technology Development, Ton Duc Thang University, Ho Chi Minh City, Vietnam 4 Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam 5 Department of Civil and Environmental Engineering, Hong Kong Polytechnic University, Hong Kong, China; [email protected] * Correspondence: [email protected] or [email protected] (M.T.S.); [email protected] (S.S.) Received: 1 April 2020; Accepted: 21 May 2020; Published: 24 May 2020 Abstract: Due to the stochastic nature and complexity of flow, as well as the existence of hydrological uncertainties, predicting streamflow in dam reservoirs, especially in semi-arid and arid areas, is essential for the optimal and timely use of surface water resources. In this research, daily streamflow to the Ermenek hydroelectric dam reservoir located in Turkey is simulated using deep recurrent neural network (RNN) architectures, including bidirectional long short-term memory (Bi-LSTM), gated recurrent unit (GRU), long short-term memory (LSTM), and simple recurrent neural networks (simple RNN). For this purpose, daily observational flow data are used during the period 2012–2018, and all models are coded in Python software programming language.
    [Show full text]
  • Training Deep Networks Without Learning Rates Through Coin Betting
    Training Deep Networks without Learning Rates Through Coin Betting Francesco Orabona∗ Tatiana Tommasi∗ Department of Computer Science Department of Computer, Control, and Stony Brook University Management Engineering Stony Brook, NY Sapienza, Rome University, Italy [email protected] [email protected] Abstract Deep learning methods achieve state-of-the-art performance in many application scenarios. Yet, these methods require a significant amount of hyperparameters tuning in order to achieve the best results. In particular, tuning the learning rates in the stochastic optimization process is still one of the main bottlenecks. In this paper, we propose a new stochastic gradient descent procedure for deep networks that does not require any learning rate setting. Contrary to previous methods, we do not adapt the learning rates nor we make use of the assumed curvature of the objective function. Instead, we reduce the optimization process to a game of betting on a coin and propose a learning-rate-free optimal algorithm for this scenario. Theoretical convergence is proven for convex and quasi-convex functions and empirical evidence shows the advantage of our algorithm over popular stochastic gradient algorithms. 1 Introduction In the last years deep learning has demonstrated a great success in a large number of fields and has attracted the attention of various research communities with the consequent development of multiple coding frameworks (e.g., Caffe [Jia et al., 2014], TensorFlow [Abadi et al., 2015]), the diffusion of blogs, online tutorials, books, and dedicated courses. Besides reaching out scientists with different backgrounds, the need of all these supportive tools originates also from the nature of deep learning: it is a methodology that involves many structural details as well as several hyperparameters whose importance has been growing with the recent trend of designing deeper and multi-branches networks.
    [Show full text]