Learning to Rank Using Gradient Descent
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Training Autoencoders by Alternating Minimization
Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort. -
Learning to Learn by Gradient Descent by Gradient Descent
Learning to learn by gradient descent by gradient descent Marcin Andrychowicz1, Misha Denil1, Sergio Gómez Colmenarejo1, Matthew W. Hoffman1, David Pfau1, Tom Schaul1, Brendan Shillingford1,2, Nando de Freitas1,2,3 1Google DeepMind 2University of Oxford 3Canadian Institute for Advanced Research [email protected] {mdenil,sergomez,mwhoffman,pfau,schaul}@google.com [email protected], [email protected] Abstract The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art. 1 Introduction Frequently, tasks in machine learning can be expressed as the problem of optimizing an objective function f(✓) defined over some domain ✓ ⇥. The goal in this case is to find the minimizer 2 ✓⇤ = arg min✓ ⇥ f(✓). While any method capable of minimizing this objective function can be applied, the standard2 approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates ✓ = ✓ ↵ f(✓ ) . t+1 t − tr t The performance of vanilla gradient descent, however, is hampered by the fact that it only makes use of gradients and ignores second-order information. -
The Matrix Calculus You Need for Deep Learning
The Matrix Calculus You Need For Deep Learning Terence Parr and Jeremy Howard July 3, 2018 (We teach in University of San Francisco's MS in Data Science program and have other nefarious projects underway. You might know Terence as the creator of the ANTLR parser generator. For more material, see Jeremy's fast.ai courses and University of San Francisco's Data Institute in- person version of the deep learning course.) HTML version (The PDF and HTML were generated from markup using bookish) Abstract This paper is an attempt to explain all the matrix calculus you need in order to understand the training of deep neural networks. We assume no math knowledge beyond what you learned in calculus 1, and provide links to help you refresh the necessary math where needed. Note that you do not need to understand this material before you start learning to train and use deep learning in practice; rather, this material is for those who are already familiar with the basics of neural networks, and wish to deepen their understanding of the underlying math. Don't worry if you get stuck at some point along the way|just go back and reread the previous section, and try writing down and working through some examples. And if you're still stuck, we're happy to answer your questions in the Theory category at forums.fast.ai. Note: There is a reference section at the end of the paper summarizing all the key matrix calculus rules and terminology discussed here. arXiv:1802.01528v3 [cs.LG] 2 Jul 2018 1 Contents 1 Introduction 3 2 Review: Scalar derivative rules4 3 Introduction to vector calculus and partial derivatives5 4 Matrix calculus 6 4.1 Generalization of the Jacobian . -
Context-Aware Learning to Rank with Self-Attention
Context-Aware Learning to Rank with Self-Attention Przemysław Pobrotyn Tomasz Bartczak Mikołaj Synowiec ML Research at Allegro.pl ML Research at Allegro.pl ML Research at Allegro.pl [email protected] [email protected] [email protected] Radosław Białobrzeski Jarosław Bojar ML Research at Allegro.pl ML Research at Allegro.pl [email protected] [email protected] ABSTRACT engines, recommender systems, question-answering systems, and Learning to rank is a key component of many e-commerce search others. engines. In learning to rank, one is interested in optimising the A typical machine learning solution to the LTR problem involves global ordering of a list of items according to their utility for users. learning a scoring function, which assigns real-valued scores to Popular approaches learn a scoring function that scores items in- each item of a given list, based on a dataset of item features and dividually (i.e. without the context of other items in the list) by human-curated or implicit (e.g. clickthrough logs) relevance labels. optimising a pointwise, pairwise or listwise loss. The list is then Items are then sorted in the descending order of scores [23]. Perfor- sorted in the descending order of the scores. Possible interactions mance of the trained scoring function is usually evaluated using an between items present in the same list are taken into account in the IR metric like Mean Reciprocal Rank (MRR) [33], Normalised Dis- training phase at the loss level. However, during inference, items counted Cumulative Gain (NDCG) [19] or Mean Average Precision are scored individually, and possible interactions between them are (MAP) [4]. -
Training Neural Networks Without Gradients: a Scalable ADMM Approach
Training Neural Networks Without Gradients: A Scalable ADMM Approach Gavin Taylor1 [email protected] Ryan Burmeister1 Zheng Xu2 [email protected] Bharat Singh2 [email protected] Ankit Patel3 [email protected] Tom Goldstein2 [email protected] 1United States Naval Academy, Annapolis, MD USA 2University of Maryland, College Park, MD USA 3Rice University, Houston, TX USA Abstract many parameters. Because big datasets provide results that With the growing importance of large network (often dramatically) outperform the prior state-of-the-art in models and enormous training datasets, GPUs many machine learning tasks, researchers are willing to have become increasingly necessary to train neu- purchase specialized hardware such as GPUs, and commit ral networks. This is largely because conven- large amounts of time to training models and tuning hyper- tional optimization algorithms rely on stochastic parameters. gradient methods that don’t scale well to large Gradient-based training methods have several properties numbers of cores in a cluster setting. Further- that contribute to this need for specialized hardware. First, more, the convergence of all gradient methods, while large amounts of data can be shared amongst many including batch methods, suffers from common cores, existing optimization methods suffer when paral- problems like saturation effects, poor condition- lelized. Second, training neural nets requires optimizing ing, and saddle points. This paper explores an highly non-convex objectives that exhibit saddle points, unconventional training method that uses alter- poor conditioning, and vanishing gradients, all of which nating direction methods and Bregman iteration slow down gradient-based methods such as stochastic gra- to train networks without gradient descent steps. -
Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments
Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments Boris Ginsburg 1 Patrice Castonguay 1 Oleksii Hrinchuk 1 Oleksii Kuchaiev 1 Ryan Leary 1 Vitaly Lavrukhin 1 Jason Li 1 Huyen Nguyen 1 Yang Zhang 1 Jonathan M. Cohen 1 Abstract We started with Adam, and then (1) replaced the element- wise second moment with the layer-wise moment, (2) com- We propose NovoGrad, an adaptive stochastic puted the first moment using gradients normalized by layer- gradient descent method with layer-wise gradient wise second moment, (3) decoupled weight decay (WD) normalization and decoupled weight decay. In from normalized gradients (similar to AdamW). our experiments on neural networks for image classification, speech recognition, machine trans- The resulting algorithm, NovoGrad, combines SGD’s and lation, and language modeling, it performs on par Adam’s strengths. We applied NovoGrad to a variety of or better than well-tuned SGD with momentum, large scale problems — image classification, neural machine Adam, and AdamW. Additionally, NovoGrad (1) translation, language modeling, and speech recognition — is robust to the choice of learning rate and weight and found that in all cases, it performs as well or better than initialization, (2) works well in a large batch set- Adam/AdamW and SGD with momentum. ting, and (3) has half the memory footprint of Adam. 2. Related Work NovoGrad belongs to the family of Stochastic Normalized 1. Introduction Gradient Descent (SNGD) optimizers (Nesterov, 1984; Hazan et al., 2015). SNGD uses only the direction of the The most popular algorithms for training Neural Networks stochastic gradient gt to update the weights wt: (NNs) are Stochastic Gradient Descent (SGD) with mo- gt mentum (Polyak, 1964; Sutskever et al., 2013) and Adam wt+1 = wt − λt · (Kingma & Ba, 2015). -
A Brief Tour of Vector Calculus
A BRIEF TOUR OF VECTOR CALCULUS A. HAVENS Contents 0 Prelude ii 1 Directional Derivatives, the Gradient and the Del Operator 1 1.1 Conceptual Review: Directional Derivatives and the Gradient........... 1 1.2 The Gradient as a Vector Field............................ 5 1.3 The Gradient Flow and Critical Points ....................... 10 1.4 The Del Operator and the Gradient in Other Coordinates*............ 17 1.5 Problems........................................ 21 2 Vector Fields in Low Dimensions 26 2 3 2.1 General Vector Fields in Domains of R and R . 26 2.2 Flows and Integral Curves .............................. 31 2.3 Conservative Vector Fields and Potentials...................... 32 2.4 Vector Fields from Frames*.............................. 37 2.5 Divergence, Curl, Jacobians, and the Laplacian................... 41 2.6 Parametrized Surfaces and Coordinate Vector Fields*............... 48 2.7 Tangent Vectors, Normal Vectors, and Orientations*................ 52 2.8 Problems........................................ 58 3 Line Integrals 66 3.1 Defining Scalar Line Integrals............................. 66 3.2 Line Integrals in Vector Fields ............................ 75 3.3 Work in a Force Field................................. 78 3.4 The Fundamental Theorem of Line Integrals .................... 79 3.5 Motion in Conservative Force Fields Conserves Energy .............. 81 3.6 Path Independence and Corollaries of the Fundamental Theorem......... 82 3.7 Green's Theorem.................................... 84 3.8 Problems........................................ 89 4 Surface Integrals, Flux, and Fundamental Theorems 93 4.1 Surface Integrals of Scalar Fields........................... 93 4.2 Flux........................................... 96 4.3 The Gradient, Divergence, and Curl Operators Via Limits* . 103 4.4 The Stokes-Kelvin Theorem..............................108 4.5 The Divergence Theorem ...............................112 4.6 Problems........................................114 List of Figures 117 i 11/14/19 Multivariate Calculus: Vector Calculus Havens 0. -
CSE 152: Computer Vision Manmohan Chandraker
CSE 152: Computer Vision Manmohan Chandraker Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense Dense Convolution + pool Label Convolution + pool Classifier Convolution + pool Pooling Convolution + pool Feature extraction Convolution + pool Image Image Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network Slide credit: Pieter Abeel and Dan Klein Neural networks Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks next layer image Convolutional layer Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks [Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing Insight: Images have similar features at various spatial locations! Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN Feature maps Rectified Linear Unit (ReLU) Spatial pooling Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN Feature maps Spatial pooling Max Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. -
Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
Article Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches Juan Cruz-Benito 1,* , Sanjay Vishwakarma 2,†, Francisco Martin-Fernandez 1 and Ismael Faro 1 1 IBM Quantum, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA; [email protected] (F.M.-F.); [email protected] (I.F.) 2 Electrical and Computer Engineering, Carnegie Mellon University, Mountain View, CA 94035, USA; [email protected] * Correspondence: [email protected] † Intern at IBM Quantum at the time of writing this paper. Abstract: In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writ- ing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engi- neering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning- enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Citation: Cruz-Benito, J.; Transformer while using transfer learning and different forms of tokenization to see how they behave Vishwakarma, S.; Martin- Fernandez, F.; Faro, I. -
An Introduction to Neural Information Retrieval
The version of record is available at: http://dx.doi.org/10.1561/1500000061 An Introduction to Neural Information Retrieval Bhaskar Mitra1 and Nick Craswell2 1Microsoft, University College London; [email protected] 2Microsoft; [email protected] ABSTRACT Neural ranking models for information retrieval (IR) use shallow or deep neural networks to rank search results in response to a query. Tradi- tional learning to rank models employ supervised machine learning (ML) techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between query and docu- ment vocabulary. Unlike classical learning to rank models and non-neural approaches to IR, these new ML techniques are data-hungry, requiring large scale training data before they can be deployed. This tutorial intro- duces basic concepts and intuitions behind neural IR models, and places them in the context of classical non-neural approaches to IR. We begin by introducing fundamental concepts of retrieval and different neural and non-neural approaches to unsupervised learning of vector representations of text. We then review IR methods that employ these pre-trained neural vector representations without learning the IR task end-to-end. We intro- duce the Learning to Rank (LTR) framework next, discussing standard loss functions for ranking. We follow that with an overview of deep neural networks (DNNs), including standard architectures and implementations. Finally, we review supervised neural learning to rank models, including re- cent DNN architectures trained end-to-end for ranking tasks. We conclude with a discussion on potential future directions for neural IR. -
GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection
GEE: A Gradient-based Explainable Variational Autoencoder for Network Anomaly Detection Quoc Phong Nguyen Kar Wai Lim Dinil Mon Divakaran National University of Singapore National University of Singapore Trustwave [email protected] [email protected] [email protected] Kian Hsiang Low Mun Choon Chan National University of Singapore National University of Singapore [email protected] [email protected] Abstract—This paper looks into the problem of detecting number of factors, such as end-user behavior, customer busi- network anomalies by analyzing NetFlow records. While many nesses (e.g., banking, retail), applications, location, time of the previous works have used statistical models and machine learning day, and are expected to evolve with time. Such diversity and techniques in a supervised way, such solutions have the limitations that they require large amount of labeled data for training and dynamism limits the utility of rule-based detection systems. are unlikely to detect zero-day attacks. Existing anomaly detection Next, as capturing, storing and processing raw traffic from solutions also do not provide an easy way to explain or identify such high capacity networks is not practical, Internet routers attacks in the anomalous traffic. To address these limitations, today have the capability to extract and export meta data such we develop and present GEE, a framework for detecting and explaining anomalies in network traffic. GEE comprises of two as NetFlow records [3]. With NetFlow, the amount of infor- components: (i) Variational Autoencoder (VAE) — an unsuper- mation captured is brought down by orders of magnitude (in vised deep-learning technique for detecting anomalies, and (ii) a comparison to raw packet capture), not only because a NetFlow gradient-based fingerprinting technique for explaining anomalies. -
Policy Gradient
Lecture 7: Policy Gradient Lecture 7: Policy Gradient David Silver Lecture 7: Policy Gradient Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the last lecture we approximated the value or action-value function using parameters θ, V (s) V π(s) θ ≈ Q (s; a) Qπ(s; a) θ ≈ A policy was generated directly from the value function e.g. using -greedy In this lecture we will directly parametrise the policy πθ(s; a) = P [a s; θ] j We will focus again on model-free reinforcement learning Lecture 7: Policy Gradient Introduction Value-Based and Policy-Based RL Value Based Learnt Value Function Implicit policy Value Function Policy (e.g. -greedy) Policy Based Value-Based Actor Policy-Based No Value Function Critic Learnt Policy Actor-Critic Learnt Value Function Learnt Policy Lecture 7: Policy Gradient Introduction Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Lecture 7: Policy Gradient Introduction Rock-Paper-Scissors Example Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors Scissors beats paper Rock beats scissors Paper beats rock Consider policies for iterated rock-paper-scissors A deterministic policy is easily exploited A uniform random policy