Employing a Transformer Language Model for Information Retrieval and Document Classification

Total Page:16

File Type:pdf, Size:1020Kb

Employing a Transformer Language Model for Information Retrieval and Document Classification DEGREE PROJECT IN THE FIELD OF TECHNOLOGY ENGINEERING PHYSICS AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Employing a Transformer Language Model for Information Retrieval and Document Classification Using OpenAI's generative pre-trained transformer, GPT-2 ANTON BJÖÖRN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Employing a Transformer Language Model for Information Retrieval and Document Classification ANTON BJÖÖRN Master in Machine Learning Date: July 11, 2020 Supervisor: Håkan Lane Examiner: Viggo Kann School of Electrical Engineering and Computer Science Host company: SSC - Swedish Space Corporation Company Supervisor: Jacob Ask Swedish title: Transformermodellers användbarhet inom informationssökning och dokumentklassificering iii Abstract As the information flow on the Internet keeps growing it becomes increasingly easy to miss important news which does not have a mass appeal. Combating this problem calls for increasingly sophisticated information retrieval meth- ods. Pre-trained transformer based language models have shown great gener- alization performance on many natural language processing tasks. This work investigates how well such a language model, Open AI’s General Pre-trained Transformer 2 model (GPT-2), generalizes to information retrieval and classi- fication of online news articles, written in English, with the purpose of compar- ing this approach with the more traditional method of Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. The aim is to shed light on how useful state-of-the-art transformer based language models are for the construc- tion of personalized information retrieval systems. Using transfer learning the smallest version of GPT-2 is trained to rank and classify news articles achiev- ing similar results to the purely TF-IDF based approach. While the average Normalized Discounted Cumulative Gain (NDCG) achieved by the GPT-2 based model was about 0.74 percentage points higher the sample size was too small to give these results high statistical certainty. Keywords: Deep Learning, Transformer Models, Information Retrieval, Ranking, Generative Pre-training, Document Classification iv Sammanfattning Informationsflödet på Internet fortsätter att öka vilket gör det allt lättare att missa viktiga nyheter som inte intresserar en stor mängd människor. För att bekämpa detta problem behövs allt mer sofistikerade informationssöknings- metoder. Förtränade transformermodeller har sedan ett par år tillbaka tagit över som de mest framstående neurala nätverken för att hantera text. Det här arbetet undersöker hur väl en sådan språkmodell, Open AIs General Pre-trained Trans- former 2 (GPT-2), kan generalisera från att generera text till att användas för informationssökning och klassificering av texter. För att utvärdera detta jäm- förs en transformerbaserad modell med en mer traditionell Term Frequency- Inverse Document Frequency (TF-IDF) vektoriseringsmodell. Målet är att klar- göra hur användbara förtränade transformermodeller faktiskt är i skapandet av specialiserade informationssökningssystem. Den minsta versionen av språk- modellen GPT-2 anpassas och tränas om till att ranka och klassificera nyhets- artiklar, skrivna på engelska, och uppnår liknande prestanda som den TF-IDF baserade modellen. Den GPT-2 baserade modellen hade i genomsnitt 0.74 procentenheter högre Normalized Discounted Cumulative Gain (NDCG) men provstorleken var ej stor nog för att ge dessa resultat hög statistisk säkerhet. Nyckelord: djupinlärning, transformermodeller, informationssökning, ranking, generativ förträning, dokumentklassificering Contents 1 Introduction 1 1.1 Overview . .1 1.1.1 The Space Industry . .1 1.1.2 Swedish Space Corporation . .2 1.1.3 Problem Formulation . .2 1.1.4 Approach . .2 1.2 Research Question . .3 1.2.1 Hypothesis . .3 1.2.2 Conditions & Limitations . .3 2 Background 5 2.1 Artificial Neural Networks . .5 2.1.1 The Perceptron . .5 2.1.2 Deep Neural Networks & Stochastic Gradient Descent . .6 2.2 Transformer Networks . .8 2.2.1 Network Architecture . 10 2.2.2 Generative Pre-Training and GPT-2 . 12 2.2.3 Generalization Capabilities of GPT-2 . 13 2.3 Information Retrieval . 16 2.3.1 Document Ranking & Relevance Score . 16 2.3.2 TF-IDF . 17 2.3.3 Neural Relevance Scoring . 18 2.3.4 Ensemble Relevance Scoring (TF-IDF + GPT-2) . 19 2.4 Evaluation . 19 3 Method 21 3.1 Data Collection . 21 3.1.1 Web Scraping . 21 v vi CONTENTS 3.1.2 SSC News Summaries & Fundamental Data Set . 21 3.1.3 Daily Web Scraping & Main Data Set . 22 3.1.4 Labeling of Main Data Set . 24 3.2 TF-IDF Based Model . 25 3.2.1 Training . 25 3.2.2 Scoring and Classification . 26 3.3 GPT-2 Based Model . 27 3.3.1 Adaptation . 27 3.3.2 Relevance Classifier . 28 3.3.3 Zone Classifier . 29 3.3.4 Data Set . 30 3.3.5 Training . 30 3.3.6 Scoring & Classification . 31 3.4 Ensemble Models . 32 3.4.1 Multiplicative Model . 32 3.4.2 Re-ranking Model . 32 3.5 Evaluation . 33 3.5.1 N-Fold Cross Validation . 33 3.5.2 Normalized Discounted Cumulative Gain . 33 3.5.3 Classification Accuracy . 34 3.5.4 Resource Demands . 34 3.6 Statistical Significance & Null Hypothesis . 34 4 Results 36 4.1 Relevance Scoring . 36 4.2 Classification Accuracy . 37 4.3 NDCG vs Accuracy . 38 4.4 NDCG@10 . 40 4.5 T-Test . 41 4.6 Resource Demands . 41 5 Discussion 42 5.1 Model Performance . 42 5.2 Model Robustness . 43 5.3 Ensemble Models . 43 5.4 Resource Demands . 44 5.5 Ethical Considerations . 45 5.6 Sustainability & Societal Considerations . 46 CONTENTS vii 6 Conclusions 47 6.1 Employment of Transformer Language Models . 47 6.2 Future Work . 48 Bibliography 50 Chapter 1 Introduction This chapter introduces the problem and provides some basic information about the host company and the approach that was applied. 1.1 Overview The amount of information published on the Internet every day is vast. For some companies it is possible that the amount of news published just within their industry alone takes too much time for a single person to read, making it hard to stay up to date. Furthermore there are many forces at work behind what is published and how it is published (click-baiting, biases, paid articles) which can make finding the most relevant and trustworthy news challenging. This flood of information calls for sophisticated information retrieval techniques to aid in navigating the growing information landscape. 1.1.1 The Space Industry The Space Industry has seen big changes since the beginning of the century with launch prices being reduced by about a factor of 20 [1] accompanied by a big shift towards commercialisation [2, 1]. With a boom in space related startups and commercial interest [1] the information flow within the industry is higher than it’s ever been. With future plans ranging from huge satellite constellations providing global Internet services [3] to the colonization of Mars [4] and new moon missions [5] there seems to be no end to the coming growth and change. 1 2 CHAPTER 1. INTRODUCTION 1.1.2 Swedish Space Corporation Swedish Space Corporation (SSC), previously known as Rymdbolaget, is a mainly commercial space company owned by the Swedish government. SSC operates the Esrange spaceport in northern Sweden where they carry out mis- sions for their customers including launching experiments aboard sounding rockets and weather balloons. The company also provides science and launch services as well as satellite ground network services from their many ground stations around the world and engineering and spacecraft operation services. 1.1.3 Problem Formulation Swedish Space Corporation (SSC) wants to help their employees stay up-to- date with industry events in a time efficient manner by constructing a program that can automatically retrieve and present the most relevant news items for SSC. The system should automatically collect news articles from selected sites and upon a request produce a summary of news collected within a given range of dates. The scientific question being investigated is whether a modern transformer model, more specifically Generative Pre-trained Transformer 2 (GPT-2) [6], can be used to improve the results of such a system by being employed for document ranking and classification. The primary objective of the system will however be document ranking, classification is a secondary objective. 1.1.4 Approach Collection of news articles will be done by crawling and scraping pages from a list of online news pages created with the help of SSC. This web scraper will be used both to collect data to be labeled for training and to collect articles for creating the news summaries in the finished software. Two different methods for document relevance scoring (ranking) and classifi- cation will be implemented. One will be a vector space model based on TF- IDF (term frequency-inverse document frequency) and will serve as a baseline for performance. The other method will perform relevance scoring and classi- fication via transfer learning on the pre-trained transformer model GPT-2 [6]. All web pages scraped will be written in English since GPT-2 is mainly trained on English text. CHAPTER 1. INTRODUCTION 3 1.2 Research Question "Is employing a transformer language model using transfer learning better than a proven traditional approach for the purpose of information retrieval and document classification?" This research question is evaluated by comparing the performance of the two different approaches. In this work "better" for the purpose of information re- trieval is defined as retrieving results of higher relevance and "better" for doc- ument classification is defined as achieving a higher classification accuracy. In addition to these metrics of performance, the resource demands of both ap- proaches will also be taken into consideration in the form of execution time. To answer this question software must be constructed and two different rel- evance scoring and classification methods implemented into it.
Recommended publications
  • Intro to Tensorflow 2.0 MBL, August 2019
    Intro to TensorFlow 2.0 MBL, August 2019 Josh Gordon (@random_forests) 1 Agenda 1 of 2 Exercises ● Fashion MNIST with dense layers ● CIFAR-10 with convolutional layers Concepts (as many as we can intro in this short time) ● Gradient descent, dense layers, loss, softmax, convolution Games ● QuickDraw Agenda 2 of 2 Walkthroughs and new tutorials ● Deep Dream and Style Transfer ● Time series forecasting Games ● Sketch RNN Learning more ● Book recommendations Deep Learning is representation learning Image link Image link Latest tutorials and guides tensorflow.org/beta News and updates medium.com/tensorflow twitter.com/tensorflow Demo PoseNet and BodyPix bit.ly/pose-net bit.ly/body-pix TensorFlow for JavaScript, Swift, Android, and iOS tensorflow.org/js tensorflow.org/swift tensorflow.org/lite Minimal MNIST in TF 2.0 A linear model, neural network, and deep neural network - then a short exercise. bit.ly/mnist-seq ... ... ... Softmax model = Sequential() model.add(Dense(256, activation='relu',input_shape=(784,))) model.add(Dense(128, activation='relu')) model.add(Dense(10, activation='softmax')) Linear model Neural network Deep neural network ... ... Softmax activation After training, select all the weights connected to this output. model.layers[0].get_weights() # Your code here # Select the weights for a single output # ... img = weights.reshape(28,28) plt.imshow(img, cmap = plt.get_cmap('seismic')) ... ... Softmax activation After training, select all the weights connected to this output. Exercise 1 (option #1) Exercise: bit.ly/mnist-seq Reference: tensorflow.org/beta/tutorials/keras/basic_classification TODO: Add a validation set. Add code to plot loss vs epochs (next slide). Exercise 1 (option #2) bit.ly/ijcav_adv Answers: next slide.
    [Show full text]
  • Machine Learning: Unsupervised Methods Sepp Hochreiter Other Courses
    Machine Learning Unsupervised Methods Part 1 Sepp Hochreiter Institute of Bioinformatics Johannes Kepler University, Linz, Austria Course 3 ECTS 2 SWS VO (class) 1.5 ECTS 1 SWS UE (exercise) Basic Course of Master Bioinformatics Basic Course of Master Computer Science: Computational Engineering / Int. Syst. Class: Mo 15:30-17:00 (HS 18) Exercise: Mon 13:45-14:30 (S2 053) – group 1 Mon 14:30-15:15 (S2 053) – group 2+group 3 EXAMS: VO: 3 part exams UE: weekly homework (evaluated) Machine Learning: Unsupervised Methods Sepp Hochreiter Other Courses Lecture Lecturer 365,077 Machine Learning: Unsupervised Techniques VL Hochreiter Mon 15:30-17:00/HS 18 365,078 Machine Learning: Unsupervised Techniques – G1 UE Hochreiter Mon 13:45-14:30/S2 053 365,095 Machine Learning: Unsupervised Techniques – G2+G3 UE Hochreiter Mon 14:30-15:15/S2 053 365,041 Theoretical Concepts of Machine Learning VL Hochreiter Thu 15:30-17:00/S3 055 365,042 Theoretical Concepts of Machine Learning UE Hochreiter Thu 14:30-15:15/S3 055 365,081 Genome Analysis & Transcriptomics KV Regl Fri 8:30-11:00/S2 053 365,082 Structural Bioinformatics KV Regl Tue 8:30-11:00/HS 11 365,093 Deep Learning and Neural Networks KV Unterthiner Thu 10:15-11:45/MT 226 365,090 Special Topics on Bioinformatics (B/C/P/M): KV Klambauer block Population genetics 365,096 Special Topics on Bioinformatics (B/C/P/M): KV Klambauer block Artificial Intelligence in Life Sciences 365,079 Introduction to R KV Bodenhofer Wed 15:30-17:00/MT 127 365,067 Master's Seminar SE Hochreiter Mon 10:15-11:45/S3 318 365,080 Master's Thesis Seminar SS SE Hochreiter Mon 10:15-11:45/S3 318 365,091 Bachelor's Seminar SE Hochreiter - 365,019 Dissertantenseminar Informatik 3 SE Hochreiter Mon 10:15-11:45/S3 318 347,337 Bachelor Seminar Biological Chemistry JKU (incl.
    [Show full text]
  • Keras2c: a Library for Converting Keras Neural Networks to Real-Time Compatible C
    Keras2c: A library for converting Keras neural networks to real-time compatible C Rory Conlina,∗, Keith Ericksonb, Joeseph Abbatec, Egemen Kolemena,b,∗ aDepartment of Mechanical and Aerospace Engineering, Princeton University, Princeton NJ 08544, USA bPrinceton Plasma Physics Laboratory, Princeton NJ 08544, USA cDepartment of Astrophysical Sciences at Princeton University, Princeton NJ 08544, USA Abstract With the growth of machine learning models and neural networks in mea- surement and control systems comes the need to deploy these models in a way that is compatible with existing systems. Existing options for deploying neural networks either introduce very high latency, require expensive and time con- suming work to integrate into existing code bases, or only support a very lim- ited subset of model types. We have therefore developed a new method called Keras2c, which is a simple library for converting Keras/TensorFlow neural net- work models into real-time compatible C code. It supports a wide range of Keras layers and model types including multidimensional convolutions, recurrent lay- ers, multi-input/output models, and shared layers. Keras2c re-implements the core components of Keras/TensorFlow required for predictive forward passes through neural networks in pure C, relying only on standard library functions considered safe for real-time use. The core functionality consists of ∼ 1500 lines of code, making it lightweight and easy to integrate into existing codebases. Keras2c has been successfully tested in experiments and is currently in use on the plasma control system at the DIII-D National Fusion Facility at General Atomics in San Diego. 1. Motivation TensorFlow[1] is one of the most popular libraries for developing and training neural networks.
    [Show full text]
  • Prof. Dr. Sepp Hochreiter
    Speaker: Univ.-Prof. Dr. Sepp Hochreiter Institute for Machine Learning & LIT AI Lab, Johannes Kepler University Linz, Austria Title: Deep Learning -- the Key to Enable Artificial Intelligence Abstract: Deep Learning has emerged as one of the most successful fields of machine learning and artificial intelligence with overwhelming success in industrial speech, language and vision benchmarks. Consequently it became the central field of research for IT giants like Google, Facebook, Microsoft, Baidu, and Amazon. Deep Learning is founded on novel neural network techniques, the recent availability of very fast computers, and massive data sets. In its core, Deep Learning discovers multiple levels of abstract representations of the input. The main obstacle to learning deep neural networks is the vanishing gradient problem. The vanishing gradient impedes credit assignment to the first layers of a deep network or early elements of a sequence, therefore limits model selection. Most major advances in Deep Learning can be related to avoiding the vanishing gradient like unsupervised stacking, ReLUs, residual networks, highway networks, and LSTM networks. Currently, LSTM recurrent neural networks exhibit overwhelmingly successes in different AI fields like speech, language, and text analysis. LSTM is used in Google’s translate and speech recognizer, Apple’s iOS 10, Facebooks translate, and Amazon’s Alexa. We use LSTM in collaboration with Zalando and Bayer, e.g. to analyze blogs and twitter news related to fashion and health. In the AUDI Deep Learning Center, which I am heading, and with NVIDIA we apply Deep Learning to advance autonomous driving. In collaboration with Infineon we use Deep Learning for perception tasks, e.g.
    [Show full text]
  • Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction
    TensorFlow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction TensorFlow Google Brain, 2015 (rewritten DistBelief) Theano University of Montréal, 2009 Keras François Chollet, 2015 (now at Google) Torch Facebook AI Research, Twitter, Google DeepMind Caffe Berkeley Vision and Learning Center (BVLC), 2013 Outline 1. Introduction of each framework a. TensorFlow b. Theano c. Keras d. Torch e. Caffe 2. Further comparison a. Code + models b. Community and documentation c. Performance d. Model deployment e. Extra features 3. Which framework to choose when ..? Introduction of each framework TensorFlow architecture 1) Low-level core (C++/CUDA) 2) Simple Python API to define the computational graph 3) High-level API (TF-Learn, TF-Slim, soon Keras…) TensorFlow computational graph - auto-differentiation! - easy multi-GPU/multi-node - native C++ multithreading - device-efficient implementation for most ops - whole pipeline in the graph: data loading, preprocessing, prefetching... TensorBoard TensorFlow development + bleeding edge (GitHub yay!) + division in core and contrib => very quick merging of new hotness + a lot of new related API: CRF, BayesFlow, SparseTensor, audio IO, CTC, seq2seq + so it can easily handle images, videos, audio, text... + if you really need a new native op, you can load a dynamic lib - sometimes contrib stuff disappears or moves - recently introduced bells and whistles are barely documented Presentation of Theano: - Maintained by Montréal University group. - Pioneered the use of a computational graph. - General machine learning tool -> Use of Lasagne and Keras. - Very popular in the research community, but not elsewhere. Falling behind. What is it like to start using Theano? - Read tutorials until you no longer can, then keep going.
    [Show full text]
  • Setting up Your Environment for the TF Developer Certificate Exam
    Set up your environment to take the TensorFlow Developer Ceicate Exam Questions? Email [email protected]. Last Updated: June 23, 2021 This document describes how to get your environment ready to take the TensorFlow Developer Ceicate exam. It does not discuss what the exam is or how to take it. For more information about the TensorFlow Developer Ceicate program, please go to tensolow.org/ceicate. Contents Before you begin Refund policy Install Python 3.8 Install PyCharm Get your environment ready for the exam What libraries will the exam infrastructure install? Make sure that PyCharm isn't subject to le-loading controls Windows and Linux Users: Check your GPU drivers Mac Users: Ensure you have Python 3.8 All Users: Create a Test Viual Environment that uses TensorFlow in PyCharm Create a new PyCharm project Install TensorFlow and related packages Check the suppoed version of Python in PyCharm Practice training TensorFlow models Setup your environment for the TensorFlow Developer Certificate exam 1 FAQs How do I sta the exam? What version of TensorFlow is used in the exam? Is there a minimum hardware requirement? Where is the candidate handbook? Before you begin The TensorFlow ceicate exam runs inside PyCharm. The exam uses TensorFlow 2.5.0. You must use Python 3.8 to ensure accurate grading of your models. The exam has been tested with Python 3.8.0 and TensorFlow 2.5.0. Before you sta the exam, make sure your environment is ready: ❏ Make sure you have Python 3.8 installed on your computer. ❏ Check that your system meets the installation requirements for PyCharm here.
    [Show full text]
  • Tensorflow and Keras Installation Steps for Deep Learning Applications in Rstudio Ricardo Dalagnol1 and Fabien Wagner 1. Install
    Version 1.0 – 03 July 2020 Tensorflow and Keras installation steps for Deep Learning applications in Rstudio Ricardo Dalagnol1 and Fabien Wagner 1. Install OSGEO (https://trac.osgeo.org/osgeo4w/) to have GDAL utilities and QGIS. Do not modify the installation path. 2. Install ImageMagick (https://imagemagick.org/index.php) for image visualization. 3. Install the latest Miniconda (https://docs.conda.io/en/latest/miniconda.html) 4. Install Visual Studio a. Install the 2019 or 2017 free version. The free version is called ‘Community’. b. During installation, select the box to include C++ development 5. Install the latest Nvidia driver for your GPU card from the official Nvidia site 6. Install CUDA NVIDIA a. I installed CUDA Toolkit 10.1 update2, to check if the installed or available NVIDIA driver is compatible with the CUDA version (check this in the cuDNN page https://docs.nvidia.com/deeplearning/sdk/cudnn-support- matrix/index.html) b. Downloaded from NVIDIA website, searched for CUDA NVIDIA download in google: https://developer.nvidia.com/cuda-10.1-download-archive-update2 (list of all archives here https://developer.nvidia.com/cuda-toolkit-archive) 7. Install cuDNN (deep neural network library) – I have installed cudnn-10.1- windows10-x64-v7.6.5.32 for CUDA 10.1, the last version https://docs.nvidia.com/deeplearning/sdk/cudnn-install/#install-windows a. To download you have to create/login an account with NVIDIA Developer b. Before installing, check the NVIDIA driver version if it is compatible c. To install on windows, follow the instructions at section 3.3 https://docs.nvidia.com/deeplearning/sdk/cudnn-install/#install-windows d.
    [Show full text]
  • Julia: a Modern Language for Modern ML
    Julia: A modern language for modern ML Dr. Viral Shah and Dr. Simon Byrne www.juliacomputing.com What we do: Modernize Technical Computing Today’s technical computing landscape: • Develop new learning algorithms • Run them in parallel on large datasets • Leverage accelerators like GPUs, Xeon Phis • Embed into intelligent products “Business as usual” will simply not do! General Micro-benchmarks: Julia performs almost as fast as C • 10X faster than Python • 100X faster than R & MATLAB Performance benchmark relative to C. A value of 1 means as fast as C. Lower values are better. A real application: Gillespie simulations in systems biology 745x faster than R • Gillespie simulations are used in the field of drug discovery. • Also used for simulations of epidemiological models to study disease propagation • Julia package (Gillespie.jl) is the state of the art in Gillespie simulations • https://github.com/openjournals/joss- papers/blob/master/joss.00042/10.21105.joss.00042.pdf Implementation Time per simulation (ms) R (GillespieSSA) 894.25 R (handcoded) 1087.94 Rcpp (handcoded) 1.31 Julia (Gillespie.jl) 3.99 Julia (Gillespie.jl, passing object) 1.78 Julia (handcoded) 1.2 Those who convert ideas to products fastest will win Computer Quants develop Scientists prepare algorithms The last 25 years for production (Python, R, SAS, DEPLOY (C++, C#, Java) Matlab) Quants and Computer Compress the Scientists DEPLOY innovation cycle collaborate on one platform - JULIA with Julia Julia offers competitive advantages to its users Julia is poised to become one of the Thank you for Julia. Yo u ' v e k i n d l ed leading tools deployed by developers serious excitement.
    [Show full text]
  • TRAINING NEURAL NETWORKS with TENSOR CORES Dusan Stosic, NVIDIA Agenda
    TRAINING NEURAL NETWORKS WITH TENSOR CORES Dusan Stosic, NVIDIA Agenda A100 Tensor Cores and Tensor Float 32 (TF32) Mixed Precision Tensor Cores : Recap and New Advances Accuracy and Performance Considerations 2 MOTIVATION – COST OF DL TRAINING GPT-3 Vision tasks: ImageNet classification • 2012: AlexNet trained on 2 GPUs for 5-6 days • 2017: ResNeXt-101 trained on 8 GPUs for over 10 days T5 • 2019: NoisyStudent trained with ~1k TPUs for 7 days Language tasks: LM modeling RoBERTa • 2018: BERT trained on 64 GPUs for 4 days • Early-2020: T5 trained on 256 GPUs • Mid-2020: GPT-3 BERT What’s being done to reduce costs • Hardware accelerators like GPU Tensor Cores • Lower computational complexity w/ reduced precision or network compression (aka sparsity) 3 BASICS OF FLOATING-POINT PRECISION Standard way to represent real numbers on a computer • Double precision (FP64), single precision (FP32), half precision (FP16/BF16) Cannot store numbers with infinite precision, trade-off between range and precision • Represent values at widely different magnitudes (range) o Different tensors (weights, activation, and gradients) when training a network • Provide same relative accuracy at all magnitudes (precision) o Network weight magnitudes are typically O(1) o Activations can have orders of magnitude larger values How floating-point numbers work • exponent: determines the range of values o scientific notation in binary (base of 2) • fraction (or mantissa): determines the relative precision between values mantissa o (2^mantissa) samples between powers of
    [Show full text]
  • Introduction to Keras Tensorflow
    Introduction to Keras TensorFlow Marco Rorro [email protected] CINECA – SCAI SuperComputing Applications and Innovation Department 1/33 Table of Contents Introduction Keras Distributed Deep Learning 2/33 Introduction Keras Distributed Deep Learning 3/33 I computations are expressed as stateful data-flow graphs I automatic differentiation capabilities I optimization algorithms: gradient and proximal gradient based I code portability (CPUs, GPUs, on desktop, server, or mobile computing platforms) I Python interface is the preferred one (Java, C and Go also exist) I installation through: pip, Docker, Anaconda, from sources I Apache 2.0 open-source license TensorFlow I Google Brain’s second generation machine learning system 4/33 I automatic differentiation capabilities I optimization algorithms: gradient and proximal gradient based I code portability (CPUs, GPUs, on desktop, server, or mobile computing platforms) I Python interface is the preferred one (Java, C and Go also exist) I installation through: pip, Docker, Anaconda, from sources I Apache 2.0 open-source license TensorFlow I Google Brain’s second generation machine learning system I computations are expressed as stateful data-flow graphs 4/33 I optimization algorithms: gradient and proximal gradient based I code portability (CPUs, GPUs, on desktop, server, or mobile computing platforms) I Python interface is the preferred one (Java, C and Go also exist) I installation through: pip, Docker, Anaconda, from sources I Apache 2.0 open-source license TensorFlow I Google Brain’s second generation
    [Show full text]
  • Reinforcement Learning with Tensorflow&Openai
    Lecture 1: Introduction Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim <[email protected]> http://angelpawstherapy.org/positive-reinforcement-dog-training.html Nature of Learning • We learn from past experiences. - When an infant plays, waves its arms, or looks about, it has no explicit teacher - But it does have direct interaction to its environment. • Years of positive compliments as well as negative criticism have all helped shape who we are today. • Reinforcement learning: computational approach to learning from interaction. Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction Nishant Shukla , Machine Learning with TensorFlow Reinforcement Learning https://www.cs.utexas.edu/~eladlieb/RLRG.html Machine Learning, Tom Mitchell, 1997 Atari Breakout Game (2013, 2015) Atari Games Nature : Human-level control through deep reinforcement learning Human-level control through deep reinforcement learning, Nature http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html Figure courtesy of Mnih et al. "Human-level control through deep reinforcement learning”, Nature 26 Feb. 2015 https://deepmind.com/blog/deep-reinforcement-learning/ https://deepmind.com/applied/deepmind-for-google/ Reinforcement Learning Applications • Robotics: torque at joints • Business operations - Inventory management: how much to purchase of inventory, spare parts - Resource allocation: e.g. in call center, who to service first • Finance: Investment decisions, portfolio design • E-commerce/media - What content to present to users (using click-through / visit time as reward) - What ads to present to users (avoiding ad fatigue) Audience • Want to understand basic reinforcement learning (RL) • No/weak math/computer science background - Q = r + Q • Want to use RL as black-box with basic understanding • Want to use TensorFlow and Python (optional labs) Schedule 1.
    [Show full text]
  • Evaluating Recurrent Neural Network Explanations
    Evaluating Recurrent Neural Network Explanations Leila Arras1, Ahmed Osman1, Klaus-Robert Muller¨ 2;3;4, and Wojciech Samek1 1Machine Learning Group, Fraunhofer Heinrich Hertz Institute, Berlin, Germany 2Machine Learning Group, Technische Universitat¨ Berlin, Berlin, Germany 3Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea 4Max Planck Institute for Informatics, Saarbrucken,¨ Germany fleila.arras, [email protected] Abstract a particular decision, i.e., when the input is a se- quence of words: which words are determinant Recently, several methods have been proposed for the final decision? This information is crucial to explain the predictions of recurrent neu- ral networks (RNNs), in particular of LSTMs. to unmask “Clever Hans” predictors (Lapuschkin The goal of these methods is to understand the et al., 2019), and to allow for transparency of the network’s decisions by assigning to each in- decision-making process (EU-GDPR, 2016). put variable, e.g., a word, a relevance indicat- Early works on explaining neural network pre- ing to which extent it contributed to a partic- dictions include Baehrens et al.(2010); Zeiler and ular prediction. In previous works, some of Fergus(2014); Simonyan et al.(2014); Springen- these methods were not yet compared to one berg et al.(2015); Bach et al.(2015); Alain and another, or were evaluated only qualitatively. We close this gap by systematically and quan- Bengio(2017), with several works focusing on ex- titatively comparing these methods in differ- plaining the decisions of convolutional neural net- ent settings, namely (1) a toy arithmetic task works (CNNs) for image recognition. More re- which we use as a sanity check, (2) a five-class cently, this topic found a growing interest within sentiment prediction of movie reviews, and be- NLP, amongst others to explain the decisions of sides (3) we explore the usefulness of word general CNN classifiers (Arras et al., 2017a; Ja- relevances to build sentence-level representa- covi et al., 2018), and more particularly to explain tions.
    [Show full text]