Quick viewing(Text Mode)

Outline for Today's Presentation

Outline for Today's Presentation

Outline for today’s presentation

• We will see how RNNs and CNNs compare on variety of tasks

• Then, we will go through a new approach for Sequence Modelling that has become state of the art

• Finally, we will look at a few augmented RNN models RNNs vs CNNs Empirical Evaluation of Generic Networks for Sequence Modelling Let’s say you are given a sequence modelling task of text classification / music note prediction, and you are asked to develop a simple model.

What would your baseline model be based on-

RNNs or CNNs? Recent Trend in Sequence Modelling

• Widely considered as RNNs “home turf” • Recent research has shown otherwise – • – WaveNet uses Dilated for Synthesis • Char-to-Char Machine Translation – ByteNet uses Encoder-Decoder architecture and Dilated Convolutions. Tested on English-German dataset • Word-to-Word Machine Translation – Hybrid CNN-LSTM on English-Romanian and English-French datasets • Character-level Language Modelling – ByteNet on WikiText dataset • Word-level Language Modelling – Gated CNNs on WikiText dataset Temporal Convolutional Network (TCN)

• Model that uses best practices in Convolutional network design • The properties of TCN - • Causal – there is no information leakage from future to past • Memory – It can look very far into the past for prediction/synthesis • Input - It can take any arbitrary length sequence with proper tuning to the particular task. • Simple – It uses no gating mechanism, no complex stacking mechanism and each output has the same length as the input • Components of TCN – • 1-D Dilated Convolutions • Residual Connections TCN - Dilated Convolutions

1-D Dilated Convolutions

1-D Convolutions Source - WaveNet TCN - Residual Connections

Residual Block of TCN Example of Residual Connection in TCN

• Layers learn modification to the identity mapping rather than the transformation • Has shown to be very useful for very deep networks TCN - Weight Normalization

• Shortcomings of – • It needs two passes of the input– one to compute the batch statistics and then to normalize • Takes significant amount of time to be computed for each batch • Dependent on the batch size – not very useful when size is small • Cannot be used when we are training in an online setting • Normalizes the weights with respect to each training example • The main aim is to decouple the magnitude and the direction

푊푗 ∗ 푥 표푗 = 훾푗 + 훽푗 푊푗 + 휖 2 • It has shown to be faster than Batch Norm TCN Advantages/Disadvantages

[+] Parallelism – Each layer of a CNN network can be parallelized [+] Receptive field size – Can be easily increased by increasing either filter length, dilation factor or depth [+] Stable – Uses Residual connections and Dropouts [+] Storage (train) – Memory footprint is lesser than RNNs [+] Sequence length – Can be easily adopted for variable input length

[-] Storage (test) – During testing, it requires more memory Experimental Setup

• TCN filter size, dilation factor and number of layers are chosen to cover the entire receptive field • Vanilla RNN/LSTM/GRU hidden nodes and layers are chosen to have roughly the same number of parameters as TCN • For both the models, the hyperparameter search was used - • clipping – [0.3, 1] • Dropout – [0, 0.5] • Optimizers - SGD/RMSProp/AdaGrad/Adam • Weights Initialization – Gaussian with 푁(0,0.01) • Exponential dilation (for TCN) Datasets

• Adding Problem – • Serves as a stress test for sequence models • Consists of an input of length n and depth 2, with the first dimension being randomly assigned as 0 or 1, and the second dimension having 1s at two places only • Sequential MNIST and P-MNIST – • Tests the ability to remember distant past • Consists of a 784x1 MNIST’s digit image for digit classification. • P-MNIST has the pixels values permuted • Copy Memory – • Tests the memory capacity of the model • Consists of an input of length n+20 with first 10 digits randomly selected from 1 to 8, and the last 10 digits being 9, with everything else as 0 • Goal is to copy the first 10 values to the last ten placeholder values • Polyphonic Music – • Consists of sequence of piano keys of length 88 • Goal is to predict the next key in the sequence

• Penn Treebank (PTB) – • It is a small language modelling dataset for both word and char-level • Consists of 5059K characters or 888K words for training

• Wikitext-103 – • Consists of 28K Wikipedia articles for word-level language modelling • Consists of 103M words for training

• LAMBADA – • Tests the ability to capture longer and broader contexts • Consists of 10K passages extracted from novels and serves as a QnA dataset Results

Performance Analysis – TCN vs TCN with Gating Mechanism Inferences

• Inferences are made on the following categories – 1. Memory - • Copy memory task was designed to check propagation of information • TCNs achieve almost 100% accuracy whereas RNNs fail at higher sequence lengths • LAMBADA dataset was designed to test the local and broader contexts • TCNs again outperform all its recurrent counterparts 2. Convergence – • In almost all of the tasks, TCNs converged faster than RNNs • The extent of parallelism possible can be one explanation • Concluded that given enough research, TCNs can outperform SOTA RNN models To build a simple Sequence Modelling network What would you choose? RNNs or CNNs? Attention is all you need Md Mehrab Tanjim

http://jalammar.github.io/illustrated-transformer/ https://nlp.stanford.edu/seminar/details/lkaiser.pdf Recap: Sequence to sequence Recap: Sequence to sequence Recap: Sequence to sequence Recap: Sequence to sequence Recap: Sequence to sequence Recap: Sequence to sequence Recap: Sequence to sequence Recap: Sequence to sequence Recap: Sequence to sequence w/ attention Recap: Sequence to sequence w/ attention Recap: Sequence to sequence w/ attention Recap: Sequence to sequence w/ attention Recap: Sequence to sequence w/ attention Recap: Sequence to sequence w/ attention Recap: Sequence to sequence w/ attention Recap: Sequence to sequence w/ attention Recap: Sequence to sequence w/ attention Recap: Sequence to sequence w/ attention Recap: Sequence to sequence w/ attention But how can we calculate the scores? Score

Refer to the assignment

Query (Q) Key (K)

Value (V) What functions can be used to calculate the score?

1. Additive attention a. Computes the compatibility function using a feed-forward network with a single hidden layer (given in the assignment)

2. Dot-product (multiplicative) attention a. Dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code (used in transformer, explained here) Dot-product (multiplicative) attention Dot-product (multiplicative) attention Dot-product (multiplicative) attention Why divide by square root of dk ? Scaling Factor

Problem: Additive attention outperforms dot product attention without scaling for larger values of dk. Cause: Dot products grow large in magnitude, pushing the into regions where it has extremely small gradients.

Solution: To counteract this effect, scale the dot products by 1/√dk Complexity This is the encoder-decoder attention. Attention

Attention between encoder and decoder is crucial in NMT

Why not use (self-)attention for the representations? Self Attention

? Self Attention Three ways of Attention Encoder Self Attention

● Each position in the encoder can attend to all positions in the previous layer of the encoder. ● All of the keys, values, and queries come from the same place, in this case, the output of the previous layer in the encoder. Decoder Self Attention

● Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. ● To preserve the auto-regressive property, mask out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. Notice any difference from ? Self-attention

● Convolution: a different linear transformation for each relative position. Allows to distinguish what information came from where. ● Self-Attention: a reduced effective resolution due to averaging attention-weighted positions Multi-head Attention

● Multiple attention layers (heads) in parallel (shown by different colors)

● Each head uses different linear transformations.

● Different heads can learn different relationships. But first we need to encode the position! Positional Encoding Positional Encoding

A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors. Why this function

● Authors hypothesized it would allow the model to easily learn to attend by

relative positions, since for any fixed offset k, PEpos+k can be represented as a

linear function of PEpos. ● Authors also experimented with using learned positional embeddings instead, and found that the two versions produced nearly identical results. ● Authors chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training. Multi-head Attention Multi-head Attention

● Authors employed, h = 8 parallel attention layers, or heads. For

each of these, dk = dv = dmodel/h = 64. ● Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality Multi-head Attention

● Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

● With a single attention head, averaging inhibits this.

Position-wise Feed-Forward Networks

● In addition to attention sub-layers, each of the layers in encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. ● This consists of two linear transformations with a ReLU activation in between.

● While the linear transformations are the same across different positions, they use different parameters from layer to layer. ● Another way of describing this is as two convolutions with kernel size 1. Layer Normalization

Layer Normalization is applied to output of each sub-layer:

LayerNorm(x + Sublayer(x)) where Sublayer(x) = Self-attention/Feed Forward Layer Normalization

● The key limitation of batch normalization is that it is dependent on the mini-batch.

● In a , the recurrent activations of each time-step will have different statistics. This means that we have to fit a separate batch normalization layer for each time-step. This makes the model more complicated and – more importantly – it forces us to store the statistics for each time-step during training. Layer Normalization Layer Normalization

● In layer normalization, the statistics are computed across each feature and are independent of other examples.

● The independence between inputs means that each input has a different normalization operation, allowing arbitrary mini-batch sizes to be used.

● The experimental results show that layer normalization performs well for recurrent neural networks.

● As there are multiple heads, layer normalization is applied to each head individually. Full Stack

Google’s Visualization Complexity Will transformer work efficiently for large sequences (images, video, audio)? Complexity

● To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position. This would increase the maximum path length to O(n/r).

● Similar to convolutional layer. A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels. Complexity Experiments

Dataset:

● WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. ● WMT 2014 English-French dataset consisting of 36M sentences.

Hardware:

● 8 NVIDIA P100 GPUs. Optimizer

Adam optimizer with β1 = 0.9, β2 = 0.98 and = 10−9 .

The is varied over the course of training. This corresponds to increasing the learning rate linearly for the first warmup_steps (4000) training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. Regularization

● Dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.

● Additional dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.

● Also employed label smoothing (of value 0.1) Experimental Results Variations

While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.

Reducing the attention key size dk (B) hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. Seems easy right? Don’t play hero

Tensorflow ● ADAM optimizer with learning rate https://github.com/tensorflow/tensor2tensor proportional to (step-0.5)

● Dropout during training at every layer mxnet just before adding residual https://github.com/awslabs/sockeye

● Label smoothing PyTorch

● Auto-regressive decoding with beam https://github.com/jadore801120/attention-is-all-you-nee search and length penalties d-

● Checkpoint-averaging Resources: http://nlp.seas.harvard.edu/2018/04/03/attention.html https://mchromiak.github.io/articles/2017/Sep/12/Transfo rmer-Attention-is-all-you-need/ Can Transformer model be generalized for other tasks? English Constituency Parsing

● The Wall Street Journal (WSJ) portion of the Penn Treebank, about 40K training sentences.

● Trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences.

● A vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.

● A 4-layer transformer with dmodel = 1024 Result Outside NMT Can you think of some tasks where it fails? Some limitations

1. Images still like CNN models 2. Copying 3. Question Answering 4. Unable to generalize to unseen length of sequence. Thanks Augmented RNN’s Manjot Singh Bilkhu How can we improve vanilla RNN’s?

Three major active areas of research

● Attention ● Adaptive Computation Time ● External Memory Adaptive Computation Time

● The number of steps of computation an RNN gets for a given problem is typically determined by the data (sequence length) and the hyperparameters (network depth, padding in sequence, etc)

● Would prefer the net to decide how to long to ‘ponder’ each input before it outputs an answer

● Clearly useful for algorithmic / planning type problems with a high in complexity (e.g. program induction, pathfinding…)

● Can also be more efficient for conventional tasks such as machine translation and language modelling Slide credits: Alex Graves (NIPS 2016) Slide credits: Alex Graves (NIPS 2016) Slide credits: Alex Graves (NIPS 2016) ACT in ResNets

Spatially Adaptive Computation Time for Residual Networks Michael Figurnov et al. CVPR 2017 Slide credits: Alex Graves (NIPS 2016)

Vanilla Transformers

● Self attentive Transformer model is essentially feed-forward ● Reports SoTA results on tasks like Machine Translation, constituency parsing, etc. ● However, fails to generalize on extremely simple tasks ACT in Universal Transformers

● Use recurrence to refine self-attention representations ● RNN processes a sequence symbol-by-symbol (left to right), the Universal Transformer processes all symbols at the same time (like the Transformer), but then it refines its interpretation of every symbol in parallel over a variable number of recurrent processing steps using self-attention ● This parallel-in-time recurrence mechanism is both faster than the serial recurrence used in RNNs, and also makes the Universal Transformer more powerful than the standard feedforward Transformer. GIF credits: Google AI blog Model Source: https://twitter.com/OriolVinyalsML/status/1017523208059260929 Results

BaBI question answering

RNN’s with Memory

19 Neural Turing Machines

Basic Idea: Turn Neural Nets into Differentiable Neural Computers by giving them read/write access to external memory

Neural Learns: • What to write to memory • When to write to memory • When to stop writing • Which memory cell to read from • How to convert result of read into final output Neural Turing Machines Neural Turing Machines

Reading

● Mt is N x M matrix of memory at time t ● We read from all memory locations to some extent, governed by the attention distribution Neural Turing Machines

Writing

● Erase

● Update Neural Turing Machines: Erase

1xN w (0) = 0.1 w (1) = 0.2 w (2) = 0.5 w (3) = 0.1 w (4) = 0.1 Mx1 1 1 1 1 1 e1 5 7 9 2 12 1.0

11 6 3 1 2 0.7

M0 3 7 3 10 6 0.2 4 2 5 9 9 0.5

3 5 12 8 4 0.0

Neural Turing Machines, Graves et. al., 24 arXiv:1410.5401 Neural Turing Machines: Erase

w1(0) = 0.1 w1(1) = 0.2 w1(2) = 0.5 w1(3) = 0.1 w1(4) = 0.1

4.5 5.6 4.5 1.8 10.8

10.23 5.16 1.95 0.93 1.86

2.94 6.72 2.7 9.8 5.88

3.8 1.8 3.75 8.55 8.55

3 5 12 8 4

Neural Turing Machines, Graves et. al., 25 arXiv:1410.5401 Neural Turing Machines: Addition

w (0) = 0.1 w (1) = 0.2 w (2) = 0.5 w (3) = 0.1 w (4) = 0.1 1 1 1 1 1 a1 4.5 5.6 4.5 1.8 10.8 3

10.23 5.16 1.95 0.93 1.86 4

2.94 6.72 2.7 9.8 5.88 -2

3.8 1.8 3.75 8.55 8.55 0

3 5 12 8 4 2

Neural Turing Machines, Graves et. al., 26 arXiv:1410.5401 Neural Turing Machines: Blurry Writes

4.8 6.2 6 2.1 11.1

10.63 5.96 3.95 1.33 2.26

M1 2.74 6.32 1.7 9.6 5.68 3.8 1.8 3.75 8.55 8.55

3.2 5.4 13 8.2 4.2

Neural Turing Machines, Graves et. al., 27 arXiv:1410.5401 Neural Turing Machines

Addressing Attention

Neural Turing Machines: ‘Copy’ Learning Curve

Trained on 8-bit sequences, 1<= sequence length <= 20

Neural Turing Machines, Graves et. al., 32 arXiv:1410.5401

Neural Turing Machines: ‘Copy’ Performance

LSTM’s

NTM with LSTM’s

Neural Turing Machines, Graves et. al., 34 arXiv:1410.5401 Repeat Copy Performance

Associative Recall

• The third experiment was to determine whether NTMs can learn indirection i.e. one data item pointing to another. • The authors fed in a list of items, and then queried one of the items in the list, with the expectation that the next item in the list be returned. • As the authors point out, the fact that the feedforward-controller NTM outperforms the LSTM-controller NTM suggests that the NTM's memory is a superior data storage system than the LSTM's internal state. Priority Sort Thank You for your ATTENTION!

Neural Turing Machines: Experiments

Network Size Number of Parameters Task NTM w/ LSTM* LSTM NTM w/ LSTM LSTM

Copy 3 x 100 3 x 256 67K 1.3M

Repeat Copy 3 x 100 3 x 512 66K 5.3M

Associative 3 x 100 3 x 256 70K 1.3M

N-grams 3 x 100 3 x 128 61K 330K

Priority Sort 2 x 100 3 x 128 269K 385K

Neural Turing Machines, Graves et. al., 41 arXiv:1410.5401 Addressing

Focusing by Content

Each head produces key vector kt of length M

Generated a content based weight w c based on Interpolation

Each head emits a scalar interpolation gate gt Convolutional Shifts

Each head emits a distribution over allowable integer shifts st Sharpening

Each head emits a scalar sharpening parameter γt Putting it all together : Controller Design

•Feed-forward: faster, more transparency & interpretability about function learnt

•LSTM: more expressive power, doesn’t limit the number of computations per time step

Both are end-to-end differentiable! 1. Reading/Writing -> Convex Sums

2. wt generation -> Smooth 3. Controller Networks

Neural Turing Machines, Graves et. al., 47 arXiv:1410.5401