OUTLINE

1 with neural networks ...... 3

2 Training neural networks ...... 20

3 Popularity and applications ...... 37

2 OUTLINE

1 Machine learning with neural networks ...... 3

2 Training neural networks ...... 20

3 Popularity and applications ...... 37

3 SUPERVISED MACHINE LEARNING CLASSIFICATION SETUP

(i) n Input features x ∈ R

(i) Outputs y ∈ Y (e.g. R, {−1, +1}, {1, . . . , p})

k Model parameters θ ∈ R

n Hypothesis function hθ : R → R

Loss function ` : R × Y → R+

Machine learning optimization problem

m X (i) (i) minimize `(hθ(x ), y ) θ i=1

4 LINEAR CLASSIFIERS

Linear hypothesis class: (i) T (i) hθ(x ) = θ φ(x ) n k where the input can be any set of non-linear features φ : R → R

The generic function φ represents some (possibly) selected way to generate non-linear features out of the available ones, for instance:

x(i) = [temperature for day i]  1   x(i)  (i)  2  φ(x ) =  x(i)     .  .

5 GENERAL REPRESENTATION OF LINEAR CLASSIFICATION

6 CHALLENGES WITH LINEAR MODELS

Linear models crucially depend on choosing “good” features

Some “standard” choices: polynomial features, radial basis functions, random features (surprisingly effective)

But, many specialized domains required highly engineered special features

E.g., computer vision tasks used Haar features, SIFT features, every 10 years or so someone would engineer a new set of features

Key question 1: Should we stick with linear hypothesis functions? What about using non-linear combinations of the inputs? → Feed-forward neural networks ()

Key question 2: can we come up with algorithms that will automatically learn the features themselves? → Feed-forward neural networks with multiple (> 2!) hidden layers (Deep Networks)

7 : USE TWO CLASSIFIERS IN CASCADE

Instead of a simple linear classifier, let’s consider a two-stage hypothesis class where one linear function creates the features, another models the classifier and takes as input the features created by the first one:

hw(x) = W2φ(x) + b2 = W2(W1x + b1) + b2 where n×k k 1×k w = {W1 ∈ R , b1 ∈ R ,W2 ∈ R , b2 ∈ R} Note that in this notation, we’re explicitly separating the parameters on the “constant feature” into the b terms

8 FEATURE LEARNING: USE TWO CLASSIFIERS IN CASCADE

Graphical depiction of the obtained function

W1, b1 z1 x1 W2, b2 z x 2 2 y . . .

xn zk

But there is a problem:

hw(x) = W2(W1x + b1) + b2 = W˜ x + ˜b (1) in other words, we are still just using a normal linear classifier: the apparent added complexity by concatenating multiple is not giving us any additional representational power, we can only discriminate linearly separable classes

9 ARTIFICIAL NEURAL NETWORKS (ANN)

Neural networks provide a way to obtain complexity by:

1 Using non linear transformations of the inputs

2 Propagating the information among layers of processing units to realize multi-staged computation

3 In deep networks, the number of stages is relatively large, allowing to automatically learn hierarchical representations of the data features

W1, b1 z1 x1 z2 z3 z4 W2, b2 z1 = x z x 2 2 y ...... z5 = hθ(x) . . . .

xn W4, b4 zk W1, b1 W2, b2 W3, b3

10 TYPICAL NON-LINEAR ACTIVATION FUNCTIONS

Using non-linear activation functions at each node, the two-layer network of the previous example, become equivalent to have the following hypothesis function:

hθ(x) = f2(W2f1(W1x + b1) + b2)

where f1, f2 : R → R are some non-linear functions (applied elementwise to vectors)

2x 2x Common choices for fi are hyperbolic tangent tanh(x) = (e − 1)/(e + 1), sigmoid/logistic σ(x) = 1/(1 + e−x), or rectified linear unit f(x) = max{0, x}

tanh sigmoid relu 1.0 1.0 4.0 3.5 0.8 0.5 3.0 0.6 2.5 0.0 2.0 0.4 1.5 0.5 1.0 0.2 0.5 1.0 0.0 0.0 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4

11 HIDDEN LAYERS AND LEARNED FEATURES

We draw these the same as before (non-linear functions are virtually always implied in the neural network setting)

W1, b1 z1 x1 W2, b2 z x 2 2 y . . .

xn zk

Middle layer z is referred to as the hidden layer or activations These are the learned features, nothing in the data that prescribes what values these should take, left up to the algorithm to decide To have a meaningful feature learning we need multiple hidden layers in cascade Networks

12 TYPES OF NETWORKS

13 PROPERTIES OF NEURAL NETWORKS

It turns out that a neural network with a single hidden layer (and a suitably large number of hidden units) is a universal function approximator, can approximate any function over the input arguments (but this is actually not very useful in practice, c.f. polynomials fitting any sets of points for high enough degree)

The hypothesis class hθ is not a convex function of the parameters θ = {Wi, bi}

The number of parameters (weights and biases), layers (depth), topology (connectivity), activation functions, all affect the performance and capacity of the network

14

“Deep” neural networks refer to networks with multiple hidden layers

z2 z3 z4 z1 = x

. . . z5 = hθ(x) . . . .

W4, b4

W1, b1 W2, b2 W3, b3 Mathematically, a k-layer network has the hypothesis function

zi+1 = fi(Wizi + bi), i = 1, . . . , k − 1, z1 = x

hw(x) = zk where zi terms now indicate vectors of output features

15 WHY USE DEEP NETWORKS?

A deep architecture trades space for time (or breadth for depth): more layers (more sequential computation), but less hardware (less parallel computation).

Many functions can be represented more compactly using deep networks than one-hidden layer networks (e.g. parity function would require (2n) hidden units in 3-layer network, O(n) units in O(log n)-layer network)

Motivation from neurobiology: brain appears to use multiple levels of interconnected neurons to process information (but careful, neurons in brain are not just non-linear functions)

In practice: works better for many domains

Allow for automatic hierarchical feature extraction from the data

16 HIERARCHICAL FEATURE REPRESENTATION

17 EXAMPLES OF HIERARCHICAL FEATURE REPRESENTATION

18 EFFECT OF INCREASING NUMBER OF HIDDEN LAYERS

Speech recognition task

19 OUTLINE

1 Machine learning with neural networks ...... 3

2 Training neural networks ...... 20

3 Popularity and applications ...... 37

20 OPTIMIZING NEURAL NETWORK PARAMETERS

How do we optimize the parameters for the machine learning loss minimization problem with a neural network m X (i) (i) minimize `(hθ(x ), y ) θ i=1 now that this problem is non-convex?

Just do exactly what we did before: initialize with random weights and run stochastic gradient descent

Now have the possibility of local optima, and function can be harder to optimize, but we won’t worry about all that because the resulting models still often perform better than linear models

21 STOCHASTIC GRADIENT DESCENT FOR NEURAL NETWORKS

Recall that stochastic gradient descent computes gradients with respect to loss on each example, updating parameters as it goes

(i) (i) function SGD({(x , y )}, hθ, `, α) Initialize: Wj , bj ← Random, j = 1, . . . , k Repeat until convergence: For i = 1, . . . , m: (i) (i) Compute ∇Wj ,bj `(hθ(x ), y ), j = 1, . . . , k − 1 Take gradient steps in all directions: (i) (i) Wj ← Wj − α∇Wj `(hθ(x ), y ), j = 1, . . . , k (i) (i) bj ← bj − α∇bj `(hθ(x ), y ), j = 1, . . . , k return {Wj , bj } (i) (i) How do we compute the gradients ∇Wj ,bj `(hθ(x ), y )?

22 BACK-PROPAGATION

Back-propagation is a method for computing all the necessary gradients using one forward pass (just computing all the activation values at layers), and one backward pass (computing gradients backwards in the network)

The equations sometimes look complex, but it’s just an application of the chain rule of calculus and the use of Jacobians

23 JACOBIANS AND CHAIN RULE

n m For a multivariate, vector-valued function f : R → R , the Jacobian is a m × n matrix

 ∂f1(x) ∂f1(x) ··· ∂f1(x)  ∂x1 ∂x2 ∂xn  ∂f2(x) ∂f2(x) ∂f2(x)   ∂f(x)   ∂x ∂x ··· ∂x  m×n  1 2 n  ∈ R =  . . . .  ∂x  . . .. .   . . .  ∂fm(x) ∂fm(x) ··· ∂fm(x) ∂x1 ∂x2 ∂xn

n For a scalar-valued function f : R → R, the Jacobian is the transpose of the gradient ∂f(x) T ∂x = ∇xf(x) For a vector-valued function, row i of the Jacobian corresponds to the gradient of the component fi of the output vector, i = 1, . . . , m. It tells how the variation of each input variable affects the variation of the output component

Column j of the Jacobian is the impact of the variation of the j-th input variable, j = 1, . . . , n, on each one of the m components of the output

Chain rule for the derivation of a composite function:

∂f(g(x)) ∂f(g(x)) ∂g(x) = ∂x ∂g(x) ∂x

24 MULTI-LAYER / MULTI-MODULE FF

Multi-layered feed-forward architecture (cascade): module/layer i gets as input the feature vector xi−1 output from module i − 1, applies the transformation function Fi(xi−1, Wi), and produces the output xi

Each layer i contains (Nx)i parallel nodes (xi)j , j = 1,..., (Nx)i. Each node gets the input from previous layer’s nodes through (Nx)i−1 ≡ (NW )i connections with weights (Wi)j , j = 1,..., (NW )i

Each layer learns its own weight vector Wi.

Fi is a vector function. At each node j of layer i, the transformation function is T (Fi)j = f((Wi)j · (xi−1)j ) and f is the activation function (e.g., a sigmoid), that can be safely considered being the same for all nodes.

In the following the notation is made simpler by dropping the second indices and reasoning at the aggregate vector level of each layer.

25 FORWARD PROPAGATION

Forward Propagation: Following the presentation of the training input x0, the output vectors xi resulting from the activation function Fi at all layers i = 1, . . . , n, are computed in sequence, starting from x1, and are stored

The output of the network, the loss ` (to be minimized), results from the forward propagation at output layer and computing the deviation with respect to the target Y :

`(Y , x, W ) = C(xn, Y )

26 COMPUTING GRADIENTS

At each iteration of SGD, the gradients with respect to all the parameters of the system need to be computed (i.e., the weights Wi, that could be split in weights for input and weights for bias, but hereafter we just use the general form Wi)

After the Forward pass, let’s start setting up the relations for the Backward pass

Let’s consider the generic layer i: from the Forward propagation, its output value is available and is xi = Fi(xi−1, Wi)

In addition, let’s assume that we already know

∂` , ∂xi that is, we know for each component of the vector xi the variation of ` in relation to a variation of ∂` xi. We can assume that we know since we ∂xi will proceed backward

27 COMPUTING GRADIENTS Since we assume as known ∂` and we have ∂xi computed xi = Fi(xi−1, Wi), we can use the chain rule to compute ∂` , which is the quantity ∂Wi of interest, and which tells us the variation in ` as a response to a variation in the weights of Wi: ∂` ∂` ∂F (x , W ) = i i−1 i ∂Wi ∂xi ∂Wi

where xi is a substitute for Fi(xi−1, Wi) Dimensionally, the previous equation is as follows:

[1 × NW ] = [1 × Nx] · [Nx × NW ]

∂Fi(xi−1, Wi) is the Jacobian matrix of Fi with ∂Wi respect to Wi The element (k, l) of the Jacobian quantifies the variation in the k-th output when a variation is exerted on the l-th weight " #   ∂Fi(xi−1, Wi) ∂ Fi(xi−1, Wi) k =   ∂Wi ∂ Wi kl l 28 COMPUTING GRADIENTS

Let’s keep assuming that we known ∂` and let’s use it ∂xi this time to compute ∂` ∂xi−1 Applying the chain rule:

∂` ∂` ∂F (x , W ) = i i−1 i ∂xi−1 ∂xi ∂xi−1 Dimensionally, the previous equation is as follows:

[1 × Nx] = [1 × Nx] · [Nx × Nx]

∂Fi(xi−1, Wi) is the Jacobian matrix of Fi with respect ∂xi−1 to xi−1 The element (k, l) of the Jacobian quantifies the variation in the k-th output when a variation is exerted on the l-th input

The equation above is a recurrence equation!

29 BACK-PROPAGATION (BP) To sequentially compute all the gradients needed by SGD, a backward sweep is applied, which is called the back-propagation algorithm, that precisely makes use of the recurrence equation for ∂` ∂xi

∂` ∂C(xn, Y ) 1 = ∂xn ∂xn

∂` ∂` ∂Fn(xn−1, Wn) 2 = ∂xn−1 ∂xn ∂xn−1

∂` ∂` ∂Fn(xn−1, Wn) 3 = ∂Wn ∂xn ∂Wn

∂` ∂` ∂Fn−1(xn−2, Wn−1) 4 = ∂xn−2 ∂xn−1 ∂xn−2

∂` ∂` ∂Fn−1(xn−2, Wn−1) 5 = ∂Wn−1 ∂xn−1 ∂Wn−1

6 . . . until we reach the first, input layer

∂` 7 → all the gradients , ∀i = 1, . . . , n have been ∂Wi computed! 30 ACTIVATION FUNCTIONS EXAMPLES

T  Remember that the transfer function at layer i is Fi(xi−1, Wi) = f Wi · xi−1 , and for the j-th neuron in layer i, (Nx)i−1 T  X (Fi)j = f (Wi)j · (xi−1)j = wij · xi−1,j j=1 where f(·) is the activation function

Linear activation function: f(z) = Az + B T (Fi)j = A · ((Wi)j · (xi−1)j ) + B, used in the Forward pass ∂Fi(xi−1, Wi) = AWi, used in the Backward pass ∂xi−1

z −z Hyperbolic tangent activation function: f(z) = tanh(z) = e −e ez +e−z T  (Fi)j = tanh (Wi)j · (xi−1)j , used in Fw pass 0 2 ∂Fi(xi−1,Wi) 2 T f (z) = 1 − tanh (z) → = 1 − tanh (W · xi−1) used in Bw pass ∂xi−1 i 1 Logistic / sigmoid activation function: f(z) = 1 + e−x (F ) = 1 , used in Fw pass i j −(W )T ·(x ) 1+e i j i−1 j 0 ∂Fi(xi−1,Wi) T  T  f (z) = f(z)(1 − f(z)) → = f W · xi−1 · (1 − f W · xi−1) used in ∂xi−1 i i Bw pass

31 NO ARCHITECTURAL CONSTRAINTS

32 AVAILABLE TOOLS

Gradients can still get somewhat tedious to derive by hand, especially for the more complex models that follow

Fortunately, a lot of this work has already been done for you

Tools like Theano (http://deeplearning.net/software/theano/), Torch (http://torch.ch/), TensorFlow (http://www.tensorflow.org/) all let you specify the network structure and then automatically compute all gradients (and use GPUs to do so)

Autograd package for Python (https://github.com/HIPS/autograd) lets you compute the derivative of (almost) any arbitrary function using numpy operations using automatic back-propagation

33 WHAT’S CHANGED SINCE THE 80S?

Most of these algorithms were developed in the 80s or 90s

So why are these just becoming more popular in the last few years?

More data

Faster computers (GPUs)

(Some) better optimization techniques

34 ISSUES

Vanishing gradients: as we add more and more hidden layers, back-propagation becomes less and less useful in passing information to the lower layers. In effect, as information is passed back, the gradients begin to vanish and become small relative to the weights of the networks. Each gradient assigns “credit” to each neuron i for the (mis)classification of the input sample, however, credit depends (backward) on the average error associated to the neurons that take the output of i, such that going backward the credit has the tendency to vanish

If the activation function has a gradient “mostly” null, as in the case of sigmoids, then, again, gradient corrections become very small

Overfitting!

How many layers/nodes? (which is related to overfitting . . . )

Non-convexity (only local minimima can be reached), computation time, . . .

Time complexity of one BP iteration is O(|W |3), which is he input for one iteration of SGD. No general guarantees on convergence time

35 IDEAS FOR OVERCOMING ISSUES

Hidden layers of and RBMs act as effective feature detectors; these structures can be stacked to form deep networks. These networks can be trained greedily, one layer at a time, to help to overcome the vanishing gradient and overfitting problems.

Unsupervised pre-training (Hinton et al., 2006): “Pre-train” the network have the hidden layers recreate their input, one layer at a time, in an unsupervised fashion This paper was partly responsible for re-igniting the interest in deep neural networks, but the general feeling now is that it doesn’t help much

Dropout (Hinton et al., 2012): During training and computation of gradients, randomly set about half the hidden units to zero (a different randomly selected set for each stochastic gradient step) Acts like regularization, prevents the parameters for overfitting to particular examples

Different non-linear functions (Nair and Hinton, 2010): Use non-linearity f(x) = max{0, x} instead of f(x) = tanh(x)

36 OUTLINE

1 Machine learning with neural networks ...... 3

2 Training neural networks ...... 20

3 Popularity and applications ...... 37

37 #neural network / #machine learning 0.6

0.5

0.4

0.3

0.2

0.1

0.0 1980 1985 1990 1995 2000 2005 2010 2015 Google scholar counts of papers containing “neural network” divided by count of papers containing “machine learning”

38 Facebook launches AI research center, Google buys DeepMind

“AlexNet” deep neural network wins ImageNet 2012 contest

Popularization of backprop for training neural networks Academic papers on unsupervised pre-training for deep networks

A non-exhaustive list of some of the important events that impacted this trend

39 “AlexNet” (Krizhevsky et al., 2012), winning entry of ImageNet 2012 competition with a Top-5 error rate of 15.3% (next best system with highly engineered features based upon SIFT got 26.1% error)

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000.

neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 5 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening⇥ ⇥ pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 3 ⇥ ⇥ 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth 40 convolutional layer has 384 kernels of size 3 3 192 , and the fifth convolutional layer has 256 kernels of size 3 3 192. The fully-connected⇥ layers⇥ have 4096 neurons each. ⇥ ⇥

4 Reducing Overfitting

Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRC make each training example impose 10 bits of constraint on the mapping from image to label, this turns out to be insufficient to learn so many parameters without considerable overfitting. Below, we describe the two primary ways in which we combat overfitting.

4.1 Data Augmentation

The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk. In our implementation, the transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images. So these data augmentation schemes are, in effect, computationally free. The first form of data augmentation consists of generating image translations and horizontal reflec- tions. We do this by extracting random 224 224 patches (and their horizontal reflections) from the 256 256 images and training our network on⇥ these extracted patches4. This increases the size of our training⇥ set by a factor of 2048, though the resulting training examples are, of course, highly inter- dependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting five 224 224 patches (the four corner patches and the center patch) as well as their horizontal reflections⇥ (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches. The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. To each training image, we add multiples of the found principal components,

4This is the reason why the input images in Figure 2 are 224 224 3-dimensional. ⇥ ⇥

5 Some classification results from AlexNet Figure 4: (Left) Eight ILSVRC-2010 test images and the five labels considered most probable by our model. The correct label is written under each image, and the probability assigned to the correct label is also shown with a red bar (if it happens to be in the top 5). (Right) Five ILSVRC-201041 test images in the first column. The remaining columns show the six training images that produce feature vectors in the last hidden layer with the smallest Euclidean distance from the feature vector for the test image.

In the left panel of Figure 4 we qualitatively assess what the network has learned by computing its top-5 predictions on eight test images. Notice that even off-center objects, such as the mite in the top-left, can be recognized by the net. Most of the top-5 labels appear reasonable. For example, only other types of cat are considered plausible labels for the leopard. In some cases (grille, cherry) there is genuine ambiguity about the intended focus of the photograph. Another way to probe the network’s visual knowledge is to consider the feature activations induced by an image at the last, 4096-dimensional hidden layer. If two images produce feature activation vectors with a small Euclidean separation, we can say that the higher levels of the neural network consider them to be similar. Figure 4 shows five images from the test set and the six images from the training set that are most similar to each of them according to this measure. Notice that at the pixel level, the retrieved training images are generally not close in L2 to the query images in the first column. For example, the retrieved dogs and elephants appear in a variety of poses. We present the results for many more test images in the supplementary material. Computing similarity by using Euclidean distance between two 4096-dimensional, real-valued vec- tors is inefficient, but it could be made efficient by training an auto-encoder to compress these vectors to short binary codes. This should produce a much better image retrieval method than applying auto- encoders to the raw pixels [14], which does not make use of image labels and hence has a tendency to retrieve images with similar patterns of edges, whether or not they are semantically similar.

7 Discussion

Our results show that a large, deep convolutional neural network is capable of achieving record- breaking results on a highly challenging dataset using purely . It is notable that our network’s performance degrades if a single convolutional layer is removed. For example, removing any of the middle layers results in a loss of about 2% for the top-1 performance of the network. So the depth really is important for achieving our results. To simplify our experiments, we did not use any unsupervised pre-training even though we expect that it will help, especially if we obtain enough computational power to significantly increase the size of the network without obtaining a corresponding increase in the amount of labeled data. Thus far, our results have improved as we have made our network larger and trained it longer but we still have many orders of magnitude to go in order to match the infero-temporal pathway of the human visual system. Ultimately we would like to use very large and deep convolutional nets on video sequences where the temporal structure provides very helpful information that is missing or far less obvious in static images.

8 Google Deep Dream software: adjust input images (by gradient descent) to strengthen the activations that are present in an image

42 ANeuralConversationalModel

4.2. OpenSubtitles dataset Machine: hi Human: hello We also tested our model on the OpenSubtitles Question answering network (Vinyals and Le,Machine: 2015),could using you please sequence let me know what to are sequence learning dataset (Tiedemann, 2009). This dataset consists of ANeuralConversationalModel the operating systems you are using ? method (Sutskevermovie conversations et in al., XML 2014) format. It contains sen- Human: iamusinglinux tences uttered by characters in movies. We applied a Machine: ok it was in the ssl vpn on which you simple processing step removing XML tags and obvi- stated it was broken ous non-conversational text (e.g., hyperlinks) from the used for neural machine translation and achieves im- Machine: please check if you are logged in to dataset. As turn taking is not clearly indicated, we treated corp network provements on the English-French and English-Germanconsecutive sentences assuming they were uttered by Machine: what is the error that you are running different characters. We trained our model to predict the please translation tasks from the WMT’14 dataset (Luong et al.next, sentence given the previous one, and we did this for Human: iamseeinganerrorrelatedtovpn every sentence (noting that this doubles our dataset size, 2014; Jean et al., 2014). It has also been used for Machine: what is the error message that you are as each sentence is used both for context and as target). getting when connecting to vpn using network Our training and validation split has 62M sentences (923M other tasks such as parsing (Vinyals et al., 2014a)and connect ? tokens) as training examples, and the validation set has Human: connection refused or something like 26M sentences (395M tokens). The split is done in such a image captioning (Vinyals et al., 2014b). Since it is that way that each sentence in a pair of sentences either appear Machine: may i know the version of network well known that vanilla RNNs suffer from vanish-together in the training set or test set but not both. Unlike ANeuralConversationalModel connect you connect ? the previous dataset, the OpenSubtitles is quite large, and Human: iamnotsureiknowthat ing gradients, most researchers use variants of Longrather noisy because consecutive sentences may be uttered 4.2. OpenSubtitles dataset FigureMachine: 1.hi Using the seq2seq frameworkMachine: is for the network modeling connect prompting conversa- for by the same character. Given the broad scope of movies, Human: hello any user certificate Short Term Memory (LSTM)We also tested recurrent our model on neural the OpenSubtitles net-this is an open-domain conversation dataset, contrasting tions.Machine: could you please let me know what are Machine: ? dataset (Tiedemann, 2009). This dataset consistswith of the technical troubleshooting dataset. works (Hochreiter & Schmidhuber, 1997). the operating systems you are using ? Human: yes movie conversations in XML format. It contains sen- Human: iamusinglinux Machine: are you at home or at work at the mo- tences uttered by characters in movies. We applied a 5. ExperimentsMachine: ok it was in the ssl vpn on which you ment ? simple processing step removing XML tags and obvi- Our work is also inspired by the recent success of neu- stated it was broken Machine: office ? ous non-conversational text (e.g., hyperlinks) fromIn the this section, we describe the experimental results with Machine: please check if you are logged in to Human: iamathome dataset. As turn taking is not clearly indicated, we treatedthe two datasets and show some samples of the interac- ral language modeling (Bengio et al., 2003; Mikolov et al., corp network Machine: try this consecutive sentences assuming they were utteredtions by with the system that we trained. We also compare Machine: what is the error that you are running Machine: goto page and enter your 2010; Mikolov, 2012), whichdifferent shows characters. that We trained recurrent our model to neural predictthe the performance of our system against a popular rule-based andplease train to map “ABC” to “WXYZ”name and password as shown in Figure 1 next sentence given the previous one, and we did thisbot for (CleverBot1)usinghumanevaluationsonasetof200 Human: iamseeinganerrorrelatedtovpn Human: done networks are rather effectiveevery models sentence (noting for that natural this doubles language. our dataset size,questions. above.Machine: what The is the hidden error message state that you of are theMachine: modeltry when logging in it with receives and the as each sentence is used both for context and as target). getting when connecting to vpn using network check if its working fine now More recently, work by SordoniOur training et and al. validation (Sordoni split has 62M et al.sentences, 2015 (923M) end of sequence symbol “”canbeviewedasthe 5.1. IT Helpdeskconnect Troubleshooting ? experiments Human: yes , now it works ! tokens) as training examples, and the validation set has and Shang et al. (Shang et al., 2015), used recurrent neural Human: connection refused or something like Machine: great 26M sentences (395M tokens). The split is done in suchIn a this experiment,thought we trained vector a singlebecause layer LSTM with it stores the information of the sen- that Machine: anything else that i can help way that each sentence in a pair of sentences either appear1024 memory cells using stochastic gradient descent with networks to model dialogue in short conversations (trained Machine: may i know the version of network Human: no , that was very useful together in the training set or test set but not both. Unlikegradient clipping.tence, The or vocabulary thought, consists “ABC”. of the most connect you connect ? Machine: you are other than :) the previous dataset, the OpenSubtitles is quite large,common and 20K words, which includes special tokens indi- on Twitter-style chats). Human: iamnotsureiknowthat Machine: have a nice night rather noisy because consecutive sentences may be utteredcating turn taking and actor. TheMachine: strengthis the network of connect this prompting model for lies in its simplicity and gener- by the same character. Given the broad scope of movies, At convergence,any user this certificatemodel achieved a perplexity of 8, (The indeed contains information about vpn ac- Building bots and conversationalthis is an open-domain agents conversation has dataset, been contrasting pur- whereas an n-grammodelality.Machine: We? achieved can18 use.Belowisafewsam- this modelcess.) for machine translation, ques- with the technical troubleshooting dataset. ples of simulatedHuman: troubleshootingyes sessions for most three sued by many researchers over the last decades, and it Conversation 2: Browser issues. 43 of the most commontion/answering,Machine: IT problemsare you at (remote home or andaccess, at work software conversations at the mo- without major changes 5. Experiments crashes, and password issues). In these conversations, Ma- is out of the scope of this paper to provide an exhaus- inment the ? architecture. ApplyingDescribe this techniqueyour problem: browser to keeps conversa- crashing chine is our NeuralMachine: Conversationaloffice ? Model, and Human the In this section, we describe the experimental results with Machine: hi tive list of references. However, most of these systemshuman actor interactingHuman: iamathome with it. the two datasets and show some samples of the interac- tion modeling is also straightforward:Human: hi the input sequence Machine: try this tions with the system that we trained. We also compareConversation 1: VPN issues. Machine: could you please let me know what is require a rather complicated processing pipeline of many canMachine: be thegoto < concatenationURL> page and enter your of what has been conversed so far the performance of our system against a popular rule-based the operating system you are using ? name and password bot (CleverBot1)usinghumanevaluationsonasetof200 Human: linux stages (Lester et al., 2004; Will, 2007; Jurafsky & Martin, Describe(theHuman: your problem:context),done i am having and issues the ac- output sequence is the reply. questions. cessing vpn Machine: ok this is the issue with the system per- 2009). Our work differs from conventional systems by Machine: try logging in with and formance , nasser , tools . 1http://www.cleverbot.com/check if its working fine now Human: what is that ? proposing an end-to-end approach5.1. IT Helpdesk to Troubleshooting the problem experiments which UnlikeHuman: yes ,easier now it works tasks ! like translation, however, a model In this experiment, we trained a single layer LSTM with likeMachine: sequence-to-sequencegreat will not be able to successfully lacks domain knowledge.1024 It memory could, cells in using principle, stochastic gradient be descent com- with Machine: anything else that i can help gradient clipping. The vocabulary consists of the most “solve”Human: no , that the was problemvery useful of modeling dialogue due to sev- bined with other systemscommon to re-score 20K words, which a short-list includes special of tokens can- indi- Machine: you are other than :) didate responses, but our workcating turn is taking based and actor. on producing an- eralMachine: obvioushave a nice night simplifications: the objective function being At convergence, this model achieved a perplexity of 8, (Theoptimized indeed contains does information not capture about vpn ac- the actual objective achieved swers given by a probabilisticwhereas modelan n-grammodel trained achieved 18 to.Belowisafewsam- maximize cess.) ples of simulated troubleshooting sessions for most three through human communication, which is typically longer the probability of the answerof the given most common some IT problems context. (remote access, software Conversation 2: Browser issues. crashes, and password issues). In these conversations, Ma- term and based on exchange of information rather than next chine is our Neural Conversational Model, and Human the Describe your problem: browser keeps crashing human actor interacting with it. stepMachine: prediction.hi The lack of a model to ensure consistency 3. Model Human: hi Conversation 1: VPN issues. andMachine: generalcould youworld please let me knowledge know what is is another obvious limitation ofthe operatinga purely system unsupervised you are using ? model. Our approach makes use ofDescribe the your sequence-to-sequence problem: i am having issues ac- Human: linux cessing vpn Machine: ok this is the issue with the system per- (seq2seq)frameworkdescribedin(Sutskever et al., 2014). formance , nasser , tools . 1 The model is based on a recurrenthttp://www.cleverbot.com/ neural network which 4.Human: Datasetswhat is that ? reads the input sequence one token at a time, and predicts the output sequence, also one token at a time. During train- In our experiments we used two datasets: a closed-domain ing, the true output sequence is given to the model, so learn- IT helpdesk troubleshooting dataset and an open-domain ing can be done by backpropagation. The model is trained movie transcript dataset. The details of the two datasets are to maximize the cross entropy of the correct sequence given as follows. its context. During inference, given that the true output se- quence is not observed, we simply feed the predicted output 4.1. IT Helpdesk Troubleshooting dataset token as input to predict the next output. This is a “greedy” In our first set of experiments, we used a dataset which was inference approach. A less greedy approach would be to extracted from a IT helpdesk troubleshooting chat service. use beam search, and feed several candidates at the previ- In this service, costumers face computer related issues, and ous step to the next step. The predicted sequence can be aspecialisthelpthembyconversingandwalkingthrough selected based on the probability of the sequence. asolution.Typicalinteractions(orthreads)are400words Concretely, suppose that we observe a conversation with long, and turn taking is clearly signaled. Our training set two turns: the first person utters “ABC”, and second person contains 30M tokens, and 3M tokens were used as valida- replies “WXYZ”. We can use a , tion. Some amount of clean up was performed, such as removing common names, numbers, and full URLs. AlphaGo (Silver et al., 2016) beats Lee Sedol in 5 game competition

44