Representations for Language: From Word Embeddings to Sentence Meanings

Christopher Manning Stanford University @chrmanning ❀ @stanfordnlp Simons Institute 2017 What’s special about human language?

Most important distinctive human characteristic The only hope for “explainable” intelligence Communication was central to human development and dominance Language forms come with meanings A social system What’s special about human language?

Constructed to convey speaker/writer’s meaning Not just an environmental signal; a deliberate communication Using an encoding which little kids learn (amazingly!) quickly A discrete/symbolic/categorical signaling system “rocket” = �; “violin” = � Very minor exceptions for expressive signaling – “I loooove it” Presumably because of greater signaling reliability Symbols are not just an invention of logic / classical AI! What’s special about human language?

Language symbols are encoded as a continuous communication signal in several ways: • Sound • Gesture • Writing (Images/Trajectories) Symbol is invariant across different encodings!

CC BY 2.0 David Fulmer 2008 National Library of NZ, no known restrictions What’s special about human language?

• Traditionally, people have extended the symbolic system of language into the brain: “The language of thought” • But a brain encoding appears to be a continuous pattern of activation, just like the signal used to transmit language • is exploring a continuous encoding of thought • CogSci question: Whether to assume symbolic representations in the brain or to directly model using continuous substrate

lab Talk outline

1. What’s special about human language 2. From symbolic to distributed word representations 3. The BiLSTM (with attention) hegemony 4. Choices for multi-word language representations 5. Using tree-structured models: Sentiment detection

6 Sec. 9.2.2 2. From symbolic to distributed word representations

The vast majority of (rule-based and statistical) natural language processing and information retrieval (NLP/IR) work regarded words as atomic symbols: hotel, conference In vector space terms, this is a vector with one 1 and a lot of zeroes [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] Deep learning people call this a “one-hot” representation It is a localist representation Sec. 9.2.2 From symbolic to distributed word representations

Its problem, e.g., for web search: • If user searches for [Dell notebook battery size], we would like to match documents with “Dell laptop battery capacity” But size [0 0 0 0 0 0 0 0 0 1 0 0 0 0]T capacity [0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0 Our query and document vectors are orthogonal There is no natural notion of similarity in a set of one-hot vectors Capturing similarity

There are many things you can do to capture similarity: Query expansion with synonym dictionaries Separately learning word similarities from large corpora But a word representation that encodes similarity wins: Less parameters to learn (per word, not per pair) More sharing of statistics More opportunities for multi-task learning A solution via distributional similarity-based representations

You can get a lot of value by representing a word by means of its neighbors “You shall know a word by the company it keeps”

(J. R. Firth 1957: 11) One of the most successful ideas of modern NLP government debt problems turning into banking crises as has happened in saying that Europe needs unified banking regulation to replace the hodgepodge ë These words will represent banking ì Basic idea of learning neural network word embeddings (Predict!)

We define a model that predicts between a center word wt and context words in terms of word vectors, e.g.,

p(context|wt) = … which has a loss function, e.g.,

J = 1 − p(w−t |wt) We look at many positions t in a big language corpus We keep adjusting the vector representations of words to minimize this loss skip-gram prediction Details of Word2Vec training time. The basic Skip-gram formulation definesFor p ( w t + j | w t ) weusing choose: the softmax function:

′ ⊤ exp vwO vwI p(wO|wI )= (2) W ! ′ ⊤ " w=1 exp vw vwI

′ # where! o is the" outside (or output) word index, c is the where vw and vw are the “input” and “output” vector representations of w,andW is the num- ber of words in the vocabulary. This formulationcenter is impracti wordcal index, because vc and the u costo are of the computing “center” and 5 7 ∇ log p(wO|wI ) is proportional to W ,whichisoftenlarge(“outside”10 vectors–10 terms). for word indices c and o Softmax using word c to obtain probability of word o 2.1 Hierarchical Softmax Co-occurring words are driven to have similar vectors Acomputationallyefficientapproximationofthefullsoftmax is the hierarchical softmax. In the context of neural network language models, it was first introduced by Morin and Bengio [12]. The main advantage is that instead of evaluating W output nodes in the neural network to obtain the probability distribution, it is needed to evaluate only about log2(W ) nodes. The hierarchical softmax uses a binary tree representation of the output layer with the W words as its leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These define a random walk that assigns probabilities to words. More precisely, each word w can be reached by an appropriate path from the root of the tree.Let n(w, j) be the j-th node on the path from the root to w,andletL(w) be the length of this path, so n(w, 1) = root and n(w, L(w)) = w.Inaddition,foranyinnernoden,letch(n) be an arbitrary fixed child of n and let [[ x]] be 1 if x is true and -1 otherwise. Then the hierarchical softmax defines p(wO|wI ) as follows:

L(w)−1 ′ ⊤ p(w|wI )= σ [[ n(w, j +1)=ch(n(w, j))]] · vn(w,j) vwI (3) j=1 $ ! " W where σ(x)=1/(1 + exp(−x)).Itcanbeverifiedthat w=1 p(w|wI )=1.Thisimpliesthatthe cost of computing log p(wO|wI ) and ∇ log p(wO|wI ) is proportional to L(wO),whichonaverage is no greater than log W .Also,unlikethestandardsoftmaxformulationoftheSkip-# gram which ′ assigns two representations vw and vw to each word w,thehierarchicalsoftmaxformulationhas ′ one representation vw for each word w and one representation vn for every inner node n of the binary tree. The structure of the tree used by the hierarchical softmax hasaconsiderableeffectontheperfor- mance. Mnih and Hinton explored a number of methods for constructing the tree structure and the effect on both the training time and the resulting model accuracy [10]. In our work we use a binary Huffman tree, as it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models [5, 8].

2.2 Negative Sampling

An alternative to the hierarchical softmax is Noise Contrastive Estimation (NCE), which was in- troduced by Gutmann and Hyvarinen [4] and applied to languagemodelingbyMnihandTeh[11]. NCE posits that a good model should be able to differentiate data from noise by means of logistic regression. This is similar to hinge loss used by Collobert and Weston [2] who trained the models by ranking the data above noise. While NCE can be shown to approximately maximize the log probability of the softmax, the Skip- gram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. We define Negative sampling (NEG) by the objective k ′ ⊤ E ′ ⊤ log σ(vwO vwI )+ wi∼Pn(w) log σ(−vwi vwI ) (4) i=1 % & '

3 Word meaning as a vector The result is a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context … those other words also being represented by vectors

0.286 0.792 −0.177 currency = −0.107 0.109 −0.542 0.349 0.271 Sec. 18.3 Latent Semantic Analysis (LSA) vs. “neural” models

Comparisons to older work: LSA Count! models • Factorize a (maybe weighted, often log- scaled) term-document (Deerwester et al. 1990) or word-context matrix (Schütze 1992) into UΣVT • Retain only k singular values, in order to generalizek SVD: Intuition of Dimensionality reduction

6

PCA dimension 1 5

PCA dimension 2 4

3

2

1

1 2 33 44 55 66 word2vec encodes semantic components as linear vector differences COALS model (count-modified LSA) [Rohde, Gonnerman & Plaut, ms., 2005] Encoding meaning in vector differences [Pennington, Socher, and Manning, EMNLP 2014]

Crucial insight: Ratios of co-occurrence probabilities can encode meaning components

x = solid x = gas x = water x = random

large small large small

small large large small

large small ~1 ~1 Encoding meaning in vector differences [Pennington, Socher, and Manning, EMNLP 2014]

Crucial insight: Ratios of co-occurrence probabilities can encode meaning components

x = solid x = gas x = water x = fashion

1.9 x 10-4 6.6 x 10-5 3.0 x 10-3 1.7 x 10-5

2.2 x 10-5 7.8 x 10-4 2.2 x 10-3 1.8 x 10-5

8.9 8.5 x 10-2 1.36 0.96 Encoding meaning in vector differences

Q: How can we capture ratios of co-occurrence probabilities as meaning components in a word vector space?

A: Log-bilinear model:

with vector differences Glove Word similarities [Pennington et al., EMNLP 2014]

Nearest words to frog:

1. frogs 2. toad litoria leptodactylidae 3. litoria 4. leptodactylidae 5. rana 6. lizard 7. eleutherodactylus rana eleutherodactylus http://nlp.stanford.edu/projects/glove/ Glove Visualizations: Gender pairs

http://nlp.stanford.edu/projects/glove/ Glove Visualizations: Company - CEO Named Entity Recognition Performance (finding person, organization names in text)

Model on CoNLL ’03 CoNLL ’03 CoNLL dev test Categorical CRF 91.0 85.4 SVD (log tf) 90.5 84.8 HPCA 92.6 88.7 C&W 92.2 87.4 CBOW 93.1 88.2 GloVe 93.2 88.3

F1 score of CRF trained on CoNLL 2003 English with 50 dim word vectors Named Entity Recognition Performance (finding person, organization names in text)

Model on CoNLL ’03 CoNLL ’03 ACE 2 MUC 7 CoNLL dev test Categorical CRF 91.0 85.4 77.4 73.4 SVD (log tf) 90.5 84.8 73.6 71.5 HPCA 92.6 88.7 81.7 80.7 C&W 92.2 87.4 81.7 80.2 CBOW 93.1 88.2 82.2 81.1 GloVe 93.2 88.3 82.9 82.2

F1 score of CRF trained on CoNLL 2003 English with 50 dim word vectors Word embeddings: Conclusion

Glove shows the connection between Count! work and Predict! work – an appropriate scaling and objective gives Count! models the properties and performance of Predict! models

Lots of other important recent work in this area: [Levy & Goldberg, 2014] [Arora, Li, Liang, Ma & Risteski, 2016] [Hashimoto, Alvarez-Melis & Jaakkola, 2016] 3. The BiLSTM Hegemony

To a first approximation, the de facto consensus in NLP in 2017 is that no matter what the task, you throw a BiLSTM at it, with attention if you need information flow

28 An RNN encoder-decoder network

Encoder Decoder Je suis étudiant

0.2 0.2 0.4 0.2 0.2 0.2 0.2 -0.1 -0.3 0.4 -0.2 0.6 0.6 0.6 0.6 0.3 -0.1 0.1 -0.3 -0.1 -0.1 -0.1 -0.1 -0.1 -0.4 -0.5 -0.4 -0.7 -0.7 -0.7 -0.7 -0.7 0.2 -0.2 -0.2 0.1 0.1 0.1 0.1 0.1

I am a student Je suis étudiant

ht = tanh(W[xt] + Uht–1 + b) Gated Recurrent Units ≈ “LSTMs”

Equations of the two most widely used gated recurrent units Gated Recurrent Unit Long Short-Term Memory [Cho et al., EMNLP2014; [Hochreiter & Schmidhuber, NC1999; Chung, Gulcehre, Cho, Bengio, DLUFL2014] Gers, Thesis2001] ht = ut h˜t +(1 ut) ht 1 ht = ot tanh(ct) ˜ ct = ft ct 1 + it c˜t h = tanh(W [xt]+U(rt ht 1)+b) c˜t = tanh(Wc [xt]+Ucht 1 + bc) ut = (Wu [xt]+Uuht 1 + bu) ot = (Wo [xt]+Uoht 1 + bo) rt = (Wr [xt]+Urht 1 + br) it = (Wi [xt]+Uiht 1 + bi) ft = (Wf [xt]+Uf ht 1 + bf ) Basic update to memory cell (GRU h = LSTM c) is via a standard neural net layer Gated Recurrent Units ≈ “LSTMs”

Equations of the two most widely used gated recurrent units Gated Recurrent Unit Long Short-Term Memory [Cho et al., EMNLP2014; [Hochreiter & Schmidhuber, NC1999; Chung, Gulcehre, Cho, Bengio, DLUFL2014] Gers, Thesis2001] ht = ut h˜t +(1 ut) ht 1 ht = ot tanh(ct) ˜ ct = ft ct 1 + it c˜t h = tanh(W [xt]+U(rt ht 1)+b) c˜t = tanh(Wc [xt]+Ucht 1 + bc) ut = (Wu [xt]+Uuht 1 + bu) ot = (Wo [xt]+Uoht 1 + bo) rt = (Wr [xt]+Urht 1 + br) it = (Wi [xt]+Uiht 1 + bi) ft = (Wf [xt]+Uf ht 1 + bf ) Bernoulli variable “gates” control how much history is kept & input is attended to Gated Recurrent Units ≈ “LSTMs”

Equations of the two most widely used gated recurrent units Gated Recurrent Unit Long Short-Term Memory [Cho et al., EMNLP2014; [Hochreiter & Schmidhuber, NC1999; Chung, Gulcehre, Cho, Bengio, DLUFL2014] Gers, Thesis2001] ht = ut h˜t +(1 ut) ht 1 ht = ot tanh(ct) ˜ ct = ft ct 1 + it c˜t h = tanh(W [xt]+U(rt ht 1)+b) c˜t = tanh(Wc [xt]+Ucht 1 + bc) ut = (Wu [xt]+Uuht 1 + bu) ot = (Wo [xt]+Uoht 1 + bo) rt = (Wr [xt]+Urht 1 + br) it = (Wi [xt]+Uiht 1 + bi) ft = (Wf [xt]+Uf ht 1 + bf ) Summing previous & new candidate hidden states gives direct gradient flow & more effective memory Gated Recurrent Units ≈ “LSTMs”

Equations of the two most widely used gated recurrent units Gated Recurrent Unit Long Short-Term Memory [Cho et al., EMNLP2014; [Hochreiter & Schmidhuber, NC1999; Chung, Gulcehre, Cho, Bengio, DLUFL2014] Gers, Thesis2001] ht = ut h˜t +(1 ut) ht 1 ht = ot tanh(ct) ˜ ct = ft ct 1 + it c˜t h = tanh(W [xt]+U(rt ht 1)+b) c˜t = tanh(Wc [xt]+Ucht 1 + bc) ut = (Wu [xt]+Uuht 1 + bu) ot = (Wo [xt]+Uoht 1 + bo) rt = (Wr [xt]+Urht 1 + br) it = (Wi [xt]+Uiht 1 + bi) ft = (Wf [xt]+Uf ht 1 + bf ) Note that recurrent state mixes control and memory. Good? (Freedom to represent.) Or bad? (Mush.) An LSTM encoder-decoder network [Sutskever et al. 2014]

Translation The protests escalated over the weekend generated

0.1 0.2 0.4 0.5 0.2 -0.1 0.2 0.2 0.3 0.4 -0.2 -0.4 -0.3 0.3 0.6 0.4 0.5 0.6 0.6 0.6 0.6 0.6 0.4 0.6 0.6 0.5 0.1 -0.1 0.3 0.9 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.4 -0.7 -0.2 -0.3 -0.5 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 -0.7 0.2 0.1 -0.3 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Encoder: 0.2 0.2 0.1 0.2 0.2 0.2 0.2 -0.4 0.2 -0.1 0.2 0.3 0.2 Builds up -0.2 0.6 0.3 0.6 -0.8 0.6 -0.1 0.6 0.6 0.6 0.4 0.6 0.6 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 Decoder 0.1 -0.7 -0.7 -0.4 -0.5 -0.7 -0.7 -0.7 0.3 0.3 0.2 -0.5 -0.7 sentence 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 meaning

0.2 0.4 0.2 0.2 0.4 0.2 0.2 0.2 0.2 -0.1 -0.2 -0.4 0.2 0.6 -0.6 -0.3 0.4 -0.2 0.6 0.6 0.6 0.6 0.3 0.6 0.5 0.6 -0.1 0.2 -0.1 0.1 -0.3 -0.1 -0.1 -0.1 -0.1 -0.1 0.1 -0.5 -0.1 -0.7 -0.3 -0.4 -0.5 -0.4 -0.7 -0.7 -0.7 -0.7 -0.7 0.3 0.4 -0.7 0.1 0.4 0.2 -0.2 -0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Source Die Proteste waren am Wochenende eskaliert The protests escalated over the weekend Feeding in sentence last word

Bottleneck A BiLSTM encoder and LSTM-with-attention decoder

Encoder Decoder Je suis étudiant

0.2 0.2 0.4 0.2 0.2 0.2 0.2 -0.1 -0.3 0.4 -0.2 0.6 0.6 0.6 0.6 0.3 -0.1 0.1 -0.3 -0.1 -0.1 -0.1 -0.1 -0.1 -0.4 -0.5 -0.4 -0.7 -0.7 -0.7 -0.7 -0.7 0.2 -0.2 -0.2 0.1 0.1 0.1 0.1 0.1

I am a student Je suis étudiant Progress in Machine Translation [Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal] Phrase-based SMT Syntax-based SMT Neural MT 25

20

15

10

5

0 2013 2014 2015 2016 From [Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf] Four big wins of Neural MT

1. End-to-end training All parameters are simultaneously optimized to minimize a loss function on the network’s output 2. Distributed representations share strength Better exploitation of word and phrase similarities 3. Better exploitation of context NMT can use a much bigger context – both source and partial target text – to translate more accurately 4. More fluent text generation Deep learning text generation is much higher quality

37 BiLSTMs(+Attn) not just for neural MT

Part of speech tagging Named entity recognition Syntactic parsing (constituency & dependency) Reading comprehension Question answering Text summarization … Reading Comprehension on the DeepMind CNN & Daily Mail datasets [Hermann et al, 2015]

39 End-to-end Neural Network [Chen, Bolton, & Manning, ACL 2016] characters in " @placeholder " movies Q have gradually become more diverse

Bidirectional RNNs

P … … …

Attention A entity6

40 Lots of complex models; lots of results Nothing does much better than LSTM+Attn

CNN Daily Mail Dev Test Dev Test (Hermann et al, 2015) NIPS’15 61.8 63.8 69.0 68.0 (Hill et al, 2016) ICLR’16 63.4 66.8 N/A N/A (Kobayashi et al, 2016) NAACL’16 71.3 72.9 N/A N/A (Kadlec et al, 2016) ACL’16 68.6 69.5 75.0 73.9 (Dhingra et al, 2016) 2016/6/5 73.0 73.8 76.7 75.7 (Sodorni et al, 2016) 2016/6/7 72.6 73.3 N/A N/A (Trischler et al, 2016) 2016/6/7 73.4 74.0 N/A N/A (Weissenborn, 2016) 2016/7/12 N/A 73.6 N/A 77.2 (Cui et al, 2016) 2016/7/15 73.1 74.4 N/A N/A Ours: neural net ACL’16 73.8 73.6 77.6 76.6 Ours: neural net (ensemble) ACL’16 77.2 77.6 80.2 79.2 The Standard Theory of Natural Language Interpretation

det nsubj a cat purred a cat purred ∃x cat(x)∧purr(x) deletion Language map to Syntax semantic Logical semantic Models expressions surface trees interpre- formulas deno- described form tation tation

Model of: • most linguistic and philosophical work (till the present) • most computational linguistic work (till 1990) • modern “semantic parsing” (Liang, Zettlemoyer, etc.) Semantic interpretation of language – Not just word vectors

How can we minimally know when larger language units are similar in meaning? • The snowboarder is leaping over a mogul • A person on a snowboard jumps into the air People interpret the meaning of larger text units – entities, descriptive terms, facts, arguments, stories – by semantic composition of smaller elements 4. Choices for multi-word language representations

44 Neural bag-of-words models

• Simply average (or just sum) word vectors:

0.4 2.1 7.0 4.0 2.3 3.0 ( 0.3 + 3.3 + 7.0 + 4.5 + 3.6 )/5 = 3.7 the country of my birth • Can improve effectiveness by putting output through 1+ fully connected layers (DANs) • Surprisingly effective for many tasks L • [Iyyer, Manjunatha, Boyd-Graber and Daumé III 2015 – DANs; Wieting, Bansal, Gimpel and Livescu 2016 – Periphrastic] Recurrent neural networks

• Simple recurrent neural nets do use word order but cannot capture phrases without prefix context • Gated LSTM/GRU units in theory could up to a certain depth, but it seems unlikely • Empirically, representations capture too much of last words in final vector – focus is LM next word prediction

1 1 5.5 4.5 2.5 3.5 5 6.1 3.8 3.8

0.4 2.1 7 4 2.3 0.3 3.3 7 4.5 3.6 the country of my birth Convolutional Neural Network

• What if we compute vectors for every h-word phrase, often for several values of h? • Example: “the country of my birth” computes vectors for: • the country, country of, of my, my birth, the country of, country of my, of my birth, the country of my, country of my birth • Not very linguistic, but you get everything!

1.1 3.5 … 2.4

0.4 2.1 7 4 2.3 0.3 3.3 7 4.5 3.6 0 0 0 0 the country of my birth Convolutional Neural Networks for Sentence Classification

Yoon Kim New York University [email protected]

Abstract local features (LeCun et al., 1998). Originally invented for computer vision, CNN models have We report on a series of experiments with subsequently been shown to be effective for NLP convolutional neural networks (CNN) and have achieved excellent results in semantic trained on top of pre-trained word vec- parsing (Yih et al., 2014), search query retrieval tors for sentence-level classification tasks. (Shen et al., 2014), sentence modeling (Kalch- We show that a simple CNN with lit- brenner et al., 2014), and other traditional NLP tle hyperparameter tuning and static vec- tasks (Collobert et al., 2011). tors achieves excellent results on multi- ple benchmarks. Learning task-specific In the present work, we train a simple CNN with vectors through fine-tuning offers further one layer of convolution on top of word vectors gains in performance. We additionally obtained from an unsupervised neural language propose a simple modification to the ar- model. These vectors were trained by Mikolov et chitecture to allow for the use of both al. (2013) on 100 billion words of Google News, and are publicly available.1 We initially keep the task-specific and static vectors. The CNN Figure 1: Model architecture with two channels for an example sentence. models discussed herein improve upon the word vectors static and learn only the other param- state of the art on 4 out of 7 tasks, which eters of the model. Despite little tuning of hyper- include sentiment analysis and question parameters, this simple model achievesnecessary) excellent is represented as that is kept static throughout training and one that 2 classification. results on multiple benchmarks, suggesting that is fine-tuned via (section 3.2). the pre-trained vectorswait are ‘universal’ feature ex- x1:n = x1 x2 ... xn, (1) In the multichannel architecture, illustrated in fig- 1 Introduction for tractors that can be utilized for various classifica-wait ure 1, each filter is applied to both channels and tion tasks. Learningthe task-specific where vectors throughis the concatenation operator. In gen- Deep learning models have achieved remarkable video for Figure 1: Model architecture with two channels for an example sentence. the results are added to calculate ci in equation results in computer vision (Krizhevsky et al., fine-tuning resultsand in further improvements.eral, letthex i:i We+j refer to the concatenation of words video finally describe a simpledo modification to the archi- (2). The model is otherwise equivalent to the sin- 2012) and speech recognition (Graves et al., 2013) xi, xi+1and,..., xi+j. A convolution operation in- tecture tonecessary) allow forn't the is represented use of both pre-trained as and that is kept static throughout traininggle channel and one architecture. that in recent years. Within natural language process- do hk rent volves a filter w R , which is applied to a 2 task-specific vectors by having multiple channels.n't 2 is fine-tuned via backpropagation (section 3.2). ing, much of the work with deep learning meth- it window of h words to produce a new feature. For Our work is philosophicallyx1:n = x1 similarx2 to... Razavianrent xn, (1) In the multichannel architecture, illustrated in fig- ods has involved learning word vector representa- 2.1 Regularization example,it a feature ci is generatedure 1, each from filter a window is applied to both channels and tions through neural language models (Bengio et et al. (2014)where whichis showed the concatenation that for image operator. clas- In gen- n x k representation of Convolutional layer with Max-over-time ForFully regularization connected layer we employ dropout on the sification, feature extractors obtainedof words from ax pre-i:i+h 1 by the results are added to calculate ci in equation al., 2003; Yih et al., 2011; Mikolov et al., 2013) eral, let xi:i+j refersentence to with the static concatenation and ofmultiple words filter widths and pooling with dropout and trained deep learning modelnon perform-static channels well on a va-n x k representationfeature maps of (2). TheConvolutional model is layer otherwise with equivalentpenultimateMaxsoftmax-over to -outputtime the layer sin- withFully connected a constraint layer on l2-norms of and performing composition over the learned word xi, xi+1,...,xi+j. A convolution operationsentence with static in- and multiple filter widths and pooling with dropout and riety of tasks—including tasks thathk are very dif-cinon=-staticf( channelsw x i:igle+h channel1 + b)feature. architecture. maps (2) the weight vectors (Hintonsoftmax output et al., 2012). Dropout vectors for classification (Collobert et al., 2011). volves a filter w R , which is applied to a· 2 Word vectors, wherein words are projected from a ferent fromwindow the original of h words task for to whichproduceFigure the1: a Model feature new feature.architecture For with two channels for an example sentence.prevents co-adaptation of hidden units by ran- extractors were trained. Here b R is a biasFigure term 1: Model2.1 and architecture Regularizationf is a non-linear with two channels for an example sentence. sparse, 1-of-V encoding (here V is the vocabulary example, a feature ci is generated2 from a window domly dropping out—i.e., setting to zero—a pro- function such as the hyperbolic tangent. This filter size) onto a lower dimensional vector space via a necessary)of words xi: isi+ representedh 1 by as Forthat regularization is kept static we throughout employportion dropout trainingp onof and the one hidden that units during foward- 2 Model 2 hidden layer, are essentially feature extractors that necessary)is applied is to represented each possible as windowpenultimateis fine-tuned of wordslayer with via in thebackpropagation athat constraint is keptbackpropagation. staticon l2-norms throughout (section of 3.2). That training is, given and one the that penultimate 2 encode semantic features of words in their dimen- xc1:i n==f(xw1 xsentenceix:i+2 h 1...+ bx).1:xhn,,x2:h+1(2),...,(1) thexnIn weighth the+1: multichanneln vectorsto produce (Hintonis architecture, fine-tuned et al., 2012). via illustrated backpropagation Dropout in fig- (section 3.2). ConvolutionalThe model architecture, Neural shown· in figure Network 1,{ is a } layer z =[ˆc1,...,cˆm] (note that here we have m sions. In such dense representations, semantically a featurex map1:n = x1 x2 ...preventsurexn, 1, co-adaptation each(1) filter isIn of applied the hidden multichannel to units both by channels architecture, ran- and illustrated in fig- slight variantwhereHere ofb theis CNNR theis a architecture concatenation bias term of and Collobertf operator.is a non-linear In gen- filters), instead of using close words are likewise close—in euclidean or k domly dropping out—i.e.,ure setting 1, each to filter zero—a is applied pro- to both channels and •etWord al. (2011). vectors: Let2 xi R be the k-dimensional the results are added to calculate ci in equation eral,function let x suchi:i+2j asrefer the hyperbolic towhere the concatenation tangent.is the concatenation This of filter words portion operator.p of In gen- the hidden units during foward- cosine distance—in the lower dimensional vector word vector corresponding to the i-th word in thec =[c1,c2,...,c(2).n h+1 The], model is(3) otherwisethe results equivalent are added to to the calculate sin- ci in equation • Concatenationxisi, appliedxi+1,..., of to words eachxi+ possiblej .in Aeral,range convolution window let: xi:i+ ofj refer words operation to in the the concatenation in- backpropagation. of words That is, given the penultimatey = w z + b (4) space. sentence. A sentence of length n (padded where gle channel architecture.(2). The model is otherwise equivalent to the sin- sentence x1:h, x2:h+1x,...,,hkx xn,...,h+1:xn to. produce A convolution operation in- · Convolutional neural networks (CNN) utilize • Convolutionalvolves a filter:filter w Ri i,+1 which isin+ appliedj h+1 to a layer z =[ˆc1,...,cˆm] (note that here we have m { 2 with c R} .hk We then apply a max-over-gle channel architecture. 1 a feature map volves a filter w R , which is applied to a layers with convolving filters that are applied to • https://code.google.com/p/word2vec/CNN layerwindow feature: of h words to produce2 a new feature.2 For filters), instead of using for output unit y in forward propagation, dropout windowtime pooling of h words operation to produce (Collobert a new2.1 feature. Regularization et al., For 2011) example, a feature ci is generated from a window 2.1 Regularizationuses • Get feature map: c =[c1,c2,...,covern theh+1 feature], c map(3) and take the maximum value of words xi:i+h 1 byexample, a feature i is generatedFor from regularization a windowy = w wez + employb dropout(4) on the 1746 · y = w (z r)+b, (5) • Max pool (better thann h+1 ave.):of cˆ words= maxxi:i+ch 1asby the feature correspondingpenultimate layer to this withFor a regularization constraint on wel -norms employ of dropout on the with c R . We then apply{ a} max-over- 2 · Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing2 (EMNLP), pages 1746–1751, penultimate layer with a constraint on l -norms of 1.1 c3.5i = f(w particularxi:i+h 12.4+ filter.b). The idea(2) isfor to capturethe output weight unit they vectors mostin forward im- (Hinton propagation, et al., 2012). dropout Dropout 2 October 25-29, 2014, Doha, Qatar. c 2014 Association fortime Computational pooling operation Linguistics… (Collobert et al., 2011) · ci = f(w xi:i+h uses1 + b). (2) the weightwhere vectorsis the (Hinton element-wise et al., 2012). multiplication Dropout opera- over the feature map andportant take the feature—one maximum value· with theprevents highest co-adaptation value—for of hidden unitsm by ran- Here 0.4b 2.1is a bias7 term4 and2.3f is a non-linear preventstor co-adaptation and r R is of a hidden ‘masking’ units vector by ran- of Bernoulli R 0 0 y = w (z r)+b, cˆ = max0.3 2 c 3.3as the feature7 each4.5 corresponding feature3.6 map. to This this poolingdomly scheme dropping naturally out—i.e., setting to2 zero—a(5) pro- function such{ } as the hyperbolicHere b R tangent.is a bias0 This term filter0 and f is a non-linear · domly random dropping variables out—i.e., setting with probability to zero—a pro-p of being 1. particularthe filter.country The ideaof deals ismy to capture 2 withbirth variable the most sentence im- lengths.portion p of the hidden units during foward- is applied to each possiblefunction window such as of the words hyperbolic in the where tangent.is This the filter element-wiseportion multiplicationGradientsp of the are hidden opera- backpropagated units during only foward- through the portant feature—one withWe the highest have described value—for the processbackpropagation. bym which one That is, given the penultimate sentence x , x is,..., appliedx to each possibleto produce windowtor and of wordsr R inis the a ‘masking’backpropagation. vector of Bernoulli That is, given the penultimate each feature1: map.h 2: Thish+1 poolingn schemeh+1:n naturally layer 2z =[ˆc1,...,cˆm] (noteunmasked that here units. we have At testm time, the learned weight { sentencefeaturex is1:h extracted, x2:}h+1,..., fromxnrandomoneh+1:nfilter.to variables produce The with model probability p of being 1. adealsfeature with map variable sentence lengths.{ } layer zvectors=[ˆc1,..., arecˆ scaledm] (note by thatp heresuch we that havewˆ m= pw, and a featureuses multiple map filters (withGradients varyingfilters), window are instead backpropagated sizes) of using only through the We have described the process by which one filters), instead of using unmasked units. At test time,wˆ theis learned used (without weight dropout) to score unseen sen- feature is extractedc =[c1 from,c2to,...,cone obtainnfilter.h multiple+1] The, model features.(3) These features form c =[c1,c2,...,cnvectorsh+1], are scaled(3) by py such= wtences. thatz +wˆb = Wepw additionally, and (4) constrain l2-norms of the uses multiple filters (withthe varying penultimate window layer sizes) and are passed to a fully con- · y = w z + b (4) n h+1 · with c R . We then applyn h+1 a max-over- wˆ is used (without dropout) toweight score unseen vectors sen- by rescaling w to have w 2 = s to obtain2 multiple features.withnectedc These softmaxR features layer. We form whose then apply output a max-over- is the probabil- || || tences.for output We additionally unit y in constrain forwardwhenever propagation,l2-normsw of the dropout>safter a gradient descent step. timethe penultimate pooling operation layer andity are (Collobert distribution passed2 to a et fully over al., con- labels.2011) for output unit y in forward2 propagation, dropout time pooling operation (Collobertweightuses et vectors al., 2011) by rescaling w to have w|| ||= s overnected the softmax feature layer map whose and take output the is maximum the probabil- value uses || ||2 overIn the one feature of map the andmodel take variants, thewhenever maximum wew experiment value>safter a gradient2We employ descent language step. from computer vision where a color cˆity= distribution max c as over the labels. feature corresponding to this 2 y = w (z r)+b, (5) { } cˆ with= max havingc as two the ‘channels’ feature corresponding of word|| vectors—one to|| this · image hasy = red,w green,(z andr)+ blueb, channels. (5) particularIn one offilter. the The model idea variants, is to capture{ } we experiment the most im- 2 · particular filter. The idea is to captureWewhere employ the most languageis the im- from element-wise computer vision multiplication where a color opera- portantwith having feature—one two ‘channels’ with of the word highest vectors—one value—for image has red, green,m and bluewhere channels.is the element-wise multiplication opera- portant feature—one with the highesttor and value—forr R is a ‘masking’ vectorm of Bernoulli each feature map. This pooling scheme naturally 2 tor and r R is a ‘masking’ vector of Bernoulli each feature map. This pooling schemerandom naturally variables with1747 probability2 p of being 1. deals with variable sentence lengths. random variables with probability p of being 1. deals with variable sentence1747 lengths. Gradients are backpropagatedGradients are only backpropagated through the only through the We have described theWe process have described by which theone process by which one unmasked units. Atunmasked test time, units. the learned At test time, weight the learned weight feature is extracted fromfeatureone is extractedfilter. The from modelone filter. The model vectors are scaledvectors by p such are scaled that wˆ by=ppsuchw, and that wˆ = pw, and uses multiple filters (withuses multiple varying filters window (with sizes) varying window sizes) wˆ is used (withoutwˆ dropout)is used (without to score dropout) unseen sen- to score unseen sen- to obtain multiple features.to obtain These multiple features features. form These features form tences. We additionallytences. constrain We additionallyl2-norms constrain of the l -norms of the the penultimate layerthe and penultimate are passed tolayer a fully and are con- passed to a fully con- 2 weight vectors byweight rescaling vectorsw to by have rescalingw 2 w=tos have w = s nected softmax layer whosenected softmax output is layer the probabil- whose output is the probabil- || || || ||2 whenever w 2 >swheneverafter a gradientw >s descentafter a step. gradient descent step. ity distribution over labels.ity distribution over labels. || || || ||2 In one of the model variants, we experiment 2 In one of the model variants, weWe experimentemploy language2We from employ computer language vision from where computer a color vision where a color with having two ‘channels’with having of word two vectors—one ‘channels’ of wordimage vectors—one has red, green, andimage blue has channels. red, green, and blue channels.

1747 1747 1D Convolutional neural network with max pooling and FC layer

wait for the video and do n't rent it

n x k representation of Convolutional layer with Max-over-time Fully connected layer sentence with static and multiple filter widths and pooling with dropout and non-static channels feature maps softmax output

• For moreFigure features, 1: Model architecture use multiple with two channelsfilter weights for an example and sentence. multiple window sizes necessary)• isFigure represented from as [Kim 2014 “Convolutionalthat is kept Neural static throughoutNetworks training and one that 2 for Sentence Classification”] is fine-tuned via backpropagation (section 3.2). x = x x ... x , (1) In the multichannel architecture, illustrated in fig- 1:n 1 2 n ure 1, each filter is applied to both channels and where is the concatenation operator. In gen- the results are added to calculate c in equation eral, let x refer to the concatenation of words i i:i+j (2). The model is otherwise equivalent to the sin- x , x ,...,x . A convolution operation in- i i+1 i+j gle channel architecture. volves a filter w Rhk, which is applied to a 2 window of h words to produce a new feature. For 2.1 Regularization example, a feature ci is generated from a window of words xi:i+h 1 by For regularization we employ dropout on the penultimate layer with a constraint on l2-norms of ci = f(w xi:i+h 1 + b). (2) the weight vectors (Hinton et al., 2012). Dropout · prevents co-adaptation of hidden units by ran- Here b R is a bias term and f is a non-linear 2 domly dropping out—i.e., setting to zero—a pro- function such as the hyperbolic tangent. This filter portion p of the hidden units during foward- is applied to each possible window of words in the backpropagation. That is, given the penultimate sentence x1:h, x2:h+1,...,xn h+1:n to produce { } layer z =[ˆc1,...,cˆm] (note that here we have m a feature map filters), instead of using

c =[c1,c2,...,cn h+1], (3) y = w z + b (4) · n h+1 with c R . We then apply a max-over- 2 time pooling operation (Collobert et al., 2011) for output unit y in forward propagation, dropout over the feature map and take the maximum value uses cˆ = max c as the feature corresponding to this y = w (z r)+b, (5) { } · particular filter. The idea is to capture the most im- where is the element-wise multiplication opera- portant feature—one with the highest value—for tor and r Rm is a ‘masking’ vector of Bernoulli each feature map. This pooling scheme naturally 2 random variables with probability p of being 1. deals with variable sentence lengths. Gradients are backpropagated only through the We have described the process by which one unmasked units. At test time, the learned weight feature is extracted from one filter. The model vectors are scaled by p such that wˆ = pw, and uses multiple filters (with varying window sizes) wˆ is used (without dropout) to score unseen sen- to obtain multiple features. These features form tences. We additionally constrain l -norms of the the penultimate layer and are passed to a fully con- 2 weight vectors by rescaling w to have w = s nected softmax layer whose output is the probabil- || ||2 whenever w >safter a gradient descent step. ity distribution over labels. || ||2 In one of the model variants, we experiment 2We employ language from computer vision where a color with having two ‘channels’ of word vectors—one image has red, green, and blue channels.

1747 Data-dependent composition

Language understanding – & – requires being able to understand bigger things from knowing about smaller parts 53 Language structure is recursive

• Recursion is natural for describing language • [The man from [the company that you spoke with about [the project] yesterday]] • noun phrase containing a noun phrase with a noun phrase • Phrases correspond to semantic units of language Relationship between RNNs and CNNs

CNN RNN Relationship between RNNs and CNNs

CNN RNN

people there speak slowly people there speak slowly 5. Using tree-structured models: Sentiment detection

Is the tone of a piece of text positive, negative, or neutral?

• Sentiment is that sentiment is “easy” • Detection accuracy for longer documents ~90%

… … loved … … … … … great … … … … … … impressed … … … … … … marvelous … … … … • BUT Stanford Sentiment Treebank • 215,154 phrases labeled in 11,855 sentences • Can train and test compositions

http://nlp.stanford.edu:8080/sentiment/ Universal Dependencies Syntax http://universaldependencies.org/

root punct

obl nsubj obj aux det case det aux det det The cat could have chased all the dogs down the street .

• Content words are related by dependency relations • Function words attachroot to content word they modify punct • Punctuation attaches to head of phrase or clause obl nsubj case det aux det The dog was chased by the cat .

root

punct

obl nsubj caseaux Hunden jagades av katten . NOUN VERB ADP NOUN PUNCT Definite=Def Voice=Pass Definite=Def

nsubj aux He is great

nsubj

1 Dozat & Manning (ICLR 2017)

• Each word predicts what it is a dependent of as a kind of head-dependent attention relation • We then find the best tree (MST algorithm) PTB-SD 3.3.0 and CTB 5.1 Results

PTB-SD CTB Type Model UAS LAS UAS LAS Transition Chen & Manning (2014) 92.0 89.7 83.9 82.4 Andor et al. (2016) 94.61 92.79 -- -- Kuncoro et al. (2016) 95.8 94.6 -- -- Graph K & G (2016) 93.9 91.9 87.6 86.1 Cheng et al. (2016) 94.10 91.49 88.1 86.1 Hashimoto et al. (2016) 94.67 92.90 -- -- Ours 95.74 94.08 89.30 88.23 Tree-Structured Long Short-Term Memory Networks [Tai et al., ACL 2015] Tree-structured LSTM

Generalizes sequential LSTM to trees with any branching factor Better Dataset Helped Even Simple Models

Positive/negative sentence classification Uni+Bigram Naïve Bayes 95

90

85

80

75

70 Training with Sentence Labels Training with Treebank • But hard negation cases are still mostly incorrect • We also need a more powerful model! Positive/Negative Results on Treebank Classifying Sentences: Accuracy improves to 88%

95

Bi NB 90

85 TreeLSTM

80

75

70 Training with Sentence Labels Training with Treebank Experimental Results on Treebank

• TreeRNN can capture constructions like X but Y • Biword Naïve Bayes is only 58% on these Results on Negating Negatives

E.g., sentiment of “not uninteresting” Goal: Positive activation should increase Envoi

• Deep learning – distributed representations, end-to- end training, and richer modeling of state – has brought great gains to NLP • At the moment, it seems like we can’t win, or we can only barely win, by having more structure than a vector space mush • However, I deeply believe that we do need more structure and modularity for language, memory, knowledge, and planning; it’ll just take some time