Teaching Machines to Understand Natural Language Antoine Bordes

Cap’18 – Rouen – June 19th, 2018 Mission “Advancing the state of the art of AI through open research for the benefit of all”

Values • Openness: publish and open-source • Freedom: researchers have complete control on their agenda • Collaboration: with internal and external partners • Excellence: focus on most impactful projects (publish or perish) • Scale: operate at large scale (computation, data) 3 Talking to Machines

Language should be the most natural interface between the digital and the physical world

Digital personal assistants: Grant accessibility for people with disabilities “What is in this picture?” Break loneliness “I’m not feeling well..” Mediate and monitor our online presence “Did anybody talk about me or my family lately?” “Why do I see this content?” Machines Understanding Language

Current systems are still far from natural interactions.

Intelligent machines should be able to: 1. Build (causal) models of the world as internal knowledge representations; 2. Ground language into them to interpret it; 3. Leverage compositionality to learn quickly and to update them.

Machines are missing the ability to learn high-level abstractions

“Building machines that learn and think like people” Lake et al. Behavioral and Brain Sciences ’17 “Generalization without systematicity: On the com- positional skills of sequence-to-sequence recurrent networks” Lake & Baroni. Arxiv’18 Two Paradigms Neural Networks Symbolic Systems Convolutional nets, Recurrent nets, etc. Ontologies, KBs, Inductive Logic Programming, etc. • Scale to very large datasets • Can be used by non domain • Small scale conditions experts • Require heavy expert knowledge • Robust to noise and ambiguity in • Very brittle with noisy, ambiguous data data • Game changers in multiple • Limited applicative success applications

• Learn from few samples • Very data hungry (mostly • Can rely on high-level abstractions supervised data) • Interpretable, provable • Can not learn easily new tasks • Complex reasoning from old ones • Not interpretable • Relatively simple reasoning Conjecture

Neural networks (and friends) can be adapted to be the best of both worlds

Architectures at the interplay between symbolic and continuous models encode structure such as symbolic databases discover structure in (unstructured) data operate and are trained with symbolic components (even programs)

Training will have to overcome new challenges Optimization mixing differentiable and non-differentiable parts Reinforcement learning requires rewards and environment This talk: Question Answering

This talk focus on NNs learning to use encyclopedic knowledge

Task: question answering (Q&A) because of ease of evaluation

1. How neural networks can be adapted to reason over symbols? 2. How vector spaces can encode structured data from knowledge bases? Neural Networks + Symbolic Memories bAbI Tasks

• Set of 20 tasks testing basic reasoning capabilities for QA from stories • Short stories are generated from a simulation • Easy to interpret results / test a broad range of properties

John dropped the milk. The suitcase is bigger than the chest. John took the milk there. The box is bigger than the chocolate. Sandra went to the bathroom. The chest is bigger than the chocolate. John moved to the hallway. The chest fits inside the container. Mary went to the bedroom. The chest fits inside the box. Where is the milk ? Hallway Does the suitcase fit in the chocolate? no Task 3: Two supporting facts Task 18: Size reasoning • Useful to foster innovation: cited +300 times

“Towards AI-complete question answering: A set of prerequisite toy tasks.” Weston et al. ICLR’16 bAbI Tasks

John dropped the milk. The suitcase is bigger than the chest. John took the milk there. The box is bigger than the chocolate. Sandra went to the bathroom. The chest is bigger than the chocolate. John moved to the hallway. The chest fits inside the container. Mary went to the bedroom. The chest fits inside the box. Where is the milk ? Hallway Does the suitcase fit in the chocolate? no Task 3: Two supporting facts Task 18: Size reasoning

Very challenging for standard neural networks: • LSTMs: 63% accuracy and 4 tasks completed with 10k training examples

“Towards AI-complete question answering: A set of prerequisite toy tasks.” Weston et al. ICLR’16 Memory Networks

Models that combine symbolic memory with learning component that can read it.

Memory Networks: Learn to reason with attention over their memory Can be trained end-to-end without supervising how the memory is used using cross-entropy or ranking loss

Multiple related models appeared around the same time and since then: Stack-RNN (Joulin et Mikolov, ‘15), RNNSearch (Bahdanau et al., ‘15) Neural Turin Machines (Graves et al. ‘15) , etc. “Memory Networks” Weston et al. ICLR’15 / “End-to-end Memory Networks.” Sukhbaatar et al. NIPS’1S Memory Networks on bAbI

Question + memory

Where is the milk? hallway End-to-end training through John backpropagation m m m m 2 (m , m3, m , mm4, …) 5: milk 1: : J 1 2 : S 3 4 : M Jo oh an Joh ar hn n d n y John dropped the milk. d tMemorieso ra m we hallway rop ok we ov nt pe th nt ed ba John took the milk there. d e m to to ck th i th th to Sandra went to the bathroom. e m lk e e t office i the ba ha he John moved to the hallway. lk. re th llw be . ro a dr Mary went back to the bedroom. om y. oo … . m. Attention during memory lookups

Story (1: 1 supporting fact) Support Hop 1 Hop 2 Hop 3 Story (2: 2 supporting facts) Support Hop 1 Hop 2 Hop 3 Daniel went to the bathroom. 0.00 0.00 0.03 J ohn dropped the milk. 0.06 0.00 0.00 Mary travelled to the hallway. 0.00 0.00 0.00 J ohn took the milk there. yes 0.88 1.00 0.00 J ohn went to the bedroom. 0.37 0.02 0.00 Sandra went back to the bathroom. 0.00 0.00 0.00 J ohn travelled to the bathroom. yes 0.60 0.98 0.96 J ohn moved to the hallway. yes 0.00 0.00 1.00 Mary went to the office. 0.01 0.00 0.00 Mary went back to the bedroom. 0.00 0.00 0.00 Where is J ohn? Answer: bathroom Prediction: bathroom Where is the milk? Answer: hallway Prediction: hallway

Story (16: basic induction) Support Hop 1 Hop 2 Hop 3 Story (18: size reasoning) Support Hop 1 Hop 2 Hop 3 Brian is a frog. yes 0.00 0.98 0.00 The suitcase is bigger than the chest. yes 0.00 0.88 0.00 Lily is gray. 0.07 0.00 0.00 The box is bigger than the chocolate. 0.04 0.05 0.10 Brian is yellow. yes 0.07 0.00 1.00 The chest is bigger than the chocolate. yes 0.17 0.07 0.90 J ulius is green. 0.06 0.00 0.00 The chest fits inside the container. 0.00 0.00 0.00 Greg is a frog. yes 0.76 0.02 0.00 The chest fits inside the box. 0.00 0.00 0.00 What color is GregHow? Answer: yellow Paboutrediction: yellow onDo esreal the suitcase fit in languagethe chocolate? Answer: no Predictio n: no

Figure 2: Example predictions on the QAdata?tasks of [21]. We show the labeled supporting facts • LSTMs:(support) from the dataset which MemN2N does not use during63%training, accuracyand the probabilitiesand 4 tasksp of completed each hop used by themodel during inference. MemN2N successfully learns to focus on thecorrect (supportingwith 10ksentences. ex) • Memory Networks (3 hops): 96% accuracy and 17 tasks completed (with 10k ex) Penn Treebank Text8 #of #of memory Valid. Test #of #of memory Valid. Test “End-to-end• Recently ModelMemory Networks.” (2017),hidden hopsSukhbaatar EntNetssize et al. perp.and NIPS’1S Query-reductionperp. hidden hops size Networksperp. perp. solve all 20 tasks. RNN [15] 300 - - 133 129 500 - - - 184 LSTM [15] 100 - - 120 115 500 - - 122 154 SCRN [15] 100 - - 120 115 500 - - - 161 MemN2N 150 2 100 128 121 500 2 100 152 187 150 3 100 129 122 500 3 100 142 178 150 4 100 127 120 500 4 100 129 162 150 5 100 127 118 500 5 100 123 154 150 6 100 122 115 500 6 100 124 155 150 7 100 120 114 500 7 100 118 147 150 6 25 125 118 500 6 25 131 163 150 6 50 121 114 500 6 50 132 166 150 6 75 122 114 500 6 75 126 158 150 6 100 122 115 500 6 100 124 155 150 6 125 120 112 500 6 125 125 157 150 6 150 121 114 500 6 150 123 154 150 7 200 118 111 -- - - -

Table 2: The perplexity on the test sets of Penn Treebank and Text8 corpora. Note that increasing thenumber of memory hopsimproves performance.

Figure 3: Average activation weight of memory positions during 6 memory hops. White color indicates where the model is attending during the kth hop. For clarity, each row is normalized to havemaximumvalueof 1. A model istrained on(left) Penn Treebank and (right) Text8 dataset.

5 LanguageModeling Experiments The goal in language modeling is to predict the next word in a text sequence given the previous wordsx. Wenow explain how our model can easily beapplied to this task. Wenow operate on word level, as opposed to thesentence level. Thus theprevious N words in the sequence (including the current) are embedded into memory separately. Each memory cell holds only asingle word, so there is no need for the BoW or linear mapping representations used in the QA tasks. Weemploy thetemporal embedding approach of Section 4.1. Since there is no longer any question, q in Fig. 1 is fixed to a constant vector 0.1 (without embedding). The output softmax predicts which word in the vocabulary (of size V ) is next in the sequence. A cross-entropy lossisusedto trainmodel by backpropagatingtheerror throughmultiple

7 Open-domain Question Answering Answer questions on any topic Knowledge Base [, directed_by, ] [Blade Runner, written_by, Philip K. Dick, Hampton Fancher] [Blade Runner, starred_actors, Harrison Ford, Sean Young, …] [Blade Runner, release_year, 1982] [Blade Runner, has_tags, dystopian, noir, police, androids, …] […]

What year was the movie Blade Runner released? 1982 Can you describe Blade Runner in a few words? A dystopian and noir movie Memory Networks for QA from KB

WhatWhat yearyear waswas thethe moviemovie BladeBlade RunnerRunner released?released? 1982 Informati Tron on Too large to [B [B [B la lad lad [… Retrieval de (m1e, m2, m3,e m4, …)] [ 1982 R Ru Ru …] fit in un nn nn [… ne Memorieser er ] KB r, d , re , w police ire lea ritt [Blade Runner, directed_by, Ridley Scott] memory ct se en ed _y _b [Blade Runner, written_by, Philip K. Dick, Hampton Fancher] _b ea y, [Steven Spielberg, directed, Jurassic Park, …] y, r, Ph Tom Cruise Rid 19 ili [Blade Runner, release_year, 1982] le 82 p K [Blade Runner, has_tags, dystopian, noir, police, androids, …] y S ] . co Dic […] tt] k] …

“Large-scale Question Simple Answering with Memory Networks” Bordes et al. Arxiv’15 Memory Networks on MovieQA Memory Networks

100 Standard QA 93.5% System on KB 90

80 78.5

70 69.9

63.4 No Knowledge 60 54.4% (embeddings) 50 ) % (

40 y c a r

u 30 c c a

e 20 s n o p

s 10 e R

0 KB IE Wikipedia “Large-scale Simple Question Answering with Memory Networks” Bordes et al. Arxiv’15 Structuring Memories in the Network

• We should use prior knowledge on the task (strength of symbolic systems) • Which knowledge source do memories have been extracted from? • Is there semantics in this knowledge source?

• Structure in the KB symbolic memories can be encoded in the network [Blade• Parts Runner, of directed_by,the memories Ridley Scott] match questions[Blade Runner, where release_year, others 1982] encode response [Blade• Different Runner, directed_by symbols, | Ridley for Scottaddressing] [Blade and Runner,for reading release_year, | 1982] Key-Value Memory Networks on KB

What year was the movie Blade Runner released? 1982 Informati End-to-end training through Tron on backpropagation [Blade Runner, directed_by] | Ridley Scott 1982 Retrieval[Blade Runner, release_year] | 1982 KB [Blade Runner, written_by] | Philip K. Dick police [Blade Runner, directed_by, Ridley Scott] […] […] […] [Blade Runner, written_by, Philip K. Dick, Hampton Fancher] [Steven Spielberg, directed, Jurassic Park, …] Tom Cruise [Blade Runner, release_year, 1982] [Blade Runner, has_tags, dystopian, noir, police, androids, …] […] …

“Key-Value Memory Networks for Directly Reading Documents” Miller et al. EMNLP’16 Memory Networks

Results on MovieQA Key-Value Memory Networks

100

Standard QA 93.9 93.5% System on KB 90

80 78.5 76.2

70 69.9 68.3 63.4 60 No Knowledge 54.4% (embeddings) 50 ) % (

40 y c a r

u 30 c c a

e 20 s n o p

s 10 e R

0 KB IE Wikipedia “Key-Value Memory Networks for Directly Reading Documents” Miller et al. EMNLP’16 Question Answering Directly from Text Wikipedia Entry: Blade Runner Muc h mo Blade Runner is a 1982 American neo-noir dystopian re in film directed by Ridley Scott and starring form ation Harrison Ford, , Sean Young, and Edward Bu than James Olmos. The screenplay, written by Hampton t Q in A is h KB Fancher and , is a modified film adaptation arde of the 1968 novel “Do Androids Dream of Electric r Sheep?” by Philip K. Dick. The film depicts a dystopian Los Angeles in November 2019 in which genetically engineered , which are visually indistinguishable from adult humans, are manufactured by the powerful Tyrell Corporation as well as by other “mega-corporations” around the world… What year was the movie Blade Runner released? 1982 Can you describe Blade Runner in a few words? A dystopian and noir movie In Blade Runner, who built the Replicants? Tyrell Corporation Key-Value Memory Networks on Text

What year was the movie Blade Runner released? 1982 Informati Tron on is a 1982 American neo-noir | 1982 1982 Retrieval is a 1982 American neo-noir | Blade Runner Wikipedia Entry: Blade Runner directed by R. Scott and starring | R. Scott police Blade Runner is a 1982 American neo-noir dystopian directed by Ridley Scott and starring Harrison Ford, Rutger directed by R. Scott and starring | Blade Runner Hauer, Sean Young, and Edward James Olmos. The screenplay, written by Hampton Fancher and David Peoples, is a modified film written by H. Fancher and D. P. | H. Fancher Tom Cruise adaptation of the 1968 novel “Do Androids Dream of Electric Sheep?” by Philip K. Dick. The film depicts a dystopian Los Angeles in written by H. Fancher and D. P. | Blade Runner November 2019 in which genetically engineered replicants, which are visually indistinguishable from adult humans, are manufactured … by the powerful Tyrell Corporation as well as by other “mega- […] […] […] corporations” around the world…

“Key-Value Memory Networks for Directly Reading Documents” Miller et al. EMNLP’16 Memory Networks

Results on MovieQA Key-Value Memory Networks

100

Standard QA 93.9 93.5% 90 System on KB 17%

80 78.5 76.2

70 69.9 68.3 63.4 60 No Knowledge 54.4% (embeddings) 50 ) % (

40 y c a r

u 30 c c a

e 20 s n o p

s 10 e R

0 KB IE Wikipedia “Key-Value Memory Networks for Directly Reading Documents” Miller et al. EMNLP’16 Extending to any Domain

Dr. QA Answers from the full English Wikipedia (5M DR articles) QA

“Reading Wikipedia to Answer Open-domain Questions” Chen et al. ACL’17 Extending to any Domain

Dr. QA Answers from the full English Wikipedia (5M articles) 30% accuracy on Wikipedia-based

“Reading Wikipediaquestions to Answer Open-domain Questions” Chen et al. ACL’17 Conclusion

We are moving towards neural networks able to be trained to manipulate symbols efficiently: Memories to avoid catastrophic forgetting (Lopez-Paz et Ranzato, ‘17) Memories to speed up Reinforcement Learning training (Pritzel et al, ‘17) Networks to execute programs (Reed & de Freitas, ‘16) (Johnson et al, ‘17)

We still have to figure out how to train them for real-life applications.

“Gradient episodic memory for continual learning” Lopez-Paz & Ranzato NIPS’17 “Neural Episodic Control” Pritzel et al. ICML’17 “Neural programmer-interpreters “ Reed & de Freitas ICLR’16 ”Inferring and executing programs for visual reasoning “ Johnson et al., ICCV’17 Structured Knowledge + Vector Spaces Embedding Knowledge Bases

A knowledge base (KB) stores complex structured and unstructured information through entities and relations between them.

KBs are hard to use in practice: • Large dimensions: 105/108 entities, 104/106 rel. types • Noisy/incomplete: missing/wrong relations/entities • Complex connection to language

Idea: Neural networks to learn continuous representations of KBs • Entity = vector (or embedding)

“Translating• Relation Embeddings = similarity of Modeling function Multi-relational learnt Data” in Bordes this etvector al. NIPS’13 space Poincaré Embeddings • Embed in Hyperbolic space • Embeddings uncover underlying hierarchies

• Better performance for predicting missing relations than ILP • NNs can discover and encode structureWordnet in the mammals input data

“Poincaré Embeddings for Learning Hierarchical Representations” Nickel & Kiela NIPS’17 Poincaré Embeddings • Embed in Hyperbolic space • Embeddings uncover underlying hierarchies

• Better performance for predicting missing relations than ILP • NNs can discover and encode structureAir Transportation in the input data Network

“Poincaré Embeddings for Learning Hierarchical Representations” Nickel & Kiela NIPS’17 Open research

Data is on http://fb.ai/babi

Embeddings Q&A / Dialog