Blade Runner, Directed By, Ridley Scott] [Blade Runner, Written By, Philip K
Total Page:16
File Type:pdf, Size:1020Kb
Teaching Machines to Understand Natural Language Antoine Bordes Cap’18 – Rouen – June 19th, 2018 Mission “Advancing the state of the art of AI through open research for the benefit of all” Values • Openness: publish and open-source • Freedom: researchers have complete control on their agenda • Collaboration: with internal and external partners • Excellence: focus on most impactful projects (publish or perish) • Scale: operate at large scale (computation, data) 3 Talking to Machines Language should be the most natural interface between the digital and the physical world Digital personal assistants: Grant accessibility for people with disabilities “What is in this picture?” Break loneliness “I’m not feeling well..” Mediate and monitor our online presence “Did anybody talk about me or my family lately?” “Why do I see this content?” Machines Understanding Language Current systems are still far from natural interactions. Intelligent machines should be able to: 1. Build (causal) models of the world as internal knowledge representations; 2. Ground language into them to interpret it; 3. Leverage compositionality to learn quickly and to update them. Machines are missing the ability to learn high-level abstractions “Building machines that learn and think like people” Lake et al. Behavioral and Brain Sciences ’17 “Generalization without systematicity: On the com- positional skills of sequence-to-sequence recurrent networks” Lake & Baroni. Arxiv’18 Two Paradigms Neural Networks Symbolic Systems Convolutional nets, Recurrent nets, etc. Ontologies, KBs, Inductive Logic Programming, etc. • Scale to very large datasets • Can be used by non domain • Small scale conditions experts • Require heavy expert knowledge • Robust to noise and ambiguity in • Very brittle with noisy, ambiguous data data • Game changers in multiple • Limited applicative success applications • Learn from few samples • Very data hungry (mostly • Can rely on high-level abstractions supervised data) • Interpretable, provable • Can not learn easily new tasks • Complex reasoning from old ones • Not interpretable • Relatively simple reasoning Conjecture Neural networks (and friends) can be adapted to be the best of both worlds Architectures at the interplay between symbolic and continuous models encode structure such as symbolic databases discover structure in (unstructured) data operate and are trained with symbolic components (even programs) Training will have to overcome new challenges Optimization mixing differentiable and non-differentiable parts Reinforcement learning requires rewards and environment This talk: Question Answering This talk focus on NNs learning to use encyclopedic knowledge Task: question answering (Q&A) because of ease of evaluation 1. How neural networks can be adapted to reason over symbols? 2. How vector spaces can encode structured data from knowledge bases? Neural Networks + Symbolic Memories bAbI Tasks • Set of 20 tasks testing basic reasoning capabilities for QA from stories • Short stories are generated from a simulation • Easy to interpret results / test a broad range of properties John dropped the milk. The suitcase is bigger than the chest. John took the milk there. The box is bigger than the chocolate. Sandra went to the bathroom. The chest is bigger than the chocolate. John moved to the hallway. The chest fits inside the container. Mary went to the bedroom. The chest fits inside the box. Where is the milk ? Hallway Does the suitcase fit in the chocolate? no Task 3: Two supporting facts Task 18: Size reasoning • Useful to foster innovation: cited +300 times “Towards AI-complete question answering: A set of prerequisite toy tasks.” Weston et al. ICLR’16 bAbI Tasks John dropped the milk. The suitcase is bigger than the chest. John took the milk there. The box is bigger than the chocolate. Sandra went to the bathroom. The chest is bigger than the chocolate. John moved to the hallway. The chest fits inside the container. Mary went to the bedroom. The chest fits inside the box. Where is the milk ? Hallway Does the suitcase fit in the chocolate? no Task 3: Two supporting facts Task 18: Size reasoning Very challenging for standard neural networks: • LSTMs: 63% accuracy and 4 tasks completed with 10k training examples “Towards AI-complete question answering: A set of prerequisite toy tasks.” Weston et al. ICLR’16 Memory Networks Models that combine symbolic memory with learning component that can read it. Memory Networks: Learn to reason with attention over their memory Can be trained end-to-end without supervising how the memory is used using cross-entropy or ranking loss Multiple related models appeared around the same time and since then: Stack-RNN (Joulin et Mikolov, ‘15), RNNSearch (Bahdanau et al., ‘15) Neural Turin Machines (Graves et al. ‘15) , etc. “Memory Networks” Weston et al. ICLR’15 / “End-to-end Memory Networks.” Sukhbaatar et al. NIPS’1S Question + memory John hallway milk hallway office Memory Networks on bAbI … m5: Mary went back to the bedroom. , …) m4 : John move , m d to the hallway. , m 3 m32 : Sandra went to the bathr , m oom. backpropagation(m 1 End-to-endm2: Jtrainingohn tMemoriesook tthroughhe milk t here. m1: John dropped the milk. Where is the milk? John dropped the milk. John took the milk there. Sandra went to the bathroom. John moved to the hallway. Mary went back to the bedroom. Attention during memory lookups Story (1: 1 supporting fact) Support Hop 1 Hop 2 Hop 3 Story (2: 2 supporting facts) Support Hop 1 Hop 2 Hop 3 Daniel went to the bathroom. 0.00 0.00 0.03 J ohn dropped the milk. 0.06 0.00 0.00 Mary travelled to the hallway. 0.00 0.00 0.00 J ohn took the milk there. yes 0.88 1.00 0.00 J ohn went to the bedroom. 0.37 0.02 0.00 Sandra went back to the bathroom. 0.00 0.00 0.00 J ohn travelled to the bathroom. yes 0.60 0.98 0.96 J ohn moved to the hallway. yes 0.00 0.00 1.00 Mary went to the office. 0.01 0.00 0.00 Mary went back to the bedroom. 0.00 0.00 0.00 Where is J ohn? Answer: bathroom Prediction: bathroom Where is the milk? Answer: hallway Prediction: hallway Story (16: basic induction) Support Hop 1 Hop 2 Hop 3 Story (18: size reasoning) Support Hop 1 Hop 2 Hop 3 Brian is a frog. yes 0.00 0.98 0.00 The suitcase is bigger than the chest. yes 0.00 0.88 0.00 Lily is gray. 0.07 0.00 0.00 The box is bigger than the chocolate. 0.04 0.05 0.10 Brian is yellow. yes 0.07 0.00 1.00 The chest is bigger than the chocolate. yes 0.17 0.07 0.90 J ulius is green. 0.06 0.00 0.00 The chest fits inside the container. 0.00 0.00 0.00 Greg is a frog. yes 0.76 0.02 0.00 The chest fits inside the box. 0.00 0.00 0.00 What color is GregHow? Answer: yellow Paboutrediction: yellow onDo esreal the suitcase fit in languagethe chocolate? Answer: no Predictio n: no Figure 2: Example predictions on the QAdata?tasks of [21]. We show the labeled supporting facts • LSTMs:(support) from the dataset which MemN2N does not use during63%training, accuracyand the probabilitiesand 4 tasksp of completed each hop used by themodel during inference. MemN2N successfully learns to focus on thecorrect (supportingwith 10ksentences. ex) • Memory Networks (3 hops): 96% accuracy and 17 tasks completed (with 10k ex) Penn Treebank Text8 #of #of memory Valid. Test #of #of memory Valid. Test “End-to-end• Recently ModelMemory Networks.” (2017),hidden hopsSukhbaatar EntNetssize et al. perp.and NIPS’1S Query-reductionperp. hidden hops size Networksperp. perp. solve all 20 tasks. RNN [15] 300 - - 133 129 500 - - - 184 LSTM [15] 100 - - 120 115 500 - - 122 154 SCRN [15] 100 - - 120 115 500 - - - 161 MemN2N 150 2 100 128 121 500 2 100 152 187 150 3 100 129 122 500 3 100 142 178 150 4 100 127 120 500 4 100 129 162 150 5 100 127 118 500 5 100 123 154 150 6 100 122 115 500 6 100 124 155 150 7 100 120 114 500 7 100 118 147 150 6 25 125 118 500 6 25 131 163 150 6 50 121 114 500 6 50 132 166 150 6 75 122 114 500 6 75 126 158 150 6 100 122 115 500 6 100 124 155 150 6 125 120 112 500 6 125 125 157 150 6 150 121 114 500 6 150 123 154 150 7 200 118 111 -- - - - Table 2: The perplexity on the test sets of Penn Treebank and Text8 corpora. Note that increasing thenumber of memory hopsimproves performance. Figure 3: Average activation weight of memory positions during 6 memory hops. White color indicates where the model is attending during the kth hop. For clarity, each row is normalized to havemaximumvalueof 1. A model istrained on(left) Penn Treebank and (right) Text8 dataset. 5 LanguageModeling Experiments The goal in language modeling is to predict the next word in a text sequence given the previous wordsx. Wenow explain how our model can easily beapplied to this task. Wenow operate on word level, as opposed to thesentence level. Thus theprevious N words in the sequence (including the current) are embedded into memory separately. Each memory cell holds only asingle word, so there is no need for the BoW or linear mapping representations used in the QA tasks. Weemploy thetemporal embedding approach of Section 4.1. Since there is no longer any question, q in Fig. 1 is fixed to a constant vector 0.1 (without embedding). The output softmax predicts which word in the vocabulary (of size V ) is next in the sequence.