Convolutional Neural Networks for Language
Total Page:16
File Type:pdf, Size:1020Kb
Convolutional Neural Networks for Language CS 6956: Deep Learning for NLP Features from text Example: Sentiment classification The goal: Is the sentiment of a sentence positive, negative or neutral? The film is fun and is host to some truly excellent sequences Approach: Train a multiclass classifier What features? 2 Features from text Example: Sentiment classification The goal: Is the sentiment of a sentence positive, negative or neutral? The film is fun and is host to some truly excellent sequences Approach: Train a multiclass classifier What features? Some words and ngrams are informative, while some are not 3 Features from text Example: Sentiment classification The goal: Is the sentiment of a sentence positive, negative or neutral? The film is fun and is host to some truly excellent sequences Approach: Train a multiclass classifier What features? Some words and ngrams are informative, while some are not We need to: 1. Identify informative local information 2. Aggregate it into a fixed size vector representation 4 Convolutional Neural Networks Designed to 1. Identify local predictors in a larger input 2. Pool them together to create a feature representation 3. And possibly repeat this in a hierarchical fashion In the NLP context, it helps identify predictive ngrams for a task 5 Overview • Convolutional Neural Networks: A brief history • The two operations in a CNN – Convolution – Pooling • Convolution + Pooling as a building block • CNNs in NLP • Recurrent networks vs Convolutional networks 6 Overview • Convolutional Neural Networks: A brief history • The two operations in a CNN – Convolution – Pooling • Convolution + Pooling as a building block • CNNs in NLP • Recurrent networks vs Convolutional networks 7 Convolutional Neural Networks: Brief history First arose in the context of vision • Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to small regions and specific patterns in the visual field • Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel – Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations: convolutional layer that reacts to specific patterns and a down-sampling layer that aggregates information • Le Cun 1989-today, Convolutional Neural Network: A supervised version – Related to convolution kernels in computer vision – Very successful on handwriting recognition and other computer vision tasks • Has become better over recent years with more data, computation – Krizhevsky et al 2012: Object detection with ImageNet – The de facto feature extractor for computer vision 8 Convolutional Neural Networks: Brief history First arose in the context of vision • Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to small regions and specific patterns in the visual field Nobel Prize in Physiology or Medicine, 1981 David H. Hubel Torsten Wiesel 9 Convolutional Neural Networks: Brief history First arose in the context of vision • Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to small regions and specific patterns in the visual field • Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel – Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations 1. convolutional layer that reacts to specific patterns and, 2. a down-sampling layer that aggregates information 10 Convolutional Neural Networks: Brief history First arose in the context of vision • Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to small regions and specific patterns in the visual field • Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel – Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations: convolutional layer that reacts to specific patterns and a down-sampling layer that aggregates information • Le Cun 1989-today, Convolutional Neural Network: A supervised version – Related to convolution kernels in computer vision – Success with handwriting recognition and other computer vision tasks 11 Convolutional Neural Networks: Brief history First arose in the context of vision • Hubel and Wiesel, 1950s/60s: Mammalian visual cortex contain neurons that respond to small regions and specific patterns in the visual field • Fukushima 1980, Neocognitron: Directly inspired by Hubel, Wiesel – Key idea: locality of features in the visual cortex is important, integrate them locally and propagate them to further layers – Two operations: convolutional layer that reacts to specific patterns and a down-sampling layer that aggregates information • Le Cun 1989-today, Convolutional Neural Network: A supervised version – Related to convolution kernels in computer vision – Success with handwriting recognition and other computer vision tasks • Has become better over recent years with more data, computation – Krizhevsky et al 2012: Object detection with ImageNet – The de facto feature extractor for computer vision 12 Convolutional Neural Networks: Brief history • Introduced to NLP by Collobert et al, 2011 – Used as a feature extraction system for semantic role labeling • Since then several other applications such as sentiment analysis, question classification, etc – Kalchbrener et al 2014, Kim 2014 13 CNN terminology Shows its computer visions and signal processing origins • Filter – A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector • Channel – In computer vision, color images have red, blue and green channels – In general, a channel represents a medium that captures information about an input independent of other channels • For example, different kinds of word embeddings could be different channels • Channels could themselves be produced by previous convolutional layers • Receptive field – The region of the input that a filter currently focuses on 14 CNN terminology Shows its computer visions and signal processing origins • Filter – A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector (also called a feature map) • Channel – In computer vision, color images have red, blue and green channels – In general, a channel represents a medium that captures information about an input independent of other channels • For example, different kinds of word embeddings could be different channels • Channels could themselves be produced by previous convolutional layers • Receptive field – The region of the input that a filter currently focuses on 15 CNN terminology Shows its computer visions and signal processing origins • Filter – A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector (also called a feature map) • Channel – In computer vision, color images have red, blue and green channels – In general, a channel represents a medium that captures information about an input independent of other channels • For example, different kinds of word embeddings could be different channels • Channels could themselves be produced by previous convolutional layers • Receptive field – The region of the input that a filter currently focuses on 16 CNN terminology Shows its computer visions and signal processing origins • Filter – A function that transforms in input matrix/vector into a scalar feature – A filter is a learned feature detector (also called a feature map) • Channel – In computer vision, color images have red, blue and green channels – In general, a channel represents a “view of the input” that captures information about an input independent of other channels • For example, different kinds of word embeddings could be different channels • Channels could themselves be produced by previous convolutional layers • Receptive field – The region of the input that a filter currently focuses on 17 Overview • Convolutional Neural Networks: A brief history • The two operations in a CNN – Convolution – Pooling • Convolution + Pooling as a building block • CNNs in NLP • Recurrent networks vs Convolutional networks 18 What is a convolution? Let’s see this using an example for vectors. We will generalize this to matrices and beyond, but the general idea remains the same. 19 What is a convolution? An example using vectors A vector � 2 3 1 3 2 1 20 What is a convolution? An example using vectors A vector � 2 3 1 3 2 1 Filter � of size � 1 2 1 Here, the filter size is 3 21 What is a convolution? An example using vectors A vector � 2 3 1 3 2 1 Filter � of size � 1 2 1 The output is also a vector 3 output( = * �, ⋅ � 0 (/ 1 2, , 22 What is a convolution? An example using vectors A vector � 2 3 1 3 2 1 Filter � of size � 1 2 1 The output is also a vector 3 output( = * �, ⋅ � 0 (/ 1 2, , The filter moves across the vector. At each position, the output is the dot product of the filter with a slice of the vector of that size. 23 What is a convolution? An example using vectors Padding at the beginning A vector � 0 2 3 1 3 2 1 Filter � of size � 1 2 1 3 output( = * �, ⋅ � 0 (/ 1 2, , 24 What is a convolution? An example using vectors Padding at the beginning A vector � 0 2 3 1 3 2 1 Filter � of size � 1 2 1 The output is also a vector 7 3 output( = * �, ⋅ � 0 (/ 1 2, , 25 What is a convolution? An example using vectors A vector � 2 3 1 3 2 1 Filter � of size � 1 2 1 The output is also a vector 7 9 3 output( = * �, ⋅ � 0 (/ 1