Natural Language Processing with Deep Learning CS224N/Ling284

NaturalNatural LanguageLanguage Processing Processing withwith DeepDeep Learning Learning CS224N/Ling284CS224N/Ling284

Lecture 8: ChristopherRecurrent Manning Neural and Networks Richard Socher andLecture Language 2: Word Models Vectors

Abigail See Announcements

• Assignment 1: Grades will be released after class

• Assignment 2: Coding session next week on Monday; details on Piazza

• Midterm logistics: Fill out form on Piazza if you can’t do main midterm, have special requirements, or other special case

2 2/1/18 Announcements

• Default Final Project (PA4) release late tonight • Read the handout, look at the code, decide which project you want to do • You may not understand all the technical parts, but you’ll get an overview • You don’t yet have the Azure resources you need to run the code

• Project proposal due next week (Thurs Feb 8) • Details released later today • Everyone submits their teams • Custom final project teams also describe their project

3 2/1/18 Call for participation

4 2/1/18 Overview

Today we will:

• Introduce a new NLP task • Language Modeling

motivates

• Introduce a new family of neural networks • Recurrent Neural Networks (RNNs)

THE most important idea for the rest of the class!

5 2/1/18 Language Modeling

• Language Modeling is the task of predicting what word comes next. books laptops the students opened their ______exams minds • More formally: given a sequence of words , compute the probability distribution of the next word :

where is a word in the vocabulary

• A system that does this is called a Language Model.

6 2/1/18 You use Language Models every day!

7 2/1/18 You use Language Models every day!

8 2/1/18 n-gram Language Models

the students opened their ______

• Question: How to learn a Language Model? • Answer (pre- Deep Learning): learn a n-gram Language Model!

• Definition: A n-gram is a chunk of n consecutive words. • unigrams: “the”, “students”, “opened”, ”their” • bigrams: “the students”, “students opened”, “opened their” • trigrams: “the students opened”, “students opened their” • 4-grams: “the students opened their”

• Idea: Collect statistics about how frequent different n-grams are, and use these to predict next word.

9 2/1/18 n-gram Language Models

• First we make a simplifying assumption: depends only on the preceding (n-1) words n-1 words

(assumption) prob of a n-gram (definition of prob of a (n-1)-gram conditional prob)

• Question: How do we get these n-gram and (n-1)-gram probabilities? • Answer: By counting them in some large corpus of text!

(statistical approximation)

10 2/1/18 n-gram Language Models: Example

Suppose we are learning a 4-gram Language Model. as the proctor started the clock, the students opened their _____ discard condition on this

In the corpus: • “students opened their” occurred 1000 times • “students opened their books” occurred 400 times • à P(books | students opened their) = 0.4 Should we have discarded the • “students opened their exams” occurred 100 times “proctor” context? • à P(exams | students opened their) = 0.1

11 2/1/18 Problems with n-gram Language Models Sparsity Problem 1 Problem: What if “students (Partial) Solution: Add small � opened their ” never to count for every . occurred in data? Then This is called smoothing. has probability 0!

Sparsity Problem 2 Problem: What if “students (Partial) Solution: Just condition opened their” never occurred in on “opened their” instead. data? Then we can’t calculate This is called backoff. probability for any !

Note: Increasing n makes sparsity problems worse. Typically we can’t have n bigger than 5. 12 2/1/18 Problems with n-gram Language Models

Storage: Need to store count for all possible n-grams. So model size is O(exp(n)).

Increasing n makes model size huge!

13 2/1/18 n-gram Language Models in practice

• You can build a simple trigram Language Model over a 1.7 million word corpus (Reuters) in a few seconds on your laptop*

Business and financial news today the ______

get probability distribution

company 0.153 Sparsity problem: bank 0.153 not much granularity price 0.077 in the probability italian 0.039 distribution emirate 0.039 … * Try for yourself: https://nlpforhackers.io/language-models/ Otherwise, seems reasonable! 14 2/1/18 Generating text with a n-gram Language Model