PROCESSING NATURAL LANGUAGE

DAY 1: INTRODUCTION TO NLP

Mark Granroth-Wilding

1 2

COURSE OUTLINE PROCESSING NATURAL LANGUAGE

Mark Leo Lidia Granroth-Wilding Lepp¨anen Pivovarova Getting information out of language • Day Topic 1 Introduction to NLP Example 2 NLU pipeline and toolkits Who made the first electric guitar? 3 Finite state methods; statistical NLP

Week 1 4 Syntax and parsing Query: string of characters 5 Evaluation • How to represent meaning in suitable form to find answer? 6 NLG and dialogue • 7 Vector space models and lexical semantics How to get from text to that? • 8 Information extraction; advanced stat. NLP

What sort of processes are needed? Week 2 • 9 Ascension: no lectures What does system need to know about language? 10 Semantics and pragmatics; the future •

3 4

DAY OUTLINE ASSIGNMENTS

Afternoons: practical programming assignments • Daily introduction Each day: • TAs available to help 9:15 – 12:00 Lectures D123 • Includes: 12:00 – 13:15 Lunch • Python programming 13:15 – 14:00 Introduction • ∼ BK107 Use of NLP tools 14:00 – 16:00 Practical assignments • Implementation of algorithms, statistical models, etc from • lectures Analysis of system output/behaviour • Consideration of method uses, limitations, . . . •

5 6

TEACHING ASSISTANTS ASSIGNMENTS

Assistance with assignments will be provided by:

Submitted same day: i.e. not homework • Submit using Moodle • Important part of learning (not optional) • Mark Leo Lidia Marked in 1-2 days • No detailed personal feedback • Collective feedback in lectures: common problems • Ask specific questions on Moodle forum •

Eliel Khalid Elaine

7 8 ASSESSMENT COURSE MATERIALS

Requirements to pass the course • Speak to me if Attend all lectures • problematic Attend all practicals (at least start) • Attempt assignments Course homepage: https://g-w.fi/nlp2019 • Pass 2/3 assignments • Grading available shortly after course Lecture slides • • Further reading recommendations: end of lectures We don’t expect state-of-the art, amazing systems! • • Assignment instructions & data We do expect you to • • try everything Moodle • • show understanding of lecture content •

9 10

MOODLE MOODLE FORUM

Feel free to post • Discusss assignments / lecture content • Help other students •

11 12

MOODLE GLOSSARY READING MATERIAL

Provided at end of each lecture • Not expected to read everything in evenings! • Further explanations of material • More details • Lots of new terminology Further reading to delve deeper • • We’ll add explanations during course • Add new entries (dummy content) to suggest addition •

13 14

READING MATERIAL READING MATERIAL

Foundations of Statistical NLP Manning & Sch¨utze, 1999. Good reference for statistical topics Main course textbook: Speech and Language Processing Jurafsky & Martin, 2nd ed. NLP with Python, ‘The NLTK book’ Bird, Klein & Loper. New draft: https://web.stanford.edu/ jurafsky/slp3/ https://www.nltk.org/book/ Natural Language Processing Eisenstein. References to online draft where possible: J&M3 https://tinyurl.com/eisenstein-nlp Print version, 2nd edition: J&M2 Linguistic Fundamentals of NLP Bender, 2013. http://tinyurl.com/bender-nlp

15 16 PRE-COURSE QUESTIONNAIRE WHY NLP?

https://presemo.helsinki.fi/nlp2019 Why do we need computers to understand Quick questionnaire • (or generate) human language? Your familiarity with topics • Not a test! AnonymousSame link: • ask questions during lecture People expect interactive agents to Fine if all answers are 1! We’ll learn about everything • • communicate in NL E.g. dialogue systems Rating Meaning • 1 Never heard of it Huge knowledge encoded in language • 2 Name familiar Hard to find: requires NLP • Automatic processing central to AI: 3 Basic familiarity (not much detail) • 4 Studied/read about before knowledge acquisition bottleneck Information extraction (more later) 5 Studied in detail •

17 20

WHY EVEN SIMPLE NLP IS HARD WHY NLP?

Example Search • What is the forecast mean daytime temperature for Kumpula corpora, libraries, medical datasets, .... • tomorrow? Computational models of human processing • Tools for studying: Simple: answer in a database! • • language (corpus linguistics) • No reasoning/computation: just query sociology • • history... • Query Analysing human behaviour • SELECT day_mean FROM daily_forecast And much more! WHERE station = ’Helsinki Kumpula’ • AND date = ’2019-05-21’;

21 22

WHY EVEN SIMPLE NLP IS HARD WHY EVEN SIMPLE NLP IS HARD

What is the forecast mean daytime SELECT day_mean FROM daily_forecast What is the forecast mean daytime SELECT day_mean FROM daily_forecast WHERE station = ’Helsinki Kumpula’ temperature for Kumpula tomorrow? temperature for Kumpula tomorrow? WHERE station = ’Helsinki Kumpula’ AND date = ’2019-05-21’; AND date = ’2019-05-21’; What is the forecast mean salary for What temperature is predicted in Kumpula tomorrow? ? Kumpula during the day tomorrow?

What is the forecast mean salary for How hot will it be in Arabia tomorrow? CEOs tomorrow? ?

Many ways to say the same thing Similar utterances mean very different things • •

23 24

WHY EVEN SIMPLE NLP IS HARD EXERCISE In small groups SELECT day_mean FROM daily_forecast: WHERE station = ’Helsinki Kumpula’: Look at sentences below What is the mean temperature in AND date = ’2019-05-21’; • Kumpula? Assume you: • SELECT day_mean FROM weekly_forecast are a computer • WHERE station = ’Helsinki Kumpula’ have database of logical/factual world knowledge • AND week = ’w22’; have lots of rules/statistics about English • What steps are involved in: • SELECT MEAN(day_temp) analysing this textual input? FROM weather_history • extracting & encoding relevant information? Ambiguity WHERE station = ’Helsinki Kumpula’ • • AND year = ’2019’; answering the question? Many forms • • Every level/step of analysis A robotic co-pilot developed under • ...? What agency has created a DARPA’s programme has The big challenge of NLP computer that can pilot a plane? • already flown a light aircraft.

25 26 NATURAL LANGUAGE PROCESSING MACHINE TRANSLATION

Language NLU Knowledge Text Knowledge representation representation Speech NLG

Natural Language Understanding (NLU) NLU NLG Natural Language Generation (NLG) Language 1 Language 2 Mostly different models/algorithms text text • Some sharing possible • This course: mostly NLU • Not the standard approach NLG: day 6 •

27 28

MACHINE TRANSLATION STEPS OF NLU

Intralingua Task: given sentence, get some representation computer can use for question answering

NLU Phrases, syntax, NLG semantics, . . . John loves Mary

Language 1 Direct translation Language 2 1. Divide into words (by spaces) text text 2. Identify John and Mary as names Potentially tricky 100k in English... MT pyramid 3. Recognise main relation loves ∼ Variety of approaches translate at different levels 4. Identify John as agent, Mary as patient Syntactic rules Large field: no more detail here. Plenty of courses available!

29 30

STEPS OF NLU STEPS OF NLU

The number of moped-related crimes rose from 827 in 2012 to more than 23,000 last year. The list of people unhappy with this decision is decidedly longer and more comprehensive than the list that support the move. Extra difficulties: More difficult to segment words Reference to earlier context • • Varied vocabulary Repeated references • • More complex syntax Actual meaning requires inference! • • More complex meaning structure And then... (more later today) • Vagueness/ambiguity in meaning Ambiguity Disfluency • • • Noise Multiple languages • •

31 32

A BRIEF HISTORY OF NLP LESS NATURAL LANGUAGE?

Why not just write a more natural query language? • 1600s Discussion of machine translation (MT), theoretical! Query Give me the daily mean from the forecast data 1930s Early proposals for MT using dictionaries for the station called ‘Helsinki Kumpula’ for 21.5.2019. 1950 Alan Turing: proposed ‘Turing Test’, depends on NLP 1954 Georgetown-IBM experiment: Still have to learn specialized language to interact • simple closed-domain MT, some grammatical rules Not good for non-expert users • 1957 Noam Chomsky: Syntactic Structures Natural interaction requires natural language • formal grammars, NLP becomes computable! Not just interaction • Extraction of information from (existing) text •

33 36 A BRIEF HISTORY OF NLP A BRIEF HISTORY OF NLP

1960s-70s Algorithms for parsing, semantic reasoning 1990s Statistical methods Formal representations of syntax, semantics, logic Statistical models for sub-tasks Hand-written rules 1987 Probabilistic n-gram language models 1964 ELIZA: simple dialogue system 1996 MT: IBM statistical, word-based models 1970 SHRDLU: narrow-domain system with NL commands 1997 Parsing: 1970 Augmented Transition Networks Probabilistic context-free grammars (PCFGS) automata for parsing text 1998 Distributional semantics (simeq word embeddings) 1980s More sophisticated parsing, semantics, reasoning ∼ Applications! 1999 Probabilistic (unsupervised) topic models

37 38

A BRIEF HISTORY OF NLP WORD FREQUENCIES

Type Tokens 2010s More computing power, more data, more statistics , 11 341 the 5 792 Deep learning, neural networks, Bayesian models, ... I 5 087 2013 word2vec: word embeddings from lots of data and 4 708 ... 2014 RNNs for MT unhappy 5 2015 RNNs for NLG resolve 5 murderers 5 And so on...... overwhelm 1 lamented 1 insufficient 1

39 41

10000

WORD FREQUENCIES ZIPF’S LAW 8000

6000

4000

Most frequent: 2000 104 , 0 0 1000 2000 3000 4000 5000 6000 7000 10000 Log-log scale

103 Inverse log-log distribution of frequencies 8000 • Power law Next: • 6000 60002 Almost any linguistic phenomenon the 10 • Zipf’s law / Zipfian distribution 4000 • ‘Long tail’ A few things are very common 101 • Many things are very rare (long tail) 2000 • Many levels of linguistic analysis 0 • 10 0

1000 2501000100201050012000 7502004030001000102 4000300125060 500015001040038060001750 20005007000100104

42 43

ZIPF’S LAW ERRORS

10000 Or human

8000 No tool is perfect! • 6000 Language ambiguous, variable, noisy: any system makes errors • 4000 Often not a problem! • 2000 Corpus-wide statistical analysis • Some level of error OK 0 • if reasonably randomly distributed 0 1000 2000 3000 4000 5000 6000 7000 • Some errors may be more problematic, e.g.: Difficulty: rare things often contain most information information • Affect meaning in important ways the, a, for Frequent, little information • Consistent across many analyses lamented, insufficient Rare, informative •

44 45 Different applications, different combinations • Reuse effort on individual tools • Lot of research effort on subtasks • Often, tools & data public •

ERRORS NLP PIPELINE

Web text

Low-level Complete NLU too complex • processing Quantitative evaluation important: day 5 Break into manageable subtasks Annotated • • text Error analysis: some errors worse than others Develop separate tools ... • • Understand your tool’s weaknesses! • Zipf’s law: common things easy Abstract • Easily overrepresented in evaluation processing • Harder, rarer phenomena more important Further • annotated text Structured knowledge base

46 48

NLP PIPELINE THE NLP PIPELINE

Web text Web text

Tokenization Low-level Complete NLU too complex Advantages: • • processing Tokenized Reuse of tools Break into manageable subtasks text • Annotated • Common work on subtasks text Develop separate tools ... • ... • E.g. parsing Evaluation of components Different applications, different combinations • • Semantic role Easy combinations for many applications Abstract Reuse effort on individual tools labeling • processing • Disadvantages of pipeline: Lot of research effort on subtasks SR labeled • • Discrete stages: no feedback Further text • annotated text Often, tools & data public Improvements on sub-tasks • • Structured might not benefit pipelines Structured knowledge base knowledge base Standard sub-tasks/tools: see day 2 •

48 49

EXERCISE: PIPE DREAMS DISCUSSION: PIPE DREAMS In small groups In small groups

Suggest some pipeline components involved in • Same sentence again analysing input • • Suggest some pipeline components involved in linking to related sources of information • • analysing input (e.g. web pages, encyclopedia, news articles) • • linking to related sources of information • (e.g. web pages, encyclopedia, news articles) • A robotic co-pilot developed under DARPA’s ALIAS programme has already flown a light aircraft. A robotic co-pilot developed under DARPA’s ALIAS programme has already flown a light aircraft. Ideas? • More on NLP pipelines tomorrow. . . •

50 52

CORPORA CORPORA

Most often: collection of text Corpus (pl. corpora) • Other types: speech (audio), video, numeric data, ... • The body of written or spoken material upon which a linguistic Often combinations • analysis is based Vary in: • source: newspapers, websites, legal documents, ... Not so much • Why do we need corpora? language(s) • on this course • Test linguistic hypotheses domain (i.e. subject matter) • Evaluation: • Evaluate tools: annotated/labelled corpus size • day 5 • Train statistical models quality • Statistical NLP: • annotations day 3 and more... •

54 55 STATISTICAL MODELS STATISTICAL MODELS

Why use statistical models for NLP? Older systems were rule-based Help model ambiguity / uncertainty • Used long-studied linguistic knowledge, but: • Multiple interpretations Lots of rules • • Complex interactions Weights/confidences derived from data • • Narrow domain Express uncertainty in output • • Hard to handle varied (“incorrect”) and changing language • Statistics from data can help •

56 57

STATISTICAL MODELS

More challenges in NLP Can try to learn everything from data More things that make NLP difficult... • Practical and theoretical difficulties • Advanced statistical Some success in recent work NLP: day 8 • Some of this in more detail on day 10: hard stuff, open problems Mostly: focussed statistical modelling of sub-tasks • Supervised & unsupervised learning • Collect statistics (learn) from corpora • annotated (supervised) / raw data (unsupervised) •

58 60

MORE CHALLENGES IN NLP MORE CHALLENGES IN NLP Multiple languages Multiple languages

Until recently, most NLP in English only Mandarin 935M • Lots of data: Spanish 390M • English 365M Many applications (e.g. web data) • Low- vs. Also, relatively easy • • high-resource Czech 10M Same systems for all languages? Finnish 5.5M • languages Depends on task • Zipfian distribution Karelian (Finnic) 63k Depends on data • • Ingrian (Finnic) 130 Multilingual NLP: hard! Kulon-Pazeh (Taiwan) 1 • Much more work in recent years 0.0 0.2 0.4 0.6 0.8 1.0 • Speakers 1e9

61 62

MORE CHALLENGES IN NLP MORE CHALLENGES IN NLP Multiple languages Multiple languages

Finnish 5.5M High-resource Low-resource Few languages Very many languages Much linguistic study Typically less studied Low- vs. Karelian (Finnic) 63k • Large corpora Small or no corpora high-resource Annotated resources Limited or no annotations languages Ingrian (Finnic) 130 Commercial interest Less commercial incentive Zipfian distribution • Kulon-Pazeh (Taiwan) 1 Increasing research interest in multilingual NLP

0 1000000 2000000 3000000 4000000 5000000 6000000 and NLP for low-resourced languages Speakers

62 63 MORE CHALLENGES IN NLP MORE CHALLENGES IN NLP Broad-domain NLP Broad-domain NLP

The Penn Treebank: 1989-1996 The result: Automatic syntactic parsing is now really good on news text Large linguistic corpus • Huge annotation project • But I want to parse forum posts, conversational speech, 7M words with part-of-speech (POS) tags • • Estonian, ... 3M words syntactic parse trees • Answer: annotate a treebank for these Syntactic trees for news articles: Wall Street Journal (WSJ) • • Penn Treebank took years, huge effort and lots of money Used for >10 years of parsing research: • • How to produce parsers that work on all these and more? model training & evaluation • Unsupervised/semi-supervised learning? Domain adaptation? The result: • Automatic syntactic parsing is now really good on news text Still a major area of research!

64 65

MORE CHALLENGES IN NLP MORE CHALLENGES IN NLP Speech Multimodal input This course: processing text • Speech processing: speech recognition at start of pipeline • Real language not in isolation Introduces further challenges: • • More ambiguity Everything in context: • • More noise Dialogue • • Disfluency Physical surroundings: visual, aural, ... • • Wider variety of language (informal, dialectal) Gestures • • Can we process multiple modes at once? • E.g. sentence + image: identify visual referents • Finally a smalls set all mens Text pipeline loom da head, ya know. . . Major increase in research in recent years Speech

66 67

MORE CHALLENGES IN NLP SUMMARY Pragmatics

Real human language processing not like most NLP tasks • Why do NLP? News article Phone call • Some challenges of NLP Further challenges: Following a three-week trial A: Some of them disagree • • at Bristol Crown Court, A: I mean some of them said one way and some Multiple languages Individual utterances useless NLU and NLG • jurors told Judge Martin the other • Broad-domain NLP History • Picton they could not reach B: ExactlyInterpret structure of dialogue, • Speech • a verdict. B: but theyspeech took you acts, know context, whatever ... the Zipf’s law Multimodal NLP Parse in isolation majority was • • Breaking into subtasks: pipeline Pragmatics Extract information B: So I didn’t know if that was just something • • for drama or that’s truly the way it is Corpora and statistical models B: I always thought it had to be unanimous • A: I think it does have to be unanimous B: So I I B: But uh rather interesting

68 69

READING MATERIAL NEXT UP

After lunch: Practical assignments in BK107

9:15 – 12:00 Lectures J&M2, p. 35-43 12:00 – 13:15 Lunch • 13:15 – 14:00 Introduction Eisenstein, Introduction (p. 1-10) ∼ • 14:00 – 16:00 Practical assignments

Refresher/intro to Python 3 • Natural Language Toolkit (NLTK) • Working with linguistic data/tools •

70 71