<<

https://presemo.helsinki.fi/nlp2020

LECTURE 2: NLU PIPELINE AND TOOLS

Mark Granroth-Wilding More challenges in NLP More things that make NLP difficult...

Some of this in more detail in lecture 14: hard stuff, open problems

2 MORE CHALLENGES IN NLP Multiple languages

• Until recently, most NLP in English only • Lots of data: • Many applications (e.g. web data) • Also, relatively easy • Same systems for all languages? • Depends on task • Depends on data • Multilingual NLP: hard! • Much more work in recent years

3 MORE CHALLENGES IN NLP Multiple languages

Mandarin 935M

Spanish 390M

English 365M • Low- vs. high-resource Czech 10M Finnish 5.5M languages

• Zipfian distribution Karelian (Finnic) 63k

Ingrian (Finnic) 130

Kulon-Pazeh (Taiwan) 1

0.0 0.2 0.4 0.6 0.8 1.0 Speakers 1e9

4 MORE CHALLENGES IN NLP Multiple languages

Finnish 5.5M

• Low- vs. Karelian (Finnic) 63k high-resource

languages Ingrian (Finnic) 130 • Zipfian distribution

Kulon-Pazeh (Taiwan) 1

0 1000000 2000000 3000000 4000000 5000000 6000000 Speakers

4 MORE CHALLENGES IN NLP Multiple languages

High-resource Low-resource Few languages Very many languages Much linguistic study Typically less studied Large corpora Small or no corpora Annotated resources Limited or no annotations Commercial interest Less commercial incentive

Increasing research interest in multilingual NLP and NLP for low-resourced languages

5 MORE CHALLENGES IN NLP Broad-domain NLP

The Penn : 1989-1996

• Large linguistic corpus • Huge annotation project • 7M with part-of- (POS) tags • 3M words syntactic parse trees • Syntactic trees for news articles: Wall Street Journal (WSJ) • Used for >10 years of research: model training & evaluation

The result: Automatic syntactic parsing is now really good on news text

6 MORE CHALLENGES IN NLP Broad-domain NLP

The result: Automatic syntactic parsing is now really good on news text

• But I want to parse forum posts, conversational speech, Estonian, ... • Answer: annotate a treebank for these • Penn Treebank took years, huge effort and lots of money • How to produce parsers that work on all these and more? • Unsupervised/semi-? Domain adaptation?

Still a major area of research!

7 MORE CHALLENGES IN NLP Speech

• This course: processing text • : speech recognition at start of pipeline • Introduces further challenges: • More ambiguity • More noise • Disfluency • Wider variety of language (informal, dialectal)

Finally a smalls set all mens Text pipeline loom da head, ya know. . . Speech

8 MORE CHALLENGES IN NLP Multimodal input

• Real language not in isolation • Everything in context: • Dialogue • Physical surroundings: visual, aural, ... • Gestures • Can we process multiple modes at once? • E.g. sentence + image: identify visual referents

Major increase in research in recent years

9 MORE CHALLENGES IN NLP Pragmatics

• Real human language processing not like most NLP tasks

News article Phone call Following a three-week trial A: Some of them disagree at Bristol Crown Court, A: I meanIndividual some of them utterances said one useless way and some jurors told Judge Martin the other Picton they could not reach B: ExactlyInterpret structure of dialogue, a verdict. B: but theyspeech took you acts, know context, whatever ... the Parse in isolation majority was Extract information B: So I didn’t know if that was just something for drama or that’s truly the way it is B: I always thought it had to be unanimous A: I think it does have to be unanimous B: So I I B: But uh rather interesting

10 • Different applications, different combinations • Reuse effort on individual tools • Lot of research effort on subtasks • Often, tools & data public

NLP PIPELINE

Web text

Low-level • Complete NLU too complex processing • Break into manageable subtasks Annotated text • Develop separate tools ...

Abstract processing Further annotated text Structured knowledge base

11 NLP PIPELINE

Web text

Tokenization • Complete NLU too complex Tokenized • Break into manageable subtasks text • Develop separate tools ... • Different applications, different combinations Semantic role • Reuse effort on individual tools labeling • Lot of research effort on subtasks SR labeled • Often, tools & data public text Structured knowledge base

11 THE NLP PIPELINE

Web text

Low-level • Advantages: processing • Reuse of tools Annotated • Common work on subtasks text E.g. parsing ... • Evaluation of components • Easy combinations for many applications Abstract processing • Disadvantages of pipeline: • Discrete stages: no feedback Further annotated text • Improvements on sub-tasks might not benefit pipelines Structured knowledge base • Standard sub-tasks/tools: coming up today

12 EXAMPLE: REMINDER

• Sentence from previous lecture below • How would we divide into pipeline components for: • analysing input • linking to related sources of information • (e.g. web pages, encyclopedia, news articles)

The number of moped-related crimes rose from 827 in 2012 to more than 23,000 last year.

• We’ll return to this in more detail later

13 NLG PIPELINE

• Natural Language Generation can also use pipeline • Same reasons, same drawbacks • Not so standardized • Far fewer tools for sub-tasks

More in lecture 10 about NLG pipeline

14 REUSABLE COMPONENTS

• Defining standard sub-tasks → can reuse models and tools across pipelines • Improvements on sub-tasks benefit many applications • Publicly available code/tools/models. E.g.:

Used by: Adam: Does: tokenization, POS tagging, named-entity recognition, ... Dragonfire: EpiTator: infectious virtual disease tracker

15 DATA IN THE PIPELINE

• Data passed between components varies greatly • Components perform analysis of input • Output annotations • • sentence • discourse • document

16 LEVELS OF ANALYSIS

• Sub-word • Character (grapheme): A l i c e w a l k s • , linguistic sound unit: æ l i s w O k s • Morpheme, smallest meaningful unit: alice walk- s • Word: alice walks quickly • Phrase, clause: alice walks, then she runs • Sentence, utterance • Paragraph, section, discourse turn, ... • Document • Document collection, corpus

17 TOOLS AND TOOLKITS

• Pipeline allows component reuse • Tools for subtasks can be shared • Many toolkits provide standard components • Compare: • accuracy • speed • pre-trained models (domain, language)

18 EXAMPLE: STANFORD CORENLP

en de fr ar zh tokenization POS tagging lemmatization NER parsing dep parsing coref sentiment Java / command line / API ... Open source (Demo coming up. . . )

19 EXAMPLE: SPACY

en de fr es it tokenization POS tagging NER dep parsing Python Fewer tools, different languages Open source Very fast (See assignments)

20 EXAMPLE: GENSIM

• Specialized tool • Topic modelling (see later in course) • Language independent • Late in pipeline: abstract analysis • Use other tools/toolkits for earlier stages: • tokenization • lemmatization • etc...

21 EXAMPLE: GENSIM

Text corpus (documents)

Sentence split Sentences Tokenize Tokens Lemmatize Lemmas Count document words Bags of words Train

Trained model parameters

22 NLU SUB-TASKS

Language data • Some typical sub-tasks • Mostly early in pipeline: “low-level” Low-level • Brief intro: more on some later processing Annotated text 1. Speech recognition ... 2. Text recognition low-level 3. Morphology Abstract processing 4. POS-tagging more abstract Further 5. Parsing annotated text 6. Named-entity recognition Structured 7. Semantic role labelling knowledge repr. abstract 8. Pragmatics

23 SPEECH RECOGNITION

• Understanding human speech Finally a small settlement loomed ahead. It was of the familiar style of toy-building- block architecture affected by Audio signal Speech the ant-men, and... Text • NL interfaces • Noisy: challenges for NLP further on • Components: • Acoustic model: audio → text • : expectations about text More later...

24 TEXT RECOGNITION Optical character recognition (OCR)

• Understanding printed/written text

Finally a small settlement loomed ahead. It was of the familiar style of toy-building-block architecture... Text Scanned document

• E.g. digitizing libraries • Huge variation in how letters appear: • Earlier: model image → character classification • Recent methods: take into account context

25 TOKENIZATION

• Many methods use word-based analysis • What is a word? • Often split text → word (token) sequence: tokenization First approximation: split on spaces

Arkilu pursed her lips in Arkilu / pursed / her / lips / thought. in / thought / .

26 TOKENIZATION

First approximation: split on spaces Often not good enough:

“Really meaning,” Arkilu “ / Really / meaning / , / ” / interposed, . . . Arkilu / interposed / , . . .

Some other tricky cases: black-furred, to-day, N.Y.U., 5,000

Language-specific

27 MORPHOLOGY

• Analysis of internal structure of words unhelpfulness → un+help+ful+ness • Splitting words into stems, affixes, compounds thunderstorm → thunder+storm • Useful for: • categorization using morphological features -s → plural • text normalization robot, robots, robot’s → robot • generation robot+plur → robots man+plur → men

More on specific methods later

28 POS TAGGING

• Part of speech: ancient form of shallow grammatical analysis • Token-level categoriesNoun Adjective • Distinguish syntactic function of words in broad classes Also noun For the present, we are. . . vs. The present situation. . .

Other ambiguity remains: He gave a present

Common NLP subtask: part-of-speech (POS) tagging

More on POS-tagging methods and statistical models in lecture 5.

29 PARSING

• Syntax: models of structures underlying NL sentences • Captures dependencies between words • Analysis required to interpret meaning (semantics)

Alice who saw the man, who pushed Bob, who ate the apple, walks quickly

• Parsing: inference of structure underlying sentence

30 PARSING

• Parsing: inference of structure underlying sentence • Useful for: • Modelling human language processing • Disambiguation of sentence meaning • Structuing statistical models • Splitting sentences into meaningful units (chunks) • Much more!

More on syntax, parsing and statistical models in lecture 6.

31 NAMED ENTITY RECOGNITION

Example • Named entities: references DARPA hopes the ALIAS to people, organisations, programme will produce an products, etc automated system that will be • Identification important for cost effective. extracting structured data In addition to the 737 simulator • Can link to known entities in and the Cessna light aircraft, knowledge base Aurora has also flown a Diamond • Many practical uses DA42 light aircraft and a Bell UH-1 helicopter.

NER comes up again in lecture 12.

33 SEMANTIC ROLE LABELLING

• Semantic roles capture key parts of structure of meaning in a sentence • Who did what to whom?

Alice saw the man that Bob pushed Alice is agent of saw man is patient of saw man is patient of pushed

• Semantic role labelling: inference of these relationships • More abstract than syntax • Less structured than formal semantics

34 PRAGMATICS

• Pragmatics concerns meaning in broader context • Includes questions of e.g. • conversational context • speaker’s intent • metaphorical meaning • background knowledge • Abstract analysis • Depends heavily on other types of analysis seen here • Many unsolved problems • Some tasks tackled in NLP: late in pipelines

Some aspects of pragmatics covered in lecture 14.

35 PIPELINE EXAMPLE REVISITED In small groups

• Assume you: • are a • have database of logical/factual world knowledge • have lots of rules/statistics about English • What is involved in: • extracting & encoding relevant information? • answering the question? • Formulate as pipeline • Use component names where possible

A robotic co-pilot developed under What agency has created a DARPA’s ALIAS programme has computer that can pilot a plane? already flown a light aircraft.

37 EXAMPLE PIPELINE

Raw text POS-tagged sentences Text input Sentence Entity MoreFacebook on later chairman stages Mark Zuckerberg was segmentation detection summonedin lecture to testify 12. in front of EU Chunked lawmakers. Sentences sentences Tokenization Relation Relation output Tokenized detection sentences (Mark Zuckerberg, chairman-of, Relations Facebook) Part-of-speech ... tagging Example from NLTK Book: https://www.nltk.org/book/ch07.html

39 SOME OTHER TOOLKITS

OpenNLP Tokenization, POS tagging, lemmatization, parsing, NER, ... Pretrained models (some components): en, de, es, nl, pt, se. Java / command line

NLTK Tokenization, POS tagging, lemmatization, parsing, NER, language modelling, WSD, ... Some pretrained models, mostly en. Python. Primarily developed for teaching

40 SOME OTHER TOOLKITS

TextBlob Tokenization, POS tagging, lemmatization, simple parsing Models: en, fr, de. Python. Easy to use

Flair Tokenization, POS tagging, NER, ... Pretrained models: mostly en. Some de, fr. Python. Fast. SOTA for many tasks

41 LIVE DEMO Stanford CoreNLP

• Online demo with visualization: http://corenlp.run/ • Many tools can be selected • Some run whole pipelines: e.g. OpenIE (information extraction), sentiment • Let’s try some examples. . .

42 SUB-TASKS COMING UP

More on some sub-tasks later in course: • Morphology: lecture 4 • POS tagging: lecture 5 • Syntactic parsing: lecture 6 • Word-level (lexical) meaning: lecture 8 • Document-level meaning: lecture 9 • NLG sub-tasks / components: lecture 10 and 11 • Named-entity recognition: lecture 12 • Discourse, pragmatics: lecture 14

43 PIPELINE STUDIO In small groups

Moodle • Come up with a task that relies on NLP • Devise a pipeline, using components we’ve seen • Add non-standard components as necessary

Tokenization Lemmatization POS tagging Morphology Parsing Named-entity recognition Sentence-level semantics Document meaning / analysis Speech recognition Semantic-role labelling

44 COURSE OUTLINE

Date Topic 1 13.1 Introduction to NLP 2 17.1 NLU pipeline and toolkits 3 20.1 Evaluation 4 24.1 Meaning and representations; FS methods 5 27.1 FS methods; statistical NLP 6 31.1 Syntax & parsing 7 3.2 Syntax & parsing 8 7.2 Lexical &

45 COURSE OUTLINE

Date Topic 9 10.2 Vector-space models 10 14.2 NLG subtasks & pipeline 11 17.2 NLG evaluation; discourse 12 21.2 Information extraction 13 24.2 Advanced statistical NLP; formal semantics 14 28.2 Semantics and pragmatics; the future

46 SUMMARY

• Big challenges in NLP: • Multiple languages • Broad-domain NLP • Speech • Multimodal NLP • Pragmatics

47 SUMMARY

• NLU typically broken into subtasks • Pipeline of components • Complex applications reuse standard tools, models, datasets • Components annotate linguistic data • Many levels of analysis • Comparing ready-made tools for subtasks • Typical sub-tasks / components • Example pipelines • Some available toolkits

48 READING MATERIAL

Introductions to: • Speech recognition: J&M2 p319-21 • Morphology: J&M3 2.4.4 (J&M2 p79-80) • POS tagging: J&M3 8.1, 8.3 (J&M2 p157-8, 167-9) • Syntax & parsing: J&M3 11.1 (J&M2 p419-20, 461) • NER: J&M3 17.1 (J&M2 p761-6)

Online introduction to Stanford CoreNLP toolkit

49