Lecture 2: Nlu Pipeline and Tools

https://presemo.helsinki.fi/nlp2020 LECTURE 2: NLU PIPELINE AND TOOLS Mark Granroth-Wilding More challenges in NLP More things that make NLP difficult... Some of this in more detail in lecture 14: hard stuff, open problems 2 MORE CHALLENGES IN NLP Multiple languages • Until recently, most NLP in English only • Lots of data: • Many applications (e.g. web data) • Also, relatively easy • Same systems for all languages? • Depends on task • Depends on data • Multilingual NLP: hard! • Much more work in recent years 3 MORE CHALLENGES IN NLP Multiple languages Mandarin 935M Spanish 390M English 365M • Low- vs. high-resource Czech 10M Finnish 5.5M languages • Zipfian distribution Karelian (Finnic) 63k Ingrian (Finnic) 130 Kulon-Pazeh (Taiwan) 1 0.0 0.2 0.4 0.6 0.8 1.0 Speakers 1e9 4 MORE CHALLENGES IN NLP Multiple languages Finnish 5.5M • Low- vs. Karelian (Finnic) 63k high-resource languages Ingrian (Finnic) 130 • Zipfian distribution Kulon-Pazeh (Taiwan) 1 0 1000000 2000000 3000000 4000000 5000000 6000000 Speakers 4 MORE CHALLENGES IN NLP Multiple languages High-resource Low-resource Few languages Very many languages Much linguistic study Typically less studied Large corpora Small or no corpora Annotated resources Limited or no annotations Commercial interest Less commercial incentive Increasing research interest in multilingual NLP and NLP for low-resourced languages 5 MORE CHALLENGES IN NLP Broad-domain NLP The Penn Treebank: 1989-1996 • Large linguistic corpus • Huge annotation project • 7M words with part-of-speech (POS) tags • 3M words syntactic parse trees • Syntactic trees for news articles: Wall Street Journal (WSJ) • Used for >10 years of parsing research: model training & evaluation The result: Automatic syntactic parsing is now really good on news text 6 MORE CHALLENGES IN NLP Broad-domain NLP The result: Automatic syntactic parsing is now really good on news text • But I want to parse forum posts, conversational speech, Estonian, ... • Answer: annotate a treebank for these • Penn Treebank took years, huge effort and lots of money • How to produce parsers that work on all these and more? • Unsupervised/semi-supervised learning? Domain adaptation? Still a major area of research! 7 MORE CHALLENGES IN NLP Speech • This course: processing text • Speech processing: speech recognition at start of pipeline • Introduces further challenges: • More ambiguity • More noise • Disfluency • Wider variety of language (informal, dialectal) Finally a smalls set all mens Text pipeline loom da head, ya know. Speech 8 MORE CHALLENGES IN NLP Multimodal input • Real language not in isolation • Everything in context: • Dialogue • Physical surroundings: visual, aural, ... • Gestures • Can we process multiple modes at once? • E.g. sentence + image: identify visual referents Major increase in research in recent years 9 MORE CHALLENGES IN NLP Pragmatics • Real human language processing not like most NLP tasks News article Phone call Following a three-week trial A: Some of them disagree at Bristol Crown Court, A: I meanIndividual some of them utterances said one useless way and some jurors told Judge Martin the other Picton they could not reach B: ExactlyInterpret structure of dialogue, a verdict. B: but theyspeech took you acts, know context, whatever ... the Parse in isolation majority was Extract information B: So I didn't know if that was just something for drama or that's truly the way it is B: I always thought it had to be unanimous A: I think it does have to be unanimous B: So I I B: But uh rather interesting 10 • Different applications, different combinations • Reuse effort on individual tools • Lot of research effort on subtasks • Often, tools & data public NLP PIPELINE Web text Low-level • Complete NLU too complex processing • Break into manageable subtasks Annotated text • Develop separate tools ::: Abstract processing Further annotated text Structured knowledge base 11 NLP PIPELINE Web text Tokenization • Complete NLU too complex Tokenized • Break into manageable subtasks text • Develop separate tools ::: • Different applications, different combinations Semantic role • Reuse effort on individual tools labeling • Lot of research effort on subtasks SR labeled • Often, tools & data public text Structured knowledge base 11 THE NLP PIPELINE Web text Low-level • Advantages: processing • Reuse of tools Annotated • Common work on subtasks text E.g. parsing ::: • Evaluation of components • Easy combinations for many applications Abstract processing • Disadvantages of pipeline: • Discrete stages: no feedback Further annotated text • Improvements on sub-tasks might not benefit pipelines Structured knowledge base • Standard sub-tasks/tools: coming up today 12 EXAMPLE: REMINDER • Sentence from previous lecture below • How would we divide into pipeline components for: • analysing input • linking to related sources of information • (e.g. web pages, encyclopedia, news articles) The number of moped-related crimes rose from 827 in 2012 to more than 23,000 last year. • We'll return to this in more detail later 13 NLG PIPELINE • Natural Language Generation can also use pipeline • Same reasons, same drawbacks • Not so standardized • Far fewer tools for sub-tasks More in lecture 10 about NLG pipeline 14 REUSABLE COMPONENTS • Defining standard sub-tasks ! can reuse models and tools across pipelines • Improvements on sub-tasks benefit many applications • Publicly available code/tools/models. E.g.: Used by: Adam: Does: question answering tokenization, POS tagging, named-entity recognition, ... Dragonfire: EpiTator: infectious virtual assistant disease tracker 15 DATA IN THE PIPELINE • Data passed between components varies greatly • Components perform analysis of input • Output annotations • word • sentence • discourse • document 16 LEVELS OF ANALYSIS • Sub-word • Character (grapheme): A l i c e w a l k s • Phoneme, linguistic sound unit: æ l i s w O k s • Morpheme, smallest meaningful unit: alice walk- s • Word: alice walks quickly • Phrase, clause: alice walks, then she runs • Sentence, utterance • Paragraph, section, discourse turn, ... • Document • Document collection, corpus 17 TOOLS AND TOOLKITS • Pipeline allows component reuse • Tools for subtasks can be shared • Many toolkits provide standard components • Compare: • accuracy • speed • pre-trained models (domain, language) 18 EXAMPLE: STANFORD CORENLP en de fr ar zh tokenization POS tagging lemmatization NER parsing dep parsing coref sentiment Java / command line / API ... Open source (Demo coming up. ) 19 EXAMPLE: SPACY en de fr es it tokenization POS tagging NER dep parsing Python Fewer tools, different languages Open source Very fast (See assignments) 20 EXAMPLE: GENSIM • Specialized tool • Topic modelling (see later in course) • Language independent • Late in pipeline: abstract analysis • Use other tools/toolkits for earlier stages: • tokenization • lemmatization • etc... 21 EXAMPLE: GENSIM Text corpus (documents) Sentence split Sentences Tokenize Tokens Lemmatize Lemmas Count document words Bags of words Train topic model Trained model parameters 22 NLU SUB-TASKS Language data • Some typical sub-tasks • Mostly early in pipeline: \low-level" Low-level • Brief intro: more on some later processing Annotated text 1. Speech recognition ::: 2. Text recognition low-level 3. Morphology Abstract processing 4. POS-tagging more abstract Further 5. Parsing annotated text 6. Named-entity recognition Structured 7. Semantic role labelling knowledge repr. abstract 8. Pragmatics 23 SPEECH RECOGNITION • Understanding human speech Finally a small settlement loomed ahead. It was of the familiar style of toy-building- block architecture affected by Audio signal Speech the ant-men, and... Text • NL interfaces • Noisy: challenges for NLP further on • Components: • Acoustic model: audio ! text • Language model: expectations about text More later... 24 TEXT RECOGNITION Optical character recognition (OCR) • Understanding printed/written text Finally a small settlement loomed ahead. It was of the familiar style of toy-building-block architecture... Text Scanned document • E.g. digitizing libraries • Huge variation in how letters appear: • Earlier: model image ! character classification • Recent methods: take into account context 25 TOKENIZATION • Many methods use word-based analysis • What is a word? • Often split text ! word (token) sequence: tokenization First approximation: split on spaces Arkilu pursed her lips in Arkilu / pursed / her / lips / thought. in / thought / . 26 TOKENIZATION First approximation: split on spaces Often not good enough: \Really meaning," Arkilu \ / Really / meaning / , / " / interposed, . Arkilu / interposed / , . Some other tricky cases: black-furred, to-day, N.Y.U., 5,000 Language-specific 27 MORPHOLOGY • Analysis of internal structure of words unhelpfulness ! un+help+ful+ness • Splitting words into stems, affixes, compounds thunderstorm ! thunder+storm • Useful for: • categorization using morphological features -s ! plural • text normalization robot, robots, robot's ! robot • generation robot+plur ! robots man+plur ! men More on specific methods later 28 POS TAGGING • Part of speech: ancient form of shallow grammatical analysis • Token-level categoriesNoun Adjective • Distinguish syntactic function of words in broad classes Also noun For the present, we are. vs. The present situation. Other ambiguity remains: He gave a present Common NLP subtask: part-of-speech (POS) tagging More on POS-tagging methods and statistical models in lecture 5. 29 PARSING • Syntax: models of structures underlying NL sentences • Captures

Lecture 2: Nlu Pipeline and Tools

Aviation Speech Recognition System Using Artificial Intelligence

A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech

Learned in Speech Recognition: Contextual Acoustic Word Embeddings

Synthesis and Recognition of Speech Creating and Listening to Speech

Natural Language Processing in Speech Understanding Systems

Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020

Improving Named Entity Recognition Using Deep Learning with Human in the Loop

Discriminative Transfer Learning for Optimizing ASR and Semantic Labeling in Task-Oriented Spoken Dialog

Language Modelling for Speech Recognition

Recent Improvements on Error Detection for Automatic Speech Recognition

A Lite BERT for Self-Supervised Learning of Audio Representation

Masked Language Model Scoring