11-411 Natural Language Processing Overview

11-411 Natural Language Processing Overview Kemal Oflazer Carnegie Mellon University in Qatar ∗ Content mostly based on previous offerings of 11-411 by LTI Faculty at CMU-Pittsburgh. 1/31 What is NLP? I Automating the analysis, generation, and acquisition of human (”natural”) language I Analysis (or “understanding” or “processing”, . ) I Generation I Acquisition I Some people use “NLP” to mean all of language technologies. I Some people use it only to refer to analysis. 2/31 Why NLP? I Answer questions using the Web I Translate documents from one language to another I Do library research; summarize I Manage messages intelligently I Help make informed decisions I Follow directions given by any user I Fix your spelling or grammar I Grade exams I Write poems or novels I Listen and give advice I Estimate public opinion I Read everything and make predictions I Interactively help people learn I Help disabled people I Help refugees/disaster victims I Document or reinvigorate indigenous languages 3/31 What is NLP? More Detailed Answer I Automating language analysis, generation, acquisition. I Analysis (or “understanding” or “processing” ...): input is language, output is some representation that supports useful action I Generation: input is that representation, output is language I Acquisition: obtaining the representation and necessary algorithms, from knowledge and data I Representation? 4/31 Levels of Linguistic Representation 5/31 Why it’s Hard I The mappings between levels are extremely complex. I Details and appropriateness of a representation depends on the application. 6/31 Complexity of Linguistics Representations I Input is likely to be noisy. I Linguistic representations are theorized constructs; we cannot observe them directly. I Ambiguity: each linguistic input may have many possible interpretations at every level. I The correct resolution of the ambiguity will depend on the intended meaning, which is often inferable from context. I People are good at linguistic ambiguity resolution. I Computers are not so good at it. I How do we represent sets of possible alternatives? I How do we represent context? 7/31 Complexity of Linguistics Representations I Richness: there are many ways to express the same meaning, and immeasurably many meanings to express. Lots of words/phrases. I Each level interacts with the others. I There is tremendous diversity in human languages. I Languages express the same kind of meaning in different ways I Some languages express some meanings more readily/often. I We will study models. 8/31 What is a Model? I An abstract, theoretical, predictive construct. Includes: I a (partial) representation of the world I a method for creating or recognizing worlds, I a system for reasoning about worlds I NLP uses many tools for modeling. I Surprisingly, shallow models work fine for some applications. 9/31 Using NLP Models and Tools I This course is meant to introduce some formal tools that will help you navigate the field of NLP. I We focus on formalisms and algorithms. I This is not a comprehensive overview; it’s a deep introduction to some key topics. I We’ll focus mainly on analysis and mainly on English text (but will provide examples from other languages whenever meaningful) I The skills you develop will apply to any subfield of NLP 10/31 Applications / Challenges I Application tasks evolve and are often hard to define formally. I Objective evaluations of system performance are always up for debate. I This holds for NL analysis as well as application tasks. I Different applications may require different kinds of representations at different levels. 11/31 Expectations from NLP Systems I Sensitivity to a wide range of the phenomena and constraints in human language I Generality across different languages, genres, styles, and modalities I Computational efficiency at construction time and runtime I Strong formal guarantees (e.g., convergence, statistical efficiency, consistency, etc.) I High accuracy when judged against expert annotations and/or task-specific performance 12/31 Key Applications (2017) I Computational linguistics (i.e., modeling the human capacity for language computationally) I Information extraction, especially “open” IE I Question answering (e.g., Watson) I Conversational Agents (e.g., Siri, OK Google) I Machine translation I Machine reading I Summarization I Opinion and sentiment analysis I Social media analysis I Fake news detection I Essay evaluation I Mining legal, medical, or scholarly literature 13/31 NLP vs Computational Linguistics I NLP is focussed on the technology of processing language I Computational Linguistics is focussed on using technology to support/implement linguistics. I The distinction is I Like “artificial intelligence” vs. “cognitive science” I Like “ building airplanes” vs. “understanding how birds fly” 14/31 Let’s Look at Some of the Levels 15/31 Morphology I Analysis of words into meaningful components – morphemes. I Spectrum of complexity across languages I Isolating Languages: mostly one morpheme (e.g., Chinese/Mandarin) I Inflectional Languages: mostly two morphemes (e.g., English, French, one morpheme may mean many things) I go+ing, habla+mos “I have spoken” (SP) 16/31 Morphology I Agglutinative Languages: Mostly many morphemes stacked like “beads-on-a-string” (e.g., Turkish, Finnish, Hungarian, Swahili) I uygar+las¸+tır+ama+dık+lar+ımız+dan+mıs¸+sınız+casına “(behaving) as if you are among those whom we could not civilize” I Polysynthetic Languages: A word is a sentence! (e.g., Inuktikut) I Parismunngaujumaniralauqsimanngittunga Paris+mut+nngau+juma+niraq+lauq+si+ma+nngit+jun “I never said that I wanted to go to Paris” I Reasonably dynamic: I unfriend, Obamacare 17/31 Let’s Look at Some of the Levels 18/31 Lexical Processing I Segmentation I Normalize and disambiguate words I Words with multiple meanings: bank, mean I Extra challenge: domain-specific meanings (e.g., latex) I Process multi-word expressions I make . decision, take out, make up, kick the . bucket I Part-of-speech tagging I Assign a syntactic class to each word (verb, noun, adjective, etc.) I Supersense tagging I Assign a coarse semantic category to each content word (motion event, instrument, foodstuff, etc.) I Syntactic “supertagging” I Assign a possible syntactic neighborhood tag to each word (e.g., subject of a verb) 19/31 Let’s Look at Some of the Levels 20/31 Syntax I Transform a sequence of symbols into a hierarchical or compositional structure. I Some sequences are well-formed, others are not X I want a flight to Tokyo. X I want to fly to Tokyo. X I found a flight to Tokyo. × I found to fly to Tokyo. X Colorless green ideas sleep furiously. × Sleep colorless green furiously ideas. I Ambiguities explode combinatorially I Students hate annoying professors. I John saw the woman with the telescope. I John saw the woman with the telescope wrapped in paper. 21/31 Some of the Possible Syntactic Analyses 22/31 Morphology–Syntax I A ship-shipping ship, shipping shipping-ships. 23/31 Let’s Look at Some of the Levels 24/31 Semantics I Mapping of natural language sentences into domain representations. I For example, a robot command language, a database query, or an expression in a formal logic I Scope ambiguities: I In this country a woman gives birth every fifteen minutes. I Every person on this island speaks three languages. I (TR) Uc¨ ¸ doktor her hastaya baktı “Three doctors every patient saw” ) 9d1; d2; d3; doctor(d1)&doctor(d1)&doctor(d1)(8p; patient(p); saw(d1; p)&saw(d2; p)&saw(d3; p)) I (TR) Her hastaya uçdoktor¨ baktı “Every patient three doctors saw” ) 8p; patient(p)(9d1; d2; d3; doctor(d1)&doctor(d1)&doctor(d1)&saw(d1; p)&saw(d2; p)&saw(d3; p)) I Going beyond specific domains is a goal of general artificial Intelligence. 25/31 Syntax–Semantics We saw the woman with the telescope wrapped in paper. I Who has the telescope? I Who or what is wrapped in paper? I Is this an event of perception or an assault? 26/31 Let’s Look at Some of the Levels 27/31 Pragmatics/Discourse I Pragmatics I Any non-local meaning phenomena I “Can you pass the salt?” I “Is he 21?” “Yes, he’s 25.” I Discourse I Structures and effects in related sequences of sentences I “I said the black shoes.” I “Oh, black.” (Is that a sentence?) 28/31 Course Logistics/Administrivia I Web page: piazza.com/qatar.cmu/fall2017/11411/home I Course Material: I Book: Speech and Language Processing, Jurafsky and Martin, 2nd ed. I As needed, copies of papers, etc. will be provided. I Lectures slides will be provided after each lecture. I Instructor: Kemal Oflazer 29/31 Your Grade I Class project, 30% I In-class midterm (October 11, for the time being), 20% I Final exam (date TBD), 20% I Unpredictable in-class quizzes, 15% I Homework assignments, 15% 30/31 Policies I Everything you submit must be your own work I Any outside resources (books, research papers, web sites, etc.) or collaboration (students, professors, etc.) must be explicitly acknowledged. I Project I Collaboration is required (team size TBD) I It’s okay to use existing tools, but you must acknowledge them. I Grade is mostly shared. I Programming language is up to you. I Do people know Python? Perl? 31/31 11-411 Natural Language Processing Applications of NLP Kemal Oflazer Carnegie Mellon University in Qatar ∗ Content mostly based on previous offerings of 11-411 by LTI Faculty at CMU-Pittsburgh. 1/19 Information Extraction – Bird’s Eye View I Input: text, empty relational database I Output: populated relational database Senator John Edwards is to drop out of the race to become the Democratic party’s presidential candidate after consistently trail- State Party Cand.

11-411 Natural Language Processing Overview

A Protein Interaction Extraction Systemusing a Link Grammar Parser from Biomedical Abstracts

Aviation Classics Magazine

Implementing a Portable Clinical NLP System with a Common Data Model: a Lisp Perspective

Chapter 14: Dependency Parsing

Development of a Persian Syntactic Dependency Treebank

Downloads." the Open Information Security Foundation

Lexicalization and Grammar Development

Parsing Aligned Parallel Corpus by Projecting Syntactic Relations from Annotated Source Corpus

Dependency-Length Minimization in Natural and Artificial Languages*

Natural Language Processing (NLP): Written and Oral Involves Processing of Written Text Using Computer Models at Lexical, Syntactic, and Semantic Level

Report on the Progress of Civil Aviation 1939 – 1945

B&W Real.Xlsx