<<

Text Mining and : where are we?

Anne Vilnat

LIMSI, UPSaclay

Septembre 2020

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 1 / 23 Plan Plan of the course : what we plan... and what we do !

1 Jan 8 : What’s in a text : an introduction, and (some on) syntax Anne Vilnat Semantics, Sahar Ghannay 2 Jan 15 : Semantics Sahar : Ghannay Text mining in open domain, Pierre Zweigenbaum 3 Jan 22 : Text mining in open domain, Pierre Zweigenbaum Introduction and syntax, Anne Vilnat 4 Jan 29 : Text mining in medical domain, Aurélie Névéol 5 Feb 5 : : what can we do ?, Sahar Ghannay and Anne Vilnat 6 Feb 12 : and now in Sign Language ?, Michael Filhol 7 Projects and papers presentation

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 2 / 23 Le problème How to define what we do...

Natural Language Processing or NLP Part of Computer Science and Artificial Intelligence, to study the links between natural languages (used by humans) and computers (using programming languages) at the frontier between computer science and NLP begins at the same time than ... Computer scince itself ! the first computers where... "big dictionaries" to translate, or decode messages !

Machine translation and ... its tricks ! The spirit is willing but the flesh is weak L’esprit est fort mais la chair est faible The vodka is strong but the meat is rotten La vodka est forte mais la viande est pourrie

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 3 / 23 Le problème Some questions to answer...

When is NLP useful ? language modelling using natural language, with a lot of industrial issues What are the difficult problems ? ambiguity in linguistics units implicit in text un natural language What technics are useful for NLP ? "old school" but classical ones : rules or others (grammars, knowledge ontologies, etc) and ...deep learning

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 4 / 23 À quoi sert le TAL ? NLP is useful for :

spelling/grammar correction retrieval (text mining) text simplification conversational agents (chatbots) ...

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 5 / 23 À quoi sert le TAL ? NLP is useful for :

machine translation spelling/grammar correction (text mining) text simplification conversational agents (chatbots) ...

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 6 / 23 À quoi sert le TAL ? NLP is useful for :

machine translation spelling/grammar correction information retrieval (text mining) text simplification conversational agents (chatbots) ...

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 7 / 23 À quoi sert le TAL ? NLP is useful for :

machine translation spelling/grammar correction information retrieval (text mining) text simplification conversational agents (chatbots) ...

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 8 / 23 À quoi sert le TAL ? NLP is useful for :

machine translation spelling/grammar correction information retrieval (text mining) text simplification conversational agents (chatbots) ...

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 9 / 23 À quoi sert le TAL ? An example to introduce the problem

The president of the United States was eating an apple with a knife. Which treatments ? → lexical units recognition of the lexical components, an their properties → textbflexical processing ; recognition of the higher level components, and the relations between them → syntactic processing ; building the meaning representation of the statement → semantic analysis ; how this statement may be related to the context in which it is analyzed ? (text, dialog,...) → pragmatics. But it is not a sequential process !

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 10 / 23 La segmentation Segmentation/normalization : what is a ?

The writings are not always clearly segmented : chinese, thaï,... Typography is not always fixed : .: etc. or limsi.fr or 20.3 or ... ’: jusqu’à or aujourd’hui or 3’4 or Floc’h ou Sotheby’s ,... -: Jean-Paul or donne-t-il or 06-05-04-03-02 or 1914-1918 or -10.5%,... the space... Detect and normalize typographical variants : France-Inter France-inter France Inter United States United-States US Finding the numbers, dates, durations, amounts, special numbers (phone, Visa card,...), scores (sport) “Deal with” unknown words (neologisms,...), words from another language (anglicisms in french,...), typos,...

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 11 / 23 La segmentation Lexical processing Lexical processing

Goal : Identifying lexical elements, their structure and characteristics ; put together the forms from the same origin.

Lemmatization find the canonical form of a word, or lemma ate → to eat journaux → journal viendras → venir

Racinization() lived → liv- journaux, journal → journ- chantais, chanteras → chant-

Anne Vilnatvenons, (LIMSI, venaient UPSaclay) →Textven- Mining and Chatbots : where are we? Septembre 2020 12 / 23 La segmentation Lexical processing Lexical processing : the result

Le président des antialcooliques mangeait une pomme avec un couteau le-det.masc.sing.,/l@/ ;pron.pers.masc.sing.,/l@/ président-vrb 3pers. plur. prés. ind./subjonctif [présid+ent],,/pKezid+@t/ ;nom masc. sing. ← présider : action de X,,/pKezid˜A/ des-det. masc./fem. plur., /dE+z/ ;prep. contr. de les... antialcooliques-adj. masc./fem. plur. [anti+alcool+ique+s], ← antialcoolique(adj) : être X, antialcoolique(X), /˜AtialkOlik@+z/ mangeait- vrb(1,3)pers. sing. imp.ind., [mang+e+ait], , /m˜AZE+t/ pomme-nom fem. sing., [pomme], , /pOm@/ ...

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 13 / 23 La segmentation Syntactic level "Simple" syntax : Tagging and chunking

Goal : desambiguate ambiguous morphological labels (POS tagging) ; identify the group frontiers (not their internal structure interne, not the dependancy relations) : chunking How : desambiguisation rules/patterns ; statistical models (HMM, CRF) ; learning disambiguation rules Tools : rules, patterns, manually annotated corpus Difficulties : unknown words ... Result : tags and chunks. [Le/Admp président/Vpi3p] [des/Prep antialcooliques/Ncmp] [mange/Vpi3s][une/Aifs pomme/Ncfs]...

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 14 / 23 La segmentation Syntactic level Syntactic level

Goal : identifying syntactic components (syntagms), their function, and the relations between them. How : syntactic , giving a tree or a dependancy structure Tools : syntactic parser, definig the representation and the way to get it Difficulties : compromise between a rich description, time processing, and proliferation of ambiguities, complexity of linguistic phenomena, robustness againt “noise” (typos, grammatical errors...). Résult : one (or a lot of) syntatic representation of the sentence.

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 15 / 23 La segmentation Syntactic level Ambiguity : due to lexical entries

One of the more important problem for suntactic parsing is ambiguity. Lexical Ambiguity : souris : verbal forms of sourire , feminin singular and plural name ; petit : adjective or name masculine singular ; la : determiner or persona pronoun feminine singular, name masculine ; mousse : verbal forms of mousser, to foam , name masculine (in the navy), name feminine (foam) ; If the description ismore precise, the ambiguity increases : monter (monter un escalier , monter un cheval, monter une pièce, ...).

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 16 / 23 La segmentation Syntactic level Ambiguity : due to syntax

La petite brise la glace ; La troupe monte Molière vs Le jockey monte Belino ; She eats an apple with a knife vs She eats an apple with the skin ; She sees the beach with a telescope vs She sees the beach with seagulls ; it’s the daughter of the cousin who drinks ; he talked about having lunch with Paul ; The desambiguation is not possible at the syntactic level, we need semantics or pragmatics to decide ; the richer is the grammar, the more ambiguities you obtain...

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 17 / 23 La segmentation Semantic level Semantics

Goal : Solve referential problems ; building a conceptual representation see the first course...

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 18 / 23 La segmentation Pragmatics Pragmatic level

Goal : integrate the statement in the current text, or interaction, adding what is implicit ; understand the argumentative function of the sentence : what is new, what is it about, is it a new information, a question ? ? ? How : knowledge about human activities, about human interaction (speech acts, relevance,...) ; about rhetorical and discursive structures... Tools : world knowledge (scripts), intaction "grammar", ... Difficultés : taille de la connaissance à représenter, spécification de la « grammaire » des interactions Result : formal representation, new knowledge,... but ... complex !

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 19 / 23 Short story of NLP Short story : the beginnings...

1954 : first machine translation system (russian → english...) 1962 : first conference on machine translation at MIT, organised by Bar-Hillel The spirit is willing but the flesh is weak the box is in the pen ↔ the pen is in the box. Bar-hillel : “translation is not possible”... ALPAC report : translation too expensive, no results... : no more fundings !

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 20 / 23 Short story of NLP Short story : the beginnings, elswhere...

Harris : distribution linguistics (51 to 54) Chomsky : syntax, language grammars/ formal languages grammar (57) and now arrives AI in 56 (Dartmouth summerschool) : Mc Carthy, Minsky, Newell, Simon → computers with language abilities first systems : BASEBALL (1961), SIR (1964), STUDENT (1964), ELIZA (1966) see : emacs Meta-X doctor knowlege representation : Quillian (semantic networks) → 72 : SCHRDLU (Winograd) ; first système that “understands” http ://hci.stanford.edu/ winograd/shrdlu/

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 21 / 23 Short story of NLP Short story : the help of semantics, the progress of syntax...

70’s : systems developed by Schank, Wilks ; Semantics is the most important... BUT how will it be possible to give ALLL the necessary knowledge... 80’s : progress in syntax : unification grammars BUT how will it be possible to give all the rules... 2000- :deep learnong, transformer, and ... what are the limits ? how to learn ? manually annotated data ?

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 22 / 23 Short story of NLP State of the art...

Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Septembre 2020 23 / 23