Linguistic processing for Spoken Language Understanding

Fr´ed´ericB´echet

Aix Marseille Universit´e- Laboratoire d’Informatique Fondamentale - LIF-CNRS

December 18th 2014

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 1 / 66 Spoken Language Understanding

1 Introduction

2 Speech

3 Dependency parsing for Open Domain SLU

4 Speech-to-Speech translation

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 2 / 66 Spoken Language Understanding

Parsing speech for Spoken Language Understanding (SLU) of spontaneous speech

Spoken Language Understanding

Natural Language Automatic Speech Processing Recognition Machine Learning

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 3 / 66 Spoken Language Understanding

-Automatic transcription ● Transcription ● « Rich »Transcription

-

-Analysis ● classification Signal Processing ● Syntactic, semantic ● Extracting entities understanding ● Relations between entities ● ….

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 4 / 66 Two applicative frameworks

Human-Machine spoken dialogue

I Call-routing

I Form-filling

I Personal assistant

I Computer assisted task

Speech Analytics / Voice Search

I Broadcast News

I Large speech archives

I Call Centres

Main challenge : processing spontaneous speech

I Human/Machine conversation

I Human/Human conversation

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 5 / 66 Choice of a Meaning Representation Language

General view

I meaning as the composition of basic constituents

I Definition of constituents and relations independent from a given application

Application specific view

I interpretation = A representation that can be executed by an interpreter in order to change the state of the system (Speech Communication 48 - SLU)

I Goal of SLU from a system point of view → SYSTEM INTERPRETATION

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 6 / 66 Choice of a Meaning Representation Language

Can be defined by the application framework

I In a call-routing application → Call-type

I In a database query application → SQL query

I In a directory assistance application → entry in the directory I In a speech analytics / voice search application

F Distillation / Summarization F Behavioral analysis

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 7 / 66 Choice of a Meaning Representation Language

Can be a formal language

I Flat representation → concepts

F Named entities, Sequences of keywords, Verbs, ...

I Structured representation

F Logical formulae F predicate/argument structure (FrameNet) + Dialog act + reference resolution

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 8 / 66 Which models and algorithm to perform SLU ?

Different points of view

I Understanding is a classification task !

I Understanding is a translation task !

I Understanding is a parsing task !

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 9 / 66 Understanding is a classification task

Mapping a speech message to a label

I Call-type (How May I Help You ?), dialog act (CALO), user intent (personal assistant) . . .

Direct system interpretation of a message

Classifiers

I Support Vector Machines, Boosting, . . .

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 10 / 66 Understanding is a translation task

Translation to a formal semantic language

I Related to tagging approaches (e.g. POS tagging)

Main model : Concept decoding

I Mapping a sequence of to a sequence of attribute/value tokens

Main approaches

I Hidden Markov Models, MaxEnt tagger, Conditional Random Fields

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 11 / 66 Understanding is a parsing task Syntactic/Semantic parsing

Structured representation

I Constituent tree, dependency tree

F System intepretation F obtained directly from the parse tree

Examples

I Full parse : mapping syntactic trees to semantic trees

F Deep Understanding (Allen, ACL 2007)

I

F Robust parsing TINA (MIT)

I Parsing + Semantic rˆolelabelling

F SLU with parsing + SVM classifiers (Moschitti, ASRU 2007)

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 12 / 66 Complex Understanding Human-Human Conversation Analysis

- Good morning, - Goodbye. how may I help you? - No, that's fine, goodbye. - Hello. I am calling regarding my internet connection. - Do you need anything else? Opening Closing - What's your phone number - Oh great, thanks. please? Information Verification Problem - Alright, it doesn't work - It's 555-221-5282. Resolution anymore. We will send you a new one shortly. - I see, you have a connection problem, right? Conflict Situtation - Hmm... - Yes, it's the fourth time I call about that. - Sir, please calm down. Let me remotely check your modem. - Okay, can you check that the light is... - I already did all the checks last time! Can't you just send a repairman?

Relevance Segment Meaning Emotion Dialog is Conflict about procedure[state=completed,time=past] problematic procedure query[obj=intervention,loc=home]

Behaviour Analysis

Answer complex questions - Was the operator empathic to customer's concerns? about conversations: - Was the customer satistified ? - How does the agent manage the argument? - Did the agent achieve the goal of the call? - Is the agent polite?

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 13 / 66 Spoken Language Understanding

1 Introduction

2 Parsing Speech

3 Dependency parsing for Open Domain SLU

4 Speech-to-Speech translation

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 14 / 66 Why is syntactic parsing needed for SLU ?

Syntactic relations → semantic relations

I From syntactic dependency to semantic dependency

I Ex : The CoNLL-2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies Semantic disambiguisation

I By identifying the arguments of a verb and using semantic constraints on the arguments Ex : Disambiguisation between Organization / Location entities France has made a statement about .. / I’ll visit France next year Identifying the language register

I Read speech / prepared speech / spontaneous speech

I Speaker role identification

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 15 / 66 Speech is not text

Text

I Complete object : made to be read ! !

I Structured object

I Designed according to strong specifications

Speech

I Dynamic object : edit, repairs, repetition, hesitations

I Structure = mostly at the acoustic level

I Speaking style : read/prepared/spontaneous, dialogs

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 16 / 66 ASR output is not text ! !

ASR output

I Partial information : all the acoustic dimension of speech is missing

I Stream of words, no punctuations

ASR errors

I Insertion/deletion/substitution

I Each can be weighted by a confidence score

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 17 / 66 Non-native text

Generation of a transcription W from a signal A :

I P (A|W ) = acoustic models

I P (W ) = language models → text generation model

Wˆ = max P (A|W ) × P (W ) W

Therefore → closed vocabulary

I All words are included in the generative model P (W )

I Automatic transcriptions contain no unknown words ! !

Sentence segmentation

I pause, fixed duration, a posteriori process based on prosodic/lexical features

Disfluencies → ASR errors

Multiple hypotheses with confidence scores

I Word lattices, word confusion networks, n-best lists

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 18 / 66 Transcribing spontaneous speech

euh bonjour donc c' est XX à l' appareil je sais pas si vous savez très bien qui je suis euh donc par rapport au à la niveau de la satisfaction de ma satisfaction personnelle par rapport à votre service euh je dirai que dans l' ensemble je suis euh plutôt satisfait euh vous avez un très bon service clientèle qui sait écouter qui euh non qui j' ai pas grand chose à dire c' est c' est très très bien sinon ben juste par rapport au à ce que vous avez mis en place euh tout de suite justement c' est une très bonne idée justement d' une façon à ce qu' y ait un taux de réponse euh assez important maintenant c' est vrai qu' on est obligé de rappeler plusieurs fois et encore quand on prend le temps de rappeler pour euh pour euh pour euh pour répondre parce que quand on nous dit euh vous allez nous donner vous allez donner euh votre euh vos idées euh vos vos suggestions et ben on n' a rien en tête donc c' est pour ça que j' ai été obligé de raccrocher et de réfléchir à ce que je vais vous dire on ça c' est pas je pense euh que ça c' est le point collectif ou c' est le point négatif et sinon dans l' ensemble je suis très satisfait sinon y a une chose que j' ai à noter euh j' ai deux comptes chez vous euh je trouve ça un peu embêtant de pouvoir euh de pas pouvoir accéder euh aux deux par la même personne quand j' appelle mon service clientèle donc ça je trouve ça un peu dommage que je sois obligé de dépenser en plus parce que faut que je c' est pas le même type euh c' est pas la même personne qui s' occupe de mon dossier donc ce qui aurait été bien c' est quand même regrouper les deux dossiers sous euh euh sous un seul quoi de façon à ce que quand on appelle on puisse accéder aux deux dossiers séparément bien sûr mais les deux dossiers donc voilà euh sinon ben je vous remercie en tous cas pour euh pour votre gentillesse et votre amabilité vos conseillers clientèle sont très très gentils et très à l' écoute et donc je vous en remercie au revoir bonne journée bonne soirée

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 19 / 66 Transcribing spontaneous speech

euh bonjour donc c' est XX à l' appareil je sais pas si vous savez très bien qui je suis euh donc par rapport au à la niveau de la satisfaction de ma satisfaction personnelle par rapport à votre service euh je dirai que dans l' ensemble je suis euh plutôt satisfait euh vous avez un très bon service clientèle qui sait écouter qui euh non qui j' ai pas grand chose à dire c' est c' est très très bien sinon ben juste par rapport au à ce que vous avez mis en place euh tout de suite justement c' est une très bonne idée justement d' une façon à ce qu' y ait un taux de réponse euh assez important maintenant c' est vrai qu' on est obligé de rappeler plusieurs fois et encore quand on prend le temps de rappeler pour euh pour euh pour euh pour répondre parce que quand on nous dit euh vous allez nous donner vous allez donner euh votre euh vos idées euh vos vos suggestions et ben on n' a rien en tête donc c' est pour ça que j' ai été obligé de raccrocher et de réfléchir à ce que je vais vous dire on ça c' est pas je pense euh que ça c' est le point collectif ou c' est le point négatif et sinon dans l' ensemble je suis très satisfait sinon y a une chose que j' ai à noter euh j' ai deux comptes chez vous euh je trouve ça un peu embêtant de pouvoir euh de pas pouvoir accéder euh aux deux par la même personne quand j' appelle mon service clientèle donc ça je trouve ça un peu dommage que je sois obligé de dépenser en plus parce que faut que je c' est pas le même type euh c' est pas la même personne qui s' occupe de mon dossier donc ce qui aurait été bien c' est quand même regrouper les deux dossiers sous euh euh sous un seul quoi de façon à ce que quand on appelle on puisse accéder aux deux dossiers séparément bien sûr mais les deux dossiers donc voilà euh sinon ben je vous remercie en tous cas pour euh pour votre gentillesse et votre amabilité vos conseillers clientèle sont très très gentils et très à l' écoute et donc je vous en remercie au revoir bonne journée bonne soirée

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 20 / 66 What kind of syntactic parsing ?

traditional approaches based on CFG not suitable for processing spontaneous speech

Approaches to parsing based on dependency structures and discriminative machine learning techniques (McDonald et al., 2007) are much easier to adapt to speech :

I they need less training data

I the annotation with syntactic dependencies of speech transcripts is simpler than with syntactic constituents

I partial annotation can be performed when the speech is ungrammatical or the ASR transcripts are erroneous

I The dependency parsing framework also generates parses much closer to meaning which eases semantic interpretation

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 21 / 66 Main issues for parsing speech

Dealing with speech disfluencies and conversational speech yes so I hum ...hum I I wou- would like

2 strategies

I Adding disfluencies to the parsing models

I Removing disfluencies prior to parsing

Dealing with ASR errors

I Using word confidence scores in the parsing process ?

I Insertion / substitution / deletion

Integration into the ASR process

I word lattice, confusion network, N-best list

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 22 / 66 Approaches developed at the LIF

Developing specific resources for parsing spontaneous speech

I POS tagger / dependency parser / FrameNet parser

I cross-project : DECODA, SENSEI, ORFEO

Method

I Adapting models on small amount of annotated speech

I Using existing linguistic resources : Syntactic/semantic lexicons

Integration into the ASR process

I Building NLP tools taking weighted word lattices as input

I Keeping alternative word string and word/sentence segmentation

I For each analysis produced : possibility to go “back” to the signal

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 23 / 66 Spoken Language Understanding

1 Introduction

2 Parsing Speech

3 Dependency parsing for Open Domain SLU

4 Speech-to-Speech translation

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 24 / 66 Context of this study : SENSEI / ORFEO

SENSEI : Making Sense of Human-Human Conversation Data

I Making Sense = analysis + synthesis + evaluation

I FP7 European project (2013-2016)

I University of Trento (IT), University of Aix Marseille (FR), University of Essex (UK), University of Sheffield (UK)

I Teleperformance (IT), Websays (SP)

ORFEO

I Outils et Recherches sur le Fran¸caisEcrit et Oral

I Projet ANR - Lattice, Modyco, ATILF, LIF, LORIA, CLLE-ERSS, ICAR

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 25 / 66 Open Domain Spoken Language Understanding

Not limited to call-centers

I Live chat

I Conversation in social-media

Common characteristics

I non-canonical language

I noisy messages (disfluencies, repetitions, truncated words, ASR errors for speech ; misspelling, acronyms, multilinguism for social-media)

I superfluous information (redundancy and digression)

I “para-linguistic” information needed for full understanding

Can’t develop limited-domain SLU models for each use-case → generic semantic models

Open Domain Situated Conversational Interaction

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 26 / 66 Related work

Deriving semantic models from large collection of data and semantic web resources

D. Hakkani-T¨ur,L. Heck, and G. T¨ur,Using a knowledge graph and query click logs for unsupervised learning of relation detection, ICASSP, 2013 Using linguistically motivated semantic models

I Frame-based models (FrameNet)

I Retraining a new semantic model from annotated data + generic models

B. Coppola, A. Moschitti and G. Riccardi, Shallow semantic parsing for spoken language understanding, HLT-NAACL 2009 Using a generic parser + adaptation on the parser output

Y. Chen, W. Wang and A. Rudnicky, Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing , IEEE ASRU, 2013

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 27 / 66 This study

To highlight the need for an adaptation process of parsing models (POS + dependencies) when dealing with non-canonical text, such as speech transcriptions spontaneous speech transcriptions are difficult to parse using models developed for written text (agrammaticality, disfluences)

I → parsing based on dependency structures and discriminative machine learning techniques (no grammar)

I → “Normalizer” module in charge of ’cleaning’ input messages

transcriptions obtained through ASR contain errors → the amount of errors increases with the level of spontaneity in speech.

I → integrating ASR and semantic parsing (reranking, word lattices, confusion networks, . . . )

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 28 / 66 Method proposed

Adapting syntactic models to the nature of language to process

I POS + dependency parsing processes to process spontaneous speech (weak supervision)

I Disfluency prediction and removal

Using a generic linguistically motivated semantic model

I French FrameNet (derived from Berkeley FrameNet) - ASFALDA project

Unsupervised frame annotation by means of both dependency structures and generic semantic model

I Training semantic parsers for processing ASR transcripts

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 29 / 66 The RATP-DECODA corpus

1514 conversations over the phone recorded at the Paris public transport authority call center (RATP) over a period of two days (Bechet et al. LREC 2012) → over 74 hours of very spontaneous speech available for research at the Ortolang SLDR data repository : http://sldr.org/sldr000847/fr 2 partitions : DEC-TRAIN (93.5K utterances) DEC-TEST (3.6K utterances)

disfluency type # occ. % of turns 10 most frequent forms discourse markers 39125 28.2% [euh] [hein] [ah] [ben] [voil`a] [bon] [hm] [bah] [hm hm] [´ecoutez] repetitions 9647 8% [oui oui] [non non] [c’ est c’ est] [le le] [de de] [ouais ouais] [je je] [oui oui oui] [non non non] [¸ca¸ca] false starts 1913 1.1% [s-] [p-] [l-] [m-] [d-] [v-] [c-] [t-] [b-] [n-]

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 30 / 66 Parsing Speech Using generic resources

French Tree Bank corpus (FTB)

I newspaper articles (”Le Monde”), 1M words with POS, syntactic trees + dependencies

Parsing tools : MACAON toolbox

I CRF-based POS tagger + first-order graph-based dependency parser

Experiments

corpus FTB RATP-DECODA train FTB-TRAIN FTB-TRAIN test FTB-TEST DEC-TEST POS err. 2.6 18.8 UAS 87.92 65.78 LAS 85.54 58.28

3 metrics : Part Of Speech error rate (POS err.) ; Unlabeled Attachment Score (UAS) ; Labeled Attachment Score (LAS)

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 31 / 66 Parsing Speech

Model adaptation on the RATP-DECODA corpus

ORFEO tagset for POS and dependency labels.

I POS tagset of 17 tags

I Tagging disfluencies with POS

I Dependency labels tagset restricted to 12 syntactic labels : (Subject, Direct Object, Indirect Object, Modifier, . . . ) and a specific link (DISFLINK) for linking dislfuencies Weakly supervised adaptation method

I Iterative active learning process (LREC 2012) Example : je je je veux (I I I want)

id word POS link dependency 1 I CLI 4 SUJ 2 I CLI 1 DISFLINK(REP) 3 I CLI 2 DISFLINK(REP) 4 want VRB 0 ROOT

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 32 / 66 Iterative training process

Manual DECODA GOLD gold corpus corpus annotation

TRAIN evaluation corpus

MACAON Automatic auto tools annotations

Manual correction

MACON model Corrected retraining annotations

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 33 / 66 Learning curve for POS tagging

20 training=RATP-DECODA 18 training=FTB+RATP-DECODA 16 14 12

error rate error 10 8 6 4 0 20 40 60 80 100 # of dialogs in training

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 34 / 66 Handling disfluencies with dependency parsing ?

DEC-TEST NODISF = RATP-DECODA test corpus (manual transcription) without disfluencies (manually removed) DEC-TEST DISF = original corpus (manual transcription)

corpus FTB RATP-DECODA train FTB-TRAIN FTB-TRAIN FTB-TRAIN DEC-TRAIN test FTB-TEST DEC-TEST DEC-TEST DEC-TEST NODISF DISF DISF UAS 87.92 71.01 65.78 85.90 LAS 85.54 64.28 58.28 83.86

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 35 / 66 Frame Selection Semantic model

I FrameNet model adapted to French through the ASFALDA project (https://sites.google.com/site/anrasfalda)

I current model : 106 frames from 9 domains with Lexical Units (LU) lexicons

Selecting frame-candidates on the RATP-DECODA corpus

I Restriction to the verbal Lexical Units

I Restriction to well-formed syntactic dependencies (SUJ-VERB-OBJ)

I Results : 146,356 frame hypotheses from 78 different frames. Example : Top-10 frame hypotheses in the RATP-DECODA corpus

Domain Frame # hyp. SPACE Arriving 8328 COM-LANG Request 7174 COG-POS FR-Awareness-Certainty-Opinion 4908 CAUSE FR-Evidence-Explaining-the-facts 4168 COM-LANG FR-Statement-manner-noise 3892 COM-LANG Text-creation 3809 SPACE Path-shape 3418 COG-POS Becoming-aware 2338 SPACE FR-Motion 2287 SPACE FR-Traversing 2008

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 36 / 66 Semantic structures obtained

Semantic structure : predicate+dependencies from the RATP-DECODA corpus Example on the lexical unit “to lose” of the frame “Losing”

SUJ PREDICATE OBJ j’ ai perdu carte ”I” “’ve lost” “card” fille a perdu doudou “daughter” “had lost” “teddy bear” j’ ai perdu journ´ee “I” “’ve lost” “day” j’ ai perdu travail ”I” “’ve lost” “job”

→ Evaluation of the syntactic parsing adaptation process for the retrieval of these syntactic structures

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 37 / 66 Evaluation on manual transcriptions

Average Precision and Recall in the detection of LUs with dependencies Comparison of 2 parsing models

I FTB-TRAIN : models trained on written text

I FTB+DEC-TRAIN : + RATP-DECODA train corpus with semi-supervised annotations

parser precision recall f-measure FTB-TRAIN 75.9 85.5 77.3 FTB+DEC-TRAIN 88.2 88.4 87.2

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 38 / 66 Evaluation on ASR transcriptions

ASR performance on the DEC-TEST corpus, according to the kind of speaker in the call-centre

speaker WER caller 49.4 operator 42.4

Performance detection of semantic dependency structures on the ASR transcriptions

I dep1 compares full predicate+dependency recovery

I dep2 accepts partial match on the dependencies

condition precision recall f-measure dep1 47.4 66.4 51.4 dep2 57.3 80.3 62.7

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 39 / 66 Future work

Main results

I FrameNet approach to → generic model

I Syntactic dependency parser can be adapted successfully to process very spontaneous spoken conversations

I Adaptation process improve significantly parsing and frame candidate selection, both on manual and noisy ASR transcriptions

Perspectives

I better integration of ASR and parsing

I from spoken data to social media data

I abstractive conversation summarization process based on semantic and structural features (in SENSEI)

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 40 / 66 Spoken Language Understanding

1 Introduction

2 Parsing Speech

3 Dependency parsing for Open Domain SLU

4 Speech-to-Speech translation

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 41 / 66 Context of this study

I like your beret

You like what?

Your hat

ﺐﺣأ ﺎﻧأ ﻚﺘﻌﺒﻗ

Projet Broad Operational Language Translation (BOLT) 2011-2015

I THUnderBOLT : Toward Human-like Understanding in Breakthrough Operational Language Technologies I Th`eme

F Spoken Language Understanding for clarification strategies in speech to speech translation

I Partners

F SRI International (Standford), Columbia University (NY), Washington University, Aix Marseille Universit´e(AMU)

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 42 / 66 Speech-to-Speech translation : best case scenario speaker 1

I ref. transcription : Bonjour, je voudrais savoir `aqui appartiennent ces entrepˆots?

I ASR transcription : Bonjour je voudrais savoir `aqui appartiennent cet entrepˆot?

I automatic translation : Hello I would like to know who is the owner of this warehouse ?

I speaker 2

I ref. transcription : They belong to XX

I ASR transcription : They belong to XX

I automatic translation : Ils appartiennent `aXX

I speech synthesis speaker 1

I ref. transcription : Merci, savez vous combien de gens travaille ici ?

I ASR transcription : merci savez vous combien de gens travaille ici ?

I automatic translation : Thanks, do you know how many people work here ?

I speech synthesis speaker 2

I ...

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 43 / 66 Speech-to-Speech translation : worst case scenario speaker 1

I ref. transcription : Bonjour, je voudrais savoir `aqui appartiennent ces entrepˆots?

I ASR transcription : Bonjour je voudrais savoir `aqui appartiennent ces entre peau?

I automatic translation : Hello I would like to know to who belongs between skin ?

I speech synthesis speaker 2

I ref. transcription : What ? ? which skin ? ? what do you mean ? ?

I ASR transcription : what which skin that do you mean

I automatic translation : quelle apparence tu veux dire

I speech synthesis speaker 1

I ref. transcription : pardon ? ? non, ces entrepˆots

I ASR transcription : pardon non c’est en troupeau

I automatic translation : sorry no it’s a flock

I speech synthesis speaker 2

I ...

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 44 / 66 Speech-to-Speech translation : an interactive process speaker 1

I ref. transcription : Bonjour, je voudrais savoir `aqui appartiennent ces entrepˆots?

I ASR transcription : Bonjour je voudrais savoir `aqui appartiennent ces entre peau? system BOLT

I speech synthesis : Excusez-moi, je ne connais pas ce mot, `aqui appartiennent quoi ? speaker 1

I ref. transcription : ces hangars de l’autre cˆot´ede la rue

I ASR transcription : ces hangars de l’autre cˆot´ede la rue system BOLT

I ref. transcription : OK, je traduis

I ASR transcription clarified : Bonjour je voudrais savoir `aqui appartiennent ces hangars de l’autre cˆot´ede la rue ?

I automatic translation : Hello I would like to know who owns these warehouse on the other side of the street ?

I speech synthesis speaker 2

I ...

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 45 / 66 General description of the BOLT clarification system

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 46 / 66 Challenges in ASR error detection in BOLT B/C

Multiple sources of errors

I Language Model ambiguities → I ran/Iran

I Out-Of-Vocabulary words → priest in / pristine

I Non-canonical word pronunciation

I Voice differing from the training database → accent, voice quality (age, pathology)

I sub-optimal search → real-time constraints

Open domain

I no precise modeling of expected semantics

Targeting only certain types of errors

I errors that can be repaired with a clarification strategy

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 47 / 66 How to detect ASR errors

confidence measures from the ASR decoding

I ASR posterior probabilities

I Language Model perplexity, unseen ngram (backoff rate)

Linguistic features

I syntactic coherence (Part-Of-Speech, Parsing)

I semantic coherence (term co-occurrence, semantic role)

Learning error patterns

I examples of ASR errors from OOV words

I modelling error segment with a sequence tagging approach rather than just local decision

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 48 / 66 How to find the ASR errors that can be recovered ?

Not all ASR errors should be sent to the clarification strategy

Targeting errors that would lead the human-human interaction into a dead end

Difficult problem

Examples

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 49 / 66 Example 1

ASR 1-best

is the town low on supplies of i see them in a fan

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 50 / 66 Example 1

ASR 1-best with posteriors

is the town low on supplies of i see them in a fan 1 1 1 0.9 1 1 1 0.1 0.9 0.9 0.9 0.8 0.6

→ which words are erroneous ? if we only look at posterior probabilities, only words I and fan are wrong

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 51 / 66 Example 1

ASR 1-best with posteriors

is the town low on supplies ofi see them ina fan 1 1 1 0.9 1 1 1 0.1 0.9 0.9 0.9 0.8 0.6

→ words in red are erroneous

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 52 / 66 Example 1

ASR 1-best with posteriors aligned with the reference transcription

is the town low on supplies ofi see them ina fan 1 1 1 0.9 1 1 1 0.1 0.9 0.9 0.9 0.8 0.6 is the town low on supplies of acetaminophen

1 error segment : i see them in a fan → the whole error segment should be sent to clarification

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 53 / 66 Example 2

ASR 1-best

they travel to cross to desert on camera back

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 54 / 66 Example 2

ASR 1-best with posteriors

they travel to cross to desert on camera back 0.7 0.8 0.8 0.8 0.8 0.9 0.9 0.7 0.8

→ which words are erroneous ?

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 55 / 66 Example 2

ASR 1-best with posteriors

they travel to cross to desert on camera back 0.7 0.8 0.8 0.8 0.8 0.9 0.9 0.7 0.8

→ words in red are erroneous

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 56 / 66 Example 2

ASR 1-best with posteriors aligned with the reference transcription

they travel to cross to desert on camera back 0.7 0.8 0.8 0.8 0.8 0.9 0.9 0.7 0.8 they traveled across the desert on camelback

2 error segments : travel to cross to and camera back → only the second error segment (camera back) should be sent to clarification

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 57 / 66 Approach developed at AMU

Detection

I Development of an error detector module, integrating various confidence measures, combining a Conditional Random Field (CRF) tagger and a Finite-State-Transducer filter to detect ASR error segments.

Characterization

I Error Processor module that add syntactic and semantic information to the error segments to clarify

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 58 / 66 Joint Error Segment detection and Syntactic parsing

N-best error segment hypotheses generated by the tagger

I Example :

F Ref : they traveled across the desert on camelback F ASR : they travel to cross to desert on camera back F Error segment : [camera back]

I N-best

F they travel to cross to desert on XXXX F they travel to cross to XXXX F they travel to cross XXXX

Syntactic & semantic analysis : for each XXXX

I prediction of a POS category and a syntactic dependency

F restriction : XXXX can be either noun phrase or verb phrase F MACAON NLP tools (developed by AMU) integrated in the AMU pipeline

I prediction of a Named Entity label : NONE PERSON/ORG LOCATION

I semantic role within the sentence (when possible)

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 59 / 66 Decision Module

Choosing among the possible error segments the one that would be the easiest to rephrase by speakers

I joint decision integrating ASR posteriors, tagging and parsing confidence and linguistic consistency Prediction of syntactic/semantic info about XXXX

I syntactic/semantic role

I Wh-question tag

I useful to generate more meaningful questions (DM) and improve answer merging

Example :

I they travel to cross to desert on XXXX: MOD(on) → WH=WHAT?

I they travel to cross to XXXX: ARGM-DIR(cross) PMOD(to) → WH=WHERE?

I they travel to cross XXXX: ARG1(cross) OBJ(cross) → WH=WHAT?

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 60 / 66 Example

For example : on the 1best : “i think mister raad fahmi is lying” with an error segment on word fahmi (start=4 end=4) the error segment characterization module produces :

ne_tag=HUMAN pos_tag=NNP dep_tag=SUBJ dep_word_index=5 is_asr_oov=true syntactic_category=NP wh_question=WHO semantic_role=ARG0(is lying)

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 61 / 66 Experiments on BOLT data DARPA BOLT project (English-Iraqi Arabic speech-to-speech translation task).

I Only the English ASR side is considered

I ASR system used is SRI Dynaspeak system Corpora

I train : English utterances collected during the TRANSTAC project (766K utterances and 7.6M words).

I development : 745 utterances (7.4K words) containing at least one error (OOV word, a mispronounciation) leading to a high WER (30%).

I test : 887 utterance (6.4K words) similar to development (WER 27.9%)

Transtac corpus (14K mots) Level WORD - error rate=5.6 WORD THE ARE IN ’S YOU A IS DO AND CAN % 3.8 3.8 2.8 2.6 2.5 2.3 2.2 2.2 1.8 1.6 Level POS - error rate=4.6 POS NN VBP VB DT VBD IN PRP MD NNP WRB % 12.1 11.9 8.4 7.5 6.9 6.5 6.2 5.0 4.1 3.7 Level DEP - error rate=10.0 DEP OBJ NMOD SBJ SUB VC PMOD ADV PRD LOC TMP % 18.5 16.6 12.6 9.0 6.1 4.6 4.3 3.5 3.5 3.3

Verbs represent 27.2% of ASR errors (16.2% for nouns)

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 62 / 66 Evaluation Metrics

Example of word/POS/dep error rate evaluation with and without error processing

Reference Error rate word we need judicious men 0% POS PRP VBP JJ NN 0% synt. dep. SBJ(need) ROOT() NMOD(men) OBJ(need) 0%

1-best (no error processing) Error rate word we need your dishes that 75% POS PRP VBP PRP NN WDT 50% synt. dep. SBJ(need) ROOT() NMOD(dishes) OBJ(need) NONE() 50%

1-best with error processing Error rate word we need XX 50% POS PRP VBP NN 25% synt. dep. SBJ(need) ROOT OBJ(need) 25%

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 63 / 66 Results

1-best : baseline, no error processing (syntactic parser directly applied to the ASR 1-best) oracle :“cheating” experiments with reference error segments err. proc. : error tagger + characterization

condition/error rate Word POS Synt. Dep. 1-best 27.9 21.4 44.6 oracle 21.2 14.5 34.2 err. proc. 25.0 16.9 39.9

Correct detection and prediction on error segments

level Precision Recall F-measure POS 79.0 53.2 63.6 dependency 68.7 46.3 55.3

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 64 / 66 Conclusions and Perspectives

Detecting and characterizing ASR errors is a difficult problem Linguistic analysis can help

I to help detecting error segments

I to predict the nature and the role of the segment missing

Interactive strategy to correct error segments

I Evaluated by NIST through the BOLT project

I Good feedback from users

Perspectives

I Generate features leading to more meaningful clarification questions I Target errors that have an impact on the dialog flow

F From Spoken Language Understanding and points of view

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 65 / 66 Thank you

F. B´echet (AMU LIF-CNRS) Xerox Research Centre Europe, Grenoble December 18th 2014 66 / 66