NPFL099 - Statistical dialogue systems

Spoken language understanding II

Filip Jurčíček

Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Home page: http://ufal.mff.cuni.cz/~jurcicek

Version: 13/03/2013

NPFL099 2013LS 1/41 Outline

● What is SLU?

● Parsers

● Semantic tuple classifiers ● Hidden Vector State ● PCCG

NPFL099 2013LS 2/41 Spoken language understanding

● Definition

● SLU converts recognised speech into meaning

● We are looking for mapping of

● I am looking for a Chinese restaurant

into

● inform(venue=restaurant)&inform(food=Chinese)

NPFL099 2013LS 3/41 Meaning representation

● Is in the form of dialogue acts: ● Each composed of:

● a dialogue act type: – inform, request, confirm, select, affirm, deny, hello, bye, repeat, help, request_alternatives, etc. ● semantic information: – venue=restaurant – food=Chinese

NPFL099 2013LS 4/41 Semantic tuple classifiers

● Developed for trees ● Based on composition of simple classifiers for

● non terminal and terminal nodes

● No word alignment is considered

● classifiers are conditioned on the complete sentence

F. Mairesse et al., ”Spoken language understanding from unaligned data using discriminative classification models” in ICASSP '09: Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2009, pp. 4749-4752.

NPFL099 2013LS 5/41 Trees and DAs

inform(venue=”restaurant”, food=”Chinese”, near=”railway station”)

CLASSIC notation!

NPFL099 2013LS 6/41 High precision mode

● dialogue act type is the root

(DAT|W) ● slot name is next

● C(NAME|DAT,W) ● slot value is next

● C(VALUE|NAME,W)

● High precision because

● slot names are conditioned on the DAT inform(venue=”restaurant”, food=”Chinese”, near=”railway station”) NPFL099 2013LS 7/41 High recall mode

● dialogue act type is the root

● C(DAT|W) ● slot name and value is next

● C(NAME, VALUE| W)

● High recall because

● slots are conditioned only on W ● error in classification of DAT does not propagate

inform(venue=”restaurant”, food=”Chinese”, near=”railway station”) NPFL099 2013LS 8/41 Choice of classifiers

● Arbitrary classifiers ● Classification

● 1 out of N - DAT ● 1 out of 2 – presence of a slot/vale

● Original paper used SVM (support vector machines)

● a kernel based technique using training examples as data-points ● simple scalar product on features works well

NPFL099 2013LS 9/41 Features

● Training classifiers in high recall mode

● C(DAT|W) ● C(NAME,VALUE|W)

● To condition on W, extract features

● bag of words trick – N-grams – N-grams from dependency tries etc.

● Tuning

● features can be optimised for each classifier

separately NPFL099 2013LS 10/41 Alternatives to SVM

● Naive Bayes

● 1 out of N

● Decision trees

● 1 out of N ● CART ● C4.5

● Decision forests

● combination of many trees ● increased robustness NPFL099 2013LS 11/41 Alternatives to SVM

N ∑ θi Φi (W ) ● 1 out of 2 p(C=1∣W )≈e i

● Kernelised logistic regression N ● 1 out of 2 ∑ θi k (W i , W ) p(C=1∣W )≈e i

k (W i , W )=Φ(W i )Φ(W )

Dot product kernel

NPFL099 2013LS 12/41 Summary of the classifiers

Name of classier Type Output Phoenix handcrafted deterministic

SVM data driven deterministic with error margins

Naïve Bayes data driven probabilistic

Decision tree data driven deterministic can be used for regressions

Decision forests data driven probabilistic

Logistic regression data driven probabilistic

Kernelised regression data driven probabilistic

NPFL099 2013LS 13/41 Other alternatives

● Weka toolkit

● KSTAR, IBK, JRIP ● mlpy.sourceforge.net

● Linear Discriminant Analysis (LDA), Basic , Elastic Net, Logistic Regression, (Kernel) Support Vector Machines (SVM), Diagonal Linear Discriminant Analysis (DLDA), Golub Classifier, Parzen-based, (kernel) Fisher Discriminant Classifier k-Nearest-Neighbor, Iterative RELIEF, Classification Tree, Maximum Likelihood Classifier ● http://scikit-learn.sourceforge.net

● Lasso, Elastic Net, Least Angle Regression, LARS Lasso, Orthogonal Matching Pursuit (OMP), Bayesian Regression, Logisitic regression, Support Vector Machines, Stochastic Gradient Descent, Nearest Neighbors, Gaussian Processes, Partial , Naive Bayes, Decision Trees, Ensemble methods, Feature selection

NPFL099 2013LS 14/41 PCFG extension

● Probabilistic version of a CFG

● R → INFORM S ● R → REQUEST S inform(venue=”restaurant”, food=”Chinese”, ● S → S S near=”railway station”) ● S → A ● S → AV ● INFORM → looking ● REQUEST → can I get ● A → food ● AV → food = FV

● FV → Chinese I am looking for a Chinese restaurant near the rail way station. ● FV → Italian NPFL099 2013LS 15/41 PCFG extension

● We need to know probabilities of:

● P(R → INFORM S | W) ● P(S → AV | W) ● P(AV → food = FV | W) ● P(FV → Italian | W) N ∑ θ i Φc ,i( W ) ● … p(C=c∣W )≈e i

Or any other probabilistic classifier ● W can be the whole input

● or only the part of the input spanned by the rule

NPFL099 2013LS 16/41 PCFG approach is more general

● It can theoretically handle:

● I would like Chinese restaurants which are not cheap or expensive or expensive hotels which are not near city centre but near south of the town. ● inform( venue = restaurant && food = Chinese && (price_range != cheap || price_range != expensive) || venue = hotel && price_range = expensive && (near ! = centre || near = south) )

● However, we do not need this at this moment

NPFL099 2013LS 17/41 Hidden Vector State parser

● PCFG structure is too complex

● Arbitrary depth of a tree ● Lets limit the depth of the semantic tree

DEPARTURE

TO

STATION TIME

jede nějaký spěšný vlak do Prahy kolem čtvrté odpoledne

Y. He and S. Young (2005). "Semantic Processing using the Hidden Vector State Model." Computer Speech and Language 19(1): 85-106. NPFL099 2013LS 18/41 Pushdown automaton approximation

● Push a new concept for each input word ● Pop 0,1,2,3 concepts fro the current stack

NPFL099 2013LS 19/41 Pushdown automaton approximation

● Push a new concept for each word ● Pop 0,1,2,3 concepts fro the current stack

NPFL099 2013LS 20/41 HVS training

● Given an utterance ● Unaligned state sequence ● Train probabilistic model of

● observations – P(w| c1) ● push – P(c1|c2,c3,c4) ● pop – P(n|c1,c2,c3,c4) ● EM algorithm

NPFL099 2013LS 21/41 HVS summary

● Although well received by community

● it's is not very good ● The learned model is biased towards unimportant common words

● In English: a, the, an

● The reason:

● generative structure ● the articles are in almost all utterances ● P( w = 'a' | c ) is the most reliably estimated probability ● So the automatically inferred alignment

● align some concepts always with articles, etc.

NPFL099 2013LS 22/41 Processing multiple hypotheses

● ASR provides N-best list

● 0.33 – I am looking for a bar ● 0.26 – I am looking for the bar ● 0.11 – I am looking for a car ● 0.09 – I am looking for the car ● ... ● How do we get?

● 0.59 – inform(task=find, venue=bar) ● 0.20 – null() ● ...

NPFL099 2013LS 23/41 Processing multiple hypotheses

● Semantic parser: P(d∣w ) ● Automatic speech recognition: P(w∣a)

● We want to get: P(d∣a)

● where

● d – dialogue act ● w – word sequence ● a – audio signal

NPFL099 2013LS 24/41 Processing multiple hypotheses

● ASR provides multiple word sequence hypotheses

● we have to sum over them

( ∣ )= ( ∣ ) ( ∣ ) P d a ∑w P d w P w a

● Algorithm

● Compute semantic interpretation for every word seq. ● Weight them by the prob. of the word sequence ● Merge the same dialogue acts and sum their probs.

NPFL099 2013LS 25/41 Alternative

● ASR provides P(w∣a)

● map directly from probability distribution to dialogue acts

P(d∣a)=P(d∣P(w∣a))

θT⋅Φ (P(w∣a)) P(d∣a)≈e d

● P(w|a) - can be compactly represented in the form of a confusion network

NPFL099 2013LS 26/41 Confusion network

● N-best list

● 0.33 – I am looking for a bar ● 0.26 – I am looking for the bar ● 0.11 – I am looking for a car ● 0.09 – I am looking for the car

● Confusion network I – 0.9

my – 0.07 am – 0.9 a – 0.6 bar – 0.5

looking – 1.0 for – 1.0 car – 0.4

hi – 0.02

ε – 0.01 ε – 0.1 the – 0.4 ε – 0.1

NPFL099 2013LS 27/41 Features from an utterance

● 1 – best utterance

● I am looking for a bar ● E.g. bigrams

● (I,am) = 1 ● (am,looking) = 1 ● (looking,for) = 1 ● (for,a) = 1 ● (a,bar) = 1

NPFL099 2013LS 28/41 Features from a CN

● CN I – 0.9

my – 0.07 am – 0.9 a – 0.6 bar – 0.5

looking – 1.0 for – 1.0 car – 0.4

hi – 0.02

ε – 0.01 ε – 0.1 the – 0.4 ε – 0.1

● E.g. bigram features Φd (P(w∣a)) ● (I,am) = 0.81 ● (my,am) = 0.063 Normalise for the length of N-grams ● (looking,for) = 1 ● (a,bar) = 0.3 ● (a,car) = 0.24

● (a,eps) = 0.06 NPFL099 2013LS 29/41 Thank you!

Filip Jurčíček

Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Home page: http://ufal.mff.cuni.cz/~jurcicek

NPFL099 2013LS 30/41 SLU exercise

● Built a SLU component using a technique of your choice ● E.g.

● Phoenix parser ● TBL ● SVM ● Decision trees – Random (Decision) forests ● Conditional Random Fields ● Template based matching – clustering

● CFG based NPFL099 2013LS 31/41 Dialogue acts in data

● The format is slightly different ● Each slot value pair has a DAT

● perfect thank you goodbye ● bye()&thankyou()

● i'd like an english restaurant that plays folk music in the north part of town ● inform(area="north")&inform(food="english")&inform( music="folk")&inform(type="restaurant")

● and what type of restaurant is the grand

● inform(name="the grand")&request(food) NPFL099 2013LS 32/41 Dialogue acts in data

● The format is slightly different ● Each slot value pair has a DAT

● perfect thank you goodbye ● bye()&thankyou()

● i'd like an english restaurant that plays folk music in the north part of town ● inform(area="north")&inform(food="english")&inform( music="folk")&inform(type="restaurant")

● and what type of restaurant is the grand

● inform(name="the grand")&request(food) NPFL099 2013LS 33/41 Provided data

● All data

● do not distribute

● use only for NPFL099

● Original CUED data

● in SDS/applications/TownInfo/cued_data ● Processed data in new format

● run ./cued-sem2ufal-sem.py in SDS/applications/TownInfo

● new data in SDS/applications/TownInfo/data

● I provide the data already in new format

NPFL099 2013LS 34/41 Provided data

● Data: train, dev, test

● Data: asr, transribed

● Files:

● auto_database.py

● database.py

● ...

● towninfo-train.grp

● towninfo-train.grp.dais

● towninfo-train.grp.reduced

● towninfo-train.grp.reduced.dais

● towninfo-train.sem

● towninfo-train.trn

NPFL099 2013LS 35/41 Provided software

● SDS in development, structure

● applications – TownInfo

● cued_data ● data ● train-lr-slu.py ● test-lr-slu.py ● components – slu ● corpustools – cued-sem2ufal-sem.py – semscore.py ● utils – exception – string NPFL099 2013LS 36/41 Code

● In Python – tested with 2.7

● Libraries

● NumPy

● SciPy

● sklearn (logistic regresion)

● Main classes

● da.py – DialogueAct, DialogueActItem

● dailrclassifier.py – DAILogRegClassifierLearning, DAILogRegClassifier

● daiklrclassifier – kernel based logistic regresion

● slu.__init__.py – CategoryLabelDatabase, SLUPreprocessing

NPFL099 2013LS 37/41 The code and data

● ufal.mff.cuni.cz / ~jurcicek / slu / SDS.tar.gz

NPFL099 2013LS 38/41 Results

● Performance in

● Precision ● Recall ● F-measure ● of dialogue act items, e.g. semscore.py ● bye(), inform(...), request(), etc. ● on the test set

● report how the performance increases with increasing training data size, e.g. 10%, 20%, ...

NPFL099 2013LS 39/41 Data vs. accuracy example

● You will get something like this

NPFL099 2013LS 40/41 My results

Dialogue act Precision Recall F­measure ack() 89.66 100.00 94.55 affirm() 95.16 93.65 94.40 bye() 100.00 100.00 100.00 confirm(area="*") 100.00 100.00 100.00 ... ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ Total precision: 95.31 Total recall: 93.58 Total F­measure: 94.44

NPFL099 2013LS 41/41