<<

Leveraging Frame and Distributional Semantics for Unsupervised Semantic Slot Induction in Spoken Dialogue Systems (Extended Abstract)

Yun-Nung Chen, William Yang Wang, and Alexander I. Rudnicky School of Computer Science, Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213-3891, USA {yvchen, yww, air}@cs.cmu.edu

Abstract for designing spoken dialogue systems (SDS) in an unsupervised fashion. Comparing to the tradi- Although the spoken dialogue system tional approach where domain experts and devel- community in speech and the semantic opers manually define the semantic ontology for community in natural language SDS, the unsupervised approach has the advan- processing share many similar tasks and tages to reduce the costs and avoid human induced approaches, they have progressed inde- bias. pendently over years with few interac- On the other hand, the distributional view of se- tions. This paper connects two worlds to mantics hypothesizes that occurring in the automatically induce the semantic slots for same contexts may have similar meanings (Har- spoken dialogue systems using frame and ris, 1954). With the recent advance of deep learn- distributional semantic theories. Given ing techniques, the continuous representation of a collection of unlabeled audio, we ex- embeddings has further boosted the state- ploit continuous-valued word embeddings of-the-art results in many applications, such as to augment a probabilistic frame-semantic frame identification (Hermann et al., 2014), sen- parser that identifies key semantic slots timent analysis (Socher et al., 2013), language in an unsupervised fashion. Our exper- modeling (Mikolov, 2012), and sentence comple- iments on a real-world spoken dialogue tion (Mikolov et al., 2013a). dataset show that distributional word rep- In this paper, given a collection of unlabeled resentation significantly improves adapta- raw audio files, we investigate an unsupervised ap- tion from FrameNet-style parses of rec- proach for semantic slot induction. To do this, we ognized utterances to the target semantic use a state-of-the-art probabilistic frame-semantic space, that comparing to a state-of-the-art parsing approach (Das et al., 2010; Das et al., baseline, a 12% relative mean average pre- 2014), and perform an adaptation process, map- cision improvement is achieved, and that ping the generic FrameNet (Baker et al., 1998) the proposed technology can be used to re- style semantic parses to the target semantic space duce the costs for designing task-oriented that is suitable for the domain-specific conversa- spoken dialogue systems. tion settings. We utilize continuous word em- beddings trained on very large external corpora 1 Introduction (e.g. Google News and Freebase) for the adap- Frame semantics is a linguistic theory that de- tation process. To evaluate the performance of our fines meaning as a coherent structure of re- approach, we compare the automatically induced lated concepts (Fillmore, 1982). Although there semantic slots with the reference slots created by has been some successful applications in natural domain experts. Empirical experiments show that language processing (Hedegaard and Simonsen, the slot creation results generated by our approach 2011; Coyne et al., 2011; Hasan and Ng, 2013), align well with those of domain experts. this linguistically principled theory has not been 2 The Proposed Approach explored in the speech community until recently: Chen et al. (2013b) showed that it is possible to We build our approach on top of the recent suc- use probabilistic frame-semantic parsing to auto- cess of an unsupervised frame-semantic parsing matically induce and adapt the semantic ontology approach. Chen et al. (2013b) formulated the se- can i have a cheap restaurant 2.2 Continuous Space Word Representations To better adapt the FrameNet-style parses to the Frame: expensiveness target task-oriented SDS domain, we make use of FT LU: cheap continuous word vectors derived from a recurrent Frame: capability Frame: locale by use neural network architecture (Mikolov et al., 2010). FT LU: can FE Filler: i FT/FE LU: restaurant The recurrent neural network language models use Figure 1: An example of probabilistic frame- the context history to include long-distance in- semantic parsing on ASR output. FT: frame target. formation. Interestingly, the vector-space word FE: frame element. LU: lexical unit. representations learned from the language mod- els were shown to capture syntactic and semantic regularities (Mikolov et al., 2013c; Mikolov et al., mantic mapping and adaptation problem as a rank- 2013b). The word relationships are characterized ing problem, and proposed the use of unsupervised by vector offsets, where in the embedded space, clustering methods to differentiate the generic se- all pairs of words sharing a particular relation are mantic concepts from target semantic space for related by the same constant offset. Considering task-oriented dialogue systems. However, their that this distributional semantic theory may bene- clustering approach only performs on the small fit our SLU task, we leverage word representations in-domain training data, which may not be robust trained from large external data to differentiate se- enough. Therefore, this paper proposes a radical mantic concepts. extension of the previous approach: we aim at im- proving the semantic adaptation process by lever- 2.3 Slot Ranking Model aging distributed word representations. Our model ranks the slot candidates by integrat- ing two scores (Chen et al., 2013b): (1) the rela- 2.1 Probabilistic Semantic Parsing tive frequency of each candidate slot in the corpus, FrameNet is a linguistically-principled semantic since slots with higher frequency may be more resource (Baker et al., 1998), developed based on important. (2) the coherence of slot-fillers cor- the frame semantics theory (Fillmore, 1976). In responding to the slot. Assuming that domain- our approach, we parse all ASR-decoded utter- specific concepts focus on fewer topics and are ances in our corpus using SEMAFOR, a state-of- similar to each other, the coherence of the corre- the-art semantic parser for frame-semantic pars- sponding values can help measure the prominence ing (Das et al., 2010; Das et al., 2014), and ex- of the slots. tract all frames from semantic parsing results as w(s ) = (1 − α) · log f(s ) + α · log h(s ), (1) slot candidates, where the LUs that correspond to i i i the frames are extracted for slot-filling. For ex- where w(si) is the ranking weight for the slot can- ample, Figure 1, shows an example of SEMAFOR didate si, f(si) is the frequency of si from seman- parsing of an ASR-decoded text output. tic parsing, h(si) is the coherence measure of si, Since SEMAFOR was trained on FrameNet and α is the weighting parameter within the inter- annotation, which has a more generic frame- val [0, 1]. semantic context, not all the frames from the pars- For each slot si, we have the set of correspond- ing results can be used as the actual slots in the ing slot-fillers, V (si), constructed from the utter- domain-specific dialogue systems. For instance, ance including the slot si in the parsing results. in Figure 1, we see that the “expensiveness” The coherence measure h(si) is computed as av- and “locale by use” frames are essentially the erage pair-wise similarity of slot-fillers to evaluate key slots for the purpose of understanding in the if slot si corresponds to centralized or scattered restaurant query domain, whereas the “capability” topics. frame does not convey particular valuable infor- P mation for SLU. In order to fix this issue, we com- Sim(xa, xb) h(s ) = xa,xb∈V (si),xa6=xb , pute the prominence of these slot candidates, use i 2 (2) |V (si)| a slot ranking model to rerank the most frequent slots, and then generate a list of induced slots for where V (si) is the set of slot-fillers correspond- use in domain-specific dialogue systems. ing slot si, |V (si)| is the size of the set, and Sim(xa, xb) is the similarity between the pair of can be computed as the between

fillers xa and xb. The slot si with higher h(si) rxa and rxb , called NeiSim(xa, xb). usually focuses on fewer topics, which is more The idea of using NeiSim(xa, xb) is very sim- specific and more likely for slots occurring in dia- ilar as using RepSim(xa, xb), where we assume logue systems. that words with similar concepts should have sim- We involve distributional semantics of slot- ilar representations and share similar neighbors. fillers xa and xb for deriving Sim(xa, xb). Hence, NeiSim(xa, xb) is larger when xa and xb Here, we propose two similarity measures: have more overlapped neighbors in continuous the representation-derived similarity and the space. neighbor-derived similarity as Sim(xa, xb) in (2). 3 Experiments 2.3.1 Representation-Derived Similarity We examine the slot induction accuracy by com- Given that distributional semantics can be cap- paring the reranked list of frame-semantic pars- tured by continuous space word representa- ing induced slots with the reference slots created tions (Mikolov et al., 2013c), we transform each by system developers. Furthermore, using the token x into its embedding vector x by pre-trained reranked list of induced slots and their associated distributed word representations, and then the sim- slot fillers (value), we compare against the human ilarity between a pair of slot-fillers x and x a b annotation. For the slot-filling task, we evaluate can be computed as their cosine similarity, called both on ASR transcripts of the raw audio, and on RepSim(x , x ). a b the manual transcripts. We assume that words occurring in similar domains have similar word representations thus 3.1 Experimental Setup RepSim(x , x ) will be larger when x and x are a b a b In this experiment, we used the Cambridge Uni- semantically related. The representation-derived versity spoken language understanding corpus, similarity relies on the performance of pre-trained previously used on several other SLU tasks (Hen- word representations, and higher dimensionality derson et al., 2012; Chen et al., 2013a). The do- of embedding words results in more accurate per- main of the corpus is about restaurant recommen- formance but greater complexity. dation in Cambridge; subjects were asked to in- 2.3.2 Neighbor-Derived Similarity teract with multiple spoken dialogue systems, in an in-car setting. The corpus contains a total num- With embedding vector x corresponding token x ber of 2,166 dialogues, and 15,453 utterances. The in the continuous space, we build a vector r = x ASR system that was used to transcribe the speech [r (1), ..., r (t), ..., r (T )] for each x, where T is x x x has a word error rate of 37%. There are 10 slots the vocabulary size, and the t-th element of r is x created by domain experts: addr, area, food, defined as name, phone, postcode, price range, signa-  x·yt ture, task, and type. The parameter α in (1) can kxkky k , if yt is the word whose embedding  t be empirically set; we use α = 0.2,N = 100 for  vector has top N greatest similarity rx(t) = all experiments. to x.  To include distributional semantics information,  0 , otherwise. (3) we use two lists of pre-trained distributed vectors described as below1. The t-th element of vector rx is the cosine simi- larity between the embedding vector of slot-filler • Word/Phrase Vectors from Google News: x and the t-th embedding vector y of pre-trained t word vectors are trained on 109 words from word representations (t-th token in the vocabulary Google News, using the continuous bag of of external larger dataset), and we only include words architecture, which predicts the cur- the elements with top N greatest values to form rent word based on the context. The resulting a sparse vector for space reduction (from T to N). vectors have dimensionality 300, vocabulary r can be viewed as a vector indicating the N near- x size is 3×106; the entities contain both words est neighbors of token x obtained from continuous and automatically derived phrases. word representations. Then the similarity between 1 a pair of slot-fillers xa and xb, Sim(xa, xb) in (2), https://code.google.com/p/word2vec/ locale by use commerce scenario match if they share at least one overlapping word. type building expensiveness price range We weight MAP scores with corresponding F- range food measure as MAP-F-H (hard) and MAP-F-S (soft) food origin part orientational direction to evaluate the performance of slot induction and area speak on topic addr locale slot-filling tasks together (Chen et al., 2013b). part inner outer seeking desiring task contacting phone 3.3 Evaluation Results locating sending postcode Table 1 shows the results. Rows (a)-(c) are the baselines without leveraging distributional word Figure 2: The mappings from induced slots representations trained on external data, where (within blocks) to reference slots (right sides of row (a) is the baseline only using frequency arrows). for ranking, and rows (b) and (c) are the re- sults of clustering-based ranking models in the • Entity Vectors with Freebase Naming: the prior work (Chen et al., 2013b). Rows (d)-(j) entity vectors are trained on 109 words from show performance after leveraging distributional Google News with naming from Freebase2. semantics. Rows (d) and (e) are the results us- The training was performed using the con- ing representation- and neighbor-derived similar- tinuous skip gram architecture, which pre- ity from Google News data respectively, while row dicts surrounding words given the current (f) and row (g) are the results from Freebase nam- word. The resulting vectors have dimen- ing data. Rows (h)-(j) are performance of late fu- sionality 1000, vocabulary size is 1.4 × 106, sion, where we use voting to combine the results and the entities contain the deprecated /en/ of two data considering coverage of Google data naming from Freebase. and precision of Freebase data. We find almost all results are improved by including distributed word The first dataset provides a larger vocabulary and information. better coverage; the second has more precise vec- For ASR results, the performance from Google tors, using knowledge from Freebase. News and Freebase is similar. Rows (h) and (i) fuse the two systems. For ASR, MAP scores 3.2 Evaluation Metrics are slightly improved by integrating coverage of To evaluate the induced slots, we measure their Google News and accuracy from Freebase (from quality as the proximity between induced slots about 72% to 74%), but MAP-F scores do not in- and reference slots. Figure 2 shows many-to- crease. This may be because some correct slots are many mappings that indicate semantically related ranked higher after combining the two sources of induced slots and reference slots (Chen et al., evidence, but their slot-fillers do not perform well 2013b). Since we define the adaptation task as enough to increase MAP-F scores. a ranking problem, with a ranked list of induced To compare the representation-derived slots, we can use the standard mean average preci- (RepSim) and neighbor-derived (NeiSim) sion (MAP) as our metric, where the induced slot similarities, for both ASR and manual transcripts, is counted as correct when it has a mapping to a neighbor-derived similarity performs better on reference slot. Google News data (row (d) v.s. row (e)). The To evaluate slot fillers, for each matched map- reason may be that neighbor-derived similarity ping between the induced slot and the reference considers more semantically-related words to slot, we compute an F-measure by comparing the measure similarity (instead of only two tokens), lists of extracted slot fillers corresponding to the while representation-derived similarity is directly induced slots, and the slot fillers in the reference based on trained word vectors, which may degrade list. Since slot fillers may contain multiple words, with recognition error. In terms of Freebase data, we use hard and soft matching to define whether rows (f) and (g) do not show significant improve- two slot fillers match each other, where “hard” ment, probably because entities in Freebase are requires that the two slot fillers should be ex- more precise and their word representations have actly the same; “soft” means that two slot fillers higher accuracy. Hence, considering additional 2http://www.freebase.com/ neighbors in continuous vector space does not ob- Table 1: The performance of induced slots and corresponding slot-fillers (%) ASR Manual Approach MAP MAP-F-H MAP-F-S MAP MAP-F-H MAP-F-S (a) Frequency (baseline) 67.31 26.96 27.29 59.41 27.29 28.68 Frame Sem (b) K-Means 67.38 27.38 27.99 59.48 27.67 28.83 (c) Spectral Clustering 68.06 30.52 28.40 59.77 30.85 29.22 (d) RepSim 72.71 31.14 31.44 66.42 32.10 33.06 Google News (e) NeiSim 73.35 31.44 31.81 68.87 37.85 38.54 Frame Sem (f) RepSim 71.48 29.81 30.37 65.35 34.00 35.04 Freebase + (g) NeiSim 73.02 30.89 30.72 64.87 31.05 31.86 Dist Sem (h) (d) + (f) 74.60 29.82 30.31 66.91 34.84 35.90 (i) (e) + (g) 74.34 31.01 31.28 68.95 33.73 34.28 (j) (d) + (e) + (f) + (g) 76.22 30.17 30.53 66.78 32.85 33.44 tain improvement, and fusion of results from two act classification. In Proceedings of ICASSP, pages sources (rows (h) and (i)) cannot perform better. 8317–8321. However, note that neighbor-derived similarity Yun-Nung Chen, William Yang Wang, and Alexander I requires less space for computational procedure Rudnicky. 2013b. Unsupervised induction and fill- and sometimes produces results the same or better ing of semantic slots for spoken dialogue systems as the representation-derived similarity. using frame-semantic parsing. In Proceedings of Overall, we see that all combinations that lever- ASRU, pages 120–125. age distributional semantics outperform only us- Bob Coyne, Daniel Bauer, and Owen Rambow. 2011. ing frame semantics; this demonstrates the effec- Vignet: Grounding language in graphics using frame tiveness of applying distributional information to semantics. In Proceedings of the ACL 2011 Work- slot induction. The 76% MAP performance in- shop on Relational Models of Semantics, pages 28– 36. dicates that our proposed approach can generate good coverage for domain-specific slots in a real- Dipanjan Das, Nathan Schneider, Desai Chen, and world SDS. While we present results in the SLU Noah A Smith. 2010. Probabilistic frame-semantic domain, it should be possible to apply our ap- parsing. In Proceedings of NAACL-HLT, pages 948– proach to text-based NLU and slot filling tasks. 956. Dipanjan Das, Desai Chen, Andre´ F. T. Martins, 4 Conclusion Nathan Schneider, and Noah A. Smith. 2014. Frame-semantic parsing. Computational Linguis- We propose the first unsupervised approach unify- tics, 40(1):9–56. ing frame and distributional semantics for the au- tomatic induction and filling of slots. Our work Charles J Fillmore. 1976. Frame semantics and the makes use of a state-of-the-art semantic parser, nature of language. Annals of the NYAS, 280(1):20– 32. and adapts the generic linguistically-principled FrameNet representation to a semantic space char- Charles Fillmore. 1982. Frame semantics. acteristic of a domain-specific SDS. With the in- in the morning calm, pages 111–137. corporation of distributional word representations, we show that our automatically induced semantic Zellig S Harris. 1954. Distributional structure. Word. slots align well with reference slots. We show fea- Kazi Saidul Hasan and Vincent Ng. 2013. Frame se- sibility of this approach, including slot induction mantics for stance classification. CoNLL-2013, page and slot-filling for dialogue tasks. 124.

Steffen Hedegaard and Jakob Grue Simonsen. 2011. References Lost in translation: authorship attribution using frame semantics. In Proceedings of ACL-HLT, Collin F Baker, Charles J Fillmore, and John B Lowe. pages 65–70. 1998. The Berkeley FrameNet project. In Proceed- ings of COLING, pages 86–90. Matthew Henderson, Milica Gasic, Blaise Thomson, Pirros Tsiakoulis, Kai Yu, and Steve Young. 2012. Yun-Nung Chen, William Yang Wang, and Alexan- Discriminative spoken language understanding us- der I. Rudnicky. 2013a. An empirical investigation ing word confusion networks. In Proceedings of of sparse log-linear models for improved dialogue SLT, pages 176–181. Karl Moritz Hermann, Dipanjan Das, Jason Weston, and Kuzman Ganchev. 2014. Semantic frame iden- tification with distributed word representations. In Proceedings of ACL. Tomas Mikolov, Martin Karafiat,´ Lukas Burget, Jan Cernocky,` and Sanjeev Khudanpur. 2010. Recur- rent neural network based language model. In Pro- ceedings of INTERSPEECH, pages 1045–1048. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word represen- tations in vector space. In Proceedings of ICLR. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed representa- tions of words and phrases and their compositional- ity. In Proceedings of NIPS, pages 3111–3119. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of NAACL- HLT, pages 746–751. Toma´sˇ Mikolov. 2012. Statistical language models based on neural networks. Ph.D. thesis, Brno Uni- versity of Technology. Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep mod- els for semantic compositionality over a sentiment . In Proceedings of EMNLP, pages 1631– 1642.