Leveraging Word Embeddings for Spoken Document Summarization
Total Page:16
File Type:pdf, Size:1020Kb
Leveraging Word Embeddings for Spoken Document Summarization Kuan-Yu Chen*†, Shih-Hung Liu*, Hsin-Min Wang*, Berlin Chen#, Hsin-Hsi Chen† *Institute of Information Science, Academia Sinica, Taiwan #National Taiwan Normal University, Taiwan †National Taiwan University, Taiwan * # † {kychen, journey, whm}@iis.sinica.edu.tw, [email protected], [email protected] Abstract without human annotations involved. Popular methods include Owing to the rapidly growing multimedia content available on the vector space model (VSM) [9], the latent semantic analysis the Internet, extractive spoken document summarization, with (LSA) method [9], the Markov random walk (MRW) method the purpose of automatically selecting a set of representative [10], the maximum marginal relevance (MMR) method [11], sentences from a spoken document to concisely express the the sentence significant score method [12], the unigram most important theme of the document, has been an active area language model-based (ULM) method [4], the LexRank of research and experimentation. On the other hand, word method [13], the submodularity-based method [14], and the embedding has emerged as a newly favorite research subject integer linear programming (ILP) method [15]. Statistical because of its excellent performance in many natural language features may include the term (word) frequency, linguistic processing (NLP)-related tasks. However, as far as we are score, recognition confidence measure, and prosodic aware, there are relatively few studies investigating its use in information. In contrast, supervised sentence classification extractive text or speech summarization. A common thread of methods, such as the Gaussian mixture model (GMM) [9], the leveraging word embeddings in the summarization process is Bayesian classifier (BC) [16], the support vector machine to represent the document (or sentence) by averaging the word (SVM) [17], and the conditional random fields (CRFs) [18], embeddings of the words occurring in the document (or usually formulate sentence selection as a binary classification sentence). Then, intuitively, the cosine similarity measure can problem, i.e., a sentence can either be included in a summary be employed to determine the relevance degree between a pair or not. Interested readers may refer to [4-7] for comprehensive of representations. Beyond the continued efforts made to reviews and new insights into the major methods that have improve the representation of words, this paper focuses on been developed and applied with good success to a wide range building novel and efficient ranking models based on the of text and speech summarization tasks. general word embedding methods for extractive speech Different from the above methods, we explore in this paper summarization. Experimental results demonstrate the various word embedding methods [19-22] for use in extractive effectiveness of our proposed methods, compared to existing SDS, which have recently demonstrated excellent performance state-of-the-art methods. in many natural language processing (NLP)-related tasks, such Index Terms: spoken document, summarization, word as relational analogy prediction, sentiment analysis, and embedding, ranking model sentence completion [23-26]. The central idea of these methods is to learn continuously distributed vector 1. Introduction representations of words using neural networks, which can probe latent semantic and/or syntactic cues that can in turn be Owing to the popularity of various Internet applications, used to induce similarity measures among words, sentences, rapidly growing multimedia content, such as music video, and documents. A common thread of leveraging word broadcast news programs, and lecture recordings, has been embedding methods to NLP-related tasks is to represent the continuously filling our daily life [1-3]. Obviously, speech is document (or query and sentence) by averaging the word one of the most important sources of information about embeddings corresponding to the words occurring in the multimedia. By virtue of spoken document summarization document (or query and sentence). Then, intuitively, the (SDS), one can efficiently digest multimedia content by cosine similarity measure can be applied to determine the listening to the associated speech summary. Extractive SDS relevance degree between a pair of representations. However, manages to select a set of indicative sentences from a spoken such a framework ignores the inter-dimensional correlation document according to a target summarization ratio and between the two vector representations. To mitigate this concatenate them together to form a summary [4-7]. The wide deficiency, we further propose a novel use of the triplet spectrum of extractive SDS methods developed so far may be learning model to enhance the estimation of the similarity divided into three categories [4, 7]: 1) methods simply based degree between a pair of representations. In addition, since on the sentence position or structure information, 2) methods most word embedding methods are founded on a probabilistic based on unsupervised sentence ranking, and 3) methods based objective function, a probabilistic similarity measure might be on supervised sentence classification. a more natural choice than non-probabilistic ones. For the first category, the important sentences are selected Consequently, we also propose a new language model-based from some salient parts of a spoken document [8], such as the framework, which incorporates the word embedding methods introductory and/or concluding parts. However, such methods with the document likelihood measure. To recapitulate, can be only applied to some specific domains with limited beyond the continued and tremendous efforts made to improve document structures. Unsupervised sentence ranking methods the representation of words, this paper focuses on building attempt to select important sentences based on the statistical novel and efficient ranking models on top of the general word features of the sentences or of the words in the sentences embedding methods for extractive SDS. 2. Review of Word Embedding Methods 2.3. The Global Vector (GloVe) Model Perhaps one of the most-known seminal studies on developing The GloVe model suggests that an appropriate starting point word embedding methods was presented in [19]. It estimated a for word representation learning should be associated with the statistical (N-gram) language model, formalized as a feed- ratios of co-occurrence probabilities rather than the prediction forward neural network, for predicting future words in context probabilities [22]. More precisely, GloVe makes use of while inducing word embeddings (or representations) as a by- weighted least squares regression, which aims at learning word product. Such an attempt has already motivated many follow- representations by preserving the co-occurrence frequencies up extensions to develop similar methods for probing latent between each pair of words: semantic and syntactic regularities in the representation of V V 2 (5) ∑ ∑ f (X w w )(vw ⋅ vw + bw + bw − log X w w ) , words. Representative methods include, but are not limited to, i=1 j=1 i j i j i j i j the continuous bag-of-words (CBOW) model [21], the skip- where denotes the number of times words wi and wj co- gram (SG) model [21, 27], and the global vector (GloVe) occur in a pre-defined sliding context window; f( · ) is a model [22]. As far as we are aware, there is little work done to monotonic smoothing function used to modulate the impact of contextualize these methods for use in speech summarization. each pair of words involved in the model training; and vw and 2.1. The Continuous Bag-of-Words (CBOW) Model bw denote the word representation and the bias term of word w, respectively. Rather than seeking to learn a statistical language model, the CBOW model manages to obtain a dense vector representation 2.4. Analytic Comparisons (embedding) of each word directly [21]. The structure of There are several analytic comparisons can be made among CBOW is similar to a feed-forward neural network, with the the above three word embedding methods. First, they have exception that the non-linear hidden layer in the former is different model structures and learning strategies. CBOW and removed. By getting around the heavy computational burden SG adopt an on-line learning strategy, i.e., the parameters incurred by the non-linear hidden layer, the model can be (word representations) are trained sequentially. Therefore, the trained on a large corpus efficiently, while still retains good order that the training samples are used may change the performance. Formally, given a sequence of words, resulting models dramatically. In contrast, GloVe uses a batch 1 2 T w ,w ,…,w , the objective function of CBOW is to maximize learning strategy, i.e., it accumulates the statistics over the the log-probability, entire training corpus and updates the model parameters at − − + + once. Second, it is worthy to note that SG (trained with the ∑T log P(wt | wt c ,..., wt 1, wt 1,..., wt c ), (1) t=1 negative sampling algorithm) and GloVe have an where c is the window size of context words being considered implicit/explicit relation with the classic weighted matrix for the central word wt, T denotes the length of the training factorization approach, while the major difference is that SG corpus, and and GloVe concentrate on rendering the word-by-word co- occurrence matrix but weighted matrix factorization is usually exp(v t ⋅ v t ) P(wt | wt−c,..., wt−1, wt+1,..., wt+c ) = w w , concerned with decomposing the word-by-document matrix [9, V ∑ exp(v t ⋅ v ) 31, 32]. i=1 w wi (2) The observations made above on the relation between word embedding methods and matrix factorization bring us to the where denotes the vector representation of the word w at notion of leveraging the singular value decomposition (SVD) position t; V is the size of the vocabulary; and denotes the method as an alternative mechanism to derive the word (weighted) average of the vector representations of the context embeddings in this paper. Given a training text corpus, we words of wt [21, 26]. The concept of CBOW is �motivated by have a word-by-word co-occurrence matrix A.