Segmental Audio Word2vec: Representing Utterances As Sequences of Vectors with Applications in Spoken Term Detection
Total Page:16
File Type:pdf, Size:1020Kb
SEGMENTAL AUDIO WORD2VEC: REPRESENTING UTTERANCES AS SEQUENCES OF VECTORS WITH APPLICATIONS IN SPOKEN TERM DETECTION Yu-Hsuan Wang, Hung-yi Lee, Lin-shan Lee College of Electrical Engineering and Computer Science National Taiwan University fr04922167, [email protected], [email protected] ABSTRACT structure information for a spoken word. This actually extends While Word2Vec represents words (in text) as vectors carrying se- the audio Word2Vec from word-level up to utterance-level. Such mantic information, audio Word2Vec was shown to be able to rep- segmental audio Word2Vec can have plenty of potential applications resent signal segments of spoken words as vectors carrying phonetic in the future, for example, speech information summarization, structure information. Audio Word2Vec can be trained in an un- speech-to-speech translation or voice conversion[14]. Here we show supervised way from an unlabeled corpus, except the word bound- the very attractive first application in spoken term detection. aries are needed. In this paper, we extend audio Word2Vec from The segmental audio Word2Vec proposed in this paper is word-level to utterance-level by proposing a new segmental audio based on a segmental sequence-to-sequence autoencoder (SSAE) Word2Vec, in which unsupervised spoken word boundary segmenta- for learning a segmentation gate and a sequence-to-sequence au- tion and audio Word2Vec are jointly learned and mutually enhanced, toencoder jointly. The former determines the word boundaries in so an utterance can be directly represented as a sequence of vectors the utterance, and the latter represents each audio segment with an carrying phonetic structure information. This is achieved by a seg- embedding vector. These two processes can be jointly learned from mental sequence-to-sequence autoencoder (SSAE), in which a seg- an unlabeled corpus in a completely unsupervised way. During mentation gate trained with reinforcement learning is inserted in the training, the model learns to convert the utterances into sequences encoder. Experiments on English, Czech, French and German show of embeddings, and then reconstructs the utterances with these very good performance in both unsupervised spoken word segmenta- sequences of embeddings. A guideline for the proper number of tion and spoken term detection applications (significantly better than vectors (or words) within an utterance of a given length is needed, frame-based DTW). in order to prevent the machine from segmenting the utterances into Index Terms— recurrent neural network, autoencoder, rein- more segments (or words) than needed. Since the number of em- forcement learning, policy gradient beddings is a discrete variable and not differentiable, the standard back-propagation is not applicable[15][16]. The policy gradient for reinforcement learning[17] is therefore used. How these generated 1. INTRODUCTION word vector sequences carry the phonetic structure information of the original utterances was evaluated with the real application task of In natural language processing, it is well known that Word2Vec query-by-example spoken term detection on four languages: English transforming words (in text) into vectors of fixed dimensionality (on TIMIT), Czech, French, German (on GlobalPhone corpora)[18]. is very useful in various applications, because those vectors carry semantic information[1][2]. In speech signal processing, it has been shown that audio Word2Vec transforming spoken words into 2. PROPOSED APPROACH vectors of fixed dimensionality[3][4] is also useful for example in spoken term detection or data augmentation[5][6], because those 2.1. Segmental Sequence-to-Sequence Autoencoder (SSAE) vectors carry phonetic structure for the spoken words. It has been shown that this audio Word2Vec can be trained in a completely un- The proposed structure for SSAE is depicted in Fig. 1, in which the supervised way from an unlabeled dataset, except the spoken word segmentation gate is inserted into the recurrent autoencoder. For an boundaries are needed. The need for spoken word boundaries is a input utterance X = fx1, x2, ..., xT g, where xt represents the t-th major limitation for audio Word2Vec, because word boundaries are acoustic feature like MFCC and T is the length of the utterance, the usually not available for given speech utterances or corpora[7][8]. model learns to determine the word boundaries and produce the em- Although it is possible to use some automatic pro- beddings for the N generated audio segments, Y = f e1; e2; :::; eN g, cesses to estimate word boundaries followed by the audio where en is the n-th embedding and N ≤ T . Word2Vec[9][10][11][12][13], it is highly desired that the signal The proposed SSAE consists of an encoder RNN (ER) and a segmentation and audio Word2Vec may be integrated and jointly decoder RNN (DR) just like the conventional autoencoder. But the learned, because in that way they may enhance each other. This encoder includes an extra segmentation gate, controlled by another means the machine learns to segment the utterances into a sequence RNN (shown as a sequence of blocks S in Fig. 1). The segmenta- of spoken words, and transform these spoken words into a sequence tion problem is formulated as a reinforcement learning problem. At of vectors at the same time. This is the segmental audio Word2Vec each time t, the segmentation gate agent performs an action at, ”seg- proposed here: representing each utterance as a sequence of fixed- ment” or ”pass”, according to a given state st. xt is taken as a word dimensional vectors, each of which hopefully carries the phonetic boundary if at is ”segment”. x x x x x For the segmentation gate, the state at time t, st, is defined as 1 2 3 4 xt xt+1 T the concatenation of the input xt, the gate activation signal (GAS) �#" �#$ �#% �#& �( �*')" �#+ gt extracted from the gates of the GRU in another pre-trained RNN ' autoencoder [10], and the previous action at−1 taken [19], DR DR DR DR … DR DR … DR st = xtjjgtjjat−1 : (1) e1 e1 e1 e2 en eN eN decoder The output ht of layers of the segmentation gate RNN (blocks 0 0 e1 0 en 0 eN π π S in Fig. 1) followed by a linear transform (W ,b ) and a softmax encoder nonlinearity models the policy πt at time t, … … ht = RNN(s1; s2; :::st); (2) S S S S S S S π π πt = softmax(W ht + b ): (3) ER ER ER ER … ER ER … ER This πt gives two probabilities respectively for ”segment” and pass ”pass”. An action at is then sampled from this distribution during x1 x2 x3 x4 xt xt+1 xT segment training to encourage exploration. During testing at is ”segment” Segment 1 Segment 2 Segment n Segment N whenever its probability is higher. When at is ”segment”, the time t is viewed as a word boundary, and the segmentation gate passes the output of encoder RNN as an Fig. 1. The segmental sequence-to-sequence autoencoder embedding. The state of the encoder RNN is also reset to its initial (SSAE). In addition to the encoder RNN (ER) and decoder value. So the embedding en is generated based on the acoustic fea- RNN (DR), a segmentation gate (blocks S) is included in the tures of the audio segment only, independent of the previous input in encoder for estimating the word boundaries. During transi- spite of the recurrent structure, tions across the segment boundaries, encoder RNN and de- coder RNN are reset (illustrated with a slash in front of an e = Encoder(x ; x ; :::; x ); (4) n t1 t1+1 t2 arrow) so there is no information flow across segment bound- aries. Each segment (shown in different colors) can be viewed where t1, t2 refers to the beginning and ending time for the n-th audio segment. as performing sequence-to-sequence training individually. The input utterance X should be reconstructed with the embed- ding sequence Y = f e1; e2; :::; eN g. Because the decoder RNN (DR) is backward in order as shown in Fig. 1 [20], for the embed- policy π as J(θ) = Eπ[r], where θ is the parameter set. The updates of the segmentation gate are simply given by: ding en for the input segment from t1 to t2 in Eq.(4) above, the reconstructed feature vector is, T X (θ) rθJ(θ) = Ea∼π[rθ logπ (at)(r − rb)]; (7) ^x = Decoder(^x ; ^x ; :::^x ; e ): (5) t t t2 t2−1 t+1 n t=1 The decoder RNN is also reset when beginning decoding each (θ) where π (at) is the probability for the action at taken as in Eq.(3). audio segment to remove the information flow from the following t segment. 2.3.2. Rewards 2.2. Encoder and Decoder Training The reconstruction error is certainly a good indicator to see whether the segmentation boundaries are good, since the embeddings are The loss function L for training the encoder and decoder is simply generated based on the segmentation. We hypothesize that good the averaged squared `-2 norm for the reconstruction error of all in- boundaries, for example those close to word boundaries, would re- put xt: sult in smaller reconstruction errors, because the audio segments for L T l 2 X X 1 (l) (l) words would appear more frequently in the corpus and thus the em- L = ^x − x ; (6) d t t beddings would be trained better giving lower reconstruction errors. l t So the smaller the reconstruction errors the higher the reward: where the superscript (l) indicates the l-th training utterance with T length Tl, and L is the number of utterances used in training. d is X 1 2 (l) rMSE = − k^xt − xtk : (8) the dimensionality of xt . d t 2.3. Segmentation Gate Training This is very similar to Eq.(6) except for a specific utterance here.