<<

Towards Unsupervised Automatic Recognition Trained by Unaligned Speech and Text only

Yi-Chen Chen, Chia-Hao Shen, Sung-Feng Huang, Hung-yi Lee

National Taiwan University, Taiwan {r06942069, r04921047, r06942045, hungyilee}@ntu.edu.tw

Abstract sion. First, phonetic embeddings from audio segments with lit- tle speaker or environment dependent information are extracted Automatic (ASR) has been widely by SA with adversarial training for disentangling information. researched with supervised approaches, while many low- Then, the phonetic embeddings are further used to obtain se- resourced languages lack audio-text aligned data, and super- mantic embeddings by a skip-gram model [21]. Different from vised methods cannot be applied on them. In this work, we typical which takes one-hot representations of propose a framework to achieve unsupervised ASR on a read as input, here the proposed model takes phonetic embeddings English speech dataset, where audio and text are unaligned. from SA as input. In the first stage, each -level audio segment in the utter- ances is represented by a vector representation extracted by a Given a set of word embeddings learned from text, if we sequence-of-sequence , in which phonetic informa- can map the audio semantic embeddings to the textual semantic tion and speaker information are disentangled. Secondly, se- embedding space, the text corresponding to the semantic em- mantic embeddings of audio segments are trained from the vec- beddings of the audio segments would be available. In this way, tor representations using a skip-gram model. Last but not the unsupervised ASR would be achieved. The idea is inspired least, an unsupervised method is utilized to transform semantic from unsupervised machine with monolingual cor- embeddings of audio segments to text embedding space, and fi- pora only [22, 23]. Because most languages share the same ex- nally the transformed embeddings are mapped to words. With pressive power and are used to describe similar human experi- the above framework, we are towards unsupervised ASR trained ences across cultures, they should share similar statistical prop- by unaligned text and speech only. erties. For example, one can expect the most frequent words Index Terms: Audio Word2Vec, Unsupervised Automatic to be shared. Therefore, given two sets of word embeddings of Speech Recognition two languages, the representations of these words can be similar up to a linear transformation [23, 24]. In our task, the targets we want to align are not two different 1. Introduction languages, but audio and text of the same language. We believe ASR has reached a huge success and been widely used in mod- the alignment is probable because the frequencies and contex- ern society [1, 2, 3]. However, in the existing algorithms, ma- tual relations of words are close in audio and text domains for chines must learn from a large amount of annotated data, which the same language. The mapping method used in this stage is makes the development of speech technology for a new lan- an EM-based method, Mini-Batch Cycle Iterative Closest Point guage with low resource challenging. Annotating audio data (MBC-ICP) [24], which is originally proposed for unsupervised for speech recognition is expensive, but unannotated audio data . Here given two sets of embeddings, that is, is relatively easy to collect. If the machine can acquire the word semantic embeddings from text and audio, MBC-ICP can iter- patterns behind speech signals from a large collection of unan- atively align the vectors in the two sets by Principal Compo- notated speech data without alignment with text, it would be nent Analysis (PCA) and an affine transformation matrix. After able to learn a new language from speech in a novel linguistic mapping the semantic embeddings from audio to those learned environment with little supervision. There are lots of researches from text, the text corresponding to the audio segments is di- towards this goal [4, 5, 6, 7, 8, 9, 10, 11]. rectly known. Audio segment representation is still an open problem with To our best knowledge, this is the first work attempting to lots of research [12, 13, 14, 15, 16]. In the previous work, achieve word-level ASR without any speech and text alignment. a sequence-to-sequence autoencoder (SA) is used to repre- sent variable-length audio segments using fixed-length vec-

arXiv:1803.10952v3 [cs.CL] 11 Aug 2018 2. Proposed Method tors [17, 18]. In SA, the RNN encoder reads an audio segment represented as an acoustic feature sequence and maps it to a The proposed framework of unsupervised ASR consists of three vector representation; the RNN decoder maps the vector back stages: to the input sequence of the encoder. With SA, only audio seg- 1. Extracting phonetic embeddings from word-level audio ments without human annotation are needed, which suits it for segments using SA with discrimination. low-resource applications. It has been shown that the vector representation contains phonetic information [17, 18, 19, 20]. 2. Training semantic embeddings from phonetic embed- In text, Word2Vec [21] transforms each word into a fixed- dings. semantic vector used as the basic component of ap- 3. Unsupervised transformation from audio semantic em- plications of natural language processing. Word2Vec is useful beddings to textual semantic embeddings. because it is learned from a large collection of documents with- out supervision. In this paper, we propose a similar method The above three stages will be described in Sections 2.1, 2.2 to extract semantic representations from audio without supervi- and 2.3 respectively. Figure 2: Training semantic embeddings from phonetic embed- dings.

Figure 1: Network architecture for disentangling phonetic and speaker information. 2.1.3. Training Criteria for Phonetic Encoder As shown in Figure 1, the discriminator D takes two phonetic

vectors vpi and vpj as inputs and tries to tell if the two vectors 2.1. Extracting Phonetic Embeddings from Word-Level come from the same speaker. The learning target of the phonetic Audio Segments Using SA with Discrimination encoder Ep is to ”fool” the discriminator, keeping it from dis- In the proposed framework, we assume that in an audio collec- criminating correctly. In this way, only phonetic information is tion, each utterance is already segmented into word-level seg- contained in the phonetic vector, and the speaker information in ments. Although unsupervised segmentation is still challeng- original acoustic features is encoded in the speaker vector. The ing, there are already many approaches available [25, 26]. We discriminator learns to maximize Ld in (3), while the phonetic M encoder learns to minimize L . denote the audio collection as X = {xi}i=1, which consists of d M word-level audio segments, x = (x1, x2, ..., xT ), where xt X X th L = D(v , v ) − D(v , v ). is the feature vector of the t time frame and T is the number of d pi pj pi pj (3) time frames of the segment. The goal is to disentangle the pho- si=sj si6=sj netic and speaker information in acoustic features, and extract a vector representation with phonetic information. The whole optimization procedure of the discriminator and the other parts is iteratively minimizing Ld and Lr + Ls − Ld. 2.1.1. Autoencoder 2.2. Training Semantic Embeddings from Phonetic Embed- As shown in Figure 1, we pass a sequence of acoustic features x dings into a phonetic encoder Ep and a speaker encoder Es to obtain Similar to the Word2Vec skip-gram model [21], we use two en- a phonetic vector vp and a speaker vector vs. Then we take the phonetic and speaker vectors as inputs of the decoder to coders Esem and Econ to train the semantic embeddings from 0 phonetic embeddings (Figure 2). On one hand, given a seg- reconstruct the acoustic features x . The phonetic vector vp will be used in the next stage. The two encoders and the decoder ment xi, we feed its phonetic vector vpi obtained from the are jointly learned by minimizing the reconstruction loss below: previous stage into Esem, and output the semantic embedding of the segment vwi = Esem(vpi). On the other hand, given X 0 the context window size c, which is a hyperparameter, if a seg- Lr = ||xi − xi||2. (1) i ment xj is in the context window of xi, then its phonetic vector vpj is a context vector of vpi. For each context vector vpj It will be clear in Sections 2.1.2 and 2.1.3 how to make Ep en- of vpi, we feed it into Econ, and output its context embedding code phonetic information and Es encode speaker information. vcj = Econ(vpj ). Given a pair of phonetic vectors (vp , vp ), the training 2.1.2. Training Criteria for Speaker Encoder i j criteria for Esem and Econ is to maximize the similarity of In the following discussion, we also assume the speakers of vwi and vcj if vpi and vpj are contextual, while minimiz- the segments are known. Suppose the segment xi is uttered ing their similarity otherwise. The basic idea is parallel to tex- by speaker si. If speaker information is not available, we can tual Word2Vec. Two different words having the similar content simply assume that the segments from the same utterance are have similar semantics, thus if two different phonetic embed- uttered by the same speakers, and the approach below can still dings corresponding to different words have the same context, be applied. Es is learned to minimize the following loss Ls: they will be close to each other after projected by Esem. Esem and Econ learn to minimize the semantic loss Lsem as follows: X Ls = ||vsi − vsj ||2 X Lsem = − log(sigmoid(vw · vc )) si=sj i j (2) (x ,x ) X i j in context window + max(λ − ||vsi − vsj ||2, 0). X + − log(sigmoid(−v · v )). si6=sj wi ck (xi,xk) not in context window If xi and xj are uttered by the same speaker (si = sj ), we want (4) their speaker embeddings vsi and vsj to be as close as possible. The sigmoid of dot product of vw and vc is used to evaluate On the other hand, if si 6= sj , we want the distance of vsi and the similarity. If xi and xj are in the same context window, vsj larger than a threshold λ. we want vwi and vcj to be as similar as possible. We also use the negative sampling technique, in which only some pairs Table 1: The Spearman’s rank correlation scores of four audio sets with textual semantic embeddings with one-hot input, where (xi, xk) are randomly sampled as negative examples instead of enumerating all possible negative pairs. the abbreviations are SE: semantic embeddings, PE: phonetic embeddings, SAD: sequence-to-sequence autoencoder with dis- 2.3. Unsupervised Transformation from Audio to Text entanglement, SA: sequence-to-sequence autoencoder without disentanglement We have a set of audio semantic embeddings VW = {vw1, ..., vwi, ..., vwM } obtained from the last stage, where Dataset SE/SAD SE/SA PE/SAD PE/SA M is the number of audio segments in the audio collection. MEN [29] 0.415 0.149 0.297 0.073 On the other hand, given a text collection, we can obtain tex- MTurk [30] 0.442 0.251 0.373 0.221 tual semantic embeddings UW = {uw1, ..., uwj , ..., uwN } RG65 [31] 0.236 0.217 0.273 -0.073 by typical models like skip-gram. Here uwj RW [32] 0.730 0.595 0.684 0.550 is the word embedding of the j-th word in the text collection, SimLex999 [33] 0.309 -0.043 0.139 -0.110 and there are N words in the text database. Although both v WS353 [34, 35] 0.441 0.203 0.374 0.109 w WS353R [34, 35] 0.385 0.164 0.348 0.102 and uw contain semantic information, they are not in the same WS353S [34, 35] 0.465 0.224 0.367 0.122 space, that is, the same dimension in vw and uw would not cor- respond to the same semantic meaning. Here we want to learn Table 2: The Spearman’s rank correlation scores of four audio a transformation to transform an embedding v to ˜u in the w w sets with textual semantic embeddings with phonetic embedding textual . input. MBC-ICP is used here, whose procedure is described as below. Given two sets of embeddings, V and U , they are W W Dataset SE/SAD SE/SA PE/SAD PE/SA projected to their top K principal components by PCA respec- tively. Let the projected vectors of V and U be A and B. MEN 0.430 0.303 0.381 0.207 W W MTurk 0.471 0.492 0.371 0.373 The i-th column of A, ai, is the PCA projection of vwi, while RG65 0.161 0.152 0.117 0.003 the j-th column of B, bj, is the PCA projection of uwj . Both RW 0.712 0.670 0.696 0.585 the dimensionality of ai and bj are K. If VW can be mapped to SimLex999 0.273 0.249 0.221 0.129 the space of UW by an affine transformation, A and B would WS353 0.520 0.478 0.504 0.393 be similar after PCA [24]. The above PCA mapping technique WS353R 0.525 0.463 0.470 0.406 is commonly used [27, 28]. WS353S 0.502 0.501 0.526 0.381 Then a pair of transformation matrices, Tab and Tba, is learned, where Tab transforms an a in A to the space of B, that is, b˜ = Taba, while Tba maps b to the space of A. Tab and in a supervised/semi-supervised way, in which we directly min- Tba are learned iteratively by the following algorithm. We as- imize the distance from the true embedding in the first two terms sume that two kinds of semantic embedding are likely the same of (5), rather than from the nearest embedding. after PCA projection, so we initialize the transformation matri- ces as identity matrices. Then in each iteration, the following 3. Experiments steps are conducted: 3.1. Experimental Setup 1. For each ai, find the nearest Tbabj from all j, denoted as bi∗ . We used LibriSpeech [36] as the audio collection in our experi-

2. For each bj, find the nearest Tabai from all i, denoted ments. LibriSpeech is a corpus of read English speech and suit- as aj∗ . able for training and evaluating speech recognition systems. It is derived from audiobooks that are part of the LibriVox project, 3. Optimize Tab and Tba by minimizing: and contains 1000 hours of speech sampled at 16 kHz. In our X X Ltrans = ||bi∗ − Tabai||2 + ||aj∗ − Tbabj||2 experiments, the dataset were segmented according to the word i j boundaries obtained by forced alignment with respect to the ref- 0 X erence transcriptions. We used the 960 hours ”clean+others” + λ ||ai − TbaTabai||2 speech sets from LibriSpeech for model training. MFCCs of i 39-dim were used as the acoustic features. 0 X + λ ||bj − TabTbabj||2 The phonetic encoder, speaker encoder and decoder in the j first stage are all 2- GRUs with hidden layer size 256, 256 (5) and 512, respectively. The discriminator is a fully-connected In the first and the second terms, we want to transform ai and feedforward network with 2 hidden layer whose size is 256. The bj respectively to its nearest neighbors in the other space, bi∗ λ value we used in the speaker loss term is set to 0.01. The dis- and aj∗ . We include cycle-constraints as the third and fourth criminator and the other parts of this stage are iteratively trained terms in (5) to ensure that both ai and bj are unchanged after as WGAN [37]. transformed to the other space and back. In the second stage, the two encoders are both 2-hidden- Equation (5) is solved by . After Tba is layer fully-connected feedforward networks with hidden size eventually obtained, given ai, we can find bi∗ in which Tbabi∗ 256. The size of embedding vectors is 128, the context win- is nearest to ai among all the columns of B. Then we consider dow size 5, the negative sampling number 5, the sampling factor the i∗-th word in the text database corresponds to the i-th audio 0.001, and 5 as the threshold for minimum count. Although tex- segment, or the i∗-th word is the recognition result of the i-th tual Word2Vec has an unsupervised training procedure, it needs audio segment. subsampling, which is an important step during training. Sub- If some aligned pairs of audio and textual semantic embed- sampling needs the frequencies of words, which we can’t obtain dings are available, we can also train the transformation matrix during the unsupervised training in the audio domain. In this preliminary work, we compromise to use known word labels to Table 3: Comparison of top 10 nearest transformation accuracy do subsampling, but we believe, with a proper statistically clus- between SE/SAD, SE/SA, PE/SAD and PE/SA. tering algorithm to classify and group phonetic vectors, a com- pletely unsupervised ASR can be achieved by this framework Labeled Pairs SE/SAD SE/SA PE/SAD PE/SA in the future. 5000 0.6068 0.3902 0.4736 0.4744 We trained two sets of textual semantic embeddings. The first set (denoted as OHW in the following discussion) was trained on the manual transcriptions of LibriSpeech using one- Table 4: Top K nearest transformation accuracies of SE/SAD hot representations as input by a typical skip-gram model. The using semi-supervised methods with different numbers of la- second set of textual semantic embeddings (denoted as PEW) beled pairs. was also trained on LibriSpeech while using phonetic infor- mation. To generate PEW, we represented each word with a Labeled Pairs top 1 top 10 top 100 sequence of , and used a sequence-to-sequence au- 0 0.000 0.0014 0.0194 1000 0.0400 0.1378 0.4984 toencoder to encode the sequence into an embedding 2000 0.1080 0.4526 0.7578 with size of 256. Then we took the embeddings of phoneme se- 5000 0.1846 0.6068 0.8638 quences as input of skip-gram model. Because the audio seman- tic embeddings were also learned from phonetic embeddings, we believe PEW would have a more similar distribution to au- dio semantic embeddings than OHW. Similarly, Table 2 presents the correlations between PEW and While each word in text has a unique semantic representa- four audio embedding sets. In Tables 1 and 2, we found that tion, segments corresponding to the same word can have differ- disentanglement improved the correlation scores in most cases, ent semantic representations vw, so the distributions of textual and audio semantic embeddings outperformed embeddings ex- and audio semantic embedding are too different to be mapped tracted from SA in most cases. The results verify the first two together. In this work, we used known word labels to aver- stages in our proposed method are both helpful for extracting age audio semantic embeddings vw corresponding to the same embeddings including semantic information. It can also be in- word, so that each word has a unique semantic representation ferred from the two tables that correlation performance of PEW in both audio and text. We realize that this is another unrealis- is better than OHW as expected because PEW is learned from tic setup in unsupervised scenario, and will develop technique the phonetic embeddings of text. Since PEW is more similar to address this issue in the future. Finally in the third stage, to the audio embeddings, it will make the transformation in the we applied MBC-ICP [24] with top 5000 frequent words and next stage easier. projected the embeddings to the top 100 principle components. Hence, the affine transform matrix from audio embeddings to text embeddings is 100 × 100, and vise versa. The mini-batch 3.3. Transformation from Audio to Text size was set to be 200. The results of MBC-ICP are shown in Table 3 and Table 4. In The models in the first two stages were trained with Adam Table 3, we compare top 10 nearest accuracies of four audio se- optimizer [38], while the last stage was trained using stochas- mantic sets mentioned above using 5000 labeled pairs. SE/SAD tic gradient descent with the initial learning rate 0.01, which achieved the best result. Once again, it shows that the first two decayed every 40 iterations with rate 0.90. stages in our proposed method are both effective indeed. In Ta- ble 4, both the unsupervised and semi-supervised results with 3.2. Evaluation of Word Representations SE/SAD are further reported. The numbers of labeled data are Several benchmarks for evaluating word representations are 0, 1000, 2000 and 5000 respectively. The results also include available [39, 30, 29, 34, 35, 31, 32, 33]. Those benchmark top 1, top 10 and top 100 nearest accuracies. We can observe corpora include pairs of words. Here we want to know whether that although unsupervised MBC-ICP may not generate perfect the audio semantic embeddings can capture semantic meanings matching, semi- achieved high transforma- like textual semantic embeddings. We calculated cosine simi- tion accuracies. It shows that there exists a good affine trans- larities of word pairs in benchmark datasets by audio semantic formation matrix that can transform audio and textual semantic embeddings and text semantic embeddings, and evaluated the embeddings. However, the affine transformation matrix cannot ranking correlation scores of cosine similarities between audio be easily found by the completely unsupervised approach. and textual embeddings1. The results are presented with the Spearman’s rank correlation coefficients. We measured correlations between two textual semantic 4. Conclusions and Future Work embeddings, OHW and PEW, mentioned in Section 3.1 and In this work, we propose a three-stage framework towards un- four types of audio embeddings. The four types of audio supervised ASR with unaligned speech and text only. Through embeddings are: semantic embeddings trained from phonetic the experiments, we showed semantic audio embeddings can be vectors with disentanglement (SE/SAD), semantic embeddings directly extracted from audio. Although we did not obtain sat- trained from vectors extracted by SA without disentanglement isfied results with , via semi-supervised (SE/SA), phonetic embeddings extracted by SA with disentan- learning, we verified there is an affine matrix transforming se- glement (PE/SAD), and embeddings extracted by SA without mantic embeddings from audio to text. How to find the affine disentanglement (PE/SA). matrix in an unsupervised setup is still under investigation. Al- The results are shown in Table 1 and Table 2. Table 1 shows though some oracle settings were used in the experiments, we the correlations between OHW and four audio embedding sets. are conducting experiments under more realistic setups. We 1The code is released at https://github.com/grtzsohalf/Towards- believe with further improvement on this framework, the com- Unsupervised-ASR pletely unsupervised ASR could be achieved in the near future. 5. References [22] G. Lample, L. Denoyer, and M. Ranzato, “Unsupervised ma- chine translation using monolingual corpora only,” arXiv preprint [1] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust au- arXiv:1711.00043, 2017. tomatic speech recognition with missing and unreliable acoustic data,” Speech communication, vol. 34, no. 3, pp. 267–285, 2001. [23] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jegou,´ “Word translation without parallel data,” arXiv preprint [2] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, arXiv:1710.04087, 2017. N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop [24] Y. Hoshen and L. Wolf, “An iterative closest point method for on automatic speech recognition and understanding, no. EPFL- unsupervised word translation,” arXiv preprint arXiv:1801.06126, CONF-192584. IEEE Signal Processing Society, 2011. 2018. [3] F. Sha and L. K. Saul, “Large margin hidden markov models for [25] Y.-H. Wang, C.-T. Chung, and H.-y. Lee, “Gate activation signal automatic speech recognition,” in Advances in neural information analysis for gated recurrent neural networks and its correlation processing systems, 2007, pp. 1249–1256. with phoneme boundaries,” arXiv preprint arXiv:1703.07588, [4] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, 2017. L. Besacier, X. Anguera, and E. Dupoux, “The zero resource [26] O. Scharenborg, V. Wan, and M. Ernestus, “Unsupervised speech speech challenge 2017,” in ASRU, 2017. segmentation: An analysis of the hypothesized phone bound- [5] A. Garcia and H. Gish, “Keyword spotting of arbitrary words us- aries,” The Journal of the Acoustical Society of America, vol. 127, ing minimal speech resources,” in ICASSP, 2006. no. 2, pp. 1084–1095, 2010. [6] A. Jansen and K. Church, “Towards unsupervised training of [27] P. Daras, A. Axenopoulos, and G. Litos, “Investigating the effects speaker independent acoustic models,” in INTERSPEECH, 2011. of multiple factors towards more accurate 3-d object retrieval,” IEEE Transactions on multimedia, vol. 14, no. 2, pp. 374–388, [7] A. Park and J. Glass, “Unsupervised pattern discovery in speech,” 2012. Audio, Speech, and Language Processing, IEEE Transactions on, vol. 16, no. 1, pp. 186–197, Jan 2008. [28] F. Li, D. Stoddart, and C. Hitchens, “Method to automatically reg- ister scattered point clouds based on principal pose estimation,” [8] V. Stouten, K. Demuynck, and H. Van hamme, “Discovering Optical Engineering, vol. 56, no. 4, p. 044107, 2017. phone patterns in spoken utterances by non-negative matrix fac- torization,” Signal Processing Letters, IEEE, vol. 15, pp. 131 – [29] E. Bruni, N. Tram, M. Baroni et al., “Multimodal distribu- 134, 2008. tional semantics,” The Journal of Artificial Intelligence Research, [9] N. Vanhainen and G. Salvi, “Word discovery with beta process vol. 49, pp. 1–47, 2014. ,” in INTERSPEECH, 2012. [30] K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch, [10] H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li, “An acous- “A word at a time: computing word relatedness using temporal tic segment modeling approach to query-by-example spoken term semantic analysis,” in Proceedings of the 20th international con- detection,” in ICASSP, 2012. ference on World wide web. ACM, 2011, pp. 337–346. [11] H. Kamper, K. Livescu, and S. Goldwater, “An embedded seg- [31] H. Rubenstein and J. B. Goodenough, “Contextual correlates of mental K-means model for unsupervised segmentation and clus- synonymy,” Communications of the ACM, vol. 8, no. 10, pp. 627– tering of speech,” in ASRU, 2017. 633, 1965. [12] K. Levin, A. Jansen, and B. Van Durme, “Segmental acoustic in- [32] T. Luong, R. Socher, and C. Manning, “Better word represen- dexing for zero resource keyword search,” in ICASSP, 2015. tations with recursive neural networks for morphology,” in Pro- ceedings of the Seventeenth Conference on Computational Natu- [13] S. Bengio and G. Heigold, “Word embeddings for speech recog- ral Language Learning, 2013, pp. 104–113. nition,” in INTERSPEECH, 2014. [33] F. Hill, R. Reichart, and A. Korhonen, “Simlex-999: Evaluating [14] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example semantic models with (genuine) similarity estimation,” Computa- keyword spotting using long short-term memory networks,” in tional , vol. 41, no. 4, pp. 665–695, 2015. ICASSP, 2015. [15] S. Settle, K. Levin, H. Kamper, and K. Livescu, “Query-by- [34] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, example search with discriminative neural acoustic word embed- G. Wolfman, and E. Ruppin, “Placing search in context: The con- Proceedings of the 10th international confer- dings,” INTERSPEECH, 2017. cept revisited,” in ence on World Wide Web. ACM, 2001, pp. 406–414. [16] A. Jansen, M. Plakal, R. Pandya, D. Ellis, S. Hershey, J. Liu, C. Moore, and R. A. Saurous, “Towards learning semantic audio [35] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pas¸ca, and representations from unlabeled data,” in NIPS Workshop on Ma- A. Soroa, “A study on similarity and relatedness using distri- chine Learning for Audio Signal Processing (ML4Audio), 2017. butional and -based approaches,” in Proceedings of Hu- man Language Technologies: The 2009 Annual Conference of the [17] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. North American Chapter of the Association for Computational Lee, “Audio word2vec: Unsupervised learning of audio segment Linguistics. Association for Computational Linguistics, 2009, representations using sequence-to-sequence autoencoder,” arXiv pp. 19–27. preprint arXiv:1603.00982, 2016. [36] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- [18] C.-H. Shen, J. Y. Sung, and H.-Y. Lee, “Language transfer of au- rispeech: an asr corpus based on public domain audio books,” in dio word2vec: Learning audio segment representations without , Speech and Signal Processing (ICASSP), 2015 IEEE target language data,” in arXiv, 2017. International Conference on. IEEE, 2015, pp. 5206–5210. [19] Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Unsupervised [37] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. adaptation with domain separation networks for robust speech Courville, “Improved training of wasserstein gans,” in Advances recognition,” in ASRU, 2017. in Neural Information Processing Systems, 2017, pp. 5769–5779. [20] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of dis- [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- entangled and interpretable representations from sequential data,” mization,” arXiv preprint arXiv:1412.6980, 2014. in Advances in neural information processing systems, 2017, pp. 1876–1887. [39] S. Jastrzebski, D. Lesniak,´ and W. M. Czarnecki, “How to evalu- ate word embeddings? on importance of data efficiency and sim- [21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, ple supervised tasks,” arXiv preprint arXiv:1702.02170, 2017. “Distributed representations of words and phrases and their com- positionality,” in Advances in neural information processing sys- tems, 2013, pp. 3111–3119.