Evaluation of Acoustic Word Embeddings

Evaluation of acoustic word embeddings

Sahar Ghannay, Yannick Esteve,` Nathalie Camelin, Paul deleglise´ LIUM - University of Le Mans, France [email protected]

Abstract beddings from an orthographic representation of the word. This paper focuses on the evaluation Recently, researchers in speech recogni- of these acoustic word embeddings. We propose tion have started to reconsider using whole two approaches to evaluate the intrinsic perfor- words as the basic modeling unit, instead mances of acoustic word embeddings in compar- of phonetic units. These systems rely on a ison to orthographic representations. In particu- function that embeds an arbitrary or fixed lar we want to evaluate whether they capture dis- dimensional speech segments to a vec- criminative information about their pronunciation, tor in a fixed-dimensional space, named approximated by their phonetic representation. In acoustic word embedding. Thus, speech our experiments, we focus on French language segments of words that sound similarly whose particularity is to be rich of homophone will be projected in a close area in a con- words. This aspect is also studied in this work. tinuous space. This paper focuses on the evaluation of acoustic word embed- 2 Acoustic word embeddings dings. We propose two approaches to evaluate the intrinsic performances of acoustic 2.1 Building acoustic word embeddings word embeddings in comparison to ortho- The approach we used to build acoustic word graphic representations in order to eval- embeddings is inspired from the one proposed uate whether they capture discriminative in (Bengio and Heigold, 2014). The deep neural phonetic information. Since French lan- architecture depicted in figure 1 is used to train the guage is targeted in experiments, a partic- acoustic word embeddings. It relies on a convolu- ular focus is made on homophone words. tional neural network (CNN) classifier over words and on a deep neural network (DNN) trained by 1 Introduction using a triplet ranking loss (Bengio and Heigold, Recent studies have started to reconsider the use 2014; Wang et al., 2014; Weston et al., 2011). of whole words as the basic modeling unit in The two architectures are trained using different speech recognition and query applications, instead inputs: speech signal and orthographic representa- of phonetic units. These systems are based on the tion of the word, which are detailed as follows. use of acoustic word embedding, which are pro- The convolutional neural network classifier is jection of arbitrary or fixed dimensional speech trained independently to predict a word given a segments into a continuous space, in a manner speech signal as input. It is composed of convo- that preserve acoustic similarity between words. lution and pooling layers, followed by fully con- Thus, speech segments of words that sound simi- nected layers which feed the final softmax layer. larly will have similar embeddings. Acoustic word The embedding layer is the fully connected layer embedding were successfully used in a query- just below the softmax one, named s in the fig- by-example search system (Kamper et al., 2015; ure 1. This representation contains a compact rep- Levin et al., 2013) and in a ASR lattice re-scoring resentation of the acoustic signal. It tends to pre- system (Bengio and Heigold, 2014). serve acoustic similarity between words, such that The authors in (Bengio and Heigold, 2014) pro- words are close in this space if they sound alike. posed an approach to build acoustic word em- The feedforward neural network (DNN) is used

62 Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP, pages 62–66, Berlin, Germany, August 12, 2016. c 2016 Association for Computational Linguistics

CNN DNN tion task can be performed on many ways, for ex- Softmax Triplet Ranking Loss ample through the use of a dynamic time warping

fully Embedding s Embedding w+ Embedding w- (DTW) to quantify the similarity between two seg- connected layers ments when using frame level embeddings (Thiol-

convolution liere et al., 2015), or by using the euclidean dis- and max + - pooling O O tance or the cosine similarity between embeddings layers representing the segments...... Lookup In (Kamper et al., 2015) the evaluation was con- table ducted on two collections of words (train and test) coming from the Switchboard English corpus. Af- Letter n-grams Letter n-grams ter training the model on the training corpus, the Word Wrong word cosine similarity is computed between the embeddings of each pair of words in the test set. These Figure 1: Deep architecture used to train acoustic pairs are classiﬁed as similar or different by apply- word embeddings. ing a threshold on their distance, and a precision- recall curve is obtained by varying the threshold. In this study, we propose two approaches to with the purpose to build an acoustic word embed- evaluate acoustic word embeddings +. We sug- ding for a word not observed in the audio training w gest to build different evaluation sets in order to corpus, based on its orthographic representation. assess the acoustic word embeddings ( +) perfor- It is trained using the triplet ranking loss function w mances on orthographic and phonetic similarity in order to project orthographic word representa- and homophones detection tasks. We remind that tions to the same space as the acoustic embeddings the acoustic word embedding w+ is a projection s. of an orthographic word representation o+ into the The orthographic word representation consists space of acoustic signal embeddings s. In our eval- in a bag of n-grams (n 3) of letters, with addi- ≤ uation, we would like to measure the loss of ortho- tional special symbols [ and ] to specify the start graphic information carried by w+ and the poten- and the end of a word. The size of this bag of n- tial gain of acoustic information due to this pro- grams vector is reduced using an auto-encoder. jection, in comparison to the information carried During the training process, this model takes as by o+. inputs acoustic embeddings s selected randomly The evaluation sets are built as follows: given from the training set and, for each signal acous- a list L of n frequent words (candidate words) in tic embedding, the orthographic representation of the vocabulary composed of m words, a list of the matching word o+, and the orthographic repre- n m word pairs was created. Then, two align- sentation of a randomly selected word different to × ments were performed between each word pair the ﬁrst word o−. These two orthographic repre- based on their orthographic (letters) and phonetic sentations supply shared parameters in the DNN. (phonemes) representations, using the sclite1 tool. The resulting DNN model can then be used to From these alignment two edition distances are build an acoustic word embedding (w+) from any computed with respect to the alignment results of word, as long as one can extract an orthographic orthographic and phonetic representations. The representation from it. This acoustic word embed- Edition distance is computed as follows: ding can be perceived as a canonical acoustic representation for a word, since different pronuncia- #In + #Sub + #Del SER = 100 (1) tions imply different signal embeddings s. #symbols in the reference word ×

2.2 Evaluation where SER stands for Symbol Error rate, symbols In the literature (Kamper et al., 2015; Levin et correspond to the letters for orthographic repre- al., 2013; Carlin et al., 2011), a word discrimi- sentations, and to the phonemes for phonetic ones, nation task was used to evaluate acoustic embed- and In, Sub and Del correspond respectively to in- dings s. Given a pair of acoustic segments, this sertion, substitution and deletion. task consists on deciding whether the segments 1http://www.icsi.berkeley.edu/Speech/docs/sctk- correspond to the same words or not. This evalua- 1.2/sclite.htm

63 Next, we compute two similarity scores that The precision Pw of the word w is defined as: correspond to the orthographic and phonetic simi- L (w) larity scores sim score attributed for each pair of P = | H found | (3) w L (w) words, which are defined as: | H | where . refers to the size of a list. We define the sim score = 10 min(10,SER/10) (2) | | − overall homophone detection precision on the Ho- mophones list as the average of the Pw: where min() is a function used to have an edition distance between 0 and 10. Then, for each candi- N P P = i=1 wi date word in the list L we extract its orthographi- (4) P N cally and phonetically 10 nearest words. This re- where N is the number of candidate words which sults in two lists for orthographic and phonetic have a none-empty Homophones list. similarity tasks. For each candidate word in the list L, the Orthographic list contains its ten closest 3 Experiments on acoustic word words in terms of orthographic similarity scores embeddings and the Phonetic list contains its ten closest words in terms of phonetic similarity scores. Finally, 3.1 Experimental setup the Homophones list, used for the homophone de- The training set for the CNN consists of 488 hours tection task, contains the homophone words (i.e. of French Broadcast News with manual transcrip- sharing the same phonetic representation). tions. This dataset is composed of data coming Table 1 shows an example of the content of the from the ESTER1 (Galliano et al., 2005), ES- three lists. TER2 (Galliano et al., 2009) and EPAC (Esteve` et al., 2010) corpora. List Exampls It contains 52k unique words that have been tres` pres` 7.5 Orthographic seen at least twice each in the corpus. All of tres` ors 5 them corresponds to a total of 5.75 millions oc- tres` frais 6.67 Phononetic currences. In French language, many words have tres` traˆınent 6.67 the same pronunciation without sharing the same tres` traie Homophone spelling, and they can have different meanings; tres` traient e.g. the sound [so] corresponds to four homophones: sot (fool), saut (jump), sceau (seal) and Table 1: Example of the content of the three lists. seau (bucket), and twice more by taking into ac- count their plural forms that have the same pro- In the case of the orthographic and phonetic nunciation: sots, sauts, sceaux, and seaux. When similarity tasks, the evaluation of the acoustic em- a CNN is trained to predict a word given an acous- beddings is performed by ranking the pairs ac- tic sequence, these frequent homophones can in- cording to their cosine similarities and measur- troduce a bias to evaluate the recognition error. To ing the Spearman’s rank correlation coefficient avoid this, we merged all the homophones exist- (Spearman’s ρ). This approach is used in (Gao et ing among the 52k unique words of the training al., 2014; Ji et al., 2015; Levy et al., 2015; Ghan- corpus. As a result, we obtained a new reduced nay et al., 2016) to evaluate the linguistic word dictionary containing 45k words and classes of ho- embeddings on similarity tasks, in which the sim- mophones. ilarity scores are attributed by human annotators. Acoustic features provided to the CNN are log- For the homophone detection task, the eval- filterbanks, computed every 10ms over a 25ms uation is performed in terms of precision. For window yielding a 23-dimension vector for each each word w in the Homophones list, let LH (w) frame. A forced alignment between manual tran- be the list of k homophones of the word w, and scriptions and speech signal was performed on the LH neighbour(w) be the list of k nearest neigh- training set in order to detect word boundaries. bours extracted based on the cosine similarity and The statistics computed from this alignment re- LH found(w) be the intersection between LH (w) veal that 99% of words are shorter than 1 sec- and LH neighbour(w), that corresponds to the list ond. Hence we decided to represent each word by of homophones found of the word w. 100 frames, thus, by a vector of 2300 dimensions.

64 When words are shorter they are padded with zero about spelling. So, in addition to making possi- equally on both ends, while longer words are cut ble a measure of similarity distance between the equally on both ends. acoustic signal (represented by s) and a word (rep- The CNN and DNN deep architectures are resented by w+), acoustic word embeddings are trained on 90% of the training set and the remain- better than orthographic ones to measure the pho- ing 10% are used for validation. netic proximity between two words. For the homophone detection task, the Homo- 3.2 Acoustic word embeddings evaluation phones list is computed from the 160k vocabu- The embeddings we evaluate are built from two lary: that results to 53869 homophone pairs in different vocabularies: the one used to train the total. The 52k vocabulary contains 13561 homo- neural network models (CNN and DNN), com- phone pairs which are included in the pairs present posed of 52k words present in the manual tran- in the 160k vocabulary. As we can see, the w+ scriptions of the 488 hours of audio; and another acoustic embeddings outperform the orthographic one composed of 160k words. The words present ones on this task on the two data sets. This con- in the 52k vocabulary are nearly all present in the ﬁrms that acoustic word embeddings have cap- 160k vocabulary. tured additional information about word pronun- The evaluation sets described in section 2.2 are ciation than the one carried by orthographic word generated from these two vocabularies: in the 52k embeddings. For this task we cannot compare the vocabulary, all the acoustic word embeddings w+ results between the two vocabularies, since the are related to words which have been observed precision measure is dependent to the number of during the training of the CNN. This means that events. For the Spearman’s correlation, a com- at least two acoustic signal embeddings have been parison is roughly possible and results show that computed from the audio for each one of these the way to compute w+ is effective to generalize words; in the 160k vocabulary, about 110k acous- this computation to word not observed in the audio tic word embeddings were computed for words training data. never observed in the audio data. 3.2.2 Qualitative Evaluation 3.2.1 Quantitative Evaluation To give more insight into the difference of the The quantitative evaluation of the acoustic word quality of the orthographic word embeddings o+ + embeddings w is performed on orthographic and the acoustic ones w+, we propose an empiri- similarity, phonetic similarity, and homophones cal comparison by showing the nearest neighbours detection tasks. Results are summarized in table 2. of a given set of words. Table 3 shows exam- ples of such neighbour. It can be seen that, as expected, neighbour of any given word share the 52K Vocab. 160K Vocab. + + + + same spelling with it when they are induced by Task o w o w the orthographic embeddings and arguably sound Orthographic 54.28 49.97 56.95 51.06 like it when they are induced by the acoustic word Phonetic 40.40 43.55 41.41 46.88 ones. Homophone 64.65 72.28 52.87 59.33 Candidate o+ w+ Table 2: Evaluation results of similarity (ρ 100) word × and homophone detection tasks (precision). grecs i-grec, rec, grec, grecque, marec grecques They show that the acoustic word embeddings ail aile, trail, fail aille, ailles, aile w+ are more relevant for the phonetic similarity arts parts, charts, arte, art, ars task, while o+ are obviously the best ones on the encarts orthographic similarity task. blocs bloch, blocher, bloc, bloque, These results show that the projection of the or- bloche bloquent thographic embeddings o+ into the acoustic embeddings space s changes their properties, since Table 3: Candidate words and their nearest neigh- they have captured more information about word bours pronunciation while they have lost information

65 4 Conclusion Sylvain Galliano, Guillaume Gravier, and Laura Chaubard. 2009. The ESTER 2 evaluation cam- In this paper, we have investigated the intrinsic paign for the rich transcription of French radio evaluation of acoustic word embeddings. These broadcasts. In Interspeech, volume 9, pages 2583– latter offer the opportunity of an a priori acous- 2586. tic representation of words that can be compared, Bin Gao, Jiang Bian, and Tie-Yan Liu. 2014. Wor- in terms of similarity, to an embedded representa- drep: A benchmark for research on learning word tion of the audio signal. We have proposed two representations. CoRR, abs/1407.1640. approaches to evaluate the performances of these Sahar Ghannay, Benoit Favre, Yannick Esteve,` and acoustic word embeddings and compare them to Nathalie Camelin. 2016. Word embedding evalu- their orthographic embeddings: orthographic and ation and combination. In 10th edition of the Lan- guage Resources and Evaluation Conference (LREC phonetic performance by ranking pairs and mea- 2016), Portorozˇ (Slovenia), 23-28 May. suring the Spearman’s rank correlation coefficient (Spearman’s ρ), and by measuring the precision in Shihao Ji, Hyokun Yun, Pinar Yanardag, Shin Mat- a homophone detection task. sushima, and S. V. N. Vishwanathan. 2015. Wor- drank: Learning word embeddings via robust rank- Experiments show that the acoustic word em- ing. CoRR, abs/1506.02761. beddings are better than orthographic ones to measure the phonetic proximity between two words. Herman Kamper, Weiran Wang, and Karen Livescu. 2015. Deep convolutional acoustic word embed- More, they are better too on homophone detec- dings using word-pair side information. In arXiv tion task. This confirms that acoustic word embed- preprint arXiv:1510.01032. dings have captured additional information about word pronunciation. Keith Levin, Katharine Henry, Anton Jansen, and Karen Livescu. 2013. Fixed-dimensional acoustic embeddings of variable-length segments in low- Acknowledgments resource settings. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop This work was partially funded by the European on, pages 410–415. IEEE. Commission through the EUMSSI project, under the contract number 611057, in the frame- Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- proving distributional similarity with lessons learned work of the FP7-ICT-2013-10 call, by the French from word embeddings. Transactions of the Associ- National Research Agency (ANR) through the ation for Computational Linguistics, 3:211–225. VERA project, under the contract number ANR- Roland Thiolliere, Ewan Dunbar, Gabriel Synnaeve, 12-BS02-006-01, and by the Region´ Pays de la Maarten Versteegh, and Emmanuel Dupoux. 2015. Loire. A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling. In Proc. Interspeech. References Jiang Wang, Yang Song, Thomas Leung, Chuck Rosen- Samy Bengio and Georg Heigold. 2014. Word embed- berg, Jingbin Wang, James Philbin, Bo Chen, and dings for speech recognition. In INTERSPEECH, Ying Wu. 2014. Learning fine-grained image simi- pages 1053–1057. larity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- Michael A Carlin, Samuel Thomas, Aren Jansen, and nition, pages 1386–1393. Hynek Hermansky. 2011. Rapid Evaluation of Jason Weston, Samy Bengio, and Nicolas Usunier. Speech Representations for Spoken Term Discovery. 2011. Wsabie: Scaling up to large vocabulary image In INTERSPEECH, pages 821–824. annotation. In IJCAI, volume 11, pages 2764–2770. Yannick Esteve,` Thierry Bazillon, Jean-Yves Antoine, Fred´ eric´ Bechet,´ and Jer´ omeˆ Farinas. 2010. The EPAC Corpus: Manual and Automatic Annota- tions of Conversational Speech in French Broadcast News. In LREC, Malta, 17-23 may 2010.

Sylvain Galliano, Edouard Geoffrois, Djamel Mostefa, Khalid Choukri, Jean-Franc¸ois Bonastre, and Guil- laume Gravier. 2005. The ESTER phase II evaluation campaign for the rich transcription of French Broadcast News. In Interspeech, pages 1149–1152.