INTERSPEECH 2011

Data Sampling and Dimensionality Reduction Approaches for Reranking ASR Outputs Using Discriminative Language Models

Erinc¸Dikici1, Murat Semerci2, Murat Sarac¸lar1, Ethem Alpaydın2

1Department of Electrical and Electronics Engineering, Bogazic¸i˘ University, Istanbul, Turkey 2Department of Computer Engineering, Bogazic¸i˘ University, Istanbul, Turkey erinc.dikici,murat.semerci,murat.saraclar,alpaydin @boun.edu.tr { }

Abstract schemes from an N-best list, and argue that the selection of hypotheses based on WER should be preferred over the recog- This paper investigates various approaches to data sampling and nition score. Using the algorithm, they achieve an dimensionality reduction for discriminative language models accurate and compact corrective model. (DLM). Being a feature based language modeling approach, the Crammer and Singer [6] propose a perceptron ranking aim of DLM is to rerank the ASR output with discriminatively (PRanking) algorithm that divides the space into regions trained feature parameters. Using a Turkish morphology based bounded by threshold levels. The study by Shen and Joshi [2] feature set, we examine the use of online Principal Component includes a review of reranking and two perceptron based algo- Analysis (PCA) as a dimensionality reduction method. We ex- rithms to decrease data complexity and training time. ploit ranking perceptron and ranking SVM as two alternative In this study we propose several data sampling and feature discriminative modeling techniques, and apply data sampling selection methods in order to increase the accuracy, robustness to improve their efficiency. We obtain a reduction in word error and computational efficiency of the discriminative model. We rate (WER) of 0.4%, significant at p < 0.001 over the baseline attempt dimensionality reduction to seek for an appropriate fea- perceptron result. ture subspace projection. We compare the behavior and perfor- Index Terms: discriminative language modeling, dimensional- mance of averaged perceptron, ranking perceptron and ranking ity reduction, data sampling, ranking perceptron, ranking SVM SVM algorithms. To make algorithmically efficient use of the ranking procedure, we put forward different data sampling and 1. Introduction grouping approaches. An ASR system outputs multiple transcription hypotheses for a This paper is organized as follows: In Section 2, we give given acoustic speech signal. These are usually ordered with re- brief information on the methods used in the study. The exper- spect to their recognition scores, but the hypothesis having the imental setup and results are given in Section 3. Section 4 con- highest score is not necessarily the most accurate transcription. cludes the paper with discussions and remarks on future work. Classical DLM approaches view the problem as discrimination of better transcriptions from worse in a binary classification set- 2. Methods ting, while trying to optimize an objective function that is di- 2.1. Discriminative language modeling using linear models rectly related to the word error rate (WER). A better alternative 10.21437/Interspeech.2011-257 would be to specify the DLM task as a reranking problem, so DLM, which is a complementary method to baseline generative that by reordering these outputs, the hypothesis with the least language modeling, is used to estimate language model param- number of errors is at the top. eters in a discriminative setting. The perceptron algorithm and its variants approaching the The training inputs to a DLM are examples extracted from problem from both classification and reranking points of view N-best lists, i.e., the score-wise ordered outputs of an ASR sys- are the most widely used techniques in discriminative training tem using a generative language model. In the linear model, of language models [1, 2]. each example of the training set is represented by a feature Another generalized linear classifier which has been proven vector of the acoustic input (x) and the candidate hypothesis useful in binary classification tasks is the Support Vector Ma- (y), denoted by Φ(x, y). We would like to learn w, the vec- chine (SVM). The SVM is optimal in the sense of finding a tor of weights associated with these features. In testing, the maximal margin that separates the two classes. However, SVM- highest scoring hypothesis that maximizes the inner product w, Φ(x, y) is chosen. We use three discriminative training based techniques can be computationally demanding when the h i number of training examples is large and the feature algorithms, which will be explained in the following sections. is high. Versions of SVMs adapted for ranking and reranking 2.1.1. Averaged perceptron have also been proposed [3]. Zhou et al. [4] compare the accuracies and training times We use a variant of the perceptron algorithm as outlined in [1]. of four discriminative algorithms, namely perceptron, boost- The idea of the perceptron algorithm is to penalize features as- ing, ranking SVMs and minimum sample risk. Because of the sociated with the current best hypothesis and to reward features long training durations, the authors had to reduce the length of associated with the gold-standard hypothesis. We define the N-best lists from 100 to 20 for their ranking SVM implemen- gold standard of the i-th utterance (yi) as the hypothesis hav- tation. Despite this limitation, they show that ranking SVM is ing the lowest WER, i.e. the oracle, and the current best as able to obtain a more generalizable model. zi = argmax w, Φ(xi, z) Oba et al. [5] investigate different hypothesis selection z H h i ∈ i

Copyright © 2011 ISCA 1461 28-31 August 2011, Florence, Italy where xi is the input of the i-th utterance and Hi is the list 2.2. Data Selection of all possible hypotheses for this utterance. For each training Since the ranking SVM formulation brings constraints of pair- example (xi, yi), if zi differs from yi, the j-th feature weight is wise examples having different ranks, the problem becomes increased by Φj (xi, yi) and decreased by Φj (xi, zi): more complex as the sample size and the number of unique w = w + Φ(xi, yi) Φ(xi, zi) ranks are increased. In order to relax some of these constraints − The algorithm requires several epochs (passes) over the train- and to decrease the CPU times, we can sample the input data. ing set and it has been shown that three epochs are adequate The most straightforward sampling scheme would be to de- for practical purposes [1]. Once the training is completed, the crease the size of the N-best list, as in [4]. Sampling randomly weights are averaged over the number of utterances (I) and from the list also provides a viable reference. 1 t In this study we propose three different data selection ap- epochs (T ) to increase model robustness: wavg = wi . IT i,t proaches. All approaches start by sorting N-best lists in ascend- 2.1.2. Ranking Perceptron P ing number of word errors. If two hypotheses have the same error, the one having a higher recognition score gets the lower The canonical perceptron and SVM algorithms are for binary order (index). In the first approach, we select n = 2, 3, 5, 9 classification. However, it is more natural to regard the DLM instances in uniform intervals from this ordered list.{ We call} task as reranking of hypotheses generated by the ASR output, this approach uniform sampling, and denote it with US-n. For since the WER provides a target ranking for each possible tran- instance in US-5 with a 50-best list, the hypotheses with the in- scription. This problem is similar to ordinal regression in that all dices 1, 13, 25, 37 and 50 are selected (The best and the worst examples of the training set Φ(x, y) are given a rank (r) instead hypotheses are always in the shortlist). of a class label. However, unlike the ordinal regression case, in In the second approach, we group the hypotheses having reranking, the ranks are comparable only between the examples the same unique WER and select the one having the highest of the same utterance (N-best list). The uneven margin percep- score as the representative example of that group. We name tron [2] separates examples of the same utterance with respect this approach rank grouping (RG). For the first and second ap- to their pairwise difference. If Φ(xa, ya) and Φ(xb, yb) are proaches, the rank of the example is the same as its WER. in the same N-best list and have ranks ra and rb, the uneven Finally in the third approach, the rank clustering (RC), we margins are determined by the margin function g(ra, rb) which generate artificial rank clusters. We try RC-3 3 and RC-3 10, depends on the ranks of the pair: where we select 3 and 10 examples, respectively,× from the× top, 1 1 g(ra, rb) = middle and bottom of the ordered list. ra − rb Each uniform sampling scheme with increasing n also con- The pairwise distance between the examples must be greater tains the instances from the previous ones. In the US scheme, than the margin function times a margin multiplier τ, and the we aim to decrease the number of instances. In RG, we keep margin-rank relations must be preserved: one representative instance from all available ranks. In RC, we decrease both the number of instances and ranks. ra > rb w, Φ(xa, ya) Φ(xb, yb) τg(ra, rb) ⇐⇒ h − i ≥ A simplified example of these sampling schemes is pre- ra > rb > rc g(ra, rc) > g(ra, rb) > g(rb, rc) ⇐⇒ sented in Table 1. The first column denotes the hypotheses or- In this study, we apply the weight updates by introducing dered with respect to their number of word errors (WE), shown a , η, which is is decreased by multiplying with a in the second column. In the columns that follow, the rank val- decay rate of γ < 1 at the end of each epoch. ues associated with these hypotheses are given.

w = w + ηg(ra, rb)(Φ(xa, ya) Φ(xb, yb)) − Table 1: An example of sampling schemes The weights are averaged as in Section 2.1.1. Order WE US-2 US-3 US-5 RG-1 RC-3 2 × 1 0 0 0 0 0 1 2.1.3. Ranking SVM 2 1 1 1 3 2 2 2 Similar to the ranking perceptron algorithm, the goal here is to 4 2 5 2 2 2 2 estimate a weight vector w such that for any two hypotheses a 6 3 3 2 and b belonging to the same N-best list Hi, 7 3 3 8 4 4 3 ra > rb w, Φ(xa, ya) > w, Φ(xb, yb) 9 4 4 4 4 3 ⇐⇒ h i h i The ranking SVM algorithm is a modification of the classi- cal SVM setup to handle the reranking problem, defined as 2.3. Dimensionality reduction via Online PCA 1 C min w, w + ξab Reducing the number of eases the curse of dimen- w,ξab 0 2 h i P ≥ | | (a,b) P sionality and drastically decreases the required resources. It X∈ also provides a better generalization of the data by eliminating s.t. (a, b) P, w, Φ(xa, ya) w, Φ(xb, yb) 1 ξab ∀ ∈ h i − h i ≥ − the effects of noise. In our problem, where the feature vector (1) is very high dimensional, applying a dimensionality reduction Here, C is the trade-off value and P is the set of (a, b) technique can provide a better characterization for the linear pairs for which ra > rb. The constraint in Equation 1 implies classification of ASR output hypotheses. that the ranking optimization can also be viewed as an SVM Principal Component Analysis (PCA) is a linear transfor- classification problem on pairwise difference vectors. In that mation method such that the data is mapped into a space where sense, the algorithm tries to find a large margin linear function the mapped data is uncorrelated and the maximal amount of which minimizes the number of pairs of training sample set to variance is preserved by using as small a number of coordi- be swapped with respect to the desired ranking [3]. nates as possible. It is an offline algorithm where the dataset

1462 is processed in a single batch. A number of online PCA al- 3.3. Dimensionality reduction gorithms have also been proposed in the literature [7]. The online PCA network consists of the input and output nodes, In our experiments, we use a modified version of the online the weights between the input and output layers and the lat- PCA algorithm. First, we apply a count based threshold on eral weights between the output nodes. The network has d the data to eliminate uncommon features. Thus, any possible input units and m output units, where m d. The weight noise due to these features is ignored and the evaluation time ≤ T vector for the output node k is vk = (v1k, v2k, . . . , vdk) . and memory storage are saved. Then we apply the averaged The lateral weights ulk are organized such that there is a perceptron on this transformed feature set. connection between the output units l to k if l < k. With one epoch of averaged perceptron training, the Let φ be a d-dimensional input vector with zero mean. held-out word error rates obtained via changing the thresh- Then the output value (ok) can be evaluated as old (θ 50, 100 ) and setting the output dimensions (m 1000∈, {5000 ) are} all 22.93%, a result comparable to the ∈ { } ok = vk φ + ulkol generative baseline. We also merge online PCA output features · l

3.2. Perceptron baseline 3.5. Ranking SVM We train the language model using the averaged perceptron We use the SVMrank tool4, which provides a fast implementa- algorithm as mentioned in Section 2.1.1. The weight α0 as- tion of the ranking SVM algorithm. Table 3 shows the WERs sociated with the zero-th feature Φ0(x, y) is fixed during the obtained from the held-out set, with varying trade-off (C) val- weight updates. For the optimal choice of α0 and the number ues. It suggests that ranking SVM gives comparable results with of epochs, the error rate on the held-out set is 22.14%, implying the averaged perceptron algorithm. a reduction of 0.8% over the generative baseline. We observe that in the final model only 20.1K of the original 45.9K fea- Table 3: SVMrank held-out WER (%) tures are kept (have nonzero weights). The same model yields C=1 C=100 C=1000 C=10000 C=20000 21.84% error on the test data. 22.34 22.26 22.13 22.14 22.11 1http://www2.research.att.com/ fsmtools/ fsm,dcd / ∼ { } 2http://www.speech.sri.com/projects/srilm/ 3http://www.cis.hut.fi/projects/morpho/ 4http://www.cs.cornell.edu/people/tj/svm light/svm rank.html

1463 3.6. Data selection The ranking perceptron has the lowest error rates, but it needs more hypotheses. However, it takes less time to train the In Table 4, we present held-out WERs of the three data sam- ranking perceptron than to train the ranking SVM. Therefore, it pling approaches for the three algorithms, with an optimal pa- is preferable than the other methods. rameter selection that yields the lowest error rates on the same Even though the SVMrank toolkit provides a fast and ef- set. Results of the 50-best setup is also given for comparison. ficient implementation of the algorithm, it is much slower than Table 4: Sampling schemes held-out WER (%) the perceptron for training, which takes about one minute. Table Method Avg. Per. SVMrank PerRank 6 shows side information about some of the experiments using All (50) 22.14 22.11 21.70 the SVMrank setup, with a fixed C=1000. We see that, by an US-2 22.10 22.66 22.07 intelligent sampling technique from the N-best list, it is pos- US-3 22.14 22.21 21.98 sible to decrease the number of training instances and thus the US-5 22.22 21.86 21.87 CPU times, while keeping the WERs still in a tolerable region. US-9 22.18 21.89 21.79 RG-1 22.26 22.11 21.99 Table 6: Training CPU times for fixed C=1000 Number of Held-out RC-3 3 22.18 22.12 21.82 Method CPU hours × instances WER(%) RC-3 10 22.28 21.91 21.73 × All 4939368 25:00 22.26 The first observation is that in the averaged perceptron, us- US-5 518234 1:42 22.05 ing only two examples (the best and worst in terms of WER) is RG-1 466277 1:49 22.01 as good as using all the examples. This finding corroborates the RC-3 3 896102 0:43 22.18 result presented in [5]. Adding more examples to the training × set does not change the error rate. Unlike the perceptron, SVM- We are currently investigating the effect of using larger rank benefits from increasing number of examples. The best N-best and feature sets. Our future directions include using result obtained here is 21.86% with the US-5 sampling scheme. the perceptron and ranking SVM for and data This value is better than using 5-best lists (22.11%), or choosing summarization. 5 examples randomly (22.24%). The RG and RC schemes also 5. Acknowledgements provide comparable or better results than the baseline. Further- more, the ranking SVM is able to learn a more generalizable This research is supported in part by TUBITAK under Project model than the perceptron baseline, suggested by the fact that number 109E142 and by the Turkish State Planning Organi- zation (DPT) under the TAM Project number 2007K120610. it performs better on an unseen test set (21.56% vs 21.84%), as Murat Sarac¸lar is supported by the TUBA GEBIP Award. The shown in Table 5. This result is significant at p = 0.027 < 0.05 authors would like to thank Ebru Arısoy for the baseline DLM as measured by the NIST MAPSSWE test. setup. The numerical calculations reported in this paper were The averaged ranking perceptron outperforms the other performed at TUBITAK ULAKBIM, High Performance and ranking algorithms almost in all cases. As in SVMrank, the Grid Computing Center (TR-Grid e-Infrastructure). accuracy improves by around 0.4% as the number of selected hypotheses increases. The test error improvement for 50-best 6. References over the averaged perceptron is also significant at p < 0.001. [1] Roark, B., Sarac¸lar, M. and Collins, M., “Discriminative n- gram language modeling”, Computer Speech and Language, Table 5: Test set WER (%) 21(2):373–392, 2007. Method Avg. Per. SVMrank PerRank [2] Shen, L. and Joshi, A.K., “Ranking and Reranking with Percep- All (50) 21.84 21.57 21.46 tron”, 60:73–96, September 2005. US-5 22.01 21.56 21.78 [3] Joachims, T., “Training Linear SVMs in Linear Time”, Proc. of the ACM Conference on Knowledge Discovery and 4. Summary and Discussion (KDD), pp. 217–226, 2006. [4] Zhou, Z., Gao, J., Soong, F.K. and Meng, H., ”A Comparative In this paper, we presented some approaches to enhance dis- Study of Discriminative Methods for Reranking LVCSR N-Best criminative language modeling performance for Turkish broad- Hypotheses in Domain Adaptation and Generalization”, Proc. cast news transcription. The results of three training meth- ICASSP, vol.1, pp.141–144, 2006. ods, averaged perceptron, ranking perceptron and ranking SVM [5] Oba, T., Hori, T. and Nakamura, A., “An approach to efficient have been compared. We apply online PCA as a dimensionality generation of high-accuracy and compact error-corrective models reduction technique on the sparse feature set, and some sample for ”, Proc. Interspeech, pp.1753–1756, 2007. selection strategies to decrease the complexity of ranking SVM. [6] Crammer K. and Singer Y., “Pranking with Ranking”, Advances Overall, we obtained a WER reduction of 0.3-0.4% on the held- in Neural Information Processing Systems 14, pp.641–647, MIT out and test sets. Press, 2001. We see that though online PCA does not degrade accuracy, [7] Mao, J. and Jain, A. K., “Artificial Neural Networks for Feature it does not improve it either. Actually in terms of processing Extraction and Multivariate Data Projection”, in IEEE Transac- time, the system becomes slower because though we decrease tions on Artificial Neural Networks, 6(2):296–317, 1995. dimensionality, we also lose sparsity. We are currently work- [8] Arısoy, E., Can, D., Parlak, S., Sak, H. and Sarac¸lar, M., “Turk- ing on (possibly nonlinear) dimensionality reduction methods ish Broadcast News Transcription and Retrieval”, IEEE Transac- which will generate sparse representations. tions on Audio, Speech and Language Processing, 17(5):874–883, 2009. The performance of the ranking SVM model is compara- ble to and sometimes better than that of the averaged percep- [9] Arısoy, E., Sarac¸lar, M., Roark, B., Shafran, I., “Syntactic and Sub-Lexical Features for Turkish Discriminative Language Mod- tron. We also see that unlike with the perceptron, sampling ap- els”, Proc. ICASSP 2010, pp. 5538–5541, 2010. proaches do help decrease the error rates.

1464