Phonetic Classification Using Controlled Random Walks
Total Page:16
File Type:pdf, Size:1020Kb
Phonetic Classification Using Controlled Random Walks Katrin Kirchhoff Andrei Alexandrescuy Department of Electrical Engineering Facebook University of Washington, Seattle, WA, USA [email protected] [email protected] Abstract a given node. When building a graph representing a speech Recently, semi-supervised learning algorithms for phonetic database, the neighbourhood sizes and entropies often exhibit classifiers have been proposed that have obtained promising re- large variance, due to the uneven distribution of phonetic classes sults. Often, these algorithms attempt to satisfy learning criteria (e.g. vowels are more frequent than laterals). MAD offers a that are not inherent in the standard generative or discriminative unified framework for accommodating such effects without the training procedures for phonetic classifiers. Graph-based learn- need for downsampling or upsampling the training data. In the ers in particular utilize an objective function that not only max- past, MAD has been applied to class-instance extraction from imizes the classification accuracy on a labeled set but also the text [10] but it has to our knowledge not been applied to speech global smoothness of the predicted label assignment. In this pa- classification tasks before. In this paper we contrast MAD to per we investigate a novel graph-based semi-supervised learn- the most widely used graph-based learning algorithm, viz. label ing framework that implements a controlled random walk where propagation. different possible moves in the random walk are controlled by probabilities that are dependent on the properties of the graph 2. Graph-Based Learning itself. Experimental results on the TIMIT corpus are presented In graph-based learning, the training and test data are jointly that demonstrate the effectiveness of this procedure. represented as a graph G(V; E; W ), where V is a set of vertices Index Terms: phonetic modeling, classification, phonetic sim- representing the data points, E is a set of edges connecting the ilary, graph-based learning vertices, and W is a weight function labeling the edges with weights. An edge e = (v1; v2) 2 V × V indicates that similar- 1. Introduction ity exists between the data points represented by v1 and v2, and A persistent problem in automatic speech processing is the need the weight w12 expresses the degree of similarity between these for adaptation of speech processing systems to varying test con- points. The graph can be densely or sparsely connected, and a ditions for which labeled data is often not available. At the wide range of similarity functions can be used for W , though acoustic level, a number of successful adaptation algorithms typically non-negative functions are used. A weight of 0 implies have been developed (e.g. [1, 2]) that transform model param- absence of an edge between two vertices. The graph is usually eters to better fit the test data. Another line of research that represented as an n×n weight matrix, where n is the number of is beginning to emerge attempts to include unlabeled data dur- data points (composed of l labeled and u unlabeled data points; ing the training process [3, 5, 6, 7, 8]. Graph-based training l + u = n). Additionally, we have labels y1; :::; yl for the set methods in particular [5, 8] add a complementary training crite- of labeled points x1; :::; xl. Assuming that there are m different rion that guides model parameters towards producing globally labels in total, we can define an n×m matrix Y with entries for smooth outputs, i.e. speech samples that are in close proximity the l labeled points and zeros for the u unlabeled points. The in the feature space are encouraged to receive the same output goal is to utilize the graph in inferring labels for the unlabeled label. In applying this criterion, similarities between training points. A number of different learning algorithms have been and test samples are taken into account as well as similarities proposed in the past, including spectral graph transducer [11], between different test samples; in this way, the training algo- label propagation [12], and measure propagation [8]. rithm takes into consideration the global underlying structure The key advantage of graph-based learning is that similar- of the test set and can influence the model parameters to fit this ities between training and test points as well as similarities be- structure. tween different test points are taken into account. This is par- In this paper we build on previous work on graph-based ticularly useful when the data to be classified lives on an un- learning for acoustic classification and test a novel, recently pro- derlying low-dimensional manifold. It has been shown in the posed graph-based learning algorithm termed Modified Adsorp- past [13] that the phonetic space exhibits such a manifold struc- tion (MAD) [9]. Unlike previously proposed graph-based algo- ture, and graph-based techniques have been used successfully rithms that can be considered simple random-walk algorithms for phonetic classification in previous work [5, 8]. over a data graph, MAD introduces different probabilities for individual random walk operations such as termination, con- 2.1. Label Propagation tinuation, and injection. The probabilities for these operations Label propagation (LP), one of the earliest graph-based learning are set depending on properties of the graph itself, notably (as algorithms [12], minimizes the cost function explained below in Section 2.2) the neighbourhood entropy of X 2 yWork was performed while the author was a student at the Univer- S = Wij (^yi − y^j ) (1) sity of Washington. i;j=l+1;:::;n subject to the constraint that cv and dv in this manner is to penalize neighbourhoods with a large number of nodes and high entropy to the benefit of small, yi =y ^i 8i = 1; :::; l (2) low-entropy neighbourhoods or prior label information. MAD optimizes the following objective function: where y is the true label, and y^ is the predicted label. Thus, two samples linked by an edge with a large weight should ideally X X inj m m 2 S = [µ p (y − y^ ) receive the same labels. The cost function has a fixed-form so- 1 i i i lution but usually an iterative procedure (Algorithm 1) is used. i m X X cont m m 2 Upon termination, Yl+1;:::;n contains entries for the unlabeled + µ2 pi Wij (^yi − y^j ) (6) data rows in the label matrix. LP can also be viewed from the m j2N(vi) perspective of random walks on graphs: after convergence, yij X abnd m m 2 + µ p (y − r ) ] contains the probability that a random walk starting at unlabeled 3 i i i vertex i will terminate in a vertex labeled with j. m The optimization criterion thus consists of three parts: (a) corre- Algorithm 1: Iterative Label Propagation Algorithm spondence of the predicted labels with the true seed labels; (b) smoothness of the predicted labeling within the graph neigh- 1 Initialize the row-normalized matrix P as borhood of the current vertex, N(v); and (c) regularization Wij Pij = P ; j Wij by penalizing divergence from a default value r. Each part is 2 Initialize a n × m label matrix Y with one-hot vectors weighted by a coefficient µ which controls its relative impor- encoding known labels for the first l rows: tance. Yi = δC (yi) 8i 2 f1; 2; : : : ; lg (the remaining rows of Another difference to the LP algorithm is that seed labels Y can be zero); are allowed to change during the training process and need not 0 3 Y P × Y ; stay fixed. This may be useful when the labeling of the training 4 Clamp already-labeled data rows: data is noisy. 0 Yi = δC (yi)8i 2 f1; 2; : : : ; lg; In [15] it was shown that the function in Equation 6 can be 0 5 If Y =∼ Y , stop; described as a linear system that can be optimized by the Jacobi 0 6 Y Y ; method [4], leading to the training procedure shown below: 7 repeat from step 3 Algorithm 2: Iterative MAD Algorithm Input: inj cont abnd 2.2. Modified Adsorption G = (G; E; W ); Y1;:::;l; p ; p ; p ; µ1; µ2; µ3 Output: Y^ Modified adsorption (MAD) [9] builds on the Adsorption al- 1;:::;n 1 Initialize 8i = 1; : : : ; l: gorithm [14] in that it introduces three different probabilities 2 y^ = y for each vertex v in the graph that correspond to the random i i inj cont P Mii µ1 × p + µ2 × p × Wij + µ3; walk operations of (a) continuing to a neighbouring vertex ac- i i j2N(vi) cording to the transition probabilities v (cont); (b) stopping and 3 repeat P cont cont returning the prior (“injected”) label probabilities (inj); or (c) Di (p Wij + p Wji)Y^j ; 4 j2N(vi) i j abandoning and returning a default label vector (abnd), which 5 for i 2 1; : : : ; n do can be an all-zero vector or a vector giving the highest proba- 1 inj 6 Y^i (µ1 ×p ×Yi +µ2 ×Di +µ3 ×Ri) bility to a default or “dummy” label. Each action is controlled Mii i by a corresponding probability that is dependent on the vertex 7 end cont inj abnd 8 until convergence ; v, i.e. pv , pv , pv , and all must sum to 1. For unlabeled nodes, pinj is 0. cont inj abnd In order to determine the values of pv , pv , and pv , we use the procedure in [15], which defines two quantities, cv and dv for each vertex v. The first of these is defined as 3. MAD for Acoustic-Phonetic logβ Classification c = (3) v log(β + eH[v]) In order to apply MAD to acoustic-phonetic classification, we need to first construct the data graph.