Prediction-Constrained Hidden Markov Models for Semi-Supervised Classification
Total Page:16
File Type:pdf, Size:1020Kb
Prediction-Constrained Hidden Markov Models for Semi-Supervised Classification Gabriel Hope 1 Michael C. Hughes 2 Finale Doshi-Velez 3 Erik B. Sudderth 1 Abstract subset of data sequences have label annotations. We develop a new framework for training hid- We consider the broad family of hidden Markov models den Markov models that balances generative and (HMMs), for which a wide range of training methods discriminative goals. Our approach requires have been previously proposed. Unsupervised learning for likelihood-based or Bayesian learning to meet HMMs such as the EM algorithm [27] do not use task labels task-specific prediction quality constraints, pre- y to inform the learned state structure. They thus may have venting model misspecification from leading to poor predictive power. In contrast, discriminative methods poor subsequent predictions. When users specify for training HMMs [5; 32; 12] have been widely used for appropriate loss functions to constrain predictions, problems like speech recognition to learn states informed by our approach can enhance semi-supervised learn- labels y. These methods have several shortcomings, includ- ing when labeled sequences are rare and boost ac- ing restrictions on the loss function used for label prediction, curacy when data has unbalanced labels.Via auto- and a failure to allow users to select a task-specific tradeoff matic differentiation we backpropagate gradients between generative and discriminative performance. through dynamic programming computation of This paper develops a framework for training prediction con- the marginal likelihood, making training feasible strained hidden Markov models (PC-HMMs) that minimize without auxiliary bounds or approximations. Our application-motivated loss functions in the prediction of task approach is effective for human activity modeling labels y, while simultaneously learning good quality genera- and healthcare intervention forecasting, deliver- tive models of the raw sequence data x. We demonstrate that ing accuracies competitive with well-tuned neural PC-HMM training leads to prediction performance competi- networks for fully labeled data, and substantially tive with modern recurrent network architectures on fully- better for partially labeled data. Simultaneously, labeled corpora, and excels over these baselines when given our learned generative model illuminates the dy- semi-supervised corpora where labels are rare. At the same namical states driving predictions. time, PC-HMM training also learns latent states that enable exploratory interpretation, visualization, and simulation. 1 Introduction 2 Hidden Markov Models for Prediction We develop broadly applicable methods for learning mod- In this section, we review HMMs. We consider a dataset of els of data sequences x, some of which are annotated with N sequences xn, some of which may have labels yn. Each task-specific labels y. For example, in human activity mod- sequence xn has Tn timesteps: xn = [xn1; xn2 : : : xn;Tn ]. els [29], x might be body motion data captured by an ac- D At each timestep t we observe a vector xnt 2 R . celerometer, and y might be activity labels (walking, swim- ming, etc). We seek to make good predictions of labels y 2.1 Hidden Markov Models from data x, while simultaneously learning a good model of Standard unsupervised HMMs [27] assume that the N ob- the data x itself that is informed by task labels y. Strong gen- served sequences are generated by a common model with erative models are valuable because they allow analysts to K hidden, discrete states. The sequences are generated visualize recovered temporal patterns and query the model by drawing a sequence of per-timestep state assignments for missing data. Moreover, our approach enables semi- zn = [zn1; zn2 : : : znT ] from a Markov process, and then supervised time-series learning from data where only a small n drawing observed data xn conditioned on these assignments. 1UC Irvine 2Tufts University 3Harvard University . Correspon- Specifically, we generate the sequence zn as dence to: Gabriel Hope <[email protected]>. zn1 ∼ Cat(π0); znt ∼ Cat(πznt−1 ); th K Proceedings of the 38 International Conference on Machine where π0 is the initial state distribution, and π = fπkgk=0 Learning, PMLR 139, 2021. Copyright 2021 by the author(s). denotes the probabilities of transitioning from state j to state Prediction-Constrained Hidden Markov Models for Semi-Supervised Classification k: πjk = p(zt = k j zt−1 = j). Next the observations xn where η are parameters of differentiable function f(·; η). are sampled such that the distribution of xnt depends only In one of the sequence classification tasks in Sec.4, we on znt. In this work we consider Gaussian emissions with a find that a prediction function incorporating a convolutional state-specific mean and covariance, transformation of the belief sequence, followed by local max-pooling, leads to improved accuracy. The convolu- p(xnt j znt = k; φk = fµk; Σkg) = N (xnt j µk; Σk) tional structure allows for predictions to depend on belief as well as a first-order autoregressive Gaussian emissions: patterns that span several time-steps, while localized pool- p(xnt j xnt−1; znt = k; φk = fAk; µk; Σkg) ing preserves high-level temporal structures while removing = N (xnt j Akxnt−1 + µk; Σk): some sensitivity to the temporal alignment of sequences. To simplify notation, we assume that there exists an unmod- Event detection. In other applications, we seek to densely eled observation xn0 at time zero. label the events occurring at each timestep of a sequence, such as the prediction of medical events from hourly obser- Given the observed data xn, we can efficiently com- pute posterior marginal probabilities, or beliefs, for the vations of patients in a hospital. To predict the label ynt at time t, we use the beliefs b at times t : t in a latent states zn via the belief propagation or forward- tk wst: wend backward algorithm [27]. We denote these probabilities window around t as features for a regression or classification model with parameters η: by btk(xn; π; φ) , p(znt = k j xn; π; φ), and note that these beliefs are a deterministic function of xn (which will y^ y^ (x ; π; φ, η) = f(b (x ; π; φ); η): (4) nt , t n twst: :tw n be important for our end-to-end optimization) with com- end Here f(·; η) could either be a generalized linear model based putational cost O(T K2). The probabilities b (x ; π; φ) n tk n on the average state frequencies in the local time window, take into account the full sequence xn, including future 0 or a more complicated non-linear model as discussed for timesteps x 0 , t > t. In some applications, predictions nt sequence classification. must be made only on the data up until time t. These be- ! liefs btk (xn; π; φ) , p(znt = k j xn1; : : : ; xnt; π; φ) are Finally, we note that many prediction tasks are offline: the computed by the forward pass of belief propagation. prediction is needed for post-hoc analysis and thus can in- corporate information from the full data sequence. When a 2.2 Predicting Labels from Beliefs prediction needs to happen online, for example in forecast- Now we consider the prediction of labels y given data x. ing applications, we instead use as regression features only ! Because they capture uncertainty in the hidden states zn, the forward-belief representation btk . the beliefs btk are succinct (and computationally efficient) summary statistics for the data. We use beliefs as features 3 Learning via Prediction Constraints for the prediction of labels y from data x in two scenarios: n n We now develop an overall optimization objective and al- per-sequence classification, where the entire sequence n has gorithm to estimate parameters π; φ, η given (possibly par- a single binary or categorical label y , and per-timestep n tially) labeled training data. Our goal is to jointly learn how classification, where each timestep has its own event label. to model the sequence data xn (via estimated transition pa- Sequence classification. In the sequence classification rameters π and emission parameters φ) and how to predict scenario, we seek to assign a scalar label yn to the entire the labels yn (via estimated regression parameters η), so sequence. Below, we provide two classification functions that predictions may be high quality even when the model given the beliefs. In the first, we use the average amount of is misspecified. time spent in each state as our feature: Suppose we observe a set L of labeled sequences ¯ 1 PTn b(xn; π; φ) bt(xn; π; φ); (1) NL , Tn t=1 fxn; yngn=1 together with a set U of unlabeled sequences T ¯ fx gNU . Our goal is to both explain the sequences x , by y^n , y^(xn; π; φ, η) = f(η b(xn; π; φ)): (2) n n=1 n achieving a high p(x j π; φ), while also making predictions where b (x ; π; φ) = p(z j x ; π; φ), η is a vector of t n nt n y^ that minimize some task-specific loss (e.g., hinge loss regression coefficients, and f(·) is an appropriate link func- n or logistic loss). Our prediction-constrained (PC) training tion (e.g., a logistic f(w) = 1=(1 + e−w) for binary labels objective for both sequence classification scenarios is: or a softmax for categorical labels). X min − log p(xn j π; φ) − log p(π; φ); (5) For some tasks it may be beneficial to use a more flexible, π,φ,η n2L[U non-linear model to predict labels from belief states. In these P cases, we can replace the linear model based on averaged subject to: n2L loss yn; y^(xn; π; φ, η) ≤ . belief states in Eq.