Prediction-Constrained Hidden Markov Models for Semi-Supervised Classification
Total Page:16
File Type:pdf, Size:1020Kb
Prediction-Constrained Hidden Markov Models for Semi-Supervised Classification Gabriel Hope 1 Michael C. Hughes 2 Finale Doshi-Velez 3 Erik B. Sudderth 1 Abstract We consider the broad family of hidden Markov models (HMMs), for which a wide range of training methods We develop a new framework for training hid- have been previously proposed. Unsupervised learning for den Markov models that balances generative and HMMs such as the EM algorithm [27] do not use task labels discriminative goals. Our approach requires y to inform the learned state structure. They thus may have likelihood-based or Bayesian learning to meet poor predictive power. In contrast, discriminative methods task-specific prediction quality constraints, pre- for training HMMs [5; 32; 12] have been widely used for venting model misspecification from leading to problems like speech recognition to learn states informed by poor subsequent predictions. When users specify labels y. These methods have several shortcomings, includ- appropriate loss functions to constrain predictions, ing restrictions on the loss function used for label prediction, our approach can enhance semi-supervised learn- and a failure to allow users to select a task-specific tradeoff ing when labeled sequences are rare and boost ac- between generative and discriminative performance. See curacy when data has unbalanced labels.Via auto- App.B for detailed related work discussion. matic differentiation we backpropagate gradients through dynamic programming computation of This paper develops a framework for training prediction con- the marginal likelihood, making training feasible strained hidden Markov models (PC-HMMs) that minimize without auxiliary bounds or approximations. Our application-motivated loss functions in the prediction of task approach is effective for human activity modeling labels y, while simultaneously learning good quality genera- and healthcare intervention forecasting, deliver- tive models of the raw sequence data x. We demonstrate that ing accuracies competitive with well-tuned neural PC-HMM training leads to prediction performance competi- networks for fully labeled data, and substantially tive with modern recurrent network architectures on fully- better for partially labeled data. Simultaneously, labeled corpora, and excels over these baselines when given our learned generative model illuminates the dy- semi-supervised corpora where labels are rare. At the same namical states driving predictions. time, PC-HMM training also learns latent states that enable exploratory interpretation, visualization, and simulation. 1 Introduction 2 Hidden Markov Models for Prediction We develop broadly applicable methods for learning models In this section, we review HMMs. We consider a dataset of of data sequences x, some of which are annotated with task- N sequences xn, some of which may have labels yn. Each sequence x has T timesteps: x = [x ; x : : : x ]. y n n n n1 n2 n;Tn specific labels . For example, in human activity tasks [29], D x might be body motion data captured by an accelerometer, At each timestep t we observe a vector xnt 2 R . and y might be activity labels (walking, swimming, etc). We 2.1 Hidden Markov Models seek good predictions of labels y from data x, while simul- Standard unsupervised HMMs [27] assume that the N ob- taneously learning a good model of x itself that is informed served sequences are generated by a common model with by task labels y. Strong generative models are valuable K hidden, discrete states. The sequences are generated because they allow analysts to visualize recovered patterns by drawing a sequence of per-timestep state assignments and query for missing data. Moreover, our approach enables z = [z ; z : : : z ] from a Markov process, and then semi-supervised time-series learning from data where only n n1 n2 nTn drawing observed data x conditioned on these assignments. a small subset of data sequences have labels. n Specifically, we generate the sequence zn as 1UC Irvine 2Tufts University 3Harvard University . Correspon- zn1 ∼ Cat(π0); znt ∼ Cat(πznt−1 ); dence to: Gabriel Hope <[email protected]>. K where π0 is the initial state distribution, and π = fπkgk=0 Accepted Paper at the Time Series Workshop at ICML 2021. Copy- denotes the probabilities of transitioning from state j to state right 2021 by the authors. k: πjk = p(zt = k j zt−1 = j). Next the observations xn Prediction-Constrained Hidden Markov Models for Semi-Supervised Classification are sampled such that the distribution of xnt depends only find that a prediction function incorporating a convolutional on znt. In this work we consider Gaussian emissions with a transformation of the belief sequence, followed by local state-specific mean and covariance, max-pooling, leads to improved accuracy. The convolu- p(xnt j znt = k; φk = fµk; Σkg) = N (xnt j µk; Σk) tional structure allows for predictions to depend on belief as well as a first-order autoregressive Gaussian emissions: patterns that span several time-steps, while localized pool- ing preserves high-level temporal structures while removing p(x j x ; z = k; φ = fA ; µ ; Σ g) nt nt−1 nt k k k k some sensitivity to the temporal alignment of sequences. = N (xnt j Akxnt−1 + µk; Σk): Event detection. In other applications, we seek to densely To simplify notation, we assume that there exists an unmod- label the events occurring at each timestep of a sequence, eled observation x at time zero. n0 such as the prediction of medical events from hourly obser- Given the observed data xn, we can efficiently com- vations of patients in a hospital. To predict the label ynt pute posterior marginal probabilities, or beliefs, for the at time t, we use the beliefs btk at times twst: : twend in a latent states zn via the belief propagation or forward- window around t as features for a regression or classification backward algorithm [27]. We denote these probabilities model with parameters η: by btk(xn; π; φ) p(znt = k j xn; π; φ), and note that , y^nt , y^t(xn; π; φ, η) = f(btw :tw (xn; π; φ); η): (4) these beliefs are a deterministic function of x (which will st: end n Here f(·; η) could either be a generalized linear model based be important for our end-to-end optimization) with com- on the average state frequencies in the local time window, putational cost O(T K2). The probabilities b (x ; π; φ) n tk n or a more complicated non-linear model as discussed for take into account the full sequence xn, including future 0 sequence classification. timesteps xnt0 , t > t. In some applications, predictions must be made only on the data up until time t. These be- Finally, we note that many prediction tasks are offline: the ! liefs btk (xn; π; φ) , p(znt = k j xn1; : : : ; xnt; π; φ) are prediction is needed for post-hoc analysis and thus can in- computed by the forward pass of belief propagation. corporate information from the full data sequence. When a prediction needs to happen online, for example in forecast- 2.2 Predicting Labels from Beliefs ing applications, we instead use as regression features only Now we consider the prediction of labels y given data x. ! the forward-belief representation btk . Because they capture uncertainty in the hidden states zn, the beliefs btk are succinct (and computationally efficient) 3 Learning via Prediction Constraints summary statistics for the data. We use beliefs as features We now develop an overall optimization objective and al- for the prediction of labels yn from data xn in two scenarios: gorithm to estimate parameters π; φ, η given (possibly par- per-sequence classification, where the entire sequence n has tially) labeled training data. Our goal is to jointly learn how a single binary or categorical label y , and per-timestep n to model the sequence data xn (via estimated transition pa- classification, where each timestep has its own event label. rameters π and emission parameters φ) and how to predict Sequence classification. In the sequence classification the labels yn (via estimated regression parameters η), so scenario, we seek to assign a scalar label yn to the entire that predictions may be high quality even when the model sequence. Below, we provide two classification functions is misspecified. given the beliefs. In the first, we use the average amount of Suppose we observe a set L of labeled sequences time spent in each state as our feature: NL fxn; yngn=1 together with a set U of unlabeled sequences ¯ 1 PTn N b(xn; π; φ) bt(xn; π; φ); (1) U , Tn t=1 fxngn=1. Our goal is to both explain the sequences xn, by T ¯ achieving a high p(x j π; φ), while also making predictions y^n , y^(xn; π; φ, η) = f(η b(xn; π; φ)): (2) y^ that minimize some task-specific loss (e.g., hinge loss where b (x ; π; φ) = p(z j x ; π; φ), η is a vector of n t n nt n or logistic loss). Our prediction-constrained (PC) training regression coefficients, and f(·) is an appropriate link func- objective for both sequence classification scenarios is: tion (e.g., a logistic f(w) = 1=(1 + e−w) for binary labels X or a softmax for categorical labels). min − log p(xn j π; φ) − log p(π; φ); (5) π,φ,η n2L[U For some tasks it may be beneficial to use a more flexible, P y ; y^(x ; π; φ, η) ≤ . non-linear model to predict labels from belief states. In these subject to: n2L loss n n cases, we can replace the linear model based on averaged Here, p(π; φ) is an optional prior density to regularize pa- belief states in Eq. (2) with a general function that takes in rameters.