[ Emily B. Fox, Erik B. Sudderth, Michael I. Jordan, and Alan S. Willsky]

[ priors for dynamical systems]

arkov switching pro- INTRODUCTION cesses, such as the A core problem in statistical signal pro- cessing is the partitioning of temporal (HMM) and switching data into segments, each of which per- linear dynamical sys- mits a relatively simple statistical Mtem (SLDS), are often used to describe description. This segmentation problem rich dynamical phenomena. They arises in a variety of applications, in describe complex behavior via repeat- areas as diverse as speech recognition, ed returns to a set of simpler models: computational biology, natural language imagine a person alternating between processing, finance, and cryptanalysis. walking, running, and jumping behav- While in some cases the problem is iors, or a stock index switching merely that of detecting temporal between regimes of high and low vola- changes (in which case the problem can tility. Classical approaches to identifi- be viewed as one of changepoint detec- cation and estimation of these models tion), in many other cases the temporal assume a fixed, prespecified number of segments have a natural meaning in the dynamical models. We instead exam- domain and the problem is to recognize ine Bayesian nonparametric approach- recurrence of a meaningful entity (e.g., a © PHOTODISC es that define a prior on Markov particular speaker, a gene, a part of switching processes with an unbound- speech, or a market condition). This ed number of potential model parameters (i.e., Markov leads naturally to state-space models, where the entities that are modes). By leveraging stochastic processes such as the beta to be detected are encoded in the state. and (DP), these methods allow the data to The classical example of a state-space model for segmenta- drive the complexity of the learned model, while still permit- tion is the HMM [1]. The HMM is based on a discrete state vari- ting efficient inference algorithms. They also lead to general- able and on the probabilistic assertions that the state transitions izations that discover and model dynamical behaviors shared are Markovian and that the observations are conditionally inde- among multiple related . pendent given the state. In this model, temporal segments are equated with states, a natural assumption in some problems but a limiting assumption in many others. Consider, for example, Digital Object Identifier 10.1109/MSP.2010.937999 the dance of a honey bee as it switches between a set of turn

1053-5888/10/$26.00©2010IEEE IEEE MAGAZINE [43] NOVEMBER 2010 right, turn left, and waggle danc- WE NEED TO MOVE BEYOND THE compare state spaces of differ- es. It is natural to view these ent cardinality and it is not pos- SIMPLE DISCRETE dances as the temporal segments, sible to use the state space to as they “permit a relatively sim- AS A DESCRIPTION OF TEMPORAL encode a notion of similarity ple statistical description.” But SEGMENTATION. between modes. More broadly, each dance is not a set of condi- many problems involve a collec- tionally independent draws from a fixed distribution as required tion of state-space models (either HMMs or Markov switching by the HMM—there are additional dynamics to account for. processes), and within the classical framework there is no Moreover, it is natural to model these dynamics using continu- natural way to talk about overlap between models. A particu- ous variables. These desiderata can be accommodated within the lar instance of this problem arises when there are multiple richer framework of Markov switching processes, specific exam- time series, and where we wish to use overlapping subsets of ples of which include the switching vector autoregressive (VAR) the modes to describe the different time series. In the section process and the SLDS. These models distinguish between the “Multiple Related Time Series,” we discuss a concrete example discrete component of the state, which is referred to as a of this problem where the time series are motion-capture vid- “mode,” and the continuous component of the state, which cap- eos of humans engaging in exercise routines, and where the tures the continuous dynamics associated with each mode. modes are specific exercises, such as “jumping jacks” or These models have become increasingly prominent in applica- “squats.” We aim to capture the notion that two different peo- tions in recent years [2]–[8]. ple can engage in the same exercise (e.g., jumping jacks) dur- While Markov switching processes address one key limita- ing their routine. tion of the HMM, they inherit various other limitations of the To address these problems, we need to move beyond the HMM. In particular, the discrete component of the state in the simple discrete Markov chain as a description of temporal seg- Markov switching process has no topological structure mentation. In this article, we describe a richer class of sto- (beyond the trivial discrete topology). Thus it is not easy to chastic processes known as combinatorial stochastic processes that provide a useful foundation for the design of flexible models for temporal segmentation. Combinatorial sto- chastic processes have been studied for several decades in z z z z . . . z 1 2 3 4 n theory (see, e.g., [9]), and they have begun to play a role in as well, most notably in the area of Bayesian nonparametric statistics where they yield Bayesian approaches y y y y . . . y 1 2 3 4 n to clustering and survival analysis (see, e.g., [10]). The work (a) that we present here extends these efforts into the time-series domain. As we aim to show, there is a natural interplay z z z z . . . 1 2 3 4 zn between combinatorial stochastic processes and state-space descriptions of dynamical systems. Our primary focus is on two specific stochastic processes— y y y y . . . y 1 2 3 4 n the DP and the beta process—and their role in describing modes in dynamical systems. The DP provides a simple descrip- (b) tion of a clustering process where the number of clusters is not fixed a priori. Suitably extended to a hierarchical DP (HDP), z z z z . . . z 1 2 3 4 n this stochastic process provides a foundation for the design of state-space models in which the number of modes is random and inferred from the data. In the section “Hidden Markov x x x x . . . x 1 2 3 4 n Models,” we discuss the HDP and its connection to the HMM. Building on this connection, the section “Markov Jump Linear . . . Systems” shows how the HDP can be used in the context of y y y y y 1 2 3 4 n Markov switching processes with conditionally linear dynamical (c) modes. Finally, in the section “Multiple Related Time Series,” we discuss the beta process and show how it can be used to cap- [FIG1] Graphical representations of three Markov switching ture notions of similarity among sets of modes in modeling processes: (a) HMM, (b) order 2 switching VAR process, and multiple time series. (c) SLDS. For all models, a discrete-valued Markov process zt z 5 6K z = evolves as t11| pk k51, t pzt.For the HMM, observations are y 5 6K z = F1 2 generated as t| uk k51, t uzt , whereas the switching HIDDEN MARKOV MODELS 1z 2 1z 2 1z 2 y = N1A t y A t y t 2 VAR(2) process assumes t 1 t21 1 2 t22, S . The The HMM generates a sequence of latent modes via a discrete- SLDS instead relies on a latent, continuous-valued Markov valued Markov chain [1]. Conditioned on this mode sequence, state xt to capture the history of the dynamical process as specified in (15). the model assumes that the observations, which may be

IEEE SIGNAL PROCESSING MAGAZINE [44] NOVEMBER 2010 discrete or continuous valued, THERE IS A NATURAL INTERPLAY parameter space U. Depending are independent. The HMM is BETWEEN COMBINATORIAL STOCHASTIC on the form of the emission the most basic example of a distribution, various choices of PROCESSES AND STATE-SPACE Markov switching process and H lead to computational effi- forms the building block for DESCRIPTIONS OF DYNAMICAL SYSTEMS. ciencies via conjugate analysis. more complicated processes examined later.

FINITE HMM Time 123 . . . T Let zt denote the mode of the Markov chain at time t, and pj the mode-specific transition distribution for mode j. Given the 1 θ1 θ1 θ1 θ1 θ1

mode zt, the observation yt is conditionally independent of

the observations and modes at other time steps. The generative 2 θ2 θ2 θ2 θ2 θ2 process can be described as θ θ θ θ θ z 0 z | p Mode 3 3 3 3 3 3 t t21 zt21 y 0 z | F1u 2 ...... t t zt (1) ...... # for an indexed family of distributions F1 2 (e.g., multinomial for ...... discrete data or multivariate Gaussian for real, vector-valued K θk θk θk θk θk data), where ui are the emission parameters for mode i. The (a) notation x | F indicates that the x is drawn π π from a distribution F. We use bar notation x | F | F to specify 12 13 G1 conditioned-upon random elements, such as a random distri- bution. The directed graphical model associated with the HMM π π is shown in Figure 1(a). 11 1K One can equivalently represent the HMM via a set of transi- θ θ θ . . . θ Θ g K 1 2 3 k tion probability measures Gj 5 k51 pjkduk, where du is a unit mass concentrated at u. Instead of employing transition distri- π22 butions on the set of integers (i.e., modes) that index into the G2 π21 π collection of emission parameters, we operate directly in the 2K parameter space U, and transition between emission parameters π23 5 6 with Gj . Specifically, let jt21 be the unique emis- r . . . sion parameter index j such that ut21 5uj. Then, θ1 θ2 θ3 θk Θ

ur 0 ur | G π32 t t21 jt21 π π33 3K 0 r | 1 r2 G3 π yt ut F ut . (2) 31 r 5 c 6 Here, ut [ u1, , uK takes the place of uzt in (1). A visualiza- tion of this process is shown by the trellis diagram of Figure 2. θ1 θ2 θ3 . . . θk Θ One can consider a Bayesian HMM by treating the transi- (b) tion probability measures Gj as random, and endowing them ′′ ′. . . . θ ′

with a prior. Formally, a random on a measurable θ1 θ2 θ3 T

= =

space U, with sigma algebra A, is defined as a stochastic = θ θ1 θ process whose index set is A. That is, G1A2 is a nonnegative 2 3 y y y y random variable for each A [ A. Since the probability mea- 1 2 3 T (c) sures are solely distinguished by their weights on the shared 5 c 6 set of emission parameters u1, , uK , we consider a prior [FIG2] (a) Trellis representation of an HMM. Each circle that independently handles these components. Specifically, K 3 c 4 represents one of the possible HMM emission parameters at take the weights pj 5 pj1 pjK (i.e., transition distribu- various time steps. The highlighted circles indicate the selected tions) to be independent draws from a K-dimensional emission parameter urt at time t, and the arrows represent the Dirichlet distribution, set of possible transitions from that HMM mode to each of the K possible next modes. The weights of these arrows indicate the relative probability of the transitions encoded by the | 1 c 2 c G pj Dir a1, , aK j 5 1, , K, (3) mode-specific transition probability measures j. (b) Transition G G G probability measures 1, 2, and 3 corresponding to the example trellis diagram. (c) A representation of the HMM implying that g p 5 1, as desired. Then, assume that the k jk observations yt, which are drawn from emission distributions | atoms are drawn as uj H for some base measure H on the parameterized by the highlighted nodes.

IEEE SIGNAL PROCESSING MAGAZINE [45] NOVEMBER 2010 This implies the following predictive distribution on the indica- tor assignment variables β 1 g β 1 c 2 1 2 2 p zN11 5 z|z1, , zN,g 5 d z, K 1 1 β3 N 1 g β 4 1 K . β5 1 2 . 1 a nk d z, k . (7) . N 1gk51

Here, n 5 g N d1z , k2 is the number of indicator random [FIG3] Pictorial representation of the stick-breaking construction k i51 i of the DP. variables taking the value k, and K11 is a previously unseen value. The discrete Kronecker delta d1z, k2 5 1 if z 5 k, and 0 STICKY HDP-HMM otherwise. The distribution on partitions induced by the In the Bayesian HMM of the previous section, we assumed that sequence of conditional distributions in (7) is commonly the number of HMM modes K is known. But what if this is not referred to as the Chinese restaurant process. Take i to be a the case? For example, in the customer entering a restau- speaker diarization task consid- THE HMM IS THE MOST BASIC rant with infinitely many ered later, determining the EXAMPLE OF A MARKOV SWITCHING tables, each serving a unique number of speakers involved in dish u . Each arriving custom- PROCESS AND FORMS THE BUILDING k the meeting is one of the pri- er chooses a table, indicated BLOCK FOR MORE COMPLICATED mary inference goals. Moreover, by zi, in proportion to how even when a model adequately PROCESSES EXAMINED LATER. many customers are currently describes previous observations, sitting at that table. With some it can be desirable to allow new modes to be added as more data positive probability proportional to g, the customer starts a are observed. For example, what if more speakers enter the new, previously unoccupied table K 1 1. From the Chinese meeting? To avoid restrictions on the size of the mode space, restaurant process, we see that the DP has a reinforcement such scenarios naturally lead to priors on probability measures property that leads to a clustering at the values uk. This repre-

Gj that have an unbounded collection of support points uk. sentation also provides a means of sampling observations The DP, denoted by DP1g, H2, provides a distribution over from a DP without explicitly constructing the infinite proba- | 1 2 countably infinite probability measures bility measure G0 DP g, H . ` One could imagine using the DP to define a prior on the | G0 5 a bk duk uk H (4) set of HMM transition probability measures Gj. Taking each k51 1 2 transition measure Gj as an independent draw from DP g, H on a parameter space U. The weights are sampled via a stick- implies that these probability measures are of the form g ` | 1 2 | breaking construction [11] k51 pjk dujk, with weights pj GEM g and atoms ujk H. k21 Assuming H is absolutely continuous (e.g., a Gaussian distri- 1 2 | 1 2 bk 5nk q 12n, nk Beta 1, g . (5) bution), this construction leads to transition measures with ,51 2 nonoverlapping support (i.e., ujk u,kr with probability one.) In effect, we have divided a unit-length stick into lengths given Based on such a construction, we would move from one infi-

by the weights bk: the kth weight is a random proportion vk of nite collection of HMM modes to an entirely new collection the remaining stick after the first 1k212 weights have been cho- at each transition step, implying that previously visited 1 2 sen. We denote this distribution by b | GEM g . See Figure 3 modes would never be revisited. This is clearly not what we for a pictorial representation of this process. intended. Instead, consider the HDP [13], which defines a 5 6 The DP has proven useful in many applications due to its collection of probability measures Gj on the same support 5 c6 clustering properties, which are clearly seen by examining the points u1, u2, by assuming that each discrete measure

r | predictive distribution of draws ui G0. Because probability Gj is a variation on a global discrete measure G0. Specifically, | 1 2 measures drawn from a DP are discrete, there is a strictly posi- the Bayesian hierarchical specification takes Gj DP a, G0 , r 1 2 tive probability of multiple observations ui taking identical with G0 itself a draw from a DP g, H . Through this con- 5 6 values within the set uk , with uk defined as in (4). For each struction, one can show that the probability measures are r value ui, let zi be an indicator random variable that picks out described as the unique value u such that ur 5u . Blackwell and MacQueen k i zi ` r G 5 b d b | g | GEM1g2 [12] derived a Pólya urn representation of the ui 0 ak51 k uk | i21 uk | H H. g 1 ` 1 2 ur 0 ur, c, ur | H 1 d r G 5 p d p | a, b | DP a, b (8) i 1 i 21 a u j j ak51 jk uk j g 1 i 2 1 j51 g1i21 K n | g k Applying the HDP prior to the HMM, we obtain the HDP-HMM H 1 a duk. (6) g 1 i 2 1 k51 g 1 i 2 1 of Teh et al. [13].

IEEE SIGNAL PROCESSING MAGAZINE [46] NOVEMBER 2010 Extending the Chinese restau- Although it is through this con- A STATE-SPACE MODEL PROVIDES rant process analogy of the DP, struction that a shared sparse one can examine a Chinese res- A GENERAL FRAMEWORK FOR mode space is induced, we see taurant franchise that describes ANALYZING MANY DYNAMICAL from (9) that the HDP-HMM does the partitions induced by the HDP. PHENOMENA. not differentiate self-transitions In the context of the HDP-HMM, from moves between different

the infinite collection of emission parameters uk determine a modes. When modeling data with mode persistence, the flexi- global menu of shared dishes. Each of these emission parame- ble nature of the HDP-HMM prior allows for mode sequences ters is also associated with a single, unique restaurant in the with unrealistically fast dynamics to have large posterior prob- r franchise. The HDP-HMM assigns a customer u t to a restau- ability, thus impeding identification of a compact dynamical r rant based on the previous customer ut21, since it is this model which best explains the observations. See an example r parameter that determines the distribution of u t [see (2)]. mode sequence in Figure 4(a). The HDP-HMM places only a r r Upon entering the restaurant determined by ut21, customer ut small prior penalty on creating an extra mode but no penalty chooses a table with probability proportional to the current on that mode having a similar emission parameter to another occupancy, just as in the DP. The dishes for the tables are then mode, nor on the time series rapidly switching between two

chosen from the global menu G0 based on their popularity modes. The increase in likelihood by fine-tuning the parame- throughout the entire franchise, and it is through this pooling ters to the data can counteract the prior penalty on creating of dish selections that the HDP induces a shared sparse subset this extra mode, leading to significant posterior uncertainty. of model parameters. These rapid dynamics are unrealistic for many real data sets. | 1 2 By defining pj DP a, b , the HDP prior encourages For example, in the speaker diarization task, it is very unlikely modes to have similar transition distributions. In particu- in that two speakers are rapidly switching who is speaking. lar, the mode-specific transition distributions are identical Such fast-switching dynamics can harm the predictive perfor- in expectation mance of the learned model since parameters are informed by fewer data points. Additionally, in some applications one cares E3 0 4 pjk b 5bk. (9) about the accuracy of the inferred label sequence instead of

11 11 10 10 9 9 8 8 7 7 6 6 5 5 Mode Mode 4 4 3 3 2 2 1 1 0 0 0 200 400 600 800 1,000 0 200 400 600 800 1,000 Time Time (a) (b)

15 8 14 6 12 10 4 10 2 8 5 6 0 4 0 −2 2 Observations Observations Observations −4 −5 0 −6 −2 −10 −8 −4 0 200 400 600 800 1,000 0 200 400 600 800 1,000 0 200 400 600 800 1,000 Time Time Time (c) (d) (e)

[FIG4] In (a) and (b), mode sequences are drawn from the HDP-HMM and sticky HDP-HMM priors, respectively. In (c)–(e), observation sequences correspond to draws from a sticky HDP-HMM, an order two HDP-AR-HMM, and an order one BP-AR-HMM with five time series offset for clarity. The observation sequences are colored by the underlying mode sequences.

IEEE SIGNAL PROCESSING MAGAZINE [47] NOVEMBER 2010 1 2 Here, ab1kdj indicates that an amount k.0 is added to γ β the jth component of ab. This construction increases the expected probability of self-transition by an amount proportional κ to k. Specifically, the expected set of weights for transition

α πκ distribution pj is a convex combination of those defined by b ∞ and mode-specific weight defined by k: z1 z2 z3 . . . zT E3 4 a k 1 2 λ θκ pjk | b, a, k 5 bk 1 d j, k . (11) ∞ a 1 k a 1 k . . . y1 y2 y3 yT When k50, the original HDP-HMM of Teh et al. [13] is recov- (a) ered. Because positive k values increase the prior probability E3 4 pjj | b, a, k of self-transitions, this model is referred to as γ β the sticky HDP-HMM. See the graphical model of Figure 5(a) κ and the mode sequence sample path of Figure 4(b). The k parameter is reminiscent of the self-transition bias parameter π α κ of the infinite HMM [15], an urn model obtained by integrat- ∞ ing out the random measures in the HDP-HMM. However, that z1 z2 z3 . . . zT paper relied on heuristic, approximate inference methods. The λ θκ ∞ full connection between the infinite HMM and the HDP, as well as the development of a globally consistent inference algo- y y y . . . y 1 2 3 T rithm, was made in [13] but without a treatment of a self-tran- sition parameter. Note that in the formulation of Fox et (b) al. [14], the HDP concentration parameters a and g, and the γ β sticky parameter k, are given priors and are thus learned from the data as well. The flexibility garnered by incorporating κ learning of the sticky parameter within a cohesive Bayesian

α πκ framework allows the model to additionally capture fast mode- ∞ switching if such dynamics are present in the data. z1 z2 z3 . . . z T In [14], the sticky HDP-HMM was applied to the speaker λ θκ ∞ diarization task, which involves segmenting an audio recording into speaker-homogeneous regions, while simultaneously iden- x x x . . . x 1 2 3 T tifying the number of speakers. The data for the experiments consisted of a standard benchmark data set distributed by NIST y y y . . . y 1 2 3 T as part of the Rich Transcription 2004–2007 meeting recogni- (c) tion evaluations [16], with the observations taken to be the first 19 Mel frequency cepstral coefficients (MFCCs) computed over [FIG5] Graphical representations of three Bayesian nonparametric short, overlapping windows. Because the speaker-specific emis- variants of the Markov switching processes shown in Figure 1. sions are not well approximated by a single Gaussian, a DP The plate notation is used to compactly represent replication of nodes in the graph [13], with the number of replicates indicated mixture of Gaussian extension of the sticky HDP-HMM was by the number in the corner of the plate. (a) The sticky HDP- considered. The sticky parameter proved pivotal in learning z 5 6` z = HMM. The mode evolves as t11| pk k51, t pzt, where 1 1 2 1 22 1 2 such multimodal emission distributions. Combining both the k| , , = DP , k and | = GEM , p a k b a1k ab 1 kd / a1k b g g y 5 6` z =F1 2 mode-persistence captured by the sticky HDP-HMM, along and observations are generated as t| uk k51, t uzt . The HDP-HMM of [13] has k50. (b) A switching VAR (2) process and with a model allowing multimodal emissions, state-of-the-art (c) SLDS, each with a sticky HDP-HMM prior on the Markov speaker diarization performance was achieved. switching process. The dynamical equations are as in (17).

MARKOV JUMP LINEAR SYSTEMS just doing model averaging. In such applications, one would The HMM’s assumption of conditionally independent observa- like to be able to incorporate prior knowledge that slow, tions is often insufficient in capturing the temporal depen- smoothly varying dynamics are more likely. dencies present in many real data sets. Recall the example of To address these issues, Fox et al. [14] proposed to instead the dancing honey bees from the “Introduction” section. In sample transition distributions pj as such cases, one can consider more complex Markov switching processes, namely the class of Markov jump linear systems b 0 g | GEM1g2 (MJLSs), in which each dynamical mode is modeled via a lin- ear dynamical process. Just as with the HMM, the switching ab1kdj p | a, k, b | DPaa1k, b. (10) mechanism is based on an underlying discrete-valued Markov j a1k

IEEE SIGNAL PROCESSING MAGAZINE [48] NOVEMBER 2010 | mode sequence. Switched affine and piecewise affine models, zt pzt21 r which we do not consider in this article, instead allow mode 1 2 y zt y e 1 2 transitions to depend on the continuous state of the dynami- t 5 a Ai t2i 1 t zt . (16) i51 cal system [17]. Both the SLDS and the switching VAR process are contained STATE-SPACE MODELS, VAR PROCESSES, within the class of MJLS, with graphical model representations AND FINITE MARKOV JUMP LINEAR SYSTEMS shown in Figure 1(b) and (c). Compare to that of the HMM in A state-space model provides a general framework for Figure 1(a). analyzing many dynamical phenomena. The model con- x Rn HDP-AR-HMM AND HDP-SLDS sists of an underlying state, t [ , with linear dynam- y Rd ics observed via t [ . A linear time-invariant state-space In the formulations of the MJLS mentioned in the section model, in which the dynamics do not depend on time, is “Markov Jump Linear Systems,” it was assumed that the num- given by ber of dynamical modes was known. However, it is often desir- able to relax this assumption to provide more modeling x x e y x w t 5 A t21 1 t t 5 C t 1 t, (12) flexibility. It has been shown that in such cases, the sticky HDP-HMM can be extended as a Bayesian nonparametric where et and wt are independent, zero-mean Gaussian noise approach to learning both SLDS and switching VAR process- processes with covariances S and R, respectively. The graphical es [19], [20]. Specifically, the transition distributions are model for this process is equivalent to that of the HMM depicted defined just as in the sticky HDP-HMM. However, instead of

in Figure 1(a), replacing zt with xt. independent observations, each mode now has conditionally An order r VAR process, denoted by VAR(r), with observa- linear dynamics. The generative processes for the resulting y Rd tions t [ , can be defined as HDP-AR-HMM and HDP-SLDS are summarized in (17), shown at the bottom of the page, with an example HDP-AR-HMM r y y e e | N1 2 observation sequence depicted in Figure 4(d). t 5 a Ai t2i 1 t t 0, S . (13) i51 Here, pj is as defined in (10). The issue, then, is in deter- mining an appropriate prior on the dynamic parameters. Here, the observations depend linearly on the previous r obser- In [19], a conjugate matrix-normal inverse-Wishart (MNIW) 1 2 1 2 vation vectors. Every VAR(r) process can be described in state- prior [21] was proposed for the dynamic parameters 5A k , S k 6 5 1k2 c 1k2 1k26 space form by, for example, the following transformation: in the case of the HDP-SLDS, and A1 , , Ar , S for the HDP-AR-HMM. The HDP-SLDS additionally assumes an c A1 A2 Ar I inverse-Wishart prior on the measurement noise R; however, I 0 c 0 0 the measurement matrix, C, is fixed for reasons of identifiabili- x 5 ≥ ¥ x 1 ≥ ¥ e t ( f ((t21 ( t ty. The MNIW prior assumes knowledge of either the autore- 0 c I 0 0 gressive order r, or the underlying state dimension n, of the switching VAR process or SLDS, respectively. Alternatively, y c c d x t 5 I 0 0 t. (14) Fox et al. [20] explore an automatic relevance determination (ARD) sparsity-inducing prior [22]–[24] as a means of learning On the other hand, not every state-space model may be MJLS with variable order structure. The ARD prior penalizes expressed as a VAR(r) process for finite r [18]. nonzero components of the model through a zero-mean Building on the HMM of the section “Finite HMM,” we define Gaussian prior with gamma-distributed precision. In the con- an SLDS by text of the HDP-AR-HMM and HDP-SLDS, a maximal autore- | gressive order or SLDS state dimension is assumed. Then, a zt pzt21 1 2 structured version of the ARD prior is employed so as to drive x zt x e 1 2 y x w t 5 A t21 1 t zt t 5 C t 1 t. (15) 1k2 1k2 entire VAR lag blocks Ai or columns of the SLDS matrix A 1 2 e 1 2 | N1 zt 2 Here, we assume the process noise t zt 0, S is mode- to zero, allowing one to infer nondynamical components of a specific, while the measurement mechanism is not. This ass- given dynamical mode. umption could be modified to allow for both a mode-specific measurement 1 2 zt w 1 2 | matrix C and noise t zt 1 2 HDP-AR-HMM HDP-SLDS N10, R zt 2. However, such a choice is Mode dynamics z | p z | p not always necessary nor appropriate t zt21 t zt21 r 1 2 1 2 for certain applications, and can have y zt y e 1 2 x zt x e 1 2 Observation dynamics t 5 a Ai t2i 1 t zt t 5 A t21 1 t zt . (17) implications on the identifiability of i51 y x w the model. We similarly define a t 5 C t 1 t switching VAR(r) process by

IEEE SIGNAL PROCESSING MAGAZINE [49] NOVEMBER 2010 The previous work of Fox et al. [8] considered a related, sentation, and show the utility of an alternative family of pri- yet simpler formulation for modeling a maneuvering target ors based on the beta process [29], [30]. as a fixed LDS driven by a switching exogenous input. Since the number of maneuver modes was assumed unknown, the FINITE FEATURE MODELS OF exogenous input was taken to be the emissions of a HDP- MARKOV SWITCHING PROCESSES 5 c 6 HMM. This work can be viewed as an extension of the work Assume we have a finite collection of behaviors u1, , uK by Caron et al. [25] in which the exogenous input was an that are shared in an unknown manner among N objects. One independent noise process generated from a DP mixture can represent the set of behaviors each object exhibits via an model. The HDP-SLDS of [19] is a departure from these associated list of features. A standard featural representation for works since the dynamic parameters themselves change describing the N objects employs an N 3 K binary matrix 5 6 with the mode, providing a F 5 fik . Setting fik 5 1 implies much more expressive model. that object i exhibits feature k 5 c 6 In [19], the utility of the OUR BAYESIAN NONPARAMETRIC for some t [ 1, , Ti , where HDP-SLDS and HDP-AR- APPROACH ENVISIONS A LARGE LIBRARY Ti is the length of the ith time HMM was demonstrated on OF BEHAVIORS, WITH EACH TIME SERIES series. To discover the structure two different problems: 1) OR OBJECT EXHIBITING A SUBSET OF of behavior sharing (i.e., the detecting changes in the vola- THESE BEHAVIORS. feature matrix), one takes the f 3 c 4 tility of the IBOVESPA stock feature vector i 5 fi1, , fiK index and 2) segmenting to be random. Assuming each sequences of honey bee dances. The dynamics underlying feature is treated independently, this necessitates defining a fea- both of these data sets appear to be quite complex, yet can ture inclusion probability vk for each feature k. Within a be described by repeated returns to simpler dynamical mod- Bayesian framework, these probabilities are given a prior that is els, and as such have been modeled with Markov switching then informed by the data to provide a posterior distribution on processes [26], [27]. Without prespecifying domain-specific feature inclusion probabilities. For example, one could consider knowledge, and instead simply relying on a set of observa- the finite Bayesian feature model of [31] that assumes tions along with weakly informative hyperprior settings, the HDP-SLDS and HDP-AR-HMM were able to discover the a v | Betaa , 1b underlying structure of the data with performance competi- k K | 1 2 tive with these alternative methods, and consistent with fik | vk Bernoulli vk . (18) domain expert analysis. 1 2 Beta random variables vk [ 0, 1 , and can thus be thought of MULTIPLE RELATED as defining coin-tossing probabilities. The resulting biased coin

TIME SERIES is then tossed to define whether fik is 0 or 1 (i.e., the of In many applications, one would like to discover and model a ). Because each feature is generated indepen- dynamical behaviors which are shared among several related dently, and a Beta1a, b2 random variable has mean a/1a 1 b2, time series. By jointly modeling such time series, one can the expected number of active features in an N 3 K matrix is improve parameter estimates, especially in the case of limited Na /1a/K 1 12 , Na. data, and find interesting structure in the relationships between A hierarchical Bayesian featural model also requires priors the time series. Assuming that each of these time series is mod- for behavior parameters uk, and the process by which each eled via a Markov switching process, our Bayesian nonparamet- object switches among its selected behaviors. In the case of ric approach envisions a large library of behaviors, with each Markov switching processes, this switching mechanism is gov- 1i2 time series or object exhibiting a subset of these behaviors. We erned by the transition distributions of object i, pj . As an aim to allow flexibility in the number of total and sequence- example of such a model, imagine that each of the N objects is specific behaviors, while encouraging objects to share similar sub- described by a switching VAR process (see the section “Markov sets of the behavior library. Additionally, a key aspect of a flexible Jump Linear Systems”) that moves among some subset of K model for relating time series is to allow the objects to switch possible dynamical modes. Each of the VAR process parameters 5A 6 f between behaviors in different manners (e.g., even if two people uk 5 k, Sk describes a unique behavior. The feature vector i both exhibit running and walking behaviors, they might alternate constrains the transitions of object i to solely be between the between these dynamical modes at different frequencies). selected subset of the K possible VAR processes by forcing 1i2 One could imagine a Bayesian nonparametric approach pjk 5 0 for all k such that fik 5 0. One natural construction based on tying together multiple time series under the HDP places Dirichlet priors on the transition distributions, and some prior outlined in the section “Sticky HDP-HMM.” However, prior H (e.g., an MNIW) on the behavior parameters. Then, such a formulation assumes that all time series share the same 1i2 f | 1 3 c c 4 # f 2 | set of behaviors, and switch among them in exactly the same pj | i, g, k Dir g, , g, g1k, g, , g i uk H, manner. Alternatively, Fox et al. [28] consider a featural repre- (19)

IEEE SIGNAL PROCESSING MAGAZINE [50] NOVEMBER 2010 where # denotes the element-wise, or Hadamard, vector prod- with c . 0, gives rise to completely random measures y1i2 | 1 2 uct. Let t represent the observation vector of the ith object at G GP c, G0 , where GP denotes a . Normalizing 1i2 1 2 1 2 time t, and zt the latent behavior mode. Assuming an order r G yields draws from a DP a, G0/a , with a5G0 U . Random switching VAR process, the dynamics of the ith object are probability measures G are necessarily not completely random 1 2 1 2 described by the following generative process: since the random variables G A1 and G A2 for disjoint sets A1

and A2 are dependent due to the normalization constraint. 1i2 1i2 | 1i2 zt | zt21 p 1i2 (20) Now consider a rate measure defined as the product of a base zt21 r 1 2 1 2 1 2 1 2 1 2 |1 2 1 2 1 2 measure B0, with total mass B0 U 5a, and an improper beta i 1 2 i i 1 i 2 1 2 i i 1 i 2 y 5 A i y 1 e z ! A i y 1 e z , (21) t a j, zt t2j t t zt t t t z 3 4 j51 distribution on the product space U 0, 1

e1i2 1 2 | N1 2 A 3 c 4 |y 1i2 1 2 21 1 2 c21 1 2 where t k 0, Sk , k 5 A1, k Ar, k , and t 5 n dv, du 5 cv 12v dvB0 du , (24) 3y1i2T c y1i2T 4T. t21 t2r The standard HMM with Gaussian emissions A 0 arises as a special case of this model when k 5 for all k. where, once again, c . 0. The resulting completely random measure is known as the beta process with draws denoted by A BAYESIAN NONPARAMETRIC FEATURAL MODEL | 1 2 B BP c, B0 . Note that using this construction, the weights vk UTILIZING BETA AND BERNOULLI PROCESSES of the atoms in B lie in the interval 10, 12. Since h is s-finite, Following the theme of sections “Sticky HDP-HMM” and “HDP- Campbell’s theorem [33] guarantees that for a finite, B has finite AR-HMM and HDP-SLDS,” it is often desirable to consider a expected measure. The characteristics of this process define Bayesian nonparametric featural model that relaxes the desirable traits for a Bayesian nonparametric featural model: we assumption that the number of have a countably infinite collec- features is known or bounded. tion of coin-tossing probabili- Such a featural model seeks to JUST AS THE DP PROVIDES A USEFUL ties (one for each of our infinite allow for infinitely many features, BAYESIAN NONPARAMETRIC PRIOR number of features), but only a while encouraging a sparse, finite IN CLUSTERING APPLICATIONS, IT HAS sparse, finite subset are active representation. Just as the DP BEEN SHOWN THAT A STOCHASTIC in any realization. provides a useful Bayesian non- PROCESS KNOWN AS THE BETA The beta process is conju- parametric prior in clustering PROCESS IS USEFUL IN BAYESIAN gate to a class of Bernoulli pro- 1 2 applications (i.e., when each NONPARAMETRIC FEATURAL MODELS. cesses [30], denoted by BeP B , observation is associated with a which provide our sought-for single parameter uk), it has been featural representation. A real- | 1 2 shown that a stochastic process known as the beta process is ization Xi BeP B , with B an atomic measure, is a collection useful in Bayesian nonparametric featural models (i.e., when each of unit mass atoms on U located at some subset of the atoms in observation is associated with a subset of parameters) [30]. B. In particular, The beta process is a special case of a general class of sto- | 1 2 chastic processes known as completely random measures [32]. fik Bernoulli vk (25) A completely random measure G is defined such that for any disjoint sets A1 and A2, the corresponding random measures is sampled independently for each atom uk in B, and then G1A 2 and G1A 2 are independent. This idea generalizes the X 5 f d . In many applications, we interpret the atom loca- 1 2 i ak ik uk family of independent increments processes on the real line. All tions uk as a shared set of global features. A completely random measures (up to a deterministic component) realization Xi then determines the subset of features allocated can be constructed from realizations of a nonhomogenous to object i Poisson process [32]. Specifically, a Poisson rate measure h is | 1 2 defined on a product space U # R, and a draw from the speci- B | B0, c BP c, B0 fied Poisson process yields a collection of points 5u , v 6 that | 1 2 c j j Xi | B BeP B , i 5 1, , N. (26) can be used to define a completely random measure

` Computationally, Bernoulli process realizations Xi are often

G 5 a vj duj. (22) summarized by an infinite vector of binary indicator variables j51 f 3 c4 i 5 fi1, fi2, , where fik 5 1 if and only if object i exhibits fea- This construction assumes h has infinite mass, yielding the ture k. Using the beta process measure B to tie together the fea- countably infinite collection of points from the Poisson process. ture vectors encourages them to share similar features while From (22), we see that completely random measures are dis- allowing object-specific variability. crete. Letting the rate measure be defined as a product of a base As shown by Thibaux and Jordan [30], marginalizing over

measure G0 and an improper gamma distribution the latent beta process measure B, and taking c 5 1, induces a predictive distribution on feature indicators known as the 1 2 21 2cv 1 2 h du, dv 5 cv e dvG0 du (23) Indian buffet process (IBP) [31]. The IBP is a culinary

IEEE SIGNAL PROCESSING MAGAZINE [51] NOVEMBER 2010 Featural Model Utilizing Beta and Bernoulli Processes.” γ κ Continuing our focus on switching VAR processes, we define a ω beta process autoregressive HMM (BP-AR-HMM) [28], in k fi ∞ which the features indicate which behaviors are utilized by π (i ) B0 each object or sequence. Considering the feature space (i.e., z (i ) z (i ) z (i ) . . . z (i ) set of autoregressive parameters) and the temporal dynamics 1 2 3 Ti θk (i.e., set of transition distributions) as separate dimensions, ∞ y(i ) y (i ) y (i ) . . . y (i ) one can think of the BP-AR-HMM as a spatiotemporal process 1 2 3 T i comprised of a (continuous) beta process in space and dis- N crete-time Markovian dynamics in time. (a) f Given i, the ith object’s Markov transitions among its set of dynamic behaviors are governed by a set of feature-constrained 1i2 5 1i26 20 transition distributions p 5 pk . In particular, motivated by the fact that Dirichlet-distributed probability mass func- 40 tions can be generated via normalized gamma random vari- ables, for each object i we define a doubly infinite collection of 60 random variables: Customer 1i2 | 1 1 2 2 80 hjk | g, k Gamma g1kd j, k , 1 , (27) 1 2 Using this collection of transition variables, denoted by h i , one 100 can define object-specific, feature-constrained transition 10 20 30 40 50 Dish distributions (b) c 1i2 1i2 cd z f hj1 hj2 i [FIG6] (a) Graphical model of the BP-AR-HMM. The beta process 1i2 # 1 2 pj 5 (28) distributed measure B|B = BP 1,B is represented by its masses 0 0 1i2 vk and locations uk, as in (22). The features are then a hjk. k|fik51 conditionally independent draws fik|vk = Bernoulli1vk2 and are used to define feature-constrained transition distributions 1i2 1i2 pj |fi, g, k = Dir13g, c,g, g1k, g, c4 z fi 2. The switching This construction defines pj over the full set of positive inte- VAR dynamics are as in (21). (b) An example feature matrix F gers, but assigns positive mass only at indices k where f 5 1. f ik with elements ik for 100 objects drawn from the Indian buffet The preceding generative process can be equivalently predictive distribution using B 1U2 5a510. 0 |1i2 represented via a sample pj from a finite Dirichlet distribu- tion of dimension K 5 f , containing the nonzero entries i ak ik 1i2 metaphor inspired by the Chinese restaurant process of (7), of pj which is itself the predictive distribution on partitions induced | 1i2 f | 13 c c 42 by the DP. The Indian buffet consists of an infinitely long buf- pj | i, g, k Dir g, , g, g1k, g, g . (29) fet line of dishes, or features. The first arriving customer, or object, chooses Poisson1a2 dishes. Each subsequent customer The k hyperparameter places extra expected mass on the |1i2 1i2 i selects a previously tasted dish k with probability mk/i component of pj corresponding to a self-transition pjj , analo- proportional to the number of gously to the sticky hyperpa-

previous customers mk to rameter of the section “Sticky sample it, and also samples THE BAYESIAN NONPARAMETRIC HDP-HMM.” To complete the Poisson 1a/i2 new dishes. The FRAMEWORK REQUIRES ONE TO MAKE Bayesian model specification, a feature matrix associated with FEWER ASSUMPTIONS ABOUT THE conjugate MNIW prior is a realization from an IBP is UNDERLYING DYNAMICS, AND THEREBY placed on the shared collec- shown in Figure 6(b). ALLOWS THE DATA TO DRIVE THE tion of dynamic parameters 5A 6 COMPLEXITY OF THE INFERRED MODEL. uk 5 k, Sk . Since the dynam- BP-AR-HMM ic parameters are shared by all Recall the model of the section time series, posterior inference

“Finite Feature Models of Markov Switching Processes,” in of each parameter set uk relies on pooling data amongst the which the binary feature indicator variables fik denote whether time series that have fik 5 1. It is through this pooling of data object i exhibits dynamical behavior k for some that one may achieve more robust parameter estimates than 5 c 6 f t [ 1, , Ti . Now, however, take i to be an infinite dimen- from considering each time series individually. The resulting sional vector of feature indicators realized from the beta pro- model is depicted in Figure 6(a), and a collection of observa- cess featural model of the section “A Bayesian Nonparametric tion sequences is shown in Figure 4(e).

IEEE SIGNAL PROCESSING MAGAZINE [52] NOVEMBER 2010 The ability of the BP-AR-HMM to find common behaviors ric setting is an open area of research. Finally, given the among a collection of time series was demonstrated on data Bayesian framework, the models that we have presented from the CMU motion capture database [34]. As an illustrative necessitate a choice of prior. We have found in practice that example, a set of six exercise routines were examined, where the models are relatively robust to the hyperprior settings for each of these routines used some combination of the following the concentration parameters. On the other hand, the choice motion categories: running in place, jumping jacks, arm circles, of base measure tends to affect results significantly, which is side twists, knee raises, squats, typical of simpler Bayesian punching, up and down, two vari- OUR FOCUS IN THIS ARTICLE HAS nonparametric models such as ants of toe touches, arch over, DP mixtures. We have found BEEN THE ADVANTAGES OF VARIOUS and a reach-out stretch. The that quasi-empirical Bayes’ overall performance of the HIERARCHICAL, NONPARAMETRIC approaches for setting the base BP-AR-HMM showed a clear abil- BAYESIAN MODELS. measure tend to help push the ity to find common motions and mass of the distribution into provided more accurate movie frame labels than previously reasonable ranges (see [36] for details). considered approaches [35]. Most significantly, the Our focus in this article has been the advantages of various BP-AR-HMM provided a superior ability to discover the shared fea- hierarchical, nonparametric Bayesian models; detailed algo- ture structure, while allowing objects to exhibit unique features. rithms for learning and inference were omitted. One major advantage of the particular Bayesian nonparametric approaches CONCLUSIONS explored in this article is that they lead to computationally effi- In this article, we explored a Bayesian nonparametric cient methods for learning Markov switching models of approach to learning Markov switching processes. This unknown order. We point the interested reader to [13], [14], framework requires one to make fewer assumptions about [19], and [28] for detailed presentations of Markov chain Monte the underlying dynamics, and thereby allows the data to Carlo algorithms for inference and learning. drive the complexity of the inferred model. We began by examining a Bayesian nonparametric HMM, the sticky HDP- AUTHORS HMM, that uses a hierarchical DP prior to regularize an Emily B. Fox ([email protected]) received the S.B., M.Eng., unbounded mode space. We then considered extensions to E.E., and Ph.D. degrees from the Department of Electrical Markov switching processes with richer, conditionally linear Engineering and Computer Science at the Massachusetts dynamics, including the HDP-AR-HMM and HDP-SLDS. We Institute of Technology in 2004, 2005, 2008, and 2009, concluded by considering methods for transferring knowl- respectively. She is currently a postdoctoral scholar in the edge among multiple related time series. We argued that a Department of Statistical Science at Duke University. She featural representation is more appropriate than a rigid received both the National Defense Science and Engineering global clustering, as it encourages sharing of behaviors Graduate (NDSEG) and National Science Foundation (NSF) among objects while still allowing sequence-specific vari- Graduate Research fellowships and currently holds an NSF ability. In this context, the beta process provides an appeal- Mathematical Sciences Postdoctoral Research Fellowship. ing alternative to the DP. She was awarded the 2009 Leonard J. Savage Thesis Award The models presented herein, while representing a flexible in Applied Methodology, the 2009 MIT EECS Jin-Au Kong alternative to their parametric counterparts in terms of defin- Outstanding Doctoral Thesis Prize, the 2005 Chorafas ing the set of dynamical modes, still maintain a number of Award for superior contributions in research, and the 2005 limitations. First, the models assume Markovian dynamics MIT EECS David Adler Memorial Second Place Master’s with observations on a discrete, evenly-spaced temporal grid. Thesis Prize. Her research interests are in Bayesian and Extensions to semi-Markov formulations and nonuniform Bayesian nonparametric approaches to multivariate time grids are interesting directions for future research. Second, series analysis. there is still the question of which dynamical model is appro- Erik B. Sudderth ([email protected]) is an assistant priate for a given data set: HMM, AR-HMM, SLDS? The fact professor in the Department of Computer Science at Brown that the models are nested (i.e., HMM ( AR-HMM ( SLDS) University. He received the bachelor’s degree (summa cum aids in this decision process—choose the simplest formula- laude) in electrical engineering from the University of California, tion that does not egregiously break the model assumptions. San Diego, and the master’s and Ph.D. degrees in electrical For example, the honey bee observations are clearly not inde- engineering and computer science from the Massachusetts pendent given the dance mode, so choosing an HMM is likely Institute of Technology. His research interests include probabi- not going to provide desirable performance. Typically, it is listic graphical models; nonparametric Bayesian analysis; and useful to have domain-specific knowledge of at least one applications of statistical in computer vision, example of a time-series segment that can be used to design signal processing, artificial intelligence, and geoscience. He was the structure of individual modes in a model. Overall, howev- awarded a National Defense Science and Engineering Graduate er, this issue of model selection in the Bayesian nonparamet- Fellowship (1999), an Intel Foundation Doctoral Fellowship

IEEE SIGNAL PROCESSING MAGAZINE [53] NOVEMBER 2010 (2004), and in 2008 was named one of “AI’s 10 to Watch” by [9] J. Pitman, “Combinatorial stochastic processes,” Dept. Statist., Univ. Califor- nia, Berkeley, Tech. Rep. 621, 2002. IEEE Intelligent Systems Magazine. Michael I. Jordan [10] N. Hjort, C. Holmes, P. Müller, and S. Walker, Eds., Bayesian ([email protected]) is the Pehong Nonparametrics: Principles and Practice. Cambridge, U.K.: Cambridge Univ. Chen Distinguished Professor in the Department of Electrical Press, 2010. Engineering and Computer Science and the Department of [11] J. Sethuraman, “A constructive definition of Dirichlet priors,” Statist. Sin., vol. 4, pp. 639–650, 1994. Statistics at the University of California, Berkeley. His research [12] D. Blackwell and J. MacQueen, “Ferguson distributions via Polya urn schemes,” in recent years has focused on Bayesian nonparametric analy- Ann. Statist., vol. 1, no. 2, pp. 353–355, 1973. sis, probabilistic graphical models, spectral methods, kernel [13] Y. Teh, M. Jordan, M. Beal, and D. Blei, “Hierarchical Dirichlet processes,” machines and applications to problems in computational biol- J. Amer. Statist. Assoc., vol. 101, no. 476, pp. 1566–1581, 2006. ogy, information retrieval, signal processing, and speech rec- [14] E. Fox, E. Sudderth, M. Jordan, and A. Willsky, “An HDP-HMM for systems with state persistence,” in Proc. Int. Conf. Machine Learning, July 2008, pp. ognition. He received the ACM/AAAI Allen Newell Award in 312–319. 2009, the SIAM Activity Group on Optimization Prize in 2008, [15] M. Beal, Z. Ghahramani, and C. Rasmussen, “The infinite hidden Markov and the IEEE Neural Networks Pioneer Award in 2006. He was model,” in Proc. Advances in Neural Information Processing Systems, 2002, vol. 14, pp. 577–584. named as a Neyman Lecturer and a Medallion Lecturer by the [16] NIST. (2007). Rich transcriptions database [Online]. Available: http:// Institute of . He is a Fellow of the www.nist.gov/speech/tests/rt/ American Association for the Advancement of Science, the [17] S. Paoletti, A. Juloski, G. Ferrari-Trecate, and R. Vidal, “Identification of IMS, the IEEE, the AAAI, and the ASA. In 2010, he was named hybrid systems: A tutorial.” Eur. J. Contr., vol. 2–3, no. 2–3, pp. 242–260, 2007. to the National Academy of Engineering and the National [18] M. Aoki and A. Havenner, “State space modeling of multiple time series,” Econom. Rev., vol. 10, no. 1, pp. 1–59, 1991. Academy of Sciences. Alan S. Willsky [19] E. Fox, E. Sudderth, M. Jordan, and A. Willsky, “Nonparametric Bayesian ([email protected]) joined the Massachusetts learning of switching dynamical systems,” in Proc. Advances in Neural Informa- Institute of Technology in 1973 and is the Edwin Sibley tion Processing Systems, 2009, vol. 21, pp. 457–464. Webster Professor of Electrical Engineering and Computer [20] E. Fox, E. Sudderth, M. I. Jordan, and A. Willsky, “Nonparametric Bayesian identification of jump systems with sparse dependencies,” in Proc. 15th IFAC Science and director of the Laboratory for Information and Symp. System Identification, July 2009. Decision Systems. He was a founder of Alphatech, Inc. and [21] M. West and J. Harrison, Bayesian Forecasting and Dynamic Models. New chief scientific consultant. From 1998 to 2002, he served on York: Springer-Verlag, 1997. the U.S. Air Force Scientific Advisory Board. He has received [22] D. MacKay, Bayesian Methods for Backprop Networks (Series Models of Neural Networks, III). New York: Springer-Verlag, 1994, ch. 6, pp. 211–254. numerous awards including the 1975 American Automatic [23] R. Neal, Ed., Bayesian Learning for Neural Networks (Lecture Notes in Control Council Donald P. Eckman Award, the 1980 IEEE Statistics, vol. 118). New York: Springer-Verlag, 1996. Browder J. Thompson Memorial Award, the 2004 IEEE Donald [24] M. Beal, “Variational algorithms for approximate Bayesian inference,” Ph.D. G. Fink Prize Paper Award, a Doctorat Honoris Causa from dissertation, Univ. College London, London, U.K., 2003. Université de Rennes in 2005, and a 2010 IEEE Signal [25] F. Caron, M. Davy, A. Doucet, E. Duflos, and P. Vanheeghe, “Bayesian infer- ence for dynamic models with Dirichlet process mixtures,” in Proc. Int. Conf. Processing Society Technical Achievement Award. In 2010, he Information Fusion, July 2006. was elected to the National Academy of Engineering. His [26] C. Carvalho and M. West, “Dynamic matrix-variate graphical models,” Bayes- research interests are in the development and application of ian Anal., vol. 2, no. 1, pp. 69–98, 2007. advanced methods of estimation, machine learning, and statis- [27] S. Oh, J. Rehg, T. Balch, and F. Dellaert, “Learning and inferring motion pat- terns using parametric segmental switching linear dynamic systems,” Int. J. Com- tical signal and image processing. put. Vis., vol. 77, no. 1–3, pp. 103–124, 2008. [28] E. Fox, E. Sudderth, M. Jordan, and A. Willsky, “Sharing features among REFERENCES dynamical systems with beta processes,” in Proc. Advances in Neural Informa- [1] L. Rabiner, “A tutorial on hidden Markov models and selected applications in tion Processing Systems, 2010, vol. 22, pp. 549–557. speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, 1989. [29] N. Hjort, “Nonparametric Bayes estimators based on beta processes in models [2] V. Pavlovic´, J. Rehg, and J. MacCormick, “Learning switching linear models of for life history data,” Ann. Statist., vol. 18, no. 3, pp. 1259–1294, 1990. human motion,” in Proc. Advances in Neural Information Processing Systems, 2001, vol. 13, pp. 981–987. [30] R. Thibaux and M. Jordan, “Hierarchical beta processes and the Indian buffet process,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2007, vol. 11. [3] L. Ren, A. Patrick, A. Efros, J. Hodgins, and J. Rehg, “A data-driven approach to quantifying natural human motion,” in Proc. Special Interest Group on [31] T. Griffiths and Z. Ghahramani, “Infinite latent feature models and the Graphics and Interactive Techniques, Aug. 2005, vol. 24, no. 3, pp. 1090–1097. Indian buffet process,” Gatsby Comput. Neurosci. Unit, Tech. Rep. 2005-001, 2005. [4] C.-J. Kim, “Dynamic linear models with Markov-switching,” J. Econometrics., vol. 60, no. 1–2, pp. 1–22, 1994. [32] J. F. C. Kingman, “Completely random measures,” Pacific J. Math., vol. 21, no. 1, pp. 59–78, 1967. [5] M. So, K. Lam, and W. Li, “A stochastic volatility model with Markov switching,” J Bus. Econ. Statist., vol. 16, no. 2, pp. 244–253, 1998. [33] J. Kingman, Poisson Processes. London: U.K.: Oxford Univ. Press, 1993. [6] C. Carvalho and H. Lopes, “Simulation-based sequential analysis of Markov [34] C. M. University. Graphics lab motion capture database [Online]. Available: switching stochastic volatility models,” Comput. Statist. Data Anal., vol. 51, no. 9, http://mocap.cs.cmu.edu/ pp. 4526–4542, 2007. [35] J. Barbicˇ, A. Safonova, J.-Y. Pan, C. Faloutsos, J. Hodgins, and N. Pollard, [7] X. R. Li and V. Jilkov, “Survey of maneuvering target tracking—Part V: Multiple- “Segmenting motion capture data into distinct behaviors,” in Proc. Graphics In- model methods,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, no. 4, pp. 1255– terface, 2004, pp. 185–194. 1321, 2005. [36] E. Fox, “Bayesian nonparametric learning of complex dynamical phenom- [8] E. Fox, E. Sudderth, and A. Willsky, “Hierarchical Dirichlet processes for ena,” Ph.D. dissertation, Massachusetts Inst. Technol., Cambridge, MA, July tracking maneuvering targets,” in Proc. Int. Conf. Information Fusion, July 2009. 2007. [SP]

IEEE SIGNAL PROCESSING MAGAZINE [54] NOVEMBER 2010