Learning Hidden Quantum Markov Models
Siddarth Srinivasan GeoffGordon Byron Boots Georgia Institute of Technology Carnegie Mellon University Georgia Institute of Technology
Abstract quantum circuits to simulate classical Hidden Markov Models (HMMs); (2) what happens if we take full ad- vantage of this quantum circuit instead of enforcing the Hidden Quantum Markov Models (HQMMs) classical probabilistic constraints; and (3) how do we can be thought of as quantum probabilistic learn the parameters for quantum models from data? graphical models that can model sequential data. We extend previous work on HQMMs The paper is structured as follows: first we describe with three contributions: (1) we show how related work and provide background on quantum in- classical hidden Markov models (HMMs) can formation theory as it relates to our work. Next, we be simulated on a quantum circuit, (2) we re- describe the hidden quantum Markov model and com- formulate HQMMs by relaxing the constraints pare our approach to previous work in detail, and give for modeling HMMs on quantum circuits, and a scheme for writing any hidden Markov model as an (3) we present a learning algorithm to esti- HQMM. Finally, our main contribution is the intro- mate the parameters of an HQMM from data. duction of a maximum-likelihood-based unsupervised While our algorithm requires further optimiza- learning algorithm that can estimate the parameters tion to handle larger datasets, we are able to of an HQMM from data. Our implementation is slow evaluate our algorithm using several synthetic to train HQMMs on large datasets, and will require datasets generated by valid HQMMs. We further optimization. Instead, we evaluate our learn- show that our algorithm learns HQMMs with ing algorithm for HQMMs on several simple synthetic the same number of hidden states and predic- datasets by learning a quantum model from data and tive accuracy as the HQMMs that generated filtering and predicting with the learned model. We the data, while HMMs learned with the Baum- also compare our model and learning algorithm to max- Welch algorithm require more states to match imum likelihood for learning hidden Markov models the predictive accuracy. and show that the more expressive HQMM can match HMMs’ predictive capability with fewer hidden states on data generated by HQMMs. 1INTRODUCTION 2 BACKGROUND We extend previous work on Hidden Quantum Markov Models (HQMMs), and propose a novel approach to 2.1 Related Work learning these models from data. HQMMs can be Hidden Quantum Markov Models were introduced by thought of as a new, expressive class of graphical mod- Monras et al. [2010], who discussed their relationship to els that have adopted the mathematical formalism for classical HMMs, and parameterized these HQMMs us- reasoning about uncertainty from quantum mechanics. ing a set of Kraus operators. Clark et al. [2015] further We stress that while HQMMs could naturally be imple- investigated HQMMs, and showed that they could be mented on quantum computers, we do not need such viewed as open quantum systems with instantaneous a machine for these models to be of value. Instead, feedback. We arrive at the same Kraus operator repre- HQMMs can be viewed as novel models inspired by sentation by building a quantum circuit to simulate a quantum mechanics that can be run on classical com- classical HMM and then relaxing some constraints. puters. In considering these models, we are interested in answering three questions: (1) how can we construct Our work can be viewed as extending previous work by Zhao and Jaeger [2010] on Norm-observable opera- Proceedings of the 21st International Conference on Artifi- tor models (NOOM) and Jaeger [2000] on observable- cial Intelligence and Statistics (AISTATS) 2018, Lanzarote, operator models (OOM). We show that HQMMs can Spain. JMLR: W&CP volume 7X. Copyright 2018 by the be viewed as complex-valued extensions of NOOMs, author(s). formulated in the language of quantum mechanics. We Learning Hidden Quantum Markov Models use this connection to adapt the learning algorithm The density matrix is the general quantum equivalent for NOOMs in Zhao and Jaeger [2007] into the first of the classical belief state ~x and has diagonal elements known learning algorithm for HQMMs, and demon- representing the probabilities of being in each system strate that the theoretical advantages of HQMMs also state. Consequently, the normalization condition is hold in practice. tr(⇢ˆ)=1. The off-diagonal elements represent quan- tum coherences, which have no classical interpretation. Schuld et al. [2015a] and Biamonte et al. [2016] provide The density matrix ⇢ˆ is Hermitian and can be used to general overviews of quantum machine learning, and describe the state of any quantum system. describe relevant work on HQMMs. They suggest that developing algorithms that can learn HQMMs from The density matrix can also be extended to represent data is an important open problem. We provide just the joint state of multiple variables, or that of ‘multi- such a learning algorithm in Section 4. particle’ systems, to use the physical interpretation. If we have density matrices ⇢ˆ and ⇢ˆ for two qudits Other work at the intersection of machine learning and A B (a d-state quantum system, akin to qubits or ‘quan- quantum mechanics includes Wiebe et al. [2016] on tum bits’ which are 2-state quantum systems) A and quantum perceptron models and learning algorithms. B,wecantakethetensorproducttoarriveatthe Schuld et al. [2015b] discuss simulating a perceptron density matrix for the joint state of the particles, as on a quantum computer. ⇢ˆ = ⇢ˆ ⇢ˆ .Asavaliddensitymatrix,thediagonal AB A ⌦ B 2.2 Belief States and Quantum States elements of this joint density matrix represent probabil- ities; tr (ˆ⇢ ) =1,andtheprobabilitiescorrespondto Classical discrete latent variable models represent un- AB the states in the Cartesian product of the basis states certainty with a probability distribution using a vector of the composite particles. In this paper, the joint den- ~x whose entries describe the probability of being in sity matrix will serve as the analogue to classical joint the corresponding system state. Each entry is real and probability distribution, with the off-diagonal terms non-negative, and the entries sum to 1. In general, we encoding extra ‘quantum’ information. refer to the run-time system component that maintains astateestimateofthelatentvariableasan‘observer’, Given the joint state of a multi-particle system, we can and we refer to the observer’s state as a ‘belief state.’ examine the state of just one or few of the particles using the ‘partial trace’ operation, where we ‘trace over’ In quantum mechanics, the quantum state of a parti- the particles we wish to disregard. This lets us recover cle A can be written using Dirac notation as ,a | iA a‘reduceddensitymatrix’forasubsystemofinterest. column-vector in some orthonormal basis (the row- The partial trace for a two-particle system ⇢ˆ where vector is the complex-conjugate transpose = AB Ah | we trace over the second particle to obtain the state of ( )†) with each entry being the ‘probability ampli- | iA the first particle is: tude’ corresponding to that system state. The squared norm of the probability amplitude for a system state is ⇢ˆA = trB (ˆ⇢AB)= B j ⇢ˆAB j B (2) h | | i j the probability of observing that state, so the sum of X squared norms of probability amplitudes over all the For our purposes, this operation will serve as the quan- system states must be 1 to conserve probability. For tum analogue of classical marginalization. 1 i † example, = is a valid quantum state, | i p2 p2 Finally, we discuss the quantum analogue of ‘condition- with basis statesh 0 andi 1 having equal probability ing’ on an observation. In quantum mechanics, the 2 2 1 i 1 act of measuring a quantum system can change the p = p = 2 .However,unlikeclassicalbelief 2 2 underlying distribution, i.e., collapses it to the observed states such as ~x = 1 1 T , where the probability of 2 2 state in the measurement basis, and this is represented di fferent states reflects ignorance about the underlying ⇥ ⇤ mathematically by applying von Neumann projection system, a pure quantum state like the one described operators (denoted Pˆy in this paper) to density matrices above is the true description of the system. describing the system. One can think of the projection But how can we describe classical mixtures of quantum operator as having ones in the diagonal entries corre- systems (‘mixed states’), where we also have classi- sponding to observed system states and zeros elsewhere. cal uncertainty about the underlying quantum states? If we only observe part of a larger system, the system Such information can be captured by a ‘density ma- collapses to the states where that subsystem had the ob- trix.’ Given a mixture of N quantum systems, each served result. For example, suppose we have the follow- with probability pi,thedensitymatrixforthisensemble ing density matrix for a two-state two-particle system is defined as follows: with basis 0 A 0 B, 0 A 1 B, 1 A 0 B, 1 A 1 B : N {| i | i | i | i | i | i | i | i } 0.25 0 0 0 00.25 0.50 ⇢ˆ = pi i i (1) ⇢ˆ = | ih | AB 2 0 0.50 .25 0 3 (3) i X 6 00 00.257 4 5 Siddarth Srinivasan, GeoffGordon, Byron Boots Table 1: Comparison between classical and quantum representations Classical probability Quantum Analogue Description Representation Representation Description Belief State ~x ⇢ˆ Density Matrix Joint Distribution ~x 1 ~x 2 ⇢ˆX ⇢ˆX Multi-particle Density Matrix ⌦ 1 ⌦ 2 Marginalization ~x = y(~x ~y ) ⇢ˆ = trY (ˆ⇢X ⇢ˆY ) Partial Trace P⌦(y,~x) ⌦ Conditional probability P (~x y)= P (states y) trY (Pˆy⇢ˆXY Pˆ†) Projection + Partial Trace |P P (y) | / y
Suppose we measure the state of particle B,andfindit The probabilities of observing each output at time t is to be in state 1 B. The corresponding projection op- given by the vector ~s : | 0000i 0100 Pˆ = ~s t = C~x t0 = CA~x t 1 (5) erator is 1B 200003 and the collapsed state is 600017 4 5 0000 We can use Bayesian inference to write the belief state ˆ ˆ normalize 00.50 0 now: ⇢ˆAB = P1 ⇢ˆAB P † .When vector after conditioning on observation y: B 1B ! 200003 0000.5 6 7 diag(C(y,:))A~x t 1 A 4 5 B we trace over particle to get the state of particle , ~x t = T (6) 00 1 diag(C(y,:))A~x t 1 ⇢ˆ = the result is B 01,reflectingthefactthatparticle B is now in state 1 B with certainty. Tracing over par- where diag(C(y,:)) is a diagonal matrix with the en- | i 0.50 ⇢ˆ = ticle B,wefind A 00.5 ,indicatingthatparticle tries of the yth row of C along the diagonal, and the A still has an equal probability of being in either state. denominator renormalizes the vector ~x t. Note that the underlying distribution of the system An alternate representation of the Hidden Markov ⇢ˆAB has changed; the probability of measuring the Model uses ‘observable’ operators (Jaeger [2000]). In- state of particle B to be 0 B is now 0, whereas before stead of using the matrices A and C, we can write 0.25| i + 0.25 = 0.5 measurement we had a chance of mea- Ty = diag(C(y,:))A. There is a different operator Ty suring 0 . This is unlike classical probability where | iB for each possible observable output y and [Ty]ij = measuring a variable doesn’t change the underlying P (y; it jt 1). We can then rewrite Equation 6 as: distribution. We will use this fact when we construct | Ty~x t 1 our quantum circuit to simulate HMMs. ~x t = T (4) 1 Ty~x t 1 Thus, if we have an n-state quantum system that tracks aparticle’sevolution,andans-state quantum system If we observe outputs y1,...,yn,weapplyTn ...T1~x that tracks the likelihood of observing various outputs and take the sum of the resulting vector to find the as they depend (probabilistically) on the n-state system, probability of observing the sequence, or renormalize to obtain the n-state system conditioned on observation to find the belief state after the final observation. y,weapplytheprojectionoperatorPˆy on the joint system and trace over the second particle. 3HIDDENQUANTUMMARKOV MODELS 2.3 Hidden Markov Models Classical Hidden Markov Models (HMMs) are graphical 3.1 A Quantum Circuit to Simulate HMMs models used to model dynamic processes that exhibit Let us now contrast state evolution in quantum systems Markovian state evolution. Figure 1 depicts a classical with state evolution in HMMs. The quantum analogue HMM, where the transition matrix A and emission ma- of observable operators is a set of non-trace-increasing trix C are column-stochastic matrices that determine Kraus operators {Kˆi}thatarecompletelypositive the Markovian hidden state-evolution and observation (CP) linear maps. Trace-preserving Kraus operators N ˆ ˆ probabilities respectively. Bayesian inference can be i Ki†Ki = I,canmapadensityoperatortoanother used to track the evolution of the hidden variable. density operator. Trace-decreasing Kraus operators PN ˆ ˆ i Ki†Ki < I,representoperationsonasmallerpart of a quantum system that can allow probability to ‘leak’ P to other states that aren’t being considered. This paper will formulate problems such that all sets of Kraus operators are trace-preserving. When there is only one Uˆ Uˆ Uˆ = Uˆ Figure 1: Hidden Markov Model operator in the set, i.e., such that † I,then is a unitary matrix. Unitary operators generally model The belief state at time t is a probability distribution the evolution of the ‘whole’ system, which may be high- over states, and prior to any observation is written as: dimensional. But if we care only about tracking the ~x t0 = A~x t 1 (4) evolution of a smaller sub-system, which may interact Learning Hidden Quantum Markov Models
⇢ˆt 1 Uˆ ⇢ˆt 1 1 ⇢ˆX ⇢ˆt t Uˆ1 Uˆ2 ˆ ⇢ˆXt Kyt 1 ⇢ˆt ⇢ˆYt (a) Full Quantum Circuit to implement HMM (b) Simplified scheme to implement HMM Figure 2: HMM implementation on quantum circuits
ˆ ⇢ˆt 1 w Kyt 1 ⇢ˆt ⇢ˆt 1 w,yt 1 ⇢ˆt K K
(a) HQMM scheme with separate transition and (b) A generalized scheme for HQMMs; w,yt 1 = K ˆ ˆ ˆ ˆ † emission; w = w Kw( )Kw† w Kw,yt 1 ( )Kw,yt 1 K · · P Figure 3: Quantum schemes implementingP classical HMMs with its environment, we can use Kraus operators. The Tensoring with an ancilla qudit and tracing over a qudit most general quantum operation that can be performed can be achieved with an ns n matrix W and an n ns M K†⇢ˆK ⇥ ⇥ i i i matrix Vy respectively, since we always prepare our on a density matrix is ⇢ˆ0 = M , where the K†⇢ˆK tr(P i i i) ancilla qudits in the same state (details on constructing denominator re-normalizes the densityP matrix. these matrices can be found in the appendix), so that: Now, how do we simulate classical HMMs on quantum ⇢ˆ ⇢ˆ W ⇢ˆ W † circuits with qudits, where computation is done using Xt ⌦ Yt ! X ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ tr⇢ˆ (Py U2W ⇢ˆX W †U †Py†) Vy Py U2W ⇢ˆX W †U †Py†Vy† unitary operations? There is no general way to convert Yt t 2 ! t 2 column-stochastic transition and emission matrices to We can then construct Kraus operators such that Kˆ = unitary matrices, so we prepare ‘ancilla’ particles and y V Pˆ Uˆ W construct unitary matrices (see Algorithm 3 in the y y 2 . Figure 2b shows this updated circuit, where ˆ appendix for details) to act on the joint state. We then U1 is still the quantum implementation of the transition Kˆ trace over one particle to obtain the state of the other. matrix and yt is the quantum implementation of the Bayesian update after observation. This scheme to Figure 2a illustrates a quantum circuit constructed model a classical HMM can be written as: with these unitary matrices. We prepare the ‘ancilla’ Kˆ (Uˆ (ˆ⇢ ⇢ˆ )Uˆ †) Kˆ states ⇢ˆX and ⇢ˆY appropriately (i.e., entirely in system yt 1 tr⇢ˆt 1 1 t 1 Xt 1 y†t 1 t t ⇢ˆ = ⌦ state 1, represented by a density matrix of zeros except t ⇣ ⌘ ˆ ˆ ˆ † ˆ † ˆ ˆ tr Kyt 1 tr⇢ˆt 1 (U1(ˆ⇢t 1 ⇢ˆXt )U1 ) Kyt 1 ⇢ˆ1,1 =1), and construct U1 and U2 from transition ma- ⌦ (8) trix A and emission matrix C,respectively.Uˆ1 evolves ⇣ ⇣ ⌘ ⌘ ˆ (ˆ⇢t 1 ⇢ˆXt ) to perform Markovian transition, while U2 Uˆ ⌦ We can similarly simplify 1 to a set of Kraus oper- updates ⇢ˆYt to contain the probabilities of measuring ators. We write the unitary operation Uˆ1 in terms each observable output. At runtime, we measure ⇢ˆYt of a set of n Kraus operators Kˆ as if we were to { w} which changes the joint distribution of ⇢ˆXt ⇢ˆYt to give measure ⇢ˆt 1 immediately after the operation Uˆ1. How- ⌦ the updated conditioned state ⇢ˆt.Mathematically,this ever, instead of applying one Kraus operator associated is equivalent to applying a projection operator on the with measurement as we do with Figure 2b, we sum joint state and tracing over ⇢ˆYt . Thus, the forward al- over all of n possible ‘observations’, as if to ‘ignore’ gorithm that explicitly models a hidden Markov Model the observation on ⇢ˆt 1.Post-multiplyingeachKraus on a quantum circuit as per Figure 2a is written as: operator in Kˆ with each operator in Kˆ ,wehave { w} { y} asetofKrausoperators Kˆw ,y that can be used to ˆ ˆ ˆ ˆ ˆ ˆ { y } ⇢ˆt tr⇢ˆY PyU2 tr⇢ˆt 1 (U1(ˆ⇢t 1 ⇢ˆXt )U1†) ⇢ˆYt U2†Py† / t ⌦ ⌦ model a classical HMM as follows (the full procedure ⇣ ⇣ ⌘ (7)⌘ is described in Algorithm 1): We can simplify this circuit to use Kraus operators act- Kˆ ⇢ˆ Kˆ wy wy ,yt 1 t 1 w† y ,yt 1 ing on the lower-dimensional state space of ⇢ˆXt .Since ⇢ˆt = (9) P Kˆ ⇢ˆ Kˆw† ,y we always prepare ⇢ˆYt in the same state, the operation tr wy wy ,yt 1 t 1 y t 1 ˆ U2 on the joint state of ⇢ˆXt ⇢ˆYt followed by the ap- ⇣ ⌘ ⌦ ˆ P plication of the projection operator Py can be more We believe this procedure to be a useful illustration of concisely written as a Kraus operator on just ⇢ˆXt ,so performing classical operations on graphical models us- that we need only be concerned with representing how ing quantum circuits. In practice, we needn’t construct the particle ⇢ˆXt evolves. We would need to construct a the Kraus operators in this peculiar fashion to simulate ˆ set of Kraus operators Ky for each observable output HMMs; an equivalent but simpler approach is to con- { } y,suchthat (Kˆ )†(Kˆ )=I. struct observable operators T from transition and y y y { y} P Siddarth Srinivasan, GeoffGordon, Byron Boots
Algorithm 1 Simulating Hidden Markov Models with HQMMs Input: Transition Matrix A and Emission Matrix C Output: Belief State as diag(ˆ⇢),orP (y ,...,y D) where D is the HMM 1 n| 1: Initialization: 2: Let s =#outputs, n =#hidden states, yt = observed output at time t 3: Prepare density matrix ⇢ˆ0 in some initial state. ⇢ˆ0 = diag(⇡) if priors ⇡ are known. 4: Construct unitary matrices Uˆ1 and Uˆ2 from A and C respectively using Algorithm 3 (in appendix) 5: Using Uˆ and Uˆ ,constructasetofn Kraus Operators Kˆ and s Kraus operators Kˆ , with Kˆ = 1 2 { w} { y} w V Uˆ W and Kˆ = V Pˆ Uˆ W and combine them into a set Kˆ with Kˆ = Kˆ Kˆ .(MatrixW tensors w 1 y y y 2 { wy ,y} wy ,y y w with an ancilla, Matrix Vy carries out a trivial partial trace operation and summing over Vw for all w carries out the proper partial trace operation. Details in appendix.) 6: for t =1:T do 7: ⇢ˆt+1 Kˆw,y ⇢ˆt(Kˆw ,y )† wy t y i 8: end for P 9: tr(⇢ˆT )givestheprobabilityofthesequence;renormalizing⇢ˆT gives the belief state on the diagonal. emission matrices as described in section 2.3, and set w =1, we do not tensor with an additional particle, but (:,w) (:,w) model the evolution of the original particle as unitary. the wth column of Kˆw ,y = Ty , with all other en- y In all, a HQMM requires learning n2sw parameters, tries being zero. This ensuresq Kˆ † Kw ,y = I. wy ,y wy ,y y which is a factor w times more than a HMM with the observable operator representation which has n2s 3.2 Formulating HQMMsP parameters. The canonical representation of HMMs Monras et al. [2010] formulate Hidden Quantum with with an n n transition matrix and an s n ⇥ ⇥ Markov Models by defining a set of Kraus operators emission matrix has n2 + ns parameters. Kˆ , where each observable y has w associated { wy ,y} y HQMMs can also be seen as a complex-valued exten- Kraus operators acting on a state with hidden di- sion of norm-observable operator models defined by mension n,andtheyformacompletesetsuchthat Zhao and Jaeger [2010]. Indeed, the HQMM we get by Kˆ † Kˆ = I. The update rule for a quantum w,y w,y w,y applying Algorithm 1 on a HMM is also a valid NOOM operation is exactly the same as Equation 9, which we P (allowing for multiple operators per output), implying arrived at by first constructing a quantum circuit to that HMMs can be simulated by NOOMs. Both HMMs simulate HMMs with known parameters and then con- and NOOMs can be simulated by HQMMs (the latter structing operators Kˆ in a very peculiar way. The { w,y} is trivially true). While Zhao and Jaeger [2010] show process outlined in the previous section is a particular that any NOOM can be written as an OOM, the ex- parameterization of HQMMs to model HMMs. If we act relationship between HQMMs and OOMs requires let the operators Uˆ and Uˆ be any unitary matrices, 1 2 further investigation. or the Kraus operators be any set of complex-valued Kˆ K = I matrices that satisfy wy ,y w† y ,y wy ,y ,thenwe have a general and fully quantum HQMM. 4 AN ITERATIVE ALGORITHM P Indeed, Equation 9 gives the forward algorithm for FOR LEARNING HQMMs HQMMs. To find the probability of emitting an output We present an iterative maximum-likelihood algorithm y given the previous state ⇢ˆt 1,wesimplytakethe to learn Kraus operators to model sequential data using trace of the numerator in Equation 9, i.e., p(yt ⇢ˆt 1)= | an HQMM. Our algorithm is general enough that it ˆ ˆ tr Kwy ,yt ⇢ˆt 1Kw† ,y . wy y t can be applied to any quantum version of a classical ⇣ ⌘ machine learning algorithm for which the loss is defined TheP number of parameters for a HQMM is determined in terms of the Kraus operators to be learned. by the number of latent states n,outputss,andKraus operators associated with an output w.Toexactly We begin by writing the likelihood of observing some simulate HMM dynamics with an HQMM, we need w = sequence y1,...,yT .Recallthatforagivenout- n as per the derivation above. However, this constraint put y,weapplythew Kraus operators associated need not hold for a general HQMM, which can have with that observable in the ‘forward’ algorithm, as any number of Kraus operators we apply and sum for a Kˆw ,y( )Kˆw ,y. If we do not renormalize the den- wy y · y given output. w can also be thought of as the dimension sity matrix after applying these operators, the diagonal P of the ancilla ⇢ˆXt that we tensor with in Figure 2a entries contain the joint probability of the correspond- before the unitary operation Uˆ1. Consequently, if we set ing system states and observing the associated sequence. Learning Hidden Quantum Markov Models
Algorithm 2 Iterative Learning Algorithm for Hidden Quantum Markov Models Input: A M ` matrix Y , where M =#data points and ` =length of a stochastic sequence to be modeled. ⇥ Output: Asetofws of n n Kraus operators Kˆ that maximize the log-likelihood of the data, where n is ⇥ { w,s} the dimension of the hidden state, s is the number of outputs, and w is the number of operators per outputs. 1: Initialization:Randomlygenerateasetofws Kraus operators Kˆ of dimension n n,andstackthem { w,s} ⇥ vertically to obtain a matrix of dimension nsw n.Letb be the batch size, B the total number of batches ⇥ to process, and Y a b ` matrix of randomly chosen data samples. Let num_iterations be the number of b ⇥ iterations spent modifying to maximize the likelihood of observing Yb. 2: for batch = 1:B do 3: Randomly select b sequences to process, and construct matrix Yb 4: for it =1:num_iterations do 5: Randomly select rows i and j of to modify, i