<<

Learning Hidden Quantum Markov Models

Siddarth Srinivasan GeoffGordon Byron Boots Georgia Institute of Technology Carnegie Mellon University Georgia Institute of Technology

Abstract quantum circuits to simulate classical Hidden Markov Models (HMMs); (2) what happens if we take full ad- vantage of this quantum circuit instead of enforcing the Hidden Quantum Markov Models (HQMMs) classical probabilistic constraints; and (3) how do we can be thought of as quantum probabilistic learn the parameters for quantum models from data? graphical models that can model sequential data. We extend previous work on HQMMs The paper is structured as follows: first we describe with three contributions: (1) we show how related work and provide background on quantum in- classical hidden Markov models (HMMs) can formation theory as it relates to our work. Next, we be simulated on a quantum circuit, (2) we re- describe the hidden quantum and com- formulate HQMMs by relaxing the constraints pare our approach to previous work in detail, and give for modeling HMMs on quantum circuits, and a scheme for writing any hidden Markov model as an (3) we present a learning algorithm to esti- HQMM. Finally, our main contribution is the intro- mate the parameters of an HQMM from data. duction of a maximum-likelihood-based unsupervised While our algorithm requires further optimiza- learning algorithm that can estimate the parameters tion to handle larger datasets, we are able to of an HQMM from data. Our implementation is slow evaluate our algorithm using several synthetic to train HQMMs on large datasets, and will require datasets generated by valid HQMMs. We further optimization. Instead, we evaluate our learn- show that our algorithm learns HQMMs with ing algorithm for HQMMs on several simple synthetic the same number of hidden states and predic- datasets by learning a quantum model from data and tive accuracy as the HQMMs that generated filtering and predicting with the learned model. We the data, while HMMs learned with the Baum- also compare our model and learning algorithm to max- Welch algorithm require more states to match imum likelihood for learning hidden Markov models the predictive accuracy. and show that the more expressive HQMM can match HMMs’ predictive capability with fewer hidden states on data generated by HQMMs. 1INTRODUCTION 2 BACKGROUND We extend previous work on Hidden Quantum Markov Models (HQMMs), and propose a novel approach to 2.1 Related Work learning these models from data. HQMMs can be Hidden Quantum Markov Models were introduced by thought of as a new, expressive class of graphical mod- Monras et al. [2010], who discussed their relationship to els that have adopted the mathematical formalism for classical HMMs, and parameterized these HQMMs us- reasoning about uncertainty from . ing a set of Kraus operators. Clark et al. [2015] further We stress that while HQMMs could naturally be imple- investigated HQMMs, and showed that they could be mented on quantum computers, we do not need such viewed as open quantum systems with instantaneous a machine for these models to be of value. Instead, feedback. We arrive at the same Kraus operator repre- HQMMs can be viewed as novel models inspired by sentation by building a quantum circuit to simulate a quantum mechanics that can be run on classical com- classical HMM and then relaxing some constraints. puters. In considering these models, we are interested in answering three questions: (1) how can we construct Our work can be viewed as extending previous work by Zhao and Jaeger [2010] on Norm- opera- Proceedings of the 21st International Conference on Artifi- tor models (NOOM) and Jaeger [2000] on observable- cial Intelligence and (AISTATS) 2018, Lanzarote, operator models (OOM). We show that HQMMs can Spain. JMLR: W&CP volume 7X. Copyright 2018 by the be viewed as complex-valued extensions of NOOMs, author(s). formulated in the language of quantum mechanics. We Learning Hidden Quantum Markov Models use this connection to adapt the learning algorithm The is the general quantum equivalent for NOOMs in Zhao and Jaeger [2007] into the first of the classical belief state ~x and has diagonal elements known learning algorithm for HQMMs, and demon- representing the probabilities of being in each system strate that the theoretical advantages of HQMMs also state. Consequently, the normalization condition is hold in practice. tr(⇢ˆ)=1. The off-diagonal elements represent quan- tum coherences, which have no classical interpretation. Schuld et al. [2015a] and Biamonte et al. [2016] provide The density matrix ⇢ˆ is Hermitian and can be used to general overviews of quantum , and describe the state of any quantum system. describe relevant work on HQMMs. They suggest that developing algorithms that can learn HQMMs from The density matrix can also be extended to represent data is an important open problem. We provide just the joint state of multiple variables, or that of ‘multi- such a learning algorithm in Section 4. particle’ systems, to use the physical interpretation. If we have density matrices ⇢ˆ and ⇢ˆ for two qudits Other work at the intersection of machine learning and A B (a d-state quantum system, akin to qubits or ‘quan- quantum mechanics includes Wiebe et al. [2016] on tum bits’ which are 2-state quantum systems) A and quantum models and learning algorithms. B,wecantakethetensorproducttoarriveatthe Schuld et al. [2015b] discuss simulating a perceptron density matrix for the joint state of the particles, as on a quantum computer. ⇢ˆ = ⇢ˆ ⇢ˆ .Asavaliddensitymatrix,thediagonal AB A ⌦ B 2.2 Belief States and Quantum States elements of this joint density matrix represent probabil- ities; tr (ˆ⇢ ) =1,andtheprobabilitiescorrespondto Classical discrete models represent un- AB the states in the Cartesian product of the basis states certainty with a using a vector of the composite particles. In this paper, the joint den- ~x whose entries describe the probability of being in sity matrix will serve as the analogue to classical joint the corresponding system state. Each entry is real and probability distribution, with the off-diagonal terms non-negative, and the entries sum to 1. In general, we encoding extra ‘quantum’ information. refer to the run-time system component that maintains astateestimateofthelatentvariableasan‘observer’, Given the joint state of a multi-particle system, we can and we refer to the observer’s state as a ‘belief state.’ examine the state of just one or few of the particles using the ‘’ operation, where we ‘trace over’ In quantum mechanics, the of a parti- the particles we wish to disregard. This lets us recover cle A can be written using Dirac notation as ,a | iA a‘reduceddensitymatrix’forasubsystemofinterest. column-vector in some orthonormal basis (the row- The partial trace for a two-particle system ⇢ˆ where vector is the complex-conjugate transpose = AB Ah | we trace over the second particle to obtain the state of ( )†) with each entry being the ‘probability ampli- | iA the first particle is: tude’ corresponding to that system state. The squared norm of the probability amplitude for a system state is ⇢ˆA = trB (ˆ⇢AB)= B j ⇢ˆAB j B (2) h | | i j the probability of observing that state, so the sum of X squared norms of probability amplitudes over all the For our purposes, this operation will serve as the quan- system states must be 1 to conserve probability. For tum analogue of classical marginalization. 1 i † example, = is a valid quantum state, | i p2 p2 Finally, we discuss the quantum analogue of ‘condition- with basis statesh 0 andi 1 having equal probability ing’ on an observation. In quantum mechanics, the 2 2 1 i 1 act of measuring a quantum system can change the p = p = 2 .However,unlikeclassicalbelief 2 2 underlying distribution, i.e., collapses it to the observed states such as ~x = 1 1 T , where the probability of 2 2 state in the measurement basis, and this is represented di fferent states reflects ignorance about the underlying ⇥ ⇤ mathematically by applying von Neumann projection system, a pure quantum state like the one described operators (denoted Pˆy in this paper) to density matrices above is the true description of the system. describing the system. One can think of the projection But how can we describe classical mixtures of quantum operator as having ones in the diagonal entries corre- systems (‘mixed states’), where we also have classi- sponding to observed system states and zeros elsewhere. cal uncertainty about the underlying quantum states? If we only observe part of a larger system, the system Such information can be captured by a ‘density ma- collapses to the states where that subsystem had the ob- trix.’ Given a mixture of N quantum systems, each served result. For example, suppose we have the follow- with probability pi,thedensitymatrixforthisensemble ing density matrix for a two-state two-particle system is defined as follows: with basis 0 A 0 B, 0 A 1 B, 1 A 0 B, 1 A 1 B : N {| i | i | i | i | i | i | i | i } 0.25 0 0 0 00.25 0.50 ⇢ˆ = pi i i (1) ⇢ˆ = | ih | AB 2 0 0.50.25 0 3 (3) i X 6 00 00.257 4 5 Siddarth Srinivasan, GeoffGordon, Byron Boots Table 1: Comparison between classical and quantum representations Classical probability Quantum Analogue Description Representation Representation Description Belief State ~x ⇢ˆ Density Matrix Joint Distribution ~x 1 ~x 2 ⇢ˆX ⇢ˆX Multi-particle Density Matrix ⌦ 1 ⌦ 2 Marginalization ~x = y(~x ~y ) ⇢ˆ = trY (ˆ⇢X ⇢ˆY ) Partial Trace P⌦(y,~x) ⌦ Conditional probability P (~x y)= P (states y) trY (Pˆy⇢ˆXY Pˆ†) Projection + Partial Trace |P P (y) | / y

Suppose we measure the state of particle B,andfindit The probabilities of observing each output at time t is to be in state 1 B. The corresponding projection op- given by the vector ~s : | 0000i 0100 Pˆ = ~s t = C~x t0 = CA~x t 1 (5) erator is 1B 200003 and the collapsed state is 600017 4 5 0000 We can use to write the belief state ˆ ˆ normalize 00.50 0 now: ⇢ˆAB = P1 ⇢ˆAB P † .When vector after conditioning on observation y: B 1B ! 200003 0000.5 6 7 diag(C(y,:))A~x t 1 A 4 5 B we trace over particle to get the state of particle , ~x t = T (6) 00 1 diag(C(y,:))A~x t 1 ⇢ˆ = the result is B 01,reflectingthefactthatparticle  B is now in state 1 B with certainty. Tracing over par- where diag(C(y,:)) is a diagonal matrix with the en- | i 0.50 ⇢ˆ = ticle B,wefind A 00.5 ,indicatingthatparticle tries of the yth row of C along the diagonal, and the  A still has an equal probability of being in either state. denominator renormalizes the vector ~x t. Note that the underlying distribution of the system An alternate representation of the Hidden Markov ⇢ˆAB has changed; the probability of measuring the Model uses ‘observable’ operators (Jaeger [2000]). In- state of particle B to be 0 B is now 0, whereas before stead of using the matrices A and C, we can write 0.25| i + 0.25 = 0.5 measurement we had a chance of mea- Ty = diag(C(y,:))A. There is a different operator Ty suring 0 . This is unlike classical probability where | iB for each possible observable output y and [Ty]ij = measuring a variable doesn’t change the underlying P (y; it jt 1). We can then rewrite Equation 6 as: distribution. We will use this fact when we construct | Ty~x t 1 our quantum circuit to simulate HMMs. ~x t = T (4) 1 Ty~x t 1 Thus, if we have an n-state quantum system that tracks aparticle’sevolution,andans-state quantum system If we observe outputs y1,...,yn,weapplyTn ...T1~x that tracks the likelihood of observing various outputs and take the sum of the resulting vector to find the as they depend (probabilistically) on the n-state system, probability of observing the sequence, or renormalize to obtain the n-state system conditioned on observation to find the belief state after the final observation. y,weapplytheprojectionoperatorPˆy on the joint system and trace over the second particle. 3HIDDENQUANTUMMARKOV MODELS 2.3 Hidden Markov Models Classical Hidden Markov Models (HMMs) are graphical 3.1 A Quantum Circuit to Simulate HMMs models used to model dynamic processes that exhibit Let us now contrast state evolution in quantum systems Markovian state evolution. Figure 1 depicts a classical with state evolution in HMMs. The quantum analogue HMM, where the transition matrix A and emission ma- of observable operators is a set of non-trace-increasing trix C are column-stochastic matrices that determine Kraus operators {Kˆi}thatarecompletelypositive the Markovian hidden state-evolution and observation (CP) linear maps. Trace-preserving Kraus operators N ˆ ˆ probabilities respectively. Bayesian inference can be i Ki†Ki = I,canmapadensityoperatortoanother used to track the evolution of the hidden variable. density operator. Trace-decreasing Kraus operators PN ˆ ˆ i Ki†Ki < I,representoperationsonasmallerpart of a quantum system that can allow probability to ‘leak’ P to other states that aren’t being considered. This paper will formulate problems such that all sets of Kraus operators are trace-preserving. When there is only one Uˆ Uˆ Uˆ = Uˆ Figure 1: Hidden Markov Model operator in the set, i.e., such that † I,then is a unitary matrix. Unitary operators generally model The belief state at time t is a probability distribution the evolution of the ‘whole’ system, which may be high- over states, and prior to any observation is written as: dimensional. But if we care only about tracking the ~x t0 = A~x t 1 (4) evolution of a smaller sub-system, which may interact Learning Hidden Quantum Markov Models

⇢ˆt 1 Uˆ ⇢ˆt 1 1 ⇢ˆX ⇢ˆt t Uˆ1 Uˆ2 ˆ ⇢ˆXt Kyt 1 ⇢ˆt ⇢ˆYt (a) Full Quantum Circuit to implement HMM (b) Simplified scheme to implement HMM Figure 2: HMM implementation on quantum circuits

ˆ ⇢ˆt 1 w Kyt 1 ⇢ˆt ⇢ˆt 1 w,yt 1 ⇢ˆt K K

(a) HQMM scheme with separate transition and (b) A generalized scheme for HQMMs; w,yt 1 = K ˆ ˆ ˆ ˆ † emission; w = w Kw( )Kw† w Kw,yt 1 ( )Kw,yt 1 K · · P Figure 3: Quantum schemes implementingP classical HMMs with its environment, we can use Kraus operators. The Tensoring with an ancilla qudit and tracing over a qudit most general quantum operation that can be performed can be achieved with an ns n matrix W and an n ns M K†⇢ˆK ⇥ ⇥ i i i matrix Vy respectively, since we always prepare our on a density matrix is ⇢ˆ0 = M , where the K†⇢ˆK tr(P i i i) ancilla qudits in the same state (details on constructing denominator re-normalizes the densityP matrix. these matrices can be found in the appendix), so that: Now, how do we simulate classical HMMs on quantum ⇢ˆ ⇢ˆ W ⇢ˆ W † circuits with qudits, where computation is done using Xt ⌦ Yt ! X ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ tr⇢ˆ (Py U2W ⇢ˆX W †U †Py†) Vy Py U2W ⇢ˆX W †U †Py†Vy† unitary operations? There is no general way to convert Yt t 2 ! t 2 column-stochastic transition and emission matrices to We can then construct Kraus operators such that Kˆ = unitary matrices, so we prepare ‘ancilla’ particles and y V Pˆ Uˆ W construct unitary matrices (see Algorithm 3 in the y y 2 . Figure 2b shows this updated circuit, where ˆ appendix for details) to act on the joint state. We then U1 is still the quantum implementation of the transition Kˆ trace over one particle to obtain the state of the other. matrix and yt is the quantum implementation of the Bayesian update after observation. This scheme to Figure 2a illustrates a quantum circuit constructed model a classical HMM can be written as: with these unitary matrices. We prepare the ‘ancilla’ Kˆ (Uˆ (ˆ⇢ ⇢ˆ )Uˆ †) Kˆ states ⇢ˆX and ⇢ˆY appropriately (i.e., entirely in system yt 1 tr⇢ˆt 1 1 t 1 Xt 1 y†t 1 t t ⇢ˆ = ⌦ state 1, represented by a density matrix of zeros except t ⇣ ⌘ ˆ ˆ ˆ † ˆ † ˆ ˆ tr Kyt 1 tr⇢ˆt 1 (U1(ˆ⇢t 1 ⇢ˆXt )U1 ) Kyt 1 ⇢ˆ1,1 =1), and construct U1 and U2 from transition ma- ⌦ (8) trix A and emission matrix C,respectively.Uˆ1 evolves ⇣ ⇣ ⌘ ⌘ ˆ (ˆ⇢t 1 ⇢ˆXt ) to perform Markovian transition, while U2 Uˆ ⌦ We can similarly simplify 1 to a set of Kraus oper- updates ⇢ˆYt to contain the probabilities of measuring ators. We write the unitary operation Uˆ1 in terms each observable output. At runtime, we measure ⇢ˆYt of a set of n Kraus operators Kˆ as if we were to { w} which changes the joint distribution of ⇢ˆXt ⇢ˆYt to give measure ⇢ˆt 1 immediately after the operation Uˆ1. How- ⌦ the updated conditioned state ⇢ˆt.Mathematically,this ever, instead of applying one Kraus operator associated is equivalent to applying a projection operator on the with measurement as we do with Figure 2b, we sum joint state and tracing over ⇢ˆYt . Thus, the forward al- over all of n possible ‘observations’, as if to ‘ignore’ gorithm that explicitly models a hidden Markov Model the observation on ⇢ˆt 1.Post-multiplyingeachKraus on a quantum circuit as per Figure 2a is written as: operator in Kˆ with each operator in Kˆ ,wehave { w} { y} asetofKrausoperators Kˆw ,y that can be used to ˆ ˆ ˆ ˆ ˆ ˆ { y } ⇢ˆt tr⇢ˆY PyU2 tr⇢ˆt 1 (U1(ˆ⇢t 1 ⇢ˆXt )U1†) ⇢ˆYt U2†Py† / t ⌦ ⌦ model a classical HMM as follows (the full procedure ⇣ ⇣ ⌘ (7)⌘ is described in Algorithm 1): We can simplify this circuit to use Kraus operators act- Kˆ ⇢ˆ Kˆ wy wy ,yt 1 t 1 w† y ,yt 1 ing on the lower-dimensional state space of ⇢ˆXt .Since ⇢ˆt = (9) P Kˆ ⇢ˆ Kˆw† ,y we always prepare ⇢ˆYt in the same state, the operation tr wy wy ,yt 1 t 1 y t 1 ˆ U2 on the joint state of ⇢ˆXt ⇢ˆYt followed by the ap- ⇣ ⌘ ⌦ ˆ P plication of the projection operator Py can be more We believe this procedure to be a useful illustration of concisely written as a Kraus operator on just ⇢ˆXt ,so performing classical operations on graphical models us- that we need only be concerned with representing how ing quantum circuits. In practice, we needn’t construct the particle ⇢ˆXt evolves. We would need to construct a the Kraus operators in this peculiar fashion to simulate ˆ set of Kraus operators Ky for each observable output HMMs; an equivalent but simpler approach is to con- { } y,suchthat (Kˆ )†(Kˆ )=I. struct observable operators T from transition and y y y { y} P Siddarth Srinivasan, GeoffGordon, Byron Boots

Algorithm 1 Simulating Hidden Markov Models with HQMMs Input: Transition Matrix A and Emission Matrix C Output: Belief State as diag(ˆ⇢),orP (y ,...,y D) where D is the HMM 1 n| 1: Initialization: 2: Let s =#outputs, n =#hidden states, yt = observed output at time t 3: Prepare density matrix ⇢ˆ0 in some initial state. ⇢ˆ0 = diag(⇡) if priors ⇡ are known. 4: Construct unitary matrices Uˆ1 and Uˆ2 from A and C respectively using Algorithm 3 (in appendix) 5: Using Uˆ and Uˆ ,constructasetofn Kraus Operators Kˆ and s Kraus operators Kˆ , with Kˆ = 1 2 { w} { y} w V Uˆ W and Kˆ = V Pˆ Uˆ W and combine them into a set Kˆ with Kˆ = Kˆ Kˆ .(MatrixW tensors w 1 y y y 2 { wy ,y} wy ,y y w with an ancilla, Matrix Vy carries out a trivial partial trace operation and summing over Vw for all w carries out the proper partial trace operation. Details in appendix.) 6: for t =1:T do 7: ⇢ˆt+1 Kˆw,y ⇢ˆt(Kˆw ,y )† wy t y i 8: end for P 9: tr(⇢ˆT )givestheprobabilityofthesequence;renormalizing⇢ˆT gives the belief state on the diagonal. emission matrices as described in section 2.3, and set w =1, we do not tensor with an additional particle, but (:,w) (:,w) model the evolution of the original particle as unitary. the wth column of Kˆw ,y = Ty , with all other en- y In all, a HQMM requires learning n2sw parameters, tries being zero. This ensuresq Kˆ † Kw ,y = I. wy ,y wy ,y y which is a factor w times more than a HMM with the observable operator representation which has n2s 3.2 Formulating HQMMsP parameters. The canonical representation of HMMs Monras et al. [2010] formulate Hidden Quantum with with an n n transition matrix and an s n ⇥ ⇥ Markov Models by defining a set of Kraus operators emission matrix has n2 + ns parameters. Kˆ , where each observable y has w associated { wy ,y} y HQMMs can also be seen as a complex-valued exten- Kraus operators acting on a state with hidden di- sion of norm-observable operator models defined by mension n,andtheyformacompletesetsuchthat Zhao and Jaeger [2010]. Indeed, the HQMM we get by Kˆ † Kˆ = I. The update rule for a quantum w,y w,y w,y applying Algorithm 1 on a HMM is also a valid NOOM operation is exactly the same as Equation 9, which we P (allowing for multiple operators per output), implying arrived at by first constructing a quantum circuit to that HMMs can be simulated by NOOMs. Both HMMs simulate HMMs with known parameters and then con- and NOOMs can be simulated by HQMMs (the latter structing operators Kˆ in a very peculiar way. The { w,y} is trivially true). While Zhao and Jaeger [2010] show process outlined in the previous section is a particular that any NOOM can be written as an OOM, the ex- parameterization of HQMMs to model HMMs. If we act relationship between HQMMs and OOMs requires let the operators Uˆ and Uˆ be any unitary matrices, 1 2 further investigation. or the Kraus operators be any set of complex-valued Kˆ K = I matrices that satisfy wy ,y w† y ,y wy ,y ,thenwe have a general and fully quantum HQMM. 4 AN ITERATIVE ALGORITHM P Indeed, Equation 9 gives the forward algorithm for FOR LEARNING HQMMs HQMMs. To find the probability of emitting an output We present an iterative maximum-likelihood algorithm y given the previous state ⇢ˆt 1,wesimplytakethe to learn Kraus operators to model sequential data using trace of the numerator in Equation 9, i.e., p(yt ⇢ˆt 1)= | an HQMM. Our algorithm is general enough that it ˆ ˆ tr Kwy ,yt ⇢ˆt 1Kw† ,y . wy y t can be applied to any quantum version of a classical ⇣ ⌘ machine learning algorithm for which the loss is defined TheP number of parameters for a HQMM is determined in terms of the Kraus operators to be learned. by the number of latent states n,outputss,andKraus operators associated with an output w.Toexactly We begin by writing the likelihood of observing some simulate HMM dynamics with an HQMM, we need w = sequence y1,...,yT .Recallthatforagivenout- n as per the derivation above. However, this constraint put y,weapplythew Kraus operators associated need not hold for a general HQMM, which can have with that observable in the ‘forward’ algorithm, as any number of Kraus operators we apply and sum for a Kˆw ,y( )Kˆw ,y. If we do not renormalize the den- wy y · y given output. w can also be thought of as the dimension sity matrix after applying these operators, the diagonal P of the ancilla ⇢ˆXt that we tensor with in Figure 2a entries contain the joint probability of the correspond- before the unitary operation Uˆ1. Consequently, if we set ing system states and observing the associated sequence. Learning Hidden Quantum Markov Models

Algorithm 2 Iterative Learning Algorithm for Hidden Quantum Markov Models Input: A M ` matrix Y , where M =#data points and ` =length of a stochastic sequence to be modeled. ⇥ Output: Asetofws of n n Kraus operators Kˆ that maximize the log-likelihood of the data, where n is ⇥ { w,s} the dimension of the hidden state, s is the number of outputs, and w is the number of operators per outputs. 1: Initialization:Randomlygenerateasetofws Kraus operators Kˆ of dimension n n,andstackthem { w,s} ⇥ vertically to obtain a matrix  of dimension nsw n.Letb be the batch size, B the total number of batches ⇥ to process, and Y a b ` matrix of randomly chosen data samples. Let num_iterations be the number of b ⇥ iterations spent modifying  to maximize the likelihood of observing Yb. 2: for batch = 1:B do 3: Randomly select b sequences to process, and construct matrix Yb 4: for it =1:num_iterations do 5: Randomly select rows i and j of  to modify, i

j ⇣ i/2 i ⌘ i ⇣ i/2 i ⌘ j  e e sin(✓)  + e e cos(✓)  7: end for 8: end for ⇣ ⌘ ⇣ ⌘

The trace of this un-normalized density matrix gives i and j specify the two rows in the matrix with the non- the probability of the sequence since we have summed trivial entries, and the other paramters ✓,, , are over all the ‘hidden’ states. Thus, the log-likelihood of angles that parameterize the non-trivial entries. The an HQMM predicting a sequence of length n is: H matrices can be thought of as Givens rotations gen- eralized for complex-valued unitary matrices. Applying H(i, j, ✓,, ,)  ˆ ˆ ˆ ˆ such a matrix on has the effect of =lntr Kwy ,yn ... Kwy ,y ⇢ˆ0Kw† ,y ...Kw† ,y L 0 n 0 1 1 y1 1 1 yn n 1 w w combining rows i and j (i

j ⇣ i/2 i ⌘ i ⇣ i/2 i ⌘ j It is not straightforward to directly maximize this log-  e e sin(✓)  + e e cos(✓)  likelihood using gradient descent; we must preserve ⇣ ⌘ ⇣ ⌘(11) the Kraus operator constraints and long sequences can quickly lead to underflow issues. Our approach is to Now the problem becomes one of identifying the se- learn a nsw n matrix ⇤, which is essentially the quence of H matrices that can take  to ⇤.Since ⇥ set of ws Kraus operators Kˆ of dimension n n, the optimization is non-convex and the H matrices { w,y} ⇥ stacked vertically. The Kraus operators constraint need not commute, we are not guaranteed to find the ˆ ˆ I I requires s Ks†Ks = , which implies † = , where global maximum. Instead, we look for a local-max ⇤ the columns of  are orthonormal. that is reachable by only multiplying H matrices that P increase the log-likelihood. To find this sequence, we Let  be our guess and ⇤ be the true matrix of stacked Kraus operators that maximizes the likelihood under iteratively find the parameters (i, j, ✓,, ,) that, if the observed data. Then, there must exist some unitary used in Equation 11, would increase the log-likelihood. operator Uˆ that maps  to ⇤,i.e.,⇤ = Uˆ.Ourgoal To perform this optimization, we use the fmincon func- is now to find the matrix Uˆ.Todothis,weusethe tion in MATLAB that uses interior-point optimization. fact that the matrix Uˆ can be written as the product of It can also be computationally expensive to find the simpler matrices H(i, j, ✓,, ,) (proof in appendix): the best rows i, j to swap at a given step, so in our implementation, we randomly pick the rows (i, j) to H(i, j, ✓,, ,)= swap. See Algorithm 2 for a summary. We believe 1 0 0 0 ··· ··· ··· more efficient implementations are possible, but we ...... leave this to future work. 2. . . . .3 0 ei/2ei cos ✓ ei/2ei sin ✓ 0 6 ··· ··· ··· 7 6. . . . .7 5EXPERIMENTALRESULTS 6...... 7 6. . . .7 6 i/2 i i/2 i 7 In this section, we evaluate the performance of our 60 e e sin ✓ e e cos ✓ 07 6. ··· . ··· . ··· .7 learning algorithm on simple synthetic datasets, and 6...... 7 6. . . . .7 compare it to the performance of Expectation Maxi- 6 7 60 0 0 17 mization for HMMs (Rabiner [1989]). We judge the 4 ··· ··· ··· 5 Siddarth Srinivasan, GeoffGordon, Byron Boots quality of the learnt model using its Description Accu- Our results in Table 2 demonstrate that a probability racy (DA) (Zhao and Jaeger [2007]), defined as: clock generates data that is hard for HMMs to model and that our iterative algorithm yields a simple HQMM log P (Y D) DA = f 1+ s | (12) that matches the predictive power of the original model. ` ✓ ◆ Table 2: Performance of various HQMMs and HMMs where ` is the length of the sequence, s is the number learned from data generated by the probability clock model. of output symbols in the sequence, Y is the data, and HQMM parameters are given as (n, s, w) and HMM param- eters are given as (n, s),wheren is the number of hidden D is the model. Finally, the function f( ) is a non- · states, s is the number of , and w is the number linear function that takes the argument from ( , 1] 1 of Kraus operators per observable. (T) indicates the true to ( 1, 1]: model, (L) indicates learned models. P is the number of pa- rameters. Both the and STD of the DA are indicated xx0 0.25x for training and test data. f(x)= 1 e (13) 0.25x x<0 1+e ⇢ Model P Train DA Test DA 2, 2, 1 HQMM (T) 8 0.1642 (0.0089) 0.1632 (0.0111) If DA =1,themodelperfectlypredictedthestochastic 2, 2, 1 HQMM (L) 8 0.1640 (0.0088) 0.1631 (0.0111) 2, 2 HMM (L) 8 0.0851 (0.0074) 0.0833 (0.0131) sequence, while DA > 0 would mean that the model 4, 2 HMM (L) 24 0.1459 (0.0068) 0.1446 (0.0100) predicted the sequence better than random. 8, 2 HMM (L) 80 0.1639 (0.0087) 0.1630 (0.0108) In each experiment, we generate 20 training sequences of length 3000, and 10 validation sequences of length 5.2 A Fully Quantum HQMM 3000, with a ‘burn-in’ of 1000 to disregard the influ- ence of the starting distribution. We use QETLAB (a Here, we present the results of our algorithm on a fully MATLAB Toolbox developed by Johnston [2016]) to quantum HQMM. Since we use complex-valued entries, generate random HQMMs. We apply our learning algo- there is no known way of writing our model as an rithm once to learn HQMMs from data and report the equivalent-sized HMM or observable operator model. DA. We use the Baum-Welch algorithm implemented in We motivate this model with a physical system. Con- the hmmtrain function from MATLAB’s Statistics and sider electron spin: quantized angular momentum that Machine Learning Toolbox to learn HMM parameters. can either be ‘up’ or ‘down’ along whichever spatial axis When training HMMs, we train 10 models and report the measurement is made, but not in between. There is the best DA. no well-defined 3D vector describing electron spin along The first experiment compares learned models on data the 3 spatial dimensions, only ‘up’ or ‘down’ along a generated by a valid ‘probability clock’ NOOM/HQMM chosen axis of measurement (i.e., measurement basis). model (Zhao and Jaeger [2007]) that theoretically can- This is unlike classical angular momentum which can not be modeled by a finite-dimensional HMM. The sec- be represented by a vector with well-defined compo- ond experiment considers a 2-state, 6-output HQMM nents in three spatial dimensions. Picking an arbitrary direction as the z-axis, we can write the electron’s spin which requires at least 4 classical states to model. These T experiments are meant to showcase the greater expres- state in the +z, z basis so that 10 is + z T { } | i siveness of HQMMs compared with HMMs, and we and 01 is z .Butelectronspinconstitutesa | i ⇥ ⇤ empirically demonstrate that our algorithm is able to two-state quantum system, so it can be in superposi- ⇥ ⇤ learn an HQMM that can better predict the generated tions of the orthogonal ‘up’ and ‘down’ quantum states, data than EM for classical HMMs. which can be parameterized with (✓,) and written as = cos ✓ +z +ei sin ✓ z , where 0 ✓ ⇡ 5.1 Probability Clock | i 2 | i 2 | i   and 0 2⇡. The Bloch sphere (sphere with radius   Zhao and Jaeger [2010] describes a 2-hidden state, 2- 1) is a useful tool to visualize qubits since it can map observable NOOM ‘probability clock,’ where the proba- any two-state system to a point on the surface of the bility of generating an observable a changes periodically sphere using (✓,) as polar and azimuthal angles. We with the length of the sequence of asprecedingit,and could also have chosen +x, x or +y, y , which { } { } cannot be modeled with a finite-dimensional HMM: can be written in our original basis:

1 1 ⇡ 0.6cos(0.6) sin(0.6) 0.80 + x = + z + z ✓ = , =0 (15) Kˆ1,1 = Kˆ1,2 = (14) | i p | i p | i 2 0.6sin(0.6) cos(0.6) 00 2 2 ✓ ◆ ✓ ◆ ✓ ◆ 1 1 ⇡ x = + z z ✓ = , = ⇡ (16) | i p | i p | i 2 y=2 I 2 2 ✓ ◆ This is a valid HQMM since y=1 K1†,yK1,y = .Ob- 1 i ⇡ ⇡ + y = + z + z ✓ = , = (17) serve that this HQMM has only 1 Kraus operator per | i p | i p | i 2 2 P 2 2 ✓ ◆ observable, which it models the state evolution 1 i ⇡ 3⇡ y = + z z ✓ = , = (18) | i p | i p | i 2 2 as unitary. 2 2 ✓ ◆ Learning Hidden Quantum Markov Models

Now consider the following process, inspired by the Table 3: Performance of various HQMMs and HMMs on Stern-Gerlach experiment (Gerlach and Stern [1922]) the fully quantum HQMM. HQMM parameters are given as (n, s, w) and HMM parameters are given as (n, s),wheren is from quantum mechanics. We begin with an electron the number of hidden states, s is the number of observables, whose spin we represent in the +z, z basis. At w { } and is the number of Kraus operators per observable each time step, we pick one of the x, y,orz directions uniformly and at random, and apply an inhomogeneous Model P Train DA Test DA 2, 6, 1 HMM (T) 24 0.1303 (0.0042) 0.1303 (0.0047) magnetic field along that axis. This is an act of mea- 2, 6, 1 HQMM (L) 24 0.1303 (0.0042) 0.1301 (0.0047) 2, 6 HMM (L) 16 0.0327 (0.0038) 0.0328 (0.0033) surement that collapses the electron spin to either ‘up’ 3, 6 HMM (L) 27 0.0522 (0.0043) 0.0530 (0.0040) or ‘down’ along that axis, which will deflect the electron 4, 6 HMM (L) 40 0.0812 (0.0042) 0.0822 (0.0045) 5, 6 HMM (L) 55 0.0967 (0.0042) 0.0967 (0.0045) in that direction. Let us use the following encoding 6, 6 HMM (L) 72 0.1305 (0.0042) 0.1301 (0.0049) scheme for the results of the measurement: 1: +z, 2: z, 3: +x, 4: x, 5: +y, 6: y. Consequently, at each time step, the observation tells us which axis we 5.3 Discussion measured along, and whether the spin of the particle Interestingly, we are able to learn reasonable models is now ‘up’ or ‘down’ along that axis. As an example, with w =1,i.e.,modelingstateevolutionasunitary. if we prepare an electron spin ‘up’ along the z-axis, Indeed, the probability clock and Stern-Gerlach in- and observe the following sequence: 1, 3, 2, 6,itmeans spired model assume unitary state evolution, and these that we applied the inhomogeneous magnetic field in HQMMs can model the same sequence with far fewer pa- the z-direction, then x-direction, then z-direction, and rameters compared to an HMM. We provide additional finally the y-direction, causing the electron spin state experimental results in the appendix that show that to evolve as +z, +x, z, y.Notethattransitions we can learn the 2-state HQMM presented by Monras 1 2, 3 4,and5 6 are not allowed, since there $ $ $ et al. [2010] from data. In our experiments on HMM- are no spin-flip operations in our process. Admittedly, generated data, we found that small HQMMs outper- this is a slightly contrived example, since normally we form HMMs with the same number of hidden states, think of a hidden state that evolves according to some although the parameter count ends up being larger rules, producing noisy observation. Here, we select the (see appendix for results). As model size increases, our observation (down to the pair, (1, 2), (3, 4), (5, 6))that HQMMs are over-parameterized, becoming prone to we wish to observe, and that tells us how the ‘hidden getting stuck in local optima, and EM for HMMs may state’ evolves as described by a chosen basis. work better in practice on HMM-generated data. This model is related to the 2-state HQMM requiring 6CONCLUSION 3 classical states described in Monras et al. [2010]. It is still a 2-state system, but we add two new Kraus We formulated and parameterized HQMMs by first find- operators with complex entries and renormalize: ing quantum circuits to implement HMMs and relaxing some constraints. We showed how quantum analogues 1 00 p 0 of classical conditioning and marginalization can be Kˆ = 3 Kˆ = 1 (19) 1,1 00 1,2 0 ✓ ◆ ✓ p3 ◆ implemented, and these methods are general enough to 1 1 1 1 ˆ 2p3 2p3 ˆ 2p3 2p3 construct quantum versions of any probabilistic graph- K1,3 = 1 1 K1,4 = 1 1 (20) 2p3 2p3 ! 2p3 2p3 ! ical model. We also proposed an iterative maximum- 1 i 1 i ˆ 2p3 2p3 ˆ 2p3 2p3 likelihood algorithm to learn the Kraus operators for K1,5 = i 1 K1,6 = i 1 (21) 2p3 2p3 ! 2p3 2p3 ! HQMMs. We demonstrated that our algorithm could successfully learn HQMMs that were shown to (theoret- Physically, Kraus operators Kˆ1,1 and Kˆ1,2 keep the ically) better model certain sequences in the literature. spin along the z-axis, Kraus operators Kˆ1,3 and Kˆ1,4 While our HQMMs cannot model data any better than rotate the spin to lie along the x-axis, while Kraus asufficientlylargeHMM,wefindthatHQMMscan operators Kˆ1,5 and Kˆ1,6 rotate the spin to lie along better model the same data with fewer hidden states. the y-axis. Following the approach of Monras et al. Future work could look at optimizing our algorithm [2010], we write down an equivalent 6-state HMM, and to scale on larger datasets, and determining more gen- compute the rank of a Hankel matrix with the statistics erally when HQMMs are more suitable than HMMs. of this process, yielding a requirement of 4 classical We speculate that quantum models could lead to im- states as a weak lower bound. provements in areas where ‘quantum’ effects may better model the dynamic processes. We present the results of our learning algorithm applied to data generated by this model in Table 3. We find that Acknowledgements our algorithm can learn a 2-state HQMM (same size as the model that generated the data) with predictive We thank Theresa W. Lynn for her advice and input power matched only by a 6-state HMM. on this work. Siddarth Srinivasan, GeoffGordon, Byron Boots

References Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. . arXiv preprint arXiv:1611.09347,2016. Lewis A Clark, Wei Huang, Thomas M Barlow, and Almut Beige. Hidden quantum markov models and open quan- tum systems with instantaneous feedback. In ISCS 2014: Interdisciplinary Symposium on Complex Systems,pages 143–151. Springer, 2015. Walther Gerlach and Otto Stern. Der experimentelle nach- weis der richtungsquantelung im magnetfeld. Zeitschrift für Physik,9(1):349–352,1922. Herbert Jaeger. Observable operator models for discrete stochastic . Neural Computation,12(6):1371– 1398, 2000. Nathaniel Johnston. QETLAB: A MATLAB toolbox for , version 0.9. http://qetlab.com, January 2016. Alex Monras, Almut Beige, and Karoline Wiesner. Hidden quantum markov models and non-adaptive read-out of many-body states. arXiv preprint arXiv:1002.2337,2010. Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in . Proceed- ings of the IEEE,77(2):257–286,1989. Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione. An introduction to quantum machine learning. Contem- porary ,56(2):172–185,2015a. Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione. Simulating a perceptron on a quantum computer. Physics Letters A,379(7):660–663,2015b. Nathan Wiebe, Ashish Kapoor, and Krysta Svore. Quan- tum perceptron models. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems 29,pages3999–4007.CurranAssociates, Inc., 2016. URL http://papers.nips.cc/paper/ 6401-quantum-perceptron-models.pdf. Ming-Jie Zhao and Herbert Jaeger. Norm observable oper- ator models. Technical report, Jacobs University, 2007. Ming-Jie Zhao and Herbert Jaeger. Norm-observable op- erator models. Neural computation,22(7):1927–1959, 2010.