<<

On the Principles of Differential Quantum Programming Languages

Shaopeng Zhu Shih-Han Hung Shouvanik Chakrabarti Xiaodi Wu University of Maryland, College Park, USA

Abstract where near-term Noisy Intermediate-Scale Quantum (NISQ) Variational Quantum Circuits (VQCs), or the so-called quan- computers [37], e.g., the 53- from tum neural-networks, are predicted to be one of the most im- [2] and IBM [20], becomes the important platform for portant near-term quantum applications, not only because of demonstrating quantum applications. Variational quantum their similar promises as classical neural-networks, but also circuits (VQCs) [15, 16, 35], or the so-called quantum neural because of their feasibility on near-term noisy intermediate- networks, are predicted to be one of the most important ap- size quantum (NISQ) machines. The need for gradient infor- plications on NISQ machines. This is not only because VQCs mation in the training procedure of VQC applications has bear a lot of similar promises like classical neural networks as stimulated the interest and development of auto-differentiation well as potential quantum speed-ups from the perspective of techniques on quantum circuits. We propose the first formal- machine learning (e.g., see the survey [8]), but also because ization of this technique, not only in the context of quantum VQC is, if not the only, one of the few candidates that can circuits but also for imperative quantum programs (e.g., with be implemented on NISQ machines. Because of this, a lot of controls), inspired by the success of differential programming study has already been devoted to the design, analysis, and languages in classical machine learning. In particular, we small-scale implementation of VQC (e.g., see the survey [6]). overcome a few unique difficulties caused by exotic quantum Typical VQC applications replace classical neural net- features (such as quantum no-cloning) and provide a rigor- works, which are just parameterized classical circuits, by ous formulation of differentiation applied to bounded-loop quantum circuits with classically parameterized unitary gates. imperative quantum programs, its code-transformation rules, Namely, one will have a "quantum" mapping from input to as well as a sound logic to reason about their correctness. output replacing classical mapping in machine learning ap- Moreover, we have implemented our code transforma- plications. An important component of these applications tion in OCaml and demonstrated the resource-efficiency of is a training procedure which optimizes a loss function the our scheme both analytically and empirically. We also con- now depends on the read-outs and the parameters of VQCs. duct a case study of training a VQC instance with controls, It is well-known that gradient-based approaches are the which shows the advantage of our scheme over existing best for the training procedure. However, computing the auto-differentiation on quantum circuits without controls. gradients of loss functions that depend on the read-outs of quantum circuits has a similar complexity of simulating quan- Keywords quantum programming languages, differential tum circuits, which is infeasible for classical computation. programming languages, auto differentiation, quantum ma- Thus, the ability of evaluating these "quantum" gradients effi- chine learning, collection of semantics ciently by quantum computation is critical for the scalability of VQC applications, e.g., on the recent 53-qubit machines. 1 Introduction Fortunately, analytical formulas of gradients used in VQCs Background. Recent years have witnessed the rapid de- have been studied by [16, 19, 29, 41, 42]. In particular, Schuld velopment of , with practical advances et al. [41] proposed the so-called phase-shift rule that uses coming from both research and industry. Quantum program- two quantum circuits to compute the partial derivative re- ming is one topic that has been actively investigated. Early spective to one parameter for quantum circuits. One of the work on language design [22, 32, 39, 40, 44] has been fol- very successful tools for , called lowed up recently by several implementations of these lan- PennyLane [7], implemented the phase-shift rule to achieve guages, including Quipper [24], Scaffold1 [ ], LIQUi|⟩ [49], auto differentiation (AD) for the read-outs of quantum cir- Q# [46], and QWIRE [33]. Extensions of program logics have cuits. It also integrated automatic differentiation from es- also been proposed for verification of quantum programs tablished machine learning libraries such as TensorFlow or [3, 9, 10, 17, 27, 52, 55]. See also surveys [18, 43, 53]. PyTorch for any additional classical calculation in the train- With the availability of prototypes of quantum machines, ing procedure. However, none of these research was con- especially the recent establishment of [2], ducted from the perspective of programming languages and the research of quantum computing has entered a new stage no rigorous foundation or principles have been formalized. Draft paper, November 22, 2019, Motivations. An important motivation of this paper is to 2020. provide a rigorous formalization of the auto-differentiation 1 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu technique applied to quantum circuits. In particular, we natural choice also serves the purpose of gradients computa- will provide a formal formulation of quantum programs, tion of loss functions in quantum machine learning applica- their semantics, and the meaning of differentiation on them. tions, which are typically phrased in terms of these read-outs. We will also study the code-transformation rules for auto- We also directly model the parameterization of quantum pro- differentiation and prove their correctness. grams after VQCs, i.e., each unitary gate becomes classically As we will highlight below, research on the formalization parameterized. The above modeling of quantum programs is will encounter many new challenges that have not been con- very different from classical ones. It is also unclear whether sidered and addressed by existing results [16, 19, 29, 41, 42]. any reasonable analogues of classical chain-rule and for- Consider one of the basic requirements, e.g., composition- ward/backward mode can exist within quantum programs. ality. As we will show, differentiating the composition of To model the meaning of one quantum program com- quantum programs will necessarily involve running multi- puting the derivative of another, we define the differential ple quantum programs on copies of initial quantum states. semantics of programs. There is a subtle quantum-unique How to represent the collection of quantum programs suc- design choice. The semantics of any quantum pro- cinctly and also bound the number of required copies is a gram will depend on the observable and its input state. Thus, totally new question. Among our techniques to address this any program computing its derivative could potentially de- question, we also need to change the previously proposed pend on these two extra factors. We find out this potential construct, e.g., the phase-shift rule [41], to something else. dependence is undesirable and propose the strongest possi- Moreover, we want to go beyond the restriction to quan- ble definition: i.e., one derivative computing program should tum circuits in [16, 19, 29, 41, 42]. We are inspired by classical work for any pair of observable and input states. We demon- machine learning examples that demonstrate the advantage strate that this strong requirement is not only achievable but of neural-networks with program features (e.g. controls) also critical for the composition of auto differentiation. over the plain ones (e.g., classical circuits) [23, 25], which is We are ready to describe the technical challenges for the also the major motivation of promoting the the paradigm compositionality. Consider the following quantum program: shift from deep learning towards differential programming QMUL ≡ U (θ);U (θ), (1.4) by prominent machine learning researchers. 1 2 Augmenting VQCs with controls is not only feasible on which performs U1(θ) and U2(θ) gates sequentially. Note NISQ machines (at least for simple ones), but also now a that gate application is matrix multiplication in quantum. logical step for the study of their applications in machine Roughly speaking, if the product rule of differentiation (as learning. (We indeed conduct one case study in Section 8.) exhibited in (1.3)) remains in the quantum setting, at least Thus, we are inspired to investigate the principles of dif- ∂ symbolically, then one should expect ∂θ (QMUL) contains ferential quantum imperative languages beyond circuits. ∂ ∂ Research challenges & Solutions. We will rely on a little (U1(θ));U2(θ) and U1(θ); (U2(θ)) (1.5) ∂θ ∂θ notation that should be self-explanatory. Please refer to a detailed preliminary on in Section 2. two different parts as sub-programs similarly in (1.3). ∂ ( ( )) ( ) ∂ ( ( )) Let us start with a simple classical program However, we cannot run ∂θ U1 θ ;U2 andU1 θ ; ∂θ U2 θ together due to the quantum no-cloning theorem [51]. This MUL ≡ v3 = v1 × v2, (1.1) is because they share the same initial state and we cannot clone two copies of it and maintain the correct correlation where v3 is the product of v1 and v2. Consider the differenti- among different parts. Note that this is not an issue classi- ation with respect to θ, we have cally as we can store all vi,vÛi at the same time as in (1.3). ∂ As a result, quantum differentiation needs to run multiple (MUL) ≡ v = v × v ; (1.2) ∂θ 3 1 2 (sub-)programs on multiple copies of the initial state. This poses a unique challenge for differentiation of quan- vÛ3 = vÛ1 × v2 + v1 ×v Û2, (1.3) tum composition: (1) we hope to have a simple scheme of where MUL keeps track of variablesv1,v2,v3 and their deriva- code transformation, ideally close-to-classical, for intuition tives vÛ1,vÛ2,vÛ3 at the same time. One simple yet important ob- and easy implementation of the compiler, whereas it needs to servation is that classical variables v1,v2,v3 are real-valued express correctly the collection of quantum programs during and can be naturally differentiated, while quantum variables code transformation; (2) for the purpose of efficiency, we also are quantum states represented by matrices. want to reasonably bound the number of required copies What are hence reasonable quantities to compute deriva- of the initial states, which roughly refers to the number of tives about quantumly? One natural choice from the prin- different quantum programs in the collection. ciples of is the (classical) read-outs of We develop a few techniques to achieve both goals at the quantum systems through measurements, which we formu- same time. First, we propose the so-called additive quantum late as the observable semantics of quantum programs. This programs as a succinct intermediate representation for the 2 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019, collection of programs during the code transformation. Now the soundness of our logic and use it to show the correctness the entire differentiation procedure will be divided into two of our code-transformation rules. steps: (1) all code transformations happen on additive pro- In Section 7, we conduct a resource analysis to further grams and are very similar to classical ones (see Figure 4) ; (2) justify our design. We show that the occurrence count of the collection of programs can be recovered by a compilation parameters capture the extra resource required in both the procedure from any additive program. Additive quantum pro- classical auto-differentiation and our scheme. In this sense, grams are equipped with a new sum operation that models our resource cost is reasonable comparing to classical. the multiple choices as exhibited in (1.5), which resembles a Finally, in Section 8, we demonstrate the implementation similar idea in differential lambda-calculus [12]. of our code transformation in OCaml and apply it to the ∂ Second, we also design a new rule for ∂θ (U (θ)) which training of one VQC instance with controls via classical sim- is slightly different from16 [ , 19, 29, 41, 42]. The existing ulation. Specifically, this instance shows an advantage of con- phase-shift rule makes use of two quantum circuits for one trols in machine learning tasks, which implies the advantage differentiation, which causes a lot of inconvenience inthe of our scheme over previous ones that cannot handle controls. formulation and potential trouble for efficiency. We replace We have also empirically verified the resource-efficiency of it by one circuit with one additional ancilla register for each our scheme on representative VQC instances. parameter. We conduct a careful resource analysis of our Related classical work. There is an extensive study of au- differentiation procedure and show the number of required tomatic differentiation (AD) or differential programming in copies of initial states is reasonable comparing to the clas- the classical setting (e.g., see books [11, 26]). The most rele- sical setting. The correctness of the code transformation of vant to us are those studies from the composition critically relies on our design choice as well as perspective. AD has traditionally been applied to imperative the strong definition related to the differential semantics. programs in both the forward mode, e.g. [28, 50], and the With the previous setup, we can naturally build the differ- reverse mode, e.g., [45]. The famous backpropagation algo- entiation for controls (i.e., the quantum condition statement). rithm [38] is also a special case of reserve-mode AD used to Note that a general solution for classical controls is still un- compute the gradient of a multi-layer perceptron. AD has known [5] due to the non-smoothness of the guard. Similar also been recently applied to functional programs [13, 14, 34]. to the classical setting [36], we do not have a solution to deal Motivated by the success of deep learning, there is significant with general unbounded loop. We instead provide a solution recent interest to develop both the theory and the implemen- to bounded loops which can be deemed as a consisting tation of AD techniques. Please refer to the survey [4] and of simple quantum condition statements. the keynote talk at POPL’18 [36] for more details. Contributions. We formulate the parameterized quantum bounded while-programs with classically parameterized uni- 2 Quantum Preliminaries tary gates modelled after VQCs [15, 30, 35] and their realistic We present basic quantum preliminaries here (a summary of examples on ion-trap machines , e.g. [56], in Section 3. notation in Table 1). Details are deferred to Appendix A. In Section 4, we illustrate our design of additive quantum programs. Specially, we add the syntax P1 + P2 to represent 2.1 Math Preliminaries the either-or choice between P and P in (1.5). We formulate 1 2 Let n be a natural number. We refer to the complex vector its semantics and the compilation rules that map any additive space Cn as an n-dimensional Hilbert space H. We use |ψ ⟩ to program into a collection of normal programs in the way to denote a complex vector in Cn. The Hermitian conjugate of serve our differentiation purpose. |ψ ⟩ ∈ Cn is denoted by ⟨ψ |. The inner product of |ψ ⟩ and |ϕ⟩, In Section 5, we formulate the observable and the differ- defined as the product of ⟨ψ | and |ϕ⟩, is denoted by ⟨ψ |ϕ⟩. ential semantics of quantum programs and formally define The norm of a vector |ψ ⟩ is denoted by ∥|ψ ⟩∥ = p⟨ψ |ψ ⟩. the meaning of program S ′(θ) computing the differential se- We define (linear) operators as linear maps between Hilbert mantics of S(θ) in the strongest possible sense. In particular, spaces, which can be represented by matrices for finite di- we require that S ′(θ) should work for any pair of observable mension. Let A be an operator; its Hermitian conjugate is and input state, which is critical for the compositionality. denoted by A†. A is Hermitian if A = A†. The trace of A is the In Section 6, we show that such a strong requirement is sum of the entries on the main diagonal, i.e., tr(A) = Í A . indeed achievable by demonstrating the code-transformation i ii We write ⟨ψ |A|ψ ⟩ to mean the inner product of |ψ ⟩ and A|ψ ⟩. rules for the differentiation procedure. Thanks to the use A Hermitian operator A is positive semidefinite if for all vec- of additive quantum programs, the code transformation is tors |ψ ⟩ ∈ H, ⟨ψ |A|ψ ⟩ ≥ 0. much simplified and as intuitive as classical ones. We also develop a logic with the judgement S ′(θ)|S(θ) that states 2.2 Quantum States and Operations S ′(θ) computes the differential semantics of S(θ). We prove The state space of a qubit is a 2-dimensional Hilbert space. Two important orthonormal bases of a qubit system are: the 3 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu

Table 1. A brief summary of notation used in this paper More generally, evolution of a quantum system can be characterized by an admissible superoperator E, namely a completely-positive and trace-non-increasing (See Appendix A) Spaces H, A L(H) (Linear operators) linear map from D(H) to D(H ′) for Hilbert spaces H, H ′. States (pure states) |ψ ⟩, |ϕ⟩ (metavariables); For every superoperator E, there exists a set of Kraus |0⟩, |1⟩, |+⟩, |−⟩ { } E( ) Í † operators Ek k such that ρ = k Ek ρEk for any input , | ⟩⟨ | † (density ) ρ σ (meta); ψ ψ ρ ∈ D(H). The Kraus form of E is therefore E = Í E ◦ E . Operations (unitaries) U ,V, σ (meta); k k k An identity operation refers to the superoperator IH = IH ◦ H, X, Z, X ⊗ X IH. The Schrödinger-Heisenberg dual of a superoperator E = (superoperators) E, F (general); Í E ◦ E† , denoted by E∗, is defined as follows: for every Φ (quantum channels) k k k state ρ ∈ D(H) and any operator A, tr(AE(ρ)) = tr(E∗(A)ρ). Measurements M {M } (general); m m The Kraus form of E∗ is Í E† ◦ E . {|0⟩⟨0|, |1⟩⟨1|} k k k Í O m λm |ϕm⟩⟨ϕm | (general); |0⟩⟨0| − |1⟩⟨1| (example) 2.3 Quantum Measurements Programs (no parameters) P, Q (meta) Quantum measurements extracts classical information out of (parameterized) P(θ), S(θ) (meta); quantum systems. A quantum measurement on a system over Rσ (θ)(notable examples) Hilbert space H can be described by a set of linear operators Í † (collection) P(θ), S(θ) (meta); {Mm }m with m MmMm = IH (identity matrix on H). If we ∂ perform a measurement {Mm }m on a state ρ, the outcome m ∂θ (QCT(7,6)(θ)) (example) ′ † Semantics (operational) ⟨P, ρ⟩ → ⟨Q, ρ ⟩ (steps) is observed with probability pm = tr(Mm ρMm) for each m, ′ † (denotational) [[P]]ρ = ρ (example) and the post-measurement state collapses to Mm ρMm/pm. (observable) [[(O, ρ) → P(θ)]] (example) ∂ Code Process (Transform) ∂θ (S(θ)) (example) 3 Parameterized Quantum Bounded (Compile) Compile(S ′(θ)) (example) While-Programs ′( )| ( ) Logic (Judgement) S θ S θ (example) We adopt the bounded-loop variant of the quantum while- Count (Non-Abort) |# ∂ (P(θ))| (example) ∂θj language developed by Ying [53], and augment it by parame- (Occurrence) OCj (P(θ)) (example) terizing the unitaries, as this provides sufficient expressibility for parameterized quantum operations: indeed, abort, skip and initialization behave independently of parameters, while computational basis with |0⟩ = (1, 0)† and |1⟩ = (0, 1)†; the ± “parameterized measurements” can be implemented with a basis, consisting of |+⟩ = √1 (|0⟩+|1⟩) and |−⟩ = √1 (|0⟩−|1⟩). regular measurement followed by a parameterized unitary. 2 2 A pure is a vector |ψ ⟩ with ∥|ψ ⟩∥ = 1.A From here onward, v is a finite set of variables, and θ a mixed state can be represented by a classical distribution length-k vector of real-valued parameters. over an ensemble of pure states {(pi, |ψi ⟩)}i , i.e., the system is in state |ψi ⟩ with probability pi . One can use density op- 3.1 Syntax erators to represent states: a density operator representing Define Var as the set of quantum variables. We use the sym- the ensemble {(pi, |ψi ⟩)}i is a positive semidefinite operator bol q as a metavariable ranging over quantum variables and Í ρ = i pi |ψi ⟩⟨ψi |. A positive semidefinite operator ρ on H define a quantum register q to be a finite set of distinct vari- is said to be a partial density operator if tr(ρ) ≤ 1. The set ables. For each q ∈ Var, its state space is denoted by Hq . of partial density operators on H is denoted by D(H). The quantum register q is associated with the Hilbert space Ë 1 Operations on quantum systems can be characterized by Hq = q ∈q Hq . A T -bounded, k-parameterized quantum unitary operators. Denoting the set of linear operators on H while-program is generated by the following syntax: as L(H), an operator U ∈ L(H) is unitary if U †U = UU † = IH. A unitary evolves a pure state |ψ ⟩ to U |ψ ⟩, or a density P(θ) ::= abort[q] | skip[q] | q := |0⟩ | q := U (θ)[q] | operator ρ to U ρU †. Common unitary operators include: P (θ); P (θ) | case M[q] = m → P (θ) end | the Hadamard operator H, which transforms between the 1 2 m (T ) computational and the ± basis via H |0⟩ = |+⟩ and H |1⟩ = while M[q] = 1 do P1(θ) done, |−⟩; the Pauli X operator which performs a bit flip, i.e., X |0⟩ = |1⟩ and X |1⟩ = |0⟩; Pauli Z which performs a phase flip, i.e., | ⟩ | ⟩ | ⟩ −| ⟩ | ⟩ | ⟩ 1 Z 0 = 0 and Z 1 = 1 ; Pauli Y mapping 0 to i 1 If type(q) = Bool then Hq = span{|0⟩, |1⟩}. If type(q) = Bounded Int + and |1⟩ to −i|0⟩; CNOT gate mapping |00⟩ 7→ |00⟩, |01⟩ 7→ then Hq is with basis {|n⟩ : n ∈ [−N , N ]}(N ∈ Z ) for some finite N . |01⟩, |10⟩ 7→ |11⟩, |11⟩ 7→ |10⟩. We require the Hilbert space to be finite dimensional for implementation. 4 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019, where constructed by applying some unitary U to |0⟩. (2) Evaluat- (1) ing the guard of a case statement or loop, which performs a while M[q] = 1 do P1(θ) done measurement, potentially disturbs the state of the system. ≡ case M[q] = {0 → skip, 1 → P1(θ); abort},(3.1)

(T ≥2) 3.2 Operational and while M[q] = 1 do P1(θ) done (T −1) We present the operational semantics of parameterized pro- ≡ case M[q] = {0 → skip, 1 → P1(θ); while } grams in Figure 1a. Transition rules are represented as ⟨P, ρ⟩ Unparameterized programs can be obtained by fixing θ ∗ ∈ Rk → ⟨P ′, ρ′⟩, where ⟨P, ρ⟩ and ⟨P ′, ρ′⟩ are quantum config- in some P(θ). We denote the set of variables accessible to urations.2 In configurations, P (or P ′) could be a quantum P(θ) as qVar(P(θ)); the collection of all “T -bounded while- program or the empty program ↓, and ρ and ρ′ are partial ( ) ( ( )) (T )( ) density operators representing the current state. Intuitively, programs P θ s.t. qVar P θ = v” as q-whilev θ , and (T ) in one step, we can evaluate program P on input state ρ to similarly, the unparameterized one as q-while . v program P ′ (or ↓) and output state ρ′. In order to present the Now let us formally define parameteriztion of unitaries: rules in a non-probabilistic manner, the probabilities associ- let θ := (θ , ··· , θ )(k ≥ 1).A k-parameterized unitary 1 k ated with each transition are encoded in the output partial U (θ) is a function Rk → L(H) s.t. (1) given any θ ∗ ∈ Rk , density operator.3 For m, the superoperator E is defined by U (θ ∗) is an unitary on H , and (2) the parameterized-matrix m E ( ) † representation of U (θ) is entry-wise smooth. m ρ = Mm ρMm, yielding the post-measurement state. Ebool ( ) A important family of examples is the single-qubit rotations In the Initialization rule, the superoperators q→0 ρ and int about the Pauli axis X, Y, Z with angle θ: Eq→0(ρ), which initialize the variable q in ρ to |0⟩⟨0|, are bool defined by Eq→0(ρ) = |0⟩q ⟨0|ρ|0⟩q ⟨0| + |0⟩q ⟨1|ρ|1⟩q ⟨0| and −iθ B−int ÍN + E (ρ) = |0⟩q ⟨n|ρ|n⟩q ⟨0| (N ∈ Z ). Here, |ψ ⟩q ⟨ϕ| Rσ (θ) := exp( σ), σ ∈ {X, Y, Z } (3.2) q→0 n=−N 2 denotes the outer product of states |ψ ⟩ and |ϕ⟩ associated One can also extend Pauli rotations to multiple . with variable q; that is, |ψ ⟩ and |ϕ⟩ are in Hq and |ψ ⟩q ⟨ϕ| For example, consider two-qubit coupling gates {Rσ ⊗σ := is a matrix over Hq . It is conventional in the quantum in- −iθ exp( 2 σ ⊗ σ)}σ ∈{X ,Y ,Z }. Note that these two-qubit gates formation literature that when operations or measurements can generate entanglement between two qubits. Combined only apply to part of the quantum system (e.g., a subset of with single-qubit rotations, they form a universal gate set for quantum variables of the program), one should assume that quantum computation. Another important feature of these an identity is applied to the rest of quantum variables. For gates is that they can already be reliably implemented in example, applying |ψ ⟩q ⟨ϕ| to ρ means applying |ψ ⟩q ⟨ϕ|⊗IHq¯ experiments, such as ion-trap quantum computers [56]. to ρ, where q¯ denotes the set of all variables except q. As a result, we will work mostly with these gates in the We present the denotational semantics of parameterized rest of this paper. However, note that one can easily add and programs in 1b, defining [[P]] as a superoperator on ρ ∈ study other parameterized gates in our framework as well. Hv [53]. For more details we refer the reader to Ying [52, 53]. The language constructed above is similar to their classical We have the following connection between the deno- counterparts. (0) abort terminates the program, outputting tational semantics and operational for parameterized pro- ∗ 0 ∈ D(Hq ). (1) skip does nothing to states in D(Hq ). (2) q := grams: in short, the meaning of running program P(θ ) on |0⟩ sets quantum variable q to the basis state |0⟩. (3) for any input state ρ and any θ ∗ ∈ Rk is the sum of all possible output θ ∗ ∈ Rk , q := U (θ ∗)[q] applies the unitaryU (θ ∗) to the qubits states with multiplicity, weighted by their probabilities. in q. (4) Sequencing has the same behavior as its classical (T ) ∗ k ∗ Proposition 3.1 ([53]). ∀P(θ) ∈ q-while (θ), and any spe- counterpart. (5) for θ ∈ R , case M[q] = m → Pm(θ ) end v ∗ k performs the measurement M = {Mm } on the qubits inq, and cific θ ∈ R , ρ ∈ D(Hv ), executes program P (θ ∗) if the outcome of the measurement Õ m [[P(θ ∗)]](ρ) = {|ρ′ : (P(θ ∗), ρ) →∗ (↓, ρ′)|}. (3.3) is m. The bar over m → Pm indicates that there may be one or more repetitions of this expression. (6) while(T ) M[q] = Here →∗ is the reflexive, transitive closure of → and {| · |} ∗ 1 do P1(θ ) done performs the measurement M = {M0, M1} denotes a multi-set. on q, and terminates if the outcome corresponds to M , or 0 We close the section with a notion arising from the fol- executes P (θ ∗) then reiterates (T ≥ 2) / aborts (T = 1) 1 lowing observation: some programs, while syntactically not otherwise. The program iterates at most T times. We highlight two differences between quantum and clas- 2Recall that, fixing arbitrary θ ∗ ∈ Rk , both semantics reduce to those of sical while languages: (1) Qubits may only be initialized to unparameterized programs, so for compactness we write P for P(θ ∗), etc. the state |0⟩. There is no quantum analogue for initialization 3If we had instead considered a probabilistic transition system, then one pm to any expression (i.e. x := e) due to the no-cloning theorem could write ⟨case M[q] = m → Pm end, ρ ⟩ −−−→⟨Pm , ρm ⟩ where pm = † † of quantum states. Any state |ψ ⟩ ∈ Hq , however, can be tr(Mm ρMm ) and ρm = Mm ρMm /pm . 5 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu

∗ ∗ ∗ (Abort) ⟨abort[q], ρ⟩ → ⟨↓, 0⟩ (Sum Components) ⟨P1(θ ) + P2(θ ), ρ⟩ → ⟨P1(θ ), ρ⟩, ∗ ∗ ∗ (Skip) ⟨skip[q], ρ⟩ → ⟨↓, ρ⟩ ⟨P1(θ ) + P2(θ ), ρ⟩ → ⟨P2(θ ), ρ⟩

(Initialization) ⟨q := |0⟩, ρ⟩ → ⟨↓, ρq ⟩ 0 Figure 2. additive parameterized quantum bounded while- ( bool programs: operational semantics. We fix θ ∗ ∈ Rk and inherit q Eq→0(ρ) if type(q) = Bool where ρ0 = B−int all the other rules from parameterized programs in Fig. 1a. Eq→0 (ρ) if type(q) = Bdd Int (Unitary) ⟨q := U (θ ∗)[q], ρ⟩ → ⟨↓, U (θ ∗)ρU †(θ ∗)⟩ 4 Additive Parameterized Quantum ∗ ′ ∗ ′ ⟨P1(θ ), ρ⟩ → ⟨P1(θ ), ρ ⟩ Bounded While-Programs ∗ ∗ ′ ∗ ∗ ′ (Sequence) ⟨P1(θ ); P2(θ ), ρ⟩ → ⟨P1(θ ); P2(θ ), ρ ⟩ As we discussed in the introduction, we will introduce a variant of additive quantum programs as a succinct way ∗ (Case m) ⟨case M[q] = m → Pm(θ ) end, ρ⟩ → to describe the collection of programs that are necessary ∗ ⟨Pm(θ ),Em(ρ)⟩, ∀ outcome m of M = {Mm } to compute the derivatives. To that end, we introduce our design of the syntax and the semantics of additive quan- (T ) (T ) ∗ (While 0) ⟨while M[q] = 1 do P1(θ ) done, ρ⟩ → tum programs as well as a compilation method that turns ⟨↓, E0(ρ)⟩ any additive quantum program into a collection of normal programs for the actual computation of derivatives. (T ) (T ) ∗ (While 1) ⟨while M[q] = 1 do P1(θ ) done, ρ⟩ → ∗ (T −1) 4.1 Syntax ⟨P1(θ ); while ,E1(ρ)⟩ We adopt the convention to use underlines to indicate ad- (a) ditive programs, such as P(θ), to distinguish from normal program P(θ). The syntax of P(θ) is given by [[abort[q]]]ρ = 0 [[skip[q]]]ρ = ρ P(θ) ::= abort[q] | skip[q] | q := |0⟩ | q := U (θ)[q] | [[q := |0⟩]]ρ = Ebool ρ or EB−int(ρ) q→0 q→0 P1(θ); P2(θ) | case M[q] = m → Pm(θ) end | [[q := U (θ ∗)[q]]]ρ = U (θ ∗)ρU †(θ ∗) ∗ ∗ ∗ ∗ (T ) [ ] = ( ) | ( ) + ( ), [[P1(θ ); P2(θ )]]ρ = [[P2(θ )]]([[P1(θ )]]ρ) while M q 1 do P1 θ done P1 θ P2 θ ∗ Í ∗ [[case M[q] = m → Pm(θ ) end]]ρ = m[[Pm(θ )]]Em ρ where the only new syntax + is the additive choice. In- (T ) ∗ [[while M[q] = 1 do P1(θ ) done]]ρ = tuitively, P1(θ) + P2(θ) refers to the program that either ÍT −1 ∗ n n=0 E0 ◦ ([[P1(θ )]]◦ E1) ρ executes P1(θ) or P2(θ). This intuition will be implemented by the definition of its operational and denotational seman- (b) tics. We assume + has lower precedence order, and is left 4 associative . If P(θ) = P1(θ) + P2(θ), then qVar(P(θ)) ≡ Figure 1. Parameterized T -bounded quantum while pro- qVar(P1(θ)) ∪ qVar(P2(θ)). Denote the collection of all non- grams: (a) operational semantics (b) denotational semantics. (T ) ∗ k ( ) ( ( )) = ( ) We instantiate θ ∈ R since choice of parameters affects determinisic P θ s.t. qVar P θ v as add-q-whilev θ . choices made at control gates. “Bdd” stands for Bounded. 4.2 Operational and Denotational Semantics We exhibit operational semantics in Figure 2 and define a sim- ilar denotational semantics for any P(θ ∗) ∈ add-q-while(T )(θ). “abort[q]”, semantically aborts. Simple examples includeU (θ); v abort or a case sentence that has abort on each branch. Definition 4.1 (Denotational Semantics). Fix θ ∗ ∈ Rk , ρ ∈ We formalize this concept (essential-abortion for unpa- D(Hv ), rameterized programs may be analogously defined): [[P(θ ∗)]](ρ) ≡ {|ρ′ : ⟨P(θ ∗), ρ⟩ →∗ ⟨↓, ρ′⟩|} (4.1) ( ) ∈ (T )( ) Definition 3.2 (“Essentially Abort”). Let P θ q-whilev θ . We can now explain the name of +. Intuitively, the (S-) P(θ) “essentially aborts” if one of the following holds: rule in Figure 2 provides different choices of paths which 1. P(θ) ≡ abort[q]; will be eventually summed up in (4.1). This resembles the idea of the sum operator in differential lambda-calculus [12]. 2. P(θ) ≡ P1(θ); P2(θ), and either P1(θ) or P2(θ) essentially aborts; Interestingly, some slight change of our (S-C) rule, e.g., in the study of non-deterministic quantum programs [54], can 3. P ≡ case M[q] = m → Pm(θ) end, and each Pm(θ) essentially aborts. 4E.g., X + Y ; Z = X + (Y ; Z ), X + Y + Z := (X + Y ) + Z . 6 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019, also be used to describe the interlacing behavior between (Atomic) Compile(P(θ)) ≡ {|P(θ)|}, different components in parallel quantum programs. if P(θ) ≡ abort[v] | skip[v] | q := |0⟩ |v := U (θ)[v]. 4.3 Compilation Rules (Sequence) Compile(P1(θ); P2(θ)) ≡ We exhibit the compilation rules in Figure 3 as a way to  {|abort|}, if Compile(P1(θ)) = {|abort|}; transform a additive program P(θ) into a multiset of nor-   {|abort|}, if Compile(P2(θ)) = {|abort|}; mal programs. The compiled set of programs will be later {|Q (θ);Q (θ) : Q (θ) ∈ Compile(P (θ))|}, used in the actual implementation of the differentiation pro-  1 2 b b  otherwise. cedure. We will show, in Section 7, that our compilation  (Case m) Compile(case) ≡ FB(case), described in Fig.3b. procedure is also efficient in the sense that the total num- (While(T )) Compile(while(T)) : use (Case m) and (Sequence). ber of non-essentially-aborting program in Compile(Pm(θ)) (Sum ) Compile(P (θ) + P (θ)) ≡ is reasonably bounded, in particular for the “fill and break” 1 2 Compile( ( ))ÝCompile( ( )) ∀ ∈ { algorithm (FB(•)) as the compilation of the case statement.  P1 θ P2 θ , if b 1,  Our compilation rule is well-defined as it is compatible  2}, Compile(Pb (θ)) , {|abort|};  with the denotational semantics and operational semantics  Compile(P1(θ)), if Compile(P2(θ)) = {|abort|},  of P(θ) in the following sense: Compile(P1(θ)) , {|abort|};   Compile(P2(θ)), if Compile(P1(θ)) = {|abort|}, Ý  Proposition 4.2. Denoting with the union of multisets,  Compile(P2(θ)) , {|abort|}; ∈ D(H )  then for any ρ v ,  {|abort|}, otherwise  ′ ′ ′ ∗ {|ρ : ρ , 0, ρ ∈ [[P(θ )]]ρ|} = (a) Þ 1. ∀m ∈ [0,w], let Cm denote the sub-multiset of {|ρ′ , 0 : ⟨Q(θ ∗), ρ⟩ →∗ ⟨↓, ρ′⟩|}. (4.2) Compile(Pm(θ)) composed of programs that do not es- Q(θ )∈Compile(P(θ )) sentially abort; without loss of generality, assume |C0| ≥ |C | ≥ · · · ≥ |C |. Proof. Structural Induction. See Appendix C.1 for details. □ 1 w 2. If all Cm’s are empty, return FB(case) ≡ {|abort[v]|}; else, pad each Cm to size |C0| by adding “abort[v]”. Note that (4.2) removes 0 from the multi-set as we are only 3. ∀m ∈ [0,w], index programs in Cm as interested in non-trivial final states. Moreover, in Compile(P(θ)), {|Qm,0(θ), ··· , Qm, |C0 |−1(θ)|}. some programs may essentially abort (Definition 3.2). For 4. Return FB(case) ≡ {| case M[q] = m → Qm,j∗ end |}j∗ ( ) ∈ ∗ implementation, we are interested in the number of Q θ with 0 ≤ j ≤ |C0| − 1. Compile(P(θ)) that do not essentially abort: (b) Definition 4.3 (Number of Non-Aborting Programs). The number of non-aborting programs of P(θ), denoted as |#P(θ)|, Figure 3. nondeterministic programs: (a) compilation rules. is defined as (b) “Fill and Break” (“FB(•)”) procedure for computing Compile(case). case stands for case M[q] = m → Pm(θ) end; (T) (T ) while stands for while M[q] = 1 do P1(θ) done. Here |#P(θ)| = |Compile(P(θ)) \ {|Q(θ) ∈ Compile(P(θ)) : Ý denotes union of multisets. One may observe from a rou- Q(θ) essentially aborts.|}| tine structural induction and the definition of “essentially abort” that: for all P(θ), either Compile(P(θ)) = {|abort|}, or ( ( )) where |C| is the cardinality of a multiset C and C0 ∖C1 denotes Compile P θ does not contain essentially abort programs. the multiset difference of C0 and C1. ∗ gates. Then for any ρ ∈ D(Hv ), fixing θ we have ∗ (Case m) ∗ ∗ † Example 4.1 (Generic-Case). Consider the following simple ⟨P(θ ), ρ⟩ → ⟨P1(θ ) + P2(θ ), M0ρM0 ⟩ program with the case statement (Sum) ∗ † → ⟨P1(θ ), M0ρM0 ⟩ ∗ ∗ † → ⟨↓, [[P1(θ )]](M0ρM )⟩; P(θ) ≡ case M[q] = 0 → P1(θ) + P2(θ), 0 ∗ (Case m) ∗ ∗ † 1 → P3(θ) ⟨P(θ ), ρ⟩ → ⟨P1(θ ) + P2(θ ), M0ρM0 ⟩ (Sum) ∗ † ∗ ∗ † → ⟨P2(θ ), M0ρM ⟩→ ⟨↓, [[P2(θ )]](M0ρM )⟩; (T ) 0 0 ( ) ( ) ( ) ∈ ( ) ) where P1 θ , P2 θ , P3 θ q-whilev θ , none of them essen- ∗ (Case m ∗ † ⟨P(θ ), ρ⟩ → ⟨P3(θ ), M1ρM ⟩ tially aborts, and each of P1(θ), P2(θ), P3(θ) contains no control 1 →∗ ⟨↓ [[ ( ∗)]]( †)⟩ 7 , P3 θ M1ρM1 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu

Hence by Definition 4.1. the paper, we assume that5

∗ ∗ † ∗ † − IH ⊑ O ⊑ IH . (5.2) [[P(θ )]]ρ = {|[[P1(θ )]](M0ρM0 ), [[P2(θ )]](M0ρM0 ), ∗ † By statistically concentration bounds (e.g. the Chernoff bound), [[P3(θ )]](M1ρM )|} 1 to approximate tr(Oρ) with additive error δ, one needs to We verify computation results from the compilation rules are repeat O(1/δ 2) times with O(1/δ 2) copies of initial states. consist with this. Writing “compilation rule” as “CP” for short, CP,Sum 5.1 Observable Semantics one observes Compile(P1(θ) + P2(θ)) = {|P1(θ), P2(θ)|}, while Compile(P3(θ)) = {|P3(θ)|} since we assumed non- We define the observable semantics of both normal (denoted ′ ′ essentially-abortness. Apply our “fill and break” procedure by P(θ), P (θ)) and additive (denoted by S(θ), S (θ)) parame- terized programs as follows. to obtain C0 = {|P1(θ), P2(θ)|}, C1 = {|P3(θ), abort[v]|}. n case M[q] = 0 → P (θ), ( ) Compile(P(θ)) = | 1 Definition 5.1 (Observable Semantics). ∀P(θ) ∈ q-while T (θ), 1 → P (θ), v 3 any observable O ∈ O and input state ρ ∈ D(H ), the ob- case M[q] = 0 → P (θ), o v v 2 | servable semantics of P, denoted [[(O, ρ) → P(θ)]], is 1 → abort[v] Evolving pursuant to the normal programs operational se- [[(O, ρ) → P(θ)]](θ ∗) ≡ tr(O[[P(θ ∗)]]ρ), ∀θ ∗ ∈ Rk . (5.3) [[ ( ∗)]] mantics (Fig 1) leads to the follow that agrees with P θ ρ. Namely, [[(O, ρ) → P(θ)]] is a function from Rk to R whose Þ value per point is given by (5.3). {|ρ′ : ⟨Q(θ ∗), ρ⟩ →∗ ⟨↓, ρ′⟩|} Similarly, for any S(θ) ∈ add-q-while(T )(θ) with Compile Q(θ )∈Compile(P(θ )) v (S(θ)) = {|P (θ)|}t where P (θ) ∈ q-while(T )(θ), its observ- = {|[[P (θ ∗)]](M ρM†), [[P (θ ∗)]](M ρM†), i i=1 i v 1 0 0 2 0 0 able semantics is given by [[ ( ∗)]]( †) |} P3 θ M1ρM1 , 0 . ∗ Õ ∗ ∗ k [[(O, ρ) → S(θ)]](θ ) ≡ [[(O, ρ) → Pi (θ)]](θ ), ∀θ ∈ R . 5 Observable and Differential Semantics i ∈[1,t] (5.4) To capture physically observable quantities from quantum systems, physicists propose the notation of observable which To compute gradients of quantum observables for each is a Hermitian matrix over the same space. Any observ- parameter, one needs an ancilla variable as hinted by results able O is a combination of information about quantum mea- in quantum information theory about gradient calculations surements and classical values for each measurement out- for simple unitaries (e.g., Bergholm et al. [7], Schuld et al. come. To see why, let us take its spectral decomposition of [41]). To that end, we can easily extend quantum programs Í with ancilla variables. For each j ∈ [1, k], the j-th ancilla of O = m λm |ψm⟩⟨ψm |. Then {|ψm⟩⟨ψm |}m form a projective (T )( ) measurement. We can design an experiment to perform this q-whilev θ is a quantum variable denoted by Aj,v disjoint projective measurement and output λm when the outcome from v. We write A instead of Aj,v when j,v are clear from is m. The expectation of the output is exactly given by context. Ancilla A could consists of any number of qubits

Õ † while we will mostly use one-qubit A in this paper. tr(Oρ) = amtr(Mm ρMm). (5.1) We will only consider programs augmented with one an- m cilla variable Aj at any time. (So let us fix j for the following Comparing to the random outcome of any quantum mea- discussion). We will then consider programs that operate on surement, the expectation tr(Oρ) given by any observable the larger space D(Hv∪{A}) and an additional observable A O represents meaningful classical information of quantum to define the observable semantics with ancilla. systems. In applications of quantum machine learning, it Definition 5.2 (Observable Semantics with Ancilla ). Given is also the quantity used in the loss function. Thus, given ′ (T ) any P (θ) ∈ q-while (θ), any observable O ∈ Ov , in- any observable O, we will define the observable semantics v∪{Aj,v } of quantum programs both the mathematical object to take put state ρ ∈ D(Hv ), and moreover the observable OA on derivatives from the original programs and the read-out of ancilla A, the observable semantics with ancilla of P, over- the programs that compute these derivatives. loading the notation [[(O, ρ) → P ′(θ)]], is Moreover, one should note the extra cost to approximate ′ ∗ [[((O, OA), ρ) → P (θ)]](θ ) ≡ tr(Oρ) to high precision. For example, one can repeat the   ⊗ [[ ′( ∗)]]((| ⟩ ⟨ |)⊗ ) , ∀ ∗ ∈ Rk . aforementioned experiment with {|ψm⟩⟨ψm |}m measurement tr OA O P θ 0 {A} 0 ρ θ and use the statistical information to recover tr(Oρ). The Again, [[((O, O ), ρ) → P(θ)]] is a function from Rk to R number of iterations will depend on the additive precision δ A whose value per point is given by (5.5). and the norm of O. To simplify our presentation, also to make a precise resource count as detailed in Section 7, throughout 5 ⊑ is defined by A ⊑ B ⇐⇒ B−A positive semidefinite. See Appendix C.1. 8 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019,

Similarly, for S ′(θ) ∈ add-q-while(T ) (θ) such that 6 Code Transformations and the v∪{Aj,v } ′ ′ t ′ Compile (S (θ)) = {|Pi (θ)|}i=1 where Pi (θ) ∈ Differentiation Logic q-while(T ) (θ), its observable semantics is: ∀θ ∗ ∈ Rk , We describe the code transformation rules of the differentia- v∪{Aj,v } ∂ tion operator ∂θ (·) in Section 6.1. We also define a logic and Õ prove its soundness for reasoning about the correctness of [[(( ) ) → ′( )]]( ∗) ≡ [[(( ) ) → ′( )]]( ∗) O, OA , ρ S θ θ O, OA , ρ Pi θ θ these code transformations, with the following judgement ∈[ ] i 1,t ′ (5.5) S (θ)|S(θ), (6.1) which states that S ′(θ) computes the differential semantics of The only difference from the normal observable semantics S(θ) in sense of Definition 5.3. We fix θ = θ and hence A | ⟩ j lies in (5.5), where we initialize the ancilla with 0 , which is ∂ ∂ 6 stands for Aj,v and for through this section. a natural choice and evaluate the observable OA ⊗ O. As we ∂θ ∂θj will see in the technique, the independence between O and A 6.1 Code Transformations O in the form of OA ⊗ O will help use obtain the strongest guarantee of our differentiation procedure. We first define some gates associated with the single-qubit rotation and the two-qubit coupling gates, which will appear 5.2 Differential Semantics in the code transformation rules. Let A be a single qubit. Given the definition of observable semantics, its differential Definition 6.1. 1. Consider |q1⟩ := Rσ (θ)|q1⟩, where σ ∈ semantics can be naturally defined by {X, Y, Z }. Define   − ( )| ⟩ ≡ (| ⟩ ⟨ |) ⊗ ( ) Definition 5.3 (Differential Semantics). Given additive pro- C Rσ θ A, q1 0 A 0 Rσ θ + (6.2) gram S(θ) ∈ add-q-while(T )(θ), its j-th differential seman-   v (|1⟩A ⟨1|) ⊗ Rσ (θ+π) (6.3) tics is defined by ′ R σ (θ)|A, q1⟩ ≡ A := H |A⟩;C − Rσ (θ)|A, q1⟩; A := H |A⟩. (6.4) ∂ ([[( ) → ( )]]) 2. Substituting σ ⊗σ for σ and q1, q2 for q1 in Eqns (6.3,6.4), O, ρ S θ , (5.6) ′ ∂θj one defines C − Rσ ⊗σ (θ), R σ ⊗σ (θ). For 1-qubit rotation R (θ), the “controlled-rotation” gate Rk R σ which is again a function from to . Moreover, for any − ( ) | ⟩ 7→ | ⟩⊗ ( ) | ⟩ | ⟩ 7→ | ⟩⊗ ′ (T ) C Rσ θ maps 0, q1 0 Rσ θ q1 , and 1, q1 1 S (θ) ∈ add-q-while (θ) with ancilla A, we say that ′ v∪{A} Rσ (θ+π) |q1⟩; R σ (θ) conjugates C − Rσ (θ) with Hadamard. ′ “S (θ) computes the j-th differential semantics of S(θ)” if and Similarly for corresponding two-qubit coupling gates. ′ only if there exists an observable OA on ancilla A for S (θ) such We exhibit our code transformation rules in Figure 4. For that ∀O ∈ Ov, ρ ∈ D(Hv ), Unitary rules we only include 1-qubit rotations and two- qubit coupling gates, since they form a universal gate set ′ ∂ and are easy to implement on quantum machines. It is also [[((O, OA), ρ) → S (θ)]] = ([[(O, ρ) → S(θ)]]). (5.7) ∂θj possible to include more unitary rules (e.g., by following the calculations in [41]), which we will leave as future directions. We remark that (5.6) is well defined because [[(O, ρ) → S(θ)]] is a function from Rk to R. It is also a smooth function 6.2 The differentiation logic and its soundness because we assume that parameterized unitaries are entry- We develop the differentiation logic given in Figure 5 torea- wise smooth, and the observable semantics is obtained by son about the correctness of code transformations. It suffices multiplication and addition of such entries. to show that our logic is sound. For ease of notation, in fu- We also remark that the order of quantifiers in (5.7) makes ∂ ∂ ture analysis we write ∂θ (P(θ)) in place of ∂θ (P(θ)) when the predicate the strongest that one can hope for. This is be- ( ) P(θ) ∈ q-while T (θ). cause the observable semantics of S(θ) will depend on O and v ρ in general. Thus, the program to compute its differential ( ) ∈ (T )( ) Theorem 6.2 (Soundness). Let S θ add-q-whilev θ , semantics could also depend on O and ρ in general. However, ( ) S ′(θ) ∈ add-q-while T (θ). Then, in our definition, S ′(θ) is a single fixed program that works v∪{A} for any O and ρ regardless of the seemingly complicated re- ′ ′ lationship. We remark that this definition is consistent with S (θ)|S(θ) =⇒ S (θ) computes the differential semantics of S(θ). the classical case where a single program is used to compute (6.5) the derivatives for any input. Indeed, we can achieve this 6 ( ) ∈ (T ) ( ) If A already exists, i.e., S θ add-q-whilev∪{A} θ , we treat vnew as definition in the quantum setting and it is also critical inthe ∪ ⊗ vold Aold and add Anew. Any observable O on vold becomes OAold O proof of Theorem 6.2 (item (5)). on vnew. Both Aold and Anew are initialized to |0⟩ in observable semantics. 9 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu

∂ ∂ ∂ (Trivial) ∂θ (abort[v]), ∂θ (skip[v]), ∂θ (q := |0⟩) ≡ ∂ ∂ (Abort) ∂θ (abort[v])|abort[v] (Skip) ∂θ (skip[v])|skip[v] abort[v ∪ {A}]. ∂ (Trivial-Unitary) ∂θ (|v⟩ := U (θ)|v⟩) ≡ abort[v ∪ {A}], ∂ (Initialization) ∂θ (q := |0⟩)|(q := |0⟩) if θj < θ. θj < θ ∂ (| ⟩ ( )| ⟩) ≡ (1-qb Rotation) ∂θ q1 := Rσ θ q1 ∂ ′ (Trivial-Unitary) ∂θ (q = U (θ)[v])|q = U (θ)[v] |A, q1⟩ := R σ (θ)|A, q1⟩. ∂ (Rot-Couple) ∂ (q = R (θ)[v])|(q = R (θ)[v]) (2-qb Coupling) ∂θ (|q1, q2⟩ := Rσ ⊗σ (θ)|q1⟩) ≡ ∂θ σ σ ′ ∂ ∂ |A, q1, q2⟩ := R σ ⊗σ (θ)|A, q1, q2⟩. ∂θ (S0(θ))|S0(θ) ∂θ (S1(θ))|S1(θ) (Sequence) ∂ (S (θ); S (θ)) ≡ (S (θ); ∂ (S (θ))) + ∂ ∂θ 1 2 1 ∂θ 2 (Sequence) (S0(θ); S1(θ))|(S0(θ); S1(θ)) ∂ ∂θ ( (S1(θ)); S2(θ)). ∂ ∂θ ∀m, ∂θ (Sm(θ))|Sm(θ) ∂ (Case) (case M[q] = m → Sm(θ) end) ≡ ∂θ (Case) ∂ (case M[q] = m → S (θ) end)| ∂ ∂θ m case M[q] = m → ∂θ (Sm(θ)) end. case M[q] = m → Sm(θ) end (T )) (while Use (Case) and (Sequence). ∂ (S1(θ)) S1(θ) ∂ ∂ ∂θ (Sum Components) (S1(θ) + S2(θ)) ≡ (S1(θ)) + ∂θ ∂θ ( ) ∂ (While(T )) ∂ (while T M[q] = 1 do S (θ) done)| (S2(θ)) ∂θ 1 ∂θ (T ) while M[q] = 1 do S1(θ) done ∂ ∂ Figure 4. Code Transformation Rules. For (1-qb Rotation) ∂θ (S0(θ))|S0(θ) ∂θ (S1(θ))|S1(θ) ′ ′ and (2-qb Coupling), (σ ∈ {X, Y, Z }); Rσ (θ), Rσ ⊗σ (θ) are as ∂ (Sum Component) ∂θ (S0(θ) + S1(θ))|(S0(θ) + S1(θ)) in Definition 6.1. θj < θ means “the unitaryU (θ) trivially uses θj ”: for example in P(θ)|q1⟩ ≡ RX (θ1); RZ (θ2), θ = (θ1, θ2) Figure 5. The differentiation logic. Wherever applicable, and RX (θ1) trivially uses θ2. ∈ ⊆ ( ) ∈ (T )( ) q v, q v, Si θ add-q-whilev θ . In (Rot-Couple), σ ∈ {X, Y, Z, X ⊗ X, Y ⊗ Y, Z ⊗ Z }. Let us highlight the ideas behind the proof of the sound- ness and all detailed proofs are deferred to Appendix D. First remember that θ = θj and for all the proofs we can choose (Definition 5.2) and the strong requirement of comput- ZA = |0⟩⟨0| − |1⟩⟨1| as the observable on the one-qubit an- ing differential semantics in Definition 5.3. Firstly, cilla A. Thus, we will omit ZA and overload the notation, ′ (T ) ∂ ∂ ∀P (θ) ∈ q-while ∪{ }(θ): [[(O, ρ) → (S (θ); S (θ))]] = [[(O, ρ) → (S (θ)); S (θ)]] + v A ∂θ 0 1 ∂θ 0 1 ′ ′ [[(O, ρ) → P (θ)]] means [[((O, ZA), ρ) → P (θ)]], (6.6) ∂ [[(O, ρ) → S0(θ); (S1(θ))]] to simplify the presentation. We make similar overloading ∂θ ′( ) ∈ (T ) ( ) We use the induction hypothesis to reason about each convention for S θ add-q-whilev∪{A} θ . Let us go through these logic rules one by one. term above. Consider the case S0(θ) = S0(θ) and S1(θ) = 1. Abort, Skip, Initialization, Trivial-Unitary rules work S1(θ). We show the following for the second step: because these statements do not depend on θ. ∂ ∂ (T ) [[(O, ρ) → S (θ); (S (θ))]] = [[(O, [[S (θ)]](ρ)) → (S (θ))]] 2. Since While can be deemed as a macro of other 0 ∂θ 1 0 ∂θ 1 statements, the correctness of While(T ) rule follows by (6.8) unfolding while(T ) and applying other rules. ∂ ∂ [[(O, ρ) → (S (θ)); S (θ)]] = [[([[S (θ)]]∗(O), ρ) → (S (θ))]], 3. The Sum Component rule is due to the property of ∂θ 0 1 1 ∂θ 0 observable semantics ([[·]]) and additive operator (+): (6.9) where we make critical use of the fact S0(θ), S1(θ) ∈ ∂ ∂ ∂ (T ) ∂ ∂ ([[P1 + P2]]) = [[ (P1)]] + [[ (P2)]], (6.7) q-while (θ) and (S (θ)), (S (θ)) ∈ ∂θ ∂θ ∂θ v ∂θ 0 ∂θ 1 add-q-while(T ) (θ). which follows from our definition design. v∪{A} ∂ 4. Our Rot-Couple rule is different from the phase-shift For (6.8), since ∂θ (S1(θ)) computes the derivative for rule in [41] by using only one circuit in derivative any input state and observable, we choose the input computing. However, the proof of the Rot-Couple rule state [[S0(θ)]](ρ) and observable O. is largely inspired by the one of the phase-shift rule. For (6.9), we don’t change the state ρ but change the 5. The proof of the Sequence rule relies very non-trivially observable O by applying the dual super-operator ∗ ∂ on our design of the observable semantics with ancilla [[S1(θ)]] . Since ∂θ (S0(θ)) computes the derivative for 10 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019,

any input state and any observable, we choose the we need to approximate ∗ input state ρ and observable [[S1(θ)]] (O). m ∗ Õ   ′  The dual super-operator [[S1(θ)]] has the property tr ZA ⊗ O [[Pi (θ)]]((|0⟩{A} ⟨0|) ⊗ ρ) , (7.1) ∗ that tr(O[[S1(θ)]](ρ)) = tr([[S1(θ)]] (O)ρ), which cor- i=1 responds to the Schrodinger picture (evolving states) where each term is the observable ZA ⊗O on the output state ′ and (evolving observables) respec- of Pi (θ) given input state ρ and the ancilla qubit |0⟩. tively in quantum mechanics. To approximate the sum in (7.1) to precision δ , one could 6. The proof of the Case rule basically follows from the first treat the sum divided by m as the observable applied on linearity of the observable semantics and the smooth the program that starts with a uniformly random choice of i ′ semantics of Case. It is interesting to compare with from 1, ··· ,m and then execute Pi (θ). By Chernoff bound, the classical case [5] where the non-smoothness of the one only needs to repeat this procedure O(m2/δ 2) times. guard causes an issue for auto differentiation. Resource count. We are only interested in non-trivial (ex- tra) resource that is something that you wouldn’t need if you Example 6.1 (Simple-Case). Consider the following simple only run the original program. Ancilla qubits count as the instantiating of Ex. 4.1 non-trivial resource. However, for our scheme, the number of required ancilla is 1 qubit when applying ∂ (·) once. P(θ)|q ⟩ ≡ case M[q ] = 0 → R (θ)|q ⟩; R (θ)|q ⟩, ∂θ 1 1 X 1 Y 1 The more non-trivial resource is the number of the copies 1 → R (θ)|q ⟩ Z 1 of input state, which is directly related to the number of repetitions in the procedure, which again connects to m = Let us apply code transformation and compilation. Let CT, CP |# ∂ (P(θ))|. We argue that our code transformation is effi- to denote “code transformation” and “compilation”, and “Seq” ∂θ cient so that m is reasonably bounded. To that end, we show and “Rot” denote Sequence and Rotation rules resp. the relation between m and a natural quantity defined on ∂ ∂ the original program P(θ) (i.e., before applying any ∂ (·) case M[q1] = 0 → (RX (θ)|q1⟩; θ ∂ CT,case ∂θ operator) called the occurrence count of the parameter θ. (P(θ)) = RY (θ)|q1⟩), ∂θ ∂ Definition 7.1. The “Occurrence Count for θj ” in P(θ), de- 1 → ∂θ (RZ (θ)|q1⟩) ′ noted OCj (P(θ)), is defined as follows: case M[q1] = 0 → (R X (θ)|A, q1⟩; 1. If P(θ) ≡ abort[v]|skip[v]|q := |0⟩ (q ∈ v), then RY (θ)|q1⟩)+ CT,Seq+Rot OCj (P(θ)) = 0; = (RX (θ)|q1⟩; ′ 2. P(θ) ≡ U (θ): if U (θ) trivially uses θj , then OCj (P(θ)) = R Y (θ)|A, q1⟩), ′ 0; otherwise OCj (P(θ)) = 1. 1 → R Z (θ)|A, q1⟩ ( ) ≡ ( ) ( ) ( )) ( ( )) ( ( )) ′ 3. If P θ U θ = P1 θ ; P2 θ then OCj P θ = OCj P1 θ case M[q1] = 0 → R X (θ)|A, q1⟩; Compile(•) n +OCj (P2(θ)). 7−→ | RY (θ)|q1⟩, ′ 4. If P(θ) ≡ case M[q] = m → Pm(θ) end then OCj (P(θ)) = 1 → R Z (θ)|A, q1⟩, maxm OCj (Pm(θ)). case M[q ] = 0 → R (θ)|q ⟩; (T ) 1 X 1 o 5. If P(θ) ≡ while M[q] = 1 do P1(θ) done then OCj (P(θ)) R′ (θ)|A, q ⟩, | Y 1 = T · OCj (P1(θ)). 1 → abort. Intuition of the “Occurrence Count” definition is clear: it 7 Execution and Resource Analysis basically counts the non-trivial number of occurrence of θj in the program, treating case as if it is deterministic. To see why In this section we illustrate the execution of the entire differ- this is a reasonable quantity, consider the auto-differentiation entiation procedure and analyze its resource cost. Consider in the classical case. For any non-trivial variable v (i.e., v has ( ) ∈ (T )( ) any program P θ q-whilev θ and the parameter θ. some dependence on the parameter θ), we will compute both ∂ Execution. The first step in differentiation is to apply the v and ∂θ (v) and store them both as variables in the new code transformation rules (in Section 6) to P(θ) and obtain program. Thus, the classical auto-differentiation essentially ∂ needs the number of non-trivial occurrences more space an additive program ∂θ (P(θ)). Then one needs to compile ∂ ′ m and related resources. As we argued in the introduction, ∂θ (P(θ)) into a multiset {|Pi (θ)|}i=1 of normal non-aborting ′ we cannot directly mimic the classical case due to the non- programs Pi (θ). The total count of these programs is given ∂ cloning theorem. The extra space requirement in the classical by m = |# ∂θ (P(θ))|. Note that the above procedure could be done at the compilation time. setting turns into the requirement on the extra copies of the Given any pair of O and ρ, the real execution to compute input state in the quantum setting. Indeed, we can bound m the derivative of [[(O, ρ) → P(θ)]] is to approximate the ob- by the occurrence count. servable semantics [[(O, ρ) → ∂ (P(θ))]]. By Definition 5.2, Proposition 7.2. |# ∂ (P(θ))| ≤ OC (P(θ)). ∂θ ∂θj j 11 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu

Proof. Structural induction. For details, see Appendix E.1. epochs with manually optimized hyperparameters, the loss □ for P1 (no control) attains a minimum of 0.6931 in less than 100 epochs and subsequently plateaus. The loss for P2 (with 8 Implementation and Case Study control) continues to decrease and attains a minimum of We have built a compiler (written in OCaml) that implements 0.0936. It demonstrates the advantage of both controls in our code transformation and compilation rules. We use it to quantum machine learning and our scheme to handle con- train one VQC instance with controls and empirically verify trols, whereas previous schemes (such as Pennylane due to its resource-efficiency on representative VQC instances. All its quantum-node design [7]) fail to do so. codes and test programs are available in the supplementary materials. Complete details are in Appendix F. Experiments are performed on a MacBook Pro with a Dual-Core Intel Core i5 Processor clocked at 2.7 GHz, and 8GB of RAM.

8.1 Training VQC instances with controls Consider a simple classification problem over 4-bit inputs 4 z = z1z2z3z4 ∈ {0, 1} with true label given by f (z) = ¬(z1 ⊕ z4). We construct two 4-qubit VQCs P1 (no control) and P2 (with control) that consists of a single-qubit Pauli X,Y and Z rotation gate on each qubit and compare their performance. For parameters Γ = {γ1,...,γ12} define the program

Q(Γ) ≡ RX (γ1)[q1]; RX (γ2)[q2]; RX (γ3)[q3]; RX (γ4)[q4]; Figure 6. Training P1 and P2 to classify inputs according to RY (γ5)[q1]; RY (γ6)[q2]; RY (γ7)[q3]; RY (γ8)[q4]; the labelling function f (z) = ¬(z1 ⊕ z4). RZ (γ9)[q1]; RZ (γ10)[q2]; RZ (γ11)[q3]; RZ (γ12)[q4], where q , q , q , q refer to 4 qubit registers. Given parame- 1 2 3 4 8.2 Benchmark testing on representative VQCs ters Θ = {θ1,..., θ12}, Φ = {ϕ1,..., ϕ12}, define We also test our compiler on important VQC candidates such P1(Θ, Φ) ≡Q(Θ);Q(Φ). (8.1) as quantum neural-networks (QNN) [16], quantum approxi- Similarly, for parameters Θ = {θ1,..., θ12},Φ = {ϕ1,..., ϕ12}, mate optimization algorithms (QAOA) [15], and variational Ψ = {ψ1,...,ψ12}, define quantum eigensolver (VQE) [35]. We enrich these examples, by adding simple controls or 2-bounded loops and increasing P (Θ, Φ, Ψ) ≡ Q(Θ); case M[q ] = 0 → Q(Φ) 2 1 (8.2) the number of qubits to 18∼40, to make them sufficiently 1 → Q(Ψ). sophisticated but yet realistic for near-term quantum appli- Note that P1 and P2 execute the same number of gates for cations. A selective output performance of our compiler is each run. To use Pi to perform the classification or in the in Table 2, with full details in Appendix F. Our scheme is training, we first initialize q , q , q , q to the classical feature ∂ 1 2 3 4 resource-efficient as |# ∂θ (·)| is always reasonably bounded. vector z = z1z2z3z4 and then execute Pi . The predicted label y is given by measuring the 4th qubit q in 0/1 basis. 4 P(θ) OC(·) |# ∂ (·)| #gates #lines #layers #qb’s We will conduct a supervised learning by minimizing the ∂θ QNN 24 24 165 189 3 18 following loss function given by the average negative log M,i QNN 56 24 231 121 5 18 likelihood of correctly classifying the inputs. Thus M,w QNNL,i 48 48 363 414 6 36 1 Õ loss = − log (Pr[y = f (z) = ¬(z ⊕ z )]) . (8.3) QNNL,w 504 48 2079 244 33 36 16 1 4 z ∈{0,1}4 VQEM,i 15 15 224 241 3 12 VQE Note that loss is a function on Θ, Φ (or Θ, Φ.Ψ). More im- M,w 35 15 224 112 5 12 VQEL,i 40 40 576 628 5 40 portantly, for each z, Pr[y = ¬(z1 ⊕ z4)] can be represented VQE , 248 40 1984 368 17 40 by the observable semantics of P1(or P2) with observable L w |0⟩⟨0| or |1⟩⟨1|. Thus, the gradient of loss can be obtained QAOAM,i 18 18 120 142 3 18 ∂ ∂ QAOAM,w 42 18 168 94 5 18 by using the collection of ∂α (P1) for α ∈ Θ, Φ (or ∂α (P2) for α ∈ Θ, Φ, Ψ). We classically simulate the above training QAOAL,i 36 36 264 315 6 36 procedure with gradient descent7 (see Figure 6). After 1000 QAOAL,w 378 36 1512 190 33 36 Table 2. Output on selective examples. {M, L} stands for 7 We attempted to directly use Pennylane to train the system without control, “medium, large”; {i,w} stands for including “if, while”. but were not able to make it work since Pennylane does not implement the gradient of the log function. 12 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019,

Acknowledgments [10] Rohit Chadha, Paulo Mateus, and Amílcar Sernadas. 2006. Reasoning SZ, SH, and XW were supported in part by the U.S. De- About Imperative Quantum Programs. Electronic Notes in Theoretical Computer Science 158 (2006). partment of Energy, Office of Science, Office of Advanced [11] George Corliss, Christèle Faure, Andreas Griewank, Lauren Hascoët, Scientific Computing Research, Quantum Testbed Pathfinder and Uwe Naumann (Eds.). 2002. Automatic Differentiation of Algo- Program under Award Number DE-SC0019040. SC and XW rithms: From Simulation to Optimization. Springer-Verlag New York, were supported in part by the U.S. Department of Energy, Inc., New York, NY, USA. Office of Science, Office of Advanced Scientific Computing [12] Thomas Ehrhard and Laurent Regnier. 2003. The differential lambda- calculus. Theoretical Computer Science 309, 1-3 (2003), 1–41. Research, Quantum Algorithms Teams program. XW also re- [13] Conal M. Elliott. 2009. Beautiful Differentiation. In Proceedings of ceived support from the National Science Foundation (grants the 14th ACM SIGPLAN International Conference on Functional Pro- CCF-1755800 and CCF-1816695). gramming (ICFP ’09). ACM, New York, NY, USA, 191–202. https: //doi.org/10.1145/1596550.1596579 [14] Conal M. Elliott. 2018. The Simple Essence of Automatic Differentia- References tion. Proc. ACM Program. Lang. 2, ICFP, Article 70 (July 2018), 29 pages. [1] Ali Javadi Abhari, Arvin Faruque, Mohammad Javad Dousti, Lukas https://doi.org/10.1145/3236765 Svec, Oana Catu, Amlan Chakrabati, Chen-Fu Chiang, Seth Vander- [15] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. 2014. A Quantum wilt, John Black, Fred Chong, Margaret Martonosi, Martin Suchara, Approximate Optimization Algorithm. (2014). arXiv:arXiv:1411.4028 Ken Brown, Massoud Pedram, and Todd Brun. 2012. Scaffold: Quan- [16] Edward Farhi and Hartmut Neven. 2018. Classification with Quantum tum Programming Language. Technical Report TR-934-12. Princeton Neural Networks on Near Term Processors. arXiv e-prints, Article University. arXiv:1802.06002 (Feb 2018), arXiv:1802.06002 pages. arXiv:quant- [2] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C. Bardin, ph/1802.06002 Rami Barends, Rupak Biswas, Sergio Boixo, Fernando G. S. L. Bran- [17] Yuan Feng, Runyao Duan, Zhengfeng Ji, and Mingsheng Ying. 2007. dao, David A. Buell, Brian Burkett, Yu Chen, Zijun Chen, Ben Chiaro, Proof Rules for the Correctness of Quantum Programs. Theoretical Roberto Collins, William Courtney, Andrew Dunsworth, Edward Farhi, Computer Science 386, 1-2 (2007). Brooks Foxen, Austin Fowler, Craig Gidney, Marissa Giustina, Rob [18] Simon J. Gay. 2006. Quantum Programming Languages: Survey and Graff, Keith Guerin, Steve Habegger, Matthew P. Harrigan, Michael J. Bibliography. Mathematical Structures in Computer Science 16, 4 (2006). Hartmann, Alan Ho, Markus Hoffmann, Trent Huang, Travis S. Hum- [19] Gian Giacomo Guerreschi and Mikhail Smelyanskiy. 2017. Practi- ble, Sergei V. Isakov, Evan Jeffrey, Zhang Jiang, Dvir Kafri, Kostyan- cal optimization for hybrid quantum-classical algorithms. arXiv e- tyn Kechedzhi, Julian Kelly, Paul V. Klimov, Sergey Knysh, Alexan- prints, Article arXiv:1701.01450 (Jan 2017), arXiv:1701.01450 pages. der Korotkov, Fedor Kostritsa, David Landhuis, Mike Lindmark, Erik arXiv:quant-ph/1701.01450 Lucero, Dmitry Lyakh, Salvatore Mandrà, Jarrod R. McClean, Matthew [20] Martin Giles. 2019. IBM’s new 53-qubit quantum McEwen, Anthony Megrant, Xiao Mi, Kristel Michielsen, Masoud computer is the most powerful machine you can Mohseni, Josh Mutus, Ofer Naaman, Matthew Neeley, Charles Neill, use. https://www.technologyreview.com/f/614346/ Murphy Yuezhen Niu, Eric Ostby, Andre Petukhov, John C. Platt, -new-53-qubit-quantum-computer-is-the-most-powerful-machine-you-can-use/ Chris Quintana, Eleanor G. Rieffel, Pedram Roushan, Nicholas C. [21] Pranav Gokhale. 2018. Variational Quantum Eigensolver Demo. ISCA Rubin, Daniel Sank, Kevin J. Satzinger, Vadim Smelyanskiy, Kevin J. 2018 (2018). Sung, Matthew D. Trevithick, Amit Vainsencher, Benjamin Villalonga, [22] Jonathan Grattage. 2005. A Functional Quantum Programming Lan- Theodore White, Z. Jamie Yao, Ping Yeh, Adam Zalcman, Hartmut guage. In LICS. Neven, and John M. Martinis. 2019. Quantum supremacy using a [23] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Dani- programmable superconducting processor. Nature 574, 7779 (2019), helka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Ed- 505–510. ward Grefenstette, Tiago Ramalho, John Agapiou, AdriàPuigdomènech [3] Alexandru Baltag and Sonja Smets. 2011. as a Dynamic Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Logic. Synthese 179, 2 (2011). Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray [4] Atılım Günes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Kavukcuoglu, and Demis Hassabis. 2016. Hybrid computing using a and Jeffrey Mark Siskind. 2017. Automatic Differentiation in Machine neural network with dynamic external memory. Nature 538 (10 2016), Learning: A Survey. J. Mach. Learn. Res. 18, 1 (Jan. 2017), 5595–5637. 471. http://dl.acm.org/citation.cfm?id=3122009.3242010 [24] Alexander S. Green, Peter LeFanu Lumsdaine, Neil J. Ross, Peter [5] Thomas Beck and Herbert Fischer. 1994. The if-problem in automatic Selinger, and Benoît Valiron. 2013. Quipper: A Scalable Quantum differentiation. J. Comput. Appl. Math. 50, 1-3 (1994), 119–131. Programming Language. In PLDI. [6] Marcello Benedetti, Erika Lloyd, and Stefan Sack. 2019. Parameterized [25] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and quantum circuits as machine learning models. arXiv e-prints, Article Phil Blunsom. 2015. Learning to Transduce with Unbounded Memory. arXiv:1906.07682 (Jun 2019), arXiv:1906.07682 pages. arXiv:quant- In Proceedings of the 28th International Conference on Neural Informa- ph/1906.07682 tion Processing Systems - Volume 2 (NIPS’15). MIT Press, Cambridge, [7] Ville Bergholm, Josh Izaac, Maria Schuld, Christian Gogolin, and MA, USA, 1828–1836. http://dl.acm.org/citation.cfm?id=2969442. Nathan Killoran. 2018. PennyLane: Automatic differentiation of hy- 2969444 brid quantum-classical computations. arXiv preprint arXiv:1811.04968 [26] Andreas Griewank. 2000. Evaluating Derivatives: Principles and Tech- (2018). niques of Algorithmic Differentiation. Society for Industrial and Applied [8] Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Mathematics, Philadelphia, PA, USA. Nathan Wiebe, and Seth Lloyd. 2017. Quantum machine learning. [27] Yoshihiko Kakutani. 2009. A Logic for Formal Verification of Quantum Nature 549, 7671 (2017), 195. arXiv:arXiv:1611.09347 Programs. In ASIAN. [9] Olivier Brunet and Philippe Jorrand. 2004. Dynamic Quantum Logic [28] Gershon Kedem. 1980. Automatic Differentiation of Computer Pro- for Quantum Programs. International Journal of Quantum Information grams. ACM Trans. Math. Softw. 6, 2 (June 1980), 150–165. https: 2, 1 (2004). //doi.org/10.1145/355887.355890 13 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu

[29] Jin-Guo Liu and Lei Wang. 2018. Differentiable learning of quantum [51] William K. Wootters and Wojciech H. Zurek. 1982. A single Born machines. Phys. Rev. A 98 (Dec 2018), 062324. Issue 6. cannot be cloned. Nature 299, 5886 (1982), 802–803. https://doi.org/10.1103/PhysRevA.98.062324 [52] Mingsheng Ying. 2011. Floyd–Hoare Logic for Quantum Programs. [30] Nikolaj Moll, Panagiotis Barkoutsos, Lev S. Bishop, Jerry M. Chow, ACM Transactions on Programming Languages and Systems 33, 6 (2011). Andrew Cross, Daniel J. Egger, Stefan Filipp, Andreas Fuhrer, Jay M. [53] Mingsheng Ying. 2016. Foundations of Quantum Programming. Morgan Gambetta, Marc Ganzhorn, Abhinav Kandala, Antonio Mezzacapo, Kaufmann. Peter Muller, Walter Riess, Gian Salis, John Smolin, Ivano Tavernelli, [54] Mingsheng Ying and Yangjia Li. 2018. Reasoning about Parallel Quan- and Kristan Temme. 2018. Quantum optimization using variational tum Programs. arXiv preprint arXiv:1810.11334 (2018). algorithms on near-term quantum devices. Quantum Science and [55] Mingsheng Ying, Shenggang Ying, and Xiaodi Wu. 2017. Invariants of Technology 3, 3 (2018), 030503. arXiv:arXiv:1710.01022 Quantum Programs: Characterisations and Generation. In POPL. [31] Michael A. Nielsen and Isaac Chuang. 2000. Quantum Computation [56] Daiwei Zhu, NM Linke, Marcello Benedetti, KA Landsman, NH and Quantum Information. Cambridge University Press. Nguyen, CH Alderete, Alejandro Perdomo-Ortiz, Nathan Korda, A [32] Bernhard Ömer. 2003. Structured Quantum Programming. Ph.D. Dis- Garfoot, Charles Brecque, et al. 2018. Training of quantum circuits on sertation. Vienna University of Technology. a hybrid quantum computer. arXiv preprint arXiv:1812.08862 (2018). [33] Jennifer Paykin, Robert Rand, and Steve Zdancewic. 2017. QWIRE: A Core Language for Quantum Circuits. In POPL. [34] Barak A. Pearlmutter and Jeffrey Mark Siskind. 2008. Reverse-mode AD in a Functional Framework: Lambda the Ultimate Backpropagator. ACM Trans. Program. Lang. Syst. 30, 2, Article 7 (March 2008), 36 pages. https://doi.org/10.1145/1330017.1330018 [35] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J. Love, Alán Aspuru-Guzik, and Jeremy L. O’brien. 2014. A variational eigenvalue solver on a photonic quantum processor. Nature Communications 5 (2014), 4213. arXiv:arXiv:1304.3061 [36] Gordon Plotkin. 2018. Some Principles of Differential Programming Languages. POPL 2018 (2018). [37] John Preskill. 2018. Quantum computing in the NISQ era and beyond. Quantum 2 (2018), 79. arXiv:arXiv:1801.00862 [38] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back-propagating errors. Nature 323, 6088 (1986), 533–536. [39] Amr Sabry. 2003. Modeling Quantum Computing in Haskell. In The Haskell Workshop. [40] Jeff W. Sanders and Paolo Zuliani. 2000. Quantum Programming. In MPC. [41] Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac, and Nathan Killoran. 2018. Evaluating analytic gradients on quantum hardware. arXiv preprint arXiv:1811.11184 (2018). [42] Maria Schuld, Alex Bocharov, Krysta Svore, and Nathan Wiebe. 2018. Circuit-centric quantum classifiers. arXiv e-prints, Article arXiv:1804.00633 (Apr 2018), arXiv:1804.00633 pages. arXiv:quant- ph/1804.00633 [43] Peter Selinger. 2004. A Brief Survey of Quantum Programming Lan- guages. In FLOPS. [44] Peter Selinger. 2004. Towards a Quantum Programming Language. Mathematical Structures in Computer Science 14, 4 (2004). [45] Bert Speelpenning. 1980. Compiling Fast Partial Derivatives of Func- tions Given by Algorithms. Ph.D. Dissertation. Champaign, IL, USA. AAI8017989. [46] Krysta Svore, Alan Geller, Matthias Troyer, John Azariah, Christopher Granade, Bettina Heim, Vadym Kliuchnikov, Mariia Mykhailova, An- dres Paz, and Martin Roetteler. 2018. Q#: Enabling Scalable Quantum Computing and Development with a High-level DSL. In RWDSL. [47] Qingfeng Wang and Tauqir Abdullah. 2018. An Introduction to Quan- tum Optimization Approximation Algorithm. [48] John Watrous. 2006. Introduction to Quantum Computa- tion. https://cs.uwaterloo.ca/~watrous/LectureNotes/CPSC519. Winter2006/all.pdf. Course notes. [49] Dave Wecker and Krysta Svore. 2014. LIQUi|⟩: A Software Design Architecture and Domain-Specific Language for Quantum Computing. CoRR abs/1402.4467 (2014). arXiv:1402.4467 [50] Robert Edwin Wengert. 1964. A Simple Automatic Derivative Eval- uation Program. Commun. ACM 7, 8 (Aug. 1964), 463–464. https: //doi.org/10.1145/355586.364791

14 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019,

A Detailed Quantum Preliminary semidefinite operator ρ on H is said to be a partial density This is a more detailed treatment of Section 2. For a further operator if tr(ρ) ≤ 1. The set of partial density operators is extended background, we recommend the notes by Watrous denoted by D(H). [48] and the textbook by Nielsen and Chuang [31]. A.3 Quantum Operations A.1 Preliminaries Operations on quantum systems can be characterized by For any non-negative integer n, an n-dimensional Hilbert unitary operators. Denoting the set of linear operators on H space H is essentially the space Cn of complex vectors. We as L(H), an operator U ∈ L(H) is unitary if its Hermitian † † use Dirac’s notation, |ψ ⟩, to denote a complex vector in Cn. conjugate is its own inverse, i.e., U U = UU = IH. For The inner product of two vectors |ψ ⟩ and |ϕ⟩ is denoted by a pure state |ψ ⟩, a unitary operator describes an evolution ⟨ψ |ϕ⟩, which is the product of the Hermitian conjugate of from |ψ ⟩ toU |ψ ⟩. For a density operator ρ, the corresponding † |ψ ⟩, denoted by ⟨ψ |, and vector |ϕ⟩. The norm of a vector evolution is ρ 7→ U ρU . Common unitary operators include p |ψ ⟩ is denoted by ∥|ψ ⟩∥ = ⟨ψ |ψ ⟩. " 1 1 # √ √  0 1   1 0  We define (linear) operators as linear mappings between H = 2 2 , X = , Z = , Y = −iXZ √1 √−1 1 0 0 −1 Hilbert spaces. Operators between n-dimensional Hilbert 2 2 spaces are represented by n × n matrices. For example, the The Hadamard operator H transforms between the com- identity operator IH can be identified by the identity matrix putational and the ± basis. For example, H |0⟩ = |+⟩ and on H. The Hermitian conjugate of operator A is denoted | ⟩ |−⟩ | ⟩ | ⟩ † † H 1 = . The Pauli X operator is a bit flip, i.e., X 0 = 1 by A . Operator A is Hermitian if A = A . The trace of an and X |1⟩ = |0⟩. The Pauli Z operator is a phase flip, i.e., operator A is the sum of the entries on the main diagonal, i.e., | ⟩ | ⟩ | ⟩ −| ⟩ | ⟩ | ⟩ Í Z 0 = 0 and Z 1 = 1 . Pauli Y maps 0 to i 1 and tr(A) = i Aii . We write ⟨ψ |A|ψ ⟩ to mean the inner product |1⟩ to −i|0⟩. The CNOT gate C maps |00⟩ 7→ |00⟩, |01⟩ 7→ between |ψ ⟩ and A|ψ ⟩. A Hermitian operator A is positive |01⟩, |10⟩ 7→ |11⟩, |11⟩ 7→ |10⟩. One may obtain the EPR state semidefinite (resp., positive definite) if for all vectors |ψ ⟩ ∈ H, H1 C1,2 |β ⟩ via |00⟩ 7→ √1 (|0⟩ + |1⟩)|0⟩ 7→ √1 (|00⟩ + |11⟩). ⟨ψ |A|ψ ⟩ ≥ 0 (resp., > 0). This gives rise to the Löwner order 00 2 2 ⊑ among operators: More generally, the evolution of a quantum system can be A ⊑ B if B − A is positive semidefinite, characterized by an admissible superoperator E, which is a A B if B − A is positive definite. completely-positive and trace-non-increasing linear map from ′ ′ < D(H) to D(H ) for Hilbert spaces H, H . A superopera- A.2 Quantum States tor is positive if it maps from D(H) to D(H ′) for Hilbert ′ The state space of a quantum system is a Hilbert space. The spaces H, H . A superoperator E is k-positive if for any state space of a qubit, or quantum bit, is a 2-dimensional k-dimensional Hilbert space A, the superoperator E ⊗ IA Hilbert space. One important orthonormal basis of a qubit is a positive map on D(H ⊗ A). A superoperator is said system is the computational basis with |0⟩ = (1, 0)† and |1⟩ = to be completely positive if it is k-positive for any positive (0, 1)†, which encode the classical bits 0 and 1 respectively. integer k. A superoperator E is trace-non-increasing if for ′ Another important basis, called the ± basis, consists of |+⟩ = any initial state ρ ∈ D(H), the final state E(ρ) ∈ D(H ) √1 (|0⟩ + |1⟩) and |−⟩ = √1 (|0⟩ − |1⟩). The state space of after applying E satisfies tr(E(ρ)) ≤ tr(ρ). 2 2 ′ multiple qubits is the tensor product of single qubit state For every superoperator E : D(H) → D(H ), there ex- { } E( ) Í † spaces. For example, classical 00 can be encoded by |0⟩ ⊗ |0⟩ ists a set of Kraus operators Ek k such that ρ = k Ek ρEk (written |0⟩|0⟩ or even |00⟩ for short) in the Hilbert space for any input ρ ∈ D(H). Note that the set of Kraus operators 2 2 is finite if the Hilbert space is finite-dimensional. The Kraus C ⊗ C . An important 2-qubit state is the EPR state |β00⟩ = 1 E E Í ◦ † √ (|00⟩ + |11⟩). The Hilbert space for an m-qubit system is form of is written as = k Ek Ek . A unitary evolution 2 † 2 ⊗m 2m can be represented by the superoperator E = U ◦ U . An (C )  C . I ◦ A pure quantum state is represented by a unit vector, i.e., a identity operation refers to the superoperator H = IH IH. E vector |ψ ⟩ with ∥|ψ ⟩∥ = 1.A mixed state can be represented The Schrödinger-Heisenberg dual of a superoperator = Í ◦ † E∗ by a classical distribution over an ensemble of pure states k Ek Ek , denoted by , is defined as follows: for every state ρ ∈ D(H) and any operator A, tr(AE(ρ)) = tr(E∗(A)ρ). {(pi, |ψi ⟩)}i , i.e., the system is in state |ψi ⟩ with probability E∗ Í † ◦ pi . One can also use density operators to represent both pure The Kraus form of is k Ek Ek . and mixed quantum states. A density operator ρ for a mixed A.4 Quantum Measurements and Observables state representing the ensemble {(pi, |ψi ⟩)}i is a positive Í semidefinite operator ρ = i pi |ψi ⟩⟨ψi |, where |ψi ⟩⟨ψi | is The way to extract information about a quantum system is the outer-product of |ψi ⟩; in particular, a pure state |ψ ⟩ can called a quantum measurement. A quantum measurement on be identified with the density operator ρ = |ψ ⟩⟨ψ |. Note a system over Hilbert space H can be described by a set of lin- Í † that tr(ρ) = 1 holds for all density operators. A positive ear operators {Mm }m with m MmMm = IH. If we perform a 15 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu

{ } ′⟩ ′ ∈ Ý measurement Mm m on a state ρ, the outcomem is observed , ρ . By inductive hypothesis, ρ Qj (θ )∈Compile(P2(θ )) † ′ ∗ ∗ ′ with probability pm = tr(Mm ρMm) for each m. A major dif- {|ρ , 0 : ⟨Qj (θ ), ρ2⟩ → ⟨↓, ρ ⟩|}. ∗ ∗ ∗ ∗ ference between classical and quantum computation is that a Since ⟨P1(θ ); P2(θ ), ρ⟩ → ⟨P2(θ ), ρ2⟩, there must ∗ quantum measurement changes the state. In particular, after be some ⟨P3(θ ), ρ3⟩ → ⟨↓, ρ2⟩ triggering the last a measurement yielding outcome m, the state collapses to transition according to our operational semantics. † Mm ρMm/pm. For example, a measurement in the computa- Inductively apply such an argument, knowing that 8 tional basis is described by M = {M0 = |0⟩⟨0|, M1 = |1⟩⟨1|}. all computation paths are finitely long, we conclude ′ ∗ ′ If we perform the computational basis measurement M on that ρ2 ∈ {|ρ2 : ⟨P1(θ), ρ⟩ → ⟨↓, ρ2⟩|}. By the in- state ρ = |+⟩⟨+|, then with probability 1 the outcome is 0 ∈ Ý 2 ductive hypothesis we have ρ2 Ri (θ )∈Compile(P1(θ )) and ρ becomes |0⟩⟨0|. With probability 1 the outcome is 1 ′ ∗ ∗ ′ 2 {|ρ2 , 0 : ⟨Ri (θ ), ρ⟩ → ⟨↓, ρ2⟩|}. Thus and ρ becomes |1⟩⟨1|.

B More on the definition of parameterized Þ ρ′ ∈ (C.1) quantum programs Ri (θ )∈Compile(P1(θ )),Qj (θ )∈Compile(P2(θ )) B.1 Definition of qVar ′ ∗ ∗ {|ρ , 0 : ⟨Ri (θ ), ρ⟩ → ⟨↓, ρ2⟩ (C.2) Given P(θ) a T -bounded k-parameterized quantum while- (C.3) program, let qVar(P(θ)), read the set of quantum variables Û ∗ ∗ ′ accesible to P, be recursively defined as follows [53]: ⟨Qj (θ ), ρ2⟩ → ⟨↓, ρ ⟩|} Þ 1. If P(θ) ≡ skip[q], abort[q] or U [q], then qVar(P(θ)) = = (C.4) q. Ri (θ )∈Compile(P1(θ )),Qj (θ )∈Compile(P2(θ )) 2. If P(θ) ≡ q := |0⟩, then qVar(P(θ)) = q. ′ ∗ ∗ ∗ ′ {|ρ , 0 : ⟨Ri (θ );Qj (θ ), ρ⟩ → ⟨↓, ρ ⟩|} 3. If P(θ) ≡ P (θ); P (θ), then qVar(P(θ)) = qVar(P (θ))∪ 1 2 1 (C.5) qVar(P2(θ)). When analyzing P1(θ); P2(θ), we identify

P1(θ) with I ⊗ P1(θ), where I ≡ IqVar(P2(θ ))\qVar(P1(θ )) ◦

IqVar(P2(θ ))\qVar(P1(θ )), the identity operation on the vari- ables where P1 originally has no access to. as desired. ⊇ ′ 4. If P(θ) ≡ case M[q] = m → Pm(θ) end, then qVar(P(θ)) = b. : Let ρ be a member of the multiset in right hand Ð side (“RHS”) ofC.5 for some R (θ) ∈ Compile(P (θ)), q ∪ m qVar(Pi (θ)). We make the same identification i 1 for Pi (θ)’s as the above. Qj (θ) ∈ Compile(P2(θ)). Then by inductive hypoth- (T ) ∗ ∗ ′ ′ 5. If P(θ) ≡ while M[q] = 1 do P1(θ) done, then esis, ⟨P2(θ), [[Ri (θ )]]ρ⟩ → ⟨↓, ρ ⟩ (ρ , 0), and ∗ ∗ ∗ qVar(P(θ)) = q ∪ qVar(P1(θ)). We make the same ⟨P1(θ), ρ⟩ → ⟨↓, [[Ri (θ )]]ρ⟩ ([[Ri (θ )]]ρ , 0). One ∗ ∗ ∗ ′ identification for P1(θ) as the above. may start with ⟨P2(θ ), [[Ri (θ )]]ρ⟩ → ⟨↓, ρ ⟩ then repeatedly apply the Sequential rule in the opera- One defines qVar(P) for unparameterized P analogously. ∗ tional semantics, tracing back the path ⟨P1(θ), ρ⟩ → ⟨↓, [[R (θ ∗)]]ρ⟩, and conclude that ⟨P (θ ∗); P (θ ∗), ρ⟩ →∗ C Detailed Proof from Section 4 i 1 2 ⟨↓, ρ′⟩ (ρ′ , 0). C.1 Proof of Prop 4.2: Non-deterministic The above said the two multisets have the same set of Compilation Rules Well-Defined distinct elements, and furthermore each copy of ρ′ ∈ Proof. Structural induction. [[P(θ ∗)]]ρ induces at least (by ⊆, or by the deterministic nature of R (θ ∗);Q (θ ∗)) as well as at most (by ⊇ ) one 1. (Atomic) and (Sequence ↓): as operational semantics i j copy of the same element in RHS(C.5). This says the of atomic operations are inherited from parameter- two multisets are equal. ized quantum while programs, they do not induce 3. (Case) Applying the operational transition rule for 1 non-determinism. For these operations, [[P(θ ∗)]]ρ = step one may observe that (writing “IH” the inductive {|[[P(θ ∗)]]ρ|} as is the right hand side. (Sequence ↓) follows from the fact that ↓; P(θ) ≡ P(θ). ∗ 2. (Sequence) If one of Compile(Pb (θ )) is {abort} the statement is immediately true, so we assume otherwise for below. ′ ′ ∗ ∗ ∗ a. ⊆: Let ρ ∈ {|ρ , 0 : ⟨P1(θ ); P2(θ ), ρ⟩ → ⟨↓ ′ ∗ 8 ∗ ∗ , ρ ⟩|}. By definition there exists ρ2 , 0 s.t. ⟨P1(θ ); We don’t count “transitions” like ⟨P(θ ), ρ ⟩ → ⟨P(θ ), ρ ⟩ when analyz- ∗ ∗ ∗ ∗ ∗ ing computation paths, as nothing evolves in these “trivial transitions”. P2(θ ), ρ⟩ → ⟨P2(θ ), ρ2⟩ and ⟨P2(θ ), ρ2⟩ → ⟨↓ 16 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019,

hypothesis) that (writing “IH” the inductive hypothesis)

′ ∗ ∗ ′ 0 ρ ∈ [[P1(θ ) + P2(θ )]]ρ (C.9) {|0 , ρ ∈ [[case M[q] = m → Pm(θ) end]]ρ|} , Þ Þ ′ ∗ ′ ′ ∗ † ′ = {|ρ , 0 : ⟨Pm(θ ), ρ⟩ → ⟨↓,(C.10)ρ ⟩|} = {|ρ , 0 : ⟨Pm(θ ), Mm ρMm⟩ → ⟨↓, ρ ⟩|} m∗=1,2 m∗ IH Þ ′ ∗ ∗ IH Þ Þ {| ⟨[[ ∗ ( )]] ⟩ → = (C.6) = ρ , 0 : Qm ,im∗ θ , ρ ∗ m∗=1,2 m Q ∗, (θ )∈Compile(P ∗ (θ )) m im∗ m ′ ⟨↓ ⟩ ∗ ( ) ∈ Compile( ∗ ( ))|} ′ ∗ † , ρ , Qm ,im∗ θ Pm θ {|ρ , 0 : ⟨[[Qm∗,i ∗ (θ )]], Mm∗ ρM ∗ ⟩ (C.7) m m CP,Sum Component Þ →∗ ⟨↓, ρ′⟩|} = Q(θ )∈Compile(P1(θ )+P2(θ )) {|ρ′ , 0 : ⟨Q(θ ∗), ρ⟩ →∗ ⟨↓, ρ′⟩|},(C.11) On the other hand we have

as desired! Þ □ Q(θ )∈Compile(case M[q]=m→Pm (θ ) end) {|ρ′ , 0 : ⟨Q(θ ∗), ρ⟩ →∗ ⟨↓, ρ′⟩|} CP,Case Þ D Detailed Proofs from Section 6 = Throughout, we use LHS, RHS to denote “left hand side” Q(θ )∈FB(case M[q]=m→Pm (θ ) end) and “right hand side” resp., and “IH” to denote “Inductive {|ρ′ , 0 : ⟨Q(θ ∗), ρ⟩ →∗ ⟨↓, ρ′⟩|} Hypothesis”. Wherever applicable, we adopt the overloading (∗∗) Þ Þ ′ ∗ convention explained in the main text (See Eqn 6.6). = {|ρ 0 : ⟨[[Q ∗ ∗ (θ )]], , m ,im Before giving proofs, we record the full details for code m∗ Q ∗ (θ )∈Compile(P ∗ (θ )) (T ) m ,im∗ m transformation rule for while for your interest. Let us † ∗ ′ ∂ (T ) T ∗ denote (while M[q] = 1 do S (θ) done) ≡ Seq , with: Mm ρMm∗ ⟩ → ⟨↓, ρ ⟩|} (C.8) ∂θ 1

as desired. Note that the last step (**) is due to the Seq(1) ≡ case M[q] = 0 → abort, behavior of the superoperators in FB(case M[q] =  ∂  1 → (P1(θ)); abort + m → Pm(θ) end): when evolved by one transition, they ∂θj ∗   ⟨[[ ∗ ( )]] either go to some non-(essentially-aborting) Qm ,im∗ θ , P1(θ); abort , † ∗ ⟩ ∗ ( ) ∈ Compile( ∗ ( )) Mm ρMm∗ for some Qm ,im∗ θ Pm θ † ∗ ∗ for somem , or go to ⟨abort, Mm ρMm∗ ⟩. Besides, each copy of final state of non-(essentially-aborting) (T ≥2) ∗ † and Seq recursively defined as: ⟨[[ ∗ ( )]] ∗ ⟩ Qm ,im∗ θ , Mm ρMm∗ appears exactly once in the multiset, Due to our construction of FB(case M[q] =

m → Pm(θ) end). On the other hand, computational ∗ ⟨ ∗ ( ), paths starting with essentially aborting Qm ,im∗ θ (T ≥2) † Seq ≡ case M[q] = 0 → abort, ∗ Mm ρMm∗ ⟩ always terminate in ⟨↓, 0⟩, therefore not   1 → ∂ (P (θ)); while(T−1) + counted on either side of (∗∗). ∂θj 1 4. (While(T )) By definition, While(T ) reduces to (Case)  (T −1) P1(θ); Seq and (Sequence). ∗ 5. (Sum Components) If one of Compile(Pb (θ )) is {abort} the statement is immediately true, so we assume oth- erwise for below. Like in the case case, applying the Note that Seq(T −1) essentially aborts. Let us now introduce operational transition rule for 1 step one may observe some helper lemmas for the soundness proof:

17 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu

D.1 Technical Lemmas On the other hand, d Lemma D.1. U (θ) ∈ {Rσ (θ), Rσ ⊗σ (θ)}σ ∈{X ,Y ,Z }. Let U (θ) d dθ tr(O · U (θ)· ρ ·( U (θ))†) + denote the entry-wise derivative of U (θ). Then, dθ d 1 d † U (θ) = U (θ + π); (D.1) tr(O · U (θ)· ρ · U (θ)) (D.14) dθ 2 dθ ∂  θ θ ([[(O, ρ) → U (θ)]]) = tr O ·(cos I• − i sin( )σ)· ρ · ∂θ 2 2 d θ ′ θ ′ †  = tr(O · U (θ)· ρ ·( U (θ))†) + ((cos ) I• + i(sin( )) σ ) + dθ 2 2 d  θ ′ θ ′ tr(O · U (θ)· ρ · U †(θ)) (D.2) tr O · ((cos ) I• − i(sin( )) σ)· ρ · dθ 2 2  θ θ †  D.1 1  ( ( ) ) = tr O U (θ)ρU †(θ + π) + cos I• + i sin σ (D.15) 2 2 2 θ θ ′  = tr(2 cos( )·(cos( )) Oρ) (D.16) U (θ + π)ρU †(θ) . (D.3) 2 2 θ θ θ θ +tr(i[cos( )(sin( ))′ + (cos( ))′ sin( )]O(ρσ † − σρ )) Proof. Let U (θ) ∈ {Rσ (θ), Rσ ⊗σ (θ)}σ ∈{X ,Y ,Z }. Then U (θ) = 2 2 2 2 θ θ θ θ ′ † cos 2 I• −i sin( 2 )σ, where σ ∈ {X, Y, Z, X ⊗X, Y ⊗Y, Z ⊗Z }. +tr(2 sin( )·(sin( )) Oσρσ ), (D.17) In the cases where σ is 1-qubit gate, I• denotes identity on 2 2 that one qubit; likewise for 2-qubit cases. Correctness of Eqns as desired. D.1, D.2 basically follows from straightforward computation. □ 1. Equation D.1: Lemma D.2. Let P(θ) ∈ q-while(T )(θ), P ′(θ) ∈ d v (U (θ)) q-while(T ) (θ). Then for arbitrary θ ∗ ∈ Rk , ρ ∈ D(H ), dθ v∪{A} v 1  θ θ  O ∈ Ov , = (− sin( ))I• − i cos( )σ (D.4) ∗ ′ ∗ ∗ ′ ∗ 2 2 2 1. [[(O, ρ) → P(θ ); P (θ )]] = [[(O, [[P(θ )]]ρ) → P (θ )]]. 1  θ π θ π  2. [[(O, ρ) → P ′(θ ∗); P(θ ∗)]] = [[([[P(θ ∗)]]∗(O), ρ) → P ′(θ ∗)]], = cos( + )I• − i sin( + )σ (D.5) ∗ ∗ ∗ 2 2 2 2 2 with [[P(θ )]] the dual of [[P(θ )]]. 1  (θ + π) (θ + π)  ∗ ′ ∗ ′ ∗ ∗ = cos( )I• − i sin( )σ (D.6) Proof. Observe that both P(θ ); P (θ ) and P (θ ); P(θ ) lives 2 2 2 (T ) in q-while (θ), so we identify the “smaller” program 1 v∪{A} = U (θ + π) (D.7) P(θ ∗) with I ⊗P(θ ∗), where I ≡ I ◦I denotes the identity 2 A A A operation on Ancilla. 2. Equation D.2: 1. Unfolding definition 5.2, ∂ ([[(O, ρ) → U (θ)]]) [[( ) → ( ∗) ′( ∗)]] ∂θ O, ρ P θ ; P θ ∗ ′ ∗ ∂ = tr((ZA ⊗ O)[[P(θ ); P (θ )]]((|0⟩A ⟨0|) ⊗ ρ)) = (tr(OU (θ)ρU †(θ))) (D.8) ∂ Fig1b, Sequence θ = tr((Z ⊗ O)[[P ′(θ ∗)]][[I ⊗ P(θ ∗)]]((|0⟩ ⟨0|) ⊗ ρ)) ∂  θ θ A A A = (tr O ·(cos I• − i sin( )σ)· ρ · (( ⊗ )[[ ′( ∗)]]((| ⟩ ⟨ |) ⊗ ([[ ( ∗)]] ))) ∂θ 2 2 = tr ZA O P θ 0 A 0 P θ ρ ∗ ′ ∗ θ θ †  = [[(O, [[P(θ )]]ρ) → P (θ )]], as desired. (cos I• + i sin( )σ ) ) (D.9) 2 2 2. Observe that: ∂  2 θ θ θ † = tr(cos ( )Oρ + i cos( ) sin( )Oρσ (D.10) ′ ∗ ∗ ∂θ 2 2 2 [[(O, ρ) → P (θ ); P(θ )]] ′ ∗ ∗ θ θ θ †  = tr((Z ⊗ O)[[P (θ ); P(θ )]] · −i sin( ) cos( )Oσρ + sin2( )σρσ ) (D.11) A 2 2 2   (|0⟩ ⟨0|) ⊗ ρ) θ θ A = tr(2 cos( )·(cos( ))′Oρ) (D.12) Fig1b, Sequence 2 2 = tr((Z ⊗ O)[[I ⊗ P(θ ∗)]][[P ′(θ ∗)]] · θ θ θ θ A A +tr(i[cos( )(sin( ))′ + (cos( ))′ sin( )]O(ρσ † − σρ ))   2 2 2 2 (|0⟩A ⟨0|) ⊗ ρ) θ θ ′ ∗ +tr(2 sin( )·(sin( ))′Oσρσ †), (D.13) ζ ≡[[P (θ )]]((|0⟩A ⟨0|)⊗ρ) ∗ 2 2 = tr((ZA ⊗ O)[[IA ⊗ P(θ )]]ζ ), (D.18) 18 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019,

while Proof. We first make some direct application of the code transformation (hereinafter “CT”, Fig 4) and compilation [[([[ ( ∗)]]∗( ) ) → ′( ∗)]] P θ O , ρ P θ (D.19) (hereinafter “CP”, Fig 3) rules; some results will immediately ∗ ∗ ′ ∗ = tr((ZA ⊗ [[P(θ )]] (O))[[P (θ )]]((|0⟩A ⟨0|) ⊗ ρ)) follow from these application. ∗ ∗ = tr((ZA ⊗ [[P(θ )]] (O))ζ ). (D.20)

Now RHS (EqnD.18) equals RHS(Eqn(D.20)) via dual-  ∂  I ⊗ ( ∗) Compile (S0(θ); S1(θ)) ity, as A P θ does nothing on the ancilla, while ∂θ [[ ( ∗)]]∗( ) ∈ O P θ O v and ZA are compatible observables. CT,Sequence ∂ = Compile((S (θ); (S (θ))) + □ 0 ∂θ 1 ∂ ( ) ( ) ∈ T ( ) ∀ ∈ [ , ], ( (S0(θ)); S1(θ))) Lemma D.3. Let Sm θ add-q-whilev θ ( m 0 w ∂θ 9 where w ≥ 1). Denote: CP,Sum Components  ∂  = Compile S (θ); (S (θ)) (D.29) 0 ∂θ 1 Compile(S (θ)) m Þ  ∂  ≡ {| ( ), ··· , ( )|}, Compile (S0(θ)); S1(θ) Qm,0 θ Qm,sm θ (D.21) ∂θ ∂ CP,Sequence Compile( (Sm(θ))) = {|Q0,д(θ); P1,h(θ)|}д ∈[s0],h ∈[t1](D.30) ∂θ Þ ≡ {|Pm,0(θ), ··· , Pm,tm (θ)|}. (D.22)

{|P0,i (θ);Q1,j (θ)|}i ∈[t ],j ∈[s ]; (D.31) Then for arbitrary ρ ∈ D(Hv ), O ∈ Ov , we have the following 0 1 computational properties regarding the observable semantics:

∂ , where [s ], [s ], [t ], [t ] stand for e.g. [0, s ], [0, s ], [0, t ], [0, t ] [[(O, ρ) → (S0(θ); S1(θ))]] 0 1 0 1 0 1 0 1 ∂θ respectively; and ∂ = [[(O, ρ) → (S (θ)); S (θ)]] ∂θ 0 1 ∂  ∂  +[[(O, ρ) → S0(θ); (S1(θ))]], (D.23) Compile (case M[q] = m → S (θ) end) ∂θ ∂θ m ∂ [[(O, ρ) → (case M[q] = m → Sm(θ) end)]] CT,Case  ∂  ∂θ = Compile case M[q] = m → (Sm(θ)) end ∂ Õ Õ † θ = [[(O, Mm ρM ) → Pm,i (θ)]], m m CP,Case  ∂  m im = FB case M[q] = m → (S (θ)) end ; ∂θ m where im runs through [0, tm]; (D.24)

also,  ∂  [[(O, ρ) → (S (θ) + S (θ)) ]] ∂θ 0 1 ∂  ∂  = [[(O, ρ) → (S0(θ))]] + (D.25) Compile (S (θ) + S (θ)) (D.32) ∂θ ∂θ 0 1 ∂ CT,Sum Component  ∂ ∂  [[(O, ρ) → (S1(θ))]] (D.26) = Compile (S (θ)) + (S (θ)) ∂θ ∂θ 0 ∂θ 1 (D.33) [[(O, ρ) → S0(θ) + S1(θ)]] CP,Sum Component = {|P (θ), ··· , P |} (D.34) [[( ) → ( )]] [[( ) → ( )]] 0,0 0,t0 = O, ρ S0 θ + O, ρ S1 θ , (D.27) Þ {|P1,0(θ), ··· , P1,t1 |}. (D.35) and

[[(O, ρ) → case M[q] = m → Sm(θ) end]] Õ Õ As promised, Eqns D.23 and D.26 immediately follows [[( E ) → ( )]] = O, m ρ Qm,jm θ (D.28) from the results above together with the corresponding defi- m jm ∈[0,sm ] nitions. So is Eqn D.27, following the same lines of unfolding by CP-(ND component) laid out in Eqns D.33∼ D.35 then a di- 9As a reminder: we shall frequently refer to Notations in Eqns D.21 and D.22 for the rest of the section. rect pattern matching by definition. To obtain D.24, observe 19 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu that Lemma D.4. Keeping the notations in Lemma D.3, we have ∗ k ∀θ ∈ R , O ∈ Ov, ρ ∈ D(Hv ),  ∂  ∂ ([[( , ) → ( ) ( )]]) [[(O, ρ) → (case M[q] = m → Sm(θ) end)]] O ρ S0 θ ; S1 θ ∂θ ∂θ θ ∗ Õ = (D.36) Õ  ∂ ∗  = ([[(O, [[Q0,д(θ )]]ρ) → S1(θ)]]) (D.43) ( )∈ ( [ ] → ∂ ( ( )) ) ∂θ θ ∗ Q θ FB case M q = m ∂θ Sm θ end д ∈[0,s0]

[[(O, ρ) → Q(θ)]] (D.37) Õ  ∂ ∗ ∗  Õ + ([[([[Q1,j (θ )]] (O), ρ) → S0(θ)]]) ∂θ θ ∗ = (D.38) j ∈[0,s1] ( )∈ ( [ ] → ∂ ( ( )) ) Q θ FB case M q = m ∂θ Sm θ end (D.44) tr((ZA ⊗ O)[[Q(θ)]](|0⟩A ⟨0| ⊗ ρ)) Proof. Observe: (D.39) LHS(D.43) A q ⊆v  < Õ Õ CP,Sequence Õ = tr (ZA ⊗ O)· = ∗ m im∗ ∈[0,tm∗ ] д ∈[0,s0],j ∈[0,s1] †  ([[ ∗ ( )]]((| ⟩ ⟨ |) ⊗ ∗ )) Pm ,im∗ θ 0 A 0 Mm ρMm∗ (D.40)  ∂  ([[(O, ρ) → Q0,д(θ);Q1,j (θ)]]) Õ Õ † ∂θ θ ∗ = [[(O, Mm∗ ρM ∗ ) → Pm∗,i ∗ (θ)]] as desired. m m (∗∗∗) ∗ ,See Below Õ m im∗ =

д ∈[0,s0],j ∈[0,s1]

 ∂ ∗  A few more words on Eqn D.40. Since A is disjoint from ([[(O, [[Q0,д(θ )]]ρ) → Q1,j (θ)]]) ∂ ∂θ θ ∗ v ⊇ q, for eachQ(θ) ∈ FB(case M[q] = m → ∂θ (Sm(θ)) end), evolving one step per the operational semantics will lead (D.45) † Õ ⟨[[ ∗ ( )]] ((| ⟩ ⟨ |) ⊗ ∗ )⟩ + to the configuration Pm ,im∗ θ , 0 A 0 Mm ρMm∗ ∗ ( ) д ∈[0,s0],j ∈[0,s1] for some non-(essentially-aborting) Pm ,im∗ θ . Besides, each such configuration appears exactly once affording one sum-  ∂  ([[([[ ( )]]∗( ) ) → ( )]]) mand, due to construction of FB(•). On the other hand, if Q1,j θ O , ρ Q0,д θ ∂θ θ ∗ P ∗ ∗ (θ) essentially aborts, the trace computed as a sum- m ,im (D.46) mand vanishes, making no contribution to the sum, thereby Õ maintaining the equation. = Eqn D.28 is proved using the same lines of logic: д ∈[0,s0]

 ∂ ∗  ([[(O, [[Q0,д(θ )]]ρ) → S1(θ)]]) (D.47) ∂θ θ ∗ [[(O, ρ) → case M[q] = m → S (θ) end]] (D.41) Õ m + Õ = [[(O, ρ) → Q(θ)]] j ∈[0,s1]

Q(θ )∈FB(case M[q]= m→Sm (θ ) end)  ∂ ∗ ∗  ([[([[Q1,j (θ )]] (O), ρ) → S0(θ)]]) Õ ∂θ ∗ = tr(O[[Q(θ)]]ρ) θ (D.48) Q(θ )∈FB(case M[q]= m→Sm (θ ) end) (D.42) Where the last step is direct application of definition of input-space observable semantics. For the step (***) claiming A

 ∂ ∗  ([[( [[ ∗ ( )]] ) → ∗ ( )]]) as desired. □ = O, Q0,д θ ρ Q1,j θ (D.50) ∂θ θ ∗

 ∂ ∗  + ([[([[Q1,j∗ (θ)]] (O), ρ) → Q0,д∗ (θ)]]) (D.51) ∂θ θ ∗ 20 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019,

We argue the correctness of D.49∼D.51 below. Simplifying no- ∗ Õ ∗ d † † ∗ i + tr(OJy (θ )( Kx (θ )ρ K (θ))J (θ ))) (D.60) tations, write Q0,д∗ (θ), Q1,j∗ (θ) as Q0(θ), Q1(θ) respectively, dθ x y j ∗ then let x θ

Õ Q (θ) ≡ K (θ) ◦ K†(θ), (D.52) (b) ∂ ∗ 0 x x = tr(O[[Q1(θ)]]([[Q0(θ )]]ρ)) (D.61) x ∂θj θ ∗ Õ † Q1(θ) ≡ Jy (θ) ◦ Jy (θ) (D.53) y ∂ + tr(O[[Q (θ ∗)]][[Q (θ)]]ρ) (D.62) denote the kraus operator decomposition of Q (θ), Q (θ), 1 0 0 1 ∂θj θ ∗ each with finitely many summands.10 Then,

 ∂  S-H Dual  ∂ ∗  ([[(O, ρ) → Q (θ);Q (θ)]]) (D.56) = ([[(O, [[Q0(θ )]]ρ) → Q1(θ)]]) (D.63) д j ∂θj ∗ ∂θ θ ∗ θ

∂ = tr(O[[Q0(θ);Q1(θ)]]ρ) (D.57) ∂θj θ ∗  ∂ ∗ ∗  + ([[([[Q1(θ )]] (O), ρ) → Q0(θ)]]) (D.64) ∂θ ∗ j θ ∂ Õ Õ † † = tr(O Jy (θ)( Kx (θ)ρKx (θ))Jy (θ)) (D.58) d • ∂θj where dθ again denotes the entry-wise derivative, and y x ∗ j θ (a), (b) is obtained by unfolding the definition of trace, notic- ing that all entries of parameterized unitaries are smooth. Õ ∂ † † = tr(OJy (θ)Kx (θ)ρKx (θ)Jy (θ)) (D.59) S-H dual here is short for Schrödinger-Heisenberg dual, as ∂θ ∗ x,y j θ is everywhere else in this section. □ (a) Õ h d Õ ∗ † ∗ † ∗ = (tr(O Jy (θ)( Kx (θ )ρKx (θ ))Jy (θ )) y dθj x D.2 Proof Details of Theorem 6.2: Soundness of Differentiation Logic

i Proof. As stated in the main text, throughout we fix OA ≡ ZA. ∗ Õ ∗ † ∗ d † +tr(OJy (θ )( Kx (θ )ρKx (θ )) Jy (θ))) Unfolding by definition, it suffices to show dθj x θ ∗ ∂ h ∀O ∈ Ov, ρ ∈ D(Hv ), [[((O, ZA), ρ) → (S(θ))]] Õ ∗ Õ d † ∗ † ∗ ∂θ + (tr(OJy (θ )( Kx (θ)ρKx (θ ))Jy (θ )) dθj ∂ y x = ([[(O, ρ) → S(θ)]]). (D.65) ∂θ 10This is doable because for each θ ∗ ∈ Rk we have Õ Let O ∈ Ov , ρ ∈ D(H ) be arbitrary. We consider each Q (θ ∗) = K (θ ∗) ◦ K † (θ ∗), (D.54) v 0 x x rule in Figure 5 possibly used as the final step of the deriva- k=1 ∗ Õ ∗ † ∗ tion. Q1(θ ) = Jy (θ ) ◦ Jy (θ ). (D.55) q=1 1. (Abort)∼(Trivial Unitary): for S(θ) ≡ abort[v], skip[v] Since parameterized unitaries have a parameterized-matrix representa- orq := |0⟩, [[(O, ρ) → S(θ)]] is a constant, so ∂ ([[(O, ρ) → tion, the superoperator U(θ ) = U (θ )) ◦ U †(θ ) for any parameterized ∂θ CT,trivialize unitaries also have a parameterized operator representation. In our pa- S(θ)]]) = 0. On the other hand, ∂ (S(θ)) ≡ rameterized quantum program syntax only unitaries are parameterized, ∂θ [ ∪ { }] [[( ) → ∂ ( ( ))]] (( ⊗ thus the kraus operators Kx (θ ), Jy (θ )’s can be chosen uniformly. If one abort v A , meaning O, ρ ∂θ S θ = tr ZA insists one may perform a standard structural induction to show the ex- O)· 0) = 0. istence of such a parameterized Kraus operator decomposition for any In Trivial-Unitary, [[(O, ρ) → U (θ)]] doesn’t depend ( ) ∈ (T )( ) ( ) Q θ q-whilev θ . For example, assuming each Sm θ decomposes to ∂ t on θj since θj < θ; hence ([[(O, ρ) → U (θ)]]) = 0, Í m,jm K (θ ) ◦ K † (θ ), then case M[q] = m → S (θ ) decom- ∂θj tm,j =tm,1 tm,j tm,j m poses into as faithfully represented by the code transformation Õ Õ ( K (θ ) ◦ K † (θ )) ◦ E . rule. tm,j tm,j m m tm,j 2. (Unitary): Assume S(θ) ≡ U (θ) with U (θ) ∈ {Rσ (θ), Rσ ⊗σ (θ)}σ ∈{X ,Y ,Z }. Let σ range over {σ, σ ⊗σ }σ ∈X ,Y ,Z As in the unitary case, the linear Kraus operators are entry-wisesmooth, 2 ensured by the entry-wise smoothness of matrix representations of param- (proof below works for all 6 cases where σ = I•). Let d eterized unitaries. dθ U (θ) denote the entry-wise derivative of U (θ), then 21 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu

11 recall from Lemma D.1 that the “off-diagonal” part is killed by ZA. Hence,

d 1 RHS(D.72 ∼ D.75) (D.76) U (θ) = U (θ + π); (D.66) dθ 2 1   ( ⊗ ) | ⟩ ⟨ | | ⟩ ⟨ | ∂ = tr ZA O 1 A 1 + 0 A 0 (D.77) ([[(O, ρ) → U (θ)]]) 4 ∂θ  † †  d ⊗ U (θ)ρU (θ) + U (θ + π)ρU (θ + π) (D.78) = tr(O · U (θ)· ρ ·( U (θ))†) dθ   d + |0⟩A ⟨0| − |1⟩A ⟨1| (D.79) ( · ( )· · †( )) +tr O U θ ρ U θ . (D.67) ! dθ    † † D.66 1  ⊗ U (θ)ρU (θ + π) + U (θ + π)ρU (θ) (D.80) = tr O U (θ)ρU †(θ + π) 2  Lastly, the behavior of Z ⊗ O ensures +U (θ + π)ρU †(θ) A 1    RHS(D.77 ∼ D.80)= tr O U (θ)ρU †(θ+π)+U (θ+π)ρU †(π) , 2 Meanwhile, as desired. From this point, we again adopt the nota- ∂ tions from Lemma D.3 Eqns D.21, D.22. [[(O, ρ) → (U (θ))]] ∂θ 3. (Sum Component) CT,1-qb,2-qb ′ = tr((ZA ⊗ O)[[R σ (θ)]] ·  ∂  [[(O, ρ) → (S0(θ) + S1(θ)) ]] ((|0⟩A ⟨0|) ⊗ ρ)) (D.68) ∂θ Lem D.3 ∂ = [[(O, ρ) → (S0(θ))]] (D.81) Let us unfold the definition of R′ (θ): ∂θ σ ∂ +[[(O, ρ) → (S (θ))]] (D.82) ∂θ 1 RHS(D.68) IH ∂ = ([[(O, ρ) → S (θ)]] (D.83) 1 Õ ∂θ 0 = tr((ZA ⊗ O)[[C − Rσ (θ); HA]] · 2 [[( ) → ( )]]) x,y ∈{0,1} + O, ρ S1 θ (D.84) ((| ⟩ ⟨ |) ⊗ )) Lem D.3 ∂ x A y ρ = ([[(O, ρ) → S (θ) + S (θ)]]), (D.85)  ∂θ 0 1 1  † = tr (ZA ⊗ O)[[HA]] |0⟩A ⟨0| ⊗ U (θ)ρU (θ) 2 as desired. † +|0⟩A ⟨1| ⊗ U (θ)ρU (θ + π) (D.69) 4. (Sequence) Assume S(θ) ≡ S0(θ); S1(θ); It suffices to † show that ∀θ ∗, +|1⟩A ⟨0| ⊗ U (θ + π)ρU (θ) (D.70)  | ⟩ ⟨ | ⊗ ( ) †( ) ∂ ∗ ∗  ∂  + 1 A 1 U θ + π ρU θ + π (D.71) [[(O, ρ) → (S0(θ ); S1(θ ))]] = ([[(O, ρ) → S0(θ); S1(θ)]]) ∂θ ∂θ ∗ Í θ 1   x,y |x⟩A ⟨y| (D.86) = tr (Z ⊗ O) ⊗ U (θ)ρU †(θ)(D.72) 2 A 2 Let us first manipulate the equation using some com- putational properties developped from the helper lem- Í |x⟩ ⟨0| − Í |x⟩ ⟨1| + x A x A U (θ)ρU †(θ + π)(D.73) mas: 2 Í |0⟩ ⟨x| − Í |1⟩ ⟨x| + x A x A U (θ + π)ρU †(θ)(D.74) LHS(D.86) (D.87) 2 Lemma D.3,Eqn D.23 ∂ ∗ ∗ Í |x⟩ ⟨x| − (|1⟩ ⟨0| + |0⟩ ⟨1|) = [[(O, ρ) → (S0(θ )); S1(θ )]] + x A A A · ∂θ 2 ∗ ∂ ∗  +[[(O, ρ) → S0(θ ); (S1(θ ))]] U (θ + π)ρU †(θ + π) (D.75) ∂θ

11Namely, the trace computation can be decomposed into a finite linear Let’s regroup the terms of D.72 ∼ D.75, bearing in ′ ′ combination of the form tr((ZA ⊗ O)(|b ⟩A ⟨b | ⊗ •)) (b, b ∈ {0, 1}), and mind that when observing ZA ⊗ O on the final state, this term vanishes whenever b , b′. 22 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019,

† This equals, by Lem D.3 Eqns D.30,D.31, Applying the inductive hypothesis IH(m) 7→ (Mm ρMm, O), Õ ∗ ∗ ∀m, we get that the above equals [[(O, ρ) → Q0,д(θ ); P1,h(θ )]] (D.88) д ∈[0,s0],h ∈[0,t1] Õ ∗ ∗ Õ ∂ † + [[(O, ρ) → P , (θ );Q , (θ )]](D.89) ([[(O, (M ρM )⟩) → S (θ)]]) 0 i 1 j ∂θ m m m i ∈[0,t0],j ∈[0,s1] m ∂ Õ Õ Via Lem D.2, this translates to = ( [[(O, E ρ⟩) → Q (θ)]]) ∂θ m m,jm Õ ∗ ∗ m j ∈[0,s ] [[(O, [[Q0,д(θ )]]ρ) → P1,h(θ )]](D.90) m m д ∈[0,s0],h ∈[0,t1] Lem D.3 ∂ = ([[(O, ρ) → case M[q] = m → Sm(θ) end]]) Õ ∗ ∗ ∗ ∂θ + [[([[Q1,j (θ )]] (O), ρ) → P0,i (θ )]]

i ∈[0,t0],j ∈[0,s1] Õ ∂ , as desired. = [[(O, [[Q (θ ∗)]]ρ) → (S (θ ∗))]] (D.91) (T ) 0,д ∂θ 1 6. (While ) This is verified by successive application of д ∈s0 (Case) rule and (Sequence) rule T times. Õ ∂ + [[([[Q (θ ∗)]]∗(O), ρ) → (S (θ ∗))]], (D.92) 1,j ∂θ 0 □ j ∈s1 where the last step comes from definition of observ- able semantics with ancilla. Now apply the induc- E Detailed Proofs from Section 7 tive hypothesis on S1(θ), a universal statement for all E.1 Proof for Proposition 7.2: Upper Bound of ′ ′ ∗ |# ∂ (P(θ))| (ρ , O ) ∈ D(Hv )×Ov , to the instances ([[Q0,д(θ )]]ρ, O)) ∂θj (∀д ∈ [0, s0]), we have the first summand Proof. 1. If P(θ) ≡ abort[v]|skip[v]|q := |0⟩ (q ∈ v), then OC (P(θ)) = 0 = |# ∂ (P(θ))| by definition. Similarly, RHS(D.91) j ∂θj if P(θ) ≡ U (θ) trivially used θ , no non-aborting pro- Õ  ∂ ∗  j = ([[(O, [[Q0,д(θ )]]ρ) → S1(θ)]]) (D.93) Compile( ∂ ( ( ))) | ∂ ( ( ))| ∂ ∗ grams exists in ∂θ P θ , yielding # ∂θ P θ = д θ θ j ∂ ′ 0; otherwise, |# ∂θ (P(θ))| consists of a single Rσ (θ), so Likewise, apply the inductive hypothesis on S0(θ) to j ∗ ∗ both numbers are 1. the instances (ρ, [[Q1,j (θ )]] O)) (∀j ∈ [0, s1]), we have ∂ 2. Assume |# (Pb (θ))| ≤ OCj (Pb (θ)) for eachb ∈ {1, 2} the second summand ∂θj and P(θ) ≡ P1(θ); P2(θ). Then RHS(D.92)

Õ  ∂ ∗ ∗  = ([[([[Q1,j (θ )]] (O), ρ) → S0(θ)]]) (D.94) ∂ ∂θ ∗ |# (P(θ))| j θ ∂θj Collecting everything, using again a techical lemma (∗) ∂ = |#(P1(θ); (P2(θ)))| from above, ∂θj RHS(D.91) + RHS(D.92) ∂ +|#( (P1(θ)); P2(θ))| (E.1) = RHS(D.93) + RHS(D.94) (D.95) ∂θj Lem D.4 ∂ = RHS(D.86), as desired. (D.96) ≤ |#( (P2(θ)))| ∂θj 5. (Case) This follows again from the computational lem- ∂ mas above, and applications of all inductive hypothesis. +|#( (P1(θ)))| (E.2) † ∂θj For conciseness, let us denote by IH(m) 7→ (Mm ρM , O) m Inductive Hypothesis the application of the m-th (m ∈ [0,w]) inductive hy- ≤ OCj (P1(θ)) + OCj (P2(θ)) (E.3) † pothesis to the instance (Mm ρM , O) ∈ D(H ) × Ov . m v = OCj (P(θ)). (E.4) Then, ∂ [[(O, ρ) → (case M[q] = m → Sm(θ) end)]] where step (∗) is via rules CT,CP-(ND-Component). ∂θ 3. Assume |# ∂ (P (θ))| ≤ OC (P (θ)) for each m ∈ Lem D.3 Õ Õ † ∂θj m j m = [[(O, Mm ρM ) → Pm,i (θ)]] (D.97) m m [0,w] and P(θ) ≡ case M[q] = m → P (θ) end. Then m i ∈[0,t ] m m m ∂ by the CT and CP rules of case, ∂θ (P(θ)) compiles to Õ † ∂ j = [[(O, Mm ρMm) → (Sm(θ))]] (D.98) | ∂ ( ( ))| ∗ ∈ [ ] ∂ maxm # ∂θ Pm θ programs. Assume m 0,w is m θ j 23 Draft paper, November 22, 2019, S. Zhu, S. Hung, S. Chakrabarti &X.Wu

∂ ∂P (Θ, Φ) s.t. |# (Pm∗ (θ))| attains that maximum. Then, 1 ′ ∂θj Compile( ) ≡ {| Q(Θ);Q (Φ) |}, ∂α ∂P (Θ, Φ, Ψ) ∂ ∂ Compile( 2 ) ≡ {| Q(Θ); |# (P(θ))| = |# (Pm∗ (θ))| (E.5) ∂α ∂θ ∂θ ′ j j case M[q1] = 0 → Q (Φ) Inductive Hypothesis 1 → Q(Ψ) |}. ≤ OCm∗ (Pm∗ (θ)) (E.6) 3. If α ∈ Ψ: without loss of generality assume α = ψ1 ≤ max OCm(Pm(θ))(E.7) m (other situations are completely analogous). We de- = OCj (P(θ)) (E.8) note:

, as desired. ′ ′ (T ) Q (Ψ) ≡ RX (ψ1)|A, q1⟩; ··· ; RZ (ψ12)|q12⟩ 4. If P(θ) ≡ while M[q] = 1 do P1(θ) done, it suffices ∂ ∂ ∂ ( , ) to show |# ∂θ (P(θ))| ≤ T · |# ∂θ (P1(θ))|. This is true for P1 Θ Φ Compile( ) ≡ {| abort[A, q1, ··· , q12] |}, T = 1 as ∂ (while(1)) essentially aborts. By induction ∂α ∂θ ∂P (Θ, Φ, Ψ) if this is true for T , then for T + 1 we have Compile( 2 ) ≡ {| Q(Θ); ∂α case M[q ] = 0 → Q(Φ) ∂ (T +1) 1 |# (while M[q] = 1 do P1(θ) done)| (E.9) → ′( ) |} ∂θ 1 Q Ψ . ∂ ≤ |# (P (θ); while(T ) M[q] = 1 do P (θ) done)| F.2 Benchmark testing on representative VQCs ∂θ 1 1 Eqn E.2 ∂ In this section we introduce three examples from real-world ≤ |# (P (θ))| quantum machine learning and quantum approximation al- ∂θ 1 ∂ gorithms, all of which are very promising candidates for +|# (while(T ) M[q] = 1 do P (θ) done)| implementation on near-term quantum devices. Without ∂θ 1 loss of generality, throughout this section θ denotes θ . ∂ 1 ≤ (T + 1)|# (P1(θ))|, as desired. (E.10) Our first case, QNN∗, starts from a slightly simplified case ∂θ from an actual quantum neural-network that has been im- □ plemented on ion-trap quantum machines [56]. QAOA∗ is a that produces approximate solutions for F Evaluation Details combinatorial optimization problems, which is regarded as F.1 Training VQC instances with controls one of the most promising candidates for demonstrating quantum supremacy [15, 47]. Quantum eigensolvers, crucial For any parameter , we exhibit the final results of Compile α to quantum phase estimation algorithms, usually requires ∂P (Θ, Φ) ∂P (Θ, Φ, Ψ) ( 1 ) and Compile( 2 ), with P (Θ, Φ) and fully coherent evolution. In [35] Peruzzo et al. introduced an ∂ ∂ 1 α α alternative approach that relaxes the fully coherent evolution P2(Θ, Φ, Ψ) defined in the main text: requirement combined with a new approach to state prepa- 1. If α ∈ Θ: without loss of generality assume α = θ1 ration based on ansä and classical optimization [21, 30, 35], (other situations are completely analogous). We de- namely our third example VQE∗. The control flow creating note: multi-layer structures for the three families of examples are ′ very similar, and these algorithms mainly differ from each Q ′(Θ) ≡ R (θ )|A, q ⟩; ··· ; R (θ )|q ⟩ X 1 1 Z 12 12 other in their basic “rotation-entanglement” blocks. There- then fore I will introduce the basic blocks for all three families, then use QNN∗ as an example to illustrate how the control ∂P (Θ, Φ) Compile( 1 ) ≡ {| Q ′(Θ);Q(Φ) |}, flow (if, bounded while) works. ∂α The basic “rotate-entangle” building for QNN consists ∂P (Θ, Φ, Ψ) Compile( 2 ) ≡ {| Q ′(Θ); of a rotation stage and an entanglement stage – we consider ∂α these two stages together a single layer: in the rotation stage, case M[q1] = 0 → Q(Φ) 1 → Q(Ψ) |}. one performs parameterized Z rotations followed by parame- terized X rotations and then again parameterized Z rotations 2. If α ∈ Φ: without loss of generality assume α = ϕ1 on the first 4 (small scale) or 6 qubits (medium or large scale); (other situations are completely analogous). We de- in the entanglement stage, one performs the parameterized note: X ⊗ X rotation on all pairs from the first 4 or 6 qubits. See Figure 7. Basic block for VQE consists of three stages: the first stage ′ ′ Q (Φ) ≡ RX (ϕ1)|A, q1⟩; ··· ; RZ (ϕ12)|q12⟩ is paramaterized X followed by parameterized Z; the second 24 On the Principles of Differential Quantum Programming Languages Draft paper, November 22,2019,

quantum machine learning system developed by Google. For details of the code and the differentiation result, please see supplementary materials.

Figure 7. Circuit representation of the basic building block of QNN. Figure credit: [56] stage uses H and CNOT to entangle; the third stage performs parameterized Z, X, Z in that order. Basic block for QAOA, on the other hand, entangles using H and CNOT in the first stage, and then performs parameterized X rotations on the second stage. The more interesting part lies in building multiple lay- ers using control flow, which we will explain using QNN∗. ( ) (·) | ∂ ()| The small scale if-controlled QNN (denoted as QNNS,i ) is two- P θ OC # ∂θ #gates #lines #layers #qb’s layered. Let B denote the basic rotate-entangle block for QNNS,b 1 1 20 24 1 4 ′ ′′ QNN explained as above; B , B be two similarly-structured QNNS,s 5 5 20 24 1 4 rotate-entangle blocks using different parameter-qubit com- QNNS,i 10 10 60 67 2 4 binations. QNNS,i performs B as the first layer, then mea- QNNS,w 15 10 60 66 3 4 ′ ′′ sures on the first qubit, and performs B or B as the second QNNM,i 24 24 165 189 3 18 layer dependent on the measurement output. For medium QNNM,w 56 24 231 121 5 18 and large scale if-controlled QNN (i.e. QNNM,i and QNNL,i ) QNNL,i 48 48 363 414 6 36 we have larger rotate-entangle blocks, various parameter- QNNL,w 504 48 2079 244 33 36 ′ ′′ qubit combinations (hence more of the B , B ’s involving VQES,b 1 1 14 16 1 2 different parameter-qubit combinations), and more layers of VQES,s 2 2 14 16 1 2 measurement-based control. VQES,i 4 4 28 38 2 2 (Bounded) while-controled QNN works similarly: take QNN S,w VQES,w 6 4 42 32 3 2 as an example, it performs the rotate-entangle block B as the VQEM,i 15 15 224 241 3 12 first layer, then measure the first qubit; if itoutputs 0 we halt; VQEM,w 35 15 224 112 5 12 otherwise we perform some B′, measure qubit q1 again, halts VQEL,i 40 40 576 628 5 40 if outputs 0, performs B′ the third time and aborts otherwise. VQEL,w 248 40 1984 368 17 40 Note that this is just a verbose way of saying “we wrapped ′ QAOAS,b 1 1 12 15 1 3 B with a 2-bounded while-loop”! And similarly, we build QAOA 3 3 12 15 1 3 more layers on larger systems (QNN , QNN ) using larger S,s M,w L,w QAOA 6 6 36 41 2 3 rotate-entangle blocks and more layers of bounded-while S,i QAOA 9 6 36 29 3 3 loops. One should note how bounded while is a succinct way S,w QAOA 18 18 120 142 3 18 to represent the circuits: as shown in Table 3, in QNN we M,i L,w QAOA 42 18 168 94 5 18 managed to represent 2079 unitary gates in just 244 lines of M,w QAOA 36 36 264 315 6 36 code. L,i QAOA We auto-differentiated the three families of quamtum L,w 378 36 1512 190 33 36 programs using our code transformation (hereinafter “CT”) Table 3. Output on Test Examples. Note that: (1) {S, M, L} and code compilation (hereinafter "CP") rules. As shown stands for “small, medium, large”; {b, s, i,w} stands for “basic, in Table 3, the computation outputs the desired multiset of shared, if, while”; #lines is the number of lines for input derivative programs, and the number of non-aborting pro- programs. (2) gate count and layer count forT -bounded while grams (|# ∂ (P(θ))|) agrees with the upper bound described is done by multiplying the corresponding count for loop ∂θ ∗ | ∂ ( ( ))| ( ( )) in Proposition 7.2. It should be noted that Table 3 indicates body with T . (3) for ∗,w we have # ∂θ P θ < OC P θ our auto-differentiation scheme works well for variously because differentiating the unrolled bounded while generates sized input programs, be the size measured by code length, essentially aborting programs which are optimized out of gate count, layer count or qubit count, to name a few. In the final multiset. particular, a 40-qubit system is very close to a real world 25