<<

NEURAL NETWORKS

A Dissertation by

Nam H. Nguyen

Master of Science, Wichita State University, 2016

Bachelor of Arts, Eastern Oregon University, 2014

Submitted to the Department of Mathematics, Statistics, and Physics and the faculty of the Graduate School of Wichita State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

May 2020 © Copyright 2020 by Nam H. Nguyen

All Rights Reserved QUANTUM NEURAL NETWORKS

The following faculty members have examined the final copy of this dissertation for form and content, and recommend that it be accepted in partial fulfillment of the requirement for the degree of Doctor of Philosophy with a major in Applied Mathematics.

Elizabeth Behrman, Committee Chair

James Steck, Committee Member

Buma Fridman, Committee Member

Ziqi Sun, Committee Member

Terrance Figy, Committee Member

Accepted for the College of Liberal Arts and Sciences

Andrew Hippisley, Dean

Accepted for the Graduate School

Coleen Pugh, Dean

iii DEDICATION

To my late grandmother Phan Thi Suu , and my loving mother Nguyen Thi Kim Hoa

iv ACKNOWLEDGEMENTS

First and foremost, I would like to express my profound gratitude to my advisor, Dr.

Elizabeth Behrman, for her advice, support and guidance toward my Ph.D. degree. She taught me not only the way to do scientific research, but also the way to become a profes- sional scientist and mathematician. Her endless encouragements and patient throughout my graduate studied have positively impacted my life. She would always allowed me to work on different research areas and ideas even if it’s outside the scope of my dissertation. This allows me to learn and investigate problems in many different areas of mathematics, which have benefited me greatly. Her scientific vigor and dedication makes her a lifetime role model for me. I will forever be indebted to her.

I also would like to extend my gratitude to Dr. James Steck for his guidance through- out this journey. I have gained a tremendous amount of knowledge in neural networks in both theory and applications as whole from working with him for the past four years.

I also would like to express my deepest gratitude to Dr. Buma Fridman for spending time to teach me many important concepts in mathematics. I am especially indebted to him for spending the summer of 2017 to worked with me, helped me to build a better under- standing of the Kolmogorov’s superposition theorem and all its extension results including a result of his own. Our various discussions through the years have helped me become a better mathematician than I would have been.

I also would like to pay my special regards to Dr. Ziqi Sun and Dr. Terrance Figy for their willingness to be on my dissertation committee. Furthermore, I want to acknowledge

v Dr. Ziqi Sun for his countless encouragements throughout my Ph.D studied and how much he helped to understand functional analysis better; seeing his passion and curiosity for math- ematics and physics makes me to always strive to be a better scientist and mathematician.

I am also deeply indebted to Professor Edward Behrman of The Ohio State University for his financial support during my graduate studied through the Behrman’s Family Foun- dation. Because of his generosity, I was able to have a smaller teaching load and dedicated much more time in to my research and studied.

Furthermore, I would like to acknowledge and thank all my colleagues and friends

(Nathan, Bill, Henry, Saideep, Mo), with whom I collaborated on most of my research work.

This work would not be possible without all their helps. They all have made great mean- ingful impact in my life. Especially, I want to give a special acknowledge to Dr. Tianshi

Lu and Sirvan Rhamati (my best friend at WSU) for there willingness to discuss and work with me on various problems/ideas during my time at WSU; from number theory to graph theory to fractional derivatives to probability theory to math competition problems, even.

They shared with me many great ideas throughout the years. Their constant questionings, ways of thinking, and supports have positively influenced me with great impact. It has been a joy to have a true friend like Sirvan who I can share my problems as well as happiness with.

Most of all, I would like to thank my dear mother and my late grandmother. They have supported me throughout my academic journey and always motivated me to strive forward. Their unconditional love has never been affected by the physical distance between us. This dissertation is dedicated to them.

vi ABSTRACT

Quantum is becoming a reality, at least on a small scale. However, de- signing a good quantum is still a challenging task. This has been a huge major bottleneck in quantum computation for years. In this work, we will show that it is possible to take a detour from the conventional programming approach by incorporating machine learn- ing techniques, specifically neural networks, to train a quantum system such that the desired algorithm is “learned,” thus obviating the program design obstacle. Our work here merges and neural networks to form what we call “Quantum Neural Networks”

(QNNs). Another serious issue one needs to overcome when doing anything quantum is the problem of “noise and decoherence”. A well-known technique to overcome this issue is using error correcting code. However, error correction schemes require an enormous amount of additional ancilla , which is not feasible for the current state-of-the-art quantum computing devices or any near-term devices for that matter. We show in this work that

QNNs are robust to noise and decoherence, provide error suppression quantum .

Furthermore, not only are our QNN models robust to noise and decoherehce, we show that they also possess an inherent speed-up, in term of being able to learned a task much faster, over various classical neural networks, at least on the set of problems we benchmarked them on. Afterward, we show that although our QNN model is designed to run on a fundamental level of a quantum system, we can also decompose it into a sequence of gates and implement it on current quantum hardware devices. We did this for a non-trivial problem known as the “entanglement witness” calculation. We then propose a couple of different hybrid quan- tum neural network architectures, networks with both quantum and classical information processing. We hope that this might increase the capability over previous QNN models in terms of the complexity of the problems it might be able to solve.

vii TABLE OF CONTENTS

Chapter Page

1 INTRODUCTION...... 1 1.1 Motivation...... 1 1.2 Scope of this Dissertation...... 2 1.3 Literature Review...... 4 1.4 Contributions...... 8 1.5 Structure of this Dissertation...... 8 2 OVERVIEW OF QUANTUM ...... 10

2.1 Quantum States and Density Operators...... 11 2.2 Composite Systems and Entanglement...... 13 2.3 Time-Evolution of a Closed System...... 20 2.4 Quantum Measurements...... 22

3 QUANTUM COMPUTING...... 26 3.1 Qubits...... 27 3.2 Quantum Gates and Circuits Model...... 33 3.3 Universal Quantum Computation and The Solovay-Kitaev Theorem..... 40 3.4 Quantum Speed-up and Quantum Algorithms...... 43 3.5 Adiabatic Quantum Computation Model...... 46 3.6 , Noise, and Error Correction...... 49 4 CLASSICAL ARTIFICIAL NEURAL NETWORKS...... 53 4.1 Introduction...... 54 4.2 Artificial Neurons...... 55 4.2.1 ...... 57 4.2.2 Sigmoid Neuron...... 59 4.3 Multi-Layer Neural Networks...... 60 4.3.1 Networks Architecture...... 61 4.3.2 Error and The Learning Rule.. 62 4.4 Universal Approximation...... 66 4.4.1 Approximation with Boxes...... 67 4.4.2 Approximation with Modern Analysis...... 74 5 ...... 77

5.1 Fundamental Structure...... 78 5.2 Learning Algorithm...... 81 5.3 Simulation of Classical and Gates, and .. 84 5.4 Universal Property of Quantum Neural Network...... 89 5.5 An Alternative Learning Approach...... 92

viii TABLE OF CONTENTS (continued)

Chapter Page

6 ROBUSTNESS OF QUANTUM NEURAL NETWORK...... 95 6.1 Quantum Computing in The Classical World...... 96 6.2 Dealing with Noise and Decoherence...... 97 6.3 Entanglement Calculation For Two- System...... 98 6.3.1 Learning with noise...... 105 6.3.2 Learning with Decoherence...... 114 6.3.3 Learning with Noise plus decoherence...... 121 6.4 Entanglement Calculation on Higher-Order Qubit Systems...... 128 6.4.1 Results for the Three-Qubit System: Training and Testing...... 128 6.4.2 Results for the four- and five-qubit systems: Training and Testing.. 138 6.4.3 Quantifying the improvement in robustness with increasing size of the system...... 139 6.4.4 Learning with other types of noise...... 142 6.4.5 Stability Analysis of The Calculations...... 144 6.5 Application: Entanglement for pattern storage...... 145

7 BENCHMARKING NEURAL NETWORKS FOR QUANTUM COMPUTATIONS ...... 148 7.1 Type of Neural Networks Performed...... 149 7.1.1 Classical real-valued neural networks...... 149 7.1.2 Complex-Valued Neural Networks...... 150 7.1.3 Quantum Neural Network...... 153 7.2 QNN Versus Classical Neural Networks: Simulating Classical Logic Gates.. 155 7.3 QNN Versus Classical Neural Networks: Iris Classification...... 160 7.4 QNN Versus Classical Neural Networks: Entanglement Calculation..... 165 8 IMPLEMENTATION OF QUANTUM NEURAL NETWORK ON ACTUAL QUAN- TUM HARDWARE...... 169 8.1 Available Quantum Simulators and Hardware...... 170 8.1.1 IBM...... 170 8.1.2 Microsoft...... 171 8.1.3 D-Wave Systems...... 172 8.2 Two-qubit Quantum Neural Network...... 173 8.2.1 Reverse Engineering of Entanglement Witness...... 174 8.2.2 Numerical computation...... 179 8.3 Statistical Evaluation of Entanglement Witness in Q#...... 183 8.4 Iterative Staging...... 185 8.4.1 Searching for an Asymptotic Limit...... 185 8.4.2 Comparing the Discrete and Continuum Cases...... 188 8.5 Discussions...... 191 9 CONCLUSIONS...... 195

ix TABLE OF CONTENTS (continued)

Chapter Page

10 FUTURE WORK...... 198 10.1 Quantum Hybrid Neural Network using Multi-Measurements...... 199 10.2 Quantum Hybrid Neural Network using Multi-Step Time Propagation.... 201 REFERENCES...... 204 APPENDIX...... 219

A PARAMETER FUNCTIONS...... 220 A.1 Parameter Functions for XOR...... 220 A.2 Paramter Functions for XNOR...... 221 A.3 Parameter Functions for CNOT Gate...... 222 A.4 Parameter Functions for Bell Circuit...... 225

x LIST OF FIGURES

Figure Page

2.1 Graphical interpretation of the geometric Hahn-Banach Theorem. The witness

w divides the Hilbert space into separable (S) and entangled subspaces. An

optimal witness is as close as possible to the set S...... 17

2.2 Graphical illustration of plane separation in Euclidean space...... 17

2.3 Geometric illustration of an optimal entanglement witness...... 18

2.4 A quantum system interacting with a measuring apparatus in the presence of

the surrounding environment...... 23

3.1 Geometric Representation of a Classical Bit...... 27

3.2 Bloch Sphere Representation of a Qubit...... 30

3.3 Geometric Representation of a Classical Probabilistic bit...... 31

3.4 Visualization of a quantum circuit...... 34

3.5 Reversible AND gate ()...... 34

3.6 Geometric visualization of the Pauli X gate being applied to the state |0i on the

Bloch sphere...... 35

3.7 Geometric visualization of the rotation on the Bloch sphere created by applying |0i + |1i the Hadamard gate the state |0i to create the superposition state √ and 2 vice versa...... 37

xi LIST OF FIGURES (continued)

Figure Page

3.8 Controlled-NOT gate...... 38

3.9 This circuit maps the state |00i to the state √1 |00i + |11i ...... 38 2

3.10 Circuit representation of Toffoli gate...... 40

3.11 Complexity spaces. The complexity class BQP is still not well understood. We

have showed that with a cleverly design , NP hard prob-

lem can be solved efficiently on a quantum computer, like prime factorization.

However, no NP-complete problem have been solved efficiently on a quantum

computer...... 46

3.12 Quantum circuit for Shor’s 9-qubit error correction code. E is a

that can arbitrarily corrupt a single qubit [34]...... 52

4.1 A structure of a biological neuron. [46]...... 56

4.2 A perceptron model with two inputs...... 57

1 4.3 A sigmoid neuron node with x ∈ N and the f(x) = . 59 R 1 + e−x

4.4 A two hidden layers neural network mapping from f : R2 → R. Nearly in all

cases, we take all the activation functions σi to be the same. That is, there is

only one type of activation in the network. The reason for this will be clear in

section 4.4...... 62

4.5 Approximation of a continuous function with piece-wise constant functions... 68

xii LIST OF FIGURES (continued)

Figure Page

4.6 Two hidden neurons to create a box...... 69

4.7 An illustration to show how a single hidden layer can be used to approximate a

continuous function f : D → R...... 70

4.8 An illustration to show how step functions can be built in the higher dimension

case and how to add them together to form something almost like a tower. The

gray arrows represent weights that are equal to zero...... 71

4.9 An illustration to show how how an 3 dimensional tower might be built using

neural network...... 72

4.10 An illustration to show how an N +1-dimensional tower can be built using neural

network...... 73

5.1 Classical Neural Network as a map from RN to R ...... 80

5.2 Quantum circuit representing a quantum algorithm as a map...... 80

5.3 Visualization of QNN structure...... 81

5.4 Perceptron model...... 89

5.5 a) Half adder using NAND gates. b) Half subtractor using NAND gates..... 90

5.6 Half subtractor modeled by a network of ...... 90

xiii LIST OF FIGURES (continued)

Figure Page

6.1 Total root mean squared error for the training set as a function of epoch (pass

through the training set), for the 2-qubit system, with zero noise. Asymptotic

error is 1.6 × 10−3. For comparison: with piecewise constant functions a similar

level of error required 2000 epochs...... 102

6.2 Parameter function KA = KB as a function of time (data points), as trained at

zero noise for the entanglement indicator, and plotted with the Fourier fit (solid

line)...... 103

6.3 Parameter function A = B as a function of time (data points), as trained at

zero noise for the entanglement indicator, and plotted with the Fourier fit (solid

line)...... 103

6.4 Parameter function ζ as a function of time (data points), as trained at zero noise

for the entanglement indicator, and plotted with the Fourier fit (solid line)... 103

6.5 Total root mean squared error for the training set as a function of epoch (pass

through the training set), for the 2-qubit system, with a (magnitude) noise level

of 0.014 at each (of 317 total) timestep. Asymptotic error is 3.1 × 10−3, about

double what it was with no noise...... 106

xiv LIST OF FIGURES (continued)

Figure Page

6.6 Parameter function KA = KB as a function of time, as trained at 0.0089 am-

plitude noise at each of the 317 timesteps, for the entanglement indicator (data

points), and plotted with the Fourier fit (solid line). Note the change in scale

from Figure 6.2, because of the (much larger) spread of the noisy data: the

Fourier fit is actually almost the same on this graph...... 107

6.7 Parameter function A = B as a function of time, as trained at 0.0089 amplitude

noise at each of the 317 timesteps, for the entanglement indicator (data points),

and plotted with the Fourier fit (solid line)...... 107

6.8 Parameter function ζ as a function of time, as trained at 0.0089 noise at each

of the 317 timesteps, for the entanglement indicator (data points), and plotted

with the Fourier fit (solid line)...... 108

6.9 Fourier coefficients for the tunneling parameter functions K, as functions of noise

level...... 109

6.10 Fourier coefficients for the bias parameter functions , as functions of noise level. 109

6.11 Fourier coefficients for the coupling parameter function ζ, as functions of noise

level...... 110

6.12 Entanglement of the state P as a function of γ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero noise

(blue) and at 0.0069 noise (orange). In each case the QNN was trained at zero

noise, but tested at the given level...... 111

xv LIST OF FIGURES (continued)

Figure Page

6.13 Entanglement of the state P as a function of γ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero noise

(blue) and at 0.0069 noise (orange). In each case the QNN was trained at a

noise level of 0.0089, and then tested at the given level...... 112

6.14 Entanglement of the state P as a function of γ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero noise

(blue) and at 0.0069 noise (orange). In each case the QNN was trained at 0.013

noise, and then tested at the given level...... 112

6.15 Entanglement of the state M as a function of δ, as calculated by the QNN,

and compared with the entanglement of formation (marked “BW”) at zero noise

(blue) and at 0.0069 noise (orange). In each case the QNN was trained at zero

noise, but tested at the given level...... 113

6.16 Entanglement of the state M as a function of δ, as calculated by the QNN,

and compared with the entanglement of formation (marked “BW”) at zero noise

(blue) and at 0.0069 noise (orange). In each case the QNN was trained at a noise

level of 0.0089, and then tested at the given level...... 113

6.17 Entanglement of the state M as a function of δ, as calculated by the QNN,

and compared with the entanglement of formation (marked “BW”) at zero noise

(blue) and at 0.0069 noise (orange). In each case the QNN was trained at 0.013

noise, and then tested at the given level...... 114

xvi LIST OF FIGURES (continued)

Figure Page

6.18 Total root mean squared error for the training set as a function of epoch (pass

through the training set), for the 2-qubit system, with a phase noise level of 0.014

at each (of 317 total) timestep. Asymptotic error is 1.3 × 10−3, approximately

the same as with no noise...... 115

6.19 Parameter function KA = KB as a function of time, as trained at 0.0089 phase

noise at each of the 317 timesteps, for the entanglement indicator (data points),

and plotted with the Fourier fit (solid line)...... 115

6.20 Parameter function A = B as a function of time, as trained at 0.0089 phase

noise at each of the 317 timesteps, for the entanglement indicator (data points),

and plotted with the Fourier fit (solid line)...... 116

6.21 Parameter function ζ as a function of time, as trained at 0.0089 phase noise

at each of the 317 timesteps, for the entanglement indicator (data points), and

plotted with the Fourier fit (solid line)...... 116

6.22 Fourier coefficients for the tunneling parameter functions K, as functions of

decoherence level...... 117

6.23 Fourier coefficients for the bias parameter functions , as functions of decoherence

level...... 117

6.24 Fourier coefficients for the coupling parameter function ζ, as functions of deco-

herence level...... 118

xvii LIST OF FIGURES (continued)

Figure Page

6.25 Entanglement of the state P as a function of γ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero phase

noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was

trained at zero phase noise, but tested at the given level...... 119

6.26 Entanglement of the state P as a function of γ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero phase

noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was

trained at a phase noise level of 0.0089, and then tested at the given level.... 119

6.27 Entanglement of the state P as a function of γ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero phase

noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was

trained at 0.013 phase noise, and then tested at the given level...... 120

6.28 Entanglement of the state M as a function of δ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero phase

noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was

trained at zero noise, but tested at the given level...... 120

6.29 Entanglement of the state M as a function of δ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero phase

noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was

trained at a phase noise level of 0.0089, and then tested at the given level.... 121

xviii LIST OF FIGURES (continued)

Figure Page

6.30 Entanglement of the state M as a function of δ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero phase

noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was

trained at 0.013 phase noise, and then tested at the given level...... 121

6.31 Total root mean squared error for the training set as a function of epoch (pass

through the training set), for the 2-qubit system, with a complex noise level of

0.014 at each (of 317 total) timestep. Asymptotic error is 3.0 × 10−3, approxi-

mately the same as with only magnitude noise...... 122

6.32 Parameter function KA = KB as a function of time, as trained at 0.0089 complex

noise at each of the 317 timesteps, for the entanglement indicator (data points),

and plotted with the Fourier fit (solid line). Note the change in scale from

Figure 6.2, because of the (much larger) spread of the noisy data: the Fourier

fit is actually almost the same on this graph...... 122

6.33 Parameter function A = B as a function of time, as trained at 0.0089 complex

noise at each of the 317 timesteps, for the entanglement indicator (data points),

and plotted with the Fourier fit (solid line)...... 123

6.34 Parameter function ζ as a function of time, as trained at 0.0089 complex noise

at each of the 317 timesteps, for the entanglement indicator (data points), and

plotted with the Fourier fit (solid line)...... 123

xix LIST OF FIGURES (continued)

Figure Page

6.35 Fourier coefficients for the tunneling parameter functions K, as functions of

complex noise level...... 124

6.36 Fourier coefficients for the bias parameter functions , as functions of complex

noise level...... 124

6.37 Fourier coefficients for the coupling parameter function ζ, as functions of complex

noise level...... 125

6.38 Entanglement of the state P as a function of γ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero noise

(blue) and at 0.69% noise plus decoherence (orange). In each case the QNN was

trained at zero noise, but tested at the given level...... 125

6.39 Entanglement of the state P as a function of γ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero noise

(blue) and at 0.0069 noise plus decoherence (orange). In each case the QNN was

trained at a noise level of 0.0089 complex noise, and then tested at the given level.125

6.40 Entanglement of the state P as a function of γ, as calculated by the QNN, and

compared with the entanglement of formation (marked “BW”) at zero complex

noise (blue) and at 0.0069 noise plus decoherence(orange). In each case the QNN

was trained at 0.013 level complex noise, and then tested at the given level... 126

xx LIST OF FIGURES (continued)

Figure Page

6.41 Entanglement of the state M as a function of δ, as calculated by the QNN,

and compared with the entanglement of formation (marked “BW”) at zero noise

(blue) and at 0.0069 noise plus decoherence(orange). In each case the QNN was

trained at zero noise, but tested at the given level...... 126

6.42 Entanglement of the state M as a function of δ, as calculated by the QNN,

and compared with the entanglement of formation (marked “BW”) at zero noise

(blue) and at 0.0069 noise plus decoherence (orange). In each case the QNN was

trained at a complex noise level of 0.0089, and then tested at the given level.. 127

6.43 Entanglement of the state M as a function of δ, as calculated by the QNN,

and compared with the entanglement of formation (marked “BW”) at zero noise

(blue) and at 0.0069 noise plus decoherence (orange). In each case the QNN was

trained at 0.013 complex noise, and then tested at the given level...... 127

6.44 K parameter function trained with zero noise or decoherence, for the 3-qubit

system...... 131

6.45  parameter function trained with zero noise or decoherence, for the 3-qubit

system...... 131

6.46 ζ parameter function trained with no noise or decoherence, for the 3-qubit system.132

6.47 Parameter function K trained at 0.0089 amount of decoherence in a 3-qubit

system. The solid red curve represents the Fourier fit of the actual data points. 133

xxi LIST OF FIGURES (continued)

Figure Page

6.48 Parameter function  trained at 0.0089 decoherence in a 3-qubit system. The

solid red curve represents the Fourier fit of the actual data points...... 134

6.49 Parameter function ζ trained at 0.0089 decoherence in a 3-qubit system. The

solid red curve represents the Fourier fit of the actual data points...... 134

6.50 Parameter function K Fourier coefficients as a function of total noise for the

3-qubit system...... 135

6.51 Parameter  Fourier coefficients as a function of total noise for the 3-qubit system.135

6.52 Parameter function ζ Fourier coefficients as a function of total noise for the

3-qubit system...... 136

6.53 Testing at different noise levels for the pure state P for a three-qubit system, as

a function of γ...... 137

6.54 Testing at different noise levels for a mixed state M, for the three-qubit system,

as a function of γ...... 137

6.55 Parameter function K for a 4-qubit system trained at 0.027 level of noise with

its Fourier fit...... 138

6.56 Parameter function K for 5-qubit system trained at 0.027 level of noise with its

Fourier fit...... 139

xxii LIST OF FIGURES (continued)

Figure Page

6.57 R2 as a function of number of qubits, for the K parameter function, trained at

0.02741 total noise...... 140

6.58 R2 as a function of number of qubits, for the  parameter function, trained at

0.02741 total noise...... 140

6.59 R2 as a function of number of qubits, for the ζ parameter function, trained at

0.02741 total noise...... 140

6.60 An illustrative example of overfitting to the data (blue dots.) The red curve

represents a possible function that an overfitting network would have learned

comparing to the actual correct fit, the green line. Training error for the red

line would be small, but any subsequent testing would give large errors. Adding

noise to the data points would correct for overfitting...... 141

6.61 Training the parameter function  with noise added to the Hamiltonian instead

of for a 2-qubit system...... 143

6.62 Training the parameter function  with noise added to the Hamiltonian instead

of density matrix for a 3-qubit system...... 143

6.63 Training the parameter  with noise added to the Hamiltonian instead of density

matrix for a 4-qubit system...... 144

6.64 Training the parameter  with noise added to the Hamiltonian instead of density

matrix for a 5-qubit system...... 144

xxiii LIST OF FIGURES (continued)

Figure Page

6.65 How to encode the letter Z using pairwise entanglement. The green double-

headed arrows represent pairwise entanglement...... 146

7.1 Visual representation of the linear separability of the OR gate versus the XOR

gate. The dotted line represents linear classifier...... 156

7.2 Example of training RMS as a function of epoch for various logic gates using the

(single layer) CVNN perceptron. The RVNN and NeuralWorks networks trained

similarly...... 159

7.3 Iris dataset cross-input classification scatterplot [148]...... 161

8.1 Actual IBM 4-qubit processor. The qubits are controlled by microwave pulses

tuned at certain frequencies [191]...... 171

8.2 An actual 128-qubit processor chip from DWave Systems [192]...... 172

8.3 Variance in entanglement witness for 100 iterations of each state measured at

shot counts ranging from 50 to 20,000 in 50 shot increments. As the shot count

increases, we see that the measurement variance quickly goes to zero...... 183

8.4 Q# entanglement witness values for the |Belli state with a 95% confidence inter-

val as a function of the shot count. The confidence interval (CI) width reaches

its minimum of ∼0.0015 after approximately 15,000 shots...... 184

xxiv LIST OF FIGURES (continued)

Figure Page

8.5 Q# entanglement witness values for the | P i state with a 95% confidence interval

as a function of the shot count. The confidence interval (CI) width reaches its

minimum of ∼0.0015 after approximately 15,000 shots...... 185

8.6 Trained values for the tunneling amplitude K and for the bias , for each time

chunk, as the number of qubits in the system is increased. Both demonstrate

clear asymptotic behavior...... 186

8.7 Trained values for the qubit-qubit coupling ζ, for each time chunk, as the number

of qubits in the system is increased. The values show a clear trend, but do not

become asymptotic as quickly as with the other parameters...... 187

8.8 Trained bias, for 4 and 8 time chunks, as functions of time, for systems of

increasing numbers of qubits N...... 189

8.9 Trained bias functions for the continuum model of the entanglement witness,

for systems of increasing numbers of qubits N. Note how the graphs in Figure

8.8 are close approximations of the shapes and values of the function for each

number of qubits...... 190

xxv LIST OF FIGURES (continued)

Figure Page

10.1 Architecture of the QHNN using multi-measurement (3 different measures in

this case) scheme. The network will take the input as the ρ(t0)

then propagate it forward to ρ(tf ). At this point, different measurements will

be performed and these classical values will be stored and processed through a

classical artificial neural network. Although there is only one classical hidden

layer in this structure, this need not be the case. Adding extra hidden layers

might speed up the training...... 200

10.2 A possible architecture of the QHNN using discrete time sampling. Here the

classical part of the network has one hidden layer with N nodes. This need not

to be the case in general...... 203

A.1 A and B parameter function for XOR gate...... 220

A.2 KA and KB parameter function for XOR gate...... 220

A.3 ζ parameter function for XOR gate...... 221

A.4 A and B parameter function for XNOR gate...... 221

A.5 KA and KB parameter function for XNOR gate...... 222

A.6 ζ parameter function for XNOR gate...... 222

A.7 A and B parameter function for CNOT gate...... 223

A.8 A and B parameter function for CNOT gate...... 223

xxvi LIST OF FIGURES (continued)

Figure Page

A.9 ζ parameter function for CNOT gate...... 224

A.10  and K parameter function for maximal entangled state circuit generation.

Note that in this case there’s symmetry in the parameters. That is, A = B and

KA = KB...... 225

A.11 ζ parameter function for maximal entangled state circuit generation...... 225

xxvii LIST OF TABLES

Table Page

3.1 Truth table of NAND and NOR logic gates in respective order...... 33

5.1 Truth table of XOR and XNOR logic gates in respective order...... 84

5.2 Inputs and Outputs (both desired and actual) for the QNN model of the XOR and

XNOR gate. The input is the initial, prepared, state of the two-qubit system, at

t = 0; the output, the square of the measured value of the qubit-qubit correlation

function at the final time...... 86

5.3 QNN simulation of the CNOT gate. Recall that CNOT maps: |00i → |00i,

|01i → |01i, |10i → |11i and |11i → |10i...... 87

5.4 QNN training result for the quantum circuit that would generate the maximal

entangled ...... 88

6.1 Training data for QNN entanglement witness...... 102

6.2 Curvefit coefficients for parameter functions K, , ζ, for QNN entanglement witness.104

6.3 Unnormalized training input states for 3-qubit entanglement calculation..... 130

6.4 Training theoretical targets and actual QNN outputs for 3-qubit entanglement

calculation with no noise or decoherence added to the system...... 131

6.5 Fourier coefficients for 3-qubit fitted parameter functions with no noise and de-

along with QNN training RMS...... 132

6.6 QNN output to encoded states of different letters, showing robustness to noise.. 147

xxviii LIST OF TABLES (continued)

Table Page

7.1 Number of epochs needed for gate training to reach RMS error ≤1% for a (single

layer) perceptron using RVNN, CVNN and NeuralWorks (to nearest 1000 epochs)

implementations...... 157

7.2 Trained QNN parameter functions A, B, and ζ, for the Iris problem, in terms of

their Fourier coefficients for the function f(t) = a0 + a1 cos(ωt) + b1 sin(ωt), using

12, 30, and 75 training pairs...... 162

7.3 Trained QNN parameter functions KA, and KB, for the Iris problem, in terms

of their Fourier coefficients for the function f(t) = a0 + a1 cos(ωt) + b1 sin(ωt) +

a2 cos(2ωt) + b2 sin(2ωt) + a3 cos(3ωt) + b3 sin(3ωt), using 12, 30, and 75 training

pairs...... 163

7.4 Iris dataset training and testing average percentage RMS and identification accu-

racy for different training set sizes using similar RVNN and CVNN networks, and

compared with the QNN. The RVNN was run for 50,000 epochs; the CVNN, for

1000 epochs; and the QNN, for 100 epochs. See text for details of the architectures

used...... 164

7.5 Training on entanglement of pure states with zero offsets, for the NeuralWorks,

RVNN, CVNN, and QNN...... 168

7.6 Testing on entanglement of pure states. Each network was tested on the same

set of 25 randomly chosen pure states with zero offset...... 168

xxix LIST OF TABLES (continued)

Table Page

8.1 QNN entanglement witness trained for 200 epochs using piecewise constant pa-

rameter functions, and compared with calculated results using first chunked time

propagators, then with the sequence of gates, and finally on the Q# simula-

tor [170]. The training set of four [56] includes one completely entangled state

|Belli = √1 [|00i+|11i], one unentangled state |Flati = 1 [|00i+|01i+|10i+|11i], 2 2

one classically correlated but unentangled state |Ci = √1 [2|00i + |01i], and one 5

partially entangled state |Pi = √1 [|01i + |10i + |11i]. Errors for each method are 3 shown in the final line...... 181

8.2 Trained parameter functions for the entanglement witness for the two-qubit sys-

tem, in MHz. Total time of evolution for the two time propagation methods was

1.58 ns...... 181

8.3 Trained parameter values at each time interval, for the pairwise entanglement

witness for the seven-qubit system, in MHz. By symmetry, each of the tunneling

functions K, each of the biases , and each of the pairwise couplings ζ is the

same. We take these values to be an approximation to the asymptotic limit of

the parameters for an N-qubit quantum system...... 188

xxx LIST OF TABLES (continued)

Table Page

8.4 Total RMS error for the training set in the 4, 8, and continuum models of the en-

tanglement witness for system sizes ranging from two- to seven-qubits. Training

followed the methods of [64, 176] with the additional condition that all param-

eters are fully symmetric. The continuum model shows the best accuracy, but

the discretized versions also trained well and are viable approximations of the

continuum model (not realizable in the current hardware.)...... 191

xxxi CHAPTER 1

INTRODUCTION

“Science offers the boldest metaphysics of the age. It’s a thoroughly human construct, driven by the faith that if we dream, press to discover, explain, and dream again, thereby plunging repeatedly into new terrain, the world will somehow come clearer and we will grasp the true strangeness of the universe. And the strangeness will all prove to be connected, and make sense” - Edward O. Wilson

1.1 Motivation

We are currently living in what most physicists would call a second quantum revolution.

The first quantum revolution started in the beginning of the 20th century, when scientists

first discovered , which led to technologies such as semiconductors, GPS,

MRI, lasers, and many more. All these technologies are now being used in our daily life. In

1965, Gordon Moore predicted that the number of transistors on a silicon chip would dou- ble every year. This is known as Moore’s Law [5]. This means that transistors are getting smaller exponentially. Hence, soon enough this will reach the quantum realm, and leads to the idea of quantum computation first proposed by Paul Benioff in the early 1980s. Scientists started to wonder whether a quantum computer, a computer operating according to the laws of quantum mechanics, could be more powerful than classical computers. then proposed that a quantum computer would be an ideal tool to solve problems in physics and chemistry, given that it is exponentially costly to simulate large quantum systems with classical computers [2]. In 1994, quantum computation started to gain significant attention throughout the whole scientific community when Peter Shor developed an algorithm known

1 as “Shor’s algorithm” to factor integers in polynomial time on quantum computer, which is

an exponential speed up compared to any known classical algorithm. This algorithm shows

explicitly the advantages of quantum computers over classical ones. However, it took quite

a while for advances in mathematics and science to enable us to develop technologies that

enable us to control one single atom. This allows us to create hardware with quantum prop-

erties like superposition and entanglement, and, finally, to turn quantum computing from

theory into reality. On May 11, 2011, D-Wave Systems made the first commercially available

quantum computer, operating on a 128 SQuID qubit chip using . Many

more advances in quantum computing have been done since then. IBM has made several

quantum computers using the circuit model available to the public on the Cloud, which

include a 53-qubit processor. Google claimed with their own 54-qubit processor on October, 2019 [15]. All these recent technology breakthroughs combined with the fact that quantum computers allow us to unravel mysteries not possible with classical computers make the field of quantum computing currently one of the most pursued area of research, from better hardware design to algorithm design.

1.2 Scope of this Dissertation

Although the quantum computer has become a reality, at least on a small scale, design- ing a good quantum algorithm is still a challenging task. This is a major bottleneck for quantum computations. Other than a handful of quantum algorithms (Shor’s integer factor- ization, Grover’s search, Jones polynomial approximation), there exist no efficient methods of programming a quantum computer compared to what we have for a classical computer.

2 Once a quantum computer is built, it also is subject to noise, similar to a classical com- puter. Additionally, a quantum computer also deals with another unique problem known as decoherence, which arises from unwanted interactions with the environment. Quantum mechanics is fragile, which is why, on a macroscopic scale, we rarely need to take quantum effects into account: Unless the quantum processes are extremely well isolated, the quantum state will decohere and become essentially classical. (Indeed, the precise nature and impli- cations of the ways in which decoherence leads to the loss of and the emergence of classicality is a fundamentally interesting problem, as it lies at the crux of the nature of quantum reality [35].) When this occurs in a quantum computer, the quantum nature of the computation is lost. So, if we are specifically interested in doing quantum computing, we need to guard against these kinds of effects as well. Therefore, we have to

find a way to correct these errors in quantum computations. Classically, error correction is done by adding redundancy to the information in a message, then using that redundancy to detect and correct errors that occur in transmission or storage of data. However, because of the no-cloning theorem, which says we cannot make an exact copy of an arbitrary quantum state, simple redundancy will not work in a quantum context, and unwanted interactions with the environment can destroy coherence and thus the quantum nature of the computa- tion. This seemed like an obstacle for quantum computation at first but in 1995 Peter Shor proposed a scheme to do error correction on a quantum computer [34]. Many advancements have been made with since then; however, they all require extra quantum bits (qubits), which make large scale computations impossible with existing quan- tum computers.

Elizabeth Behrman and James Steck at Wichita State University proposed an inge- nious idea, instead of trying to explicitly design an algorithm to solve a specific task on a

3 quantum computer, why don’t we let the quantum computer program itself? This might sounds strange at first, but it’s a well known technique in classical computation known as

”. And it’s used heavily for difficult computational tasks, for instance, trying to design an algorithm to differentiate handwritten digits between 0 to 9. Thus they combined the theories of classical machine learning, specifically artificial neural networks

(ANN), a model of computation inspired by the structure of biological neural networks in brains, together with quantum computing to invent a sub-field of quantum computing called

“Quantum Neural Network” (QNN) [54]. The objective of this dissertation is to show that a quantum neural network indeed provides answers to various problems in quantum compu- tations, ranging from algorithm design to noise and decoherence.

1.3 Literature Review

Machine learning algorithms have achieved remarkable successes ranging from image classifications to self-driving car and playing complex games such as chess or Go. These ad- vancements were possible because of the tremendous increased in computational power and the availability of vast amounts of data in the past two decades. There are many machine learning algorithms, but artificial neural networks are arguably the most well-known ma- chine learning algorithms since it has been shown to be effective and able achieve marvelous performance over several tasks [47]. However, classical computer has its limits. Therefore, it was no surprised that people started to look to quantum computer for help with performing machine learning tasks, making use of the possible exponential computational speed up it has to offer. This leads to mixed usage of the terms Learning and Quan- tum Neural Network. In fact, the term is so loosely defined that it means differently from one person to another. The term quantum-assisted machine

4 learning could be used for a significant number of these methods [17, 18, 19, 20, 21, 22],

where one performing a subroutine of certain classical machine learning algorithms on a

quantum computer using one of the well-known quantum algorithm that possessed quantum

speedup to enhance the efficiency of the overall algorithm. Here the term quantum speedup

is referring to the advantage in runtime obtained by a quantum algorithm over classical

algorithms. The reason is simple, a large class of classical machine learning techniques heav-

ily relies on performing matrix operations on high dimensional vector space; which is very

computational costly, and a bottleneck for classical computer. However, there are quantum

algorithms that allow you to perform certain matrix operations with an exponential speed-up

on a quantum computer. For instance, classical learning methods like Gaussian Processes

or Support Vector Machines (SVMs) require one to solve the system of linear equations

Ax = b with A ∈ RN×N , and x, b ∈ RN . The current best classical algorithm requires a runtime of O(N 2.373) comparing to a runtime of O(logN) on a quantum computer using

Harrow-Hassidim-Lloyd (HHL) algorithm [16] assuming that A is sparse and the classical

data can be loaded in to the quantum computer in logarithmic time

along with the condition that only certain features of the solution will need to be extracted.

Grover’s search algorithm which offers a quadratic speed-up is also being used extensively

for quantum machine learning.

In our work, the term Quantum Neural Network does not refer to performing a subroutine

to a classical machine learning task on a quantum computer, but rather a parameterized

Hamiltonian with tune-able coefficients that can be learned through training. It is pivotal

that the distinction between the two concepts is clear. Recently, a few variational quantum

algorithms arose that posses similar feature to our Quantum Neural Networks, like that

of Quantum Approximate Optimization Algorithm (QAOA) [23] and Variational Quantum

5 Eigensolver (VQE) [24, 25]. Both of these algorithms rely on a tune-able ansatze, which is essentially a Parametrized Quantum Circuit (PQC); and it has become standard to also refer to PQC as Quantum Neural Networks [26, 27, 28]. These variational quantum algorithms are similar to each other, and they can be summarized at a higher level as follow:

1. Select a parametrized quantum circuit, also known as an ansatz, to operate on an

initial reference state

|ψ(θ)i = Un(θk) ··· U2(θ2)U1(θ1)|ψref i = U(θ)|ψref i (1.1)

where the initial reference state |ψref i will be chosen depending on the problem.

2. Calculate the expectation value

hψ(θ)|H|ψ(θ)i (1.2)

where H represents a Hermitian , and it may varies from problem to problem.

For instance, in VQE algorithm, where it is widely used for chemistry calculation of

the molecular energies for molecules, would take H as the molecular

Hamiltonian. Whereas, QAOA would take H to be σz, the usual computational basis.

3. Define and optimize the cost function by tuning θ. This objective function will vary

from problem to problem. For instance, in calculating molecular ground state energies

using VQE algorithm, one would simply let the objective function to be

C = hψ(θ)|H|ψ(θ)i (1.3)

minimizing C is equivalent to minimizing the eigenvalues of H.

6 Our work is very similar in a sense that we also have a paremtrized unitary with tune-able

coefficients, U(θ), taken to be

−iHtˆ  U(θ) = exp (1.4) ~

where the tune-able coefficients θ is embedded within Hˆ , the Hamiltonian; and we tune the unitary operator in Eq. 1.4 based on the cost function that we defined, which is also depend on the expectation values, and it varies from problem to problem. Both QAOA and VQE algorithm can be put into the structure of our QNN. The difference between QAOA or VQE algorithm and our QNN model lying at the ansatz structures (parametrized unitary) along with the Hermitian operator that the expectation value is being calculated with respect too.

For instance, when using VQE algorithm to determine the electronic ground state energies for certain molecule, the ansatz has often taken to have the form

U(θ) = eT −T † (1.5)

where T is the excitation operator, and eT −T † is known as the Unitary Coupled Cluster

ansatz in classical variational theory. And we are calculating the expectation value of the

molecular electronic Hamiltonian. Whereas, in our work, we often take our expectation with

respect to σz, Pauli-Z basis, since we are doing more of a supervised training to learn the

Observable to certain physical quantities, like entanglement; rather than knowing the Ob-

servable and just trying to minimize it to find its lowest eigenvalue. Despite these differences,

our QNN model could be easily modify to give the same performance as both QAOA and

VQE. Furthermore, it should be noted that our idea was first proposed in 1996 [54], whereas

both QAOA and VQE were published in 2014.

7 1.4 Contributions

This work contributes to the field of quantum computation in the following ways: We show that we can use machine learning to help us in programming a quantum computer to perform specific tasks, both quantum mechanical and classical. Furthermore, we show that quantum computation programmed in this way possesses the properties of being robust to noise and decoherence. This shows that Quantum Neural Network can not only help with algorithm designs, but can also provide us an answer to the “noise and decoherence” problems in quantum computations. In a way, our technique provides a way to find a class of quantum circuit known as noise-resilient quantum circuits, which are quantum circuits that minimizes the error in the computation without using error correction schemes. The robustness to noise and decoherence property in our QNN increases as the size of the system increases. This provide a promising answer to the scalability problem, and application to pattern storage. Furthermore, we show that Quantum Neural Network is, in fact, much more powerful than existing classical neural networks, which means we can provide a “speed-up” in computational tasks. Then, we show that we can implement our Quantum Neural Net- work on existing quantum hardware, such as those of Microsoft and IBM systems. Finally, we sketch a design for a quantum hybrid neural network for universal computation.

1.5 Structure of this Dissertation

The dissertation is structured as follows: Chapter 2, an overview of quantum mechanics, provides key concepts and postulates from quantum mechanics that are essential to under- stand the theory behind quantum computing. Chapter 3 will outline the theory of quantum computing. Here we will describe how a quantum computer is inherently different from a classical computer, the advantages a quantum computer possess over classical computers,

8 different quantum computation models, the noise and decoherence problems in quantum computers and how one might fix these issue with error correcting schemes. Next, in chap- ter 4, we will present an overview on the theory of classical artificial neural networks. We will go over artificial neurons and their mathematicial model, and how we can stack them together in layers to form a neural network. The universality of artificial neural networks will be discussed in detail as well. In chapter 5, we will present our model of Quantum

Neural Networks (QNN). We will present the fundamental structure and derive the learning algorithm to train QNN. Here we will also discuss the universality property of QNN, similar to the discussion on the universality of classical artificial neural networks. In Chapter 6, we will show that QNN is robust to noise and decoherence, first on a 2-qubit system and then we generalize this to higher order system. This result is promising and give us hope that QNN might offer at least a partial solution to the noise and decoherence in quantum computations. In Chapter 7, we will show several benchmarking results between various classical neural networks and QNN on both classical and quantum tasks. The results give us hope that QNN will, in general, offer a speed up over classical neural networks. This is equivalent to showing certain quantum algorithms have a speed-up advantage over classical algorithms. In chapter 8, we take our experimental results from chapter 6 to the next step, by implementing them on actual quantum simulators and hardware, specifically on the Mi- crosoft topological quantum computer simulator and the IBM Q-Experience simulator and hardware. Last but not least, in chapter 10, we will go over some possible future work; one of which is the immediate extension to our current work. We propose a network architecture where both quantum and classical networks will combined together to form, what we called,

Quantum Hybrid Neural Networks (QHNN). We hope that this might be able to increase the capability of the network and allow us to solve more complicated problems.

9 CHAPTER 2

OVERVIEW OF QUANTUM MECHANICS

At the beginning of the twentieth century, almost everyone was convinced that we have discovered and understood all physical realities by the laws of Newton and Maxwell. These laws made up what we now call classical physics. However, by late 1930s, it had become apparent that classical physics faced serious problems trying to account for some observed results of certain experiments. For instance, for the black radiation problem classical physics predicted something absurd known as the ‘ultraviolet catastrophe’ involving infinite energies.

As a result, a new mathematical framework for physics called quantum mechanics was devel- oped, and new laws of physics (quantum physics) were formulated. Quantum physics deals with physical system in the fundamental scale. Currently, quantum mechanics provides the most accurate and complete description of nature. It is also required for an understanding of quantum computation and information, since a quantum computer is, after all, just a computer that follows the law of quantum mechanics, and quantum information is the result of reformulating information theory in the quantum mechanics framework. This chapter will serve as a brief summary of all the essential background knowledge of quantum mechanics needed to grasp the concept of quantum computation. In section 2.1, we start by defining the concept of a quantum state, both physically and mathematically. Next, in section 2.2, we will introduce composite quantum systems and ‘’, a crucial concept in some of the work we have done. Section 2.3 will discuss the of a quantum system and the Schrodinger equation, which is the fundamental equation behind quantum mechanics and quantum information theory, including quantum neural network. Last but not least, section 2.4 will show how a quantum state might be measured, and the effects of applying a measurement on a quantum state.

10 2.1 Quantum States and Density Operators

The state of a physical system contains information about the system. A quantum state

contains statistical information about the quantum system. It’s essentially a probability

density. Mathematically, it’s a postulate that a quantum state, known as a state vector,

is an element of a Hilbert space, H [2]. The exact Hilbert space depends on the actual

system, and may be infinite-dimensional. In quantum computation and information theory,

it is most often taken as the the space Cn for n ∈ N. Furthermore, it’s required that every quantum state is normalized, that is, it has unit norm, to preserve probability.

A quantum state can either be pure or mixed. A pure state, is often denoted as a normal-

ized vector, known as the ket, |ψi in the separable Hilbert space H. It has a corresponding

dual vector, denoted as hψ|, belonging to the dual space H∗ . If the quantum state can be described with finite dimension, then |ψi is a column vector in Cn, and its dual vector hψ| is a row vector. Since |ψi is an element of a separable Hilbert space, it has a countable orthonormal basis [6]. Thus, if {|eii} is the orthonormal basis set for H then

X |ψi = ci|eii ci ∈ C (2.1) i

P 2 2 Since |ψi is normalized, we must have i |ci| = 1. Physically, |ci| represents the probability

that the state |eii will be observed (measured). Quantum measurement will be discussed in

more details later. Mathematically, |ψi is a linear combination of the states |eii. Physically,

we say that the state |ψi is in a quantum superposition of the states |eii. This is one of the

fundamental differences between classical states and quantum states. A quantum state exists

in all of its possible states simultaneously, with each state having a statistical probability

amplitude ci of being observed. The pure state |ψi has an alternative representation known

as the density operator representation, denoted as ρpure. It is formed by taking the outer

11 product of the pure state |ψi with its dual:

ρpure = |ψihψ| (2.2)

They are also called density matrices, and the two terms density operators and density matrices are used interchangeably. The following properties hold for pure state density operators.

Properties of pure states density operators :

1. Unit trace with rank 1: T r(ρpure) = 1 and rank(ρpure) = 1

† 2. Hermitian: ρpure = ρpure

n 3. Positive semidefinite: hφ|ρpure|φi ≥ 0 for all non-zero |φi ∈ C

2 2 4. Idempotent: ρpure = ρpure ⇒ T r(ρpure) = 1 where T r(A) denotes the trace of A. Mathematically, properties 1-3 tells us that a density operator, ρpure, is nothing more than a bounded, positive trace class operator with unit trace.

The representation of the state |ψi as a density operator has some advantages. One advantage is that it lets us represent another type of quantum states, known as mixed states

ρmixed. A mixed state is a linear combination of an ensemble of pure states, {pi, |ψii}, which can be written as

X X i ρmixed = pi|ψiihψi| = piρpure (2.3) i i Observe that all pure states density operators can also be written in the form of Equation 2.3, as a single term. Therefore, Equation 2.3 is the most general form of writing down a quantum state, since it encompasses both the pure and mixed states. Hence, a general quantum state usually denoted as a density operator ρ, a bounded, positive trace class operator with unit

12 trace. Mixed state density operators ρmixed will be required to have the same properties as the density operators for pure states ρpure with one exception.

Properties of mixed state density operators :

1. Unit trace with rank 1: T r(ρ) = 1 and rank(ρ) = 1

2. Hermitian: ρ = ρ†

3. Positive semidefinite: hφ|ρ|φi ≥ 0 for all non-zero |φi ∈ Cn

4. Not Idempotent: ρ2 6= ρ and in fact T r(ρ2) < 1

P The last property can be seen by the fact that: T r(ρ) = i pi = 1 then this implies pi < 1.

2 P 2 Therefore, T r(ρ ) = i pi < 1. This condition is often used as the criterion to determine whether a quantum state is pure or mixed. Thus, generally we don’t assign tag of pure or

mixed to ρ, but instead we can identify it by checking whether: T r(ρ2) = 1 (pure state), or

if T r(ρ2) < 1 (mixed state).

2.2 Composite Systems and Entanglement

Previously, we focused on a single isolated system. However, this is not good enough to do

quantum computation because a quantum computer with non-interacting qubits (quantum

bits) is no better than a classical computer. Therefore, we need to have a thorough grasp

of how several qubits interact with each other. When a quantum system is made from two

or more distinct physical systems, then we have something called a ‘composite system’. The

Hilbert space of a composite physical system is the tensor product of the Hilbert spaces of

the component physical system. For instance, if A and B are the two components, then the

Hilbert space of the composite system, HAB, is

HAB = HA ⊗ HB

13 Suppose system A was prepared in the state |ψiA and system B was prepared in the state

|ψiB, then the composite system’s state, |ψiAB is the tensor product

|ψiAB = |ψiA ⊗ |ψiB (2.4)

This can be extended to any arbitrary system size with induction, i.e. if there are n different systems, labeled 1 through n prepared in the states |ψi1 through |ψin, then the composite system’s state is

|ψi12...n = |ψi1 ⊗ |ψi2 ⊗ · · · ⊗ |ψin = |ψ1ψ2 ··· ψni (2.5)

Thus, if HA and HB are finite discrete Hilbert spaces with basis |ψiAi for i = 1, 2, ..., N and

|ψiBj for j = 1, 2, ..., M respectively, then the basis of the composite system is made up from the NM states |ψii ⊗ |ψij. That is, if |φi ∈ HA ⊗ Hb is a state in the composite system, then X |φi = ci|ψii| ⊗ ψij (2.6) ij The important key feature about a composite state is that it gives rise to a very strange and interesting type of quantum states, known as ‘entangled’ states. Erwin Schrodinger was the first person to introduce the term, and he named it ‘Verschrankung’, which was later translated to English as ‘Entanglement’. Mathematically, if |ψi ∈ HA ⊗ HB represents the composite state of the systems A and B, then |ψi is entangled if we cannot write it as the tensor product of two pure quantum states, i.e.

|ψi= 6 |ai ⊗ |bi (2.7)

where |ai and |bi belongs to HA and HB respectively. Physically, this means that the

2 subsystems A and B do not have well-defined states. For instance, suppose HA = C and

2 2 2 4 similarly HB = C , then HAB = C ⊗ C = C . So if we consider the state 1 |ψi = √ (1, 0, 0, 1)0 (2.8) 2

14 which belongs to HAB. However, observe that |ψi= 6 |ai ⊗ |bi for all |ai ∈ HA and |bi ∈ HB.

This state is known as a “Bell state”. It’s a maximally entangled state. Now if we instead

consider the state 1 |ψi = (1, 1, 1, 1)0 (2.9) 2

which also belongs to HAB. However, here we have |ψi = |ai ⊗ |bi, where

1 1 |ai = √ (1, 1)0 and |bi = √ (1, 1)0 2 2

Thus, in this case |ψi is an unentangled state. The collection of states which are not entangled is known as ‘separable states’ or ‘product states’. In summary, for finite dimension, a pure state |ψi ∈ Cn ⊗ Cm is called ‘separable’ or ’product’ state if it can be written in the form

|ψi = |ai ⊗ |bi for some |ai ∈ Cn and |bi ∈ Cm. Else, |ψi is called ‘entangled’. Now recall the well-known Schmidt decomposition theorem from linear algebra, which states:

Theorem 2.2.1. If HA, HB are Hilbert spaces of dimension n and m respectively, and

|ψi ∈ HA ⊗ HB, then there exist orthonormal bases αi, βi of HA, H2, and reals λi ≥ 0 with P i λi = 1 such that r X |ψi = λiαi ⊗ βi i where r = min(m, n).

Theorem 2.2.1 holds for infinite dimension Hilbert spaces, and in fact even non-separable ones

[7]. The key feature Theorem 2.2.1 provides us is a quick check (a criterion) for entanglement in a bipartite system. More explicitly, a bipartite state, |ψi is entangled if and only if r > 1.

Unfortunately, Theorem 2.2.1 can’t be extended into tripartite systems (HABC = HA ⊗

HB ⊗ HC ). Therefore, there is no simple technique of determining whether a tripartite or

higher order system is entangled or not. Moreover, we have only talked about characterizing

entanglement for pure states, so far. It turns out that characterize entanglement in mixed

15 quantum states is even harder! And no exact entanglement quantification for an arbitrary

quantum state exists, at least not yet. The general (pure or mixed states) definition of

entanglement is as follows:

Definition 2.2.2. A state ρ is called ‘separable’ if it can be written as a convex combination

of product states

X i i X ρ = piρA ⊗ ρB where 0 ≤ pi ≤ 1, pi = 1 (2.10) i i

From Definition 2.2.2 we have that the collection of separable states forms a convex set,

S. From this point of view, we can define something called an “Entanglement Witness”, a

separability criterion [41].

Theorem 2.2.3. (Entanglement Witness Theorem). A state ρent is entangled if and only

there exists a Hermitian operator W ∈ B(H), the space of all operators act on the Hilbert

space H of the quantum system, such that

hρent,W i = T r(W ρent) < 0, hρ, W i = T r(W ρ) ≥ 0 ∀ρ ∈ S (2.11)

W is known as an “Entanglement Witness” (EW).

Note the the space B(H) is itself a Hilbert space, usually called the Hilbert-Schmidt space.

Since we are only interested in finite dimensional quantum system, we can regard it as the space of matrices. An Entanglement Witness W is guaranteed to exist because of the geometric version of the Hahn-Banach Theorem from functional analysis [6]. See Figure 2.1 for a graphical interpretation.

Theorem 2.2.4. (Geometric Hahn-Banach Theorem) Let S be a convex, compact set in a

finite dimensional Banach space. Let ρ be a point in the space with ρ∈ / S. Then there exists a hyperplane that separates ρ from S

16 Figure 2.1: Graphical interpretation of the geometric Hahn-Banach Theorem. The witness w divides the Hilbert space into separable (S) and entangled subspaces. An optimal witness is as close as possible to the set S.

The hyperplane separating the set S and the point ρent is determined by the orthonormal vector W which is selected outside the set S, similar to how a plane in Euclidean space can be defined by its orthonormal vector, say V . Recall that the plane in Euclidean space separates vectors for which their inner product with V is negative from vectors for which their inner product with V is positive. See Figure 2.2 for a graphical illustration.

Figure 2.2: Graphical illustration of plane separation in Euclidean space

In Hilbert-Schmidt space, the inner product between (A, B) ∈ B(H) is defined as hA, Bi =

T r(A†B). Thus, at this point, it’s clear why the Entanglement Witness Theorem has the

form it has, and it’s nothing more than an application of the geometric Hahn-Banach The-

orem. Moreover, we say that an EW is optimal, Wopt, if apart from Equation 2.11 there

17 0 0 exists a separable state ρ ∈ S such that hρ ,Wopti = 0. An optimal EW can detect more entangled states than any non-optimal ones. See Figure 2.3 for a geometrical visualization.

It turns out that finding the optimal EW is a very hard (NP-hard) problem [42]. Despite many contributions been made to the theory of quantum entanglement, especially by the

Horodecki family [43], it remains as a big mystery, both mathematically and physically.

Figure 2.3: Geometric illustration of an optimal entanglement witness

Although Entanglement Witness provides a separability criterion and is helpful in detecting

entanglement, it does not give quantitative information on how much the state is entangled.

This leads to the idea of constructing “Entanglement Measure”. As expected, this is a much

harder problem! A few prominent entanglement measures do exist: Those that are based on

convex roof construction, or those that are based on distance of the state to the convex set

S of all separable states. Examples of entanglement measures that are based on convex roof

construction are “Concurrence” and “Entanglement of Formation”, which were developed

by Bennett et al. [70] and Wootters [71]. Intuitively, Entanglement of Formation (EF)

quantifies how many Bell states are needed to prepare n copies of a particular state. For

example, if the EF of a state ρ is 4/9 then this means we need 4 Bell states to prepare 9 copies

of ρ. For a pure state of bipartite composite system, EF is equivalent to the von Neumann

18 of the reduced subsystem. For mixed states of a bipartite system, this is no longer true since each reduced system can now have non-zero entropy on its own even if there is no entanglement. However, Entanglement of Formation is well defined for an arbitrary two-qubit state as shown by the following two theorems. Their proofs can be found in [71].

Theorem 2.2.5. The entanglement of formation of a two-qubit state ρ is a function of the concurrence C, √ 1 + 1 − C2  E (ρ) = E (C(ρ)) = H (2.12) F F 2 where H is the Shannon entropy function

H(x) = −x log2(x) − (1 − x) log2(1 − x) (2.13)

Theorem 2.2.6. The Concurrence C of a two-qubit state ρ is

C(ρ) = max{0, µ1 − µ2 − µ3 − µ4} (2.14)

where µis are the square-roots of the eigenvalues of the matrix ρ · ρ˜ in decreasing order, and

ρ˜ is defined as

∗ ρ˜ = (σy ⊗ σy)ρ (σy ⊗ σy) (2.15)

0 −i ∗ with σy = i 0 and ρ is the complex conjugate of ρ.

Theorem 2.2.5 and 2.2.6 provide us with explicit entanglement measures for an arbitrary

2-qubit state.

Up to this point, we have only discussed entanglement in the abstract sense and not its applications in quantum computation and information. It turns out that entanglement play a huge role in error correction, a concept we will discuss in the next chapter.

19 2.3 Time-Evolution of a Closed System

The state of a system is time dependent, and so a quantum state vector |ψi is a function of time, |ψ(t)i. It is a postulate that |ψi evolves in time linearly based on the Schrodinger’s equation [82]. d|ψ(t)i i = H(t)|ψ(t)i (2.16) ~ dt where ~ is Planck’s constant, and H(t) is known as the Hamiltonian of the system, which rep- resents the total energy (/measurable quantity) function of the system. Another postulate of quantum mechanics is that every measureable quantities have an associated observable operator, which is a self-adjoint Hermitian operator mapping a Hilbert space into itself. Thus, the Hamiltonian H(t) in Equation 2.16 must be a Hermitian operator. In general, if H(t) is well defined, then we have complete description and understanding of the quantum system, at least mathematically. Furthermore, from Equation 2.16 we can see that the transformation that takes a quantum state ψ(t)i from t1 to t2 must be unitary, i.e. if

|ψ(t2)i = U|ψ(t1)i then U is a Unitary operator. This makes sense because of conservation of probability. A quantum state must remain normalized, hence if a quantum state is written

P 2 in term of Equation 2.1, then i |ci| = 1. The only linear operators that preserve such norms of vectors are unitary operators. This sheds some light on the reason that a closed quantum system must satisfy the Schrodinger equation as it evolves in time.

If we suppose that H(t) is constant in time in Equation 2.16, then the solution to the

Schrodinger equation for fixed times t0 and t is

−i~H(t−t0) |ψ(t)i = e |ψ(t0)i = U(t, t0)|ψ(t0)i (2.17)

Since H is a Hermitian operator, we can see from Equation 2.17 that U is a unitary operator, coincide with our discussion about Unitary transformation. For time dependent Hamilto-

20 nian, H(t), the unitary operator, U(t, t0) in Equation 2.17 can be written as:

Case I: if [H(ti),H(tj)] = 0 for all choices of ti and tj

−i Z t  U(t, t0) = exp H(t)dt (2.18) ~ t0

Case II: if [H(ti),H(tj)] 6= 0 for any choices of ti and tj

∞ n X  i  Z t Z t1 Z tn−1 U(t, t0) = I + − H(t1)dt1 H(t2)dt2 ··· H(tn)dtn (2.19) n=1 ~ t0 t0 t0

It should be noted that Equation 2.19 is often seen as

−i Z t  U(t, t0) = T exp H(t)dt (2.20) ~ t0

where T denotes the time-ordering operator. Details derivations of Equations 2.18 to 2.20

can be found in [1].

Equation 2.16 can also be formulated in term of density operators, ρ(t) = |ψ(t)ihψ(t)|, as

d d ρ(t) = |ψ(t)ihψ(t)| dt dt 1 1 = H(t)|ψ(t)ihψ(t)| − |ψ(t)ihψ(t)|H(t) i~ i~ 1 = [H(t), ρ(t)] i~

Thus, a more general form of Equation 2.16 can be written as

dρ 1 = [H(t), ρ(t)] (2.21) dt i~

Equation 2.21 is usually known as the Liouville - Von Neumann equation. The solution is

† ρ(t) = Uρ(t0)U (2.22)

21 where U is the unitary operator described in Equations 2.18 to Equation 2.19. Moreover, if the Hamiltonian, H, is time independent then Equation 2.22 can be written explicitly as

−iHt iHt ρ(t) = exp ρ(t0)exp (2.23) ~ ~

Equation 2.21 often get rewritten in shorthand in terms of the Liouville super-operator

Lˆ = 1 [H, ...] as ~ dρ = −iLρˆ (2.24) dt

From here we can see that the time evolution of a quantum state, ρ, can also be written as

−iLtˆ ρ(t) = e ρ(t0) (2.25)

2.4 Quantum Measurements

Until this point, we have been focused on discussing the state of a quantum system, and how a it evolved in a closed system. Ultimately at some point, it will be of interest to measure some properties of a system, and so we must allow the system to interact with a measurement apparatus (a macroscopic size piece of equipment that behaves according to the laws of classical physics) of an outside observer. See Figure 2.4. The system is then no onger closed, and the evolution postulate of quantum mechanics no longer holds. This leads us to the Measurements postulate, which provides a description of the effects of measure- ments on a quantum system. The following postulate is taken from [2]. It should be noted that the following postulate is the most general postulate about quantum measurement.

22 Measurements Postulate: Quantum measurements are described by a collection {Mi} of measurement operators also known as Kraus operators. These are operators acting on the state space of the system being measured. The index i refers to the measurement outcomes that may occur in the experiment. If the state of the quantum system is ρ immediately before the measurement then the probability that the result i occurs is given by

† p(i) = T r(Mi Miρ) (2.26) and the state of the system after the measurement is

† MiρMi † (2.27) T r(MiρMi )

Moreover, the measurement operators satisfy the completeness equation

X † Mi Mi = I (2.28) i

Figure 2.4: A quantum system interacting with a measuring apparatus in the presence of the surrounding environment.

One take-away from the above measurement postulate is the collapse of the quantum

state after a measurement is performed. That is, the state is no longer quantum mechanical

but rather classical. This is what makes the so called “” one of the

23 most difficult and controversial problems in quantum mechanics. Moreover, one can argue to consider the combined systems, the measurement apparatus and the quantum state together, as a larger closed quantum system. However, this would take us to another very controversial issue. Therefore, through this dissertation, we will go with the conventional practice and assume that measurement is not part of the closed quantum system. And performing an measurement on quantum system means the evolution postulate is no longer holds. That is, a quantum measurement is not a unitary transformation.

Furthermore, if one restricts the condition that all of the Mi are orthogonal projectors, that is Mi are Hermitian and MiMj = δi,jMi; then we have the commonly used and well known measurements, known as ‘projective measurements’ or ‘’. For clarity, let us state the measurement postulate in term of projective measurements below:

Projective Measurements Postulate: A projective measurement is described by an observ- able, M, a Hermitian operator on the state space of the system being observed. M has a spectral decomposition as X M = λiPi (2.29) i where Pi is the projector onto the eigenspace of M with the eigenvalue λi. Thus, an ob- servable, M, is the weighted sum of projectors. The possible outcomes of the measurement correspond to the eigenvalues, λi. Upon measuring the state ρ, the probability of getting result λi is given by

P (λi) = T r(Piρ) (2.30)

Equation 2.30 is also called Born’s rule. The system will be in the state

P ρP i i (2.31) λi

24 immediately after measurement. Thus, measurement is a irreversible process and causes loss of information.

Projective measurement is the type of measurement taught in standard quantum me- chanics courses; not the general measurement stated in the first measurement postulate. The reason for this is twofold: the first reason is projective measurements coupled with unitary dynamic are sufficient to implement a general measurement, see [2] page 94-95 for proof; the second reason is that the calculation of the expected value for a projective measurement is much easier. It should be noted that in all quantum computating models, the measurements being performed are actually the projective measurements; in particular, the Pauli measure- ments generated from the Pauli matrices, σx, σy, σz, and σI . These matrices will be discussed in the next chapter. In our work, we usually pick the Pauli matrix σz as our measure, M.

In general, the expectation value of M in the state ρ, denoted as hAiρ can simply be written as

X hAiρ = pihψi|M|ψii = T r(ρM) (2.32) i

This is important since in our work, especially entanglement witness calculation, we map our input states to their expectation values.

25 CHAPTER 3

QUANTUM COMPUTING

In this chapter, we will explore the theory of quantum computing. We start by introduc- ing the concept of quantum bit (qubit), a fundamental unit for quantum computations, in

Section 3.1. We will show how qubits are fundamentally different from classical bits and even probabilistic bits. In section 3.2, we will introduce a quantum computation model known as the ‘gate model’ which resembles the classical digital computer model. Here we will give an overview of different quantum logic gates and the concept of a quantum circuit. Then in

Section 3.3, we will discuss a set of gates which is universal for quantum computation, and one of the most important theorems in quantum information, the ‘Solovay-Kitaev’ theorem.

In Section 3.4, we will discuss different quantum algorithms, their advantages over classi- cal algorithms, and the difficulty of designing one. Section 3.5 provides another quantum computation model known as the Adiabatic Quantum Computation model. Here we will show why this model is very good at solving optimization problems. Last but not least, in

Section 3.6, we will point out the most problematic issues in quantum computations, noise and decoherence. This last section shows one important contribution our research work has provided to the field of quantum computing.

26 3.1 Qubits

A building block of classical computational devices is a bit, a two-state system, which can be 0 or 1. This two-state system may be the voltage level on a wire with a threshold function. For instance, this threshold function could output 0 if the voltage level is less than 4.5mV , and output 1 if the the voltage level is greater than or equal to 4.5mV . These threshold functions can be physically implemented by transistors. They act as an on (0) and off (1) switch. All classical computations can be built from manipulating these on and off switches (bits). See Figure 3.1 for a geometric picture.

Figure 3.1: Geometric Representation of a Classical Bit

Similarly, the building block for quantum computational devices is the quantum me- chanical two-level system, which is the simplest non-trivial quantum system. One example of such a system is the , where you can take spin up to be the state |0i and

1 spin down to be the state |1i. In fact, any spin- 2 particle system can be used to model as a qubit. Another two-level quantum system could be the of , where one can take horizontal polarization to be the state |0i and vertical polarization to be the state |1i. Any two level quantum system can form a qubit, and the states |0i and |1i form a basis, called the computational basis. It is possible to have a multi-level system to model a qubit, as long as there are two states that can be separated from the rest of the states in the system. For instance, one can consider the energy of an electron in an atom. Theoretically

27 there are infinite possible energy levels; however, since energy is quantized, we can pick the lowest energy state (ground state) to represent |0i and the first to represent |1i and essentially ignore the subspace spanned by all the energies after the first excited state.

This multi-level system has now become a two-level system for all practical purposes and it can be described by a 2-dimensional vector in the space spanned by the ground state and

first excited state energy levels. Therefore, it can be used to model a qubit. The general state of a two-level quantum system can be described by a vector in a 2-dimensional Hilbert space, C2. Therefore, if we take {|0i, |1i} as the basis (computational basis) then a general single qubit state has the form

|ψi = α|0i + β|1i α, β ∈ C (3.1) where " # " # 1 0 |0i = |1i = (3.2) 0 1

Due to the normalization constraint of quantum states (conservation of probability),

|α|2 + |β|2 = 1 must hold. The coefficients α and β are often called the amplitudes of the ba- sis states |0i and |1i, respectively. Equation 3.1 shows a key difference between classical bit and a qubit: A qubit can be in linear superposition between the states |0i and |1i, whereas a classical bit is deterministic, it’s either in the state 0 or 1. This linear superposition is part of the exclusive world of the qubit and it’s not available to an outside observer. For us to know the actual qubit’s state, we must perform a quantum measurement. If a projective measurement is performed in the standard computational basis then it will collapse the qubit state, |ψi, into either the state |0i or |1i with probabilities |α|2 and |β|2, respectively.

Note that the overall phase does not matter in quantum system, that is, the state |ψi and

28 eiθ|ψi are essentially the same (indistinguishable from one another). With this in mind and the normalization constraint |α|2 + |β|2 = 1, we can rewrite Equation 3.1 as

θ θ |ψi = cos |0i + eiφ sin |1i (3.3) 2 2 θ The reason for the will be clear when we represent |ψi in the density matrix form. 2  θ θ θ  cos2 e−iφ cos sin  2 2 2 ρ = |ψi ⊗ hψ| =  θ θ θ  (3.4) eiφ cos sin sin2 2 2 2 ! 1 1 + cos θ cos φ sin θ − i sin φ sin θ = (3.5) 2 cos φ sin θ + i sin φ sin θ 1 − cos θ 1 = I + cos φ sin θX + sin φ sin θY + cos θZ (3.6) 2 1 = I + ~r · ~σ (3.7) 2

where ~σ is the 3-element ‘vector’ of Pauli matrices (X,Y,Z). ! ! ! 0 1 0 −i 1 0 σx = X = σy = Y = σz = Z = (3.8) 1 0 i 0 0 −1

From Equation 3.6 we see that each θ and φ define a point on the unit three dimensional sphere, Figure 3.2. This sphere, which represents the state of a qubit geometrically, is known as the Bloch sphere. It provides us with a nice geometric intuition about operations on a single qubit on a quantum computer as we will see in the next section. One can also express points on the Bloch sphere in term of the unit Bloch vector, ~r, in Cartesian coordinates as

~r = (x, y, z) = (sin θ cos φ, sin θ sin φ, cos θ) (3.9)

29 Figure 3.2: Bloch Sphere Representation of a Qubit

One may wonder how is a qubit different from a classical probabilistic bit? A classical probabilistic bit can be written as a vector ! a (3.10) b where a represents the probability of the bit’s being 0, b the probability of the bit’s being

1, and a + b = 1. In this description, the distinct difference between a classical bit and a qubit is the amplitude coefficients in a qubit are complex numbers instead of a real numbers, which leads to a spherical geometric description instead of a linear description for classical probabilistic bit. See Figure 3.3. This also means that there can be interference effects in composite systems, as we will see.

Now that we understand the differences between a bit, a classical probabilistic bit, and a qubit, let’s try to understand the state of multiple qubits. If Equation 3.1 represents a general state of a qubit in the computation basis, then by Equation 2.4, the general state of a two qubits system in the computation basis is

|ψi = a|00i + b|01i + c|10i + d|11i (3.11)

where a, b, c, d ∈ C and |a|2, |b|2, |c|2, |d|2 represent the probabilities of measuring the state as being |00i, |01i, |10i, |11i, respectively. Hence, |a|2 + |b|2 + |c|2 + |d|2 = 1. Furthermore, the

30 Figure 3.3: Geometric Representation of a Classical Probabilistic bit

state |ψi of two qubits can no longer be represented in terms of a sphere. Another feature a two qubit state |ψi may possess is entanglement as described in the previous section. This feature is not available to a classical two probabilistic bits system, where the state of the composite system is always a tensor product of the component systems. This is one major difference and advantage of a quantum computer to a classical computer. This can be made even more clear if we consider the three classical probabilistic bits and three qubit systems.

In the three classical probabilistic bits system, we can write the second bit independently of the first bit, and the third bit independently of the previous two bits. That is, we can write them independently. ! ! ! a c e (3.12) b d f whereas a three qubit states can’t be written in term of each qubit independently because of entanglement, and therefore, it must be written as

|ψi = a|000i + b|001i + c|010i + d|011i + e|100i + f|101i + g|110i + h|111i (3.13)

Thus the space of qubits is inherently much larger than the space for bits or for classical probabilistic bits. This gives us a glimpse into the power of quantum computation.

31 We have described what it means to make a measurement on a single qubit, and what the coefficients in Equation 3.1 represent. For multi-qubit system, measurement can be done in similar manner. For instance, if one makes a projective measurement in the computational basis on the first qubit of Equation 3.11, then the probability of its being in the state |0i is

|a|2 + |b|2 and of being the state |1i, |c|2 + |d|2. Similar results can be obtained if we perform the same measurement on the second qubit. In quantum computation, it is possible to do this, that is, to perform measurement on a subset of all the qubits, the one of interest, and not the entire set. In this case the quantum nature of the rest of the system could possibly be retained after the measurement.

There exist many physical implementation models of qubits, ranging from supercon- ducting qubits [29], ion traps [31], topological qubits [32], to photonic qubits [31]. Each has its advantages and disadvantages. Since a qubit is a quantum state, it evolves according to Equation 2.16. In general, the time independent Hamiltonian of a single qubit can be written as 1   K  H = = Kσ + σ (3.14) 2 K − x z where K is the tunneling parameter and  is the potential energy off-set or bias. The most general form of the time-dependent Hamiltonian for an N−qubit system is an 2N × 2N Her- mitian matrix.

32 3.2 Quantum Gates and Circuits Model

Classical computers use electrical wires and logic gates to perform their computation.

These logic gates are built from the electrical current and transistors. All classical algorithms are built from manipulating these logic gates to get desired outputs. For better visualization, truth tables of the NAND and NOR gate can be found in Table 3.1.

Table 3.1: Truth table of NAND and NOR logic gates in respective order

Input A Input B Output Input A Input B Output 0 0 1 0 0 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 0

Since all classical computations are done by manipulation of logic gates, one might sug- gest that quantum computations can be done with quantum logic gates. This leads to the quantum circuit model which consists of logical qubits and quantum gates acting on the qubits. We can think of qubits being carried by ‘wires’ from left to right and a quantum gate as a unitary operator in the circuit diagram. See Figure 3.4 for a visualization of a quantum circuit. In Figure 3.4, U1,U2,U3,U4 represent quantum gates operations on the qubit. U1 is a single qubit operation, hence it belongs to U(2), a 2 by 2 unitary matrix. U2,U3 and U4 are multi-qubit gate operations, with U2,U4 ∈ U(4) and U(3) ∈ U(8). The reason for quan- tum gates being unitary operators is clear by the evolution postulate of quantum mechanics,

Equation 2.21. Thus all quantum gates must be reversible, and therefore all quantum com- putations are reversible. At the end of the circuit, we can perform a measurement on the qubits of interest. This is represented by the boxes marked M in Figure 3.4. At first, it seems like the unitary and hence reversible condition that is imposed on a quantum computer is an issue[132]. The fear was that we might not be able to do most classical computational tasks

33 Figure 3.4: Visualization of a quantum circuit

since those tasks are non-reversible. However, in 1973 Charlie Bennett showed that one can make any non-reversible computation reversible with a small overhead [3]. Therefore, any classical computation task can be executed on a quantum computer. In fact, the classical reversible computer has become a topic of interest recently because reversible computers are more energy efficient and faster [4]. As an example of a non-reversible operation turned into a reversible operation, consider the AND , which outputs 1 if and only if both input bits are 1. Hence, the AND logic-gate is a non-reversible operation. However, this can be made reversible by adding an extra input bit. Consider the circuit diagram in Figure 3.5, with the top bit (A) being the control bit for the swapping operation between B and 0. From

Figure 3.5: Reversible AND gate (Fredkin gate)

the circuit diagram in 3.5, we see that if A = 0 then the output is 0, and if A = 1 then the output is 1. This is exactly the AND function but now it’s being done in a reversible manner.

34 Any 2×2 unitary operator acting on a qubit is called a 1-qubit quantum gate. Let’s take a look at some of the crucial 1-qubit gates and the geometric interpretation of applying such gate to a certain 1-qubit state on the Bloch sphere. The classical NOT gate has a quantum analog known as the ‘Pauli X gate’ or X or σx for short. ! 0 1 σx = X = (3.15) 1 0

From Equation 3.8, we can see that if we take the computational basis as |0i = (1 0)0 and

|1i = (0 1)0 then when X is being applied to the state |0i, the result is the state |1i ! ! ! ! 0 1 0 1 1 0 |0i = = = |1i (3.16) 1 0 1 0 0 1

The geometric interpretation of the X gate is a 180◦ rotation of |ψi around the x − axis.

See Figure 3.6. The other Pauli gates are σy = Y , σz = Z, and σI = I. ! ! ! 0 −i 1 0 1 0 Y = Z = I = (3.17) i 0 0 −1 0 1

Figure 3.6: Geometric visualization of the Pauli X gate being applied to the state |0i on the Bloch sphere

The Y,Z gates represent 180◦ rotation around the y− and z− axes, respectively. Note

that the set {X,Y,Z,I} of Pauli gates span the vector space formed by all 1-qubit operators.

35 Therefore, any 1-qubit unitary operator (gate) can be expressed as a linear combination of the Pauli gates.

Some other important 1-qubit gates are the Hadamard (H), Phase (S), and π/8 (T ) gates. These gates are fundamental in quantum computation as they make up the universal set, a set of gates for universal computation, as we will see in section 3.3. ! ! ! 1 1 1 1 0 1 0 H = √ S = T = (3.18) 2 1 −1 0 i 0 eiπ/4

Notice that H is a 180◦ around the diagonal X + Z axis of the Bloch sphere. An important feature of the Hadamard gate is that it creates quantum superposition when it acts on a computational basis state, as shown in Equations 3.19 and 3.20 ! ! ! 1 1 1 1 1 1 |0i + |1i √ = √ = √ (3.19) 2 1 −1 0 2 1 2

! ! ! 1 1 1 0 1 1 |0i − |1i √ = √ = √ (3.20) 2 1 −1 1 2 −1 2

A geometric interpretation of the Hadamard gate applied to the state |0i on the Bloch sphere can be seen in Figure 3.7. Another thing to notice is that the Hadamard gate is exactly the two-point Discrete Fourier Transform matrix. This turns out to be an important feature to be exploited in quantum algorithms.

36 Figure 3.7: Geometric visualization of the rotation on the Bloch sphere created by applying |0i + |1i the Hadamard gate the state |0i to create the superposition state √ and vice versa. 2

So far, we have only been talking about single qubit gates. To be able to do any useful

computation, we must be able to get the qubits interact with each other. Hence, we will

shift our focus to multi-qubit gates. Classical gates such as AND, OR, NAND, NOR are

multiple-bit gates. The most fundamental multi-qubit quantum gate is the controlled-NOT

or CNOT gate, which is a 2-qubit gate. It is the analog of the classical XOR gate. The first qubit in the CNOT gate is the control qubit and the second is the target qubit. See Figure

3.8. If the control qubit is |0i then nothing is done. If the control qubit is |1i then the target

qubit will get flipped. Explicitly, the action of CNOT gate can be written as follows.

CNOT |00i = |00i; CNOT |01i = |01i; CNOT |10i = |11i; CNOT |11i = |10i (3.21)

The matrix representation of the CNOT gate in the computational basis is   1 0 0 0 0 1 0 0   CNOT =   (3.22) 0 0 0 1 0 0 1 0

37 Figure 3.8: Controlled-NOT gate

Now that we have defined 1-qubit gates and a 2-qubit gate, we can combine them to create the following quantum circuit, which is often used to make a maximal entangled state

(Bell state). See Figure 3.9.

Figure 3.9: This circuit maps the state |00i to the state √1 |00i + |11i 2

The circuit in Figure 3.9 is very important in quantum computations. Without a 2-qubit

gate like the CNOT , quantum computers are no better than classical computers because the

quantum states are not entangled. A generalization of the CNOT gate is the controlled-U,

CU, gate. Again, it is a two qubit operation, with the first qubit being the control and

the second being the target qubit. If the control qubit is set to |1i, then the 2-by-2 unitary

operator U is applied to the target qubit. From this point of view, it is clear that CNOT

gate is a special case of CU, where U is the Pauli-X gate. The matrix representation for a

38 general controlled-U gate, CU, is

1 0 0 0  ! 0 1 0 0  u u   11 12 CU =   U = (3.23) 0 0 u11 u12 u21 u22 0 0 u21 u22

Another multi-qubit gate that is often comes up in quantum computations is the Toffoli gate.

It is a 3-qubit gate, and often denoted as CCNOT . It has a circuit representation as shown in Figure 3.10. The first two qubits are the control qubits, so if both are in the state |1i then the Pauli-X gate is applied to the third qubit. It should be noted that Toffoli gate is the reversible analogue of the classical NAND gate. This is important since this tells us we must use a three-qubit system to simulate such gate. Truth table and matrix representation can be seen in equation 4.6; where IA, IB, IC indicates the 3 input values, and OA, OB, OC indicates the 3 output values.

IA IB IC OA OB OC 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1   0 0 1 0 0 0 0 0 0 1 0 0 1 0   0 0 0 1 0 0 0 0 0 1 1 0 1 1 CCNOT = 0 0 0 0 1 0 0 0 (3.24) 1 0 0 1 0 0   0 0 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 1 1 0

We have talked about a handful of different quantum gates, from 1-qubit gates to a 3-

qubit gate. How many quantum gates are there? The answer is that an uncountably infinite

number of quantum gates exist. The reason is any unitary operation can be thought as a

quantum gate, and the set of unitary operations is continuous. This seems like a daunting

statement at first, but it turns out that we can approximate any arbitrary unitary operator

with a small set of discrete gates, known as a universal set of quantum gates.

39 Figure 3.10: Circuit representation of Toffoli gate

3.3 Universal Quantum Computation and The Solovay-Kitaev Theorem

In classical computing, the NAND (NOT-AND) or the NOR (NOT-OR) gates are uni- versal. That is, any other Boolean functions (logic-gates) can be reproduced from just the

NAND or the NOR gates alone, and they can be used to compute an arbitrary classical functions. Note that the NOT gate is a single bit gate, whereas the AND and OR gates are 2 bits gate. In quantum computation, a similar result exists despite the fact the set of unitary operations is continuous. A set, G, of quantum gates is said to be universal if any unitary operation can be approximated to arbitrary precision by a sequence of gates from G.

If U is the desired unitary operator, and V is the result from manipulating the gates from

G in a sequential manner, then we can define the error when V is implemented instead of U by

E(U, V ) = max ||(U − V )|ψi|| (3.25) |ψi

Therefore, mathematically when U is said to be approximated to an arbitrary accuracy, it means that, given  > 0, there exists some V such that E(U, V ) < . A common and primarily used universal set, G, contains the following gates:

G = {H, π/8,CNOT } (3.26) where H is Hadamard, π/8 is T, and CNOT is controlled-NOT gate. See Equation 3.18 and

3.21 for exact description of these gates. People often include the phase gate, (S), to the set

G, for error correction. However, it is not needed for the proof of the universality of the set G.

40 An interesting remark is the set P = {H,S,CNOT } is not universal since it only generates

a discrete subgroup of U(n). Moreover, any circuit over this set of gates can be simulated

in classical polynomial time by the famous Gottesman-Knill Theorem [9]. Note that H and

π/8 gates are single qubit gate, whereas the CNOT gate is a 2-qubit gate. Thus, the set G contains only 1-qubit and 2-qubit gates, so it might be a surprise that it can approximate an arbitrary n−dimensional unitary operator belongs to U(n) to arbitrary accuracy. Before

explaining why this surprising fact is actually not at all surprising, it is good to recall that

classically, any Boolean function can be reproduced by the AND (2-bit) and NOT (1-bit)

gates alone. Thus, we have a similar result in the gate quantum computing model. This

is why the gate model for quantum computing is more popular than other models (we will

discuss other quantum computation models in later section), because it is more relatable to

classical digital computer. To see why only 1,2-qubit gates are needed to approximate any

unitary matrix with arbitrary dimension, we first state a well-known theorem:

Theorem 3.3.1. A set S consisting of the Hadamard gate H and the π/8 gate T is a

universal set for U(2). That is, given an arbitrary unitary matrix U ∈ U(2) and  > 0, there

exists a V which can be generated by a product of H and T in a sequential manner, such

that E(U, V ) < 

The details of the proof of Theorem 3.3.1 can be found in [2] on page 196. This theorem

essentially states that any single qubit gate can be approximated to an arbitrary precision

with just the Hadamard (H), and π/8 (T ) gates. This does not answer the question of why

only 1,2-qubit gates are needed for universal computation, yet. The following definition and

a result from linear algebra will shed some light on the answer.

41 Definition 3.3.2. A unitary matrix V ∈ U(n) is called a two-level unitary matrix if all diagonal elements are 10s and all off-diagonal elements are 00s except for 4 elements

Vjj = a, Vjk = b, Vkj = c, Vkk = d which form a 2 × 2 unitary matrix, Vˆ , where det(Vˆ ) = 1. Equation 3.27 shows what a two-level matrix might look like.  1 ··· 0 ··· 0 ··· 0 ··· 0  ···························    0 ··· a ··· 0 ··· b ··· 0  ···························   a b V =  0 ··· 0 ··· 1 ··· 0 ··· 0  Vˆ = (3.27)   c d ···························  0 ··· c ··· 0 ··· d ··· 0  ··························· 0 ··· 0 ··· 0 ··· 0 ··· 1

Now, there is a result from Linear Algebra which shows that any n×n unitary matrix can be decomposed into product of two-level unitary matrices. Moreover, this product consist of at most 2n−1(2n − 1) two-level unitary matrices [8]. Thus, to show that the set G defined by

Equation 3.26 is a universal set, it suffices to show that it can approximate an arbitrary n×n two-level unitary matrix. By looking at the structure of a two-level unitary matrix, Equation

3.27, it shouldn’t be an astonishing fact that a two-level unitary can be implemented using only 1-qubit and CNOT gates. The detailed construction algorithm to implement this can be found in [2] on page 191-193. Therefore, putting it altogether, it is now clear why the set G defined by Equation 3.26 is universal for quantum computations. However, up to this point of our discussion about universality, we have been avoiding an important question.

Specifically, despite knowing that an arbitrary unitary, U ∈ U(n), can be approximated by a sequence of gates from our universal set G to any desired accuracy , we don’t know how many gates are needed for a given . This is an important question because if the number of gates needed in the sequence is in the order of O(21/), then our universal set G is only

42 good in the theoretical sense but it has no practical value in quantum computations. One of

the most well-known theorem in quantum computations is the “Solovay-Kitaev Theorem”,

and it assures us that this never happens.

Theorem 3.3.3. Given any physical universal set G and an arbitrary target unitary U ∈

U(n), and let a desired accuracy  > 0 be given. Then there exist a finite sequence S of gates 1 from G of length O(log ), omitting the dependence on the number of qubit n, such that c  E(U, S) < . Where 2 ≤ c ≤ 4, depending on the structure of G.

It should be noted that the gates in the universal set G of the Solovay Kitaev Theorem are closed under inverses, that is, if a gate A ∈ G then A−1 ∈ G. The gates belonging to our universal set defined in Equation 3.26 certainly have this property, as expected. The condi- tion of being closed under inverses helps with the proof of the theorem; however, whether or not this condition can be removed is an open problem!

3.4 Quantum Speed-up and Quantum Algorithms

There exist different quantum computation models, as we will see in the next section.

Their fundamental building block, the qubit, remain unchanged. When people think of quantum computers the first thing usually come to mind is the exponential speed up they provide over classical computers, because of the superposition and entanglement properties of a qubit. Although it is true that these properties are the basis for the advantages of quantum computers over classical computers, it is not true in general that a quantum com- puter will be exponentially faster than a classical computer. A big part of this exponential speedup comes from the way the algorithm is designed. Designing a quantum algorithm is not an easy task in general, and to design one cleverly enough to encompass the exponential speed up is even harder! Part of our research goal is to at least solve the first task, that is,

43 to use machine learning to allow the quantum computer to design its own algorithm. But

for now, we will discuss some famous existing quantum algorithms and why they provide an

exponential speed up over known classical algorithms.

To understand the computational speed up of classical problems on a quantum computer,

we first define some classical complexity classes. The two most important classical complexity

classes are P and NP. The class P contain problems which can be solved in polynomial time

(quickly) on a classical computer, for example: calculating the greatest common divisor. The class NP (nondeterministic polynomial time) contain problems which their solutions can be quickly verified on a classical computer, for instance, prime factors of an integer n. Thus far

there is no known classical algorithm that can solve prime factors efficiently on a classical

computer. It’s obvious that P⊆ NP. However, it’s not yet known and proven that there exist

problems that are in NP but not in P. This is one of the seven Millennium Prize Problems

known as “P versus NP”. The key point is that there are problems in the NP class which

can be solved efficiently (quickly) on a quantum computer. One such problem is: prime

factorization, which can be solved efficiently using Shor’s algorithm. At first, one might

wonder what does quantum mechanics has to do with factoring? The interesting answer is:

Nothing. However, quantum mechanics has everything to do with waves, and periodicity,

which have a lot to do with factoring. Finding prime factors for a number n essentially boils

down to finding the period of the function

f(x) = axmod(n) where a < n and gcd(a, n) = 1 (3.28)

The function f in Equation 3.28 looks like random noise, so finding its period is very dif-

ficult, at least with classical computer. However, it turns out that quantum computers are

very efficient at period finding using a technique called Quantum Fourier Transform (QFT),

44 which is mathematically equivalent to the discrete Fourier Transform (DFT). Classically, a

fast Fourier transform takes n2n steps to Fourier transform 2n numbers. However, this can

be accomplished in n2 steps with QFT on a quantum computer. Performing some sort of

transform efficiently is essentially the backbone of Shor’s algorithm and many other quantum

algorithms, like Grover search algorithm [123] or Bernstein-Vazirani algorithm [124]. This is

the main source of the exponentially speedup. Details on Shor’s algorithm can be found in

[14]. In fact, almost every quantum algorithm can be posed as a hidden subgroup problem.

Finding an efficient quantum algorithm is the same as developing an efficient solution to the hidden subgroup problem for certain groups. It turns out that a quantum computer can solve hidden subgroup problem efficiently when the group is a finite Abelian group. However, the question is still open for non-Abelian group. If a quantum computer can solve non-Abelian group efficiently then graph isomorphism can be solved efficiently on a quantum computer.

The main point is, there are hard problems like prime factoring which can be solved very efficiently on a quantum computer contrary to the fact that there exists no efficient classical algorithm as of yet. We can actually define the class of all computational problems which can be solved efficiently on a quantum computer, this complexity class is known as BQP

(Bounded-error Quantum Polynomial time). There are still many open questions about the

BQP complexity class; in fact, we still don’t yet known where does it fits with respect to P,

NP and PSPACE complexity classes. See figure 3.11 to see how these complexity classes are embedded within each other.

45 Figure 3.11: Complexity spaces. The complexity class BQP is still not well understood. We have showed that with a cleverly design quantum algorithm, NP hard problem can be solved efficiently on a quantum computer, like prime factorization. However, no NP-complete problem have been solved efficiently on a quantum computer.

3.5 Adiabatic Quantum Computation Model

To this point, the primary focus of quantum computation has been the gate, or circuit, model. The circuit model offers a great parallelism to classical computation model, which makes it more intuitive and easier to understand. However, other quantum computational models exist, and they are all equivalent from one to another up to a polynomial overhead

[10]. Therefore, the computational complexity won’t be exponentially better or worse if we are switching between different models. In this section, we will briefly introduce another quantum computation model known as “Adiabatic Quantum Computation” (AQC) which based on the celebrated “Adiabatic Theorem” in quantum mechanics. Similar to the quan- tum gate model, one can derive algorithms to operate on AQC system, such as the prime factorization algorithm [12]. These algorithms are implemented using a physical process known as “Quantum Annealing”. A main difference between the gate and the AQC model is the analog nature of the AQC model, where the system is slowly adjusted from an initial

46 state to the final state where the solution to the problem is encoded. AQC systems are being built with thousands of qubits compared to the current largest gate model of only 72 qubits. However, it is not all clear that these AQC systems use the full power of quantum computing, for which entangled states are necessary [11] as we have discussed in the previous sections. In fact, our research group has developed a teachnique to determine whether an

AQC system has all of its qubits entangled or not [55]. One advantage of the AQC model against other quantum computation models is that it’s more robust against environmental noise and decoherence [13].

A computation in the AQC model is specified by two Hamiltonians, the initial Hamilto- nian and final Hamiltonian, usually denoted as HI and HF , respectively. The Hamiltonian structure depends on the type of qubits used in the system, but for arbitrary purposes one can regard it as a Hermitian matrix. The system starts in the ground state, the eigenvector with the lowest eigenvalue, of HI . This state is usually taken to be the tensor product state because it is easy to prepare. The system will be slowly adjusted from HI to HF by changing

field and interaction strengths on the qubits. The output of the system is the ground state of HF , where the solution to the problem has been encoded. Similar to the gate model where quantum gates are operating on a constant number of qubits, we require that the

Hamiltonians are local. That is, they involve only interactions among a constant number of particles. This ensures that the Hamiltonians have a nice matrix description structure. The gradual transition from HI to HF can be described in more explicit manner as

H(t) = s(t)HI + (1 − s(t))HF (3.29) where s(t) is the “adiabatic evolution path” that decreases from 1 to 0 as t go from 0 to some t elapsed time, tf . A simple and often used path is the linear one given by s(t) = 1 − . The tf

47 time it takes to evolve from HI to HF while maintaining in the ground state is known as the “annealing schedule”, and for many problems, it grows exponentially in the problem size.

AQC are great at solving hard classical optimization problems. For instance, suppose we are given an Oracle, a black-box, which can be described as a function f such that

n f : {0, 1} → R

The goal is to find the optimalx ˆ ∈ {0, 1}n such that f(ˆx) = min(f). The initial Hamiltonian,

HI , is taken to be n X HI = σxi i=1 where σxi = I ⊗ · · · ⊗ σx ⊗ · · · ⊗ I, that is, it’s the tensor product of a sequence of 2 × 2

Identity (I) matrices, with the Pauli-X at the ith position. Then the final Hamiltonian is taken to be X Hf = f(x)Πx x∈{0,1}n where Πx denotes the projection onto x. Thus, by starting at an easy simple ground state we can move toward another ground state which is the minimal solutionx ˆ, provided that we move slowly enough.

Currently, D-Wave Systems provides the largest implementation of AQC. D-Wave’s pro- cessor are essentially a transerve Ising model with tuneable local fields and coupling coeffi- cients, which has the governing Hamiltonian as:

X X X H = iσxi + Kii + ζijσzi σzj (3.30) i i i

48 even NP-hard problems, can naturally be expressed as the problem of finding the ground state (minimum energy configuration) for such Hamiltonian [44]. However, the problem

Hamiltonian taking the form of equation 3.30 is not QMA-complete. Recall that a problem is said to be QMA-complete if it is QMA-hard and in QMA, where QMA represents the

Quantum Merlin Arthur complexity class, which contains, essentially, all interesting problem we really care about. A simple modification to the Hamiltonian from equation 3.30 can make it QMA-complete, however [45].

Theorem 3.5.1. The Hamiltonians

X X X X H = Kiσxi + iσzi + ζijσzi σzj + βijσxi σxj (3.31) i i i,j i,j and X X X X H = Kiσxi + iσzi + ζijσzi σxj + βijσxi σzj (3.32) i i i

Some other quantum computation models are (QTM), and

Measurement-based quantum computation (MBQC). However, our research group has never used these models for our work.

3.6 Quantum Decoherence, Noise, and Error Correction

A closed quantum system evolves based on the Schrodinger equation, equation 2.16, where the Hamiltonian H is well defined and provides complete information about how the system evolves. The Hamiltonian might come from outside the system (control of external

fields), but what makes a quantum system closed is that it doesn’t act back to this external source. However, a closed system doesn’t exist in reality. All quantum systems interact with the outside environment, at least on a weak scale. This environmental interaction almost

49 always does not preserve coherence, and, thus, provides “decoherence”. When a quantum state decoheres it will become, essentially, classical. This is the reason quantum effects are not observed at macroscopic (classical) scales. Any large scale superposition (e.g., being both dead and alive) would decohere instantaneously (unfortunately). Thus, decoherence is a big issue for quantum computations. Outside of quantum decoherence, quantum systems also have to deal with noise, similar to classical systems. can come from var- ious sources, and it can be from both external (e.g., stray magnetic fields) or internal (e.g., material impurities). Therefore, the ability to eliminate or reduce these effects is significant in quantum computations.

Even after Peter Shor published “Shor’s prime factors algorithm” in 1994 and put public- key encyption systems like RSA at risk, there was still much skepticism that quantum com- puting could be practical, mainly because of noise and decoherence in quantum systems [36].

This led to Peter Shor’s publishing the first Quantum Error Correction code (QECC) [34].

The Shor code is a 9-qubit QECC that is capable of correcting any arbitrary error on a single qubit while protecting the logical qubit state. This was done by encoded a qubit state

|ψi = α|0i + β|1i to the state |ψiL = α|0iL + β|1iL where

1     |0iL = √ |000i + |111i |000i + |111i |000i + |111i (3.33) 8 1     |1iL = √ |000i − |111i |000i − |111i |000i − |111i (3.34) 8

From equation 3.33, 3.34 we can see that the concept of quantum entanglement serves as the backbone tool behind quantum error correcting scheme. How this is done is by first entangle the working qubits in the quantum register with ancilla qubits prepared in a well- defined state. Now because of the non-classical correlation property of entanglement, the ancilla qubits will contain information about the errors on the working qubits when a cor-

50 ruption occurs. We can then quantify the errors occurred through the ancilla qubits, and use this information to correct the working qubits without ever having to make any type of measurement on them. Thus, entanglement allows us to detect quantum errors without changing the state of the qubits in a quantum register then use this information to correct the errors. Figure 3.12 shows how one can implement Shor’s 9-qubit code on a quantum circuit. More details on Shor’s error correction scheme can be found in [2]. Since Shor published his 9-qubit error correcting code, others have developed more efficient error cor- rection schemes. For instance, Andrew Steane developed the which does the same thing as the Shor code but only using 7 qubits instead of 9, and it’s closely related to the classical error-correcting code known as the Hamming code [37][38]. Then Raymond

Laflamme et al. found a class of codes which do the same thing as Shor and Steane code but using only 5 qubits, and also have the property of being fault-tolerant [39]. It turns out that

5-qubit is the lowest number of qubits possible for any quantum error correction scheme [40].

Thus decoherence and noise are enormous problems for quantum computations. Fortu- nately, quantum error-correcting schemes exist. However, as discussed, these schemes need a minimum of 5-qubit to correct one single qubit. This is a problem for current existing quantum hardware, where we only have limited number of qubits available. It should be noted that although Adiabatic Quantum Computation model has available hardware up to

5000-qubit (D-Wave inc.), they do not incorporate quantum error-correcting code in their algorithm.

51 Figure 3.12: Quantum circuit for Shor’s 9-qubit error correction code. E is a quantum channel that can arbitrarily corrupt a single qubit [34].

52 CHAPTER 4

CLASSICAL ARTIFICIAL NEURAL NETWORKS

The first and simplest definition of a artificial neural network, or neural network, is pro- vided by Dr. Robert-Hecht Nielsen as:

“... a computing system made up of a number of simple, highly connected processing elements, which process information by their dynamic state response to external inputs.”

In this chapter we will explore classical artificial neural network in detail. First, in Section

4.1 we will describe what is an artificial neural network and why we need it. In Section

4.2 we will briefly describe biological neurons, and how we can design mathematical models for them, which leads us to the concept of artificial neurons. We will present two types of artificial neuron, the ‘perceptron’ and its modified version the ‘sigmoid neuron’. In Section

4.3 we will discuss how artificial neurons can be stack in layers with interconnections to build a neural network, a mathematical model trying to replicate the structure of our brain. In

Section 4.4 we will discuss the universality of artificial neural networks.

53 4.1 Introduction

Artificial neural network is an area of machine learning, a subfield of artificial intelligence.

Artificial neural networks are computational models alternative to the standard determinis- tic algorithmic approach. Computers are great at performing simple operations and they do it quickly. For a computer to perform a specific task, we program it explicitly, by giving it a well-defined procedure (an algorithm) to execute. When this task is too complex then it is very hard to develop an algorithm for it. For instance, it would be a nightmare to program a computer explicitly for face recognition, or to determine what letters/numbers a person wrote. However, recognizing faces and letters are rather trivial problems for us humans, and we do it effortlessly and unconsciously in our daily life. The human brain is great at recog- nizing patterns. So why do we need a computer to do these tasks? The answer is: Although we can perform these tasks with our brains, we are limited by time and energy. So while we might be able to recognize faces or handwritten letters rather easily, doing it in a data set of hundred of millions is unreasonable. However, computers are great at sorting through enormous data sets! In a way this tells us that our brain is really a supercomputer that has been evolved over hundreds of millions of years, and is very well adapted to understanding the visual world. Thus it might be a good idea to program computers the way our brain is programmed, which is by learning! One way a person learns new tasks is by recognising patterns from data in past experience and applying them. For instance, if a kid touches a red hot stove, he/she will learn not to touch it along with any hot red piece of metal ever again in the future. Or when you try to kick a soccer ball inside a goal, if you missed it wide to the right the first time then you will aim more toward the left the second. You will keep adjusting your aim until the soccer ball goes inside the goal. In any of those instances, our brain learns from past experiences/results and we use this information for future tasks. Thus we can try to teach a machine to learn the same way we do. In other words, we can use a set

54 of examples instead of explicit instructions to allow the machine to infer rules automatically

about a specific task, like classifying handwritten letters or how much force and optimal aim

scores a goal with a soccer ball. To do this we must have some sort of a model representing

the brain, at least partially, and apply this model to train the computer. This is what an

‘artificial neural networks’ does: it tries to replicate how a human brain works. The human

brain is a formidably complex structure and there is still a lot we don’t know about it. But

we do know that it is made up of a hundred billion neurons, the information processing

cells of the brain, and each neuron has about 10,000 synapses (interconnections) on average

which brings the total approximate number of connections to an astronomical value of 1015.

Modeling a human brain mathematically is impossible, but we can create a partial model to replicate it, which then boils down to modeling (a small number of) biological neurons and the interactions among them.

4.2 Artificial Neurons

Similar to the way a ‘bit’ and a ‘qubit’ are fundamental building block/processing units for classical and quantum computers respectively, an ‘artificial neuron’ is the basic building

block of an artificial neural network. An artificial neuron is a mathematical model, a repre-

sentation of a human neuron, which is the fundamental unit of our brain. Biological neurons

come in various shapes and sizes but each neuron has a ‘soma’ or cell body which contains

the nucleus and other vital components. See Figure 4.1.

55 Figure 4.1: A structure of a biological neuron. [46]

The cell body also contains tree-like branches known as dendrites and a long fiber extending from the cell body known as the axon. These are the main communication links. The neuron receives its input (electrical signals) along the dendrites, the cell body processes these inputs and decides whether or not the neuron fire an action potential. The axon then carries the signal (action potential) away from the cell body to the synaptic/axon terminals. When the action potential reaches the terminal, chemical messengers called neurotransmitters are released, which will be collected from nearby dendrites from other neurons. These dendrites convert the chemical signals into electrical signals which then get sent to a cell body again for processing. This process keeps going as long as we are alive. From here we can develop a mathematical model, an artificial neuron, representing a biological neuron. We present here two different models. The first model is called a ‘perceptron’, developed in the 1950s and

1960s by the scientist Frank Rosenblatt, and inspired by earlier work of Warren McCulloch and Walter Pitts. The perceptron is rarely used in today’s neural networks because of its hard threshold activation, which makes devising a learning algorithm difficult. This leads to another artificial neuron model that is similar to the perceptron but using a soft thershold activation instead, known as a ‘sigmoid neuron’. We will discuss these two neuron models in the following subsections.

56 4.2.1 Perceptron

A perceptron was the first mathematical model that attempted to replicate a biological

neuron. It takes a finite number of binary inputs, {x1, x2, ··· , xn}, and produces a single

binary output y. Figure 4.2 shows an example of a two-input perceptron.

Figure 4.2: A perceptron model with two inputs.

Each of the inputs is multiplied by a weight, wi ∈ R, which represents the importance of that input to the output. The output of the neuron is either 0 or 1 depending on the weighted sum and the threshold value of the activation function, σ. Mathematically, this process can be described as: ( P 0 if i wixi ≤ threshold output = P (4.1) 1 if i wixi > threshold

This basic mathematical model tried to replicate how dendrites collect chemical signals, turns them into an electrical signal, and sends it to the cell body to process. The cell body then decides whether to fire an activation potential or not. This is all there is to a perceptron. It’s nothing more than a processing unit that makes decision by weighing up the evidence. The following is an example to illustrate how a perceptron makes a decision: Suppose you want to decide whether to show up to class or not on a particular day. Then there are probably several factors going to your decision, for instance: x1 = Is there an exam? x2 = Is homework being collected?

57 x3 = Is the weather good?

x4 = Interesting topics?

We can set 0 = No, 1 = Yes, and a threshold value of 4. Then you might assign a weight of

5 to input x1 since an exam is an important factor, but you might assign a weight of 3, 2, and 2 to x2, x3, and x4 respectively. This means that you will go to class every time there is an exam. But if there is no exam then at least two of the other three conditions must be met for you to show up to class. Note that the threshold value determines the bias in our decision. If the threshold value is low in the previous example then this means you are biased toward going to class to start with, and you don’t need much reason to go. In most neural networks literature, the negative of the threshold/bias value is designated as b in the rewritten Equation 4.1: ( P 0 if i wixi + b ≤ 0 output = P (4.2) 1 if i wixi + b > 0 A perceptron by itself can’t do much other than being a linear classifier, like the example above. However, if we stack a lot of them together in layers fashion like neurons in our brain, they can accomplish many complicated tasks, if we can find the correct weights, wi, for the output. This is where the perceptron runs into a problem. The hard threshold step function for the output makes it hard to adjust the weights wi because small changes in the weights might completely flip the decision from 0 to 1 or vice versa. This makes it difficult to see how we can make gradual modifications to the weights and biases so that we can get closer to the desired behaviors.

58 4.2.2 Sigmoid Neuron

The sigmoid neuron is similar to the perceptron but with some small modifications to

make the problem of determining the weights and bias easier. The first modification is that

the inputs {xi} are no longer binary but instead continuous real variables. The activation function, σ, is now taken to be a , hence the name sigmoid neuron. A sigmoid function is a bounded, continuous, and differentiable function with the properties ( 1 as t → ∞ σ(x) → (4.3) 0 as t → −∞

Some examples of sigmoid functions are:

1 x f(x) = f(x) = tanh(x) f(x) = arctan(x) f(x) = √ 1 + e−x 1 + x2

1 Figure 4.3: A sigmoid neuron node with x ∈ N and the activation function f(x) = R 1 + e−x .

The sigmoid activation function provides a soft threshold. It’s a smoothed out version of the

hard threshold step function of the perceptron. The precise mathematical model for sigmoid

59 neuron is: N  X  y = output = σ xiwi + b (4.4) i=1

where σ is the sigmoid activation function, and xi, wi, b ∈ R are the inputs, weights and bias, respectively. Note that the output of a sigmoid neuron is no longer binary, but any real value within some bounded interval, usually scaled to [0, 1]. The smoothness of σ implies that small changes in the weights and bias only cause a small change in the output which is crucial in devising an algorithm to find the correct weights and biases for neural networks.

X ∂y ∂y ∆y ≈ ∆w + ∆b (4.5) ∂w i ∂b i i Sigmoid neurons are used extensively and more preferred over the perceptrons in neural

networks because of the differentiable property of their activation functions. Therefore, for

the remaining sections and subsections of this chapter, we will only focus on neural networks

that are built with sigmoid neurons.

4.3 Multi-Layer Neural Networks

A single perceptron or sigmoid neuron can’t accomplish much by itself. However, stacking

them in layers with interconnections (building a neural network) in an attempt to replicate

the structure of our brain can make them quite a powerful computational tool. What we

are trying to accomplish is to determine the map f : RN → RM without having to build an explicit algorithm, by forcing the computer to learn this map through a neural network

model instead. In Section 4.4 we will see exactly the class of functions a multi-layer neural

network can learn. For now we will define some terminologies of multi-layer neural networks

and how the optimal weights and biases can be found through a specific learning algorithm

for supervised learning called error backpropagation. It should be noted that this learning

rule is only for sigmoid neurons because it requires the activation to be differentiable.

60 4.3.1 Networks Architecture

The first layer of a neural network is called the input layer. The input layer takes in the input value x ∈ RN , and the neurons within this layer are called input neurons. Hence there are N neuron nodes here each encoding a value xi ∈ R for i = 1, 2, ..., N. Each of these nodes will get connected to an arbitrary P number of sigmoid neurons in the next layer.

Each of these neurons will output a value which will get passed through to the next layer of neurons. This process will continue until the last layer of the neural network, which is called the output layer. If the map is from RN → RM then there will be M neuron nodes at the output layer. This changes from problem to problem, of course. The layers in between the input layer and output layer are called hidden layers. A neural network with only one hidden layer is called a shallow network. See Figure 4.4 for a visualization of a neural network with

2 hidden layers, with the first hidden layer containing P number of neurons, and the second hidden layer containing Q number of neurons, mapping from R2 to R.

Note that the output layer does not need to be a single output; rather, this changes from problem to problem, and how we want to structure the learning process. For instance, if we want to train a neural network to recognize handwritten digits from 0 to 9, then we might want to have 10 output neurons with each of the output neurons acting as an indicator.

That is, if the input is an image of the digit 2, then the third output neuron target value is a 1 and the rest of the output neurons’ target value is a 0. However, we can structure the problem so that the there is only one output neuron, with the output values ranging from 0 to 10. In such instance, if the image is an image of the digit 2 then it target output value is within the interval 3 < target < 4. The number of hidden layers also changes from problem to problem. If the problem is very complex, then we might design the neural network to have parts of its hidden layers extracting certain features. This might help in speeding up

61 Figure 4.4: A two hidden layers neural network mapping from f : R2 → R. Nearly in all cases, we take all the activation functions σi to be the same. That is, there is only one type of activation in the network. The reason for this will be clear in section 4.4. the training as well as finding the true optimal weights and biases. All of this is known as the network architecture.

4.3.2 Error Backpropagation and The Gradient Descent Learning Rule

A neural network is of no use if it doesn’t learn. Learning in neural networks means

finding the optimal weights and biases to correctly map the set of inputs to their correspon- dence output. This is done in the training process. In supervised learning, we will have a training data set where we have the input data and their correspondence output. We will start at some random initial weights and biases, then let the network propagate one of the inputs forward using the initial weights. This is known as the feed-forward step. At the

62 end of the feed-forward step, we will have an output that is most likely different from our

intended target value for that input. Then we have to go back to change the weights and

biases to reduce this difference/error. To do this, we define a cost function J

1 J(w, b) = ||y − d||2 (4.6) 2

where w and b represent all the weights and bias of the network, respectively. y is the actual

output of the network after the feed-forward step and d is the desired (correct) value. The

norm in Equation 4.6 can be taken as an l2 norm. So what we want is to change the weights and biases in such a way so that we can minimize the cost function. Notice that the cost function is a very high dimensional surface, and therefore analytic minimizing technique from calculus won’t work. Thus, we resort to the iterative optimization technique known as gradient descent. Gradient descent tells us that a function J decreases fastest at a point w0

if we goes in the negative gradient, −∇J(w0). Thus, giving the iterative updating rule

wk+1 = wk − η∇J(wk) (4.7)

This is exactly how we update each weights and bias in neural networks. But remember

that neural networks might have more than one hidden layer, so we need to trace back this

error to every layer. This is nothing more than applying successive chain rule computations

because layers in neural networks are just compositions of functions. To see this, just note

that the activation/output of the jth neuron in the lth layer can be written as   l X l l−1 l yj = σ wjkyk + bj (4.8) k where the sum of Equation 4.8 is over all neurons k in the (l − 1)th layer. This is where

backpropagation come into play! It gives us a way to calculate the gradients of the cost

function with respect to each of the weights and biases efficiently. And computing the

gradient effectively is all what backpropagation does, and nothing more. This efficiency

63 comes from the fact that the computations of the gradient of the weights between the l

and l + 1 layer of the network already have most of the calculation done when we did the

gradient for the weights in the l + 1 and l + 2 layers, as you can see from Equation 4.8.

Thus, the idea is we don’t have to do these duplicate computations, which saves us some

computational cost. Also now we see why it is important to have our activation functions

to be differentiable. Without the differentiability property of the activation function, it

would be difficult to develop a learning algorithm to train the network. To write down the

backpropagation algorithm, first note that the activations/outputs in the lth can be written

in matrix form as

yl = σ(wlyl−1 + bl) (4.9)

where wl is the weight matrix connecting the l − 1 layer to the l layer. The entries of wl are

l l just wij, and similarly for b . Note that σ is being applying component wise. This expression provides a more global way of thinking about how the activations/outputs in one layer relate to activations/outputs in the previous layer. Now, from here the equation for the error in the output layer in an L layers network, denoted as δL (this is nothing but the gradient of the cost function with respect to the weights at the L layer), can be written as

L 0 L L−1 L δ = ∇yJ σ (w y + b ) (4.10)

where the operation is the Hadamard product, element-wise product of two vectors of same dimension. Then, from chain rule, one can easily derive the following equation for the error in the l layer, δl, in terms of the error in the error in the l + 1 layer, δl+1

δl = (wl+1)T δl+1 σ0(wlyl−1 + bl) (4.11)

64 Then from here, we can see that the gradient of the cost function J with respect to the

l l weights wjk and bias bj is

∂J l−1 l ∂J l l = yk δj l = δj (4.12) ∂wjk ∂bj ∂J ∂J Thus the partial derivatives l and l can be computed in terms of the known quantities ∂wjk ∂bj δl and yl−1. These equations are essentially everything going into the backpropagation algorithm. First, you do the feedforward step, calculate the activation/output for each layers all the way to the last (output) layer L. Then start with finding the error of the output layer, δL. At this point, one just need to keep cascading the error backward until you reached the input layer. And as you backpropagate through each layer, you evaluate the ∂J ∂J expression l and l to be used for the gradient descent learning rule: ∂wjk ∂bj

l l ∂J l l ∂J wjk = wjk − ηw l bj = bj − ηb l (4.13) ∂wjk ∂bj That’s all there is to train a neural network so that it can learn to perform a specfic task.

As pointed out by Yann LeCun in [78] in 1987. We can reformulated backpropagation as a variational problem, by first defined the Lagrangian for each pattern p (training pair) denoted as Lp to be N   2 X T Lp(W, Yp,Bp) = ||y − d|| + Bp(k) Yp(k) − F [W (k)Yp(k − 1)] (4.14) k=1 where the summation term is the constraint of the network, it represents how the network is connected and evolve from layer to layer. The indices k = 1 : N represents the layers in the network, W (k) represents the weights matrix at the k layer, and Yp(k) represents the state at the layer k. The term Bp(k) is the Lagrange multiplier. The full Lagrangian of the network is then defined as P X L = Lp(W, Xp,Bp) (4.15) p=1

65 This formation is quite nice because it takes the dynamic of the network into account in the

Lagrangian, and it’s the formulation we used in our Quantum Neural Networks. See [78] for details derivation and how the network can be trained in this formulation. The key point here is that from this formulation, Lecun was able to extended the training method to a few different networks, one of which is the continuous time recurrent networks. We will soon see how this is helpful in deriving the learning algorithm for Quantum Neural Networks in the next chapter.

4.4 Universal Approximation

In the previous section, we discussed about network architecture and how to search for the optimal weights and biases using the error backpropagation learning algorithm. Throughout the discussion, we have swept a very important question under the rug: How complicated a problem can a neural network solve? This is an important question because if neural networks can only solve simple/trivial problems then this would defeat the purpose. Fortu- nately, it turns out that neural networks are universal approximators. They can approximate any continuous functions on a compact set. This tells us that in considering a problem, as long as the function that maps the input to output can be represented as a continuous func- tion then this problem can be solved with a neural network. We will present two different explanations. The first will be intuitive and constructive, using ancient mathematics. The second is based on modern analysis so it will be more abstract but elegant. It should be noted that by Lusin’s theorem, this result can be extended to measureable functions.

66 4.4.1 Approximation with Boxes

Here we will show that a neural network with two hidden layers is sufficient to approxi-

mate any continuous function on compact set via construction. This constructive arguement

is well known around the neural network community [53, 97]. However, when people discuss

about the universality of a neural networks, they often refer to abstract proof by Cybenko of

which we will present in the following subsection. Although Cybenko’s proof answered the

question, it doesn’t provide much intuition since it is not constructive. To see this construc-

tive argument, let us first recall an ancient theorem of mathematics.

Theorem 4.4.1. Let f be any continuous function from Rn to R and  > 0. Then there

n exists a δ > 0 such that for any partition P of [0, 1] into rectangles P = (R1,R2, ··· ,RN )

with all side lengths not exceeding δ, there exist scalars α1, α2, ··· , αN such that

N X ||h − f|| ≤  where h = αiIRi (4.16) i=1

Proof. Since f is continuous from Rn to R then f is uniformly continuous on [0, 1]n by the standard uniform continuity theorem. Thus for any  > 0, there exists a δ > 0 so that

n for every x, y ∈ [0, 1] with ||x − y||∞ < δ we have |f(x) − f(y)| ≤ . Now let a partition

(R1,R2, ··· ,RN ) be given, and suppose all side lengths are no greater than δ, where δ is from

the preceding analysis. For each i ∈ {1, 2, ··· ,N}, choose any xi ∈ Ri and set αi = f(xi),

by which any other x ∈ Ri satisfies ||x − xi||∞ ≤ δ, and thus |f(x) − α| ≤ . Therefore the

PN function h = i=1 αiIRi satisfies

sup |h(x) − f(x)| ≤  (4.17) x∈[0,1]n

67 1

0.5

0

-0.5

-1

0 1 2 3 4 5 6 7 8 9 10

Figure 4.5: Approximation of a continuous function with piece-wise constant functions.

The theorem above tells us that if we can somehow design our neural network output

piece-wise constant functions then we can approximate any continuous function. That is, if

we can show that we are able to construct boxes with our neurons then we can approximate

any continuous function. For 1-dimension, f :[a, b] → R it’s not so hard. To see this, let’s consider a neural network consist of one input neuron, a single hidden layer with 2 sigmoid

neurons, and an output layer with no activation. Mathematically, we can write this output

as:

2 1 2 1 output = w1σ(w1x + b1) + w2σ(w2x + b2) (4.18) 1 where σ(x) = , and x, w1, w1, w2, w2, b , b ∈ . Now, by picking w1 and w1 to be a 1 + e−x 1 2 1 2 1 2 R 1 2 2 2 large value, and w1 = −w2 with b1 6= b2 then we have successfully created a box with our neural network. See Figure 4.6.

68 Figure 4.6: Two hidden neurons to create a box.

Therefore from this point of view we can see that a neural network with one hidden layer is enough to approximate an arbitrary continuous function f : D → R where D is a compact set, by stacking up the hidden neurons in such a way that each pair will output a box. See

Figure 4.7 to see how this is done.

It should be noted that because of the continuity of the sigmoid neuron, we will have a little fudge at the edge of these boxes, hence the approximation here should be taken in the

Lp sense and not uniform. However, the perceptron, which doesn’t require continuousness, can built these rectangles exactly.

So far in this constructive argument approach, only one hidden layer is needed. This is because we have only considered the map f :[a, b] → R. If we try to replicate what we did previously, create step functions and add them together to create boxes, to approximate a function f :[a, b]N → R, with N ≥ 2, then we won’t get exactly what we want. Therefore we

69 Figure 4.7: An illustration to show how a single hidden layer can be used to approximate a continuous function f : D → R. need an additional hidden layer to clean up the mess. First note that for us to approximate a map in higher dimension, we will need towers (hyper-rectangles) instead of boxes (rectangles).

Now to see why things don’t add up nicely in higher dimensional case compare to the one- dimensional case, let’s us consider a continuous map from R2 to R. We want to approximate this map. To do this, we need to be able to build a three dimensional tower. Note that we have two input neurons, x1 and x2. Previously in the one-dimensional case, we built a box by adding two step functions together. Using this idea, we can create four step functions

(in three dimensions), two in the x1 direction and two in x2 direction. To create these step functions, we can set the weights in the respective direction to be large and the rest equal to zero. The biases are chosen in such a way that they create a desired shift in the step functions. Now if we put these four step functions in a linear combination in such a way that everything goes to zero except the region within the shifts, then we get something that almost resembles a three-dimensional tower. See Figure 4.8 for a visual representation of

70 what we discussed above. In the univariate case, this would give us exactly the box we were looking for. However, here we have some sloppiness in the result that need to be cleaned up.

Figure 4.8: An illustration to show how step functions can be built in the higher dimension case and how to add them together to form something almost like a tower. The gray arrows represent weights that are equal to zero.

1 1 2 4 In Figure 4.8 we would pick the weights w11, w12, w23, w24 to be large, whereas the rest of the weights (not written down explicitly to simplify the picture) equal zero. The weights

2 2 2 2 w11, w21, w31, w41 have equal magnitude and they represent the height of the tower. They

2 2 2 2 are chosen in such a way that w11 = w31 = −w21 = −w41. Note that Figure 4.8 is almost a tower but not exactly, since there are some unwanted regions still left after the linear combination of the step functions. However, it’s quite easy to get rid of this unwanted region, by simply applying a threshold function. Thus another layer of neurons (with an

71 activation function) will be needed for this clean up task since a neuron’s activation function

is basically a threshold function. See Figure 4.9.

Figure 4.9: An illustration to show how how an 3 dimensional tower might be built using neural network.

Thus each tower (in three dimensions) can be built from 4 neurons in the first hidden layer,

and 1 neuron in the second hidden layer. The output layer will consist of one output neuron

that will sum up all the towers from the second hidden layer. This construction algorithm

extends to higher dimensions, more than two variables, as well. That is, to construct a single

tower in N−dimension, each of the N input neurons will have two hidden neurons in the first layer to create 2N step functions, where they will be added together. The second hidden layer will have one neuron acting as a threshold to clean up the sloppiness caused by the sum of the step functions.

72 Figure 4.10: An illustration to show how an N + 1-dimensional tower can be built using neural network.

Therefore a neural network with two hidden layers is sufficient to approximate any continuous function f : D → R in the LP sense, where D is a compact set in RN . This argument required no function compositions whatsoever, and it is as old as Jordan content or Lebesgue measure.

73 4.4.2 Approximation with Modern Analysis

Previously we argued that a neural network with 2 hidden layers is sufficient to approx- imate a continuous function in a constructive manner. Thus we know that a neural network with two hidden layers with a sufficient number of neurons can solve a huge class of prob- lems, specifically, all the problems that can be posed as a continuous map. In this second argument, we will take George Cybenko’s approach and use modern analysis to prove that a neural network with one hidden layer is sufficient to approximate a continuous function on a compact set. Although this proof is elegant, it’s not constructive, hence we lose all intuition on how this approximation is done.

Theorem 4.4.2 (Cybenko’s Approximation Theorem [48]). Let σ be any continuous dis- criminatory function. Then finite sums of the form

K X T G(x) = αjσ(wj x + bj) (4.19) j=1

n n are dense in C(In) where In denote the n−dimensional unit cube, [0, 1] . wj, x ∈ R , and

αj, bj ∈ R. We say that σ is discriminatory if for a measure µ ∈ M(In), the space of finite regular Borel measure on In, Z σ(wT x + b)dµ(x) = 0 In for all w ∈ Rn and b ∈ R implies that µ = 0

PK T n Proof. Let S := { j αjσ(wj x+bj): w ∈ R , b ∈ R}. First note that S is a linear subspace of C(In), the space of all continuous functions on In, since σ is taken to be a continuous discriminatory function. Because we are working in infinite dimension, subspace need not to be closed, so let’s take the closure of S and called it Q. That is, Q = S. If Q = C(In) then

S is dense in C(In) and we are done. So suppose Q is not all of C(In), hence Q ⊂ C(In).

74 By the Hahn-Banach theorem, there is a bounded linear functional L on C(In) with the property that

L 6= 0 L(Q) = L(S) = 0 (4.20)

Then now by the Riesz Markov Kakutani Representation Theorem, L can be written in the form Z L(h) = h(x)dµ(x) (4.21) In

T for some µ ∈ M(In) and for all h ∈ C(In). But since σ(w x + b) is in Q for all x and b, we must have that Z σ(wT x + b)dµ(x) = 0 (4.22) In for all x and b. However, σ is a discriminatory function so this implies µ = 0, which is a contradiction. Thus, we must have Q = C(In) so S must be dense in C(In).

The proof is basically just an application of modern analysis, hence it’s very short and el-

egant. It shows that the sums of the form in Equation 4.19 are dense in C(In) providing

that σ is continuous and discriminatory. It turns out that all sigmoidal functions are con-

tinuous discrimatory, see lemma 1 in [48]. Recall that sigmoidal functions are continuous,

differentiable functions with the property ( 1 as x → ∞ σ(x) = 0 as x → −∞

1 In particular, the activation function σ(x) = is a continuous discriminatory func- 1 + e−x tion. Hence, neural networks made up with sigmoid neurons are universal.

It should be pointed out that this is only an approximation result. It showed that for

any given f ∈ C(In) and  > 0, there is a sum, G(x), of the form in equation 4.19 for which

|G(x) − f(x)| <  for all x ∈ In. One might wonder if it’s possible to create a equality

75 here instead of just an approximation. The answer is yes! This stronger result is known as the Kolmogorov Superposition Theorem [49], which is closely related to the Hilbert’s 13th problem which involves the study of solutions of algebraic equations.

Theorem 4.4.3 (Kolmogorov’s Superposition Theorem). Let f : In := [0, 1]n → R be an arbitrary multivariative function. Then it has the representation

2n+1 n X  X  f(x1, ..., xn) = gq ψq,p(xp) (4.23) q=1 p=1 with continuous one-dimensional outer gq and inner functions ψq,p, where the functions gq and ψq,p are defined on the real line. Furthermore, the inner function ψq,p are independent of f.

Many improvements have been made to the theorem since it was first published. For instance, George Lorentz showed in 1962 that only one outer function is needed [51]. That is, gq can be replaced by a single function g. Then in 1965 David Sprecher showed that the inner functions ψq,p can be replaced by single inner function ψ with the appropriate shift in its argument [52]. In 1967 Buma Fridman showed that the inner functions ψq,p in Theorem

4.4.3 can be chosen to be in the class Lip(1) [50].

76 CHAPTER 5

QUANTUM NEURAL NETWORK

In this chapter, we will discuss the design and fundamental structure of our Quantum

Neural Network (QNN) which was originally proposed and developed by Elizabeth Behrman and James Steck of Wichita State University. The first section will outline the fundamental structure of QNN. Then we will discuss the universal property of QNN and how an arbitrary classical problem can be implemented on it. Furthermore, we will provide an alternative approach using differential geometry, and discuss why this approach might not be optimal for hardware implementation.

77 5.1 Fundamental Structure

The fundamental equation of quantum mechanics and therefore of quantum computing is the Schrodinger equation, which again can be written in the density operator form as

dρ 1 = [H(t), ρ(t)] dt i~ The solution to the Schrodinger equation has been given in Equation 2.22. The quantum system must follow this equation through the computations. Therefore, we will use this equation as the base of our Quantum Neural Network model. The question is, how should we incorporate the Hamiltonian? In general, an N−qubit system Hamiltonian H(t) is an

N × N Hermitian matrix. Since the group

G = {σI , σx, σy, σz} (5.1)

where σI , σx, σy, σz, the Pauli matrices described in Equations 3.8 and 3.17, forms an orthog- onal basis for C2×2, the group

⊗n Gn = {σI , σx, σy, σz} (5.2) forms an orthogonal basis in C2n×2n. Thus, any Hermitian Hamiltonian H can be written in terms of Pauli matrix Kronecker products, for instance, the most general form for a 2-qubit

Hamiltonian is

4 X H(t) = hijkl(t) · σi ⊗ σj ⊗ σk ⊗ σl (5.3) i,j,k,k=1 where σ1 = I, σ2 = σx, σ3 = σy, σ4 = σz and hijkl ∈ R. However, when doing quantum computations, we work with certain hardware designs and limitations, which leads to a more

78 restricted forms of the Hamiltonian. In our work, we assume that our Hamiltonian has the

following form:

N N X   X H(t) = Kα(t)σxα + α(t)σzα + ζαβ(t)σzα σzβ (5.4) α=1 α6=β=1

where σxα = σI ⊗ σI ⊗ · · · σx ⊗ ·σI with σx located on the α spot. Similar for σzα . But σzα σzβ

represents regular matrix multiplication between σzα and σzβ . Kα represents the tunneling

amplitude of the qubit α; α, the potential energy or bias of qubit α; and ζαβ the coupling

parameter between qubits α and β. This is almost identical to the Hamiltonian of D-Wave

Systems Inc.’s quantum computing machines [33].

Recall from the previous chapter that a classical neural network is nothing more than a mapping from an input state to an output state. See Figure 5.1. That is, we were given a set of training pairs, which consist of input states and the respective desired outputs. The correspondence between the inputs and outputs is a function, and this function defines the mapping. As long as this is a measurable function, this map can be approximated to an arbitrary precision by a neural network.

From this point of view, we can develop a similar analogy using dynamic evolution with the Schrodinger equation, taking inputs as initial quantum states and outputs as a measure- ment or measurements on those initial quantum states after some time evolution. Hence, this is a mapping f : C2n×2n → R. If you think of a quantum circuit, which implements an arbitrary quantum algorithm, then it’s nothing but a map from the initial state of the

qubit, which usually taken as |ψ(t0)i = |000 ··· 0i to some final state |ψ(tf )i before making

a quantum measurement. See Figure 5.2 for a visualization of a 3-qubit quantum circuit as

a map, f : C8×8 → R.

79 Figure 5.1: Classical Neural Network as a map from RN to R

Figure 5.2: Quantum circuit representing a quantum algorithm as a map

Therefore, we can implement machine learning technique to find this map! How would we define the weight parameters in a quantum setting? We can use the Hamiltonian! That is, the weight parameters to control this mapping are the parameters of the Hamiltonian, which consists of Kα(t), α(t), and ζα(t). Thus, this allows us to define a neural network on a quantum computer, hence the name “Quantum Neural Network”. See Figure 5.3 for a visualization of a QNN structure.

80 Figure 5.3: Visualization of QNN structure

Although the Hamiltonian is time dependent, we can discretize it by gridding the time do- main into chunks, similar to numerical analysis techniques, and treat each time grid with a constant Hamiltonian. Thus, the state evolution in each time grid can be written as Equa- tion 2.23.

5.2 Learning Algorithm

We have concretely defined a Quantum Neural Network in the previous section. In this section, we will show how we can train this Quantum Neural Network. By training, we mean finding the right parameters for the weights (parameters Kα(t), α(t), ζα(t) of the

Hamiltonian) so that it correctly approximates a mapping for a particular problem. To do this, we can use calculus of variations as pointed out by Yann LeCun[78], where he reformatted backpropagation in term of optimizing the Lagrangian L. First, we define the

Lagrangian, L, to be minimized, as

1 Z tf dρ i  L = (T − y)2 + λ(t) + H, ρ γ(t) dt (5.5) 2 0 dt ~ where y is the output value of the Quantum Neural Network. The output y corresponds to

81 the problem of interest. Generally, it can be written mathematically as

y = T r(Mρ(tf )) = T r(Mρ) (5.6)

with M being the quantum measurement operator, usually taken as σz ⊗· · ·⊗σz, and ρ(tf ) is the quantum state right before the measurement is performed. The value T in Equation 5.5 is the “target value” corresponding to that input state. That is, it’s the desired theoretical value we want that particular input state to get mapped to. The integral term in Equation

5.5 represents the dynamic of the network, how it evolves in time, and for us it is to constrain the system so that it will always satisfy the Schrodinger equation. Last but not least, the terms λ(t) and γ(t) are the Lagrange multipliers, which can be written as row and column vectors, respectively. This is the “error delta” in neural network terminology. It gives you the direction when doing backprogation back through time.

To calculate the vector elements of the Lagrange multipliers λ and γ to use for the weight updates later, we take the first variation of L with respect to ρ:

Z tf    ∂ i δLδρ = − T − T r(Mρ(tf ) Mδρ + λ δρ + [H, δρ] γ dt (5.7) 0 ∂t ~ Z tf Z tf Z tf  i i = − T − T r(Mρ(tf ) Mδρ + λ(δρ˙)γ dt + λH(δρ)γ dt − λ(δρ)γ dt 0 ~ 0 ~ 0 (5.8) Note that the overdot operation represents the time derivative, and the commutator is defined as [H, δρ] = H(δρ) − (δρ)H. Now integration by parts gives us:

Z tf tf Z tf  λδρλ˙ dt = λ(δρ)γ − (δρ) λγ˙ dt (5.9) 0 0 0 tf Z tf Z tf

= λ(δρ)γ − λ˙ (δρ)γ dt − λ(δρ)γ ˙ dt (5.10) 0 0 0

82 Then, setting δLδρ = 0 we get:

tf  0 = − T − T r(Mρ(tf ) Mδρ + λ(δρ)γ (5.11) 0 Z tf Z tf i Z tf i Z tf − λ˙ (δρ)γ dt − λ(δρ)γ ˙ dt + λH(δρ)γ dt − λ(δρ)γ dt (5.12) 0 0 ~ 0 ~ 0

which give us the condition at tf to be:

 T − T r(Mρ(tf ) · M = λ(tf )γ(tf ) (5.13)

and the differential equation to solve for the Lagrange multipliers as:

i i λγ˙ + λγ˙ − λHγ + γHλ = 0 (5.14) ~ ~

Now, using the classical gradient descent learning rule, we can update the weights through the equation ∂L wnew = wold − η (5.15) ∂w

here w represents the parameters, Kα, α, ζα, which are real constant values in each time ∂L grid, and is ∂w

∂L i Z tf ∂H  = λ(t) , ρ γ(t) dt (5.16) ∂w ~ 0 ∂w Z tf   i X ∂Hik ∂Hkj = λ ρ γ − λ ρ γ dt (5.17) i ∂w kj j i ik ∂w j ~ 0 ijk

Therefore, we can now train our Quantum Neural Network with the backpropagation tech-

nique and Equation 5.15. Classically, we would go back through the layers of the neural net-

work and update the weights with Equation 5.15, but here we propagate back through time

to change the weights. That is, we integrate from tf to 0 instead. Complete details, including

a derivation of the quantum dynamic learning paradigm using quantum backpropagation[78]

in time[79], are given in [56]. Also note that the above technique, since it uses the density

83 matrix, is applicable to any state of the quantum system, pure or mixed. Furthermore, it’s not necessary that we always map to a scalar value. We can map to a specific quantum state as well. One might do that by modifying the Lagrangian to,

Z tf   1 2 dρ i   L = ||ρtarget − ρoutput|| + + λ(t) + H, ρ γ(t) dt (5.18) 2 0 dt ~ where ρtarget is the desired quantum state we want to map to. Note that this will result in a little modification to the learning algorithm as well.

5.3 Simulation of Classical and Quantum Logic Gates, and Quantum Circuit

As a demonstration, we will show that our Quantum Neural Network model can simulate all possible classical and quantum logic gates. We will train our QNN model to produce the classical two-input and one-output gates XOR, XNOR, NAND, and NOR, as well as the quantum two qubit CNOT gate.

Recall that the classical XOR and XNOR gates have the following truth tables:

Table 5.1: Truth table of XOR and XNOR logic gates in respective order

Input A Input B A XOR B Input A Input B A XNOR B 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 1

We will implement the XOR and XNOR gates using a two qubit system. The most obvious way to set up the input in table 5.1 through a quantum state is to let input A represents the

first qubit and input B represents the second qubit in the quantum system. Then, we can take the quantum state |0iA|0iB = |00i as the representation of when both input A and B are 0. See Table 5.2 for a complete description of how we might set up the inputs for XOR

84 and XNOR in term of quantum states along with their desired output values. Table 5.2 also shows the actual trained QNN output values along with the trained total root mean square

(RMS) error. To get these results, we first consider the 2-qubit system (qubits A and B) whose Hamiltonian is taken to be:

H = KAσxA + KBσxB + AσxA + BσxB + ζσzA σzB (5.19) where {σ} are the Pauli operators corresponding to each of the qubits; {K = K(t)} are the tunneling amplitudes, { = (t)} are the biases, and {ζ = ζ(t)} the qubit-qubit couplings as we’ve discussed previously. We choose the usual “charge basis ”, in which each qubit’s state is given as 0 or 1; for a system of 2 qubits there are 4 states, each labelled by a bit string each of whose numbers corresponds to the state of each qubit, in order. The amplitude for each qubit to tunnel to its opposing state (i.e., switch between the 0 and 1 states) is its K value; each qubit has an external bias represented by its  value; and each qubit is coupled to each of the other qubits, with a strength represented by the appropriate ζ value. The parameters of the training, {K, , ζ} cam be found in Figures A.1, A.2, A.3, A.4, A.5, and

A.6 in AppendixA.

Similar results can be derived for all other logic gates, as we will see in Chapter 7, when we benchmark our QNN model against various classical neural networks. Although, it should be noted that the NAND gate was simulated on a 3-qubit system instead of two since it is the reversible equivalent to the Toffoli gate, see Section 3.2 and Figure 3.10. Therefore, it would be of more interest to show the implementation of a quantum gate, and a small but popular quantum circuit that created a maximal entangled state. The quantum gate we will simulate here will be the 2-qubit CNOT gate. Recall that without the CNOT gate, there will be no interaction between the qubits, and hence no quantum advantage. See Table 5.3

85 Table 5.2: Inputs and Outputs (both desired and actual) for the QNN model of the XOR and XNOR gate. The input is the initial, prepared, state of the two-qubit system, at t = 0; the output, the square of the measured value of the qubit-qubit correlation function at the final time.

XOR XNOR Input state Desired Output QNN Output Desired Output QNN Output |00i 0 6.03 × 10−4 1 0.999 |01i 1 0.999 0 5.99 × 10−4 |10i 1 0.999 0 7.08 × 10−4 |11i 0 6.16 × 10−4 1 0.999 Total RMS error 9.6 × 10−4 9.6 × 10−4

for the result of the simulation of the CNOT gate, and Figure A.7, A.8,A.9 in AppendixA for the trained parameter functions {K, , ζ}.

86 Table 5.3: QNN simulation of the CNOT gate. Recall that CNOT maps: |00i → |00i, |01i → |01i, |10i → |11i and |11i → |10i.

Input State QNN Output   1.0000 + 0.0000i −0.0002 + 0.0003i 0.0018 + 0.0023i 0.0026 + 0.0008i     −0.0002 − 0.0003i 0.0000 + 0.0000i 0.0000 − 0.0000i −0.0000 − 0.0000i |00i      0.0018 − 0.0023i 0.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 − 0.0000i    0.0026 − 0.0008i −0.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i   0.0000 + 0.0000i 0.0002 − 0.0003i 0.0000 + 0.0000i −0.0000 − 0.0000i      0.0002 + 0.0003i 1.0000 + 0.0000i −0.0011 + 0.0018i −0.0004 − 0.0029i |01i      0.0000 − 0.0000i −0.0011 − 0.0018i 0.0000 + 0.0000i −0.0000 + 0.0000i   −0.0000 + 0.0000i −0.0004 + 0.0029i −0.0000 − 0.0000i 0.0000 + 0.0000i   0.0000 + 0.0000i −0.0000 + 0.0000i −0.0000 − 0.0000i −0.0026 − 0.0008i     −0.0000 − 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i 0.0004 + 0.0029i  |10i     −0.0000 + 0.0000i 0.0000 − 0.0000i 0.0000 + 0.0000i 0.0007 + 0.0001i    −0.0026 + 0.0008i 0.0004 − 0.0029i 0.0007 − 0.0001i 1.0000 + 0.0000i   0.0000 + 0.0000i 0.0000 − 0.0000i −0.0018 − 0.0023i 0.0000 + 0.0000i      0.0000 + 0.0000i 0.0000 + 0.0000i 0.0011 − 0.0018i −0.0000 + 0.0000i |11i     −0.0018 + 0.0023i 0.0011 + 0.0018i 1.0000 + 0.0000i −0.0007 − 0.0001i   0.0000 − 0.0000i −0.0000 − 0.0000i −0.0007 + 0.0001i 0.0000 + 0.0000i Total RMS error 1.4 × 10−4

Now that we have simulated the CNOT gate, let’s demonstrate the simulation of the quantum circuit that created the maximal entangled (Bell) state. This circuit was first mentioned in Chapter 3. See Figure 3.9. It maps the state |ψi = |00i to the state

|ψi = √1 |00i + |11i. See Table 5.4 for the result of the training, and Figures A.10 Bell 2 and A.11 for the trained parameter functions.

87 Table 5.4: QNN training result for the quantum circuit that would generate the maximal entangled Bell state.   1 0 0 0     0 0 0 0 Input State |ψi = |00i ⇔ ρ =     0 0 0 0   0 0 0 0   0.5 0 0 0.5      0 0 0 0  Desired Output State ρBell =      0 0 0 0    0.5 0 0 0.5   .5000 .0001 .0001 .5000     .0001 .0000 .0000 .0000 QNN Output State ρQNN =     .0001 .0000 .0000 .0001   .5000 .0001 .0001 .5000 Total RMS error 6.18 × 10−16

88 5.4 Universal Property of Quantum Neural Network

We have seen that classical neural networks with sigmoid neurons are universal using

two different arguments: one using a constructive technique with boxes/towers and the other

using modern analysis. Both techniques showed that neural networks with sigmoid neurons

can approximate any continuous function on a compact set. The universality argument could

have done a little differently, in particular, we could show that any computational task can

be replaced with neural networks. For clarity, we can show that a neural network with

layers of perceptrons can do any computational task, by first showing that a perceptron can

efficiently implement a NAND logic function, a well known result. Recall the perceptron

model.

Figure 5.4: Perceptron model

Then by picking the weights w1 = −1, w2 = −1 and b = 2 we have succesfully implemented the perceptron as a NAND logical gate. Recall that the NAND logic function maps every- thing to 1 except the state 11 (x1 = 1, x2 = 1) get maps to 0. Now recall that NAND gates are universal for classical computation. That is, all classcial computational tasks can be replaced as sequence of NAND gates. For example, computational tasks such as adding or subtracting two 1-bit numbers x1 and x2 together (half adder and half subtractor) can be built from only NAND gates. See Figure 5.5.

89 Figure 5.5: a) Half adder using NAND gates. b) Half subtractor using NAND gates.

The idea is we can replace the half adder and subtractor circuit models using NAND gates with perceptrons. For instance, Figure 5.6 shows a model where the half subtractor is being modelled as a network of perceptrons.

Figure 5.6: Half subtractor modeled by a network of perceptrons.

90 The previous example of replacing the half subtractor circuit consisting of NAND gates is a classic example in many neural network literature, but the point we are trying to connect here is that one way to determine the universality of a neural network is simply by determine whether it can replicate arbitrary computational tasks. So now let’s shift our focus back to the quantum neural network model: Can our quantum neural network model perform any computational task? The answer is yes! Recall that our QNN is simulating the most funda- mental structure of a quantum computer, the time evolution using the Hamiltonian of the system. And since any quantum algorithm can be thought as a unitary map from the input state to the output state, this unitary map can be realized by our QNN model. Then one might ask: What about classical computations? These tasks are highly non-linear so how can QNN simulate them? Yes, and it has been answered all along. Remember in Chapter 3, we showed that any classical computation can be made reversible and hence implementable on a quantum computer via ancilla qubits, and the overhead is not too bad. This was shown by Charles Bennett in early 1970s. Therefore, we can do the same with QNN. By encoding our input into a larger quantum state will ensure that our QNN can model these highly non-linear classical computations. We must remember that we are working on the most fundamental level of the computational system. At this level, all computational tasks are nothing but sequence of gates, as shown by the half adder or subtractor example. It’s easy to forget this and let ourselves compare the universality of classical neural networks to quantum neural networks in the sense of composition of functions. There are no such things at the fundamental level. That is, when you program a classical multi-layered neural network, the execution step will turn everything you programmed into a classical circuit with sequence of gates. And this classical circuit can be made into a quantum circuit with small overhead, which our QNN can implement! After all if a quantum computer can’t perform a task that can be done on a classical computer then there would be no point to building one. Another

91 point we want to make is that a quantum measurement is not considered to be part of the closed system, hence it is not a unitary operation. Not only is measurement a non unitary operation, but also, recall that measurement in quantum mechanics causes collapse to the quantum state or wavefunction, so there is measurement-induced non-linearity as well. This is perhaps a controversial avenue to take our non-linearity argument into because fundamen- tally we still don’t have a good understanding about the collapse of quantum states post measurement. In Chapter 9, we will discuss Hybrid Quantum Neural Networks, a model where we combine classical neural networks on top of quantum systems. In this approach, it is possible for us to achieve highly non-linear computations without having to add additional ancilla qubits into our system.

5.5 An Alternative Learning Approach

Here we provide an alternative approach to train the quantum neural network. Previ- ously, we derived the learning algorithm through the dynamic of the Hamiltonian, which we take to have the form in Equation 5.4. The weights {K, , ζ} were found to create a mapping from the input states to its corresponding outputs (supervised learning) through the Schrodinger equation, Equation 2.21, which is a unitary operator, followed by a quan- tum measurement. Recall that we discretized the time space and assumed the Hamiltonian constant during each time step. Thus at each time step we have a set of {K, , ζ}, which combine together to form our unitary operator. Now we offer an alternative global approach:

Instead of dealing with the Hamiltonian locally at each time step, we will consider the entire map from the initial ρ(t0) to ρ(tf ) as an arbitrary unitary operator follow by a quantum measurement. Our learning process now is to find the right unitary operator to accomplish

92 the tasks at hand. One way to do this is be defining the Lagrangian, L, to be

1 †2 † L = (T − T r M · Uρ(t0)U + UU − I (5.20) 2

The last term on Equation 5.20 is the penalty term to force the map from ρ(t0) to ρ(tf ) to stay unitary. However, if we use the usual gradient descent weight updates technique as in

Equation 5.15, we will have an weights update scheme similar to Equation 5.21, but this will deviate U from being unitary, and the optimization of the Lagrangian will become difficult.

new old ∂L Uij = Uij + η (5.21) ∂uij

The reason Equation 5.21 is not a good approach is that the unitary group, U(n), is not closed under addition! It’s closed under multiplication, however. Furthermore, it is well known that the group U(n) forms a smooth manifold. A clever idea is to remove this penalty term by moving our problem into the space of the constraint, the space of unitary matrices. This type of constraint optimization techniques has been previously explored in detail [159, 160, 161, 162, 162]. We have incorporated these techniques in our work with great results.

The problem with this approach is the lack of implementation scheme on hardware.

Because in hardware, we can’t control the unitary evolution directly, rather we control the parameters of the Hamiltonian by some external fields to achieve the desired unitary evolu- tion. This external field can range from a variety of different sources. For instance, as we will see in Chapter 8, IBM uses microwave pulses to control the evolution of the qubits in their gate-model quantum computers. Thus, by training the time varying parameter functions in the Hamiltonian, we will have direct control of the system at the fundamental level, at least in principle. Currently, all quantum gate models only allow us to set up our computation from the set of universal gates, and not direct control of the Hamiltonian. However, this

93 is slowly changing, as quantum computing company like IBM starting to allow its users to operate on the more fundamental and abstract level known as the pulse-level-control. This is a step in the right direction for a direct implementation of our QNN model onto the gate based model quantum system without having to do any gate decomposition; and validating the reason why it would be better to train the Hamiltonian of the system rather than the overall unitary evolution operator.

94 CHAPTER 6

ROBUSTNESS OF QUANTUM NEURAL NETWORK

Noise and decoherence are two major obstacles to the implementation of large-scale quantum computing, as we have discussed in previous sections. In this section, we show that our model of a quantum neural network (QNN) is robust to noise, and that, in addition, it is robust to decoherence. Moreover, robustness to noise and decoherence is not only maintained but improved as the size of the system is increased. Noise and decoherence may even be of advantage in training, as it helps correct for overfitting. We demonstrate the robustness using entanglement as a means for pattern storage in a qubit array. Our results provide evidence that machine learning approaches can obviate otherwise recalcitrant problems in quantum computing.

We will start this chapter with a quick review of the differences between quantum and classical computing along with the problems with noise and decoherence in Sections 6.1 and

6.2. In Section 6.3, we will show that our Quantum Neural Network is robust to noise, decoherence, and the combination of both on a two-qubit system. In Section 6.4, we extend our robustness to noise and decoherence results to higher-order qubit systems, in particular, the 3,4,5-qubit systems. We also show the stability analysis of the noise calculation. In

Section 6.5, we showed a potential application of quantum neural network as consequence of its robustness property.

95 6.1 Quantum Computing in The Classical World

Quantum computing may very well be the way to find solutions to a host of calculations that are difficult to do with a classical computer [84]. Moreover, it offers the opportunity to approach true simulation of physical reality [85], on the fundamental scale. But when it comes to scaling up from proof of concept hardware to practical size, significant problems are yet to be solved. Among the most recalcitrant ones are the issues related to noise [86] and decoherence [87]. The fact that noise is a special problem in quantum computing may seem peculiar. After all, data scientists doing any kind of computation have always known that if there are enough errors in the input, the output will be wrong. One obvious way of guarding against data errors is redundancy: essentially, to make multiple backups and average out the noise. But quantum mechanical computers cannot use simple redundancy, because there is no easy way to make copies of unknown quantum states. This is called the

“no-cloning theorem,” and is an immediate consequence of a quantum systems being in a superposition of states, unknown until measured – and measurement collapses and destroys the state. This is a fundamental rule in quantum mechanics [2]. Classical noise distributions can be handled using the theory of stochastic processes [88], but, again, quantum mechanics limits both the ease and the effectiveness of these kinds of theories. So other methods need to be used. And in addition, in quantum computation there is also the unique problem of decoherence, which arises from unwanted interactions with the environment. Quantum mechanics is fragile, which is why, on a macroscopic scale, we rarely need to take quantum effects into account: Unless the quantum processes are extremely well isolated, the quan- tum state will decohere and become, essentially, classical. (Indeed, the precise nature and implications of the ways in which decoherence leads to the loss of quantum information and the emergence of classicality is a fundamentally extremely interesting problem, as it lies at the crux of the nature of quantum reality [89].) When this occurs in a quantum computer,

96 the quantum nature of the computation is lost. So, if we are specifically interested in doing quantum computing, we need to guard against these kinds of effects as well.

6.2 Dealing with Noise and Decoherence

Most researchers who address the problem of noise in quantum computing use the method of ancilla [90][91], which are extra quantum bits (qubits) for error correction, as we have discussed in the “Quantum Computing” chapter. The problem with this approach is the fast growth of the number of ancilla necessary, making scaleup much harder. Some recent papers on the decoherence problem are those of Glickman [92], Takahashi [93], Roszak [94], Dong

[95], and Cross [96]. Glickman et al. [92] used the scattering of photons to understand the process of decoherence better. Their experience showed that it might be possible to control decoherence in a quantum system by taking advantage of an atom’s spin state. Takahashi et al. [93] were able to use high magnetic fields to suppress quantum decoherence to levels far below the threshold necessary for quantum information processing. Roszak et al. [94] sug- gested a general approach to protect a two level system against decoherence by engineering a non-classical multiple superposition of coherent states in a non-Markovian reservoir.

Neural network methods are another way of dealing with noise. Because of the dis- tributed nature of the computation and the multiple interconnectivity of the architecture, classical neural networks are inherently robust to noise [97]. Our research group has been investigating the advantages of a machine learning approach to quantum computing for some time [60][61][62][63]. More recently, a number of other researchers have also explored these directions [98][99][100]. There are now a number of proposed architectures for quantum neural networks. See [101] for a pedagogic introduction. Dong et al. [95] presented a sys-

97 tematic numerical methodology called sampling-based control for robust quantum design of a quantum system. This method is similar to ours in that it also uses machine learning.

In the training step, the control is learned using a gradient flow based learning algorithm for an augmented system constructed from samples. Cross et al. [96] showed that quantum learning is robust to noise by proving that the complexity of learning between classical and quantum methods is the same. However, when noise is being introduced, the best classical algorithm will have superpolynomial complexity, whereas, the complexity in quantum algo- rithms is only logarithmic. Several groups [102][103][104] use ancilla added to a [105] to address the problem of unwanted interactions.

In this work, we use our machine learned entanglement witness to explore and address the question: Is our quantum neural network robust to noise and to decoherence? We will

first show that this is indeed true on the simple two-qubit system [65]. Afterward, we will generalize to larger systems and show that, in fact, the robustness is maintained and even improved with increasing system size [66]. Indeed, it may even be true that the presence of noise improves learning, as it prevents overfitting [106]. We also present preliminary results of an application, using entanglement as a means of pattern storage. Techniques of machine learning and neural networks may provide solutions to many of the problems facing large scale quantum computing.

6.3 Entanglement Calculation For Two-Qubit System

As a test bed for investigation of the robustness to noise and decoherence of our architec- ture for a quantum neural network, we choose the calculation of an entanglement indicator or witness. Entanglement, like the no-cloning theorem, is a direct result of superposition.

98 Entanglement is a purely quantum phenomenon not possible with classical states. Recall that mathematically a general state |ψi of a 2-qubit composite system, say qubit A and qubit B, can be written as

|ψi = a|00i + beiθ1 |01i + ceiθ2 |10i + deiθ3 |11i (6.1)

where a, b, c, d, θ1, θ2, θ3 ∈ R. Equation 6.1 represents a general state of both qubit A and B, but it’s not necessarily a product state. That is, there are states of the composite system that are not expressible as product state, |ψiA ⊗ |ψiB. For instance, recall from Section 2.2 that the classical Bell state |00i + |11i |ψi = √ 2 which is a maximally entangled state, can’t be written as a product state, whereas, the state

|00i + |01i + |10i + |11i |ψi = 2 1 which is a product state (non-entangled) since it can be written as |ψi = √ |0i + |1i ⊗ 2 1 √ |0i + |1i. Since there is no closed-form solution for the entanglement of a general state 2 of even a two-qubit system, much less of a many-qubit system, there is no classical algorithm or quantum algorithm for its calculation. The problem seems ideal for a quantum neural network, which could be trained on known exemplars then generalized.

We begin with the Schr¨odingerequation for the time evolution of the density matrix ρ:

dρ 1 = [H, ρ] (6.2) dt i~ where H is the Hamiltonian. We consider the 2-qubit system (qubits A and B) whose

Hamiltonian is:

H = KAσxA + KBσxB + AσxA + BσxB + ζσzA σzB (6.3)

99 where {σ} are the Pauli operators corresponding to each of the qubits; see Equations 3.8 and 3.17. {K} are the tunneling amplitudes, {} are the biases, and {ζ} the qubit-qubit

couplings as we’ve discussed previously. We choose the usual “charge basis ”, in which each

qubit’s state is given as 0 or 1; for a system of 2 qubits there are 4 states, each labelled by

a bit string each of whose numbers corresponds to the state of each qubit, in order. The

amplitude for each qubit to tunnel to its opposing state (i.e., switch between the 0 and 1

states) is its K value; each qubit has an external bias represented by its  value; and each

qubit is coupled to each of the other qubits, with a strength represented by the appropriate

ζ value. Note that, for example, the operator σxA = σx ⊗ σI , whereas σxB = σI ⊗ σx.

As mentioned in the previous chapter, the parameter functions {K(t), (t), ζ(t)} direct

the time evolution of the system in the sense that, if one or more of them is changed, the

way a given initial state will evolve in time will also change, because of Eqs. 6.2-5.19. This

is the basis for using our quantum system as a neural network. The role of the input vec-

tor is played by the initial density matrix ρ(0), the role of the output by (some function

of) the density matrix at the final time, ρ(tf ), and the role of the “weights” of the net-

work by the parameter functions of the Hamiltonian, {K, , ζ}, all of which can be adjusted

experimentally[77]. By adjusting these parameters using a machine learning algorithm we

can train the system to evolve in time from an input state to a set of particular final states at

the final time tf . Because the time evolution is quantum mechanical (and, we assume, coher-

ent), a quantum mechanical function, like an entanglement witness of the initial state, can be

mapped to an observable of the system’s final state, a measurement made at the final time tf .

We found [56, 176] a set of parameter functions that successfully map the input (initial)

state of a two-qubit system to a good approximation of the entanglement of formation of

that initial state, using as the output the qubit-qubit correlation function at the final time,

100 2 hσz1(tf )σz2(tf )i . This set of parameter functions was relatively easily generalized to three- four- and five-qubit systems[58]. Here, we will consider the effect of noise on only the simplest case, of two qubits (N = 2), and for ease of notation we will call the two qubits A

and B. With continuum[59, 176] rather than piecewise constant parameter functions[56, 58]

training is more rapid and complete; see Figure 6.1. The parameter functions are found by

training with a set of just four initial quantum states (“inputs”), as shown in Table 6.3: a

fully entangled state (“Bell”), a “Flat” state (equal amounts of all basis states), a product

2 state“C” whose initial (t = 0) correlation function hσzAσzBi is nonzero, and a partially entangled state “P”. The parameter functions are shown in Figures 6.2 6.3 6.4. Because of the symmetry in the Hamiltonian and in the training set, KA = KB = K, and A = B = .

Each is a relatively simple function, well parametrized by only one frequency (, ζ) or two

(K), as shown in Table 6.3 and plotted in Figures 6.2,6.3,6.4. The (input, output) pairs are:

input = ρ(0) = |Ψ(0)ihΨ(0)| (6.4)

2 output = hσzA(tf )σzB(tf )i → target

with prepared input states at zero time, and corresponding targets, given in Table 6.3. This

table also shows the QNN indicated entanglement values after training has finished and, for

comparison, the entanglement of formation, calculated using the analytic formula [71] for

comparison. Entanglement of formation is not the only measure of entanglement, of course,

but any one that we chose would have qualitatively similar behavior, which we would like

our entanglement indicator to imitate. That is, we seek here not exact agreement with EF

(in which case we would train the state √1 (|00i + |01i + |10i) to a target value of 0.55), 3 but a robust and internally self consistent measure, which we would hope would track well

with an analytic measure like EF . The QNN indicator systematically underestimates EF

101 for partially entangled pure states; this is because we found through simulation that the net

naturally trained to the target value of 0.44. See [56] for details.

Figure 6.1: Total root mean squared error for the training set as a function of epoch (pass through the training set), for the 2-qubit system, with zero noise. Asymptotic error is 1.6×10−3. For comparison: with piecewise constant functions a similar level of error required 2000 epochs.

Table 6.1: Training data for QNN entanglement witness.

Input state |Ψ(0)i Target Trained EF Bell = √1 (|00i + |11i) = |Φ i 1.0 0.998 1.0 2 + 1 −5 Flat = 2 (|00i + |01i + |10i + |11i) 0.0 1.2 × 10 0.0 C = √ 1 (0.5|10i + |11i) 0.0 1.8 × 10−4 0.0 1.25 P = √1 (|00i + |01i + |10i) 0.44 0.44 0.55 3 RMS Error 1.6 × 10−3 Epochs 100

102 Figure 6.2: Parameter function KA = KB as a function of time (data points), as trained at zero noise for the entanglement indicator, and plotted with the Fourier fit (solid line).

Figure 6.3: Parameter function A = B as a function of time (data points), as trained at zero noise for the entanglement indicator, and plotted with the Fourier fit (solid line).

Figure 6.4: Parameter function ζ as a function of time (data points), as trained at zero noise for the entanglement indicator, and plotted with the Fourier fit (solid line).

103 Table 6.2: Curvefit coefficients for parameter functions K, , ζ, for QNN entanglement witness. f(t) = a0 + a1 cos(ωt) + b1 sin(ωt) + a2 cos(2ωt) + b2 sin(2ωt)

Coefficient K  ζ −5 −5 a0 0.00248 9.89 × 10 9.89 × 10 −6 −6 −5 a1 3.68 × 10 −4.96 × 10 −1.46 × 10 −6 −6 −5 b1 1.95 × 10 4.55 × 10 −2.34 × 10 −6 a2 3.70 × 10 —— −7 b2 8.68 × 10 —— ω 0.0250 0.0250 0.0575 RMS Error 9.77 × 10−7 1.84 × 10−6 7.29 × 10−6

104 Once trained, the parameter functions that are found can be tested, by using the Hamil- tonian so defined to calculate the QNN indicator for other initial states. Testing was therefore done on a large number of states not represented in the training set, including fully entan- gled states, partially entangled states, product (unentangled) states, and even mixed states.

(Note that only pure states were present in the training sets.) The interested reader is re- ferred to our previous work for the (extensive) testing results [56, 58, 176]. Note also that the testing was done using the fitted functions only.

6.3.1 Learning with noise

Physical systems contain noise, meaning that there is some uncertainty in each of the elements of the density matrix (though, of course, it must remain hermitian and positive semidefinite, with unit trace to conserve probability.) What if the system on which we train is somewhat noisy?

Let us first define our terms. Because we are working with simulations, we can isolate the different effects of “noise” and “decoherence”: here, we will use “noise” to refer to ran- dom (uncorrelated) magnitudes, of a given size, added to the density matrix elements; and

“decoherence” to refer to random phases so added. In general, of course, both effects will be present; we will call that “complex” noise. Recall that our entanglement indicator is a mapping to a time correlation function, after evolution in time according to the Hamilto- nian of Equation 5.19; that is, the entanglement of the initial state is approximated by a measurement performed at the final time. To simulate white noise (zero mean and specified amplitude), random numbers were added at each timestep ∆t = 0.8 ns of that time evo- lution. These numbers have zero time correlation themselves. The level of noise we report is the amplitude, that is, the root-mean-square-average size of these random numbers, that

105 are imposed at each time step. All the simulations were done for the same total time of 317 timesteps, or about 253 ns. Because the system evolves in time, the noise itself propagates, so that by the time the correlation function is measured, at the final time, numbers that seem quite small can build up to destroy a significant amount of entanglement. For comparison with our testing results, we will always therefore include the entanglement of formation for the noisy density matrix, calculated using the Bennett-Wootters formula[70, 71] and marked

“BW” on our testing graphs (Figures 12-17, 25-30, and 38-43) below.

We consider first the case of magnitude noise only. Figure 6.5 shows a typical rms error training curve for a fairly large level of noise. As expected, asymptotic error for the training set does increase with increasing noise. The parameter functions also become “noisy”: See

Figures 6.6,6.7,6.8.

Figure 6.5: Total root mean squared error for the training set as a function of epoch (pass through the training set), for the 2-qubit system, with a (magnitude) noise level of 0.014 at each (of 317 total) timestep. Asymptotic error is 3.1 × 10−3, about double what it was with no noise.

106 Figure 6.6: Parameter function KA = KB as a function of time, as trained at 0.0089 ampli- tude noise at each of the 317 timesteps, for the entanglement indicator (data points), and plotted with the Fourier fit (solid line). Note the change in scale from Figure 6.2, because of the (much larger) spread of the noisy data: the Fourier fit is actually almost the same on this graph.

Figure 6.7: Parameter function A = B as a function of time, as trained at 0.0089 amplitude noise at each of the 317 timesteps, for the entanglement indicator (data points), and plotted with the Fourier fit (solid line).

107 Figure 6.8: Parameter function ζ as a function of time, as trained at 0.0089 noise at each of the 317 timesteps, for the entanglement indicator (data points), and plotted with the Fourier fit (solid line).

Much of this variation is not meaningful, though. In fact, the numbers in Table 6.3

are slightly different from the ones we found in [176]. Is that difference significant? To

investigate this, we tried testing with all but a selected one of the parameter functions’

Fourier coefficients set to the trained values, but assigning random numbers (of the right

order of magnitude) to the Fourier coefficients of the one chosen. The system was remarkably

insensitive to this procedure when the random function was the tunneling function K: as

long as K was the right order of magnitude, the indicator still tested extremely well - in fact

for most of the testing the indicator results were identical. This was not true if  or ζ were

randomized, however: errors were substantial, particularly when the Fourier frequency ω is

randomized. Still, exact agreement with the trained functions is not, apparently, necessary.

We can, as before, fit each function to a Fourier series, and test using the fitted functions.

Figures 6.9,6.10,6.11 show the Fourier coefficients for KA = KB, A = B , and ζ, respectively, as functions of increasing noise level. The Fourier components of the tunneling parameter function K are clearly the least sensitive to noise level, while for  they are a bit more so and for ζ the most. This is in accordance with the observed insensitivity of the entanglement

108 indicator to K: that is, it is true both that the indicator is relatively insensitive to the K

function, and that the K function’s Fourier fit is insensitive to environmental noise.

Figure 6.9: Fourier coefficients for the tunneling parameter functions K, as functions of noise level.

Figure 6.10: Fourier coefficients for the bias parameter functions , as functions of noise level.

109 Figure 6.11: Fourier coefficients for the coupling parameter function ζ, as functions of noise level.

All of the parameters look fairly stable at these levels of noise, from this point of view.

But a more important question is: how much does noise interfere with the net’s ability to detect entanglement?

Consider a pure state, specifically, the state P (γ) = |00i√+|11i+γ|01i . For γ = 0 this is of 2+|γ|2 course the (fully entangled) Bell state; as γ increases the entanglement decreases. P (γ) is one of the states we tested on in the 2008 paper[56], and, as we found then, the entanglement as computed by the quantum neural network, without noise, tracks the entanglement of formation very well. How much noise or uncertainty can the network tolerate? We answer this question in two ways. First, we suppose the QNN was trained on the perfect (zero noise) system of the four-pair training set, and we add increasing amounts of noise to test the system for various nonzero values of γ, and compare the results to the entanglement of formation both at zero noise and at 0.0069 noise. See Figure 6.12. Then, we suppose the QNN was trained on a noisy system, and, again, test with increasing levels of noise.

Figures 6.13 and 6.14 show, respectively, an intermediate level and a high level of noise present during training. Recall that the noise should be understood as occurring over the total time evolution interval: that is, an independent (uncorrelated) noise at the given rms

110 level was added at each timestep. The two curves showing the entanglement of formation can therefore be thought of as a kind of “error bars”: the correct entanglement of the system should lie somewhere between the zero noise result and the result at maximum noise, insofar as the QNN tracks well with the entanglement of formation. Presumably the presence of noise does destroy entanglement, but, since the measurement itself is noisy, it is not certain how much is destroyed and how much is simply a bad measurement. Still, it is obvious from the results that, indeed, the QNN technique does an excellent job of remaining robust to pure noise.

Figure 6.12: Entanglement of the state P as a function of γ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero noise (blue) and at 0.0069 noise (orange). In each case the QNN was trained at zero noise, but tested at the given level.

111 Figure 6.13: Entanglement of the state P as a function of γ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero noise (blue) and at 0.0069 noise (orange). In each case the QNN was trained at a noise level of 0.0089, and then tested at the given level.

Figure 6.14: Entanglement of the state P as a function of γ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero noise (blue) and at 0.0069 noise (orange). In each case the QNN was trained at 0.013 noise, and then tested at the given level.

Second, we consider testing on mixed states. We might expect the QNN to perform significantly less well with these kinds of states, because the training set (Table 6.3) contained no mixed states. In fact with zero noise the QNN tested well on several classes of mixed states

[56]; we need to see if that success is maintained with noisy conditions. Figures 6.15,6.16,6.17

δ|11ih11|+|Φ+ihΦ+| + show results for the mixed states M(δ) = δ+1 , where |Φ i is the Bell state, given in Table 6.3, as functions of δ, for QNN trained at zero, 0.0089, and 0.013 noise amplitude,

112 respectively. Again, the entanglement of formation results can serve as an approximate

error bound, and we see that the QNN’s entanglement indicator is robust to noise. Indeed,

comparison of these figures with the ones for pure states shows that performance on mixed

states is even better. Because of this we might expect the QNN indicator to show greater

robustness to decoherence than to magnitude noise; we will see in the next section that this

is, indeed, the case.

Figure 6.15: Entanglement of the state M as a function of δ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero noise (blue) and at 0.0069 noise (orange). In each case the QNN was trained at zero noise, but tested at the given level.

Figure 6.16: Entanglement of the state M as a function of δ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero noise (blue) and at 0.0069 noise (orange). In each case the QNN was trained at a noise level of 0.0089, and then tested at the given level.

113 Figure 6.17: Entanglement of the state M as a function of δ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero noise (blue) and at 0.0069 noise (orange). In each case the QNN was trained at 0.013 noise, and then tested at the given level.

6.3.2 Learning with Decoherence

We now turn to the case of “pure” decoherence, that is, we introduce random phases to the elements of the density matrix, without changing their magnitudes. Figure 6.18 shows a typical rms error training curve for a fairly large level of phase noise. Again, asymptotic error for the training set does increase with increasing phase noise (decoherence). The parameter functions also become “noisy”: See Figures 6.19,6.20,6.21.

114 Figure 6.18: Total root mean squared error for the training set as a function of epoch (pass through the training set), for the 2-qubit system, with a phase noise level of 0.014 at each (of 317 total) timestep. Asymptotic error is 1.3 × 10−3, approximately the same as with no noise.

Figure 6.19: Parameter function KA = KB as a function of time, as trained at 0.0089 phase noise at each of the 317 timesteps, for the entanglement indicator (data points), and plotted with the Fourier fit (solid line)

115 Figure 6.20: Parameter function A = B as a function of time, as trained at 0.0089 phase noise at each of the 317 timesteps, for the entanglement indicator (data points), and plotted with the Fourier fit (solid line).

Figure 6.21: Parameter function ζ as a function of time, as trained at 0.0089 phase noise at each of the 317 timesteps, for the entanglement indicator (data points), and plotted with the Fourier fit (solid line).

Figures 6.22,6.23,6.24 show the Fourier coefficients for KA = KB, A = B , and ζ, respectively, as functions of increasing decoherence level. It is clear that in the case of deco- herence the parameter functions are even more stable than in the previous case of noise.

116 Figure 6.22: Fourier coefficients for the tunneling parameter functions K, as functions of decoherence level.

Figure 6.23: Fourier coefficients for the bias parameter functions , as functions of decoher- ence level.

117 Figure 6.24: Fourier coefficients for the coupling parameter function ζ, as functions of deco- herence level.

Testing results for the pure state P subject to decoherence are shown in Figures 6.25,6.26,6.27, as trained, respectively, at zero, 0.0089, and 0.013 decoherence (phase noise), and tested with various levels of phase noise, and compared with the entanglement of formation at zero and at 0.0069 phase noise. The results are extremely good, better even than the ones in Fig- ures 6.12,6.13,6.14; clearly, the QNN is even better at dealing with decoherence than with

“pure” noise. Possibly the various (random) phases tend to cancel each other; but since for definite phase shifts the QNN underestimates the entanglement, sometimes drastically [56]

[176], this was an unexpectedly good result.

118 Figure 6.25: Entanglement of the state P as a function of γ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero phase noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was trained at zero phase noise, but tested at the given level.

Figure 6.26: Entanglement of the state P as a function of γ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero phase noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was trained at a phase noise level of 0.0089, and then tested at the given level.

119 Figure 6.27: Entanglement of the state P as a function of γ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero phase noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was trained at 0.013 phase noise, and then tested at the given level.

Figures 6.28,6.29,6.30 show the performance of the QNN on the mixed state M, as trained, respectively, at zero, 0.0089, and 0.013 decoherence (phase noise), and tested with various levels of phase noise, and compared with the entanglement of formation at zero and at 0.0069 phase noise. Again, we see that the QNN entanglement indicator is robust to decoherence.

Figure 6.28: Entanglement of the state M as a function of δ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero phase noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was trained at zero noise, but tested at the given level.

120 Figure 6.29: Entanglement of the state M as a function of δ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero phase noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was trained at a phase noise level of 0.0089, and then tested at the given level.

Figure 6.30: Entanglement of the state M as a function of δ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero phase noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was trained at 0.013 phase noise, and then tested at the given level.

6.3.3 Learning with Noise plus decoherence

Finally, we consider the case of noise plus decoherence, that is, what we are calling random complex noise. For this case, we add both magnitude and phase noise. Figure 6.31 shows a typical rms error training curve for a the same level of complex noise as in Figures

121 6.5 and 6.18. Again, asymptotic error for the training set does increase with increasing

complex noise. The parameter functions again become “noisy”: See Figures 6.32,6.33,6.34.

Figure 6.31: Total root mean squared error for the training set as a function of epoch (pass through the training set), for the 2-qubit system, with a complex noise level of 0.014 at each (of 317 total) timestep. Asymptotic error is 3.0 × 10−3, approximately the same as with only magnitude noise.

Figure 6.32: Parameter function KA = KB as a function of time, as trained at 0.0089 complex noise at each of the 317 timesteps, for the entanglement indicator (data points), and plotted with the Fourier fit (solid line). Note the change in scale from Figure 6.2, because of the (much larger) spread of the noisy data: the Fourier fit is actually almost the same on this graph.

Again, we test to see how much the Fourier fit changes, this time with complex noise.

122 Figure 6.33: Parameter function A = B as a function of time, as trained at 0.0089 complex noise at each of the 317 timesteps, for the entanglement indicator (data points), and plotted with the Fourier fit (solid line).

Figure 6.34: Parameter function ζ as a function of time, as trained at 0.0089 complex noise at each of the 317 timesteps, for the entanglement indicator (data points), and plotted with the Fourier fit (solid line).

Figures 6.35,6.36,6.37 show the Fourier coefficients as a function of complex noise level, for

K, , and ζ, respectively. We can see that the indicator is, again, relatively stable.

Again we use the Fourier fitted functions to test on both pure and mixed states. Fig- ures 6.38,6.39,6.40 show performance of the QNN on the entanglement of the state P , as

trained, respectively, at zero, 0.0089, and 0.013 amplitude complex noise, and tested with

various levels of complex noise, and compared with the entanglement of formation at zero

and at 0.0069 complex noise. Figures 6.41 6.42 6.43 show the performance of the QNN on the

123 Figure 6.35: Fourier coefficients for the tunneling parameter functions K, as functions of complex noise level.

Figure 6.36: Fourier coefficients for the bias parameter functions , as functions of complex noise level. mixed state M, as trained, respectively, at zero, 0.0089, and 0.013 complex noise, and tested with various levels of complex noise, and compared with the entanglement of formation at zero and at 0.0069 complex noise. Results are excellent; in fact, they are somewhat better than for the case of only “magnitude” noise. In some sense allowing for decoherence makes the indicator even more robust.

124 Figure 6.37: Fourier coefficients for the coupling parameter function ζ, as functions of com- plex noise level.

Figure 6.38: Entanglement of the state P as a function of γ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero noise (blue) and at 0.69% noise plus decoherence (orange). In each case the QNN was trained at zero noise, but tested at the given level.

Figure 6.39: Entanglement of the state P as a function of γ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero noise (blue) and at 0.0069 noise plus decoherence (orange). In each case the QNN was trained at a noise level of 0.0089 complex noise, and then tested at the given level.

125 Figure 6.40: Entanglement of the state P as a function of γ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero complex noise (blue) and at 0.0069 noise plus decoherence(orange). In each case the QNN was trained at 0.013 level complex noise, and then tested at the given level.

Figure 6.41: Entanglement of the state M as a function of δ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero noise (blue) and at 0.0069 noise plus decoherence(orange). In each case the QNN was trained at zero noise, but tested at the given level.

126 Figure 6.42: Entanglement of the state M as a function of δ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero noise (blue) and at 0.0069 noise plus decoherence (orange). In each case the QNN was trained at a complex noise level of 0.0089, and then tested at the given level.

Figure 6.43: Entanglement of the state M as a function of δ, as calculated by the QNN, and compared with the entanglement of formation (marked “BW”) at zero noise (blue) and at 0.0069 noise plus decoherence (orange). In each case the QNN was trained at 0.013 complex noise, and then tested at the given level.

127 6.4 Entanglement Calculation on Higher-Order Qubit Systems

In previous the section, we showed that quantum neural computation is robust under random perturbations of the density matrix for the two qubit quantum system. We now generalize this result by extending our previous work to three, four and five qubit quantum systems, and show that the increase in the number of qubits improves robustness to both

“noise and “decoherence. Recall that we define “noise as perturbation to the magnitude of the elements of the density matrix, while “decoherence is perturbation to the phase. Split- ting the density matrix this way is possible since we are working with numerical simulation.

Throughout the simulation, in order to conserve probability, the density matrix must remain

Hermitian, positive definite, and with unit trace. As we did with the two-qubit system, we will investigate the effects of noise during both training and testing. Furthermore, we can use knowledge of the smaller system as partial knowledge of the larger system, in finding the set of parameter functions that will perform the desired computation. This kind of inference is called, in the literature, bootstrapping [108]. The bootstrapping technique for entanglement witness calculation can be described as follow: We trained the two-qubit system to output an entanglement indicator, then used those functions as the starting point for training the three-qubit system. From three we bootstrapped to four, and so on. In each case, the amount of additional required training decreased, because more and more of the information necessary for the entanglement indicator was present already in the N − 1 qubit system.

6.4.1 Results for the Three-Qubit System: Training and Testing

The quantum neural network structure is the same as described in section 5.1. However, for a three qubit system, our Hamiltonian is written out explicitly in term of qubits A, B

128 and C as

H = KAσxA +KBσxB +KC σxC +AσzA +BσzB +C σzC +ζABσzA σzB +ζBC σzB σzC +ζAC σzA σzc

(6.5)

As described previously in section 5.1, σxi or σzi can be written as tensor products of the

Pauli matrix σx or σz with the 2x2 identity matrix σI . The order depends on the qubit label.

For example, in this case (three-qubit system) σxB = σI ⊗ σx ⊗ σI . The identity operators σI in the first and third places sure that σx operates only on the B qubit. The increased con- nectivity is evident in the increased number of qubit-qubit terms: While with the two-qubit system there is but one connection, with the three-qubit system there are three. Since we are training for a symmetric measure, the tunneling functions are equal: KA = KB = KC , and similarly for the  and ζ functions. Also, we now have more output measures {Mz} to be trained. We need to be able to distinguish among entanglement between qubits A and

B, between A and C, between B and C, and amongst A, B, and C (three-way entan- glement). The fully entangled state of all qubits in an N-qubit system is also known as the GHZ state. Therefore, the number of training pairs, and hence the amount of possible training, goes up like the connectivity; were it not for bootstrapping, this would indeed be a daunting challenge.

Recall from section 6.3 that for the two-qubit system, we used a training set of four; thus, for the three-qubit system, we need a set of thirteen: three sets of four for the three pairwise entanglements, and one more for three-way entanglement. Table 6.3 shows our unnormalized thirteen training states. Here BAB represent the fully pairwise entangled state between qubit A and B. FAB represents the Flat state (unentangled but with quan- tum correlation) between qubit A and qubit B. CAB represents the Classical correlation

129 Table 6.3: Unnormalized training input states for 3-qubit entanglement calculation

Input BAB BAC BBC FAB FAC FBC CAB CAC CBC PAB PAC PBC GHZ |000i 111111000111 1 |001i 000011000011 0 |010i 00010100.5101 0 |011i 001001001000 0 |100i 0 0 0 1 1 0 .5 .5 0 1 1 0 0 |101i 010010010000 0 |110i 100100100000 0 |111i 000000000000 1

state (no quantum correlation) between qubit A and B, hence it’s also an unentangled

state. PAB represents partially entangled state between qubit A and B. Representation for BAC ,BBC ,FAC ,FBC ,CAC ,CBC ,PAC ,PBC can be derive similarly. Then lastly, the state

GHZ represents the fully entangled state between qubit A, B, and C as discussed.

Table 6.4 shows the theoretical target and actual QNN output values correspond to each input states with zero noise or decoherence being added to the system. Here MAB repre- sents the quantum measurement operation on qubit A and B, hence MAB = σz ⊗ σz ⊗ σI .

Similarly, MAC , MBC , and MABC represent measurement on the respective qubits.

The parameter functions {K, , ζ} have similar characteristics to the two qubit system parameter functions. In fact, they are identical functions just with different amplitudes.

Once these parameters functions are trained, we can use them to test other states. The results on these testing states, which can be either pure or mixed, tell us whether the system has correctly generalized (learned). We found that all three functions {K, , ζ} are easily parameterized as simple oscillating functions. Figures 6.44- 6.46 show the actual trained parameters in dashed lines and their Fourier fit in solid lines.

130 Table 6.4: Training theoretical targets and actual QNN outputs for 3-qubit entanglement calculation with no noise or decoherence added to the system

BAB BAC BBC FAB FAC FBC CAB CAC CBC PAB PAC PBC GHZ

MAB 1 0 0 0 0 0 0 0 0 4/9 0 0 0

Targets MAC 0 1 0 0 0 0 0 0 0 0 4/9 0 0

MBC 0 0 1 0 0 0 0 0 0 0 0 4/9 0

MABC 000000000000 1

MAB 0.9911 0.0012 0.0000 0.0002 0.0023 0.0012 0.0010 0.0006 0.0005 0.4384 0.0016 0.0014 0.0039

QNN MAC 0.0002 0.9964 0.0000 0.0015 0.0001 0.0018 0.0000 0.0017 0.0001 0.0049 0.4386 0.0021 0.0001

MBC 0.0002 0.0012 0.9969 0.0002 0.0001 0.0000 0.0000 0.0009 0.0000 0.0026 0.0006 0.4396 0.0066

MABC 0.0012 0.0000 0.0028 0.0001 0.0014 0.0002 0.0000 0.0001 0.0003 0.0006 0.0024 0.0007 0.9883

Figure 6.44: K parameter function trained with zero noise or decoherence, for the 3-qubit system.

Figure 6.45:  parameter function trained with zero noise or decoherence, for the 3-qubit system.

131 Figure 6.46: ζ parameter function trained with no noise or decoherence, for the 3-qubit system.

Testing using the Fourier fit parameter functions gives identical (within computational error) results as testing with the trained parameter pointwise data; i.e, the differences between the data points and the fits are small enough not to matter. Table 6.5 shows the fitted parameter functions’ Fourier coefficients for zero noise along with the Quantum Neural Network training root-mean-square (RMS) error for the training step.

Table 6.5: Fourier coefficients for 3-qubit fitted parameter functions with no noise and decoherence along with QNN training RMS

f(t) = a0 + a1 cos(ωt) + b1 sin(ωt) + a2 cos(2ωt) + b2 sin(2ωt) Coefficients Tunneling (K) Bias () Qubit-Qubit Coupling (ζ) −5 −4 a0 0.002472 5.9935 × 10 1.9991 × 10 −6 −4 −4 a1 1.071 × 10 −2.3456 × 10 1.2969 × 10 −6 −4 −5 b1 −1.103 × 10 1.6611 × 10 4.0433 × 10 −6 a2 −1.109 × 10 ——– ——– −7 b2 1.023 × 10 ——– ——– ω 0.02498 0.02498 0.05750 QNN RMS 9.744 × 10−4 1.840 × 10−6 7.291 × 10−6

132 We then proceed by adding noise and decoherence to the system. To carry this through, we

first trained the 2-qubit system with no noise or decoherence, then bootstrap the already trained parameter functions {K, , ζ} to the three qubit system. We found out that this helped to decrease the number of epochs needed to train the system. We steadily increased the perturbation to the density matrix and found out that the 3-qubit system parameters are more stable to noise and decoherence than the 2-qubit system. That is, the data points on the parameter functions deviate less as noise, decoherence, or both, are added. Results of the parameter functions for the 3-qubit system, trained at 0.0089 phase noise (decoherence), are shown in Figures 6.4.1- 6.4.1. The level of noise we report is the amplitude, that is the root-mean-square-average size of these random numbers, imposed at each timestep.

Figure 6.47: Parameter function K trained at 0.0089 amount of decoherence in a 3-qubit system. The solid red curve represents the Fourier fit of the actual data points.

133 Figure 6.48: Parameter function  trained at 0.0089 decoherence in a 3-qubit system. The solid red curve represents the Fourier fit of the actual data points.

Figure 6.49: Parameter function ζ trained at 0.0089 decoherence in a 3-qubit system. The solid red curve represents the Fourier fit of the actual data points.

134 At equal amounts of phase noise added to the density matrix, the data points for the three-

qubit system are less scattered than the parameter functions for the two qubit system.

Magnitude noise gives similar results. We can also add both noise and decoherence to the

system simultaneously (we call this ‘total noise’.) As with the two-qubit system in the

previous section, we graph the Fourier coefficients of each parameter function as a function

of total noise. Figures 6.4.1-6.4.1 show that the coefficients, and therefore the parameter

functions, change very little.

Figure 6.50: Parameter function K Fourier coefficients as a function of total noise for the 3-qubit system.

Figure 6.51: Parameter  Fourier coefficients as a function of total noise for the 3-qubit system.

135 Figure 6.52: Parameter function ζ Fourier coefficients as a function of total noise for the 3-qubit system.

So far, we have shown that our system is robust to noise and decoherence during the learning

process. We need also to test these learned parameters on some testing states (i.e., not in

the training set) to see how well the system has actually learned, and what is the impact of

the noise on the performance. We choose a pure state |ψipure and a mixed state ρmixed for

this process. We choose the pure state to be

|000i + γ|001i + |011i |ψipure = (6.6) p1 + |γ|2

Here we choose 0 ≤ γ ≤ 1. Note that the state |ψipure is a superposition of the state |000i,

|001i, and |011i. It’s a pairwise entangled state between qubit B and C, with maximum entanglement when γ = 0 and decreasing as γ increases. Figure 6.4.1 shows the testing of this state, as a function of gamma, for increasing amounts of (total) noise. There is some spread, but the behavior is qualitatively correct, and the indicator relatively stable.

136 Figure 6.53: Testing at different noise levels for the pure state P for a three-qubit system, as a function of γ.

We now choose an exemplar mixed state as

|000i + |011ih000| + h011| + γ|001ih001| ρ = (6.7) mixed 2 + γ

again, here 0 ≤ γ ≤ 1. For this state, we should also expect to have full pairwise entangle-

ment for the BC pair when γ = 0, which should decrease as γ increases. Figure 6.4.1 shows

testing of Mix at similar noise levels. We notice the same trend as with the previous state:

the entanglement indicator is relatively stable.

Figure 6.54: Testing at different noise levels for a mixed state M, for the three-qubit system, as a function of γ.

137 6.4.2 Results for the four- and five-qubit systems: Training and Testing

The evidence of improvement in training and testing as well as the stability to noise and decoherence in a three qubit system is clear. We will now show that the system becomes more stable as we increase its size. This improvement in the results should not be surprising since the bigger the system, the more connectivity we have in our network. Note that for the large N-qubit systems, we train for each level of entanglement: For each pair of qubits, we use the four training states (Bell, P, Flat and C) for that pairwise entanglement, and in addition we also include the training pairs for three-way entanglement, four-way, and so on, up to

N-way entanglement (also called GHZ.) See [99] for details. For instance, for the five-qubit

5 5 system, there are 2 pairs and therefore 2 sets of Bell, P, Flat, C states for training the 5 5 5 pairwise entanglement; 3 three-way entanglements; 4 four-way entanglements; and 5 five-way entanglement or a 5-qubit GHZ state. As before, since this is a simulation, we can separate the noise and decoherence into two problems, or combine them together to get total noise. To demonstrate the improvement in robustness, we examine the most noise-sensitive parameter function, K. Figures 6.4.2 and 6.4.2 show the parameter functions K for the four and five qubit systems respectively.

Figure 6.55: Parameter function K for a 4-qubit system trained at 0.027 level of noise with its Fourier fit.

138 Figure 6.56: Parameter function K for 5-qubit system trained at 0.027 level of noise with its Fourier fit.

6.4.3 Quantifying the improvement in robustness with increasing size of the

system

Note that the noise level being added to the system in Figures 6.4.2 and 6.4.2 is almost three times as much as the noise added to the three qubit system (Figure 6.4.1); these results are, however, comparable. It is obvious there is an improvement in robustness. To quantify that robustness as a function of the number of qubits, we plot the coefficient of determination for each of the least-square curve fits of the trained parameter functions, R2, as a function of the number of qubits in the system, in Figures 6.4.3- 6.4.3. Recall that R2 is defined as one minus the (normalized) sum of squares of the residuals; therefore, R2 varies from zero

(bad fit) to one (perfect fit).

139 Figure 6.57: R2 as a function of number of qubits, for the K parameter function, trained at 0.02741 total noise.

Figure 6.58: R2 as a function of number of qubits, for the  parameter function, trained at 0.02741 total noise

Figure 6.59: R2 as a function of number of qubits, for the ζ parameter function, trained at 0.02741 total noise.

140 As the number of qubits increases, R2 increases towards one, and may converge towards an asymptote by four or five qubits. The increase is uniform except for the ζ parameter

(Figure 6.4.3), but even in that case, R2 never gets below 0.8, and still converges towards a high asymptote for number of qubits of four or five. This improvement is not unexpected, for two reasons. First, we observed in earlier work that the amount of training necessary decreases sharply as the number of qubits increases [64]; and, second, it is well known that the presence of (small amounts of) noise can in fact improve robustness, because of the problem of overfitting: If the data set is small, the network essentially memorizes all the training pairs, leading to very bad testing results [106]. This is illustrated with a linear vs high order polynomial fit example, shown in Figure 6.4.3.

Figure 6.60: An illustrative example of overfitting to the data (blue dots.) The red curve represents a possible function that an overfitting network would have learned comparing to the actual correct fit, the green line. Training error for the red line would be small, but any subsequent testing would give large errors. Adding noise to the data points would correct for overfitting.

141 Our training sets for the entanglement witness are small compared to the number of free parameters to be determined, that is, the parameter functions {K, , ζ} as functions of time.

Thus the training sets represent only a very small subset of the entire Hilbert space, and overfitting can easily occur. One method to prevent overfitting and to improve the structure of the predicted function in neural networks is to add random noise (usually white Gaussian noise) during training [109, 110, 111] . Adding noise will affect the training iteration step but it will spread out the data points and prevent the network from fitting each data points precisely, hence avoiding the overfitting issue and improving the robustness of the network

[112]. This is a direct consequence of the inability of the network to memorize the training data since it is continuously being perturbed by noise.

6.4.4 Learning with other types of noise

We have shown that QNN calculation is robust to noise and to decoherence, and that that robustness improves as the size of the system increases. But we need also to address the type of perturbation applied. What we do here, adding white Gaussian noise to the density matrix, is similar to what is known as random jitter, in which random perturbations are applied to the input data. Noise could instead be added to the outputs and to the gradients of the network during training [111]. Another possibility is to add noise to the activation, which, in our case, would be equivalent to adding noise to our projective measure

MZ .Yet another procedure would be to add noise to the weights of the network [110, 111,

112, 113]. This assumes that the parameters themselves are somewhat uncertain or noisy.

This is similar to adding noise directly to the Hamiltonian (since the Hamiltonian is itself a function of the parameter functions.) An advantage of this approach is that one need not separately impose conservation of probability, because, as long as the Hamiltonian remains

142 Hermitian, the system will still obey all the physical requirements to be a quantum system. A

disadvantage though is that the effects of decoherence cannot be directly investigated. Some

preliminary results are shown in Figures 6.61- 6.64 for the bias ( ) parameter function,

trained with magnitude noise being added to the Hamiltonian for the 2-,3-,4-, and 5-qubit

systems respectively.

Figure 6.61: Training the parameter function  with noise added to the Hamiltonian instead of density matrix for a 2-qubit system

Figure 6.62: Training the parameter function  with noise added to the Hamiltonian instead of density matrix for a 3-qubit system

The results, shown in Figures 6.61- 6.64, seem promising: Clearly, robustness increases with system size also with this method of adding noise. Indeed, it would be surprising if

143 Figure 6.63: Training the parameter  with noise added to the Hamiltonian instead of density matrix for a 4-qubit system

Figure 6.64: Training the parameter  with noise added to the Hamiltonian instead of density matrix for a 5-qubit system

this were not true, since the two reasons cited above in Subsection 6.4.3 would still apply.

This is good for practical purposes since, in reality, noise and decoherence can be introduced anywhere during an experiment.

6.4.5 Stability Analysis of The Calculations

To check the stability of the simulation, we performed an error analysis, starting with the initial density matrix ρ(0). Evolving ρ(t0) to ρ(tf ) based on the Schrodinger’s equation,

144 we get the following:

−iLt ρ(tf ) = e ρ(0)

Our output function is O = Tr(Mρ(tf ) where M = σz ⊗ σz. To add noise, we perturbed the original system by adding δρ. Since the time propagation step is linear in term of ρ, our perturbed output is:

∗ ˙    O = Tr M(ρ(tf ) + δρ(tf )) = Tr Mρ(tf ) + Mδρ(tf ) = Tr A + δA

 If λi ∈ Λ A for some i, eigenvalues of A, then by the Bauer Fike’s theorem, there’s a   µi ∈ Λ A + δA such that ˙ µi − λi ≤ κ(V )||δA||

Where V results from the diagonalization of A, A = V −1DV , and κ(V ) is the conditional

number of V . In our case, A is hermitian so V will be unitary, resulting in κ(V ) = 1. For a

non-noisy system we have : n X O = Tr A) = λi i=1 where n is the system’s size. For a noisy system, we have :

n ∗ X O = Tr A + δA) = µi i=1 Therefore, n ∗ X ˙ O − O ≤ µi − λi ≤ n||δA|| i=1 Therefore, the norm of the perturbed matrix δA is the upper bound of the error, whereas our simulation showed that the error doesn’t reach such value.

6.5 Application: Entanglement for pattern storage

Entanglement can be used as a means for pattern storage as follows. We encode the pairwise entanglement between certain qubits to represent the shape of a character. For

145 example, using a four qubit system, we can represent the letter Z as the pairwise entanglement between qubits AB, BC, and CD, as shown in Figure 6.65.

Figure 6.65: How to encode the letter Z using pairwise entanglement. The green double- headed arrows represent pairwise entanglement.

We can exclude the 3-way and the GHZ state entanglement from our 4-qubit training sets here (since we are only dealing with pairwise entanglement) to help speed up the training process. Since the training and testing is robust to noise to a certain extent, a little corrup- tion in the data will not greatly affect the output from the QNN. Table 6.6 shows the results of the QNN outputs from different encoded states with noise and decoherence added. Here, we can think of noise and decoherence as corruption to the letters.

With an array of four qubits, we can form only a few characters, but the number of distinct shapes goes up rapidly with the number of qubits: with only six, for example, arranged as a rectangular array in three rows of two, there are 8126 distinct symbols, unique up to translation in the plane. Modern technology should allow encoding even of images [118]. These results, while promising, are only preliminary. True character recognition will require some means of transferring an image of a letter into a quantum state before

146 performing this process. We are currently extending this work to address this problem, and to include a more complete analysis [68].

Table 6.6: QNN output to encoded states of different letters, showing robustness to noise.

Total Noise Letter Z Letter N Letter O Letter X 0.0050 correct correct correct correct 0.0125 correct correct correct correct 0.0375 correct correct correct correct 0.0500 correct correct correct correct 0.0625 correct correct incorrect correct 0.0075 incorrect incorrect incorrect incorrect

147 CHAPTER 7

BENCHMARKING NEURAL NETWORKS FOR QUANTUM COMPUTATIONS

The power of quantum computers is still somewhat speculative. While they are certainly faster than classical ones at some tasks, the class of problems they can efficiently solve has not been mapped definitively onto known classical complexity theory. This means that we do not know for which calculations there will be a “quantum advantage,” once an algorithm is found. One way to answer the question is to find those algorithms, but finding truly quantum algorithms turns out to be very difficult. In previous chapters we have shown that we can use our QNN to help us to get around the algorithm design problem. Here we compare the performance of standard real- and complex-valued classical neural networks with that of one of our models for a quantum neural network, on both classical problems and on an archetypal quantum problem: the computation of an entanglement witness. The is shown to need far fewer epochs and a much smaller network to achieve comparable or better results. The chapter begins with the benchmarking of classical logic gates in Section 7.2. In

Section 7.3 we will show that QNN also outperforms classical neural networks in a standard classification task, the famous Iris flowers classification problem. Lastly, in Section 7.4 we will compare our QNN model against classical neural networks on the entanglement witness calculation problem. We will show that QNN is far superior to classical neural networks when it comes to this specific problem.

148 7.1 Type of Neural Networks Performed

Here we quickly go over different types of neural networks that are being benchmarking against each other. Up to this point, we talked about Artificial Neural Networks, we have only considered the case where the inputs are real-valued. This is for good reason, because classically most inputs are real-valued, and in rare instances when our inputs are complex we can split it into two real-part. However, it turned out that there are advantages in the training process if we keep the inputs as complex-valued. Quantifying the exact advantages is still an ongoing study in the neural networks community [158]. In this benchmarking performance test we will use a complex-valued network as well as a real valued one. We will discuss this complex-valued neural network model in subsection 7.1.2. The classical real-valued neural networks have been discussed in quite extensive manner in Chapter 4, however, we will give a quick reminder of its structure again in subsection 7.1.1. The other neural network model will be tested is of course the QNN model. The fundamental structure and universality of the QNN model have been discussed extensively in chapter 5. Therefore, we will only give a quick overview of it in subsection 7.1.3.

7.1.1 Classical real-valued neural networks

Standard algorithmic computers are “Turing complete”, that is, can be programmed to simulate any computer algorithm. Now recall that classical, Real-Valued Neural Networks

(RVNN) are also “universal approximators.” This means essentially that an RVNN can approximate any continuous function that maps inputs to outputs. We have shown with three different augments why this is the case. The first argument was done in a construc- tive manner using boxes/towers, the second was more abstract using modern analysis, and the third is from the computational perspective of replacing NAND gates with perceptrons.

149 To perform this benchmarking task, an (in house) implementation of a real-valued artificial neural network was used as a baseline. The software NeuralWorks Professional II [143] was also used to verify that the results obtained from commercially available software were com- parable to our own implementation.

7.1.2 Complex-Valued Neural Networks

Here we briefly introduce the concept of complex-valued neural networks (CVNN) and their universality. CVNN was first developed or generalized from real-valued neural networks

(RVNN) by Thomas Clarke in 1990 [145]. One of the main challenge in this generalization is that a bounded entire function must be constant in C, a direct consequence of Liouville’s theorem. Why is this a problem? Because remember that in RVNN, the activation function of a sigmoid neuron is a smooth and bounded function. Thus, this means we have to sacrifice either the bounded or analytic property of the activation functions in CVNN’s neurons. It’s obvious that we would like to keep the analytic property since without the analytic property we can’t provide a truthful complex gradient to do error backpropagation, and therefore hinder the real advantages of CVNN. The function f(z) = tanh(z) is one of the popular choice to use as an activiation function for complex neurons. A larger class of activation functions for CVNN can be found in [144]. The universality question of CVNN have been answered in [144] as well but let us state it here for convenience. It should be noted that the universality theorem below resembles Theorem 4.4.2 to a large degree.

Theorem 7.1.1. Let σ : Cn → C be any complex continuous discriminatory function. Then the finite sums of the product of the form

m sk X Y T  F (z) = βk σ Wklz + θk (7.1) k=1 l=1

150 n are dense in C(In). Where βk, θk ∈ C, Wkl, z ∈ C , m is the number of neurons in the hidden layer, and sk is an arbitrary number depending on how many product terms one wants to do. A function σ is said to be discriminating if for a measure µ ∈ M(In),

Z T  σ W z + θk dµ(z) = 0 (7.2) In for all W ∈ Cn and θ ∈ C, implies that µ = 0.

The product is needed to ensure that the set τ = F : Cn → C is an algebra since it will be closed under multiplication [144]. This is the main difference between Theorem 7.1.1 and Theorem 4.4.2. This shows that there exists a representation of the CVNN that can solve essentially any problem the RVNN can do. Indeed, we expect that the CVNN will be more powerful or at least more efficient. The archetypal example is the XOR. Since it is a nonlinear problem, a real-valued net needs a hidden layer to solve, but the complex-valued net can do this in a single layer [146].

In this study, however, we use a simpler versions for the initial benchmarking work, both of the QNN and of the CVNN, which have already been explored in the literature. The implementation of the CVNN used here is largely based on the work of Aizenberg [146]. The major difference from the RVNN is that the neuron inputs, outputs, weights, and biases are numbers in the complex plane. In the RVNN signals are only scaled by their respective weights, while the CVNN allows the scaling and rotation of the signals. This leads to higher functionality in more compact architectures. Similar to a conventional RVNN, the input signals to any given neuron are summed by their respective weights:

X z = wn · xn (7.3) n

151 where w and x are the complex-valued weights and signals. An activation function is then applied to the sum z at the neuron. The complex activation function P takes the inputs and maps them to a unit circle on the complex plane which constitutes the signal from the neuron to the consequent layer.

P (z) = ei·arg(z) (7.4)

The network’s scalar inputs and outputs/targets are condensed to the range [0,1] and mapped as points on the complex unit circle using the following mapping function:

M(r) = ei·πr (7.5)

Unlike the RVNN, with this architecture there is no need to take the derivative of the activation function or to apply gradient descent in order for this network to learn. The resulting error from each training pair can be represented as the vector from the output to the target in the complex plane. In other words, given the target t and the output z of an output neuron, the error can be represented as the vector e

X e = t − z = ∆wn · xn (7.6) n given that the resulting error is a contribution of all of the signals to the output neuron. Here

∆wn is the contribution of each individual weight to the error. Dividing the error equally among the N incoming weights we arrive at the following required weight adjustments for each training pair per epoch.

e ∆w = · x−1 (7.7) n N n

152 For each subsequent layer the error is backpropagated by dividing the error across the weights relative to the values of the signals they carry. Thus the error gets diluted as it propagates backwards, yet this is applicably manageable given that the higher functionality of the CVNN favors shallower networks for conventional problems.

7.1.3 Quantum Neural Network

Because the quantum benchmark calculation chosen here is that of an entanglement witness, we consider the model for a QNN as the one we used in Chapter 6. The system is prepared in an initial state, allowed to evolve for a fixed time, then a (predetermined) measure is applied at the final time. The adjustable parameters necessary are the (assumed time- dependent) couplings and tunneling amplitudes of the qubits; the discriminatory function is provided by the measurement, which we may take to be any nontrivial projection convenient without loss of generality. For the two-qubit system, we choose as output the square of the qubit-qubit correlation function at the final time, which varies between zero (uncorrelated) and one (fully correlated.) Thus we are considering the map

f :Ω → [0, 1] (7.8) where Ω ∈ Cn. This structure lends itself immediately to the perceptron problem, as well as the entanglement problem. The network input is the initial state of the system, as given by the density matrix ρ(0). The signal is then propagated through time using Schrodinger’s equation dρ 1 = [H, ρ] (7.9) dt i~ where H is the Hamiltonian. Note that we work with the density matrix representation because it is computationally easier than using the state vector (or wave vector) represen- tation. The density matrix form is more general, allowing mixed as well as pure states, but

153 here we will in fact use only pure states. In terms of the state vector of the system |ψi, the density matrix is given by ρ = |ψihψ|.

For a two-qubit system, we take the Hamiltonian to be:

2 2 X X H = KαXα + αZα + ζαβZαZβ (7.10) α=1 α6=β=1 where {X,Z} are the Pauli operators corresponding to each qubit, {K} are the amplitudes

for each qubit to tunnel between states, {} are the biases, and {ζ} are the qubit-qubit

couplings. Generalization to a fully connected N qubit system is obvious. We choose the

usual charge basis, in which each qubit’s state is ±1, corresponding to the Dirac notation

|0i and |1i, respectively.

By introducing the Liouville operator, L = 1  ··· ,H, Eq. (7.9) can be rewritten [82] ~ as ∂ρ = −iLρ (7.11) ∂t which has the general solution of

ρ(t) = e−iLtρ(0) (7.12)

We propagate our initial state, ρ(0), forward in time to the state ρ(tf ) where we will perform

a measurement on ρ(tf ), which mathematically is an operator followed by a trace. That measurement, or some convenient mathematical function of that measurement, is the out- put of the net. To train the net, the parameters K,  and ζ are adjusted based on the error, using a quantum analogue of Werbos’s well-known back-propagation technique. This form of a QNN in terms of a time-dependent Hamiltonian would be physically implementable on a version of an interconnected array of two-level systems, for example, like DWave’s quantum annealing machines.

154 There is, clearly, no direct correspondence between these equations for the QNN with those of either the real- or complex-valued neural network, above. However, we know that classical neural networks (both real and complex-valued) are universal as well as Quantum

Neural Networks. Thus we can investigate their relationship experimentally, in what follows.

7.2 QNN Versus Classical Neural Networks: Simulating Classical Logic Gates

Perhaps the most basic of neural network architectures is the single-layer perceptron as we have discussed in Chapter 4. Recall that the perceptron is a single output neuron connected to the input neurons by their respective weighted synapses. See Figure 4.2. The real-valued perceptron is a linear classifier and therefore can approximate linearly separable logic functions such as AND and OR gates; however, because it essentially creates a single linear separator, it necessarily fails to approximate non-linear gates, like the XOR or XNOR.

This point is made graphically in Figure 7.1(a): a straight line can separate points for which different outputs are desired. In contrast, a complex perceptron can calculate the XOR or

XNOR, since it can create a nonlinear separation - or, equivalently, uses interference. See

Figure 7.1(b) to see that a straight line can’t separate outputs from XOR gate.

155 Figure 7.1: Visual representation of the linear separability of the OR gate versus the XOR gate. The dotted line represents linear classifier.

The single layer perceptron was implemented on all four networks. The networks were

trained on the 4 possible input pairs, {(00), (01), (10), and (11)}, for the outputs of the

AND, NAND, OR, NOR, XOR and XNOR gates until the root mean square error (RMS) reached 1%. See Figure 7.2 for an example. The learning rate of each network was increased to optimal, i.e, the training speed was maximized subject to the constraint that convergence was still achieved. This was done to draw a more equitable comparison on learning speed between the RVNN and CVNN, which have different learning schemes.

Recall that to implement logic gates on our QNN model, we must specify the encoding of inputs as a function of the prepared state of the quantum system, and outputs as a functions of some measurement on the final state of the quantum system. We take the inputs as being the basis states of the two-qubit system in the so-called “charge” basis, that is, the states

|00i, |01i, |10i, |11i. We choose the output measure as being the square of the qubit-qubit

2 correlation function at the final time, i.e, hZA(tf )ZB(tf )i . An output of one would mean

156 that the two qubits are perfectly correlated to each other at the final time; an output of zero,

that they are perfectly uncorrelated. The target values for these outputs will, of course, be

different for each logic gate considered. Thus, for example, for the XNOR gate we would

want an inital state of |00i or an inital state of |11i to train to a final correlated state but initial states of |01i or of |10i to train to a final uncorrelated state. Because of the inherent continuity of the QNN any initial state close to one of the input states will necessarily pro- duce an final state close to the corresponding output state, and the computation is robust to both noise and decoherence [65, 66].

Table 7.1 shows the number of training epochs required by each “optimal” network to reach an RMS error of 1%, where an epoch is defined as one pass through the 4 training pairs. Note that the nonlinear gates (XOR and XNOR) cannot be done by either of the real classical nets (RVNN and NeuralWorks) with only a single layer. The QNN reached ≤1% error in a single epoch.

Table 7.1: Number of epochs needed for gate training to reach RMS error ≤1% for a (single layer) perceptron using RVNN, CVNN and NeuralWorks (to nearest 1000 epochs) imple- mentations.

No. of Epochs to RMS error ≤1% Gate N.Works RVNN CVNN QNN AND 11000 11146 222 1 NAND 11000 11145 127 1 OR 6000 5672 114 1 NOR 6000 5671 172 1 XOR n/a n/a 110 1 XNOR n/a n/a 78 1

157 The real-valued network would not train to an RMS error below 50% for the XOR and XNOR gates, given a number of epochs up to order 106. In addition, for the linearly separable gates the RVNN required 30-50 times more learning iterations than the CVNN to reach an RMS error of 1%, making the CVNN training runs computationally much more efficient than RVNN. Note that the single-layer complex-valued perceptron can learn to solve the non-linearly separable XOR and XNOR gates, and do so with an efficiency at least comparable to that for the linearly separable gates as shown in Table 7.1, as mentioned above.

“Quantum-inspired” networks [147] generally are some version of a CVNN, and, there- fore, do have increased power over a RVNN due to their ability to use interference: That is, their performance would be expected to be comparable to our CVNN. However a fully quantum network can do even better. Results of the training of the QNN for the linear and non-linear logic gates are shown in the last column of Table 7.1. Note that the QNN requires only a single epoch to learn any of the linear or the nonlinear gates, and it should also be noted that the error continued to decrease below that level. This is an experimen- tal realization of our 2015 theoretical result deriving weights for a quantum perceptron [149].

158 Figure 7.2: Example of training RMS as a function of epoch for various logic gates using the (single layer) CVNN perceptron. The RVNN and NeuralWorks networks trained similarly.

159 7.3 QNN Versus Classical Neural Networks: Iris Classification

The other archetypal machine learning task we consider is that of pattern classification, for which one of the most common benchmarking problems is the Iris flower classification dataset [148]. The multivariate dataset contains four features, as inputs: sepal length, sepal width, petal length, and petal width. The corresponding taxonomy of Iris flower (setosa, virginica or versicolor) is represented as a “one-hot” vector [150]. The set contains a total of 150 pairs, 50 of each species, as identified by Fisher in 1936 [151]. The difficulty in Iris classification is well demonstrated by the scatterplot shown in Fig. 7.3: Two of the the three species cluster together and require a highly non-linear separability function.

The RVNN and CVNN implementations were trained on the dataset to compare their performance for a non-linear multivariate classification problem. The dataset was divided into two randomly selected sets of 75 pairs containing an equal number of each species for training and testing. The networks were first trained on the entirety of the training set.

The training pairs were then reduced to 30 and 12 pairs while keeping the testing set at 75 pairs. Both the RVNN and CVNN had a single hidden layer, that is, an architecture (4,

Nh, 3), where the number of neurons in the hidden layer, Nh, was increased from three on up. Unlike the RVNN, the CVNN’s testing performance improved substantially as Nh was increased, up to about 100; this is the (optimized) architecture for the results reported in

Table 7.4. “Testing” accuracy in Table 7.4 means classification in the correct category if the appropriate output is above 0.5; below, it is counted as incorrect.

160 Figure 7.3: Iris dataset cross-input classification scatterplot [148].

For implementation on the QNN, we used the feature data as the coefficients of the input state |ψ(0)i in the Schmidt decomposition. That is, the initial quantum state vector for each input was given by

|ψ(0)i = a|00i + beiθ1)|01i + ceiθ2 |10i + deiθ3 |11i (7.13)

We set all the phase offsets {θi} equal to zero (i.e., only real inputs here), and take each

amplitude to correspond to a features (i.e., a, b, c, d correspond to sepal length, sepal width,

petal length, and petal width.) For the output, we again choose the final time correlation 0.01(a2 + b2) + c2 + 1.8d2 function, taken to be the polynomial . This polynomial sorts out a2 + b2 + c2 + d2

161 the three categories of flowers almost completely. It is possible to choose a more complicated function or higher degree polynomial to create more separation but we chose the simplest polynomial for the ease of training.

Table 7.2 shows the QNN trained parameters  and ζ for the Iris problem, as time dependent functions fitted to a general Fourier form: f(t) = a0 + a1 cos(ωt) + b1 sin(ωt); Table 7.3, similarly for the tunneling parameter functions K with additional terms in 2ω and 3ω. The coefficients do not vary much as the size of the training set is increased. We have found that the QNN is fairly robust both to the exact values of the coefficients and to the presence of noise in the system [65, 66]; certainly these values are not unique.

Table 7.2: Trained QNN parameter functions A, B, and ζ, for the Iris problem, in terms of their Fourier coefficients for the function f(t) = a0 + a1 cos(ωt) + b1 sin(ωt), using 12, 30, and 75 training pairs.

A

Training pairs ω a0 a1 b1 12 0.0315 9.78E-05 1.16E-04 2.35E-05 30 0.0313 9.75E-05 1.05E-04 1.41E-05 75 0.0311 9.73E-05 1.10E-04 6.84E-06

B 12 0.0286 1.04E-04 5.74E-05 -2.35E-05 30 0.0279 1.06E-04 7.90E-05 -5.46E-05 75 0.0280 1.06E-04 7.18E-05 -4.31E-05 ζ 12 0.0619 1.66E-04 4.60E-06 3.51E-04 30 0.0617 1.59E-04 3.08E-05 4.43E-04 75 0.0614 2.06E-04 5.23E-05 4.52E-04

162 Table 7.3: Trained QNN parameter functions KA, and KB, for the Iris problem, in terms of their Fourier coefficients for the function f(t) = a0 + a1 cos(ωt) + b1 sin(ωt) + a2 cos(2ωt) + b2 sin(2ωt) + a3 cos(3ωt) + b3 sin(3ωt), using 12, 30, and 75 training pairs.

KA

Training pairs ω a0 a1 b1 12 0.0147 3.21E-03 -1.96E-05 -1.11E-03 30 0.0200 2.65E-03 -1.30E-04 -1.80E-04 75 0.0198 2.65E-03 -1.60E-04 -1.90E-04

a2 b2 a3 b3 12 7.40E-04 -2.00E-04 -1.60E-04 2.86E-04 30 -9.58E-05 -8.52E-05 -1.30E-04 -9.79E-05 75 -1.20E-04 -8.45E-05 -1.60E-04 -8.43E-05

KB

Training pairs ω a0 a1 b1 12 0.0193 2.02E-03 6.79E-05 8.89E-05 30 0.0170 2.08E-03 1.69E-04 -1.80E-04 75 0.0198 2.65E-03 1.53E-04 -2.00E-04

a2 b2 a3 b3 12 4.45E-05 -3.58E-05 1.84E-06 -1.20E-04 30 -8.91E-05 -2.50E-04 -2.00E-04 -8.26E-05 75 -9.71E-05 -2.60E-04 -2.10E-04 -8.32E-05

Results of the comparison of the different nets are shown in Table 7.4. On this classi-

fication problem the advantage of the CVNN over the RVNN is not nearly as pronounced.

The RMS training errors for the CVNN are consistently below those of the RVNN, but the

testing errors are not. In terms of the classification the CVNN does do better, but not as

significantly as with the perceptron. The reason the classification improves while the RMS

error does not is that the CVNN tends to a flatter output: thresholding gives the “correct”

163 answer, but the separation between classification of “yes” (i.e., a value near 1) and “no” (i.e. a value near zero) is smaller with the CVNN.

Table 7.4: Iris dataset training and testing average percentage RMS and identification accu- racy for different training set sizes using similar RVNN and CVNN networks, and compared with the QNN. The RVNN was run for 50,000 epochs; the CVNN, for 1000 epochs; and the QNN, for 100 epochs. See text for details of the architectures used.

Training RMS (%) Testing RMS (%) Testing Accuracy (%) Training Pairs RVNN CVNN QNN RVNN CVNN QNN RVNN CVNN QNN 75 3.45 2.06 0.96 3.71 7.78 2.31 100 100 97.3 30 0.51 0.41 1.1 4.97 9.47 9.78 93.3 96.0 97.5 12 0.69 0.09 0.62 11.9 16.4 7.48 85.3 94.7 85.5

Performance in training, testing, and classification by the QNN is also comparable, to both nets. The major difference among the three is in the amount of training necessary to achieve the results. The RVNN required 50,000 epochs, while the CVNN required only

1000, and the QNN only 100. The ratios of iterations necessary is remarkably similar to the results for the perceptron, shown in Table 7.1. In this sense we can say that the QNN is significantly more efficient at training, in classification as well as in computation.

164 7.4 QNN Versus Classical Neural Networks: Entanglement Calculation

We now consider our networks’ performance on a purely quantum calculation: an entan-

glement witness. As previously discussed, an outstanding problem in quantum information

is quantifying one of the key elements responsible for the powerful capabilities of quantum

computational systems: entanglement. Recall that entanglement is the feature whereby two

or more subsystems’ quantum states cannot be described independently of each other, that

is, that the state of a system AB cannot be correctly represented as the (state of A)⊗(state of

B), where ⊗ indicates the tensor product. Non-multiplicativity, or entanglement, of states, is fundamental to quantum mechanics. Indeed, all of quantum mechanics can be derived from this characteristic [153]; in that sense, entanglement is equivalent to non-commutativity of operators, or to the Heisenberg . Even though quantifying entangle- ment has been a prominent subject of research, there is no accepted closed-form represen- tation that is generalizable to quantum states of arbitrary size or properties. Because a quantum state is unentangled if it is separable into its component states, entanglement mea- sures are used to determine the “closeness” of a quantum state to the subspace of separable product states. Most of the developed measures involve a subset of possible compositions or require extremizations which are difficult to implement [154], especially for large systems.

One well-known measure is the entanglement of formation which quantifies the resources utilized to construct a specific quantum state. For a two-qubit system we have an explicit formula for the entanglement of formation [71], but generalizations to larger systems are difficult. The lack of a general analytic solution combined with a base of knowledge on spe- cific sets of states suggests an opportunity for a machine learning approach. Given that the entanglement is an inherent property of the quantum mechanical system, that system in a sense “knows” what its entanglement is. Thus, if we are clever, we can devise a measurement or measurements that will extract this information, and, it is hoped, somewhat more effi-

165 ciently than complete determination of the quantum state (“”) would

require, as the number of measurements goes like 22n, where n is the number of qubits. It is a theorem [56] that a single measurement cannot determine the entanglement completely even for a two qubit system; however, one measurement can serve as a “witness”: that is, a measurement giving a lower bound to the correct entanglement.

Using a training set of only four pure states, we successfully showed that this model of a QNN can “learn” a very general entanglement witness, which tracks very well with the entanglement of formation, and which not only works on mixed as well as pure states but can be bootstrapped to larger systems and, most importantly, is robust to noise and to decoherence. See Chapter 6 for details. Any computation done by a quantum system can be simulated; any function can be approximated. But how complex is this calculation?

Classical neural networks theory tells us that the weights exist by the universality argument, but how many do we need, and are they easy to find? In the case of a 2-qubit pure quantum state

|ψi = a|00i + beiθ1 |01i + ceiθ2 |10i + deiθ3 + |11i

the mapping of the “Concurence”, which is an entanglement measure, can be written as an

elementary function

˜ 2 2 2 2 C(|ψi) = |hψ|ψi| = 4a d + 4b c − 8abcd cos(θ3 − θ2 − θ1) (7.14)

˜ ∗ where |ψi = σy|ψ i, also known as the “spin flip” of the stat e|ψi, and a, b, c, d, θ1, θ2, θ3 ∈ R are defined in Equation 7.13. So it should not be altogether surprising that the QNN can

learn an entanglement indicator for the two-qubit pure system. Furthermore, since the

Concurrence mapping for 2-qubit pure states is continuous, it ought to be able to be approx-

imated by both the RVNN and the CVNN, especially for the relatively simple case where all

166 the phase offsets, {θi}, are zero. In our previous work, see Chapter 6, we saw that the QNN

was able to successfully generalize to test on mixed states, as well. Here we train, and test,

only on pure states; nevertheless, neither the RVNN nor the CVNN was able to accomplish

this. We have shown the trained parameter functions for this problem in Chapter 6; again,

as with the iris problem, the parameters do not vary much as the size of the training set is

increased. It is also possible [142] to train for symmetric problems (like pairwise entangle-

ment) with the symmetry constraints KA = KB and A = B.

Training for each of the four nets on the entanglement witness is shown in Table 7.5.

Each network was given the entire 4 × 4 density matrix of the state as input. The classical nets again were given a hidden layer, and a single output. The architectures used for the

NeuralWorks, RVNN, and CVNN were 16 (input layer), 8 (hidden layer), 1 (output layer).

Numbers of epochs were 5 to 10 thousand for the real valued nets; 1000 for the CVNN, and only 20 for the QNN. Again we note the increase in efficiency of training in going from

RVNN to CVNN, and in going from CVNN to QNN. For the minimal training set of four, all three classical nets trained below 1% error.

However the testing error, shown in Table 7.6, was quite bad for all the classical nets.

Increasing the size of the training set did enable marginally better fitting by all the classical networks, but testing error remained an order of magnitude larger than with the QNN.

Increasing Nh did not help. Increasing the size of the training set affected training and testing very little for the QNN, which had already successfully generalized with a set of only four. Neither the NeuralWorks nor the RVNN could learn pure state entanglement well, even with a training set as large as 100. And while the (“quantum-inspired”) CVNN could use interference, it did not generalize efficiently or well, either. This is not surprising when one

167 considers that it is precisely access to the entangled part of the Hilbert space that the CVNN lacks; that is, the state of CVNN cannot itself be entangled. In that sense it is inherently classical.

Table 7.5: Training on entanglement of pure states with zero offsets, for the NeuralWorks, RVNN, CVNN, and QNN.

Training RMS error (%) Training Pairs N.Works RVNN CVNN QNN 100 5.66 3.74 0.97 0.04 50 5.96 5.89 0.53 0.09 20 6.49 4.97 0.04 0.2 4 0.00 0.93 0.01 0.2

Table 7.6: Testing on entanglement of pure states. Each network was tested on the same set of 25 randomly chosen pure states with zero offset.

Testing RMS error(%) Training Pairs N.Works ANN CVNN QNN 100 7.56 5.39 3.61 0.2 50 7.91 10.7 6.00 0.3 20 13.6 15.5 9.48 0.4 4 48.2 51.9 55.0 0.4

168 CHAPTER 8

IMPLEMENTATION OF QUANTUM NEURAL NETWORK ON ACTUAL QUANTUM HARDWARE

Designing and implementing algorithms for medium and large scale quantum computers is not easy. In previous work we have suggested, and developed, the idea of using machine learning techniques to train a quantum system such that the desired process is “learned,” thus obviating the algorithm design difficulty. This works quite well for small systems. But the goal is macroscopic physical computation. Here, we implement our learned pairwise entanglement witness on Microsoft’s Q#, one of the commercially available gate model quantum computer simulators; we perform statistical analysis to determine reliability and reproducibility; and we show that after training the system in stages for an incrementing number of qubits (2, 3, 4, . . . ) we can infer the pattern for mesoscopic N from simulation results for three-, four-, five-, six-, and seven-qubit systems. Our results suggest a fruitful pathway for general quantum computer algorithm design and for practical computation on noisy intermediate scale quantum devices.

169 8.1 Available Quantum Simulators and Hardware

Quantum computing has becoming a hot topic of research recently because we have reached the point where it is possible to built actual quantum hardware. Companies like

IBM, Google, Microsoft, D-Wave Systems, Regetti, and 1QBit are currently developing ac- tual quantum computers. All the companies listed are working on gate model quantum computers except D-Wave Systems, where they opted to build adiabatic quantum comput- ers. All these quantum systems operate on superconducting qubits except Microsoft system, where they opt into a more exotic approach using something known as topological qubits.

We will briefly discussed some of these commercially available systems, specifically, the IBM,

Microsoft and D-Wave Systems.

8.1.1 IBM

IBM is one of the leading firms in building quantum computers and they have made significant progress recently. They have an ambitious goal of making quantum computers become mainstream in the next five years. They have launched a website called the IBM Q

Experience that contains a quantum computing software framework called along with their 5,16 and 20 qubits quantum computers available to the public through the Cloud. This is exciting for many researchers because we can actually test our algorithms on actual quan- tum hardware. IBM built their quantum systems out of the usual superconducting qubits.

Because of the simplicity of superconducting qubits they are easier to built and operate on, making them very popular in many quantum systems.

170 Figure 8.1: Actual IBM 4-qubit processor. The qubits are controlled by microwave pulses tuned at certain frequencies [191].

8.1.2 Microsoft

The Microsoft quantum computing team has been working on building their own scal- able quantum computer model based on the topological qubits for nearly two decades, which are believed to be more robust to noise and decoherence (prone to less error) because they were engineered to have two ground states. This degeneracy in the ground states might seems weird at first because there is no feasible way to distinguish them from each other.

However, topological system can use something known as braiding to distinguish this differ- ence. Now with this extra protection to the qubits, the scalability problem becomes easier.

Thus, a smaller number of qubits will need to be devoted to error correction compared to other quantum systems. Topological quantum computing was first developed by Alexei Ki- taev in 1997 in the paper known as Faul-tolerant quantum computation by anyons [190].

Thus the topological qubits are built based on the properties of an exotic type of particles called anyons. The only issue with this approach is anyons don’t exist. We can only trick them into existence using certain materials. Because of the complexity in building such systems, we have no actual topological quantum systems available at the moment, at least publicly. Similar to IBM, Microsoft also released a quantum development kit, which includes

171 a programming language called Q# and a to their topological quantum computing system. Although there is no hardware available, at least of yet, Microsoft built their simulator to replicate their hardware truthfully.

8.1.3 D-Wave Systems

D-Wave Systems is the first quantum computing startup company, and is based in

Canada. Rather than using the conventional gate model approach, D-wave opted to build their hardware based on an called quantum annealing to search for so- lutions to problems, using the fact that physical systems naturally tend toward minimum energy state, as discussed in section 3.5. D-Wave systems have built hardware with 5000 qubits; however, there are a lot of controversies around the capabilities of their hardware.

Many believe that D-Wave’s quantum annealing machines are too restrictive because of the following reasons: Limitations of the adiabatic algorithm, limitation in the quality of their qubits, and limitation of their Hamiltonians as D-Wave only used stoquastic Hamiltonian

(all of its off diagonal terms are real and non-positive).

Figure 8.2: An actual 128-qubit processor chip from DWave Systems [192]

172 8.2 Two-qubit Quantum Neural Network

Entanglement estimation is a good example of an intrinsically quantum calculation for

which we have no general algorithm. Indeed, it has been shown that the quantum separability

problem (determination of entanglement) is NP-hard [168]. In previous work we succeeded

in mapping a function of a measurement at the final time to a witness of the entanglement

of a two-qubit system in its initial state. The “output” (result of the measurement of the

witness at the final time) will change depending on the time evolution of the system, which is

of course controlled by the Hamiltonian: by the tunneling amplitudes {K}, the qubit biases

{}, and the qubit-qubit coupling ζ. Thus we can consider these functions {KA, KB, A, B,

ζ} to be the “weights” to be trained. We then use a quantum version of backpropagation to

find optimal functions such that our desired mapping is achieved. Full details are provided in section Chapter 5. From a training set of only four pure states, our quantum neural network successfully generalized the witness to large classes of states, mixed as well as pure as shown in Chapter 6. Qualitatively, what we are doing is using machine learning techniques to find a “best” hyperplane to divide separable states from entangled ones, in the Hilbert space.

Now, this method finds a time dependent Hamiltonian that solves the given problem, a procedure more reminiscent of a quantum annealing approach [164] than the gate approach.

But of course the unitary operator of time development can be represented as a product of simple gates; as we have shown in section 3.3. Thus, a universal quantum computer need only be able to execute each of the members of that set. There is now a large number of quantum simulators available online [169], including Microsoft Quantum Development Kit

[170] and IBM’s Quantum Experience [171], which implement a universal set of quantum operators (gates) plus many more that are useful in encoding quantum computations, such as the Pauli spin matrices, the Hadamard gate, the CNOT gate, and others. The difficulty arises in determining exactly how to represent a particular calculation: first, in terms of

173 finding the unitary for that problem; and second, in terms of these gates so that algorithms

may be eventually implemented on real quantum hardware.

Once we do have a unitary, there are several approaches [172, 173] available for de-

composing an arbitrary 2N × 2N unitary matrix, representing a quantum computation on

an N-qubit system, into “simple” gates: single qubit and the two-qubit CNOT operations

implemented in the languages associated with one of the online systems. However, none

of these techniques is straightforward, and often the result is a large sequence of gates to

represent the desired unitary. This inherent difficulty is another reason machine learning

techniques are enticing [174]: If we can determine methods where the machines themselves

develop and refine the algorithms they are using, we circumvent part of this intrinsic chal-

lenge of quantum computing. In this paper, because we want to demonstrate the advantages

of our stage training paradigm, we will solve the two-qubit problem by hand, then use these

results to generalize to larger systems.

8.2.1 Reverse Engineering of Entanglement Witness

The system evolves in time according to the Hamiltonian

H = KAXA + KBXB + εAZA + εBZB + ζZAZB (8.1)

where X and Z are the Pauli operators corresponding to qubits A and B, KA and KB are the tunneling amplitudes, εA and εB are the biases, and ζ is the qubit-qubit coupling. The state of the system as a function of time can then be written, for a pure state, as

−iHt |ψ(t)i = exp |ψ(0)i. (8.2) ~

174 It is convenient to consider the Hamiltonian H as a sum of single qubit and two-qubit

operations

H = KAXA + εAZA + KBXB + εBZB + ζZAZB . (8.3) | {z } | {z } | {z } HA HB HAB We now consider the evolution to be broken into several “time chunks” where the parameters

{KA, KB, εA, εB, ζ} are held constant on each interval. For most of the paper, we will use four time chunks or intervals. We can approximate the operator as the product of several operators as follows:

  3   −iHt Y −iHkt exp = exp , (8.4) 4 ~ k=0 ~

where on each time chunk k the operator is approximated using the first-order Trotter-Suzuki

formula [175]

−iH t −i(H + H + H ) t exp k = exp A B AB k (8.5) 4~ 4~ −iH t −iH t −iH t ≈ exp A,k exp B,k exp AB,k . (8.6) 4~ 4~ 4~

This last equation is only approximate, because while HA and HB commute, neither com-

mutes with HAB. The Trotter-Suzuki formula introduces an error of the type

exp d(P + Q) = exp dP  exp dQ + O(d2) (8.7)

for a scalar parameter d and non-commuting ([P,Q] 6= 0) operators P and Q. In our appli-

cation, this d corresponds to the time evolution variable, which is measured on the order of

nanoseconds and should not introduce much error. During each interval or time chunk the

functions {KA, KB, εA, εB, ζ} are constant, so, we may, for a given time interval ∆t = t/4,

rewrite the operator given by (8.6) as a product of physically realizable quantum gates, such

as those implemented in the Q# or Qiskit languages. Comparisons between the final density

matrices made in the training process and the full gate based decomposition as well the effect

175 of a finer discretization are discussed in Sections 8.2.2 and 8.4.2, respectively.

We start with the single qubit part of the operator for a single time interval, exp(−iHA∆t/~), but first note that for a unit vector ˆn = (nx, ny, nz)

2 ˆn · σ˜ = (nxX + nyY + nzZ)(nxX + nyY + nzZ)

2 2 2 2 = nxX + nxnyXY + nxnzXZ + nynxYX + nyY + nynzYZ

2 2 + nznxZX + nznyZY + nzZ

2 2 2 = (nx + ny + nz)I = I (8.8) where I is the 2 × 2 identity matrix. The last equality arrived from the fact that

XY = −YX = iZ

YZ = −ZY = iX

ZX = −XZ = iY from Equation 8.8 we have the following identity

e−iα(ˆn·σ˜) = I cos(α) − i(ˆn · σ˜) sin(α) (8.9) where α is an angle of rotation about axis ˆn (a unit vector) on the Bloch sphere, and ~σ is a vector of the Pauli matrices {X,Y,Z}. Looking at the definition of HA (or HB) in equation

(8.3), we see that it is easy to express the exponent in the form of representation (8.9):

∆t ∆t HA = (KAXA + 0 YA + εAZA) ~ ~   ∆tp 2 2 KA εA = KA + εA p XA + 0 YA + p ZA . (8.10) ~ K 2 + ε 2 K 2 + ε 2 | {z } A A A A α | {z } ˆn·σ˜

176 Interpreting the operator as a rotation on the Bloch sphere, we have a formula for a rotation α about an axis ˆn [2]

Rˆn(α) = Rz(γ)Ry(β)Rz(α)Ry(−β)Rz(−γ), (8.11)

where the rotations Rx(θ), Ry(θ), and Rz(θ) are defined as   θ θ −ı˙ θ X cos 2 −ı˙ sin 2 Rx(θ) = e 2 =   θ θ −ı˙ sin 2 cos 2   θ θ −ı˙ θ Y cos 2 − sin 2 Ry(θ) = e 2 =   θ θ − sin 2 cos 2

 θ  −ı˙ 2 −ı˙ θ Z e 0 Rz(θ) = e 2 =   ı˙ θ 0 e 2

The Q# and Qiskit environments [170, 171] have access to a function which computes the rotation of a state about the x, y, or z axis of the Bloch sphere by a specified angle, so this expression will suffice supposing that we can find the appropriate values for α, β, and γ in

(8.11). To do this, we use some analogues to this expression in terms of Pauli matrices and spherical coordinates:

α α R (α) = I cos − ı˙ (ˆn · σ˜) sin ˆn 2 2 α α = I cos − ı˙ (sin β cos γ X + sin β sin γ Y + cos β Z) sin (8.12) 2 2

Our expression (8.10) matches (8.12) perfectly, and now we need only solve the following system of three equations with three unknowns:

K ε sin β cos γ = A , sin β sin γ = 0, cos β = A . (8.13) p 2 2 p 2 2 KA + εA KA + εA

We notice immediately that sin γ = 0 (since sin β cannot be zero due the first equation), and   −1 p 2 2 so γ = cπ for some integer c. This forces cos γ to be ±1. Last, β = sin ±KA/ KA + A .

177 We see that the relative sizes of KA and A are constrained by the sine and cosine relationship between K ε sin β = ± A cos β = A . p 2 2 p 2 2 KA + A KA + εA

A change of indices gives us the operator HB in a similar way. The only remaining step is

to write HAB of (8.3) in a form using practical quantum gates.

The two-qubit part of the Hamiltonian is HAB = ζZAZB. The matrix form of this

operator is generated by taking the Kronecker product, ZAZB = Z ⊗ Z. After setting

w0 = ζ∆t/~ and taking the exponential of the operator, we have   e−ıw˙ 0 0 0 0      ıw˙ 0  −ıH˙ AB∆t  0 e 0 0  exp =   . (8.14)  ıw˙ 0  ~  0 0 e 0    0 0 0 e−ıw˙ 0

Since this is a two-qubit operator, it is necessary to represent it using a two-qubit gate. The

primary tool for this is the CNOT (controlled NOT) gate,   1 0 0 0      0 1 0 0  CNOT =   . (8.15)    0 0 0 1    0 0 1 0

In general the CNOT operator in addition to single-qubit phase gates forms a universal set

with which to build an arbitrary (N-qubit) operator. For our purposes, we may represent

matrix (8.14) using the following expression:

  e−ıw˙ 0 0 0 0      ıw˙ 0  −ıH˙ AB∆t  0 e 0 0  exp = CNOT   CNOT. (8.16)  −ıw˙ 0  ~  0 0 e 0    0 0 0 eıw˙ 0

178 The interior matrix is I ⊗ Rz(2w0), which is a rotation on only the B qubit. With the above

decompositions for HA, HB, and HAB, we can now express each time chunk of our quantum

operator in terms of a quantum circuit

ζk∆t where we have relabeled variables as w5k = ,   ~   −1 KA,k −1 KB,k w5k+1 = sin  , w5k+2 = sin  , q 2 2 q 2 2 KA,k + A,k KB,k + B,k

∆tq 2 2 ∆tq 2 2 w5k+3 = KA,k + A,k, and w5k+4 = KB,k + B,k. ~ ~ Collected formulaically, the gate decomposition of operator (8.4) is

3 −ıHt˙  Y exp = UA,k UB,k UAB,k (8.17) ~ k=0

where

UA,k = [Ry(w5k+1)Rz(w5k+3)Ry(−w5k+1)] ⊗ I (8.18)

UB,k = I ⊗ [Ry(w5k+2)Rz(w5k+4)Ry(−w5k+2)] (8.19)

UAB,k = CNOT [I ⊗ Rz(2w5k)] CNOT. (8.20)

8.2.2 Numerical computation

In our original work [61, 56, 64] on the entanglement witness, we used piecewise constant

functions for {KA, KB, A, B, ζ}; in subsequent work [176] we used continuum functions,

for which we found training was much more rapid and complete. Because current technol-

ogy does not allow for continuous-time control of gate functions, we return to our original

piecewise formulation; however, we have retrained using our more recent codes to improve

179 our earlier results.

Physically, we imagine that the system would be allowed to evolve for a specified time

under a Hamiltonian whose parameter functions we could control. At the end of that time we

perform a measurement whose average value would represent the entanglement witness. The

training of the net is a process whereby we find an optimal mapping of the desired physical

property (here, the entanglement) to that chosen measurement. We chose, as that measure-

2 ment, the (square of the) qubit-qubit correlation function at that final time, hZA(tf )ZB(tf )i .

To perform the retraining, we used our (newer) continuum codes implemented in MAT-

LAB [177], following the general approach of [56] with some modifications. Each parameter

{KA, KB, A, B, ζ} is represented as a Fourier series and allowed to evolve for a spec- ified time under the prescribed Hamiltonian. After each epoch (single pass through the whole training set) of training, the parameter functions are then approximated as piecewise constant functions obtained by averaging over each of the four time chunks. Next, these piecewise constant functions are used as the initial parameters for the following training epoch. Each time, the parameters functions are allowed to evolve as Fourier series, but are

“chunked” and averaged back into a piecewise constant form. After a sufficient amount of training, we use the piecewise constant parameters for time evolution to calculate the expec- tation value of the final time correlation function, and, therefore, the error. Training data are shown in Table 8.1. The “Desired” column is the goal of the training for the final time correlation function, showing that we seek a value of one for a fully entangled state and zero for a product state. Because we are trying to optimize our entanglement witness, we find a value intermediate between zero and one for the target value for the partially entangled state

|Pi; this (optimized) value is 0.443. The column labelled “Trained” shows the asymptotic

180 value for that final time correlation function after the training of the network. Training was of course less efficient than with the greater flexibility offered by the continuous-time functions; nonetheless, RMS error for the training set was only 0.05% after 200 epochs. The piecewise constant values found for the parameter functions are shown in Table 8.2.

Table 8.1: QNN entanglement witness trained for 200 epochs using piecewise constant param- eter functions, and compared with calculated results using first chunked time propagators, then with the sequence of gates, and finally on the Q# simulator [170]. The training set of four [56] includes one completely entangled state |Belli = √1 [|00i + |11i], one unentangled 2 1 state |Flati = 2 [|00i + |01i + |10i + |11i], one classically correlated but unentangled state |Ci = √1 [2|00i + |01i], and one partially entangled state |Pi = √1 [|01i + |10i + |11i]. Errors 5 3 for each method are shown in the final line.

Input state Desired Trained Chunked Gates Q# |Belli 1.0 0.999 0.999 0.999 0.999 |Flati 0.0 7.99 × 10−5 5.99 × 10−7 5.99 × 10−7 6.14 × 10−5 |Ci 0.0 1.08 × 10−4 1.87 × 10−5 1.87 × 10−5 8.17 × 10−5 |Pi 0.443 0.440 0.446 0.446 0.446 Total RMS error 5.0 × 10−4 1.4 × 10−3 1.4 × 10−3 1.7 × 10−3

Table 8.2: Trained parameter functions for the entanglement witness for the two-qubit sys- tem, in MHz. Total time of evolution for the two time propagation methods was 1.58 ns.

Parameter Interval 1 Interval 2 Interval 3 Interval 4

KA = KB 2.49 2.47 2.48 2.49 ζ 0.0382 0.128 0.117 0.0382

A = B 0.0930 0.116 0.0954 0.0833

181 We now use these trained values for the piecewise constant parameter functions in the equations derived for the sequence of operators in the previous section. Note that there are two separate sources for the error: the approximation in Equation (8.6), which assumes that the matrices commute; and the approximation of the substitution of the products of the gate operators for the time-propagation operator. We can separate these two sources by calculating the density matrix for the final time, using “chunked time.” That is, instead of calculating the time propagation correctly as in the QNN training, we separate the Hamilto- nian into HA, HB, and HAB for each of the four time intervals. This assumes that the pieces commute, which of course is an approximation. The column in Table 8.1 labelled “Chunked,” shows the calculation of the entanglement witness using this approximation. The “Gates” column repeats these calculations using the matrix decomposition outlined in Section 8.2.1.

The final column, labelled “Q#”, shows the entanglement witness values of the sequence of gates as measured on Microsoft’s quantum simulator [170]. (Calculations performed using

IBM’s Quantum Experience simulator [171] produced almost identical results [178].) Note that the calculated numbers for the entanglement witness in the two last columns are ex- tremely close, as are, of course, the RMS errors for each method. The Frobenius norm of the difference between the density matrix as trained by the QNN technique and the (non- commuting) chunked time propagation matrices is in each case 1 to 2%; while the norm of the difference between the density matrix calculated by the chunked time propagator and by the sequence of applied gates in each case is on the order of 10−15, i.e., within round-off error.

Clearly all of the error comes from the non-commutation. This validates our replacement of the time evolution operator by the product of gates.

182 8.3 Statistical Evaluation of Entanglement Witness in Q#

With the entanglement witness properly reverse engineered to run on the hardware sim-

ulators, we now need to understand how to utilize it in applied situations. Both the Q# and

Qiskit systems implement measurements of the qubit along a standard axis x, y, or z in the

Bloch sphere. Each individual measurement only returns an eigenvalue of ±1. To generate a useable expected value, several thousand measurements must be done to average these

2 eigenvalues to get a valid approximation of the expectation value hZA(tf )ZB(tf )i . Using the Q# built-in simulator, we did 100 iterations at several different “shot counts” (number of individual experiments and measurements) to gauge how many times a particular exper- iment must be run to generate a high confidence value for the entanglement witness. Our code is available at [179].

Variance for Entanglement vs Measurement Count 0.005 Bell Flat 0.004 C P

0.003

0.002

0.001 Variance in Entanglement Witness

0.000

0 2500 5000 7500 10000 12500 15000 17500 20000 Number of Shots

Figure 8.3: Variance in entanglement witness for 100 iterations of each state measured at shot counts ranging from 50 to 20,000 in 50 shot increments. As the shot count increases, we see that the measurement variance quickly goes to zero.

183 2 Figure 8.3 shows the variance of the expectation value hZA(tf )ZB(tf )i as a function of numbers of shot counts. We can see plainly that the law of large numbers is in effect for determining the entanglement witness. High confidence values for the witness are achieved near 15,000 iterations of the experiment. This is easier to see in Figure 8.4 and 8.5, which shows a 95% confidence interval surrounding the computed square of the qubit-qubit corre- lation for the witness on the |Belli and |Pi states, where the width of the interval shrinks to

0.0015. Results for the |Flati and |Ci states are similar.

Confidence Interval for Bell State Entanglement Witness

< |Z Z | > 2 1.000 A B 95% CI

0.999

0.998

0.997 Expected Value

0.996

0.995

0 2500 5000 7500 10000 12500 15000 17500 20000 Number of Shots

Figure 8.4: Q# entanglement witness values for the |Belli state with a 95% confidence interval as a function of the shot count. The confidence interval (CI) width reaches its minimum of ∼0.0015 after approximately 15,000 shots.

184 Confidence Interval for P State Entanglement Witness 0.48 2 < |ZAZB| > 95% CI 0.47

0.46

0.45

0.44 Expected Value

0.43

0.42

0.41 0 2500 5000 7500 10000 12500 15000 17500 20000 Number of Shots

Figure 8.5: Q# entanglement witness values for the | P i state with a 95% confidence interval as a function of the shot count. The confidence interval (CI) width reaches its minimum of ∼0.0015 after approximately 15,000 shots.

8.4 Iterative Staging

We have constructed a sequence of hardware gates that mimics our trained two-qubit entanglement witness quite well. While this is interesting it is perhaps of somewhat limited use, as it pertains only to a two-qubit system. We now extend our results to an N-qubit

system.

8.4.1 Searching for an Asymptotic Limit

As demonstrated in our previous work, it is both possible and very beneficial to use

knowledge of smaller scale systems to make systematic inferences about larger ones. Hence,

we train our network in iterative stages, using the trained parameter functions for the two-

qubit system {KA = KB, A = B, ζ} as an initial guess for those parameters in the three-

qubit case. We then train the three-qubit system from that point to minimize the error.

Once the three-qubit trained functions are found, we start from those to train the four-qubit

185 system, and so on. The benefit is that while there are large changes in the tunneling, bias,

and coupling parameters as the system size increases initially, those percentage changes di-

minish as the system size N increases, due to the increased connectivity. Hence, training

five-, six-, and seven-qubit systems require fewer and fewer additional epochs. Because of

the symmetry of the problem, all the parameter functions can be taken to be the same (that

is, KA(t) = KB(t) = KC (t), A(t) = B(t) = C (t), and so on); imposing this as a constraint also reduces the training time. We now look for an asymptotic limit as N increases. Figures

8.6 and 8.7 show the results of the training.

Tunneling Amplitude (K) Training Bias ( ) Training 2.52 0.4 Chunk 1 Chunk 1 Chunk 2 Chunk 2 2.51 Chunk 3 0.3 Chunk 3 Chunk 4 Chunk 4 2.50 0.2

2.49 Bias

0.1 2.48 Tunneling Amplitude

0.0 2.47

2.46 0.1 2 3 4 5 6 7 2 3 4 5 6 7 Number of Qubits Number of Qubits

Figure 8.6: Trained values for the tunneling amplitude K and for the bias , for each time chunk, as the number of qubits in the system is increased. Both demonstrate clear asymptotic behavior.

186 Qubit-qubit Coupling ( ) Training

Chunk 1 0.12 Chunk 2 Chunk 3 0.10 Chunk 4

0.08

0.06

Qubit-qubit Coupling 0.04

0.02

0.00

2 3 4 5 6 7 Number of Qubits

Figure 8.7: Trained values for the qubit-qubit coupling ζ, for each time chunk, as the number of qubits in the system is increased. The values show a clear trend, but do not become asymptotic as quickly as with the other parameters.

All parameters show an asymptotic trend, with the tunneling amplitudes K and biases  showing swift convergence to a limiting value. The qubit-qubit coupling ζ also has a trend emerging at the number of qubits increases, indicating that an N-qubit limit is likely. We infer that the parameters for the seven-qubit system are a reasonable approximation for the entanglement witness of an N-qubit system based on the limiting behavior observed in K and . This is important, because once quantum computers become only a very little larger we will no longer be able to simulate them on classical computers (the point of so-called

“quantum supremacy” [180].) Table 8.3 contains the parameters for the fully symmetric seven-qubit system.

Training for these parameters was relatively efficient for an N-qubit system, taking only

100 additional epochs past the previously trained (N − 1)−qubit system to train the pair-

N wise entanglement witness. While the number of training pairs, 4 2 , does increase with the number of qubits, the increased connectivity meant that the system needed less addi-

187 Table 8.3: Trained parameter values at each time interval, for the pairwise entanglement witness for the seven-qubit system, in MHz. By symmetry, each of the tunneling functions K, each of the biases , and each of the pairwise couplings ζ is the same. We take these values to be an approximation to the asymptotic limit of the parameters for an N-qubit quantum system.

Parameter Interval 1 Interval 2 Interval 3 Interval 4 K 2.49 2.47 2.48 2.51 ζ 0.0188 0.0440 0.0805 0.00132  -0.0164 0.299 0.0636 -0.0693

tional training each time. The total RMS value for the training of the two-qubit parameters

is 6.0 × 10−4, and only increased slightly as qubits were added, with six-qubits having an

RMS of 1.6 × 10−3 (at 60 training pairs) and seven-qubits 1.8 × 10−3 (84 training pairs).

Mesoscopic systems will still require some training to decrease initial errors, but this amount

should be very small or negligible since we already see the parameters nearing asymptotic

values, and we anticipate that this (small) additional training can be done online and need

not be simulated.

8.4.2 Comparing the Discrete and Continuum Cases

Having established the scalability of our results in terms of growing system size, we now

show how the results for the chunked system compare to our more sophisticated model for

the entanglement witness studied in [176]. In that work, the tunneling, bias, and coupling

parameters were all modeled using continuous functions of time. Allowing continuous pa-

rameters added a great deal more flexibility and assisted training immensely. This approach

to quantum machine learning resulted in smaller errors and faster training than the piece-

188 wise constant “chunked” model. However, with the current gate based model of quantum computing, we have no expectation of being able to implement or train a continuum param- eter solution on developing or proposed hardware. Therefore, we examine the relationship between entanglement witnesses built and trained using the chunked and continuous versions of the K, , and ζ parameters.

Bias ( ) Training

0.30 2 Qubit 3 Qubit 0.25 4 Qubit

0.20 5 Qubit 6 Qubit 0.15 7 Qubit

0.10 Bias

0.05

0.00

0.05

0.10

1 2 3 4 Time Chunk

Bias ( ) Training

0.30 2 Qubit 3 Qubit 0.25 4 Qubit

0.20 5 Qubit 6 Qubit 0.15 7 Qubit

0.10 Bias

0.05

0.00

0.05

0.10

1 2 3 4 5 6 7 8 Time Chunk

Figure 8.8: Trained bias, for 4 and 8 time chunks, as functions of time, for systems of increasing numbers of qubits N.

189 Bias ( ) Training

0.30 2 Qubit 3 Qubit 0.25 4 Qubit

0.20 5 Qubit 6 Qubit 0.15 7 Qubit

0.10 Bias

0.05

0.00

0.05

0.10 Time

Figure 8.9: Trained bias functions for the continuum model of the entanglement witness, for systems of increasing numbers of qubits N. Note how the graphs in Figure 8.8 are close approximations of the shapes and values of the function for each number of qubits.

Figure 8.8 shows the bias as a function of time in the 4 chunk model, for two- through seven-qubit systems, compared with a similar 8 chunk model, where the time chunks or inter- vals were halved. In each case we see that as the number of qubits increases a similar curve takes shape as the bias function seems to reach asymptotic values. Figure 8.9 is the contin- uum case [176], for the same two- through seven-qubit systems. Each system was trained using the techniques outlined in [64, 176] with the imposed symmetry constraints as discussed in the previous section. We observe that the 4 and 8 chunk cases show strong qualitative resemblance to the shape of the parameter function in the continuum case. Quantitatively, the exact values of the discretized functions and the continuum model do not match, but the disagreement is small and the overshooting can be attributed to fitting error. In pre- vious work we have shown [65, 66] both that the calculation is relatively insensitive to the exact values of the time dependent functions and that as the system size N is increased that

robustness increases; thus the disagreement is probably irrelevant and in any case becomes

190 more irrelevant with increasing N. Total RMS error for each of these simulations is shown in Table 8.4.

Table 8.4: Total RMS error for the training set in the 4, 8, and continuum models of the entanglement witness for system sizes ranging from two- to seven-qubits. Training followed the methods of [64, 176] with the additional condition that all parameters are fully symmetric. The continuum model shows the best accuracy, but the discretized versions also trained well and are viable approximations of the continuum model (not realizable in the current hardware.)

RMS error Qubits 4 Chunk 8 Chunk Continuous 2 6.0 × 10−4 6.4 × 10−4 5.2 × 10−4 3 6.0 × 10−4 8.8 × 10−4 1.2 × 10−3 4 4.2 × 10−4 1.2 × 10−3 4.3 × 10−4 5 3.6 × 10−4 1.0 × 10−3 4.1 × 10−4 6 1.6 × 10−3 1.2 × 10−3 6.4 × 10−4 7 1.8 × 10−3 1.4 × 10−3 6.6 × 10−4

8.5 Discussions

As an example of using machine learning techniques to train quantum systems to do computations for which no algorithm is known, we have trained a system of qubits to return a witness to its initial pairwise entanglement, by manipulating parameters in a time-dependent

Hamiltonian. The training process utilized classical computation performed in MATLAB which were later verified using the Q# simulator to determine that the computation was in fact implementable in the gate model decomposition, returned results consistent with error bounds projected by our training, and generated usable results without requiring an extremely large number of experiments (shots) within the hardware simulation.

This procedure is reminiscent of a physical setup like the quantum annealing processors

191 [164], which have a time-dependent Hamiltonian (though the parameter flexibility is still severely limited.) But the approach outlined in this paper is a kind of bridge between the annealing and gate approaches to quantum computing: with systematic Hamiltonian design,

QA computers could be used as programmable machines as well [55]. The entanglement witness was well approximated by a series of implementable gates. A thorough statistical analysis was done, and a good confidence interval of about 0.0015 is reached after 15,000 shots. The discretized parameter setup models the entanglement witness accurately with respect to two kinds of scaling: increasing the number of qubits and increasing the number of time chunks in the piecewise-constant parameter functions. Agreement was excellent, and the calculation generalizes well and easily as the number of qubits N increases to seven.

Two of the parameter functions so learned seem already to have reached an asymptote, which would mean that the witness could probably be implemented with only small error for much larger values of N, or, at minimum, could be trained online with little effort [181]. The qubit-qubit coupling could not definitively be said to have reached its asymptote, but it is at least plausible that a large fraction of the training necessary has already been accomplished, and, again, has reduced the amount of further training necessary. This potential training reduction is significant because in the near term we can expect that quantum computers will both be noisy and have to operate with a severely limited number of ancillary qubits.

Both noise and decoherence continue to be problematic, and with only between 50 to a few hundred qubits total, practical computations cannot afford even the minimum five ancilla per correction [40]. Our research suggests another approach for these noisy, intermediate-scale quantum (NISQ) [182] devices: To perform offline (simulation) training followed by online

(physical device) training using [181] or automatic differentiation [183] to fine tune the system parameters. Our work here strongly suggests that the amount of online training necessary should not be large; moreover, that our quantum machine learning

192 technique should be robust to both noise and to decoherence as shown in Chapter 6, which

will be a great advantage. We are currently working on determining the most promising

approach [184]. And, while we have done this work using the calculation of a pairwise

entanglement witness as exemplar, there is no reason to think that our results are unique to

that particular calculation: in all probability the technique presented here could be used for

a general scaleup paradigm.

Physical implementation still poses some problems. There are major limitations in both

connectivity and decoherence with the target hardware. For viable hardware implemen-

tation, the main consideration is the computational fidelity. Fidelity is lost to both time

and inefficient computations. On the available IBM hardware, coherence times are approxi-

mately 60 µs for both depolarization and spin dephasing [185]. The time required to apply

a single-qubit gate is about 0.130 µs and two-qubit gates are between 0.250 µs and 0.450

µs. Any state preparation and quantum circuit operations must be completed within the

60 µs interval. Our implementation of the chunked pairwise entanglement witness uses 28

single-qubit and 8 two-qubit gates, which yields a smaller than 8 µs total time (plus up to

2 µs to prepare a state); despite this, reproducibility on IBM hardware was not good [178].

Gate fidelity also affects computations. Single qubit readouts are accurate 96% of the time, and single- and two-qubits maintain fidelity at a rate of 99.7% and 96.5%, respectively [185].

Available hardware and circuit implementation techniques will of course improve. Develop- ers are working on two important avenues to combat decoherence: higher fidelity physical implementation of quantum gates [186, 187], and the reduction of the so-called T -depth for circuits [173, 188]. The value of fidelity increases are obvious, and reducing the physical time required to perform the operations of a circuit will improve computational accuracy. It should be noted that as the coherence times of the hardware improve, our training paradigm increases in value as we can give better models with finer discretizations of the continuum

193 training for our entanglement witness, and, of course, other desired calculations. Moreover, machine learning solutions may also have robustness advantages [65, 66].

Optimization of the discretization of universal circuits for operators involving very small numbers of qubits at a time is a major advance towards universal quantum computation. But it is not the whole answer. For one thing, many times we do not know the unitary operator that will perform the computation, since we do not have an algorithm. For another, we still do not have optimal ways of reducing an N qubit unitary to building blocks involving only one or two qubits. Machine learning holds a great deal of promise for both tasks. Our work here seems to show that with iterative staging we can fairly easily extend small simulational results to larger systems. And, even when a unitary is known that performs the desired calculation, a clever neural network approach may find one with more efficiency or better speedup [189].

194 CHAPTER 9

CONCLUSIONS

We have shown throughout this dissertation that Quantum Neural Network provides an alternative option to the algorithmic approach to quantum computations. In view of the fact that it is so hard to develop quantum algorithms for many physical applicable problems, it seems that QNN is a viable option to overcome this bottleneck. We also answered the universality question for our model of QNN. In particular, we showed that the structure of our QNN is indeed universal, that is, it can learn any computational task, whether quantum mechanical or classical. We hope that this will eliminate some misunderstanding about the capability of our model of QNN as well as other QNN models.

Furthermore, we have successfully shown that our model of QNN calculation is robust to both noise and decoherence, first on a 2-qubit system, then we generalize to multiple qubit systems. The increased connectivity in the higher qubit system both decreases the required training time per qubit and increases the stability of the system, independently of the details of the noise source. This seems to be especially true for decoherence: The increased robustness is evident whether the system is trained or tested with noise. While ex- act simulations on macroscopic quantum computers remain impractical, these results brings us a step forward in the investigation of the quantum neural approach for extrapolation to macroscopic quantum computing. A useful quantum computer will have thousands of qubits. Based on our work from Chapter 6 so far, it seems likely that the increase in size will only improve our training and help our system to be more robust to noise and decoher- ence. We chose an entanglement indicator as an example problem, but we would expect the same kind of effects doing other measures, for the reasons outlined above. As the number of

195 qubits increases, the number of epochs required for training each qubit decreases with the method of bootstrapping, but the total simulation time necessary goes up. This is to be expected since the reduction is linear but the connectivity is quadratic. Because of this, our calculations were limited to a five-qubit system for the complete entanglement (all possible entanglements) problem. However, we have built up to eight-qubit system training only

8 the 2 pairwise entanglements. We can see similar results in that case as well. We could have taken the pairwise entanglement witness calculation much higher than eight qubits quite easily but because of the limitation in the size of the actual hardware, there was no point in our doing this calculation at the moment. With better hardware, there would be no difficulty in extending our simulations to a much larger system using our bootstrapping technique. The point is our QNN is robust to noise and decoherence, meaning not only can we train a quantum computer to learn a specific task (create its own algorithm), we can do so in such a way that it is affected by noise and decoherence. This is similar to what have been investigated very recently known as “noise-resilient quantum circuit”. It’s a type of quantum circuit that is least affected by noise without having to use additional ancilla qubits for error correction, which is not feasible even for the current state of the art or any near term quantum devices. In a way, what we found is a noise-resilient quantum circuit.

The problem of quantifying speedup for quantum algorithms continues to be a difficult one, solved only on a case-by-case basis, as we have seen with Shor’s prime factorization algorithm, Grover search algorithm, and Bernstein-Vazirani algorithm. The situation with quantum machine learning is similar [135, 136]. We need both general theorems (univer- sality) and benchmarking for specific architectures and learning methods. Both of these questions have been answered, at least to some degree by experiments, here in this disserta- tion. We showed in Chapter 7 that our QNN model outperforms classical neural networks in

196 both classical and quantum problems. However, we have not benchmarked our QNN model against other QNN models, like those of [135, 138]. Chen et al. [138] also found a quantum advantage for their (very different) quantum neural network model.

We validated our QNN results by implementing our results on actual quantum hard- ware. The way our QNN is set up is reminiscent of the physical setup like the quantum annealing processors, which have a time-dependent Hamiltonian. But since there is no pub- licly available quantum annealing processors as of yet, we converted our QNN to fit into the gate model. This conversion serves as a potential bridge between the annealing and gate approaches to quantum computing. That is, with systematic Hamiltonian design, QA computers could be used as programmable machines as well. The entanglement witness was well approximated by a series of implementable gates. A thorough statistical analysis was done, and a good confidence interval of about 0.0015 is reached after 15,000 shots.

The marriage of quantum computing and machine learning can be enormously beneficial to both fields: with machine learning techniques we can find “algorithms”, do automatic scale-up, and increase robustness significantly as we have shown; while with quantum systems and their access to the enormously greater Hilbert space, machine learning can be both more efficient and more powerful. In this dissertation we provide evidence of both.

197 CHAPTER 10

FUTURE WORK

There are many mysteries still awaiting to be discovered and solved in this area. How- ever, a few extensions can be done immediately.

The first extension that can be done is to build quantum hybrid neural networks, net- works with both quantum and classical information processing. Throughout this dissertation we have only been interested in pure quantum neural networks, and they provide great re- sults. However, we think that we can improve on these results by bridging classical neural networks together with quantum neural networks. We propose several models of these hybrid networks in Sections 10.1 10.2. We hope that these models will extend the capabilities over the pure QNN or NN models.

Another extension work to be done is to train the Quantum Neural Networks on the actual hardware. In Chapter 6, we injected artifical noise into our QNN model, hoping that this noise correctly (at least up to some degree) represents the actual noise in the physical system. However, we don’t know exactly how noise is being introduced, and this will vary from system to system. Therefore, the best way to account for the actual noise would be to do an online training instead!

Another extension is to use fractional derivative for backpropagation. Because of the non-local nature of fractional derivative, one might be able to increase the speed of the training.

198 10.1 Quantum Hybrid Neural Network using Multi-Measurements

In this section we will present a possible structure of a QHNN using different mea-

surements. Each measurement will extract certain feature to the map we want to achieve.

The architechure of this QHNN is as follows: we first start with the initial density matrix,

ρ(t0), where the input variables are encoded. We then propagate ρ(t0) through time to ρ(tf )

through Schrodinger’s equation as we have done previously

dρ i = [H(t), ρ(t)] (10.1) dt ~ where we can take the Hamiltonian H to still have the form

N N X X H(t) = Kα(t)σxα + α(t)σzα + ζαβ(t)σzα σzβ (10.2) α=1 α6=β=1

where σxα = σI ⊗ σI ⊗ · · · σx ⊗ ·σI with σx located on the α spot. Kα represents the

tunneling amplitude of the qubit α; α, the potential energy or bias of qubit α; and ζαβ

the coupling parameter between qubits α and β. Once we have propagated the system to

ρ(tf ), at this point, we will perform several different measurements, M1,M2, ··· ,Mn. Of

course, in actual physical device, each measurement will collapse the state of the system;

hence after a measurement is done, we will need to propagate the network from ρ(t0) to

ρ(tf ) again using the same time evolution unitary before we make a new measurement.

Once all the desired measurements have been completed, we will pass the ‘classical’ data

through a classical artificial neural network for processing. See Figure 10.1 for a better

visualization of the architecture of this network. The harder task here is to determine the

set of measurements {Mi} we would like to perform. This is still not clear and it’s something

we need to investigate further. However, we can still perform simulations with different sets

of measures and study the behavior of this network.

199 Figure 10.1: Architecture of the QHNN using multi-measurement (3 different measures in this case) scheme. The network will take the input as the quantum state ρ(t0) then propagate it forward to ρ(tf ). At this point, different measurements will be performed and these classical values will be stored and processed through a classical artificial neural network. Although there is only one classical hidden layer in this structure, this need not be the case. Adding extra hidden layers might speed up the training.

200 Note that it is possible to take the quantum time propagation step as one unitary weight

matrix U instead of the Hamiltonian approach. In this case, the training method will a

little different: we will need to incorporate the techniques laid out in Section 5.5 along with

classical backpropagation for the classical layers, instead of the usual variational method

described in Section 5.2 coupled with classical backpropagation for the classical part. Re-

member that the Hamiltonian approach does give us an advantage when it comes to future

hardware implementation. However, considering the time evolution as just a single unitary

weight matrix U allows us to train the network quickly.

10.2 Quantum Hybrid Neural Network using Multi-Step Time Propagation

Here we will present another possible structure of QHNN using discrete sampling at

different time step. In contrast to the previous QHNN model, we only require a single mea-

surement, M, for this model. However, this measurement it will be perform repeatedly at

different incremental time step starting at t0, and keep going until tf . We hope that these

time increment measurement steps will help us to extract certain features about the mapping.

The number, n, of measurements to be done from t0 to tf will be predetermined. We start at the input state ρ(t0), which will be propagated to ρ(t0 +∆t) by equation 2.21, where t − t ∆t = f 0 . At this point, we will make a measurement to obtain a classical state, Y Q, n 1 where

Q Y1 = T r(Mρ(t0 + ∆t)) (10.3)

We then continue to propagate to the next step, ρ(t0 + 2∆t), where another measurement

Q will be made to obtain Y2

Q Y2 = T r(Mρ(t0 + 2∆t)) (10.4)

201 This will continue until we reached ρ(tf ), where the last measurement will be made.

Q Yn = T r(Mρ(t0 + n∆t)) = T r(Mρ(tf )) (10.5)

In actual hardware, each measurement will collapse the quantum state; however, we can always rerun it for the desired time. Once all the measurements have been performed, that is, we have reached the final time step, we will pass the information we have sampled through measurement through a classical neural network. Note that the information we collected from the quantum neural network part are the input values for the classical neural network part. This classical network part can have an arbitrary number of hidden layers, and each layer can have an arbitrary number of neurons. Once the information has been feedforward propagated until the end, we can calculate the error, and backpropagate this error through all the layers until the quantum step while updating the classical weights. At the quantum step, we will perform the quantum backpropagation as described in Section 5.2. See Figure

10.2 to see a specific example of the architecture of this network. The motivation of this network is from the classical time-delay network.

202 Figure 10.2: A possible architecture of the QHNN using discrete time sampling. Here the classical part of the network has one hidden layer with N nodes. This need not to be the case in general.

203 REFERENCES

204 REFERENCES

[1] G. Baym, Lectures on Quantum Mechanics, W.A. Benjamin. (1974).

[2] M.A. Nielsen, I.L.Chuang, Quantum Computation and Quantum Information, Cam- bridge University Press, New York, NY, USA. (2011).

[3] C. H. Bennett, Logical reversibility of computation, IBM J. Res. Dev. 17, 525-532. (1972).

[4] M.P. Frank, Back to the Future: The Case for , ArXiv abs/1803.02789. (2018).

[5] G.E. Moore, Cramming More Components onto Integrated Circuits , Electronics, pp. 114117. (1965).

[6] M. Reed, B. Simon, Functional Analysis, Academic Press. (1981)

[7] T. Mainiero, Schmidt decomposition on infinite-dimensional Hilbert spaces, URL: https://mathoverflow.net/q/281012.

[8] Kirk T. MCDonald, Physics of Quantum Computation. (2017)

[9] S. Aaronson and D. Gottesman, Improved simulation of stabilizer circuits , Phys. Rev.A, 70(052328). (2004).

[10] D. Aharonov, W. van Dam, J. Kempe, Z. Landau, S. Loyd, and O. Regev Adiabatic quantum computation is equivalent to standard quantum computation, SIAM Journal of Computing, vol. 37, issue 1, pp. 166194, 2007.

[11] T. Lanting, Entanglement in a quantum annealing processor, Phys. Rev. X 4, 021041, (2014).

[12] N.S. Dattani, N. Bryans, Quantum factorization of 56153 with only 4 qubits , ArXiv abs/1411.6758 (2014).

[13] A.M. Childs, E. Farhi, J. Preskill, Robustness of adiabatic quantum computation, Phys. Rev. A 65, 012322, (2001)

[14] P.W. Shor, Polynomial-Time Algorithms for Prime Factorization and Discrete Loga- rithms on a Quantum Computer , SIAM Journal on Computing 26.5, (1997)

[15] F. Arute, K. Arya, R. Babbush et al. Quantum supremacy using a programmable su- perconducting processor. Nature 574, 505510 (2019)

[16] A. W. Harrow, A. Hassidim, S. Lloyd, Quantum algorithm for linear systems of equa- tions. Phys. Rev. Lett. 103, 150502 (2009). DOI 10.1103/PhysRevLett.103.150502.

205 REFERENCES (continued)

[17] N. Wiebe, D. Braun, S. Lloyd, Quantum algorithm for data fitting, Phys. Rev. Lett. 109, 050505 (2012). DOI 10.1103/PhysRevLett.109.050505.

[18] S. Lloyd, M. Mohseni, P. Rebentrost, Quantum principal component analysis, Nat. Phys. 10, 631633 (2014). DOI 10.1038/nphys3029. Letter.

[19] P. Rebentrost, M. Mohseni, S. Lloyd, Quantum support vector machine for clas- sification, Phys. Rev. Lett. 113, 130503 (2014). DOI 10.1103/PhysRevLett.113.130503.

[20] S. Lloyd, S. Garnerone, P. Zanardi, Quantum algorithms for topological and geometric analysis of data, Nat. Commun. 7, 10138 (2016).

[21] M. Schuld, I. Sinayskiy, F. Petruccione, Prediction by linear regression on a quantum computer, Phys. Rev. A 94, 022342 (2016). DOI 10.1103/physreva.94.022342

[22] H. Lau, R. Pooser, G. Siopsis, C. Weedbrook, Quantum machine learning over infinite dimensions , Physical Review Letters 118 (2017). DOI 10.1103/physrevlett.118.080501. https://arxiv.org/abs/1603.06222.

[23] E. Farhi, J. Jeffrey Goldstone, S. Gutmann, A Quantum Approximate Optimization Algorithm, https://arxiv.org/abs/1411.4028. (2014)

[24] A.Peruzzo, J. McClean, P. Shadbolt, M-H. Yung, X-Q Zhou, P.J. Love, A. Aspuru- Guzik, J.L. O’Brien A variational eigenvalue solver on a photonic quantum processor, Nat Commun 5, 4213 (2014). https://doi.org/10.1038/ncomms5213

[25] J.R. McClean, J. Romero, R. Babbush, A. Aspuru-Guzik, The theory of vari- ational hybrid quantum-classical algorithms, New J. Phys. 18 023023, (2016). https://doi.org/10.1088/1367-2630/18/2/023023

[26] E. Farhi, H. Neven, Classification with Quantum Neural Networks on Near Term Pro- cessors, arXiv:1802.06002 (2018).

[27] J.R. McClean, S. Boixo, V.N. Smelyanskiy, R. Babbush, H. Neven Barren plateaus in quantum neural network training landscapes, Nat Commun 9, 4812 (2018). https://doi.org/10.1038/s41467-018-07090-4

[28] G. Verdon, M. Broughton, J.R. McClean, K.J Sung, R. Babbush, Z. Jiang, H. Neven, M. Moheseni, Learning to learn with quantum neural networks via classucal neural networks, https://arxiv.org/abs/1907.05415 (2019).

[29] J. Clarke, F.K Wilhelm, Superconducting quantum bits, Nature, Nature, 453(7198):10311042, 2008.

206 REFERENCES (continued)

[30] P. Kok, W.J Munro, K. Nemoto, T.C. Ralph, J.P. Dowling, and G.J. Milburn, Linear optical quantum computing with photonic qubits, Reviews of Modern Physics, 79(1):135, 2007.

[31] J.I. Cirac, P. Zoller, Quantum computations with cold trapped ions, Physical Review Letters, 74(20):4091, 1995.

[32] C. Nayak, S.H. Simon, A. Stern, M. Freedman, and S.D Sarma, Non-Abelian anyons and topological quantum computation, Reviews of Modern Physics, 80(3):1083, 2008.

[33] M.V Altaisky, N.N. Zolnikova, N.E. Kaputkina, V. Krylov, Y.E. Lozovik, N.S. Dat- tani, Entanglement in a quantum neural network based on quantum dots, Photonics and Nanostructures - Fundamentals and Applications, Pg. 24-28. (2017)

[34] P.W. Shor, Scheme for reducing decoherence in quantum computer memory, Phys. Rev. A 52, R2493(R), (1995).

[35] W.H. Zurek, , Nature Physics 5, 181-188 (2009).

[36] W.G. Unruh, Maintaining coherence in quantum computers, Phys. Rev. A 51, 992997. (1995)

[37] A. M. Steane, Error correcting codes in quantum theory Phys. Rev. Lett. 77, 793. (1996)

[38] A. M. Steane, Multiple particle interference and quantum error correction, Proc. Roy. Soc.Lond. A 452, 2551. (1996)

[39] R. Laflamme, C. Miquel, J.P Paz, W,H. Zurek, Perfect Quantum Error Correcting Code, Phys. Rev. Lett. 77, 198. (1996)

[40] E. Knill, R. Laflamme Theory of error-correcting codes, Phys. Rev. A 55, 900. (1997)

[41] B.M. Terhal, Bell inequalitites and the separability criterion, Phys. Lett. A 271. (2000)

[42] S. Gharibian, Strong NP-hardness of the quantum separability problem, Quantum Infor- mation and Computation 10, 343-360. (2008)

[43] R. Horodecki, P. Horodecki, M. Horodecki, and K. Horodecki, Quantum Entanglement Rev. Mod. Phys. 81, 865. (2009)

[44] A.Lucas, Ising formulations of many NP problems, Frontiers Physics. (2014)

[45] J. Biamonte, P. Love, Realizable Hamiltonians for universal adiabatic quantum comput- ers, Phys. Rev. A. (2008)

207 REFERENCES (continued)

[46] Available at https://en.wikipedia.org/wiki/Biological neuron model

[47] K.P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, Cambridge. (2012)

[48] G. Cybenko Approximation by superpositions of a sigmoidal function, Math. Control Signal Systems 2, pp. 303-314.(1989)

[49] A. N. Kolmogorov, On the representation of continuous functions of many variables by superpositions of continuous functions of one variable and addition, Doklay Akademii Nauk USSR, 14 (1957), pp. 953956.

[50] B. Fridman, An improvement on the smoothness of the functions in Kolmogorovs the- orem on superpositions, Dokl. Akad. Nauk SSSR, 177 (1967), pp. 10191022. English translation: Soviet Math. Dokl. (8), 1550-1553, 1967.

[51] G. Lorentz, Metric entropy, widths, and superpositions of functions, The American Mathematical Monthly, 69 (1962), pp. 469485.

[52] D. A. Sprecher, On the structure of continuous functions of several variables, Transac- tions Amer. Math. Soc, 115 (1965), pp. 340355.

[53] M.A. Nielsen, Neural Networks and , Determination Press, 2015.

[54] E.C. Behrman, J. Niemel, J. E. Steck and S. R. Skinner, A Quantum Dot Neural Net- work, Proceedings of the 4th Workshop on Physics of Computation, Boston, pp. 22-4, (1996).

[55] E.C. Behrman, J.E. Steck, and M.A. Moustafa, Learning quantum annealing, Quantum Inf. Comput. 17, 0469-0487 (2017).

[56] E.C. Behrman, J.E. Steck, P. Kumar, and K.A. Walsh, Quantum algorithm design using dynamic learning, Quantum Information and Computation 8, pp. 12-29. (2008)

[57] E.C. Behrman and J.E. Steck, Dynamic learning of pairwise and three-way entangle- ment, in Proceedings of the Third World Congress on Nature and Biologically Inspired Computing (NaBIC 2011) (Salamanca, Spain, October 19-21, 2011.) (Institute of Elec- trical and Electronics Engineers).

[58] E.C. Behrman and J.E. Steck, Multiqubit entanglement of a general input state, Quan- tum Information and Computation 13, 36-53 (2013).

[59] E.C. Behrman and J.E. Steck, A quantum neural network computes its own relative phase, in Proceedings of the IEEE Symposium on Computational Intelligence 2013 (Sin- gapore, April 15-19, 2013.) (Institute of Electrical and Electronics Engineers).

208 REFERENCES (continued)

[60] E.C. Behrman, L.R. Nash, J.E. Steck, V. Chandrashekar, and S.R. Skinner, Simulations of quantum neural networks, Information Sciences 128, 257 (2000).

[61] E.C. Behrman, V. Chandrashekar, Z. Wang, C.K. Belur, J.E. Steck, and S.R. Skinner, A quantum neural network computes entanglement, arXiv:quant-ph/0202131 (2002).

[62] M.J. Rethinam, A.K. Javali, A.E. Hart, E.C. Behrman, and J.E. Steck, A genetic algorithm for finding pulse sequences for nmr quantum computing, Paritantra Journal of Systems Science and Engineering 20, 32-42 (2011).

[63] R. Allauddin, K. Gaddam, S. Boehmer, E.C. Behrman, and J.E. Steck, Quantum si- multaneous recurrent networks for content addressable memory, in Quantum-Inspired Intelligent Systems, N. Nedjah, L. dos Santos Coelho, and L. de Macedo Mourelle, eds. (Springer Verlag, 2008).

[64] E. C. Behrman and J. E. Steck, Multiqubit entanglement of a general input state, Quantum Inf. Comput. 13, 36-53 (2013).

[65] E.C. Behrman, N.H. Nguyen, J.E. Steck, and M. McCann, Quantum neural computation of entanglement is robust to noise and decoherence, in Quantum Inspired Computational Intelligence: Research and Applications, S. Bhattacharyya, ed. (Morgan Kauffman, Elsevier) pp.3-33 (2016).

[66] N.H. Nguyen, E.C. Behrman, J.E. Steck, ”Quantum Learning with Noise and Decoher- ence: A Robust Quantum Neural Network”, Quantum Machine Intelligence, Springer (2019).

[67] N.H. Nguyen, E.C. Behrman, M.A. Moustafa and J.E. Steck, Benchmarking neural net- works for quantum computations, IEEE Transactions on Neural Networks and Learning Systems (to appear, 2019); also at arXiv:1807.03253

[68] N.H. Nguyen, B. Samarakoon, E.C. Behrman, and J.E. Steck, Pattern storage in qubit arrays using entanglement, in preparation (2019).

[69] N.L. Thompson, N.H. Nguyen, E.C. Behrman, and J.E. Steck, Experimental pairwise entanglement estimation for an N-qubit system: a machine learning approach for pro- gramming quantum hardware, submitted to Quantum Information Processing (2019); available at arXiv:1902.07754.

[70] C.H. Bennett, D.P. DiVincenzo, J.A. Smolin, and W.K. Wootters (1996), Mixed-state entanglement and quantum error correction, Phys. Rev. A 54, pp. 3824-3851.

[71] W.K. Wootters (1998), Entanglement of formation of an arbitrary state of two qubits, Phys. Rev. Lett. 80, pp. 2245-2248.

209 REFERENCES (continued)

[72] D.M. Greenberger, M.A. Horne, and A. Zeilinger (1989), in Bell’s Theorem and the Conception of the Universe, M. Kafatos, ed. , Kluwer Acdemic (Dordrecht), p 107.

[73] V. Vedral, M.B. Plenio, M.A. Rippin, and P.L. Knight (1997), Quantifying entangle- ment, Phys. Rev. Lett. 78, pp. 2275-2279; V Vedral and M.B. Plenio (1998), Entangle- ment measures and purification procedures, Phys. Rev. A 57, pp. 1619-1633; L. Hender- son and V. Vedral (2001), Classical, quantum and total correlations, J. Phys. A 34, pp. 6899-6905.

[74] S. Tamaryan, A. Sudbery, and L. Tamaryan (2010), Duality and the geometric measure of entanglement of general multiqubit W states, Phys. Rev. A 81, 052319.

[75] H.S. Park, S.-S.B. Lee, H. Kim, S.-K. Choi, and H.-S. Sim (2010), Construction of optimal witness for unknown two-qubit entanglement, Phys. Rev. Lett. 105, 230404.

[76] R. Filip (2002), Overlap and entanglement-witness measurements , Phys. Rev. A 65, 062320; F.G.S.L. Brando (2005), Quantifying entanglement with witness operators , quant-ph/0503152.

[77] T. Yamamoto, Yu.A. Pashkin, O. Astafiev, Y. Nakamura, and J.S. Tsai (2003), Demon- stration of conditional gate operation using superconducting charge qubits, Nature 425, pp. 941-944.

[78] Yann le Cun (1988), A theoretical framework for back-propagation in Proc. 1998 Con- nectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, eds., Morgan Kaufmann, (San Mateo), pp. 21-28.

[79] Paul Werbos, in Handbook of Intelligent Control, Van Nostrand Reinhold, p. 79 (1992); Yann le Cun, A theoretical framework for back-propagation in Proc. 1998 Connection- ist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, eds., Morgan Kaufmann, (San Mateo), pp. 21-28 (1988).

[80] B. Efron and R.J. Tibshirani (1994), An Introduction to the bootstrap. Boca Raton, FL: Chapman and Hall/CRC.

[81] P.D. Wasserman (1993), Advanced Methods in Neural Computing. New York: Van Nos- trand Reinhold.

[82] A. Peres (1995), Quantum Theory: Concepts and Methods. Dordrecht, The Netherlands: Kluwer.

[83] R.P. Feynman (1951), An operator calculus having applications in quantum electro- dynamics, Phys. Rev. 84, 108-128; R.P. Feynman and A.R. Hibbs (1965), Quantum Mechanics and Path Integrals, McGraw-Hill (New York).

210 REFERENCES (continued)

[84] P.W. Shor, Polynomial-Time algorithms for prime factorization and discrete logarithms on a quantum computer, SIAM J. Comput. 26, pp. 1484-1509 (1995).

[85] S. Lloyd, The universe as quantum computer, arXiv:1312.4455 (2013).

[86] G. Kalai, Quantum computers: Noise propagation and adversarial noise models, arXiv:0904.3265 (2009); G. Kalai, How quantum computer fail: Quantum codes, corre- lations in physical systems, and noise accumulation, arXiv:1106.0485 (2011); G. Kalai, The quantum computer puzzle, arXiv:1605.00992 (2016).

[87] T. Albash and D.A. Lidar, Decoherence in adiabatic quantum computation, Phys. Rev. A 91, 062320 (2015).

[88] N. Wax, Selected papers on noise and stochastic process, Dover (2014).

[89] W.H. Zurek, Quantum Darwinism, Nature Physics 5, 181-188 (2009).

[90] A. Paetznick and B.W. Reichardt, Fault-tolerant ancilla preparation and noise threshold lower bounds for the 23-qubit Golay code, Quantum Inf. Comput. 12, 11-12, pp. 1034- 1080 (2013).

[91] C.H. Bennett, D.P. DiVincenzo, J.A. Smolin, and W.K. Wootters, Mixed-state entan- glement and quantum error correction, Phys. Rev. A 54, pp. 3824-3851 (1996).

[92] Y.Glickman, S. Kotler, N. Akerman, and R.Ozeri, Emergence of a measurement basis in atom- scattering, Science 339, pp. 1187-1191 (2012).

[93] S. Takahasi, I.S Tupitsyn, J. van Tol, C.C Beedle, D.N Hendrickson, and P.C.E Stamp, Decoherence in crystals of quantum molecular magnets, Nature 476, pp.7679 (2011).

[94] K. Roszak, R. Filip, and T. Novotny, Decoherence control by decoherence itself, Scien- tific Reports 5, Article number: 9796 (2014).

[95] D. Dong, M.A Mabrok, I.R Petersen, B. Qi, C. Chen, and H. Rabitz, Sampling-based learning control for quantum systems with uncertainties, IEEE Transactions on Control Systems Technology 23, pp. 2155-2166 (2015).

[96] A.W. Cross, G. Smith, and J.A. Smolin, Quantum learning robust to noise, Phys. Rev. A 92, 012327 (2015).

[97] L.V. Fausett, Fundamentals of Neural Networks. Pearson, (1993.)

[98] N. Wiebe, A. Kapoor, and K. Svore, Quantum algorithms for nearest neighbor methods for supervised and unsupervised learning, Quantum Inf. Comput. 15, 318-358 (2014); J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd, Quantum machine learning, Nature 549, 195-202 (2017).

211 REFERENCES (continued)

[99] V. Dunjiko, J.M. Taylor, and H.J. Briegel, Quantum-enhanced machine learning, Phys. Rev. Lett. 117, 130501 (2016).

[100] J. Bang, A. Dutta, S-W. Lee, and J. Kim, Optimal usage of quantum random access memory in quantum machine learning, Phys. Rev. A 99, 012326 (2019).

[101] M. Schuld, I. Sinayskiy, and F. Petruccione, An introduction to quantum machine learning, Contemp. Phys. 56, 172-185 (2014).

[102] M.J. Hartmann and G. Carleo, Neural network approach to dissipative quantum many- body dynamics, Phys. Rev. Lett. 122, 250502 (2019).

[103] A. Nagy and V. Savona, Variational quantum Monte Carlo method with a neural network ansatz for open quantum systems, Phys. Rev. Lett. 122, 250501 (2019).

[104] F. Vicentini, A. Biella, N. Regnault, and C. Ciuti, Variational neural network ansatz for steady states in open quantum systems, Phys. Rev. Lett. 122, 250503 (2019).

[105] P. Mehta, M. Bukov, C-H. Wang, A.G.R. Day, C. Richardson, C.K. Fisher, and D.J. Schwab, A high-bias, low-variance introduction to machine learning for physicists, Phys. Rep. 810, 1-124 (2019).

[106] C.M. Bishop, Neural Networks for . Oxford Univ Press, 1995.

[107] I. Aizenberg, Complex-Valued Neural Networks with Multi-Valued Neurons, Springer (2011).

[108] B. Efron and R.J. Tibshirani, An Introduction to the bootstrap. Boca Raton, FL: Chapman and Hall/CRC (1994).

[109] R.D. Reed, Neural Smithing. Bradford, 1999.

[110] I. Goodfellow, Y. Bengio, A. Courville, and F. Bach, Deep Learning. MIT Press, 2016.

[111] A. Neelakantan, L. Vilnis, Q.V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens, Adding Gradient Noise Improves Learning for Very Deep Networks, arXiv:1511.06807 (2015).

[112] L. Holmstrom and P. Koistinen, Using additive noise in back-propagation training, IEEE Trans. Neural Networks 3, 24-38 (1992); C.M. Bishop, Training with noise is Equivalent to Tikhonov Regularization, Neural Comp. 7, 108-116 (1995).

[113] A. Graves, A. Mohamed, and G. Hinton, with deep recurrent neural networks, arXiv:1303.5778 (2013).

212 REFERENCES (continued)

[114] G.H. Golub and C.F. Van Loan, Matrix Computations, 4th Ed., Johns Hopkins Uni- versity Press (2013).

[115] W.K. Wootters, Entanglement of Formation of an Arbitrary State of Two Qubits, Phys. Rev. Lett. 80, 2245 (1998).

[116] V. Coffman, J. Kundu, and W.K. Wootters, Distributed entanglement, Phys. Rev. A 61, 052306 (2000).

[117] Neuralware Getting started: a Tutorial in NeuralWorks Professional II/Plus, (2000).

[118] I. Piquero-Zulaica, J. Lobo-Checa, A. Sadeghi, Z.M. Abd El-Fattah, C. Mitsui, T. Okamoto, R. Pawlak, T. Meier, A. Arnau, J.E. Ortega, J. Takeya, S. Goedecker, E. Meyer, and S. Kawai, Precise engineering of quantum dot array coupling through their barrier widths,Nature Commun. 8, 787 (2017).

[119] Microsoft Quantum Development Kit. https://docs.microsoft.com/en- us/quantum/?view=qsharp-preview, 2018.

[120] The IBM Quantum Experience. https://quantumexperience.ng.bluemix.net/qx, 2018.

[121] R.P. Feynman (1982), Simulating physics with computers, Int. J. of Theo. Phys. 21, 467.

[122] P.W. Shor (1994), Algorithms for quantum computation: discrete logarithms and fac- toring, Proceedings of the 35th Annual Symposium on Foundations of Computer Science (IEEE)

[123] L.K. Grover (1996), A fast quantum mechanical algorithm for data base search, Pro- ceedings of the 28th Annual ACM Symposium on the Theory of Computing, 212.

[124] E. Bernstein and U. Vazirani (1997) Quantum complexity theory, SIAM J. Comput. 26, pp 1411-1473.

[125] T.F. Ronnow, Z. Wang, J. Job, S. Boixo, S.V. Isakov, D. Wecker, J.M. Martinis, D.A. Lidar, and M. Troyer (2014), Defining and detencting quantum speedup, Science 345, 420.

[126] A.R. Calderbank and P. Shor (1996) Good quantum error correction codes exist, Phys. Rev. A 54, pp 1098-1105.

[127] G. Ortiz, J. Gubernatis, E. Knill, and R. Laflamme (2001), Quantum algorithms for femionic simulations Phys. Rev. A 64 022319.

213 REFERENCES (continued)

[128] R. Chrisley (1995), Quantum Learning, In New directions in cognitive science: Pro- ceedings of the international symposium, Saariselka, Finland, P. Pylkknen and P. Pylkk (editors). Finnish Association of Artificial Intelligence, Helsinki, 77-89.

[129] S. Kak, On quantum neural computing, Advances in Imaging and Electron Physics 94, 259 (1995).

[130] R. Schutzhold (2002), Pattern recognition on a quantum computer, arXiv:0208063v3.

[131] C.A. Trugenberger (2002), Pattern recognition in a quantum computer, arXiv: 0210176v2 .

[132] M. Schuld, I. Sinayskiy and F. Petruccione (2016), Prediction by linear regression on a quantum computer, Phys. Rev. A 94, 022342.

[133] M. Schuld, I. Sinayskiy and F. Petruccione (2015), An introduction to quantum machine learning Contemp. Phys. 56, 172.

[134] S. Arunaschalam and R deWolf (2017), A survey of quantum learning theory, arXiv:1701.06806v3.

[135] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd (2018), Quantum machine learning, arXiv:1611.09347v2.

[136] E. Aimeur, G. Brassard, and S. Gambs (2013), Quantum speed-up for unsupervised learning, Mach. Learn 90, 261-87.

[137] A. Hentschel and B.C Sanders (2010), Machine learning for precise quantum measure- ment, arXiv:0910.0762v2.

[138] J. Chen, L. Wang, and E. Charbon (2017), A quantum-omplementable neural network model, Quantum Inf. Process. 16, 245.

[139] E. C. Behrman and J. E. Steck (2013), Multiqubit entanglement of a general input state. Quantum Inf. Comput. 13, 1-2, pp. 36-53.

[140] N. Wiebe, C. Granade, and D.G. Cory (2015), Quantum bootstrapping via compressed quantum Hamiltonian learning, arXiv: 1409.1524v3.

[141] X.D. Cai, D. Wu, Z.E. Su, M.C. Chen, X.L. Wang, L. Li, N.L. Liu, C.Y. Lu, and J.W. Pan (2015), Entanglement-based machine learning on a quantum computer, Phys. Rev. Lett. 114, 110504.

214 REFERENCES (continued)

[142] N.L. Thompson, N.H. Nguyen, E.C. Behrman, and J.E. Steck (2018), Experimen- tal pairwise entanglement estimation for an N-qubit system: A machine learning ap- proach for programming quantum hardware, submitted to Quant. Inf. Comput. (2019), arXiv:1902.07754.

[143] Neuralware (2000), Getting started: a Tutorial in NeuralWorks Professional II/Plus

[144] Taehwan Kim and Tlay Adali (2001), Complex backpropagation neural network using elementary transcendental activation function Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 7-11 May 2001 2 pp. 1281-1284; Taehwan Kim and Tlay Adali (2003) Approximation by fully complex multilayer perceptrons, Neural Comput. 15 pp. 1641-1666. DOI=http://dx.doi.org/10.1162/089976603321891846

[145] T.L. Clarke, Generalization of neural networks to the complex plane. 1990 IJCNN International Joint Conference on Neural Networks, 435-440 vol.2. (1990)

[146] I. Aizenberg (2011), Complex-Valued Neural Networks with Multi-Valued Neurons, Berlin-Heidelberg: Springer.

[147] S. Dey, S. Bhattacharyya, and U. Maulik (2014), Quantum inspired genetic algorithm and particle swarm optimization using chaotic map model based interference for gray level image thresholding, Swarm and Evol. Comput. 15 pp 38-57; S. Dey, S. Bhat- tacharyya, and U. Maulik (2017), Efficient quantum inspired meta-heuristics for multi- level true colour image thresholding, Appl. Soft Comput. 56 pp 472-513; and references cited therein.

[148] Available at https://en.wikipedia.org/wiki/Iris flower data set

[149] K.-L. Seow, E.C. Behrman, and J.E. Steck (2015), Efficient learning algorithm for quantum perceptron unitary weights, arXiv:1512.00522

[150] D. Harris and S. Harris (2012) Digital design and computer architecture (2nd ed.), San Francisco, Calif.: Morgan Kaufmann, p. 129.

[151] R. A. Fisher (1936), The use of multiple measurements in taxonomic problems Ann. Eugenics 7, pp 179188.

[152] E.C. Behrman, G.A. Jongeward, and P.G. Wolynes (1983), A Monte Carlo approach for the real time dynamics of tunneling systems in condensed phases J. Chem. Phys. 79, 6277-6281.

[153] R.P. Feynman, R.B. Leighton, and M. Sands (1964), The Feynman Lectures on Physics, Addison Wesley, Reading, MA, Vol III.

[154] V. Vedral, M.B. Plenio, M.A. Rippin, and P.L. Knight (1997) Quantifying entangle- ment Phys. Rev. Lett. 78, pp. 2275-2279 (1997); S. Tamaryan, A. Sudbery, and L.

215 REFERENCES (continued)

Tamaryan (2010), Duality and the geometric measure of entanglement of general multi- qubit W states Phys. Rev. A 81, 052319

[155] The IBM Quantum Experience (2018). Available at https://quantumexperience.ng.bluemix.net/qx

[156] Microsoft Quantum Development Kit (2018). Available at https://docs.microsoft.com/en-us/quantum/?view=qsharp-preview

[157] M.H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy, and R. Melko (2018), Quantum Boltzmann machine, Phys. Rev. X 8, 021050.

[158] C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, C. Pal, Deep Complex Networks, arXiv preprint arXiv:1705.09792. (2017)

[159] J.H. Manton. Optimization Algorithms Exploiting Unitary Constraint. IEEE TRANS- ACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 3., pp. 635-650. 2002

[160] D. Gabay. Minimizing a differentiable function over a differential manifold. Journal of Optimization Theory and Applications, 37(2):177219, Jun. 1982.

[161] S. T. Smith. Optimization techniques on Riemannian manifolds. Fields Institute Com- munications, American Mathematical Society, 3:113 136, 1994.

[162] T.E. Abrudan, J. Eriksson, V. Koivumen. Steepest Descent Algorithms for Optimiza- tion Under Unitary Constraint. IEEE Transactions on Signal Processing, vol. 53, No. 3, pp. 1134-1147, 2008.

[163] P.A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Mani- folds. Princeton University Press, Princeton, NJ, Jan. 2008

[164] K. Karimi, N.G. Dickson, F. Hamze, M.H.S. Amin, M. Drew-Brook, F.A. Chudak, P.I. Bunyk, W.G. Macready, and G. Rose, Investigating the performance of an adiabatic quantum optimization processor, Quantum Inf. Process. 11, pp. 77-88 (2012).

[165] A.M. Childs, R. Cleve, E. Deotto, E. Farhi, S. Gutmann, and D.A. Spielman, Expo- nential algorithmic speedup by , Proceedings of the 35th Symposium on Theory of Computing, 5968 (2003).

[166] T.F. Ronnow, Z. Wang, J. Job, S. Boixo, S.V. Isakov, D. Wecker, J.M. Martinis, D.A. Lidar, and M. Troyer, Defining and detencting quantum speedup, Science 345, 420 (2014).

[167] M. Swaddle, L. Noakes, L. Salter, H. Smallbone, and J. Wang, Generating 3 qubit quantum circuits with neural networks, Phys. Lett. A 381, 3391 (2017).

216 REFERENCES (continued)

[168] L. Gurvitz, Classical deterministic complexity of Edmonds problem and quantum en- tanglement, in Proc. 35th ACM Symp. on the Theory of Comput., pp. 10-19 (ACM Press, New York, 2003).

[169] R. LaRose, Overview and comparison of gate level quantum software plaltforms, arXiv: 1807.02500 (2018).

[170] Microsoft Quantum Development Kit. https://docs.microsoft.com/en-us/ quantum/?view=qsharp-preview, (2018). accessed September 2018

[171] The IBM Quantum Experience. https://quantumexperience.ng.bluemix.net/qx, (2018). accessed October 2018

[172] Y.G. Chen and J.B. Wang, QCompiler: quantum compilation with CSF method, arXiv:quant-ph/1208.0194v2 (2012).

[173] V. Kliuchnikov, Synthesis of unitaries with Clifford+ T circuits, arXiv:1306.3200 (2013); V. Kliuchnikov, D. Maslow, M. Mosca, Asmptotically optimal approximation of single qubit unitaries by Clifford and T circuits using a constant number of ancillary qubits. Phys. Rev. Lett. 110, 190502 (2013).

[174] L. Cincio, Y. Suba, A.T. Sornborger, and P.J. Coles, Learning the quantum algorithm for state overlap, New J. Phys. 20, 113022 (2018).

[175] Hatano N., Suzuki M. Finding exponential product formulas of higher orders, In: Das A., K. Chakrabarti B. (eds) Quantum Annealing and Other Optimization Methods. Lecture Notes in Physics, vol 679. Springer, Berlin, Heidelberg (2005).

[176] E.C. Behrman, R.E.F. Bonde, J.E. Steck, and J.F. Behrman, On the correction of anomalous phase oscillation in entanglement witnesses using quantum neural networks, IEEE Trans. on Neural Networks and Learning Systems 25, pp 1696-1703 (2014).

[177] MATLAB and Statistics Toolbox Release 2018a, The MathWorks, Inc., Natick, Mas- sachusetts, United States

[178] J.E. Steck, E.C. Behrman, and N.L. Thompson, Machine learning applied to pro- gramming quantum computers, AIAA Scitech 2019 Forum, (January 2019). DOI: 10.2514/6.2019-0956

[179] Available at https://github.com/williamingle/WichitaStateQNN

[180] J. Preskill, Quantum computing and the entanglement frontier, arXiv:1203.5813v3 (2013).

[181] M.Y. Niu, S. Boixo, V. Smelyanskiy, and H. Neven, Universal quantum control through deep reinforcement learning, arXiv:1803.01857v2 (2018).

217 REFERENCES (continued)

[182] J. Preskill, Quantum Computing in the NISQ era and beyond, Quantum 2, 79, arXiv:1801.00862v3 (2018).

[183] M. Schuld, V. Bergholm, C. Gogolin, J. Izaac, and N. Killoran, Evaluating analytic gradients on quantum hardware, arXiv:1811.11184 (2018).

[184] N.L. Thompson, N.H. Nguyen, W. Ingle, E.C. Behrman, and J.E. Steck, Optimized quantum machine learning, in progress (2020).

[185] N. M. Linke, D. Maslov, M. Rotteler, S. Debnath, C. Figgatt, K. A. Landsman, K. Wright, and C. R. Monroe, Experimental comparison of two quantum computing archi- tectures, arXiv:1702.01852 (2017).

[186] X. Wang, E. Barnes, S.D. Sarma, Improving the gate fidelity of capacitively coupled spin qubits, NPJ Quantum Information 1, 15003, 10.1038/npjqi.2015.3 (2015).

[187] B. Pokharel, N. Anand, B. Fortman, D.A. Lidar,Demonstration of fidelity improvement using dynamical decoupling with superconducting qubits, Phys. Rev. Lett., 121, 220502, DOI: 10.1103/PhysRevLett.121.220502 (2016).

[188] N.J. Ross and P. Selinger, Optimal ancilla-free Clifford+T approximation of z- rotations, arXiv: quant-ph/1403.2975 (2014).

[189] M.J. Rethinam, A.K. Javali, A.E. Hart, E.C. Behrman, and J.E. Steck, A genetic algorithm for finding pulse sequences for nmr quantum computing, Paritantra Journal of Systems Science and Engineering 20, 32-42. arXiv:quant-ph/0404170 (2011).

[190] A.Y. Kitaev, Fault-tolerant quantum computation by anyons, Annals of Physics, Vol- ume 303, Issue 1, Pg. 2-30. (1997)

[191] J. M. Gambetta, J. M. Chow and M. Steffen, ”Building logical qubits in a su- perconducting quantum computing system”, npj Quantum Information 3, 2 (2017), doi:10.1038/s41534-016-0004-0

[192] Available at https://www.dwavesys.com/resources/media-resources. accessed October 2019

218 APPENDIX

219 APPENDIX A

PARAMETER FUNCTIONS

A.1 Parameter Functions for XOR

Here are the graphs of the trained parameters for the XOR gate.

Figure A.1: A and B parameter function for XOR gate.

Figure A.2: KA and KB parameter function for XOR gate.

220 APPENDIX A (continued)

Figure A.3: ζ parameter function for XOR gate.

A.2 Paramter Functions for XNOR

Here are the graphs of the trained parameters for the XNOR gate.

Figure A.4: A and B parameter function for XNOR gate.

221 APPENDIX A (continued)

Figure A.5: KA and KB parameter function for XNOR gate.

Figure A.6: ζ parameter function for XNOR gate.

A.3 Parameter Functions for CNOT Gate

Here are the graphs of the trained parameters for the CNOT gate.

222 APPENDIX A (continued)

Figure A.7: A and B parameter function for CNOT gate.

Figure A.8: A and B parameter function for CNOT gate.

223 APPENDIX A (continued)

Figure A.9: ζ parameter function for CNOT gate.

224 A.4 Parameter Functions for Bell Circuit

Here are the graphs of the trained parameters to generate the circuit in figure 3.9. It

maps the state |00i to the maximal entangled (Bell) state. Note that there are symmetries in these parameter functions.

Figure A.10:  and K parameter function for maximal entangled state circuit generation. Note that in this case there’s symmetry in the parameters. That is, A = B and KA = KB.

Figure A.11: ζ parameter function for maximal entangled state circuit generation.

225