QUANTUM NEURAL NETWORKS
A Dissertation by
Nam H. Nguyen
Master of Science, Wichita State University, 2016
Bachelor of Arts, Eastern Oregon University, 2014
Submitted to the Department of Mathematics, Statistics, and Physics and the faculty of the Graduate School of Wichita State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy
May 2020 © Copyright 2020 by Nam H. Nguyen
All Rights Reserved QUANTUM NEURAL NETWORKS
The following faculty members have examined the final copy of this dissertation for form and content, and recommend that it be accepted in partial fulfillment of the requirement for the degree of Doctor of Philosophy with a major in Applied Mathematics.
Elizabeth Behrman, Committee Chair
James Steck, Committee Member
Buma Fridman, Committee Member
Ziqi Sun, Committee Member
Terrance Figy, Committee Member
Accepted for the College of Liberal Arts and Sciences
Andrew Hippisley, Dean
Accepted for the Graduate School
Coleen Pugh, Dean
iii DEDICATION
To my late grandmother Phan Thi Suu , and my loving mother Nguyen Thi Kim Hoa
iv ACKNOWLEDGEMENTS
First and foremost, I would like to express my profound gratitude to my advisor, Dr.
Elizabeth Behrman, for her advice, support and guidance toward my Ph.D. degree. She taught me not only the way to do scientific research, but also the way to become a profes- sional scientist and mathematician. Her endless encouragements and patient throughout my graduate studied have positively impacted my life. She would always allowed me to work on different research areas and ideas even if it’s outside the scope of my dissertation. This allows me to learn and investigate problems in many different areas of mathematics, which have benefited me greatly. Her scientific vigor and dedication makes her a lifetime role model for me. I will forever be indebted to her.
I also would like to extend my gratitude to Dr. James Steck for his guidance through- out this journey. I have gained a tremendous amount of knowledge in neural networks in both theory and applications as whole from working with him for the past four years.
I also would like to express my deepest gratitude to Dr. Buma Fridman for spending time to teach me many important concepts in mathematics. I am especially indebted to him for spending the summer of 2017 to worked with me, helped me to build a better under- standing of the Kolmogorov’s superposition theorem and all its extension results including a result of his own. Our various discussions through the years have helped me become a better mathematician than I would have been.
I also would like to pay my special regards to Dr. Ziqi Sun and Dr. Terrance Figy for their willingness to be on my dissertation committee. Furthermore, I want to acknowledge
v Dr. Ziqi Sun for his countless encouragements throughout my Ph.D studied and how much he helped to understand functional analysis better; seeing his passion and curiosity for math- ematics and physics makes me to always strive to be a better scientist and mathematician.
I am also deeply indebted to Professor Edward Behrman of The Ohio State University for his financial support during my graduate studied through the Behrman’s Family Foun- dation. Because of his generosity, I was able to have a smaller teaching load and dedicated much more time in to my research and studied.
Furthermore, I would like to acknowledge and thank all my colleagues and friends
(Nathan, Bill, Henry, Saideep, Mo), with whom I collaborated on most of my research work.
This work would not be possible without all their helps. They all have made great mean- ingful impact in my life. Especially, I want to give a special acknowledge to Dr. Tianshi
Lu and Sirvan Rhamati (my best friend at WSU) for there willingness to discuss and work with me on various problems/ideas during my time at WSU; from number theory to graph theory to fractional derivatives to probability theory to math competition problems, even.
They shared with me many great ideas throughout the years. Their constant questionings, ways of thinking, and supports have positively influenced me with great impact. It has been a joy to have a true friend like Sirvan who I can share my problems as well as happiness with.
Most of all, I would like to thank my dear mother and my late grandmother. They have supported me throughout my academic journey and always motivated me to strive forward. Their unconditional love has never been affected by the physical distance between us. This dissertation is dedicated to them.
vi ABSTRACT
Quantum computing is becoming a reality, at least on a small scale. However, de- signing a good quantum algorithm is still a challenging task. This has been a huge major bottleneck in quantum computation for years. In this work, we will show that it is possible to take a detour from the conventional programming approach by incorporating machine learn- ing techniques, specifically neural networks, to train a quantum system such that the desired algorithm is “learned,” thus obviating the program design obstacle. Our work here merges quantum computing and neural networks to form what we call “Quantum Neural Networks”
(QNNs). Another serious issue one needs to overcome when doing anything quantum is the problem of “noise and decoherence”. A well-known technique to overcome this issue is using error correcting code. However, error correction schemes require an enormous amount of additional ancilla qubits, which is not feasible for the current state-of-the-art quantum computing devices or any near-term devices for that matter. We show in this work that
QNNs are robust to noise and decoherence, provide error suppression quantum algorithms.
Furthermore, not only are our QNN models robust to noise and decoherehce, we show that they also possess an inherent speed-up, in term of being able to learned a task much faster, over various classical neural networks, at least on the set of problems we benchmarked them on. Afterward, we show that although our QNN model is designed to run on a fundamental level of a quantum system, we can also decompose it into a sequence of gates and implement it on current quantum hardware devices. We did this for a non-trivial problem known as the “entanglement witness” calculation. We then propose a couple of different hybrid quan- tum neural network architectures, networks with both quantum and classical information processing. We hope that this might increase the capability over previous QNN models in terms of the complexity of the problems it might be able to solve.
vii TABLE OF CONTENTS
Chapter Page
1 INTRODUCTION...... 1 1.1 Motivation...... 1 1.2 Scope of this Dissertation...... 2 1.3 Literature Review...... 4 1.4 Contributions...... 8 1.5 Structure of this Dissertation...... 8 2 OVERVIEW OF QUANTUM MECHANICS...... 10
2.1 Quantum States and Density Operators...... 11 2.2 Composite Systems and Entanglement...... 13 2.3 Time-Evolution of a Closed System...... 20 2.4 Quantum Measurements...... 22
3 QUANTUM COMPUTING...... 26 3.1 Qubits...... 27 3.2 Quantum Gates and Circuits Model...... 33 3.3 Universal Quantum Computation and The Solovay-Kitaev Theorem..... 40 3.4 Quantum Speed-up and Quantum Algorithms...... 43 3.5 Adiabatic Quantum Computation Model...... 46 3.6 Quantum Decoherence, Noise, and Error Correction...... 49 4 CLASSICAL ARTIFICIAL NEURAL NETWORKS...... 53 4.1 Introduction...... 54 4.2 Artificial Neurons...... 55 4.2.1 Perceptron...... 57 4.2.2 Sigmoid Neuron...... 59 4.3 Multi-Layer Neural Networks...... 60 4.3.1 Networks Architecture...... 61 4.3.2 Error Backpropagation and The Gradient Descent Learning Rule.. 62 4.4 Universal Approximation...... 66 4.4.1 Approximation with Boxes...... 67 4.4.2 Approximation with Modern Analysis...... 74 5 QUANTUM NEURAL NETWORK...... 77
5.1 Fundamental Structure...... 78 5.2 Learning Algorithm...... 81 5.3 Simulation of Classical and Quantum Logic Gates, and Quantum Circuit.. 84 5.4 Universal Property of Quantum Neural Network...... 89 5.5 An Alternative Learning Approach...... 92
viii TABLE OF CONTENTS (continued)
Chapter Page
6 ROBUSTNESS OF QUANTUM NEURAL NETWORK...... 95 6.1 Quantum Computing in The Classical World...... 96 6.2 Dealing with Noise and Decoherence...... 97 6.3 Entanglement Calculation For Two-Qubit System...... 98 6.3.1 Learning with noise...... 105 6.3.2 Learning with Decoherence...... 114 6.3.3 Learning with Noise plus decoherence...... 121 6.4 Entanglement Calculation on Higher-Order Qubit Systems...... 128 6.4.1 Results for the Three-Qubit System: Training and Testing...... 128 6.4.2 Results for the four- and five-qubit systems: Training and Testing.. 138 6.4.3 Quantifying the improvement in robustness with increasing size of the system...... 139 6.4.4 Learning with other types of noise...... 142 6.4.5 Stability Analysis of The Calculations...... 144 6.5 Application: Entanglement for pattern storage...... 145
7 BENCHMARKING NEURAL NETWORKS FOR QUANTUM COMPUTATIONS ...... 148 7.1 Type of Neural Networks Performed...... 149 7.1.1 Classical real-valued neural networks...... 149 7.1.2 Complex-Valued Neural Networks...... 150 7.1.3 Quantum Neural Network...... 153 7.2 QNN Versus Classical Neural Networks: Simulating Classical Logic Gates.. 155 7.3 QNN Versus Classical Neural Networks: Iris Classification...... 160 7.4 QNN Versus Classical Neural Networks: Entanglement Calculation..... 165 8 IMPLEMENTATION OF QUANTUM NEURAL NETWORK ON ACTUAL QUAN- TUM HARDWARE...... 169 8.1 Available Quantum Simulators and Hardware...... 170 8.1.1 IBM...... 170 8.1.2 Microsoft...... 171 8.1.3 D-Wave Systems...... 172 8.2 Two-qubit Quantum Neural Network...... 173 8.2.1 Reverse Engineering of Entanglement Witness...... 174 8.2.2 Numerical computation...... 179 8.3 Statistical Evaluation of Entanglement Witness in Q#...... 183 8.4 Iterative Staging...... 185 8.4.1 Searching for an Asymptotic Limit...... 185 8.4.2 Comparing the Discrete and Continuum Cases...... 188 8.5 Discussions...... 191 9 CONCLUSIONS...... 195
ix TABLE OF CONTENTS (continued)
Chapter Page
10 FUTURE WORK...... 198 10.1 Quantum Hybrid Neural Network using Multi-Measurements...... 199 10.2 Quantum Hybrid Neural Network using Multi-Step Time Propagation.... 201 REFERENCES...... 204 APPENDIX...... 219
A PARAMETER FUNCTIONS...... 220 A.1 Parameter Functions for XOR...... 220 A.2 Paramter Functions for XNOR...... 221 A.3 Parameter Functions for CNOT Gate...... 222 A.4 Parameter Functions for Bell Circuit...... 225
x LIST OF FIGURES
Figure Page
2.1 Graphical interpretation of the geometric Hahn-Banach Theorem. The witness
w divides the Hilbert space into separable (S) and entangled subspaces. An
optimal witness is as close as possible to the set S...... 17
2.2 Graphical illustration of plane separation in Euclidean space...... 17
2.3 Geometric illustration of an optimal entanglement witness...... 18
2.4 A quantum system interacting with a measuring apparatus in the presence of
the surrounding environment...... 23
3.1 Geometric Representation of a Classical Bit...... 27
3.2 Bloch Sphere Representation of a Qubit...... 30
3.3 Geometric Representation of a Classical Probabilistic bit...... 31
3.4 Visualization of a quantum circuit...... 34
3.5 Reversible AND gate (Fredkin gate)...... 34
3.6 Geometric visualization of the Pauli X gate being applied to the state |0i on the
Bloch sphere...... 35
3.7 Geometric visualization of the rotation on the Bloch sphere created by applying |0i + |1i the Hadamard gate the state |0i to create the superposition state √ and 2 vice versa...... 37
xi LIST OF FIGURES (continued)
Figure Page
3.8 Controlled-NOT gate...... 38
3.9 This circuit maps the state |00i to the state √1 |00i + |11i ...... 38 2
3.10 Circuit representation of Toffoli gate...... 40
3.11 Complexity spaces. The complexity class BQP is still not well understood. We
have showed that with a cleverly design quantum algorithm, NP hard prob-
lem can be solved efficiently on a quantum computer, like prime factorization.
However, no NP-complete problem have been solved efficiently on a quantum
computer...... 46
3.12 Quantum circuit for Shor’s 9-qubit error correction code. E is a quantum channel
that can arbitrarily corrupt a single qubit [34]...... 52
4.1 A structure of a biological neuron. [46]...... 56
4.2 A perceptron model with two inputs...... 57
1 4.3 A sigmoid neuron node with x ∈ N and the activation function f(x) = . 59 R 1 + e−x
4.4 A two hidden layers neural network mapping from f : R2 → R. Nearly in all
cases, we take all the activation functions σi to be the same. That is, there is
only one type of activation in the network. The reason for this will be clear in
section 4.4...... 62
4.5 Approximation of a continuous function with piece-wise constant functions... 68
xii LIST OF FIGURES (continued)
Figure Page
4.6 Two hidden neurons to create a box...... 69
4.7 An illustration to show how a single hidden layer can be used to approximate a
continuous function f : D → R...... 70
4.8 An illustration to show how step functions can be built in the higher dimension
case and how to add them together to form something almost like a tower. The
gray arrows represent weights that are equal to zero...... 71
4.9 An illustration to show how how an 3 dimensional tower might be built using
neural network...... 72
4.10 An illustration to show how an N +1-dimensional tower can be built using neural
network...... 73
5.1 Classical Neural Network as a map from RN to R ...... 80
5.2 Quantum circuit representing a quantum algorithm as a map...... 80
5.3 Visualization of QNN structure...... 81
5.4 Perceptron model...... 89
5.5 a) Half adder using NAND gates. b) Half subtractor using NAND gates..... 90
5.6 Half subtractor modeled by a network of perceptrons...... 90
xiii LIST OF FIGURES (continued)
Figure Page
6.1 Total root mean squared error for the training set as a function of epoch (pass
through the training set), for the 2-qubit system, with zero noise. Asymptotic
error is 1.6 × 10−3. For comparison: with piecewise constant functions a similar
level of error required 2000 epochs...... 102
6.2 Parameter function KA = KB as a function of time (data points), as trained at
zero noise for the entanglement indicator, and plotted with the Fourier fit (solid
line)...... 103
6.3 Parameter function A = B as a function of time (data points), as trained at
zero noise for the entanglement indicator, and plotted with the Fourier fit (solid
line)...... 103
6.4 Parameter function ζ as a function of time (data points), as trained at zero noise
for the entanglement indicator, and plotted with the Fourier fit (solid line)... 103
6.5 Total root mean squared error for the training set as a function of epoch (pass
through the training set), for the 2-qubit system, with a (magnitude) noise level
of 0.014 at each (of 317 total) timestep. Asymptotic error is 3.1 × 10−3, about
double what it was with no noise...... 106
xiv LIST OF FIGURES (continued)
Figure Page
6.6 Parameter function KA = KB as a function of time, as trained at 0.0089 am-
plitude noise at each of the 317 timesteps, for the entanglement indicator (data
points), and plotted with the Fourier fit (solid line). Note the change in scale
from Figure 6.2, because of the (much larger) spread of the noisy data: the
Fourier fit is actually almost the same on this graph...... 107
6.7 Parameter function A = B as a function of time, as trained at 0.0089 amplitude
noise at each of the 317 timesteps, for the entanglement indicator (data points),
and plotted with the Fourier fit (solid line)...... 107
6.8 Parameter function ζ as a function of time, as trained at 0.0089 noise at each
of the 317 timesteps, for the entanglement indicator (data points), and plotted
with the Fourier fit (solid line)...... 108
6.9 Fourier coefficients for the tunneling parameter functions K, as functions of noise
level...... 109
6.10 Fourier coefficients for the bias parameter functions , as functions of noise level. 109
6.11 Fourier coefficients for the coupling parameter function ζ, as functions of noise
level...... 110
6.12 Entanglement of the state P as a function of γ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero noise
(blue) and at 0.0069 noise (orange). In each case the QNN was trained at zero
noise, but tested at the given level...... 111
xv LIST OF FIGURES (continued)
Figure Page
6.13 Entanglement of the state P as a function of γ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero noise
(blue) and at 0.0069 noise (orange). In each case the QNN was trained at a
noise level of 0.0089, and then tested at the given level...... 112
6.14 Entanglement of the state P as a function of γ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero noise
(blue) and at 0.0069 noise (orange). In each case the QNN was trained at 0.013
noise, and then tested at the given level...... 112
6.15 Entanglement of the state M as a function of δ, as calculated by the QNN,
and compared with the entanglement of formation (marked “BW”) at zero noise
(blue) and at 0.0069 noise (orange). In each case the QNN was trained at zero
noise, but tested at the given level...... 113
6.16 Entanglement of the state M as a function of δ, as calculated by the QNN,
and compared with the entanglement of formation (marked “BW”) at zero noise
(blue) and at 0.0069 noise (orange). In each case the QNN was trained at a noise
level of 0.0089, and then tested at the given level...... 113
6.17 Entanglement of the state M as a function of δ, as calculated by the QNN,
and compared with the entanglement of formation (marked “BW”) at zero noise
(blue) and at 0.0069 noise (orange). In each case the QNN was trained at 0.013
noise, and then tested at the given level...... 114
xvi LIST OF FIGURES (continued)
Figure Page
6.18 Total root mean squared error for the training set as a function of epoch (pass
through the training set), for the 2-qubit system, with a phase noise level of 0.014
at each (of 317 total) timestep. Asymptotic error is 1.3 × 10−3, approximately
the same as with no noise...... 115
6.19 Parameter function KA = KB as a function of time, as trained at 0.0089 phase
noise at each of the 317 timesteps, for the entanglement indicator (data points),
and plotted with the Fourier fit (solid line)...... 115
6.20 Parameter function A = B as a function of time, as trained at 0.0089 phase
noise at each of the 317 timesteps, for the entanglement indicator (data points),
and plotted with the Fourier fit (solid line)...... 116
6.21 Parameter function ζ as a function of time, as trained at 0.0089 phase noise
at each of the 317 timesteps, for the entanglement indicator (data points), and
plotted with the Fourier fit (solid line)...... 116
6.22 Fourier coefficients for the tunneling parameter functions K, as functions of
decoherence level...... 117
6.23 Fourier coefficients for the bias parameter functions , as functions of decoherence
level...... 117
6.24 Fourier coefficients for the coupling parameter function ζ, as functions of deco-
herence level...... 118
xvii LIST OF FIGURES (continued)
Figure Page
6.25 Entanglement of the state P as a function of γ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero phase
noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was
trained at zero phase noise, but tested at the given level...... 119
6.26 Entanglement of the state P as a function of γ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero phase
noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was
trained at a phase noise level of 0.0089, and then tested at the given level.... 119
6.27 Entanglement of the state P as a function of γ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero phase
noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was
trained at 0.013 phase noise, and then tested at the given level...... 120
6.28 Entanglement of the state M as a function of δ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero phase
noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was
trained at zero noise, but tested at the given level...... 120
6.29 Entanglement of the state M as a function of δ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero phase
noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was
trained at a phase noise level of 0.0089, and then tested at the given level.... 121
xviii LIST OF FIGURES (continued)
Figure Page
6.30 Entanglement of the state M as a function of δ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero phase
noise (blue) and at 0.0069 phase noise (orange). In each case the QNN was
trained at 0.013 phase noise, and then tested at the given level...... 121
6.31 Total root mean squared error for the training set as a function of epoch (pass
through the training set), for the 2-qubit system, with a complex noise level of
0.014 at each (of 317 total) timestep. Asymptotic error is 3.0 × 10−3, approxi-
mately the same as with only magnitude noise...... 122
6.32 Parameter function KA = KB as a function of time, as trained at 0.0089 complex
noise at each of the 317 timesteps, for the entanglement indicator (data points),
and plotted with the Fourier fit (solid line). Note the change in scale from
Figure 6.2, because of the (much larger) spread of the noisy data: the Fourier
fit is actually almost the same on this graph...... 122
6.33 Parameter function A = B as a function of time, as trained at 0.0089 complex
noise at each of the 317 timesteps, for the entanglement indicator (data points),
and plotted with the Fourier fit (solid line)...... 123
6.34 Parameter function ζ as a function of time, as trained at 0.0089 complex noise
at each of the 317 timesteps, for the entanglement indicator (data points), and
plotted with the Fourier fit (solid line)...... 123
xix LIST OF FIGURES (continued)
Figure Page
6.35 Fourier coefficients for the tunneling parameter functions K, as functions of
complex noise level...... 124
6.36 Fourier coefficients for the bias parameter functions , as functions of complex
noise level...... 124
6.37 Fourier coefficients for the coupling parameter function ζ, as functions of complex
noise level...... 125
6.38 Entanglement of the state P as a function of γ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero noise
(blue) and at 0.69% noise plus decoherence (orange). In each case the QNN was
trained at zero noise, but tested at the given level...... 125
6.39 Entanglement of the state P as a function of γ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero noise
(blue) and at 0.0069 noise plus decoherence (orange). In each case the QNN was
trained at a noise level of 0.0089 complex noise, and then tested at the given level.125
6.40 Entanglement of the state P as a function of γ, as calculated by the QNN, and
compared with the entanglement of formation (marked “BW”) at zero complex
noise (blue) and at 0.0069 noise plus decoherence(orange). In each case the QNN
was trained at 0.013 level complex noise, and then tested at the given level... 126
xx LIST OF FIGURES (continued)
Figure Page
6.41 Entanglement of the state M as a function of δ, as calculated by the QNN,
and compared with the entanglement of formation (marked “BW”) at zero noise
(blue) and at 0.0069 noise plus decoherence(orange). In each case the QNN was
trained at zero noise, but tested at the given level...... 126
6.42 Entanglement of the state M as a function of δ, as calculated by the QNN,
and compared with the entanglement of formation (marked “BW”) at zero noise
(blue) and at 0.0069 noise plus decoherence (orange). In each case the QNN was
trained at a complex noise level of 0.0089, and then tested at the given level.. 127
6.43 Entanglement of the state M as a function of δ, as calculated by the QNN,
and compared with the entanglement of formation (marked “BW”) at zero noise
(blue) and at 0.0069 noise plus decoherence (orange). In each case the QNN was
trained at 0.013 complex noise, and then tested at the given level...... 127
6.44 K parameter function trained with zero noise or decoherence, for the 3-qubit
system...... 131
6.45 parameter function trained with zero noise or decoherence, for the 3-qubit
system...... 131
6.46 ζ parameter function trained with no noise or decoherence, for the 3-qubit system.132
6.47 Parameter function K trained at 0.0089 amount of decoherence in a 3-qubit
system. The solid red curve represents the Fourier fit of the actual data points. 133
xxi LIST OF FIGURES (continued)
Figure Page
6.48 Parameter function trained at 0.0089 decoherence in a 3-qubit system. The
solid red curve represents the Fourier fit of the actual data points...... 134
6.49 Parameter function ζ trained at 0.0089 decoherence in a 3-qubit system. The
solid red curve represents the Fourier fit of the actual data points...... 134
6.50 Parameter function K Fourier coefficients as a function of total noise for the
3-qubit system...... 135
6.51 Parameter Fourier coefficients as a function of total noise for the 3-qubit system.135
6.52 Parameter function ζ Fourier coefficients as a function of total noise for the
3-qubit system...... 136
6.53 Testing at different noise levels for the pure state P for a three-qubit system, as
a function of γ...... 137
6.54 Testing at different noise levels for a mixed state M, for the three-qubit system,
as a function of γ...... 137
6.55 Parameter function K for a 4-qubit system trained at 0.027 level of noise with
its Fourier fit...... 138
6.56 Parameter function K for 5-qubit system trained at 0.027 level of noise with its
Fourier fit...... 139
xxii LIST OF FIGURES (continued)
Figure Page
6.57 R2 as a function of number of qubits, for the K parameter function, trained at
0.02741 total noise...... 140
6.58 R2 as a function of number of qubits, for the parameter function, trained at
0.02741 total noise...... 140
6.59 R2 as a function of number of qubits, for the ζ parameter function, trained at
0.02741 total noise...... 140
6.60 An illustrative example of overfitting to the data (blue dots.) The red curve
represents a possible function that an overfitting network would have learned
comparing to the actual correct fit, the green line. Training error for the red
line would be small, but any subsequent testing would give large errors. Adding
noise to the data points would correct for overfitting...... 141
6.61 Training the parameter function with noise added to the Hamiltonian instead
of density matrix for a 2-qubit system...... 143
6.62 Training the parameter function with noise added to the Hamiltonian instead
of density matrix for a 3-qubit system...... 143
6.63 Training the parameter with noise added to the Hamiltonian instead of density
matrix for a 4-qubit system...... 144
6.64 Training the parameter with noise added to the Hamiltonian instead of density
matrix for a 5-qubit system...... 144
xxiii LIST OF FIGURES (continued)
Figure Page
6.65 How to encode the letter Z using pairwise entanglement. The green double-
headed arrows represent pairwise entanglement...... 146
7.1 Visual representation of the linear separability of the OR gate versus the XOR
gate. The dotted line represents linear classifier...... 156
7.2 Example of training RMS as a function of epoch for various logic gates using the
(single layer) CVNN perceptron. The RVNN and NeuralWorks networks trained
similarly...... 159
7.3 Iris dataset cross-input classification scatterplot [148]...... 161
8.1 Actual IBM 4-qubit processor. The qubits are controlled by microwave pulses
tuned at certain frequencies [191]...... 171
8.2 An actual 128-qubit processor chip from DWave Systems [192]...... 172
8.3 Variance in entanglement witness for 100 iterations of each state measured at
shot counts ranging from 50 to 20,000 in 50 shot increments. As the shot count
increases, we see that the measurement variance quickly goes to zero...... 183
8.4 Q# entanglement witness values for the |Belli state with a 95% confidence inter-
val as a function of the shot count. The confidence interval (CI) width reaches
its minimum of ∼0.0015 after approximately 15,000 shots...... 184
xxiv LIST OF FIGURES (continued)
Figure Page
8.5 Q# entanglement witness values for the | P i state with a 95% confidence interval
as a function of the shot count. The confidence interval (CI) width reaches its
minimum of ∼0.0015 after approximately 15,000 shots...... 185
8.6 Trained values for the tunneling amplitude K and for the bias , for each time
chunk, as the number of qubits in the system is increased. Both demonstrate
clear asymptotic behavior...... 186
8.7 Trained values for the qubit-qubit coupling ζ, for each time chunk, as the number
of qubits in the system is increased. The values show a clear trend, but do not
become asymptotic as quickly as with the other parameters...... 187
8.8 Trained bias, for 4 and 8 time chunks, as functions of time, for systems of
increasing numbers of qubits N...... 189
8.9 Trained bias functions for the continuum model of the entanglement witness,
for systems of increasing numbers of qubits N. Note how the graphs in Figure
8.8 are close approximations of the shapes and values of the function for each
number of qubits...... 190
xxv LIST OF FIGURES (continued)
Figure Page
10.1 Architecture of the QHNN using multi-measurement (3 different measures in
this case) scheme. The network will take the input as the quantum state ρ(t0)
then propagate it forward to ρ(tf ). At this point, different measurements will
be performed and these classical values will be stored and processed through a
classical artificial neural network. Although there is only one classical hidden
layer in this structure, this need not be the case. Adding extra hidden layers
might speed up the training...... 200
10.2 A possible architecture of the QHNN using discrete time sampling. Here the
classical part of the network has one hidden layer with N nodes. This need not
to be the case in general...... 203
A.1 A and B parameter function for XOR gate...... 220
A.2 KA and KB parameter function for XOR gate...... 220
A.3 ζ parameter function for XOR gate...... 221
A.4 A and B parameter function for XNOR gate...... 221
A.5 KA and KB parameter function for XNOR gate...... 222
A.6 ζ parameter function for XNOR gate...... 222
A.7 A and B parameter function for CNOT gate...... 223
A.8 A and B parameter function for CNOT gate...... 223
xxvi LIST OF FIGURES (continued)
Figure Page
A.9 ζ parameter function for CNOT gate...... 224
A.10 and K parameter function for maximal entangled state circuit generation.
Note that in this case there’s symmetry in the parameters. That is, A = B and
KA = KB...... 225
A.11 ζ parameter function for maximal entangled state circuit generation...... 225
xxvii LIST OF TABLES
Table Page
3.1 Truth table of NAND and NOR logic gates in respective order...... 33
5.1 Truth table of XOR and XNOR logic gates in respective order...... 84
5.2 Inputs and Outputs (both desired and actual) for the QNN model of the XOR and
XNOR gate. The input is the initial, prepared, state of the two-qubit system, at
t = 0; the output, the square of the measured value of the qubit-qubit correlation
function at the final time...... 86
5.3 QNN simulation of the CNOT gate. Recall that CNOT maps: |00i → |00i,
|01i → |01i, |10i → |11i and |11i → |10i...... 87
5.4 QNN training result for the quantum circuit that would generate the maximal
entangled Bell state...... 88
6.1 Training data for QNN entanglement witness...... 102
6.2 Curvefit coefficients for parameter functions K, , ζ, for QNN entanglement witness.104
6.3 Unnormalized training input states for 3-qubit entanglement calculation..... 130
6.4 Training theoretical targets and actual QNN outputs for 3-qubit entanglement
calculation with no noise or decoherence added to the system...... 131
6.5 Fourier coefficients for 3-qubit fitted parameter functions with no noise and de-
coherence along with QNN training RMS...... 132
6.6 QNN output to encoded states of different letters, showing robustness to noise.. 147
xxviii LIST OF TABLES (continued)
Table Page
7.1 Number of epochs needed for gate training to reach RMS error ≤1% for a (single
layer) perceptron using RVNN, CVNN and NeuralWorks (to nearest 1000 epochs)
implementations...... 157
7.2 Trained QNN parameter functions A, B, and ζ, for the Iris problem, in terms of
their Fourier coefficients for the function f(t) = a0 + a1 cos(ωt) + b1 sin(ωt), using
12, 30, and 75 training pairs...... 162
7.3 Trained QNN parameter functions KA, and KB, for the Iris problem, in terms
of their Fourier coefficients for the function f(t) = a0 + a1 cos(ωt) + b1 sin(ωt) +
a2 cos(2ωt) + b2 sin(2ωt) + a3 cos(3ωt) + b3 sin(3ωt), using 12, 30, and 75 training
pairs...... 163
7.4 Iris dataset training and testing average percentage RMS and identification accu-
racy for different training set sizes using similar RVNN and CVNN networks, and
compared with the QNN. The RVNN was run for 50,000 epochs; the CVNN, for
1000 epochs; and the QNN, for 100 epochs. See text for details of the architectures
used...... 164
7.5 Training on entanglement of pure states with zero offsets, for the NeuralWorks,
RVNN, CVNN, and QNN...... 168
7.6 Testing on entanglement of pure states. Each network was tested on the same
set of 25 randomly chosen pure states with zero offset...... 168
xxix LIST OF TABLES (continued)
Table Page
8.1 QNN entanglement witness trained for 200 epochs using piecewise constant pa-
rameter functions, and compared with calculated results using first chunked time
propagators, then with the sequence of gates, and finally on the Q# simula-
tor [170]. The training set of four [56] includes one completely entangled state
|Belli = √1 [|00i+|11i], one unentangled state |Flati = 1 [|00i+|01i+|10i+|11i], 2 2
one classically correlated but unentangled state |Ci = √1 [2|00i + |01i], and one 5
partially entangled state |Pi = √1 [|01i + |10i + |11i]. Errors for each method are 3 shown in the final line...... 181
8.2 Trained parameter functions for the entanglement witness for the two-qubit sys-
tem, in MHz. Total time of evolution for the two time propagation methods was
1.58 ns...... 181
8.3 Trained parameter values at each time interval, for the pairwise entanglement
witness for the seven-qubit system, in MHz. By symmetry, each of the tunneling
functions K, each of the biases , and each of the pairwise couplings ζ is the
same. We take these values to be an approximation to the asymptotic limit of
the parameters for an N-qubit quantum system...... 188
xxx LIST OF TABLES (continued)
Table Page
8.4 Total RMS error for the training set in the 4, 8, and continuum models of the en-
tanglement witness for system sizes ranging from two- to seven-qubits. Training
followed the methods of [64, 176] with the additional condition that all param-
eters are fully symmetric. The continuum model shows the best accuracy, but
the discretized versions also trained well and are viable approximations of the
continuum model (not realizable in the current hardware.)...... 191
xxxi CHAPTER 1
INTRODUCTION
“Science offers the boldest metaphysics of the age. It’s a thoroughly human construct, driven by the faith that if we dream, press to discover, explain, and dream again, thereby plunging repeatedly into new terrain, the world will somehow come clearer and we will grasp the true strangeness of the universe. And the strangeness will all prove to be connected, and make sense” - Edward O. Wilson
1.1 Motivation
We are currently living in what most physicists would call a second quantum revolution.
The first quantum revolution started in the beginning of the 20th century, when scientists
first discovered quantum mechanics, which led to technologies such as semiconductors, GPS,
MRI, lasers, and many more. All these technologies are now being used in our daily life. In
1965, Gordon Moore predicted that the number of transistors on a silicon chip would dou- ble every year. This is known as Moore’s Law [5]. This means that transistors are getting smaller exponentially. Hence, soon enough this will reach the quantum realm, and leads to the idea of quantum computation first proposed by Paul Benioff in the early 1980s. Scientists started to wonder whether a quantum computer, a computer operating according to the laws of quantum mechanics, could be more powerful than classical computers. Richard Feynman then proposed that a quantum computer would be an ideal tool to solve problems in physics and chemistry, given that it is exponentially costly to simulate large quantum systems with classical computers [2]. In 1994, quantum computation started to gain significant attention throughout the whole scientific community when Peter Shor developed an algorithm known
1 as “Shor’s algorithm” to factor integers in polynomial time on quantum computer, which is
an exponential speed up compared to any known classical algorithm. This algorithm shows
explicitly the advantages of quantum computers over classical ones. However, it took quite
a while for advances in mathematics and science to enable us to develop technologies that
enable us to control one single atom. This allows us to create hardware with quantum prop-
erties like superposition and entanglement, and, finally, to turn quantum computing from
theory into reality. On May 11, 2011, D-Wave Systems made the first commercially available
quantum computer, operating on a 128 SQuID qubit chip using quantum annealing. Many
more advances in quantum computing have been done since then. IBM has made several
quantum computers using the circuit model available to the public on the Cloud, which
include a 53-qubit processor. Google claimed quantum supremacy with their own 54-qubit processor on October, 2019 [15]. All these recent technology breakthroughs combined with the fact that quantum computers allow us to unravel mysteries not possible with classical computers make the field of quantum computing currently one of the most pursued area of research, from better hardware design to algorithm design.
1.2 Scope of this Dissertation
Although the quantum computer has become a reality, at least on a small scale, design- ing a good quantum algorithm is still a challenging task. This is a major bottleneck for quantum computations. Other than a handful of quantum algorithms (Shor’s integer factor- ization, Grover’s search, Jones polynomial approximation), there exist no efficient methods of programming a quantum computer compared to what we have for a classical computer.
2 Once a quantum computer is built, it also is subject to noise, similar to a classical com- puter. Additionally, a quantum computer also deals with another unique problem known as decoherence, which arises from unwanted interactions with the environment. Quantum mechanics is fragile, which is why, on a macroscopic scale, we rarely need to take quantum effects into account: Unless the quantum processes are extremely well isolated, the quantum state will decohere and become essentially classical. (Indeed, the precise nature and impli- cations of the ways in which decoherence leads to the loss of quantum information and the emergence of classicality is a fundamentally interesting problem, as it lies at the crux of the nature of quantum reality [35].) When this occurs in a quantum computer, the quantum nature of the computation is lost. So, if we are specifically interested in doing quantum computing, we need to guard against these kinds of effects as well. Therefore, we have to
find a way to correct these errors in quantum computations. Classically, error correction is done by adding redundancy to the information in a message, then using that redundancy to detect and correct errors that occur in transmission or storage of data. However, because of the no-cloning theorem, which says we cannot make an exact copy of an arbitrary quantum state, simple redundancy will not work in a quantum context, and unwanted interactions with the environment can destroy coherence and thus the quantum nature of the computa- tion. This seemed like an obstacle for quantum computation at first but in 1995 Peter Shor proposed a scheme to do error correction on a quantum computer [34]. Many advancements have been made with quantum error correction since then; however, they all require extra quantum bits (qubits), which make large scale computations impossible with existing quan- tum computers.
Elizabeth Behrman and James Steck at Wichita State University proposed an inge- nious idea, instead of trying to explicitly design an algorithm to solve a specific task on a
3 quantum computer, why don’t we let the quantum computer program itself? This might sounds strange at first, but it’s a well known technique in classical computation known as
“machine learning”. And it’s used heavily for difficult computational tasks, for instance, trying to design an algorithm to differentiate handwritten digits between 0 to 9. Thus they combined the theories of classical machine learning, specifically artificial neural networks
(ANN), a model of computation inspired by the structure of biological neural networks in brains, together with quantum computing to invent a sub-field of quantum computing called
“Quantum Neural Network” (QNN) [54]. The objective of this dissertation is to show that a quantum neural network indeed provides answers to various problems in quantum compu- tations, ranging from algorithm design to noise and decoherence.
1.3 Literature Review
Machine learning algorithms have achieved remarkable successes ranging from image classifications to self-driving car and playing complex games such as chess or Go. These ad- vancements were possible because of the tremendous increased in computational power and the availability of vast amounts of data in the past two decades. There are many machine learning algorithms, but artificial neural networks are arguably the most well-known ma- chine learning algorithms since it has been shown to be effective and able achieve marvelous performance over several tasks [47]. However, classical computer has its limits. Therefore, it was no surprised that people started to look to quantum computer for help with performing machine learning tasks, making use of the possible exponential computational speed up it has to offer. This leads to mixed usage of the terms Quantum Machine Learning and Quan- tum Neural Network. In fact, the term Quantum Machine Learning is so loosely defined that it means differently from one person to another. The term quantum-assisted machine
4 learning could be used for a significant number of these methods [17, 18, 19, 20, 21, 22],
where one performing a subroutine of certain classical machine learning algorithms on a
quantum computer using one of the well-known quantum algorithm that possessed quantum
speedup to enhance the efficiency of the overall algorithm. Here the term quantum speedup
is referring to the advantage in runtime obtained by a quantum algorithm over classical
algorithms. The reason is simple, a large class of classical machine learning techniques heav-
ily relies on performing matrix operations on high dimensional vector space; which is very
computational costly, and a bottleneck for classical computer. However, there are quantum
algorithms that allow you to perform certain matrix operations with an exponential speed-up
on a quantum computer. For instance, classical learning methods like Gaussian Processes
or Support Vector Machines (SVMs) require one to solve the system of linear equations
Ax = b with A ∈ RN×N , and x, b ∈ RN . The current best classical algorithm requires a runtime of O(N 2.373) comparing to a runtime of O(logN) on a quantum computer using
Harrow-Hassidim-Lloyd (HHL) algorithm [16] assuming that A is sparse and the classical
data can be loaded in quantum superposition to the quantum computer in logarithmic time
along with the condition that only certain features of the solution will need to be extracted.
Grover’s search algorithm which offers a quadratic speed-up is also being used extensively
for quantum machine learning.
In our work, the term Quantum Neural Network does not refer to performing a subroutine
to a classical machine learning task on a quantum computer, but rather a parameterized
Hamiltonian with tune-able coefficients that can be learned through training. It is pivotal
that the distinction between the two concepts is clear. Recently, a few variational quantum
algorithms arose that posses similar feature to our Quantum Neural Networks, like that
of Quantum Approximate Optimization Algorithm (QAOA) [23] and Variational Quantum
5 Eigensolver (VQE) [24, 25]. Both of these algorithms rely on a tune-able ansatze, which is essentially a Parametrized Quantum Circuit (PQC); and it has become standard to also refer to PQC as Quantum Neural Networks [26, 27, 28]. These variational quantum algorithms are similar to each other, and they can be summarized at a higher level as follow:
1. Select a parametrized quantum circuit, also known as an ansatz, to operate on an
initial reference state
|ψ(θ)i = Un(θk) ··· U2(θ2)U1(θ1)|ψref i = U(θ)|ψref i (1.1)
where the initial reference state |ψref i will be chosen depending on the problem.
2. Calculate the expectation value
hψ(θ)|H|ψ(θ)i (1.2)
where H represents a Hermitian operator, and it may varies from problem to problem.
For instance, in VQE algorithm, where it is widely used for chemistry calculation of
the molecular ground state energies for molecules, would take H as the molecular
Hamiltonian. Whereas, QAOA would take H to be σz, the usual computational basis.
3. Define and optimize the cost function by tuning θ. This objective function will vary
from problem to problem. For instance, in calculating molecular ground state energies
using VQE algorithm, one would simply let the objective function to be
C = hψ(θ)|H|ψ(θ)i (1.3)
minimizing C is equivalent to minimizing the eigenvalues of H.
6 Our work is very similar in a sense that we also have a paremtrized unitary with tune-able
coefficients, U(θ), taken to be
−iHtˆ U(θ) = exp (1.4) ~
where the tune-able coefficients θ is embedded within Hˆ , the Hamiltonian; and we tune the unitary operator in Eq. 1.4 based on the cost function that we defined, which is also depend on the expectation values, and it varies from problem to problem. Both QAOA and VQE algorithm can be put into the structure of our QNN. The difference between QAOA or VQE algorithm and our QNN model lying at the ansatz structures (parametrized unitary) along with the Hermitian operator that the expectation value is being calculated with respect too.
For instance, when using VQE algorithm to determine the electronic ground state energies for certain molecule, the ansatz has often taken to have the form
U(θ) = eT −T † (1.5)
where T is the excitation operator, and eT −T † is known as the Unitary Coupled Cluster
ansatz in classical variational theory. And we are calculating the expectation value of the
molecular electronic Hamiltonian. Whereas, in our work, we often take our expectation with
respect to σz, Pauli-Z basis, since we are doing more of a supervised training to learn the
Observable to certain physical quantities, like entanglement; rather than knowing the Ob-
servable and just trying to minimize it to find its lowest eigenvalue. Despite these differences,
our QNN model could be easily modify to give the same performance as both QAOA and
VQE. Furthermore, it should be noted that our idea was first proposed in 1996 [54], whereas
both QAOA and VQE were published in 2014.
7 1.4 Contributions
This work contributes to the field of quantum computation in the following ways: We show that we can use machine learning to help us in programming a quantum computer to perform specific tasks, both quantum mechanical and classical. Furthermore, we show that quantum computation programmed in this way possesses the properties of being robust to noise and decoherence. This shows that Quantum Neural Network can not only help with algorithm designs, but can also provide us an answer to the “noise and decoherence” problems in quantum computations. In a way, our technique provides a way to find a class of quantum circuit known as noise-resilient quantum circuits, which are quantum circuits that minimizes the error in the computation without using error correction schemes. The robustness to noise and decoherence property in our QNN increases as the size of the system increases. This provide a promising answer to the scalability problem, and application to pattern storage. Furthermore, we show that Quantum Neural Network is, in fact, much more powerful than existing classical neural networks, which means we can provide a “speed-up” in computational tasks. Then, we show that we can implement our Quantum Neural Net- work on existing quantum hardware, such as those of Microsoft and IBM systems. Finally, we sketch a design for a quantum hybrid neural network for universal computation.
1.5 Structure of this Dissertation
The dissertation is structured as follows: Chapter 2, an overview of quantum mechanics, provides key concepts and postulates from quantum mechanics that are essential to under- stand the theory behind quantum computing. Chapter 3 will outline the theory of quantum computing. Here we will describe how a quantum computer is inherently different from a classical computer, the advantages a quantum computer possess over classical computers,
8 different quantum computation models, the noise and decoherence problems in quantum computers and how one might fix these issue with error correcting schemes. Next, in chap- ter 4, we will present an overview on the theory of classical artificial neural networks. We will go over artificial neurons and their mathematicial model, and how we can stack them together in layers to form a neural network. The universality of artificial neural networks will be discussed in detail as well. In chapter 5, we will present our model of Quantum
Neural Networks (QNN). We will present the fundamental structure and derive the learning algorithm to train QNN. Here we will also discuss the universality property of QNN, similar to the discussion on the universality of classical artificial neural networks. In Chapter 6, we will show that QNN is robust to noise and decoherence, first on a 2-qubit system and then we generalize this to higher order system. This result is promising and give us hope that QNN might offer at least a partial solution to the noise and decoherence in quantum computations. In Chapter 7, we will show several benchmarking results between various classical neural networks and QNN on both classical and quantum tasks. The results give us hope that QNN will, in general, offer a speed up over classical neural networks. This is equivalent to showing certain quantum algorithms have a speed-up advantage over classical algorithms. In chapter 8, we take our experimental results from chapter 6 to the next step, by implementing them on actual quantum simulators and hardware, specifically on the Mi- crosoft topological quantum computer simulator and the IBM Q-Experience simulator and hardware. Last but not least, in chapter 10, we will go over some possible future work; one of which is the immediate extension to our current work. We propose a network architecture where both quantum and classical networks will combined together to form, what we called,
Quantum Hybrid Neural Networks (QHNN). We hope that this might be able to increase the capability of the network and allow us to solve more complicated problems.
9 CHAPTER 2
OVERVIEW OF QUANTUM MECHANICS
At the beginning of the twentieth century, almost everyone was convinced that we have discovered and understood all physical realities by the laws of Newton and Maxwell. These laws made up what we now call classical physics. However, by late 1930s, it had become apparent that classical physics faced serious problems trying to account for some observed results of certain experiments. For instance, for the black radiation problem classical physics predicted something absurd known as the ‘ultraviolet catastrophe’ involving infinite energies.
As a result, a new mathematical framework for physics called quantum mechanics was devel- oped, and new laws of physics (quantum physics) were formulated. Quantum physics deals with physical system in the fundamental scale. Currently, quantum mechanics provides the most accurate and complete description of nature. It is also required for an understanding of quantum computation and information, since a quantum computer is, after all, just a computer that follows the law of quantum mechanics, and quantum information is the result of reformulating information theory in the quantum mechanics framework. This chapter will serve as a brief summary of all the essential background knowledge of quantum mechanics needed to grasp the concept of quantum computation. In section 2.1, we start by defining the concept of a quantum state, both physically and mathematically. Next, in section 2.2, we will introduce composite quantum systems and ‘quantum entanglement’, a crucial concept in some of the work we have done. Section 2.3 will discuss the time evolution of a quantum system and the Schrodinger equation, which is the fundamental equation behind quantum mechanics and quantum information theory, including quantum neural network. Last but not least, section 2.4 will show how a quantum state might be measured, and the effects of applying a measurement on a quantum state.
10 2.1 Quantum States and Density Operators
The state of a physical system contains information about the system. A quantum state
contains statistical information about the quantum system. It’s essentially a probability
density. Mathematically, it’s a postulate that a quantum state, known as a state vector,
is an element of a Hilbert space, H [2]. The exact Hilbert space depends on the actual
system, and may be infinite-dimensional. In quantum computation and information theory,
it is most often taken as the the space Cn for n ∈ N. Furthermore, it’s required that every quantum state is normalized, that is, it has unit norm, to preserve probability.
A quantum state can either be pure or mixed. A pure state, is often denoted as a normal-
ized vector, known as the ket, |ψi in the separable Hilbert space H. It has a corresponding
dual vector, denoted as hψ|, belonging to the dual space H∗ . If the quantum state can be described with finite dimension, then |ψi is a column vector in Cn, and its dual vector hψ| is a row vector. Since |ψi is an element of a separable Hilbert space, it has a countable orthonormal basis [6]. Thus, if {|eii} is the orthonormal basis set for H then
X |ψi = ci|eii ci ∈ C (2.1) i
P 2 2 Since |ψi is normalized, we must have i |ci| = 1. Physically, |ci| represents the probability
that the state |eii will be observed (measured). Quantum measurement will be discussed in
more details later. Mathematically, |ψi is a linear combination of the states |eii. Physically,
we say that the state |ψi is in a quantum superposition of the states |eii. This is one of the
fundamental differences between classical states and quantum states. A quantum state exists
in all of its possible states simultaneously, with each state having a statistical probability
amplitude ci of being observed. The pure state |ψi has an alternative representation known
as the density operator representation, denoted as ρpure. It is formed by taking the outer
11 product of the pure state |ψi with its dual:
ρpure = |ψihψ| (2.2)
They are also called density matrices, and the two terms density operators and density matrices are used interchangeably. The following properties hold for pure state density operators.
Properties of pure states density operators :
1. Unit trace with rank 1: T r(ρpure) = 1 and rank(ρpure) = 1
† 2. Hermitian: ρpure = ρpure
n 3. Positive semidefinite: hφ|ρpure|φi ≥ 0 for all non-zero |φi ∈ C
2 2 4. Idempotent: ρpure = ρpure ⇒ T r(ρpure) = 1 where T r(A) denotes the trace of A. Mathematically, properties 1-3 tells us that a density operator, ρpure, is nothing more than a bounded, positive trace class operator with unit trace.
The representation of the state |ψi as a density operator has some advantages. One advantage is that it lets us represent another type of quantum states, known as mixed states
ρmixed. A mixed state is a linear combination of an ensemble of pure states, {pi, |ψii}, which can be written as
X X i ρmixed = pi|ψiihψi| = piρpure (2.3) i i Observe that all pure states density operators can also be written in the form of Equation 2.3, as a single term. Therefore, Equation 2.3 is the most general form of writing down a quantum state, since it encompasses both the pure and mixed states. Hence, a general quantum state usually denoted as a density operator ρ, a bounded, positive trace class operator with unit
12 trace. Mixed state density operators ρmixed will be required to have the same properties as the density operators for pure states ρpure with one exception.
Properties of mixed state density operators :
1. Unit trace with rank 1: T r(ρ) = 1 and rank(ρ) = 1
2. Hermitian: ρ = ρ†
3. Positive semidefinite: hφ|ρ|φi ≥ 0 for all non-zero |φi ∈ Cn
4. Not Idempotent: ρ2 6= ρ and in fact T r(ρ2) < 1
P The last property can be seen by the fact that: T r(ρ) = i pi = 1 then this implies pi < 1.
2 P 2 Therefore, T r(ρ ) = i pi < 1. This condition is often used as the criterion to determine whether a quantum state is pure or mixed. Thus, generally we don’t assign tag of pure or
mixed to ρ, but instead we can identify it by checking whether: T r(ρ2) = 1 (pure state), or
if T r(ρ2) < 1 (mixed state).
2.2 Composite Systems and Entanglement
Previously, we focused on a single isolated system. However, this is not good enough to do
quantum computation because a quantum computer with non-interacting qubits (quantum
bits) is no better than a classical computer. Therefore, we need to have a thorough grasp
of how several qubits interact with each other. When a quantum system is made from two
or more distinct physical systems, then we have something called a ‘composite system’. The
Hilbert space of a composite physical system is the tensor product of the Hilbert spaces of
the component physical system. For instance, if A and B are the two components, then the
Hilbert space of the composite system, HAB, is
HAB = HA ⊗ HB
13 Suppose system A was prepared in the state |ψiA and system B was prepared in the state
|ψiB, then the composite system’s state, |ψiAB is the tensor product
|ψiAB = |ψiA ⊗ |ψiB (2.4)
This can be extended to any arbitrary system size with induction, i.e. if there are n different systems, labeled 1 through n prepared in the states |ψi1 through |ψin, then the composite system’s state is
|ψi12...n = |ψi1 ⊗ |ψi2 ⊗ · · · ⊗ |ψin = |ψ1ψ2 ··· ψni (2.5)
Thus, if HA and HB are finite discrete Hilbert spaces with basis |ψiAi for i = 1, 2, ..., N and
|ψiBj for j = 1, 2, ..., M respectively, then the basis of the composite system is made up from the NM states |ψii ⊗ |ψij. That is, if |φi ∈ HA ⊗ Hb is a state in the composite system, then X |φi = ci|ψii| ⊗ ψij (2.6) ij The important key feature about a composite state is that it gives rise to a very strange and interesting type of quantum states, known as ‘entangled’ states. Erwin Schrodinger was the first person to introduce the term, and he named it ‘Verschrankung’, which was later translated to English as ‘Entanglement’. Mathematically, if |ψi ∈ HA ⊗ HB represents the composite state of the systems A and B, then |ψi is entangled if we cannot write it as the tensor product of two pure quantum states, i.e.
|ψi= 6 |ai ⊗ |bi (2.7)
where |ai and |bi belongs to HA and HB respectively. Physically, this means that the
2 subsystems A and B do not have well-defined states. For instance, suppose HA = C and
2 2 2 4 similarly HB = C , then HAB = C ⊗ C = C . So if we consider the state 1 |ψi = √ (1, 0, 0, 1)0 (2.8) 2
14 which belongs to HAB. However, observe that |ψi= 6 |ai ⊗ |bi for all |ai ∈ HA and |bi ∈ HB.
This state is known as a “Bell state”. It’s a maximally entangled state. Now if we instead
consider the state 1 |ψi = (1, 1, 1, 1)0 (2.9) 2
which also belongs to HAB. However, here we have |ψi = |ai ⊗ |bi, where
1 1 |ai = √ (1, 1)0 and |bi = √ (1, 1)0 2 2
Thus, in this case |ψi is an unentangled state. The collection of states which are not entangled is known as ‘separable states’ or ‘product states’. In summary, for finite dimension, a pure state |ψi ∈ Cn ⊗ Cm is called ‘separable’ or ’product’ state if it can be written in the form
|ψi = |ai ⊗ |bi for some |ai ∈ Cn and |bi ∈ Cm. Else, |ψi is called ‘entangled’. Now recall the well-known Schmidt decomposition theorem from linear algebra, which states:
Theorem 2.2.1. If HA, HB are Hilbert spaces of dimension n and m respectively, and
|ψi ∈ HA ⊗ HB, then there exist orthonormal bases αi, βi of HA, H2, and reals λi ≥ 0 with P i λi = 1 such that r X |ψi = λiαi ⊗ βi i where r = min(m, n).
Theorem 2.2.1 holds for infinite dimension Hilbert spaces, and in fact even non-separable ones
[7]. The key feature Theorem 2.2.1 provides us is a quick check (a criterion) for entanglement in a bipartite system. More explicitly, a bipartite state, |ψi is entangled if and only if r > 1.
Unfortunately, Theorem 2.2.1 can’t be extended into tripartite systems (HABC = HA ⊗
HB ⊗ HC ). Therefore, there is no simple technique of determining whether a tripartite or
higher order system is entangled or not. Moreover, we have only talked about characterizing
entanglement for pure states, so far. It turns out that characterize entanglement in mixed
15 quantum states is even harder! And no exact entanglement quantification for an arbitrary
quantum state exists, at least not yet. The general (pure or mixed states) definition of
entanglement is as follows:
Definition 2.2.2. A state ρ is called ‘separable’ if it can be written as a convex combination
of product states
X i i X ρ = piρA ⊗ ρB where 0 ≤ pi ≤ 1, pi = 1 (2.10) i i
From Definition 2.2.2 we have that the collection of separable states forms a convex set,
S. From this point of view, we can define something called an “Entanglement Witness”, a
separability criterion [41].
Theorem 2.2.3. (Entanglement Witness Theorem). A state ρent is entangled if and only
there exists a Hermitian operator W ∈ B(H), the space of all operators act on the Hilbert
space H of the quantum system, such that
hρent,W i = T r(W ρent) < 0, hρ, W i = T r(W ρ) ≥ 0 ∀ρ ∈ S (2.11)
W is known as an “Entanglement Witness” (EW).
Note the the space B(H) is itself a Hilbert space, usually called the Hilbert-Schmidt space.
Since we are only interested in finite dimensional quantum system, we can regard it as the space of matrices. An Entanglement Witness W is guaranteed to exist because of the geometric version of the Hahn-Banach Theorem from functional analysis [6]. See Figure 2.1 for a graphical interpretation.
Theorem 2.2.4. (Geometric Hahn-Banach Theorem) Let S be a convex, compact set in a
finite dimensional Banach space. Let ρ be a point in the space with ρ∈ / S. Then there exists a hyperplane that separates ρ from S
16 Figure 2.1: Graphical interpretation of the geometric Hahn-Banach Theorem. The witness w divides the Hilbert space into separable (S) and entangled subspaces. An optimal witness is as close as possible to the set S.
The hyperplane separating the set S and the point ρent is determined by the orthonormal vector W which is selected outside the set S, similar to how a plane in Euclidean space can be defined by its orthonormal vector, say V . Recall that the plane in Euclidean space separates vectors for which their inner product with V is negative from vectors for which their inner product with V is positive. See Figure 2.2 for a graphical illustration.
Figure 2.2: Graphical illustration of plane separation in Euclidean space
In Hilbert-Schmidt space, the inner product between (A, B) ∈ B(H) is defined as hA, Bi =
T r(A†B). Thus, at this point, it’s clear why the Entanglement Witness Theorem has the
form it has, and it’s nothing more than an application of the geometric Hahn-Banach The-
orem. Moreover, we say that an EW is optimal, Wopt, if apart from Equation 2.11 there
17 0 0 exists a separable state ρ ∈ S such that hρ ,Wopti = 0. An optimal EW can detect more entangled states than any non-optimal ones. See Figure 2.3 for a geometrical visualization.
It turns out that finding the optimal EW is a very hard (NP-hard) problem [42]. Despite many contributions been made to the theory of quantum entanglement, especially by the
Horodecki family [43], it remains as a big mystery, both mathematically and physically.
Figure 2.3: Geometric illustration of an optimal entanglement witness
Although Entanglement Witness provides a separability criterion and is helpful in detecting
entanglement, it does not give quantitative information on how much the state is entangled.
This leads to the idea of constructing “Entanglement Measure”. As expected, this is a much
harder problem! A few prominent entanglement measures do exist: Those that are based on
convex roof construction, or those that are based on distance of the state to the convex set
S of all separable states. Examples of entanglement measures that are based on convex roof
construction are “Concurrence” and “Entanglement of Formation”, which were developed
by Bennett et al. [70] and Wootters [71]. Intuitively, Entanglement of Formation (EF)
quantifies how many Bell states are needed to prepare n copies of a particular state. For
example, if the EF of a state ρ is 4/9 then this means we need 4 Bell states to prepare 9 copies
of ρ. For a pure state of bipartite composite system, EF is equivalent to the von Neumann
18 entropy of the reduced subsystem. For mixed states of a bipartite system, this is no longer true since each reduced system can now have non-zero entropy on its own even if there is no entanglement. However, Entanglement of Formation is well defined for an arbitrary two-qubit state as shown by the following two theorems. Their proofs can be found in [71].
Theorem 2.2.5. The entanglement of formation of a two-qubit state ρ is a function of the concurrence C, √ 1 + 1 − C2 E (ρ) = E (C(ρ)) = H (2.12) F F 2 where H is the Shannon entropy function
H(x) = −x log2(x) − (1 − x) log2(1 − x) (2.13)
Theorem 2.2.6. The Concurrence C of a two-qubit state ρ is
C(ρ) = max{0, µ1 − µ2 − µ3 − µ4} (2.14)
where µis are the square-roots of the eigenvalues of the matrix ρ · ρ˜ in decreasing order, and
ρ˜ is defined as
∗ ρ˜ = (σy ⊗ σy)ρ (σy ⊗ σy) (2.15)
0 −i ∗ with σy = i 0 and ρ is the complex conjugate of ρ.
Theorem 2.2.5 and 2.2.6 provide us with explicit entanglement measures for an arbitrary
2-qubit state.
Up to this point, we have only discussed entanglement in the abstract sense and not its applications in quantum computation and information. It turns out that entanglement play a huge role in error correction, a concept we will discuss in the next chapter.
19 2.3 Time-Evolution of a Closed System
The state of a system is time dependent, and so a quantum state vector |ψi is a function of time, |ψ(t)i. It is a postulate that |ψi evolves in time linearly based on the Schrodinger’s equation [82]. d|ψ(t)i i = H(t)|ψ(t)i (2.16) ~ dt where ~ is Planck’s constant, and H(t) is known as the Hamiltonian of the system, which rep- resents the total energy (observable/measurable quantity) function of the system. Another postulate of quantum mechanics is that every measureable quantities have an associated observable operator, which is a self-adjoint Hermitian operator mapping a Hilbert space into itself. Thus, the Hamiltonian H(t) in Equation 2.16 must be a Hermitian operator. In general, if H(t) is well defined, then we have complete description and understanding of the quantum system, at least mathematically. Furthermore, from Equation 2.16 we can see that the transformation that takes a quantum state ψ(t)i from t1 to t2 must be unitary, i.e. if
|ψ(t2)i = U|ψ(t1)i then U is a Unitary operator. This makes sense because of conservation of probability. A quantum state must remain normalized, hence if a quantum state is written
P 2 in term of Equation 2.1, then i |ci| = 1. The only linear operators that preserve such norms of vectors are unitary operators. This sheds some light on the reason that a closed quantum system must satisfy the Schrodinger equation as it evolves in time.
If we suppose that H(t) is constant in time in Equation 2.16, then the solution to the
Schrodinger equation for fixed times t0 and t is
−i~H(t−t0) |ψ(t)i = e |ψ(t0)i = U(t, t0)|ψ(t0)i (2.17)
Since H is a Hermitian operator, we can see from Equation 2.17 that U is a unitary operator, coincide with our discussion about Unitary transformation. For time dependent Hamilto-
20 nian, H(t), the unitary operator, U(t, t0) in Equation 2.17 can be written as:
Case I: if [H(ti),H(tj)] = 0 for all choices of ti and tj
−i Z t U(t, t0) = exp H(t)dt (2.18) ~ t0
Case II: if [H(ti),H(tj)] 6= 0 for any choices of ti and tj
∞ n X i Z t Z t1 Z tn−1 U(t, t0) = I + − H(t1)dt1 H(t2)dt2 ··· H(tn)dtn (2.19) n=1 ~ t0 t0 t0
It should be noted that Equation 2.19 is often seen as
−i Z t U(t, t0) = T exp H(t)dt (2.20) ~ t0
where T denotes the time-ordering operator. Details derivations of Equations 2.18 to 2.20
can be found in [1].
Equation 2.16 can also be formulated in term of density operators, ρ(t) = |ψ(t)ihψ(t)|, as
d d ρ(t) = |ψ(t)ihψ(t)| dt dt 1 1 = H(t)|ψ(t)ihψ(t)| − |ψ(t)ihψ(t)|H(t) i~ i~ 1 = [H(t), ρ(t)] i~
Thus, a more general form of Equation 2.16 can be written as
dρ 1 = [H(t), ρ(t)] (2.21) dt i~
Equation 2.21 is usually known as the Liouville - Von Neumann equation. The solution is
† ρ(t) = Uρ(t0)U (2.22)
21 where U is the unitary operator described in Equations 2.18 to Equation 2.19. Moreover, if the Hamiltonian, H, is time independent then Equation 2.22 can be written explicitly as
−iHt iHt ρ(t) = exp ρ(t0)exp (2.23) ~ ~
Equation 2.21 often get rewritten in shorthand in terms of the Liouville super-operator
Lˆ = 1 [H, ...] as ~ dρ = −iLρˆ (2.24) dt
From here we can see that the time evolution of a quantum state, ρ, can also be written as
−iLtˆ ρ(t) = e ρ(t0) (2.25)
2.4 Quantum Measurements
Until this point, we have been focused on discussing the state of a quantum system, and how a it evolved in a closed system. Ultimately at some point, it will be of interest to measure some properties of a system, and so we must allow the system to interact with a measurement apparatus (a macroscopic size piece of equipment that behaves according to the laws of classical physics) of an outside observer. See Figure 2.4. The system is then no onger closed, and the evolution postulate of quantum mechanics no longer holds. This leads us to the Measurements postulate, which provides a description of the effects of measure- ments on a quantum system. The following postulate is taken from [2]. It should be noted that the following postulate is the most general postulate about quantum measurement.
22 Measurements Postulate: Quantum measurements are described by a collection {Mi} of measurement operators also known as Kraus operators. These are operators acting on the state space of the system being measured. The index i refers to the measurement outcomes that may occur in the experiment. If the state of the quantum system is ρ immediately before the measurement then the probability that the result i occurs is given by
† p(i) = T r(Mi Miρ) (2.26) and the state of the system after the measurement is
† MiρMi † (2.27) T r(MiρMi )
Moreover, the measurement operators satisfy the completeness equation
X † Mi Mi = I (2.28) i
Figure 2.4: A quantum system interacting with a measuring apparatus in the presence of the surrounding environment.
One take-away from the above measurement postulate is the collapse of the quantum
state after a measurement is performed. That is, the state is no longer quantum mechanical
but rather classical. This is what makes the so called “measurement problem” one of the
23 most difficult and controversial problems in quantum mechanics. Moreover, one can argue to consider the combined systems, the measurement apparatus and the quantum state together, as a larger closed quantum system. However, this would take us to another very controversial issue. Therefore, through this dissertation, we will go with the conventional practice and assume that measurement is not part of the closed quantum system. And performing an measurement on quantum system means the evolution postulate is no longer holds. That is, a quantum measurement is not a unitary transformation.
Furthermore, if one restricts the condition that all of the Mi are orthogonal projectors, that is Mi are Hermitian and MiMj = δi,jMi; then we have the commonly used and well known measurements, known as ‘projective measurements’ or ‘observables’. For clarity, let us state the measurement postulate in term of projective measurements below:
Projective Measurements Postulate: A projective measurement is described by an observ- able, M, a Hermitian operator on the state space of the system being observed. M has a spectral decomposition as X M = λiPi (2.29) i where Pi is the projector onto the eigenspace of M with the eigenvalue λi. Thus, an ob- servable, M, is the weighted sum of projectors. The possible outcomes of the measurement correspond to the eigenvalues, λi. Upon measuring the state ρ, the probability of getting result λi is given by
P (λi) = T r(Piρ) (2.30)
Equation 2.30 is also called Born’s rule. The system will be in the state
P ρP i i (2.31) λi
24 immediately after measurement. Thus, measurement is a irreversible process and causes loss of information.
Projective measurement is the type of measurement taught in standard quantum me- chanics courses; not the general measurement stated in the first measurement postulate. The reason for this is twofold: the first reason is projective measurements coupled with unitary dynamic are sufficient to implement a general measurement, see [2] page 94-95 for proof; the second reason is that the calculation of the expected value for a projective measurement is much easier. It should be noted that in all quantum computating models, the measurements being performed are actually the projective measurements; in particular, the Pauli measure- ments generated from the Pauli matrices, σx, σy, σz, and σI . These matrices will be discussed in the next chapter. In our work, we usually pick the Pauli matrix σz as our measure, M.
In general, the expectation value of M in the state ρ, denoted as hAiρ can simply be written as
X hAiρ = pihψi|M|ψii = T r(ρM) (2.32) i
This is important since in our work, especially entanglement witness calculation, we map our input states to their expectation values.
25 CHAPTER 3
QUANTUM COMPUTING
In this chapter, we will explore the theory of quantum computing. We start by introduc- ing the concept of quantum bit (qubit), a fundamental unit for quantum computations, in
Section 3.1. We will show how qubits are fundamentally different from classical bits and even probabilistic bits. In section 3.2, we will introduce a quantum computation model known as the ‘gate model’ which resembles the classical digital computer model. Here we will give an overview of different quantum logic gates and the concept of a quantum circuit. Then in
Section 3.3, we will discuss a set of gates which is universal for quantum computation, and one of the most important theorems in quantum information, the ‘Solovay-Kitaev’ theorem.
In Section 3.4, we will discuss different quantum algorithms, their advantages over classi- cal algorithms, and the difficulty of designing one. Section 3.5 provides another quantum computation model known as the Adiabatic Quantum Computation model. Here we will show why this model is very good at solving optimization problems. Last but not least, in
Section 3.6, we will point out the most problematic issues in quantum computations, noise and decoherence. This last section shows one important contribution our research work has provided to the field of quantum computing.
26 3.1 Qubits
A building block of classical computational devices is a bit, a two-state system, which can be 0 or 1. This two-state system may be the voltage level on a wire with a threshold function. For instance, this threshold function could output 0 if the voltage level is less than 4.5mV , and output 1 if the the voltage level is greater than or equal to 4.5mV . These threshold functions can be physically implemented by transistors. They act as an on (0) and off (1) switch. All classical computations can be built from manipulating these on and off switches (bits). See Figure 3.1 for a geometric picture.
Figure 3.1: Geometric Representation of a Classical Bit
Similarly, the building block for quantum computational devices is the quantum me- chanical two-level system, which is the simplest non-trivial quantum system. One example of such a system is the electron spin, where you can take spin up to be the state |0i and
1 spin down to be the state |1i. In fact, any spin- 2 particle system can be used to model as a qubit. Another two-level quantum system could be the polarization of photons, where one can take horizontal polarization to be the state |0i and vertical polarization to be the state |1i. Any two level quantum system can form a qubit, and the states |0i and |1i form a basis, called the computational basis. It is possible to have a multi-level system to model a qubit, as long as there are two states that can be separated from the rest of the states in the system. For instance, one can consider the energy of an electron in an atom. Theoretically
27 there are infinite possible energy levels; however, since energy is quantized, we can pick the lowest energy state (ground state) to represent |0i and the first excited state to represent |1i and essentially ignore the subspace spanned by all the energies after the first excited state.
This multi-level system has now become a two-level system for all practical purposes and it can be described by a 2-dimensional vector in the space spanned by the ground state and
first excited state energy levels. Therefore, it can be used to model a qubit. The general state of a two-level quantum system can be described by a vector in a 2-dimensional Hilbert space, C2. Therefore, if we take {|0i, |1i} as the basis (computational basis) then a general single qubit state has the form
|ψi = α|0i + β|1i α, β ∈ C (3.1) where " # " # 1 0 |0i = |1i = (3.2) 0 1
Due to the normalization constraint of quantum states (conservation of probability),
|α|2 + |β|2 = 1 must hold. The coefficients α and β are often called the amplitudes of the ba- sis states |0i and |1i, respectively. Equation 3.1 shows a key difference between classical bit and a qubit: A qubit can be in linear superposition between the states |0i and |1i, whereas a classical bit is deterministic, it’s either in the state 0 or 1. This linear superposition is part of the exclusive world of the qubit and it’s not available to an outside observer. For us to know the actual qubit’s state, we must perform a quantum measurement. If a projective measurement is performed in the standard computational basis then it will collapse the qubit state, |ψi, into either the state |0i or |1i with probabilities |α|2 and |β|2, respectively.
Note that the overall phase does not matter in quantum system, that is, the state |ψi and
28 eiθ|ψi are essentially the same (indistinguishable from one another). With this in mind and the normalization constraint |α|2 + |β|2 = 1, we can rewrite Equation 3.1 as
θ θ |ψi = cos |0i + eiφ sin |1i (3.3) 2 2 θ The reason for the will be clear when we represent |ψi in the density matrix form. 2 θ θ θ cos2 e−iφ cos sin 2 2 2 ρ = |ψi ⊗ hψ| = θ θ θ (3.4) eiφ cos sin sin2 2 2 2 ! 1 1 + cos θ cos φ sin θ − i sin φ sin θ = (3.5) 2 cos φ sin θ + i sin φ sin θ 1 − cos θ 1 = I + cos φ sin θX + sin φ sin θY + cos θZ (3.6) 2 1 = I + ~r · ~σ (3.7) 2
where ~σ is the 3-element ‘vector’ of Pauli matrices (X,Y,Z). ! ! ! 0 1 0 −i 1 0 σx = X = σy = Y = σz = Z = (3.8) 1 0 i 0 0 −1
From Equation 3.6 we see that each θ and φ define a point on the unit three dimensional sphere, Figure 3.2. This sphere, which represents the state of a qubit geometrically, is known as the Bloch sphere. It provides us with a nice geometric intuition about operations on a single qubit on a quantum computer as we will see in the next section. One can also express points on the Bloch sphere in term of the unit Bloch vector, ~r, in Cartesian coordinates as
~r = (x, y, z) = (sin θ cos φ, sin θ sin φ, cos θ) (3.9)
29 Figure 3.2: Bloch Sphere Representation of a Qubit
One may wonder how is a qubit different from a classical probabilistic bit? A classical probabilistic bit can be written as a vector ! a (3.10) b where a represents the probability of the bit’s being 0, b the probability of the bit’s being
1, and a + b = 1. In this description, the distinct difference between a classical bit and a qubit is the amplitude coefficients in a qubit are complex numbers instead of a real numbers, which leads to a spherical geometric description instead of a linear description for classical probabilistic bit. See Figure 3.3. This also means that there can be interference effects in composite systems, as we will see.
Now that we understand the differences between a bit, a classical probabilistic bit, and a qubit, let’s try to understand the state of multiple qubits. If Equation 3.1 represents a general state of a qubit in the computation basis, then by Equation 2.4, the general state of a two qubits system in the computation basis is
|ψi = a|00i + b|01i + c|10i + d|11i (3.11)
where a, b, c, d ∈ C and |a|2, |b|2, |c|2, |d|2 represent the probabilities of measuring the state as being |00i, |01i, |10i, |11i, respectively. Hence, |a|2 + |b|2 + |c|2 + |d|2 = 1. Furthermore, the
30 Figure 3.3: Geometric Representation of a Classical Probabilistic bit
state |ψi of two qubits can no longer be represented in terms of a sphere. Another feature a two qubit state |ψi may possess is entanglement as described in the previous section. This feature is not available to a classical two probabilistic bits system, where the state of the composite system is always a tensor product of the component systems. This is one major difference and advantage of a quantum computer to a classical computer. This can be made even more clear if we consider the three classical probabilistic bits and three qubit systems.
In the three classical probabilistic bits system, we can write the second bit independently of the first bit, and the third bit independently of the previous two bits. That is, we can write them independently. ! ! ! a c e (3.12) b d f whereas a three qubit states can’t be written in term of each qubit independently because of entanglement, and therefore, it must be written as
|ψi = a|000i + b|001i + c|010i + d|011i + e|100i + f|101i + g|110i + h|111i (3.13)
Thus the space of qubits is inherently much larger than the space for bits or for classical probabilistic bits. This gives us a glimpse into the power of quantum computation.
31 We have described what it means to make a measurement on a single qubit, and what the coefficients in Equation 3.1 represent. For multi-qubit system, measurement can be done in similar manner. For instance, if one makes a projective measurement in the computational basis on the first qubit of Equation 3.11, then the probability of its being in the state |0i is
|a|2 + |b|2 and of being the state |1i, |c|2 + |d|2. Similar results can be obtained if we perform the same measurement on the second qubit. In quantum computation, it is possible to do this, that is, to perform measurement on a subset of all the qubits, the one of interest, and not the entire set. In this case the quantum nature of the rest of the system could possibly be retained after the measurement.
There exist many physical implementation models of qubits, ranging from supercon- ducting qubits [29], ion traps [31], topological qubits [32], to photonic qubits [31]. Each has its advantages and disadvantages. Since a qubit is a quantum state, it evolves according to Equation 2.16. In general, the time independent Hamiltonian of a single qubit can be written as 1 K H = = Kσ + σ (3.14) 2 K − x z where K is the tunneling parameter and is the potential energy off-set or bias. The most general form of the time-dependent Hamiltonian for an N−qubit system is an 2N × 2N Her- mitian matrix.
32 3.2 Quantum Gates and Circuits Model
Classical computers use electrical wires and logic gates to perform their computation.
These logic gates are built from the electrical current and transistors. All classical algorithms are built from manipulating these logic gates to get desired outputs. For better visualization, truth tables of the NAND and NOR gate can be found in Table 3.1.
Table 3.1: Truth table of NAND and NOR logic gates in respective order
Input A Input B Output Input A Input B Output 0 0 1 0 0 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 0
Since all classical computations are done by manipulation of logic gates, one might sug- gest that quantum computations can be done with quantum logic gates. This leads to the quantum circuit model which consists of logical qubits and quantum gates acting on the qubits. We can think of qubits being carried by ‘wires’ from left to right and a quantum gate as a unitary operator in the circuit diagram. See Figure 3.4 for a visualization of a quantum circuit. In Figure 3.4, U1,U2,U3,U4 represent quantum gates operations on the qubit. U1 is a single qubit operation, hence it belongs to U(2), a 2 by 2 unitary matrix. U2,U3 and U4 are multi-qubit gate operations, with U2,U4 ∈ U(4) and U(3) ∈ U(8). The reason for quan- tum gates being unitary operators is clear by the evolution postulate of quantum mechanics,
Equation 2.21. Thus all quantum gates must be reversible, and therefore all quantum com- putations are reversible. At the end of the circuit, we can perform a measurement on the qubits of interest. This is represented by the boxes marked M in Figure 3.4. At first, it seems like the unitary and hence reversible condition that is imposed on a quantum computer is an issue[132]. The fear was that we might not be able to do most classical computational tasks
33 Figure 3.4: Visualization of a quantum circuit
since those tasks are non-reversible. However, in 1973 Charlie Bennett showed that one can make any non-reversible computation reversible with a small overhead [3]. Therefore, any classical computation task can be executed on a quantum computer. In fact, the classical reversible computer has become a topic of interest recently because reversible computers are more energy efficient and faster [4]. As an example of a non-reversible operation turned into a reversible operation, consider the AND logic gate, which outputs 1 if and only if both input bits are 1. Hence, the AND logic-gate is a non-reversible operation. However, this can be made reversible by adding an extra input bit. Consider the circuit diagram in Figure 3.5, with the top bit (A) being the control bit for the swapping operation between B and 0. From
Figure 3.5: Reversible AND gate (Fredkin gate)
the circuit diagram in 3.5, we see that if A = 0 then the output is 0, and if A = 1 then the output is 1. This is exactly the AND function but now it’s being done in a reversible manner.
34 Any 2×2 unitary operator acting on a qubit is called a 1-qubit quantum gate. Let’s take a look at some of the crucial 1-qubit gates and the geometric interpretation of applying such gate to a certain 1-qubit state on the Bloch sphere. The classical NOT gate has a quantum analog known as the ‘Pauli X gate’ or X or σx for short. ! 0 1 σx = X = (3.15) 1 0
From Equation 3.8, we can see that if we take the computational basis as |0i = (1 0)0 and
|1i = (0 1)0 then when X is being applied to the state |0i, the result is the state |1i ! ! ! ! 0 1 0 1 1 0 |0i = = = |1i (3.16) 1 0 1 0 0 1
The geometric interpretation of the X gate is a 180◦ rotation of |ψi around the x − axis.
See Figure 3.6. The other Pauli gates are σy = Y , σz = Z, and σI = I. ! ! ! 0 −i 1 0 1 0 Y = Z = I = (3.17) i 0 0 −1 0 1
Figure 3.6: Geometric visualization of the Pauli X gate being applied to the state |0i on the Bloch sphere
The Y,Z gates represent 180◦ rotation around the y− and z− axes, respectively. Note
that the set {X,Y,Z,I} of Pauli gates span the vector space formed by all 1-qubit operators.
35 Therefore, any 1-qubit unitary operator (gate) can be expressed as a linear combination of the Pauli gates.
Some other important 1-qubit gates are the Hadamard (H), Phase (S), and π/8 (T ) gates. These gates are fundamental in quantum computation as they make up the universal set, a set of gates for universal computation, as we will see in section 3.3. ! ! ! 1 1 1 1 0 1 0 H = √ S = T = (3.18) 2 1 −1 0 i 0 eiπ/4
Notice that H is a 180◦ around the diagonal X + Z axis of the Bloch sphere. An important feature of the Hadamard gate is that it creates quantum superposition when it acts on a computational basis state, as shown in Equations 3.19 and 3.20 ! ! ! 1 1 1 1 1 1 |0i + |1i √ = √ = √ (3.19) 2 1 −1 0 2 1 2
! ! ! 1 1 1 0 1 1 |0i − |1i √ = √ = √ (3.20) 2 1 −1 1 2 −1 2
A geometric interpretation of the Hadamard gate applied to the state |0i on the Bloch sphere can be seen in Figure 3.7. Another thing to notice is that the Hadamard gate is exactly the two-point Discrete Fourier Transform matrix. This turns out to be an important feature to be exploited in quantum algorithms.
36 Figure 3.7: Geometric visualization of the rotation on the Bloch sphere created by applying |0i + |1i the Hadamard gate the state |0i to create the superposition state √ and vice versa. 2
So far, we have only been talking about single qubit gates. To be able to do any useful
computation, we must be able to get the qubits interact with each other. Hence, we will
shift our focus to multi-qubit gates. Classical gates such as AND, OR, NAND, NOR are
multiple-bit gates. The most fundamental multi-qubit quantum gate is the controlled-NOT
or CNOT gate, which is a 2-qubit gate. It is the analog of the classical XOR gate. The first qubit in the CNOT gate is the control qubit and the second is the target qubit. See Figure
3.8. If the control qubit is |0i then nothing is done. If the control qubit is |1i then the target
qubit will get flipped. Explicitly, the action of CNOT gate can be written as follows.
CNOT |00i = |00i; CNOT |01i = |01i; CNOT |10i = |11i; CNOT |11i = |10i (3.21)
The matrix representation of the CNOT gate in the computational basis is 1 0 0 0 0 1 0 0 CNOT = (3.22) 0 0 0 1 0 0 1 0
37 Figure 3.8: Controlled-NOT gate
Now that we have defined 1-qubit gates and a 2-qubit gate, we can combine them to create the following quantum circuit, which is often used to make a maximal entangled state
(Bell state). See Figure 3.9.
Figure 3.9: This circuit maps the state |00i to the state √1 |00i + |11i 2
The circuit in Figure 3.9 is very important in quantum computations. Without a 2-qubit
gate like the CNOT , quantum computers are no better than classical computers because the
quantum states are not entangled. A generalization of the CNOT gate is the controlled-U,
CU, gate. Again, it is a two qubit operation, with the first qubit being the control and
the second being the target qubit. If the control qubit is set to |1i, then the 2-by-2 unitary
operator U is applied to the target qubit. From this point of view, it is clear that CNOT
gate is a special case of CU, where U is the Pauli-X gate. The matrix representation for a
38 general controlled-U gate, CU, is
1 0 0 0 ! 0 1 0 0 u u 11 12 CU = U = (3.23) 0 0 u11 u12 u21 u22 0 0 u21 u22
Another multi-qubit gate that is often comes up in quantum computations is the Toffoli gate.
It is a 3-qubit gate, and often denoted as CCNOT . It has a circuit representation as shown in Figure 3.10. The first two qubits are the control qubits, so if both are in the state |1i then the Pauli-X gate is applied to the third qubit. It should be noted that Toffoli gate is the reversible analogue of the classical NAND gate. This is important since this tells us we must use a three-qubit system to simulate such gate. Truth table and matrix representation can be seen in equation 4.6; where IA, IB, IC indicates the 3 input values, and OA, OB, OC indicates the 3 output values.
IA IB IC OA OB OC 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 1 CCNOT = 0 0 0 0 1 0 0 0 (3.24) 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 1 1 0
We have talked about a handful of different quantum gates, from 1-qubit gates to a 3-
qubit gate. How many quantum gates are there? The answer is that an uncountably infinite
number of quantum gates exist. The reason is any unitary operation can be thought as a
quantum gate, and the set of unitary operations is continuous. This seems like a daunting
statement at first, but it turns out that we can approximate any arbitrary unitary operator
with a small set of discrete gates, known as a universal set of quantum gates.
39 Figure 3.10: Circuit representation of Toffoli gate
3.3 Universal Quantum Computation and The Solovay-Kitaev Theorem
In classical computing, the NAND (NOT-AND) or the NOR (NOT-OR) gates are uni- versal. That is, any other Boolean functions (logic-gates) can be reproduced from just the
NAND or the NOR gates alone, and they can be used to compute an arbitrary classical functions. Note that the NOT gate is a single bit gate, whereas the AND and OR gates are 2 bits gate. In quantum computation, a similar result exists despite the fact the set of unitary operations is continuous. A set, G, of quantum gates is said to be universal if any unitary operation can be approximated to arbitrary precision by a sequence of gates from G.
If U is the desired unitary operator, and V is the result from manipulating the gates from
G in a sequential manner, then we can define the error when V is implemented instead of U by
E(U, V ) = max ||(U − V )|ψi|| (3.25) |ψi
Therefore, mathematically when U is said to be approximated to an arbitrary accuracy, it means that, given > 0, there exists some V such that E(U, V ) < . A common and primarily used universal set, G, contains the following gates:
G = {H, π/8,CNOT } (3.26) where H is Hadamard, π/8 is T, and CNOT is controlled-NOT gate. See Equation 3.18 and
3.21 for exact description of these gates. People often include the phase gate, (S), to the set
G, for error correction. However, it is not needed for the proof of the universality of the set G.
40 An interesting remark is the set P = {H,S,CNOT } is not universal since it only generates
a discrete subgroup of U(n). Moreover, any circuit over this set of gates can be simulated
in classical polynomial time by the famous Gottesman-Knill Theorem [9]. Note that H and
π/8 gates are single qubit gate, whereas the CNOT gate is a 2-qubit gate. Thus, the set G contains only 1-qubit and 2-qubit gates, so it might be a surprise that it can approximate an arbitrary n−dimensional unitary operator belongs to U(n) to arbitrary accuracy. Before
explaining why this surprising fact is actually not at all surprising, it is good to recall that
classically, any Boolean function can be reproduced by the AND (2-bit) and NOT (1-bit)
gates alone. Thus, we have a similar result in the gate quantum computing model. This
is why the gate model for quantum computing is more popular than other models (we will
discuss other quantum computation models in later section), because it is more relatable to
classical digital computer. To see why only 1,2-qubit gates are needed to approximate any
unitary matrix with arbitrary dimension, we first state a well-known theorem:
Theorem 3.3.1. A set S consisting of the Hadamard gate H and the π/8 gate T is a
universal set for U(2). That is, given an arbitrary unitary matrix U ∈ U(2) and > 0, there
exists a V which can be generated by a product of H and T in a sequential manner, such
that E(U, V ) <
The details of the proof of Theorem 3.3.1 can be found in [2] on page 196. This theorem
essentially states that any single qubit gate can be approximated to an arbitrary precision
with just the Hadamard (H), and π/8 (T ) gates. This does not answer the question of why
only 1,2-qubit gates are needed for universal computation, yet. The following definition and
a result from linear algebra will shed some light on the answer.
41 Definition 3.3.2. A unitary matrix V ∈ U(n) is called a two-level unitary matrix if all diagonal elements are 10s and all off-diagonal elements are 00s except for 4 elements
Vjj = a, Vjk = b, Vkj = c, Vkk = d which form a 2 × 2 unitary matrix, Vˆ , where det(Vˆ ) = 1. Equation 3.27 shows what a two-level matrix might look like. 1 ··· 0 ··· 0 ··· 0 ··· 0 ··························· 0 ··· a ··· 0 ··· b ··· 0 ··························· a b V = 0 ··· 0 ··· 1 ··· 0 ··· 0 Vˆ = (3.27) c d ··························· 0 ··· c ··· 0 ··· d ··· 0 ··························· 0 ··· 0 ··· 0 ··· 0 ··· 1
Now, there is a result from Linear Algebra which shows that any n×n unitary matrix can be decomposed into product of two-level unitary matrices. Moreover, this product consist of at most 2n−1(2n − 1) two-level unitary matrices [8]. Thus, to show that the set G defined by
Equation 3.26 is a universal set, it suffices to show that it can approximate an arbitrary n×n two-level unitary matrix. By looking at the structure of a two-level unitary matrix, Equation
3.27, it shouldn’t be an astonishing fact that a two-level unitary can be implemented using only 1-qubit and CNOT gates. The detailed construction algorithm to implement this can be found in [2] on page 191-193. Therefore, putting it altogether, it is now clear why the set G defined by Equation 3.26 is universal for quantum computations. However, up to this point of our discussion about universality, we have been avoiding an important question.
Specifically, despite knowing that an arbitrary unitary, U ∈ U(n), can be approximated by a sequence of gates from our universal set G to any desired accuracy , we don’t know how many gates are needed for a given . This is an important question because if the number of gates needed in the sequence is in the order of O(21/), then our universal set G is only
42 good in the theoretical sense but it has no practical value in quantum computations. One of
the most well-known theorem in quantum computations is the “Solovay-Kitaev Theorem”,
and it assures us that this never happens.
Theorem 3.3.3. Given any physical universal set G and an arbitrary target unitary U ∈
U(n), and let a desired accuracy > 0 be given. Then there exist a finite sequence S of gates 1 from G of length O(log ), omitting the dependence on the number of qubit n, such that c E(U, S) < . Where 2 ≤ c ≤ 4, depending on the structure of G.
It should be noted that the gates in the universal set G of the Solovay Kitaev Theorem are closed under inverses, that is, if a gate A ∈ G then A−1 ∈ G. The gates belonging to our universal set defined in Equation 3.26 certainly have this property, as expected. The condi- tion of being closed under inverses helps with the proof of the theorem; however, whether or not this condition can be removed is an open problem!
3.4 Quantum Speed-up and Quantum Algorithms
There exist different quantum computation models, as we will see in the next section.
Their fundamental building block, the qubit, remain unchanged. When people think of quantum computers the first thing usually come to mind is the exponential speed up they provide over classical computers, because of the superposition and entanglement properties of a qubit. Although it is true that these properties are the basis for the advantages of quantum computers over classical computers, it is not true in general that a quantum com- puter will be exponentially faster than a classical computer. A big part of this exponential speedup comes from the way the algorithm is designed. Designing a quantum algorithm is not an easy task in general, and to design one cleverly enough to encompass the exponential speed up is even harder! Part of our research goal is to at least solve the first task, that is,
43 to use machine learning to allow the quantum computer to design its own algorithm. But
for now, we will discuss some famous existing quantum algorithms and why they provide an
exponential speed up over known classical algorithms.
To understand the computational speed up of classical problems on a quantum computer,
we first define some classical complexity classes. The two most important classical complexity
classes are P and NP. The class P contain problems which can be solved in polynomial time
(quickly) on a classical computer, for example: calculating the greatest common divisor. The class NP (nondeterministic polynomial time) contain problems which their solutions can be quickly verified on a classical computer, for instance, prime factors of an integer n. Thus far
there is no known classical algorithm that can solve prime factors efficiently on a classical
computer. It’s obvious that P⊆ NP. However, it’s not yet known and proven that there exist
problems that are in NP but not in P. This is one of the seven Millennium Prize Problems
known as “P versus NP”. The key point is that there are problems in the NP class which
can be solved efficiently (quickly) on a quantum computer. One such problem is: prime
factorization, which can be solved efficiently using Shor’s algorithm. At first, one might
wonder what does quantum mechanics has to do with factoring? The interesting answer is:
Nothing. However, quantum mechanics has everything to do with waves, and periodicity,
which have a lot to do with factoring. Finding prime factors for a number n essentially boils
down to finding the period of the function
f(x) = axmod(n) where a < n and gcd(a, n) = 1 (3.28)
The function f in Equation 3.28 looks like random noise, so finding its period is very dif-
ficult, at least with classical computer. However, it turns out that quantum computers are
very efficient at period finding using a technique called Quantum Fourier Transform (QFT),
44 which is mathematically equivalent to the discrete Fourier Transform (DFT). Classically, a
fast Fourier transform takes n2n steps to Fourier transform 2n numbers. However, this can
be accomplished in n2 steps with QFT on a quantum computer. Performing some sort of
transform efficiently is essentially the backbone of Shor’s algorithm and many other quantum
algorithms, like Grover search algorithm [123] or Bernstein-Vazirani algorithm [124]. This is
the main source of the exponentially speedup. Details on Shor’s algorithm can be found in
[14]. In fact, almost every quantum algorithm can be posed as a hidden subgroup problem.
Finding an efficient quantum algorithm is the same as developing an efficient solution to the hidden subgroup problem for certain groups. It turns out that a quantum computer can solve hidden subgroup problem efficiently when the group is a finite Abelian group. However, the question is still open for non-Abelian group. If a quantum computer can solve non-Abelian group efficiently then graph isomorphism can be solved efficiently on a quantum computer.
The main point is, there are hard problems like prime factoring which can be solved very efficiently on a quantum computer contrary to the fact that there exists no efficient classical algorithm as of yet. We can actually define the class of all computational problems which can be solved efficiently on a quantum computer, this complexity class is known as BQP
(Bounded-error Quantum Polynomial time). There are still many open questions about the
BQP complexity class; in fact, we still don’t yet known where does it fits with respect to P,
NP and PSPACE complexity classes. See figure 3.11 to see how these complexity classes are embedded within each other.
45 Figure 3.11: Complexity spaces. The complexity class BQP is still not well understood. We have showed that with a cleverly design quantum algorithm, NP hard problem can be solved efficiently on a quantum computer, like prime factorization. However, no NP-complete problem have been solved efficiently on a quantum computer.
3.5 Adiabatic Quantum Computation Model
To this point, the primary focus of quantum computation has been the gate, or circuit, model. The circuit model offers a great parallelism to classical computation model, which makes it more intuitive and easier to understand. However, other quantum computational models exist, and they are all equivalent from one to another up to a polynomial overhead
[10]. Therefore, the computational complexity won’t be exponentially better or worse if we are switching between different models. In this section, we will briefly introduce another quantum computation model known as “Adiabatic Quantum Computation” (AQC) which based on the celebrated “Adiabatic Theorem” in quantum mechanics. Similar to the quan- tum gate model, one can derive algorithms to operate on AQC system, such as the prime factorization algorithm [12]. These algorithms are implemented using a physical process known as “Quantum Annealing”. A main difference between the gate and the AQC model is the analog nature of the AQC model, where the system is slowly adjusted from an initial
46 state to the final state where the solution to the problem is encoded. AQC systems are being built with thousands of qubits compared to the current largest gate model of only 72 qubits. However, it is not all clear that these AQC systems use the full power of quantum computing, for which entangled states are necessary [11] as we have discussed in the previous sections. In fact, our research group has developed a teachnique to determine whether an
AQC system has all of its qubits entangled or not [55]. One advantage of the AQC model against other quantum computation models is that it’s more robust against environmental noise and decoherence [13].
A computation in the AQC model is specified by two Hamiltonians, the initial Hamilto- nian and final Hamiltonian, usually denoted as HI and HF , respectively. The Hamiltonian structure depends on the type of qubits used in the system, but for arbitrary purposes one can regard it as a Hermitian matrix. The system starts in the ground state, the eigenvector with the lowest eigenvalue, of HI . This state is usually taken to be the tensor product state because it is easy to prepare. The system will be slowly adjusted from HI to HF by changing
field and interaction strengths on the qubits. The output of the system is the ground state of HF , where the solution to the problem has been encoded. Similar to the gate model where quantum gates are operating on a constant number of qubits, we require that the
Hamiltonians are local. That is, they involve only interactions among a constant number of particles. This ensures that the Hamiltonians have a nice matrix description structure. The gradual transition from HI to HF can be described in more explicit manner as
H(t) = s(t)HI + (1 − s(t))HF (3.29) where s(t) is the “adiabatic evolution path” that decreases from 1 to 0 as t go from 0 to some t elapsed time, tf . A simple and often used path is the linear one given by s(t) = 1 − . The tf
47 time it takes to evolve from HI to HF while maintaining in the ground state is known as the “annealing schedule”, and for many problems, it grows exponentially in the problem size.
AQC are great at solving hard classical optimization problems. For instance, suppose we are given an Oracle, a black-box, which can be described as a function f such that
n f : {0, 1} → R
The goal is to find the optimalx ˆ ∈ {0, 1}n such that f(ˆx) = min(f). The initial Hamiltonian,
HI , is taken to be n X HI = σxi i=1 where σxi = I ⊗ · · · ⊗ σx ⊗ · · · ⊗ I, that is, it’s the tensor product of a sequence of 2 × 2
Identity (I) matrices, with the Pauli-X at the ith position. Then the final Hamiltonian is taken to be X Hf = f(x)Πx x∈{0,1}n where Πx denotes the projection onto x. Thus, by starting at an easy simple ground state we can move toward another ground state which is the minimal solutionx ˆ, provided that we move slowly enough.
Currently, D-Wave Systems provides the largest implementation of AQC. D-Wave’s pro- cessor are essentially a transerve Ising model with tuneable local fields and coupling coeffi- cients, which has the governing Hamiltonian as:
X X X H = iσxi + Kii + ζijσzi σzj (3.30) i i i 48 even NP-hard problems, can naturally be expressed as the problem of finding the ground state (minimum energy configuration) for such Hamiltonian [44]. However, the problem Hamiltonian taking the form of equation 3.30 is not QMA-complete. Recall that a problem is said to be QMA-complete if it is QMA-hard and in QMA, where QMA represents the Quantum Merlin Arthur complexity class, which contains, essentially, all interesting problem we really care about. A simple modification to the Hamiltonian from equation 3.30 can make it QMA-complete, however [45]. Theorem 3.5.1. The Hamiltonians X X X X H = Kiσxi + iσzi + ζijσzi σzj + βijσxi σxj (3.31) i i i,j i,j and X X X X H = Kiσxi + iσzi + ζijσzi σxj + βijσxi σzj (3.32) i i i Some other quantum computation models are Quantum Turing Machine (QTM), and Measurement-based quantum computation (MBQC). However, our research group has never used these models for our work. 3.6 Quantum Decoherence, Noise, and Error Correction A closed quantum system evolves based on the Schrodinger equation, equation 2.16, where the Hamiltonian H is well defined and provides complete information about how the system evolves. The Hamiltonian might come from outside the system (control of external fields), but what makes a quantum system closed is that it doesn’t act back to this external source. However, a closed system doesn’t exist in reality. All quantum systems interact with the outside environment, at least on a weak scale. This environmental interaction almost 49 always does not preserve coherence, and, thus, provides “decoherence”. When a quantum state decoheres it will become, essentially, classical. This is the reason quantum effects are not observed at macroscopic (classical) scales. Any large scale superposition (e.g., being both dead and alive) would decohere instantaneously (unfortunately). Thus, decoherence is a big issue for quantum computations. Outside of quantum decoherence, quantum systems also have to deal with noise, similar to classical systems. Quantum noise can come from var- ious sources, and it can be from both external (e.g., stray magnetic fields) or internal (e.g., material impurities). Therefore, the ability to eliminate or reduce these effects is significant in quantum computations. Even after Peter Shor published “Shor’s prime factors algorithm” in 1994 and put public- key encyption systems like RSA at risk, there was still much skepticism that quantum com- puting could be practical, mainly because of noise and decoherence in quantum systems [36]. This led to Peter Shor’s publishing the first Quantum Error Correction code (QECC) [34]. The Shor code is a 9-qubit QECC that is capable of correcting any arbitrary error on a single qubit while protecting the logical qubit state. This was done by encoded a qubit state |ψi = α|0i + β|1i to the state |ψiL = α|0iL + β|1iL where 1 |0iL = √ |000i + |111i |000i + |111i |000i + |111i (3.33) 8 1 |1iL = √ |000i − |111i |000i − |111i |000i − |111i (3.34) 8 From equation 3.33, 3.34 we can see that the concept of quantum entanglement serves as the backbone tool behind quantum error correcting scheme. How this is done is by first entangle the working qubits in the quantum register with ancilla qubits prepared in a well- defined state. Now because of the non-classical correlation property of entanglement, the ancilla qubits will contain information about the errors on the working qubits when a cor- 50 ruption occurs. We can then quantify the errors occurred through the ancilla qubits, and use this information to correct the working qubits without ever having to make any type of measurement on them. Thus, entanglement allows us to detect quantum errors without changing the state of the qubits in a quantum register then use this information to correct the errors. Figure 3.12 shows how one can implement Shor’s 9-qubit code on a quantum circuit. More details on Shor’s error correction scheme can be found in [2]. Since Shor published his 9-qubit error correcting code, others have developed more efficient error cor- rection schemes. For instance, Andrew Steane developed the Steane code which does the same thing as the Shor code but only using 7 qubits instead of 9, and it’s closely related to the classical error-correcting code known as the Hamming code [37][38]. Then Raymond Laflamme et al. found a class of codes which do the same thing as Shor and Steane code but using only 5 qubits, and also have the property of being fault-tolerant [39]. It turns out that 5-qubit is the lowest number of qubits possible for any quantum error correction scheme [40]. Thus decoherence and noise are enormous problems for quantum computations. Fortu- nately, quantum error-correcting schemes exist. However, as discussed, these schemes need a minimum of 5-qubit to correct one single qubit. This is a problem for current existing quantum hardware, where we only have limited number of qubits available. It should be noted that although Adiabatic Quantum Computation model has available hardware up to 5000-qubit (D-Wave inc.), they do not incorporate quantum error-correcting code in their algorithm. 51 Figure 3.12: Quantum circuit for Shor’s 9-qubit error correction code. E is a quantum channel that can arbitrarily corrupt a single qubit [34]. 52 CHAPTER 4 CLASSICAL ARTIFICIAL NEURAL NETWORKS The first and simplest definition of a artificial neural network, or neural network, is pro- vided by Dr. Robert-Hecht Nielsen as: “... a computing system made up of a number of simple, highly connected processing elements, which process information by their dynamic state response to external inputs.” In this chapter we will explore classical artificial neural network in detail. First, in Section 4.1 we will describe what is an artificial neural network and why we need it. In Section 4.2 we will briefly describe biological neurons, and how we can design mathematical models for them, which leads us to the concept of artificial neurons. We will present two types of artificial neuron, the ‘perceptron’ and its modified version the ‘sigmoid neuron’. In Section 4.3 we will discuss how artificial neurons can be stack in layers with interconnections to build a neural network, a mathematical model trying to replicate the structure of our brain. In Section 4.4 we will discuss the universality of artificial neural networks. 53 4.1 Introduction Artificial neural network is an area of machine learning, a subfield of artificial intelligence. Artificial neural networks are computational models alternative to the standard determinis- tic algorithmic approach. Computers are great at performing simple operations and they do it quickly. For a computer to perform a specific task, we program it explicitly, by giving it a well-defined procedure (an algorithm) to execute. When this task is too complex then it is very hard to develop an algorithm for it. For instance, it would be a nightmare to program a computer explicitly for face recognition, or to determine what letters/numbers a person wrote. However, recognizing faces and letters are rather trivial problems for us humans, and we do it effortlessly and unconsciously in our daily life. The human brain is great at recog- nizing patterns. So why do we need a computer to do these tasks? The answer is: Although we can perform these tasks with our brains, we are limited by time and energy. So while we might be able to recognize faces or handwritten letters rather easily, doing it in a data set of hundred of millions is unreasonable. However, computers are great at sorting through enormous data sets! In a way this tells us that our brain is really a supercomputer that has been evolved over hundreds of millions of years, and is very well adapted to understanding the visual world. Thus it might be a good idea to program computers the way our brain is programmed, which is by learning! One way a person learns new tasks is by recognising patterns from data in past experience and applying them. For instance, if a kid touches a red hot stove, he/she will learn not to touch it along with any hot red piece of metal ever again in the future. Or when you try to kick a soccer ball inside a goal, if you missed it wide to the right the first time then you will aim more toward the left the second. You will keep adjusting your aim until the soccer ball goes inside the goal. In any of those instances, our brain learns from past experiences/results and we use this information for future tasks. Thus we can try to teach a machine to learn the same way we do. In other words, we can use a set 54 of examples instead of explicit instructions to allow the machine to infer rules automatically about a specific task, like classifying handwritten letters or how much force and optimal aim scores a goal with a soccer ball. To do this we must have some sort of a model representing the brain, at least partially, and apply this model to train the computer. This is what an ‘artificial neural networks’ does: it tries to replicate how a human brain works. The human brain is a formidably complex structure and there is still a lot we don’t know about it. But we do know that it is made up of a hundred billion neurons, the information processing cells of the brain, and each neuron has about 10,000 synapses (interconnections) on average which brings the total approximate number of connections to an astronomical value of 1015. Modeling a human brain mathematically is impossible, but we can create a partial model to replicate it, which then boils down to modeling (a small number of) biological neurons and the interactions among them. 4.2 Artificial Neurons Similar to the way a ‘bit’ and a ‘qubit’ are fundamental building block/processing units for classical and quantum computers respectively, an ‘artificial neuron’ is the basic building block of an artificial neural network. An artificial neuron is a mathematical model, a repre- sentation of a human neuron, which is the fundamental unit of our brain. Biological neurons come in various shapes and sizes but each neuron has a ‘soma’ or cell body which contains the nucleus and other vital components. See Figure 4.1. 55 Figure 4.1: A structure of a biological neuron. [46] The cell body also contains tree-like branches known as dendrites and a long fiber extending from the cell body known as the axon. These are the main communication links. The neuron receives its input (electrical signals) along the dendrites, the cell body processes these inputs and decides whether or not the neuron fire an action potential. The axon then carries the signal (action potential) away from the cell body to the synaptic/axon terminals. When the action potential reaches the terminal, chemical messengers called neurotransmitters are released, which will be collected from nearby dendrites from other neurons. These dendrites convert the chemical signals into electrical signals which then get sent to a cell body again for processing. This process keeps going as long as we are alive. From here we can develop a mathematical model, an artificial neuron, representing a biological neuron. We present here two different models. The first model is called a ‘perceptron’, developed in the 1950s and 1960s by the scientist Frank Rosenblatt, and inspired by earlier work of Warren McCulloch and Walter Pitts. The perceptron is rarely used in today’s neural networks because of its hard threshold activation, which makes devising a learning algorithm difficult. This leads to another artificial neuron model that is similar to the perceptron but using a soft thershold activation instead, known as a ‘sigmoid neuron’. We will discuss these two neuron models in the following subsections. 56 4.2.1 Perceptron A perceptron was the first mathematical model that attempted to replicate a biological neuron. It takes a finite number of binary inputs, {x1, x2, ··· , xn}, and produces a single binary output y. Figure 4.2 shows an example of a two-input perceptron. Figure 4.2: A perceptron model with two inputs. Each of the inputs is multiplied by a weight, wi ∈ R, which represents the importance of that input to the output. The output of the neuron is either 0 or 1 depending on the weighted sum and the threshold value of the activation function, σ. Mathematically, this process can be described as: ( P 0 if i wixi ≤ threshold output = P (4.1) 1 if i wixi > threshold This basic mathematical model tried to replicate how dendrites collect chemical signals, turns them into an electrical signal, and sends it to the cell body to process. The cell body then decides whether to fire an activation potential or not. This is all there is to a perceptron. It’s nothing more than a processing unit that makes decision by weighing up the evidence. The following is an example to illustrate how a perceptron makes a decision: Suppose you want to decide whether to show up to class or not on a particular day. Then there are probably several factors going to your decision, for instance: x1 = Is there an exam? x2 = Is homework being collected? 57 x3 = Is the weather good? x4 = Interesting topics? We can set 0 = No, 1 = Yes, and a threshold value of 4. Then you might assign a weight of 5 to input x1 since an exam is an important factor, but you might assign a weight of 3, 2, and 2 to x2, x3, and x4 respectively. This means that you will go to class every time there is an exam. But if there is no exam then at least two of the other three conditions must be met for you to show up to class. Note that the threshold value determines the bias in our decision. If the threshold value is low in the previous example then this means you are biased toward going to class to start with, and you don’t need much reason to go. In most neural networks literature, the negative of the threshold/bias value is designated as b in the rewritten Equation 4.1: ( P 0 if i wixi + b ≤ 0 output = P (4.2) 1 if i wixi + b > 0 A perceptron by itself can’t do much other than being a linear classifier, like the example above. However, if we stack a lot of them together in layers fashion like neurons in our brain, they can accomplish many complicated tasks, if we can find the correct weights, wi, for the output. This is where the perceptron runs into a problem. The hard threshold step function for the output makes it hard to adjust the weights wi because small changes in the weights might completely flip the decision from 0 to 1 or vice versa. This makes it difficult to see how we can make gradual modifications to the weights and biases so that we can get closer to the desired behaviors. 58 4.2.2 Sigmoid Neuron The sigmoid neuron is similar to the perceptron but with some small modifications to make the problem of determining the weights and bias easier. The first modification is that the inputs {xi} are no longer binary but instead continuous real variables. The activation function, σ, is now taken to be a sigmoid function, hence the name sigmoid neuron. A sigmoid function is a bounded, continuous, and differentiable function with the properties ( 1 as t → ∞ σ(x) → (4.3) 0 as t → −∞ Some examples of sigmoid functions are: 1 x f(x) = f(x) = tanh(x) f(x) = arctan(x) f(x) = √ 1 + e−x 1 + x2 1 Figure 4.3: A sigmoid neuron node with x ∈ N and the activation function f(x) = R 1 + e−x . The sigmoid activation function provides a soft threshold. It’s a smoothed out version of the hard threshold step function of the perceptron. The precise mathematical model for sigmoid 59 neuron is: N X y = output = σ xiwi + b (4.4) i=1 where σ is the sigmoid activation function, and xi, wi, b ∈ R are the inputs, weights and bias, respectively. Note that the output of a sigmoid neuron is no longer binary, but any real value within some bounded interval, usually scaled to [0, 1]. The smoothness of σ implies that small changes in the weights and bias only cause a small change in the output which is crucial in devising an algorithm to find the correct weights and biases for neural networks. X ∂y ∂y ∆y ≈ ∆w + ∆b (4.5) ∂w i ∂b i i Sigmoid neurons are used extensively and more preferred over the perceptrons in neural networks because of the differentiable property of their activation functions. Therefore, for the remaining sections and subsections of this chapter, we will only focus on neural networks that are built with sigmoid neurons. 4.3 Multi-Layer Neural Networks A single perceptron or sigmoid neuron can’t accomplish much by itself. However, stacking them in layers with interconnections (building a neural network) in an attempt to replicate the structure of our brain can make them quite a powerful computational tool. What we are trying to accomplish is to determine the map f : RN → RM without having to build an explicit algorithm, by forcing the computer to learn this map through a neural network model instead. In Section 4.4 we will see exactly the class of functions a multi-layer neural network can learn. For now we will define some terminologies of multi-layer neural networks and how the optimal weights and biases can be found through a specific learning algorithm for supervised learning called error backpropagation. It should be noted that this learning rule is only for sigmoid neurons because it requires the activation to be differentiable. 60 4.3.1 Networks Architecture The first layer of a neural network is called the input layer. The input layer takes in the input value x ∈ RN , and the neurons within this layer are called input neurons. Hence there are N neuron nodes here each encoding a value xi ∈ R for i = 1, 2, ..., N. Each of these nodes will get connected to an arbitrary P number of sigmoid neurons in the next layer. Each of these neurons will output a value which will get passed through to the next layer of neurons. This process will continue until the last layer of the neural network, which is called the output layer. If the map is from RN → RM then there will be M neuron nodes at the output layer. This changes from problem to problem, of course. The layers in between the input layer and output layer are called hidden layers. A neural network with only one hidden layer is called a shallow network. See Figure 4.4 for a visualization of a neural network with 2 hidden layers, with the first hidden layer containing P number of neurons, and the second hidden layer containing Q number of neurons, mapping from R2 to R. Note that the output layer does not need to be a single output; rather, this changes from problem to problem, and how we want to structure the learning process. For instance, if we want to train a neural network to recognize handwritten digits from 0 to 9, then we might want to have 10 output neurons with each of the output neurons acting as an indicator. That is, if the input is an image of the digit 2, then the third output neuron target value is a 1 and the rest of the output neurons’ target value is a 0. However, we can structure the problem so that the there is only one output neuron, with the output values ranging from 0 to 10. In such instance, if the image is an image of the digit 2 then it target output value is within the interval 3 < target < 4. The number of hidden layers also changes from problem to problem. If the problem is very complex, then we might design the neural network to have parts of its hidden layers extracting certain features. This might help in speeding up 61 Figure 4.4: A two hidden layers neural network mapping from f : R2 → R. Nearly in all cases, we take all the activation functions σi to be the same. That is, there is only one type of activation in the network. The reason for this will be clear in section 4.4. the training as well as finding the true optimal weights and biases. All of this is known as the network architecture. 4.3.2 Error Backpropagation and The Gradient Descent Learning Rule A neural network is of no use if it doesn’t learn. Learning in neural networks means finding the optimal weights and biases to correctly map the set of inputs to their correspon- dence output. This is done in the training process. In supervised learning, we will have a training data set where we have the input data and their correspondence output. We will start at some random initial weights and biases, then let the network propagate one of the inputs forward using the initial weights. This is known as the feed-forward step. At the 62 end of the feed-forward step, we will have an output that is most likely different from our intended target value for that input. Then we have to go back to change the weights and biases to reduce this difference/error. To do this, we define a cost function J 1 J(w, b) = ||y − d||2 (4.6) 2 where w and b represent all the weights and bias of the network, respectively. y is the actual output of the network after the feed-forward step and d is the desired (correct) value. The norm in Equation 4.6 can be taken as an l2 norm. So what we want is to change the weights and biases in such a way so that we can minimize the cost function. Notice that the cost function is a very high dimensional surface, and therefore analytic minimizing technique from calculus won’t work. Thus, we resort to the iterative optimization technique known as gradient descent. Gradient descent tells us that a function J decreases fastest at a point w0 if we goes in the negative gradient, −∇J(w0). Thus, giving the iterative updating rule wk+1 = wk − η∇J(wk) (4.7) This is exactly how we update each weights and bias in neural networks. But remember that neural networks might have more than one hidden layer, so we need to trace back this error to every layer. This is nothing more than applying successive chain rule computations because layers in neural networks are just compositions of functions. To see this, just note that the activation/output of the jth neuron in the lth layer can be written as l X l l−1 l yj = σ wjkyk + bj (4.8) k where the sum of Equation 4.8 is over all neurons k in the (l − 1)th layer. This is where backpropagation come into play! It gives us a way to calculate the gradients of the cost function with respect to each of the weights and biases efficiently. And computing the gradient effectively is all what backpropagation does, and nothing more. This efficiency 63 comes from the fact that the computations of the gradient of the weights between the l and l + 1 layer of the network already have most of the calculation done when we did the gradient for the weights in the l + 1 and l + 2 layers, as you can see from Equation 4.8. Thus, the idea is we don’t have to do these duplicate computations, which saves us some computational cost. Also now we see why it is important to have our activation functions to be differentiable. Without the differentiability property of the activation function, it would be difficult to develop a learning algorithm to train the network. To write down the backpropagation algorithm, first note that the activations/outputs in the lth can be written in matrix form as yl = σ(wlyl−1 + bl) (4.9) where wl is the weight matrix connecting the l − 1 layer to the l layer. The entries of wl are l l just wij, and similarly for b . Note that σ is being applying component wise. This expression provides a more global way of thinking about how the activations/outputs in one layer relate to activations/outputs in the previous layer. Now, from here the equation for the error in the output layer in an L layers network, denoted as δL (this is nothing but the gradient of the cost function with respect to the weights at the L layer), can be written as L 0 L L−1 L δ = ∇yJ σ (w y + b ) (4.10) where the operation is the Hadamard product, element-wise product of two vectors of same dimension. Then, from chain rule, one can easily derive the following equation for the error in the l layer, δl, in terms of the error in the error in the l + 1 layer, δl+1 δl = (wl+1)T δl+1 σ0(wlyl−1 + bl) (4.11) 64 Then from here, we can see that the gradient of the cost function J with respect to the l l weights wjk and bias bj is ∂J l−1 l ∂J l l = yk δj l = δj (4.12) ∂wjk ∂bj ∂J ∂J Thus the partial derivatives l and l can be computed in terms of the known quantities ∂wjk ∂bj δl and yl−1. These equations are essentially everything going into the backpropagation algorithm. First, you do the feedforward step, calculate the activation/output for each layers all the way to the last (output) layer L. Then start with finding the error of the output layer, δL. At this point, one just need to keep cascading the error backward until you reached the input layer. And as you backpropagate through each layer, you evaluate the ∂J ∂J expression l and l to be used for the gradient descent learning rule: ∂wjk ∂bj l l ∂J l l ∂J wjk = wjk − ηw l bj = bj − ηb l (4.13) ∂wjk ∂bj That’s all there is to train a neural network so that it can learn to perform a specfic task. As pointed out by Yann LeCun in [78] in 1987. We can reformulated backpropagation as a variational problem, by first defined the Lagrangian for each pattern p (training pair) denoted as Lp to be N 2 X T Lp(W, Yp,Bp) = ||y − d|| + Bp(k) Yp(k) − F [W (k)Yp(k − 1)] (4.14) k=1 where the summation term is the constraint of the network, it represents how the network is connected and evolve from layer to layer. The indices k = 1 : N represents the layers in the network, W (k) represents the weights matrix at the k layer, and Yp(k) represents the state at the layer k. The term Bp(k) is the Lagrange multiplier. The full Lagrangian of the network is then defined as P X L = Lp(W, Xp,Bp) (4.15) p=1 65 This formation is quite nice because it takes the dynamic of the network into account in the Lagrangian, and it’s the formulation we used in our Quantum Neural Networks. See [78] for details derivation and how the network can be trained in this formulation. The key point here is that from this formulation, Lecun was able to extended the training method to a few different networks, one of which is the continuous time recurrent networks. We will soon see how this is helpful in deriving the learning algorithm for Quantum Neural Networks in the next chapter. 4.4 Universal Approximation In the previous section, we discussed about network architecture and how to search for the optimal weights and biases using the error backpropagation learning algorithm. Throughout the discussion, we have swept a very important question under the rug: How complicated a problem can a neural network solve? This is an important question because if neural networks can only solve simple/trivial problems then this would defeat the purpose. Fortu- nately, it turns out that neural networks are universal approximators. They can approximate any continuous functions on a compact set. This tells us that in considering a problem, as long as the function that maps the input to output can be represented as a continuous func- tion then this problem can be solved with a neural network. We will present two different explanations. The first will be intuitive and constructive, using ancient mathematics. The second is based on modern analysis so it will be more abstract but elegant. It should be noted that by Lusin’s theorem, this result can be extended to measureable functions. 66 4.4.1 Approximation with Boxes Here we will show that a neural network with two hidden layers is sufficient to approxi- mate any continuous function on compact set via construction. This constructive arguement is well known around the neural network community [53, 97]. However, when people discuss about the universality of a neural networks, they often refer to abstract proof by Cybenko of which we will present in the following subsection. Although Cybenko’s proof answered the question, it doesn’t provide much intuition since it is not constructive. To see this construc- tive argument, let us first recall an ancient theorem of mathematics. Theorem 4.4.1. Let f be any continuous function from Rn to R and > 0. Then there n exists a δ > 0 such that for any partition P of [0, 1] into rectangles P = (R1,R2, ··· ,RN ) with all side lengths not exceeding δ, there exist scalars α1, α2, ··· , αN such that N X ||h − f|| ≤ where h = αiIRi (4.16) i=1 Proof. Since f is continuous from Rn to R then f is uniformly continuous on [0, 1]n by the standard uniform continuity theorem. Thus for any > 0, there exists a δ > 0 so that n for every x, y ∈ [0, 1] with ||x − y||∞ < δ we have |f(x) − f(y)| ≤ . Now let a partition (R1,R2, ··· ,RN ) be given, and suppose all side lengths are no greater than δ, where δ is from the preceding analysis. For each i ∈ {1, 2, ··· ,N}, choose any xi ∈ Ri and set αi = f(xi), by which any other x ∈ Ri satisfies ||x − xi||∞ ≤ δ, and thus |f(x) − α| ≤ . Therefore the PN function h = i=1 αiIRi satisfies sup |h(x) − f(x)| ≤ (4.17) x∈[0,1]n 67 1 0.5 0 -0.5 -1 0 1 2 3 4 5 6 7 8 9 10 Figure 4.5: Approximation of a continuous function with piece-wise constant functions. The theorem above tells us that if we can somehow design our neural network output piece-wise constant functions then we can approximate any continuous function. That is, if we can show that we are able to construct boxes with our neurons then we can approximate any continuous function. For 1-dimension, f :[a, b] → R it’s not so hard. To see this, let’s consider a neural network consist of one input neuron, a single hidden layer with 2 sigmoid neurons, and an output layer with no activation. Mathematically, we can write this output as: 2 1 2 1 output = w1σ(w1x + b1) + w2σ(w2x + b2) (4.18) 1 where σ(x) = , and x, w1, w1, w2, w2, b , b ∈ . Now, by picking w1 and w1 to be a 1 + e−x 1 2 1 2 1 2 R 1 2 2 2 large value, and w1 = −w2 with b1 6= b2 then we have successfully created a box with our neural network. See Figure 4.6. 68 Figure 4.6: Two hidden neurons to create a box. Therefore from this point of view we can see that a neural network with one hidden layer is enough to approximate an arbitrary continuous function f : D → R where D is a compact set, by stacking up the hidden neurons in such a way that each pair will output a box. See Figure 4.7 to see how this is done. It should be noted that because of the continuity of the sigmoid neuron, we will have a little fudge at the edge of these boxes, hence the approximation here should be taken in the Lp sense and not uniform. However, the perceptron, which doesn’t require continuousness, can built these rectangles exactly. So far in this constructive argument approach, only one hidden layer is needed. This is because we have only considered the map f :[a, b] → R. If we try to replicate what we did previously, create step functions and add them together to create boxes, to approximate a function f :[a, b]N → R, with N ≥ 2, then we won’t get exactly what we want. Therefore we 69 Figure 4.7: An illustration to show how a single hidden layer can be used to approximate a continuous function f : D → R. need an additional hidden layer to clean up the mess. First note that for us to approximate a map in higher dimension, we will need towers (hyper-rectangles) instead of boxes (rectangles). Now to see why things don’t add up nicely in higher dimensional case compare to the one- dimensional case, let’s us consider a continuous map from R2 to R. We want to approximate this map. To do this, we need to be able to build a three dimensional tower. Note that we have two input neurons, x1 and x2. Previously in the one-dimensional case, we built a box by adding two step functions together. Using this idea, we can create four step functions (in three dimensions), two in the x1 direction and two in x2 direction. To create these step functions, we can set the weights in the respective direction to be large and the rest equal to zero. The biases are chosen in such a way that they create a desired shift in the step functions. Now if we put these four step functions in a linear combination in such a way that everything goes to zero except the region within the shifts, then we get something that almost resembles a three-dimensional tower. See Figure 4.8 for a visual representation of 70 what we discussed above. In the univariate case, this would give us exactly the box we were looking for. However, here we have some sloppiness in the result that need to be cleaned up. Figure 4.8: An illustration to show how step functions can be built in the higher dimension case and how to add them together to form something almost like a tower. The gray arrows represent weights that are equal to zero. 1 1 2 4 In Figure 4.8 we would pick the weights w11, w12, w23, w24 to be large, whereas the rest of the weights (not written down explicitly to simplify the picture) equal zero. The weights 2 2 2 2 w11, w21, w31, w41 have equal magnitude and they represent the height of the tower. They 2 2 2 2 are chosen in such a way that w11 = w31 = −w21 = −w41. Note that Figure 4.8 is almost a tower but not exactly, since there are some unwanted regions still left after the linear combination of the step functions. However, it’s quite easy to get rid of this unwanted region, by simply applying a threshold function. Thus another layer of neurons (with an 71 activation function) will be needed for this clean up task since a neuron’s activation function is basically a threshold function. See Figure 4.9. Figure 4.9: An illustration to show how how an 3 dimensional tower might be built using neural network. Thus each tower (in three dimensions) can be built from 4 neurons in the first hidden layer, and 1 neuron in the second hidden layer. The output layer will consist of one output neuron that will sum up all the towers from the second hidden layer. This construction algorithm extends to higher dimensions, more than two variables, as well. That is, to construct a single tower in N−dimension, each of the N input neurons will have two hidden neurons in the first layer to create 2N step functions, where they will be added together. The second hidden layer will have one neuron acting as a threshold to clean up the sloppiness caused by the sum of the step functions. 72 Figure 4.10: An illustration to show how an N + 1-dimensional tower can be built using neural network. Therefore a neural network with two hidden layers is sufficient to approximate any continuous function f : D → R in the LP sense, where D is a compact set in RN . This argument required no function compositions whatsoever, and it is as old as Jordan content or Lebesgue measure. 73 4.4.2 Approximation with Modern Analysis Previously we argued that a neural network with 2 hidden layers is sufficient to approx- imate a continuous function in a constructive manner. Thus we know that a neural network with two hidden layers with a sufficient number of neurons can solve a huge class of prob- lems, specifically, all the problems that can be posed as a continuous map. In this second argument, we will take George Cybenko’s approach and use modern analysis to prove that a neural network with one hidden layer is sufficient to approximate a continuous function on a compact set. Although this proof is elegant, it’s not constructive, hence we lose all intuition on how this approximation is done. Theorem 4.4.2 (Cybenko’s Approximation Theorem [48]). Let σ be any continuous dis- criminatory function. Then finite sums of the form K X T G(x) = αjσ(wj x + bj) (4.19) j=1 n n are dense in C(In) where In denote the n−dimensional unit cube, [0, 1] . wj, x ∈ R , and αj, bj ∈ R. We say that σ is discriminatory if for a measure µ ∈ M(In), the space of finite regular Borel measure on In, Z σ(wT x + b)dµ(x) = 0 In for all w ∈ Rn and b ∈ R implies that µ = 0 PK T n Proof. Let S := { j αjσ(wj x+bj): w ∈ R , b ∈ R}. First note that S is a linear subspace of C(In), the space of all continuous functions on In, since σ is taken to be a continuous discriminatory function. Because we are working in infinite dimension, subspace need not to be closed, so let’s take the closure of S and called it Q. That is, Q = S. If Q = C(In) then S is dense in C(In) and we are done. So suppose Q is not all of C(In), hence Q ⊂ C(In). 74 By the Hahn-Banach theorem, there is a bounded linear functional L on C(In) with the property that L 6= 0 L(Q) = L(S) = 0 (4.20) Then now by the Riesz Markov Kakutani Representation Theorem, L can be written in the form Z L(h) = h(x)dµ(x) (4.21) In T for some µ ∈ M(In) and for all h ∈ C(In). But since σ(w x + b) is in Q for all x and b, we must have that Z σ(wT x + b)dµ(x) = 0 (4.22) In for all x and b. However, σ is a discriminatory function so this implies µ = 0, which is a contradiction. Thus, we must have Q = C(In) so S must be dense in C(In). The proof is basically just an application of modern analysis, hence it’s very short and el- egant. It shows that the sums of the form in Equation 4.19 are dense in C(In) providing that σ is continuous and discriminatory. It turns out that all sigmoidal functions are con- tinuous discrimatory, see lemma 1 in [48]. Recall that sigmoidal functions are continuous, differentiable functions with the property ( 1 as x → ∞ σ(x) = 0 as x → −∞ 1 In particular, the activation function σ(x) = is a continuous discriminatory func- 1 + e−x tion. Hence, neural networks made up with sigmoid neurons are universal. It should be pointed out that this is only an approximation result. It showed that for any given f ∈ C(In) and > 0, there is a sum, G(x), of the form in equation 4.19 for which |G(x) − f(x)| < for all x ∈ In. One might wonder if it’s possible to create a equality 75 here instead of just an approximation. The answer is yes! This stronger result is known as the Kolmogorov Superposition Theorem [49], which is closely related to the Hilbert’s 13th problem which involves the study of solutions of algebraic equations. Theorem 4.4.3 (Kolmogorov’s Superposition Theorem). Let f : In := [0, 1]n → R be an arbitrary multivariative function. Then it has the representation 2n+1 n X X f(x1, ..., xn) = gq ψq,p(xp) (4.23) q=1 p=1 with continuous one-dimensional outer gq and inner functions ψq,p, where the functions gq and ψq,p are defined on the real line. Furthermore, the inner function ψq,p are independent of f. Many improvements have been made to the theorem since it was first published. For instance, George Lorentz showed in 1962 that only one outer function is needed [51]. That is, gq can be replaced by a single function g. Then in 1965 David Sprecher showed that the inner functions ψq,p can be replaced by single inner function ψ with the appropriate shift in its argument [52]. In 1967 Buma Fridman showed that the inner functions ψq,p in Theorem 4.4.3 can be chosen to be in the class Lip(1) [50]. 76 CHAPTER 5 QUANTUM NEURAL NETWORK In this chapter, we will discuss the design and fundamental structure of our Quantum Neural Network (QNN) which was originally proposed and developed by Elizabeth Behrman and James Steck of Wichita State University. The first section will outline the fundamental structure of QNN. Then we will discuss the universal property of QNN and how an arbitrary classical problem can be implemented on it. Furthermore, we will provide an alternative approach using differential geometry, and discuss why this approach might not be optimal for hardware implementation. 77 5.1 Fundamental Structure The fundamental equation of quantum mechanics and therefore of quantum computing is the Schrodinger equation, which again can be written in the density operator form as dρ 1 = [H(t), ρ(t)] dt i~ The solution to the Schrodinger equation has been given in Equation 2.22. The quantum system must follow this equation through the computations. Therefore, we will use this equation as the base of our Quantum Neural Network model. The question is, how should we incorporate the Hamiltonian? In general, an N−qubit system Hamiltonian H(t) is an N × N Hermitian matrix. Since the group G = {σI , σx, σy, σz} (5.1) where σI , σx, σy, σz, the Pauli matrices described in Equations 3.8 and 3.17, forms an orthog- onal basis for C2×2, the group ⊗n Gn = {σI , σx, σy, σz} (5.2) forms an orthogonal basis in C2n×2n. Thus, any Hermitian Hamiltonian H can be written in terms of Pauli matrix Kronecker products, for instance, the most general form for a 2-qubit Hamiltonian is 4 X H(t) = hijkl(t) · σi ⊗ σj ⊗ σk ⊗ σl (5.3) i,j,k,k=1 where σ1 = I, σ2 = σx, σ3 = σy, σ4 = σz and hijkl ∈ R. However, when doing quantum computations, we work with certain hardware designs and limitations, which leads to a more 78 restricted forms of the Hamiltonian. In our work, we assume that our Hamiltonian has the following form: N N X X H(t) = Kα(t)σxα + α(t)σzα + ζαβ(t)σzα σzβ (5.4) α=1 α6=β=1 where σxα = σI ⊗ σI ⊗ · · · σx ⊗ ·σI with σx located on the α spot. Similar for σzα . But σzα σzβ represents regular matrix multiplication between σzα and σzβ . Kα represents the tunneling amplitude of the qubit α; α, the potential energy or bias of qubit α; and ζαβ the coupling parameter between qubits α and β. This is almost identical to the Hamiltonian of D-Wave Systems Inc.’s quantum computing machines [33]. Recall from the previous chapter that a classical neural network is nothing more than a mapping from an input state to an output state. See Figure 5.1. That is, we were given a set of training pairs, which consist of input states and the respective desired outputs. The correspondence between the inputs and outputs is a function, and this function defines the mapping. As long as this is a measurable function, this map can be approximated to an arbitrary precision by a neural network. From this point of view, we can develop a similar analogy using dynamic evolution with the Schrodinger equation, taking inputs as initial quantum states and outputs as a measure- ment or measurements on those initial quantum states after some time evolution. Hence, this is a mapping f : C2n×2n → R. If you think of a quantum circuit, which implements an arbitrary quantum algorithm, then it’s nothing but a map from the initial state of the qubit, which usually taken as |ψ(t0)i = |000 ··· 0i to some final state |ψ(tf )i before making a quantum measurement. See Figure 5.2 for a visualization of a 3-qubit quantum circuit as a map, f : C8×8 → R. 79 Figure 5.1: Classical Neural Network as a map from RN to R Figure 5.2: Quantum circuit representing a quantum algorithm as a map Therefore, we can implement machine learning technique to find this map! How would we define the weight parameters in a quantum setting? We can use the Hamiltonian! That is, the weight parameters to control this mapping are the parameters of the Hamiltonian, which consists of Kα(t), α(t), and ζα(t). Thus, this allows us to define a neural network on a quantum computer, hence the name “Quantum Neural Network”. See Figure 5.3 for a visualization of a QNN structure. 80 Figure 5.3: Visualization of QNN structure Although the Hamiltonian is time dependent, we can discretize it by gridding the time do- main into chunks, similar to numerical analysis techniques, and treat each time grid with a constant Hamiltonian. Thus, the state evolution in each time grid can be written as Equa- tion 2.23. 5.2 Learning Algorithm We have concretely defined a Quantum Neural Network in the previous section. In this section, we will show how we can train this Quantum Neural Network. By training, we mean finding the right parameters for the weights (parameters Kα(t), α(t), ζα(t) of the Hamiltonian) so that it correctly approximates a mapping for a particular problem. To do this, we can use calculus of variations as pointed out by Yann LeCun[78], where he reformatted backpropagation in term of optimizing the Lagrangian L. First, we define the Lagrangian, L, to be minimized, as 1 Z tf dρ i L = (T − y)2 + λ(t) + H, ρ γ(t) dt (5.5) 2 0 dt ~ where y is the output value of the Quantum Neural Network. The output y corresponds to 81 the problem of interest. Generally, it can be written mathematically as y = T r(Mρ(tf )) = T r(Mρ) (5.6) with M being the quantum measurement operator, usually taken as σz ⊗· · ·⊗σz, and ρ(tf ) is the quantum state right before the measurement is performed. The value T in Equation 5.5 is the “target value” corresponding to that input state. That is, it’s the desired theoretical value we want that particular input state to get mapped to. The integral term in Equation 5.5 represents the dynamic of the network, how it evolves in time, and for us it is to constrain the system so that it will always satisfy the Schrodinger equation. Last but not least, the terms λ(t) and γ(t) are the Lagrange multipliers, which can be written as row and column vectors, respectively. This is the “error delta” in neural network terminology. It gives you the direction when doing backprogation back through time. To calculate the vector elements of the Lagrange multipliers λ and γ to use for the weight updates later, we take the first variation of L with respect to ρ: Z tf ∂ i δLδρ = − T − T r(Mρ(tf ) Mδρ + λ δρ + [H, δρ] γ dt (5.7) 0 ∂t ~ Z tf Z tf Z tf i i = − T − T r(Mρ(tf ) Mδρ + λ(δρ˙)γ dt + λH(δρ)γ dt − λ(δρ)γ dt 0 ~ 0 ~ 0 (5.8) Note that the overdot operation represents the time derivative, and the commutator is defined as [H, δρ] = H(δρ) − (δρ)H. Now integration by parts gives us: Z tf tf Z tf λδρλ˙ dt = λ(δρ)γ − (δρ) λγ˙ dt (5.9) 0 0 0 tf Z tf Z tf = λ(δρ)γ − λ˙ (δρ)γ dt − λ(δρ)γ ˙ dt (5.10) 0 0 0 82 Then, setting δLδρ = 0 we get: tf 0 = − T − T r(Mρ(tf ) Mδρ + λ(δρ)γ (5.11) 0 Z tf Z tf i Z tf i Z tf − λ˙ (δρ)γ dt − λ(δρ)γ ˙ dt + λH(δρ)γ dt − λ(δρ)γ dt (5.12) 0 0 ~ 0 ~ 0 which give us the condition at tf to be: