<<

Reinforcement Learning for Memory

Thomas F¨osel Marquardt group Institute for the Science of Light Erlangen, Germany

May 8th, 2019

X

[T. F¨osel,P. Tighineanu, T. Weiss & F. Marquardt, PRX 8, 031084 (2018)] AlphaGo

Oct 2015 Fan Hui 5:0

Mar 2016 Lee Sedol 4:1

May 2017 Ke Jie 3:0

PRX 8, 031084 T. F¨osel,MPL Erlangen 2/19 What are the “games” that we play in physics?

PRX 8, 031084 T. F¨osel,MPL Erlangen 3/19 unexperienced (random)

X X X

reinforcement learning (no human guidance!)

X

smart

e. g.: manipulating quantum systems

PRX 8, 031084 T. F¨osel,MPL Erlangen 4/19 e. g.: manipulating unexperienced (random) quantum systems X X X

reinforcement learning (no human guidance!)

X

smart

PRX 8, 031084 T. F¨osel,MPL Erlangen 4/19 long-term vision: unified approach to

Quantum error correction: the RL approach

hybrid dynamical DD-DFS decoupling

phase estimation

stabilizer codes

decoherence-free subspaces temporal correlation of the noise temporal correlation spatial correlation of the noise

PRX 8, 031084 T. F¨osel,MPL Erlangen 5/19 Quantum error correction: the RL approach

hybrid dynamical DD-DFS decoupling

phase estimation RL for Quantum Feedback stabilizer codes

decoherence-free subspaces temporal correlation of the noise temporal correlation spatial correlation of the noise

long-term vision: unified approach to quantum error correction

PRX 8, 031084 T. F¨osel,MPL Erlangen 5/19 Reinforcement learning example: Go

[Silver et al., Nature 529, 484-489 (2016)] [Silver et al., Nature 550, 354-359 (2017)] PRX 8, 031084 T. F¨osel,MPL Erlangen 6/19 Supervised learning vs. reinforcement learning

supervised learning reinforcement learning ? ! ! ? ?

teacher student develop own strategies (no teacher)

PRX 8, 031084 T. F¨osel,MPL Erlangen 7/19 training

rewards

(Model-free) reinforcement learning: basic setup

agent (neural network)

state 49 195 action 4450 40 45 47 57 61 63 77 69 133 35 193 194

52 42 3839 46 199 60 7 62 7071 66 5 36 196197

48 3 43 41 56 58 59 76 37 64 68 134 4

51 164 200 166 87 97 207 65 121 72 67 125 126

203 86 96 201 206 75 6

204 205 122 156 132

55 11 13 82 202 92 116117 120 155

73 54 9 10 15 80 161 162 93 115 180 182

81 8 12 78 79 84 95 114 118 119 168 179

83 28 26 53 85 8889 150 108 110111 113 174175 177178 192

27 19 20 25 149 98112 99 106 109 190 181

30 2324 91 100 102 103 107 124 188 189

14 29 210 131 129 104105 159 158 170 171172

74 209 17 130 101 213123 173 214 167 191 142 165

94 21 1 208 128 31 33 211 186 187 143 2 136 141 146 18 22 16 32 198 34 212 184 185 144 137138 135 147148 environment 176 183 169 139140 152 153 145

90 at 15 127 at 37 151 at 141 154 at 148 157 at 141 160 at 148 163 at 141 Rules 1 . . . §2 . . . §3 . . . §

PRX 8, 031084 T. F¨osel,MPL Erlangen 8/19 (Model-free) reinforcement learning: basic setup

agent (neural network) training

state 49 195 action 4450 40 45 47 57 61 63 77 69 133 35 193 194

52 42 3839 46 199 60 7 62 7071 66 5 36 196197

48 3 43 41 56 58 59 76 37 64 68 134 4

51 164 200 166 87 97 207 65 121 72 67 125 126

203 86 96 201 206 75 6

204 205 122 156 132

55 11 13 82 202 92 116117 120 155

73 54 9 10 15 80 161 162 93 115 180 182

81 8 12 78 79 84 95 114 118 119 168 179

83 28 26 53 85 8889 150 108 110111 113 174175 177178 192

27 19 20 25 149 98112 99 106 109 190 181

30 2324 91 100 102 103 107 124 188 189

14 29 210 131 129 104105 159 158 170 171172

74 209 17 130 101 213123 173 214 167 191 142 165

94 21 1 208 128 31 33 211 186 187 143 2 136 141 146 18 22 16 32 198 34 212 184 185 144 137138 135 147148 environment 176 183 169 139140 152 153 145

90 at 15 127 at 37 151 at 141 154 at 148 157 at 141 160 at 148 163 at 141 Rules rewards 1 . . . §2 . . . §3 . . . §

PRX 8, 031084 T. F¨osel,MPL Erlangen 8/19 we: address new problem: build complete QEC protocol introduce model-free RL with neural networks in physics

2

tage in controlling realistic systems where constructing a reliable e↵ective model is infeasible, for example due to disorder or dislocations. To manipulate the quantum system, our agent constructs piecewise-constant protocols of dura- tion T by choosing a drive protocol strength hx(t) at each time t = nt, n = 0, 1, ,T/t ,witht the time-step size. In order to{ make··· the agent} learn, it is given a reward for every protocol it constructs – the fi- 2 delity Fh(T )= (T ) for being in the target state |h ⇤| i| after time T following the protocol hx(t) under unitary Schr¨odinger evolution. The goal of the agent is to max- imize the reward in a series of attempts. Deprived of any knowledge about the underlying physical model, the agent collects information about already tried protocols, based on which it constructs new, improved protocols through a sophisticated biased sampling algorithm [35]. In realistic applications, one does not have access to in- Prior applications of reinforcement learning in physicsfinite control fields; for this reason, we restrict to fields hx(t) [ 4, 4], see Fig. 1b. Pontryagin’s maximum prin- ciple further2 allows us to focus on bang-bang protocols (Fig. 1b, red), where hx(t) 4 , although we verified quantum phase that RL also worksdesign for quasi-continuous quantum2 {± } protocols with quantum control many di↵erent steps hx [35]. Even though there is only estimation one control field,optics the space experiments of available protocols grows 1 exponentially with the inverse step size t . in the task of designing new interesting experiments, and here we A B Control PhasesC of ConstrainedD Manipulation.elucidate— why this should be the case. The following analysis also t Fh(T )=1.000 To benchmark the application of RL to physics problems, 4 sheds light on other settings where we can be confident that RL consider first a two-level system described by techniques can be applied as well. 2

) z x

t First, the space of optical setups can be illustrated using a

( H(t)= S hx(t)S , (1) x 0 h graph as given in Fig. 4C, where the building of an optical exper- 2 ↵ where S , are the -1/2 operators. Thisiment Hamil- corresponds to a walk on the directed graph. Note that 4 tonian comprises both integrable many-body and non- 0.0 0.5 1.0 1.5 2.0 2.5 3.0 optical setups that create a certain state are not unique: Two or t interacting translational invariant systems, suchmore as different the setups can generate the same . Due transverse-field Ising model, graphene and topologicalto this in- fact, this graph does not have a tree structure but rather FIG. 1: (a) Phase diagram of the quantum state prepa- sulators. The initial i and target states areresembles chosen a maze. Navigating in a maze, in turn, constitutes ration problem for the qubit in Eq. (1) vs. protocol du- as the ground states| ofi (1) at h =| ⇤i2 and h = 2, re- x onex of the classic textbook RL problems (10, 41, 48, 49). Sec- [Hentschel &ration Sanders,T , as determined by the orderFig. 3. parameterExperimental[Bukovq( etT ) setups al.,spectively. frequently Although used by the there PS exists agent.[Melnikov an (A analytical) Local et al., solution to (red) and the maximum possible achievableparity sorter. fidelity (B)F Nonlocal(T ) parity sorter (as discovered by the program). ond, our empirical analysis suggests that experiments generating PRL 104, 063603 (2010)] PRX 8, 031086 (2018)]h solve for the optimalPNAS, protocol 201714936 in this case (2017)] [46], ithigh-dimensional does not multipartite entanglement tend to have some (blue), compared to the variational(C fidelity) Nonlocalh(T parity) (black, sorter in the Klyshko wave front picture (53), in which F generalize to non-integrable many-body systems.structural Thus, similarities (12) (Fig. 4 A and B partially displays dashed). Increasing the total protocolthe paths timea andT ,d weareidentical go studying to the paths thisb problemand c, respectively. using RL (D) serves Setup to a two-fold pur- from an overconstrained phase I, throughincrease adimensionality glassy phase of .pose: (i)(A–D) weIn benchmark a simulation the of 100 protocols agents, the obtainedthe by exploration the space). Fig. 4 shows regions where the den- highest-weighted subsetups were 11 times experiment A, 22 times experi- II, to a controllable phase III. (b) Left: the infidelity RL agent demonstrating that, even though RLsity is a of com- interesting experiments (large colored nodes) is high and ment B, and 43 times experiment D was part of the highest-weighted sub- landscape is shown schematically (green). Right: the pletely model-free algorithm, it still finds theothers physically where it is low—interesting experiments seem to be clus- setup. Only in 24 cases were other subsetups the highest weighted. optimal bang-bang protocol found by the RL agent at meaningful solutions by constructing a minimalistictered e↵ (ec-Fig. S2). In turn, RL is particularly useful when one the points (i)–(iii) (red) and the variational protocol [35] tive model on-the-fly. The learning process isneeds shown to in handle situations which are similar to those previously (blue, dashed). this Movie; (ii) We reveal an important novel perspectiveencountered—once one maze (optical experiment) is learned, experiments designed byon the the basic complexity PS agent of quantum (solid blue state curve) preparationsimilar which, mazes (experiments) are tackled more easily, as we have PRX 8, 031084 T. F¨osel,MPLand the Erlangen PS agent withas action we show composition below, generalizes (19) (dashed to many-particle blue seen systems.9/19 before. In other words, whenever the experimental task has paradigm of multi-starting local gradientcurve). optimizers Action composition [47]. For allows fixed totalthe agent ramp to time constructT ,theinfidelity new ah maze-type(t) underlying structure, which is often the case, PS can x 7! Unlike these methods, the RL agentcomposite progressively actions learns fromI usefulh(T )=1 opticalFh setups(T ) represents (i.e., placing a “potential mul- landscape”,likely help—and critically, without having any a priori informa- to build a model of the optimizationtiple landscape elements in in such a fixedthe configuration), global minimum thereby of which autonomously corresponds totion the about opti- the structure itself (41, 51). In fact, PS gathers infor- a way that the protocols it finds areenhancing stable to the sampling toolbox (seemalProjective driving protocol. Simulation Forfor bang-bang details). protocols, It is mation the prob- about the underlying structure throughout the learning noise. In this regard, RL-based approachesa central ingredient are particu- for anlem AI of to finding exhibit the even optimal a primitive protocol notion becomes of equivalentprocess. This information can then be extracted by an external larly well-suited to work with experimentalcreativity data (50) [48 and, 49 was] alsoto findingused in theref. ground12 to augment state configuration automated of auser classical or potentially be used further by the agent itself. and, unlike many optimal controlrandom methods, search. they do For not comparison,Ising model we with provide complicated the total interactions. number of We map out require explicit knowledge of local gradients of the con- the landscape of local infidelity minima h↵(t) using interesting experiments obtained by automated random search{ x The} Potential of Learning from Experiments. Thus far, we have trol landscape [35, 45]. This o↵erswith a considerable and without advan- actionSD, composition starting from (Fig. random 2B, solid bang-bang and dashed protocolestablished configu- that a machine can indeed design new quantum red curves). As we will see later, action composition will allow experiments in the setting where the task is precisely specified for additional insight into the agent’s behavior and helps provide (via the rewarding rule). Intuitively, this could be considered the useful information about quantum optical setups in general. We limit of what a machine can do for us, as machines are speci- found that the PS model discovers significantly more interesting fied by our programs. However, this falls short from what, for experiments than both automated random search and automated instance, a human researcher can achieve. How could we, even random search with action composition (Fig. 2B). in principle, design a machine to do something (interesting) we have not specified it to do? To develop an intuition for the type Ingredients for Successful Learning. In general, successful learning of behavior we could hope for, consider, for the moment, what relies on a structure hidden in the task environment (or dataset). we may expect a human, say a good PhD student, would do in The results presented thus far show that PS is highly successful situations similar to those studied thus far.

A B C

Fig. 4. Exploration space of optical setups. Different setups are represented by vertices with colors specifying an associated SRV [biseparable states are depicted in blue]. Arrows represent the placing of optical elements. (A) A randomly generated space of optical setups. Here we allow up to 6 elements on the optical table and a standard toolbox of 30 elements. Large, colored vertices represent interesting experiments. If two nodes share a color, they can generate a state with the same SRV. Running for 1.6 104 experiments, the graph that is shown here has 45,605 nodes, of which 67 represent interesting ⇥ setups. (B) A part of graph A, which demonstrates the nontrivial structure of the network of optical setups. (C) A detailed view of one part of the bigger network. The depicted colored maze represents an analogy between the task of finding the shortest implementation of an experiment and the task of navigating in a maze (10, 41, 48, 49). Arrows of different colors represent distinct optical elements that are placed in the experiment. The initial state is represented by an empty table ?. The shortest path to a setup that produces a state with SRV (3, 3, 2) and (3, 3, 3) is highlighted. Labels along this path coincide with the labels of the percept clips in Fig. 1B.

1224 www.pnas.org/cgi/doi/10.1073/pnas.1714936115 Melnikov et al. | 2

tage in controlling realistic systems where constructing a reliable e↵ective model is infeasible, for example due to disorder or dislocations. To manipulate the quantum system, our computer agent constructs piecewise-constant protocols of dura- tion T by choosing a drive protocol strength hx(t) at each time t = nt, n = 0, 1, ,T/t ,witht the time-step size. In order to{ make··· the agent} learn, it is given a reward for every protocol it constructs – the fi- 2 delity Fh(T )= (T ) for being in the target state |h ⇤| i| after time T following the protocol hx(t) under unitary Schr¨odinger evolution. The goal of the agent is to max- imize the reward in a series of attempts. Deprived of any knowledge about the underlying physical model, the agent collects information about already tried protocols, based on which it constructs new, improved protocols through a sophisticated biased sampling algorithm [35]. In realistic applications, one does not have access to in- Prior applications of reinforcement learning in physicsfinite control fields; for this reason, we restrict to fields hx(t) [ 4, 4], see Fig. 1b. Pontryagin’s maximum prin- ciple further2 allows us to focus on bang-bang protocols (Fig. 1b, red), where hx(t) 4 , although we verified quantum phase that RL also worksdesign for quasi-continuous quantum2 {± } protocols with quantum control many di↵erent steps hx [35]. Even though there is only estimation one control field,optics the space experiments of available protocols grows 1 exponentially with the inverse step size t . in the task of designing new interesting experiments, and here we A B Control PhasesC of ConstrainedD Qubit Manipulation.elucidate— why this should be the case. The following analysis also t Fh(T )=1.000 To benchmark the application of RL to physics problems, 4 sheds light on other settings where we can be confident that RL consider first a two-level system described by techniques can be applied as well. 2

) z x

t First, the space of optical setups can be illustrated using a

( H(t)= S hx(t)S , (1) x 0 h graph as given in Fig. 4C, where the building of an optical exper- 2 ↵ where S , are the spin-1/2 operators. Thisiment Hamil- corresponds to a walk on the directed graph. Note that 4 tonian comprises both integrable many-body and non- 0.0 0.5 1.0 1.5 2.0 2.5 3.0 optical setups that create a certain state are not unique: Two or t interacting translational invariant systems, suchmore as different the setups can generate the same quantum state. Due transverse-field Ising model, graphene and topologicalto this in- fact, this graph does not have a tree structure but rather FIG. 1: (a) Phase diagram of the quantum state prepa- sulators. The initial i and target states areresembles chosen a maze. Navigating in a maze, in turn, constitutes ration problem for the qubit in Eq. (1) vs. protocol du- as the ground states| ofi (1) at h =| ⇤i2 and h = 2, re- x onex of the classic textbook RL problems (10, 41, 48, 49). Sec- [Hentschel &ration Sanders,T , as determined by the orderFig. 3. parameterExperimental[Bukovq( etT ) setups al.,spectively. frequently Although used by the there PS exists agent.[Melnikov an (A analytical) Local et al., solution to (red) and the maximum possible achievableparity sorter. fidelity (B)F Nonlocal(T ) parity sorter (as discovered by the program). ond, our empirical analysis suggests that experiments generating PRL 104, 063603 (2010)] PRX 8, 031086 (2018)]h solve for the optimalPNAS, protocol 201714936 in this case (2017)] [46], ithigh-dimensional does not multipartite entanglement tend to have some (blue), compared to the variational(C fidelity) Nonlocalh(T parity) (black, sorter in the Klyshko wave front picture (53), in which F generalize to non-integrable many-body systems.structural Thus, similarities (12) (Fig. 4 A and B partially displays dashed). Increasing the total protocolthe paths timea andT ,d weareidentical go studying to the paths thisb problemand c, respectively. using RL (D) serves Setup to a two-fold pur- from an overconstrained phase I, throughincrease adimensionality glassy phase of photons.pose: (i)(A–D) weIn benchmark a simulation the of 100 protocols agents, the obtainedthe by exploration the space). Fig. 4 shows regions where the den- highest-weighted subsetups were 11 times experiment A, 22 times experi- II, to a controllable phase III. (b) Left: the infidelity RL agent demonstrating that, even though RLsity is a of com- interesting experiments (large colored nodes) is high and we: ment B, and 43 times experiment D was part of the highest-weighted sub- landscape is shown schematically (green). Right: the pletely model-free algorithm, it still finds theothers physically where it is low—interesting experiments seem to be clus- setup. Only in 24 cases were other subsetups the highest weighted. address new problem:optimal build bang-bang complete protocol found QEC by the protocol RL agent at meaningful solutions by constructing a minimalistictered e↵ (ec-Fig. S2). In turn, RL is particularly useful when one the points (i)–(iii) (red) and the variational protocol [35] tive model on-the-fly. The learning process isneeds shown to in handle situations which are similar to those previously introduce model-free RL with neural(blue, dashed). networks in physicsthis Movie; (ii) We reveal an important novel perspectiveencountered—once one maze (optical experiment) is learned, experiments designed byon the the basic complexity PS agent of quantum (solid blue state curve) preparationsimilar which, mazes (experiments) are tackled more easily, as we have PRX 8, 031084 T. F¨osel,MPLand the Erlangen PS agent withas action we show composition below, generalizes (19) (dashed to many-particle blue seen systems.9/19 before. In other words, whenever the experimental task has paradigm of multi-starting local gradientcurve). optimizers Action composition [47]. For allows fixed totalthe agent ramp to time constructT ,theinfidelity new ah maze-type(t) underlying structure, which is often the case, PS can x 7! Unlike these methods, the RL agentcomposite progressively actions learns fromI usefulh(T )=1 opticalFh setups(T ) represents (i.e., placing a “potential mul- landscape”,likely help—and critically, without having any a priori informa- to build a model of the optimizationtiple landscape elements in in such a fixedthe configuration), global minimum thereby of which autonomously corresponds totion the about opti- the structure itself (41, 51). In fact, PS gathers infor- a way that the protocols it finds areenhancing stable to the sampling toolbox (seemalProjective driving protocol. Simulation Forfor bang-bang details). protocols, It is mation the prob- about the underlying structure throughout the learning noise. In this regard, RL-based approachesa central ingredient are particu- for anlem AI of to finding exhibit the even optimal a primitive protocol notion becomes of equivalentprocess. This information can then be extracted by an external larly well-suited to work with experimentalcreativity data (50) [48 and, 49 was] alsoto findingused in theref. ground12 to augment state configuration automated of auser classical or potentially be used further by the agent itself. and, unlike many optimal controlrandom methods, search. they do For not comparison,Ising model we with provide complicated the total interactions. number of We map out require explicit knowledge of local gradients of the con- the landscape of local infidelity minima h↵(t) using interesting experiments obtained by automated random search{ x The} Potential of Learning from Experiments. Thus far, we have trol landscape [35, 45]. This o↵erswith a considerable and without advan- actionSD, composition starting from (Fig. random 2B, solid bang-bang and dashed protocolestablished configu- that a machine can indeed design new quantum red curves). As we will see later, action composition will allow experiments in the setting where the task is precisely specified for additional insight into the agent’s behavior and helps provide (via the rewarding rule). Intuitively, this could be considered the useful information about quantum optical setups in general. We limit of what a machine can do for us, as machines are speci- found that the PS model discovers significantly more interesting fied by our programs. However, this falls short from what, for experiments than both automated random search and automated instance, a human researcher can achieve. How could we, even random search with action composition (Fig. 2B). in principle, design a machine to do something (interesting) we have not specified it to do? To develop an intuition for the type Ingredients for Successful Learning. In general, successful learning of behavior we could hope for, consider, for the moment, what relies on a structure hidden in the task environment (or dataset). we may expect a human, say a good PhD student, would do in The results presented thus far show that PS is highly successful situations similar to those studied thus far.

A B C

Fig. 4. Exploration space of optical setups. Different setups are represented by vertices with colors specifying an associated SRV [biseparable states are depicted in blue]. Arrows represent the placing of optical elements. (A) A randomly generated space of optical setups. Here we allow up to 6 elements on the optical table and a standard toolbox of 30 elements. Large, colored vertices represent interesting experiments. If two nodes share a color, they can generate a state with the same SRV. Running for 1.6 104 experiments, the graph that is shown here has 45,605 nodes, of which 67 represent interesting ⇥ setups. (B) A part of graph A, which demonstrates the nontrivial structure of the network of optical setups. (C) A detailed view of one part of the bigger network. The depicted colored maze represents an analogy between the task of finding the shortest implementation of an experiment and the task of navigating in a maze (10, 41, 48, 49). Arrows of different colors represent distinct optical elements that are placed in the experiment. The initial state is represented by an empty table ?. The shortest path to a setup that produces a state with SRV (3, 3, 2) and (3, 3, 3) is highlighted. Labels along this path coincide with the labels of the percept clips in Fig. 1B.

1224 www.pnas.org/cgi/doi/10.1073/pnas.1714936115 Melnikov et al. | Overview: machine learning for quantum error correction

QEC decoding problem QEC protocol optimization surface/ bang-bang protocols [Torlai & Melko, PRL 119, 030501 (2017)] [Bukov et al., PRX 8, 031086 (2018)] [Baireuther et al., Quantum 2, 48 (2018)] QVECTOR [Johnson et al., arXiv:1711.02249] [Krastanov & Jiang, Sci. Rep. 7, 11003 (2017)] (model-free) RL approach [Sweke et al., arXiv:1810.07207] [August & Hern´andez-Lobato,arXiv:1802.04063] color code [F¨oselet al., PRX 8, 031084 (2018)] [Baireuther et al., NJP 21.1, 013003 (2019)]

related problems: quantum phase estimation [Hentschel & Sanders, PRL 104, 063603 (2010)] [Hentschel & Sanders, PRL 107, 233601 (2011)] [Palittapongarnpim et al., Neurocomputing 268, 116-126 (2017)] design experiments [Melnikov et al., PNAS, 201714936 (2017)] quantum control [Niu et al., arXiv:1803.01857] [Porotti et al., arXiv:1901.06603] search QEC codes [Nautrup et al., arXiv:1812.08451]

PRX 8, 031084 T. F¨osel,MPL Erlangen 10/19 Reinforcement learning setup for quantum memory

goal: preserve agent (neural network)

action

(RL-)environment

noise ...

PRX 8, 031084 T. F¨osel,MPL Erlangen 11/19 Training progress

0.99 non-adaptive error detections ) T (

Q encoding w/o error detections 0.9

idle

0 0 250 500 1500 2500 12500 22500 epoch

PRX 8, 031084 T. F¨osel,MPL Erlangen 12/19 Training progress

0.99 non-adaptive error detections ) T (

Q encoding w/o error detections 0.9

idle

0 0 250 500 1500 2500 12500 22500 epoch

PRX 8, 031084 T. F¨osel,MPL Erlangen 12/19 Training progress

0.99 non-adaptive error detections ) T (

Q encoding w/o error detections 0.9

idle

0 0 250 500 1500 2500 12500 22500 epoch

PRX 8, 031084 T. F¨osel,MPL Erlangen 12/19 Training progress

0.99 non-adaptive error detections ) T (

Q encoding w/o error detections 0.9 1 1 1 0 0 0 1 idle 1 0

0 ... 0 0 250 500 1500 2500 12500 22500 epoch

PRX 8, 031084 T. F¨osel,MPL Erlangen 12/19 same algorithm works for very different scenarios! tailored to hardware resources

Flexibility

connectivities correlated noise Improvement Number of ancillas

msmt errors MY MX MY MX > errors

MX MY MY MX MY MX MY

PRX 8, 031084 T. F¨osel,MPL Erlangen 13/19 Flexibility

connectivities correlated noise Improvement Number of ancillas

msmtsame errors algorithm works for very different scenarios!MY MX MY MX >

tailored to hardwareerrors resources

MX MY MY MX MY MX MY

PRX 8, 031084 T. F¨osel,MPL Erlangen 13/19 does not work!!! (at least with present-day RL techniques)

Naive approach

input: φ ψ raw msmt results h | i

reward: final overlap

PRX 8, 031084 T. F¨osel,MPL Erlangen 14/19 Naive approach

input: φ ψ raw msmt results h | i

reward: final overlap

does not work!!! (at least with present-day RL techniques)

PRX 8, 031084 T. F¨osel,MPL Erlangen 14/19 reinforcement supervised learning learning application

? ! ! ? ?

Two-stage learning

ρˆ

state-aware event-aware network network

PRX 8, 031084 T. F¨osel,MPL Erlangen 15/19 supervised learning

! !

Two-stage learning

reinforcement learning application ρˆ ? ? ?

state-aware event-aware network network

PRX 8, 031084 T. F¨osel,MPL Erlangen 15/19 Two-stage learning

reinforcement supervised learning learning application ρˆ

? ! ! ? ?

state-aware event-aware network network

PRX 8, 031084 T. F¨osel,MPL Erlangen 15/19 probability to randomly find good sequence: 20 10− 

What about the reward?

problem with reward φ ψ : sequence must be completely correct ∼ |h | i| include encoding, error detection, error correction, and decoding in particular: don’t collapse superposition state with a msmt!

PRX 8, 031084 T. F¨osel,MPL Erlangen 16/19 What about the reward?

problem with reward φ ψ : sequence must be completely correct ∼ |h | i| include encoding, error detection, error correction, and decoding in particular: don’t collapse superposition state with a msmt!

probability to randomly find good sequence: 20 10−  PRX 8, 031084 T. F¨osel,MPL Erlangen 16/19 for QEC: requires quantity↓ which tells “How much qu. info is still left?”

Final vs. immediate reward scheme

final reward immediate reward

⇒ t ⇒ ⇒ t

PRX 8, 031084 T. F¨osel,MPL Erlangen 17/19 Final vs. immediate reward scheme

final reward immediate reward

⇒ t ⇒ ⇒ t

for QEC: requires quantity↓ which tells “How much qu. info is still left?”

PRX 8, 031084 T. F¨osel,MPL Erlangen 17/19 Recoverable quantum information

question: “How much quantum information is still left?”

eq. (under a few conditions): How well can antipodal logicalm states be distinguished?

1 Q = min ρˆ~n ρˆ−~n R ~n 2 − 1

recoverable quantum information

PRX 8, 031084 T. F¨osel,MPL Erlangen 18/19 Conclusion and Outlook

more difficult problems (higher qubit numbers, . . . )

better learning algorithms more efficient physics engine

last year: proof of concept

0.99 PRX 2018 non-adaptive error detections Improvement Number of ancillas ) T (

Q encoding w/o error detections 0.9

X MY MX MY MX T. F¨osel idle

0 MX MY 0 250 500 1500 2500 12500 22500 MY MX MY MX MY P. Tighineanu epoch T. Weiss F. Marquardt two-stage recoverable [PRX 8, 031084 (2018)] learning quantum information