<<

Machine learning for long-distance communication

Julius Walln¨ofer,1, 2 Alexey A. Melnikov,2, 3, 4 Wolfgang D¨ur,2 and Hans J. Briegel2, 5 1Department of Physics, Freie Universit¨atBerlin, Arnimallee 14, 14195 Berlin, Germany 2Institute for Theoretical Physics, University of Innsbruck, Technikerstraße 21a, 6020 Innsbruck, 3Department of Physics, University of Basel, Klingelbergstrasse 82, 4056 Basel, Switzerland 4Valiev Institute of Physics and Technology, Russian Academy of Sciences, Nakhimovskii prospekt 36/1, 117218 Moscow, Russia 5Department of Philosophy, University of Konstanz, Fach 17, 78457 Konstanz, Germany (Dated: September 18, 2020) Machine learning can help us in solving problems in the context big data analysis and classifica- tion, as well as in playing complex games such as Go. But can it also be used to find novel protocols and algorithms for applications such as large-scale quantum communication? Here we show that machine learning can be used to identify central quantum protocols, including , entan- glement purification and the quantum repeater. These schemes are of importance in long-distance quantum communication, and their discovery has shaped the field of pro- cessing. However, the usefulness of learning agents goes beyond the mere re-production of known protocols; the same approach allows one to find improved solutions to long-distance communication problems, in particular when dealing with asymmetric situations where channel noise and segment distance are non-uniform. Our findings are based on the use of projective simulation, a model of a learning agent that combines reinforcement learning and decision making in a physically motivated framework. The learning agent is provided with a universal gate set, and the desired task is specified via a reward scheme. From a technical perspective, the learning agent has to deal with stochastic environments and reactions. We utilize an idea reminiscent of hierarchical skill acquisition, where solutions to sub-problems are learned and re-used in the overall scheme. This is of particular im- portance in the development of long-distance communication schemes, and opens the way for using machine learning in the design and implementation of quantum networks.

I. INTRODUCTION modern artificial intelligence [16–18]. By using projec- tive simulation (PS) [19], a physically motivated frame- work for RL, we show that teleportation, entanglement Humans have invented technologies with transform- swapping, and entanglement purification are found by a ing impact on society. One such example is the inter- PS agent. We equip the agent with a universal gate set, net, which significantly influences our everyday life. The and specify the desired task via a reward scheme. With quantum internet [1,2] could become the next genera- certain specifications of the structure of the action and tion of such a world-spanning network, and promises ap- percept spaces, RL then leads to the re-discovery of the plications that go beyond its classical counterpart. This desired protocols. Based on these elementary schemes, includes e.g. distributed quantum computation, secure we then show that such an artificial agent can also learn communication or distributed quantum sensing. Quan- more complex tasks and discover long-distance commu- tum technologies are now at the brink of being com- nication protocols, the so-called quantum repeaters [12]. mercially used, and the quantum internet is conceived The usage of elementary protocols learned previously is of as one of the key applications in this context. Such central importance in this case. We also equip the agent quantum technologies are based on the invention of a with the possibility to call sub-agents, thereby allowing number of central protocols and schemes, for instance for a design of a hierarchical scheme [20, 21] that offers [3–7] and teleportation [8]. Ad- the flexibility to deal with various environmental situa- ditional schemes that solve fundamental problems such as

arXiv:1904.10797v2 [quant-ph] 17 Sep 2020 tions. The proper combination of optimized block actions the accumulation of channel noise and decoherence have discovered by the sub-agents is the central element at been discovered and have also shaped future research. this learning stage, which allows the agent to find a scal- This includes e.g. entanglement purification [9–11] and able, efficient scheme for long-distance communication. the quantum repeater [12] that allow for scalable long- We are aware that we make use of existing knowledge distance quantum communication. These schemes are in the specific design of the challenges. Rediscovering considered key results whose discovery represent break- existing protocols under such guidance is naturally very throughs in the field of quantum information processing. different from the original achievement (by humans) of But to what extent are human minds required to find conceiving of and proposing them in the first place, an such schemes? essential part of which includes the identification of rele- Here we show that many of these central quantum pro- vant concepts and resources. However, the agent does not tocols can in fact be found using machine learning by only re-discover known protocols and schemes, but can phrasing the problem in a reinforcement learning (RL) go beyond known solutions. In particular, we find that in framework [13–15], the framework at the forefront of asymmetric situations, where channel noise and decoher- 2 ence are non-uniform, the schemes found by the agent II. PROJECTIVE SIMULATION FOR outperform human-designed schemes that are based on QUANTUM COMMUNICATION TASKS known solutions for symmetric cases. In this paper the process of designing quantum commu- From a technical perspective, the agent is situated in nication protocols is viewed as a reinforcement learning stochastic environments [13, 14, 22], as measurements (RL) problem. RL, and more generally machine learn- with random outcomes are central elements of some of ing (ML), is becoming increasingly more useful in au- the schemes considered. This requires to learn proper re- tomation of problem-solving in quantum information sci- actions to all measurement outcomes, e.g., the required ence [25–27]. First, ML has been shown to be capable of correction operations in a teleportation protocol depend- designing new quantum experiments [24, 28–30] and new ing on outcomes of (Bell) measurements. Additional ele- quantum algorithms [31, 32]. Next, by bridging knowl- ments are abort operations, as not all measurement out- edge about quantum algorithms with actual near-term comes lead to a situation where the resulting state can experimental capabilities, ML can be used to identify be further used. This happens for instance in entangle- problems in which quantum advantage over a classical ap- ment purification, where the process needs to be restarted proach can be obtained [33–35]. Then, ML is used to re- in some cases as the resulting state is no longer entan- alize these algorithms and protocols in quantum devices, gled. The overall scheme is thus probabilistic. These by autonomously learning how to control [36–38], error- are new challenges that have not been treated in pro- correct [39–42], and measure quantum devices [43]. Fi- jective simulation before, but the PS agent can in fact nally, given experimental data, ML can reconstruct quan- deal with such challenges. Another interesting element tum states of physical systems [44–46], learn a compact is the usage of block actions that have been learned representation of these states and characterize them [47– previously. This is a mechanism similar to hierarchical 49]. skill learning in robotics [20, 21], and to clip composi- Here we propose learning quantum communication tion in PS [19, 23, 24], where previously learned tasks protocols by a trial and error process. This process is are used to solve more complex challenges and problems. visualized in Fig.1 as an interaction between an RL Here we use this concept for long-distance communica- agent and its environment: by trial and error the agent tion schemes. The initial situation is a quantum chan- is manipulating quantum states hence constructing com- nel that has been subdivided by multiple repeater sta- munication protocols. At each interaction step the RL tions that share entangled pairs with their neighboring agent perceives the current state of the protocol (envi- stations. Previously learned protocols, namely entangle- ronment) and chooses one of the available operations (ac- ment swapping and entanglement purification, are used tions). This action modifies the previous version of the as new primitives. Additionally, the agent is allowed to protocol and the interaction step ends. In addition to employ sub-agents that operate in the same way but deal the state of the protocol the agent gets feedback at each with a problem at a smaller scale, i.e. they find optimized interaction step. This feedback is specified by a reward block actions for shorter distances that the main agent function, which depends on the specific quantum com- can employ at the larger scale. This allows the agent munication task a)-d) in Fig.1. A reward is interpreted to deal with big systems, and re-discover the quantum by the RL agent and its memory is updated. repeater with its favorable scaling. The ability to del- The described RL approach is used for two reasons. egate is of special importance in asymmetric situations First, there is a similarity between a target quantum com- as such block actions need to be learned separately for munication protocol and a typical RL target. A target different initial states of the environment – in our case quantum is a sequence of ele- the fidelity of the elementary pairs might vary drastically mentary operations leading to a desired , either because they correspond to segments with differ- whereas a target of an RL agent is a sequence of actions ent channel noise, or they are of different length. In this that maximizes the achievable reward. In both cases the case, the agent outperforms human-designed protocols solution is therefore a sequence, which makes it natural to that are tailored to symmetric situations. assign each elementary quantum operation a correspond- ing action, and to assign each desired state a reward. Sec- The paper is organized as follows. In Sec.II we provide ond, the way the described targets are achieved is similar background information on reinforcement learning and in RL and quantum communication protocols. In both projective simulation, and discuss our approach on how cases an initial search (exploration) over a large num- to apply these techniques on problems in quantum com- ber of operation (or action) sequences is needed. This munication. In Sec.III, we show that the PS agent can search space can be viewed as a network, where states find solutions to elementary quantum protocols, thereby of a quantum communication environment are vertices, re-discovering teleportation, entanglement swapping, en- and basic quantum operations are edges. The structure tanglement purification and the elementary repeater cy- of a complex network, formed in the described way, is cle. In Sec.IV we present results for the scaling repeater similar to the one observed in quantum experiments [24], in a symmetric and asymmetric setting, and summarize which makes the search problem equivalent to navigation and conclude in Sec.V. in mazes – a reference problem in RL [14, 50–52]. 3

problem-specific reward function 0 1 update quantum communication PS agent environment

state

ECM action

FIG. 1. Illustration of the reinforcement learning agent interacting with the environment. The agent performs actions that change the state of the environment, while the environment communicates information about its state to the agent. The reward function is customized for each environment. The initial states for the different environments we consider here are illustrated: a Teleportation of an unknown state. b Entanglement purification applied recurrently. c Quantum repeater with entanglement purification and entanglement swapping. d Scaling quantum repeater concepts to distribute long-distance entanglement.

It should also be said, that the role of the RL agent is based on a not computationally demanding random goes beyond mere parameter estimation for the following walk process, which in addition can be sped up via a reasons. First, using simple search methods (e.g, a brute- quantum walk process [58, 59], leading to a quadratic force or a guided search) would fail for the considered speedup in deliberation time [60, 61], and makes the PS problem sizes: e.g. in the teleportation task discussed in model conceptually attractive. Physical implementations sectionIIIA, the number of possible states of the com- of the basic PS agent and the quantum-enhanced PS munication environment is at least 714 > 0.6 1012[53]. agent were proposed by using photonic platforms [62], Second, the RL agent learns in the space of its× memory trapped ions [63], and superconducting circuits [64]. parameters, but it is not the case with optimization tech- The quantum-enhanced deliberation was recently imple- niques (e.g, genetic algorithms, simulated annealing, or mented, as a proof-of-principle, in a small-scale quantum gradient descent algorithms) that would search directly information processor based on trapped ions [65]. in the parameter space of communication protocols. Op- The use of PS in the design of quantum communication timizing directly in the space of protocols, which consist protocols has further advantages compared to other ap- of both actions and stochastic environment responses, proaches, such as standard tabular RL models, or deep can only be efficient if the space is sufficiently small [14]. RL networks. First, the PS agent was shown to per- Additional complication will be introduced by the fact form well on problems that, from an RL perspective, that reward signals are often sparse in quantum com- are conceptually similar to designing communication net- munication tasks, hence the reward gradient is almost works. In the problems that can be mapped to a naviga- always zero giving optimization algorithms no direction tion problem [66], such as the design of quantum exper- for parameter change. Third, using an optimization tech- iments [24] and the optimization of quantum error cor- nique for constructing an optimal action sequence, ig- rection codes [40], PS outperformed methods that were noring stochastic environment responses, is usually not practically used for those problems (and were not based possible in quantum communication tasks. Because dif- on machine learning). In standard navigation problems, ferent responses are needed depending on measurement such as the grid world and the mountain car problem, outcomes, there is no single action sequence that achieves the basic PS agent shows a performance qualitatively and an optimal protocol, i.e. there is no single point optimal quantitatively similar to standard tabular RL models of point in the parameter space with which an optimization SARSA and Q-learning [66]. Second, as was shown in technique. Nevertheless, there is at least one point in Ref. [66], the computational effort is one to two orders of the RL agent’s memory parameter space that achieves magnitude lower compared to tabular approaches. The an optimal protocol as the RL agent can choose an ac- reason for this is a low model complexity: in static task tion depending on the current state of the environment environments the basic PS agent has only one relevant rather than a whole action sequence. model parameter. This makes it easy to set up the agent As a learning agent that operates within the RL frame- for a new complex environment, such as the quantum work shown in Fig.1 we use the PS agent [19, 54]. communication network, where model parameter opti- PS is a physically-motivated approach to learning and mization is costly because of the runtime of the simu- decision making, which is based on deliberation in the lations. Third, it has been shown that a variant of the episodic and compositional memory (ECM). The ECM PS agent converges to optimal behavior in a large class is organized as an adjustable network of memory units, of Markov decision processes [67]. Fourthly, by construc- which provides flexibility in constructing different con- tion, the PS decision making can be explained by analyz- cepts in learning, e.g., meta-learning [55] and gener- ing graph properties of its ECM. Because of this intrinsic alization [56, 57]. The deliberation within the ECM interpretability of the PS model, we are able to properly 4

a d b c e learning with Clifford gates optimal solution, 7 actions learning with Clifford gates learning with universal gates optimal solution, 9 actions learning with universal gates number of operations average number of operations average number of operations

trials 103 trials 103 trials 106 × × × FIG. 2. Reinforcement learning a teleportation protocol. a Initial setup: the agent is tasked to teleport the state of A0 to B using a Bell pair shared between A and B. b Learning curves of an ensemble of PS agents: average number of actions performed in order to teleport an unknown quantum state in the case of having Clifford gates (magenta) and universal gates (blue) as a set of available actions. c Two learning curves (magenta and blue) of two individual PS agents. Four solutions of different lengths are found by the agents. d Initial setup without pre-distributed entanglement. e Learning curve in the learning setting d: average number of actions performed in order to teleport an unknown quantum state. b,e The curves represent an average over 500 agents. The shaded areas show mean squared deviation ±σ/3. This deviation appears not only because of different individual histories of the agents, but also because of a difference in the individual solution lengths shown in c. analyze the outcomes of the learning process [24]. peats until the trial ends either successfully, if the goal is Next, we show how the PS agent learns quantum com- reached, or unsuccessfully e.g. if a maximum number of munication protocols. The code of the PS agent used in actions is exceeded or if there are no more actions left. this context is a derivative of a publicly available Python We call a successful sequence of actions the agent used code [68]. in one trial a protocol.

III. LEARNING ELEMENTARY PROTOCOLS A. Quantum teleportation

We let the agent interact with various environments The quantum teleportation protocol [8] is one of the where the initial states and goals correspond to well- central protocols of quantum information. In the stan- known quantum information protocols. For each of the dard version, a maximally entangled state shared be- protocols we will first explain our formulation of the envi- tween two parties, A and B, is used as a resource and ronment and the techniques we used. Then we discuss the serves to teleport the unknown quantum state of a third solutions the agent finds before finally comparing them qubit, which is also held by party A, from A to B. To to the established protocols. A detailed description of achieve this, A performs a Bell measurement, communi- the environments together with additional results can be cates the outcome to B via classical communication, and found in the Appendix. then B performs a correction operation (Pauli operation) depending on the measurement outcome. Notice that the The learning process follows a similar structure for all same scheme serves for entanglement swapping when the of the environments as the agent interacts with the envi- qubit to be teleported is itself entangled with a fourth ronment over multiple trials. One trial consists of multi- qubit held by another party. ple interactions between agent and environment. At the beginning of each trial the environment is initialized in the initial setup , so each individual trial starts again from a blank slate. The agent selects one of the avail- 1. Basic protocol able actions (which are specific to the environment), then the environment will provide information to the agent The agent is tasked to find a way to transmit quan- whether the goal has been reached (with a reward R > 0) tum information without directly sending the quantum or not (R = 0), together with the current percept. The system to the recipient. As an additional resource a agent then gets to choose the next action and this re- maximally entangled state shared between sender and 5 recipient is available. The agent can apply operations Depending on the measurement outcomes, either from a (universal) gate set locally. This task challenges • apply 1, X, Y or Z (decomposed to the elementary the agent without any prior knowledge to find the best gates of the used gate set) on qubit B. (shortest) sequence of operations out of a large number of possible action sequences, which grows exponentially We see four different solutions in Fig.2c as four horizon- with a sequence length. tal lines, which appear because of the probabilistic We describe the learning task as follows: There are of the quantum communication environment. The agent two A and A0 at the sender’s station and one learns different sequences of gates because different oper- qubit B at the recipient’s station. Initially, the qubits ations are needed, depending on measurement outcomes A and B are in a maximally entangled state Φ+ = the agent has no control over. Four appropriate correc- √1 ( 00 + 11 ) and A0 is in an arbitrary input| statei tion operations of different length (as seen in Fig.2c), 2 | i | i Ψ . The setup is depicted in Fig.2a. For this setup which are needed in order for the agent to successfully |wei consider two different sets of actions: The first is a transmit quantum information at each trial, complete the Clifford gate set consisting of the Hadamard gate H and protocol. This protocol found by the agent is identical the P -gate P = diag(1, i), as well as the controlled-NOT to the well-known quantum teleportation protocol [8]. (CNOT) gate (which, as a multi-qubit operation, can Note that because we used the Jamio lkowski fidelity only be applied on qubits at the same station, i.e. in to verify that the protocol implements the teleportation this case only on A and A0). Furthermore, single-qubit channel for all possible input states, it follows that the same protocol can be used for entanglement swapping if measurements in the Z-basis are allowed with the agent 0 obtaining a measurement outcome. The second set of ac- the input qubit at A is part of an entangled state. tions replaces P with T = diag(1, eiπ/4), which results in a universal set of quantum gates. A detailed description of the actions and percepts can be found in AppendixB. 2. Variants without pre-distributed entanglement The task is considered to be successfully solved if the qubit at B is in state Ψ . In order to ensure that this The entangled state shared between two distant parties works for all possible input| i states, instead of using ran- is the key resource that makes the quantum teleportation dom input states, we make use of the Jamio lkowski fi- protocol possible. Naturally, one could ask whether the delity [69, 70] to evaluate if the protocol proposed by the agent is still able to find a protocol if not provided with agent is successful. This means we require that the over- the initial entangled state. To this end we let the agent + lap of the Choi-Jamio lkowski state [69] Φ Ae0A0 corre- solve two variants of this task with the goal of transferring sponding to the effective map generated by| thei suggested an input state Ψ to the receiving station B without protocol with the Choi-Jamiolkowski state corresponding sending it directly.| i to the optimal protocol is equal to 1. a. Variant 1: Initially there are two qubits The learning curves, i.e the number of operations the 0 A1 0 A2 and the input qubit in state Ψ , all at the agent applies to reach a solution at each trial, are shown sender’s| i | i station A. Note that in this variant| i there is no as an average over 500 agents in Fig.2b. Unsuccessful qubit at station B initially. In addition to a universal trials are recorded with 50 operations as that is the max- gate set (multi-qubit operations can only be applied imum number of operations per trial to which the PS to qubits at the same station) the agent now has the agent was limited. Observing the number of operations additional capability to send a qubit to the recipient’s decrease below 50 means that the PS agent finds a solu- station with the important restriction that the input tion. The decline over time in the averaged learning curve qubit may not be sent. stems not only from an increasing number of agents find- For this case the agent quickly finds a solution, which ing a solution but also from individual agents improving is however different from the standard teleportation pro- their solution based on their experience. We observe that tocol and only uses one of the provided qubits A1, A2. the learning curve converges to some average number of The protocol is (up to order of commuting operators operations in both cases, using a Clifford (magenta) and and permutations of qubits A1 and A2) given by: Ap- 0 a universal (blue) gate set. However, the mean squared A →A1 ply HA0 CNOT , then send qubit A1 to the receiving deviation does not go to zero. This can be explained by station. Now measure qubit A0 in the computational ba- looking at the individual learning curves of two example sis. If the outcome is 1, apply the Pauli-Z operator on agents in Fig.2c: the agent does not arrive at a single − qubit A1. This protocol was called one- teleportation solution for this problem setup, but rather four different in [71] and is conceptually and in terms of resources even solutions. These solutions can be summarized as follows simpler than the standard teleportation. (up to different orders of commuting operations): b. Variant 2: Now let us consider the same environ- A0→A ment as in Variant 1 but with the additional restriction Apply HA0 CNOT , where H is the Hadamard • gate and CNOT is the controlled-NOT operation. that no gates or measurements can be applied to the in- put qubit A0 until one of the two other qubits has been Measure qubits A and A0 in the computational ba- sent to the recipient’s station. In this case the agent in- • sis. deed finds the standard quantum teleportation protocol, 6

a b

c DEJMPS protocol

coinciding outcomes number of agents CNOT CNOT

measurement measurement

reward value

FIG. 3. Reinforcement learning an entanglement purification protocol. a Initial setup of the quantum communication environ- ment: two entangled noisy pairs shared between two stations A and B. b Cumulative reward obtained by 100 agents for their 5 protocols found after 5 × 10 trials. c Illustration of the best protocol found by the agent: Apply bilateral CNOT√ operations and measure one of the pairs. If the measurement outcomes coincide, the protocol is considered successful and −iX is applied on both remaining qubits before the next entanglement purification step.

A1→A2 by first creating a Bell pair via CNOT HA1 and with increased fidelity. However, it is desirable to obtain sending A2 to the second station – the rest is identical a protocol that does not only result in an increased fi- to the base case discussed before. In Fig.2e the learning delity when applied once, but consistently increases the curve averaged over 500 agents is shown. Obviously it fidelity when applied recurrently, i.e. on two pairs that takes significantly more trials for the agent to find solu- have been obtained from the previous round of the proto- tions as the action sequence is longer. Nonetheless, this col. In order to make such a recurrent application possi- shows that the agent can find the quantum teleportation ble while dealing with probabilistic measurements, iden- protocol without being given an entangled state as an tifying the branches that should be reused is an integral initial resource. part. To this end, a different technique than before is em- ployed. Rather than simply obtaining a random mea- B. Entanglement purification surement outcome every time the agent picks a measure- ment action, instead the agent needs to provide poten- Noise and imperfections are a fundamental obstacle to tially different actions for all possible outcomes. The ac- distribute entanglement over long-distances, so a strat- tions taken on all the different branches of the protocol egy to deal with these is needed. Entanglement purifi- are then evaluated as a whole. This makes it possible to cation is one approach that is integral to enabling long- calculate the result of the recurrent application of that distance quantum communication. It is a probabilistic protocol separately for each trial. The agent is rewarded protocol that generates out of two noisy copies of a (non- according to both the overall success probability of the maximally) entangled state a single copy with increased protocol and the obtained increase in fidelity. fidelity. Iterative application of this scheme yields pairs The agent is provided with a Clifford gate set and with higher and higher fidelity, and eventually maximally single-qubit measurements. Qubits labeled Ai are held entangled pairs are generated. by one party and those labeled Bi are held by another In particular here we investigate a situation that uses party. Multi-qubit operations can only be applied on a larger amount of entanglement in the form of multiple qubits at the same station. The output of each of the noisy Bell pairs, each of which may have been affected branches is enforced to be a state with one qubit on side by noise during the initial distribution, and try to obtain A and one on side B along with a decision by the agent fewer, less noisy pairs from them. Again, the agent has whether to consider that branch success or failure for the to rely on using only local operations at the two different purpose of iterating the protocol. Since this naturally stations that are connected by the Bell pairs. needs two single-qubit measurements, with two possible Specifically, we provide the agent with two noisy Bell outcomes each, there are four branches that need to be pairs ρA1B1 ρA2B2 as input, where ρ is of the form of ρ = considered. + + ⊗ 1−F + + − − − − F Φ Φ + 3 ( Ψ Ψ + Φ Φ + Ψ Ψ ). In Fig.3b we see reward values that 100 agents ob- Here| ihΦ± |and Ψ± |denoteih the| standard| ih | Bell| basisih and| tained for the protocols applied to initial states with fi- F is the| fidelityi | withi respect to Φ+ . This starting sit- delities of F = 0.73. The reward is normalized such uation is depicted in Fig.3a. The| agenti is tasked with that the entanglement purification protocol presented in finding a protocol that probabilistically outputs one copy Ref. [10] would obtain a reward of 1.0. All the successful 7

a of sufficiently high fidelity. However, the reachable dis- tance is limited because at some point too much noise will accumulate such that the initial states will no longer have the minimal fidelity required for the entanglement b purification protocol. The insight at the heart of the quantum repeater protocol [12] is that one can split up the channels into smaller segments and use entanglement purification on short-distance pairs before performing en- tanglement swapping to create a long-distance pair. In the most extreme case with very noisy (but still purifi- able) short-distance pairs the requirement of prior pu- rification can easily be understood since entanglement swapping alone would produce a state that can no longer to symmetric strategy resources used relative be purified, but this approach can also be beneficial when considering resource requirements at less extreme noise levels. While the value of the repeater protocol lies in its scal- trials ing behavior, which becomes manifest when the number FIG. 4. Reinforcement learning a quantum repeater protocol. of links grows, for now the agent has to deal with only two a Initial setup for the length 2 quantum repeater environment. channel segments that distribute noisy Bell pairs with a The agent is provided with many copies of noisy Bell states common station in the middle as depicted in Fig.4a. In with initial fidelities F = 0.75, that can be purified on each this scenario the challenge for the agent is to use the link separately or connected via entanglement swapping at protocols of the previous sections in order to distribute the middle station. b Learning curve in terms of resources an entangled state over the whole distance. To this end (used initial Bell pairs) for the best of 128 agents with gate the agent may use the previously discovered protocols for reliability parameter p = 0.99. The known repeater solution teleportation/entanglement swapping and entanglement (red line) is reached. purification as elementary actions, rather than individual gates. The task is to find a protocol for distributing an entan- protocols found start the same way (up to permutations A →A gled state between the two outer stations with a threshold of commuting operations): they apply CNOT 1 2 fidelity of at least 0.9, all the while using as few initial B1→B2 ⊗ CNOT followed by measuring qubits A2 and B2 states as possible. The initial Bell pairs are considered in the computational basis. In some of the protocols two to have initial fidelities of F = 0.75. Furthermore, the of the previously discussed four branches are marked as CNOT gates used for entanglement purification are con- successful, while others only mark one particular com- sidered to be imperfect, which we model as local depolar- bination of measurement outcomes. The latter therefore izing noise with reliability parameter p acting on the two have a smaller probability of success, which is reflected in qubits involved followed by the perfect CNOT operation a→b the reward. However, looking closely at the distribution [11]. The effective map CNOT is given by: in Fig.3b we can see that these cases correspond to two M variants with slightly different rewards. Those variants a→b a b  † differ in the operations that are applied on the output CNOT(p)ρ = CNOTab (p) (p)ρ CNOTab (1) copies before the next purification step. The variant with M D D slightly lower reward applies the Hadamard gate on both where i(p) denotes the local depolarizing noise chan- qubits: H H. The protocol that obtains the full reward nel withD reliability parameter p acting on the i-th qubit: of 1.0 applies⊗ √ iX √ iX and is depicted in Fig.3c. This protocol is− equivalent⊗ − to the well-known DEJMPS 1 p protocol [10] for an even number of recurrence steps, but i(p)ρ = pρ+ − ρ + XiρXi + Y iρY i + ZiρZi (2) requires a shorter action sequence for the gate set pro- D 4 vided to the agent. We discuss this solution in more de- i i i tail, as well as an additional variant of the environment with X , Y , Z denoting the acting on with automatic depolarization after each recurrence step, the i-th qubit. in AppendixC. While the point of such an approach only begins to show for much longer distances, which we take a look at in Sec.IV, some key concepts can already be observed at small scales. C. Quantum repeater The agent naturally tends to find solutions that use a small number of actions in an environment that is similar Entanglement purification alone certainly increases the to a navigation problem. However, this is not necessar- distance over which one can distribute an entangled state ily desirable here because the resources, i.e. the number 8 b symmetric case d asymmetric case

a to symmetric strategy to symmetric strategy resources used relative main agent resources used relative

trials trials c symmetric case e asymmetric case delegate action

delegated subsystem to symmetric strategy to symmetric strategy resources used relative resources used relative

distance gate reliability parameter

FIG. 5. Reinforcement learning a scalable quantum repeater protocol. a Illustration of delegating a block action. The main agent that is tasked with finding a solution for the large scale problem (repeater length 4 in this example) delegates the solution of a subsystem of length 2 to another agent. That agent comes up with a solution for the smaller scale problem and that action sequence is then applied to the larger problem. This counts as one single action for the main agent. For settings even larger then this, the sub-agent itself can again delegate the solution of a sub-sub-system to yet another agent. b-c Scaling repeater with forced symmetric protocols with initial fidelities F = 0.75. Gate reliability parameter p = 0.99. The threshold fidelity for a successful solution is 0.9. The red line corresponds to the solution with approximately 0.518 × 108 resources used. b Best solution found by an agent with repeater length 8. c Relative resources used by the agent’s solution compared to a symmetric strategy for different repeater lengths. d-e Scaling repeater with asymmetric initial fidelities (0.8, 0.6, 0.8, 0.8, 0.7, 0.8, 0.8, 0.6). The threshold fidelity for a successful solution is 0.9. d Best solution found by an agent with gate reliability p = 0.99 outperforms a strategy that does not take the asymmetric nature of the initial state into account (red line). e Relative resources used by the agent’s solution compared to a symmetric strategy for different reliability parametersa. a The jumps in the relative resources used are likely due to threshold effects or agents converging to a very short sequence of block actions that is not optimal. of initial Bell pairs, is the figure of merit in this sce- distances than just two links. This means we have to con- nario rather than the number of actions. Therefore an sider starting situations of variable length as depicted appropriate reward function for this environment takes in Fig.1d using the same error model as described in the used resources into account. sectionIIIC. In order to distribute entanglement over In Fig.4b the learning curve of the best of 128 agents in varying distances, the agent needs to come up with a terms of resources used is depicted. Looking at the best scalable scheme. However, both the action space and solutions, the key insight is that it is beneficial to purify the length of action sequences required to find a solution the short-distance pairs a few times before connecting would quickly grow unmanageable with increasing dis- them via entanglement swapping even though this way tances. Furthermore, an RL agent learns a solution for more actions need to be performed by the agent. This so- a particular situation and problem size rather than find- lution is in line with the idea of the established quantum ing a universal concept that can be transferred to similar repeater protocol [12]. starting situations and larger scales. To overcome these restrictions, we provide the agent with the ability to effectively outsource finding solutions IV. SCALING QUANTUM REPEATER for distributing an entangled pair over a short distance and reuse them as elementary actions for the larger set- The point of the quantum repeater lies in its scaling be- ting. This means that, as a single action, the agent can havior which only starts to show when considering longer instruct multiple sub-agents to come up with a solution 9

a b τ = 1s τ = 0.5s τ = 0.1s τ = ∞ ∞ ∞ = = τ τ to to resources used relative resources used relative

trials decoherence rate 1/τ

FIG. 6. Reinforcement learning quantum repeater protocols with imperfect memories. We consider an asymmetric setup with repeater length 8 and initial fidelities (0.8, 0.6, 0.8, 0.8, 0.7, 0.8, 0.8, 0.6) arising from different distances between the repeater stations. Gate reliability parameter p = 0.99. The threshold fidelity for a successful solution is 0.9. a Learning curves (relative to the protocol designed for symmetric approaches for perfect memories) for different decoherence times τ of the quantum memories at repeater stations. The dashed lines are the resources used by symmetric strategies that do not account for the asymmetric situation. b Required resources (number of initial pairs) of the protocols found by the agent for different memory times τ (relative to resources required for τ = ∞). for a small distance and then pick the best action se- ent initial fidelities (0.8, 0.6, 0.8, 0.8, 0.7, 0.8, 0.8, 0.6). In quence among those solutions. This process is illustrated Fig.5d the learning curve in terms of resources for the in Fig.5a. agent that can delegate work to sub-agents is shown. The Again, the aim is to come up with a protocol that gate reliability of the CNOT gates used in the entangle- distributes an entangled pair over a long distance with ment purification protocol is p = 0.99. The obtained so- sufficiently high fidelity, while using as few resources as lution is compared to the resources needed for a protocol possible. that does not take into account the asymmetric nature of this situation and that is also used as an initial guess for the reward function (see AppendixE4 for additional A. Symmetric protocols details of that approach). Clearly the solution found by the RL agent is preferable to the protocol tailored to First, we take a look at a symmetric variant of this symmetric situations. Fig.5e shows how that advantage setup: The initial situation is symmetric and the agent scales for different gate reliability parameters p. is only allowed to do actions in a symmetric way. If it applies one step of an entanglement purification protocol on one of the initial pairs, all the other pairs need to be C. Imperfect memories treated in the same way. Similarly, entanglement swap- ping is always performed at every second station that is One central parameter that influences the performance still connected to other stations. In Fig.5b-c the results of the quantum repeater is the quality of quantum mem- for various lengths of Bell pairs with an initial fidelity ories available at the repeater stations. It is necessary of F = 0.75 are shown. We compare the solutions that to store the qubits while the various measurement out- the agent found with a strategy that repeatedly purifies comes of the entanglement swapping and especially the all pairs up to a chosen working fidelity followed by en- entanglement purification protocols are communicated tanglement swapping (see AppendixE4). For lengths between the relevant stations. greater than 8 repeater links, the agent still finds a solu- To this end we revisit the previous asymmetric setup tion with desirable scaling behavior solution while only and assume that the initial fidelities now arise from dif- using slightly more resources. ferent channel lengths between the repeater stations. We model the noisy channels as local depolarizing noise (see equation (2)) with length dependent error parameter B. Asymmetric setup −L/Latt e with Latt = 22 km [72]. Similarly, we model imperfect memories by local depolarizing noise with er- The more interesting scenario is when the initial Bell ror parameter e−t/τ , where t is the time for which a qubit pairs are subjected to different levels of noise, e.g. when is stored and τ is the decoherence time of the quantum the physical channels between stations are of different memory. length or quality. In this scenario symmetric protocols The measurement outcomes of the entanglement pu- are not optimal. rificaton protocol need to be communicated between the We consider the following scenario: 9 repeater stations two stations performing the protocol. This information connected via links that can distribute Bell pairs of differ- tells the stations whether the purification step was suc- 10

station at resources stations at resources stations at resources 4 9.60 × 104 3, 6 1.06 × 105 2, 4, 7 9.95 × 104 5 1.60 × 105 3, 5 1.39 × 105 3, 5, 7 1.07 × 105 3 1.60 × 105 2, 5 1.59 × 105 2, 4, 6 1.08 × 105 6 4.97 × 105 3, 7 1.64 × 105 1, 3, 6 1.10 × 105 2 5.02 × 105 2, 6 1.64 × 105 1, 3, 5 1.11 × 105 7 4.31 × 106 4, 7 1.80 × 105 2, 3, 5 1.11 × 105 1 5.90 × 108 3, 4 1.90 × 105 3, 4, 6 1.12 × 105

TABLE I. Resource requirements for various choices to construct repeater stations in order to connect two distant parties that are 20 km apart. There are 7 possible locations to place repeater stations that are asymmetrically spaced (see AppendixE3). Gate reliability parameter p = 0.99, decoherence time for quantum memories τ = 0.1 s, attenuation length Latt = 22 km. cessful and is needed before the output pair can be used D. Choosing location of repeater stations in any further operations. Therefore both qubits of the pair need to be stored for a time tepp = l/c, with the As an alternative use case, the protocols found by the distance between repeater stations l and the light speed agent allow us to compare different setups. Let us con- in glass fibres c = 2 108 m/s. The measurement out- × sider the following situation: We want two distant parties comes of the entanglement swapping also need to be com- to share entanglement via a quantum repeater by using municated from the middle station performing the Bell only a small number of intermediate repeater stations. measurement to the outer repeater stations, so that the On the path between the terminal stations there are mul- correct by-product operator is known. In this case, how- tiple possible locations where repeater stations could be ever, it is not necessary to wait for this information to built. Which combination of locations should be chosen? arrive before proceeding with the next operations. This Furthermore, let us assume that the possible locations is the case because all operations used in the proto- are unevenly spaced so that simply picking a symmetric cols are Clifford operations thus any Pauli by-product setup is impossible. operators can be taken into account retroactively. In We demonstrate this concept with the following exam- our particular case, the information from entanglement ple setup using the same error model as in SectionIVC swapping may only change the interpretation of the mea- for both the length dependent channel noise and imper- surement outcomes obtained in the subsequent entangle- fect memories (τ = 0.1 s). We consider possible loca- ment purification protocol and needs to be available at tions (numbered 1 to 7) for stations between the two end that time. However, since the information exchange for points that are located at positions that correspond to the entanglement purification protocol always happens the asymmetric setup from the previous subsection but over a longer distance than the associated entanglement scaled down to a total distance of Ltot = 20 km. The swapping, this information will always be available when positions of all locations are listed in the appendix. In needed, so there is no need to account for additional time TableI we show the best combinations of locations for in memory for that. placing either 1, 2 or 3 repeater stations and the amount Using the same approach as above, we let the agent of resources the agent’s protocol would take for these learn protocols for different values of τ and compare them choices. to the symmetric protocols. We use a gate reliability pa- Naturally, this analysis could be repeated for different rameter p = 0.99 and the task is considered complete initial setups and error models of interest. This particular if a pair with a fidelity F > 0.9 has been distributed. example can be understood as a proof-of-principle that The learning curves as well as the advantage over the using the agent in this way can be useful as well. protocols optimized for symmetric approaches in Fig.6a look qualitatively very similar to the result for perfect memories (i.e. τ = ). In Fig.6b the required ini- V. SUMMARY AND OUTLOOK tial pairs of the protocols∞ found by the agent are shown for different memory times. The required resources in- We have demonstrated that reinforcement learning can crease sharply at a decoherence time of τ = 1/30 s and serve as a highly versatile and useful tool in the context the agent was unable to find a protocol for τ = 1/31 s of quantum communication. When provided with a suffi- (which could either mean the threshold has been reached ciently structured task environment including an appro- or the number of actions required for a successful proto- priately chosen reward function, the learning agent will col has grown so large that the agent could not find it retrieve (effectively re-discover) basic quantum communi- in the a few thousand trials). It should be noted that cation protocols like teleportation, entanglement purifi- this rather demanding memory requirement for this par- cation, and the quantum repeater. We have developed ticular setup certainly arises from our very challenging methods to state challenges that occur in quantum com- starting situation (e.g. some very low starting fidelities munication as RL problems in a way that offers very of 0.6). general tools to the agent while ensuring that relevant 11

figures of merit are optimized. repeaters with asymmetrically distributed channel noise We have shown that stating the considered challenges the agent comes up with novel and practically relevant as an RL problem is beneficial and offers advantages over solutions. using optimization techniques as discussed in sectionII. We are confident that the presented approach can be Regarding the question to what extent programs can extended to more complex scenarios. We believe that help us in finding genuinely new schemes for quantum reinforcement learning can become a practical tool to communication, it has to be emphasized that a significant apply to quantum communication problems that do not part of the work consists in asking the right questions have a rich spectrum of existing protocols such as de- and identifying the relevant resources, both of which are signing quantum networks, especially if the underlying central to the formulation of the task environment and network structure is irregular. Alternatively, machine are provided by researchers. However, it should also be learning could be used to investigate other architectures noted that not every aspect of designing the environment for quantum communication such as constructing cluster is necessarily a crucial addition and many details of the states for all-photonic quantum repeaters [73]. implementation are simply an acknowledgment of practi- cal limitations like computational runtimes. When pro- vided with a properly formulated task, a learning agent can play a helpful, assisting role in exploring the possi- ACKNOWLEDGMENTS bilities. In fact, we used the PS agent in this way to demon- J.W. and W.D. were supported by the Austrian Sci- strate that the application of machine learning techniques ence Fund (FWF) through Grants No. P28000-N27 and to quantum communication is not limited to rediscover- P30937-N27. J.W. acknowledges funding by Q.Link.X ing existing protocols. The PS agent finds adapted and from the BMBF in Germany. A.A.M. acknowledges fund- optimized solutions in situations that lack certain sym- ing by the Swiss National Science Foundation (SNSF), metries assumed by the basic protocols, such as the qual- through the Grant PP00P2-179109 and by the Army Re- ities of physical channels connecting different stations. search Laboratory Center for Distributed Quantum In- We extended the PS model to include the concept of del- formation via the project SciNet. H.J.B. was supported egating parts of the solution to other agents, which al- by the FWF through the SFB BeyondC F7102 and by lows the agent to effectively deal with problems of larger the Ministerium f¨urWissenschaft, Forschung, und Kunst size. Using this new capability for long-distance quantum Baden-W¨urttemberg (AZ: 33-7533.-30-10/41/1).

[1] Kimble H. J., “The quantum internet,” Nature, vol. 453, Noisy Entanglement and Faithful Teleportation via Noisy no. 7198, pp. 1023–1030, 2008. Channels,” Phys. Rev. Lett., vol. 76, pp. 722–725, 1996. [2] S. Wehner, D. Elkouss, and R. Hanson, “Quantum in- [10] D. Deutsch, A. Ekert, R. Jozsa, C. Macchiavello, ternet: A vision for the road ahead,” Science, vol. 362, S. Popescu, and A. Sanpera, “Quantum Privacy Ampli- no. 6412, 2018. fication and the Security of Quantum Cryptography over [3] P. W. Shor and J. Preskill, “Simple Proof of Security of Noisy Channels,” Phys. Rev. Lett., vol. 77, pp. 2818– the BB84 Protocol,” Phys. 2821, 1996. Rev. Lett., vol. 85, pp. 441–444, 2000. [11] W. D¨urand H. J. Briegel, “Entanglement purification [4] R. Renner, “Security of Quantum Key Distribution,” and ,” Rep. Prog. Phys., vol. 70, PhD thesis, ETH Zurich, 2005. no. 8, p. 1381, 2007. [5] Y.-B. Zhao and Z.-Q. Yin, “Apply current exponential [12] H.-J. Briegel, W. D¨ur,J. I. Cirac, and P. Zoller, “Quan- de finetti theorem to realistic quantum key distribution,” tum Repeaters: The Role of Imperfect Local Operations International Journal of Modern Physics: Conference Se- in Quantum Communication,” Phys. Rev. Lett., vol. 81, ries, vol. 33, p. 1460370, 2014. pp. 5932–5935, 1998. [6] D. Gottesman and H.-K. Lo, “Proof of security of quan- [13] S. Russel and P. Norvig, Artificial Intelligence - A Mod- tum key distribution with two-way classical communi- ern Approach. New Jersey: Prentice Hall, 3rd ed., 2010. cations,” IEEE Transactions on , [14] R. S. Sutton and A. G. Barto, Reinforcement Learn- vol. 49, no. 2, pp. 457–475, 2003. ing: An Introduction. Cambridge, MA, USA: MIT press, [7] H.-K. Lo, “A simple proof of the unconditional security of 2nd ed., 2017. quantum key distribution,” Journal of Physics A: Math- [15] M. Wiering and M. van Otterlo, eds., Reinforcement ematical and General, vol. 34, no. 35, p. 6957, 2001. learning: State of the Art. Adaptation, Learning, and [8] C. H. Bennett, G. Brassard, C. Cr´epeau, R. Jozsa, Optimization, vol. 12, Berlin, Germany: Springer, 2012. A. Peres, and W. K. Wootters, “Teleporting an unknown [16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, quantum state via dual classical and Einstein-Podolsky- J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, Rosen channels,” Phys. Rev. Lett., vol. 70, pp. 1895–1899, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, 1993. A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wier- [9] C. H. Bennett, G. Brassard, S. Popescu, B. Schumacher, stra, S. Legg, and D. Hassabis, “Human-level control J. A. Smolin, and W. K. Wootters, “Purification of through deep reinforcement learning,” Nature, vol. 518, 12

pp. 529–533, 2015. 2014. [17] D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, [33] A. A. Melnikov, L. E. Fedichkin, and A. Alodjants, “Pre- G. van den Driessche, J. Schrittwieser, I. Antonoglou, dicting quantum advantage by quantum walk with convo- V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, lutional neural networks,” New J. Phys., vol. 21, no. 12, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, p. 125002, 2019. M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, [34] A. A. Melnikov, L. E. Fedichkin, R.-K. Lee, and A. Alod- “Mastering the game of Go with deep neural networks jants, “Machine learning transfer efficiencies for noisy and tree search,” Nature, vol. 529, no. 7597, pp. 484– quantum walks,” Adv. Quantum Technol., vol. 3, no. 4, 489, 2016. p. 1900115, 2020. [18] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, [35] C. Moussa, H. Calandra, and V. Dunjko, “To quantum M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, or not to quantum: towards algorithm selection in near- T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, term quantum optimization,” arXiv:2001.08271, 2020. “A general reinforcement learning algorithm that masters [36] T. F¨osel,P. Tighineanu, T. Weiss, and F. Marquardt, chess, shogi, and go through self-play,” Science, vol. 362, “Reinforcement learning with neural networks for quan- no. 6419, pp. 1140–1144, 2018. tum feedback,” Phys. Rev. X, vol. 8, p. 031084, 2018. [19] H. J. Briegel and G. De las Cuevas, “Projective simula- [37] H. Xu, J. Li, L. Liu, Y. Wang, H. Yuan, and X. Wang, tion for artificial intelligence,” Sci. Rep., vol. 2, p. 400, “Generalizable control for quantum parameter estima- 2012. tion through reinforcement learning,” npj Quantum In- [20] S. Hangl, E. Ugur, S. Szedmak, and J. Piater, “Robotic formation, vol. 5, no. 1, p. 82, 2019. playing for hierarchical complex skill learning,” in Proc. [38] F. Sch¨afer,M. Kloc, C. Bruder, and N. L¨orch, “A dif- IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 2799–2804, ferentiable programming method for quantum control,” 2016. arXiv:2002.08376, 2020. [21] S. Hangl, V. Dunjko, H. J. Briegel, and J. Piater, “Skill [39] M. Tiersch, E. J. Ganahl, and H. J. Briegel, “Adaptive learning by autonomous robotic playing using active quantum computation in changing environments using learning and exploratory behavior composition,” Front. projective simulation,” Sci. Rep., vol. 5, p. 12874, 2015. Robot. AI, vol. 7, p. 42, 2020. [40] H. Poulsen Nautrup, N. Delfosse, V. Dunjko, H. J. [22] L. Orseau, T. Lattimore, and M. Hutter, “Universal Briegel, and N. Friis, “Optimizing quantum error correc- knowledge-seeking agents for stochastic environments,” tion codes with reinforcement learning,” Quantum, vol. 3, in Algorithmic Learning Theory (S. Jain, R. Munos, p. 215, 2019. F. Stephan, and T. Zeugmann, eds.), pp. 158–172, [41] R. Sweke, M. S. Kesselring, E. P. van Nieuwenburg, Springer Berlin Heidelberg, 2013. and J. Eisert, “Reinforcement learning decoders for [23] H. J. Briegel, “On creative machines and the physical fault-tolerant quantum computation,” arXiv:1810.07207, origins of freedom,” Sci. Rep., vol. 2, p. 522, 2012. 2018. [24] A. A. Melnikov, H. Poulsen Nautrup, M. Krenn, V. Dun- [42] A. Valenti, E. van Nieuwenburg, S. Huber, and E. Gre- jko, M. Tiersch, A. Zeilinger, and H. J. Briegel, “Active plova, “Hamiltonian learning for quantum error correc- learning machine learns to create new quantum experi- tion,” Phys. Rev. Research, vol. 1, p. 033092, 2019. ments,” Proc. Natl. Acad. Sci. U.S.A., vol. 115, no. 6, [43] G. Liu, M. Chen, Y.-X. Liu, D. Layden, and P. Cap- pp. 1221–1226, 2018. pellaro, “Repetitive readout enhanced by machine learn- [25] V. Dunjko and H. J. Briegel, “Machine learning & artifi- ing,” Mach. Learn.: Sci. Technol., vol. 1, no. 1, p. 015003, cial intelligence in the quantum domain: a review of re- 2020. cent progress,” Rep. Prog. Phys., vol. 81, no. 7, p. 074001, [44] S. Yu, F. Albarrn-Arriagada, J. C. Retamal, Y.-T. 2018. Wang, W. Liu, Z.-J. Ke, Y. Meng, Z.-P. Li, J.-S. Tang, [26] G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, E. Solano, L. Lamata, C.-F. Li, and G.-C. Guo, “Re- N. Tishby, L. Vogt-Maranto, and L. Zdeborov´a,“Ma- construction of a photonic qubit state with reinforce- chine learning and the physical sciences,” Rev. Mod. ment learning,” Adv. Quantum Technol., vol. 2, no. 7-8, Phys., vol. 91, p. 045002, 2019. p. 1800074, 2019. [27] V. Dunjko and P. Wittek, “A non-review of Quantum [45] J. Carrasquilla, G. Torlai, R. G. Melko, and L. Aolita, Machine Learning: trends and explorations,” Quantum “Reconstructing quantum states with generative mod- Views, vol. 4, p. 32, 2020. els,” Nat. Mach. Intell., vol. 1, no. 3, pp. 155–161, 2019. [28] L. O’Driscoll, R. Nichols, and P. A. Knott, “A hybrid [46] G. Torlai, B. Timar, E. P. L. van Nieuwenburg, H. Levine, machine learning algorithm for designing quantum exper- A. Omran, A. Keesling, H. Bernien, M. Greiner, iments,” Quantum Mach. Intell., vol. 1, no. 1, pp. 5–15, V. Vuleti´c,M. D. Lukin, R. G. Melko, and M. Endres, 2019. “Integrating neural networks with a quantum simula- [29] M. Krenn, M. Erhard, and A. Zeilinger, “Computer- tor for state reconstruction,” Phys. Rev. Lett., vol. 123, inspired quantum experiments,” arXiv:2002.09970, 2020. p. 230504, 2019. [30] A. A. Melnikov, P. Sekatski, and N. Sangouard, “Setting [47] G. Carleo and M. Troyer, “Solving the quantum many- up experimental with reinforcement learning,” body problem with artificial neural networks,” Science, arXiv:2005.01697, 2020. vol. 355, no. 6325, pp. 602–606, 2017. [31] L. Cincio, Y. Suba¸sı,A. T. Sornborger, and P. J. Coles, [48] X. Gao and L.-M. Duan, “Efficient representation of “Learning the for state overlap,” New quantum many-body states with deep neural networks,” J. Phys., vol. 20, no. 11, p. 113022, 2018. Nat. Commun., vol. 8, no. 1, p. 662, 2017. [32] J. Bang, J. Ryu, S. Yoo, M. Pawlowski, and J. Lee, “A [49] A. Canabarro, S. Brito, and R. Chaves, “Machine learn- strategy for quantum algorithm design assisted by ma- ing nonlocal correlations,” Phys. Rev. Lett., vol. 122, chine learning,” New J. Phys., vol. 16, no. 7, p. 073017, p. 200401, 2019. 13

[50] R. S. Sutton, “Integrated architectures for learning, plan- simulation-based reinforcement learning in markov deci- ning, and reacting based on approximating dynamic pro- sion processes,” arXiv:1910.11914, 2019. gramming,” in Proceedings of the 7th International Con- [68] “Projective simulation Github repository.” github.com/ ference on Machine Learning, pp. 216–224, 1990. qic-ibk/projectivesimulation. Accessed: 2020-04-08. [51] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, [69] A. Jamio lkowski, “Linear transformations which preserve A. Ballard, A. Banino, M. Denil, R. Goroshin, trace and positive semidefiniteness of operators,” Rep. L. Sifre, K. Kavukcuoglu, D. Kumaran, and R. Had- Math. Phys., vol. 3, no. 4, pp. 275–278, 1972. sell, “Learning to navigate in complex environments,” [70] A. Gilchrist, N. K. Langford, and M. A. Nielsen, “Dis- arXiv:1611.03673, 2016. tance measures to compare real and ideal quantum pro- [52] T. Mannucci and E.-J. van Kampen, “A hierarchical cesses,” Phys. Rev. A, vol. 71, p. 062310, 2005. maze navigation algorithm with reinforcement learning [71] X. Zhou, D. W. Leung, and I. L. Chuang, “Methodol- and mapping,” in Proc. IEEE Symposium Series on ogy for gate construction,” Phys. Rev. A, Computational Intelligence, 2016. vol. 62, p. 052316, Oct 2000. [53] In the teleportation task the shortest possible sequence [72] This value is usually used for loss in a glass fibre, of gates is equal to 14, and at each step of this sequence but here we use it for the decay in fidelity instead, which there are at least 7 possible gates that can be applied. results in extremely noisy channels. [54] J. Mautner, A. Makmal, D. Manzano, M. Tiersch, and [73] K. Azuma, K. Tamaki, and H.-K. Lo, “All-photonic H. J. Briegel, “Projective simulation for classical learn- quantum repeaters,” Nature Communications, vol. 6, ing agents: a comprehensive investigation,” New Gener. p. 6787, Apr 2015. Comput., vol. 33, no. 1, pp. 69–114, 2015. [74] D. Gottesman, “Theory of fault-tolerant quantum com- [55] A. Makmal, A. A. Melnikov, V. Dunjko, and H. J. putation,” Phys. Rev. A, vol. 57, pp. 127–137, 1998. Briegel, “Meta-learning within projective simulation,” [75] P. O. Boykin, T. Mor, M. Pulver, V. Roychowdhury, IEEE Access, vol. 4, pp. 2110–2122, 2016. and F. Vatan, “On universal and fault-tolerant quan- [56] A. A. Melnikov, A. Makmal, V. Dunjko, and H. J. tum computing: A novel basis and a new constructive Briegel, “Projective simulation with generalization,” Sci. proof of universality for shor’s basis,” in Proceedings of Rep., vol. 7, p. 14430, 2017. the 40th Annual Symposium on Foundations of Computer [57] K. Ried, B. Eva, T. M¨uller, and H. J. Briegel, Science, FOCS ’99, (Washington, DC, USA), pp. 486–, “How a minimal learning agent can infer the existence IEEE Computer Society, 1999. of unobserved variables in a complex environment,” arXiv:1910.06985, 2019. [58] J. Kempe, “Quantum random walks: An introductory overview,” Contemp. Phys., vol. 44, no. 4, pp. 307–327, 2003. [59] S. E. Venegas-Andraca, “Quantum walks: a comprehen- sive review,” Quantum Information Processing, vol. 11, no. 5, pp. 1015–1106, 2012. [60] G. D. Paparo, V. Dunjko, A. Makmal, M. A. Martin- Delgado, and H. J. Briegel, “Quantum speed-up for ac- tive learning agents,” Phys. Rev. X, vol. 4, p. 031002, 2014. [61] S. Jerbi, H. Poulsen Nautrup, L. M. Trenkwalder, H. J. Briegel, and V. Dunjko, “A framework for deep energy- based reinforcement learning with quantum speed-up,” arXiv:1910.12760, 2019. [62] F. Flamini, A. Hamann, S. Jerbi, L. M. Trenkwalder, H. Poulsen Nautrup, and H. J. Briegel, “Photonic archi- tecture for reinforcement learning,” New J. Phys., vol. 22, no. 4, p. 045002, 2020. [63] V. Dunjko, N. Friis, and H. J. Briegel, “Quantum- enhanced deliberation of learning agents using trapped ions,” New J. Phys., vol. 17, no. 2, p. 023006, 2015. [64] N. Friis, A. A. Melnikov, G. Kirchmair, and H. J. Briegel, “Coherent controlization using superconducting qubits,” Sci. Rep., vol. 5, p. 18036, 2015. [65] T. Sriarunothai, S. W¨olk,G. S. Giri, N. Friis, V. Dunjko, H. J. Briegel, and C. Wunderlich, “Speeding-up the deci- sion making of a learning agent using an ion trap quan- tum processor,” Quantum Sci. Technol., vol. 4, no. 1, p. 015014, 2018. [66] A. A. Melnikov, A. Makmal, and H. J. Briegel, “Bench- marking projective simulation in navigation problems,” IEEE Access, vol. 6, pp. 64639–64648, 2018. [67] J. Clausen, W. L. Boyajian, L. M. Trenkwalder, V. Dun- jko, and H. J. Briegel, “On the convergence of projective- 14

Appendix A: Introduction to projective simulation percepts s1 s2 s3 s4 s5 s6 s7 s8 s9 ...

This work is based on using reinforcement learning weights h(s, a) (RL) for quantum communication. As an RL model we use the model of projective simulation (PS), which was first introduced in Ref. [19]. In the main text we provide actions a1 a2 a3 a4 a5 a6 a7 a8 ... background information on RL and PS, and explain our motivation of using PS. In this section we give an intro- FIG. 7. The PS network in its basic form, which corresponds duction to the working principles of the PS agent. The to a weighed directed bipartite graph. Clips corresponding decision-making of a PS agent is realized in its episodic to 8 states of environment and 7 possible actions are shown. and compositional memory, which is a network on mem- They are connected by directed edges. Thick edges corre- ory units, or clips. Each clip encodes a percept, an action, spond to high probabilities of choosing action ak in the state s . or a sequence thereof. There are different mechanisms to m connect clips in the PS network, which can be found e.g. in Refs. [56, 66]. In this work we use a two-layered PS Appendix B: Quantum teleportation network construction, similar to the one used for design- ing quantum experiments [24]. The two-layered network 1. Environment description is schematically shown in Fig.7. The first layer of clips corresponds to percepts s, which Fig.2a depicts the setup. Qubit A0 is initialized as are states of the (quantum communication) environment. + part of an entangled state Φ 0 0 in order to facilitate The layer of percepts is connected to the layer of actions Ae A measuring the Jamio lkowski| fidelityi later on. Qubits A a by edges with the weights h(s, a). These weights de- and B are in a Φ+ state. termine the probabilities of choosing the corresponding AB Goal: The state| i of A0 should be teleported to B. We action: the agent perceives a state sm and performs ac- tion a with probability measure this by calculating the Jamiolkowski fidelity of k the effective channel applied by the action sequences. + That means we calculate the overlap of Φ Ae0B with hmk | i e the reduced state ρ 0 to determine whether the goal pmk = . (A1) Ae B P hml l e has been reached. Actions: The following Actions are allowed: The decision-making process within the PS model, using Depending on the specification of the task ei- the two-layered network, is hence a 1-step random walk • 1 0 ther P = for a Clifford gate set or T = process. One trial consisting of n agent-environment in- 0 i teraction steps is an n-step random walk process defined 1 0  by the weight matrix hmk. The weights of the network iπ/4 for a universal gate set, on each qubit. are updated at the end of each agent-environment inter- 0 e action step i according to the learning rule (3 actions) 1 1    The Hadamard gate H = √1 on each (i+1) (i) (i) (i+1) (i) • 2 1 1 hmk = hmk γ hmk 1 + gmk r , (A2) − − − qubit (3 actions)

A0→A for all m and k. r(i) is the reward, and g(i+1) is a coef- The CNOT-gate CNOT on the two qubits at • ficient that distributes this reward proportional to how location A. (1 action) much a given edge (m, k) contributed to the rewarded se- (i+1) Z-measurement on each qubit. (3 actions) quence of actions. To be more specific, g is set to 1 • once the edge is used in a random walk process, and goes In total: 10 actions. Note that the Clifford group is gen- back to it’s initial value of 0 with the rate η afterwards: erated by P , H and CNOT [74] and replacing P with T makes it a universal gate set [75]. The measurements are ( (i+1) 1, if (m, k) was traversed modeled as destructive measurements, which means that gmk = (i) (A3) operations acting on that qubit are no longer available, (1 η) g , otherwise. − mk thereby reducing the number of actions that the agent can choose. The time-independent parameter η is set to a value from Percepts: The agent only uses the previous actions the interval [0, 1]. The second meta-parameter 0 γ 1 of the current trial as a percept. Z-measurements with of the PS agent is a damping parameter that helps≤ ≤ the different outcomes will produce different percepts. agent to forget, which is beneficial in cases where the Reward: If Goal is reached, R = 1 and the trial ends. environment changes. Else R = 0. 15   2. Discussion 1 1 The Hadamard gate H = √1 on each • 2 1 1 − One central complication in this scenario is that the qubit (3 actions) entanglement is a limited resource. If the entanglement CNOT gate on each pair of qubits as long as they is destroyed without accomplishing anything (e.g., mea- • 0 are at the same location (initially CNOTA →A1 , suring qubit A as the first action), then the goal can no 0 A →A2 A1→A2 longer be reached no matter what the agent tries after- CNOT and CNOT ). wards. This is a feature that distinguishes this setup and Z-measurement on each qubit (3 actions). other quantum environments with irreversible operations • from a simple navigation problem in a maze. Instead this Send a qubit A1 or A2 to location B. (2 actions) is more akin to a navigational problem, where there are • numerous cliffs that the agent can fall down, but never Initially available: 14 actions. Measuring a qubit will get back up again, which means that the agent could be remove all actions that involve that qubit from the pool permanently separated from the goal. of available actions. Sending a qubit to location B will An alternative formulation that would make the goal remove the used action itself (as that qubit is now at reachable at each trial even if a wrong irreversible action location B) and will also trigger a check which CNOT was taken would be to provide the agent with a reset actions are now possible. action that resets the environment to the initial state. A Percepts: The agent only uses the information which different class of percepts would need to be used in this of the previous actions of the current trial wer taken as case. a percept. Z-measurements with different outcomes will With prior knowledge of the quantum teleportation produce different percepts. protocol it is easy to understand why this problem struc- Reward: If Goal is reached, R = 1 and the trial ends. ture favors an RL approach. The shortest solution for one Else R = 0. b. Variant 2: As above, but initially no action in- particular combination of measurement outcomes takes 0 only four actions and these actions are to be performed volving the input qubit A is available, i.e. no single qubit gates, no measurement nor CNOT gate. After one regardless of which correction operation needs to be ap- 0 plied. That means once this simple solution is found it of the sending actions is used, these actions on qubit A significantly reduces the complexity of finding the other becom available. Furthermore, only one qubit may be solutions, as now only the correction operations need to sent in total, i.e. after sending a qubit neither of the two be found. sending actions may be chosen. c. Comment on the variant environments: Compare this to searching for this solution by brute Variant forcing action sequences. For the universal gate set we 1 serves as a good example why specifying the task cor- know that the most complicated of the four solutions rectly is such an important part for RL problems. As takes at least 14 actions. Ignoring the measurement ac- discussed in the main text, the agent immediately spot- tions for now as they anyway reduce the number of avail- ted the loophole and found a protocol that uses fewer able actions, there are 714 possible action sequences. So, actions than the standard teleportation protocol. An- 14 11 other successful way to circumvent the restriction of not we would have to try at least 7 > 6.7 10 sequences, 0 which is vastly more than the few hundred× thousands of sending A directly is to construct a SWAP operation trials needed by the agent. from the gate set. Then it is possible to simply swap the input state with one of the state of one of the qubits that can be sent. However, in the given action set this solu- tion consists of a longer sequence of actions is therefore 3. Environment variants without pre-distributed deemed more expensive than the one the agent found. entanglement

a. Variant 1: Fig.2d the initial setup is shown. The Appendix C: Entanglement purification qubits A1 and A2 are both intialized in the state 0 . As before the input qubit A0 is initialized as part of an| entan-i 1. Environment description + gled state Φ 0 0 in order to later obtain Jamiolkowski | iAe A fidelity. Fig.3a shows the initial setup with A and B sharing Goal: A qubit at location B is in the initial state of two Bell pairs with initial fidelity F = 0.73. 0 qubit A . As before the Jamiolkowski fidelity is used to Goal: Find an action sequence that results in a pro- determine whether this goal has been reached. tocol that improves the fidelity of the Bell pair, when Actions: The following actions are available: applied recurrently. This means that two copies of the resulting two-qubit state after one successful application   1 0 of the protocol are taken and the protocol is using them The T -gate T = on every qubit (3 ac- • 0 eiπ/4 as input states. tions) Actions: The following Actions are available: 16

Px = HPH on each qubit (4 actions) 2. Discussion • H, the Hadamard gate on each qubit (4 actions) • As discussed in the main text, the agent found an en- The CNOT-gates CNOTA1→A2 and CNOTB1→B2 tanglement purification protocol that is equivalent to the • on qubits at the same location. (2 action) DEJMPS protocol [10] for an even number of purification steps. Z-measurements on each qubit (4 actions) Let us briefly recap how the DEJMPS protocol works: • Initially we have two copies of a state ρ that is diagonal Accept/Reject (2 actions) • in the Bell basis and can be written with coefficients λij: In total: 16 actions. Note that these gates generate the + + − − ρ =λ Φ Φ + λ Φ Φ + Clifford group. (We tried different variants of generating 00 10 + + − − (C1) the gate set as the choice of basis is not fundamental, λ01 Ψ Ψ + λ11 Ψ Ψ the one with P gave the best results.) The measure- x The effect of the multilateral CNOT operation ments are modeled as destructive measurements, which CNOTA1→A2 CNOTB1→B2 followed by measurements means that operations acting on that qubit are no longer in the computational⊗ basis on A and B and post- available, thereby reducing the number of actions that 2 2 selected for coinciding measurement results is: the agent can choose. In order to reduce the huge action 2 2 space further, the requirement that the final state of one λ00 + λ10 2λ00λ10 λe00 = λe10 = sequence of gates needs to be a two-qubit state shared N N between A and B is enforced by removing actions that 2 2 (C2) λ01 + λ11 2λ01λ11 would destructively measure all qubits on one side. The λe01 = λe11 = accept and reject actions are essential because they allow N N identifying successful branches. where λeij denote the new coefficient after the procedure Percepts: The agent only uses the previous actions 2 2 and N = (λ00 + λ10) + (λ01 + λ11) is a normalization of the current trial as a percept. Z-measurements with constant and also the probability of success. Without any different outcomes will produce different percepts. additional intervention applying this map recurrently, Reward: The protocol suggested by the agent is per- not only the desired coefficient λ00 (the fidelity) will be formed recurrently for ten times. This is done to en- amplified, but both λ00 and λ10. sure that the solution found is a viable protocol for To avoid this and only amplify the fidelity with respect recurrent application because it is possible that a sin- to Φ+ , the DEJMPS protocol calls for the application gle step of the protocol might increase the fidelity but | i of √ iX √iX on both copies of ρ before applying the further applications of the protocol could undo that multilateral− ⊗ CNOTs and performing the measurements. improvement. The reward function is given by R =  q  The effect of this operation is to exchange the two coeffi- 10 Q10 max 0, const pi∆F where pi is the success cients λ10 and λ11 thus preventing the unwanted amplifi- × i=1 cation of λ10. So the effective map at each entanglement probability (i.e. the combined probability of the accepted purification step is the following: branches) of the i-th step, ∆F is the increase in fidelity 2 2 after ten steps and the constant is chosen such that the λ00 + λ11 2λ00λ11 known protocols [9] or [10] would receive a reward of 1. λe00 = λe10 = N N (C3) Problem-specific techniques: To evaluate the perfor- 2 2 λ01 + λ10 2λ01λ10 mance of an entanglement purification protocol that is λe01 = λe11 = N N applied in a recurrent fashion it is necessary to know 2 2 which actions are performed and especially whether the with N = (λ00 + λ11) + (λ01 + λ10) . protocol should be considered successful for all possible In contrast, the solution found by the agent calls for measurement outcomes. Therefore, it is not sufficient to √ iX √ iX to be applied, which exchanges two differ- − ⊗ − use the same approach as for the teleportation challenge ent coefficients λ00 and λ01 instead for an effective map: and simply consider one particular measurement outcome 2 2 for each trial. Instead, the agent is required to choose ac- λ01 + λ10 2λ01λ10 λe00 = λe10 = tions for all possible measurement outcomes every time it N N 2 2 (C4) chooses a measurement action. This means we keep track λ00 + λ11 2λ00λ11 λe01 = λe11 = of multiple separate branches (and the associated prob- N N abilities) with different states of the environment. The 2 2 average of the branches that the agent and N = (λ01 + λ10) +(λ00 + λ11) . Note that the maps decides to keep is the state that is used for the next pu- (C3) and (C4) are identical except that roles of λek0 and rification step. We choose to do it this way because it λek1 are exchanged. It is clear that applying the agent’s allows us to obtain a complete protocol that can be eval- map twice will have the same effect as applying the DE- uated at each trial and the agent is rewarded according JMPS protocol twice, which means that for an even num- to the performance of the whole protocol. ber of recurrence step they are equivalent. 17

Appendix D: Quantum repeater

1. Environment description

The setup is depicted in Fig.4a. The repeater sta- tions share entangled states via noisy channels with their neighbors, which results in pairs with an initial fidelity of F = 0.75. The previous two protocols now are available number of agents as the elementary actions for this more complex scenario. Goal: Entangled pair between the left-most and the right-most station with fidelity above threshold fidelity Fth = 0.9. reward value Actions: Purify a pair with one entanglement purification FIG. 8. Entanglement purification environment with au- • tomatic depolarization after each purification step. Figure step. (left pair, right pair, the long-distance pair shows obtained rewards by 100 agents for their protocols that arises from entanglement swapping at the mid- found after 5 × 105 trials. dle station) Entanglement swapping at the middle station • We use the protocol in [9] for this as it is computa- As a side note, the other protocol that was found by tionally easier to handle. For a practical application it the agent as described in the main text, applies such an would be advisable to use a more efficient entanglement additional operation before each entanglement purifica- purification protocol. Percepts: Current position of the pairs and fidelity tion step as well: Applying H H on ρ exchanges λ10 and ⊗ of each pair. λ01. This also yields a successful entanglement purifica- const 2 tion protocol, however with a slightly worse performance. Reward function: R = resources . The whole path is rewarded in full. The reward constant is obtained from an initial guess using the working-fidelity strategy de- scribed inE4.

Appendix E: Scaling repeater 3. Automatic depolarization variant 1. Environment description

We also investigated a variant where after each pu- In addition to the elementary actions from the rification step, the state is automatically depolarized distance-2 quantum repeater discussed above we provide before the protocol is applied again. That means if the agent with the ability to delegate solving smaller- the first step brought the state up to the new fidelity scale problems of the same type to other agents, there- 0 0 + + F it is then brought to the form: F Φ Φ + fore splitting the problem into smaller parts. Then, the 0 | ih | 1−F ( Ψ+ Ψ+ + Φ− Φ− + Ψ− Ψ− ). This can al- found sequence of actions is applied as one block action 3 | ih | | ih | | ih | ways be achieved without changing the fidelity [9]. as illustrated in Fig.5a. Goal: Entangled pair between the left-most and the In Fig.8 the obtained reward for 100 agents for this right-most station with fidelity above threshold fidelity alternative scenario is shown. The successful protocols Fth = 0.9. consist of applying CNOTA1→A2 CNOTB1→B2 followed ⊗ Actions: by measuring qubits A2 and B2 in the computational ba- sis. Optionally, some additional local operations that do Purify a pair with one entanglement purification not change the fidelity itself can be added as the effect of • step. those is undone by the automatic depolarization. Similar Entanglement swapping at a station to the scenario in the main text, there are some solutions • that only accept one branch as successful, which means Block actions of shorter lengths they only get half the reward as the success probability at • each step is halved (center peak in Fig.8). The protocols So for the setup with L repeater links, initially there for which two relevant branches are accepted are equiva- are L purification actions and L 1 entanglement swap- lent to the entanglement purification protocol presented ping actions. Of course, the purification− actions have in [9]. to be adjusted every time an entanglement swapping is 18

a symmetric case b symmetric case c asymmetric case d asymmetric case to symmetric strategy to symmetric strategy to symmetric strategy to symmetric strategy resources used relative resources used relative resources used relative resources used relative

trials gate error trials gate error

FIG. 9. a-b Scaling repeater with 8 repeater links with symmetric initial fidelities of 0.7. a Best solution found by an agent for gate reliability p = 0.99. b Relative resources used by the agent’s solution compared to the working-fidelity strategy for different gate reliability parameters. c-d Scaling repeater with 8 repeater links with very asymmetric initial fidelities (0.95, 0.9, 0.6, 0.9, 0.95, 0.95, 0.9, 0.6). c Best solution found by an agent for gate reliability p = 0.99. d Relative resources used by the agent’s solution compared to the working-fidelity strategy for different gate reliability parameters. performed to include the new, longer-distance pair. The 2. Additional results and discussion block actions can be applied at different locations, e.g. example a length two block action can initially be ap- We also investigated different starting situations for plied at L 1 different positions (which also have to be − this setup. Here we discuss two of them: adjusted to include longer-distance pairs as entanglement First, we also applied the agent that is not restricted swapping actions are chosen). So it is easy to see how the to symmetric protocols to a symmetric starting situa- action space quickly gets much larger as L increases. tion. The results for initial fidelities F = 0.7 can be Percepts: Current position of the pairs and fidelity found in Fig.9a-b. In general the agent finds solutions of each pair. that are very close but not equal to the working-fidelity Reward: Again we use the resource-based reward strategy described inE4. Remarkably, for some relia- function as this is the metric we would like to optimize. bility parameters p the agent even finds a solution that const 2 R = resources . The whole path is rewarded in full. The is slightly better by switching around the order of oper- reward constant is obtained from an initial guess (seeE4) ations a little, or a threshold effect, where omitting an and adjusted downward once a better solution is found entanglement purification step on one of the pairs is still such that the maximum possible reward from one trial is enough to reach the desired threshold fidelity. 1. Finally, we also looked at a situation that is highly Comment on block actions: The main agent can use asymmetric with starting fidelities (0.95, 0.9, 0.6, 0.9, block actions for a wide variety of situations at different 0.95, 0.95, 0.9, 0.6). Thus there are high-quality links stages of the protocol. This means the sub-agents are on most connections, but two links suffer from very high tasked with finding block actions for a wide variety of levels of noise. The results depicted in Fig.9c-d show initial fidelities, so a new problem needs to be solved that the advantage over a working-fidelity strategy is for each new situation. In order to speed up the trials even more pronounced. we save situations that have already been solved by sub- agents in a big table and reuse the found action sequence if a similar situation arises. 3. Repeater stations for setups with memory errors

a. Symmetric variant As mentioned in the main text, in order to properly account for imperfections in the quantum memories, we We force a symmetric protocol by modifying the ac- need to know the distance between repeater stations. tions as follows: In sectionIVC we looked at a total distance of just Actions: below 78.9 km with 7 intermediate repeater stations lo- cated at the positions shown in TableII. Together with Purify all pairs with one entanglement purification the distance dependent error model introduced in that • step. section, these give rise to the asymmetric initial fidelities Entanglement swapping at every second active sta- (0.8,0.6,0.8,0.8,0.7,0.8,0.8,0.6) which we also used for the • tion setup with perfect memories. In sectionIVD we considered a list of 7 possible loca- Block actions of shorter lengths, that have been tions to position repeater stations on, which we get by • obtained in the same, symmetrized manner. scaling down the previous scenario to a total distance of 19

repeater station index position scheme and can therefore be stated easily: 1 6.8 km 2 23.6 km 1. Pick a working fidelity Fw. 3 30.4 km 4 37.2 km 2. Purify all pairs until their fidelity is F Fw. 5 48.5 km ≥ 6 55.3 km 3. Perform entanglement swapping at every second ac- 7 62.1 km tive station such that there are half as many re- peater links left. TABLE II. The positions of the repeater stations in section IV C. The terminal stations between which shared entangle- 4. Repeat from step 2. until only one pair remains ment is to be established are located at positions 0 km and (and therefore the outermost stations are con- 78.9 km. nected).

possible location index position We then optimize the choice of Fw such that the resources 1 1.73 km are minimized for the given scenario. 2 5.98 km As we are dealing with repeater lengths that are not a 3 7.71 km power of 2 as part of the delegated subsystems discussed 4 9.44 km in the main text, the strategy is adjusted as follows for 5 12.29 km those cases. 6 14.02 km 1. Pick a working fidelity Fw. 7 15.75 km 2. Purify all pairs until their fidelity is F Fw. TABLE III. Possible locations at which repeater stations can ≥ be positioned in sectionIVD. The terminal stations between 3. Perform entanglement swapping at the station with which shared entanglement is to be established are located at the smallest combined distance of their left and positions 0 km and 20 km. right pair (e.g. 2 links + 3 links). If multiple sta- tions are equal in that regard, pick the leftmost station. 20 km. We chose such a comparably short distance in 4. Repeat from step 2. until only one pair remains order to make protocols with only one added repeater (and therefore the outermost stations are con- station a viable solution. The positions of the 7 possible nected). locations are listed in TableIII. Then, we again optimize the choice of Fw such that the resources are minimized for the given scenario. 4. Working-fidelity strategy Appendix F: Computational Resources This is the strategy we use to determine the reward constants for the quantum repeater environments and All numerical calculations were performed on a ma- was presented in [12]. This strategy leads to a resource chine with four Sixteen-Core AMD Opteron 6274 CPUs. requirement per repeater station that grows logarithmi- In TableIV we provide the run-times of the calculations cally with the distance. presented in this paper. Memory requirements were in- For repeater lengths with 2k links it is a fully nested significant for all of our scenarios. 20

scenario run-time Teleportation (Fig.2b/c ) Clifford: ∼5h, Universal: ∼3d Teleportation Variant 1 <1h Teleportation Variant 2 (Fig.2e) ∼93h Entanglement Purification (Fig.3) ∼7h Quantum Repeater (Fig.4) <1h Scaling symmetric case (Fig.5 b, c) ∼3h Scaling asymmetric case (Fig.5 d, e) ∼24h per data point Scaling with memory errors (Fig.6) ∼25h per data point Repeater station positions (TableI) ∼4.5h

TABLE IV. Run time for our scenarios on a machine with four Sixteen-Core AMD Opteron 6274 CPUs.