<<

1

Towards Joint Learning of Optimal MAC Signaling and Channel Access Alvaro Valcarce, Senior Member, IEEE, and Jakob Hoydis, Senior Member, IEEE

Abstract— protocols are the languages MAC Protocol used by network nodes. Before a user equipment (UE) exchanges data with a base station (BS), it must first nego- Signaling tiate the conditions and parameters for that transmission. Channel Access This negotiation is supported by signaling messages at all Policy Vocabulary Policy layers of the protocol stack. Each year, the telecoms in- dustry defines and standardizes these messages, which are designed by humans during lengthy technical (and often political) debates. Following this standardization effort, the development phase begins, wherein the industry interprets Fig. 1. MAC protocol constituent parts. This paper trains UEs to and implements the resulting standards. But is this massive jointly learn the policies highlighted as dashed blocks. development undertaking the only way to implement a given protocol? We address the question of whether radios can learn a pre-given target protocol as an intermediate signaling (i.e., the structure of the protocol data step towards evolving their own. Furthermore, we train units (PDUs), the control procedures, etc) and the cellular radios to emerge a channel access policy that channel access policy (listen before talk (LBT), performs optimally under the constraints of the target FDMA, etc). These are the main building blocks protocol. We show that multi-agent reinforcement learning (MARL) and learning-to-communicate (L2C) techniques of a MAC protocol (see Fig 1) and their implemen- achieve this goal with gains over expert systems. Finally, tation ultimately defines the performance that users we provide insight into the transferability of these results will experience. The signaling defines what control to scenarios never seen during training. is available to the radio nodes and when Index Terms—communication system signaling, learning it is available. In doing so, it constraints the actions systems, mobile communication that a channel access policy can take and puts an upper bound to its attainable performance. As wireless technologies evolve and new radio I.INTRODUCTION environments emerge (e.g., industrial IoT, beyond 6 URRENT (MAC) pro- GHz, etc), new protocols are needed. The optimal C tocols obey fixed rules given by industrial MAC protocol may not be strictly contention-based standards (e.g., [1]). These standards are designed or coordinated as shown in Table I. For instance, arXiv:2007.09948v3 [cs.IT] 27 Apr 2021 by humans with competing commercial interests. 5GNR supports grant-free scheduling by means of Despite the standardization process’ high costs, the a Configured Grant to support ultra reliable low la- ensuing protocols are often ambiguous and not tency (URLLC). But are this acces necessarily optimal for a given task. This ambiguity policy and its associated signaling jointly optimal? increases the costs of testing, validation and imple- We are interested in the question of which MAC mentation, specially in cross-vendor systems like protocol is optimal for a given use case and whether cellular networks. In this paper, we study if there it can be learned through experience. might be an alternative approach to MAC protocol implementation based on reinforcement learning. To implement a MAC protocol for a wireless A. Related work application, vendors must respect the standardized Current research on the problem of emerging MAC protocols with machine learning focuses A. Valcarce and J. Hoydis are with Nokia Bell-Labs France, Route de Villejust, 91620 Nozay, France. Email: {alvaro.valcarce_rial, mainly on new channel access policies. Within this jakob.hoydis}@nokia-bell-labs.com body of research, contention-based policies domi- 2

TABLE I CHANNELACCESSSCHEMES,MAC PROTOCOLSANDTHETECHNOLOGIESTHATUSETHEM

Channel Access Schemes frequency division multiple access (FDMA) TDMA Time-variant CDMA Spatial Time-invariant Non-OFDMA OFDMA Aloha Contention CSMA N.A. Cognitive Radio based Cognitive Radio GSM MAC FM/AM radio LTE 3G Satcoms

protocols Coordinated Token Ring POTS DVB-T 5G New Radio (5GNR) GPS MIMO Bluetooth nate (see [2], [3], [4] or [5]), although work on B. Contribution coordinated protocols also exists (e.g., [6], [7], [8]). For the reasons mentioned above, we believe that Other approaches such as [9] propose learning to machine-learned MAC protocols have the potential enable/disable existing MAC features. to outperform their human-built counterparts in cer- Given a channel access scheme (e.g., time divi- tain scenarios. This idea has been recently used sion multiple access (TDMA)), [2] asks whether for emerging new digital (see, e.g., an agent can learn a channel access policy in an [15]). Protocols built this way are freed from human environment where other agents use heterogeneous intuitions and may be able to optimize control-plane policies (e.g., q-ALOHA, etc). Whereas this is inter- traffic and channel access in yet unseen ways. esting, it solely focuses on contention-based access The first step towards this goal is to train an and it ignores the signaling needed to support it. intelligent software agent to learn an existing Instead, we focus on the more ambitious goal of MAC protocol. Our agents are trained tabula rasa jointly emerging a channel access policy and its with no previous knowledge or logic about the associated signaling. target protocol. We show that this is possible in a simplified wireless scenario with MARL, Dynamic spectrum sharing is a similar prob- self-play and tabular learning techniques, and lay lem, where high performance has recently been the groundwork for scaling this further with deep achieved (see [10], [11]) by subdividing the task learning. In addition, we measure the influence into the smaller sub-problems of channel selection, of signaling on the achievable channel access admission control and scheduling. However, these performance. Finally, we present results on the studies focus exclusively on maximizing channel- extension of the learned signaling and channel access efficiency under the constraints of a pre-given access policy to other scenarios. signaling. None focuses on jointly optimizing the control signaling and channel access policy. The rest of this paper is organized as follows. The field of Emergent Communication has been Section II formalizes the concepts of channel access, growing since 2016 [12]. Advances in deep learning protocol and signaling. It then formulates the joint have enabled the application of MARL to the study learning problem and defines the research target. of how languages emerge and to teaching natural Section III describes the multi-agent learning frame- languages to artificial agents (see [13] or [14]). Due work. Section IV illustrates the achieved perfor- to the multi-agent nature of cellular networks and to mance, provides an example of the learned signaling the control-plane/user-plane traffic separation, these and discusses the transferability of these results. A techniques generalize well to the development of final discussion is provided in section V. machine-type languages (i.e., signaling). In cellular II.PROBLEM DEFINITION systems, we interpret the MAC signaling as the language spoken by UEs and the BS to coordinate A. Definitions while pursuing the goal of delivering traffic across We distinguish the concepts of channel access a network. scheme and MAC protocol as follows: 3

Base Station UE user data MAC MAC .

channel access Uplink policy . User plane

Control plane signaling

Downlink User plane signaling Control plane Fig. 3. Left: Venn diagram of all possible signaling messages M = ∗ M퐷퐿 ∪ M푈 퐿 of size (푙, 푚) bits. The set 푀 of messages for the optimal signaling 푆∗, as well as the set 푀 푘 of messages for the 푘푡ℎ Fig. 2. MAC protocol abstraction with UL data traffic only. arbitrary signaling 푆푘 are also shown. Right: An optimum signaling 푆∗ and the 푘푡ℎ arbitrary signaling 푆푘 within the abstract space S of all possible signaling (i.e., the signaling corpus). • Channel access scheme: Method used by mul- tiple radios to share a communication channel (e.g., a wireless medium). The channel access a channel access policy and the MAC signaling scheme is implemented and constrained by the needed to coordinate channel access with the BS. (PHY). Sample channel access The BS uses an expert MAC that is not learned. schemes are frequency division Let S be the set of all possible MAC signaling (FDM), time division multiplexing (TDM), etc. that UEs and a BS may ever use to communicate • MAC protocol: Combination of a channel (see Fig. 3). We formalize a signaling as a vo- access policy and a signaling with which a cabulary with downlink (DL) and UL messages, channel access scheme is used. Sample channel plus mappings from observations to these messages. access policies are LBT, dynamic scheduling, Since different signalings are possible, let us denote etc. Signaling is the vocabulary and rules (i.e., the 푘푡ℎ signaling 푆푘 ∈ S as: signaling policy) used by radios to coordinate 푆푘 = [푀 푘 , 푀 푘 , and is described by the PDU structure, the sub- 퐷퐿 푈퐿 푂퐵푆 푀 푘 , headers, etc. The channel access policy decides → 퐷퐿 (1) when to send data through the user-plane pipe, 푂푈퐸 푀 푘 → 푈퐿] and the signaling rules decide when to send 푀 푘 푀 푘 what through the control-plane pipe (see Fig. where 퐷퐿 ⊆ M퐷퐿 and 푈퐿 ⊆ M푈퐿 are the sets 2). of DL and UL messages of signaling 푆푘 . 푂퐵푆 and Table I illustrates these definitions with examples of 푂푈퐸 are generic observations obtained at the BS and technologies that use these mechanisms. the UE respectively, and can include internal states, local measurements, etc. In other words, a signaling defines the messages that can be exchanged and the B. Target MAC signaling rules under which they can be exchanged. These A complete commercial MAC layer provides nu- rules give hence meaning to the messages. 푀 푘 푀 푘 merous services to the upper layers (channel access, In the above definitions, | 퐷퐿 | and | 푈퐿 | are multiplexing of service data units (SDUs), hybrid the sizes of the DL and UL signaling vocabularies, automatic repeat request (HARQ), etc). Our goal is which implicitly define the amount of control to replace the MAC layer in a mobile UE by a learn- data a single message can carry, i.e., the richness ing agent that can perform all these functions and of the control vocabulary. Messages from non- their associated signaling. However for simplicity, compositional protocols with larger vocabularies this paper targets a leaner MAC layer with two main can therefore feed more control information to the functions: wireless channel access, and automatic channel access policy. Although this comes at the repeat request (ARQ). Let this agent be denoted as expense of a higher signaling overhead, the richer the MAC learner. This differs from an expert MAC context available to the radio nodes can enable implemented through a traditional design-build-test- more sophisticated algorithms for channel access. and-validate approach. Several learners are then The size of the signaling vocabulary is hence an concurrently trained in a mobile cell to jointly learn important hyper-parameter in emergent protocols 4 to balance the trade-off between channel access Hence, the second objective of this research is performance and signaling cost. for the UEs to learn an UL channel access policy The mappings from observations to messages 휋푃 that leverages the available signaling to perform define when to send what. A MAC signaling policy optimally (according to some chosen channel access 휋푆 describes one possible way of implementing performance metric). The channel access policy is this mapping. This signaling policy shall not be denoted with the subscript 푃 to highlight that the confused with the channel access policy, which MAC layer controls channel access by steering the describes when and how to transmit data. underlying physical layer. The MAC layer com- The BS is an expert system with full knowledge mands the PHY through an application program- of a standard MAC signaling 푆퐵푆 ∈ S. Our first ming interface (API) and is therefore constrained objective is to enable the UEs to communicate by the services it offers. The MAC has no means with the BS by learning to understand and use its of influencing the wireless shared channel without signaling. Out of the many signaling UEs could a PHY API (e.g., [16]). In our experiments, these learn, 푆퐵푆 is the learning target. Note that 푆퐵푆 is services are limited to the in-order delivery of MAC not necessarily the optimal signaling 푆∗ ∈ S, which PDUs through a packet erasure channel. depends on a chosen metric, such as , , etc. D. Channel model The ideas presented here are generalizable to vocabularies of any size. For simplicity, we have All learners share the same UL frequency channel reduced the size of the target signaling vocabulary for transmitting their UL MAC PDUs and they to the minimum number of messages needed to access this shared UL data channel through their support both uncoordinated and coordinated MAC respective PHY layers. From the viewpoint of the protocols. In this paper, 푆퐵푆 has the following DL MAC learner, the PHY is thus considered part of messages that the MAC learners need to interpret: the shared channel. The channel accessed by each • scheduling grants (SGs) MAC learner is a packet erasure channel, where a • acknowledgments (ACKs). transmitted PDU is lost with a certain block error rate (BLER). For simplicity, we assume that the Similarly, the UL messages that the MAC learners BLER is the same for all data links and abstract need to learn to use are: away all other PHY features. This comes without • scheduling requests (SRs). loss of generality to the higher-layer protocol anal- Other messages such as buffer status reports can be ysis. added to accommodate for larger vocabularies. Since we want to study the effects of the control signaling on the performance of the shared data channel, we assume that the UL and DL control C. Channel access policy channels are error free, costless and dedicated to The experiments described in this paper use a each user without any contention or collisions. TDMA scheme, which was chosen for simplicity. The learning framework proposed here is never- theless equally applicable to other channel access E. BS signaling policy schemes, such as those listed in Table I. Each time step, the BS receives zero or more SRs Unlike for the MAC signaling, we impose no from the UEs. It then chooses one of the requesting a-priori known policy with which to use the ac- UEs at random and a SG is sent in response. An cess scheme. This is to allow UEs to explore the exception occurs if the UE had made a successful full spectrum of channel access policies between data transmission concurrently with the SR. In this contention-based and coordinated. For example, a case, instead of an SG, an ACK is sent to the UE channel access policy might follow a logic that and another UE is then scheduled at random. This ignores the available MAC signaling. Other channel is because only one UE can be scheduled each time access policies may take it into consideration and slot. More complex scheduling algorithms could be implement a coordinated access scheme based on a used here, but random scheduling has been chosen sequence of SRs, SGs and ACKs. for simplicity. 5

UE 0 UE 1 aggregated into a joint action vector a . The envi- traffic traffic 푡 generator generator UL Tx UL Tx ronment then executes this joint action and delivers buffer buffer the same reward 푟푡+1 plus independent observations

Base Station to all learners. The BS also receives a scalar ob- servation 표푏 following the execution of the joint MAC MAC MAC 푡+1 (learner) (expert) (learner) learner action. The environment studied in this paper demands that each MAC learner delivers a total of 푃 UL PHY PHY PHY MAC SDUs to the BS. We performed experiments

Shared UL data channel with the following simple SDU traffic models: Environment • Full buffer start: The UL Tx buffer is filled 푃 푡 Fig. 4. System model with one BS and two UEs with SDUs at = 0. • Empty fuffer start: The UL Tx buffer is empty at 푡 = 0. Then, it is filled with probability F. Multi-Agent Reinforcement Learning formulation 0.5 with one new SDU each time step until 푃 For any given UE, the sequential decision-making a maximum of SDUs have been generated. nature of channel access lets us model it as a The learners must also indicate awareness that the Markov Decision Process (MDP), which can then SDUs have been successfully delivered by removing be solved using the tools of reinforcement learn- them from their UL transmit buffer. An episode ends ing (RL). Using RL to train multiple simultaneous when the 푃 SDUs of each and all of the |푈| MAC learners (i.e., UEs) constitutes what is known as learners have successfully reached the BS and the a MARL setup. If the observations received by learners have removed them from their buffers. each learner differ from those of other learners, 2) Observation space: Each time step 푡, the en- 표푢 U the problem becomes a partially observable Markov vironment delivers a scalar observation 푡 ∈ O = decision process (POMDP). [0, 퐿] to each learner 푢 describing the number We have formulated this POMDP as a cooperative of SDUs that remain in the learner’s UL transmit Markov game (see [17]), where all learners receive buffer. Here, 퐿 is the transmit buffer capacity and exactly the same reward from the environment. all SDUs are presumed to be of the same size. For 표푢 Learners are trained cooperatively to maximize the example, the environment observation 푡 |푢=5,푡=2 = 3 sum 푅 of rewards: indicates to learner 5 that, at time 푡 = 2, three SDUs 푡 are yet to be transmitted. ∑︁푚푎푥 푅 , 푟푡 (2) Similarly, at each time step the BS receives a 표푏 B , 푈 푡=0 scalar observation 푡 ∈ O = [0 | | + 1] from the environment. This observation can take the where 푡 is the maximum number of time steps 푚푎푥 following meanings: in an episode. This design decision reflects the ob- 표푏 jective of optimizing the performance of the whole • 푡 = 0: UL channel is idle • 표푏 ∈ [ , |푈|] cell, rather than that of any individual UE. An 푡 1 : Collision-free UL transmission 표푏 − alternative approach that delivers different rewards received from UE 푡 1 • 표푏 |푈| + to different UEs depending on their radio conditions 푡 = 1: Collision in UL channel. 표푏 could perhaps yield higher network-wide perfor- For example, the environment observation 푡 |푡=4 = mance. However, this would require the design of 3 indicates that, at time 푡 = 4, the BS successfully multiple reward functions, which is known to be received a MAC PDU from UE 2. difficult (see [18]), and left to future investigations. 3) Channel access action space: Learners follow 푎푢 1) Time dynamics: The MARL architecture is a channel access policy by executing actions 푡 ∈ illustrated in Figure 4. 푈 is the set of all MAC A푃 every time step. These actions have an effect learners. Then, on each time step 푡, each MAC on the environment by steering the physical layer 푢 , 푈 푎푢 and are A = {0, 1, 2}, which are interpreted by learner ∈ [0 | |) invokes an action 푡 on the 푃 environment, and it receives a reward 푟푡+1 and an the environment as follows: 표푢 • 푎푢 observation 푡+1. The actions of all learners are 푡 = 0: Do nothing 6

푎푢 • 푡 = 1: Transmit the oldest SDU in the buffer. 6) Learners’ memory: We endow each MAC ℎ푢 U Invoking this action when the UL buffer is learner with an internal memory 푡 ∈ H to 푎푢 empty is equivalent to invoking 푡 = 0. store the past history of observations, PHY and 푎푢 1 • 푡 = 2: Delete the oldest SDU in the buffer. UL signaling actions, as well as the DL messages The channel access actions from all learners received from the BS. 푁 denotes the size, in number are aggregated into a joint action vector at = of past time steps, of this internal memory, which 푎0, 푎1, ..., 푎|푈|−1 at time step 푡 takes the form: [ 푡 푡 푡 ], which is then executed on the environment at once. For example, invoking action ℎ푢 = [푚푢 , 푎푢 , 푛푢 , 표푢 ,..., a = [1, 2, 0] on the environment indicates that, at 푡 푡−푁 푡−푁 푡−푁 푡−푁 (3) 2 푚푢 , 푎푢 , 푛푢 , 표푢 ]. time 푡 = 2, UE 0 attempts an SDU transmission 푡−1 푡−1 푡−1 푡−1 while UE 1 deletes a SDU from its buffer, and UE For example, if 푁 = 1, the internal memory ℎ푡 at 2 remains idle. time 푡 contains only the observation, actions, and 4) Uplink signaling action space: Following the messages from time 푡 − 1. The motivation for this approach introduced in [12], in each time step, memory is the need to disambiguate instantaneous MAC learners can also select an UL signaling action observations that may seem equal to a given MAC 푛푢 푡 ∈ A푆. This action maps to a message received learner, but emanate from the un-observed actions by the BS MAC, and it exerts no direct effect concurrently taken by other learners. In short, mem- onto the environment. These messages are thus ory addresses the problem of partial observability transmitted through a dedicated UL control channel by the learners. This is loosely based on the idea of that is separate from the shared UL data channel fingerprinting introduced in [19]. ℎ푢 modeled by the environment (see Fig. 2). The UL The current state of the memory 푡 and the cur- 표푢 signaling actions available to the MAC learners are rent observation 푡 constitute the two main inputs M푈퐿 = {0, 1}, which are interpreted by the BS considered by the learners during action selection: 푎푢, 푛푢 푓 표푢, ℎ푢 푢 MAC expert as follows: ( 푡 푡 ) = ( 푡 푡 ). For example, a learner with 푛푢 ℎ푢 , , , , , , , • 푡 = 0: Null uplink message current memory state 푡 = [1 0 0 1 0 0 1 1] and 푛푢 표푢 • 푡 = 1: Scheduling Request. current observation 푡 = 1 describes a situation 5) Downlink messages: Finally, the BS MAC where the learner received a SG in response to a expert can, each time step, select a DL signaling ac- previous SR due to a non-empty Tx buffer. The 푚푢 learner’s current policy will then take this context tion 푡 towards each MAC learner. These messages have no direct effect on the environment and are into consideration for deciding the next channel transmitted through a dedicated DL control channel. access action and UL signaling message (if any). The actions available to the BS MAC expert are 7) Reward function: The environment delivers a 푟 − M퐷퐿 = {0, 1, 2} and have the following meanings: reward of 푡 = 1 on all time steps. This motivates 푚푢 MAC learners to finish the episode as quickly as • 푡 = 0: Null downlink message 푚푢 possible (i.e., in the smallest number of time steps). • 푡 = 1: Scheduling Grant for next time step 푚푢 The conditions for the termination of an episode • 푡 = 2: ACK corresponding to the last SDU received in the uplink. were described in Section II-F1. Note that the MAC learners are unaware of the meanings of these messages. They must learn them III.METHODS from experience. Larger sizes of the DL vocabulary A. Tabular Q-learning M would let the BS communicate more complex 퐷퐿 Given the definition of reward 푟 of section II-F1, schedules to the UEs (e.g., by granting radio re- 푡 the return 퐺 at time 푡 is defined as (see [20]): sources further into the future). This would increase 푡 training duration significantly and is therefore out of ∞ 퐺 ∑︁ 훾푘푟 the scope of this paper but left for future investiga- 푡 , 푡+푘+1 (4) tion. 푘=0 훾 , 1 where ∈ [0 1] is a discount factor. The action- A third action space with buffer-management actions could have 휋 also been defined. For simplicity, we only use this buffer related value function, under an arbitrary policy , for a action and include it in A푃 given observation-action pair (표푡, 푎푡) is then defined 7

MAC learner appropriate as a proof of concept. However, tabular Q-learning scales poorly and is certainly not suitable in larger problems (e.g., |푈| >> 2, more SDUs, larger memories, etc.). In such cases, deep learning approaches are needed (the tables in Fig. 5 can be easily replaced by deep neural networks). Then, every time step 푡, each MAC learner fol- lows two different policies 휋푃 and 휋푆 to act: U U 휋푃 : O × H −→ A푃 U U (7) 휋푆 : O × H −→ M푈퐿. Fig. 5. Internal structure of a MAC learner Although 푎푡 ∈ A푃 and 푛푡 ∈ M푈퐿 actions are chosen independently, both policies are synchronous as the expected return when that observation is made due to the conditioning of both Q tables on the and the action taken: same current observation and state of the internal memory: 푄휋 (표푡, 푎푡) = 퐸휋 [퐺푡 |푎푡, 표푡]. (5) 푎푡 = arg max 푄푃 (푎|표푡, ℎ푡) (8) Q-learning [21] is a well known technique for 푎 finding the optimal action-value function, regardless 푛푡 = arg max 푄푆 (푛|표푡, ℎ푡). (9) of the policy being followed. It stores the value 푛 function 푄(표, 푎) in two-dimensional tables, whose cells are updated iteratively with: B. Training procedure

푄(표푡,푎푡) ← 푄(표푡, 푎푡)+ MAC learners are trained using self-play [23], (6) 훼[푟푡+1 + 훾 max 푄(표푡+1, 푎) − 푄(표푡, 푎푡)]. which is known to reach a Nash equilibrium of 푎 the game when it converges. This yields policies Independently of how the table is initialized, 푄 will that can be copied to all UEs and used during converge to the optimal action-value function as deployment. Training is centralized, with one cen- long as sufficient exploration is provided. tral copy of the 푄푃 and 푄푆 tables being updated Independent Q-learners are known to violate the regularly as experience from all MAC learners is Q-learning convergence requirements (see [22]). collected. In our experiments, we update the ta- However, in practice, good results have been ob- bles after each time step, although other schedules tained when combining it with memory-capable are possible. Then, each time step, all learners agents and decentralized training. These two tech- choose their actions based on the same version niques are essential to address the non-stationarity of the trained tables (i.e., decentralized execution) of multi-agent Q-learning problems and have been combined with an 휖-greedy policy. The exploration successfully applied in [12] on an L2C setting. Our probability 휖 is annealed between training episodes 휖 푚푎푥 휖 퐹푒, . 휖 퐹 MAC learner architecture and training procedure with 푒 = ( 푒−1 ∗ 휖 0 01), where 0 = 1, 휖 are a tabular adaptation of the RIAL architecture is the exploration decay factor, and 푒 > 0 is the described in [12], whose success is one of the main training episode number. Asymptotically, 휖-greedy motivations for using Q-learning here. guarantees that learners try all actions an infinite Each MAC learner is composed of two Q tables number of times. This ensures that 푄(표푡, 푎푡) con- ∗ (see Fig. 5), which contain action value estimates verges to the optimal 푄 (표푡, 푎푡). However, training 푄푃 for the physical-layer actions and action value is never run indefinitely, so the theoretical guarantee estimates 푄푆 for the signaling actions. The reduced looses relevance and leaves the door open to alter- size of the problem described in Section II allows native exploration methods such as e.g., optimistic for the application of a traditional Q-learning ap- initial action values (see [20]). This centralized proach based on Q tables (see [20]). This is less training with decentralized execution approach has data hungry than deep learning approaches, i.e., it been chosen to address the non-stationarity pathol- needs less samples to converge and is therefore ogy typical of setups with independent learners 8

[24]. Deploying the same policies to all learners TABLE II reduces the variance of observations during learning SCENARIOSETTINGSANDSIMULATIONHYPER-PARAMETERS and helps convergence. This training procedure can be performed in a server farm, where thousands Symbol Name Value of cross-vendor UEs (cabled and/or over-the-air BLER block error rate 0, 10−4,..., 10−1 SB start buffer full, empty (OTA)) would contribute with experiences to the 푃 number of MAC SDUs 1, 2 training of a central UE MAC model. From a |푈| number of UEs 1, 2 practical viewpoint, centralized training also spares 푡푚푎푥 maximum time steps per episode 4, 8, 16, 32 13 14 UEs from executing costly training algorithms. This 푁푡푟 number of training episodes 2 , 2 ,... way, most of the computational workload is shifted 푁푒푣푎푙 number of evaluation episodes 128 푁푟푒푝 training session repetitions 4, 8 towards a central server. This avoids impacting the 훾 discount factor 1 UEs’ battery performance and escapes constraints 퐹휖 exploration decay factor 0.999991, 0.9999991 due to UE mobility. It is precisely this procedure 훼 learning rate 0.3 푁 memory length of MAC learner 1, 2, 3, 4 which could replace the more traditional design- build-test-and-validate approach to MAC layer de- velopment in future protocol stacks. . , . , . , . , . , . Different BS models can also be incorporated • Learning rate: [0 05 0 06 0 1 0 2 0 3 0 4] to the training testbed to increase the environment Table II collects the best-performing parameters. variance. This can generalize the learned policies and improve performance when deployed in mo- A. Expert baseline bile networks different from the ones used during training. Along these lines, zero-shot coordination Algorithm 1 Expert channel access policy techniques (see [25]) could also help. Require: 표 , 푚 − , 푎 − 푁 푡 푡 1 푡 1 The system was trained for a fixed number 푡푟 of if 푚푡−1 = 푆퐺 and 표푡 > 0 and 푎푡−1 ≠ 1 then consecutive training episodes, followed by a fixed Transmit oldest SDU in the Tx buffer (푎 = 1) 푁 푡 number 푒푣푎푙 of consecutive evaluation episodes. else if 푚푡−1 = 퐴퐶퐾 and 표푡 > 0 then 푁 + 푁 This sequence of 푡푟 푒푣푎푙 episodes is called a Delete oldest SDU in the Tx buffer (푎푡 = 2) training session. The Q tables were only updated else during training episodes, but not during evaluation Do nothing (푎푡 = 0) episodes. The 휖-greedy policy with 휖 decay was end if followed only during training episodes. Evaluation episodes follow strictly the learned policy and are thus free of exploration variance. For each configu- Algorithm 2 Expert signaling policy ration (set of hyper-parameters), training sessions Require: 표푡, 푚푡−1, 푁 were repeated 푟푒푝 times with different random if 표푡 > 0 then seeds. The convergence time grows with the prob- if 푚푡−1 = 0 or (표푡 > 1 and 푚푡−1 = 퐴퐶퐾) then 푃 푈 lem size, specially with hyper-parameters and | |. Send an SR to the BS (푛푡 = 1) Fig. 9 illustrates how even a small scenario with else 푈 푃 | | = 2 UEs and = 2 SDUs requires several Do nothing (푛푡 = 0) million training episodes to converge. end if else IV. RESULTS Do nothing (푛푡 = 0) The training procedure described above was end if tested in simulation, and its performance 푅 mea- sured as the reward collected per episode (see (2)). For comparison purposes, the performance ob- The optimal hyper-parameters have been chosen via tained by a population of expert (i.e., non-learner) grid search in the following discretized sets: UEs is also shown. The expert UE has complete • Discount factor: [0, 0.5, 1] knowledge about the BS signaling semantics and • Exploration decay factor: only transmits following the reception of an SG. 0.99991, 0.999991, 0.9999991 Similarly, it only deletes an SDU from its buffer 9 following reception of an ACK. These UEs follow a fully coordinated channel access policy and do 2 not profit from potential gains due to stochastic − contention. Pseudo for the channel access and 4 − signaling policies implemented by the expert UE is 6 − provided in the Algorithms 1 and 2, respectively. R 8 − 10 B. Optimal policy −

Performance, 12 Theoretical optimum. U = 1 ∗ ∗ ∗ − | | The optimal policy 휋 = (휋 , 휋 ) for a single Experimental. U = 1. Expert 푃 푆 14 | | − Experimental. U = 2. Expert MAC learner that needs to transmit a single SDU | | Experimental. U = 1. Learner 푈 푃 16 | | (i.e., | | = 1 and = 1) can be intuitively deducted − Experimental. U = 2. Learner | | and its performance modeled analytically. Consid- 18 − 0 5 10 15 20 25 30 ering that the total reward that can be collected Maximum time steps per episode, tmax in a given episode depends on 푡푚푎푥, there may be different optimal policies for different values of this Fig. 6. Average performance over eight independent training ses- sions. The performance is evaluated over 푁푒푣푎푙 = 128 episodes parameter. Indeed, the optimal channel access policy with different seeds. The environment was configured with 푃 = 1, for low 푡푚푎푥 consists in transmitting an SDU at 푡 = 0 퐵퐿퐸푅 = 0.5 and start buffer full. MAC learners were trained and immediately removing it from the Tx buffer with a learning rate of 0.05. Learners on the |푈| = 1 environ- ment were trained with an internal memory length of 푁 = 1 13 in the next time step. UEs using this policy ignore for 푁푡푟 = 2 = 8192 episodes, while learners on the |푈| = 2 the ACKs because on average, waiting for the ACK environment were trained with an internal memory length of 푁 = 3 18 before removing the SDU takes too long. Let this for 푁푡푟 = 2 = 262144 episodes. No exploration was used (i.e., 휋(1) 휖 = 0). The length of the error bars equal the standard error of optimal policy be with expected performance: the mean performance (calculated over the 8 independent training (1) sessions). 푅 = 푏 ·(2 − 푡푚푎푥) − 2 (10) where 푏 denotes the BLER. When the SDU trans- mission at 푡 = 1 succeeds, this leads to a sum reward reward described in section II-F7 encourages them of −2. If the transmission fails, the UE will get, to accomplish a task as fast as possible. Alterna- under policy 휋(1), the mininum reward of −푡푚푎푥. tively, one could provide a reward of 1 (or 1 for The value of 푡푚푎푥 for which this policy is optimal each message) as soon as the messages have been depends hence on the BLER. successfully delivered and a reward of -1 if the maximum duration has passed. For higher 푡푚푎푥, the MAC learner is encouraged to wait for an ACK because the risk of ignoring it is Fig. 6 shows how MAC learners use 휋(1) or 휋(2) too high due to the long episode duration. Let this depending on the maximum episode length. Both of optimal policy for |푈| = 1 be 휋(2). One can easily these policies ignore SGs and are therefore useless derive the expected performance analytically as: in scenarios with more than |푈| = 1 UE, where collisions impose the need for MAC learners to 푡 , 푡 < − 푚푎푥 푚푎푥 4 respect the scheduling allocation decided by the BS.  −(푏 + 3), 푡푚푎푥 = 4 Fig. 6 also shows experimental results for the  ( )  |푈| 푅 2 =  . (11) = 2 scenario to illustrate how performance 푡푚푎푥−1 scales with increasing numbers of UEs. Similarly  푏 Í 푖푏푖−3 ( − 1)(3 + ) 푡 > to the single-UE case, ACKs are largely ignored in  푖=4 푚푎푥 4  푡푚푎푥−3 the low 푡 regime when |푈| = 2 UEs are present. −푡푚푎푥 · 푏 푚푎푥  In this larger scenario, the vast size of the signaling The expected optimum performance in this scenario solution space makes it difficult to find the optimal (|푈| = 1 and 푃 = 1) is shown in Fig. 6 and is: policy, much less a closed-form expression for its 푅∗ = 푚푎푥(푅(1), 푅(2)). (12) performance. Nevertheless, a measure of gain can still be obtained by comparing the performance of It is worth noting that the MAC learners’ be- the learned policies against that of a known expert havior is an artifact of the reward function. The (see Fig. 6 for the gains above the expert). 10

BS C. BLER impact on MAC training MAC MAC MAC learner 0 learner 1 expert In the absence of block errors, MAC learners Scheduling Request Scheduling Request learn to ignore ACKs. On the other hand, un- t=0 expected PDU losses motivate MAC learners to Scheduling Grant interpret the DL ACKs before deleting transmitted t=1 SDUs from the Tx buffer. This is illustrated in the Transmit learned MAC signaling trace of Fig. 7, which shows the MAC learners removing SDUs from the UL t=2 ACK Tx buffer at time steps immediately following the 푡 푡 Scheduling Request reception of an ACK (deletions at = 3 and = 6 t=3 Delete for MAC 1, and at 푡 = 5 for MAC 0). Transmit

Interestingly, MAC 0 also removed a SDU at Scheduling Request Scheduling Request t=4 푡 = 7 before it had time to process the received ACK Transmit ACK. The reason is because, unlike in all other Delete Scheduling Request transmissions, transmission at 푡 = 6 was preceded by t=5 Scheduling Grant ACK a SG at 푡 = 5, which guarantees it to be collision- free. In this scenario, BLER was so low that, on Scheduling Request Scheduling Grant t=6 Transmit average, removing the SDU from the buffer before Delete waiting for an ACK yields a higher average reward. ACK Scheduling Request t=7 This suggests that MAC protocols may perform Delete Delete better by skipping ACKs during dynamic scheduling in low BLER regimes. This is clearly something not t=8 typically done in human-designed protocols. As expected, having to wait for the ACKs before deleting SDUs from the Tx buffer reduces perfor- Fig. 7. Best performing evaluation episode in a scenario with |푈| = 2, mance, since episodes take longer to complete (see −3 푃 = 2, 푁 = 4, 푡푚푎푥 = 32, empty buffer start and 퐵퐿퐸푅 = 10 . The Fig. 8). The presence of BLER in the data channel small 4-cells grid depicts the UL Tx buffer, with one dark cell per also slows down training due to a larger number of available SDU. state transitions (see Fig. 9). BS’s DL MAC messages received at time 푡 and the UE MAC’s channel access actions at time 푡 + 1: D. The importance of signaling 퐼퐶푢 퐼 푚푢 푎푢 = ( 푡 ; 푡+1) ∑︁ ∑︁ 푝(푚푢, 푎푢 ) A major question that arises in learning-to- = 푝(푚푢, 푎푢 ) log 푡 푡+1 . 푡 푡+1 푝(푚푢)푝(푎푢 ) communicate settings is whether gains are due to 푢 ∈A 푢 푡 푡+1 푎푡+1 P 푚푡 ∈M퐷퐿 either optimized action policies or better commu- (13) nication (i.e., signaling in our case), or both. In 푝(푚푢) 푝(푎푢 ) our MARL formulation, MAC learners have two The marginal probabilities 푡 and 푡+1 , and 푝(푚푢, 푎푢 ) action spaces and are trained to learn two distinct the joint probabilities 푡 푡+1 can be obtained policies (i.e., a channel access and a signaling by averaging DL message and channel access action policy). Hence, in an extreme case, it is conceivable occurrences in simulation episodes after training. for the learners to ignore all DL signaling and learn Fig. 10 shows that performance 푅 and Instanta- an optimized channel access policy instead. In this neous Coordination 퐼퐶 are clearly correlated (i.e., case, no MAC protocol signaling would be needed. higher performance is concurrently achieved with 퐼퐶 To address the previous question, we have cal- higher ). The Pearson correlation coefficient 휌 . culated the Instantaneous Coordination (퐼퐶) metric 퐼퐶,푅 = 0 91 can be calculated as: proposed in [26]. For the 푢푡ℎ UE, 퐼퐶푢 is defined in 푐표푣(퐼퐶, 푅) 휌 this scenario as the mutual information between the 퐼퐶,푅 = (14) 휎퐼퐶 휎푅 11

2 Start Buffer Full. Learner 0.35 BLER = 0 − 4 Start Buffer Full. Expert BLER = 10−

4 IC 3 − Start Buffer Empty. Learner 0.30 BLER = 10− 2 6 Start Buffer Empty. Expert BLER = 10− − R 1 0.25 BLER = 10− 8 − 10 − 0.20

Performance, 12 − 0.15 14 − Instantaneous Coordination, 16 0.10 − 18 4 3 2 1 − 0 10− 10− 10− 10− 30 25 20 15 10 BLER − − Performance,− R − −

Fig. 8. Best performance out of 4 independent training sessions. The Fig. 10. Relationship between the Instantaneous Coordination (퐼퐶) performance is evaluated over 푁푒푣푎푙 = 128 episodes with different and performance 푅 across various training sessions. 퐼퐶 is averaged seeds. The environment was configured with |푈| = 2, 푃 = 2 and across all UEs. The length of the vertical error bars equal the standard 푡푚푎푥 = 32. MAC learners were trained with learning rates between error of the mean 퐼퐶, while the length of the horizontal error bars 0.1 and 0.3 and an internal memory length of 푁 = 4 . Start buffer equal the standard error of the mean performance (calculated over 4 20 full learners were trained for 푁푡푟 = 2 ≈ 1 million episodes, while independent training sessions). The environment was configured with start buffer empty learners were trained for 푁푡푟 ≤ 4 million episodes. |푈| = 2, 푃 = 2, 푡푚푎푥 = 32 and empty start buffer. MAC learners were 휖-greedy exploration with 휖 decay from 1 to 0.01 was used. trained with a learning rate of 0.3 and an internal memory length of 푁 = 4 for 푁푡푟 = 3푀 episodes. 휖-greedy exploration with 휖 decay from 1 to 0.01 was used. 0 BLER = 0 10 4 15.0BLER = 10− 3 − BLER = 10−

2  17.5BLER = 10− − 1 E. Generalization BLER = 10−

R 20.0 −

22.5 10 1 − − How well can the learned signaling and channel 25.0 access policy perform in conditions never seen Performance, − during training? This question is about the gener- 27.5 − Probability of exploration, alization capacity of the learning algorithm (i.e., its 30.0 robustness against environment variations, see [27]). − 2 We address this by first training the MAC learner in 32.5 10− − 0.5M 1.0M 1.5M 2.0M 2.5M 3.0M 3.5M a default environment. We then confront the trained Training episodes learners against environments that differ in a single hyperparameter from the default parameters. Fig. 9. Average learning progress (over 4 independent training sessions) under various BLER conditions. The environment was From Figs. 11 and 12, generalization seems ro- configured with |푈| = 2, 푃 = 2, 푡푚푎푥 = 32 and empty start buffer. bust across environment variations of BLER and MAC learners were trained with a learning rate of 0.3 and an internal 22 |푈|. MAC learners can therefore be quickly trained memory length of 푁 = 4 for 푁푡푟 = 2 ≈ 4 million episodes. 휖-greedy exploration with 휖 decay from 1 to 0.01 was used. in low-load scenarios with mild BLER conditions. Then, these trained systems can still be expected to perform well in more challenging channels. How- where 휎퐼퐶 and 휎푅 are the standard deviations of ever, Figure 13 shows that performance degrades the Instantaneous Coordination and performance rapidly with the number of SDUs to be transmitted. samples respectively. This suggests that high perfor- In this case, the trained learners have overfit to mance may only be achievable under high influence the number of SDUs to be transmitted. Training in (i.e., high 퐼퐶) from BS onto UEs. The signaling environments with a larger number of SDUs may vocabulary (see Fig. 1) plays thus a major role in alleviate this problem, although convergence in that the performance a MAC protocol can achieve. case was elusive in our experiments. 12

5 5 Ptr = 2 − − Ptr = 1 10 10 Expert − − R R 15 15 − −

20 20 − U tr = 4 −

Performance, | | Performance, U tr = 3 25 | | 25 − U tr = 2 − | | U tr = 1 30 | | 30 − Expert −

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of UEs during evaluation, U Number of tx MAC SDUs during evaluation, Peval | |eval

Fig. 11. Best learner’s performance (over 16 independent training Fig. 13. Best learner’s performance and (over 4 independent training sessions). Training proceeds with |푈|푡푟 MAC learners but perfor- sessions). Training proceeds with 푃푡푟 SDUs, but performance is mance is evaluated with |푈|푒푣푎푙 MAC learners. The environment was evaluated with 푃푒푣푎푙 SDUs. −4 configured with 푃 = 1, 퐵퐿퐸푅 = 10 , 푡푚푎푥 = 32 and an empty start The environment was configured with |푈| = 2, buffer. MAC learners were trained with a learning rate of 0.06 and 퐵퐿퐸푅 −4 푡 24 = 10 , 푚푎푥 = 32 and an empty start buffer. an internal memory length of 푁 = 4 for 푁푡푟 = 2 = 16푀 episodes. 휖-greedy exploration with 휖 decay from 1 to 0.01 was used. MAC learners were trained with a learning rate of 0.06 and an internal memory length of 푁 = 4 for 24 푁푡푟 = 2 ≈ 16푀 episodes. 휖-greedy exploration with 휖 decay from 1 to 0.01 was used.

5 −

10 −

R that deep learning approaches may be able to scale 15 − these techniques to larger and more practical sce-

20 narios. The paper has also shown that these agents −

Performance, can achieve the optimal performance when trained BLERtr = 0 25 1 tabula rasa, i.e., without any built-in knowledge of − BLERtr = 10− 1 BLER = 2 10− the target MAC signaling or channel access policy. tr · 30 1 BLERtr = 2 10− − · A main experimental observation of these exper- 0.00 0.05 0.10 0.15 0.20 Block Error Rate during evaluation, BLEReval iments is that the trained agents understand but do not always comply with the DL signaling from the Fig. 12. Best learner’s performance (over 32 independent training BS. This yields MAC protocols that can not be clas- sessions). Training proceeds with 퐵퐿퐸푅푡푟 but performance is evalu- sified as contention-based or coordinated, but fluidly ated with 퐵퐿퐸푅푒푣푎푙. The environment was configured with |푈| = 2, 푃 = 2, 푡푚푎푥 = 32 and empty start buffer. MAC learners used a in between. This is in line with 5GNR’s grant-free learning rate of 0.1 and an internal memory length of 푁 = 4. 휖-greedy scheduling mechanism, although a major difference exploration with 휖 decay from 1 to 0.01 was used. Learners on the −1 −2 is that the protocols learned by our agents are not 퐵퐿퐸푅푡푟 = 0, 10 , 10 environments were trained respectively for 23 24 25 BS-controlled. A key advantage of protocols learned 푁푡푟 = 2 , 2 , 2 (≈ 8푀, 16푀, 32푀) episodes. this way is that they can co-exist with the pre- existing human-designed protocol, while exploiting V. CONCLUSIONS the optimizations uncovered during training. This paper has demonstrated that RL agents can Finally, this research has shown that the learned be jointly trained to learn a given simple MAC protocols depend on the deployment scenario. This signaling and a wireless channel access policy. It is one reason why they might yield higher perfor- has done so using a purely tabular RL approach mance than non-trainable protocols, whose - in a simplified wireless environment. It is expected ing is static and independent of these parameters. 13

A. Future work [9] H. B. Pasandi and T. Nadeem, “MAC Protocol Design Optimization Using Deep Learning,” CoRR, feb 2020. For the methodology presented here to be prac- [Online]. Available: http://arxiv.org/abs/2002.02075 tical, a number of obstacles must be overcome. [10] C. Bowyer, D. Greene, T. Ward, M. Menendez, J. Shea, The immediate next steps include a study on the and T. Wong, “Reinforcement Learning for Mixed Coopera- tive/Competitive Dynamic Spectrum Access,” 2019 IEEE In- sensitivity of the learned policies to the training ternational Symposium on Dynamic Spectrum Access Networks, environment. This should respond to whether it is DySPAN 2019, pp. 1–6, 2019. possible to train MAC learners in networks with a [11] T. F. Wong, T. Ward, J. M. Shea, M. Menendez, D. Greene, reduced number of UEs and traffic volume, and then and C. Bowyer, “A Dynamic Spectrum Sharing Design in the DARPA Spectrum Collaboration Challenge,” in GOMACTech, deploy them in larger networks. A more detailed San Diego, 2020. scalability analysis is also necessary (can these [12] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, agents be trained in a reasonable amount of time “Learning to Communicate with Deep Multi-Agent Reinforcement Learning,” in NIPS’16: Proceedings of the 30th in larger environments?). The mirror problem of International Conference on Neural Information Processing training a BS to learn an expert UE signaling may Systems, vol. abs/1605.0, Barcelona, Spain, 2016, pp. 2145– also illuminate the difficulties of protocol learning. 2153. [Online]. Available: http://arxiv.org/abs/1605.06676 [13] R. Lowe, J. Foerster, Y.-L. Boureau, J. Pineau, and Y. Dauphin, The ultimate goal is to jointly train UEs and the “On the Pitfalls of Measuring Emergent Communication,” BS to emerge a fully new MAC protocol (i.e., to CoRR, vol. abs/1903.0, no. Aamas, 2019. [Online]. Available: learn, not only the signaling and channel access http://arxiv.org/abs/1903.05168 [14] R. Lowe, A. Gupta, J. Foerster, D. Kiela, and J. Pineau, policies, but also the signaling vocabulary of Fig. “On the interaction between supervision and self-play 1). Recent research (e.g., [26], [13]) suggests this is in emergent communication,” in International Conference hard when all agents (i.e., the UEs and the BS) begin on Learning Representations (submitted), 2020, pp. 1–13. with no previous protocol knowledge. Consequently, [Online]. Available: http://arxiv.org/abs/2002.01093 [15] F. Ait Aoudia and J. Hoydis, “Model-free Training of End- scheduled training that alternates between super- to-end Communication Systems,” IEEE Journal on Selected vised learning and self-play, as suggested in [14], Areas in Communications, vol. 37, no. 11, pp. 2503 – 2516, seems promising to emerge fully new protocols. 2019. [Online]. Available: https://ieeexplore.ieee.org/document/ 8792076 [16] Small Cell Forum, “5G FAPI: PHY API Specification,” Small REFERENCES Cell Forum, Tech. Rep. June, 2019. [17] M. L. Littman, “Markov games as a framework for multi-agent [1] 3GPP, “3GPP TS 38.321. Medium Access Control (MAC) reinforcement learning,” in 11th International Conference on protocol specification,” 2020. Machine Learning (ICML), 1994, pp. 157–163. [2] Y. Yu, T. Wang, and S. C. Liew, “Deep-Reinforcement Learn- ing Multiple Access for Heterogeneous Wireless Networks,” [18] D. H. Wolpert and K. Tumer, “Optimal Payoff Functions for Advances in Complex Systems in IEEE International Conference on Communications (ICC), Members of Collectives,” , vol. 4, Kansas City, MO, 2018. no. 02n03, pp. 265–279, 2001. [3] Y. Yu, S. C. Liew, and T. Wang, “Non-Uniform Time- [19] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Step Deep Q-Network for Carrier-Sense Multiple Access in Torr, P. Kohli, and S. Whiteson, “Stabilising Experience Heterogeneous Wireless Networks,” CoRR, oct 2019. [Online]. Replay for Deep Multi-Agent Reinforcement Learning,” in Available: http://arxiv.org/abs/1910.05221 34th International Conferecen on Machine Learning (ICML), [4] A. Destounis, D. Tsilimantos, M. Debbah, and G. S. feb 2017. [Online]. Available: https://arxiv.org/abs/1702.08887 Paschos, “Learn2MAC: Online Learning Multiple Access for [20] R. S. Sutton and A. G. Barto, Reinforcement Learning: URLLC Applications,” CoRR, apr 2019. [Online]. Available: An Introduction (2nd Edition, in preparation), 2nd ed. https://arxiv.org/abs/1904.00665 London, England: The MIT Press, 2018. [Online]. Available: [5] Y. Yu, S. C. Liew, and T. Wang, “Multi-Agent Deep http://ir.obihiro.ac.jp/dspace/handle/10322/3933 Reinforcement Learning Multiple Access for Heterogeneous [21] C. Watkins, “Learning from delayed rewards,” Ph.D. disserta- Wireless Networks with Imperfect Channels,” CoRR, mar tion, King’s College London, 1989. 2020. [Online]. Available: https://arxiv.org/abs/2003.11210 [22] M. Tan, “Multi-Agent Reinforcement Learning: Independent vs. [6] N. Naderializadeh, J. Sydir, M. Simsek, H. Nikopour, and Cooperative Agents,” in Proceedings of the Tenth International S. Talwar, “When Multiple Agents Learn to Schedule: Conference on Machine Learning, 1993, pp. 330—-337. A Distributed Radio Resource Management Framework,” [23] G. Tesauro, “TD-Gammon, a Self-Teaching Backgammon unpublished, jun 2019. [Online]. Available: http://arxiv.org/abs/ Program, Achieves Master-Level Play,” Neural Computation, 1906.08792 vol. 6, no. 2, pp. 215–219, 1994. [7] N. Naderializadeh, J. Sydir, M. Simsek, and H. Nikopour, [24] P. Hernandez-Leal, B. Kartal, and M. E. Taylor, “A Survey “Resource Management in Wireless Networks via Multi-Agent and Critique of Multiagent Deep Reinforcement Learning,” Deep Reinforcement Learning,” CoRR, feb 2020. [Online]. Autonomous Agents and Multi-Agent Systems, vol. 33, no. 6, Available: http://arxiv.org/abs/2002.06215 pp. 750–797, oct 2019. [Online]. Available: http://arxiv.org/ [8] F. AL-Tam, N. Correia, and J. Rodriguez, “Learn to Schedule abs/1810.05587http://dx.doi.org/10.1007/s10458-019-09421-1 (LEASCH): A Deep reinforcement learning approach for radio [25] H. Hu, A. Lerer, A. Peysakhovich, and J. Foerster, “"Other- resource scheduling in the 5G MAC layer,” unpublished, mar Play" for Zero-Shot Coordination,” mar 2020. [Online]. 2020. [Online]. Available: http://arxiv.org/abs/2003.11003 Available: http://arxiv.org/abs/2003.02979 14

[26] N. Jaques, “Social Influence as Intrinsic Motivation for Multi- Agent Deep Reinforcement Learning,” in 36th International Conference on Machine Learning, R. Chaudhuri, Kamalika and Salakhutdinov, Ed. Long Beach, California, USA: PMLR, 2019, pp. 3040—-3049. [Online]. Available: http: //arxiv.org/abs/1810.08647 [27] C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun, and D. Song, “Assessing Generalization in Deep Reinforcement Learning,” CoRR, vol. abs/1810.1, 2018. [Online]. Available: http://arxiv.org/abs/1810.12282