Efficient and robust multi-task learning in the brain with modular task primitives.

Christian D. Márton Guillaume Lajoie† Friedman Brain Institute Dep. of Mathematics and Statistics Icahn School of Medicine at Mount Sinai Université de Montréal New York, NY 10029 MILA - Québec AI Institute [email protected] [email protected]

Kanaka Rajan† Friedman Brain Institute Icahn School of Medicine at Mount Sinai New York, NY 10029 [email protected]

Abstract

In a real-world setting biological agents do not have infinite resources to learn new things. It is thus useful to recycle previously acquired knowledge in a way that allows for faster, less resource-intensive acquisition of multiple new skills. Neural networks in the brain are likely not entirely re-trained with new tasks, but how they leverage existing computations to learn new tasks is not well understood. In this work, we study this question in artificial neural networks trained on commonly used paradigms. Building on recent work from the multi-task learning literature, we propose two ingredients: (1) network modularity, and (2) learning task primitives. Together, these ingredients form inductive biases we call structural and functional, respectively. Using a corpus of nine different tasks, we show that a modular network endowed with task primitives allows for learning multiple tasks well while keeping parameter counts, and updates, low. We also show that the skills acquired with our approach are more robust to a broad range of perturbations compared to those acquired with other multi-task learning strategies. This work offers a new perspective on achieving efficient multi-task learning in the brain, and makes predictions for novel neuroscience experiments in which targeted perturbations are employed to explore solution spaces. arXiv:2105.14108v1 [cs.AI] 28 May 2021

1 Introduction

Task knowledge can be stored in the collective dynamics of large populations of neurons in the brain [37, 48, 64, 46, 27, 68, 10, 45, 58, 34, 57, 2, 8, 56]. This body of work has offered insight into the computational mechanisms underlying the execution of multiple different tasks across various domains. In these studies, however, animals and models are over-trained on specific tasks, at the expense of long training duration and high energy expenditure. Models are optimized for one specific task, and unlike networks of neurons in the brain, are not meant to be re-used across tasks. In addition to being resource-efficient, it is useful that acquired task representations be resistant to failures and perturbations. Biological organisms are remarkably resistant to lesions affecting large extents of the [3, 12], and to neural cell death with aging [18, 63]. Current approaches

Preprint. Under review. do not take this into account; often, representations acquired naively are highly susceptible to perturbations [60]. In the multi-task setting, where multiple tasks are learnt simultaneously [64] or sequentially [15, 69, 28], the amount of weight updates performed grows quickly with the number of tasks as the entire model is essentially being re-trained with each new task. Training time and energy cost increase rapidly and can become prohibitive. To overcome this, it is conceivable that the brain has adopted particular inductive biases to speed up learning over an evolutionary timescale [48]. Inductive biases can help guide and accelerate future learning by providing useful prior knowledge. This becomes particularly useful when resources (energy, time, neurons) to acquire new skills are limited, and the space of tasks that need to be acquired is not infinite. In fact, biological agents often inhabit particular ecological niches [9] which offer very particular constraints on the kinds of tasks that are useful and need to be acquired over a lifetime. Under these circumstances, useful priors or inductive biases can help decrease the amount of necessary weight updates. One way in which such biases may be expressed in the brain is in terms of neural architecture or structure [48]. Neural networks in the brain may come with a pre-configured wiring diagram, encoded in the genome [67]. Such structural pre-configurations can put useful constraints on the learning problem. One way in which this is expressed in the brain is modularity. The brain exhibits specializations over various scales [55, 62, 21, 36, 35], with visual information being primarily represented in the visual cortices, auditory information in the auditory cortices, and decision-making tasks encoded in circuits in the prefrontal cortex [37, 10, 34]. Further subdivisions within those specialized modules may arise in the form of particular cell clusters, or groupings in a graph-theoretic sense, processing particular kinds of information (such as cells mapping particular visual features or aspects of a task). From a multi-task perspective, modularity can offer an efficient solution to the problem of task interference and catastrophic forgetting in continual learning [23, 40, 49], while increasing robustness to perturbations. Another way in which inductive biases may be expressed in the brain, beyond structure, is in terms of function: the types of neural dynamics instantiated in particular groups of neurons, or neural ensembles. Rather than learn separate models for separate tasks, as many of the studies cited above do, or re-train the same circuit many times on different tasks [64, 15], core computational principles could be learned once and re-used across tasks. There is evidence that the brain implements stereotyped dynamics which can be used across tasks [13, 51]. Ensembles of neurons may be divided into distinct groups based on the nature of the computations they perform, forming functional modules [42, 55, 5, 44, 35]. In the motor domain, for instance, neural activity was found to exhibit shared structure across tasks [19]. Different neural ensembles can be found to implement particular aspects of a task [29, 64, 10] and theoretical work suggests the number of different neural sub-populations influences the set of dynamics a neural network is able to perform [4]. Thus, a network that implements a form of task primitives, or basis-computations, could offer dynamics useful for training multiple tasks. Putting these different strands together, in this work we investigate the advantage of combining structural and functional inductive biases in a (RNN) model of modular task primitives. Structural specialization in the form of modules can help avoid task interference and increase robustness to failure of particular neurons, while core functional specialization in the form of task primitives can help substantially lower the costs of training. We show how our model can be used to facilitate resource-efficient learning in a multi-task setting. We also compare our approach to other training paradigms, and demonstrate higher robustness to perturbations. Altogether, our work shows that combining structural and functional inductive biases in a recurrent network allows an agent to learn many tasks well, at lower cost and higher robustness compared to other approaches. This draws a path towards cheaper multi-task solutions, and offers new hypotheses for how multi-task learning may be achieved in the brain.

2 Background

The brain is highly recurrently connected [30], and recurrent neural networks have been successfully used to model dynamics in the brain [37, 48, 64, 46, 27, 68, 10, 45, 58, 34, 57, 2, 8, 56]. Reservoir

2 computing approaches, such as those proposed in the context of echo-state networks (ESNs) and liquid state machines (LSMs) [41, 26, 32], avoid learning the recurrent weights of a network (only training output weights) and thereby achieve reduced training time and lower computational cost compared to recurrent neural networks (RNNs) whose weights are all trained in an end-to-end fashion. This can be advantageous: when full state dynamics are available for training, reservoir networks exhibit higher predictive performance and lower generalisation error at significantly lower training cost, compared to recurrent networks fully trained with gradient descent using backpropagation through time (BPTT); in the face of partial observability, however, RNNs trained with BPTT perform better [61]. Also, naive implementations of reservoir networks show high susceptibility to local perturbations, even when implemented in a modular way [60].

Module training inputs Modular RNN Tasks trained Modules performing task primitives 1 0.5 M1 Instructions Fixation 0 Autoencode 1

0.5 M2 T1 T2 T3 T4 Inputs Input 1 0 Memory 1

Input 2 0

-0.5 M3 Inputs Memory 2 -1 Fix Stim Go Recurrent Weights Output Weights 0 50 100 150 200 Time (msec) Fig 1. Schematic of fully disconnected recurrent network with modular task primitives. Left: inputs for modules M1-M3. Middle: recurrent weight matrix of the modular network, with separate partitions for modules M1-M3 along the diagonal. Right: different tasks (T1-T9) trained into different output weights, leveraging the same modular network.

Meanwhile, previous work in continual learning has largely focused on feedforward architectures [23, 66, 1, 40, 20, 65, 53, 52, 68, 28, 69, 31, 14, 50]. The results obtained in feedforward systems, however, are not readily extensible to continual learning in dynamical systems, where recurrent interactions may lead to unexpected behavior in the evolution of neural responses over time [54, 15]. Previously, modular approaches have also largely been explored in feedforward systems [59, 16, 24]. Recently, a method for continual learning has been proposed for dynamical systems [15]. The authors have shown that their method, dynamical non-interference, outperforms other gradient-based approaches [69, 28] in key applications. All of these approaches, however, require the entire network to be re-trained with each new task and thus come with a high training cost. A network with functionally specialized modules could provide dynamics that are useful across tasks and thus help mitigate this problem. In the use of modules, our approach is similar to memory-based approaches to continual learning which have also been applied in recurrently connected architectures [22, 11]. In these approaches, context-triggered information retrieval is central to computations: information can be stored in memory and re-used when it becomes relevant to new tasks. These approaches, however, rely on adding new modules or memory components as learning proceeds, which may often grow unboundedly [23]. Ongoing work in transfer, few-shot and one-shot learning has focused on leveraging previously trained networks to achieve a reduction in training time on new tasks. This work, largely carried out in the computer vision domain, has focused on feedforward systems [47]. Recently, a new framework for achieving good time-series forecasts with limited data has been proposed [38], but it has not been done in a multi-task setting. Initial training in this framework is expensive also, as it requires training on a multitude of different time-series datasets.

3 Learning from this wide-ranging body of work, our key contribution is the following. Instead of pre-training an entire network on a full set of tasks, as in transfer learning, we hypothesize that pre- training functional modules on task primitives offers benefits for multi-task learning while keeping training costs low. This procedure aims at a tradeoff between end-to-end trained RNNs and the reservoir computing paradigm, replacing randomly connected ’reservoir’ networks with task-primitive modules. Modules are trained once at the beginning, and can be re-used to perform many different tasks. We also hypothesize that functional modularity guards against failure, as functionality in other modules is preserved even if dynamics in one of the modules are perturbed.

Inputs Targets/Outputs

1.00 1.00

0.80 0.80 0.60 0.60 0.40 0.40 0.20

0.20 0.00 0.00 T1: DelayPro -0.20 T2: DelayAnti T3: MemPro T1-T2 T3-T4

1.00 1.00 1.00

0.80 0.80 0.80 0.60 0.60 0.60 0.40 0.40 0.40 0.20 0.20 0.20 0.00 T5: MemDm1 T6: MemDm2 T5-T6 T7-T9 -0.20 T4: MemAnti 0.00 0.00

Inputs

1.00 1.00 M1 0.80 0.80 0.60 0.60 Targets / 0.40 0.40 M2 Outputs 0.20 T7: ContextMemDm1 0.20 T8: ContextMemDm2 T9: MultiMem 0.00 0.00 Modular M3 task primitives Fig 2. Task training overview. Left: the pre-configured network receives task-specific inputs. A multitude of different tasks (T1-T9) can be trained leveraging the dynamics in the pre-configured network. Separate readout weights are trained for tasks T1-T9 while recurrent weights remain frozen. Right: task targets (strong colors) are plotted together with trained outputs (pale colors).

3 Model details

3.1 Modularity

Our network is based on a simple RNN architecture with hidden state h evolving according to

M ht+1 = σ(Wrecht + Winxt + ηt) M (1) yt = Woutht , i i with activation function σ = tanh, external input xt, uncorrelated Gaussian noise ηt, and linear i readouts yt. Starting from an initial state, with initial weights drawn from a Gaussian distribution, each module is trained to produce a module-specific target output timeseries zt by minimizing the squared loss using the Adam optimizer (see A for more details). Three modules (M1-M3) are embedded within the larger network. We consider two conditions, (1) modular disconnected, in which the modules are entirely disconnected (Figure 1), and (2) modular connected, in which connection weights are trained (Figure S1). Inputs are fed in through shared weights. During training, each module (M1-M3) receives one of three inputs (Fixation, Input1, Input2), respectively. Each module is trained separately such that when a particular module is being trained, only weights within a module are allowed to change, while others remain frozen. The gradient of the error, however, is computed across the whole network.

3.2 Task primitives

Neuroscience tasks [64], like the ones we employ here (explained in greater detail later, see 4), are structured as trials, analogous to separate training batches. In each trial, stimuli are randomly drawn, e.g. from a distribution of possible orientations. In a given trial, there is an initial period without a stimulus in which animals are required to maintain fixation. Then, in the stimulus period, a stimulus

4 appears for a given duration and the animal again is required to maintain fixation without responding. Finally, in the response period, the animal is required to make a response. In models, a context signal indicates when to maintain fixation, separate inputs are fed in through different channels, and targets indicating the required choice are read out from the network. Taking inspiration from this task structure, and core functionality in most neuroscience tasks [33, 19, 34], we derive simple task primitives that provide core functionality for solving multiple tasks. The goal is to explore how such primitives can be pre-learned and recombined by networks for multi-task learning. Chosen primitives are (i) a low-dimensional latent representation of inputs, and (ii) the memorization of transient input sequences. In our model, M1 is assigned to (i) and learns to autoencode the fixation signal. M2 and M3 are assigned to (ii) and are trained to hold Inputs 1-2 in memory, respectively. More specifically, M2 receives short positive input pulses (∆t = 25ms) and is required to hold the signal in memory for the remainder of the trial period, while M3 is required to do the same with negative input pulses. More details can be found in section Appendix A. After training, recurrent weights are frozen. Separate linear readouts can now be trained for different tasks (T1-T9) by minimizing the task-specific loss function and updating readout weights (Wout) only.

T1,2,... T1,2,... M yt = Wout ht (2)

4 Tasks

We show that we can leverage our pre-configured modular network to learn a set of nine different neuroscience tasks, previously used to study multi-task learning [64, 15]. Trained simultaneously into one and the same network, these tasks produce a rich and varied population structure and are mapped into different parts of state space [64]. Simultaneous training, however, does not guarantee good performance on all tasks; this can be improved by using a multi-task training paradigm whereby new tasks are projected into non-interfering subspaces[15]. This approach is still dependent on task order, however: how well a new task is learnt often depends on its placement in the training curriculum. We employ a set of nine different tasks: Delay Pro, Delay Anti, Mem Pro, Mem Anti, Mem Dm 1, Mem Dm 2, Context Mem Dm 1, Context Mem Dm 2 and Multi Mem (Figure 2). In Delay Pro and Delay Anti, the network receives three inputs (Fixation, cos(θ), sin(θ), with θ varying from 0-180 degrees). The angle inputs turned on at the Go-signal and remained on for the duration of the trial. The network was required to produce three outputs, yielding the correct location during the output period (when the fixation signal switched to zero). In the anti-versions, the network is required to respond in the direction opposite to the inputs. In Mem Pro and Mem Anti, the input stimulus turns off after a short, variable duration (Figure 2,inputs T3-T4). As before, the network is required to produce three outputs encoding the correct location during the output period. In Mem Dm 1 and Mem Dm 2, the network receives two inputs (Fixation, and a set of stimulus pulses separate by 100msec) (Figure 2, inputs T5-6). The network is required to reproduce the higher valued pulse during the output period. Depending on the task, the network receives one or the other input stimulus pulse set. In Context Mem Dm 1 and Context Mem Dm 2, the network receives three inputs (Fixation, and two sets of stimulus pulses) (Figure 2, inputs T7-9). The network is required to reproduce the higher value pulse during the output period, ignoring the other input stimulus set. The two tasks differ in which input stimulus to pay attention to. In Multi Mem, the network again receives three inputs (Fixation, and both sets of stimulus pulses). The network needs to yield the summed value of the stimulus train with the higher total magnitude of the two pulses.

5 Results

All of the tasks are learned well by leveraging the dynamics in the network (Figure 2). Each task is trained into separate output weights, leaving the recurrent weights frozen. We also compared our approach to networks trained separately and with recent multi-task learning approaches (Figure 3). Networks trained separately reach perfect performance on each task, while networks trained with dynamical non-interference [15] reach ∼ 70% performance on all task at the end of training. Our

5 Modular Network Dynamical non-interference Separate training E I A Training lossacross tasks al 1. Table 3F). (Figure training multi-task to than 3B) (Figure oesol safnto ftsstand(iue4)cmae oohrapoce,wihrequire which approaches, other to much compared increase 4A) training to (Figure learnt approach trained be our tasks can in of count task function parameter a each the as subsequently, slowly allows and more This upfront, only. once weights trained readout be separate to need only network modular the of weights recurrent The with network dynamics. the useful using of from is reservoir arises approach a This our as 1). in (Table primitives parameters approaches task other trained modular network the of one with number uses than approach the lower However, our magnitude count, tasks. parameter all overall the the perform re-training in to (2) learning and multi-task low, to parameters similarly the possible. of Indeed, exploit as number to little total aims as the approach parameters keeping Our re-trained of (i) set total. be methods: same in these to parameters of needs fewer both using still of despite best network approach, naive whole requires the the like it however, just Thus tasks; task. 4B,[ all each (Figure on perform non-interference separately to dynamical and using needed units contrast, is 300 In network of network. network size per network separate task a one Naive, assuming on requires Thus, it 1). task. units, (Table each readout updates for two network parameter separate high total a achieves requires less network training modular much our using considered, while approaches other performance all training to separate comparison to in akin Importantly, more tasks, all on performance high reaches also 3I-K) (Figure approach achieves approach modular The tasks. (I-K). all approach on network performance modular high our for and (E-G) non-interference ( outputs and targets Performance. 3. Fig vrl on .5590e 9.54e4 9.06e4 8.15e5 9.06e4 9.06e4 task-primitves Modular non-interference Dynamical training Separate 8.15e5 task 8.15e5 Per trained Number count Overall Counts aaee count Parameter ea Pro Delay riigls costss akpromneatrtann,adexemplary and training, after performance task tasks, across loss Training ∼ ak r hw o eaaetann AC,dynamical (A-C), training separate for shown are task) 8 e 5 aaeesi oa,adalo hmne ob rie separately trained be to need them of all and total, in parameters ∼ J F B 8 e 5 Task performance aaeest etandi oa costenn tasks, nine the across total in trained be to parameters 6 6e2 9.54e4 C K G Example Output(DelayPro) 15 ∼ ) nyone only ]), 1 re of order the entire recurrent weight matrix to be re-trained on each task. The overall parameter count increases slowly with tasks, but also stays low overall using our approach (Figure 4B).

A B

Fig 4. Parameter count comparison. Total number of parameters trained (A), and total number of parameters needed overall (B) for networks trained separately, networks trained with dynamical non-interference, and those trained with our approach, modular task primitives.

5.1 Perturbations

We also compare the performance under different types of perturbations for the three approaches. We consider four different types of perturbations: lesioned weights, silenced activity, global noise i,j and input noise (see A). Weights are lesioned by randomly picking an entry Wrec in the connectivity matrix (Equation 1), and setting it to zero. Activity is silenced by clamping the activity of a particular unit in the network at zero. In the case of global noise, Gaussian noise is added to the entire connectivity matrix, while in the case of input noise, uncorrelated Gaussian noise is added to all the input units. In Figure 5, we show the results of comparing the performance of our approach after perturbations. Here, we pick connections and units from across the entire network randomly and repeat this procedure for 25 draws with different seeds. In addition to fully disconnected networks with block diagonal modules (Figure 1), we also consider fully connected modular networks (Figure S1). Modular networks show consistently high resilience to all types of perturbations, performing better than networks trained with a high-performing multi-task learning approach such as dynamical non- interference (Figures 5 & S2). Performance decays more slowly using our approach after weight and activity perturbations, as well as when adding global and input noise to the network. In the case of the Delay Anti task, for instance, the performance of networks trained with dynamical non-interference reaches chance-level at ∼ 10% weights lesioned, while our approach still gets 50% of the decisions correct at that level. We found networks trained separately and those trained with dynamical non- interference to be similar in performance, with a few exceptions for particular tasks (such as DelayPro and DelayAnti, for example, where separately trained networks showed higher robustness). We found no particular advantage in training off-diagonal connections (Figure 1, cyan vs dark green). Overall, we found that our multi-task learning technique achieved highest robustness to perturbations across tasks. We also performed perturbations in which we specifically targeted units within a module in our network. In Figures 6 & S3, we show performance after perturbing units within module M2 of our network (see Appendix A). We observed the same pattern across approaches, with our technique achieving highest robustness. We found that fully connected modular networks performed worse than fully disconnected networks (6 & S3 light blue vs. green, respectively). When perturbations were performed in a modular way, targeting sets of locally inter-connected units, we found that networks trained with dynamical non-interference performed better than those trained separately. Overall, even when targeting a module directly, our approach consistently achieved higher robustness to perturbations than other approaches across tasks.

7 Weights lesioned Activity lesioned Global noise Input noise A B C D DelayPro Task DelayPro

E F G H DelayAnti Task

I J K L MemPro Task MemPro

M N O P MemAnti Task

Fig 5. Performance after perturbations. Performance (% decisions correct on a 250 trial test set) for tasks T1-T4 (DelayPro, DelayAnti, MemPro, MemAnti) as a function of % weights lesioned (A, E, I, J), % units silenced (B, F, J, N), global Gaussian noise as a function of standard deviation (σ) added to the entire population (C, G, K, O), and independent Gaussian noise added to all inputs (D, H, L, P). Error bars show the standard deviation across repeating perturbations for 25 draws and 2 networks, all with different random seeds.

6 Discussion

We presented a new approach for multi-task learning leveraging modular task primitives. Inspired by the organization of brain activity and the need to minimize training cost, we developed a network made up of functional modules that provide useful dynamics for solving a number of different tasks. We showed that our approach is able to achieve high performance across all tasks, while not suffering from the effects of task order or task interference that might arise with other multi-task learning approaches [15, 69]. What is more, our approach achieves high performance with around an order of magnitude less parameters compared to other approaches. It scales better with tasks, thus reducing overall training cost. Our results also show that networks composed of task modules are more robust to perturbations, even when perturbations are specifically targeted at a particular module. As readouts may drawn upon dynamics in different modules, task performance may not be drastically affected by the failure of any one particular module. The lower performance of the fully connected modular network may be due to the trained connection weights, which increase the inter-connectivity of different cell clusters and thus the susceptibility of local cell clusters to perturbations in distal parts of the network. This generates useful hypotheses for multi-task learning in the brain. It suggests decision-making related circuits in the brain, such as in the prefrontal cortex, need not re-adjust all weights during learning. Rather, pre-existing dynamics may be leveraged to solve new tasks. This could be tested by using recurrent networks fitted to population activity recorded from the brain [43, 39] as reservoirs of useful dynamics. We hypothesize that the fitted networks contain dynamics useful to new tasks; if so, it should be possible to train readout weights to perform new tasks. Moreover, targeted lesions may be performed to study the effect of perturbations on network dynamics in the brain; our work suggests networks with functional task modules are less prone to being disrupted by perturbations. Based on our work we would also expect functional modules to encode a set of basis-tasks - commonly occurring dynamics - for the purpose of solving future tasks.

8 In a recent study, [17] examined the effects of ’rich’ and ’lazy’ learning on acquired task representa- tions and compared their findings to network representations in the brain of macaques and humans. The lazy learning paradigm is similar to ours in as far as learning is confined to readout weights; but it lacks our modular approach and does not show performance on multiple tasks. Conversely, our networks are not initialized with high variance weights as in the lazy regime. Ultimately, we observe improved robustness using our approach, unlike lazy learning which appears to confer lower robustness compared with rich learning. In the more general sense, the challenge of how to choose task primitives for a given distribution of tasks remains. In the future, we aim to develop a way to derive the initialization scheme for the modular network directly from the set of tasks to be learnt. This could be done by segregating common input features derived from a set of tasks into separate modules, for example by obtaining a factorised representation of task dynamics [25]. Also, instead of keeping recurrent weights frozen, they could be allowed to change slowly over the course of learning multiple tasks [6], thereby incorporating useful task information without disrupting previous dynamics. Leveraging Meta-Learning approaches to optimize task primitives themselves is another possible avenue to explore this question. Future work will also explore how the observed susceptibility to perturbations is more precisely related to the dynamics in the network. Higher robustness to perturbations could be achieved by optimizing the training of readout weights further. In as far as this work facilitates multi-task learning at a lower overall training and thus energy cost, it has the potential to displace other more energy-intensive methods and thereby reduce the overall impact on the environment. The focus of this work is mainly on neuroscience tasks and thus devoid of any potentially harmful application. However, in as far as this work facilitates efficient multi-task learning, it has the potential to find wider application in modeling sequential data. The dynamics created in the network may be useful to training a wide range of tasks; to preclude any harmful effects, task primitives can be chosen in such a way as to limit the performance of the network.

Weights lesioned Activity lesioned Global noise A B C DelayPro Task DelayPro

D E F DelayAnti Task

G H I MemPro Task MemPro

J K L MemAnti Task

Fig 6. Performance after modular perturbations (see A for details). Performance (% decisions correct on a 250 trial test set) for tasks T1-T4 (DelayPro, DelayAnti, MemPro, MemAnti) as a function of % weights lesioned (A, D, G, J), % units silenced (B, E, H, K), and global Gaussian noise as a function of standard deviation (σ) (C, F, I, L). Error bars show standard deviation across repeating perturbations for 25 draws with different random seeds for 2 networks initialized with different random seeds.

9 7 Broader Impact

In this work we developed a novel approach to continual learning which achieves high performance and robustness at lower overall training cost compared to other approaches. As energy usage soars with increased model size, lowering training cost without sacrificing performance is becoming increasingly more important in fields such as artificial intelligence and robotics. Our approach may also find applications in the domain of low-power biological implants. Furthermore, our work explores new avenues for multi-task learning in the brain, and thus has the potential to catalyze new experiments and analyses in the field of neuroscience.

8 Acknowledgments & Disclosure of funding

We would like to thank Daniel Kepple for his insightful comments on this work. This work was funded by the NSF Foundations grant (1926800) and the McDonnell UHC Scholar Award, as well as by the Canada CIFAR AI Chair, the NSERC Discovery Grant (RGPIN-2018-04821), Samsung Research Support and FRQS (Research Scholar Award Junior 1 LAJGU0401-253188).

References [1] Tameem Adel, Han Zhao, and Richard E. Turner. Continual Learning with Adaptive Weights (CLAW). arXiv:1911.09514 [cs, stat], June 2020. URL http://arxiv.org/abs/1911. 09514. arXiv: 1911.09514. [2] Omri Barak, David Sussillo, Ranulfo Romo, Misha Tsodyks, and L F Abbott. From fixed points to chaos: Three models of delayed discrimination. Progress in Neurobiology, 103:214–222, March 2013. doi: 10.1016/j.pneurobio.2013.02.002. URL http://dx.doi.org/10.1016/j. pneurobio.2013.02.002. Publisher: Elsevier Ltd. [3] Benjamin M. Basile, Victoria L. Templer, Regina Paxton Gazes, and Robert R. Hampton. Preserved visual memory and relational cognition performance in monkeys with selective hippocampal lesions. Sci. Adv., 6(29):eaaz0484, July 2020. ISSN 2375-2548. doi: 10. 1126/sciadv.aaz0484. URL https://advances.sciencemag.org/lookup/doi/10.1126/ sciadv.aaz0484. [4] Manuel Beiran, Alexis Dubreuil, Adrian Valente, Francesca Mastrogiuseppe, and Srdjan Ostojic. Shaping dynamics with multiple populations in low-rank recurrent networks. arXiv:2007.02062 [q-bio], November 2020. URL http://arxiv.org/abs/2007.02062. arXiv: 2007.02062. [5] Yazan N Billeh, Michael T Schaub, Costas A Anastassiou, Mauricio Barahona, and Christof Koch. Revealing cell assemblies at multiple levels of granularity. Journal of Neuroscience Methods, 236:92–106, October 2014. doi: 10.1016/j.jneumeth.2014.08.011. URL http: //dx.doi.org/10.1016/j.jneumeth.2014.08.011. Publisher: Elsevier B.V. [6] Matthew Botvinick, Sam Ritter, Jane X Wang, Zeb Kurth-Nelson, Charles Blundell, and Demis Hassabis. Reinforcement Learning, Fast and Slow. Trends in Cognitive Sciences, 23(5):408– 422, May 2019. doi: 10.1016/j.tics.2019.02.006. URL https://doi.org/10.1016/j.tics. 2019.02.006. Publisher: Elsevier Ltd. [7] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax. [8] Dean V Buonomano and Wolfgang Maass. State-dependent computations: spatiotemporal processing in cortical networks. Nature Reviews Neuroscience, 10(2):113–125, January 2009. doi: 10.1038/nrn2558. URL http://www.nature.com/articles/nrn2558. Publisher: Nature Publishing Group. [9] Kelly A. Carscadden, Nancy C. Emery, Carlos A. Arnillas, Marc W. Cadotte, Michelle E. Afkhami, Dominique Gravel, Stuart W. Livingstone, and John J. Wiens. Niche Breadth: Causes

10 and Consequences for Ecology, Evolution, and Conservation. The Quarterly Review of Biology, 95(3):179–214, September 2020. ISSN 0033-5770, 1539-7718. doi: 10.1086/710388. URL https://www.journals.uchicago.edu/doi/10.1086/710388. [10] Warasinee Chaisangmongkon, Sruthi K Swaminathan, David J Freedman, and Xiao-Jing Wang. Computing by Robust Transience: How the Fronto- Parietal Network Performs Sequential, Category- Based Decisions. Neuron, 93(6):1504–1517.e4, March 2017. doi: 10.1016/j.neuron. 2017.03.002. URL http://dx.doi.org/10.1016/j.neuron.2017.03.002. Publisher: Elsevier Inc. [11] Andrea Cossu, Antonio Carta, and Davide Bacciu. Continual Learning with Gated Incremental Memories for sequential data processing. 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8, July 2020. doi: 10.1109/IJCNN48605.2020.9207550. URL http://arxiv.org/abs/2004.04077. arXiv: 2004.04077. [12] C. E. Curtis and M. D’Esposito. The effects of prefrontal lesions on performance and theory. Cognitive, Affective, & Behavioral Neuroscience, 4(4):528–539, December 2004. ISSN 1530-7026, 1531-135X. doi: 10.3758/CABN.4.4.528. URL http://link.springer.com/10.3758/CABN.4.4.528. [13] Philippe Domenech and Etienne Koechlin. Executive control and decision-making in the prefrontal cortex. Current Opinion in Behavioral Sciences, 1:101–106, February 2015. doi: 10.1016/j.cobeha.2014.10.007. URL http://linkinghub.elsevier.com/retrieve/pii/ S2352154614000278. [14] Timothy J Draelos, Nadine E Miner, Christopher C Lamb, Jonathan A Cox, Craig M Vineyard, William M Severa, Conrad D James, and James B Aimone. Neurogenesis Deep Learning. IEEE, page 8, 2017. [15] Lea Duncker, Laura N Driscoll, Krishna V Shenoy, Maneesh Sahani, and David Sussillo. Organizing recurrent network dynamics by task-computation to enable continual learning. NeurIPS 2020, page 11, 2020. [16] Kai Olav Ellefsen, Jean-Baptiste Mouret, and Jeff Clune. Neural Modularity Helps Organisms Evolve to Learn New Skills without Forgetting Old Skills. PLoS Comput Biol, 11(4):e1004128, April 2015. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1004128. URL https://dx.plos. org/10.1371/journal.pcbi.1004128. [17] Timo Flesch, Keno Juechems, Tsvetomira Dumbalska, Andrew Saxe, and Christopher Summer- field. Rich and lazy learning of task representations in brains and neural networks. preprint, Neuroscience, April 2021. URL http://biorxiv.org/lookup/doi/10.1101/2021.04. 23.441128. [18] Y. Fu, Y. Yu, G. Paxinos, C. Watson, and Z. Rusznák. Aging-dependent changes in the cellular composition of the mouse brain and spinal cord. Neuroscience, 290:406–420, April 2015. ISSN 03064522. doi: 10.1016/j.neuroscience.2015.01.039. URL https://linkinghub.elsevier. com/retrieve/pii/S0306452215000949. [19] Juan A Gallego, Matthew G Perich, Stephanie N Naufel, Christian Ethier, Sara A Solla, and Lee E Miller. Cortical population activity within a preserved neural manifold underlies mul- tiple motor behaviors. Nature Communications, 9(1):1–13, October 2018. doi: 10.1038/ s41467-018-06560-z. URL http://dx.doi.org/10.1038/s41467-018-06560-z. Pub- lisher: Springer US. [20] Siavash Golkar, Michael Kagan, and Kyunghyun Cho. Continual Learning via Neural Pruning. arXiv:1903.04476 [cs, q-bio, stat], March 2019. URL http://arxiv.org/abs/1903.04476. arXiv: 1903.04476. [21] Alexandros Goulas, Alexander Schaefer, and Daniel S. Margulies. The strength of weak connections in the macaque cortico-cortical network. Brain Struct Funct, 220(5):2939–2951, September 2015. ISSN 1863-2653, 1863-2661. doi: 10.1007/s00429-014-0836-3. URL http://link.springer.com/10.1007/s00429-014-0836-3.

11 [22] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines. arXiv:1410.5401 [cs], December 2014. URL http://arxiv.org/abs/1410.5401. arXiv: 1410.5401.

[23] Raia Hadsell, Dushyant Rao, Andrei A. Rusu, and Razvan Pascanu. Embracing Change: Continual Learning in Deep Neural Networks. Trends in Cognitive Sciences, 24(12):1028– 1040, December 2020. ISSN 13646613. doi: 10.1016/j.tics.2020.09.004. URL https: //linkinghub.elsevier.com/retrieve/pii/S1364661320302199.

[24] Bart L.M. Happel and Jacob M.J. Murre. Design and evolution of modular neural network architectures. Neural Networks, 7(6-7):985–1004, January 1994. ISSN 08936080. doi: 10.1016/S0893-6080(05)80155-8. URL https://linkinghub.elsevier.com/retrieve/ pii/S0893608005801558.

[25] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: LEARNING BASIC VISUAL CON- CEPTS WITH A CONSTRAINED VARIATIONAL FRAMEWORK. ICLR, page 22, 2017.

[26] H. Jaeger. Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. Science, 304(5667):78–80, April 2004. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.1091277. URL https://www.sciencemag.org/lookup/doi/10.1126/ science.1091277.

[27] Alexander J.E. Kell, Daniel L.K. Yamins, Erica N. Shook, Sam V. Norman-Haignere, and Josh H. McDermott. A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy. Neuron, 98(3):630– 644.e16, May 2018. ISSN 08966273. doi: 10.1016/j.neuron.2018.03.044. URL https: //linkinghub.elsevier.com/retrieve/pii/S0896627318302502.

[28] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA, 114(13):3521–3526, March 2017. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1611835114. URL http: //www.pnas.org/lookup/doi/10.1073/pnas.1611835114.

[29] Sue Ann Koay, Adam S. Charles, Stephan Y. Thiberge, Carlos D. Brody, and David W. Tank. Sequential and efficient neural-population coding of complex task information. preprint, Neuroscience, October 2019. URL http://biorxiv.org/lookup/doi/10.1101/801654.

[30] Kestutis Kveraga, Avniel S Ghuman, and Moshe Bar. Top-down predictions in the cognitive brain. Brain and Cognition, 65(2):145–168, November 2007. doi: 10.1016/j.bandc.2007.06.007. URL http://linkinghub.elsevier.com/retrieve/pii/S0278262607000954.

[31] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient Episodic Memory for Continual Learn- ing. arXiv:1706.08840 [cs], November 2017. URL http://arxiv.org/abs/1706.08840. arXiv: 1706.08840.

[32] Wolfgang Maass, Thomas Natschläger, and Henry Markram. Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations. Neu- ral Computation, 14(11):2531–2560, November 2002. ISSN 0899-7667, 1530-888X. doi: 10.1162/089976602760407955. URL https://direct.mit.edu/neco/article/14/11/ 2531-2560/6650.

[33] N Maheswaranathan, Alex H Williams, Matthew D Golub, Surya Ganguli, and David Sussillo. Universality and individuality in neural dynamics across large populations of recurrent networks. arXiv, pages 1–16, July 2019. URL https://arxiv.org/pdf/1907.08549.pdf.

[34] Valerio Mante, David Sussillo, Krishna V Shenoy, and William T Newsome. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature, 503(7474):78–84, Novem- ber 2013. ISSN 0028-0836. doi: 10.1038/nature12742. URL http://www.nature.com/ doifinder/10.1038/nature12742.

12 [35] David Meunier, Renaud Lambiotte, and Edward T. Bullmore. Modular and Hierarchically Modular Organization of Brain Networks. Front. Neurosci., 4, 2010. ISSN 1662-4548. doi: 10.3389/fnins.2010.00200. URL http://journal.frontiersin.org/article/10.3389/ fnins.2010.00200/abstract. [36] John D Murray, Alberto Bernacchia, David J Freedman, Ranulfo Romo, Jonathan D Wallis, Xinying Cai, Camillo Padoa-Schioppa, Tatiana Pasternak, Hyojung Seo, Daeyeol Lee, and Xiao-Jing Wang. A hierarchy of intrinsic timescales across primate cortex. Nat Neurosci, 17 (12):1661–1663, December 2014. ISSN 1097-6256, 1546-1726. doi: 10.1038/nn.3862. URL http://www.nature.com/articles/nn.3862. [37] Christian D. Márton, Simon R. Schultz, and Bruno B. Averbeck. Learning to select ac- tions shapes recurrent dynamics in the corticostriatal system. Neural Networks, 132:375– 393, December 2020. ISSN 08936080. doi: 10.1016/j.neunet.2020.09.008. URL https: //linkinghub.elsevier.com/retrieve/pii/S0893608020303312. [38] Bernardo Pérez Orozco and Stephen J. Roberts. Zero-shot and few-shot time series forecasting with ordinal regression recurrent neural networks. arXiv:2003.12162 [cs, stat], March 2020. URL http://arxiv.org/abs/2003.12162. arXiv: 2003.12162. [39] Chethan Pandarinath, Daniel J. O’Shea, Jasmine Collins, Rafal Jozefowicz, Sergey D. Stavisky, Jonathan C. Kao, Eric M. Trautmann, Matthew T. Kaufman, Stephen I. Ryu, Leigh R. Hochberg, Jaimie M. Henderson, Krishna V. Shenoy, L. F. Abbott, and David Sussillo. Inferring single- trial neural population dynamics using sequential auto-encoders. Nat Methods, 15(10):805– 815, October 2018. ISSN 1548-7091, 1548-7105. doi: 10.1038/s41592-018-0109-9. URL http://www.nature.com/articles/s41592-018-0109-9. [40] German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71, May 2019. ISSN 08936080. doi: 10.1016/j.neunet.2019.01.012. URL https://linkinghub. elsevier.com/retrieve/pii/S0893608019300231. [41] Jaideep Pathak, Brian Hunt, Michelle Girvan, Zhixin Lu, and Edward Ott. Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Comput- ing Approach. Phys. Rev. Lett., 120(2):024102, January 2018. ISSN 0031-9007, 1079- 7114. doi: 10.1103/PhysRevLett.120.024102. URL https://link.aps.org/doi/10.1103/ PhysRevLett.120.024102. [42] Matthew G Perich and Kanaka Rajan. Rethinking brain-wide interactions through multi-region ‘network of networks’ models. Current Opinion in Neurobiology, 65:146–151, December 2020. ISSN 09594388. doi: 10.1016/j.conb.2020.11.003. URL https://linkinghub.elsevier. com/retrieve/pii/S0959438820301707. [43] Matthew G. Perich, Charlotte Arlt, Sofia Soares, Megan E. Young, Clayton P. Mosher, Juri Minxha, Eugene Carter, Ueli Rutishauser, Peter H. Rudebeck, Christopher D. Harvey, and Kanaka Rajan. Inferring brain-wide interactions using data-constrained recurrent neural network models. preprint, Neuroscience, December 2020. URL http://biorxiv.org/lookup/doi/ 10.1101/2020.12.18.423348. [44] Jonathan D. Power, Alexander L. Cohen, Steven M. Nelson, Gagan S. Wig, Kelly Anne Barnes, Jessica A. Church, Alecia C. Vogel, Timothy O. Laumann, Fran M. Miezin, Bradley L. Schlaggar, and Steven E. Petersen. Functional Network Organization of the Human Brain. Neuron, 72(4):665–678, November 2011. ISSN 08966273. doi: 10.1016/j.neuron.2011.09.006. URL https://linkinghub.elsevier.com/retrieve/pii/S0896627311007926. [45] Kanaka Rajan, Christopher D Harvey, and David W Tank. Recurrent Network Models of Sequence Generation and Memory. Neuron, 90(1):128–142, April 2016. doi: 10.1016/j.neuron. 2016.02.009. URL http://dx.doi.org/10.1016/j.neuron.2016.02.009. Publisher: Elsevier Inc. [46] Evan D. Remington, Seth W. Egger, Devika Narain, Jing Wang, and Mehrdad Jazayeri. A Dynamical Systems Perspective on Flexible Motor Timing. Trends in Cognitive Sciences, 22(10):938–952, October 2018. ISSN 13646613. doi: 10.1016/j.tics.2018.07.010. URL https://linkinghub.elsevier.com/retrieve/pii/S1364661318301724.

13 [47] Mahdi Rezaei and Mahsa Shahidi. Zero-shot learning and its applications from autonomous vehicles to COVID-19 diagnosis: A review. Intelligence-Based Medicine, 3-4:100005, Decem- ber 2020. ISSN 26665212. doi: 10.1016/j.ibmed.2020.100005. URL https://linkinghub. elsevier.com/retrieve/pii/S2666521220300053. [48] Blake A Richards, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy Berker, Surya Ganguli, Colleen J Gillon, Danijar Hafner, Adam Kepecs, Nikolaus Kriegeskorte, Peter Latham, Grace W Lindsay, Kenneth D Miller, Richard Naud, Christopher C Pack, Panayiota Poirazi, Pieter Roelfsema, João Sacramento, Andrew Saxe, Benjamin Scellier, Anna C Schapiro, Walter Senn, Greg Wayne, Daniel Yamins, Friedemann Zenke, Joel Zylberberg, Denis Therien, and Konrad P Kording. A deep learning framework for neuroscience. Nature Neuroscience, 22(11):1–10, October 2019. doi: 10.1038/s41593-019-0520-2. URL http://dx.doi.org/10.1038/ s41593-019-0520-2. Publisher: Springer US. [49] Sebastian Ruder. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv:1706.05098 [cs, stat], June 2017. URL http://arxiv.org/abs/1706.05098. arXiv: 1706.05098. [50] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive Neural Networks. arXiv:1606.04671 [cs], September 2016. URL http://arxiv.org/abs/1606.04671. arXiv: 1606.04671. [51] Katsuyuki Sakai. Task Set and Prefrontal Cortex. Annual Review of Neuroscience, 31(1): 219–245, July 2008. doi: 10.1146/annurev.neuro.31.060407.125642. URL http://www. annualreviews.org/doi/10.1146/annurev.neuro.31.060407.125642. [52] Jonathan Schwarz, Jelena Luketina, Wojciech M Czarnecki, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & Compress: A scalable framework for continual learning. PMLR, 80:10, 2018. [53] Joan Serrà, Dídac Surís, Marius Miron, and Alexandros Karatzoglou. Overcoming Catastrophic Forgetting with Hard Attention to the Task. PMLR, 80:10, 2018. [54] Shagun Sodhani, Sarath Chandar, and Yoshua Bengio. Toward Training Recurrent Neural Networks for Lifelong Learning. Neural Computation, 32(1):1–35, January 2020. ISSN 0899- 7667, 1530-888X. doi: 10.1162/neco_a_01246. URL https://direct.mit.edu/neco/ article/32/1/1-35/95562. [55] Olaf Sporns and Richard F. Betzel. Modular Brain Networks. Annu. Rev. Psy- chol., 67(1):613–640, January 2016. ISSN 0066-4308, 1545-2085. doi: 10.1146/ annurev-psych-122414-033634. URL http://www.annualreviews.org/doi/10.1146/ annurev-psych-122414-033634. [56] David Sussillo and L F Abbott. Generating Coherent Patterns of Activity from Chaotic Neural Networks. Neuron, 63(4):544–557, August 2009. doi: 10.1016/j.neuron.2009.07.018. URL http://dx.doi.org/10.1016/j.neuron.2009.07.018. Publisher: Elsevier Ltd. [57] David Sussillo and Omri Barak. Opening the Black Box: Low-Dimensional Dynamics in High-Dimensional Recurrent Neural Networks. Neural Computation, pages 1–24, January 2013. doi: https://doi.org/10.1162/NECO_a_00409. URL https://www.mitpressjournals.org/ doi/pdf/10.1162/NECO_a_00409. [58] David Sussillo, Mark M Churchland, Matthew T Kaufman, and Krishna V Shenoy. A neural network that finds a naturalistic solution for the production of muscle activity. Na- ture Neuroscience, 18(7):1025–1033, June 2015. doi: 10.1038/nn.4042. URL http: //www.nature.com/articles/nn.4042. Publisher: Nature Publishing Group. [59] Alexander V. Terekhov, Guglielmo Montone, and J. Kevin O’Regan. Knowledge Transfer in Deep Block-Modular Neural Networks. In Stuart P. Wilson, Paul F.M.J. Verschure, Anna Mura, and Tony J. Prescott, editors, Biomimetic and Biohybrid Systems, volume 9222, pages

14 268–279. Springer International Publishing, Cham, 2015. ISBN 978-3-319-22978-2 978-3- 319-22979-9. doi: 10.1007/978-3-319-22979-9_27. URL http://link.springer.com/10. 1007/978-3-319-22979-9_27. Series Title: Lecture Notes in Computer Science. [60] Philippe Vincent-Lamarre, Guillaume Lajoie, and Jean-Philippe Thivierge. Driving reservoir models with oscillations: a solution to the extreme structural sensitivity of chaotic networks. J Comput Neurosci, 41(3):305–322, December 2016. ISSN 0929-5313, 1573-6873. doi: 10.1007/ s10827-016-0619-3. URL http://link.springer.com/10.1007/s10827-016-0619-3. [61] P.R. Vlachas, J. Pathak, B.R. Hunt, T.P. Sapsis, M. Girvan, E. Ott, and P. Koumoutsakos. Backpropagation algorithms and Reservoir Computing in Recurrent Neural Networks for the forecasting of complex spatiotemporal dynamics. Neural Networks, 126:191–217, June 2020. ISSN 08936080. doi: 10.1016/j.neunet.2020.02.016. URL https://linkinghub.elsevier. com/retrieve/pii/S0893608020300708. [62] Xiao-Jing Wang and Henry Kennedy. Brain structure and dynamics across scales: in search of rules. Current Opinion in Neurobiology, 37:92–98, April 2016. ISSN 09594388. doi: 10.1016/j.conb.2015.12.010. URL https://linkinghub.elsevier.com/retrieve/pii/ S0959438815001889. [63] L. T. Westlye, K. B. Walhovd, A. M. Dale, A. Bjornerud, P. Due-Tonnessen, A. Engvig, H. Grydeland, C. K. Tamnes, Y. Ostby, and A. M. Fjell. Life-Span Changes of the Human Brain White Matter: Diffusion Tensor Imaging (DTI) and Volumetry. Cerebral Cortex, 20(9):2055– 2068, September 2010. ISSN 1047-3211, 1460-2199. doi: 10.1093/cercor/bhp280. URL https: //academic.oup.com/cercor/article-lookup/doi/10.1093/cercor/bhp280. [64] Guangyu Robert Yang, Madhura R Joglekar, H Francis Song, William T Newsome, and Xiao- Jing Wang. Task representations in neural networks trained to perform many cognitive tasks. Nature Neuroscience, 22(2):1–16, December 2018. doi: 10.1038/s41593-018-0310-2. URL http://dx.doi.org/10.1038/s41593-018-0310-2. Publisher: Springer US. [65] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. ICLR, page 11, 2018. [66] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient Surgery for Multi-Task Learning. arXiv:2001.06782 [cs, stat], December 2020. URL http://arxiv.org/abs/2001.06782. arXiv: 2001.06782. [67] Anthony M Zador. A critique of pure learning and what artificial neural networks can learn from animal brains. Nature Communications, 10(1):1–7, August 2019. doi: 10.1038/ s41467-019-11786-6. URL http://dx.doi.org/10.1038/s41467-019-11786-6. Pub- lisher: Springer US. [68] Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continuous Learning of Context-dependent Processing in Neural Networks. arXiv:1810.01256 [cs], October 2018. URL http://arxiv. org/abs/1810.01256. arXiv: 1810.01256. [69] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual Learning Through Synaptic Intelligence. arXiv:1703.04200 [cs, q-bio, stat], June 2017. URL http://arxiv.org/abs/ 1703.04200. arXiv: 1703.04200.

15 A Appendix

A.1 Training details

All networks were trained on a single laptop with a 2.7 GHz Quad-Core Intel Core i7 processor and 16GB of working memory using a custom implementation of recurrent neural networks based on JAX [7], which is available at https://github.com/dashirn/FAST_RNN_JAX. We used a recurrent neural network (RNN) model with the hyperbolic tangent as the activation function throughout (σ = tanh, Equation 1), in accordance with previous approaches to modeling neuroscience tasks [37, 10, 34]. Previous work suggests the dynamics obtained with different activation functions, such as tanh or ReLu, and different network models such as RNNs, GRUs and LSTMs are similar [33]. All networks were trained by minimizing the mean squared error using Adam as an optimizer. We also compared our results with those obtained from the native implementation of dynamical non- interference [15], available at https://github.com/LDlabs/seqMultiTaskRNN under a MIT license, and obtained similar results. We performed a gridsearch over various parameter ranges (Table 2), and settled on the parameter combination that worked best across all training paradigms. We used a network of N = 350 units across all tasks and training paradigms. This ensured that individual modules trained well, and achieved the highest task performance across all training paradigms. The performance we obtained agrees with previously obtained results [15].

Table 2. Parameter settings Parameter Parameter range considered Network size 100 – 500 Learning rate 1e-4 – 3e-3 Batch size 20 – 400 Scaling factor g 1.1 – 1.8 L2-norm weight regularization parameter 5e-6 – 5e-5 Activity regularization parameter 1e-7 Number iterations before decreasing learning rate 100 – 300

We set the learning rate at α = 1e − 3 for all networks, the batch size at 200 trials, the scaling factor at g = 1.5, the L2-norm weight regularization parameter at 1e-5, the activity regularization parameter at 1e-7, and the number of iterations at 200. The weights in the recurrent weight matrix, Wrec (Equation 1), were initially drawn from the standard normal distribution and scaled by a factor of √g . The input weights, W , were also initially drawn from the standard normal distribution and N in 1 scaled by a factor of √ with Nin varying by task. Each input unit also received an independent Nin i white noise input, η , with zero mean and a standard deviation of σ = 0.002. Output weights, Wout, were also initially drawn from the standard normal distribution and multiplied by a factor of √ 1 Nout with Nout varying by task.

A.1.1 Modular network with task primitives The modular network was initialized and trained with the parameters described above. Each module,

M1-3, was trained separately, with NM1 = 50, NM2 = 150, and NM3 = 150. When a particular module was being trained, all other weights in the network remained frozen (they were not updated with information from the error gradient). For the perturbation studies, we considered two structural variants: a modular, fully disconnected network (Figure 1), and a modular, fully connected network (Figure S1). After the modules were trained, tasks T1-9 were trained using separate readout weights, T1 T2 T9 Wout,Wout...Wout.

A.2 Task performance

In order to measure task performance, a test set of 250 trials was randomly generated from a separate seed for each task. Performance was calculated as the percentage of correct decisions over this test set. Decisions, rather than other error metrics such as mean squared error, were chosen as a performance

16 metric as they reflect the ultimate goal of biological agents. A decision was considered correct if the 2π mean output over the response period did not deviate more than a fixed amount ( 10 ) from the target direction, following previous approaches [15]. Performance was averaged across all output units.

Example Inputs Modular RNN Tasks trained

Modules trained on task primitives 1

0.5 M1 Instructions Fixation 0 Autoencode 1

0.5 M2 T1 T2 T3 T4 Inputs Memory 1 Input 1 0

Input 2 0 M3 -0.5 Inputs Memory 2

-1 Fix Stim Go Recurrent Weights Output Weights 0 50 100 150 200 Time (msec)

Fig S1. Schematic of fully connected recurrent network with modular task primitives. Left: inputs for modules M1-M3. Middle: recurrent weight matrix of the modular network, with separate partitions for modules M1-M3 along the diagonal. Right: different tasks (T1-T9) trained into different output weights, leveraging the same modular network.

A.3 Perturbation studies

We considered 4 types of perturbations: lesioned weights, silenced activity, global Gaussian noise, and independent Gaussian input noise. We performed all these types of perturbations in 2 conditions, (1) whole network (Figures 5 and S2) and (2) modular (Figures 6 and S3). In the whole network condition (1), perturbations were performed across all connections or units in the network: e.g. connections to be lesioned were picked from among the entire recurrent weight matrix, units to be silenced from among all units, and global Gaussian noise was added to the whole network. In the modular condition (2), perturbations were performed locally, restricted to a module. For the modular network with task primitives, perturbations were performed within module 2 (M2) in this condition, while perturbations in the networks trained sequentially and with dynamical non-interference were averaged across 5 modules of the same size picked randomly from the entire connectivity matrix. All perturbations were performed for 2 networks initialized with different random seeds and 25 draws with different random seeds.

A.4 Supplementary results

17 Weights lesioned Activity lesioned Global noise Input noise A B C D Mem Dm1 Task

E F G H Mem Dm2 Task

I J K L ContextMemDm1 Task

M N O P ContextMemDm2 Task Q R S T MultiMem Task

Fig S2. Performance after perturbations. Performance (% decisions correct on a 250 trial test set) for tasks T1-T4 (MemDm1, MemDm2, ContextMemDm1, ContextMemDm2, MultiMem) as a function of % weights lesioned (A, E, I, M, Q), % units silenced (B, F, J, N, R), global Gaussian noise as a function of standard deviation (σ) (C, G, K, O, S), and uncorrelated Gaussian noise added to all inputs (D, H, L, P, T). Error bars show standard deviation across repeating perturbations for 25 draws with different random seeds for 2 networks initialized with different random seeds.

18 Weights lesioned Activity lesioned Global noise A B C Mem Dm1 Task

D E F Mem Dm2 Task

G H I ContextMemDm1 Task J K L

ContextMemDm2 Task M N O MultiMem Task

Fig S3. Performance after modular perturbations (see A for details). Performance (% decisions correct on a 250 trial test set) for tasks T1-T4 (MemDm1, MemDm2, ContextMemDm1, ContextMemDm2, MultiMem) as a function of % weights lesioned (A, D, G, J, M), % units silenced (B, E, H, K, N), and global Gaussian noise as a function of standard deviation (σ) (C, F, I, L, O). Error bars show standard deviation across repeating perturbations for 25 draws with different random seeds for 2 networks initialized with different random seeds.

19