Alleviating Catastrophic Forgetting Using Context- Dependent Gating

Downloaded by guest on October 1, 2021 ioa .Masse Y. stabilization Nicolas synaptic and gating dependent context- using forgetting catastrophic Alleviating www.pnas.org/cgi/doi/10.1073/pnas.1803839115 forgetting catastrophic highly a be can brain, learning. the continual in support occur to to strategy believed effective is what meth- to complementary akin multiple, ods, using reinforcement-based that or suggests this This supervised learning. that either using archi- show network trained recurrent We tectures, and feedforward stabilization. both for weight works method with particularly combined tasks, presented performance when sequentially high computa- of maintain numbers little to large requires ANNs across one implement, allows any and to for overhead, easy tional active is are method units This task. of sparse, only patterns method: that nonoverlapping such complementary mostly signal, a to gating propose context-dependent believed a we are adding vivo, that in forgetting. algorithms implemented important catastrophic from be alleviate most inspiration helps deemed drawing which Here, are task, a that solving ANNs for of stabilize weights to methods connection proposed have perfor- forget- studies task recent degrading Several “catastrophic that tasks, mance. weights previous of connection solving for disrupts result important ANN were an the training learned which is previously in forget ting,” phenomenon to This them causes tasks. forget- typically 2018) 4, without (ANNs) tasks March networks tasks new review neural for on artificial new (received training 2018 However, learn 11, ones. can September old approved ting animals and CA, most Jolla, and La Studies, Humans Biological for Institute Salk Sejnowski, J. Terrence by Edited 60637 IL Chicago, Chicago, of University The Behavior, Human a ml osi accuracy. in loss allows small to techniques a according (≤ stabilization several each and learn stabilize these to ANNs connection then Applying and each importance. task of its a importance solving propose toward the authors bias measure the stud- to Specifically, modeling 9). methods alleviate recent (8, to stabilization two forgetting spine catastrophic inspired mimic that have methods proposing results is ies spines retention These dendritic and of 7). stabilization acquisition (6, and skill creation Particularly, the 5). with seconds associated 4, from ranging (2, lifetimes years with to stable, can or synapses, dynamic associated either the and be of morphology, the connectivity This shapes functional 3). the (2, morphology hence connection whose synaptic the (1), are of spines strength connections dendritic synaptic excitatory on forget- neuroscientific most located catastrophic brain, toward to the solutions look Within possible ting. to for brain sense the of makes studies it learning, previous tinual to solutions optimal alter from can task tasks. away new a weights learning because connection occurs previously forgetting,” This on tasks. tasks performance “catastrophic learned degrades several tasks from on new suffer learning trained wherein often are ANNs they solve to when sequentially, knowledge However, previous net- continual tasks. upon neural build new as artificial can to design that to (ANNs) referred requirement works knowledge, critical a past to is forget ability learning, not This and information. learned learn previously forgetting sarily H stabilization synaptic eateto erbooy h nvriyo hcg,Ciao L667 and 60637; IL Chicago, Chicago, of University The Neurobiology, of Department ie hthmn n te nml r aal fcon- of capable are animals other and humans that Given ag ubr ftssdrn hi ieie ihu neces- without lifetime, their learning during of tasks capable of are numbers animals large advanced other and umans a,1 | otx-eedn gating context-dependent rgr .Grant D. Gregory , | otna learning continual 0 eunilytandtsswt only with tasks trained sequentially 10) | rica intelligence artificial a n ai .Freedman J. David and , | 1073/pnas.1803839115/-/DCSupplemental. y at online information supporting contains article This 1 at available Gating is code All deposition: Data the under Published Submission.y Direct PNAS a is article This interest.y of conflict no declare paper.y the authors wrote D.J.F. The and G.D.G., N.Y.M., and performed data; G.D.G. analyzed and G.D.G. N.Y.M. and N.Y.M. research; research; designed D.J.F. and N.Y.M. contributions: Author 0 eunilMITpruain 1)ado h ImageNet the on and (13) permutations MNIST sequential 100 extra little requires and implement to overhead. Importantly, simple computational neurons. The is hidden all task. algorithm each onto for this one projected unique mostly is is any and that that for and signal sparse task, additional active that implement an are of such we consists units algorithm XdG study, of this sets this of nonoverlapping In version tasks branches. simplified previous inter- for a different minimal occurred with (mostly) that task, on changes one small synaptic any a with for on ference dendritic branches occur dendritic to of changes of sets synaptic set allows nonoverlapping This highly (10). branches sparse, tasks between disinhibit switching brain, can the further In can learning. continual (XdG), support gating context-dependent solution, inspired ANNs. to in required learning are continual algorithms support complementary several question the (12) whether raising to transcriptional information, as learned and stabilize to 4), act (2, all levels the morphological at 11), operating (10, mechanisms systems diverse that neuroscience proposed notion, have this studies with whether Consistent uncertain tasks. of across is numbers learning large it continual support and can alone remembered, stabilization that synaptic and environments learned and tasks be different must of (100) numbers large a,b,1 owo orsodnemyb drse.Eal as@ciaoeuo dfreedman or [email protected] @uchicago.edu. Email: addressed. be may correspondence whom To b egt,cnalwntok omiti ihperformance tasks. connection presented high sequentially maintain stabilize of numbers to large to networks across allow methods can previous weights, when and with overhead, computational combined extra little Importantly, requires task. implementa- one tion, straightforward any a for has active gating are context-dependent scheme, units neuroscience-inspired of nonover- sets mostly a lapping which in propose tasks. gating,” whether of “context-dependent we uncertain called numbers study, is larger this across it In forgetting tasks, prevent of can (610) to they forgetting numbers network alleviate small can the that over causes studies methods previous various task While proposed tasks. new have previous perform a to how learning forget which forget- in catastrophic from ting, suffer can networks neural Artificial Significance h rsmnIsiuefrNuocec,Qatttv ilg and Biology Quantitative Neuroscience, for Institute Grossman The etse u ehdo edowr ewrstandon trained networks feedforward on method our tested We neuroscience- another whether examine we study, this In encounter likely will animals other and humans However, . y y NSlicense.y PNAS https://github.com/nmasse/Context-Dependent- www.pnas.org/lookup/suppl/doi:10. NSLts Articles Latest PNAS | f9 of 1

COMPUTER SCIENCES NEUROSCIENCE dataset (14) split into 100 sequential tasks. XdG or synaptic AB stabilization (8, 9), when used alone, is partially effective at alle- Network with Network + stabilization viating forgetting across the 100 tasks. However, when XdG is synaptic stabilization + context signal combined with synaptic stabilization, networks can successfully 784 2000 2000 10 learn all 100 tasks with little forgetting. Furthermore, combin- inputs hidden hidden outputs ing XdG with stabilization allows recurrent neural networks (RNNs), trained by using either supervised or reinforcement learning, to sequentially learn 20 tasks commonly used in cognitive and systems neuroscience experiments (15) with high accuracy. Hence, this neuroscience-inspired solution, XdG, when used in tandem with existing stabilization methods, dramatically increases the ability of ANNs to learn large numbers of tasks without forgetting previous knowledge. C Split net. + stabilization D Network + stabilization + context signal + context-dependent Results 784 5 X 733 5 X 733 10 gating (XdG) The goal of this study was to develop neuroscience-inspired inputs hidden hidden outputs methods to alleviate catastrophic forgetting in ANNs. Two previous studies have proposed one such method: stabilizing connection weights depending on their importance for solving a task (8, 9). This method, inspired by neuroscience research demon- strating that stabilization of dendritic spines is associated with task learning and retention (6, 7), has shown promising results when trained and tested on sequences of ≤ 10 tasks. However, it is uncertain how well these methods perform when trained on much larger numbers of sequential tasks. not gated gated not gated gated We first tested whether these methods can alleviate catastrophic forgetting by measuring performance on 100 sequentially Fig. 1. Network architectures for the permuted MNIST task. The ReLu acti- presented digit classification tasks. Specifically, we tested a fully vation function was applied to all hidden units. (A) The baseline network connected feedforward network with two hidden layers (2,000 consisted of a multilayer perceptron with two hidden layers consisting of units each; Fig. 1A) on the permuted MNIST digit classification 2,000 units each. (B) For some networks, a context signal indicating task identity projected onto the two hidden layers. The weights between the task (13). This involved training the network on the MNIST task context signal and the hidden layers were trainable. (C) Split networks (net.) for 20 epochs, permuting the 784 pixels in all images with the consisted of five independent subnetworks, with no connections between same permutation, and then training the network on this new subnetworks. Each subnetwork consisted of two hidden layers with 733 set of images. This test is a canonical example of an “input refor- units each, such that it contained the same amount of parameters as the matting” problem, in which the input and output semantics (pixel full network described in A. Each subnetwork was trained and tested on intensities and digit identity, respectively) are identical across all 20% of tasks, implying that, for every task, four of the five subnetworks tasks, but the input format (the spatial location of each pixel) was set to zero (gated). A context signal, as described in B, projected onto changes between tasks (13). the two hidden layers. (D) XdG consisted of multiplying the activity of a We sequentially trained the base ANN on 100 permutations fixed percentage of hidden units by 0 (gated), while the rest were left unchanged (not gated). The results in Fig. 2D involve gating 80% of hidden of the image set. Without any synaptic stabilization, this network units. can classify digits with an accuracy of 98.5% for any single permutation, but mean classification accuracy falls to 52.5% after the network is trained on 10 permutations, and to 19.1% after mutations (Materials and Methods). Although both stabilization training on 100 permutations. methods successfully mitigated forgetting, mean classification accuracy after 100 permutations was still far below single-task Synaptic Stabilization. Given that this ANN rapidly forgets previ- accuracy. This prompts the question of whether an additional, ously learned permutations of the MNIST task, we next asked complementary method can raise performance even further. how well two previously proposed synaptic stabilization methods, synaptic intelligence (8) and elastic weight consolidation Context Signal. One possible reason why classification accuracy (EWC) (9), alleviate catastrophic forgetting. Both methods work decreased after many permutations was that ANNs were not by applying a quadratic penalty term on adjusting synaptic val- informed as to what permutation, or context, was currently being ues, multiplied by a value quantifying the importance of each tested. In contrast, context-dependent signals in the brain, likely connection weight and bias term toward solving previous tasks originating from areas such as the prefrontal cortex, project to (Materials and Methods). We note that we use the term “synapse” various cortical areas and allow neural circuits to process incom- to refer to both connection weights and bias terms. Briefly, EWC ing information in a task-dependent manner (16–18). Thus, we works by approximating how small changes to each parame- tested whether such a context signal improves mean classification ter affect the network output, calculated after training on each accuracy. Specifically, a one-hot vector (N-dimensional consist- task is completed, while synaptic intelligence works by calcu- ing of N − 1 zeros and 1 one), indicating task identity, projected lating how the gradient of the loss function correlates with onto the two hidden layers. The weights projecting the context parameter updates, calculated during training. Both stabiliza- signal could be trained by the network. tion methods significantly alleviate catastrophic forgetting: Mean We found that networks including parameter stabilization classification accuracies for networks with EWC (green curve, combined with a context signal had greater mean classifi- Fig. 2A) were 95.3% and 70.8% after 10 and 100 permutations, cation accuracy (context signal with synaptic intelligence = respectively, and mean classification accuracies for networks with 89.6%, with EWC = 87.3%; Fig. 2B) than synaptic stabiliza- synaptic intelligence (magenta curve, Fig. 2A) were 97.0% and tion alone. However, the mean classification accuracy after 82.3% after 10 and 100 permutations, respectively. We note that 100 tasks still falls short of single-task accuracy, suggesting we used the hyperparameters that produced the greatest mean that contextual information alone is insufficient to alleviate accuracy across all 100 permutations, not just the first 10 per- forgetting.

2 of 9 | www.pnas.org/cgi/doi/10.1073/pnas.1803839115 Masse et al. Downloaded by guest on October 1, 2021 Downloaded by guest on October 1, 2021 as tal. et Masse fsnpe r epnil o erigec e ak hs we Thus, task. new each learning for sets responsible overlapping classification are partially sub- greater synapses unique, of if whereas each possible tasks, be more forces might multiple for accuracies method learn impractical fixed this to be a network Furthermore, may into priori tasks. networks a real-world subnetworks splitting of However, number tasks forgetting. presented less sequentially learn with to networks allow potentially XdG. signal. context with a networks with full than combined 91.4%) stabilization = EWC synaptic and and 93.1% signal = classifi- context intelligence with mean networks greater (split achieved accuracies cation architecture 2B Further- This Fig. inhibited. layers. in fully described hidden were signal context four the other subnetwork more, the one and permutation, each activated, param- For free was network. of full connection number the the of in matched number eters 733 network the split contained that this subnetwork so in layer, weights Each hidden 2C). each (Fig. in neurons size equal of works frozen are and and task, presynaptic the their during otherwise. if active occur are only partners postsynaptic weights synaptic inhi- in This that changes 20). provided forgetting, (19, catastrophic network alleviates the potentially cortical bition of parts various but large manner, inhibits context-dependent allows selectively a in information only process to not areas signaling context-dependent Network. Split synaptic SI, respectively. intelligence, with synaptic networks or intelligence. for EWC dashed accuracy with the mean and combined the alone, XdG represent used curves XdG or magenta with and EWC networks green in of with and accuracy as networks mean green the (same for solid sents respectively accuracy The mean intelligence, (D) the combined synaptic respectively. signal represent intelligence, context curves a synaptic magenta with or networks split EWC intelligence, for with synaptic accuracy mean or net- the EWC for sent with in accuracy combined as mean signal (same the respectively context (C represent a respectively. curves with intelligence, magenta works synaptic context and a or green with EWC dashed networks synaptic with for with accuracy combined mean and signal the EWC represent with with curves and in magenta networks as EWC for (same curves respectively magenta with accuracy intelligence, and mean networks green differ- the solid magenta for a The and represent accuracy (B) to green respectively. The mean corresponds intelligence, (A) synaptic task the pixels. input each represent the where curves tasks of of on, permutation number random trained the ent of was function a network as the accuracy classification mean the show 2. Fig. C A Mean task accuracy Mean task accuracy 0.6 0.7 0.8 0.9 0.6 0.7 0.8 0.9 ots hspsiiiy eslttentokit v subnet- five into network the split we possibility, this test To 1 1 0 0 hs eut ugs htcnetdpnetihbto can inhibition context-dependent that suggest results These akacrc nteprue NS ecmr ak l curves All task. benchmark MNIST permuted the on accuracy Task SI EWC Split SI+Context Split EWC+Context SI +Context EWC +Context 20 20 Task number Task number eetnuocec tde aehglgtdhow highlighted have studies neuroscience Recent 40 40 60 60 ,adtesldgenadmgnacre repre- curves magenta and green solid the and B), 80 80 100 100 ,adtedse re n dashed and green dashed the and A), D B 0.6 0.7 0.8 0.9 0.6 0.7 0.8 0.9 1 1 0 0 SI +XdG EWC +XdG SI EWC XdG SI +Context EWC +Context SI EWC 20 20 ,tebakcrerepre- curve black the A), Task number Task number rjce nothe onto projected 40 40 60 60 80 80 The ) 100 100 t smr ak r ere.Frtefis eea ak (red tasks several network first the allowing the stabilization, For less learned. flexibil- require are less synapses tasks and dots), more stabilization as greater (Materials to ity tasks leading across Methods), accumulate and values each importance on training synaptic after and (x permutation before MNIST measured values MNIST synaptic new the (y each on learns learn accuracy it to network’s permutation the amounts show sufficient we adjusted tasks, be new can that values synaptic (R 3A). synapses Fig. less-important accu- perturbing degraded than more synapses racy more-important perturbing expected, val- synaptic individual (see perturbing ues (9), after accuracy permutations EWC mean all the and across measured then amount (8) on and intelligence) point, permutations intelligence sufficient (synaptic MNIST stabilization first 100 synaptic with a the network for a by demonstrate trained basis we To values the tasks. is synaptic flexible new which remain adjust learn yet can accurately that tasks, to it synapses previous bal- stabilize that for must must so important network it deemed many the demands: are learn competing forgetting, accurately two little to ance with stabi- that tasks to hypothesize synap- sequential compared We with forgetting combined alone. alleviates XdG lization better why understand stabilization Stabilization. to tic Synaptic like and would XdG We Between Interaction the Analyzing (synaptic severely more = accuracy decreased 54.9%). intelligence = alone of intelligence stabilization synaptic loss for with gradual racy combined a (XdG only 90.7%; tasks trained allowed with 500 sequentially stabilization would learning over 500 with they combined continual for abruptly. XdG if for drops that analysis or found accuracy our and accuracy, where tasks repeated in point we loss critical more the Thus, a gradual even raises reach learn a result to This instead only networks tasks. with allows 100 all XdG tasks task across whether first 95.4% of the of question on per- mean 98.2% in a from loss to dropping minimal with accuracy is tasks with trained formance, network sequentially 100 the learn which to upon tasks archi- and of size number network trained. the the of on and note (values depends We tecture value gated S1). optimal Fig. were this Appendix, units (SI that of respectively) 95.5%, 86.7% and and 95.4% peaked 80% accuracy classification between mean connection permuted when that the of found For we number tasks. task, the between MNIST forgetting decreases and which changes gated, weight units num- of greater a ber the keeping increasing and active, capacity, units representational of network’s number greater a keeping between synaptic existing methods. with paired stabilization learn- when continual effective support highly becomes not it does we ing, alone methods previous XdG the while of Thus, any tested. than greater lines), green (dashed magenta methods and stabilization both for classification 95.4% mean was EWC, accuracy tasks or was intelligence 10 XdG synaptic when after with However, alone 97.1% combined permutations. used 100 was all accuracy was across 61.4% mean XdG and 2D), When units Fig. permutation. hidden curve, gated each (black fully of spe- for of identity the chosen set for The different was testing a task. and and training per permutation, during units cific fixed hidden was units of gated 80% the while gated (1 (gated), we other 0 study, the by of multiplied of activity was activity the chosen, the which randomly in units, XdG, den method, final a tested odmntaeta erignwtssrqie flexible requires tasks new learning that demonstrate To networks allowed stabilization synaptic with combined XdG compromise a is gate to units of percentage optimal The .I oprsn enaccu- mean comparison, In S2). Fig. Appendix, SI aeil n Methods and Materials xs.Frsnpi nelgneadEWC, and intelligence synaptic For axis). s h itnebetween distance the vs. 3B) Fig. axis, − X o eal eadn nlss.As analysis). regarding details for %wslf nhne.I this In unchanged. left was )% NSLts Articles Latest PNAS X = fhid- of % −0.904; | f9 of 3

COMPUTER SCIENCES NEUROSCIENCE AB (referred to as layer 1; Fig. 3C), connecting the first and sec- 0 1 ond hidden layers (layer 2; Fig. 3D), and connecting the second hidden and output layers (layer 3; Fig. 3E). For all three layers, the mean importance values were lower for networks with XdG (layer 1, Fig. 3C: synaptic intelligence = 0.793, synaptic intelli- 0.9 Task number gence + XdG = 0.216; layer 2, Fig. 3D, synaptic intelligence = Accuracy ∆ accuracy 0.076, SI + XdG = 0.026; layer 3, Fig. 3E, synaptic intelligence = −0.1 110050 1.81, synaptic intelligence + XdG = 0.897). We hypothesized 0 2.5 5 10 20 that having a larger number of synapses with low importance Synaptic importance Synaptic distance is beneficial, as the network could adjust those synapses to between tasks C XdG + SI learn new tasks with minimal disruption to performance on 5 previous tasks. 10 SI 0.6 4 To confirm this, we binned the synapses by their importance 10 3 0.4 and calculated the Euclidean distance in synaptic values mea- 10 2 sured before and after training on the 100th MNIST permutation 10 0.2 (Fig. 3 C–E, Right). These images suggest that synaptic distances 1 10 0 are greater for networks with XdG combined with synaptic intel-

−3 −2 −1 0 1 Synaptic distance −3 −2 −1 0 1 10 10 10 10 10 10 10 10 10 10 ligence and that larger changes in synaptic values are more Synaptic importance conﬁned to synapses with low importances. D To conﬁrm, we calculated the synaptic distance using all 5 synapses from each layer and found that it was greater for net- 10 6 4 works with XdG combined with synaptic intelligence (layer 1, 10 3 4 Fig. 3C, synaptic intelligence = 1.21, synaptic intelligence + 10 2 XdG = 1.65; layer 2, Fig. 3D, synaptic intelligence = 1.26, 10 1 2 synaptic intelligence + XdG = 8.54; layer 3, Fig. 3E, synaptic 10 0 intelligence = 0.06, synaptic intelligence + XdG = 0.09). Fur- Synapse count Synapse count

−3 −2 −1 0 1 Synaptic distance −3 −2 −1 0 1 thermore, the distance for each synaptic value multiplied by its 10 10 10 10 10 10 10 10 10 10 Synaptic importance importance was lower for networks with XdG combined with E synaptic intelligence (layer 1, Fig. 3C, synaptic intelligence = 5 10 457.33, synaptic intelligence + XdG = 83.56; layer 2, Fig. 3D, 4 synaptic intelligence = 109.75, synaptic intelligence + XdG = 10 0.03 3 17.83; layer 3, Fig. 3E, synaptic intelligence = 12.15, synaptic 10 2 0.02 intelligence + XdG = 3.16). Thus, networks with XdG combined 10 1 with stabilization can make larger adjustments to synapse values 10 0.01 to accurately learn new tasks and simultaneously make smaller Synapse count 0 adjustments to synapses with high importance.

−3 −2 −1 0 1 Synaptic distance −3 −2 −1 0 1 10 10 10 10 10 10 10 10 10 10 By gating 80% of hidden units, 96% of the weights connecting Synaptic importance the two hidden layers and 80% of all other parameters are not used for any one task. This allows networks with XdG to main- Fig. 3. Analyzing the interaction between XdG and synaptic stabilization. tain a reservoir of synapses that have not been previously used, (A) The effect of perturbing synapses of various importance is shown for a network with synaptic intelligence that was sequentially trained on 100 or used sparingly, that can be adjusted by large amounts to learn MNIST permutations. Each dot represents the change in mean accuracy (y new tasks without disrupting performance on previous tasks. axis) after perturbing a single synapse, whose importance is indicated on the x axis. For visual clarity, we show the results from 1,000 randomly selected XdG on the ImageNet Dataset. A simplifying feature of the per- synapses chosen from the connection weights to the output layer. (B) Scatter muted MNIST task is that the input and output semantics are plot showing the Euclidean distance in synaptic values measured before and identical between tasks. For example, output unit 1 is always after training on each MNIST permutation (x axis) vs. the accuracy the net- associated with the digit 1, output unit 2 is always associated work achieved on each new permutation. The task number in the sequence with the digit 2, etc. We wanted to ensure that our method of 100 MNIST permutations is indicated by the red to blue color progression. generalizes to cases in which the output semantics are similar (C, Left) Histogram of synaptic importances from the connections between between tasks, but not identical. Thus, we tested our method on the input layer and the first hidden layer (layer 1), for networks with XdG the ImageNet dataset, which comprises approximately 1.3 mil- (green curve) and without (magenta curve). (C, Right) Synaptic distance, measured before and after training on the 100th MNIST permutation, for lion images, with 1,000 class labels. For computational efficiency, groups of synapses binned by importance. (D) Same as C, except for the we used images that were downscaled to 32 × 32 pixels from the synapses connecting the first and second hidden layers (layer 2). (E) Same more traditional resolution of 256 × 256 pixels. We divided the as C, except for the synapses connecting the second hidden layer and the dataset into 100 sequential tasks, in which the first 10 labels of the output layer (layer 3). ImageNet dataset were assigned to task 1, the next 10 labels were assigned to task 2, etc. The 10 class labels used in each of the 100 tasks are shown in SI Appendix, Table S1. Such a test can to adjust synapse values by relatively large amounts to accurately be considered an example of a “similar task” problem as defined learn each new task. However, as the network learns increasing by ref. 13. numbers of tasks (blue dots), synaptic importances, and hence We tested our model on the ImageNet tasks using two dif- stabilization, increases, preventing the network from adjusting ferent output layer architectures. The first was a “multihead” synapse values large amounts between tasks, decreasing accuracy output layer, which consisted of 1,000 units, one for each image on each new task. class. The output activity of 990 output units not applicable to To help us understand why XdG combined with stabilization the current task was set to zero. better satisfies this trade-off compared with stabilization alone, To measure the maximum obtainable accuracy our networks in Fig. 3 C–E, Left, we show the distribution of importance val- could achieve when sequentially trained on the 100 tasks, we ues for synapses connecting the input and first hidden layers trained and tested networks without stabilization and with

4 of 9 | www.pnas.org/cgi/doi/10.1073/pnas.1803839115 Masse et al. Downloaded by guest on October 1, 2021 Downloaded by guest on October 1, 2021 as tal. et Masse ieto nsottr eoyars ea eidutlthe until period delay stimulus a is the across stimulus maintain memory motion must short-term the the network in with direction task, the presented Go and is presented, Delay it as briefly the soon In as stimulus. respond motion and ignore cue should similar network fixation the is crucially, the that task except stimulus RT-Go above, motion task The the the arrow). to dire magenta the the in by moving (represented net- by the respond disappears, as dot) should soon located As work centrally locations. (white two a cue of by one fixation in the (represented occur stimulus could direction that arrow) motion dura- green random a a by lasting period followed fixation tion, a with begin trials task, Go possible several of one to locations. movement target eye by either or saccadic that response) by a its decisions cue, performing withholding their to involve fixation (akin reports fixation a tasks who gaze subject, plus maintaining all a stimuli, to setting, motion presented physical visual are a multiple or In one experiments control. categorization, are neuroscience inhibitory memory, tasks in working and 20 decision-making, used involve These commonly and those (15). pre- to tasks sequentially similar 20 demanding on in cognitively RNNs well trained sented performs we conditions, and task generalizes To other method learning. our supervised that through demonstrate trained perform- tasks, networks classification feedforward for ing stabiliza- forgetting synaptic alleviate with can combined tion, XdG, that demonstrated tests RNNs. of XdG tasks. when between differs even domain forgetting, synap- output catastrophic with the tandem alleviate in with can used stabilization 50.7%, XdG Thus, tic = 4D). with Fig. intelligence 52.4%; tasks synaptic = 100 EWC with all (XdG learn forgetting could networks little EWC, synap- or with combined intelligence was tic networks XdG split when However, or than 28.1%). signal less = context (XdG was units. a but and hidden alone, stabilization used with of greater networks EWC was or 80% 4D) intelligence Fig. gated curve, synaptic (black than we alone XdG which trained for we accuracy in Next, Mean 4C). XdG, Fig. with 42.5%; = networks EWC and context intel- 40.0%, synaptic a and = and signal ligence context = stabilization with networks using EWC (split over alone signal with not improvement did any and subnetworks to five 42.7%, into lead networks = Splitting 4B). intelligence Fig. (context 44.4%; synaptic accuracy we mean with increased Thus, signal substantially classes. signal 2 context image Fig. a for different used analysis with output our which repeated associated in were architecture, more-challenging units this for getting 4A). Fig. 11.6%; 10.5%, = = EWC archi- units 12.3%, stabilization = (without more-challenging 10 intelligence tasks. synaptic lower this these different substantially of in for was tecture classes activity accuracy which image the classification layer, different and Mean output with units, “single-head” associated 10 that a was pri- only tested class a of known also output be consisted we possible not might Thus, it each which ori. as encounter, for might implementations, unit network real-world the output for one impractical requires be can tive, 54.9%; = EWC 4A). synaptic 51.1%, alle- Fig. = without fully In intelligence almost (synaptic 36.7% forgetting tasks. stabilization viated was synaptic 100 layer adding and output of stabilization, multihead sequence tasks the trained archi- sequentially the using 100 this across across accuracy of of mean achieve network comparison, accuracy on any mean could accuracy maximum a disregarded tecture the achieved We representing networks task. 56.5%, These new tasks. each previous learning accu- their at measured and racy tasks between values synaptic resetting ceaiso h rtfu ak r hw nFg 5 Fig. in shown are tasks four first the of Schematics for- alleviate could method our whether know to wanted We effec- potentially while architecture, network multiheaded The h eunilprue NS n ImageNet and MNIST permuted sequential The efudta adding that found We B–D. nthe In A. n ol pclt o uhasga ol emdfidto modified units be of could form, ensembles signal specific current that a its Suppose such function. in this how learning perform speculate transfer could support one not does likely learning rapid is of form tasks this new implement (22–24). recently can have solve ANNs groups This rapidly several how task. and proposed to learning, new transfer knowledge the as to of past referred contingencies using and of rules tasks process the similar encountered, these learn from previously help learned been simi- knowledge to past have partially use that least can at contexts one and are or tasks tasks new to most of lar because thousands new is the perform accurately This to to tasks. contrast ANNs for in required tasks, usually points or data rules new learn rapidly Learning. Transfer consolidation 11). promote 10, set and diverse 6, forgetting a (4, to uses combat appear which to brain, would mechanisms the approach by of used an that Such with consistent strategy. be to single opposed a as on forgetting, relying catastrophic alleviate complemen- to multiple, strategies using tary of effectiveness the simple highlights this study Importantly, is large inexpensive. method computationally on This and implement tasks. learning, to presented reinforcement sequentially or of conjunction supervised numbers in either trained using used catas- networks by recurrent XdG, alleviate and can feedforward that in synapses, forgetting stabilize shown trophic to have methods previous we with study, this In or Discussion supervised with cog- compatible of is these sequence learning. and synaptic large reinforcement Thus, a tasks with learn 86.4–100.0%. demanding combined to nitively of RNNs when allow range XdG, can stabilization, 20 a that all with demonstrate across results accuracy 96.4%, Mean was dots). tasks (black XdG intelligence with synaptic combined of synaptic learning; with version reinforcement with networks modified compatible on a test (using our stabilization repeated we reinforcement-learning-based Thus, with compatible brain, training. the be in should implemented is it method then this settings, to practical more akin in something work if to and is method our If learning. vised of range a 43.1–100.0%. with 98.2%, of accuracy 92.9–100.0%. XdG range mean with a a achieved combined dots) with stabilization (magenta with 80.0%, networks of achieved comparison, 5B) accuracy In Fig. task dots, mean (green this signal a using (synaptic context By stabilization a synaptic output. and intelligence) with target equipped the between RNNs and difference trained approach, output the first network minimize actual We to network the the task. adjusted sequentially which were each in tasks learning, parameters for supervised all standard accuracy on using RNNs measured networks then trained and we different forgetting, eight out in responses represented eight directions. other response), a repre- the withhold unit (i.e., one fixation and maintain units; to output choice nine the fixation- onto sented 4 projected and It units units. input tuned direction-tuned motion 64 in from provided input is tasks learned 20 has all it of Methods. description and information to Materials Full against, learn tasks. must work previous network actively in the to tasks, or many ignore, learn must network to the Thus, 180 that stimulus. direction except the task, the Go in Lastly, the respond respond. to can similar it is time Anti-Go which at disappears, cue fixation lhuhteXGmto ehv rpsdi hsstudy this in proposed have we method XdG the Although super- using by trained all were above described networks The with- tasks these learn can methods various how assess To received that (21) cells LSTM 256 of consisted RNN Our uasadohravne nml can animals advanced other and Humans ◦ poiet h oindirection motion the to opposite NSLts Articles Latest PNAS aeil n Methods) and Materials | f9 of 5

COMPUTER SCIENCES NEUROSCIENCE underlie the various cognitive processes, or building blocks, AB1 1 required to solve different tasks. Then, learning a new task Base Base MH EWC would not require relearning entirely new sets of connection 0.8 EWC EWC MH 0.8 SI SI SI MH EWC + Context weights, but, rather, implementing a context-dependent signal 0.6 0.6 SI + Context that activates the necessary building blocks, and facilitates their interaction, to solve the new task. PathNet (24), a recent method 0.4 0.4 in which a genetic algorithm selects a subset of the network 0.2 0.2

to use for each task, is one prominent example of how selec- Mean task accuracy 0 0 tive gating can facilitate transfer. Although it is computationally 0 20 40 60 80 100 0 20 40 60 80 100 intensive, requires freezing previously learned synapses, and Task number Task number CD1 1 has only demonstrated transfer between two sequential tasks, EWC + Context XdG EWC + XdG it clearly shows that gating specific network modules can allow 0.8 SI + Context 0.8 EWC SI + XdG the agent to reuse previously learned information, decreasing the Split EWC + Context SI time required to learn a new task. 0.6 Split SI + Context 0.6 We believe that further progress toward transfer learning will 0.4 0.4 require progress along several fronts. First, as humans and other advanced animals can seamlessly switch between contexts, novel 0.2 0.2 Mean task accuracy algorithms that can rapidly identify network modules applicable 0 0 0 20 40 60 80 100 0 20 40 60 80 100 to the current task are needed. Second, algorithms that can iden- Task number Task number tify the current task or context, compare it to previously learned contexts, and then perform the required gating based on this Fig. 4. Similar to Fig. 2, except showing the mean image classification accu- comparison are also required. The method proposed in this study racy for the ImageNet dataset split into 100 sequentially trained sections. (A) clearly lacks this capability, and developing this ability is crucial The dashed black, green, and magenta curves represent the mean accuracies for multihead networks without synaptic stabilization, with EWC or synap- if transfer learning is to be implemented in real-world scenarios. tic intelligence, respectively. The solid black, green, and magenta curves Third, and perhaps most important, is that the network must represent the mean accuracies for nonmultihead networks without synap- represent learned information in a format to support the above tic stabilization, with EWC or synaptic intelligence, respectively. All further two points. If learned information is located diffusely through- results involve nonmultihead networks. (B) The solid green and magenta out the network, then activating the relevant circuits and facil- curves represent the mean accuracies for networks with EWC or synaptic itating their interaction might be impractical. Strategies that intelligence, respectively (same as in A). The dashed green and magenta encourage the development of a modular representation, as curves represent the mean accuracies for networks with a context signal is believed to occur in the brain (25), might be required to combined with EWC or synaptic intelligence, respectively (C) The dashed fully implement continual and transfer learning (26). We believe green and magenta curves represent the mean accuracies for networks with a context signal combined with EWC or synaptic intelligence, respectively that in vivo studies examining how various cortical and subcor- (same as in B). The solid green and magenta curves represent the mean tical areas underlie task learning will provide invaluable data accuracies for split networks, with a context signal combined with EWC or to guide the design of novel algorithms that allow ANNs to synaptic intelligence, respectively. (D) The black curve represents the mean rapidly learn new information in a wide range of contexts and accuracy for networks with XdG used alone. The solid green and magenta environments. curves represent the mean accuracies for networks with EWC or synaptic intelligence, respectively (same as in A). The dashed green and magenta Related Methods to Alleviate Catastrophic Forgetting. The last curves represent the mean accuracies for networks with XdG combined several years have seen the development of several methods with EWC or synaptic intelligence, respectively. SI, synaptic intelligence. to alleviate catastrophic forgetting in neural networks. Ear- lier approaches, such as Progressive Neural Networks (27) and Learning Without Forgetting (28), achieved success by adding 93.0% (SI Appendix, Fig. S3), greater than networks with EWC or additional modules for each new task. Both methods include the synaptic intelligence alone. That said, combining synaptic intel- use of a multihead output layer (Fig. 4A), and, while effective, ligence and XdG still outperforms HAT (95.8% when using the in this study we were primarily interested in the more general same number of training epochs as HAT; SI Appendix, Fig. 3) and case in which the network size cannot be augmented to support is computationally less demanding. This further highlights the each additional task and in which the same output units must be advantage of using complementary methods to alleviate forget- shared between tasks. ting, but also suggests that adding synaptic stabilization to HAT Other studies are similar in spirit to synaptic intelligence (8) could allow it to further mitigate forgetting. and EWC (9) in that they propose methods to stabilize important network weights and biases (29, 30) or to stabilize the linear Summary. Drawing inspiration from neuroscience, we propose space of parameters deemed important for solving previous tasks that XdG, a simple-to-implement method with little computa- (31). While we did not test the performance of these methods, we tional overhead, can allow feedforward and recurrent networks, note that, like EWC and synaptic intelligence, these algorithms trained by using supervised or reinforcement learning, to learn could in theory also be combined with XdG to potentially further large numbers of tasks with little forgetting when used in con- mitigate catastrophic forgetting. junction with synaptic stabilization. Future work will build upon Another class of studies has proposed similar methods that this method so that it not only alleviates catastrophic forgetting, also gate parts of the network. Aside from PathNet (24) de- but can also support transfer learning between tasks. scribed above, recent methods have proposed gating network connection weights (32) or units (33) to alleviate forgetting. The Materials and Methods two key differences between our method and theirs are: (i) Gat- Network training and testing was performed by using the machine-learning ing is defined a priori in our study, which increases computational framework TensorFlow (34). efficiency, but potentially makes it less powerful; and (ii) we propose to combine gating with parameter stabilization, while Feedforward Network Architecture. For the MNIST task, we used a fully their methods involve gating alone. We tested the Hard Atten- connected network consisting of 784 input units, two hidden layers of 2,000 tion to Task (HAT) method (33) and found that it could learn hidden units each, and 10 outputs. We did not use a multihead output 100 sequential MNIST permutations with a mean accuracy of layer, and thus the same 10 output units were used for all permutations.

6 of 9 | www.pnas.org/cgi/doi/10.1073/pnas.1803839115 Masse et al. Downloaded by guest on October 1, 2021 Downloaded by guest on October 1, 2021 h rdce n xetddsone uuereward, future between discounted error expected squared and mean predicted the minimize the explo- should encourages network that the term First, entropy an ration. and output, policy the output, value where eod h ewr hudmxmz h oaih fcosn actions choosing of below) logarithm (defined the values advantage maximize large should with network the Second, as output value the denote also We cells. LSTM the V of activity output the reward. as discounted as over future parameters, output distribution network the policy probability the estimates the a which denote into scalar we RNN Specifically, value the a of and activity vec- actions policy the nine-dimensional maps a of that consisted tor output RNN the method, this For trained is reward, network future discounted the expected which the in maximize (37), to method learning test reinforcement the critic tasks, ImageNet accuracy images. and training classification the MNIST tested from the We separate for kept 0.001. batches; batch were of 10 images a rate using used task learning We and each a task, task. on and ImageNet cognitive-based 256 each each for of on epochs size batches 40 training task, 6,000 MNIST networks the each for trained for We epochs function. 20 loss for cross-entropy the used we learning, tasks. between reset was state optimizer nti td,w s h eeaie datg siain(8,which (38), estimation reward: advantage actual the generalized and expected the the between use difference the we represents study, this In (η (36) optimizer Adam the using Testing. by an and of Training consisted Network input task. current the the 5B), representing Fig. 20 length dots, RNNs set of (green For vector each one-hot locations. signal additional which spatial context two in a of units, used one that motion-direction-tuned from fixation-tuned input four 32 motion of of visual consisted represents sets activ- RNN the two to into and applied Input units layer. was toward output nonlinearity the responses softmax in representing ity The units fixation directions. eight maintain different to and eight choice response) the a representing withhold unit (i.e., one onto projected that Architecture. dataset. Network ImageNet Recurrent then the convolutional) of and tasks the 100 not layers the (and on convolutional connected network the the fully of After the layers in dataset. in parameters parameters CIFAR-100 the the the trained the of fixed of images images we 50,000 50,000 training, the the with combined which dataset dataset, CIFAR-10 similar, yet different, a were units the output task. of multihead one 990 any a wherein units for tested output masked also 1,000 we were of comparison, consisting that 4A) For layer (Fig. 4). output layer (Fig. the primarily tasks in We units all layer. 10 in The with used output layers. layer, the output connected single-head to fully a two applied used the was in function units activation hidden drop softmax the 50% a to and applied function After was activation ReLu layers. rate 2,000 the of convolutional above, layers described hidden the As connected each. to full units two applied were layers not convolutional was four the Gating pooling Max rate. stride. drop and 25% size kernel 2 same a the with with filters 64 used layers 3 two with filters 32 used layers the to applied layer. was output nonlinearity the softmax in The percentage units units. drop hidden 50% all a to with applied (35) was Dropout and function activation ReLU The as tal. et Masse θ (h nadto osprie erig eas rie Nsuigteactor- the using RNNs trained also we learning, supervised to addition In supervised using networks recurrent and feedforward the training When o opttoa fcec,w rttandteIaee ewr on network ImageNet the trained first we efficiency, computational For two first The layers. convolutional four included network ImageNet The t .Tels ucincnb rkndw noepesosrltdt the to related expressions into down broken be can function loss The ). γ ∈ × ,1 stedson atrand factor discount the is 1) [0, tiewsapidatrlyr w n or ln iha with along four, and two layers after applied was stride 2 L V = L A P 2T t 1 = a = t X × t − r =1 h etro osbeatosa time at actions possible of vector the T R t T 1 enlsz n 1 and size kernel 3 + [V τ o l ewrs aaeeswr trained were parameters networks, all For X t γ = =1 θ l Nscnitdo 5 SMcls(21) cells LSTM 256 of consisted RNNs All T V (h X t θ =τ A t T (h ) t − t log(π γ +1 = t r −τ t 0.001, ) − − r t r θ γ t V π (a sterwr ie ttime at given reward the is , V θ θ t θ (h (a |h β (h × t t t 1 |h ). )). t = tie n h second the and stride, 1 +1 t ,where ), 0.9, )] 2 . β 2 = θ 9) The 0.999). represents t and , [1] [2] [4] [3] h t t . where terms: three all combining by function loss overall the obtain We actions: confident overly compute penalizing not does one (i.e., function). advantage value the fixed of one a gradient function, as the policy-loss function the advantage of the gradient the treats calculating when that note We with combined learning. reinforcement accu- intelligence) using mean (synaptic by the trained stabilization XdG, represent with dots networks Black for learning. XdG, racy accuracy supervised with combined using mean intelligence) by the (synaptic trained represent stabilization networks dots with trained for Magenta networks cue, for learning. accuracy rule supervised a mean using with represents combined by Green intelligence) (synaptic (B) stabilization arrow). with and (represented magenta saccade pattern eye a an using dot response by cen- action white white an the and by the dot), (represented located cue trally by fixation a (represented arrow), direction involve motion stimulus trials green All direction tasks. four motion first the a of Schematics (A) tasks. cognitive-based 5. Fig. hne wyfo hi ausbfr trigtann nanwtask: new by a on weight work training penalizes starting methods that before function values Both loss their the from (9). away to stabilize EWC changes term to penalty and methods quadratic a (8) proposed adding previously intelligence two synaptic of synapses: one with conjunction in Stabilization. Synaptic gradient. the determine respectively, B A Go Dly Go RT- Go Go Anti- eas nld netoytr htecuae xlrto by exploration encourages that term entropy an include also We Task accuracy 0.0 0.2 0.4 0.6 0.8 1.0 α akacrc frcretntok eunilytando 20 on trained sequentially networks recurrent of accuracy Task

and Go

RT-Go Reinforcement learning,stabilization+XdG Supervised learning,stabilization+XdG Supervised learning,stabilization+contextsignal Fixation Stimulus Fixation Stimulus Dly Go Fixation Stimulus Fixation Stimulus+response β

oto o togyteetoyadvlels functions, loss value and entropy the strongly how control Anti-Go Anti-RT-Go

L Anti-Dly Go H = L h d ehdpooe nti td a used was study this in proposed method XdG The − = L T 1 L DM1 = X k t =1 T +

L DM2 CtxDM1 P c π X + θ CtxDM2 (a Multi-CtxDM β i t Ω L |h V i (θ t Cue off+response Cue off+response log(π )

− Dly DM1 i − αL Dly DM2

θ Dly CtxDM1 Delay i prev H NSLts Articles Latest PNAS θ ,

(a Dly Multi-CtxDMDly CtxDM2 ) t 2 |h , t )). Cue off+response

DMS movement Saccadic eye stimulus Motion direction

Anti-DMSDMC Anti-DMC | f9 of 7 [6] [5] [7]

COMPUTER SCIENCES NEUROSCIENCE where Lk is the loss function of the current task k, c is a hyperparame- Hyperparameter Search. We tested our networks on different sets of ter that scales the strength of the synaptic stabilization, Ωi represents the hyperparameters to determine which values yielded the greatest mean importance of each parameter θi toward solving the previous tasks, and classification accuracy. When using synaptic intelligence, we tested net- prev θi is the parameter value at the end of the previous task. works with c = {0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2}, and k ζ = {0.001, 0.01} When using EWC, we tested networks with c = For EWC, Ωi is calculated for each task k as the diagonal elements of the Fisher information F: {0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 100, 200, 500, 1000}. Furthermore, for the MNIST dataset, we tested networks with and without a 20% drop rate in the input ∂ log pθ (y|x) ∂ log pθ (y|x) T layer. F = E k , [8] x∼D ,y∼pθ (y|x) ∂θ ∂θ Once the optimal c and ζ (for synaptic intelligence) for each task were ci+cj determined, we tested one additional value c = 2 , where ci and cj were where the inputs x are sampled from the data distribution for task k, Dk, the two values generating the greatest mean accuracies (i and j were always and the labels y are sampled from the model pθ . We calculated the Fisher adjacent). The hyperparameters yielding the greatest mean classification information using an additional 32 batches of 256 training data points after accuracy across the 100 or 500 MNIST permutations, the 100 ImageNet tasks, training on each task was completed. or the 20 cognitive tasks were used for Figs. 2, 4, and 5. For synaptic intelligence, for each task k, we first calculated the product Training the RNNs using reinforcement learning required additional between the change in parameter values, θ(t) − θ(t − 1), with the nega- hyperparameters, which we experimented with before settling on fixed val- ∂L (t) tive of the gradient of the loss function k , summed across all training ues. We set the reward equal to −1 if the network broke fixation (i.e., did ∂θ not choose the fixation action when required), equal to 0.01 when choos- batches t: − ing the wrong output direction, and equal to +1 when choosing the correct X −∂Lk(t) ωk = (θ (t) − θ (t − 1)) . [9] output direction. The reward was 0 for all other times, and the trial ended as i i i ∂θ t i soon as the network received a reward other than 0. The constant weighting the value loss function, β, was set to 0.01. The discount factor, γ, was set to We then normalized this term by the total change in each parameter ∆θi = P 0.9, although other values between 0 and 1 produced similar results. Lastly, (θi(t) − θi(t − 1)) plus a damping term ζ: t the constant weighting the entropy loss function, α was set to 0.0001, and ! the learning rate was set to 0.0005. ωk Ωk = max 0, i . [10] i 2 (∆θi) + ζ Analysis Methods. For our analysis into the interaction between synaptic stabilization and XdG, we trained two networks, one with stabilization (synaptic Finally, for both EWC and synaptic intelligence, the importance of each intelligence) and one with stabilization combined with XdG, on 100 sequen- k parameter i at the start of task m is the sum of Ωi across all completed tial MNIST permutations. After training, we tested the network with synap- tasks: tic intelligence alone after perturbing single connection weights located X k between the last hidden layer and the output layer. Specifically, we randomly Ωi = Ω . [11] i selected 1,000 connections weights one at a time, measured the mean accu- k

1. Peters A (1991) The Fine Structure of the Nervous System: Neurons and Their 9. Kirkpatrick J, et al. (2017) Overcoming catastrophic forgetting in neural networks. Supporting Cells (Oxford Univ Press, Oxford). Proc Natl Acad Sci USA 114:3521–3526. 2. Kasai H, Matsuzaki M, Noguchi J, Yasumatsu N, Nakahara H (2003) Structure–stability– 10. Cichon J, Gan W-B (2015) Branch-speciﬁc dendritic Ca2+ spikes cause persistent function relationships of dendritic spines. Trends Neurosci 26:360–368. synaptic plasticity. Nature 520:180–185. 3. Yuste R, Bonhoeffer T (2001) Morphological changes in dendritic spines associated 11. Tononi G, Cirelli C (2014) Sleep and the price of plasticity: From synaptic and with long-term synaptic plasticity. Annu Rev Neurosci 24:1071–1089. cellular homeostasis to memory consolidation and integration. Neuron 81:12– 4. Yoshihara Y, De Roo M, Muller D (2009) Dendritic spine formation and stabilization. 34. Curr Opin Neurobiol 19:146–153. 12. Kukushkin NV, Carew TJ (2017) Memory takes time. Neuron 95:259–279. 5. Fischer M, Kaech S, Knutti D, Matus A (1998) Rapid actin-based plasticity in dendritic 13. Goodfellow IJ, Mirza M, Xiao Da, Courville A, Bengio Y (2013) An empiri- spines. Neuron 20:847–854. cal investigation of catastrophic forgetting in gradient-based neural networks. 6. Yang G, Pan F, Gan W-B (2009) Stably maintained dendritic spines are associated with arXiv:1312.6211. lifelong memories. Nature 462:920–924. 14. Deng J, et al. (2009) Imagenet: A large-scale hierarchical image database. IEEE Con- 7. Xu T, et al. (2009) Rapid formation and selective stabilization of synapses for enduring ference on Computer Vision and Pattern Recognition (IEEE, Piscataway, NJ), pp motor memories. Nature 462:915–919. 248–255. 8. Zenke F, Poole B, Ganguli S (2017) Continual learning through synaptic intelligence. 15. Robert Yang G, Francis Song H, Newsome WT, Wang X-J (2017) Clustering and com- International Conference on Machine Learning (International Machine Learning positionality of task representations in a neural network trained to perform many Society, Princeton), pp 3987–3995. cognitive tasks, bioRxiv:183632.

8 of 9 | www.pnas.org/cgi/doi/10.1073/pnas.1803839115 Masse et al. Downloaded by guest on October 1, 2021 Downloaded by guest on October 1, 2021 6 ee ,CueJ(07 ifso-ae ermdlto a lmnt catastrophic eliminate can neuromodulation Diffusion-based during (2017) networks J brain Clune human R, Velez of reconfiguration 26. Dynamic (2011) al. et DS, Bassett neural super in 25. descent gradient channels Evolution PathNet: (2017) al. et C, Fernando 24. a inverting by learning One-shot (2013) J Tenenbaum RR, learning Salakhutdinov One-shot (2016) BM, T. Lake Lillicrap D, 23. Wierstra M, Botvinick S, Bartunov A, Santoro 22. memory. short-term Long suppresses (1997) task J auditory Schmidhuber S, an Hochreiter in Engaging 21. (2009) AM Zador Y, Yang L-H, Tai GH, Otazu context- enables 20. inhibition cortical by processing Parallel (2017) al. et function. KV, cortex Kuchibhotla prefrontal of 19. theory integrative An (2001) JD Cohen EK, Miller 18. dynamics control-signal Top-down (2007) S Everling MJ, Koval HM, in Levin synchrony K, and Johnston Oscillations predictions: 17. Dynamic (2001) W Singer P, Fries AK, Engel 16. as tal. et Masse ogtigi ipenua ewrs arXiv:1705.07241. networks. neural simple in forgetting learning. arXiv:1701.08734. networks. Red Assoc, (Curran KQ Weinberger Z, 2526–2534. Ghahramani pp NY), M, Hook, Welling L, Bottou CJC, Burges process. causal compositional arXiv:1605.06065. networks. neural memory-augmented with 1780. cortex. auditory in responses behavior. dependent Neurosci Rev switching. task following neurons 53:453–462. cortex prefrontal and cingulate anterior in processing. top-down rcNt cdSiUSA Sci Acad Natl Proc 24:167–202. a Neurosci Nat a e Neurosci Rev Nat a Neurosci Nat eds Systems, Processing Information Neural in Advances 108:7641–7646. 20:62–71. 2:704–716. 12:646–654. erlComput Neural Neuron, 9:1735– Annu 8 cumnJ oizP eieS odnM belP(05 High-dimensional (2015) P Abbeel M, Jordan S, Levine P, Moritz J, can that Schulman elements 38. adaptive Neuronlike arXiv:1412. (1983) optimization. CW Anderson stochastic RS, for Sutton method AG, A Barto Adam: 37. (2014) J. Ba D, Dropout: Kingma (2014) R 36. Salakhutdinov I, Sutskever A, Krizhevsky heterogeneous G, on Hinton learning N, machine Srivastava Large-scale 35. Tensorflow: (2016) al et M, Abadi 34. 2 alaA aenkS(07 ake:Adn utpetsst igentokby network conceptors. single a by to learning. tasks interference Serr multiple continual Adding 33. catastrophic Packnet: Variational (2017) Overcoming S (2017) Lazebnik (2017) A, RE Mallya H 32. Turner Jaeger TD, X, Bui He Y, 31. Li aware Memory (2017) CV, T Tuytelaars Nguyen M, Rohrbach 30. M, Elhoseiny F, Babiloni R, Aljundi 29. forgetting. without Learning (2018) arXiv:1606.04671. D networks. Hoiem neural Z, Progressive Li (2016) al. 28. et AA, Rusu 27. otnoscnrluiggnrlzdavnaeetmto.arXiv:1506.02438. estimation. advantage generalized using control continuous problems. control 846. learning difficult solve 6980. overfitting. from networks neural prevent to 15:1929–1958. way simple A arXiv:1603.04467. systems. distributed arXiv:1801.01423. task. the to attention hard with trtv rnn.arXiv:1711.05769. pruning. iterative arXiv:1707.04853. arXiv:1710.10628. arXiv:1711.09601. forget. to (not) what Learning synapses: 10.1109/TPAMI.2017.2773081. ,Sur J, a ` sD io ,KrtoluA(08 vroigctsrpi forgetting catastrophic Overcoming (2018) A Karatzoglou M, Miron D, ´ ıs EETasSs a Cybernetics Man Syst Trans IEEE EETasPtenAa ahIntell Mach Anal Pattern Trans IEEE NSLts Articles Latest PNAS ahn er Res Learn Machine J | 5:834– f9 of 9

COMPUTER SCIENCES NEUROSCIENCE