Arxiv:2108.12596V1 [Cs.LG] 28 Aug 2021
Total Page:16
File Type:pdf, Size:1020Kb
Representation Memorization for Fast Learning New Knowledge without Forgetting Fei Mi1∗, Tao Lin2, , Boi Faltings2 1Huawei Noah’s Ark Lab, 2Swiss Federal Institute of Technology in Lausanne (EPFL) [email protected], ftao.lin,boi.faltingsg@@epfl.ch Abstract limited, which prevents neural networks from being trained to achieve reasonable accuracy Wang et al. [2020]. The ability to quickly learn new knowledge (e.g. The catastrophic forgetting challenge is recently studied new classes or data distributions) is a big step in the context of “continually/incrementally” learning a se- towards human-level intelligence. In this paper, quence of tasks. Various regularization-base methods Kirk- we consider scenarios that require learning new patrick et al. [2017]; Rebuffi et al. [2017]; Castro et al. [2018] classes or data distributions quickly and incremen- and memory-based approaches Grave et al. [2017b]; Merity tally over time, as it often occurs in real-world dy- et al. [2017]; Rebuffi et al. [2017] have been proposed to mit- namic environments. We propose “Memory-based igate catastrophic forgetting. The sample efficiency challenge Hebbian Parameter Adaptation” (Hebb) to tackle is recently studied in the context of few-shot learning; a pop- the two major challenges (i.e., catastrophic forget- ular approach is “meta-learning” Finn et al. [2017] that learns ting and sample efficiency) towards this goal in a over a bunch of specifically structured meta-tasks. unified framework. To mitigate catastrophic for- However, existing methods often tackle these two chal- getting, Hebb augments a regular neural classifier lenges separately. To this end, we propose a method called with a continuously updated memory module to Memory-based Hebbian Parameter Adaptation (Hebb) to store representations of previous data. To improve tackle them in a unified framework. Hebb makes use of a sample efficiency, we propose a parameter adapta- memory component similar to Merity et al. [2017]; Sprech- tion method based on the well-known Hebbian the- mann et al. [2018] that stores previous input representations ory Hebb [1949], which directly “wires” the output to mitigate catastrophic forgetting. To improve sample ef- network’s parameters with similar representations ficiency on new knowledge, we propose a parameter adap- retrieved from the memory. We empirically verify tation procedure based on the well-known Hebbian theory the superior performance of Hebb through exten- Hebb [1949] during inference. It directly wires similar rep- sive experiments on a wide range of learning tasks resentations retrieved from the memory to the corresponding (image classification, language model) and learn- parameters of the classifier’s output network. The memory ing scenarios (continual, incremental, online). We accessing operation and the parameter adaptation procedure demonstrate that Hebb effectively mitigates catas- can be easily computed such that Hebb can be easily plugged trophic forgetting, and it indeed learns new knowl- into different neural classifiers and learning scenarios. edge better and faster than the current state-of-the- Besides the standard continual learning setting Rebuffi et art. al. [2017] to evaluate the ability to mitigate catastrophic for- getting, two other learning scenarios are considered to evalu- arXiv:2108.12596v2 [cs.LG] 10 Sep 2021 ate the ability to deal with the sample efficiency challenge. In 1 Introduction the first incremental learning scenario, a pre-trained classifier In real-life machine learning applications, new knowledge is initially trained on a small dataset containing new knowl- (e.g., new classes or data distributions) arrive gradually over edge. Then it is fixed to be evaluated w.r.t. future observa- time. The ability to quickly learn and accumulate new knowl- tions. In the second online adaptation scenario, we have no edge without forgetting old ones is a hallmark of artificial in- data on new knowledge initially, and the pre-trained classifier telligence. Two major challenges prevent standard neural net- needs to continuously learn new knowledge through a single works to be applied towards this goal. (1) catastrophic for- pass over new data in an online manner. These two learning getting: continually incorporating new knowledge requires scenarios are both practically critical. For example, the incre- additional training and often reduces the performance on old mental learning scenario could simulate that a robot is shown ones learned before McCloskey and Cohen [1989]; (2) sam- some images of new objects before they appear in its routine ple efficiency: the amount of data on new knowledge is often tasks. In the online adaptation scenarios, a robot has to deal with new objects continuously. ∗This work was mainly done when Fei Mi was a Ph.D. in EPFL Through extensive experiments on a wide range of learn- ing tasks (image classification, language model) and learn- to bind representative patterns to their corresponding con- ing scenarios (continual, incremental, online), we empirically cepts or labels. demonstrate: (i) Hebb can be easily plugged into different Recent approaches apply Hebbian theory to every single neural classifiers and learning scenarios with trivial compu- neuron connection for fast network weight learning. Ba et al. tation overhead. (ii) Hebb effectively mitigates catastrophic [2016] proposes a fast weight to augment the standard com- forgetting, indicated by its superior performance compared putation of RNNs. The fast weight is defined as the running to MbPA Sprechmann et al. [2018] and EWC Kirkpatrick et average of the outer product of two hidden states in RNNs. It al. [2017] in the continual learning setting. (iii) More im- is multiplied to the current state and it is continuously updated portantly, Hebb notably improves sample efficiency for fast to allow each new hidden state to be attracted to recent hidden learning new classes and data distributions. It outperforms states. Miconi et al. [2018] later augments the traditional con- various state-of-the-art methods in both incremental learning nection between two neurons in general neural networks with and online adaptation scenarios, especially on new or infre- a Hebbian trace. The Hebbian trace between two neurons quent classes. is defined as a running average of the scalar product of the first neuron’s activation in the last timestamp and the second 2 Related Work neuron’s activation in the current timestamp. The Hebbian 2.1 Memory-augmented Neural Networks trace is merged with the standard neuron connection through a differentiable plasticity optimized by SGD. The later exten- Recently, various memory modules (M) have been proposed sion Miconi et al. [2019] introduces a term parametrized by to augment neural networks for remembering long-term in- another neural network to learn how fast should new informa- formation Graves et al. [2014]; Grave et al. [2017b,a]; Mer- tion be incorporated. ity et al. [2017] or learning infrequent patterns Santoro et al. Instead of applying the Hebbian theory to every single neu- [2016]; Kaiser et al. [2017]; Sprechmann et al. [2018]; Mi ron connection, we use it to directly wire the activation input and Faltings [2020]. to the output layer with the corresponding class label for fast M There are many variants of how to read from and mix binding new classes. Similar perspectives are recently pro- M the entries retrieved from with the network computa- posed. For example, Munkhdalai and Trischler [2018] aug- tion. One approach is through some differentiable context- ments the layer preceding the Softmax layer with the Hebbian et al. et al. based lookup mechanisms Graves [2014]; Santoro updates followed by a nonlinear activation for meta-learning. [2016] for learning to match the current activation to past ac- Rae et al. [2018] proposes a Hebbian Softmax layer during M tivations stored in . However, these mechanisms often re- the normal model training phase to better learn infrequent vo- M quire strong memory supervision, and the size of the has to cabularies in language modeling tasks. The Hebbian update be fixed. Another approach is using a simple mixture model. rule proposed in our paper is motivated by Rae et al. [2018], In this case, a non-parametric prediction is computed based yet our Hebbian update rule is only applied to relevant entries M on the similarity between the entries in and the current in- retrieved from a continuously updated memory module for put. The neural network’s prediction is directly interpolated the purpose of fast learning new knowledge in an incremental M with the non-parametric prediction from . This approach or online manner. has been shown simple but effective for language model- ing Grave et al. [2017b,a], neural machine translation Tu et al. [2018], image classification Orhan [2018], and recommen- dation Mi and Faltings [2020]. Recently, Sprechmann et al. 3 Memory-based Hebbian Parameter [2018] introduces MbPA to use nearest neighbors retrieved Adaptation from M for parameter adaptation during model inference for the fast acquisition of new knowledge. MbPA++ de Mas- This section introduces the Memory-based Hebbian Param- son d’Autume et al. [2019] improves MbPA to better mitigate eter Adaptation (Hebb) method to help standard neural clas- catastrophic forgetting through better memory management sifiers mitigate catastrophic forgetting and improve sample during training. The framework proposed in this paper is mo- efficiency. First, we introduce a memory component to store tivated by Sprechmann et al. [2018], and it mainly improves representations of past data. Then, we introduce the Hebbian MbPA for better learning new knowledge. update for fast learning new knowledge during inference, and we compare it with state-of-the-art (MbPA Sprechmann et al. 2.2 Hebbian Learning [2018]). Lastly, we introduce a dynamic interpolation of the Hebbian theory Hebb [1949] is a neuroscientific theory at- proposed Hebbian update and MbPA. tempting to explain “synaptic plasticity”, i.e., the adaptation of brain neurons during the learning process. For artificial Background Neural classifiers can be visualized by two neural networks, Hebbian theory describes a method of de- parts.