Online Forgetting Process for Models

Yuantong Li Chi-Hua Wang Guang Cheng Department of Statistics Department of Statistics Department of Statistics Purdue University Purdue University Purdue University

Abstract Privacy Act Right (CCPA) (State of California Depart- ment of Justice, 2018), which also stipulates users to require companies and organizations such as Google, Motivated by the EU’s "Right To Be Forgot- Facebook, Twitter, etc to forget and delete these per- ten" regulation, we initiate a study of sta- sonal data to protect their privacy. Besides, users also tistical data deletion problems where users’ have the right to request the platform to delete his/her data are accessible only for a limited period obsolete data at any time or only to authorize the of time. This setting is formulated as an on- platform to hold his/her personal information such as line task with constant photos, emails, etc, only for a limited period. Unfor- memory limit.Weproposeadeletion-aware tunately, given these data are typically incrementally algorithm FIFD-OLS for the low dimensional collected online, it is a disaster for the machine learn- case, and witness a catastrophic rank swinging ing model to forget these data in chronological order. phenomenon due to the data deletion oper- Such a challenge opens the needs to design and analyze ation, which leads to statistical inefficiency. deletion-aware online method. As a remedy, we propose the FIFD-Adaptive Ridge algorithm with a novel online regular- In this paper, we propose and investigate a class of ization scheme, that effectively offsets the un- online learning procedure, termed online forgetting pro- certainty from deletion. In theory, we pro- cess,toadaptusers’requeststodeletetheirdatabefore vide the cumulative regret upper bound for a specific time bar. To proceed with the discussion, both online forgetting algorithms. In the ex- we consider a special deletion practice, termed First periment, we showed FIFD-Adaptive Ridge In First Delete (FIFD), to address the scenario that outperforms the ridge regression algorithm the users only authorize their data for a limited period. with fixed regularization level, and hopefully (See Figure 1 for an illustration of the online forgetting sheds some light on more complex statistical process with constant memory limit s.) In FIFD, the models. agent is required to delete the oldest data as soon as receiving the latest data, to meet a constant memory limit. The FIFD deletion practice is inspired by the 1Introduction situation that, the system may only use data from the past three months to train their machine learning model to offer service for new customers (Jon Porter, Today many internet companies and organizations are 2019; Google, 2020). The proposed online forgetting facing the situation that certain individual’s data can process is an online extension of recent works that no longer be used to train their models by legal require- consider offline data deletion (Izzo et al., 2020; Ginart ments. Such circumstance forces companies and organi- et al., 2019; Bourtoule et al., 2019) or detect data been zations to delete their database on the demand of users’ forgotten or not (Liu and Tsaftaris, 2020). willing to be forgotten. On the ground of the ‘Right to be Forgotten’, regularization is established in many In such a ‘machine forgetting’ setting, we aim to deter- countries and states’ laws, including the EU’s General mine its difference with standard statistical machine Data Protection Regulation (GDPR)(Council of Euro- learning methods via an online regression framework. pean Union, 2016), and the recent California Consumer To accommodate such limited authority, we provide solutions on designing deletion-aware online linear re- Proceedings of the 24th International Conference on Artifi- gression algorithms, and discuss the harm due to the cial Intelligence and Statistics (AISTATS) 2021, San Diego, "constant memory limit" setting. Such a setting is California, USA. PMLR: Volume 130. Copyright 2021 by challenging for a general online statistical learning task the author(s). Online Forgetting Process for Linear Regression Models

✓[1,s] ✓[t s,t 1] b b

x1 y1 x2 y2 …… x ys xs+1 ys+1 …… xt s yt s …… xt 1 yt 1 xt yt s time

✓[2,s+1]

b Figure 1: Online Forgetting Process with constant memory limit s.

since the "sample size" never grows to infinity but denotes the maximum value of vi, where i is in the stays a constant size along the whole learning process. indexes of negative coordinate. As an evil consequence, statistical efficiency is never d d For a positive semi-definite matrix R ⇥ , min() improving as the time step grows due to the constant 2 0 denotes the minimum eigenvalue of .Wedenote⌫ memory limit. as the generalized inverse of if it satisfies the Our Contribution. We first investigate the online condition =. The weighted 2-norm of vec- forgetting process in the ordinary linear regression. We tor x Rd with respect to positive definite matrix 2 find a new phenomenon defined as Rank Swinging Phe- is defined by x = pxTx . The inner product k k nomenon, which exists in the online forgetting process. is denoted by , and the weighted inner-product h· ·i If the deleted data can be fully represented by the is denoted by xTy = x, y .Foranysequence h i data memory, then it will not introduce any regret. x 1 ,wedenotematrixx = x ,x ,...,x { t}t=0 [a,b] { a a+1 b} Otherwise, it will introduce extra regret to make this and x[a,b] = supi,j x[a,b] (i,j).Foranymatrixx[a,b], online forgetting process task not online learnable. The 1 b | | notation [a,b] = t=a xtxt> represents a gram matrix rank swinging phenomenon plays such a role that it with constant memory limit s = b a +1 and notation indirectly represents the dissimilarity between the dele- b P ,[a,b] = xtxt> +Id d represents a gram matrix t=a ⇥ tion data and the new data to affect the instantaneous with ridge hyperparameter . regret. Besides, if the gram matrix does not have full P rank, it will cause the adaptive constant ⇣ to be un- stable and then the confidence ellipsoid will become 2StatisticalDataDeletionProblem wider. Taking both of these effects into consideration, At each time step t [T ], where T is a finite time the order of the FIFD-OLS’s regret will become linear 2 in time horizon T . horizon, the learner receives a context-response pair d zt =(xt,yt), where xt R is a d-dimensional context 2 The rank swinging phenomenon affects the regret scale and yt R is the response. The observed sequence of 2 and destabilizes the FIFD-OLS algorithm. Thus, to context xt t 1 are drawn i.i.d from a distribution of remedy this problem, we propose the FIFD-Adaptive { } d with a bounded support R and xt 2 L. Ridge to offset this phenomenon because when we add PX t 1 X⇢ k k  Let D[t s:t 1] = zi i=t s denote the data collected at { } the regularization parameter, the gram matrix will have [t s, t 1] following the FIFD scheme. full rank. Different from using the fixed regularization We assume that for all t [T ],theresponsey is a parameter in the standard online learning model, we 2 t use the martingale technique to adaptively select the linear combination of context xt;formally, regularization parameter over time. y = x ,✓ + ✏ , (1) t h t ?i t [T ] Notation. Throughout this paper, we denote as d where ✓? R is the target parameter that sum- the set 1, 2,...,T . S denotes the number of ele- 2 { } | | marizes the relation between the context xt and re- ments for any collection S.Weuse x p to denote d k k sponse yt. The noise ✏t’s are drawn independently from the p-norm of a vector x R and x = supi xi . d 2 k k1 | | subgaussian distribution. That is, for every ↵ R, For any vector v R ,notation (v) i vi > 0 2 2 2 2 P ⌘{| } it is satisfied that E[exp(↵✏t)] exp(↵ /2). denotes the indexes of positive coordinate of v and  (v) i vi < 0 denotes the indexes of negative Under the proposed FIFD scheme with a constant N ⌘{| } coordinate of v. (v) min vi i (v) denotes memory limit s,thealgorithm at time t only can Pmin ⌘ { | 2P } A the minimum value of vi, where i is in the indexes of keep the information of historical data from time step positive coordinate and (v) max vi i (v) t s to time step t 1 and then to make the prediction, Nmax ⌘ { | 2N } Yuantong Li, Chi-Hua Wang, Guang Cheng

and previous data before time step t s 1 are not 3.1 FIFD-OLS Algorithm allowed to be kept and needs to be deleted or forgotten. The algorithm is required to make total T s number The FIFD-OLS algorithm uses the least square es- A of predictions in the time interval [s +1,T]. timator based on the constant data memory from ˆ time window [t s, t 1],definedas✓[t s,t 1] = To be more precise, the algorithm first receives a con- 1 t 1 A [t s,t 1] i=t s yixi . Then an incremental update text xt at time step t,andmakeapredictionbasedonly ✓ˆ [t s, t 1] [t s+1,t] on the information in previous s time steps D[t s:t 1] formula for⇥ Pfrom time⇤ window to ( D[t s:t 1] = s)andhencetheagentforgetstheinfor- is showed as follows. | | mation up to the time step t s 1. The predicted OLS incremental update: At time step t, the es- value yˆt computed by the algorithm based on infor- ˆ timator ✓[t s+1,t] is updated by previous estimator A mation D[t s:t 1] and current the context xt, which is ˆ ✓[t s,t 1] and new data zt, denoted by 1 yˆt = (xt,D[t s:t 1]). (2) ✓ˆ =f( ,x ,x ) A [t s+1,t] [t s,t 1] t s t (4) ˆ After prediction, the algorithm receives the true g(✓[t s,t 1], [t s,t 1],zt s,zt), A response yt and suffers a pseudo regret, r(yˆt,yt)= · 2 1 ( xt,✓? yˆt) .Wedefinethecumulative(pseudo)- where f( ,xt s,xt) is defined as h i [t s,t 1] regret of the algorithm up to time horizon T as A 1 1 1 ([t s,t 1]) (xt> s([t s,t 1]) 1) T 2 1 1 RT,s( ) ( xt,✓? yˆt) . (3) ([t s,t 1])xt sxt> s([t s,t 1]) A ⌘ h i t=s+1 (5) X 1 ⇥ 1 ⇤ with ([t s,t 1]) [t s,t 1] 1 1 ⌘ 1 1 Our theoretical goal is to explore the relationship be- (xt>[t s,t 1]xt + 1) [t s,t 1]xtxt>[t s,t 1] tween constant memory limit s and the cumulative re- ˆ and g(✓[t s,t 1], [t s,t 1],z⇥t s,zt) is defined as⇤ gret RT,s( ).WeproposedtwoalgorithmsFIFD-OLS ˆ A [t s,t 1]✓[t s,t 1] + ytxt yt sxt s. The detailed and FIFD-Adaptive Ridge and studied their cumula- online incremental update is in Appendix F. tive regret, respectively. In particular, we discusses the effect of constant memory limit s,dimensiond, The algorithm has two steps. First, f-step is to update 1 1 subguassian parameter ,andconfidencelevel1 on the inverse of gram matrix from to [t s,t 1] [t s+1,t] the obtained cumulative regret R ( ) for these two T,s A and the Woodbury formula Woodbury (1950) is applied algorithms. For example, what’s the effect of constant to reduce the time complexity to (s2d) compared with memory s on the order of the regret R ( ) if other O T,s A directly computing the inverse of gram matrix; Sec- parameters keep constant. In other words, how many ondly, g-step is a simple algebra and uses the definition data do we need to keep in order to achieve the satis- of adding and forgetting data to update the least square fied performance under the FIFD scheme. Besides, is estimator. The detailed update procedure of FIFD-OLS there any amazing or unexpected phenomenon both in is displayed in Algorithm 1. the experiment and theory occurred when the agent used the ordinary least square method under the FIFD 3.2 Confidence Ellipsoid for FIFD-OLS scheme and how to improve it? Our first contribution is to obtain a confidence ellipsoid Remark. The FIFD scheme can also be generalized for the OLS method based on the sample set collected to the standard online learning paradigm when we add under the FIFD scheme, showed in Lemma 1. two and delete one data or add more and delete one data. These settings will automatically transfer to the Lemma 1. (FIFD-OLS Confidence Ellipsoid) For any 2 T >0, if the event min([t s,t 1]/s) > > 0 standard online learning model when becomes large. [t s,t 1] holds, with probability at least 1 , for allt s +1, We presented the simulation result to illustrate it. ✓? lies in the set

3 FIFD OLS d ˆ C[t s,t 1] = ✓ R : ✓[t s,t 1] ✓ 2 [t s,t 1] ⇢ (6) In this section, we present the FIFD - ordinary least q[t s,t 1] (2d/s) log(2 d/) square regression (FIFD-OLS)algorithmunder“First  In First Delete” scheme and the corresponding confi- p 2 dence ellipsoid for the FIFD-OLS estimator and regret where q[t s,t 1] = x[t s,t 1] /[t s,t 1] is the adap- 1 analysis. Besides, the rank swinging phenomenon will tive constant and we denote [t s,t 1] as the RHS be discussed in section 3.4. bound. Online Forgetting Process for Linear Regression Models

Algorithm 1: FIFD-OLS Remarks. We develop the FIFD-OLS algorithm and prove an upper bound of the cumulative regret in or- Given parameters: s, T , 1 der of (⇣[(d/s)(log 2d/)] 2 T ). The agent using this for t s +1,...,T do O 2{ } algorithm cannot improve the performance of this al- Observe xt T gorithm because it has a constant memory limit s to [t s,t 1] = x[t s,t 1]x[t s,t 1] update the estimated parameters. What’s more, the if t=s+1then q 1 T adaptive constant [t s,t 1] is unstable caused by the ✓ˆ = x y [1,s] [1,s] [1,s] [1,s] forgetting process. The main factor causing the oscil- else lation of gram matrix [t s,t 1] is that the minimum ˆ 1 ✓[t s,t 1] = f( ,xt s 1,xt) eigenvalue is close to zero. In other words, the gram [t s 1,t 2] · ˆ matrix is full rank at some time. Therefore, the adap- g(✓[t s 1,t 2], [t s 1,t 2], q zt s 1,zt 1) tive constant [t s,t 1] will go to infinity at such time end points. So the generalization’s ability of FIFD-OLS is ˆ poor. Predict yˆt = ✓[t s,t 1],xt and observe yt h i Compute loss r = yˆ ✓?,x Following the derivation of the regret upper bound of t | t h ti| Delete data zt s =(xt s,yt s) FIFD-OLS in Theorem 1, we find an interesting phe- end nomenon, which we called Rank Swinging Phenomenon (defined below Definition 1). This phenomenon will un- stabilize the FIFD-OLS algorithm and introduce some Proof. The key step to get the confidence ellipsoid is to extreme bad events, which results in a larger value of use the martingale noise technique presented in Bastani the regret upper bound of the FIFD-OLS algorithm. and Bayati (2020) to obtain an adaptive bound. The detailed proof can be obtained in Appendix A. Besides, 3.4 Rank Swinging Phenomenon the confidence ellipsoid in (6) requires the information Below we give the definition of the Rank Swinging of minimum eigenvalue of gram matrix [t s,t 1] and infinity norm of our observations with constant data Phenomenon. memory s. Thus, it is an adaptive confidence ellipsoid Definition 1. (Rank Swinging Phenomenon) At time t, when we delete data xt s and add data xt, the following the change of data. gram matrix switching from [t s,t 1] to [t s+1,t] will 3.3 FIFD OLS Regret cause its rank increasing or decreasing by 1 or 0 if Rank([t s,t 1]) d,  The following theorem provides the regret upper bounds Rank([t s,t 1])+1, Case 2 for the FIFD-OLS algorithm. Rank([t s+1,t])= Rank([t s,t 1]), Case 3 Theorem 1. (Regret Upper Bound of The FIFD-OLS 8 > Algorithm). Assume that for all t [s +1,T s] and probability at least 1 [0, 1], for all T>s,s d, where four examples: are illustrated in Appendix D. The 2 we have an upper bound on the cumulative regret at real example can be found in Figure 2. time T : The rank swinging phenomenon results in unstable R ( ) T,s AOLS regret, which is measured by the term called ‘Forgetting 2⇣ (d/s) log(2d/)(T s)(d log(sL2/d)+(T s)), Regret Term’ (FRT) caused by deletion at time t.  (7) Definition 2. (Forgetting Regret Term) p where the adaptive constant ⇣ = max q[t s,t 1]. s+1 t T FRT[t s,t 1] =   2 2 2 1 Proof. We provide a roadmap for the proof of The- xt s 1 ( xt 1 sin (✓, ) + 1), [t s,t 1] [t s,t 1] [t s,t 1] k k k k orem 1. The proof is motivated by (Abbasi-Yadkori (9) et al., 2011; Bastani and Bayati, 2020). We first prove where FRT[t s,t 1] [0, 2]. Lemma 1 to obtain a confidence ellipsoid holding for 2 the FIFD-OLS estimator. Then we use the confidence 2 1 Let’s use sin (✓, [t s,t 1]) denote the dissimilarity be- ellipsoid, to sum up the regret over time in Lemma 2 1 tween xt s and xt under [t s,t 1] as follows, and find a key term called FRT (defined below equation 9), which affects the derivation of regret upper bound a 2 1 2 sin (✓, [t s,t 1])=sin 1 ✓ [t s,t 1] lot. The detailed proof of this theorem can be found in 2 1 the appendix C. =1 cos (✓, [t s,t 1]) Yuantong Li, Chi-Hua Wang, Guang Cheng

inverse of [t s,t 1]. Lemma 2. The cumulative regret of FIFD-OLS is partially determined by FRT at each time step, and the cumulative representation term is

T T 2 1 xt 2⌘OLS + FRT[t s,t 1] k k [t s,t 1]  t=s+1 t=s+1 X X (10) where ⌘OLS = log(det([T s,T ])) is a constant based on data time window [T s, T ].

Proof. Let’s first denote rt as the instantaneous regret at time t and decompose the instantaneous regret, rt = Figure 2: Rank switching phenomenon with memory ˆ ✓[t s,t 1],xt ✓?,xt [t s,t 1]() xt 1 , [t s,t 1] limit s = 20, 60, 100, 150, 500 and d = 100 from left h ih i k k where the inequality is from Lemma 1. Thus, with to right. Xt Unif(e1,e2,...e100). The rank of gram p ⇠ probability at least 1 ,forallT>s, matrix can decrease due to deletion operation.

T 2 1 RT,s( OLS) (T s) r xt,xt s t 1 [t s,t 1] A  v h i t=s+1 cos(✓, [t s,t 1])= u xt 1 xt s 1 u X [t s,t 1] [t s,t 1] t k k k k T 2 1 (T s) max [t s,t 1]() xt .  v s+1 t T k k [t s,t 1] Remark. If FRT[t s,t 1] =0,thenitwon’tintroduce u   t=s+1 u X extra regret to avoid it being online learnable which t means the deleting data can be fully represented by the rest of data and the deletion operation won’t sab- 4 FIFD-Adaptive Ridge otage the representation power of the algorithm. If To remedy the rank switching phenomenon, we take FRT[t s,t 1] =0,itwillintroduceextraregretduring 6 this online forgetting process, which means that the advantage of the ridge regression method to avoid this deletion data can’t be fully represented by the rest of happening in the online forgetting process. In this the data and the deletion operation will sabotage the section, we present the FIFD - adaptive ridge regres- representation power of the algorithm. sion (FIFD-Adaptive Ridge)algorithmunder“First In First Delete” scheme and the corresponding confi- FRT is determined by two terms the ‘deleted dence ellipsoid for the FIFD-Adaptive Ridge estimator 2 term’ xt s 1 and the ‘dissimilarity term’ and regret analysis. [t s,t 1] k k 2 2 1 1 xt sin (✓, [t s,t 1]). If this deleted term k k [t s,t 1] 2 4.1 FIFD-Adaptive Ridge Algorithm xt s 1 is zero, then FRT won’t introduce any [t s,t 1] k k extra regret, no matter how large the weight term The FIFD-Adaptive Ridge algorithm use the ridge 2 2 1 ˆ xt 1 sin (✓, ) is. The larger the dis- estimator ✓,[t s,t 1] to predict the response. The defi- [t s,t 1] [t s,t 1] k k nition of it for time window [t s, t 1] is showed as similarity term, the more regret it will introduce if the deleted term is not zero since the new data intro- follows. duces new direction in the representation space. If Adaptive Ridge update: FRT[t s,t 1] =0, s,[t s,t 1] holds, with Rank([t s,t 1]) x[a,b] (2d/s) log(2d/)/,[a,b], 1 p Online Forgetting Process for Linear Regression Models

we have a control of L2 estimation error that If Assumption 1 holds, a high probability confidence ˆ Pr ✓,[t s,t 1] ✓? () 1 , and to sat- ellipsoid can be obtained for FIFD-Adaptive ridge. 2  isfyh this condition, we select thei adaptive [t s,t 1] as Lemma 4. (FIFD-Adaptive Ridge Confidence Ellip- follows for the limited time window [t s, t 1], soid) For any [0, 1], with probability at least 1 , 2 for all t s +1, with Assumption 1 and the event [t s,t 1] x[t 1,t s] 2s log(2d/)/ ✓? . ( /s) >2 > 0 holds, then ✓  k k min ,[t s,t 1] ,[t s,t 1] ? 1 1(11) p lies in the set The detailed derivation can be found in Appendix E. d ˆ C,[t s,t 1] = ✓ R : ✓,[t s,t 1] ✓ The detailed update procedure of FIFD-Adaptive 2 ,[t s,t 1] ⇢ Ridge is displayed in Algorithm 2. ⌫q,[t s,t 1] d/2s ,  Algorithm 2: FIFD-Adaptive Ridge p (13) Given parameters: s, T , 2 where q,[t s,t 1] = x[t 1,t s] /,[t s,t 1], for t s +1,...,T do 1 2{ } 2 Observe xt  = log (6 (✓ ) /)/ log(2d/), and ⌫ = |P ? | x[t 1,t s] = max x(i,j) 1 i s,1 j d ✓? / qmin(✓?). 1     k k1 P ˆ[t s,t 1] = SD(y[t s,t 1]) Proof. Key steps. The same technique used in Lemma [t s,t 1] = 1isappliedheretoobtainanadaptiveconfidenceellip- p2sˆ[t s,t 1] x[t s,t 1] log 2d/ 1 soid for ridge estimator. Here we assume Assumption ,[t s,t 1] = T p 1tomaketheconfidenceellipsoid(4)moresimplified. x[t s,t 1]x[t s,t 1] + [t s,t 1]I The detailed proof can be obtained in the Appendix B. ˆ 1 T ✓[t s,t 1] =,[t s,t 1]x[t s,t 1]y[t s,t 1] Remark. The order of constant memory limit s in the ˆ Predict yˆt = ✓[t s,t 1],xt and observe yt h i confidence ellipsoid for FIFD-OLS and FIFD-Adaptive Compute loss r = yˆ ✓?,x t | t h ti| ridge’s confidence ellipsoids are (ps). The adap- Delete data zt s =(xt s,yt s) O tive ridge confidence ellipsoid requires the information end of minimum eigenvalue of gram matrix ,[t s,t 1] with hyperparameter and infinity norm of data x[t 1,t s] , which is similar as FIFD-OLS confi- 4.2 Confidence Ellipsoid for FIFD-Adaptive dence ellipsoid1 required. The benefits of introduction Ridge of avoids the singularity of gram matrix , which will Before we moving to the confidence ellipsoid of the stabilize the algorithm FIFD-Adaptive ridge and make adaptive ridge estimator, we define some notions about the confidence ellipsoid narrow in most times. Besides, the true parameter ✓ , where (✓ ) represents the it is an adaptive confidence ellipsoid. ? Pmin ? weakest positive signal and (✓ ) serves as the Nmax ? weakest negative signal. To make the adaptive ridge 4.3 FIFD Adaptive Ridge Regret confidence ellipsoid more simplified, without loss of generality, we assume the following assumption. In the following theorem, we provide the regret upper bound for the FIFD-Adaptive Ridge method. Assumption 1. (Weakest Positive to Strongest Signal Ratio) We assume positive coordinate of ✓? dominates Theorem 2. (Regret Upper Bound of the FIFD- the bad events happening, Adaptive Ridge algorithm) With Assumption 1 and with probability at least 1 , the cumulative regret (✓?) WPSSR = Pmin satisfies: ✓? k k1 2d 12 (✓?) (12) p 2 |P | RT,s( Ridge) ⌫⇣ (d/s)(T s)[⌘Ridge +(T s)] C3 + C3 + s log log A  , (14)  q s log 2d q x[t 1,t s] where ⇣ = max k 2 k1 is the maximum adap- s+1 t T ,[t s,t 1] 6d 2d where C3 = log log . The WPSSR is monotone   2 tive constant, ⌘Ridge = d log(sL /d + [T s,T 1]) increasing in s and (✓?),andismonotonedecreasing d P log C2() is a constant related to the last data memory, in . In most cases, the LHS is greater than one, T s C2()= t=s+1(1 + 2 ,[t s+1,t]) such as s = 100,d = 110, =0.05, (✓?) = 30,then ,[t s+1,t] ,[t s+1,t] |P | WPS Ratio needs to be less than 1.02, which is satisfied is a constant close to 1, and ,[t s+1,t] = [t s+1,t] Q this assumption. [t s,t 1] represents the change of over time steps. Yuantong Li, Chi-Hua Wang, Guang Cheng

Proof. The key step is to use the same technical lemma used in the OLS setting to show a confidence ellipsoid holds for the adaptive ridge estimator. Lemma 2 is used to compute FRT and then sums up FTRs to get the cumulative regret upper bound. The detailed proof of this theorem can be found in Appendix E. Remarks. (Relationship between regret upper bound and parameters) We develop the FIFD-Adaptive Ridge algorithm and then provide a regret upper bound in 1 order (⇣ L[d/s] 2 T ). This order represents the re- O lationship between the regret, noise ,dimensiond, constant memory limit s,signallevel⌫,andconfidence level 1 .Sincetheagentalwayskeepstheconstant memory limit s data to update the estimated param- eters, thus the agent can’t improve the performance of the algorithm. Therefore, the regret is (T ) with Figure 3: Comparison of cumulative regret between of O respect to the decision number. Besides, we find that the Adaptive Ridge method and Fixed Ridge method. using the ridge estimator can offset the rank swinging The error bars represent the standard error of the phenomenon since Ridge’s gram matrix is always full mean regret over 100 runs. The blue line represents rank. the Adaptive Ridge. The green line represents the Fixed Ridge method with = 1, 2, 3 for each row. { } The red line represents the Fixed-Ridge method with 5SimulationExperiments = 10, 20, 30 for each row. The red line represents { } the Fixed-Ridge method with = 100, 200, 300 for { } We compare the FIFD-Adaptive Ridge method where each row. is chosen under Lemma 3 with the baselines with pre- tuned . For all experiments unless otherwise specified, we choose T = 3000, =0.05,d = 100,L =1for all simulation settings. All results are averaged over 100 From three rows, as the memory limit s increases, runs. we find that the cumulative regret of the adaptive ridge (blue line) decreases. Since we know that as Simulation settings. The setting for constant mem- the sample size increases, the regret will decrease in ory limit s is 20, 40, 60, 80 and subgaussian parame- theory. For each column, when we fix the memory ter is 1, 2, 3. Context xt t [T ] is generated from { } 2 limit s,asthenoiselevel increases, the regret of N(0d, Id d), and then we normalize it. The response ⇥ FIFD-Adaptive Ridge and Fixed Ridge increase with yt t [T ] is generated by yt = x, ✓? + ✏t, where { } 2 h i different hyperparameters. ✓? N(0d, Id d) and then normalize it with ✓? =1. ⇥ 2 ⇠ i.i.d k k Overall, we find that the FIFD-Adaptive Ridge method We set ✏ -subguassian. The adaptive ridge’s t ⇠ is more robust to the change of noise level when noise in the simulation is estimated by ˆ[t s,t 1] = we view figure 3 by row. Fixed Ridge methods with sd(y[t s,t 1]),t [s +1,T]. 2 different hyperparameters such as green line and red Hyperparameter settings for competing meth- line are sensitive to the change of noise and memory ods. The hyperparameter we select for =1setting limit s because Fixed Ridge doesn’t have any prior is 1, 10, 100 .Sincebytherelationshipofhyperparam- to adaptive the data. Although it performs well in { } eter and noise level ,weknowtheyareinlinearorder. the top left, it doesn’t work well in other settings. Thus, for the =2setting, we set = 2, 20, 200 The yellow line has relative comparable performance { } and for the =3setting, we set = 3, 30, 300 . The compared with the Adaptive Ridge method since its { } adaptive ridge hyperparameter is automatically calcu- hyperparameter is determined by the knowledge from lated according to Lemma 3 and we assume ✓ =1 Adaptive Ridge. The FIFD-Adaptive Ridge is always k ?k since ✓ =1. 1 the best (2nd row and 3rd row) or close to the best k ?k2 (1st row) choice of among all of these settings, which Results. In figure 3, we present the simulation re- means that the Adaptive Ridge method is robust to sult of relationship between the cumulative regret the large noise. RT,s( Ridge),noiselevel,hyperparameter,and A constant memory limit s.Wefindallofthecumulative Besides, the Adaptive Ridge method can save compu- regret are linear in time horizon T just with different tational cost compared with the fixed Ridge method, constant levels. which needs cross-validation to select the optimal . Online Forgetting Process for Linear Regression Models

Figure 4: Here ✏ t with degree of freedom df = Figure 5: Switching from Ridge to OLS & (+k, 1): t ⇠ df 5, 10, 15 . This setting investigate the robustness of Comparison of cumulative regret between of the Switch- { } the proposed online regularization scheme. These lines’ ing Adaptive Ridge method when s 2d and Fixed colors are the same as they represent in figure 3. Ridge method. The error bars represent the standard error of the mean regret over 100 runs. The cyan line represents the Switching Adaptive Ridge. Other lines’ colors are the same as they presented in Figure 3. In addition, more results about the choice of hyperpa- rameter and L2 estimation error of the estimator can be found in Appendix G. Switching from Ridge to OLS. Besides, we also consider when the agent accumulates many data such Heavy Tail Noise Distribution. Moreover, we also as n>2d, where n represents the number of data the test the robustness of these methods concerning dif- agent has at some time point t, and then transfer the ferent noise. Here we assume ✏ t , t [T ]. The t ⇠ df 8 2 algorithm from the Adaptive ridge method to the OLS degree of freedom df is set to be 5, 10, 15. As we method. This method called Switching Adaptive Ridge know when df is greater than 30, it behaves like the represented as the cyan line in figure 5. We see that normal distribution. When df is small, t-distribution when the noise level is relatively small, it performs well. has much heavier tails. Thus, in figure 4, we find that However, when is large, it works worse than the green when df increases, the cumulative regret is decreasing line (fixed Ridge with prior knowledge about )and since the error is more like the normal distribution, the blue line (Adaptive Ridge). which can be well captured by our algorithm. However, fixed Ridge methods are not robust to the change of noise distribution, especially green line and red line. 6DiscussionandConclusion Results for (+k, 1) addition-deletion operation. In this paper, we have proposed two online forgetting Here we test the pattern when we add more than one algorithms under the FIFD scheme and provide the data point at each time step, such as k =2, 3, 4,and theoretical regret upper bound. To our knowledge, this delete one data, denoted as (+k, 1) addition-deletion is the first, theoretically well-proved work in the online pattern. In figure 5, the 1st, 2nd, 3rd column are cases forgetting process under the FIFD scheme. that delete one data after receiving 2, 3, 4 data, respec- tively. Each row has different noise level =1, 2, 3. Besides, we find the existence of rank swinging phe- From the figure, we notice that the cumulative regret nomenon in the online least square regression and tackle in each subplot has an increasing sublinear pattern. it using the online adaptive ridge regression. In the For the first row, the noise level is the smallest, which future, we hope we can provide the lower bound for shows the sublinear pattern quickly. For other rows, all these two algorithms and design other online forgetting subplots show increasing sublinear patterns sooner or algorithms under the FIFD scheme. later, which satisfied our anticipation that when adding k>1 data points and delete one data point at each References time step, it finally will convert the normal online learn regime. However, the noise level will affect the time it Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). moves to the sublinear pattern. Improved algorithms for linear stochastic bandits. In Yuantong Li, Chi-Hua Wang, Guang Cheng

Advances in Neural Information Processing Systems, Liu, X. and Tsaftaris, S. A. (2020). Have you forgotten? pages 2312–2320. amethodtoassessifmachinelearningmodelshave Bastani, H. and Bayati, M. (2020). Online decision forgotten data. arXiv preprint arXiv:2004.10129. making with high-dimensional covariates. Operations Liu, Y., Gadepalli, K., Norouzi, M., Dahl, G. E., Research,68(1):276–294. Kohlberger, T., Boyko, A., Venugopalan, S., Timo- Bourtoule, L., Chandrasekaran, V., Choquette-Choo, feev, A., Nelson, P. Q., Corrado, G. S., et al. (2017). C., Jia, H., Travers, A., Zhang, B., Lie, D., and Detecting cancer metastases on gigapixel pathology Papernot, N. (2019). Machine unlearning. arXiv images. arXiv preprint arXiv:1703.02442. preprint arXiv:1912.03817. State of California Department of Justice (2018). Cal- ifornia consumer privacy act (ccpa). https://oag. Chen, M. X., Lee, B. N., Bansal, G., Cao, Y., Zhang, ca.gov/privacy/ccpa. S., Lu, J., Tsay, J., Wang, Y., Dai, A. M., Chen, Z., et al. (2019). Gmail smart compose: Real-time Villaronga, E. F., Kieseberg, P., and Li, T. (2018). assisted writing. In Proceedings of the 25th ACM Humans forget, machines remember: Artificial intel- SIGKDD International Conference on Knowledge ligence and the right to be forgotten. Computer Law Discovery & ,pages2287–2295. &SecurityReview,34(2):304–313. Council of European Union (2012). Council reg- Wainwright, M. J. (2019). High-dimensional statistics: ulation (eu) no 2012/0011. https://eur-lex. A non-asymptotic viewpoint, volume 48. Cambridge europa.eu/legal-content/EN/TXT/?uri=CELEX: University Press. 52012PC0011. Woodbury, M. (1950). Inverting modified matrices Council of European Union (2016). Council regulation (memorandum rept., 42, statistical research group). (eu) no 2016/678. https://eur-lex.europa.eu/ Zahavy, T. and Mannor, S. (2019). Deep neural eli/reg/2016/679/oj. linear bandits: Overcoming catastrophic forget- Dekel, O., Shalev-Shwartz, S., and Singer, Y. (2006). ting through likelihood matching. arXiv preprint The forgetron: A kernel-based on a fixed arXiv:1901.08612. budget. In Advances in neural information processing systems,pages259–266. Dragomir, S. S. (2015). Reverses of schwarz inequality in inner product spaces with applications. Mathema- tische Nachrichten,288(7):730–742. Gaillard, P., Gerchinovitz, S., Huard, M., and Stoltz, G. (2018). Uniform regret bounds over Rd for the sequential linear regression problem with the square loss. arXiv preprint arXiv:1805.11386. Ginart, A., Guan, M., Valiant, G., and Zou, J. Y. (2019). Making ai forget you: Data deletion in ma- chine learning. In Advances in Neural Information Processing Systems,pages3513–3526. Google (2020). Clear browsing data. https: //support.google.com/chrome/answer/2392709? co=GENIE.Platform. Hsu, D., Kakade, S., Zhang, T., et al. (2012). A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17. Izzo, Z., Smart, M. A., Chaudhuri, K., and Zou, J. (2020). Approximate data deletion from machine learning models: Algorithms and evaluations. arXiv preprint arXiv:2002.10077. Jon Porter (2019). Google will let you delete your lo- cation tracking data. https://www.theverge.com/ 2019/5/1/18525384/.