Carnegie Mellon University, University of Chicago, Stanford University

Conformal Prediction Under Covariate Shift Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel Candes, Aaditya Ramdas Carnegie Mellon University, University of Chicago, Stanford University, Carnegie Mellon University 1500 700 No covariate shift No covariate shift Conformal prediction Covariate shift Covariate shift Covariate shift 500 1000 300 Frequency Frequency Setup. Given i.i.d. samples (Xi,Yi) P , i =1,...,n, where Setup. Suppose training and test data are not i.i.d., instead 500 d ⇠ d 100 P is a distribution on R R. Goal: compute band Cn : R 0 0 ⇥ ! i.i.d. 0.7 0.8 0.9 1.0 10 15 20 25 30 35 40 such that for a new i.i.d. point (Xn+1,Yn+1), (Xi,Yi) P = PX PY X ,i=1, . , n, Coverage Length ⇠ ⇥ | (2) 1500 700 Oracle weights Oracle weights b No shift, fewer samples No shift, fewer samples (Xn+1,Yn+1) P = PX PY X , independently | 500 P Yn+1 Cn(Xn+1) 1 ↵ (1) ⇠ ⇥ 1000 2 ≥ − 300 Frequency Frequency n o Called covariate shift model.e e Conformal prediction no longer 500 for given miscoverage rate b↵ (0, 1). No assumptions on P ! works (lack of exchangeability) 100 2 0 0 0.7 0.8 0.9 1.0 10 15 20 25 30 35 40 Coverage Length 1500 Suppose we knew likelihood ratio w = dP /dP . By sampling 700 X X Logistic regression weights Logistic regression weights Quantile lemma. If V1,...,Vn+1 are exchangeable, then for from training set with probabilities w, this would “look like” Random forest weights Random forest weights 500 / 1000 any β (0, 1), draws from test distribution, so we coulde use conformal 300 2 Frequency Frequency 500 100 P Vn+1 Quant β; V1:n β 0 0 [ {1} ≥ Key idea. can do this without sampling, with exact coverage. 0.7 0.8 0.9 1.0 10 15 20 25 30 35 40 n o Coverage Length Compute scores as before, now let y in Cn(x) provided Proof. Let q =Quant(β; F ), where F has support points ai, ● ● ● ● 140 ● ● 140 ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● Unity weights ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●●●●●● ● ● ● ● i =1, 2,.... If we reassign points a >qto any values strictly ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ●● ● ● Logistic regression weights i ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●●●●●●● ● ● ● ●● ● n 130 ● ● ● ● ● ● ● ● ● ● ● 130 ●● ● ●● ● ● ●● ● ●● ● Random forest weights ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 500 ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ●●●●● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ●●●● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●●●●● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ●● ●●● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●●● ● b ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ●●●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ●● ●●● ● ● (x,y) ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ● ●● ●● ●● ●●● larger than q, then the level β quantile is unchanged. Thus ● ● ● ● ● ● ●● ● ● ●● ● w w ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● y ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ●● ● ● ● ● (x,y) ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● V Quant 1 ↵; p (x)δ + p (x)δ 120 ● ● ● ● 120 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● 300 n+1 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● i n+1 ● ● ● ● ● ● ● ● ● ●●●● ●● ● ● V ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ●● ● ●● ● ● ● ● ● Frequency ● ● ● ● ● ● ●●● ● ●● ● ●●● ● i 1 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● − ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● i=1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ✓ ◆ ● ● ●● ● ● ● ● ● ● ● 100 V > Quant β; V V > Quant β; V 110 ● ● 110 ● ● n+1 1:n n+1 ● ● 1:(n+1) ● ● X ● ● ● ● ● ● ● ● 0 [ {1} () ● ● w w 6 7 8 9 10 −8 −7 −6 −5 −4 −3 0.80 0.85 0.90 0.95 1.00 Here p (x) w(X ), i =1,...,n, and p (x) w(x). This x1 x5 Equivalently with . But V Quant(β; V ) i i n+1 Coverage n+1 1:(n+1) () / / construction recovers (1) for the covariate shift model (2) 0.6 Vn+1 is among β(n +1) smallest of V1,...,Vn+1. 0.5 800 d e 0.5 0.4 0.4 600 0.3 0.3 400 Frequency Frequency Frequency 0.2 0.2 200 Estimation of w. Given test covariates X , i = n +1,...,m, 0.1 Conformal prediction. Due to Vovk et al. (2005). Denote i 0.1 0 0.0 0.0 we can run any classifier (that outputs estimated probabilities) 12 14 16 18 20 22 24 Zi =(Xi,Yi), i =1,...,n. Choose a score function , e.g., 5 6 7 8 9 10 −8 −7 −6 −5 −4 −3 −2 S x1 x5 to (Xi,Ci), i =1,...,n+ m, where Ci =1 i n . As Length (x, y),Z = y µ(x) , { } S | − | P(C =1X = x) P(C =1)dPX where µ : Rd R is fitted by running algorithm on Z.For | = (x), Generalization d ! b A P(C =0X = x) P(C =0)dPX x R , define Cn(x) by computing for each y R: | 2 2 e Weighted exchangeability. We say V ,...,V are weighted b we can take w(x)=P(C =1X = x)/P(C =0X = x) 1 n V (x,y) = Z ,Z (x, y) ,i=1,...,n | | exchangeable if their joint density f factorizes as i b S i 1:n [ { } (x,y) V = (x, y),Z1:n (x, y) n n+1 S [ { } Simulated example f(v1,...,vn)= wi(vi) g(v1,...,vn), · We include y in Cn(x) provided i=1 Y Airfoil data: n =1503, d =5.ForT =5000trials, randomly V (x,y) Quant 1b ↵; V (x,y) where g does not depend on the ordering of its inputs n+1 − 1:n [ {1} () partition the data into halves Dtrain,Dtest, construct Dshift,by (x,y) (x,y) (x,y) V is among (1 ↵)(n + 1) smallest of V ,...,V sampling from Dtest proportionally to n+1 d − e 1 n+1 Weighted conformal. Conformal prediction can be extended to (X ,Y), i =1,...,n+1weighted exchangable. Special w(x)=exp(xT β), where β =( 1, 0, 0, 0, 1) i i This construction gives distribution-free coverage as in (1) − cases: i.i.d., exchangeable, covariate shift, covariate clusters.

Load more