<<

Conformal Prediction Under Covariate Shift Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel Candes, Aaditya Ramdas Carnegie Mellon , University of , , Carnegie Mellon University 1500 700 No covariate shift No covariate shift Conformal prediction Covariate shift Covariate shift Covariate shift 500 1000 300 Frequency Frequency Setup. Given i.i.d. samples (Xi,Yi) P , i =1,...,n, where Setup. Suppose training and test data are not i.i.d., instead 500

d ⇠ d 100 P is a distribution on R R. Goal: compute band Cn : R 0 0 ⇥ ! i.i.d. 0.7 0.8 0.9 1.0 10 15 20 25 30 35 40 such that for a new i.i.d. point (Xn+1,Yn+1), (Xi,Yi) P = PX PY X ,i=1, . . . , n, Coverage Length ⇠ ⇥ | (2) 1500 700 Oracle weights Oracle weights b No shift, fewer samples No shift, fewer samples (Xn+1,Yn+1) P = PX PY X , independently

| 500 P Yn+1 Cn(Xn+1) 1 ↵ (1) ⇠ ⇥ 1000 2 300 Frequency Frequency n o Called covariate shift model.e e Conformal prediction no longer 500

for given miscoverage rate b↵ (0, 1). No assumptions on P ! works (lack of exchangeability) 100 2 0 0 0.7 0.8 0.9 1.0 10 15 20 25 30 35 40

Coverage Length 1500 Suppose we knew likelihood ratio w = dP /dP . By sampling 700 X X Logistic regression weights Logistic regression weights Quantile lemma. If V1,...,Vn+1 are exchangeable, then for from training set with probabilities w, this would “look like” Random forest weights Random forest weights 500 / 1000 any (0, 1), draws from test distribution, so we coulde use conformal 300

2 Frequency Frequency 500 100

P Vn+1 Quant ; V1:n 0 0  [ {1} Key idea. can do this without sampling, with exact coverage. 0.7 0.8 0.9 1.0 10 15 20 25 30 35 40 n o Coverage Length Compute scores as before, now let y in Cn(x) provided Proof. Let q =Quant(; F ), where F has support points ai, ● ● ● ● 140 ● ● 140 ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● Unity weights ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●●●●●● ● ● ● ● i =1, 2,.... If we reassign points a >qto any values strictly ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ●● ● ● Logistic regression weights i ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●●●●●●● ● ● ● ●● ● n 130 ● ● ● ● ● ● ● ● ● ● ● 130 ●● ● ●● ● ● ●● ● ●● ● Random forest weights ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 500 ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ●●●●● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ●●●● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●●●●● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ●● ●●● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●●● ● b ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ●●●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ●● ●●● ● ● (x,y) ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ● ●● ●● ●● ●●● larger than q, then the level quantile is unchanged. Thus ● ● ● ● ● ● ●● ● ● ●● ● w w ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● y ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ●● ● ● ● ● (x,y) ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● V Quant 1 ↵; p (x) + p (x) 120 ● ● ● ● 120 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● 300 n+1 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● i n+1 ● ● ● ● ● ● ● ● ● ●●●● ●● ● ● V ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ●● ● ●● ●

● ● ● ● Frequency ● ● ● ● ● ● ●●● ● ●● ● ●●● ● i 1 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●  ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● i=1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ✓ ◆ ● ● ●● ●

● ● ● ● ● ● 100 V > Quant ; V V > Quant ; V 110 ● ● 110 ● ● n+1 1:n n+1 ● ● 1:(n+1) ● ● X ● ● ● ● ● ● ● ● 0 [ {1} () ● ● w w 6 7 8 9 10 −8 −7 −6 −5 −4 −3 0.80 0.85 0.90 0.95 1.00 Here p (x) w(X ), i =1,...,n, and p (x) w(x). This x1 x5 Equivalently with . But V Quant(; V ) i i n+1 Coverage  n+1  1:(n+1) () / / construction recovers (1) for the covariate shift model (2) 0.6 Vn+1 is among (n +1) smallest of V1,...,Vn+1. 0.5 800 d e 0.5 0.4 0.4 600 0.3 0.3 400 Frequency Frequency Frequency 0.2 0.2 200

Estimation of w. Given test covariates X , i = n +1,...,m, 0.1 Conformal prediction. Due to Vovk et al. (2005). Denote i 0.1 0 0.0 0.0 we can run any classifier (that outputs estimated probabilities) 12 14 16 18 20 22 24 Zi =(Xi,Yi), i =1,...,n. Choose a score function , e.g., 5 6 7 8 9 10 −8 −7 −6 −5 −4 −3 −2 S x1 x5 to (Xi,Ci), i =1,...,n+ m, where Ci =1 i n . As Length (x, y),Z = y µ(x) , {  } S | | P(C =1X = x) P(C =1)dPX where µ : Rd R is fitted by running algorithm on Z.For | = (x), Generalization d ! b A P(C =0X = x) P(C =0)dPX x R , define Cn(x) by computing for each y R: | 2 2 e Weighted exchangeability. We say V ,...,V are weighted b we can take w(x)=P(C =1X = x)/P(C =0X = x) 1 n V (x,y) = Z ,Z (x, y) ,i=1,...,n | | exchangeable if their joint density f factorizes as i b S i 1:n [ { } (x,y) V = (x, y),Z1:n (x, y) n n+1 S [ { } Simulated example f(v1,...,vn)= wi(vi) g(v1,...,vn), · We include y in Cn(x) provided i=1 Y Airfoil data: n =1503, d =5.ForT =5000trials, randomly V (x,y) Quant 1b ↵; V (x,y) where g does not depend on the ordering of its inputs n+1  1:n [ {1} () partition the data into halves Dtrain,Dtest, construct Dshift,by (x,y) (x,y) (x,y) V is among (1 ↵)(n + 1) smallest of V ,...,V sampling from Dtest proportionally to n+1 d e 1 n+1 Weighted conformal. Conformal prediction can be extended to (X ,Y), i =1,...,n+1weighted exchangable. Special w(x)=exp(xT ), where =( 1, 0, 0, 0, 1) i i This construction gives distribution-free coverage as in (1) cases: i.i.d., exchangeable, covariate shift, covariate clusters