Arxiv:1710.06514V3 [Cs.LG] 27 Aug 2019 Ed Ob Oerbs Ob Osdrdapatclsolution Practical a Considered I Be to Probability

Home , Weight function

arXiv:1710.06514v3 [cs.LG] 27 Aug 2019 ed ob oerbs ob osdrdapatclsolution practical a considered I be to probability. robust high more be with to occur needs weights probl large to suited where not settings skewness therefore sampling estimator’ is and the Importance-weighting scale 3.2) [5]. directly Section weights weights (see the The variance if sampling [2]. fail, large even too can are and underperforms weighting n nrdc oto ait oices t outest robustness its increase to distributi weights. variate data large problematically training control the a of sam- introduce its function and study a We as probability. variance high pling with arise wher weights settings large problem sub in produces estimates How- hyperparameter estimator optimal risk risk. importance-weighted empirical the the ever, importance-weighting by done princ be in can, bias selection sample under Cross-validation n aadsrbto ota ntetre ouain pro a trai population, target the the as known in in a that probability by to distribution observation weighted data is ing its sample matches each that bias, factor selection t for over i slowly control that drifts To reality process in a but stationary, from be t collected to example, generalizing assumed data For of or aim population the 3]. national with 2, a hospital [1, one in population collected target g data to larger is a goal the to settin but eralize locally, to collected is refers data bias training where selection sample under Classification eetadwoeepce au skon[] eas of Because [7]. known i is the of value statistic of expected the whose function with and well additional terest correlates ar an that variates requiring variable Control random technique, bias. a a introducing such avoid stati to the interest te about of reduction information variance additional Alternatively, incorporate niques b [6]. substantial estimator a the introduces in that but weights, large avoiding ∗ aoiyo okdn hl Kwsa ef nvriyo Tech of University Delft at was WK while done work of Majority ne Terms Index n ol osdrwih rnaina en of means a as truncation weight consider could One importance-weighting 2 nttt o optn n nomto cecs Radboud Sciences, Information and Computing for Institute — 1 eateto lcrclEgneig idoe Universi Eindhoven Engineering, Electrical of Department .INTRODUCTION 1. apeslcinba,cross-validation. bias, selection Sample 3 OUTIPRAC-EGTDCROSS-VALIDATION IMPORTANCE-WEIGHTED ROBUST ABSTRACT eateto nelgn ytm,DltUiest fTec of University Delft Systems, Intelligent of Department otrM Kouw M. Wouter 4 ] oee,importance- However, 2]. [4, NE APESLCINBIAS SELECTION SAMPLE UNDER 1 , 3 ∗ nology. es .Krijthe H. Jesse , ime. iple, cess stic ch- en- em ias on gs n- n- o o e e s s - t . htiprac-egtdcosvldto smr robust weights. more large is cross-validation weight-based importance-weighted the that estimator risk including importance-weighted for the in argue variate with We control settings [5]. problem sub- weights produce in large to estimates known is hyperparameter but [4], optimal solution a as explored cross-validation been Importance-weighted training [11]. the generated which esse is from estimator distribution the the becomes to as over-fits Cross-validation bias, tially hy- selection optimal sample under selected. estimate, difficult be this With can expected perparameters the [10]. of error estimate repeat an generalization is obtain risk to averaged the and cross-validation, estimated standard In bias. tion weights. large importance- to increasin robustness An thereby o estimator, variance risk sampling importance-weighted [9]. the an reduce known can variate is los control value weighted importance-weighted expected the whose with and additiona importance-weighting, correlates an In that constitute function themselves 8]. weights [7, importance variance) the sampling expect statistic its the from (i.e. from deviation statistic’s subtracted the be varia reducing can control thereby value the expected of deviation its whenever The ca from value in does. expected fall variate its or control above – the – rise to correlation known negative is of statistic the correlation, the in n xesosi eto 6. conclu Section discuss in briefly extensions We and sions 5). (Section of estimator cross- effectiveness controlled a the run demonstrate variate then to We control experiment validation the 4). (Section how variance show sampling and reduces importance-weighte 3) the (Section discuss setting estimator 2), problem risk the (Section present detail we more paper in the of remainder the In • • follows: as summarized be can contributions Our selec- sample under cross-validation in interested are We eeprclyso htteicuino h control weights. the large of to robustness inclusion increases the variate that show empirically We procedure. esti- cross-validation risk a to importance-weighted mator controlled a apply We 2 ac Loog Marco , nvriyNijmegen University yo Technology of ty hnology 3 ation such edly its g data has the n- se to te d s f - l , 2. SAMPLE SELECTION BIAS the loss ℓ over a joint distribution p, and is therefore domain- specific. We are interested in minimizing the target risk RT : Consider an input space RD, and an output space , with = 1, +1 forX binary ⊆ classification or N forY RT (h )= ℓ h (x),y pT x, y dydx , (1) Y {− } Y ⊆ θ Z Z θ multi-class. The population, or target domain, is a probability X Y distribution pT defined over this pair of spaces. The distribu- which can be estimated with the sample average over data tion of the training data pS , gathered under sample selection bias, is referred to as the source domain. Under sample se- drawn from the target domain: lection bias, the data distributions differ pS (x) = pT (x) with m 6 1 pS(x) having a smaller variance than pT (x), but the posterior RˆT (h )= ℓ h z ,u . (2) θ m θ j j distributions remain equal pS (y x)= pT (y x). Data points Xj=1 | | from the source domain are referred to as xi with labels yi, while target domain data is referred to as zj with labels uj. However, this estimator cannot be used, since we lack target n labels . The challenge is to use labeled source data (xi,yi) i=1 and u m { } not unlabeled target data zj j=1 to predict target labels uj . We are interested in estimators that do depend on u. Consider a function{ }h, parameterized by θ, that maps a One possibility is the importance-weighted risk estimator, data point to a real number, h : R. The real-valued which re-formulates the target risk to include the source dis- θ X → outputs are predictions of classes and we therefore refer to h tribution: ℓ hθ(x),y pS (x, y) pT (x, y)/pS (x, y) dydx. as a classifier. Predictions are evaluated using a loss function Under sampleRR selection bias, pT (y x) = pS (y x), which | | ℓ : R R, which compares the prediction to the true simplifies the re-formulation to: label.× The Y→ expected loss is called the risk of the classifier, ˆ pT (x) R(hθ) and the average loss is called the empirical risk R(hθ). RW (h )= ℓ h x ,y pS x, y dydx . (3) θ Z Z θ X Y pS (x) 2.1. An example setting n Using samples from the source domain (x ,y ) =1, this risk { i i }i Throughout the paper, we illustrate the behavior of estima- can be estimated through: tors with a running example of a classification problem under n sample selection bias. The posterior distribution in both do- 1 pT (xi) RˆW (hθ)= ℓ(hθ(xi),yi) . (4) n pS (x ) mains is a unit cumulative normal distribution, pT (y x) = Xi=1 i Φ(yx 0, 1). The target data distribution is taken to be| a unit | normal distribution pT (x) = (0, 1), while the source data Note that it does not depend on target labels u. distribution is set to ( 1,γ)N. Figure 1 plots the data distri- N − butions, for γ = 1/√2. Note that pS has higher probability 3.1. Weights mass between x = 3 and x = 1/2, implying these values will tend to be observed− more often− in the source data set than The ratio of probability distributions is referred to as the im- in the target set. portance weight function w(x)= pT (x)/pS (x). Each weight corrects the probability of observing a single pair (xi,yi) un- 0.6 der the source distribution to its probability under the target distribution. Higher values indicate higher importance to the 0.4 target domain. In the example setting, the weight function is: −2 2 2 0.2 w(x)= γ exp [γ ( 1 x) x ]/2 . Its variance has an − − − 2 2 2 analytical solution, VS [w(x)] = γ /(2γ 1) exp 1/(2γ 0 − · − -3 -2 -1 0 1 2 3 1) 1, provided γ > 1/√2, shown in Figure 2. Note that the weights’− variance grows exponentially as γ shrinks. In other words, as the source domain becomes a more limited view of Fig. 1. Data distributions in the example setting. the target population, the spread over the importance of each sample grows sharply. In order to evaluate risk, a choice of loss function and classifier is needed. For this example, we use a quadratic loss, 2 ℓ(h (x),y)= xθ y , with θ =1/√π as coefficient. 3.2. Sampling variance θ − In expectation, the estimators RˆT and RˆW are equivalent. 3. IMPORTANCE-WEIGHTING However, they behave differently for finite sample sizes. The variance of an estimator with respect to data, is known as its Empirical risk minimization describes a classifier’s perfor- sampling variance. In the following, we will compare the mance by its expected loss. The risk function R integrates sampling variances of RˆW and RˆT . 10 4 10 0

10 2 10 -1

10 0 0.7 0.75 0.8 0.85 0.9 0.95 1 Pr[ w(x) > c ]

10 -2 10 0 10 1 10 2 Fig. 2. Variance of the importance weights w(x) as a function constant (c) of the standard deviation γ of the source data distribution in the example setting. Fig. 3. Probability of an importance weight exceeding the constant c, for different values of source standard deviation γ. Assuming data is drawn independently and is identically distributed (iid), the sampling variance of the target risk esti- 4. REDUCING SAMPLING VARIANCE mator is m 4.1. Control variates 1 2 VT [RˆT ]= ET ℓ(h (z ),u ) RT m θ j j − A control variate is a function g of a random variable x that Xj=1 2 correlates with a particular statistic f(x) of that random vari- 1 2 σT able and whose expected value is known [7, 8]. It reduces = ET ℓ(hθ(x),y) RT = , (5) m − m sampling variance as follows: suppose that g(x) is positively 2 correlated to the statistic f(x). Then, whenever g(xi) rises where σ is the sampling variance of RˆT with respect to a T above its expected value, g(x ) E[g(x)] > 0, f(x ) will single point. Assuming the source data is drawn iid as well, i i also rise above its expected value,−f(x ) E[f(x)] > 0. If g the sampling variance of the importance-weighted risk esti- i is negatively correlated, then f will fall below− its expectation mator follows similarly: whenever g rises above its expectation. By subtracting the n control variate from the statistic, f(x ) (g(x ) E[g(x)]), 1 2 i − i − VS [RˆW ]= ES ℓ(hθ(xi),yi)w(xi) RW the statistic’s deviation from its expected value can be manip- n − Xi=1 ulated. 2 1 2 σW It is however important that the control variate is appro- = ES ℓ(h (x),y)w(x) RW = . (6) θ priately scaled. For this purpose, a parameter β is introduced. n − n If β is fixed, then the estimator will be unbiased: Since RW = RT , Equations (5) and (6) reveal that the sam- ˆ ˆ n pling variances of RT and RW only differ by a scaling in- 1 duced by the importance weights: E f(x ) β g(x ) E[g(x)] n i − i − Xi=1 2 2 E 2 σT σW = T ℓ(hθ(x),y) 1 w(x) . (7) = E[f(x)] β(E[g(x)] E[g(x)]) = E[f(x)] (8) − − − − Hence, in problem settings where large weights occur often, Our statistic of interest is the weighted loss. For common the importance-weighted risk estimator has a higher sampling choices of loss functions, such as quadratic, logistic or hinge, variance than the target risk estimator. This makes sense: as the importance weights will generally correlate well with the the training data distribution provides a more narrow view of weighted loss. The expected value of the importance weights the population, the risk estimate becomes more uncertain. 2 2 2 is always known, since Figure 4 plots σW and σT for the example setting (σβ will be discussed in Section 4). Note that the shape of the curve of E pT (x) ˆ S [w(x)] = pS (x)dx = pT (x)dx =1 . (9) RW reflects the influence of weight variance (see Figure 2). ZX pS (x) ZX always 3.3. Occurrence of large weights This means the weights can be introduced as a control variate to an importance-weighted estimator. The controlled A large sampling variance does not necessarily mean a given importance-weighted risk estimator becomes: data set will contain a large weight. Drawing a point with a n large weight occurs with some probability. For illustration, 1 Rˆ (h )= ℓ(h (x ),y )w(x ) β(w(x ) 1) . (10) we plot the probability that an importance-weight exceeds a β θ n θ i i i − i − Xi=1 constant c in the example setting. Figure 3 shows curves for three choices of γ. For γ = 0.7, the probability of a weight Note that the weights w(xi) have already been computed, im- exceeding the value 10 is about 0.2. plying that additional computational cost is restricted to β. The effect of the control variate on the sampling variance of 10 4 the importance-weighted risk estimator can be seen via [7]:

2 VS [Rˆ ]= ES (Rˆ R ) β β − β 1 2 = ES ℓ (x, y) β(w(x) 1) RW 0 n w − − − 10 2 1 2 2 σβ = σ 2βCS[ℓ (x,y),w(x)]+β VS [w(x)] = , 0.7 0.75 0.8 0.85 0.9 0.95 1 n W − w n (11) 2 Fig. 4. Sampling variance of the target (yellow, σT ), where ℓw(x, y) = ℓ(hθ(x),y)w(x) is shorthand for the 2 the importance-weighted (blue, σW ) and the controlled weighted loss function given a ﬁxed classiﬁer h , and CS θ importance-weighted (green, σ2 ) risk estimators as a function refers to covariance. We can minimize this sampling variance β of γ in the example setting. for the scaling parameter β. Taking the derivative of (11) with respect to β and setting it to 0 yields:

∗ Importance-weighted cross-validation is a potential solu- β = CS ℓ (x, y), w(x) / VS w(x) . (12) w tion [4], but is not robust to large weights. The following Plugging β∗ back in simplifies the sampling variance to: subsections describe experiments that compare risk estimators during k-fold cross-validation, and evaluate their ability 2 2 C 2 V to find appropriate regularization parameters. σβ = σW S ℓw(x, y), w(x) / S w(x) . (13) − Considering that both the squared covariance term and the 5.1. Data variance term are non-negative, the sampling variance of Rˆβ ˆ 2 We consider a synthetic and a natural data set. In the synthetic is never larger than that of RW [9]. In particular, σβ can be 2 2 setting, the target data distribution is a unit bivariate normal alternatively formulated as σW (1 ρ ), where ρ denotes the − ([0, 0] I), while the source data distribution consists of correlationbetween the weighted loss and the weights [8]. Es- N | sentially, the more the weights correlate – positively or neg- a bivariate normal with a shifted mean and a scaled identity covariance matrix, ([ 1, 0] γI). The posterior distribution atively – with the weighted loss, the larger the reduction in N − | is of the form: p(y = 1 x1, x2)=1 Φ( [x1, x2]) variance. − | − − 2 and p(y = 1 x1, x2) = Φ( [x1, x2]). Figure 5 visualizes We computed σβ for the example setting and show it along- | − 2 2 2 the class-conditional distributions p(x y) for γ = 1/√2. side σ and σ in Figure 4. Note that σ also diverges as γ W T β Note that the source domain is a local sampling| of the larger shrinks, like σ2 , but does so at a slower rate. W target domain and that the nonlinear nature of the underlying decision boundary is not apparent. We draw data sets using 4.2. Regression estimator rejection sampling, with 50 source samples and 1000 target In practice, β will need to be estimated. Both the weight vari- samples. ance and the covariance between the weighted loss and the 3 3 weights can be estimated from data. In that case, Equation 12 2 2 becomes the solution to a least-squares problem [8]: 1 1 0 0 n n =1 ℓw(xi,yi) =1 ℓw(xi,yi) w(xi) 1 -1 -1 βˆ= i − i − . (14) P n P 2 -2 -2 i=1 w(xi) 1 -3 -3 − -4 -2 0 2 -4 -2 0 2 P However, β now depends on the same observed variable as the weighted loss and the weights, which is somewhat prob- Fig. 5. Synthetic problem, with γ =1/√2. Class-conditional lematic. It can be shown that the deviation of βˆ from β∗ drops distributions for source (left) and target domain (right). off on the order of O(n−1/2) (see Theorem 1 from [9]). This estimation error in β causes a bias in the risk estimator on the 1 The natural setting is derived from the ozone level de- order of O(n− ) [9]. tection data set from the UCI machine learning repository1. The task is to predict ozone days versus normal weather days, 5. IMPORTANCE-WEIGHTED CROSS-VALIDATION given various weather station measurements. All samples with missing values are removed, the data has been z-scored, Standard cross-validation will not yield optimal regularization parameter estimates under sample selection bias. 1https://archive.ics.uci.edu/ml/datasets ˆ ˆ and it has been projected onto its first 10 principal compo- there is nearly no difference between Rwˆ and Rβˆ. This indi- nents. The source domain is a local sampling in time: a cates that the addition of the control variate does not deterio- Gaussian centered at the start date with a standard deviation rate importance-weighted cross-validation in general. When that is a proportion γ of the total number of samples. For looking at the sets with large weight variance, it can be seen ˆ ˆ small values of γ, the source domain contains only samples that Rwˆ deteriorates strongly, while Rβˆ does not. This shows close to the start date. For large values, the source domain is that the addition of the control variate makes the importance- roughly equally spread over time. We draw 80 samples with- weighted risk estimator more robust to large weights. The ˆ ˆ ˆ ˆ out replacement from both classes to form the source data set. difference between Rw and Rβ for the data sets with large These 80 samples are also included in the target data, so as weight variance (dotted lines) is statistically significant for not to form a ’hole’ in feature space. each value of γ, with the largest p-valueon the orderof 10−30. For each setting, we perform 100 000 repetitions of sampling a data set. The top 10% of repetitions with the largest 0.8 weight variance are considered the data sets with problematically large weights. 0.75

5.2. Experimental setup 0.7

The source set is split into k = 5 folds, a classifier is trained 0.65 using a particular value for the regularization parameter λ 0.5 0.6 0.7 0.8 0.9 1 and its loss is computed on each data point in the held-out validation set. That loss is stored, along with those points’ Fig. 6. Mean target risks for the synthetic problem setting, as estimated weights. An importance-weighted L2-regularized a function of source variance γ. linear least-squares classifier is taken. Its parameters are estimated through θ = (X W X⊤ + λI)−1(X⊤W y ) where t λ t t t t t t Figure 7 presents the mean target risks in the Ozone prob- indicates training indices and W is a diagonal matrix with the lem setting. Note that the controlled estimator Rˆ ˆ is essen- estimated weights as its entries. For λ, we considered a range β tially unaffected by the large weights since it produces the from 10−3 to 106 over 200 logarithmic steps. same final risks in both the case of all data sets and the top Weights are estimated by fitting a normal distribution to 10% of largest weight variance. Both sets of risks are below data from each domain, and computing the ratio of the target that of the uncontrolled estimator Rˆ ˆ, which still deteriorates probability over the source probability of each source data w in the data sets with large weight variance. All differences point wˆ(xi)=ˆpT (xi)/pˆS (xi). We compare the importance- between ˆ ˆ and ˆ ˆ are statistically significant for all values ˆ Rw Rβ weighted risk estimator (Rwˆ) with its control variate coun- −50 ˆ of γ, with the largest p-value on the order of 10 . Lastly, terpart (Rβˆ). We average their estimated risks over all data sets and specifically over the 10% of data sets with the largest all risks increase slightly as γ grows, which is due to less data outliers around the start date. weight variance (indicated with ”>” in the legend of Figures 6, 7 and 8). We also include validation on the labeled target 0.9 samples (RˆT ) as the oracle solution. After risk estimation, 0.85 the λ is selected that minimized risk. The classifier is then re- trained using all source data and the selected λ, and evaluated 0.8 using the target risk based on the true target labels as the final measure. This process is repeated for each data set and we 0.75 report the final average as R¯T . 0.7 Repeating the above procedure for each data set allows 0.5 0.6 0.7 0.8 0.9 1 us to perform non-parametric hypothesis tests for statistically significant differences between the final risks of the estima- Fig. 7. Mean target risks for Ozone level detection setting, as tors. Since the data is paired and the estimators’ sampling a function of the relative scale of the source domain γ. distributions are skewed beyond normality, we employ a Wilcoxon signed-rank test [12]. 5.4. Weight estimators 5.3. Results The effect of the control variate is independent of the choice Figure 6 presents the final target risks for each risk estimator of importance-weight estimator. We illustrate this point by on the synthetic problem setting. Note that the errorbars in performing the same experiment on the synthetic setting us- the plot are too small to see. When looking at all data sets, ing Kernel Mean Matching (KMM) [13] and the Kullback- Leibler Importance Estimation Procedure (KLIEP) [14]. Fig- 7. REFERENCES ure 8 shows similar results as in Figure 6, except that the control variate even leads to better final target risks for the [1] B Zadrozny, “Learning and evaluating classifiers under average over all data sets (green lines are below blue lines). sample selection bias,” in International Conference on Machine learning, 2004, p. 114. 0.85 [2] C Cortes, M Mohri, M Riley, and A Rostamizadeh, “Sample selection bias correction theory,” in Algorith- 0.8 mic Learning Theory, 2008, pp. 38–53. [3] J Quionero-Candela, M Sugiyama, A Schwaighofer, and 0.75 ND Lawrence, Dataset shift in machine learning, MIT Press, 2009. 0.5 0.6 0.7 0.8 0.9 1 [4] M Sugiyama, M Krauledat, and K-R Müller, “Covari- 0.9 ate shift adaptation by importance weighted cross vali- 0.85 dation,” Journal of Machine Learning Research, vol. 8, 0.8 pp. 985–1005, 2007. 0.75 [5] WM Kouw and M Loog, “Effects of sampling skewness 0.7 of the importance-weighted risk estimator on model selection.,” in International Conference on Pattern Recog- 0.5 0.6 0.7 0.8 0.9 1 nition, 2018, pp. 1468–1473. [6] EL Ionides, “Truncated importance sampling,” Journal Fig. 8. Mean target risks for the synthetic problem setting, as of Computational and Graphical Statistics, vol. 17, no. a function of γ. (Top) KMM, (bottom) KLIEP. 2, pp. 295–311, 2008.

Some non-parametric weight estimators are formulated such [7] BL Nelson, “On control variate estimators.,” Computers that they naturally avoid large weights. However, this be- & OR, vol. 14, no. 3, pp. 219–225, 1987. havior depends on one or more hyperparameters, such as a [8] AB Owen, Monte Carlo theory, methods and examples, kernel bandwidth parameter. Unfortunately, finding an op- 2013. timal bandwidth parameter would require cross-validation, and consequently, such estimators are not suited to cross- [9] A Owen and Y Zhou, “Safe and effective importance validation procedures. One has to resort to heuristics, such sampling,” Journal of the American Statistical Associa- as Silverman’s rule or the average distance to k-nearest- tion, vol. 95, no. 449, pp. 135–143, 2000. neighbours. In the above experiment, we set the kernel band- [10] R Kohavi et al., “A study of cross-validation and boot- width in KMM and KLIEP to the average Euclidean distance strap for accuracy estimation and model selection,” in from the source points to their five nearest neighbours of the International Joint Conference on Artificial Intelligence, target points. 1995, vol. 14, pp. 1137–1145. [11] GC Cawley and NLC Talbot, “On over-fitting in model 6. CONCLUSION & DISCUSSION selection and subsequent selection bias in performance evaluation,” Journal of Machine Learning Research, We introduced a control variate to reduce the sampling vari- vol. 11, pp. 2079–2107, 2010. ance of the importance-weighted risk estimator. With its [12] WJ Conover, “Practical nonparametric statistics,” 1980. inclusion, the importance-weighted risk estimator is more robust large weight variance. Consequently, during k-fold [13] J Huang, AJ Smola, A Gretton, KM Borgwardt, cross-validation, it selects better hyperparameters than the B Schölkopf, et al., “Correcting sample selection bias uncontrolled importance-weighted risk estimator. by unlabeled data,” in Advances in Neural Information We have studied an additive linear control variate. Alter- Processing Systems, 2007, p. 601. natively, one could consider more complex control variates, [14] M Sugiyama, S Nakajima, H Kashima, PV Buenau, such as higher-order moments of the weights, or multiplica- and M Kawanabe, “Direct importance estimation with tive control variates [8]. If these have a stronger correlation model selection and its application to covariate shift with the weighted loss, they could lead to larger reductions adaptation,” in Advances in Neural Information Pro- in variance. However, there are a wide variety of possible cessing Systems, 2008, pp. 1433–1440. alternatives, and it is unclear how to search over these.