<<

Multiple-source cross-validation

Krzysztof J. Geras [email protected] Charles Sutton [email protected] School of Informatics, University of Edinburgh

Abstract ferent characteristics that affect classification (Blitzer et al., 2007; Craven et al., 1998). As another exam- Cross-validation is an essential tool in ma- ple, in biomedical data, such as EEG data (Mitchell chine learning and . The typical pro- et al., 2004), data items are associated with particular cedure, in which data points are randomly subjects, with large variation across subjects. assigned to one of the test sets, makes an im- plicit assumption that the data are exchange- For data of this nature, a common procedure is to ar- able. A common case in which this does not the cross-validation procedure by source, rather hold is when the data come from multiple than assigning the data points to test blocks randomly. sources, in the sense used in transfer learn- The idea behind this procedure is to estimate the per- ing. In this case it is common to arrange formance of the learning algorithm when it is faced the cross-validation procedure in a way that with data that arise from a new source that has not takes the source structure into account. Al- occurred in the training data. We call this procedure though common in practice, this procedure multiple-source cross-validation. Although it is com- does not appear to have been theoretically monly used in applications, we are unaware of theo- analysed. We present new estimators of the retical analysis of this cross-validation procedure. of the cross-validation, both in the This paper focuses on the estimate of the predic- multiple-source setting and in the standard tion error that arises from the multiple-source cross- iid setting. These new estimators allow for validation procedure. We show that this procedure much more accurate confidence intervals and provides an unbiased estimate of the performance of hypothesis tests to compare algorithms. the learning algorithm on a new source that was un- seen in the training data, which is in contrast to the standard cross-validation procedure. We also analyse 1. Introduction the variance of the estimate of the prediction error, Cross-validation is an essential tool in machine learn- inspired by the work of Bengio & Grandvalet (2003). ing and statistics. The procedure estimates the ex- Estimating the variance enables the construction of ap- pected error of a learning algorithm by running a train- proximate confidence intervals on the prediction error, ing and testing procedure repeatedly on different par- and hypothesis tests to compare learning algorithms. titions of the data. In the most common setting, data We find that estimators of the variance based on the items are assigned to a test partition uniformly at ran- standard cross-validation setting (Grandvalet & Ben- dom. This scheme is appropriate when the data are in- gio, 2006) perform extremely poorly in the multiple- dependent and identically distributed, but in modern source setting, in some cases, even failing to be con- applications this iid assumption often does not hold. sistent if allowed infinite data. Instead, we propose a One common situation is when the data arise from new family of estimators of the variance based on a multiple sources, each of which has a characteristic simple characterisation of the space of possible biases generating process. For example, in document classifi- that can be achieved by a class of reasonable estima- cation, text that is produced by different authors, dif- tors. This viewpoint yields a new estimator not only ferent organisations or of different genres will have dif- for the variance of the multiple-source cross-validation but for that of standard cross-validation as well. th Proceedings of the 30 International Conference on Ma- On a real-world text data set that is commonly used for chine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). studies in domain adaptation (Blitzer et al., 2007), we Multiple-source cross-validation demonstrate that the new estimators are much more Applying the formula for the variance of the sum of accurate than previous estimators. correlated variables yields

1 2 M − 1 K − 1 θCVR = V[ˆµCVR] = σ + ω + γ. (2) 2. Background KM KM K Cross-validation (CV) is a common of assess- This decomposition can be used to show that an unbi- ing the performance of learning algorithms (Stone, ased estimator of θCVR does not exist. The reasoning 1974). Given a L(y, yˆ) and a learn- behind this is the following. Because θCVR has only ing algorithm A that maps a data set D to a pre- second order terms in eki’s, so would an unbiased es- D dictor A : x 7→ y, CV can be interpreted as a timator. Thus, if there existed such an estimator, it procedure to estimate the expected prediction error would be of the form θˆ = wk`ekie`j. Coeffi- µ = EPE(A) = [L(AD(x), y)]. The expectation is k,` ij E cients wk` would be found by equating coefficients of taken over training sets D and test instances (x, y). 2 ˆ P P σ , ω and γ in E θ and in θCVR. Unfortunately, the In this paper we focus on k-fold CV. We introduce resulting systemh ofi equations has no solution, unless some notation to compactly describe the procedure. some assumptions about σ2, ω or γ are made. In later N Denote the training set by D = {(xi, yi)}i=1 and par- work, Grandvalet & Bengio (2006) suggested several tition the data as D = D1 ∪...∪DK with |Dk| = M for biased estimators of θCVR based on this variance de- all k. This means that N = KM. Let HO(A, D1,D2) composition and simplifying assumptions on σ2, ω and denote the average loss incurred when training on set γ. These are described in Table 2. D1, and testing on set D2. In k-fold cross-validation, we first compute the average 3. Multiple-source cross-validation loss, HO(A, D\Dk,Dk), when training A on D\Dk and In many practical problems, the data arise from a num- testing on Dk, for each k ∈ {1,...,K}. We average these to get the final error estimate ber of different sources that have different generating processes. Examples of sources include different gen- 1 K res in document classification, or different patients in a µˆ = CV(A, D1:K ) = HO(A, D\Dk,Dk). (1) biomedical domain. In these cases, often the primary K kX=1 interest is in the performance of a classifier on a new source, rather than over only the sources in the train- If (D ,...,D ) is a random partition of D, we will 1 K ing data. This is essentially the same setting used in refer to this procedure as random cross-validation domain adaptation and transfer learning, except that (CVR) to differentiate it from the multiple-source we are interested in estimating the error of a predic- cross-validation (CVS) that is the principal focus of tion procedure rather than in developing the prediction this paper. procedure itself. Notice that every instance in D is a test instance ex- This situation can be modelled by a simple hierarchi- actly once. For (x , y ) ∈ D , denote this test er- m m k cal generative process. We assume that each source ror by the random variable e = L(y ,AD Dk (x )). km m \ m k ∈ {1,...,K} has a set of parameters β that de- Using this notation,µ ˆ = 1 e , so the CV k KM k m km fine its generative process and that the parameters estimate is a of correlated random variables. For β , . . . , β for each source are iid according to an un- CVR, the e ’s have a specialP exchangeabilityP struc- 1 K km known distribution. We assume that the source of each ture: For each k, the sequence e = (e , . . . , e ) k k1 kM datum in the training set is known, i.e., each training is exchangeable, and the sequence of vectors e = example is of the form (y , x , k ), where k is the (e ,..., e ) is also exchangeable. m m m m 1 K index of the source of data item m. The data are then

Our goal will be to estimate the variance θCVR = modelled as arising from a distribution p(ym, xm|βkm ) V[ˆµCVR] of this and related CV estimators. If the ex- that is different for each source. amples in D are iid, Bengio & Grandvalet (2003) show The goal of cross-validation in this setting is to esti- that the variance θ can be decomposed into a sum CVR mate the out-of-source error, i.e., the error on data of three terms. The exchangeability structure of the that arises from a new source that does not occur in e implies that there are only three distinct entries km the training data. This error is in their covariance matrix: σ2, ω and γ, where D 2 [eki] = σ , i 1,...,M , k 1,...,K , OSE = E[L(A (x), y)], (3) V ∀ ∈ { } ∀ ∈ { } Cov [eki, ekj ] = ω, i, j 1,...,M , i = j, k 1,...,K , ∀ ∈ { } 6 ∀ ∈ { } Cov [eki, e`j ] = γ, i, j 1,...,M , k, ` 1,...,K , k = `. where the expectation is taken with respect to training ∀ ∈ { } ∀ ∈ { } 6 Multiple-source cross-validation sets D and testing examples (y, x) that are drawn from a new source, that is, p(y, x) = p(y, x|β)p(β)dβ. In the multiple-source setting,R it is intuitively obvi- ous that the standard cross-validation estimator of (1) is inappropriate. (We shall make this intuition pre- cise in the next section.) Instead it is common prac- tice to modify the CV procedure so that no source is split across test blocks, a procedure that we will call Figure 1. Diagram of the covariance matrix of the error multiple-source cross-validation. Let Sk denote the set variables for CVS. Blocks of the same colour represent of all training points that were sampled from source blocks of variables with identical covariance. k. Multiple-source cross-validation works exactly as standard CV, except that we partition the data as where this notation indicates that the expectation is D = S ∪ ... ∪ S instead of assigning the training 1 K taken over S , X and Y . This is the expected instances to blocks randomly. The resulting estimate 1:(K 1) error when the algorithm− A is trained with data com- of the error, which we denote by CVS, is ing from K − 1 sources and the test point is going to 1 K come from a newly sampled source. This is the out-of- µˆ = CVS(A, S ) = HO(A, D\S ,S ). CVS 1:K K k k source error (3), but on a slightly smaller training set kX=1 than D. The idea behind this procedure is that it accounts for the fact that we expect to be surprised by the new On the other hand, consider the CVR estimate on the source in the test data, by using the information in the same data set D. Let D1 ∪ ... ∪ DK = D be a random training data to simulate the effect of predicting on an partition of D with |Dk| = M for all k. The same unseen source. Although this estimator is commonly symmetry argument which was used in (4) yields used in practice, we are unaware of previous work con- E [ˆµCVR] = E(D1:K ) HO(A, D1:(K 1),DK ) cerning its asymptotic or finite sample behaviour. − D1:(K−1) = E(D1:(K−1),X,Y ) L(A (X),Y ) , (5) 4. Analysis of multiple-source which is formally similar to the above, but has the cross-validation crucial difference that D1:(K 1),X,Y have a different joint distribution. Here X −and Y are drawn from We give the mean (Section 4.1) and variance (Sec- the same family of K sources as the training data tion 4.2) of the CVS estimator. The variance has an D1:(K 1). So E [ˆµCVR] is the expected error of an al- analogous decomposition to the CVR estimator, but gorithm− trained with M(K − 1) data items from K the individual terms in the CVS variance behave dif- sources, when the test point arises from one of the ferently from those in CVR, in a way that has a large training sources. impact on effective estimators of the variance. Finally, we present novel estimators of the variance of the CVS Neither (4) or (5) is exactly the same as the out-of- error estimate (Section 4.2). sample error (3) that we want to estimate. Looking at the above, we see thatµ ˆCVS is biased for (3) because 4.1. Mean of the error estimate µˆCVS uses slightly fewer training sources than OSE. On the other hand,µ ˆCVR is biased for OSE because First, we consider the expected value of both the CVS the training sets are slightly smaller and also because and the CVR error estimates, in the case where the inµ ˆCVR the training and test data are drawn from data was in fact generated from the multiple-source the same set of sources. However, if it is important process described in the previous section. The idea is to have a conservative estimate of the error,µ ˆCVR has to make clear whatµ ˆCVS is estimating, and to provide the advantage that its will tend to be negative, a more careful justification for the common use ofµ ˆCVS based on the expected effect of a smaller training set. on multiple-source data. (We verify this intuition experimentally in Section 6.) Let D = S ∪ ... ∪ S . Then it is easy to show using 1 K 4.2. Variance of the error estimate the exchangeability of S1,...,SK that the expected value of the estimateµ ˆCVS is Now we consider the variance ofµ ˆCVS. The key dif- ference between this case and the CVR case is that E [ˆµCVS] = E(S1:K ) HO(A, S1:(K 1),SK ) − error variables eki and e`j are no longer identically = L(AS1:(K−1) (X),Y ) , (4) E(S1:(K−1),X,Y )  distributed if k =6 `, as the corresponding data points   Multiple-source cross-validation arise from different sources. There is still a block struc- that there is no unbiased estimator of θCVS. First, be- 1 ture to the covariance matrix of e, but it is more com- cause θCVS = (KM)2 k,` i,j Cov [ekie`j], an unbi- plex. Namely, the error variables have identical co- ased estimator must also be quadratic. The expecta- P P variance structure within each source but not across tion of a quadratic estimator has the form sources. More formally, ˆ X 2 2 X 2 X E[θ] = ak(σk +µk)+ bk(ωk +µk)+ ckl(γkl+µkµl). [e ] = σ2, ∀i ∈ {1,...,M}, k k k6=l V ki k (8) Cov [eki, ekj] = ωk, ∀i, j ∈ {1,...,M}, i =6 j, (6) ˆ To get E[θ] = θCVS, we need to match the coefficients Cov [eki, e`j] = γk`, ∀i, j ∈ {1,...,M}, k =6 `. in the equations (7) and (8), including the coefficients 2 of the µ and µkµ` terms that need to equal zero, since This covariance structure follows from the generating k there clearly exist distributions such that µkµl > 0. process described in Section 3, and is depicted graphi- This yields the system of equations cally in Figure 1. The means of the error variables have 1 M − 1 1 an analogous structure, that is, E [eki] = E [ekj] := µk, ak = 2 , bk = 2 , ak + bk = 0, ckl = 2 , ckl = 0. which follows because the data points within each K M K M K (9) source are exchangeable. Clearly, these equations are unsatisfiable, so no unbi- This allows us to obtain a decomposition of the vari- ased estimator of θCVS exists. ance θCVS = V[ˆµCVS] of the CVS error estimate. This decomposition is 4.2.2. Naive Estimators

1 M − 1 1 We can derive new estimators of the variance θCVS by θ = σ2 + ω + γ , (7) CVS K2M k K2M k K2 kl following the reasoning used in the standard CV set- k k k=l X X X6 ting by Grandvalet & Bengio (2006). Unfortunately, as we will see, the resulting estimators perform poorly. which again follows because θCVS is the variance of a sum of correlated random variables. Notice the differ- The idea is rather than attempting to define an estima- ence to decomposition of θCVR in (2). tor that is unbiased for all data distributions, instead In the rest of this section we consider various estimates define an estimator that is unbiased for a restricted class of data distributions, defined by assumptions on of θCVS. As θCVS depends quadratically on the vari- 2 µk, σ , ωk and γkl. ables ekl, we will restrict our attention to estimators k ˆ of the form θ = k,` i,j wk`ekie`j. That is, the First, we restrict our attention to cases in which the estimator is a quadratic function of the variables e, mean prediction error across sources is the same, i.e., P P and as the error variables ek within a single source µk = µl := µ. A quadratic estimator which is unbi- k are exchangeable, we require such variables to be ased for this class of data distributions must satisfy weighted equally. We refer to an estimator of this form the equations as quadratic. 1 M − 1 1 a = , b = , c = , Every quadratic estimator can be written as a function k K2M k K2M kl K2 of the empirical second moments of the error variables. (10) a + b + c = 0. If θˆ is quadratic, k k kl k k k=l X X X6 ˆ σ2 ω γ θ = aksk + bksk + cklskl, These equations also have no solution. However, if we k k k=l X X X6 further assume (following the reasoning of Grandvalet 2 & Bengio (2006), even though it is unlikely to be a where the sσ , sω and sγ , are empirical moments ˜ k k kl good assumption) that one of the sources k has ωk˜ = 0, then we no longer have to match the value of the 2 1 1 sσ = e2 , sω = e e , coefficient b , removing one of the constraints. Solving k M ki k M(M − 1) ki kj k˜ i i=j the remaining equations yields X X6 γ 1 s = ekielj. 1 1 kl M 2 ak = , ckl = , i,j K2M K2 X M − 1 M − 1 b˜ = − 1, bk = , ∀k =6 k.˜ 4.2.1. No Unbiased Estimator k K2M K2M Using the decomposition (7), we can now follow the The corresponding estimator and its bias are shown in same reasoning as Bengio & Grandvalet (2003) to show the first line of Table 1. Similarly, instead of assuming Multiple-source cross-validation

ωk˜ = 0, we could restrict ourselves to have a single The bias of this estimator is γk˜˜l, or to have all of the ωk’s or γkl’s equal zero. This 1 X 1 X [θˆ − θ ] = − ω − γ . yields the remaining estimators in Table 1. E A CVS K2 k K2 kl k k6=l These estimators are appealingly simple, but, unfor- tunately, they have serious problems. The main prob- However, we can do better than this. Instead of requir- ing that the σ2 term vanish, which is a very stringent lem is that their biases depend on the mean error µk k of the sources. This is often an order of magnitude requirement, we could instead require its coefficient to be (KN) 1 = (K2M) 1, so that it becomes negligi- larger than the true variance θCVS that we are trying − − γ ble for large N. This amounts to the requirment that to predict. In particular, θˆ , which is an analog of θˆ3, 2 1 the preferred estimator in the original CVR setting ak = 2(K M)− , which results in the estimator in the work of Grandvalet & Bengio (2006), is poor 2 2 in this regard. Unlike the true variance, it does not ˆ σ − ω θB = 2 sk sk . (12) even converge to 0 as N → ∞, because the second K M ! P k k 1 K 2 k=6 l µkµl X X term K2 k=1 µk − (K 1) in its bias does not − This estimator is especially appealing because of the converge to 0 in general. P  form of its bias, which is The end result is that we can attempt to follow a simi- 0 1 ˆ 1 X 2 X X [θB θCVS] = @ σ (M + 1) ωk M γklA . lar strategy as in standard cross-validation to estimate E − K2M k − − the error variance, but doing so leads to extremely k k k=6 l poor estimators. Instead, we will introduce new esti- 2 Now σk’s and ωk’s are positive, and in practice γkl’s mators that are specific to the multiple-source cross- are almost always positive if M is not small. So what is validation setting. appealing about this estimator is that the three terms have differing signs. In many situations, the difference 4.2.3. Design of new estimators between the first term and the second two will be of Instead of the naive estimators from the previous sec- smaller magnitude than either of the three terms alone, ˆ tion, we can derive better estimators by taking a differ- causing θB to be significantly less biased than the es- ent perspective. Instead of trying to design estimators timators in Table 1. We show this experimentally in that are unbiased for a restrictive set of scenarios, we Section 6. consider the asymptotic behaviour of the bias decom- position of the quadratic variance estimator. 5. New estimators of θ for CVR In the last section we saw that every quadratic esti- It is actually possible to apply the same viewpoint from mator θˆ has a bias of the form the previous section in order to provide new estima- „ « „ « tors for the variance of the standard cross-validation X 1 2 X M 1 [θˆ θ ] = ak σ + bk − ωk+ E − CVS − 2 k − 2 k K M k K M procedure. Previously known estimators of the vari- „ « X 1 X 2 X ance θCVR (Table 2) were designed to be unbiased for ckl γkl + (ak + bk) µ + cklµkµl. − K2 k k=6 l k k=6 l a subclass of generating processes. By considering the bias decomposition directly, as in the previous section, Choosing the coefficients ak, bk, ckl uniquely deter- we can design better estimators. mines an estimator θˆ. Let us consider the relative First, we give a proposition that describes the space magnitude of the terms in the bias decomposition. The 2 of possible biases for quadratic estimators of θCVR. error means µk are usually a few of orders of magni- 1 2 M 1 1 Proposition 1. Let θˆ = wk`ekie`j be a tude larger than K2M σk, K2−M ωk and K2 γkl as long k,` ij as M is not too small. (We check this intuition exper- ˆ quadratic estimator. Then theP biasPE θ − θCVR has imentally in Section 6.) Therefore it seems sensible to 2 the form α1σ +α2ω +α3γ +α4µ. Alsohα1 +α2 +iα3 − require that all of the µk terms vanish. This implies α4 = −1. Conversely, for every α1, α2, α3, α4 such that ak + bk = 0 and ckl = 0 for all k, l. that α1 + α2 + α3 − α4 = −1, there exists a quadratic 2 ˆ Next, usually σk is larger than ωk or γkl, which sug- estimator θ with bias 2 1 2 gests choosing a = (K M)− so that the σ term k k ˆ 2 2 vanishes as well. This yields the estimator E θ − θCVR = α1σ + α2ω + α3γ + α4µ . h i 1 2 σ ω Proof. For a quadratic estimator θˆ, let a = M wkk, θˆA = s − s . (11) k K2M k k − 2 k k ! let b = M(M 1) k wkk and c = M k l=k wkl. X X P6 P P P Multiple-source cross-validation

Table 1. Naive estimators of θCVS coming from solutions of simplified system of equations (10). Unbiased if Estimator Bias ˆω 1 P σ2 M−1 P ω 1 P γ ω 1 P 2 ωk˜ = 0 θk˜ = K2M k sk + K2M k sk + K2 k6=l skl − sk˜ −ωk˜ + K2 k,l µkµl − µk˜ γ γ 2 „ s +s « γ 1 P σ M−1 P ω 1 P γ k˜l˜ l˜k˜ 1 P γ˜˜ = 0 θˆ = s + s + s − −γ˜˜ + µ µ − µ˜µ˜ kl k,˜ ˜l K2M k k K2M k k K2 k6=l kl 2 kl K2 k,l k l l k 2 P “ P µ µ ” ˆω 1 P σ M−1−KM P ω 1 P γ k ωk 1−K P 2 k=6 l k l ∀k ωk = 0 θ = K2M k sk + K2M k sk + K2 k6=l skl − K + K2 k µk − (K−1) 2 P γ “ P µ µ ” ˆγ 1 P σ M−1 P ω 1 P γ k=6 l kl 1 P 2 k=6 l k l ∀k,l γkl = 0 θ = K2M k sk + K2M k sk − K2(K−1) k6=l skl − K(K−1) + K2 k µk − (K−1)

Table 2. Estimators of θ for CVR suggested by Grandvalet & Bengio (2006). Unbiased if Estimator Bias ˆ 1 M−1 N−M 2 µ = 0 θ1 = N s1 + N s2 + N s3 µ ˆ 1 N+1−M N−M ω = 0 θ2 = N s1 − N s2 + N s3 −ω ˆ 1 M−1 M γ = 0 θ3 = N s1 + N s2 − N s3 −γ M ˆ 1 1 M N−M γ = − N−M ω θ4 = N s1 − N s2 − N ω − N γ

Taking expectations, we have 6.

2 2 E θˆ = aσ + bω + cγ + (a + b + c)µ . We evaluate the usefulness ofµ ˆCVR andµ ˆCVS as es- timators of out-of-source error and the estimators of h i Therefore the bias has the form θCVR and θCVS on a data set of product reviews from h i „ 1 « „ M − 1 « Amazon (Blitzer et al., 2007), which is frequently used θˆ − θ = a − σ2 + b − ω+ E CVR KM KM as a benchmark data set in domain adaptation. The „ K − 1 « data contains reviews of products from 25 diverse do- c − γ + (a + b + c)µ2. K mains corresponding to high-level categories on Ama- zon.com. The goal is to classify whether a review is So the bias has the required form, with α1 + α2 + α3 − positive or negative based on the review text. We take α4 = −1. Conversely, let α1 +α2 +α3 −α4 = −1. Then each product domain as being a separate source. ˆ 1 we can obtain an estimator θ by setting a = α1 + KM , M 1 K 1 We with the version of the data set which b = α2 + − , c = α3 + − , and d = a+b+c+1. KM K contains ten domains, each with 1000 positive and 1000 negative examples. We will use CV to estimate All of the existing estimators of θCVR (Table 2) have the prediction error of a simple naive Bayes classifier. similar biases in the sense that the coefficients α1, α2, (We have replicated these results with an SVM with a and α3 are non-positive. But from the previous propo- sition, we know that there is a much larger set of co- linear kernel.) efficients available. 6.1. Bias and variance To design a new estimator, we observe that both in the results of Grandvalet & Bengio (2006) and our own First we measure the bias ofµ ˆCVR andµ ˆCVS and com- results in Section 6, typically ω > γ and ω and γ are of pare them to the out-of-source error. To estimate this, similar magnitude. Therefore an estimator with bias we average over the set of training domains, the set of of ω − 2γ will have smaller bias than the estimators training instances, the test domain and the test in- from Table 2. Applying the previous result, this bias stances. To get this, we first sample without replace- is achieved by the estimator ment a given number of domains, keeping all of them but one as training domains and using the remain- 1 N + M − 1 N + M θˆ5 = s1 + s2 − s3, ing one as a test domain. Given this, we sample 100 N N N pairs consisting of training and test data sets, where s1, s2 and s3 are the empirical moments data points from empirical distribution of the respec-

1 X σ2 1 X ω 1 X γ tive domains. Having these, we run CVR and CVS on s = s , s = s , s = s . 1 K k 2 K k 3 K(K − 1) kl the training domains, comparing the cross-validation k k k6=l estimate to the prediction on the out-of-source test In the next section we show that its performance is set. Figure 2 shows this comparison as a function of superior in practice to the best previous estimator θˆ3 the number of training sources K (left panel) and the given by Grandvalet & Bengio (2006). Multiple-source cross-validation

0.035 0.3 0.3 CVS CVS CVS CVR 0.28 CVR CVR 0.03 out−of−source out−of−source 0.26 0.25 0.025 0.24 0.2 0.22 0.02 0.2 average std 0.15 0.015 0.18 average mean error average mean error

0.16 0.1 0.01 0.14 0.005 0.12 0.05 2 3 4 5 6 7 2 3 4 5 6 7 100 150 250 500 1000 2000 K K M Figure 3. ofµ ˆCVR and Figure 2. Values ofµ ˆCVR andµ ˆCVS averaged over draws of training and test domains, compared to the true out-of-source error. Both µˆCVS averaged draws of training and test plots were generated drawing domains 200 times. domains. Experiment was done drawing domains 200 times.

−4 −4 x 10 x 10

1 2 1 2 K2M k σk KM σ 2.5 1 1 2.5 1 1 (K2 - K2M ) k ωk (K - KM )ω 1 P 1 K2 k=l γkl (1- K )γ 6 P 2 P 2

1.5 1.5

1 1

0.5 0.5

100 150 200 250 350 500 750 1000 100 150 200 250 350 500 750 1000 M M

Figure 4. Decomposition of θCVR (right panel) and θCVS (left panel).

−4 −4 −4 −5 x 10 x 10 x 10 x 10

9 15 100 16 200 500 1000 8 2

200 14 400 1000 2000 7

300 12 600 1500 3000 6 10 400 800 2000 4000 10 5

500 1000 2500 1 5000 8 4

600 1200 3 3000 6000 6 5 700 1400 2 3500 7000 4 800 1600 1 4000 8000 2 900 1800 4500 9000 0 0 0 0 1000 2000 5000 10000 200 400 600 800 1000 500 1000 1500 2000 1000 2000 3000 4000 5000 2000 4000 6000 8000 10000 M = 100 M = 200 M = 500 M = 1000

2 Figure 5. V[e] for different M. σk’s are typically a few orders of magnitude greater and were omitted here.

−4 −4 x 10 x 10 6 6 θ θ γ E θˆ E θˆ3 5 5 hˆ i EhθˆAi E θ5 hˆ i h i 4 E θB 4 h i

3 3

2 2

1 1

0 0 100 150 200 250 350 500 750 1000 100 150 200 250 350 500 750 1000 M M

Figure 6. Comparison of estimators of θCVR (right panel) and θCVS (left panel). Multiple-source cross-validation

ˆ ˆγ number of points M per source (right panel). It can be new estimator θA is closer to θCVS than θ for large clearly seen that CVS yields a better estimate of the values of M but its bias is consistently optimistic. On out-of-source error than CVR. It is worth noting what the other hand, for small M, θˆB is not more optimistic happens when the number of training domains or the than θˆγ and for large M, while θˆγ becomes very pes- number of training data items gets larger. While CVS simistic, θˆB has a negligible bias. Similarly, for the converges to the true out-of-source error, CVR con- standard CVR setting, our new estimator θˆ5 has lower verges to a different value. This confirms the analysis bias than the previous estimators suggested by Grand- from Section 4.1. From these results, we can see that valet & Bengio (2006), being almost unbiased even for CVR yields an optimistic estimate of out-of-source er- small M. ror, even though the training set in each iteration of CV is smaller than the full training set. This is in con- 7. Related work trast to CVS, which yields a more desirable pessimistic estimate. Various different versions of cross-validation have been analysed previously (Bengio & Grandvalet, 2003; Ar- Figure 3 shows the standard deviation ofµ ˆ and CVS lot & Celisse, 2010; Nadeau & Bengio, 2003; Marka- µˆ averaged over the choice of training domains. CVR tou et al., 2005), but to our knowledge multiple-source To get this estimate, we have performed a similar pro- cross-validation has not been previously analysed. cedure to Figure 2, i.e., for each draw of training do- mains, we sample 100 data sets, sampling data points The idea of learning from a number of sources dates from expirical distribution of the respective domains. back to Caruana (1997) and Thrun (1996). A related Although the CVS estimate has higher variance, the issue in the context of covariate shift was suggested are of the same order of magnitude and both by Sugiyama et al. (2007), however, the resulting im- tend to 0 for large data sets. portance weights are difficult to estimate in practice (but see Gretton et al. (2009) for some work in this 6.2. Variance decompositions and estimators direction). of variance Rakotomalala et al. (2006) investigate the multiple- In this section, we evaluate our new estimators of the source cross-validation procedure empirically, but they variances θCVR and θCVS. To obtain these results, we do not perform theoretical analysis or present any es- used all domains as training domains. Estimating both timators of θCVS. Work in domain adaptation also has

θCVR and θCVS unbiasedly requires more than one inde- considered the problem of bounding the prediction er- pendently sampled data set. To get them, we sample ror when the training and test distribution have a dif- 1000 data sets, sampling from empirical distributions ferent source (Ben-David et al., 2010). Unfortunately, of each domain. the resulting bounds are too loose to be used for con- fidence intervals. In the first experiment we estimated components of the decomposition of θCVR and θCVS (Figure 4). The 1 2 M 1 K 1 8. Conclusions quantities KM σ , KM− ω, K− γ and corresponding to 1 2 M 1 1 them K2M k σk, K2−M k ωk, K2 k=l γkl have dif- We have considered a cross-validation procedure for M 1 6 ferent magnitudes. Notice that KM− ω is much larger the multiple-source setting. We show that the bias M 1 P P P than K2−M k ωk. It is not a surprise that CVS yields of this procedure is better suited to this setting. We a strong positive correlation between errors made on have presented several new estimates of the variance points belongingP to the same test block, which is not of the error estimate, both for the multiple-source observed in CVR, where the test blocks are chosen cross-validation procedure and for the standard cross- randomly. validations setting, which perform well empirically.

Secondly, we looked at V[e] to see how much ωk’s and γkl’s vary (Figure 5). It can be seen that variation Acknowledgments within γ ’s diminishes when M gets larger but varia- kl The authors would like to thank Amos Storkey, Vitto- tion within ω ’s does not change much. k rio Ferrari, Simon Rogers and the anonymous review-

Finally, we have tested estimators of θCVR and θCVS, ers for insightful comments on this work. This work which we have suggested in the earlier sections. The was supported by the Scottish Informatics and Com- γ results are in Figure 6. For CVS, we compare θˆ , θˆA puter Science Alliance (SICSA). γ and θˆB. As expected, θˆ does not converge to 0 and grossly overestimates θCVS for large values of M. The Multiple-source cross-validation

References Stone, M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statis- Arlot, S. and Celisse, A. A survey of cross-validation tical Society. Series B (Methodological), 36(2):111– procedures for . Statistics Surveys, 147, 1974. 4:40–79, 2010. Sugiyama, M., Krauledat, M., and M¨uller,K.-R. Co- Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., variate shift adaptation by importance weighted Pereira, F., and Vaughan, J. Wortman. A theory of cross validation. Journal of Machine Learning Re- learning from different domains. Machine Learning, search, 8:985–1005, 2007. 79:151–175, 2010. Bengio, Y. and Grandvalet, Y. No unbiased estimator Thrun, S. Is learning the n-th thing any easier than In Advances in Neural Infor- of the variance of k-fold cross-validation. Journal of learning the first? In mation Processing Systems Machine Learning Research, 5:1089–1105, 2003. , 1996. Blitzer, J., Dredze, M., and Pereira, F. Biogra- phies, bollywood, boom-boxes and blenders: Do- main adaptation for sentiment classification. In In Proceedings of Association of Computational Lin- guistics, 2007. Caruana, R. Multi-task learning. Machine Learning, 28:41–75, 1997. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slatteryy, S. Learning to extract symbolic knowledge from the world wide web. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98), 1998. Grandvalet, Y. and Bengio, Y. Hypothesis testing for cross-validation. Technical Report 1285, D´epartement d’informatique et recherche op´erationnelle,Universit´ede Montr´eal,2006. Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Sch¨olkopf, B. &. Covariate shift by kernel mean matching. In Qui˜nonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N.D. (eds.), Dataset shift in machine learning, pp. 131–160. The MIT Press, 2009. Markatou, M., Tian, H., Biswas, S., and Hripcsak, G. of cross-validation estimators of the generalization error. Journal of Machine Learn- ing Research, 6:1127–1168, 2005. Mitchell, T., Hutchinson, R., , Niculescu, R., Pereira, F., Wang, X., Justl, M., and Newman, S. Learning to decode cognitive states from brain images. Ma- chine Learning, 57(1):145–175, 2004. Nadeau, C. and Bengio, Y. Inference for the general- ization error. Machine Learning, 52:239–281, 2003. Rakotomalala, R., Chauchat, J.-H., and Pellegrino, F. Accuracy estimation with clustered dataset. In In Proceedings of Australasian conference on Data mining and analytics, 2006.