Multiple-Source Cross-Validation

Multiple-source cross-validation Krzysztof J. Geras [email protected] Charles Sutton [email protected] School of Informatics, University of Edinburgh Abstract ferent characteristics that affect classification (Blitzer et al., 2007; Craven et al., 1998). As another exam- Cross-validation is an essential tool in ma- ple, in biomedical data, such as EEG data (Mitchell chine learning and statistics. The typical pro- et al., 2004), data items are associated with particular cedure, in which data points are randomly subjects, with large variation across subjects. assigned to one of the test sets, makes an im- plicit assumption that the data are exchange- For data of this nature, a common procedure is to ar- able. A common case in which this does not range the cross-validation procedure by source, rather hold is when the data come from multiple than assigning the data points to test blocks randomly. sources, in the sense used in transfer learn- The idea behind this procedure is to estimate the per- ing. In this case it is common to arrange formance of the learning algorithm when it is faced the cross-validation procedure in a way that with data that arise from a new source that has not takes the source structure into account. Al- occurred in the training data. We call this procedure though common in practice, this procedure multiple-source cross-validation. Although it is com- does not appear to have been theoretically monly used in applications, we are unaware of theo- analysed. We present new estimators of the retical analysis of this cross-validation procedure. variance of the cross-validation, both in the This paper focuses on the estimate of the predic- multiple-source setting and in the standard tion error that arises from the multiple-source cross- iid setting. These new estimators allow for validation procedure. We show that this procedure much more accurate confidence intervals and provides an unbiased estimate of the performance of hypothesis tests to compare algorithms. the learning algorithm on a new source that was un- seen in the training data, which is in contrast to the standard cross-validation procedure. We also analyse 1. Introduction the variance of the estimate of the prediction error, Cross-validation is an essential tool in machine learn- inspired by the work of Bengio & Grandvalet (2003). ing and statistics. The procedure estimates the ex- Estimating the variance enables the construction of ap- pected error of a learning algorithm by running a train- proximate confidence intervals on the prediction error, ing and testing procedure repeatedly on different par- and hypothesis tests to compare learning algorithms. titions of the data. In the most common setting, data We find that estimators of the variance based on the items are assigned to a test partition uniformly at ran- standard cross-validation setting (Grandvalet & Ben- dom. This scheme is appropriate when the data are in- gio, 2006) perform extremely poorly in the multiple- dependent and identically distributed, but in modern source setting, in some cases, even failing to be con- applications this iid assumption often does not hold. sistent if allowed infinite data. Instead, we propose a One common situation is when the data arise from new family of estimators of the variance based on a multiple sources, each of which has a characteristic simple characterisation of the space of possible biases generating process. For example, in document classifi- that can be achieved by a class of reasonable estima- cation, text that is produced by different authors, dif- tors. This viewpoint yields a new estimator not only ferent organisations or of different genres will have dif- for the variance of the multiple-source cross-validation but for that of standard cross-validation as well. th Proceedings of the 30 International Conference on Ma- On a real-world text data set that is commonly used for chine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). studies in domain adaptation (Blitzer et al., 2007), we Multiple-source cross-validation demonstrate that the new estimators are much more Applying the formula for the variance of the sum of accurate than previous estimators. correlated variables yields 1 2 M − 1 K − 1 θCVR = V[^µCVR] = σ + ! + γ: (2) 2. Background KM KM K Cross-validation (CV) is a common means of assess- This decomposition can be used to show that an unbi- ing the performance of learning algorithms (Stone, ased estimator of θCVR does not exist. The reasoning 1974). Given a loss function L(y; y^) and a learn- behind this is the following. Because θCVR has only ing algorithm A that maps a data set D to a pre- second order terms in eki's, so would an unbiased es- D dictor A : x 7! y, CV can be interpreted as a timator. Thus, if there existed such an estimator, it procedure to estimate the expected prediction error would be of the form θ^ = wkèkie`j. Coeffi- µ = EPE(A) = [L(AD(x); y)]. The expectation is k;` ij E cients wk` would be found by equating coefficients of taken over training sets D and test instances (x; y). 2 ^ P P σ , ! and γ in E θ and in θCVR. Unfortunately, the In this paper we focus on k-fold CV. We introduce resulting systemh ofi equations has no solution, unless some notation to compactly describe the procedure. some assumptions about σ2, ! or γ are made. In later N Denote the training set by D = f(xi; yi)gi=1 and par- work, Grandvalet & Bengio (2006) suggested several tition the data as D = D1 [:::[DK with jDkj = M for biased estimators of θCVR based on this variance de- all k. This means that N = KM. Let HO(A; D1;D2) composition and simplifying assumptions on σ2, ! and denote the average loss incurred when training on set γ. These are described in Table 2. D1, and testing on set D2. In k-fold cross-validation, we first compute the average 3. Multiple-source cross-validation loss, HO(A; DnDk;Dk), when training A on DnDk and In many practical problems, the data arise from a num- testing on Dk, for each k 2 f1;:::;Kg. We average these to get the final error estimate ber of different sources that have different generating processes. Examples of sources include different gen- 1 K res in document classification, or different patients in a µ^ = CV(A; D1:K ) = HO(A; DnDk;Dk): (1) biomedical domain. In these cases, often the primary K kX=1 interest is in the performance of a classifier on a new source, rather than over only the sources in the train- If (D ;:::;D ) is a random partition of D, we will 1 K ing data. This is essentially the same setting used in refer to this procedure as random cross-validation domain adaptation and transfer learning, except that (CVR) to differentiate it from the multiple-source we are interested in estimating the error of a predic- cross-validation (CVS) that is the principal focus of tion procedure rather than in developing the prediction this paper. procedure itself. Notice that every instance in D is a test instance ex- This situation can be modelled by a simple hierarchi- actly once. For (x ; y ) 2 D , denote this test er- m m k cal generative process. We assume that each source ror by the random variable e = L(y ;AD Dk (x )). km m n m k 2 f1;:::;Kg has a set of parameters β that de- Using this notation,µ ^ = 1 e , so the CV k KM k m km fine its generative process and that the parameters estimate is a mean of correlated random variables. For β ; : : : ; β for each source are iid according to an un- CVR, the e 's have a specialP exchangeabilityP struc- 1 K km known distribution. We assume that the source of each ture: For each k, the sequence e = (e ; : : : ; e ) k k1 kM datum in the training set is known, i.e., each training is exchangeable, and the sequence of vectors e = example is of the form (y ; x ; k ), where k is the (e ;:::; e ) is also exchangeable. m m m m 1 K index of the source of data item m. The data are then Our goal will be to estimate the variance θCVR = modelled as arising from a distribution p(ym; xmjβkm ) V[^µCVR] of this and related CV estimators. If the ex- that is different for each source. amples in D are iid, Bengio & Grandvalet (2003) show The goal of cross-validation in this setting is to esti- that the variance θ can be decomposed into a sum CVR mate the out-of-source error, i.e., the error on data of three terms. The exchangeability structure of the that arises from a new source that does not occur in e implies that there are only three distinct entries km the training data. This error is in their covariance matrix: σ2, ! and γ, where D 2 [eki] = σ ; i 1;:::;M ; k 1;:::;K ; OSE = E[L(A (x); y)]; (3) V 8 2 f g 8 2 f g Cov [eki; ekj ] = !; i; j 1;:::;M ; i = j; k 1;:::;K ; 8 2 f g 6 8 2 f g Cov [eki; e`j ] = γ; i; j 1;:::;M ; k; ` 1;:::;K ; k = `: where the expectation is taken with respect to training 8 2 f g 8 2 f g 6 Multiple-source cross-validation sets D and testing examples (y; x) that are drawn from a new source, that is, p(y; x) = p(y; xjβ)p(β)dβ.

Multiple-Source Cross-Validation

Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure

A Machine Learning Approach to Census Record Linking∗

Stability and Median Rationalizability for Aggregate Matchings

Report on Exact and Statistical Matching Techniques

Alternatives to Randomized Control Trials: a Review of Three Quasi-Experimental Designs for Causal Inference

Matching Via Dimensionality Reduction for Estimation of Treatment Effects in Digital Marketing Campaigns

Frequency Matching Case-Control Techniques: an Epidemiological Perspective

Package 'Matching'

A Comparison of Different Methods to Handle Missing Data in the Context of Propensity Score Analysis

Performance of Logistic Regression, Propensity Scores, and Instrumental Variable for Estimating Their True Target Odds Ratios In

An Empirical Evaluation of Statistical Matching Methodologies

The Essential Role of Pair Matching in Cluster-Randomized Experiments, with Application to the Mexican Universal Health Insuranc