Asymptotic Bayesian Generalization Error when Training and Test Distributions are Different

Keisuke Yamazaki [email protected] Precision and Intelligence Laboratory, Tokyo Institute of Technology, R2-5, 4259 Nagatsuta, Midori-ku, Yoko- hama, 226-8503 Japan Motoaki Kawanabe [email protected] Fraunhofer FIRST, IDA, Kekul´estr. 7, 12489 Berlin, Germany Sumio Watanabe [email protected] Precision and Intelligence Laboratory, Tokyo Institute of Technology Masashi Sugiyama [email protected] Department of Computer Science, Tokyo Institute of Technology, 2-12-1, O-okayama, Meguro-ku, Tokyo, 152- 8552 Japan Klaus-Robert Muller¨ [email protected] Fraunhofer FIRST, IDA, & Technical University of Berlin, Computer Science, Franklinstr. 28/29 10587 Berlin, Germany

Abstract 1. Introduction The goal of is to infer an underly- In supervised learning, we commonly assume ing relation between input x and output y from train- that training and test data are sampled from ing data. This allows us to predict the output value of the same distribution. However, this as- an unseen test input point. A common assumption in sumption can be violated in practice and then the supervised learning scenario is that the test data standard techniques per- is sampled from the same underlying distribution as form poorly. This paper focuses on revealing the training data. However, this assumption is not and improving the performance of Bayesian often fulfilled in practice, e.g., when the data genera- estimation when the training and test distri- tion mechanism is non-stationary or the data sampling butions are different. We formally analyze process has time or cost constraint. If the joint distri- the asymptotic Bayesian generalization error bution p(x, y) is totally different between training and and establish its upper bound under a very test phases, we may not be able to extract any infor- general setting. Our important finding is mation about the test data from training data. There- that lower order terms—which can be ignored fore, the change of distribution needs to be restricted in the absence of the distribution change— in a reasonable way. play an important role under the distribu- tion change. We also propose a novel vari- One of the most interesting types of distribution ant of stochastic complexity which can be change would be the situation called the covariate shift used for choosing an appropriate model and (Shimodaira, 2000): the input distribution p(x) varies hyper-parameters under a particular distri- but the functional relation p(y x) remains unchanged. | bution change. For data from many applications such as off-policy reinforcement learning (Shelton, 2001), bioinformat- ics (Baldi et al., 1998) or brain-computer interfacing (Wolpaw et al., 2002), the covariate shift phenomenon th Appearing in Proceedings of the 24 International Confer- is conceivable. Sample selection bias (Heckman, 1979) ence on Machine Learning, Corvallis, OR, 2007. Copyright in econometrics may also include a particular form of 2007 by the author(s)/owner(s). Asymptotic Bayesian Generalization Error Under Distribution Change

x2 y training data test data The prediction performance of Bayesian estimation The true curve Class 1 training data Class 2 training data Estimation by training data can be improved by properly choosing the model struc- Decision boundary Estimation by test data ture and hyper-parameters. For this purpose, the stochastic complexity (Rissanen, 1986) is often used as an evaluation criterion. The stochastic complexity x x1 Training phase corresponds to the probability of having the current

training training data given a model and hyper-parameters. Input x2 density test Therefore, employing the stochastic complexity under Class 1 test data Class 2 test data Decision boundary the distribution change may not be suitable when the training and test data follow different distributions. In this paper, we propose a novel variant of stochastic complexity that can appropriately compensate for the x x1 Test phase effect of covariate shift. Covariate shift Functional relation(b) change Figure 1. Schematic illustration of the distribution change: 2. Bayesian Estimation without Left block: Covariate shift in a regression problem. In- put distributions are different between the training and Distribution Change test phases, while the target function remains unchanged. In this section, we briefly introduce the standard Right block: Functional relation change in a classification Bayesian estimation procedure without distribution problem. The target decision boundary changes, while the input distribution stays unchanged. change and review asymptotic forms of the general- ization error and the stochastic complexity. covariate shift. Another possible type of distribution 2.1. Bayesian Inference change is the functional relation change, where p(y x) Let Xn, Y n = X , Y , . . . , X , Y be a set of train- changes between the training and test phases. Un-| 1 1 n n ing {samples }that{ are independently} and identically der the classification scenarios, the situation called the generated following the true distribution r(y x)q(x). class prior change—the class prior probability p(y) is Let p(y x, w) be a learning machine and ϕ(w)| be the different for training and test data, can be often ob- a priori| distribution of parameter w. Then the a pos- served. See Fig.1 for illustration. teriori distribution is defined by n Standard supervised learning techniques are not de- n n 1 p(w X , Y ) = p(Yi Xi, w)ϕ(w), signed to work appropriately under the distribution | Z(Xn, Y n) | i=1 change. So far, several frequentist’s methods have n Y where n n been developed to improve the performance, e.g., Z(X , Y ) = p(Yi Xi, w)ϕ(w)dw. (1) | in the covariate shift scenarios (Shimodaira, 2000; Z i=1 Sugiyama & Muller,¨ 2005; Sugiyama et al., 2007) and The Bayesian predictiveYdistribution is given by when the class-prior change occurs (Lin et al., 2002). p(y x, Xn, Y n) = p(y x, w)p(w Xn, Y n)dw. However, it seems that a Bayesian perspective under | | | Z such distribution changes is still an open research is- We evaluate the generalization error by the average sue. Kullback divergence from the true distribution to the predictive distribution: In this paper, we therefore investigate the behavior of r(y x) Bayesian estimation in the presence of the distribution G(n)=EXn,Y n r(y x)q(x) log | dxdy . | p(y x, Xn, Y n) change. Our primal result is that lower order terms Z |  which can be ignored in the absence of the distribu- The stochastic complexity (Rissanen, 1986) is defined tion change play an important role under the distribu- by F (Xn, Y n) = log Z(Xn, Y n), (2) − tion change. Note that this result is derived without which can be used for selecting an appropriate model assuming the regularity condition (White, 1982) and or hyper-parameters. When analyzing the behavior of is applicable to non-regular statistical models such as the stochastic complexity, the following function plays multi-layer perceptrons, Gaussian mixtures, and hid- n n an important role:F (n) = E n n [F (X , Y )] , (3) den Markov models. However, precisely investigating X ,Y where EXn,Y n [ ] stands for the expectation value over the lower order terms may only be possible in some · limited cases. To cope with this problem, we derive an all sets of training samples and n n n n n F (X , Y ) = F (X , Y ) + log r(Yi Xi). upper bound of the Bayesian generalization error for i=1 | more general analysis. P Asymptotic Bayesian Generalization Error Under Distribution Change

The generalization error and the stochastic complexity 3.1. Notations are linked by the following equation (Watanabe, 1999): In the following, let us denote the training distribu- G(n) = F (n + 1) F (n). (4) − tion with subscript 0 and the test distribution with subscript 1. Then the covariate shift situation is ex- 2.2. Asymptotic Generalization Error pressed by Watanabe (2001a) developed an algebraic geometrical r0(y x) = r1(y x) and q0(x) = q1(x), | | 6 approach to analyzing the asymptotic generalization while the functional relation change is described as error of Bayesian estimation. This approach includes r (y x) = r (y x) and q (x) = q (x). 0 | 6 1 | 0 1 as a special case the well-known result by Schwarz For i = 0, 1, let (1978), but is substantially more general—the gener- i i 0 ri(Yn+1 Xn+1) alization error of non-regular statistical models such n n G (n)=EXn+1,Yn+1EX ,Y log | n n , p(Yn Xn , X , Y ) as multi-layer perceptrons can also be analyzed. " +1| +1 # where Ei stands for the expectation over X and When the learning machine p(y x, w) can attain the X,Y Y taken from r (y x)q (x). Note that the functions true distribution r(y x), i.e., there| exists a parameter i i G0(n) and G1(n) corresp| ond to the generalization er- w∗ such that p(y x, w| ∗) = r(y x), the asymptotic ex- rors without and with distribution change, respec- pansion of F (n) is| given as follo|ws (Watanabe, 2001a). tively. Our primal goal in this section is to reveal the F (n) = α log n (β 1) log log n + O(1), (5) 1 − − asymptotic form of G (n). where the rational number α and natural number β − are the largest pole and its order of 3.2. Asymptotic Expansion of Generalization z J(z) = H(w) ϕ(w)dw. Error H(w) is defined by R r(y x) Let us define a function, H(w) = r(y x)q(x) log | dxdy. (6) n | p(y x, w) r (Yj Xj ) Z | U i(n + 1) =Ei log exp log 0 | Combining Eqs.(5) and (4) immediately gives − − p(Yj Xj , w) j=1 α β 1 1 Z X | G(n) = − + o ,   n − n log n n log n ri(Yn+1 Xn+1)   log | ϕ(w)dw , − p(Yn Xn , w) when G(n) has an asymptotic form. The coefficients +1| +1   α and β indicate the speed of convergence of the gen- i i 0 eralization error when the number of training samples for i = 0, 1, where E E E n n . Note that ≡ Xn+1,Yn+1 X ,Y is sufficiently large. U 0(n) = F (n) (see Eq.(3)). We assume When the learning machine cannot attain the true dis- (A1) Gi(n) has an asymptotic expansion and tribution (i.e., the model is misspecified), the stochas- i G (n) Bi as n , where Bi is a constant. tic complexity has an upper bound of the following → → ∞ asymptotic expression (Watanabe, 2001b). (A1’) U i(n) has the following asymptotic expansion,

F (n) nC + α log n (β 1) log log n + O(1), (7) i di 1 ≤ − − U (n) = ain + bi log n + + ci + + o where C is a non-negative constant. When the gen- · · · n n   eralization error has an asymptotic form, combining i TH (n) T i n Eqs.(7) and (4) gives L( ) (9) α β 1 1 | {z } | {z } G(n) C + − + o , (8) in the descending order with respect to n, where ≤ n − n log n n log n   a , b , c , and d are constants. independent of n. where C is the bias. i i i i Note that (A1’) is not essential but only for notational 3. Analysis of the Bayesian convenience. Generalization Error with Lemma 1 The generalization error G1(n) is ex- Distribution Change pressed as G1(n) =U 1(n + 1) U 0(n). (10) − In this section, we analyze the generalization error of Bayesian estimation under the distribution change. Theorem 1 Under the assumptions (A1) and (A1’), the asymptotic expansion of G1(n) is expressed by b + (d d ) 1 G1(n) = a + (c c ) + 0 1 − 0 + o , 0 1 − 0 n n   Asymptotic Bayesian Generalization Error Under Distribution Change

1 0 and TH (n) = TH (n). lower order terms and we need to directly investigate them. This is usually very hard—even for a simple The proof is given in Appendix A. case such as Example 1, the calculation of lower order In the standard case without the distribution change, terms is not straightforward. it is straightforward to show that Here we propose a different approach: deriving an up- b0 1 per bound on G1(n) in terms of G0(n). Since the al- G0(n) = a + + o . 0 n n gebraic geometrical method can be used for revealing   G0(n), this approach allows us to deal with a broader Thus an important finding from the above theorem i class of models. We assume is that the lower order terms in TL(n) which do not appear in the asymptotic expansion of G0(n) can not (A2) The largest difference between the training and be ignored in the asymptotic expansion of G1(n). test distributions is finite, i.e. r (y x)q (x) Example 1 Let the true distribution and learning M = max 1 | 1 < . model be 1 y2 x,y∼r0(y|x)q0(x) r0(y x)q0(x) ∞ r(y x) = exp ,  |  | √2πσ −2σ2 2  2  Theorem 2 Under the assumptions (A1) and (A2), 1 (y ax)2 the generalization error G1(n) asymptotically has an p(y x, a) = exp − 2 , | √2πσ2 − 2σ2 upper bound,  1  1 0 where the parameter is only a R . Note that the G (n) MG (n) + D1 + D2, learning model p(y x, a) can attain∈ the true distribu- ≤ | r1(y x) tion r(y x) by a = 0. We assume a Gaussian prior, where D1 = r1(y x)q1(x) log | dxdy, | | r0(y x) 1 a2 Z | ϕ(a) = exp . 0 r (y x) = r (y x), √2πσ −2σ2 1 0   D2 = | | The training and test input distributions q0 and q1 are (1 otherwise. respectively defined by 1 x2 The proof is given in Appendix B. qi(x) = exp . √2πσ −2σ2 i  i  Example 2 Let r1(y x) = r0(y x) and the training The coefficients in Theorem 1 are as follows: | | input distribution q0(x) and the test input distribution a0 = a1 = 0, b0 = b1 = 1/2, q1(x) be Gaussians, 2 c1 c0 = 0, d1 d0 = (σ1/σ0 1)/2. 1 (x µi) − − − qi(x) = exp − 2 . Then the generalization errors are written as √2πσi − 2σi 2   1 σ1 1 0 1 1 Then, G (n) = 2 + o , G (n) = + o . 2 2nσ0 n 2n n q1(x) σ0 (µ0 µ1)     M = max = max exp 2− 2 x∼q0(x) q0(x) σ1 x∼q0(x) 2(σ σ ) We omit the derivation because of lack of space; but    0 − 1 σ2 σ2 σ2µ σ2µ 2 we note that, due to the lower order terms, the deriva- 0 − 1 x 0 1 − 1 0 . tion is not straightforward despite its simplicity at first − 2σ2σ2 − σ2 σ2 0 1  0 − 1   glance. In this case, (A2) requires σ0 > σ1 independent of µi 2 σ0 (µ0−µ1) In this regression case, the learning model can attain and then M = exp 2 2 . σ1 2(σ0 −σ1 ) the true distribution, and the true line (y = 0) does   not change in the test phase (i.e., the covariate shift). 3.4. Discussion of the Theorems When the test input distribution is wider than the training input distribution (i.e., σ1 > σ0), the gener- Under the assumption (A1), Bi represents the bias 1 0 alization errors satisfy G (n) > G (n). This is consis- and Si corresponds to the speed of convergence, i.e., tent with an intuition that the data far from the origin S 1 Gi(n) = B + i + o . bring more information on the true function (y = 0). i n n   Here we analyze the speed of convergence and bias of 3.3. Bound of Generalization Error Bayesian estimation under the distribution change.

The above theorem gives an insight that clarifying 3.4.1. Bias the asymptotic form of G1(n) requires to compute the lower order terms. However, the algebraic geometrical According to Theorem 1, B0 = a0, B1 = a0+(c1 c0). approach (see Section 2.2) does not take account of the Thus the constant terms of U i(n) induce the distinc-− Asymptotic Bayesian Generalization Error Under Distribution Change

Input Data Densities 1 1 Table 1. Upper bounds of B . CS and FRC denote the G Training Upper bound 1.5 Test 12 covariate shift and functional relation change, respectively. G 01 G 0 1 10 B0 = 0 B0 = 0 or 6 8 0.5

No dist. change B1 = 0 B1 = B0 ation err iz 6 ral 0 e CS B1 = 0 B1 MB0 −0.5 0 0.5 1 1.5 2 2.5 3 ≤ x gen 4 FRC B1 = D1 B1 MB0 + D1 + 1 2 CS & FRC B = D B ≤ MB + D + 1 1.5 1 1 1 0 1 1 0 1 2 3 ≤ 10 10 10 0.5 sample size tion of the bias. As for the bias B1, the following 0 Figure 3. The generaliza- −0.5 True function 1 Training tion error G (n) and its corollary holds. −1 Test −0.5 0 0.5 1 1.5 2 2.5 3 upper bound. The com- x 0 Corollary 1 When the learning machine p(y x, w) Figure 2. Illustrative exam- parative errors G (n) and | 01 can realize the true r (y x) (i.e., B = 0), B = D . ple of covariate shift. G (n) are also plotted. 0 | 0 1 1 The proof is given in Appendix B. Table 1 summarizes the upper bound of B1 in each case based on Theorem puts are subject to N(1, (1/2)2) and N(2, (1/4)2), re- 2 and Corollary 1. Note that D1 = 0 under the covari- spectively (see Fig 2). We use the linear regression ate shift (r1(y x) = r0(y x)). | | model f(x) = w1 + w2x, and estimate the parameter When the learning model p(y x, w) can realize the true w = (w1, w2) in the Bayesian framework with the prior | −1 r0(y x) (the middle column of Table 1), B1 B0(= 0) distributionb ϕ(w) being N(µ, λ I2). (11) alwa|ys holds. Therefore, the bias is generally≥ larger µ and λ are hyper-parameters and are determined than the case without the distribution change. How- based on the stochastic complexity. ever, when the learning model p(y x, w) can not realize | Under this setting, the posterior distribution the true r0(y x) (the right column of Table 1), B1 < B0 n n | p(w X , Y ) is Gaussian N(µn, Λn), where can occur depending on the sign of c1 c0. | −2 n n − µn = Λn(σ Ψ Y + λµ), −2 n n > −1 3.4.2. Speed of Convergence Λn = (σ Ψ (Ψ ) + λI2) , Ψn = (ψ(X ), . . . , ψ(X )), ψ(x) = (1, x)>. According to Theorem 1, S0 = b0, S1 = b0 +(d1 d0). 1 n Note that S corresponds to α in Eq.(8). The sign− of The predictive distribution p(y x, X n, Y n) is also 0 | d1 d0, which determines the magnitude relation be- Gaussian N(mn(x), vn(x)), where − > > 2 tween S0 and S1, depends on the setting. In Example mn(x) = ψ (x)µn, vn(x) = ψ (x)Λnψ(x) + σ . 2 2 1, the both variances σ0 and σ1 affect the sign. The generalization errors are expressed as 2 i 1 0 sinc(x) mn(x) We recall that a faster convergence does not neces- G (n) = E n n q (x) { − } 2 X ,Y i v (x) sarily imply a lower generalization error due to the  Z  n bias term. However, the speed term is dominant when σ2 σ2 + 1 log dx , the model can attain the true distribution under the v (x) − − v (x) n n   covariate shift. In this case, B0 = B1 = 0 and for i = 0, 1. The maximum ratio M defined by (A2) S1 MS0, M = maxx∼q0 q1(x)/q0(x). In the above is 2 exp(8/3). Note that M is finite according to Ex- ≤ equation, the quantity M appears as a maximum fac- ample 2. The generalization errors G0 and G1 con- . . tor of speeding-down. This would be natural since M verge to B0 = 0.2939 and B1 = 2.2818, respectively, represents the amount of difference in the training and where we used the fact that µn and Λn converge to 0 > −1 0 test distributions. For example, if the support of the E [ψ(x)ψ (x)] E [ψ(x)f(x)] and 02×2, respectively. test distribution is not included in the support of the 1 training distributions, M becomes infinity and no con- Fig.3 depicts the generalization error G (n) and its up- clusion is derived from the bound. In such a case, the per bound as functions of the sample size n. The values explicit computation as Example 1 is required. are averages over 400 (for n 100) and 100 (for n > 100) realizations. For comparison,≤ the generalization error G0(n) in the absence of the distribution change 3.5. Numerical Example 01 0 and its shift defined by G (n) = G (n) B0 + B1 are Let us illustrate the error bounds using a simple re- also plotted. This shifted error has the−same bias as gression problem borrowed from Sugiyama and Muller¨ G1(n) and the same convergence speed term as G0(n). (2005): y = sinc(x) + ε, where the noise ε follows In this specific example, the bias term of G1(n) is N(0, σ2) with σ2 = (1/4)2. We assume the noise larger than G0(n) and the speed term does not seem variance σ2 is known. The training and the test in- so different. Asymptotic Bayesian Generalization Error Under Distribution Change

4. Importance-Weighted Stochastic First, let us analyze the behavior of the average IWSC. Complexity Considering the order of n, we can prove that 0 IW n n E n n F (X , Y µ) In the previous section, we analyzed the asymptotic X ,Y |   generalization performance of Bayesian estimation un- 2 der the distribution change. In this section, we shift fg 1 1 = n log(√2πσ4) + log n + O(1), our focus toward more practical aspects and show how − 2σ2 f 2 2  4 1  the generalization performance could be improved un- where µ is a hyper-parameter and f = E1[f(x)]. 1 x der the covariate shift scenarios. This implies that the average IWSC is expressed in

terms of the expectation over the data from the test 4.1. Definition input distribution, not on the training input distribu- In Bayesian inference, the stochastic complexity is of- tion, and thus the application of the importance weight ten used for selecting the model structure and hyper- W (x) would be reasonable. parameters. However, as seen in Eqs.(1) and (2), the Next we focus on the optimization of the hyper- original stochastic complexity is computed from the parameter µ. We can show that the hyper-parameter marginal likelihood of training data. Under the co- that minimizes Eq.(13) is given as n variate shift, we need to select the model structure IW W (X )f(X )Y µˆ arg min F (Xn, Y n µ) = i=1 i i i . and hyper-parameters in terms of the likelihood of the ≡ | n W (X )f 2(X ) P i=1 i i test data (both inputs and outputs). However, the test This result claims that IWSC selects a reasonable data is not available during the training phase. hyper-parameter because we canPprove that µˆ on av- To cope with this problem, we propose a variant of erage converges to 0 W fg 0 fg 1 stochastic complexity called the importance-weighted E n n [µˆ] = = , X ,Y W f 2 f 2 stochastic complexity (IWSC), which is defined as fol- 0 1 where f = E0[f(x)]. The convergent point is the lows. 0 x IW n n IW same as the hyper-parameter selected by the ordinary F (X , Y ) = log exp l (w) ϕ(w)dw, (12) − stochastic complexity when the training input distri- Z n   bution agrees with the test input distribution. IW where l (w) = W (Xi) log p(Yi Xi, w), | i=1 X 4.3. Experimental Results and W (x) = q1(x)/q0(x). Here, we assume that W (x) is known; when it is unknown, we may use an estimate We report a result of a simple numerical example to (e.g., (Huang et al., 2007)). illustrate how IWSC actually works. We again used the same toy regression problem used in Section 3.5. 4.2. Example of IWSC Linear models were learned with 200 training samples 2 Here we illustrate the behavior of IWSC using a toy by the Bayesian procedure. The noise variance σ is example. Let us consider the following setting: the assumed to be known for simplicity and the hyper- model p(y x, a) is a Gaussian N(af(x), σ4), where a is parameters µ and λ in Eq.(11) were selected based the parameter.| The prior is the same as Example 1, on the stochastic complexity (2) or IWSC (12). The N(µ, σ). The true distribution r(y x) is also a Gaus- results are depicted in Fig.4. The solid lines in the | sian N(g(x), σ2). Then IWSC is rewritten as left-most and second-left graphs show the mean of the n prior, i.e. µ1 + µ2x, while the dashed lines indicate IW n n F (X , Y µ) = W (Xi) log(√2πσ ) the regions within three times of the standard devi- | 4 i=1  ations determined by the dispersion parameter λ. ‘ ’ n X ◦ 1 σ2 µ2 are training samples and ‘ ’ are noiseless test samples. + log 1 + W (X )f 2(X ) + × 2 σ2 i i 2σ2 The result of IWSC (see the second-left graph) predicts 4 i=1  X  the output values in the test region very well, while 1 (σ2 n W (X )f(X )Y + σ2µ)2 i=1 i i i 4 . (13) SC only captures the training samples (see the left- − 2σ2σ2 σ2 n W (X )f 2(X ) + σ2 most graph). We remark that the hyper-parameters 4 P i=1 i i 4 in the second-left panel were obtained only from the P training samples (‘ ’) through IWSC. The profiles of SC and IWSC over◦ the mean hyper-parameter µ are depicted in the second-right and right-most graphs, showing that both surfaces are smooth with a unique Asymptotic Bayesian Generalization Error Under Distribution Change

1 1 72 −20 70

68 −25

0 0 66 −30 64 2

stochastic complexity 2

stochastic complexity 1 62 1 −35 −2 −2 0 −1 0 −1 0 −1 µ 0 −1 µ −1 −1 1 2 1 µ −2 µ 2 0 1 2 0 1 2 1 2 1 2 −2 Figure 4. Learned functions obtained based on SC (left-most) and IWSC (second-left). SC (second-right) and IWSC (right-most) over the mean hyper-parameter µ (λ was optimally chosen by SC and IWSC, respectively). minimum but at the different points. Experimental design where the training input distribu- tion is designed by users is a relevant situation since 4.4. Related Works it naturally induces the covariate shift. A standard approach to experimental design in least-squares re- The importance-weight has been widely used in the gression often ignores the bias of the estimator and context of frequentist’s approach. Maximum likeli- design the training input distribution so that the vari- hood estimation is no longer consistent under the ance of the estimator is minimized (Fedorov, 1972). covariate shift when the model is misspecified; in- However, when the model is misspecified—which is a stead, the maximizer of the importance-weighted log- usual case in practice—the bias may not be ignored be- likelihood is consistent (Shimodaira, 2000). n cause of the covariate shift. Instead, the importance- max W (Xi) log p(Yi Xi, w). (14) i=1 | weighted least squares method produces an asymptotic However, this is not efficient and is rather unstable P unbiased estimator and its use allows us to apply the in practical situations with finite samples. To cope variance-only approach also in the experimental de- with this problem, an adaptive variant was proposed sign of approximately linear regression (Wiens, 2000; (Shimodaira, 2000): Sugiyama, 2006). Furthermore, an experimental de- n λ max W (Xi) log p(Yi Xi, w), (15) sign method for totally misspecified models has been i=1 | where 0 λ 1. λ controls the trade-off between con- developed (Kanamori & Shimodaira, 2003), where the P sistency ≤and ≤efficiency and it needs to be chosen appro- importance-weight plays an essential role in establish- priately for better estimation. Note that any empirical ing the consistency. error based methods could be extended similarly. The task of choosing λ is the model selection problem. 5. Conclusions Standard model selection methods such as Akaike’s in- This paper clarified the asymptotic behavior of the formation criterion (Akaike, 1974) and cross-validation Bayesian generalization error under the distribution (Stone, 1974) are not designed to work well under the change. Our result gave an interesting insight that the covariate shift. To cope with this problem, a mod- lower order terms which are ignored in the standard ified information criterion has been developed (Shi- asymptotic theory play important roles under the dis- modaira, 2000), where the importance-weight plays tribution change. We also established an upper bound an essential role. Similarly, an importance-weighted of the asymptotic generalization error in terms of the model selection criterion specialized for linear regres- generalization error in the absence of the distribution sion (Sugiyama & Muller,¨ 2005) and an importance- change. In order to improve the prediction perfor- weighted cross-validation method (Sugiyama et al., mance, we proposed a variant of stochastic complexity 2007) have been developed and have shown to work which can be used for choosing an appropriate model well in real-world problems. and hyper-parameters under the covariate shift. In the above importance-weighting framework, it is Our future study will focus on investigating and im- theoretically assumed that the importance is known proving the tightness of the bound. Similar to IWSC, a priori. However, this may not be the case in prac- the likelihood term in the posterior distribution can tice. To cope with this problem, a method of directly also be modified by the importance weight W (Xi) for estimating the importance in a non-parametric way compensating for the change in the input distribu- has been developed (Huang et al., 2007), which ef- tions (Shimodaira, 2000). A promising direction in fectively makes use of the kernel trick (Sch¨olkopf & this line would be to combine these procedures into a Smola, 2002) in a class of reproducing kernel Hilbert single framework and analyze the generalization per- spaces. formance. Asymptotic Bayesian Generalization Error Under Distribution Change

This research was partly supported by the Alexan- Kanamori, T., & Shimodaira, H. (2003). Active learning der von Humboldt Foundation, MEXT 18079007, algorithm using the maximum weighted log-likelihood 17700142 and 18300057, the Okawa Foundation, and estimator. Journal of Statistical Planning and Inference, 116, 149–162. Microsoft IJARC. Lin, Y., Lee, Y., & Wahba, G. (2002). Support vector ma- A. Proof of Lemma 1 and Theorem 1 chines for classification in nonstandard situations. Ma- chine Learning, 46, 191–202. In the same way to derive Eq.(4), we can obtain Gi(n) = U i(n + 1) U 0(n). The asymptotic expan- Rissanen, J. (1986). Stochastic complexity and modeling. sions of U i(n) and−G1(n) are immediately derived Annals of Statistics, 14, 1080–1100. based on this relation, (A1), and (A1’). If a co- Sch¨olkopf, B., & Smola, A. J. (2002). Learning with ker- 1 1 efficient of TH (n) in U (n) is different from that of nels. Cambridge, MA: MIT Press. 0 TH (n), the assumption (A1) is violated. For exam- 1 Schwarz, G. (1978). Estimating the dimension of a model. ple, if a1 = a0, G (n) has the term (a1 a0)n. This Annals of Statistics, 6 (2), 461–464. means G1(6 n) as n . Therefore,−it must hold that T 1 (n) =→T 0∞(n). → ∞ Shelton, C. R. (2001). Importance sampling for reinforce- H H ment learning with multiple objectives. PhD thesis. Mas- sachusetts Institute of Technology. B. Proof of Theorem 2 and Corollary 1 Shimodaira, H. (2000). Improving predictive inference un- Define that der covariate shift by weighting the log-likelihood func- n n tion. Journal of Statistical Planning and Inference, 90, 0 r1(y x)p(y x, X , Y ) D = 1 E n n | | q (x)dxdy . 227–244. 3 − X ,Y r (y x) 1 " Z 0 | # According to S(x) e−x 1 + x, Stone, M. (1974). Cross-validatory choice and assessment ≡ − of statistical predictions. Journal of the Royal Statistical 1 0 r0(y x) Society, Series B, 36, 111–147. G (n) =E n n r (y x)q (x) log | dxdy X ,Y 1 | 1 p(y x, Xn, Y n) Z |  Sugiyama, M. (2006). Active learning in approximately lin- + D1 = D4(n) + D1 + D3, (16) ear regression based on conditional expectation of gener- alization error. Journal of Machine Learning Research, 0 r1(y x)q1(x) where D (n) = E n n | r (y x)q (x) 7, 141–166. 4 X ,Y r (y x)q (x) 0 | 0 Z 0 | 0 r (y x) Sugiyama, M., Krauledat, M., & Muller,¨ K.-R. (2007). Co- S log 0 | dxdy . variate shift adaptation by importance weighted cross × p(y x, Xn, Y n)  |   validation. Journal of Machine Learning Research, 8. When r1(y x) = r0(y x), D3 = D2(= 0). Otherwise, D D (=| 1). Since| S(x) 0, G1(n) MG0(n) + Sugiyama, M., & Muller,¨ K.-R. (2005). Input-dependent 3 ≤ 2 ≥ ≤ estimation of generalization error under covariate shift. D1 + D2, which completes the proof of Theorem 2. Statistics & Decisions, 23, 249–279. Next, we prove Corollary 1. When n and B = 0, → ∞ 0 p(y x, Xn, Y n) r (y x). Watanabe, S. (1999). Algebraic analysis for singular sta- | → 0 | tistical estimation. Lecture Notes on Computer Science Therefore, D4(n) and D3 in Eq.(16) asymptotically Springer, 1720, 39–50. goes to zero, which means B1 = D1. Watanabe, S. (2001a). Algebraic analysis for non- References identifiable learning machines. Neural Computation, 13 Akaike, H. (1974). A new look at the statistical model (4), 899–933. identification. IEEE Transactions on Automatic Con- Watanabe, S. (2001b). Algebraic information geometry for trol, AC-19, 716–723. learning machines with singularities. Advances in Neural Baldi, P., Brunak, S., & Stolovitzky, G. A. (1998). Bioin- Information Processing Systems, 14, 329–336. formatics: The machine learning approach. Cambridge: White, H. (1982). Maximum likelihood estimation of mis- MIT Press. specified models. Econometrica, 50, 1–25. Fedorov, V. V. (1972). Theory of optimal experiments. New York: Academic Press. Wiens, D. P. (2000). Robust weights and designs for bi- ased regression models: Least squares and generalized Heckman, J. J. (1979). Sample selection bias as a specifi- M-estimation. Journal of Statistical Planning and In- cation error. Econometrica, 47, 153–162. ference, 83, 395–412. Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., & Wolpaw, J. R., Birbaumer, N., McFarland, D. J., Sch¨olkopf, B. (2007). Correcting sample selection bias Pfurtscheller, G., & Vaughan, T. M. (2002). Brain- by unlabeled data. In B. Sch¨olkopf, J. Platt and T. Hoff- computer interfaces for communication and control. man (Eds.), Advances in neural information processing Clinical Neurophysiology, 113, 767–791. systems 19. Cambridge, MA: MIT Press.