Lecture-14: Total Variation Distance

Lecture-14: Total Variation Distance 1 Comparison of Monte Carlo Methods We have seen that Monte Carlo sampling methods give us the desired stationary distribution starting from a base Markov chain. Assuming one step of the Markov chain takes unit time, we can compare their computational efficiency in terms of number of operations needed per sample, and the number of sampled needed to reach close enough to the desired stationary distribution. 1.1 Total variation distance The closeness of two distributions can be measured by the following distance metric. Definition 1.1. The total variation distance between two probability distributions m and n on X N is defined by n No km − nkTV = max jm(A) − n(A)j : A ⊆ X . This definition is probabilistic in the sense that the distance between m and n is the maximum difference between the probabilities assigned to a single event by the two distributions. Proposition 1.2. Let m and n be two probability distributions on the configuration space XN. Let B = x 2 XN : m(x) > n(x) , then 1 km − nk = jm(x) − n(x)j = [m(x) − n(x)]. TV 2 ∑ ∑ x2XN x2B Proof. Let A ⊆ XN be any event. Since m(x) − n(x) < 0 for any x 2 A \ Bc, we have m(A) − n(A) 6 m(A \ B) − n(A \ B) 6 m(B) − n(B). Similarly, we can show that c c n(A) − m(A) 6 n(B ) − m(B ) = m(B) − n(B). It follows that jm(A) − n(A)j 6 m(B) − n(B) for all events A, and the equality is achieved for A = B and A = Bc. Thus, we get that 1 1 km − nk = [m(B) − n(B) + n(Bc) − m(Bc)] = jm(x) − n(x)j. TV 2 2 ∑ x2XN Exercise 1.3. Show that km − nkTV is a distance. Proposition 1.4. Let m and n be two probability distributions on the configuration space XN. For any observable f : XN ! R, we have ( ) 1 km − nkTV = sup f (x)m(x) − f (x)n(x) : max j f (x)j 6 1 . 2 ∑ ∑ 2XN x2XN x2XN x 1 Proof. If maxx j f (x)j 6 1, then it follows that 1 1 ( )( ( ) − ( ) ( ) − ( ) = − ∑ f x m x n x 6 ∑ jm x n x j km nkTV . 2 x 2 x ∗ N For the reverse inequality, we define f (x) = 1fx2Bg − 1fx2/Bg in terms of the set B = x 2 X : m(x) > n(x) . It is clear that maxx j f (x)j = 1, and we have 1 1 ∗( )( ( ) − ( ) = ( ) − ( ) = − ∑ f x m x n x ∑ jm x n x j km nkTV . 2 x 2 x 1.2 Coupling and total variation distance Definition 1.5. A coupling of two probability distributions m and n is a pair of random variables (X,Y) defined on a single probability space such that the marginal distribution of X is m and the marginal distribution of Y is n. That is, a coupling (X,Y) satisfies P fX = xg = m(x) and P fY = yg = n(y). Any two distributions m and n have an independent coupling. However, when the two distributions are not identical, it will not be possible for the random variables to always have the same value. Total variation distance between m and n determine how close can a coupling get to having X and Y identical. Definition 1.6. For two distributions m and n on the configuration space XN, the coupling (X,Y) is optimal if km − nkTV = P fX 6= Yg. Proposition 1.7. Let m and n be two distributions on the configuration space XN, then km − nkTV = inffP fX 6= Yg : (X,Y) a coupling of distributions m,ng. Proof. For any coupling (X,Y) of the distributions m,n and any event A ⊆ XN, we have m(A) − n(A) = P fX 2 Ag − P fY 2 Ag 6 P fX 2 A,Y 2/ Ag 6 P fX 6= Yg. Therefore, it follows that km − nkTV 6 P fX 6= Yg for all couplings (X,Y) of distributions m,n. N Next we find a coupling (X,Y) for which km − nkTV = P fX 6= Yg. In terms of the set B = x 2 X : m(x) > n(x) , we can write c p , ∑ m(x) ^ n(x) = m(B ) + n(B) = 1 − (m(B) − n(B)) = 1 − km − nkTV . x2X m^n N N By the definition of p, the function p : X ! [0,1] is a probability distribution on X . Let us call this m(x)^n(x) distribution as g3(x) , p . We also define the following two function from the configuration space XN to [0,1] as m(x) − n(x) n(x) − m(x) g1(x) , 1fx2Bg, g2(x) , 1fx2/Bg. km − nkTV km − nkTV N c N From the definition of the set B, we can easily verify that g1(X ) = g1(B) = g2(B ) = g2(X ) = 1. We define a binary random variable x 2 f0,1g such that Ex = p, and the conditional distribution of (X,Y) such that P((X,Y) = (x,y) x) = g3(x)1fx=ygx + (1 − x)g1(x)g2(y). Since g1,g2,g3 are distributions, it follows that P f(X,Y) = (x,y)g = pg3(x)1fx=yg + (1 − p)g1(x)g2(y)1fx6=yg is a joint distribution function. From the definition of the set B, we observe that P fX = xg = pg3(x) + (1 − p)g1(x) = m(x) ^ n(x) + (m(x) − n(x))1fx2Bg = m(x) P fY = yg = pg3(y) + (1 − p)g2(y) = m(y) ^ n(y) + (n(y) − m(y))1fy2/Bg = n(y). That is, (X,Y) is a coupling of the distributions m,n and P fX 6= Yg = 1 − p = km − nkTV. 2.

Load more