<<

Statistics 116 - Fall 2002 Breaking the Code Lecture # 5 – Vigen`ere – Friedman’s Attack

Jonathan Taylor

1

In this lecture, we will discuss a more “automated” approach to guessing the keyword length in the Vigen`erecipher. The procedure is based on what we will call the index of coincidence of two strings of text S1 and S2 of a given length n. The index of coincidence is just the proportion of positions in the strings where the corresponding characters of the two strings match, and is defined for the strings S1 = (s11, s12, . . . , s1n) and S2 = (s21, s22, . . . , s2n) n 1 X I(S ,S ) = δ(s , s ) 1 2 n 1i 2i i=1 where the function ( 1 if x = y δ(x, y) = 0 otherwise. Now, in our situation we have to imagine that the string we observe is “random”. Suppose then, that S1 and S2 are two random strings generated by choosing letters at random with the frequencies (p0, . . . , p25) and (q0, . . . , q25), respectively. If the strings are random we can talk about the “average” index of coincidence of the two strings S1 and S2. In statistics, if we have a random quantity X we write E(X) for the “expectation” or “average” of X. It is not difficult to show that if S1 is a string of random letters with frequencies (p0, . . . , p25) and (q0, . . . , q25) then the average index of coincidence of these two strings is given by 25 X piqi. (1) i=0

1 This is because, for any of the letters in the string, the probability that the two characters match is the probability that they are both A’s, both B’s, . . . , or both Z’s. The probability that they are both A’s, if the two strings are generated “independently” from the other is p0q0. The probability that they are both B’s is p1q1 and the total probability that they match is therefore the sum of the piqi’s. Now, there is one key property of the index of coincidence that come in handy later. Namely, if C is any simple then

I(C(S1), C(S2)) = I(S1,S2). To see why this is true, look at the example in Table 1.

S1:HEREWEARE NOW A5,3(S1): M X K X J X D K X QVJ S2:WHEREISIT NOW A5,3(S2): J M X K X R P R U QVJ

Table 1: Matches do not change when a simple substitution cipher is applied to S1 and S2.

2 Friedman’s attack

Friedman’s idea is to take the cipher text

C = (c1, . . . , cn) encrypted with a Vigen`ere cipher with some keyword k of length m and shift to the right it by l to get (+l) C = (c1, . . . , cn−l) then compare it to the last n − l characters of C, i.e. he computed

I((c1, . . . , cn−l), (cl+1, . . . , cn)) His first claim was that if l is a multiple of the keyword length m, then

I((c1, . . . , cn−l), (cl+1, . . . , cn)) ' E(I(S1,S2)) '= 0.070 where S1 and S2 are independent strings of “English”. The number 0.070 comes from taking the frequencies in 2 (obtained from the King James Bible), squaring them and adding them up, i.e. setting q = p in (1). His second claim was that if l is not a multiple of the keyword length, then

I((c1, . . . , cn−l), (cl+1, . . . , cn)) ' E(I(V1,V2)) '= 0.040 where S1 and S2 are independent strings of “English” each encrypted with a different “typical” Vigen`ere cipher. Friedman’s attack then consists of the following:

2 a) For l = 1,..., 100 (or larger) compute

I((c1, . . . , cn−l), (cl+1, . . . , cn)) ' E(I(V1,V2)) '= 0.040.

b) Sort the indices of coincidences and look for common factors in the shifts with highest indices of coincidence. c) Use this to guess the keyword length and repeat attack in Lecture # 4 with known keyword length.

At first glance, it is not obvious how Friedman arrives at these conclusions (actually the exact numbers coincide with the frequencies in the King James Bible). We will try to explain these two claims now. Let us begin with the first, that is, when l is a multiple of the keyword length. We know that

I ((c1, . . . , cn−l), (cl+1, . . . , cn))  = I (t1 + k1, . . . , tn−l + k(n−l)%m), (tl+1 + k(l+1)%m, . . . , tn + kn%m) . (2) Now if, l is a multiple of the keyword length m then  I (t1 + k1, . . . , tn−l + k(n−l)%m), (tl+1 + k(l+1)%m, . . . , tn + kn%m)  = I (t1 + k1, . . . , tn−l + k(n−l)%m), (tl+1 + k1, . . . , tn + k(n−l)%m)

= I ((t1, . . . , tn−l), (tl+1, . . . , tn)) . ' 0.070

The first equality follows because (l + j)%m = j%m if l is a multiple of the keyword length m. The second follows from the fact that the only way the first two letters of the match in the two strings in the second equation is when the characters in the plain text match. Now, if l is big enough, say bigger then 3 or 4 then, for any j there will be little relation between tj and tj+l so the third equation above is like the index of coincidence of two unrelated strings of English. The “law of large numbers” tell us now that if we have enough data this third equation should be the average index of coincidence for two random strings of English of length n − l which is approximately 0.070. To see where the figure 0.040 comes from, we have to think of creating a “random” Vigen`erecipher of length m. That is, once the keyword length is fixed at m choose m letters at random with the frequencies corresponding to plain English. This gives a random keyword which can be used to make a random Vigen`erecipher. Returning to (2), if l is not a multiple of the keyword length then, say the i-th term in the sum used to calculate the index of coincidence is

δ(ti+l + k(i+l)%m, ti + ki%m) where we know that (i + l)%m 6= i%m. The contribution of this term to the

3 average index of coincidence will be the same as each term, so we only have to compute its contribution. If the ti’ and ki’s are drawn at random from the alphabet with the frequencies of English then each character should be independent of the other, i.e. ti and ti+l should have no relation to each other, and ki%m and k(i+l)%m should also have no relation to each other. In this case, both the character on the left (and on the right) is another “random letter” but the frequencies used to sample this letter are not those of a typical English letter. However, it turns out that these frequencies can be expressed in terms of those of the English letters. Let us use L1 to represent a letter drawn at random from the alphabet according to the frequencies of English and let L2 be another such letter. Then, if we think of L1 and L2 as elements of Z/26 the frequencies of the characters above should have the same frequencies as

(L1 + L2)%26.

These frequencise can be computed as X X P (L1 + L2 = j) = P (L1 = k, L2 = (j − k)%26) = pkp(j−k)%26. k=0 k=0 These frequencies are written in column 2 of Table 2. To finish the computation, we note that the character on the right above has the same frequency properties as the one on the left, and the two are independent. Therefore, the probability they match is the sum of the frequencies squared. If you square the numbers in column 2 of Table 2 and add them up you get 0.040. Another way we can get the figure 0.040 is as follows using the equality

δ(ti+l + k(i+l)%m, ti + ki%m) = δ(ti+l, ti + ki%m − k(i+l)%m).

Therefore, the contribution of this term is the probability that

L1 = L2 + L3 − L4 where L1,L2,L3 and L4 are drawn independently from the English frequencies. The contribution is therefore the sum

25 X piqi k=0 where pi are the English frequencies and

25 25 X X qi = P (L2 + L3 − L4 = i) = pkp(l−k)%26p(l−i)%26. l=0 k=0 These frequencies are given in column 3 of Table 2. If you multiply the numbers in column 1 by the numbers in column 2, you also get 0.040.

4 Letter (l) P (L = i) P (L1 + L2 = l) P (L1 +L2 −L3 = l) A 0.103 0.050 0.042 B 0.018 0.033 0.038 C 0.017 0.024 0.036 D 0.061 0.028 0.040 E 0.128 0.049 0.043 F 0.023 0.036 0.037 G 0.015 0.038 0.037 H 0.087 0.054 0.042 I 0.055 0.044 0.039 J 0.003 0.025 0.035 K 0.007 0.035 0.037 L 0.036 0.053 0.039 M 0.027 0.040 0.037 N 0.074 0.032 0.039 O 0.067 0.037 0.040 P 0.011 0.036 0.039 Q 0.000 0.027 0.036 R 0.050 0.045 0.039 S 0.057 0.044 0.040 T 0.090 0.040 0.041 U 0.023 0.035 0.038 V 0.009 0.046 0.036 W 0.021 0.045 0.038 X 0.000 0.036 0.038 Y 0.017 0.033 0.036 Z 0.001 0.034 0.037

Table 2: Frequencies of L1, L1 + L2, L1 + L2 − L3 for independently drawn letters L1,L2,L3.

5 3 Example

We revisit the example used in Lecture # 4 and compute the index of coincidence for various shifted versions of the ciphertext. The results, for shifts from size 4 up to 34 are reported in Table 3.

Shift Index Shift Index Shift Index Shift Index Length=100 Length=150 Length=200 Length=1000 24 0.119 27 0.079 24 0.074 30 0.068 5 0.113 4 0.074 28 0.070 10 0.053 4 0.085 24 0.073 17 0.066 20 0.051 9 0.083 30 0.071 4 0.062 24 0.051 8 0.081 21 0.068 10 0.060 5 0.046 6 0.077 11 0.065 22 0.056 17 0.046 7 0.066 26 0.064 21 0.055 22 0.046 29 0.063 10 0.064 30 0.055 7 0.044 21 0.063 17 0.063 27 0.052 25 0.044 11 0.059 32 0.061 7 0.051 28 0.043 26 0.053 19 0.054 26 0.051 8 0.043 15 0.050 9 0.054 11 0.047 31 0.043 32 0.038 8 0.053 32 0.047 26 0.042 19 0.038 7 0.052 20 0.046 11 0.042 31 0.036 5 0.050 19 0.045 32 0.041 17 0.036 14 0.049 8 0.045 27 0.041 30 0.033 22 0.047 14 0.042 33 0.041 12 0.030 20 0.044 12 0.041 13 0.040 10 0.029 31 0.044 9 0.039 29 0.040 27 0.028 28 0.041 33 0.038 15 0.039 22 0.022 15 0.040 5 0.037 6 0.037 20 0.020 6 0.034 31 0.037 16 0.037 18 0.019 12 0.028 29 0.036 18 0.036 16 0.017 29 0.028 6 0.032 19 0.035 14 0.016 25 0.025 16 0.029 21 0.035 33 0.000 16 0.020 15 0.029 23 0.034 28 0.000 13 0.019 25 0.025 14 0.034 25 0.000 23 0.012 13 0.021 9 0.034 23 0.000 18 0.011 23 0.016 4 0.034 13 0.000 33 0.000 18 0.015 12 0.031

Table 3: Indices of coincidence for texts of length 100, 150, 200 and 1000. The columns are ranked from highest to lowest indices of coincidence.

6