
Statistics 116 - Fall 2002 Breaking the Code Lecture # 5 – Vigen`ere Cipher – Friedman’s Attack Jonathan Taylor 1 Index of Coincidence In this lecture, we will discuss a more “automated” approach to guessing the keyword length in the Vigen`erecipher. The procedure is based on what we will call the index of coincidence of two strings of text S1 and S2 of a given length n. The index of coincidence is just the proportion of positions in the strings where the corresponding characters of the two strings match, and is defined for the strings S1 = (s11, s12, . , s1n) and S2 = (s21, s22, . , s2n) n 1 X I(S ,S ) = δ(s , s ) 1 2 n 1i 2i i=1 where the function ( 1 if x = y δ(x, y) = 0 otherwise. Now, in our situation we have to imagine that the string we observe is “random”. Suppose then, that S1 and S2 are two random strings generated by choosing letters at random with the frequencies (p0, . , p25) and (q0, . , q25), respectively. If the strings are random we can talk about the “average” index of coincidence of the two strings S1 and S2. In statistics, if we have a random quantity X we write E(X) for the “expectation” or “average” of X. It is not difficult to show that if S1 is a string of random letters with frequencies (p0, . , p25) and (q0, . , q25) then the average index of coincidence of these two strings is given by 25 X piqi. (1) i=0 1 This is because, for any of the letters in the string, the probability that the two characters match is the probability that they are both A’s, both B’s, . , or both Z’s. The probability that they are both A’s, if the two strings are generated “independently” from the other is p0q0. The probability that they are both B’s is p1q1 and the total probability that they match is therefore the sum of the piqi’s. Now, there is one key property of the index of coincidence that come in handy later. Namely, if C is any simple substitution cipher then I(C(S1), C(S2)) = I(S1,S2). To see why this is true, look at the example in Table 1. S1:HEREWEARE NOW A5,3(S1): M X K X J X D K X QVJ S2:WHEREISIT NOW A5,3(S2): J M X K X R P R U QVJ Table 1: Matches do not change when a simple substitution cipher is applied to S1 and S2. 2 Friedman’s attack Friedman’s idea is to take the cipher text C = (c1, . , cn) encrypted with a Vigen`ere cipher with some keyword k of length m and shift to the right it by l to get (+l) C = (c1, . , cn−l) then compare it to the last n − l characters of C, i.e. he computed I((c1, . , cn−l), (cl+1, . , cn)) His first claim was that if l is a multiple of the keyword length m, then I((c1, . , cn−l), (cl+1, . , cn)) ' E(I(S1,S2)) '= 0.070 where S1 and S2 are independent strings of “English”. The number 0.070 comes from taking the frequencies in 2 (obtained from the King James Bible), squaring them and adding them up, i.e. setting q = p in (1). His second claim was that if l is not a multiple of the keyword length, then I((c1, . , cn−l), (cl+1, . , cn)) ' E(I(V1,V2)) '= 0.040 where S1 and S2 are independent strings of “English” each encrypted with a different “typical” Vigen`ere cipher. Friedman’s attack then consists of the following: 2 a) For l = 1,..., 100 (or larger) compute I((c1, . , cn−l), (cl+1, . , cn)) ' E(I(V1,V2)) '= 0.040. b) Sort the indices of coincidences and look for common factors in the shifts with highest indices of coincidence. c) Use this to guess the keyword length and repeat attack in Lecture # 4 with known keyword length. At first glance, it is not obvious how Friedman arrives at these conclusions (actually the exact numbers coincide with the frequencies in the King James Bible). We will try to explain these two claims now. Let us begin with the first, that is, when l is a multiple of the keyword length. We know that I ((c1, . , cn−l), (cl+1, . , cn)) = I (t1 + k1, . , tn−l + k(n−l)%m), (tl+1 + k(l+1)%m, . , tn + kn%m) . (2) Now if, l is a multiple of the keyword length m then I (t1 + k1, . , tn−l + k(n−l)%m), (tl+1 + k(l+1)%m, . , tn + kn%m) = I (t1 + k1, . , tn−l + k(n−l)%m), (tl+1 + k1, . , tn + k(n−l)%m) = I ((t1, . , tn−l), (tl+1, . , tn)) . ' 0.070 The first equality follows because (l + j)%m = j%m if l is a multiple of the keyword length m. The second follows from the fact that the only way the first two letters of the ciphertext match in the two strings in the second equation is when the characters in the plain text match. Now, if l is big enough, say bigger then 3 or 4 then, for any j there will be little relation between tj and tj+l so the third equation above is like the index of coincidence of two unrelated strings of English. The “law of large numbers” tell us now that if we have enough data this third equation should be the average index of coincidence for two random strings of English of length n − l which is approximately 0.070. To see where the figure 0.040 comes from, we have to think of creating a “random” Vigen`erecipher of length m. That is, once the keyword length is fixed at m choose m letters at random with the frequencies corresponding to plain English. This gives a random keyword which can be used to make a random Vigen`erecipher. Returning to (2), if l is not a multiple of the keyword length then, say the i-th term in the sum used to calculate the index of coincidence is δ(ti+l + k(i+l)%m, ti + ki%m) where we know that (i + l)%m 6= i%m. The contribution of this term to the 3 average index of coincidence will be the same as each term, so we only have to compute its contribution. If the ti’ and ki’s are drawn at random from the alphabet with the frequencies of English then each character should be independent of the other, i.e. ti and ti+l should have no relation to each other, and ki%m and k(i+l)%m should also have no relation to each other. In this case, both the character on the left (and on the right) is another “random letter” but the frequencies used to sample this letter are not those of a typical English letter. However, it turns out that these frequencies can be expressed in terms of those of the English letters. Let us use L1 to represent a letter drawn at random from the alphabet according to the frequencies of English and let L2 be another such letter. Then, if we think of L1 and L2 as elements of Z/26 the frequencies of the characters above should have the same frequencies as (L1 + L2)%26. These frequencise can be computed as X X P (L1 + L2 = j) = P (L1 = k, L2 = (j − k)%26) = pkp(j−k)%26. k=0 k=0 These frequencies are written in column 2 of Table 2. To finish the computation, we note that the character on the right above has the same frequency properties as the one on the left, and the two are independent. Therefore, the probability they match is the sum of the frequencies squared. If you square the numbers in column 2 of Table 2 and add them up you get 0.040. Another way we can get the figure 0.040 is as follows using the equality δ(ti+l + k(i+l)%m, ti + ki%m) = δ(ti+l, ti + ki%m − k(i+l)%m). Therefore, the contribution of this term is the probability that L1 = L2 + L3 − L4 where L1,L2,L3 and L4 are drawn independently from the English frequencies. The contribution is therefore the sum 25 X piqi k=0 where pi are the English frequencies and 25 25 X X qi = P (L2 + L3 − L4 = i) = pkp(l−k)%26p(l−i)%26. l=0 k=0 These frequencies are given in column 3 of Table 2. If you multiply the numbers in column 1 by the numbers in column 2, you also get 0.040. 4 Letter (l) P (L = i) P (L1 + L2 = l) P (L1 +L2 −L3 = l) A 0.103 0.050 0.042 B 0.018 0.033 0.038 C 0.017 0.024 0.036 D 0.061 0.028 0.040 E 0.128 0.049 0.043 F 0.023 0.036 0.037 G 0.015 0.038 0.037 H 0.087 0.054 0.042 I 0.055 0.044 0.039 J 0.003 0.025 0.035 K 0.007 0.035 0.037 L 0.036 0.053 0.039 M 0.027 0.040 0.037 N 0.074 0.032 0.039 O 0.067 0.037 0.040 P 0.011 0.036 0.039 Q 0.000 0.027 0.036 R 0.050 0.045 0.039 S 0.057 0.044 0.040 T 0.090 0.040 0.041 U 0.023 0.035 0.038 V 0.009 0.046 0.036 W 0.021 0.045 0.038 X 0.000 0.036 0.038 Y 0.017 0.033 0.036 Z 0.001 0.034 0.037 Table 2: Frequencies of L1, L1 + L2, L1 + L2 − L3 for independently drawn letters L1,L2,L3.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-