Fall 2002 Breaking the Code Lecture # 1 – Simple Substitution Ciphers

Statistics 116 - Fall 2002 Breaking the Code Lecture # 1 – Simple Substitution Ciphers

1 Substitution Ciphers

In this class, we will be discussing general ways of encrypting and decrypting strings of text. We will generally have two kinds of text C – ciphertext and T – plain text. We can represent a string of text T˜ (either ciphertext or plain text) of length n as

(t˜1, t˜2,..., t˜n) ∈ {(s1, s2, . . . , sn) : 0 ≤ si ≤ 25, si ∈ Z}. An encryption cipher En is a function that takes plain text T and gives cipher text T , i.e. C = En(T ). In general, given a sequence T of plain text of length n, En(T ) does not have to be of length n, but for our purposes it will be (at least for a little while). Another way of describing all the possible strings is to note that a string of length n is, ignoring spaces and case, an element of

n (Z/26) where Z/26 are the integers mod 26. Therefore, a cipher, for our purposes is a map n n E :(Z/26) → (Z/26) . The simplest ciphers are simple substitution ciphers given by a ﬁxed map π : Z/26 → Z/26 applied to each coordinate, i.e.

E ((t1, t2, . . . , tn)) = (π(t1), π(t2), . . . , π(tn)) .

Another requirement we have of ciphers is that we are able to decrypt encrypted messages. For our purposes, this means that we will insist that E has an inverse, which we denote by De – for decryption. What we mean by an inverse can be expressed as,

De(En(T )) = T, ∀T

1 Plain: ABCDEFGHIJKLMNOPQRSTUVWXYZ Cipher: EFGHIJKLMNOPQRSTUVWXYZABCD

Table 1: Shift cipher with a = 4 i.e. that if we encrypt any plaintext, then decrypt it, we will end up with the same plaintext. In terms of the simple substitution cipher, this means that π is a permutation {0,..., 25}, i.e. a rearrangement of {0,..., 25}. Therefore, there are

26! ' 4.03 × 1026 possible simple substitution ciphers. Obviously, an exhaustive search of the simple substitution ciphers is impos- sible. We will start oﬀ by deﬁning some simpler family of ciphers and trying to break them.

2 Shift or Ceasar Ciphers

These ciphers go back to Ceasarean times and are based on a one-parameter family of permutations which, as their name suggest, consists of “shifting” the letters in the alphabet. Table 2 demonstrates how these shift ciphers are imple- mented.

Mathematically, we can express a shift cipher with shift a as:

πa(x) = (a + x)%26, or Ena ((t1, . . . , tn)) = ((t1 + a)%26,..., (tn + a)%26) . Clearly, there are 26 shift ciphers (with one of these ciphers having no eﬀect, i.e. a = 0). We will next describe how to use statistical tools – frequency analysis to break these codes.

3 Frequency analysis – a probabilistic model for “English”

To use statistics to break these shift ciphers (or any of these substitution ciphers) we need a “probabilistic model” of English texts. The simplest model of English text is to make random words by drawing letters from the alphabet. To make sure that the text “resembles” English we want the frequency of individual letters to match those of our “random” text strings. To begin with, we need to estimate the frequency of letters in “typical” English. One way to do this is to look around on the web for large texts and

2 compute the frequencies of the individual letters from this text. This gives a vector (p0, . . . , p25) with 25 X 0 ≤ pi ≤ 1, pi = 1. i=0 For instance, the King James version of the Bible, found at http://www.patriot.net/users/bmcgin/kjv12.zip gives the frequencies:

(0.085, 0.015, 0.017, 0.049, 0.127, 0.026, 0.017, 0.087, 0.060, 0.003, 0.007, 0.040, 0.025, 0.069, 0.075, 0.013, 0.000, 0.052, 0.059, 0.098, 0.026, 0.009, 0.020, 0.000,, 0.018, 0.001).

Given these frequencies, to generate a random text of length n with these frequencies we do the following: a) for 1 to n, choose a number from 0 to 25 with probability of choosing the i-th letter pi; b) concatenate the letters chosen above to give a string of length n. The text generated is not close to English, but remember the only quantity we were trying to match was the frequency of letters, and English is much more complex. For example, using the above frequencies, the following 50 character string was generated (the space was treated as a 27-th character so we would be dealing with Z/27 instead of Z/26): O EIU MHH NINH FETT LONC HSS USL Y ETODRDHLLYG EA. We will see in later lectures better ways to generate “random” English text. Although this model is not very useful to generate English text, it is quite useful for breaking codes. In fact, this was one of the ﬁrst uses of statistics in the 9-th century by the Arab cryptologist (see The Code Book). The way it is used is in terms of what statisticians call “likelihood”, which we use to estimate the shift parameter by “maximum likelihood.” Speciﬁcally, given a model for “random” English text we can compute the probability of observing that given English text T . In our simple model, this probability will be

P (T ) = P (T1 = t1,...,Tn = tn)

= pt1 × ptn 25 Y #{t∈T :t=i} = pi . i=0

3 Plain: THIS IS A STATISTICS CLASS Cipher: YMNX NX F XYFYNXYNHX HQFXX

Table 2: Shift cipher with a = 4

Now, in breaking codes we do not actually observe the plain text, T just the cipher text C = En(T ). If we know that the cipher used is a shift cipher, then for each shift a, we can compute the probability of observing Dea(C) (the decryption of T using the inverse of the shift cipher with shift a). This gives us the likelihood

(Likelihood) The likelihood of the shift a is given by

L(a|C) = P (Dea(C)).

That is, we decrypt with the inverse cipher to get Dea(C) and compute the probability of observing that string as “plaintext.”

To standardize things, I will be reporting a scaled version of log L(a|C), specifically I will report 1 ˜l(a|C) = log L(a|C) length(C) 25 X #{t ∈ Dea(C): t = i} = log p . length(C) i i=0 The likelihood, as a function of the shift a is the probability of observing Dea(C) if the shift is a, in which case Dea(C) is plaintext, hence its frequencies should match those of English. Shifts with high likelihood, as the name suggests are “more likely” to be the shift associated with the cipher. Specifically, we define the maximum likelihood estimate of the shift a as

(Maximum Likelihood Estimation) The maximum likelihood estimate of the shift a is given by ˜ aˆ(C) = argmaxal(a|C).

That is, it is the value of a that maximizes ˜l(a|C) as a function of a (with C ﬁxed).

Let’s look at an example. Suppose we observe the message “THIS IS A STATISTICS CLASS” encrypted with a shift cipher of shift 4 in Table 3. In Table 3 we compute ˜l(a|C) for all values of a In this example, the maximum likelihood is not actually the correct shift, but the true shift is the second “most likely”.

4 ˜ Decrypted Text Shift l(a|C) qa

ESTD TD L DELETDETND NWLDD 20 -2.70 0.934 THIS IS A STATISTICS CLASS 5 -2.82 0.066 PDEO EO W OPWPEOPEYO YHWOO 9 -3.19 0.000 FTUE UE M EFMFUEFUOE OXMEE 19 -3.22 0.000 MABL BL T LMTMBLMBVL VETLL 12 -3.41 0.000 OCDN DN V NOVODNODXN XGVNN 10 -3.54 0.000 NBCM CM U MNUNCMNCWM WFUMM 11 -3.58 0.000 UIJT JT B TUBUJTUJDT DMBTT 4 -3.62 0.000 DRSC SC K CDKDSCDSMC MVKCC 21 -3.68 0.000 SGHR HR Z RSZSHRSHBR BKZRR 6 -3.69 0.000 LZAK AK S KLSLAKLAUK UDSKK 13 -3.80 0.000 IWXH XH P HIPIXHIXRH RAPHH 16 -3.86 0.000 WKLV LV D VWDWLVWLFV FODVV 2 -3.88 0.000 GUVF VF N FGNGVFGVPF PYNFF 18 -3.93 0.000 VJKU KU C UVCVKUVKEU ENCUU 3 -4.09 0.000 ZNOY OY G YZGZOYZOIY IRGYY 25 -4.20 0.000 HVWG WG O GHOHWGHWQG QZOGG 17 -4.22 0.000 CQRB RB J BCJCRBCRLB LUJBB 22 -4.26 0.000 BPQA QA I ABIBQABQKA KTIAA 23 -4.27 0.000 XLMW MW E WXEXMWXMGW GPEWW 1 -4.30 0.000 AOPZ PZ H ZAHAPZAPJZ JSHZZ 24 -4.66 0.000 JXYI YI Q IJQJYIJYSI SBQII 15 -4.77 0.000 YMNX NX F XYFYNXYNHX HQFXX 0 -4.97 0.000 KYZJ ZJ R JKRKZJKZTJ TCRJJ 14 -5.10 0.000 RFGQ GQ Y QRYRGQRGAQ AJYQQ 7 -5.39 0.000 QEFP FP X PQXQFPQFZP ZIXPP 8 -5.69 0.000

Table 3: Values of ˜l(a|C) for all values of a, ranked by ˜l(a|C).

5 4 Bayesian Interpretation

We have the used term “most likely” somewhat loosely here, we would like the numbers L(a|C) to have an interpretation as “the probability that the TRUE shift that was used to encrypt C is equal to a. However, as they are now, these numbers do not have that interpretation. The way to interpret these numbers as probabilities is as a Bayesian statistician would. Though we have not defined what a “Bayesian statistician” is, we can still talk about how they would interpret these numbers. We want to define a way to assign probability that the shift is equal to a. That is, we want to find numbers (q0, . . . , q25) with

25 X 0 ≤ qa ≤ 1, qa = 1. a=0 Consider the following deﬁnition

L(a|C) q = . a P25 b=0 L(b|C) It is easy to check that these numbers satisfy what is needed to make a statement like “the probability that a is 3 is 0.4.” A Bayesian statistician would call the number qa the posterior probability that the TRUE shift is a given the data C. The term posterior is used to distin- guish from prior probabilities, which is another fundamental part of Bayesian statistics. The prior probabilities reﬂect the beliefs of the statistician as to the probability that the shift is a. For instance, suppose you knew that your enemy was lazy and preferred to use a shift between 0 and 9 because it only required typing 1 digit instead of 2. In this case, the shifts between 0 and 9 should be more likely to be the ones used to encrypt a given message. These beliefs, as mentioned above are reﬂected in terms of prior probabilites. In the above example, we might insist that shifts between 0 and 9 are twice as likely as shifts between 10 and 25. This means that, for instance, the prior probability that the shift would be 7 would be 1/18 and the prior probability that it would be 15 would be 1/36. Given these prior probabilities (˜p0,..., p˜25), the Bayesian forms posterior probabilities by computing

L(C|a)˜p q = a . a P25 b=0 L(C|b)˜pb In our earlier example, we, as Bayesian statisticians, assumed that the prior probability was uniform, i.e. that there was a 1/26 chance that the TRUE shift was a. These posterior probabilities are reported in Table 3.

6 5 More Data – Law of Large Numbers

Our example above only had 22 letters in it, and the maximum likelihood estimate of the shift was incorrect. Suppose now that we have more data, what happens to our procedure. We will see that maximum likelihood estimation gets better and better, and if we compute the above posterior probabilites, we see that the posterior probability that the maximum likelihood estimate is the TRUE shift converges to 1.

˜ Decrypted text Shift l(a|C) qa

WASHINGTON, SEPT. 5 -2.93 1.000 JNFUVATGBA, FRCG. 18 -3.76 0.000 KOGVWBUHCB, GSDH. 17 -3.76 0.000 SWODEJCPKJ, OALP. 9 -3.80 0.000 HLDSTYREZY, DPAE. 20 -3.81 0.000 DHZOPUNAVU, ZLWA. 24 -3.84 0.000 QUMBCHANIH, MYJN. 11 -3.84 0.000 XBTIJOHUPO, TFQU. 4 -3.86 0.000 LPHWXCVIDC, HTEI. 16 -3.86 0.000 AEWLMRKXSR, WITX. 1 -3.98 0.000 OSKZAFYLGF, KWHL. 13 -3.98 0.000 VZRGHMFSNM, RDOS. 6 -4.01 0.000 GKCRSXQDYX, COZD. 21 -4.01 0.000 PTLABGZMHG, LXIM. 12 -4.04 0.000 ZDVKLQJWRQ, VHSW. 2 -4.08 0.000 UYQFGLERML, QCNR. 7 -4.19 0.000 FJBQRWPCXW, BNYC. 22 -4.20 0.000 RVNCDIBOJI, NZKO. 10 -4.21 0.000 NRJYZEXKFE, JVGK. 14 -4.21 0.000 MQIXYDWJED, IUFJ. 15 -4.24 0.000 BFXMNSLYTS, XJUY. 0 -4.27 0.000 CGYNOTMZUT, YKVZ. 25 -4.28 0.000 YCUJKPIVQP, UGRV. 3 -4.28 0.000 IMETUZSFAZ, EQBF. 19 -4.35 0.000 EIAPQVOBWV, AMXB. 23 -4.47 0.000 TXPEFKDQLK, PBMQ. 8 -4.58 0.000

Table 4: Values of ˜l(a|C) for all values of a, ranked by ˜l(a|C) – New York Times example.

The text is taken from the New York Times web page on September 25, specifcally the URL is http://www.nytimes.com/2002/09/25/international/middleeast/25ASSE.html There are 4754 characters in the text, and the results are shown in Table 4.

7 Clearly, as the size of the ciphertext grows, so should our conﬁdence in the maximum likelihood estimate of the shift. It is not foolproof, but it does provide a way of ranking “how likely” the shift is a given value a, once we observe the cipher text C. This improved performance as the length of the cybertext goes to inﬁnity is an example of a so-called Law of Large Numbers in statistics. That is, given more and more data, the maximum likelihood estimate of the shift parameter converges to the TRUE shift parameter.