POSTLUDE P.1 Letter Frequencies
Total Page:16
File Type:pdf, Size:1020Kb
POSTLUDE A Prelude opened this textbook with the puzzle Mastermind that introduced combinatorial reasoning in a recreational setting. This Postlude looks at a cryptanalysis problem that is again of a recreational nature but also illustrates the less structured side of combinatorial reasoning as it often occurs in real-world problems. In particular, we will look at a simple cryptographic scheme in which the analysis of underlying combinatorial problems is complicated by the somewhat random pattern of letters in English text. P.1 Letter Frequencies There have been many tables produced of the relative frequencies of letters in English writing, starting with Samuel Morse (of Morse code fame). We use frequencies averaged over several tables. Most Common Letters in English Text Vowels Consonants E 12% T 9% A, I, O 8 % N, R, S 6-8% D, L 4% The least frequent letters, all consonants, are: J, Q, X, Z below .05% As we shall see, it is also useful to know which pairs of consecutive letters, called digraphs, are most frequent. The eight most common digraphs are Most frequent: TH Second most frequent: HE Next six most frequent: ER, RE and AN, EN, IN, ON N is unique among the very frequent letters in that close to 90% of its occurrences are preceded by a vowel; other frequent letters have a much wider range of other letters preceding them. Some other frequent digraphs that can be helpful are: ES, SE ED, DE ST TE, TI, TO OF Frequent consonants tend to appear beside vowels but vowels do not occur side-by-side often and similarly frequent consonants do not appear side-by-side often except for TH and ST. So there is a quasi-bipartite graph-like relationship with vowels as one set of vertices and frequent consonants as the other set of vertices in the bipartition), and the frequent digraphs are the edges. There is one triple of consecutive letters, called a trigraph, that stands out, namely, THE. THE is four times as common as any other trigraph in English text. It is frequent both as a three-letter word and as a part of other words. If we are given an encoded message, we could count the frequency of each letter in the message to determine single-letter frequencies. To get information of how often various letters occur before and after other letters, we build what is called a trigraphic frequency 1 table. For each occurrence of a letter, we record the letter just preceding and the letter following this letter. To illustrate, consider the following cryptogram. Here spaces have been removed between words, but for readability letters are written in groups of five. Cryptogram FJYHP KKYRH YKYRF HYVYK PRQYI SFIFP RNAVP PUDQC CAYJY COQRF JYRYD TQYCO JPMIY FJQIN YSVTP VFJYT QVIFF QKYQR FJYES IFIFM OYRFI JSWYP TFYRK QIAYJ QWYOQ RSVPD ONRPQ INTSI JQPRP MFIQO YQFQI JPEYO FJSFF JYQRO PPVIY FFQRB DQCCA VQRBP MFFJY AYIFQ RFJQI NYSVI BVSOM SFQRB HCSII FJYVY DQCCA YRPOY CQWYV QYIPT OPKQR PIEQX XSSCC PDYOQ RIQOY Table P.1: Trigraph table for cryptogram 6 4 12 6 3 27 0 4 22 16 7 0 5 5 13 21 31 20 13 6 1 12 3 2 37 0 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z NV RD QC UQ YS .J YP YS FY PK PI RA CQ HK RY YH IF DQ PD YY SY QX JH CY RP CA YT PY RH RY FF YY KY FO IY CJ KR DC YF YV VP AP QY XS KR IY IV YO PO IQ SI FY MY FY YY PF OR MY FR OR PQ EI YQ ST QY HK CV RH YO BQ IP BC QN OP YP PF IT YQ VP TY PN JW PF PF KR YY HS QC YQ RJ VF FQ QY OS IY DN PU JI QF RV NS QI HV CY CA PY YJ SF FY RQ QY JM QN YY TI PO SP VK HS VJ FF FY PQ YF TV TV QF JF PI QI QC IF FJ IS RP YT FK YF YV AQ AJ CA FQ QA YQ SM VD YR YK VO SI JC YQ RJ QN IQ PY RQ KI QS MF BS JR SC II SJ IP TP QR JW NP CI YY RD CP IM FQ FS YQ RM OR PP XS YQ QC RI QJ FY QY JE PI QO SC IF TY VF FY OP JP QB NS MI YF FQ PV IO QB JT QQ QN FY BM YF QF KQ OJ VB RO FI QB JE SF SI IT YR YP OR FJ IF OK FR QP WP YF YP RI DC QI FR FQ PE CD VR AJ MF RQ FR WO FJ JI OQ IQ FR EO RJ DC JQ SQ CW IF IJ VY JA KR AI EX NS OR JV IQ VD AR OC WV QI DO O. 2 The trigraphic frequency table for the cryptogram is given in Table P.1. Letter frequencies appear at the top of each column in Table P.1. When a trigraph is repeated in some letter’s column, the trigraph is underlined. To illustrate how the table is constructed, consider the beginning of the cryptogram: FJYHP KKY . For each occurrence of a letter, we enter the letter just before it and the letter just after it. For the first letter is F, we enter .J in F’s column (the ‘.’ indicates that since F is the first letter of the message, no letter precedes it). For the second letter J in the message, we enter FY in J’s column. For the third letter Y, we enter JH in Y’s column. For the fourth letter H, we enter YP in H’s column; for the fifth letter P, we enter HK in P’s column. For the sixth letter K, we enter PK in K’s column. Since the seventh letter is also K, next in K’s column, we enter KY. For the frequent letters, their columns of trigraph data can be overwhelming, and so it is helpful to make a digraph table for each frequent letter, such as F, listing the frequency of letters that occur 2 or more times Before F and After F. See Table P.2. Table P.2: Digraph repetitions for frequent letters in table P.1 12 27 21 16 13 22 31 20 14 12 36 C F I J O P Q R S V Y___ Bef Bef Bef Bef Bef Bef Bef Bef Bef Bef Bef Aft C4 F4 F4 F10 C2 J2 D3 P3 J2 A2 A4 3C Q3 I6 Q5 I3 Q2 G2 F4 Q10 S2 P2 I2 2D Y3 M2 S3 Y2 Y3 P2 I3 Y6 Y2 S3 J8 2F R5 V3 R4 J4 Y3 K3 3I Aft S3 Y3 Aft Aft V2 K2 Aft Aft N2 2J 3A Y2 2P 2P O2 3B 3F Aft O4 2K 4C Aft 4Q 2Q Aft T2 4F 3I 3I Q3 30 2O Aft 6F 3S 4Y 2D V2 4P 3V 2P V2 3Q 4F 3J 8Y 2M Y3 2Q W3 6R 4I 3N 2P 2Y 2S 10J 2Q 2R Aft 3V 5Q 2Y 2T 3C 2V 4I 2O 10R 3Y 2W We shall be referring to the data in Tables P.1 and P.2 repeatedly through this Postlude. Finally, we list the sequences of 3 or more letters that are repeated several times in the message. 7 times: FJY 3 times: DQCCA, RFJ, QRB (also DQCCAY 2 times) Note that longer repeats, such as DQCCA and DQCCAY, can be found by looking at trigraph (3-letter) repeats and concatenating these repeats together. That is, DQCCA is built from the repeated trigraphs of DQC, QCC, and CCA. Observe that this 5-letter sequence is probably a word, since the chances are extremely low that a repeated 5-letter sequence would be formed by a common ending of three different words followed by a common start of three other different words. Now we start the decoding process. We typically begin with the letters in the English word THE. The very frequent trigraph FJY is a perfect fit to be the encoding of THE, since: 3 i) THE is the most frequent trigraph in English text and FJY is the most frequent trigraph in the cryptogram (occurring 7 times); ii) E is the most frequent letter in English text and Y is the most frequent letter in our cryptogram; iii) TH is the most frequent digraph in English text and FJ is the most frequent digraph in the cryptogram (occurring 10 times); and iv) T is one of the most frequent letters in English text and F is one of the most frequent letters in our cryptogram. We write the information about the encoding of THE as TP = FC, HP = JC, EP = YC, where the P subscript stands for Plain text and the C subscript stands for Code text. Once we know that EP = YC, we can look for frequent letters that rarely beside Y, keeping in mind that vowels do not occur side-by-side often. Looking through the digraph information in Figure P.2, we see that YC has no repeated occurrences before or after PC (YC appears before PC once and not at all after PC). So PC is extremely likely to be a vowel. To find another vowel, we can look at frequent code letters that have few occurrences of YC and PC before and after them.