Cryptodiagnosis of “Kryptos K4”
Total Page:16
File Type:pdf, Size:1020Kb
Cryptodiagnosis of “Kryptos K4” Richard Bean School of Information Technology and Electrical Engineering University of Queensland, Australia 4072 [email protected] Abstract parts of the K4 plaintext with locations were re- leased by the sculptor, totalling 24 letters. Fur- The sculpture “Kryptos” at the Central In- ther details may be found in Dunin and Schmeh telligence Agency in Langley, Virginia, (2020). contains four encrypted passages. The last, known as “K4” and consisting of 97 Callimahos (1977) and Lewis (1992) describe letters, remains unsolved. the process of diagnosis of an unknown cipher type. Callimahos, in a chapter entitled “Principles In this work, we look at unusual statistical of Cryptodiagnosis”, sets out a process of hypoth- properties of the K4 ciphertext, together esis formulation and hypothesis testing. This in- with the known plaintext, using Monte volves arrangement and rearrangement of the data Carlo sampling to perform permutation to disclose nonrandom characteristics, followed testing. This provides evidence strongly by recognition and explanation of these character- indicating a definite “one-to-one” relation- istics. The chapter headers are: manipulating the ship between corresponding plaintext and data, recognizing the phenomena, and interpreting ciphertext letters. It also points toward a the phenomena. possible encryption method which could account for most or all of the observed Lewis states that the task of an analyst is find- properties. This is the “Gromark” cipher ing, measuring, explaining, and exploiting a phe- invented by Hall (1969, 1975) and ana- nomenon (or phenomena). Writing about cipher lyzed by Blackman (1989). type diagnosis, he describes the search for “some- thing funny” or “finding the phenomena”. 1 Introduction Since these publications, Mason (2012) pre- The “Kryptos” sculpture was installed at the Cen- pared a table of cipher statistics for many tral Intelligence Agency (CIA) in Langley, Vir- American Cryptogram Association (ACA) types, ginia in November 1990. The sculptor was Jim with associated random forest (2013) and neural Sanborn and the cryptographic consultant was Ed net (2016) classifiers. Nuhn and Knight (2014) Scheidt, who retired from the CIA in December also developed a classifier for ACA cipher types 1989. The sculpture contains four encrypted mes- using a support vector machine approach. sages totalling 865 letters plus 4 question marks. Scheidt has indicated that the codes were de- In this paper we attempt to measure some of signed to be solved in five, seven or ten years. the interesting phenomena seen in K4 and provide The first three sections were solved indepen- possible explanations. We perform statistical test- dently by three different teams or individuals: an ing using Monte Carlo sampling and describe one NSA team in 1992, David Stein from the CIA in possible encryption method, the Gromark cipher 1998, and Jim Gillogly in 1999. The fourth sec- of the ACA, and its variants. Finally we conduct tion, “K4”, consisting of 97 letters remains un- an extensive search of the Gromark key space for solved and its encryption method remains publicly various bases and key primer lengths before dis- unknown. During the period 2010 to 2020, four cussing our conclusions. 2 Analysis in this paper are provided via Github.1 The recently solved (Oranchak et al., 2020) Good (1983) commented on the practice of look- “Zodiac 340” cipher also had a similar prop- ing at a sample ciphertext and deciding on a test erty (Daikon, 2015; Van Eycke, 2015). A rela- of significance based on the observed data, instead tively high number of repeated bigrams was seen of running a standard series of tests. The passage at width 19 in the ciphertext. The cipher was ulti- is worth quoting in full to describe the risks and mately found to be a combination of transposition rewards of such an approach. and homophonic substitution. The width 19 prop- ... it is sometimes sensible to decide on a signif- erty can thus, after the fact, be deemed “causal” icance test after looking at a sample. As I’ve said as the enciphering process caused this property to elsewhere this practice is dangerous, useful, and appear. often done. It is especially useful in cryptanalysis, but one needs good detached judgment to estimate 2.2 Seriated ciphers the initial probability of a hypothesis that is sug- The “seriated Playfair” cipher of the ACA might gested by the data. Cryptanalysts even invented provide a partial aesthetic explanation for the a special name for a very far-fetched hypothe- width 21 patterns. This cipher is digraphic and sis formulated after looking at the data, namely works by performing Playfair encryption on verti- a “kinkus” (plural: “kinkera”). It is not easy to cal pairs of letters. That is, any given pair of letters judge the prior probability of a kinkus after it has in plaintext (p1; p2) maps to another pair of letters been observed. (c1;c2) in a one-to-one fashion. Thus, numbering the positions 0 to 96, the repeated “BS” bigrams at 2.1 Ciphertext analysis positions 12/33 and 18/39 would reflect the same underlying plaintext, or “AZ” at positions 49/70 One of K4’s most prominent unusual features is and 57/78. Similarly, the “double box cipher” or the number of repeated bigrams when the ci- Doppelkastenschlussel,¨ sometimes referred to as phertext is written at width 21 (Hannon, 2010; “double Playfair”, described by David (1996) is a LaTurner, 2016; Kirchner, 2003); see Table 1. digraphic cipher which required seriation at width 21. OBKRUOX OGHULBS OLIF BB W There are also several arguments against this as FLRV QQ P RNGK SS O TWTQSJQ a method: SSEK ZZ W ATJKLUD IAWINFB NYPV TT M ZFPKWGD KZXTJCD • according to the “ACA Cipher Statistics” IGKUHUA UEKCAR webpage of Mason (2012), the index of co- incidence (IC) of a typical “seriated Playfair” Table 1: K4 ciphertext written at width 21 ciphertext is 0.048 with standard deviation 0.003 versus K4’s IC of 0.036 If we consider the 76 bigrams formed vertically • 26 letters occur in the ciphertext, while the (starting with OF and finishing with GR), there most common Playfair variant uses only 25 are 11 repeated bigrams (AZ BS IT LS LW PK from a 5x5 square QZ SN WA ZT KK). This value is in line with the expected number of repeated bigrams if a typical • the doublet “KK” is present, which cannot English plaintext was written out at width 21; for occur in standard Playfair example, testing all 97-letter contiguous subsets of • a plain interpretation is that there are 97 ci- the King James Bible gives an average value of 9.7 phertext letters, an odd number, while Play- repeated bigrams at width 21. fair works on pairs of letters. As 97 is also If we perform Monte Carlo sampling and take prime, this is also an argument against the a large number of permutations of the cipher- Hill cipher suggestion of Bauer et al (2016). text (Good, 2013), we can estimate the propor- tion of permutations of the ciphertext which would The original description of the Playfair cipher have at least this number of repeated bigrams. In by Wheatstone entitled “Specimen of a Rectan- this case, this proportion is approximately one in gular Cipher”, seen in Kahn (1996, p. 199) uses 6,750. Programs written in C to calculate values 1https://github.com/RichardBean/k4testing a 9x3 rectangle, and enciphers doubled letters, These observations are unusual, and may well which would account for the last three observa- be causal, but were ultimately considered harder to tions. We could take the question mark before measure, explain or exploit than the observations “OBKR” on the sculpture as the initial charac- discussed here. ter, with 27 different characters and 98 ciphertext characters. The low IC could then be accounted 2.4 Known plaintext analysis for by careful selection of the key. The 24 known plaintext letters are as fol- However, these theories all became moot af- lows: “FLRVQQPRNGKSS” in cipher corre- ter the “BERLIN” plaintext clue was released in sponds to “EASTNORTHEAST” in plain and November 2010. This corresponds to the cipher- “NYPVTTMZFPK” in cipher corresponds to text “NYPVTT”. Thus, if a seriated digraphic ci- “BERLINCLOCK” in plain. pher at width 21 had been used to encipher the Materna (2020) noted that for the known K4 plaintext, we would have two different plaintext plaintext, where the plaintext letters are in the bigrams ending in “I” and “N” both mapping to set fK;R;Y;P;T;O;Sg the corresponding cipher- “ZT”, which is impossible. As the letter “K” in the text letters are very close in the standard alphabet 2014 plaintext clue of “CLOCK” also enciphered to the plaintext letters. Thus, the 10 shortest dis- to “K” this ruled out the use of standard Playfair tances (the so-called “minor differences” (Fried- for the vertical bigrams. man, 1954)) sum to 21, as shown in Table 2, for a We might also wish to check a width of seven, mean of 2.1. based on a purely aesthetic argument, since 98 Plaintext letter STORTSTROK characters is seven pairs of rows with seven char- Ciphertext letter RVQPRSSPFK acters per row. Similarly, the “NORTHEAST” Distance 1 2 2 2 2 0 1 2 9 0 plaintext clue was released in January 2020, which corresponded to letters 26-34 in the ciphertext, Table 2: Minor differences between plain and ci- “QQPRNGKSS”. If a seriated digraphic cipher phertext letters had been used to encipher the plaintext at a width of seven, we would have two different plaintext bigrams ending in “N” and “O” both mapping to Monte Carlo sampling by permuting the cipher- “BQ”, again impossible. text letters of K4 demonstrates this occurs only in about one in 5,520 permutations of K4 ciphertext letters.