Decoding Running Key Ciphers
Total Page:16
File Type:pdf, Size:1020Kb
Decoding Running Key Ciphers Sravana Reddy∗ Kevin Knight Department of Computer Science Information Sciences Institute The University of Chicago University of Southern California 1100 E. 58th Street 4676 Admiralty Way Chicago, IL 60637, USA Marina del Rey, CA 90292, USA [email protected] [email protected] Abstract Since we know that the plaintext and running key are both drawn from natural language, our objective There has been recent interest in the problem function for the solution plaintext under some lan- of decoding letter substitution ciphers using techniques inspired by natural language pro- guage model is: cessing. We consider a different type of classi- ˆ cal encoding scheme known as the running key P = arg max log Pr(P ) Pr(RP,C ) (1) cipher, and propose a search solution using P Gibbs sampling with a word language model. where the running key RP,C is the key that corre- We evaluate our method on synthetic cipher- sponds to plaintext P and ciphertext C. texts of different lengths, and find that it out- Note that if RP,C is a perfectly random sequence performs previous work that employs Viterbi of letters, this scheme is effectively a ‘one-time pad’, decoding with character-based models. which is provably unbreakable (Shannon, 1949). The knowledge that both the plaintext and the key 1 Introduction are natural language strings is important in breaking The running key cipher is an encoding scheme that a running key cipher. uses a secret key R that is typically a string of words, The letter-frequency distribution of running key usually taken from a book or other text that is agreed ciphertexts is notably flatter than than the plaintext upon by the sender and receiver. When sending a distribution, unlike substitution ciphers where the plaintext message P , the sender truncates R to the frequency profile remains unchanged, modulo letter length of the plaintext. The scheme also relies on substitutions. However, the ciphertext letter distri- a substitution function f, which is usually publicly bution is not uniform; there are peaks corresponding known, that maps a plaintext letter p and key letter to letters (like I) that are formed by high-frequency r to a unique ciphertext letter c. The most common plaintext/key pairs (like E and E). choice for f is the tabula recta, where c = (p + r) 2 Related Work mod 26 for letters in the English alphabet, with A = 0,B = 1, and so on. 2.1 Running Key Ciphers To encode a plaintext with a running key, the Bauer and Tate (2002) use letter n-grams (without spaces in the plaintext and the key are removed, and smoothing) up to order 6 to find the most probable for every 0 ≤ i < |P |, the ciphertext letter at posi- plaintext/key character pair at each position in the ci- tion i is computed to be Ci ← f(Pi,Ri). Figure 1 phertext. They test their method on 1000-character shows an example encoding using the tabula recta. ciphertexts produced from plaintexts and keys ex- For a given ciphertext and known f, the plaintext tracted from Project Gutenberg. Their accuracies uniquely determines the running key and vice versa. range from 28.9% to 33.5%, where accuracy is mea- ∗Research conducted while the author was visiting ISI. sured as the percentage of correctly decoded char- 80 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 80–84, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics Figure 1: Example of a running key cipher. Note that key is truncated to the length of the plaintext. Plaintext – linguistics is fun, Running Key – colorless green ideas, tabula recta substitution where Ci ← (Pi + Ri) mod 26 Plaintext: LINGUISTICSISFUN Running Key: COLORLESSGREENID Ciphertext: NWYULTWLAIJMWSCQ acters. Such figures are too low to produce read- tion that the length of the key is known. If unknown, able plaintexts, especially if the decoded regions are the key length can be inferred using the Kasiski Test not contiguous. Griffing (2006) uses Viterbi decod- (Kasiski, 1863) which takes advantage of repeated ing and letter 6-grams to improve on the above re- plaintext/key character pairs. sult, achieving a median 87% accuracy over several 1000-character ciphertexts. A key shortcoming of 3 Solution with Gibbs Sampling this work is that it requires searching through about In this paper, we describe a search algorithm that 5 26 states at each position in the ciphertext. uses Gibbs Sampling to break a running key cipher. 2.2 Letter Substitution Ciphers 3.1 Choice of Language Model Previous work in decipherment of classical ciphers The main advantage of a sampling-based approach has mainly focused on letter substitution. These ci- over Viterbi decoding is that it allows us to seam- phers use a substitution table as the secret key. The lessly use word-based language models. Lower or- ciphertext is generated by substituting each letter of der letter n-grams may fail to decipher most cipher- the plaintext according to the substitution table. The texts even with perfect search, since an incorrect table may be homophonic; that is, a single plaintext plaintext and key could have higher likelihood un- letter could map to more than one possible cipher- der a weak language model than the actual message. text letter. Just as in running key ciphers, spaces in the plaintext are usually removed before encoding. 3.2 Blocked Sampling Proposed decipherment solutions for letter substi- One possible approach is to sample a plaintext letter tution ciphers include techniques that use expecta- at each position in the ciphertext. The limitation of tion maximization (Ravi and Knight, 2008), genetic such a sampler for the running key problem is that algorithms (Oranchak, 2008), integer programming is extremely slow to mix, especially for longer ci- (Ravi and Knight, 2009), A* decoding (Corlett and phertexts: we found that in practice, it does not usu- Penn, 2010), and Bayesian learning with Dirichlet ally converge to the optimal solution in a reasonable processes (Ravi and Knight, 2011). number of iterations even with simulated annealing. We therefore propose a blocked sampling algorithm 2.3 Vigenere` Ciphers that samples words rather than letters in the plain- A scheme similar to the running key cipher is the Vi- text, as follows: genere` cipher, also known as the periodic key cipher. 1. Initialize randomly P := p p . p , fix R as Instead of a single long string spanning the length of 1 2 |C| the key that corresponds to P, C the plaintext, the key is a short string – usually but not always a single word or phrase – repeated to the 2. Repeat for some number of iterations length of the plaintext. Figure 2 shows an example Vigenere` cipher encoding. This cipher is less secure (a) Sample spaces (word boundaries) in P ac- than the running key, since the short length of the cording to Pr(P ) key vastly reduces the size of the search space, and (b) Sample spaces in R according to Pr(R) the periodic repetition of the key leaks information. (c) Sample each word in P according to Recent work on decoding periodic key ciphers Pr(P ) Pr(R), updating R along with P perform Viterbi search on the key using letter n- (d) Sample each word in R according to gram models (Olsen et al., 2011), with the assump- Pr(P ) Pr(R), updating P along with R 81 Figure 2: Example of a Vigenere` cipher cipher, with a 5-letter periodic key, repeated to the length of the plaintext. Plaintext – linguistics is fun, Periodic Key – green, tabula recta substitution. Plaintext: LINGUISTICSISFUN Running Key: GREENGREENGREENG Ciphertext: RZRKHOJXMPYZWJHT 3. Remove spaces and return P, R Corpus (Nelson and Kucera, 1979), with Kneser- Ney smoothing. We fixed the word language model Note that every time a word in P is sampled, it interpolation weight to λ = 0.7. induces a change in R that may not be a word or a sequence of words, and vice versa. Sampling word 4.2 Baseline and Evaluation boundaries will also produce hypotheses contain- ing non-words. For this reason, we use a word tri- For comparison with the previous work, we re- gram model linearly interpolated with letter trigrams implement Viterbi decoding over letter 6-grams (including the space character).1 The interpolation (Griffing, 2006) trained on the Brown Corpus. In mainly serves to smooth the search space, with the addition to decipherment accuracy, we compare the added benefit of accounting for out-of-vocabulary, running time in seconds of the two algorithms. misspelled, or truncated words in the actual plaintext Both decipherment programs were implemented in or key. Table 1 shows an example of one sampling Python and run on the same machines. The Gibbs iteration on the ciphertext shown in Figure 1. Sampler was run for 10000 iterations. As in the Griffing (2006) paper, since the plaintext and running key are interchangeable, we measure Table 1: First sampling iteration on the ciphertext NWYULTWLAIJMWSCQ the accuracy of a hypothesized solution against the reference as the max of the accuracy between the hy- Generate P, R P : WERGATERYBVIEDOW with letter trigrams R: RSHOLASUCHOESPOU pothesized plaintext and the reference plaintext, and Sample spaces in P P : WERGAT ER YB VIEDOW the hypothesized plaintext and the reference key. Sample spaces in R R: RS HOLASUCHOES POU Sample words in P P : ADJUST AN MY WILLOW 4.3 Results R: NT PATAWYOKNEL HOU Sample words in R P : NEWNXI ST HE SYLACT Table 2 shows the average decipherment accuracy of R: AS CHOLESTEROL SAX our algorithm and the baseline on each dataset.