Decipherment with a Million Random Restarts
Total Page:16
File Type:pdf, Size:1020Kb
Decipherment with a Million Random Restarts Taylor Berg-Kirkpatrick Dan Klein Computer Science Division University of California, Berkeley {tberg,klein}@cs.berkeley.edu Abstract in accuracy even as we scale the number of ran- dom restarts up into the hundred thousands. In both This paper investigates the utility and effect of cases the difference between few and many random running numerous random restarts when us- restarts is the difference between almost complete ing EM to attack decipherment problems. We failure and successful decipherment. find that simple decipherment models are able We also find that millions of random restarts can to crack homophonic substitution ciphers with high accuracy if a large number of random be helpful for performing exploratory analysis. We restarts are used but almost completely fail look at some empirical properties of decipherment with only a few random restarts. For partic- problems, visualizing the distribution of local op- ularly difficult homophonic ciphers, we find tima encountered by EM both in a successful deci- that big gains in accuracy are to be had by run- pherment of a homophonic cipher and in an unsuc- ning upwards of 100K random restarts, which cessful attempt to decipher Zodiac 340. Finally, we we accomplish efficiently using a GPU-based attack a series of ciphers generated to match proper- parallel implementation. We run a series of ties of Zodiac 340 and use the results to argue that experiments using millions of random restarts in order to investigate other empirical proper- Zodiac 340 is likely not a homophonic cipher under ties of decipherment problems, including the the commonly assumed linearization order. famously uncracked Zodiac 340. 2 Decipherment Model 1 Introduction Various types of ciphers have been tackled by the NLP community with great success (Knight et al., What can a million restarts do for decipherment? 2006; Snyder et al., 2010; Ravi and Knight, 2011). EM frequently gets stuck in local optima, so running Many of these approaches learn an encryption key between ten and a hundred random restarts is com- by maximizing the score of the decrypted message mon practice (Knight et al., 2006; Ravi and Knight, under a language model. We focus on homophonic 2011; Berg-Kirkpatrick and Klein, 2011). But, how substitution ciphers, where the encryption key is a important are random restarts and how many random 1-to-many mapping from a plaintext alphabet to a restarts does it take to saturate gains in accuracy? cipher alphabet. We use a simple method introduced We find that the answer depends on the cipher. We by Knight et al. (2006): the EM algorithm (Demp- look at both Zodiac 408, a famous homophonic sub- ster et al., 1977) is used to learn the emission pa- stitution cipher, and a more difficult homophonic ci- rameters of an HMM that has a character trigram pher constructed to match properties of the famously language model as a backbone and the ciphertext unsolved Zodiac 340. Gains in accuracy saturate af- as the observed sequence of emissions. This means ter only a hundred random restarts for Zodiac 408, that we learn a multinomial over cipher symbols for but for the constructed cipher we see large gains each plaintext character, but do not learn transition 874 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 874–878, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics parameters, which are fixed by the language model. 1 -1460 We predict the deciphered text using posterior de- 0.9 coding in the learned HMM. -1470 0.8 -1480 2.1 Implementation 0.7 Log likelihood 0.6 Log likelihood -1490 Running multiple random restarts means running Accuracy EM to convergence multiple times, which can be 0.5 Accuracy -1500 computationally intensive; luckily, restarts can be 0.4 run in parallel. This kind of parallelism is a good -1510 0.3 fit for the Same Instruction Multiple Thread (SIMT) -1520 hardware paradigm implemented by modern GPUs. 0.2 We implemented EM with parallel random restarts 0.1 -1530 using the CUDA API (Nickolls et al., 2008). With a 1 10 100 1000 10000 100000 1e+06 GPU workstation,1 we can complete a million ran- Number of random restarts dom restarts roughly a thousand times more quickly Figure 1: Zodiac 408 cipher. Accuracy by best model score and than we can complete the same computation with a best model score vs. number of random restarts. Bootstrapped serial implementation on a CPU. from 1M random restarts. 3 Experiments tions.3 We found that smoothing EM was important for good performance. We added a smoothing con- We ran experiments on several homophonic sub- stant of 0.1 to the expected emission counts before stitution ciphers: some produced by the infamous each M-step. We tuned this value on a small held Zodiac killer and others that were automatically out set of automatically generated ciphers. generated to be similar to the Zodiac ciphers. In In all experiments we used a trigram character each of these experiments, we ran numerous random language model that was linearly interpolated from restarts; and in all cases we chose the random restart character unigram, bigram, and trigram counts ex- that attained the highest model score in order to pro- tracted from both the Google N-gram dataset (Brants duce the final decode. and Franz, 2006) and a small corpus (about 2K 3.1 Experimental Setup words) of plaintext messages authored by the Zodiac killer.4 The specifics of how random restarts are produced is usually considered a detail; however, in this work 3.2 An Easy Cipher: Zodiac 408 it is important to describe the process precisely. In Zodiac 408 is a homophonic cipher that is 408 char- order to generate random restarts, we sampled emis- acters long and contains 54 different cipher sym- sion parameters by drawing uniformly at random bols. Produced by the Zodiac killer, this cipher was from the interval [0, 1] and then normalizing. The solved, manually, by two amateur code-breakers a corresponding distribution on the multinomial emis- week after its release to the public in 1969. Ravi and sion parameters is mildly concentrated at the center Knight (2011) were the first to crack Zodiac 408 us- of the simplex.2 ing completely automatic methods. For each random restart, we ran EM for 200 itera- In our first experiment, we compare a decode of 1We used a single workstation with three NVIDIA GTX 580 Zodiac 408 using one random restart to a decode us- GPUs. These are consumer graphics cards introduced in 2011. ing 100 random restarts. Random restarts have high 2We also ran experiments where emission parameters were drawn from Dirichlet distributions with various concentration 3While this does not guarantee convergence, in practice 200 parameter settings. We noticed little effect so long as the distri- iterations seems to be sufficient for the problems we looked at. bution did not favor the corners of the simplex. If the distribu- 4The interpolation between n-gram orders is uniform, and tion did favor the corners of the simplex, decipherment results the interpolation between corpora favors the Zodiac corpus with deteriorated sharply. weight 0.9. 875 variance, so when we present the accuracy corre- 0.8 -1280 sponding to a given number of restarts we present an 0.75 -1285 average over many bootstrap samples, drawn from 0.7 a set of one million random restarts. If we attack 0.65 Zodiac 408 with a single random restart, on aver- -1290 Log likelihood 0.6 age we achieve an accuracy of 18%. If we instead Log likelihood Accuracy use 100 random restarts we achieve a much better 0.55 -1295 average accuracy of 90%. The accuracies for vari- Accuracy 0.5 -1300 ous numbers of random restarts are plotted in Fig- 0.45 ure 1. Based on these results, we expect accuracy 0.4 -1305 to increase by about 72% when using 100 random 0.35 restarts instead of a single random restart; however, 0.3 -1310 using more than 100 random restarts for this partic- 1 10 100 1000 10000 100000 1e+06 ular cipher does not appear to be useful. Number of random restarts Also in Figure 1, we plot a related graph, this time Figure 2: Synth 340 cipher. Accuracy by best model score and showing the effect that random restarts have on the best model score vs. number of random restarts. Bootstrapped achieved model score. By construction, the (maxi- from 1M random restarts. mum) model score must increase as we increase the number of random restarts. We see that it quickly 3.3 A Hard Cipher: Synth 340 saturates in the same way that accuracy did. What do these graphs look like for a harder cipher? This raises the question: have we actually Zodiac 340 is the second cipher released by the Zo- achieved the globally optimal model score or have diac killer, and it remains unsolved to this day. How- we only saturated the usefulness of random restarts? ever, it is unknown whether Zodiac 340 is actually a We can’t prove that we have achieved the global op- homophonic cipher. If it were a homophonic cipher timum,5 but we can at least check that we have sur- we would certainly expect it to be harder than Zo- passed the model score achieved by EM when it is diac 408 because Zodiac 340 is shorter (only 340 initialized with the gold encryption key. On Zodiac characters long) and at the same time has more ci- 408, if we initialize with the gold key, EM finds pher symbols: 63.