Cipher Type Detection

Cipher Type Detection Malte Nuhn Kevin Knight Human Language Technology Information Sciences Institute and Pattern Recognition Group University of Southern California Computer Science Department [email protected] RWTH Aachen University [email protected] Abstract they often still need human interaction and are only restricted to analyzing very few types of ci- Manual analysis and decryption of enci- phers. In practice however, there are many differ- phered documents is a tedious and error ent types of ciphers which we would like to an- prone work. Often—even after spend- alyze in a fully automatic fashion: Bauer (2010) ing large amounts of time on a par- gives a good overview over historical methods that ticular cipher—no decipherment can be have been used to encipher messages in the past. found. Automating the decryption of var- Similarly, the American Cryptogram Association ious types of ciphers makes it possible (ACA) specifies a set of 56 different methods for to sift through the large number of en- enciphering a given plaintext: crypted messages found in libraries and Each encipherment method Mi can be seen as archives, and to focus human effort only a function that transforms a given plaintext into a on a small but potentially interesting sub- ciphertext using a given key, or short: set of them. In this work, we train a clas- cipher = M (plain, key) sifier that is able to predict which enci- i pherment method has been used to gener- When analyzing an unknown ciphertext, we are ate a given ciphertext. We are able to dis- interested in the original plaintext that was used to tinguish 50 different cipher types (speci- generate the ciphertext, i.e. the opposite direction: fied by the American Cryptogram Associ- 1 ation) with an accuracy of 58.5%. This is a plain = Mi− (cipher, key) 11.2% absolute improvement over the best previously published classifier. Obtaining the plaintext from an enciphered message is a difficult problem. We assume that the decipherment of a message can be separated into 1 Introduction solving three different subproblems: Libraries and archives contain a large number of 1. Find the encipherment method Mi that was encrypted messages created throughout the cen- used to create the cipher turies using various encryption methods. For the great majority of the ciphers an analysis has not cipher M → i yet been conducted, simply because it takes too 2. Find the key that was used together with the . much time to analyze each cipher individually, or method Mi to encipher the plaintext to obtain because it is too hard to decipher them. Automatic cipher = Mi(plain, key). methods for analyzing and classifying given ciphers makes it possible to sift interesting messages 3. Decode the message using Mi and key 1 cipher M − (cipher, key) and by that focus the limited amount of human re- → i sources to a promising subset of ciphers. Thus, an intermediate step to deciphering an un- For specific types of ciphers, there exist au- known ciphertext is to find out which encryption tomated tools to decipher encrypted messages. method was used. In this paper, we present a clas- However, the publicly available tools often de- sifier that is able to predict just that: Given an un- pend on a more or less educated guess which known ciphertext, it can predict what kind of en- type of encipherment has been used. Furthermore, cryption method was most likely used to generate 1769 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1769–1773, October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics Type: CMBIFID plaintext key type • Plaintext: • WOMEN NSFOO TBALL ISGAI encipher NINGI NPOPU LARIT YANDT HETOU RNAME ciphertext Key: • LEFTKEY=’IACERATIONS’ classifier training RIGHTKEY=’KNORKOPPING’ PERIOD=3, LROUTE=1 Figure 2: Overview over the data generation and RROUTE=1, USE6X6=0 training of the classifier presented in this work. Ciphertext: • WTQNG GEEBQ BPNQP VANEN KDAOD GAHQS PKNVI PTAAP DGMGR PCSGN Figure 1: Example “CMBIFID” cipher: Text is 3 General Approach grouped in five character chunks for readability. Given a ciphertext, the task is to find the right encryption method. Our test set covers 50 out of 56 it. The results of our classifier are a valuable input cipher types specified by ACA, as listed in Fig- to human decipherers to make a first categoriza- ure 3. We are going to take a machine learning ap- tion of an unknown ciphertext. proach which is based on the observation that we can generate an infinite amount of training data. 2 Related Work 3.1 Data Flow Central to this work is the list of encryption meth- The training procedure is depicted in Figure 2: ods provided by the American Cipher Associa- Based upon a large English corpus, we first choose tion1. This list contains detailed descriptions and possible plaintext messages. Then, for each enci- examples of each of the cipher types, allowing us pherment method, we choose a random key and to implement them. Figure 3 lists these methods. encipher each of the plaintext messages using the We compare our work to the only previously encipherment method and key. By doing this, we published cipher type classifier for classical ci- can obtain (a theoretically infinite) amount of la- phers2. This classifier is trained on 16, 800 cipher- beled data of the form (type, ciphertext). We can texts and is implemented in javascript to run in the then train a classifier on this data and evaluate it web browser: The user can provide the ciphertext on some held out data. as input to a web page that returns the classifier’s Figure 1 shows that in general the key can con- predictions. The source code of the classifier is sist of more than just a codeword: In this case, available online. Our work includes a reimple- the method uses two codewords, a period length, mentation of the features used in that classifier. two different permutation parameters, and a gen- As examples for work that deals with the auto- eral decision whether to use a special “6 6” vari- mated decipherment of cipher texts, we point to × ant of the cipher or not. If not defined otherwise, (Ravi and Knight, 2011), and (Nuhn et al., 2013). we choose random settings for these parameters. These publications develop specialized algorithms If the parameters are integers, we choose random for solving simple and homophonic substitution values from a uniform distribution (in a sensible ciphers, which are just two out of the 56 cipher range). In case of codewords, we choose the 450k types defined by the ACA. We also want to men- most frequent words from an English dictionary. tion (de Souza et al., 2013), which presents a ci- We train on cipher texts of random length. pher type classifier for the finalist algorithms of the Advanced Encryption Standard (AES) contest. 3.2 Classifiers 1 http://cryptogram.org/cipher_types.html 2 The previous state-of-the-art classifier by BION See http://bionsgadgets.appspot.com/gadget_forms/ refscore_extended.html and https://sites.google.com/site/ uses a random forest classifier (Breiman, 2001). bionspot/cipher-id-tests The version that is available online, uses 50 ran- 1770 6x6bifid four square (nihilisttransp) fort swagman • 6x6playfair • fracmorse • patristocrat progressivekey • tridigital • amsco • grandpre • period 7 vig. • quagmire2 • trifid • bazeries • (grille) • periodic gro- • quagmire3 • trisquare • beaufort • gromark • mark • quagmire4 • trisquare hr • bifid6 • gronsfeld phillips • ragbaby • two square • bifid7 • homophonic • plaintext • randomdigit • two sq. spiral • (cadenus) • mnmedinome • playfair • randomtext • vigautokey • cmbifid • morbit • pollux • redefence • (vigenere) • columnar • myszkowski • porta • (route transp) • (vigslidefair) • digrafid • nicodemus • portax • runningkey • • dbl chckrbrd • nihilistsub • progkey beau- • seriatedpfair • • • • Figure 3: Cipher types specified by ACA. Our classifier is able to recognize 50 out of these 56 ciphers. The braced cipher types are not covered in this work. dom decision trees. The features used by this clas- tains a digit, a letter (A-Z), the “#” symbol, the sifier are described in Section 4. letter “j”, the digit “0”. We also introduce an- Further, we train a support vector machine using other set of features called DGT that contains two the libSVM toolkit (Chang and Lin, 2011). This features, firing when the cipher is starting or end- is feasible for up to 100k training examples. Be- ing with a digit. The set VIG contains 5 features: yond this point, training times become too large. The feature score is based on the best possible bi- We perform multi class classification using ν-SVC gram LM perplexity of a decipherment compatible and a polynomial kernel. Multi class classification with the decipherment process of the cipher types is performed using one-against-one binary classifi- Autokey, Beaufort, Porta, Slidefair and Vigenere. cation. We select the SVM’s free parameters using Further, we also include the features IC, MIC, a small development set of 1k training examples. MKA, DIC, EDI, LR, ROD and LDI, DBL, NOMOR, We also use Vowpal Wabbit (Langford et al., RDI, PTX, NIC, PHIC, BDI, CDD, SSTD, MPIC, 2007) to train a linear classifier using stochastic SERP, which were introduced in the BION classi- gradient descent. Compared to training SVMs, fier3. Thus, the first 22 data points in Figure 4 are Vowpal Wabbit is extremely fast and allows using based on previously known features by BION. We a lot of training examples. We use a squared loss further present the following additional features. function, adaptive learning rates and don’t employ any regularization. We train our classifier with up 4.1 Repetition Feature (REP) to 1M training examples. The best performing set- This set of features is based on how often the ci- tings use one-against-all classification, 20 passes phertext contains symbols that are repeated ex- over the training data and the default learning rate.

Load more