San Jose State University SJSU ScholarWorks
Master's Projects Master's Theses and Graduate Research
Fall 12-20-2016
Cryptanalysis of Homophonic Substitution Cipher Using Hidden Markov Models
Guannan Zhong San Jose State University
Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects
Part of the Information Security Commons
Recommended Citation Zhong, Guannan, "Cryptanalysis of Homophonic Substitution Cipher Using Hidden Markov Models" (2016). Master's Projects. 505. DOI: https://doi.org/10.31979/etd.9x48-r4vp https://scholarworks.sjsu.edu/etd_projects/505
This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact [email protected]. Cryptanalysis of Homophonic Substitution Cipher Using Hidden Markov Models
A Project
Presented to
The Faculty of the Department of Computer Science
San Jose State University
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
by
Guannan Zhong
December 2016 © 2016
Guannan Zhong
ALL RIGHTS RESERVED The Designated Project Committee Approves the Project Titled
Cryptanalysis of Homophonic Substitution Cipher Using Hidden Markov Models
by
Guannan Zhong
APPROVED FOR THE DEPARTMENTS OF COMPUTER SCIENCE
SAN JOSE STATE UNIVERSITY
December 2016
Mark Stamp Department of Computer Science
Thomas Austin Department of Computer Science
Fabio Di Troia Department of Computer Science
Richard M. Low Department of Mathematics ABSTRACT
Cryptanalysis of Homophonic Substitution Cipher Using Hidden Markov Models
by Guannan Zhong
We investigate the effectiveness of a Hidden Markov Model (HMM) with random restarts as a mean of breaking a homophonic substitution cipher. Based on extensive experiments, we find that such an HMM-based attack outperforms a previously de- veloped nested hill climb approach, particularly when the ciphertext message is short.
We then consider a combination cipher, consisting of a homophonic substitution and a column transposition. We develop and analyze an attack on such a cipher. This attack employs an HMM (with random restarts), together with a hill climb to recover the column permutation. We show that this attack can succeed on relatively short ci- phertext messages. Finally, we test this combined attack on the unsolved Zodiac 340 cipher. ACKNOWLEDGMENTS
I am very thankful to my advisor Prof. Stamp for his detailed and patient guid- ance. Also, I would like to thank Prof. Austin and Prof. Low for their valuable suggestions. Thanks to Fabio Di Troia for his support in double checking the cor- rectness of the code and feedback on my work. Finally, I thank my husband for his support.
v TABLE OF CONTENTS
CHAPTER
1Introduction...... 1
2Background...... 3
2.1 Substitution Cipher ...... 3
2.2 Jakobsen’s Algorithm ...... 4
2.3 Nested Hill Climb Algorithm ...... 6
2.4 Hidden Markov Model ...... 6
2.4.1 Notations ...... 7
2.4.2 Alpha Pass ...... 8
2.4.3 Beta Pass ...... 8
2.4.4 HMM Training ...... 9
2.4.5 HMM in Decrypting Substitution Cipher ...... 10
2.4.6 HMM with Random Restarts ...... 11
2.5 Compare HMM to Jakobsen’s Algorithm in Decrypting Simple Sub- stitution Cipher ...... 11
3CompareHMMwithRandomRestartstoJakobsen’sAlgo- rithm in Decrypting the Homophonic Substitution Cipher ... 13
3.1 Experiment and Results on Magnuson’s Nested Hill Climb Algorithm 13
3.1.1 Test Cases ...... 13
3.1.2 Results ...... 14
3.2 Experiment and Results on HMM with Random Restarts ..... 15
3.2.1 Test Cases ...... 16
vi 3.2.2 Choose the Number of Iterations ...... 16
3.2.3 Initialize the B Matrix ...... 17
3.2.4 HMM Results ...... 20
4HMMwithRandomRestartsinDecryptingFakeZodiac340. 22
4.1 Zodiac 408 ...... 22
4.2 Fake Zodiac 340 ...... 24
4.3 Decrypt the Fake Zodiac 340 Using HMM with the Fixed A Matrix 25
4.3.1 Decrypt the Fake Zodiac 340 Using HMM with the Fixed A From the Zodiac 408 ...... 25
4.3.2 Decrypt the Fake Zodiac 340 Using Fixed A from English Corpus ...... 27
4.4 Decrypt the Fake Zodiac 340 Using HMM by Reestimating A ... 27
4.5 Compare the Performances of Reestimating A or not in Decrypting a Simple Substitution Cipher ...... 27
5AttempttoDecryptZodiac340...... 31
5.1 Choose the Score Metric ...... 32
5.2 Use Combined Model to Decrypt Zodiac 340 ...... 34
5.2.1 Combined Model ...... 34
5.2.2 Result of Using Fixed A from Zodiac 408 ...... 35
5.2.3 Result of Reestimating A ...... 36
5.3 Discussion ...... 36
5.3.1 Check the Effectiveness of the Combined Model ...... 37
5.3.2 Check the Effectiveness of Swapping Columns in Recovering the Permutation Based on the Digram Score ...... 37
vii 5.3.3 Check the Effectiveness of Swapping Columns in Recovering the Permutation Based on the Score from Words .... 39
6ConclusionandFutureWork...... 41
6.1 Conclusion ...... 41
6.2 Future Work ...... 42
APPENDIX
More Results of HMM with Random Restarts in Decrypting the Homophonic Substitution Cipher ...... 44
viii LIST OF TABLES
1Frequencydistribution[7] ...... 14
2 KeyofthefakeZodiac340 ...... 25
ix LIST OF FIGURES
1 Structure of the HMM Model ...... 7
2 Jakobsen’s vs. HMM [11]...... 12
3 Success rate of the Nested Hill Climb Algorithm for different ci- phertext length and different number of distinct cipher symbols . 15
4 Success rate of the Nested Hill Climb Algorithm for different ci- phertext length and different number of distinct cipher symbols . 15
5 Normalized histograms of the accuracies with different steps and lengths...... 17
6Normalizedhistogramsoftheaccuraciesonthesameciphertext but with different base values...... 19
7Basevs.Variance...... 20
8Restart=200,000...... 21
9Restart=200,000...... 21
10 Zodiac 408 ciphertext [12] ...... 23
11 Histogram of the accuracies with fixed A from Zodiac 408 ..... 26
12 Histograms of the accuracies with A re-estimated or not and with different steps, T=10,000 ...... 28
13 Histograms of the accuracies with A re-estimated or not and with different lengths, step=200 ...... 30
14 Zodiac 340 cipher [13] ...... 31
15 Scatter plot between accuracy and the HMM score ...... 33
16 Scatter plot between accuracy and the Digram score ...... 34
17 Test swapping columns...... 38
18 Test swapping columns based on the score from words...... 40
x A.19 Restart = 10 ...... 44
A.20 Restart = 100 ...... 45
A.21 Restart = 1000 ...... 45
A.22 Restart = 10,000 ...... 46
A.23 Restart = 100,000 ...... 46
A.24 Restart = 10 ...... 47
A.25 Restart = 100 ...... 47
A.26 Restart = 1000 ...... 48
A.27 Restart = 10,000 ...... 48
A.28 Restart = 100,000 ...... 49
xi CHAPTER 1
Introduction
Substitution ciphers are one of the earliest methods used to encrypt plaintext.
While there are many different kinds of substitution ciphers, the most elementary type is the simple substitution, where each plaintext symbol is mapped in a one- to-one manner to a ciphertext symbol. There are mature algorithms to aid in the decryption of such ciphers, including Jakobsen’s Algorithm [6]andmethodsbased on Hidden Markov Models (HMM) [9]. When the ciphertext is short, an HMM with random restarts can be more effective than Jakobsen’s Algorithm [11].
Avariationonthesimplesubstitutionisthehomophonicsubstitutioncipher.In ahomophonicsubstitution,aplaintextsymbolcanbeencryptedtomorethanone ciphertext symbol, which has the effect of flattening the ciphertext statistics, thus making elementary cryptanalysis far more challenging as compared to a simple sub- stitution [10]. In [4], the authors propose and analyze generalization of Jakobsen’s
Algorithm for homophonic substitution ciphers. This algorithm is based on a nested hill climb approach where an additional outer hill climb layer is added to find appro- priate number of cipher symbol that each plain letter is mapped to. The work in [4] has recently been improved—both in terms of accuracy and efficiency—by Magnu- son [7]. A method based on HMMs with random restarts has also been successfully applied to the homophonic substitution problem [1, 11].
In this project, we compare the efficiency and accuracy of Magnuson’s nested hill climb [7] to an HMM with random restarts [11]forhomophonicsubstitutionciphers.
We show that HMM with random restarts has significantly better performance than
1 the nested hill climb, particularly when the message is short.
We then consider an attack on a cipher that combines a homophonic substitution with a column transposition. Our attack combines an outer hill climb (to recover the column transposition) with an inner HMM with random restarts (to break the homophonic substitution). We apply this attack to a message inspired by the unsolved
Zodiac 340 cipher [2], and we also apply the attack to the actual Zodiac 340 cipher.
This paper is organized as follows. In Chapter 2,wereviewtheconceptsofsimple and homophonic substitution ciphers as well as algorithms to decrypt them, including
Jakobsen’s Algorithm and its generalization based on a nested hill climb algorithm. In
Chapter 3, we show that HMM with random restarts performs better than the nested hill climb in decrypting homophonic substitution ciphers. Then in Chapter 4,wetest short homophonic substitution ciphertext messages to determine whether an HMM with random restarts can succeed against messages comparable to the Zodiac 340.
Finally, in Chapter 5,weproposeamodelthatcombinesJakobsen’sAlgorithmand an HMM with random restarts and we apply this attack to the unsolved Zodiac 340 cipher.
2 CHAPTER 2
Background
In this Chapter, we briefly review substitution ciphers and some algorithms to decrypt such kind of ciphers. In Section 2.1,weintroducetheconceptsofsimple substitution cipher and homophonic substitution cipher. In Section 2.2 and Section
2.3,wereviewtheJakobsen’sAlgorithmanditsgeneralizationbasedonanestedhill climb approach that are used to decrypt simple substitution ciphers and homophonic substitution ciphers, respectively. Besides these two algorithms, we review the con- cept and algorithm of hidden Markov model (HMM) in Section 2.4,aswellasits application in decrypting substitution ciphers. Finally in Section 2.5,wesummarize the work [11] on comparing the performances of HMM and the Jakobsen’s Algorithm in decrypting the simple substitution ciphers.
2.1 Substitution Cipher
Substitution cipher is one of the earliest methods used to encrypt plaintext.
While there are many different kinds of substitution ciphers, the most elementary type is the simple substitution, where each plaintext symbol is one-to-one mapped to asingleuniqueciphertextsymbol.Asaresult,thefrequencyofeachciphersymbol is the same as that of its corresponding symbol in the plaintext.
Suppose the ciphertext only contains 26 distinct alphabetic characters without spaces or punctuations. If we decrypt this kind of simple substitution cipher in a brute force way by trying all permutations in the key space, the time complexity is
26! 288. However, when the ciphertext is long enough, we can utilize the statistics information≈ from the English corpus to reduce the time complexity significantly. For
3 example, the most frequent ciphertext symbol can be decrypted as the letter ‘e’, since
‘e’ is the most common letter in English. Similarly, we decrypt the second frequent ciphertext symbol to letter ‘t’, and so on. Once we obtain part of the mapping, we can derive the whole mapping by common pattern analysis or exhaustively trying the permutations of the remaining cipher symbols.
Avariationonthesimplesubstitutioncipheristhehomophonicsubstitution cipher. In a homophonic substitution cipher, the mapping is not one-to-one. A plain- text symbol can be encrypted to more than one ciphertext symbols. A homophonic substitution cipher will tend to flatten the ciphertext statistics, making elementary cryptanalysis far more challenging as compared to a simple substitution [10].
2.2 Jakobsen’s Algorithm
Jakobsen’s Algorithm is an efficient and fast algorithm in decrypting simple sub- stitution ciphers. We outline the main steps as follows [9]:
1. Make initial guess for the key using frequency counts.
2. Compute oldScore.
3. Modify key by swapping two elements.
4. Compute newScore.
5. If newScore > oldScore: oldScore = newScore; Else: undo the swap operation
in the step 3.
6. Goto 3.
Here are some detailed explanations for the above procedure.
4 • Score: Let D dij be digraph distribution corresponding to putative key K.
Let E eij be= digraph distribution of English language. Both D and E are
matrices= with dimension to be 26 26. The score is then defined as follows:
× score K d D,E dij eij (1) i,j ( ) = ( ) = − • Swap: Let K k1,k2,...,k26 be the initial putative key. Let ’|’ represents the
swap operation.= We swap the elements in the following order:
round 1: k1 k2 k2 k3 k3 k4 ... k23 k24 k24 k25 k25 k26
round 2: k1k3 k2k4 k3k5 ... k23k25 k24k26
round 3: k1k4 k2k5 k3k6 ... k23k26
round 23:⋮⋮k1 k24 k2 k25 k3 k26
round 24: k1k25 k2k26
round 25: k1k26
Restart the swap sequence from the beginning whenever the score improves.
The algorithm terminates when there is no improvement in score in round 25.
The core of the Jakobsen’s Algorithm is that it provides a method to obtain the new
D matrix from the old D through swapping rows and columns of the old D matrix
instead of decrypting the ciphertext again. The rows and columns to be swapped are
determined by elements in key to be swapped. Since Jakobsen’s Algorithm decrypts
the ciphertext only once, this algorithm is fast and its speed is independent of the
length of the ciphertext.
5 2.3 Nested Hill Climb Algorithm
In paper [4], the authors propose an algorithm that is fast and effective in at- tacking the homophonic substitution cipher. Assume that the ciphertext is in English containing only 26 upper case letters with no spaces or punctuations. In a homophonic substitution cipher, a plain letter can be encrypted to one of a set of distinct cipher symbols. Let na,nb,...,nz be the number of ciphertext symbols that correspond to plain letters A, B, . . . , Z. The algorithm consists of three layers:
• Inner hill climb: A generalization of the Jakobsen’s Algorithm [6]thatcan
quickly get the putative digram matrix through operations on the old putative
digram matrix.
• Random initial key generator: To overcome the shortcoming of getting the local
maximum in the inner hill climb layer.
• Outer hill climb: To obtain appropriate combination of values of na,nb,...,nz.
This algorithm was proved to be effective by decrypting large number of test cases [4] and has recently been improved—both in terms of accuracy and efficiency—by Mag- nuson [7].
2.4 Hidden Markov Model
Hidden Markov Model (HMM) describes a Markov Process with hidden states.
The observed states are probabilistically related to the hidden states and thus provide information on the hidden states. A high level structure of the HMM model is shown in Figure 1.
6 Figure 1: Structure of the HMM Model
2.4.1 Notations
We follow the same notations of symbols in HMM as that in [9]:
T the length of the observation sequence
N = the number of hidden states in the model
M = the number of observation symbols
Q = q0,q1, ,qN 1 the states of the Markov process
− V = {0, 1,...,M 1} = set of possible observations
X = {X0,X1, ,X−T }1 = hidden state sequence. Each Xt takes a value in Q.
− = ( 0, 1, , T 1 ) =observation sequence. Each t takes a value in V.
− OA = (theO NO N Ostate) transition= probability matrix O
B = the N × M observation probability matrix
⇡ = the N-dimensional× row vector representing the initial state distribution
bo = the N-dimensional row vector that is the transpose of the o-th column of the matrix B
= matrix-matrix multiplication or matrix-vector multiplication
⋅∶ element-wise multiplication between two vectors with same dimensions v ∗∶1 the sum of the absolute value of each element in vector v
∶
7 Especially, for the matrix A aij where i, j 0,...,N 1,theelementaij is the probability of transiting from= state{ }qi at time t =to state qj−at time t 1:
aij P state qj at t 1 state qi at t + (2)
For the matrix B bj k =where( j 0, 1,...,N+ 1,k 0, 1),...,M 1,theelement bj k is the probability= { ( of)} seeing the= observation k− given= a hidden state− qj:
( ) bj k P observation k at t state qj at t (3)
An HMM model can be( denoted) = ( as A,B,⇡ . )
= ( ) 2.4.2 Alpha Pass
Given a model A,B,⇡ and an observation sequence ,alphapasscangive the probability of the= occurrence( ) of the observation sequence O, which is denoted as
P . The algorithm is described as follows: O
(O ) For each t 0, 1,...,T 1,wedefineaN-dimensional row vector ↵t, where its i- th component represents= the− conditional probability P 0, 1,..., t,xt qi .We can then compute iteratively: (O O O = )
↵0 ⇡ b 0 ,
O ↵t = ↵∗t 1 A b t ,t1, 2, ,T 1 . (4)
− O Finally, we have = ( ⋅ ) ∗ = −
P ↵T 1 1 (5)
− Alpha pass is also called forward algorithm(O ) =.
2.4.3 Beta Pass
Given a model A,B,⇡ and an observation sequence as well as the result of alpha pass, we can find= ( the optimal) state sequence. i.e., uncoverO the hidden sequence.
8 The algorithm is described as follows:
For each t 0, 1,...,T 1,wedefineaN-dimensional row vector t, where its i-th component represents= the− conditional probability P t 1, t 2,..., T 1, xt qi, . + + − We can then compute iteratively: (O O O = )