<<

Prof. Dr. Carsten Damm Dr. Henrik Brosenne

University of Goettingen Institut of Computer Science

Winter 2013/2014 Table of Contents

Elementary Classification of Cryptanalytic Attacks Stochastic structure of natural language - Part 1 Cryptanalysis by Breaking the Vigenere Statistical Measures Cryptanalysis of Transposition Starring

Alice = first person in all protocols (initiator) Bob = second person in all protocols Eve = an eavesdropper, i.e., passive attacker Mallory = malicious active attacker In this chapter we study passive attacks. Eve tries to get information about the plaintext, while observing only messages in a . All attacks rely on a fixed (E, D). Ciphertext-only attack

The ciphertext-only attack is the type of attack we will study in this chapter.

given C1 = EK (M1) ... Ci = EK (Mt ) of several messages, all generated by the same cipher EK .

wanted an algorithm to infer Mt+1 from Ct+1 = EK (Mt+1). weaker: recover some information about M1 ... Mt . stronger: recover the K (or at least information about it). Known plaintext attack

additionally given M1,..., Mt

scenario: disclosure of formerly classified documents Chosen plaintext attack

instead given (limited) access to the cipher EK , so that the analyst can choose M1,..., Mt and generate the corresponding ciphertexts C1 = EK (M1),..., Ci = EK (Mt )

scenario: a spy that is able to plant some specially prepared messages on the Enigma-operator Adaptive-chosen-plaintext attack

special variant of chosen plaintext attack:

I the attacker doesn’t need to fix the chosen plaintexts in advance but rather can watch the outcome of chosen plaintext and based on that choose the next one(s) scenario: before World War II Polish cryptanalysts were in posession of a copy of the (http://en.wikipedia.org/wiki/Biuro Szyfrow) Table of Contents

Elementary Cryptanalysis Classification of Cryptanalytic Attacks Stochastic structure of natural language - Part 1 Cryptanalysis by Frequency Analysis Breaking the Vigenere cipher Statistical Measures Cryptanalysis of Transposition Ciphers Published Worksheet

Published worksheet 04 stochastic structure of natural languages part1. Simple observations

Well known: each language (English, German, . . . ) has statistical characteristics that can be used to differentiate between various text sources. frequencies of letters and words of pairs, triples, . . . n-grams, or more general patterns starting/ending letters of words, starting/ending words of sentences lengths of words/sentences ... Letter frequencies of typical english text samples

ABCDEFGHIJKLM 7.3 0.9 3.0 4.4 13.0 2.8 1.6 3.5 7.4 0.2 0.3 3.5 2.5

NOPQRSTUVWXYZ 7.8 7.4 2.7 0.3 7.7 6.3 9.3 2.7 1.3 1.6 0.5 1.9 0.1

heavy vowels: {E, I, O, A} = more than 1/3 heavy consonants: {T, N, R, S} = almost 1/3 low frequency symbols {J, K, Q, X, Z} = less than 2/100 Popular frequency ordered alphabets

(cited from F.L.Bauer: Entzifferte Geheimnisse) English (various sources)

I etaoins(h)r dlucmfwypvbgkqjxz (1884)

I etoanirs hdlcufmpywgbvkxjqz (1893)

I etaoinsr hldcumfpgwybvkxjqz (1982) German (various sources)

I enrisdutaghlobmfzkcwvjpqxy (1840)

I enirsahtudlcgmwfbozkpjvqxy (1863)

I enisratduhglcmwobfzkvpjqxy (1955) Artificial text samples

one can generate random text by drawing symbols according to symbol frequencies in genuine text sources (0 order Markov source) better: Shannon’s method (gives a 1st order Markov source) 1 take a large text sample (typical of the language) 2 select a random cursor position, σ = symbol at cursor 3 output σ, select a random cursor position 4 locate first occurence of σ after cursor 5 σ = character following cursor 6 back to 3. or STOP see published worksheet for illustration can be extended to 2nd, 3rd, . . . order sources Law of large numbers

wanted: a suitable mathematical model for plaintext sources a stochastic source over alphabet A is a device that randomly emits ω “infinite texts” X = X1X2 ... ∈ A the source is called memoryless, if for every symbol a the probability P(Xn = a) =: pa is independent of n and of all previous or future symbols emitted let Nn(a, X ) denote the number of occurrences i, Xi = a in the prefix Nn(a,X ) X1,..., Xn and let fn(a, X ) := n be the relative frequency of a in the prefix X1X2 ... Xn Theorem If X is a random emission from a memoryless source with symbol probabilites (pa)a∈A, then with probability 1 holds

lim fn(a, X ) = pa . n→∞

this law holds true also for relative frequency of pairs, triples, . . . , and more general “patterns” in the prefix important: the longer the text sample, the more stable are its stochastic features in terms of pattern frequency Ergodic sources

a source is called stationary if probabilty of occurence of arbitrary “patterns” at position n of X is independent of n generalization of memoryless sources: source is called ergodic, if it is stationary and the law of large numbers holds for arbitrary patterns natural language sources are “close to” ergodic sources one feature is that for an ergodic source the (infinite) emission is “almost surely typical” (where typicality has a precise mathematical meaning that we will discuss later) Exercise 11

1 Implement a digram counter for text data and try to find some important “heavy pairs” by testing various text samples. 2 Extend this to triples (going much further probably doesn’t make much sense for cryptanalysis) Table of Contents

Elementary Cryptanalysis Classification of Cryptanalytic Attacks Stochastic structure of natural language - Part 1 Cryptanalysis by Frequency Analysis Breaking the Vigenere cipher Statistical Measures Cryptanalysis of Transposition Ciphers Published Worksheet

Published worksheet 04 cryptanalysis by frequency analysis. Breaking a simple

Ciphertext from a simple substitution cipher QWMMPQDVKUVFDTXJQVDBOPIDUHDQQUGDLAMWJGXBGURRBPBURMKULDVX OOKUJUOVDJQDGBWHLDJQQMUODQUBIMWBOVWUVXPBUBIOKUBGXBGURROK UJUOVDJQVPWMMDJOUQDVKDBVKDCDAQXEDFKXOKLPWBIQVKDQDOWJXVAP TVKDQAQVDHXQURMKULDVXOOKUJUOVDJQVKDJDTPJDVKDVPVURBWHLDJP TCDAQXQPTDBPJHPWQQXEDBDNDJVKDRDQQFDFXRRQDDVKUVQXHMRDQWLQ VXVWVXPBXQNDJAQWQODMVXLRDVPOJAMVUBURAVXOUVVUOCQ most frequent cipher symbols are D, V, Q, V, U, O, J, K, B (conjecture: these correspond to the heavy symbols) looks like the cipher takes E 7→ D and T 7→ V or T 7→ Q rarest are E, N, S, Y, Z (conjecture: these correspond to the low frequency symbols) Exercise 12

1 Complete the analysis of this ciphertext. Hint: It is useful to replace recovered plaintext letters in lower case in the ciphertext. i.e. replacing e for D gives QWMMPQeVKUVFeTXJQVeBOPIeUHeQQUGeLAMWJGXBGURRBPBURMKULeVX... Once several letters have been identified it may help to first ignore the unidentified ones, as in the below ficticious example) and make a good guess. t.etopo.t.et.reetreesisato..o.oneo.t.reetree.

1 Using a brute force attack is an option for Caesar ciphers. Suggest a method to avoid it. Implement it in Sage. 2 Using a brute force attack is an option for affine ciphers. Suggest a method to avoid it. Try to implement it in Sage. 3 Implement a digram counter for text data and try to find some important “heavy pairs” by testing various text samples Analysis of Vigen`ereCiphers

consider Vigenere ciphers as synonymous to periodic substitution cipher on the standard alphabet and with “short period” methods apply in principle to any periodic substitution cipher but are probably not powerful enough to break the Enigma or similar ciphers The column trick

if (E, C) is polylaphabetic cryptosystem and for a specific key cipher EK has period `, then each of the “plaintext columns”

(1) M = M1 M1+` M1+2` ... (2) M = M2 M2+` M2+2` ...... (`) M = M` M2` M3` ...

is enciphered by the same monoalphabetic cipher. the corresponding ciphertext columns C (1),..., C (`) can be deciphered as simple substitution ciphers in particular:

I the symbol distributions in the columns are permuted versions of the source language symbol distribution

I the symbol distributions falling ordered are all very similar Frequency analysis of periodic ciphers

Observation periodic ciphers destroy the stochastic structure of the source language, the distribution looks “more random” than normal source language the first task for the cryptanalist is to determine the period there are several methods of estimating the period often a combination is to be applied Decimation of a sequence

given a sequence S = s0 s1 s3 ... of symbols and a positive integer ` (the period) for 0 ≤ k < ` the k-th decimation of S is the sequence

(`) Sk := sk sk+` sk+2` ... decimating a sequence is a kind of downsampling Idea if m is a candidate period, consider and compare the decimated symbol distributions: compare them to “typical” source language distributions compare the decimations among each other (e.g., by bar-charts, if you have no other idea) more efficient: compare numerical parameters of distributions expectation of rank, variance of rank, entropy, index of coincidence (see below) Reminder on entropy

binary entropy

h(p) = −p log2 p − (1 − p) log2(1 − p) maximum at p = 0.5 (uniform distribution)

general entropy P H(p1,..., pN ) = − pi log2 pi 1 maximum at p1 = p2 = ... = pN = N (uniform distribution) Fact. The “more uncertain” a distribution, the larger the entropy Remark Symbol distributions of natural languages (or programming source or . . . ), are pretty predictable, i.e. they should have small entropy values. Kasiski’s method

Kasiski (1805-1881) was a Prussian officer (http://en.wikipedia.org/wiki/Friedrich Kasiski) Assume. The key for the Vigenere cipher under consideration is a natural language word Idea sometimes frequent plaintext words (like ‘the’) are aligned at same positions with respect to the keyword in this situation the resulting ciphertext fragments are the same plain: TOBEORNOTTOBETHATISTHEQUESTION key stream: RUNRUNRUNRUNRUNRUNRUNRUNRUNRUN cipher: KIOVIEEIGKIOVNURNVJNUVKHVMGZIA ------some coincidences: IIIII VVVVV NU NU KIOV KIOV 012345678901234567890123456789 some of those coincidences are random but the longer the fragments the more likely is a systematic origin (essential coincidences) the key length is a divisor of the distances of significant coincidences Kasiski suggests the greatest common divisor (GCD) of distances of (sufficiently long) coincidences or a multiple thereof as period of the system. Idea for refinement

hard to distinguish between essential and inessential coincidences idea: consider all distances of coincidences, but weigh them according to the length (the higher the length, the higher the weight) problem: how to assign a period to such a weighted sum? this problem was solved by William Friedman (1891-1969) a US Army cryptographer (http://en.wikipedia.org/wiki/William F. Friedman) Notations and terminology

alphabet A with m symbol, #A = m T ∈ A∗ a particular text of length N

Nα = Nα(T ) = number of occurences of symbol α in T an α-twin is a pair of occurences of symbol α in T cabbage has an a-twins and a b-twins

Nα(Nα−1) Observation T has exactly 2 distinct α-twins α-twins for arbitrary α are commonly called twins P T has exactly α∈A Nα(Nα − 1) twins whereat both position pairs (i, j) and (j, i) are counted the index of coincidence of T is the relative frequency of twins among all position pairs: P Nα(Nα − 1) ϕ(T ) := α∈A N(N − 1)

(the phi-value of the text) Typical coincidence values

consider a stochastic source Q that emits an infinite stream ∈ Σω

symbol α is emitted with probability pα if s, t are “independend positions” in the infinite stream, the probability of 2 (s, t) being an α-twin is pα

I if Q is a natural language source, the latter is certainly true if s, t are sufficiently spaced away

I in this case the statement is true for the vast majority of position pairs in general we conclude that the probability of a twin in positions s, t is (roughly)

X 2 κ(Q) := pα α∈A

(kappa-value of the source language) if T is a long enough typical finite text sample (prefix of infinite emission from Q) we expect by the law of large numbers ϕ(T ) ≈ κ(Q) Properties of kappa

Lemma

Let p = (p0,..., pm−1) a probability distribution. Then X 1 κ(p) := p2 ≥ i m

1 and the minimal value κmin := m is attained only for the uniform distribution 1 p0 = ... = pm−1 = m .

In turn: κ(p) is the larger the more “uneven” p is. Proof. 1 1 1 Pm−1 For i ∈ {0,..., m − 1} let pi = m + εi , where − m ≤ εi ≤ 1 − m and i=0 εi = 0. Then

m−1 2 m−1 m−1 m−1 m−1 X  1  X 1 X 2 X 1 X κ(p) = + ε = + ε2 + ε = + ε2 . m i m2 i m i m i i=0 i=0 i=0 i=0 i=0 Relation to entropy

The kappa-value is related to the Renyi-entropy, which is a generalization to the (so-called Shannon-)entropy mentioned above. More precisely, for α ≥ 0, α 6= 1 1 Pm−1 α Hα(p0,..., pm−1) = 1−α log i=0 pi for α → 1 this converges to Shannon entropy obviously κ = 2−H2

H2 is also called collision entropy Properties of phi

Reminder κ = quantity associated to a stochastic source Example: kappa of English ≈ 0.076, kappa of German ≈ 0.066 ϕ = quantity associated to a specific text sample

Observation

ϕ is invariant under simple substitutions, i.e., if σ :Σ1 → Σ2 is a substitution cipher then ϕ(T ) = ϕ(σ(T )). Application to periodic substitution ciphers

let T be a plaintext from a language source Q let C be a ciphertext obtained from a periodic substitution cipher of unknown period ` for given candidate periodm consider the corresponding decimations C (0), C (1),..., C (m−1) (“columns” of ciphertext) if m is a multiple of `, then each column is a simple substitution ciphertext, hence the phi-value of each column should be ≈ κQ otherwise the mixture of substitutions should result in more , we expect phi-values closer to κmin phi-spectrum: plotting the phi-values for a range of periods m should offer significant peaks at multiples of ` (see published worksheet) 1st and 2nd kind twins

if it is not possible or too time-consuming to plot the phi-spectrum we can at least estimate the magnitude of ` basic idea: compare ϕ(C) to ϕ(C (i)) where C (i) is one of the period ` decimations (“columns”), for the correct period ` ` is unknown to Eve, but in the beginning of the consideration we need to have it in the formulas recall that ϕ(C) is the relative frequency of twins in the ciphertext of length n we consider two kinds of twins:

I 1st kind: same symbol in positions s, t of same column

I 2nd kind: same symbol in position s, t of different columns Expected numbers of twins of either kind

in each column we have ≈ N/` symbols N(N−`) this means there are ≈ N · (N/` − 1)/2 = 2` position pairs s, t from a same column the expected number of 1st kind twins is N(N−`) Z1 = κ(Q) · 2` N N2(`−1) on the other hand there are ≈ N · (N − ` )/2 = 2` position pairs from different columns therefore the expected number of 2nd kind twins is about N2(`−1) Z2 = κmin 2` reasoning: while 1st kind twins occur systematically (according to the source statistics) 2nd kind twins are totally random events Sinkov’s formula

recall that ϕ(C) is the relative frequency of twins in C there is a total of N(N − 1)/2 position pairs s, t, hence if the text is long enough, we expect

Z1+Z2 ϕ(C) ≈ N(N−1)/2

substituting the above expected values for Z1, Z2 we find after some lines of computing:

(κ(Q) − κ )N ` ≈ min (N − 1)ϕ(C) − κminN + κ(Q)

Abraham Sinkov (1907-1998) was a co-worker of Friedman (http://en.wikipedia.org/wiki/Abraham Sinkov) Exercise 13

1 Take some text samples from Project Gutenberg (http://www.gutenberg.org/) and compute their coincidence index (after encoding them into the standard alphabet A-Z). Group the results by language or by author. 2 Do the same for typical source code files in the C or Java programming language compared to typical HTML-files. Take care of the appropriate alphabet! Table of Contents

Elementary Cryptanalysis Classification of Cryptanalytic Attacks Stochastic structure of natural language - Part 1 Cryptanalysis by Frequency Analysis Breaking the Vigenere cipher Statistical Measures Cryptanalysis of Transposition Ciphers Published Worksheet

Published worksheet 04 breaking the vigenere cipher. Kohel’s example

we consider the ciphertext from Kohel’s book it is too short to estimate by Sinkov’s formula or to plot the spectrum, so Kasiski’s method has to be applied, and this gives period 11 (read in the book how this is done) ciphertext arranged in blocks of length 11: . 1 2 3 4 1 OOEXQGHXINM FRTRIFSSMZR WLYOWTTWTJI WMOBEDAXHVH 2 SFTRIQKMENX ZPNQWMCVEJT WJTOHTJXWYI FPSVIWEMNUV 4 WHMCXZTCLFS CVNDLWTENUH SYKVCTGMGYX SYELVAVLTZR 5 VHRUHAGICKI VAHORLWSUNL GZECLSSSWJL SKOGWVDXHDE 6 CLBBMYWXHFA OVUVHLWCSYE VVWCJGGQFFV EOAZTQHLONX 7 GAHOGDTERUE QDIDLLWCMLG ZJLOEJTVLZK ZAWRIFISUEW 8 WLIXKWNISKL QZHKHWHLIEI KZORSOLSUCH AZAIQACIEPI 9 KIELPWHWEUQ SKELCDDSKZR YVNDLWTMNKL WSIFMFVHAPA 0 ZLNSRVTEDEM YOTDLQUEIIM EWEBJWRXSYE VLTRVGJKHYI 1 SCYCPWTTOEW ANHDPWHWEPI KKODLKIEYRP DKAIWSGINZK 2 ZASDSKTITZP DPSOILWIERR VUIQLLHFRZK ZADKCKLLEEH 3 JLAWWVDWHFA LOEOQW Counting frequent symbols in the decimations

decimations are rather short, the “most frequent is e”-trick may not work in every decimation . . .

decimation 9 occurrences 8 o. 7 o. 6 o. 5 o. 0 S W Z V 1 L A 2 F T 3 D O R 4 L I W 5 W L 6 T H W 7 I S X E 8 E H 9 Z E Y 10 I

see published worksheet for further analysis steps Exercise 14

1 Program a function in Sage that accepts a string from the standard alphabet and returns the relative amount of text that the string symbols typically contribute to English text. E.g., the string ETAOINRS (=heavy vowels and heavy consonants) would together contribute about 60% to the text, while JKQXZ would contribute less than 2%. 2 Using that function complete the Vigenere cipher analysis started in the published worksheet. Table of Contents

Elementary Cryptanalysis Classification of Cryptanalytic Attacks Stochastic structure of natural language - Part 1 Cryptanalysis by Frequency Analysis Breaking the Vigenere cipher Statistical Measures Cryptanalysis of Transposition Ciphers Published Worksheet

Published worksheet 04 statistical measures. Brute-force attacks

recall: the system is known to the attacker if the key space is small, the attacker can just try all keys at least for a human it is easy (but time-consuming) to distinguish a typical natural language sample from random nonsense strings see published worksheet will now study easy approach to rank candidate decipherements Squared differences

let m be a length N plaintext from a natural language source with symbol probabilities p = (pa)a∈A let ciphertext c be the result of encrypting m by some mono- or polyalphabetic substitution cipher

let mK := D(c, K) be a trial decipherment of c by candidate key K N(a,mK further let N(a, mK ) be the number of a’s in mK and qa := N the corresponding empirical probability if K was the correct key, we expect

N · qa ≈ N · pa for every symbol a the squared differences rank (residual sum of squares) of K is

2 X 2 rRSS(K) := N (qa − pa) a it quantifies the deviation from the expected situation in terms of statistical estimators it is related to the mean squared error (MSE) of an estimation Other rank functions

Chi-square test statistic

2 X (N(a, mK ) − N · pa) r 2 (K) := χ N · p a a Exercise 15

Brute force deciphering is currently implemented in Sage only for shift and affine ciphers. In principle it is possible to implement it also for Vigenere ciphers (of known key length), but the full keyspace is too large.

1 Mount a “structured” brute force attack to Vigenere ciphers: Run trial decipherements for all keys in a given set of keys. (it is useful to read in the Sage documentation about the set constructor). 2 Combine this brute force search with an appropriate rank function. Table of Contents

Elementary Cryptanalysis Classification of Cryptanalytic Attacks Stochastic structure of natural language - Part 1 Cryptanalysis by Frequency Analysis Breaking the Vigenere cipher Statistical Measures Cryptanalysis of Transposition Ciphers Published Worksheet

Published worksheet 04 analysis of transposition ciphers. Some ideas

transposition ciphers don’t change the symbol frequencies and are therefore easy to distinguish from substitutions unfortunately there is no easy method to determine the block length but transpositions change digram (and in general n-gram) frequencies q is always followed by u in English, this may be destroyed by transpositions by “anagramming” one tries to rearrange the first few letters until some meaningful words occur (digram statistics helps) large blocks require a lot of memory also for Alice and Bob, so there are chances that block length is small guess appropriate block lengths and apply that same permutation in later blocks until something meaningful occurs automated versions:

I dictionary look up

I MSE-ranking of candidate decipherments