<<

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, 2019

Decyphering the Geheimschreiber, a Machine Learning approach

Recreating and breaking the and Halske T52 used during World War II to secure communications in Sweden

ORIOL CLOSA MÁRQUEZ

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND SCIENCE

Decyphering the Geheimschreiber, a Machine Learning approach

Recreating and breaking the Siemens and Halske T52 used during World War II to secure communications in Sweden

ORIOL CLOSA MÁRQUEZ

Bachelor in Computer Science Date: 28th June 2019 Supervisor: Richard Glassey Examiner: Örjan Ekeberg School of Electrical Engineering and Computer Science Titel på svenska: Att dechiffrera Geheimschreiber med hjälp av maskininlärning Títol en català: Desxifrant la Geheimschreiber, un enfocament d’aprenentatge automàtic

Abstract

Historically, rotor cyphers have been used in order to secure written communications. Mechanical machines provided continuous streams of characters for encoding secret messages that were sent to the other part of the continent by means of telephone cables or . Several people tried in vain to tackle them but only those bold enough were successful. In Sweden, the Siemens and Halske T52 was used by the Germans during World War II and Arne Beurling was one of those bright people that successfully broke it. This thesis aims to recreate his steps applying modern concepts to the task, breaking the Geheimschreiber. In order to do that, a recreation of the machine has virtually been built and several German texts encyphered. The techniques used, involving Recurrent Neural Networks, have proven to be effective in breaking all XOR wheels with different crib sizes removing the random factor introduced by the cypher. However, if this method can be applied to real war intercepts remains to be seen.

Sammanfattning

Historiskt sett har rotormaskiner använts för att säkra skriftlig kommunikation. Mekaniska ma- skiner försåg kontinuerliga strömmar av tecken för att kryptera hemliga meddelanden som skic- kades till andra delar av kontinenten genom telefonkablar eller radio. Flera personer försökte att knäcka dem men bara ett fåtal personer var djärva nog att lyckas. I Sverige användes Siemens and Halske T52 av tyskarna under andra världskriget och Arne Beurling var en av de första att framgångsrikt knäcka den. Tesen syftar att återskapa stegen genom att applicera moderna kon- cept till uppgiften, att knäcka Geheimschreiber. För att lyckas med det har en maskin återskapats i en virtuell miljö och ett flertal tyska texter har chiffrererats. De teknikerna som har använts, som involverar Återkommande Neurala Nätverk, har bevisat sig vara effektiva för knäcka XOR-hjulen genom att ta bort den slumpmässiga faktorn som introduceras av chiffern. Om denna metod kan bli applicerad i riktiga krigssituationer återstår dock att se.

Resum

Històricament, les màquines d’encriptar amb rotors s’utilitzaven per protegir totes les comuni- cacions escrites. Aquestes generaven un flux continu de caràcters per codificar missatges secrets que eren enviats a l’altra banda del continent a través de cables telefònics o per ràdio. Diverses persones van intentar en va fer-hi front però només aquells prou aguts hi van tenir èxit. A Suècia, la Siemens and Halske T52 va ser utilitzada pels alemanys durant la Segona Guerra Mundial i Arne Beurling fou una d’aquelles persones ·ligents que va tenir èxit en trencar-la. Aquesta tesi vol recrear els seus passos aplicant conceptes moderns a la tasca, trencar la Geheimschreiber. Per fer-ho, una recreació de la màquina s’ha construït virtualment i diversos texts alemanys han estat xifrats. Les tècniques utilitzades, incloent Xarxes Neuronals Recurrents, han demostrat ser efectives en trencar totes les positions corresponents a l’XOR amb diferents prediccions eliminant el factor aleatori introduït per la màquina. Tot i això, si aquest mètode es pot aplicar a missatges reals interceptats durant la guerra queda per veure.

To Arne Beurling (1905-1986) for his amazing feat in beating the Geheimschreiber and his leadership in decyphering German traffic which played a big but secret role during World War II.

To Erika Schwarze (1917-2003) for her bravery and determination spying the Nazis during the Second World War and providing the Swedes with Gestapo operatives, active agents information and Geheimschreiber messages in plain under the code name Onkel.

To Bengt Beckman (1925-2012) for his interest and publications on cryptography which have been indispensable for this thesis.

To every single individual that played a role in this marvelous feat, for their work and contribution to modern democracy.

Contents

Acknowledgements ...... 1

1 Introduction ...... 3 1.1 Problem statement ...... 4 1.1.1 Objectives ...... 4

2 Background ...... 5 2.1 Intercepting German signals ...... 5 2.2 Evolution of cyphers ...... 7 2.3 XOR cypher ...... 8 2.4 Breaking the Geheimschreiber ...... 9 2.5 The Siemens and Halske T52 ...... 11 2.5.1 Models ...... 13 2.5.1.1 Model T52a/b ...... 13 2.5.1.2 Model T52c/ca ...... 13 2.5.1.3 Model T52d ...... 13 2.5.1.4 Model T52e ...... 13 2.5.1.5 Model T52f ...... 13 2.5.2 Irregular stepping ...... 14 2.5.3 Klartextfunction ...... 14 2.6 The App ...... 14 2.7 Artificial Neural Networks ...... 16 2.7.1 Artificial neuron ...... 17 2.7.2 Activation function ...... 17 2.7.3 Learning processes ...... 18 2.7.4 Backpropagation ...... 18 2.7.4.1 The Delta rule ...... 18 2.7.5 Regularisation ...... 19 2.7.5.1 LASSO regression ...... 19 2.7.5.2 Ridge regression ...... 20 2.7.5.3 Early stopping ...... 20 2.7.6 Recurrent Neural Networks ...... 20 2.7.6.1 Long Short-Term Memory ...... 21

3 Methods and results ...... 23 3.1 The Vigenère ...... 23 3.1.1 Unknown key and plaintext ...... 23 3.1.1.1 Training with a fixed key ...... 25 3.1.1.2 Training with variable keys ...... 26 3.1.1.3 Training with a German dictionary ...... 26 3.1.2 Unknown key and known plaintext ...... 27 3.2 The Geheimschreiber ...... 31 3.2.1 Cryptanalysis ...... 31

ixixix x CONTENTS

3.2.2 Unknown XOR and permutation wheels ...... 32 3.2.3 Unknown XOR and known permutation wheels ...... 35 3.2.3.1 Training with a short crib ...... 35 3.2.3.2 Training with a long crib ...... 36

4 Discussion ...... 39 4.1 Obtained results ...... 39 4.1.1 The Vigenère ...... 39 4.1.2 The Geheimschreiber ...... 40

5 Conclusions ...... 43

A T52 simulator ...... 45 A.1 Text encyphering ...... 45 A.2 Interactive version ...... 45

B Cloud computing ...... 49 B.1 Virtual machine setup ...... 49

C International Alphabet 2 ...... 51

D Historical images ...... 53

E Chronological timeline of the events ...... 55

Bibliography ...... 57 Acknowledgements

Although the period in which this project has been developed does not span through more than a few months, several people have contributed to the realisation of this thesis. Because without their help the results would have been really different, I would like to thank the following people and institutions. Richard Glassey, my supervisor, for his enthusiasm and feedback on this project. Ingrid Karlsson, archivist from the Riksarkivet, and Martina Brisman and Lars Rune, from Försvarsmakten, for their quick interest in pointing me to the right direction. Kári Ólafsson from the legal unit at Försvarets Radioanstalt who has helped me from the first day in providing data and background on the matter. Her dedication in all my requests has been outstanding. In consequence, original data has been gathered and inspected, providing true results to something real. Christine, Daniel and everyone else working at the Krigsarkivet that have helped me in find- ing the corresponding material for this thesis. The archives related to this project have been pre- served and maintained through the years thanks to them and consequently I have been able to examine this marvelous material proof of another epoch. Furthermore, they have provided the necessary resources and allowed me to publish original material on this thesis. Herman Byström for helping with the translation of the abstract into Swedish in a really short period of time. KTH Biblioteket for providing me access to printed and online material used to develop the background.

111

Chapter 1 Introduction

During World War II, the transmission of information through secure means became an important concern. Radio could be intercepted and telegraph lines tapped. This led to the development of encyphering methods in different parts of the planet. One of the most famous systems include the Enigma, Type B1 or the Lorenz machine, all used during the war time. The latter —codenamed Tunny by British cryptanalysts[1]— was the main objective of the first programmable electronic digital computer in the world[2]. However, a less known cyphering machine named Siemens and Halske T52 was also used by the Germans to secure communications through neutral Sweden. This machine was not really of interest by Bletchley Park as much of its traffic was also encoded with other systems easier to break. Because of its complexity compared with the others (from where they reversed engineered their layout just by intercepts), they believed it was impossible to break. The Siemens and Halske T52, the Geheimschreiber or Sturgeon as codenamed by the British GC&HQ and Bletchley Park[2], was both a cypher and a teleprinter produced by the company who gave also part of its name. Opposed to the Enigma, heavily used by the Germans during the first part of WW2, this machine was not as portable but offered a more automated way of secur- ing and sending the transmissions. No actual knowledge of cryptography from the operators side was needed, they would type and receive plain text at all times. Nevertheless, anyone listening in between would not be able to understand what was being said, as information would appear en- cyphered. This was the work of a rotor machine, an electro-mechanical stream cypher, part of the encrypting and decrypting machines that were the state of the art for securing communications from the 1920s to the 1970s[3]. In May 1940, a Swedish mathematician named Arne Beurling and his team broke the Geheim- schreiber in two weeks using only pen and paper[4]. This was due to several mistakes made by the operators, from sending the same message twice using a different key to repeating some of the text in clear right after switching to crypto mode. Messages had to be intercepted, decyphered, corrected and typed. The long process was performed in mere hours as each day the key was changed and they had to start over. Correcting the telegrams was also a dangerous task, a bad rectification could change completely the meaning of the text. All this material was found to be very valuable to the Swedish government as it contained vital information for the country’s own survival. Handled by hundreds of people, the consequence of this massive amount of work res- ulted in the creation of the Försvarets Radioanstalt (FRA) who took charge of the task in 1942. Unfortunately, that same year Germans became aware that Sweden was actually listening to their traffic not only from to Oslo but also to . They improved the security of the machine upgrading some of its architecture and functionalities. Nevertheless, the results of this actions proved to be more favorable to the Swedes in order to decypher the intercepts. The Nazis realised their mistake and in 1943, another upgrade was performed fixing the previous introduced flaws. The blackout came, Sweden would not ever read German traffic again.

1Although all other machines were used by the Germans, the Type B Cypher Machine —codenamed Purple by the USA— was used by the Japanese.

333 4 CHAPTER 1. INTRODUCTION

1.1 Problem statement

Technology has evolved since when rotor machines were widely used. Methods to make com- puters learn tasks have surfaced. Therefore, can a modern general purpose computer break2 one of the best and more complex cypher machines from the second part of the previous century using Machine Learning?

Figure 1.1: T52 in display at Bletchley Park, picture by the student

1.1.1 Objectives • Implement a recreation for a T52 machine model, depending on the information available, and other more simpler cyphers.

• Perform the corresponding statistical analysis on the generated data from the simulators and, if possible, from real available data taken from the archives.

• Implement different ways to approach the decyphering process using Machine Learning, including the application of Recurrent Neural Networks.

In conclusion, the main objective is to gather the key from cyphered messages produced by the aforementioned cyphers using a Neural Network.

2Break in terms of a cypher machine refers to establishing the essential structure and method of operation for the given apparatus[5]. However, in our case we are refering to the retrieval of the key on a given message. Chapter 2 Background

2.1 Intercepting German signals

Arne Karl-August Beurling was born in Göteborg on February 3rd 1905 but moved to Uppsala in order to study and obtain his Ph.D. in 1933. Only four years later, he became professor teaching mathematics at the same university. In 1954 he moved to Princeton to become a member and professor at the Institute for Advance Study. He was member of several insti- tutions including the Kungliga Vetenskapsakademien as well as the Finish, Danish and American equival- ents from which he received considerable awards[6]. In the spring of 1940, Beurling was working on so- viet intercepts which used cyphered code books when "somebody dumped a bunch of intercepted telegrams of unknown origin on my desk"[4]. They were com- pletely different from those he was working on. To begin with, they had no spacing and were a continu- ous stream of characters. He saw that all 26 letters of the alphabet appeared along with the numbers from 1 to 6, which in total made 32 different characters. He then looked for repetitions, trying to understand Figure 2.1: Arne Beurling[6] the code, but soon he realised that everything seemed completely random. Perhaps an imprecise recording of the telegram? No, it was the result of a crypto machine over telex1 lines leased —on April 14th[4]— to the Germans. During the Second World War, used three different teleprinter cypher machines2, the Lorenz SZ40 and latter models, the Siemens T43 and the Siemens and Halske T52[2]. The latter, also known as the Geheimschreiber, consisted of different models which were developed during the years before and during the war. This cypher was the machine that Beurling had to beat in order to get German information and therefore the source of the messages that had been dumped on his desk. The amount of work needed in order to tap and listen the conversations made by the Nazis was colossal. On April 18th the confirmation of tone trials on the lines came. The Germans were testing the equipment on twelve different frequencies, all of them following the international standard. However, Sweden was still using an older system from Western Electric

1The telex network was similar to the telephone network but for sending text messages. 2We must remember to that the famous cypher machine Enigma was not a teleprinter forcing the operator to send the code by other means after using the device.

555 6 CHAPTER 2. BACKGROUND

partially compatible. Fortunately, Swedish tests on equipment from the Jacobsbergsgatan office revealed the use of 50 baud[4], a speed the current Swedish could manage. The ma- chines had to be adjusted as some digits were unprinted codes resulting in no physical record of them. In consequence they were remapped to numbers from 1 to 6.

Figure 2.2: German broadcast intercepts from December 22nd 1940[7]

Figure 2.2 shows two of the intercepted messages from the end of 1940. The teleprinter out- put tape can be clearly seen to be cut and glued on the telegram paper. In this particular text, divided into two different messages, we can see the communications between what appears to be the Reichs-Rundfunk-Gesellschaft3 in Berlin and Oslo. The conversation seems to continue, after the first contact attempt, on line 4 when the transmitter says hallo Oslo bitte melden —hello Oslo, please report— (the space is represented by a 5). But because no answer is received, the same operator tries to get the attention of the receiver by ringing the bell from the remote teleprinter. This can be accomplished by sending the corresponding command as if it was another letter from the message. The usage of the Figure Shift on line 5 (which is represented by a 4) followed by the bell key (represented by a J) repeatedly would have activated the bell several times. Without an- swer, the transmitter presents itself again with hier RRG 10001/5300 Berlin —here RRG 10001/5300

3The Reichs-Rundfunk-Gesellschaft (RRG) was a national network of German public radio and television broadcast- ing companies used during World War II for spreading Nazi propaganda. ORIOL CLOSA MÁRQUEZ 7

Berlin— and after several spaces they begin sending the broadcast. Nevertheless, the messages would usually start with geheim —secret— in order to indicate its content was private[7][8]. Other keywords were also used such as gkdos for geheime kommandosache —top secret— [8]. Because the origin of the signals layed in telegraphy, Swedes had to intercept duplex con- nections joining two channels —unknown from the twelve available— into a single tape, which proved to be very difficult. But soon over, the work was going to become impossible. The Ger- mans set-up the Geheimschreibers and the traffic became "severely unreadable"[4].

2.2 Evolution of cyphers

Keeping secrets has been a problem for centuries. Even in ancient Egypt hieroglyphs have been found to encypher mysterious messages. Nevertheless, the first to consider securing information were the Mesopotamian with the use of clay tablets. Several methods have been created and developed since then, each time with hopes of increasing the security of the system. One of the most simple examples is the substitution cypher where the encrypting is performed using a fixed system. Commonly, this type of cyphers operate on single letters receiving the name of simple substitutions, but this is not always the case. The operation is simple, each original character gets mapped to a new one, from a new alphabet which can be compounded by the same symbols or by completely new figures. This function is injective as it preserves distinctness, all input characters are mapped into different symbols, no different character can be mapped to the same one. This method provides 26! ≈ 4, 03·1026 different keys, which is very large. However, as we show with the following cypher, it can be easily broken. One well-known example is the cryptogram found in the book The Gold-Bug by Edgar Allan Poe. Here, each character of the message gets replaced by another symbol and this relation is maintained through the entire text.

53‡‡†305))6*;4826)4‡.)4‡);806*;48†8¶60))85;;]8*;:‡*8†83(88)5* †;46(;88*96*?;8)*‡(;485);5*†2:*‡(;4956*2(5*--4)8¶8*;4069285); )6†8)4‡‡;1(‡9;48081;8:8‡1;48†85;4)485†528806*81(‡9;48;(88;4(‡ ?34;48)4‡;161;:188;‡?; agoodglassinthebishopshostelinthedevilsseattwentyonedegreesandthirteenminute snortheastandbynorthmainbranchseventhlimbeastsideshootfromthelefteyeofthede athsheadabeelinefromthetreethroughtheshotfiftyfeetout EDGAR ALLAN POE

As we can see, the symbol 8 is the one that appears the most followed by ; and 4. Matching this values with the most frequent characters, in this case in English, means that they represent e, t and a. Taking a look at the actual decyphered message, we see that the first two actually match those stated before, although the 4 does not encypher the a. Nevertheless, by computing a simple frequency table as shown in section 3.2.1 we are able to reduce the huge amount of possible combinations. Another simple and widely known substitution cypher is the Caesar. Each letter is encyphered as another letter determined by a fixed number of positions down the alphabet. For example, if we had a shift of 5 the character E would be replaced by A, F by B and so on[9]. In order to decypher the text we just need to apply the reserve shift to the text. If we assign numbers from 0 to 25 to the corresponding letters of the English alphabet (A becomes 0, B becomes 1 and so on) we can represent the encyphering process like cypherletter = (plainletter + shift) mod 26. Likewise, the decyphering process is defined by plainletter = (cypherletter − shift) mod 26. The Caesar cypher is one of the most easiest to brake. Because the only operation applied to the plaintext is 8 CHAPTER 2. BACKGROUND

the shift, the resulting cyphertext will keep the distinctive shape of the character frequency table. Moreover, there are only 26 different values for the shift in English, meaning a brute-force attack4 can be easily employed. Cyphers can also incorporate the key into the message, known as autokey cyphers, which generate the key from the message or by selecting letters from a text or a book. One of the most famous examples is the Vigenère. Originally invented by the Milanese Girolamo Cardano it was perfected by Blaise de Vigenère, born on April 5th 1523 in the village of Saint-Pourçain, France. Unlike Cardano, Vigenère provided a priming key consisting of a single letter known both to the encypherer and to the decypherer. The idea was that this character would reveal the first plaintext letter which could then be used as the key to decypher the second one and so on. Not only this but Vigenère also introduced the concept of changing keys for each message thus he would not reuse the same character each time, which is a weakness[10][9]. Stream cyphers incorporate a more complex way to generate the key via a pseudorandom key- stream. In this systems, each character of the message is encrypted with a corresponding character of the generated key. However, the stream is not truly random as it needs to be generated also by the receiver in order to decypher the text. This results into a dependence of encryption given the current state of the machine. Usually, the encypherer will write down the corresponding paramet- ers that have been set in order to start the encyphering process. The receiver will need the same parameters to be able to decypher the text. The machines usually perform this encyphering via a XOR operation as explained in section 2.3. While older versions of this type of cyphers generated the key from rotors, modern versions use a random seed value using shift registers. In any case, the process is similar and the seed is needed, as the parameter, to decypher the message. Other methods not as well known as the previous include the manually operated VIC cypher used by the Soviet spy Reino Häyhänen during World War II. The technique involves the strad- dling checkerboard, a substitution cypher in its own but of variable length achieving fractionation and data compression[11]. Finally, there are other techniques considered unbreakable, the one-time pad, which uses a previously established key of equal length as the original text. As its name indicates, the key is only used once. In any case, to be a really secure system it has to be a truly random sequence of characters, not any pseudorandom sequence as we have seen previously. Described for the first time by the banker Frank Miller in 1882 it was later patented by Gilbert Sandford Vernam who used a XOR operation for encyphering[12]. This method prevents the attacker to perform any statistical analysis as the cypher text does not reveal any particularity.

2.3 XOR cypher

A XOR operation, also called exclusive disjunction, outputs a true value when their inputs differ. This means that if both inputs are true or false, the operation will evaluate into false, but if they are different, it will evaluate into true as shown in figure 2.3. It is equivalent to the modulo 2 operation, as in binary, 1 + 1 = 0 (with only 1 bit of representation which causes overflow).

Input 0 0 1 1 Key 0 1 0 1 Output 0 1 1 0

Figure 2.3: XOR or modulo 2 operations

4A brute-force attack consists on trying all possible keys in order to decypher, in this case, a message. ORIOL CLOSA MÁRQUEZ 9

As an example, we can take a look to the figure 2.4 where we encypher the word CYPHER. Note characters are encoded according to the digure C.1 on the appendix. Moreover, numbers ranging from one to six are used in order to represent the unprintable characters of a telex machine, as explained in section 2.1.

Input CYPHER Character Key KTHKTH Input 01110 10101 10110 10100 00001 01010 Binary Key 01111 10000 10100 01111 10000 10100 Character Output E S 5 2 Z V Binary Output 00001 00101 00010 11011 10001 11110

Figure 2.4: Performed XOR to the word CYPHER with key KTH

The process for decyphering is the same as for encyphering, we just have to perform the XOR between the encoded message and the key. This exposes a cypher using only this encryption method when the plaintext is known because of the fact that plaintext ⊕ key = cyphertext and cyphertext ⊕ key = plaintext, then plaintext ⊕ cyphertext = key. What this actually means is that if we know the original message and the encoded message, we can extract the key, assuming it repeats through the message. Nevertheless, this is not the case for a pseudorandom generated key that does not repeat itself, at least for the span of the message.

2.4 Breaking the Geheimschreiber

The team in charge for handling Geheimschreiber messages was located at number 4 of Kar- laplan5, on the fourth floor, after all equipment was moved from the offices in Jacobsbergsgatan on May 21st[4]. This part of the cryptography department was managed by the Försvarsstaben (FST) until the creation of the Försvarets Radioanstalt (FRA) in 1942. It consisted of eight people, four of them students from KTH, which were given access to two direct telephone lines with the relay station in Göteborg to hook up with any German line between Berlin and Oslo. From this station, the material was shipped to some of FST offices in a villa of Elfvik6, on the Lidingö Municipality[4]. At the beginning, Beurling knew nothing about the origin of the messages. After trying to find the usual depths7 without results, he directed himself to the house at Karlaplan. He examined all the available material "in the bedrooms of the old apartment" where he found a "jam-packed [closet] with cubic boxes, each about a foot8 in height, and filled with already collected material". From that box, he copied down the traffic from May 25th and May 27th as they seemed to be free of any typing errors. Only two weeks later, Beurling and his team managed to crack the code by themselves, only helped by pen and paper[4]. Decyphering the messages was really valuable. In the spring of 1941, the cryptographers dis- covered an attack to the URSS was going to be perpetrated between June 20th and 25th. All traffic appeared to indicate an invasion by the Axis powers. Erik Boheman, secretary general of the Utrikesdepartementet (UD), warned Cripps, at that time Ambassador on Soviet territory,

5Codenamed "Karlbo" as the union of the abbreviation of "Karlaplan" and "bo", abbreviation at the same time for "Bosön", an area west of Stockholm. 6Codenamed "Rabo" as its intended target for most of its activities was the Red Army. Note the use of "bo" at the end which was continuously adopted for naming listening posts. 7A depth is produced when two or more messages are sent with the same key[5]. 8A foot corresponds to 30,48 centimetres in metric units. 10 CHAPTER 2. BACKGROUND

about the imminent attack during a dinner in Stockholm while Cripps was passing through. Al- though the ambassador believed the threat was real, Britain was not taken seriously by Stalin and the warning was dismissed[9]. Operation Barbarossa was finally launched on June 22nd and took by surprise the Red Army, unaware of the warnings given by the United Kingdom[4]. This proved to be very valuable to Sweden, the fact of knowing historic events before they even occurred. Members of the section 31 —the department in charge of SIGINT9—, remember the emotion that surrounded the work. On occasions, high-ranking officers would stand behind the team and read the tapes over their shoulders —as recalled by Birgit Asp and Gertrud Hirschfeld—. Two years later, on July 21st 1942, the T52c appeared on a few lines[13]. This was due to the concern from the German side that Sweden was actually listening to their communications. The Swedes were intercepting all traffic from Oslo to Berlin, Narvik and Sätermoen, from Trondheim to Narvik and Tromsö... Not to mention the German embassy communications with Berlin[14]. On other occasions they would also intercept traffic not only from Scandinavia and Berlin but from the south of Europe such as messages from Oslo to Rome[15]. The Nazis probably learned that from the Finish military attaché, Colonel Stewen, before June 17th of the same year[13]. The new model appeared to be similar to the previous, it could be attacked by depths but the already available tools to crack the code did not help. Nevertheless, the section 31 finally found out how the new model operated, the rushness of the Germans in changing the encyphering process prompted fatal consequences. They had made the T52c compatible with previous models and that would give Sweden the necessary information in order to break it. Early the same year, Erika Schwarze10 was appointed secretary of Hans Georg Wagner, head of the German intelligence of the Abwehr11 in Stockholm. Recruited by Helmuth Ternberg, she conveyed information to the Swedish government, including data from Gestapo active agents. Her greatest achievement was the memorisation and transcription of several messages in plain on the spring of 1943, although it is not known if FRA ever received this information. However, in 1944 she was asked to return to Germany, unaware that had been discovered and was to be executed. The Swedish intelligence service intervened and provided her with a new identity. She lived the rest of her life in Sweden and published her memoirs in 1993[16]. Because of German concerns, all important messages were no longer sent via Sweden but using other connections and even with the installation of new cables for that same purpose. This was done gradually until October 1942 when only traffic from the German embassy could still be intercepted[13]. During the same month, the Germans introduced a procedure after the machines were switched to cypher mode. The operators would have to type a random word at the beginning of the mes- sage thus the start of the real message was moved to an unknown position. This was called wahlwörter (choice words) and increased the difficulty for decyphering the message. However, al- though operators usually failed to follow instructions, this time some of them followed to the let- ter. They would use the example word given on the instructions in order to start the transmission —which appears that could have been sonnenschein (sunshine)—. Nevertheless, most of the time operators would follow correctly the instructions as recalled by Carl-Gösta Borelius, student of Beurling, who remembered the record was for the word donaudampfschiffsfartsgesellschaftskapitän (captain of the Danube steamship traffic company)[13]. The cryptographic department not only specialised in the Geheimschreiber but also with the previously mentioned Lorenz SZ40. A machine was even built in order to crack the key, but only one model was produced. November 1942 was the apogee for decyphered messages, 10.638 in total which would fall quickly next month.

9Signals Intelligence (SIGINT) consists of the gathering of intelligence by interception of signals. 10Born in Stralsund on September 20th 1917 and died April 9th 2003 in Stockholm. 11The Abwehr was the German military intelligence service for the Reichswehr and the Wehrmacht from 1920 to 1945. ORIOL CLOSA MÁRQUEZ 11

The situation of the department as of the end of that month was the following[4].

• Section 31n, wire collection. With 72 receivers and 36 teleprinters, 9 technicians and 8 gluing personnel along with 1 to 3 repair people from the Kungliga Telegrafverket (the Swedish Royal Telecommunication Administration) were responsible for intercepting and collecting all German communications.

• Section 31g, cryptanalysis and Apps handling. 32 Apps were managed by 14 cryptanalysts and 60 operators. 22 of the machines had attachments for the T52c model and 26 specially configured teleprinters.

• Section 31f, cleaning and typing. 56 cleaners and 18 typists were in charge of handling the decyphered messages. They would remove any perturbation produced by the intercepting equipment thus "cleaning" the message.

• Section 31m, compilation. At the end of the process, 7 compilers, 13 translators and other personnel produced the final messages.

2.5 The Siemens and Halske T52

The company Siemens, based in Germany, developed during the 1930s a series of mechanical teleprinter cyphers which received the name of Geheimschreibers. They used both superpositions and permutations with pin wheels controlling both tasks in order to encypher the text. Its relays controlled ten coding rotors which could be connected to the relays by means of a manual map- ping. The first five wheels computed a XOR operation as explained in section 2.3 while the latter five transposed the previous result. The wheels had different number of pins, in order from left to right they were 47, 53, 59, 61, 64, 65, 67, 69, 71 and 73. All relatively prime, meaning there is no common factor between them, resulting in 893.622.318.929.520.960 ≈ 8, 94 · 1017 different position combinations[13]. Every few days (from three to nine) at nine on the morning, the manual mapping would be changed resulting in a reassignment of the wheels to switches and telex code levels[13]. On top of that, the starting point of the wheels was controlled by other variables, which were changed every day, including a part that was changed between each message. Five wheels were fixed during 24 hours, positioned by the QEK indicator, while the other five would be selected before each transmission and sent to the receiving end by a QEP indicator. In total, there was 1027 key setting possibilities[4], which, to make a comparison, is more than the number of stars on the observable universe.

Teleprinter transmission HIER35MBZ35QRV45B35K35QEP455WT55QT55RW55TR55PR35UMUM354J3VEVE Transmitter HIER MBZ QRV? QEP 25 15 42 54 04 UMUM Receiver KK VEVE

Figure 2.5: Example of protocol transmission for changing to cypher mode

The aforementioned procedure for changing the five wheel positions with the QEP indicator before transmitting each message is shown in figure 2.5. On the top of the table, we can see how the intercepts were taken when the transmission was in clear text. Note 3 represents the letter shift, 4 the figure shift and 5 the space. As the teleprinters used for reading the lines did not inter- pret the character set change, they printed letters instead of actual numbers or any other symbol on the figure character set. The transmitter would first identify itself saying HIER MBZ —MBZ 12 CHAPTER 2. BACKGROUND

here—, where MBZ is the code of the station. Then, they would ask if the receiver end under- stands what is being said with QRV?. If so, they would answer KK meaning klar —clear—. It was now the moment for transmitting the QEP numbers, the positions for the five rotors that were changed every time a message was sent. For this, the transmitter would specify QEP and after it the positions with leading zeroes if necessary. Finally, the same station would transmit UMUM, for umschalten —switch— to which the receiving end would reply VEVE for verstanden —un- derstood—. Here, the transmitter was telling the receiver to switch modes, between clear and cyphered text after the wheels were positioned. Then they would repeat some of the text trans- mitted before to see if the encoding and decoding was performed successfully. This was probably a point exploited by Beurling and his team12 if several messages in depth were available[4]. One of the major weaknesses of this machine did not lay within its architecture but on the transmitting lines. Note that because of the 5-bit representation, only 32 different characters could be represented. In order to increase this limitation, the Letter Shift and Figure Shift feature was introduced. This enabled the machines to work with two different sets of characters, one used at a time. One of the characters indicated to the machine to change from the "letter set" to the "figure set" and another one the opposite action. This increased the number of possible encoded characters which now included numbers and punctuation marks as shown on the appendix C. However, this functionality was later shown to be the reason of the Swedish victory over the machine. As the telex lines were prone to interference, a character was not always rendered on the receiving end correctly. This did not affect much the communication as a wrong letter in a word did not affect its readability. However, if a character was received incorrectly and interpreted as a Figure Shift, that is if one or more bits were flipped resulting in a 11011 —the 5-bit representation of a Figure Shift—, the receiving machine could not be able to decypher the text correctly as it would be using the other character set. In order to avoid this, the operators typed a Letter Shift each time before or after a space. Now, if there was an error in any of the characters, the next space would restore normality switching the receiving machine to the letter character set again. Note that applying a letter shift when the machine is on the letter character set does nothing. Beurling took advantage of this situation because he discovered that 3 and 5 —the Swedish representation for a letter shift and a space— only had one bit in common as they were encoded 11111 and 00100 respectively. As a result, for a guessed 3, there were only five possible 5s and vice-versa. This meant that once a space was spotted, the neighbouring characters could be downsized to five each[13]. Although Beurling did not disclose much information about his team’s feat during the war, he once revealed the importance of the threes and fives, but when asked further, he replied that "a magician does not reveal his tricks". Another of the mistakes made by the Germans layed in providing depths to the Swedish cryptographers. Because of the stated interference on the lines, the machines could also loose synchronicity. When this happened, they were no longer on the same state thus resulting in completely unintelligible messages for the receiver. The operators would then start the process again. Because of the same machine architecture, although depending on the model, the rotors could be freed with the release of a locking arm. When turned, the rotors would move to the position where they were initially set. The operators had to send the message again from the beginning and to do se they would reset the state of the machine as it was before. But instead of choosing a new QEP number, they would just start typing again the message. In doing so, they were providing unconsciously the cryptographers a way to decypher the text because of the particularities of a XOR cypher as explained in section 2.3. Note that when two messages of this kind are aligned in depth, the key element is removed resulting in cyphertext1 ⊕ cyphertext2 =

12An example of the technique used to exploit this can be found in Bengt Beckman book "Codebreakers: Arne Beurling and the Swedish Crypto Program during World War II" on pages 79-86. ORIOL CLOSA MÁRQUEZ 13

plaintext1 ⊕ plaintext2. From here, individual plaintexts can be worked out linguistically by trying cribs13 and when combined, they can produce intelligible plaintexts from the second en- cyphered message as (plaintext1 ⊕ plaintext2) ⊕ plaintext1 = plaintext2.

2.5.1 Models The Siemens and Halske T52 consisted of several models manufactured through the span of years preceding the war and until the end of it. They increase in design and complexity but are all based on the first ever envisioned model, the T52a.

2.5.1.1 Model T52a/b

This cypher was noted for its limited security com- T52a pared with later models, mainly because it stepped regularly its wheels. The model T52a was manu- factured between 1932 and 1934, however, it was T52b found to cause radio interference. In consequence, the model T52b was created. Built from 1934 to 1942, it incorporated a filter avoiding the disruption. Because T52c T52d this was the only change, it was completely compat- ible with the T52a[17]. T52ca T52e 2.5.1.2 Model T52c/ca Developed in 1941, it included a simpler setting for the message key. The T52c can be seen in figure D.1a T52f in the appendix. Note that on the frontal top left part, five switch levers were installed for setting the key Figure 2.6: T52 models evolution more easily. This resulted in a reduction of the pos- sible alphabets thus making the machine more prompt to be breaked. Its designers realised the mistake and increased the possibilities creating the model T52ca. Both had a switch that made them backwards compatible with any T52a or T52b machine[17].

2.5.1.3 Model T52d A serious improvement in security with the incorporation of irregular wheel stepping and the klartextfunction (KTF) as explained in section 2.5.3. Starting to be designed and produced between 1942 and 1943, this model was never broken by Swedish cryptanalysts and was considered to have a better cryptographic strength than the Lorenz SZ40[17].

2.5.1.4 Model T52e Both irregular stepping and the KTF were also applied to the T52c model[17]. Although the T52e is not compatible with the T52d, they are similar in nature.

2.5.1.5 Model T52f This latter model, evolved from the T52e[18], was never put into production possibly because of the continuous bombing of Siemens and Halske factories, among others[19][20]. Furthermore, no available information exists at the moment of this model.

13A crib is a probable word or phrase on a given encyphered message[5]. 14 CHAPTER 2. BACKGROUND

2.5.2 Irregular stepping Regardless of the model, some or all wheels stepped each time a character was sent or received. This meant that even if the machine was receiving or sending information, the rotors would step. If we think in terms of a computer, then for each received or sent character, the would increment by one, thus loading the new key for the next cycle. Nevertheless, in more developed models of Sturgeon such as the T52d, all wheels did not step once each time making even more difficult to decypher the messages.

2.5.3 Klartextfunction The idea of using the KTF on transmissions was that it would cause more difficulties for anyone trying to break the code, however it also affected the recipient. The added device was activated by one bit of the plain text character two characters back after encoding. It appears it was the 5th bit[2] but latter models such as the T52e may have used the 3rd[21]. The KTF seems to had been patented14 by the Swedish inventor, Arvid Damm, around 1920[2]. This might seem ironic, a group of Swedish cryptographers against an invention of their own country from which they had no knowledge about. In order to know if the transmission was using or not the klartextfunction it is known the operators would type MIT KTF —with KTF— or OHNE KTF —without KTF—[2].

2.6 The App

Short after the success of Beurling and his team decyphering the T52, the amount of work was becoming overwhelming. The pace in which intercepts were processed concerned everyone, it was a time-consuming task that had to be efficiently handled. This is how the App15 was born, a machine that emulated the Geheimschreiber. Beurling knew how to turn his knowledge into design principles but lacked from technical implementation. Someone familiar with telephone switches had to be found, and what better place to search than in the L.M. company. Finally, Vigo Lindstein from the Cash Re- gister division was chosen, which turned out to be a great decision because of his easiness to turn cryptographic ideas into working hardware[4]. Although few information about the machines has survived upon today, several pictures (see appendix D) and descriptions can still be found. The App enabled the operators to follow the frequent changes in the keys and therefore quickly extract plain text. They were built in large quantities by L.M. Ericsson with precision mechanics[13]. In the fall of 1942, between 30 and 40 of them were in operation[26][13][4]. First models used celluloid tapes with holes to represent 1 and the lack of them to represent 0. This was done in order to be able to change easily the tapes if the machine wheels were changed. However, this proved to be a problem as the film strips were prompt to crack and break easily. Not only this, but because of static electricity, they even clung to the bottom of the App. Finally and because patterns were never changed on the German side (as explained on the appendix A.2), the wheels were finally built with a more durable material[4]. Each App had a connected teleprinter on which the operator could write on. The text was transmitted to the App and then it was returned decyphered to the teleprinter in order to be printed.

14Patents US1502376A[22], US1540107A[23], US1484477A[24] and US1643546A[25] filled for the of America patent office describe cypher machines and means to encypher and decypher messages even through "telegraphic dis- patch". In particular one of the patents specifies that if a single character is lost during transmission, the machines would lose syncronisation and would no longer be able to encypher and decypher simultaneously. 15The word "App" derives from apparat in Swedish which means "apparatus"[4]. ORIOL CLOSA MÁRQUEZ 15

Figure 2.7: A teleprinter and an App machine for German traffic decryption[26]

On the left side of figure 2.7 we can see a Siemens teleprinter used to communicate the App with the operator. On the right, an App is displayed with cables for rewiring the key into the machine, an early form of programming also used in the first . On top of the machine, there is an early form of peripheral which enabled it to tackle the T52c. This layout was chosen probably in order to be able to decypher traffic from both the T52ab and its predecessor, the T52c. The Siemens teleprinter was not the model used at first by the cryptographic department but from the chicagoan Teletype Corporation. Thanks to the Kungliga Telegrafverket a batch of Siemens teleprinters were provided, in a time which its supply was scarce. This had the side effect of returning to Morse telegraphy in some lines[13].

Year Not decyphered Decyphered Unknown Total 1940 7.100 7.100 1941 41.400 41.400 1942 101.000 19.800 120.800 1943 86.600 13.000 99.600 1944 29.000 29.000 Total 187.600 32.800 77.500 297.900

Figure 2.8: Number of messages and status per year[13][4]

The number of decyphered messages increased as time passed-by. The following year on July 1st, the FRA was established mainly thanks to the breaking of the Geheimschreiber. Sadly, on May 1943 the T52d entered service and decyphering became impossible for the cryptanalysts[13]. In total, around 300.000 messages were collected[27] by the team in Karlaplan which are now handled by the Krigsarkivet[28], the military archives of Sweden. 16 CHAPTER 2. BACKGROUND

2.7 Artificial Neural Networks

Artificial Neural Networks (ANN) have been in evolution from 1943 but most importantly since the publication of the backpropagation algorithm in 1975 by Paul J. Werbos. They are computing systems inspired by our own brains, although a couple of decades ago, the field splitted between those who wanted to recreate the exact same structure of the brain and those who did not. An ANN is based on a collection of connected nodes or units which take the name of neurons. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. A unit that receives a signal can process it and then signal additional artificial neurons connected to it. There are several different of ANNs, from a single layer Perceptron to the a fully-connected Recurrent Neural Network. It is worth noting that the connections are not always obligatory and sometimes dropouts are introduced to increase performance and decrease the chance of memor- ising the input. However, they all are an attempt to mimic the connections on the human brain and its signal production based on previous experiences. This includes the effect one neuron has on another one when fired which is called the synaptic weight[29]. On the field of cryptography there is almost no real application in use nowadays. Neverthe- less, ANNs are well known for their ability to selectively explore the solution space of a given problem. This feature finds a natural niche of application in the field of cryptanalysis. As we suggested before, Neural Networks offer a new way to attack cyphers based on the principle that any function can be reproduced by a network.

Input Hidden Hidden Output layer layer 1 layer o layer

Input 1 Output 1

Input 2 Output 2

......

Input n Output m

Figure 2.9: Simple network with n inputs, m outputs and o hidden layers of size p

An example of an Artificial Neural Network architecture can be seen in figure 2.9. It consists of an input layer of size n, an output layer with m units and o hidden layers with p units. In this case, the number of input neurons can be different from the number of output neurons, but they could have the same width. However, in the example given, all hidden layers have the same number of units, p. Note also this network is fully connected with no dropouts as all nodes on a layer are connected to those on the following one. Networks usually have known shapes depending on the problem, at least the input and output. On a regression problem there is only one output unit which will content the predicted value. However, when classifying, the number of output neurons is usually the same as the number of classes from which the unit with a highest activated value is the winner. In our case, we want to predict the key of a given cyphertext and plaintext, thus the number of output neurons will not be the same as the number of inputs as explained in depth on section 3.1. ORIOL CLOSA MÁRQUEZ 17

2.7.1 Artificial neuron The artificial neuron is based on biology but modeled as a function. It is the elementary unit of an Artificial Neural Network. In particular, a visual representation can be seen in figure 2.10. In this case, the neuron k receives various inputs xkn that are transformed by the neuron connection weights wkn and summed along with a bias parameter z. The output is produced by an activation function ϕ as explained in section 2.7.2. The successive application of the functions in 2.1 over a given set of values allow the activation of all units of the network[30].

xk0 wk0

xk wk 1 1 ak = wkizi . . P ϕ yk i (2.1) . . X yk = ϕ(ak) xkn wkn z

Figure 2.10: Visual and mathematical[30] representation of the neuron k

2.7.2 Activation function The output of a node is determined by an activation function given the current input. If the node is not on the output layer, then, the resulting value might be used as the input of another neuron. The outcome of the activation function will usually be within a range of (0, 1) or (−1, 1) where ϕ is often a nondecreasing function of the total input of the unit. There are different types of functions, from a hard slope to a more smooth transition. Some of them can be seen in figure 2.11. On the left, the binary step (or Heaviside step) results only in 0 or 1 given by the sign of the input. On the center, the Rectified Linear Unit (ReLU) can be defined as the positive part of the input, when the value is negative, the result will be 0. And finally, the logistic function or sigmoid is the σ(x) function offering a smooth curve of activation.

0 for x< 0 0 for x< 0 1 f(x)= f(x)= f(x)= −x (1 for x ≥ 0 (x for x ≥ 0 1+ e

(a) Binary step (b) ReLU (c) Sigmoid

Figure 2.11: Different activation function definitions and plots

The choice of the activation function can alter the way in which the network behaves. In particular, for the binary step for example, outputs will only be 0 or 1 which might result in data loss. However, other functions such as the sigmoid will tend to push the output to the previous values but will maintain the aforementioned gradient. 18 CHAPTER 2. BACKGROUND

2.7.3 Learning processes There are three main categories of learning.

• Supervised learning: The learning rule is provided with a set of training data consisting of input and output. The inputs are applied to the network and compared to the expected outputs. For each iteration, the learning rule is used to update and adjust the weights and biases of the network in order to approximate a function. • Reinforcement learning: Similar to supervised learning, it does not use outputs but scores in order to update the units. This score or grade is a measure of the performance of the network on a given set of inputs. • Unsupervised learning: Different from the previous methods, the weights and biases are updated only with the inputs without any output available. The network will usually try to perform clustering and categorise the patterns into a finite set of classes.

For this thesis, all approaches will be comprised among supervised learning. Reinforcement learning could be applied but because of the huge effect of any outliers it will not be implemented.

2.7.4 Backpropagation The weights of the network are usually initialised randomly with low values which result in a poor network performance. In order to increase that, the process of backpropagation can be applied. Batches of data of the same length are fed as input to the network resulting in several predictions. Then, using an error function, the costs are computed and propagated back (from here the name backpropagation) updating the weights of the network resulting into the ability to map a set of inputs to their outputs. The weight update process is not unique and there are several methods that will depend on the optimisation algorithm chosen. However, all approaches lay into the gradient descent principle. Basically, the gradient of the loss function is used to converge towards a minimum cost which comprises the combination of the weights and biases optimal for minimising the error of the prediction[29].

2.7.4.1 The Delta rule The Delta rule is a gradient descent rule that updates the mentioned weights of the network. A e2 simple algorithm can be to follow the steepest descent minimising the cost function ǫ = 2 . The gradient then defines the direction16 in which the error increases most, meaning we need to move in the opposite direction on the weight space[31]. The gradient and the Delta rule can be computed as shown on equation 2.2.

∂ǫ ∂e ∂(t − ~wT ~x) = e = e = −e~x ∂ ~w ∂ ~w ∂ ~w (2.2) ∆~w = ηe~x

However, one must take into account that we must measure the error before the threshold, which will only work for the last layer. In consequence, the Delta rule only works for networks with a single layer, because as mentioned before in section 2.7.2, the application of the activation function might mean a loss of information.

16Here we are refering to the relative direction, in English this comprises both the inclination of the segment respect to the coordinate axis (direcció in Catalan) and the way in which the segment points (sentit in Catalan) of a vector. ORIOL CLOSA MÁRQUEZ 19

2.7.5 Regularisation Because the network tends to increase in complexity, this will prompt overfitting. In order to avoid this, one of the methods we can use is regularisation. It can also be used in order to avoid manual mappings of the inputs to the outputs provided there are enough units on the hidden layers. Regularisation can be presented as penalised learning because it introduces a penalty term in the error function as seen on equation 2.3[32]. The final complexity of the model will depend on the hyperparameter17 λ which is the regularisation coefficient[30].

Eˆ(w)= E(w)+ λEΩ(w) (2.3)

One of the most simple regularisers is given by the sum of squares of the weight vector element which can be seen on equation 2.4 along with the sum of squares error function[30].

1 T EΩ(w)= w w 2 N (2.4) 1 T 2 E(w)= (tn − w φ(xn)) 2 n=1 X If we consider the previous functions, the total error function becomes the following formula which is known as weight decay as it encourages weight values to decrease towards zero.

N 1 T 2 λ T Eˆ(w)= (tn − w φ(xn)) + w w (2.5) 2 n=1 2 X Because the error formula remains a quadratic function of w, its minimizer can be found in closed form. In particular and with the gradient on 2.5 with respect to w to zero, w can de defined as follows.

T − T w =(λI +Φ Φ) 1Φ t (2.6)

When the regularisation term approximates to zero, that is for a low value of λ, then the error or cost function turns into the Ordinary Least Squares (OLS) as the penalty term λEΩ(w) is almost negligible. But if the value of the coefficient is too high, we ran into the possibility of underfitting[33]. The choice of the values for the hyperparameters is crucial and can affect in many ways the final result[34].

2.7.5.1 LASSO regression The Least Absolute Shrinkage and Selection Operator (LASSO) method is a particular case of a more general regulariser where the β on the penalty term has an exponent of 1. The error function seen on equation 2.7 follows the more general version with the sum of squares error given in 2.4. LASSO is also known as L1 regularisation in Machine Learning.

N M 1 T 2 λ (tn − w φ(xn)) + |βm| (2.7) 2 n=1 2 m=1 X X

17An hyperparameter is a parameter that takes its value before the learning process. 20 CHAPTER 2. BACKGROUND

2.7.5.2 Ridge regression Also called L2 regularisation, it adds a "squared magnitude" as penalty term which can be seen λ on equation 2.8. As in the previous given formulas, 2 is used instead of λ in order to ease the task of the derivative and because no consequence is produced from this change.

N M 1 T 2 λ 2 (tn − w φ(xn)) + βm (2.8) 2 n=1 2 m=1 X X

2.7.5.3 Early stopping Another way of applying regularisation is the procedure of early stopping. For many of the optimization algorithms used during training such as gradient descent, the error can be defined as a nonincreasing function of the iteration index. That is, a value that keeps decreasing through the iterations. However, the error measured with a validation set, does not always follow this tendency. Usually, the resulting curve will decrease at first followed by an increase when the network starts overfitting. This happens because the network has memorised the already seen data instead of the function which generates them. In order to avoid this, training can be stopped at the point of smallest error of the validation set.

2.7.6 Recurrent Neural Networks The Recurrent Neural Networks (RNN) are a type of Artificial Neural Networks that mimic even more the human brain via sequential information. Their architecture let them relate progressions of inputs to outputs identifying more complex data. There are also characteristic shapes of net- works which are related to their objective. One to one, one to many, many to one or many to many are the most common ones. Typical uses of this types of networks include autocompletion or translation where the following output depends on the previous input, not just the current.

...

Unfolding ...

...

tk+0 tk+1 tk+2 tk+3 t...k+0 tk+n

Figure 2.12: Recurrent Neural Network layer unfolding with time steps from tk to tk+n

Unfolding can be seen in figure 2.12 where the information flows forward (outputs) and back- ward (gradients) in time in terms of explicit paths. Its canonical form allows the modeling of sequences of varying length (with some problems of technical nature). However, the transmis- sion of information is not mandatory to be forward in time, bidirectional networks for example include preceding and following connections[35]. However, RNNs are not perfect, when eigenvalues are less than 1 there can be vanishing (or even exploding) gradients. Long-short term dependencies are also a problem, when the relation is not close and exists a wider context in between, predictions can be hard to make[35]. ORIOL CLOSA MÁRQUEZ 21

2.7.6.1 Long Short-Term Memory

In the last decade, a new type of unit has ht been in development. The LSTM or GRU units (different in architecture but aiming + × + + + to the same objective) try to solve the problem of long short-term memory by in- tanh troducing the concept of memorising and × × forgetting[33]. They have been proved to be useful for temporal predictions, that is, σ σ tanh σ when the output depends on the moment + + + + + + + + + + + + + + + in time to which it corresponds[36]. LSTMs have been developed because of the vanish- x ing gradients when using backpropagation t through time for RNNs and for the already mentioned poor capacity to handle long- Figure 2.13: LSTM unit in detail term dependencies. The main idea behind the LSTMs is to have a "memory cell" with capabilities to keep the state over time. The unit can be decomposed in several parts. The cell state vector represents the memory and changes as a result of learning new information and forgetting old one. The forget gate on the bottom left controls the information that has to be removed from memory. On the middle, the input gate controls the data to be added on the cell state from the current input. And finally, the output gate controls the information sent to the output[35].

Chapter 3 Methods and results

The main idea is to mimic an existing function —the decyphering function— learning by its in- put and output. Function decode1 represents the decyphering function given the cyphertext and the key while decode2 is a new function where the key is not needed in order to decypher the cyphertext. We want the network to simulate this functions by approximating their result creat- ing f1 and f2 correspondingly.

f1(cyphertext, key) ≈ decode1(cyphertext, key) (3.1) f2(cyphertext) ≈ decode2(cyphertext)

However, as we can already expect, f2 can be highly complex. Imagine a simple cypher such as the Vigenère, studied on the following section, where only a shift is produced in order to encypher and decypher a message. The decode function can be expressed then as the application consecutively of (cyphercharacter − shift) mod 26 —as seen on section 2.2— to the characters of the cyphertext and the key assuming the latter repeats itself. Nevertheless, if we take out the key and thus the shift, the decypher function now cannot be correctly defined as multiple inputs can represent several different outputs. All we can do is a function that outputs the most plausible plaintext by probability given a language or a context. But, if this context does not exist, the function would have to choose from the 26 different possible outputs, information is missing.

3.1 The Vigenère

For this first approach, we will use the Vigenère cypher from three centuries later and whose cryptographic principles have been explained on section 2.2. A Vigenère machine is virtually built in order to encypher the provided text. Only characters from the plain English alphabet will be taken into account, which gives us 26 different letters. During the entire procedure, one hot encoding is also used in order to represent the characters into arrays the network can understand. As we only have 26 characters this will not become a problem. The minus sign - will also be used in order to represent a blank or a null character. As the key will be able to vary in length (although will be fixed for some experiments) from 1 to 10 inclusive, this symbol will help us in order to identify a non existing position. Because the input of the network is fixed, we need to maintain also a fixed length for the code even if it can vary.

3.1.1 Unknown key and plaintext

We first try to approximate function decode2 trying to predict the plaintext and the key of a given cyphertext without anything else. The network will be trained to output a character at a time for each time step, which would correspond to xi as input and yi as output in figure 3.1. Once all characters have been used, the current element of the batch will be discarded and the network will proceed to start over from the second. When all the batch has been used, the network will update the corresponding weights on the LSTM layer.

232323 24 CHAPTER 3. METHODS AND RESULTS

En example of an input and output can be seen on table 3.1 where the key has been appended to the output y. With an initial maximum size of 10 for the key, an extra blank character has been added in between padding the code to the right. As shown on the same table, the input does not consist of a composition of the cyphertext along with the key, only the cyphertext is used. For this reason, some blank characters have also to be added at the end of the sequence with the same length as the maximum possible key.

x UIXVPFYISUEGDPBNUETIDAYWFKHGNLEVHBZ------y SIEVERLASSENDENAMERIKANISCHENSEKTOR-CATALONIA

Figure 3.1: Example of sequence input

The weights of the network will be initialised using the Xavier initialisation making the vari- ance of the outputs of the layer equal to the variance of its inputs[37]. This initialisation method is chosen based on previous empirical evidence of better performance with a sigmoid as activation function, explained in section 2.7.2. Other forms of setting initial weights consist on the He ini- tialisation that has been found out to work better with the ReLU function, detailed in section 2.7.2 as well. The resulting network, visible in figure 3.2, will have tsteps nodes as input and output, which should correspond to the sum of the length of the plaintext or the cyphertext and the key. However, the number of nodes on the LSTM layer m will vary through the experiments.

Input LSTM Output layer layer layer

Input 1 Output 1

Input 2 Output 2

......

Input n Output n

Figure 3.2: LSTM network with n inputs and outputs, and a hidden layer of size m

The training process will be fairly simple consisting on a simple loop in which new data is gathered and then feed into the network. We set a batch size of 64 initially, for each iteration, or training step, meaning we will obtain 64 different inputs and outputs with their correspond- ing key. The network then will see this data and update its weights at the end of the batch. Adam optimiser has been chosen combining both advantages of the AdaGrad and RMSProp al- gorithms. The intrinsic idea on the previous algorithms is to minimise the loss of the current state as explained in section 2.7.4. Adam has increased popularity in the last years as it achieves convergence faster than its competitors. In our case, this will prove indispensable when we try to crack the Geheimschreiber in section 3.2, which we can already foresee will be a tremendous time-consuming task. Moreover, the loss function will be L2 as explained in section 2.7.5.2 because it penalises out- liers —using the square power of βm—. This is particularly wanted because the minimisation of deviations will result in fewer errors when the model has reached high accuracies. ORIOL CLOSA MÁRQUEZ 25

Consequently, the training of a given element on the batch will be performed as seen in fig- ure 3.3. The LSTM layer can be seen to connect each node with itself from left to right through time. Therefore, each character xi is inputted accordingly to the time step i. This ensures all the characters in the output can be changed based on previous seen data from the same element. In particular, the key can be changed depending on the cyphertext as it is positioned in the end.

Output y

SIEV ··· A

··· ··· ··· ··· ··· ··· Decoding

··· ··· ··· ··· ··· ··· LSTM

··· ··· ··· ··· ··· ··· Encoding

UIXV ··· -

Input x

Figure 3.3: Training schema for the given inputs and outputs on the network

All the decisions are based on the guidelines given in Greydanus’ paper "Learning the Enigma with Recurrent Neural Networks" in which the intended target is the Enigma machine[38]. As a result, the values in table 3.4 are the initial hyperparameter selection for the network.

Time steps Key length Layers Batch size Learning rate Dropout 50 10 1 64 0,005 1,0

Figure 3.4: Initial hyperparameter selection for training

3.1.1.1 Training with a fixed key We begin the experiments training the network with the Nibelungenlied manuscript, translated as "The Song of the Nibelungs", from around the year 1230 and of unknown authorship. The original text is firstly formatted removing all characters not belonging to the alphabet and dropping all accents and other marks. Letter Ä, for example, is converted into A. Once the text is formatted, it is encyphered using the Vigenère cypher. The key —chosen at "random"— for encoding the manuscript is CATALONIA of length 9.

1 text_input = open("input/das_nibelungenlied.txt", "r", encoding="utf-8").read() 2 text_key = "CATALONIA" 3 int_input = [ord(i) for i in text_input] 4 int_key = [ord(i) for i in text_key] 5 length_input = len(text_input) 6 length_key = len(text_key) 7 text_output = "" 26 CHAPTER 3. METHODS AND RESULTS

8 for i in range(length_input): 9 value = ((int_input[i] + int_key[(i % length_key)]) % 26) 10 text_output += chr(value + 65) 11 open("output/das_nibelungenlied.txt", "w", encoding="utf-8").write(text_output)

Listing 1: Vigenère text encypher code

Using the code seen in figure 1 we obtain the text to be used to the network. After iterating on the train process with a network of 512 hidden units, we test the model with a part of text that was not been used during training. The result gives 49.980 correctly predicted characters and only 19 wrong. The keystream for this test begins with CATALALIACATALONIACATALONIA, where we can see a few wrong indexes that can be fixed by the repetition of the key itself. However, when presented with a message encyphered with a different key, the predicted key- stream starts with LOANLONIACTTOLIATA, where the key used for training CATALONIA can be fairly detected. The conclusion is that we just made a network that memorises the training key without taking into account the input. Although regularisation has been used in order to avoid overfitting, it is clear our network only predicts well seen data. This can be explained because the key used in order to encypher all the input was the same. In essence, we have been able to mimic the function decode2 with the key as a fixed parameter, CATALONIA in this case.

3.1.1.2 Training with variable keys In order to avoid the results from the previous section, the network clearly needs to be trained varying both the cyphertext and the key. As the encyphering algorithm is known, we will try to include the simulator inside the same training procedure. This means the network will train each time with generated input and output and we will have no training or test data. Each new batch will contain new information previously not seen. As the network does not care if the text in it is legible or not, we just need to generate a sequence of characters for the input and for the key, produce the output and then train on that. A basic data generation can be found in figure 2.

1 alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" 2 alphabet_length = len(alphabet) 3 input_length = 10 4 key_length = 5 5 text_input = [random.choice(alphabet) for _ in range(input_length)] 6 text_key = [random.choice(alphabet) for _ in range(key_length)] 7 text_key_full = [text_key[i % key_length] for i in range(input_length)] 8 text_output = [alphabet[(alphabet.index(text_input[i]) - alphabet.index(text_key_full[i])) % alphabet_length] for i in →֒ [(range(input_length →֒

Listing 2: Vigenère random data generation code

Still trying to mimic the decode2 function, we find our after several epochs that out network converges to an accuracy of around 20%. Although this is really low, one would expect a value around 4% as we have 26 different possible outputs once the key is removed from the original decode1 function.

3.1.1.3 Training with a German dictionary As key is not part of the input, the randomness of the cyphertext makes it almost impossible for the decode2 to output a correct plaintext. However, if a context is introduced, the probabilities of such event increase. In this case, the language of the text is known —German— and therefore most of the previously generated combinations are not likely to be on the text. For this reason, we introduce the usage of a German dictionary[39] on the data generation process taking into account ORIOL CLOSA MÁRQUEZ 27

its side effects. For example, any spelling error on the plaintext can result into a poor prediction. Furthermore, this will avoid inputs like XJECZEIUSE but will not avoid bad constructed sen- tences such as ICHBERGDIE —me mountain the—. We discard using the previously mentioned manuscript because the transitions between words would be only the ones on the book. Although the generation process described creates wrong combinations, it covers all possible real ones. In any case, we must take into account that this will only affect the plaintext, the key will still be generated randomly.

0.500 45

0.475 40

0.450 35

0.425 Loss

Accuracy 30 0.400

25 0.375

0.350 20 0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000 Epochs Epochs

Figure 3.5: Loss function and accuracy for f2 with a German dictionary

Finally and after 250.000 epochs on the same network, accuracy reaches 40% which is an im- provement compared with the previous attempt but still far from our expectations. In this case, the network has more accuracy because of the probabilities of the German language. For example, given a C the likelihood of having a preceeding A is not the same as having a B. Taking into ac- count this factor, the result is more accurate than when trained with random values. However, when presented with completely random inputs, the accuracy drops because this expected rela- tion between the characters does not exist. Even if it did, the result could also be really bad. As this network has been trained using a German dictionary, it has learned German dependencies but not Swedish or Catalan. Inputs in any other language will not share the same transitions for which the introduction of a dictionary makes the model language dependent.

3.1.2 Unknown key and known plaintext

Focusing on the decode1 function, we compare several network architectures. In this case, we aim to discover the key using the cyphertext and the plaintext. A single layer LSTM is trained with several widths from 128 to 2048. Selection of hyperparameters such as the learning rate or the batch size have already been tested in the previous section and will not vary through this series of experiments. As we are using both the cyphertext and the plaintext for the input as seen on table 3.6, the network shape will have to change.

x1 UIXVPFYISUEGDPBNUETIDAYWFKHGNLEVHBZ------x2 SIEVERLASSENDENAMERIKANISCHENSEKTOR------y SIEVERLASSENDENAMERIKANISCHENSEKTOR-CATALONIA

Figure 3.6: Example of sequence input 28 CHAPTER 3. METHODS AND RESULTS

The network will now be trained with two characters for the input at time step i, correspond- 1 2 1 2 ing to xi and xi . For the time step 0, characters x0, x0 and y0 will be used. This results into a network with 2 × tsteps input nodes and tsteps output nodes as seen in figure 3.7.

Input LSTM Output layer layer layer

Input 1 Output 1 Input 2 Output 2 Input 3

...... Output n Input 2n

Figure 3.7: LSTM network with 2n inputs, n outputs and a hidden layer of size m

As the shape of the input is not the same as the output, the training of a given element on the batch will be performed differently and accordingly with figure 3.8. Although similar to the previous section, before starting to input x2 we will compute the changes to the LSTM layer with x1. The input now will be the concatenation of x1 and x2. As can be seen on the following figure, 2 the input element i of x will correspond to yi. This will force the corresponding nodes to output the correct value of the plaintext while applying changes to the key from the previously seen cyphertext. In other words, the network will see both the cyphertext and the plaintext first, and based on this information, it will compute the key.

Output y

SIEV ··· A

··· ··· ··· ··· ··· ··· Decoding

··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· LSTM

··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· Encoding

UIXV ··· -SIEV ··· -

Input x1 Input x2

Figure 3.8: Training schema for the given inputs and outputs on the network ORIOL CLOSA MÁRQUEZ 29

We begin experimenting with different network sizes, from 128 to 2.048 nodes on the hidden layer. Both random generation and with the German dictionary training methods are tested. As we can see in figure 3.9, the loss appears to drop significantly on the very first epochs of the training. This could mean the network is able to rapidly correct the first random output generation and start producing the plaintext part of the output correctly. During training, we can clearly see two different network behaviours. The one trained with a random character generation in figure 3.9a takes longer to reach big accuracies and the network that reaches best scores is the one with less units on the LSTM layer. However, in the figure 3.9b with the German dictionary, the networks tend to reach high accuracies in a much smoother curve as one would expect. And unlike the previous one, the network with a better performance seems to have a higher number of units on the hidden layer. Furthermore, a couple of spikes can be observed on both figures where the loss increases substantially and so does inversely the accuracy. However, this is resolved after a few epochs taking the accuracy back up again. This is due to the nature of the data generation process where "unlucky data" can be produced resulting into an increase on the loss function similar to what mini-batch gradient descent could encounter.

1.000 100

0.875 95 0.750

0.625 90 0.500 Loss

0.375 Accuracy 85

0.250 80 0.125

0.000 75 0 20000 40000 60000 80000 100000 120000 140000 0 20000 40000 60000 80000 100000 120000 140000 Epochs Epochs

128 256 512 1024 2048 128 256 512 1024 2048

(a) Training with random generated characters

1.000 100

0.875 95 0.750

0.625 90 0.500 Loss

0.375 Accuracy 85

0.250 80 0.125

0.000 75 0 20000 40000 60000 80000 100000 120000 140000 0 20000 40000 60000 80000 100000 120000 140000 Epochs Epochs

128 256 512 1024 2048 128 256 512 1024 2048

(b) Training with German generated sentences

Figure 3.9: Loss function and accuracy for f1 and several network widths 30 CHAPTER 3. METHODS AND RESULTS

About the training time shown in figure 3.10, we can clearly observe that it is linear relative to the number of epochs. For larger networks the number of seconds needed increases considerably as there are much more connections between the units. In particular, for training the network with 2.048 LSTM units, the total time (without overheads) was of almost 4 hours. As the only difference between networks 3.9a and 3.9b is the data generation process, the total time amount is practically the same and the difference is negligible.

14000 14000

12000 12000

10000 10000

8000 8000 Time Time 6000 6000

4000 4000

2000 2000

0 0 0 20000 40000 60000 80000 100000 120000 140000 0 20000 40000 60000 80000 100000 120000 140000 Epochs Epochs

128 256 512 1024 2048 128 256 512 1024 2048

Figure 3.10: Training times (in seconds) for networks in figure 3.9a and 3.9b

Returning to the actual results, it appears at least 100.000 gradient descent steps were needed in most cases to achieve the best accuracy. In particular, for the network 3.9a —random characters generation— the best accuracy was reached with a minimalist number of only 128 units on the hidden layer. However, network 3.9b —German sentences generation— achieved the best results for a network of size 512 although the 2.048 resulted in virtually the same score. Furthermore, some networks using the random generator were not even able to reach a minimum of 95% of accuracy.

Accuracy >90% Accuracy >95% Maximum accuracy Nodes Random German Random German Random characters German sentences Iteration Iteration Iteration Iteration Iteration Accuracy Iteration Accuracy 128 20.300 6.900 51.300 29.200 140.300 99,72% 143.600 98,18% 256 45.900 6.200 - 55.100 63.000 91,79% 88.800 96,55% 512 99.100 5.800 - 12.400 144.100 94,10% 130.600 99,99% 1.024 25.700 107.300 90.600 55.100 132.800 95,97% 149.500 97,54% 2.048 82.400 17.800 137.900 28.600 139.800 96,92% 102.800 99,98%

Figure 3.11: Needed iterations for a minimum accuracy and maximum reached

Given the previous results, we perform a final test with the text mentioned earlier —the Nibe- lungenlied manuscript— with a chosen key. It is worth mentioning this is done merely to test the trained models and not to train them. As we specified before, not all transitions between words are produced in the book. However, all transitions (not the text or excerpts from it) that actually appear could have been already seen by the network. The best two networks from both random and German generated inputs are chosen. In particular, network 3.9a with 128 nodes and 3.9b with 512 as it performed slightly better than with 2.048. ORIOL CLOSA MÁRQUEZ 31

As explained before, outliers are a serious weakness, even with a high accuracy the result can be completely different. For example, for the 3.9a network with 98,93% of accuracy, the input on table 3.6 results in XATALOAIA as the predicted key seen on table 3.12. Although we can clearly see the expected result of CATALONIA, we can deduce that because the content has a meaning for us. If the key was completely random we would not be able to deduce it. However, with only 13.100 steps more resulting in a mere 0,36% of accuracy difference, the prediction is spotted.

Iterations Accuracy Key prediction Length Hits Misses 95 85,79% ------Incorrect 0 10 99.500 98,93% -XATALOAIA Correct 7 3 109.000 99,10% -CATNLONIN Correct 8 2 112.600 99,29% -CATALONIA Correct 10 0 140.000 99,42% -CATALONIA Correct 10 0

Figure 3.12: Final test with unseen data and key prediction for random characters generation

Iterations Accuracy Key prediction Length Hits Misses 93 86,56% ------Incorrect 0 10 16.800 93,38% -OOLALONOO Correct 4 6 41.000 95,31% -CAAALONIP Correct 7 3 83.600 99,56% -CATALONIA Correct 10 0 139.200 99,96% -CATALONIA Correct 10 0

Figure 3.13: Final test with unseen data and key prediction for German sentences generation

3.2 The Geheimschreiber

In this second approach, the Geheimschreiber will be inspected. Only model T52a/b will be taken into account on the following series of experiments due to the complexity of the posterior models. In order to attack the machine, we first need data related with the encypher process of the T52. As explained in section 2.1, original files regarding the German transmissions during World War II can be found on the Försvarets Radioanstalt archives in the Krigsarkivet. In order to speed up the process, a replica of the T52a/b is developed following the previously stated guidelines. Data can now be encyphered and decyphered using this model which replicates step by step what a real Geheimschreiber would do.

3.2.1 Cryptanalysis As done historically when presented to a cypher text of unknown origin, one must count the frequencies of the symbols and find patterns. Using the previously mentioned manuscript, we encypher it as explained on the appendix A.1. To start with, all the wheels are set to the posi- tion 0 and the manual mapping is not used. The pins used on the rotors are the same available on the appendix and will not change during the experiments, as the Germans did during their transmissions after starting to build the wheels with bakelite. One must take into account that the fact of setting a specific key or mapping should not affect the density function on the encyphered characters. However, if the keystream was not random, the result would be compromised and not truly secure. Although as explained in section 2.2, any machine relying on a keystream is not 32 CHAPTER 3. METHODS AND RESULTS

fully secure, its results achieve better performance in terms of security than a simple shifting or transposition machine.

20000

15000

Frequency 10000

5000

0 1 2 3 4 5 6 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Letter

Figure 3.14: Frequency table for one of the Nibelungenlied encypherments

Taking this into account, a simple frequency table for the aforementioned text is computed as shown on the figure 3.14. The results displayed there appear to be of randomness nature, discarding any expectations for yielding conclusions, no character appears more than the other, only slightly.

3.2.2 Unknown XOR and permutation wheels

The function decode1 on the Geheimschreiber will tried to be approximated on this series of ex- periments. The network in use will be the same as in the Vigenère section with some light modi- fications. Remembering the T52, it has several variable parts. In particular, the pins on the wheels, the manual mapping and the initial rotor positions. As the complexity of the machine is high, the pins on the wheels will not change during this and the following sections as stated previously. Furthermore, the manual mapping will supposed to be known. In particular, the identity map- ping will be used, that is, the mapping that outputs the same input without any modifications. This means the wheel w0 will contain the first bit of the keystream, w1 the second, and so on until w9 which contains the number 10. In any case, one must take into account the manual mapping is not random but rather a fixed modification on the encyphering process skewing the result. The same applies to the permutation of the XOR result which again, is not a random effect or similar. The alphabet of the network will now be different as not only characters from the English al- phabet are valid. Intercepted telegrams contained up to 32 different characters, not only letters from A to Z but also numbers from 1 to 6. As explained in section 2.1, the printed numbers repres- ented special codes for the teleprinters such as a Figure Shift. However, one must remember this characters are just a representation of the content, the machines could send any Arabic number but someone intercepting the information would see this represented with another letter. This gives us an alphabet of size 32 with the aforementioned characters. But in this case we also have to take into account that the positions of the resulting key cannot be encoded with the previous symbols. As any given rotor can have a maximum of 73 different positions, we will need num- bers up to 72 to represent all possible combinations. We could split the tenths from the units and have a key twice the size but the first approach will be used as it implies a smaller key size. The final alphabet will have then 99 meaningful characters along with the minus symbol, also used to represent a null or blank value. In total, 100 different characters will be used. ORIOL CLOSA MÁRQUEZ 33

The same input and output shape will be used as in the previous sections. In particular, the previously seen example is in figure 3.15 but with some changes. In this case, the key1 used to encypher represents the positions of the rotors rather than the keystream itself. In essence, this will result into a memorisation of the wheel pins on the network as well as the manual mapping. Any change in this part will result into a completely wrong prediction for the wheel positions. Note also x1 is now different as the encyphering algorithm has changed.

x1 1JH4DN524MKM6GEOB5FDTR64RWMC5FE4Y1I------x2 SIEVERLASSENDENAMERIKANISCHENSEKTOR------y SIEVERLASSENDENAMERIKANISCHENSEKTOR0123456789

Figure 3.15: Example of sequence input for a given i

In order to train the network, the learning rate has been reduced from the beginning to ensure a good result although will tend to higher training times as seen in figure 3.16. The time steps in this case will also be 50, 40 for the cyphertext and the plaintext and 10 for the key. The total number of time steps will then be 100, as we need to input both x1 and x2. This means a 40 character long crib is needed which although is not usual, it will help us to start. Furthermore, although we used previously batch size of 64, this has been reduced to 32 for the long training times derived. Basically, the weights will be updated more often relative to time.

Time steps Key length Nodes Layers Batch size Learning rate Dropout 50 10 512 1 32 0,0025 1,0

Figure 3.16: Initial hyperparameter selection for training

The random data generation process for the previous hyperparameters results into a rather low accuracy as seen in figure 3.17. The training seems to converge before reaching the 78% on the epoch 200.000 with a 512 hidden nodes network.

0.35 80

0.30 79 0.25

78 0.20

Loss 0.15 Accuracy 77

0.10 76 0.05

0.00 75 0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000 Epochs Epochs

Figure 3.17: Loss function and accuracy for f1 with a crib of size 40

1Note we used numbers from 0 to 9 as they can be expressed in a single character. However, for the network, number 69 for example will be interpreted as a single character because the input will be a vector of equal size as any other. 34 CHAPTER 3. METHODS AND RESULTS

However, this is not the case. The training process is more slow than usual but the loss func- tion continues to drop along with the accuracy which increases at the same pace. After a million epochs training, the network roughly performs with more than 80% of accuracy. An excerpt from the training output can be seen in figure 3.18 where a test with the previously mentioned example is performed. Although the plaintext is the same, the key is randomly generated each time pro- ducing a different cyphertext. We must take into account this is only used as feedback in order to know the current state of the training process. In the mentioned figure, only three out of 10 different rotors are correctly predicted. Nevertheless, the last one seems to be a coincidence as the previously predicted rotor is also 69. Looking at the predicted key —44 10 28 28 03 03 03 03 69 69— we can clearly see wheels w2 to w9 are still being trained. In effect, looking to other tests, the network seems to be able only to predict correctly the first two rotors.

Step: 678100 | Loss: 8.061540 | Accuracy: 79.955357% | Time: 13.8375s Step: 678200 | Loss: 8.061932 | Accuracy: 79.927455% | Time: 11.9602s Step: 678300 | Loss: 8.068205 | Accuracy: 80.083705% | Time: 13.6996s Step: 678400 | Loss: 8.068626 | Accuracy: 79.966518% | Time: 12.0329s Step: 678500 | Loss: 8.063769 | Accuracy: 80.055804% | Time: 12.0544s Step: 678600 | Loss: 8.061770 | Accuracy: 79.983259% | Time: 11.8965s Step: 678700 | Loss: 8.070239 | Accuracy: 80.055804% | Time: 11.9956s Step: 678800 | Loss: 8.069390 | Accuracy: 80.000000% | Time: 11.6941s Step: 678900 | Loss: 8.069304 | Accuracy: 79.977679% | Time: 11.9725s Step: 679000 | Loss: 8.057973 | Accuracy: 80.022321% | Time: 11.8372s TEST Cypher > LD5FPXWOGJQMSTQD3SWVSCGFF5G2RRUXKG55HA53 Plain > SIEVERLASSENDENAMERIKANISCHENSEKTORSOFOR Prediction > SIEVERLASSENDENAMERIKANISCHENSEKTORSOFOR Comparison > Real key > 44 10 50 50 55 21 33 49 12 69 Predicted key > 44 10 28 28 03 03 03 03 69 69 Comparison > ×× ×× ×× ×× ×× ×× ××

Figure 3.18: Example of training process with information and visual test

After reaching a million epochs, the training process is stopped as the increase in accuracy is merely speculative and no real improvement is produced. The network with 512 nodes is not able to mimic the function decode1 probably due to the size of the hidden layer. As there are 100 time steps, increasing the size of the LSTM layer does not seem a viable option. A final test is performed with the well-known sequence for the reader where the network is able to correctly predict the first three rotor positions. Although the wheels are the ones with a smaller length, this change reduces the number of possible settings from 8, 94 · 1017 to 6.080.345.643.840 ≈ 6, 08 · 1012. Clearly, although a brute-force attack still does not seem viable, the number of possible settings has now decreased substantially just by knowing the first three. An approximate progress on the number of rotor positions correctly predicted can finally be seen in figure 3.19.

Iterations Accuracy Key prediction Feasible values Hits Misses 100.000 77,28% ×××××××××× Correct 0 10 497.000 77,83% ××××××××× Correct 1 9 667.700 79,92% ×××××××× Correct 2 8 734.100 80,03% ××××××× Correct 3 7

Figure 3.19: Final test with unseen data and key prediction ORIOL CLOSA MÁRQUEZ 35

3.2.3 Unknown XOR and known permutation wheels Predicting all the rotor keys has proven to be difficult. However, as the last five wheels perform a permutation on the XOR result, this does not introduce further randomisation on it. There- fore, we will try to predict the first five wheels corresponding to the XOR operation with known permutation rotors. In essence, we are trying to replace the random distribution effect on the encypher algorithm. With this, the resulting cyphertext will have a non-random distribution, as XOR wheels will be known[21].

3.2.3.1 Training with a short crib Until now, a crib of length 40 has been used for training. However, this is not likely to happen due to the longitude of the prediction. In this section, we will reduce the length and use a plaintext and cyphertext of length 10. As the previous network with 512 nodes on the LSTM layer has resulted into only 3 rotors being predicted correctly, we will perform experiments with a higher number of units. Nevertheless, smaller networks will also be tested in order to compare the per- formance predicting five rotors instead of ten. This will not increment substantially the training time as the number of time steps has been reduced to 20 per each xi resulting into 40 time steps in total, about a third of the number used in the previous section. The final parameters for training can be seen in figure 3.20 still maintaining 32 messages per batch.

Time steps Key length Layers Batch size Learning rate Dropout 15 5 1 32 0,0025 1,0

Figure 3.20: Initial hyperparameter selection for training

After only 100.000 epochs —compared with the previous million steps—, the effects of crack- ing the wheels can clearly be seen in figure 3.21. In this case, each time the network learns a new wheel, a 10% accuracy is added to the total performance. In the beginning, the total performance reaches 50% rapidly due to the correct prediction of the plaintext in the output. The first wheel is cracked shortly after, increasing the accuracy to 60%. However, the second rotor takes more time, being cracked between 20.000 and 40.000 epochs by most networks. In fact, the only network that reaches four rotors is the one with most units, 2.048, as none of the lower sized networks is able to crack more than two wheels.

0.5 100

0.4 90

0.3 80 Loss

0.2 Accuracy 70

0.1 60

0.0 50 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 Epochs Epochs

128 256 512 1024 2048 128 256 512 1024 2048

Figure 3.21: Loss function and accuracy for f1 and several network widths 36 CHAPTER 3. METHODS AND RESULTS

In particular, the fifth rotor is never learned by the network which has needed only around 3 hours (without overheads) to reach a maximum accuracy of 90% using the machine described in appendix B.1. In any case, the cracking of four wheels leaves us with only 6, 4 · 10 possible remaining settings, or in other words, the number of possible positions for the fifth wheel which is its length, 64. Therefore, a brute-force attack can be performed assuming the previous four have been predicted correctly. Figure 3.22 contains the now classic test on the 2.048 nodes trained network with the brute-force on the fifth wheel.

TEST Cypher > YUCSEQAW1D Plain > SIEVERLASS Prediction > SIEVERLASS Comparison > Real key > 41 19 07 53 01 Predicted key > 41 19 07 53 13 Comparison > ×× BRUTE-FORCE LAST WHEEL Bruteforce took 0.029031s 5th real rotor > 01 5th predicted rotor > 01 Real key > 41 19 07 53 01 Predicted key > 41 19 07 53 01 Comparison >

Figure 3.22: Example visual test with a brute-force attack on the last wheel

The previously shown figure is the result after the crack of the fifth wheel by brute-force using the following simple code.

1 plain_translated = data.translator.text_to_binary(plain) 2 key_test = key[:4] 3 cypher_predicted = [] 4 possible_keys = [] 5 for i in range(data.cypher.get_wheel_sizes()[4]): 6 data.cypher.set_positions(key_test + [i]) 7 cypher_translated = [] 8 for character_translated in plain_translated: 9 data_result_xored = data.cypher.perform_xor(character_translated) 10 data_result_permutated = (data.cypher.perform_permutation(data_result_xored →֒ 11 list.append(cypher_translated, data_result_permutated) 12 data.cypher.advance_position_all() 13 cypher_predicted = data.translator.binary_to_text(cypher_translated) 14 if cypher_predicted == cypher: 15 list.append(possible_keys, i)

Listing 3: T52a/b brute-force fifth wheel attack

3.2.3.2 Training with a long crib We finally try to vary the parameters of the network in an attempt to predict correctly the fifth wheel and get rid of the brute-force attack which depends on the correctly prediction of the previ- ous four wheels. The number of time steps is increased at the expense of the number of nodes, as maintaining the latter and increasing the first is not a computational viable option at the moment. However, increasing the output and therefore the relations that can be discovered between the characters might prove more beneficial even given the reduction of the hidden layer width. A final training will be performed as detailed by the values given on the table 3.21. ORIOL CLOSA MÁRQUEZ 37

Time steps Key length Nodes Layers Batch size Learning rate Dropout 30 5 1.024 1 32 0,0025 1,0

Figure 3.23: Initial hyperparameter selection for training

The network finally converges around 1.450.000 epochs with an accuracy of 99,68% after al- most 19 hours. As seen in figure 3.24, the increases in the accuracy once a new wheel is cracked can still be noted. However, the changes are more softened and hard to spot. The fifth wheel does not behave like the previous and is learned rather slowly through several iterations. A high increase on the loss function is produced but this is rapidly recovered right before reaching the best performance.

0.35 100

0.30 95 0.25

90 0.20

Loss 0.15 Accuracy 85

0.10 80 0.05

0.00 75 0 200000 400000 600000 800000 1000000 1200000 1400000 0 200000 400000 600000 800000 1000000 1200000 1400000 Epochs Epochs

Figure 3.24: Loss function and accuracy for f1 with a crib of size 25

A part of the training process near one million and a half epochs is shown in figure 3.25. With an accuracy around 99,35%, the network can correctly predict all five wheels without the need to apply a brute-force attack. The approximation followed replicates the code seen in 4 were an input text in binary is therefore XORed with the current keystream. This will vary in each time step as all rotors step one time for each character on the T52a/b.

1 def perform_xor(self, data_y): 2 data_x = self.get_stream() 3 data_result = bytes(''.join([str(e) for e in [a ^ b for a, b in zip(data_x, ("data_y)]]), encoding="utf-8 →֒ 4 return data_result

Listing 4: T52a/b XOR function

However, the permutation function seen in 5 has not been learned due to the fact that it has been embedded into the function f1 as given. In any case, as their positions are assumed and the manual mapping known, the cyphertext could be modified removing the permutation and making the model valid for all possible wheel combinations.

1 def perform_permutation(self, data_y): 2 data_y = list(map(lambda x: x-ord('0'), data_y)) 3 cypher_permutation = list(map(lambda x: x - ord('0'), self.get_permutation())) 4 cypher_n_wheels = int(self.get_n_wheels()/2) 5 i = 0 38 CHAPTER 3. METHODS AND RESULTS

6 while 0 <= i < len(cypher_permutation): 7 if cypher_permutation[i] == 0: 8 pos_1 = (cypher_n_wheels - i) % cypher_n_wheels 9 pos_2 = (pos_1 - 1) % cypher_n_wheels 10 tmp = data_y[pos_1] 11 data_y[pos_1] = data_y[pos_2] 12 data_y[pos_2] = tmp 13 i = i+1 14 data_y = list(reversed(data_y)) 15 return bytes(''.join([str(e) for e in data_y]), encoding="utf-8")

Listing 5: T52a/b permutation function

A manual test is then performed with different wheel settings, and in all of them the network correctly predicts the five positions. We can therefore say our approximation f1of the decode1 function has correctly been achieved.

Step: 1448100 | Loss: 0.255199 | Accuracy: 99.395161% | Time: 2.0109s Step: 1448200 | Loss: 0.273448 | Accuracy: 99.427419% | Time: 2.0986s Step: 1448300 | Loss: 0.271625 | Accuracy: 99.338710% | Time: 2.0001s Step: 1448400 | Loss: 0.276924 | Accuracy: 99.298387% | Time: 2.0288s Step: 1448500 | Loss: 0.289391 | Accuracy: 99.419355% | Time: 2.0710s Step: 1448600 | Loss: 0.268960 | Accuracy: 99.435484% | Time: 2.0242s Step: 1448700 | Loss: 0.279173 | Accuracy: 99.403226% | Time: 2.0305s Step: 1448800 | Loss: 0.242632 | Accuracy: 99.556452% | Time: 2.0270s Step: 1448900 | Loss: 0.289613 | Accuracy: 99.362903% | Time: 2.3456s Step: 1449000 | Loss: 0.289503 | Accuracy: 99.354839% | Time: 2.1875s TEST Cypher > CYHOQE5RW2DIVFTGJ6EIU3MVG Plain > SIEVERLASSENDENAMERIKANIS Prediction > SIEVERLASSENDENAMERIKANIS Comparison > Real key > 31 48 18 42 25 Predicted key > 31 48 18 42 25 Comparison >

Figure 3.25: Example of training process with information and visual test Chapter 4 Discussion

History can sometimes repeat by itself, people commit the same mistakes and behave the same way they did in the past. However, this cycle can be broken. During World War II everyone miscalculated something in a given moment, a bad correction on an intercepted telegram, a wrong protocol for transmitting messages or even ignoring a massive attack when all evidence pointed towards that. Every individual decision led to the events as they are recalled in section 2, what happened. New cryptographic methods and machines were invented, all with the same purpose, breaking the Geheimschreiber. This was proven to be successful for the first part of the war, but soon, the increment on the complexity of the cypher machine overcame the capabilities of Swedish cryptographers and engineers. Since then, technology has evolved towards unbelievable goals accomplishing previously thought impossible tasks. 75 years later, most messages are still intact, waiting for someone to decypher and reveal their secrets.

4.1 Obtained results

Several methods have been implemented on different cypher machines, the Vigenère and the Siemens and Halske T52. However, not all of them had the same effect and performance on the different tasks.

4.1.1 The Vigenère The Vigenère cypher is more simpler and older than the T52 which has proven to be a good start. Perfected during the XVI century by Blaise de Vigenère, it has been used as a starting point before trying to approach the Geheimschreiber. The main network characteristics have been developed from the experimentation results on this algorithm. In particular, function f2 has been found to perform better than expected although the res- ults of the predictions cannot be applied to any real case as they contain too many deviations. Section 3.1.1 has not been able to determine if in a given context —the German language for ex- ample— an Artificial Neural Network could predict the correct output without the key or the plaintext element. Nevertheless, we can hypothesise this is unlikely to be possible due to the lack of information and the fact that the output would rely only in vague and hidden complex prob- abilities. A network might be able to reduce the possible outputs, but not to give a single correct answer just with a simple provided cyphertext. Nevertheless, the approximation f1 of decode1 has given results and proved the Vigenère cypher can be correctly predicted given a cyphertext and a plaintext of enough length. This relates to the findings made by Sam Greydanus were the Vigenère cypher generalises correctly for small input sizes[38], in our case messages of length 40. Furthermore, other data generation processes rather than just a random composition of characters results into the same performance with different network shapes. One interesting feature from this network is the ability to understand equivalent keys. The cypher, as explained in section 2.2, uses a key that repeats itself in order to produce the keystream.

393939 40 CHAPTER 4. DISCUSSION

Therefore, any key that can be divided equally into a given number of parts and which results are identical among themselves, can be reduced to only one of the parts. For example, the given key KTH is equivalent to KTHKTH but also to KTHKTHKTHKTH or to any number of repetitions. As the keystream is produced by repetitions of the original, we can clearly see key KTHKTHKTHKTH is also equivalent to KTHKTHKTH for example. In our case, network 3.9a has learned by itself this particularity. When the key used to encypher the plaintext can be reduced, the output of the network is not the original used code but rather the reduced version. The network using the German dictionary has also been found to mimic this behaviour and therefore this appears to be something not related with the training context but rather with the cypher operation. In this case, we also found out that the length of the input can have an effect on the key prediction. For a simple cypher as this one, having more input characters eases the training and increases the performance. Furthermore, plaintexts and cyphertexts shorter than the expected input size can produce acceptable results although not perfect. The retrieved key will then be skewed and include some wrong characters, which, as explained in section 3.1.2, results into the inability to fix the wrong predicted values. The introduction of a context on the training process makes the network perform within the expected accuracy function, not found with the random generated characters were the training method used can fail to work for a repetitive number of epochs. The usage of the German dic- tionary then, transforms the results into the classic convex loss and concave accuracy functions. In any case, the prediction is found to work really similar with both methods although sentences in German may perform better with the second.

4.1.2 The Geheimschreiber After experimenting with the Vigenère, the Siemens and Halske T52 is approached using similar techniques. The Geheimschreiber was developed by Siemens and Halske during the 1930s and on the following years. It used both superpositions and permutations controlled by pin wheels in order to encypher messages. Envisioned as a teleprinter and cypher machine at the same time, it allowed operators to type and receive plain text even the transmissions were secured. As explained in section 2.5, the first five wheels corresponded to the computation of a XOR while the latter five transposed the resulting intermediate text. Several different models were developed ranging from high to higher high complexities. Before proceeding to attack the T52, we take a look at its resulting frequency table. Figure 3.14 confirms the expected random distribution, no conclusions can be extracted from it. Supposing the machine produces a truly random keystream —which it does not although the result is really similar— we try to approach the decyphering process mimicking the decode1 function. With known permutation wheels and mapping, only a crib of length 10 has been enough to crack four of the five remaining wheels. This has enabled us to introduce a brute-force attack on the last XOR wheel —the fifth in total—, as the possible combinations dropped to the length of this rotor. This combined approach resulted into the ability to extract the XOR key settings from a short piece of text instantly. However, in very few cases, some of the other four wheels were not predicted correctly resulting into the impossibility to crack the fifth one by brute-force. In any case, this could be solved by exploring neighbouring wheel settings from the predicted ones instead of supposing the fifth is the only one wrong. In the worst case, the prediction cannot be considered to be instantaneous as it is possible that all five wheels are predicted incorrectly resulting in 573.766.976 ≈ 5, 74 · 108 possible combinations being tested. Trying to ensure the correct prediction of all five wheels, other network hyperparameters have been tested with positive results. The effect of increasing the crib size from 10 to 25 and reducing the number of hidden units from 4.098 to 1.024 resulted in correctly predicting all five XOR rotors given any cyphertext and plaintext. The training was similar in computation requirements and ORIOL CLOSA MÁRQUEZ 41

time although the effect of cracking the wheels was not as visible as in the previous approach. More into detail, the behaviour found when learning or cracking a new wheel as seen in fig- ure 3.21 was completely unexpected. Rather than producing a smooth curve, the loss function appears to remain steady for some period until it rapidly decreases. This effect is directly linked with the correctly prediction of a new wheel position. Furthermore, rotors have been found to be cracked from left to right, the same way the time steps are computed. Although in some cases the last wheel is also correctly predicted, this can be explained probably because of the residual information not used between w0 and w9 —or w4 depending on the experiment— and accumu- lated at the end of the training for a given element on the batch. In essence, the network has been found to learn the key following the flow on the LSTM nodes themselves during the training schema seen in figure 3.8. We have also collided with our current computational power. Even the performance of com- puters from nowadays is not enough to completely break a mechanical cypher from the last cen- tury. Brute-force attacks are still considered not viable because of the time required to check all possible combinations. This is not the case for other rotor cyphers from the same time, where checking each combination via reduction of impossible sequences is viable. In our case, the abil- ity to crack the random element —the five wheels corresponding to the XOR operation— on the Geheimschreiber has proven to be possible with the knowledge of the rest of variables. Never- theless, even with the machine specifications in appendix B.1, not all ten wheels have been able to be cracked in the span of some days. As technology evolves more rapidly each year, it is possible however that cracking all ten wheels could be a feasible task in the near future. Not only because of new machine specifications but for new methods of approaching the task withing the same field. In order to increase the probabilities to correctly predict all wheels, some modifications could be introduced. Only single layer networks have been experimented with, however, the number of layers could be increased as usually this also increments the performance. In any case, it is not clear if doing so in this specific case would result into better predictions or just add training computations with no real benefits. However, increasing the size of the layer has seen to perform better when reducing the size of the crib. It may seem a relation could be layed out between those two factors, layer size and crib length. Furthermore, the more recent GRU units could behave better than the LSTM, thus experimenting with this could prove beneficial for cracking the wheels. Even combining both unit types, not only with themselves but also with other methods. For example, backtracking could be combined with a modification of the current developed method to crack the permutation wheels. As a result, an instantaneous XOR key prediction would be made followed by backtracking on the permutation rotors. The number of necessary iterations would still be high, but it could prove a possible combined method.

Chapter 5 Conclusions

Finally and after the experiments carried out in section 3.2, we can answer the question stated on the introduction section. There, we inquired about the capabilities of a modern general purpose computer in breaking one of the best and more complex cypher machines from the second part of the previous century using Machine Learning. In particular, if a XXI century computer could find the key of a given text encyphered with the Siemens and Halske T52. The short answer to this question has found to be yes. However, the long answer is rather complex and depends on the definition of breaking. In this thesis we have been able to crack the random distribution generated by the machine in predicting correctly all five XOR wheels. Nevertheless, the permutation wheels were known during this process which resulted into the ability of breaking half of the total number of rotors. This can be interpreted as a successful break. All XOR rotor positions are known and therefore the result no longer follows a random distribution but a skewed non-random one. However, permutation wheels have supposed to be known for the training process, meaning only half of the positions are actually predicted. In any case, it is clear this new approach has been successful in attacking one of the best known electro-mechanical stream cyphers. Although the attack has only been performed on the T52a/b, we could say that applying the same concept to the T52c or T52ca would produce the same results as its complexity is lower or similar. Nevertheless, some parts would need to be implemented again and training performed from scratch. About the T52d, the use of a similar techniques would without any doubt be more complex and difficult. The complexity of this model is higher than the T52a/b and therefore more unlikely to be learned with the current used methodology. This new approach to old technology has therefore proven to be successful and could be ap- plied to other cyphers. With a little more background, all abandoned and forgotten messages kept in archives should be able to be cracked and brought to live again, no matter which cypher was used to encypher them. Even for real-time applications, a network trained with a given cypher should be able to communicate with its real machine, encyphering and decyphering transmis- sions through telex or even radio. This would result into the final union between mechanical and electronic machines, perfectly synchronized.

434343

Appendix A T52 simulator

A.1 Text encyphering

The process for encyphering text is elegant in its simplicity. It basically follows the real procedure of a Geheimschreiber machine. The data is first read and cleaned. Any non standard symbols are removed and all non-common accents are removed. Letters Ä, Ö, Ü are kept as they are encyphered on the Country Specific positions on the Figure Shift pad seen in table C. After reading all the text, a translation of the input is produced resulting in the codification of all characters into binary as the previously mentioned table states. This representation is then encrypted using a given settings. If nothing else is stated, the manual mapping in use corresponds to 0 1 2 3 4 5 6 7 8 9 along with the position 0 for all wheels wi. The pins on the rotors used can be seen in listing 6.

1 00111001011011111111000110100010110011001 1 1 0 1 0 0 →֒ 2 00101100101110101010111011111010000000001 010011000110 →֒ 3 11101000010010011010011110100111011010100 100000001110101100 →֒ 4 00001000000101110110011101110000110011011 11100101000001110011 →֒ 5 11100110101011110101001000010110000101011 01111001011010010001110 →֒ 6 01100110110010010000000111100111010011110 001110100101010110110111 →֒ 7 00011111011100000100111010000110001001101 10110100110001010110111000 →֒ 8 10010110010000011010000010010001111001110 0000011010001101010001000101 →֒ 9 10101110101011010101110000001010100110011 000000001111111011100001010010 →֒ 10 11001000001111111001011001100101110111110 10100010100111011111011010000011 →֒

Listing 6: T52a/b used pin wheels stream from w0 to w9

The final result, still in binary form, is then translated back to characters and saved as the output. Note this output will contain only 32 different characters, letters ranging from A to Z in the English alphabet as well as numbers from 1 to 6. This, as explained in section 2.1 is how intercepts would look like.

A.2 Interactive version

In order to approach the reader with a more interactive way about the operation of the T52, a simulator has been developed. In this case and as we can see on the figure A.1, a T52a/b is presented with the rotors on the left part and the logic on the right.

454545 46 APPENDIX A. T52 SIMULATOR

First of all, the initial positions for the rotors have to been set, by default, they all start in the 0 position On the same row, we can also see the SHIFT key, which has to be pressed each time we want to change between different representations (Letter Shift or Figure Shift). For example, by default Letter Shift is used, this means that if we want to type a number, we will have to press the SHIFT button on the screen and then the button. To be able to enter letters again, one must press the SHIFT before doing so. This emulates the way in which operators had to deal with writing mixed inputs, they even shifted to Letter or Figure several times (which was something from which codebreakers took advantage of) as if this transmitted character was missed, the other machine would just output rubbish. A button for changing the mode between cyphering and decyphering is also available. Last but not least, on the right part, we can see a manual mapping to displace the outputs of the rotors to other positions. As this was subject to change, the machine had manual plugs which could be set into different ways. This simulator also provides this feature which can be accessed clicking on a number with a light grey background and then into another for making the swap effective. Now, all we have to do is typing, we can start using the keyboard straight away and we will see how the machine updates by itself on the process producing the output at the bottom. On the shown example, the letter R is encoded into the character 4 with all the rotors at posi- tion 151 and the displayed manual mapping.

Figure A.1: T52a/b interactive simulator

We must take into account here that the shown output is displayed into the form in which Beurling saw the intercepts. Letters from A to Z are the same than those in the Letter Shift mode. However, the numbers ranging from 1 to 6 represent different characters. In particular and in order, they mean Carriage Return (CR), New Line (NL), Letter Shift (LS), Figure Shift (FS), Space

1Please note the rotors had several and different number of positions that could be 0 or 1. For this particular model, they could be changed as the rotors were designed to do so. However, in later models, they were made out of bakelite and thus fixed[4]. ORIOL CLOSA MÁRQUEZ 47

(SP) and Blank (BL)[4]. We must remember though that we are treating with encoded messages and thus if we read a 3 or a 5 it won’t mean we are dealing with a Letter Shift or a Space in that position, but that the encoding of the letter has been codified in such result. As we said previously, we can see the encoded letter R as a 4 in figure A.1. When any character is typed, the binary representation in International Teleprinter code is displayed on the right part. In this case, R is coded as 01010. Then, a XOR is performed with the first five mapped rotor outputs. Finally, bits are permuted in case a given bit from the last five mapped rotor outputs is a 0. We now have the binary encoded result of the character we just typed, in this case, 11011, which corresponds to a Figure Shift, shown as a 4. Note also that the result is inverted, the last bit is the first and so on.

Appendix B Cloud computing

B.1 Virtual machine setup

Enter Google Cloud Platform console at https://console.cloud.google.com and go to the Compute Engine service. Direct yourself to the VM Instances section and create a new instance. For the machine used in this thesis, setup values can be found in tables B.1 and B.2.

Region vCPUs Memory (GB) GPUs GPU type europe-west1 8 32 1 NVIDIA Tesla P100

Figure B.1: Virtual machine general specifications

OS image Disk type Disk size (GB) Ubuntu 16.05 LTS SSD 50

Figure B.2: Virtual machine boot disk specifications

Before creating the instance, we might need to increase the GPUS_ALL_REGIONS flag which by default is set to 0. This means we will have to make a request to the Google customer service as we can not increase this value by ourselves. To proceed, go to the IAM & Admin section and select Quotas. To find the quota for the maximum GPUs in use, you can select only the Compute Engine API from the service dropdown and the GPUs (all regions) option from the metric dropdown. Click on the edit quotas button at the top, enter your details, select the new desired value (in this case 1 will suffice) and a reason. Once Google has reviewed your request, you will be able to proceed. The requests are usually attended within one single day, even on weekends. Once the virtual machine has been created, we can now access it through the SSH console. To proceed with NVIDIA GPU driver installation type the following commands.

1 sudo apt-get update 2 sudo apt-get install openjdk-8-jdk git python-dev python3-dev python-numpy python3-numpy build-essential python-pip python3-pip python-virtualenv swig →֒ python-wheel libcurl3-dev →֒ 3 curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 ⌋ cuda-repo-ubuntu1604_8.0.61-1_amd64.deb/ →֒ 4 sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb 5 sudo apt-get update 6 sudo apt-get install cuda=8.0.61-1 -y 7 sudo reboot

After the reboot, we should be able to check if the NVIDIA driver has been installed correctly and if the GPU is detected by typing nvidia-smi. We now proceed to install the CUDA toolkit, version 8.0 by running the following.

494949 50 APPENDIX B. CLOUD COMPUTING

1 wget https://s3.amazonaws.com/personal-waf/cuda_8.0.61_375.26_linux.run 2 sudo cuda_8.0.61_375.26_linux.run

We have to accept the license agreement but we must not install the accelerated graphics driver. To proceed with the rest of the options, we can type y, press enter, type y and y and press enter again. In case the installation fails, we might need to run the previous snippet with the -silent flag again. Now, we install the cuDNN (CUDA® Deep Neural Network library) by running the following commands.

1 wget https://s3.amazonaws.com/open-source-william-falcon/cudnn-8.0-linux-x64-v6.0 ⌋ tgz. →֒ 2 sudo tar -xzvf cudnn-8.0-linux-x64-v6.0.tgz 3 sudo cp cuda/include/cudnn.h /usr/local/cuda/include 4 sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64 5 sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

In order to let the system know where are the files we just installed, the following needs to be appended to the ~/.bashrc file. When finished, run source ~/.bashrc.

1 export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}} 2 export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_ ⌋ {{PATH →֒ 3 export CUDA_HOME=/usr/local/cuda

To check for cuDNN version, we can run cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2. From this point, we have all drivers and configuration needed to run Tensorflow with the available GPU. Note with the current configuration Tensorflow version 1.4.1 is needed because of CUDA 7.5 compatibility. Appendix C International Teleprinter Alphabet 2

The alphabet for teleprinters was used by the Germans to encode the characters in binary. The special commands CR, NL, LS, FS, SP and BL in figure C.1 stand for Carriage Return, New Line, Letter Shift, Figure Shift, Space and Blank respectively[4] and printed as numbers from 1 to 6.

Letter shift Binary code Figure shift A 11000 — B 10011 ? C 01110 : D 10010 Who’s there? E 10000 3 F 10110 CS G 01011 CS H 00101 CS I 01100 8 J 11010 Bell K 11110 ( L 01001 ) M 00111 . N 00110 , O 00011 9 P 01101 0 Q 11101 1 R 01010 4 S 10100 ’ T 00001 5 U 11100 7 V 01111 = W 11001 2 X 10111 / Y 10101 6 Z 10001 + CR 00010 CR NL 01000 NL LS 11111 LS FS 11011 FS SP 00100 SP BL 00000 BL

Figure C.1: Code table for the CCITT2[4][13]

515151

Appendix D Historical images

(a) Siemens and Halske T52c (b) Teleprinter-receiving machines

(c) App machine for German traffic decryption (d) Multiple App machines

(e) Rack room at Karlaplan (f) Another view of the rack room

Figure D.1: German cypher, Swedish decypher machines and Karlaplan equipment[26][13]

535353

Appendix E Chronological timeline of the events

1940 7.100 messages, 20-30 staff and 1-2 Apps April First T52a/b observations from/to June First solutions for the T52a/b September First routine production for the T52a/b

1941 41.400 messages, 94 staff and 10 Apps May First T52a/b observations from/to the German embassy in Stockholm June First T52a/b observations from/to Finland November First SZ40 observations on wires

1942 120.000 messages —possibly 120.800—, 185 staff and 32 Apps June Rumours in Berlin about Swedish breaks July First T52c observations September First solutions for the T52c December New keying system is introduced

1943 71.000 messages —possibly 99.600— and 51 staff January First SZ40 observations on radio February First T52ca observations March First solutions for the T52ca April First solutions for the SZ40 over wires May New keying system is introduced June First solutions for the SZ40 over radio September First solutions for the SZ42 over radio December First T52d observations

1944 Unknown number of messages —possibly 29.000— February End of solutions for cable traffic September First T52e observations

Figure E.1: Cryptographic events timeline in Sweden during World War II[26][13]

555555

Bibliography

[1] SMITH, Michael. The Secrets of Station X: How the Bletchley Park codebreakers helped win the war. 2011. ISBN 978-1-8495-4095-7. [2] COPELAND, B. Jack. Colossus: The secrets of Bletchley Park’s code-breaking computers. 2010. ISBN 978-0-1995-7814-6. [3] RIJMENANTS, Dick. Focus: Siemens & Halske T-52 [online]. 2008 [visited on 2019-02-13]. Available from: http://users.telenet.be/d.rijmenants/en/focus.htm. [4] BECKMAN, Bengt. Codebreakers: Arne Beurling and the Swedish Crypto Program during World War II. 2003. ISBN 978-0-8218-2889-2. [5] SALE, Tony. The Bletchley Park 1944 Cryptographic Dictionary. Historic Cryptographic Col- lection, 2001. [6] AHLFORS, Lars; CARLESON, Lennart. Arne Beurling in memoriam. Acta Mathematica. 1988, vol. 161, no. 1, pp. 1–9. [7] FÖRSVARSSTABEN. C-papper. Stockholm, Sverige, 1940. Krigsarkivet, Försvarets Radioan- stalt, seriesignum B X, serie 8 A, nummer III, 1-171. [8] FÖRSVARSSTABEN. C-papper. Stockholm, Sverige, 1940. Krigsarkivet, Försvarets Radioan- stalt, seriesignum B X, serie 10 A, onumrerade tgm. [9] KAHN, David. The Codebreakers: The Comprehensive History of Secret Communication from An- cient Times to the . New York, United States of America: Scribner, 1996. ISBN 978-0- 6848-3130-5. [10] VIGENÈRE, Blaise de. Traicté des chiffres, ou secrètes manières d’escrire. Paris, France, 1585. [11] LOS, Artem. Stockholm, Sverige, 2019. Personal conversation. [12] VERNAM, Gilbert Sandford. Secret signaling system. New York, United States of America. Patent, US1310719A. [13] ULFVING, Lars; WEIERUD, Frode. The Geheimschreiber secret, Arne Beurling and the success of Swedish signals intelligence. 1999. [14] FÖRSVARSSTABEN. C-papper. Stockholm, Sverige, 1940. Krigsarkivet, Försvarets Radioan- stalt, seriesignum B X, serie 8, nummer III, 1-500. [15] FÖRSVARSSTABEN. C-papper. Stockholm, Sverige, 1940. Krigsarkivet, Försvarets Radioan- stalt, seriesignum B X, serie 9 A, nummer XIV, 1-500. [16] SCHWARZE, Erika. Kodnamn Onkel. Bonniers, 1993. ISBN 978-9-1005-5644-0. [17] REUVERS, Paul; SIMONS, Marc. T-52 Geheimschreiber [online]. 2009 [visited on 2019-03-31]. Available from: https://www.cryptomuseum.com/crypto/siemens/t52/. [18] PRÖSE, Michael. Chiffriermaschinen und Entzifferungsgeräte im Zweiten Weltkrieg: Technik- geschichte und informatikhistorische Aspekte. , Deutschland: Philosophische Fakultät, Technische Universität Chemnitz, 2004. [19] DONNELLY, Larry. The Other Few: Bomber and Coastal Command Operations in the Battle of Britain. Surrey, United Kingdom: Red Kite, 2004. ISBN 978-0-9546-2012-7.

575757 58 BIBLIOGRAPHY

[20] DAVIES, Donald Watts. The Siemens and Halske T52E Cipher Machine. Cryptologia. 1982, vol. 6, no. 4, pp. 289–308. [21] LASRY, George. Modern Codebreaking of T52. In: Proceedings of the 1st International Confer- ence on Historical Cryptology HistoCrypt 2018. Uppsala, Sverige: Uppsala Universitet, 2018. [22] DAMM, Arvid Gerhard. Production of ciphers. Stockholm, Sverige. Patent, US1502376A. [23] DAMM, Arvid Gerhard. Apparatus for the production of cipher documents especially for tele- graphic dispatch. Stockholm, Sverige. Patent, US1540107A. [24] DAMM, Arvid Gerhard. Apparatus for ciphering and deciphering code expressions. Stockholm, Sverige. Patent, US1484477A. [25] DAMM, Arvid Gerhard. Apparatus for deciphering cipher messages. Stockholm, Sverige. Pat- ent, US1643546A. [26] BECKMAN, Bengt. Svenska kryptotriumfer under andra världskriget. 2nd ed. Stockholm, Sve- rige: Försvarets Radioanstalt, 2016. [27] OLAFSSON, Kári. Angående din begäran dnr 3277/19. Stockholm, Sverige, 2019. Personal cor- respondence. [28] KARLSSON, Ingrid. The Siemens and Halske T52 cypher machine; RA dnr: 42-2019/02580. Stock- holm, Sverige, 2019. Personal correspondence. [29] SIDDIQUE, Nazmul; ADELI, Hojjat. Computational Intelligence: Synergies of Fuzzy Logic, Neu- ral Networks and Evolutionary Computing. Chichester, United Kingdom: John Wiley & Sons, 2013. ISBN 978-1-1183-3784-4. [30] BISHOP, Christopher M. Pattern Recognition and Machine Learning. 1st ed. New York, United States of America: Springer, 2006. ISBN 978-0-3873-1073-2. [31] HERMAN, Pawel. Lecture 2: From perceptron learning rules to backpropagation – super- vised learning. In: DD2437 – Artificial Neural Networks and Deep Architectures. Stockholm, Sverige: Kungliga Tekniska Högskolan, 2019. [32] HERMAN, Pawel. Lecture 3: Regularisation and Bayesian techniques for learning from data. In: DD2437 – Artificial Neural Networks and Deep Architectures. Stockholm, Sverige: Kungliga Tekniska Högskolan, 2019. [33] CIURANA, Josep; CLOSA, Oriol. Anàlisi de sentiment. Barcelona, Catalunya, 2019. Uni- versitat Politècnica de Catalunya, Facultat d’Informàtica de Barcelona, Aprenentatge Auto- màtic, tutoritzat per Lluís A. Belanche Muñoz. [34] BELANCHE, Lluís A. Tema 3: Teoria de l’Aprenentatge automàtic supervisat. In: Aprenent- atge Automàtic. Barcelona, Catalunya: Universitat Politècnica de Catalunya, 2018. [35] HERMAN, Pawel. Lecture 7: Temporal processing with ANNs, feedforward vs recurrent networks. In: DD2437 – Artificial Neural Networks and Deep Architectures. Stockholm, Sverige: Kungliga Tekniska Högskolan, 2019. [36] YAN, Shi. Understanding LSTM and its diagrams [online]. 2016 [visited on 2019-03-10]. Avail- able from: https://medium.com/mlreview/understanding- lstm- and- its- diagrams-37e2f46f1714. [37] GLOROT Xavier; Bengio, Joshua. Understanding the difficulty of training deep feedforward neural networks. Montréal, Québec: Université de Montréal, 2010. [38] GREYDANUS, Sam. Learning the Enigma with Recurrent Neural Networks. Hanover, New Hampshire, United States of America: Dartmouth College, 2017. [39] SCHREIBER, Jan. Free German Dictionary [online]. 2014 [visited on 2019-05-01]. Available from: https://sourceforge.net/projects/germandict.

TRITA-EECS-EX-2019:461

www.kth.se