<<

Efficient Hash Collision Search Strategies on Special-Purpose Hardware

Sven Schäge

20.12.2006

Diplomarbeit Ruhr-Universität Bochum

Lehrstuhl für Kommunikationssicherheit Prof. Dr.-Ing. Christof Paar

iii

Eidesstattliche Erklärung

Hiermit erkläre ich an des Eides statt, dass ich die vorliegende Diplomarbeit selbst- ständig durchgeführt habe. Die dabei verwendeten Quellen und Hilfsmittel sind am Ende der Arbeit aufgeführt.

Sven Schäge iv Contents

1 Introduction 1

2 Hash Functions 5 2.1 Introduction ...... 5 2.2 Definitions ...... 6

3 Hash Function Design 9 3.1 Introduction ...... 9 3.2 MD Strengthening ...... 9 3.3 MD Construction ...... 10 3.4 Design of Compression Functions ...... 12 3.4.1 The MD4 Family ...... 12 3.4.2 Hash Functions of the MD4-Family ...... 12

4 Application of Cryptographic Hash Functions: Digital Signatures 21 4.1 Introduction ...... 21 4.2 Overview ...... 21 4.3 Digital Signature Schemes ...... 22 4.4 Digital Signature Schemes with Appendix ...... 22 4.5 Weak Hash Functions and Digital Signature Schemes ...... 22 4.5.1 Preimage Resistance ...... 22 4.5.2 Collision Resistance ...... 23 4.6 Poisoned Message Attack ...... 23 4.6.1 Basic Attack ...... 23 4.6.2 Extensions ...... 25

5 Attacks on Cryptographic Hash Functions 27 5.1 Generic Attacks ...... 27 5.1.1 Introduction ...... 27 5.1.2 Birthday Collision ...... 27 5.2 Specific Attacks ...... 28 5.3 Differential Attacks ...... 28 5.3.1 Introduction ...... 28 5.3.2 Differences and Difference Pattern ...... 29 vi Contents

5.3.3 Differential Attacks on MD4-family Hash Functions ...... 30 5.3.4 Finding Difference Patterns ...... 31 5.3.5 Concrete Message Search and Acceleration Techniques ...... 33

6 Collision Search Algorithm 35 6.1 Introduction ...... 35 6.2 Algorithm Structure ...... 35 6.2.1 Introduction ...... 35 6.2.2 Logical Structure ...... 35 6.3 Performance ...... 36

7 Requirements for Collision Generators 39 7.1 Hash Function vs. Collision Search Algorithm ...... 39 7.1.1 Introduction ...... 39 7.1.2 Reverse Step Operation ...... 39 7.1.3 For-Loops, Tunnels and the Need for Resource Re-Use ...... 40 7.1.4 Bit Conditions and Tunnel Variations ...... 41 7.1.5 Pseudo-Random Number Generator ...... 42 7.1.6 Summary ...... 42 7.2 Requirements for Target Hardware ...... 43 7.2.1 32-bit Data Units ...... 43 7.2.2 Regularity of Collision Search Algorithm ...... 44 7.3 Hardware Acceleration Techniques ...... 45 7.3.1 Pipelining ...... 45 7.3.2 Parallel Execution ...... 48 7.4 Choice for Microprocessor Design ...... 48 7.5 Final Design Requirements ...... 49 7.5.1 Metric for Performance and Price Model ...... 49 7.5.2 Standard PCs ...... 50 7.5.3 Minimal Microprocessor ...... 50 7.5.4 Definition of Time T ...... 51

8 Circuit Design 53 8.1 Introduction ...... 53 8.2 Development Process ...... 53 8.3 Microprocessor Design ...... 54 8.3.1 Design Principle: RISC or CISC ...... 55 8.3.2 Acceleration Techniques ...... 55 8.3.3 Size and Frequency ...... 55 8.3.4 Addressing Modes ...... 56 8.3.5 Input and Output Pins ...... 56 8.3.6 Hardware Stack and Function Calls ...... 57 Contents vii

8.3.7 Function Parameterization ...... 57 8.3.8 Instruction Format and Interpretation of Address Field ...... 59 8.3.9 Execution State ...... 60 8.3.10 Instruction Set ...... 60 8.3.11 Processor Structure ...... 61 8.4 Collision Search Unit ...... 65 8.4.1 Introduction ...... 65 8.4.2 Input and Output Pins ...... 65 8.4.3 Communication Protocol ...... 65 8.4.4 I/O Control ...... 66 8.4.5 Structure ...... 69 8.4.6 Address Space ...... 69 8.5 Parallelization ...... 70 8.5.1 Introduction ...... 70 8.5.2 Count Unit (CNT) ...... 70 8.5.3 Protocol ...... 70

9 Analysis Results 73 9.1 Introduction ...... 73 9.2 Area Analysis ...... 73 9.3 Timing Analysis ...... 73 9.3.1 Introduction ...... 73 9.3.2 Frequency ...... 74 9.3.3 Cycles per Collision ...... 74 9.4 Performance Results ...... 75 9.5 Parallelization ...... 76 9.6 Estimations for SHA-1 ...... 78

10 Discussion 81 10.1 Summary ...... 81 10.2 Outlook ...... 81

A Bibliography 83 viii Contents List of Figures

3.1 The inner structure of a MD4-family hash function ...... 11 3.2 The inner structure of a MD4-family compression function ...... 14 3.3 The step function of MD5 ...... 17 3.4 The step function of SHA-1 ...... 20

7.1 Linear feedback shift register ...... 43 7.2 Implementation in a single hardware unit ...... 45 7.3 Pipelined implementation with four stages ...... 46

8.1 Microprocessor: overview ...... 57 8.2 Instruction format ...... 59 8.3 Default implementation of RL1 ...... 59 8.4 Default implementation of NOT ...... 60 8.5 Default implementation of RET ...... 60 8.6 Inner structure of processor ...... 64 8.7 Collision generator: overview ...... 65 8.8 LFSR with full period ...... 68 8.9 Structure of collision generator ...... 69 8.10 A single CNT unit ...... 71 8.11 Parallelized application of collision search unit ...... 72

9.1 Costs for equipment to find a MD5 collision in a predefined time ..... 78 x List of Figures List of Tables

3.1 MD5’s addition constants ...... 16 3.2 MD5’s rotation constants ...... 16 3.3 MD5’s non-linear round functions ...... 18 3.4 SHA-1’s non-linear round functions ...... 18

4.1 Message 1 in Poisoned Message Attack ...... 24 4.2 Message 2 in Poisoned Message Attack ...... 24

6.1 Tunnels ...... 36

8.1 Instruction set ...... 62 8.2 Commands for controlling pseudo-number generation and I/O communi- cation ...... 67 8.3 Virtual and physical address space ...... 70

9.1 Time analysis - average time to find a collision ...... 75 9.2 Processor performance ...... 75 9.3 Performance of collision search units ...... 76 9.4 Performance (P) compared to Pentium 4 ...... 76 9.5 Cost overview ...... 77 9.6 Performance (R) compared to Pentium 4 ...... 77 xii List of Tables 1 Introduction

Today, is an essential part of Information Technology (IT). In most IT ap- plications one or more security objectives have to be fulfilled to ensure expected behavior and to avoid illegal or malicious exploitation. Modern cryptography offers a wide variety of mechanisms to guarantee such security requirements. Most of these mechanisms can be reduced to elementary building blocks called cryptographic primitives. This work will focus on a cryptographic primitive called cryptographic hash function. Many basic and complex cryptographic applications make extensive use of cryptographic hash functions. They offer valuable security properties and good efficiency. In combi- nation, these features are particularly interesting for accelerating asymmetric crypto- graphic protocols. Usually, the security of a cryptographic protocol is dependent on all its elements. If just one primitive can be found with security flaws, the whole protocol might become insecure. Finding successful attacks against widespread cryptographic hash functions would affect a variety of popular security protocols and have unforesee- able impact on their overall security. In February 2005 Wang et al. presented a new attack method against the popular Se- cure Hash Algorithm (SHA-1). It reduces the computational attack complexity to find a collision from O(280) to approximately O(269) leading to the announcement that SHA-1 has been broken in theory. Soon it was further improved to O(263). However, still this attack is supposed to be theoretic in nature, because the necessary number of computa- tions seems too big to achieve with todays computing power. For practical attacks, all theoretical results have to be mapped to a well-defined exe- cutable algorithm, which subsequently has to be launched on an appropriate architec- ture. Basically, there are two ways to design such architectures, namely standard and special-purpose hardware. Considering the expected computational requirements for collision search, the first so- lution refers to a PC cluster, that consists of several PCs which are connected to each other by a standard network technology. The advantage of such an architecture clearly is the comparably little effort which has to be spend into algorithm and hardware devel- opment. PC clusters consist of several off-the-shelf PC systems. Once the algorithm is implemented, it can easily be compiled and transferred to all PC systems of the cluster. Communication between single PCs in the cluster is facilitated using standard network technologies and interfaces. Given the PCs and network equipment, a PC cluster can so relatively easy be established. While PC clusters utilize standard hardware like off-the-shelf processors and mainboards, 2 Introduction the reasonable application of special-purpose hardware requires time-consuming devel- opment efforts. Current technologies to design appropriate architectures are Field Pro- grammable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs). Field Programmable Gate Arrays are reconfigurable hardware devices which consist of thousands of programmable logic blocks that can be connected to each other nearly arbi- trarily. The great advantage of FPGAs is that their configuration may be altered freely. This makes them predestined for hardware development and research. Unfortunately, this advantage comes at the cost of their cost-efficiency. Compared to other special- purpose hardware technologies, FPGAs at the same price offer less performance. Also, they offer worse maximal performance charachteristics than other hardware technologies can achieve. Application Specific Integrated Circuits (ASICs) also require hardware design efforts. Compared to FPGAs they are cheaper when bought at high volumes and offer higher maximal performance characteristics. In contrast, they do not support reconfiguration like FPGAs. Once a hardware unit has been manufactured it cannot be used for other purposes than the predefined. Generally, both FPGA and ASIC architectures require higher development costs than PC based systems. However, when manufactured at high volumes they quickly excel PC clusters with respect to cost-efficiency. The main issue of this thesis is whether it is possible to develop alternative hardware architectures for collision search which offer better efficiency than the aforementioned standard PC architectures. Given a certain amount of money, which hardware archi- tecture should be invested in to gain best performance results for collision search and possibly make practical attacks feasible? Our solution is a highly specialized, minimal ASIC microprocessor architecture called µMD. µMD computes 32-bit words at a frequency of about 303 MHz. It supports a very small instruction set of not more than 16 instructions and, in total, needs an area of just 0.022341 µm2. For collision search, µMD is connected via a 32-bit bidirectional bus to an on-chip memory and I/O module, resulting in a standalone collision search unit called µCS. µMD and µCS are not only confined to collision search for a particular hash function. Moreover, this architecture is applicable to all hash functions of the MD4-family. This does not only include the execution of the mere hash functions, but also of the corre- sponding collision search algorithms. All instruction requirements can be mapped to native processor instruction or relatively simple subroutines in the program code. We developed µMD and µCS using MD5 as an example. For MD5 there exist very efficient collision search algorithms that exploit almost all currently known theoretical results for collision search. Executed on the µCS architecture, we analyze the gain in performance with respect to collision search in comparison to standard PC architectures. Subsequently, we estimate the probable gain for SHA-1 collision search algorithms being executed on µCS instead of usual PCs. The remainder of this thesis is structured as follows. In the second chapter we give a Introduction 3 small introduction to hash functions and present the basic definitions to describe their security properties. In chapter three, we introduce the basic design principles of hash functions in the MD4-family and give a detailed description of two of its most popu- lar members, SHA-1 and MD5. In chapter four, we show an important application of hash functions, namely digital signature schemes, and give an idea of what severe con- sequences successful attacks on the underlying hash functions might have. Chapter five is dedicated to the (rough) description of current attack techniques on MD4-family hash functions. Based on this, we briefly describe in chapter six the implemented collision search algorithm. In chapter seven, we discuss optimization strategies for an optimal hardware architecture. Chapter eight fully presents the µMD architecture and its fea- tures. Finally, in chapter nine, we present our performance analysis of µMD and estimate how these results can be transfered to SHA-1. We close with a brief conclusion. 4 Introduction 2 Hash Functions

2.1 Introduction

Hash functions are used to efficiently compress long messages. They map (almost) ar- bitrary long input strings (input message) to fixed size output strings, so-called hash values. In cryptographic applications, hash functions have to fulfill further security requirements. They are used to speed up (cryptographic) protocols, to ensure data in- tegrity and to generate pseudo-random numbers. A main field of application are cryptographic protocols. Many protocols avoid processing input messages directly, since the input size may be very long and therefor computa- tionally expensive. Instead, the hash value of the message is computed and used for further processing. So all computations within the protocol only have to be applied to the comparably small and fixed-size hash values, what can significantly speed up the overall computation process. In such cases, the hash value is somehow regarded as a representative of the original message and often called digital fingerprint of the origi- nal message. Similar to real-world fingerprints, the original input to the hash function cannot be uniquely recovered. Cryptographic hash functions rather constitute a certain relationship between input message and hash value. Given a certain message, it is easy to compute the corresponding hash value. It is also straightforward to verify if a hash value h corresponds to a message m. One just has to compute the hash value h0 of m and to verify if h equals h0. Furthermore, it is very unlikely to find two messages which map to the same hash value. These features of cryptographic hash functions are essential for their application in cryptographic protocols. Cryptographic hash functions are used as a building block in many protocols. This hi- erarchical strategy generally offers great advantages. Many protocols can be reduced to simple primitives, like cryptographic hash functions. Finding a new implementation of such a cryptographic primitive automatically yields a new implementation of all complex protocols based on it [3]. However, this approach is not without risk. Many applications are based on a few, probably good, cryptographic hash functions. If one of them is bro- ken, at once all corresponding applications might become insecure as well. Of course, just exchanging the old cryptographic hash function with a new one is a very fast and reasonable solution. But for a wide variety of applications this might be a problem. If implemented in hardware or based on inflexible standards, the mere exchange of the cryptographic hash function may be very costly or impossible at all. Usually it requires 6 Hash Functions the substitution of complete hardware modules. Therefore, it is desirable to rely from the beginning only on very promising candidates. The speed of a hash function can be measured fairly well, its security not. Informally spoken, a cryptographic hash function is today regarded as secure if all known attacks on cryptographic hash functions are not successfully applicable to it.

2.2 Definitions

In this section we give the necessary definitions to describe cryptographic hash functions and their security properties [9, 31]. In the first place, a cryptographic hash function is a special hash function.

Definition 2.2.1 Hash Function A hash function h is an efficiently evaluable mapping h,

h : {0, 1}∗ → {0, 1}n . h maps arbitrary-sized messages to fixed-sized hash values. n is called the output length of h. The image h(X) of X is called the hash value of X.

The security properties of a cryptographic hash function can be categorized when looking at the relationships between its input and output values. In general, the security of a cryptographic hash function is dependent on how difficult it is to find two bit strings with a prescribed relationship. In the following, these relationships are defined more precisely.

Definition 2.2.2 k-way Second Preimage

A k-way second preimage for a hash function h and a message X0 is a set Sh,X0,k of messages with

Sh,X0,k = {X1,X2,...,Xk−1| h(X1) = h(X2) = ... = h(Xk−1) = h(X0)

and Xi 6= X0 for 1 ≤ i ≤ k − 1} .

A 2-way second preimage is simply called second preimage.

Definition 2.2.3 k-way Collision A k-way collision for a hash function h is a set Ch,k of messages with

Ch,k = {X1,X2,...,Xk| h(X1) = h(X2) = ... = h(Xk)} .

A 2-way collision is simply called collision. 2.2 Definitions 7

Definition 2.2.4 Security Properties of Hash Functions A hash function h is called • preimage resistant, if, given a hash value Y , it is computationally infeasible to find a message X with h(X) = Y , • second preimage resistant, if, given a message X0, it is computationally infeasible to find a message X 6= X0 with h(X) = h(X0), i.e. a second preimage, • collision resistant, if it is computationally infeasible to find a collision, that is, a pair of two different messages X and X0 with h(X) = h(X0), i.e. a collision.

Definition 2.2.5 Cryptographic Hash Function A hash function h is called cryptographic hash function if it fulfills all properties of Definition 2.2.4.

In the following we use the terms hash function and cryptographic hash function inter- changeably. Please note that we always refer to cryptographic hash functions as defined in Definition 2.2.5 unless explicitly mentioned otherwise. 8 Hash Functions 3 Hash Function Design

3.1 Introduction

The computation of a cryptographic hash value is exclusively dependent on the cor- responding input message. Definition 2.2.1 allows for an arbitrary length of the input message. This means, that the hash algorithm cannot expect inputs with a static length, it rather has to adapt to them. For this reason, many hash functions iteratively compress the original input message. This is also true for a very popular class of hash functions, the MD4-family. All of its hash functions are based on a so-called MD construction which was presented in 1990 by Merkle and Dåmgard [32, 33]. The MD construction is a general design principle for cryptographic hash functions. It constructs hash functions using fixed-size input func- tions called compression functions. Moreover, it offers a theoretical security reduction, stating that collision resistance of hash functions which have been developed in line with it, can be deduced from the security properties of the underlying compression functions. Essential in this approach is the design of the compression function f.

Definition 3.1.1 Compression Function A compression function f is an efficiently evaluable mapping

f : {0, 1}n × {0, 1}r → {0, 1}n . with r > n ≥ 1. The two input parameters of a compression function f(cv, m), are called message word (m) and chaining value (cv). r is called (input) block size, n is called output length.

3.2 MD Strengthening

Furthermore the definition of a special padding technique called MD strengthening is important. MD strengthening is a padding method for extending the length of a mes- sage to be a multiple of r, that is a multiple of the compression function’s block size. The compression function is then repeatedly applied to the resulting message blocks, step-wisely compressing them to a bit string of n bits. The pseudo-code of the MD strengthening preprocessing step is shown in Algorithm 1. 10 Hash Function Design

Algorithm 1 MD Strengthening

Input: Input message M with |M|2 = b, block size r, design parameter z Output: Message words M0,...,Mq−1 M ← M|1 while |M|2 6= r − z mod r do M ← M|0 end while b ← b mod 2z with leading 0-bits M ← M|b cut M into pieces M0,...,Mq−1 with |Mi|2 = r, 0 ≤ i ≤ q − 1 and M0| ... |Mq−1 = M

Let X|2 denote the binary representation of X and let |X|2 be its bit length. In the first step a single 1-bit is appended to M. Then, M is concatenated with 0-bits until its bit length equals r − z mod r, where r is f’s block size and z is a design dependent constant. It is necessary to have z bits left for appending additional information. In most MD4 family hash functions z is defined as z = 64. In the computation process, the remaining z bit positions are filled with the binary representation of M’s original bit length b in big-endian notation with leading 0-bits. If b does exceed 2z only the lower z bits are used. Please note that in such an unlikely case Theorem 3.3.3 does not hold anymore. However, using only the lower z bits is intended for practical use in some pop- ular hash functions [40, 41]. Finally, the result is split into message blocks Mi of length r.

3.3 MD Construction

Using this notation, the MD construction of a hash function h with compression function f can be described as follows. In a preprocessing step MD strengthening is applied to the hash function’s input value M. The result is a set of message words M0,...,Mq−1, each with a bit length equal to the block size of f. After the preprocessing phase, the compression function f is applied to the first message word M0. The first chaining value cv0 is a hash function dependent initialization vector IV . The output of this operation is used as the new chaining value cv1 and the new parameter for a further invocation of f on an input M1. This process is repeated until all message words have been processed. The final result is the last chaining value cvq.

cv0 = IV

cvi+1 = f(cvi,Mi), 1 ≤ i ≤ q − 1 (3.1)

h(M) = cvq

Using the MD construction for hash function design, the collision resistance of a cryp- 3.3 MD Construction 11

Figure 3.1: The inner structure of a MD4-family hash function tographic hash function can be deduced from security properties of the compression function.

Definition 3.3.1 k-way Pseudo-Collision A k-way pseudo-collision of a compression function f(cv, m) is a set PCf,k with

PCf,k = {(X1,Y1), (X2,Y2),..., (Xk,Yk)| f(X1,Y1) = f(X2,Y2) = ... = f(Xk,Yk)} .

A 2-way pseudo-collision is simply called pseudo-collision.

Definition 3.3.2 Pseudo-Collision Resistance A compression function f is called pseudo-collision resistant, if it is computationally infeasible to find a pseudo-collision.

The aforementioned security reduction is described in the Merkle Damgård Theorem. 12 Hash Function Design

Theorem 3.3.3 Merkle Damgård Theorem If h is a hash function based on the compression function f using the MD construction and MD strengthening, then the following statement holds:

f is pseudo-collision resistant ⇒ h is collision resistant. (3.2)

A proof for this theorem can be found in [9]. In other words, this theorem states that attacking the collision resistance of a hash function with a secure compression func- tion is useless. To successfully attack a hash function an attacker has to show that the underlying compression function is vulnerable according to the security definitions. Consequently, most attacks on hash functions based on the MD construction concentrate on compression functions [9].

3.4 Design of Compression Functions

3.4.1 The MD4 Family

In 1990, shortly after Merkle’s and Damgård’s results, Ron Rivest published MD4, a fast hash function based on the MD construction [39]. Two years later, Rivest proposed its successor with improved security properties, MD5 [41]. Similarly, in 1993, the NIST presented the Secure Hash Algorithm (SHA, later retroactively called SHA-0) [35], which shares many properties with MD4 [43]. Again two years later, in 1995, the NIST intro- duced its strengthened successor, SHA-1 [36]. After several important cryptanalytic results [46, 50] and subsequently [23, 29, 42, 51], MD5 is broken and very efficient attacks on its collision resistance have been developed [25, 44]. Indeed, collisions can be found in a few seconds and it is recommended to aban- don the idea of using MD5 in security applications [28]. In 2005, Wang et al. presented new theoretical results on SHA-1, which reduce attack complexity on collision resistance from O(280) to about O(269) [49] and a short time later to O(263) [7, 47]. MD4, SHA-0 and all their successors share basic properties. They are all members of the so-called MD4-family. In the following we will concentrate on the essential features of such hash functions. MD5 and SHA-1 are analyzed in more detail to exemplify different approaches in hash function design and to support our estimations of Section 9.

3.4.2 Hash Functions of the MD4-Family

All hash functions of the MD4-family are based on the MD construction. Also their compression functions share common design features. Each compression function f, with block size r, output length n, and initialization vector IV can be divided into s 3.4 Design of Compression Functions 13

s steps which, in turn, can be grouped into t rounds, each consisting of t steps. The compression function comprises two important step- and round-dependent algorithms, one called message expansion (MessExp) and the second step function or step operation (StepOp). Existing implementations of hash functions in the MD4-family differ in these particular algorithms and their iteration count in a single compression function compu- tation. Message expansion and step operation are applied iteratively or recursively to the current compression function’s state S. Technically, S is build up of k equally sized working registers

S = Regk| ... |Reg1 (3.3)

n which, altogether, have a bit size equal to |S|2 = n, with each register being w = k bits long. Since most MD4-family hash functions have been developed with respect to fast software execution, w is chosen according to typical processor width like 32 bit or 64 bit. At the beginning of the i-th evaluation of f these registers are initialized with the former chaining value cvi−1 = cvi−1,k| ... |cvi−1,1. After s steps, the new chaining value cvi = cvi,k| ... |cvi,1 is obtained on adding the registers of f’s final state to those of cvi−1. The content of each register Regk, . . . , Reg1 is dependent on the current step number m with 0 ≤ m ≤ s − 1 and can be denoted as R(l, m) with 1 ≤ l ≤ k. Let rl(x, y) define the result of a rotation of x by y bit positions to the left. Since for all MD4 family hash functions it holds that

R(l + 1, m + 1) = rl(R(l, m), cl) with 0 ≤ l ≤ k − 1, (3.4) a whole state is uniquely defined in each step m by the particular value for R(0, m). All remaining values R(l, m) with 1 ≤ l ≤ k − 1 can be concluded from former steps. Therefore, all upcoming register values Rm of an compression function evaluation can simply be defined with a single parameter m with −k ≤ m ≤ s − 1. For reasons of uniformity, the negative values of m reference to the register values of the initialized state. The overall computation process within the compression function performs the follow- ing. After the padding phase, where M is divided into message blocks M0,...,Mq−1 of bit length r, f(cv0,M0) is invoked loading IV into the first state registers R−1,...,R−k:

R−1|R−2| ... |R−k = IV . (3.5)

Applying StepOp and MessExp, the next register values are computed successively until Rs−1 has been generated. Finally cv1 is computed using

cv1 = (Rs−1 + R−1)| ... |(Rs−k + R−k) . (3.6)

This process is repeated until all Mi with 0 ≤ i ≤ q − 1 have been computed. 14 Hash Function Design

Figure 3.2: The inner structure of a MD4-family compression function 3.4 Design of Compression Functions 15

For most functions of the MD4-family the final hash value h(M) is equal to cvq. However, sometimes, like in MD5 and SHA-2 [37], it is computed using a final output function HashOut. If HashOut is invertible it can be absorbed into the definition of the compres- sion function resulting in a new compression function f 0 = HashOut ◦ f ◦ HashOut−1. With IV adopted accordingly (IV 0 = HashOut(IV )), it holds that

h(M) = HashOut(f(f(. . . f(IV,M0) ...,Mq−2),Mq−1)) (3.7)

= HashOut ◦ fMq−1 ◦ fMq−2 ◦ ... ◦ fM0 (IV ) −1 = HashOut ◦ fMq−1 ◦ (HashOut ◦ HashOut) ◦ fMq−2 −1 ◦ (HashOut ◦ HashOut) ◦ fMq−3 ◦ ... −1 ◦ (HashOut ◦ HashOut) ◦ fM0 (IV ) −1 −1 = (HashOut ◦ fMq−1 ◦ HashOut ) ◦ (HashOut ◦ fMq−2 ◦ HashOut ) ◦ ... −1 ◦ (HashOut ◦ fM1 ◦ HashOut ) ◦ (HashOut ◦ fM0 (IV )) 0 0 0 0 = fMq−1 ◦ fMq−2 ◦ ... ◦ fM0 (IV ) .

In each step i, message-dependent data Wi is processed. MessExp describes how this data is chosen. StepOp, on the other hand, computes the new state register value Ri using the current input values. The message expansion functions of SHA-1 and MD5 are based on different approaches. r In both cases, each M is split up in parts X ,...,X r of w-bits, where t = . In MD5, i 0 w −1 w each Wi is computed using a permutation on Xi with 0 ≤ i ≤ t − 1. SHA-1 however, computes Wi recursively. The states of MD5 and SHA-1 are of different size and also step operations work on a distinct number of registers. In MD5, the step operation uses the last four register values to compute the current Ri, whereas in SHA-1 five registers are used. Example: MD5 MD5 and its compression function f output 128 bit (hash) values. The block size of f is 512 bit. The algorithm is divided into s = 64 steps, which are grouped into t = 4 rounds. The current state of MD5 in step i, Si consists of four 32-bit registers (w = 32):

Si = Ri|Ri−1|Ri−2|Ri−3 .

Furthermore IV = R−1|R−2|R−3|R−4 equals

S−1 = 0xefcdab89|0x98badcfe|0x10325476|0x67452301, 16 Hash Function Design

i Ki Ki+1 Ki+2 Ki+3 0 0xd76aa478 0xe8c7b756 0x242070db 0xc1bdceee 4 0xf57c0faf 0x4787c62a 0xa8304613 0xfd469501 8 0x698098d8 0x8b44f7af 0xffff5bb1 0x895cd7be 12 0x6b901122 0xfd987193 0xa679438e 0x49b40821 16 0xf61e2562 0xc040b340 0x265e5a51 0xe9b6c7aa 20 0xd62f105d 0x02441453 0xd8a1e681 0xe7d3fbc8 24 0x21e1cde6 0xc33707d6 0xf4d50d87 0x455a14ed 28 0xa9e3e905 0xfcefa3f8 0x676f02d9 0x8d2a4c8a 32 0xfffa3942 0x8771f681 0x6d9d6122 0xfde5380c 36 0xa4beea44 0x4bdecfa9 0xf6bb4b60 0xbebfbc70 40 0x289b7ec6 0xeaa127fa 0xd4ef3085 0x04881d05 44 0xd9d4d039 0xe6db99e5 0x1fa27cf8 0xc4ac5665 48 0xf4292244 0x432aff97 0xab9423a7 0xfc93a039 52 0x655b59c3 0x8f0ccc92 0xffeff47d 0x85845dd1 56 0x6fa87e4f 0xfe2ce6e0 0xa3014314 0x4e0811a1 60 0xf7537e82 0xbd3af235 0x2ad7d2bb 0xeb86d391

Table 3.1: MD5’s addition constants

i s4i s4i+1 s4i+2 s4i+3 0,. . . ,3 7 12 17 22 4,. . . ,7 5 9 14 20 8,. . . ,11 4 11 16 23 12,. . . ,15 6 10 15 21

Table 3.2: MD5’s rotation constants 3.4 Design of Compression Functions 17

Figure 3.3: The step function of MD5 18 Hash Function Design

i bi(Ri−1,Ri−2,Ri−3) 0 ,. . . ,15 (Ri−1 ∧ Ri−2) ∨ (¬(Ri−1) ∧ Ri−3) 16,. . . ,31 (Ri−3 ∧ Ri−1) ∨ (¬(Ri−3) ∧ Ri−2) 32,. . . ,47 Ri−1 ⊕ Ri−2 ⊕ Ri−3 48,. . . ,63 (Ri−1 ∨ ¬(Ri−2)) ⊕ Ri−3

Table 3.3: MD5’s non-linear round functions

i bi(Ri−1,Ri−2,Ri−3) Ki 0 ,. . . ,19 (Ri−1 ∧ Ri−2) ∨ (¬(Ri−1) ∧ Ri−3) 0x5a827999 20,. . . ,39 Ri−1 ⊕ Ri−2 ⊕ Ri−3 0x6ed9eba1 40,. . . ,59 (Ri−1 ∧ Ri−2) ⊕ (Ri−1 ∧ Ri−3) ⊕ (Ri−2 ∧ Ri−3) 0x8f1bbcdc 60,. . . ,79 Ri−1 ⊕ Ri−2 ⊕ Ri−3 0xca62c1d6

Table 3.4: SHA-1’s non-linear round functions

where 0x denotes hexadecimal notation. The compression function of MD5 is defined as

StepOp :

Ri = Ri−1 + rl((Ri−4 + bi(Ri−1,Ri−2,Ri−3) + Wi + Ki), si)

MessExp : s W = X with 0 ≤ k ≤ t − 1, 0 ≤ j ≤ − 1 kt+j σ(k,j) t

HashOut :

h(M) = cvq,2|cvq,3|cvq,4|cvq,1

where σ(k, j) is a round (k) dependent function

σ(0, j) = i σ(1, j) = 5j + 1 mod 16 σ(2, j) = 3j + 5 mod 16 σ(3, j) = 7j mod 16 .

Ki and si are step-dependent constants (see Table 3.1 and Table 3.2). bi is a bitwise-defined and round-dependent boolean function with three inputs, see Table 3.3. Example: SHA-1 SHA-1 and its compression function f output 160 bit (hash) values. The 3.4 Design of Compression Functions 19

block size of function f is 512 bit. f consists of s = 80 steps grouped into t = 4 rounds. The current state of SHA-1 Si in step i consists of five 32-bit registers (w = 32):

Si = Ri|Ri−1|rl(Ri−2, 30)|rl(Ri−3, 30)|rl(Ri−4, 30) .

IV = R−1|R−2|R−3|R−4|R−5 equals:

S−1 = 0x67452301|0xefcdab89|0x62eb73fa|0x40c951d8|0x0f4b87c3 .

The compression function of SHA-1 is defined as

StepOp :

Ri = rl(Ri−1, 5) + bi(Ri−2, rl(Ri−3, 30), rl(Ri−4, 30))

+ rl(Ri−5, 30) + Wi + Ki MessExp :

Wi = Xi for 0 ≤ i ≤ 19

Wi = (Wi−3 ⊕ Wi−8 ⊕ Wi−14 ⊕ Wi−16, 1) for 20 ≤ i ≤ 79 HashOut :

h(M) = cvq,5|cvq,4|cvq,3|cvq,2|cvq,1

= cvq .

Ki are round-dependent constants as defined in Table 3.4. bi is again a bitwise-defined and round-dependent boolean function. 20 Hash Function Design

Figure 3.4: The step function of SHA-1 4 Application of Cryptographic Hash Functions: Digital Signatures

4.1 Introduction

Hash functions are used in various cryptographic applications, particularly asymmetric protocols. A successful attack on a certain hash function primitive usually also has seri- ous security consequences for all cryptographic systems, that use this function [26, 27]. A very popular application of cryptographic hash functions are digital signature schemes. Digital signature schemes are of great interest in electronic commerce as a substitute for usual signatures and thus greatly speed up and automate business processes. Crypto- graphic hash functions can remarkably accelerate generation and verification of digital signatures and hence are used as an essential building block in digital signature schemes. This section describes how the aforementioned security properties of cryptographic hash functions help to guarantee digital signature security. Furthermore, it shows how suc- cessful attacks on hash functions do threaten the overall security of digital signature schemes.

4.2 Overview

A digital signature scheme is an asymmetric cryptographic scheme used to provide mes- sages with personalized signatures that serve to prove the authenticity of the message and the sender. Digital signatures can be used as an equivalent alternative to real-world signatures. This means that they share important security relevant properties with usual signatures. Sign- ing a document is an act of confirmation of the content of a document or a commitment to behave according to it. Such a statement is supposed to be (legally) compulsory for all parties. Besides other features, it should not be able for anyone to successfully revoke a signature or to deny its origin. This property is called non-repudiation [43]. If someone signs a document, he is bound to stick to his decision. If he deviates from it, any other party can, if necessary, sue him. 22 Application of Cryptographic Hash Functions: Digital Signatures

Like usual signatures, also digitally signing a document can have legal consequences [11, 12, 13]. Therefore, it is of utmost importance to have reliable means for finding out whether a document has actually been signed by someone or not. Otherwise, an attacker was able to successfully forge a signature and could make the apparent signer responsible for something he did not willingly confirm.

4.3 Digital Signature Schemes

A digital signature is a data string which associates a message with some originating entity. A digital signature scheme consists of a pair of two algorithms, a signature generation algorithm to create digital signatures and a verification algorithm to verify that a digital signature is authentic. Digital signature schemes can further be divided into schemes with message recovery and schemes with appendix, which are more popular [31].

4.4 Digital Signature Schemes with Appendix

Digital signature schemes are strongly based on asymmetric cryptography. Compared to symmetric algorithms, applying asymmetric algorithms (to long messages) is very slow and inefficient. One way to speed up the signing and verification process is to use schemes with appendix. In digital signature schemes with appendix, the signature generation algorithm is applied to the hash value of the original document instead of the document itself. This obviously requires an additional hashing step, but hashing is comparably fast. Expensive asymmetric operations are still necessary, but in contrast, they are now applied to operands (the hash values) not longer than the output length of the underlying hash function. This greatly reduces computational costs. In contrast to schemes with message recovery, the verification algorithm does not only take the digital signature as input but also requires the original message. The original message is hashed and the digital signature is verified with respect to this hash value.

4.5 Weak Hash Functions and Digital Signature Schemes

4.5.1 Preimage Resistance

Again, this speedup technique is not without risk. Each message that hashes to the same hash value can later be exchanged with the original message and yields a positive 4.6 Poisoned Message Attack 23 verification result as well. This attack on digital signature schemes directly concerns second preimage resistance (of the underlying cryptographic hash function), which is comparably hard to attack in practice.

4.5.2 Collision Resistance

Using an insecure hash function makes signing documents which have been composed by someone else very risky. Several authors showed why it is necessary that the underlying hash function is not only second preimage resistant but also collision-resistant [10, 15]. Collision resistance means, informally speaking, that it is computationally extremely costly to find a (random) collision. Finding random collisions is significantly easier than finding collisions for a prescribed message (second preimage resistance). This is due to the , see Section 5.1. At a first glance, a random collision is of very limited use in digital signature schemes. It is very unlikely to find a random collision (X,X0) which, at the same time, consists of meaningful messages. However, there are usually further requirements on their content: In the first step of an attack Oscar, the attacker, has to make Alice, the victim, sign a message X. Therefore, X could have advantageous and seducing content for Alice. Furthermore, the second message X0 has to guarantee worthwhile advantages for Oscar when it is signed by Alice. The forgery, exchanging the first message with the second one, should pay off for Oscar. These additional requirements make it very unlikely to find adequate (random) collisions. However, if it was possible to construct two colliding messages with arbitrary contents from random collisions, this would mean a real threat to digital signature schemes.

4.6 Poisoned Message Attack

4.6.1 Basic Attack

Daum and Luchs in [10] describe a technique to construct meaningful messages from ran- dom collisions of a MD based cryptographic hash function h with compression function f. It is called poisoned message attack and can be applied to all documents which are composed in document languages that support conditional branches (’if-then-else’-like constructions). This is also true for the very popular document formatting language postscript. In postscript the corresponding command [18] is

(S1)(S2) eq (T1)(T2) ifelse with the meaning that the program code T1 is executed if S1 = S2 and T2 otherwise. If Oscar can compute random collisions for a given initialization vector, i.e. a given 24 Application of Cryptographic Hash Functions: Digital Signatures

Message m1 m2 m3 Content If ( cont(m2) == cont(m2) ) then {Text1} else {Text2}

Table 4.1: Message 1 in Poisoned Message Attack

0 Message m1 m2 m3 0 Content If ( cont(m2) == cont(m2) ) then {Text1} else {Text2} Table 4.2: Message 2 in Poisoned Message Attack

input chaining value, he can easily betray Alice and build up two messages M,M 0 that share the same hash value but display completely different but sensible contents. This attack is concerned with collision resistance. 0 0 Oscar prepares two messages M = (m1|m2|m3) and M = (m1|m2|m3), such that they only differ in just one message block. The contents of these messages is exemplified in the C-like pseudocode shown in Table 4.1 and Table 4.2. Let cont(x) define the (possibly senseless) content of a message x in the appropriate formatting language. To generate such messages, Oscar places the first string of a simple ’if-then-else’ statement into m1 and then uses padding bits to expand it to the block size of the compression function. Then, he computes the first chaining value f(IV, m1) = CV1. In the next step, he 0 0 generates an arbitrary collision m2, m2, such that f(CV1, m2) = f(CV1, m2). Finally, he chooses the content of m3. The idea is to complete the conditional branch in such a way that the display algorithm branches into Text1 if the second block equals m2 and into 0 Text2 otherwise. Exchanging m2 with m2, Oscar is now able to construct two messages 0 M and M that either display Text1 or Text2 in the appropriate document reader, and, at the same time, map to the same hash value. If Oscar gives such a prepared message to Alice for signing it, Alice displays its content in her document reader and cannot find any evidence for malicious manipulations. Here, the display algorithm computes the ’then’-part of the document. After signing the document, Oscar just exchanges the first colliding message with the second one. The display algorithm now branches into the ’else’- part and displays a totally different content. The hash values for both messages are equal, consequently the signature is guilty for both of them. So Oscar finally gains a guilty signature to his chosen content. This attack can be made useless by generally checking all formatting commands used in the original source code of any vulnerable document. It is important to distinguish harmless ’if-then-else’ commands from malicious ones. However, in larger documents, this task can turn out to be costly. Another way is to check the documents size. A document which can display two different contents is nearly twice as big as a document which displays only a single content. 4.6 Poisoned Message Attack 25

4.6.2 Extensions

Gebhardt, Illies and Schindler showed in [15] that similar attacks can be applied to files of other popular file formats like PDF [19], TIFF [17], and MS Word 97. In many cases they additionally make use of techniques for hiding text by adapting its color to the background since, in contrast to postscript, the corresponding formatting language does not comprise explicit ’if-then-else’ commands. They call collisions, which can be used to easily construct meaningful format file collisions with (almost) arbitrary predefined message meaning, universal collisions. Of course, also binary and archival file formats are vulnerable to the poisoned message attack, since they are inherently equipped with ’if-then-else’-like constructions. Mikle generated a tool, which constructs two zip-files that map to the same MD5 hash value [34]. The content of both zip-files can be chosen almost arbitrarily. The poisoned message attack clearly shows, why not only second preimage resistance but also collision resistance are essential security properties of cryptographic hash functions. At first sight, the attack using poisoned messages and the general attack against second preimage resistance look very similar. Both generate two meaningful, colliding messages which can be exchanged in digital signatures. The poisoned message attack additionally depends on certain features of the employed file formats. However, attacks on collision resistance in general require much less effort than attacks on second preimage resistance, see Section 5.1. 26 Application of Cryptographic Hash Functions: Digital Signatures 5 Attacks on Cryptographic Hash Functions

Attacks on cryptographic hash functions can be divided in generic and specific attacks. Specific attacks make use of specific properties of a given hash function. Generic attacks can be applied to all cryptographic hash functions. Cryptographic hash functions which are only vulnerable to generic attacks are called ideal (cryptographic) hash functions.

5.1 Generic Attacks

5.1.1 Introduction

The distinct security properties of Definition 2.2.4 show different computation complex- ities for a successful attack. Given a cryptographic hash function h that maps to hash values of size n, the attack complexities for preimage and second preimage attacks are both O(2n). This does not hold for collision resistance attacks. The necessary attack n complexity to find a 2-way collision is O(2 2 ). These results can be generalized [9]:

Fact 5.1.1 Attack Complexity For an ideal cryptographic hash function with a hash value of bit length n, finding a (k−1)n k-way collision requires about 2 k hash computations, while finding a k-way second preimage requires about k · 2n hash computations.

This result is due to the so-called birthday attack [52]. The birthday attack basically exploits a probabilistic result that is commonly known as the birthday paradox or the birthday collision.

5.1.2 Birthday Collision

Corollary 5.1.2 Birthday Collision Suppose that F : X → Y is a random√ function where Y is a set of n distinct values. Then, one expects a collision in about n evaluations of F . 28 Attacks on Cryptographic Hash Functions

For finding colliding messages, F is simply substituted with an (ideal) cryptographic hash function h, thus assuming pseudo-random properties of h. If the output values of h are not distributed uniformly, collisions are found even earlier [2]. The setting can simply be modeled as a probabilistic experiment; the probability to find two messages (and corresponding hash values) in a sample of k messages (drawn randomly with re- placement) that map to the same hash value. A proof of Corollary 5.1.2 can be found in [3]. Please note that for non-ideal, iterated hash functions the situation is even worse [21, 22].

5.2 Specific Attacks

Collisions can be found in different ways. The birthday collision and Fact 5.1.1 give information about the effort to find collisions by chance. More efficient attacks try to exploit the knowledge of the inner structure of the hash function and its inherent weak- nesses. In this way, it is possible to construct collisions to a certain extent. Nevertheless, most collision search algorithms make use of a probabilistic search algorithm as well. In general, one can think of many different attack approaches since hash functions may differ in the number and severity of their weaknesses. Most of these attacks are only useful for attacking just one or very few particular hash functions. Attacks can be grouped according to their basic ideas. One idea, which has in the past proven very productive (from a cryptanalytic point of view), is to examine the differ- ence propagation in compression functions. The corresponding attacks are referred to as differential attacks.

5.3 Differential Attacks

5.3.1 Introduction

Differential attacks have become very widespread as a standard tool for analyzing the security properties of block ciphers, for instance [5, 6]. Even more, they can be efficiently used in the analysis of hash functions, in particular to attack their collision resistance. Hash functions of the MD4-family use interleaved operations like modular additions, bit rotations and boolean functions, iteratively applied to the state of the hash function [7]. To better understand which impact the structure of the hash function has on the final hash value, it is useful to expand the iteration steps, with one equation for each step, and to follow the computation path observing how the state registers alter. Then, changes in the computation paths of two different inputs can be correlated to each other to observe in which steps similar or distinct values occur [9]. 5.3 Differential Attacks 29

5.3.2 Differences and Difference Pattern

5.3.2.1 Finding Collisions Using Fixed Message Differences

One way to find a collision is to check message words with a certain difference. This difference is chosen with respect to a specific operation , see 5.3.2.2. In the next step, the compression function is invoked for both messages and it is observed how their register values differ from each other in each computation step. The result is a list of differences which describes how the (input) differences of the two message blocks map to the (output) differences of the step registers. This list is called difference pattern.

5.3.2.2 Difference Operation

The function is based on an operation which usually occurs in the computations of the 32 compression function, like addition in the ring Z232 or in the vector space F2 (XOR). From the results of such operations the original input values cannot be uniquely re- covered, see [7]. For attacking purposes, though, such information might be useful [8]. Therefore, several attacks use extensions of the above examples. These extensions are chosen not only to convey (pure) difference information but also to give valuable hints on the original operands, while at the same time good propagation properties are main- tained (see Section 5.3.2.4). However, the applied operation is suitable for a certain attack and may be of limited worth in other attack scenarios.

5.3.2.3 Differences and Collisions

Collisions can be defined appropriately using differences. Obviously, it is not necessary to assume concrete input or output values. This leaves much freedom for the particular choice of the final values and avoids imposing unnecessary restrictions. It is very useful to preserve as many degrees of message freedom as possible, since they can later be exploited to fulfill additional requirements which, for example, help to accelerate collision search (Section 5.3.5). Having found a difference pattern which results in a zero difference of the last register values, the only task is to find two messages, with the prescribed input difference, which actually stick to the difference pattern in the computation process.

5.3.2.4 Choice of Difference Pattern

The particular choice of the difference pattern is dependent on the likelihood to effi- ciently find messages which stick to it. It is chosen in such a way that an initial input difference is likely to propagate through all steps, possibly with changing differences on 30 Attacks on Cryptographic Hash Functions the path, and is finally eliminated in the last steps. To estimate the relationship between input and output differences reasonably, the compression function is approximated lin- early with respect to . This linear approximation is very close to the original function when non-linear effects are minimized. Such effects may consist of carries from mod- ular additions or subtractions, dedicated non-linear boolean functions or bit rotations. Whenever there is a difference between two values which enter a non-linear part, the resulting difference can only be estimated in general. The probability for such situations to occur is not only dependent on the message input difference but also on the actual state values. Of course, there are input values for which the linear approximation does not yield the same output values as the original function. In collision search algorithms the remaining degrees of message freedom are therefore used to additionally impose conditions on the state register values. They are chosen such that several non-linear effects can be mini- mized or fully excluded.

5.3.3 Differential Attacks on MD4-family Hash Functions

5.3.3.1 Pseudo-Collisions and the MD Theorem

When considering collision attacks on MD4-family hash functions, the first step toward practically computing collisions is to efficiently find two message words and two chain- ing values which map to the same output chaining value of the underlying compression function. This would yield a pseudo-collision of the compression function and in the first place pave the way for finding a collision of the hash function. However, pseudo- collisions of the compression function do not automatically convert to hash function collisions. Up to now, there is no transform known to practically construct hash func- tion collisions from general pseudo-collisions of the underlying compression function, although they definitely remove the theoretical obstacle of Theorem 3.3.3.

5.3.3.2 Practical Collision Construction

A practical way to construct collisions is to find pseudo-collisions which are, at the same time, also (true) collision of the compression function. This means to find pseudo- collision pairs which share a common input chaining value. Preferably, this chaining value is equal to the initialization vector of the hash function. Having just one such pair m and m0, all messages m|M and m0|M with M being an arbitrary message, would result in the same hash value. 5.3 Differential Attacks 31

f(IV, m) = f(IV, m0) ⇒ f(f(IV, m),M) = f(f(IV, m0),M) ⇒ h(m|M) = h(m0|M)

After all, this approach simply reduces the problem of finding a hash function collision to the problem of finding a collision for the compression function. The effort for random collision search is nevertheless the same, since the output length of the compression function equals the size of the hash value, see Section 5.1.2.

5.3.3.3 Multi-Block Collisions

Besides searching for compression function collisions, there is another way to construct hash function collisions, too. Multi-block collisions result in an output difference not equal to zero after the first invocation of the compression function. They use equal input chaining values as well, but do not require a collision after the first function evaluation. The resulting difference is rather corrected using a pseudo-collision which maps to a zero output difference after the second compression function evaluation. Evidently, the output difference of the first invocation and the pseudo-collision have to be adapted appropriately to each other. The output difference after the first invocation is typically very small. In such cases, the two input message words constitute a so-called near-collision [4, 9].

5.3.3.4 Message Conditions vs. State Register Conditions

The computations within each compression function are, given a fixed input chaining value, exclusively dependent on the message input. Therefore, conditions imposed on the state (or step) registers actually mean conditions on the input message. Nevertheless, state register conditions are very clear and easy to verify on computers. Specifically, they are very helpful to define conditions on particular register values. In collision search algorithms, difference patterns have to be transformed into conditions of the computation path, that can easily be checked by computers.

5.3.4 Finding Difference Patterns

Finding suitable difference patterns is the most difficult and decisive task within collision attacks on dedicated hash functions [16, 30]. There are few approaches which have lead to successful attacks. Daum gives an overview over some of them in [9]. The application field of one methods is mostly restricted to just a single or only few hash functions of the MD4 family. Usually, these techniques cannot be fully used on other hash functions 32 Attacks on Cryptographic Hash Functions but often single parts can be transfered. According to their originators, most difference patterns have been found predominantly manually [7, 9]. Cannière and Rechberger have recently proposed an automated approach in [7]. Recent designs of difference patterns even consider more advanced techniques to accel- erate collision search [49], as presented in [4, 48, 50] and in Section 5.3.5. It is very difficult to estimate what consequences an induced message difference has after several non-linear step operations. In the linear model, the impact of input differences can be observed very well. Designers try to find so-called differential characteristics, i.e. con- secutive steps in which the linear model and the original function do behave very similar with respect to difference propagation. The quality of such characteristics is mainly determined by the number of steps covered and the probability that the differences ap- pearing in the linearized model, do also occur in the original function, when given an appropriate, concrete message pair. Conditions are set in the difference pattern such that they avoid deviation from the linearized model at critical steps. Unfortunately, from an attacker’s perspective, differential characteristics with good prop- agation do not automatically have zero differences at their output. When designing difference patterns, this obstacle can be removed considering multi-block collisions, see Section 5.3.3.3. Moreover, they possibly have a special difference at their input as well. The solution to this problem is to find conditions which map a message input difference to the input difference of such a good characteristic. This task is not trivial and accounts for the majority of conditions in the difference pattern. As a matter of fact, the corre- sponding conditions do all occur at the beginning of the difference pattern. Cannière and Rechberger [7] refer to them as NL-characteristics since they are not based on a linearized model. The high quality characteristics spanning several steps in the follow- ing parts of the hash functions are called L-characteristics to emphasize that they have been derived from the linear model. From a computational point of view such a division is very advantageous for collision search. The earlier a condition can be verified in the collision search, the less computations are wasted if it does not hold. Consequently, it is very useful to have as many conditions as possible to be evaluated very early. To practically exploit appropriate characteristics, bit conditions on step registers and message words are determined. Some of the conditions in the difference pattern are used to minimize or exclude non-linear effects of the original function. This should help to avoid deviations from the linearized difference path. Other conditions simply constitute a choice between several alternative output differences which may occur after a non- linear operation. There are also conditions in a difference pattern which account for the acceleration techniques from Section 5.3.5. 5.3 Differential Attacks 33

5.3.5 Concrete Message Search and Acceleration Techniques

5.3.5.1 Introduction

With a difference pattern at hand, the search for appropriate messages can begin. The first naive approach would probably be to choose two messages at random and to just test if all conditions of the difference pattern are satisfied. If not, a new message pair is simply chosen and tested. This is indeed very inefficient. Usually two messages with a prescribed input difference are chosen at random what would at least fulfill the input conditions. Just dropping them and trying new ones if one condition is found to be violated is very inefficient, too. If some of the conditions on the computation path are not satisfied, the influencing message words can sometimes be altered suitably. This is only possible up to a certain number of steps, since all subsequent modifications would definitely influence earlier, yet satisfied conditions. Klima [25] calls this step number point of verification: All following conditions can only be verified; active manipulations on the message words are not sensible anymore. If, after the point of verification, one or more conditions are not fulfilled, the messages are dropped and new ones are chosen. There are several ways to adapt or ”correct” the messages accordingly on their way to the point of verification. All methods which deal with manipulating messages such that they fulfill the conditions of the first round are called single step modifications. Methods which help to fulfill conditions after the first round up to the point of verification are more complex and called multi message modifications. Furthermore, there is also another acceleration technique in the literature, see [25], called tunneling. This method aims at increasing the number of messages which fulfill all conditions up to the point of verification, if just one such message is found. All three techniques exploit the remaining degrees of freedom in the choice of the particular messages.

5.3.5.2 Single Step Modification

These modifications are confined to the first round [50]. They consist in choosing the step registers for the first round at random and altering single bits such that they fulfill all imposed conditions. Then, all message words are computed using the inverse step operation, see Section 7.1.2.

5.3.5.3 Multi Step Modification

Multi message modifications are used to meet conditions in the steps after the first round, while preserving all conditions fulfilled so far [50, 23, 51, 42, 29]. There are several ways to achieve this. They all have in common, that the bits constituting the target condition are manipulated indirectly. Multi-message modifications help to fulfill all conditions up to the point of verification. 34 Attacks on Cryptographic Hash Functions

5.3.5.4 Tunneling

Tunneling is another way to accelerate collision search. Given a message pair which fulfills all condition up to the point of verification, it is possible to find other such message pairs very efficiently. It is not necessary to repeat all computation steps of the difference pattern for these message pairs. A new message pair is chosen in such a way that it automatically matches the difference pattern up to the point of verification. This method exploits the remaining degrees of freedom in the concrete message choice and accelerates collision search exponentially. Essentially, it uses certain bits in the computation path which can be altered such that all prescribed conditions before remain satisfied or can easily be corrected. In a deterministic tunnel, each such bit doubles the number of found messages which fulfill all conditions up to the point of verification. A tunnel consists of one or more of these bits that can be altered similarly using indirect manipulations. The strength n of a tunnel indicates that using it multiplies the number of messages by 2n. Besides deterministic tunnels, Klima also describes so-called probabilistic tunnels. The particular bits of these tunnels yield a new message pair only with a certain probability. Their success is dependent on other (non-linear) effects in the computation path. Similar to multi-message modification, tunneling manipulates messages indirectly. The aim is to gain as many promising messages as possible from a single initial pair of messages. These messages then have to be verified by the remaining conditions of the difference pattern. So found message pairs are more likely to meet all necessary conditions than random pairs and the computation effort to find them using tunneling is comparably small. 6 Collision Search Algorithm

6.1 Introduction

For MD5 there exist several efficient collision search algorithms [20, 25, 44]. Joˆsˆcák [20] compares their performances in more detail, where Klima’s approach [25] turns out to be the fastest. In contrast to the other ones, this algorithm extensively makes use of tunneling, see Section 5.3.5. In this work, we use the original C source code from [24]. It is based on the attack of [50] and uses the condition set of [29]. As such, it attempts to find multi-block collisions, see Section 5.3.3.3. The program is optimized for performance. For the analysis of minimal memory requirements, and thus as a preparation for our final assembler code (see Section 7.3.1 and Section 8.2), we developed another C program, which tries to use as few memory words as possible. For reasons of clear references, we call Klima’s program CS1 (Collision Search) and our contribution CS2. When we address the actual algorithm and not a particular C implementation, we refer to it as CS.

6.2 Algorithm Structure

6.2.1 Introduction

CS is divided into two parts. The first one searches for a near-collision of the compression function, given the standard initialization vector, see Example 3.4.2. The second part searches for an appropriate pseudo-collision. To randomize message generation, the algorithm uses a pseudo-random number generator that is fed with an initial seed at startup stage.

6.2.2 Logical Structure

The first part begins with single message modification methods. CS therefore starts with generating random numbers for the step registers of the first round. Then, these num- bers are adapted in such a way, that they comply with the conditions of the differential path. Using the reverse step operation (Subsection 7.1.2), CS subsequently computes 36 Collision Search Algorithm

Tunnel Original Name Strength Θ1 ’Q10’ 2 Θ2 ’Q20’ > 3 Θ3 ’Q13’ > 10 Θ4 ’Q14’ 8 Θ5 ’Q4’ 1 Θ6 ’Q9’ 3 Θ7 ’cq4’ 6 Θ8 ’cq9’ 8

Table 6.1: Tunnels the corresponding message words. In the following very simple multi-message part, computations are applied to satisfy the remaining condition up to the point of verification. According to Klima [25] this point is R23. Hereafter, CS uses three deterministic (Θ4, Θ5, Θ6) and three probabilistic (Θ1, Θ2, Θ3) tunnels for further message pair generation, see Table 6.1. Altogether, they have a strength which is greater than n=27. The second part starts with a very similar single message modification block. In con- trast, the multi-message block is much more elaborated and thus computationally more expensive. Due to the difference pattern, it spans much more computation steps than the first algorithm. Unlike the first part, the second part uses just two deterministic tunnels (Θ7, Θ8). They have a strength of altogether 14.

6.3 Performance

The pseudocode of the entire algorithm is shown in Algorithm 2 and Algorithm 3. The search for the pseudo-collision, Algorithm 3, requires less computations than Algorithm 2 needs for the near-collision search, see also [20]. Therefore, the complexity of the algorithm is primarily defined by Algorithm 2. We tested the performance of the algorithm on a Pentium 4 PC, where we generated about 10,000 collisions. The average time for finding a collision is about 30 seconds. Executed at a frequency of 2.0 GHz, this yields an average number of 60 · 109 clock cycles per collision. 6.3 Performance 37

Algorithm 2 MD5 Collision Search Block1 Input: Seed for pseudo random number generator Output: Near-collision with prescribed difference Start: loop Single message modification up to R15 Multi message modification up to R23{Point of verification}

for all possible changes in Θ1 do Compute new message words and step registers Test several conditions of the differential path if not all conditions hold then continue end if

for all possible changes in Θ2 do Compute new message words and step registers Test several conditions of the differential path if not all conditions hold then continue end if . .

for all possible changes in Θ6 do Compute new message words and step registers Test several conditions of the differential path if not all conditions hold then continue end if Return found near-collision Invoke MD5 Collision Search Block2 Goto Start end for . .

end for

end for

end loop 38 Collision Search Algorithm

Algorithm 3 MD5 Collision Search Block2 Input: Seed for pseudo random number generator Output: Pseudo-collision for two chaining values with prescribed difference loop Single message modification up to R13

for all possible multi message modifications from R14 to R18 do Compute new message words and step registers Test several conditions of the differential path if not all conditions hold then continue end if

for all possible multi message modifications up to R23 do Compute new message words and step registers Test several conditions of the differential path if not all conditions hold then continue end if

for all possible changes in Θ7 do Compute new message words and step registers Test several conditions of the differential path if not all conditions hold then continue end if

for all possible changes in Θ8 do Compute new message words and step registers Test several conditions of the differential path if not all conditions hold then continue end if Return out found pseudo-collision end for

end for

end for

end for

end loop 7 Requirements for Collision Generators

In this chapter we analyze the requirements for a suitable hardware architecture to efficiently execute collision search algorithms for MD4-family hash functions. Using MD5, CS1 and CS2 as examples, we will explain our choice for a minimal microprocessor based design with massive parallelization.

7.1 Hash Function vs. Collision Search Algorithm

7.1.1 Introduction

Hash functions of the MD4-family have been developed for fast software execution on standard PCs [41]. They operate on data units with popular processor word lengths 32-bit or 64-bit, and all their operations consist of typical processor instructions. Hash functions of the MD4 family are mainly characterized by their compression func- tion. As shown before, a single evaluation of a compression function can simply be seen as an iterated application of its step function on the state of the hash function. Finally, the output chaining value is computed using a final modular addition. Collision search algorithms do not only consists of step function calls, they make use of other operations as well. In the following section we analyze what additional require- ments collision search algorithms impose on the underlying hardware.

7.1.2 Reverse Step Operation

In order to reasonably manipulate messages, collision search algorithms invocate a ’re- −1 verse’ step operation StepOp . Often, the values for some registers Ri are fully or partly chosen in advance to guarantee accordance with sufficient conditions of the differential path. In such cases, message words Wi are computed from successive register values Ri to Ri−k. 40 Requirements for Collision Generators

Example: MD5 Since the step operation of MD5 equals

StepOp :

Ri = Ri−1 + rl((Ri−4 + bi(Ri−1,Ri−2,Ri−3) + Wi + Ki), si) ,

the corresponding reverse step operation is defined as

StepOp−1 :

Wi = rr((Ri − Ri−1), si) − Ri−4 − bi(Ri−1,Ri−2,Ri−3) − Ki ,

where rr(x,y) denotes the result when rotating x y bit positions to the right and rl(x,y) when rotating x y bit positions to the left. Example: SHA-1 The step operation of SHA-1 is

StepOp :

Ri = rl(Ri−1, 5) + bi(Ri−2, rl(Ri−3, 30), rl(Ri−4, 30))

+ rl(Ri−5, 30) + Wi + Ki .

Consequently, StepOp−1 can be defined as

StepOp−1 :

Wi = Ri − rl(Ri−1, 5) − bi(Ri−2, rl(Ri−3, 30), rl(Ri−4, 30))

− rl(Ri−5, 30) − Ki .

Obviously, StepOp and StepOp−1 reflect the same equation. In contrast to StepOp, StepOp−1 is simply solved for another variable, namely the message word. In the trans- form process, operations of the step function are possibly inverted. Left rotations become right rotations and modular additions are mapped to modular subtractions.

7.1.3 For-Loops, Tunnels and the Need for Resource Re-Use

For a simple hash function it is possible to write down all step operations explicitly. For collision search algorithms that use message modification and tunneling, the corre- sponding program code would grow too large. Explicitly written down, each possible 7.1 Hash Function vs. Collision Search Algorithm 41 bit combination of the tunnel bits has to be taken into account. The number of possible execution paths grows exponentially with the sum of all tunnel strengths [25]. This increase in complexity requires for hardware or program code reuse. For this purpose, CS1 uses several, partly nested, for-loops and a single infinite loop. Both loop types can easily be mapped to typical processor instructions. An infinite loop can simply be converted to a single unconditional branch, i.e. a ’jump’ or a ’goto’ instruc- tion. For-loops can generally be mapped to conditional branches and a few arithmetic operations which increment or decrement the loop index variable appropriately.

7.1.4 Bit Conditions and Tunnel Variations

Collision search algorithms make extensive use of bit testing operations and conditional jumps. If any defined (bit) condition of the difference pattern does not hold, the algo- rithm branches to a predefined earlier execution step, what in C-terms corresponds to a continue command. Here, the collision candidate is altered or renewed completely. It is interesting that all such jumps can be realized as backward jumps, i.e. jumps that branch to a former execution step of the program. A collision candidate has to pass all conditional jumps in the algorithm to prove being a collision for the chosen differential path. There is no way of ignoring some conditions in between. Bit conditions are usually very manifold. They do not only differ in the very comparison operation (=, 6=, <, >, ≤, ≥), but are also parameterized by the bit position of the exam- ined bits. A general purpose operation covering all possible bit conditions is therefore not reasonable. More complex bit conditions should rather be reduced to easy-to-evaluate conditions like x =? 0 using basic arithmetic and logical operations. We observed that, in a processor-like design, just one status bit is enough to successfully implement all kinds of bit conditions. All conditional branches of CS1, whether based on a for-loop or on conditions of the differential pattern, can be implemented using a single status request on a (commonly called) zero-flag (z-flag), which is set to ’1’, if and only if all bits of the examined register value equal zero. One of the most frequent processor flags, the carry-flag is not required. A carry-flag is usually used to indicate an overflow that, for example, occurs when two big numbers, both having a 1-bit at their most significant position, are added and the result cannot fully be described by a single processor word. Since MD4-family hash functions only use modular additions and subtractions, bits of higher bit positions are ignored anyway. In- formation that a carry-flag can convey are therefore not valuable for further processing. In CS1, more precisely in its for-loops, there are also conditional branches based on less- than-evaluations. These branches are typically implemented on processor architectures using another popular status bit, the negative-flag (n-flag). However, it turns out that a negative-flag is actually not necessary for an implementation of CS. This is due to the fact that it is known in advance how many possible bit changes a tunnel can produce. Except for the first infinite loop, which surrounds the whole algorithm, the maximal 42 Requirements for Collision Generators iteration number for each loop is therefore predetermined. Hence, all for-loops can be implemented using simple tests for equality. Luckily, it is possible to deduce the current operands generated by a certain tunnel from a single loop index. Enumerating this loop index generates all possible tunnel variations. Practically, this can be done by transferring the least significant bits of the loop index to the step register bits of the particular tunnel. The transfer operation, in turn, can be implemented using basic boolean operations, bit rotations and conditional branches, which require the z-flag. We applied this technique in CS2 and could save about 5000 32-bit constants compared to CS1.

7.1.5 Pseudo-Random Number Generator

To generate new collision candidates, also a pseudo-random number generator is nec- essary. For example, it can be implemented as a linear feedback shift register (LFSR), using basic shift and XOR operations [43]. In contrast to an application in encryption algorithms, a PRNG in CS has to only randomize collision search. There are no further security requirements on its output sequence, non-linear behavior is not explicitly de- sired. Rather contrarily, (as it will be shown in more depth later in Section 8.4.4.2), it is very useful if the PRNG generator assists to suitably partition the search space. This is an important requirement for parallelization approaches. LFSRs can efficiently be implemented in dedicated hardware modules. In software exe- cuted on a processor, their implementation is somewhat more complicated. If the length of the LFSR does not exceed the processor word length, the LFSR shift operation can be mapped to a simple invocation of a native shift operation. What is more complex is to extract all bits of the state of the LFSR that are used to compute the final output bit. Mapped to typical processor instructions, this requires further bit shifting operations, bitwise AND operations and conditional branches.

7.1.6 Summary

In the end, it depends on the hierarchical level, whether the required (processor) in- struction sets for a collision search algorithm and its original hash algorithm differ. There is no difference in the required instruction sets between the full hash function and the collision search algorithm. Since the full hash function has to process a variable number of message input bits, it has to use some sort of while-loops. While-loops im- plicitly require conditional jumps. In contrast, the computation of a single message block using the compression function of the hash function does neither require a conditional nor an unconditional branch. The algorithm structure is very linear. Having this in mind, the implementation of a collision search algorithm on processor 7.2 Requirements for Target Hardware 43

Figure 7.1: Linear feedback shift register based architectures requires no or only few additional instructions. Most important, to realize loops and data dependent program branches, unconditional and conditional jumps are needed, whereas unconditional jumps can be reduced to conditional jumps. The unconditional branches in CS1 and CS2 just require the existence of a z-flag status bit that indicates whether the examined register value equals zero or not. Dependent on the status of this flag, conditional jumps have to be ignored or taken.

7.2 Requirements for Target Hardware

7.2.1 32-bit Data Units

The target hardware for the collision search algorithm CS should work on 32-bit words. A division into several smaller parts is not useful. There are frequent operations in the program code, like modular additions and bit rotations, that do not operate bit-wisely. Within these, a certain bit position can have impact on bits of the result that are lo- cated in far-away bit positions. In contrast to bit-wise defined operations, like AND, OR, NOT, the actual effect of these operations heavily depends on the processor word length. When operands are divided up into parts, the original impact across several bit positions may require additional processing. To guarantee compliance with the specified hash algorithm (or its collision search algorithm) the results have to be corrected. For example, when looking at carry propagation in modular additions, the least signifi- cant bit of one operand can even influence the most significant bit of the result:

0x1 + 0xF F F F = 0x0 . 44 Requirements for Collision Generators

When divided into parts, this effect is distorted and has to be corrected artificially. Likewise, summing several 32-bit words up to bigger data units is not sensible. Again, the original meaning of an operation, as defined in the hash function specification, would be altered, thus requiring costly corrections.

7.2.2 Regularity of Collision Search Algorithm

The search for a suitable hardware architecture can be simplified when looking at regu- larities within the difference pattern and the collision search algorithm. On the one hand, there is a very limited set of basic (processor) instructions behind all operations in CS. On the other hand, these instructions are used with numerous distinct operands, constants as well as variables. To sum up, instructions occur repeatedly very often, operands only rarely. As mentioned in Section 7.1.3, the huge number of different combinations of tunnel bits inevitably requires program or hardware reuse. Tunnel bits are, looked at from a com- putational point of view, simply used to (slightly) modify operands. In the following, we call all values tunnel operands which may be altered within the computation process due to a tunnel variation. Their current values are called tunnel values. The values for tunnel operands are not only dependent on the pseudo-number genera- tor, like other operand values, but also on the current tunnel variation. Consequently, there are many possible tunnel values in a single collision search algorithm execution. Due to their huge number, it is not sensible to wire each tunnel value together with its operations. Instead, for avoiding costly repetition of logic, tunnel values have to be generated and loaded to the corresponding operation units on demand. In the execution process, all possible tunnel values of a single tunnel operand can so be fed into the same hardware unit, of course each at a different time. This can greatly help saving hardware costs. To execute a certain algorithm the underlying hardware has to wire all distinct opera- tions, at least each operation once. A further crucial question is to what extent, (i.e. how often and for which groups of operands) the implemented hardware units are reused. When each elementary operation is implemented in hardware only once, the design is very area-efficient. The disadvantage of this approach is that it is impossible to com- pute two or more functions which are based on this operations at the same time. The hardwired operation can only process one input data set at a time. Parallel execution is strongly restricted. If two sets of input values need to be processed by the same hard- ware unit, they must be computed one after the other. The opposite solution is to generate for each instruction in CS a new hardware unit, regardless of the fact that the required operation has possibly been implemented yet (see Section 7.3.1). Theoretically, all so generated computation units could be operated in parallel. This would lead to a very speed-oriented but area-demanding architecture. Obviously, a first major restriction to this approach is given in Section 7.1.3. 7.3 Hardware Acceleration Techniques 45

Figure 7.2: Implementation in a single hardware unit

The final hardware design is located in between these extreme approaches. It is heavily dependent on the algorithm to be implemented. The final choice is made when analyz- ing to what extent several hardwired, concurrently active operations, can reasonably be organized and connected to each other such that they practically help to speed up the computation process.

7.3 Hardware Acceleration Techniques

7.3.1 Pipelining

From a certain point of view, full pipelining is a strategy to implement an algorithm in such a way in hardware, that in each clock cycle a new input operand can be loaded into it for processing. Technically, the implementation computes more than one operand at the same time, each at a different execution step. The implementation is therefore di- vided into several so-called pipeline stages, which are connected to each other in a linear order: The output of the i-th stage is the input of the i + 1-th stage. If the pipeline has d stages, d operands, one in each stage, are computed simultaneously. The time for one clock cycle is roughly the same as the execution time for the slowest stage, since in one clock cycle all stages must have enough time to complete execution. Big differences in the execution times of several stages are very unfavorable because very fast stages complete execution long before slow ones do, and their resources are wasted. Hence, pipeline stages are usually constructed with nearly the same execution time. Pipelines are very efficient for algorithms with a high proportion of sequential opera- tions. A pipeline of d stages can improve the throughput of an originally not pipelined implementation by a factor of d. For example, imagine a strictly sequential set of in- structions to be implemented in hardware. Figure 7.2 reflects its implementation in a single hardware unit. We assume, that it requires 4n seconds to compute a single set of input values. Figure 7.3 shows the corresponding pipelined implementation. Here, each input set of values has to pass all four pipeline stages for processing, one after the other. Similarly, it takes n seconds to complete computations for a single input value set. The 46 Requirements for Collision Generators

Figure 7.3: Pipelined implementation with four stages delay between insertion of input values and output of results is in each implementation the same. In contrast, their throughputs are different. The throughput t1 for the first implementation is simply 1 set of input values t = . 1 4n seconds

In the pipelined implementation, a new set of input values can be inserted into the first pipeline stage after n seconds. After an initialization phase of three steps all four pipeline stages synchronously compute values from originally different sets of input values. In other words, this means that four input values are computed concurrently, and the overall throughput, t2, grows to 4 sets of input values t = (7.1) 2 4n seconds 1 set of input values = . n seconds

However, when the original algorithm deviates from a linear command sequence things get complicated [45]. Loops and unconditional jumps try to return to a previous execu- tion step. Having this in mind, one expects an input data conflict, since new operands are continuously delivered to the pipeline input. Additional administration logic and buffers are needed at pipeline stages that can handle such backward jumps. They have to ensure that not only operands of preceding pipeline stages but also from subsequent pipeline stages can continuously be processed. Such logic must comprise several complex design features, since it cannot be foreseen when a certain data-dependent condition does not hold, a conditional (backward) jump is taken, and a value from a subsequent pipeline stage has to be recomputed in a former stage. As a consequence of this, a pipeline stage can possibly encounter two different input values for processing. Such cases require a logic to decide which value is to be given priority. Unfortunately, the other value has to wait. If input data is continuously fed into the overall algorithm, this may result in fur- ther conflicts, because the number of operands in the pipeline stages does not decrease steadily as usual but rather data-dependently. Consequently, a suitable administration 7.3 Hardware Acceleration Techniques 47 logic must have costly means to also regulate the input data flow. Conditional branches lead to further problems for pipelines. If a certain data dependent condition holds, the algorithm branches to part A otherwise to part B. This disrupts the continuous flow of operands through the pipeline and, at the same time, requires the implementation of two different hardware units. If the algorithm branches with proba- bility 0.5 into part A then the efficiency of part A and B, and of all subsequent parts, is only half of the efficiency of preceding parts: Only one of two parts is fed with a new operand, the other one is idle and its resources are wasted. One way to (partially) solve these problems is to unfold loops and unconditional jumps. This means not to use pipeline stages repeatedly. To avoid interfering with the continu- ous data flow of previous stages, an additional stage is implemented instead, regardless of the fact that preceding stages provide the same functionality yet. All commands inside a loop are copied and append as many times as the loop is expected to iterate. That way, the loop is unfolded and appended to the algorithm in a linear and thus pipeline-suited order. Unfortunately, unfolding loops in CS1 is naturally restricted due to the arguments from Section 7.1.3. Hence, it is impossible to generate a full-pipeline implementation of CS.

7.3.1.1 Algorithm State

Another important point is that using pipelines requires the algorithm state to be trans- fered from one pipeline step to the next. Therefore, each pipeline stage must offer adequate memory capacities to save the whole algorithm state. The results computed by a certain pipeline stage have to be stored to this memory. It is important to cleanly separate values of different algorithm evaluations from each other. Given a pipeline with d-stages, this is only possible if the hardware implementation offers d times as much memory for variables as was needed for a single algorithm, since d invocations are indeed computed concurrently. Mixing up results from distinct algorithm invocations would invalidate all results. In CS2 we tried to analyze the state of the MD5 collision search algorithm. The state of an algorithm consists of all variable memory values that are necessary for subsequent computations. The number and size of these values may differ from execution step to execution step. As the state of the algorithm can consists of any set of such value sets, the size is defined by the largest amount of memory which is, at one point in algorithm execution, necessary to successfully complete the program. The state of CS1 is quite big. Even with the memory-improved version CS2 the program requires 16×32-bit message registers, 28×32-bit step registers and about 50 additional, primarily 32-bit, registers to store various data. They are used to store backup values for registers, loop indexes, temporary variables and so on. Altogether this makes up about 94×32-bit registers and a total of 3008 bits, what makes pipelining techniques very expensive. 48 Requirements for Collision Generators

7.3.2 Parallel Execution

Another popular hardware acceleration technique is parallel execution. Sometimes, com- plex operations can be split up into parts which can be executed concurrently. Hereafter, the output values of each part are combined together to compute the overall result of the operation. Parallel execution, similar to pipelining, allows for several hardware units to be active at the same time, thus accelerating the overall computation process. Par- allelization techniques can be used on different hierarchical levels, ranging from parallel execution of the most basic operations used in the algorithm, like XOR, to parallel exe- cution of the whole algorithm itself. Whereas the success of the first approach is highly algorithm dependent, the latter solution is always possible. The success of parallel execution on lower hierarchical stages is dependent on how far the structure of the algorithm supports parallelization, that means, if there is a high pro- portion of operations that can be executed concurrently. The gain in speed is described with Amdahl’s law [45]. Given a not yet parallelized algorithm that takes T seconds to finish, of which a fraction of f seconds is sequential code (S) and a fraction of (1 − f) is parallelizable code (P), then the speedup s, yielded when executing P using n similar hardware units, is n s = . 1 + (n − 1)f

CS can hardly be parallelized on lower hierarchical levels. Useful situations are confined to operations within the step function where the evaluation of the round function and two (or just one if Ki +Wi have been precomputed) modular additions can be computed concurrently. In almost all other cases an operation can only start when its predeces- sor has yet output a result, since this result is required as an operand to the current operation.

7.4 Choice for Microprocessor Design

We finally decided to implement the collision search algorithm CS on a 32-bit micro- processor based ASIC architecture. The central microprocessor we call µMD, the final collision search circuit µCS. The main reasons for our choice are that advanced accel- eration techniques are obviously of too limited use. Accelerating the collision search algorithm considerably using a very smart hardware design is simply not possible with- out great cost in administration logic. Due to the structure of the algorithm and its comparably big state, it is not possible to reasonably use a great number of concurrently active hardware units in combination. All these arguments also hold for other MD4-family hash function and their possible collision search algorithms. We believe that, besides multi message modification, tun- neling will become a standard means of improving collision search algorithms based on 7.5 Final Design Requirements 49 differential patterns. Such designed algorithms are also confronted with the restrictions of Section 7.1.3, making usual hardware acceleration techniques like pipelining hardly useful. Similar to the arguments of Section 7.3.2, also parallel execution on lower hi- erarchical levels is restricted. Without successful hardware acceleration techniques and the need for resource reuse, we try to concentrate on area-optimal designs. Instead of investing additional hardware costs to make algorithm execution faster, we try to mini- mize area-costs and thus to maximize hardware reuse. The constant operands of CS are almost all pairwise different. Big savings, except for that of Section 7.1.4, can hardly be made. Maximal reuse can only be realized by im- plementing each operation in hardware just once. Preferably, the instruction set is very small, meaning little hardware requirements. It is additionally favorable that all opera- tions have a similar, short execution time, since this heavily influences the clock speed of the µMD.We believe that the best solution for all these requirements is a microprocessor based architecture.

7.5 Final Design Requirements

7.5.1 Metric for Performance and Price Model

Our target collision search device µCS should deliver better performance results than usual PCs. The measure P examines our final solution dependent on the speed until a collision is found and the price, that means, given a certain circuit technology, the area A of the target circuit. We believe that the area-time (AT-product), with a properly defined metric for time T , is adequate for this purpose.

P = A · T

Furthermore, and as will be dealt with in more depth in Section 9.4, we need an appropri- ate measure to compare the final costs of our design with those of distinct architectures. This is particularly important to also compare the cost-efficiency of parallelized hard- ware designs, see Section 9.5. For the sake of comparability, we therefore choose the price per chip area to be the major cost defining factor of integrated circuits. Assuming the costs of an off-the-shelf Pentium 4 processor with 2.0 GHz clock frequency and die 2 e size 146 mm to be about 50 e, we define the price per chip area as QA = 0.3425mm2 , see Equation 9.1. Please note that while choosing an ASIC based architecture we also assume that the price per chip area is comparable to that of standard processors. At this point we recall the main objective of our work. We aim at developing a better archi- tecture for collision search. The main application of this architecture are hash functions that have been theoretically broken but for which it is still very demanding to practically find collisions. We expect this task to require such a huge amount of µCS units which is much higher than the number of standard processors that would be needed for an 50 Requirements for Collision Generators equivalent performance. Given comparable margins, we believe that µCS units can be manufactured at nearly the same price per area.

7.5.2 Standard PCs

Obviously, standard PCs have two major advantages which are very hard to compete with. First, the clock frequency of standard processors is much higher than of µMD. Secondly, advanced acceleration techniques like pipelining on modern processors are very well elaborated. They use several pipeline stages (Pentium 4, 2.00 GHz: 20 pipeline stages) with additional supporting means, like, for example, static and dynamic branch prediction. As a consequence of the aforementioned arguments, µMD is not supposed to implement such techniques. All above arguments indicate that the (execution) time T of µMD is much higher than that of standard PC processors. On the other hand, all standard processors have a lot of additional features which are primarily not interesting for collision search. This increases their AT-product, since it is not possible to somehow extract all important parts for collision search and leave the rest behind. A standard processor only works as a whole. Furthermore, standard PC’s comprise a lot of additional devices like mem- ory, motherboard, cooler and I/O to successfully run a collision search algorithm. Also, when trying to parallelize standard PCs, a network infrastructure is required. All these requirements, besides the area for the standard processor, increase the final implemen- tation costs.

7.5.3 Minimal Microprocessor

Designing and implementing a similarly elaborated and complex processor which con- tains only relevant features for MD4-family collision search, is very costly and time- consuming. We want to show that good results can also be achieved with very clean and simple approaches. Therefore, µMD should have minimal area requirements at maximal performance. We primarily do not try to increase its speed but to decrease its area as much as possible, and in this way decrease its AT-product. µMD is used as a basis for our collision search unit µCS, which again, should be held as small and simple as possi- ble. Due to the comparably small maximal speed of a single µCS circuit, our solution can only be practically useful with massive parallelization of µCS units. Therefore, it is of utmost importance to keep the I/O communication of µCS very low. This greatly helps to reduce communication overhead and area requirements for the parallelization infrastructure and corresponding control units. In contrast to standard PCs, we do not need any further I/O or memory devices, what again saves implementation area and 7.5 Final Design Requirements 51 costs.

7.5.4 Definition of Time T

Given a certain technology, the required area A for a circuit can directly be mapped into its price, what finally leads to a secondary measure. An open issue is to suitably define the time T with respect to collision search. We define T as the average time required to find a collision. When two processors are compared with each other, this time is dependent on their frequency f and on the average number of cycles C needed to compute a collision. C T = f

We expect µMD to have a much smaller frequency than standard processors. In contrast, it is not clear from the beginning, if a collision algorithm developed for and executed on µMD needs more or less cycles in average to find a collision. This is due to several factors. First, we are free to implement whatever operations we want in hardware. Possibly there are operations in CS, which need several clock cycles on a standard PC to finish, since they require iterated use of a certain processor instruction. If worthwhile, these operations can be executed by µMD at the cost of much less clock cycles using a more specialized hardware unit. Secondly, we will program CS in a dedicated assembler language. As no compilers are used for translation processes we expect our final machine code to be much more compact than the binary files for standard PCs. Thirdly, we will actually develop our assembler program, subsequently called ACS, based on CS2 rather than CS1, since it is much more memory-efficient. The first and the second issues should decrease the number of necessary clock cycles in average, the third one lets it grow up. Whether the final implementation needs actually more or less clock cycles as CS1 cannot be foreseen at this point. An appropriate analysis, as presented in Section 9, has to thoroughly consider the final processor implementation as well. 52 Requirements for Collision Generators 8 Circuit Design

8.1 Introduction

In the following sections we present our solution for efficient collision search on MD4- family hash functions. We start with a brief summary of the overall development process for successfully de- signing and implementing our microprocessor µMD and our collision search unit µCS. Subsequently, we described µMD in more detail. Then, we give a comprehensive pre- sentation of µCS. µCS is a standalone hardware unit for collision search. Besides µMD, it comprises sufficient memory capacities to store CS and all corresponding variables. Furthermore it uses a dedicated I/O interface to read in starting parameters and to give out found collisions. Finally, we show how several µCS units can be connected to each other for parallelized collision search.

8.2 Development Process

The development process can be divided into several phases. In the first step we de- signed the basic processor µMD using a VHDL integrated development environment. For a verification and performance analysis, we additionally needed appropriate memory devices containing the program code and constants and offering enough space for storing temporary variables. This resulted in the first design of µCS. For simulation purposes, we developed tools to automatically load the future ROM mod- ules of µCS with the content of dedicated binary files. To generate these files, we had to develop a dedicated assembler. Subsequently, we added further functionality to µCS, consisting of a pseudo-random number generator and an I/O controller, that are used to parallelize several units. In the next step we had to program ACS, the assembler version of CS, to gather infor- mation on the required memory space of µMD and determine some basic facts about its runtime behavior. Unfortunately, it is currently not possible to thoroughly analyze ACS long-run behavior in the simulation model for gaining average values. It is even hardly possible to observe a single collision, when the algorithm is once started. Simulated execution is too time-consuming. 54 Circuit Design

All development steps were iterative, i.e. for implementing additional improvements and enhancements most of them were passed more than once. As data from several sources (a dozen VHDL components and an assembler source file converted to binary instruc- tions) was used in combination for simulation purposes, code inspection and verification was often quite complex. Our final step was to synthesize µMD and µCS. That way, we gained reliable information on their timing and area requirements.

8.3 Microprocessor Design

For the reasons given in Section 7.4, we developed a minimal 32-bit microprocessor ar- chitecture µMD for fast collision search. It uses a very small instruction set, consisting of not more than sixteen native commands. In particular, this is sufficient for the ex- ecution of all algorithms of the MD4-family. Moreover, it suffices for the execution of current and (probably) future collision search algorithms, like CS. µMD is based on the design approach from [38]. The instruction set is adapted according to the requirements of the target application. Therefore, several instructions have been deleted, others have been added. We designed two versions of µMD, which differ in the implementation of just a single instruction, namely RL. RL, see Section 8.3.10, is used to left-rotate 32-bit values. The first variant simply realizes RL as a left-rotation of a single bit. Rotations of greater length have to be programmed using RL iteratively, what, of course, consumes compa- rably much clock cycles. The second processor variant implements it as a variable bit rotation. A second operand is used to specify how wide the rotation should be. We subsequently call the µMD variant, which implements RL as a simple one bit rota- tion µMD1, the other variant µMD2. When we generally refer to the micro processor, regardless of which concrete implementation for RL is used, we further denote it µMD. Accordingly, we use RL1, RL2 and simply RL to differentiate between the instruction implementations. Moreover, when we want to emphasize that µCS is based on a concrete implementation µMD1 or µMD2, we call it µCS1 or µCS2. In the general case we further use µCS. Whether RL1 or RL2 is used hardly influences the clock frequency of µMD. In contrast, the impact on the area is twofold. Implemented on a Xilinx Spartan3 xc3s1000 FPGA, the implementation of RL2 heavily increases the area requirements of µMD, almost tripling its size. Interestingly, a similar effect cannot be observed when implementing it as an ASIC in standard logic. 8.3 Microprocessor Design 55

8.3.1 Design Principle: RISC or CISC

µMD is neither a pure RISC nor a pure CISC processors. It deviates from important characteristics of both design principles. On the one hand, µMD does not only use load and store instructions to access the mem- ory. Almost all instructions initiate memory transfers to load a second operand. This makes µMD appear more CISC-like. On the other hand, µMD’s instructions are not interpreted by a microcode, what is a property of common RISC processors. Furthermore, µMD has only very few registers, just a single instruction register (IR), an accumulator (A) and a single program counter. All modern (standard) processor design principles, whether RISC or CISC, recommend implementing a lot of registers, since register-to-register transfers are much faster than memory accesses. µMD cannot issue one instruction per clock cycle, it has to wait until the preceding in- struction has finished. This again, would be a favorable feature of all current processor designs. µMD is truly rather a processor suited for embedded systems than for standard PC systems. If at all, one can say that µMD tends to be more RISC-like. This is mainly due to the very clear and simple instruction set.

8.3.2 Acceleration Techniques

µMD does without any advanced processor acceleration techniques. In particular it does not use pipelining, super-scalar execution or caching, along with supporting techniques like static and dynamic branch prediction, speculative execution and so on. µMD so saves much area, not only for the very mechanism themselves but also for additional administration logic to control them. The program counter of µMD is equipped with an incrementing logic, what considerably contributes to shorten the execution path length of all instructions. Moreover, it allows to cleanly separate the data path elements of µMD from its address computing units what also simplifies processor design. As a consequence of this, the program counter cannot be loaded into or from the arithmetic logic unit (ALU), thus making dynamic address computations impossible.

8.3.3 Size and Frequency

µMD1 is comparably small, using about 3 percent of the slices of a Spartan3 xc3s1000 FPGA. The final clock frequency is about 110 MHz after synthesis. µMD2 is much bigger it uses about 9 percent of the slices on the same FPGA device. After synthesis, the final clock frequency is reported of about 95 MHz. 56 Circuit Design

A more detailed analysis of the performance characteristics of µMD can be found in Section 9.

8.3.4 Addressing Modes

µMD supports just two addressing modes, absolute (or direct) addressing and indirect addressing. All two-cycle instructions use absolute addressing mode. In such cases, the address field contains the (absolute) memory address of the operand. In indirect addressing mode, the contents of the address field are interpreted as the address of a memory word, in this context called pointer, which, in turn, contains the final address of the operand. There are just two instructions in the instruction set which use this addressing mode, namely LDI and STI. By step-wisely incrementing the value of a pointer, successive memory words can be referenced in an array-like manner, see Section 8.3.7. Technically, the first two instructions retrieve the content of the pointer register, whereas the third one finally loads the required operand.

8.3.5 Input and Output Pins

µMD uses a 32-bit bidirectional synchronous data bus and a separated k-bit address bus to connect the memory and I/O devices. The design of µMD is highly parameterized. The value k is a program-dependent constant, reflecting the actual number of required program words to implement the algorithm (≤ 2k). For ACS, k = 13. The advantage for a multiplexed bus is more than doubtable. Since in µMD almost all instructions initiate a memory transfer, we believe that the possible area savings for a multiplexed data and address bus cannot outweigh the loss in performance. Moreover, we expect a multiplexed bus to further complicate memory access, what surely leads to additional logic for the implementation of a more advanced bus protocol, not only in µMD but also in the memory control units (see Section 8.4). However, µMD uses some sort of multiplex as well. The memory address mechanism used by µMD is general purpose. Instructions are addressed in exactly the same way as operands. µMD has no dedicated bus to receive requested program words from the program ROM. New instructions are also loaded via the data bus. When a program word is loaded, only 17 bits of the data bus are actually used. Dependent on the memory device, the other bits can simply be set to zero. In addition to the data and the address bus, µMD has a single output pin (RNW) that notifies the connected memory device whether to read to or to write from the data bus. As a synchronous hardware unit, µMD is triggered by a clock signal (CLK). Finally, the state of µMD can be reset asynchronously by a dedicated input signal 8.3 Microprocessor Design 57

Figure 8.1: Microprocessor: overview

(Reset). An overview over the input of µMD and output pins is given in Figure 8.1.

8.3.6 Hardware Stack and Function Calls

To the design of µMD, we added a hardware stack. It provides for sub-routine calls and returns what enables efficient program code reuse. On a sub-routine invocation the stack reads in the current k-bit program counter position, increments it once and pushes it onto the stack. On a return command the top-of-stack value is popped and loaded to the program counter. In order to make this functionality available for program development, two new instruc- tions, namely CALL and RET, have been added to the instruction set. A sub-routine CALL needs two, a RET command three cycles to finish. Altogether this means that each sub-routine invocation costs five additional cycles. The design of µMD allows to easily adapt the depth of the hardware stack according to the needs of the application that is executed. ACS based on RL1 requires not more then a depth of three stack registers, meaning that within program execution maximal three sub-routines are called before a RET command occurs. In contrast, ACS based on RL2 requires just two stack registers. This is due to the fact that RL2 is implemented as a native processor instruction. In contrast, RL1 is implemented as a sub-routine in the program code.

8.3.7 Function Parameterization

Besides sub-routine calls, code reuse can further be improved. When looking at CS, it gets clear that some step function invocations occur repeatedly throughout the program. In MD5 and CS the step function is evaluated very often, at least more than 256 times, 58 Circuit Design

while R21, for example, is computed at minimum five times by the step function before a collision is found. Sub-routines are particularly efficient for program reuse when they support several ar- guments. Suitably parameterized, a single sub-routine can replace numerous explicit computations in the program code and thus greatly reduce program code size. What remains to do is to efficiently tell the sub-routine on which current values it should start its work. A favorable way to design sub-routines is to assign them predefined memory locations to load their arguments. These memory locations are subsequently called formal ar- guments. Before calling a sub-routine, it is consequently required to store all current values into the corresponding formal arguments. MD5, for example, needs four registers and three constants as input values for a single step function invocation. Assigning each current value explicitly to the corresponding formal arguments would require at least seven additional store operations in the program code. Even worse, step functions in this way cannot be invoked iteratively. Iterative invocations on the other hand, for ex- ample within a for-loop, can save many words of program code, since they prevent the loop-body to be implemented in the program code more than once. Consequently, it is very favorable to design sub-routines such that they have as few in- put arguments as possible. Preferably, these arguments can comfortably be enumerated. Using the index of a surrounding for-loop as such an argument, the sub-routine can so easily be invoked iteratively. This approach requires that a sub-routine can find and load all its current operands on its own. The locations of these values have somehow to be dependent on the input argument. A straightforward solution is to use arrays. Given a sub-routine Subr with n input arguments I1,I2,...,In, we organize all possible values for I1 in a linear order (array). Likewise, we do this with the input values for I2,...,In. The addresses of the first ele- ment in each sequence is precomputed and made accessible to the sub-routine. They are the so-called pointers to their arrays. Then we redesign Subr to only have a single input argument. This value is simply an offset value, that, when Subr is invoked, is added to all pointers, resulting in the addresses of the currently required operands. Using the indirect load instruction, these operands can be read and computed. Analogously, the result of the final sub-routine is stored into the correct array position using the indirect store instruction. With this structure, we could efficiently design the step function of MD5 and the reverse step function in ACS. As offset value, we use the current step number. Moreover, we were able to construct a sub-routine that computes several step function calls succes- sively. Given a starting step a and final step b, it computes all step functions evaluations for values from a to b − 1, each right after the other. As this sub-routine can be applied at least six times in a single algorithm evaluation, we believe it to save a consider- able amount of program words. All these improvements could justify the costs for two additional instructions, LDI and STI, in the instruction set. 8.3 Microprocessor Design 59

Figure 8.2: Instruction format

Figure 8.3: Default implementation of RL1

8.3.8 Instruction Format and Interpretation of Address Field

The instruction length in µMD (adjusted for CS) is 17 bit. Each instruction consists of two fields, see Figure 8.2. The first field is the 4-bit operation code (opcode). The second is the 13-bit address field, that specifies the operand location. In all except three instructions, namely RL1, RET and NOT, the address field is interpreted. The interpretation is strongly dependent on the corresponding opcode. In general, there are two possibilities: When the opcode is JMP, JE (while z-flag is set to ’0’), JNE (while z-flag is not set to ’0’) or CALL, the address field is interpreted as a target (program) address to which the program counter is permanently changed. This means that µMD continues the program execution at the address specified in the address field. Otherwise, the address field is interpreted as a reference to a memory location which is used to read or store a current operand. On a load operation, the program counter is temporarily ignored so that the address bus is loaded directly with the instructions address field. This requests the corresponding operand from the memory, which is computed by the ALU in a second cycle. Simultaneously, the incremented program counter is loaded again to the address bus for fetching new instructions. This second instruction subset can further be divided according to the applied addressing mode, see Section 8.3.4. When RL1, NOT or RET are encountered, the address field is simply ignored. By default, our assembler then sets all its bits to zero, although any other bit combination is possible as well. 60 Circuit Design

Figure 8.4: Default implementation of NOT

Figure 8.5: Default implementation of RET

8.3.9 Execution State

The execution state of µMD is determined by all register values. In particular these are the values of the program counter (13 bit), the instruction register (17 bit), the accumulator (32 bit), the z-flag (1 bit), and the hardware stack (39 bits) for µMD1 and (26 bits) for µMD2. Altogether this makes up a state size of 102 bits and 89 bits, respectively.

8.3.10 Instruction Set

Each instruction of the instruction set has influence on the state of µMD. At least one register is changed on instruction execution. All boolean, bit-rotation and arithmetic instructions use the accumulator register as an implicit operand. In cases of binary operations, a second operand has to be loaded from the memory, requiring an additional clock cycle. Categorized by the number of required cycles for completion, there are three distinct types of instructions in the instruction set. The instruction set of µMD comprises two instructions that just need one cycle to com- plete, namely NOT, which computes the logical complement of the accumulator, and RL1, that rotates the bits of the accumulator one position to the left. Both instructions 8.3 Microprocessor Design 61 are monadic, using the accumulator as operand. The majority of the instructions of µMD require two clock cycles to complete. In par- ticular, these instructions are:

• the arithmetic commands ADD and SUB for modular addition and subtraction; • the logical operators AND, OR and XOR for binary boolean operations; • the memory transfer operations LDA and STA for absolute read and write transfers from and to the memory;

• the CALL instruction to initiate a sub-routine call; • and the branch operations JMP (unconditional jump), JE (jump if z-flag is set to ’0’) and JNE (jump if z-flag is not set to ’0’) for unconditional and conditional branches.

In µMD2, RL2 also requires two clock cycles. It is used for rotating bits a variable number of positions to the left. Finally, the three-cycle consuming instructions of µMD are RET for sub-routine returns and LDI and STI for indirect memory read and write operations. RET is the only instruction that does not require any operand, neither explicitly nor implicitly. Table 8.1 gives an overview over the instruction set of µMD.

8.3.11 Processor Structure

µMD’s inner structure is shown in Figure 8.6. In addition to the information given above, we in the following describe the building blocks of µMD in more detail.

8.3.11.1 Accumulator and z-flag

The accumulator or (A) is the central working register of µMD. In the computation process, values are loaded from the memory to A. Here, they are processed by the ALU, and finally saved back to the memory. For most operations A is an implicit (first) operand. Binary operations require a second operand. When loaded on the data bus by the memory device, this second operand and A are directly processed by the ALU. The accumulator is controlled by the control unit using the signal ACCUOUT and LDACCU. ACCUOUT is used to put the content of A on the data bus. LDACCU is used to load the accumulator from the ALU. As described above, the z-flag holds the status value of the last data computation in the ALU. This value is retrieved by the control unit. 62 Circuit Design

Opcode Hex. value z-flag cycle Description RL1 0000 not altered 1 Rotate A’s bits one position to the left RL2 0000 not altered 2 Rotate A’s bits to the left. The rotation width is found at the specified memory address. STA 0001 not altered 2 Store A to absolute memory or IO address STI 0010 not altered 3 Store A to indirect memory or IO address LDI 0011 not altered 3 Load A from indirect memory or IO address LDA 0100 not altered 2 Load A from absolute memory or IO address ADD 0101 possibly altered 2 Add ( mod 232) specified mem- ory word to A SUB 0110 possibly altered 2 Subtract ( mod 232) the specified memory word from A OR 0111 possibly altered 2 Compute logical OR of A and specified memory word AND 1000 possibly altered 2 Compute logical AND of A and specified memory word XOR 1001 possibly altered 2 Compute logical XOR of A and specified memory word JMP 1010 not altered 2 Jump to specified address JE 1011 not altered 2 Jump to specified address if z-flag is set to ’0’ JNE 1100 not altered 2 Jump to specified address if z-flag is not set to ’0’ CALL 1101 not altered 2 Push incremented program counter onto the stack and jump to specified address RET 1110 not altered 3 Jump to address stored in top of stack. Pop top of stack NOT 1111 possibly altered 1 Compute logical NOT of A

Table 8.1: Instruction set 8.3 Microprocessor Design 63

8.3.11.2 Instruction Register and Opcode Decoder

The instruction register is used to save the address field of an instruction, the opcode decoder to save its opcode. When LDOP is set to ’1’, the instruction register and the opcode decoder are loaded with 17 dedicated bits of the data bus. When LDALU is set to ’1’ the content of the opcode decoder is loaded into the ALU and the control unit. Whether the instruction register is actually used to address the memory device, is dependent on the address multiplexer.

8.3.11.3 Address Multiplexer

Using the address multiplexer, the control unit can control if the instruction register or the program counter is loaded to the address bus. Typically, the first case occurs when additional operands have to be loaded. The second case means that the next program words are loaded or a jump is initiated. The address multiplexer is controlled using the signal PCOUT.

8.3.11.4 Program Counter

The program counter always points to the next program word. When loaded to the address bus, it initiates the transfer of the referenced program word to the instruction register and the opcode decoder. There are two control signals for the program counter. When LDPC is set to ’1’, the program counter is loaded with the output of the instruction multiplexer. This is used for branches and sub-routine calls and returns. When INC is set to ’1’, the program counter is incremented, thus referencing the next program word in the program code. Such a case typically occurs in the last cycle of an instruction, in which a new instruction has to be loaded for processing.

8.3.11.5 Instruction Multiplexer

The instruction multiplexer controls whether the new program counter is loaded from the instruction register or the stack. By default LDSTACK is disabled and the instruction register is loaded. Only when a RET instruction is encountered, LDSTACK is enabled.

8.3.11.6 Stack

In µMD1 the stack contains three thirteen bit registers, in µMD2 just two. The stack is controlled using the signals DO and PUSH. When DO=’1’ and PUSH=’1’ the program counter is pushed onto the stack, on active DO and inactive PUSH it is popped. On DO=’0’ the push and pop functionality is simply disabled. 64 Circuit Design

Figure 8.6: Inner structure of processor

8.3.11.7 Control Unit

Besides CLK and Reset, the control unit processes just two input signals, ZSTATE (indi- cating the status of the z-flag) and the 4-bit content of the opcode decoder. Dependent on this input, the control unit’s inner three-state finite state machine, computes the values of all eleven output signals.

8.3.11.8 Arithmetic Logical Unit (ALU)

The arithmetic logical unit is the most important part of the data path. Here, all arith- metic and logical instructions are implemented in hardware. The ALU can use up to two operands simultaneously, the first coming from the accumulator and the second directly from the memory device via the data bus. The ALU is the location where the operations on data are executed. All other compo- nents are supplementary. They support loading the correct operands to the ALU for computation and control which instruction should be executed. 8.4 Collision Search Unit 65

Figure 8.7: Collision generator: overview

8.4 Collision Search Unit

8.4.1 Introduction

µCS is our final integrated circuit for collision search. Roughly spoken, it consists of a single µMD unit, additional memory and I/O logic. To start a computation, it just needs an initial seed for the integrated PRNG. When a collision is found, the corresponding message words are returned. Except for the PRNG initialization phase and the collision output sequence, there is no further I/O communication required. This decisively supports parallelization approaches making the overhead for additional control logic negligible.

8.4.2 Input and Output Pins

The input and output pins of µCS are depicted in Figure 8.7. To support parallelization, we designed µCS with as few pins as possible. They are suited for the communication protocol as described in Section 8.4.3. µCS has two input signals, CLK and Reset. The first is the clock signal, the second resets all registers asynchronously. For I/O communication, µCS can be connected to a 32-bit bidirectional synchronous I/O bus. To notify the I/O device whether the output words on the I/O bus are actually valid, µCS has a single output signal called OK. The overall structure is depicted in Figure 8.7.

8.4.3 Communication Protocol

The current communication protocol is very simple. In parallelized scenarios, we do not believe many collisions to occur simultaneously. Therefore, we do not employ any logic for collision detection on the I/O bus. 66 Circuit Design

Imagine a single µCS unit that is connected to an I/O device via the I/O bus and the OK signal. Additionally, let the I/O device control the Reset signal of µCS. In the first step of the collision search the I/O device loads a seed value for PRNG of µCS onto the I/O bus. Then it sets the Reset signal active. In the next cycle it is again disabled. The program counter of µMD is so reset to the first position in the program code, the OK signal is disabled and the I/O port of µCS is disconnected. The first instructions in the program code let the I/O logic of µCS read in the seed value from the I/O bus, directly storing it into the I/O register. The next instructions load this value to A and afterwards to the PRNG register. The continuous pseudo-number generation is initial- ized and started. Then µCS tells the I/O device that the seed value has successfully been read by temporarily setting OK active. Hereafter, µCS continues to execute ACS by starting the first collision search computation. As a consequence of this, the I/O device disconnects from the I/O bus, waiting for an- other reaction of µCS. When a near-collision is found µCS enables the OK signal. Consecutively, it returns all sixteen 32-bit registers that constitute the near-collision by loading them serially onto the I/O bus. Afterwards, it disables OK, disconnects the I/O port and continues com- putation in the second part of the algorithm searching for an adequate pseudo-collision. In the meantime, the I/O device has received the output register values. It further re- mains in its passive state. When the required pseudo-collision is found, µCS simply repeats the output process: OK is activated and the corresponding register values are consecutively returned. Fi- nally, it disconnects the I/O port again and invalidates the OK signal. Our current implementation lets µCS subsequently return to the position in program code, where the collision search begins. To sum up, after an initial handshake-like protocol step, the connected I/O device be- haves passive, meaning that it does not alter any signal anymore. This fact is very advantageous for parallelization attempts.

8.4.4 I/O Control

To control the I/O logic and the associated PRNG, µCS uses memory-mapped I/O. This means, that the I/O behavior is not controlled by a dedicated control logic, but that all functions can be invoked using the basic load and store operations as defined in the instruction set. The I/O logic comprises three registers, which can be accessed by µMD like usual RAM words. These registers are: the 32-bit register for PRNG, the 32-bit I/O register for receiving and sending data from and to the I/O bus, and the 3-bit command register, in which µMD can load I/O instructions. These instructions are executed by the I/O logic in parallel to the computations of µMD. 8.4 Collision Search Unit 67

Command (hex.) Description 000 Disconnect I/O port and set OK to ’0’ (active-high) 001 Load value from I/O bus to I/O register 010 Write I/O register to I/O bus 011 Disconnect IO port, set OK to ’0’ and compute next pseudo-random number 100 Set OK to ’1’ (active-high)

Table 8.2: Commands for controlling pseudo-number generation and I/O communication

8.4.4.1 Command Register

Currently, there are five distinct instructions, which provide pseudo-number generation and bus communication as defined by the communication protocol in Section 8.4.3. These instructions, together with their descriptions, can be found in Table 8.2

8.4.4.2 Pseudo-Random Number Generator and Partitioning of the Search Space

CS1 uses a PRNG that is based on integer multiplication. Implemented in hardware, this is comparably expensive both in speed and area. We decided to use another strategy. For pseudo-random number generation we implemented a maximal-period LFSR. It can hold 232 − 1 different values and is initialized by a single 32-bit seed from the I/O bus. The length of the LFSR must be adapted to the requirements of the implemented collision search algorithm. More precisely, it is chosen based on the average amount of pseudo-random numbers required to find a collision. In our case, there are just 232 − 1 distinct ways to start the algorithm. Very useful with LFSRs, is their simple description using a matrix M. Given a k-bit T LFSR with an initial state s = (sk, sk−1, . . . , s1) , clocking the LFSR n times simply 0 0 0 0 T means that the new state s = (sk, sk−1, . . . , s1) can be obtained by multiplying s with M n s0 = M n · s (8.1) where M is a k × k matrix. Except for the first line, the coefficients of M are regular.

Mi+1,i = 1 with 1 ≤ i ≤ k − 1 (8.2)

Mi+1,j = 0 with 1 ≤ i ≤ k − 1, 1 ≤ i ≤ k

For maximal-period LFSRs the coefficients of the first line must be the coefficients of a 68 Circuit Design

Figure 8.8: LFSR with full period

primitive polynomial p(x) of degree k in GF(2k) [43].

k k−1 p(x) = bkx + bk−1x + ... + b1x + b0 (8.3)

  bk bk − 1 bk − 2 ... b3 b2 b1  1 0 0 ... 0 0 0     0 1 0 ... 0 0 0    M =  0 0 1 ... 0 0 0     ......     0 0 0 ... 1 0 0  0 0 0 ... 0 1 0

Using square-and-multiply algorithms for exponentiation [31], M n can efficiently be j 2k−1 k computed. When we choose n = r , we can easily partition all possible LFSR val- ues into r distinct sets with consecutive values. Each value M n, M 2n,... M (r−1)n can then be used as an initialization value for a single PRNG. This is advantageous when using several µCS in parallel for avoiding that unsuccessful searches are repeated and computation power is wasted. 8.4 Collision Search Unit 69

Figure 8.9: Structure of collision generator

8.4.5 Structure

For collision finding purposes, µMD is connected via a 32-bit bidirectional bus to a memory device, which in turn, consists of a 12,672 bit constant ROM, a 31,501 bit program ROM, and a 6,336 bit RAM for variables. Additionally, the memory device comprises the I/O controller and the pseudo-random number generator. Dependent on the address bus value and RNW, the memory controller generates the appropriate output signals for accessing single memory units. The inner structure of µCS is shown in Figure 8.9.

8.4.6 Address Space

The address space is divided in segments of equal size. The two most significant bits of the address bus are used for chip selection purposes. 70 Circuit Design

Memory/IO Virtual Address Space Physical Address Space Program ROM 0 ... (211 − 1) 0 ... 1852 Constant ROM 211 ... (212 − 1) 211 ... (211 + 395) RAM 212 ... (212 + 211 − 1) 212 ... (212 + 197) IO Registers (212 + 211) ... (213 − 1) (212 + 211) ... (212 + 211 + 2)

Table 8.3: Virtual and physical address space

As a result, the virtual address space ranges from 0 to 213 − 1 whereas the physical address space is much smaller. Table 8.3 shows the address space in more detail.

8.5 Parallelization

8.5.1 Introduction

µCS can easily be parallelized. Figure 8.11 depicts the corresponding setting. In addition to the µCS units, a parallelized application requires just a single control unit called I/O Admin, trivial counter units (CNT) and simple OR-units. All these units can be connected in a bus-like manner.

8.5.2 Count Unit (CNT)

The structure of CNT is very simple (see Figure 8.10). Each CNT unit has two output signals, O1 and O2. O1 is connected to a µCS unit, O2 to the next CNT unit. A single CNT unit, implements a two-state finite state machine. In the initial state the input signal is directly connected to O1 and O2 is disabled. So, when the input signal is active, CNT sets O1 active, too. When the input value is subsequently disabled again, CNT changes into its second and final state. In this state, O1 is simply set to ’0’. In contrast, O2 is directly connected to CNT’s input. Arranged linearly as in Figure 8.11, several CNT units can so help to reset multiple µCS units dependent on how often Reset output signal of I/O Admin has been altered. Steadily altering this signal, consecutively resets all µCS units.

8.5.3 Protocol

The communication protocol is very simple. We assume that all µCS units have an inactive OK signal and are disconnected from the I/O bus. The output signals of I/O Admin are all disabled. The protocol consists of two phases, the initialization and the computation phase. The search space has been partitioned properly (into n partitions, 8.5 Parallelization 71

Figure 8.10: A single CNT unit see Section 8.4.4.2), so that the I/O Admin knows all required PRNG seeds. In the initialization phase, I/O Admin updates the first PRNG via I/O bus. Then, it toggles the Reset output signal from active to inactive. This activates the Reset input of the first µCS unit and is subsequently acknowledged by an interrupt to the OK signal. Then, I/O Admin puts the second PRNG seed on the I/O bus. Again, it additionally toggles the Reset output signal from active to inactive. Since CNT 1 simply connects its input to O2, this time the second CNT unit is addressed. It performs the same steps and acknowledges the successful reception of the PRNG seed with an interrupt to the OK signal. This process is repeated until all CNT units have successfully received their PRNG seeds and started computation. In the computation phase, the I/O Admin simply remains passive and disconnects from the I/O bus. As mentioned before, we do not believe more than one collisions to be found simultaneously. Therefore, this phase is very simple. When any CNT unit has found a collision, it is returned via the I/O bus and saved by I/O Admin. It is the responsibility of the I/O Admin to perform post-processing steps on the received data. 72 Circuit Design

Figure 8.11: Parallelized application of collision search unit 9 Analysis Results

9.1 Introduction

In this section we present a detailed performance analysis for µMD and µCS. With these results, we estimate the performance for an implementation of a SHA-1 collision search algorithm.

9.2 Area Analysis

2 Synthesizing µCS1 for standard cells (UMC 130 nm [1]) requires 0.955812 mm chip 2 area. Contrarily, µMD1 can be realized with only 0.022341 mm . This is just 2.33% of the chip area for the full µCS1 unit. 2 For µCS2 and µMD2 the ratio is similar. µCS2 requires 0.959812 mm area. Again, with a required chip area of 0.959618 mm2, the processor is comparably small. This yields about 2.77% of the total chip areaof µCS2. As we expected, the vast majority of chip area is used for the implementation of memory logic. For comparison reasons, we also tested collision search on a standard PC processor. We used a Pentium 4, 2.0 GHz machine (Northwood core) with approximately 55 million transistors build in 130 nm circuit technology [14]. Its die size is 146 mm2.

9.3 Timing Analysis

9.3.1 Introduction

As mentioned in Section 7.5, we define time T as the average time for a single unit to find a collision. Unfortunately, µMD and µCS are not available for a practical test run. For µCS this measure is computed based on the average number C of cycles required to find a collision and the corresponding frequency f. Instead of f we can also use the reciprocal clock cycle time t. C T = = C · t f 74 Analysis Results

9.3.2 Frequency

As expected, both circuits have a frequency noticeably worse than the Pentium 4. µCS1 can be run at a frequency of approximately 115.2 MHz. Interestingly, this is only slightly faster than µCS2, which can be operated at a frequency of about 102.9 MHz.

9.3.3 Cycles per Collision

As mentioned in Section 8.2, the execution of ACS in the simulation model is too slow to achieve reliable values. In the following, we will reasonably estimate the average number of clock cycles needed to find a collision. As a basis we use the average number of collisions needed by the Pentium 4 standard PC. Although directly implemented in assembler, we believe the required average number of clock cycles for the execution of ACS to be higher than that of CS. This is due to three major points. First, almost each instruction in µMD uses two instead of just one clock cycle. We assume that this fact roughly doubles the number of required clock cycles compared to the Pentium 4. Secondly, on a memory access the ALU of µMD is idle. In the meantime, the ALU cannot compute further operands. Contrarily, within algorithm execution on standard PCs, load and store operations are far less frequent. The number of registers for storing numerous different operands is very high so that operations can directly be executed on registers. If each operand was to be loaded and subsequently stored back to memory, one could assume that this would nearly triple the overall number of clock cycles of ACS compared to standard PCs. However, we assume that the actual number is smaller: In ACS there are many situations in which only the final result of a function and few intermediate results have to be stored back to memory. Such cases occur when the execution of a function cannot be parallelized (see Section 7.3.2). As shown before, we can state that these situation are very frequent, i.e. the result of the preceding operation is needed as an operand to the current operation. Luckily, this result is then available in the accumulator, making explicit load and store operations redundant. Furthermore, in some cases, for example in bit condition evaluations, no operand has to be stored at all. Altogether we assume a factor of two for the overhead due to additional load and store instructions. Thirdly, ACS requires additional clock cycles for calling sub-routines. This includes loading and assigning of arguments. Again, we believe that this task nearly doubles the 9.4 Performance Results 75

Architecture C f t T 9 µCS1 960·10 cycles ≈ 115.2 MHz 8.68 ns 8332.8 s 9 µCS2 480·10 cycles ≈102.9 MHz 9.71 ns 4660.8 s 9 µMD1 960·10 cycles ≈303.9 MHz 3.29 ns 3158.4 s 9 µMD2 480·10 cycles ≈228.8 MHz 4.37 ns 2097.6 s Pentium 4 60·109 cycles 2 GHz 0.5 ns 30 s

Table 9.1: Time analysis - average time to find a collision

Architecture T A P=A·T 2 µMD1 3158.4 s 0.022341 mm 70.6 2 µMD2 2097.6 s 0.026626 mm 55.9 Pentium 4 30 s 146 mm2 4380

Table 9.2: Processor performance number of required clock cycles. Altogether, we estimate ACS (executed on µMD2) to require roughly eight times more clock cycles than CS. Executed on µMD1, we believe this number to be greater, since parameterized sub-routines for rotation have to be constructed using single bit rotations repeatedly. As such rotations occur quite often in the execution path, namely each time a step operation or a inverse step operation is invoked, we assume that the number of required clock cycles for the execution of ACS on µMD1 is additionally increased by a factor of 2. This results in an overall clock cycle number that is 16 times worse than an execution on a Pentium 4.

9.4 Performance Results

Assuming equal production constraints, meaning the same price per chip area (see Equa- tion 9.1), each of our solution is much more effective than a comparable Pentium 4 ar- chitecture. The higher speed of a Pentium 4 processor is achieved only with much more area. The area-time product P reflects this fact (see Section 7.5.1). Table 9.2 compares the performance characteristics of µMD1 and µMD2 with those of a standard Pentium 4 processor. It is obvious, that µMD1 and µMD2 have both better characteristics than the Pentium 4 processor. The performance of µMD2 for collision search is about 62 times better than the Pentium’s. µMD1 has the best characteristics of all results. Its performance is approximately 78.4 times better. It is also interesting to compare the Pentium 4 with the full collision search units µCS1 and µCS2. One can see that a single unit of µCS1 is just 1.82 times less efficient than a Pentium 4 processor, although it includes all data required to immediately start a 76 Analysis Results

Architecture T A P=A·T 2 µCS1 8332.9 s 0.955812 mm 7964.7 2 µCS2 4660.8 s 0.959618 mm 4472.6 Pentium 4 30 s 146 mm2 4380

Table 9.3: Performance of collision search units

Architecture P P/PP 4 Pentium4 (P4) 4380 100 % µMD1 70.6 1.6 % µMD2 55.9 1.3 % µCS1 7964.7 181.8 % µCS2 4472.6 102.1 %

Table 9.4: Performance (P) compared to Pentium 4

collision search. A single µCS2 is even better. The AT-product is nearly as small as that of the Pentium: µCS2 is just 1.021 µCS2 times less efficient than a Pentium 4 processor. The latter numbers compare the Pentium 4 processor with our full collision search unit. They do not consider all additional logic and equipment which are required to practi- cally operate the Pentium 4. The price for this additional equipment would considerably come to the Pentium’s disadvantage. When we use an off-the-shelf standard PC for parallelization, we have to consider the costs for I/O, ROMs, motherboards, fans, and additional equipment for parallelization like network cards and cables. Altogether, we believe the costs to be at least 200 e, whereas we assume the Pentium 4 2.0 GHz to have a price of roughly 50 e. As a consequence, we estimate the price per chip QA area to be 50 e e Q = = 0.3425 . (9.1) A 146 mm2 mm2

9.5 Parallelization

Using such off-the-shelf standard PCs, the processors are connected to each other by standard network equipment. As described in Section 9.4, the costs per unit will in- crease four-fold. This means, that for each single parallelized Pentium 4 processor one can buy four boxed ones. In other words, the parallelization overhead (Op) is 300 %. As shown in Section 8.5, parallelization of µCS requires only few additional logic per unit. In combination with the proposed bus-like connection, this provides an optimal scaling solution without noticeable additional costs. For our solution, we assume the 9.5 Parallelization 77

Pentium 4 µCS1 µCS2 e e e QA 0.3425 mm 2 0.3425 mm 2 0.3425 mm 2 A 146 mm2 0.955812 mm2 0.959618 mm2 Qs 50 e 0.3274 e 0.3287 e Op 300 % 5 % 5 % Qp 200 e 0.3438 e 0.3451 e P 4380 7964.7 4472.6 T 30 s 8332.8 s 4660.8 s R 6000 2864.8 1608.4

Table 9.5: Cost overview

Architecture R R/RP 4 P4 standard PC 6000 100 % µCS1 2864.8 47.7 % µCS2 1608.4 26.8 %

Table 9.6: Performance (R) compared to Pentium 4 overhead in area to be almost negligible. We assume that the overhead does not exceed 5% of the original area, including wires, OR units, CNT units and I/O Admin units (see Section 8.5). Based on these considerations, we believe that our full collision search solution is noticeably more effective for collision search than parallelized Pentium 4 pro- cessors. Our estimates are summed up in Table 9.5.Qs reflects the price for a single standalone unit of the corresponding architecture. In contrast, Qp is the average price for a single unit after parallelization. To also consider the distinct increases in costs for parallelization, we redefine our per- formance metric. The performance R is defined as

R = T · Qp . (9.2)

The value for R reflects the amount of money that has to be spent into parallelized units of a certain architecture to find one collision per second in average. Obviously, for finding a single MD5 collision per second one has to spend 6000 e in (30) parallelized standard PCs. Assuming similar constraints for manufacturing ASICs, see Section 7.5, the price for the same performance invested in parallelized µCS1 units is 2864.8 e. With a price of 1608.4 e µCS2 is the cheapest and also the most efficient architecture for collision search. It is almost four times more cost-efficient than the Pentium 4 standard PC architecture (see 9.6). 78 Analysis Results

Figure 9.1: Costs for equipment to find a MD5 collision in a predefined time

9.6 Estimations for SHA-1

We believe that a comparable implementation of a SHA-1 collision search algorithm in dedicated and parallelized collision search units has even better performance charac- teristics. This is mainly due to the fact, that SHA-1 needs much less constants than MD5, thus radically reducing the costs for (constant) ROMs. We assume that a collision search algorithm for SHA-1 can be programmed similarly compact. Although SHA-1 spans more steps, what is surely also reflected in the corresponding collision search al- gorithm, the program code cannot grow considerably larger. This is due to the fact, that each sub-routine is just implemented once. When needed, it is called using only few additional instructions, see Section 8.3.6. The average number of required clock cycles to find a collision is primarily dependent on the available theoretical results. Currently, attacks on SHA-1 have a complexity of about O(263). For practical attacks, we believe this number to be still too large. When implementing SHA-1 collision search algorithms, we strongly recommend using µMD2, since SHA-1 makes use of rotations much more often than MD5. The area- expensive operation RL2 can easily be optimized for the use in SHA-1. In contrast to MD5, the number of distinct bit rotations which may occur in step function invocations or in reverse step function invocations is noticeably limited. SHA-1 only makes use of just six of altogether 32 possible bit rotations. Probably, this further increases the 9.6 Estimations for SHA-1 79

frequency of µCS2 and decreases its size. 80 Analysis Results 10 Discussion

10.1 Summary

In this work we analyzed the hardware requirements of current and future collision search algorithms for hash functions of the MD4-family. We used our results to develop an appropriate hardware platform which executes collision search algorithms about four times faster than standard PC architectures. The heart of our design is a very small microprocessor µMD with only sixteen instruc- tions. At the same time, it provides very effective means to support program code reuse, what greatly helps to keep the size of our overall collision search unit µCS small. In the context of MD4-family hash functions, µMD is general-purpose, meaning that it is appropriate for the execution of all MD4-family hash functions and also of all corre- sponding current and future collision search algorithms. In contrast to standard PCs, the final collision search unit needs only very little addi- tional logic. This reduces its price and greatly eases parallelization approaches. We believe that our design approach is much better suited for collision search than stan- dard PCs. When money is spent on collision search, our design, massively parallelized, is nearly four times more cost-efficient than parallelized P4 standard PCs.

10.2 Outlook

In the end, there are some questions and open issues to further research. • Improving µMD by adding new hardware features. Only those hardware features should finally be implmented, which prove valuable for collision search by decreas- ing the overall performance. • Substitution of estimates concerning the required number of clock cycles per col- lision with experimental values. One possible approach is to appropriately profile the assembler implementation and to develop a further collision search program in C which explicitly counts the clock cycles needed by our assembler code. • Analysis of the requirements for dedicated logic to appropriately parallelize stan- dard processors. Instead of buying full standard PCs and using their network 82 Discussion

capabilities, a custom parallelization board with space for several processors may be much more cost efficient. • Optimization of the design process. • Implementation of a C compiler that efficiently maps hash function C code to the machine code of µMD. This would support estimates of the performance of new collision search algorithms on a µMD based architecture. A Bibliography

[1] http://www.europractice.imec.be/europractice/on-line-docs/ prototyping/ti/ti_VST_UMC13_logic.html. [2] M. Bellare and T. Kohno. Hash Function Balance and its Impact on Birthday Attacks, November 2002. [3] M. Bellare and P. Rogaway. Introduction to Modern Cryptography, 2005. Avail- able for download at http://www-cse.ucsd.edu/users/mihir/cse207/ classnotes.html. [4] E. Biham and R. Chen. Near-Collisions of SHA-0. In Advances in Cryptology — CRYPTO 2004, volume 3152 of Lecture Notes in Computer Science, pages 290–305. Springer Verlag, August 2004. [5] E. Biham and A. Shamir. Differential Cryptanalysis of DES-like Cryptosystems. In Advances in Cryptology — CRYPTO ’90, volume 537 of Lecture Notes in Computer Science. Springer Verlag, June 1990. [6] E. Biham and A. Shamir. Differential Cryptanalysis of the Data Encryption Stan- dard, 1993. [7] C. De Cannière and C. Rechberger. Finding SHA-1 Characteristics: General Results and Applications, 2006. [8] F. Chabaud and A. Joux. Differential Collisions in SHA-0. In Advances in Cryp- tology — CRYPTO ’98, volume 1462 of Lecture Notes in Computer Science, pages 56–71. Springer Verlag, August 1998. [9] M. Daum. Cryptanalysis of Hash Functions of the MD4-Family. PhD thesis, Ruhr- Universität Bochum, 2005. Available for download at http://www.cits.rub. de/MD5Collisions/. [10] M. Daum and S. Lucks. The Story of Alice and Bob. Presented at the rump session of EUROCRYPT 2005. Cryptology ePrint Archive, Report 2005/067, May 2005. On- line at http://www.cits.rub.de/imperia/md/content/magnus/rump_ ec05.pdf. [11] Gesetz über Rahmenbedingungen für elektronische Signaturen und zur Änderung weiterer Vorschriften, February 2001. http://www.dud.de/dud/documents/ sigg010214.pdf. 84 A Bibliography

[12] Gesetz zur digitalen Signatur (Signaturgesetz – SigG). Bundesgesetzblatt I, pages 1870 and 1872, or http://www.regtp.de/tech_reg_tele/start/ in_06-02-01-00-00_m/index.html, June 1997. [13] Verordnung zur digitalen Signatur (Signaturverordnung – SigV), October 1997. http://www.regtp.de/tech_reg_tele/start/in_06-02-01-00-00_ m/index.html. [14] J. Pelzl et al. Area-Time Efficient Hardware Architecture for Factoring Integers with the Elliptic Curve Method. volume 152 of IEEE Proceedings on Information Security, Special Issue on Cryptographic Algorithms and Architectures for System- on-Chip, pages 67–78, October 2005. [15] M. Gebhardt, G. Illies, and W. Schindler. A Note on the Practical Value of Single Hash Collisions for Special File Formats. In Sicherheit 2006, volume P-77 of Lecture Notes in Informatics (LNI), pages 333–344, August 2006. [16] P. Hawkes, M. Paddon, and G. G. Rose. Musings on the Wang et al. MD5 Collision, 2004. [17] Adobe Systems Incorporated. TIFF Revision 6.0. fifth edition, June 1992. Avail- able for download at http://partners.adobe.com/public/developer/ en/tiff/TIFF6.pdf. [18] Adobe Systems Incorporated. PostScript Language Reference. Addison-Wesley Publishing Company, third edition, February 1999. Available for download at http://partners.adobe.com/public/developer/en/ps/PLRM.pdf. [19] Adobe Systems Incorporated. PDF Reference. fifth edition, November 2004. Avail- able for download at http://partners.adobe.com/public/developer/ en/pdf/PDFReference16.pdf. [20] D. Joˆsˆcák. Finding collisions in cryptographic hash functions. Master’s thesis, Univerzita Karlova v Praze, 2006. [21] A. Joux. Multicollisions in Iterated Hash Functions. In Advances in Cryptology — CRYPTO 2004, volume 3152 of Lecture Notes in Computer Science, pages 303–316. Springer Verlag, August 2004. [22] J. Kelsey and B. Schneier. Second preimages on n-bit hash functions for much less than 2n work. Cryptology ePrint Archive, Report 2004/304, 2004. Available for download at http://eprint.iacr.org/. [23] V. Klima. Finding MD5 Collisions on a Notebook PC Using Multi-message Modifi- cations. Cryptology ePrint Archive, Report 2005/102, 2005. Available for download at http://eprint.iacr.org/. [24] V. Klima. Project Homepage. http://cryptography.hyperlink.cz/MD5_ collisions.html, 2006. A Bibliography 85

[25] V. Klima. Tunnels in Hash Functions: MD5 Collisions Within a Minute. Cryp- tology ePrint Archive, Report 2006/105, 2006. Available for download at http: //eprint.iacr.org/. [26] A. Lenstra and B. de Weger. On the possibility of constructing meaningful hash collisions for public keys. ACISP, 2005. [27] A. Lenstra, X. Wang, and B. de Weger. Colliding X.509 Certificates, 2005. Available for download at http://eprint.iacr.org/. [28] A. K. Lenstra. Further progress in hashing cryptanalysis. Presented at the Cryp- tographic Hash Workshop hosted by NIST, February 2005. Available for download at http://cm.bell-labs.com/who/akl/hash.pdf. [29] J. Liang and X. Lai. Improved on Hash Function MD5. Cryptology ePrint Archive, Report 2005/425, November 2005. Available for download at http: //eprint.iacr.org/. [30] K. Matusiewicz and J. Pieprzyk. Finding Good Differential Patterns for Attacks on SHA-1. Cryptology ePrint Archive, Report 2004, December 2004. Available for download at http://eprint.iacr.org/. [31] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone. Handbook of Applied Cryptography. CRC Press, Boca Raton, Florida, USA, 1997. [32] R. Merkle. One Way Hash Functions and DES. In Advances in Cryptology — CRYPTO ’90, volume 435 of Lecture Notes in Computer Science. Springer Verlag, 1990. [33] I. Dåmgard. A Design Principle for Hash Functions. In Advances in Cryptology — CRYPTO ’90, volume 435 of Lecture Notes in Computer Science. Springer Verlag, 1990. [34] O. Mikle. Practical Attacks on Digital Signatures Using MD5 Message Digest. Cryptology ePrint Archive, Report 2004/356, 2004. Available for download at http://eprint.iacr.org/. [35] National Institute of Standards and Washington Technology (NIST), US Depart- ment of Commerce. Federal Information Processing Standard FIPS-180-0: Secure Hash Standard, May 1993. Available for download at http://itl.nist.gov/ fipspubs/. [36] National Institute of Standards and Washington Technology (NIST), US Depart- ment of Commerce. Federal Information Processing Standard FIPS-180-1: Secure Hash Standard, April 1995. Available for download at http://itl.nist.gov/ fipspubs/. [37] National Institute of Standards and Washington Technology (NIST), US Depart- ment of Commerce. Federal Information Processing Standard FIPS-180-2: Secure 86 A Bibliography

Hash Standard, August 2002. Available for download at http://itl.nist. gov/fipspubs/. [38] J. Reichardt and B. Schwarz. VHDL-Synthese. Oldenbourg, third edition, 2003. [39] R. Rivest. The MD4 Message-Digest Algorithm. In Advances in Cryptology — CRYPTO ’90, volume 537 of Lecture Notes in Computer Science, pages 303–311. Springer Verlag, 1991. [40] R. Rivest. The MD4 Message-Digest Algorithm, Request for Comments (RFC) 1320, 1992. [41] R. Rivest. The MD5 Message-Digest Algorithm, Request for Comments (RFC) 1321, 1992. [42] Y. Sasaki, Y. Naito, N. Kunihiro, and K. Ohta. Improved Collision Attack on MD5. Cryptology ePrint Archive, Report 2005/400, November 2005. Available for download at http://eprint.iacr.org/. [43] B. Schneier. Applied Cryptography. John Wiley & Sons, 2nd edition edition, 1996. [44] M. Stevens. Fast Collision Attack on MD5. Cryptology ePrint Archive, Report 2006/104, 2006. Available for download at http://eprint.iacr.org/. [45] A. S. Tanenbaum. Structured Computer Organization. Prentice Hall, fourth edition, 1999. [46] X. Wang, F. Guo, X. Lai, and H. Yu. Collisions for hash functions MD4, MD5, HAVAL-128 and RIPEMD. Cryptology ePrint Archive, Report 2004/199, August 2004. Available for download at http://eprint.iacr.org/. [47] X. Wang, A. Yao, and F. Yao. Cryptanalysis of SHA-1. Presented at the Crypto- graphic Hash Workshop hosted by NIST, October 2005. [48] X. Wang, Y. L. Yin, and X. Yu. Efficient Collision Search Attacks on SHA-0. In Advances in Cryptology — CRYPTO 2005, volume 3621 of Lecture Notes in Computer Science, pages 1–16. Springer Verlag, August 2005. [49] X. Wang, Y. L. Yin, and X. Yu. Finding Collisions in the Full SHA-1. In Advances in Cryptology — EUROCRYPT 2005, volume 3621 of Lecture Notes in Computer Science, pages 17–36. Springer Verlag, August 2005. [50] X. Wang and X. Yu. How to Break MD5 and other Hash Functions. In Advances in Cryptology — EUROCRYPT 2005, volume 3494 of Lecture Notes in Computer Science, pages 19–35. Springer Verlag, May 2005. [51] J. Yajima and T. Shimoyama. Wang’s sufficient conditions of MD5 are not suf- ficient. Cryptology ePrint Archive, Report 2005/263, August 2005. Available for download at http://eprint.iacr.org/. [52] G. Yuval. How to Swindle Rabin, 1979.