The Compression Functions of SHA, MD2, MD4
and MD5 are not Ane
Mahesh V. Tripunitara and Samuel S. Wagsta , Jr.
COAST Lab oratory
Purdue University
West Lafayette, IN 47907-1398
ftripunit,[email protected]
COAST TR-98/01
Abstract
In this pap er, we show that the compressions functions of SHA, MD2, MD4 and MD5 are not
ane transformations of their inputs.
1 Intro duction
The Secure Hash Algorithm SHA [1, 5, 2] and the Message Digest algorithms MD2 [4], MD4 [6]
and MD5 [7], are used to pro duce a message digest, a condensed form of a message for use in digital
signature and other applications. Figure 1 shows how these algorithms are used. Each step in each of the
algorithms is a compression function denoted by CF in the gure which takes as inputs a p ortion of
a message and the currentvalue for a chaining variable. Each compression function alters the value for
the chaining variable based on the input message, and the output from each compression function is this
altered value for the chaining variable. The value for the chaining variable from the last application of
the compression function is the output message digest. The initial value for the chaining variable for the
rst application of the compression function is xed for each algorithm and each subsequent application
uses the output from the previous application as input value for the chaining variable.
An imp ortant prop erty for each such compression function to p ossess is for the output to not b e an
ane transformation of the inputs, which are a xed size p ortion of a message and the initial value
for the chaining variable. If the output from a compression function is an ane transformation of its
inputs, it would b e trivial to construct messages that pro duce a given message digest when sub jected to
the message digest algorithm as we show in section 1.2. Such an algorithm would b e useless in a security
context.
In this pap er we show that the compressions functions of the message digest algorithms SHA, MD2,
MD4 and MD5 as describ ed in [1], [4], [6] and [7] resp ectively are indeed not ane.
1.1 Some Characteristics of the Algorithms
Each algorithm takes an arbitrary length we use \length", \size" and \numb er of bits" to mean the
same unless sp eci cally noted otherwise message as input and pro duces a xed length message digest
as output. The message digest is 160 bits for SHA and 128 bits for each of MD2, MD4 and MD5.
Each algorithm pads the input message such astomake the numb er of bits in the message plus the
padamultiple of 512 in the case of SHA, MD4 and MD5, and a multiple of 128 in the case of MD2.
Portions of this work were supp orted by sp onsors of the COAST Lab oratory. 1 Message Pad
MMMMM
LLLLLL CF CF CF CF CF Initial Final Default Digest Value for Chaining Variable
M = 512 bits for SHA, MD{4,5} L = 160 bits for SHA
M = 128 bits for MD2 L = 128 bits for MD{2,4,5}
Figure 1: Use of Message Digest Algorithms
For SHA, MD4 and MD5, the padding consists of a 1 bit followed by enough 0's to make the number
of bits of the message plus the pad so far 448 mo d 512. At least 1 bit and at most 512 bits are thus
added as pad. The length of the original message represented as a 64 bit numb er is then app ended to
make the entire length a multiple of 512. In the case of MD2, padding of sucient length to make the
numb er of bits of the padded message a multiple of 128 is added. Thus, if i bytes are added as pad, each
of those bytes has the value i.At least 1 byte and at most 16 bytes are app ended as pad. A 16 byte
128 bit checksum of the message plus pad is also then app ended.
After padding, SHA, MD4 and MD5 break the padded message into 512 bit chunks and use those as
inputs to the compression function, in turn, as indicated in gure 1. MD2 also do es the same, except
that its compression function takes 128 bit chunks of the message as input. Each compression function
also takes as input the value of the chaining variable which is the output from the previous application
of the compression function. For the rst application of the compression function, each algorithm uses
a default initial value for the chaining variable.
1.2 An Attack on a Message Digest Algorithm That Uses an Ane Com-
pression Function
In this section, we assume the existence of a message digest algorithm, A that works like SHA, MD2,
MD4 and MD5 in that it takes as input a message of some length, pads the message and breaks the
padded message into equal sized chunks to use as input to a compression function, CF . CF also takes
as input a chaining variable that is the output from the previous application of CF a default value for
the rst application. Also, we assume that CF is ane in its inputs, that is, CF can b e represented as
a pre-multiplication by a matrix of appropriate dimensions bitwise XOR-ed with a vector of appropriate
size. Then, given a message and its corresp onding message digest, we construct a second message with
the same message digest. We assume that the reader is familiar with basic concepts of numerical linear
algebra suchasrank of a matrix and Gaussian Elimination.We refer the reader to [3] for a review of
these and other related topics. Also, we adopt the following notation:
n;m
F : The set of n m matrices each of whose elements are in F , the Galois Field with the
2
2
two elements 0 and 1. m is omitted if it is 1, and n and m are omitted if they are b oth 1.
: The bitwise exclusive-or op erator.
jj: A function that takes as argument a bit string or bit vector and returns its length.
: Used to concatenate bit strings. 2
A: The message digest algorithm with an ane compression function.
0 00
M; M ;M : Inputs to A messages of some length.
D : Output from A message digest.
CF : The compression function used in A.
0 00
m; m : Input message to CF ,typically a p ortion of M , M or M , written as a bit vector.
i
v; v : Input chaining variable to CF written as a bit vector.
i
w; w : Output from CF written as a bit vector. The output from the nal application of CF
i
is the message digest.
L : The length of the input message to CF .Thus, typically, L = jm j and L = 512 for
m m i m
SHA, MD4 and MD5 and L = 128 for MD2.
m
L : The length of the input chaining variable to and thus, the output from, CF .Thus,
v
L = jv j = jw j for any i; j . L = 160 for SHA and L = 128 for MD2, MD4 and MD5.
v i j v v
p: The padding added to the message. For SHA, MD4 and MD5, 65 jpj 576 and for
MD2, 136 jpj 256.
L ;L +L
v v m
C : The matrix corresp onding to CF .Thus, C 2 F .
2
L
th
v
and C = C : The i column of C , where 1 i L + L . Thus, C 2 F
i v m i
2
C C ::: C .
1 2
L +L
v m
b: The constantvector corresp onding to the ane transformation, CF .Thus, b is the output
L
v
. on input 0 the vector of all 0's, and b 2 F
2
Thus, CF , on inputs v , m and output w can now b e expressed as:
v
w = C C ::: C b 1
1 2
L +L
v m
m
Assume that some message M on input to A pro duces D as message digest. Wewould liketo
0
substitute our favorite message, M for the original message, but have the new message also pro duce the
same output D on input to A. Of course, this is not p ossible in general. Therefore, starting from our
0 00 0 00
favorite message M ,we construct M , a slightly transformed version of M , such that M on input to
0
A, pro duces D as message digest. Starting with M ,we carry out the attack as follows.
0
1. We split M into chunks of size L .Thus, we pro duce m ;m ;:::;m such that jm j = L for
m 1 2 k i m
0
i =1; 2;:::;k 1, jm j L and m m ::: m = M .
k m 1 2 k
2. We sub ject eachofm ;m ;:::;m to CF , in turn. At the rst application, we use the default
1 2 k 1
value for the input chaining variable and at each subsequent application, we use the output from
the previous application of CF . Application of CF is carried out as indicated in 1. The output
from the last such application of CF is w = v , which is the same as the input chaining variable
k 1 k
to the next application of CF .Thus, wehavenow p erformed k 1 applications of CF .
3. If A works like SHA, MD4 or MD5, the message is rst padded with at least one bit so that it is
of length 448 mo d L , and then with the length of the message represented as a 64 bit number
m
of course, L = 512 for SHA, MD4 and MD5. Our strategy is to app end a suitable \noise" to
m
the message. Wenow need to consider one of several cases as describ ed b elow.
th
a If the next k application of CF is to b e the last, we require the output from the application
to b e D .Thus, we need to nd a \noise" n such that:
3 2
v
k
7 6
m
k
7 6
C C ::: C = D b 2
1 2
L +L
v m
5 4
n
p 3
To b e able to nd suchann,we need the following condition to b e satis ed: