FAST HASHING in CUDA Neville Walo Department of Computer
Total Page:16
File Type:pdf, Size:1020Kb
FAST HASHING IN CUDA Neville Walo Department of Computer Science ETH Zurich¨ Zurich,¨ Switzerland ABSTRACT cies, it is possible to use the compression function of SHA- Hash functions, such as SHA-256 [1], are extensively used 256 along with the Sarkar-Schellenberg composition prin- in cryptographic applications. However, SHA-256 cannot ciple [2] to create a parallel collision resistant hash function be parallelized due to sequential dependencies. Using the called PARSHA-256 [3]. Sarkar-Schellenberg composition principle [2] in combina- In this work, we try to accelerate hashing in CUDA [6]. tion with SHA-256 gives rise to PARSHA-256 [3], a parallel We have divided this project into two sub-projects. The first collision resistant hash function. We present efficient imple- one is the Bitcoin scenario, with the goal to calculate many mentations for both SHA-256 and PARSHA-256 in CUDA. independent SHA-256 computation in parallel. The second Our results demonstrate that for large messages PARSHA- case is PARSHA-256, where the goal is to implement the 256 can significantly outperform SHA-256. proposed algorithm efficiently in CUDA. Related work. To our knowledge there is no compara- ble implementation of PARSHA-256 which runs on a GPU. 1. INTRODUCTION There exists only the implementation of the original paper, Hash functions are one of the most important operations which uses multithreading [3]. On the other hand, there in cryptographic applications, like digital signature algo- are countless implementations of Bitcoin Miners in CUDA rithms, keyed-hash message authentication codes, encryp- [7, 8], as this was the most prominent way to mine Bitcoins tions and the generation of random numbers. Furthermore, before ASICs were introduced. hashing is also used in many data structures and applica- Our contribution is an implementation of SHA-256 and tions such as hash tables and for calculating checksums to PARSHA-256 in CUDA. compare files. The development and implementation of fast hash functions is therefore of great importance for many ar- 2. BACKGROUND eas. With the recent rise of the cryptocurrencies, it is as im- This section gives an overview of how PARSHA-256 and portant as never before to hash as fast as possible. Many SHA-256 and work, mainly taken from the original sources cryptocurrencies are based on the proof of work [4] princi- [1, 3]. The focus is on the technical implementation (how ple, in which one party (the prover) proves to others (the ver- the algorithm works), not on the theoretical properties (why ifiers) that a certain amount of computational effort has been the algorithm is secure). A word size of 32-bits is assumed. expended for some purpose. For example, in the Bitcoin protocol [5], users have to find a nonce such that the SHA- 2.1. SHA-256 [1] 256 hash of the nonce within the current block is smaller than the current target of the network. Since only the first The SHA-256 algorithm can be described in two stages: miner who finds a nonce that fulfills the target receives a re- preprocessing and hash computation. Preprocessing involves ward, it is important to try out many SHA-256 hashes as fast padding a message, parsing the padded message into 512- as possible. Today mostly ASICs (application-specific inte- bit blocks, and setting initialization values to be used in the grated circuit) are used to mine Bitcoins, as ASICs work hash computation. The hash computation generates a mes- more efficient and compute more hashes per second than sage schedule of 64 words from the padded message and traditional hardware. uses that schedule, along with functions, constants, and op- While the original SHA-256 implementation, as pro- erations to iteratively generate a series of hash values. The posed in the Secure Hash Standart (SHS) [1] by the National final hash value generated by the hash computation is called Institute of Standards and Technology (NIST), does not al- a message digest and its length is 256 bits. low for much parallelization due to sequential dependen- 2.1.1. Preprocessing After repeating steps one through four for every block, the resulting 256-bit message digest of the message, M, is Preprocessing consists of three steps: padding the message, the hash value of the final block. parsing the message into message blocks, and setting the initial hash value. 2.2. PARSHA-256 [3] Padding. The purpose of padding the input message is to ensure that the padded message is a multiple of 512 bits, PARSHA-256 uses the compression function of SHA-256 since SHA-256 assumes a block size of 512 bits. Suppose along with the Sarkar-Schellenberg [2] principle to create that the length of the message, M, is ` bits. The bit ”1” a parallelizable collision resistant hash function. The over- is appended to the end of the message, followed by k zero all approach does look not that different from SHA-256, but bits, where k is the smallest, non-negative solution to the there is one major change. While the task graph in SHA-256 equation ` + 1 + k ≡ 448 mod 512. Then the 64-bit block is linear, because the previous block has to be processed to that is equal to the number ` expressed using a binary rep- process the next one, in PARSHA-256 the processors are ar- resentation is appended to the message. This will result in a ranged in a binary tree (similar to a reduction), which allows message that can be divided into 512-bit blocks. working on more than one block at a time. Parsing the Message. The message and its padding are (1) (2) (N) parsed into N 512-bit blocks, M ;M ; :::; M . Since 2.2.1. Compression Function the 512 bits of the input block may be expressed as sixteen 32-bit words, the first 32 bits of the message block i are Let h() be the compression function. In the case of SHA- (i) 256, the input to h() consists of 24 32-bit words (768 bits) denoted M0 . Initial Hash Value. Before hash computation begins, and the output consists of 8 32-bit words (256 bits). In the (0) (0) rest of the paper we set n = 768 and m = 256. the initial 8 hash values, H0 up to H7 , must be set. These can be taken from the official document [1]. 2.2.2. Processor Tree 2.1.2. Hash Computation PARSHA-256 uses a binary tree of processor height T , note that T is an argument to the hash function and a different T SHA-256 uses eight 32-bit working variables and six logical can produce a different result. There are 2T processors in functions (Ch(x; y; z), Maj(x; y; z), σ (x), σ (x), Σ (x), 0 1 0 the tree and the children of processor P are P and P , Σ (x)), where each function operates on 32-bit words, which i 2i 2i+1 1 see Fig. 1. The arcs denote the data flow and go from the are represented as x, y, and z. The result of each function children to the parent. is a new 32-bit word. The exact specification of these func- tions can be found in the official document [1]. Each message block is processed in order, using the fol- lowing steps: 1. The Message Schedule fWtg of the current block is prepared using the following approach: 8 (i) <Mt 0 ≤ t ≤ 15 W = t σ1(Wt−2) + Wt−7 + 16 ≤ t ≤ 63 : σ0(Wt−15) + Wt−16 2. The eight working variables are initialized with the (i−1)st hash values. This means that block i receives as working variables the message digest of block i−1, Fig. 1: Processor Tree with T = 3. Source: [3] while the first block receives the initial hash values as working variables. The behaviour of any processor Pi with input y is de- scribed as follows: 3. In 64 rounds the working variables are permuted us- ( h(y) if jyj = n ing the above functions, the message schedule and Pi(y) = (1) predefined constants. The exact specification can be y else found in the official document [1]. 2.2.3. Formatting the Message 4. To compute the ith intermediate hash value, the 8 working variables are added to the (i − 1)th hash Similar to SHA-256, the incoming message has to be padded value. and divided into blocks. While for SHA-256, this procedure was relatively simple, as the message should be a multiple 3. In round (p + 2) all non leaf processors get an (n − of 512 bits, which is also the block size, in PARSHA-256 it 2m)-bit substring from the input message and con- is more complicated. catenate this with the two m-bit messages z2i; z2i+1 In PARSHA-256 the message undergoes two kinds of from their children from the previous round. The leaf padding. In the first kind of padding, called end-padding, processors do not receive any new input. zeros are appended to the end of the message to get the length of the padded message into a certain form. The sec- 4. In round (p + 3) only P0 and P1 get an (n − 2m)- ond kind of padding is called IV-Padding. The Initialization bit substring from the input message and concatenate Vector (IV) ensures that no invocation of h() gets only mes- this with the two m-bit messages z2i; z2i+1 from their sage bits as input. Using an IV is relatively simple in the children from the previous round. The other proces- Merkle-Damgard composition scheme (used by SHA-256).