Parallel Cryptanalysis
Total Page:16
File Type:pdf, Size:1020Kb
Parallel cryptanalysis Citation for published version (APA): Niederhagen, R. F. (2012). Parallel cryptanalysis. Technische Universiteit Eindhoven. https://doi.org/10.6100/IR731259 DOI: 10.6100/IR731259 Document status and date: Published: 01/01/2012 Document Version: Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal. If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement: www.tue.nl/taverne Take down policy If you believe that this document breaches copyright please contact us at: [email protected] providing details and we will investigate your claim. Download date: 25. Sep. 2021 Parallel Cryptanalysis Ruben Niederhagen Parallel Cryptanalysis PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de rector magnificus, prof.dr.ir. C.J. van Duijn, voor een commissie aangewezen door het College voor Promoties in het openbaar te verdedigen op maandag 23 april 2012 om 16.00 uur door Ruben Falko Niederhagen geboren te Aken, Duitsland Dit proefschrift is goedgekeurd door de promotoren: prof.dr. T. Lange en prof.dr. D.J. Bernstein Copromotor: dr. C.-M. Cheng A catalogue record is available from the Eindhoven University of Technology Library. ISBN: 978-90-386-3128-8 Printed by Printservice Technische Universiteit Eindhoven Cover design by Verspaget & Bruinink, Nuenen Public domain Commissie: prof.dr. D.J. Bernstein, promotor (University of Illinois at Chicago) dr. C.-M. Cheng, copromotor (National Taiwan University) prof.dr. A.M. Cohen, chairman prof.dr. R. Bisseling (Utrecht University) prof.dr. T. Lange, promotor prof.dr. W.H.A. Schilders prof.dr. G. Woeginger prof.dr. B.-Y. Yang (Academia Sinica) Acknowledgements I would like to thank my supervisors Daniel J. Bernstein and Tanja Lange as well as my supervisors in Taiwan, Chen-Mou Cheng and Bo-YinYang, for the opportunity to enjoy my PhD studies commuting between the Netherlands and Taiwan. Furthermore, I would like to thank them for their support throughout my studies and the writing of my thesis. I also would like to thank Rob Bisseling, Arjeh Cohen, Wil Schilders, and Gerhard Woeginger for joining my doctorate committee and for their valuable feedback on the draft of my dissertation. Special thanks go to Tung Chou, Andreas Gierlich, and Peter Schwabe for proofreading of the first versions of my thesis and for suggesting corrections and improvements. Finally, I want to thank Stefan Lankes from the Lehrstuhl für Betriebssys- teme (research group for operating systems) at RWTH Aachen University for his support in the first years of my PhD studies. Contents 1 Introduction1 2 Overview of parallel computing3 2.1 Parallel architectures..........................5 2.1.1 Microarchitecture........................5 2.1.2 Instruction set.........................7 2.1.3 System architecture......................8 2.2 Parallel programming......................... 12 2.2.1 Shared memory......................... 12 2.2.2 Message passing........................ 14 2.2.3 Summary............................ 14 2.3 General-purpose GPU programming................. 15 2.3.1 Programming NVIDIA GPUs................. 17 2.3.2 Assembly for NVIDIA GPUs................. 18 3 Parallel implementation of Pollard’s rho method 23 3.1 The ECDLP and the parallel version of Pollard’s rho method... 24 3.2 ECC2K-130 and the iteration function................ 25 3.3 Implementing ECC2K-130 on the Cell processor.......... 26 3.3.1 A Brief Description of the Cell processor........... 27 3.3.2 Approaches for implementing the iteration function..... 29 3.3.3 ECC2K-130 iterations on the Cell processor......... 31 3.3.4 Using DMA transfers to increase the batch size....... 33 3.3.5 Overall results on the Cell processor............. 34 3.4 Implementing ECC2K-130 on a GPU................. 35 3.4.1 The GTX 295 graphics card.................. 36 3.4.2 Approaches for implementing the iteration function..... 38 3.4.3 Polynomial multiplication on the GPU............ 39 3.4.4 ECC2K-130 iterations on the GPU.............. 42 3.4.5 Overall results on the GPU.................. 44 3.5 Performance comparison........................ 45 4 Parallel implementation of the XL algorithm 47 4.1 The XL algorithm........................... 48 4.2 The block Wiedemann algorithm................... 49 4.3 The block Berlekamp–Massey algorithm............... 53 4.3.1 Reducing the cost of the Berlekamp–Massey algorithm... 54 4.3.2 Parallelization of the Berlekamp–Massey algorithm..... 55 4.4 Thomé’s version of the block Berlekamp–Massey algorithm.... 58 4.4.1 Matrix polynomial multiplications.............. 58 4.4.2 Parallelization of Thomé’s Berlekamp–Massey algorithm.. 58 4.5 Implementation of XL......................... 59 4.5.1 Reducing the computational cost of BW1 and BW3.... 60 4.5.2 SIMD vector operations in F16 ................ 60 4.5.3 Exploiting the structure of the Macaulay matrix...... 62 4.5.4 Macaulay matrix multiplication in XL............ 64 4.5.5 Parallel Macaulay matrix multiplication in XL....... 66 4.6 Experimental results.......................... 73 4.6.1 Impact of the block size.................... 74 4.6.2 Performance of the Macaulay matrix multiplication..... 76 4.6.3 Scalability experiments.................... 78 5 Parallel implementation of Wagner’s birthday attack 81 5.1 Wagner’s generalized birthday attack................. 82 5.1.1 Wagner’s tree algorithm.................... 82 5.1.2 Wagner in storage-restricted environments.......... 83 5.2 The FSB hash function........................ 85 5.2.1 Details of the FSB hash function............... 85 5.2.2 Attacking the compression function of FSB48 ........ 87 5.3 Attack strategy............................. 88 5.3.1 How large is a list entry?................... 88 5.3.2 What list size can be handled?................ 89 5.3.3 The strategy.......................... 89 5.4 Implementing the attack........................ 92 5.4.1 Parallelization......................... 92 5.4.2 Efficient implementation.................... 94 5.5 Results.................................. 96 5.5.1 Cost estimates......................... 96 5.5.2 Cost measurements....................... 97 5.5.3 Time-storage tradeoffs..................... 97 5.6 Scalability analysis........................... 98 Bibliography 101 1 Introduction Most of today’s cryptographic primitives are based on computations that are hard to perform for a potential attacker but easy to perform for somebody who is in possession of some secret information, the key, that opens a back door in these hard computations and allows them to be solved in a small amount of time. Each cryptographic primitive should be designed such that the cost of an attack grows exponentially with the problem size, while the computations using the secret key only grow polynomially. To estimate the strength of a cryptographic primitive it is important to know how hard it is to perform the computation without knowledge of the secret back door and to get an understanding of how much money or time the attacker has to spend. Usually a cryptographic primitive allows the cryptographer to choose parameters that make an attack harder at the cost of making the computations using the secret key harder as well. Therefore designing a cryptographic primitive imposes the dilemma of choosing the parameters strong enough to resist an attack up to a certain cost while choosing them small enough to allow usage of the primitive in the real world, e.g. on small computing devices like smart phones. Typically a cryptographic attack requires a tremendous amount of compu- tation—otherwise the cryptographic primitive under attack can be considered broken. Given this tremendous amount of computation, it is likely that there are computations that can be performed in parallel. Therefore, parallel computing systems are a powerful tool for the attacker of a cryptographic system. In contrast to a legitimate user who typically exploits only a small or moderate amount of parallelism, an attacker is often able to launch an attack on a massively parallel system. In practice the amount of parallel computation power available to an