Architectural Techniques for Accelerating Subword Permutations with Repetitions John P

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 3, JUNE 2003 325 Architectural Techniques for Accelerating Subword Permutations With Repetitions John P. McGregor and Ruby B. Lee, Fellow, IEEE Abstract—We propose two new instructions, swperm and use permutations to achieve diffusion [21], a critical character- sieve, that can be used to efficiently complete an arbitrary istic of a secure cipher, in symmetric-key encryption algorithms bit-level permutation of an -bit word with or without repetitions. such as DES [14], Twofish [19], and Serpent [2]. Some permu- Permutations with repetitions are rearrangements of an ordered set in which elements may replace other elements in the set; tations in cryptographic algorithms are not bijections. For in- such permutations are useful in cryptographic algorithms. On a stance, the expansion permutation in DES maps some bits in four-way superscalar processor, we can complete an arbitrary the source datum to multiple destinations in the result datum. 64-bit permutation with repetitions of 1-bit subwords in 11 We define such rearrangements of an ordered set in which el- instructions and only four cycles using the two proposed instruc- ements can be replicated and possibly replace other elements tions. For subwords of size 4 bits or greater, we can perform an arbitrary permutation with repetitions of a 64-bit register in a in the set to be permutations with repetitions. If no informa- single cycle using a single swperm instruction. This improves tion is lost in a permutation with repetitions, the permutation is upon previous results by requiring fewer instructions to permute invertible and therefore can be used in any cryptographic algo- 4-bit or larger subwords packed in a 64-bit register and fewer ex- rithm. Even if information is lost, cryptographic hash functions ecution cycles for 1-bit subwords on wide superscalar processors. and encryption algorithms based upon Feistel networks can still We also demonstrate that we can accelerate the performance of the popular DES block cipher using the proposed instructions. employ permutations with repetitions [18]. We obtain a DES performance improvement of at least 55% in constrained embedded environments and an improvement of 71% A. Past Work on a four-way superscalar processor when applying DES as a Several methods exist for performing permutations in soft- cryptographic hash function. ware. In one method, individual bits of the source datum are Index Terms—Cryptography, encryption, instruction set archi- selected and shifted to their destination locations using a series tecture, permutation, permutation instruction, processor architecture, subword parallelism, subword permutation. of bitwise AND, bitwise OR, and shift instructions [12]. For an arbitrary permutation of the bits in an -bit word, this procedure requires as many as 4 instructions. If the architecture in- I. INTRODUCTION cludes instructions such as extract and deposit [9], one S THE POPULARITY of security applications grows, can reduce the instruction count of this procedure to 2 , yet this A the underlying cryptographic algorithms consume an method is still unacceptably slow. increasingly large percentage of processor workloads. These Alternatively, we can employ lookup tables to perform per- applications often include several operations involving 1-bit mutations with repetitions in software [12]. First, we divide the or multiple-bit register subwords. Many microprocessor -bit source datum into groups of bits; we use each group to instruction set architectures have been extended to include index a unique lookup table. The output of a lookup table repre- subword-parallel integer arithmetic instructions that improve sents the input group of bits permuted per the desired permuta- performance by executing several operations on low-precision tion. The bits of the table output that do not represent any of the data in parallel. Some of these extensions include MAX [7] input bits are set to zeroes. Therefore, we can combine the out- and MAX-2 [10] for HP PA-RISC, VIS [22] for Sun SPARC, puts of the lookup tables using bitwise OR or bitwise AltiVec [5] for PowerPC, 3 DNow! by AMD [15], MMX [16] XOR operations to generate the desired permuted -bit result. for Intel IA-32, and IA-64 multimedia instructions [6], [11]. In general, assuming the extract instruction is available, Before performing subword arithmetic operations, it may be we require instructions to complete an -bit permuta- necessary to rearrange the subwords within a single register or tion using lookup tables. Each of the lookup tables consists between multiple registers using subword permutation instruc- of entries, and each entry is bits in size, so the total size tions. In addition, we can employ subword permutations to effi- of the tables is bits. This technique is commonly ciently perform transformations such as matrix transposition in used but is unattractive because the permutations must be stat- multimedia applications [8]. Certain cryptographic algorithms ically encoded in the tables at compile-time. Furthermore, the space required to store the lookup tables is large for acceptably small permutation instruction sequences. For example, we need Manuscript received May 22, 2002; revised October 3, 2002. This work was supported in part by the National Science Foundation under Grant 2 MB of storage to permute a 64-bit datum in 11 instructions CCR-0105677 and by Hewlett Packard Laboratories. using four lookup tables. With eight lookup tables, we require The authors are with the Department of Electrical Engineering, Princeton 16 kB of storage and 23 instructions to permute a 64-bit value. University, Princeton, NJ 08540 USA (email: [email protected]; [email protected]). Multiple instruction set architectures have been amended Digital Object Identifier 10.1109/TVLSI.2003.812318 to include instructions for permutations of 8-bit or larger 1063-8210/03$17.00 © 2003 IEEE 326 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 3, JUNE 2003 subwords. The permute instruction in the MAX-2 extension to rapidly achieve a desired level of diffusion in future ciphers. to PA-RISC supports permutations with and without repetitions As a result, the proposed instructions could greatly improve the of 16-bit subwords in a 64-bit word by statically encoding overall throughput of cryptographic algorithms. the permutation function in the instruction [10]. In IA-64, the In Section II, we discuss the mathematics of permutations and mux instruction supports a small set of permutations of 8-bit define two new instructions. We demonstrate how to apply these subwords in a 64-bit word and supports all permutations of instructions to achieve arbitrary subword permutations with rep- 16-bit subwords in a 64-bit word [6]. Similar to permute in etitions in Section III. In Section IV,we present the hardware re- MAX-2, the permutation function is statically encoded in the quired to implement the two new instructions. In Section V, we mux instruction at compile-time. The vperm instruction in the analyze the performance of permutations for differently sized AltiVec extension to the PowerPC instruction set architecture subwords, and we evaluate the performance improvement ef- permutes the 8-bit subwords of a 128-bit vector register [5]. fected by the proposed permutation instructions for a highly This instruction requires three 128-bit register reads and one popular symmetric-key encryption algorithm. We summarize in 128-bit register write, and the permutation function is encoded Section VI. in one of the vector source registers. None of the permutation instructions in popular popular instruction set architectures II. PERMUTATION INSTRUCTIONS (ISAs) efficiently support arbitrary permutations of 4-bit or We propose two new instructions to efficiently support per- smaller subwords. mutations with repetitions of 1-bit or multiple-bit subwords: Recently, researchers have proposed several instructions swperm and sieve [13]. Using these instructions, we can dy- for performing arbitrary, dynamically specified permutations namically specify permutations with repetitions during program of 1-bit or larger subwords. Using the pperm instruction, execution rather than force the permutations to be statically en- we can complete an arbitrary permutation of bits with or coded at compile-time. without repetitions in instructions [12], [20]. The xbox instruction performs -bit permutations in a similar A. Permutations With Repetitions fashion [4]. We can conduct a 64-bit permutation by executing 8 pperm or xbox or instructions followed by 7 bitwise XOR A permutation is a rearrangement of the elements in an or- or OR instructions to combine the results. The xbox and dered set, i.e., a bijective map from a set to itself [1]. In pperm instructions essentially dynamically configure and the context of this paper, such a set is not a collection of all invoke an -by- crossbar without requiring the processor to possible -bit words; we are concerned with ordered sets that maintain any additional state information. Amending an ISA are column vectors of -bit subwords. We define a surjective by requiring additional state variables would be undesirable: (i.e., one-to-many) map from an ordered set to another or- such changes require explicit operating system (OS) support dered set —where the cardinality of equals that of —to and increase the complexity of context switches and interrupts. be a permutation with repetitions.

Architectural Techniques for Accelerating Subword Permutations with Repetitions John P

RISC-V Vector Extension Webinar II

Optimizing Packed String Matching on AVX2 Platform

Optimizing Software Performance Using Vector Instructions Invited Talk at Speed-B Conference, October 19–21, 2016, Utrecht, the Netherlands

Idisa+: a Portable Model for High Performance Simd Programming

Counter Control Flow Divergence in Compiled Query Pipelines

MPC7410/MPC7400 RISC Microprocessor Reference Manual

A Second Generation SIMD Microprocessor Architecture

Effective Vectorization with Openmp 4.5

Jon Stokes Jon

Altivec Extension to Powerpc Accelerates Media Processing

Register Level Sort Algorithm on Multi-Core SIMD Processors

G4 Is First Powerpc with Altivec: 11/16/98