Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors

Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Department of Electrical Engineering , Princeton University, Princeton, NJ 08544 USA {hilewitz, rblee}@princeton.edu Abstract processors also have Extract_field and Deposit_field operations, which can be viewed as variants of This paper describes a new basis for the Shift_Right and Shift_Left operations, with certain implementation of a shifter functional unit. We present bits masked out and set to zeros or replicated-sign bits. a design based on the inverse butterfly and butterfly There are many emerging applications, such as datapath circuits that performs the standard shift and cryptography, imaging and biometrics, where more rotate operations, as well as more advanced extract, advanced bit manipulation operations are needed. deposit and mix operations found in some processors. While these can be built from the simpler logical and Additionally, it also supports important new classes of shift operations, the applications using these advanced even more advanced bit manipulation instructions bit manipulation operations are significantly sped up if recently proposed: these include arbitrary bit the processor can support more powerful bit permutations, bit scatter and bit gather instructions. manipulation instructions. Such operations include The new functional unit’s datapath is comparable in arbitrary bit permutations, performing multiple bit- latency to that of the classic barrel shifter. It replaces field extract operations in parallel, and performing two existing functional units - shifter and mix - with a multiple bit-field deposit operations in parallel. We much more powerful one. call these permutation (perm), parallel extract (pex) or Keywords: shifter, rotations, permutations, bit bit gather, and parallel deposit (pdep) or bit scatter manipulations, arithmetic, processor operations, respectively. They will be further described in section 2. It has been shown that these operations can be implemented in a single new 1. Introduction Permutation functional unit, utilizing two simple datapaths: an inverse butterfly circuit and a butterfly Computer arithmetic for integers and floating- circuit [1]. point numbers is a well-developed and mature field. In this paper, we show that we can perform both Many books and papers describe the design of integer existing and newly proposed bit manipulation arithmetic and floating-point arithmetic units. Less instructions with a single, simple functional unit, extensively studied is the design of shifters, and rather than two (or more) separate units. Also, instead functional units for advanced bit manipulations in of starting with the existing Shifter functional unit and general-purpose, word-oriented microprocessors. extending it so that it also performs bit permutations, Simple bit-parallel operations like AND, OR, XOR bit gather and bit scatter operations, we propose to and NOT are supported in word-oriented processors. start with the more powerful Permutation functional Denoted “logical” operations, they are typically unit and perform all the operations previously done by implemented with the integer arithmetic operations in the Shifter functional unit. Hence, our new Shift- the Arithmetic-Logic Unit (ALU), which is the most Permute functional unit can perform all the useful bit basic functional unit in a processor. manipulations beyond the simple bit-parallel logical In current microprocessors, only very few bit operations already done by the ALU. operations that are not bit-parallel are supported. The contributions of this paper are: These are shifts and rotates, where each bit moves in • A proposal for a new basis for the design of the same relative amount as every other bit in the word Shifters, based on the inverse butterfly (register). A separate Shifter functional unit is circuit, that is much more powerful, with only typically used to implement these operations. A few small or no impact on cycle-time and area. • A recursive algorithm for determining the Yedidya Hilewitz and Ruby B. Lee, “Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors,” Proceedings of the IEEE Symposium on Computer Arithmetic ( ARITH-18 ), June 25-27, 2007. This work is supported in part by the DoD. Hilewitz is also suppor ted by a NSF Graduate Research Fellowship and a Hertz Foundation Fellowship. control bits for Rotate, Shift, Extract, Deposit arbitrary position in the source register and right and Mix operations on the inverse butterfly justifies that field in the result. Extract is equivalent to and butterfly datapath circuits. a shift right and mask operation. Extract has both • Demonstration of the implementation of a unsigned and signed variants. In the latter, the sign bit powerful Shift-Permute unit, comparing its of the extracted field is propagated to the most- complexity with that of an ordinary Shifter significant bit of the destination register. functional unit. The deposit operation (Figure 1(b)) takes a single In section 2, we list the instructions supported by right justified field of arbitrary length from the source our proposed new functional unit. In section 3, we register and deposits it at any arbitrary position in the show how to obtain the control bits for the inverse destination register. Deposit is equivalent to a left shift butterfly and butterfly datapaths. In section 4, the and mask operation. There are two variants of deposit: implementation of the new functional unit is described the remaining bits can be zeroed out, or they are and compared to that of the barrel and log shifter. supplied from a second register, in which case a Section 5 concludes the paper. masked merge is required. 2. Existing and Newly-Proposed Bit Manipulation Instructions We now detail the operations supported by two existing microprocessor functional units – the Shifter and the Mix multimedia functional units - as well as a (a) recently proposed Permutation functional unit [1]. 2.1. Shifter and Mix Operations The first 4 groups of instructions in Table 1 are the rotate and shift instructions supported in most (b) microprocessors. These include right or left rotates, Figure 1. (a) extr.u r1 = r2, pos, len right or left shifts (with zero or sign propagation). The (b) dep.z r1 = r2, pos, len last 3 groups of instructions exist in a few Instruction Set Architectures (ISAs) such as PA-RISC [2] and IA- The mix operation selects the left subwords or 64 [3]. These include extract, deposit and mix right subwords from each pair of subwords alternating instructions. between the two source registers r 2 and r 3 (Figure 2). The extract operation (Figure 1 (a)) selects a The mix instruction was first introduced in PA-RISC single field of bits of arbitrary length from any for multimedia acceleration [4], and also appears in Table 1. Existing Shifter and Mix Operations Instruction Description rotr r 1 = r 2, shamt Right rotate of r 2. The rotate amount is either an immediate in the opcode (shamt), or rotr r 1 = r 2, r 3 specified using a second source register r 3. rotl r 1 = r 2, shamt Left rotate of r 2. rotl r 1 = r 2, r 3 shr r 1 = r 2, shamt Right shift of r 2 with vacated positions on the left filled with the sign bit (most significant shr r 1 = r 2, r 3 bit of r 2) or zero-filled. The shift amount is either an immediate in the opcode or specified shr.u r 1 = r 2, shamt using a second source register r 3. shr.u r 1 = r 2, r 3 shl r 1 = r 2, shamt Left shift of r 2 with vacated positions on the right zero-filled. shl r 1 = r 2, r 3 extr r 1 = r 2, pos, len Extraction and right justification of single field from r 2 of length len from position pos. extr.u r 1 = r 2, pos, len The high order bits are filled with the sign bit of the extracted field or zero-filled. dep.z r 1 = r 2, pos, len Deposit at position pos of single right justified field from r 2 of length len. Remaining bits dep r 1= r 2, r 3, pos, len are zero-filled or merged from second source register r 3. mix {r,l} {0,1,2,3,4,5} r 1 Select right or left subword from a pair of subwords, alternating between source registers i = r 2, r 3 r2 and r 3. Subword sizes are 2 bits, for i = 0,1,2,…,5, for a 64-bit processor. IA-64 (Itanium) [3], where it is implemented in a The structure of the circuits is shown in Figure 3, separate multimedia functional unit. No ISA currently which shows 8-bit circuits. The n-bit circuits consist of supports mix for subwords smaller than a byte, lg( n) stages, each stage composed of n/2 2-input although this is very useful, e.g., for bit matrix switches. Each of these circuits takes at most one transposition and fast parallel sorting [5]. In our cycle, since they are less complicated than an ALU of proposed new functional unit, mix for bits – and for all the same width, which we normalize to a latency of subword sizes that are powers of 2 – are supported. one processor cycle. Furthermore, each switch is This includes 12 mix operations: mix_left and composed of two 2:1 multiplexers, totaling n × lg( n) mix_right for each of 6 subword sizes of 2 0, 2 1, 2 2, 2 3, multiplexers for each circuit, leading to small overall 24, 2 5, for a 64-bit datapath processor. circuit area. In the ith stage ( i starting from 1), the paired bits are n/2 i positions apart for the butterfly network and 2 i- 1 positions apart for the inverse butterfly network. A switch either passes through or swaps its inputs based on the value of a control bit.

Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors

ADSP-21065L SHARC User's Manual; Chapter 2, Computation

Most Computer Instructions Can Be Classified Into Three Categories

Subchapter 2.4–Hp Server Rp5400 Series

Analysis of GPGPU Programs for Data-Race and Barrier Divergence

The Central Processor Unit

Evolving GPU Machine Code

Readingsample

MT-088: Analog Switches and Multiplexers Basics

Processor-103

A Remotely Accessible Network Processor-Based Router for Network Experimentation

X86 Intrinsics Cheat Sheet Jan Finis [email protected]

Abatement of Area and Power in Multiplexer Based on Cordic Using