IA-64: a Parallel Instruction Set: 5/31/99

VOLUME 13, NUMBER 7 MAY 31, 1999 MICROPROCESSOR REP ORT THE INSIDERS’ GUIDE TO MICROPROCESSOR HARD WARE IA-64: A Parallel Instruction Set Register Frames, x86 Mode, Instruction Set Disclosed by Linley Gwennap local registers) to create a new register frame. This subroutine uses ALLOC to set up 15 locals and 8 outputs. The Finally allowing a full evaluation of their new instruc- first 7 locals overlap the outputs of the previous routine, pro- tion set, Intel and Hewlett-Packard have released a full viding input parameters to the subroutine. The subroutine description of IA-64’s application-level architecture and also has access to the 32 global registers. instruction set. The disclosures address some previous crit- From the subroutine’s viewpoint, however, the registers icisms of the architecture and provide more details con- in its frame are numbered from 32 to 54, even though they cerning how IA-64 processors will execute both x86 and occupy physical registers 44 to 66. Thus, the compiler doesn’t PA-RISC binaries. have to know which registers are unused by previous rou- The disclosures show a thoroughly modern instruction tines; it simply arranges each routine within its own virtual set with a range of multimedia instructions and prefetch register space. This technique simplifies situations in which a capabilities. Although IA-64 includes many RISC concepts, subroutine can be called from various places in a program, the architects added some rather complicated and special- and it can avoid saving and restoring registers, even when a ized instructions. Concerns remain, however, about code subroutine is called through a dynamic link. density and just how much of an advantage these new fea- tures will provide over a standard RISC architecture. Register Save Engine Spills and Fills One criticism had been that the large register file, while Even though the IA-64 register file is large, it is finite. The effective for compute-intensive routines, would cause exces- architecture defines a register save engine (RSE) that auto- sive overhead on subroutine calls, due to saving and restor- matically spills registers to memory when the register file is ing the contents of the registers. The vendors disclosed that fully allocated, creating the illusion of an infinite register file. IA-64 supports register frames that alleviate much of call/ For example, if ALLOC must create a 20-register frame start- save overhead. ing at physical register 120, the frame will go to register 127 and then wrap back to register 32. To avoid destroying state, Register Frames Are Dynamically Sized the contents of registers 32–43 are stored to the memory With IA-64’s 128 integer registers plus predicates, saving and stack. The RSE can save and restore registers before they are restoring the entire register file takes more than four times as needed, minimizing the performance impact. This activity is long as on a standard RISC processor. To ease this problem, Physical register numbers IA-64 implements register frames, which take advantage of 03132434450 the large register file to efficiently handle multiple levels of 32 Globals 12 Locals 7 Out subroutine calls. Register frames are similar to the fixed-size 03132434450 register windows in SPARC (see MPR 12/26/90, p. 9),except 03144 58 59 66 that IA-64 allows software to dynamically specify the size of 32 Globals 15 Locals 8 Out each frame using the ALLOC instruction. 03132 46 47 54 In the example shown in Figure 1, the top-level routine Virtual register numbers has specified a register frame with 19 registers, divided as 12 for local use and 7 for parameter passing (output). In addi- Figure 1 . In this example, a top-level IA-64 routine calls a subroutine and allocates a new frame with 23 registers that overlap tion, the routine can use the first 32 registers, which are des- the output registers of the original routine. The subroutine uses ignated for global use. When this routine calls a subroutine, virtual registers 32–54, but these are mapped to physical registers the register frame pointer is advanced by 12 (the number of 44–66 by the hardware. 2 IA-64: A PARALLEL INSTRUCTION SET invisible to software, whereas SPARC software must manu- by most subroutines. To further reduce save overhead, soft- ally spill and fill registers. ware needs to save FP registers only if the user mask indi- The register frames minimize the number of registers cates that either the lower or upper half of the FP register file saved and restored during procedure calls, although IA-64’s has been written. static design will result in more “live” registers than in a tra- As Figure 2 shows, the floating-point registers are 82 bits ditional dynamic design. Routines with high computational wide. The native FP format is identical to Intel’s 80-bit requirements can still take advantage of the full register file extended precision (EP) mode, except that the exponent field by allocating a frame with up to 96 registers (plus globals). has 17 bits instead of 15. All results are calculated in the native Any routine that desires can also use IA-64’s register rota- format and converted to the requested precision. The extra tion (see MPR 3/8/99, p. 16); to simplify the hardware, the two exponent bits handle common overflows and underflows rotating portion of the frame is required to start at GR32 in EP mode, simplifying some scientific algorithms. and contain a multiple of eight registers. Each of the integer registers (except GR0, which is With rotation and framing, the translation to physical always zero) has an associated NaT (not a thing) bit to sup- register addresses requires adding two 7-bit values—the reg- port speculative loads (see MPR 3/8/99, p. 16). The FP regis- ister frame pointer and the rotating register base (RRB)—to ters, however, don’t need a NaT bit; instead, they encode NaT the virtual register number and wrapping around from regis- as a special value that is unused by the IEEE-754 standard. ter 127 to 32 if necessary. Although parts of this calculation The 64-bit instruction pointer (IP) points to the next can be done ahead of time, an IA-64 processor may require bundle to be executed. The current frame marker (CFM) an extra pipeline stage to access its registers. Compared with holds the register frame pointer and frame size, along with the register renaming found in most modern processors, the RRB values for the integer, FP,and predicate register files. however, the IA-64 approach requires much less hardware The usr mask is the application-visible portion of the proces- and adds less time to the pipeline. sor status register (PSR). It controls the endianness of loads and stores, enables or disables the performance monitors and Register Set Provides Massive Resources data-alignment checking, and contains the dirty flags for the Only the integer registers are framed; all other registers must upper and lower halves of the FP register file. be saved and restored explicitly by software. This restriction Like current Intel products, IA-64 processors will in- simplifies the register save engine, which doesn’t need to clude a serial number and CPU_ID. This information is spread keep track of separate regions to save and restore various across five 64-bit registers, including two to hold a 16-charac- register types. Furthermore, the predicate registers can be ter ASCII vendor name (e.g., “Genuine Intel”). saved quickly, as a single memory access saves all 64 of the IA-64 also defines a set of 128 special registers, similar 1-bit predicates, and the floating-point registers are not used to PA-RISC’s control registers, called the application regis- General RegistersBranch Registers FP Registers Predicates Application Registers 63 0 63 0 81 0 0 63 0 GR0 = 0 FR0 = 0.0 P0 1 AR0– BR0 KR0–KR7 FR1 = 1.0 P1 AR7 GR1 – GR7 BR1 P2 AR16 RSC • FR2–FR7 GR8 (EAX) • P3 AR17 BSP GR9 (ECX) BR7 FR8–FR15 • AR18 BSPSTORE GR10 (EDX) (MM0–7, FP0–7) • AR19 RNAT GR11 (EBX) 63 0 FR16–FR31 • AR21 FCR GR12 (ESP) IP (PC) (XMM0–7) P61 AR24 EFLAG GR13 (EBP) P62 AR25 CSD 37 0 GR14 (ESI) P63 AR26 SSD CFM FR32–FR127 NaT Bits GR15 (EDI) (rotating) AR27 CFLG GR16–GR17 5 0 AR28 FSR (selectors) PSR User AR29 FIR Mask AR31 FDR GR18–GR23 Performance Monitor CPU Identifiers Data Registers AR32 CCV GR24–GR31 63 0 63 0 AR36 UNAT (segment descr.) CPUID0 PMD0 AR40 FPSR CPUID1 PMD1 AR44 ITC GR32–GR127 • • PFS (framed, AR64 rotating) • • AR65 LC CPUIDn PMDm AR66 EC Figure 2. The IA-64 register file consists of 128 integer registers, 128 floating-point registers, and 64 predicate registers, along with an extensive set of special-purpose registers. Registers in gray are shared between IA-64 and x86 modes; x86 functions are in parentheses. ©MICRODESIGN RESOURCES MAY 31, 1999 MICROPROCESSOR REPORT 3 IA-64: A PARALLEL INSTRUCTION SET ters. Although many are reserved for future definition, sev- must properly set up the x86 segment descriptors, PSR, and eral have important functions. For example, AR65–66 con- EFLAG registers. tain the loop count (LC) and epilogue count (EC) used to Furthermore, the architecture allows the processor to control loops with rotating registers. AR16–19 provide overwrite all of the nonshared general, FP,and predicate reg- information for the register save engine, such as the memory isters, as well as the ALAT (used for IA-64’s speculative address for spilling registers. stores), during x86 execution.

IA-64: a Parallel Instruction Set: 5/31/99

RISC-V Vector Extension Webinar II

Optimizing Packed String Matching on AVX2 Platform

Optimizing Software Performance Using Vector Instructions Invited Talk at Speed-B Conference, October 19–21, 2016, Utrecht, the Netherlands

Idisa+: a Portable Model for High Performance Simd Programming

Counter Control Flow Divergence in Compiled Query Pipelines

MPC7410/MPC7400 RISC Microprocessor Reference Manual

A Second Generation SIMD Microprocessor Architecture

Effective Vectorization with Openmp 4.5

Jon Stokes Jon

Altivec Extension to Powerpc Accelerates Media Processing

Register Level Sort Algorithm on Multi-Core SIMD Processors

G4 Is First Powerpc with Altivec: 11/16/98