<<

Optimizing for Size: Exploring the Limits of Code Density Sally A. McKee Vincent M. Weaver ASPLOS XIV Poster Session, 8 March 2009 Cornell University Chalmers University of Technology

Abstract Hand-Optimized Assembly Results Architectural Size Correlations

0.9020 Minimum possible instruction size Reductions in instruction count can improve and bandwidth utiliza- 0.8652 Number of integer registers tion, lower power consumption, and increase overall performance. Nonethe- Total Size of Executable VLIW 2560 RISC less, code density is often overlooked when studying architectures. 2048 CISC 0.6623 Architecture has a zero register embedded 1536 8/16-bit 0.5845 Bit-width bytes 1024 We hand-optimize an embedded for size in assembly language -0.4366 Hardware divide in ALU on 20 different instruction set architectures and investigate the architectural 512 features that contribute most heavily to code density. 0 0.4385 Number of operands in each instruction arm ppc vax sh3 z80 ia64 6502 mips s390 i386 alphaparisc m88k m68k thumb avr32 -0.3356 Unaligned load/store available x86_64 crisv32pdp-11 0.2773 Year the architecture was introduced 256 Size of LZSS Decompression Code VLIW -0.2597 Auto-incrementing addressing scheme Background RISC CISC 192 embedded -0.2597 Hardware status flags (zero/overflow/etc.) 128 8/16-bit bytes -0.1252 Little or Big endian 64 The 20 architectures investigated can be broadly broken into 5 categories: -0.0487 Branch delay slot 0 -0.0079 Maximum possible instruction size Instr Length Opcode arm ppc vax z80 sh3 ia64 mips s390 6502 i386 Type Represented Architectures alpha parisc sparc m88k m68kthumb avr32 (bytes) Args pdp-11 x86_64crisv32 VLIW ia64 16/3 3 Size of String-Search Code VLIW Other Considerations alpha, arm, m88k, mips 256 RISC RISC 4 3 CISC pa-risc, ppc, sparc 192 embedded 128 8/16-bit System libraries and compiler overhead can overshadow the effects of size bytes CISC m68k, s390, vax, , x86 64 1-54 2 64 optimization, increasing memory footprint by several orders of magnitude. N/A Embedded avr32, crisv32, sh3, THUMB 2 2 0 These effects cannot be ignored when analyzing system-wide code-density.

ppc arm sh3 vax z80 8/16-bit 6502, pdp-11, z80 1-6 1-2 ia64 mips i386 s390 6502 alpha sparc m88k parisc m68kthumb avr32 473kB 473kB 473kB 473kB pdp-11 x86_64 crisv32

Below is the output from the benchmark tool. It uses LZSS decompression, 8192 Size of Integer/ASCII Conversion Code VLIW file I/O, and string manipulation. 256 RISC 6144 CISC 192 embedded 4096 8/16-bit ּ###############################################################################

128 Size (bytes) bytes No HW divide 2048 ּ############################################################################### O#O##########ּ 64 1024################################################################## N/A ּ############################################################################### O3 Os O3,Os O3,Os O3 Os O3 Os O3,Os O3,Os O3,Os O3 Os O3,Os O3,Os hand gcc42 gcc42 intel9 sun12 gcc41 gcc41 gcc42 gcc42 intel9 sun12 gcc41 gcc42 gcc42 intel9 sun12 optimized 0 ּ############################################################################### GLIBC uCLIBC ּ############################################################################### arm z80 sh3 ppc vax ia64 6502 mips s390 i386 parisc alpha sparc m88k m68kthumb avr32 STATICALLY LINKED DYNAMICALLY LINKED SYSCALL ONLY ASM ּ############################################################################### crisv32pdp-11 x86_64 ּ############################################################################### ּ############################################################################### ּ############################################################################### Size of String Concatenation Code VLIW Future Work 64 ּ############################################################################### RISC ּ############################################################################### CISC embedded 48 ּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּ LinuxּVersionּ2.6.11.421.17smp,ּCompiledּ#1ּSMPּFriּAprּ6ּ08:42:34ּUTCּ2007ּּּ 32 8/16-bit .Twoּ2791MHzּIntel(R)ּXeon(TM)ּProcessors,ּ2027MּRAM,ּ5521.40ּBogomipsּTotalּּּ bytes • Investigate more architecturesּּ sampakaּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּ 16ּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּ • ּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּ .Evaluate performance impact of dense architectures 0 ּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּּ • arm ppc vax sh3 z80 ia64 mips 6502 s390 i386 Propose compact architectures for FPGA and GPU-based systems. alpha sparc parisc m88k thumb m68k avr32 crisv32x86_64pdp-11