Using Dynamic Binary Instrumentation to Generate Multi-Platform SimPoints

Vincent M. Weaver and Sally A. McKee Cornell University

29 January 2008 Breakdown of Benchmark Methodology Used in Major Conferences

HPCA, ISCA, MICRO ISCA (1995-2004) (2006-2007) Yi et al HPCA 2005 Run Complete (simulator) 18% 19% Run Multiple SimPoints 0% 4% Run SMARTS 0% 11% Run One SimPoint 0% 30% Fast Forward X, Run Z 27% 30% Run Reduced (train, MinneSPEC) 19% 22% Run Z 23% 4%

Fusion 1 Phase Behavior — twolf

5 4 3 2 1

L1 D- 0 Miss Rate (%) 2

1 CPI 0 0 1000 2000 3000 Instruction Interval (100M) 300.twolf default

Fusion 2 Phase Behavior — mcf

30 20 10

L1 D-Cache 0 Miss Rate (%) 8 6 4 CPI 2 0 0 200 400 600 Instruction Interval (100M) 181.mcf default

Fusion 3 Phase Behavior — gcc.200

30 20 10

L1 D-Cache 0 Miss Rate (%)

10

CPI 5 0 0 200 400 600 Instruction Interval (100M) 176.gcc 200

Fusion 4 SimPoint Basic Block Vectors

• Instrument every Basic Block

• Increment unique counter on entry to each BB

• Save list of BBs and frequencies periodically

These BBVs are used by SimPoint utility to determine phase behavior

Fusion 5 Dynamic Binary Instrumentation

• Instruments each BB at first execution

• Caches instrumented code

• Runs faster than simulation, slower than native

Fusion 6 Previous Tools for Generating BBVs

— only Alpha with Tru64 UNIX

• Pin — only architectures (IA32, Intel 64, IA64, XScale)

• Simulator mods — only simulator specific (SLOW)

Fusion 7 Goal – Expand Applicability of SimPoints

• Support more architectures

• Use existing DBI tools

• Validate generated BBVs

• Generate cross-platform BBV files

Fusion 8 Solution: Valgrind and Qemu

Pin Valgrind

x86_64 arm ppc

alpha sh4 m68k mips hppa cris Qemu

Fusion 9 Pin

• Intel proprietary DBI tool: IA32, IA64, Intel 64, Xscale (Windows, , OSX)

• Plugins are written in C++

• Average slowdown of 15x on SPEC 2000

Fusion 10 Qemu

• Open Source

• Full-system simulation or syscall-by-proxy

• Translates from arbitrary ISAs via DBI

• Runs ARM, IA32, Intel 64, MIPS, PPC, SPARC, HPPA, Etrax CRIS, Alpha, sh4, m68k

• Average slowdown of 28x on SPEC 2000

Fusion 11 Valgrind

• Open Source, external plugin interface

• Originally designed to find memory access violations

• Translates to own IR, instruments, retranslates back to native ISA

• Runs on IA32, Intel 64 and PPC platforms (Linux and AIX)

• Average slowdown of 39x on SPEC 2000

Fusion 12 Validation Systems

machine memory L1 I/D L2/L3 Cache nestle 400MHz II 256MB 16KB/16KB 512KB spruengli 550MHz Pentium III 512MB 16KB/16KB 512KB itanium 800MHz Itanium (x86 mode) 1GB 16KB/16KB 96KB/3MB chocovic 1.66GHz Core Duo 1GB 32KB/32KB 1MB milka 1.733MHz Athlon MP 512MB 64KB/64KB 256KB gallais 1.8GHz 256MB 12Ku/16KB 256KB jennifer 2GHz Athlon64 X2 1GB 64KB/64KB 512KB sampaka12 2.8GHz Pentium 4 2GB 12Ku/16KB 512KB domori25 3.46GHz 4GB 12Ku/16KB 2MB

Fusion 13 SPEC CPU 2000

40 46.7% 50.1% 40.6% 44.9% 82.6% 50.8% 69.1% 55.3% 54.4% 54.7% 43.1% 46.2% 40.3% 30

20

CPI Error (%) 10

0 nestle spruengli itanium chocovic milka gallais jennifer sampaka12 domori25 Pentium II Pentium III Itanium Core Duo Athlon Pentium 4 Athlon 64 Pentium 4 Pentium D

First 100M Pin, one SimPoint Pin, up to 10 SimPoints Pin, up to 20 SimPoints Ffwd 1B, 100M Qemu, one SimPoint Qemu, up to 10 SimPoints Qemu, up to 20 SimPoints Valgrind, one SimPoint Valgrind, up to 10 SimPoints Valgrind, up to 20 SimPoints

Fusion 14 SPEC CPU 2000 Breakdown – Pentium D

Integer Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10 11.8 19.8 19.3 12.9 22.8 19.9 10.7 5

0

CPI Error (%) -5

-10 -109

eon.cook gcc.expr gcc.166gcc.200 gzip.log vortex.1vortex.2vortex.3vpr.routevpr.place eon.kajiyagap.default gcc.scilab gzip.sourcemcf.default bzip2.graphicbzip2.sourcecrafty.default gcc.integrate gzip.graphicgzip.programgzip.random parser.default perlbmk.535perlbmk.704perlbmk.957perlbmk.850twolf.default bzip2.program eon.rushmeier perlbmk.diffmailperlbmk.perfect perlbmk.makerand

FP Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10 17.5 10.8

5

0

CPI Error (%) -5

-10 -15.5

art.110 art.470 apsi.default ammp.defaultapplu.default fma3d.defaultgalgel.defaultlucas.defaultmesa.defaultmgrid.default swim.default equake.defaultfacerec.default sixtrack.default wupwise.default

Fusion 15 SPEC CPU 2006

40 63.7% 110.1% 42.6% 124.7% 110.0%

30

20

CPI Error (%) 10

0 chocovic - Core Duo sampaka12 - Pentium 4 domori25 - Pentium D jennifer - Athlon 64

First 100M Pin, one SimPoint Pin, up to 10 SimPoints Pin, up to 20 SimPoints Ffwd 1 B, 100M Qemu, one SimPoint Qemu, up to 10 SimPoints Qemu, up to 20 SimPoints Valgrind, one SimPoint Valgrind, up to 10 SimPoints Valgrind, up to 20 SimPoints

machine processor memory L1 I/D L2 Cache chocovic 1.66GHz Core Duo 1GB 32KB/32KB 1MB jennifer 2GHz Athlon64 X2 1GB 64KB/64KB 512KB sampaka12 2.8GHz Pentium 4 2GB 12Ku/16KB 512KB domori25 3.46GHz Pentium D 4GB 12Ku/16KB 2MB

Fusion 16 SPEC CPU 2006 Breakdown – Pentium D

Integer Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10 12.1 13.0 11.3 10.711.5 13.1 15.6

5

0

CPI Error (%) -5

-10 -11.4 -12.6 -36.3 -12.8 mcf sjeng gcc.166gcc.200 gcc.expr gcc.g23gcc.s04 omnetpp astar.rivers bzip2.html gcc.cp-declgcc.expr2 gcc.scilab libquantum xalancbmk bzip2.sourcebzip2.chickenbzip2.liberty gcc.c-typeck gobmk.13x13gobmk.nngs hmmer.nph3hmmer.retro astar.BigLakes bzip2.programbzip2.combined gobmk.score2gobmk.trevorcgobmk.trevord h264ref.fore_baseh264ref.fore_mainh264ref.sss_main perlbench.diffmailperlbench.splitmail perlbench.checkspam

FP Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10

5

0 n/a

CPI Error (%) -5

-10 -20.5 -33.25 lbm wrf dealII milc namd tonto bwaves calculix leslie3d povray sphinx3 gromacs soplex.ref zeusmp cactusADM GemsFDTD gamess.h2ocu2 soplex.pds-50 gamess.cytosinegamess.triazolium

Fusion 17 Results – Average CPI error

• SPEC2000: 10 SimPoints (<0.4% of reference inputs) ◦ Pin 5.32% ◦ Qemu 5.04% ◦ Valgrind 5.38% • SPEC2006: 10 SimPoints (<0.06% of reference inputs) ◦ Pin 5.58% ◦ Qemu 5.30% ◦ Valgrind 5.28%

Fusion 18 Future Work

• Generating multi-platform results – MIPS BBV file from IA32 Qemu

• Running non-Linux binaries (Solaris, IRIX)

• Generating OS-aware SimPoints – Qemu runs full OS

Fusion 19 Tools Available for Download

All code is available from: http://fusion.csl.cornell.edu/tools/

Fusion 20 Questions? Feedback?

All code is available from: http://fusion.csl.cornell.edu/tools/ [email protected] — http://www.csl.cornell.edu/˜vince [email protected] — http://www.csl.cornell.edu/˜sam

Fusion 21