Using Dynamic Binary Instrumentation to Generate Multi-Platform SimPoints
Vincent M. Weaver and Sally A. McKee Cornell University
29 January 2008 Breakdown of Benchmark Methodology Used in Major Conferences
HPCA, ISCA, MICRO ISCA (1995-2004) (2006-2007) Yi et al HPCA 2005 Run Complete (simulator) 18% 19% Run Multiple SimPoints 0% 4% Run SMARTS 0% 11% Run One SimPoint 0% 30% Fast Forward X, Run Z 27% 30% Run Reduced (train, MinneSPEC) 19% 22% Run Z 23% 4%
Fusion 1 Phase Behavior — twolf
5 4 3 2 1
L1 D-Cache 0 Miss Rate (%) 2
1 CPI 0 0 1000 2000 3000 Instruction Interval (100M) 300.twolf default
Fusion 2 Phase Behavior — mcf
30 20 10
L1 D-Cache 0 Miss Rate (%) 8 6 4 CPI 2 0 0 200 400 600 Instruction Interval (100M) 181.mcf default
Fusion 3 Phase Behavior — gcc.200
30 20 10
L1 D-Cache 0 Miss Rate (%)
10
CPI 5 0 0 200 400 600 Instruction Interval (100M) 176.gcc 200
Fusion 4 SimPoint Basic Block Vectors
• Instrument every Basic Block
• Increment unique counter on entry to each BB
• Save list of BBs and frequencies periodically
These BBVs are used by SimPoint utility to determine phase behavior
Fusion 5 Dynamic Binary Instrumentation
• Instruments each BB at first execution
• Caches instrumented code
• Runs faster than simulation, slower than native
Fusion 6 Previous Tools for Generating BBVs
• Atom — only Alpha with Tru64 UNIX
• Pin — only Intel architectures (IA32, Intel 64, IA64, XScale)
• Simulator mods — only simulator specific (SLOW)
Fusion 7 Goal – Expand Applicability of SimPoints
• Support more architectures
• Use existing DBI tools
• Validate generated BBVs
• Generate cross-platform BBV files
Fusion 8 Solution: Valgrind and Qemu
Pin Valgrind
alpha sh4 m68k sparc mips hppa cris Qemu
Fusion 9 Pin
• Intel proprietary DBI tool: IA32, IA64, Intel 64, Xscale (Windows, Linux, OSX)
• Plugins are written in C++
• Average slowdown of 15x on SPEC 2000
Fusion 10 Qemu
• Open Source
• Full-system simulation or syscall-by-proxy
• Translates from arbitrary ISAs via DBI
• Runs ARM, IA32, Intel 64, MIPS, PPC, SPARC, HPPA, Etrax CRIS, Alpha, sh4, m68k
• Average slowdown of 28x on SPEC 2000
Fusion 11 Valgrind
• Open Source, external plugin interface
• Originally designed to find memory access violations
• Translates to own IR, instruments, retranslates back to native ISA
• Runs on IA32, Intel 64 and PPC platforms (Linux and AIX)
• Average slowdown of 39x on SPEC 2000
Fusion 12 Validation Systems
machine processor memory L1 I/D L2/L3 Cache nestle 400MHz Pentium II 256MB 16KB/16KB 512KB spruengli 550MHz Pentium III 512MB 16KB/16KB 512KB itanium 800MHz Itanium (x86 mode) 1GB 16KB/16KB 96KB/3MB chocovic 1.66GHz Core Duo 1GB 32KB/32KB 1MB milka 1.733MHz Athlon MP 512MB 64KB/64KB 256KB gallais 1.8GHz Pentium 4 256MB 12Ku/16KB 256KB jennifer 2GHz Athlon64 X2 1GB 64KB/64KB 512KB sampaka12 2.8GHz Pentium 4 2GB 12Ku/16KB 512KB domori25 3.46GHz Pentium D 4GB 12Ku/16KB 2MB
Fusion 13 SPEC CPU 2000
40 46.7% 50.1% 40.6% 44.9% 82.6% 50.8% 69.1% 55.3% 54.4% 54.7% 43.1% 46.2% 40.3% 30
20
CPI Error (%) 10
0 nestle spruengli itanium chocovic milka gallais jennifer sampaka12 domori25 Pentium II Pentium III Itanium Core Duo Athlon Pentium 4 Athlon 64 Pentium 4 Pentium D
First 100M Pin, one SimPoint Pin, up to 10 SimPoints Pin, up to 20 SimPoints Ffwd 1B, 100M Qemu, one SimPoint Qemu, up to 10 SimPoints Qemu, up to 20 SimPoints Valgrind, one SimPoint Valgrind, up to 10 SimPoints Valgrind, up to 20 SimPoints
Fusion 14 SPEC CPU 2000 Breakdown – Pentium D
Integer Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10 11.8 19.8 19.3 12.9 22.8 19.9 10.7 5
0
CPI Error (%) -5
-10 -109
eon.cook gcc.expr gcc.166gcc.200 gzip.log vortex.1vortex.2vortex.3vpr.routevpr.place eon.kajiyagap.default gcc.scilab gzip.sourcemcf.default bzip2.graphicbzip2.sourcecrafty.default gcc.integrate gzip.graphicgzip.programgzip.random parser.default perlbmk.535perlbmk.704perlbmk.957perlbmk.850twolf.default bzip2.program eon.rushmeier perlbmk.diffmailperlbmk.perfect perlbmk.makerand
FP Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10 17.5 10.8
5
0
CPI Error (%) -5
-10 -15.5
art.110 art.470 apsi.default ammp.defaultapplu.default fma3d.defaultgalgel.defaultlucas.defaultmesa.defaultmgrid.default swim.default equake.defaultfacerec.default sixtrack.default wupwise.default
Fusion 15 SPEC CPU 2006
40 63.7% 110.1% 42.6% 124.7% 110.0%
30
20
CPI Error (%) 10
0 chocovic - Core Duo sampaka12 - Pentium 4 domori25 - Pentium D jennifer - Athlon 64
First 100M Pin, one SimPoint Pin, up to 10 SimPoints Pin, up to 20 SimPoints Ffwd 1 B, 100M Qemu, one SimPoint Qemu, up to 10 SimPoints Qemu, up to 20 SimPoints Valgrind, one SimPoint Valgrind, up to 10 SimPoints Valgrind, up to 20 SimPoints
machine processor memory L1 I/D L2 Cache chocovic 1.66GHz Core Duo 1GB 32KB/32KB 1MB jennifer 2GHz Athlon64 X2 1GB 64KB/64KB 512KB sampaka12 2.8GHz Pentium 4 2GB 12Ku/16KB 512KB domori25 3.46GHz Pentium D 4GB 12Ku/16KB 2MB
Fusion 16 SPEC CPU 2006 Breakdown – Pentium D
Integer Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10 12.1 13.0 11.3 10.711.5 13.1 15.6
5
0
CPI Error (%) -5
-10 -11.4 -12.6 -36.3 -12.8 mcf sjeng gcc.166gcc.200 gcc.expr gcc.g23gcc.s04 omnetpp astar.rivers bzip2.html gcc.cp-declgcc.expr2 gcc.scilab libquantum xalancbmk bzip2.sourcebzip2.chickenbzip2.liberty gcc.c-typeck gobmk.13x13gobmk.nngs hmmer.nph3hmmer.retro astar.BigLakes bzip2.programbzip2.combined gobmk.score2gobmk.trevorcgobmk.trevord h264ref.fore_baseh264ref.fore_mainh264ref.sss_main perlbench.diffmailperlbench.splitmail perlbench.checkspam
FP Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10
5
0 n/a
CPI Error (%) -5
-10 -20.5 -33.25 lbm wrf dealII milc namd tonto bwaves calculix leslie3d povray sphinx3 gromacs soplex.ref zeusmp cactusADM GemsFDTD gamess.h2ocu2 soplex.pds-50 gamess.cytosinegamess.triazolium
Fusion 17 Results – Average CPI error
• SPEC2000: 10 SimPoints (<0.4% of reference inputs) ◦ Pin 5.32% ◦ Qemu 5.04% ◦ Valgrind 5.38% • SPEC2006: 10 SimPoints (<0.06% of reference inputs) ◦ Pin 5.58% ◦ Qemu 5.30% ◦ Valgrind 5.28%
Fusion 18 Future Work
• Generating multi-platform results – MIPS BBV file from IA32 Qemu
• Running non-Linux binaries (Solaris, IRIX)
• Generating OS-aware SimPoints – Qemu runs full OS
Fusion 19 Tools Available for Download
All code is available from: http://fusion.csl.cornell.edu/tools/
Fusion 20 Questions? Feedback?
All code is available from: http://fusion.csl.cornell.edu/tools/ [email protected] — http://www.csl.cornell.edu/˜vince [email protected] — http://www.csl.cornell.edu/˜sam
Fusion 21