Using Dynamic Binary Instrumentation to Generate Multi-Platform Simpoints

Using Dynamic Binary Instrumentation to Generate Multi-Platform SimPoints Vincent M. Weaver and Sally A. McKee Cornell University 29 January 2008 Breakdown of Benchmark Methodology Used in Major Conferences HPCA, ISCA, MICRO ISCA (1995-2004) (2006-2007) Yi et al HPCA 2005 Run Complete (simulator) 18% 19% Run Multiple SimPoints 0% 4% Run SMARTS 0% 11% Run One SimPoint 0% 30% Fast Forward X, Run Z 27% 30% Run Reduced (train, MinneSPEC) 19% 22% Run Z 23% 4% Fusion 1 Phase Behavior | twolf 5 4 3 2 1 L1 D-Cache 0 Miss Rate (%) 2 1 CPI 0 0 1000 2000 3000 Instruction Interval (100M) 300.twolf default Fusion 2 Phase Behavior | mcf 30 20 10 L1 D-Cache 0 Miss Rate (%) 8 6 4 CPI 2 0 0 200 400 600 Instruction Interval (100M) 181.mcf default Fusion 3 Phase Behavior | gcc.200 30 20 10 L1 D-Cache 0 Miss Rate (%) 10 CPI 5 0 0 200 400 600 Instruction Interval (100M) 176.gcc 200 Fusion 4 SimPoint Basic Block Vectors • Instrument every Basic Block • Increment unique counter on entry to each BB • Save list of BBs and frequencies periodically These BBVs are used by SimPoint utility to determine phase behavior Fusion 5 Dynamic Binary Instrumentation • Instruments each BB at first execution • Caches instrumented code • Runs faster than simulation, slower than native Fusion 6 Previous Tools for Generating BBVs • Atom | only Alpha with Tru64 UNIX • Pin | only Intel architectures (IA32, Intel 64, IA64, XScale) • Simulator mods | only simulator specific (SLOW) Fusion 7 Goal { Expand Applicability of SimPoints • Support more architectures • Use existing DBI tools • Validate generated BBVs • Generate cross-platform BBV files Fusion 8 Solution: Valgrind and Qemu Pin Valgrind itanium x86 x86_64 arm ppc alpha sh4 m68k sparc mips hppa cris Qemu Fusion 9 Pin • Intel proprietary DBI tool: IA32, IA64, Intel 64, Xscale (Windows, Linux, OSX) • Plugins are written in C++ • Average slowdown of 15x on SPEC 2000 Fusion 10 Qemu • Open Source • Full-system simulation or syscall-by-proxy • Translates from arbitrary ISAs via DBI • Runs ARM, IA32, Intel 64, MIPS, PPC, SPARC, HPPA, Etrax CRIS, Alpha, sh4, m68k • Average slowdown of 28x on SPEC 2000 Fusion 11 Valgrind • Open Source, external plugin interface • Originally designed to find memory access violations • Translates to own IR, instruments, retranslates back to native ISA • Runs on IA32, Intel 64 and PPC platforms (Linux and AIX) • Average slowdown of 39x on SPEC 2000 Fusion 12 Validation Systems machine processor memory L1 I/D L2/L3 Cache nestle 400MHz Pentium II 256MB 16KB/16KB 512KB spruengli 550MHz Pentium III 512MB 16KB/16KB 512KB itanium 800MHz Itanium (x86 mode) 1GB 16KB/16KB 96KB/3MB chocovic 1.66GHz Core Duo 1GB 32KB/32KB 1MB milka 1.733MHz Athlon MP 512MB 64KB/64KB 256KB gallais 1.8GHz Pentium 4 256MB 12Ku/16KB 256KB jennifer 2GHz Athlon64 X2 1GB 64KB/64KB 512KB sampaka12 2.8GHz Pentium 4 2GB 12Ku/16KB 512KB domori25 3.46GHz Pentium D 4GB 12Ku/16KB 2MB Fusion 13 SPEC CPU 2000 40 46.7% 50.1% 40.6% 44.9% 82.6% 50.8% 69.1% 55.3% 54.4% 54.7% 43.1% 46.2% 40.3% 30 20 CPI Error (%) 10 0 nestle spruengli itanium chocovic milka gallais jennifer sampaka12 domori25 Pentium II Pentium III Itanium Core Duo Athlon Pentium 4 Athlon 64 Pentium 4 Pentium D First 100M Pin, one SimPoint Pin, up to 10 SimPoints Pin, up to 20 SimPoints Ffwd 1B, 100M Qemu, one SimPoint Qemu, up to 10 SimPoints Qemu, up to 20 SimPoints Valgrind, one SimPoint Valgrind, up to 10 SimPoints Valgrind, up to 20 SimPoints Fusion 14 Integer Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10 11.8 19.8 19.3 12.9 22.8 19.9 10.7 5 0 SPEC CPU 2000 Breakdown { Pentium D CPI Error (%) -5 -10 -109 eon.cook gcc.expr gcc.166gcc.200 gzip.log vortex.1vortex.2vortex.3vpr.routevpr.place eon.kajiyagap.default gcc.scilab gzip.sourcemcf.default bzip2.graphicbzip2.sourcecrafty.default gcc.integrate gzip.graphicgzip.programgzip.random parser.default perlbmk.535perlbmk.704perlbmk.957perlbmk.850twolf.default bzip2.program eon.rushmeier perlbmk.diffmailperlbmk.perfect perlbmk.makerand FP Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10 17.5 10.8 5 0 CPI Error (%) -5 -10 -15.5 art.110 art.470 apsi.default ammp.defaultapplu.default fma3d.defaultgalgel.defaultlucas.defaultmesa.defaultmgrid.default swim.default equake.defaultfacerec.default sixtrack.default wupwise.default Fusion 15 SPEC CPU 2006 40 63.7% 110.1% 42.6% 124.7% 110.0% 30 20 CPI Error (%) 10 0 chocovic - Core Duo sampaka12 - Pentium 4 domori25 - Pentium D jennifer - Athlon 64 First 100M Pin, one SimPoint Pin, up to 10 SimPoints Pin, up to 20 SimPoints Ffwd 1 B, 100M Qemu, one SimPoint Qemu, up to 10 SimPoints Qemu, up to 20 SimPoints Valgrind, one SimPoint Valgrind, up to 10 SimPoints Valgrind, up to 20 SimPoints machine processor memory L1 I/D L2 Cache chocovic 1.66GHz Core Duo 1GB 32KB/32KB 1MB jennifer 2GHz Athlon64 X2 1GB 64KB/64KB 512KB sampaka12 2.8GHz Pentium 4 2GB 12Ku/16KB 512KB domori25 3.46GHz Pentium D 4GB 12Ku/16KB 2MB Fusion 16 Integer Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10 12.1 13.0 11.3 10.711.5 13.1 15.6 5 0 SPEC CPU 2006 Breakdown { Pentium D CPI Error (%) -5 -10 -11.4 -12.6 -36.3 -12.8 mcf sjeng gcc.166gcc.200 gcc.expr gcc.g23gcc.s04 omnetpp astar.rivers bzip2.html gcc.cp-declgcc.expr2 gcc.scilab libquantum xalancbmk bzip2.sourcebzip2.chickenbzip2.liberty gcc.c-typeck gobmk.13x13gobmk.nngs hmmer.nph3hmmer.retro astar.BigLakes bzip2.programbzip2.combined gobmk.score2gobmk.trevorcgobmk.trevord h264ref.fore_baseh264ref.fore_mainh264ref.sss_main perlbench.diffmailperlbench.splitmail perlbench.checkspam FP Results, up to 20 SimPoints, Pentium D Pin Qemu Valgrind 10 5 0 n/a CPI Error (%) -5 -10 -20.5 -33.25 lbm wrf dealII milc namd tonto bwaves calculix leslie3d povray sphinx3 gromacs soplex.ref zeusmp cactusADM GemsFDTD gamess.h2ocu2 soplex.pds-50 gamess.cytosinegamess.triazolium Fusion 17 Results { Average CPI error • SPEC2000: 10 SimPoints (<0.4% of reference inputs) ◦ Pin 5.32% ◦ Qemu 5.04% ◦ Valgrind 5.38% • SPEC2006: 10 SimPoints (<0.06% of reference inputs) ◦ Pin 5.58% ◦ Qemu 5.30% ◦ Valgrind 5.28% Fusion 18 Future Work • Generating multi-platform results { MIPS BBV file from IA32 Qemu • Running non-Linux binaries (Solaris, IRIX) • Generating OS-aware SimPoints { Qemu runs full OS Fusion 19 Tools Available for Download All code is available from: http://fusion.csl.cornell.edu/tools/ Fusion 20 Questions? Feedback? All code is available from: http://fusion.csl.cornell.edu/tools/ [email protected] | http://www.csl.cornell.edu/~vince [email protected] | http://www.csl.cornell.edu/~sam Fusion 21.

Using Dynamic Binary Instrumentation to Generate Multi-Platform Simpoints

Comparison of Contemporary Real Time Operating Systems

SIMD Extensions

Software De Sistemas Introduccion

Generic Pipelined Processor Modeling and High Performance

Improving Performance of Virtual Machines by Virtio Bridge Bypass

Virtualization for Embedded Software

Demystifying Internet of Things Security Successful Iot Device/Edge and Platform Security Deployment — Sunil Cheruvu Anil Kumar Ned Smith David M

IXP43X Product Line of Network Processors Specification Update December 2008 2 Order Number: 316847; Revision: 005US Contents

Intel® Itanium® Architecture Assembly Language Reference Guide

Computer Architectures an Overview

Intel® 80219 General Purpose PCI Processor Based on Intel Xscale® Microarchitecture Adds Performance and Feature Integration at a Low Cost

Network Processors: Building Block for Programmable Networks