CERE: LLVM Based Codelet Extractor and Replayer for Piecewise Benchmarking and Optimization

CERE: LLVM based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization Chadi Akel, P. de Oliveira Castro, M. Popov, E. Petit, W. Jalby University of Versailles { Exascale Computing Research EuroLLVM 2016 C ER E Motivation I Finding best application parameters is a costly iterative process I CodeletExtractor andREplayer I Break an application into standalone codelets I Make costly analysis affordable: I Focus on single regions instead of whole applications I Run a single representative by clustering similar codelets C ER E 1 / 16 Codelet Extraction I Extract codelets as standalone microbenchmarks codelets can be recompiled C or Fortran application and run independently . capture machine state state dump . for (i = 0; i < N; i ++) { + for (j = 0; j < N; j ++) { codelet wrapper a[i][j] += b[i]*c[j]; IR for (i = 0; i < N; i ++) { Region } } for (j = 0; j < N; j ++) { a[i][j] += b[i]*c[j]; . } . } . 2 / 16 CERE Workflow Invocation Region Working set LLVM IR Region & Applications and cache Capture outlining Codelet capture subsetting Change: number of threads, affinity, Working sets Codelet runtime parameters memory dump Replay Fast Warmup Generate Retarget for: performance + codelets different architecture prediction Replay wrapper different optimizations CERE can extract codelets from: I Hot Loops I OpenMP non-nested parallel regions [Popov et al. 2015] 3 / 16 Outline Extracting and Replaying Codelets Faithful Retargetable Applications Architecture selection Compiler flags tuning Scalability prediction Demo Capture and replay in NAS BT Simple flag replay for NAS FT Conclusion 4 / 16 Capturing codelets at Intermediate Representation I Faithful: behaves similarly to the original region I Retargetable: modify runtime and compilation parameters same assembly 6= assembly [Akel et al. 2013] hard to retarget (compiler, ISA) easy to retarget costly support various ISA costly support various languages I LLVM Intermediate Representation is a good tradeoff 5 / 16 Faithful capture I Required for semantically accurate replay: I Register state I Memory state I OS state: locks, file descriptors, sockets I No support for OS state except for locks. CERE captures fully from userland: no kernel modules required. I Required for performance accurate replay: I Preserve code generation I Cache state I NUMA ownership I Other warmup state (eg. branch predictor) 6 / 16 Faithful capture: memory Capture access at page granularity: coarse but fast region protect static and currently allocated to capture process memory (/proc/self/maps) memory allocation intercept memory allocation functions with LD_PRELOAD a = malloc(256); 1 allocate memory 2 protect memory and return to user program memory segmentation access fault handler a[i]++; 1 dump accessed memory to disk 2 unlock accessed page and return to user program I Small dump footprint: only touched pages are saved I Warmup cache: replay trace of most recently touched pages I NUMA: detect first touch of each page 7 / 16 Outline Extracting and Replaying Codelets Faithful Retargetable Applications Architecture selection Compiler flags tuning Scalability prediction Demo Capture and replay in NAS BT Simple flag replay for NAS FT Conclusion 8 / 16 Selecting Representative Codelets I Key Idea: Applications have redundancies I Same codelet called multiple times I Codelets sharing similar performance signatures I Detect redundancies and keep only one representative 8e+07 6e+07 4e+07 Cycles 2e+07 0e+00 0 1000 2000 3000 replay invocation Figure : SPEC tonto make [email protected]:1133 execution trace. 90% of NAS codelets can be reduced to four or less representatives. 9 / 16 Performance Signature Clustering SP Static & Dynamic Profiling f1 Vectorization ratio f2 FLOPS/s Maqao Clustering f3 Cache Misses BT [... ] ... Likwid Step B: Using the proximity between feature Step A: Perform static and dynamic analysis on a reference vectors we cluster similar codelets and select one architecture to capture codelet's feature vectors. representative per cluster. Sandy Bridge Step C: CERE extracts the Atom representatives as standalone Full codelets. A model extrapolates full Model Benchmarks benchmark results. Core2 Results [Oliveira Castro et al. 2014] 10 / 16 Codelet Based Architecture Selection 1.55 1.59 1.5 Real Speedup Predicted Speedup 1.0 0.83 0.83 0.5 0.15 0.12 Nehalem Reference: Geometric mean speedup 0.0 Atom Core 2 Sandy Bridge Figure : Benchmarking NAS serial on three architectures I real: speedup when benchmarking original applications I predicted: speedup predicted with representative codelets I CERE 31× cheaper than running the full benchmarks. 11 / 16 Autotuning LLVM middle-end optimizations I LLVM middle-end offers more than 50 optimization passes. I Codelet replay enable per-region fast optimization tuning. original 1.2e+08 replay O3 1.0e+08 8.0e+07 6.0e+07 Cycles (Ivybridge 3.4GHz) 0 200 400 600 Id of LLVM middle−end optimization passes combination Figure : NAS SP ysolve codelet. 1000 schedules of random passes combinations explored based on O3 passes. CERE 149× cheaper than running the full benchmark ( 27× cheaper when tuning codelets covering 75% of SP) 12 / 16 Fast Scalability Benchmarking with OpenMP Codelets 1e8 SP compute rhs on Sandy Bridge 6 Real Predicted 5 4 3 Runtime cycles 2 1 0 1 2 4 8 16 32 Threads Core2 Nehalem Sandy Bridge Ivy Bridge Accuracy 98.2% 97.1% 92.6% 97.2% Acceleration × 25.2 × 27.4 × 23.7 × 23.7 Figure : Varying thread number at replay in SP and average results over OMP NAS [Popov et al. 2015] 13 / 16 Outline Extracting and Replaying Codelets Faithful Retargetable Applications Architecture selection Compiler flags tuning Scalability prediction Demo Capture and replay in NAS BT Simple flag replay for NAS FT Conclusion 14 / 16 Conclusion I CERE breaks an application into faithful and retargetable codelets I Piece-wise autotuning: I Different architecture I Compiler optimizations I Scalability I Other exploration costly analysis ? I Limitations: I No support for codelets performing IO (OS state not captured) I Cannot explore source-level optimizations I Tied to LLVM I Full accuracy reports on NAS and SPEC'06 FP available at benchmark-subsetting.github.io/cere/#Reports 15 / 16 Thanks for your attention! https://benchmark-subsetting.github.io/cere/ distributed under the LGPLv3 C ER E 16 / 16 BibliographyI Akel, Chadi et al. (2013). \Is Source-code Isolation Viable for Performance Characterization?" In: 42nd International Conference on Parallel Processing Workshops. IEEE. Gao, Xiaofeng et al. (2005). \Reducing overheads for acquiring dynamic memory traces". In: Workload Characterization Symposium, 2005. Proceedings of the IEEE International. IEEE, pp. 46{55. Liao, Chunhua et al. (2010). “Effective source-to-source outlining to support whole program empirical optimization". In: Languages and Compilers for Parallel Computing. Springer, pp. 308{322. Oliveira Castro, Pablo de et al. (2014). \Fine-grained Benchmark Subsetting for System Selection". In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, p. 132. Popov, Mihail et al. (2015). \PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction". In: Proceedings of the 29th IEEE International Parallel and Distributed Processing Symposium IPDPS. IEEE. 17 / 16 Retargetable replay: register state I Issue: Register state is non-portable between architectures. I Solution: capture at a function call boundary I No shared state through registers except function arguments I Get arguments directly through portable IR code I Register agnostic capture I Portable across Atom, Core 2, Haswell, Ivybridge, Nehalem, Sandybridge I Preliminar portability tests between x86 and ARM 32 bits 18 / 16 Retargetable replay: outlining regions Step 1: Outline the region to capture using CodeExtractor pass original: define internal void @outlined( %0= load i32* %i, align4 i32* %i, i32* %s.addr, %1= load i32* %s.addr, align4 i32** %a.addr){ %cmp= icmp slt i32 %0, %1 call void @start_capture(i32* %i, br i1 %cmp, ; loop branch here i32* %s.addr, i32** %a.addr) label %for.body, %0= load i32* %i, align4 label %for.exitStub ... ... ret void } original: call void @outlined(i32* %i, i32* %s.addr, i32** %a.addr) Step 2: Call start capture just after the function call Step 3: At replay, reinlining and variable cloning [Liao et al. 2010] steps ensure that the compilation context is close to original 19 / 16 Comparison to other Code Isolating tools Code Astex Codelet SimPoint CERE Isolator Finder Support Language C(++), Fortran C, Fortran C(++), assembly Fortran, ... Fortran Extraction IR source source source assembly Indirections yes no no yes yes Replay Simulator yes yes yes yes yes Hardware yes yes yes yes no Reduction Capture size reduced reduced reduced full - Instances yes manual manual manual yes Code sign. yes no no no yes 20 / 16 accurate replay codelet coverage Core2 Haswell 100 Accuracy Summary: NAS SER 75 NAS.A 50 25 % of Exec. Time % of Exec. 0 lu ft bt is lu ft bt is cg mg sp ep cg mg sp ep Figure : The Coverage the percentage of the execution time captured by codelets. The Accurate Replay is the percentage of execution time replayed with an error less than 15%. 21 / 16 Accuracy Summary: SPEC FP 2006 Haswell 100 75 SPECFP06 50 25 % of Exec. Time % of Exec. 0 wrf lbm tonto milc dealII povray soplex namd gamesssphinx3 calculix leslie3d zeusmp bwaves gromacs gemsfdtdspecrand cactusADM Figure : The Coverage is the percentage of the execution time captured by codelets. The Accurate Replay is the percentage of execution time replayed with an error less than 15%. I low coverage (sphinx3, wrf, povray): < 2000 cycles or IO I low matching (soplex, calculix, gamess): warmup \bugs" 22 / 16 CERE page capture dump size full dump (Codelet Finder) page granularity dump (CERE) 200 367 472 150 131 117 106 94 101 96 100 89 51 Size (MB) Size 46 50 27 26 22 1 8 0 bt cg ep ft is lu mg sp Figure : Comparison between the page capture and full dump size on NAS.A benchmarks.

CERE: LLVM Based Codelet Extractor and Replayer for Piecewise Benchmarking and Optimization

An Execution Model for Serverless Functions at the Edge

Toward IFVM Virtual Machine: a Model Driven IFML Interpretation

A Brief History of Just-In-Time Compilation

A Parallel Program Execution Model Supporting Modular Software Construction

CNT6008 Network Programming Module - 11 Objectives

The CLR's Execution Model

Executing UML Models

Performance Analyses and Code Transformations for MATLAB Applications Patryk Kiepas

An Executable Formal Semantics of PHP with Applications to Program Analysis

CUDA C++ Programming Guide

Programming Languages: Classification, Execution Model, and Errors

The Design and Implementation of a Bytecode for Optimization On