Level-3 BLAS on Myriad Multi-Core Media-Processor

Level-3 BLAS on Myriad Multi-Core Media-Processor SoC Tomasz Szydzik1 Marius Farcas2 Valeriu Ohan2 David Moloney3 1Insititute for Applied Microelectronics, University of Las Palmas of Gran Canaria, 2Codecart, 3Movidius Ltd. BLAS-like Library Instantiation Subprograms (BLIS) The Myriad media-processor SoCs The Basic Linear Algebra Subprograms (BLAS) are a set of low-level subroutines that perform common linear algebra Myriad architecture prioritises power-efficient operation and area efficiency. In order to guarantee sustained high operations. BLIS is a software framework for instantiating high-performance BLAS-like dense linear algebra libraries. performance and minimise power the proprietary SHAVE (Streaming Hybrid Architecture Vector Engine) processor was BLIS[1] was chosen over GoTOBLAS, ATLAS, etc. due to its portable micro-kernel architecture and active user-base. developed. Data and Instructions reside in a shared Connection MatriX (CMX) memory block shared by all Shave processors. Data is moved between peripherals, processors and memory via a bank of software-controlled DMA engines. BLIS features Linear Algebra Myriad Acceleration DDR Controller 128/256MB LPDDR2/3 Stacked Die 128-bit AXI 128-bit AHB High-level LAPACK EIGEN In-house 128kB 2-way L2 cache (SHAVE) IISO C99 code with flexible BSD license. routines DDR Controller 1MB CMX SRAM ISupport for BLAS API calling conventions. 128-bit AXI 128-bit AHB 128-bit AHB BLIS 128-bit 64-bit 64-bit 256kB 2-way L2 cache (SHAVE) 32-bit CMX Instr CMX CMX Competitive performance [2]. Level Level Level Level APB x 8 SHAVEs I f Port Port Port 32kB LRAM 2MB CMX SRAM 0 1m 2 3 Subroutines IMulti-core friendly. d 128-bit 64-bit 64-bit Multi-layer API and code identifying and isolatinga Inner kernels/ SHAVE VLIW Vector Processor I 4kB 2-way Ports CMX CMX x 12 SHAVEs 256kB 4-way 32kB 4-way block-kernels VRF 32x128 (12 ports) Port Port L2 cache (LEON4) L2 cache (LEON4) key set of computational kernels. DCU I-cache 32-bit SRF 32x32 (12 ports) APB 4kB 2-way Micro-kernels D-cache IModularity and extensiveness. IDC IRF 32x32 (17 ports) SHAVE VLIW Vector Processor Assembly C99 Assembly C99 Assembly C99 Assembly C99 32kB 2-way 4kB 2-way VRF 32x128 (10 ports) I-cache (LEON4) I-cache (LEON4) IPortability (x86, x64, TIC66x, PowerPC, etc.) that LEON3 DCU 1kB PEU BRU LSU0 LSU1 SAUVAUIAU CMU doesn’t impede high performance [3]. D-cache RISC 32kB 2-way 4kB 2-way IRF 32x32 (17 ports) IDC D-cache (LEON4) D-cache (LEON4) IFoundation for mixed precision (experimental). BASE Control Compatibility Utility Driver-level LEON4 LEON4 1kB 1kB Myriad 1 architecture highlights [4] PEU BRU LSU0 LSU1 SAUVAUIAU CMU Level-3 BLAS using BLIS on Myriad D-cache I-cache RISC2 RISC1 I65nm ultra-low power architecture BLIS Level-3 micro-kernels GEMM and TRSM operations (≤ 0:35W @180MHz) with 11 power islands. Myriad 2 architecture highlights [4] IHardware support for SIMD, matrix transpose, BLIS Routine Operation Flops Comments sparse data, sqrt@fp16, predicated execution... I28nm ultra-low power( ≤ 0:5W @600MHz) with 17 power islands. GEMM C := α · op(A) · op(B) + β · C 2mnk op(X) = GEMM Extended hardware support over Myriad 1: clock-gating, Level TRSM GEMMTRSM −1 2 T H Heterogeneous SoC: 1 Leon3@fp64 + 8 I general triangular solve I combined/fused C := α · op(A )C nm X; X ; X ; C 3 matrix-matrix with multiple Shaves@fp32. hard-wired configurable accelerators for imaging and vision, etc. GEMM and TRSM TRSM −1 2 multiplication right sides C := α · C · op(A ) mn is m × n Heterogeneous SoC: 2 Leon4@fp64 + 12 Shaves@fp32. I32KB LRAM, 1MB CMX, 16/64MB DDR, DMAs. I 256+32KB LRAM, 2MB CMX, DDR3 support, DMAs. Mapping of matrix blocks on CMX (SGEMM) IPower efficiency of 1Tops/W (max 8-bit I Level-3 BLAS on Myriad equivalent). IPower efficiency of 2Tops/W (max 16-bit equivalent). AC B Port SGEMM and STRSM performance start Execute on Leon (C99) SGEMM vs #cores and matrix width: Power Performance [GFLOPS] . Architecture . ... [W] core system [GFLOPS/W] SGEMM Execute on SHAVEs (C99) Cell [5] 20 23.01 184.8 9.22 C6678 [3] 10 10.3 79.6 7.96 Optimize Myriad 1 [4] 0.35 0.75 4.92 14.06 STRSM Kernels in SHAVE assembler c Cell [5] 20 16.47 131.8 6.59 b C6678 [3] 10 8.7 59.5 5.95 Myriad 1 [4] 0.35 0.5 3.57 10.2 CMX memory managment High-performance required hand optimized STRSM vs #cores and matrix width: I yes no micro-kernels and memory management. end ok? a IMost effort spent on tuning memory management. IPort to Myriad 2 (expected 10× more efficient). Memory focused optimizations [1] F. G. Van Zee and R. A. van de Geijn, ”BLIS: A framework for rapidly instantiating BLAS functionality,” ACM 0 1 7 Transactions on Mathematical Software, 2013, accepted. [2] F. G. Van Zee, T. Smith, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Austel, J. Gunnels, T. M. Low, B. Marker, L. Killough, and R. A. van de Geijn, ”The BLIS framework: Experiments in portability,” ACM Transactions on IDouble/triple buffering of arguments. where, Mathematical Software, 2013, accepted. [3] M. Ali, E. Stotzer, F. Igual, and R. van de Geijn, ”Level-3 blas on the TIC6678 multi-core DSP,” in Computer the numbers represent SHAVE cores, Architecture and High Performance Computing (SBAC-PAD), 2012 IEEE 24th International Symposium on, Oct 2012, IBuffers shared by all SHAVEs. I pp. 179–186. the boxes above them, their local CMX memory slice, [4] David Moloney, 1TOPS/W Software Programmable Media Processor, HotChips 2011 (HC23), Stanford, California, IData passing using pointer arithmetic. I August 23rd, 2011. buffering of A, B and C is not shown in order to preserve clarity. [5] J. Kurzak, A. Buttari, and J. Dongarra, ”Solving systems of linear equations on the cell processor using cholesky IOverlapped DMA accesses. I factorization,” Parallel and Distributed Systems, IEEE Transactions on, vol. 19, no. 9, pp. 1175–1186, Sept 2008 Institute for Applied Microelectronics of the University of Las Palmas of Gran Canaria, Codecart, Movidius Hot Chips: A Symposium on High Performance Chips, 10-12 August 2014, Cupertino, CA Mail: [email protected] WWW: http://movidius.com http://www.iuma.ulpgc.es.

Level-3 BLAS on Myriad Multi-Core Media-Processor

Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators

Supermatrix: a Multithreaded Runtime Scheduling System for Algorithms-By-Blocks

Benchmark of C++ Libraries for Sparse Matrix Computation

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

Anatomy of High-Performance Matrix Multiplication

DD2358 – Introduction to HPC Linear Algebra Libraries & BLAS

A Framework for Efficient Execution of Matrix Computations

On Computing with Diagonally Structured Matrices

A Systematic Approach for Obtaining Performance on Matrix-Like Operations Submitted in Partial Fulfillment of the Requirements

Parallel Tiled QR Factorization for Multicore Architectures LAPACK Working Note # 190

Implementing Strassen-Like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen's Algorithm with BLIS