Software and Hardware for High Performance and Low Power Homogeneous and Heterogeneous Multicore Systems Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan IEEE Computer Society President Elect 2017 URL: http://www.kasahara.cs.waseda.ac.jp/

Waseda Univ. GCSC Multicores for Performance and Low Power Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers (“K” more than 10MW) . Power Frequency * Voltage2 ILRAM I$ (Voltage Frequency) LBSC Core#0 Core#1 3 URAM DLRAM Power Frequency D$ ∝ ∝ SNC0 Core#2 Core#3 If Frequency∝is reduced to 1/4 SHWY VSWC (Ex. 4GHz1GHz), Core#6 Core#7 Power is reduced to 1/64 and SNC1 Performance falls down to 1/4 . Core#4 Core#5 DBSC CSM GCPG If 8cores are integrated on a chip, DDRPAD IEEE ISSCC08: Paper No. 4.5, Power is still 1/8 and M.ITO, … and H. Kasahara, Performance becomes 2 times. “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler” 2 Parallel Soft is important for scalable performance of multicore Just more cores don’t give us speedup Development cost and period of parallel software are getting a bottleneck of development of embedded systems, eg. IoT, Automobile Earthquake wave propagation simulation GMS developed by National Research Institute for Earth Science and Disaster Resilience (NIED) Fjitsu M9000 SPARC Multicore Server OSCAR Compiler gives us 211 times speedup with 128 cores

Commercial compiler gives us 0.9 times speedup with 128 cores (slow- downed against 1 core)

 Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores  Execution time with 128 cores was slower than 1 core (0.9 times speedup)  Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler  OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization 3 Trend of Peak Performances of Supercomputers Aurora,2018,180PFLOPS,13MW,Argonne National Lab., Intel & Cray 1 Sunway TaihuLight, 2016.06, 93PFLOPS, 15.4MW Z 米中欧日 Tianhe-2, 2013.06, 55PFLOPS, 17.8MW 2020-22 ExaFLOPS計画 Titan,2012.11,27PFLOPS,8.2MW 1E Sequoia,2012.06, 20PFLOPS, 7.9MW 京,2011.6&11, 11PFLOPS, 11.3MW 1P

1T

1G Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2 by OSCAR Parallelizing Compiler MPEG2 Decoding with 8 CPU cores Without Power With Power Control Control (Frequency, 7 7 (Voltage:1.4V) Resume Standby: Power shutdown & 6 6 Voltage lowering 1.4V-1.0V)

5 5

4 4

3 3

2 2

1 1

0 0 Avg. Power 73.5% Power Reduction Avg. Power

5.73 [W] 1.52 [W] 5 Compiler Co-designed Multicore RP2

Cluster #1 Cluster #0 Barrier Core #3 Sync. Lines Core #7 CoreCPU #2 FPU FPU CPUCore #6 Core #1 Core #5 CPUI$ FPUD$ D$FPU CPUI$ CoreCPU #0 CPUCore #4 I$16K FPUD$16K 16KD$FPU 16KI$ LCPG0 LCPG1 CPUI$16KLocalFPU memoryD$16K 16KD$FPU 16KCPUI$ PCR3 16KLocalI:8K, memory D:32K16K I:8K,16K D:32K16K PCR7 I$LocalI:8K,D$ memory D:32KCCN I:8K,CCN D:32KD$ I$ PCR2 16KUser16K RAMBAR 64K User BARRAM 16K64K 16K PCR6 I:8K, D:32K I:8K, D:32K LocalUser memory RAM 64K UserLocal RAM memory 64K PCR1 I:8K,User RAMD:32K 64K UserI:8K, RAM D:32K 64K PCR5 PCR0 URAM 64K URAM 64K PCR4 Snoop controller 1 controller Snoop Snoop controller 0 controller Snoop

On-chip system bus (SuperHyway) LCPG: Local clock pulse generator DDR2 SRAM DMA PCR: Power Control Register control control control CCN/BAR:Cache controller/Barrier Register URAM: User RAM (Distributed Shared Memory) Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project

ILRAM I$ Process 90nm, 8-layer, triple-

LBSC Core#0 Core#1 Technology Vth, CMOS URAM DLRAM D$ Chip Size 104.8mm2 SNC0 Core#2 Core#3 (10.61mm x 9.88mm) 2 SHWY VSWC CPU Core 6.6mm Core#6 Core#7 Size (3.36mm x 1.96mm)

SNC1 Supply 1.0V–1.4V (internal), Core#4 Core#5 Voltage 1.8/3.3V (I/O)

CSM Power 17 (8 CPUs, DBSC GCPG DDRPAD Domains 8 URAMs, common) IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8

CPUs and 8 RAMs by an Automatic Parallelizing Compiler” 7 Industry-government-academia collaboration in R&D Protect lives and target practical applications For smart life Protect environment OSCAR

Robots Waseda University :R&D Green Many-core system technologies with supercomputers On-board vehicle technology ultra-low power consumption (navigation systems, integrated controllers, infrastructure OSCAR many-core chip Super real-time disaster coordination) simulation (tectonic shifts, tsunami), tornado, flood, Consumer electronic OSCAR fire spreading) Many-core Internet TV/DVD Green cloud servers Chip Camcorders Compiler, API Stock trading Capsule inner Cool desktop servers cameras Cameras Solar Powered Medical servers Smart phones OSCAR

National Institute of Radiological Sciences Heavy particle radiation planning, cerebral infarction) Operation/recharging Non-fan, cool, quiet servers by solar cells designed for server

Intelligent home Industry Supercomputers appliances and servers 8 Cancer Treatment Carbon Ion Radiotherapy (Previous best was 2.5 times speedup on 16 processors with hand optimization)

National Institute of Radiological Sciences (NIRS)

8.9times speedup by 12 processors 55 times speedup by 64 processors Intel Xeon X5670 2.93GHz 12 IBM Power 7 64 core SMP core SMP (Hitachi HA8000) (Hitachi SR16000) OSCAR Parallelizing Compiler To improve effective performance, cost-performance and software productivity and reduce power Multigrain Parallelization coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism CPU0 CPU1 CPU2 CPU3 DRP0 CORE DTU CORE DTU CORE DTU CORE DTU CORE DTU

LOAD LOAD MTG1 LOAD LOAD Data Localization MT1-1 MT1-2 SEND SEND LOAD OFF MT1-3 MT1-4 OFF SEND OFF Automatic data management for LOAD SEND LOAD MTG2 MT2-1 LOAD MTG3 LOAD SEND MT3-1 LOAD MT2-2 LOAD OFF MT2-3 distributed shared memory, cache LOAD SEND LOAD SEND TIME SEND MT3-2 LOAD LOAD MT2-4 LOAD MT3-3 SEND and local memory MT2-5 LOAD SEND LOAD SEND MT2-6 LOAD MT2-7 MT3-4 MT3-6 SEND SEND MT3-5 LOAD Data Transfer Overlapping SEND OFF MT2-8 MT3-7 STORE STORE MT3-8 OFF STORE Data transfer overlapping using Data STORE Transfer Controllers (DMAs) Power Reduction Reduction of consumed power by compiler control DVFS and Power gating with hardware supports. Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server

AIX Ver.12.1

Compile Option: (*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto (*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto (Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto

11 Generation of Coarse Grain Tasks Macro-tasks (MTs)  Block of Pseudo Assignments (BPA): Basic Block (BB)  Repetition Block (RB) : natural loop  Subroutine Block (SB): subroutine

Near fine grain parallelization BPA BPA RB SB Loop level parallelization BPA BPA RB Program RB Near fine grain of loop body RB SB Coarse grain SB BPA parallelization BPA RB SB SB Coarse grain RB BPA parallelization SB RB SB Total 1st. Layer 2nd. Layer 3rd. Layer System

12 Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks)

A Macro Flow Graph A Macro Task Graph

13 PRIORITY DETERMINATION IN DYNAMIC CP METHOD

60*0.80+100*0.20=68

V1.0 © 2014, IEEE All rights reserved 14 MT2 may start Earliest Executable Conditions execution after EEC: Control dependence + Data Dependence MT1 branches to MT2 and MT1 Control dependences show executions of MTs are decided finish execution. Data dependences show data accessed by MTs are ready

MT3 may start execution after MT1 branches to MT3.

MT6 may start execution after MT3 finish execution or MT2 branches to MT4.

V1.0 © 2014, IEEE All rights reserved 15 Automatic processor assignment in 103.su2cor

• Using 14 processors Coarse grain parallelization within DO400

16 MTG of Su2cor-LOOPS-DO400  Coarse grain parallelism PARA_ALD = 4.3

DOALL Sequential LOOP SB BB

17 Data-Localization: Loop Aligned Decomposition • Decompose multiple loop (Doall and Seq) into CARs and LRs considering inter-loop data dependence. – Most data in LR can be passed through LM. – LR: Localizable Region, CAR: Commonly Accessed Region C RB1(Doall) DO I=1,101 LR CAR LR CAR LR A(I)=2*I DO I=1,33 DO I=34,35 DO I=36,66 DO I=67,68 DO I=69,101 ENDDO DO I=1,33 C RB2(Doseq) DO I=1,100 DO I=34,34 B(I)=B(I-1) +A(I)+A(I+1) DO I=35,66 ENDDO DO I=67,67

RB3(Doall) DO I=68,100 DO I=2,100 C(I)=B(I)+B(I-1) DO I=2,34 DO I=35,67 DO I=68,100 ENDDO C

18 Inter-loop data dependence analysis in TLG

• Define exit-RB in TLG as Standard-Loop C RB1(Doall) DO I=1,101 K-1 K K+1 • Find iterations on which A(I)=2*I I(RB1) ENDDO a iteration of Standard-Loop is data dependent C RB2(Doseq) – e.g. K of RB3 is data-dep DO I=1,100 th B(I)=B(I-1) I(RB2) +A(I)+A(I+1) K-1 K on K-1th,Kth of RB2, ENDDO on K-1th,Kth,K+1th of RB1 indirectly. C RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) I(RB3) K ENDDO

Example of TLG

19 Decomposition of RBs in TLG

• Decompose GCIR into DGCIRp(1 p n) – n: (multiple) num of PCs, DGCIR: Decomposed GCIR ≦ ≦ • Generate CAR on which DGCIRp&DGCIRp+1 are data-dep. • Generate LR on which DGCIRp is data-dep.

RB11 RB1<1,2> RB12 RB1<2,3> RB13 I(RB1) 1 2 3 33 34 35 36 65 66 67 68 99 100 101 RB21 RB2<1,2> RB22 RB2<2,3> RB23 I(RB2) 1 2 33 34 35 65 66 67 99 100 RB31 RB32 RB33 I(RB3) 2 34 35 66 67 100 DGCIR1 DGCIR2 DGCIR3 GCIR

20 Data Localization

A schedule for MTG MTG after Division two processors 21 An Example of Data Localization for Spec95 Swim DO 200 J=1,N cache size DO 200 I=1,M 0 1 2 3 4MB UNEW(I+1,J) = UOLD(I+1,J)+ 1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J) UN 2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J)) VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1)) VN PN UO VO 1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J)) PO CU CV Z 2 -TDTSDY*(H(I,J+1)-H(I,J)) PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J)) H 1 -TDTSDY*(CV(I,J+1)-CV(I,J)) 200 CONTINUE UN VN PN DO 210 J=1,N UNEW(1,J) = UNEW(M+1,J) VNEW(M+1,J+1) = VNEW(1,J+1) U V P UN PNEW(M+1,J) = PNEW(1,J) VN PN UO VO 210 CONTINUE PO

DO 300 J=1,N Cache line conflicts occurs DO 300 I=1,M among arrays which share the UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) same location on cache VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (b) Image of alignment of arrays on 300 CONTINUE cache accessed by target loops (a) An example of target loop group for data localization

22 Data Layout for Removing Line Conflict Misses by Array Dimension Padding Declaration part of arrays in spec95 swim before padding after padding PARAMETER (N1=513, N2=513) PARAMETER (N1=513, N2=544)

COMMON U(N1,N2), V(N1,N2), P(N1,N2), COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2) * Z(N1,N2), H(N1,N2) 4MB 4MB

padding

Box: Access range of DLG0

23 Statement Level Near Fine Grain Task

<> 0 Task No. 1) u12 = a12/l11 2) u24 = a24/l22 Task 3) u34 = a34/l33 10 3 2 5 11 8 1 7 19 19 19 19 19 4) l54 = -l52 * u24 19 19 19 Processing 5) u45 = a45/l44 4 9 Time 6) l55 = u55 - l54 * u45 8 10 Data <> 6 12 Transfer 10 10 7) y1 = b1 / l11 Time tij 8) y2 = b2 / l22 1913 tij = 0 9) b5 = b5 - l52 * y2 If Ti and Tj 10) y3 = b3 / l33 are on the 11) y4 = b4 / l44 1014 12) b5 = b5 - l54 * y4 same PE 13) y5 = b5 / l55 16 15 10 <> 10 tij = 9 14) x4 = y4 - u45 * y5 17If Ti and Tj are 15) x3 = y3 - u34 * x4 10 on different PE 16) x2 = y2 - u24 * x4 17) x1 = y1 - u12 * x2 18

24 Task Graph for FPPPP Statement level near fine grain parallelism

25 Elimination of Redundant Synchronization for Shared Data on Centralized Shared Memory after Static Task Scheduling PE1 PE2 PE3

A FS Precedence FC relations D FC FS FS Flag set B FC FS FC Flag check FC FC C FC FC Unnecessary FS FC E Generated Parallel Machine Code for Near Fine Grain Parallel Processing

PE1 PE2 ; ---- Task A ---- ; ---- Task C ---- PE1 PE2 ; Task Body ; Task Body

A C FS FADD R23, R19, R21 ; ---- Task B ---- ; Flag Check : Task C FMLT R27, R28, R29 L18: FSUB R29, R19, R27 B FC LDR R28, [R14, 0] ; Data Transfer CMP R28, R0 STR [R14, 1], R29 JNE L18 ; Flag Set ; Data Receive STR [R14, 0], R0 LDR R29, [R14, 1] ; Task Body FMLT R24, R23, R29

27 【W-CDMA Base Band Communication】 Near Fine Grain Parallel Processing of EAICH Detection Program on RP2 Multicore with 4 SH4A cores

 Hadamard transform often used in the signal processing  Parallel Processing Method – Near fine grain parallel processing among statements Special purpose – Static Scheduling Hardware (250MHz): 1.74μs

1.62 times speedup for 2cores, 3.45 times speedup for 4 cores for EAICH on RP2.

28 Generated Multigrain Parallelized Code Centralized (The nested coarse grain task parallelization is realized by only OpenMP scheduling “section”, “Flush” and “Critical” directives.) code SECTIONS 1st layer SECTION SECTION T0 T1 T2 T3 T4 T5 T6 T7 MT1_1 Distributed scheduling MT1_1 MT1_2 code DOALL SYNC SEND SYNC RECV MT1_3 MT1_2 MT1-3 SB MT1_4 1_3_1 1_3_1 1_3_1 RB MT1-4 1_3_2 1_3_2 1_3_2 1_4_1 1_4_1 1_3_3 1_3_3 1_3_3 1_3_1 1_4_2 1_4_2 1_3_4 1_3_4 1_3_4 1_4_3 1_4_3 1_3_2 1_3_4 1_3_5 1_3_5 1_3_5 1_4_1 1_3_3 1_4_4 1_4_4 1_3_6 1_3_6 1_3_6 1_3_5 1_3_6 1_4_21_4_3 1_4_4 END SECTIONS 2nd layer Thread Thread 2nd layer 29 group0 group1 Code Generation Using OpenMP

• Compiler generates a parallelized program using OpenMP API • One time single level thread generation – Threads are forked only once at the beginning of a program by OpenMP “PARALLEL SECTIONS” directive – Forked threads join only once at the end of program • Compiler generates codes for each threads using static or dynamic scheduling schemes • Extension of OpenMP for hierarchical processing is not required

30 Multicore Program Development Using OSCAR API V2.0 Sequential Application OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycores Generation of Program in Fortran or C parallel machine (Consumer Electronics, Automobiles, Directives for thread generation, memory, Medical, Scientific computation, etc.) data transfer using DMA, power codes using managements sequential compilers Manual parallelization / Low Power Hetero power reduction Parallelized Homogeneous API F or C Multicore Code Accelerator Compiler/ User program Generation Homegeneous Add “hint” directives Proc0 API Existing Multicore s before a loop or a function Analyzer sequential from Vendor A to specify it is executable Code with compiler Homogeneous Homogeneous (SMP servers) by the accelerator with directives how many clocks Thread 0 Low Power Heterogeneous Proc1 Multicore Code Waseda OSCAR Code with Generation Parallelizing Compiler directives API Existing Heterogeneous Thread 1 Analyzer sequential Multicores Coarse grain task (Available compiler from Vendor B parallelization from Accelerator 1 Waseda) Data Localization Code  DMAC data transfer Accelerator 2  Server Code Power reduction using DVFS, Code Generation Clock/ Power gating OpenMP Executable on multicores various Hitachi, Renesas, NEC, Compiler Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, OSCAR: Optimally Scheduled Advanced Multiprocessor Shred memory Esol, Cats, Gaio, 3 univ. API: Application Program Interface servers Parallel Processing of Face Detection on Manycore, Highend and PC Server

14.00 速度向上率 tilepro64 gcc SR16k(Power7 8core*4cpu*4node) xlc 12.00 11.55 rs440(Intel Xeon 8core*4cpu) icc 10.92

10.00 9.30

8.00 6.46 6.46 5.74

速度向上率 6.00

4.00 3.57 3.67 3.01 1.93 1.93 2.00 1.72 1.00 1.00 1.00

0.00 1 2 コア数4 8 16 • OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highend server.

32 Performance on Multicore Server for Latest Cancer Treatment Using Heavy Particle (Proton, Carbon Ion) 327 times speedup on 144 cores Hitachi 144cores SMP Blade Server BS500: Xeon E7-8890 V3(2.5GHz 18core/chip) x8 chip 350 327.60 327.6 times speed up with 144 cores 300 250 GCC 196.50 200 150 109.20 100 50 1.00 5.00 0 1PE 32pe 64pe 144pe

 Original sequential execution time 2948 sec (50 minutes) using GCC was reduced to 9 sec with 144 cores(327.6 times speedup)  Reduction of treatment cost and reservation waiting period is expected

33 110 Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000 (Power7 Based 128 Core Linux SMP)

Fortran:15 thousand lines

First touch for distributed shared memory and cache optimization over loops are 34 important for scalable speedup Parallel Processing of JPEG XR Encoder on TILEPro64 Multimedia Applications: AAC Encoder Speedup (JPEG XR Encoder) Sequential C Source Code JPEG XR Encoder {Optical Flow Calc. 55x speedup on 64 cores 60.00 55x OSCAR Compiler Default Cache Allocation 50.00 Our Cache Allocation Cache Parallelized C Program Allocation 40.00 with OSCAR API Setting (2)Cache Allocation Setting 30.00 API Analyzer +

Speedup 28x Sequential Compiler 20.00 (1)OSCAR Parallelization Parallelized Executable Binary for 10.00 TILEPro64 1x 0.00

Memory Memory Controller 0 Controller 1 1 64 0 0 0 0 0 0 0 0 , , , , , , , , 0 1 2 3 4 5 6 7 1 1 1 1 1 1 1 1 # Cores , , , , , , , , 0 1 2 3 4 5 6 7 2 2 2 2 2 2 2 2 , , , , , , , ,

0 1 2 3 4 5 6 7 I/O 3 3 3 3 3 3 3 3 Local cache optimization: , , , , , , , ,

I/O 0 1 2 3 4 5 6 7 4 4 4 4 4 4 4 4 , , , , , , , , 0 1 2 3 4 5 6 7 Parallel Data Structure (tile) on heap 5 5 5 5 5 5 5 5 , , , , , , , , 0 1 2 3 4 5 6 7 6 6 6 6 6 6 6 6 allocating to local cache , , , , , , , ,

0 1 2 3 4 5 6 7 I/O 7 7 7 7 7 7 7 7 , , , , , , , , 0 1 2 3 4 5 6 7 Memory Memory Controller 2 Controller 3

V1.0 © 2014, IEEE All rights reserved 35 Speedup with 2cores for Engine Crankshaft Handwritten Program on RPX Multi-core Processor

1.6 times Speed up by 2 cores against速度向上率 1core 1.8 1.60 1.6 1.4 1.2 1 1 1core 0.8 2core Macrotask graph after 0.6 Macrotask graph with a lot of conditional task fusion branches 0.4 0.2 1core 2core Branches are fused to macrotasks for static scheduling 0

Grain is too fine (us) for dynamic scheduling. 36 Model Base Designed Engine Control on Multicore with Denso Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor. Hard real-time automobile engine control by multicore C codes generated by MATLAB/Simulink embedded coder are automatically parallelized.

1 core 2 cores

37 OSCAR Compile Flow for Simulink Applications

Generate C code using Embedded Coder

Simulink model C code OSCAR Compiler

(1) Generate MTG → Parallelism (3) Generate parallelized C code using the OSCAR API (2) Generate gantt chart → Multiplatform execution → Scheduling in a multicore (Intel, ARM and SH etc) 38 Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores (Intel Xeon, ARM Cortex A15 and Renesas SH4A)

Road Tracking, Image Compression : http://www.mathworks.co.jp/jp/help/vision/examples Buoy Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/44706-buoy-detection-using-simulink Color Edge Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/28114-fast-edges-of-a-color-image--actual-color--not-converting- to-grayscale-/ Vessel Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/ 39 Parallel Processing on Simulink Model • The parallelized C code can be embedded to Simulink using C mex API for HILS and SILS implementation.

Call sequential C code from the S-Function block

Sequential C Code

Parallelized C Code

Call parallelized C code from the S-Function block

40 OSCAR API Ver. 2.0 for Homogeneous/Heterogeneous Multicores and Manycores

41 An Image of Static Schedule for Heterogeneous Multi- core with Data Transfer Overlapping and Power Control

CPU0 CPU1 CPU2 CPU3 DRP0

CORE DTU CORE DTU CORE DTU CORE DTU CORE DTU

LOAD LOAD MTG1 LOAD LOAD MT1-1 LOAD MT1-2 SEND SEND OFF MT1-3 MT1-4 OFF SEND OFF LOAD SEND LOAD MTG2 MT2-1 LOAD LOAD LOAD MTG3 LOAD LOAD LOAD SEND MT3-1 LOAD LOAD LOAD MT2-2 OFF MT2-3 LOAD SEND LOAD SEND TIME SEND LOAD MT2-4 MT3-2 LOAD MT3-3 SEND SEND MT2-5 SEND MT2-6 MT2-7 MT3-4 MT3-6 SEND SEND MT3-5 OFF MT2-8 SEND MT3-7 STORE STORE MT3-8 OFF STORE STORE

42 33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library) 111[fps]

35 32.65

30 26.71

25

20 18.85

15

10

Speedups against a single SH against processor Speedups 3.4[fps] 5.4 5 2.29 3.09 1 0 1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler Frequency and Voltage (DVFS), Clock and Power gating of each cores are scheduled considering the task schedule since the dynamic power proportional to the cube of F (F3) and the leakage power (the static power ) can be reduced by the power gating (power off).

• Shortest execution time mode

In this Fig. Frequency Full, Mid, Low

Power OFF: Power Gating

44 An Example of Machine Parameters

• Functionsfor of the the multiprocessor Power Saving Scheme – Frequency of each proc. is changed to several levels – Voltage is changed together with frequency – Each proc. can be powered on/off state FULL MID LOW OFF frequency 1 1 / 2 1 / 4 0 voltage 1 0.87 0.71 0 dynamic energy 1 3 / 4 1 / 2 0 static power 1 1 1 0 • State transition overhead state FULL MID LOW OFF state FULL MID LOW OFF FULL 0 40k 40k 80k FULL 0 20 20 40 MID 40k 0 40k 80k MID 20 0 20 40 LOW 40k 40k 0 80k LOW 20 20 0 40 OFF 80k 80k 80k 0 OFF 40 40 40 0 delay time [u.t.] energy overhead [μJ]

45 Power Reduction Scheduling

A power schedule for fastest execution mode A macrotask graph assigned to 3 cores 1) Reduce frequencies (Fs) of MTs on CP considering Realtime scheduling mode dead line. MTs 1,4,7,8 are on Critical Path (CP) 2) Reduce Fs of MTs not on CP. Idle: Clock or Power Gating considering overheads.

A power schedule for SPEC95 APPLU for fastest execution mode Doall6, Loop 10,11,12,13, Doall 17, Loop 18,19, 20, 21 are on CP 46 Low-Power Optimization with OSCAR API

Scheduled Result Generate Code Image by OSCAR Compiler by OSCAR Compiler void void VC0 VC1 main_VC0() { main_VC1() {

MT2 MT2

MT1 MT1 #pragma oscar fvcontrol \ ((OSCAR_CPU(),0)) Sleep

Sleep #pragma oscar fvcontrol \ (1,(OSCAR_CPU(),100)) MT3 MT4

MT3 MT4

} } 47 Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library) Without Power Reduction With Power Reduction by OSCAR Compiler 70% of power reduction

Average:1.76[W] Average:0.54[W]

1cycle : 33[ms] →30[fps] Automatic Power Reduction for MPEG2 Decode on Android Multicore ODROID X2 ARM Cortex-A94cores http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ 電力制御なしWithout Power 電力制御ありWith Power Reduction Reduction 3.00 2.79 2.50

[W] 1/7(-86.7%) 2.00 1.88 1.50 0.97 1.00 2/3 1/4 0.63 (-35.0%) -75.5% 0.46( ) 1/3 0.50 0.37 (-61.9%) 0.00 Power Consumption Power No. of Processor Cores 1 2 3 • On 3 cores, Automatic Power Reduction control successfully reduced power to 1/7 against without Power Reduction control. • 3 cores with the compiler power reduction control reduced power to 1/3 against ordinary 1 core execution. 49 Power Reduction on Intel Haswell for Real-time Optical Flow Intel CPU Core i7 4770K For HD 720p(1280x720) moving pictures 15fps (Deadline66.6[ms/frame]) without power control with power control

50 Power was [W] 45 41.58 reduced to 1/4 40 36.59 by compiler on 3 cores 35 29.29 30 24.17 25 20 12.21 15 9.60 1/3 10 5 0 average power consumption power average number of PE 1PE 2PE 3PE Power was reduced to 1/4 (9.6W) by the compiler power optimization on the same 3 cores (41.6W). Power with 3 core was reduced to 1/3 (9.6W) against 1 core (29.3W) .

50 Automatic Parallelization of JPEG-XR for Drinkable Inner Camera (Endo Capsule) 10 times more speedup needed after parallelization for 128 cores of Power 7. Less than 35mW power consumption is required. • TILEPro64 60.00Speed-ups on TILEPro64 Manycore 55.11

50.00 64cores 0.18[s] 40.00 30.79 30.00 Speed up Speed 20.00 15.82 1core 10.0[s] 10.00 7.86 3.95 1.00 1.96 0.00 1 2 4 8 16 32 64 55 times speedup withコア数 64 cores

Waseda U. & Olympus 51 OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology

Centralized Shared Memory

Compiler Co-designed Interconnection Network Target: Multicore Chip  Solar Powered with On-chip Shared Memory ×4 chips compiler power reduction.

Compiler co-designed Connection Network  Fully automatic Core Local Memory parallelization and Distributed Shared Memory vectorization including Data Vector Transfer CPU Unit local memory management Power Control Unit and data transfer. Fujitsu VPP500/NWT: PE Unit Cabinet (open)

Copyright 2008 FUJITSU 53 LIMITED Performance of OSCAR Compiler Software Coherence Control

 Faster or Equal Processing Performance up to 4cores with hardware coherent mechanism on RP2.  Software Coherence gives us correct execution without hardware coherence mechanism on 8 cores.

7.00 6.63 SMP 5.90 Non-Coherent Cache 6.00

5.00 3.59 3.90 4.00 3.54 3.36 3.34 3.00 2.54 1.92 2.45 2.10 1.89 1.62 1.85 2.00 1.61 1.02 1.00 1.02 1.00 1.01 1.00 1.00 Seed Up againstsequential Processing 0.00 1 2 4 8 1 2 4 8 1 2 4 8 AAC Encoder MPEG2 Decoder MPEG2 Encoder No. of processor cores 54 Automatic Local Memory Management Data Localization: Loop Aligned Decomposition • Decomposed loop into LRs and CARs – LR ( Localizable Region): Data can be passed through LDM – CAR (Commonly Accessed Region): Data transfers are required among processors Multi-dimension Decomposition Single dimension Decomposition

55 Adjustable Blocks

• Handling a suitable block size for each application – different from a fixed block size in cache – each block can be divided into smaller blocks with integer divisible size to handle small arrays and scalar variables

56 Multi-dimensional Template Arrays for Improving Readability • a mapping technique for arrays with varying dimensions – each block on LDM corresponds to multiple empty arrays with varying dimensions – these arrays have an additional dimension to store the corresponding block number • TA[Block#][] for single dimension • TA[Block#][][] for double dimension • TA[Block#][][][] for triple dimension • ... • LDM are represented as a one dimensional array – without Template Arrays, multi- LDM dimensional arrays have complex index calculations • A[i][j][k] -> TA[offset + i’ * L + j’ * M + k’] – Template Arrays provide readability • A[i][j][k] -> TA[Block#][i’][j’][k’] 57 8 Core RP2 Chip Block Diagram Cluster #1 Cluster #0 Barrier Core #3 Sync. Lines Core #7 CoreCPU #2 FPU FPU CPUCore #6 Core #1 Core #5 CPUI$ FPUD$ D$FPU CPUI$ CoreCPU #0 CPUCore #4 I$16K FPUD$16K 16KD$FPU 16KI$ LCPG0 LCPG1 CPUI$16KLocalFPU memoryD$16K 16KD$FPU 16KCPUI$ PCR3 16KLocalI:8K, memory D:32K16K I:8K,16K D:32K16K PCR7 I$LocalI:8K,D$ memory D:32KCCN I:8K,CCN D:32KD$ I$ PCR2 16KUser16K RAMBAR 64K User BARRAM 16K64K 16K PCR6 I:8K, D:32K I:8K, D:32K LocalUser memory RAM 64K UserLocal RAM memory 64K PCR1 I:8K,User RAMD:32K 64K UserI:8K, RAM D:32K 64K PCR5 PCR0 URAM 64K URAM 64K PCR4 Snoop controller 1 controller Snoop Snoop controller 0 controller Snoop

On-chip system bus (SuperHyway) LCPG: Local clock pulse generator DDR2 SRAM DMA PCR: Power Control Register control control control CCN/BAR:Cache controller/Barrier Register URAM: User RAM (Distributed Shared Memory) Speedups by the Local Memory Management Compared with Utilizing Shared Memory on Benchmarks Application using RP2

20.12 times speedup for 8cores execution using local memory against

sequential execution using off-chip shared memory of RP2 for the AACenc59 Software Coherence Control Method on OSCAR Parallelizing Compiler

 Coarse grain task parallelization with earliest condition analysis (control and data dependency analysis to detect parallelism among coarse grain tasks).  OSCAR compiler automatically controls coherence using following simple program restructuring methods:  To cope with stale data problems: Data synchronization by compilers  To cope with false sharing problem: Data Alignment Array Padding MTG generated by Non-cacheable Buffer earliest executable condition analysis Automatic Software Coherent Control for Manycores Performance of Software Coherence Control by OSCAR Compiler on 8-core RP2

6.00 5.66 SMP(Hardware Coherence) 4.92 4.76 5.00 4.63 4.37 NCC(Software Coherence)

4.00 3.71 3.67 3.65 3.49 3.32 3.28 3.19 3.34 3.17 3.02 2.95 2.90 2.99 2.87 2.86 3.00 2.63 2.86 2.70 2.52 2.65 Speedup 2.362.36 2.19 1.90 1.95 2.19 2.00 1.76 1.81 2.01 1.79 1.89 1.70 1.67 1.76 1.79 1.84 1.87 1.77 1.67 1.381.45 1.551.40 1.321.32 1.07 1.10 1.06 1.05 1.07 1.02 1.00 1.00 1.01 1.07 1.03 1.00 1.05 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.00 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 equake art lbm hmmer cg mg bt lu sp MPEG2 Encoder SPEC2000 SPEC2006 Application/the number of processor coreNPB MediaBench OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology

Target:  Solar Powered Centralized Shared Memory  Compiler power reduction. Fully automatic parallelization and vectorization including local memory Compiler Co-designed Interconnection Network management and data transfer.

Multicore Chip On-chip Shared Memory ×4 chips

Compiler co-designed Connection Network

Core Local Memory Distributed Shared Memory

Data Vector Transfer CPU Unit

Power Control Unit Future Multicore Products

Next Generation Automobiles - Safer, more comfortable, energy efficient, environment friendly - Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control Personal / Regional Smart phones Advanced medical systems Supercomputers

Solar powered with more than 100 Cancer treatment, -From everyday recharging to times power efficient : FLOPS/W Drinkable inner camera less than once a week • Regional Disaster Simulators • Emergency solar powered - Solar powered operation in saving lives from tornadoes, • No cooling fun, No dust , emergency condition localized heavy rain, fires with clean usable inside OP room 63 - Keep health earth quakes Summary  To get speedup and power reduction on homogeneous and heterogeneous multicore systems, collaboration of architecture and compiler will be more important.  Automatic Parallelizing and Power Reducing Compiler has succeeded speedup and/or power reduction of scientific applications including “Earthquake Wave Propagation”, medical applications including “Cancer Treatment Using Carbon Ion”, and “Drinkable Inner Camera”, industry application including “Automobile Engine Control”, and “Wireless communication Base Band Processing” on various multicores.  For example, the automatic parallelization gave us 110 times speedup for “Earthquake Wave Propagation Simulation” on 128 cores of IBM Power 7 against 1 core, 327 times speedup for “Heavy Particle Radiotherapy Cancer Treatment” on 144cores Hitachi Blade Server using Intel Xeon E7-8890 , 1.95 times for “Automobile Engine Control” on Renesas 2 cores using SH4A or V850, 55 times for “JPEG-XR Encoding for Capsule Inner Cameras” on Tilera 64 cores Tile64 manycore.  In automatic power reduction, consumed powers for real-time multi-media applications like Human face detection, H.264, mpeg2 and optical flow were reduced to 1/2 or 1/3 using 3 cores of ARM Cortex A9 and Intel Haswell and 1/4 using Renesas SH4A 8 cores against ordinary single core execution.  For more speedup and power reduction, we have been developing a new architecture/compiler co-designed multicore with vector accelerator based on vector pipelining with vector registers, chaining, load-store pipeline, advanced DMA controller without need of modification of CPU instruction set. 64