Software and Hardware for High Performance and Low Power Homogeneous and Heterogeneous Multicore Systems Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan IEEE Computer Society President Elect 2017, President 2018 1980 BS, 82 MS, 85 Ph.D. , Dept. EE, Waseda Univ. Reviewed Papers: 214, Invited Talks: 145, Published 1985 Visiting Scholar: U. of California, Berkeley Unexamined Patent Application:59 (Japan, US, GB, 1986 Assistant Prof., 1988 Associate Prof., 1997 China Granted Patents: 30), Articles in News Papers, Prof. Dept. of EECE, Waseda Univ. Now Dept. of Web News, Medias incl. TV etc.: 572 Computer Sci. & Eng. Committees in Societies and Government 245 1989‐90 Research Scholar: U. of Illinois, Urbana‐ IEEE Computer Society President 2018, BoG(2009‐ Champaign, Center for Supercomputing R&D 14), Multicore STC Chair (2012‐), Japan Chair (2005‐ 1987 IFAC World Congress Young Author Prize 07), IPSJ Chair: HG for Mag. & J. Edit, Sig. on ARC. 1997 IPSJ Sakai Special Research Award 【METI/NEDO】 Project Leaders: Multicore for 2005 STARC Academia‐Industry Research Award Consumer Electronics, Advanced Parallelizing 2008 LSI of the Year Second Prize Compiler, Chair: Computer Strategy Committee 2008 Intel AsiaAcademic Forum Best Research Award 【Cabinet Office】 CSTP Supercomputer Strategic 2010 IEEE CS Golden Core Member Award ICT PT, Japan Prize Selection Committees, etc. 2014 Minister of Edu., Sci. & Tech. Research Prize 【MEXT】 Info. Sci. & Tech. Committee, 2015 IPSJ Fellow Supercomputers (Earth Simulator, HPCI Promo., 2017 IEEE Fellow Next Gen. Supercomputer K) Committees, etc.

IEEE Computer Society BoG (Board of Governors) Feb.1, 2017 Multicores for Performance and Low Power Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers (“K” more than 10MW) . Power ∝ Frequency * Voltage2 ILRAM I$ (Voltage ∝ Frequency) LBSC Core#0 Core#1 3 URAM ∝ DLRAM D$ Power Frequency SNC0 Core#2 Core#3 If Frequency is reduced to 1/4 SHWY VSWC (Ex. 4GHz1GHz), Core#6 Core#7 Power is reduced to 1/64 and

SNC1 Performance falls down to 1/4 . Core#4 Core#5

DBSC CSM GCPG If 8cores are integrated on a chip, DDRPAD Power is still 1/8 and IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, Performance becomes 2 times. “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler” 6 Parallel Soft is important for scalable performance of multicore  Just more cores donʼt give us speedup  Development cost and period of parallel software are getting a bottleneck of development of embedded systems, eg. IoT, Automobile Earthquake wave propagation simulation GMS developed by National Research Institute for Earth Science and Disaster Resilience (NIED) Fjitsu M9000 SPARC Multicore Server OSCAR Compiler gives us 211 times speedup with 128 cores

Commercial compiler gives us 0.9 times speedup with 128 cores (slow- downed against 1 core)

 Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores  Execution time with 128 cores was slower than 1 core (0.9 times speedup)  Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler  OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization 7 Trend of Peak Performances of Supercomputers Aurora,2018,180PFLOPS,13MW,Argonne National Lab., Intel & Cray Sunway TaihuLight, 2016.06, 93PFLOPS, 15.4MW 1Z 米中欧日 Tianhe-2, 2013.06, 55PFLOPS, 17.8MW 2020-22 ExaFLOPS計画 Titan,2012.11,27PFLOPS,8.2MW 1E Sequoia,2012.06, 20PFLOPS, 7.9MW 京,2011.6&11, 11PFLOPS, 11.3MW 1P

1T

1G Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2 by OSCAR Parallelizing Compiler MPEG2 Decoding with 8 CPU cores Without Power With Power Control Control (Frequency, 7 7 (Voltage:1.4V) Resume Standby: Power shutdown & 6 6 Voltage lowering 1.4V-1.0V)

5 5

4 4

3 3

2 2

1 1

0 0 Avg. Power 73.5% Power Reduction Avg. Power

5.73 [W] 1.52 [W] 9 4 core multicore RP1 (2007) , 8 core multicore RP2 (2008) and 15 core Heterogeneous multicore RPX (2010) developed in NEDO Projects with Hitachi and Renesas Compiler Co-designed Multicore RP2

Cluster #1 Cluster #0 Barrier Core #3 Sync. Lines Core #7 CoreCPU #2 FPU FPU CPUCore #6 Core #1 Core #5 CPUI$ FPUD$ D$FPU CPUI$ CoreCPU #0 CPUCore #4 I$16K FPUD$16K 16KD$FPU 16KI$ LCPG0 LCPG1 CPUI$16KLocalFPU memoryD$16K 16KD$FPU 16KCPUI$ PCR3 16KLocalI:8K, memory D:32K16K I:8K,16K D:32K16K PCR7 I$LocalI:8K,D$ memory D:32KCCN I:8K,CCN D:32KD$ I$ PCR2 16KUser16K RAMBAR 64K User BARRAM 16K64K 16K PCR6 I:8K, D:32K I:8K, D:32K LocalUser memory RAM 64K UserLocal RAM memory 64K PCR1 I:8K,User RAMD:32K 64K UserI:8K, RAM D:32K 64K PCR5 PCR0 URAM 64K URAM 64K PCR4 Snoop controller 1 Snoop controller Snoop controller 0 Snoop controller

On-chip system bus (SuperHyway) LCPG: Local clock pulse generator DDR2 SRAM DMA PCR: Power Control Register control control control CCN/BAR:Cache controller/Barrier Register URAM: User RAM (Distributed Shared Memory) Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project

ILRAM I$ Process 90nm, 8-layer, triple-

LBSC Core#0 Core#1 Technology Vth, CMOS URAM DLRAM D$ Chip Size 104.8mm2 Core#2 SNC0 Core#3 (10.61mm x 9.88mm) 2 SHWY VSWC CPU Core 6.6mm Core#6 Core#7 Size (3.36mm x 1.96mm)

SNC1 Supply 1.0V–1.4V (internal), Core#4 Core#5 Voltage 1.8/3.3V (I/O)

CSM Power 17 (8 CPUs, DBSC GCPG DDRPAD Domains 8 URAMs, common) IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler” 12 Industry-government-academia collaboration and Protect lives target applications in Green Computing R&D Center For smart life Protect environment OSCAR

Robots Waseda University :R&D Green Many-core system technologies with supercomputers On-board vehicle technology ultra-low power consumption (navigation systems, integrated controllers, infrastructure OSCAR many-core chip Super real-time disaster coordination) simulation (tectonic shifts, tsunami), tornado, Consumer electronic OSCAR flood, fire spreading) Internet TV/DVD Many-core Chip Green cloud servers Camcorders Compiler, API Stock trading Capsule inner Cool desktop servers cameras Cameras Solar Powered Medical servers Smart phones OSCAR

National Institute of Radiological Sciences Heavy particle radiation planning, cerebral infarction) Operation/recharging Non-fan, cool, quiet servers by solar cells designed for server

Intelligent home Industry Supercomputers appliances and servers 13 Cancer Treatment Carbon Ion Radiotherapy (Previous best was 2.5 times speedup on 16 processors with hand optimization)

National Institute of Radiological Sciences (NIRS)

8.9times speedup by 12 processors 55 times speedup by 64 processors Intel Xeon X5670 2.93GHz 12 IBM Power 7 64 core SMP core SMP (Hitachi HA8000) (Hitachi SR16000) OSCAR Parallelizing Compiler To improve effective performance, cost-performance and software productivity and reduce power Multigrain Parallelization coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism 1 Data Localization 3 2 5 4 6 12 7 10 11 9 8 Automatic data management for distributed shared memory, cache 14 13 15 16 and local memory 18 17 19 21 22 20

24 25 23 26

dlg1 dlg2 dlg3 Data Transfer Overlapping dlg0 28 29 27 31 32 30

Data transfer overlapping using Data Data Localization Group 33 Transfer Controllers (DMAs) Power Reduction Reduction of consumed power by compiler control DVFS and Power gating with hardware supports. Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32‐core SMP Server

AIX Ver.12.1

Compile Option: (*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto (*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto (Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto

16 Generation of Coarse Grain Tasks Macro-tasks (MTs)  Block of Pseudo Assignments (BPA): Basic Block (BB)  Repetition Block (RB) : natural loop  Subroutine Block (SB): subroutine

BPA Near fine grain parallelization BPA RB SB Loop level parallelization BPA BPA RB Program RB Near fine grain of loop body RB SB Coarse grain SB BPA parallelization BPA RB SB SB Coarse grain RB BPA parallelization SB RB SB Total 1st. Layer 2nd. Layer 3rd. Layer System

17 Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro‐tasks)

Data Dependency 1 Control flow Conditional branch 1 BPA BPA Block of Psuedo Assignment Statements 23 2 BPA 3 BPA RB Repetition Block 4 BPA 4 RB 8 5 BPA 7 6 BPA 6RB 5 910 BPA RB 6 15 BPA 7RB RB 8 BPA 11

BPA 15 7 9 BPA 10 RB 12 11 BPA Data dependency

Extended control dependency13 12 BPA Conditional branch 13 RB OR AND 14 14 RB Original control flow END A Macro Flow Graph A Macro Task Graph

18 PRIORITY DETERMINATION IN DYNAMIC CP METHOD

60*0.80+100*0.20=68

V1.0 © 2014, IEEE All rights reserved 19 Earliest Executable Conditions MT2 may start execution after EEC: Control dependence + Data Dependence MT1 branches to MT2 and MT1 Control dependences show executions of MTs are decided finish execution. Data dependences show data accessed by MTs are ready

MT3 may start execution after MT1 branches to MT3.

MT6 may start execution after MT3 finish execution or MT2 branches to MT4.

V1.0 © 2014, IEEE All rights reserved 20 Automatic processor assignment in 103.su2cor

• Using 14 processors Coarse grain parallelization within DO400

21 MTG of Su2cor‐LOOPS‐DO400  Coarse grain parallelism PARA_ALD = 4.3

DOALL Sequential LOOP SB BB

22 Data-Localization: Loop Aligned Decomposition • Decompose multiple loop (Doall and Seq) into CARs and LRs considering inter-loop data dependence. – Most data in LR can be passed through LM. – LR: Localizable Region, CAR: Commonly Accessed Region C RB1(Doall) DO I=1,101 LR CARLR CAR LR A(I)=2*I DO I=1,33 DO I=34,35 DO I=36,66 DO I=67,68 DO I=69,101 ENDDO DO I=1,33 C RB2(Doseq) DO I=1,100 DO I=34,34 B(I)=B(I-1) +A(I)+A(I+1) DO I=35,66 ENDDO DO I=67,67

RB3(Doall) DO I=68,100 DO I=2,100 C(I)=B(I)+B(I-1) DO I=2,34 DO I=35,67 DO I=68,100 ENDDO C

23 Inter-loop data dependence analysis in TLG

• Define exit-RB in TLG as Standard-Loop C RB1(Doall) DO I=1,101 K-1 KK+1 • Find iterations on which A(I)=2*I I(RB1) ENDDO a iteration of Standard-Loop is data dependent C RB2(Doseq) – e.g. K of RB3 is data-dep DO I=1,100 th B(I)=B(I-1) I(RB2) +A(I)+A(I+1) K-1 K on K-1th,Kth of RB2, ENDDO on K-1th,Kth,K+1th of RB1 indirectly. C RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) I(RB3) K ENDDO Example of TLG

24 Decomposition of RBs in TLG

• Decompose GCIR into DGCIRp(1≦p≦n) – n: (multiple) num of PCs, DGCIR: Decomposed GCIR • Generate CAR on which DGCIRp&DGCIRp+1 are data-dep. • Generate LR on which DGCIRp is data-dep.

RB11 RB1<1,2> RB12 RB1<2,3> RB13 I(RB1) 1 23 33 34 35 36 65 66 67 68 99 100 101 RB21 RB2<1,2> RB22 RB2<2,3> RB23 I(RB2) 1 2 33 34 35 65 66 67 99 100 RB31 RB32 RB33 I(RB3) 2 34 35 66 67 100 DGCIR1 DGCIR2 DGCIR3 GCIR

25 Data Localization 1 1 PE0 PE1 12 1 2 3 2 3 2 5 4 6 7 4 14 6 3 4 5 6 12 7 10 11 9 8 8 18 15 5

7 14 13 15 16 19 9 25 11 29 10 8 9 10 18 17 19 21 22 20 13 16 17 20 11 24 25 23 26 22 26

dlg1 dlg2 dlg3 21 30 dlg0 23 24 14 12 13 28 29 27 31 32 30 27 28 32 15 Data Localization Group 33 A schedule for 31 MTG MTG after Division two processors 26 An Example of Data Localization for Spec95 Swim DO 200 J=1,N cache size DO 200 I=1,M 01234MB UNEW(I+1,J) = UOLD(I+1,J)+ 1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J) UN 2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J)) VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1)) VN PN UO VO 1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J)) PO CU CV Z 2 -TDTSDY*(H(I,J+1)-H(I,J)) PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J)) H 1 -TDTSDY*(CV(I,J+1)-CV(I,J)) 200 CONTINUE UN VN PN DO 210 J=1,N UNEW(1,J) = UNEW(M+1,J) VNEW(M+1,J+1) = VNEW(1,J+1) U V P UN PNEW(M+1,J) = PNEW(1,J) VN PN UO VO 210 CONTINUE PO

DO 300 J=1,N Cache line conflicts occurs DO 300 I=1,M among arrays which share the UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) same location on cache VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (b) Image of alignment of arrays on 300 CONTINUE cache accessed by target loops (a) An example of target loop group for data localization

27 Data Layout for Removing Line Conflict Misses by Array Dimension Padding Declaration part of arrays in spec95 swim before padding after padding PARAMETER (N1=513, N2=513) PARAMETER (N1=513, N2=544)

COMMON U(N1,N2), V(N1,N2), P(N1,N2), COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2) * Z(N1,N2), H(N1,N2) 4MB 4MB

padding

Box: Access range of DLG0

28 Statement Level Near Fine Grain Task

<> 0 Task No. 1) u12 = a12/l11 2) u24 = a24/l22 Task 3) u34 = a34/l33 10 3 2 5 11 8 1 7 19 19 19 19 19 4) l54 = -l52 * u24 19 19 19 Processing 5) u45 = a45/l44 4 9 Time 6) l55 = u55 - l54 * u45 8 10 Data

<> 6 12 Transfer 10 10 7) y1 = b1 / l11 Time tij 8) y2 = b2 / l22 13 19 tij = 0 9) b5 = b5 - l52 * y2 If Ti and Tj 10) y3 = b3 / l33 are on the 11) y4 = b4 / l44 1014 12) b5 = b5 - l54 * y4 same PE 13) y5 = b5 / l55 16 15 10 <> 10 tij = 9 14) x4 = y4 - u45 * y5 17If Ti and Tj are 15) x3 = y3 - u34 * x4 10 on different PE 16) x2 = y2 - u24 * x4 17) x1 = y1 - u12 * x2 18

29 Task Graph for FPPPP Statement level near fine grain parallelism

30 Elimination of Redundant Synchronization for Shared Data on Centralized Shared Memory after Static Task Scheduling PE1 PE2 PE3

A FS Precedence FC relations D FC FS FS Flag set B FC FS FC Flag check FC FC C FC FC Unnecessary FS FC E Generated Parallel Machine Code for Near Fine Grain Parallel Processing

PE1 PE2 ; ---- Task A ---- ; ---- Task C ---- PE1 PE2 ; Task Body ; Task Body

A C FS FADD R23, R19, R21 ; ---- Task B ---- ; Flag Check : Task C FMLT R27, R28, R29 L18: FSUB R29, R19, R27 B FC LDR R28, [R14, 0] ; Data Transfer CMP R28, R0 STR [R14, 1], R29 JNE L18 ; Flag Set ; Data Receive STR [R14, 0], R0 LDR R29, [R14, 1] ; Task Body FMLT R24, R23, R29

32 【W‐CDMA Base Band Communication】 Near Fine Grain Parallel Processing of EAICH Detection Program on RP2 Multicore with 4 SH4A cores

 Hadamard transform often used in the signal processing  Parallel Processing Method

– Near fine grain parallel Special purpose processing among Hardware statements (250MHz): 1.74μs – Static Scheduling

1.62 times speedup for 2cores, 3.45 times speedup for 4 cores for EAICH on RP2.

33 Generated Multigrain Parallelized Code Centralized (The nested coarse grain task parallelization is realized by only OpenMP scheduling “section”, “Flush” and “Critical” directives.) code SECTIONS 1st layer SECTION SECTION T0 T1 T2 T3 T4 T5 T6 T7 MT1_1 Distributed scheduling MT1_1 code MT1_2 SYNC SEND SYNC RECV DOALL MT1_3 MT1_2 MT1-3 SB MT1_4 MT1-4 1_3_1 1_3_1 1_3_1 RB 1_3_2 1_3_2 1_3_2 1_4_1 1_4_1 1_3_3 1_3_3 1_3_3 1_3_1 1_4_2 1_4_2 1_3_4 1_3_4 1_3_4 1_4_3 1_4_3 1_3_2 1_3_4 1_3_5 1_3_5 1_3_5 1_4_1 1_3_3 1_4_4 1_4_4 1_3_6 1_3_6 1_3_6 1_3_5 1_3_6 END SECTIONS 1_4_21_4_3 1_4_4 2nd layer Thread Thread 2nd layer 34 group0 group1 Code Generation Using OpenMP

• Compiler generates a parallelized program using OpenMP API • One time single level thread generation – Threads are forked only once at the beginning of a program by OpenMP “PARALLEL SECTIONS” directive – Forked threads join only once at the end of program • Compiler generates codes for each threads using static or dynamic scheduling schemes • Extension of OpenMP for hierarchical processing is not required

35 Multicore Program Development Using OSCAR API V2.0 Sequential Application OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycores Generation of Program in Fortran or C parallel machine (Consumer Electronics, Automobiles, Directives for thread generation, memory, Medical, Scientific computation, etc.) data transfer using DMA, power codes using managements sequential compilers Manual parallelization / Low Power

Hetero power reduction Parallelized Homogeneous API F or C Multicore Code Accelerator Compiler/ User program Generation Add “hint” directives Homegeneous Proc0 API Existing Multicore s before a loop or a function to Analyzer sequential Code with from Vendor A specify it is executable by compiler (SMP servers) Homogeneous the accelerator with directives how many clocks Thread 0 Low Power Heterogeneous Proc1 Multicore Code Waseda OSCAR Code with Generation Parallelizing Compiler directives API Existing Heterogeneous Thread 1 Analyzer sequential Multicores  Coarse grain task (Available compiler from Vendor B parallelization from Accelerator 1 Waseda)  Data Localization Code  DMAC data transfer Accelerator 2 Server Code  Power reduction using Code Generation DVFS, Clock/ Power gating OpenMP various multicores Executable on Hitachi, Renesas, NEC, Compiler Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, OSCAR: Optimally Scheduled Advanced Multiprocessor Shred memory Esol, Cats, Gaio, 3 univ. API: Application Program Interface servers Parallel Processing of Face Detection on Manycore, Highend and PC Server

14.00 速度向上率 tilepro64 gcc SR16k(Power7 8core*4cpu*4node) xlc 12.00 11.55 rs440(Intel Xeon 8core*4cpu) icc 10.92

10.00 9.30

8.00 6.46 6.46 5.74

速度向上率 6.00

4.00 3.57 3.67 3.01 1.93 1.93 2.00 1.72 1.00 1.00 1.00

0.00 124816コア数 • OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highend server.

37 Performance on Multicore Server for Latest Cancer Treatment Using Heavy Particle (Proton, Carbon Ion) 327 times speedup on 144 cores Hitachi 144cores SMP Blade Server BS500: Xeon E7-8890 V3(2.5GHz 18core/chip) x8 chip 350 327.60 327.6 times speed up with 144 cores 300 250 GCC 196.50 200 150 109.20 100 50 1.00 5.00 0 1PE 32pe 64pe 144pe

 Original sequential execution time 2948 sec (50 minutes) using GCC was reduced to 9 sec with 144 cores(327.6 times speedup)  Reduction of treatment cost and reservation waiting period is expected

38 110 Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000 (Power7 Based 128 Core Linux SMP)

Fortran:15 thousand lines

First touch for distributed shared memory and cache optimization over 39 loops are important for scalable speedup Parallel Processing of JPEG XR Encoder on TILEPro64 Multimedia Applications: AAC Encoder Speedup (JPEG XR Encoder) Sequential C Source Code JPEG XR Encoder {Optical Flow Calc. 55x speedup on 64 cores 60.00 55x OSCAR Compiler Default Cache Allocation 50.00 Our Cache Allocation Cache Parallelized C Program Allocation 40.00 with OSCAR API Setting (2)Cache Allocation Setting 30.00 API Analyzer +

Speedup 28x Sequential Compiler 20.00 (1)OSCAR Parallelization Parallelized Executable Binary for 10.00 1x TILEPro64 0.00

Memory Memory Controller 0 Controller 1 164 0 0 0 0 0 0 0 0 , , , , , , , , 0 1 2 3 4 5 6 7 1 1 1 1 1 1 1 1 # Cores , , , , , , , , 0 1 2 3 4 5 6 7 2 2 2 2 2 2 2 2 , , , , , , , , al)

0 1 2 3 4 5 6 7 I/O t0 3 3 3 3 3 3 3 3 X4) Local cache optimization: , , , , , , , ,

I/O 0 1 2 3 4 5 6 7 Ds 4 4 4 4 4 4 4 4 , , , , , , , , Parallel Data Structure (tile) on heap 0 1 2 3 4 5 6 7 5 5 5 5 5 5 5 5 , , , , , , , , 0 1 2 3 4 5 6 7 nal) 6 6 6 6 6 6 6 6 rt 1 allocating to local cache , , , , , , , , X4)

0 1 2 3 4 5 6 7 I/O 7 7 7 7 7 7 7 7 , , , , , , , , 0 1 2 3 4 5 6 7 Memory Memory Controller 2 Controller 3

V1.0 © 2014, IEEE All rights reserved 40 Speedup with 2cores for Engine Crankshaft Handwritten Program on RPX Multi‐core Processor

1.6 times Speed up by 2 cores against速度向上率 1core

1.8 1.60 1.6 1.4 1.2 1 1 0.8 1core 0.6 0.4 0.2 2core 0 Macrotask graph after Macrotask graph with a lot of conditional task fusion branches 1core 2core Branches are fused to macrotasks for static scheduling

Grain is too fine (us) for dynamic scheduling. 41 Model Base Designed Engine Control on Multicore with Denso Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor. Hard real-time automobile engine control by multicore C codes generated by MATLAB/Simulink embedded coder are automatically parallelized.

1 core 2 cores

42 OSCAR Compile Flow for Simulink Applications

Generate C code using Embedded Coder

Simulink model C code OSCAR Compiler

(1) Generate MTG → Parallelism (3) Generate parallelized C code using the OSCAR API (2) Generate gantt chart → Multiplatform execution → Scheduling in a multicore (Intel, ARM and SH etc) 43 Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores (Intel Xeon, ARM Cortex A15 and Renesas SH4A)

Road Tracking, Image Compression : http://www.mathworks.co.jp/jp/help/vision/examples Buoy Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/44706‐buoy‐detection‐using‐simulink Color Edge Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/28114‐fast‐edges‐of‐a‐color‐image‐‐actual‐color‐‐not‐converting‐ to‐grayscale‐/ Vessel Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/24990‐retinal‐blood‐vessel‐extraction/ 44 Parallel Processing on Simulink Model • The parallelized C code can be embedded to Simulink using C mex API for HILS and SILS implementation.

Call sequential C code from the S‐Function block

Sequential C Code

Parallelized C Code

Call parallelized C code from the S‐Function block

45 OSCAR API Ver. 2.0 for Homogeneous/Heterogeneous Multicores and Manycores

46 An Image of Static Schedule for Heterogeneous Multi- core with Data Transfer Overlapping and Power Control TIME

47 33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library) 111[fps]

35 32.65

30

26.71

25 processor

SH

20 18.85 single a

15 against

10 5.4 Speedups 3.4[fps] 5 2.29 3.09 1 0 1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler Frequency and Voltage (DVFS), Clock and Power gating of each cores are scheduled considering the task schedule since the dynamic power proportional to the cube of F (F3) and the leakage power (the static power ) can be reduced by the power gating (power off).

• Shortest execution time mode

In this Fig. Frequency Full, Mid, Low

Power OFF: Power Gating

49 An Example of Machine Parameters

• Functionsfor of the the multiprocessor Power Saving Scheme – Frequency of each proc. is changed to several levels – Voltage is changed together with frequency – Each proc. can be powered on/off state FULL MID LOW OFF frequency 1 1 / 2 1 / 4 0 voltage 1 0.87 0.71 0 dynamic energy 1 3 / 4 1 / 2 0 static power 1 1 1 0 • State transition overhead state FULL MID LOW OFF state FULL MID LOW OFF FULL 0 40k 40k 80k FULL 0 20 20 40 MID 40k 0 40k 80k MID 20 0 20 40 LOW 40k 40k 0 80k LOW 20 20 0 40 OFF 80k 80k 80k 0 OFF 40 40 40 0 delay time [u.t.] energy overhead [μJ]

50 Power Reduction Scheduling

A power schedule for fastest execution mode A macrotask graph assigned to 3 cores 1) Reduce frequencies (Fs) of MTs on CP considering Realtime scheduling mode dead line. MTs 1,4,7,8 are on Critical Path (CP) 2) Reduce Fs of MTs not on CP. Idle: Clock or Power Gating considering overheads.

A power schedule for SPEC95 APPLU for fastest execution mode Doall6, Loop 10,11,12,13, Doall 17, Loop 18,19, 20, 21 are on CP 51 Low-Power Optimization with OSCAR API

Scheduled Result Generate Code Image by OSCAR Compiler by OSCAR Compiler void void VC0 VC1 main_VC0() { main_VC1() {

MT2 MT2

MT1 MT1 #pragma oscar fvcontrol ¥ ((OSCAR_CPU(),0)) Sleep

Sleep #pragma oscar fvcontrol ¥ (1,(OSCAR_CPU(),100)) MT3 MT4

MT3 MT4

} } 52 Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library)

Without Power Reduction With Power Reduction by OSCAR Compiler 70% of power reduction

Average:1.76[W] Average:0.54[W]

1cycle : 33[ms] →30[fps] Automatic Power Reduction for MPEG2 Decode on Android Multicore ODROID X2 ARM Cortex-A94cores http://www.youtube.com/channel/UCS43lNYEIkC8i_KIgFZYQBQ 電力制御なしWithout Power 電力制御ありWith Power Reduction Reduction 3.00 2.79 2.50 1/7(‐86.7%) 2.00 1.88 1.50 0.97 1.00 2/3 1/4 0.63 (‐35.0%) (‐75.5%) 1/3 Consumption [W]

0.46 0.50 0.37 (‐61.9%) 0.00 Power No. of Processor Cores 123 • On 3 cores, Automatic Power Reduction control successfully reduced power to 1/7 against without Power Reduction control. • 3 cores with the compiler power reduction control reduced power to 1/3 against ordinary 1 core execution. 54 Power Reduction on Intel Haswell for Real-time Optical Flow Intel CPU Core i7 4770K For HD 720p(1280x720) moving pictures 15fps (Deadline66.6[ms/frame]) without power control with power control

50 Power was 45 41.58 reduced to 1/4 40 36.59 by compiler on 3 cores 35 29.29 30 24.17 25 consumption [W] 20 12.21 15 9.60 1/3 power 10 5 0 average number of PE 1PE 2PE 3PE Power was reduced to 1/4 (9.6W) by the compiler power optimization on the same 3 cores (41.6W). Power with 3 core was reduced to 1/3 (9.6W) against 1 core (29.3W) .

55 Automatic Parallelization of JPEG-XR for Drinkable Inner Camera (Endo Capsule) 10 times more speedup needed after parallelization for 128 cores of Power 7. Less than 35mW power consumption is required. •TILEPro64 60.00Speed‐ups on TILEPro64 Manycore 55.11 al) t0 X4) Ds 50.00 64cores 0.18[s] nal) rt 1 X4) 40.00 30.79 up 30.00 Speed 20.00 15.82 1core 10.0[s] 10.00 7.86 3.95 1.00 1.96 0.00 1 2 4 8 16 32 64 55 times speedup withコア数 64 cores

Waseda U. & Olympus 56 OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology

Target:

 Solar Powered Centralized Shared Memory  Compiler power reduction.

Fully automatic parallelization and vectorization including local memory Compiler Co-designed Interconnection Network management and data transfer.

Multicore Chip

On-chip Shared Memory ×4 chips

Compiler co-designed Connection Network

Core

Local Memory Distributed Shared Memory

Data Vector Transfer CPU Unit

Power Control Unit Fujitsu VPP500/NWT: PE Unit Cabinet (open)

Copyright 2008 FUJITSU 58 LIMITED Automatic Local Memory Management Data Localization: Loop Aligned Decomposition • Decomposed loop into LRs and CARs – LR ( Localizable Region): Data can be passed through LDM – CAR (Commonly Accessed Region): Data transfers are required among processors Multi-dimension Decomposition Single dimension Decomposition

59 Adjustable Blocks

• Handling a suitable block size for each application – different from a fixed block size in cache – each block can be divided into smaller blocks with integer divisible size to handle small arrays and scalar variables

60 Multi-dimensional Template Arrays for Improving Readability • a mapping technique for arrays with varying dimensions – each block on LDM corresponds to multiple empty arrays with varying dimensions – these arrays have an additional dimension to store the corresponding block number • TA[Block#][] for single dimension • TA[Block#][][] for double dimension • TA[Block#][][][] for triple dimension • ... • LDM are represented as a one dimensional array – without Template Arrays, multi- LDM dimensional arrays have complex index calculations • A[i][j][k] -> TA[offset + i’ * L + j’ * M + k’] – Template Arrays provide readability • A[i][j][k] -> TA[Block#][i’][j’][k’] 61 8 Core RP2 Chip Block Diagram Cluster #1 Cluster #0 Barrier Core #3 Sync. Lines Core #7 CoreCPU #2 FPU FPU CPUCore #6 Core #1 Core #5 CPUI$ FPUD$ D$FPU CPUI$ CoreCPU #0 CPUCore #4 I$16K FPUD$16K 16KD$FPU 16KI$ LCPG0 LCPG1 CPUI$16KLocalFPU memoryD$16K 16KD$FPU 16KCPUI$ PCR3 16KLocalI:8K, memory D:32K16K I:8K,16K D:32K16K PCR7 I$LocalI:8K,D$ memory D:32KCCN I:8K,CCN D:32KD$ I$ PCR2 16KUser16K RAMBAR 64K User BARRAM 16K64K 16K PCR6 I:8K, D:32K I:8K, D:32K LocalUser memory RAM 64K UserLocal RAM memory 64K PCR1 I:8K,User RAMD:32K 64K UserI:8K, RAM D:32K 64K PCR5 PCR0 URAM 64K URAM 64K PCR4 Snoop controller 1 Snoop controller Snoop controller 0 Snoop controller

On-chip system bus (SuperHyway) LCPG: Local clock pulse generator DDR2 SRAM DMA PCR: Power Control Register control control control CCN/BAR:Cache controller/Barrier Register URAM: User RAM (Distributed Shared Memory) Speedups by the Local Memory Management Compared with Utilizing Shared Memory on Benchmarks Application using RP2

20.12 times speedup for 8cores execution using local memory against

sequential execution using off‐chip shared memory of RP2 for the AACenc63 Software Coherence Control Method on OSCAR Parallelizing Compiler

 Coarse grain task parallelization with earliest condition analysis (control and data dependency analysis to detect parallelism among coarse grain tasks).  OSCAR compiler automatically controls coherence using following simple program restructuring methods:  To cope with stale data problems: Data synchronization by compilers  To cope with false sharing problem: Data Alignment Array Padding MTG generated by Non-cacheable Buffer earliest executable condition analysis Automatic Software Coherent Control for Manycores Performance of Software Coherence Control by OSCAR Compiler on 8-core RP2

6.00 5.66 SMP(Hardware Coherence) 4.92 4.76 5.00 4.63 4.37 NCC(Software Coherence)

4.00 3.71 3.67 3.65 3.49 3.32 3.28 3.19 3.34 3.17 2.99 2.87 2.86 3.02 3.00 2.95 2.90 2.63 2.65 2.86 2.70 2.52 2.36 Speedup 2.36 2.19 2.19 1.90 2.01 1.95 2.00 1.76 1.81 1.79 1.89 1.70 1.67 1.76 1.79 1.84 1.87 1.77 1.67 1.381.45 1.551.40 1.321.32 1.07 1.10 1.06 1.05 1.07 1.02 1.00 1.00 1.01 1.07 1.03 1.00 1.05 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.00 1248124812481248124812481248124812481248 equake art lbm hmmer cg mg bt lu sp MPEG2 Encoder SPEC2000 SPEC2006Application/the number of processor core NPB MediaBench Future Multicore Products

Next Generation Automobiles ‐ Safer, more comfortable, energy efficient, environment friendly ‐ Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control Personal / Regional Smart phones Advanced medical systems Supercomputers

Solar powered with more than 100 Cancer treatment, ‐From everyday recharging to times power efficient : FLOPS/W Drinkable inner camera less than once a week • Regional Disaster Simulators • Emergency solar powered ‐ Solar powered operation in saving lives from tornadoes, • No cooling fun, No dust , emergency condition localized heavy rain, fires with clean usable inside OP room 66 ‐ Keep health earth quakes Summary  To get speedup and power reduction on homogeneous and heterogeneous multicore systems, collaboration of architecture and compiler will be more important.  Automatic Parallelizing and Power Reducing Compiler has succeeded speedup and/or power reduction of scientific applications including “Earthquake Wave Propagation”, medical applications including “Cancer Treatment Using Carbon Ion”, and “Drinkable Inner Camera”, industry application including “Automobile Engine Control”,and “Wireless communication Base Band Processing” on various multicores.  For example, the automatic parallelization gave us 110 times speedup for “Earthquake Wave Propagation Simulation” on 128 cores of IBM Power 7 against 1 core, 327 times speedup for “Heavy Particle Radiotherapy Cancer Treatment” on 144cores Hitachi Blade Server using Intel Xeon E7‐8890 , 1.95 times for “Automobile Engine Control” on Renesas 2 cores using SH4A or V850, 55 times for “JPEG‐XR Encoding for Capsule Inner Cameras” on Tilera 64 cores Tile64 manycore.  In automatic power reduction, consumed powers for real-time multi-media applications like Human face detection, H.264, mpeg2 and optical flow were reduced to 1/2 or 1/3 using 3 cores of ARM Cortex A9 and Intel Haswell and 1/4 using Renesas SH4A 8 cores against ordinary single core execution.  For more speedup and power reduction, we have been developing a new architecture/compiler co‐designed multicore with vector accelerator based on vector pipelining with vector registers, chaining, load‐store pipeline, advanced DMA controller without need of modification of CPU instruction set. 67