OSCAR SCM Architecture for Multigrain Parallel Processing
Total Page:16
File Type:pdf, Size:1020Kb
Software and Hardware for High Performance and Low Power Homogeneous and Heterogeneous Multicore Systems Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan IEEE Computer Society President Elect 2017 URL: http://www.kasahara.cs.waseda.ac.jp/ Waseda Univ. GCSC Multicores for Performance and Low Power Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers (“K” more than 10MW) . Power Frequency * Voltage2 ILRAM I$ (Voltage Frequency) LBSC Core#0 Core#1 3 URAM DLRAM Power Frequency D$ ∝ ∝ SNC0 Core#2 Core#3 If Frequency∝is reduced to 1/4 SHWY VSWC (Ex. 4GHz1GHz), Core#6 Core#7 Power is reduced to 1/64 and SNC1 Performance falls down to 1/4 . Core#4 Core#5 <Multicores> DBSC CSM GCPG If 8cores are integrated on a chip, DDRPAD IEEE ISSCC08: Paper No. 4.5, Power is still 1/8 and M.ITO, … and H. Kasahara, Performance becomes 2 times. “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler” 2 Parallel Soft is important for scalable performance of multicore Just more cores don’t give us speedup Development cost and period of parallel software are getting a bottleneck of development of embedded systems, eg. IoT, Automobile Earthquake wave propagation simulation GMS developed by National Research Institute for Earth Science and Disaster Resilience (NIED) Fjitsu M9000 SPARC Multicore Server OSCAR Compiler gives us 211 times speedup with 128 cores Commercial compiler gives us 0.9 times speedup with 128 cores (slow- downed against 1 core) Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores Execution time with 128 cores was slower than 1 core (0.9 times speedup) Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization 3 Trend of Peak Performances of Supercomputers Aurora,2018,180PFLOPS,13MW,Argonne National Lab., Intel & Cray 1 Sunway TaihuLight, 2016.06, 93PFLOPS, 15.4MW Z 米中欧日 Tianhe-2, 2013.06, 55PFLOPS, 17.8MW 2020-22 ExaFLOPS計画 Titan,2012.11,27PFLOPS,8.2MW 1E Sequoia,2012.06, 20PFLOPS, 7.9MW 京,2011.6&11, 11PFLOPS, 11.3MW 1P 1T 1G Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2 by OSCAR Parallelizing Compiler MPEG2 Decoding with 8 CPU cores Without Power With Power Control Control (Frequency, 7 7 (Voltage:1.4V) Resume Standby: Power shutdown & 6 6 Voltage lowering 1.4V-1.0V) 5 5 4 4 3 3 2 2 1 1 0 0 Avg. Power 73.5% Power Reduction Avg. Power 5.73 [W] 1.52 [W] 5 Compiler Co-designed Multicore RP2 Cluster #1 Cluster #0 Barrier Core #3 Sync. Lines Core #7 CoreCPU #2 FPU FPU CPUCore #6 Core #1 Core #5 CPUI$ FPUD$ D$FPU CPUI$ CoreCPU #0 CPUCore #4 I$16K FPUD$16K 16KD$FPU 16KI$ LCPG0 LCPG1 CPUI$16KLocalFPU memoryD$16K 16KD$FPU 16KCPUI$ PCR3 16KLocalI:8K, memory D:32K16K I:8K,16K D:32K16K PCR7 I$LocalI:8K,D$ memory D:32KCCN I:8K,CCN D:32KD$ I$ PCR2 16KUser16K RAMBAR 64K User BARRAM 16K64K 16K PCR6 I:8K, D:32K I:8K, D:32K LocalUser memory RAM 64K UserLocal RAM memory 64K PCR1 I:8K,User RAMD:32K 64K UserI:8K, RAM D:32K 64K PCR5 PCR0 URAM 64K URAM 64K PCR4 Snoop controller 1 controller Snoop Snoop controller 0 controller Snoop On-chip system bus (SuperHyway) LCPG: Local clock pulse generator DDR2 SRAM DMA PCR: Power Control Register control control control CCN/BAR:Cache controller/Barrier Register URAM: User RAM (Distributed Shared Memory) Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project ILRAM I$ Process 90nm, 8-layer, triple- LBSC Core#0 Core#1 Technology Vth, CMOS URAM DLRAM D$ Chip Size 104.8mm2 SNC0 Core#2 Core#3 (10.61mm x 9.88mm) 2 SHWY VSWC CPU Core 6.6mm Core#6 Core#7 Size (3.36mm x 1.96mm) SNC1 Supply 1.0V–1.4V (internal), Core#4 Core#5 Voltage 1.8/3.3V (I/O) CSM Power 17 (8 CPUs, DBSC GCPG DDRPAD Domains 8 URAMs, common) IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler” 7 Industry-government-academia collaboration in R&D Protect lives and target practical applications For smart life Protect environment OSCAR Robots Waseda University :R&D Green Many-core system technologies with supercomputers On-board vehicle technology ultra-low power consumption (navigation systems, integrated controllers, infrastructure OSCAR many-core chip Super real-time disaster coordination) simulation (tectonic shifts, tsunami), tornado, flood, Consumer electronic OSCAR fire spreading) Many-core Internet TV/DVD Green cloud servers Chip Camcorders Compiler, API Stock trading Capsule inner Cool desktop servers cameras Cameras Solar Powered Medical servers Smart phones OSCAR National Institute of Radiological Sciences Heavy particle radiation planning, cerebral infarction) Operation/recharging Non-fan, cool, quiet servers by solar cells designed for server Intelligent home Industry Supercomputers appliances and servers 8 Cancer Treatment Carbon Ion Radiotherapy (Previous best was 2.5 times speedup on 16 processors with hand optimization) National Institute of Radiological Sciences (NIRS) 8.9times speedup by 12 processors 55 times speedup by 64 processors Intel Xeon X5670 2.93GHz 12 IBM Power 7 64 core SMP core SMP (Hitachi HA8000) (Hitachi SR16000) OSCAR Parallelizing Compiler To improve effective performance, cost-performance and software productivity and reduce power Multigrain Parallelization coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism CPU0 CPU1 CPU2 CPU3 DRP0 CORE DTU CORE DTU CORE DTU CORE DTU CORE DTU LOAD LOAD MTG1 LOAD LOAD Data Localization MT1-1 MT1-2 SEND SEND LOAD OFF MT1-3 MT1-4 OFF SEND OFF Automatic data management for LOAD SEND LOAD MTG2 MT2-1 LOAD MTG3 LOAD SEND MT3-1 LOAD MT2-2 LOAD OFF MT2-3 distributed shared memory, cache LOAD SEND LOAD SEND TIME SEND MT3-2 LOAD LOAD MT2-4 LOAD MT3-3 SEND and local memory MT2-5 LOAD SEND LOAD SEND MT2-6 LOAD MT2-7 MT3-4 MT3-6 SEND SEND MT3-5 LOAD Data Transfer Overlapping SEND OFF MT2-8 MT3-7 STORE STORE MT3-8 OFF STORE Data transfer overlapping using Data STORE Transfer Controllers (DMAs) Power Reduction Reduction of consumed power by compiler control DVFS and Power gating with hardware supports. Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server AIX Ver.12.1 Compile Option: (*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto (*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto (Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto 11 Generation of Coarse Grain Tasks Macro-tasks (MTs) Block of Pseudo Assignments (BPA): Basic Block (BB) Repetition Block (RB) : natural loop Subroutine Block (SB): subroutine Near fine grain parallelization BPA BPA RB SB Loop level parallelization BPA BPA RB Program RB Near fine grain of loop body RB SB Coarse grain SB BPA parallelization BPA RB SB SB Coarse grain RB BPA parallelization SB RB SB Total 1st. Layer 2nd. Layer 3rd. Layer System 12 Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks) A Macro Flow Graph A Macro Task Graph 13 PRIORITY DETERMINATION IN DYNAMIC CP METHOD 60*0.80+100*0.20=68 V1.0 © 2014, IEEE All rights reserved 14 MT2 may start Earliest Executable Conditions execution after EEC: Control dependence + Data Dependence MT1 branches to MT2 and MT1 Control dependences show executions of MTs are decided finish execution. Data dependences show data accessed by MTs are ready MT3 may start execution after MT1 branches to MT3. MT6 may start execution after MT3 finish execution or MT2 branches to MT4. V1.0 © 2014, IEEE All rights reserved 15 Automatic processor assignment in 103.su2cor • Using 14 processors Coarse grain parallelization within DO400 16 MTG of Su2cor-LOOPS-DO400 Coarse grain parallelism PARA_ALD = 4.3 DOALL Sequential LOOP SB BB 17 Data-Localization: Loop Aligned Decomposition • Decompose multiple loop (Doall and Seq) into CARs and LRs considering inter-loop data dependence. – Most data in LR can be passed through LM. – LR: Localizable Region, CAR: Commonly Accessed Region C RB1(Doall) DO I=1,101 LR CAR LR CAR LR A(I)=2*I DO I=1,33 DO I=34,35 DO I=36,66 DO I=67,68 DO I=69,101 ENDDO DO I=1,33 C RB2(Doseq) DO I=1,100 DO I=34,34 B(I)=B(I-1) +A(I)+A(I+1) DO I=35,66 ENDDO DO I=67,67 RB3(Doall) DO I=68,100 DO I=2,100 C(I)=B(I)+B(I-1) DO I=2,34 DO I=35,67 DO I=68,100 ENDDO C 18 Inter-loop data dependence analysis in TLG • Define exit-RB in TLG as Standard-Loop C RB1(Doall) DO I=1,101 K-1 K K+1 • Find iterations on which A(I)=2*I I(RB1) ENDDO a iteration of Standard-Loop is data dependent C RB2(Doseq) – e.g.