Hironori Kasahara Professor, Dept

Software and Hardware for High Performance and Low Power Homogeneous and Heterogeneous Multicore Systems Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan IEEE Computer Society President Elect 2017, President 2018 1980 BS, 82 MS, 85 Ph.D. , Dept. EE, Waseda Univ. Reviewed Papers: 214, Invited Talks: 145, Published 1985 Visiting Scholar: U. of California, Berkeley Unexamined Patent Application:59 (Japan, US, GB, 1986 Assistant Prof., 1988 Associate Prof., 1997 China Granted Patents: 30), Articles in News Papers, Prof. Dept. of EECE, Waseda Univ. Now Dept. of Web News, Medias incl. TV etc.: 572 Computer Sci. & Eng. Committees in Societies and Government 245 1989‐90 Research Scholar: U. of Illinois, Urbana‐ IEEE Computer Society President 2018, BoG(2009‐ Champaign, Center for Supercomputing R&D 14), Multicore STC Chair (2012‐), Japan Chair (2005‐ 1987 IFAC World Congress Young Author Prize 07), IPSJ Chair: HG for Mag. & J. Edit, Sig. on ARC. 1997 IPSJ Sakai Special Research Award 【METI/NEDO】 Project Leaders: Multicore for 2005 STARC Academia‐Industry Research Award Consumer Electronics, Advanced Parallelizing 2008 LSI of the Year Second Prize Compiler, Chair: Computer Strategy Committee 2008 Intel AsiaAcademic Forum Best Research Award 【Cabinet Office】 CSTP Supercomputer Strategic 2010 IEEE CS Golden Core Member Award ICT PT, Japan Prize Selection Committees, etc. 2014 Minister of Edu., Sci. & Tech. Research Prize 【MEXT】 Info. Sci. & Tech. Committee, 2015 IPSJ Fellow Supercomputers (Earth Simulator, HPCI Promo., 2017 IEEE Fellow Next Gen. Supercomputer K) Committees, etc. IEEE Computer Society BoG (Board of Governors) Feb.1, 2017 Multicores for Performance and Low Power Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers (“K” more than 10MW) . Power ∝ Frequency * Voltage2 ILRAM I$ (Voltage ∝ Frequency) LBSC Core#0 Core#1 3 URAM ∝ DLRAM D$ Power Frequency SNC0 Core#2 Core#3 If Frequency is reduced to 1/4 SHWY VSWC (Ex. 4GHz1GHz), Core#6 Core#7 Power is reduced to 1/64 and SNC1 Performance falls down to 1/4 . Core#4 Core#5 <Multicores> DBSC CSM GCPG If 8cores are integrated on a chip, DDRPAD Power is still 1/8 and IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, Performance becomes 2 times. “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler” 6 Parallel Soft is important for scalable performance of multicore Just more cores donʼt give us speedup Development cost and period of parallel software are getting a bottleneck of development of embedded systems, eg. IoT, Automobile Earthquake wave propagation simulation GMS developed by National Research Institute for Earth Science and Disaster Resilience (NIED) Fjitsu M9000 SPARC Multicore Server OSCAR Compiler gives us 211 times speedup with 128 cores Commercial compiler gives us 0.9 times speedup with 128 cores (slow- downed against 1 core) Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores Execution time with 128 cores was slower than 1 core (0.9 times speedup) Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization 7 Trend of Peak Performances of Supercomputers Aurora,2018,180PFLOPS,13MW,Argonne National Lab., Intel & Cray Sunway TaihuLight, 2016.06, 93PFLOPS, 15.4MW １Z 米中欧日 Tianhe-2, 2013.06, 55PFLOPS, 17.8MW 2020-22 ExaFLOPS計画 Titan,2012.11,27PFLOPS,8.2MW １E Sequoia,2012.06, 20PFLOPS, 7.9MW 京,2011.6&11, 11PFLOPS, 11.3MW １P １T １G Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2 by OSCAR Parallelizing Compiler MPEG2 Decoding with 8 CPU cores Without Power With Power Control Control （Frequency, 7 7 （Voltage：1.4V) Resume Standby: Power shutdown & 6 6 Voltage lowering 1.4V-1.0V) 5 5 4 4 3 3 2 2 1 1 0 0 Avg. Power 73.5% Power Reduction Avg. Power 5.73 [W] 1.52 [W] 9 4 core multicore RP1 (2007) , 8 core multicore RP2 (2008) and 15 core Heterogeneous multicore RPX (2010) developed in NEDO Projects with Hitachi and Renesas Compiler Co-designed Multicore RP2 Cluster #1 Cluster #0 Barrier Core #3 Sync. Lines Core #7 CoreCPU #2 FPU FPU CPUCore #6 Core #1 Core #5 CPUI$ FPUD$ D$FPU CPUI$ CoreCPU #0 CPUCore #4 I$16K FPUD$16K 16KD$FPU 16KI$ LCPG0 LCPG1 CPUI$16KLocalFPU memoryD$16K 16KD$FPU 16KCPUI$ PCR3 16KLocalI:8K, memory D:32K16K I:8K,16K D:32K16K PCR7 I$LocalI:8K,D$ memory D:32KCCN I:8K,CCN D:32KD$ I$ PCR2 16KUser16K RAMBAR 64K User BARRAM 16K64K 16K PCR6 I:8K, D:32K I:8K, D:32K LocalUser memory RAM 64K UserLocal RAM memory 64K PCR1 I:8K,User RAMD:32K 64K UserI:8K, RAM D:32K 64K PCR5 PCR0 URAM 64K URAM 64K PCR4 Snoop controller 1 Snoop controller Snoop controller 0 Snoop controller On-chip system bus (SuperHyway) LCPG: Local clock pulse generator DDR2 SRAM DMA PCR: Power Control Register control control control CCN/BAR:Cache controller/Barrier Register URAM: User RAM (Distributed Shared Memory) Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project ILRAM I$ Process 90nm, 8-layer, triple- LBSC Core#0 Core#1 Technology Vth, CMOS URAM DLRAM D$ Chip Size 104.8mm2 Core#2 SNC0 Core#3 (10.61mm x 9.88mm) 2 SHWY VSWC CPU Core 6.6mm Core#6 Core#7 Size (3.36mm x 1.96mm) SNC1 Supply 1.0V–1.4V (internal), Core#4 Core#5 Voltage 1.8/3.3V (I/O) CSM Power 17 (8 CPUs, DBSC GCPG DDRPAD Domains 8 URAMs, common) IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler” 12 Industry-government-academia collaboration and Protect lives target applications in Green Computing R&D Center For smart life Protect environment OSCAR Robots Waseda University :R&D Green Many-core system technologies with supercomputers On-board vehicle technology ultra-low power consumption (navigation systems, integrated controllers, infrastructure OSCAR many-core chip Super real-time disaster coordination) simulation (tectonic shifts, tsunami), tornado, Consumer electronic OSCAR flood, fire spreading) Internet TV/DVD Many-core Chip Green cloud servers Camcorders Compiler, API Stock trading Capsule inner Cool desktop servers cameras Cameras Solar Powered Medical servers Smart phones OSCAR National Institute of Radiological Sciences Heavy particle radiation planning, cerebral infarction) Operation/recharging Non-fan, cool, quiet servers by solar cells designed for server Intelligent home Industry Supercomputers appliances and servers 13 Cancer Treatment Carbon Ion Radiotherapy (Previous best was 2.5 times speedup on 16 processors with hand optimization) National Institute of Radiological Sciences (NIRS) 8.9times speedup by 12 processors 55 times speedup by 64 processors Intel Xeon X5670 2.93GHz 12 IBM Power 7 64 core SMP core SMP (Hitachi HA8000) (Hitachi SR16000) OSCAR Parallelizing Compiler To improve effective performance, cost-performance and software productivity and reduce power Multigrain Parallelization coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism 1 Data Localization 3 2 5 4 6 12 7 10 11 9 8 Automatic data management for distributed shared memory, cache 14 13 15 16 and local memory 18 17 19 21 22 20 24 25 23 26 dlg1 dlg2 dlg3 Data Transfer Overlapping dlg0 28 29 27 31 32 30 Data transfer overlapping using Data Data Localization Group 33 Transfer Controllers (DMAs) Power Reduction Reduction of consumed power by compiler control DVFS and Power gating with hardware supports. Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32‐core SMP Server AIX Ver.12.1 Compile Option: (*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto (*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto (Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto 16 Generation of Coarse Grain Tasks Macro-tasks (MTs) Block of Pseudo Assignments (BPA): Basic Block (BB) Repetition Block (RB) : natural loop Subroutine Block (SB): subroutine BPA Near fine grain parallelization BPA RB SB Loop level parallelization BPA BPA RB Program RB Near fine grain of loop body RB SB Coarse grain SB BPA parallelization BPA RB SB SB Coarse grain RB BPA parallelization SB RB SB Total 1st. Layer 2nd. Layer 3rd. Layer System 17 Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro‐tasks) Data Dependency 1 Control flow Conditional branch 1 BPA BPA Block of Psuedo Assignment Statements 23 2 BPA 3 BPA RB Repetition Block 4 BPA 4 8 7 RB 5 BPA 6 BPA 6RB 910 BPA RB 5 6 15 BPA 7RB RB 11 8 BPA BPA 15 7 9 BPA 10 RB 12 Data dependency 11 BPA Extended control dependency Conditional branch 13 12 BPA OR 13 RB AND 14 Original control flow 14 RB A Macro Flow Graph A Macro Task END Graph 18 PRIORITY DETERMINATION IN DYNAMIC CP METHOD 60*0.80+100*0.20=68 V1.0 © 2014, IEEE All rights reserved 19 Earliest Executable Conditions MT2 may start execution after EEC: Control dependence + Data Dependence MT1 branches to MT2 and MT1 Control dependences show executions of MTs are decided finish execution. Data dependences show data accessed by MTs are ready MT3 may start execution after MT1 branches to MT3.

Load more