OSCAR SCM Architecture for Multigrain Parallel Processing

Total Page:16

File Type:pdf, Size:1020Kb

OSCAR SCM Architecture for Multigrain Parallel Processing Software and Hardware for High Performance and Low Power Homogeneous and Heterogeneous Multicore Systems Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan IEEE Computer Society President Elect 2017 URL: http://www.kasahara.cs.waseda.ac.jp/ Waseda Univ. GCSC Multicores for Performance and Low Power Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers (“K” more than 10MW) . Power Frequency * Voltage2 ILRAM I$ (Voltage Frequency) LBSC Core#0 Core#1 3 URAM DLRAM Power Frequency D$ ∝ ∝ SNC0 Core#2 Core#3 If Frequency∝is reduced to 1/4 SHWY VSWC (Ex. 4GHz1GHz), Core#6 Core#7 Power is reduced to 1/64 and SNC1 Performance falls down to 1/4 . Core#4 Core#5 <Multicores> DBSC CSM GCPG If 8cores are integrated on a chip, DDRPAD IEEE ISSCC08: Paper No. 4.5, Power is still 1/8 and M.ITO, … and H. Kasahara, Performance becomes 2 times. “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler” 2 Parallel Soft is important for scalable performance of multicore Just more cores don’t give us speedup Development cost and period of parallel software are getting a bottleneck of development of embedded systems, eg. IoT, Automobile Earthquake wave propagation simulation GMS developed by National Research Institute for Earth Science and Disaster Resilience (NIED) Fjitsu M9000 SPARC Multicore Server OSCAR Compiler gives us 211 times speedup with 128 cores Commercial compiler gives us 0.9 times speedup with 128 cores (slow- downed against 1 core) Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores Execution time with 128 cores was slower than 1 core (0.9 times speedup) Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization 3 Trend of Peak Performances of Supercomputers Aurora,2018,180PFLOPS,13MW,Argonne National Lab., Intel & Cray 1 Sunway TaihuLight, 2016.06, 93PFLOPS, 15.4MW Z 米中欧日 Tianhe-2, 2013.06, 55PFLOPS, 17.8MW 2020-22 ExaFLOPS計画 Titan,2012.11,27PFLOPS,8.2MW 1E Sequoia,2012.06, 20PFLOPS, 7.9MW 京,2011.6&11, 11PFLOPS, 11.3MW 1P 1T 1G Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2 by OSCAR Parallelizing Compiler MPEG2 Decoding with 8 CPU cores Without Power With Power Control Control (Frequency, 7 7 (Voltage:1.4V) Resume Standby: Power shutdown & 6 6 Voltage lowering 1.4V-1.0V) 5 5 4 4 3 3 2 2 1 1 0 0 Avg. Power 73.5% Power Reduction Avg. Power 5.73 [W] 1.52 [W] 5 Compiler Co-designed Multicore RP2 Cluster #1 Cluster #0 Barrier Core #3 Sync. Lines Core #7 CoreCPU #2 FPU FPU CPUCore #6 Core #1 Core #5 CPUI$ FPUD$ D$FPU CPUI$ CoreCPU #0 CPUCore #4 I$16K FPUD$16K 16KD$FPU 16KI$ LCPG0 LCPG1 CPUI$16KLocalFPU memoryD$16K 16KD$FPU 16KCPUI$ PCR3 16KLocalI:8K, memory D:32K16K I:8K,16K D:32K16K PCR7 I$LocalI:8K,D$ memory D:32KCCN I:8K,CCN D:32KD$ I$ PCR2 16KUser16K RAMBAR 64K User BARRAM 16K64K 16K PCR6 I:8K, D:32K I:8K, D:32K LocalUser memory RAM 64K UserLocal RAM memory 64K PCR1 I:8K,User RAMD:32K 64K UserI:8K, RAM D:32K 64K PCR5 PCR0 URAM 64K URAM 64K PCR4 Snoop controller 1 controller Snoop Snoop controller 0 controller Snoop On-chip system bus (SuperHyway) LCPG: Local clock pulse generator DDR2 SRAM DMA PCR: Power Control Register control control control CCN/BAR:Cache controller/Barrier Register URAM: User RAM (Distributed Shared Memory) Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project ILRAM I$ Process 90nm, 8-layer, triple- LBSC Core#0 Core#1 Technology Vth, CMOS URAM DLRAM D$ Chip Size 104.8mm2 SNC0 Core#2 Core#3 (10.61mm x 9.88mm) 2 SHWY VSWC CPU Core 6.6mm Core#6 Core#7 Size (3.36mm x 1.96mm) SNC1 Supply 1.0V–1.4V (internal), Core#4 Core#5 Voltage 1.8/3.3V (I/O) CSM Power 17 (8 CPUs, DBSC GCPG DDRPAD Domains 8 URAMs, common) IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler” 7 Industry-government-academia collaboration in R&D Protect lives and target practical applications For smart life Protect environment OSCAR Robots Waseda University :R&D Green Many-core system technologies with supercomputers On-board vehicle technology ultra-low power consumption (navigation systems, integrated controllers, infrastructure OSCAR many-core chip Super real-time disaster coordination) simulation (tectonic shifts, tsunami), tornado, flood, Consumer electronic OSCAR fire spreading) Many-core Internet TV/DVD Green cloud servers Chip Camcorders Compiler, API Stock trading Capsule inner Cool desktop servers cameras Cameras Solar Powered Medical servers Smart phones OSCAR National Institute of Radiological Sciences Heavy particle radiation planning, cerebral infarction) Operation/recharging Non-fan, cool, quiet servers by solar cells designed for server Intelligent home Industry Supercomputers appliances and servers 8 Cancer Treatment Carbon Ion Radiotherapy (Previous best was 2.5 times speedup on 16 processors with hand optimization) National Institute of Radiological Sciences (NIRS) 8.9times speedup by 12 processors 55 times speedup by 64 processors Intel Xeon X5670 2.93GHz 12 IBM Power 7 64 core SMP core SMP (Hitachi HA8000) (Hitachi SR16000) OSCAR Parallelizing Compiler To improve effective performance, cost-performance and software productivity and reduce power Multigrain Parallelization coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism CPU0 CPU1 CPU2 CPU3 DRP0 CORE DTU CORE DTU CORE DTU CORE DTU CORE DTU LOAD LOAD MTG1 LOAD LOAD Data Localization MT1-1 MT1-2 SEND SEND LOAD OFF MT1-3 MT1-4 OFF SEND OFF Automatic data management for LOAD SEND LOAD MTG2 MT2-1 LOAD MTG3 LOAD SEND MT3-1 LOAD MT2-2 LOAD OFF MT2-3 distributed shared memory, cache LOAD SEND LOAD SEND TIME SEND MT3-2 LOAD LOAD MT2-4 LOAD MT3-3 SEND and local memory MT2-5 LOAD SEND LOAD SEND MT2-6 LOAD MT2-7 MT3-4 MT3-6 SEND SEND MT3-5 LOAD Data Transfer Overlapping SEND OFF MT2-8 MT3-7 STORE STORE MT3-8 OFF STORE Data transfer overlapping using Data STORE Transfer Controllers (DMAs) Power Reduction Reduction of consumed power by compiler control DVFS and Power gating with hardware supports. Performance of OSCAR Compiler on IBM p6 595 Power6 (4.2GHz) based 32-core SMP Server AIX Ver.12.1 Compile Option: (*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto (*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto (Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto 11 Generation of Coarse Grain Tasks Macro-tasks (MTs) Block of Pseudo Assignments (BPA): Basic Block (BB) Repetition Block (RB) : natural loop Subroutine Block (SB): subroutine Near fine grain parallelization BPA BPA RB SB Loop level parallelization BPA BPA RB Program RB Near fine grain of loop body RB SB Coarse grain SB BPA parallelization BPA RB SB SB Coarse grain RB BPA parallelization SB RB SB Total 1st. Layer 2nd. Layer 3rd. Layer System 12 Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks) A Macro Flow Graph A Macro Task Graph 13 PRIORITY DETERMINATION IN DYNAMIC CP METHOD 60*0.80+100*0.20=68 V1.0 © 2014, IEEE All rights reserved 14 MT2 may start Earliest Executable Conditions execution after EEC: Control dependence + Data Dependence MT1 branches to MT2 and MT1 Control dependences show executions of MTs are decided finish execution. Data dependences show data accessed by MTs are ready MT3 may start execution after MT1 branches to MT3. MT6 may start execution after MT3 finish execution or MT2 branches to MT4. V1.0 © 2014, IEEE All rights reserved 15 Automatic processor assignment in 103.su2cor • Using 14 processors Coarse grain parallelization within DO400 16 MTG of Su2cor-LOOPS-DO400 Coarse grain parallelism PARA_ALD = 4.3 DOALL Sequential LOOP SB BB 17 Data-Localization: Loop Aligned Decomposition • Decompose multiple loop (Doall and Seq) into CARs and LRs considering inter-loop data dependence. – Most data in LR can be passed through LM. – LR: Localizable Region, CAR: Commonly Accessed Region C RB1(Doall) DO I=1,101 LR CAR LR CAR LR A(I)=2*I DO I=1,33 DO I=34,35 DO I=36,66 DO I=67,68 DO I=69,101 ENDDO DO I=1,33 C RB2(Doseq) DO I=1,100 DO I=34,34 B(I)=B(I-1) +A(I)+A(I+1) DO I=35,66 ENDDO DO I=67,67 RB3(Doall) DO I=68,100 DO I=2,100 C(I)=B(I)+B(I-1) DO I=2,34 DO I=35,67 DO I=68,100 ENDDO C 18 Inter-loop data dependence analysis in TLG • Define exit-RB in TLG as Standard-Loop C RB1(Doall) DO I=1,101 K-1 K K+1 • Find iterations on which A(I)=2*I I(RB1) ENDDO a iteration of Standard-Loop is data dependent C RB2(Doseq) – e.g.
Recommended publications
  • CS 110 Computer Architecture Lecture 5: Intro to Assembly Language, MIPS Intro
    CS 110 Computer Architecture Lecture 5: Intro to Assembly Language, MIPS Intro Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C 1 Using Memory You Don’t Own • What’s wrong with this code? char *append(const char* s1, const char *s2) { const int MAXSIZE = 128; char result[128]; int i=0, j=0; for (j=0; i<MAXSIZE-1 && j<strlen(s1); i++,j++) { result[i] = s1[j]; } for (j=0; i<MAXSIZE-1 && j<strlen(s2); i++,j++) { result[i] = s2[j]; } result[++i] = '\0'; return result; } 2 Using Memory You Don’t Own • Beyond stack read/write char *append(const char* s1, const char *s2) { const int MAXSIZE = 128; char result[128]; result is a local array name – int i=0, j=0; stack memory allocated for (j=0; i<MAXSIZE-1 && j<strlen(s1); i++,j++) { result[i] = s1[j]; } for (j=0; i<MAXSIZE-1 && j<strlen(s2); i++,j++) { result[i] = s2[j]; } result[++i] = '\0'; return result; Function returns pointer to stack } memory – won’t be valid after function returns 3 Managing the Heap • realloc(p,size): – Resize a previously allocated block at p to a new size – If p is NULL, then realloc behaves like malloc – If size is 0, then realloc behaves like free, deallocating the block from the heap – Returns new address of the memory block; NOTE: it is likely to have moved! E.g.: allocate an array of 10 elements, expand to 20 elements later int *ip; ip = (int *) malloc(10*sizeof(int)); /* always check for ip == NULL */ … … … ip = (int *) realloc(ip,20*sizeof(int));
    [Show full text]
  • Power and Energy Characterization of an Open Source 25-Core Manycore Processor
    Power and Energy Characterization of an Open Source 25-core Manycore Processor Michael McKeown, Alexey Lavrov, Mohammad Shahrad, Paul J. Jackson, Yaosheng Fu∗, Jonathan Balkind, Tri M. Nguyen, Katie Lim, Yanqi Zhouy, David Wentzlaff Princeton University fmmckeown,alavrov,mshahrad,pjj,yfu,jbalkind,trin,kml4,yanqiz,[email protected] ∗ Now at NVIDIA y Now at Baidu Abstract—The end of Dennard’s scaling and the looming power wall have made power and energy primary design CB Chip Bridge (CB) PLL goals for modern processors. Further, new applications such Tile 0 Tile 1 Tile 2 Tile 3 Tile 4 as cloud computing and Internet of Things (IoT) continue to necessitate increased performance and energy efficiency. Manycore processors show potential in addressing some of Tile 5 Tile 6 Tile 7 Tile 8 Tile 9 these issues. However, there is little detailed power and energy data on manycore processors. In this work, we carefully study Tile 10 Tile 11 Tile 12 Tile 13 Tile 14 detailed power and energy characteristics of Piton, a 25-core modern open source academic processor, including voltage Tile 15 Tile 16 Tile 17 Tile 18 Tile 19 versus frequency scaling, energy per instruction (EPI), memory system energy, network-on-chip (NoC) energy, thermal charac- Tile 20 Tile 21 Tile 22 Tile 23 Tile 24 teristics, and application performance and power consumption. This is the first detailed power and energy characterization of (a) (b) an open source manycore design implemented in silicon. The Figure 1. Piton die, wirebonds, and package without epoxy encapsulation open source nature of the processor provides increased value, (a) and annotated CAD tool layout screenshot (b).
    [Show full text]
  • Eithne: a Framework for Benchmarking Micro-Core Accelerators
    Eithne: A framework for benchmarking micro-core accelerators Maurice Jamieson Nick Brown EPCC EPCC University of Edinburgh University of Edinburgh Edinburgh, UK Edinburgh, UK [email protected] [email protected] Soft-core MFLOPs/core 1 INTRODUCTION MicroBlaze (integer only) 0.120 The free lunch is over and the HPC community is acutely aware of MicroBlaze (floating point) 5.905 the challenges that the end of Moore’s Law and Dennard scaling Table 1: LINPACK performance of the Xilinx MicroBlaze on [4] impose on the implementation of exascale architectures due to the Zynq-7020 @ 100MHz the end of significant generational performance improvements of traditional processor designs, such as x86 [5]. Power consumption and energy efficiency is also a major concern when scaling thecore is the benefit of reduced chip resource usage when configuring count of traditional CPU designs. Therefore, other technologies without hardware floating point support, but there is a 50 times need to be investigated, with micro-cores and FPGAs, which are performance impact on LINPACK due to the software emulation somewhat related, being considered by the community. library required to perform floating point arithmetic. By under- Micro-core architectures look to address this issue by implement- standing the implications of different configuration decisions, the ing a large number of simple cores running in parallel on a single user can make the most appropriate choice, in this case trading off chip and have been used in successful HPC architectures, such how much floating point arithmetic is in their code vs the saving as the Sunway SW26010 of the Sunway TaihuLight (#3 June 2019 in chip resource.
    [Show full text]
  • Hironori Kasahara Professor, Dept
    Software and Hardware for High Performance and Low Power Homogeneous and Heterogeneous Multicore Systems Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan IEEE Computer Society President Elect 2017, President 2018 1980 BS, 82 MS, 85 Ph.D. , Dept. EE, Waseda Univ. Reviewed Papers: 214, Invited Talks: 145, Published 1985 Visiting Scholar: U. of California, Berkeley Unexamined Patent Application:59 (Japan, US, GB, 1986 Assistant Prof., 1988 Associate Prof., 1997 China Granted Patents: 30), Articles in News Papers, Prof. Dept. of EECE, Waseda Univ. Now Dept. of Web News, Medias incl. TV etc.: 572 Computer Sci. & Eng. Committees in Societies and Government 245 1989‐90 Research Scholar: U. of Illinois, Urbana‐ IEEE Computer Society President 2018, BoG(2009‐ Champaign, Center for Supercomputing R&D 14), Multicore STC Chair (2012‐), Japan Chair (2005‐ 1987 IFAC World Congress Young Author Prize 07), IPSJ Chair: HG for Mag. & J. Edit, Sig. on ARC. 1997 IPSJ Sakai Special Research Award 【METI/NEDO】 Project Leaders: Multicore for 2005 STARC Academia‐Industry Research Award Consumer Electronics, Advanced Parallelizing 2008 LSI of the Year Second Prize Compiler, Chair: Computer Strategy Committee 2008 Intel AsiaAcademic Forum Best Research Award 【Cabinet Office】 CSTP Supercomputer Strategic 2010 IEEE CS Golden Core Member Award ICT PT, Japan Prize Selection Committees, etc. 2014 Minister of Edu., Sci. & Tech. Research Prize 【MEXT】 Info. Sci. & Tech. Committee, 2015 IPSJ Fellow Supercomputers (Earth Simulator, HPCI Promo., 2017 IEEE Fellow Next Gen. Supercomputer K) Committees, etc. IEEE Computer Society BoG (Board of Governors) Feb.1, 2017 Multicores for Performance and Low Power Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers (“K” more than 10MW) .
    [Show full text]
  • Update on International HPC Activities (Mostly Asia)
    Update on International HPC Activities (mostly Asia) Input from: Erich Strohmaier, Patrick Naullieu (LBNL) Satoshi Matsuoka (TiTech) Haohuan Fu (Wuxi) And many conversations in Singapore John Shalf Lawrence Berkeley National Laboratory ASCAC, April 18, 2017 Performance of Countries 100,000 US /s] 10,000 EU Tflop Japan 1,000 China 100 10 1 Total P erformanc e [ 0 2000 2002 2004 2006 2008 2010 2012 2014 2016 Share of Top500 Entries Per Country Historical Share Current Share (averaged over liftetime of list) (November 2016 list) Korea, Poland Italy Others Italy Poland Italy China South United United 1% 1% 11% 1% 1% 1% Canada 1% others KingdomKingdom Others United States 1% 12% 3% France 3%France 11% ChinaChina Germany China 4% France 4% Japan 3434%% Japan 5% United 4% Japan United 6% States 6% France Kingdom Germany 6% 52% Germany United Kingdom 6% United Germany 6% United StatesStates Poland 8% Japan 34%34% Italy 10% Producers of HPC Equipment 500 India 450 400 Taiwan 350 Australia 300 250 Russia 200 China 150 100 Europe 50 0 Japan 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 USA Vendors / Performance Share 2007 Now HPE, 66, 10% HPE SGI, 40, 6% SGI Others, Lenovo 136, 20% Lenovo, 64, 10% Cray Inc. NUDT, 39, 6% Sugon FuJitsu, 38, 6% Cray Inc., Dell, 16, 2% 143, 21% IBM Inspur, Bull, Atos 9, 1% Huawei, 9, 1% Bull, Atos, 24, 4% Huawei IBM, 63, 9% Sugon, 25, 4% Sum of Pflop/s, % of whole list by vendor NSA-DOE Technical Meeting on High Performance Computing December 1, 2017 Top Level Conclusions 1.
    [Show full text]
  • 論文 / 著書情報 Article / Book Information
    論文 / 著書情報 Article / Book Information 題目(和文) Title(English) Hardware-Accelerated Modeling of Large-Scale Networks-on-Chip 著者(和文) Chu Van Thiem Author(English) Thiem Van Chu 出典(和文) 学位:博士(工学), 学位授与機関:東京工業大学, 報告番号:甲第10994号, 授与年月日:2018年9月20日, 学位の種別:課程博士, 審査員:吉瀬 謙二,横田 治夫,宮﨑 純,渡部 卓雄,金子 晴彦 Citation(English) Degree:Doctor (Engineering), Conferring organization: Tokyo Institute of Technology, Report number:甲第10994号, Conferred date:2018/9/20, Degree Type:Course doctor, Examiner:,,,, 学位種別(和文) 博士論文 Type(English) Doctoral Thesis Powered by T2R2 (Tokyo Institute Research Repository) Hardware-Accelerated Modeling of Large-Scale Networks-on-Chip by Thiem Van Chu A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Engineering Department of Computer Science Graduate School of Information Science and Engineering Tokyo Institute of Technology © Thiem Van Chu 2018. All rights reserved. Abstract Networks-on-Chip (NoCs) are becoming increasingly important elements in different types of computing hardware platforms, from general-purpose many-core processors for supercomputers and datacenters to application-specific MultiProcessor Systems-on-Chip (MPSoCs) for embed- ded applications. They are also integral parts of many emerging accelerators for critically essen- tial applications such as deep neural networks, databases, and graph processing. In such a hard- ware platform, the NoC is responsible for connecting the other components together and thus has a significant impact on the overall performance. To achieve higher performance and better power efficiency, many-core processors with more and more cores have been developed. For the similar reason and to meet the increasingly stringent requirements of target applications, the number of processing elements, memory and input/output modules integrated on an MPSoC/accelerator is increasing.
    [Show full text]
  • Future High Performance Computing Capabilities Summary Report of The
    Future High Performance Computing Capabilities Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee March 20, 2019 Contents 1 Executive Summary 1 2 Background 4 2.1 Moore's Law and Current Technology Roadmaps . .4 2.2 Levels of Disruption in Post-Moore era . .6 2.3 National Landscape for Post-Moore Computing . .7 2.4 International Landscape for Post-Moore Computing . .8 2.5 Interpretation of Charge . .8 3 Application lessons learned from past HPC Technology Transitions 9 3.1 Background . .9 3.2 Vector-MPP Transition . .9 3.3 Terascale-Petascale Transition . 10 3.4 Petascale-Exascale Transition . 11 3.5 Lessons Learned . 11 3.6 Assessing Application Readiness . 12 3.7 Next Steps . 12 4 Future HPC Technologies: Opportunities and Challenges 15 4.1 Reconfigurable Logic . 15 4.2 Memory-Centric Processing . 17 4.3 Silicon Photonics . 21 4.4 Neuromorphic Computing . 24 4.5 Quantum Computing . 26 4.6 Analog Computing . 28 4.7 Application Challenges . 30 4.8 Open Platforms . 31 5 Findings 32 5.1 Need for clarity in future HPC roadmap for science applications . 32 5.2 Extreme heterogeneity with new computing paradigms will be a common theme in future HPC technologies . 32 5.3 Need to prepare applications and system software for extreme heterogeneity . 33 5.4 Need for early testbeds for future HPC technologies . 33 5.5 Open hardware is a growing trend in future platforms . 33 5.6 Synergies between HPC and mainstream computing . 34 i Future High Performance Computing Capabilities 6 Recommendations 35 6.1 Office of Science's Role in Future HPC Technologies .
    [Show full text]
  • George A. Matheou Phd.Pdf
    DEPARTMENT OF COMPUTER SCIENCE ARCHITECTURAL AND SOFTWARE SUPPORT FOR DATA-DRIVEN EXECUTION ON MULTI-CORE PROCESSORS MATHEOU DOCTOR OF PHILOSOPHY DISSERTATION GEORGE MATHEOU GEORGE 2017 DEPARTMENT OF COMPUTER SCIENCE ARCHITECTURAL AND SOFTWARE SUPPORT FOR DATA-DRIVEN EXECUTION ON MULTI-CORE PROCESSORS MATHEOU GEORGE MATHEOU A Dissertation Submitted to the University of Cyprus in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy GEORGE November, 2017 MATHEOU GEORGE @George Matheou, 2017 VALIDATION PAGE Doctoral Candidate: George Matheou Doctoral Dissertation Title: Architectural and software support for data-driven execution on multi-core processors The present Doctoral Dissertation was submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy at the Department of Computer Science and was approved on November 27, 2017 by the members of the Examination Committee. Examination Committee: Research Supervisor: MATHEOU Professor Paraskevas Evripidou Committee Member: Professor Constantinos S. Pattichis Committee Member: Assistant Professor Theocharis Theocharides Committee Member: Professor Ian Watson GEORGECommittee Member: Dr. Albert Cohen i DECLARATION OF DOCTORAL CANDIDATE The present doctoral dissertation was submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the University of Cyprus. It is a product of original work of my own, unless otherwise mentioned through references, notes, or any other statements. George Matheou MATHEOU GEORGE ii ABSTRACT To tèloc thc εκθετικής ανάπτυξης twn seiriak¸n epexergast¸n èqei dieukoλύνει thn ανάπ- tuxh twn polupύρηνων susτημάτwn. 'Etsi, οποιαδήποte αύξηση thc απόδοσης prèpei na proèrqetai από ton parallhliσμό. Gia na epiteuqjeÐ autό, prèpei na αναπτυχθούν apoteles- ματικά montèla παράλληλου programmatiσμού/εκτέλεσης.
    [Show full text]
  • An Optimization Framework for Codes Classification and Performance Evaluation of RISC Microprocessors
    S S symmetry Article An Optimization Framework for Codes Classification and Performance Evaluation of RISC Microprocessors Syed Rameez Naqvi 1,*, Ali Roman 1, Tallha Akram 1 , Majed M. Alhaisoni 2, Muhammad Naeem 1, Sajjad Ali Haider 1, Omer Chughtai 1 and Muhammad Awais 1 1 Department of Electrical and Computer Engineering, COMSATS University Islamabad, Wah Cantonment 47040, Pakistan 2 College of Computer Science and Engineering, University of Ha0il, Ha0il 81451, Saudi Arabia * Correspondence: [email protected]; Tel.: +92-51-9314-382 (ext. 259) Received: 24 June 2019; Accepted: 06 July 2019; Published: 19 July 2019 Abstract: Pipelines, in Reduced Instruction Set Computer (RISC) microprocessors, are expected to provide increased throughputs in most cases. However, there are a few instructions, and therefore entire assembly language codes, that execute faster and hazard-free without pipelines. It is usual for the compilers to generate codes from high level description that are more suitable for the underlying hardware to maintain symmetry with respect to performance; this, however, is not always guaranteed. Therefore, instead of trying to optimize the description to suit the processor design, we try to determine the more suitable processor variant for the given code during compile time, and dynamically reconfigure the system accordingly. In doing so, however, we first need to classify each code according to its suitability to a different processor variant. The latter, in turn, gives us confidence in performance symmetry against various types of codes—this is the primary contribution of the proposed work. We first develop mathematical performance models of three conventional microprocessor designs, and propose a symmetry-improving nonlinear optimization method to achieve code-to-design mapping.
    [Show full text]
  • Pdf Download
    E-Infrastructures H2020-EINFRA-2014-2015 EINFRA-4-2014: Pan-European High Performance Computing Infrastructure and Services PRACE-4IP PRACE Fourth Implementation Phase Project Grant Agreement Number: EINFRA-653838 D5.2 Market and Technology Watch Report Year 2. Final summary of results gathered Final Version: 1.0 Author(s): Ioannis Liabotis, GRNET Date: 21.04.2017 D5.2 Market and Technology Watch Report Year 2. Final summary of results gathered Project and Deliverable Information Sheet PRACE Project Project Ref. №: EINFRA-653838 Project Title: PRACE Fourth Implementation Phase Project Project Web Site: http://www.prace-project.eu Deliverable ID: D5.2 Deliverable Nature: Report Dissemination Level: Contractual Date of Delivery: PU * 30 / April / 2017 Actual Date of Delivery: 30 / April / 2017 EC Project Officer: Leonardo Flores Añover * - The dissemination level are indicated as follows: PU – Public, CO – Confidential, only for members of the consortium (including the Commission Services) CL – Classified, as referred to in Commission Decision 2991/844/EC. Document Control Sheet Title: Market and Technology Watch Report Year 2. Final Document summary of results gathered ID: D5.2 Version: 1.0 Status: Final Available at: http://www.prace-project.eu Software Tool: Microsoft Word 2010 File(s): D5.2.docx Written by: Ioannis Liabotis, GRNET PRACE-4IP-EINFRA-653838 i 21.04.2017 D5.2 Market and Technology Watch Report Year 2. Final summary of results gathered Authorship Contributors: Felip Moll, BSC Oscar Yerpes, BSC Francois Robin, CEA Jean-Philippe
    [Show full text]
  • ISC 2017 Conference Guide
    SUNDAY, JUNE 18 – THURSDAY, JUNE 22, 2017 FRANKFURT, GERMANY CONFERENCE & EXHIBITION GUIDE isc-hpc.com Platinum Sponsors ISC 2017 | Table of Contents WELCOME TO ISC 2017 3 GENERAL INFORMATION 6 PROGRAM 16 Sunday, June 18 Overview ............................................................................................................................................ 18 Program – Tutorials .......................................................................................................................... 19 Coffee & Lunch Breaks ..................................................................................................................... 21 Monday, June 19 Overview ............................................................................................................................................ 24 Program – Conference & Exhibition ................................................................................................. 25 ISC Welcome Party .............................................................................................................................32 Coffee & Lunch Breaks ..................................................................................................................... 32 Tuesday, June 20 Overview ............................................................................................................................................ 36 Program – Conference & Exhibition ................................................................................................. 37
    [Show full text]
  • Introduction to Parallel Processing
    IntroductionIntroduction toto ParallelParallel ProcessingProcessing • Parallel Computer Architecture: Definition & Broad issues involved – A Generic Parallel ComputerComputer Architecture • The Need And Feasibility of Parallel Computing Why? – Scientific Supercomputing Trends – CPU Performance and Technology Trends, Parallelism in Microprocessor Generations – Computer System Peak FLOP Rating History/Near Future • The Goal of Parallel Processing • Elements of Parallel Computing • Factors Affecting Parallel System Performance • Parallel Architectures History – Parallel Programming Models – Flynn’s 1972 Classification of Computer Architecture • Current Trends In Parallel Architectures – Modern Parallel Architecture Layered Framework • Shared Address Space Parallel Architectures • Message-Passing Multicomputers: Message-Passing Programming Tools • Data Parallel Systems • Dataflow Architectures • Systolic Architectures: Matrix Multiplication Systolic Array Example PCA Chapter 1.1, 1.2 CMPE655 - Shaaban #1 lec # 1 Spring 2017 1-24-2017 ParallelParallel ComputerComputer ArchitectureArchitecture A parallel computer (or multiple processor system) is a collection of communicating processing elements (processors) that cooperate to solve large computational problems fast by dividing such problems into parallel tasks, exploiting Thread-Level Parallelism (TLP). i.e Parallel Processing • Broad issues involved: Task = Computation done on one processor – The concurrency (parallelism) and communication characteristics of parallel algorithms for a given
    [Show full text]