George A. Matheou Phd.Pdf

DEPARTMENT OF COMPUTER SCIENCE ARCHITECTURAL AND SOFTWARE SUPPORT FOR DATA-DRIVEN EXECUTION ON MULTI-CORE PROCESSORS MATHEOU DOCTOR OF PHILOSOPHY DISSERTATION GEORGE MATHEOU GEORGE 2017 DEPARTMENT OF COMPUTER SCIENCE ARCHITECTURAL AND SOFTWARE SUPPORT FOR DATA-DRIVEN EXECUTION ON MULTI-CORE PROCESSORS MATHEOU GEORGE MATHEOU A Dissertation Submitted to the University of Cyprus in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy GEORGE November, 2017 MATHEOU GEORGE @George Matheou, 2017 VALIDATION PAGE Doctoral Candidate: George Matheou Doctoral Dissertation Title: Architectural and software support for data-driven execution on multi-core processors The present Doctoral Dissertation was submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy at the Department of Computer Science and was approved on November 27, 2017 by the members of the Examination Committee. Examination Committee: Research Supervisor: MATHEOU Professor Paraskevas Evripidou Committee Member: Professor Constantinos S. Pattichis Committee Member: Assistant Professor Theocharis Theocharides Committee Member: Professor Ian Watson GEORGECommittee Member: Dr. Albert Cohen i DECLARATION OF DOCTORAL CANDIDATE The present doctoral dissertation was submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the University of Cyprus. It is a product of original work of my own, unless otherwise mentioned through references, notes, or any other statements. George Matheou MATHEOU GEORGE ii ABSTRACT To tèloc thc εκθετικής ανάπτυξης twn seiriak¸n epexergast¸n èqei dieukoλύνει thn ανάπ- tuxh twn polupύρηνων susτημάτwn. 'Etsi, οποιαδήποte αύξηση thc απόδοσης prèpei na proèrqetai από ton parallhliσμό. Gia na epiteuqjeÐ autό, prèpei na αναπτυχθούν apoteles- ματικά montèla παράλληλου programmatiσμού/εκτέλεσης. ProteÐnoume thn ανάπτυξη tètoiwn susτημάτwn qrhsimopoi¸ntac to montèlo ektèleshc Data-Driven Multithreading (DDM). To DDM eÐnai èna poλυνηματικό montèlo pou συνδυάζει tautoqroniσμό, basismèno sto δυναμικό montèlo ροής dedomènwn, kai apodoτική diadoχική ektèlesh se συμβατικούς epexergastèc. To DDM qrhsimopoieÐ to Thread Scheduling Unit (TSU) gia th qronodromoλόghsh twn νημάτwn katά th διάρκεια ektèleshc, βάσει thc διαθεσιμόthtac twn dedomènwn. Se autό to èrgo, parèqoume arqitektoνική kai logiσμική uposτήριξη gia thn apoteλεσματική ektèlesh se poλυπύρηνες arqitektonikèc, mèsw δύο diaforetik¸n υλοποιήσεων pou basÐzontai sto montèlo DDM. H pr¸th ulopoÐhsh pragmatopoieÐ to montèlo DDM sto υλικό, qrhsimopoi¸ntac Field Programmable Gate Arrays (FPGAs). H ulopoÐhsh αυτή stoχεύει na βοηθήσει sthn ανάπ- tuxh mellontik¸n poλυπύρηνων susτημάτwnMATHEOU υψηλής απόδοshc kai χαμηλής iσχύος. To TSU υλοποιήθηκε se υλικό qrhsimopoi¸ntac th gl¸ssa programmatiσμού Verilog kai enswmat¸jhke se èna poλυπύrhno epexergasτή me μη-συνεκτικούς kai χαμηλής poλυπλοκόthtac πυρήνες. O epexergastήc autός oνομάζετai MiDAS (Multi-core with Data-Driven Architectural Support) kai èqei paraqjeÐ se prwtόtupo qrhsimopoi¸ntac èna Xilinx Virtex-6 FPGA.O MiDAS epexer- gasτής èqei axiologhjeÐ qrhsimopoi¸ntac efarmogèc me διαφορετικά qarakthrisτικά, oi opoÐec αναπτύχθηκαν se C/C++ qrhsimopoi¸ntac mia διεπαφή programmatiσμού efarmog¸n (API). H axioλόghsh thc απόδοσης tou MiDAS èdeixe όti h arqitektoνική uposτήριξη gia thn ektèlesh ροής dedomènwn mporeÐ na epiτύχει poλύ καλά apotelèsmata, ακόμη kai se efarmogèc me poλύ μικρά megèjh προβλημάτwn. Sto plaÐsio αυτής thc ergasÐac, parèqoume arketά apotelèsmata gia ton MiDAS epexer- gasτή, όπως gia παράδειgma th χρήση twn πόρων tou FPGA, εκτιμήσειc gia thn katανάλωση GEORGEenèrgeiac kai kajusτερήσειc (se κύκλους) twn διαφόρων leitourgi¸n tou TSU. Ta apotelès- mata deÐqnoun όti to TSU mporeÐ na ulopoihjeÐ me μικρό προϋποlogiσμό υλικού. To TSU sugkrÐnetai me to Task Superscalar, mia arqitektoνική pou ulopoieÐ to montèlo StarSs se υλικό, qrhsimopoi¸ntac apaiτήσειc se πόρους kai makro-statistikèc. Ta apotelèsmata deÐqnoun όti h ulopoÐhsh ενός montèlou ροής dedomènwn sto υλικό, pou aniχνεύει δυναμικά εξαρτήσειc iii metαξύ ergasi¸n kai katασκευάζει to γράφημα εξαρτήσεων katά th διάρκεια ektèleshc, όπως to Task Superscalar, αυξάνει σημαντικά th χρήση twn πόρων (kai katά sunèpeia thn katανάλωση enèrgeiac). H δεύτερη ulopoÐhsh, oνομαζόμενη FREDDO (efficient Framework for Runtime Execution of Data-Driven Objects), eÐnai mia apodoτική kai φορητή αντικειμενοστρεφής ulopoÐhsh tou montèlou DDM pou epitrèpei qronodromoλόghsh basismènh sto montèlo ροής dedomènwn se katanemhmèna συστήματa me συμβατικούς poluπύρηνους epexergastèc. To FREDDO stoχεύει sthn apodoτική DDM ektèlesh se katanemhmèna συστήματa upologiσμού uyhl¸n επιδόσεων. Parèqei epÐshc nèec dunatόthtec sto montèlo DDM, όπως υποστήριξη αναδρομής kai epekteÐnei th diepαφή programmatiσμού tou DDM me antikeimenoστρεφή programmatiσμό. To FREDDO axioloγήθηκε se δύο διαφορετικά συστήματa: èna σύστημα 4-κόμβων AMD me sunoλικά 128 πυρήνες kai èna σύστημα 64-κόμβων Intel me sunoλικά 768 πυρήνες. H axioλόghsh thc απόδοσης deÐqnei όti to proteiνόμενο σύστημα klimak¸netai καλά kai anèqetai apoτελεσματικά to κόστoc qronodromoλόghshc kai tic καθυστερήσειc μνήμης. EpÐshc, sugkrÐnoume to FREDDO me ta συστήματa OpenMP, MPI, DDM-VM kai OmpSs. Ta apotelèsmata σύγκριshc deÐqnoun όti to proteiνόμενο σύστημα epitυγχάνεi sugkrÐsimec ή καλύτερες επιδόσειc. MATHEOU GEORGE iv ABSTRACT The end of the exponential growth of the sequential processors has facilitated the development of multi-core systems. Thus, any growth in performance must come from parallelism. To achieve that, efficient parallel programming/execution models must be developed. We propose to develop such systems using the Data-Driven Multithreading (DDM) model of execution. DDM is a non-blocking multithreading model that combines dynamic data-flow concurrency with efficient sequential execution on conventional processors. DDM utilizes the Thread Scheduling Unit (TSU) for scheduling threads at runtime, based on data availability. In this work, we provide architectural and software support for efficient data-driven execution on multi-core architectures, through two different DDM-based implementations. The first implementation realizes the DDM model in hardware, using Field Programmable Gate Arrays (FPGAs). The hardware DDM implementation aims to help in the development of future high- performance and low-power multi-core systems. DDM’s TSU was implemented in hardware using Verilog. The hardware TSU implementation was integrated into a shared-memory multi-core processor with non-coherent in-order cores, called MiDAS (Multi-core with Data-Driven Architectural Support). MiDAS was prototyped and evaluatedMATHEOU on a Xilinx Virtex-6 FPGA using benchmarks with different characteristics. The benchmarks were developed in C/C++ using a software API. The performance evaluation of MiDAS has shown that the architectural support for data-driven execution can achieve very good results, even on benchmarks with very small problem sizes. We provide several results for the hardware TSU and MiDAS, including FPGA resource requirements, power consumption estimations and latencies (in cycles) of various TSU operations. The results show that TSU can be implemented in hardware with a small hardware budget. The proposed TSU is compared with Task Superscalar, an architecture that implements the StarSs programming framework in hardware. The results show that implementing a data-driven model in hardware, that dynamically detects inter-task dependencies and constructs the dependency graph at runtime, like Task Superscalar, significantly increases the resource requirements and power consumption. GEORGE v The second implementation, called FREDDO (efficient Framework for Runtime Execution of Data-Driven Objects), is an efficient and portable object-oriented implementation of DDM that en- ables data-driven scheduling on conventional single-node and distributed multi-core systems. The FREDDO implementation aims to allow efficient DDM execution on distributed High Performance Computing (HPC) systems. It also provides new features to the DDM model like recursion support and it extends DDM’s programming interface with the object-oriented programming paradigm. FREDDO was evaluated on two different systems: a 4-node AMD system with a total of 128 cores and a 64-node Intel HPC system with a total of 768 cores. The performance evaluation shows that the proposed framework scales well and tolerates scheduling overheads and memory latencies effectively. We also compare our framework to OpenMP, MPI, DDM-VM and OmpSs. The comparison results show that the proposed framework obtains comparable or better performance. MATHEOU GEORGE vi ACKNOWLEDGEMENTS I would like to thank my advisor, Professor Paraskevas Evripidou, for his guidance and support during the completion of this thesis. I am forever indebted to him for accepting me as his student and for his unconditional support, especially during difficult times. For parts of this work I need to acknowledge and thank other researchers. I would like to thank, Dr. Pedro Trancoso, Dr. Costas Kyriacou, Dr. Samer Arandi, George Michael and Andreas Diavastos for their invaluable help and support. Thanks for the inspiring opinions and insightful discussions, and for helping me to understand the different implementations of the Data-Driven Multithreading (DDM) model. In addition, I would like to thank my committee members, Professor Constantinos S. Pattichis, Assistant Professor Theocharis Theocharides,

George A. Matheou Phd.Pdf

CS 110 Computer Architecture Lecture 5: Intro to Assembly Language, MIPS Intro

Power and Energy Characterization of an Open Source 25-Core Manycore Processor

Eithne: a Framework for Benchmarking Micro-Core Accelerators

Hironori Kasahara Professor, Dept

Update on International HPC Activities (Mostly Asia)

論文 / 著書情報 Article / Book Information

Future High Performance Computing Capabilities Summary Report of The

An Optimization Framework for Codes Classification and Performance Evaluation of RISC Microprocessors

Pdf Download

ISC 2017 Conference Guide

Introduction to Parallel Processing

OSCAR SCM Architecture for Multigrain Parallel Processing