Sequential Codelet Model

SEQUENTIAL CODELET MODEL A SUPERCODELET PROGRAM EXECUTION MODEL AND ARCHITECTURE by Jose M Monsalve Diaz A dissertation submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering Winter 2021 © 2021 Jose M Monsalve Diaz All Rights Reserved SEQUENTIAL CODELET MODEL A SUPERCODELET PROGRAM EXECUTION MODEL AND ARCHITECTURE by Jose M Monsalve Diaz Approved: Jamie D. Phillips, Ph.D. Chair of the Department of Electrical and Computer Engineering Approved: Levi T. Thompson, Ph.D. Dean of the College of Engineering Approved: Louis F. Rossi, Ph.D. Vice Provost for Graduate and Professional Education and Dean of the Graduate College Approved: Guang R. Gao, Professor Emeritus Professor in charge of dissertation on behalf of the Advisory Committee I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: Rudolf Eigenmann, Ph.D. Professor in charge of dissertation I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: Xiaomin Li, Ph.D. Professor in charge of dissertation I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: Sunita Chandrasekaran, Ph.D. Member of dissertation committee I certify that I have read this dissertation and that in my opinion it meets the academic and professional standard required by the University as a dissertation for the degree of Doctor of Philosophy. Signed: Kalyan Kumaran, Ph.D. Member of dissertation committee ACKNOWLEDGEMENTS This work is the result of several years at the CAPSL research group, under the supervision of Professor Guang R. Gao. I joined CAPSL in 2013 as an intern, and I never imagined I would stay for this long. Yet, I enjoy every single moment of my Graduate School. Professor Gao has been an inspiration. His work and ambitions, as well as his many hours of insightful conversations were crucial for the creation of this thesis. I also would like to acknowledge the patience and help from other senior CAPSL students through my PhD. In particular, Elkin Garc´ıa,Aaron Landwehr, Kelly Livingston, Joshua Suetterlein, Jaime Arteaga, Pouya Fotouhi, and Sid Raskar. As well as former postdoc Stephane Zuckerman, who I admire deeply, and who has been essential to the CAPSL group even to this day. Additionally, all the new students that I have been collaborating in the past three years. Ryan Kabrick, who brought two amazing students with him, Dawson Fox and Matthew Matusek; and Diego Roa, my old friend who also helped me craft the Matrix Multiplication example presented in this work. I would also like to acknowledge Professor Sunita Chandrasekaran, one of the best professors I have met. The professionalism and dedication to her students are an inspiration. It is an honor having work with her for the past few years. Her personal and academic advise help me dealt with many difficult moments. The University of Delaware, and in particular, the Electrical and Computer Engineering Department has supported me in many different ways. While it is difficult to mention everyone by name, I feel obliged to recognize Professor Kenneth Barner, Cynthia McLaughlin, and Gwen Looby, who bore all my crazy administrative questions and ideas. Bryan Youse, and Andrew Roosen, two excellent bosses during my ECE staff days. Amber Spivey, and Wendy Scott, who were always nice, greet me with v a smile and help me out whenever I needed it. Additionally, I must acknowledge all my other professors in the department, specially Xiaoming Li, Chengmo Yang, Rudolf Eigenmann, and Andy Novicin. I have learned a great deal from each of you. Likewise, other offices in the University such as the Office of International Students and the Office of Graduate and Professional Education. In particular, Dr. Mary Martin who supported me and helped me graduate twice. The past couple of years at Argonne National Laboratory have also played an important role in this thesis. I would like to acknowledge Kevin Harm's active participation in this project. He listened to me, and always asked the right questions. Dr. Kalyan Kumaran who also provided valuable input and supervision. Finally, and most importantly, I must acknowledge my Family. My parents and my brother who have always given everything for my education and my future. They will always be an inspiration of hard work, dedication and love. Also, my wife, Luisa, who has brought me joy and support for many years, and specially in the past year when working on this thesis. Thanks for being there in every up and down. I gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory. Ad- ditionally, this research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02- 06CH11357. This work is also supported by the National Science Foundation, under award SHF-1763654, as well as other CAPSL grants. vi TABLE OF CONTENTS LIST OF TABLES :::::::::::::::::::::::::::::::: xii LIST OF FIGURES ::::::::::::::::::::::::::::::: xiii LIST OF LISTINGS :::::::::::::::::::::::::::::: xviii ABSTRACT ::::::::::::::::::::::::::::::::::: xx Chapter 1 INTRODUCTION :::::::::::::::::::::::::::::: 1 1.1 The problem with trending parallelism ::::::::::::::::: 4 1.1.1 The evolution ::::::::::::::::::::::::::: 5 1.2 The problem :::::::::::::::::::::::::::::::: 10 1.3 Synopsis :::::::::::::::::::::::::::::::::: 14 2 BACKGROUND ::::::::::::::::::::::::::::::: 16 2.1 Foundation :::::::::::::::::::::::::::::::: 16 2.1.1 The Universal Turing Machine :::::::::::::::::: 16 2.1.2 Von Neumann Architecture ::::::::::::::::::: 27 2.1.3 Multithreading Computation and Multi Core systems ::::: 30 2.1.4 Dataflow Computation :::::::::::::::::::::: 34 2.1.5 Hybrid Von Neumann/Dataflow architecture :::::::::: 37 2.1.5.1 Instruction Level Parallelism ::::::::::::: 38 2.1.5.2 Tasking ::::::::::::::::::::::::: 42 2.1.6 Instruction Set Architectures and Program Execution Models 44 2.2 The Codelet Model of Computation ::::::::::::::::::: 47 vii 3 OBJECTIVES AND PROBLEM FORMULATION ::::::::: 52 3.1 Objectives ::::::::::::::::::::::::::::::::: 52 3.2 Problem Formulation ::::::::::::::::::::::::::: 53 4 THE SEQUENTIAL CODELET MODEL ::::::::::::::: 55 4.1 Motivation example :::::::::::::::::::::::::::: 56 4.2 Hierarchical Turing Machine ::::::::::::::::::::::: 59 4.3 Hierarchical Von Neumann Architecture :::::::::::::::: 62 4.4 The Sequential Codelet Model :::::::::::::::::::::: 65 4.4.1 The SCM Abstract Machine ::::::::::::::::::: 65 4.4.2 The SCM Program Execution Model :::::::::::::: 67 4.4.2.1 Tasking model ::::::::::::::::::::: 67 4.4.2.2 Synchronization Model ::::::::::::::::: 69 4.4.2.3 Memory Model ::::::::::::::::::::: 70 4.4.3 Programming model ::::::::::::::::::::::: 71 4.5 Codelet Level Parallelism: Parallelism and performance :::::::: 72 4.5.1 Codelet Level Parallelism :::::::::::::::::::: 72 4.5.2 Heterogeneity and distributed systems in the Hierarchical Abstract Machine ::::::::::::::::::::::::: 74 4.5.3 Compiler techniques ::::::::::::::::::::::: 75 5 THE SUPERCODELET ARCHITECTURE ::::::::::::: 77 5.1 The Architecture organization :::::::::::::::::::::: 77 5.1.1 Scheduling units ::::::::::::::::::::::::: 79 5.1.2 Computation Units :::::::::::::::::::::::: 79 5.1.3 Memory Units ::::::::::::::::::::::::::: 80 5.1.4 Register File :::::::::::::::::::::::::::: 81 5.2 Programming model ::::::::::::::::::::::::::: 81 5.2.1 SuperCodelet ISA ::::::::::::::::::::::::: 82 5.2.2 Codelet Definition :::::::::::::::::::::::: 83 viii 5.2.3 An example program ::::::::::::::::::::::: 85 6 SCMULATE. AN EMULATION RUNTIME FOR THE SEQUENTIAL CODELET MODEL :::::::::::::::::: 88 6.1 SCMUlate software design :::::::::::::::::::::::: 89 6.1.1 The role of OpenMP ::::::::::::::::::::::: 89 6.1.2 Folder infrastructure ::::::::::::::::::::::: 91 6.1.3 The SCM Machine emulation :::::::::::::::::: 93 6.1.3.1 Instructions and Instruction Memory ::::::::: 93 6.1.3.2 Fetch, decode and execute ::::::::::::::: 95 6.1.3.3 The L1 register file on Cache based systems ::::: 97 6.1.3.4 Codelet Level Parallelism ::::::::::::::: 99 6.1.3.4.1 Sequential execution mode: :::::::: 99 6.1.3.4.2 Superscalar execution mode: ::::::: 100 6.1.3.4.3 Out of Order (OoO) execution mode: :: 101 6.1.4 Configuration and common structures :::::::::::::: 103 6.1.5 The Codelet Class :::::::::::::::::::::::: 104 6.2 Programming API and the assembly Codelet program ::::::::: 106 6.2.1 Codelets definition :::::::::::::::::::::::: 106 6.2.2 Example of a print Codelet ::::::::::::::::::: 108 6.2.3 The assembly Codelet Program ISA ::::::::::::::: 110 6.2.4 Running the emulator :::::::::::::::::::::: 111 6.3 Runtime and Codelet Level Parallelism ::::::::::::::::: 113 6.4 Profiling execution code ::::::::::::::::::::::::: 115 7 EVALUATION :::::::::::::::::::::::::::::::: 119 7.1 Evaluation methodology ::::::::::::::::::::::::: 119 7.2 Testbed :::::::::::::::::::::::::::::::::: 120 ix 7.3 Sequential Codelet Abstract Machine mapping ::::::::::::: 122 7.3.1 Register File: :::::::::::::::::::::::::::

Load more