UvA-DARE (Digital Academic Repository)

On the exploration of the DRISC architecture

Yang, Q.

Publication date 2014

Link to publication

Citation for published version (APA): Yang, Q. (2014). On the exploration of the DRISC architecture.

General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

Download date:30 Sep 2021 Chapter 1

Introduction

Contents 1.1 Background ...... 8 1.2 Multithreading ...... 8 1.3 Multiple cores ...... 10 1.4 What’s next ...... 11 1.5 Case study ...... 12 1.6 What have we done ...... 14 1.7 Overview and organization ...... 16

7 8 CHAPTER 1. INTRODUCTION

1.1 Background

Moore’s law, first raised in 1965 [M 65] but later widely accepted as being “the number of transistors on integrated circuits+ doubles approximately every two year” (fig. 1.1), is steering the development of the semiconductor industry covering the capabilities of electronic devices like storage capacity, processing speed, etc. In the field of , this is highly exemplified by the “Tick-Tock” paradigm of Intel (fig. 1.2). Before the dawn of this millennium, the increased density of transistors was chiefly devoted to more powerful uniprocessors. “Powerful” on one hand refers to techniques such as out-of-order (OoO) execution, branch prediction, specula- tion and multiple issue, accelerating the execution of sequential codes; on the other hand it refers to increasing the main frequency, e.g. extending the number of pipeline stages. The former suffers hardware complexity and insufficient ILP while both incur serious power dissipation and thermal problems. The resultant memory wall [WM95] (increasing memory access latency compared with decreas- ing execution time of a single instruction) and performance wall [AHKB00] (de- creasing performance benefits compared with increasing hardware costs and power consumption) mandate rethinking architecture design for sustainable, efficient per- formance improvement. One answer proves to be multithreading and later (with) multiple cores, although binary compatibility is not so easily attainable as it used to be from the perspective of performance scalability.

1.2 Multithreading

As said in [BH95], independent streams of instructions, interwoven on a singe pro- cessor, fill its otherwise idle cycles and so boosts its performance; multithreading which exploits the higher -Level Parallelism (TLP) starts a new way to achieve overall performance by interleaving threads to hide latency. The study of TLP dates back to late 1950s like [Bem57] and the first commer- cial multi-threaded system is the heterogeneous element (HEP) [Smi82] released in 1978. However such architectures began to thrive only since the late 1990s because of stalling ILP in OoO processors. Despite diverse accomplishments, two important keys to multithreading are how and when to perform thread inter- leaving, which also categorize multi-threaded systems into different classes. As the execution of a thread requires its own context consisting of a (PC), state registers and stack frames, interleaving threads implies a con- text switch, namely context preservation and recovery. If such a switch is entirely conducted by software, this is known as software multithreading. Software mul- tithreading usually bears a higher switch overhead because it relies on accessing memory for context values, thus it is only suitable to infrequent switches; con- trarily if each thread has its unique inside-processor storages for context, this is known as Hardware Multithreading (HMT), which usually facilitates more fre- quent switching achieved at the cost of additional hardware investment (hence the design trade-off). For instance, HEP supports 50 hardware threads equipped with 2048 registers; while its descendent Tera MTA [ACC 90, AKK 95] and the latest Cray Thunderstorm processor [KV11] maintains 128+ hardware+ contexts backed with 128 register sets. 1.2. MULTITHREADING 9

Figure 1.1: The number of on-chip transistors scaling from 1971 to mid 20131.

Decisions on when to start thread switches generally fall into three types: 1. fine-grained or interleaved multithreading. That is fetching one instruction from a thread in one cycle and switching to another thread in the next cycle to ensure fairness among threads. The exemplary architecture of this type is the often called barrel processor, where each thread is guaranteed to run one instruction every N cycles given N hardware contexts; meantime the execution speed of a single thread is always roughly 1 N of its original. This is improved later in [AKK 95] by scheduling only ready threads for better flexibility. + ￿ 2. coarse-grained or blocked multithreading. A thread emits instructions until a point is reached to trigger a switch. Such points can be either static, e.g. tags or switch instructions inserted by compilers, every branch or memory load instruction, or dynamic such as misses, traps or interruptions. Unlike interleaved multithreading, different threads often have different time slices for execution.

1This is a refined figure based on http://en.wikipedia.org/wiki/File:Transistor_Count_ and_Moore’s_Law_-_2011.svg. 2This is a refined figure based on http://commons.wikimedia.org/wiki/File: IntelProcessorRoadmap.svg. 10 CHAPTER 1. INTRODUCTION

Figure 1.2: The evolution of Intel’s processors: shrunk process technology in the first “tick” year followed by the “tock” year with updated micro-architecture2.

3. Simultaneous Multithreading (SMT). It resembles the cycle-by-cycle switch of fine-grained multithreading but features multiple issue, i.e. several in- structions from different threads are issued to the pipeline in one cycle. This is usually built on top of superscalar processors to fill empty pipeline slots due to dependencies, and thus captures both inter-thread TLP and intra- thread ILP. The issue width of SMT processors is always moderate, i.e. 2 or 4 ways, and cannot be very large because of area costs e.g. the issue logic area grows as the square of the width [AHKB00, PBB 02] and that of regis- ter file has a cubic scaling [BDA01], and the subsequent+ power consumption moves the processor towards even lower performance efficiency.

1.3 Multiple cores

In 2000, IBM announced the Power4, the first commercial multi-core microproces- sor that integrated two Power3 cores onto a single die. It was the kick-off of a new era that is investing transistors on additional CPUs (cores) inside the same chip. This was soon followed by other venders moving to multi-core instead of launch- ing more aggressive uni-core processors. However what a “core” is like manifests disparate design principles from different venders. One extreme can be found in the developing Oracle T-Series and the IBM Power-Series Chip Multiprocessors (CMP) that integrate 8 16 “fat” cores. Each core is enhanced with SMT as well as some other special function units (i.e. SPU in Oracle cores and Vector unit in Power cores) emphasizing∼ chip-level parallelism but retaining complex hardware circuits for ILP and assuring better performance of a single instruction stream. For example, the Oracle SPARC T5 [Ora13] claims 30% higher performance of a single thread compared with the previous generation. This is by and large regarded as latency-oriented. The other extreme is utilizing only simpler or “thin” cores. Each core supports massive hardware threads with trivial scheduling overheads in order to hide latency 1.4. WHAT’S NEXT 11 simply by thread interleaving, even to the point that hierarchical caches are not necessary. A trial of this is the early Niagara T1 [LLS06] that packs 8 cores of 32 threads. It featured fine-grained multi-threading not radical ILP techniques or a sizable amount of cache. The development of the Graphics Processing Unit (GPU) is much more aggressive. Taking NVidia’s Kepler GK110 [Nvi12] for example, it includes up to 15 streaming multiprocessors (SMX) primarily composed of integer and floating-point arithmetic logic; each SMX allows the concurrent issue and execution of 128 parallel threads and supports up to a maximum of 2048 threads. Consequently the GPU possesses superior advantages over its CMP rivals in terms of system throughout when solving problems exposing redundant parallelism, and such an achievement is devoid of a single thread’s performance [GK10]. Also it is less general-purpose in contrast with CMP. The deficiencies of the above two extremes motivate an alternative design— the “fused” packaging “fat” CPU cores and GPU together. This is currently prompted by AMD’s APU and Intel’s Haswell for the purpose of smooth synergy between several cores and a GPU on chip, yet it needs further im- provements to perfect the coordination and get rid of interconnection limitation [DAF11]. In fact, a similar asymmetry appeared early in IBM’s broadband engine architecture (Cell) [JB07], a multi-core chip where a general-purpose Power Pro- cessor Element controls multiple streamlined Synergistic Processing Elements via the interconnect bus. Like GPUs, Cell is also designed for specified domains, e.g. multimedia.

1.4 What’s next

Although Amdahl’s law [Amd67] pinpoints the embarrassing parallelism on multi- ple processors running an individual program, flourishing applications with plen- tiful inherent concurrency as well as the habit of running a few independent tasks at the same time from the end user may be beyond that conclusion, and the multi- core benefits from this approach are justified better by Gustafson’s law [Gus88]. There is already a consensus that multi-core processors will prosper with ever- increasing core count (many-core3)inthecomputinglandscape,althoughthere are challenges and dissension regarding the implementation. One major challenge is from scaling. The physical limitations of silicon-based miniaturization is likely to end Moore’s law [TP06, Pow08], which will the benefits from the quantity of cores. On the other hand, the exponentially shrunk transistor size does not maintain the same scaling of its energy consump- tion because of leakage [Kau13]. Provided a constant power budget on chip, the increment in density implies a reciprocal decrement of active transistors at full speed, which is called “dark silicon” [EBA 11, VSGH 11, HFFA11] or otherwise the frequency wall [FH04]. Therefore power+ dissipation+ becomes the leading limit- ing factor in many-core design resulting from packaging and cooling technologies, and the performance of a many-core system should not be measured by its com- putation ability alone, energy consumption has to be taken into account. Hence the frequently addressed energy efficiency or (PPW).

3There is no clear limit between multi-core and many-core processors in relate to the exact number of cores on chip and neither the agreement in academia. This dissertation take 32 cores as the threshold of many-core processors. 12 CHAPTER 1. INTRODUCTION

Dissenting voices are found principally around the fundamental question namely what is the optimal constitution for a many-core chip, heterogeneity or homogene- ity. Those with the philosophy of omnipotence opt for the heterogeneous orga- nization, i.e. general-purpose “fat” cores surrounded with specialized units such as GPU, Field-Programmable Gate Array (FPGA) and Application Specific In- tegrated Circuit (ASIC), so that all types of computation can be appropriately served achieving adaptive performance [HM08]. For instance, sequential codes will be boosted on “fat” cores and other parts on specialty cores. This paradigm is prominent in its energy efficiency, and the validity and superiority are partially promised by System on Chip (SoC) nowadays in the field of mobile computation. Those espousing homogeneous chips do so for the sake of simplicity. This is mostly concerned by the programming and even resource or concurrency man- agement difficulties that would be imposed to software communities, to whom, the “free lunch” from speedy clock and implicit ILP is over [Sut05]. “A core will look like a NAND gate in the future. You won’t want to mess with it. As for software, the time to market has been long in the past, but we can’t afford to let that be the case in the future.” said by Shekhar Y. Borkar, Director of Intel’s Microprocessor Technology Lab. Non-uniformity will further exacerbate these problems. Additionally, homogeneity also confers ductility in that these non- monolithic general-purpose units are configured dynamically (voltage, frequency or shutdown as examples) adapting to particular applications for throughput and cost parameters. The ideal strategy would be the dynamic organization and harnessing of re- configurable logic, e.g. FPGAs, where the entire chip can be fused together for latency-oriented applications or separated into handful parts as a throughput- oriented approach, and is likely to vary in time on the basis of instantaneous computational demands. This is believed to be the most flexible approach and to yield quite appreciable PPW only with the obstacle of transformation time on the fly [HM08]. Less aggressively, preliminary research targeting on-line composition of existing simple cores has been carried out and is developing what has been done in [KSG 07, GRE 10, RAKK12]. + + 1.5 Many-core chips: the case studies

As academic studies always precede business actions, commercial many-core chips are still on the verge of this development with only modest core counts. Up to the present, only three vendors have launched their commercial products of different designs (table 1.1). Intel initiated its research project for the many integrated core architecture (MIC) in 2006 and has now experienced three stages. So far (mid 2013), only one MIC product branded Xeon Phi is available for market as a via PCI-E. That chip is fabricated in a 22nm process size using tri-gate technology and consists of sixty P54C, each of which is a 2-way superscalar in-order multi- threaded core with private two level full coherent caches, and is fortified by adding a 512-bit Single Instruction Multiple Data (SIMD) unit and the support for 64- bit instructions. All on-chip components are connected by a 512-bit bidirectional 1.5. CASE STUDY 13

Table 1.1: Specifications of three commercial many-core chips. Tech. info. Intel Tilera Adapteva 2-way superscalar 3-wide VLIW in-order in-order 2-way superscalar Core feature 4h/wthreads 32K+32K L1$ in-order 32K+32K L1$ 256K L2$ 32K local memory 256K L2$ shared 18M L3$ Core count 60 72 64 Freq. MHz 1053 1000 800 ISA x86-64 — RISC bidirectional ring 2-D mesh 2-D mesh Link — 1cycle/hop 1.5 cycles/hop 134GB/s >12500GB/s 102GB/s Peak perf. 1011GFLOPS — 100GFLOPS GDDR5 DDR3 x 4 Memory 8GB <=1866MHz 1TB <4GB 320GB/s >60GB/s 6.4GB/s Die mm2 350 — 8.2 Max. watt 225 60 2 Intel 22nm TSMC 40nm GlobalFoundries Fab. tech. FinFET 28nm SLP C/C++/Fortran ANSI C/C++ & Programming Intel MPI & Std. C/C++ OpenCL OpenCL SDK Composer XE 2013 MDE SDK networking data- plane offload and consumer devices general-purpose co- H.264/H.265 vi- Market and embedded sys- processor via PCIe. deo transcoding tem accelerator. and HPC offload via PCIe. ring bus. The choice of using the “old” Pentium cores is to leverage existing soft- ware tools that facilitate software development and optimization, and particularly the reusability of legacy codes. Compared with this mature commercial product, the previously released research prototype, the Single Cloud Computer (SCC) [HDH 10, MRL 10] adopted some aggressive innovations. On SCC, all nodes packaging+ a sum+ of forty eight P54C cores are arranged in a 6 4meshnetwork, and the whole chip relies upon message passing for inter-core or chip-wide commu- nications because of its split address spaces and non-hardware× coherence, which requires customized programs and drops backward compatibility. Tilera develops many-core chips based on its own TILE64 architecture [BEA 08] aiming for multimedia, network and cloud computing, and launched its first+ 64- core product in 2007. The newest TILE-Gx8072 chip integrates up to seventy two TILE64 cores in 40nm technology, each of which is a 3-wide Very Long Instruction Word (VLIW) 64-bit core with private two level caches and extended Instruction 14 CHAPTER 1. INTRODUCTION

Set Architecture (ISA) for multimedia and SIMD processing. All cores share an 18MB level 3 cache coherently via the low-latency iMesh network and up to 5 independent mesh networks are provided e.g. accessing integrated memory con- trollers or I/O controllers, thus getting rid of the need for any virtual channels. It is similar to the Oracle Tx processors in putting some accelerators on chip for encryption, decryption and packet analysis. The chip can be programmed under the existing GNU tool chain with dedicated libraries and tools to save software investment and the 72 cores can be used on demand, i.e. grouped in clusters to apply the appropriate, deterministic amount of horsepower to each application, or formed into a homogeneous pool to distribute work among individuals [Cor13]. Adapteva founded in 2008 focuses on the research of its scalable Multiple In- struction Multiple Data (MIMD) architecture named Epiphany based on Reduced Instruction Set Computer (RISC) CPU cores targeting 4096-core chips. The fourth generation Epiphany yields a 64-eCore chip in a 28nm fabrication technology. Each eCore is a 2-way in-order superscalar 32-bit RISC core with optimized ISA for floating point operations. All cores are arranged in a 8 8mannerconnectedbya 2-D mesh network (called eMesh) consisting of three independent mesh structures for different memory transactions. The architecture principally× features a single, flat address space statically partitioned for each eCore, and the banked 32KB local memory mapped inside each eCore’s memory space instead of cache hierarchies to reduce overhead of inter-core communications. For example, the communication can be expressed implicitly as load or store targeting another core’s local memory and will be translated into a transaction with the target core ID and requestor core ID delivered by eMesh. The platform supports programs in ANSI C/C++ and OpenCL backed by the Epiphany SDK [Ada12]. In summary, these preliminary many-core chips all take homogeneous cores with “conservative” architectures, i.e. in-order execution and small issue width, and are generally as affiliated components. Xeon Phi is the only one with multi- threading and a ring topology. It directly competes with GPUs given its integrated wide SIMD units and Graphics Double Data Rate (GDDR) memory. The 2-D mesh Network on Chip (NoC) of the other two demonstrates impressive efficiency in terms of aggregate bandwidth and low latency. The Adapteva chip also shows the superiority of RISC cores with respect to hardware costs, power consump- tion, energy efficiency (70GFLOPS/watt) and even price. This is also verified in [SVC 13] by comparison between an ARM-core cluster and classical x86 sys- tems running+ High Performance Computing (HPC) applications. Furthermore, all benefits of these many-core chips can only be reaped with platform specific programming requiring full-fledged, back-end support like Software Development Kit (SDK) or Integrated Development Environment (IDE) including compilers, libraries, debugging and simulation tools or even OS kernels, which are of equal importance and to some extent decide the fate of hardware. Notoriously all work is largely in the charge of these chip venders since “who can use it well that knows it well”.

1.6 What have we done: a brief synopsis of our research

The Computer System Architecture (CSA) group of the University of Amster- dam puts their efforts into the innovated DRISC architecture with simplicity and 1.6. WHAT HAVE WE DONE 15 neatness and tries to promote it for general-purpose use. Such a process almost goes hand in hand with building multi- or many-core system using the DRISC architecture. Regarding present and emerging applications with massive concurrency [LCD11], people here believe in the increasing requirements on system throughput and effi- ciency rather than the performance of an individual instruction stream, and pro- pose the many-core architecture on the basis of DRISC cores equipped with HMT and dynamic data-flow scheduling [Ian88, Nik89] to tolerate asynchronous latency and improve pipeline occupancy. This shares the same principle with Intel’s Xeon Phi as “many-core chips include more and smaller cores, and more threads provide more efficient performance for highly parallel applications; and the high degree of parallelism compensates for the lower speed of each individual core to deliver higher aggregate performance for workloads that can be subdivided into a suffi- ciently large number of simultaneous tasks [Int13]”. But for us it is not limited to the concept of “accelerator” and asymmetry is always on demand as “fat” cores are suitable candidates for shared services. The many-core chip research embraces hardware and software co-design and leverages an abstract concurrency interface to • achieve resource-agnostic programming [BGH 11]. Parallelism exploration and expression in programmes target virtual resources+ and are parameter- ized. Such virtualization as well as isolation from concurrency management gives rise to the binary compatibility and flexible and scalable performance over varying numbers of threads and cores. • achieve in-silicon Operating System (OS). Software parallelism is exposed to hardware by the ISA, and resource allocation, concurrency mapping and management are wholly conducted on chip without the interference from OS. In other words, partial OS function is moved to and attained at the hardware level for higher efficiency. Research around the proposed architecture has spanned a very long time and has been supported by multiple projects with different proposals and interests. This has yielded a fruitful harvest in particular the infrastructure including: • the conceptual system now has concrete entities which cover a dedicated DRISC core structured on the FPGA for SPARC V8 ISA and a full-system simulator with cycle accuracy. The former verifies the realizability of the DRISC architecture in transistors and the latter creates the simulation en- vironment. • the developed tool chain supporting various platform implementations (e.g. ISA) of the architecture. It embodies gcc-based compilers and libraries for each target platform to support high-level imperative and functional lan- guages, i.e. extended C and Single Assignment C (SAC) [GS06], and many other frameworks for performance counting, benchmark packaging and exe- cution monitoring, tracing, debugging and visualizing. All the infrastructure facilitates the inspection and evaluation of the system design leading to an evaluation-revamping cycle in the overall design. It also prepares the DRISC architecture for different motivations and trials in different research directions. 16 CHAPTER 1. INTRODUCTION

Very recently, we embarked on a project for outer space use, and made a foray into the real-time multi-core processing domain. On the basis of existing DRISC cores implemented in building the many-core system and the developed environment, the major research problem is how to reuse DRISC for its pros while intensifying this architecture for the real-time or even reliability preference.

1.7 Overview and organization

That many-core systems will be the future wave is already without a shadow of doubt despite divergent expectations with respect to the increment of on-chip cores in years. Regarding heretofore state-of-the-art commodities and their niche markets, many-core chips are barely on the horizon to become the main-stream computing horsepower. The design strategies are still under discussion and aca- demic research is around exploring solutions for possible challenges sweeping the entire ecosystem. We are catching the trend and have gained a foothold with our own system with many DRISC cores, yet it still needs the glare of publicity. The character and advantages are appealing but limitations and vulnerability are also easy to draw attention to. Not only in general-purpose computing, but also the real-time field follows the trend to make use of multiple cores to boost performance while stringency of dead- lines is always of the first importance. Research on real-time multi-core processors also gives consideration to system efficiency, predicability and analyzability etc. We foray into this domain and carry out preliminary studies on top of DRISC, though revamping it for full real-time competence still needs further investigation. On account of all the above concerns, several questions are explored by the work undertaken in this thesis, which are: • possibly first and foremost, what are the fortes of the DRISC architecture that is attractive for relentless research in comparison with its conventional and popular rivals? • what are the special requirements of DRISC for desirable research goals, e.g. competitive performance using only multithreading and singe in-order issue execution? • what is the advantage to build a many-core chip with DRISC and how is it done? • what are the factors that constrain its capability for deserved performance? if referred to other many-core research, do the problems found there also pertain to this architecture? If yes, what about those solutions? • given these limitations, what are the amendments and the benefit to general- purpose studies? • what is the motivation of building real-time processor with multiple DRISC cores and how is the work going on? • what is the latest development of this architecture research and the future plan? or there may be even more. This thesis will cover most of them and deliver the answers. As a result, the remainder of this dissertation is organized as follows: part I focuses on the DRISC architecture per se including motivations, design principles 1.7. OVERVIEW AND ORGANIZATION 17 and features. We will articulate in part II how DRISC is instantiated in building the proposed scalable system with many cores. Other major elements and the simulation of the system will also be elaborated in chapter 3. These two chapters will answer questions in the above first three key points and put the main research of this thesis into context. As a part of the study for future scalable many-core sys- tems, chapter 4 discusses implementation and performance matters in both other academic studies and our research. Then chapter 5 continues the topic on memory subsystem and makes optimization explorations. Chapter 6 of part III introduces the initial design of a real-time multi-core processor and its achievements. Finally, part IV gives conclusions and future work concerning refinements of the work in this thesis and the development of the DRISC architecture.