A Functionally Accurate Simulation Toolset for the Cyclops64 Cellular Architecture

FAST: A Functionally Accurate Simulation Toolset for the Cyclops64 Cellular Architecture Juan del Cuvillo Weirong Zhu Ziang Hu Guang R. Gao Department of Electrical and Computer Engineering University of Delaware Newark, Delaware 19716, U.S.A jcuvillo,weirong,hu,ggao ¡ @capsl.udel.edu Abstract application software development and testing. For our pur- poses, a cycle accurate (rather than function accurate) sim- This paper reports our experience and lessons learned ulator would be too slow for a system consisting of one in the design, implementation and experimentation of an or more fully-populated C64 chips. Currently, FAST effi- instruction-set level simulator for the IBM Cyclops-64 (or ciently handles C64 systems consisting of either a single C64 for short) architecture. This simulation tool, named processing core, a C64 chip fully populated or a system built Functionally Accurate Simulation Toolset (FAST), is de- out of several nodes connected with a 3D mesh. signed for the purpose of architecture design verification We present several important aspects of the FAST simu- as well as early system and application software develop- lator and highlight the tradeoffs faced during its design and ment and testing. FAST has been in use by the C64 architec- implementation. Some design decisions are made based on ture team, system software developers and application sci- the unique features of the C64 architecture. For instance, entists. We report some preliminary results and illustrate, C64 employs no data caches. Instead, on-chip memories are through case studies, how the FAST toolchain performs in organized in two levels — global interleaved memory banks terms of its design objectives as well as where it should be that are uniformly addressable, and scratch memories that improved in the future. are local to individual processing cores. FAST has been in use by the C64 architecture team, system software developers and application scientists. We report some preliminary results and illustrate, through case 1. Introduction studies, how FAST performs in terms of its design objectives as well as where it should be improved in the future. It is increasingly clear that the huge number of transis- tors that can be put on a chip (now is reaching 1 billion and 2. Cyclops64 chip architecture continues to grow) can no longer be effectively utilized by traditional microprocessor technology that only integrates a The Cyclops-64 (C64) is the latest version of the Cy- single processor on a chip. A new generation of technology clops cellular architecture designed to serve as a dedi- is emerging by integrating a large number of tightly-coupled cated petaflop compute engine for running high perfor- simple processor cores on a chip empowered by parallel mance applications [10]. A C64 supercomputer is attached system software technology that will coordinate these pro- — through a number of Gigabit Ethernet links — to a host cessors toward a scalable solution. system. The host system provides a familiar computing en- This paper reports our experience and lessons learned vironment to application software developers and end users. in the design, implementation and experimentation of an A C64 is built out of tens of thousands of C64 process- instruction-set level simulator for the IBM Cyclops-64 ar- ing nodes arranged in a 3D-mesh network. Each process- chitecture that integrates on a single chip up to 150 processing node consists of a C64 chip, external DRAM, and a ing cores, an equal number of SRAM memory banks and small amount of external interface logic. A C64 chip em- 75 floating point units. This simulation tool, named Func- ploys a multiprocessor-on-a-chip design with a large num- tionally Accurate Simulator Toolset (FAST), is designed for ber of hardware thread units, half as many floating point the following goals (1) architecture design verification; (2) units, embedded memory, an interface to the off-chip DDR early system software development and testing; (3) early SDRAM memory and bidirectional inter-chip routing ports, Node Chip Processor Table 1: Simulation parameters SP SP SP SP SP SP SP SP Component # of units Params./unit TU TU TU TU TU TU TU TU Gigabit Threads 150 single in-order issue, ethernet Off−chip Memory FP FP FP FP 500MHz FPUs 75 floating point/MAC, Off−chip Memory Crossbar Network divide/square root A−switch 3D−mesh I-cache 15 32KB Off−chip Memory GM GM GM GM GM GM GM GM SRAM (on-chip) 150 32KB ATA HD DRAM (off-chip) 4 256MB Off−chip Memory Crossbar 1 96 ports, 4GB/s port A-switch 1 6 ports, 4GB/s port Figure 1: Cyclops-64 node see Figure 1. A C64 chip has 75 processors, each with two the memory bank where they operate upon while the re- thread units, a floating-point unit and two SRAM memory maining banks proceed servicing other requests. This func- banks of 32KB each. A 32KB instruction cache, not shown tionality provides a higher memory bandwidth. in the figure, is shared among five processors. The C64 chip has no data cache. Instead a portion of each SRAM bank 3. FAST design and implementation can be configured as scratchpad memory (SP). The remain- ing sections of SRAM together form the global memory FAST is an execution-driven, binary-compatible simula- (GM) that is uniformly addressable from all thread units. tor of a multichip multithreaded C64 system. It accurately On-chip resources are connected to a 96-port crossbar net- reproduces the functional behavior and count of hardware work, which sustains all the intra-chip traffic communica- components such thread units, on-chip and off-chip mem- tion and provides access to the routing ports that connect ory banks, and the 3D-mesh network, see Table 1. The ac- each C64 chip to its nearest neighbors in the 3D-mesh net- tual number of simulated chips is limited by practical rea- work. The intra-chip network also facilitates access to spe- sons, since the memory corresponding to all the chips need cial devices such as the Gigabit Ethernet port and the serial to be allocated in the host machine. ATA disk drive attached to each C64 node. FAST has been developed following a modular ap- The C64 architecture represents a major departure from proach, such that additional features could be easily incor- mainstream microprocessor design in several aspects. The porated into the existing design. To help the architecture C64 chip integrates processing logic, embedded memory team with the verification of the C64 chip design, the simu- and communication hardware in the same piece of sili- lator executes instructions (3.1), models the architecture ex- con. However, it provides no resource virtualization mech- ceptions (3.2), reproduces the C64 memory map (3.3) and anisms. For instance, execution is non preemptive and there produces histograms of the instruction mix as well as de- is no hardware virtual memory manager. The former means tailed traces of all instructions executed (3.4). For the pur- a single application can run at a given time on a set of C64 pose of early system and application software design and nodes. Additionally, the OS will not interrupt the user pro- evaluation, in addition FAST accounts for memory and in- gram running on the thread units unless the user explic- terconnect contention (3.5), supports intra-chip communi- itly specifies preemption or an exception occurs. The latter cation through the A-switch device (3.6) and incorporates means the three-level memory hierarchy of the C64 chip is debugging facilities (3.7). Finally, an overview of the simu- visible by the programmer. From the processing core stand- lator internals is provided (3.8). point, a thread unit is a simple 64-bit, single issue, in-order RISC processor with a small instruction set architecture 3.1. Instruction execution (60 instruction groups) operating at a moderate clock rate (500MHz). Nonetheless, it incorporates efficient support for FAST simulates the four-stage pipeline employed in the thread level execution. For instance, a thread can stop exe- C64 architecture, see Figure 2. cuting instructions for a number of cycles or indefinitely; At the first stage of the pipeline, an instruction (see Ta- and when asleep it can be woken up by another thread ble 2) is fetched from the program instruction buffer (PIB) through a hardware interrupt. Additionally, the integration and decoded. FAST may account for the access to the PIB of processing logic and memory is further leveraged with a and subsequent delay if the instruction has to be read from rich set of hardware supported in-memory atomic instruc- the instruction cache or memory, if a miss should occur. tions. Unlike similar instructions on common off-the-shelf Whenever the branch prediction is incorrect, execution in microprocessors, atomic instructions in the C64 only block a thread unit stalls for three cycles while the pipeline is Table 2: Instruction set summary Table 3: Instruction timing ¢ Instruction type Core Integer and Branch Floating Point Branches 2 0 Count population 1 1 Load, Store Add, Subtract Integer multiplication 1 5 Load, Store Multiple Multiply, Divide Integer divide, remainder 1 33 Add, Subtract [Immediate] Multiply and Add Floating add, mult. and conv. 1 5 Multiply, Divide Conversions Floating mult. and add 1 10 Compare [Immediate] Square Root Floating divide double 1 30 Trap on Condition [Immediate] Floating square root double 1 56 Logic [Immediate] Floating mult. and accumulate 1 5 Shift [Immediate] Memory operation (local SRAM) 1 2 Shift left 16 then OR immediate Memory operation (global SRAM) 1 20 Insert, Extract Memory operation (off-chip DRAM) 1 36 Move if Condition All other operations 1 0 Branch on Condition Branch and Link Exotic Control struction becomes available.

A Functionally Accurate Simulation Toolset for the Cyclops64 Cellular Architecture

System Trends and Their Impact on Future Microprocessor Design

An Overview of the Blue Gene/L System Software Organization

Cellular Wave Computers and CNN Technology – a Soc Architecture with Xk Processors and Sensor Arrays*

Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64

Focal-Plane Analog VLSI Cellular Implementation of the Boundary Contour System

Software-Defined Hyper-Cellular Architecture for Green and Elastic

An Overview on Cyclops-64 Architecture - a Status Report on the Programming Model and Software Infrastructure

Virtualized Baseband Units Consolidation in Advanced Lte Networks Using Mobility- and Power-Aware Algorithms

Simulating Linux Clusters on Linux Clusters

Evaluating Cyclops64

Efficient Synchronization for a Large-Scale Multi-Core Chip Architecture

Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture