A Study of the On-Chip Interconnection Network for the IBM Cyclops64 Multi-Core Architecture
Total Page:16
File Type:pdf, Size:1020Kb
A Study of the On-Chip Interconnection Network for the IBM Cyclops64 Multi-Core Architecture Ying Ping Zhang Taikyeong Jeong Fei Chen Haiping Wu Ronny Nitzsche Guang R. Gao University of Delaware Department of Electrical and Computer Engineering Newark, Delaware 19716, U.S.A {yzhang, ttjeong, fchen, hwu, ggao}@capsl.udel.edu, [email protected] Abstract tiple processing cores on a single chip [14]. The perfor- mance scalability of such chips requires sophisticated inter- The designs of high-performance processor architec- connection network architectures and their behavioral eval- tures are moving toward the integration of a large num- uation should begin in the logical design and verification ber of multiple processing cores on a single chip. The stage. In this paper, a study of the architecture and perfor- IBM Cyclops-64 (C64) is a petaflop supercomputer built on mance of the interconnection network of the IBM Cyclops- multi-core system-on-a-chip technology. Each C64 chip em- 64 (C64) chip are presented. ploys a multistage pipelined crossbar switch as its on-chip interconnection network to provide high bandwidth and low System 1.1Pflops latency communication between the 160 thread processing 13.8TB cores, the on-chip SRAM memory banks, and other compo- nents. In this paper, we present a study of the architecture and Rack 15.3Tflops performance of the C64 on-chip interconnection network 192GB through simulation. Our experimental results provide ob- servations on the network behavior: (1) Dedicated chan- nels can be created between any output port to input port of Thread the C64 crossbar with latency as low as 7 cycles. The C64 Unit SRAM Intra−chip Network crossbar has the potential reach the full hardware band- FP width, and exhibit a non-blocking behavior; (2) The C64 Thread crossbar is a stable network; (3) The network logic design Unit SRAM Processor I−Cache appears to provide a reasonable opportunity for sharing the Board Processor Chip 320Gflops / 4GB DRAM channel bandwidth between traffic in either direction; (4) 1Gflops / 64KB SRAM 80Gflops / 5MB SRAM A simple circular neighbor arbitration scheme can achieve competitive performance level comparing to the complex Figure 1. Cyclops-64 system Architecture. segmented LRU (Least Recently Used) matrix arbitration scheme without losing the fairness. (5) Application-driven The C64 system (Figure 1) is a petaflop supercom- benchmarks provide comparable results to synthetic work- puter built on multi-core system-on-a-chip (SoC) technol- loads. ogy, based on a cellular architecture and expected to achieve over one petaflop peak performance. A maximum configu- ration of a C64 system consists of 13,824 C64 processing 1 Introduction nodes (1 million processors) connected by a 3D-mesh net- work [9]. The designs of high-performance processor architectures Each node is composed of a C64 chip, external DRAMs are moving toward the integration of a large number of mul- and a small number of external modules. A C64 chip con- 1-4244-0054-6/06/$20.00 ©2006 IEEE 2 Board Chip little under 17 mm wide (27 mm ). It turns out that the Processor SP SP SP SP SP SP SP SP crossbar is just 6% of the die area (the whole chip is 21 x 22 = 462 mm2)! TU TU TU TU TU TU ...... TU TU Off−chip Memory Gigabit ethernet To verify the performance of the design of the C64 cross- FP FP FP FP bar switch in the logical and verification stage, two sim- Off−chip Memory ulators were designed. An empirical analysis of the fully Crossbar Network pipelined C64 crossbar network was conducted. The perfor- A−switch 3D−mesh Off−chip Memory mance analysis was done under fixed channel width, node GM GM GM GM GM GM GM GM ATA size, and topology constraints. Different parameters, such Off−chip Memory HD as workload types, traffic patterns, injection rates and arbi- SP: Scratch Memory TU: Thread Unit tration algorithms, were tested during the performance sim- GM: Global Memory FP: Floating Point SP + GM = SRAM Bank ATA: Advanced Technology Attachment ulation. From the performance analysis, we have observed the Figure 2. Cyclops-64 Node Architecture. following network behaviors: (1) The C64 on-chip crossbar has the potential to reach the full hardware bandwidth, and exhibit a non-blocking behavior1; (2) The C64 crossbar is sists of up to 80 custom-designed 64-bit processors (each a stable network2; (3) The network logic design provides a consists of two thread processing cores), 16 shared instruc- reasonable opportunity for sharing the channel bandwidth tion caches (I-caches), 160 on-chip embedded SRAM mem- between traffic in either direction; (4) Simple arbitration ory banks and 80 floating point units (FP). It is interesting schemes can achieve competitive performance level com- to note that there is no data cache on the chip. Instead, each paring to other more complicated schemes. (5) Application- SRAM bank on the chip can be configured into two lev- driven benchmarks provide results comparable to synthetic els: global interleaved memory banks (GM) which are uni- workloads [18]. formly addressable, and scratch pad memories (SP) that are The experimental results and performance analysis have local to individual processors [4]. helped the architects for verifying the performance of the The C64 chip configuration used in this study integrates C64 chip interconnection architecture and conducted that 75 processors on a single chip. Each processor contains a simple circular neighbor arbitration scheme instead of a two thread units, one floating point unit and two 32KB complex segmented matrix arbitration scheme is chosen for SRAM memory banks. Groups of five processors share one the final design. This topic will be explored in sections 3 I-Cache. Figure 2 shows the structure of the C64 chip. and 5 of this paper. The interconnection network embedded in the center of each chip is a seven stage pipelined crossbar switch with 2 Background a large number of ports (96 x 96), which is much larger than any published design we are aware for SoC architec- A C64 chip consists of many simple, general purpose ture. One primary motivation to employ a crossbar switch RISC style processor cores, shared I-caches, and multi- in the design of C64 chip architecture was to provide a ple banks of embedded memory connected via a high- uniform memory access model to significantly simplify the performance on-chip crossbar switch. A C64 system com- programming inside such a large scale multiprocessor chip. posed of thousands of C64 processors can achieve over one The ordering property of the crossbar was another valuable petaflop peak performance [8]. property of our design. We ensure there is a unique path Verification and testing is a significant portion of the de- from an input port to an output port and packets will be de- sign cycle for chip performance analysis. For a pipelined livered in order. crossbar network, verification should be conducted for two The main challenges for the design of the on-chip inter- reasons: finding unforeseen phenomena that may happen connection network were the complication of the network in the interconnection (such as deadlocks and logic errors), and the limitation of the die area. A popular concern of and verifying the performance of the on-chip interface ar- crossbar switch is its quadratic cost in terms of chip area. chitecture [4]. In the C64 Chip design, the bandwidth of the crossbar in- In this paper, we are interested in the following questions creases linearly with the area, and quadratic with the num- regarding the C64 crossbar switch architecture: ber of ports. Besides, a moderate speed clock at 533MHz employed on the C64 chip leaves plenty room to grow, • Will the C64 crossbar switch deliver the full pipelined which is quite slow by today’s multi-ghz standards. Con- 1Any two free ports can be connected, regardless of the settings in the sequently, with sophisticated architecture and engineering switch design, the completed crossbar is only 1.6 mm high and a 2The throughput does not degrade beyond the saturation point [5] bandwidth if the communication traffic does not en- incoming interface unit (TUnitA) and an outgoing interface counter network contention (i.e. non-blocking)? unit (TUnitB). Each port owns a source control unit (Sr- cCtl), a target control unit (TarCtl), a 96-to-1 multiplexer • Can the C64 crossbar switch maintain its throughput (Mux96), a data FIFO (FifoD7), and some registers (not when the network reaches its saturation (i.e. stable)? shown in this figure). In fact, the FifoD7 and their cor- • Can the C64 virtual channel mechanism be effectively responding Mux96 in each port are combined into a C64 exploited by the sharing between the forward and crossbar core providing a data path for the crossbar (See backward traffic? Figure 4). The FifoD7 employs two groups of data buffers store data for two virtual channels individually (not shown • Can simple hardware arbitration mechanisms be used in the figure). A buffer group has 7 buffers with 92 bits to achieve reasonable performance gains? each. The packets delivered to the destinations through the 7 The rest of this paper will provide answers to these ques- pipelined stages of the crossbar have a 95-bit fixed length. tions. In principle, the minimum latency of the crossbar is 7 cycles assuming one cycle for each stage. A full hardware band- 3 Architecture of the C64 Crossbar width of the crossbar is 96 x 95 = 9120 bits/cycle3.The packets are routed through the C64 crossbar by the SrcCtl of 3.1 An Overview of the C64 Crossbar the source port and the TarCtl of the destination port. Flow control of the crossbar is implemented using a token pro- tocol.