Exploring a Multithreaded Methodology to Implement a Network Communication Protocol on the Cyclops-64 Multithreaded Architecture

Exploring a Multithreaded Methodology to Implement a Network Communication Protocol on the Cyclops-64 Multithreaded Architecture Ge Gan, Ziang Hu, Juan del Cuvillo, Guang R. Gao University of Delaware Dept. of Electrical and Computer Engineering Newark, DE. 19716 USA {gan,hu,jcuvillo,ggao}@capsl.udel.edu C64 Node Abstract C64 Chip icache−5P icache−5P icache−5P icache−5P TU SP GM icache−5P FPU TU SP GM icache−5P icache−5P The IBM Cyclops-64 (C64) chip employs a multi- icache−5P icache−5P ...... icache−5P icache−5P TU SP GM icache−5P 6 threaded architecture that integrates a large number of FPU TU SP GM icache−5P icache−5P icache−5P hardware thread units on a single chip. A cellular super- TU SP GM IC−Glue IC−Glueicache−5P IC−Glue IC−Glue FPU TU SP GM DDR2 Controller computer is being developed based on a 3D-mesh connec- TU SP GM DDR2 Controller FPU TU SP GM 96−Port Crossbar Switch DDR2 Controller 1G Offchip DRAM tion of the C64 chips. This paper introduces the Cyclops TU SP GM Datagram Protocol (CDP) developed for the C64 super- FPU TU SP GM DDR2 Controller TU SP GM 2 3D−mesh A−Switch FPU TU SP GM computer system. CDP is inspired by the TCP/IP protocol, Gigabit Ethernet TU SP GM 1 345 80 Host Interface IDE FPU TU SP GM yet simpler and more compact. The implementation of CDP FPGA Control Network leverages the abundant hardware thread-level parallelism TU: Thread Unit SP: Scratchpad Memory FPU: Floating Point Unit GM: Global Memory provided by the C64 multithreaded architecture. The main contributions of this paper are: (1) We have Figure 1. Cyclops-64 Node completed a design and implementation of CDP that is used as the fundamental communication infrastructure for the C64 supercomputer system. (2) CDP successfully ex- ics and the off-chip memory, becomes the building block ploits the massive thread-level parallelism provided on the (i.e. the C64 node) of the C64 supercomputer system. See C64 hardware, achieving good performance scalability; Figure 2. The C64 supercomputer system consists of tens of (3) CDP is quite efficient. Its peak throughput reaches thousands of C64 nodes that are connected by the 3D-mesh 884Mbps on the Gigabit Ethernet, even it is running at network and the Gigabit Ethernet and can provide comput- the user-level on a single-processor Linux machine; (4) Ex- ing power at the Petaflops level. tensive application test cases are passed and no reliability To interconnect the two different subnetworks in the C64 problems have been reported. supercomputer, we have designed the Cyclops Datagram Protocol (CDP). CDP is a projection of the conventional network communication protocol (TCP/IP) to the modern 1. Introduction C64 multithreaded architecture. It is a datagram-based, connection-oriented communication protocol that supports Cyclops-64 (C64) is a multithreaded architecture devel- reliable and full-duplex data transfer. We have implemented oped at the IBM T.J. Watson research center [4]. It is the lat- the very popular BSD socket API in CDP. This provides a est version of the Cyclops cellular architecture that employs user-friendly programming environment for the C64 system a unique multiprocessor-on-a-chip design [1] that integrates or application programmers. a large number of thread execution units, main memory We have implemented the CDP protocol on the C64 banks, and communication hardware on a single chip. See thread virtual machine (TVM) [3]. The C64 thread vir- Figure 1. The C64 chip, together with the host control log- tual machine is a lightweight runtime system developed for the C64 chip. It provides the mechanism to map software 1-4244-0910-1/07/$20.00 c 2007 IEEE. threads directly onto the C64 hardware thread units. It also The A-switch interface of the chip connects the C64 node to its six neighbors in the 3D-mesh network. In every CPU cycle, A-switch can transfer one double word (8 bytes) .... in one direction. The 3D-mesh may scale up to several ten thousands of nodes, which becomes a powerful parallel 3D−mesh Gigabits Ethernet I/O node computing engine. The 3D-mesh computing engine is at- compute node tached to the host system via Gigabit Ethernet and becomes host node the C64 supercomputer. See Figure 2. The whole C64 sys- Figure 2. Cyclops-64 Supercomputer tem is designed to provide computing power at Petaflops level. It is targeted at applications that are highly paralleliz- able and require enormous amount of computation. provides a familiar and efficient programming interface for Given the C64 multithreaded architecture and the C64 the C64 system programmers. Currently the C64 hardware supercomputer organization, we are interested in two ques- is still under development, so the C64 thread virtual ma- tions regarding the design and implementation of CDP: chine is running on the C64 FAST simulator [2]. • Is it possible to implement CDP in a way such that it We have explored a multithreaded methodology in the can utilize the massive thread-level parallelism on the development of the CDP protocol. A fine-grain thread li- C64 hardware and achieve good performance scalabil- brary called TiNy Thread (TNT) library [3] is used to im- ity? plement the CDP protocol. The TNT thread library is part of the C64 Thread Virtual Machine and implements the C64 • Is the communication protocol we developed for the fine-grain thread model [3]. C64 architecture an efficient one? We have evaluated the performance of CDP through micro-benchmarking. From the experimental results, we CDP Node CDP Node have two observations: (1) The multithreaded methodology used in the implementation of CDP is very successful. It ef- .. .. .. .. fectively exploits the massive thread-level parallelism pro- user thread user thread user thread user thread vided on the C64 hardware and achieves good performance cdp receiving thread cdp receiving thread cdp timer thread cdp receiving thread cdp receiving thread cdp timer thread cdplib cdplib cdplib cdplib scalability. The speedup of a CDP test program can reach 82.55 after using 128 receiving threads. (2) As a commu- 3D−mesh network / Gigabit Ethernet nication protocol, CDP is efficient. The peak throughput of the user-level CDP (implemented by Pthread) is 884Mbps Figure 3. CDP Multithreaded Implementation on the Gigabit Etherent. In the next section, we will first introduce the necessary In order to answer these questions, we came up a multi- background of the C64 architecture. Then, we will formu- threaded solution, shown in Figure 3. Briefly, the CDP late our problem and give a brief introduction to the solu- program consists of a set of TNT threads [3]: the receiv- tion. ing threads, the timer thread, and the user threads. These threads cooperate with each other to implement the full 2 Problem Formulation and Solution functions of the CDP protocol. A fine-grain lock algorithm is proposed to improve the parallelism among these threads. Section 4 will give a detailed description on the CDP imple- A C64 chip has 80 “processors”, which are connected mentation. to a 96-port crossbar network. See Figure 1. Each proces- The rest of the paper is organized as follows. Section 3 sor consists of two thread units, one floating point unit, and briefly introduces the CDP communication protocol. Sec- two SRAM memory banks (32KB each). A thread unit is a tion 4 discusses the multithreaded implementation of CDP. 64-bit, single issue, in-order RISC core operating at clock Section 5 presents the experimental results and analysis. rate of 500MHz. The execution on the thread unit is not Section 6 introduces some related works. Section 7 is our preemptable. A 32KB instruction cache is shared among conclusion. We will talk a little about our future work in five processors. The chip has no cache for data. Instead, section 8. a portion of the SRAM memory bank can be configured as scratchpad memory (SP), which is a fast temporary stor- age that can be used to exploit locality under software con- 3 CDP Protocol trol. All of the remaining part of the SRAM form the global memory (GM) and is uniformly addressable from all thread CDP is inspired by TCP/IP, yet simpler and more com- units. The C64 chip does not support virtual memory. pact. See Figure 5. Such a design is based on the consider- 2 IP Header ation that both the C64 architecture and the network topol- ver IHL ToS Total Length identification fragment offset ogy of the C64 supercomputer are simple. We will briefly TTL protocol Header Checksum introduce the CDP protocol in this section and discuss the Source Address CDP Header Destination Address destination node destination port protocol implementation problems in the next section. Options (optional) source node source port sequence number TCP Header acknowledgment number source port destination port flags total length 3.1 Overview sequence number acknowledgment number flags window Checksum Urgent Pointer Figure 4 shows the position of CDP in the protocol stack. Options (optional) According to the OSI reference model, CDP corresponds to the Transport layer plus the Network layer. This implies Figure 5. CDP Packet Header Format that CDP should implement the main functions (or at least some) of these two layers that are specified in the OSI reference model. Here are the main features of CDP: (1) CDP is As for the CDP connection, the finite state automata used a datagram-based, connection-oriented communication pro- to direct the connection state transition is shown in Figure tocol; (2) it is reliable and supports timeout retransmission; 6. This finite state automata is similar to the one used in (3) it uses sliding-window based flow control mechanism starting point to avoid network traffic congestion; (4) it provides a full- CLOSED server: cdp_listen() duplex service to the application layer; (5) it has imple- client: cdp_connect() LISTEN send: SYN mented the very familiar BSD Socket programming inter- recv: SYN; passive open send: SYN,ACK faces for the CDP program developers.

Exploring a Multithreaded Methodology to Implement a Network Communication Protocol on the Cyclops-64 Multithreaded Architecture

System Trends and Their Impact on Future Microprocessor Design

An Overview of the Blue Gene/L System Software Organization

Cellular Wave Computers and CNN Technology – a Soc Architecture with Xk Processors and Sensor Arrays*

Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64

Focal-Plane Analog VLSI Cellular Implementation of the Boundary Contour System

Software-Defined Hyper-Cellular Architecture for Green and Elastic

An Overview on Cyclops-64 Architecture - a Status Report on the Programming Model and Software Infrastructure

Virtualized Baseband Units Consolidation in Advanced Lte Networks Using Mobility- and Power-Aware Algorithms

Simulating Linux Clusters on Linux Clusters

Evaluating Cyclops64

Efficient Synchronization for a Large-Scale Multi-Core Chip Architecture

Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture