Performance and Energy Efficient Network-On-Chip Architectures

Linköping Studies in Science and Technology Dissertation No. 1130 Performance and Energy Efficient Network-on-Chip Architectures Sriram R. Vangal Electronic Devices Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden Linköping 2007 ISBN 978-91-85895-91-5 ISSN 0345-7524 ii Performance and Energy Efficient Network-on-Chip Architectures Sriram R. Vangal ISBN 978-91-85895-91-5 Copyright Sriram. R. Vangal, 2007 Linköping Studies in Science and Technology Dissertation No. 1130 ISSN 0345-7524 Electronic Devices Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden Linköping 2007 Author email: [email protected] Cover Image A chip microphotograph of the industry’s first programmable 80-tile teraFLOPS processor, which is implemented in a 65-nm eight-metal CMOS technology. Printed by LiU-Tryck, Linköping University Linköping, Sweden, 2007 Abstract The scaling of MOS transistors into the nanometer regime opens the possibility for creating large Network-on-Chip (NoC) architectures containing hundreds of integrated processing elements with on-chip communication. NoC architectures, with structured on-chip networks are emerging as a scalable and modular solution to global communications within large systems-on-chip. NoCs mitigate the emerging wire-delay problem and addresses the need for substantial interconnect bandwidth by replacing today’s shared buses with packet-switched router networks. With on-chip communication consuming a significant portion of the chip power and area budgets, there is a compelling need for compact, low power routers. While applications dictate the choice of the compute core, the advent of multimedia applications, such as three-dimensional (3D) graphics and signal processing, places stronger demands for self-contained, low-latency floating-point processors with increased throughput. This work demonstrates that a computational fabric built using optimized building blocks can provide high levels of performance in an energy efficient manner. The thesis details an integrated 80-Tile NoC architecture implemented in a 65-nm process technology. The prototype is designed to deliver over 1.0TFLOPS of performance while dissipating less than 100W. This thesis first presents a six-port four-lane 57 GB/s non-blocking router core based on wormhole switching. The router features double-pumped crossbar channels and destination-aware channel drivers that dynamically configure based on the current packet destination. This enables 45% reduction in crossbar iii iv channel area, 23% overall router area, up to 3.8X reduction in peak channel power, and 7.2% improvement in average channel power. In a 150-nm six-metal CMOS process, the 12.2 mm2 router contains 1.9-million transistors and operates at 1 GHz at 1.2-V supply. We next describe a new pipelined single-precision floating-point multiply accumulator core (FPMAC) featuring a single-cycle accumulation loop using base 32 and internal carry-save arithmetic, with delayed addition techniques. A combination of algorithmic, logic and circuit techniques enable multiply- accumulate operations at speeds exceeding 3GHz, with single-cycle throughput. This approach reduces the latency of dependent FPMAC instructions and enables a sustained multiply-add result (2FLOPS) every cycle. The optimizations allow removal of the costly normalization step from the critical accumulation loop and conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In a 2 90-nm seven-metal dual-VT CMOS process, the 2 mm custom design contains 230-K transistors. Silicon achieves 6.2-GFLOPS of performance while dissipating 1.2 W at 3.1 GHz, 1.3-V supply. We finally present the industry's first single-chip programmable teraFLOPS processor. The NoC architecture contains 80 tiles arranged as an 8×10 2D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision FPMAC units which feature a single-cycle accumulation loop for high throughput. The five-port router combines 100 GB/s of raw bandwidth with low fall-through latency under 1ns. The on-chip 2D mesh network provides a bisection bandwidth of 2 Tera-bits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100-M transistors. The fully functional first silicon achieves over 1.0TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07-V supply. It is clear that realization of successful NoC designs require well balanced decisions at all levels: architecture, logic, circuit and physical design. Our results demonstrate that the NoC architecture successfully delivers on its promise of greater integration, high performance, good scalability and high energy efficiency. Preface This PhD thesis presents my research as of September 2007 at the Electronic Devices group, Department of Electrical Engineering, Linköping University, Sweden. I started working on building blocks critical to the success of NoC designs: first on crossbar routers followed by research into floating-point MAC cores. I finally integrate the blocks into a large monolithic 80-tile teraFLOP NoC multiprocessor. The following five publications are included in the thesis: • Paper 1 - S. Vangal, N. Borkar and A. Alvandpour, "A Six-Port 57GB/s Double-Pumped Non-blocking Router Core", 2005 Symposium on VLSI Circuits, Digest of Technical Papers, June 16-18, 2005, pp. 268–269. • Paper 2 - S. Vangal, Y. Hoskote, D. Somasekhar, V. Erraguntla, J. Howard, G. Ruhl, V. Veeramachaneni, D. Finan, S. Mathew and N. Borkar, “A 5 GHz floating point multiply- accumulator in 90 nm dual VT CMOS”, ISSCC Digest of Technical Papers, Feb. 2003, pp. 334–335. • Paper 3 - S. Vangal, Y. Hoskote, N. Borkar and A. Alvandpour, "A 6.2 GFLOPS Floating Point Multiply-Accumulator with Conditional Normalization”, IEEE Journal of Solid-State Circuits, Volume 41, Issue 10, Oct. 2006, pp. 2314–2323. • Paper 4 - S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote and N. Borkar, “An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS,” ISSCC Dig. Tech. Papers, pp. 98–99, Feb. 2007. v vi • Paper 5 - S. Vangal, A. Singh, J. Howard, S. Dighe, N. Borkar and A. Alvandpour, "A 5.1GHz 0.34mm2 Router for Network-on-Chip Applications ", 2007 Symposium on VLSI Circuits, Digest of Technical Papers, June 14-16, 2007, pp. 42–43. As a staff member of Microprocessor Technology Laboratory at Intel Corporation, Hillsboro, OR, USA, I am also involved in research work, with several publications not discussed as part of this thesis: • J. Tschanz, N. Kim, S. Dighe, J. Howard, G. Ruhl, S. Vangal, S. Narendra, Y. Hoskote, H. Wilson, C. Lam, M. Shuman, C. Tokunaga, D. Somasekhar, D. Finan, T. Karnik, N. Borkar, N. Kurd, V. De, “Adaptive Frequency and Biasing Techniques for Tolerance to Dynamic Temperature-Voltage Variations and Aging”, ISSCC Digest of Technical Papers, Feb. 2007, pp. 292–293. • S. Narendra, J. Tschanz, J. Hofsheier, B. Bloechel, S. Vangal, Y. Hoskote, S. Tang, D. Somasekhar, A. Keshavarzi, V. Erraguntla, G. Dermer, N. Borkar, S. Borkar, and V. De, “Ultra-low voltage circuits and processor in 180nm to 90nm technologies with a swapped-body biasing technique”, ISSCC Digest of Technical Papers, Feb. 2004, pp. 156–157. • Y. Hoskote, B. Bloechel, G. Dermer, V. Erraguntla, D. Finan, J. Howard, D. Klowden, S. Narendra, G. Ruhl, J. Tschanz, S. Vangal, V. Veeramachaneni, H. Wilson, J. Xu, and N. Borkar, ”A TCP offload accelerator for 10Gb/s Ethernet in 90-nm CMOS”, IEEE Journal of Solid- State Circuits, Volume 38, Issue 11, Nov. 2003, pp. 1866–1875. • Y. Hoskote, B. Bloechel, G. Dermer, V. Erraguntla, D. Finan, J. Howard, D. Klowden, S. Narendra, G. Ruhl, J. Tschanz, S. Vangal, V. Veeramachaneni, H. Wilson, J. Xu, and N. Borkar, “A 10GHz TCP offload accelerator for 10Gb/s Ethernet in 90nm dual-VT CMOS”, ISSCC Digest of Technical Papers, Feb 2003, pp. 258–259. • S. Vangal, M. Anders, N. Borkar, E. Seligman, V. Govindarajulu, V. Erraguntla, H. Wilson, A. Pangal, V. Veeramachaneni, J. Tschanz, Y. Ye, D. Somasekhar, B. Bloechel, G. Dermer, R. Krishnamurthy, K. Soumyanath, S. Mathew, S. Narendra, M. Stan, S. Thompson, V. De and S. Borkar, “5-GHz 32-bit integer execution core in 130-nm dual-VT CMOS”, IEEE Journal of Solid-State Circuits, Volume 37, Issue 11, Nov. 2002, pp. 1421- 1432. vii • T. Karnik, S. Vangal, V. Veeramachaneni, P. Hazucha, V. Erraguntla and S. Borkar, “Selective node engineering for chip-level soft error rate improvement [in CMOS]”, 2002 Symposium on VLSI Circuits, Digest of Technical Papers, June 2002, pp. 204-205. • S. Vangal, M. Anders, N. Borkar, E. Seligman, V. Govindarajulu, V. Erraguntla, H. Wilson, A. Pangal, V. Veeramachaneni, J. Tschanz, Y. Ye, D. Somasekhar, B. Bloechel, G. Dermer, R. Krishnamurthy, K. Soumyanath, S. Mathew, S. Narendra, M. Stan, S. Thompson, V. De and S. Borkar, “A 5GHz 32b integer-execution core in 130nm dual-VT CMOS”, ISSCC Digest of Technical Papers, Feb. 2002, pp. 412–413. • S. Narendra, M. Haycock, V. Govindarajulu, V. Erraguntla, H. Wilson, S. Vangal, A. Pangal, E. Seligman, R. Nair, A. Keshavarzi, B. Bloechel, G. Dermer, R. Mooney, N. Borkar, S. Borkar, and V. De, “1.1 V 1 GHz communications router with on-chip body bias in 150 nm CMOS”, ISSCC Digest of Technical Papers, Feb. 2002, Volume 1, pp. 270–466. • R. Nair, N. Borkar, C. Browning, G. Dermer, V. Erraguntla, V. Govindarajulu, A. Pangal, J. Prijic, L. Rankin, E. Seligman, S. Vangal and H. Wilson, “A 28.5 GB/s CMOS non-blocking router for terabit/s connectivity between multiple processors and peripheral I/O nodes”, ISSCC Digest of Technical Papers, Feb.

Performance and Energy Efficient Network-On-Chip Architectures

Lecture 5: Scaled MOSFETS for Ics

By Erika Azabache Villar a Thesis Submitted in Partial Fulfillment of The

Computer Architecture: Dataflow (Part I)

Multiprocessing Contents

Stratix III Programmable Power

Configurable Fine-Grain Protection for Multicore Processor Virtualization 1

CG-Ooo Energy-Efficient Coarse-Grain Out-Of-Order Execution

Design and Analysis of Integrated CMOS High-Voltage Drivers in Low-Voltage Technologies

WP Microelectronics and Interconnections

Parallel Computer Architecture III

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

An Evaluation of the TRIPS Computer System