High-Bandwidth, Low Power Photonic On-Chip Networks Assaf Shacham Keren Bergman Luca P

Maximizing GFLOPS-per-Watt: High-Bandwidth, Low Power Photonic On-Chip Networks Assaf Shacham Keren Bergman Luca P. Carloni Columbia University, Columbia University, Columbia University, Dept. of Electrical Engineering Dept. of Electrical Engineering Dept. of Computer Science 500 W 120th St., 500 W 120th St., 1214 Amsterdam Ave. New York, NY 10027 New York, NY 10027 New York, NY 10027 +1 (212) 854 2768 +1 (212) 854 2280 +1 (212) 939 7043 [email protected] [email protected] [email protected] ABSTRACT This is limited, however, by the exponential As high-performance processors move towards multi- relationship between the increase in sub-threshold core architectures, packet-switched on-chip networks leakage current and the reduction of threshold voltage. are gaining wide acceptance as interconnect solutions Reductions in threshold voltage have made static that can directly address the bandwidth and latency leakage power large enough that it needs to be requirements as well as provide partial relief to the considered alongside dynamic transistor switching broader challenge of power dissipation. Still, studies activity in the overall power budget [24] (for chips show that the power consumed by on-chip networks designed with 65nm technologies, for example, will remain a major issue that has to be addressed to leakage power can account for up to 45% of the total enable a true leap in future multi-core processors power dissipation [6]). Consequently, in order to performance. Based on recent and expected minimize power, the threshold voltage is now set as technological advances in the integration of silicon the result of an optimization, and not by technology photonic elements with CMOS electronics, we consider scaling [19]. Meanwhile on-chip buses dissipate the usage of photonics to construct an on-chip increasingly higher power due to the rapid growth in network, offering unique advantages in terms of the number of repeaters and latches that are used to energy, bandwidth, and latency. We propose a novel buffer and pipeline long metal lines. Specifically, up to architecture for a photonic on-chip network based on 20% of the total power can be dissipated on long intra- a hybrid approach: a network of wideband photonic chip buses, which are also typically very wide with switches combined with a parallel electronic control multiple parallel lines (e.g., 32, 64 and even 128 bit network. A high-level power analysis and comparison wide in the recent IBM CELL[25]), each requiring with electronic on-chip networks show that some of the substantial drivers [30]. advantages that have made photonics ubiquitous in long-haul transmission systems can be leveraged to In summary, high-performance microprocessors that construct photonic on-chip networks, delivering keep existing architecture and circuit design unprecedented computational capabilities, while techniques are expected to exceed package power operating at a fraction of the power of their electronic limits by a factor of nearly 4X over the next decade; counterparts. alternatively, microprocessor logic content and/or logic activity would need to decrease proportionally to 1. INTRODUCTION match packaging constraints. Despite these measures, Improvements in the performance of microprocessors power densities are estimated to reach up to 2 have until recently been largely driven by the ~200W/cm by 2010 [20]. This power density exceeds miniaturization of transistors, rising clock frequencies, the air-cooling limit by a factor of 2, and is challenging and growing die sizes. Aggressive architectural for efficient liquid-cooling methods as well as for the solutions such as deep pipelines and complex cache required power supply. memory organizations have accompanied these advances. The convergence of these design factors, 1.1 The Rise of Multi-Core Architectures however, has also led to the detrimental trend of Perhaps not surprisingly, the quest for both high exponentially increasing on-chip power dissipation performance and low power has brought designers to a [19]. To counterbalance these effects designers have major paradigm shift in the microprocessor sought quadratic reductions in dynamic power architectures. The emerging trend is to replicate the dissipation through aggressive supply voltage scaling. computational logic in order to maintain processing throughput while lowering clock frequencies and 1.3 The Case for On-chip Networks supply voltages. In fact, two processing cores running In this scenario various researchers have proposed to at half the frequency and half the supply voltage will implement on-chip global communication with packet- 2 save approximately a factor of 4 in C•V •f dynamic switched micro-networks based on regular scalable capacitive power, versus the “equivalent” single core structures such as meshes or tori [4], [8], [15], [16], [20]. This trend was marked by the arrival of the first [26]. These so-called on-chip networks are made of commercial chips hosting multiple processing cores carefully-engineered links and represent a shared like the dual core multi-threaded IBM POWER 5 [23], medium that is able to provide enough bandwidth to the Intel ITANIUM 2 [31], and the CELL processor, a replace many traditional bus-based and/or point-to- joint effort of IBM, Toshiba and Sony, which features point links. Furthermore, since they have better scaling a Power processing element and eight co-processing properties, on-chip networks have the potential to cores [22]. It is reasonable to assume that the number mitigate the complexity of system-on-chip designs by of these cores will continue to grow, leading to various facilitating the assembling of pre-designed and pre- generations of chip multiprocessors (CMP). validated processing cores through the emergence of 1.2 The Impact of Global Wires new interface standards. However, as performance- per-watt is expected to remain the fundamental design Following the clear trend of CMPs, each technology metric for both embedded and high-performance generation will likely feature chips that host a larger computing for years to come, on-chip interconnection number of smaller processing cores. To achieve the networks will have to satisfy communication desired growth in computation power these cores, bandwidth and latency requirements with minimal which may be physically distant, must interact in a power dissipation. tightly coupled manner. Thus these new CMP architectures are leading a major design shift from “computation-bound” to “communication-bound” [7]: 1.4 The Photonics Opportunity the chip becomes a distributed system where global Leveraging the unique advantages of optical on-chip communication plays a dominant role during communication for on-chip photonic networks offers a the design process. potentially disruptive technology solution that can provide ultra-high throughput, minimal access Hence, another important technology trend, wire latencies, and low power dissipation that remains scaling, joins power dissipation in driving the design independent of capacity. Recent advances in nanoscale of high performance integrated circuits. While local silicon photonics have yielded improved control over interconnects scale in length approximately in device optical properties and unprecedented accordance with transistors, global wires do not fabrication capabilities for the integration photonic because they need to span across multiple modules to elements in commercial CMOS chip manufacturing connect distant gates [17]. Consequently, long-range processes. An optical interconnection network that can communication requires delays of multiple clock capitalize on the enormous capacity, transparency, and cycles as global wires must be heavily buffered and fundamentally low power consumption of silicon pipelined [34]. Traditional bus-based communication photonics could deliver performance per watt that is schemes, which are already “power hungry” [30], are simply not possible with all-electronic interconnects. destined to scale poorly if forced to support many IP Photonic channels could act as “on-chip super- cores [11]. In the case of general purpose highways” supporting large amounts of shared data microprocessors the evolution towards communication traffic across longer distances in a bandwidth-oriented bound design implies that the amount of states design of a network connecting processing cores and reachable in a clock cycle, and not the number of memories. On the other hand, electronic technology transistors that can be integrated, becomes the major can complement the photonic network in overcoming factor limiting the growth of instruction throughput. some of the limitations inherent to photonics, namely Furthermore, the increasing interconnect latency processing and buffering. particularly penalizes traditional memory-oriented Silicon photonic device technologies and integration microprocessor architectures that strongly rely on the have achieved unprecedented advances over the past assumption of low-latency communication with five years. High speed optical modulators, capable of structures such as caches, register files, and performing switching operations, have been realized rename/reorder tables [1]. using ring resonator structures [2, 37] and the free carrier plasma dispersion effect [28]. The integration of modulators, waveguides and photodetectors with CMOS integrated circuit for chip-to-chip communication has been reported and recently became 2.1 Building Blocks commercially available [13]. Finally, SiGe-based The

High-Bandwidth, Low Power Photonic On-Chip Networks Assaf Shacham Keren Bergman Luca P

Accelerating HPL Using the Intel Xeon Phi 7120P Coprocessors

USE CASE Requirements

Intel Cirrascale and Petrobras Case Study

Copyrighted Material

UNIT 8B a Full Adder

Power Measurement Tutorial for the Green500 List

Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems

Comparing the Power and Performance of Intel's SCC to State

NVIDIA Ampere GA102 GPU Architecture Whitepaper

With Extreme Scale Computing the Rules Have Changed

Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus

Programming the Cell Broadband Engine Examples and Best Practices