Optics for disaggregating Datacenters and disintegrating Computing N.Terzenidis, M.Moralis-Pegios, S. Pitris, G.Mourgias-Alexandris, A. Tsakyridis, C. Vagionas, K. Vyrsokinos, T. Alexoudi, C. Mitsolidou, N. Pleros

ARISTOTLE UNIV. OF THESSALONIKI Wireless and Photonics Systems and Networks (WinPhos) Lab Dept. of Informatics, Aristotle Univ. of Thessaloniki, Center for Interdisciplinary Research & Innovation (CIRI), Greece The energy problem: World’s No. 1 HPC

No.#1: IBM (top500.org Nov. 2018 List)

 200 Pflops (20% of Exascale target)  10MW (50% of the 20MW limit)

 Energy in DataCenters (% of global power) :

1.6% >3%(420TWh) ~20% 2014 2016 2025

Nikos Pleros The energy efficiency problem

Intel Whitepaper, 2015  High diversity of workloads with different resource requirements

 up to 50% underutilized resources..impacting cost and energy !

Nikos Pleros The way-out Energy Resource utilization

Technology Architecture

Photonics Disaggregation

Nikos Pleros Challenges across the hierarchy

hierarchy >2μsec latency in < 8-node connectivity Non-modular settings with high-port switches in p2p QPI-based >40% on-die caches Energy increases High-latency in non- Energy dominated by with capacity p2p switched-based memory access

Nikos Pleros Our work T. Alexoudi et al, “Optics in Computing: from Photonic Network-on-Chip to Chip-to-Chip Interconnects and Disintegrated Architectures “, JLT 2019

 disaggregate at rack-level… down to disintegration at chip-level

 Hipoλaos Optical Packet hierarchy  On-board  Optical Static RAM 1.48 cm

Test Switch structures disaggegration with technology 18 SOAs 44 I/O  1024x1024 ports, 10T >8 sockets in p2p  Disintegrate via off- capacity scalable to >40T  Low-energy, low- chip high-speed  sub-μsec latency latency with Si-Pho optical caches

Nikos Pleros The Hipoλaos Optical Packet Switch

1024x1024 -port Multicasting Si-integration ?

N. Terzenidis et al, ECOC, Sept. 2018 N. Terzenidis et al, IEEE PTL, pp. 1535- M. Moralis-Pegios et al, IEEE PTL 1538, Sept. 2018 pp. 712-715, April, 2018

 N. Terzenidis et al, ONDM 2019 to demonstrate performance in a 256- node network

Nikos Pleros Disaggregate at board-level

Beating QPI: a p2p board- level optical interconnect for >8 sockets

Nikos Pleros The inner-anatomy: board-level Switch-based Direct P2P - High-End Servers

CISC processors () Intel® Xeon® E5-4600

• PCIe INTEL QPI, AMD HyperTransport, IBM POWER8 SMP • RapidIO Links, Oracle SPARK Coherence links, NVlink  Unlimited connectivity  Low-latency, low-power Increased latency, power, Limited connectivity cost, size

Nikos Pleros QPI

102.4Gb/s …the dominant P2P link 4s-“glueless” architecture

 4 sockets P2P connected  9.6 Gb/s line rate 8s- “glueless”  38.4 GB/s BW (307.2 Gb/s architecture or 102.4 Gb/s /direction)  up to 2-hop  Increased latency  60ns-240ns Latency  16 pJ/bit TxRx power efficiency Intel Xeon E7-8800v4: 1st Intel Processor to support up to 8-socket ! (released June 2016)

Nikos Pleros Going beyond 8 sockets? Bixby (BX) switches

January 2017

 Price/performance drastically decreases as #processors increases  Latency increases  65% of QPI Bandwidth wasted for cache coherency updates Nikos Pleros The ICT-STREAMS O-band technology

www.ict-streams.eu

16x50Gb/s WDM TxRx SiP O-band

16x16 AWGR- O-band 1300nm SM EOPCB with 50G RF Nikos Pleros The ICT-STREAMS p2MP architecture

 Strictly non-blocking  High bandwidth optical links  All-to-All connectivity  No O/E/O conversions  Suitable for broadcasting/multicasting  Passive (no power consumption)  High number of ports (up to 32)  Time of flight latency

Nikos Pleros STREAMS vs QPI

QPI STREAMS gain

Sockets/ Up to 8 >16 x 2 nodes

#hops up to 2 1 ÷ 2

Line rate 9.6 Gb/s 50 Gb/s x 10

Link BW 307.2 Gb/s 16x50 Gb/s x 2.6

Throughput 2.45 Tb/s 12.8 Tb/s x 5.2

Power 16 pj/bit < 3.5 pj/bit ÷ 4 efficiency

Nikos Pleros The on-board routing platform

8×8 O-band Si-AWGR

S. Pitris et al., Opt. Express 26(5), 2018 Nikos Pleros Si-based 8x8 O-band AWGR technology

S. Pitris et. al., OFC 2018 S. Pitris et. al., OpEx 2018

 10nm ch. spacing  channel crosstalk : 11 dB  3-dB bandwidth 5.7nm  non-uniformity : 3.5 dB loss,  insertion losses: 2.5 dB loss

Nikos Pleros Si-AWGR on board

AWGR chip 18mm

30mm

 Si-AWGR onto an EOPCB for on-board routing  Testing currently on-going

Nikos Pleros Multisocket routing @40Gb/s

40 Gb/s O-band RM 8×8 O-band Si-AWGR 40G PD & BICMOS TIA

S. Pitris et al., IEEE PJ 2018 S. Pitris et al., Opt. Express, 2018. S. Pitris et al., PJ 2018

Nikos Pleros 8x40Gb/s multi-socket Tx/Rx/routing

S. Pitris et al., IEEE Photon. J., Oct. 2018

RM: 1.7pJ/bit @40Gbps TIA: 4 pJ/bit @40Gbps

 5.7pJ/bit for Tx/Rx link (incl. thermal tuning)

Nikos Pleros The WDM Transceiver engine

4ch Si-pho WDM TxRx

S. Pitris et al., OFC 2019

Nikos Pleros 4x40Gb/s O-band Si WDM transmitter

S. Pitris et. al., OFC 2019

Mask Layout

ER = 4.4 dB ER = 4.1 dB ER = 4 dB ER = 4.2 dB Tx1 Tx2 Tx3 Tx4

2mV 10ps/div λ1=1310.25 nm λ2=1303.23 λ3=1317.28 nm λ4=1324.59 nm

Nikos Pleros 4x50Gb/s on-board WDM transmitter

4-channel 55nm BiCMOS DR Tx out 1 @ 50 Gb/s NRZ (λ=1314.9 nm) High Speed TX RF traces ER = 4.3 dB

HF 4mV-10ps/div Board DC traces RM-driving voltage = 1.9 Vpp EE=1.525pJ/bit/RM + 0.8 pJ/bit (tun) Rx TIA out 1 @ 30 Gb/s NRZ (λ=1318.4 nm)

High Speed RX RF traces 4-channel 55nm BiCMOS TIA

10mV-20ps/div TIA output voltage = 120 mVpp  All 4-channels capable of >50G operation PCTIA=130mW/ch

Nikos Pleros 1-Volt 50Gb/s x 52km transmission

FDSOI CMOS DR H. Ramon et al., IEEE PTL 30, 2018.

Micro-ring Modulator M. Moralis-Pegios et al., OECC/PSC 2019

BER measurements @ 40 Gb/s  1V-driven 50 Gb/s NRZ operation & transmission at 52 km through SMF  EE = 0.8 pJ/bit @ 50 Gb/s

 Record-high BxL = 2600 Gb-km/s

Power penalty = 0.2 dB @10-9

Nikos Pleros The energy-latency gain

Power Device Ref. EE@40Gb/s EE@50Gb/s consumtpion

LD [1] 6.1 dBm 1 pJ/bit 0.8 pJ/bit

RM DRIVER + This work 40 mW + 40 mW 2 pJ/bit 1.6 pJ/bit TUNING

PD TIA This work 130 mW 3.25 pJ/bit 2.6 pJ/bit

SerDes [2] 11.6 mW 0.29 pJ/bit 0.232 pJ/bit 1 S. Pitris et al, PJ 2018 2 V. Stojanovic, OPEX 2018 Totals 6.54 pJ/bit 5.23 pJ/bit %Reduction to QPI ~60% ~68%

 ~68% energy savings compared to QPI  with >4-node direct connectivity !

Nikos Pleros Disintegrate at chip-level

Optical RAMs: disintegrate via off-chip optical caching

Nikos Pleros Why to disintegrate?

Yigit Demir et al, “Galaxy: A High-Performance Energy-Efficient ORACLE, Pranay Koka et al, “Silicon-Photonic Network Architectures Multi-Chip Architecture Using Photonic Interconnects”, ICS 2014 for Scalable, Power-Efficient Multi-Chip Systems”, ISCA 2010

 Smaller dies (chiplets) interconnected through optics  Relieve from area & power constraints – ease scaling to many-core  >2x speed-up in512-core (Oracle) and 4k-core (Galaxy) settings

Nikos Pleros On-chip caches and memory bandwidth

Evolution of on-die caches <20GB/s

>40%

[Source: S. Borkar and A.A.Chien, Com. ACM, 2011] [Source] INTEL

 Memory speed-up mainly through hierarchical levels Large and complex two or three level cache hierarchies Up to 40% of chip real-estate consumed by caches and X-bar Still <20GB/s memory bandwidth Nikos Pleros Our goal Now!

CMP: cores p-NoC only! o/e o/e optical cache

 Modular and granular setting with cache-free cores and a unified, shared pool of cache resources  Needs an ultra-fast optical cache for time-sharing between lower-

clock cores Nikos Pleros 10 Gbps Optical RAM

A. Tsakyridis et al, CLEO 2019 First to exceed electronics 2x the Speed: 10Gbps

Up to 5GHz RAM speed Fail to exceed electronics

Give up access times for lower footprint/power

Nikos Pleros 10 Gbps Optical RAM

10 Gb/s monolithic InP Flip-Flop

Packaged Optical Memory

BER penalties: Write 6.2 dB, Read 0.4 dB A. Tsakyridis et al, CLEO 2019, A. Tsakyridis et al, OSA Optics Letters 2019 Nikos Pleros 10 Gbps Optical RAM

Nikos Pleros Conclusions

hierarchy Hipoλaos Switch  On-board p2p >8 sockets  10GHz Optical SRAM: 2x the speed of electronics  1024x1024-port  5.2pJ/bit link @50Gb/s  Si-based RAM layout +  sub-μsec latency  Synergy between technology+architecture: Optical cache design  Scalable to 10Tbps 68% energy save over QPI  Modular, granular, flexible with 113pJ/bit Nikos Pleros www.ict-streams.eu Contacts: Prof. Nikos Pleros : [email protected] Dr. Theoni Alexoudi : [email protected]

Nikos Pleros Back-up slides

Nikos Pleros The Si-based Optical SRAM Hybrid PhC-based laser Flip Flop with only 13fJ/bit *In collaboration with F. Raineri and R. Raj, C2N/CNRS Lowest power consumption FF

Switching energy 6.4 fJ/bit@ 5 GHz 3.2 fJ/bit@ 10 GHz

Input

*T. Alexoudi et al, JSTQE (2016) * D. Fitsios et al, Optics Express, (2016)

 6.2μm2 footprint Output  <50psec response time

Nikos Pleros The first optical cache design * P. Maniotis et.al., "Optical Buffering for Chip Multiprocessors: A 16GHz Optical Cache Memory Architecture,” JLT, pp.4175-4191, 2013

Optical cache simulations @ 16GHz Optical RAM Peripheral Circuits WDM  Row Access Gate + Access Gate + Col. Decoder optical Column Decoder memory  Row Decoder address  Tag Comparator

Experiments @10Gb/s Row Decoder Row

Tag Comparator WDM optical word / Fast Optical RAM cell: data  design+theory > 40GHz  experiment @10GHz  Record-low power Nikos Pleros Caching with light at 16Gbps

 VPI simulations with exp.verified SOA model

Write operation Write to row “00”: Only when both λ1,λ2=0 & RW=0 (green area) the FF contents follow the Bit contents (blue area)

average ER  6.2 dB average AM  1.9 dB close to experimental values!

Nikos Pleros And compare…

CONVENTIONAL TOPOLOGY OPTICAL TOPOLOGY

 8 cores, 2GHz clock  8 cores, 2GHz clock  Dedicated L1  16GHz cache clock  Shared L2=8xL1  Off-die shared L1 – no L2

Nikos Pleros L1 Conventional L1+L2 Conventional Execution times for PARSEC 2xN GHz Optical

*P. Maniotis et al."HighMEAN-Speed Optical Cache63.4 Memory as Single62.7-Level Shared 19.4 10.7 Cache in Chip-Multiprocessor architectures," in Proceedings of HIPEAC15, 19-21 January, 2015

Nikos Pleros