AltiumLive 2017: PCBs for Computing Density From Big Bang to the Automobile

Andreas Doering IBM Research – Zurich Laboratory

1 Agenda

1 The DOME project

2 Motivation for Microservers

3 Boards

4 Insights

5 Outlook

2 * IDC HPC technology excellence award, ISC17

3 DOME ppp Astron, IBM, Dutch gvt

•4 Ronald P. Luijten / July 2017 SKA (Square Kilometer Array) to measure Big Bang

Start of nucleosynthesi End of Big Protons Inflatio s through nucleo- Modern Bang created n fusion synthesis Universe 0 10-32s 10-6s 0.01s 3min 380’000 years 13.8 Billion years

Picture source: NZZ march 2014 •5 SKA: What is it?

~0.5M Antennae .07GHz-0.45GHz.

~0.5M Antennae .5GHz-1.7GHz.

1. 109 samples/second * .5M antennae: .5 1015 samples/sec.

2. 3.5 109 samples/second * .5M antennae: 1.7 1015 samples/sec.

~3000 Dishes 3. 2 1010 samples/second * 3K antennae: 6.1013 samples/sec 3GHz-10GHz. Sum = 2 1015 samples/second @ 86400 seconds/day:

170 1018 (Exa) samples/day. Assume 10-12x reduction @antenna:

14 Exabytes/day (minimum). Top 500: Sum=123 PFlops. 2GFlops/watt.  100x Flops of Sum!  ~ 7GWh •6 CSP SDP ~ 1 PB/Day.

330 disks/day

~ 10 Pb/s

? ? 120’000 disks/yr 86’400 sec/day

14 ExaByte/day

Top-500 Supercomputing(11/2013)…. 0.3Watt/Gflop/s Too hard Today’s industry focus is 1 Eflop @ 20MW. (2018) ( 0.02 Gflop/s)

Most recent data from SKA: CSP….max. power 7.5MW SDP….max. power 1 MW Latest need for SKA – 4 Exaflop (SKA1 - Mid) Too easy (for us)  1.2GW…80MW

Factor 80-1200 Moore’s law multiple breakthroughs needed •7 © 2016 IBM Corporation Dome Project:

Research Streams… Sustainable Data & (Green) Nanophotonics Streaming …plus Computing an open …are mapped to research projects: user System Analysis platforUser m:platform Algorithms & Machines - Student projects - Events - Research Computing Transport Storage Collaboratio n - Microservers - Nanophotonics - Access Patterns - Accelerators - Real-Time Communicatio ns - New Algorithms

33M€ 5-year Research Project: 76 IBM PY (32 in NL); 50 ASTRON PY •8 Definitions

• “Microserver” = The server class of the mobile era

• “Microserver” = SoC + DRAM + Flash + Power

• “Microserver” = Backplane + not-enclosed modules

9 Motivation

• Silicon scaling limits, Energy for computation vs. on-chip- communication vs. off-chip communication • Use of large SMP-servers by partitioning, docker, etc.: Cache Coherency not fully used • Emergence of powerful embedded processor cores, in particular ARM • Premise given through Aquasar cooling work  enabled DOME funding

10 Table of PCBs

Length Width Thickness Module Name Iterations Layers Holes Components Nets Backdrilling Material Tool [mm] [mm] [mm] P5020/P5040 processor 3 139.7 55.5 1.28 10 3242 1007 539 no ISOLA-400 A Big Baseboard 1 220 160 1.28 10 491 175 154 no ISOLA-400 A Power Converter 2 139 56.5 1.63 8 737 440 231 no FR-4 A mSATA on DIMM 2 139.7 55.5 1.24 4 341 69 67 no FR-4 A 8p1 backplane 2 300 200 2.7 18 3582 565 1326 no FR-4+ A Testboard for switch power converter 1 160 220 1.6 8 888 259 134 no FR-4 A Switch Mothercard >1 139.7 57.8 3.6 28 3311 837 730 yes A Switch Daughtercard 1 139.7 57.8 1.8 10 423 213 160 no FR-4 A Mini baseboard 2 160 100 1.2 6 851 376 241 no FR-4 A Bracket for DIMM connector on Minibaseboard 2 154 32 1.2 6 116 2 98 no FR-4 A Bracket for SPD08 connector on Minibaseboard 1 C T4240 processor 3+1 139.7 63 1.6 16 1316 820 Panasonic C mSATA on SPD08 3 139.7 62.5 1.6 6 1014 79 105 no FR-4 A M2 carrier 2 139.7 61.6 1.6 6 1091 130 131 A Auxiliary power converter 1 61 56 1.6 4 478 74 30 no FR-4 A PCIe Extender no FR-4 A LS2088 Processor module 1? 139.7 62.5 1.6 14 1037 714 no Panasonic R1577/1570 C USB HUB Module 2 139.7 61.5 1.57 8 1162 557 387 no FR-4 A BB2 backplane 2 520 200 3.15 22 12598 1076 3820 7 Runs Panasonic Megtron 6 A FR-4, Panasonic Interposer card 1 139.7 80 1.57 8 897 76 132 4 Runs A Megtron 6N FMKU2595 FPGA 2 139.7 63 1.57 14 7442 881 914 no Panasonic Megtron 6 A

A= Altium Designer, C = Cadence

11 System Overview

8x40Gb 10G Ethernet storage node Switch

8/32/128 Power compute nodes converter

P5020/P5040 2/4 cores [email protected], 16GByte DDR3, 2xXAUI,4x1GbE, 2xSATAv1 8 x mSATA or T4240 24 cores [email protected], 24GByte DDR3, 4x10GbE, 2x1Gb, 2xM2 2x SATAv2, PCIe-2.0 x8 LS2088 8xARMv8@2GHz, 32GByte DDR4, 6x10GbE, PCIe, 2xSATA

FMKU2595 FPGA 330KLUTs, 4x10GbE, 4xGbE,2xSATA

12 Backplane connectors

DIMM socket with removed latches for generation 1

3M’s SPD08 in various lengths For generation 2

3 segments of Molex Impact 210 contacts (70 diff pairs) Xtreme Poweredge for power converter (both)

13 System today

Backplane for • 32 compute nodes, • 8 populated • 1 Switch node, • 1 Management node • 2 Storage nodes • Water cooled

14 View from above

QSFP cages

Water In/Out

10 GbE Switch

Server nodes

Storage node

Cooling Rails

Power node 15 System Q4 2017

Two backplanes, total 64 compute Nodes, e.g. 1536 cores, 1536 GB DRAM 64 SSDs

16 Gallery of (some) Boards

17 Power Converter

• Master thesis project: • Student did high-level design (e.g. selection of backplane connector), component selection, and schematic entry. Layout was completed by regular engineer: First version worked,

• 1 iteration to improve stability, protection

Challenges: High current on top/bottom and SMD packages, location of connectors, and tight IC/L/C-converter triangle, conflict ofhigh profile Ls and hot ICs that must be covered by cool plate

18 40A per contact finger, allowing different type of C/L

19 Switch Module

Left: Main Switch PCB 130mm x 55mm

Right: Switch with mounted daughter card

20 Pin Assignment

• Pin Assignment has to suit back plane and switch module design • Both are challenging (Back plane has more space, but many more wires) • Reduce crossing on both boards • XAUI has low requirements on length balancing • 1st Iteration: • Let the CAD tool choose the pinout on both boards independently • Find out the critical spots • Use python script to build systematic pinout that circumvents these

21 PCB Layer Stack

Press-Fit Connector on this side

Total PCB thickness 6 inner signal layers, impedance controlled 3.6mm with shielding ground layers Length of connector in-between pins 1.2mm Original Assumption, that board space across “through-hole” 4 high-current connector cannot be power supply used, was wrong. lanes Need backdrilling

ASIC on this side 22 PCB routing

Routing between connector pins with 1 signal pair This narrow strip (1cm wide) is one critical part.

23 FPGA Node

PCI- and/or Network-Attached 2 Channels DDR4 (e.g. 16GByte) Xilinx® Kintex® UltraScale

6 x 10 GBE, PCIe3 x8, 2 x SATA3 Status: In bringup

24 FPGA Node – Layout Concept

Flyby control signals on 3 Layers, P2P data signals mainly on 1 layer HighSpeed IO on 2 inner layers

25 Cooling

Combination of passive cooling on decapped chip, using vapor chambers and hot- water

26 Insights

• Main source of error: transfer from data sheet into tool • Second source of error: Harness interface (swapping P/N on diff pairs, clock/data on I2C) • Third source of error: voltage levels of pins (e.g. enable of power converter) •  Why is there no electronic transfer of component data to designers? Exception: TI (e.g. https://webench.ti.com/cad/) Why is there no standard format? There was an initiative XMLEDA, etc. • DRC could do more, if symbols provided the information (e.g. P/N property, clock, etc.) • Conversion from one tool to another is a кошмар Hired Elgris and still 5 working days turned into 2 months

27 Acknowledgements

This work is the results of many people • Ronald Luijten (Lead Architect/Technical Lead), Francois Abel (Switch, FPGA. and BB2-lead), Beat Weiss (Core Engineering), Matteo Cossale (Cooling), Stephan Paredes (Coooling), and others: IBM ZRL/CH • Peter v. Ackeren,, Ed Swarthout, Dac Pham : Freescale/NXP • Yvonne Chan, IBM Toronto • Gijs Schonderbeek, Sieds Damstra, Albert-Jan Boobstra: ASTRON/NL • Several students and interns • And many more remain unnamed….

Companies: NXP; IBM; TransferDSW – NL, Strukton/NL, Roneda/BE, AT&S/AT, Supercomputing Systems/CH, Miromico/CH Dutch Gvt for DOME grant

28 Outlook

• Still work to be done, HW testing, SW, redesign of some boards for bugs or low production yield, cost reduction of some components • Commercially available through startup ILA Microservers • First customer bought 15 T4240 modules • Buildup of two systems for ASTRON and ZRL (with enclosure, etc.) • GPU node • Target markets: • Data center • Scientific computing (SKA) • Embedded (vehicles, robots, IoT Edge server)

29 Backup System Management

• Every node is a USB device • Cypress PSoC controller implements module-level management • Serial console • Power Sequencing • Current and Temperature Monitoring • JTAG • etc. • Python process on host allows access of all hosts • Implements IPMI • Interacts with Switch, FPGA tools, etc.

31 QorlQ T4240 Communication Processor

32 32-way carrier network topology

FM6000 switch

T4240 module

32 way carrier 32x 10 GbE internal connectivity from switch 8 x 40GbE external connectivity (QSFP+) Green links optionally connect to other 32way carrier

Ronald P. Luijten / July 2017 33 33 Thanks for your Attention! Questions?