Heavy Quarks on Fast Computers

Hubert Simma NIC DESY

6.1.2010

Plan: K Lattice QCD Challenges K Machine Developments K Theoretical Performance Analysis

H. Simma, Heavy Quarks on Fast Computers1 Lattice QCD

Discretization on the Lattice provides framework for

 rigorous regularization of QFT

 non-perturbative results (renormalization, matrix elements)

 “ab initio” computation from QCD Lagrangian

 numerical evaluation of Path Integral by Monte Carlo method

1 X Z O(U) −→ hOi = D[U]O(U) · e−S(U) N U

with N gauge-field configurations U(t, x) distributed according to

P (U) ∼ e−Sg(U) · detM(U)

H. Simma, Heavy Quarks on Fast Computers2 Lattice QCD

Challenges (theoretical and technical)

 all observables in Euclidian space

 explicitly broken (chiral and space-time) symmetries

 extrapolation to continuum limit a → 0

 extrapolation to physical values of light-quark masses

 limited physical volume L (isolation of hadronic ground states)

H. Simma, Heavy Quarks on Fast Computers3 Large Scale Differences

−1 −1 a  µPT  mhadr  L

(0.05 fm)−1 (3 fm)−1

H. Simma, Heavy Quarks on Fast Computers4 Renormalization Schemes

    ΛQCD Fπ ˆ  Mlight  LQCD(g0, m0)  mπ       Ms  ⇐⇒  mK  Mc mD | {z } RGI| parameters{z } Hadronic observables

H. Simma, Heavy Quarks on Fast Computers5 B-Physics?

−1 −1 a  mB . . . mπ  L

The heater wasn’t so expensive, but the cable has cost a fortune!

H. Simma, Heavy Quarks on Fast Computers6 Non-Perturbative HQET Strategy [J. Heitger, R. Sommer, ALPHA]

2 LQCD LHQET = Lstat − ωkinOkin − ωspinOspin + O(1/mb) continuum limit

Φi(L, M, a)= Aij(L, a) · ωj(M, a) + Bi(L, a)

H. Simma, Heavy Quarks on Fast Computers7 HQET Tests

FBs quenched

static 2.4 3/2 φRGI r0

2.2 r 3/2 F m 1/2 / C HQET 0 Bs Bs Bs

2 3/2 1/2 r0 FPS mPS / CPS QCD

1.8

1.6

1.4

1.2

1

0.8

0.6 0 0.05 0.1 0.15 0.2 0.25 0.3

1/(r0 mPS)

In progress: unquenched Nf = 2 ...

H. Simma, Heavy Quarks on Fast Computers8 Algorithmic Challenges

CPU cost per HMC trajectory: [M. L¨uscher]

 L 5 0.05 fm6 20 MeV 1 3 Tflops × year · · · · 3 fm a mq

Towards continuum limit some observables (e.g. topological charge) can have serious critical slowing down (autocorrelations)!

H. Simma, Heavy Quarks on Fast Computers9 Dedicated Machines

Idea: Focus on specific computational task to improve on

Ì Cost: development + production + operation (politics, manpower) (market, technology) (RAS, kW in+out)

Ì Performance:

1/Texe = work/cycle × fclk (architecture) (technology)

Examples of LQCD Machines:

• GF11 (IBM) • QCDSP, QCDOC (Columbia U, IBM) • CP-PACS (Tokio U, Hitachi) • APE1, APE100, APEmille, apeNEXT (INFN, DESY, F) • PC Clusters with tailored network (Budapest, . . . )

H. Simma, Heavy Quarks on Fast Computers 10 APE:ArrayProcessorExperiment

• single logical CPU (SIMD) • multiple FPUs with private memory • 3d torus communication network

H. Simma, Heavy Quarks on Fast Computers 11 Evolution of APE Machines

Generation APE100 APEmille apeNEXT bringup 1992 1999 2004 peak/board 0.4 Gflops 4 Gflops 12 Gflops Architecture control SIMD SIMD MIMD synchronous yes yes no FP precision 32 32, 64 64 a × b + c R C, R2 C, R2 βFP 2 SP/clk 8 SP/clk 8 DP/clk βmem/βFP 1:4 1:4 1:4 βnet/βFP 1:16 1:16 1:72 Technology ASICs 2 3 1 feature size 1.2 µ 0.5 µ 0.18 µ fclk 25 MHz 66 MHz 200 (135) MHz

Challenges of ASIC Design: • Growing functionality • Reduced architectural freedom • Access to competitive chip technology at affordable price

H. Simma, Heavy Quarks on Fast Computers 12 Beyond APE Basic Idea:

APE Cluster Ì Fast commodity processor  

Ì Tailored custom network   (tightly coupled, simple)

Ì Custom node+system design  

Challenges:

• High single-node performance § multi-core processor • Scalability § fast processor interface and balanced torus network • Cost efficiency § integration and cooling

H. Simma, Heavy Quarks on Fast Computers 13 BE Processor • Innovative “Cell Broadband Engine” architecture • Developed by Sony, Toshiba, IBM (Playstation 3) • Enhanced version with DDR2 and DP: PowerXCell 8i (IBM blades, Roadrunner)

Key Features:

• PowerPC core for OS and control (PPU) • 200 Gflops (SP peak) by 8 in-order cores (SPU) with SW-controlled private cache (LS) • Integrated memory and IO interfaces • Fast ring interconnect between cores and interfaces (EIB)

A very fast APE board (without memory and communication network) on a single chip!

H. Simma, Heavy Quarks on Fast Computers 14 QPACE:Qcd PArallel computing on theCell broadbandEngine Academic Partners:

• Uni Regensburg • Uni Wuppertal • Forschungszentrum J¨ulich • DESY Zeuthen • Uni Ferrara • Uni Milano

Industrial Partner: IBM (B¨oblingen,Rochester, La Gaude)

Support by: (I), Kn¨urr(D), (US)

Main Funding: DFG (SFB TR55), IBM

Dec 2007 kick-off meeting Apr 2008 prototype design completed July 2008 start of prototype tests Fall 2009 installation of O(2000) nodes

H. Simma, Heavy Quarks on Fast Computers 15 QPACE Building Block: Node Card

H. Simma, Heavy Quarks on Fast Computers 16 Network Processor (NWP) Main Purpose: Route and control the data flow between Cell and 6 links of torus network

Ì Field Programmable Gate Array (FPGA) Xilinx Virtex5 LX110T

Ì External 10GbE/PCIe transceiver (PHY) PMC Sierra PM8358

Teye ≈ 100 ps

H. Simma, Heavy Quarks on Fast Computers 17 QPACE Communication Model

• 2-sided (separate send + receive) • non-blocking (separate initiate + complete) • data transport to/from main memory or LS • nearest-neighbour connectivity • light-weight link layer protocol • multiple use of same link by 8 virtual channels

More details: http://moby.mib.infn.it/∼simma/tnw [M. Pivanti, F. Schifano, H.S]

H. Simma, Heavy Quarks on Fast Computers 18 QPACE Cooling [G. Goldrian, IBM]

Concept: • node housing § heat conductor • cold plate § liquid cooling

Simulation: • 10 l/min water at 12◦ • load 4224 W

H. Simma, Heavy Quarks on Fast Computers 19 QPACE Cooling [G. Goldrian, IBM]

Concept: • node housing § heat conductor • cold plate § liquid cooling

Simulation: • 10 l/min water at 12◦ • load 4224 W • confirmed by tests . . .

H. Simma, Heavy Quarks on Fast Computers 20 Other QPACE Hardware Components

Ì Root Card

• Micro-processor with Ethernet interface • Global signal tree (like APE) • Serial links (configuration) • Clock distribution

Ì Backplane

• High-speed signals and power distribution • 22 layers • 13’000 holes

H. Simma, Heavy Quarks on Fast Computers 21 QPACE Rack

Performance Density: 52 TFlops (SP) / rack

• Footprint: 80 × 120 cm • Weight: O(1000) kg • Power: O(29) kW

Power Efficiency: § Number #1 in Green500 of Nov 2009

[www.green500.org]

H. Simma, Heavy Quarks on Fast Computers 22 Comparison apeNEXT PowerXCell 8i Intel Nehalem bringup 2004 2007 2008 peak (DP) 12 Gflops/board 100 Gflops/chip 50 Gflops/chip Technology feature size 180 nm 65 nm 45 nm power O(10) W O(100) W O(80) W fclk 200 (135) MHz 3.2 GHz 2.8 GHz Architecture control SPMD 8 cores + PPU 4 cores a × b + c C, R2 R4, R2 2 × SSE3 βFP 8 DP/clk 32 DP/clk 16 DP/clk βmem/βFP 1:4 1:32 1:12 βnet/βFP 1:72 1:140 1:140 cache — 8 × 256 KB (LS) 8 MB (L3)

Cell seems to have memory bottleneck for LQCD

H. Simma, Heavy Quarks on Fast Computers 23 Hardware Model

Devices for: Parametrized by:

• control ISA, . . .

• data storage size: 0 ≤ σi < ∞ bandwidth: β < ∞ • data transport/processing ij latency: λij ≥ 0

Structure: described by a “Hardware Architecture Graph” (HAG) with

• nodes = storage devices • arcs = transport devices

H. Simma, Heavy Quarks on Fast Computers 24 Application Model

Computational Tasks: Quantified by:

• data set (input, output, temporary variables) storage requirement: Si

• data transport/processing tasks (assignments) information exchange: Iij

Data Dependencies: described by a Directed Acyclic Graph (DAG) with

• arcs = RAW dependencies (variable lives) • nodes = transport operations

H. Simma, Heavy Quarks on Fast Computers 25 Implementation Main problems:

Ì Code Selection transport operations HW instructions −→ (DAG) (DAG’)

Ì Resource Allocation data set (variables) −→ storage devices = arcs of DAG’ = nodes of HAG

operations (instructions) −→ transport devices = nodes of DAG’ = arcs of HAG

Ì Scheduling

Allocation and scheduling are interrelated and NP-hard problems (to be tackled by algorithm, programmer, compiler, hardware)

H. Simma, Heavy Quarks on Fast Computers 26 Main Computational Task in LQCD: Wilson-Dirac Operator

Hopping term of Dφ6 :

4 0 X φ (x)= {U(x, µ)(1 − γµ)φ(x +µ ˆ)+ · · ·} µ=1

Naive analysis: (without data re-use)

• FP Computations: IFP = 1320 FP operations (→ 840 muladd)

• Memory Access: Imem = 9|φ| + 8|U| = 360 FP words

Performance Model: (minimal Imem = 192 FP words)

• data re-use in a small cache • prefetch to hide latencies • 160’000 different schedules Texe • Tij = λij + Iij/βij

storage requirement S

H. Simma, Heavy Quarks on Fast Computers 27 Summary and Outlook Progress is slow, but

Ì good perspectives to obtain accurate lattice results for B-physics

Ì dedicated machine developments can pay off

Ì implementation space for scientific computing problems is amazingly rich

H. Simma, Heavy Quarks on Fast Computers 28 Summary and Outlook Progress is slow, but

Ì good perspectives to obtain accurate lattice results for B-physics

Ì dedicated machine developments can pay off

Ì implementation space for scientific computing problems is amazingly rich

All the best wishes and many thanks to Daniel!

H. Simma, Heavy Quarks on Fast Computers 29