HEPiX Oct 2003: CERN IT-ADC research

CERN IT-ADC research activities

– Why: Intro & Motivation – How: Process – Conditions: Load – Architecture – Interfaces – (some) details and results

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Motivation and Goals

LHC. " make most of what we'll get":

review new technologies and paradigms opportunity for change "make most of what we have":

optimize existing architecture, hardware usage Timelines:

2003 major verification of the architecture

2004 further verification or verification of a DIFFERENT architecture

mid 2005 IT Computing Technical Design Report, architecture decided

end 2005 purchasing procedure starts, 10-15 million SFr value

Q3 2006 Installation of disk, cpu and tape resources

Q2 2007 first data taking

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Research process

Top-down Assess "boundary conditions": Money Expected workload(s) Technology forecast: PASTA Risk assessment ProductionizabilityTM Event-driven, Bottom-up Debugging, "..this ought to work.." Validate results: Data Challenges

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Input: Workload

Event Simulation Data Acquisition

Mayor workflows: Raw Data Monte Carlo production data recording (CDR): Physics analysis DAQ

filtering Event reconstruction summary event data High Level Trigger stable storage selection, reconstr. tape archive

data analysis Event Summary Data working environments: user home directory filesystems programming Interactive physics Processed Data analysis

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Load: synthetics / benchmarks

Computing: GEANT4 ≈ 80% Integer Ops ⇒ SPECint Storage: "home directory": small random concurrent access CDR: predictable, few streams, very large transfers Analysis: not predictable, many streams, large transfers

Access Frequency Size Single Speed Aggr. Speed add.Services Home Directories Random 10000/s 10s of TB 10MB/s 1GB/s Backup Analysis Random 1000/s 100s of TB 100MB/s 50GB/s CDR Sequential 100/s PB s 5 GB/s Tape robot

Network: assumes current "rfio" model with multiple large transfers

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Interfaces

From users: workload estimates To/From management: cost envelope, TCO, risk forecast To/From service managers: Production-relevant features Dissemination: Internal reports OpenLAB PASTA (technology forecasts) HEPiX? How to share "research workload"?  need standardized benchmarks

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Tools

'Standard Benchmarks' Bonnie++ netio, gensync CASTOR rfio SPECint, SPECfp

Benchmarking Framework Repeatable results Specify (or note) relevant external factors Post-processing, Data aggregation Built around IBM STAF/STAX (regression) test framework

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research Managing results : Overview 2 : Store

Store all results from benchmarks Q u eries (x p ath)

XML Repository R esu lts (x m l) 3 : Present 1 : Benchmark R esu lts Create reports to present and analyse benchmark results : Automate the (x m l) benchmarking process : Aggregate results from different benchmarks Populate environment configuration Present figures and conclusions in a Launch benchmarks and monitors report Retrieve results and measurements Edit and store reports Store all the information in the result repository Organise and publish reports

STAF/STAX XSL/XPATH Web CMS (Cocoon) Browser J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Benchmark automation : STAF/STAX

get_node_configuration Node configuration : hardware (cpu, memory) start_monitors Service node software ( /proc entries, kernel) start_benchmarks Benchmark flow STAX libraries : Monitors : vmstat description : populate node Benchmarks : get_results XML + python configuration bonnie++, iperf start monitors get_measurements Results : create xml file and benchmarks (aggregate node compute_results compute and configuration, ST AX store results benchmark/monitoring store_results results, timestamp) Staf

Execution node Execution Execution Execution node node node m onitors Staf benchm arks J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Result repository browser

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Result view

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Resources People: IT-ADC ("Architecture and Data Challenges") ADC-CA ("CASTOR & AFS") ADC-LE (" Expertise") ADC-OL ("OpenLAB") Collaboration with other IT groups Collaboration with Experiments (Data Challenges)

Material: OpenLAB industry collaboration ⇒ OpenCLUSTER + (shared) part of our batch farm

J.Iven, IT-ADC-LE 21.10.03 CCEERRNN ooppeennlab

IBM has now joined

Bernd Panzer-Steindel CERN-IT

CERN Academic Training 12-16 May 2003 13 HEPiX Oct 2003: CERN IT-ADC research

Research Phase-Space

Cost Architecture Workload

Inter- CPU Storage connect

Optimization

Debugging Service

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Architecture Architectures

Sea-of-nodes Black-Box CPU, disk, tape solutions

Macrocluster

Virtual machine

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Processor & Motherboards CPU

Criteria $$ / SPECint SPECint / Watt Local throughput (memory access, PCI(-X), HyperTransport) Internal utilization (HT, compilers) Candidates IA32: Pentium-4, XEON, Pentium-M 64bit: -2/3, Opteron PS-2, Xbox

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Exam ple: CPU AMD Opteron - Model 240 rioworks HDAMA

Gigabit Ethernet controller DIMM0 DIMM1 Broadcom 5702 DDR Memory 2.7 Gby/s Corsair PC2700 – 1 Gbyte PCI-X 64 bits, 100 Mhz PCI 32/64 bits, 33/66 Mhz DIMM2 DIMM3 0.8 Gby/s 0.132 - 0.528 Gby/s 32 bits / 66 Mhz : 0.264 Gby/s

Gigabit Ethernet controller 5.3 Gby/s 5.3 Gby/s Broadcom 5702 Opteron Cpu0 Opteron Cpu1 DDR Memory DDR Memory PCI-X 64 bits, 100 Mhz PCI 32/64 bits, 33/66 Mhz 1.4 Ghz Controller 1.4 Ghz Controller 0.8 Gby/s 0.132 - 0.528 Gby/s 2 x 64 Kb L1 cache 2 x 64 Kb L1 cache 32 bits / 66 Mhz : 0.264 Gby/s 1024 Kb L2 cache HyperTransport 1024 Kb L2 cache HyperTransport

3.2 GBy/s 3.2 GBy/s 3.2 GBy/s 3.2 GBy/s SATA 0.150 Gby/s Serial ATA controller SATA 0.150 Gby/s Promise PDC 20319 SATA 0.150 Gby/s

PCI v2.3 32bits/66(33)Mhz SATA 0.150 Gby/s 3.2 GBy/s 3.2 GBy/s 1.064 Gby/s 1.064 Gby/s 0.266 Gby/s HT Link – Side A : 1.064 Gby/s PCI-X Bridge A 16 bits AMD 8131 1.064 Gby/s 1.064 Gby/s PCI-X Tunnel 1.064 Gby/s 1.064 Gby/s PCI -X 64 bits/133 Mhz HT Link – Side B : PCI-X Bridge B 1.064 Gby/s 1.064 Gby/s 1.064 Gby/s PCI -X 64 bits/133 Mhz 8 bits 1.064 Gby/s 1.064 Gby/s 1.6 GBy/s 1.6 GBy/s 0.528 Gby/s PCI 64 bits/66 Mhz

0.528 Gby/s PCI 64 bits/66 Mhz

0.132 Gby/s PCI 32 bits/33 Mhz

0.132 Gby/s PCI 32 bits/33 Mhz 0.4 GBy/s 0.4 GBy/s

HT Link : AMD 8111AC EIDE 0.133 Gby/s 8 bits Hyper Transport I/O hub PIO 0-4, DMA 0-2, UDMA 0-6 0.133 Gby/s

USB, SMBus, AC97, LPC Ethernet controller (10 – 100 Mbits/s) PCI Bus (32 bits – 33 MHz) 0.132 Gby/s

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Interconnects Interconnect

Candidates: FastEthernet Gigabit Ethernet 10Gigabit Ethernet Infiniband Criteria: $$ / Throughput CPU utilization (Linux NAPI, Offload engines, RDMA) OS support, stability (Latency) Influence on Applications

J.Iven, IT-ADC-LE 21.10.03 CERN High Performance Networking Test results at CERN Interconnect

XEO N 2.4 GHz O pteron 1.4 GHz 2 processor 2 processor 1 GByte RAM 1 GByte RAM

INFINISWITCH 8 X 8

XEO N 2.4 GHz O pteron 1.4 GHz 2 processors 2 processors 1 GByte RAM 1 GByte RAM

Itanium Itanium 2 processors 2 processors

ARIE VAN PRAAG CERN IT ADC E-Mail: [email protected] CERN High Performance Networking Infiniband Transfer modes Interconnect

InfiniSwitch Test Results ( between two XEON 2.4GHz nodes ) Infiniband test results 800 RDM A TRC TUD 700 10 3.8 2.3 3.9 RDMA

100 37.3 23.4 38.6 ) 600 s 1000 315.2 232.7 369.4 / e t

2000 453.5 464.7 486 y 500 B

3000 528.3 611.8 495.5 M

( TRC 400 5000 599.8 658.7 494.3 t u

10000 676.9 706.3 502.7 p

h 300 g

50000 731.5 739.8 498.4 u o 100000 740 744 492.1 r 200 h T TUD 250000 744.3 744.7 490.5 100 500000 745.6 746.1 496.9 750000 745.8 746.2 501 0 0 0 0 0 0 0 5 5 5 5 6 6 6 0 0

1000000 747.9 747.9 496.9 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 + + + + + + + 1 0 0 0 0 0 0 E E E E 2000000 747.4 747.4 491.5 1 2 3 5 E E E 1 5 1 3 5 8 1 2 4 4000000 747.9 747.5 490.7 File Size

M achines used: Double Xeon 2.4 GHz, HCA: M elanox, Switch: Fabric Networks (Infiniswitch) 8 X 8

ARIE VAN PRAAG CERN IT ADC E-Mail: [email protected] CERN High Performance Networking

Interconnect Infiniband Tests at CERN >> RDMA

Infiniband test results Infiniband Bandwidth and % CPU Usage RDMA Bandwidth M B/s % CPU

Polling Event Dr.Initiator Receiver 800 100 10 2.9 2.9 93 9 700 90 Polling Mode 100 33.3 100 33.6 11 80 1000 455.0 86 600 459.7 10 70 h

2000 605.4 598.2 100 10 t 500 Event Driven

d 60 i

3000 651.5 645.0 80 9 w

d 400 50

n

5000 686.2 86 12 a 688.3 40 10000 722.7 78 10 B 300 Initiator CPU 725.4 30 50000 742.6 742.2 54 9 200 20 100000 745.6 745.3 37 9 100 10 Receiver CPU 250000 746.9 746.8 21 8

0 1 2 3 3 3 3 4 4 5 5 5 5 6 6 6 0 0 0 0 0 0 0 0 500000 747.3 747.2 12 5 0 0 0 0 0 0 0 0 + + + + + + + + + + + + + + + E E E E E E E E E E E E E E E ...... 747.3 7 ...... 750000 747.4 4 1 1 1 2 3 5 1 5 1 3 5 8 1 2 4 1000000 747.4 747.5 5 3 File Size 2000000 747.2 747.1 2 2 4000000 747.3 747.3 1 1

M achines used: Double Xeon 2.4 GHz, HCA: M elanox, Switch: Fabric Networks (Infiniswitch) 8 X 8

ARIE VAN PRAAG CERN IT ADC E-Mail: [email protected] HEPiX Oct 2003: CERN IT-ADC research

10Gig Ethernet tests Interconnect 2x1.5GHz Itanium-2 (CERN), 2x1GHz Itanium-2 (Amsterdam), unidirectional, Iperf

IllustrationN 1:o I AT64u-n>IiAn6g4, No tuning Illustration 1F: uIAll6 4T -u> nIAin64g, Full tuning

700 800 650 600 700 550 1500B 500 600 1500B 9000B 9000B 450 500 16114B t 400 t u u p 350 400 p h h g 300 g u

u 300 o 250 o r r h

200 h T 150 T 200 100 100

0 9000B 0 1 Stream 1 Stream 9000B 12 Streams 1500B 12 Streams MTU 1500B MTU Streams Streams Stability: 15hour transfer Transfer rates 5500 5000 4500 2x1.5GHz Itanium-2, back-to-back, unidirectional, GenSink 4000

e 3500

/s 3000 ts 2500

Mbi 2000 1500 1000 500 0 1 2 4 8 10 15 20 number of streams Fairness: effect of multiple streams J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Storage Storage

Areas: File Systems: StorageTank, GFS, Lustre, Terrascale,.. HSM (CASTORng) Hardware: SATA, IDE+3ware RAID options: Hardware vs Software, 1 vs 5 Network acces: NAS: CASTOR / rfio iSCSI Linux NBD FC & SAN "Black Boxes": DataDirect, StorageTank, LinuxNetworx

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Storage Criteria Storage

Cost: $$ / GB, $$ / (MB/sec) Raw performance: MB/sec (stream, random) Reliability, MTBF, risk Implications on Architecture (NU"S"A, scheduler) Security: ObjectStore vs "trusted cluster"-only ACLs Authentication schemes supported

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Storage

iSCSI performance for different software initiators The server is the Eurologic iCS2100 IP-SAN storage appliance. The client is test13*, with kernel 2.4.19smp. 65 60 Presented at CHEP 03 55 s

s/ 50 e

yt 45 B 40 M

h t s s s 35 s s s s s s s s s s s i / / / / / / / / / / / / / / s s s w s s s s s s s s s s s 30 s e e e d e e e e e e / e e e e e t s n e a 25 byt byt byt byt byt by byt byt byt B byt byt byt byt byt 20 byt M M M M M M M M M M M M M M

/ / / / / 15 / / / / / / / / / M

/ ibmiscsi-1.2.2 10 % %

% linux-iscsi-2.1.2.9 5 7 linux-iscsi-3.1.0.6 0.9 % 0.7 % 1.0 % 1.4 % 1.3 % 1.7 1.4 % 1.4 % 1.7 0.9 % 0.7 % 1.0 % 1. 1 % 2.0 % 0

100 90 80

70 60 %

D 50 A 40 O L

30 U

P 20

C 10

0 sequential output sequential input write sequential read random read block block

benchmarks *

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Optimization & Production issues Optimization

(tight collaboration with service providers) Performance improvements on existing hardware Disk server: disk scheduler and VM tuning local file systems (ext3, XFS) and RAID options (Network) data stream priorities Manageability Monitoring sensors (thermal, SMART, 3ware controller) Serial console and remote reset Kernel crash dumps, KDB Collaboration on call for tenders TCO studies New hardware support, driver updates Seamless transition to debugging...

J.Iven, IT-ADC-LE 21.10.03 HEPiX Oct 2003: CERN IT-ADC research

Putting it all together: Data Challenges

Large-scale scalability exercises Test whole workflows through all equipment Experiment DC: Experiment's workload "reachable" goals, adequate hardware (⇒ TDRs) Requires (certain) stability of IT base components IT DCs: "break it." Scalability / Capacity limits

J.Iven, IT-ADC-LE 21.10.03 TAPE TAPE IITT DDaattaa CChhaalllleennggee CPU CPU TPSRV101-112 TPSRV113-124 CPU CPU TBED001-12 TBED0013-24 TBED0025-36 TBED0037-48 -12 -13 -14 -15 TBED… 513-V -16 TBED… 4 4 4 ST11 ST21 4 4 4 ST12 ST13 ST22 ST23 -23 -17 8 8 -24 ST1+ ST2 ST5 + ST6 -18 ST14 ST15 ST2 ST25 4 4 4 4 4 -19 -20 20 4 -21 -22 DISK DISK DISK DISK LXSHARE108D-119D LXSHARE001D-12D 20 x Hoplapro LXSHARE013D-24D LXSHARE025D-36D Except LXSHARE115D -25 -26 -27

TAPE TAPE TAPE TPSRV050-52 TPSRV054-57 TPSRV058-62

613-R Backbone TAPE TAPE TAPE TAPE TAPE TPSRV028-32 TPSRV001-15 TPSRV002-16 TPSRV018-22 TPSRV023-27

Bernd Panzer-Steindel CERN-IT

CERN Academic Training 12-16 May 2003 28 IITT ""TTaappee"" DDaattaa CChhaalllleennggee ppeerrffoorrmmaannccee 40 60 45

[ GBytes/s]

1.4 running in parallel with increasing 1.2 production service

1.0 920 MB/s 0.8 average

0.6

0.4 daytime tape server intervention 0.2

0.0 time in minutes

Bernd Panzer-Steindel CERN-IT

CERN Academic Training 12-16 May 2003 29 HHaarrddwwaare anndd NNeettwwoorrkk TTooppoollooggyy AALLIICCEE--IITT DDCC IIVV 4 Gigabit switches 3 Gigabit switches

U513-C-IP74-S3C49-12 -13 -14 -15 -20 -21 -22 TBED00 01-12 13-24 25-36 37-48 LXSHARE 01D-12D 13D-24D 25D-36D

Disk server

3 3 3 3 DOT HILL 2 2 2 TOTAL: 22 ports

TOTAL: 32 ports U513-C-IP74-SEN7I-11 1-2 3-4 5-6 1-3 4-6 7-9 10-12 11-14 U513-C-IP74-SEN7I-10 25-32 8 25-32 U513-C-IP74-SEN5I-99

CPU servers on FE 9 10 7-8 13-15 19-21 16-18 22-24 2

3 3 3 3

49-60 61-72 73-77 90-100 78-88 89, 101-112 -16 -17 -18 -19 U513-C-IP74-S3C44-23 -24 Backbone 4 Gigabit switches 2 Fastethernet switches (4 Gbps) 20 TAPE servers (distributed) Total: 192 CPU servers (96 on Gbe, 96 on Fe), 36 DISK servers, 20 TAPE servers

Bernd Panzer-Steindel CERN-IT

CERN Academic Training 12-16 May 2003 30 AALLIICCEE--IITT DDCC IIVV

Aggregate disk server performance in 40s time intervals, writing to tape

Performance [KB/s]

Average

Goal

Time, from Friday 6th to Friday 13th in December 2002

Bernd Panzer-Steindel CERN-IT

CERN Academic Trainin g 1 2 - 1 6 M ay 2 00 3 31