End-to-end lightpaths for large file transfers over high speed long distance networks
Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron (Alberta), Wade Hong (Carleton)
CHEP03 1 CHEP03 Outline
• TB data transfer from TRIUMF to CERN (iGrid 2002) • e2e Lightpaths • 10 GbE technology • Performance tuning (Disk I/O, TCP) • Throughput results • Future plans and activities
2 CHEP03 The Birth of a Demo
• Suggestion from Canarie to the Canadian HEP community to participate at iGrid2002 • ATLAS Canada discussed the demo at a Vancouver meeting in late May • Initial meeting at TRIUMF by participants in mid July to plan the demo • Sudden realization that there was a very short time to get all elements in place!
3 CHEP03 did we So what -a--r--e--- -w----e--- -g---o--i--n---g-- --t-o-- do?
• Demonstrate a manually provisioned “e2e” lightpath • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN • Test out 10GbE technology and channel bonding • Establish a new benchmark for high performance disk to disk throughput over a large distance 4 CHEP03 TRIUMF
• TRI University Meson Facility • Operated as a joint venture by Alberta, UBC, Carleton, SFU and Victoria • Located on the UBC campus in Vancouver • Proposed location for Canadian ATLAS Tier 1.5 site
5 The iGrid2002 Network
6 CHEP03 e2e Lightpaths
• Core design principle of CA*net 4 • Ultimately to give control of lightpath creation, teardown and routing to the end user • Users “own” their resources and can negotiate sharing with other parties – Hence, “Customer Empowered Networks” • Ideas evolved from initial work on OBGP • Provides a flexible infrastructure for emerging grid applications via Web Services 7 CHEP03 e2e Lightpaths
• Grid services architecture for user control and management • NEs are distributed objects or agents whose methods can be invoked remotely • Use OGSA and Jini/JavaSpaces for e2e customer control • Alas, can only do things manually today 8 CHEP03 CA*net 4 Topology
Edmonton
Saskatoon
Winnipeg Vancouver Calgary Halifax Kamloops Regina Thunder Bay St. John's Victoria Quebec City Sudbury Charlottetown Seattle Montreal Ottawa Fredericton Minneapolis Halifax Toronto Kingston CA*net 4 Node Buffalo London Boston Hamilton Possible Future Breakout Albany Windsor Possible Future link or Option Chicago New York
CA*net 4 OC192 9 CHEP03
10 CHEP03 We are live continent to continent!
• e2e lightpath up and running 21:45 CET
traceroute to cern-10g (192.168.2.2), 30 hops max, 38 byte packets 1 cern-10g (192.168.2.2) 161.780 ms 161.760 ms 161.754 ms 11 CHEP03 Intel 10GbE Cards
• Intel kindly loaned us 2 of their Pro/10GbE LR server adapters cards despite the end of their Alpha program – based on Intel® 82597EX 10 Gigabit Ethernet Controller
12 CHEP03 Extreme Networks
• Extreme Networks generously loaned and shipped to us 2 Black Diamond 6808s with their new10GbE LRi blades
13 CHEP03 Hardware Configurations
TRIUMF CERN
14 CERN Server (Receive Host) SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 2 3ware 7850 RAID controller 6 IDE drives on each 3-ware controllers RH7.3 on 13th drive connected to on-board IDE RMC4D from HARDDATA WD Caviar 120GB drives with 8Mbyte cache
15 CHEP03 TRIUMF Server (Send Host) SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 3ware 7850 RAID controller 2 Promise Ultra 100 Tx2 controllers
16 CHEP03 CHEP03 Operating System
• Redhat 7.3 based Linux kernel 2.4.18-3 – Needed to support filesystems > 1TB • Upgrades and patches – Patched to 2.4.18-10 – Intel Pro 10GbE Linux driver (early stable) – SysKonnect 9843 SX Linux driver (latest) – Ported Sylvain Ravot’s tcp tune patches
17 CHEP03 Black Magic
• Optimizing system performance doesn’t come by default • Performance tuning is very much an art requiring Black Magic
• Disk I/O Optimization • TCP Tuning
18 CHEP03 Disk I/O Black Magic • min max read ahead on both systems
sysctl -w vm.min-readahead=127 sysctl -w vm.max-readahead=256 • bdflush on receive host sysctl -w vm.bdflush=“2 500 0 0 500 1000 60 20 0” or echo 2 500 0 0 500 1000 60 20 0 >/proc/sys/vm/bdflush • bdflush on send host sysctl -w vm.bdflush=“30 500 0 0 500 3000 60 20 0” or echo 30 500 0 0 500 3000 60 20 0 >/proc/sys/vm/bdflush 19 CHEP03 Disk I/O Black Magic • Disk I/O elevators (minimal impact noticed) – /sbin/elvtune – Allows some control of latency vs throughput – Max I/O scheduler read latency set to 512 (default 8192) – Max I/O scheduler write latency set to 1024 (default 16384) • atime – Disables updating the last time a file has been accessed (typically for file servers) mount –t ext2 –o noatime /dev/md0 /raid Typically, ext3 writes 90Mbytes/sec while for ext2 writes 190Mbytes/sec Reads minimally affected. We always used ext2
20 CHEP03 Disk I/O Black Magic • IRQ Affinity [root@thunder root]# more /proc/interrupts CPU0 CPU1 Need to have PROCESS Affinity 0: 15723114 0 IO-APIC-edge timer - but this requires 2.5 kernel 1: 12 0 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 8: 1 0 IO-APIC-edge rtc 10: 0 0 IO-APIC-level usb-ohci 14: 22 0 IO-APIC-edge ide0 15: 227234 2 IO-APIC-edge ide1 16: 126 0 IO-APIC-level aic7xxx 17: 16 0 IO-APIC-level aic7xxx 18: 91 0 IO-APIC-level ide4, ide5, 3ware Storage Controller 20: 14 0 IO-APIC-level ide2, ide3 22: 2296662 0 IO-APIC-level SysKonnect SK-98xx 24: 2 0 IO-APIC-level eth3 26: 2296673 0 IO-APIC-level SysKonnect SK-98xx 30: 26640812 0 IO-APIC-level eth0 NMI: 0 0 LOC: 15724196 15724154 ERR: 0 MIS: 0 echo 1 >/proc/irq/18/smp_affinity use CPU0 echo 2 >/proc/irq/18/smp_affinity use CPU1 echo 3 >/proc/irq/18/smp_affinity use either 21 cat /proc/irq/prof_cpu_mask >/proc/irq/18/smp_affinity reset to default CHEP03 TCP Black Magic
• Typically suggested TCP and net buffer tuning
sysctl -w net.ipv4.tcp_rmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_wmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_mem="4194304 4194304 4194304"
sysctl -w net.core.rmem_default=65535 sysctl -w net.core.rmem_max=8388608 sysctl -w net.core.wmem_default=65535 sysctl -w net.core.wmem_max=8388608
22 CHEP03 TCP Black Magic
• Sylvain Ravot’s tcp tune patch parameters ssthresh sysctl -w net.ipv4.tcp_tune=“115 115 0” • Linux 2.4 retentive TCP Cong. Avoid. Increment Slow Start Increment – Caches TCP control information for a destination for 10 mins – To avoid caching
sysctl -w net.ipv4.route.flush=1
23 CHEP03 Testing Methodologies
• Began testing with a variety of bandwidth characterization tools – pipechar, pchar, ttcp, iperf, netpipe, pathcar, etc • Evaluated high performance file transfer applications – bbftp, bbcp, tsunami, pftp • Developed scripts to automate and to scan parameter space for a number of the tools
24 CHEP03 A Real WAN Emulator
• First tested with a hardware WAN emulator but it had its limitations (RTT<100 ms, MTU=1500) • Canarie Loopback – TRIUMF – Starlight : RTT = 96 ms – Allowed testing at 1 GbE due to capacity constraints • Canarie Loopback2 – TRIUMF – Starlight : RTT = 193 ms – Close to the expected TRIUMF – CERN e2e RTT of 200 ms
25 CHEP03 did we So what -a--r--e--- -w----e--- -g---o--i--n---g-- --t-o-- do?
• Demonstrate a manually provisioned “e2e” lightpath • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN • Test out 10GbE technology and channel bonding • Establish a new benchmark for high performance disk to disk throughput over a large distance
26 Comparative Results (TRIUMF to CERN) Tool Transferred Average Max Avg wuftp 100 MbE 600 MB 3.4 Mbps wuftp 10 GbE 6442 MB 71 Mbps iperf 275 MB 940 Mbps 1136 Mbps pftp 600MB 532 Mbps bbftp (13 streams) 1.4 TB 666 Mbps 731 Mbps Tsunami - disk to disk 0.5 TB 700 Mbps 825 Mbps Tusnami - disk to memory 12 GB > 1GBps
27 Sunday Nite Summaries
28 Exceeding 1Gbit/sec … ( using tsunami)
29 CHEP03 did we So what -a--r--e--- -w----e--- -g---o--i--n---g-- --t-o-- do?