End-To-End Lightpaths for Large File Transfers Over High Speed Long Distance Networks

End-to-end lightpaths for large file transfers over high speed long distance networks Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron (Alberta), Wade Hong (Carleton) CHEP03 1 CHEP03 Outline • TB data transfer from TRIUMF to CERN (iGrid 2002) • e2e Lightpaths • 10 GbE technology • Performance tuning (Disk I/O, TCP) • Throughput results • Future plans and activities 2 CHEP03 The Birth of a Demo • Suggestion from Canarie to the Canadian HEP community to participate at iGrid2002 • ATLAS Canada discussed the demo at a Vancouver meeting in late May • Initial meeting at TRIUMF by participants in mid July to plan the demo • Sudden realization that there was a very short time to get all elements in place! 3 CHEP03 did we So what -a--r--e--- -w----e--- -g---o--i--n---g-- --t-o-- do? • Demonstrate a manually provisioned “e2e” lightpath • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN • Test out 10GbE technology and channel bonding • Establish a new benchmark for high performance disk to disk throughput over a large distance 4 CHEP03 TRIUMF • TRI University Meson Facility • Operated as a joint venture by Alberta, UBC, Carleton, SFU and Victoria • Located on the UBC campus in Vancouver • Proposed location for Canadian ATLAS Tier 1.5 site 5 The iGrid2002 Network 6 CHEP03 e2e Lightpaths • Core design principle of CA*net 4 • Ultimately to give control of lightpath creation, teardown and routing to the end user • Users “own” their resources and can negotiate sharing with other parties – Hence, “Customer Empowered Networks” • Ideas evolved from initial work on OBGP • Provides a flexible infrastructure for emerging grid applications via Web Services 7 CHEP03 e2e Lightpaths • Grid services architecture for user control and management • NEs are distributed objects or agents whose methods can be invoked remotely • Use OGSA and Jini/JavaSpaces for e2e customer control • Alas, can only do things manually today 8 CHEP03 CA*net 4 Topology Edmonton Saskatoon Winnipeg Vancouver Calgary Halifax Kamloops Regina Thunder Bay St. John's Victoria Quebec City Sudbury Charlottetown Seattle Montreal Ottawa Fredericton Minneapolis Halifax Toronto Kingston CA*net 4 Node Buffalo London Boston Hamilton Possible Future Breakout Albany Windsor Possible Future link or Option Chicago New York CA*net 4 OC192 9 CHEP03 10 CHEP03 We are live continent to continent! • e2e lightpath up and running 21:45 CET traceroute to cern-10g (192.168.2.2), 30 hops max, 38 byte packets 1 cern-10g (192.168.2.2) 161.780 ms 161.760 ms 161.754 ms 11 CHEP03 Intel 10GbE Cards • Intel kindly loaned us 2 of their Pro/10GbE LR server adapters cards despite the end of their Alpha program – based on Intel® 82597EX 10 Gigabit Ethernet Controller 12 CHEP03 Extreme Networks • Extreme Networks generously loaned and shipped to us 2 Black Diamond 6808s with their new10GbE LRi blades 13 CHEP03 Hardware Configurations TRIUMF CERN 14 CERN Server (Receive Host) SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 2 3ware 7850 RAID controller 6 IDE drives on each 3-ware controllers RH7.3 on 13th drive connected to on-board IDE RMC4D from HARDDATA WD Caviar 120GB drives with 8Mbyte cache 15 CHEP03 TRIUMF Server (Send Host) SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 3ware 7850 RAID controller 2 Promise Ultra 100 Tx2 controllers 16 CHEP03 CHEP03 Operating System • Redhat 7.3 based Linux kernel 2.4.18-3 – Needed to support filesystems > 1TB • Upgrades and patches – Patched to 2.4.18-10 – Intel Pro 10GbE Linux driver (early stable) – SysKonnect 9843 SX Linux driver (latest) – Ported Sylvain Ravot’s tcp tune patches 17 CHEP03 Black Magic • Optimizing system performance doesn’t come by default • Performance tuning is very much an art requiring Black Magic • Disk I/O Optimization • TCP Tuning 18 CHEP03 Disk I/O Black Magic • min max read ahead on both systems sysctl -w vm.min-readahead=127 sysctl -w vm.max-readahead=256 • bdflush on receive host sysctl -w vm.bdflush=“2 500 0 0 500 1000 60 20 0” or echo 2 500 0 0 500 1000 60 20 0 >/proc/sys/vm/bdflush • bdflush on send host sysctl -w vm.bdflush=“30 500 0 0 500 3000 60 20 0” or echo 30 500 0 0 500 3000 60 20 0 >/proc/sys/vm/bdflush 19 CHEP03 Disk I/O Black Magic • Disk I/O elevators (minimal impact noticed) – /sbin/elvtune – Allows some control of latency vs throughput – Max I/O scheduler read latency set to 512 (default 8192) – Max I/O scheduler write latency set to 1024 (default 16384) • atime – Disables updating the last time a file has been accessed (typically for file servers) mount –t ext2 –o noatime /dev/md0 /raid Typically, ext3 writes 90Mbytes/sec while for ext2 writes 190Mbytes/sec Reads minimally affected. We always used ext2 20 CHEP03 Disk I/O Black Magic • IRQ Affinity [root@thunder root]# more /proc/interrupts CPU0 CPU1 Need to have PROCESS Affinity 0: 15723114 0 IO-APIC-edge timer - but this requires 2.5 kernel 1: 12 0 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 8: 1 0 IO-APIC-edge rtc 10: 0 0 IO-APIC-level usb-ohci 14: 22 0 IO-APIC-edge ide0 15: 227234 2 IO-APIC-edge ide1 16: 126 0 IO-APIC-level aic7xxx 17: 16 0 IO-APIC-level aic7xxx 18: 91 0 IO-APIC-level ide4, ide5, 3ware Storage Controller 20: 14 0 IO-APIC-level ide2, ide3 22: 2296662 0 IO-APIC-level SysKonnect SK-98xx 24: 2 0 IO-APIC-level eth3 26: 2296673 0 IO-APIC-level SysKonnect SK-98xx 30: 26640812 0 IO-APIC-level eth0 NMI: 0 0 LOC: 15724196 15724154 ERR: 0 MIS: 0 echo 1 >/proc/irq/18/smp_affinity use CPU0 echo 2 >/proc/irq/18/smp_affinity use CPU1 echo 3 >/proc/irq/18/smp_affinity use either 21 cat /proc/irq/prof_cpu_mask >/proc/irq/18/smp_affinity reset to default CHEP03 TCP Black Magic • Typically suggested TCP and net buffer tuning sysctl -w net.ipv4.tcp_rmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_wmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_mem="4194304 4194304 4194304" sysctl -w net.core.rmem_default=65535 sysctl -w net.core.rmem_max=8388608 sysctl -w net.core.wmem_default=65535 sysctl -w net.core.wmem_max=8388608 22 CHEP03 TCP Black Magic • Sylvain Ravot’s tcp tune patch parameters ssthresh sysctl -w net.ipv4.tcp_tune=“115 115 0” • Linux 2.4 retentive TCP Cong. Avoid. Increment Slow Start Increment – Caches TCP control information for a destination for 10 mins – To avoid caching sysctl -w net.ipv4.route.flush=1 23 CHEP03 Testing Methodologies • Began testing with a variety of bandwidth characterization tools – pipechar, pchar, ttcp, iperf, netpipe, pathcar, etc • Evaluated high performance file transfer applications – bbftp, bbcp, tsunami, pftp • Developed scripts to automate and to scan parameter space for a number of the tools 24 CHEP03 A Real WAN Emulator • First tested with a hardware WAN emulator but it had its limitations (RTT<100 ms, MTU=1500) • Canarie Loopback – TRIUMF – Starlight : RTT = 96 ms – Allowed testing at 1 GbE due to capacity constraints • Canarie Loopback2 – TRIUMF – Starlight : RTT = 193 ms – Close to the expected TRIUMF – CERN e2e RTT of 200 ms 25 CHEP03 did we So what -a--r--e--- -w----e--- -g---o--i--n---g-- --t-o-- do? • Demonstrate a manually provisioned “e2e” lightpath • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN • Test out 10GbE technology and channel bonding • Establish a new benchmark for high performance disk to disk throughput over a large distance 26 Comparative Results (TRIUMF to CERN) Tool Transferred Average Max Avg wuftp 100 MbE 600 MB 3.4 Mbps wuftp 10 GbE 6442 MB 71 Mbps iperf 275 MB 940 Mbps 1136 Mbps pftp 600MB 532 Mbps bbftp (13 streams) 1.4 TB 666 Mbps 731 Mbps Tsunami - disk to disk 0.5 TB 700 Mbps 825 Mbps Tusnami - disk to memory 12 GB > 1GBps 27 Sunday Nite Summaries 28 Exceeding 1Gbit/sec … ( using tsunami) 29 CHEP03 did we So what -a--r--e--- -w----e--- -g---o--i--n---g-- --t-o-- do? • Demonstrate a manually provisioned “e2e” lightpath • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN • Test out 10GbE technology and channel bonding • Establish a new benchmark for high performance disk to disk throughput over a large distance 30 CHEP03 Lessons Learned • Linux software RAID faster than most conventional SCSI and IDE based RAID systems • One controller for each drive, more disk spindles the better • Channel bonding of two GbEs seems to work very well (on an unshared link) • The larger files the better for throughput 31 CHEP03 Further Investigations • Linux TCP/IP Network Stack Performance – Efficient copy routines (zero copy, copy routines, read copy update) • Stream Control Transmission Protocol • Scheduled Transfer Protocol – OS bypass and zero copy • Web 100, Net 100, DRS 32 CHEP03 Acknowledgements • Canarie • Indiana University – Bill St. Arnaud, Rene Hatem, Damir Pobric, – Mark Meiss, Stephen Wallace Thomas Tam, Jun Jian • ATLAS Canada • Caltech – Sylvain Ravot, Harvey Neuman – Mike Vetterli, Randall Sobie, Jim Pinfold, Pekka Sinervo, Gerald Oakham, Bob Orr, • CERN Michel Lefebrve, Richard Keeler – Olivier Martin, Paolo Moroni, Martin Fluckiger, Stanley • HEPnet Canada Cannon, J.P Martin-Flatin – Dean Karlen • SURFnet/Universiteit van Amsterdam • TRIUMF – Pieter de Boer, Dennis Paus, Erik.Radius, Erik-Jan.Bos, Leon Gommans, Bert Andree, Cees de Laat – Peter Gumplinger, Fred Jones, Mike Losty, Jack Chakhalian, Renee Poutissou • BCNET • Yotta Yotta – Mike Hrybyk, Marilyn Hay, Dennis O’Reilly, – Geoff Hayward, Reg Joseph, Ying Xie, E.

End-To-End Lightpaths for Large File Transfers Over High Speed Long Distance Networks

Evaluation and Tuning of Gigabit Ethernet Performance on Clusters

Institutionen För Datavetenskap Department of Computer and Information Science

Performance Testing Tools Jan Bartoň 30/10/2003

Realistic Network Traffic Profile Generation: Theory and Practice

A Tool for Evaluating Network Protocols

Optimizing NFS Performance

Initial End-To-End Performance Evaluation of 10-Gigabit Ethernet

Jo˜Ao Pedro Marçal Lemos Martins Sistema Para Gest˜Ao Remota De

Available End-To-End Throughput Measurement

Network Working Group S. Bensley Internet-Draft Microsoft Intended Status: Informational L

Open Source Traffic Analyzer

Applications and Network Performance