End-to-end lightpaths for large file transfers over high speed long distance networks

Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron (Alberta), Wade Hong (Carleton)

CHEP03 1 CHEP03 Outline

• TB data transfer from TRIUMF to CERN (iGrid 2002) • e2e Lightpaths • 10 GbE technology • Performance tuning (Disk I/O, TCP) • Throughput results • Future plans and activities

2 CHEP03 The Birth of a Demo

• Suggestion from Canarie to the Canadian HEP community to participate at iGrid2002 • ATLAS Canada discussed the demo at a Vancouver meeting in late May • Initial meeting at TRIUMF by participants in mid July to plan the demo • Sudden realization that there was a very short time to get all elements in place!

3 CHEP03 did we So what -a--r--e--- -w----e--- -g---o--i--n---g-- --t-o-- do?

• Demonstrate a manually provisioned “e2e” lightpath • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN • Test out 10GbE technology and channel bonding • Establish a new benchmark for high performance disk to disk throughput over a large distance 4 CHEP03 TRIUMF

• TRI University Meson Facility • Operated as a joint venture by Alberta, UBC, Carleton, SFU and Victoria • Located on the UBC campus in Vancouver • Proposed location for Canadian ATLAS Tier 1.5 site

5 The iGrid2002 Network

6 CHEP03 e2e Lightpaths

• Core design principle of CA*net 4 • Ultimately to give control of lightpath creation, teardown and routing to the end user • Users “own” their resources and can negotiate sharing with other parties – Hence, “Customer Empowered Networks” • Ideas evolved from initial work on OBGP • Provides a flexible infrastructure for emerging grid applications via Web Services 7 CHEP03 e2e Lightpaths

• Grid services architecture for user control and management • NEs are distributed objects or agents whose methods can be invoked remotely • Use OGSA and Jini/JavaSpaces for e2e customer control • Alas, can only do things manually today 8 CHEP03 CA*net 4 Topology

Edmonton

Saskatoon

Winnipeg Vancouver Calgary Halifax Kamloops Regina Thunder Bay St. John's Victoria Quebec City Sudbury Charlottetown Seattle Montreal Ottawa Fredericton Minneapolis Halifax Toronto Kingston CA*net 4 Node Buffalo London Boston Hamilton Possible Future Breakout Albany Windsor Possible Future link or Option Chicago New York

CA*net 4 OC192 9 CHEP03

10 CHEP03 We are live continent to continent!

• e2e lightpath up and running 21:45 CET

traceroute to cern-10g (192.168.2.2), 30 hops max, 38 byte packets 1 cern-10g (192.168.2.2) 161.780 ms 161.760 ms 161.754 ms 11 CHEP03 Intel 10GbE Cards

• Intel kindly loaned us 2 of their Pro/10GbE LR server adapters cards despite the end of their Alpha program – based on Intel® 82597EX 10 Gigabit Ethernet Controller

12 CHEP03 Extreme Networks

• Extreme Networks generously loaned and shipped to us 2 Black Diamond 6808s with their new10GbE LRi blades

13 CHEP03 Hardware Configurations

TRIUMF CERN

14 CERN Server (Receive Host) SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 2 3ware 7850 RAID controller 6 IDE drives on each 3-ware controllers RH7.3 on 13th drive connected to on-board IDE RMC4D from HARDDATA WD Caviar 120GB drives with 8Mbyte cache

15 CHEP03 TRIUMF Server (Send Host) SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 3ware 7850 RAID controller 2 Promise Ultra 100 Tx2 controllers

16 CHEP03 CHEP03

• Redhat 7.3 based kernel 2.4.18-3 – Needed to support filesystems > 1TB • Upgrades and patches – Patched to 2.4.18-10 – Intel Pro 10GbE Linux driver (early stable) – SysKonnect 9843 SX Linux driver (latest) – Ported Sylvain Ravot’s tcp tune patches

17 CHEP03 Black Magic

• Optimizing system performance doesn’t come by default • Performance tuning is very much an art requiring Black Magic

• Disk I/O Optimization • TCP Tuning

18 CHEP03 Disk I/O Black Magic • min max read ahead on both systems

sysctl -w vm.min-readahead=127 sysctl -w vm.max-readahead=256 • bdflush on receive host sysctl -w vm.bdflush=“2 500 0 0 500 1000 60 20 0” or echo 2 500 0 0 500 1000 60 20 0 >/proc/sys/vm/bdflush • bdflush on send host sysctl -w vm.bdflush=“30 500 0 0 500 3000 60 20 0” or echo 30 500 0 0 500 3000 60 20 0 >/proc/sys/vm/bdflush 19 CHEP03 Disk I/O Black Magic • Disk I/O elevators (minimal impact noticed) – /sbin/elvtune – Allows some control of latency vs throughput – Max I/O scheduler read latency set to 512 (default 8192) – Max I/O scheduler write latency set to 1024 (default 16384) • atime – Disables updating the last time a file has been accessed (typically for file servers) mount –t ext2 –o noatime /dev/md0 /raid Typically, ext3 writes 90Mbytes/sec while for ext2 writes 190Mbytes/sec Reads minimally affected. We always used ext2

20 CHEP03 Disk I/O Black Magic • IRQ Affinity [root@thunder root]# more /proc/interrupts CPU0 CPU1 Need to have PROCESS Affinity 0: 15723114 0 IO-APIC-edge timer - but this requires 2.5 kernel 1: 12 0 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 8: 1 0 IO-APIC-edge rtc 10: 0 0 IO-APIC-level usb-ohci 14: 22 0 IO-APIC-edge ide0 15: 227234 2 IO-APIC-edge ide1 16: 126 0 IO-APIC-level aic7xxx 17: 16 0 IO-APIC-level aic7xxx 18: 91 0 IO-APIC-level ide4, ide5, 3ware Storage Controller 20: 14 0 IO-APIC-level ide2, ide3 22: 2296662 0 IO-APIC-level SysKonnect SK-98xx 24: 2 0 IO-APIC-level eth3 26: 2296673 0 IO-APIC-level SysKonnect SK-98xx 30: 26640812 0 IO-APIC-level eth0 NMI: 0 0 LOC: 15724196 15724154 ERR: 0 MIS: 0 echo 1 >/proc/irq/18/smp_affinity use CPU0 echo 2 >/proc/irq/18/smp_affinity use CPU1 echo 3 >/proc/irq/18/smp_affinity use either 21 cat /proc/irq/prof_cpu_mask >/proc/irq/18/smp_affinity reset to default CHEP03 TCP Black Magic

• Typically suggested TCP and net buffer tuning

sysctl -w net.ipv4.tcp_rmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_wmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_mem="4194304 4194304 4194304"

sysctl -w net.core.rmem_default=65535 sysctl -w net.core.rmem_max=8388608 sysctl -w net.core.wmem_default=65535 sysctl -w net.core.wmem_max=8388608

22 CHEP03 TCP Black Magic

• Sylvain Ravot’s tcp tune patch parameters ssthresh sysctl -w net.ipv4.tcp_tune=“115 115 0” • Linux 2.4 retentive TCP Cong. Avoid. Increment Slow Start Increment – Caches TCP control information for a destination for 10 mins – To avoid caching

sysctl -w net.ipv4.route.flush=1

23 CHEP03 Testing Methodologies

• Began testing with a variety of bandwidth characterization tools – pipechar, pchar, ttcp, , netpipe, pathcar, etc • Evaluated high performance file transfer applications – bbftp, bbcp, tsunami, pftp • Developed scripts to automate and to scan parameter space for a number of the tools

24 CHEP03 A Real WAN Emulator

• First tested with a hardware WAN emulator but it had its limitations (RTT<100 ms, MTU=1500) • Canarie Loopback – TRIUMF – Starlight : RTT = 96 ms – Allowed testing at 1 GbE due to capacity constraints • Canarie Loopback2 – TRIUMF – Starlight : RTT = 193 ms – Close to the expected TRIUMF – CERN e2e RTT of 200 ms

25 CHEP03 did we So what -a--r--e--- -w----e--- -g---o--i--n---g-- --t-o-- do?

• Demonstrate a manually provisioned “e2e” lightpath • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN • Test out 10GbE technology and channel bonding • Establish a new benchmark for high performance disk to disk throughput over a large distance

26 Comparative Results (TRIUMF to CERN) Tool Transferred Average Max Avg wuftp 100 MbE 600 MB 3.4 Mbps wuftp 10 GbE 6442 MB 71 Mbps iperf 275 MB 940 Mbps 1136 Mbps pftp 600MB 532 Mbps bbftp (13 streams) 1.4 TB 666 Mbps 731 Mbps Tsunami - disk to disk 0.5 TB 700 Mbps 825 Mbps Tusnami - disk to memory 12 GB > 1GBps

27 Sunday Nite Summaries

28 Exceeding 1Gbit/sec … ( using tsunami)

29 CHEP03 did we So what -a--r--e--- -w----e--- -g---o--i--n---g-- --t-o-- do?

• Demonstrate a manually provisioned “e2e” lightpath • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN • Test out 10GbE technology and channel bonding • Establish a new benchmark for high performance disk to disk throughput over a large distance 30 CHEP03 Lessons Learned

• Linux software RAID faster than most conventional SCSI and IDE based RAID systems • One controller for each drive, more disk spindles the better • Channel bonding of two GbEs seems to work very well (on an unshared link) • The larger files the better for throughput

31 CHEP03 Further Investigations

• Linux TCP/IP Network Stack Performance – Efficient copy routines (zero copy, copy routines, read copy update) • Stream Control Transmission Protocol • Scheduled Transfer Protocol – OS bypass and zero copy • Web 100, Net 100, DRS

32 CHEP03 Acknowledgements

• Canarie • Indiana University – Bill St. Arnaud, Rene Hatem, Damir Pobric, – Mark Meiss, Stephen Wallace Thomas Tam, Jun Jian • ATLAS Canada • Caltech – Sylvain Ravot, Harvey Neuman – Mike Vetterli, Randall Sobie, Jim Pinfold, Pekka Sinervo, Gerald Oakham, Bob Orr, • CERN Michel Lefebrve, Richard Keeler – Olivier Martin, Paolo Moroni, Martin Fluckiger, Stanley • HEPnet Canada Cannon, J.P Martin-Flatin – Dean Karlen • SURFnet/Universiteit van Amsterdam • TRIUMF – Pieter de Boer, Dennis Paus, Erik.Radius, Erik-Jan.Bos, Leon Gommans, Bert Andree, Cees de Laat – Peter Gumplinger, Fred Jones, Mike Losty, Jack Chakhalian, Renee Poutissou • BCNET • Yotta Yotta – Mike Hrybyk, Marilyn Hay, Dennis O’Reilly, – Geoff Hayward, Reg Joseph, Ying Xie, E. Siu Don McWilliams • BCIT – Bill Rutherford • Extreme Networks • Jalaam – Amyn Pirmohamed, Steven Flowers, John – Loki Jorgensen Casselman, Darrell Clarke, Rob Bazinet, Damaris Soellner • HardData – Maurice Hilarius • Intel Corporation • Netera – Hugues Morin, Caroline Larson, Peter Molnar, Harrison Li, Layne Flake, Jesse Brandeburg – Gary Finley 33 CHEP03

CA*net4 International Grid Testbed Proposal

• Motivation – How can we build upon the successful high performance bulk data transfer demonstrations at iGrid 2002 and Canarie Advanced Networks workshop? • How about a persistent testbed ? – would allow participants to develop and adapt grid specific applications to utilize the evolving CA*net4 Lightpath Grid Service to establish e2e lightpaths – Also serve as an experimental network for investigating and testing protocols for high speed long distance optical networks 34 CHEP03

CA*net4 International Grid Testbed Proposal

• Proposal – Create an international testbed in collaboration with international partners to advance the realization and utilization of high bandwidth delay networks – Allow for the creation of e2e lightpaths to span CA*net4 and SURFnet to CERN with bandwidths up to 10 Gbps • Initial Participants – Canarie, CERN, SURFnet, TRIUMF, HEPNet Canada, ATLAS, ATLAS Canada (Alberta, Carleton, Toronto, Victoria) We welcome and encourage further participants! 35 CHEP03 CA*net4 International Grid Testbed Proposal Network Topology

36 CHEP03

CA*net4 International Grid Testbed Proposal

• General Objectives – Facility where user control interfaces and frameworks for CA*net4 Lightpath Cross-connect Devices can be tested and demonstrated – Develop and adapt grid applications which are designed to interact with the Lightpath Grid Service (networks and network elements as Grid resources) – Investigate and test protocols for high speed long distance optical networks – Investigate and test emerging technologies including: 10 GbE, RDMA/IP, Fiber Channel IP, serial SCSI, hyper SCSI, … – Collaborate with other international research and development efforts

• Specific Projects – Bulk data transfer, Grid Canada Testbed, ATLAS Data Challenges, ATLAS HLT remote real-time processing, S2IO collaboration, RDMA trials, and others … 37