Joon Kim PICSciE

Research Computing Workshop Department of Chemistry 10/30/2017 Introduction

Hyojoon (Joon) Kim. Ph.D. Cyber Infrastructure Engineer

Role: Design & build a campus network and infrastructure that will better support our researchers

2 Our goal today

• Overview of widely used tools

• Learn about data transfer basics • Pick the right tool for your job • Know what to expect

• Learn about Globus

• Q&A

3 Data transfer

Data Data source destination

This part is our topic!

4 Why do we care?

Without good practice, you will waste time and effort

1. Start data transfer using SCP at 10pm. Usually takes 10 hours. 2. At 2am, there was a brief 1-minute network outage. Transfer job aborted. 3. Arrive 8am in the morning. See the damage. Start again, which will take 10 hours. 4. Lost a day of work. Time Effort

5 Why do we care?

Without good practice, you will waste time and effort

1. Start data transfer using SCP at 10pm. Usually takes 10 hours.

Really? Are you SURE that’s the best?

Time Effort

6 We want you to

Focus on your research, not on transferring data around

Time Effort

7 Data Transfer Tools

8 Transfer tools

What transfer tool do you use?

9 What we normally see

10 Secure Copy (SCP)

• Secure Copy (SCP) • Uses SSH for authentication and data transfer (TCP port 22) • -based systems (including Mac OS X): Should have it by default • Windows: WinSCP (https://winscp.net/eng/download.php)

11 Secure Copy (SCP)

• Secure Copy (SCP) • Uses SSH for authentication and data transfer (TCP port 22) • Unix-based systems (including Mac OS X): Should have it by default • Windows: WinSCP (https://winscp.net/eng/download.php)

12 rsync

• rsync (rsync over SSH) • Sync files and directories between two endpoints. • Good for running backups. Careful with “--delete” option (this *mirrors* directories) • Unix-based systems (including Mac OS X): Should have it by default • Windows: CwRsync (https://itefix.net/cwrsync)

13 (FTP)

• ‘Secure’ File Transfer Protocol (‘S’FTP) • Widely used for file transfers • SFTP is more secure. Use it if available. • Unix-based systems (including Mac OS X): Should have it by default • Windows: FileZilla (https://filezilla-project.org)

14 File Transfer Protocol (FTP)

• ‘Secure’ File Transfer Protocol (‘S’FTP) • Widely used for file transfers • SFTP is more secure. Use it if available. • Unix-based systems (including Mac OS X): Should have it by default • Windows: FileZilla (https://filezilla-project.org)

15 These tools are okay, but not always

• Great compatibility. Widely available. • Small datasets. Quick transfers. (< 15 mins)

• Large bulk data transfers. • Transfers on unreliable connections and hosts.

16 When should you look for other solutions?

• Transfer is unreliable and takes a long time, to the point it affects your workflow

• You are getting speeds less than (for large datasets): • Within campus • 800 Mbps (Mega bits per second). • 100 GB = ~820,000 Mb. Takes ~ 17 minutes • 3-4 Gbps if you have a 10G connection • 100 GB = ~800 Gb. Takes ~ 4 minutes

• Between campus and outside • Hard to tell because of things out of our control • 200 Mbps – 5000 Mbps (5 Gbps) • 100 GB = ~820,000 Mb. Takes ~ 1 hour

17 Data Transfer Basics

18 Data transfer: Overview

• The key players • Endpoints

• Network 1/10/100 Gbps

• Transfer tool Source Destination

SCP FTP • Transfer settings SFTP rsync rsync over ssh

Encrypted vs. not encrypted

19 (Why) is my data transfer slow?

Where are the bottlenecks?

scp scp

ftp ftp

Source Destination

20 1. Potential bottlenecks in endpoints

• CPU and memory • Higher clock speed is better than # of cores • E.g., 2 x Intel Xeon® Broadwell processor E5-2643 3.4 GHz (total 12 cores) • RAM: 32 GB or more recommended

• Disk I/O • Disk type (SATA HDD, SSD), configuration (RAID), and file system (ext4, GPFS, Lustre) • Decent server with HDD, EXT4 with RAID performs around 4 Gbps • RAID is required to get > 1 Gbps (ref: http://fasterdata.es.net/data-transfer-tools/)

• Network Interface Controller (NIC) • Wireless: don’t expect much (mostly < 130 Mbps) • Wired: 1/10/40/100 Gbps

• MiscellaneousYou tuning won’t get more than 1 Gb/s (125 MB/s) with • NIC tx buffer, enabling jumbo frames (9K instead of 1.5K), TCP/UDP tuning … your laptop, most desktops, and un-optimized servers 21 2. Potential bottlenecks in networks

• Bandwidth • E.g., 1/10/40/100 Gb/s

• Congestion • Time of day

• Distance • E.g., Round Trip Time (RTT) between Stanford – Princeton: ~ 80ms

• “Things” along the way • Routers, switches, firewalls, NAT, security devices, …

22 Our network, their network, and networks in between

Parts we have limited visibility, and no control over

23 3. Transfer tools: Single vs multi stream

Single stream - scp src dst - ftp - rsync

Multi stream - Less packet loss (w/ dups) - Better utilization of link - GridFTP src dst - BBCP Faster transfer speed

24 Transfer tools: scp vs. GridFTP

Downloading 500 GB data

8 hours

10 minutes (ref: http://fasterdata.es.net/data-transfer-tools/)

25 4. Transfer settings: Encryption

Tool Encrypted Control Encrypted Data FTP HTTP (even password-based access) BBCP BBFTP ✔ Globus/GridFTP

SCP SFTP rsync over SSH ✔ ✔ Globus/GridFTP with encryption-on HTTPS

Data encryption provides best security, but negatively impacts transfer speed

26 In summary

• Data transfer speed is affected by: Endpoints, network, transfer tool, and transfer settings

• In most cases, your endpoint cannot handle much

• Use wired connection, and check your bandwidth as far as you can • Ask for 10Gbps or 40Gbps connection if needed

• Use better transfer tool if possible. • scp, (s)ftp, rsync, and /curl work fine for small transfers. • For large transfer (> GBs) over the WAN (RTT > 5ms: beyond Philly or NYC), don’t expect much from: scp, (s)ftp, rsync, wget/curl, robocopy

27 Hey… that’s a lot of work

Transfer Endpoint Tool

Transfer Network Settings

Photo from Flickr (Billy Abbott)

28 What we have at Princeton: Data Transfer Nodes

Test DTN Tigress DTN Lewis-Sigler PNI DTN CS DTN DTN Globus (GridFTP)

10 Gbps Data Transfer Nodes (DTNs) 10 Gbps (Mar. 2016 -) • High-end servers (Nov. 2015 -) • 10 Gbps connections ESnet • Tuned and optimized. RAID configured Internet2 • Good transfer tools (e.g., Globus) • Supported

29 Leverage our resources

• Leverage our resources

• Contact us or your departmental staff • About existing departmental DTNs and best work/dataflow • About having a departmental DTN

30 Data Transfer Nodes @ Princeton

• Test DTN • /u/ • Contact: [email protected] • Tigress DTN (Princeton TIGRESS) • GPFS-based /tigress and scratch (tiger, della, orbital) disk space • Contact: [email protected] • LSI DTN (Lewis-Sigler Institute Core DTN) • LSI local cluster storage, lab data volumes, and scratch spaces • Contact: [email protected] • PNI DTN (Princeton Neuroscience Institute DTN) • All PNI `/Jukebox` Volumes (Bucket and Scratch spaces) are available • Contact: [email protected] • CS DTN (Computer Science Department DTN) • /n/fs/scratch/ and CS “project” storage spaces • Contact: [email protected] • Physics DTN (Princeton Physics DTN) • Restricted to users who has an account on Feynman cluster. /group, /scratch, and /mnt/ NFS files systems. • Contact: Vinod Gupta ([email protected]), Sumit Saluja ([email protected])

31 What is Globus?

https://www.globus.org

• Fast, reliable data transfer and management service

• Uses GridFTP underneath

• Main advantages • Fast transfer speed (multi-stream) • Convenient to use: “Fire-and-Forget”

32 How it works

3. Data A B 1. Pick two endpoints

2. Submit transfer request at the Globus website

3. Dataset is transferred between two endpoints • Your machine’s web browser is just a “remote control” Globus.org • But, your machine can be an endpoint too (more later) 1. Select endpoints 4. Get 2. Request transfer notification 4. Get notification when transfer is done

33 What you need to use Globus

• Account • Have Princeton NetID? You’re set.

• Web browser • Chrome, Firefox, Safari, IE (Edge), etc • E.g., use smartphone to submit a transfer job (note: your phone is not transferring the dataset)

• Access to source and destination endpoints

34 Your peers are actively using it!

09/2017:

Total 257 transfers via Globus

35 Globus has good coverage

• Universities • Cornell, NYU, Yale, Johns Hopkins, Dartmouth, Purdue, Georgia Tech, Virginia Tech, UVA, Michigan, Indiana, Stanford, Berkeley, and U Chicago,…

• Most of the DOE national labs • ESnet at CERN, ANL, LBNL, LLNL, LANL, ORNL, and PNNL

• National computing facilities • NERSC, NCSA, SDSC …

• Federal agencies • NIH, USDA, NASA/JPL, USGS …

• Over 50,000 registered endpoints at over 500 institutions worldwide

36 How to use it

’On’ by default. ‘On’ is around 25% slower than ‘Off’ ‘On’ is 10%-40% slower than ’Off’ 37 Globus Sharing

• Share file or directory with other Globus users

38 Data Transfer via CLI

• Install Globus CLI package (Linux or Cygwin on Windows) • https://docs.globus.org/cli/installation/

• Documentation • https://docs.globus.org/cli/

• Scheduling transfer jobs • $ echo "globus transfer $ep1:/share/godata/file1.txt $ep2:~/file1.txt -- label 'CLI Test Transfer 2’ " | at 13:07 oct 30 2017 • Cron job: 0 2 * * * runglobus.sh &> ~/runglobus.log

39 Globus Connect Personal https://www.globus.org/globus-connect-personal

• Make your own machine a Globus endpoint • Mac, Windows, Linux

• You are the administrator for your own Globus endpoint

• Limited performance (# of streams), but convenient!

40 Use cases Scenario 1 • Data on your laptop • Globus Connect Personal

Scenario 2 • Data on shared Windows PC e.g., • Globus Connect Personal Tigress DTN

SMB mount Better for Scenario 2 • Data on shared Windows PC • SMB mount to Linux server • Globus Connect Server 41 Learn more about Globus

• Globus documentation • https://docs.globus.org/how-to/

• Research Computing mini-course • “Transferring Large Data Sets, Plus Hands-on Tutorial with the Globus Transfer Tool” (Spring, 2018) • http://www.princeton.edu/researchcomputing/education/mini-courses/

42 Other tools

• BBCP • Free, easy to use, and comparable performance to Globus • Mac OS X, Linux-based systems. SSH-based access control • Both endpoints need it installed, but easier to install and configure • Supported DTNs: Test DTN, Tigressdata • “$ bbcp -V -s 16 /local/path/largefile.tar remotesystem:/remote/path/largefile.tar” • More info • http://www.slac.stanford.edu/~abh/bbcp/ • https://www.olcf.ornl.gov/kb_articles/transferring-data-with-bbcp/

• Fast Data Transfer (FDT) • Java-based tool from Caltech & CERN (http://monalisa.cern.ch/FDT/) • Can theoretically run in any , including Windows • Need server-side running in server mode • “$ java -jar ./fdt.jar -ss 1M -P 10 -c remotehost.domain.uci.edu ~/file.633M -d /userdata/hjm” • Slower than BBCP or Globus

43 Other tools

• Aria2c (https://aria2.github.io) • Faster http/https, ftp, sftp, BitTorrent, and Metalink download tool (x4 faster) • Windows, Mac, Linux, Android App • http: “$ ./aria2c -x4 -k1M http://foo.com/foo.zip”

• LFTP • Faster download (get) speed (2-5x) for ftp, http, sftp, fish, torrent. (put) speed is same. • Seems to be compatible with normal FTP, HTTP servers. • Mac OS X, Linux-based systems. (apt-get install lftp; yum install lftp; brew install lftp) • ftp: “$ lftp ftp://speedtest.tele2.net” • http: “$ lftp -e 'pget -n 5 foo.zip' http://foo.com/” • More info: http://lftp.tech=

• HPN-patched SCP/SSH • https://www.psc.edu/hpn-ssh • Slower than BBCP or Globus

44 Summary and Takeaways

• Small and quick transfers: basic transfer tools

• Large bulk transfers: Data Transfer Nodes (DTNs) and Globus when possible

• Large bulk transfers, but Globus is unavailable: Better tools aforementioned

• Know your environment and limitations • Endpoints, network, transfer tool, and transfer settings

• Speak up and reach out to us 45 Q&A

[email protected] Computational Science and Engineering Support: [email protected]

46