File Transfer
Total Page:16
File Type:pdf, Size:1020Kb
File Transfer Joon Kim PICSciE Research Computing Workshop Department of Chemistry 10/30/2017 Introduction Hyojoon (Joon) Kim. Ph.D. Cyber Infrastructure Engineer Role: Design & build a campus network and infrastructure that will better support our researchers 2 Our goal today • Overview of widely used tools • Learn about data transfer basics • Pick the right tool for your job • Know what to expect • Learn about Globus • Q&A 3 Data transfer Data Data source destination This part is our topic! 4 Why do we care? Without good practice, you will waste time and effort 1. Start data transfer using SCP at 10pm. Usually takes 10 hours. 2. At 2am, there was a brief 1-minute network outage. Transfer job aborted. 3. Arrive 8am in the morning. See the damage. Start again, which will take 10 hours. 4. Lost a day of work. Time Effort 5 Why do we care? Without good practice, you will waste time and effort 1. Start data transfer using SCP at 10pm. Usually takes 10 hours. Really? Are you SURE that’s the best? Time Effort 6 We want you to Focus on your research, not on transferring data around Time Effort 7 Data Transfer Tools 8 Transfer tools What transfer tool do you use? 9 What we normally see 10 Secure Copy (SCP) • Secure Copy (SCP) • Uses SSH for authentication and data transfer (TCP port 22) • Unix-based systems (including Mac OS X): Should have it by default • Windows: WinSCP (https://winscp.net/eng/download.php) 11 Secure Copy (SCP) • Secure Copy (SCP) • Uses SSH for authentication and data transfer (TCP port 22) • Unix-based systems (including Mac OS X): Should have it by default • Windows: WinSCP (https://winscp.net/eng/download.php) 12 rsync • rsync (rsync over SSH) • Sync files and directories between two endpoints. • Good for running backups. Careful with “--delete” option (this *mirrors* directories) • Unix-based systems (including Mac OS X): Should have it by default • Windows: CwRsync (https://itefix.net/cwrsync) 13 File Transfer Protocol (FTP) • ‘Secure’ File Transfer Protocol (‘S’FTP) • Widely used for file transfers • SFTP is more secure. Use it if available. • Unix-based systems (including Mac OS X): Should have it by default • Windows: FileZilla (https://filezilla-project.org) 14 File Transfer Protocol (FTP) • ‘Secure’ File Transfer Protocol (‘S’FTP) • Widely used for file transfers • SFTP is more secure. Use it if available. • Unix-based systems (including Mac OS X): Should have it by default • Windows: FileZilla (https://filezilla-project.org) 15 These tools are okay, but not always • Great compatibility. Widely available. • Small datasets. Quick transfers. (< 15 mins) • Large bulk data transfers. • Transfers on unreliable connections and hosts. 16 When should you look for other solutions? • Transfer is unreliable and takes a long time, to the point it affects your workflow • You are getting speeds less than (for large datasets): • Within campus • 800 Mbps (Mega bits per second). • 100 GB = ~820,000 Mb. Takes ~ 17 minutes • 3-4 Gbps if you have a 10G connection • 100 GB = ~800 Gb. Takes ~ 4 minutes • Between campus and outside • Hard to tell because of things out of our control • 200 Mbps – 5000 Mbps (5 Gbps) • 100 GB = ~820,000 Mb. Takes ~ 1 hour 17 Data Transfer Basics 18 Data transfer: Overview • The key players • Endpoints • Network 1/10/100 Gbps • Transfer tool Source Destination SCP FTP • Transfer settings SFTP rsync rsync over ssh Encrypted vs. not encrypted 19 (Why) is my data transfer slow? Where are the bottlenecks? scp scp ftp ftp Source Destination 20 1. Potential bottlenecks in endpoints • CPU and memory • Higher clock speed is better than # of cores • E.g., 2 x Intel Xeon® Broadwell processor E5-2643 3.4 GHz (total 12 cores) • RAM: 32 GB or more recommended • Disk I/O • Disk type (SATA HDD, SSD), configuration (RAID), and file system (ext4, GPFS, Lustre) • Decent server with HDD, EXT4 with RAID performs around 4 Gbps • RAID is required to get > 1 Gbps (ref: http://fasterdata.es.net/data-transfer-tools/) • Network Interface Controller (NIC) • Wireless: don’t expect much (mostly < 130 Mbps) • Wired: 1/10/40/100 Gbps • MiscellaneousYou tuning won’t get more than 1 Gb/s (125 MB/s) with • NIC tx buffer, enabling jumbo frames (9K instead of 1.5K), TCP/UDP tuning … your laptop, most desktops, and un-optimized servers 21 2. Potential bottlenecks in networks • Bandwidth • E.g., 1/10/40/100 Gb/s • Congestion • Time of day • Distance • E.g., Round Trip Time (RTT) between Stanford – Princeton: ~ 80ms • “Things” along the way • Routers, switches, firewalls, NAT, security devices, … 22 Our network, their network, and networks in between Parts we have limited visibility, and no control over 23 3. Transfer tools: Single vs multi stream Single stream - scp src dst - ftp - rsync Multi stream - Less packet loss (w/ dups) - Better utilization of link - GridFTP src dst - BBCP Faster transfer speed 24 Transfer tools: scp vs. GridFTP Downloading 500 GB data 8 hours 10 minutes (ref: http://fasterdata.es.net/data-transfer-tools/) 25 4. Transfer settings: Encryption Tool Encrypted Control Encrypted Data FTP HTTP (even password-based access) BBCP BBFTP ✔ Globus/GridFTP SCP SFTP rsync over SSH ✔ ✔ Globus/GridFTP with encryption-on HTTPS Data encryption provides best security, but negatively impacts transfer speed 26 In summary • Data transfer speed is affected by: Endpoints, network, transfer tool, and transfer settings • In most cases, your endpoint cannot handle much • Use wired connection, and check your bandwidth as far as you can • Ask for 10Gbps or 40Gbps connection if needed • Use better transfer tool if possible. • scp, (s)ftp, rsync, and wget/curl work fine for small transfers. • For large transfer (> GBs) over the WAN (RTT > 5ms: beyond Philly or NYC), don’t expect much from: scp, (s)ftp, rsync, wget/curl, robocopy 27 Hey… that’s a lot of work Transfer Endpoint Tool Transfer Network Settings Photo from Flickr (Billy Abbott) 28 What we have at Princeton: Data Transfer Nodes Test DTN Tigress DTN Lewis-Sigler PNI DTN CS DTN DTN Globus (GridFTP) 10 Gbps Data Transfer Nodes (DTNs) 10 Gbps (Mar. 2016 -) • High-end servers (Nov. 2015 -) • 10 Gbps connections ESnet • Tuned and optimized. RAID configured Internet2 • Good transfer tools (e.g., Globus) • Supported 29 Leverage our resources • Leverage our resources • Contact us or your departmental staff • About existing departmental DTNs and best work/dataflow • About having a departmental DTN 30 Data Transfer Nodes @ Princeton • Test DTN • /u/<NetID> • Contact: [email protected] • Tigress DTN (Princeton TIGRESS) • GPFS-based /tigress and scratch (tiger, della, orbital) disk space • Contact: [email protected] • LSI DTN (Lewis-Sigler Institute Core DTN) • LSI local cluster storage, lab data volumes, and scratch spaces • Contact: [email protected] • PNI DTN (Princeton Neuroscience Institute DTN) • All PNI `/Jukebox` Volumes (Bucket and Scratch spaces) are available • Contact: [email protected] • CS DTN (Computer Science Department DTN) • /n/fs/scratch/ and CS “project” storage spaces • Contact: [email protected] • Physics DTN (Princeton Physics DTN) • Restricted to users who has an account on Feynman cluster. /group, /scratch, and /mnt/<project> NFS files systems. • Contact: Vinod Gupta ([email protected]), Sumit Saluja ([email protected]) 31 What is Globus? https://www.globus.org • Fast, reliable data transfer and management service • Uses GridFTP underneath • Main advantages • Fast transfer speed (multi-stream) • Convenient to use: “Fire-and-Forget” 32 How it works 3. Data A B 1. Pick two endpoints 2. Submit transfer request at the Globus website 3. Dataset is transferred between two endpoints • Your machine’s web browser is just a “remote control” Globus.org • But, your machine can be an endpoint too (more later) 1. Select endpoints 4. Get 2. Request transfer notification 4. Get notification when transfer is done 33 What you need to use Globus • Account • Have Princeton NetID? You’re set. • Web browser • Chrome, Firefox, Safari, IE (Edge), etc • E.g., use smartphone to submit a transfer job (note: your phone is not transferring the dataset) • Access to source and destination endpoints 34 Your peers are actively using it! 09/2017: Total 257 transfers via Globus 35 Globus has good coverage • Universities • Cornell, NYU, Yale, Johns Hopkins, Dartmouth, Purdue, Georgia Tech, Virginia Tech, UVA, Michigan, Indiana, Stanford, Berkeley, and U Chicago,… • Most of the DOE national labs • ESnet at CERN, ANL, LBNL, LLNL, LANL, ORNL, and PNNL • National computing facilities • NERSC, NCSA, SDSC … • Federal agencies • NIH, USDA, NASA/JPL, USGS … • Over 50,000 registered endpoints at over 500 institutions worldwide 36 How to use it ’On’ by default. ‘On’ is around 25% slower than ‘Off’ ‘On’ is 10%-40% slower than ’Off’ 37 Globus Sharing • Share file or directory with other Globus users 38 Data Transfer via CLI • Install Globus CLI package (Linux or Cygwin on Windows) • https://docs.globus.org/cli/installation/ • Documentation • https://docs.globus.org/cli/ • Scheduling transfer jobs • $ echo "globus transfer $ep1:/share/godata/file1.txt $ep2:~/file1.txt -- label 'CLI Test Transfer 2’ " | at 13:07 oct 30 2017 • Cron job: 0 2 * * * runglobus.sh &> ~/runglobus.log 39 Globus Connect Personal https://www.globus.org/globus-connect-personal • Make your own machine a Globus endpoint • Mac, Windows, Linux • You are the administrator for your own Globus endpoint • Limited performance (# of streams), but convenient! 40 Use cases Scenario 1 • Data on your laptop • Globus Connect Personal Scenario 2 • Data on shared Windows PC e.g., • Globus Connect Personal Tigress DTN SMB mount Better for Scenario 2 • Data on shared Windows PC • SMB mount to Linux server • Globus Connect Server 41 Learn more about Globus • Globus documentation • https://docs.globus.org/how-to/ • Research Computing mini-course • “Transferring Large Data Sets, Plus Hands-on Tutorial with the Globus Transfer Tool” (Spring, 2018) • http://www.princeton.edu/researchcomputing/education/mini-courses/ 42 Other tools • BBCP • Free, easy to use, and comparable performance to Globus • Mac OS X, Linux-based systems.