Understanding Hadoop Clusters and the Network
Total Page:16
File Type:pdf, Size:1020Kb
Understanding Hadoop Clusters and the Network Part 1. Introduction and Overview Brad Hedlund http://bradhedlund.com http://www.linkedin.com/in/bradhedlund @bradhedlund BRAD HEDLUND .com Hadoop Server Roles Clients Distributed Data Analytics Distributed Data Storage Map Reduce HDFS Secondary Job Tracker Name Node masters Name Node Data Node & Data Node & Data Node & Task Tracker Task Tracker Task Tracker slaves Data Node & Data Node & Data Node & Task Tracker Task Tracker Task Tracker BRAD HEDLUND .com Hadoop Cluster World switch switch switch switch switch switch switch Name Node Job Tracker Secondary NN Client DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT Rack 1 Rack 2 Rack 3 Rack 4 Rack N BRAD HEDLUND .com Typical Workflow • Load data into the cluster (HDFS writes) • Analyze the data (Map Reduce) • Store results in the cluster (HDFS writes) • Read the results from the cluster (HDFS reads) Sample Scenario: How many times did our customers type the word “Fraud” into emails sent to customer service? Huge file containing all emails sent File.txt to customer service BRAD HEDLUND .com Writing files to HDFS OK. Write to I want to write Data Nodes Blocks A,B,C of 1,5,6 File.txt File.txt Client Blk A Blk B Blk C Name Node Data Node 1 Data Node 5 Data Node 6 Data Node N Blk A Blk B Blk C • Client consults Name Node • Client writes block directly to one Data Node • Data Nodes replicates block • Cycle repeats for next block BRAD HEDLUND .com Hadoop Rack Awareness – Why? switch Name Node switch switch switch Rack aware metadata File.txt= Rack 1: Data Node 1 Data Node 5 Data Node 9 Blk A: Data Node 1 B A C A C B DN1, DN5, DN6 Data Node 2 Data Node 2 Data Node 6 Data Node 10 Data Node 3 Blk B: B A C DN7, DN1, DN2 Rack 5: Data Node 3 Data Node 7 Data Node 11 Data Node 5 Blk C: Data Node 5 Data Node 8 Data Node 6 Data Node 12 DN5, DN8,DN9 Data Node 7 Rack 1 Rack 5 Rack 9 • Never loose all data if entire rack fails • Keep bulky flows in-rack when possible • Assumption that in-rack is higher bandwidth, lower latency BRAD HEDLUND .com Preparing HDFS writes OK. Write to I want to write Data Nodes File.txt 1,5,6 Block A File.txt Client Blk A Blk B Blk C Name Node Ready! Ready switch Rack aware Data Nodes Rack 1: 5,6 Data Node 1 switch switch Rack 5: Data Node 5 Data Node 6 Data Node 1 Data Node 5 Ready Ready! • Name Node picks Data Node 6 two nodes in the same rack, one Ready? Data Node 6 node in a different rack • Data protection Rack 1 Rack 5 • Locality for M/R BRAD HEDLUND .com Pipelined Write File.txt Client Blk A Blk B Blk C Name Node switch Rack aware Rack 1: Data Node 1 switch switch Rack 5: Data Node 5 Data Node 6 Data Node 1 Data Node 5 • Data Nodes 1 A A & 2 pass data along as its received Data Node 6 • TCP 50010 A Rack 1 Rack 5 BRAD HEDLUND .com Pipelined Write File.txt Blk A: DN1, DN2, DN3 File.txt Client Blk A Blk B Blk C Name Node Success switch Block received Rack 1: Data Node 1 switch switch Rack 5: Data Node 2 Data Node 3 Data Node 1 Data Node 2 A A Data Node 3 A Rack 1 Rack 5 BRAD HEDLUND .com Multi-block Replication Pipeline 1TB File = File.txt Client 3TB storage 3TB network traffic Blk A Blk B Blk C switch switch switch switch Blk A Blk A Data Node 1 Data Node X Data Node 2 Blk B Blk C Blk A Blk B Data Node Y Data Node 3 Data Node W Blk C Blk C Blk B Data Node Z Rack 1 Rack 4 Rack 5 BRAD HEDLUND .com Client writes Span the HDFS Cluster Client switch switch switch switch switch Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Rack 1 Rack 2 Rack 3 Rack 4 Rack N Factors: File.txt • Block size • File Size More blocks = Wider spread BRAD HEDLUND .com Data Node writes span itself, and other racks switch switch switch switch switch Data Node Data Node C Data Node Data Node Data Node A B C Data Node Data Node B Data Node A Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node B Data Node Data Node Data Node Data Node C Data Node Data Node Data Node Data Node Data Node Data Node Data Node A Data Node Rack 1 Rack 2 Rack 3 Rack 4 Rack N Results.txt Blk A Blk B Blk C BRAD HEDLUND .com Name Node Awesome! Thanks. metadata File system DN1: A,C DN2: A,C File.txt = A,C DN3: A,C Name Node I have I’m blocks: alive! A, C Data Node 1 Data Node 2 Data Node 3 Data Node N A C A C A C • Data Node sends Heartbeats • Every 10th heartbeat is a Block report • Name Node builds metadata from Block reports • TCP – every 3 seconds • If Name Node is down, HDFS is down BRAD HEDLUND .com Re-replicating missing replicas Uh Oh! Missing metadata Rack Awareness replicas DN1: A,C Rack1: DN1, DN2 DN2: A,C Rack5: DN3, DN3: A, C Name Node Rack9: DN8 Copy blocks A,C to Node 8 Data Node 1 Data Node 2 Data Node 3 Data Node 8 A C A C A C A C • Missing Heartbeats signify lost Nodes • Name Node consults metadata, finds affected data • Name Node consults Rack Awareness script • Name Node tells a Data Node to re-replicate BRAD HEDLUND .com Secondary Name Node File system metadata File.txt = A,C Name Node Its been an hour, give me your Secondary metadata Name Node • Not a hot standby for the Name Node • Connects to Name Node every hour* • Housekeeping, backup of Name Node metadata • Saved metadata can rebuild a failed Name Node BRAD HEDLUND .com Client reading files from HDFS Tell me the Blk A = 1,5,6 block locations Blk B = 8,1,2 of Results.txt Blk C = 5,8,9 Client Name Node switch switch switch metadata Results.txt= Data Node 1 Data Node 5 Data Node 8 Blk A: B A C A C B DN1, DN5, DN6 Data Node 2 Data Node 6 Data Node 9 Blk B: B A C DN7, DN1, DN2 Data Node Data Node Data Node Blk C: Data Node Data Node Data Node DN5, DN8,DN9 Rack 1 Rack 5 Rack 9 • Client receives Data Node list for each block • Client picks first Data Node for each block • Client reads blocks sequentially BRAD HEDLUND .com Data Node reading files from HDFS Tell me the Block A = 1,5,6 locations of Block A of File.txt switch Name Node switch switch switch Rack aware metadata File.txt= Rack 1: Data Node 1 Data Node 5 Data Node 8 Blk A: Data Node 1 B A C A C B DN1, DN5, DN6 Data Node 2 Data Node 2 Data Node 6 Data Node 9 Data Node 3 Blk B: B A C DN7, DN1, DN2 Rack 5: Data Node 3 Data Node Data Node Data Node 5 Blk C: Data Node Data Node Data Node DN5, DN8,DN9 Rack 1 Rack 5 Rack 9 • Name Node provides rack local Nodes first • Leverage in-rack bandwidth, single hop BRAD HEDLUND .com Data Processing: Map How many times does Client “Fraud” appear Count in File.txt? “Fraud” in Block C Name Node Job Tracker Map Task Map Task Map Task Data Node 1 Data Node 5 Data Node 9 A B C Fraud = 3 Fraud = 0 Fraud = 11 File.txt • Map: “Run this computation on your local data” • Job Tracker delivers Java code to Nodes with local data BRAD HEDLUND .com What if data isn’t local? How many times does Client “Fraud” appear Count in File.txt? “Fraud” in Block C Name Node Job Tracker switch switch switch Map Task Map Task “I need block A” Data Node 5 Data Node 9 Data Node 1 B C Fraud = 0 Fraud = 11 A “no Map tasks left” Data Node 2 Rack 1 Rack 5 Rack 9 • Job Tracker tries to select Node in same rack as data • Name Node rack awareness BRAD HEDLUND .com Data Processing: Reduce Client Job Tracker Sum “Fraud” HDFS Results.txt Data Node 3 Fraud = 14 Reduce Task X Y Z Fraud = 0 Map Task Map Task Map Task Data Node 1 Data Node 5 Data Node 9 A B C • Reduce: “Run this computation across Map results” • Map Tasks deliver output data over the network • Reduce Task data output written to and read from HDFS BRAD HEDLUND .com Unbalanced Cluster switch **I was assigned a Map Task but don’t have the switch switch switch NEW switch NEW block.