Understanding Hadoop Clusters and the Network
Part 1. Introduction and Overview Brad Hedlund http://bradhedlund.com http://www.linkedin.com/in/bradhedlund @bradhedlund
BRAD HEDLUND .com Hadoop Server Roles
Clients
Distributed Data Analytics Distributed Data Storage Map Reduce HDFS
Secondary Job Tracker Name Node masters Name Node
Data Node & Data Node & Data Node & Task Tracker Task Tracker Task Tracker slaves Data Node & Data Node & Data Node & Task Tracker Task Tracker Task Tracker
BRAD HEDLUND .com Hadoop Cluster
World
switch switch
switch switch switch switch switch
Name Node Job Tracker Secondary NN Client DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT
Rack 1 Rack 2 Rack 3 Rack 4 Rack N
BRAD HEDLUND .com Typical Workflow
• Load data into the cluster (HDFS writes) • Analyze the data (Map Reduce) • Store results in the cluster (HDFS writes) • Read the results from the cluster (HDFS reads) Sample Scenario: How many times did our customers type the word “Fraud” into emails sent to customer service?
Huge file containing all emails sent to customer service File.txt
BRAD HEDLUND .com Writing files to HDFS OK. Write to I want to write Data Nodes Blocks A,B,C of 1,5,6 File.txt File.txt Client Blk A Blk B Blk C Name Node
Data Node 1 Data Node 5 Data Node 6 Data Node N
Blk A Blk B Blk C
• Client consults Name Node • Client writes block directly to one Data Node • Data Nodes replicates block • Cycle repeats for next block BRAD HEDLUND .com Hadoop Rack Awareness – Why?
switch Name Node
switch switch switch Rack aware metadata File.txt= Rack 1: Data Node 1 Data Node 5 Data Node 9 Blk A: Data Node 1 B A C A C B DN1, DN5, DN6 Data Node 2
Data Node 2 Data Node 6 Data Node 10 Data Node 3 Blk B: B A C DN7, DN1, DN2 Rack 5:
Data Node 3 Data Node 7 Data Node 11 Data Node 5 Blk C: Data Node 5 Data Node 8 Data Node 6 Data Node 12 DN5, DN8,DN9 Data Node 7 Rack 1 Rack 5 Rack 9 • Never loose all data if entire rack fails • Keep bulky flows in-rack when possible • Assumption that in-rack is higher bandwidth, lower latency BRAD HEDLUND .com Preparing HDFS writes OK. Write to I want to write Data Nodes File.txt 1,5,6 Block A File.txt Client Blk A Blk B Blk C Name Node Ready! Ready switch Rack aware Data Nodes Rack 1: 5,6 Data Node 1
switch switch Rack 5: Data Node 5 Data Node 6 Data Node 1 Data Node 5
Ready Ready! • Name Node picks Data Node 6 two nodes in the same rack, one Ready? Data Node 6 node in a different rack • Data protection Rack 1 Rack 5 • Locality for M/R BRAD HEDLUND .com Pipelined Write
File.txt Client Blk A Blk B Blk C Name Node
switch Rack aware Rack 1: Data Node 1
switch switch Rack 5: Data Node 5 Data Node 6 Data Node 1 Data Node 5 • Data Nodes 1 A A & 2 pass data along as its received Data Node 6 • TCP 50010 A
Rack 1 Rack 5 BRAD HEDLUND .com Pipelined Write
File.txt Blk A: DN1, DN2, DN3 File.txt Client Blk A Blk B Blk C Name Node Success switch Block received Rack 1: Data Node 1
switch switch Rack 5: Data Node 2 Data Node 3 Data Node 1 Data Node 2 A A
Data Node 3 A
Rack 1 Rack 5 BRAD HEDLUND .com Multi-block Replication Pipeline
1TB File = File.txt Client 3TB storage 3TB network traffic Blk A Blk B Blk C switch
switch switch switch
Blk A Blk A Data Node 1 Data Node X Data Node 2 Blk B Blk C
Blk A Blk B Data Node Y Data Node 3 Data Node W Blk C Blk C Blk B Data Node Z
Rack 1 Rack 4 Rack 5
BRAD HEDLUND .com Client writes Span the HDFS Cluster
Client
switch switch switch switch switch
Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node
Rack 1 Rack 2 Rack 3 Rack 4 Rack N
Factors: File.txt • Block size • File Size More blocks = Wider spread BRAD HEDLUND .com Data Node writes span itself, and other racks
switch switch switch switch switch
Data Node Data Node C Data Node Data Node Data Node A B C Data Node Data Node B Data Node A Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node B Data Node Data Node Data Node Data Node C Data Node Data Node Data Node Data Node Data Node Data Node Data Node A Data Node
Rack 1 Rack 2 Rack 3 Rack 4 Rack N
Results.txt
Blk A Blk B Blk C
BRAD HEDLUND .com Name Node Awesome! Thanks. metadata File system DN1: A,C DN2: A,C File.txt = A,C DN3: A,C Name Node
I have I’m blocks: alive! A, C
Data Node 1 Data Node 2 Data Node 3 Data Node N A C A C A C
• Data Node sends Heartbeats • Every 10th heartbeat is a Block report • Name Node builds metadata from Block reports • TCP – every 3 seconds • If Name Node is down, HDFS is down BRAD HEDLUND .com Re-replicating missing replicas Uh Oh! Missing metadata Rack Awareness replicas DN1: A,C Rack1: DN1, DN2 DN2: A,C Rack5: DN3, DN3: A, C Name Node Rack9: DN8
Copy blocks A,C to Node 8
Data Node 1 Data Node 2 Data Node 3 Data Node 8 A C A C A C A C
• Missing Heartbeats signify lost Nodes • Name Node consults metadata, finds affected data • Name Node consults Rack Awareness script • Name Node tells a Data Node to re-replicate BRAD HEDLUND .com Secondary Name Node
File system metadata
File.txt = A,C Name Node
Its been an hour, give me your Secondary metadata Name Node
• Not a hot standby for the Name Node • Connects to Name Node every hour* • Housekeeping, backup of Name Node metadata • Saved metadata can rebuild a failed Name Node
BRAD HEDLUND .com Client reading files from HDFS Tell me the Blk A = 1,5,6 block locations Blk B = 8,1,2 of Results.txt Blk C = 5,8,9
Client Name Node
switch switch switch metadata Results.txt= Data Node 1 Data Node 5 Data Node 8 Blk A: B A C A C B DN1, DN5, DN6
Data Node 2 Data Node 6 Data Node 9 Blk B: B A C DN7, DN1, DN2 Data Node Data Node Data Node Blk C: Data Node Data Node Data Node DN5, DN8,DN9
Rack 1 Rack 5 Rack 9 • Client receives Data Node list for each block • Client picks first Data Node for each block • Client reads blocks sequentially BRAD HEDLUND .com Data Node reading files from HDFS
Tell me the Block A = 1,5,6 locations of Block A of File.txt switch Name Node
switch switch switch Rack aware metadata File.txt= Rack 1: Data Node 1 Data Node 5 Data Node 8 Blk A: Data Node 1 B A C A C B DN1, DN5, DN6 Data Node 2
Data Node 2 Data Node 6 Data Node 9 Data Node 3 Blk B: B A C DN7, DN1, DN2 Rack 5:
Data Node 3 Data Node Data Node Data Node 5 Blk C: Data Node Data Node Data Node DN5, DN8,DN9
Rack 1 Rack 5 Rack 9 • Name Node provides rack local Nodes first • Leverage in-rack bandwidth, single hop
BRAD HEDLUND .com Data Processing: Map How many times does Client “Fraud” appear Count in File.txt? “Fraud” in Block C Name Node Job Tracker
Map Task Map Task Map Task
Data Node 1 Data Node 5 Data Node 9 A B C Fraud = 3 Fraud = 0 Fraud = 11
File.txt • Map: “Run this computation on your local data” • Job Tracker delivers Java code to Nodes with local data
BRAD HEDLUND .com What if data isn’t local? How many times does Client “Fraud” appear Count in File.txt? “Fraud” in Block C Name Node Job Tracker
switch switch switch
Map Task Map Task “I need block A” Data Node 5 Data Node 9 Data Node 1 B C Fraud = 0 Fraud = 11 A “no Map tasks left”
Data Node 2
Rack 1 Rack 5 Rack 9 • Job Tracker tries to select Node in same rack as data • Name Node rack awareness BRAD HEDLUND .com Data Processing: Reduce
Client Job Tracker Sum “Fraud”
HDFS Results.txt Data Node 3 Fraud = 14 Reduce Task X Y Z
Fraud = 0
Map Task Map Task Map Task
Data Node 1 Data Node 5 Data Node 9 A B C • Reduce: “Run this computation across Map results” • Map Tasks deliver output data over the network • Reduce Task data output written to and read from HDFS BRAD HEDLUND .com Unbalanced Cluster
switch **I was assigned a Map Task but don’t have the switch switch switch NEW switch NEW block. Guess I need to get it. Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node
Rack 1 Rack 2 New Rack New Rack
*I’m bored! File.txt • Hadoop prefers local processing if possible • New servers underutilized for Map Reduce, HDFS* • Might see more network bandwidth, slower job times** BRAD HEDLUND .com Cluster Balancing
switch
switch switch switch NEW switch NEW
Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node
Rack 1 Rack 2 New Rack New Rack
File.txt brad@cloudera-1:~$hadoop balancer • Balancer utility (if used) runs in the background • Does not interfere with Map Reduce or HDFS • Default speed limit 1 MB/s BRAD HEDLUND .com Thanks!
Narrated at: http://bradhedlund.com/?p=3108
BRAD HEDLUND .com