Understanding Hadoop Clusters and the Network

Part 1. Introduction and Overview Brad Hedlund http://bradhedlund.com http://www.linkedin.com/in/bradhedlund @bradhedlund

BRAD HEDLUND .com Hadoop Server Roles

Clients

Distributed Data Analytics Distributed Data Storage Map Reduce HDFS

Secondary Job Tracker Name Node masters Name Node

Data Node & Data Node & Data Node & Task Tracker Task Tracker Task Tracker slaves Data Node & Data Node & Data Node & Task Tracker Task Tracker Task Tracker

BRAD HEDLUND .com Hadoop Cluster

World

switch switch

switch switch switch switch switch

Name Node Job Tracker Secondary NN Client DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT

Rack 1 Rack 2 Rack 3 Rack 4 Rack N

BRAD HEDLUND .com Typical Workflow

• Load data into the cluster (HDFS writes) • Analyze the data (Map Reduce) • Store results in the cluster (HDFS writes) • Read the results from the cluster (HDFS reads) Sample Scenario: How many times did our customers type the word “Fraud” into emails sent to customer service?

Huge file containing all emails sent to customer service File.txt

BRAD HEDLUND .com Writing files to HDFS OK. Write to I want to write Data Nodes Blocks A,B,C of 1,5,6 File.txt File.txt Client Blk A Blk B Blk C Name Node

Data Node 1 Data Node 5 Data Node 6 Data Node N

Blk A Blk B Blk C

• Client consults Name Node • Client writes block directly to one Data Node • Data Nodes replicates block • Cycle repeats for next block BRAD HEDLUND .com Hadoop Rack Awareness – Why?

switch Name Node

switch switch switch Rack aware metadata File.txt= Rack 1: Data Node 1 Data Node 5 Data Node 9 Blk A: Data Node 1 B A C A C B DN1, DN5, DN6 Data Node 2

Data Node 2 Data Node 6 Data Node 10 Data Node 3 Blk B: B A C DN7, DN1, DN2 Rack 5:

Data Node 3 Data Node 7 Data Node 11 Data Node 5 Blk C: Data Node 5 Data Node 8 Data Node 6 Data Node 12 DN5, DN8,DN9 Data Node 7 Rack 1 Rack 5 Rack 9 • Never loose all data if entire rack fails • Keep bulky flows in-rack when possible • Assumption that in-rack is higher bandwidth, lower latency BRAD HEDLUND .com Preparing HDFS writes OK. Write to I want to write Data Nodes File.txt 1,5,6 Block A File.txt Client Blk A Blk B Blk C Name Node Ready! Ready switch Rack aware Data Nodes Rack 1: 5,6 Data Node 1

switch switch Rack 5: Data Node 5 Data Node 6 Data Node 1 Data Node 5

Ready Ready! • Name Node picks Data Node 6 two nodes in the same rack, one Ready? Data Node 6 node in a different rack • Data protection Rack 1 Rack 5 • Locality for M/R BRAD HEDLUND .com Pipelined Write

File.txt Client Blk A Blk B Blk C Name Node

switch Rack aware Rack 1: Data Node 1

switch switch Rack 5: Data Node 5 Data Node 6 Data Node 1 Data Node 5 • Data Nodes 1 A A & 2 pass data along as its received Data Node 6 • TCP 50010 A

Rack 1 Rack 5 BRAD HEDLUND .com Pipelined Write

File.txt Blk A: DN1, DN2, DN3 File.txt Client Blk A Blk B Blk C Name Node Success switch Block received Rack 1: Data Node 1

switch switch Rack 5: Data Node 2 Data Node 3 Data Node 1 Data Node 2 A A

Data Node 3 A

Rack 1 Rack 5 BRAD HEDLUND .com Multi-block Replication Pipeline

1TB File = File.txt Client 3TB storage 3TB network traffic Blk A Blk B Blk C switch

switch switch switch

Blk A Blk A Data Node 1 Data Node X Data Node 2 Blk B Blk C

Blk A Blk B Data Node Y Data Node 3 Data Node W Blk C Blk C Blk B Data Node Z

Rack 1 Rack 4 Rack 5

BRAD HEDLUND .com Client writes Span the HDFS Cluster

Client

switch switch switch switch switch

Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node

Rack 1 Rack 2 Rack 3 Rack 4 Rack N

Factors: File.txt • Block size • File Size More blocks = Wider spread BRAD HEDLUND .com Data Node writes span itself, and other racks

switch switch switch switch switch

Data Node Data Node C Data Node Data Node Data Node A B C Data Node Data Node B Data Node A Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node B Data Node Data Node Data Node Data Node C Data Node Data Node Data Node Data Node Data Node Data Node Data Node A Data Node

Rack 1 Rack 2 Rack 3 Rack 4 Rack N

Results.txt

Blk A Blk B Blk C

BRAD HEDLUND .com Name Node Awesome! Thanks. metadata File system DN1: A,C DN2: A,C File.txt = A,C DN3: A,C Name Node

I have I’m blocks: alive! A, C

Data Node 1 Data Node 2 Data Node 3 Data Node N A C A C A C

• Data Node sends Heartbeats • Every 10th heartbeat is a Block report • Name Node builds metadata from Block reports • TCP – every 3 seconds • If Name Node is down, HDFS is down BRAD HEDLUND .com Re-replicating missing replicas Uh Oh! Missing metadata Rack Awareness replicas DN1: A,C Rack1: DN1, DN2 DN2: A,C Rack5: DN3, DN3: A, C Name Node Rack9: DN8

Copy blocks A,C to Node 8

Data Node 1 Data Node 2 Data Node 3 Data Node 8 A C A C A C A C

• Missing Heartbeats signify lost Nodes • Name Node consults metadata, finds affected data • Name Node consults Rack Awareness script • Name Node tells a Data Node to re-replicate BRAD HEDLUND .com Secondary Name Node

File system metadata

File.txt = A,C Name Node

Its been an hour, give me your Secondary metadata Name Node

• Not a hot standby for the Name Node • Connects to Name Node every hour* • Housekeeping, backup of Name Node metadata • Saved metadata can rebuild a failed Name Node

BRAD HEDLUND .com Client reading files from HDFS Tell me the Blk A = 1,5,6 block locations Blk B = 8,1,2 of Results.txt Blk C = 5,8,9

Client Name Node

switch switch switch metadata Results.txt= Data Node 1 Data Node 5 Data Node 8 Blk A: B A C A C B DN1, DN5, DN6

Data Node 2 Data Node 6 Data Node 9 Blk B: B A C DN7, DN1, DN2 Data Node Data Node Data Node Blk C: Data Node Data Node Data Node DN5, DN8,DN9

Rack 1 Rack 5 Rack 9 • Client receives Data Node list for each block • Client picks first Data Node for each block • Client reads blocks sequentially BRAD HEDLUND .com Data Node reading files from HDFS

Tell me the Block A = 1,5,6 locations of Block A of File.txt switch Name Node

switch switch switch Rack aware metadata File.txt= Rack 1: Data Node 1 Data Node 5 Data Node 8 Blk A: Data Node 1 B A C A C B DN1, DN5, DN6 Data Node 2

Data Node 2 Data Node 6 Data Node 9 Data Node 3 Blk B: B A C DN7, DN1, DN2 Rack 5:

Data Node 3 Data Node Data Node Data Node 5 Blk C: Data Node Data Node Data Node DN5, DN8,DN9

Rack 1 Rack 5 Rack 9 • Name Node provides rack local Nodes first • Leverage in-rack bandwidth, single hop

BRAD HEDLUND .com Data Processing: Map How many times does Client “Fraud” appear Count in File.txt? “Fraud” in Block C Name Node Job Tracker

Map Task Map Task Map Task

Data Node 1 Data Node 5 Data Node 9 A B C Fraud = 3 Fraud = 0 Fraud = 11

File.txt • Map: “Run this computation on your local data” • Job Tracker delivers Java code to Nodes with local data

BRAD HEDLUND .com What if data isn’t local? How many times does Client “Fraud” appear Count in File.txt? “Fraud” in Block C Name Node Job Tracker

switch switch switch

Map Task Map Task “I need block A” Data Node 5 Data Node 9 Data Node 1 B C Fraud = 0 Fraud = 11 A “no Map tasks left”

Data Node 2

Rack 1 Rack 5 Rack 9 • Job Tracker tries to select Node in same rack as data • Name Node rack awareness BRAD HEDLUND .com Data Processing: Reduce

Client Job Tracker Sum “Fraud”

HDFS Results.txt Data Node 3 Fraud = 14 Reduce Task X Y Z

Fraud = 0

Map Task Map Task Map Task

Data Node 1 Data Node 5 Data Node 9 A B C • Reduce: “Run this computation across Map results” • Map Tasks deliver output data over the network • Reduce Task data output written to and read from HDFS BRAD HEDLUND .com Unbalanced Cluster

switch **I was assigned a Map Task but don’t have the switch switch switch NEW switch NEW block. Guess I need to get it. Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node

Rack 1 Rack 2 New Rack New Rack

*I’m bored! File.txt • Hadoop prefers local processing if possible • New servers underutilized for Map Reduce, HDFS* • Might see more network bandwidth, slower job times** BRAD HEDLUND .com Cluster Balancing

switch

switch switch switch NEW switch NEW

Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node

Rack 1 Rack 2 New Rack New Rack

File.txt brad@cloudera-1:~$hadoop balancer • Balancer utility (if used) runs in the background • Does not interfere with Map Reduce or HDFS • Default speed limit 1 MB/s BRAD HEDLUND .com Thanks!

Narrated at: http://bradhedlund.com/?p=3108

BRAD HEDLUND .com