<<

MapReduce/ for Distributed Opmizaon

Keith Hall, Sco Gilpin, Gideon Mann presented by: Slav Petrov Zurich, Thornton, New York Inc. Outline

• Large‐scale Gradient Opmizaon – Distributed Gradient – Asynchronous Updates – Iterave Parameter Mixtures • MapReduce and Bigtable • Experimental Results Gradient Opmizaon Seng

Goal: find θ∗ = argmin f(θ) θ If is differenable, then solve via gradient f updates: θi+1 = θi + α f(θi) ∇ Consider case where is composed of a sum of f differenable funcons , then the gradient fq update can be wrien as: This case is the θi+1 = θi + α f (θi) focus of the talk ∇ q q ! Maximum Entropy Models

: input space : output space X Y n Φ : : feature mapping X × Y → R S = ((x1,y1), ..., (xm,ym)) : training data 1 p [y x]= exp(θ Φ(x, y)) : probability model θ | Z · 1 θ∗ = argmin log p (y x ) m θ i| i θ i !

The objecve funcon is a summaon of funcons, each of which is computed on one data instance. Distributed Gradient

Observe that the gradient θi+1 = θi + α f (θi) ∇ q q update can be broken ! down into three phases: f (θi) Can be done in parallel ∇ q Cheap to compute q ! Update Depends on number of At each iteraon, must send complete to each θ Step features, not data size. parallel worker.

[Chu et al. ’07] Stochasc Gradient

Alternavely approximate θi+1 = θi + α f (θi) ∇ q the sum by a subset of q i ! funcons . If the fq(θ ) subset is size 1, then the ∇ algorithm is an online: θθ+1 = θi + α f (θi) ∇ q

Stochasc gradient approaches provably converge, and in pracce oen much quicker than exact gradient calculaon methods. Asynchronous Updates

Asynchronous updates are a distributed extension of stochasc gradient. Each worker: Get current θi i Compute fq(θ ) Update global parameter vector ∇ Since each worker will not be compung in ‐ step, some gradients will be based on old parameters. Nonetheless, this also converges.

[Langford et al. ’09] Iterave Parameter Mixtures

Separately esmate on different samples, and θ then combine. Iterave: take resulng as starng θ point for next round and rerun.

For certain classes of objecve funcons (e.g. maxent and perceptron), convergence can also be shown. Lile communicaon between workers.

[Mann et al. ’09, McDonald et al. ‘10] Typical Datacenter (at Google)

Commodity [Holzle, Barroso ‘09] machines, with relavely few Racks of machines connected by cores (e.g. <=4) commodity ethernet switches. No GPUs. No Shared Memory. No NAS. No massive mulcore machines. Implicaons for algorithms

• Communicaon between workers is costly – No shared memory means very hard to maintain global state – Even ethernet communicaon can be a significant source of latency • Each worker relavely low powered – No machines have massive amounts of RAM or disk – Simple isolated jobs performed in parallel will have significant gains. Outline

• Distributed Gradient Opmizaon • MapReduce and Bigtable – Opmal Configuraons • Experimental Results MapReduce

1. Backup map workers compensate for failure of map task.

2. Sorng phase requires all map jobs to be completed before starng reduce phase. MapReduce Implementaons

Distributed Gradient and Iterave Parameter Mixtures are straight‐forward to implement in MapReduce.

Distributed Gradient: Each iteraon has mulple mappers Compute Map and one reducer. Gradient Iterave Parameter Mixtures: One mapper/reducer pair, Reduce Sum and each runs to convergence, and then Update there’s an outside mixture and loop.

Asynchronous – needs shared memory. Bigtable

Distributed storage – built off of GFS Not a : cells indexed by row/column/ me. (e.g. can recover n most recent values) Mulple layers of caching: Machine and network level Transacon processing is oponal Asynchronous Updates

Update Map Parameters, Compute Bigtable Gradient

Sorng turned off Transacon processing Sum and turned off: Update overwrites Reduce Bigtable are possible

Gradients collected into mini‐batches Outline

• Distributed Gradient Opmizaon • MapReduce and Bigtable • Experimental Results Experimental Set‐up

Click‐through data – A. 370 million training instances • Paroned into 200 pieces • 240 worker machines – B. 1.6 billion instances • Paroned into 900 pieces • 600 worker machines – Evaluaon: 185 million instances (temporally occurring aer first instances) Stopping Criterion Tricky

Complicated because the implementaons are all slightly different • Gradient change: – no global gradient in asychronous or iterave parameter mixtures • Log‐likelihood/objecve funcon value: – Asynchronous log‐likelihood fluctuang frequently • Fixed number of iteraons – Means different things for each implementaon

Currently working on more clean performance result esmaon – take AUC results with a grain of salt. 370 Million Training Instances

AUC CPU (rel.) Wall clock Network Usage (rel.) (rel.) Single‐core SGD .8557 0.10 9.64 0.03 Distributed Gradient .8323 1.00 1.00 1.00 Asynchronous SGD .7283 2.17 7.50 82.47 Iterave Parameter Mixture : .8557 0.12 0.13 0.10 MaxEnt Iterave Parameter Mixture : .8870 0.16 0.20 0.22 Perceptron Observaons: Even with tuning, Asynchronous Updates are impraccal Iterator Parameter mixtures perform well IPM w/240 machines is ~70x as fast as single core w/10 machines is 8x as fast as single core. 1.6 Billion Training Instances

AUC CPU (rel.) Wall clock Network Usage (rel.) (rel.) Distributed Gradient .8117 1.00 1.00 1.00 Asynchronous SGD .8525 3.71 14.33 133.93 Iterave Parameter Mixture : .8904 0.17 0.15 0.15 MaxEnt Iterave Parameter Mixture : .8961 0.24 0.20 0.38 Perceptron Observaons: Network usage much worse with more machines, and asynchronous much slower. Iterave parameter mixtures sll effecve with 900 parons. Conclusions

Asynchronous Updates not suited to current datacenter configuraons

Iterave parameter mixtures give significant me improvements over distributed gradient.