Horn: A System for Parallel Training and Regularizing of Large-Scale Neural Networks

Edward J. Yoon [email protected] I Am

● Edward J. Yoon ● Member and Vice President of Apache Software Foundation ● Committer, PMC, Mentor of ○ Apache Hama ○ Apache Bigtop ○ Apache Rya ○ Apache Horn ○ Apache MRQL ● Keywords: big data, cloud, machine learning, database What is Apache Software Foundation? The Apache Software Foundation is an Non-profit foundation that is dedicated to open source software development

1) What Apache Software Foundation is, 2) Which projects are being developed, 3) What’s HORN? 4) and How to contribute them. Apache HTTP Server (NCSA HTTPd)

powers nearly 500+ million websites (There are 644 million websites on the Internet) And Now!

161 Top Level Projects, 108 SubProjects, 39 Incubating Podlings, 4700+ Committers, 550 ASF Members Unknown number of developers and users Domain Diversity Programming Language Diversity Which projects are being developed?

What’s HORN?

● Oct 2015, accepted as Project ● Was born from Apache Hama ● A System for Deep Neural Networks ○ A neuron-level abstraction framework ○ Written in Java :/ ○ Works on distributed environments Apache Hama 1. K-means clustering Hama is 1,000x faster than At UT Arlington & Oracle 2013

2. PageRank on 10 Billion edges Graph Hama is 3x faster than Facebook’s Giraph At Samsung Electronics (Yoon & Kim) 2015

3. Top-k Set Similarity Joins on Flickr Hama is clearly faster than At IEEE 2015 (University of Melbourne) Why we do this?

1. How to parallelize the training of large models? 2. How to avoid overfitting due to large size of the network, even with large datasets? JonathanNet Distributed Training

Parameter Server Parameter Server

Parameter Swapping

Task 5 Each group performs Task 2 Task 4 Task 3 ... minibatch in BSP Task 6 paradigm, and interacts Task 1 with Parameter Server asynchronously.

Like MapReduce, Apache Hama BSP framework schedules tasks BSP framework according to the distance between on Hama or YARN the input data of the tasks and request nodes. Hadoop HDFS A neuron-centric Model

function forward(messages [i1, i2, ..., ]) sum ← 0 for each w ∈ [i1, i2, ..., ] do sum ← sum + i.input * i.weight feedforward(apply(sum));

function backward(messages [i1, i2, ..., ]) gradient ← 0 for each w ∈ [i1, i2, ..., ] do gradient ← gradient + i.delta() * i.weight(); // weight collections w ← w + Δw (α * output * i.delta) // push updates to parameter server push(w); backpropagate(gradient * applyDerivative(output)); Parallel Dropout

● Dropout [Hinton et al., 2012] is a technique for addressing the Overfitting problem (It generates a lot of different sub-models).

● Parallel Dropout trains different sub-models and dropout it again. ○ More randomness. ○ Reduce the size of model per each worker. Randomness

● Random noise allows neural nets to produce multiple outputs given the same instance of input. ● Random noise limits the amount of information flowing through the network, forcing the network to learn meaningful representations of data. ● Random noise provides "exploration energy" for finding better optimization solutions during gradient descent. Parallel Dropout on MNIST

Top is non-Parallel version Bottom is Parallel dropout

It trains better! More Ideas

● RANSAC (Random Sample Consensus) for Parallel Dropout (by Nilesh) ○ Instead of 2^n models to merge, we just consider merging K of them. ● Dynamic addition or removal of neurons ○ Real-time Self-evolving and model compression. ○ E.g., removes useless neurons during training. Appendix Appendix

● It goes from Centralized Architecture to Decentralized Architecture ○ Instead of gathering data into centralized cloud and computing them, ○ we can share Knowledges or experiences among Machines or Devices.