HHarp:arp: CollectiveCollective CommunicationCommunication onon HadooHadoopp

Judy Qiu, Indiana University

SA OutlineOutline

• Machine Learning on Big Data • Big Data Tools • Iterative MapReduce model • MDS Demo • Harp

SA MachineMachine LearningLearning onon BigBig DataData

• Mahout on Hadoop • https://mahout.apache.org/

• MLlib on Spark • http://spark.apache.org/mllib/

• GraphLab Toolkits • http://graphlab.org/projects/toolkits.html • GraphLab Computer Vision Toolkit

SA World Data DAG Model MapReduce Model Graph Model BSP/Collective Mode

Hadoop MPI HaLoop Giraph Twister Hama or GraphLab ions/ ions/ Spark GraphX ning Harp Stratosphere Dryad/ Reef DryadLIN Q Pig/PigLatin Hive Query Tez Drill Shark MRQL

S4 Storm or ming Samza Spark Streaming BigBig DataData ToolsTools forfor HPCHPC andand SupercomputingSupercomputing

• MPI(Message Passing Interface, 1992) • Provide standardized function interfaces for communication between parallel processes.

• Collective communication operations • Broadcast, Scatter, Gather, Reduce, Allgather, Allreduce, Reduce‐scatter.

• Popular implementations • MPICH (2001) • OpenMPI (2004) • http://www.open‐mpi.org/

SA MapReduceMapReduce ModelModel

Google MapReduce (2004) • Jeffrey Dean et al. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004.

Apache Hadoop (2005) • http://hadoop.apache.org/ • http://developer.yahoo.com/hadoop/tutorial/

Apache Hadoop 2.0 (2012) • Vinod Kumar Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator, SO 2013. • Separation between resource management and computation model.

SA KeyKey FeaturesFeatures ofof MapReduceMapReduce ModelModel

• Designed for clouds • Large clusters of commodity machines

• Designed for big data • Support from local disks based distributed file system (GFS / HDFS) • Disk based intermediate data transfer in Shuffling

• MapReduce programming model • Computation pattern: Map tasks and Reduce tasks • Data abstraction: KeyValue pairs

SA ApplicationsApplications && DifferentDifferent InterconnectionInterconnection PatternsPatterns (a) Map Only (b) Classic (c) Iterative (d) Loosely (Pleasingly Parallel) MapReduce MapReduce Synchronous

Input Input Input iterations map map map Pij reduce Output reduce

‐ CAP3 Gene Analysis ‐ High Energy Physics ‐ Expectation Many MPI scientific ‐Smith‐Waterman (HEP) Histograms maximization applications utilizing Distances ‐ Distributed search algorithms wide variety of ‐ Document conversion ‐ Distributed sorting ‐ Linear Algebra communication (PDF ‐> HTML) ‐ Information retrieval ‐ Data mining, includes constructs, including ‐ Brute force searches in ‐ Calculation of Pairwise K‐means clustering local interactions cryptography Distances for ‐ Deterministic ‐ Solving Differential ‐ Parametric sweeps sequences (BLAST) Annealing Clustering Equations and particle ‐ PolarGrid MATLAB data ‐ Multidimensional dynamics with short analysis Scaling (MDS) range forces ‐ PageRank No Communication Collective Communication MPI Domain of MapReduce and Iterative Extensions SA WhatWhat isis IterativeIterative MapReduce?MapReduce?

Iterative MapReduce – Mapreduce is a Programming Model instantiating the paradigm of bringing computation to data – Iterative Mapreduce extends Mapreduce programming model and support iterative algorithms for Data Mining and Data Analysis – Is it possible to use the same computational tools on HPC and Cloud? – Enabling scientists to focus on science not programming distributed systems

SA (Iterative)(Iterative) MapReduceMapReduce inin ContextContext

Support Scientific Simulations (Data Mining and Data Analysis) Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Applications Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping Security, Provenance, Portal Programming Services and Workflow Model High Level Language Runtime Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling) Storage Distributed File Systems Object Store Data Parallel File System

Windows Server Linux HPC Amazon Cloud Azure Cloud Grid HPC Infrastructure Bare‐system Appliance Virtualization Bare‐system Virtualization

Hardware CPU Nodes GPU Nodes

SA Orchestra tion & Workflow Oozie, ODE, Airavata and OODT (Tools) NA: Pegasus, Kepler, Swift, Taverna, Trident, ActiveBP EL, BioKeple r, Galaxy Orchestra tion & Workflow Oozie, ODE, Airavata and OODT (Tools) NA: Pegasus, Kepler, Swift, Taverna, Trident, ActiveBP EL, BioKeple r, Galaxy

Cross Cutting Data Analytics Libraries: Capabilities red Machine Learning Mahout , MLlib , MLbase Statistics, Bioinformatics Imagery Linear Algebra itecture (Upper) CompLearn (NA) R, Bioconductor (NA) ImageJ (NA) Scalapack, PetSc (N

High Level (Integrated) Systems for Data Processing Monitoring: Monitoring: Distributed Coordination: A –Non Apache projects Distributed Coordination: Message Protocols: Message Protocols: Hive Hcatalog Pig Shark MRQL Impala (NA) Swaz reen layers are (SQL on Interfaces (Procedural (SQL on (SQL on Hadoop, Cloudera (Log F Hadoop) Language) Spark, NA) Hama, Spark) (SQL on Hbase) Google Ambari, Ganglia, Nagios, Inca (NA) Ambari, Ganglia, Nagios, Inca (NA) Security &Privacy pache/Commercial Cloud Security &Privacy ght) to HPC (darker) Parallel Horizontally Scalable Data Processing Peg Thrift, Protobuf (NA) Thrift, Protobuf Thrift, Protobuf (NA) Thrift, Protobuf Hadoop Spark NA:Twister Tez Hama S4 Samza Giraph on H

tegration layers ZooKeeper, JGroups Storm ZooKeeper, JGroups (Map (Iterative Stratosphere (DAG) (BSP) Yahoo LinkedIn ~Pregel (N Reduce) MR) Iterative MR Batch Stream Graph

ABDS Inter-process Communication HPC Inter-process Communication

Hadoop, Spark Communications MPI (NA) & Reductions Harp Collectives (NA)

Pub/Sub Messaging Netty (NA)/ZeroMQ (NA)/ActiveMQ/Qpid/Kafka

Note: Harp and New Communication Layer SA Cross Cutting In memory distributed databases/caches: GORA (general object from NoSQL), Memcached (NA), Redis Capabilities (key value), Hazelcast (NA), Ehcache (NA); ORM Object Relational Mapping: Hibernate(NA), OpenJPA and JDBC Standard Extraction Tools SQL SciDB NoSQL: Column Sol UIMA Phoenix (NA) HBase Accumulo (S Tika MySQL Cassandra (Entities) (SQL on Arrays, (Data on (Data on Cass yered (Content) (NA) (DHT) (Watson) HBase) R,Python HDFS) HDFS) +Doc chitecture NoSQL: Document NoSQL: Key Value (all NA) MongoDB Distributed Coordination Distributed Coordination Monitoring: Monitoring: CouchDB Lucene Berkeley Azure Dynamo Riak Volde (NA) Message Protocols: wer) Message Protocols: Solr DB Table Amazon ~Dynamo ~Dyn NoSQL: General Graph Fil NoSQL: TripleStore RDF SparkQL Manag Ambari, Ganglia, Nagios, Inca (NA) Ambari, Ganglia, Nagios, Inca (NA) Security &Privacy Security &Privacy Yarcdata Neo4J Sesame AllegroGraph RYA RDF on A –Non Apache Commercial Jena Java Gnu (NA) Commercial Accumulo iROD (NA) (NA) Thrift, Protobuf (NA) (NA) Protobuf Thrift, Thrift, Protobuf (NA) (NA) Protobuf Thrift, rojects : ZooKeeper,JGroups : ZooKeeper,JGroups Data Transport BitTorrent, HTTP, FTP, SSH Globus Online (GridFTP) reen layers are ABDS Cluster Resource Management HPC Cluster Resource Management pache/Commercial Mesos, Yarn, Helix, Llama(Cloudera) Condor, Moab, Slurm, Torque(NA) …….. ABDS File Systems User Level HPC File Systems (NA) loud (light) to HPC HDFS, Swift, Ceph FUSE(NA) Gluster, Lustre, GPFS, GFFS Object Stores POSIX Interface Distributed, Parallel, Federated darker) integration Interoperability Layer Whirr / JClouds OCCI CDMI (NA) yers DevOps/Cloud Deployment Puppet/Chef/Boto/CloudMesh(NA) IaaS System Manager Open Source Commercial Clouds OpenStack, OpenNebula, Eucalyptus, CloudStack, vCloud, Amazon, Azure, Google SA DataData AnalysisAnalysis ToolsTools MapReduce optimized for iterative computations

Twister: the speedy elephant

Abstractions In‐Memory Data Flow Thread Map‐Collective Portability •Cacheable •Iterative •Lightweight •Communication •HPC (Java) map/reduce tasks •Loop Invariant •Local aggregation patterns optimized for •Azure Cloud (C#) •Variable data large intermediate data •Supercomputer transfer (C++, Java)

SA ProgrammingProgramming ModelModel forfor IterativeIterative MapReduceMapReduce

Loop Invariant Data Loaded only once

Configure() Cacheable Variable data map/reduce tasks Main Program (in memory) while(..) Map(Key, Value) { Faster intermediate runMapReduce(..) } data transfer Reduce (Key, List) mechanism

Combiner operation Combine(Map) to collect all reduce outputs

Distinction on loop invariant data and variable data (data flow vs. δ flow) Cacheable map/reduce tasks (in‐memory) Combine operation

SA TwisterTwister ProgrammingProgramming ModelModel

Main program’s process space Worker Nodes

configureMaps(…) Local Disk configureReduce(…)

while(condition){ Cacheable map/reduce tasks

runMapReduce(...) May scatter/broadcast Map() Iterations pairs directly May merge data in shuffling Reduce() Combine() Communications/data transfers via the operation pub‐sub broker network & direct TCP updateCondition() • Main program may contain many } //end while MapReduce invocations or iterative close() MapReduce invocations SA TwisterTwister ArchitectureArchitecture

Master Node Pub/Sub Broker Network Twister Driver B B B B and Collective Communication Main Program Service

Twister Daemon Twister Daemon

map

reduce Cacheable tasks Worker Pool Worker Pool

Scripts perform: Local Disk Data distribution, data collection, Local Disk and partition file creation Worker Node Worker Node

SA PageRankPageRank

Partial Adjacency Matrix

Current Page ranks M Partial (Compressed) Updates

Iterations R Partially merged C Updates

Well‐known page rank algorithm [1] Used ClueWeb09 [2] (1TB in size) from CMU Hadoop loads the web graph in every iteration Twister keeps the graph in memory Pregel approach seems natural to graph‐based problems

[1] Pagerank Algorithm, http://en.wikipedia.org/wiki/PageRank [2] ClueWeb09 Data Set, http://boston.lti.cs.cmu.edu/Data/clueweb09/ SA Data Intensive Kmeans Clustering

─ Image Classification: 7 million images; 512 features per image; 1 million clusters 10K Map tasks; 64G broadcasting data (1GB data transfer per Map task node); 20 TB intermediate data in shuffling.

SA BroadcastBroadcast Comparison:Comparison: TwisterTwister vs.vs. MPIMPI vs.vs. SparkSpark

Tested on IU Polar Grid with 1 Gbps Ethernet connection At least a factor of 120 on 125 nodes, compared with the simple broadcast algorithm The new topology‐aware chain broadcasting algorithm gives 20% better performance than best C/C++ MPI methods (four times aster than Java MPJ) 19 A factor of 5 improvement over non‐optimized (for topology) pipeline‐based method over 150 nodes. SA CollectiveCollective ModelModel

• Harp (2013) – https://github.com/jessezbj/harp‐project – Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0) – Hierarchical data abstraction on arrays, key‐values and graphs for easy programming expressiveness. – Collective communication model to support various communication operations on the data abstractions. – Caching with buffer management for memory allocation required from computation and communication – BSP style parallelism – Fault tolerance with check‐pointing

SA HarpHarp DesignDesign

Parallelism Model • Architecture

MapReduce Model Map‐Collective Model MapReduce Map‐Collective Application Applications Applications M M M M Harp M M M M Shuffle Framework Collective Communication MapReduce V2

R R Resource YARN Manager

SA HierarchicalHierarchical DataData AbstractionAbstraction andand CollectiveCollective CommunicationCommunication

SA HarpHarp BcastBcast CodeCode ExampleExample

tected void mapCollective(KeyValReader reader, Context context) hrows IOException, InterruptedException { rrTable table = new ArrTable(0, DoubleArray.class, DoubleArrPlus.class); f (this.isMaster()) { String cFile = conf.get(KMeansConstants.CFILE); Map cenDataMap = createCenDataMap(cParSize, rest, numCenPartition vectorSize, this.getResourcePool()); loadCentroids(cenDataMap, vectorSize, cFile, conf); addPartitionMapToTable(cenDataMap, table); rrTableBcast(table);

SA KK‐‐meansmeans ClusteringClustering ParallelParallel EfficiencyEfficiency

• Shantenu Jha et al. A Tale of Two Data‐Intensive Paradigms: Applications, Abstractions, and Architectures. 2014. SA WDAWDA‐‐MDSMDS PerformancePerformance onon BigBig RedRed IIII

• WDA‐MDS • Y. Ruan et. al, A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting. IEEE e‐Science 2013. • Big Red II • http://kb.iu.edu/data/bcqt.html • Allgather • Bucket algorithm • Allreduce • Bidirectional exchange algorithm

SA ParallelParallel EfficiencyEfficiency

SA REEFREEF

• Retainable Evaluator Execution Framework • http://www.reef‐project.org/ • Provides system authors with a centralized (pluggable) control flow • Embeds a user‐defined system controller called the Job Driver • Event driven control • Package a variety of data‐processing libraries (e.g., high‐bandwidth shuffle, relational operators, low‐latency group communication, etc.) in a reusable form. • To cover different models such as MapReduce, query, graph processing and stream data processing

SA IterativeIterative MapreduceMapreduce ‐‐ MDSMDS DemoDemo

II. Send intermediate results

Master Node

Twister ActiveMQ ActiveMQ MDS Monitor Driver Broker Twister‐MDS PlotViz

I. Send message to start the job Client Node

 Input: 30k metagenomics data  MDS reads pairwise distance matrix of all sequences  Output: 3D coordinates visualized in PlotViz SA IterativeIterative MapreduceMapreduce ‐‐MDSMDS DemoDemo

SA KMeansKMeans ClusteringClustering ComparisonComparison HadoopHadoop vs.vs. HDInsightHDInsight vs.vs. Twister4AzureTwister4Azure

30 Shaded part is computation SA IterativeIterative MapReduceMapReduce ModelsModels

SA ModelModel CompositionComposition

(2010) – Matei Zaharia et al. Spark: Cluster Computing with Working Sets,. HotCloud 2010. – Matei Zaharia et al. Resilient Distributed Datasets: A Fault‐Tolerant Abstraction for In‐Memory Cluster Computing. NSDI 2012. – http://spark.apache.org/ – Resilient Distributed Dataset (RDD) – RDD operations • MapReduce‐like parallel operations – DAG of execution stages and pipelined transformations – Simple collectives: broadcasting and aggregation

SA GraphGraph ProcessingProcessing withwith BSPBSP modelmodel

• Pregel (2010) – Grzegorz Malewicz et al. Pregel: A System for Large‐Scale Graph Processing. SIGMOD 2010.

(2010) – https://hama.apache.org/

(2012) – https://giraph.apache.org/ – Scaling Apache Giraph to a trillion edges • https://www.facebook.com/notes/facebook‐engineering/scaling‐apache‐giraph‐to‐a‐ trillion‐edges/10151617006153920

SA MapMap‐‐CollectiveCollective CommunicationCommunication ModelModel

Patterns

MapReduce MapReduce‐ Map‐AllGather Map‐AllReduce Map‐ReduceScatter •Wordcount, Grep MergeBroadcast •MDS‐BCCalc •KMeansClustering, •PageRank, Belief •KMeansClustering, MDS‐StressCalc Propagation PageRank

We generalize the Map‐Reduce concept to Map‐Collective, noting that large collectives are a distinguishing feature of data intensive and data mining applications. Collectives generalize Reduce to include all large scale linked communication‐compute patterns. MapReduce already includes a step in the collective direction with sort, shuffle, merge as well as basic reduction. SA FutureFuture WorkWork

• Run Algorithms on a much larger scale • Provide Data Service on Clustering and MDS Algorithms

SA AcknowledgementAcknowledgement

Bingjing Zhang Thilina GunarathneFei Teng Xiaoming Gao Stephen Wu Yuan Young

Prof. Haixu Tang Prof. David Crandall Prof. Filippo Menczer Bioinformatics Computer Vision Complex Networks and Systems

SALSA HPC Group http://salsahpc.indiana.edu School of Informatics and Computing Indiana University

SA ThankThank You!You!

Questions?Questions?

SA