Harp: Collective Communication on Hadoop
Total Page:16
File Type:pdf, Size:1020Kb
HHarp:arp: CollectiveCollective CommunicationCommunication onon HadooHadoopp Judy Qiu, Indiana University SA OutlineOutline • Machine Learning on Big Data • Big Data Tools • Iterative MapReduce model • MDS Demo • Harp SA MachineMachine LearningLearning onon BigBig DataData • Mahout on Hadoop • https://mahout.apache.org/ • MLlib on Spark • http://spark.apache.org/mllib/ • GraphLab Toolkits • http://graphlab.org/projects/toolkits.html • GraphLab Computer Vision Toolkit SA World Data DAG Model MapReduce Model Graph Model BSP/Collective Mode Hadoop MPI HaLoop Giraph Twister Hama or GraphLab ions/ ions/ Spark GraphX ning Harp Stratosphere Dryad/ Reef DryadLIN Q Pig/PigLatin Hive Query Tez Drill Shark MRQL S4 Storm or ming Samza Spark Streaming BigBig DataData ToolsTools forfor HPCHPC andand SupercomputingSupercomputing • MPI(Message Passing Interface, 1992) • Provide standardized function interfaces for communication between parallel processes. • Collective communication operations • Broadcast, Scatter, Gather, Reduce, Allgather, Allreduce, Reduce‐scatter. • Popular implementations • MPICH (2001) • OpenMPI (2004) • http://www.open‐mpi.org/ SA MapReduceMapReduce ModelModel Google MapReduce (2004) • Jeffrey Dean et al. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Apache Hadoop (2005) • http://hadoop.apache.org/ • http://developer.yahoo.com/hadoop/tutorial/ Apache Hadoop 2.0 (2012) • Vinod Kumar Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator, SO 2013. • Separation between resource management and computation model. SA KeyKey FeaturesFeatures ofof MapReduceMapReduce ModelModel • Designed for clouds • Large clusters of commodity machines • Designed for big data • Support from local disks based distributed file system (GFS / HDFS) • Disk based intermediate data transfer in Shuffling • MapReduce programming model • Computation pattern: Map tasks and Reduce tasks • Data abstraction: KeyValue pairs SA ApplicationsApplications && DifferentDifferent InterconnectionInterconnection PatternsPatterns (a) Map Only (b) Classic (c) Iterative (d) Loosely (Pleasingly Parallel) MapReduce MapReduce Synchronous Input Input Input iterations map map map Pij reduce Output reduce ‐ CAP3 Gene Analysis ‐ High Energy Physics ‐ Expectation Many MPI scientific ‐Smith‐Waterman (HEP) Histograms maximization applications utilizing Distances ‐ Distributed search algorithms wide variety of ‐ Document conversion ‐ Distributed sorting ‐ Linear Algebra communication (PDF ‐> HTML) ‐ Information retrieval ‐ Data mining, includes constructs, including ‐ Brute force searches in ‐ Calculation of Pairwise K‐means clustering local interactions cryptography Distances for ‐ Deterministic ‐ Solving Differential ‐ Parametric sweeps sequences (BLAST) Annealing Clustering Equations and particle ‐ PolarGrid MATLAB data ‐ Multidimensional dynamics with short analysis Scaling (MDS) range forces ‐ PageRank No Communication Collective Communication MPI Domain of MapReduce and Iterative Extensions SA WhatWhat isis IterativeIterative MapReduce?MapReduce? Iterative MapReduce – Mapreduce is a Programming Model instantiating the paradigm of bringing computation to data – Iterative Mapreduce extends Mapreduce programming model and support iterative algorithms for Data Mining and Data Analysis – Is it possible to use the same computational tools on HPC and Cloud? – Enabling scientists to focus on science not programming distributed systems SA (Iterative)(Iterative) MapReduceMapReduce inin ContextContext Support Scientific Simulations (Data Mining and Data Analysis) Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Applications Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping Security, Provenance, Portal Programming Services and Workflow Model High Level Language Runtime Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling) Storage Distributed File Systems Object Store Data Parallel File System Windows Server Linux HPC Amazon Cloud Azure Cloud Grid HPC Infrastructure Bare‐system Appliance Virtualization Bare‐system Virtualization Hardware CPU Nodes GPU Nodes SA Orchestra tion & Workflow Oozie, ODE, Airavata and OODT (Tools) NA: Pegasus, Kepler, Swift, Taverna, Trident, ActiveBP EL, BioKeple r, Galaxy Orchestra tion & Workflow Oozie, ODE, Airavata and OODT (Tools) NA: Pegasus, Kepler, Swift, Taverna, Trident, ActiveBP EL, BioKeple r, Galaxy Cross Cutting Data Analytics Libraries: Capabilities red Machine Learning Mahout , MLlib , MLbase Statistics, Bioinformatics Imagery Linear Algebra itecture (Upper) CompLearn (NA) R, Bioconductor (NA) ImageJ (NA) Scalapack, PetSc (N High Level (Integrated) Systems for Data Processing Monitoring: Monitoring: Distributed Coordination: Distributed A –Non Apache projects Coordination: Distributed Message Protocols: Message Protocols: Message Protocols: Message Protocols: Hive Hcatalog Pig Shark MRQL Impala (NA) Swaz reen layers are (SQL on Interfaces (Procedural (SQL on (SQL on Hadoop, Cloudera (Log F Hadoop) Language) Spark, NA) Hama, Spark) (SQL on Hbase) Google Ambari, Ganglia, Nagios, Inca (NA) Nagios, Ganglia, Ambari, Ambari, Ganglia, Nagios, Inca (NA) Nagios, Ganglia, Ambari, Security &Privacy pache/Commercial Cloud Security &Privacy ght) to HPC (darker) Parallel Horizontally Scalable Data Processing Peg Thrift, Protobuf (NA) Thrift, Protobuf (NA) Hadoop Spark NA:Twister Tez Hama S4 Samza Giraph on H tegration layers ZooKeeper, JGroups Storm ZooKeeper, JGroups (Map (Iterative Stratosphere (DAG) (BSP) Yahoo LinkedIn ~Pregel (N Reduce) MR) Iterative MR Batch Stream Graph ABDS Inter-process Communication HPC Inter-process Communication Hadoop, Spark Communications MPI (NA) & Reductions Harp Collectives (NA) Pub/Sub Messaging Netty (NA)/ZeroMQ (NA)/ActiveMQ/Qpid/Kafka Note: Harp and New Communication Layer SA Cross Cutting In memory distributed databases/caches: GORA (general object from NoSQL), Memcached (NA), Redis Capabilities (key value), Hazelcast (NA), Ehcache (NA); ORM Object Relational Mapping: Hibernate(NA), OpenJPA and JDBC Standard Extraction Tools SQL SciDB NoSQL: Column Sol UIMA Phoenix (NA) HBase Accumulo (S Tika MySQL Cassandra (Entities) (SQL on Arrays, (Data on (Data on Cass yered (Content) (NA) (DHT) (Watson) HBase) R,Python HDFS) HDFS) +Doc chitecture NoSQL: Document NoSQL: Key Value (all NA) MongoDB Distributed Coordination Distributed Distributed Coordination Distributed Monitoring: Monitoring: Monitoring: Monitoring: CouchDB Lucene Berkeley Azure Dynamo Riak Volde (NA) Message Protocols: Message Protocols: wer) Message Protocols: Solr DB Table Amazon ~Dynamo ~Dyn NoSQL: General Graph Fil NoSQL: TripleStore RDF SparkQL Manag Ambari, Ganglia, Nagios, Inca (NA) Nagios, Ganglia, Ambari, Ambari, Ganglia, Nagios, Inca (NA) Nagios, Ganglia, Ambari, Security &Privacy Security &Privacy Yarcdata Neo4J Sesame AllegroGraph RYA RDF on A –Non Apache Commercial Jena Java Gnu (NA) Commercial Accumulo iROD (NA) (NA) Thrift, Protobuf(NA) Thrift, Protobuf(NA) rojects : ZooKeeper,JGroups : ZooKeeper,JGroups Data Transport BitTorrent, HTTP, FTP, SSH Globus Online (GridFTP) reen layers are ABDS Cluster Resource Management HPC Cluster Resource Management pache/Commercial Mesos, Yarn, Helix, Llama(Cloudera) Condor, Moab, Slurm, Torque(NA) …….. ABDS File Systems User Level HPC File Systems (NA) loud (light) to HPC HDFS, Swift, Ceph FUSE(NA) Gluster, Lustre, GPFS, GFFS Object Stores POSIX Interface Distributed, Parallel, Federated darker) integration Interoperability Layer Whirr / JClouds OCCI CDMI (NA) yers DevOps/Cloud Deployment Puppet/Chef/Boto/CloudMesh(NA) IaaS System Manager Open Source Commercial Clouds OpenStack, OpenNebula, Eucalyptus, CloudStack, vCloud, Amazon, Azure, Google SA DataData AnalysisAnalysis ToolsTools MapReduce optimized for iterative computations Twister: the speedy elephant Abstractions In‐Memory Data Flow Thread Map‐Collective Portability •Cacheable •Iterative •Lightweight •Communication •HPC (Java) map/reduce tasks •Loop Invariant •Local aggregation patterns optimized for •Azure Cloud (C#) •Variable data large intermediate data •Supercomputer transfer (C++, Java) SA ProgrammingProgramming ModelModel forfor IterativeIterative MapReduceMapReduce Loop Invariant Data Loaded only once Configure() Cacheable Variable data map/reduce tasks Main Program (in memory) while(..) Map(Key, Value) { Faster intermediate runMapReduce(..) } data transfer Reduce (Key, List<Value>) mechanism Combiner operation Combine(Map<Key,Value>) to collect all reduce outputs Distinction on loop invariant data and variable data (data flow vs. δ flow) Cacheable map/reduce tasks (in‐memory) Combine operation SA TwisterTwister ProgrammingProgramming ModelModel Main program’s process space Worker Nodes configureMaps(…) Local Disk configureReduce(…) while(condition){ Cacheable map/reduce tasks runMapReduce(...) May scatter/broadcast <Key,Value> Map() Iterations pairs directly May merge data in shuffling Reduce() Combine() Communications/data transfers via the operation pub‐sub broker network & direct TCP updateCondition() • Main program may contain many } //end while MapReduce invocations or iterative close() MapReduce invocations SA TwisterTwister ArchitectureArchitecture Master Node Pub/Sub Broker Network Twister Driver B B B B and Collective Communication Main Program Service Twister Daemon Twister Daemon map reduce Cacheable tasks Worker Pool Worker Pool Scripts perform: Local Disk