Accelerate Big Data Processing (Hadoop, Spark, Memcached, & TensorFlow) with HPC Technologies

Talk at ® HPC Developer Conference 2017 (SC ‘17) by

Dhabaleswar K. (DK) Panda Xiaoyi Lu The Ohio State University The Ohio State University E-mail: [email protected] E-mail: [email protected] http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~luxi Big Data Processing and Deep Learning on Modern Clusters

• Multiple tiers + Workflow – Front-end data accessing and serving (Online) • Memcached + DB (e.g. MySQL), HBase, etc. – Back-end data analytics and deep learning model training (Offline) • HDFS, MapReduce, Spark, TensorFlow, BigDL, Caffe, etc.

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 2 Drivers of Modern HPC Cluster Architectures

High Performance Interconnects - Accelerators / Coprocessors InfiniBand high compute density, high Multi-core Processors <1usec latency, 100Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Tianhe – 2 Titan Stampede Tianhe – 1A Network Based Computing Laboratory Intel® HPC Developer Conference 2017 3 Interconnects and Protocols in OpenFabrics Stack for HPC

(http://openfabrics.org) Application / Middleware Application / Middleware Interface Sockets Verbs Protocol Kernel Space TCP/IP TCP/IP RSockets SDP TCP/IP RDMA RDMA

Ethernet Hardware User User User User IPoIB RDMA Driver Offload Space Space Space Space Adapter Ethernet InfiniBand Ethernet InfiniBand InfiniBand iWARP RoCE InfiniBand Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter Switch Ethernet InfiniBand Ethernet InfiniBand InfiniBand Ethernet Ethernet InfiniBand Switch Switch Switch Switch Switch Switch Switch Switch 1/10/40/100 10/40 GigE- IPoIB RSockets SDP iWARP RoCE IB Native GigE TOE Network Based Computing Laboratory Intel® HPC Developer Conference 2017 4 Large-scale InfiniBand Installations • 177 IB Clusters (35%) in the Jun’17 Top500 list – (http://www.top500.org) • Installations in the Top 50 (18 systems):

241,108 cores (Pleiades) at NASA/Ames (15th) 152,692 cores (Thunder) at AFRL/USA (36th)

220,800 cores (Pangea) in France (19th) 99,072 cores (Mistral) at DKRZ/Germany (38th)

522,080 cores (Stampede) at TACC (20th) 147,456 cores (SuperMUC) in Germany (40th)

144,900 cores (Cheyenne) at NCAR/USA (22nd) 86,016 cores (SuperMUC Phase 2) in Germany (41st)

72,800 cores Cray CS-Storm in US (27th) 74,520 cores (Tsubame 2.5) at Japan/GSIC (44th)

72,800 cores Cray CS-Storm in US (28th) 66,000 cores (HPC3) in Italy (47sth)

124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (30th) 194,616 cores (Cascade) at PNNL (49th)

60,512 cores (DGX-1) at Facebook/USA (31st) 85,824 cores (Occigen2) at GENCI/CINES in France (50th)

60,512 cores (DGX SATURNV) at NVIDIA/USA (32nd) 73,902 cores (Centennial) at ARL/USA (52nd)

72,000 cores (HPC2) in Italy (33rd) and many more!

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 5 Increasing Usage of HPC, Big Data and Deep Learning

Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.)

Deep Learning (Caffe, TensorFlow, BigDL, etc.)

Convergence of HPC, Big Data, and Deep Learning!!!

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 6 How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data and Deep Learning Applications?

Can RDMA-enabled Can HPC Clusters with How much Can the bottlenecks be high-performance alleviated with new high-performance performance benefits storage systems (e.g. designs by taking interconnects can be achieved advantage of HPC SSD, parallel file through enhanced benefit Big Data technologies? systems) benefit Big designs? processing and Deep Data and Deep Learning Learning? applications? How to design What are the major benchmarks for bottlenecks in current Big evaluating the Data processing and performance of Big Data and Deep Learning Deep Learning middleware on HPC middleware (e.g. Hadoop, clusters? Spark)?

Bring HPC, Big Data processing, and Deep Learning into a “convergent trajectory”!

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 7 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 8 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 9 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 10 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 11 Designing Communication and I/O Libraries for Big Data Systems: Challenges

Applications Benchmarks Big Data Middleware Upper level (HDFS, MapReduce, HBase, Spark and Memcached) Changes? Programming Models (Sockets) Other Protocols? Communication and I/O Library Point-to-Point Threaded Models Virtualization Communication and Synchronization

I/O and File Systems QoS Fault-Tolerance

Commodity Computing System Networking Technologies Architectures Storage Technologies (InfiniBand, 1/10/40/100 GigE (Multi- and Many-core (HDD, SSD, and NVMe-SSD) and Intelligent NICs) architectures and accelerators)

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 12 The High-Performance Big Data (HiBD) Project • RDMA for • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions • RDMA for Apache HBase • RDMA for Memcached (RDMA-Memcached) Available for InfiniBand and RoCE • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) Also run on Ethernet • OSU HiBD-Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 260 organizations from 31 countries • More than 23,900 downloads from the project site

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 13 Acceleration Case Studies and Performance Evaluation • Basic Designs – Hadoop – Spark – Memcached • Advanced Designs – Memcached with Hybrid Memory and Non-blocking APIs – Efficient Indexing with RDMA-HBase – TensorFlow with RDMA-gRPC – Deep Learning over Big Data • BigData + HPC Cloud

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 14 Design Overview of HDFS with RDMA

Applications • Design Features – RDMA-based HDFS write HDFS – RDMA-based HDFS Others Write replication Java Socket Interface Java Native Interface (JNI) – Parallel replication support OSU Design – On-demand connection Verbs setup 1/10/40/100 GigE, IPoIB RDMA Capable Networks – InfiniBand/RoCE support Network (IB, iWARP, RoCE ..) • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based HDFS with communication library written in native code

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012 N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014 Network Based Computing Laboratory Intel® HPC Developer Conference 2017 15 Enhanced HDFS with In-Memory and Heterogeneous Storage

Applications • Design Features – Three modes Triple-H • Default (HHH) Data Placement Policies • In-Memory (HHH-M) • Lustre-Integrated (HHH-L) Hybrid Replication Eviction/Promotion – Policies to efficiently utilize the heterogeneous storage devices • RAM, SSD, HDD, Lustre Heterogeneous Storage – Eviction/Promotion based on data usage RAM Disk SSD HDD pattern – Hybrid Replication – Lustre-Integrated mode: Lustre • Lustre-based fault-tolerance

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015 Network Based Computing Laboratory Intel® HPC Developer Conference 2017 16 Design Overview of MapReduce with RDMA Applications • Design Features – RDMA-based shuffle MapReduce – Prefetching and caching map output Job Task Map – Efficient Shuffle Algorithms Tracker Tracker Reduce – In-memory merge

Java Socket Interface Java Native Interface (JNI) – On-demand Shuffle Adjustment – Advanced overlapping OSU Design • map, shuffle, and merge Verbs • shuffle, merge, and reduce 1/10/40/100 GigE, IPoIB RDMA Capable Networks – On-demand connection setup Network (IB, iWARP, RoCE ..) – InfiniBand/RoCE support • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based MapReduce with communication library written in native code M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014 Network Based Computing Laboratory Intel® HPC Developer Conference 2017 17 Performance Numbers of RDMA for Apache Hadoop 2.x – RandomWriter & TeraGen in OSU-RI2 (EDR) 400 IPoIB (EDR) 800 IPoIB (EDR) Reduced by 4x 350 Reduced by 3x 700 OSU-IB (EDR) 300 600 OSU-IB (EDR) 250 500 200 400 150 300 100 200 Execution Time (s) Execution Time (s) 50 100 0 0 80 120 160 80 160 240 Data Size (GB) Data Size (GB) RandomWriter TeraGen Cluster with 8 Nodes with a total of 64 maps

• RandomWriter • TeraGen – 3x improvement over IPoIB – 4x improvement over IPoIB for for 80-160 GB file size 80-240 GB file size

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 18 Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort in OSU-RI2 (EDR) Reduced by 61% 800 600 IPoIB (EDR) IPoIB (EDR) Reduced by 18% 700 500 600 OSU-IB (EDR) OSU-IB (EDR) 400 500 400 300 300 200 200 Execution Time (s) Execution Time (s) 100 100 0 0 80 120 160 80 160 240 Data Size (GB) Data Size (GB) Sort TeraSort Cluster with 8 Nodes with a total of Cluster with 8 Nodes with a total of 64 maps and 14 reduces 64 maps and 32 reduces • Sort • TeraSort – 61% improvement over IPoIB for – 18% improvement over IPoIB for 80-160 GB data 80-240 GB data Network Based Computing Laboratory Intel® HPC Developer Conference 2017 19 Design Overview of Spark with RDMA Apache Spark Benchmarks/Applications/Libraries/Frameworks • Design Features

Spark Core – RDMA based shuffle plugin Shuffle Manager (Sort, Hash, Tungsten-Sort) – SEDA-based architecture

Block Transfer Service (Netty, NIO, RDMA-Plugin) – Dynamic connection management and sharing Netty NIO RDMA Netty NIO RDMA Server Server Server Client Client Client – Non-blocking data transfer

Java Socket Interface Java Native Interface (JNI) – Off-JVM-heap buffer management Native RDMA-based Comm. Engine – InfiniBand/RoCE support

RDMA Capable Networks 1/10/40/100 GigE, IPoIB Network (IB, iWARP, RoCE ..) • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014 X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016. Network Based Computing Laboratory Intel® HPC Developer Conference 2017 20 Performance Evaluation on SDSC Comet – HiBench PageRank 800 450 37% 43% 700 IPoIB 400 IPoIB 600 RDMA 350 RDMA 500 300 250 400 200 Time (sec) 300 Time (sec) 150 200 100 100 50 0 0 Huge BigData Gigantic Huge BigData Gigantic Data Size (GB) Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) • RDMA-based design for Spark 1.5.1 • RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) – 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) Network Based Computing Laboratory Intel® HPC Developer Conference 2017 21 Memcached Performance (FDR Interconnect) Memcached GET Latency Memcached Throughput 1000 800 OSU-IB (FDR) 100 600 Latency Reduced 10 by nearly 20X 400 Time (us) Time 2X 200 Second (TPS) Second 1 of Thousands Transactions per Transactions 1 2 4 8 16 32 64 1K 2K 4K 0 128 256 512 Message Size 16 32 64 128 256 512 102420484080 No. of Clients Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) • Memcached Get latency – 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us, 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us • Memcached Throughput (4bytes) – 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s, Nearly 2X improvement in throughput J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11 J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12 Network Based Computing Laboratory Intel® HPC Developer Conference 2017 22 Acceleration Case Studies and Performance Evaluation • Basic Designs – Hadoop – Spark – Memcached • Advanced Designs – Memcached with Hybrid Memory and Non-blocking APIs – Efficient Indexing with RDMA-HBase – TensorFlow with RDMA-gRPC – Deep Learning over Big Data • BigData + HPC Cloud

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 23 Performance Evaluation on IB FDR + SATA/NVMe SSDs (Hybrid Memory)

slab allocation (SSD write) cache check+load (SSD read) cache update server response client wait miss-penalty

2500

2000

1500

1000 Latency (us) Latency

500

0 Set Get Set Get Set Get Set Get Set Get Set Get Set Get Set Get IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid- IPoIB-Mem RDMA-Mem RDMA-Hybrid-SATA RDMA-Hybrid- NVMe NVMe Data Fits In Memory Data Does Not Fit In Memory

– Memcached latency test with Zipf distribution, server with 1 GB memory, 32 KB key-value pair size, total size of data accessed is 1 GB (when data fits in memory) and 1.5 GB (when data does not fit in memory) – When data fits in memory: RDMA-Mem/Hybrid gives 5x improvement over IPoIB-Mem – When data does not fit in memory: RDMA-Hybrid gives 2x-2.5x over IPoIB/RDMA-Mem

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 24 Performance Evaluation with Non-Blocking Memcached API 2500 Miss Penalty (Backend DB Access Overhead) Client Wait 2000 Server Response Cache Update 1500 Cache check+Load (Memory and/or SSD read) Slab Allocation (w/ SSD write on Out-of-Mem) 1000

500

0 Average Latency (us) Latency Average Set Get Set Get Set Get Set Get Set Get Set Get IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB- H-RDMA-Opt-NonB- i b H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee – Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve • >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty • >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid – Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem Network Based Computing Laboratory Intel® HPC Developer Conference 2017 25 Performance Evaluation with Boldio for Lustre + Burst-Buffer 70000.00 450 Lustre-Direct Alluxio-Local Lustre-Direct Alluxio-Remote Boldio 60000.00 400 Alluxio-Remote Boldio 350 50000.00 ~7x 300 21% 40000.00 250 30000.00 200 20000.00 150 ~3x 10000.00 (sec)Latency 100 50 0.00 0 20 GB 40 GB 20 GB 40 GB

Agg. Agg. (MBps) Throughput WordCount InvIndx CloudBurst Spark TeraGen Write Read DFSIO Throughput Hadoop/Spark Workloads • InfiniBand QDR, 24GB RAM + PCIe-SSDs, 12 nodes, 32/48 Map/Reduce Tasks, 4-node Memcached cluster • Boldio can improve – throughput over Lustre by about 3x for write throughput and 7x for read throughput – execution time of Hadoop benchmarks over Lustre, e.g. Wordcount, Cloudburst by >21% • Contrasting with Alluxio (formerly Tachyon) – Performance degrades about 15x when Alluxio cannot leverage local storage (Alluxio-Local vs. Alluxio-Remote) – Boldio can improve throughput over Alluxio with all remote workers by about 3.5x - 8 .8x (Alluxio-Remote vs. Boldio) D. Shankar, X. Lu, D. K. Panda, Boldio: A Hybrid and Resilient Burst-Buffer over Lustre for Accelerating Big Data I/O, IEEE Big Data 2016. Network Based Computing Laboratory Intel® HPC Developer Conference 2017 26 Accelerating Indexing Techniques on HBase with RDMA • Challenges TPC-H Query Benchmarks HBase RDMA-HBase HBase-Phoenix – Operations on Distributed Ordered Table (DOT) RDMA-HBase-Phoenix HBase-CCIndex RDMA-HBase-CCIndex with indexing techniques are network intensive 15000 Increased by 2x – Additional overhead of creating and maintaining 10000 5000 secondary indices THROUGHPUT

– Can RDMA benefit indexing techniques (Apache 0 Phoenix and CCIndex) on HBase? Query1 Query2 • Results Ad Master Application Workloads HBase-Phoenix RDMA-HBase-Phoenix – Evaluation with Apache Phoenix and CCIndex 600 HBase-CCIndex RDMA-HBase-CCIndex Reduced by – Up to 2x improvement in query throughput 500 35% 400 – Up to 35% reduction in application workload 300 execution time 200 100 EXECUTION TIME EXECUTION Collaboration with Institute of Computing Technology, 0 Chinese Academy of Sciences Workload1 Workload2 Workload3 Workload4 S. Gugnani, X. Lu, L. Zha, and D. K. Panda, Characterizing and Accelerating Indexing Techniques on Distributed Ordered Tables, IEEE BigData, 2017. Network Based Computing Laboratory Intel® HPC Developer Conference 2017 27 Overview of RDMA-gRPC with TensorFlow

Worker /job:PS/task:0

CPU GPU

RDMA gRPC server/ client

Client Master

RDMA gRPC server/ client Worker /job:Worker/task:0

CPU GPU

Worker services communicate among each other using RDMA-gRPC

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 28 Performance Benefit for TensorFlow 50 35

30 40 25 30 20

20 15

10

Time (Seconds) 10 Images / SecondImages 5

0 0 Default gRPC 2 Nodes (2 GPUS) Default gRPCresnet50 inception3 gRPC + Verbs gRPC + Verbs 2 Nodes gRPC + MPI gRPC + MPI OSU RDMA-gRPC2 OSU-RDMA-gRPC sigmoid net CNN • TensorFlow performance evaluation on RI2 – Up to 19% performance speedup over IPoIB for Sigmoid net (20 epochs). – Up to 35% and 30% performance speedup over IPoIB for resnet50 and Inception3 (batch size 8). R. Biswas, X. Lu, and D. K. Panda, Accelerating gRPC and TensorFlow with RDMA for High-Performance Deep Learning over InfiniBand, Under Review. Network Based Computing Laboratory Intel® HPC Developer Conference 2017 29 High-Performance Deep Learning over Big Data (DLoBD) Stacks • Challenges of Deep Learning over Big Data (DLoBD) . Can RDMA-based designs in DLoBD stacks improve performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs? . What are the performance characteristics of representative DLoBD stacks on RDMA networks? • Characterization on DLoBD Stacks

. CaffeOnSpark, TensorFlowOnSpark, and BigDL 5010 60 . IPoIB vs. RDMA; In-band communication vs. Out- IPoIB-Time 4010 RDMA-Time IPoIB-Accuracy of-band communication; CPU vs. GPU; etc. 40 3010 RDMA-Accuracy . Performance, accuracy, scalability, and resource 2.6X 2010 utilization (secs) Time 20 . RDMA-based DLoBD stacks (e.g., BigDL over 1010 (%) Accuracy

RDMA-Spark) can achieve 2.6x speedup compared Epochs 10 0 to the IPoIB based scheme, while maintain similar 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Epoch Number accuracy X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017.

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 30 Acceleration Case Studies and Performance Evaluation • Basic Designs – Hadoop – Spark – Memcached • Advanced Designs – Memcached with Hybrid Memory and Non-blocking APIs – Efficient Indexing with RDMA-HBase – TensorFlow with RDMA-gRPC – Deep Learning over Big Data • BigData + HPC Cloud

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 31 Virtualization-aware and Automatic Topology Detection Hadoop Benchmarks Schemes in Hadoop on InfiniBand RDMA-Hadoop Hadoop-Virt Reduced by • Challenges 6000 34%

– Existing designs in Hadoop not virtualization- 4000

aware 2000 – No support for automatic topology detection EXECUTION TIME EXECUTION 0 • Design 40 GB 60 GB 40 GB 60 GB 40 GB 60 GB Sort WordCount PageRank – Automatic Topology Detection using Hadoop Applications MapReduce-based utility RDMA-Hadoop Hadoop-Virt Reduced by • Requires no user input 400 55% • Can detect topology changes during 300 runtime without affecting running jobs 200 100 – Virtualization and topology-aware

EXECUTION TIME EXECUTION 0 communication through map task scheduling and Default Distributed Default Distributed Mode Mode Mode Mode YARN container allocation policy extensions CloudBurst Self-join S. Gugnani, X. Lu, and D. K. Panda, Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds, CloudCom’16, December 2016 Network Based Computing Laboratory Intel® HPC Developer Conference 2017 32 Concluding Remarks • Discussed challenges in accelerating Big Data middleware with HPC technologies • Presented basic and advanced designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, RPC, HBase, Memcached, Spark, gRPC, and TensorFlow • Results are promising • Many other open issues need to be solved • Will enable Big Data community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner • Looking forward to collaboration with the community

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 33 OSU Participating at Multiple Events on BigData Acceleration • Tutorial – Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management (Sunday, 1:30-5:00 pm, Room #201) • BoF – BigData and Deep Learning (Tuesday, 5:15-6:45pm, Room #702) – SigHPC Big Data BoF (Wednesday, 12:15-1:15pm, Room #603) – Clouds for HPC, Big Data, and Deep Learning (Wednesday, 5:15-7:00pm, Room #701) • Booth Talks – OSU Booth (Tuesday, 10:00-11:00am, Booth #1875) – Mellanox Theater (Wednesday, 3:00-3:30pm, Booth #653) – OSU Booth (Thursday, 1:00-2:00pm, Booth #1875) • Student Poster Presentation – Accelerating Big Data processing in Cloud (Tuesday, 5:15-7:00pm, Four Seasons Ballroom) • Details at http://hibd.cse.ohio-state.edu

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 34 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 35 Personnel Acknowledgments

Current Students Current Research Scientists Current Research Specialist – S. Guganani (Ph.D.) – M. Rahman (Ph.D.) – A. Awan (Ph.D.) – X. Lu – J. Smith – J. Hashmi (Ph.D.) – D. Shankar (Ph.D.) – M. Bayatpour (Ph.D.) – H. Subramoni – M. Arnold – S. Chakraborthy (Ph.D.) – N. Islam (Ph.D.) – A. Venkatesh (Ph.D.) Current Post-doc – C.-H. Chu (Ph.D.) – M. Li (Ph.D.) – J. Zhang (Ph.D.) – A. Ruhela

Past Students Past Research Scientist – W. Huang (Ph.D.) – M. Luo (Ph.D.) – R. Rajachandrasekar (Ph.D.) – A. Augustine (M.S.) – K. Hamidouche – P. Balaji (Ph.D.) – W. Jiang (M.S.) – A. Mamidala (Ph.D.) – G. Santhanaraman (Ph.D.) – S. Sur – S. Bhagvat (M.S.) – J. Jose (Ph.D.) – G. Marsh (M.S.) – A. Singh (Ph.D.) – A. Bhat (M.S.) – S. Kini (M.S.) – V. Meshram (M.S.) – J. Sridhar (M.S.) Past Programmers – S. Sur (Ph.D.) – D. Buntinas (Ph.D.) – M. Koop (Ph.D.) – A. Moody (M.S.) – D. Bureddy – H. Subramoni (Ph.D.) – L. Chai (Ph.D.) – K. Kulkarni (M.S.) – S. Naravula (Ph.D.) – J. Perkins – K. Vaidyanathan (Ph.D.) – B. Chandrasekharan (M.S.) – R. Kumar (M.S.) – R. Noronha (Ph.D.) – N. Dandapanthula (M.S.) – S. Krishnamoorthy (M.S.) – X. Ouyang (Ph.D.) – A. Vishnu (Ph.D.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – S. Pai (M.S.) – J. Wu (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – S. Potluri (Ph.D.) – W. Yu (Ph.D.) – K. Gopalakrishnan (M.S.) – J. Liu (Ph.D.)

Past Post-Docs – D. Banerjee – J. Lin – S. Marcarelli – X. Besseron – M. Luo – J. Vienne – H.-W. Jin – E. Mancini – H. Wang

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 36 Thank You! {panda, luxi}@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~luxi

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

Network Based Computing Laboratory Intel® HPC Developer Conference 2017 37