The Future of High- Performance Computing

Lecture 26: The Future of High- Performance Computing Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Carnegie Mellon Comparing Two Large-Scale Systems ⬛ Oakridge Titan ⬛ Google Data Center ▪ Monolithic ■ Servers to support supercomputer (3rd millions of customers fastest in world) ■ Designed for data ▪ Designed for compute- collection, storage, and intensive applications analysis 2 15-418/618 2 Carnegie Mellon Google Data Center Computing Landscape • Web search • Mapping / directions Internet-Scale Computing • Language translation Data Intensity • Video streaming Cloud Services Oakridge Titan Traditional Supercomputing Modeling & Simulation-Driven Science & Engineering Personal Computing 3 15-418/618 Computational Intensity 3 Carnegie Mellon Supercomputing Landscape Data Intensity Oakridge Titan Modeling & Simulation-Driven Science & Engineering 4 15-418/618 Computational Intensity 4 Carnegie Mellon Supercomputer Applications Science Industrial Public Health Products ⬛ Simulation-Based Modeling ▪ System structure + initial conditions + transition behavior ▪ Discretize time and space ▪ Run simulation to see what happens ⬛ Requirements ▪ Model accurately reflects actual system ▪ Simulation faithfully captures model 5 15-418/618 5 Carnegie Mellon Titan Hardware Local Network CPU CPU CPU • • • GPU GPU GPU Node 1 Node 2 Node 18,688 ⬛ Each Node ▪ AMD 16-core processor ▪ nVidia Graphics Processing Unit ▪ 38 GB DRAM ▪ No disk drive ⬛ Overall ▪ 7MW, $200M 6 15-418/618 6 Carnegie Mellon Titan Node Structure: CPU DRAM Memory ⬛ CPU ▪ 16 cores sharing common memory ▪ Supports multithreaded programming ▪ ~0.16 x 1012 floating-point operations per second (FLOPS) peak performance 7 15-418/618 7 Carnegie Mellon Titan Node Structure: GPU ⬛ Kepler GPU ▪ 14 multiprocessors ▪ Each with 12 groups of 16 stream processors ▪ 14 X 12 X 16 = 2688 ▪ Single-Instruction, Multiple-Data parallelism ▪ Single instruction controls all processors in group ▪ 4.0 x 1012 FLOPS peak performance 8 15-418/618 8 Carnegie Mellon Titan Programming: Principle ⬛ Solving Problem Over Grid ▪ E.g., finite-element system ▪ Simulate operation over time ⬛ Bulk Synchronous Model ▪ Partition into Regions ▪ p regions for p-node machine ▪ Map Region per Processor 9 15-418/618 9 Carnegie Mellon Titan Programming: Principle (cont) ⬛ Bulk Synchronous Model ▪ Map Region per Processor ▪ Alternate P1 P2 P3 P4 P5 ▪ All nodes compute behavior of region Compute – Perform on GPUs Communicate ▪ All nodes communicate values at Compute boundaries Communicate Compute Communicate 10 15-418/618 10 Carnegie Mellon Bulk Synchronous Performance P1 P2 P3 P4 P5 ▪ Limited by performance of slowest processor Compute ⬛ Strive to keep perfectly Communicate balanced Compute ▪ Engineer hardware to be highly reliable Communicate ▪ Tune software to make as regular Compute as possible Communicate ▪ Eliminate “noise” ▪ Operating system events ▪ Extraneous network activity 11 15-418/618 11 Carnegie Mellon Titan Programming: Reality ⬛ System Level ▪ Message-Passing Interface (MPI) supports node computation, synchronization and communication ⬛ Node Level ▪ OpenMP supports thread-level operation of node cPU ▪ cUDA programming environment for GPUs ▪ Performance degrades quickly if don’t have perfect balance among memories and processors ⬛ Result ▪ Single program is complex combination of multiple programming paradigms ▪ Tend to optimize for specific hardware configuration 12 15-418/618 12 Carnegie Mellon MPI Fault Tolerance ⬛ Checkpoint P1 P2 P3 P4 P5 ▪ Periodically store state of all processes ▪ Significant I/O traffic Checkpoint ⬛ Restore ▪ When failure occurs Compute & Wasted ▪ Reset state to that of last checkpoint Communicate Computation ▪ All intervening computation wasted ⬛ Performance Scaling Restore ▪ Very sensitive to number of failing components Compute & Communicate 13 15-418/618 13 Carnegie Mellon Supercomputer Programming Model Application ▪ Program on top of bare hardware Programs ⬛ Performance Software ▪ Low-level programming to Packages maximize node performance Machine-Dependent ▪ Keep everything globally Programming Model synchronized and balanced Hardware ⬛ Reliability ▪ Single failure causes major delay ▪ Engineer hardware to minimize failures 14 15-418/618 14 Carnegie Mellon Google Data Center Data-Intensive Computing Landscape • Web search Internet-Scale • Mapping / directions Computing • Language translation Data Intensity • Video streaming Cloud Services Personal Computing 15 15-418/618 Computational Intensity 15 Carnegie Mellon Internet Computing ⬛ Web Search ▪ Aggregate text data from across WWW ▪ No definition of correct operation ▪ Do not need real-time updating ⬛ Mapping Services ▪ Huge amount of (relatively) static data ▪ Each customer requires Online Documents individualized computation ■ Must be stored reliably ■ Must support real-time updating ■ (Relatively) small data volumes 16 15-418/618 16 Carnegie Mellon Other Data-Intensive Computing Applications ⬛ Wal-Mart ▪ 267 million items/day, sold at 6,000 stores ▪ HP built them 4 PB data warehouse ▪ Mine data to manage supply chain, understand market trends, formulate pricing strategies ⬛ LSST ▪ chilean telescope will scan entire sky every 3 days ▪ A 3.2 gigapixel digital camera ▪ Generate 30 TB/day of image data 17 15-418/618 17 Carnegie Mellon Data-Intensive Application Characteristics ⬛ Diverse Classes of Data ▪ Structured & unstructured ▪ High & low integrity requirements ⬛ Diverse Computing Needs ▪ Localized & global processing ▪ Numerical & non-numerical ▪ Real-time & batch processing 18 15-418/618 18 Carnegie Mellon Google Data Centers Dalles, Oregon ■ Engineered for low cost, modularity & power efficiency ▪ Hydroelectric power @ 2¢ / KW Hr ■ container: 1160 server nodes, ▪ 50 Megawatts 250KW ▪ Enough to power 60,000 homes 19 15-418/618 19 Carnegie Mellon Google Cluster Local Network CPU CPU CPU • • • Node 1 Node 2 Node n ▪ Typically 1,000−2,000 nodes ⬛ Node Contains ▪ 2 multicore cPUs ▪ 2 disk drives ▪ DRAM 20 15-418/618 20 Carnegie Mellon Hadoop Project ⬛ File system with files distributed across nodes Local Network CPU CPU CPU • • • Node 1 Node 2 Node n ▪ Store multiple (typically 3 copies of each file) ▪ If one node fails, data still available ▪ Logically, any node has access to any file ▪ May need to fetch across network ⬛ Map / Reduce programming environment ▪ Software manages execution of tasks on nodes 21 15-418/618 21 Carnegie Mellon Map/Reduce Operation ⬛ Characteristics Map/Reduce ▪ computation broken into many, short- Map lived tasks Reduce ▪ Mapping, reducing Map ▪ Tasks mapped onto processors Reduce dynamically Map ▪ Use disk storage to hold intermediate Reduce results Map ⬛ Strengths Reduce ▪ Flexibility in placement, scheduling, and load balancing ▪ can access large data sets ⬛ Weaknesses ▪ Higher overhead ▪ Lower raw performance 22 15-418/618 22 Carnegie Mellon Map/Reduce Fault Tolerance ⬛ Data Integrity Map/Reduce ▪ Store multiple copies of each file Map ▪ Including intermediate results of Reduce each Map / Reduce Map ▪ continuous checkpointing Reduce ⬛ Recovering from Failure Map Reduce ▪ Simply recompute lost result ▪ Map Localized effect Reduce ▪ Dynamic scheduler keeps all processors busy ⬛ Use software to build reliable system on top of unreliable hardware 23 15-418/618 23 Carnegie Mellon Cluster Programming Model ▪ Application programs written in terms of high-level operations on Application data Programs ▪ Runtime system controls scheduling, Machine-Independent load balancing, … Programming Model Runtime ⬛ Scaling Challenges System ▪ centralized scheduler forms bottleneck Hardware ▪ copying to/from disk very costly ▪ Hard to limit data movement ▪ Significant performance factor 24 15-418/618 24 Carnegie Mellon Recent Programming Systems ⬛ Spark Project ▪ at U.c., Berkeley ▪ Grown to have large open source community ⬛ GraphLab ▪ Started as project at cMU by carlos Guestrin ▪ Environment for describing machine-learning algorithms ▪ Sparse matrix structure described by graph ▪ computation based on updating of node values 25 15-418/618 25 Carnegie Mellon Computing Landscape Trends Data Intensity Mixing simulation with data analysis Modeling & Simulation-Driven Science & Traditional Engineering Supercomputing 26 15-418/618 Computational Intensity 26 Carnegie Mellon Combining Simulation with Real Data ⬛ Limitations ▪ Simulation alone: Hard to know if model is correct ▪ Data alone: Hard to understand causality & “what if” ⬛ Combination ▪ check and adjust model during simulation 27 15-418/618 27 Carnegie Mellon Real-Time Analytics ⬛ Millenium XXL Simulation (2010) ▪ 3 X 109 particles ▪ Simulation run of 9.3 days on 12,228 cores ▪ 700TB total data generated ▪ Save at only 4 time points ▪ 70 TB ▪ Large-scale simulations generate large data sets ⬛ What If? ▪ could perform data analysis while simulation is running Simulation Analytic Engine Engine 28 15-418/618 28 Carnegie Mellon Google Data Center Computing Landscape Trends Sophisticated data analysis Internet-Scale Computing Data Intensity 29 15-418/618 Computational Intensity 29 Carnegie Mellon Example Analytic Applications Microsoft Project Adam Image Classifier Description English German Text Transducer Text 30 15-418/618 30 Carnegie Mellon Data Analysis with Deep Neural Networks ⬛ Task: ▪ compute classification of set of input signals ⬛ Training ■ Use many training samples of form input / desired output ■ Compute weights that minimiZe classification error ⬛ Operation ■ Propagate signals from input to output 31 15-418/618 31

Load more