The Future of High- Performance Computing

Lecture 26: The Future of High- Performance Computing Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Carnegie Mellon Comparing Two Large-Scale Systems ⬛ Oakridge Titan ⬛ Google Data Center ▪ Monolithic ■ Servers to support supercomputer (3rd millions of customers fastest in world) ■ Designed for data ▪ Designed for compute- collection, storage, and intensive applications analysis 2 15-418/618 2 Carnegie Mellon Google Data Center Computing Landscape • Web search • Mapping / directions Internet-Scale Computing • Language translation Data Intensity • Video streaming Cloud Services Oakridge Titan Traditional Supercomputing Modeling & Simulation-Driven Science & Engineering Personal Computing 3 15-418/618 Computational Intensity 3 Carnegie Mellon Supercomputing Landscape Data Intensity Oakridge Titan Modeling & Simulation-Driven Science & Engineering 4 15-418/618 Computational Intensity 4 Carnegie Mellon Supercomputer Applications Science Industrial Public Health Products ⬛ Simulation-Based Modeling ▪ System structure + initial conditions + transition behavior ▪ Discretize time and space ▪ Run simulation to see what happens ⬛ Requirements ▪ Model accurately reflects actual system ▪ Simulation faithfully captures model 5 15-418/618 5 Carnegie Mellon Titan Hardware Local Network CPU CPU CPU • • • GPU GPU GPU Node 1 Node 2 Node 18,688 ⬛ Each Node ▪ AMD 16-core processor ▪ nVidia Graphics Processing Unit ▪ 38 GB DRAM ▪ No disk drive ⬛ Overall ▪ 7MW, $200M 6 15-418/618 6 Carnegie Mellon Titan Node Structure: CPU DRAM Memory ⬛ CPU ▪ 16 cores sharing common memory ▪ Supports multithreaded programming ▪ ~0.16 x 1012 floating-point operations per second (FLOPS) peak performance 7 15-418/618 7 Carnegie Mellon Titan Node Structure: GPU ⬛ Kepler GPU ▪ 14 multiprocessors ▪ Each with 12 groups of 16 stream processors ▪ 14 X 12 X 16 = 2688 ▪ Single-Instruction, Multiple-Data parallelism ▪ Single instruction controls all processors in group ▪ 4.0 x 1012 FLOPS peak performance 8 15-418/618 8 Carnegie Mellon Titan Programming: Principle ⬛ Solving Problem Over Grid ▪ E.g., finite-element system ▪ Simulate operation over time ⬛ Bulk Synchronous Model ▪ Partition into Regions ▪ p regions for p-node machine ▪ Map Region per Processor 9 15-418/618 9 Carnegie Mellon Titan Programming: Principle (cont) ⬛ Bulk Synchronous Model ▪ Map Region per Processor ▪ Alternate P1 P2 P3 P4 P5 ▪ All nodes compute behavior of region Compute – Perform on GPUs Communicate ▪ All nodes communicate values at Compute boundaries Communicate Compute Communicate 10 15-418/618 10 Carnegie Mellon Bulk Synchronous Performance P1 P2 P3 P4 P5 ▪ Limited by performance of slowest processor Compute ⬛ Strive to keep perfectly Communicate balanced Compute ▪ Engineer hardware to be highly reliable Communicate ▪ Tune software to make as regular Compute as possible Communicate ▪ Eliminate “noise” ▪ Operating system events ▪ Extraneous network activity 11 15-418/618 11 Carnegie Mellon Titan Programming: Reality ⬛ System Level ▪ Message-Passing Interface (MPI) supports node computation, synchronization and communication ⬛ Node Level ▪ OpenMP supports thread-level operation of node cPU ▪ cUDA programming environment for GPUs ▪ Performance degrades quickly if don’t have perfect balance among memories and processors ⬛ Result ▪ Single program is complex combination of multiple programming paradigms ▪ Tend to optimize for specific hardware configuration 12 15-418/618 12 Carnegie Mellon MPI Fault Tolerance ⬛ Checkpoint P1 P2 P3 P4 P5 ▪ Periodically store state of all processes ▪ Significant I/O traffic Checkpoint ⬛ Restore ▪ When failure occurs Compute & Wasted ▪ Reset state to that of last checkpoint Communicate Computation ▪ All intervening computation wasted ⬛ Performance Scaling Restore ▪ Very sensitive to number of failing components Compute & Communicate 13 15-418/618 13 Carnegie Mellon Supercomputer Programming Model Application ▪ Program on top of bare hardware Programs ⬛ Performance Software ▪ Low-level programming to Packages maximize node performance Machine-Dependent ▪ Keep everything globally Programming Model synchronized and balanced Hardware ⬛ Reliability ▪ Single failure causes major delay ▪ Engineer hardware to minimize failures 14 15-418/618 14 Carnegie Mellon Google Data Center Data-Intensive Computing Landscape • Web search Internet-Scale • Mapping / directions Computing • Language translation Data Intensity • Video streaming Cloud Services Personal Computing 15 15-418/618 Computational Intensity 15 Carnegie Mellon Internet Computing ⬛ Web Search ▪ Aggregate text data from across WWW ▪ No definition of correct operation ▪ Do not need real-time updating ⬛ Mapping Services ▪ Huge amount of (relatively) static data ▪ Each customer requires Online Documents individualized computation ■ Must be stored reliably ■ Must support real-time updating ■ (Relatively) small data volumes 16 15-418/618 16 Carnegie Mellon Other Data-Intensive Computing Applications ⬛ Wal-Mart ▪ 267 million items/day, sold at 6,000 stores ▪ HP built them 4 PB data warehouse ▪ Mine data to manage supply chain, understand market trends, formulate pricing strategies ⬛ LSST ▪ chilean telescope will scan entire sky every 3 days ▪ A 3.2 gigapixel digital camera ▪ Generate 30 TB/day of image data 17 15-418/618 17 Carnegie Mellon Data-Intensive Application Characteristics ⬛ Diverse Classes of Data ▪ Structured & unstructured ▪ High & low integrity requirements ⬛ Diverse Computing Needs ▪ Localized & global processing ▪ Numerical & non-numerical ▪ Real-time & batch processing 18 15-418/618 18 Carnegie Mellon Google Data Centers Dalles, Oregon ■ Engineered for low cost, modularity & power efficiency ▪ Hydroelectric power @ 2¢ / KW Hr ■ container: 1160 server nodes, ▪ 50 Megawatts 250KW ▪ Enough to power 60,000 homes 19 15-418/618 19 Carnegie Mellon Google Cluster Local Network CPU CPU CPU • • • Node 1 Node 2 Node n ▪ Typically 1,000−2,000 nodes ⬛ Node Contains ▪ 2 multicore cPUs ▪ 2 disk drives ▪ DRAM 20 15-418/618 20 Carnegie Mellon Hadoop Project ⬛ File system with files distributed across nodes Local Network CPU CPU CPU • • • Node 1 Node 2 Node n ▪ Store multiple (typically 3 copies of each file) ▪ If one node fails, data still available ▪ Logically, any node has access to any file ▪ May need to fetch across network ⬛ Map / Reduce programming environment ▪ Software manages execution of tasks on nodes 21 15-418/618 21 Carnegie Mellon Map/Reduce Operation ⬛ Characteristics Map/Reduce ▪ computation broken into many, short- Map lived tasks Reduce ▪ Mapping, reducing Map ▪ Tasks mapped onto processors Reduce dynamically Map ▪ Use disk storage to hold intermediate Reduce results Map ⬛ Strengths Reduce ▪ Flexibility in placement, scheduling, and load balancing ▪ can access large data sets ⬛ Weaknesses ▪ Higher overhead ▪ Lower raw performance 22 15-418/618 22 Carnegie Mellon Map/Reduce Fault Tolerance ⬛ Data Integrity Map/Reduce ▪ Store multiple copies of each file Map ▪ Including intermediate results of Reduce each Map / Reduce Map ▪ continuous checkpointing Reduce ⬛ Recovering from Failure Map Reduce ▪ Simply recompute lost result ▪ Map Localized effect Reduce ▪ Dynamic scheduler keeps all processors busy ⬛ Use software to build reliable system on top of unreliable hardware 23 15-418/618 23 Carnegie Mellon Cluster Programming Model ▪ Application programs written in terms of high-level operations on Application data Programs ▪ Runtime system controls scheduling, Machine-Independent load balancing, … Programming Model Runtime ⬛ Scaling Challenges System ▪ centralized scheduler forms bottleneck Hardware ▪ copying to/from disk very costly ▪ Hard to limit data movement ▪ Significant performance factor 24 15-418/618 24 Carnegie Mellon Recent Programming Systems ⬛ Spark Project ▪ at U.c., Berkeley ▪ Grown to have large open source community ⬛ GraphLab ▪ Started as project at cMU by carlos Guestrin ▪ Environment for describing machine-learning algorithms ▪ Sparse matrix structure described by graph ▪ computation based on updating of node values 25 15-418/618 25 Carnegie Mellon Computing Landscape Trends Data Intensity Mixing simulation with data analysis Modeling & Simulation-Driven Science & Traditional Engineering Supercomputing 26 15-418/618 Computational Intensity 26 Carnegie Mellon Combining Simulation with Real Data ⬛ Limitations ▪ Simulation alone: Hard to know if model is correct ▪ Data alone: Hard to understand causality & “what if” ⬛ Combination ▪ check and adjust model during simulation 27 15-418/618 27 Carnegie Mellon Real-Time Analytics ⬛ Millenium XXL Simulation (2010) ▪ 3 X 109 particles ▪ Simulation run of 9.3 days on 12,228 cores ▪ 700TB total data generated ▪ Save at only 4 time points ▪ 70 TB ▪ Large-scale simulations generate large data sets ⬛ What If? ▪ could perform data analysis while simulation is running Simulation Analytic Engine Engine 28 15-418/618 28 Carnegie Mellon Google Data Center Computing Landscape Trends Sophisticated data analysis Internet-Scale Computing Data Intensity 29 15-418/618 Computational Intensity 29 Carnegie Mellon Example Analytic Applications Microsoft Project Adam Image Classifier Description English German Text Transducer Text 30 15-418/618 30 Carnegie Mellon Data Analysis with Deep Neural Networks ⬛ Task: ▪ compute classification of set of input signals ⬛ Training ■ Use many training samples of form input / desired output ■ Compute weights that minimiZe classification error ⬛ Operation ■ Propagate signals from input to output 31 15-418/618 31

The Future of High- Performance Computing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support