Introductory Supercomputing Course Objectives

Total Page:16

File Type:pdf, Size:1020Kb

Introductory Supercomputing Course Objectives Introductory Supercomputing Course Objectives • Understand basic parallel computing concepts and workflows • Understand the high-level architecture of a supercomputer • Use a basic workflow to submit a job and monitor it • Understand a resource request and know what to expect • Check basic usage and accounting information Iron Man 2 WORKFLOWS eResearch Workflows Research enabled by computers is called eResearch. eResearch examples: – Simulations (e.g. prediction, testing theories) – Data analysis (e.g. pattern matching, statistical analysis) – Visualisation (e.g. interpretation, quality assurance) – Combining datasets (e.g. data mashups, data mining) A supercomputer is just one part of an eResearch workflow. eResearch Workflow Collect • Kilobytes of parameters, or petabytes of data from a telescope or atomic collider Stage • Data transfer from manual transfer of small files, to custom data management frameworks for large projects Process • Single large parallel job with check pointing, or many smaller parallel or serial jobs Visualise • Examining values, generating images or movies, or interactive 3D projections Archive • Storage of published results, or publicly accessible research datasets The Pawsey Centre • Provides multiple components of the workflow in one location • Capacity to support significant computational and data requirements Vis cluster Display Archive, scratch compute scratch Instrument databases Supercomputer(s) Archive, Pawsey Centre databases Parallelism in Workflows Exploiting parallelism in a workflow allows us to • get results faster, and • break up big problems into manageable sizes. A modern supercomputer is not a fast processor. It is many processors working together in parallel. Workflow Example – Cake Baking It has a sequence of tasks that follow a recipe. Just like a computer program! Some tasks are independent • Can be done in any order. • Can be done in parallel. Some tasks have prerequisites. • Must be done sequentially. Baking a Cake - Staging • “Staging” the ingredients improves access time. • Ingredients are “prerequisites”. • Each ingredient does not depend on others, so can be moved in parallel, or at different times. Supermarket Kitchen pantry Kitchen bench Baking a Cake - Process Stage ingredients Measure ingredients Mix ingredients Preheat oven Serial Bake and cool cake Parallel Mix icing Ice cake Baking a Cake - Finished Either: • Sample it for quality. • Put it in a shop for others to browse and buy. • Store it for later. • Eat it! Clean up! Then make next cake. Levels of Parallelism Coarse-grained parallelism (high level) • Different people baking cakes in their own kitchens. • Preheating oven while mixing ingredients. Greater autonomy, can scale to large problems and many helpers. Fine-grained parallelism (low level) • Spooning mixture into cupcake tray. • Measuring ingredients. Higher coordination requirement. Difficult to get many people to help on a single cupcake tray. How many helpers? What is your goal? – high throughput, to do one job fast, or solve a grand-challenge problem? High Throughput: • For many cakes, get many people to bake independently in their own kitchens – minimal coordination. • Turn it into a production line. Use specialists and teams in some parts. Doing one job fast: • Experience as well as trial and error will find the optimal number of helpers. When to use supercomputing Workflows need supercomputing when the resources of a single laptop or workstation are not sufficient: • The program takes too long to process • There is not sufficient memory for the program • The dataset is too large to fit on the computer If you are unsure whether moving to a World’s largest wedding cake 6.8 tonnes supercomputer will help, ask for help 2004. Connecticut, USA from Pawsey staff. Photo: Guinness World Records Cray 1, Cray Research Inc. SUPERCOMPUTERS Abstract Supercomputer Login Nodes Data Movers Scheduler Compute Nodes High Performance Storage Login Nodes • Remote access to the supercomputer • Where users should manage workflows • Many people (~100) share a login node at the same time. Do not run your programs on the login nodes! • Use the login nodes to submit jobs to the queue to be executed on the compute nodes • Login nodes can have different hardware to compute nodes. • Some build tests may fail if you try to compile on login nodes. Compute Nodes • Programs should run on the compute nodes. • Access is provided via the scheduler • Compute nodes have a fast interconnect that allows them to communicate with each other • Jobs can span multiple compute nodes • Individual nodes are not that different in performance from a workstation • Parallelism across compute nodes is how significant performance improvements are achieved. Inside a Compute Node Each compute node has one or more CPUs: • Each CPU has multiple cores • Each CPU has memory attached to it Each node has an external network connection Some systems have accelerators (e.g. GPUs) High Performance Storage Fast storage “inside” the supercomputer • Temporary working area • Might have local node storage • Not shared with other users • Usually have global storage • All nodes can access the filesystems • Either directly connected to the interconnect, or via router nodes • The storage is shared. Multiple simultaneous users on different nodes will reduce performance Data Mover Nodes Externally connected servers that are dedicated to moving data to and from the high performance filesystems. • Data movers are shared, but most users will not notice. • Performance depends on the other end, distance, encryption algorithms, and other concurrent transfers. • Data movers see all the global filesystems hpc-data.pawsey.org.au Scheduler The scheduler feeds jobs into the compute nodes. It has a queue of jobs and constantly optimizes how to get them all done. As a user you interact with the queues. The scheduler runs jobs on the compute nodes on your behalf. Pawsey Supercomputing Centre PAWSEY SYSTEMS Pawsey Supercomputers Name System Nodes Cores/node RAM/node Notes Magnus Cray XC40 1488 24 64 GB Aries interconnect Galaxy Cray XC30 472 20 64 GB Aries interconnect, +64 K20X GPU nodes Zeus HPE Cluster 90 28 98-128 GB +20 vis nodes, +15 K20/K40/K20X GPU nodes, +11 P100 GPU nodes, +80 KNL nodes, +1 6TB node • Magnus should be used for large, parallel workflows. • Galaxy is reserved for the operation of the ASKAP and MWA radio telescopes. • Zeus should be used for small parallel jobs, serial or single node workflows, GPU-accelerated jobs, and remote visualisation. Filesystems Filesystems are areas of storage with very different purposes. • Some are large, some are small. • Some are semi-permanent, others temporary. • Some are fast, some are slow. High performance storage should not be used for long- term storage. Scratch Filesystem High performance Lustre filesystem for large data transfers • Large 3 PB capacity, or 1.98 billion files • Shared by all users, with no quotas • Temporary space for queued, running, or recently finished jobs Your directory is /scratch/projectname/username • Path is available in $MYSCRATCH environment variable • Files on /scratch may be purged after 30 days • Touching files to avoid purging violates Pawsey’s conditions of use and will result in suspension of accounts • Please delete your files before they are purged Group Filesystem High performance Lustre filesystem for shared project files, such as data sets and executables. • Not as fast as /scratch, copy data before accessing it from the compute nodes. Your directory is /group/projectname • Path is available in $MYGROUP environment variable • Default 1TB quota: `lfs quota -g projectname /group` • Make sure files are set to your project group, not your username group. • Use `fix.group.permission.sh projectname` if needed Home Filesystem You can use your home directory to store environment configurations, scripts and small input files • Small quota (about 10GB): `quota -s -f /home` • Uses NFS, which is very slow compared to Lustre • Where possible, use /group instead Your directory is /home/username • The path is available in the $HOME environment variable John Kogut using Cray X-MP at NCSA, University of Illinois LOGGING IN Command line SSH Within a terminal window, type: ssh [email protected] • Common terminal programs: • Windows, use MobaXterm (download) • Linux, use xterm (preinstalled) • OS X, use Terminal (preinstalled) or xterm (download) Account Security • SSH uses fingerprints to identify computers … so you don't give your password to someone pretending to be the remote host • We recommend that you set up SSH key authentication to increase the security of your account: https://support.pawsey.org.au/documentation/display/US/Logging+in+with+SSH+keys • Do not share your account with others, this violates the conditions of use (have the project leader add them to the project) • Please do not provide your password in help desk tickets, we never ask for your password via email as it is not secure. Graphical Interfaces • Remote graphical interface for some tasks available. • Graphical debuggers, text editors, remote visualisation. • Low performance over long distances. • Very easy to set up for Linux, Mac and MobaXterm clients. Add -X flag to ssh. ssh -X [email protected] • For higher performance remote GUI, use VNC: https://support.pawsey.org.au/documentation/display/US/Visualisation+documentation Firewalls • Firewalls protect computers on the local network by only allowing particular traffic in or out to the internet. • Most computers
Recommended publications
  • 4: Gigabit Research
    Gigabit Research 4 s was discussed in chapter 3, the limitations of current networks and advances in computer technology led to new ideas for applications and broadband network design. This in turn led to hardware and software developmentA for switches, computers, and other network compo- nents required for advanced networks. This chapter describes some of the research programs that are focusing on the next step-the development of test networks.1 This task presents a difficult challenge, but it is hoped that the test networks will answer important research questions, provide experience with the construction of high-speed networks, and demonstrate their utility. Several “testbeds” are being funded as part of the National Research and Education Network (NREN) initiative by the Advanced Research Projects Agency (ARPA) and the National Science Foundation (NSF). The testbed concept was first proposed to NSF in 1987 by the nonprofit Corporation for National Research Initiatives (CNRI). CNRI was then awarded a planning grant, and solicited proposals or white papers" from The HPCC prospective testbed participants. A subsequent proposal was then reviewed by NSF with a focus on funding levels, research program’s six objectives, and the composition of the testbeds. The project, cofunded by ARPA and NSF under a cooperative agreement with testbeds will CNRI, began in 1990 and originally covered a 3-year research program. The program has now been extended by an additional demonstrate fifteen months, through the end of 1994. CNRI is coordinat- gigabit 1 Corporation for National Research Initiatives, ‘‘A Brief Description ofthe CNRJ Gigabit lkstbed Initiative,” January 1992; Gary StiX, “Gigabit Conn=tiom” Sci@@ net-working.
    [Show full text]
  • NVIDIA Tesla Personal Supercomputer, Please Visit
    NVIDIA TESLA PERSONAL SUPERCOMPUTER TESLA DATASHEET Get your own supercomputer. Experience cluster level computing performance—up to 250 times faster than standard PCs and workstations—right at your desk. The NVIDIA® Tesla™ Personal Supercomputer AccessiBLE to Everyone TESLA C1060 COMPUTING ™ PROCESSORS ARE THE CORE is based on the revolutionary NVIDIA CUDA Available from OEMs and resellers parallel computing architecture and powered OF THE TESLA PERSONAL worldwide, the Tesla Personal Supercomputer SUPERCOMPUTER by up to 960 parallel processing cores. operates quietly and plugs into a standard power strip so you can take advantage YOUR OWN SUPERCOMPUTER of cluster level performance anytime Get nearly 4 teraflops of compute capability you want, right from your desk. and the ability to perform computations 250 times faster than a multi-CPU core PC or workstation. NVIDIA CUDA UnlocKS THE POWER OF GPU parallel COMPUTING The CUDA™ parallel computing architecture enables developers to utilize C programming with NVIDIA GPUs to run the most complex computationally-intensive applications. CUDA is easy to learn and has become widely adopted by thousands of application developers worldwide to accelerate the most performance demanding applications. TESLA PERSONAL SUPERCOMPUTER | DATASHEET | MAR09 | FINAL FEATURES AND BENEFITS Your own Supercomputer Dedicated computing resource for every computational researcher and technical professional. Cluster Performance The performance of a cluster in a desktop system. Four Tesla computing on your DesKtop processors deliver nearly 4 teraflops of performance. DESIGNED for OFFICE USE Plugs into a standard office power socket and quiet enough for use at your desk. Massively Parallel Many Core 240 parallel processor cores per GPU that can execute thousands of GPU Architecture concurrent threads.
    [Show full text]
  • Resource Guide: Workstation Productivity
    Resource Guide: Workstation Productivity Contents Beyond the Desktop: Workstation Productivity ……………………………………………………2 Find out how to get the highest productivity possible from high performance applications that require more power than traditional desktop computers Top IT Considerations for Mobile Workstations …………………...............................................4 Discover the top IT considerations for organization that require mobile workstations designed to provide a balance of performance and portability Maximum Management and Control with Virtualized Workstations ....…………………...…….6 Explore the benefits of virtualization to ensure the greatest possible management and control of high performance workstations for the most demanding workloads and applications Sponsored by: Page 1 of 6 Copyright © MMIX, CBS Broadcasting Inc. All Rights Reserved. For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html TechRepublic Resource Guide: Workstations Productivity Beyond the Desktop: Workstation Productivity Find out how to get the highest productivity possible from high performance applications that require more power than traditional desktop computers Desktop computers have come a long way in terms of value and performance for the average user but more advanced applications now require even more in terms of processing, graphics, memory and storage. The following is a brief overview of the differences between standard desktops and workstations as well as the implications for overall performance and productivity as compiled from the editorial resources of CNET, TechRepublic and ZDNet.com. Think Inside the Box A lot of people tend to think of desktop computers and computer workstation as one and the same but that isn’t always the case in terms of power and performance. In fact, many of today’s most demanding workloads and applications for industries such as engineering, entertainment, finance, manufacturing, and science actually require much higher functionality than what traditional desktops have to offer.
    [Show full text]
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • Parallel Programming
    Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other.
    [Show full text]
  • Real-Time Performance During CUDA™ a Demonstration and Analysis of Redhawk™ CUDA RT Optimizations
    A Concurrent Real-Time White Paper 2881 Gateway Drive Pompano Beach, FL 33069 (954) 974-1700 www.concurrent-rt.com Real-Time Performance During CUDA™ A Demonstration and Analysis of RedHawk™ CUDA RT Optimizations By: Concurrent Real-Time Linux® Development Team November 2010 Overview There are many challenges to creating a real-time Linux distribution that provides guaranteed low process-dispatch latencies and minimal process run-time jitter. Concurrent Real Time’s RedHawk Linux distribution meets and exceeds these challenges, providing a hard real-time environment on many qualified hardware configurations, even in the presence of a heavy system load. However, there are additional challenges faced when guaranteeing real-time performance of processes while CUDA applications are simultaneously running on the system. The proprietary CUDA driver supplied by NVIDIA® frequently makes demands upon kernel resources that can dramatically impact real-time performance. This paper discusses a demonstration application developed by Concurrent to illustrate that RedHawk Linux kernel optimizations allow hard real-time performance guarantees to be preserved even while demanding CUDA applications are running. The test results will show how RedHawk performance compares to CentOS performance running the same application. The design and implementation details of the demonstration application are also discussed in this paper. Demonstration This demonstration features two selectable real-time test modes: 1. Jitter Mode: measure and graph the run-time jitter of a real-time process 2. PDL Mode: measure and graph the process-dispatch latency of a real-time process While the demonstration is running, it is possible to switch between these different modes at any time.
    [Show full text]
  • Unit: 4 Processes and Threads in Distributed Systems
    Unit: 4 Processes and Threads in Distributed Systems Thread A program has one or more locus of execution. Each execution is called a thread of execution. In traditional operating systems, each process has an address space and a single thread of execution. It is the smallest unit of processing that can be scheduled by an operating system. A thread is a single sequence stream within in a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. In a process, threads allow multiple executions of streams. Thread Structure Process is used to group resources together and threads are the entities scheduled for execution on the CPU. The thread has a program counter that keeps track of which instruction to execute next. It has registers, which holds its current working variables. It has a stack, which contains the execution history, with one frame for each procedure called but not yet returned from. Although a thread must execute in some process, the thread and its process are different concepts and can be treated separately. What threads add to the process model is to allow multiple executions to take place in the same process environment, to a large degree independent of one another. Having multiple threads running in parallel in one process is similar to having multiple processes running in parallel in one computer. Figure: (a) Three processes each with one thread. (b) One process with three threads. In former case, the threads share an address space, open files, and other resources. In the latter case, process share physical memory, disks, printers and other resources.
    [Show full text]
  • Gpu Concurrency
    GPU CONCURRENCY ROBERT SEARLES 5/26/2021 EXECUTION SCHEDULING & MANAGEMENT Pre-emptive scheduling Concurrent scheduling Processes share GPU through time-slicing Processes run on GPU simultaneously Scheduling managed by system User creates & manages scheduling streams C B A B C A B A time time time- slice 2 CUDA CONCURRENCY MECHANISMS Streams MPS MIG Partition Type Single process Logical Physical Max Partitions Unlimited 48 7 Performance Isolation No By percentage Yes Memory Protection No Yes Yes Memory Bandwidth QoS No No Yes Error Isolation No No Yes Cross-Partition Interop Always IPC Limited IPC Reconfigure Dynamic Process launch When idle MPS: Multi-Process Service MIG: Multi-Instance GPU 3 CUDA STREAMS 4 STREAM SEMANTICS 1. Two operations issued into the same stream will execute in issue- order. Operation B issued after Operation A will not begin to execute until Operation A has completed. 2. Two operations issued into separate streams have no ordering prescribed by CUDA. Operation A issued into stream 1 may execute before, during, or after Operation B issued into stream 2. Operation: Usually, cudaMemcpyAsync or a kernel call. More generally, most CUDA API calls that take a stream parameter, as well as stream callbacks. 5 STREAM EXAMPLES Host/Device execution concurrency: Kernel<<<b, t>>>(…); // this kernel execution can overlap with cpuFunction(…); // this host code Concurrent kernels: Kernel<<<b, t, 0, streamA>>>(…); // these kernels have the possibility Kernel<<<b, t, 0, streamB>>>(…); // to execute concurrently In practice, concurrent
    [Show full text]
  • Parallel Computing
    Parallel Computing Announcements ● Midterm has been graded; will be distributed after class along with solutions. ● SCPD students: Midterms have been sent to the SCPD office and should be sent back to you soon. Announcements ● Assignment 6 due right now. ● Assignment 7 (Pathfinder) out, due next Tuesday at 11:30AM. ● Play around with graphs and graph algorithms! ● Learn how to interface with library code. ● No late submissions will be considered. This is as late as we're allowed to have the assignment due. Why Algorithms and Data Structures Matter Making Things Faster ● Choose better algorithms and data structures. ● Dropping from O(n2) to O(n log n) for large data sets will make your programs faster. ● Optimize your code. ● Try to reduce the constant factor in the big-O notation. ● Not recommended unless all else fails. ● Get a better computer. ● Having more memory and processing power can improve performance. ● New option: Use parallelism. How Your Programs Run Threads of Execution ● When running a program, that program gets a thread of execution (or thread). ● Each thread runs through code as normal. ● A program can have multiple threads running at the same time, each of which performs different tasks. ● A program that uses multiple threads is called multithreaded; writing a multithreaded program or algorithm is called multithreading. Threads in C++ ● The newest version of C++ (C++11) has libraries that support threading. ● To create a thread: ● Write the function that you want to execute. ● Construct an object of type thread to run that function. – Need header <thread> for this. ● That function will run in parallel alongside the original program.
    [Show full text]
  • Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
    Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g.
    [Show full text]
  • Hyper-Threading Technology Architecture and Microarchitecture
    Hyper-Threading Technology Architecture and Microarchitecture Deborah T. Marr, Desktop Products Group, Intel Corp. Frank Binns, Desktop ProductsGroup, Intel Corp. David L. Hill, Desktop Products Group, Intel Corp. Glenn Hinton, Desktop Products Group, Intel Corp. David A. Koufaty, Desktop Products Group, Intel Corp. J. Alan Miller, Desktop Products Group, Intel Corp. Michael Upton, CPU Architecture, Desktop Products Group, Intel Corp. Index words: architecture, microarchitecture, Hyper-Threading Technology, simultaneous multi- threading, multiprocessor INTRODUCTION ABSTRACT The amazing growth of the Internet and telecommunications is powered by ever-faster systems Intel’s Hyper-Threading Technology brings the concept demanding increasingly higher levels of processor of simultaneous multi-threading to the Intel performance. To keep up with this demand we cannot Architecture. Hyper-Threading Technology makes a rely entirely on traditional approaches to processor single physical processor appear as two logical design. Microarchitecture techniques used to achieve processors; the physical execution resources are shared past processor performance improvement–super- and the architecture state is duplicated for the two pipelining, branch prediction, super-scalar execution, logical processors. From a software or architecture out-of-order execution, caches–have made perspective, this means operating systems and user microprocessors increasingly more complex, have more programs can schedule processes or threads to logical transistors, and consume more power. In fact, transistor processors as they would on multiple physical counts and power are increasing at rates greater than processors. From a microarchitecture perspective, this processor performance. Processor architects are means that instructions from both logical processors therefore looking for ways to improve performance at a will persist and execute simultaneously on shared greater rate than transistor counts and power execution resources.
    [Show full text]
  • Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks
    Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks Timothy G. Armstrong, Zhao Zhang Daniel S. Katz, Michael Wilde, Ian T. Foster Department of Computer Science Computation Institute University of Chicago University of Chicago & Argonne National Laboratory [email protected], [email protected] [email protected], [email protected], [email protected] Abstract—In order for many-task applications to be attrac- as a worker and allocate one task per node. If tasks are tive candidates for running on high-end supercomputers, they single-threaded, each core or virtual thread can be treated must be able to benefit from the additional compute, I/O, as a worker. and communication performance provided by high-end HPC hardware relative to clusters, grids, or clouds. Typically this The second feature of many-task applications is an empha- means that the application should use the HPC resource in sis on high performance. The many tasks that make up the such a way that it can reduce time to solution beyond what application effectively collaborate to produce some result, is possible otherwise. Furthermore, it is necessary to make and in many cases it is important to get the results quickly. efficient use of the computational resources, achieving high This feature motivates the development of techniques to levels of utilization. Satisfying these twin goals is not trivial, because while the efficiently run many-task applications on HPC hardware. It parallelism in many task computations can vary over time, allows people to design and develop performance-critical on many large machines the allocation policy requires that applications in a many-task style and enables the scaling worker CPUs be provisioned and also relinquished in large up of existing many-task applications to run on much larger blocks rather than individually.
    [Show full text]