A Beginner's Guide to High–Performance Computing
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
A Comprehensive Review for Central Processing Unit Scheduling Algorithm
IJCSI International Journal of Computer Science Issues, Vol. 10, Issue 1, No 2, January 2013 ISSN (Print): 1694-0784 | ISSN (Online): 1694-0814 www.IJCSI.org 353 A Comprehensive Review for Central Processing Unit Scheduling Algorithm Ryan Richard Guadaña1, Maria Rona Perez2 and Larry Rutaquio Jr.3 1 Computer Studies and System Department, University of the East Caloocan City, 1400, Philippines 2 Computer Studies and System Department, University of the East Caloocan City, 1400, Philippines 3 Computer Studies and System Department, University of the East Caloocan City, 1400, Philippines Abstract when an attempt is made to execute a program, its This paper describe how does CPU facilitates tasks given by a admission to the set of currently executing processes is user through a Scheduling Algorithm. CPU carries out each either authorized or delayed by the long-term scheduler. instruction of the program in sequence then performs the basic Second is the Mid-term Scheduler that temporarily arithmetical, logical, and input/output operations of the system removes processes from main memory and places them on while a scheduling algorithm is used by the CPU to handle every process. The authors also tackled different scheduling disciplines secondary memory (such as a disk drive) or vice versa. and examples were provided in each algorithm in order to know Last is the Short Term Scheduler that decides which of the which algorithm is appropriate for various CPU goals. ready, in-memory processes are to be executed. Keywords: Kernel, Process State, Schedulers, Scheduling Algorithm, Utilization. 2. CPU Utilization 1. Introduction In order for a computer to be able to handle multiple applications simultaneously there must be an effective way The central processing unit (CPU) is a component of a of using the CPU. -
The Different Unix Contexts
The different Unix contexts • User-level • Kernel “top half” - System call, page fault handler, kernel-only process, etc. • Software interrupt • Device interrupt • Timer interrupt (hardclock) • Context switch code Transitions between contexts • User ! top half: syscall, page fault • User/top half ! device/timer interrupt: hardware • Top half ! user/context switch: return • Top half ! context switch: sleep • Context switch ! user/top half Top/bottom half synchronization • Top half kernel procedures can mask interrupts int x = splhigh (); /* ... */ splx (x); • splhigh disables all interrupts, but also splnet, splbio, splsoftnet, . • Masking interrupts in hardware can be expensive - Optimistic implementation – set mask flag on splhigh, check interrupted flag on splx Kernel Synchronization • Need to relinquish CPU when waiting for events - Disk read, network packet arrival, pipe write, signal, etc. • int tsleep(void *ident, int priority, ...); - Switches to another process - ident is arbitrary pointer—e.g., buffer address - priority is priority at which to run when woken up - PCATCH, if ORed into priority, means wake up on signal - Returns 0 if awakened, or ERESTART/EINTR on signal • int wakeup(void *ident); - Awakens all processes sleeping on ident - Restores SPL a time they went to sleep (so fine to sleep at splhigh) Process scheduling • Goal: High throughput - Minimize context switches to avoid wasting CPU, TLB misses, cache misses, even page faults. • Goal: Low latency - People typing at editors want fast response - Network services can be latency-bound, not CPU-bound • BSD time quantum: 1=10 sec (since ∼1980) - Empirically longest tolerable latency - Computers now faster, but job queues also shorter Scheduling algorithms • Round-robin • Priority scheduling • Shortest process next (if you can estimate it) • Fair-Share Schedule (try to be fair at level of users, not processes) Multilevel feeedback queues (BSD) • Every runnable proc. -
2.5 Classification of Parallel Computers
52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor. -
Su(3) Gluodynamics on Graphics Processing Units V
Proceedings of the International School-seminar \New Physics and Quantum Chromodynamics at External Conditions", pp. 29 { 33, 3-6 May 2011, Dnipropetrovsk, Ukraine SU(3) GLUODYNAMICS ON GRAPHICS PROCESSING UNITS V. I. Demchik Dnipropetrovsk National University, Dnipropetrovsk, Ukraine SU(3) gluodynamics is implemented on graphics processing units (GPU) with the open cross-platform compu- tational language OpenCL. The architecture of the first open computer cluster for high performance computing on graphics processing units (HGPU) with peak performance over 12 TFlops is described. 1 Introduction Most of the modern achievements in science and technology are closely related to the use of high performance computing. The ever-increasing performance requirements of computing systems cause high rates of their de- velopment. Every year the computing power, which completely covers the overall performance of all existing high-end systems reached before is introduced according to popular TOP-500 list of the most powerful (non- distributed) computer systems in the world [1]. Obvious desire to reduce the cost of acquisition and maintenance of computer systems simultaneously, along with the growing demands on their performance shift the supercom- puting architecture in the direction of heterogeneous systems. These kind of architecture also contains special computational accelerators in addition to traditional central processing units (CPU). One of such accelerators becomes graphics processing units (GPU), which are traditionally designed and used primarily for narrow tasks visualization of 2D and 3D scenes in games and applications. The peak performance of modern high-end GPUs (AMD Radeon HD 6970, AMD Radeon HD 5870, nVidia GeForce GTX 580) is about 20 times faster than the peak performance of comparable CPUs (see Fig. -
Supercomputing in Plain English: Overview
Supercomputing in Plain English An Introduction to High Performance Computing Part I: Overview: What the Heck is Supercomputing? Henry Neeman, Director OU Supercomputing Center for Education & Research Goals of These Workshops To introduce undergrads, grads, staff and faculty to supercomputing issues To provide a common language for discussing supercomputing issues when we meet to work on your research NOT: to teach everything you need to know about supercomputing – that can’t be done in a handful of hourlong workshops! OU Supercomputing Center for Education & Research 2 What is Supercomputing? Supercomputing is the biggest, fastest computing right this minute. Likewise, a supercomputer is the biggest, fastest computer right this minute. So, the definition of supercomputing is constantly changing. Rule of Thumb: a supercomputer is 100 to 10,000 times as powerful as a PC. Jargon: supercomputing is also called High Performance Computing (HPC). OU Supercomputing Center for Education & Research 3 What is Supercomputing About? Size Speed OU Supercomputing Center for Education & Research 4 What is Supercomputing About? Size: many problems that are interesting to scientists and engineers can’t fit on a PC – usually because they need more than 2 GB of RAM, or more than 60 GB of hard disk. Speed: many problems that are interesting to scientists and engineers would take a very very long time to run on a PC: months or even years. But a problem that would take a month on a PC might take only a few hours on a supercomputer. OU Supercomputing -
Computer Hardware Architecture Lecture 4
Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data. -
Comparing Systems Using Sample Data
Operating System and Process Monitoring Tools Arik Brooks, [email protected] Abstract: Monitoring the performance of operating systems and processes is essential to debug processes and systems, effectively manage system resources, making system decisions, and evaluating and examining systems. These tools are primarily divided into two main categories: real time and log-based. Real time monitoring tools are concerned with measuring the current system state and provide up to date information about the system performance. Log-based monitoring tools record system performance information for post-processing and analysis and to find trends in the system performance. This paper presents a survey of the most commonly used tools for monitoring operating system and process performance in Windows- and Unix-based systems and describes the unique challenges of real time and log-based performance monitoring. See Also: Table of Contents: 1. Introduction 2. Real Time Performance Monitoring Tools 2.1 Windows-Based Tools 2.1.1 Task Manager (taskmgr) 2.1.2 Performance Monitor (perfmon) 2.1.3 Process Monitor (pmon) 2.1.4 Process Explode (pview) 2.1.5 Process Viewer (pviewer) 2.2 Unix-Based Tools 2.2.1 Process Status (ps) 2.2.2 Top 2.2.3 Xosview 2.2.4 Treeps 2.3 Summary of Real Time Monitoring Tools 3. Log-Based Performance Monitoring Tools 3.1 Windows-Based Tools 3.1.1 Event Log Service and Event Viewer 3.1.2 Performance Logs and Alerts 3.1.3 Performance Data Log Service 3.2 Unix-Based Tools 3.2.1 System Activity Reporter (sar) 3.2.2 Cpustat 3.3 Summary of Log-Based Monitoring Tools 4. -
Embedded System Introduction.Pdf
EmbeddedEmbedded SystemSystem IntroductionIntroduction ANDES Confidential WWW.ANDESTECH.COM Embedded System vs Desktop Page 2 相同點 CPU Storage I/O 相異點 Desktop ‧可執行多種功能 ‧作業系統對於系統資源的管理較為複雜 Embedded System ‧執行特定功能 ‧作業系統對於系統資源的管理較為簡單 Page 3 SystemSystem LayerLayer Application Application Operating System Operating System Firmware Firmware Firmware Hardware Hardware Hardware Desktop computer Complex embedded Simple embedded computer computer Page 4 HardwareHardware ArchitectureArchitecture DesktopDesktop ComputerComputer SystemSystem HardwareHardware ArchitectureArchitecture CPU AGP Slot North Bridge Memory PCI Interface USB Interface South Bridge IDE Interface System BIOS ISA Interface Super IO Port Page 5 HardwareHardware ArchitectureArchitecture EmbeddedEmbedded SystemSystem ComputerComputer HardwareHardware ArchitectureArchitecture ADC CPU SPI Digital I/O ROM IIC Host RAM Timers Computer UART Bus Interface Memory Network Interface I/O Page 6 What is the Embedded System? Page 7 IntroductionIntroduction Challenges in embedded system design. Design methodologies. Page 8 EmbeddedEmbedded SystemSystem ?? •An embedded system is a special-purpose computer system designed to perform one or a few dedicated functions •with real-time computing constraints • include hardware, software and mechanical parts Page 9 EmbeddingEmbedding aa computercomputer Page 10 ComponentsComponents ofof anan embeddedembedded systemsystem Characteristics Low power Closed operating environment Cost sensitive Page 11 ComponentsComponents ofof anan embeddedembedded -
14 Los Alamos National Laboratory
14 Los Alamos National Laboratory Every year for the past 17 years, the director of Los Alamos alternative for assessing the safety, reliability, and perfor- National Laboratory has had a legally required task: write mance of the stockpile: virtual-world simulations. a letter—a personal assessment of Los Alamos–designed warheads and bombs in the U.S. nuclear stockpile. Th i s I, Iceberg letter is sent to the secretaries of Energy and Defense and to Hollywood movies such as the Matrix series or I, Robot the Nuclear Weapons Council. Th rough them the letter goes typically portray supercomputers as massive, room-fi lling to the president of the United States. machines that churn out answers to the most complex questions—all by themselves. In fact, like the director’s Th e technical basis for the director’s assessment comes from Annual Assessment Letter, supercomputers are themselves the Laboratory’s ongoing execution of the nation’s Stockpile the tip of an iceberg. Stewardship Program; Los Alamos’ mission is to study its portion of the aging stockpile, fi nd any problems, and address them. And for the past 17 years, the director’s letter has said, Without people, a supercomputer in eff ect, that any problems that have arisen in Los Alamos would be no more than a humble jumble weapons are being addressed and resolved without the need for full-scale underground nuclear testing. of wires, bits, and boxes. When it comes to the Laboratory’s work on the annual assess- Although these rows of huge machines are the most visible ment, the director’s letter is just the tip of the iceberg. -
A Simple Chargeback System for SAS® Applications Running on UNIX and Linux Servers
SAS Global Forum 2007 Systems Architecture Paper: 194-2007 A Simple Chargeback System For SAS® Applications Running on UNIX and Linux Servers Michael A. Raithel, Westat, Rockville, MD Abstract Organizations that run SAS on UNIX and Linux servers often have a need to measure overall SAS usage and to charge the projects and users who utilize the servers. This allows an organization to pay for the hardware, software, and labor necessary to maintain and run these types of shared servers. However, performance management and accounting software is normally expensive. Purchasing such software, configuring it, and managing it may be prohibitive in terms of cost and in terms of having staff with the right skill sets to do so. Consequently, many organizations with UNIX or Linux servers do not have a reliable way to measure the overall use of SAS software and to charge individuals and projects for its use. This paper presents a simple methodology for creating a chargeback system on UNIX and Linux servers. It uses basic accounting programs found in the UNIX and Linux operating systems, and exploits them using SAS software. The paper presents an overview of the UNIX/Linux “sa” command and the basic accounting system. Then, it provides a SAS program that utilizes this command to capture and store monthly SAS usage statistics. The paper presents a second SAS program that reads the monthly SAS usage SAS data set and creates both a chargeback report and a general usage report. After reading this paper, you should be able to easily adapt the sample SAS programs to run on servers in your own environment. -
Improving Tasks Throughput on Accelerators Using Opencl Command Concurrency∗
Improving tasks throughput on accelerators using OpenCL command concurrency∗ A.J. L´azaro-Mu~noz1, J.M. Gonz´alez-Linares1, J. G´omez-Luna2, and N. Guil1 1 Dep. of Computer Architecture University of M´alaga,Spain [email protected] 2 Dep. of Computer Architecture and Electronics University of C´ordoba,Spain Abstract A heterogeneous architecture composed by a host and an accelerator must frequently deal with situations where several independent tasks are available to be offloaded onto the accelerator. These tasks can be generated by concurrent applications executing in the host or, in case the host is a node of a computer cluster, by applications running on other cluster nodes that are willing to offload tasks in the accelerator connected to the host. In this work we show that a runtime scheduler that selects the best execution order of a group of tasks on the accelerator can significantly reduce the total execution time of the tasks and, consequently, increase the accelerator use. Our solution is based on a temporal execution model that is able to predict with high accuracy the execution time of a set of concurrent tasks launched on the accelerator. The execution model has been validated in AMD, NVIDIA, and Xeon Phi devices using synthetic benchmarks. Moreover, employing the temporal execution model, a heuristic is proposed which is able to establish a near-optimal tasks execution ordering that signicantly reduces the total execution time, including data transfers. The heuristic has been evaluated with five different benchmarks composed of dominant kernel and dominant transfer real tasks. Experiments indicate the heuristic is able to find always an ordering with a better execution time than the average of every possible execution order and, most times, it achieves a near-optimal ordering (very close to the execution time of the best execution order) with a negligible overhead. -
Advanced Engineering Mathematics
5 Linear Systems of ODEs 5.1 Systems of ODEs In a sense, Chapter 5 equals Chapter 2 “plus” Chapter 3, in the sense that Chapter 5 combines use of matrix theory and ordinary differential equation (ODE) methods. When we have more than one linear ODE, results from matrix theory turn out to be useful. Example 5.1 For the circuit shown in Figure 5.1, let v(t) be the voltage drop across the capacitor and I(t) be the loop current. The input V(t) is a given function. Assume, as usual, that L, R, and C are constants. Write down a system of ODEs in R2 that models this circuit. Method: The series RLC circuit shown in Figure 5.1 is analogous to the DC series RLC circuit discussed near the end of Section 3.3. The first ODE models the voltage drop across the capacitor being v(t) = 1 q(t),whereq(t) is the charge on the capacitor and C ˙ q˙(t) = I(t). The second ODE in the system is Kirchhoff’s voltage law, LI(t)+RI(t)+v(t) = V(t), after dividing through by L. The system is ⎧ ⎫ ⎨˙( ) = 1 ( ) ⎬ v t C I t ⎩ ⎭ . (5.1) ˙( ) = 1 ( ( ) − ( ) − ( )) I t L V t RI t v t More generally, consider a system of two ODEs in unknowns x1(t), x2(t): ⎧ ⎫ ˙ ⎨x1(t) = F1 t, x1(t), x2(t) ⎬ ⎩ ⎭ . (5.2) ˙ x2(t) = F2 t, x1(t), x2(t) A special case is ⎧ ⎫ ˙ ⎨x1(t) = a11(t)x1 + a12(t)x2 + f1(t)⎬ ⎩ ⎭ , (5.3) ˙ x2(t) = a21(t)x1 + a22(t)x2 + f2(t) which is called a linear system.