The HEP Parallel Processor by James W
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
2.5 Classification of Parallel Computers
52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor. -
Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy
Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy Eric Tune Rakesh Kumar Dean M. Tullsen Brad Calder Computer Science and Engineering Department University of California at San Diego {etune,rakumar,tullsen,calder}@cs.ucsd.edu Abstract ibility of SMT comes at a cost. The register file and rename tables must be enlarged to accommodate the architectural reg- A simultaneous multithreading (SMT) processor can issue isters of the additional threads. This in turn can increase the instructions from several threads every cycle, allowing it to clock cycle time and/or the depth of the pipeline. effectively hide various instruction latencies; this effect in- Coarse-grained multithreading (CGMT) [1, 21, 26] is a creases with the number of simultaneous contexts supported. more restrictive model where the processor can only execute However, each added context on an SMT processor incurs a instructions from one thread at a time, but where it can switch cost in complexity, which may lead to an increase in pipeline to a new thread after a short delay. This makes CGMT suited length or a decrease in the maximum clock rate. This pa- for hiding longer delays. Soon, general-purpose micropro- per presents new designs for multithreaded processors which cessors will be experiencing delays to main memory of 500 combine a conservative SMT implementation with a coarse- or more cycles. This means that a context switch in response grained multithreading capability. By presenting more virtual to a memory access can take tens of cycles and still provide contexts to the operating system and user than are supported a considerable performance benefit. -
Amdahl's Law Threading, Openmp
Computer Science 61C Spring 2018 Wawrzynek and Weaver Amdahl's Law Threading, OpenMP 1 Big Idea: Amdahl’s (Heartbreaking) Law Computer Science 61C Spring 2018 Wawrzynek and Weaver • Speedup due to enhancement E is Exec time w/o E Speedup w/ E = ---------------------- Exec time w/ E • Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Execution Time w/ E = Execution Time w/o E × [ (1-F) + F/S] Speedup w/ E = 1 / [ (1-F) + F/S ] 2 Big Idea: Amdahl’s Law Computer Science 61C Spring 2018 Wawrzynek and Weaver Speedup = 1 (1 - F) + F Non-speed-up part S Speed-up part Example: the execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? 1 1 = = 1.33 0.5 + 0.5 0.5 + 0.25 2 3 Example #1: Amdahl’s Law Speedup w/ E = 1 / [ (1-F) + F/S ] Computer Science 61C Spring 2018 Wawrzynek and Weaver • Consider an enhancement which runs 20 times faster but which is only usable 25% of the time Speedup w/ E = 1/(.75 + .25/20) = 1.31 • What if its usable only 15% of the time? Speedup w/ E = 1/(.85 + .15/20) = 1.17 • Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! • To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99 4 Amdahl’s Law If the portion of Computer Science 61C Spring 2018 the program that Wawrzynek and Weaver can be parallelized is small, then the speedup is limited The non-parallel portion limits the performance 5 Strong and Weak Scaling Computer Science 61C Spring 2018 Wawrzynek and Weaver • To get good speedup on a parallel processor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem. -
Parallel Programming
Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. -
Parallel Generation of Image Layers Constructed by Edge Detection
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 10 (2018) pp. 7323-7332 © Research India Publications. http://www.ripublication.com Speed Up Improvement of Parallel Image Layers Generation Constructed By Edge Detection Using Message Passing Interface Alaa Ismail Elnashar Faculty of Science, Computer Science Department, Minia University, Egypt. Associate professor, Department of Computer Science, College of Computers and Information Technology, Taif University, Saudi Arabia. Abstract A. Elnashar [29] introduced two parallel techniques, NASHT1 and NASHT2 that apply edge detection to produce a set of Several image processing techniques require intensive layers for an input image. The proposed techniques generate computations that consume CPU time to achieve their results. an arbitrary number of image layers in a single parallel run Accelerating these techniques is a desired goal. Parallelization instead of generating a unique layer as in traditional case; this can be used to save the execution time of such techniques. In helps in selecting the layers with high quality edges among this paper, we propose a parallel technique that employs both the generated ones. Each presented parallel technique has two point to point and collective communication patterns to versions based on point to point communication "Send" and improve the speedup of two parallel edge detection techniques collective communication "Bcast" functions. The presented that generate multiple image layers using Message Passing techniques achieved notable relative speedup compared with Interface, MPI. In addition, the performance of the proposed that of sequential ones. technique is compared with that of the two concerned ones. In this paper, we introduce a speed up improvement for both Keywords: Image Processing, Image Segmentation, Parallel NASHT1 and NASHT2. -
Massively Parallel Computing with CUDA
Massively Parallel Computing with CUDA Antonino Tumeo Politecnico di Milano 1 GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs. Jack Dongarra Professor, University of Tennessee; Author of “Linpack” Why Use the GPU? • The GPU has evolved into a very flexible and powerful processor: • It’s programmable using high-level languages • It supports 32-bit and 64-bit floating point IEEE-754 precision • It offers lots of GFLOPS: • GPU in every PC and workstation What is behind such an Evolution? • The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) • So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU • The fast-growing video game industry exerts strong economic pressure that forces constant innovation GPUs • Each NVIDIA GPU has 240 parallel cores NVIDIA GPU • Within each core 1.4 Billion Transistors • Floating point unit • Logic unit (add, sub, mul, madd) • Move, compare unit • Branch unit • Cores managed by thread manager • Thread manager can spawn and manage 12,000+ threads per core 1 Teraflop of processing power • Zero overhead thread switching Heterogeneous Computing Domains Graphics Massive Data GPU Parallelism (Parallel Computing) Instruction CPU Level (Sequential -
CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors
NJIT Computer Science Dept CS650 Computer Architecture CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors and PC Clustering Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 10: Intro to Multiprocessors/Clustering 1/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Key Issues Run PhotoShop on 1 PC or N PCs Programmability • How to program a bunch of PCs viewed as a single logical machine. Performance Scalability - Speedup • Run PhotoShop on 1 PC (forget the specs of this PC) • Run PhotoShop on N PCs • Will it run faster on N PCs? Speedup = ? Lecture 10: Intro to Multiprocessors/Clustering 2/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Types of Multiprocessors Key: Data and Instruction Single Instruction Single Data (SISD) • Intel processors, AMD processors Single Instruction Multiple Data (SIMD) • Array processor • Pentium MMX feature Multiple Instruction Single Data (MISD) • Systolic array • Special purpose machines Multiple Instruction Multiple Data (MIMD) • Typical multiprocessors (Sun, SGI, Cray,...) Single Program Multiple Data (SPMD) • Programming model Lecture 10: Intro to Multiprocessors/Clustering 3/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Shared-Memory Multiprocessor Processor Prcessor Prcessor Interconnection network Main Memory Storage I/O Lecture 10: Intro to Multiprocessors/Clustering 4/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Distributed-Memory Multiprocessor Processor Processor Processor IO/S MM IO/S MM IO/S MM Interconnection network IO/S MM IO/S MM IO/S MM Processor Processor Processor Lecture 10: Intro to Multiprocessors/Clustering 5/15 12/7/2003 A. -
Parallel Computer Architecture
Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon, -
Real-Time Performance During CUDA™ a Demonstration and Analysis of Redhawk™ CUDA RT Optimizations
A Concurrent Real-Time White Paper 2881 Gateway Drive Pompano Beach, FL 33069 (954) 974-1700 www.concurrent-rt.com Real-Time Performance During CUDA™ A Demonstration and Analysis of RedHawk™ CUDA RT Optimizations By: Concurrent Real-Time Linux® Development Team November 2010 Overview There are many challenges to creating a real-time Linux distribution that provides guaranteed low process-dispatch latencies and minimal process run-time jitter. Concurrent Real Time’s RedHawk Linux distribution meets and exceeds these challenges, providing a hard real-time environment on many qualified hardware configurations, even in the presence of a heavy system load. However, there are additional challenges faced when guaranteeing real-time performance of processes while CUDA applications are simultaneously running on the system. The proprietary CUDA driver supplied by NVIDIA® frequently makes demands upon kernel resources that can dramatically impact real-time performance. This paper discusses a demonstration application developed by Concurrent to illustrate that RedHawk Linux kernel optimizations allow hard real-time performance guarantees to be preserved even while demanding CUDA applications are running. The test results will show how RedHawk performance compares to CentOS performance running the same application. The design and implementation details of the demonstration application are also discussed in this paper. Demonstration This demonstration features two selectable real-time test modes: 1. Jitter Mode: measure and graph the run-time jitter of a real-time process 2. PDL Mode: measure and graph the process-dispatch latency of a real-time process While the demonstration is running, it is possible to switch between these different modes at any time. -
High-Performance Message Passing Over Generic Ethernet Hardware with Open-MX Brice Goglin
High-Performance Message Passing over generic Ethernet Hardware with Open-MX Brice Goglin To cite this version: Brice Goglin. High-Performance Message Passing over generic Ethernet Hardware with Open-MX. Parallel Computing, Elsevier, 2011, 37 (2), pp.85-100. 10.1016/j.parco.2010.11.001. inria-00533058 HAL Id: inria-00533058 https://hal.inria.fr/inria-00533058 Submitted on 5 Nov 2010 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. High-Performance Message Passing over generic Ethernet Hardware with Open-MX Brice Goglin INRIA Bordeaux - Sud-Ouest – LaBRI 351 cours de la Lib´eration – F-33405 Talence – France Abstract In the last decade, cluster computing has become the most popular high-performance computing architec- ture. Although numerous technological innovations have been proposed to improve the interconnection of nodes, many clusters still rely on commodity Ethernet hardware to implement message passing within parallel applications. We present Open-MX, an open-source message passing stack over generic Ethernet. It offers the same abilities as the specialized Myrinet Express stack, without requiring dedicated support from the networking hardware. Open-MX works transparently in the most popular MPI implementations through its MX interface compatibility. -
An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications
An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications Installation and Performance Evaluation of ASCI sPPM Frank Castaneda [email protected] Nikola Vouk [email protected] North Carolina State University CSC 591c Spring 2003 May 2, 2003 Dr. Frank Mueller An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications Introduction Our project is to investigate the use of Symmetric Multi-Threading (SMT), or “Hyper-threading” in Intel parlance, applied to course-grain parallelism in large-scale distributed scientific applications. The processors provide the capability to run two streams of instruction simultaneously to fully utilize all available functional units in the CPU. We are investigating the speedup available when running two threads on a single processor that uses different functional units. The idea we propose is to utilize the hyper-thread for asynchronous communications activity to improve course-grain parallelism. The hypothesis is that there will be little contention for similar processor functional units when splitting the communications work from the computational work, thus allowing better parallelism and better exploiting the hyper-thread technology. We believe with minimal changes to the 2.5 Linux kernel we can achieve 25-50% speedup depending on the amount of communication by utilizing a hyper-thread aware scheduler and a custom communications API. Experiment Setup Software Setup Hardware Setup ?? Custom Linux Kernel 2.5.68 with ?? IBM xSeries 335 Single/Dual Processor Red Hat distribution 2.0 Ghz Xeon ?? Modification of Kernel Scheduler to run processes together ?? Custom Test Code In order to test our code we focus on the following scenarios: ?? A serial execution of communication / computation sections o Executing in serial is used to test the expected back-to-back run-time. -
A Review of Multicore Processors with Parallel Programming
International Journal of Engineering Technology, Management and Applied Sciences www.ijetmas.com September 2015, Volume 3, Issue 9, ISSN 2349-4476 A Review of Multicore Processors with Parallel Programming Anchal Thakur Ravinder Thakur Research Scholar, CSE Department Assistant Professor, CSE L.R Institute of Engineering and Department Technology, Solan , India. L.R Institute of Engineering and Technology, Solan, India ABSTRACT When the computers first introduced in the market, they came with single processors which limited the performance and efficiency of the computers. The classic way of overcoming the performance issue was to use bigger processors for executing the data with higher speed. Big processor did improve the performance to certain extent but these processors consumed a lot of power which started over heating the internal circuits. To achieve the efficiency and the speed simultaneously the CPU architectures developed multicore processors units in which two or more processors were used to execute the task. The multicore technology offered better response-time while running big applications, better power management and faster execution time. Multicore processors also gave developer an opportunity to parallel programming to execute the task in parallel. These days parallel programming is used to execute a task by distributing it in smaller instructions and executing them on different cores. By using parallel programming the complex tasks that are carried out in a multicore environment can be executed with higher efficiency and performance. Keywords: Multicore Processing, Multicore Utilization, Parallel Processing. INTRODUCTION From the day computers have been invented a great importance has been given to its efficiency for executing the task.