Vector Processing As a Soft-CPU Accelerator

Total Page:16

File Type:pdf, Size:1020Kb

Vector Processing As a Soft-CPU Accelerator Vector Processing as a Soft-CPU Accelerator by Jason Kwok Kwun Yu B.A.Sc, Simon Fraser University, 2005 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Applied Science in The Faculty of Graduate Studies (Electrical and Computer Engineering) The University Of British Columbia May, 2008 c Jason Kwok Kwun Yu 2008 ii Abstract Soft processors simplify hardware design by being able to implement complex control strate- gies using software. However, they are not fast enough for many intensive data-processing tasks, such as highly data-parallel embedded applications. This thesis suggests adding a vector processing core to the soft processor as a general-purpose accelerator for these types of applica- tions. The approach has the benefits of a purely software-oriented development model, a fixed ISA allowing parallel software and hardware development, a single accelerator that can accel- erate multiple functions in an application, and scalable performance with a single source code. With no hardware design experience needed, a software programmer can make area-versus- performance tradeoffs by scaling the number of functional units and register file bandwidth with a single parameter. The soft vector processor can be further customized by a number of secondary parameters to add and remove features for the specific application to optimize resource utilization. This thesis shows that a vector processing architecture maps efficiently into an FPGA and provides a scalable amount of performance for a reasonable amount of area. Configurations of the soft vector processor with different performance levels are estimated to achieve speedups of 2–24× for 5–26× the area of a Nios II/s processor on three benchmark kernels. iii Abstract iv Table of Contents Abstract ............................................ iii Table of Contents ....................................... v List of Tables ......................................... xi List of Figures .........................................xiii Acknowledgments ...................................... xv 1 Introduction ........................................ 1 1.1 Motivation ...................................... 3 1.2 Contributions ................................... .. 5 1.3 ThesisOutline ................................... .. 6 2 Background ........................................ 7 2.1 VectorProcessingOverview . ..... 7 2.1.1 ExploitingData-levelParallelism . ....... 8 2.1.2 VectorInstructionSetArchitecture . ....... 10 2.1.3 Microarchitectural Advantages of Vector Processing ............ 11 2.2 VectorProcessorsandSIMDExtensions . ....... 12 2.2.1 SIMDExtensions ............................... 12 2.2.2 Single-chipVectorProcessors . ...... 13 v Table of Contents 2.2.3 FPGA-basedVectorProcessors . ... 13 2.3 MultiprocessorSystems . ..... 15 2.4 Custom-designedHardware Accelerators . ......... 17 2.4.1 DecoupledCo-processorAccelerators . ....... 17 2.4.2 CustomInstructionAccelerators . ...... 18 2.5 SynthesizedHardwareAccelerators . ........ 20 2.5.1 C-basedSynthesis .............................. 20 2.5.2 Block-basedSynthesis . ... 24 2.5.3 Application Specific Instruction Set Processor . .......... 26 2.5.4 SynthesisfromBinaryDecompilation . ...... 27 2.6 OtherSoftProcessorArchitectures . ........ 28 2.6.1 MitrionVirtualProcessor . .... 28 2.6.2 VLIWProcessors ............................... 29 2.6.3 SuperscalarProcessors . ... 30 2.7 Summary ........................................ 31 3 Configurable Soft Vector Processor ......................... 33 3.1 Requirements .................................... .. 33 3.2 SoftVectorArchitecture . ..... 34 3.2.1 SystemOverview ............................... 34 3.2.2 HybridVector-SIMDModel . 36 3.3 VectorLaneDatapath .............................. ... 37 3.3.1 VectorPipeline ................................ 37 3.3.2 DatapathFunctionalUnits . ... 39 3.3.3 DistributedVectorRegisterFile . ...... 40 3.3.4 LoadStoreUnit ................................ 40 3.3.5 LocalMemory ................................. 41 vi 3.4 MemoryUnit ...................................... 41 3.4.1 LoadStoreController . .. 42 3.4.2 ReadInterface ................................. 42 3.4.3 WriteInterface ................................ 43 3.5 FPGA-SpecificVectorExtensions . ...... 45 3.6 ConfigurableParameters . .... 47 3.7 DesignFlow ...................................... 48 3.7.1 Soft Vector Processor Flow versus C-based Synthesis . .......... 48 3.7.2 CommentonVectorizingCompilers . .... 50 4 Results ........................................... 51 4.1 BenchmarkSuite .................................. .. 51 4.2 BenchmarkPreparation . .... 53 4.2.1 GeneralMethodology . 54 4.2.2 BenchmarkVectorization . ... 54 4.2.3 BenchmarkTuningforC2HCompiler . ... 59 4.2.4 Soft Vector Processor Per Benchmark Configuration . ......... 60 4.3 ResourceUtilizationResults . ....... 60 4.4 PerformanceResults .............................. .... 63 4.4.1 PerformanceModels . ... .. ... .. ... .. .. ... .. ... .. 64 4.4.2 RTLModelPerformance . 66 4.4.3 IdealVectorModel .............................. 68 4.4.4 C2HAcceleratorResults . .. 70 4.4.5 VectorversusC2H ............................... 71 5 Conclusions and Future Work ............................. 73 5.1 FutureWork ...................................... 75 vii Table of Contents Bibliography .......................................... 79 Appendices A Soft Vector Processor ISA ............................... 85 A.1 Introduction .................................... .. 85 A.1.1 ConfigurableArchitecture . ... 86 A.1.2 MemoryConsistency ............................. 87 A.2 VectorRegisterSet ............................... ... 87 A.2.1 VectorRegisters ............................... 87 A.2.2 VectorScalarRegisters . ... 88 A.2.3 VectorFlagRegisters . .. 88 A.2.4 VectorControlRegisters . ... 89 A.2.5 Multiply-Accumulators for Vector Sum Reduction . ......... 90 A.2.6 VectorLaneLocalMemory . 90 A.3 InstructionSet .................................. ... 92 A.3.1 DataTypes .................................. 92 A.3.2 AddressingModes ............................... 92 A.3.3 FlagRegisterUse ............................... 92 A.3.4 Instructions .................................. 92 A.4 InstructionSetReference . ...... 93 A.4.1 IntegerInstructions . ... 93 A.4.2 LogicalInstructions . ... 96 A.4.3 Fixed-Point Instructions (Future Extension) . .......... 96 A.4.4 MemoryInstructions . 99 A.4.5 VectorProcessingInstructions . .102 A.4.6 VectorFlagProcessingInstructions . .......105 viii A.4.7 MiscellaneousInstructions . .106 A.5 InstructionFormats .............................. .107 A.5.1 Vector Register and Vector Scalar Instructions . ..........107 A.5.2 VectorMemoryInstructions . .108 A.5.3 InstructionEncoding . 109 ix Table of Contents x List of Tables 1.1 New soft vector processor instruction extensions . ............. 6 2.1 Partial list of FPGA-based multiprocessor systems in literature . 17 3.1 Listofconfigurableprocessorparameters . .......... 48 4.1 Per benchmark soft vector processor configuration parametersettings . 60 4.2 Resource usage of vector processor and C2H accelerator configurations . 61 4.3 Resource usage of vector processor when varying NLane and MemMinWidth with 128-bit MemWidth andotherwisefullfeatures. 62 4.4 Resource utilization from varying secondary processor parameters. 63 4.5 Idealvectorperformancemodel . ....... 65 4.6 Performancemeasurements . ..... 67 A.1 Listof configurable processor parameters . .......... 87 A.2 Listofvectorflagregisters. ....... 88 A.3 Listofcontrolregisters. ....... 89 A.4 Instructionqualifiers . ...... 93 A.11NiosIIOpcodeUsage .............................. .107 A.12 Scalar register usage as source or destination register................108 A.13 Vector register instruction function field encoding (OPX=0). .109 A.14 Scalar-vector instruction function field encoding (OPX=1).............110 A.15 Fixed-point instruction function field encoding (OPX=0) .............110 xi List of Tables A.16 Flag and miscellaneous instruction function field encoding(OPX=1) . .111 A.17Memory instruction function field encoding . ...........111 xii List of Figures 2.1 8-tapFIRfilterMIPSassembly . ..... 9 2.2 8-tapFIRfilterVIRAMvectorassembly . ....... 10 2.3 NiosIIcustominstructionformats . ........ 19 2.4 Example Nios II system with C2H accelerators. .......... 22 3.1 Scalarandvectorcoreinteraction. ......... 35 3.2 Vector chaining versus hybrid vector-SIMD execution . .............. 36 3.3 Vector co-processorsystem block diagram . .......... 38 3.4 VectorLaneALU................................... 39 3.5 Softvectorprocessorwriteinterface. ........... 43 3.6 Data alignment using crossbar and delay network for vectorstore. 44 3.7 8-tapFIRfiltervectorassembly. ....... 47 3.8 Altera C2H compiler design flow versus soft vector processordesignflow . 49 4.1 MotionestimationCcode . .... 52 4.2 5 × 5medianfilterCcode............................... 53 4.3 Simultaneously matching two copies of the macroblock to a reference frame . 55 4.4 Code segment from the motion estimation vector assembly. The code to handle the first and last rows of the windows are not shown. ....... 56 4.5 Vectorizingtheimagemedianfilter . ........ 57 4.6 Medianfilterinnerloopvectorassembly . ......... 57 4.7 Vector assembly for loading AES data and performing AES encryption round . 58 xiii List of Figures 4.8 RTLModelspeedupoverIdealNiosModel . ...... 68 4.9 Ideal Vector Model and C2H Model speedup over Ideal Nios Model........ 69 A.1 Connection between distributed MAC units and
Recommended publications
  • Data-Level Parallelism
    Fall 2015 :: CSE 610 – Parallel Computer Architectures Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Data Parallelism vs. Control Parallelism – Data Parallelism: parallelism arises from executing essentially the same code on a large number of objects – Control Parallelism: parallelism arises from executing different threads of control concurrently • Hypothesis: applications that use massively parallel machines will mostly exploit data parallelism – Common in the Scientific Computing domain • DLP originally linked with SIMD machines; now SIMT is more common – SIMD: Single Instruction Multiple Data – SIMT: Single Instruction Multiple Threads Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Many incarnations of DLP architectures over decades – Old vector processors • Cray processors: Cray-1, Cray-2, …, Cray X1 – SIMD extensions • Intel SSE and AVX units • Alpha Tarantula (didn’t see light of day ) – Old massively parallel computers • Connection Machines • MasPar machines – Modern GPUs • NVIDIA, AMD, Qualcomm, … • Focus of throughput rather than latency Vector Processors 4 SCALAR VECTOR (1 operation) (N operations) r1 r2 v1 v2 + + r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 Scalar processors operate on single numbers (scalars) Vector processors operate on linear sequences of numbers (vectors) 6.888 Spring 2013 - Sanchez and Emer - L14 What’s in a Vector Processor? 5 A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs) 6.888 Spring 2013 - Sanchez and Emer - L14 Example of Simple Vector Processor 6 6.888 Spring 2013 - Sanchez and Emer - L14 Basic Vector ISA 7 Instr.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy
    Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy Eric Tune Rakesh Kumar Dean M. Tullsen Brad Calder Computer Science and Engineering Department University of California at San Diego {etune,rakumar,tullsen,calder}@cs.ucsd.edu Abstract ibility of SMT comes at a cost. The register file and rename tables must be enlarged to accommodate the architectural reg- A simultaneous multithreading (SMT) processor can issue isters of the additional threads. This in turn can increase the instructions from several threads every cycle, allowing it to clock cycle time and/or the depth of the pipeline. effectively hide various instruction latencies; this effect in- Coarse-grained multithreading (CGMT) [1, 21, 26] is a creases with the number of simultaneous contexts supported. more restrictive model where the processor can only execute However, each added context on an SMT processor incurs a instructions from one thread at a time, but where it can switch cost in complexity, which may lead to an increase in pipeline to a new thread after a short delay. This makes CGMT suited length or a decrease in the maximum clock rate. This pa- for hiding longer delays. Soon, general-purpose micropro- per presents new designs for multithreaded processors which cessors will be experiencing delays to main memory of 500 combine a conservative SMT implementation with a coarse- or more cycles. This means that a context switch in response grained multithreading capability. By presenting more virtual to a memory access can take tens of cycles and still provide contexts to the operating system and user than are supported a considerable performance benefit.
    [Show full text]
  • Vector-Thread Architecture and Implementation by Ronny Meir Krashinsky B.S
    Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky B.S. Electrical Engineering and Computer Science University of California at Berkeley, 1999 S.M. Electrical Engineering and Computer Science Massachusetts Institute of Technology, 2001 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2007 c Massachusetts Institute of Technology 2007. All rights reserved. Author........................................................................... Department of Electrical Engineering and Computer Science May 25, 2007 Certified by . Krste Asanovic´ Associate Professor Thesis Supervisor Accepted by . Arthur C. Smith Chairman, Department Committee on Graduate Students 2 Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky Submitted to the Department of Electrical Engineering and Computer Science on May 25, 2007, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract This thesis proposes vector-thread architectures as a performance-efficient solution for all-purpose computing. The VT architectural paradigm unifies the vector and multithreaded compute models. VT provides the programmer with a control processor and a vector of virtual processors. The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application paral- lelism and locality. VT architectures can efficiently exploit a wide variety of loop-level parallelism, including non-vectorizable loops with cross-iteration dependencies or internal control flow.
    [Show full text]
  • Lecture 14: Vector Processors
    Lecture 14: Vector Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Autumn 2009 Lecture 14- 1 Christos Kozyrakis Announcements • Readings for this lecture – H&P 4th edition, Appendix F – Required paper • HW3 available on online – Due on Wed 11/11th • Exam on Fri 11/13, 9am - noon, room 200-305 – All lectures + required papers – Closed books, 1 page of notes, calculator – Review session on Friday 11/6, 2-3pm, Gates Hall Room 498 EE382A – Autumn 2009 Lecture 14 - 2 Christos Kozyrakis Review: Multi-core Processors • Use Moore’s law to place more cores per chip – 2x cores/chip with each CMOS generation – Roughly same clock frequency – Known as multi-core chips or chip-multiprocessors (CMP) • Shared-memory multi-core – All cores access a unified physical address space – Implicit communication through loads and stores – Caches and OOO cores lead to coherence and consistency issues EE382A – Autumn 2009 Lecture 14 - 3 Christos Kozyrakis Review: Memory Consistency Problem P1 P2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; • Intuitively, you expect to print A=1 – But can you think of a case where you will print A=0? – Even if cache coherence is available • Coherence talks about accesses to a single location • Consistency is about ordering for accesses to difference locations • Alternatively – Coherence determines what value is returned by a read – Consistency determines when a write value becomes visible EE382A – Autumn 2009 Lecture 14 - 4 Christos Kozyrakis Sequential Consistency (What the Programmers Often Assume) • Definition by L.
    [Show full text]
  • Amdahl's Law Threading, Openmp
    Computer Science 61C Spring 2018 Wawrzynek and Weaver Amdahl's Law Threading, OpenMP 1 Big Idea: Amdahl’s (Heartbreaking) Law Computer Science 61C Spring 2018 Wawrzynek and Weaver • Speedup due to enhancement E is Exec time w/o E Speedup w/ E = ---------------------- Exec time w/ E • Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Execution Time w/ E = Execution Time w/o E × [ (1-F) + F/S] Speedup w/ E = 1 / [ (1-F) + F/S ] 2 Big Idea: Amdahl’s Law Computer Science 61C Spring 2018 Wawrzynek and Weaver Speedup = 1 (1 - F) + F Non-speed-up part S Speed-up part Example: the execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? 1 1 = = 1.33 0.5 + 0.5 0.5 + 0.25 2 3 Example #1: Amdahl’s Law Speedup w/ E = 1 / [ (1-F) + F/S ] Computer Science 61C Spring 2018 Wawrzynek and Weaver • Consider an enhancement which runs 20 times faster but which is only usable 25% of the time Speedup w/ E = 1/(.75 + .25/20) = 1.31 • What if its usable only 15% of the time? Speedup w/ E = 1/(.85 + .15/20) = 1.17 • Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! • To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99 4 Amdahl’s Law If the portion of Computer Science 61C Spring 2018 the program that Wawrzynek and Weaver can be parallelized is small, then the speedup is limited The non-parallel portion limits the performance 5 Strong and Weak Scaling Computer Science 61C Spring 2018 Wawrzynek and Weaver • To get good speedup on a parallel processor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem.
    [Show full text]
  • Parallel Generation of Image Layers Constructed by Edge Detection
    International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 10 (2018) pp. 7323-7332 © Research India Publications. http://www.ripublication.com Speed Up Improvement of Parallel Image Layers Generation Constructed By Edge Detection Using Message Passing Interface Alaa Ismail Elnashar Faculty of Science, Computer Science Department, Minia University, Egypt. Associate professor, Department of Computer Science, College of Computers and Information Technology, Taif University, Saudi Arabia. Abstract A. Elnashar [29] introduced two parallel techniques, NASHT1 and NASHT2 that apply edge detection to produce a set of Several image processing techniques require intensive layers for an input image. The proposed techniques generate computations that consume CPU time to achieve their results. an arbitrary number of image layers in a single parallel run Accelerating these techniques is a desired goal. Parallelization instead of generating a unique layer as in traditional case; this can be used to save the execution time of such techniques. In helps in selecting the layers with high quality edges among this paper, we propose a parallel technique that employs both the generated ones. Each presented parallel technique has two point to point and collective communication patterns to versions based on point to point communication "Send" and improve the speedup of two parallel edge detection techniques collective communication "Bcast" functions. The presented that generate multiple image layers using Message Passing techniques achieved notable relative speedup compared with Interface, MPI. In addition, the performance of the proposed that of sequential ones. technique is compared with that of the two concerned ones. In this paper, we introduce a speed up improvement for both Keywords: Image Processing, Image Segmentation, Parallel NASHT1 and NASHT2.
    [Show full text]
  • CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors
    NJIT Computer Science Dept CS650 Computer Architecture CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors and PC Clustering Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 10: Intro to Multiprocessors/Clustering 1/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Key Issues Run PhotoShop on 1 PC or N PCs Programmability • How to program a bunch of PCs viewed as a single logical machine. Performance Scalability - Speedup • Run PhotoShop on 1 PC (forget the specs of this PC) • Run PhotoShop on N PCs • Will it run faster on N PCs? Speedup = ? Lecture 10: Intro to Multiprocessors/Clustering 2/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Types of Multiprocessors Key: Data and Instruction Single Instruction Single Data (SISD) • Intel processors, AMD processors Single Instruction Multiple Data (SIMD) • Array processor • Pentium MMX feature Multiple Instruction Single Data (MISD) • Systolic array • Special purpose machines Multiple Instruction Multiple Data (MIMD) • Typical multiprocessors (Sun, SGI, Cray,...) Single Program Multiple Data (SPMD) • Programming model Lecture 10: Intro to Multiprocessors/Clustering 3/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Shared-Memory Multiprocessor Processor Prcessor Prcessor Interconnection network Main Memory Storage I/O Lecture 10: Intro to Multiprocessors/Clustering 4/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Distributed-Memory Multiprocessor Processor Processor Processor IO/S MM IO/S MM IO/S MM Interconnection network IO/S MM IO/S MM IO/S MM Processor Processor Processor Lecture 10: Intro to Multiprocessors/Clustering 5/15 12/7/2003 A.
    [Show full text]
  • High-Performance Message Passing Over Generic Ethernet Hardware with Open-MX Brice Goglin
    High-Performance Message Passing over generic Ethernet Hardware with Open-MX Brice Goglin To cite this version: Brice Goglin. High-Performance Message Passing over generic Ethernet Hardware with Open-MX. Parallel Computing, Elsevier, 2011, 37 (2), pp.85-100. 10.1016/j.parco.2010.11.001. inria-00533058 HAL Id: inria-00533058 https://hal.inria.fr/inria-00533058 Submitted on 5 Nov 2010 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. High-Performance Message Passing over generic Ethernet Hardware with Open-MX Brice Goglin INRIA Bordeaux - Sud-Ouest – LaBRI 351 cours de la Lib´eration – F-33405 Talence – France Abstract In the last decade, cluster computing has become the most popular high-performance computing architec- ture. Although numerous technological innovations have been proposed to improve the interconnection of nodes, many clusters still rely on commodity Ethernet hardware to implement message passing within parallel applications. We present Open-MX, an open-source message passing stack over generic Ethernet. It offers the same abilities as the specialized Myrinet Express stack, without requiring dedicated support from the networking hardware. Open-MX works transparently in the most popular MPI implementations through its MX interface compatibility.
    [Show full text]
  • An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications
    An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications Installation and Performance Evaluation of ASCI sPPM Frank Castaneda [email protected] Nikola Vouk [email protected] North Carolina State University CSC 591c Spring 2003 May 2, 2003 Dr. Frank Mueller An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications Introduction Our project is to investigate the use of Symmetric Multi-Threading (SMT), or “Hyper-threading” in Intel parlance, applied to course-grain parallelism in large-scale distributed scientific applications. The processors provide the capability to run two streams of instruction simultaneously to fully utilize all available functional units in the CPU. We are investigating the speedup available when running two threads on a single processor that uses different functional units. The idea we propose is to utilize the hyper-thread for asynchronous communications activity to improve course-grain parallelism. The hypothesis is that there will be little contention for similar processor functional units when splitting the communications work from the computational work, thus allowing better parallelism and better exploiting the hyper-thread technology. We believe with minimal changes to the 2.5 Linux kernel we can achieve 25-50% speedup depending on the amount of communication by utilizing a hyper-thread aware scheduler and a custom communications API. Experiment Setup Software Setup Hardware Setup ?? Custom Linux Kernel 2.5.68 with ?? IBM xSeries 335 Single/Dual Processor Red Hat distribution 2.0 Ghz Xeon ?? Modification of Kernel Scheduler to run processes together ?? Custom Test Code In order to test our code we focus on the following scenarios: ?? A serial execution of communication / computation sections o Executing in serial is used to test the expected back-to-back run-time.
    [Show full text]
  • Computer Systems Architecture
    CS 352H: Computer Systems Architecture Topic 14: Multicores, Multiprocessors, and Clusters University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level) parallelism High throughput for independent jobs Parallel processing program Single program run on multiple processors Multicore microprocessors Chips with multiple processors (cores) University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 2 Hardware and Software Hardware Serial: e.g., Pentium 4 Parallel: e.g., quad-core Xeon e5345 Software Sequential: e.g., matrix multiplication Concurrent: e.g., operating system Sequential/concurrent software can run on serial/parallel hardware Challenge: making effective use of parallel hardware University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 3 What We’ve Already Covered §2.11: Parallelism and Instructions Synchronization §3.6: Parallelism and Computer Arithmetic Associativity §4.10: Parallelism and Advanced Instruction-Level Parallelism §5.8: Parallelism and Memory Hierarchies Cache Coherence §6.9: Parallelism and I/O: Redundant Arrays of Inexpensive Disks University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell 4 Parallel Programming Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor,
    [Show full text]
  • Computer Architecture: Parallel Processing Basics
    Computer Architecture: Parallel Processing Basics Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/9/13 Today What is Parallel Processing? Why? Kinds of Parallel Processing Multiprocessing and Multithreading Measuring success Speedup Amdhal’s Law Bottlenecks to parallelism 2 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi PDL'09 © 2007-9 Goldstein5 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi Parallel PDL'09 © 2007-9 Goldstein6 Concurrent Systems Physical Geographical Cloud Parallel Geophysical +++ ++ --- --- location Relative +++ +++ + - location Faults ++++ +++ ++++ -- Number of +++ +++ + - Processors + Network varies varies fixed fixed structure Network --- --- + + connectivity 7 Concurrent System Challenge: Programming The old joke: How long does it take to write a parallel program? One Graduate Student Year 8 Parallel Programming Again?? Increased demand (multicore) Increased scale (cloud) Improved compute/communicate Change in Application focus Irregular Recursive data structures PDL'09 © 2007-9 Goldstein9 Why Parallel Computers? Parallelism: Doing multiple things at a time Things: instructions,
    [Show full text]