Instruction Set Architecture for Multimedia Signal Processing
Total Page:16
File Type:pdf, Size:1020Kb
Load more
										Recommended publications
									
								- 
												  RISC-V Vector Extension Webinar IIRISC-V Vector Extension Webinar II August 3th, 2021 Thang Tran, Ph.D. Principal Engineer Webinar II - Agenda • Andes overview • Vector technology background – SIMD/vector concept – Vector processor basic • RISC-V V extension ISA – Basic – CSR • RISC-V V extension ISA – Memory operations – Compute instructions • Sample codes – Matrix multiplication – Loads with RVV versions 0.8 and 1.0 • AndesCore™ NX27V • Summary Copyright© 2020 Andes Technology Corp. 2 Terminology • ACE: Andes Custom Extension • ISA: Instruction Set Architecture • CSR: Control and Status Register • GOPS: Giga Operations Per Second • SEW: Element Width (8-64) • GFLOPS: Giga Floating-Point OPS • ELEN: Largest Element Width (32 or 64) • XRF: Integer register file • XLEN: Scalar register length in bits (64) • FRF: Floating-point register file • FLEN: FP register length in bits (16-64) • VRF: Vector register file • VLEN: Vector register length in bits (128-512) • SIMD: Single Instruction Multiple Data • LMUL: Register grouping multiple (1/8-8) • MMX: Multi Media Extension • EMUL: Effective LMUL • SSE: Streaming SIMD Extension • VLMAX/MVL: Vector Length Max • AVX: Advanced Vector Extension • AVL/VL: Application Vector Length • Configurable: parameters are fixed at built time, i.e. cache size • Extensible: added instructions to ISA includes custom instructions to be added by customer • Standard extension: the reserved codes in the ISA for special purposes, i.e. FP, DSP, … • Programmable: parameters can be dynamically changed in the program Copyright© 2020 Andes Technology Corp. 3 RISC-V V Extension ISA Basic Copyright© 2020 Andes Technology Corp. 4 Vector Register ISA • Vector-Register ISA Definition: − All vector operations are between vector registers (except for load and store).
- 
												  SIMD ExtensionsSIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
- 
												  Optimizing Packed String Matching on AVX2 PlatformOptimizing Packed String Matching on AVX2 Platform M. Akif Aydo˘gmu¸s1,2 and M.O˘guzhan Külekci1 1 Informatics Institute, Istanbul Technical University, Istanbul, Turkey [email protected], [email protected] 2 TUBITAK UME, Gebze-Kocaeli, Turkey Abstract. Exact string matching, searching for all occurrences of given pattern P on a text T , is a fundamental issue in computer science with many applica- tions in natural language processing, speech processing, computational biology, information retrieval, intrusion detection systems, data compression, and etc. Speeding up the pattern matching operations benefiting from the SIMD par- allelism has received attention in the recent literature, where the empirical results on previous studies revealed that SIMD parallelism significantly helps, while the performance may even be expected to get automatically enhanced with the ever increasing size of the SIMD registers. In this paper, we provide variants of the previously proposed EPSM and SSEF algorithms, which are orig- inally implemented on Intel SSE4.2 (Streaming SIMD Extensions 4.2 version with 128-bit registers). We tune the new algorithms according to Intel AVX2 platform (Advanced Vector Extensions 2 with 256-bit registers) and analyze the gain in performance with respect to the increasing length of the SIMD registers. Profiling the new algorithms by using the Intel Vtune Amplifier for detecting performance bottlenecks led us to consider the cache friendliness and shared-memory access issues in the AVX2 platform. We applied cache op- timization techniques to overcome the problems particularly addressing the search algorithms based on filtering. Experimental comparison of the new solutions with the previously known-to- be-fast algorithms on small, medium, and large alphabet text files with diverse pattern lengths showed that the algorithms on AVX2 platform optimized cache obliviously outperforms the previous solutions.
- 
												  SIMD: Data Parallel Execution JERLANGEN REGIONAL COMPUTING CENTER SIMD: Data parallel execution J. Eitzinger PATC, 12.6.2018 Stored Program Computer: Base setting for (int j=0; j<size; j++){ Memory sum = sum + V[j]; } 401d08: f3 0f 58 04 82 addss xmm0,[rdx + rax * 4] Arithmetic Control 401d0d: 48 83 c0 01 add rax,1 Logical 401d11: 39 c7 cmp edi,eax Unit CPU 401d13: 77 f3 ja 401d08 Unit Input Output Architect’s view: Make the common case fast ! Execution and memory . Improvements for relevant software Strategies . What are the technical opportunities? . Increase clock speed . Parallelism . Economical concerns . Specialization . Marketing concerns 2 Data parallel execution units (SIMD) for (int j=0; j<size; j++){ Scalar execution A[j] = B[j] + C[j]; } Register widths • 1 operand = + • 2 operands (SSE) • 4 operands (AVX) • 8 operands (AVX512) 3 Data parallel execution units (SIMD) for (int j=0; j<size; j++){ A[j] = B[j] + C[j]; SIMD execution } Register widths • 1 operand • 2 operands (SSE) = + • 4 operands (AVX) • 8 operands (AVX512) 4 History of short vector SIMD 1995 1996 1997 1998 1999 2001 2008 2010 2011 2012 2013 2016 VIS MDMX MMX 3DNow SSE SSE2 SSE4 VSX AVX IMCI AVX2 AVX512 AltiVec NEON 64 bit 128 bit 256 bit 512 bit Vendors package other ISA features under the SIMD flag: monitor/mwait, NT stores, prefetch ISA, application specific instructions, FMA X86 Power ARM SPARC64 Intel 512bit IBM 128bit A5 upward Fujitsu 256bit AMD 256bit 64bit / 128bit K Computer 128bit 5 Technologies Driving Performance Technology 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
- 
												  Optimizing Software Performance Using Vector Instructions Invited Talk at Speed-B Conference, October 19–21, 2016, Utrecht, the NetherlandsAgner Fog, Technical University of Denmark Optimizing software performance using vector instructions Invited talk at Speed-B conference, October 19–21, 2016, Utrecht, The Netherlands. Abstract Microprocessor factories have a problem obeying Moore's law because of physical limitations. The answer is increasing parallelism in the form of multiple CPU cores and vector instructions (Single Instruction Multiple Data - SIMD). This is a challenge to software developers who have to adapt to a moving target of new instruction set additions and increasing vector sizes. Most of the software industry is lagging several years behind the available hardware because of these problems. Other challenges are tasks that cannot easily be executed with vector instructions, such as sequential algorithms and lookup tables. The talk will discuss methods for overcoming these problems and utilize the continuously growing power of microprocessors on the market. A few problems relevant to cryptographic software will be covered, and the outlook for the future will be discussed. Find more on these topics at author website: www.agner.org/optimize Moore's law The clock frequency has stopped growing due to physical limitations. Instead, the number of CPU cores and the size of vector registers is growing. Hierarchy of bottlenecks Program installation Program load, JIT compile, DLL's System database Network access Speed → File input/output Graphical user interface RAM access, cache utilization Algorithm Dependency chains CPU pipeline and execution units Remove
- 
												  Effectiveness of the MAX-2 Multimedia Extensions for PA-RISC 2.0 ProcessorsEffectiveness of the MAX-2 Multimedia Extensions for PA-RISC 2.0 Processors Ruby Lee Hewlett-Packard Company HotChips IX Stanford, CA, August 24-26,1997 Outline Introduction PA-RISC MAX-2 features and examples Mix Permute Multiply with Shift&Add Conditionals with Saturation Arith (e.g., Absolute Values) Performance Comparison with / without MAX-2 General-Purpose Workloads will include Increasing Amounts of Media Processing MM a b a b 2 1 2 1 b c b c functionality 5 2 5 2 A B C D 1 2 22 2 2 33 3 4 55 59 A B C D 1 2 A B C D 22 1 2 22 2 2 2 2 33 33 3 4 55 59 3 4 55 59 Distributed Multimedia Real-time Information Access Communications Tool Tool Computation Tool time 1980 1990 2000 Multimedia Extensions for General-Purpose Processors MAX-1 for HP PA-RISC (product Jan '94) VIS for Sun Sparc (H2 '95) MAX-2 for HP PA-RISC (product Mar '96) MMX for Intel x86 (chips Jan '97) MDMX for SGI MIPS-V (tbd) MVI for DEC Alpha (tbd) Ideally, different media streams map onto both the integer and floating-point datapaths of microprocessors images GR: GR: 32x32 video 32x64 ALU SMU FP: graphics FP:16x64 Mem 32x64 audio FMAC PA-RISC 2.0 Processor Datapath Subword Parallelism in a General-Purpose Processor with Multimedia Extensions General Regs. y5 y6 y7 y8 x5 x6 x7 x8 x1 x2 x3 x4 y1 y2 y3 y4 Partitionable Partitionable 64-bit ALU 64-bit ALU 8 ops / cycle Subword Parallel MAX-2 Instructions in PA-RISC 2.0 Parallel Add (modulo or saturation) Parallel Subtract (modulo or saturation) Parallel Shift Right (1,2 or 3 bits) and Add Parallel Shift Left (1,2 or 3 bits) and Add Parallel Average Parallel Shift Right (n bits) Parallel Shift Left (n bits) Mix Permute MAX-2 Leverages Existing Processing Resources FP: INTEGER FLOAT GR: 16x64 General Regs.
- 
												  Pengju Ren@XJTU 2021Computer Architecture Lecture 10 – Vector Machine (Data Level Parallel) Pengju Ren Institute of Artificial Intelligence and Robotics PengjuXi’an Ren@XJTU Jiaotong University 2021 http://gr.xjtu.edu.cn/web/pengjuren SISD、MIMD、SIMD and MIMD Data Streams Single Multiple Instruction Single SISD: Intel Pentium 4 SIMD: SSE instruction of x86 Streams Multiple MISD: No example today MIMD: Intel Core i7 SISD: Single Instruction stream, Single Data Stream MIMD: Multiple Instruction streams, Multiple Data Streams SPMD (SinglePengju Program Ren@XJTU multiple data) GPU 2021 SIMD: Single Instruction stream, Multiple Data Streams MISD: Multiple Instruction streams, Single Data Stream 2 Agenda . Vector Processors . Single Instruction Multiple Data (SIMD) . Instruction Set Extensions (Neon@ARM, AVX@Intel, etc.) Pengju Ren@XJTU 2021 3 Modern SIMD Processors SIMD architectures can exploit significant data-level parallelism for: Matrix-oriented scientific computing Media-oriented image and sound processors Machine Learning Algorithms Most modern CPUs have SIMD architectures Intel SSE and MMX, AVX, AVX2 (Streaming SIMD Extension, Multimedia extensions、Advanced Vector extensions) ARM NEON, MIPS MDMX These architectures include instruction set extensions which allow both sequential and parallel instructions to be executed Some architectures include separate SIMD coprocessors for handling these instructions ARM NEON Pengju Ren@XJTU 2021 Included in Cortex-A8 and Cortex-A9 processors Intel SSE and AVX Introduced in 1999 in the Pentium III processor AVX512 currently used in Xeon Core series 4 Vector Processor Basic idea: Read sets of data elements into “vector registers” Operate on those registers Disperse the results back into memory Registers are controlled by compiler Used to hide memory latency Leverage memory bandwidth Overcoming limitations of ILP: – Dramatic reduction in fetch and decode bandwidth.
- 
												  Implications of Programmable General Purpose Processors for Compression/Encryption Applications 1. IntroductionIn the Proceedings of the 13th International Conference on Application-specific Systems, Architectures and Processors (ASAP 2002), July 2002 Implications of Programmable General Purpose Processors for Compression/Encryption Applications £ Byeong Kil Lee and Lizy Kurian John The University of Texas at Austin g fblee, ljohn @ece.utexas.edu Abstract With the growth of the Internet and mobile communication industry, multimedia applications form a domi- nant computer workload. Media workloads are typically executed on Application Specific Integrated Circuits (ASICs), application specific processors (ASPs) or general purpose processors (GPPs). GPPs are flexible and allow changes in the applications and algorithms better than ASICs and ASPs. However, executing these applications on GPPs is done at a high cost. In this paper, we analyze media compression/decompression algorithms from the perspective of the overhead of executing them on a programmable general purpose processor versus ASPs. We choose nine encode/decode programs from audio, image/video and encryption applications. The instruction mix, memory access and parallelism aspects during the execution of these pro- grams are analyzed. Memory access latency is observed to be the main factor influencing the execution time on general purpose processors. Most of these compression/decompression algorithms involve processing the data through execution phases (e.g. quantization, encoding, etc) and temporary results are stored and retrieved between these phases. A metric called overhead memory-access bandwidth per input/output byte is defined to characterize the temporary memory activity of each application. We observe that more than 90% of the memory accesses made by these programs are temporary data stores and loads arising from the general purpose nature of the execution platform.
- 
												  ALEX BENNÉE KVM FORUM 2017 Created: 2017-10-20 Fri 20:46VECTORING IN ON QEMU'S TCG ENGINE ALEX BENNÉE KVM FORUM 2017 Created: 2017-10-20 Fri 20:46 1 INTRODUCTION Alex Bennée [email protected] stsquad on #qemu Virtualization Developer @ Linaro Projects: QEMU TCG, KVM, ARM 2 . 1 WHY VECTORS? 2 . 2 DATA PARALLELISM: AUDIO EXAMPLE 2 . 3 WHAT ARE VECTORS? 2 . 4 VECTOR REGISTER 2 . 5 VECTOR OPERATION vadd %Vd, %Vn, %Vm 2 . 6 REGISTER LAYOUT 2 . 7 OTHER DATA PARALLELISM EXAMPLES General Matrix Multiplication (GEMM) 3D modelling AI/Machine learning Numerical simulation 2 . 8 HISTORY OF VECTORS IN COMPUTING 3 . 1 NAME THAT MACHINE? 3 . 2 CRAY 1 SPECS Addressing 8 24 bit address Scalar Registers 8 64 bit data Vector Registers 8 (64x64bit elements) Clock Speed 80 Mhz Performance up to 250 MFLOPS* Power 250 kW ref: The Cray-1 Computer System, Richard M Russell, Cray Reasearch Inc, ACM Jan 1978, Vol 21, Number 1 3 . 3 VECTOR PROCESSORS/MACHINES Year Machine Performance 1971 CDC Star-100 100 MFLOPS 1976 Cray 1 250 MFLOPS 1982 Cray X-MP 400 MFLOPS 1996 Fujitsu VP2600 5 GFLOPS 2001 NEC SX-6 8 GFLOPS/node 3 . 4 SX-6 (2001) 8 vector pipelines up to 72 Vector Registers up to 256 64 bit words per register 8 GFLOPS/node 3 . 5 NAME THAT MACHINE? 3 . 6 EARTH SIMULATOR (ES) 640 SX-6 nodes 5120 vector processors 35.86 TFLOPS Fastest Supercomputer 2002-2004 3 . 7 VECTORS COMES TO THE WORKSTATION Intel MMX (multi-media extensions*) 1997, Intel Pentium P5 8x64bit MMX registers 1x64/2x32/4x16/8x8 Integer only 3 .
- 
												  Idisa+: a Portable Model for High Performance Simd ProgrammingIDISA+: A PORTABLE MODEL FOR HIGH PERFORMANCE SIMD PROGRAMMING by Hua Huang B.Eng., Beijing University of Posts and Telecommunications, 2009 a Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Computing Science Faculty of Applied Science c Hua Huang 2011 SIMON FRASER UNIVERSITY Fall 2011 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. APPROVAL Name: Hua Huang Degree: Master of Science Title of Thesis: IDISA+: A Portable Model for High Performance SIMD Pro- gramming Examining Committee: Dr. Kay C. Wiese Associate Professor, Computing Science Simon Fraser University Chair Dr. Robert D. Cameron Professor, Computing Science Simon Fraser University Senior Supervisor Dr. Thomas C. Shermer Professor, Computing Science Simon Fraser University Supervisor Dr. Arrvindh Shriraman Assistant Professor, Computing Science Simon Fraser University SFU Examiner Date Approved: ii Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.
- 
												  Msc THESIS Customizing Vector Instruction Set ArchitecturesComputer Engineering 2007 Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ MSc THESIS Customizing Vector Instruction Set Architectures C¸at¸alinBogdan CIOBANU Abstract Data Level Parallelism(DLP) can be exploited in order to improve the performance of processors for certain workload types. There are two main application ¯elds that rely on DLP, multimedia and scienti¯c computing. Most of the existing multimedia vector ex- tensions use sub-word parallelism and wide data paths for process- ing independent, mainly integer, values in parallel. On the other hand, classic vector supercomputers rely on e±cient processing of large arrays of floating point numbers typically found in scienti¯c applications. In both cases, the selection of an appropriate instruc- tion set architecture(ISA) is crucial in exploiting the potential DLP to gain high performance. The main objective of this thesis is to develop a methodology for synthesizing customized vector ISAs for various application domains targeting high performance program execution. In order to accomplish this objective, a number of applications from the telecommunication and linear algebra do- mains have been studied, and custom vector instructions sets have CE-MS-2007-04 been synthesized. Three algorithms that compute the shortest paths in a directed graph (Dijkstra, Floyd and Bellman-Ford) have been analyzed, along with the widely used Linpack floating point bench- mark. The framework used to customize the ISAs included the use of the Gnu C Compiler versions 4.1.2 and 2.7.2.3 and the SimpleScalar-3.0d tool set extended to simulate customized vector units. The modi¯cations applied to the simulator include the addition of a vector reg- ister ¯le, vector functional units and speci¯c vector instructions.
- 
												  Counter Control Flow Divergence in Compiled Query PipelinesMake the Most out of Your SIMD Investments: Counter Control Flow Divergence in Compiled Query Pipelines Harald Lang Andreas Kipf Linnea Passing Technical University of Munich Technical University of Munich Technical University of Munich Munich, Germany Munich, Germany Munich, Germany [email protected] [email protected] [email protected] Peter Boncz Thomas Neumann Alfons Kemper Centrum Wiskunde & Informatica Technical University of Munich Technical University of Munich Amsterdam, The Netherlands Munich, Germany Munich, Germany [email protected] [email protected] [email protected] ABSTRACT Control flow graph: SIMD lane utilization: Increasing single instruction multiple data (SIMD) capabilities in out scan σ ⋈ ⋈ ⋈ out scan σ ⋈ ... modern hardware allows for compiling efficient data-parallel query index 7 x x x x x x x pipelines. This means GPU-alike challenges arise: control flow di- traversal 6 x x Index ⋈ vergence causes underutilization of vector-processing units. In this 5 x x x paper, we present efficient algorithms for the AVX-512 architecture 4 x 3 x x x x x σ no to address this issue. These algorithms allow for fine-grained as- x x x x x x x x match SIMD lanes 2 signment of new tuples to idle SIMD lanes. Furthermore, we present 1 x x x strategies for their integration with compiled query pipelines with- scan 0 x x x out introducing inefficient memory materializations. We evaluate t our approach with a high-performance geospatial join query, which Figure 1: During query processing, individual SIMD lanes shows performance improvements of up to 35%.