Avoiding Conversion and Rearrangement Overhead in SIMD Architectures

Avoiding Conversion and Rearrangement Overhead in SIMD Architectures Asadollah Shahbahrami Avoiding Conversion and Rearrangement Overhead in SIMD Architectures PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus Prof. dr. ir. J.T. Fokkema, voorzitter van het College voor Promoties, in het openbaar te verdedigen op maandag 15 september 2008 om 15:00 uur door Asadollah SHAHBAHRAMI Master of Science in Computer Engineering-Machine Intelligence, Shiraz University, Shiraz, Iran geboren te Kelardasht, Chaloos, Mazandaran, Iran Dit proefschrift is goedgekeurd door de promotor: Prof. dr. K.G.W. Goossens Copromotor: Dr. B.H.H. Juurlink Samenstelling promotiecommissie: Rector Magnificus voorzitter Prof. dr. K.G.W. Goossens Technische Universiteit Delft, promotor Dr. B.H.H. Juurlink Technische Universiteit Delft, copromotor Prof. dr. ir. H.J. Sips Technische Universiteit Delft Prof. dr. ir. A.J. van der Veen Technische Universiteit Delft Dr. K. Flautner ARM Ltd., Cambridge Dr. A. Ramirez Universitat Politecnica` de Catalunya, Barcelona Prof. dr. ir. G.J.M. Smit Universiteit Twente Prof. dr. ir. R.L. Lagendijk, reservelid Technische Universiteit Delft My first promotor Professor Stamatis Vassiliadisy has provided substantial guidance and support for this thesis. Shahbahrami, Asadollah Avoiding Conversion and Rearrangement Overhead in SIMD Architectures Computer Engineering Laboratory Delft University of Technology Keywords: SIMD Architectures, Vectorization, SIMD Programming, Multimedia Application, Cache Optimization. ISBN 978-90-807957-9-2 Cover page: Sketch design of an SIMD unit by Author. Copyright c 2008 by Asadollah Shahbahrami All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without permission of the author. Typeset by the author with the LATEX Documentation system. Author email: [email protected], [email protected] Printed in The Netherlands This dissertation is dedicated to all my teachers and family with gratitude and love Avoiding Conversion and Rearrangement Overhead in SIMD Architectures Asadollah Shahbahrami Abstract n this dissertation, a novel SIMD extension called Modified MMX (MMMX) I for multimedia computing is presented. Specifically, the MMX architecture is enhanced with the extended subwords and the matrix register file techniques. The extended subwords technique uses SIMD registers that are wider than the packed for- mat used to store the data. It uses 32 bits extra for each 64-bit register. The extended subwords technique avoids data type conversion overhead and increases parallelism in SIMD architectures. This is because promoting the subwords of the source SIMD registers to larger subwords before they can be processed and demoting the results again before they can be written back to memory incurs conversion overhead. The matrix register file technique allows to load data that is stored consecutively in memory into a column of the register file, where a column corresponds to the correspond- ing subwords of different registers. In other words, this technique provides both row- wise as well as column-wise accesses to the media register file. It is a useful approach for matrix operations that are common in multimedia processing. In addition, in this work, new and general SIMD instructions addressing the multimedia application domain are investigated. It does not consider an ISA that is application specific. For example, special-purpose instructions are synthesized using a few general-purpose SIMD instructions. The performance of the MMMX architecture is compared to the performance of the MMX/SSE architecture for different multimedia applications and kernels using the sim-outorder simulator of the SimpleScalar toolset. Addition- ally, three issues related to the efficient implementation of the 2D Discrete Wavelet Transform (DWT) on general-purpose processors, in particular the Pentium 4, are discussed. These are 64K aliasing, cache conflict misses, and SIMD vectorization. 64K aliasing is a phenomenon that happens on the Pentium 4, which can degrade performance by an order of magnitude. It occurs if two or more data items whose ad- dresses differ by a multiple of 64K need to be cached simultaneously. There are also many cache conflict misses in the implementation of vertical filtering of the DWT, if the filter length exceeds the number of cache ways. In this dissertation, techniques are proposed to avoid 64K aliasing and to mitigate cache conflict misses. Furthermore, the performance of the 2D DWT is improved by exploiting the data-level parallelism using the SIMD instructions supported by most general-purpose processors. Asadollah Shahbahrami Delft, The Netherlands, 2008 i Abbreviations Full Name Description ASIC Application-Specific Inte- An integrated circuit that implements a specific grated Circuit function. CISC Complex Instruction Set CISC is an instruction set architecture in which Computers each instruction consists of many microcode and take many clock cycles to execute. CPU Central Processing Unit A unit that executes the programs. DCT Discrete Cosine Transform The DCT is a transform to convert image or video pixels from the time domain to the frequency domain. DLP Data-Level Parallelism DLP is a technique to execute a large number of operations by a single instruction. DMPs Dedicated Multimedia DMPs are typically custom designed architec- Processors tures intended to perform specific multimedia functions. DSPs Digital Signal Processors DSPs are microprocessors, which have specifically been designed for digital signal processing. DWT Discrete Wavelet Trans- The DWT provides a time-frequency represen- form tation of image or video signals. FIR Finite Impulse Response FIR filters are digital filters that have an impulse response which reaches zero in a finite number of steps. FP Floating-Point FP presents a numerical representation system for real numbers. FPGA Field Programmable Gate An FPGA is a reprogrammable hardware device Array that can be used to implement arbitrary circuits. GPPs General-Purpose Proces- Processors that are designed to execute a variety sors of applications. GPPs have a higher degree of flexibility than other processors such as DSPs. HDTV High Definition TeleVision HDTV is the new standard in television technology which enhances the quality of the picture on the screen. IDCT Inverse Discrete Cosine The IDCT is the inverse of the DCT, which con- Transform verts the transformed image to the time domain. ILP Instruction-Level Paral- ILP is a technique to execute several instruc- lelism tions in each cycle by exploiting the indepen- dent instructions. ISA Instruction Set Architec- ISA includes the set of instructions of either a ture particular processor or a family of processors. JPEG Joint Photographic Experts The committee that has developed the JPEG and Group JPEG2000 standards. iii LBWT Line-Based Wavelet Trans- The LBWT is a traversal technique that is used form to implement the 2D discrete wavelet transform. In this technique the vertical filtering starts as soon as a sufficient number of lines, as deter- mined by the filter length, has been horizontally filtered. LUT Look-Up Table A LUT is a group of memory cells, which consists of all the possible results of a function for a given set of its input values. MDMX MIPS Digital Media eX- MDMX is a SIMD extension unit developed for tension the MIPS family of processors. MMA MultiMedia Application Multimedia applications use and process different media elements including text, graphics, im- ages, audio, 2D and 3D animation, and video. MMX Multi-Media Extensions MMX is a multimedia extension, provided on the Intel microprocessors, which consists of 64- bit integer SIMD instructions on packed elements. MMMX Modified Multi-Media Ex- The MMMX architecture is MMX enhanced tensions with extended subwords, the matrix register file, and a few general-purpose instructions that are not present in MMX. MPEG Motion Picture Experts The committee that has developed the MPEG Group compression standards. MRF Matrix Register File The MRF is a media register file that provides both row-wise as well as column-wise access to the register file. NSPs Native Signal Processing NSP is an enhancement to a GPP to process multimedia data. RCWT Row-Column Wavelet The RCWT is a traversal technique that is used Transform to implement the 2D discrete wavelet transform. In the RCWT approach, the 2D DWT is divided into two 1D DWTs, namely horizontal and vertical filtering. RISC Reduced Instruction Set RISC is opposite of CISC. RISC represents Computer a microprocessor design strategy that reduces chip complexity by using simpler instructions, removing microcode layer, than the CISC design. RUU Register Update Unit The RUU determines which instruction should be issued to the functional units for execution. SAD Sum-of-Absolute Differ- The SAD function is a similarity measurement ences algorithm that is usually used in motion estima- tion algorithms to remove temporal redundan- cies between video frames. SIMD Single-Instruction Computation concept of executing the same in- Multiple-Data struction on multiple data elements. iv SLP Subword Level Parallelism SLP is a form of DLP that packs several small data elements into a media register in order to process them simultaneously. SPE Synergistic Processing El- SPEs are SIMD processors with

Load more