Architectural Enhancements for Simd-Style Multimedia

ARCHITECTURAL ENHANCEMENTS FOR SIMD-STYLE MULTIMEDIA PROCESSING IN GENERAL-PURPOSE PROCESSORS by SHUTHAKINI ANUGITHAN A thesis submitted to the Department of Electrical and Computer Engineering in conformity with the requirements for the degree of Master of Science (Engineering) Queen’s University Kingston, Ontario, Canada April 2007 Copyright c Shuthakini Anugithan, 2007 Abstract ULTIMEDIA applications are becoming increasingly common in personal comput- M ers. Hence, efficient multimedia processing is required in general-purpose processors. Subword parallelism in the form of SIMD (Single-Instruction, Multiple-Data) processing has commonly been adapted for multimedia processing in general-purpose processors. However, there have been problems reported with this method. In addition, there have also been problems reported on the behaviour of conventional cache systems for multimedia applications. This thesis briefly outlines the problems, and proposes a new general-purpose processor architecture known as Media-TCM (Media-Tightly-Coupled Memory). It uses a newly introduced feature known as TCM (Tightly-Coupled Memory) to gain performance. The TCM is a low-latency on-chip memory used for efficient handling of multimedia data. Several new instructions are introduced to assist this task. The features of the Media-TCM architecture are designed to be easily adaptable to the existing general-purpose processors. The performance of the Media-TCM architecture is evaluated with the help of a simulation model which is developed using the SimpleScalar tool set. The performance is compared with those of a processor enhanced with the Altivec multimedia extension for PowerPC and a processor with no multimedia extensions. A simulation model for Altivec is not currently available. Hence, a simulation model for Altivec is also developed in this i research for the purpose of comparison. Five selected multimedia applications are implemented in three different styles and simulated. The simulation results reveal that the proposed Media-TCM architecture can provide significant performance improvements over the Altivec multimedia extension for three of the applications, but it is marginally worse on two of them. However, the Altivec implementations can still be executed in the Media-TCM architecture. ii TO MY PARENTS NAGAMUTTU PULENDRAN AND PASUPATHY PULENDRAN, AND MY HUSBAND ANUGITHAN AMIRTHARAJA iii Acknowledgements HIS research work has broadened my knowledge in microprocessors for multimedia T processing, and has sharpened my ability to conduct independent research. I would like to thank the following individuals, without whose support and guidance this thesis would not have been possible. • I would like to extend my deep appreciation to my advisor, Prof. Subramania Sud- harsanan, for introducing this exciting field to me, for sharing his knowledge in this field, and for his constant guidance, support, and encouragement throughout my research. His research guidance and excellent advice will definitely help to improve my career in the future. • I am grateful to my co-advisor Prof. Carl Hamacher, for his valuable comments on my technical writing and for his continuous encouragement, expert guidance, time, and support throughout the research. His comments have particularly sharpened my skills in technical writing. • I express my special thanks to my second co-advisor Prof. Naraig Manjikian, for his continuous guidance and valuable suggestions about the thesis. His valuable guidance and suggestions have directed this dissertation to reach completion. • I appreciate all the faculty, staff, and students of Electrical and Computer Engineering iv at Queen’s University for a rewarding experience as a graduate student. I also thank my colleagues in the computer architecture lab for their friendship and for making my time so enjoyable. • I am forever indebted to my dear husband, Anugithan Amirtharaja, for his love, support, caring, encouragement, and friendship. I appreciate him for his understanding during my busy times. • I owe my dearest thanks to my parents, Nagamuttu Pulendran and Pasupathy Pulen- dran, for their everlasting love, affection, encouragement, friendship, sacrifices, and caring throughout my life. • I also wish to acknowledge the Natural Sciences and Engineering Research Coun- cil, and Queen’s University for their financial support during my time as a graduate student at Queen’s University. v Contents Abstract i Acknowledgements iv Contents vi List of Tables x List of Figures xii Glossary xiv Chapter 1 Introduction 1 1.1 Multimedia Extensions in General-Purpose Processors . 2 1.1.1 Subword Parallelism . 3 1.2 Multimedia Processors . 5 1.3 Problems with Existing Architectures . 6 1.4 Objectives of Thesis . 8 1.5 Contributions . 10 1.6 Organization of Thesis . 13 Chapter 2 Background and Previous Work 14 2.1 Characteristics of Multimedia Workloads . 15 2.2 Behaviour of Conventional Cache Systems for Multimedia Applications . 18 2.2.1 Multimedia Applications and Cache Performance . 18 2.2.2 Cache Prefetching Techniques for Improving Cache Behaviour . 19 2.3 Problems with Current SIMD-style Multimedia Extensions . 23 2.3.1 Overhead/Supporting instructions . 23 2.3.2 Nested Loops . 24 2.3.3 Efficiency . 24 2.4 Existing Architectural Enhancements . 25 vi 2.4.1 MOM . 25 2.4.2 Media Breeze . 26 2.4.3 CSI . 27 2.4.4 Imagine . 28 2.4.5 PLX . 29 2.5 Tightly-Coupled Memory in the ARM Processors . 29 Chapter 3 Proposed Media-TCM Architecture 32 3.1 The Media-TCM Architecture . 33 3.2 Tightly-Coupled Memory . 39 3.2.1 Efficient Row and Column Accesses . 43 3.2.2 Structure . 45 3.2.3 TCM Hits . 48 3.2.4 TCM Misses . 49 3.2.5 Data Replacement Strategy . 49 3.2.6 TCM Writes . 50 3.3 Address Generation Unit . 50 3.3.1 Tables for Holding Memory Addresses and Bank Numbers . 51 3.3.2 Row Accesses . 54 3.3.3 Column Accesses . 55 3.3.4 Prefetching . 55 3.3.5 Hardware Requirements . 57 3.4 New Instructions . 58 3.4.1 TCM Prefetch Instructions . 59 3.4.2 TCM Row Access Instructions . 67 3.4.3 TCM Column Access Instructions . 70 3.4.4 Table Invalidation Instructions . 76 Chapter 4 Simulation Model 78 4.1 SimpleScalar . 79 4.1.1 Adding New Instructions to SimpleScalar . 82 4.2 The Altivec Simulation Model . 84 4.3 The Media-TCM Simulation Model . 86 Chapter 5 Application Examples and Implementation Details 88 5.1 Reasons for the Selection of Applications . 89 5.2 Motion-Compensated Prediction in H.264/AVC . 91 5.2.1 Fractional Sample Interpolation Algorithm . 92 5.2.2 Common Implementation Details . 94 5.2.3 Altivec-Style Implementation . 95 5.2.4 Media-TCM-Style Implementation . 102 vii 5.3 Adaptive Deblocking Filter in H.264/AVC . 107 5.3.1 A Brief Overview of the Deblocking Filter Algorithm . 110 5.3.2 Common Implementation Details . 112 5.3.3 Altivec-Style Implementation . 113 5.3.4 Media-TCM-Style Implementation . 117 5.4 Scaling . 121 5.4.1 Altivec-Style Implementation . 123 5.4.2 Media-TCM-Style Implementation . 124 5.5 Median Filter . 125 5.5.1 Median Filter Algorithm . 125 5.5.2 Altivec-Style Implementation . 128 5.5.3 Media-TCM-Style Implementation . 130 5.6 Integer DCT Transform . 132 Chapter 6 Performance and Discussion 134 6.1 Simulation Environment . 135 6.2 Performance . 139 6.2.1 Motion-Compensated Prediction in H.264/AVC . 140 6.2.2 Adaptive Deblocking Filter in H.264/AVC . 142 6.2.3 Scaling . 144 6.2.4 Median Filter . 145 6.2.5 Integer DCT Transform . 146 6.3 Summary of Performance . 147 6.4 Discussion . 150 Chapter 7 Summary and Conclusions 153 7.1 Summary . 153 7.2 Future Work . 155 Bibliography 157 Appendix A Multimedia Extensions 169 A.1 Intel’s MMX and SSE Multimedia Extensions . 170 A.1.1 MMX . 171 A.1.2 SSE . 172 A.1.3 SSE2 . 173 A.1.4 SSE3 . 173 A.1.5 Supplemental SSE3 . 174 A.1.6 SSE4 . 174 A.2 The Altivec Multimedia Extension . 174 A.3 AMD’s 3DNow! . 178 viii Appendix B Multimedia Processors 180 B.1 Instruction Level Parallelism . 181 B.2 Cache System, Registers, and DMA . 183 B.3 Instructions . 184 B.4 Some Examples of Existing Multimedia Processors . 185 Appendix C The Altivec Multimedia Instructions 190 Appendix D The Media-TCM Multimedia Instructions 198 Appendix E Additional Simulation Results 207 E.1 Motion-Compensated Prediction in H.264/AVC . 207 E.2 Adaptive Deblocking Filter in H.264/AVC . 209 E.3 Scaling . 211 E.4 Median Filter . 212 E.5 Integer DCT Transform . 213 ix List of Tables 3.1 An example of a table that is used for aligned prefetch cases in the address generation unit . 52 3.2 An example of a table that is used for unaligned prefetch cases in the address generation unit . 53 3.3 List of TCM prefetch instructions . 60 3.4 List of TCM row access instructions . 67 3.5 List of TCM column access instructions . 70 3.6 List of.

Architectural Enhancements for Simd-Style Multimedia

CPU ボードカタログサポート CPU Intel ：Core I7、Xeon-E5 Freescale ：T4240、P4080、MPC8640D AMD ：Radeon HD 6970M、HD 7970M GPGPU NVIDIA ：Fermi、Kepler Architecture GPGPU

Charactersing the Limits of the Openflow Slow-Path

RISC-V Vector Extension Webinar II

Avionics Hardware Issues 2010/11/19 Chih-Hao Sun Avionics Software--Hardware Issue -History

Optimizing Packed String Matching on AVX2 Platform

Chapter 1. Origins of Mac OS X

This Thesis Has Been Submitted in Fulfilment of the Requirements for a Postgraduate Degree (E.G

Performance of Image and Video Processing with General-Purpose

On Implementation of MPEG-2 Like Real-Time Parallel Media Applications on MDSP Soc Cradle Architecture

Optimizing Software Performance Using Vector Instructions Invited Talk at Speed-B Conference, October 19–21, 2016, Utrecht, the Netherlands

XOS Advanced Media Processor

Idisa+: a Portable Model for High Performance Simd Programming