Architectural Enhancements for Simd-Style Multimedia

Total Page:16

File Type:pdf, Size:1020Kb

Architectural Enhancements for Simd-Style Multimedia ARCHITECTURAL ENHANCEMENTS FOR SIMD-STYLE MULTIMEDIA PROCESSING IN GENERAL-PURPOSE PROCESSORS by SHUTHAKINI ANUGITHAN A thesis submitted to the Department of Electrical and Computer Engineering in conformity with the requirements for the degree of Master of Science (Engineering) Queen’s University Kingston, Ontario, Canada April 2007 Copyright c Shuthakini Anugithan, 2007 Abstract ULTIMEDIA applications are becoming increasingly common in personal comput- M ers. Hence, efficient multimedia processing is required in general-purpose pro- cessors. Subword parallelism in the form of SIMD (Single-Instruction, Multiple-Data) processing has commonly been adapted for multimedia processing in general-purpose pro- cessors. However, there have been problems reported with this method. In addition, there have also been problems reported on the behaviour of conventional cache systems for mul- timedia applications. This thesis briefly outlines the problems, and proposes a new general-purpose processor architecture known as Media-TCM (Media-Tightly-Coupled Memory). It uses a newly introduced feature known as TCM (Tightly-Coupled Memory) to gain performance. The TCM is a low-latency on-chip memory used for efficient handling of multimedia data. Several new instructions are introduced to assist this task. The features of the Media-TCM architecture are designed to be easily adaptable to the existing general-purpose processors. The performance of the Media-TCM architecture is evaluated with the help of a sim- ulation model which is developed using the SimpleScalar tool set. The performance is compared with those of a processor enhanced with the Altivec multimedia extension for PowerPC and a processor with no multimedia extensions. A simulation model for Altivec is not currently available. Hence, a simulation model for Altivec is also developed in this i research for the purpose of comparison. Five selected multimedia applications are implemented in three different styles and simulated. The simulation results reveal that the proposed Media-TCM architecture can provide significant performance improvements over the Altivec multimedia extension for three of the applications, but it is marginally worse on two of them. However, the Altivec implementations can still be executed in the Media-TCM architecture. ii TO MY PARENTS NAGAMUTTU PULENDRAN AND PASUPATHY PULENDRAN, AND MY HUSBAND ANUGITHAN AMIRTHARAJA iii Acknowledgements HIS research work has broadened my knowledge in microprocessors for multimedia T processing, and has sharpened my ability to conduct independent research. I would like to thank the following individuals, without whose support and guidance this thesis would not have been possible. • I would like to extend my deep appreciation to my advisor, Prof. Subramania Sud- harsanan, for introducing this exciting field to me, for sharing his knowledge in this field, and for his constant guidance, support, and encouragement throughout my re- search. His research guidance and excellent advice will definitely help to improve my career in the future. • I am grateful to my co-advisor Prof. Carl Hamacher, for his valuable comments on my technical writing and for his continuous encouragement, expert guidance, time, and support throughout the research. His comments have particularly sharpened my skills in technical writing. • I express my special thanks to my second co-advisor Prof. Naraig Manjikian, for his continuous guidance and valuable suggestions about the thesis. His valuable guidance and suggestions have directed this dissertation to reach completion. • I appreciate all the faculty, staff, and students of Electrical and Computer Engineering iv at Queen’s University for a rewarding experience as a graduate student. I also thank my colleagues in the computer architecture lab for their friendship and for making my time so enjoyable. • I am forever indebted to my dear husband, Anugithan Amirtharaja, for his love, sup- port, caring, encouragement, and friendship. I appreciate him for his understanding during my busy times. • I owe my dearest thanks to my parents, Nagamuttu Pulendran and Pasupathy Pulen- dran, for their everlasting love, affection, encouragement, friendship, sacrifices, and caring throughout my life. • I also wish to acknowledge the Natural Sciences and Engineering Research Coun- cil, and Queen’s University for their financial support during my time as a graduate student at Queen’s University. v Contents Abstract i Acknowledgements iv Contents vi List of Tables x List of Figures xii Glossary xiv Chapter 1 Introduction 1 1.1 Multimedia Extensions in General-Purpose Processors . 2 1.1.1 Subword Parallelism . 3 1.2 Multimedia Processors . 5 1.3 Problems with Existing Architectures . 6 1.4 Objectives of Thesis . 8 1.5 Contributions . 10 1.6 Organization of Thesis . 13 Chapter 2 Background and Previous Work 14 2.1 Characteristics of Multimedia Workloads . 15 2.2 Behaviour of Conventional Cache Systems for Multimedia Applications . 18 2.2.1 Multimedia Applications and Cache Performance . 18 2.2.2 Cache Prefetching Techniques for Improving Cache Behaviour . 19 2.3 Problems with Current SIMD-style Multimedia Extensions . 23 2.3.1 Overhead/Supporting instructions . 23 2.3.2 Nested Loops . 24 2.3.3 Efficiency . 24 2.4 Existing Architectural Enhancements . 25 vi 2.4.1 MOM . 25 2.4.2 Media Breeze . 26 2.4.3 CSI . 27 2.4.4 Imagine . 28 2.4.5 PLX . 29 2.5 Tightly-Coupled Memory in the ARM Processors . 29 Chapter 3 Proposed Media-TCM Architecture 32 3.1 The Media-TCM Architecture . 33 3.2 Tightly-Coupled Memory . 39 3.2.1 Efficient Row and Column Accesses . 43 3.2.2 Structure . 45 3.2.3 TCM Hits . 48 3.2.4 TCM Misses . 49 3.2.5 Data Replacement Strategy . 49 3.2.6 TCM Writes . 50 3.3 Address Generation Unit . 50 3.3.1 Tables for Holding Memory Addresses and Bank Numbers . 51 3.3.2 Row Accesses . 54 3.3.3 Column Accesses . 55 3.3.4 Prefetching . 55 3.3.5 Hardware Requirements . 57 3.4 New Instructions . 58 3.4.1 TCM Prefetch Instructions . 59 3.4.2 TCM Row Access Instructions . 67 3.4.3 TCM Column Access Instructions . 70 3.4.4 Table Invalidation Instructions . 76 Chapter 4 Simulation Model 78 4.1 SimpleScalar . 79 4.1.1 Adding New Instructions to SimpleScalar . 82 4.2 The Altivec Simulation Model . 84 4.3 The Media-TCM Simulation Model . 86 Chapter 5 Application Examples and Implementation Details 88 5.1 Reasons for the Selection of Applications . 89 5.2 Motion-Compensated Prediction in H.264/AVC . 91 5.2.1 Fractional Sample Interpolation Algorithm . 92 5.2.2 Common Implementation Details . 94 5.2.3 Altivec-Style Implementation . 95 5.2.4 Media-TCM-Style Implementation . 102 vii 5.3 Adaptive Deblocking Filter in H.264/AVC . 107 5.3.1 A Brief Overview of the Deblocking Filter Algorithm . 110 5.3.2 Common Implementation Details . 112 5.3.3 Altivec-Style Implementation . 113 5.3.4 Media-TCM-Style Implementation . 117 5.4 Scaling . 121 5.4.1 Altivec-Style Implementation . 123 5.4.2 Media-TCM-Style Implementation . 124 5.5 Median Filter . 125 5.5.1 Median Filter Algorithm . 125 5.5.2 Altivec-Style Implementation . 128 5.5.3 Media-TCM-Style Implementation . 130 5.6 Integer DCT Transform . 132 Chapter 6 Performance and Discussion 134 6.1 Simulation Environment . 135 6.2 Performance . 139 6.2.1 Motion-Compensated Prediction in H.264/AVC . 140 6.2.2 Adaptive Deblocking Filter in H.264/AVC . 142 6.2.3 Scaling . 144 6.2.4 Median Filter . 145 6.2.5 Integer DCT Transform . 146 6.3 Summary of Performance . 147 6.4 Discussion . 150 Chapter 7 Summary and Conclusions 153 7.1 Summary . 153 7.2 Future Work . 155 Bibliography 157 Appendix A Multimedia Extensions 169 A.1 Intel’s MMX and SSE Multimedia Extensions . 170 A.1.1 MMX . 171 A.1.2 SSE . 172 A.1.3 SSE2 . 173 A.1.4 SSE3 . 173 A.1.5 Supplemental SSE3 . 174 A.1.6 SSE4 . 174 A.2 The Altivec Multimedia Extension . 174 A.3 AMD’s 3DNow! . 178 viii Appendix B Multimedia Processors 180 B.1 Instruction Level Parallelism . 181 B.2 Cache System, Registers, and DMA . 183 B.3 Instructions . 184 B.4 Some Examples of Existing Multimedia Processors . 185 Appendix C The Altivec Multimedia Instructions 190 Appendix D The Media-TCM Multimedia Instructions 198 Appendix E Additional Simulation Results 207 E.1 Motion-Compensated Prediction in H.264/AVC . 207 E.2 Adaptive Deblocking Filter in H.264/AVC . 209 E.3 Scaling . 211 E.4 Median Filter . 212 E.5 Integer DCT Transform . 213 ix List of Tables 3.1 An example of a table that is used for aligned prefetch cases in the address generation unit . 52 3.2 An example of a table that is used for unaligned prefetch cases in the ad- dress generation unit . 53 3.3 List of TCM prefetch instructions . 60 3.4 List of TCM row access instructions . 67 3.5 List of TCM column access instructions . 70 3.6 List of.
Recommended publications
  • CPU ボードカタログ サポート CPU Intel :Core I7、Xeon-E5 Freescale :T4240、P4080、MPC8640D AMD :Radeon HD 6970M、HD 7970M GPGPU NVIDIA :Fermi、Kepler Architecture GPGPU
    組込みシステム向け CPU ボードカタログ サポート CPU Intel :Core i7、Xeon-E5 Freescale :T4240、P4080、MPC8640D AMD :Radeon HD 6970M、HD 7970M GPGPU NVIDIA :Fermi、Kepler Architecture GPGPU サポートバス規格 OpenVPX VME/VXS CompactPCI PMC/XMC ATCA/AMC PCI Express 403102 Ⓒ MISH International Co., Ltd. MISH International Co., Ltd. ミッシュインターナショナルでは CPU ボードをスピーディに 導入頂けますよう、次のような サービスを提供しております CPU ボードのお貸出しサービス CPU ボードの性能評価検証サービス ミッシュインターナショナルでは、ユーザが実際に製品を導入する前に性能評価を実施していただけ ミッシュインターナショナルでは、専門の CPU ボードサポート技術者がお客様のご要望に応じて CPU ますよう各種評価用 CPU ボードをお貸出ししています。お貸出し時には、リアルタイム OS を含めた ボードの性能を評価・検証させていただきます。たとえばFFT の処理速度やボード間のデータ転送スピー CPU ボードに関するトータルな技術サポートを行っております。 ドの測定などユーザがシステムインテグレーションする上で必要なデータを検証の上、レポートさせて いただきます。(お客様のご要望内容によっては別途有償の場合もあります) CPU ボードの技術サポート ミッシュインターナショナルでは、専門のCPU ボードサポート技術者が導入前はもちろん、導入後もハー ド・ソフトの両面からお客様の技術サポートをいたします。CPU ボードのドライバソフトウェアやアプ リケーションの開発方法等をトータルにバックアップいたします。また、リアルタイム OS を含んだシ CPU ボード用フレームワークソフトウェアの開発サービス ステムインテグレーッションに関するアドバイスも対応しています。 CPU ボードを含んだ組込み用システムを構 築する上では、CPU ボードのハード・ソフ トに関する技術的な知識経験はもちろんです が、CPU ボード以外の A/D、D/A、DIO ボー ド等の各種 I/O ボードとのシームレスな高速 データ通信やリアルタイム OS を使用したイ ンテグレーションが必要です。当社では複数 のボードを使ったマルチ CPU ボードシステ ムやレーダ、ソナー、移動体通信等の無線信 号のリアルタイム処理等をトータルにサポートしています。全体的なデータのパスをサポートした『フ レームワークソフトウェア』の開発もお手伝いしています。ユーザは『フレームワークソフトウェア』 の開発を当社へ外注することにより、アプリケーションソフトウェアの開発や FPGA の開発に専念する ことが出来ます。(お客様のご要望内容によっては別途有償の場合もあります) インテル製 プロセッサ搭載 CPU ボード ボード CPU スピード 拡張 USB 耐環境 型名 プロセッサ メモリ NVRAM Ethernet インテル製 プロセッサ Core i7(Ivy Bridge)、 タイプ (Max) メザニン 2.0 仕様 Xeon E5-2648L x 2 32GB DDR3- 8MB NOR 1000BASE-T x 1 Level HDS6601 6U VPX 1.8GHz - 3 Xeon(8 Core) 搭載 CPU ボード (Sandy Bridge)
    [Show full text]
  • Charactersing the Limits of the Openflow Slow-Path
    Charactersing the Limits of the OpenFlow Slow-Path Richard Sanger, [email protected] Brad Cowie, [email protected] Matthew Luckie, [email protected] Richard Nelson, [email protected] University of Waikato, New Zealand 28 November 2018 The Question How slow is the slow-path? © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 2 Contents • Introduction to the Slow-Path • Motivation • Test Suite • Test Methodology • Results • Conclusions © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 3 OpenFlow Packet-in and Packet-out To move packets between the controller and network, packets are encapsulated in OpenFlow packet-in and packet-out messages and sent via the slow-path. © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 4 The Fast-Path ASIC OpenFlow Agent Ingress Egress OpenFlow Switch Network © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 5 The Slow-Path (Packet In) ASIC OpenFlow Agent Packet in OpenFlow Switch Network Control-Plane Network OpenFlow Application NIC Controller © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 6 Motivation: Control Traffic Requirements Control traffic is sensitive to bandwidth and latency Latency • Keep-alives • Flow Establishment (Reactive control) Bandwidth • Initial route exchange (BGP etc.) • Capture (Network debugging) • DoS (Misconfiguration, ICMP, etc.) © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 7 Motivation: Control Traffic Requirements Control traffic requirements must be met simultaneously. Example: consider the requirement of link detection probing. • Typical Bidirectional Forwarding Detection (BFD) requirements • < 50ms • 2,880pps (48 port switch) © THE UNIVERSITY OF WAIKATO • TE WHARE WANANGA O WAIKATO 8 Motivation: Shared Resource The slow-path is shared with all other OpenFlow messages.
    [Show full text]
  • RISC-V Vector Extension Webinar II
    RISC-V Vector Extension Webinar II August 3th, 2021 Thang Tran, Ph.D. Principal Engineer Webinar II - Agenda • Andes overview • Vector technology background – SIMD/vector concept – Vector processor basic • RISC-V V extension ISA – Basic – CSR • RISC-V V extension ISA – Memory operations – Compute instructions • Sample codes – Matrix multiplication – Loads with RVV versions 0.8 and 1.0 • AndesCore™ NX27V • Summary Copyright© 2020 Andes Technology Corp. 2 Terminology • ACE: Andes Custom Extension • ISA: Instruction Set Architecture • CSR: Control and Status Register • GOPS: Giga Operations Per Second • SEW: Element Width (8-64) • GFLOPS: Giga Floating-Point OPS • ELEN: Largest Element Width (32 or 64) • XRF: Integer register file • XLEN: Scalar register length in bits (64) • FRF: Floating-point register file • FLEN: FP register length in bits (16-64) • VRF: Vector register file • VLEN: Vector register length in bits (128-512) • SIMD: Single Instruction Multiple Data • LMUL: Register grouping multiple (1/8-8) • MMX: Multi Media Extension • EMUL: Effective LMUL • SSE: Streaming SIMD Extension • VLMAX/MVL: Vector Length Max • AVX: Advanced Vector Extension • AVL/VL: Application Vector Length • Configurable: parameters are fixed at built time, i.e. cache size • Extensible: added instructions to ISA includes custom instructions to be added by customer • Standard extension: the reserved codes in the ISA for special purposes, i.e. FP, DSP, … • Programmable: parameters can be dynamically changed in the program Copyright© 2020 Andes Technology Corp. 3 RISC-V V Extension ISA Basic Copyright© 2020 Andes Technology Corp. 4 Vector Register ISA • Vector-Register ISA Definition: − All vector operations are between vector registers (except for load and store).
    [Show full text]
  • Avionics Hardware Issues 2010/11/19 Chih-Hao Sun Avionics Software--Hardware Issue -History
    Avionics Hardware Issues 2010/11/19 Chih-hao Sun Avionics Software--Hardware Issue -History -HW Concepts History -FPGA vs ASIC The Gyroscope, the first auto-pilot device, was -Issues on • Avionics Computer introduced a decade after the Wright Brothers -Avionics (1910s) Computer -PowerPC • holds the plane level automatically -Examples -Energy Issue • is connected to computers for missions(B-17 and - Certification B-29 bombers) and Verification • German V-2 rocket(WWII) used the earliest automatic computer control system (automatic gyro control) • contains two free gyroscopes (a horizontal and a vertical) 2 Avionics Software--Hardware Issue -History -HW Concepts History -FPGA vs ASIC Avro Canada CF-105 Arrow fighter (1958) first used -Issues on • Avionics Computer analog computer to improve flyability -Avionics Computer is used to reduce tendency to yaw back and forth -PowerPC • -Examples F-16 (1970s) was the first operational jet fighter to use a -Energy Issue • fully-automatic analog flight control system (FLCS) - Certification and Verification • the rudder pedals and joysticks are connected to “Fly-by-wire” control system, and the system adjusts controls to maintain planes • contains three computers (for redundancy) 3 Avionics Software--Hardware Issue -History -HW Concepts History -FPGA vs ASIC NASA modified Navy F-8 with digital fly-by wire system in -Issues on • Avionics Computer 1972. -Avionics Computer • MD-11(1970s) was the first commercial aircraft to adopt -PowerPC computer-assisted flight control -Examples -Energy Issue The Airbus A320 series, late 1980s, used the first fully-digital - • Certification fly-by-wire controls in a commercial airliner and Verification • incorporates “flight envelope protection” • calculates that flight envelope (and adds a margin of safety) and uses this information to stop pilots from making aircraft outside that flight envelope.
    [Show full text]
  • Optimizing Packed String Matching on AVX2 Platform
    Optimizing Packed String Matching on AVX2 Platform M. Akif Aydo˘gmu¸s1,2 and M.O˘guzhan Külekci1 1 Informatics Institute, Istanbul Technical University, Istanbul, Turkey [email protected], [email protected] 2 TUBITAK UME, Gebze-Kocaeli, Turkey Abstract. Exact string matching, searching for all occurrences of given pattern P on a text T , is a fundamental issue in computer science with many applica- tions in natural language processing, speech processing, computational biology, information retrieval, intrusion detection systems, data compression, and etc. Speeding up the pattern matching operations benefiting from the SIMD par- allelism has received attention in the recent literature, where the empirical results on previous studies revealed that SIMD parallelism significantly helps, while the performance may even be expected to get automatically enhanced with the ever increasing size of the SIMD registers. In this paper, we provide variants of the previously proposed EPSM and SSEF algorithms, which are orig- inally implemented on Intel SSE4.2 (Streaming SIMD Extensions 4.2 version with 128-bit registers). We tune the new algorithms according to Intel AVX2 platform (Advanced Vector Extensions 2 with 256-bit registers) and analyze the gain in performance with respect to the increasing length of the SIMD registers. Profiling the new algorithms by using the Intel Vtune Amplifier for detecting performance bottlenecks led us to consider the cache friendliness and shared-memory access issues in the AVX2 platform. We applied cache op- timization techniques to overcome the problems particularly addressing the search algorithms based on filtering. Experimental comparison of the new solutions with the previously known-to- be-fast algorithms on small, medium, and large alphabet text files with diverse pattern lengths showed that the algorithms on AVX2 platform optimized cache obliviously outperforms the previous solutions.
    [Show full text]
  • Chapter 1. Origins of Mac OS X
    1 Chapter 1. Origins of Mac OS X "Most ideas come from previous ideas." Alan Curtis Kay The Mac OS X operating system represents a rather successful coming together of paradigms, ideologies, and technologies that have often resisted each other in the past. A good example is the cordial relationship that exists between the command-line and graphical interfaces in Mac OS X. The system is a result of the trials and tribulations of Apple and NeXT, as well as their user and developer communities. Mac OS X exemplifies how a capable system can result from the direct or indirect efforts of corporations, academic and research communities, the Open Source and Free Software movements, and, of course, individuals. Apple has been around since 1976, and many accounts of its history have been told. If the story of Apple as a company is fascinating, so is the technical history of Apple's operating systems. In this chapter,[1] we will trace the history of Mac OS X, discussing several technologies whose confluence eventually led to the modern-day Apple operating system. [1] This book's accompanying web site (www.osxbook.com) provides a more detailed technical history of all of Apple's operating systems. 1 2 2 1 1.1. Apple's Quest for the[2] Operating System [2] Whereas the word "the" is used here to designate prominence and desirability, it is an interesting coincidence that "THE" was the name of a multiprogramming system described by Edsger W. Dijkstra in a 1968 paper. It was March 1988. The Macintosh had been around for four years.
    [Show full text]
  • This Thesis Has Been Submitted in Fulfilment of the Requirements for a Postgraduate Degree (E.G
    This thesis has been submitted in fulfilment of the requirements for a postgraduate degree (e.g. PhD, MPhil, DClinPsychol) at the University of Edinburgh. Please note the following terms and conditions of use: This work is protected by copyright and other intellectual property rights, which are retained by the thesis author, unless otherwise stated. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author. The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author. When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given. Towards the Development of Flexible, Reliable, Reconfigurable, and High- Performance Imaging Systems by Jalal Khalifat A thesis submitted in partial fulfilment of the requirements for the degree of DOCTOR OF PHILOSOPHY The University of Edinburgh May 2016 Declaration I hereby declare that this thesis was composed and originated entirely by myself except where explicitly stated in the text, and that this work has not been submitted for any other degree or professional qualifications. Jalal Khalifat May 2016 Edinburgh, U.K. I Acknowledgements All praises are due to Allah, the most gracious, the most merciful, after three and a half years of continuous work, this work comes to end and it is the moment that I have to thank all the people who have supported me throughout my PhD study and who have contributed to the completion of this thesis.
    [Show full text]
  • Performance of Image and Video Processing with General-Purpose
    Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions y Parthasarathy Ranganathan , Sarita Adve , and Norman P. Jouppi Electrical and Computer Engineering y Western Research Laboratory Rice University Compaq Computer Corporation g fparthas,sarita @rice.edu [email protected] Abstract Media processing refers to the computing required for the creation, encoding/decoding, processing, display, and com- This paper aims to provide a quantitative understanding munication of digital multimedia information such as im- of the performance of image and video processing applica- ages, audio, video, and graphics. The last few years tions on general-purpose processors, without and with me- have seen significant advances in this area, but the true dia ISA extensions. We use detailed simulation of 12 bench- promise of media processing will be seen only when ap- marks to study the effectiveness of current architectural fea- plications such as collaborative teleconferencing, distance tures and identify future challenges for these workloads. learning, and high-quality media-rich content channels ap- Our results show that conventional techniques in current pear in ubiquitously available commodity systems. Fur- processors to enhance instruction-level parallelism (ILP) ther out, advanced human-computer interfaces, telepres- provide a factor of 2.3X to 4.2X performance improve- ence, and immersive and interactive virtual environments ment. The Sun VIS media ISA extensions provide an ad- hold even greater promise. ditional 1.1X to 4.2X performance improvement. The ILP One obstacle in achieving this promise is the high com- features and media ISA extensions significantly reduce the putational demands imposed by these applications.
    [Show full text]
  • On Implementation of MPEG-2 Like Real-Time Parallel Media Applications on MDSP Soc Cradle Architecture
    On Implementation of MPEG-2 like Real-Time Parallel Media Applications on MDSP SoC Cradle Architecture Ganesh Yadav1, R. K. Singh2, and Vipin Chaudhary1 1 Dept. of Computer Science, Wayne State University fganesh@cs.,[email protected] 2 Cradle Technologies, [email protected] Abstract. In this paper we highlight the suitability of MDSP 3 architec- ture to exploit the data, algorithmic, and pipeline parallelism o®ered by video processing algorithms like the MPEG-2 for real-time performance. Most existing implementations extract either data or pipeline parallelism along with Instruction Level Parallelism (ILP) in their implementations. We discuss the design of MP@ML decoding system on shared memory MDSP platform and give insights on building larger systems like HDTV. We also highlight how the processor scalability is exploited. Software implementation of video decompression algorithms provides flexibility, but at the cost of being CPU intensive. Hardware implementations have a large development cycle and current VLIW dsp architectures are less flexible. MDSP platform o®ered us the flexibilty to design a system which could scale from four MSPs (Media Stream Processor is a logical clus- ter of one RISC and two DSP processors) to eight MSPs and build a single-chip solution including the IO interfaces for video/audio output. The system has been tested on CRA2003 board. Speci¯c contributions include the multiple VLD algorithm and other heuristic approaches like early-termination IDCT for fast video decoding. 1 Introduction Software programmable SoC architectures eliminate the need for designing ded- icated hardware accelerators for each standard we want to work with. With the rapid evolution of standards like MPEG-2, MPEG-4, H.264, etc.
    [Show full text]
  • Optimizing Software Performance Using Vector Instructions Invited Talk at Speed-B Conference, October 19–21, 2016, Utrecht, the Netherlands
    Agner Fog, Technical University of Denmark Optimizing software performance using vector instructions Invited talk at Speed-B conference, October 19–21, 2016, Utrecht, The Netherlands. Abstract Microprocessor factories have a problem obeying Moore's law because of physical limitations. The answer is increasing parallelism in the form of multiple CPU cores and vector instructions (Single Instruction Multiple Data - SIMD). This is a challenge to software developers who have to adapt to a moving target of new instruction set additions and increasing vector sizes. Most of the software industry is lagging several years behind the available hardware because of these problems. Other challenges are tasks that cannot easily be executed with vector instructions, such as sequential algorithms and lookup tables. The talk will discuss methods for overcoming these problems and utilize the continuously growing power of microprocessors on the market. A few problems relevant to cryptographic software will be covered, and the outlook for the future will be discussed. Find more on these topics at author website: www.agner.org/optimize Moore's law The clock frequency has stopped growing due to physical limitations. Instead, the number of CPU cores and the size of vector registers is growing. Hierarchy of bottlenecks Program installation Program load, JIT compile, DLL's System database Network access Speed → File input/output Graphical user interface RAM access, cache utilization Algorithm Dependency chains CPU pipeline and execution units Remove
    [Show full text]
  • XOS Advanced Media Processor
    XOS Advanced Media Processor The XOS Advanced Media Processor is a high performance live media processor for broadcast and streaming applications. Key Business Benefits Application versatility The XOS Advanced Media Processor is the latest generation of Harmonic software-based video appliances. XOS can be used for either broadcast or streaming applications, and is adapted to multiple deployment environments. Classic infrastructures are supported with SDI, ASI, and satellite RF interfaces. Full-IP architectures are also supported: XOS handles MPEG compressed formats, as well as the latest SMPTE ST 2022-6 and SMPTE ST 2110 standards. GRAPHICS TRANSCODE STATMUX MULTIPLEXING ENCRYPT SDI TS over IP PACKAGING SAT RECEPTION OTT INGEST DECRYPT 2022-6 BROADCAST 2110 DVB-S2X XOS HLS STREAMING XOS Advanced Media Processor Inputs and Functionality XOS is packed with features to address any kind of video processing application. In addition to its market-leading compression engine, XOS integrates a broad range of audio codecs, including Dolby AC-4, an advanced video pre-processing engine, a broadcast multiplexer with statmux support, and a state-of-the-art packager for streaming applications. From decoding to encoding, from HDR processing to audio loudness management, Harmonic has you covered. Improved cost of ownership XOS Advanced Media Processor’s unparalleled function integration and performance dramatically reduce the number of appliances required for a given application, significantly improving your cost of ownership. As a software solution, XOS
    [Show full text]
  • Idisa+: a Portable Model for High Performance Simd Programming
    IDISA+: A PORTABLE MODEL FOR HIGH PERFORMANCE SIMD PROGRAMMING by Hua Huang B.Eng., Beijing University of Posts and Telecommunications, 2009 a Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Computing Science Faculty of Applied Science c Hua Huang 2011 SIMON FRASER UNIVERSITY Fall 2011 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. APPROVAL Name: Hua Huang Degree: Master of Science Title of Thesis: IDISA+: A Portable Model for High Performance SIMD Pro- gramming Examining Committee: Dr. Kay C. Wiese Associate Professor, Computing Science Simon Fraser University Chair Dr. Robert D. Cameron Professor, Computing Science Simon Fraser University Senior Supervisor Dr. Thomas C. Shermer Professor, Computing Science Simon Fraser University Supervisor Dr. Arrvindh Shriraman Assistant Professor, Computing Science Simon Fraser University SFU Examiner Date Approved: ii Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.
    [Show full text]