Intrinsic Functions ►Development Tools ►Performance and Optimizations

June, 2010 Optimization Techniques for Next-Generation StarCore DSP Products, Including the MSC815x and MSC825x StarCore DSPs FTF-NET-F0672 Michael Fleischer Senior Applications Engineer, Senior Member Technical Staff TM Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. What is this session about? In this session we will describe how to port code written for the TI C64x and 64x+ core-based DSPs to the Freescale StarCore SC3400-based DSPs used in devices such as the MSC8144 and SC3850-based DSPs used in devices such as the MSC8156 and MSC8256. We shall provide the necessary steps to port application code written in the C language. Topics include architectural differences, data types, tools settings, intrinsics, pragmas and optimization suggestions. The focus of this session will be porting from the core perspective only—it does not discuss system issues. Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 2 Agenda ►Overview of Porting Code ►Core Architecture Comparisons ►Data Types and Intrinsic Functions ►Development Tools ►Performance and Optimizations Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 3 Overview: Key Factors in Porting ► Key factors to consider when porting your code • Data type selection Which data types are compatible? Which ones are not? What is natively supported by the hardware? What is emulated in software? • Development tool-specific issues such as configuration, measuring performance, language extensions and how to measure performance How to properly profile and benchmark DSP algorithms? Choosing the correct simulator model to make comparisons? Cache effect on benchmarks? Compiler effects Safe techniques for DSP algorithm testing • Intrinsic mapping What is different? What is the same? What is missing? What is added? • Pragma mapping Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 4 Overview: Algorithmic Approach ►When moving between very different architectures, such as from the TI DSP architecture compared to the StarCore architecture, the porting effort must balance between the two following options for every operation: • Exact functional duplication For example, you get the same output from the same input using the same mathematical or logical operations • Equivalent functionality Achieves the same result using core-equivalent operations, which may be more efficient and therefore can increase performance and throughput. ►A programmer must choose when converting/porting code at each step. However, because of the complexity of the code in most applications and the differences in the advantages for each of the two approaches, hand optimization and engineering decisions are often required to balance the two criteria. Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 5 Overview: A Good Starting Point ►It cannot be emphasized enough that before starting code porting work, several things need to be in place: • A goal for code performance Should be reasonable for a given architecture • A goal for algorithm performance • A sound method for profiling that gives accurate, or as accurate as possible, results. • A solid C language test harness that can provide: Test input parameters Test input data vectors Test output parameters Test output vectors Comparison to “Golden Model” reference vectors or prior platform reference vectors when available Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 6 Agenda ►Overview of Porting Code ►Core Architecture Comparisons ►Data Types and Intrinsic Functions ►Development Tools ►Performance and Optimizations Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 7 Core Comparisons: Arithmetic Operations SC3400 SC3850 TI C64x TI C64x+ Available Parallelism 6688 8 (16-bit) Multipliers (Width) 4 (16-bit) Organized as 4x dual 2 (32-bit) 2 (32-bit) mac 8x @ 8-bit Native multiplies (# per cycle x 8x @ 8-bit 8x @ 8-bit 8x @ 8-bit 4x @ 16-bit 4x @ 16-bit 4x @ 16-bit 4x @ 16-bit width) 2x @ 32-bit Accumulators (width assuming 4 (40-bit) 4 (40-bit) 4 (32-bit) 4 (32-bit) parallel load/store) 2x @ 40-bit 4x @ 40-bit 4x @ 40-bit 4x @ 32-bit Native adds (# per cycle x 4x @ 32-bit 4x @ 32-bit 4x @ 32-bit 4x @ 16-bit 8x @ 16-bit width) 8x @ 16-bit 8x @ 16-bit 8x @ 8-bit 16x @ 8-bit 4x @ 40-bit 4x @ 40-bit 2x @ 32-bit Total Multiplies per cycle by 4x @ 16-bit 4x @ 32-bit 4x @ 32-bit 8x @ 16-bit 8x @ 8-bit data type 4x @ 16-bit 8x @ 16-bit 8x @ 8-bit Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 8 Core Comparisons: Architecture SC3400 SC3850 TI C64x TI C64x+ 8 8 Available Instruction 6 6 (6 ALUs 2 (6 ALUs 2 (4 ALUs 2 AGUs) (4 ALUs 2 AGUs) Parallelism multipliers) multipliers) 2 x 64-bit load or 2 x 64-bit load or 2 x 64-bit 2 x 64-bit Data Bandwidth store store Load/Store Units 2222 16 Data (40-bit) 16 Data (40-bit) 16 Address (32-bit) 16 Address (32-bit) 64 General 64 General 4 Modulus 4 Modulus Purpose (32- Purpose (32- Registers 4 Offset 4 Offset bit) bit) 4 HW Loop Counters 4 HW Loop Counters (Or up to 32-, 64- (Or up to 32-, 64- 4 HW Loop Starts 4 HW Loop Starts bit registers) bit registers) 2 Stack Pointers 2 Stack Pointers Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 9 SC3850 Core Architecture

Intrinsic Functions ►Development Tools ►Performance and Optimizations

CS 110 Discussion 15 Programming with SIMD Intrinsics

PGI Compilers

Intel Hardware Intrinsics in .NET Core

Optimizing Subroutines in Assembly Language an Optimization Guide for X86 Platforms

Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets

Micro Focus Visual COBOL 6.0 for Visual Studio

Research Collection

In the GNU Fortran Compiler

Micro Virtual Machines: a Solid Foundation for Managed Language Implementation

Tricore C Compiler, Assembler, Linker Reference Manual

Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms

Power Vector Intrinsic Programming Reference