June, 2010 Optimization Techniques for Next-Generation StarCore DSP Products, Including the MSC815x and MSC825x StarCore DSPs FTF-NET-F0672 Michael Fleischer Senior Applications Engineer, Senior Member Technical Staff

TM Freescale, the Freescale logo, AltiVec, -5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. What is this session about?

In this session we will describe how to port code written for the TI C64x and 64x+ core-based DSPs to the Freescale StarCore SC3400-based DSPs used in devices such as the MSC8144 and SC3850-based DSPs used in devices such as the MSC8156 and MSC8256. We shall provide the necessary steps to port application code written in the C language. Topics include architectural differences, data types, tools settings, intrinsics, pragmas and optimization suggestions. The focus of this session will be porting from the core perspective only—it does not discuss system issues.

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 2 Agenda

►Overview of Porting Code ►Core Architecture Comparisons ►Data Types and Intrinsic Functions ►Development Tools ►Performance and Optimizations

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 3 Overview: Key Factors in Porting

► Key factors to consider when porting your code • Data type selection ƒ Which data types are compatible? ƒ Which ones are not? ƒ What is natively supported by the hardware? ƒ What is emulated in ? • Development tool-specific issues such as configuration, measuring performance, language extensions and how to measure performance ƒ How to properly profile and benchmark DSP algorithms? ƒ Choosing the correct simulator model to make comparisons? ƒ Cache effect on benchmarks? ƒ effects ƒ Safe techniques for DSP algorithm testing • Intrinsic mapping ƒ What is different? ƒ What is the same? ƒ What is missing? ƒ What is added? • Pragma mapping

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 4 Overview: Algorithmic Approach

►When moving between very different architectures, such as from the TI DSP architecture compared to the StarCore architecture, the porting effort must balance between the two following options for every operation: • Exact functional duplication ƒ For example, you get the same output from the same input using the same mathematical or logical operations • Equivalent functionality ƒ Achieves the same result using core-equivalent operations, which may be more efficient and therefore can increase performance and throughput. ►A programmer must choose when converting/porting code at each step. However, because of the complexity of the code in most applications and the differences in the advantages for each of the two approaches, hand optimization and engineering decisions are often required to balance the two criteria.

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 5 Overview: A Good Starting Point

►It cannot be emphasized enough that before starting code porting work, several things need to be in place: • A goal for code performance ƒ Should be reasonable for a given architecture • A goal for algorithm performance • A sound method for profiling that gives accurate, or as accurate as possible, results. • A solid C language test harness that can provide: ƒ Test input parameters ƒ Test input data vectors ƒ Test output parameters ƒ Test output vectors ƒ Comparison to “Golden Model” reference vectors or prior platform reference vectors when available

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 6 Agenda

►Overview of Porting Code ►Core Architecture Comparisons ►Data Types and Intrinsic Functions ►Development Tools ►Performance and Optimizations

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 7 Core Comparisons: Arithmetic Operations

SC3400 SC3850 TI C64x TI C64x+

Available Parallelism 6688

8 (16-bit) Multipliers (Width) 4 (16-bit) Organized as 4x dual 2 (32-bit) 2 (32-bit) mac

8x @ 8-bit Native multiplies (# per cycle x 8x @ 8-bit 8x @ 8-bit 8x @ 8-bit 4x @ 16-bit 4x @ 16-bit 4x @ 16-bit 4x @ 16-bit width) 2x @ 32-bit Accumulators (width assuming 4 (40-bit) 4 (40-bit) 4 (32-bit) 4 (32-bit) parallel load/store) 2x @ 40-bit 4x @ 40-bit 4x @ 40-bit 4x @ 32-bit Native adds (# per cycle x 4x @ 32-bit 4x @ 32-bit 4x @ 32-bit 4x @ 16-bit 8x @ 16-bit width) 8x @ 16-bit 8x @ 16-bit 8x @ 8-bit 16x @ 8-bit 4x @ 40-bit 4x @ 40-bit 2x @ 32-bit Total Multiplies per cycle by 4x @ 16-bit 4x @ 32-bit 4x @ 32-bit 8x @ 16-bit 8x @ 8-bit data type 4x @ 16-bit 8x @ 16-bit 8x @ 8-bit

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 8 Core Comparisons: Architecture

SC3400 SC3850 TI C64x TI C64x+

8 8 Available Instruction 6 6 (6 ALUs 2 (6 ALUs 2 (4 ALUs 2 AGUs) (4 ALUs 2 AGUs) Parallelism multipliers) multipliers) 2 x 64-bit load or 2 x 64-bit load or 2 x 64-bit 2 x 64-bit Data Bandwidth store store Load/Store Units 2222 16 Data (40-bit) 16 Data (40-bit) 16 Address (32-bit) 16 Address (32-bit) 64 General 64 General 4 Modulus 4 Modulus Purpose (32- Purpose (32- Registers 4 Offset 4 Offset bit) bit) 4 HW Loop Counters 4 HW Loop Counters (Or up to 32-, 64- (Or up to 32-, 64- 4 HW Loop Starts 4 HW Loop Starts bit registers) bit registers) 2 Stack Pointers 2 Stack Pointers

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 9 SC3850 Core Architecture 64 64 32 32 XA_DATA XA_ADDR XB_ADDR XB_DATA XP_DATA

XP_ADDR 128 32 AGU DALU PCU PAG Address Data Registers BTB Registers

REG PDU MAC0a MAC1a MAC2a MAC3a MAC0b MAC1b MAC2b MAC3b COF INT ALU0 ALU1 BMU Logic0 Logic1 Logic2 Logic3

OCE Inst. Bus RSU – Resource Stall Unit

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 10 Agenda

►Overview of Porting Code ►Core Architecture Comparisons ►Data Types and Intrinsic Functions ►Development Tools ►Performance and Optimizations

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 11 Data Type Considerations

►40-bit precision data storage • StarCore data registers are 40 bits wide and store all 40-bit precision data in a single register • TI c64x/c64x+ data registers are 32 bits wide and can be combined on two registers into up to a 64-bit wide register ƒ 40-bit precision data is stored in bits 0:39 of a pair of registers ►40-bit data usage • StarCore’s C compiler uses a structure to save 40-bit data as a 40-bit fractional value: WORD40 and UWORD40 • TI’s C compiler extends the C language such that a long or unsigned long value is a 40-bit integer

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 12 Data Type Comparison Table

C Data Type StarCore TI c64x/c64x+ Different? Size Placement in 40 Bit Register Size Placement in 32 Bit Register

char 8 Bits 0:7 8 Bits 0:7 No

unsigned char 8 Bits 0:7 8 Bits 0:7 No

short 16 Bits 0:15 16 Bits 0:15 No

unsigned short 16 Bits 0:15 16 Bits 0:15 No

int 32 Bits 0:31 32 Entire 32 bit register No

unsigned int 32 Bits 0:31 32 Entire 32 bit register No

long 32 Bits 0:31 40 Even/odd register pair, Bits 0:39 Yes

unsigned long 32 Bits 0:31 40 Even/odd register pair, Bits 0:39 Yes

long long 641 Even/odd register pair, Bits 0:31, Bits 0:31 64 Even/odd register pair No

unsigned long long 641 Even/odd register pair, Bits 0:31, Bits 0:31 64 Even/odd register pair No

float 32 Bits 0:31 32 Entire 32 bit register Yes

double, long double 641 Even/odd register pair, Bits 0:31, Bits 0:31 64 Even/odd register pair Yes

pointer 32 Bits 0:31 of data register or entire 32 bit 32 Entire 32 bit register No address register

1. Using 64 bit precision with the StarCore compiler requires using the –slld command line switch

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 13 Data Types: Fractional vs. Integer types

►Freescale StarCore DSPs and TI DSPs handle fractional data types and saturation very differently ►StarCore DSPs have both integer and fractional instructions and can execute both types of instructions simultaneously • Fractional instructions saturate when saturation mode is set (SM bit in the status register) • Fractional multiply instructions always do a left shift before writing the result back to the register • Integer instructions never saturate ►Standard C code generates integer instructions ►Fractional data types can only be accessed via the use of intrinsic (prototype) functions

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 14 Data Types: Fractional vs. Integer: MAC Example

integer integer

short a, b; move.w ($c000574c),d1 move.w ($c000574e),d3 int c; move.l ($c0005620),d0 imac d1,d3,d0 c = c + (a*b);

integer mac instruction

fractional fractional

Word16 a, b, move.f ($c0005752),d1 move.f ($c0005750),d3 Word32 c; move.l ($c0005624),d2 c = L_mac(a,b); mac d1,d3,d2

fractional mac instruction

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 15 Fractional vs. Integer: MAC Example integer

a = 0x4500 a = 17 664 short a, b; b = 0x2000 b = 8 192 int c; c = 0x0000 0A00 c = OK 2 560 a * b = 0x08A0 0000 a * b = 144 703 488 c = c + (a*b); c + (a*b) = 0x08A0 0A00 c + (a*b) = 144 706 048

0x08A0 0000 <<1 = 0x1140 0000 0x08A0 0A00 ≠ 0x1140 0A00 fractional – Q15/Q31

a = 0.5390625 Word16 a, b, a = 0x4500 b = 0x2000 b = 0.25 Word32 c; c = 0x0000 0A00 c = 2-20 +2-22 a * b = 0x1140 0000 a * b = OK0.134765625 c = L_mac(a,b); c + (a*b) = 0x1140 0A00 c + (a*b) = 0.134765625 + 2-20 +2-22

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 16 Data Types: Integer

►StarCore DSPs favor fractional arithmetic. Wherever possible, try and use these instructions ►Integer instructions can often generate extra instructions to transfer results. For example:

Integer Additions

Assembly code: dosetup3 L11 C code: doen3 #MAX for (i=0; i< MAX; i++) FALIGN { LOOPSTART3 d2 = d0+d1; L11 d3 = d0+d1; tfr d1,d2 tfr d1,d3 d1 = d0+d1; iadd d0,d2 . iadd d0,d3 . iadd d0,d1 . . } . . LOOPEND3

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 17 Data Types: Fractional

►The extra tfr instructions used with iadd will add cycles to the loop, so using fractional arithmetic may make more sense ►To use fractional math here: • Prior to the start of the loop, disable saturation mode • Use an intrinsic add function for the additions • Re-enable saturation mode when the loop completes Fractional Addition

Assembly code: C code: setnosat(); dosetup3 L11 for (i=0; i< MAX; i++) doen3 #MAX { FALIGN d2 = add(d0, d1); LOOPSTART3 L11 d3 = add(d0, d1); add d0,d1, d2 d1 = add(d0, d1); add d0,d1, d3 . add d0,d1, d1 . . } . setsat(); . LOOPEND3

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 18 Data Types: Porting Rules

► Integer vs fractional • Multiply results are different (shifted left by one to keep the decimal place fixed) • No issue for char, short or int data types → stick with integer data types and instructions • Need to be careful only when porting long data type → need for Word40 intrinsics usage (fractional) ► 2 Word40 intrinsics map to integer instructions: • Word40 X_imacus (Word40 c, Word32 a, Word32 b) performs integer multiplication of the low part of a with the high part of b and adds it to c • Word40 X_imacuu (Word40 c, Word32 a, Word32 b) performs integer multiplication of the low part of a with the low part of b and adds it to c • No need for a right shift • More intrinsics can be created

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 19 Data Types: Porting « long » Data Type: Example (1/3)

a = 0x7F FFFF FFFE b = 0x00 0000 0021

C64x+ SC3850

Word40 a, b, c; long a, b, c; setnosat(); c = a + b; c = X_add(a,b);

c = 0x80 0000 001F c = 0x00 7FFF FFFF

c = 0x80 0000 001F

► X_add saturates on 32-bits ► Turning off saturation gives the same result as TI

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 20 Porting « long » Data Type: Example (2/3)

C64x+ SC3850

short a, b, Word16 a, b; long c; Word40 c; c = a * b; setnosat(); c = X_mult(a,b); c = X_shr (c,1);

► When multiplying with 40-bit integer values, result needs to be shifted right by 1 position to be the same as TI’s long (±1)

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 21 Porting « long » Data Type: Example (3/3)

intrinsics C assembly

Word40 a, b, c; move.l ($c0005530),d0 move.l ($c0005528),d2 c = X_add(a,b); move.l ($c0005534),d0.e move.l ($c000552c),d2.e add d0,d2,d0 move.l d0,($c0005520) move.l d0.e:d1.e,($c0005524)

► To avoid spending 2 move.l on each long data move, use -Xicode "-- CC_40Bit_In_Reg=TRUE“ compilation switch to port 40-bit data types

► Or CC_40Bit_In_Reg = TRUE via application file

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 22 Intrinsic Functions

► An intrinsic is used as a function call in C code ► An intrinsic function maps to one or several assembly instructions, directly through the compiler ► Benefits of intrinsics: • Leverage the core architecture through usage which cannot be achieved with standard C code • Access fractional data • Access SIMD instructions • Improve performance of compiler-generated code • Used to get core specific instructions in place of inline assembly code

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 23 Intrinsic Functions

►Using intrinsic functions on StarCore: • Include

►C with intrinsics ►Corresponding assembly #include code: [ ret=L_mac(ret, x[i], y[i]); mac d4,d5,d0 move.f (r1)+,d5 move.f (r0)+,d4 ]

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 24 StarCore Intrinsics Data Types (1/2)

►Scalar fractional data types ► All fractional data are placed in 40-bit registers (DALU)

Type Size Description Word8 8 Signed char fractional data type UWord8 8 Unsigned char fractional data type Word16 16 Signed short fractional data type UWord16 16 Unsigned short fractional data type Word32 32 Signed long fractional data type UWord32 32 Unsigned long fractional data type Word40 40 Signed extended precision fractional data type UWord40 40 Unsigned extended precision fractional data type Word64 64 Signed double precision fractional data type

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 25 StarCore Intrinsics Data Types (2/2)

►Vector/packed fractional data types ► All fractional data are placed in 40-bit registers (DALU)

Type Size Description Vector_Type32 32 32-bit vector data type Vector_Type40 40 40-bit vector data type Vector_Component8 8 8-bit vector component Vector_Component16 16 16-bit vector component Vector_Component20 20 20-bit vector component Vector_ComponentU8 8 Unsigned 8-bit vector component Vector_ComponentU16 16 Unsigned 16-bit vector component Complex16 32 32-bit complex fractional (2x16) Complex20 40 40-bit complex fractional (2x20)

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 26 Fractional Intrinsic Data Types Definition

Word40 UWord40

typedef struct Word40 typedef struct UWord40 { { UWord32 body; UWord32 body; char ext; unsigned char ext; } Word40; } UWord40;

► Support for non-C language native sized Word64 elements is done using structures

► « body » and « ext » components of a typedef struct Word64 (U)Word40 are stored in one 40-bit register : { Word32 msb; UWord32 lsb; 39 32 31 0 } Word64; ext body

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 27 StarCore Intrinsic Function Naming Conventions

Prefix Description Example

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 28 Data Types: Differences

C Data Type TI C64x+ size (bits) SC3850 size (bits) char 8 8 unsigned char 8 8 short 16 16 unsigned short 16 16 int 32 32 unsigned int 32 Size mismatch 32 long 40 32 unsigned long 40 32 long long 64 64 unsigned long long 64 64 float 32 32 double 64 64 long double 64 64 No native support

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 29 Data Types: Mapping

C Data Type on TI Recommended Data Notes c64x/c64x+ Type on Starcore char char unsigned char unsigned char short short unsigned short unsigned short int int 32 bit multiply is not natively supported on unsigned int unsigned int SC3400 long int (long) Word40 C unsigned long int If 40 bit is needed, use X_* intrinsic functions UWord40 (unsigned long) If 32 bit is needed, use L_* intrinsic functions long long long long It is recommended to avoid using these data types, and use multiple operations on smaller unsigned long long unsigned long long data types instead float float It is also recommend to avoid using floating point data types. There is no native support double, long double double, long double for floating point operations on StarCore

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 30 Intrinsic Functions: Saturated vs. Non Saturated Arithmetic

►TI architecture has unique intrinsic functions for saturating and non- saturating (overflow/underflow) arithmetic • _add2 Normal dual addition • _sadd2 Saturated dual addition

►StarCore uses saturation mode bits in the Status Register (SR) to change the behavior of some intrinsic functions from saturating mode to non-saturating mode. • V_add2 Normal or saturating dual addition depending on SR ►Use setsat16() for 16-bit saturation mode, and setsat32() for 32-bit saturating mode ►Use setnosat16() and setnosat32() to put the core in non-saturating modes

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 31 Intrinsic Functions: Two Types of Saturation

► The StarCore architecture supports 2 saturation modes for signed integer and signed fractional data types • 16-bit (for packed data types) • 32-bit (for non-packed data types) ► These modes are set by configuring bits in the SR: • SM = 0 → 32-bit saturation is disabled • SM2 = 0 → 16-bit saturation is disabled ► SM2 saturation only affects SIMD instructions ►The default behavior is determined by the SR setting loaded during C runtime library initialization and can be modifed in the linker command files for any project « _SR_Setting »

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 32 Mapping Intrinsic Functions: Addition

►StarCore supports scalar TI Starcore Description Bitwise Accuracy and vector addition Intrinsic Intrinsic _sadd2 V_sod2aaii Performs saturated addition For ►Integer scalar, fractional between pairs of 16-bit _saddus2 saturating values in src1 and src2. versions scalar and fractional _add2 Values for src1 can be signed or unsigned (add2 vector types are does not saturate, but available overflows at 16 bit boundary) _sub2 V_sod2ssii Subtracts the upper and For • Note the SOD2 instruction lower halves of src2 from the _ssub2 saturating is recommended over upper and lower halves of versions src1. _ssub2 also saturates ADD2 and SUB2 result instructions because it is more orthogonal and _addsub2 V_sod2aaii Performs _add2 or _sadd2 For and _sub2 or _ssub2 in generates better- _saddsub2 and saturating performing code V_sod2ssii parallel versions

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 33 Mapping Intrinsic Functions: 32-bit Packed Addition

TI Starcore Description Bitwise Intrinsic Intrinsic Accuracy _sadd L_add Adds two 32 bit values and saturates results Yes

_ssub L_sub Subtracts two 32 bit values and saturates results Yes

_addsub L_add and L_sub Performs an add and a sub in parallel (note does not saturate No results)

_saddsub L_add and L_sub Performs a saturated add and a saturated subtraction in Yes parallel

Note: Turning off saturation mode on StarCore allows L_add and L_sub to behave the same as non-saturating versions of the TI intrinsics

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 34 Mapping Intrinsic Functions: 8-bit Multiplies

TI Intrinsic Starcore Description Bitwise Accuracy Intrinsic _mpysu4 None. Can use V_impysu2 for For each 8-bit quantity in Yes two multiplies. Reads two 8-bit src1 and src2, perform an 8 components from lower 16 bits of bit by 8 bit multiply. The registers for every TI source four 16-bit results are register packed into a long long. desta = V_impysu2(src1a, src2a); Results can be signed or unsigned destb = V_impysu2(src1b, src2b);

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 35 Mapping Intrinsic Functions: 16 bit Multiplies

TI StarCore Description Bitwise Accuracy Intrinsic Intrinsic

_mpy V_L_mult_ll Multiplies the 16 LSBs of both operands and returns the result as signed No _mpyus mpyus or unsigned _mpysu mpysu _mpyu * Multiplies the 16 LSBs of both operands and returns the result as signed No or unsigned *(int)((unsigned short) src1) * (int)((unsigned short) src2) _mpyh V_L_mult_hh Multiplies the 16 MSBs of both operands and returns the result as No _mpyhus mpyus* signed or unsigned _mpyhsu mpysu* *Use extract_h on operands to extract upper 16-bit word as inputs: this is equivalent to using D0.h and adds no extra instructions or cycles _mpyhu mpyuu*

_mpyhl V_L_mult_hl Multiplies the 16 MSBs of the first operands by the 16 LSBs of the No _mpyhuls mpyus* second operand and returns the result as signed or unsigned No _mpyhslu mpysu* *Use extract_h on first operand No _mpyhlu ** ** ((int)((unsigned short) (src1>>16 ) * (unsigned short) (src1))) Yes

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 36 Mapping Intrinsic Functions: 16-bit Multiplies, Continued

TI Starcore Description Bitwise Accuracy Intrinsic Intrinsic

_mpyhl V_L_mult_hh Multiplies the 16 LSBs of the first operands by the 16 MSBs of the No _mpyluhs second operand and returns the result as signed or unsigned N/A _mpylshu mpysu* *Use extract_h on second operand No _mpylhu ** ** ((int)((unsigned short) (src1) * (unsigned short) (src2>>16))) Yes _smpy V_L_mult_hh Multiplies the operands and left shifts the result by 1, and returns the Yes _smpyh V_L_mult_hh result. If the result is 0x80000000, it is saturated to 0x7FFFFFFF Yes _smpyhl V_L_mult_hl Yes _smpylh V_L_mult_lh Yes

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 37 Mapping Intrinsic Functions: Packed and Complex 16-bit Multiplies

TI Starcore Description Bitwise Accuracy Intrinsic Intrinsic1 _mpy2 V_L_mpy2 Returns the products of the lower and higher 16-bit values of the two No operands _smpy2 V_L_mpy2 Returns the products of the lower and higher 16-bit values of the two Yes _smpy2ll V_L_mpy2 operands with an additional 1 bit left shift and saturation on each result Yes

_cmpy C_D_mpyre_ll Returns the complex multiplication of two packed 32-bit values as a 64- Yes C_D_mpyim_ll bit value C_D_mpyre_m C_D_mpyim_m C_D_mpyre_hh _cmpyr Returns the complex multiplication of two packed 32-bit values as a 32- N/A bit value, rounded _cmpyr1 L_Round of L_mpyre Returns the complex multiplication of two packed 32-bit values as a 32- Yes L_Round of L_mpyim bit value, rounded V_pack_2fr 1: SC3400 support for these functions is not the same as SC3850; SC3850 equivalents are shown

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 38 Mapping Intrinsic Functions: 16 x 32 Mixed Precision Multiplies and 32-bit Precision Multiplies TI Intrinsic Starcore Description Bitwise Intrinsic1 Accuracy _mpyhir dmpy* Produces a signed 16 by 32 multiply. The result No is shifted right by 15 bits, upper 16 bits of src1 are used. *use extract_h on first operand _mpylir dmpy* Produces a signed 16 by 32 multiply. The result No is shifted right by 15 bits, lower 16 bits of src1 are used. *use extract_l on first operand _mpy32 Returns 64 bits of a 32 by 32 multiply No * src1*src2 _smpy32 Returns 64 bits of a 32 by 32 multiply shifted left No by 1 and saturated * src1*src2 1: SC3400 support for these functions is not the same as SC3850; SC3850 equivalents are shown

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 39 Mapping Intrinsic Functions: Galois Field Multiply and Dot Products

TI Intrinsic Starcore Description Bitwise Accuracy Intrinsic1 _gmpy4 None Performs Galois (or finite) field multiply on four N/A values in src1 with four parallel values in src2. The four products are packed into the return value. _gmpy None Performs Galois field multiply N/A _dotp2 L_mpyd The product of the signed lower 16-bit values of No the two operands added t to the product of the signed upper 16-bit values of the two operands _ddotp4 L_mpyd and Performs two _dotp2 operations at the same time No L_mpyd 1: SC3400 support for these functions is not the same as SC3850; SC3850 equivalents are shown

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 40 Mapping Intrinsic Functions: Absolute Value and Bitwise Operations

TI Starcore Description Bitwise Accuracy Intrinsic Intrinsic _abs2 V_abs2 Absolute value of two 16-bit values packed Yes _abs L_abs Saturated absolute value of input Yes _suabs4 V_sad4 Absolute value of differences for each pair of packed 8-bit values Yes _ext * Extracts specified field from src2, sign extended to 32 bits Yes *((((signed int)(src2)<<(csta)))>>(cstb)) _extu * Extracts specified field from src2, zero extended to 32 bits Yes *((((unsigned int)(src2)<<(csta)))>>(cstb)) _extr * Extracts specified field from src2, sign extended to 32 bits Yes *((((signed int)(src2)<<(((src1 >>5) & 0x1F))))>>(src1&0x1F)) _extur * Extracts specified field from src2, zero extended to 32 bits Yes *((((unsigned int)(src2)<<(((src1 >>5) & 0x1F))))>>(src1&0x1F)) _xpnd4 * Bits 3 and 0 of input are replicated to bytes 3 through 0 of result Yes *Use a 16 entry LUT _xpnd2 Bits 1 and 0 of src are replicated to the upper and lower halfwords of the result N/A _deal _bdeintrlv Bit de-interleave for 32-bit data Yes _bitr _brev Bit reverse for 32-bit data Yes

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 41 Mapping Intrinsic Functions: Average, Normalization, Comparison, Interrupt Control and Shifts

TI Intrinsic Starcore Description Bitwise Accuracy Intrinsic

_avg2 Calculates the average for each pair of signed 16-bit values N/A _avgu4 V_avgu4 Calculates the average for each pair of signed 8-bit values Yes _norm norm_l Returns the number of bits up to the first nonredundant sign No _lnorm norm_l Returns the number of bits up to the first nonredundant sign No _max2 V_max2 Places the larger of each pair of values in the return value Yes _min2 min2 Places the smaller of each pair of values in the return value Yes _disable_interrupts di Disables interrupts N/A _enable_interupts ei Enables interrupts N/A _shr2 V_asrr2 Shifts right by src1 bits each 16-bit quantity in src2 either Yes arithmetically or logically _sshvl L_shl Shifts src2 left by src1 bits, saturates the result No _sshvr L_shr Shifts src2 right by src1 bits, saturates the result No _shfl3 Takes two 16-bit values from src1 and the 16 LSBs from src2 to N/A perform a 3-way interleave, creating a 48-bit result

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 42 Data Types: Using “double” for Data Moves

double _itod(uint src2, uint src1) Can be ported in this case as for (z = 0; z < MAX2; z++) Word64 D_set( Word32 left , UWord32 right) { *uz_p = _itod(_pack2(_sshvr(_dotpn2(_hi(*z_p), uf), s), In this case uint _hi(double src) can pe ported to _sshvr(_dotp2(_hi(*z_p), ufInv), s)), Word32 D_get_msb( Word64 Val) _pack2(_sshvr(_dotpn2(_lo(*z_p), uf), s), _sshvr(_dotp2(_lo(*z_p), ufInv), s))); uz_p++; z_p++; In this case uint _lo(double src) } can pe ported to Word32 D_get_lsb( Word64 Val)

►In this case double is not used for floating point arithmetic, but for 64-bit data moves, it can be ported to Word64.

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 43 Data Types: « double » for Floating Point Operations

►In cases where « double » type is used for floating-point arithmetic, there is no need for intrinsics

C64x+ SC3850

double a, b, c; double a, b, c; c = _mpyd(a, b); c = a * b;

(102 cycles) (101 cycles)

► TI’s intrinsics call a runtime library for floating point ► StarCore’s compiler calls the equivalent runtime library based on data type

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 44 Data types: Handling « long » Data

► TI’s « long » data type is 40 bits (integer) ► No native support for 40-bit data type in StarCore compiler ► 40-bit data type is supported through intrinsics – type is called Word40

C64x+ SC3850

long a, b, c; Word40 a, b, c; c = a + b; c = X_add(a,b);

► « unsigned long » type is mapped to UWord40 ► Word40 and UWord40 are a fractional types

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 45 Intrinsic Function Mapping Header File: Port64xplustoSC3850.h

►Fast way to take a TI-based DSP kernel and build it for SC3850 ►Supports many of the most common intrinsics and library functions ►Currently at Version 1.5 ►Packaged separately from Compiler and IDE ►Implementation maps TI intrinsics to an equivalent functionality on StarCore • 1:1 mapping (50) – identical performance ƒ Typically maps to a single Freescale SC3850 compiler intrinsic or single assembly instruction • 1:multiple mapping (25). ƒ Typically maps to several SC3850 intrinsics and low-level C code • 1: code (49) ƒ Maps to inline assembly or low-level C code

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 46 Intrinsic Function Mapping: 1-to-1 Mapping Example

C64x+ SC3850

#include int a, b, c; Word32 a, b, c; c = _sadd(a, b); c = L_add(a,b);

1-to-multiple mapping example

C64x+ SC3850

#include int a, b; Word32 a, b; long long c; Word64 c; c = _saddsub2(a, b); c = D_set(V_add2(a,b),V_sub2(a,b));

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 47 Intrinsic Function Mapping: 1-to-Multiple Mapping Example: Low-level C Implementation

►_swap4 performs a byte endian swap within each 16-bit word

SC3850

#include static unsigned int _swap4(unsigned src) C64x+ { #pragma inline

unsigned char byte_0 = (src >> 0) & 0x000000FF; unsigned char byte_1 = (src >> 8) & 0x000000FF; unsigned int a, b; unsigned char byte_2 = (src >> 16) & 0x000000FF; b = _swap4(a); unsigned char byte_3 = (src >> 24) & 0x000000FF;

unsigned int swap_byte_0 = (byte_0 << 8) & 0x0000FF00; unsigned int swap_byte_1 = (byte_1 << 0) & 0x000000FF; unsigned int swap_byte_2 = (byte_2 << 24) & 0xFF000000; unsigned int swap_byte_3 = (byte_3 << 16) & 0x00FF0000;

return (unsigned int) (swap_byte_0 | swap_byte_1 | swap_byte_2 | swap_byte_3);

}

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 48 Intrinsic Function Mapping: 1-to-Multiple Mapping Example: Assembly Implementation

►_mem2 loads an unaligned 16-bit word in the low part of a register SC3850

static asm unsigned short _mem2(void * ptr) { asm_header C64x+ .arg _ptr in $r0; return in $d0; short *ptr; .reg $d0,$d1,$r0; unsigned short a; asm_body moveu.b (r0)+,d0 a = _mem2(ptr); [ moveu.b (r0),d1 asll #8,d0 ] or d1,d0 zxt.w d0,d0 asm_end }

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 49 Agenda

►Overview of Porting Code ►Core Architecture Comparisons ►Data Types and Intrinsic Functions ►Development Tools ►Performance and Optimizations

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 50 Development Tools: Compiler Optimizations

►Both TI and StarCore have similar optimization settings ►Both allow multiple levels of code optimization from none to maximum settings • TI from –O0 to –O5 • StarCore from –O0 to –O4 ►Both allow for cross-file optimizations • TI uses –pm • StarCore uses -Og ►Both have options for reduced code size • TI uses –ms • StarCore uses –Os ►Also note on StarCore, release and debug builds with the same settings generate identical binary executables (release does not generally have debug symbols) • Performance of code execution is not impacted by making a debug build

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 51 Development Tools: Pragmas

► Pragma directives tell the compiler how to treat a certain function, object, or section of code ► One of the main differences is that pragmas will need to be placed in the function or in the loop they are referring to, whereas the TI compiler expects them right before this function or this loop

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 52 Development Tools: Pragma Porting (1/3)

C64x+ pragmas SC3850 pragmas #pragma CODE_SECTION (symbol, “section name”); No equivalent pragma Allocates space for the symbol in a section named section name #pragma DATA_ALIGN (symbol, constant); #pragma align symbol constant Aligns the symbol to an alignment boundary constant Identical #pragma DATA_MEM_BANK (symbol, constant); Aligns a symbol to a specified internal data memory bank boundary No equivalent pragma (banks 0 to 7) #pragma DATA_SECTION (symbol, “section name”); No equivalent pragma. Allocates space for the symbol in a section named section name #pragma FUNC_ALWAYS_INLINE (function); #pragma inline Always To be placed inside function #pragma FUNC_CANNOT_INLINE (function); #pragma noinline Never inline function To be placed inside function #pragma FUNC_EXT_CALLED (function); No equivalent pragma Do not remove function not called from main

#pragma FUNC_INTERRUPT_THRESHOLD (function, threshold); Allows interrupts to be disabled around software pipelined loops for No equivalent pragma threshold cycles

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 53 Development Tools: Pragma porting (2/3)

C64x+ pragmas SC3850 pragmas

#pragma FUNC_IS_PURE (function); No equivalent pragma Function has no side effects

#pragma FUNC_IS_SYSTEM (function); Function has the behaviour defined by the ANSI/ISO standard for No equivalent pragma a function with that name

#pragma fct_never_return function #pragma FUNC_NEVER_RETURNS (function); or Function never returns to its caller #pragma never_return placed inside function

#pragma FUNC_NO_GLOBAL_ASG (function); The function makes no assignments to named global variables No equivalent pragma and contains no asm statement

#pragma FUNC_NO_IND_ASG (function); The function makes no assignments through pointers and No equivalent pragma contains no asm statement

#pragma INTERRUPT (function); #pragma interrupt function Defines a C interrupt function Identical

#pragma loop_count (min, max, modulo, remainder) #pragma MUST_ITERATE (min, max, multiple); To be placed after the loop starter. Modulo and remainder Specifies the number of times a loop will iterate give more granularity than multiple

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 54 Development Tools: Pragma Porting (3/3)

C64x+ pragmas SC3850 pragmas

#pragma NMI_INTERRUPT (function); No equivalent pragma Defines a C non maskable interrupt function

#pragma NO_HOOKS (function); Prevents entry and exit hooks from being generated for No equivalent pragma function

#pragma PROB_ITERATE (min, max); No equivalent pragma Specifies the number of times a loop might iterate

#pragma STRUCT_ALIGN (type, constant, #pragma min_struct_align min expression); Set minium alignment for structures inside file. Similar to DATA_ALIGN but for structures, typedefs etc. No equivalent for typedefs or other expressions

#pragma UNROLL (n); #pragma loop_unroll n Specifies how many times a loop should be unrolled To be placed inside loop

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 55 Development Tools: Mapping Keywords

►There are several keywords used on the TI platform which are not defined within StarCore tools ►It is recommended to use the following conversions instead:

TI Keyword Recommended Conversion interrupt #define interrupt //ignored inline Use #pragma inline inside function body to force inlining near #define near //ignored far #define far //ignored cregister #define cregister //ignored

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 56 Measuring Performance: In Code

► Both architectures provide multiple simulator models for functional verification and profiling purposes • TI uses a CAS model (Cycle Accurate Simulator) with varying levels of system modeling of surrounding memory behaviors (core module to entire system) • StarCore uses a PACC model (Performance Accurate) which always models caches and memory latencies ► Both simulator models provide methods for in-code profiling as well as profiling through a development tool

In Code Profiling on TI architectures: In Code Profiling on StarCore In Code Profiling on StarCore architectures (OCE method*): architectures (DPU method*): #include /* calculate overhead of clock () */ #include #include overhead = clock (); … … overhead = clock ()–overhead; /*Init OCE Performance counter */ /*Init DPU Performance counters */ ... overhead = InitEonce((unsigned int) overhead = InitDPU((unsigned int) cycles = clock (); &kernel); &kernel); /* call the kernel */ … … value = kernel (array1,output,j); /*call the kernel */ cycles = ReadCountDPU(); /* calculate cycles */ value = kernel (array1,output,j); /* call the kernel */ cycles = clock ()–cycles–overhead; /*Get the cycle count */ value = kernel (array1,output,j); … cycles = ReadCountEonce() - /*Get the cycle count */ overhead; cycles = ReadCountDPU() – cycles - … overhead; …

*Both OCE and DPU inline profiling methods work on silicon in the same manner as on the simulator

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 57 Measuring Performance: Using IDE

►Both architectures offer development environments with profiling capability built into them ►For StarCore, the following profiling and trace targets are applicable: • SC3x50 PACC simulator model ƒ Provides highest detail on core stall, memory stall, cache behavior and overall cycle visibility ƒ Provides trace of every instruction for entire simulation run along with timing behavior for each instruction ƒ Models HW OCE registers • MSC8156 ADS hardware (and other 8156 hardware) ƒ Supports on chip VTB (virtual trace buffer), DPU and OCE capabilities ƒ Profiling has resolution of change of flow ƒ Trace mode disables profiling capability (not supported by hardware)

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 58 Agenda

►Overview of Porting Code ►Core Architecture Comparisons ►Data Types and Intrinsic Functions ►Development Tools ►Performance and Optimizations

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 59 Performance Optimization

►There are a number of things that can be done to optimize performance after porting is complete. In addition to the standard bit-wise accurate versus performance model described previously, there are also special situations or computational patterns that can currently only be resolved by hand optimization. ►Items that are found when performing a DSP kernel optimization can be broken down into the following categories: • Pointer aliasing • Data alignment • Loops • Conversion of math operations to StarCore native math • Conversion of data manipulation to StarCore native data manipulation • Case and if-then-else ladders • Data accesses • Better mousetraps ►While not every DSP kernel bears out all of these types of optimization items, they more often than not contain at least several of them.

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 60 Porting Example: Complex Multiplication

► Complex multiplication is a good example of a DSP kernel which is frequently ported from a TI version to a StarCore one. In this case, the TI version uses no intrinsics and is just naturally written C code.

int complex_mult(short* coef, short* input, short* result, int n) { int i; for(i=0;i

► This is a very simple example, but it highlights some of what we want to do to optimize the code. When performing 512 complex multiplies, the TI simulator reports this code to run in 1550 cycles, not including core stalls and cache misses. This implies an average of ~3 instructions per complex multiply performed.

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 61 Porting Code Example: Complex Multiply, Create a Test Harness

► This is a straightforward task. Simply add the function to a StarCore Stationery project in CodeWarrior. Create an input vector, a coefficient vector, a result vector and an output reference vector to check correctness. ► It is recommended to include test vectors from external files to prevent compiler substitutions of results for algorithmic code Word16 Coeff[Nh]= { #include "../vectors/coeff.dat" }; Word16 Input[2Nh]= { #include "../vectors/test_in_80.dat" }; Word16 Output_ref[2*Nr]= { #include "../vectors/output_80.ref.dat" }; ► Unless this test case is run for a significant number of iterations, it is also recommended to benchmark this in a warm cache mode on the StarCore simulator. • All profiling is done on the StarCore simulator with a fully modeled memory and core subsystem. • Also add a #pragma noinline to prevent the StarCore compiler from automatically inlining the function under test with the test harness ► To perform a warm cache profile, first call the complex multiply function; then call it a second time and measure performance around the second call. This is more representative of how the algorithm will run in a cached environment. ► The call can look something like the following: #ifdef WARM_CACHE complex_mult_natural_C(a,b,c,N); #endif InitEonce((unsigned long int )&complex_mult); complex_mult(a,b,c,N); cycle_count = ReadCountEonce() - cycle_counter_overhead; printf("complex_mult C results: cycle count=%d\n", cycle_count);

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 62 Porting Code Example: Complex Multiply

► Pointer aliasing • We can see the input data, coefficient and output vectors do not alias in memory, so adding the restrict keyword will help for using packed moves of data. • Add “restrict” to pointers which do not alias • Note that enabling global (cross file) optimization allows the compiler to see the data vectors do not alias also ► Data Alignment • We can also set input, coefficient, and output vectors to have an 8-byte alignment ƒ Use #pragma align or __attribute__(align(8)) in the vector definition

int complex_mult(short* restrict coef, short* restrict input, short* restrict result, int n) { int i; for(i=0;i

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 63 Porting Code Example: Complex Multiply

► The inner loop generated has 8 cycles, performs 4 complex multiplies and looks like the following:

LOOPSTART3 DW39 L93 [ DW35 [ impy d12,d4,d11 ;[90,2] move.4w (r2)+n3,d4:d5:d6:d7 ;[90,2] impy d15,d6,d10 ;[91,2] move.4w (r4)+n3,d8:d9:d10:d11 ;[90,2] impy d14,d6,d9 ;[90,2] ] impy d13,d4,d8 ;[91,2] DW36 move.4w d0:d1:d2:d3,(r3)+n3 ;[90,2] [ ] impy d9,d4,d3 ;[91,2] DW40 impy d8,d4,d0 ;[90,2] [ impy d10,d6,d4 ;[90,2] sxt.w d11,d0 ;[90,2] impy d11,d6,d6 ;[91,2] move.4w (r1)+n3,d12:d13:d14:d15 ;[90,2] sxt.w d8,d1 ;[91,2] ] sxt.w d9,d2 ;[90,2] DW37 sxt.w d10,d3 ;[91,2] [ ] sxt.w d3,d1 ;[91,2] DW41 sxt.w d4,d2 ;[90,2] [ sxt.w d6,d3 ;[91,2] imac d12,d5,d1 ;[91,2] sxt.w d0,d0 ;[90,2] imac d14,d7,d3 ;[91,2] ] imac -d15,d7,d2 ;[90,2] DW38 [ imac -d13,d5,d0 ;[90,2] imac d10,d7,d3 ;[91,2] ] imac -d11,d7,d2 ;[90,2] DW42 imac -d9,d5,d0 ;[90,2] move.4w d0:d1:d2:d3,(r0)+n3 ;[0,2] imac d8,d5,d1 ;[91,2] LOOPEND3 move.4w (r5)+n3,d4:d5:d6:d7 ;[90,2] ]

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 64 Porting Code Example: Complex Multiply

► Loops int complex_mult(int *restrict coef, int *restrict input, int *restrict result, • We can unroll this loop and perform 4 int n) complex multiplies in a single iteration { int i; ► Conversion of math operations to int tempI1,tempQ1, tempI2,tempQ2, tempI3,tempQ3, tempI4,tempQ4; StarCore native math • We can use StarCore packed multiply for(i=0;i

writer_4f((short*)&result[i], tempI1, tempQ1, tempI2, tempQ2); ► Now the generated code is very writer_4f((short*)&result[i+2], tempI3, tempQ3, tempI4, tempQ4); efficient. The whole function } executes in 398 cycles, averaging return n; 0.75 cycles per complex multiply }

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 65 Porting Code Example: Complex Multiply

►The inner loop generated has LOOPSTART3 L56 3 cycles DW60 TYPE debugsymbol [ ►It performs 4 complex mpyre d1,d5,d2 ;[133,2] 3%=1 [0] mpyim d1,d5,d3 ;[134,2] 3%=1 [0] mpyim d0,d4,d1 ;[132,2] 3%=1 [0] multiplies mpyre d0,d4,d0 ;[131,2] 3%=1 [0] move.2l (r3)+n3,d6:d7 ;[127,2] 0%=0 [1] ►It also shows the algorithm is move.2l (r4)+n3,d4:d5 ;[127,2] 0%=0 [1] ] now limited by how much data DW61 TYPE debugsymbol [ mpyre d4,d6,d0 ;[127,2] 1%=0 [1] can be moved in a single cycle mpyim d4,d6,d1 ;[128,2] 1%=0 [1] mpyre d5,d7,d2 ;[129,2] 1%=0 [1] mpyim d5,d7,d3 ;[130,2] 1%=0 [1] mover.4f d0:d1:d2:d3,(r5)+n3 ;[137,2] 4%=1 [0] move.2l (r0)+n3,d4:d5 ;[131,2] 1%=0 [1] ] DW62 TYPE debugsymbol [ mover.4f d0:d1:d2:d3,(r2)+n3 ;[136,2] 2%=0 [1] move.2l (r1)+n3,d0:d1 ;[131,2] 2%=0 [1] ] LOOPEND3

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 66 Porting Code Example: LTE Kernel Resource Mapping

►In this case we had a DSP kernel already written for a TI architecture • Test harness and input/reference vectors were already completed • On TI simulator which models core stalls and cache behavior, this benchmarked in an existing test harness at 82k cycles ►Phase 0: Getting the test harness ready • First thing we did was create a new StarCore project for the 3850 PACC ƒ Not necessary to run a single DSP kernel in multicore mode! ƒ PACC profiler gives extensive details on code performance! • Second, we added the test harness and DSP kernel source to the project, and removed the existing “Hello World” source code • Third, we add the TI-to-StarCore porting header file to map TI intrinsic functions to StarCore functions at the top of all source files that use TI intrinsic functions • Fourth, we built it and ran it on the PACC simulator to see where we spend time ►In this example initial performance results were 670k cycles! • Using the porting header file will not yield good performance in general ƒ Has inline ASM that blocks optimizer ƒ Has SR mode switches throughout that cause pipeline flush • It is designed to allow minimal code changes to get a kernel up and running without needing to start with an extensive kernel re-write

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 67 Porting Code Example: LTE Kernel Resource Mapping

► First thing we ask is, where are my cycles being spent? ► CodeWarrior’s PACC profiler makes answering this easy! • Just use the Critical Code Analysis view of the profiling results (look for the green bars!)

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 68 Porting Code Example: LTE Kernel Resource Mapping, Pass 1

►Looking for aliasing in the test code, we found heavy use of the keyword restrict already • Note that using my_avar[restrict] on TI architectures has no correlating meaning on StarCore ►Data alignment was also already optimal ►It was clear from the profiling data that the majority of cycle time was spent inside four loops • Each of these four loops performed multiple complex multiplies and saturated complex (dual) additions • These loops accounted for the majority of cycles in the function being optimized • TI-based intrinsic functions _cmpy and _sadd2 where used • The intrinsic mappings for these were not optimal and it was reflected in the results ƒ _sadd2 intrinsic sets saturation mode and clears it on each call! • These loops were re-written using native StarCore intrinsic math functions and native StarCore intrinsic data manipulation functions

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 69 Porting Code Example: LTE Kernel Resource Mapping, Pass 1

*(iq_ptrA + index[counter]) = _sadd2(_cmpy(ant00,*symbol_ptr), _cmpy(ant01,*symbol_ptr2)); Original TI-based code *(iq_ptrB + index[counter]) = _sadd2(_cmpy(cdd[0][index2 + index[counter]], _cmpy(ant10, *symbol_ptr++)), _cmpy(cdd[0][index2 + index[counter]], _cmpy(ant11, *symbol_ptr2++)));

tt1 = L_mpyre(ant00,*symbol_ptr); tt2 = L_mpyim(ant00,*symbol_ptr); tt1 = L_macre(tt1, ant01, *symbol_ptr2); tt2 = L_macim(tt2, ant01, *symbol_ptr2); tt5 = L_mpyre(ant10, *symbol_ptr); tt6 = L_mpyim(ant10, *symbol_ptr++); tt3 = L_mpyre(ant11, *symbol_ptr2); tt4 = L_mpyim(ant11, *symbol_ptr2++); t3 = V_pack_2fr(tt5, tt6); t4 = V_pack_2fr(tt3, tt4); Rewritten as StarCore- tt3 = L_mpyre(cdd[0][index2 + index[counter],t3); based code tt4 = L_mpyim(cdd[0][index2 + index[counter],t3); tt3 = L_macre(tt3, cdd[0][index2 + index[counter],t4); tt4 = L_macim(tt4, cdd[0][index2 + index[counter],t4);

writer_2f((Word16*) (iq_buf_ptrA + index[counter]), tt1, tt2); writer_2f((Word16*) (iq_buf_ptrB + index[counter]), tt3, tt4);

► After re-writing, the loops performance was measured again ► This showed an overall cycle count of 89k cycles!

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 70 Porting Code Example: LTE Kernel Resource Mapping, Pass 2

►Loop inspection was performed in this pass of the optimization effort. • By inspecting the 4 loops described above in the SL (assembly listing) file, it was found that the 1st of the four loops in the original code had been unrolled one time on the TI architecture • This favored the TI architecture’s larger register set, and performed 12 complex multiplies and 4 dual saturated additions inside the loop • This change was backed off in the StarCore variant to run a single grouping of 6 _cmpys and 2 _sadds, and increasing the loop iterations by a factor of 2 • Overall this yielded higher performance on StarCore by avoiding multiple moves of data in and out of the local stack ƒ When running out of registers, we often say the data “spills” to memory via the stack ►Additionally, some hints to the compiler about some of the loop variables were also put into the code as: cw_assert(total_iterations > 4); cw_assert((total_iterations%4) == 0); ►Rolling this loop back gave an overall performance of 84k cycles

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 71 Porting Code Example: LTE Kernel Resource Mapping, Pass 3

► For this pass, we also looked at the many switch ladders and if-then-else chains ► To improve performance for code with large amounts of flow change, we added a compiler hint “case_likely” to schedule them first in comparison trees:

case MYTEST_NUMBER2: #ifdef STARCORE_TEST #pragma case_likely #endif

► Also cache behavior true, runtime accurate. This was due to the test harness only calling the test function one time. In that case, no data or instructions are preloaded into the caches and worst- case cache behavior occurs, with a maximum number of cache misses • To do this the function is called once to put data and instructions into the caches, and then a second time to measure performance • The real world performance will often lie between the “cold” cache performance and the “warm” cache performance • In this case, the cold cache and warm cache performance differ by about 1500 cycles, the majority of which were instruction cache misses occurring in the cold cache run and not the warm cache run • These performance improvements could also have been achieved by judicious use of the instruction cache pre-fetch instruction and intrinsic

PFETCH

► These optimizations took us to the original TI benchmark number of 81k cycles

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 72 Porting Code Example: LTE Kernel Resource Mapping, Pass 4

► For the fourth pass, some re-working of TI intrinsic-based code was performed. In this case it’s a bit of intrinsic mapping and a bit of building a better mousetrap ► The following sequence (which occurred in multiple locations throughout the code):

tmp1 = _bitc4(pattern); num_iterations = LTE_RE - _add4(tmp1, _shrmb(0,tmp1))&0xFF;

► Was re-written as follows:

UINT32_T index1 = bit_count_table_8bit[(pattern &0xFF)]; UINT32_T index2 = bit_count_table_8bit[((pattern>>8) &0xFF)]; tmp1 = index1+index2; num_iterations = LTE_RE - tmp1;

► This avoids using a more costly C implementation of _add4 and _bitc4 which have no native StarCore equivalent instructions ► Performance dropped to 78k cycles with this change

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 73 Porting Code Example: LTE Kernel Resource Mapping, Pass 5

►For the fifth pass, data manipulation intrinsics were used in place of standard pointer value assignments • The following sequence:

*(iq_ptrA + index) = *symbol_ptr; *(iq_ptrA + index+1) = *(symbol_ptr+1); *(iq_ptrA + index+2) = *(symbol_ptr+2); *(iq_ptrA + index+3) = *(symbol_ptr+3); symbol_ptr+=4;

• Was changed to use StarCore data move intrinsic functions:

write_2l((iq_ptrA+ index), *(symbol_ptr), *(symbol_ptr+1)); write_2l((iq_ptrA+ index+2), *(symbol_ptr+2), *(symbol_ptr+3)); symbol_ptr+=4;

►Performance dropped to 75k cycles with this change

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 74 Porting Code Example: LTE Kernel Resource Mapping, Pass 6

► A better mousetrap! • This optimization consisted of recognizing that the address calculations were quite costly for multiply indexed arrays, and consisted of several integer multiply and accumulations to generate the correct array offset inside one of the inner loops • In this case, the array offset is calculated sequentially since the results of the offset are needed prior to other operations which use those same results. To a large extent, this blocks the compiler from scheduling other operations in parallel to the offset calculations. • In the original code, the following line of code is present in an inner loop that increments index: re = lte_map[subframe][symbol][index]; • It was changed so that outside this loop a pointer was assigned: ptr_lte_map = lte_map[subframe][symbol]; • And inside the loop the pointer is accessed by index as follows: re = ptr_lte_map[index]; ► Multiply indexed arrays often require elaborate offset calculations, so de- referencing these accesses with a pointer makes sense and saves cycles on both StarCore and TI architectures ► This change took performance down to 69.5k cycles

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 75 Porting Code Example: LTE Kernel Resource Mapping, Pass 7

►A valuable lesson! ►Always check out the latest tool chains! ►Updating the StarCore tool chain took performance from 69.5k cycles to 67k cycles with no changes made to the code • Some changes to how loops were ordered by the compiler and updated heuristics helped here ►Free cycles like this do not often happen, but are always welcome. The point is, always check out the latest tools available!

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 76 Available Documentation

► Application note AN3434: TI C64x and C64x+ to SC3400 and SC3850 Porting Guide ► Application note AN4031: How to port a FIR kernel from the TMS320C64x+ to the StarCore SC3850 ► StarCore C/C++ Compiler Intrinsic Functions Reference Manual

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 77 Summary

►Overview of Porting Code ►Core Architecture Comparisons ►Data Types and Intrinsic Functions ►Development Tools ►Performance Optimizations

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 78 Backup Slides

►Migration from TI DMA model architecture to Freescale cache model architecture • Software model • Memory/cache mapping guidelines

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 79 Backup: MSC8156 Software Models

DMA SW architecture: Cache SW architecture: All data and code pre-located All data and code located in in M2 memory using DMA or DDR memory and fetched by peripherals, DDR memory is the L2 cache directly not accessed on cache misses

Extreme case 2 Code DDR800 Data 64-bit

Non-blocking Switching Matrix

Extreme case 1 Code 512 K Byte Data 512 K Byte Unified Cache M2 Memory 8 ways, 64 byte line

32 KB I-Cache 32 KB D-Cache 32 KB I-Cache 32 KB D-Cache 8 ways, 8 ways, 8 ways, 8 ways, 256 byte line 256 byte line In real systems we 256 byte line 256 byte line

SC3850 anticipate a mixture SC3850 MMU MMU DSP Core of these two DSP Core 800MHz-1GHz architectures 800MHz-1GHz StarCore SC3850 StarCore SC3850 Subsystem Subsystem

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 80 Backup: Cache vs. DMA Model in SC3850 DSP Subsystem Scheduled Cache DMA SW Model Cache SW model SW model Mixed Model 100% M2 100% L2 + SWPF L2 is partly M2 100% L2

• All in M2 • Critical code/data in M2 • Highest performance • Consider using L2 cache • High effort partitioning • Generate higher bus load • High performance • Expert Mode – Higher TTM • Moderate effort

• All in DDR/M3 • Use SWPF • All in • Use L2 cache partitioning DDR/M3 • High performance • Good • Moderate-high effort performance • Low effort Effort

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 81 Backup: Mixed Model

► In order to: • Take advantage of the cache architecture and features • Avoid heavy « handmade » DMA programming • Simplify slave cores job scheduling ►The following basic asssumption of Mixed Model usage can be chosen: • 50% L2 & 50% M2 is selected

256kB Cache coherency L2 abstraction via SDOS

256 kB 3 possible mappings of the M2 M2 memory

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 82 Backup: M2 Mapping: Option #1

Each core has a different M2 memory map 256 kB M2 + Highly optimized Assymetric - Static architecture – core x cannot perform core y’s tasks

Core1 Core2 Core3

Task B Task D data data Task A Task C data data

Task B Task D … program program Task A Task C program program

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 83 Backup: M2 Mapping: Option #2

Each core has the same fixed M2 memory map Buffers are statically allocated in M2 at compile time 256 kB M2 + Flexible: any task can be performed on any core Optimized attributes for a given buffer Symetric & static - Waste memory ressources when buffer is not used by a core

Task A

Core1 Task B Core2 Core3 Master Slave Slave

Task A data Task A data Task A data Task B Task B Task B data data data Task C data Task C data Task C data … Task A program Task A program Task A program

Task B program Task B program Task B program

Task C program Task C program Task C program

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 84 Backup: M2 Mapping: Option #3

256 kB Each core has the same M2 memory map (seen as a global temporary buffer) M2 Buffers are dynamicaly allocated in M2 at runtime

Symmetric & dynamic+ Flexible: any task can be performed on any core M2 memory ressources not wasted - Complexity & latency in DMA programming

Core1 Task C Core2 Core3 Master Slave Slave Task A

Task A Task C Data Data Data …

Task A Task C Program Program Program

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 85 Backup: Cache Coherency

► Cache coherency easily implemented in software through dedicated SDOS ► Full range of APIs mapped to core instructions (pfetch, dfetch, dfetchw, dflush, dsync…) ► Two types of supported operations: • Global e.g. osCacheL2ProgSweepGlobal() • By address e.g. osCacheProgSweep() ► Two flavors of each cache API: • Blocking e.g. osCacheDataSweep() • Non-Blocking e.g. osCacheDataSweepAsync() ► APIs working on L1 only, L1+L2, L2 only depending on attributes and settings done in MMU 256kB ► Works as a cache coherency abstraction layer L2

Freescale, the Freescale logo, AltiVec, C-5, CodeTEST, CodeWarrior, ColdFire, C-Ware, mobileGT, PowerQUICC, StarCore, and Symphony are trademarks of Freescale Semiconductor, Inc., TM Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions logo, Flexis, MXC, Platform in a Package, Processor Expert, QorIQ, QUICC Engine, SMARTMOS, TurboLink and VortiQa are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © 2010 Freescale Semiconductor, Inc. 86 TM