Cadence Tensilica Fusion G3 DSP Core
Total Page:16
File Type:pdf, Size:1020Kb
An Independent Evaluation of the Cadence Tensilica Fusion G3 DSP Core By the staff of October 2016 OVERVIEW The recently announced Cadence Tensilica Fusion G3 DSP IP core is a high-performance licensable programmable digital signal processor core targeting diverse signal processing applications such as communications, audio and industrial applications. BDTI, a technology analysis firm, benchmarked the Fusion G3 core on several typical digital signal processing functions, comparing the performance of Fusion G3 against Texas Instruments’ flagship C66x DSP core. BDTI also compared the Fusion G3’s FFT performance to that of the ARM Cortex-A57 CPU core. Finally, BDTI implemented and optimized a custom DSP function from scratch on the Fusion G3 and compared the resulting performance to that of the TI C66x. This report presents BDTI’s independent evaluation of the Fusion G3 core’s performance and ease of software development. The Fusion G3 DSP core’s wide SIMD (single-instruction, multiple-data) operations and VLIW (very long instruction word) instruction set provide excellent cycle efficiency on many DSP tasks, and yield performance that surpasses that of TI’s flagship C66x DSP core. Fusion G3 is also noteworthy for its double- precision floating-point support for precision-critical tasks. Cadence provides robust software development tools and DSP function libraries to help users effectively realize the core’s performance potential. © 2016 Berkeley Design Technology, Inc. Page 1 Contents targets a similar range of applications, is well known in the industry, and has readily-available 1. Introduction ....................................................... 2 tools and optimized libraries that we could 2. About the Cadence Fusion G3 Core ............. 3 leverage for benchmarking. We used existing, 3. About the Benchmarks .................................... 3 optimized software library functions available 4. Benchmark Results ........................................... 4 from Cadence and TI, respectively, to compare 5. BDTI’s Fusion G3 Software Development the performance of the Fusion G3 and C66x. User Experience ................................................ 6 We also compare the Fusion G3’s execution 6. Conclusions........................................................ 7 speed on floating-point FFTs against ARM’s Appendix: Fusion G3 vs. ARM Cortex-A57 Cortex-A57 core leveraging ARM NEON SIMD Benchmarking Considerations ................................. 7 instructions. The Cortex-A57 is a licensable 64-bit high-performance CPU core. Comparing Fusion 1. Introduction G3 performance against that of the Cortex-A57 The Cadence Tensilica Fusion G3 DSP is a illustrates the benefits of including a dedicated high-performance programmable digital signal DSP core in a SoC design. However, CPU cores processor core from Cadence, targeting a wide such as Cortex-A57 are typically used with range of signal processing workloads in different system-level hardware architectures, applications such as communications, audio, and different software architectures, and different industrial equipment. This report presents BDTI’s programming practices compared with DSP cores, independent evaluation of the Cadence Fusion G3 making performance comparisons between these DSP core’s performance and ease of software two classes of processors challenging and prone to development. misinterpretation. Therefore, we consider the We compare Fusion G3’s execution speed and comparison of Fusion G3 and Cortex-A57 to be a cycle-efficiency against the Texas Instruments’ rough first-order comparison. flagship C66x high-performance DSP core, used Finally, we discuss BDTI’s software in many of TI’s DSP chips. BDTI chose the TI development and optimization experience with C66x for this competitive evaluation because it Figure 1 Block Diagram of Fusion G3 Architecture © 2016 Berkeley Design Technology, Inc. Page 2 the Fusion G3 core and toolchain. In addition to We computed execution time for each target by using DSP functions from Cadence’s DSP library, dividing the measured number of cycles by the BDTI implemented and optimized on the Fusion processor’s clock rate. For the Cadence Fusion G3 G3 a median filter common in video processing we measured cycle counts using a cycle-accurate applications. In Section 5 we describe and simulator, and for the C66x we used an evaluation comment on our experience with function board. optimization on Fusion G3 and with Cadence The benchmark functions were chosen based software libraries and tools. on relevance to a variety of signal processing tasks, and also based on availability in a 2. About the Cadence Fusion G3 comparable form on both the Fusion G3 and Core C66x platforms. The Cadence and Texas Cadence positions the Fusion G3 DSP core as Instruments optimized DSP function libraries a multi-purpose DSP for compute-intensive both contain multiple implementations of some applications. The Fusion G3 is suitable for diverse algorithm kernel functions such as FIR filters. workloads including communications, audio, Different implementations impose different imaging, radar and industrial control, among restrictions on function parameters, for example others. A block diagram of the Fusion G3 requiring that the number of filter taps be a architecture is shown in Figure 1. multiple of four. Such restrictions represent Fusion G3 natively supports 8-bit, 16-bit, and tradeoffs between function generality and optimal 32-bit fixed-point arithmetic via 128-bit wide vectorized implementation. BDTI made best SIMD (single-instruction, multiple-data) vector efforts to choose the fastest available function for registers, performing parallel operations on each target processor, while also ensuring that the sixteen, eight, or four data elements at a time, restrictions imposed by the respective respectively. An optional vector floating-point unit implementations are reasonably comparable. adds support for SIMD operations on 32-bit Thus, the requirement of availability of single-precision and 64-bit double-precision comparable implementations on both targets floating-point data, compliant with IEEE 754 restricted the choice of algorithm kernels standards. somewhat. A VLIW (very long instruction word) instruction set architecture allows the Fusion G3 Benchmark Parameters Data Types core to execute up to four SIMD vector operations per cycle, including (with some 16-bit fixed-point, restrictions) up to two multiplier/ALU operations Complex FFT 512 points and two loads or one load and one store. Flexible 32-bit floating-point operation predication support allows vectorized 32 taps, 16-bit fixed-point, Real FIR Filter code to maintain high throughput by avoiding 1024 points 32-bit floating-point conditional branches. Real Vector Dot 16-bit fixed-point, 1024 points 3. About the Benchmarks Product 32-bit floating-point Common DSP Functions 3×3 kernel, BDTI measured the performance of the output one 2D Median Filter 8-bit unsigned Cadence Tensilica Fusion G3 DSP core and the scan line, Texas Instruments C66x DSP core on several 1280 pixels functions from the optimized software libraries Matrix LU available from Cadence and Texas Instruments. 32×32 matrix 64-bit floating-point We believe that these software libraries are widely Decomposition used, and their performance generally reflects real- world application performance. Table 1 DSP functions selected for benchmarking BDTI benchmarked the DSP algorithm kernel the Fusion G3 and C66x. For each function, BDTI functions listed in Table 1. For each function, we benchmarked an optimized implementation for each data type listed. measured the number of processor cycles taken by the Fusion G3 and C66x to execute the function. © 2016 Berkeley Design Technology, Inc. Page 3 The LU matrix decomposition algorithm Evaluating Ease of Software kernel was chosen specifically to highlight the Development support for double-precision floating-point The median filter function was chosen for the arithmetic on the Fusion G3 core. Matrix purpose of evaluating software development user decomposition and inversion functions often experience for the Fusion G3. This function was require double-precision floating-point arithmetic. attractive because it is not included in the Cadence Of the algorithm kernels provided in the Cadence function library—thus requiring custom and TI libraries that commonly require double- implementation and optimization—and because precision floating-point support, the LU there is a median filter implementation available in decomposition kernel functions have the most the Texas Instruments image processing software closely comparable APIs between the Fusion G3 library for the C66x core, enabling performance and C66x DSP libraries. Note, however, that while comparison against the C66x. BDTI created and the TI matrix LU decomposition function optimized a median filter implementation for generates the permutation matrix as one of its Fusion G3 that is designed to be meaningfully outputs, the corresponding Cadence function does comparable to the median filter function found in not. BDTI estimated the additional number of the Texas Instruments library. Therefore, the cycles that would be required by the Cadence median filter benchmark result presented in this function to output the permutation matrix. The document reflects a fair comparison of the Fusion number of cycles for the Fusion G3 on the matrix G3 and C66x core architectures, rather than the LU decomposition reported here is computed by fastest possible implementation