<<

TETRACOM: Technology Transfer in Computing Systems

FP7 Coordination and support action to fund 50 technology transfer projects (TTP) in computing systems. FP7 Coordination and Support Action to fund 50 technology transfer projects (TTP) in computing systems. This project has received funding from the European Union’s Seventh Framework Programme for research, This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement n⁰ 609491. technological development and demonstration under grant agreement n⁰ 609491. A Highly Optimized Arithmetic Software Library and Hardware Co- IP for Fixed-Point VLIW-SIMD Processor Architectures Lukas Gerlach, Stephan Nolting, Holger Blume and Guillermo Payá Vayá, Leibniz Universität Hannover, Germany Hans‐Joachim Stolberg and Carsten Reuter, videantis GmbH, Hannover, Germany

TTP Problem

Performance requirements are pushing the limits of Area and energy efficiency is restricted for embedded multimedia systems: embedded multimedia systems: Often used non-linear complex Area and energy optimized computation by mathematical functions require a lot of using specific arithmetic evaluation computational power. software libraries or hardware accelerators.

sin() atan() ln() div()

cos() sqrt() exp() pow()

TTP Solution Software-based solution: Hardware-based solution:

Mathematic software CORDIC (Coordinate CORDIC processing element config x,y,z start Digital Computer) library (LibARITH) N optional Scalable co-processor Pre-processing stage architecture: Optimized for VLIW-SIMD processors: scale . M CORDIC modules in . Exploiting data and instruction level parallelism series are incorporated to Register Register Register Iteration controller M CORDIC Advantages: D S iterations per clock cycle CORDIC Scale factor M table . High flexibility Angle P P table P . High accuracy CORDIC Data level parallelism (SIMD) . Fast computation compared to other approximation optional is supported. Scale fac . Reduced memory requirement compared to look-up-table interpolation Register Register Register controller config scale Maximal relative error per number of CORDIC iterations for 32 bit fix point values Post-processing stage 10 N 1 x,y,z 0,1 CORDIC co-processor area with and without SIMD support for different 0,01 number of CORDIC modules 0,001 30000 25000 ] 0,0001 2 20000 m 0,00001 μ 15000 w/o SIMD 0,000001 10000

Area [ Area SIMD Maximal relative error 0,0000001 5000 1E-08 0 M=1 M=2 M=4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 M CORDIC modules CORDIC iterations Synthesized with a 40 nm low-power technology for an operating frequency of 200 MHz.

TTP Impact Hyperbolic and trigonometric operations (32-bit) sin() cos() atan() div() exp() ln() sqrt() pow()

55+30+32+18 (*) 55+30+32+18 (*) 55+10+32+5 (*) 55+17+32+9 (*) 55+28+32+17 (*) 55+15+32+12 (*) 55+18+32+9 (*) 51+49+32+27 (*) HW VLIW-SIMD+CORDIC w/o SIMD =135 =135 =102 =113 =121 =114 =114 =159

Cycles SW VLIW-SIMD (SW CORDIC) 55+30+408+53 (*) 55+30+408+18 (*) 55+10+408+5 (*) 55+17+408+9 (*) 55+28+408+17 (*) 55+15+408+12 (*) 55+18+408+9 (*) 51+49+408+27 (*) =546 =511 =478 =489 =508 =490 =490 =535 SW TI TMS320C6748 2474 1423 3759 152 2557 2721 311 4176

(*) Notation for VLIW: table-configuration + pre-processing + CORDIC-core-iterations + post-processing

• The CORDIC-co-processor with SIMD support computes multiple independent results within the same number of CORDIC-core-iterations. • The number of CORDIC modules linearly decreases the needed number of CORDIC-core-iteration cycles for the same precision. • The area of the CORDIC co-processor without SIMD accounts for about 7% of the area of a reference VLIW-SIMD processor architecture.

TTP Facts Contact: Jun.‐Prof. Dr.‐Ing. Guillermo Payá Vayá E‐mail: [email protected]‐hannover.de TETRACOM contribution: 25.000 € Duration: 1/1/2016‐31/7/2016

TETRACOM coordinator: Prof. Rainer Leupers, [email protected]‐aachen.de http://www.tetracom.eu | @TetracomProject