Extend the Powerpc Instruction Set for Complex-Number Arithmetic

XPLANATION:FPGA101 Extend the PowerPC Instruction Set for Complex-Number Arithmetic Using the APU inside a Xilinx Virtex-5-FXT can ease the design of field-upgradable automotive multimedia systems. by Endric Schubert, Felix Eckstein, Joachim Foerster, Lorenz Kolb and Udo Kebschull Missing Link Electronics [email protected] Automotive multimedia systems face a serious technical challenge: how to achieve system upgradability over the product's lengthy life cycle? Cars and trucks have a typical lifetime of more than 10 years. That makes it diffi- cult for automotive multimedia systems to keep pace with the rapidly changing stan- dards in consumer electronics and mobile communications. In most cases, it is not sufficient (or even possible) to update only the multimedia software. Many applications, especially multimedia codecs, also require an increase in compute performance. Yet designing a system with “spare” compute power for later use is nei- ther economical nor technically feasible, since many technology changes are simply not foreseeable. One solution is to upgrade the compute platform together with the software in such a way that the upgraded system provides sufficient computation power for the additional software processing load. If you build a system with a Xilinx® Virtex®-5 FXT device, you can give your design additional computation power by adding special-purpose compute operations to the PowerPC processor’s Auxiliary Processing Unit (APU). 44 Xcell Journal Copyright © 2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners. Fourth Quarter 2008 XPLANATION:FPGA101 At Missing Link Electronics, a company computations execute in parallel in the out of our design without increasing the working on reconfigurable platforms to link FPGA fabric rather than sequentially in CPU clock frequency (which may cause reliable automotive and aerospace electron- software. In Xilinx-speak, these hardware other headaches). ics with the rapidly changing consumer and blocks are called Fabric Coprocessing The key reason to use the APU rather mobile-communications markets, we Modules, or FCMs. You can write FCMs than connecting hardware blocks to the believe the PowerPC processor APU inside the Xilinx Virtex-5 FXT devices is a little The PowerPC processor APU inside the Xilinx Virtex-5 gem. It provides embedded-systems designers with the same optimization powers tra- FXT devices is a little gem. It provides embedded- ditionally available only to the “big guys” who build their own custom ASSP devices. systems designers with the same optimization powers Given what’s now available to Xilinx users, we think everyone should tap the APU to traditionally available only to the ‘big guys’ who build optimize their designs. their own custom ASSP devices. Basics of Extending Instruction Set via the APU When designers want to optimize an in VHDL or Verilog, and they will end up microprocessor via the PLB bus is the supe- embedded system, they typically do so by in the FPGA fabric of the Virtex-5 FXT rior bandwidth and lower latency between looking for ways to extend the instruction device. You can connect one or more FCM the PowerPC processor and the APU/FCM. set of the microprocessor at the heart of to the PowerPC processor APU interface. Another advantage lies in the fact that the their design. Traditionally, this is the best The next step is to adjust your software APU is independent of the CPU-to-periph- option when the complexity of the embed- code to make use of those additional eral interface and therefore does not add an ded system lies in the software portion of instructions. You have two options (assum- extra load to the PLB bus, which the system the design. You could also simply put new ing you are programming in C language). needs for fast peripheral access. functionality in the design by adding dedi- The first is to change the C compiler to The APU provides various ways of inter- cated hardware blocks. automatically exploit cases where the use of facing between the PowerPC and the FCM. However, you’ll likely find that increas- the additional instructions would be benefi- We can use a load-store method or the user- ing the instructions holds some great cial. We’ll leave this option to the academ- defined instruction (UDI) method. advantages that complement hardware ics and certain folks working on ASSPs. Chapter 12 of the Xilinx User Guide changes, but is somewhat easier for The second, and more elegant, option is UG200 offers detailed descriptions of these designers to implement. For example, by not to touch the compiler but instead use techniques (http://www.xilinx.com/support/ extending the instructions, you can opti- so-called compiler-known functions. This documentation/user_guides/ug200.pdf ). mize your design in finer granularity. means that manually, in our software code, In our example we’ll deploy the UDI Also, extending the instruction set typical- we’ll call a C macro or a C function that method, because it provides the most con- ly does not interfere with memory access; makes use of these additional instructions. trol over the system, enabling the highest thus, it has the potential to optimize the Using either option, we must adjust the performance. The example design is avail- system’s overall performance. assembler so that it supports the new instruc- able for download from our Web site, at Even though individuals, companies tions. Fortunately, Xilinx includes a http://www.missinglinkelectronics.com/ and academic researchers have published PowerPC compiler and assembler with the support/. papers on how to do it, extending an Embedded Development Kit (EDK) that instruction set may seem like a “black art” already support these additional instructions. Example Design Description to anyone new to this technique. But in When the PowerPC encounters these By adding a UDI, we have extended the reality, it isn’t that complex. Let’s examine new instructions, it quickly detects that PowerPC processor’s instruction set to per- how you can optimize your Virtex-5 FXT they are not part of its own original form complex-number multiplications, a design by making some fairly simple addi- instruction set and defers the handling of handy optimization for many multimedia tions to the PowerPC processor’s instruc- them to the APU. Xilinx has configured decoding systems. The EDK diagram (see tion set via the APU interface. the APU to decode these instructions, pro- sidebar, Figure 1) shows the overall design, In general, to extend the instruction set vide the appropriate FCM with the including how we connected the complex- of an embedded microprocessor, you need operand data and then let the FCM per- number multiplier FCM to the PowerPC to understand you are modifying both the form the computation. processor via the APU, and how software software and the hardware. First, you are If this is properly done, the software can make use of it. adding hardware blocks to your system to requires fewer instructions when running. We picked complex-number multiplica- perform specialized computations. These Therefore, we can get more compute power tion as an example because of its wide Fourth Quarter 2008 Xcell Journal 45 XPLANATION:FPGA101 applicability in decoding streaming-media data, and because it clearly demonstrates how to make use of the APU by adding a Step-by-Step Guide to Using the APU special-purpose instruction. Here we present detailed information on how the engineers at Missing Link Electronics Complex-number multiplication is generated the necessary files for our example design, and how to use these files to repro- defined as multiplying two complex num- duce the results on the Xilinx ML507 Evaluation Platform, which contains a Xilinx bers, each having a real value and an imag- Virtex-5 XC5VFX70T device. We also show how to use this design as a starting point inary value. for your own APU-enhanced FPGA design. (a_R + j a_I, where j*j = -1): Step 1: Build Your Coprocessor Theoretically, you can build almost any coprocessor as long as it fits into your FPGA, (a_R + j a_I) * (b_R + j b_I) = but keep in mind that a user-defined instruction (UDI) can transport two 32-bit (a_R * b_R - a_I * b_I) + j (a_I * operands and one 32-bit result per cycle. Our coprocessor for complex-number multi- b_R + a_R * b_I) plication is implemented in file src/cmplxmul.vhd. For efficiency, the complex-number Step 2: Build the FCM Wrapper multiplication hardware block (cmplxmul), To be area-efficient, your coprocessor may need a multicycle behavior similar to ours. performs the multiplication in three stages. Therefore, you will need a state machine to implement a simple handshake protocol between the coprocessor and the Auxiliary Processing Unit (APU). In our example, we This saves hardware resources by using did this inside the wrapper “fcmcmul,” which we implemented in file src/fcmcmul.vhd. only two multipliers and two adders in this Inside the wrapper fcmcmul, we instantiated the complex-number multiplication multicycle implementation. Figure 2 shows hardware block cmplxmul, which becomes the Fabric Coprocessing Module (FCM). a block diagram (in sketch form) of the Thus, fcmcmul provides the interface we need to connect it to the APU. You can find a complex-number multiplication FCM. detailed description of those interface signals in Xilinx document UG200, starting at As the VHDL code in cmplxmul.vhd page 188. The important detail is the timing diagram for “Non-Autonomous demonstrates, we perform the complex- Instructions with Early Confirm Back-to-Back” on page 216, which shows the protocol number multiplication in three clock between the APU and the FCM. cycles. In file cmplxmul.vhd we have implemented the FCM to perform this Step 3: Connect the FCM with the APU complex-number multiplication.

Load more