Inside the New Virtex-5 FXT FPGA

SU 4 EODQATR20 CL ORA XILINX, INC. XCELL JOURNAL ISSUE 64, SECOND QUARTER 2008 EMBEDDED PROCESSING SOLUTIONS EDITION

Issue 64 Second Quarter 2008 XcellXcelljournaljournal SOLUTIONS FOR A PROGRAMMABLE WORLD

COVER Embedded Processing InnovationsInnovations withwith Virtex-5 FXT Devices

INSIDEINSIDE Next-Generation Wireless Standards Using Virtex-5 FXT Devices

FPGA Coprocessors for Spacecraft Image Processing

Designing Multiprocessor SOCs

Building Linux Platforms on Xilinx Processors

www.xilinx.com/xcell/ Support Across The Board™

Bryan Fletcher Avnet Global Technical Marketing Bringing Products to Life. Manager At Avnet Electronics Marketing, support across the board David Loftus Xilinx is much more than a tagline for us. From design to Vice President / General Manager delivery – we are deeply committed to driving maximum General Products Division (GPD) efﬁciency throughout the product lifecycle.

The Challenge When the world’s largest FPGA maker Xilinx was mounting the launch of its latest Spartan™-3 Generation family of low-cost FPGAs, the company wanted to ensure designers would have everything they need to compete in the highly competitive high-volume market. Xilinx turned to longtime partner Avnet to lend its technical expertise in creating a complete solution.

The Solution Avnet spent countless engineering hours alongside Xilinx – developing a comprehensive solution of starter kits, development boards, and Speedway™ educational workshops. With Xilinx® Spartan-3 Generation FPGAs and Avnet’s support across the board, designers have access to all they need to save up to 50 percent in total system cost over competing FPGAs.

Visit www.em.avnet.com/xilinxspartan3kits to learn more about Spartan-3 solutions and to purchase a Spartan-3A evaluation kit for only $39!

1 800 332 8638 www.em.avnet.com

©Avnet, Inc. 2008. All rights reserved. AVNET is a registered trademark of Avnet, Inc. ©2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners. LETTER FROM THE PUBLISHER

Integrated System Platforms: Xcell journal The Next Wave in FPGA-based Innovation PUBLISHER Forrest Couch [email protected] 408-879-5270 for Embedded Processing Applications EDITOR Charmaine Cooper Hussain

ART DIRECTOR Scott Blair FPGAs have come a long way since the days serving as prototypes to test design concepts before committing them to ‘permanent’ silicon. FPGAs have now become the foundation for many of DESIGN/PRODUCTION Teie, Gelwicks & Associates the most innovative products that improve our lives everyday. Driven by diverse and demanding 1-800-493-5551 market requirements, FPGAs are valued above all else for their flexibility. Their inherent pro-

ADVERTISING SALES Dan Teie grammability is the insurance policy product developers depend on to quickly and efficiently meet 1-800-493-5551 market windows and keep products current in fast moving, evolving markets. [email protected] As many of the articles published in Xcell Journal illustrate, FPGAs are no longer a single-function device. These programmable devices have emerged as the preferred platform for the highly TECHNICAL COORDINATORS Larry Caputo F Jay Gould integrated world of systems-on-chip (SoC) and embedded system design. Therefore, it is no coincidence that many of you will see this issue of our magazine at the Embedded System Conference in INTERNATIONAL Piera Or, Asia Pacific San Jose, CA, one of the largest gatherings of hardware designers and software developers. [email protected] It’s also appropriate that this theme is highlighted as Xilinx introduces its ultimate system integra- Christelle Moraga, Europe/ tion platform – the Virtex™-5 FXT FPGA. Comprising the fourth platform in the 65-nanometer Middle East/Africa [email protected] Virtex-5 family, Virtex-5 FXT devices combine flexibility with very high-performance embedded processing, digital signal processing and connectivity capabilities into a single chip to reduce total system Yumi Homura, Japan [email protected] cost, power, and board real estate.

SUBSCRIPTIONS All Inquiries In this Issue www.xcellpublications.com The significance of the Virtex-5 FXT platform for embedded developers is underscored in the cover article, which describes in detail the industry’s first FPGAs with embedded PowerPC® REPRINT ORDERS 1-800-493-5551 440 processor blocks, high-speed RocketIO™ serial transceivers, and dedicated XtremeDSP™ processing capabilities. These devices deliver an optimal system integration platform that meets market and bandwidth requirements of transporting voice, video, and data for a wide range of applications in wired and wireless communications, audio/video broadcast equipment, military, aerospace and industrial systems, along with many others. This issue provides a “behind-the- scene” look into a variety of FPGA-based embedded system applications, ranging from outer space and aerospace to more earthly scientific and automotive innovations. In addition, you’ll learn how every aspect of Xilinx embedded processing solutions has been substantially upgraded and usability drastically simplified through intuitive hardware and software www.xilinx.com/xcell/ tools. Advancements include the v7 release of the MicroBlaze™ 32-bit soft processor with configurable memory management unit, the new high-performance processor interconnect architecture, and the ISE™ Design Suite 10.1 development environment with unified access to all logic, Xilinx, Inc. 2100 Logic Drive embedded, and DSP design tools. Of course, our expansive ecosystem of third-party embedded San Jose, CA 95124-3400 technology and service providers, several of whom are featured in this issue, also play a key role Phone: 408-559-7778 FAX: 408-879-4780 in helping developers maximize the benefits of Xilinx embedded solutions. www.xilinx.com/xcell/ At the end of the day, the goal of these efforts is to enable rapid design of high-performance embedded systems with highly integrated, scalable hardware platforms, customizable embedded © 2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands included processors, and pre-verified IP cores. Designers can then focus their efforts on creating differenti- herein are trademarks of Xilinx, Inc. All other trade- ated value, thereby getting products to market in time to extract maximum return and ultimately marks are the property of their respective owners. realizing the full potential of their own inspirations. The articles, information, and other materials included in this issue are provided solely for the convenience of our readers. Xilinx makes no warranties, express, implied, statutory, or otherwise, and accepts no liability with respect to any such articles, information, or other materials or their use, and any use thereof is solely at the risk of the user. Any person or entity using such information in any way releases and waives any claim it might have against Xilinx for any loss, damage, or expense caused thereby. Forrest Couch Publisher ON THE COVER

EMBEDDED APPLICATIONS SYSTEM DEVELOPMENT 48 1414 Next-Generation Wireless Standards Using Virtex-5 FXT Devices LTE baseband reference design implements hardware, Designing Multiprocessor SOCs software and high-speed off-chip comms for wireless baseband Using Xilinx EDK tools and IP, you can scale your applications processing on single device. by designing SOCs with multiple processors.

EMBEDDED APPLICATIONS PLATFORMS, PROCESSORS, & TOOLS 61 2222 61 FPGA Coprocessors for Spacecraft Image Processing C-to-FPGA design techniques speed development of performance-accelerated space-based Building Linux Platforms on Xilinx Processors imaging applications. Find out how a LinuxLink subscription accelerates your Linux development and mitigates project risks. 88 Cover Embedded Processing Innovations with Virtex-5 FXT FPGAs Get ahead of the embedded system design curve by maximizing performance and throughput with the built-in PowerPC 440 processor block. SECOND QUARTER 2008, ISSUE 64 Xcell journal

VIEWPOINTS Letter from the Publisher: The Next Wave in Embedded Processing Innovation...... 3 Xilinx Welcomes New President & CEO...... 7 Power.org: Presenting Power Architecture Technology...... 20

FEATURES Cover Embedded Processing Innovations with Virtex-5 FXT Devices...... 8

Embedded Applications Implementing Next-Generation Wireless Standards Using Virtex-5 FXT Device Parts...... 14 Developing FPGA Coprocessors for Performance-Accelerated Spacecraft Image Processing ...... 22 Using the MicroBlaze Processor to Develop Speed Sensors for F1 Racing Cars...... 28 FPGA-Based Controller Enables Precise Chemical Analysis...... 32 Reconfiguring the Battlespace...... 36

System Development Hardware/Software Optimization of Embedded Systems...... 41 Custom Processors in FPGAs ...... 45 Designing Multiprocessor SOCs ...... 48 Accelerating Video Development on FPGAs Using the Xilinx XtremeDSP Video Starter Kit...... 52

Platforms, Processors, & Tools Microprocessor Report: MicroBlaze v7 Gets an MMU...... 55 Building Linux Platforms on Xilinx Processors with LinuxLink ...... 61 Quickly Build Distributed Systems...... 65 Debugging Systems with FPGA Embedded Processors...... 68 A Complete Debugging Solution for Xilinx Embedded Processors...... 72 Reduce FPGA Design Time with PICO Express FPGA...... 75 Design Embedded Systems with Xilinx and Synplicity...... 78

RESOURCES Embedded Processing Application Notes and Reference Systems...... 81 Embedded Processing QuickStart!...... 82 Embedded Unlock your future

Enter the New Era of Configurable Embedded Processing

Adapt to changing algorithms, protocols and interfaces, by creating your next embedded design on the world’s most flexible system platform. With the latest processing breakthroughs at your fingertips, you can readily meet the demands of applications in automotive, industrial, medical, communications, or defense markets.

Architect your embedded vision • Choose MicroBlaze™, the only 32-bit soft processor with a configurable Get the Complete Embedded Solution MMU, or the industry-standard 32-bit PowerPC ® architecture • Select the exact mix of peripherals that meet your I/O needs, and stitch them together with the new optimized CoreConnect™ PLB bus

Build, program, debug . . . your way • Port the OS of your choice including Linux 2.6 for PowerPC or MicroBlaze • Reduce hardware/software debug time using Eclipse-based IDEs together with integrated ChipScope™ analyzer

Eliminate risk & reduce cost • No worry of processor obsolescence with Xilinx Embedded Processing technology and a range of programmable devices • Reconfigure your design even after deployment, reducing support cost Linux 2.6 and increasing product life Order your complete development kit today, and unlock the future www.xilinx.com/processor of embedded design.

www.xilinx.com/processor

At the Heart of Innovation

©2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners. Viewfrom the top Xilinx Welcomes New President & CEO Wim Roelandts Passes the Torch to Industry Veteran Moshe Gavrielov

On January 7, 2008, Xilinx announced the appointment of a new president and CEO, Moshe Gavrielov. He assumes the role from Wim Roelandts, one of the most revered CEOs in the industry. Roelandts remains chairman of the board, while Gavrielov becomes only the third Xilinx® CEO in the company’s 24-year history. In this issue’s View from the Top, Moshe provides a I see a three-fold opportunity for Xilinx automation, aerospace and defense, med- glimpse into his vision for Xilinx and how he to better serve our customers while achiev- ical, automotive, and consumer electronics. plans to get us there. ing higher revenues and increased market FPGAs offer a compelling value propo- share. First, we must maintain our pace in sition and significant technology advan- Let me start by saying how honored and expanding the underlying capabilities of our tages for both high-performance DSP and excited I am to assume responsibility for silicon. Second, we must continue to build embedded applications. Already, DSP and leading one of the most highly respected our portfolio of IP across a growing breadth embedded processors have opened an semiconductor companies in the world. of applications. And third, we must contin- entirely new set of opportunities for This is truly my dream job and I feel I’m ue to invest in our development tools so Xilinx. By integrating key functionalities joining the company at a very exciting that we are able to serve the diverse needs of into a single FPGA, designers can reduce time. We have a great opportunity to bring a growing, increasingly specialized commu- total system cost, power, and board space the benefits of Xilinx technology to a nity of traditional and new users. by reducing component count. Simple, broader range of industries and applica- Xilinx is already on the path to becom- right? Not so simple actually, which leads tions in the coming years. ing a complete solutions supplier. But we me back to my point on becoming a total FPGAs are becoming more and more have a ways to go before we get there. We solutions provider. Every designer – par- relevant to a wider range of designers. It’s need to learn from our customers across ticularly those coming from the DSP or a well-known fact that the best way for the globe, developing an intimate under- embedded space – will need more than designers to prepare for change is to arm standing of their pain points as if they were just FPGAs to make the proverbial leap themselves with options. The FPGA has our own. Then we need to provide solu- from an ASIC or ASSP. always provided these options through the tions to make their pain disappear. Xilinx must provide all of the necessary programmability of its hardware. We are tools, including IP, software tools, and the ultimate providers of faster time to On DSP and Embedded design methodologies that allow designers market. But it’s not just about silicon and Xilinx already had its eye on the DSP and to leverage the high-performance, flexible flexibility – it’s about the IP as well. In embedded markets when I arrived. Three capabilities of the FPGA. Oh yeah – and that way, Xilinx today is in a similar tran- years ago, the company announced the for- we need to make that leap as compelling sition to one I have faced before: moving mation of a new division to broaden the and simple as possible. I, for one, am look- from a supplier of blank gates to being a reach of Xilinx FPGAs into multi-billion ing forward to doing just that. supplier and supporter of the IP that goes dollar vertical markets previously the For the latest in Xilinx solutions designed into the gates. But supporting a body of domain of of ASICs and ASSPs, including to meet the needs of the DSP and embedded IP in the field is a non-trivial undertaking. audio, video and broadcast, industrial designer, visit www.xilinx.com.

Second Quarter 2008 Xcell Journal 7 COVER STORY EmbeddedEmbedded ProcessingProcessing InnovationsInnovations withwith Virtex-5Virtex-5 FXTFXT DevicesDevices Maximizing performance and throughput utilizing the built-in PowerPC 440 processor block.

8 Xcell Journal Second Quarter 2008 COVER STORY by Craig Abramson high-performance processing system while al enhancements in the Virtex-5 FXT Product Marketing Manager still allowing a wide variety of topologies. device’s embedded processor block. This Xilinx, Inc. You still have the flexibility and advantages results in an overall system cost reduction [email protected] of an FPGA-based implementation, but and invariably a more tightly integrated you now also have the added benefit of a processor system. Dan Isaacs hardened, integrated interconnect architec- The processor buses only take up three Director of Embedded Processor Marketing ture that (among other things) maximizes of the five “crossbar master” ports on the 5 Xilinx, Inc. access to external memory. x 2 crossbar (see Figure 1). The crossbar [email protected] As you will see, the result is an embed- includes two additional master ports, Ahmad Ansari ded block that allows you to develop a because in many real-world applications it’s Principal Engineer Xilinx, Inc. [email protected] Crossbar Crossbar Masters Slaves With the advent of the Xilinx® Virtex™-5 FXT FPGA, you have an opportunity to PLB Slave/DMA get ahead of the embedded system design curve. The need to quickly develop and val- Processor Instruction Read idate embedded systems has never been Memory Controller Interface more apparent than in the realm of embedded system design. Processor Data Read Combining software and hardware to PLB Master demonstrate this at a system level (as quick- Processor Data Write ly as time permits) has become common- place in the industry. By providing a more tightly coupled, flexible, scalable solution, PLB Slave/DMA you have a means to address many hardware and software SOC design challenges. FPGAs provide a significantly faster path for designers to rapidly develop, prototype, Figure 1 – The crossbar and test their embedded designs. The Virtex-5 FXT device platform, the third- generation FPGA to feature a PowerPC wider range of high-performance processing not just the processor that needs access to processor, has added an embedded block architectures in a shorter period of time. memory or peripherals. Each of these that will help you meet more demanding PowerPC processors generically have “crossbar master” ports comprises a proces- design requirements while allowing you to three interfaces: instruction read, data read, sor local bus (PLB) slave interface, as well finish your designs quickly and easily. and data write. In previous Virtex device as two channels of scatter-gather direct In this article, we’ll provide a detailed architectures, which embedded the memory access (DMA). description of the embedded processing PowerPC 405, these processor buses would The “slave” side of the crossbar com- innovations in the PowerPC 440 processor connect to FPGA fabric. The timing clo- prises two ports. One port is a dedicated block and system interconnect. A key area of sure requirements of this circuitry would memory controller interface that provides a focus in the Virtex-5 FXT FPGA processor vary based on how many and what types of high-throughput generic interface to soft block is simplification through integration. loads the design presented to the buses. memory controllers. The other is a bus for A corollary to this is ease of develop- In the Virtex-5 FXT FPGA (where the attaching I/O devices and peripherals. ment and test. Quickly bringing up a sys- processor is now the PowerPC 440), these tem to allow software developers to get a buses are hardened and hooked directly A Better Processor head start on actual hardware is a major to a new structure, an integrated 5 x 2 Providing all of this extra functionality in emphasis for the Virtex-5 FXT device’s crossbar switch – generically referred to as the embedded processor block would be of PowerPC 440 processor. the crossbar. This hardened interconnect little consequence if there were not a proces- provides significantly higher performance sor with the horsepower to take advantage of Simplification Through Integration (with virtually no consumption of FPGA it. The Virtex-5 FXT FPGA represents the Integration is key. We have reduced the logic resources and fixed timing) when first time anyone has embedded a PowerPC amount of FPGA logic needed to build a combined with the rest of the architectur- 440-class processor in an FPGA.

Second Quarter 2008 Xcell Journal 9 COVER STORY

ent memory windows mapped into the DMA memory space of any of the crossbar mas- DMA ters. These memory spaces can be pro- PLB Slv0 grammed through the FPGA bitstream, by the processor at run time, or even by I PowerPC external logic on the FPGA using the Memory I/F APU 440 R crossbar’s sideband bus, called the device Processor W PLB Master control register (DCR) bus.

PLB Slv1 Integrated PLB Interfaces As we mentioned earlier, many of the buses DCR DMA connected to the crossbar are processor DMA local buses, also called PLBs. The PLB is one of the standard CoreConnect buses as defined by IBM. Figure 2 – The PowerPC 440 embedded processor block An earlier version of the PLB (version 3.4) was used as one of the standard buses on The PowerPC 440 offers a significant per- In fact, up to four concurrent transac- PowerPC 405 designs in Virtex-II Pro formance improvement over the PowerPC tions can exist: two for each crossbar slave and Virtex-4 FX FPGAs and is also used 405 (which was embedded in previous Virtex (such as the memory controller or PLB in the new PowerPC 440 embedded families) in a number of areas. master). Additionally, each crossbar mas- processor block. First, the PowerPC 440, when in the ter (that is, the three processor PLBs and In the PowerPC 440 embedded proces- fastest speed-grade FPGA, can be clocked the two PLB slave interfaces) can pipeline sor block, the PLBs connect the processor’s at 550 MHz. The PowerPC 405 topped out four read and four write transactions to internal caches to the input side of the at 450 MHz. This is almost a 20% per- the same slave. crosspoint. The buses are: formance improvement. But add to that Another key feature of the crossbar is the fact that the I and D cache sizes are its highly programmable memory map- • ICURD: instruction cache unit read doubled, the instruction pipeline is seven ping. You can think of the entire system • DCURD: data cache unit read instead of five stages, and the execution of having available memory space of 4 unit can now execute two instructions out GB. Both the memory controller inter- • DCUWR: data cache unit write of order and in parallel. face and the PLB master can have differ- The result? You’ve got a processor with performance sufficient to handle a great many PPC405 PPC440 Benefit of today’s embedded processing challenges. (Virtex-4 FX FPGA) (Virtex-5 FXT FPGA) There are a number of other advantages to moving from the PowerPC 405 to the Architecture 32-bit instruction, 32-bit instruction, Access more physical PowerPC 440, as shown in Table 1. 32-bit address, 36-bit address, 128-bit data, memory, higher speed The PowerPC 440 embedded block is 64-bit data Book E compliant data movement shown in Figure 2. Pipline Single instruction/cycle, Two instructions/cycle, More efficient instruction High-Throughput Switch Matrix five-stage pipeline, seven-stage pipeline, execution The 5 x 2 crossbar is more than just a big in-order issue out-of-order issue switch. It provides non-blocking pipelined access from the five crossbar masters to the Caches – I/D 16K/16K, two-way set 32K/32K, 64-way set Less memory two crossbar slaves (see Figure 1). It allows associative, no locking associative, locking access latency concurrent transfers between different agents on the crossbar at the same time. MMU Page size: 1 KB to 16 MB Page size: 1 KB to 256 MB Less page swapping As shown, we’ll call the buses going into the crossbar “crossbar masters” and the DMPS Estimate 700+ DMIPS 1000+ DMIPS Better benchmarks equal buses coming out “crossbar slaves.” These higher performance interfaces are highly pipelined, thus allowing a large number of transactions to be in Table 1 – Virtex PowerPC integrated processor feature comparison progress at the same time.

10 Xcell Journal Second Quarter 2008 COVER STORY

PLB PLB Master Slave Device Device

PLB Slave Soft External DDR2 Memory ICURD Memory MCI Controller DCURD PPC440 MPLB DCUWR

PLB PLB Slave Slave PLB Slave Device Device

PLB PLB Master Slave Device Device

Figure 3 – PLB masters and slaves

The PLB used in the Virtex-5 FXT device Note that there can be no more than Optimized DMA Engines is version 4.6 (PLB46). The PLB46 bus archi- four PLB masters connected to each PLB There are four additional crossbar masters; tecture brings with it a number of new capa- slave bus. Few systems are likely to need they are the four DMA channels. Each bilities that give it a nice performance boost more than four masters, but if you did need DMA channel has separate 32-bit transmit over its predecessor. The most obvious is the more, you could always use the PLB/PLB and 32-bit receive interfaces. They share fact that while PLB34 was 64 bits, PLB46 is bridge IP provided with the Embedded crossbar arbitration with PLB slave inter- 128 bits. But not to worry – if the IP con- Development Kit (EDK) (see www.xilinx. faces, as shown in Figure 4. nected to the bus is less than that, the bus will com/support/documentation/ipembedprocess_ All DMA ports can be operating at the perform dynamic bus sizing as required to coreconnect_plbbusstruct.htm). same time. Each one has a dedicated FIFO, accommodate 32- and 64-bit transactions. Figure 3 is a simplified system diagram so as one DMA is accumulating data, the We should also point out that the PLB46 showing how PLB peripherals can be other DMA can be pumping data through version is a Xilinx implementation of the hooked to crossbar master and crossbar the crossbar. Each DMA channel operates IBM-defined PLB46, optimized to take slave ports. Note that if you have multiple asynchronously to the processor clock. advantage of FPGA resources. masters on any PLB, arbitration is handled The interface into the DMA channels is PLB46 – and indeed all versions of PLB by the IP for the bus. You do not need a through an interface called LocalLink. – have the concept of master and slave. This separate arbiter. Xilinx uses the LocalLink interface in a should not be confused with crossbar master and crossbar slave. (Again, refer to Figure 1.) As we stated earlier, there are two PLB slave port interfaces on the crossbar; they 32 Local Link RX are crossbar masters. These slave ports are DMA 32 Local Link TX Figure 4 – PLB slave connected to the FPGA fabric. 32 Local Link RX buses and DMA DMA 32 Local Link TX Arbitration

In a processor system there is often the channels share Programmable 128 PLB Slave PLB Slv0 need to allow something besides the proces- crossbar arbitration. sor to access external memory or on-chip peripherals. The PLB slave interfaces allow MCI just that. PLB masters, built from FPGA logic, connected to the PLB Slave ports (see MPLB Figure 3) can access either the MCI or the MPLB through the crossbar. 128 PLB Slave Similarly, the function of the PLB master PLB Slv1 32 Local Link RX (the one that is the crossbar slave) is to have DMA 32 Local Link TX 32 Local Link RX a PLB to hook to I/O devices and soft Arbitration

Programmable DMA 32 Local Link TX peripherals. And because the PLB master is a crossbar slave, anything hooked to a crossbar master port can access it.

Second Quarter 2008 Xcell Journal 11 COVER STORY

To maximize throughput and performance, the PowerPC 440 embedded block employs scatter/gather DMA. To make using this capability as easy as possible, Xilinx provides wrappers for the various pieces of IP and embedded blocks it offers. number of IP blocks. LocalLink is a point- This wrapper, combined with the func- High-Performance Dedicated Memory Interface to-point interface that sends packets to, or tionality of the DMA engine in the Rounding out the new processor block is the receives packets from, some external device. PowerPC 440 embedded block, allows dedicated memory controller interface. The The most notable type of processor IP you to easily build a processing system purpose of this interface is to provide a ded- that uses the LocalLink interface is the hard with a high-performance TEMAC con- icated link out to external memory, but at embedded tri-mode Ethernet media access nected directly to the PowerPC 440 DMA the same time not be tied to any specific controller (TEMAC) block. The TEMAC channels. Figure 5 is a simplified system memory technology. has a wrapper that allows it to communicate showing how both DMA and PLB periph- At this time, the memory controller directly with the PowerPC 440 DMA. erals can be hooked to crossbar master interface supports a stand-alone DDR2 con- Although all data paths through the and crossbar slave ports. troller and MPMC4 controller, all available crossbar are 128 bits, the LocalLink inter- The DMA channels are controlled by through Xilinx Platform Studio, EDK 10.1. face into and out of the DMA channels are descriptors, small blocks of memory that are This interface provides the flexibility to con- all 32 bits. As such, there is built-in logic set up by the PowerPC 440 processor before nect to virtually any memory technology between the DMA controller and the cross- commencing DMA operations. The descrip- now or in the future. bar that realigns data. tors control how much data is transferred and The memory controller interface is To maximize throughput and perform- where data is located in system memory. streamlined, comprising address/data/ ance, the PowerPC 440 embedded block Descriptors can be chained together if control. It can be programmed to support employs scatter/gather DMA. To make need be, effectively creating a sequence of 128-, 64-, 32-, or even 16-bit memory. It using this capability as easy as possible, commands to control a DMA channel. does width and burst realignment, so while Xilinx provides wrappers for the various The DMA controller is covered in its the DMA may be bursting one size packet, pieces of IP and embedded blocks it offers. entirety in the reference guide, entitled the memory controller can buffer and The first one targeted specifically “Embedded Processor Block in Virtex-5 realign the packet data to maximize the toward the PowerPC 440 is the soft wrap- FPGAs” (http://www.xilinx.com/support/ bandwidth to the memory. Burst size is per for the embedded TEMAC blocks. documentation/user_guides/ug200pdf). programmable and can be 1, 2, 4, or 8, and the memory controller interface will automatically adjust the address to accom- PLB PLB Master Slave modate the various burst widths. Device Device The majority of soft memory controller handshaking signals are generated by the PLB Slave Soft External Memory DDR2 interface on behalf of the memory con- ICURD MCI Controller Memory DCURD troller. They are provided ahead of time such PPC440 MPLB DCUWR that the soft memory controller can generate

PLB PLB throttling signals back to the memory inter- Slave Slave PLB Slave face. The memory controller interface – on Device Device and behalf of the soft memory controller – can PLB PLB DMA Master Slave also be programmed to detect bank and row Device Device misses ahead of time and will inform the soft Local Link Interfaces memory controller to anticipate a bank or row miss. All of these features together pro- Wrapper Wrapper vide a solution whose primary goal is to Hard Hard maximize memory throughput. TEMAC TEMAC

Tuning the System Figure 5 – Sample system with both PLB and DMA peripherals In some situations, a PLB or DMA interface just may not be the right solution. For

12 Xcell Journal Second Quarter 2008 COVER STORY

includes Platform Studio, with its comprehensive library of IP for hardware design, and Platform Studio SDK, a software Virtex-5 FXT Development Platforms development environment familiar to Jump-Start Your Design many embedded software engineers. Dual PPC With the introduction of the Virtex-5 Single PPC FXT family of devices, we continue to further strengthen our third-party alliances with support from industry-leading operating system providers, including WindRiver Systems with VxWorks and Green Hills Integrity. Linux support is provided through LynuxWorks, Monta Vista, and WindRiver ML507 Evaluation Board ML510 Development Board • XC5VFX70T-1FF1136C • XC5VFX130T-1FF1738C Systems. In addition, Xilinx recognizes the • PCie x 1 Plug-In or Stand-Alone • ATX Form Factor importance of open-source Linux, and we’re moving forward on that front. Xilinx and its partner companies are also developing a wide variety of boards. Figure 6 – Xilinx ML507 evaluation and ML510 development boards Xilinx has multiple boards for the Virtex-5 FXT device: the ML507 with the instance, you might find that you have a figured in multiple ways. Virtually every XC5VFX70T and the ML510 with the software algorithm that takes too many interface is programmable. XC5VFX130T, as shown in Figure 6. The cycles to execute and is affecting your sys- For example, when you build your pro- ML507 evaluation platform enables your tem bandwidth. That algorithm is a great cessing system in the Xilinx Platform team to quickly begin developing hard- candidate for implementation in hardware, Studio development environment and a ware, software, or both. When multiple and the interface to which you may want bitstream is created, all of the specifications processors or a motherboard-type platform that hardware connected would be the aux- of the processing system are in the bit- are required, the ML510 with the dual- iliary processing unit, or APU interface. stream. Thus, when the FPGA starts up, processor XC5VFX130T is ideal. The PowerPC 440 has a second-genera- your processor is up and running. tion APU interface that is tightly coupled Now, let’s say the processing system is up Conclusion to the execution units of the processor. The and running and you want to modify the A high-performance processing solution interface is controlled by 16 user defined operation of one of the DMA channels. You with optimized data throughput is high on instructions (UDIs). The data path of the would do that through the DCR interface. the wish list of embedded designers every- APU interface is 128 bits. There are DCR registers to control every where. This is true whether you are run- Perhaps the most common use of the aspect of DMA operation. ning critical algorithms at the heart of the APU interface is for connecting to a float- In fact, there is DCR access to virtually latest wireless base station, switching high- ing-point unit (FPU). The FPU is every other subsystem of the embedded bandwidth data through a video switch, IEEE754-compatible and supports both block: the PLBs and crossbar, memory con- performing advanced signal processing for single- and double-precision operations for troller interface, and the APU controller. guidance systems using coprocessor accel- the PowerPC 440. Refer to Figure 2 for more details. eration, or handling complex control and The FPU is implemented in the FPGA system management tasks. soft logic fabric and utilizes the DSP48E Putting It All Together The Virtex-5 FXT embedded processor blocks. The soft logic implementation This innovation would be for naught if block, with a multi-ported non-blocking operates up to half the frequency of the Xilinx did not provide a comprehensive infra- integrated processor interconnect and hard embedded processor. structure to take advantage of all of the archi- high-performance integrated DMA, offers Other uses of the APU interface tectural enhancements. We should point out a solution that allows you to focus on the include hardware algorithm acceleration, that the Virtex-5 FXT FPGA with the key elements of your embedded design. as well as an alternative high-bandwidth PowerPC 440 block represents our eighth With a virtually unlimited number of link to block RAM. year in embedded processing and our third- ways to harness these embedded capabili- generation FPGA with a hardened processor. ties, the Virtex-5 FXT FPGA embedded Configuring the Embedded Block Throughout that time we’ve been con- processing solution provides a highly inte- By integrating the PowerPC 440 block in stantly updating EDK, our award-winning grated platform for high-performance, the FPGA, the processor block can be con- Embedded Development Kit. EDK high-throughput SOC designs.

Second Quarter 2008 Xcell Journal 13 EMBEDDED APPLICATIONS ImplementingImplementing Next-GenerationNext-Generation WirelessWireless StandardsStandards UsingUsing Virtex-5Virtex-5 FXTFXT DeviceDevice PartsParts The Xilinx LTE baseband reference design shows how the Virtex-5 FXT device enables hardware, software, and high-speed off-chip comms for wireless baseband processing – implemented on a single device.

by Rob Payne Staff Engineer Xilinx, Inc. [email protected]

The next generation of the 3GPP wireless standard is called long-term evolution (LTE). It provides a leap in performance and a complete move to packet-based processing. In the physical (PHY) level of the LTE specification, specific challenges exist when dealing with higher data throughput rates, as well as the move to OFDM (orthogonal frequency-division multiplexing) for transmission. Xilinx has developed – or is in the process of developing – several new or revised DSP LogiCORE™ solutions to meet the demands of the new specification. With such blocks it is critical not only to verify them as stand-alone blocks, but also to validate them in real systems with real- world data. The Xilinx 3GPP downlink reference design provides this validation, as well as providing a reference to customers about how to use the blocks.

14 Xcell Journal Second Quarter 2008 EMBEDDED APPLICATIONS

The higher data rate in LTE places grid are allocated to different users’ data as 3GPP LTE Downlink Processing increased processing demands on all parts resource blocks. The task of scheduling data Figure 2 shows the processing stages in the of the system: increased DSP hardware transmission and allocating resource blocks baseband section of the 3GPP LTE down- processing in the baseband, increased soft- to users is performed by the higher software link. The processing for both transmit and ware processing to implement the higher layers in the LTE stack. receive can be split into two main sec- layers of the UMTS protocol stack, and increased I/O communication bandwidth to accept packets and pass data to remote radio-heads. 1 subframe = 1 ms = 14 OFDM Symbols In this article, I’ll review some of the new features of the LTE specification and how

Xilinx® Virtex™-5 FXT devices address the 0 1 2 3 4 5 6 7 8 9 10 11 12 13 increased processing demands of LTE through its tight integration of microprocessor subsystem, DSP-enhanced FPGA fabric, and high-speed communication links.

The 3GPP LTE Physical Layer One of the key changes in the Layer 1 Resource Grid (PHY layer) of the 3GPP LTE is the change from CDMA (code-division multi- Frequency Resource Block is a 12 x 14 block of resource elements ple access) to OFDM (orthogonal fre- in a subframe. quency-division multiplexing). One of the main benefits of OFDM is that it reduces the problems associated with multiple paths in the radio channel. In CDMA, a significant amount of processing must be devoted to characterizing and tracking the radio channel to compensate for the effects of fading in the channel. Figure 1 illustrates the structure of an example LTE subframe. The subframe Sub-Frequency = 15 kHz comprises a number of OFDM symbols. Each OFDM symbol provides the data input for an inverse fast Fourier transform (IFFT). In LTE this may be as many as 2,048 input points for in-phase (I) and quadrature (Q) components. A subframe can be represented as a resource grid, where each resource element in the grid comprises a single I/Q input KEY: point for the IFFT in an OFDM symbol. User 1 Resource Blocks Resource grids can be layered to provide User 2 Resource Blocks data to multiple antennas, supporting User 3 Resource Blocks transmission schemes such as transmit Synchronization Signal diversity or MIMO (multiple input/multiple output) techniques. Control Channel The resource grid is allocated to differ- Time ent purposes. Resource elements are allocated to control channels, data channels, and synchronization signals. The diagram also Figure 1 – LTE uses OFDM. A subframe comprises a resource grid, with areas shows the packetization of data on the allocated to control, synchronization, and user data. Each column of the grid forms an OFDM symbol that is converted to the time domain by an IFFT. channel – different areas of the resource

Second Quarter 2008 Xcell Journal 15 EMBEDDED APPLICATIONS

between transmitter and receiver to Downlink TX Downlink RX increase data rates).

Transport Blocks Transport Blocks The next step is to map samples to

TB CRC Insertion TB CRC Checking resource elements in the OFDM resource

Transport Blocks + CRC Transport Blocks + CRC grid. This stage also adds synchronization TB Segmentation TB Reassembly signals to the resource grid, allowing the

Segmented Blocks Segmented Blocks receiver to synchronize with the transmitter. CB CRC Insertion CB CRC Checking The output of this stage is passed to an

Symbol Rate Segmented Blocks + CRC Segmented Blocks + CRC Processing IFFT. The IFFT takes the samples from the

Turbo Encoder / frequency domain to IQ signals in the time Convolutional Turbo Decoder / domain, which are ready to be sent to the Encoder Convolutional radio-head for transmission. OFDM Decoder Codeblocks requires a cyclic prefix (CP) added to the Rate Matcher data, which repeats the end of the time- domain signal at the start of the output. The

Punctured Codeblocks Receive Likelihoods CP size is determined by the size of the QAM Modulation LLR Generation mobile cell and reflections. A sufficiently I/Q Samples (Freq Domain) I/Q Samples (Freq Domain) larger CP is required to remove multi-path Layer Mapper Layer Demapper effects from the OFDM symbol. Antenna I/Q Samples Antenna I/Q Samples Resource Resource Mapping Receive Sample Rate Processing Mapping Sample Rate Sync Insertion Processing The receive processing generally follows the OFDM Symbols OFDM Symbols inverse of the transmit processing. The first step is to do an FFT on the incoming data IFFT FFT to convert the time-domain signal back to + CP Insertion + Synchronization the frequency domain. In the reference design, data is received across all OFDM I/Q Samples (Time Domain) sub-frequencies for all users, but a real mobile user would only decode data on the resource blocks that it had been allocated. Figure 2 – 3GPP LTE downlink processing: transmit and receive chains Synchronization at this step is also performed to synchronize the system to the start of each sub-frame in the OFDM data. tions: symbol rate processing and sample Each segment then has a CRC added The output of the FFT is processed through rate processing. before it is supplied to the forward-error a layer demapper that inverts the layer map- Symbol-rate processing is centered encoder: for data channels this is a turbo ping in the transmission. around forward error correction, used to encoder and for control channels this is a add redundancy to the data stream in a convolutional encoder. Following the encod- Receive Symbol Rate Processing bandwidth-efficient manner and to recov- ing, rate matching is applied to the output to The first step in the receive processing is to er data on the receiver. Sample-rate pro- puncture the data so that it will fill the avail- take modulation symbols and convert cessing is centered around the IFFT/FFT able resource blocks in the OFDM resource them to individual bits. For turbo-encoded that performs the OFDM part of the grid. Finally, the data stream is modulated data, this is a set of probabilities in the baseband operation. with a specified modulation (QPSK, form of logarithmic-likelihood-ratios for QAM16, or QAM64) to give a sample value the turbo decoder. For convolutionally Transmit Symbol Rate Processing to go in the OFDM resource grid. encoded data, it is a distance metric that is The first stage in the Layer 1 (PHY) process- then fed to a Virterbi decoder. The output ing of the LTE downlink takes transport Transmit Sample Rate Processing of the error correction is checked for CRC blocks from the media access controller Samples are first mapped to different validity and reassembled into the original (MAC) layer. Transport blocks have cyclic antenna layers. This mapping allows sup- transport blocks. redundancy checks (CRCs) added, while port for schemes such as transmit diversi- larger transport blocks may be segmented to ty (multiple transmit paths to reduce The LTE Baseband Reference Design ensure that blocks do not exceed a maximum noise at receiver) or MIMO (MIMO LTE has required a number of new or size supported by the forward-error encoder. techniques utilize multiple channels revised Xilinx LogiCORE solutions to meet

16 Xcell Journal Second Quarter 2008 EMBEDDED APPLICATIONS

(EDK). XPS was the framework to pull in TX PC TX Board RX Board RX PC ML507 ML507 the various LogiCORE systems that needed

VLC Control Test Application Test Application VLC Control validating, plus additional blocks that were Server GUI UDP LTE Bridge UDP LTE Bridge Client GUI implemented in a mixture of VHDL and

LTE LTE Xilinx System Generator for the DSP por- Drivers Drivers tions of the design. Xilinx Xilinx Windows LWIP LTE TX LTE RX LWIP Windows IP Stack IP Stack CHAIN CHAIN IP Stack IP Stack PDSCH LTE Implementing on the Virtex-5 FXT FPGA Encode Turbo Decode Figure 3 shows a top-level block diagram of FFT v5 the LTE baseband reference design. The system shown consists of two ML507s, Ethernet FX70T Aurora Aurora FX70T Ethernet MAC/PHY LLTEMAC GTX GTX LLTEMAC MAC/PHY which act as IP packet bridges between the Gigabit Ethernet and LTE protocol stacks. Gigabit Ethernet Aurora Link Gigabit Ethernet The reference application is video

KEY: streaming using the open-source Virtex-5 FXT Features: PPC440 Software DSP Blocks High-Speed I/O Blocks VideoLan Server. The VideoLAN server runs on the transmitter PC, which com- Other Features: New / Revised Ref Design GUI Off-the-Shelf LogiCORE Solution municates to the system over the Gigabit Ethernet link. Video is received by the Figure 3 – Xilinx LTE baseband reference design showing system integration of hardware VideoLAN client on the receive PC. A and software blocks. Blocks are stacked into respective protocols so that higher layers use services control GUI also runs on the PC, allow- of lower-layer blocks. Peer-to-peer communication goes from left to right. ing you to set up the LTE blocks to be changed and monitored remotely. the new requirements of the specification. amount of additional design work for the Video packets are transmitted through For example, the turbo encoder uses LTE- reference design. We also wanted to mini- the LTE software driver and into the LTE specific interleavers; the IFFT requires the mize the system integration and tool issues downlink transmission blocks previously addition of a variable-length cyclic prefix; and use off-the-shelf boards and IP blocks described. The output I/Q data from the and the turbo decoder requires a far higher as much as possible. LTE downlink is then sent out over an throughput than previous 3GPP standards. Previous reference systems for WCDMA Aurora link. Aurora was chosen initially Verifying these individual blocks pres- Release 6 of the 3GPP specifications had over dedicated base station standards such ents a number of challenges. First, the LTE used a separate DSP microprocessor board as CPRI/OBSAI protocols because of its standard is still changing and has not been coupled with an FPGA board. This meant early availability on FXT parts. ratified. Second, the size of OFDM symbol using different tool flows for software and The I/Q data is received on the receive (potentially 2,048 points) and the length of hardware design, plus the overhead of ML507 board and passed into the LTE the sub-frame and turbo-decoder iterations designing interposer boards to connect the downlink receive chain. Processing and required mean that a large number of simu- main boards together. communication follows an inverse order to lation cycles are required to verify the For the LTE baseband reference design, that for the transmission, and finally deliv- behavior of even a single transport block we chose to avoid these issues by gaining ers video packets to the VideoLAN client being decoded. This can limit the test cov- early access to the Xilinx Virtex-5 FXT on the receiving PC. erage achievable in simulation. Finally, just device silicon. The FXT parts integrate on a Figure 3 is color-coded to highlight how running unit tests on blocks does not test single chip a PowerPC 440 microprocessor, the different features of the FXT parts are the macroscopic behavior of systems such as DSP-enhanced FPGA fabric, and high-speed used. The software elements of the system transport-block error rates, or validate that GTX off-chip communication blocks. This that run on the PowerPC 440 are highlight- blocks have compatible interfaces. decision minimized our system integration ed in dark blue. DSP functions comprising To address these issues, the design team problems (as all the key components were on the LTE downlink transmit and receive pro- decided to implement an LTE reference a single chip) and allowed us to use the stan- cessing are in orange and high-speed I/O system design that would provide system- dard customer design board, the ML507. blocks are in light blue. New or revised level validation of the new LTE LogiCORE In addition, the design simplified our LogiCORE solutions are shown in yellow. IP using real-world data sources such as tool flow. The top level of the system was We found we could implement both video streams. integrated in Xilinx Platform Studio (XPS), transmit and receive functionality on a sin- Since the main aim of the reference sys- which allows a large number of pre-verified gle FX70T part, so by looping back the data tem was to validate new IP LogiCORE blocks to be pulled into the system from the on the Aurora link, we could implement the solutions, we wanted to minimize the Xilinx Embedded Development Kit system on a single ML507.

Second Quarter 2008 Xcell Journal 17 EMBEDDED APPLICATIONS

Get Published

Figure 4 – Screenshot of demo running. The GUI allows variable modulation, encoding rate, and channel noise. Transmitted video is in the upper right-hand corner. Received video (with time lag for software video decode buffers) is shown for two users in the bottom half of the figure. Uncorrected errors appear as artifacts in the video decode.

Final Demonstration System imize efficient use of the channel, the base Figure 4 shows a screenshot of the final station has to balance the cost of retransmis- demonstration system in operation. The sion against the benefits of lower overall data Would you like transmit video stream is shown along with encoding rates, and also attempt to mini- to be published the video stream received by two different mize latency in the system. These may feed in Xcell LTE mobile users after the packets were into quality of service (QoS) parameters transmitted over a noisy channel. The con- used by higher layers in the LTE protocol. Publications? trol GUI allows variable noise in the chan- It's easier than you think! nel, along with modulation and encoding Conclusion rate parameter settings for each user. The The Xilinx Virtex-5 FXT device provides a Submit an article draft for our Web-based number of iterations used by the turbo tightly coupled integration of processor or printed Xcell Publications and we will decode is also variable. subsystem, DSP-enabled FPGA fabric, and assign an editor and a graphic artist The demonstration allows us to exam- high-speed communication. Such high lev- to work with you to make your work ine the behavior of the LogiCORE solu- els of integration have allowed both the look as good as possible. tions when placed in a system with hardware and software elements of the LTE For more information on this real-world data. In the case of the video baseband reference system to be integrated streams shown in Figure 4, we used the on a single Xilinx FX70T part using stan- exciting and highly rewarding program, same modulation scheme (QPSK) for both dard hardware boards. please contact: users. However, for user 1, we started By using blocks available in the EDK toolkit to maximize IP reuse, and using increasing the data encoding rate, thus Forrest Couch reducing redundancy in the data. Xilinx Platform Studio as a single integra- Publisher, Xcell Publications As we pass an encoding rate of 0.8, we tion framework, design teams can concen- [email protected] cross a threshold and start to see a significant trate on the novel parts of the LTE downlink number of errors in the decoded video design. This has allowed rapid development stream at this value of SNR. These errors and tracking of changes in the LTE specifi- appear as artifacts in the decoded video cation as it approaches ratification. stream. In a real base station, higher layers in Many thanks to other members of the LTE baseband the LTE protocol stack would retransmit design team: Beth Cowie, Graham Johnston, Andrew extra redundant data for the packet. To max- Laney, Neil Lilliott, Jorge Seoane, and Bill Wilkie.

18 Xcell Journal Second Quarter 2008 More gates, more speed, more versatility, and of course, less cost — it’s what you expect from The Dini Group. This new board features 16 Xilinx Virtex-5 LX 330s (-1 or -2 speed grades). With over 32 Million ASIC gates (not counting memories or multipliers) the DN9000K10 is the biggest, fastest, ASIC prototyping platform in production. User-friendly features include: • 9 clock networks, balanced and distributed to all FPGAs • 6 DDR2 SODIMM modules with options for FLASH, SSRAM, QDR SSRAM, Mictor(s), DDR3, RLDRAM, and other memories • USB and multiple RS 232 ports for user interface • 1500 I/O pins for the most demanding expansion requirements Software for board operation includes reference designs to get you up and running quickly. The board is available “off-the-shelf” with lead times of 2-3 weeks. For more gates and more speed, call The Dini Group and get your product to market faster. www.dinigroup.com • 1010 Pearl Street, Suite 6 • La Jolla, CA 92037 • (858) 454-3419 • e-mail: [email protected] VIEWPOINT Presenting Power Architecture Technology The Power Architecture processing platform experiences rapid growth in unit shipments and software support.

by Mike Paczan CTO Power.org [email protected]

Power Architecture processing technology is the common thread for a very broad range of computing devices, from 32-bit microcontrollers to 64-bit ASICs. It is also found in some of the most sophisticated FPGAs available today, including Xilinx® Virtex™-4 FX and Virtex-5 FXT FPGAs. More than a billion Power Architecture- based chips have been built into electronic equipment since 1991, when the processing platform was first introduced as IBM’s POWER (Performance Optimized With Enhanced RISC). Every Power Architecture-based chip is rooted in the Power Instruction Set Architecture (ISA), a processing specification spanning server and embedded computing capabilities and incorporating the AltiVec vector (SIMD) processing extension. The Power ISA serves as the basis for future advancement of this highly success- Xilinx; the Cell Broadband Engine from tools, software, systems, and services to ful processing platform. IBM, Sony, and Toshiba; the speed up and simplify product develop- Although the Power ISA name may be PowerQUICC line of SOCs from ment. In any given product category – new to some, its features are familiar to Freescale; the POWER5 and POWER6 from custom design services to real-time thousands of software, hardware, and tool server chips from IBM; and, of course, operating services – there are multiple developers who have worked with hundreds of products from companies all providers from which to choose. The prod- PowerPC devices over the past 16 years. over the world. ucts and services these vendors offer are Power Architecture technology underlies Silicon is just a small part of the com- not only state-of-the-art but in many cases many well-known chip families, includ- plete technology platform: hundreds of have also been refined by decades of expe- ing full-featured Virtex FPGAs from companies provide Power Architecture rience in the marketplace.

20 Xcell Journal Second Quarter 2008 VIEWPOINT

Expanding Market Opportunities set of hardware debugging interfaces for The Power Architecture community’s focus Power Architecture implementations on improving customers’ systems design • Publication of consolidated application experience, along with our product variety binary interface specifications for 32- and vendor quality, has led to tremendous and 64-bit Power Architecture imple- growth in the technology platform. Between mentations with neutral licensing that 2005 and 2006, unit shipments for Power permits wide use and reference by cus- Architecture processors grew by 61%. tomers, third parties, and other interest- FPGA Powered – Today, Power Architecture processing ed parties to support the development That‘s it! technology is used in a huge variety of of tools, operating systems, and other advanced electronics. For example, the platform technologies world’s two fastest supercomputers run Power Architecture processors, as do five of • Platform requirements specification for the world’s 10 best-performing enterprise embedded systems built on Power servers. Power Architecture controllers are Architecture technology in more than half the new cars on the road. • An open-source Hypervisor optimized Practically every phone call, e-mail, and for embedded systems Web page is delivered by equipment using FPGA-Module: Power Architecture microprocessors. Power • New models and simulation specifica- TQM hydraXC – Architecture devices are also pervasive in tions that support the construction of consumer game consoles, including the “virtual prototypes” with Power smallest, most universal Microsoft Xbox 360, Nintendo Wii, and Architecture processors Hardware Platform for Sony PlayStation 3. • Technical training for Power Reconﬁ gurable Computing Looking forward, the communications Architecture products offered online • Based on XILINX Spartan 3, Virtex 4 and Virtex 5 technology infrastructure, automotive control, aero- through webinars or in person at our • Ethernet 10/100, USB 2.0, RTC space, and defense market segments will regional Power Architecture confer- • SPI-, NAND-Flash, DDR2 / SDRAM continue to be strongholds for Power ences in Europe and Asia • Size 2.13 Inch x 1.73 Inch Architecture technology. The consumer (54 mm x 44 mm) market – digital televisions, DVD players, Conclusion • Programmable VCC IOs set-top boxes, and particularly game con- Power.org’s open-community approach to soles – will drive strong double-digit managing Power Architecture technology Embedded solution with growth for Power Architecture devices has already resulted in many tangible bene- TQM hydraXC for through 2011. Xilinx Virtex FPGAs with fits. For instance, in 2007, Power.org’s • Fast time to market PowerPC cores will remain a vital part of members completed the Server Power • Economical series production • Highest ﬂ exibility the Power Architecture portfolio for many Architecture Platform Reference (sPAPR), • Hardware reduction of these major markets. an open specification for streamlining the :: Starter kits available :: development of Power Architecture prod- Enhancing Architecture Through Power.org ucts for Linux. This specification, built on Dozens of influential companies provid- more than 20 IBM patents, was made Starter kit ing chips, tools, software, services, and available to corporate members of with module systems have joined together in Power.org Power.org royalty-free. Along with this to develop open specifications and stan- specification, a compliant reference design dards for the Power Architecture platform. was also released based on IBM’s 970MP To simplify the development experience processor. This was made available with a for system designers, Power.org’s members Linux stack, Hypervisor, and firmware. will complete a number of projects and By working together within Power.org, specifications this year: our member companies are continuing to improve the product design experience for • Searchable online product catalog cover- our many shared customers and to expand ing almost all silicon, tools, software, business opportunities for the entire Power Email: [email protected] and services offerings available for the Architecture community. www.tq-group.com Power Architecture technology platform For more information about Power.org, Infoline: (+49 ) 8153 / 93 08 - 333 • Specifications establishing a common visit www.power.org.

Second Quarter 2008 Xcell Journal 21 EMBEDDED APPLICATIONS Copyright 2008 California Institute of Technology. Government sponsorship acknowledged. Developing FPGA Coprocessors for Performance-Accelerated Spacecraft Image Processing C-to-FPGA design techniques speed development of space-based imaging applications.

by Paula J. Pingree Senior Engineer Jet Propulsion Laboratory, California Institute of Technology [email protected]

Lucas J. Scharenbroich Staff Engineer Jet Propulsion Laboratory, California Institute of Technology [email protected]

Thomas A. Werne Associate Engineer Jet Propulsion Laboratory, California Institute of Technology [email protected]

David Pellerin CTO Impulse Accelerated Technologies [email protected]

Fast and accurate on-board classification of image data is a critical part of modern satellite image processing. For Earth sciences and other applications, space-based smart payloads make use of intelligent, machine- learning algorithms and instrument autonomy to detect and identify natural phenomena such as flooding, volcanic eruptions, and sea ice break-up.

22 Xcell Journal Second Quarter 2008 EMBEDDED APPLICATIONS

The Jet Propulsion Laboratory (JPL), a FPGAs for Onboard Computation power and lower cost than general-purpose, National Aeronautics and Space Onboard computation has become a sig- single-board computers (SBCs). FPGA plat- Administration (NASA) laboratory, has nificant bottleneck for advanced, space- forms offer breakthrough performance over developed support vector machine (SVM) based scientific and engineering radiation-hardened SBCs, leading to entire- classification algorithms used on board applications. Currently available spacecraft ly new architectures for smart payloads. spacecrafts to identify high-priority image processors have high power consumption, To evaluate the potential for accelera- data for downlinking to Earth. These are expensive, require additional interface tion using FPGAs, we selected the Xilinx algorithms also provide onboard data boards, and are limited in their computa- ML410 evaluation platform for develop- analysis to enable rapid reaction to tional capabilities. ment and demonstration of selected smart dynamic events (Figure 1). These onboard payload concepts. The Xilinx ML410 eval- classifiers help reduce the amount of data uation board comes equipped with a downloaded to Earth, greatly increasing Virtex-4 FX60 FPGA that features two the science return of the instrument. embedded PowerPC 405 processors and a SVM classification algorithms are flying large amount of available FPGA logic. today, using computational platforms such The specific algorithm we chose for our as the RAD6000 and Mongoose V proces- investigations is the SVM classification sors. These legacy processors have only lim- algorithm. Algorithms of this type have ited computing power, extremely limited found broad application in general machine active storage capabilities, and are no longer learning and classification tasks, as well as considered state-of-the-art. For this reason, for onboard remote sensing. An SVM is a onboard classification has been limited to maximum margin classifier that calculates a only the simplest functions running on only separating hyperplane between two labeled a subset of the full instrument data: for classes such that the distance to the nearest example, only 11 of 242 bands in the case data in each class is maximized (Figure 2). of the Hyperion instrument on the Earth Figure 1 – NASA uses smart payloads to By selecting such a maximum margin classify Earth image data and reduce the Observing-1 (EO-1) satellite. amount of data required for downloading hyperplane, the SVM classifier can exhibit FPGA coprocessors are an ideal candi- (Image courtesy of NASA). better generalization to new data than other date for these algorithms. FPGAs can pro- linear classification methods. vide significant improvement in onboard The goal of training a support vector + classification capability and accuracy when H machine is to learn a set of weights such compared to the legacy processing plat- H that the sign of a weighted sum of dot – forms now flying. H products between the training data, xi, and To evaluate the effectiveness of FPGAs a test vector – t – will correctly predict the for SVM algorithms, we implemented a class of the new data vector. legacy snow-water-ice-land (SWIL) classi- SVMs also incorporate what is known fier, originally developed for the as the kernel trick, a method allowing them Hyperion instrument, on the Xilinx® to be extended from purely linear to non- Virtex™-4 FX60 FPGA. To develop the linear classifiers. This method involves for- application more quickly, we took advan- mulating the training and testing tage of the Impulse C-to-FPGA compiler algorithms in terms of dot products and margin tools provided by Impulse Accelerated then replacing the dot products with a ker- Technologies. These tools support the nel function that represents a dot product rapid development of highly parallel Figure 2 – Calculating a separating hyperplane after passing the arguments through some using an SVM. The circled data points are the hardware algorithms and applications. support vectors that lie on the margin. non-linear function. The kernel function This article describes our approach to permits the high-, or even infinite-dimen- implementing the Hyperion linear SVM on Recently developed hybrid FPGAs, such sional, dot products in the non-linear space the Virtex-4 FX60 FPGA, as well as addi- as the Xilinx Virtex-4 FX device, offer the to be computed using terms from the orig- tional experiments that we performed using versatility of running diverse software appli- inal, low-dimensional space. an increased number of data bands and a cations on embedded processors while at the SVMs are well suited to onboard more sophisticated SVM kernel. These same time taking advantage of reconfig- autonomy applications. The property that experiments show the potential for more urable hardware resources, all on a single makes SVMs particularly applicable is the efficient, higher performance onboard classi- chip. These tightly coupled, single-chip asymmetry of computational effort in the fication using FPGA-embedded algorithms. hardware/software systems offer lower training and testing stages of the algo-

Second Quarter 2008 Xcell Journal 23 EMBEDDED APPLICATIONS

rithm. Classifying new data points requires orders-of-magnitude less computation Input Output than training because the process of train- Image File Image File ing an SVM requires solving a quadratic optimization problem. SVM training requires O(n 3) operations, where n is the number of training examples. Data Input/Output In contrast, testing a new vector with a trained SVM requires only O(n) operations. Faster training algorithms that exploit the specific structure of the SVM optimization SVM problem exist, but the training remains the primary computational bottleneck. After training the SVM, many of the weights, wi, will be equal to zero. This means that these terms can be ignored in the classification formula. Input vectors that have a corresponding non-zero weight are called support vectors. Reducing the number of support vectors is key to successfully deploying an SVM classifier on board a spacecraft, where there are severe constraints on the amount of CPU resources available. Previously deployed classifiers have used such reduced-set methods, but were still constrained to operate on only a subset of the available classification features. Removing such bottlenecks is critical to realizing the full potential of SVMs as an onboard autonomy tool. Figure 3 – Software/hardware partitioning for the SVM algorithm

Partitioning the Problem When using FPGAs with embedded proces- CompactFlash card installed in the quency of the MicroBlaze processor. Also, sors, efficient partitioning of algorithms ML410 board. the MicroBlaze processor would be instan- between software and hardware is important The software-side application streams tiated in valuable FPGA fabric, whereas the to achieve high performance. For the the image data to the SVM, which is also PowerPC exists external to the fabric. FPGA-based development of the SVM, we written in C but has been compiled Because the 256-MB DIMM is the implemented a previously software-only (using the Impulse C-to-FPGA compiler) largest source of memory on the board, legacy algorithm in the FPGA hardware fab- to FPGA hardware. The SVM hardware we used it as main memory for the pro- ric to take advantage of the FPGA’s high- process performs the required SVM oper- gram. The processor local bus (PLB) is a speed parallel processing capabilities. The ation on the image and streams the results high-speed bus (compared to the on-chip image file input and classification file output back to the PowerPC 405 processor. The peripheral bus [OPB]) that allows for fast are managed within the embedded processor then writes the pixel classifica- data transfer to/from the memory and PowerPC processor using the CompactFlash tions (e.g., snow, water, ice, land, cloud, SVM core peripherals. The 16-GB card provided on the ML410 board. or unclassified) to an output file on the CompactFlash card holds the input and In this implementation, the software CompactFlash card (Figure 3). output data files, which are too large to fit side of the application is coded in C and The PowerPC ran the software portion on the DIMM. The UART was used for compiled to the embedded PowerPC 405 of the task, which sends data to and collects debugging output. The OPB is a low- processor using the Xilinx EDK tools. data from the SVM hardware module. We speed bus that serves as the default inter- The embedded software application reads chose to use the PowerPC instead of a face between the processor and the an input image file consisting of 857,856 MicroBlaze™ processor because the SystemACE™ interface controller and pixels. The image file is read from the PowerPC can operate at triple the clock fre- UART peripherals.

24 Xcell Journal Second Quarter 2008 EMBEDDED APPLICATIONS

The Impulse C compiler performs these optimizations and generates hardware in the form of either VHDL or Verilog. This hardware can then be synthesized using FPGA tools such as Xilinx ISE™ software and Platform Studio.

In support of partitioned software/hard- The Impulse C compiler performs these pixel-wise dimensions of the image. Each ware applications such as this, the Impulse optimizations and generates hardware in class was assigned an arbitrary color, and the tools include a library of C-compatible func- the form of either VHDL or Verilog. This number of pixels belonging to each class tions that implement a number of process- hardware can then be synthesized using was tabulated. We could then easily calcu- to-process communication methods. These FPGA tools such as Xilinx ISE™ software late the percentage of pixels belonging to methods include streaming, shared memory, and Platform Studio. On the processor each class as well as visualize the resulting and message passing. For this application, side, the compiler generates run-time file of classified pixels as a colored image. the Impulse C streaming programming libraries ready for use on the embedded This validation was required to meet model was the obvious choice. PowerPC processor. two goals of this project. First, it was nec- In Impulse C streaming applications, essary to validate both the pixel classifica- hardware and software processes communi- Validating the Algorithm tion results from the legacy (software-only) cate primarily through buffered data The output of the combined software/hard- version of the SVM and the generated streams implemented directly in hardware. ware application is a file comprising a col- hardware implementation of the SVM. This buffering of data makes it possible to umn of integers indicating the resulting This was necessary to verify that the integer write parallel applications at a relatively class of each pixel in the image. To validate and floating-point calculations performed high level of abstraction without the cycle- the results of the experiment and see the in the FPGA (in hardware) returned the by-cycle synchronization that would other- results, we used MATLAB to reformat the same results (within acceptable margins) as wise be required. column of integer values to the original those observed in software currently flying. Figure 4 illustrates the design flow for C-to-hardware compilation using the Xilinx FX60 FPGA as a target. On the software side of the application C Language (in this case on the PowerPC 405 proces- Applications sor used for hardware-level testing), Impulse C functions are used to open and close data streams, read or write data on the streams, and, if desired, send status messages or poll for results. In the case of Generate Generate Generate Hardware the Virtex-4 FX, stream reads and writes FPGA Software Hardware Interfaces Interfaces can be specified as operations that take advantage of either the PLB or the auxil- iary peripheral unit (APU) interface.

HDL C Software Generating Parallel FPGA Hardware Files Libraries To create the hardware portion of our application, we used the Impulse C compiler to generate synthesizable HDL files ready to use with the Xilinx EDK tools. In addition to generating HDL files, the Impulse compiler also generates additional files required by the EDK tools, including the needed PLB and APU bus interfaces. The Impulse C compiler performs a variety of low-level optimizations, including C statement scheduling and loop pipelining, saving application developers a great deal of time that would otherwise be spent performing Figure 4 – Design flow from C-code to FPGA-embedded application tedious, low-level hardware optimization.

Second Quarter 2008 Xcell Journal 25 EMBEDDED APPLICATIONS

discrepancies were caused by the differ- Using the C-to-hardware compiler, we ences in the training data sets of the were able to quickly experiment with these SVMs, but nonetheless we decided to ver- alternative implementations and compare ify the hardware implementation on a results, both in terms of accuracy and in pixel-by-pixel basis. terms of performance. The hardware To completely validate the algorithm, implementation of these SVMs produced we compared the outputs of a software results that agree very well with the soft- implementation, using the same input ware simulations of the algorithms and data, to the version implemented in hard- demonstrate significant increases in over- ware by the Impulse C compiler. The two all system performance. implementations produce identical classifi- We were also able to determine (using cations on a pixel-by-pixel basis. The com- Xilinx synthesis tools) the FPGA fabric uti- bination of the good agreement of our lization percentages for each of these results with the legacy ASE results, as well SVMs, as shown in Table 1. as the independent results from the software platform, led us to believe that our Conclusion implementation was valid. All of these val- FPGAs with embedded processors are idations were performed using a combina- demonstrating levels of performance and tion of C language and MATLAB efficiency that were previously impossible programming methods, with no need to using traditional processors. Hardware use hardware design methods or hardware acceleration of SVM algorithms promises description languages. to dramatically improve onboard data processing in future science missions. Extending the Algorithm By using embedded FPGAs such as the Having successfully implemented the legacy Virtex-4 FX60 device, we can implement SVM designed for Hyperion, we then con- increasingly advanced SVMs with “room to sidered two extensions to the C-language grow” in onboard resources. Our results algorithm: using a larger number of bands demonstrate that we can achieve even our [The color key is blue = water, cyan = ice, dark purple = snow, lavender = unclassified] with the same linear kernel SVM and creat- most advanced SVM extension, a polyno- ing a new SVM with a nonlinear kernel. For mial kernel, using just one-third of the Figure 5 – A comparison of the results from the expanded linear kernel SVM, we arbi- DSP blocks and slices available in the a) the legacy 11-band SVM implementation, b) the FPGA-accelerated implementation, and trarily selected 30 of the available 242 bands Virtex-4 FX-60 FPGA. c) the original hyperspectral image. in the image. For the non-linear kernel Software-to-hardware design tools SVM, we used the same 11 bands as the played an important role in the fast proto- We began the validation process by com- legacy SVM, but with a modified kernel. typing and development of these SVM paring the pixel classification percentage Because training data was not available for algorithms. The Impulse C tools allowed us results to those reported by the SVM used in the original legacy SVM, we could not gen- to more easily manage the complexity of the currently flying EO-1 satellite. The clas- erate new SVMs that would be comparable the application and to experiment with sification percentages show good agreement, to it, so we used new training data to gen- alternative implementations. For testing particularly for the snow and water classes. erate the two new SVMs. We also generat- and algorithm validation, the Xilinx Image visualizations were also impor- ed a new, 11-band linear-kernel SVM for ML410 board provided an excellent and tant in this effort. Our resulting visualiza- comparison to the legacy SVM. cost-effective development target. tions show excellent agreement with the results from the EO-1 (Figure 5). FPGA Resources Total on V4FX60 Linear (11 bands) Linear (30 bands) (2,1) Polynomial In addition to the qualitative comparison of the images, we also conducted a Slices 25,280 1,151 (4%) 2,253 (8%) 2,082 (8%) pixel-by-pixel comparison of the legacy Slice Flip-Flops 50,560 1,290 (2%) 1,337 (2%) 2,519 (4%) algorithm results and our classifications. Four-Input LUTs 50,560 1,838 (3%) 3,110 (6%) 3,287 (6%) The pixel-by-pixel classification compari- FIFO16/RAMB16s 232 2 (1%) 2 (1%) 5 (2%) son showed that 76.8% of the pixel classifications in our results matched those of DSP48s 128 4 (3%) 4 (3%) 12 (9%) the Autonomous Sciencecraft Experiment Table 1 – Device utilization for an FPGA-based SVM algorithm (ASE). We believed that the remaining

26 Xcell Journal Second Quarter 2008

EMBEDDED APPLICATIONS Using the MicroBlaze Processor to Develop Speed Sensors for F1 Racing Cars CEA Leti Minatec, Michelin, and EASii IC developed an accurate and real-time optical sensor prototype that calculates both speed and slip angle.

by Viviane Cattin You can estimate horizontal speed components. For our sensor, two laser Research Engineer parameters through indirect modeling of diodes emit elliptical beam shapes to illu- CEA Grenoble Leti Minatec wheel speed and yaw-angle measurements; minate the road. The first front laser beam [email protected] however, this method suffers from model- runs parallel to the longitudinal axis of the ling errors and other uncertainties. Some car, while the second back laser beam Bernard Guilhamat optical sensors give measurements that are, measures orthogonally. Two linear photo- Research Engineer in practice, not reliable in wet conditions. diode arrays collect the reflected light CEA Grenoble Leti Minatec We set out to develop a compact and beams on the ground. [email protected] real-time sensor that could deliver a direct A pixel from the front photodiode array slip-angle measurement with high accuracy is compared to all pixels on the back pho- Angélo Guiga and flexibility in various conditions (dry, todiode array. The back pixels provide Research Engineer wet, humid, or snowy). In this article, we’ll information about slip angle or transversal CEA Grenoble Leti Minatec ® [email protected] describe how the Xilinx Virtex™ FPGA speed, while a corresponding time delay family, with its embedded MicroBlaze™ gives longitudinal speed estimation. Figure Sébastien Riccardi processors, helped us achieve our goal. 1 illustrates the general optical configura- Hardware Activity Manager tion of the sensor. EASii IC Sensor Measurement Principles We then developed an additional func- [email protected] A laser velocimeter is commonly used for tionality to our sensor. Because of the laser taking speed measurements in the print- beam shape, we inserted a standard triangula- Understanding the dynamic behavior of ing and textile industries. The velocime- tion process to measure ground height varia- cars, in particular two-dimensional hori- ter takes laser images of the surface and tion beneath the car. This parameter is useful zontal speed parameters (also known as the uses an autocorrelation function to com- for understanding the behavior of the vehicle car’s slip angle), is critical to optimize per- pute accurate speed estimates. and also improves overall system accuracy. formance during races and improve essen- We extended this approach to measure Finally, we developed an innovative tial equipment like tires. both longitudinal and transversal speed process to avoid optical disturbances. In

28 Xcell Journal Second Quarter 2008 EMBEDDED APPLICATIONS

wet conditions, random spurious reflections disturb systems locally and cor- V H V rupt the calculation of correlations. Our trans longi solution is two-fold: inclining the laser beams at a specific angle and using pre- Laser Laser 1 2 processing to split the optical signals. 685 nm We successfully tested our sensor in real 50 mW time in various conditions. Class 3B

Embedded Intelligence During the development of our sensor, we designed and validated the signal processing algorithms using a reconfigurable FPGA. Unlike other targets such as DSPs or ASICs, V an FPGA allowed us to adapt routines and software architecture depending on test Figure 1 – General optical configuration of a laser speed sensor. Three photodiodes results and evolving specifications. arrays measure longitudinal speed, height measurement, and transversal speed

Software Architecture Figure 2 shows the main architecture of the sensor as proposed by Leti, the research SPEED Process laboratory in charge of this project. Three processes run concurrently: HEIGHT Process Front Pixels Back Pixels • The height process calculates ground Height Pixels FIR Filter height variations. ! ! Adaptive Threshold Adaptive Chopping • The speed process computes longitu- Average Subtraction dinal speed and slip angle from cor- Barycentre Calculation relations and height estimations of 1 x 128 Times Correlation 1D the car. Height H Estimation ! Maximum Extraction • The output process sends sequential Average (y/n) outputs on a high-speed, CAN-stan- Interpolation dard automotive bus system. Fit : Delay, Transersal Pixel These processes share variables through

! Over Sampling memory access. 8 x 5 Times Correlation 1D Figure 2 includes some key points marked with circles. As in the example previously Maximum Extraction mentioned, signal chopping improved our Interpolation ability to take measurements in wet condi- OUTPUT Process tions. Still other developments are essential to Prediction: S D Fit : Delay , Transersal Pixel provide accurate measurements: out out i i Speed S , and Slip Angle D CAN Frame Formation i i • Sampling frequency is the constant Estimation parameter that ensures accurate meas- Sorting Average urements. We controlled the acquisition frequency of diodes so that the ... CAN Bus ... Error Management < S D H Control Flags > ! ground sampling remained constant Model Updating Fs Loop Control during vehicle displacement. A control loop adjusted the sampling frequency according to the speed estimation and prediction, part of the speed process Figure 2 – Main software architecture with key points previously described.

Second Quarter 2008 Xcell Journal 29 EMBEDDED APPLICATIONS

We wanted a programmable embedded target that offered the possibility to accelerate software run times, used reconfigurable parameters, and facilitated evolutions of the algorithm.

• Oversampling of ground signals This routine additionally offered a crit- cal parameters and added a suitable improved measurement accuracy, par- ical way to evaluate the quality of sen- threshold to the signal to avoid sun ticularly for the high speeds (above 190 sor measurements. If all calculations perturbations on height calculation. kilometers per hour) of F1 racing cars. matched, we could confidently believe Other smaller processes run in parallel. At these frequencies, we encountered a in the accuracy of the measurement. For example, a specific algorithm evaluates technological ceiling of photodiode whether speed is equal to zero and commu- array integration time. Therefore, we • Taking into account height estimation nicates that the car is stopped. averaged several slightly shifted meas- when calculating speed improves measurements to improve accuracy (similar urement quality; we used the car’s Hardware Architecture to when using very fast oscilloscopes). ground height variation to refine opti- We wanted a programmable embedded target that offered the possibility to accelerate software run times, used reconfigurable FPGA Processing XC2V1000 CLK parameters, and facilitated evolutions of the Manager Buffer Maximum algorithm. An FPGA had all of these features, as well as offering the potential to

Buffer Maximum parallelize correlation calculations. We used two FPGAs from the Virtex MicroBlaze Processor RISC 32 Bits device family: one devoted to preprocessing (sensor control, digital-to-analog converter, INTERCORR memory, filters, averages, height processes) and one focused on the main processing. We added a MicroBlaze embedded Energy Buffer processor inside the second FPGA to CAN CAN Sequencer Pixel Column Controller CTRL Process implement our high-level algorithm. It includes mainly conditional instructions, with few mathematical calculations. Using this processor allowed us to apply mathe- FPGA Processing XC2V1000 matical functions (like trigonometric func- Sequencer CLK Sensor CTRL Preprocess Manager tions, for example). The soft-core processor is also more user-friendly (based on C lan- Sensor Analog Filter Channel 30 Hz guage) and facilitates debugging. FL FL 30 KHz Figure 3 shows the architecture as pro-

Sensor Analog Buffer Buffer Filter posed by EASii IC, the company in charge Channel Sensor Sensor 30 Hz Average LS LS LS LS 30 KHz of hardware realization.

Analog Filter Component Choice Channel 30 Hz Sensor ZBT HSa 30 KHz The sensor must operate through the severe HS Analog Control conditions of F1 racing cars, which include Channel Average HSb low weights, reduced size, high vibrations, and high accelerations. We used both ZBT Memory DC/DC converters and low dropout regu- 256K x 16 bits lators to avoid major perturbations from the car’s power supply. Figure 3 – Main software architecture with four analog inputs, digital-to-analog The output is transmitted over a high- converter timing, memory storage, calculations, timing, and output control speed automotive CAN bus, thanks to an SJA1000 chip manufactured by NXP

30 Xcell Journal Second Quarter 2008 EMBEDDED APPLICATIONS

simulating high-speed and slip-angle varia- Ph di d A tions. The sensor is designed to work with velocities from 2 to 400 kilometers per hour and can calculate slip angles from -10° to +10°. We reported relative errors of 0.5% on speed, 0.3 mm on ground height variation, and less than 0.1° on slip angle. We have conducted more than 50 trials since 2003 at the Michelin Technology Centre and with F1 partner teams. These measurements allowed us to significantly improve sensor performance and validate its use under a wide variety of conditions. Figure 4 presents experimental results in wet conditions. Outputs of the other commercial speed sensor are irrelevant because it was very disturbed by the behavior of Figure 4 – Sensor photography and main parameters water on asphalt. Our prototype presents more robustness; F1 car experts consider these results fully functional.

Conclusion Various experiments have shown that our prototype addresses the accuracy and robustness necessary for F1 race cars. We assume that these results show the engineering maturity of our sensor, especially with its innovative ability to operate on wet surfaces – a key challenge for non- contact speed sensors. The simplicity of the global architecture (which includes electronic devices as well as optical parts) makes us confident that our optical speed sensor prototype could be implemented at a low cost. We have already studied a new hardware design that fully exploits the available Figure 5 – Comparative result on wet surfaces (LETI sensor in blue, other commercial space between optical elements, reducing sensor in red) showing speed (top), slip angle (middle), and height output (bottom) package volume by a third. Future works may include an image Semiconductors. To improve the heat possible if we applied our system to a specif- sensor to simplify the optical part and transfer of the most consuming compo- ic design. The total power consumption of fusion with other data (such as nents (such as the analog-to-digital con- the sensor is ~10W. The total size (195 mm accelerometer measurements) to enhance verter), we added a thermal paste, which x 107 mm x 35 mm, including both elec- sensor functionalities. facilitated thermal exchanges with the tronic and optical elements) facilitates rapid For more information, visit Leti (www- metal packaging. Additionally, a thermal mounting on the car, either on the keel for leti.cea.fr) or EASii IC (www.easii-ic.com) or sensor gives the temperature of the sensor racing cars or anywhere on the chassis. e-mail [email protected]. to prevent overheating. To avoid connection failure and stress Performance We wish to acknowledge Sébastien Dauvé, Bruno breakdowns, we added glue to the largest We set up and calibrated our prototype at Flament, William Foucault, Jean-Michel Ittel, Philipe electronic modules. Most of the basic com- the Michelin Technology Centre, our part- Peltié, and Jean-Charles Vidal from Leti research ponents of our system are off-the-shelf, ner in this project and the eventual end user laboratories and Jean-Baptiste Monnard from EASii which means that further optimization is of the sensor. They have a dedicated tool for IC for their contributions to this project.

Second Quarter 2008 Xcell Journal 31 EMBEDDED APPLICATIONS FPGA-BasedFPGA-Based ControllerController EnablesEnables PrecisePrecise ChemicalChemical AnalysisAnalysis An embedded controller using a Spartan-3 device and MicroBlaze processor provides by Robert Henderson Development Engineer electronic control of gas flows and pressures. Agilent Technologies [email protected]

A gas chromatograph (GC) is a chemical analysis instrument that separates volatile substances in a complex chemical sample. GCs help answer such basic questions as: • “What contaminants are in this ground water”? • “Is this gasoline formulated correctly”? • “Are there any residual solvents in this pill”? The basic mechanism for gas chromato- graphic separation is the distribution of a sample between two phases (a stationary phase and a gas mobile phase). In modern chromatography, the stationary phase is coated on the inside of a long narrow tube known as a column. As the sample passes down the column, the different chemical components travel at different rates and are detected at the exit of the column by a detector. In addition to the gas supporting the flow of the sample down the column, each detector also uses between one and three supporting gases (see Figure 1). Agilent Technologies (spun off from Hewlett Packard in 1999) has been a part of the GC business since 1965. In 1973, Hewlett Packard introduced the world’s first microprocessor-controlled analytical instrument, the 5830 GC, and over the years introduced a series of next-generation instruments, including the Agilent 7890A high-performance GC in 2007.

32 Xcell Journal Second Quarter 2008 EMBEDDED APPLICATIONS

Gas Chromatograph pneumatic channels in a closed-loop manner, offloading the CPU in the base unit from this real-time task (Figure 2). Inlet Sample Detector (Carrier Gas) Four-Wire Interface (Module to Base Unit) From the base unit to the module, the Separation entire interface comprises an unshielded Column four-wire cable: two wires are for power Detector and two are for communications. The Sample Support power supplied is an unregulated, full- Gases wave-rectified +24V, and the communications path is a bus LVDS system. The LVDS communications path is a Figure 1 – Block diagram of gas flows in a gas chromatograph differential signal path that minimizes noise susceptibility issues and allows for cable lengths long enough to place modules anywhere inside the instrument, maximiz- +24V ing instrument configurability options. Module Power The bus LVDS communications operate in Supplies a half-duplex mode, allowing the same pair Spartan-3 BASE of wires that sends commands to the mod- FPGA 4-Pin UNIT (XC3S200) Connector ule to provide command responses from CPU LVDS 2-Wire the module. Additionally, because the communications interface in the base unit is also implemented with an FPGA, the LVDS Multiplexed communications link between the two Valve Valve A/D EEPROM FPGAs required essentially no hardware, (8 Input Channels) Drivers Driver Monitor saving cost, reducing complexity, and improving reliability.

MicroBlaze Embedded Processor Core In order to implement the embedded controller, we configured the Xilinx® MicroBlaze™ embedded processor in the FPGA using the Platform Studio tool pro- Pressure Sensor Flow Sensor Pressure Sensor Proportional Valve Proportional Valve Proportional Valve vided in the EDK development system. This tool not only defines, synthesizes, and routes Figure 2 – Block diagram of Spartan-3 FPGA-based pneumatic module user logic, utilizing MicroBlaze processor IP and associated bus interface designs, but also GCs have evolved measurably over the the 7890A GC with a modular structure, develops and compiles the program code for last 20 years, from simple mechanical pres- allowing as many as six pneumatic modules the MicroBlaze processor. All FPGA hard- sure and flow regulators to sophisticated to communicate with a common “base ware description is in VHDL, and all electronic closed-loop pressure and flow unit” that provides the central CPU and MicroBlaze processor coding is in C. controllers that allow for the storage and memory, communication ports, keyboard Specifying the optional barrel shifter retrieval of the entire instrument setup. and display interfaces, and power supplies. and hardware divider in the microprocessor Additionally, this capability also allows for Pneumatic modules can regulate up to hardware specification file (system.mhs) programmable setpoints and for the log- three channels of gas for each inlet or enabled the processor system to function as ging of any deviation from the setpoint(s). detector. Each module is essentially an a real-time controller. Performing opera- embedded controller that receives com- tions such as bit shifting or division in C Spartan-3 FPGA-Based Pneumatic Modules mands from the CPU in the base unit, but code instead of VHDL hardware takes too Because of the disparate different applica- otherwise operates independently. The long for some of the time-critical PID con- tion needs of our customers, we designed module controls the three independent troller processes.

Second Quarter 2008 Xcell Journal 33 EMBEDDED APPLICATIONS

Using the Xilinx-supplied template 65 kHz. This high a frequency is used so MicroBlaze processor a transmit and VHDL file, the user logic interfaces to the that in addition to being above the audible receive FIFO along with the necessary flags MicroBlaze processor OPB bus with an range, it is filtered out well by the inductive (FIFO empty, FIFO full). OPB to IPIF (IP interface) block. This time constant of the valve, resulting in a block takes care of the OPB bus protocol smooth current profile. Basic Pneumatic Module Operation and interface signals and presents a simpli- Signals from the valve drive are read As I mentioned earlier, the physical set- fied interface to the user logic called the IP back into the FPGA for diagnostic purpos- points supplied by the customer may be InterConnect (IPIC). es. This allows the MicroBlaze processor to static or programmed. They are expressed The major user logic VHDL blocks com- examine the voltages and determine if a in physical units such as psig (gauge pres- plement the external hardware controlled by the MicroBlaze processor, resulting in the overall FPGA system design (Figure 3). Let’s FPGA with Embedded Controller review the major components. OPB Power Supplies

Multiplexed A/D Converter Control MicroBlaze Processor IP Interface (IPIF) A common delta-sigma analog-to-digital CPU + Memory IPIC converter (ADC) is multiplexed between User Logic eight inputs. With this type of ADC, when the input changes the output filter must be 4-Pin Connector “flushed” of the previous channel’s data. A MUX EEPROM Valve UART VHDL block manages the ADC and its A/D Control Drive + FIFO serial data output. Control PWM When the MicroBlaze processor selects a specific channel, the VHDL subsystem LVDS 2-Wire reads and throws away the old channel’s readings, then reads and averages data automatically from the ADC. The Multiplexed Valve Valve EEPROM MicroBlaze processor specifies to the A/D Drivers Driver VHDL block how many readings to aver- Monitor age, and thus can perform other operations (such as executing a new command from the base unit CPU) until an averaged reading is ready.

Serial EEPROM Control

The VHDL hardware manages the com- Pressure Sensor Flow Sensor Pressure Sensor Proportional Valve Proportional Valve Proportional Valve mands, addresses, and 8-bit data to and from the serial EEPROM and presents a Figure 3 – Overall pneumatic controller block diagram memory-mapped, 32-bit-word-wide interface to the MicroBlaze processor. The valve is installed and if the PWM drive is sure) or ml/minute (flow rate) and can be EEPROM stores calibration coefficients able to drive both high and low. set with a resolution of 0.001 psig and 0.01 for the sensors in the pneumatic module ml/min. These setpoints are sent at a rate of and PID coefficients, as well as other con- UART/FIFO 50 Hz to each pneumatic control module. figuration and identification information. We implemented a UART and FIFO in The basic job of the pneumatic con- VHDL hardware to handle the com- troller module is to regulate the pressure or Valve PWM Drive and Monitor mands and responses with the base unit flow to the desired setpoint. This is done After each PID calculation, the MicroBlaze CPU. Because of the half-duplex proto- with a modified proportional integral embedded processor outputs another valve col, the pneumatic module must only be derivative (PID) controller. drive setpoint to the memory-mapped in the transmit mode during the response addresses for the valve drive. The valve to commands. Command Interpreter drive VHDL state machine uses these val- VHDL hardware automatically returns Because there is no clock shared between ues to generate a pulse-width modulated to receive mode a fixed time after the com- the base unit CPU and the LVDS pneu- (PWM) drive to the proportional valves at mand response and presents to the matic modules, commands to the pneu-

34 Xcell Journal Second Quarter 2008 EMBEDDED APPLICATIONS

matic control module are not synchronized used on data returned to the base unit for the control loop to attenuate away (the with the closed-loop operation of the CPU. It limits the bandwidth of the data disturbances are above the bandwidth of embedded controller in the module. to below 25 Hz so that there is no aliasing the controller). For example, you can send commands of the data in the 50-Hz rate of data to the We solved these problems by reading while the MicroBlaze processor is in the mid- base unit CPU. the value of the +24V supply with the mul- dle of PID calculations. Because the control tiplexed A/D converter and having the loop takes precedence, the command parsing Control Loop MicroBlaze processor calculate a feed-for- and execution will be temporarily delayed The PID controller function implements a ward compensation term. If the +24V sup- until the MicroBlaze embedded processor is standard PID control with a couple of ply increases, the valve drive is idle. This delay is short enough that it necessary twists. Figure 4 shows the over- automatically reduced to compensate. This allows the command response to fall with- all operation of the PID controller. requires a divide operation. Because this in the acceptable response-time limits set The first modification of a standard happens at over 100 Hz and takes away by the base unit CPU. PID is due to the +24V power supply. A from PID calculation time, the hardware The simple command interface defines fully loaded instrument could have 18 divider was implemented in the commands that set setpoints, read actuals, proportional valves to power. Rather MicroBlaze processor system specification. read and write the EEPROM, specify the than implement a 20W-regulated +24V The second modification from a stan- gas type (H2, Helium, N2, ArCh4), as well supply, we decided to use a simple and dard PID controller is due to the pneumat- as a number of diagnostic tests on the sys- reliable full-wave-rectified and unregu- ic system itself. The range of setpoints you tem, A/D converter, and valve drives. lated supply. A supply like this will track can define covers a broad dynamic range, the AC voltage and ripple at twice the with pressure setpoints from 0.2 psi up to Filtered Sensor Readings line frequency. 150 psi and flows from 3 ml/min up to The sensors read data hundreds of times a The magnitude of the +24V supply can 1,250 ml/min. second. Two IIR filters in the MicroBlaze range from 22V to 28V, depending on the Not surprisingly, the dynamics of the processor process the data, which makes use AC voltage to the instrument. This results pneumatic system over this range of opera- of the barrel shifter hardware configured as in a variation in the open-loop gain of the tion change a lot. It is possible, for exam- part of the processor configuration file. system, which can affect stability. ple, to have more than a 30-dB gain change The first low-pass filter helps reduce raw Additionally, the AC ripple on the in the transfer function of the pneumatic sensor noise above the bandwidth of the +24V supply modulates the drive to the system (ratio of sensor out to valve drive in) control loop that would result in a wider valve and can result in pressure or flow over these ranges. In addition, the band- control band. The second low-pass filter is variations that are at too high a frequency width (poles) of the pneumatic system varies greatly over this operating range. To obtain good control and step +24V +24V Feed-Forward response performance, you can alter the Calculation values of the PID coefficients based on the setpoints specified. This alters both the gain and frequency response of the PID Gain and controller to match the characteristics of Bandwidth Correction the pneumatic system at that setpoint.

Conclusion New techniques and improved precision in chemical analysis techniques have enabled Error Standard Gain and Compensate Proportional + PID Bandwidth +24V our customers to meet increasingly stringent Valve - Controller Correction Multiplier requirements for chemical identification. Setpoint Part of that improvement comes from the increased accuracy and precision of gas supplies within the GC. The new Agilent 7890A Pressure network gas chromatograph continues this Pneumatic or Flow System tradition, while the Xilinx Spartan-3 Sensor XC3S200 FPGA with MicroBlaze embedded processor helped the product Figure 4 – Modified PID controller meet its functionality, precision, cost, and modularity requirements.

Second Quarter 2008 Xcell Journal 35 EMBEDDED APPLICATIONS

by Paul Parkinson Senior Systems Architect Wind River [email protected]

Reconfiguring The development of modern defense systems presents a familiar technological challenge in that the requirements for increased application functionality conflict with the requirements for minimal footprint in terms of space, weight, and power (often the Battlespace referred to as “SWaP”). This is a particular concern for airborne platforms, especially unmanned air vehicles (UAVs) that have You can develop aerospace and defense applications physical constraints. Some defense systems are physically with Xilinx and Wind River technologies. vulnerable to interception by hostile forces because of their operational role, including intelligence, surveillance, and reconnaissance (ISTAR) assets such as reconnaissance aircraft, UAVs, and land-based sensor systems. If these platforms are to use leading-edge technology, it must be technology that cannot be compromised or reverse-engineered. In addition, field-based configuration and upgradability are essential to achieve interoperability between new and rapidly deployed coalition forces. In this article, I’ll explain how Xilinx® Virtex™-II Pro and Virtex-4 platform FPGAs, along with Wind River software platforms, are ideal for the development of ISTAR systems, and present design techniques that will enable secure deployment, with the use of partial reconfiguration in the field to help enable interoperability between coalition forces.

Technology Challenges In previous decades, defense systems used military-grade components from semiconductor manufacturers. These components had long life cycles, which were essential to support the in-service use of deployed hardware for 20, 30, or even 40 years. In 1994, when U.S. Secretary of Defense William Perry first advocated the use of commercial off-the-shelf (COTS) components on U.S. military programs where appropriate, other governments around the world subsequently adopted this philoso- phy (to varying degrees). The migration to

36 Xcell Journal Second Quarter 2008 EMBEDDED APPLICATIONS

number of enhanced capabilities when compared to VxWorks 5.5, including technologies that are critical in defense applications; advanced memory protection for application isolation using real-time processes; and a dual-mode IPv4 and IPv6 network stack, which the U.S. Department of Defense mandated for new programs in 2003. After XPS generates the VxWorks 6 BSP, it can be configured in the Workbench project configurator (Figure 1). The configura- Figure 1 – Workbench kernel configuration for Xilinx ML410 VxWorks 6 BSP tor enables components to be configured in a hierarchical manner and parameter COTS has led to a move away from specif- (The details of this approach were previ- values to be specified. When this has been ic military-grade components and toward a ously discussed by Rick Moleres and Milan completed, you can build the VxWorks 6 greater reliance on industrial-grade compo- Saini in their article, “Generating Efficient kernel for your target device from nents. The latter’s shorter supported life Board Support Packages” [Xilinx Embedded Workbench through automated processes cycles could impact in-service lifetimes. magazine, March 2006] and included inte- (Figure 2). Once VxWorks is running on In addition, defense companies have gration with Wind River’s Tornado 2.2 IDE the PowerPC 405 core(s), you can use developed custom ASICs at great expense and VxWorks 5.5 RTOS.) Workbench for source-level application for specialized functions in defense systems, Since the publication of that article, it has development (Figure 3). particularly ISTAR systems. Replacing these become possible to undertake similar devel- The integration between XPS and devices in deployed systems can be prohibi- opments using the state-of-the-art Wind Workbench allows you to use a VxWorks 6- tively expensive. There is also an ever-grow- River Workbench development suite and based common software platform on Virtex ing requirement to extend the in-service life VxWorks 6 RTOS. VxWorks 6 provides a FPGAs for application or algorithm control of ASICs as the operational lifetimes of or gateway network interfaces. This pro- front-line defense systems are extended. vides a very flexible approach that you can These challenges can now be addressed exploit for functions such as image com- through the use of Virtex-II Pro and pression and data link encryption to rele- Virtex-4 FX FPGAs, given their increased vant NATO standardization agreements. gate counts and incorporation of CPU and System and application software in DSP functionality. these devices (known as device software) The platform FPGA also has the poten- are often implemented in an architecture- tial to be used for algorithmic operations, specific manner, using low-level program- but until recent years this has not been ming languages such as assembly language exploited because it is difficult to express and custom software schedulers. Although software algorithmic operations in million- this has enabled exploitation of the hard- gate applications in VHDL. However, the ware’s performance potential, it has also ability to program reconfigurable logic made the task of technology insertion and from high-level software languages such as Figure 2 – Workbench build of Xilinx program upgrades more difficult. C as well as VHDL opens up the potential board VxWorks 6 kernel image You can overcome this problem by using of these devices to further exploitation. software platforms based on open standards. For example, the Wind River General Platform Development Purpose Platform – VxWorks Edition incor- You can readily exploit the potential of the porates the Wind River Workbench devel- platform FPGA through the Xilinx Platform opment suite, which uses the Eclipse Studio (XPS), which generates hardware open-source framework to provide seamless peripheral and IP definitions in VHDL, integration between tools for different parts closely coupled with the automatic genera- of the device software development life tion of a VxWorks board support package cycle, as well as a consistent GUI for devel- (BSP) in C source code for the PowerPC opers. This approach has tangible benefits in 405 processor core(s) contained within the Figure 3 – Workbench target connection to terms of developer efficiency, transferable PowerPC 405 core in Virtex FPGA Virtex-II Pro and Virtex-4 FX device fabric. skills, and knowledge retention.

Second Quarter 2008 Xcell Journal 37 EMBEDDED APPLICATIONS

On the runtime side, VxWorks 6.4 eration technique to provide a secure not provide a rapid, complete, and irre- introduced 100% conformance to the TCP/IP-based communications frame- versible destruction of subsystems; although POSIX PSE52 real-time controller profile, work (using IPSec in conjunction with they may achieve software destruction, they enabling software reuse from legacy sys- Triple DES or potentially 256-bit AES may not sufficiently destroy all of the hard- tems and the development of new encryption) and data link protocols. This ware architecture, especially fixed hardware portable applications. would involve performing the network components such as ASICs. packet processing on the PowerPC proces- Thus, it could still be possible to per- Communications Security sor, with computationally intensive form reverse engineering on some aspects in Defense Systems Design encryption offloaded to the FPGA, acting of the hardware design. To prevent this, Communications security (ComSec) is as a coprocessor. reconfigurable technology offers the ability an important requirement in many to achieve rapid and complete destruct defense systems, especially in UAVs, Information Security sequences and prevent reverse engineering. which often need to maintain continu- in Defense Systems Design FPGA devices can be erased completely, ous communications links for remote Defense systems now contain an increas- leaving no trace of their original applica- piloting and real-time image streaming. ing number of subsystems. In an airborne tion. By asserting a specific signal, the These communication links must be platform, for example, there are avionic device could be cleared in hardware in a few secure from eavesdropping and interfer- systems (for flight control), mission sys- hundred microseconds. You could also per- ence by hostile forces, so encryption is tems, and sensor systems for payloads form the software destruct sequence required for flight control and sensor such as electro-optic/infrared sensors through programmatic software control data transmissions. You could imple- (EO/IR) and synthetic aperture radar issued from a remote system, a sensible sce- ment this encryption in software (which (SAR) (Figure 3). ComSec in this case nario in the case of a UAV capture. places an additional load on the proces- refers to the use of firewalls and encryp- You can implement the sequence from a sor) or in dedicated hardware logic. You could also use reconfigurable technology, which provides benefits in terms of performance and secure design (which I’ll Firewall Firewall soon discuss). Let’s consider a hypothetical scenario Avionics Mission Sensor Network Systems Systems involving the use of Triple DES-encryption on a communications link. Triple DES encryption provides a relatively high level of security by encrypting data three Figure 3 – Military airborne data networks times using three 64-bit private encryption keys. This is inherently more secure tion for the transmission and reception of VxWorks-based common software plat- than single DES encryption but will take information securely between networks form connected to the command and con- a processor three times longer to compute, without transforming the information trol center through a secure IP-based because the processor performs the during transport. network, permanently and irreversibly eras- encryption as three sequential steps. Data communication between the avion- ing the content of a Xilinx FPGA using the Given the parallelism inherent in a ics systems and mission systems will include PLD API for VxWorks Embedded Systems Xilinx platform FPGA, you can implement the passing of global positioning informa- (PAVE), as shown in Figure 4. three encryption stages operating in paral- tion, bearing, and altitude. However, infor- I referred to Triple DES encryption in lel, with the output of one stage pipelined mation with differing security classifications the context of secure communication links into the next stage in 64-bit words. This may also need to be transformed securely earlier, but you can also employ this method acceleration enables the passing of data at between applications, subsystems, or net- within the content of the Xilinx FPGA (also higher rates over secure downlinks. works – this is known as InfoSec. known as the payload, not to be confused For example, a 350-MHz PowerPC can with a UAV payload). When the bitstream Triple DES-encrypt a data stream at the Preventing Reverse is read out from the FPGA, the Triple DES rate of 1.2 Mbps, whereas a lower power Engineering by Secure Design encryption must be decrypted before the 20-MHz Xilinx Virtex-II Pro FPGA can The implementation of a system destruct bitstream can be decoded. This not only Triple DES-encrypt a data stream at the sequence capability is possible for systems protects the FPGA content from being rate of 22 Mbps, with each 64-bit word that are vulnerable to compromise or cap- copied blindly and reused; it also prevents it processed in 57 clock cycles. ture by foreign forces. However, previous from being reverse-engineered should the You could apply this encryption accel- implementations of destruct sequences may UAV be intercepted.

38 Xcell Journal Second Quarter 2008 EMBEDDED APPLICATIONS

Software System Destruct Encryption Application Algorithms + PAVE Application Algorithms VxWorks

Secure Hardware TCP/IP CPU FPGA Ground Link

Figure 4 – CPU-controlled reconfiguration of a Xilinx FPGA

Technical Advantage Field Upgrade Advantage Deployment Advantage

FPGA-Based Solution

Fixed Chip Solution Time

Time to Deployment - First to deploy increases technical advantage Time in Field - Increases the in-life support yield while in field

Figure 5 – Reconfigurable technology life cycle

Conclusion form acts as a secure gateway to deliver ISTAR systems will spearhead the deploy- new content and perform partial or full ment of new coalitions in response to reconfiguration of FPGAs while deployed potential or emerging conflicts, providing in the field. This approach would harvest vital imagery intelligence and situational the architecture shown in Figure 4, but awareness from reconnaissance missions. employ an upgrade application instead of a A common software platform and recon- system destruct application. figurable technology could be used to When considered together, these capa- implement codecs that support reconfigu- bilities provide a technical advantage over ration in the field, assisting in the rapid traditional fixed-chip solutions (Figure 5) deployment and interoperability with and present a powerful argument for their NATO and/or other coalition forces by adoption on defense programs. sharing encryption keys. Platforms can For more information, I recommend a also be reconfigured on the fly to commu- paper by Saar Drimer of the University of nicate with legacy or incompatible sys- Cambridge on the use of FPGAs in the tems belonging to other coalition design of secure defense systems. To read members (if a suitable codec exists). “Volatile FPGA Design Security – A Reconfigurable technology not only Survey,” visit www.cl.cam.ac.uk/~sd410/ provides the ability to rapidly deploy new papers/fpga_security.pdf. technology, but also the means to extend For more information about the Wind the in-service lifetimes of deployed sys- River General Purpose Platform – VxWorks tems. A network-enabled software plat- Edition, visit www.windriver.com.

Second Quarter 2008 Xcell Journal 39 OR

Agilent Logic Analyzers Agilent Mixed Signal Oscilloscopes Up to 612 channels, up to 256 M deep memory + 4 scope channels + 16 timing channels Agilent FPGA Dynamic Probe Application software to increase visibility inside your FPGA = Fastest FPGA Debug Available

2ENogE!c_`gEN¢l3`ol!c!l<y!!gE`3

At the beginning of a new embedded system design, there are so many different solution options that it is hard to predict their impact on system cost, development time, and performance. Some example of possible choices are: • Memory devices (DDR, SRAM, flash, SPI-Flash) • Bus architecture and topologies (PLB, OPB, OCM) • Number of bus masters and arbitration schemes • Processor architecture (PowerPC 405, PowerPC 440, Xilinx® MicroBlaze™ processor, ARM) • Operating system (stand-alone, Xilinx microkernel, Embedded Linux, commercial RTOS) • Single or multiple processors • Interprocessor communication (shared memory, shared bus, interrupts) • Interfaces to the outside world (Ethernet, PCI Express) In this article, we’ll show how you can evaluate these different choices during the whole design process to build an optimized system.

Second Quarter 2008 Xcell Journal 41 SYSTEM DEVELOPMENT

Figure 2 – Board of data acquisition system

Figure 1 -Typical environment Exploring with a Demo Board data sheet to select the device, and the for the data acquisition system A good way to explore the capabilities of a number of I/O blocks, I/O standards, and new FPGA family is to use a demo board I/O banks to select the package. System Architecture like the ML403. First, we implemented the After this first experience, we decided to AVL offers modular test bed components, simplest possible embedded system run- use the Virtex-4 FX family and FF676 highly reliable instrumentation, and ning on one PowerPC (see Figure 3). The package, which allowed us to use different advanced tools supporting the develop- CPU executes the software from internal devices (FX20 with one PowerPC, FX40 ment and calibration process for all kinds block RAMs. With this small system, we and FX60 with two PowerPCs). We had to and sizes of combustion engines (Figure 1). could evaluate the throughput of the use different TCP/IP stacks to use the full One part of this system is the embedded Gigabit Ethernet interface. We could also performance of the Gigabit Ethernet inter- Gigabit Ethernet card discussed in this arti- determine the memory bandwidth and face, so we selected the Treck stack. cle, which collects data from as many as speed of the PowerPC. A single PowerPC is too slow to move four data acquisition boards through paral- We estimated the required logic the data from the burst interfaces to the lel burst interfaces, performs basic data pre- resources like the number of flip-flops, memory and then to the Ethernet inter- processing, and streams the pre-processed block RAMs, PowerPCs, DCMs from the faces, so we had to connect the burst data data to a host system through a Gigabit Ethernet interface. Multiples of such acqui- 2 sitions should be cascaded; therefore, we DDR1 RS232 I C, LED planned to use two Ethernet interfaces, as OPB_ OPB_ PLB_DDR UART GPIO shown in Figure 2. The main components of this card are a Block RAM PLB_OPB Virtex™-4 FX60 device, three DDR ISOCM_ DSOCM_ RAMs, two Gigabit Ethernet interfaces, BRAM_IF BRAM_IF and one XC9500XL family CPLD. Two OPB_ OPB_ PowerPC processors determine the func- BRAM_IF SPI ISOCM DSOCM tionality of the interface board. Block RAM FLASH Because of high speed requirements, the PPC1 PLB multi-port memory controller executes DCR FCM checksum generation and the data management of four burst data interfaces. The PLB_ DCR_ TEMAC measurement data is placed into an exter- External Hardmacro nal high-speed DDR RAM. Both PowerPC INTC HARD TEMAC processors are responsible for the data flow GE_PHY management and Ethernet communica- PCORE CUSTOM tion. Five high-speed interfaces control the external data acquisition boards. Figure 3 – Block diagram of simple system with ML403 demo board

42 Xcell Journal Second Quarter 2008 SYSTEM DEVELOPMENT

and Ethernet interfaces directly to the CTRL RS232 I2C, LED FLASH memory. We decided to connect one net- DATA work interface to each PowerPC. OPB_ OPB_ OPB_ SO_OPB UART GPIO SPI _TLINK We are sure that a platform without a real-time operating system will make the Block RAM PLB_OPB development of this application easier. This ISOCM_ DSOCM_ has to be decided on a case-by-case basis. BRAM_IF BRAM_IF

OPB_ BRAM_IF SO_OPB_BURST Hardware Design ISOCM DSOCM The system uses three DDR RAMs: one for Block RAM MEASUREMENT DATA each PowerPC and one for the measure- PPC1 PLB ment data storage. We had two design pos- DCR FCM sibilities: a special peripheral that is capable of bus mastering DMA transfer or a multi- DCR_ PLB_ PLB_ PLB_ PLB_ port memory controller MPMC2 to enable INTC TEMAC TEMAC DDR DDR HARD HARD simultaneous access to the data acquisition TEMAC TEMAC DDR2 DDR1 RAM. We selected the second option. GE_PHY GE_PHY We were then ready to start developing our new platform. We mainly used Mentor Graphics tools: DxDesigner for schematic External Hardmacro PCORE CUSTOM capture, Boardstation for PCB layout, Hyperlynx for signal simulation, and IODesigner for the interface between hard- Figure 4 – Block diagram collecting performance data with PLB ware and ISE™ software. The PCB design was awarded the first prize in the Transportation & Automotive category in CTRL DDR1 RS232 I2C, LED FLASH FLASH x5 the Mentor Graphics’ 2007 PCB DATA OPB_ OPB_ OPB_ OPB_ SO_FSL PLB_DDR Technology Leadership Awards. UART GPIO SPI SPI _TLINK We also developed the peripheral cores Block RAM Block RAM PLB_OPB PLB_OPB for the custom peripherals. We needed two ISOCM_ DSOCM_ ISOCM_ DSOCM_ custom cores: one for transmitting data to BRAM_IF BRAM_IF BRAM_IF BRAM_IF OPB_ the transmission link (SO_TLINK_FSL) BRAM_IF ISOCM DSOCM ISOCM DSOCM and one for capturing the block data Block RAM (SO_NPI_FSL_BURST). PPC1 PLB PLB PPC2 DCR FCM DCR FCM SO_TLINK_FSL MEASUREMENT DATA DCR_ x4 DCR_ SO_NPI_FSL_BURST We already had VHDL code for the trans- INTC FCM2FSL INTC FCM2FSL mission link with a register interface, so we LL_ LL_ PLB CDMA NPI NPI NPI NPI CDMA PLB decided to use a block RAM interface to TEMAC TEMAC HARD HARD connect the register memory mapped to the TEMAC MPMC2 TEMAC PowerPC. One advantage of this solution is GE_PHY DDR3 GE_PHY that for every bus (OPB, PLB, OCM, and even LMB for the MicroBlaze processor), a External Hardmacro PCORE CUSTOM block RAM interface controller exists, so we could easily try the performance on different bus configurations. If connected to the Figure 5 – Block diagram of optimized system with MPMC2 OPB bus, we could share this resource between the two PowerPCs. With this uni- arbitration takes too many clock cycles; slows down the whole bus frequency. versal approach, we had the freedom to even worse, because we used the OPB bus, The other problem was that we could move the control software modules from we needed an additional PLB2OPB bridge. not send further data until we had one PowerPC to the other. For the OCM bus, time for bus arbitration acknowledgment that the packet was Unfortunately, this solution was too would be lower, but we would need a sec- received. One solution would have been to slow for two reasons. First, we saw that bus ond DSOCM_block RAM interface, which integrate a FIFO in our core, but we could

Second Quarter 2008 Xcell Journal 43 SYSTEM DEVELOPMENT

also use the FSL links instead and eliminate flash for booting the FPGA. The CPLD One very useful tool for qualifying the both disadvantages. We estimated the best converts the signals from SPI to the usual results is the combined XMD and FIFO size, because we could optimize it serial interface for booting. ChipScope analyzer for hardware/software later under different load conditions. The We implemented the I2C with the co-debugging. You can trigger with the FSL interface automatically uses the OPB_GPIO core and a software driver ChipScope analyzer and stop the software SRL16 for small FIFOs and the FIFO16B without the special core OPB_IIC core, debugger, and vice versa. hard block for larger depth. because it was free of charge and we need- We also experimented with interrupt ed only to read an ID tag once after boot. priorities, FIFO sizes on the MPMC2, and SO_NPI_FSL_BURST For the interprocessor communication arbitration schemes inside the MPMC2. We wanted the SO_NPI_FSL_BURST core between the two PowerPCs, we used a We also moved some software modules to move the data directly to the MPMC2, so shared memory implemented in a single from PowerPC1 to PowerPC2 and vice we needed to develop an interface between ported block RAM connected to the OPB versa to optimize system performance. the test card interface and NPI (native port bus with the OPB_BRAM controller. interface similar to the local link interface) We implemented two separated inter- Conclusion of the MPMC2. The PowerPC must control rupt controllers, one for each PowerPC You can start with a very simple design and this interface to specify the address location connected to the DCR bus. upgrade step by step to a higher perform- and burst length. We decided to use the FSL The central core in our design is the ance design by boosting Ethernet through- interface instead of the GPIO because the multiport memory controller with eight put, boosting DDR throughput with a FSL is really simple to handle and saves a ports: four for data acquisition multiport memory controller, adding data few clock cycles. (SO_NPI_FSL_BURST), two PLB ports, filtering, FFTs, and so on. For the simulation of the peripheral one for each PowerPC, and two CDMA In the future, it will be necessary more cores, we used ISIM to verify the funda- ports to connect the Gigabit Ethernet often than not to postpone important system mental functionality of our cores. For a sys- interfaces (Figure 5). architecture decisions to later stages during tem simulation with the PowerPCs, we An interesting design phase was finding the design process. With an FPGA embed- used ModelSim PE with the SWIFT inter- performance bottlenecks and fine-tuning ded system, you have many opportunities to face for the processor models. the system performance. We connected remove bottlenecks from your system. In the implementation process, we gen- our board to real test equipment to get For more information, visit www. erally used the generated make file from the real-life data. so-logic.net or www.avl.com. Xilinx Platform Studio software and added a lot of features like checkout and tagging for source code repository, tar ball, and .zip file generation, different programming file Further Improvements format generation, debugging with the ChipScope™ analyzer core inserter, and • Upgrading the actual design environment to the newest versions of ISE software and EDK 9.2. compilation of different application soft- • Moving all OPB cores to the PLB bus, because all OPB cores are now also available for PLB. In this case, we ware for both PowerPCs. Our scripts ran the development tools had to decide which shared peripheral is connected to which PowerPC. As a positive consequence, we could on Windows XP and Linux (CentOS 5.1) remove the whole OPB bus and two relative large PLB2OPB bridges, each saving 500 slices. and supported different boards and devices. • Running embedded Linux on one or both PowerPCs if you needed to run one of the many applications that Real Life are available for Linux. We first migrated our design from ML403 • Adding cores for double-precision floating-point to offload the host PC application or integrate more DSP func- to our new target board to check all external tionality into the FPGA. peripherals (DDR, Gigabit Ethernet, SPI, I2C) (see Figure 4). We can reuse this design • Downgrading to a lower performance system with only one PowerPC or, as an even more low-cost solution, later for hardware test for manufacturing. replacing the PowerPC with the MicroBlaze processor v7.0, which also has a PLB bus interface. We then used small internal block RAMs on the data and instruction side • Upgrading to a higher performance Virtex-5 FXT FPGA with PowerPC 440 and removing the CPLD, because of OCM buses, with boot loaders for each the more user I/O pins available on the FPGA in the same package and native boot support from SPI flash. PowerPC and fast interrupt routines. The • Upgrading the FIFO links to serial point-to-point, high-speed links with RocketIO™ transceivers to improve boot loader moves the data from the SPI flash to the dedicated DDR RAM. No par- data transfer speed. allel flash is needed. We use the same SPI

44 Xcell Journal Second Quarter 2008 SYSTEM DEVELOPMENT CustomCustom ProcessorsProcessors inin FPGAsFPGAs You can build a custom processor with the newest Xilinx FPGAs.

by Ilya Tarasov Associate Professor KSTA [email protected]

Processor cores, systems, and designs are gaining ground in the FPGA solution portfolio. Xilinx® MicroBlaze™ and PicoBlaze™ processors, along with the EDK development tool, have paved the way to processor-based systems on a single chip. You are not, however, limited to these solutions only – you can always implement a soft processor with your desired features. In this article, which is devoted to soft processors in general, I’ll attempt to determine when it is preferable to implement custom soft-core processors instead of proven Xilinx solutions.

How Custom Processor Design Works Before discussing why you might need to design a custom processor, let’s take a look at the design methodology. I’ll use a very simple example of a processor to show the main steps and what problems you might encounter.

Second Quarter 2008 Xcell Journal 45 SYSTEM DEVELOPMENT

The heart of the processor is a finite state multipliers and XtremeDSP™ solution Non-Symmetrical Register Architecture machine (FSM). An FSM is quite effective slices, and no processor core can achieve A MicroBlaze embedded processor realizes because the latest FPGAs are synchronous comparable benchmarks given the highly a symmetrical three-address register archi- by nature; therefore, an FSM is a good base paralleled nature of the DSP block array. tecture. This means that any registers may from which to describe processor behavior. You may want to control this array with perform the same operations (with few Being an FSM, the processor must change standard interfaces and protocols, collect exceptions for MicroBlaze processors), its state at the rising edge of the clock signal. statistics, and perform additional tasks in while general commands may choose Let’s take a straightforward example each DSP block. This is a job for processor operands and destinations separately. where we have three 8-bit registers: PC cores. But you can’t just place the same A three-address architecture with many (program counter); RA and RB (general- number of MicroBlaze processor cores as registers is good for compilers, because purpose registers); and an 18-bit command embedded multipliers on the FPGA. For many variables can be loaded into the bus named “cmd.” This scheme is very effi- these purposes, simple processor cores with processor without needing to be stored cient for an FPGA with 1,024 x 18-bit little register sets, short commands, and back into memory. If you have fewer regis- organization. Figure 1 shows the behavioral small command memories (even based on ters, some variables may be forced to code for this example. distributed RAM) may prove very useful. unload when the number of active variables This example illustrates the implementation of a fully synchronous digital design. If you can provide an appropriate sequence of -- register declaration omitted for saving article space commands for each new program counter process(clk) value, you’ll get a programmable state begin if rising_edge(clk) then machine; in other words, a processor. case conv_integer(cmd) is Of course, you may want to use on-chip when 0 => pc <= pc + 1; -- no operation, but we must remember to increment program counter when 0x100 to 0x1FF => RA <= cmd(7 downto 0); pc <= pc + 1; -- load imm value to RA FPGA resources for a complete system-on- when 0x200 to 0x2FF => if conv_integer(RA) = 0 then pc <= cmd(7 downto 0); chip solution without any external memo- else pc <= pc + 1; ry components. But looking at block RAM end if; -- jump if RA = 0 -- we can add any ‘simple’ commands, which will affect any desired registers waveforms, you might find it necessary to when 1 => RA <= RA + RB; pc <= pc + 1; spend at least one clock cycle to read com- --- etc, with format: when => RA RB; pc <= pc + 1; mand from memory. To correct this, let’s -- also, let’s look on the parallel data processing and combined commands implement a two-stage FSM where the first when 100 => RA <= RA + 1; RB <= RB – 1; pc <= pc + 1; -- different operations with two registers stage is “read” and the second stage is “exe- -- we may also add other complex commands, which have no direct analogs in common instruction sets cute,” shown in Figure 2. end case; end if; end process; Why (and When) to Opt for a Custom Processor Designers familiar with the MicroBlaze processor and EDK software may ask why Figure 1 – Simple decoder and ALU for custom processor we need yet another core, or how it could replace proven, high-performance products. Indeed, you don’t need another processor in general if you can perform a common task -- in architecture section signal st : bit; -- state of core: 0 – fetch, 1 – execute; using well-known approaches. -- after ‘begin’ keyword But because the MicroBlaze processor is process(clk) begin a 32-bit RISC core, it is possible to achieve if rising_edge(clk) then case st is a significantly different solution. Let’s ana- when ‘0’ => st <= ‘1’; -- simple wait for BRAM, go to EXECUTE state lyze when these steps might lead to valu- when ‘1’ => st <= ‘0’; -- return to FETCH state case conv_integer(cmd) is able results. when 0 => pc <= pc + 1; …. – all of the ‘case’ code from the previous example end case; DSP Algorithms with High Parallelism end case; -- st end if; In this case, FPGA performance is deter- end process; mined by the structure of hardware multiplier blocks, which realize certain algorithms of data processing. There may Figure 2 – Simple processor engine for use with block RAM be many such data paths with embedded

46 Xcell Journal Second Quarter 2008 SYSTEM DEVELOPMENT

mentation of deeply pipelined, register- rich processor cores. DSP Core Datapath • On-chip, dual-port block RAM stores program and data on the same chip to make FPGAs a system-on-chip. Two independent ports also extend memory Start/Stop bandwidth and provide easy download- Coeffecient Reloading ing and debugging. Statistics Distributed • Distributed RAM in SliceM allows you RAM to create small memory sections, such as register files, buffers, stacks, and FIFOs. Several Registers All of these are important parts of a Simple Command Set Compact Size processor system and can significantly improve performance and functionality. • The newest Xilinx Spartan™-3AN Figure 3 – A DSP coprocessor with general-purpose core FPGAs with internal flash memory allow the implementation of a single- chip system with as much as 11 Mb of becomes greater than the number of regis- universal (indeed, you can’t add hundreds flash memory and 72 kb of dual-port ters. (This is not a problem for less complex of I/O modules this way as you can for an SRAM. This is enough to complete a algorithms, when calculations are simple on-chip peripheral bus). However, it may serious embedded system-on-chip. enough to impose some limitations on be an “ad hoc” solution. processor architecture.) • High-performance Virtex™-5 FPGAs For example, you can use a dedicated Advantages of FPGA-Based Processors provide better results for processor soft accumulator to act as a destination register The unlimited reprogrammability of cores, thanks to their six-input look-up for all ALU activity. This allows command FPGAs allows you to realize very uncom- tables (LUTs). In real projects (with width to shrink because you can exclude mon instructions. This is important, ALU based on combinatorial logic), information about destination registers (it because it is virtually impossible to add any this makes it possible to implement the is always known). Looking further, it is non-standard, uncommon instruction to processor with fewer logic levels than possible to realize a zero-operand processor ASIC processors without ensuring its fur- with previous Virtex platform FPGA (with a stack architecture like Forth proces- ther usage. If you don’t meet the market generations (and other FPGAs with sors or Java machines) with the smallest requirements for ASICs, you’ll lose time four-input LUTs). In general, when possible command width. Code compact- and money, but an experimental instruc- compared to Virtex-4 devices, proces- ness is pretty useful in the FPGA world tion set costs almost nothing with FPGAs. sor core designs become 30% faster. because of the limited amount of on-chip If an experimental architecture with block RAM. additional commands is not required, you Conclusion can simply reset the FPGA configuration Now that designers have access to real hard- I/O-Oriented Project and download the MicroBlaze processor as ware prototyping devices, it is high time to Here’s another uncommon approach to I/O a proven, supported solution. Experimental take a fresh look at the embedded processor module integration. In generic processor architectures may be very effective for par- world. With powerful and easy-to-use cores, it is necessary to assign some address- ticular purposes, and we can always try development tools like ISE™ software, you es to each I/O port and then write low-level them at the beginning of a new project. can easily try processor designing. This is a drivers that will exchange with these I/O Uncommon algorithms, parallel calcula- promising field for research teams, quali- cores through in and out commands. tions, or operations with ultra-wide num- fied designers, those who need a specific But if you implement any high-per- bers are quite appropriate reasons to use computing platform, and IT enthusiasts. formance-critical I/O cores, you may want custom processor architectures. CTC Inline Group in Russia, a training to boost and simplify data exchange with Xilinx FPGAs have a complex set of fea- center, is now holding training courses on them. You can also turn to dedicated tures that make all processor implementa- this topic; additionally, you will find a non- buses for modules with dedicated data tions effective. For example: commercial community of those creating exchange commands, too. This takes addi- experimental processor implementations tional resources and adds special com- • The balanced set of resources and reg- with FPGAs at www.fforum.winglion.ru mands, of course, so the result will not be ister-rich architecture allow for imple- (with text in both Russian and English).

Second Quarter 2008 Xcell Journal 47 SYSTEM DEVELOPMENT Designing Multiprocessor SOCs Using Xilinx EDK tools and IP, you can scale your applications by designing SOCs with multiple processors.

by Vasanth Asokan Staff Software Engineer Xilinx, Inc. [email protected]

Given the rapid growth of embedded processing requirements, system archi- tects are turning toward multiprocessor designs to solve the twin problems of burgeoning complexity and inadequateness in single processor systems. With their high logic density and high-performance hard blocks, recent generations of FPGAs have made powerful chip multiprocessing (CMP) solutions a reality. The challenge now lies in how you can rapidly explore and create designs in this solution space. Xilinx® Embedded Development Kit (EDK) tools and IP provide the flexibility to create uniquely crafted, customized multiprocessing solutions on FPGA logic real estate that can meet both price and performance targets. In this article, I’ll give a broad overview of multiprocessing concepts as they apply to Xilinx solutions based on PowerPC and MicroBlaze™ embedded processors.

48 Xcell Journal Second Quarter 2008 SYSTEM DEVELOPMENT

Scenarios processors to act on the data stream in Apart from the SMP scenario, all other Performance and functional partitioning a pipeline fashion. Each peer stage in scenarios are feasible on Xilinx FPGAs are compelling reasons to design systems the multiprocessor pipeline acts on one with EDK tools. The unique capability of with multiple processors. In general, there portion of the computation before Xilinx processing solutions is the flexibili- are some commonly encountered scenarios passing it to the next processor. This is ty to customize each of the processing where multiprocessing can help: an effective way to increase system subsystems to the application require- throughput. ments. For example, not all processors • Multiple independent functions. The may need a cache or a floating-point unit. design may have multiple, independent • Reliability and redundancy. You can By assigning specific functions to specific sets of processing tasks. An attractive replicate processing systems multiple processors, you can create a tailored solu- way to address this challenge is to cre- times to provide reliability and tion that meets all design goals. ate independent processing modules redundancy. dedicated to each processing task, • Symmetric processing. Traditional sym- assigning each processing module a A Simple and Scalable System Architecture metric processing (SMP) is a useful unique processor and peripheral set. As you can see, a number of use models solution with which you can scale up are possible for multiple processors. • Control or data-plane offload. One (by adding more processors) the per- There are also a large number of possibil- common scenario is the presence of a formance of applications that do not ities for the system architecture. distinct set of real-time (compute- or possess clean partition boundaries. An Reconciling a use case to a clean, scalable data-intensive) and non-real-time tasks SMP-capable OS layer manages paral- topology and architecture can be daunt- that might cause a non-response in lel tasks and automatically schedules ing, so it helps to define a baseline archi- solutions based on a single processor to them across multiple processors. The tecture that fits most needs. be non-responsive. In these cases, you SMP use model cannot be applied to Figure 1 illustrates a dual-core architec- can dedicate a slave processor to per- Xilinx processors, however, because ture. This architecture presents a simple form the real-time task in a timely they lack cache coherency, a require- and scalable multiprocessor system defini- manner. This leaves the master proces- ment for implementing SMP. tion. You can generate derivative topolo- sor to perform other regular tasks and serve as an interface to the

host system. The master PLBv46 PLBv46 processor also monitors and Private Boot Memory Private Boot Memory controls the slave processor. XPS_INTC XPS_INTC XPS Block RAM The slave processor may Shared External XPS Block RAM contain specialized func- Memory MicroBlaze tions or interfaces, allowing PPC405 Processor it to meet computation IPLB1 IPLB0 MPMC IXCL IPLB DPLB1 DPLB0 (External Memory) DXCL DPLB performance requirements. PLB XCL Some examples of this sce- PLB XCL LL PLB nario include network Free LL offload, media processing, and security algorithms. Trimode Ethernet PCI Express MAC • Interface processing. On Peripherals for Send FIFO Peripherals for

systems that act as a bridge Processor 1 Receive FIFO Processor 2 PLBv46 XPS_Mailbox Core PLBv46 for or switch between mul- Port A Port B

tiple interfaces, you can Mutex Array

dedicate a slave processor to PLBv46 XPS_Mutex Core PLBv46 Port A Port B the processing of data at Shared Block RAM Memory each interface, while one or XPS Block RAM PLBv46 PLBv46 more master processors per- Port A Port B form higher level bridging PLB_v46 BRIDGE and switching tasks. PLBv46 PLBv46 Port A Port B • Stream processing. For handling stream-oriented com- Figure 1 – Dual processor architecture putation, you can arrange

Second Quarter 2008 Xcell Journal 49 SYSTEM DEVELOPMENT

gies, starting from this definition, to meet nously or asynchronously. It can be used Other Considerations design constraints or challenges. The key for either directly passing messages or for You will typically apply a few other con- concepts of this architecture are as follows: passing pointers to messages stored in siderations to your multiprocessing shared memory. You can use the XPS architecture. For instance, you will need • The architecture is a simple extension Mutex core for arbitrating accesses to to define memory maps in a non-con- of two completely independent single shared resources (whether they are on- flicting manner between the two proces- processing systems, formed by linking chip or off-chip) between software on sors. The automatic address generation the systems together with communica- the two processors. Together, these cores tools in EDK simplify this task down to tion components. allow you to build cooperating software a push of a button. • The shared components are all multi- programs on each processor. You will also need to give some (or dual-) ported in nature. The multi- thought to your clocking and reset net- • Some systems might wish to share ported nature of these components works. You will have an option to clock peripherals that are not multiported (like allows each processor’s system bus to all of the processors at the same rate or a UART or SPI or I2C). Such a situation be independent of the other, both in use different clocking domains for each requires providing a system bus bridge terms of static as well as dynamic load. processor. Similarly, you can define reset from the bus that does not connect to By isolating each processing subsys- domains at various levels such as proces- the peripheral to the bus that does con- tem, you ensure that the system bus is sor-only reset, processor subsystem reset, nect to the peripheral. Figure 1 shows the not locked out for a processor or and system reset. The processors must use of a bus bridge to share a UART peripheral because of a transaction also be independently connected to a between the two processors. executed for another processor. All debugger port, thus allowing separate multiported peripherals arbitrate • Figure 1 intentionally shows a PowerPC debugging sessions for each processor. accesses on various ports internally. 405 processor as the first and a Beyond the hardware considerations, MicroBlaze processor as the second to you will also be designing your software • The key shared peripheral is the multi- illustrate certain specific characteristics of systems such that they can work cooper- ported memory controller (MPMC). each processor. However, any one proces- atively. This includes using shared mem- MPMC provides access to external sor can be equivalently replaced by the ory, message passing, and common memory through different port inter- other with very minimal adaptation. synchronization concepts such as barri- faces. Multiple processors can connect Thus, this architecture can be seamlessly ers and rendezvous so that the system to MPMC through independent ports. transitioned between processors. behaves in a predictable and synchro- This topology allows both the nized fashion. Commercial software PowerPC and MicroBlaze processors to Although Figure 1 illustrates the recom- stacks are also available that can provide simultaneously access external memory mended overall multiprocessing architec- higher level communication paradigms. with minimal latency and high band- ture, various constraints may require you to width. MPMC currently provides a further refine the architecture. For instance, Conclusion maximum of eight ports, thus allowing in a system in which logic area and resource For a more details about the possibilities as many as three to four processors to usage are a key concern, all of the processors of multiprocessor systems, a longer white connect to a single external memory. could be connected to the same system bus. paper version of this article is available at: Although this makes the system less deter- www.xilinx.com/support/documentation/ • The architecture also shows sharing of ministic and increases run-time load on the white_papers/wp262.pdf. Also consider the internal block RAM memory between bus, it offers area savings by eliminating a reference designs described and provided processors. Sharing on-chip block new system bus as well as removing the need in Xilinx application note XAPP996, RAM can be an extremely fast way to for multiple ports on IPs. “Dual Processor Reference Design Suite,” pass kilobyte-sized data between the Other derivative architectures are also at www.xilinx.com/support/documentation/ processors. Accesses to block RAMs possible, such as having the high-perform- application_notes/xapp996.pdf. can also be deterministic – an impor- ance processor on a separate system bus Stay tuned for new features in future tant requirement in some applications. and multiple low-performance processors EDK tools, such as support for auto- • Apart from shared memory, there are on a shared system bus. You can also create mated multiprocessor design creation two other cores – the XPS Mailbox and hierarchical topologies by connecting pro- and cooperative debugging. The Xilinx XPS Mutex – that provide simple forms cessing subsystems to each other through Virtex™-5 FXT platform, with the of interprocessor communication. The multiple levels of bridges. The various powerful PowerPC 440 processor, also XPS Mailbox core provides a low-laten- tools and IP provided in EDK allow you to opens up endless possibilities for creat- cy, FIFO-style message-passing interface further refine this base topology down to ing ultra-high performance multi- between the two processors synchro- something that exactly fits your needs. processor systems.

50 Xcell Journal Second Quarter 2008 Get on Target

Let Xcell Publications help you get your message out to thousands of programmable logic users.

Hit your target by advertising your product or service in the Xilinx Xcell Journal, you’ll reach thousands of qualified engineers, designers, and engineering managers worldwide.

The Xilinx Xcell Journal is an award-winning publication, dedicated specifically to helping programmable logic users – and it works.

We offer affordable advertising rates and a variety of advertisement sizes to meet any budget!

Call today : (800) 493-5551 or e-mail us at [email protected]

Join the other leaders in our industry and advertise in the Xcell Journal!

www.xilinx.com/xcell/ R SYSTEM DEVELOPMENT AcceleratingAccelerating VideoVideo DevelopmentDevelopment onon FPGAsFPGAs UsingUsing thethe XilinxXilinx XtremeDSPXtremeDSP VideoVideo StarterStarter KitKit Try a video development methodology that is processor-friendly, generates highly optimized results, and does not require VHDL or Verilog knowledge.

by Tom Hill You can achieve FPGA performance The FMC-video daughtercard extends Sr. Marketing Manager, DSP gains by exploiting the flexibility of the the video I/O capabilities of the Spartan-3A Xilinx, Inc. device to configure a hardware architecture DSP 3400A development platform by [email protected] optimized for a particular application. including the following additional interfaces: This flexibility adds a degree of freedom to • DVI-I input, both digital and analog Along with next-generation video com- the development process that also con- pression standards, the industry shift tributes to its complexity. • Composite input and output from basic video processing to more The XtremeDSP™ Video Starter Kit • S-video input and output complex and integrated processing solu- (VSK) provides a complete and easy-to-use tions are driving system requirements for design environment. Example applications • Two camera inputs video performance beyond what stand- and full support for standard Xilinx tool alone DSPs can deliver. FPGAs such as flows help accelerate the design process yet Video Development Tools the Xilinx® Spartan™-3A DSP fill this still allow for end-product differentiation. You can create video applications for the gap for cost-sensitive military, automo- VSK without RTL knowledge or experi- tive, medical, consumer, industrial, and Introducing the XtremeDSP VSK – ence using the Xilinx Embedded security applications by providing more Spartan-3A Edition Development Kit (EDK) and System than 20 GMACs of DSP performance The XtremeDSP Video Starter Kit – Generator for DSP. EDK is a comprehen- for less than $30. FPGAs uniquely pro- Spartan-3A Edition is a video development sive solution for designing embedded vide logic, embedded processing, OS platform comprising the Spartan-3A DSP programmable systems and includes the support, and drivers to offer a complete 3400A development platform, the FMC- Platform Studio tool suite, embedded IP end-to-end solution for video. video daughtercard, and a VGA camera. cores, and the MicroBlaze™ embedded It is not that developers lack under- The Spartan-3A DSP 3400A develop- processor. standing about the performance benefits of ment platform, which you can purchase System Generator for DSP enables the FPGAs that prevent their use in video separately, is built around the Spartan-3A use of The MathWorks Simulink/MAT- applications; rather, it’s a lack of experience DSP XC3SD3400A device. This device LAB modeling environment for FPGA with the design flow. This is especially true provides 126 embedded DSP blocks for design by providing a Simulink blockset of for traditional DSP program developers implementing coprocessing and high-per- more than 100 Xilinx-optimized DSP accustomed to programming in C. formance video processing systems. building blocks.

52 Xcell Journal Second Quarter 2008 SYSTEM DEVELOPMENT

Block RAM

ILMB DLMB

Processor MicroBlaze GIP GPIO GPIO MDM UART System Reset Processor DIP Switches Push Buttons LEDs

PLB PLB ARB

XCL IXCL Clock System ACE MPMCXPS IIC XPS IIC Generator Technology DXCL

DDR2

Figure 1 – Base platform block diagram

Developing Video Applications The Xilinx VFBC is ideal for video appli- video applications running on Xilinx Using the Base Platform cations where the hardware control of two- FPGAs. Each reference design is built on An embedded system called the base plat- dimensional data is required to achieve the base platform and uses custom form provides the framework from which real-time operation. This is typical of motion peripherals from the VSK IP library. you can develop video applications using estimation, video scaling, on-screen displays, Table 1 lists each reference design and the the VSK. The base platform is an embedded and video capture used in video surveillance, video processing and connectivity capa- system created using the Xilinx Platform video conferencing, and video broadcasts. bilities that it illustrates. These reference Studio base system builder (BSB) and designs are intended to serve as a starting includes a MicroBlaze embedded processor. Jump-Start Development point from which further development This framework provides a starting Using VSK Reference Designs may occur. Figure 2 shows how the DVI point for new designs or serves as an easy The VSK provides three reference designs pass-through reference design interfaces migration path for existing applications for jump-starting the development of into the base system. developed on processor-based systems. You can recompile any C code created for external processors on the MicroBlaze processor Reference Design Functionality Description with minimal effort; once ported, the high- performance video chains can migrate from DVI Pass-Through • Capturing a video stream from the input port software to the FPGA fabric. • Performing real-time image processing on the video stream To assist in this migration, the VSK • Displaying the processes video includes an IP library of custom peripherals that you can easily add to the base sys- DVI Frame Buffer • Capturing a video stream from a DVI source tem using Platform Studio. You can • Buffering the video stream in external memory connect to the video interfaces, manage • Displaying the buffered video data frames, and perform memory access • Reporting memory bandwidth utilization data and basic video processing. These custom peripherals include: Camera Frame Buffer • Capturing a video stream from a camera • DVI in • Performing processing on the video stream • Buffering the video stream in external memory • DVI out • Displaying the processed video at a different rate • Camera • Using a microprocessor to configure various aspects of the video pipeline • Video frame buffer controller (VFBC) • Video processing pipeline Table 1 – VSK reference design summary

Second Quarter 2008 Xcell Journal 53 SYSTEM DEVELOPMENT

Block RAM

ILMB DLMB

Processor MicroBlaze GIP GPIO GPIO MDM UART System Reset Processor DIP Switches Push Buttons LEDs

PLB PLB ARB

XCL Clock System ACE MPMCXPS IIC XPS IIC Generator Technology

Video In DVI_IN DE_GEN Gamma_in 2D FIR Gamma_out DVI_OUT Video Out Filter

Developed Using System Generator

Figure 2 – Base system with video pipeline

Using Model-Based Design formance of Simulink simula- to Create Video Applications tions up to 1,000x. This accel- Accelerating video applications on FPGAs eration enables video algorithm requires that performance-critical opera- development and debugging tions migrate from software running on using real-time video streams processors to hardware. The VSK supports read into Simulink through a variety of hardware design flows. This The MathWorks data acquisi- includes flows that leverage strong hardware tion toolbox. design backgrounds using VHDL/Verilog and flows that accommodate little or no Conclusion hardware design experience by leveraging The XtremeDSP Video Starter more abstract modeling environments Kit – Spartan-3A DSP Edition including C, MATLAB, and Simulink. Figure 3 – System Generator diagram keeps development costs low by Simulink from The MathWorks is a for a camera video processing pipeline providing a complete video model-based design environment that you development solution for under can use to develop algorithmic models of providing a rich set of DSP building blocks, $1,600. The DSP and embedded design video systems. The MathWorks provides an optimized for Xilinx devices. Tight integra- tools included enable rapid FPGA develop- optional video and imaging blockset for tion allows DSP designs captured in System ment of the video system without requiring Simulink that provides a rich set of video Generator to be converted into custom RTL design experience. building blocks for easily processing peripherals for Platform Studio and con- DSP-optimized FPGA platforms such streaming video and visualizing the results nected to the base system using the proces- as the Virtex-5 SXT device, Virtex-4 SX at each step in the model. sor local bus or Fast Simplex Link bus. device, and Spartan-3A DSP are particu- You can initially model the video pro- Figure 3 shows an example of a camera larly well suited to meet both the cost and cessing algorithm itself abstractly using video processing pipeline created using performance requirements of high-per- floating-point data types and high-level System Generator, which is included in the formance video and image processing video and imaging blocks, refining the camera frame buffer reference designs applications in security, broadcast, indus- algorithm as you consider the trade-offs shipped with the VSK. trial, consumer, medical, and automotive associated with complexity, system cost, System Generator supports hardware-in- applications – while insulating products and performance. the-loop co-simulation using the Spartan-3A against early obsolescence. System Generator for DSP enables the DSP 3400A development platform. You For more information, visit www. use of Simulink for Xilinx FPGA designs by can use this platform to accelerate the per- xilinx.com/s3adsp_vsk.

54 Xcell Journal Second Quarter 2008 First published in Microprocessor Report, November 13, 2007. PLATFORMS, PROCESSORS & TOOLS Copyright 2007 by In-Stat, a unit of Reed Business Information. Reprinted with permission. MicroBlaze v7 Gets an MMU Memory Manager Brings Full-Fledged Linux to Xilinx Processor Core

By Tom R. Halfhill processor cores intended for synthesis in application programs, and multiple pro- Senior Analyst FPGAs – primarily Altera’s Nios II and grams while avoiding memory collisions In-Stat/Microprocessor Report ARM’s new Cortex-M1. Those rival that would compromise security and relia- [email protected] processors don’t have MMUs or MPUs. bility. In addition, virtual memory allows a MicroBlaze v7 is available now as part of a system to work with less physical memory, Xilinx is upgrading its MicroBlaze embed- $495 development kit. (An extra $100 which can reduce costs and power con- ded-processor core again, this time adding buys a development board with a Xilinx sumption. As small embedded systems an optional memory-management unit Spartan-3E FPGA.) That price includes a increasingly tackle heavy-duty applications (MMU) that allows the 32-bit processor to MicroBlaze v7 license, and developers owe once limited to bigger systems, sophisticat- run sophisticated operating systems sup- no royalties when deploying the processor ed memory management is becoming porting virtual memory. Developers can in Xilinx chips. almost a necessity. also substitute a simpler memory-protec- Of course, many embedded systems tion unit (MPU) or omit supervised mem- Robust Memory Management don’t need this level of memory manage- ory management altogether. Xilinx has steadily improved MicroBlaze ment, so the MicroBlaze MMU is optional. The first full-fledged operating system since introducing the soft processor in 2001. It’s just another configuration option avail- announced for the new MicroBlaze v7 is Two years ago, Xilinx began offering an able when MicroBlaze developers synthesize Lynuxworks BlueCat Linux. Until now, optional FPU. (See MPR 5/17/05-02, the processor using the Xilinx development MicroBlaze processors were limited to sim- “MicroBlaze Can Float.”) Last year, Xilinx tools. Another option is to implement an pler embedded operating systems that don’t lengthened the instruction pipeline to permit MPU, which provides memory protection support virtual memory or memory protec- higher clock speeds. (See MPR 11/13/06-01, without virtual memory and address trans- tion. With its optional MMU or MPU, “Xilinx Revs Up MicroBlaze.”) And earlier lation. An MPU is appropriate for embed- MicroBlaze v7 is suitable for a wider range this year, Xilinx released MicroBlaze v6, ded systems that must shield a program’s of embedded applications requiring greater which added a few other minor enhance- memory region against accidental or mali- security and reliability. ments. Now, with MicroBlaze v7, Xilinx is cious intrusions by other programs. Yet MicroBlaze v7 has other improvements introducing big-league memory manage- another option is to implement privileged- as well. New instructions provide faster ment, which significantly expands the range mode execution without memory protec- floating-point performance and better I/O of embedded applications for which tion or virtual memory. In privileged mode, with coprocessors and custom logic. In MicroBlaze is suited. only the operating system or a privileged addition, Xilinx has upgraded the With an MMU, MicroBlaze can run application program can execute certain CoreConnect interface to the latest full-fledged operating systems based on the instructions critical to system security. CoreConnect Processor Local Bus (PLB) Linux 2.6 kernel. Windows CE is another Table 1 shows how each option (MMU, v4.6 specification, which provides faster possibility, although Xilinx hasn’t MPU, or privileged execution) affects the links to on-chip peripherals. announced a deal with Microsoft yet. size of the synthesized processor, as meas- All these improvements strengthen Processors that support virtual memory ured by the number of additional lookup MicroBlaze’s position against competing can run powerful operating systems, larger tables (LUT) required to implement the

Second Quarter 2008 Xcell Journal 55 PLATFORMS, PROCESSORS & TOOLS

the maximum theoretical bandwidth of a Memory Manager Type LUTs (Virtex-5) LUTs (Spartan-3) 128-bit PLB would be 8.8MB/s. Memory Management Unit (MMU) 910 1,100 The PLB connects directly to the CPU and provides a multidrop bus shared by Memory Protection Unit (MPU) 560 670 several on-chip peripherals. It supplements PrivIieged Mode Only 34 38 a proprietary Xilinx interface called the Fast Simplex Link (FSL). An FSL is a direct Table 1 – Sizes of three MicroBlaze v7 memory-management options. Each option occupies point-to-point interface, not a multidrop additional lookup tables (LUT) in the FPGA, enlarging the processor core. (A fully configured bus. FSLs are faster than shared buses but core requires about 3,000 LUTs.) Note that the sizes for these options are slightly different in require more logic gates per I/O interface. Xilinx Virtex-5 and Spartan-3 FPGAs, because the LUTs are different: Virtex-5 LUTs have six inputs, whereas Spartan-3 LUTs have four inputs. An SoC design can use one or more FSLs in combination with a CoreConnect PLB feature in the FPGA’s programmable-logic an easier time porting virtual-memory oper- for different purposes, giving developers a fabric. Implementing privileged mode alone ating systems from the Power Architecture to wealth of options. requires so few LUTs that it’s practically a MicroBlaze. Second, developers should find Figure 1 shows an example of an SoC no-brainer. The other options demand it easier to create a multicore design that implemented in a Xilinx FPGA with a more forethought. In particular, the MMU mates one or more MicroBlaze cores with a MicroBlaze or Power 405 processor core. – which requires about 1,000 LUTs – will Power 405 in a shared-memory configura- In this example – a TCP/IP packet proces- account for approximately one-third of a tion. And third, the Power-style MMU pre- sor – the most critical datapaths link the fully configured MicroBlaze v7 core. (To pares MicroBlaze v7 for possible future Ethernet controller to the external-memory put that size in perspective, the Spartan-3E arrangements with IBM to integrate newer controller and the CPU. Those datapaths 1600E chip on a MicroBlaze v7 develop- Power cores into Xilinx FPGAs. are FSLs, which can be 32 to 128 bits wide. ment board has about 33,000 LUTs and Less critical components share a costs $10 to $15, depending on volume.) Faster CoreConnect Bus CoreConnect PLB. The Xilinx develop- The full-blown MMU is the largest CoreConnect is IBM’s on-chip bus for ment tools can automatically configure an memory-management option, partly SoCs, introduced in 1999. (See MPR FSL for a particular purpose or allow devel- because it needs a translation lookaside 7/12/99-03, “PowerPC 405GP Has Core- opers to configure the interface manually. buffer (TLB) to cache a subset of the table Connect Bus.”) Although IBM created With IBM’s blessings, Xilinx has slightly that translates virtual and physical memory CoreConnect mainly for its own Power modified the standard CoreConnect IP. addresses. MicroBlaze v7 has a 64-entry Architecture processors, anyone can freely These modifications were necessary because unified TLB. To supplement this software- license CoreConnect as synthesizable intel- the programmable-logic gates of FPGAs managed buffer, there are shadow entries lectual property (IP), and it’s not specific to aren’t as efficient as the standard-cell gates of for instruction-memory pages and data- a particular CPU architecture. Over the ASICs, especially when routing signals across memory pages. The number of shadows is past eight years, soft-IP vendors have made a large chip. Datapaths and clock trees tend user configurable: one, two, four, or eight many of their licensable peripheral cores to stretch much longer in FPGAs, making entries for instructions and the same for compatible with CoreConnect. The only timing closure more difficult. The problem data. (The default configuration is two on-chip bus supported more widely is gets worse in complex designs that distribute shadows for instructions and four shadows ARM’s AMBA. numerous peripherals throughout the chip for data.) The processor automatically For greater efficiency, CoreConnect sep- on a shared bus. Consequently, Xilinx has manages the shadows, which prevent the arates low-speed and high-speed peripher- modified the PLB to be more synchronous TLB from thrashing. Memory pages can als on separate buses joined by a bridge. and to eliminate indeterminate data bursts. range in size from 1KB to 16MB, and Until now, MicroBlaze supported only the Xilinx says these changes, though relatively mixed sizes are allowed. With 32-bit effec- slower On-chip Peripheral Bus (OPB), minor, allow developers to attach 10 or 20 tive addressing, MicroBlaze v7 can address which has a 32-bit-wide datapath. peripherals to the CoreConnect PLB with- up to 4GB of flat memory. MicroBlaze v7 still supports the OPB for out timing problems. The MicroBlaze MMU is patterned after backward compatibility but adds support In addition, Xilinx has modified the the one in an IBM Power 405 processor. for the faster Processor Local Bus (PLB). PLB to work more efficiently with hard- That’s no coincidence. Some Virtex family The PLB’s datapaths are configurable dur- ened transceivers built into some Virtex-5 FPGAs integrate a hardened Power 405 core, ing synthesis for 32-, 64-, or 128-bit FPGAs. These transceivers are a PCI which is much faster than a MicroBlaze widths. Bus bandwidth depends on the Express endpoint and a trimode Ethernet processor synthesized in the fabric. Having a width of these datapaths and the FPGA’s media-access controller (TEMAC), which similar MMU brings a few advantages to clock frequency, which can reach 550MHz deliver much better performance than MicroBlaze v7. First, programmers will have in the fastest Virtex-5 devices. At 550MHz, would equivalent soft-IP controllers synthe-

56 Xcell Journal Second Quarter 2008 PLATFORMS, PROCESSORS & TOOLS

Eight new instructions improve I/O when a coprocessor is connected to a Custom MicroBlaze core via an FSL. These instruc- Local Memory Coprocessors tions take the form of PUT and GET operations, and they allow programs to manage the coprocessor I/O as blocking or non- MicroBlaze or PowerPC Debug blocking operations. In a blocking relation- PLBv46 ship, the CPU stops processing other operations until it handles a coprocessor’s Multiport Interrupt request for attention. In a non-blocking Memory Controller Controller DMA relationship, the CPU continues processing Timer/PWM other operations while the coprocessor’s requests enter a FIFO buffer. The CPU isn’t I2C/SPI Ethernet MAC blocked unless the buffer fills. Developers UART can configure the size of the buffer accord- PCIe ing to the coprocessor’s needs. GPIO General Peripheral In addition, MicroBlaze v7 can accom- Controller CAN/MOST modate twice as many FSL interfaces as before (16 vs. 8), and programs can dynam- USB 2.0 Custom I/O ically assign coprocessors to individual FSL Peripherals interfaces at run time. Previously, coproces-

Virtex or Spartan FPGA sor assignments to FSLs were hard-coded into the user’s software. Any changes Figure 1 – Example SoC block diagram. This packet processor uses Xilinx Fast Simplex required developers to recompile the soft- Links (FSL) for critical datapaths and a shared IBM CoreConnect Processor Local Bus ware. Dynamic assignments allow develop- (PLB) for other on-chip peripherals. MicroBlaze v7 is the first version of the core to support ers to write software that adapts to the CoreConnect PLB; previous MicroBlaze cores supported only the slower CoreConnect On-chip Peripheral Bus (OPB). The multiport memory controller is special Xilinx IP with changing conditions and workloads. For built-in DMA; it’s compatible with DDR1 and DDR2 SDRAM. example, a developer could create precom- piled software libraries that select their sized in the fabric. The PCI Express end- operations, and eight for use with FSLs. appropriate coprocessors at run time, point is fully buffered and supports 1, 2, 4, The new floating-point instructions are according to the custom hardware in the or 8 lanes. The TEMAC transceiver supports straightforward. One instruction, FSQRT, coprocessors and the tasks to be performed. 10Mb/s, 100Mb/s, and Gigabit Ethernet calculates the square root of a 32bit float- Multimedia-acceleration libraries could run rates. In a Xilinx benchmark test, a packet ing-point value in either 27 or 29 clock on coprocessors specializing in fast Fourier processor based on MicroBlaze v7 and the cycles, depending on whether the TEMAC achieved 750Mb/s raw throughput MicroBlaze processor is configured with a – an impressive 75% utilization of the trans- three- or five-stage pipeline. (The deeper ceiver’s maximum theoretical bandwidth. pipeline is faster.) Without using FSQRT, Price & Availability Although AMBA enjoys wider support the same operation performed by a func- MicroBlaze v7 is available now as part than CoreConnect does, the latter standard tion call to a software library would take is better for MicroBlaze. The Power 405 about 500 cycles. of Xilinx Embedded Development Kit 9.2, for processor built into some Virtex family The other two new floating-point $495. The kit includes MicroBlaze configuration FPGAs uses CoreConnect, so it’s easier for instructions convert integers to floats or tools, Eclipse-based software-development tools, developers to create asymmetric multi- vice versa. The FLT instruction converts a and other software, plus documentation. processors around the Power 405 and 32-bit integer into a 32-bit float in four MicroBlaze cores. Many peripheral-IP or six clock cycles, depending on the For $595, the kit comes with a development cores from third-party vendors work with pipeline depth. Calling the same function board, a Spartan-3E 1600E FPGA, and a either AMBA or CoreConnect, usually by in software would take 330 cycles. JTAG probe. Xilinx doesn’t charge royalties for adding a simple gasket adapter. Conversely, the FINT instruction con- deploying MicroBlaze designs in Xilinx FPGAs. verts a 32-bit float into a 32-bit integer in For more information, see: New Instructions Boost Performance five or seven clock cycles, depending on Xilinx has added 11 new instructions to the pipeline depth. A software function www.xilinx.com/microblaze MicroBlaze v7: three for floating-point call would take 88 cycles.

Second Quarter 2008 Xcell Journal 57 PLATFORMS, PROCESSORS & TOOLS

loss-leader product that Xilinx created to Feature MicroBlaze v6 MicroBlaze v7 sell more FPGAs. A MicroBlaze v7 license Embedded Dev Kit 9.1 Embedded Dev Kit 9.2 costs only $495, so the chips – not the TCP/IP Stack LwIP Treck licenses – are the real moneymakers. Xilinx is happy to see customers buying its FPGAs On-Chip CoreConnect CoreConnect to use with the Cortex-M1, too. Interconnect On-chip Peripheral Bus Processor Local Bus Even so, while ARM and Xilinx cordial- (OPB) (PLB v4.6) ly shake hands, MicroBlaze v7 is slapping External Xilinx Xilinx Multiport the Cortex-M1 upside the head. The Memory Controller MCH_OPB_DDR Memory Controller brand-new ARM processor suffers in com- Gigabit Ethernet parison with the Xilinx loss leader. Ethernet MAC Fast Ethernet Trimode MAC Although MicroBlaze v7 is priced at a pit- (10–100 Mb/s) (10–100–1,000 Mb/s tance, it’s embarrassingly rich in features missing from the Cortex-M1, such as an Virtex-5 LXT Xilinx FPGA Virtex-5 LX optional FPU, MMU/MPU, 32-bit Packet Throughput ~70 Mb/s >250 Mb/s divider, and instruction/data caches. On top of that, MicroBlaze can reach higher Table 2 – MicroBlaze v6 versus MicroBlaze v7 performance. This comparison is based on a clock frequencies than the Cortex-M1 can. packet-processor design pitting the new Micro-Blaze core against the older version. The upgraded ARM’s biggest selling point is that the design is more than three times faster. However, Xilinx also changed other variables, including the Cortex-M1 is from ARM. The ARM archi- TCP/IP stack, memory controller, and Ethernet controller, which clouds the comparison somewhat. In particular, the network interface leaps from a soft-IP Fast Ethernet MAC (100Mb/s) to a hard- tecture is almost an industry standard, and wired Gigabit Ethernet Trimode MAC (10–100–1,000Mb/s). it’s supported by tons of peripheral IP, development tools, and software. transforms (FFT) or finite impulse- in FPGAs and optimized for their pro- As Table 3 shows, Altera’s Nios II is a response (FIR) filters. grammable-logic fabrics. Previously, ARM closer match for MicroBlaze, even though Table 2 shows the results of porting a allowed licensees to test their designs in it hasn’t been significantly upgraded since packet-processor design from MicroBlaze v6 FPGAs but not to deliver finished designs 2004. (See MPR 6/28/04-02, “Altera’s New to v7 – throughput improved more than 3x, in the chips. ARM’s course change was CPU for FPGAs.”) The addition of an from about 70Mb/s to more than 250Mb/s. prompted partly by the rising costs of optional MMU/MPU gives MicroBlaze v7 However, notice that this comparison (con- designing and manufacturing ASICs, and its first big advantage over Nios II. ducted by Xilinx) doesn’t isolate the effect of partly by the popularity of the Xilinx However, Altera retains one advantage: a each variable changed in the design. In par- MicroBlaze and Altera Nios II cores. user-configurable instruction-set architec- ticular, the Ethernet controller in the (Xilinx and Altera have sold tens of thou- ture. Nios II developers can create custom upgraded design is much faster. sands of licenses for their processors.) The instructions to accelerate specific applica- Nevertheless, the comparison demonstrates Cortex-M1 is a major new development tions, which can dramatically improve per- what is possible. Increasing the maximum that alters the competitive landscape. (See formance. To achieve similar results, theoretical bandwidth of a system doesn’t MPR 3/19/07-01, “ARM Blesses FPGAs.”) MicroBlaze developers can implement guarantee higher throughput, and one fea- The first FPGA vendor to announce coprocessors in the programmable-logic ture of MicroBlaze v7 is better CoreConnect support for the Cortex-M1 was Actel, a fabric. (Coinciding with the online publi- support for the TEMAC hard-wired into much smaller company than Altera or cation of this article on November 13, Virtex-5 LXT chips. Xilinx. Actel has a special arrangement with Altera is announcing an arrangement with ARM to sell Cortex-M1 FPGAs without Synopsys that will make it easier for devel- New Kid on the Block: ARM requiring customers to purchase an ARM opers to move Nios II-based designs from MicroBlaze v7 is the second new version of license or pay chip royalties to ARM. This FPGAs to standard-cell ASICs. MPR plans the processor that Xilinx has introduced deal dramatically lowers the cost of deploy- to cover this development in the future.) this year, and the third since 2006. ing an ARM-based design. Xilinx hasn’t Note that the price gaps among these Although these steps have been incremen- announced a similar arrangement yet, but processors are shrinking. Before the tal, they add up to a significantly better MPR suspects it’s a possibility in the future. Cortex-M1, the difference between licens- processor. The quickening pace of Although the CortexM1 and MicroBlaze ing a processor core from an FPGA vendor improvement may be due to the arrival of processors would seem to make rivals of or licensing one from ARM was four orders fresh competition: ARM’s Cortex-M1. ARM and Xilinx, their relationship remains of magnitude: about $500 for a MicroBlaze The Cortex-M1 is the first ARM more cooperative than competitive. ARM or Nios II versus millions of dollars for an processor core sanctioned for deployment understands that MicroBlaze is primarily a ARM. With the Cortex-M1, ARM has

58 Xcell Journal Second Quarter 2008 PLATFORMS, PROCESSORS & TOOLS

Feature Xilinx Xilinx Altera Altera Altera ARM MicroBlaze v7.0 MicroBlaze v6.0 Nios II/f Nios II/s Nios II/e Cortex-M1 Architecture MicroBlaze MicroBlaze Nios II Nios II Nios II ARMv6-M Primary Virtex-5 Virtex-5 Stratix, Cyclone, Stratix, Cyclone, Stratix, Cyclone, Fusion, ProASIC-3, FPGA Targets Spartan-3 Spartan-3 HardCopy HardCopy HardCopy Stratix, Virtex-4/5 Cyclone, Spartan

Config. ISA — — Yes Yes Yes — Pipeline Depth 3 or 5 stages 3 or 5 stages 6 stages 5 stages 1 stage* 3 stages I-Cache 0–64K 0–64K 0–64K 0–64K — — D-Cache 0–64K 0–64K 0–64K 0–64K — — Local Memory 0 or 2 0 or 2 0–8 0–4 — 0 or 2 256K each 256K each Configurable Configurable 1K–1,024 K each 32-Bit Multiplier Optional Optional Optional Optional — Yes, 2 options 32-Bit Divider Optional Optional Optional Optional — — Barrel Shifter Optional Optional Optional Optional — Yes FPU Optional 32 bits Optional Optional Optional Optional — (New instructions) 32 bits 32 bits 32 bits 32 bits Branch Predict — — Dynamic Static — 1 Priv. Levels 1 or 2 1 2 2 2 AMBA-Lite Coprocessor FSL FSL Avalon-MM Avalon-MM Avalon-MM AMBA-Lite Interface (New instructions) On-Chip CoreConnect CoreConnect Avalon-MM Avalon-MM Avalon-MM — Interconnect PLB v4.6 OPB Memory Optional Management MPU or MMU — — — — — Translation Optional Lookaside 8-entry I + D — — — — — Buffer (TLB) 64-entry unified Core Freq (Max) 220 MHz † 220 MHz † 205 MHz 165 MHz 200 MHz Up to 72MHz >170MHz**

Int. Perf (Max) 240 Dmips 240 Dmips 225 Dmips 127 Dmips 31 Dmips 0.8 Dmips/MHz FP Perf (Max) 50 MFLOPS 50 MFLOPS n/a n/a n/a n/a FPGA 980–3,000 960–1,700 1,800 1,150 600 4,300+ LUT3 tiles Logic Cells ~1,900 LUT4 cells Introduction Nov 2007 Mar 2007 2004 2004 2004 4Q07 Price $495 $495 $495 $495 $495 <$100,000 (ARM) Free (Actel)

Table 3 – Feature comparison of the Xilinx MicroBlaze v7, MicroBlaze v6, Altera Nios II, and ARM Cortex-M1 processor cores. All are 32-bit embedded processors designed and optimized for synthesis in FPGAs. Altera’s Nios II processor is available in three basic configurations that developers can customize. Key differences between the MicroBlaze v7 and v6 are highlighted in purple text. With its new optional MMU/MPU, MicroBlaze v7 gains an important advantage over Nios II; the Cortex-M1 fares poorly in this comparison but is redeemed by a more familiar CPU architecture. (FSL is the Xilinx Fast Simplex Link. Avalon-MM is Altera’s memory-mapped interface.) *Nios II/e has a six-stage pipeline, but it works like a one-stage pipe. †Maximum clock speed assumes synthesis in the fastest Xilinx Virtex-5 FPGAs. ‡Estimate for synthesis in an Actel ProASIC3 or Fusion FPGA. **Estimate for synthesis in a Xilinx Virtex-5 FPGA. (n/a: not applicable)

Second Quarter 2008 Xcell Journal 59 PLATFORMS, PROCESSORS & TOOLS

departed from its long-standing practice of nothing on the horizon that will alter that the time may come when the FPGA ven- not publicly disclosing licensing fees. trend. For that reason, the future of dors’ own CPU architectures look less Although the exact price for a Cortex-M1 MicroBlaze (and Nios II) looks bright. attractive. Although Altera and Xilinx have license is still under wraps, ARM says it will However, one thing that could change is sold many more CPU licenses than even cut deals for less than $100,000 – a huge the CPU architecture that developers ARM ever will, a great number of those price break. And, as described above, Actel choose to implement in an FPGA. For licenses are purchased by students and tin- sells preconfigured Cortex-M1 FPGAs now, MicroBlaze and Nios II are by far the kerers who never intend to mass-produce without requiring an ARM license at all. most popular choices, because they are pro- their designs. Companies seriously intend- With Altera and Xilinx practically giving moted by their respective FPGA vendors ing to produce FPGA-based SoCs in vol- away their FPGA-ready processors, ARM and are practically free. ARM’s CortexM1 ume may prefer to use a more widely had to modify its own licensing model to changes the equation by introducing the supported CPU architecture. MicroBlaze make the Cortex-M1 competitive. option of using the industry’s most popular and Nios II could become interesting foot- 32-bit embedded-processor architecture. In notes in the history of microprocessors. The Future of CPUs for FPGAs time, other embedded-processor architec- Even if that happens, the Altera and As FPGA prices fall, and the costs of tures may join the fray. At present, ARC, Xilinx processors will have served their pur- ASICs rise, deploying an SoC in program- MIPS Technologies, and Tensilica don’t poses. They are selling more FPGAs, they are mable logic becomes increasingly attrac- necessarily forbid licensees from deploying helping to seed the market for FPGA-based tive. As MPR has noted in the past, the designs in FPGAs, but they don’t encour- SoCs, and they are defining the features and production volume at which an FPGA age it, either – and they have not optimized optimizations that FPGA-specific processors implementation becomes more economi- their processors for programmable logic. should have. Whether or not MicroBlaze cal than an ASIC implementation keeps If other CPU architectures do become and Nios II live long and prosper, they are tilting in favor of FPGAs, and we perceive available for FPGAs, at affordable prices, wise investments for their vendors.

60 Xcell Journal Second Quarter 2008 PLATFORMS, PROCESSORS & TOOLS Building Linux Platforms on Xilinx Processors with LinuxLink Find out how a LinuxLink subscription accelerates your Linux development and mitigates project risks. by Maciej Halasz Senior Product Manager TimeSys [email protected]

Many of you have built a Linux platform on top of Xilinx® processors using PowerPC-based Virtex™ FPGAs. The decision to use Linux is happening more often, and for a variety of reasons. The most appealing aspect of Linux is that it is free to end users and does not require any royalties or run-time license payments to an OS vendor. There are, however, a number of challenges that drive up the cost and risk of adopting Linux. In this article, I’ll outline those challenges and discuss how you can leverage LinuxLink by TimeSys when developing on Xilinx platforms, lowering your cost, risk of failure, and shortening the project timeframe.

The Challenge and very error-prone, sharply increasing next step is to get a Linux kernel running Many of you choose to build a Linux plat- project risks. All of this work to assemble on it. Depending on your specific require- form on your own, spending substantial a Linux platform for an application does ments, you must change the Linux kernel amounts of time “hunting and gathering” not add real value to the project. The to match the hardware design. Sometimes the right Linux components for your proj- value comes from applications built on you have to develop new device drivers. ect, irrespective of your experience level. top of the Linux platform. This task is supported today to some Assuming that you are successful in degree by Xilinx EDK tools. gathering the right components, the next What is My Task? The next task is to assemble a root challenge lies in the aggregation and cross- There are probably several answers to this filesystem that includes the functionality compilation of all of these pieces. question. At a high level, most of you want required by the end application running on Unfortunately, this aggregation process to build a Linux platform that works on your custom hardware design. Assembling can be complicated, particularly in an custom hardware, with a value-added a root filesystem can be a very involved and embedded environment where Linux com- application on top (see Figure 1). time-consuming process that grows in time ponents need to work with an embedded The hardware design is central to the and risk level with the number of packages processor. This process is unpredictable project; once such a design is in place, the that you need to include.

Second Quarter 2008 Xcell Journal 61 PLATFORMS, PROCESSORS & TOOLS

++=+ = Xilinx FPGA Value-Added Hardware Design Linux Kernel Root Filesystem Linux Platform Application Final Solution

Figure 1 – Linux project components

Once a Linux kernel and a root filesystem are running on the custom hardware, Cross-Toolchains you can focus on developing value-added Basic Packages (Libraries, Utilities, etc.) applications. If the application require- Specialized Packages Host ments are well understood before the (Graphics, Win Mgmt, Audio, Web Server, etc.) Development Environment project begins, this stage of the project Command Line Dev Tools Graphical Dev Tools (Accelerated Development) (Accelerated Development) ends here, with aggregation of all Linux Debuggers Configuration Drivers pieces into deployable images. Profilers Patching Tracers However, application requirements often change during the project cycle. But for Linux, this means not only application changes but also root filesystem changes. Application Sometimes even the Linux kernel has to be adjusted. Function Root Filesystem Target Alpha Development Components Package 1 Linux Kernel What Do I Need? Package 2 Function Driver X Driver Y Driver Z You will need several components to Beta Package N develop an embedded software platform (see Figure 2). For any Linux-based project, Linux Figure 2 – Host and target components needed in a Linux project components are developed and/or adopted from already existing sources (open-source Linux Development Process new designs and applications. Each kernel projects). Depending on the specific appli- The Linux project development cycle, as release, occurring on average three to four cation and how complex the hardware shown in Figure 3, can be divided into sev- times a year, introduces new updates, fixes, design is, the Linux platform development eral stages. Some of the tasks in the cycle and support for ever-growing sets of driv- process can take anywhere from weeks to (stages 3 and 4) are executed in parallel; ers. For this reason alone, the latest version months, even though there are tools and others are sequential (stages 5 and 6). of the Linux kernel is the preferred choice information online that you can leverage The process of developing Linux-based for many Linux projects. in the process. Tools, libraries, and various systems on a Virtex platform starts at the utilities are definitely needed to develop, hardware level, when you select a set of Root Filesystem Challenges debug, and test the Linux solution hardware blocks (IPs) needed to support The number of open-source projects that throughout entire project cycle. your application. Part of the reason why provide filesystem functionality grows To accelerate development processes FPGAs are so popular is the flexibility they every day, providing you with access to while minimizing the risk of failure, you offer in assembling the hardware platform. hundreds of Linux components that you can use highly integrated development That flexibility translates directly into how can adopt in your project. This is both tools and prebuilt Linux components LinuxLink supports you in your efforts in good and bad for the embedded Linux from a trusted source. getting a Linux solution in place – in a application. The wide selection of Linux Development tools that run on the timely fashion and with minimal risk. components available from the open- host development platform, combined source community provides platform sup- with your experience and knowledge, will Linux Kernel Choice port for all kinds of applications. greatly drive how much time and effort The Linux kernel code is just like other On the negative side, that same breadth you spend on a project. software: it evolves to meet the needs of causes headaches, as it is very difficult to

62 Xcell Journal Second Quarter 2008 PLATFORMS, PROCESSORS & TOOLS

installed with the libraries frequently Development Gethering of needed for application development. This Environment Needed Linux Setup Components makes your task of building value-adding 1 2 applications seamless. Development environment setup by LinuxLink is always Integration Linux Deployment Platform kept in sync with target binaries. Libraries Linux Project Development Cycle Deployment that are available on the host for applica- 6 3 tion development are also available in the Testing Application Debugging Development form of ready-to-deploy, on-target pack- Optimization ages for the platform design. 5 4 Finally, a LinuxLink subscription provides access to tools that are indispensible in the development of an embedded Linux solution. You can download tools Figure 3 – Tasks in the Linux project development cycle from the LinuxLink site such as debuggers for both system and application-level find the packages that work as expected or Linux components available through debugging, profilers for code optimiza- packages that work well together. LinuxLink are regularly updated, providing tions, and tracing tools. The subscription It can take a lot of time and effort to find you with the latest code available for both comes with ticket-based engineering sup- a good set of packages, libraries, and utilities Linux kernel and platform packages. port staffed with embedded engineers that properly support application-level The reference distribution is optimized with years of experience. development. Certainly there are some to achieve two distinctive goals: With a LinuxLink subscription, you get starting points available in the open-source 1. Setup the platform and application help throughout the entire development community, but they all bear the same chal- development environment for the tar- cycle (Figure 3). This means that you can lenge – they are not easy to customize for get Virtex platform on your host get your job done much faster and in a speed, functionality, and the footprint machine more controlled environment. required by the final deployment image. 2. Get the Linux kernel up and running The process of building a Linux plat- Conclusion in matter of minutes on a Xilinx refer- form based on open-source information LinuxLink is designed to support you as ence board (such as the ML405) and components is not a problem for a very you assemble a custom Linux platform. knowledgeable engineer, although it still After installation of the reference distri- With prebuilt, ready-to-run Linux compo- bears a high risk of failure. bution, you can develop, in parallel, both nents and support, you need not spend value-added applications and the software valuable time building out a commoditized How LinuxLink Accelerates Development platform on which that application will open-source platform. You can instead LinuxLink is a subscription-based system execute (blocks 3 and 4 in Figure 3). focus on the development of value-added that provides embedded engineers with You can easily assemble the Linux plat- applications, along with integration, test- online access to hundreds of Linux compo- form with pre-built, ready-to-use packages ing, and optimization. You can then intro- nents optimized for the Virtex platform, all like “alsa” sound libraries or video codecs. duce products to the market much faster, at of which have been tested to work with You can immediately download these higher quality and lower cost. each other. ready-to-use packages to the host develop- LinuxLink supports Virtex-4 and The subscription provides you with ment system with the command line or Virtex-5 FPGAs with the latest Linux ker- access to preassembled reference distribu- Eclipse-based TimeStorm tools, where they nels. With a LinuxLink subscription, you tions optimized for different applications can be aggregated into a complete and can choose from the latest continuously that include the following: functional root filesystem. These filesys- updated Linux components for your Xilinx • EDK studio project files for specific tems are ready for deployment in a variety FPGA, including the newest Linux kernel hardware configurations of media, including on-board NAND flash and other Linux component versions. or Compact Flash cards. To find out more about how you can use • Binary Linux kernel Application development is well sup- the software, tools, and support provided ported by the same command line and with a LinuxLink subscription to accelerate • Cross-toolchain with platform libraries graphical development tools used for plat- your next embedded Linux project, visit • Development-ready root filesystem form development. The set of architec- www.timesys.com. You can browse the con- ture-specific cross-toolchains provided tent of the LinuxLink repository to see • Sources with a LinuxLink subscription come pre- everything available to subscribers.

Second Quarter 2008 Xcell Journal 63 Xilinx Productivity Advantage. THE ADVANTAGE IS ALL YOURS!

The Xilinx Productivity Advantage (XPA) program provides all your FPGA design needs up-front where and when you need them. This one-stop, single-vendor solution allows you to create customized bundles of software and services to meet your speciﬁc needs. The Xilinx Productivity Advantage program gives you the competitive advantage by:

• Delivering cost savings in reduced speed grades

• Decreasing your time to market

• Decreasing the gap between consumable and consumed technology

Design Faster and More Cost-Effectively Each XPA solution bundle includes your user licenses for ISE™, the industry’s leading design tools and development system for FPGAs. Pre-veriﬁed and pre-optimized IP cores and reference designs. Customized development and evaluation boards. And training credits toward the most comprehensive portfolio of expert Education Services. The Xilinx Productivity Advantage program: The Advantage is All Yours! To learn more about Xilinx Services and the Xilinx Productivity Advantage, we invite you to visit: www.xilinx.com/xpa

©2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners. PLATFORMS, PROCESSORS & TOOLS QuicklyQuickly BuildBuild DistributedDistributed SystemsSystems Mobius generates both hardware and software for XPS-based projects.

by Per Ljung President Codetronix LLC [email protected]

Many applications use a distributed archi- Mobius tecture, including multi-FPGA systems and FPGA-accelerators for microprocessors. Creating master-slave or peer-to-peer Mobius Mobius Mobius TLM Simulate Hardware Generate Software Generate distributed systems is complex and requires a design methodology that supports effi- C cient partitioning, as well as an efficient HDL implementation for communication and synchronization. HDL Hardware Netlist Software Software Simulate Synthesis Simulate Compile Simulate In this article, I’ll present the Mobius tools and work flow for distributed systems. Mobius is a high-level language that can Bitstream Elf generate both efficient hardware (Verilog, VHDL) and software (C with scheduler) from a single source. Mobius uses handshaking for multi-threaded synchronization and communication, resulting in a target- independent, latency-insensitive handshak- Target ing processes. Mobius is suitable for both control- and data-dominated applications. Figure 1 shows how Mobius is a front-end Figure 1 – Mobius is a front-end to Xilinx Platform Studio tools. to the Xilinx® Platform Studio tool.

Second Quarter 2008 Xcell Journal 65 PLATFORMS, PROCESSORS & TOOLS

The micro-architecture is user-specified Benchmarks show that quality of results (nodes) connected by handshaking chan- using a WYSIWYG syntax; for example, for Mobius-generated software and hard- nels (arcs). Some of the handshaking prim- which tasks are run sequentially or in parallel. ware are excellent. For example, signed itives are shown in Figure 2, and can be Parallel tasks synchronize and communicate fixed-point FIR filters easily achieve 500 implemented in either hardware or soft- using message passing over unidirectional MHz on Virtex™-4 and Virtex™-5 device ware as a tiny finite state machine (FSM) point-to-point blocking channels. Because targets, and the commstime benchmark reacting to the signals on any connected both the reader and writer wait for the other running on a MicroBlaze processor shows handshake channels. Because all control to be ready, the channels provide a robust that Mobius is 66 times faster than a Posix and dataflow is local, there are no global bridge between hardware and software sub- pthread implementation. FSMs or wires. systems. You can compile a Mobius source Consider the Mobius-generated hand- to hardware, software or any arbitrary com- Handshaking Primitives and Channels shaking graph for the Euclid algorithm to bination, allowing for exceptional flexibility The Mobius compiler decomposes the compute the greatest common denomina- when partitioning. high-level Mobius source into a handshak- tor (GCD) in Figure 3. A control channel An ESL tool, Mobius generates software ing graph of handshaking primitives is depicted as an arc terminated in a solid for the MicroBlaze embedded processor, cir- cuits for FPGA logic, and bridges between the software and hardware domains. A Xilinx transfer repeat function timer Platform Studio project is automatically gen- if-else const procedure muxN erated with Mobius-generated software and hardware, which further increases productivi- do var cast demuxN ty. Using Mobius, you can design and implement a complex co-design within an hour. stop channel unary op seqN

Mobius Language and Tools skip array binary op parN Codetronix designed the Mobius language, a tiny high-level multi-threaded language, Figure 2 – Target-independent handshaking primitives to be simple to learn and to enable efficient implementations. A uniform programming model allows for target-independent specifi- 5606 cation and subsequent compilation to tar- gcd get-specific implementations. 13517: std_rep The programming model uses the com- 5607 municating sequential processes (CSP) methodology with blocking message passing. 13516: std_seq3

The Mobius language supports sequential or 5609 5610 parallel statements, channels, timers, events, 13512: std_do 5608 and arbitration, in addition to parameterized signed/unsigned integers and fixed- and 5615 13515: std_transfer floating-point (arithmetic, relational, tran- 13510: std_ifelse 13491: std_seq2 5594 scendental) operators. As a result, the Mobius language offers many of the built-in capabil- 5620 5617 5613 5612 5611 c ities of an RTOS. The CSP methodology allows compile- 5616 13509: std_transfer 13503: std_transfer 5633 13490: std_transfer 13487: std_transfer time semantic checking of parallel con- 5622 5619 5628 5625 5627 5624 5596 5600 structs (such as the illegal simultaneous read 13497: > 13506: - 13500: - 13494: != 13532: var_mux2 13522: var_mux2 b a and write of a variable) for faster develop- 5635 5637 5636 5630 5632 5631 5634 5629 ment. The Mobius high-speed transaction- level simulator provides quick bit-accurate 13550: var_demux4 13539: var_demux5 5590 5588 simulations, enabling functional verifica- 5589 5587 tion with a test bench written in Mobius. 13478: y 13477: x The test bench is also compiled by Mobius, enabling quick verification of the generated Figure 3 – Handshake graph of GCD() hardware or software.

66 Xcell Journal Second Quarter 2008 PLATFORMS, PROCESSORS & TOOLS

cuit, while the test bench runs in C on a software applications and hardware IP to Hardware MicroBlaze processor. The Mobius an existing Xilinx Platform Studio project. compiler identifies that the reader and As shown in Figure 5, xpscustom adds autostart writer of each channel are in different Mobius-generated hardware IP, bridges, domains and emits C for the software and wire connections. On the software side, domain, HDL for the hardware xpscustom automatically adds simple IP domain, and FSL bridges to connect the drivers, a software-only application, and a gcd hardware and software. co-design application. This results in a The Xilinx fast simplex link (FSL) is a push-button flow where you only have to peer-to-peer, unidirectional, point-to- compile the software and generate the bit- point synchronization and communica- stream for a distributed implementation. tion protocol using queues, which maps easily to Mobius channels. Figure 4 shows Conclusion how the software writes two values to two Mobius allows you to rapidly develop high- FSL bridges and then reads a value from a quality hardware and software solutions.

bridge_fsl_rchan bridge_fsl_rchan third FSL bridge that is blocking until the You can map Mobius-generated HDL and bridge_fsl_wchan output is available. Simultaneously, the C to arbitrary voltage, clock, physical, or hardware reads two values from its two co-design domains and then connect these input channels and computes a result that domains using synchronous channels, asyn- Software is written to its single output channel. An chronous channels, or bridges to build dis- autostart() module generates a req pulse tributed embedded systems on any Xilinx to start the Mobius-generated hardware FPGA. Mobius also compiles your test on a system reset. bench to hardware and software, allowing functional verification in both software and Figure 4 – Mobius-generated hardware and software connected by bridges Automatic XPS Projects hardware implementations. Assembling a complex hardware and Users benefit from significantly higher software design in Xilinx Platform Studio design productivity, letting them generate dot and a hollow dot. The solid dot can be challenging. To enable users to and explore alternative microarchitectures denotes an active port that initiates the rapidly generate correct XPS projects, to optimize their system. For more infor- handshake, and the hollow dot denotes a Codetronix provides a utility called mation about using Mobius in your next passive port that waits for the handshake. A xpscustom that adds Mobius-generated design, visit www.codetronix.com. push or pull channel has arrow heads signi- fying the direction of data flow, where the arrow head may be hollow or solid, signify- ing an active or passive protocol.

Bridges between Hardware and Software Channels are transport layer-independent and can use wires, SERDES, or packets as appropriate. You can insert bridges on any channel and allow full-featured support of legacy interconnects, including OCP-IP, Wishbone, PLB, FSL, PCI, Ethernet, or RocketIO™ transceivers. Similarly, channels allow domain crossing – between software and hardware, for instance – between different clock domains or between different voltage domains. As a result, implementing distributed and hardware/software systems using handshaking is a straightforward process. Figure 5 – Hardware and software integrated into Consider a simple example where the a Xilinx Platform Studio project by xpscustom GCD is implemented as a hardware cir-

Second Quarter 2008 Xcell Journal 67 PLATFORMS, PROCESSORS & TOOLS DebuggingDebugging SystemsSystems withwith FPGAFPGA EmbeddedEmbedded ProcessorsProcessors F-Sight improves debugging efficiency through by Jorge Carrillo easy-to-use, state-of-the-art features. Staff Software Engineer easy-to-use, state-of-the-art features. Xilinx, Inc. [email protected]

Koji Imanishi Department Manager Computex Co., Ltd. [email protected]

Raj Nagarajan Senior Software Engineer Xilinx, Inc. [email protected]

Nobuhiro Nishiguchi Superintending Engineer Computex Co., Ltd. [email protected]

Ayako Suzuki Senior Engineer Computex Co., Ltd. [email protected]

More and more FPGA designs now include embedded processors (like PowerPC and Xilinx® MicroBlaze™ processors) to deal with control tasks that are easier to describe in a software language like C rather than in a hardware description language like VHDL or Verilog. When designing embedded systems, you probably spend the vast majority of design time in the debugging phase. It is important to reduce the time associated with finding problems and solving them. When you do face design problems, it is important to use the right tools for the job, including a tool that allows you to identify problems quickly.

68 Xcell Journal Second Quarter 2008 PLATFORMS, PROCESSORS & TOOLS

Computex F-Sight is an integrated ML400 series, ML500 series, or overflow. If you suspect this problem is debugger that provides both hardware and Spartan™-3E/3A/3AN FPGA board that occurring, you can set a trigger to start or software debugging capabilities. On one does not include a Mictor connector, you stop trace data collection. By setting the side, it allows full software debugging of can still employ the processor trace feature trigger condition to allow a comparison processors inside the FGPA. On the other in F-Sight by using the corresponding between the stack pointer and the upper side, it allows monitoring of FPGA hard- Computex F-Sight adapter. Figure 1 stack limit, the program will break right ware signals. In this article, we will describe shows F-Sight connected to a Spartan-3 when the condition occurs. You can then how F-Sight can help improve your debug- board using the F-Sight adapter. easily identify the stack overflow and the ging efficiency. location where it happened. Using Processor Trace In certain cases with real-time systems, Enabling the Debugger A processor trace monitors program execu- stopping a processor for debugging may Xilinx has enabled the features available in tion without stopping the processor. This not be an option, as stopping might change the Computex debugger for use on FPGA allows you to analyze program flow over the program behavior. Other times the internal embedded processors. For long periods to identify problems in the problem might occur very rarely, which MicroBlaze processors, you can use the code, all without changing the processor might require you to monitor the program MicroBlaze debugging module (MDM) to execution state. Computex F-Sight pro- execution over a long time. Using F-Sight, control and debug the processor execution. vides processor trace capabilities that can you can set complex trigger conditions and You can also use the Xilinx MicroBlaze trace prove very useful in different situations. collect trace data, which can be post-ana- core (XMTC) to monitor the processor pro- Imagine a program that is constantly lyzed to debug the problem. gram execution in a non-intrusive manner. generating exceptions. Exceptions may Probing Internal Signals When debugging FPGAs, it is common to begin by simulating the design. The simulator is not capable of finding faults with the specifications, although it is able to find errors in the design. Also, it often happens that designs that pass all tests when simulated do not eventually work once implemented on the FPGA. When this happens, you are forced to debug the problem on the actual target system by using a logic analyzer. The problem arises when you try to bring the signals out of the FPGA so that the logic analyzer can monitor their waveform Figure 1 – F-Sight connected to a Spartan-3 board patterns. As in the majority of cases when designing large-scale embedded systems, it Because of pin limitations in FGPAs, it is happen anywhere in the program; the chal- will take a considerable amount of time to important to reduce the amount of signals lenge is to find out why and where the go through the FPGA synthesis and imple- that are brought out to pins. The XMTC exceptions are being generated. For this, mentation tools even for minor changes provides encoded instruction and data traces you can set a breakpoint either at the start (such as the changes needed to route signals with pin requirements only 10% of what of the exception being thrown or at the out to pins). Plus, you risk running into unencoded signals would require. exception vector so that the program stops timing problems caused by the different To enable the debugger to collect a when it reaches the breakpoint. When the placement and routing. The actual time to trace, all you need to do is connect both program stops, you can look at the execu- run the implementation tools depends on MDM and XMTC cores to the tion history collected by F-Sight. It will the scale of circuitry and the host computer MicroBlaze processor’s debugging and show which instructions were executed performance, but it’s possible that you will trace interfaces, respectively, and bring before going into the exception handler. only be able to debug a few times in a day. out the encoded trace signals to FPGA Stack overflow is also a common prob- Fortunately, Computex F-sight has a pins so that F-Sight can collect the data. lem in embedded systems. All of a sudden very useful feature that allows you to mod- After FPGA implementation, connect the a program starts executing in a location ify the design to bring internal FPGA sig- F-Sight debugger to your board’s Mictor that just does not look right. The stack is nals out to pins without going through the connector. If you are using a Xilinx-made probably getting corrupted because of the synthesis and implementation tools. The

Second Quarter 2008 Xcell Journal 69 PLATFORMS, PROCESSORS & TOOLS

Debugging Flash Memory The FPGA’s internal memory is commonly used to store the embedded processor program. However, this memory’s capacity is often not sufficient if the size of the program is too big. One alternative is to use external flash memory and store the user program in it. Although some debuggers do not have the capability to write to flash memory, F- Sight allows you to fully debug programs that are located in flash memory just as you debug programs located in internal memory. For example, you can download the user program, apply patches to a part of the memory, or even set software breakpoints in the flash memory. In F-Sight, you can select from among Figure 2 – F-Sight probing more than 1,000 available types of flash memory options. Even if the flash memo- feature is called “probing.” Simply select the hardware and software is known. For this ry you are using is not available from the the internal FPGA signals in the display reason, Computex has implemented a coop- options in the list, you can manually and showing HDL source (Figure 2). F-Sight erative debugging feature that allows you to easily add the entry using the graphical will automatically take care of the rest, synchronize the hardware (analyzer) and user interface. adding the appropriate routing to the test software (trace) histories in F-Sight. With pins on your behalf. It does this by utiliz- this feature, you can check the wave pattern Conclusion ing the FPGA Editor included in Xilinx and program behavior at the time the event Debugging systems with FPGA embed- ISE™ software tools. With this feature, occurred on the same time axis. The program ded processors should not be a time- the time required for logic synthesis and execution history and the source code view consuming task. When you encounter place and route – time that you could have will follow as you scroll through the wave problems, you need to have the right tools used for debugging – is minimized, allow- pattern display in the analyzer window that will allow you to efficiently deal with ing you to spend more time monitoring (Figure 3). The cooperative debugging fea- the problem, saving you time to focus on waveform patterns. ture is powerful in that it quickly identifies the development part. F-Sight certainly causes by allowing cooperative debugging proves to be a tool that will improve your Cooperative Debugging from both the hardware and software sides. debugging efficiency. When the system is not functioning properly, the only thing you can do is investi- gate the cause of the problem based on actually occurring events. In some cases, event tracking is easier if you do it from the hardware; in other cases, it is easier if you do it from the software. For example, in the case of hardware, if the signal indicating abnormality is asserted you can set the signal as a trigger. In the case of software, if the exception handler is called out, you can set a breakpoint to the exception handling routine and run the user program. The process of event occurrence will be captured into the tracing buffer in F-Sight. However, the issue here is that even if the history was captured, indentifying causes will take time unless the correlation between Figure 3 – F-Sight cooperative debugging

70 Xcell Journal Second Quarter 2008

PLATFORMS, PROCESSORS & TOOLS A Complete Debugging Solution for Xilinx Embedded Processors Employ the industry-proven TRACE32 debugger to cope with the challenges of debugging advanced embedded processor systems.

by Jorge Carrillo Staff Software Engineer Xilinx, Inc. [email protected]

Raj Nagarajan Senior Software Engineer Xilinx, Inc. [email protected]

Oliver Oppitz System Design Engineer Lauterbach [email protected]

The best thing about FPGAs is their flexibility, which inspires designers to create a myriad of different designs. However, sup- block on a Virtex™-5 LX50T device. The the debugger internal address translation port for debugging the design is often con- peculiar thing about this design was that mechanism to remap access from unread- sidered last – if at all – so the debugger each MicroBlaze processor had only an I- able target memory to virtual memory. often has to adapt to the system. side interface to the block RAM block for This made it possible to diagnose the pro- The good news is that a debugger exists instruction fetch. The second memory port gram even at the assembly level. from a company that has been around for of the block RAM was connected to a PCI But there was another challenge: for step- nearly 30 years in the embedded arena, Express interface, used to remotely change ping through the program, especially over gaining experience with all of the problems the application code and boot the processor high-level language lines and conditional you can imagine and some that you do not at run time. The challenge was how to branches, one normally uses software break- even want to hear about. In this article, debug in a memory area that the debugger points. With no access to the block RAM we’ll show you a few examples of big and could neither read nor write, because the from the MicroBlaze processor, this was out small features that Lauterbach’s TRACE32 MicroBlaze processor itself could not per- of the question. The solution came in the debugger offers – features that might save form load/store operations in this area. form of a “map.break” command, used to your day or even your project. We solved the problem using force the debugger to exclusively use hard- TRACE32’s internal “virtual memory,” a ware breakpoints for the given address Flexibility in Debugging for a Flexible Platform simulated memory inside the debugger range. Other flavors of the map command Speaking of flexible designs, a really inter- that has an indefinite address space (64 allow you to specify the data width of mem- esting customer system comes to mind. It bits) and for which memory is allocated on ories, change the endianness, or completely used two Xilinx® MicroBlaze™ processor demand. We loaded the target program deny the debugger from accessing an address cores with an internal block RAM memory into this virtual memory and configured range with critical peripheral registers.

72 Xcell Journal Second Quarter 2008 PLATFORMS, PROCESSORS & TOOLS

Small but Useful Features Flexible IDE and will work with any design created with Lauterbach’s experience with JTAG TRACE32 offers a powerful graphical user the Xilinx Embedded Development Kit debuggers and emulators shows in both interface (GUI), but its command-line ori- (EDK). For PowerPCs, dedicated debug- the features of the TRACE32 debugger as gins still show in two highly useful features: ging connectors are also supported. well as in many small and unexpected the debugger command line at the bottom In the context of multi-core systems, details. Given that Lauterbach develops of the screen and the fact that practically the synchronized starting and stopping of all software with its own tools, this is not every function of the GUI is also accessible cores becomes an issue. For targets sup- surprising. And daily problems often through commands – and thus from porting this in hardware, like in multi- inspire the most useful features. scripts. This automates all of your routine MicroBlaze processor configurations, the Did you ever want to disable the tasks, including target configuration, laying debugger uses hardware features to annoying timer interrupt handler during out windows, and arranging them on mul- achieve cycle-accurate synchronization; a debugging session or invert a condition- tiple virtual screens. Best of all, the win- otherwise the synchronization is done in al branch without restarting? Ever wanted dows do not have a docking behavior like software. The integrated scripting lan- to patch a loop to see if the core executes many IDEs but can be freely positioned guage is multicore-aware, allowing control correctly from external memory? The and resized, even overlapping each other. of all GUIs from a master script for con- built-in assembler does just that. There are also couplings with various IDEs, necting debuggers to their respective Another typical scenario during so you can invoke TRACE32 directly from cores, resetting them, and downloading debugging: How often did you just step your Eclipse environment, for example. and starting application programs. one line too far and had to start all over? The register-restore feature will undo the last steps. TRACE32 displays memory by permanently rereading the memory 10 times per second, even while the processor is stopped. Why does it do this? Perhaps a second MicroBlaze processor in your system, which you kept running, is causing some data corruption and you would like to also monitor this. Perhaps your sometimes unstable memory now causes tell- tale flickering on the screen. Or perhaps your JTAG interface is really not so stable at 20 MHz. These are things you would certainly like to know. On the other hand, TRACE32 guarantees that it will only access memories when required. Having touched on the topic of peripherals, we should also mention the peripheral register files. These specify the Figure 1 – Lauterbach TRACE32 debugging and trace cable connected to Xilinx ML507 board location, width, and even bit-wise encoding of memory-mapped registers, grouping them into a tree of related registers. In this Attaching to Multi-Core Targets Real-Time Program Flow and Data Trace way the peripheral registers are easily Another interesting feature is Lauterbach’s The main feature of the real-time trace is accessible for inspection and modification: intuitive approach to debugging multiple recording the program flow, defined as you can disable your DMA controller with cores in the target. Just start an instance of each and every instruction the processor a click instead of digging in the target the GUI for each core and have them share executed as well as data transactions. For manuals for the correct bit. The debugger a debugging cable; this also works for het- MicroBlaze processors, this is done through comes with specifications for standard erogeneous systems with PowerPC and the Xilinx MicroBlaze trace core (XMTC) peripherals, but you can modify the files MicroBlaze cores or for another system that comes with Xilinx Platform Studio. for your purposes using a simple text edi- from the 50-plus processor architectures XMTC includes a trace encoder, which tor. For the Xilinx tool chain, an add-on is supported by TRACE32 (Figure 1). contains an input interface that attaches to in the works to generate these files on the TRACE32 attaches to the same JTAG a MicroBlaze processor trace port compris- fly as part of the Xilinx build process. connector used by the Xilinx platform cable ing approximately 200 unencoded signals.

Second Quarter 2008 Xcell Journal 73 PLATFORMS, PROCESSORS & TOOLS

Xilinx Global Trade Show Calendar

Xilinx participates in numerous tradetrade showsshows andand eventsevents through-throughout the year. This is a perfect opportunity to meet our silicon and software experts, ask questions, see demonstrations of new products, and hear other customer success stories. For more information and the most current schedule, visit www.xilinx.com/events/.. Figure 2 – Lauterbach TRACE32 IDE showing trace, code coverage, and function call chart views April 14-17, 2008 NAB 2008 Operating System Support Las Vegas, NV On the MicroBlaze processor, TRACE32 comes with a so-called kernel awareness April 15-17, 2008 module for µClinux and Linux. For Embedded Systems Conference PowerPCs, many more operating systems are Silicon Valley Figure 3A – Lauterbach Mictor MicroBlaze trace supported, including QNX, VxWorks, and San Jose, CA adapter for Xilinx Spartan-3E FPGA board Nucleus PLUS. The extensions make the debugger aware of the kernel-related data April 15-17, 2008 structures in the target. They allow debug- Special Event Effects ging on the process level using process-spe- Symposium cific breakpoints and program control. Long Beach, CA Other features include full MMU support; real-time, non-intrusive display of Linux sys- April 21-22, 2008 tem resources like loaded kernel modules or China International Medical mounted file systems; statistical evaluation Electronics Technology Figure 3B – Lauterbach Mictor MicroBlaze and graphical display of task run times; and Shenzhen, China trace adapter for Xilinx MLx boards task-related evaluation of function run times. May 14, 2008 It also includes an output interface, which Conclusion Linley Tech Seminar: provides an encoded trace in 21 signals. The Lauterbach TRACE32 provides a com- High-Speed Interconnects The trace hardware provides up to 512 prehensive debugging solution for PowerPC San Jose, CA MB of external high-speed trace memory, and MicroBlaze processors on all Xilinx which is used instead of the scarce on- device families. Future Xilinx-related addi- May 14-16, 2008 chip memory resources for storing trace tions will be to enhance the debugging Embedded Systems Expo 2008 information. Trace functionality is also cable so that the Xilinx ChipScope™ ana- Tokyo, Japan supported on PowerPC architectures. lyzer can use it for target accesses jointly More advanced features include statistical with the debugger and downloading FPGA analysis of function and task run times, bitstreams through the debugger itself. R variable access, and code coverage analysis For more information, please visit (Figure 2, Figure 3A and 3B). www.lauterbach.com.

74 Xcell Journal Second Quarter 2008 PLATFORMS, PROCESSORS & TOOLS ReduceReduce FPGAFPGA DesignDesign TimeTime withwith PICOPICO ExpressExpress FPGAFPGA Synfora’s PICO Express accelerates by Fernando Martinez algorithm-to-design completion on an FPGA. Field Application Engineer Synfora [email protected]

The Xilinx® Spartan™ and Virtex™ product families offer designers a wide range of choices for embedded design. This array of products allows designers to create custom accelerators for embedded applications as well as entire systems on a single FPGA. The most vexing problem facing today’s FPGA designers is shifting away from part capacity and toward design turn- around time. Depending on the complexity of the target application, going from a written specification or reference C model to a working FPGA design can take anywhere from weeks to months. This increases the pressure on designers and makes the goal of first to market harder to reach.

Advanced Algorithmic Synthesis The only way to accelerate design time on an FPGA is to automate the creation of the RTL. Although several attempts have been made to go from C to RTL, these solutions have fallen short of customer expectations because of language transformations, application complexity, and size limitations. These concerns, which have limited the widespread adoption of high-level synthesis in the FPGA market, are addressed by Synfora’s PICO platform, a complete development environment capable of taking sequential algorithmic ANSI-C and creating high-quality parallel hardware on par with RTL created by hand.

Second Quarter 2008 Xcell Journal 75 PLATFORMS, PROCESSORS & TOOLS

PICO is also fully integrated with FPGA back-end flows from both Xilinx and Synplicity. You can either use your own synthesis scripts or scripts provided by PICO to generate an FPGA programming bitstream.

PICO Express FPGA, which is a key the synthesizable C model is completely creates an on-the-fly YUV representation component in the PICO platform, pro- sequential without any extra need for a user of this video. The conversion to YUV has vides a complete development environ- declaration of parallelism. The advanced the benefit of acting as a filtering opera- ment capable of taking sequential compiler built into PICO automatically tion, which reduces noise in the original algorithmic ANSI-C and creating high- analyzes user applications to extract data RGB video stream. quality parallel hardware that is on par with dependencies and derive parallelism from Figure 1 shows a summarized pseudo RTL created by hand. sequential code. code for the color converter; note that the The design methodology in PICO The driver code comprises a test bench type of C code accepted by PICO Express Express FPGAs allows design teams to move used to verify the correct operation of the is no different from the C code accepted to a higher level of abstraction without sac- function targeted for hardware synthesis. In by any ANSI-C compatible software com- rificing design performance. Another aspect the case of this example, the driver code piler. This allows for code development of the PICO design flow is the ability to reads input image files, passes the data to using open-source tools such as gcc. retarget RTL generation across Xilinx prod- the hardware procedure, uct families in terms of part, speed grade, and gathers the results. package, and clock frequency without hav- The results of the hard- unsigned char rgb_to_yuv(unsigned int rgb) { ... ing to modify the C code. ware function are com- resulty = ((66* Rf)> 8) + ((129* Gf)>8) + ((25* Bf)>8) + 4096; PICO is also fully integrated with FPGA pared against the expected resulty = (result + 128) > 8; if(resulty>255) resulty=255; back-end flows from both Xilinx and output created during test ... Synplicity. You can either use your own syn- vector selection. } thesis scripts or scripts provided by PICO to During the several generate an FPGA programming bitstream. steps needed to go from Besides integration into standard FPGA C to RTL, PICO simu- Figure 1 – RGB to YUV converter back-end flows and the ability to target lates each transformation Spartan and Virtex parts, PICO gives you to ensure correct hardware by design. The After color conversion, the new video the ability to carry out what-if analysis. simulations during hardware design are stream is passed through a Sobel edge Under design space exploration, PICO also useful in estimating design perform- detector. Sobel edge detection is a classic allows side-by-side comparisons of multi- ance and resource utilization before RTL filter used in image processing to reduce ple designs targeted at different through- simulation and back-end place and route. the information in both still images and puts, clock frequencies, part families, The second stage is the creation of the video to a series of edges. The fundamen- package options, and speed grades. synthesizable C model. The synthesizable tals of Sobel edge detection are based on: In this article, I’ll show how PICO C model is the software procedure that will Express can quickly take an ANSI-C func- be implemented as a hardware block. This the following convolution equation tional model and create a fully functional procedure code can encompass functional- FPGA design. The application chosen for ity as simple as a single accelerator block to This equation shows the use of a 3 x 3 this demonstration combines real-time functionality as complex as a complete window to calculate the strength of an color space conversion from RGB to YUV H.264 encoder/decoder. You only need to edge at the center of the window. and a Sobel edge detector. Although the C let PICO know the name of the top-level Depending on the chosen threshold, pixels model for this application is part independ- procedure in the synthesizable model. will either have a value of 1 or 0 after the ent, Synfora has implemented a working PICO Express automatically transforms all windowing function. Figure 2 shows the system on the Spartan-3E video display kit. sub-procedures into hardware. code used for the edge detector. Like the code in Figure 1, the code in Application Development on PICO Express Color Conversion and Edge Detection Figure 2 is completely compliant with Application development on PICO Express The color conversion and Sobel edge ANSI-C and can be compiled on any C is divided into three stages. The first two detection application is divided into two compiler. Figure 2 shows the use of stages are the driver and synthesizable C main components. Color conversion arrays for storing both horizontal and model creation. As the example will show, takes the RGB from the video source and vertical windows.

76 Xcell Journal Second Quarter 2008 PLATFORMS, PROCESSORS & TOOLS

trol flow in this example. For the board unsigned char sobel( implementation, the bitstream is mapped unsigned int window[3][3]) Topics Relevant { onto a Spartan-3E device. The base C code ... used to create the hardware is independent y00 = rgb_to_y(window[0][0]); to You Delivered y01 = rgb_to_y(window[0][1]); of the source video resolution. Like in most y02 = rgb_to_y(window[0][2]); video processing systems, the PICO- in One Monthly y10 = rgb_to_y(window[1][0]); y12 = rgb_to_y(window[1][2]); designed hardware will use the first image y20 = rgb_to_y(window[2][0]); frame to calculate the image resolution Newsletter! y21 = rgb_to_y(window[2][1]); y22 = rgb_to_y(window[2][2]); from the video source. In this case, the video is of HD quality weight_horizontal = weight_calc( Thanks for your feedback! y00, y01, y02, y20, y21, y22); at 720p resolution. On the FPGA, the system runs at 133 MHz, which is sufficient We have now created a new weight_vertical = weight_calc( y00, y10, y20, y02, y12, y22); to comply with the DVI standard for 720p Xilinx newsletter that will HD video. In terms of synthesis time, this provide you all of the information total_weight = weight_horizontal + weight_vertical; return total_weight >= THRESHOLD; application takes less than five minutes in you need in one location. } PICO to generate RTL and less than 10 This newsletter will replace the minutes to generate a bitstream from multiple topic newsletters you Xilinx Synthesis Technology (XST). Figure 2 – Sobel edge detector have currently been receiving. Conclusion It's simple! Choose only the PICO will analyze arrays in your code PICO Express for FPGA provides the nec- topics you want to receive to determine the optimal mapping to min- essary tools and methodology for you to and your customized newsletter imize resource utilization and maximize quickly go from algorithmic code to fully performance. Depending on the size of an functional designs on an FPGA. The ANSI- will be delivered to your array and how it is used in the design appli- C based input accepted by PICO presents inbox every month. cation, PICO will determine the optimal you with a familiar coding language and storage medium for the array as LUTs, helps reduce the learning curve associated block RAMs, or external memories. It is with high-level synthesis tools. Also, PICO http://newsletter.xilinx.com the advanced memory analysis capabilities provides the capability to quickly compare of PICO that allow for a simple coding price/performance trade-offs that allow you model for memories. Arrays are a common to choose the right part without having to storage structure with an easily understood modify your application code. usage model in the C domain. Therefore, For additional information on how designers can specify both simple and com- PICO Express can accelerate the develop- plex memory architectures with ease. ment of FPGA-based embedded systems, Figure 3 summarizes the data and con- visit www.synfora.com.

Original Video Frame

RGB to YUV Sobel Edge Fixed-Point Conversion Detector

Output Video

YUV to RGB Output Video Conversion

Figure 3 – Color conversion and Sobel edge detector

Second Quarter 2008 Xcell Journal 77 PLATFORMS, PROCESSORS & TOOLS DesignDesign EmbeddedEmbedded SystemsSystems withwith XilinxXilinx andand SynplicitySynplicity You cancan useuse thethe integratedintegrated XilinxXilinx EDKEDK andand Synplicity’sSynplicity’s SynplifySynplify ProPro oror Synplify Premier software development flow to synthesize and debug.

by Angela Sutton Sr. Product Marketing Manager Flow When to Use Details Synplicity, Inc. •Allows the addition of non-peripheral [email protected] ISE software/EDK hard- Most processor-based ware development flow embedded subsystems custom logic to the system and designs with non- The Xilinx® Embedded Development Kit •Allows the addition of HDL files that (.ise project file) peripheral custom logic contain non-peripheral custom logic (EDK) is available to designers wishing to directly to the project implement embedded designs using IBM PowerPC hard processor cores, Xilinx •You can integrate processor subsystems as either top-level modules or sub-modules MicroBlaze™ embedded processor cores, and Xilinx-supplied implementations of buses, block RAMs, and peripherals such as Standalone EDK hard- PowerPC or MicroBlaze •The flow includes system top-level hardware USB and PCIe. ware development flow processor-based synthesis, mapping, and place and route If you generated an embedded proces- (.xmp project file) embedded system or processes in the background so that it is not sor subsystem with EDK and wish to syn- necessary to work with Xilinx ISE software sub-modules to generate the configuration bitstream file. thesize and debug your complete system for implementing on a Xilinx FPGA, you •Compiles and executes multiple software can do so using the Synplify Pro and/or applications, generates libraries for the code, and merges software with hardware Synplify Premier (referred to in this arti- files for downloading to a target board. cle as Synplify Pro/Premier) integrated embedded design flows from Synplicity, •Each hardware component is treated as an independent core (a separate XST project Inc. These Synplify Pro/Premier/EDK file is generated for each core). flows allow you to perform optimization and debug operations on an entire •During synthesis, XST automatically runs on each core separately to generate netlists embedded FPGA system, including (NGC files) through the Xilinx Xflow. Xilinx embedded cores. An integrated Synplify Pro/Premier/EDK •If the core is encrypted (MicroBlaze processor cores, for example), XST is always flow allows designers to synthesize an invoked to provide an NGC netlist. entire design top-down as one design entity, handled as a single Synplify Pro/Premier design project. Table 1 – Choosing which Synplify Pro/Premier/EDK flow to use

78 Xcell Journal Second Quarter 2008 PLATFORMS, PROCESSORS & TOOLS

Hardware Development Flow In addition to automating the flow, this Hardware Compilation integrated Synplify Pro/Premier/EDK flow Processor, Peripherals, offers the following advantages: and Bus Specification Import Project into Synplicity • Better quality of results (QoR): Synplify Hardware Configuration Pro/Premier considers the design in its Add / Edit HDL Code entirety when performing timing-driven Automatic Hardware and area optimizations. Moreover, the Platform Generation tools can time through the core and may Run Synthesis

potentially optimize across EDK core Hardware Compilation boundaries, which improves design performance QoR. Run Place and Route Bitstream • Visibility and debugging capabilities during implementation. You only need to run Import Bitmap Back to XPS synthesis once to generate the final design netlist. Debugging and optional floorplan- Figure 1 – Synplify Pro/Premier/EDK flow. Embedded cores are read into the ning can be performed on the entire Synplify Pro/Premier tool, where the design is fully synthesized. The post-place and design. You can also use Synplify route-generated bitmap is returned to Xilinx Platform Studio. Pro/Premier’s HDL analyst, timing analyst, physical analyst, island timing analyst, Synplify Pro, Synplify Premier, and add-on Identify RTL debugging tool Xilinx Platform Studio ISE Software to debug and analyze the entire design. .ise or .xmp Let’s describe this flow in more detail. Xilinx project 2. Read into 1. Generate embedded cores Synplify design project Overview of Xilinx EDK Flow .prj Synplify project The Synplify Pro or Synplify Premier tool from Synplicity is required to perform syn- 3. Synthesize, place and route thesis, as a part of the overall embedded 4. Read bitmap file into Xilinx Platform Studio .bit FPGA hardware creation process, after the _bd.bmm EDK cores have been included in the design. Bit file You can use the two available Xilinx EDK Figure 2 – The four steps involved in the EDK Synplify Pro/Premier flow. embedded hardware development flows that exist today to create your embedded design 1. Generate embedded cores using XPS or The .ise or .xmp file generated will refer and to generate the embedded hardware in ISE software. Create an EDK project to a variety of design files that represent the HDL format. that Synplify Pro/Premier can read to core. These files are shown in Table 2. The two flows are: generate a core netlist using one of the two flows for XPS users: 2. Read the generated files into Synplify • Xilinx stand-alone EDK hardware devel- Pro/Premier design project. The files opment flow for those using Xilinx • The Xilinx stand-alone EDK hard- that Synplify Pro and Synplify Premier Platform Studio (XPS) ware (generates an .xmp file) automatically read are listed in Table 3. • Xilinx ISE™ software/EDK hardware • Xilinx ISE software/EDK hardware The Synplify Pro/Premier tools read the development flow for those using Xilinx development flow (generates an .ise NGC netlist and EDK-generated plat- ISE software file) for ISE software users form (as specified by the .ise/.xmp project files); they are automatically Table 1 includes additional details about In the XPS GUI: included in the design project. The which of these two flows to use. Figure 1 • Select Project > Project Options > Synplify Pro/Premier software reads illustrates the overall design flow steps for Hierarchy and Flow many different files that are part of the both flows. Xilinx project file, including RTL, • Select Implement Design in ISE netlists, and constraints (V/VHD, software (Export to Project Integrated Synplify Pro/Premier/EDK Flow EDN, NGC, NGO, SDC, UCF, OPT). Navigator Flow: Depreciated) The Synplify Pro/Premier/EDK flow com- It adds these definition and constraints prises four steps, as shown in Figure 2. • Select Hardware > Generate Netlist files to the synthesis project it creates Here are some details about each step. to generate the netlist before running synthesis.

Second Quarter 2008 Xcell Journal 79 PLATFORMS, PROCESSORS & TOOLS

Table 4 shows the output files created • Click OK Files and Directories Created by EDK after synthesis, including a bitmap file. .xmp Xilinx microprocessor • Run Device Configuration > project file 4. Read bitmap back into XPS. Update Bitstream. You can now .mhs Microprocessor hardware • Import the bitmap file back into the download the bitstream into the specification file XPS environment. The .bit FPGA. .mss Microprocessor software and _bd.bmm files original- specification file ly found in the synplify/rev_1/par_1 Encrypted Cores directory should be placed in the pcores/ User IP cores In some cases, the cores are encrypted by “implementation” directory within Xilinx (for example, Xilinx MicroBlaze hdl HDL wrappers the EDK project directory. In the (encrypted cores) processors). You can merge these cores XPS GUI, select Project > Import into your design with Synplify etc/ UCF Xilinx constraint files from ProjNav. Pro/Premier. The core may appear as a data/ Place and route option files • Select the BIT and BMM files black box in the tool. synthesis/ Synthesis files from the PAR directory The PAO file generated by XPS will point to an encrypted IP core definition implementation/ Top-level place and route • BIT file synplify/rev_1/par_1/ results, block .ngc netlist files .bit file. Synplify Pro/Premier software allows you to specify the appearance of the core • BMM file synplify/rev_1/par_1/ Table 2 – The embedded core files as a black box (in which case the core generated by EDK software edkBmmFile_bd.bmm wrapper file will be present and copied into the Synplify Pro/Premier folder). If Note that any custom HDL logic will Input Files Read by Synplify Pro/Premier you specify that the core should not be be included in the .ise project file and .ise Project file in ISE software treated as a black box – in other words, a read automatically by the Synplify project directory, if created predefined netlist exists for the core – the Pro/Premier tool if you used the tool will seek to locate a definition file .xpm and .mhs Project file in EDK project Xilinx EDK/ISE hardware develop- directory, if created named implementation/.ngc. If it ment flow. However, you must add finds the file, it adds it to the Synplify Design specification files these files manually to the Synplify .mpd, .pao, Project .prj file. If it does not find the HDL (.v, .vhd), in EDK project directory, Pro/Premier project if you are using implementation/.ngc file, it issues .edn, .ngo, .ngc if available, or in EDK the Xilinx stand-alone EDK hardware hardware library in a location an error message. development flow. specified through the EDK In Synplify Pro/Premier software ver- installation path sion 9.0.2 and above, a new secure .ngc 3. Invoke and run Synplify Pro/Premier, flow exists that allows you to easily include then place and route. Table 3 – These EDK files are read by encrypted cores in the synthesis flow. It Synplify Pro and Synplify Premier • In the Synplify Pro/Premier tool, also allows the Synplify Pro/Premier tool open the Synplify Pro/Premier to additionally optimize inside cores, design (.prj) project, which now Output Files Created by Synplify Pro/Premier resulting in better overall QoR. includes the .ise/.xmp embedded .prj Project file cores. The cores will now be part Conclusion .sdc Constraint file of the Synplify Pro/Premier An increasing percentage of FPGA designs design. .ucf Xilinx user constraint file include cores. For designers generating microprocessor and peripheral design cores • If the EDK cores are a subsystem OPT file and core Xilinx Xflow OPT file and wrapper file core wrapper file generated, using the Xilinx Embedded Development in a larger design, simply instanti- if the core is a black box to Kit (EDK), the Synplify Pro and Synplify ate the subsystem into the top- the Synplify directory Premier software tools offer a fully inte- level HDL design. (for example, if the core was encrypted) grated flow that allows you to synthesize • Run synthesis and then complete and implement an FPGA system. place and route using the normal .bmm Bitmap file – in These tools offer better QoR and a Synplify Pro/Premier ISE software Synplify Pro/Premier place and route directory more convenient flow, as well as full flow. You can run Xilinx place and design debugging and analysis during the route from the Synplify Table 4 – Output files generated by Synplify Pro synthesis and placement phases. For more Pro/Premier interface to generate a and Synplify Premier, ready for Xilinx information, contact Synplicity customer top-level hardware bitmap file. Platform Studio to read support at [email protected].

80 Xcell Journal Second Quarter 2008 TECHNICAL LITERATURE Embedded Processing Application Notes and Reference Systems Popular application notes and reference systems available from Xilinx.

Xilinx maintains and updates a large data- the PLBv46 PCI core is operating correctly. (XMD) and the GNU software debugger base of application notes on numerous top- Several software projects illustrate how to (GDB) to debug software defects. ics as well as reference systems to assist you configure the PLBv46 PCI core(s), set up in your next design. interrupts, scan configuration registers, and Ethernet PHY Register Access with GPIO set up and use DMA operations. http://www.xilinx.com/support/documentation/ PowerPC 440 System Simulation application_notes/xapp1042.pdf http://www.xilinx.com/support/documentation/ PCI Bus Performance Measurements application_notes/xapp1003.pdf Using the Vmetro Bus Analyzer The XPS Ethernet MAC lite peripheral does not http://www.xilinx.com/support/documentation/ provide any mechanism to access Ethernet PHY This reference system demonstrates the application_notes/xapp998.pdf registers. These registers are used to configure functionality of the PowerPC 440 proces- auto negotiation parameters and to obtain PHY sor block on the Virtex™-5 FXT FPGA. This application note illustrates how to meas- status. This application note provides a PowerPC This system includes common peripherals ure performance using the Vmetro Vanguard 405 reference system and a MicroBlaze processor like Ethernet, DDR2, DMA, interrupt PCI bus analyzer. These measurements use a system where the PHY serial management inter- controller, and RS232. system in which the ML410 evaluation plat- face signals (MDC, MDIO) are connected to an Several stand-alone software applications form, the ML555 evaluation platform, and XPS GPIO peripheral. can verify functionality of the peripherals Vmetro Vanguard-PCI (VG-PCI) boards and the PowerPC 440 processor block. communicate over the PCI bus. The test Setup of a MicroBlaze Processor You can learn how to set up the simula- method is provided so that similar tests can be Design for Off-Chip Trace tion environment for the system, execute done using different PCI bus transactions, http://www.xilinx.com/support/documentation/ the simulation, and add PLB v4.6 periph- different boards, or other Xilinx PCI cores. application_notes/xapp1029.pdf erals. The Xilinx ML507 Rev A board ver- ifies the system in hardware on the Virtex-5 Accessing Spartan-3AN This application note describes how to modify FXT FX70TFF1136-1FPGA. In-System Flash Using XPS SPI an existing MicroBlaze processor design to sup- http://www.xilinx.com/support/documentation/ port the trace features in MicroBlaze processor PLBv46 PCI Using the ML555 Embedded application_notes/xapp1034.pdf version 7 and above. It is a detailed tutorial on Development Platform / ML410 Embedded how to add and configure the trace core in a The application note demonstrates how to Development Platform / Avnet Spartan-3 MicroBlaze processor design. The application access the in-system flash in a Spartan™- FPGA Evaluation Board note does not target any specific development 3AN FPGA after the FPGA is configured. http://www.xilinx.com/support/documentation/ board, but I/O constraints for some boards are The software applications use the serial application_notes/xapp999.pdf provided to work with Lauterbach and peripheral interface (XPS SPI) core in a Computex trace tools. http://www.xilinx.com/support/documentation/ MicroBlaze processor-based reference system. application_notes/xapp1001.pdf VxWorks 6.x on the ML403 Embedded Introduction to Software Debugging http://www.xilinx.com/support/documentation/ Development Platform on Xilinx PowerPC 405 Embedded Platforms / application_notes/xapp1038.pdf http://www.xilinx.com/support/documentation/ MicroBlaze Processor Embedded Platforms application_notes/xapp947.pdf These application notes describe how to http://www.xilinx.com/support/documentation/ build reference systems for the processor application_notes/xapp1036.pdf This guide shows the steps required to build and local bus peripheral component intercon- configure a ML403 embedded development http://www.xilinx.com/support/documentation/ nect (PLBv46 PCI) core using either a platform to boot and run the VxWorks RTOS. A application_notes/xapp1037.pdf PowerPC 405 or MicroBlaze™ processor- VxWorks bootloader is created, programmed based embedded system. These application notes offer an introduc- into flash, and used to boot the design. The con- A set of files containing Xilinx tion to software debugging of Xilinx cepts presented here can be scaled to any Microprocessor Debugger (XMD) com- PowerPC 405 or MicroBlaze embedded PowerPC-enabled development platform. This mands are provided for writing to the con- processor platforms by discussing the use of popular application note is now available for figuration space headers and to verify that the Xilinx Microprocessor Debugger EDK / ISE™ software 10.1.

Second Quarter 2008 Xcell Journal 81 GLOBAL SERVICES Embedded Processing QuickStart! Get your design off to the right start. by Jannis McReynolds by using the Embedded Development Kit How You Benefit Sr. Manager, Services Marketing (EDK). This course provides hands-on labs • Shortened ramp-up times by learning Xilinx, Inc. regarding the development, debugging, and essential design and debugging techniques [email protected] simulation of the embedded system. Labs provide you with a choice of targeting either • Empowered hardware teams through coach- ing on software-based optimized systems More than ever, designers are being asked to PowerPC or MicroBlaze processor systems. reduce system costs, reduce power con- Note that QuickStart! is available for • Maximum performance gained from sumption, and meet aggressive design projects beyond embedded processing. understanding advanced and proven FPGA schedules. With QuickStart!, Xilinx helps Please see Table 1 for details about our addi- design techniques tional offerings. you meet these challenges with an unprece- • Minimized design risks acquired by using dented level of on-site support and training. the finest expert advice QuickStart! is available for a wide range Three Days of Consulting: Deliverables of technologies, including embedded sys- • Configuration of the Xilinx Titanium Dedicated Engineering tems, PlanAhead™ software floorplan- design environment Titanium Dedicated Engineering provides a ning, timing optimization, multi-gigabit senior engineer on-site similar to QuickStart! transceivers, and PCIe integration. • Set up and customization of EDK software However, Titanium allows for longer engage- QuickStart! provides you with the accel- ments (more than one week) and does not erated ramp-up time you need to go to • Design architecture and implementation include a training class. For more information, market faster. QuickStart! ensures that you consultation and guidance visit www.xilinx.com/titanium. will have a more knowledgeable team – one that is capable of executing better designs, • System partitioning consultation Take the Next Step lowering overall costs, and enhancing your • Advice on using of advanced features Get your next Xilinx design off to the right competitive edge. and capabilities of the Xilinx MicroBlaze start with QuickStart! To learn more, visit processor and PowerPC processor www.xilinx.com/quickstart. What is QuickStart!? QuickStart! supplies you with an engineer at your site for one week. The Xilinx expert Type Course Specific Features will train and empower your team to com- Embedded Embedded Systems Development Design environment configuration plete your project on time and on budget, EDK software customization while ensuring you make the best use of Design architecture consultation System partitioning guidance your Xilinx device. MicroBlaze / Power PC processor optimization QuickStart deliverables include: PlanAhead Software Designing with PlanAhead Software PlanAhead software design environment configuration • A customized, two-day, on-site, Floorplanning implementation consultation technology-specific course Synthesis and project tips Design and performance-improving analysis • Three days of consulting from a senior applications engineer Timing Optimization Designing for Performance Proper constraining techniques Best-known HDL coding methods • Expert advice on design architecture Timing optimization reviews and implementation optimization Debugging techniques demonstrations MGT Designing with MGTs MGT ports and attributes education Two-Day Customized Course: Advanced features utilization Embedded Systems Development Configuration methods Customers engaged in an embedded design Simulation and implementation flows will receive the Xilinx Embedded Systems PCIe Integration Designing with MGTs Migrating Virtex-II/Virtex-4 software designs Development course. Advanced timing closure This course gives you a better under- Evaluating design requirements standing of how to develop a PowerPC and MicroBlaze™ processor embedded system Table 1 – QuickStart engagement types

82 Xcell Journal Second Quarter 2008

GET UP TO 50% LOWER COST

Xilinx Spartan™-3 Generation devices are the only low-cost FPGA family to cut your total cost by half… and we can prove it! We offer the industry’s widest range of devices, most comprehensive IP library (8x the nearest competitor!), a huge selection of development boards and we’ve minimized the need for external components.

Visit us at www.xilinx.com/spartan to download our free ISE™ WebPACK™ design tools and start your Spartan-3 design today. You too can experience the lowest total cost. Period.

www.xilinx.com/spartan

The World’s Most Widely Adopted Low-Cost FPGAs

PN 2074 ©2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.