Issue 1 March 2005 Embeddedmagazine EMBEDDED SOLUTIONS FOR PROGRAMMABLE PROCESSING DESIGNS

INSIDE Accelerated System Performance with APU-Enhanced Processing

The New Virtex-4 FPGA Family with Immersed PowerPC

Embedded Processing with Virtex and Spartan FPGAs

High-Bandwidth TCP/IP PowerPC Applications

Scalable Software-Defined Radio

Emulate 8051 Microprocessor in PicoBlaze IP Core

Optimize MicroBlaze Processors for Consumer Electronics Products

Xilinx Embedded Processing with Optos

R

EMBEDDED MAGAZINE ISSUE 1, MARCH 2005

CONTENTS

Introducing the New Virtex-4 FPGA Family ...... 6

Accelerated System Performance with APU-Enhanced Processing ...... 10

Sign of the Times ...... 14

Xilinx Teams with Optos ...... 18

What Customers and Partners are Saying about Xilinx Embedded Processor Solutions ...... 22

Considerations for High-Bandwidth TCP/IP PowerPC Applications ...... 24

Configure and Build the Embedded Nucleus PLUS RTOS Using Xilinx EDK ...... 27

MicroBlaze and PowerPC Cores as Hardware Test Generators ...... 30

Nohau Shortens Debugging Time for MicroBlaze and Virtex-II Pro PowerPC Users ...... 34

A Scalable Software-Defined Radio Development System ...... 37

Nucleus for Xilinx FPGAs – A New Platform for Design ...... 40

Emulate 8051 Microprocessor in PicoBlaze IP Core ...... 44

Optimize MicroBlaze Processors for Consumer Electronics Products ...... 48

Use an RTOS on Your Next MicroBlaze-Based Project ...... 50

UltraController-II Reference Design ...... 53 Welcome to the inaugural edition of the new Xilinx Embedded Magazine and the Embedded Systems Conference 2005. Embeddedmagazine In this, our first edition of Embedded Magazine, we have assembled a host of articles representing a wide range of embedded processing applications. During the last few years, Xilinx®, our partners, and EDITOR IN CHIEF Carlis Collins W [email protected] our customers have developed and shared a vision to build and assemble all of the elements required 408-879-4519 for a complete and robust range of embedded processing solutions for FPGA technologies. Included in this magazine and at the Embedded Systems Conference 2005 are examples and MANAGING EDITOR Forrest Couch [email protected] demonstrations of state-of-the-art commercial applications, real-time operating systems, multi- 408-879-5270 processor debugging environments, testing of complex hardware modules, and high-speed Internet communication protocols. ASSISTANT MANAGING EDITOR Charmaine Cooper Hussain Last year, we made significant strides in the embedded processing arena. In July, we announced the formation of the Embedded Processing Division, reinforcing our commitment to the increasingly XCELL ONLINE EDITOR Tom Pyles [email protected] diverse and evolving embedded systems market. This division brings talent and technology togeth- 720-652-3883 er in an organization that intends to accelerate development of an even wider range of embedded system solutions, optimizing the full capabilities of our silicon architectures at multiple perform- ADVERTISING SALES Dan Teie ance and price points. 1-800-493-5551 In September, Xilinx launched our breakthrough Virtex™-4 family of products, featuring embed- ded processing solutions with PicoBlaze™ and MicroBlaze™ soft-processor cores and PowerPC™ ART DIRECTOR Scott Blair (available on all Virtex-4 FX devices). The Virtex-4 FX device includes the new PowerPC Auxiliary Processor Unit (APU) controller to easily connect the CPU to the FPGA fabric, enabling the implementation of acceleration hardware for virtually any application. Once only the domain of www.xilinx.com/xcell/embedded high-budget ASIC and ASSP design teams, the Virtex-4 FPGA’s architectural ability to combine application-specific hardware acceleration with the high-performance PowerPC and MicroBlaze processor shatters the traditional barriers of cost, time to market, and risk. In February 2005, our Xilinx Platform Studio (XPS) embedded tool suite received top honors at the inaugural International Engineering Consortium (IEC) DesignVision Awards. Judged by inde- pendent industry experts, XPS was selected in the FPGA/PLD Design Tools category for its inno- vation, uniqueness, market impact, customer benefits, and value to society. We have an exciting lineup of events at the conference, showcasing both Xilinx and our partners’ latest solutions. The Xilinx booth (#1525) will feature our award-winning Xilinx Platform Studio, as well as the latest Virtex-4 FX solution; our flexible PicoBlaze and MicroBlaze soft processors with Xilinx, Inc. Spartan™-3 devices; and our high-performance DSP solutions. 2100 Logic Drive San Jose, CA 95124-3400 Phone: 408-559-7778 I hope you find the conference and our first edition of Embedded Magazine informative and inspir- FAX: 408-879-4780 ing. We invite you to unlock the power of Xilinx programmability. The advantages will change the way embedded systems are designed. © 2005 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and otherdesignated brands included herein are trademarks of Xilinx, Inc. PowerPC is a trade- mark of IBM, Inc. All other trademarks are the propert y of their respective owners.

The articles, information, and other materials included in this issue are provided solely for the convenience of our readers. Xilinx makes no warranties, express, implied, Mark Aaldering statutory, or otherwise, and accepts no liability with respect to any such articles, information, or other materials or Vice President their use, and any use thereof is solely at the risk of the user. Any person or entity using such information in any Embedded Processing way releases and waives any claim it might have against & IP Divisions Xilinx for any loss, damage, or expense caused thereby. Theatre Presentations in Xilinx Booth #1525 Hands-On Workshops Switching Fabrics at Layers 1 and 2 Network Systems Design Seminar: NSD-707 A Complete Spectrum of Programmable Processing Solutions Implementing a Virtex-4 FX PowerPC System with Tuesday, March 8: 1:30 p.m., 4:30 p.m., 7 p.m. Floating-Point Co-Processor Using Platform Studio Wednesday, March 9: 8:30 a.m.–10:00 a.m. Wednesday, March 9: 12 p.m., 3 p.m., 5:30 p.m. Instructor: Glenn Steiner Presenter: Hamish Fallside, Xilinx, Sr. Manager, Thursday, March 10: 10:30 a.m., 1:30 p.m. Following a brief presentation, this workshop will provide Networking Systems Solutions Boot a Live Embedded System in Just 15 Minutes hands-on experience implementing a PowerPC processor sys- Tuesday, March 8: 2 p.m., 5 p.m., 7:30 p.m. tem demonstrating code acceleration through an FPU co- Hardware/Software Codesign for Platform FPGAs processor. You will create the design utilizing the award-win- Wednesday, March 9: 12:30 p.m., 3:30 p.m., 6 p.m. Embedded Training Program: ETP-423 Thursday, March 10: 11 a.m. ning Platform Studio tool suite and target actual hardware. The application will demonstrate code acceleration via the Thursday, March 10: 11:15 a.m.–12:45 p.m. Breakthrough Performance with Virtex-4 FPU with graphical display of the output. Presenter: Don Davis, Xilinx, Sr. Manager, High-Level Tools Tuesday, March 8: 2:30 p.m., 5:30 p.m. Location: Room 256 Tuesday, March 8: Wednesday, March 9: 10:30 a.m., 1 p.m., 4 p.m., 6:30 p.m. Software-Defined Radio with 10:00 a.m.–11:30 a.m. Thursday, March 10: 12 p.m. Reconfigurable Hardware and Software 2:00 p.m.–3:30 p.m. Eclipse-Based HW/SW Debug 4:00 p.m.–5:30 p.m. Embedded Training Program: ETP-442 Tuesday, March 8: 3 p.m., 6 p.m. Thursday, March 10: 2:00 p.m.–3:30 p.m. Wednesday, March 9: 11 a.m., 1:30 p.m., 4:30 p.m. Accelerating an IDCT Software Application Using the Presenter: Peter Ryser, Xilinx, Thursday, March 10: 9:30 a.m., 12:30 p.m. Configurable MicroBlaze 32-bit RISC Processor Embedded SW Systems Engineer Manager Breakthrough DSP Performance at the Lowest Cost Instructor: Derek Palmer Tuesday, March 8: 3:30 p.m., 6:30 p.m. Utilizing the new Fast Simplex Link (FSL) port on the Xilinx Other Xilinx-Related Technical Papers Wednesday, March 9: 11:30 a.m., 2 p.m., 5 p.m. MicroBlaze soft-core processor, we can add dedicated low-laten- Designing with Embedded Processor Cores in FPGAs Thursday, March 10: 10 a.m., 1 p.m. cy co-processing engines tuned specifically for any application Easy Paths to Silicon Design Seminar: EPD-524 Xilinx Distributor Presentations at hand. We will demonstrate a 10X performance improvement Tuesday, March 8: 4 p.m. – Presenter: Insight/Memec of an inverse discrete cosine transform (IDCT) algorithm, uti- Monday, March 7: 11:00 a.m.–12:30 p.m. Wednesday, March 9: 2:30 p.m. – Presenter: Avnet lizing a Spartan-3 platform for hands-on experience. Presenter: Avnet Thursday, March 10: 11:30 a.m. – Presenter: Nu Horizons Location: Room 256 Wednesday, March 9: Designing with IP in FPGAs 10:00 a.m.–11:30 a.m. Easy Paths to Silicon Design Seminar: EPD-544 2:00 p.m.–3:30 p.m. Product Demonstrations in Xilinx Booth #1525 4:00 p.m.–5:30 p.m. Monday, March 7: 1:30 p.m.–3:00 p.m. Intelligent Tools Accelerate Development Presenter: Avnet We will use the Xilinx Platform Studio tool suite to create an embedded High-Performance DSP on an FPGA Made Easy Implementing DSPs in FPGAs hardware/software/IP system and boot a real-time . Instructor: Sabine Lam We will show how Platform Studio accelerates embedded processing Easy Paths to Silicon Design Seminar: EPD-604 development for programmable systems through the IDE, as well as In this DSP workshop, designers will learn how to implement Tuesday, March 8: 8:30 a.m.–10:00 a.m. high-performance DSP designs using the new Xilinx Virtex-4 powerful design wizards, BSP, and application code generators. Presenter: Avnet FPGAs. By leveraging tools such as Xilinx System Generator Low Cost and Flexible Processing for DSP, IP cores, and hardware, even designers new to FPGAs Designing with Soft-Processor Cores in FPGAs, Part 1 Come see the new Spartan-3 SP305 board implementing complex embed- will realize the capability and ease-of-use of Xilinx DSP Embedded Training Program: ETP-309 ded motor control using the MicroBlaze soft 32-bit processor, a floating design solutions. point unit, PID control loops, PWM, DAC, and ADC to control both a brush- Wednesday, March 9: 8:30 a.m.–10:00 a.m. less DC and a three-phase AC induction motor. We will also show a fre- Location: Room 256 Thursday, March 10: Presenter: Avnet quency generator using the PicoBlaze 8-bit processor, capable of generat- 9:00 a.m.–10:30 a.m. ing a frequency range up to 640 MHz incremented at 1 Hz. 11:00 a.m.–12:30 p.m. Designing with Soft-Processor Cores in FPGAs, Part 2 1:00 p.m.–2:30 p.m. Embedded Training Program: ETP-329 High-Performance Processing Wednesday, March 9: 11:00 a.m.–12:30 p.m. Using a Xilinx ML310 Embedded Development Platform board with multi- Presenter: Avnet OS boot (Linux, VxWorks, and QNX), along with the Xilinx Platform Studio (XPS) tool suite, we will implement a dual-processor system with Technical Conference Papers by Xilinx FPGA Design for Firmware Engineers, Part 1 inter-processor communication. The first processor will be used as a com- Virtex-4 FX Platform FPGA Embedded Training Program: ETP-343 pute engine utilizing the FPU. The second processor will be used as a host Extends PowerPC Instructions processor managing data flow and I/O. Wednesday, March 9: 2:00 p.m.–3:30 p.m. Microprocessor Summit Presenter: Avnet Accelerate Programmable Embedded Platform Performance Session MPS-900 Accelerate system performance with the Virtex-4 FX family of devices FPGA Design for Firmware Engineers, Part 2 Argent Hotel, San Francisco using the Auxiliary Processor Unit (APU) controller. We will introduce Embedded Training Program: ETP-363 the Xilinx ML403 FX Embedded Development Platform board and Monday, March 7: 9:00 a.m.–10:30 a.m. Wednesday, March 9: 3:45 p.m.–5:15 p.m. demonstrate algorithm code acceleration with APU implementation Presenter: Dan Isaacs, Xilinx, APD Marketing and graphical display of output. We will utilize the Gigabit System Presenter: Avnet Reference Design (GSRD) with TCP/IP acceleration and Linux eOS. Approaches in FPGA Design FPGA Embedded Processors: Easy Paths to Silicon Design Seminar: EPD-664 Revealing True System Performance High-Performance DSP Implementation and Embedded Tuesday, March 8: 4:00 p.m.–5:30 p.m. Development Kit Integration Using System Generator Embedded Training Program: ETP-367 This demo shows how you can use System Generator for DSP 6.3 to Presenter: Tim Tripp, Xilinx, Wednesday, March 9: 3:45 p.m.–5:15 p.m. build DSP microprocessor-based control circuits for high-performance Advanced System Tools Product Marketing Presenter: Memec DSP systems. System Generator for DSP and Xilinx DSP IP cores provide unrivaled DSP design solutions for FPGA-based DSP systems. Viewfrom the top Introducing the New Virtex-4 FPGA Family

The latest FPGAs from Xilinx set new records in capacity, by Erich Goetting Vice President & capability, performance, power efficiency, and value. General Manager, Advanced Products Division Xilinx, Inc. [email protected]

Welcome to the Xilinx® Virtex-4™ edition of the Xcell Journal. We’ve created this spe- cial issue to show you the new Virtex-4 FPGA family, and how its innovations enable the creation of next-generation sys- tems that do more than ever thought possi- ble only a few years ago. In this article, I’ll take you behind the scenes for a guided tour of some of the new technologies, as well as a bit of the inspira- tion and rationale behind them. With more than 100 innovations, the Virtex-4 family represents a new milestone in the evolution of FPGA technology. After conducting extensive interviews with lead- ing design engineers worldwide, we knew that they wanted the following things in an advanced next-generation FPGA family: • Higher performance • Higher logic density • Lower power • Lower cost • More advanced capabilities

6 Embedded magazine March 2005 VIEW FROM THE TOP

At the heart of the Virtex-4 FPGA is our next-generation 90 nm triple-oxide 10-layer copper CMOS process technology.

It’s relatively easy to deliver on one or two That in turn has enabled us to offer Virtex-4 material, which is copper rather than alu- of these items – our challenge was to deliver devices in three unique platforms, each with minum (the traditional material). More lay- all of them at the same time. We did this a different mix of on-chip resources: ers provide more routing in less space and through a combination of innovative process shorter connection distances. Copper • The LX platform, optimized for logic and circuit design, process development, the reduces resistance compared to aluminum, applications ASMBL architectural approach, and the use and thus speeds signal interconnect and of advanced embedded functions. • The SX platform, optimized for high- reduces on-chip power-distribution IR Development work on the Virtex-4 fam- end DSP applications drop. As clock rates go up and voltages go ily (code-named “Whitney” after the high- • The FX platform, optimized for down, these considerations have become est mountain in the continental United embedded processing and high-speed increasingly important, and have driven the States) began more than two years ago. It serial applications industry-wide shift to copper interconnect. represents the creativity and dedication of The Virtex-4 logic fabric was complete- hundreds of engineers, spanning integrated ly re-engineered to fully take advantage of circuit design and layout, software and IP A Look Inside the Virtex-4 FPGA the 90 nm triple-oxide CMOS process, development, process development, testing At the heart of the Virtex-4 FPGA is our resulting in the highest performance fabric and characterization, systems and applica- next-generation 90 nm triple-oxide 10- ever, with system clock rates in excess of tions engineering, technical documenta- layer copper CMOS process technology. 500 MHz (at three LUT levels). At the tion, and product marketing. While that’s quite a lot of adjectives, same time, static power was cut in half One of the most remarkable develop- every one of them is incredibly impor- compared to 130 nm Virtex-II Pro™ ments embodied in the new Virtex-4 FPGA tant. The first, 90 nm, refers to the devices, as was dynamic power. family is the ASMBL architecture, which rep- “drawn” gate length of the smallest tran- Thus, while some industry pundits were resents a fundamentally new way of con- sistors. As transistors get smaller, they get proclaiming that the future of deep submi- structing the FPGA floor plan and its faster, use less dynamic power, and enable cron CMOS devices was getting hotter and interconnect to the package. First of all, higher complexity at lower price points. hotter, with chip temperatures destined to ASMBL enables I/O pins, clock pins, and Chip designers think in terms of “transis- reach that of rocket nozzles and the surface power and ground pins to be located any- tor budgets,” which are now in the billion of the sun, the Virtex-4 design’s creative where on the silicon chip, not just along the transistor range. approach has turned that conventional wis- periphery as with previous approaches. This dom on its head, resulting in overall power in turn allows power and ground pins to be Triple-Oxide 90 nm CMOS Technology reductions of 50% compared to our previ- brought directly into the center of the silicon Triple-oxide technology refers to the num- ous 130 nm generation. In many applica- die, thereby significantly reducing on-chip IR ber of transistor oxide thicknesses available tions, such as DSP functions, power levels drops that can occur with the largest FPGAs in the process. More oxide thicknesses are reduced even more – as much as 90%. running at the highest frequencies. allow more tuning of performance and No wonder design engineers say that Clock input pins are also located in the power in the device circuitry, and enable Virtex-4 FPGAs are cool – they literally are. center of the die, which reduces clock latency. Virtex-4 devices to deliver industry-leading This is because clock networks need to have performance while dramatically lowering High-Performance Clocking equal delay to all endpoints (that is, mini- power consumption. Clocks were rated as one of the most mum skew), and thus the clock must emanate One of our key inputs from many engi- important and critical FPGA resources in from the center. In periphery-connected clock neers was that performance and power were our surveys of design engineers. Quantity, input pins, the signal first traverses from the very important constraints in their systems quality, connectivity, frequency, duty cycle, edge of the die to the center, and is then dis- designs, and that they needed both high jitter, and skew all made a big difference. tributed to all regions. The Virtex-4 ASMBL performance and low power. With a dual- To take clocking to the next level in design eliminates this traversal completely, oxide 90 nm process, we would have had to Virtex-4 devices, all global clock resources and thus directly reduces the clock network choose performance or power. This wasn’t were made fully differential, thereby reduc- propagation delay. good enough. By employing a triple-oxide ing skew, jitter, and duty-cycle distortion. In addition to its electrical advantages, 90 nm process, we achieved high perform- This marks the first implementation of dif- ASMBL provides another significant benefit ance and low power. ferential clocking in a programmable logic in that it allows a more flexible – and thus The 10-layer copper refers to the num- device. Not only that, but the number of more precise – allocation of on-chip resources. ber of metal interconnect layers and their global clocks was increased to 32, for every

March 2005 Embedded magazine 7 VIEW FROM THE TOP

device, and internal connectivity options The Virtex-4 LX, FX, and SX platforms increased as well. But because the propaga- enhanced to allow any region to use any 8 include the breakthrough XtremeDSP tion speed of signals is a constant, dictated clocks simultaneously. technology. With the new SX platform we by the speed of light, we entered the realm did something completely new – we dra- in which a signal on one end of a wire was 500 MHz Synchronous Memories and FIFOs matically increased the ratio of DSP units no longer the same as the signal on the On-chip synchronous block RAM was to logic cells. Given the highly integrated other end of the same wire. This is what enhanced to run at 500 MHz. Built-in sup- nature of XtremeDSP slices, they need only transmission lines are all about, and their port for first-in first-out (FIFO) memories small amounts of logic fabric to implement appearance during the last few years has was included directly in the block RAM most common DSP functions, and thus driven a sea change in all aspects of signal unit, enabling the same 500 MHz opera- increasing the ratio provides a significant interconnect and I/O design. tion for FIFOs (approximately a 2X increase in DSP compute power per unit To make sure that these signal “waves” speedup over fabric-based FIFOs), while silicon area. In fact, SX devices provide a don’t start “splashing” uncontrollably, trans- eliminating the need for any additional 10X performance increase per unit cost mission lines need to be driven, built, and logic cells or complex FIFO designs. over previous solutions. received using proper signal integrity If you’re designing systems requiring Power is dramatically reduced as well, approaches, the most critical of which is ter- ECC (error checking and correcting) with more than a 10X reduction for multi- mination. Traditionally implemented with memory, Virtex-4 devices have built-in ply/add functions from previous FPGA discrete resistors on the PCB, termination ECC support, with single-bit correct and solutions. The Virtex-4 SX55 contains 512 layouts can become exceedingly difficult double-bit detect. ECC is common in XtremeDSP slices, providing an aggregate around high-density pinouts like those used infrastructure equipment in networking, DSP compute performance of 256 in FPGAs. This often dictates more PCB telecom, storage, servers, instrumentation, GigaMAC/s, making it one of the most layers and thus more system cost. and aerospace applications, and provides powerful DSP devices ever manufactured. Virtex-4 FPGAs include our third- the highest levels of data integrity. Like the The state-of-the-art XtremeDSP slice generation of XCITE™ integrated digi- integrated FIFO support, the integrated employs new “silicon algorithms” devel- tally controlled termination technology. ECC eliminates the cost and delay of oped by a company called Arithmatica™. Offering a precisely controlled source fabric-based solutions. Many different architectures exist for impedance at the output drive pin, it is Speaking of on-chip memory, Virtex-4 implementing multiplication, and the designed to enable the driving of trans- devices continue to offer SelectRAM™ Arithmetica architecture is truly a break- mission lines without external compo- memory, whereby each LUT is trans- through. We are excited to see it available nents, with maximum speed and signal formed into a 16 x 1 RAM, ideally suited for the first time to FPGA users. For more integrity, and with straightforward PCB for building high-speed register files and information, visit Arithmatica’s website at layout and layer stack-ups. local buffers. www.arithmatica.com. Likewise, on inputs, XCITE offers par- At the other end of the spectrum, inter- allel termination for single-ended inputs faces to external memory devices such as The Evolution of Advanced I/O Technology and true differential termination for differ- DDR, DDR2, QDR-II, and RLDRAM-II I/O continues to be a critical success factor ential inputs. Termination occurs on the are dramatically enhanced through our new for today’s systems designers. During the end of the transmission line at the die, not ChipSync™ technology, which offers mem- last decade, we have seen four major on the way there on the PCB, offering max- ory interface speeds at rates limited only by changes in I/O. First was the shift away imum signal integrity. Many customers the speed of the external memory devices. from 5V, the result of the need to scale volt- report that the XCITE technology has The new Virtex-4 ML461 Advanced ages as we scaled the transistor. This in turn saved them many PCB layers, increased Memory Development System contains led to the plethora of I/O standards that we PCB packing density, and saved them sub- fully functional and hardware-proven refer- are all familiar with today: SSTL, HSTL, stantial dollars in their bill of materials. ence designs for all of today’s most popular LVDS, and LVCMOS 1.5. The Virtex-4 memory technologies. If you plan to use SelectIO™ resource continues to lead the Source-Synchronous Interfaces external memory, I highly recommend that industry, supporting virtually every I/O The third major change was the shift from you check this out. standard in use today on every pin. system-synchronous to source-synchronous interfaces. Traditional system-synchronous DSP Performance of 256 GigaMAC/s XCITE On-Chip Termination interfaces work by distributing a single In the DSP domain, we incorporated some The second major change was the transi- clock to all transmitters and receivers in of the world’s fastest multiply accumulate tion from lumped loads to transmission the system, and transmitting data between (MAC) technology. The XtremeDSP™ line loads – again the direct result of source and destination within a single slice can perform an 18 x 18 signed multi- Moore’s Law. As transistors got faster and clock cycle. This makes the data rate ply and 48-bit accumulate every 2 ns. clock rates increased, I/O edge rates inversely proportional to the sum of clock-

8 Embedded magazine March 2005 VIEW FROM THE TOP

to-out, transmission line delay, and input such as those used in fiber-optic links in the The number of CPUs in one device is setup time. SONET/SDH world and the Ethernet limited only by your imagination, and of Typically, system synchronous interfaces links like 100BASE-T. course by the available logic cells. The top out at speeds in the range of 100 MHz. A key breakthrough occurred in the late high-performance PowerPC CPU runs at To go faster, source-synchronous interfaces 1990s, in which high-speed serial transceivers clock rates up to 450 MHz and delivers transmit a clock along with the data, and the (which traditionally had been designed using up to 702 DMIPS each. The first receiver uses this clock to capture the data. complex process technology such as GaAs PowerPC processor available by any Using this technique, along with double- [Gallium-Arsenide]) were for the first time manufacturer on 90 nm, the PowerPC data-rate transmissions, enables parallel I/O created using advanced design techniques processor is incredibly power-efficient, data rates in excess of 1 Gbps. using standard CMOS. Once implemented using only 0.29 mW/DMIPS. This The challenge of source-synchronous in CMOS, these transceivers had lower cost makes it among the lowest power micro- interfaces is that each interface generates a and much lower power, and could even be processors available from any manufac- new clock domain at the receiver. On top integrated into complex CMOS chips. turer worldwide. of this, to operate at high speeds, the pre- Virtually overnight, gigabit serial tech- New Auxiliary Processor Unit (APU) cise alignment of clock and data at the nology changed from a rare, expensive, and technology connects the CPU to the receiver is paramount. To address this new power-hungry technology to a common, FPGA fabric, enabling implementation of world of source-synchronous interfaces, low-cost, and very power-efficient technol- acceleration hardware for virtually any Virtex-4 devices include the breakthrough ogy. This has been the economic and tech- application. Once only the domain of ChipSync technology. ChipSync units lie nical impetus behind the industry’s “Serial high-budget ASIC and ASSP design teams, between the SelectIO technology and the Tsunami,” in which interface after inter- the Virtex-4 FPGA’s architectural ability to core FPGA fabric, are available on every face has shifted from parallel to gigabit combine application-specific hardware I/O pin on the device, and serve to trans- serial links. Two common examples are vis- acceleration with high-performance RISC mit and receive high-speed source-syn- ible in today’s computer architectures, with CPUs shatters traditional barriers of cost, chronous data and clocks, achieving speeds the shift from parallel PCI to 2.5 Gbps time-to-market, and risk. of 1 Gbps per pin pair. serial PCI-Express™, and the shift from During the next few years, I expect to On the receiver, precise digital delay lines the parallel ATA drive interface to the see more and more instances of application- work internally to align data signals to each Serial ATA interface. specific acceleration, as it truly offers the other, and then to align these to the received There are more than a dozen multi- ability to deliver very high performance at clock. The captured data is synchronized gigabit serial interfaces in widespread use low cost and low power. A recent research and transferred to the selected FPGA core today, with more being introduced every program completed within Xilinx Research clock domain. year. The Virtex-4 FX family provides our Labs, led by Dr. Kees Vissers, demonstrated To operate at maximum data rates, the third-generation RocketIO™ multi-giga- a 20-fold speedup for anencryption/decryp- transmit and receive units include parallel- bit serial transceiver technology. Spanning tion (IPSEC) application over the base to-serial and serial-to-parallel conversion speeds from 622 Mbps to more than 10 PowerPC processor. Using only 135 mW, units, respectively. Using ChipSync technol- Gbps, each Virtex-4 RocketIO transceiver it outperforms a 3.2 GHz Pentium™-4, ogy is virtually automatic for most designs, is programmable and can implement a while at the same time reducing power by as it is utilized automatically in the various myriad of speeds and serial standards. 99%. That, in my opinion, is what state-of- Xilinx IP cores and reference designs. Link-layer IP is available for such stan- the-art embedded processing is all about. Networking interfaces such as SPI-4.2 dards as PCI Express, Serial-ATA, and HyperTransport™, and memory inter- FibreChannel, Gigabit Ethernet, and Conclusion faces such as DDR, DDR2 SDRAM, and Aurora, to name a few. I hope that you’ve enjoyed reading a bit QDR II SRAM, all employ the Virtex-4 In addition, Virtex-4 FX devices each about the Virtex-4 Platform FPGA and ChipSync technology. And if you’re design- include multiple embedded tri-mode (or the factors that drove its design. From ing your own source-synchronous interface, 10/100/1000) Ethernet MACs, making the breakthrough ASMBL architecture the ChipSync wizard gives you complete implementation of compliant Ethernet and the triple-oxide 90 nm CMOS control and an easy-to-use GUI that lets you devices simpler and faster than ever. process technology, to the world’s most dial in exactly what you want to build. capable embedded processing and multi- Application-Specific Embedded Processing gigabit serial solutions, Virtex-4 devices Multi-Gigabit Serial Interfaces Virtex-4 embedded processing solutions offer an unparalleled set of enabling The fourth major change in I/O has been include full support for both MicroBlaze™ technologies for your next-generation the rapid adoption of high-speed serial 32-bit soft CPUs on all devices, and systems designs. I look forward to seeing interfaces. For years, serial interfaces were embedded PowerPC™ 32-bit RISC CPUs the creativity of the world’s designers in limited to long-distance communications, on all Virtex-4 FX devices. tomorrow’s products.

March 2005 Embedded magazine 9 Accelerated System Performance with APU-Enhanced Processing The Auxiliary Processor Unit (APU) controller is a key embedded processing feature in the Virtex-4 FX family.

by Ahmad Ansari FCM; the primary capabilities are shown in emulation or ignored completely. This fine- Senior Staff Systems Architect Figure 1. This provides a more efficient inte- tuning optimizes FPGA resources while Xilinx, Inc. gration between an application-specific accelerating the most critical calculations [email protected] function and the processor pipeline than is with dedicated logic. Peter Ryser possible using a memory-mapped coproces- The APU controller also decodes high- performance load and store instructions Manager, Systems Engineering sor and shared bus implementation. Xilinx, Inc. The instructions supported by the APU between the processor data cache or system [email protected] are classified into three main categories: memory and the FPGA fabric. A single instruction transfers up to 16 bytes of data – • User-defined instructions (UDI) Dan Isaacs four times greater than a load or store Director, APD Embedded Marketing • PowerPC floating-point instructions instruction for one of the general purpose Xilinx, Inc. registers (GPR) in the processor itself. Thus, [email protected] • APU load/store instructions this capability creates a low-latency and high- The UDIs are programmed into the bandwidth data path to and from the FCM. The APU controller provides a flexible controller either dynamically through the high-bandwidth interface between the re- PowerPC 405 device control register APU Controller Operation configurable logic in the FPGA fabric and (DCR) or statically when the FPGA is con- Figure 2 identifies the key modules of the the pipeline of the integrated IBM™ figured through its bitstream. The APU APU controller and the 405 CPU in rela- PowerPC™ 405 CPU. Fabric co-processor controller allows you to optimize your sys- tion to the FCM soft coprocessor module modules (FCM) implemented in the FPGA tem architecture by decoding instructions implemented in FPGA logic. To explain fabric are connected to the embedded either internally or in the FCM. the operation of the APU controller and PowerPC processor through the APU con- The floating-point unit (FPU) is an the processor interactions related to the troller interface to enable user-defined con- example of an FCM. The PowerPC float- execution units in soft logic, we can trace figurable hardware accelerators. These ing-point instruction set is decoded in the the step-by-step sequence of events that hardware accelerator functions operate as APU controller, whereas the computation- occur when an instruction is fetched from extensions to the PowerPC 405, thereby al functionality is implemented in the cache or memory. offloading the CPU from demanding com- FPGA fabric. To support FPUs with dif- Once the instruction reaches the decode putational tasks. ferent complexities, the APU controller stage, it is simultaneously presented to both allows you to select subgroups of the the CPU and APU decode blocks. If the APU Instructions PowerPC floating-point instructions. instruction is detected as a CPU instruc- The APU controller allows you to extend the These instructions are executed in the tion, the CPU will continue to execute the native PowerPC 405 instruction set with cus- FCM while other subgroups of instructions instruction as it would normally. tom instructions that are executed by the soft are either computed through software FPU Otherwise, within the same cycle, the CPU

10 Embedded magazine March 2005 implement synchronization semantics to • Extends PPC 405 Instruction Set pace the software execution with the hard- – Floating Point Support PLB ware FCM latency. (with soft auxiliary processor) Non-autonomous instruction types are – User-Defined Instructions further divided into blocking and non- • Offloads CPU-Intensive Operations blocking. If blocking, asynchronous excep- Soft – Matrix Calculations APU tions or interrupts are blocked until the PowerPC Auxiliary Controller FCM instruction completes. Otherwise, if • Video Processing APU FPGA Processor – Floating-Point Mathematics I/F I/F non-blocking, the exception or interrupt is taken and the FCM is flushed. • 3D Data Processing Processor Block • Direct Interface to HW Accelerators Software Description OCM – High Bandwidth FPGA Fabric Software engineers can access the FCM – Low Latency from within assembler or code. On one side, Xilinx has enabled the GCC compiler Figure 1 – APU expanded processing capabilities (which is contained in the Embedded Development Kit) to generate code that uses an FCM floating-point unit to calcu- Intructions from Processor Block Cache or Memory late floating-point operations. Furthermore, Fetch Stage assembler mnemonics are available for Decode Decode Decode Stage UDIs and the pre-defined load/store Instruction Control Registers Soft Coprocessor instructions, enabling you to place hard- Decode APU Decode Module ware-accelerated functions into the regular Optional Decode Control Pipeline program flow. For the ultimate level of flex- Control Operands EXE Stage ibility, you can define your own instructions Execution Instruction Units designed specifically for the hardware func- Exec. Units Buffers and Operands Result Synchronization Result tionality of the FCM. You can easily use the pre-defined Register File load/store instructions through high-level WB Stage APU 405 Controller C macros. For example, in an application Core Load WB Stage Load Data where the FCM is used to convert pixel data into the frequency domain, 8 pixels of Figure 2 – APU controller processing operative block diagram 16 bits are transferred from main memory to an FCM register with a simple program: will look for a response from the APU con- at the proper execution time send the data unsigned short pixel_row[8]; // 8 pixels, troller. If the APU controller recognizes the back to the processor. The APU controller each pixel has a size of 16 bits instruction, it will provide the necessary knows in advance, based on instruction information back to the CPU. type, if or when it will get the result. lqfcm(0, pixel_row); // transfer a row of If the APU controller does not respond pixels to FCM register zero within that same cycle, an invalid instruc- Autonomous and The quadword load operation main- tion exception will be generated by the Non-Autonomous Instructions tains cache coherency as the data is moved CPU. If the instruction is a valid and rec- Two major categories of instructions exist: through the cache, if caching is enabled for ognized instruction, the necessary operands autonomous and non-autonomous. For the corresponding address space. are fetched from the processor and passed autonomous instructions, the CPU contin- The FCM operation on the pixel data to the FCM for processing. ues issuing instructions and does not stall can start on an explicit command; for Because the PowerPC processor and the while the FCM is operating on an instruc- example, a UDI. However, for many appli- FCM reside in two separate clock domains, tion. This overlap of execution allows you cations the operation starts immediately synchronization modules of the APU con- to achieve high performance through tech- after the FCM hardware detects the com- troller manage the clock frequency differ- niques such as software pipelining. pletion of the load instruction. ence. This allows the FCM to operate at a On the other hand, during the syn- The latter approach has many advantages: slower frequency than the processor. In this chronized execution, the CPU pipeline instance, the APU controller would receive stalls while the FCM is operating on an • Simple software – A load operation the resultant data from the coprocessor and instruction. This feature allows you to moves the data from the memory to

March 2005 Embedded magazine 11 the FCM and starts the operation. A To increase the readability of the code, inverse discrete cosine transform (2D- subsequent store instruction retrieves you can redefine the user-defined instruc- IDCT) typically used as one of the pro- the result of the operation and stores it tion with regular C preprocessor constructs. cessing blocks in MPEG-2 video back to main memory. Instead of using the udi0fcm() macro, you decompression, again compared to emu- can redefine it to a more comprehensible lating the 2D-IDCT function in software. • High data transfer rates – Quadword complex_add() macro with #define com- The two use cases are different in that load and store operations take just a few plex_add(r, a, b) udi0fcm(r, a, b) and change the FPU implements a set of registers in the cycles to complete. A single operation the listing to call complex_add(2, 1, 0) FPGA fabric upon which the FPU instruc- moves 16 bytes within that timeframe. instead of udi0fcm(2, 1, 0). tions operate. The 2D-IDCT only requires • Low latency – FCM load operations Therefore, system architects can partition load and store operations, while the func- are simple to use. The processor com- their tasks into hardware- and software- tionality of the operation on the data pletes the operation in a single cycle. executed pieces that are efficiently and pre- stream is fixed. In either case the operations The principle of the RISC architecture cisely interfaced to one another through the are complex enough to justify offloading uses a number of simple instructions on use of the APU controller. This partitioning into the FPGA fabric. data stored in general-purpose registers can be done statically during the initial sys- Thus, the combination of using the (GPR) to compute complex operations. tem configuration or dynamically during APU and FPGA hardware acceleration User-defined instructions fall into this cat- the program execution. Using the direct clearly provides a significant performance egory but take the concept a step further in processor/FPGA coupling presented by the advantage over software emulation – or the that the system architect defines the com- APU controller and its high throughput conventional method involving the proces- plexity of the operation on data stored in interfaces, hardware/software synchroniza- sor and processor local bus architecture GPRs and FCM registers (FCR). Again, tion is greatly simplified and performance with a soft co-processing function. from a software point of view, the engineer significantly improved. codes user-defined instructions through C FIR Filter Accelerating System Performance macros. GCC recognizes mnemonics such The implementation of floating-point as udi0fcm as a user-defined operation of The following examples showcase key calculations in hardware yields an the general form: advantages the APU provides based on two improvement by a factor of 20 over soft- different scenarios. The first scenario is ware emulation. Connecting the FPU as udi0fcm,, essentially a benchmarking comparison of a an FCM to the APU controller provides finite impulse response (FIR) filter using a performance improvement because the The target of the operation is either a soft FPU core, implemented as an FCM latency to access the floating-point regis- GPR or an FCR. The operands are either attached directly to the APU controller (as ters is reduced and dedicated load and GPRs, FCRs, immediate values, or a com- compared to software emulation used to store instructions move the operands and bination. As you can see, the semantics are calculate the filter function). The second results between the FPU registers and the not defined by the instruction and depend scenario implements a two-dimensional system memory. on your intentions and the implementation in the FCM. Spatial Redundancy: This code sequence demonstrates the Pixel Decoding Using the IDCT use of a user-defined instruction as an example of a complex add operation: Pixel Amplitude Values Blocks 223 191 159 128 98 72 39 16 223 191 159 128 98 72 39 16 struct complex { 223 191 159 128 98 72 39 16 APU Function: 223 191 159 128 98 72 39 16 int r, i; // 32 bit integer for real ¥ Decompresses YUV 223 191 159 128 98 72 39 16 and imaginary parts encoded pixel data RGB Pixel DCT Values 223 191 159 128 98 72 39 16 43.8 .40 0 -4.1 0 -1.1 0 0 223 191 159 128 98 72 39 16 for output display 223 191 159 128 98 72 39 16 }; 0 0 000000 APU ¥ Utilize FPGA Resources 0 0 000000 PowerPC Processor Controller (soft logic) 0 0 000000 APU FPGA complex a, b, r; Ð Less overhead logic I/F Interface 0 0 000000 XtremeDSP XtremeDSP Ð Fast data transfer Processor Block ldfcm(0, &a); // load complex number a 0 0 000000 XtremeDSP

0 0 000000 OCM FPGA Fabric into FCM register 0 0 0 000000 ldfcm(1, &b); // load complex number b into FCM register 1 MPEG Decode Flow udi0fcm(2, 1, 0);// udi0fcm computes r = a + b, where r is stored in FCM register 2 stdfcm(&r, 2); // store complex result from FCM register 2 to variable r Figure 3 – Utilizing APU to decode pixel data for display output

12 Embedded magazine March 2005 specific algorithm is desired to achieve Example: Video Application – MPEG De-Compression Algorithm optimal performance, the combination of the APU controller and FPGA hardware PLB acceleration provides a definitive per- • Leverages Integrated Features – PowerPC, APU, XtremeDSP Blocks formance advantage over software emula- tion or the conventional method of • HW Acceleration Over Software attaching coprocessors to the processor Auxiliary – Lower Latency and High Bandwidth APU PowerPC Processor memory bus. Controller APU FPGA (soft logic) Generating the accelerated functions • Effecient HW/SW Design Partitioning I/F Interface – Optimized Implementation XtremeDSP called by user-defined instructions is easily XtremeDSP Processor Block performed through GUI-based wizards. XtremeDSP • Significant Performance Increase This functionality will be included in sub- OCM FPGA Fabric sequent releases of the powerful Embedded Development Kit or Platform Studio. If you are more comfortable working Over 20X Performance Improvement Compared to Software Emulation at the source code or assembly level, the Figure 4 – Accelerated system performance with APU APU controller allows you to define your own instructions written specifically for 2D-IDCT By comparison, if you connect the 2D- the hardware functionality of the FCM, The 2D-IDCT transforms a block of 8 x 8 IDCT hardware block to the processor local or you can easily use the pre-defined data points from the frequency domain into bus, as it is done conventionally, the system load/store instructions through high-level pixel information. A high-level diagram performance will be reduced. This increased C macros. depicting the pixel decode by the APU con- latency is mainly caused by the bus arbitra- The APU controller provides a close troller, along with advantages, is shown in tion overhead and the large number of 32- coupling between the PowerPC processor Figure 3. In this example, each data point bit load and store instructions. This is and the FPGA fabric. This opens up an has a resolution of 12 bits and is represented illustrated schematically in Figure 5. entire range of applications that can imme- as a 16-bit integer value. The data structure diately benefit customers by achieving is defined where each row of 8 pixels con- Conclusion increases in system performance that were sumes 16 bytes. This is an ideal size that The low-latency and high-bandwidth fab- previously unattainable. allows optimal use of the FCM load and ric coprocessor module interface of the For additional details on the APU con- store instructions described earlier. In other APU controller enables you to accelerate troller in Virtex-4-FX devices, including words, eight FCM quadword load instruc- algorithms through the use of dedicated detailed descriptions and timing waveforms, tions are needed to load a data block into the hardware. Where operations are complex refer to the Virtex-4 PowerPC 405 Processor 2D-IDCT hardware. Eight FCM quadword enough to justify the offloading into the Block Reference Guide at www.xilinx.com/ store instructions are sufficient to copy the FPGA fabric, or when acceleration of a bvdocs/userguides/ug018.pdf. pixel data back into the system memory. The calculation of the 2D-IDCT in the FCM starts immediately after the first load, Inverse Two-Dimensional IDCT Algorithm and the pixel data is available shortly after the last load operation. As shown in Figure Software EmulationProcessor w/Soft IDCT APU Accelerated 4, the 2D-IDCT makes uses of the new XtremeDSP™ slices in the Virtex-4 archi- Power Power PC PC Processor Block tecture that offer multiply-and-accumulate Processor Processor Local Bus Power APU Soft functionality. Controller Local Bus Soft PC IDCT IDCT FPGA A software-only implementation of a Interface 2D-IDCT takes 11 multiplies and 29 addi- Memory Memory tions together with a number of 32-bit load Memory and store operations, while the hardware- FPGA Fabric FPGA Fabric FPGA Fabric accelerated version takes 8 load and 8 store Software Only Accelerated w/Soft Core APU w/XtremeDSP Slices >200 Lines Code Multiple Load/Store Operations Single Instruction Execution operations. The reduced number of opera- Several 100 Instructions per IDCT Leverages APU and Soft Logic tions results in a speed-up of 20X in favor a 2D-IDCT FCM attached through the APU controller. Figure 5 – Comparison of implementation models for 2D-IDCT

March 2005 Embedded magazine 13 SignSign ofof thethe TimesTimes Xilinx makes high-tech outdoor advertising in Times Square possible.

by Jason Daughenbaugh Sr. Design Engineer Advanced Electronic Designs, Inc. [email protected]

New York City’s Times Square is known as the “Crossroads of the World.” Approximately 1.5 million people pass through the intersection of Broadway and 42nd Street every day, and millions more see the area daily on television broadcasts. No better place for outdoor advertising exists. As a result, dazzling signs have become a Times Square trademark. Every advertiser wants to have the best advertising medium possible, so new signs must use the latest technology. Times Square tenants rely on MultiMedia, which manufactures the majority of the spectacu- lar signs in Times Square. When MultiMedia asked our company, Advanced Electronic Designs, Inc. (AED), to design an LED sign for JPMorgan Chase™ in Times Square, we needed a huge amount of signal processing, data distribution, and interfac- ing. We also needed to design the sign very quickly. We met this challenge by utilizing the advantages of Xilinx® components. We used Virtex-II™ XC2V1000 FPGAs for video processing, and for control and distribution we chose low-cost Spartan-3™ XC3S200 FPGAs. To configure the FPGAs, we chose the Platform Flash XCF00 config- uration PROM family. And for final distri- bution of the data on the 3800 LED blocks, we used XC9572XL PLDs.

14 Embedded magazine March 2005 The Design An LED sign is like a large computer mon- itor; video data goes in and is displayed on the sign. The sign comprises red, green, and blue LEDs that turn on and off (pulse- width modulation) to generate more than four trillion colors. What made this particular design a challenge was the scale, both in terms of physical size as well as the amount of data and the transfer rates involved. The sign is 135 feet long and 26 feet tall. With nearly two million pixels, it is the highest defini- tion LED display in the world. This is ten times the resolution of the average televi- sion screen and twice the resolution of top- of-the-line HDTV sets. After considering our options, design- ing with Xilinx programmable logic was the obvious choice. The high-perform- ance, low-cost FPGAs are well suited for all three main components of this design: video processing, data distribution, and sign control.

Video Processing The video processor accepts a variety of video inputs. It captures these video streams as 36 bit RGB (12 bits per color). It then crops and places these inputs onto a master Figure 1 – The world’s highest resolution LED display is based on Xilinx devices. sign image for display. Color-space conver- sion adjusts image characteristics such as allowing great distances between the con- color temperature and balance. troller and the sign itself. Additional processing cor- We were able to use off-the-shelf switches rects for individual LED to distribute the data within the sign and put differences. We also use pro- inexpensive 10/100 Ethernet ports on the prietary image processing individual distribution boards. The avail- algorithms to operate the ability of Ethernet protocol analyzers, LEDs efficiently while main- such as the open-source project taining optimal image quality. Ethereal, allowed us to easily analyze and debug the system. Data Distribution Video data starts in a control Sign Control room and ends at the LEDs. The Figure 2 – The sign is built In advertising, time is money; thus it is cru- out of 3,800 display blocks. first step is the video processor, cial to monitor the sign at all times. The which is located in the control control system monitors temperatures room. The video processor breaks the variety of control and status functions. throughout the sign to ensure that adequate images into manageable chunks to send to Not wanting to re-invent the wheel, cooling is present. Voltages are monitored to the many modules of the sign so that each we chose Ethernet as our data distribu- detect malfunctioning power supplies. The LED displays the data for the correspon- tion medium. Our video processor has control system maintains error and resend ding pixel. More than 3 Gbps of video data multiple Gigabit Ethernet ports that counts to detect faulty data links. It also pro- alone is required to operate the LEDs. In interface to the sign. Gigabit Ethernet vides an interface to upgrade the FPGAs addition to video data, we also transfer a can be transferred over fiber-optic cable, remotely for enhancements and bug fixes.

March 2005 Embedded magazine 15 The reconfigurable nature of Xilinx FPGAs allows us to provide feature upgrades and bug fixes to the customer via e-mail...

The Benefits of Xilinx Devices • This design required a large variety of them easy to use for transferring data Xilinx devices include a large number of signaling standards. The flexible Xilinx between clock domains or sharing data. features that are ideal for our sign project: I/O blocks allowed us to connect direct- ly to a large number of different inter- • While very powerful and convenient, • The reconfigurable nature of Xilinx faces. Voltages ranged from standard the PLDs and Spartan-3 FPGAs are devices is necessary for a project like 3.3V CMOS down to 1.5V HSTL. also very inexpensive. When combined this. Without FPGAs, the only alterna- We required single-ended and differen- with the development advantages, the tive would have been an ASIC. But an tial interfaces. In some cases we could low device price makes Xilinx devices ASIC was not feasible for this project have used external driver and receiver unbeatable when developing high-per- for several reasons. parts, but that would have added com- formance embedded systems. First, this project had a very tight sched- plexity and cost to the product. PicoBlaze Processors ule. An ASIC could not have been com- Other high-speed I/O interfaces, such Device hardware capabilities are essential pleted in the time allotted. Second, the as to the DDR 333 memory, would not for any design, but development tools and volumes of the components in this sign have been possible without direct tricks are also very important. The favorite are not of sufficient volume to hide the FPGA support. The digitally controlled toy in our Xilinx bag-of-tricks is the NREs of an ASIC. Third, an ASIC lacks impedance (DCI) modes were necessary PicoBlaze™ processor. We could not have the development opportunities of an on the high-speed single-ended traces. completed the project in the time allowed FPGA. To me, as an engineer, this rea- • With the high data rates involved and without extensive use of the PicoBlaze son is the most important. No matter the many data interfaces, we had a large processor. The sign contains an impressive how much simulation you perform, number of clock domains. The quanti- count of more than 1,000 of these embed- there can always be unexpected bugs. In ty of global clock nets available and the ded processors, with nine different designs. an ASIC, these bugs are expensive; in an ability of the digital clock managers PicoBlaze processors provide efficient FPGA, they can be fixed easily. (DCMs) to synthesize clock frequencies logic resource utilization by time-multiplex- Another FPGA advantage is that it can made this easy. We also used the phase- ing logic circuits. Many functions, especial- meet future needs through feature shift ability of the DCM to adjust sam- ly control functions, do not need to be upgrades; an ASIC cannot. The recon- ple times on various interfaces. figurable nature of Xilinx FPGAs • Block RAM is my favorite resource in 10 XC2V1000 Virtex-II FPGA allows us to provide feature upgrades an FPGA. Without block RAM, there 323 XC3S200 Spartan-3 FPGA and bug fixes to the customer via e- are two memory options. The first mail, making it easy for them to apply option is the logic slices, using flip- 333 XCF00 Platform Flash PROM to the sign. Through an Ethernet inter- flops or distributed RAM, but this is 3,800 XC9572XL 72 macrocell PLD face, the FPGA reprograms the expensive and slow for anything more Platform Flash configuration PROM Table 1 – This sign includes nearly than 16- to 32-bit addresses. The sec- 4,500 Xilinx devices. and automatically reboots. ond option is external memory, such as SDRAM. SDRAM storage is gener- • Video processing requires a large num- 8 Gbps video processing ally in the range of tens to hundreds of ber of multiply operations. The video megabytes, leaving a huge size gap 18 Billion 16-bit multiply processor must perform color-space between these two memory options. operations per second conversion and apply calibration coeffi- cients in real time. It would require a Block RAM bridges this size gap. It can 16 DDR333 SDRAM banks large portion of FPGA logic resources to be used for a limitless number of 6 Gigabit Ethernet MACs build multipliers. Instead, this can be things, from FIFOs for processing done very efficiently by utilizing the engines to loadable tables for data con- 333 Fast Ethernet MACs embedded multipliers. Building versions. The flexible port-widths of >1000 PicoBlaze Processors pipelined processing structures with the block RAM allow you to use them embedded multipliers allowed us to eas- individually or in efficient combina- Table 2 – Xilinx devices achieve ily meet the processing requirements. tions. The dual-port capability makes impressive specifications.

16 Embedded magazine March 2005 especially fast, or don’t happen very often. the PicoBlaze processor is a quick and easy processor. Because it is so quick and easy One example of this would be a serial way to define many functions. to write programs for PicoBlaze proces- transfer to read a temperature sensor. For The PicoBlaze processor is also a great sors, it is very straightforward to write this application, the sensor only needs to tool for accelerating the testing and debug- programs to test the various logic circuits be read every ten seconds. It would be a ging process. The PicoBlaze program code attached to the processor. We can test each waste to have a state machine for tempera- is stored in block RAM. To make a change function individually, greatly simplifying ture sensor reading that ran once every ten to the program we only need to change the and accelerating any debugging that seconds but only took a few milliseconds block RAM contents. It is possible to do becomes necessary. to complete. The logic would be unused this without re-implementing the FPGA, A key application of the PicoBlaze 99.9% of the time. saving a lot of time. processor in this project is the Ethernet These types of functions can be effi- Our favorite method of PicoBlaze controller. As mentioned earlier, we select- ciently combined into a single PicoBlaze processor development, which is slightly ed Ethernet to distribute data throughout processor, which in the previous example unique, is to use a PC serial port and a sim- the sign. At each Ethernet connection, we have an Ethernet physical layer transceiver (PHY) device connected directly to an FPGA. We developed a very simple and tiny media access controller (MAC) mod- ule, which we use inside the FPGA to con- nect the PHY to an instantiation of the PicoBlaze processor. This Ethernet unit is small, requiring less than a quarter of the logic resources in the XC3S200 FPGA. It handles the basic Ethernet layers and protocols, including ARP (address resolution proto- col). It also supports the IP (Internet protocol) layer with ICMP (Internet control message protocol), UDP (user datagram protocol), and DHCP (dynam- ic host configuration protocol). With this Ethernet controller, we can plug an FPGA into our network and it negotiates an IP address. Then we can transfer files and data to and from it.

Conclusion Xilinx devices made the challenge of devel- Figure 3 – The video data distribution board is based on an XC3S200 FPGA. It also includes SRAM, oping the world’s highest definition LED a 10/100 Ethernet port, a status display, and numerous connections to display blocks. display achievable. These devices are a per- fect fit for a complex design because of can not only read the temperature every ten ple PC application to download the pro- their flexible nature and powerful feature seconds but perform other similar tasks in gram code into the block RAM of a set. Valuable design components such as the meantime. configured FPGA. We have developed an the PicoBlaze processor further increase The PicoBlaze processor also provides a interface board that connects to the FPGA their ease of use and thus their value. quick and easy way to develop control and has the serial port, as well as several The reconfigurable and flexible nature functions. The alternative would be to seven-segment displays to which the of the devices allowed us to ship the sign build a custom state machine for each PicoBlaze processor can write for debug- with all first-revision circuit boards, function. The PicoBlaze processor is a pro- ging. We also allow the selection of different enabling us to develop a very complex sys- grammable state machine, meaning that processors so that we can work on multiple tem in very little time. the state machine is already built; one just processors through the same interface. For more information about MultiMedia has to program it. It has an intuitive and This interface is not only useful for LED signs, visit www.multimediaLED.com. powerful instruction set and a large code- debugging PicoBlaze programs, but also For more information about the engineering space of 1,024 instructions. Programming for debugging the logic connected to the provided by AED, visit www.aedmt.com.

March 2005 Embedded magazine 17 XilinxXilinx TeamsTeams withwith OptosOptos Xilinx Design Services assists in the development of a new eye care technology.

by Barrie Mullins cal features to take their system from con- Optos Integrated Solutions Manager, Xilinx Design Services cept to reality. The system is used by a Optos was founded to develop a tech- Xilinx, Inc. qualified operator to take a scan of the nology that will provide ophthalmolo- [email protected] patient’s eye while they look at a target pre- gists and optometrists with improved sented on an LCD panel. This image cap- image capture and analysis capability for Optos PLC, a revolutionary eye care tech- ture uses Optos’ patented technology in the early detection and prevention of eye nology company, selected Xilinx Virtex-II wide-field visual imaging and transfers the disease. Optos’ patented technology in Pro™ FPGAs to be at the heart of their data to a microprocessor. The processor is wide-field visual imaging is unique. wide-field visual imaging system. To lower used for storage of patient data and pro- Their business strategy is to make their the risk, reduce the learning curve, and cessing by the operator. visual imaging system available to the meet their time-to-market requirements for The challenge for XDS was to design, greatest number of practitioners, thus this new technology, Optos engaged Xilinx develop, test, simulate, integrate, and allowing an overall improvement in eye Design Services (XDS) to help them plan, bring up the hardware and software on a care and early disease detection for a far design, implement, and deliver their com- single platform to enable a complete con- greater number of people. plex solution. nectivity solution addressing all layers of The Optos headquarters in The Optos Panoramic System compris- the Optos application. The solution also Dunfermline, United Kingdom, is the base es two IBM™ PowerPC™ 405 micro- had to be scaleable for future features and for research and development of the processors and multiple proprietary upgrades. XDS was also required to meet Panoramic diagnostic ophthalmic appara- interfaces, as well as industry standard budget constraints. tus, the technology behind Optomap interfaces. The design utilizes the latest Retinal Images. technology, tools, and software provided by The Teams Xilinx and gave XDS the opportunity to The Xilinx Design Services team worked Xilinx Design Services apply its system, hardware, and embedded with the Optos engineering team to set the Optos contracted XDS to design and deliv- software knowledge, along with its project system requirements and to agree on a sched- er the Virtex-II Pro 2VP20 design. By doing management expertise, to meet the time- ule and deliverables for the project. Over the so, they reduced the risk and learning curve to-market requirements. period of the project, the XDS project man- for this new technology, and they ensured ager worked with the Optos project manage- they would meet their time-to-market goal. The Design Challenge ment team to ensure that milestones were XDS gathered a team of experienced co- For Optos, the challenge was to find a tech- delivered and that information flowed design engineers with in-depth knowledge nology that provided them with the techni- smoothly between the two teams. of project management, embedded software

18 Embedded magazine March 2005 design, FPGA design, co-design tools, IP IIC & SPI ZBT LCD Flash Peripherals RAM cores, and platform FPGAs. The final delivery to Optos includ- ed the source code for the embedded test software, HDL designs, design scripts, HDL test vectors, simulation System Configuration and Control Block test software, build scripts, and associ- ated testbenches. The delivery specifi- Camera 1 PCI Interface to cations also included all associated Transmit the Low Resolution PCI North/South Low Res Camera project documentation: project plan, Camera 2 Camera Data Bridge Interface to and High Speed Interfaces x86 Environment design specifications, testbench strate- MGT Data Running Linux gy, software specifications, memory map, and verification plan. ZBT RAM DSP Processing Block High Speed Data Management and The Panoramic System Data Gathering Interface System Configuration Figure 1 shows the basic Panoramic sys- & Control High-Speed Data tem design, which includes: Logic FPGA Peripherals • System control processor (SCP), Host x86 System described as System Configuration Virtex-II Pro 2VP20 DDR 3 MGT and Control in the diagram legend SDRAM Peripherals • Camera and digital signal processing Figure 1 – Block diagram of Virtex-II Pro design in the Panoramic system (DSP) components, described as DSP and Camera Logic • High-speed multi-gigabit serial GPIO SPI IIC UART DCR 16550 Interrupt Core Core Core Core Core transceiver (MGT) data and PCI, described as High-Speed Data Logic. OPB OPB Arbiter PowerPC Each section interacts with the others, DCR PLB2OPB OLB2PPB (Config.) but is not 100 percent interdependent. Bridge Bridge In Figure 2 you can see a more PLB PLB Arbiter detailed diagram of the system that XDS designed. The outer color blocks in EMC EMC IPIF Core PLB (Proprietary Core Core Display Figure 2 show the division of the logic Arbiter (Flash) (ZBT) Interface) in the system. The interface between the two main blocks shown in Figure 2 was

256 Control & Status through the use of device control regis- R Registers R ters (DCRs) accessed and controlled by e e Proprietary Low g g PowerPC Resolution both the SCP and associated hardware. Interface for 2 (DSP) Cameras OPB System Configuration & Control Data Sniffer PLB2OPB High-Speed Data Logic PLB Bridge Power PC 405 OPB2PCI PLB Core Processor Local Bus (PLB)

On-Chip Peripheral Bus (OPB) EMC Core Device Control Registers (DCR) (ZBT) DDR Proprietary High Speed MGT SDRAM Interface with FSM Core & Control Logic DCR Registers IP Cores MGT 1 MGT 2 MGT 3 Multi-Gigabit Tranceivers (MGT)

Proprietary Logic

Data Flow

Figure 2 – Detailed view of Virtex-II Pro design in the Panoramic system

March 2005 Embedded magazine 19 System Control Processor a larger data packet, which is then stored in PowerPC instructions and act accordingly. The main function of the SCP is system DDR SDRAM for transmission over PCI. Two software applications were used to configuration management, coordination, The PCI section is used to transfer all test the system in simulation. The first ran status reporting, and control of the timing the data stored in the DDR SDRAM to on the testbench and the second ran on the of events based on inputs. It starts image the x86 host processor environment. system under test (SUT). capture, and controls LCD display and XDS designed a number of proprietary The function of the SUT application is other logic through registers in the system. hardware blocks to control the flow of to interact with the peripheral cores and The main technical blocks of the SCP data from DDR SDRAM to the OPB- to drive the inputs and outputs of the perform the following functions: PCI core and over to the host x86 mem- SUT. The application in the testbench • Communicate with the Intel™ x86 ory space. Configuration and status of the interacts with the SUT in the following operating system environment over PCI core is carried out by the SCP manner: PCI and RS-232 connections through DCR registers. • Data received from the SUT is stored in memory. • Configure the LCD panel and camera Tools • Protocol data is decoded, stored, and interfaces via IIC During this project, XDS used several soft- replied to, if appropriate. • Display images on the LCD panel ware tools to create the Panoramic system, to simulate the design, and to manage the The application for the SUT allows • Communicate with peripheral subsys- hardware design engi- tems via SPI neers to control many Function Tool Used Vendor • Boot from internal Virtex-II Pro block different features and RAM and load application from exter- VHDL Design, Compile, Test ISE 5.2i Xilinx to run a suite of tests in nal flash memory to ZBT RAM before any order required. The running it Software Design, Compile, Test EDK 3.2 Xilinx XDS team also designed tests to verify Logic and Software Simulation MTI 5.6d Mentor Graphics • Schedule and control events in the the pure hardware application software via interrupts and Version Management CVS Open Source interfaces without GPIO interaction with the • Use DCR registers to gather status and Table 1 – Software used to develop, simulate, and manage applications. control operations in other sections. the development environment of the Panoramic system All functionality was tested at unit, function- DSP and Camera Logic al, and system level. The The camera interfaces are required to sup- development environment to ensure trace- tests were all documented in the test and ver- port 30 frames per second (fps) of 640 x ability and quality releases. The tools used ification plan, which was approved by Optos 480 pixels, requiring bandwidths greater are listed in Table 1. before the system design phase began. than 9 Mbps for each camera. The data is stored locally in ZBT RAM so that a DSP Test and Verification Conclusion algorithm running on the second When designing a project of this size, test With Xilinx Design Services, you get a PowerPC 405 microprocessor can be and verification are of paramount impor- highly skilled and experienced team that applied to the data. While the data is being tance to the successful completion of the will help get you to market faster by bridg- received from the cameras, it is also copied project. The XDS team focused a large ing the learning curve, delivering quality to DDR SDRAM for transmission over number of resources to ensure the design on-time designs, and working in close the PCI to the host x86 memory space. met Optos’ requirements. contact with your development team. The team developed a full testbench that In a letter to the engineers at Xilinx High-Speed Data Logic would fit around the design (shown in Figure Design Services, David Cairns, technical RocketIO™ MGTs are used to receive 2). This testbench was designed to simulate director at Optos, wrote, “I have been high-speed data from three peripheral every interface in the design. It consisted of a impressed with Xilinx Design Services. boards, each transmitting data from a single PowerPC microprocessor with peripheral Their commitment to quality and high MGT on a Virtex-II Pro 2VP7. The MGT cores to test protocols and communication standards has eased transition into pro- data transfers are initiated by an external interfaces. Both the design and the testbench duction. XDS went the extra mile, and we system triggered by the operator. This event used the swift model for the PowerPC micro- could not have achieved our design goals is synchronized using interrupts to the SCP. processor to run the software during simula- without their help.” The data from the three MGTs is formatted tion. A swift model is a cycle-accurate For more information about Xilinx in the Virtex-II Pro device and used to form simulation model that can interpret Design Services, visit www.xilinx.com/xds.

20 Embedded magazine March 2005 Coming Soon to a location near you!

April through June

Learn how the latest Xilinx technology can help you design cost effective solutions faster. Gain hands-on experience to speed up your next development cycle.

What Are You Waiting For? Register now for the event nearest you. Visit www.memec.com/xfest-2005

ASIA CANADA EUROPE JAPAN UNITED STATES

Copyright 2004 Memec, LLC. All rights reserved. Logos are owned by their proprietors and used by Memec with permission. All company and product names may be trademarks of their respective companies. (MG0084-04) 12.20.04 What Customers and Partners are Saying about Xilinx Embedded Processor Solutions Real breakthroughs. Real stories.

PowerPC “Leveraging a high degree of IP design re-use MicroBlaze “We designed our latest Nova identity4 developed with our first Virtex-II Pro-based “The MicroBlaze™ solution has proven to broadcast production video switcher with system, we can rapidly migrate our high-per- be far more than a companion micro- what we consider a truly disruptive tech- formance storage networking architecture to processor to our FPGA compute engines. It nology utilizing only a single Virtex™-II the Virtex-4 FX family of devices. By utiliz- has become quite an integral part in shap- Pro FPGA. This approach reduced our part ing many of the FX features, we are able to ing the way we are solving problems and is count 10-fold over prior systems while pro- increase both functionality and performance helping reduce our engineering costs. viding a per-channel cost at a fraction of while reducing system cost. “Currently, we use the MicroBlaze proces- our competition. “Key elements include the enhanced sor in our 68-billion-color LED display sys- “We are very excited about the new Virtex- PowerPC™ processor running Linux OS to tem to manage calibration and setup 4 FX family of devices. The higher per- manage and control our high throughput routines for the video processor module. formance processor and integrated APU multi-channel switching matrix and the The ability to easily modify these functions will allow us to rapidly create hardware integrated dual tri-mode Ethernet MAC and without completely changing the design modules for accelerating specific software RocketIO transceivers for high-speed data allows us to offer a cutting-edge product.” functions. By leveraging these capabilities transfer. The APU controller can also with the integrated EMAC cores and offload some tasks via custom-defined Xilinx Customer enhanced RocketIO™ transceivers avail- instructions, resulting in further computing Ricardo Ramos, President and CEO, able in Virtex-4 FX devices, our next-gener- and performance headroom.” Interativa Paineis Eletronicos ation systems will further widen the Xilinx Customer distance between us and our competition.” Sandy Helton, Chief Technology Officer, Xilinx Customer SAN Valley Systems Roger Smith, Chief Engineer, Echolab

22 Embedded magazine March 2005 “Incorporating the MicroBlaze processor PicoBlaze “The PicoBlaze processor made our transi- inside a single FPGA allowed us to use a tion towards embedding a software-based “The favorite toy in our Xilinx bag-of- low pin-count device and reduce PCB application into our well-known FPGA envi- tricks is the PicoBlaze™ processor. We complexity and cost. But the most signifi- ronment seamless and surprisingly simple.” could not have completed the sign project cant impact that the MicroBlaze processor in the time allowed without extensive use Xilinx Customer has offered is the ability to almost com- of the PicoBlaze processor. Moti Cohen, Hardware Design Engineer, pletely change a design without needing TeraSync to change the physical design whatsoever. “The sign contains an impressive count of As requirements change, we no longer more than 1,000 of these embedded need to remap functionality into different processors, with nine different designs. peripherals, as is often the case when mov- PicoBlaze processors provide efficient logic Xilinx Platform Studio ing to a larger microcontroller. We can resource utilization by time-multiplexing add, delete, or move peripherals; the logic circuits. The PicoBlaze processor also Regarding Xilinx Platform Studio winning Embedded Development Kit (EDK) provides a quick and easy way to develop the DesignVision Award: makes this effortless.” control functions. The alternative would be “It is inspiring to see Platform Studio rec- to build a custom state machine for each Xilinx Partner ognized in this way. The platform function. The PicoBlaze processor is a pro- Erik Widding, President, approach to next-generation embedded grammable state machine, meaning that Birger Engineering systems will become the only viable solu- the state machine is already built; one just tion to meet the economic and technolog- has to program it.” ical challenges of the future. Xilinx Xilinx Partner continues to prove its undeniable leader- ship in developing world-class technolo- “The storage module was a nice applica- Jason Daughenbaugh, Sr. Design Engineer, gies that make FPGAs a compelling system tion for the MicroBlaze embedded proces- Advanced Electronic Designs (AED) platform for an expanding range of sor. We wrote routines in C for the embedded processing applications.” MicroBlaze core for asynchronous transfer, including the packet handler and interpre- Jerry Worchel, Principal Analyst, tation. The routines that needed accelera- “We do four channels of PID + feedforward In-Stat/MDR tion were re-coded in RTL and moved to control + profile generation + host inter- hardware implementations on the same face, all with 32-bit position resolution and Virtex-II device. The storage elements 32-bit breakpoints at a sample rate of 10 needed to take data streams in, process KHz, with all four axes in motion. Not bad Regarding Xilinx Platform Studio 6.3i them, and put the data out to disk. We for an 8-bit microcontroller!” product release: could accomplish that 100% with the Xilinx Customer “The platform approach will not only MicroBlaze core, just not at speed. Peter Wallace, Partner, Mesa Electronics find a permanent home in the FPGA “The MicroBlaze processor allowed us to world, but in the ASIC and ASSP worlds update the program without reprogram- as well. With this announcement, Xilinx ming the device, drop in a test program in continues to prove its undeniable leader- C, set breakpoints, and functionally debug “We needed the control that a CPU pro- ship in this market.” by examining memory, variables, and regis- vides without the overhead of extra periph- Max Baron, Principal Analyst, ters. This helped us debug the functionali- erals and external hardware. The PicoBlaze In-Stat/MDR ty very quickly. We also were able to do processor was easy to implement and performance analysis and profiling that worked very well in our custom protocol helped us decide what modules needed application.” acceleration.” Xilinx Customer Xilinx Partner Scott Orangio, Senior Hardware Engineer, Derek Stark, Senior Software Engineer, Polycom Inc. Nallatech

March 2005 Embedded magazine 23 ConsiderationsConsiderations forfor High-BandwidthHigh-Bandwidth TCP/IPTCP/IP PowerPCPowerPC ApplicationsApplications The Xilinx Gigabit System Reference Design maximizes TCP/IP performance.

by Chris Borrelli Embedded Networking Manager Xilinx, Inc. [email protected]

The TCP/IP protocol suite is the de facto worldwide standard for communications over the Internet and almost all intranets. Interconnecting embedded devices is becoming standard practice even in device classes that were previously stand-alone entities. By its very definition, an embedded archi- tecture has constrained resources, which is often at odds with rising application require- ments. Achieving wire-speed TCP/IP per- formance continues to be a significant engineering challenge, even for high-pow- ered Intel™ Pentium™-class PCs. In this article, we’ll discusses the per-byte and per-packet overheads limiting TCP/IP performance and present the techniques uti- lized in the Xilinx Gigabit System Reference Design (GSRD) to maximize TCP/IP over Gigabit Ethernet performance in embedded PowerPC™-based applications.

24 Embedded magazine March 2005 GSRD Overview DDR SDRAM The GSRD terminates IP-based transport pro- tocols such as TCP or UDP. It incorporates the embedded PowerPC and RocketIO™ blocks FPGA Fabric Multi Port Memory Controller of the Virtex-II Pro™ device family, and is Port 0 Port 1 Port 2 Port 3 delivered as an Embedded Development Kit Communication (EDK) reference system. PLB Interface PLB Interface DMA Controller Port Port The reference system as described in RX0 TX0 RX1 TX1 Xilinx Application Note XAPP536 lever- ISPLB ages a multi-port DDR SDRAM memory PPC405 controller to allocate memory bandwidth DSPLB RX TX RX TX Checksum Checksum between the PowerPC processor local bus Offload Offload

(PLB) interfaces and two data ports. Each User Defined LocalLink data port is attached to a direct memory DCR Bus LocalLink Peripheral GMAC Peripheral access (DMA) controller, allowing hard- DCR20PB DCR20PB ware peripherals high-bandwidth access to memory. Dual GPIO UART Lite A MontaVista™ Linux™ port is available LEDs & XCVR User Defined Fiber Optics for applications requiring an embedded oper- Interface Pushbuttons ating system, while a commercial standalone DB9 *Can be customizable for other applications TCP/IP stack from Treck™ is also available to satisfy applications with the highest bandwidth Figure 1 – GSRD system block diagram requirements.

System Architecture Memory bandwidth is an important consid- CSUM Insert eration for high-performance network- CSUM CSUM attached applications. Typically, external GEN FIFO DDR memory is shared between the proces- sor and one or more high-bandwidth periph- erals such as Gigabit Ethernet. TX_DATA[31:0] TX TX_REM[3:0] LocalLink Data FIFO The four-port multi-port memory con- TX_SOF_N troller (MPMC) efficiently divides the avail- TX_SOP_N TX_EOP_N TX Peripheral TX_EOF_N able memory bandwidth between the TXP TX_SRC_RDY_N TXN PowerPC’s instruction/data PLB interfaces TX_DST_RDY_N and a communications direct memory access controller (CDMAC). The CDMAC RX_DATA[31:0] Xilinx LogiCORE RX_REM[3:0] 1-Gigabit provides two bi-directional channels of RX_SOF_N Ethernet MAC DMA that connect to peripherals through a RX_SOP_N RX_EOP_N RX Peripheral Xilinx standard LocalLink streaming inter- RX_EOF_N RXP RX_SRC_RDY_N face. The CDMAC implements data re- RXN RX_DST_RDY_N alignment to support arbitrary alignment of packet buffers in memory. A block diagram DCR_READ DCR_WRITE of the system is shown in Figure 1. CSUM CSUM DCR_WR_DBUS[0:31] R fa FIFO GEN The LocalLink Gigabit Ethernet MAC DCR_ABUS[0:9] (LLGMAC) peripheral incorporates the

UNH-tested Xilinx LogiCORE™ 1-Gigabit DCR_ACK Peripheral DCR_RD_DBUS[0:31] Ethernet MAC to provide a 1 Gbps 1000- Registers RX FIFO BASE-X Ethernet interface to the reference RX Client Interface system. The LLGMAC implements checksum offload on both the transmit and receive paths for optimal TCP performance. Figure 2 is a Figure 2 – LocalLink Gigabit Ethernet MAC peripheral block diagram simplified block diagram of the peripheral.

March 2005 Embedded magazine 25 TCP/IP Per-Byte Overhead Checksum offload is a feature of the Conclusion Per-byte overhead occurs when the processor LocalLink Gigabit Ethernet (LLGMAC) The Xilinx GSRD is an EDK-based refer- touches payload data. The two most com- peripheral. It allows the TCP payload check- ence system geared toward high-performance mon operations of this type are buffer copies sum to be calculated in FPGA fabric as bridging between TCP/IP-based protocols and TCP checksum calculation. Buffer Ethernet frames are transferred between and user data interfaces like high-resolution copies represent a significant overhead for main memory and the peripheral’s hardware image capture or Fibre Channel. The com- two reasons: FIFOs. GSRD removes the need for costly ponents of GSRD contain features to address buffer copies and processor checksum opera- the per-byte and per-packet overheads of a 1. Most of the copies are unnecessary. tions, leaving the PowerPC 405 to process TCP/IP system. 2. The processor is not an efficient data only protocol headers. Table 1 details the GSRD TCP transmit mover. performance with varying levels of optimiza- TCP/IP Per-Packet Overhead tion for Linux and standalone Treck stacks. TCP checksum calculation is also expen- Per-packet overhead is associated with opera- Future releases of GSRD will explore fur- sive, as it is calculated over each payload tions surrounding the transmission or recep- ther opportunities for TCP acceleration data byte. tion of packets. Packet interrupts, hardware using the FPGA fabric to offload functions Embedded TCP/IP-enabled applications interfacing, and header processing are exam- such as TCP segmentation. such as medical imaging require near wire- ples of per-packet overheads. The GSRD Verilog™ source code is speed TCP bandwidth to reliably transfer Interrupt overhead represents a consider- available as part of Xilinx Application image data over a Gigabit Ethernet network. able burden on the processor and memory Note XAPP536. It leverages the MPMC The data is generated from a high-resolution subsystem, especially when small packets are and CDMAC detailed in Xilinx image source, not the processor. transferred. Interrupt moderation (coalesc- Application Note XAPP535 to allocate In this case, introducing a zero-copy soft- ing) is a technique used in GSRD to alleviate memory bandwidth between the processor ware API and offloading the checksum cal- some of this pressure by amortizing the inter- and the LocalLink Gigabit Ethernet MAC culation into FPGA fabric completely rupt overhead across multiple packets. The peripheral. The MPMC and CDMAC can removes the per-byte overheads. “Zero-copy” DMA engine waits until there are n frames be leveraged for PowerPC-based embedded is a term that describes a TCP software inter- to process before interrupting the processor, applications where high-bandwidth access face where no buffer copies occur. Linux and where n is a software-tunable value. to DDR SDRAM memory is required. other operating systems have introduced Transferring larger sized packets (jumbo For more information about XAPP536 software interfaces like sendfile() that serve frames of 9,000 bytes) has a similar effect and XAPP535, visit www.xilinx.com/gsrd/. this purpose, and commercial standalone by reducing the number of frames trans- TCP/IP stack vendors like Treck offer similar mitted, and therefore the number of inter- zero-copy features. These software features rupts generated. This amortizes the Associated Links: allow the removal of buffer copies between per-packet overhead over a larger data pay- Xilinx XAPP536, “Gigabit System the user application and the TCP/IP stack or load. GSRD supports the use of Ethernet Reference Design” operating system. jumbo frames. http://direct.xilinx.com/bvdocs/ The data re-alignment and the checksum The components of GSRD use the appnotes/xapp536.pdf offload features of GSRD provide the hard- device control register (DCR) bus for con- ware support necessary for zero-copy func- trol and status. This provides a clean inter- Xilinx XAPP535, “High Performance tionality. The data re-alignment feature is a face to software without interfering with the Multi Port Memory Controller” flexibility of the CDMAC that allows soft- high-bandwidth data ports. The per-packet http://direct.xilinx.com/bvdocs/ ware buffers to be located at any byte offset. features of GSRD help make efficient use of appnotes/xapp535.pdf This removes the need for the processor to the processor and improve system-level copy unaligned buffers. TCP/IP performance. Treck, Inc. (www.treck.com) MontaVista Software (www.mvista.com) TCP/IP Stack Ethernet Frame Size Optimization TCP Transmit Bandwidth “End-System Optimizations for High- MontaVista Linux 9000 bytes (jumbo) None 270 Mbps Speed TCP” (www.cs.duke.edu/ari/ MontaVista Linux 9000 bytes (jumbo) Zero-copy, checksum offload 540 Mbps publications/end-system.pdf) Treck, Inc 9000 bytes (jumbo) Zero-copy 490 Mbps “Use sendfile to optimize data transfer” Treck, Inc 9000 bytes (jumbo) Zero-copy, checksum offload 780 Mbps (http://builder.com.com/ 5100-6372-1044112.html) Table 1 – TCP transmit benchmark results

26 Embedded magazine March 2005 by Gordon Cameron Business Development Manager Accelerated Technology, Mentor Graphics Configure and Build the [email protected]

Accelerated Technology’s Nucleus™ PLUS real-time operating system Embedded Nucleus PLUS (RTOS) is already available for both the Xilinx® MicroBlaze™ 32-bit soft processor core and the IBM™ PowerPC™ 405 core integrated into RTOS Using Xilinx EDK Virtex-II Pro™ devices. This determin- istic, fast, small footprint RTOS is ideal for “hard” real-time applications. Nucleus PLUS RTOS for MicroBlaze and PowerPC 405 processors With the release of the Xilinx Platform Studio EDK 6.3i, configura- isis nownow automaticallyautomatically configurableconfigurable usingusing XPSXPS MLDMLD technology.technology. tion of this leading royalty-free RTOS on your newly designed system is as easy as selecting from a pull-down menu. Instead of spending hours mod- ifying your target software to work with your new hardware configuration, you can configure the target software auto- matically in minutes, without the error- prone possibilities of configuring by hand. This is especially valuable during the earlier design phases when the hard- ware may be changing frequently. This process was enabled by one of the underlying technologies of Xilinx Platform Studio EDK, called micro- processor library definition, or MLD.

MLD Technology The Xilinx Platform Studio EDK devel- opment system is based on a data-driven code base that makes it extensible and open. MLD is one example of this underlying capability. It was created specifically to allow you to easily create and modify kernel configurations and associated board support packages (BSPs) for partner-supported RTOSs like Nucleus PLUS and its extensive middleware offering. MLD has two required file types: the data definition file (.MLD) and data generation file (.Tcl). The .MLD con- tains the Nucleus user-customization parameters, while the .Tcl file is a Tcl script that defines a set of Nucleus- specific procedures for building the final software system (see Figure 1).

March 2005 Embedded magazine 27 Accelerated Technology supplies these Once you have installed these, you files and the associated installer as an eval- can use Xilinx Platform Studio EDK OS uation disk, included in the latest release of with Nucleus now visible in the RTOS HW Design Xilinx EDK 6.3i. Accelerated Technology pull-down selection menu. See Figure 3a Selection has also established a website to support for the PPC405 and Figure 3b for the and distribute this evaluation. This site MicroBlaze processor. contains updates, evaluations, reference MLD XPS designs, and documentation for all of the Evaluating Nucleus PLUS in EDK Files Accelerated Technology Xilinx offerings The Accelerated Technology Nucleus and will be updated regularly with new PLUS evaluation software provided in middleware implementations that you can the EDK Platform Studio 6.3i shipment HW RTOS add to the automatic configuration of your includes a limited version (LV) of application. The website is located at Nucleus PLUS. This is a fully function- Netlist BSP www.acceleratedtechnology.com/xilinx/. al version of the RTOS compiled into a To get up and running quickly with library format (rather than the normal your first Nucleus-based system, the instal- source code distribution) with the single To To lation also includes a sample pre-built ref- restriction that it will stop working after ISE RTOS erence design with a compiled Nucleus 60 minutes, facilitating evaluation of its PLUS demonstration. The pre-built refer- full functionality. When you purchase a ence designs currently support the full license of Nucleus PLUS from Figure 1 – Outline of an MLD-enabled system design Memec™ design-based DS-KIT- Accelerated Technology, you receive the 2VP7FG456 and DS-KIT-V2MB1000 full source code and, obviously, the 60- FPGAs. This is the fastest method to minute run time restriction is lifted. Installing MLD files in XPS employ for a sample Nucleus-based, MLD- The LV version of Nucleus PLUS is The installation CD of Nucleus PLUS enabled Xilinx system. configured to execute from the off-chip installs the RTOS, associated drivers, and Use of Xilinx’s Base System Builder is SRAM or SDRAM module. Once you the two Nucleus-configured MLD files also well documented inside the applica- have a full license to the RTOS, you can that enable you to use MLD technology tion notes accompanying the installation. configure it to run from any memory in within the Xilinx Platform Studio EDK. With the Base System Builder, you can your system. The default install path for the MLD files build a variety of system core configura- Nucleus PLUS is a scalable RTOS – is the \nucleus\bsp sub-directory, located in tions to work with the Nucleus PLUS only the software you use in your design \edk_user_repository. RTOS (see Figure 2). is included in the downloaded code. If you have received your EDK 6.3i This may be contrasted with other larg- update recently or have purchased a seat, er, more static systems, which consume please check the contents for this evalua- far more system resources. In some cir- tion. After running the Nucleus PLUS cumstances, the whole RTOS and appli- evaluation disk installer, the necessary files cation can fit in the on-chip memory, will be placed into the Xilinx EDK 6.3i thus achieving high performance and and the support of Nucleus PLUS will be low power consumption. Even with larg- automatically added. er applications, which may utilize exten- The elements of Nucleus PLUS modi- sive middleware, the efficient use of the fied by the data generation file (.Tcl) for relatively small amount of on-chip mem- specific hardware configuration are: ory means that the size of the kernel footprint is an important consideration. • The number and type of devices used You can configure other components by the hardware designer of the Nucleus system by hand to work • Memory map information in this environment, such as network- ing, web server, graphics, file manage- • Locations of memory-mapped device ment, USB, WiFi, and CAN bus. registers Future releases of Nucleus will move • Timer configuration these products into full integration with Figure 2 – The hardware design is complete Xilinx Platform Studio EDK and MLD and ready to configure the software. • Interrupt controller configuration technology.

28 Embedded magazine March 2005 Nucleus PLUS and Xilinx Devices As we have said, the process of creating a working BSP for Nucleus PLUS begins with configuring the hardware platform using Xilinx Base System Builder or the supplied sample reference designs. Then you can go to the Project > Software Platform Settings menu item and select the operating system you want to use from the list. Choosing the Nucleus option (Figure 3 a/b) will make available the spe- cific software settings for the RTOS under Figure 3a – After installing Nucleus PLUS Figure 3b – Nucleus appears as an option each of the tabs on the Software Platform in EDK, Nucleus appears as an option in the in the drop-down menu choosing which drop-down menu choosing which operating operating system to use with the MicroBlaze Settings menu (Figure 4 shows the user system to use with the PowerPC 405 processor. soft-core processor. enabling the cache on the PowerPC). Note that in the LV version most of the software options are disabled, but can be changed in the full version. • If code footprint or performance is Once you are satisfied with the soft- important, then consider the highly ware settings, you can use the Generate optimizing Microtec compiler for Netlist and Generate Bitstream com- PowerPC Virtex-II Pro devices. This mands and download the hardware con- ensures that the code that is shipped is figuration onto the FPGA using Xilinx the same as the code that is debugged – XPS or ISE tools. a goal not achieved by many compilers. You can now execute the Tools > Generate Libraries and BSP commands • Application debugging often needs to configure Nucleus PLUS. The appli- RTOS awareness, advanced break- cation software can be linked with the points, and debugging of fully opti- Figure 4 – Enabling the cache in the RTOS RTOS. Now you are ready to switch over mized code. These features are available configuration parameters to the GDB debugger and download the on PowerPC Virtex-II Pro devices with combined RTOS and application image the industry-standard XRAY debugger. These tools, when combined with the to the FPGA. Nucleus PLUS RTOS, are ideal for helping • To bring software development forward you maximize the functionality and efficien- in time so that it can be started before Advanced Software Tools cy of your designs. the hardware is complete, software Up to this point, we have bypassed many teams can use our advanced prototyp- aspects of application software design, Conclusion ing products Nucleus SIM or Nucleus assuming that you have code ready to The latest EDK-configurable Nucleus SIMdx. These tools allow the develop- compile and link and download to the PLUS RTOS brings a new dimension to ment of the complete application soft- FPGA. In fact, as systems become ever systems incorporating high-performance ware in a host-based environment. more complex, both hardware and soft- embedded processors from Xilinx. Its ware designers require advanced state-of- • UML enables software teams to raise small size means that it can use available the-art tools to help them complete their their level of abstraction and produce on-chip memory to minimize power dissi- projects within budget and on time. models of their software. Nucleus pation and deliver increased performance, The Xilinx EDK-configurable version BridgePoint enables full code genera- while its wealth of middleware makes it of Nucleus PLUS uses the standard GNU tion by using the xtUML subset of ideal for products targeted at the network- suite of tools supplied with the Xilinx UML 2.0. ing, telecommunications, data, communi- EDK package. This is more than adequate cation, and consumer markets. for many projects for getting systems up • You can verify software/hardware inter- Making this solution easy to configure and running, but advanced application action in the Mentor Graphics® within Xilinx EDK allows you to development often needs more. Seamless® co-verification environment, easily exploit the benefits of this powerful Accelerated Technology can provide a which allows combined hardware and product. For more information, visit complete range of tools that encompass all software simulation for PowerPC Virtex- www.acceleratedtechnology.com or www. phases of the software design process. II Pro devices. mentor.com.

March 2005 Embedded magazine 29 MicroBlazeMicroBlaze andand PowerPCPowerPC CoresCores asas HardwareHardware TestTest GeneratorsGenerators Combining FPGA embedded processors with C-to-RTL compilation can accelerate the testing of complex hardware modules.

by David Pellerin – to create high-performance, mixed soft- real-world conditions of the entire applica- CTO ware/hardware test benches that closely tion), a unit test allows you to focus on poten- Impulse Accelerated Technologies model real-world conditions. tial trouble spots for a given component, such [email protected] Key to this approach are high-performance as boundary conditions and corner cases, that standardized interfaces between test software might be difficult or impossible to test from a Milan Saini (C-language test benches) running on the system-level perspective. Such unit testing Technical Marketing Manager embedded processor and other components improves the quality and robustness of the Xilinx, Inc. (including the hardware under test) imple- application as a whole. [email protected] mented in the FPGA fabric. These interfaces take advantage of communication channels Unit Testing Regardless of whether you are using a available in the target platform. A comprehensive hardware/software testing processor core in your FPGA design, For example, the MicroBlaze soft-core strategy includes many types of tests, includ- using a Xilinx® MicroBlaze™ or IBM™ processor has access to a high-speed serial ing the previously-described unit tests, for all PowerPC™ embedded processor can interface called the Fast Simplex Link, or FSL. critical modules in an application. accelerate unit testing and debugging of The FSL is an on-chip interconnect feature Traditionally, system designers and FPGA many types of FPGA-based application that provides a high-performance data chan- application developers have used HDL simu- components. nel between the MicroBlaze processor and the lators for this purpose. C code running on an embedded proces- surrounding FPGA fabric. Using simulators, the FPGA designer cre- sor can act as an in-system software/hardware Similarly, the PowerPC hard processor ates test benches that will exercise specific test bench, providing test inputs to the core, as implemented in Virtex-II Pro™ modules by providing stimulus (test vectors FPGA, validating the results, and obtaining and Virtex-4™ FPGAs, provides high-per- or their equivalents) and verifying the result- performance numbers. In this role, the formance communication channels through ing outputs. For algorithms that process large embedded processor acts as a vehicle for in- the processor local bus (PLB) and on-chip quantities of data, such testing methods can system FPGA verification and as a comple- memory (OCM) interfaces, as illustrated in result in very long simulation times, or may ment to hardware simulation. Figure 1. not adequately emulate real-world condi- By extending this approach to include Using these Xilinx-provided interfaces to tions. Adding an in-system prototype test not only C compilation to the embedded define an in-system unit test allows you to environment bolsters simulation-based verifi- processor but C-to-hardware compilation quickly verify critical components of a larger cation and inserts more complex real-world as well, it is possible – with minimal effort application. Unlike system tests (which model testing scenarios.

30 Embedded magazine March 2005 Unit testing is most effective when it CoDeveloper generates FPGA hard- hardware modules – to be described and focuses on unexpected or boundary condi- ware from the C-language software interconnected using buffered communica- tions that might be difficult to generate processes and automatically generates tion channels called streams. when testing at the system level. For exam- software-to-hardware and hardware-to- Impulse C also supports communica- ple, in an image processing application that software interfaces. You can optimize tion through signals and shared memories, performs multiple convolutions in these generated interfaces for the which are useful for testing hardware sequence, you may want to focus your MicroBlaze processor and its FSL inter- processes that must access external or static efforts on one specific filter by testing pixel face or the PowerPC and its PLB inter- data such as coefficients. combinations that are outside the scope of face. Other approaches to data what the filter would normally encounter in movement, including shared memories, Data Throughput and Processor Selection a typical image. are also supported. When evaluating processors for in-system testing, you must first consider the fact that the MicroBlaze processor or any other soft Impulse C Application processor requires a certain amount of area H/W in the target FPGA device. If you are only S/W Process H/W Process Process using the MicroBlaze processor as a test H/W generator for a relatively small element of Process H/W your complete application, this added Process resource usage may be of no concern. If, FPGA however, the unit under test already pushes Hardware OPB, FSL or Other Interface the limits in the FPGA, you may want to MBlaze, PPC Resources or Other target a bigger device during the testing Processor phase or consider the PowerPC core pro- H/W H/W Process Process vided in the Virtex-II Pro and Virtex-4 S/W Process platforms as an alternative. H/W H/W Process Process Synthesis time can also be a factor. Depending on the synthesis tool you use, Virtex or Spartan FPGA adding a MicroBlaze core to your complete Figure 1 – Hardware and software test components are mapped to the FPGA target and application may add substantially to the communicate across the OPB or FSL bus to the MicroBlaze or Power PC processor. time required to synthesize and map the application to the FPGA, which can be a It may be impossible to test all permu- Desktop Simulation and Modeling Using C factor if you are performing iterative com- tations from the system perspective, so the Using C language for hardware unit testing pile, test, and debug operations. unit test lets you build a suite to test spe- lets you create software/hardware models Again, the PowerPC core, being a hard cific areas of interest or test only the (for the purpose of algorithm debugging) core that does not require synthesis, has an boundary/corner cases. Performing these in software, using Microsoft™ Visual advantage over the MicroBlaze core when tests with actual hardware (which may for Studio™, GCC/GBD, or similar C devel- design iteration times are a concern. The testing purposes be running at slower than opment and debugging environments. For 16 KB of data cache and 16 KB of instruc- usual clock rates) obtains real, quantifiable the purpose of desktop simulation, the tions cache available in the PowerPC 405 performance numbers for specific applica- complete application – the unit under test, processor also makes it possible to run tion components. the producer and consumer test functions, small test programs entirely within cache Introducing C-to-RTL compilation into and any other needed test bench elements memory, thereby increasing the perform- the testing strategy can be an effective way to – is described using C, compiled under a ance of the test application. increase testing productivity. For example, standard desktop compiler, and executed. If a high test data rate (the throughput to quickly generate mixed software/hard- Although you can do this using from the processor to the FPGA) is your ware test routines that run on the both the SystemC, the complexity of SystemC primary concern, using the MicroBlaze embedded processor and in dedicated hard- libraries (in particular their support for data- core with the FSL bus or the PowerPC with ware, you can use tools such as CoDeveloper flow abstractions through channels) makes its on-chip-memory (OCM) interface will (available from Impulse Accelerated the process of creating such test benches provide the highest possible performance Technologies) to create prototype hardware somewhat complex. CoDeveloper’s Impulse for streaming data between software and and custom test generation hardware that C libraries take a simpler approach, provid- hardware components. operates within the FPGA to generate sam- ing a set of functions that allow multiple C By using CoDeveloper and the Impulse ple inputs and validate test outputs. processes – representing parallel software or C libraries, you can make use of multiple

March 2005 Embedded magazine 31 pipelined processes, for example) makes it If tests require very specific timing, possible to generate hardware directly using an embedded processor to create from C language at orders of magnitude test data will most likely result in data faster than the equivalent algorithm as rates that are only a fraction of what is implemented in software on the embed- needed to obtain timing closure. In fact, if ded microprocessor. This creates hard- the test routine is implemented as a state ware test generators that generate outputs machine on the processor, the speed at at a high rate. which the state machine can be made to Figure 2 – A software-based unit test operate will be slower than the clock fre- operating on the embedded processor com- Does C-Based Testing Eliminate quency of the test logic in hardware. municates with the hardware unit under the Need for HDL Simulators? Hence, for most cases, the hardware por- test through data streams. C-based test methods such as those tion would need to slow down so the CPU described in this article are a useful addi- can keep pace – providing test stimulus streaming software/hardware interfaces tion to a designer’s bag of tricks, but they and measuring expected responses. using a common set of stream read and are certainly not replacements for a com- Alternatively, you can create a buffered write functions. These stream read and prehensive hardware simulation. HDL interface – a software-to-hardware bridge write functions provide an abstract pro- simulation can be an effective way to deter- – to manage the test data using a stream- gramming model for streaming communi- mine cycle counts and explore hardware ing programming model. cations. Figure 2 shows how the Impulse C interface issues. HDL simulators can also Given the performance differences library functions support streams-based help alleviate the typically long between a processor-based test bench and communication on the software side of a compile/synthesize/map times required the potential performance of an all-hard- typical streaming interface. before testing a given hardware module in- ware system, it should be clear that soft- system. Hardware simulators provide much ware-based testing of such applications Moving Test Generators to Hardware more visibility into a design under test, and cannot replace true hardware simulation, in To maximize the performance of test gen- allow single-stepping and other methods to which you can observe, using post-route eration software routines, you can migrate be used to zero-in on errors. simulation models, the design running at critical test functions such as stimulus gen- any simulated clock speed. erators into hardware. Rather than re- implementing such functions in Conclusion Impulse C Visual Studio, VHDL or Verilog™, automated C- Impulse In-system testing using embedded proces- Platform Design Files CodeWarrior, to-RTL compilation quickly generates Libraries GCC, etc. sors is an excellent complement to simula- hardware representing test producer tion-based testing methods, allowing you or consumer functions. These func- to test hardware elements at lower clock tions interact with the unit under test, rates efficiently using actual hardware Generate Generate Generate using FIFO or other interfaces to Hardware Software RTL interfaces and potentially more accurate Interfaces Interfaces implement data streams and supply real-world input stimulus. This helps to other test inputs. augment simulation, because even at The CoDeveloper C-to-RTL com- HDL HDL Software Files Files Libraries reduced clock rates the hardware under piler analyzes C processes (individual test will operate substantially faster than is functions that communicate via possible in RTL simulation. streams, signals, and shared memories) By combining this approach with C- and generates synthesizable HDL to-hardware compilation tools, you can compatible with Xilinx Platform model large parts of the system (includ- Studio (EDK), Xilinx ISE, and third- ing the hardware test bench) in C lan- party synthesis tools including guage. The system can then be Synplicity® (Figure 3). The generated iteratively ported to hand-coded HDL RTL is automatically parallelized at or optimized from the level of C code to the level of inner code loops to reduce Figure 3 – Hardware and software C code is create increasingly high-performance process latencies and increase data compiled and debugged in a standard IDE; system components and their accompa- rates for output data streams. the interfaces are automatically generated and nying unit- and system-level tests. Automated compilation capability the generated HDL synthesized in Xilinx tools For more information, visit with the ability to express system- before downloading into the Virtex-II, www.impulsec.com, e-mail info@ level parallelism (creating multiple Virtex-II Pro, or Virtex-4 device. impulsec.com, or call (425) 576-4066.

32 Embedded magazine March 2005 X-ray vision for your designs Agilent 16900 Series logic analysis system with FPGA dynamic probe

Now you can see inside your FPGA designs in a way that will save days of development time.

The FPGA dynamic probe, when combined with an Agilent 16900 Series logic analysis system, allows you to access different groups of signals to debug inside your FPGA— without requiring design changes. You’ll increase visibility • Increased visibility with FPGA dynamic probe into internal FPGA activity by gaining access up to 64 ® • Intuitive Windows XP Pro user interface internal signals with each debug pin. • Accurate and reliable probing with soft touch connectorless probes • 16900 Series logic analysis system prices starting at $21,000 You’ll also be able to speed up system analysis with the 16900’s hosted power mode—which enables you and your team to remotely access and operate the 16900 over the network from your fastest PCs. Get a quick quote and/or FREE CD-ROM with video demos showing how you can The intuitive user interface makes the 16900 easy to get up reduce your development time. and running. The touch-screen or mouse makes it simple to U.S. 1-800-829-4444, Ad# 7909 use, with prices to fit your budget. Optional soft touch Canada 1-877-894-4414, Ad# 7910 connectorless probing solutions provide unprecedented www.agilent.com/find/new16900 reliability, convenience and the smallest probing footprint www.agilent.com/find/new16903quickquote available. Contact Agilent Direct today to learn more.

©Agilent Technologies, Inc. 2004 Windows is a U.S. registered trademark of Microsoft Corporation NohauNohau ShortensShortens DebuggingDebugging TimeTime forfor MicroBlazeMicroBlaze andand Virtex-IIVirtex-II ProPro PowerPCPowerPC UsersUsers Nohau tools provide multiprocessor debug environments for embedded systems, resulting in increased design modification and debug efficiency.

by Darrell Wilburn sion. Furthermore, adding logic paths to Nohau Corporation has developed a President provide for external probing is greatly compact on-chip development system that I.Q. Services, Inc.* intrusive, which may create new place and enables you to efficiently address these [email protected] or [email protected] route problems as well as timing differ- debug issues. The Nohau solution includes ences in the final design. compact on-chip debug IP called Platform FPGAs can implement a com- Simulation can still help you overcome DebugTraceBlaze that is minimized for pletely configurable system-on-chip by the simpler roadblocks, but for real-time size, connects directly to the on-chip containing one or more microprocessors in or intermittent problems, observing in real peripheral bus (OPB), and utilizes on-chip a tightly coupled fabric. This delivers very time through in-circuit methods quickly block RAM for trace storage. flexible hardware and software, which can becomes a necessity. On-board instrumen- The debug facilities are implemented change continuously throughout the tation circuits can provide visibility to all two ways: through hardware or software. design and debug cycle. system signals as well as executing pro- The software-based solution uses a small A powerful set of software debug tools grams. Xilinx® program called XMD-STUB that that can properly support sophisticated The challenges of verification and resides in the first 1K block of memory. FPGAs is critical for successful project debug steps are: The hardware solution uses programmable completion. Debugging and verifying a • Instrumentation to provide correlated logic in the hardware and is transparent to design from external pins is problematic at hardware and software measurements the software. You may choose the solution best. Reliably measuring 200 to 300 MHz that is best for you. • Needing a broad range of engineering signals (like Fast Simplex Links) over a 3- Personally, I prefer the software solu- skills foot logic cable to an external trace facility tion because it has less impact on the hard- is very difficult – and sometimes impossi- • Extreme flexibility with ever-changing ware and is more flexible for ble – to make with sub-nanosecond preci- needs for both customization. Also, the cost of 1K of

34 Embedded magazine March 2005 A second path for code development is I-OPB shown in Figure 3. BlazeGen generates UART GPIO small pre-tested code snippets that fit D-OPB entirely in one on-board block RAM to MicroBlaze provide you with a solid starting place for initial power-up check-out. These snippets 32-bit Processor Nohau Debug Ext. Memory IP 10/100 EMAC are treated just like user code for input to Controller

ILMB the GNU compiler. Dual Port DLMB with TRACE Block RAM You can enter and compile C/C++ code Boot Program from inside XPS or from an external editor and compiler using its own make files. For large programs, I recommend using an JTAG RJ45 external GNU make facility. The output 256K X32 PHY SRAM from the compile process is an .elf file that contains all code and symbolic information to be loaded directly by Seehau. Figure 1 – Nohau Debug IP with trace, shown in a simple five-chip Internet-aware system As shown in Figure 3, the classic edit/compile/debug loop familiar to memory is usually insignificant in systems .MHS, .MSS, .UCF, and project options embedded system engineers centers around that often have 1 MB or more. A block files, which are generated by the platform the Seehau debugger. Additionally, a hard- diagram for a typical small system is shown builder or user-generated text files. To add ware edit/compile/debug loop is now in Figure 1, illustrating placement of the the Nohau DebugTraceBlaze IP to a proj- included that loops back through new Nohau DebugTraceBlaze module. ect, you first build it with BSB or BlazeGen builds in XPS. Please note that the Nohau solution and add DebugTraceBlaze IP with a pass requires no external signal pins; all access is through BlazeGen. Debugging with Seehau through the JTAG port. Furthermore, it The output of the XPS build is a .bit file Seehau provides an intuitive source-level does not impact timing because it only that contains the bitstream required to pro- debugger that can be made aware of logic interfaces through the OPB bus. The gram the target FPGA with your system. signals in the fabric; RTOS state and vari- resource utilization in the FPGA for the The Nohau Seehau debugger is a conven- ables; correlation of hardware signals to code Nohau IP is very small. Actual require- ient and easy-to-use GUI interface that execution; and Ethernet performance char- ments are shown in Figure 2. allows fast and easy updating of system acteristics in Internet-aware applications. hardware and software as well as test and Seehau is a full-featured source or assembly check-out of software execution. Seehau debugger with an integral real-time trace Debug with no trace: Debug with trace: loads the bit file and programs the FPGA facility. It supports either PowerPC™ hard- in just a few seconds. core or MicroBlaze™ soft-core processors. Slices = 84 Slices = 425

LUTs = 86 LUTs = 642 BlazeGen or User Source Flops = 148 Flops = 489 GNU GCC Mults = 0 Mults = 0 C Source BRAMs = 0 BRAMs = 4 User Code Download.bit

Figure 2 – Nohau resource usage with BlazeGen and without trace XPS Seehau Start or Debug Done Retargeting (Source Level BSB Debugger) Design/ Debug Flow The design flow with Nohau tools present Software Edit is illustrated in Figure 3. You may build an Compile Loop initial system from scratch or use a plat- MHS MSS form generator like Nohau BlazeGen or UCF Hardware Edit Compile Loop Xilinx Platform Studio (XPS) Base System XMD Builder (BSB). Simply specify your system with the Figure 3 – Development flow with Seehau source-level debugger in place

March 2005 Embedded magazine 35 You can look back in time from associated with the PowerPC archi- any execution to follow the path tecture. Figure 6 shows a source- backward, or you can use the Seehau level debug screen of a typical event configuration system to specify PowerPC debug session. pre- and post-triggering, complex Seehau has also been expanded to breakpoints, triggers on register reads include support for multiple processors and writes, and triggers on data from in the same fabric. The processors are the fabric. Figure 4 shows a typical run independently. Figure 7 shows a set source-level debug display with of screens for a two-processor system. processor registers, memory data, The Nohau GUI provides a sim- program data in source form, and ple, easy-to-use interface that assigns trace and breakpoint status. a complete set of control and status Nohau tools are sold as a system, Figure 4 – Full source with breakpoint windows to each processor. All which includes the Seehau debugger, an at context switch in µC/OS-II Seehau windows are available for interface pod to the appropriate JTAG both processors, and show the name connector, and the IP DebugTraceBlaze given to each processor in the top configured with trace memory. As a sys- banner. In the case shown in Figure tem, it may be ordered as EMUL- 7, the execution sites are named MICROBLAZE-PC. MB1 and MB2. The Nohau EMUL-MICROB- When you select a command from LAZE-PC provides a 512-frame or a pull-down list by clicking on it, the 2K-frame deep trace with a trigger, command is directed to the processor post-trigger count, and break control. Figure 5 – Trace 40-bit mixed-mode display assigned to the window in focus. You Probe pins may be either 8 or 40 bits in binary logic signals from FPGA fabric control the set of windows open for wide. It will display data connected to each processor through a pull-down it as specified in the XPS .MHS file. menu. The choice of open windows is Figure 5 illustrates a trace display in controlled by your selection of new mixed mode with C source and assem- windows to view. The result is an easy- bly source intermixed. On this single to-use, intuitive user interface. display, you can correlate the frame at capture time, the execution address, Conclusion the opcode of the instruction execut- Getting the right tool set and develop- ed, the disassembled MicroBlaze ment environment set up for a new instruction, the C source line that gen- FPGA project is critical to the success erated that instruction, and 40 bits of of the product development cycle. A data from any logic in the system. highly productive development and Data from logic can include signals debug environment based around from your own logic design. Figure 6 – PowerPC source-level debug with trace Nohau tools supports these new mul- tiprocessor systems, with an extension Multiprocessor System Support of the same powerful debug and test Recently, Nohau completed a joint tools the company has offered for the project with Xilinx, expanding the last 20 years. Seehau system to include support For more information, please visit for the hard-core PowerPC proces- www.nohau.com, www.iq-service.com, sor found in Virtex-II Pro™ or e-mail [email protected] or devices. As Seehau is a robust, [email protected]. source-level debugger, the user interface and source-level feature set are nearly identical. The only major * I.Q. Services is under contract to sup- changes from an embedded system port and market platform FPGA tools engineer’s point of view are the for Nohau Corporation and performs processor-level language on disas- custom start-up engineering for plat- Figure 7 – Multi-core display with two MicroBlaze processors sembled screens and the register set form FPGA embedded system designs.

36 Embedded magazine March 2005 AA ScalableScalable Software-DefinedSoftware-Defined RadioRadio DevelopmentDevelopment SystemSystem Sundance enters the SDR fray by Flemming Christensen Managing Director with a Xilinx-based platform. Sundance [email protected]

There has been a strong push in the past few years to replace analog radio systems with digital radio systems. The Department of Defense Joint Tactical Radio System program shifted the empha- sis on the development of software-defined radio (SDR) to the forefront of research and development efforts in the defense, civilian, and commercial fields. Although it has existed for many years, SDR technology continues to evolve through newly funded ventures. The “holy grail” of SDR is its promise to solve incom- patible wireless network issues by imple- menting radio functionalities as software modules running on generic hardware plat- forms. Future SDR platforms will comprise hardware and software technologies that enable reconfigurable system architectures for wireless networks and user terminals. Although new SDR-based systems are purported to be highly reconfigurable and reprogrammable right now, the truth is that SDR hardware platforms are still in their early development stages. Many issues must still be resolved, including reconfig- urable signal processing algorithms, hard- ware and software co-design methodologies, and dynamically reconfig- urable hardware. Overall, the main key issues for SDR embedded system platforms are flexibility, expandability, scalability, reconfigurability, and reprogrammability.

March 2005 Embedded magazine 37 many configurations download software. The eight RocketIO™ JTAG COMPORTS JTAG of multi-channel soft- transceivers on the Virtex-II Pro device JTAG ware programmable enable high-speed data transfer to additional RS485 and hardware-config- SMT148 or Virtex-II Pro add-on modules. urable digital radio. Downloads can leverage a powerful I/O Add-on Modules Add-on Modules The SMT148 has architecture that includes the popular at its heart a powerful FireWire, USB interfaces, LVDS interfaces, embedded system and JTAG for debugging and downloads. controller that lever- Data flows into the FPGA and is managed ages the Xilinx Virtex- by a Sundance program written for the II Pro FPGA with its embedded PowerPC before processing. Add-on Modules Add-on Modules embedded PowerPC Figure 3 is the C code comprising the data 405 processor (Figure flow and RocketIO PowerPC program.

USB 2). As an embedded RS485 LVDS RS232 system controller, the Scalable, Reconfigurable Embedded Processors USB IEEE1394 USB IEEE1394 role of the Virtex-II Scalability is addressed through four add- Figure 1 – The SMT148 Pro device is to man- on module sites, and you can partly resolve age the reconfigura- the requirements of dynamic reconfigura- Meeting the Challenge tion of the add-on modules, especially tion by adding additional Xilinx-based SDR is characterized by a decoupling when downloading. Reconfiguration means FPGA modules. All add-on modules com- between the heterogeneous execution plat- switching between modes or updating a municate through the Virtex-II Pro device, form, based on hardware functions togeth- hardware/software component. which also manages two 32-bit microcon- er with DSP and MCU processors, and the In the global functioning of the trollers that enable communications with applications through abstraction layers. SMT148, you can download many kinds of most widely used standards. One of the approaches for meeting the software (high-level applications, protocol With the RocketIO transceivers con- complexity demands of third-generation stacks, low-level signal processing algo- nected to differential pair connectors, you systems is to use multi-core systems where rithms) and employ several methods to can connect FPGA systems directly a DSP and a microcontroller work togeth- er with an FPGA-based hardware co- processor. With its embedded IBM™ JTAG PowerPC™, the Xilinx® Virtex-II Pro™ FPGA is rapidly becoming a solution that embeds and tightly couples reconfigurable TIM Site 1 TIM Site 2 TIM Site 3 TIM Site 4 logic and a processor in the same device. External Sundance, a developer of advanced sys- ComPorts tem architectures for high-performance sig- Global Buses nal processing applications, has focused on designing FPGA-based development plat- UART CPLD USB FPGA forms that address an SDR OEM’s wish FireWire Virtex-II Pro JTAG Configuration Chain Xilinx Interface list. Our challenge was to design a system XC2VP7, 20, 30 JTAG that would provide the scalability SDR sys- FFB56 Power In tems require. Power Management Power Out

The Embedded System Controller 100 MHz The SMT148 (Figure 1) is one of many Crystal development systems Sundance has RSL launched recently. Aimed specifically at SHB 32 LED DAC Circuitry ADC Circuitry LVDS 20 Gbits RS485 400 Mbytes “Display” 8 channels 8 channels fullduplex SDR, the SMT148 is a fully configurable 16-bit 200kSps 14-bit 400kSps and expandable waveform development environment that meets the many require- ch0 ch7 ch0 ch7 ments of SDR developers. This entry-level, stand-alone system enables radio designers to investigate and experiment with the Figure 2 – SMT148 systems architecture

38 Embedded magazine March 2005 through simple cable connections support- When we designed the SMT148, we topology neatly harmonized to the Xilinx ing more than 2 Gbps data rates. understood that one of the many chal- architecture. The SMT148 leverages the Xilinx lenges OEMs would face in developing Fully interconnected and configurable Virtex-II Pro block RAM configuration to JTRS-compliant platforms was the avail- through their communication ports, these generate FIFOs for the RocketIO trans- ability of interchangeable and networked add-on sites are also connected to the ceivers, add-on modules communication processing nodes. The availability of pro- embedded Virtex-II Pro FPGA. The avail- ports, and a high-speed bus, as well as the cessing nodes aims to meet the expandabil- ability of a network of add-on sites removes embedded PowerPC code. The single high- ity and scalability requirements in complex the main expandability restrictions often speed bus allows parallel data transfer to waveform applications. associated with other platforms, and offers and from a wide range of high-speed The SMT148 meets this availability chal- the OEMs a highly compact design and ADC/DAC modules. lenge with a network of daughter sites that development tool. More importantly, this scalable

TIM1 TIM2 TIM3 TIM4 system architecture makes the SMT148 a perfect development plat- CP0 CP1 CP2 CP3 CP4 CP5 CP6 CP7 RESET TIMs form with which to resolve the many issues related to the implementation of multiple radio functionalities in a ComPort ComPort ComPort ComPort ComPort ComPort ComPort ComPort Global IF 0 IF 1 IF 2 IF 3 IF 4 IF 5 IF 6 IF 7 Bus IF1 GB1 single environment. These can be addressed as multiple software mod- Global Bus IF2 GB2 ules running on Sundance’s reconfig- Watchdog Timer 1 R µc urable hardware platform. µc1 Watchdog E Global GB3 IF1 Timer 2 Bus IF3 Watchdog S Switch Fabric Timer 3 E IP Cores Watchdog T Global GB4 Timer 4 Bus IF4 µc µc2 Sundance takes advantage of the IF2 McBSP en/dis 1 high-performance DSP acceleration McBSP en/dis 2 capabilities and flexible connectivity McBSP en/dis 3 that the Virtex-II Pro FPGA provides McBSP en/dis 4 by supporting developers with a fam- ily of software tools and IP cores. DAC IF ADC IF LED IF RSL IF SHB IF RS485 IF McBSP IF The SMT148 I/O flexibility enables you to rapidly investigate and experiment with features of the DAC ADC LED RSL SHB RS485 To/From LVDS Drivers/Receivers Virtex-II Pro FPGA as well as those of developed IP cores from Sundance. Figure 3 – Interconnections diagram of the digital modules inside the SMT148 These include multi-tap complex fil- ters, Viterbi decoders, encoders, a Data rates on this port are in excess of you can use for additional resources such as complete transmitter, QAM mapper, multi- 100 MHz (400 Mbps), and are useful for signal processors, reconfigurable computing phase pulse shaping filter, multi-phase cas- transferring sampled 16-bit I and Q. modules, and Sundance’s large family of add- caded integrator comb (CIC) interpolation Processing data streams can take place on modules. These add-on modules include filter with a fixed interpolation rate, multi- either in the embedded PowerPC in the a variety of embedded system options such as phase numerical-controlled oscillator Virtex-II Pro device or throughout an reconfigurable modules with tightly coupled (NCO), and multi-phase digital mixer. array of other add-on Virtex-II Pro FPGA- Virtex-II Pro FPGAs and DSPs, digital and based modules with embedded PowerPC. analog converters, data conversions, trans- Conclusion The FPGA on the SMT148 carrier ceivers, and I/Os of all types. Developing, testing, and implementing card is connected to many different SDR IP cores is simplified with the devices and therefore has many internal Designed for Developers Sundance SMT148 platform. You can interfaces that allow it to exchange data or Powered by an external supply, the now focus on developing additional IPs commands with the external world. All SMT148 platform has an impressive topol- without worrying about peripheral pro- interfaces are reset at power on when ogy that accepts input signals from various cessing or I/O devices, as these are simply applying a manual reset. Figure 4 shows sources through a network of multi-pin off-the-shelf add-on IP blocks. the interconnections between the digital connectors. High-speed I/O channels sup- For more information, please visit modules inside the FPGA. port the additional nodes in a network www.sundance.com.

March 2005 Embedded magazine 39 NucleusNucleus forfor XilinxXilinx FPGAsFPGAs –– AA NewNew PlatformPlatform forfor EmbeddedEmbedded SystemSystem DesignDesign The Nucleus real-time operating system is now available for Xilinx FPGAs with MicroBlaze and PowerPC 405 cores.

by Chang Ning Sun Technical Marketing Engineer Mentor Graphics Corporation [email protected]

Embedded system development traditionally requires a hardware design cycle to create a prototype; a software implementation cycle can only begin once that proto- type is available. Co-design and co-verification tools such as the Seamless™ co-verification envi- ronment (CVE) from Mentor Graphics® provide great flexibility and early system integration, but these tools have mostly been used for large-scale systems such as ASIC designs. Ordinary embed- ded system designs have not yet benefited from hardware/software co-design and co-verification.

40 Embedded magazine March 2005 In recent years, innovations in FPGA technology have shifted the use of these (written in C, C++, POSIX or Java™) devices from supporting logic to forming a central part of the embedded system. With their high logic density and high performance, modern FPGAs allow you to implement almost an entire embedded hardware system in a single FPGA device. Xilinx® Spartan-II™, Spartan-3™, and Virtex-II™ Platform FPGAs (com- bined with the MicroBlaze™ soft proces- sor core) and Virtex-II Pro™ Platform FPGAs (combined with the PowerPC™ 405 hard processor core) offer new hard- ware platforms and facilitate new method- ologies in embedded system design. The Nucleus™ real-time operating system (RTOS) from Accelerated Technology (Mentor Graphics’ Embedded Systems Division) fully supports Xilinx Platform FPGAs. Together with our FPGA design flow and Seamless CVE tools, the Nucleus RTOS is a complete hard- ware/software co-design and co-verification environment.

The Nucleus RTOS Figure 1 - Nucleus system architecture Very few embedded software developers start their development on bare hardware; most choose a commercial embedded systems. It has a very small footprint; the ker- more difficult. These standards are often RTOS like Nucleus as the base environ- nel itself can be as small as 15-20 kB. implemented as middleware components ment and develop their applications on top All Nucleus kernel functions are provid- and provided together with the RTOS. of the operating system. ed as libraries, so only the required kernel The Nucleus RTOS has an extensive Generally, an RTOS provides a multi- functionality is linked with the application library of middleware support (Table 1), tasking kernel and middleware compo- code in the final image. Nucleus PLUS which covers every vertical market of the nents such as a TCP/IP stack, file system, provides complete multi-tasking kernel embedded industry. Nucleus NET utilizes or USB stack. Application developers functions, including task management, a zero-copy mechanism for transferring build their application software by using inter-task communication, inter-task syn- data between user memory space and the the system services provided by the RTOS chronization, memory management TCP/IP stack, which significantly kernel and the middleware components. A (MMU), and timer management. Recently, improves data transmission efficiency and number of commercial RTOSs are avail- we added a dynamic download library reduces dynamic memory allocation. able: some have hard real-time kernels, (DDL) into Nucleus PLUS to allow Nucleus software fully supports the some have extensive middleware support, dynamic downloading and allocation of POSIX standard for software portability. and some have good development tools. system modules. POSIX interface libraries are available for The Nucleus RTOS has all of these fea- all Nucleus middleware components. tures from a single vendor. Nucleus Middleware Figure 1 shows the Nucleus system The middleware available for an RTOS has Nucleus Device Drivers architecture. become an important factor when selecting The Nucleus RTOS has an extensive list of a commercial RTOS for a particular applica- off-the-shelf device drivers. When you select Nucleus PLUS tion. As embedded applications become a new hardware IP or peripheral to use in Nucleus PLUS, a fully preemptive and scala- increasingly more complicated, application your design, the driver source code may ble kernel suitable for hard real-time applica- developers find implementing industry stan- already be available. If not, a driver template tions, is designed specifically for embedded dards like TCP/IP, WiFi, USB, and MPEG lets you easily develop a functional driver.

March 2005 Embedded magazine 41 Nucleus software has proven to be a robust RTOS and has been widely used in every vertical market of the embedded industry...

Mentor Graphics will continue to add compiler tools such as GNU. telecom, defense/aerospace, automotive, device drivers to the Nucleus library for For system-on-chip (SoC) users, inte- and telematics. new hardware IP blocks for Xilinx FPGAs gration of the Nucleus RTOS and XRAY as well as for third parties, such as its own debugger with other products in the Nucleus for Xilinx FPGAs IP Division. Mentor Graphics lineup (such as Seamless The scalable and configurable nature of CVE for hardware/software co-verifica- Nucleus products, combined with the flexi- Nucleus Development Tools tion) offer a complete hardware/software bility and versatility of FPGAs, provides The Nucleus RTOS has the most inte- co-design and co-verification platform. great potential for hardware/software co- grated development environment (IDE) design, co-verification, and early integra- in the industry. Combined with our Royalty-Free with Source Code tion of your software and hardware projects. Microtec compiler tools, the code|lab The royalty-free with source code busi- embedded development environment ness model of the Nucleus RTOS, com- Nucleus for MicroBlaze (EDE) and XRAY debuggers provide a bined with the high scalability and small The MicroBlaze soft processor core features complete software design flow, from footprint architecture, provides great a Harvard-style RISC architecture with 32- compilation to debugging. value to customers developing large-vol- bit instruction and data buses. A Our C/C++ source-level debuggers ume products such as cell phones, PDAs, MicroBlaze soft core can be programmed support both run-mode and freeze mode and cameras. A single up-front license fee into Spartan-II, Spartan-3, or Virtex-II debugging with complete Nucleus kernel covers all production rights for a single Platform FPGAs. awareness. The XRAY debugger supports product using Nucleus software, with no Nucleus software fully supports the both homogeneous and heterogeneous additional royalty fees and no need to Xilinx MicroBlaze soft processor core with multi-core debugging, and a wide range of track shipment numbers. GNU compiler tool and GDB debugger. processors and DSPs. Nucleus software has proven to be a In addition to the Microtec compil- robust RTOS and has been widely used in Nucleus for Virtex-II Pro ers, the EDE, IDE, and debuggers are every vertical market of the embedded FPGAs/PowerPC 405 also integrated with many third-party industry, including consumer electronics, Many embedded system developers are already familiar with the PowerPC 405 processor. Xilinx Virtex-II Pro FPGAs pro- Nucleus Product Function vide fully configurable hardware platforms Nucleus NET Complete embedded TCP/IP stack with the PowerPC 405 hard processor core. Nucleus Extended Protocols Telnet, FTP, TFTP, Embedded Shell Nucleus software fully supports Virtex-II Pro FPGAs with the PowerPC Nucleus SNMP version 1,2,3 Network management protocols 405 hard core, providing a Microtec Nucleus RMON (1-9 groups) Network remote monitoring protocols PowerPC compiler and XRAY kernel- aware debugger, as well as GNU compil- Nucleus Residential Gateway NET, PPP, PPPoE, DHCP server er tools and GDB debugger. Nucleus WebServ Embedded HTTPD web server Nucleus EMAIL POP3 and SMTP FPGA Reference Design for Nucleus Software Nucleus FILE MS DOS compatible file system To give embedded system developers a Nucleus GRAFIX Graphics-rendering engine with windowing toolkit quick start with Nucleus software for Nucleus USB Complete USB stack for USB specifications 1.1, 2.0, and OTG Xilinx FPGAs, we worked with Xilinx and Memec™ Insight to develop reference Nucleus 802.11 STA (WiFi) 802.11b and 802.11g protocol stack designs for running Nucleus software. Each Nucleus IPv6 IP version 6 protocol stack reference design is a sample FPGA hard- ware configuration based on a Memec CEE-J Java VM for Nucleus CLDC, MIDP, Embedded Java, and Personal Java Insight FPGA development board. You can build and implement all of the Table 1 - Primary Nucleus middleware components reference designs into the FPGA device by

42 Embedded magazine March 2005 using Xilinx EDK and ISE software. For • Developing Nucleus device drivers and serves as a hardware abstraction layer, each reference design, we provide a Limited Nucleus applications which will significantly simplify the Version (LV) of our Nucleus software, • Programming the FPGA device Nucleus device driver porting over including the Nucleus PLUS kernel and FPGA-based hardware platforms. the relevant middleware (such as Nucleus • Downloading Nucleus into the FPGA NET) to help you evaluate Nucleus soft- hardware board Nucleus Software Debugging ware with the FPGA. Once you have reached this point, you Both MicroBlaze soft-core and PowerPC The LV version of Nucleus software can start evaluating your Nucleus software 405 hard-core designs provide a standard functions exactly the same as a full version, and begin debugging your application. JTAG debug interface. So after imple- but is provided as binary only and menting the hardware design in runs for a limited time. Currently, the FPGA, you can take full we have FPGA reference designs for advantage of our JTAG-based the following Memec Insight code|lab EDE or XRAY debugger boards: to facilitate application debug- ging. Both code|lab and XRAY • Spartan-IIE LC MicroBlaze have complete Nucleus kernel development board awareness, which can help you • Spartan-3 LC MicroBlaze analyze complex real-time behav- development board iors in a multi-tasking system. • Spartan-3 MB MicroBlaze development board Co-Design and Co-Verification As we mentioned earlier, Nucleus • Virtex-II MB MicroBlaze software provides a viable platform development board for hardware/software co-design • Virtex-II Pro PowerPC and co-verification. As one of the development board Figure 2 - Xilinx EDK-generated source code directories only EDA companies with an embedded software focus, Mentor We have also created a website Graphics’ Nucleus RTOS and (www.acceleratedtechnology.com/xilinx) to Xilinx EDK and ISE software provide a XRAY debugger are fully integrated with support Nucleus software for Xilinx FPGAs, complete environment for configuring and our Seamless hardware/software co-verifi- where you can download FPGA reference implementing an FPGA-based hardware cation software platform. designs and Nucleus software updates. platform. At the same time, EDK also gen- The combination of the Nucleus RTOS erates software libraries and C header files and Seamless co-verification tool provides FPGA Design Flow for the hardware design. These libraries you with one of the most comprehensive The FPGA reference designs for the provide basic processor boot-up code and hardware/software co-design and co-verifi- Nucleus RTOS can be used as your starting low-level hardware IP device driver func- cation tools possible. point to build a custom FPGA-based hard- tion code. These low-level functions can ware platform. All of the reference designs be treated as a “BIOS” layer for Nucleus. Conclusion are provided as Xilinx Platform Studio Figure 2 shows the source code directo- Embedded systems are becoming increasing- (XPS) projects, so you can directly open ries generated by EDK in an XPS project. ly complicated. To meet the challenge, inte- these projects in XPS and edit, build, and You will find one subdirectory for each piece grated solutions including hardware/software implement the design with Xilinx EDK of hardware IP, such as “emac_v1_00_d” for co-design and co-verification are required. and ISE software. Typically, building a Ethernet and “uartlite_v1_00_b” for UART. FPGAs with configurable processor cores Nucleus system using an FPGA and con- Each subdirectory contains the driver code bring a new, innovative approach to mod- figurable cores involves the following: for that hardware IP. These driver functions ern embedded system designs and open the • Configuring a hardware platform with are included with the hardware IP from the door to hardware/software co-design and FPGA, processor core, hardware IP, vendor and have been integrated into the co-verification. and memory EDK installation. Nucleus software for FPGAs can sig- Nucleus software provides an OS nificantly accelerate your hardware/soft- • Building the design with EDK/ISE soft- adaptation layer that is an interface layer ware development cycle and improve ware to generate an FPGA bitstream between the EDK-generated “BIOS” your hardware/software integration quali- • Building software libraries with EDK function layer and the Nucleus high-level ty. For more information, please visit to generate low-level device driver code device drivers. This OS adaptation layer www.acceleratedtechnology.com.

March 2005 Embedded magazine 43 EmulateEmulate 80518051 MicroprocessorMicroprocessor inin PicoBlazePicoBlaze IPIP CoreCore Put the functions of a legacy microprocessor into a Xilinx FPGA.FPGA.

by Lance Roman President Roman-Jones, Inc. [email protected]

Brad Fayette Senior Software Engineer Roman-Jones, Inc. [email protected]

44 Embedded magazine March 2005 How do you put a one-dollar Intel™ 8051 • Smaller Size – Traditional 8051 type core. Hook up block RAM or microprocessor into an FPGA without implementations range from 1,100 off-chip program ROM, and you’re using 10 dollars’ worth of FPGA fabric? to 1,600 slices of FPGA logic. The ready to go. The answer is emulation. Using soft- PicoBlaze 8051 processor requires • Low Cost – The PB8051 is $495 ware emulation, Roman-Jones Inc. has just 76 slices. Emulation hardware with an easy Xilinx SignOnce IP developed a new type of 8051 processor requires 77 slices. Add another 158 license. core built on a Xilinx 8-bit, soft-core slices for two timers and a four-mode PicoBlaze™ (PB) processor. This “new” serial port, and you have a total of a Architecture of Emulated 8051 PB8051 is more than 70% smaller than 311 slices. This is a reduction of As shown in Figure 1, the architecture of an competing soft-core implementations – more than two-thirds of the FPGA emulated processor has several elements. without sacrificing any of the performance fabric of competing products. Each element is designed independently, of this legacy part. The PB8051 is a Xilinx • Faster – At 1.3 million instructions but together, they act as a whole. AllianceCORE™ microprocessor built per second (MIPS), the PB8051 is through emulation. faster than a legacy 8051 (1 MIPS) PicoBlaze Platform running at 12 MHz. Compare this to The PicoBlaze host processor is the heart The Legacy of the 8051 5 MIPS with a 40 MHz Dallas version of the emulated system and defines the The Intel 8051 family of microprocessors or 8 MIPS with a traditional FPGA architecture of the PB8051 emulated – probably one of the most popular archi- processor core. It comprises: tectures around – is still the core of many embedded applications. This processor This “new” PB8051 • PicoBlaze Code ROM – Block ROM just refuses to retire. Many designers are contains the software code to emulate using legacy code from previous projects, is more than 70% 8051 instructions. while others are actually writing new code. • Internal Address/Bus – The PicoBlaze The 8051 architecture was designed for smaller than peripheral bus is a 256-byte address ASIC fabric. It is not efficient in an space accessed by PicoBlaze I/O FPGA, resulting in excess logic usage with instructions. It allows the PicoBlaze marginal performance. competing soft-core processor to interface with RAM, FPGA microprocessor integration is a emulation peripherals, timers, and the solution for older 8051 products undergo- implementations – serial port. This bus is internal to the ing redesign to eliminate obsolesce, lower core and thus hidden from the user. costs, decrease component count, and without sacrificing increase overall performance. • Emulation Peripherals – Not all emu- The new FPGA-embedded PB8051 any of the performance lation tasks can be done in software designs allow you to take advantage of and still meet the performance existing in-house software tools and your requirements for your particular own architecture familiarity to quickly of this legacy part. application. Specialized hardware implement a finished design. The inte- assists the PicoBlaze code to perform grated PB8051 can be customized on the tasks that are time-consuming, on a critical execution path, or both. FPGA to exact requirements. implementation – which takes up more than three times as much FPGA A good example is the parity bit in the Processor Emulation fabric as the PB8051. The PicoBlaze PSW (program status word) register that Programmers have used microprocessor processor itself runs at a remarkable 40 reflects 8051 accumulator parity. This emulation for many years as a software MIPS using an 80 MHz clock. function must be performed for every development vehicle. It allows program- 8051 instruction executed, and would mers to write and test code on a develop- • Software Friendly – You can write C take several PicoBlaze instruction cycles ment platform before testing on target code or assembler code with your to perform. It is, therefore, done in hardware. present software development tools to hardware. This same concept can be practical generate programs. You can also run when the target microprocessor architec- legacy objects out of a 27C512 • Instruction Decode ROM – This is a ture does not lend itself to efficient imple- EPROM. key emulation peripheral that is used mentation and use of FPGA resources. • Easy Hardware – You can use VHDL to decode 8051 instructions to set The features of our PB8051 emulated or Verilog™ hardware description PicoBlaze routine locations and emu- processor include: languages to instantiate the 8051- lation parameters.

March 2005 Embedded magazine 45 1K x 16 256 x 8 Block ROM Instruction Emulation Decode ROM Program Code

Interrupt RST_8051 PicoBlaze 8051 WR Data Emulation Peripherals RD 256 x 8 PSEN Block RAM INSTR_FETCH EXT_BUS_START Address Decode, Data, Signals EXT_BUS_HOLD

SERIAL0_PRE12 Address P1_IN[7:0] Serial Port Address SERIAL2_PRE32 Decode P1_OUT[7:0] Address Decode, Data, Signals P3_IN[7:0] P3_OUT[7:0]

TIME_PRE Timer EXT_DATA_IN[7:0] EXT_DATA_OUT[7:0]

EXT_ADDRESS[15:0] CLK Internal RST Address/Data ROM_ADDRESS[15:0] Bus ROM_DATA[7:0]

Figure 1 – PB8051 block diagram

• Address Decode – A significant in very tight PicoBlaze assembler code, opti- • Scan Interrupts – This function deter- amount of the emulation peripheral mized for speed and efficiency. mines if an interrupt is pending, and if hardware is dedicated to simple The emulation program is divided into so, services it. An interrupt window is address decode of the PicoBlaze several segments: generated at the end of every emulated peripheral bus. This address space is instruction when required. for the PicoBlaze processor only and is • Instruction Fetch – An 8051 bus cycle insulated from the 8051 application. is simulated to fetch the next instruc- In addition to PicoBlaze code, Java™ tion from 8051 program memory (64 software utilities process symbols taken • Block RAM – 256 bytes are available Kb size), which may be on-chip block from PicoBlaze listings (.LOG files) into a to 8051 internal RAM and some 8051 RAM or off-chip EPROM (such as the ROM table. These .LOG files are used to registers. You can access this block 27C256). decode 8051 opcodes. RAM via 8051 instructions. This program also produces the .COE • Instruction Decode – Fetched instruc- • Serial Port and Timer – The actual files used by the Xilinx CORE tions are decoded to determine the 8051 timer and multi-mode serial port Generator™ system to create PicoBlaze addressing mode and operation. were best done in hardware instead of code ROM. All of this is transparent to trying to implement these functions in • Fetch Operands – Depending upon designers integrating with the PB8051. software. Clock prescaling inputs are the addressing mode, additional provided so that these functions can operands are fetched for the instruc- User Back-End Interface run at a clock rate independent of the tion from program memory, internal What gives the PB8051 its “hardware emulated system. RAM, or external RAM. flavor” is the user back-end interface, where you interface your logic design PicoBlaze Emulation Software • Instruction Execution – An instruction with the emulated 8051 processor. The A 1K x 16 block ROM holds the PicoBlaze performs the desired operation and back-end interface is part of the emula- code that performs the actual emulation. The updates emulated register contents, tion peripherals, controlled by the emulation program is carefully constructed including affected 8051 PSW flags. PicoBlaze platform.

46 Embedded magazine March 2005 What gives the PB8051 its “hardware flavor” is the user back-end interface, where you interface your logic design with the emulated 8051 processor.

Just as the 8051 processor family has tional multiplexer circuitry is not needed. Test and Debug Considerations many derivatives to define port, function- If your entire design, including the 8051 We recommend you use design tools to ality, features, and pinout, the back-end program, resides on the FPGA, simply set quickly and easily test and debug your design. interface serves the same function. The the EXT_BUS_HOLD to “low” to take full The most useful tool will be an HDL simula- PB8051 has a back-end interface that advantage of running at clock speed. If you tor, such as Modeltech or Aldec programs. resembles the generic 8031, the ROM-less elect to use an off-chip EPROM or have Most problems and bugs can be solved at the version of the 8051. slow peripherals, wait states can be inserted behavioral level. For interactive debugging, Roman-Jones Inc. customizes back-end by asserting EXT_BUS_HOLD at “high.” the Xilinx ChipScope™ integrated logic ana- interfaces to meet your exact 8051 needs, One of our reference designs illustrates wait lyzer has proved to be the tool of choice. At such as removing an unused serial port or state generation. the current time, no source code debugger adding an I2C port to emulate the 80C652 tools are available for the PB8051. derivative. Xilinx Implementation Considerations There are few implementation considera- Designer’s Learning Curve Designing with the PB8051 tions other than the need to place the Designers should have some experience in Incorporating the PB8051 into the rest of PB8051 design netlist into your project 8051 hardware/software and FPGA design your design is easy, because it comes with ref- directory and instantiate it as a compo- before attempting to consolidate the two. erence designs and examples of Xilinx inte- nent in your VHDL or Verilog design; The PB8051 core is designed for ease of grated software design (ISE) projects. ISE software will do the rest. ISE schemat- use and integration. ic capture is also supported. Your 8051 hardware and software Hardware Considerations expertise should include hardware under- Figure 1 illustrates the signal names avail- Simulation Considerations standing of the part and experience in writ- able on the user back-end interface. A We’ve included a behavior simulation ing 8051 code using software development VHDL or Verilog template provides the model of the PB8051 for adding (along tools. The basic design flow using the exact signal names – many of which are with the rest of your design) to your favorite PB8051 is identical to the normal pack- already familiar to 8051 designers. A few simulator. Modeltech and Aldec™ simula- aged processor flow. new types of signals exist, including: tors have been tested for correct operation. Integrating the PB8051 onto the Xilinx Post place-and-route or timing simulations part is much the same as instantiating a core • Pre-scales used by the timer and serial follow a conventional design flow. using the Xilinx CORE Generator tool. If port to set counting and baud rates. You will enjoy watching your 8051 you are familiar with VHDL or Verilog lan- • One-clock-wide read/write strobes to instruction execution flow go by on the guage, and have a couple of Xilinx designs read and write external memory space. simulation waveforms. This makes behav- under your belt, you’re good to go. There is also a program store enable ioral debugging straightforward, fast, and (PSEN) strobe for the 8051 code easy. We’ve provided a reference test Conclusion memory. bench with example waveform files. Integrating microprocessors onto FPGAs • Bus start and hold signals used to through emulation, as illustrated with the insert wait states for external memory 8051 Software Considerations PB8051, is a viable alternative to a full hard- cycles that cannot be completed in one To generate your 8051 programs the way ware functional design. The advantage of system clock cycle. you always have, use your favorite C com- FPGA fabric savings correlates to reduced piler or assembler and linker (if necessary) to parts cost. The PB8051 is instantiated as a compo- produce the same Intel hex format file that Integrating the PB8051 processor nent into your top-level design. You deter- you would use to burn an EPROM. You can into your design yields lower component mine if the 8051 program resides in even use an existing hex file, because the count, easier board debugging, less on-chip block RAM or off-chip EPROM. PB8051 looks like a regular 8051 as far as noise, and optimum performance Peripherals can be hooked up to either the software code is concerned. An Intel hex to of the peripheral/8051 micro-interface, P1/P3 port lines or to the external address .COE utility is included for those designs because both are in an FPGA. For more and data buses. For convenience, the that put the 8051 software into on-chip information about the PB8051 microcon- address and data lines for program and data block RAM. On-chip program storage pro- troller, visit www.roman-jones.com/rj2/ memory spaces are separated, so conven- vides maximum speed performance. PB8051Microcontroller.htm.

March 2005 Embedded magazine 47 OptimizeOptimize MicroBlazeMicroBlaze ProcessorsProcessors forfor ConsumerConsumer ElectronicsElectronics ProductsProducts An RTOS can be critical for getting the most out of the MicroBlaze processor.

by John Carbone VP, Marketing Express Logic, Inc. [email protected]

A fast processor is essential for today’s demanding electronic products. Efficient application software is equally essential, as it enables the processor to keep up with the real-world demands of networking and consumer products. A Good Fit Efficiency But what about the real-time operating Let’s look at some of the reasons why the The ThreadX RTOS can perform a full system (RTOS) that handles your system ThreadX RTOS is a good fit with the context switch between active threads in a interrupts and schedules your application MicroBlaze processor. mere 1.2 µs on a 100 MHz Xilinx Virtex-II software’s multiple threads? Pro™ FPGA. This is particularly critical in If it’s not optimized for the processor Size applications with high interrupt incidences, you’re using, and isn’t efficient in its han- The ThreadX RTOS is only about 6 KB - such as network packet processing. dling of external events, you’ll end up with 12 KB in total size for a complete multi- The ability of the ThreadX RTOS to only half a processor to do useful work. tasking system. This includes basic handle interrupts efficiently means that An efficient RTOS should do more than services, queues, event flags, semaphores, your system can handle higher rates of just handle multiple threads and service and memory allocation services. external events such as TCP/IP packet interrupts. It must also have a small foot- Its small size means that more of your arrivals, delivering greater throughput. print and be able to deliver the full power of memory resources will be preserved for the processor used by your application soft- use by the application and not absorbed Speed ware. Express Logic’s ThreadX® real-time by the RTOS. This will result in smaller The ThreadX RTOS responds to inter- operating system is just such an RTOS. memory requirements, lower costs, and rupts and schedules required processing in The ThreadX RTOS has now been opti- lower power consumption. Table 1 shows less than 1 microsecond. The single-level mized to support the Xilinx MicroBlaze™ code sizes for various optional compo- interrupt processing architecture of the soft processor, and is available off-the-shelf nents of the ThreadX RTOS on the ThreadX RTOS eliminates excess over- for your next development project. MicroBlaze processor. head found in many other RTOSs.

48 Embedded magazine March 2005 This keeps the ThreadX RTOS from enables shorter development times and Efficient interrupt processing delivers gobbling up valuable processor cycles with faster time to market. the best performance from MicroBlaze housekeeping tasks, thus leaving more processors, giving your product a boost processor bandwidth for your application. Fast Interrupt Handling over the competition. Thus, you can configure a less expensive The ThreadX RTOS is optimized for fast processor or add more features to your interrupt handling, saving only scratch reg- Full Source Code application without increasing processor isters at the beginning of an interrupt serv- By providing full source code, you have speed and cost. ice routine (ISR), enabling use of all complete visibility into all ThreadX servic- Table 2 shows execution times for vari- services from the ISR, and using the system es – no more mysteries are hidden within ous ThreadX services measured according stack in the ISR. the RTOS “black box.” to the following definitions: This full view of source code enables ThreadX Instruction Area Size (in bytes) you to avoid delays in development • Immediate Response (IR): Time Core Services (required) 4,804 caused by an incomplete understanding required to process the request imme- of RTOS services. With the ThreadX diately, with no thread suspension or Queue Services 3,200 RTOS, full source code is there any time thread resumption. Event Flag Services 1,720 you need it. • Thread Suspend (TS): Time required Semaphore Services 932 to process the request when the calling Conclusion thread is suspended because the Mutex Services 2,540 Express Logic’s ThreadX RTOS is well resource is unavailable. Block Memory Services 1,144 matched for networking and consumer applications based on MicroBlaze proces- • Thread Resumed (TR): Time required Byte Memory Services 1,876 sors. It’s been optimized to deliver the to process the request when a previous- processor’s inherent performance directly ly suspended thread (of the same or Table 1 – Code size of various ThreadX services through to the application. lower priority) is resumed because of What’s next? Express the request. ThreadX Service Call Execution Times Logic is working to port Service IR TS TR TRCS • Thread Resumed and Context NetX™, its TCP/IP net- Switched (TRCS): Time required to tx_thread_suspend 1.0 µs 2.0 µs work stack, to MicroBlaze. process the request when a previously tx_thread_resume 1.0 µs 2.8 µs With so many of today’s suspended thread with a higher priority consumer electronics devices is resumed because of the request. As the tx_thread_relinquish 0.5 µs 1.9 µs accessing the Internet, effi- resumed thread has a higher priority, a tx_queue_send 0.9 µs 2.4 µs 1.6 µs 3.4 µs cient TCP/IP software is context switch to the resumed thread is increasingly critical. The tx_queue_receive 0.9 µs 2.3 µs 2.5 µs 4.2 µs also performed from within the request. NetX TCP/IP stack will pro- tx_semaphore_get 0.3 µs 2.3 µs vide further support for • Context Switch (CS): Time required MicroBlaze in consumer to save the current thread’s context, tx_semaphore_put 0.3 µs 1.3 µs 3.0 µs electronics products, and find the highest priority ready thread, tx_mutex_get 0.4 µs 2.4 µs will enable developers to and restore its context. tx_mutex_put 1.0 µs 2.0 µs 3.6 µs concentrate on their applica- • Interrupt Latency Range (ILR): Time tx_event_flags_set 0.6 µs 1.7 µs 3.4 µs tion code, getting their duration of disabled interrupts. product to market faster. tx_event_flags_get 0.4 µs 2.5 µs To develop a successful, Ease of Use tx_block_allocate 0.4 µs 2.3 µs high-performance consumer, All services are available as function calls, networking, or other elec- tx_block_release 0.4 µs 1.4 µs 3.1 µs and full source code is provided so you can tronic product around see exactly what the ThreadX RTOS is tx_byte_allocate 1.4 µs 2.9 µs MicroBlaze processors, use an doing at any point. tx_byte_release 0.9 µs 3.0 µs 4.7 µs efficient RTOS like ThreadX Many other commercial RTOS prod- by Express Logic. For further ucts are massive collections of unintelligi- Context Switch (CS) 1.2 µs information on the ThreadX ble system calls with non-intuitive names. RTOS and other embedded Interrupt Latency 0.0 µs-0.8 µs The ThreadX RTOS uses service call Range (ILR) software products from names that are orderly, simple in struc- Express Logic, please visit ture, and ultimately easier to use. This Table 2 – Execution times of various ThreadX services www.expresslogic.com.

March 2005 Embedded magazine 49 UseUse anan RTOSRTOS onon YourYour NextNext MicroBlaze-BasedMicroBlaze-Based ProductProduct AnAn RTOSRTOS calledcalled µC/OS-IIµC/OS-II hashas beenbeen portedported toto thethe MicroBlazeMicroBlaze soft-coresoft-core processor.processor.

by Jean J. Labrosse President Micrium, Inc. [email protected]

Designing software for a microprocessor- based application can be a challenging undertaking. To reduce the risk and com- plexity of such a project, it’s always impor- tant to break the problem into small pieces, or tasks. Each task is therefore responsible for a certain aspect of the application. Keyboard scanning, operator interfaces, display, protocol stacks, reading sensors, performing control functions, updating outputs, and data logging are all examples of tasks that a microprocessor can perform. In many embedded applications, these tasks are simply executed sequentially by the microprocessor in a giant infinite loop. Unfortunately, these types of sys- tems do not provide the level of respon- siveness required by a growing number of real-time applications because the code executes in sequence. Thus, all tasks have virtually the same priority.

50 Embedded magazine March 2005 For this and other reasons, consider the partitions. In all, µC/OS-II provides more • The starting address of the task use of a real-time operating system (RTOS) than 65 functions you can call from your (MyTask()) for your next Xilinx MicroBlaze™ proces- application. µC/OS-II can manage as many • The task priority based on the impor- sor-based product. This article introduces as 63 application tasks that could result in tance you give to the task you to a low-cost, high-performance, and quite complex systems. high-quality RTOS from Micrium called µC/OS-II is used in hundreds of prod- • The amount of stack space needed by µC/OS-II. ucts from companies worldwide. µC/OS-II the task. is also very popular among colleges and With µC/OS-II, you provide this infor- What is µC/OS-II? universities, and is the foundation for mation by calling a function to “create a many courses about real-time systems. µC/OS-II is a highly portable, ROM-able, task” (OSTaskCreate()). µC/OS-II scalable, preemptive RTOS. The source maintains a list of all tasks that have code for µC/OS-II contains about Code (ROM) Data (RAM) been created and organizes them by 5,500 lines of clean ANSI C code. It is priority. The task with the highest pri- included on a CD that accompanies a Minimum Size 5.5 Kb 1.25 Kb ority gets to execute its code on the (excluding task stacks) book fully describing the inner work- MicroBlaze processor. You determine ings of µC/OS-II. The book is called 9 Kb how much stack space each task gets MicroC/OS-II, The Real-Time Kernel Maximum Size 24 Kb (excluding task stacks) based on the amount of space needed (ISBN 1-5782-0103-9) and was writ- by the task for local variables, func- ten by the author (Figure 2). The book Table 1 – Memory footprint for µC/OS-II tion calls, and ISRs. also contains more than 50 pages of on the MicroBlaze processor In the task body, within the infi- RTOS basics. nite loop, you need to include calls to RTOS Basics with µC/OS-II Although the source code for µC/OS-II RTOS functions so that your task waits for is provided with the book, µC/OS-II is not When you use an RTOS, very little has to some event. One task can ask the RTOS to freeware, nor is it open source. In fact, you change in the way you design your prod- “run me every 100 milliseconds.” Another need to license µC/OS-II to use it in actu- uct; you still need to break your application task can say, “I want to wait for a character al products. into tasks. An RTOS is software that man- to be received on a serial port.” Yet anoth- Micrium’s application note AN-1013 ages and prioritizes these tasks based on er task can say, “I want to wait for an ana- provides the complete details about the your input. The RTOS takes the most log-to-digital conversion to complete.” porting of µC/OS-II to the MicroBlaze important task that is ready to run and exe- Events are typically generated by your soft-core processor. This application note is cutes it on the microprocessor. application code, either from ISRs or available from the Micrium website at With most RTOSs, a task is an infinite issued by other tasks. In other words, an www.micrium.com. loop, as shown here: ISR or another task can send a message to µC/OS-II was designed specifically for void MyTask (void) a task to wake that task up. With µC/OS- embedded applications and thus has a very { II, a task waits for an event using a “pend” small footprint. In fact, the footprint for while (1) { call. An event is signaled using a “post” call, µC/OS-II can be scaled based on which // Do something (Your code) as shown in the pseudo-code below: µC/OS-II services you need in your applica- // Wait for an event, either: tion. On the MicroBlaze processor, µC/OS- // Time to expire void EthernetRxISR(void) II can be scaled (as shown in Table 1) and can // A signal from an ISR { easily fit into Xilinx FPGA block RAM. // A message from another task Get buffer to store received packet; µC/OS-II is a fully preemptive real-time // Other Copy NIC packet to buffer; kernel. This means that µC/OS-II always } Call OSQPost() to send packet to task attempts to run the highest-priority task that } (this signals the task); is ready to run. Interrupts can suspend the } execution of a task and, if a higher-priority You’re probably wondering how it’s pos- sible to have multiple infinite loops on a task is awakened as a result of the interrupt, void EthernetRxTask (void) single processor, as well as how the CPU that task will run as soon as the interrupt { goes from one infinite loop to the next. service routines (ISR) are completed. while (1) { The RTOS takes care of that for you by µC/OS-II provides a number of system Wait for packet by calling OSQPend(); suspending and resuming tasks based on services, such as task and time manage- Process the packet received; events. To have the RTOS manage these ment, semaphores, mutual exclusion sema- } tasks, you need to provide at least the fol- phores, event flags, message mailboxes, } message queues, and fixed-sized memory lowing information to an RTOS:

March 2005 Embedded magazine 51 any application. Every feature, function, and line of code of µC/OS-II has been EthernetRxIS() (3) OSQPost() examined and tested to demonstrate its OSQPend() (5) (4) (6) safety and robustness for safety-critical systems where human life is on the line. EthernetRxTask() (2) µC/OS-II also follows most of the (7) Motor Industry Software Reliability

Low-Priority Task (1) (8) Association (MISRA) C Coding Standards. These standards were created by MISRA to improve the reliability and predictability of C programs in critical Figure 1 – Execution profile of an ISR and µC/OS-II automotive systems. Your application may not be a safety- The execution profile for these two µC/OS-II then locates the next most critical application, but it’s reassuring to functions is shown in Figure 1. important task that’s ready to run (the know that the µC/OS-II has been sub- Let’s assume that a low-priority task is “Low-Priority Task” in Figure 1) jected to the rigors of such environments. currently executing (1). When an Ethernet and switches to that task by loading packet is received on the Network Interface the MicroBlaze registers with the Card (NIC), an interrupt is generated. The context saved by the ISR (7). This MicroBlaze processor suspends execution is called a context switch. µC/OS-II of the current task (the low-priority task) takes less than 1 microsecond at and vectors to the ISR handler (2). 150 MHz on the MicroBlaze With µC/OS-II, the ISR starts by saving processor to complete this process. all of the MicroBlaze registers, taking a The interrupted task is resumed mere 300 nanoseconds at 150 MHz. The exactly as if it was never inter- ISR handler then needs to determine the rupted (8). source of the interrupt and execute the The general rule for an appli- appropriate function (in this case, cation using an RTOS is to put EthernetRxISR()) (3). most of your code into tasks and This ISR obtains a buffer large enough very little code into ISRs. In fact, to hold the packet received by the NIC. It you always want to keep your copies the contents of the packet received ISRs as short as possible. With into the buffer and sends the address of this an RTOS, your system is easily buffer to the task responsible for handling expanded by simply adding received packets (EthernetRxTask() in our more tasks. Adding lower-pri- example) using µC/OS-II’s OSQPost() (4). ority tasks to your application When this operation is completed, the will not affect the responsive- ISR invokes µC/OS-II (calls a function) ness of higher-priority tasks. and µC/OS-II determines that a higher- priority task needs to execute. µC/OS-II Safety-Critical Systems Certified Figure 2 – MicroC/OS-II book cover then resumes the more important task and µC/OS-II has been certified by the doesn’t return to the interrupted task (5). Federal Aviation Administration (FAA) With µC/OS-II on the MicroBlaze proces- for use in commercial aircraft by meeting Conclusion sor, this takes about 750 nanoseconds at the demanding requirements of the An RTOS is a very useful piece of software 150 MHz. RTCA DO-178B standard (Level A) for that provides multitasking capabilities to When the packet is processed, software used in avionics equipment. To your application and ensures that the most EthernetRxTask() calls OSQPend() (6). meet the requirements for this standard, important task that is ready to run executes µC/OS-II notices that there are no more it must be possible to demonstrate on the CPU. You can actually experiment packets to process and suspends execution through documentation and testing that with µC/OS-II in your embedded product of this task by saving the contents of all the software is both robust and safe. by obtaining a copy of the MicroC/OS-II MicroBlaze registers onto that task’s stack. This is particularly important for an book. You only need to purchase a license Note that when the task is suspended, it operating system, as it demonstrates that to use µC/OS-II for the production phase doesn’t consume any processing time. it has the proven quality to be usable in of your product.

52 Embedded magazine March 2005 UltraController™-ll Ultra-Fast, Ultra-Small PowerPC-Based Microcontroller Reference Design Embedded applications present a variety of design challenges, requiring both hardware and software to accomplish real-time processing. This is because non- performance critical functions and complex state machines are handled more efficiently in software, while timing-critical, parallel structures are handled more efficiently in hardware. The Virtex-II Pro and Virtex-4 FX FPGAs meet these requirements PowerPC Made Easy by providing fast FPGA fabric and The UltraController-II, an enhanced version of the popular UltraController, is a minimal footprint embedded processing engine based on the PowerPC immersed PowerPC™ processors on 405 processor core in the Virtex-II Pro and Virtex-4 FX FPGAs. With exciting the same device. new features, the UltraController-II further simplifies PowerPC based The UltraController-II, an FPGA designs. enhanced version of the popular Ease-of-Use for Hardware and Software Designers – The UltraController-II UltraController, provides the easiest reference design is a pre-verified processing engine with 32 bits of user-defined general purpose input and output as well as interrupt handling capability. way to use the embedded PowerPC It enables FPGA logic centric designers to partition design functions between processor in the Virtex-II Pro and FPGA logic and software while leveraging the embedded PowerPC core in Virtex-4 FX FPGAs. This minimal the Virtex-II Pro and Virtex-4 FX FPGAs, using the award winning Xilinx Platform Studio design flow. footprint embedded processing engine maximizes performance Maximum Performance – Processing power is determined by program execution speed. The UltraController-II can be clocked at the maximum and minimizes FPGA resources PowerPC clock frequency (up to 400 MHz in Virtex-II Pro and 450 MHz in by running code strictly from the Virtex-4 FX FPGA) which far exceeds any soft core processor implementation. integrated PowerPC caches. System It also delivers up to 702 DMIPs with a low 0.45 mW/MHz power consumption in Virtex-4 FX. designers can easily incorporate the UltraController-II design by Minimum FPGA Resource – Code now executes faster by using the 16 KB instruction and 16 KB data-side cache memory. By utilizing cache memory using the award winning Xilinx to hold instructions and data no additional BRAM resources are required. As Platform Studio tool suite. a result, the UltraController-II occupies only 10 logic cells _ 1/5 the logic size of the original UltraController!

March 2005 Embedded magazine 53 Key Features ISE Flow XPS Flow

• Utilizes the PowerPC in the Virtex-II Pro Step 1a: Implement HW Step 1b: Create and Update and Virtex-4 FX (Map, Place & Route) SW Application

• Occupies only 10 logic cells Step 2: Generate Bitstream • Runs at up to 400 MHz in Virtex-II Pro (Bitgen) and 450 MHz in Virtex-4 FX Step 3: Configure FPGA Step 4: Modify & Compile • Code and data execute from the integrated (iMPACT) C code in XPS 16KB instruction and 16 KB data caches

JTAG OPB OPB CNTL EMC GPI0

PLB20PB • 32-bit input and 32-bit output ports OPB/ Bridge OPB/ PPC405 Arbiter Arbite PLB20PB Bridge

BRAM BRAM OPB Block Block UART L/F • External interrupt support

• Timer functionality support (FIT, PIT, Watchdog) Integrated Design Flow • Easy to integrate using the award winning Xilinx Platform Studio (XPS) tool flow Design Flow The UltraController-II development leverages the proven • Fully observable with ChipScope Pro hardware flow of ISE and simplified software flow within the • Verilog and VHDL source code provided Xilinx Platform Studio (XPS) and Embedded Development Kit (EDK) to provide rapid prototyping and debug abilities. The sim- • "C" source code examples provided plified software flow enables a direct code-to-download option with zero hardware impact, yet maintains the standard ISE development tool flow for full synthesis, simulation and PAR capabilities. UltraController Features Comparison

Feature UltraController UltraController-II UltraController-II Take the Next Step in Virtex-II Pro in Virtex-II Pro in Virtex-4 FX Visit www.xilinx.com/ultracontroller to download the Performance Max fmax 200 MHz 400 MHz 450 MHz free UltraController II reference design. Contents include:

Max DMIPS 220 DMIPs 624 DMIPs 702 DMIPs • Complete VHDL/Verilog Source Code

DMIPS/MHz 1.1 1.56 1.56 • IP and software included

Logic Cells 49 Logic Cells 10 Logic Cells 10 Logic Cells • Collateral Included

Power mW/MHz 0.9 0.9 0.45 – QuickStart Tutorial – HDL Sources & Examples UC 4x4: 4 BRAM Usage 0 0 UC 8x8: 8 – “C” Source Code Examples The UltraController II application note (XAPP575) is available at www.xilinx.com/bvdocsappnotes/xapp575.pdf The UltraController application note (XAPP672) is available at www.xilinx.com/bvdocsappnotes/xapp672.pdf

Corporate Headquarters Europe Headquarters Japan Asia Pacific Distributed by Xilinx, Inc. Xilinx, Ltd. Xilinx, K. K. Xilinx, Asia Pacific Pte. Ltd. 2100 Logic Drive Citywest Business Campus Shinjuku Square Tower 18F No. 3 Changi Business Park Vista, #04-01 San Jose, CA 95124 Saggart, 6-22-1 Nishi-Shinjuku Singapore 486051 Tel: 408-559-7778 Co. Dublin Shinjuku-ku, Tokyo Tel: (65) 6544-8999 Fax: 408-559-7114 Ireland 163-1118, Japan Fax: (65) 6789-8886 Web: www.xilinx.com Tel: +353-1-464-0311 Tel: 81-3-5321-7711 RCB no: 20-0312557-M Fax: +353-1-464-0324 Fax: 81-3-5321-7765 Web: www.xilinx.com Web: www.xilinx.com Web: www.xilinx.co.jp

The Programmable Logic CompanySM

© 2005 Xilinx Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the proiperty of their respective owners. Printed in U.S.A. PN 0010751-1

54 Embedded magazine March 2005 Support Across the Board.™

XLerated Solutions Seminar from Avnet Electronics Marketing Showcases the Latest Innovations from Xilinx®

Locations Dates

Minneapolis, MN March 22 Avnet Electronics Marketing introduces a new FREE, one-day seminar that will San Jose, CA March 22 detail how to utilize Xilinx FPGA technologies in the areas of programmable logic, Chicago, IL March 23 connectivity, embedded processing, and design tools. It will include presentations, Ottawa, ON March 24 case studies, and live demonstrations to help you assess how these pioneering Toronto, ON March 25 offerings can be leveraged in your company’s product plans. Boston, MA March 30

Rochester, NY March 31

Seminar attendees will qualify for reduced pricing on special bundles of training Atlanta, GA April 4

classes, software tools, development hardware and the Embedded Systems Huntsville, AL April 5

Starter kit. Free Avnet Design Services evaluation kits (choice of Virtex or Los Angeles, CA April 5

Spartan-based) will also be given away. Irvine, CA April 6

Baltimore, MD April 6

Seminar Topics San Diego, CA April 7

• The Virtex-4 FX family of FPGAs, which delivers breakthrough performance Philadelphia, PA April 7 at the lowest cost and offer a compelling alternative to ASICs and ASSPs. Seattle, WA April 7 • The Spartan-3 FPGA family, which delivers an unmatched combination of Phoenix, AZ April 8 low-cost and full-feature capability. Salt Lake City, UT April 18 • Detailed coverage of the capabilities of these two families Denver, CO April 19 using the Integrated Software Environment (ISE) and Embedded Raleigh, NC April 19 Development Kit (EDK). Orlanda, FL April 20 Ft. Lauderdale, FL April 21 Registration: em.avnet.com/xleratedsolutions Austin, TX April 27 Dallas, TX April 28

Enabling success from the center of technology™ © Avnet, Inc. 2005. All rights reserved. AVNET is a registered trademark of Avnet, Inc. Virtex-4, Spartan-3, MicroBlaze and Platform Studio are registered trademarks of Xilinx. 1.800.408.8353 PowerPC is a trademark of International Business Machines Corporation. www.em.avnet.com PN 0010862