SU 3 EODQATR20 CL ORA XILINX, INC. XCELL JOURNAL ISSUE 53, SECOND QUARTER 2005

Issue 53 Second Quarter 2005 XcellXcelljournaljournal THETHE AUTHORITATIVEAUTHORITATIVE JOURNALJOURNAL FORFOR PROGRAMMABLEPROGRAMMABLE LOGICLOGIC USERSUSERS

PluggingPlugging intointo High-VolumeHigh-Volume ConsumerConsumer ProductsProducts

HIGH VOLUME Spartan-3E: A New Era Multimedia for Automotive DSP Algorithms

DESIGN TOOLS New ISE 7.1i Software Control Your Designs

SERIAL I/O Extend Your Reach

SUBSCRIBE

NOW R SEE PAGE 3 Xilinx is the only FPGA supplier in the world to have achieved high-volume 90nm production, resulting in the lowest-cost FPGAs in the industry. Our leadership in 90nm products gives you all the performance and features you need, at the lowest price points ever.

Both our SpartanTM and VirtexTM product lines__ the world's most widely adopted FPGAs__ are shipping on our optimized 90nm process now. Contact your Xilinx rep today and let's ramp up for success together.

The Programmable Logic CompanySM www.xilinx.com/spartan3 Pb-free devices available now

©2004 Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124. Europe +44-870-7350-600; Japan +81-3-5321-7711; Asia Pacific +852-2-424-5200; Xilinx is a registered trademark, Spartan and Virtex are trademarks, and The Programmable Logic Company is a service mark of Xilinx, Inc. LETTER FROM THE EDITOR What’s Both High and Low, and Used Everywhere...?

This issue of the Xcell Journal focuses on high-volume, low-cost consumer applications and the design tools used to implement them. Business Viewpoints Our Business Viewpoints column offers an independent business perspective by Rich Wawrzyniak, Senior Analyst, ASICs Services for Semico Research Corporation. His article,

EDITOR IN CHIEF Carlis Collins “Changing the Systems Landscape with Low-Cost FPGAs,” discusses how advanced processing [email protected] technology is driving down the costs of FPGAs and changing the ASSP/ASIC crossover point. 408-879-4519 T High-Volume Solutions MANAGING EDITOR Forrest Couch Approximately one year ago Xilinx® engineers were asked to develop the lowest cost, highest density [email protected] 408-879-5270 FPGA family on the market. This section features articles using the new Spartan™-3E devices in several high-volume applications. Their low cost makes these devices successful in applications ASSISTANT MANAGING EDITOR Charmaine Cooper Hussain previously reserved for ASIC and gate-array technologies.

XCELL ONLINE EDITOR Tom Pyles Industry Expert [email protected] In “Extend Your Reach,” well-known signal integrity expert Howard Johnson discusses the 720-652-3883 Virtex™-4 RocketIO™ transceiver, which incorporates three forms of equalization to stretch the

ADVERTISING SALES Dan Teie reach and performance of serial links. 1-800-493-5551 Design Tools ART DIRECTOR Scott Blair Software support for the Spartan-3E family is available with the newly announced ISE™ 7.1i tools. This section has articles from Xilinx and some of its partners describing the use of their design tools to implement a complete solution. Board Room The Board Room section describes some of the Spartan-3 hardware development boards and other design solutions to help you determine which platform is best for your application and design task. Power Play The new Power Play section features technical descriptions of our partners’ power solutions. In Steve Sharp’s lead article, “Conquering the Three Challenges of Power Consumption,” he writes,

Xilinx, Inc. “To conquer the key challenges of power consumption, it takes a combination of good product 2100 Logic Drive San Jose, CA 95124-3400 design, proper device technology, and tools that let you take control of system power management.” Phone: 408-559-7778 FAX: 408-879-4780 And... It’s Time to Re-Subscribe www.xilinx.com/xcell/ Periodically, we must clean our mailing database. As of January 1, 2005, you must re-subscribe to continue receiving the Xcell Journal FREE. If you subscribed after January 1, 2005, you do not © 2005 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands included have to re-subscribe. If you subscribed before this date, please visit our website at herein are trademarks of Xilinx, Inc. PowerPC is a trade- mark of IBM, Inc. All other trademarks are the property www.xilinx.com/xcell/subscribe/ and take a minute to renew your FREE subscription and ensure of their respective owners. uninterrupted delivery.

The articles, information, and other materials included in this issue are provided solely for the convenience of our readers. Xilinx makes no warranties, express, implied, statutory, or otherwise, and accepts no liability with respect to any such articles, information, or other materials or their use, and any use thereof is solely at the risk of the user. Any person or entity using such information in any way releases and waives any claim it might have against Xilinx for any loss, damage, or expense caused thereby. Forrest Couch Managing Editor TABLE OF CONTENTS

INDUSTRY EXPERT DESIGN TOOLS

63

This section comprises articles from Xilinx and its partners describing the use of the newly announced ISE 7.1i software, 66 as well as other design tools. This article, by well-known signal integrity expert Howard POWER PLAY Johnson, discusses the Virtex-4 RocketIO transceiver, which incorporates three forms of equalization to stretch the reach and performance of serial links. 104 High-Volume Solutions The latest FPGAs from Xilinx set new records in capacity, capability, performance, power efficiency, and value. 13 The new Power Play section features technical descriptions of our partners’ power solutions.

BOARD ROOM

112 This section describes some of the Spartan-3 hardware development boards and other design solutions. SECOND QUARTER 2005, ISSUE 53 Xcell journal

INDUSTRY EXPERT Extend Your Reach -- Using RocketIO Transceivers...... 6

BUSINESS VIEWPOINTS Changing the Systems Landscape with Low-Cost FPGAs ...... 10

HIGH-VOLUME SOLUTIONS Leading the High-Volume Programmable Revolution ...... 13 Spartan-3E FPGAs Introduce a New Era in Low-Cost Programmable Logic...... 14 Implementing New Configuration Options for the Spartan-3E Family...... 16 Encoding High-Resolution Ogg/Theora Video with Reconfigurable FPGAs...... 19 Implementing DSP Algorithms Using Spartan-3 FPGAs ...... 22 Designing a Spartan-3 FPGA DDR Memory Interface...... 26 Signal Processing Capability with the Nu Horizons Spartan-3 Development Board...... 28 CoolRunner-II CPLDs Offer New Features...... 31 Reducing Bill-of-Materials Cost Using Logic Consolidator ...... 34 Using CoolRunner-II Silicon Features to Reduce Cost...... 36 Using Spartan-3/3E Features to Area-Optimize Your Design...... 39 Low-Cost EasyPath FPGAs Offer Promise to ASSP Companies...... 42 Applications of the Spartan Family in Flat-Panel Displays...... 46 Managing Signal Quality ...... 49 Meeting Timing and Reducing Area with the Synplify Pro Tool ...... 52 The Changing Face of Automotive ECU Design...... 54 A Multimedia Platform for Automotive and Consumer Markets...... 57 A Low-Cost PCI Express Solution...... 60

DESIGN TOOLS Design Leadership From Xilinx -- Introduction...... 63 Design Performance Leaps Forward with ISE 7.1i Software...... 64 Debugging with Combined Oscilloscope and Logic Analyzer Measurements...... 67 Real-Time Debug That Dominates ...... 70 Integrating MATLAB Algorithms into FPGA Designs...... 73 Complex FPGAs Require Equivalence Checking...... 76 Complete FPGA and CPLD Power Analysis ...... 80 Increase Performance and Lower Cost Through System Design...... 83 Control Your Designs with the PlanAhead Hierarchical Design and Analysis Tool ...... 86 Building High-Performance Measurement and Control Systems with FPGAs...... 90

GENERAL The Hydra Project (Chess Program)...... 94 Accelerating Time to Market with Embedded Processing QuickStart!...... 98 Bioinformatics Algorithms on Nallatech-Configurable Multi-FPGA Systems...... 100

POWER PLAY...... 104 Conquering the Three Challenges of Power Consumption...... 104

BOARD ROOM ...... 112

REFERENCE...... 118 INDUSTRY EXPERT ExtendExtend YourYour ReachReach RocketIO transceivers included in the Virtex-4 FPGA family incorporate highly flexible equalization circuits that significantly extend the range and performance of high-speed serial links. by Howard Johnson, Ph.D. President Signal Consulting, Inc. [email protected] Mike Degerstrom Senior Staff Signal Integrity Design Engineer Xilinx, Inc. [email protected]

Every multigigabit backplane, trace, and cable distorts the signals passing through it. This degradation may be slight or dev- astating, depending on the conductor geometry, materials, length, and type of connectors used. Because they spend their lives working with sine waves, communications engineers like to characterize this distortion in the frequency domain. Figure 1 shows the channel gain, also called the frequency response, of a perfectly terminated typical 50 ohm stripline (or 100 ohm differential stripline). This stripline acts like a low-pass filter, attenuating high-frequency sine waves more than lower frequency waves. Figure 2 illustrates the degradation inherent to a digital signal passing through 20 inches (.5 meters) of FR-4 stripline. The dielectric and skin-effect losses in the trace reduce the amplitude of the incident pulse and disperse its rising and falling edges. We like to call the received pulse, much small- er than normal, a “runt pulse.” In a binary communication system, any runt pulse that fails to cross the receiver threshold by a suf- ficient margin causes a bit error.

6 Xcell Journal Second Quarter 2005 INDUSTRY EXPERT

For the purposes of this discussion, 0 Runt-Pulse Degradation three things degrade the amplitude of 0.1m On the left side of Figure 4 is a sine the runt pulse in a high-speed serial –5 wave with a period of two baud. To link: losses in the traces or cables, reflec- the extent that the runt-pulse pattern tions due to connectors and other signal –10 (101) looks somewhat like this sine transitions, and the limited bandwidth 0.5m wave, you should be able to infer the of the driver and receiver. Channel Gain, dB –15 1m runt-pulse amplitude from a fre- A classic test of dispersion appears in 1.5m –20 quency-domain plot of channel Figure 3. This particular waveform – 108 109 1010 attenuation. Let’s try it. adjusted so that the long flat portions of Frequency, Hz In Figure 4, the data waveform the test signal represent the worst-case, FR-4 stripline, t =1/2 oz. Cu; has a baud rate of 2.5 Gbps. One half w µ Z Ω; longest runs of ones or zeros available in =152 m (6 mil); 0=50 no connectors this frequency (the equivalent sine your data code – displays the runt-pulse Figure 1 – The effective channel gain associated with wave frequency) equals 1.25 GHz. amplitude. In the absence of reflections, a long PCB trace depends on the trace width, dielectric According to Figure 5, the half-meter crosstalk, or other noise, this single materials, length, and type of connectors used. curve gives you 4.5 dB of attenuation waveform (as measured at the receiver) at 1.25 GHz. The same curve also represents a worst-case test of channel shows 1.5 dB of attenuation at dispersion. Longer traces introduce pro- 20 in. (Ωm) 1/10th this frequency, corresponding gressively more dispersion, eventually roughly to the lowest frequency of causing receiver failure at (in this exam- 50W interest in an 8B10B coded data

– – ple) a length of 1.5 meters. Z V 100W transmission system. The difference INPUT DIFF = 100W DIFF + One measure of signal quality at the ( ñ ) + between these two numbers (-3 dB) receiver is voltage margin. This number (+) approximates the ratio of runt-pulse equals the minimum distance (in volts) amplitude to low-frequency signal between the signal amplitude and the amplitude at the receiver. With only receiver threshold at the instant sam- -3dB degradation, the system satisfies Figure 2 – Long traces reduce the amplitude of the pling occurs. In a system with zero input pulse and disperse its rising and falling edges. our 70% frequency-domain criterion reflections, crosstalk, or other noise you for solid link performance – precisely could theoretically operate with a very explaining why time-domain wave- 0.1m small voltage margin and still expect the Signal amplitude at receiver forms look so good at a half-meter. 0.5m system to operate perfectly. 1.0m Looking closely at Figure 4, the In a practical system, however, you 1.5m actual runt-pulse amplitude in the must maintain a healthy noise margin suf- Receiver Threshold time domain is 85%, not quite as ficient to soak up the maximum ampli- bad as the -3dB predicted by our tude of all reflections, crosstalk, and other quick frequency-domain approxi- noise in the system, while still keeping the mation. This discrepancy arises part- received signal sufficiently above the Time (1-bit per tick - 400 ps) ly from the harmonic construction threshold to account for the limited band- Figure 3 – This test waveform displays the worst-case of a square wave, where the funda- width and noise inherent to the receiver. runt-pulse amplitude. mental amplitude exceeds the Following the example in Figure 4, a amplitude of the square wave signal runt-pulse amplitude equal to 85% of from which it is extracted, and part- the nominal low-frequency signal Sine wave at half baud rate ly from the natural fuzziness inher- amplitude exceeds the receiver threshold Signal amplitude ent to any quick rule-of-thumb at receiver by only 35%, instead of the nominal 0.5m trace translation between the time and

50%. A smaller runt pulse with ampli- 100% frequency domains. The simple fre- tude 75% of the normal size would 35% quency-domain criteria conserva- Receiver 85% reduce the voltage margin by half – a Threshold tively estimates these factors. huge hit to your noise budget, but still If your data code permits longer workable. For generic binary communi- runs of zeros or ones than 8B10B cation using no equalization, we would 400 ps/div coding, then you must use a corre- like to see the runt pulse arrive with Figure 4 – A runt-pulse amplitude equal to 85% of the nominal spondingly lower frequency as your amplitude never smaller than 70% of low-frequency signal amplitude reduces the voltage margin above “lowest frequency of interest.” In the the low-frequency pulse amplitude. the threshold to only 35%, instead of the nominal 50%. time domain, you will see the

Second Quarter 2005 Xcell Journal 7 INDUSTRY EXPERT

0 Figure 6 illustrates a not two but three delayed terms, called ñ– 3 dB 0.1 m simple binary waveform the pre-cursor, cursor, and post-cursor. – 5 x[n] and the related first- This architecture gives you the capacity

f f B /20 B /2 difference waveform to realize both first and second differ- – 10 8B/10B x[n]-x[n-1]. If you are ences by adjusting the coefficients associ- 0.50.5m m

Channel gain, dB –15 11m m familiar with calculus, ated with these three terms. 1.51.5m m you can think of the first- Programmable 5-bit multiplying DACs –20 10 8 10 9 10 10 difference waveform as a control the three coefficients. The first Frequency, Hz kind of derivative opera- and third amplitudes are always inverted Stripline trace: t =1/2 oz. Cu tion. On every edge, the with respect to the main center term, a w µ =152 m (6 mil) difference waveform cre- trick that is accomplished by using the Z0=50Ω ates a big kick. The trans- NOT-Q outputs of the first and third Figure 5 – The difference between high-frequency and low-frequency mit pre-emphasis circuit flip-flops. As an example, Figure 7 plots channel gain in this 2.5 Gbps system equals 3 dB. adds together a certain the frequency response corresponding to proportion of the main the particular coefficient set [-0.056, received signal creep closer to the floor (or signal and the first difference waveform 0.716, -0.228]. ceiling) of its maximum range before the to superimpose the big kick at the begin- Over the critical range from DC to runt pulse occurs, making it even more ning of every transition. As viewed by the 1.25 GHz, the pre-emphasis response rises difficult for the worst-case runt pulse to receiver, each kick boosts the amplitude smoothly – just the opposite of the plum- cross the threshold. of the runt pulses without enlarging low- meting curves drawn in Figure 5. The As a rule of thumb, we look at the dif- frequency portions of your signal, which response peaks at 1.25 GHz. If you clock ference between the channel attenuation are already too big. this pre-emphasis circuit at a higher data at the highest frequency of operation (the The first-difference idea helps you see rate, the peak shifts correspondingly high- 101010 pattern) and the lowest frequen- how pre-emphasis works, but that is not er, always appearing just where you want it cy of operation (determined by your data how it is built. The actual circuit sums at a frequency equal to half the data rate. coding run length) to quickly estimate Figure 8 overlays the pre- the degree of runt-pulse amplitude degra- emphasis response with the dation at the receiver. This simple fre- channel response at 1 meter, quency-domain method only crudely showing a composite result Binary waveform x[n] estimates link performance. It cannot (the equalized channel) that First difference substitute for rigorous time-domain sim- x([n]-x[n-1]) appears much flatter than ulation, but it can greatly improve your Sum either curve alone. In very understanding of link behavior. simplistic terms, a flatter com- A channel with less than 1 dB of runt- a*x[n]+b*(x[n]-x[n-1]) posite channel response pulse degradation works great with just should make a better-looking about any ordinary CMOS logic family, Figure 6 – The transmit pre-emphasis circuit creates signal in the time domain. assuming that you solve the clock skew a big kick at the beginning of every transition. The time-domain benefits problem either with low-skew clock dis- of pre-emphasis appear in tribution or by using a clock recovery Figure 9. At shorter distances unit at the receiver. A channel with as Equalized the signal appears over-equal- D Q DQ DQ Output much as 3 dB degradation requires noth- to Line ized. The overshoot at each ing more sophisticated than a good dif- Q Q Q SΣ transition works fine in a bina- 0.716 ferential architecture with tightly placed .056 ry system, assuming that the well-controlled receiver thresholds. A Rises smoothly receiver has ample headroom channel with 6 dB of degradation 0 to avoid saturation with the requires equalization. maximum-sized signal. At 1 – 5 meter, the signal looks quite Transmit Pre-Emphasis Function, dB 1.25 GHz Equalizer Transfer nice, with very little runt- The Xilinx® Virtex™-4 RocketIO™ – 10 pulse degradation visible and 10 8 10 10 transceiver incorporates three forms of Frequency, Hz (if you look closely) very little equalization that extend your reach on jitter. The 1.5 meter waveform deeply degraded channels. The first is Figure 7 – Over the critical range from DC to 1.25 GHz, the pre-emphasis response rises smoothly. now just meets the 70% criteria transmit pre-emphasis. for runt-pulse success.

8 Xcell Journal Second Quarter 2005 INDUSTRY EXPERT

Compared to a simple differential Decision-Feedback Equalizer architecture, the pre-emphasis Channel As a last defense against the slings and circuit has at least doubled the arrows of uncertain channel perform- length of channel over which you Equalizer ance, the RocketIO transceiver includes a 5 dB/div may safely operate. Equalizer Channel manually adjustable six-tap decision- feedback equalizer (DFE). This device is Linear Receive Equalizer integrated into the slicer circuit at the Frequency band of interest In addition to the pre-emphasis receiver. The DFE is particularly useful circuit, the RocketIO transceiver with poor-quality legacy channels not also incorporates a sophisticated initially designed to handle high serial 6-zero, 9-pole receive-based linear data rates. It has the remarkable property 108 109 1010 equalizer. This circuit precedes of accentuating the incoming signal Frequency, Hz the data slicer. It comprises three without exacerbating crosstalk. cascaded stages of active analog Those of you familiar with signal Figure 8 – Composing the pre-emphasis circuit equalization that may be individ- with the channel produces an overall response much processing will recognize that a DFE ually enabled, turning on zero, flatter than either curve alone. inserts poles into the equalization one, two, or all three stages in network, while a TX pre-emphasis succession. circuit creates zeros. (A very accessi- Figure 10 presents the set of 0.1m ble book about digital equalization, 0.50.5m m four possible frequency-response 11.0m m including DFE circuits, is John A.C. curves attainable with this receiv- 1.5m Bingham’s “The Theory and Practice er-equalization architecture. Each Receiver Threshold of Modem Design.”)

section of the equalizer is tuned Working together, the DFE, TX-pre- Far End of Line of End Far

to approximate the channel at Amplitude Singal emphasis, and RX linear equalizer pro- response of a typical PCB channel vide an incredibly rich array of possible Time (1-bit per tick - 400 ps) with an attenuation of about 3 dB adjustments. at 2.5 GHz. With all stages on, Figure 9 – A pre-emphasis circuit at least doubles the you get a little more than 9 dB of length of channel over which you may safely operate. Conclusion boost at 2.5 GHz. Because the For any channel with as much as 6 dB of response keeps rising all the way runt-pulse degradation, a simple pre- to 5 GHz, this equalizer is useful emphasis adjustment easily doubles the 16 for data rates up to and beyond 3 length at which your link operates. 10 Gbps. If you anticipate more than 6 dB of When setting up the equalizer, 12 runt-pulse degradation, we strongly 2 first select the number of sections suggest that you simulate your system of the RX linear equalizer that dB 8 in detail before making the final equal- best match your overall channel 1 TypicalTypica lFR-4 FR- izer adjustments. Contact your local response. Then fine-tune the Materialmaterial Curve cu rve Xilinx customer support office or visit 4 overall pulse response using the the Xilinx website to obtain the neces- 0 5-bit programmable coefficients sary RocketIO models and associated in the transmit pre-emphasis cir- 0 10 781010 9 1010 design kits for modeling your channel. cuit to obtain the lowest ISI, the Frequency, Hz The modeling effort is well worth it, as lowest jitter, or a combination of equalization can substantially extend both. After building the circuit, the reach of your circuits. Figure 10 – The linear equalizer in the receiver may be set to a clock phase adjustment internal one of four distinct response curves preprogrammed to match to the receiver helps you map the response of various lengths of FR-4 PCB trace. Howard Johnson, PhD, is the author of High- Speed Digital Design and High-Speed Signal out bit error rate (BER) bathtub Propagation. He frequently conducts technical curves, so you can corroborate workshops for digital engineers at Oxford the correctness of your equalizer standards, meeting exact transmitted University and other sites worldwide. For more settings. information, visit www.sigcon.com, or -mail signal specifications and at the same [email protected]. The flexibility provided by these two time adding receiver-based equalization Figures 1, 3, 4, and 9 are adapted with permission from forms of equalization lets you interoper- to keep your system working at the peak Johnson and Graham, High-Speed Signal Propagation: ate with an amazing array of serial-link of performance. Advanced Black Magic, Prentice-Hall, 2003.

Second Quarter 2005 Xcell Journal 9 BUSINESS VIEWPOINTS Changing the Systems Landscape The advent of low-cost FPGAs with Low-Cost FPGAs signals the dawn of a new era.

by Richard Wawrzyniak Why is this important, and how does it shortest amount of time to meet a moving Sr. Market Analyst: ASIC and SoC change the systems landscape? Lower price market window but do not have the engi- Semico Research Corp. points for a given performance level allow neering or financial resources necessary to [email protected] system designers and system architects to commission a new system-on-chip design. have an end product with increased per- Figure 2 shows the relationship between We are living in very exciting times today, formance and a lower BOM. For systems design cycle times, gate count, and product with new, compelling consumer applica- manufacturers, this leads to increased prof- life cycles. Designers are under pressure to hit tions springing up seemingly out of itability, a lower overall system cost, or both. a market window that is shrinking, while at nowhere. These exciting times can be At face value, this situation seems the same time their designs are becoming applied specifically to FPGAs, as we con- rather straightforward: lower device costs more complex. These trends are at odds with tinue to see growth and widespread accept- equal lower system costs – everyone is a product life cycles, which are also shrinking. ance of the concept of programmable logic winner. However, the real answer to this The net effect of these trends is to make by the marketplace at large. question goes much deeper than lower the design environment more challenging The wind in the sails of the FPGA indus- costs. To gain a better perspective on this for ASIC designers. Even before a design is try is our old friend Moore’s Law, which issue, we must explore some of the back- complete, the market at which it is aimed states that semiconductor devices will see a ground of the FPGA market and the ASIC could have changed – necessitating changes doubling of available transistors every 18 to industry as a whole. in the design itself. This quickly becomes a 24 months. The increase in transistor budg- no-win situation for all concerned. ets available to designers and the movement A Difficult Design Environment Furthermore, as the cost to create a to ever-tighter process geometries translates In today’s market environment, there are a complex ASIC solution increases, so too directly into devices with rising performance series of trends adversely influencing the must the size of the market towards which levels and smaller die areas. ASIC design community. One of the the solution is targeted. If the design costs A good example of this is the recent intro- trends most often discussed is the rise in for a complex ASIC are in the $30 million duction of FPGAs produced using 90 nm NRE, which are costs associated with the range, the size of its target market must be process geometries. Smaller geometries have design of ASICs. Figure 1 shows how these many times larger than the design costs just allowed FPGAs with more logic elements costs have increased by process geometry. to recover the initial investment in the sili- and embedded memory, while at the same NRE costs have risen in large part con. Admittedly, there are not many appli- time decreasing the die area needed for the because of the rising device complexity. As cations today that can command such high solution. A part with a given performance the possible gate counts have increased at unit volumes to guarantee the recovery of level using 130 nm process geometries can each process node, so too has the task of this initial investment. now be produced with a higher performance assembling these gates into a design that An additional detrimental effect is the level in a smaller die area at 90 nm. Die area meets the needs of the market. decline in ASIC design starts because of and yield have a great impact on device costs, With its low-to-none NRE costs, pro- rising design costs, decreasing product so FPGA manufacturers can offer these new grammable logic is a great aid to designers life cycles, and shrinking market win- 90 nm families at decreasing price points. who need to prove out their designs in the dows. Several of these trends and ASIC

10 Xcell Journal Second Quarter 2005 BUSINESS VIEWPOINTS

design start numbers for various point, many of the gains seen ASIC product types (including Increase in SoC NRE Costs through the use of program-

FPGAs and PLDs by end appli- 120 $45 mable logic were given up; cation) are detailed in an $40 designers were back on a more 100 upcoming report from Semico $35 traditional ASIC path, with all titled, “ASIC Design Starts: An 80 $30 the previously mentioned Industry in Transition.” $25 problems and pitfalls. 60

M Gates $20 As shown in Figure 3, low- M$ Dollars Low-Cost FPGAs 40 $15 cost FPGAs have pushed out the Change the Equation $10 point at which a standard cell or 20 Now that we’ve reviewed current $5 SoC solution must be employed trends in the ASIC market, we can 0 $0 to arrive at the absolute lowest now look at how low-cost FPGAs 350nm 250nm 180nm 130nm 90nm 65nm cost. This chart also takes into can change the landscape to one Gate Count NRE Charges consideration the total cost to more favorable for designers. Figure 1 - NRE costs by process geometry Source: Semico Research Corp. create the ASIC solution (NRE Designers can now apply the charges and mask set costs, for traditional strengths of program- example), as compared to the mable logic to their system solu- Ever-Tightening Circle: Design Time vs. Market Window same cost to create the silicon tions – namely the low cost of 80 14 solution using programmable entry, off-the-shelf availability, 70 12 logic. The benefits are numer- short design cycles, reprogram- 60 ous, particularly the ability to mability itself, and a known and 10 use a silicon solution that fits 50 working architecture. Plus, they 8 into market windows and is at can use FPGAs as test vehicles to 40 the right price point. M Gates

6 (Months) examine and prove out different 30 types of semiconductor intellec- 4 Conclusion Shrinking Product Life Cycle tual property (SIP) with an ease 20 As FPGA vendors continue to that is impossible in standard cell 10 2 use smaller process geometries or SoC markets. ASIC designers 0 0 when they become available, the can test out several different types 6 7 8 9 10 11 12 13 14 15 16 17 18 ASIC industry will see a further of SIP using FPGAs in the same Months extension of FPGAs into appli- time it took just to try out one Design Complexity Design Time Product Life Cycle cations where they can deliver a type of SIP using the traditional solid solution at the necessary standard cell or SoC route. This Figure 2 - Comparison of design and market trends Source: Semico Research Corp. price points. in itself is a great aid to ASIC Designs and end-system solu- designers in arriving at the best tions that might not have made it possible silicon solution. into production because of a pro- However, the road for hibitively long design cycle now FPGAs does not end here; this is have a much better chance of suc- merely the start of the story in ceeding, whereas before they Point terms of aiding ASIC designers. might have been cancelled because they couldn’t make that Previously, when FPGA silicon New Crossover Point Traditional Crossover was more costly, ASIC designers ever-important market window. could successfully prototype The introduction of low- with FPGAs and enter into lim- cost, full-featured FPGAs has ited production with their changed the systems landscape FPGA solution. But at some for the better. Today, ASIC point, usually between 30,000 5000 1000025000 50000 100000 250000 500000 1000000 2000000 designers or system architects and 100,000 units (depending Unit Volumes can use a programmable logic on the price sensitivity of the solution with the knowledge Higher-Cost FPGA Low-Cost FPGA Standard Cell end application), the design that a FPGA can be used farther would change over to a full- into the production cycle than Figure 3 - Deeper volume penetration by FPGAs Source: Semico Research Corp. blown ASIC solution. At this was previously possible.

Second Quarter 2005 Xcell Journal 11 X-ray vision for your designs Agilent 16900 Series logic analysis system with FPGA dynamic probe

Now you can see inside your FPGA designs in a way that will save days of development time.

The FPGA dynamic probe, when combined with an Agilent 16900 Series logic analysis system, allows you to access different groups of signals to debug inside your FPGA— without requiring design changes. You’ll increase visibility • Increased visibility with FPGA dynamic probe into internal FPGA activity by gaining access up to 64 ® • Intuitive Windows XP Pro user interface internal signals with each debug pin. • Accurate and reliable probing with soft touch connectorless probes • 16900 Series logic analysis system prices starting at $21,000 You’ll also be able to speed up system analysis with the 16900’s hosted power mode—which enables you and your team to remotely access and operate the 16900 over the network from your fastest PCs. Get a quick quote and/or FREE CD-ROM with video demos showing how you can The intuitive user interface makes the 16900 easy to get up reduce your development time. and running. The touch-screen or mouse makes it simple to U.S. 1-800-829-4444, Ad# 7909 use, with prices to fit your budget. Optional soft touch Canada 1-877-894-4414, Ad# 7910 connectorless probing solutions provide unprecedented www.agilent.com/find/new16900 reliability, convenience and the smallest probing footprint www.agilent.com/find/new16903quickquote available. Contact Agilent Direct today to learn more.

©Agilent Technologies, Inc. 2004 Windows is a U.S. registered trademark of Microsoft Corporation HIGH-VOLUME SOLUTIONS LeadingLeading thethe High-VolumeHigh-Volume ProgrammableProgrammable RevolutionRevolution Xilinx pioneered – and leads – the adoption of PLDs in low-cost systems.

by Sandeep Vij Vice President, Worldwide Marketing Xilinx, Inc. Table of Contents [email protected] Spartan-3E FPGAs Introduce a New Era in Low-Cost Programmable Logic...... 14 Implementing New Configuration Options for the Spartan-3E Family...... 16 Xilinx® was the first FPGA company to recognize the market potential for FPGAs in high-volume, low-cost systems. In Encoding High-Resolution Ogg/Theora Video with Reconfigurable FPGAs ...... 19 1998, Xilinx released the very first FPGA designed for these Implementing DSP Algorithms Using Spartan-3 FPGAs...... 22 systems – the Spartan™ device family. Today, the Spartan series of FPGAs are broadly adopted in high-volume con- Designing a Spartan-3 FPGA DDR Memory Interface ...... 26 sumer applications, with more than 100 million units Signal Processing Capability with the shipped and more than $1 billion in cumulative revenue. Nu Horizons Spartan-3 Development Board...... 28 With the latest release of the Spartan-3E family, Xilinx CoolRunner-II CPLDs Offer New Features...... 31 has reduced the cost of the Spartan device 30X since its introduction, while offering more platform features that Reducing Bill-of-Materials Cost Using Logic Consolidator...... 34 enable higher levels of integration. Using CoolRunner-II Silicon Features to Reduce Cost ...... 36 Xilinx recognizes that an important part of high-volume Using Spartan-3/3E Features to Area-Optimize Your Design ...... 39 PLDs is the non-volatile CPLD solution. The Xilinx portfo- lio of CPLD products is one of the most comprehensive in Low-Cost EasyPath FPGAs Offer Promise to ASSP Companies ...... 42 the industry – ranging from lowest cost XC9500/XL devices Applications of the Spartan Family in Flat-Panel Displays...... 46 to the ultra-low-power CoolRunner™-II device family. This Managing Signal Quality...... 49 has allowed us to grow our market share in the CPLD mar- ket for 16 consecutive annualized quarters, propelling Xilinx Meeting Timing and Reducing Area with the Synplify Pro Tool...... 52 to the number-two position in the CPLD market. The Changing Face of Automotive ECU Design ...... 54 Xilinx conceived of and developed the market for high- A Multimedia Platform for Automotive and Consumer Markets ...... 57 volume programmable logic products. In this issue of the Xcell Journal, you’ll find articles about products and appli- A Low-Cost PCI Express Solution ...... 60 cations within this market – with many contributions from satisfied customers.

Second Quarter 2005 Xcell Journal 13 HIGH-VOLUME SOLUTIONS Spartan-3ESpartan-3E FPGAsFPGAs IntroduceIntroduce aa NewNew EraEra inin Low-CostLow-Cost ProgrammableProgrammable LogicLogic The Spartan-3E XC3S100E FPGA is the first 100,000-gate device available for under $2.

by Richard Terrill Senior Manager, High-Volume Products Marketing large I/O requirements. The Spartan-3E The Spartan-3E Family Xilinx, Inc. family is optimized for gate-centric designs, The Spartan-3E family is our newest [email protected] and is well-suited for designs that require a low-cost FPGA family, and further relative higher gate-to-I/O ratio. The older reduces the price points for low-cost First introduced in 1998, the Spartan™ Spartan-II/IIE and Spartan-XL families FPGAs to unprecedented levels. family was the world’s first FPGA series tai- remain suitable candidates for legacy designs Through 90 nm process technology, lored for low-cost applications. With the or for systems with higher core voltages. 300 mm wafers, and application-driven introduction of the Spartan-3E family, Spartan-3 FPGAs have found remarkable architecture choices, Xilinx has extended Xilinx® now has seven families of Spartan success in the production of systems that FPGAs into volumes and applications FPGAs in production – with more than typically would have used an ASIC or gate previously reserved for mask-pro- 100 million units shipped to date. The array. For example, many flat-panel display grammed ASICs. Spartan-3E devices Spartan series features the world’s most systems employ Spartan-3 devices to man- offer one of the lowest costs-per-logic accepted low-cost FPGA architecture and age the display driver and control functions. (CPL) of any FPGA. is familiar to thousands of engineers. The ability to modify the design after layout Spartan-3E FPGAs have been archi- Moore’s law allows Xilinx to offer ever- and adapt the system to changing market tected for digital consumer applications, lower prices in the Spartan series of FPGAs. conditions makes FPGAs highly desirable. and all high-volume/low-cost applica- The Spartan-3E family is our third Spartan Spartan-3E devices extend the reach of tions will benefit from its advanced fea- family in 90 nm, and gives us the lowest FPGAs into production volumes by fur- tures and capabilities. possible manufacturing costs. These low ther reducing costs, while preserving the The five-member family ranges from costs make programmable logic successful low NRE and high flexibility of program- the 100,000-gate XC3S100E through in high-volume and low-cost production mable logic. Many mainstream applica- the 1.6 million-gate XC3S1600E, adding applications, an area previously reserved for tions using Spartan-3E FPGAs will have features such as 64/66 PCI, mini LVDS, ASIC and gate-array technologies. an ASIC crossover point of 250,000 units and faster embedded multipliers for low- The Spartan-3 family (introduced in – meaning that the total cost favors cost DSP, all to better serve low-cost 2003) is optimized for I/O-centric designs, Spartan-3E devices over ASICs for the first applications. Table 1 highlights the key and is ideally suited for systems that have quarter-million units of production. features of the Spartan-3E family.

14 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

Spartan-3E FPGA Family • PCI 64/66 and PCI-X support

XC3S100E XC3S250E XC3S500E XC3S1200E XC3S1600E • DDR 333 memory interface System Gates 100K 250K 500K 1,200K 1,600K Logic Cells 2,160 5,508 10,476 19,512 33,192 • Mini-LVDS, RSDS Block RAM Bits 72K 216K 360K 504K 648K • DCM clock frequency input 15K 38K 73K 136K 231K Distributed RAM Bits range expanded down to 5 MHz DCMs 2 4 4 8 8 Multipliers 4 12 20 28 36 (ideal for video) I/O Standards 18 18 18 18 18 • 325 MHz multipliers aligned with 108 172 232 304 376 Max Single-Ended I/O block RAM for low-cost DSP Max Differential I/O Pairs 40 68 92 124 156 Package and I/O Offerings One of the most significant new fea- XC3S100E XC3S250E XC3S500E XC3S1200E XC3S1600E tures available in Spartan-3E devices is sup- VQ100 14 x 14 mm 66 66 port for low-cost serial commodity Flash CP132 8 x 8 mm 92 92 TQ144 20 x 20 mm 108 108 configuration memory. With Spartan-3E PQ208 28 x 28 mm 158 158 FPGAs you can use generic, low-cost serial FT256 17 x17 mm 172 190 190 EPROMs or byte-wide Flash devices avail- FG320 19 x 19 mm 232 250 250 able from multiple vendors to configure FG400 21 x 21 mm 304 304 the device. You can use the configuration 376 FG484 23 x 23 mm memory for other system functions, and Table 1 – Spartan-3E family key features matrix even reprogram it under the control of the Spartan-3E FPGA, adding tremendous suited for I/O-intensive FPGA designs. flexibility for system designers. Together, Spartan-3 and Spartan-3E FPGAs will continue to meet customer needs in Conclusion low-cost system design. With the arrival of the Spartan-3E FPGA family, there are now even more reasons to New Features in Spartan-3E Devices consider programmable logic in the produc- Besides the optimized inline I/O ring to tion of low-cost systems. With prices start- Figure 1 – Spartan-3 staggered I/O lower costs, Spartan-3E FPGAs also have a ing under $2 and a substantially lower cost compared to Spartan-3E inline I/O host of new features that include: configuration solution, Spartan-3E devices • Support for low-cost commodity will serve as the production solution The Gate-Centric FPGA Flash memory (SPI/BPI) configuration for increasing numbers of low-cost, high- Spartan-3E devices feature an optimized memory volume, and consumer applications. inline I/O ring for the absolute lowest cost. Figure 1 shows the inline I/O ring of 80000 Spartan-3 75000 Spartan-3E FPGAs versus the staggered I/O Spartan-3E pad approach used by Spartan-3 devices. 70000 Inline I/Os are more efficient for smaller 65000 densities and allow us to add more logic for 60000 a given I/O count, as shown in Figure 2. 55000 To further reduce die size, we refined the 50000 I/O layout, removed some less-common I/O 45000 standards, and resized the output buffers. 40000 Logic Cells Even a small area reduction in each I/O adds 35000 up to a relatively large savings because it is 30000 repeated for each I/O pad. Many of the deci- 25000 20000 sions in the Spartan-3E architecture were 15000 driven by direct feedback from our high-vol- 10000 ume and low-cost customers. 5000

Spartan-3E FPGAs have a lower CPL 0 compared to Spartan-3 devices and are the 0 200 400 600 700 800 lowest cost FPGAs for gate-centric designs. I/O Correspondingly, Spartan-3 devices are well Figure 2 – I/O versus logic curves for Spartan-3 and Spartan-3E devices

Second Quarter 2005 Xcell Journal 15 HIGH-VOLUME SOLUTIONS Implementing New Configuration Options for the Spartan-3E Family The new Spartan-3E family supports serial (SPI) and parallel flash memory for configuration.

by Steve Knapp Sr. Applications Engineering Manager Xilinx, Inc. [email protected] Ward Williams Sr. Strategic Marketing Manager Xilinx, Inc. [email protected] Kirk Owyang Sr. Solutions Marketing Manager Xilinx, Inc. [email protected]

The new Xilinx® Spartan™-3E FPGA family reduces your total system cost in many ways, including new low-cost configuration memory options. You can now choose the configuration memory solution that best suits your specific application requirements. For configura- tion memory, you can choose between industry-standard, commodity serial peripheral interface (SPI) or parallel NOR flash PROMs, competitively priced Xilinx Platform Flash, or other low-cost memories with a microcontroller.

16 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

Nearly all of the Spartan-3E configuration accessible, byte-addressable, read/write memo- What is SPI? pins become available as user I/Os after con- ry. If the application requires additional space, The serial peripheral interface (SPI) figuration. Consequently, you can leverage any upgrade to the next larger PROM. Most SPI interface is a four-wire synchronous remaining space in the configuration memory and parallel flash PROMs are available in a interface (Figure 1). SPI was originally for application data, such as code for an common footprint across multiple densities. pioneered as a serial communications embedded MicroBlaze™ processor, serial Xilinx Platform Flash still provides an interface between CPUs, MCUs, numbers, or Ethernet MAC IDs. SPI and par- excellent solution for stand-alone two-chip peripherals, and other devices support- allel flash PROMs additionally offer random- programmable logic solutions (FPGA + ded- ing this protocol. Now SPI is popular in icated configuration PROM), the embedded processing and consumer with competitive cost-per- SPI Physical Interface electronics markets. Many modern megabit pricing. Platform Flash microcontrollers use this interface, as Master Slave also excels with features such as well as a wide variety of third-party MOSI SI JTAG in-system programmabil- peripherals. ity and patented compression MISO SO technology. SPI Flash Suppliers SS# SS SPI flash PROMs are popu- Although SPI is a standard four-wire SCLK C lar in high-volume consumer interface, the various available SPI flash • SPI is a four-wire synchronous serial interface electronics applications, where PROMs use different command proto- • SPI Master device communicates to one or more Slaves they store system parameters or cols. Spartan-3E FPGAs support up to • SPI Master controls all timing via the SCLK clock signal • SPI Master selects a Slave using an active-Low select signal (SS#) code for an embedded proces- four different command protocols. • All connected SPI devices share a common serial data input, sor. SPI flash PROMs also offer Currently, SPI flash PROMS from five output, and clock signal a low-cost pin-saving configu- vendors (Atmel, NexFlash, Programmable ration solution for Spartan-3E Figure 1 – SPI physical interface Microelectronics Corporation [PMC], FPGAs. SPI flash memory is Silicon Storage Technology [SST], and multi-sourced and multiple ST Microelectronics) are tested and sup- SPI Serial Flash Vendor Tested SPI Flash Family densities are available within ported, as shown in Table 1. Devices from the same package footprint. other SPI suppliers are being tested and Atmel AT45DBxxx Parallel NOR flash is the pre- will be supported in the near future. NexFlash NX25Pxx ferred solution for FPGA appli- cations with an embedded Memory Requirements Programmable Microelectronics Pm25LVxxx processor such as the Corporation (PMC) A single Spartan-3E FPGA requires MicroBlaze soft-core processor. roughly 600K to 6M bits for configura- Silicon Storage Technology (SST) SST25LFxxxA At power-on, the FPGA config- tion, a small amount of memory relative SST25VFxxxA ures from one end of the parallel to the large SPI capacities available flash. After configuration, the today. Table 2 lists the amount of con- SST25VFxx * MicroBlaze processor either exe- figuration memory required by each ST Microelectronics M25Pxx cutes directly from the opposite Spartan-3E device type, the minimum

* SST25VFxx supports only the READ (0X03) SPI read command; all others support the FAST_Read (0X0B) command. end of memory or shadows the required SPI device size, and the extra code to external SDRAM. memory that is available for other pur- Table 1 – SPI flash memory families supported by Spartan-3E FPGAs poses. Note that it is not necessary to have a dedicated configuration memory device for each FPGA. Minimum SPI Unused Memory Multiple FPGAs can share a single Spartan-3E Device Configuration Bits Device Capacity After Configuration SPI flash PROM in a configuration daisy chain. XC3S100E 581K 1 Mb 443K

XC3S250E 1,352K 2 Mb 696K Interfacing to SPI XC3S500E 2,267K 4 Mb 1,829K Figure 2 shows the typical connections between a Spartan-3E FPGA and SPI XC3S1200E 3,832K 4 Mb 264K flash memory, as well as the FPGA I/O XC3S1600E 5,958K 8 Mb 2,234K pins used to control the configuration procedure. When the FPGA is in SPI Table 2 – Required SPI device sizes and estimated cost flash configuration mode, three dual-pur-

Second Quarter 2005 Xcell Journal 17 HIGH-VOLUME SOLUTIONS

• Microprocessors, microcontrollers Spartan-3E SPI Flash Interface (MCUs), and digital signal processors 3.3V Spartan-3E SPI Flash (DSPs) FPGA (*) Memory • Many application-specific standard '0' M2 MOSI (MOSI) D SPI Flash (MISO) Configuration Mode '0' M1 DIN Q products (ASSPs) '1' MO CSO_B (SS#) S CCLK (SCLK) C After configuration, the FPGA user appli- SPI Flash ? VS2 Vendor Select ? VS1 cation can remain as the SPI bus master and ? VS0 communicate with the attached SPI flash

Optional JTAG TDI PROM and any attached SPI peripherals. All TDO Programming of the attached SPI peripherals share com- Interface TMS * Pull-up registor on SS_B only TCK required if HSWAP_EN=1 mon serial-input, serial-output, and clock sig- nals. Each SPI peripheral has a separate select Figure 2 – Interface between Spartan-3E FPGA and SPI flash memory input. The SPI flash PROM used for config- uration is selected through the FPGA’s CSO_B pin. Each additional SPI peripheral Example: Reusing SPI Interface is selected by a separate user I/O pin.

MOSI D SPI flash Parallel NOR Flash MISO Q Memory SCLK S Spartan-3E FPGAs also provide a parallel CSO_B C configuration interface, primarily

MicroBlaze designed to configure the FPGA from

DDR SDRAM User-I/O SI SPI Peripherals industry-standard parallel NOR flash. The Spartan-3E SO • A/D Converter FPGA’s asynchronous memory interface is C • D/A Converter SS flexible and can connect to a variety of • FPGA configures from SPI flash memory memory devices such as EEPROM, OTP • After configuration, the FPGA copies MicroBlaze code from EPROM, masked ROM, NVRAM, or SPI flash to external DDR SDRAM even asynchronous SRAM. • SPI flash available for non-volatile data or parameter storage • Additional SPI peripherals require a single select pin per device The memory requirements are fairly simple. During configuration, the FPGA provides up to 24 address lines and Figure 3 – Connecting SPI flash memory to multiple devices receives byte-wide data. The FPGA has four control lines driving the memory’s pose pins, called VS[2:0], define the type of ing the FPGA’s MOSI, DIN, and CCLK chip-select, output-enable, write-enable, attached SPI flash and operate as follows: pins as user I/O. and an optional byte-enable for high- If large enough, the SPI flash PROM can density flash PROMs with a x8/x16 • These pins are only activated in SPI con- also contain non-volatile application data, mode control (BYTE#). The memory figuration mode (M[2:0]=“001”) such as MicroBlaze processor code or data must be 3.3V, 2.5V, or 1.8V and must • The VS pins are sampled when INIT_B such as serial numbers and Ethernet MAC have a read access time of 200 ns or faster. goes high IDs. Figure 3 shows an example of using SPI Conclusion • The VS pins are reusable as user I/O flash memory for multiple purposes. In addition to low device costs, the new low- after FPGA configuration completes cost Spartan-3E FPGA family lowers overall (DONE goes high) Third-Party Peripherals In addition to flash memory, many periph- system cost in many ways, including new Then, the FPGA issues the command erals utilize the same SPI. These include: support for inexpensive SPI and parallel sequence appropriate for the selected SPI flash. flash memory for configuration. Currently, • Memories (EEPROM, EPROM) a number of flash memory families are sup- After Configuration • Analog-to-digital converters (ADCs) ported and more memory vendors will After configuration, all of the pins connect- announce their support for Spartan-3E ed to the SPI flash PROM are available as • Digital-to-analog converters (DACs) devices in the coming months. Please check user I/O pins. If not using the SPI flash • Thermal management the Spartan-3E pages on the Xilinx website PROM after configuration, drive the FPGA’s for a current list of supported SPI flash, CSO_B pin high to disable the PROM, free- • Display drivers www.xilinx.com/spartan3e/.

18 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS EncodingEncoding High-ResolutionHigh-Resolution Ogg/TheoraOgg/Theora VideoVideo withwith ReconfigurableReconfigurable FPGAsFPGAs Once the traditional application area of custom ASICs, modern FPGAs can now handle high-performance video encoding.

by Andrey Filippov President Elphel, Inc. [email protected]

Much of the Spring 2003 issue of the Xcell Journal in which my article about Spartan™-IIE-based Elphel Model 313 cameras appeared (“How to Use Free Software in FPGA Embedded Designs”) was dedicated to the Xilinx® Spartan-3 FPGA. I immediately started to think about using these devices in our new gen- eration of Elphel network cameras, but it wasn’t until last year that I was finally able to start working with them. One of the factors that slowed my com- pany’s adoption of this new technology was the fact that at first I could not find appro- priate software that could handle the devices selected, as it is essential that our end users can modify our products without expensive software development tools. When I visited the Xilinx website in Summer 2004 and found that the current version of the free downloadable WebPACK™ software could handle the XC3S1000 – the largest device available in a small FT256 package – I knew it was the right time to switch to the Spartan-3 device.

Second Quarter 2005 Xcell Journal 19 HIGH-VOLUME SOLUTIONS

connected directly to the processor I/O CPU/Compressor pins, so I could not use the software that DDR SDRAM Board (333) CMOS Image 16M x 16 comes with Xilinx configuration hardware. Sensor IEEE802.3af The JTAG instruction register is six bits Sensor Board Programmable Compliant wide, not five as it was in the Spartan-IIE (304/314/317/318) Clock Generator Power (3 PLLs) Supply devices with which I was familiar. After some trial and error, I figured that out and Xilinx 10/100 BaseT LAN found that the same code could run at 125 Spartan-3 Transceiver 1000K Gates MHz (instead of 90 MHz in the previous FPGA model) and used just 36% (not 98% as Axis ETRAX100LX 32-bit 100 MHz before) of available slices – plenty of room GNU/Linus Processor for more challenging tasks. Of course, I had some challenging tasks in mind, as motion JPEG is not a really good option for high-resolution/high- SDRAM FLASH 8M x 32 8M x 16 frame-rate cameras because the amount of JTAG Port data to be transferred or stored is quite huge. It is a waste of network bandwidth or hard disk space when recording such video Figure 1 – Camera system block diagram streams, as fixed-view cameras in most cases have very little difference between The Camera Hardware ble to change the pin functions (including consecutive frames. Something like The new Model 333 camera (Figure 1) polarity) to match the particular sensor MPEG-2 could make a difference; that was uses the same Linux-optimized CPU boards. A similar solution allowed the earli- the standard I was planning to implement (ETRAX100LX by Axis Communications) er Model 313 camera to support different in the camera. as the earlier Model 313, but with types of sensors (most became available But as soon as I got some books on increased system memory – 32 MB of after the board design). It even works in our MPEG-2 and started combing through SDRAM and 16 MB of Flash. The second 11-megapixel Model 323 cameras without online resources, I found another funda- major upgrade is the use of 32 MB of DDR any PCB modifications. mental difference between MPEG and SDRAM as a dedicated frame buffer that JPEG – not just that it can use the similar- works in tandem with the FPGA, supple- Selecting the Video Encoding Technique ity between consecutive frames. Contrary menting its processing power with high After the prototype camera was ready, it to JPEG, MPEG-2 requires you to pay capacity and I/O bandwidth. took just a couple of weeks to modify the licensing fees for using the encoders based The Spartan-3 DDR I/O functionality code developed for the Spartan-IIE-based on this standard. The fee is small compared made it possible to increase the memory bandwidth without increasing board size – the complete system still fits on a 1.5 x 3.5- inch four-layer board (see Figure 2). The actual board area is even smaller, as the new one is designed to fit the sealed RJ45 con- nectors for outdoor applications. For the camera circuit design, the goals include combining high computational per- formance with small size (that also simpli- fies preserving high-speed signal integrity on the PCB) and providing the flexibility for the reconfigurable FPGA on the system Figure 2 – Camera system board level. For the latter, I decided to split the camera circuitry into two boards: one main camera and to implement motion JPEG to the cost of the hardware, but it still board and a second containing just a sensor compression. Half of that time was spent could be a hassle and does not provide free- with minimal related components. On the trying to figure out how to configure the dom for implementation. main board the FPGA I/O pins go directly new FPGA with the generated bitstream. It did not take long to find a perfect to the inter-board connector, so it is possi- In the camera, JTAG pins of the device are alternative – Theora, based on the VP3

20 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

codec developed by On2 Technologies data mapping both serve the same goal: management, DDR I/O functions, an (www.on2.com) and released as open-source optimizing memory bandwidth that other- increased number of global clock networks software for royalty-free use and modifica- wise would be a system bottleneck. for the DDR SDRAM controller, and tions (see www.theora.org/svn.html). The rest of the code that currently uses large block RAM modules for the various Theora is an advanced video codec that two-thirds of the general FPGA resources tables and buffers in the camera. competes with MPEG-4 and other similar (slices) and 20 of 24 block RAM modules The free video encoder (Theora) and low-bit-rate video compression schemes. It includes video compression modules, a completely open implementation of the is now supported by the Xiph.org sensor, and system interfaces. camera (all software and Verilog code is Foundation along with Ogg, the transport A detailed description of the camera provided under the GNU General Public layer used with Theora to deliver the video code is available, together with the source License) makes the second most important content. The bitstream format is stable code, at Sourceforge (https://sourceforge.net/ function of Elphel products possible. You enough and supported by multiple players projects/elphel). can use these cameras not only as finished running on different operating systems. products but also as universal development Like JPEG and MPEG, it uses a two- Conclusion platforms – demonstrating the power and dimensional 8 x 8 DCT. High-performance reconfigurable FPGAs flexibility of the Spartan-3 family. It is pos- made it possible to build a fast high-resolu- sible to add your own code, rerun the tools FPGA Implementation tion low-bit-rate network camera capable (both for the FPGA code and the C-lan- The code for the Elphel Model 333 camera of running 30 fps at a resolution of 1280 x guage camera software), and immediately FPGA is written in Verilog HDL (Figure 1024 pixels (12 fps at a resolution of 2048 try the new camera with advanced image 3). It is designed around the 8-channel x 1536). Many of the new features of the processing implemented. SDRAM controller that uses the Spartan-3 Spartan-3 devices proved to be very useful For more information, visit DDR capabilities. The structure of the in this design: embedded multipliers for www.elphel.com, https://sourceforge.net/ memory accesses and specially organized DSP functions, advanced digital clock projects/elphel/, and www.theora.org.

Sensor Interface Xilinx Spartan-3 1000K Gates FPGA CMOS Gamma Correction SDRAM Controller Image Sensor I/O FPN Correction Sensor Synchronization Overlay Application 0 - Data from Sensor 1 - FPN Correction/Overlay 2.- 20 x 20 Pixel Tiles to Compressor 3 - PIO SDRAM Access DDR SRAM 4 - Reference Frame Write 5 - Reference Frame Read Compressor Stage 1 6 - Compressed Tokens Write 7 - Compressed Tokens Read Bypass Buffer - Compressor Stage 2 8 x 8 + + 8 x 8 Quan- Dequan- Forward tizer tizer Inverse EOB Runs Huffman Bitstream DCT DCT Bayer to Extractor Encoder Packager YCbCr 4:2:0 Coefficient Converter DC Predictor Encoder (pretokens) System Interface

DMA Buffer/ Controller Bus CPU Interface Status Data

Command Decode Tables Write

Clock JTAG Programming Oscil- Management Interface lators

Figure 3 – Block diagram of the FPGA code

Second Quarter 2005 Xcell Journal 21 HIGH-VOLUME SOLUTIONS ImplementingImplementing DSPDSP AlgorithmsAlgorithms UsingUsing Spartan-3Spartan-3 FPGAsFPGAs This article presents two case studies of FPGA implementations for commonly used imageimage processingprocessing algorithmsalgorithms –– featurefeature extractionextraction andand digitaldigital imageimage warping.warping.

Digital image warping is a branch of uniquely matched and tracked over a by Paolo Giacon image processing that deals with tech- sequence of images, whereas a point Graduate Student niques of geometric spatial transforma- along an edge can be matched with any Università di Verona, Italy number of other points on the edge in a [email protected] tions. Warping images is an important stage in many applications of image analy- second image. Saul Saggin sis, as well as some common applications Undergraduate Student of computer vision, such as view synthesis, The Feature Extraction Algorithm Università di Verona, Italy image mosaicing, and video stabilization The algorithm employed to select good [email protected] in a real-time system. features is inspired by Tomasi and Giovanni Tommasi In this article, we’ll present an FPGA Kanade’s method, with the Benedetti and Undergraduate Student implementation of these algorithms. Perona approximation, considering the Università di Verona, Italy eigenvalues α and β of the image gradient [email protected] Feature Extraction Theory covariance matrix. The gradient covari- Matteo Busti In many computer vision tasks we are ance matrix is given by: Graduate Student interested in finding significant feature Università di Verona, Italy points – or more exactly, the corners. 2 [email protected] These points are important because if we Ix Ix Iy H = 2 measure the displacement between fea- Ix Iy Ix Computer vision is a branch of artificial tures in a sequence of images seen by the intelligence that focuses on equipping com- camera, we can recover information both where I and I denote the image gradi- puters with the functions typical of human on the structure of the environment and x y ents in the x and y directions. vision. In this discipline, feature tracking is on the motion of the viewer. Hence we can classify the structure one of the most important pre-processing Figure 1 shows a set of feature points around each pixel observing the eigenval- tasks for several applications, including extracted from an image captured by a ues of H: structure from motion, image registration, camera. Corner points usually show a and camera motion retrieval. The feature significant change of the gradient values No structure : α ≈ β ≈ 0 extraction phase is critical because of its along the two directions (x and y). These Edge : α ≈ 0, β >> 0 computationally intensive nature. points are of interest because they can be Corner : α >> 0, β >> 0

22 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

is a corner point. The minimum eigenvalue compute for each pixel three coeffi- is computed using an approximation to cients used by the characteristic poly- avoid the square root operation that is nomial. To store and read the gradient expensive for hardware implementations. values, we use a buffer (implemented The corner detection algorithm could using a Spartan-3 block RAM). be summarized as follows: 2. Calculation of the characteristic poly- The image gradient is computed by nomial value. This value is important mean of convolution of the input image to sort the features related to the spe- with a predefined mask. The size and the cific pixel. We implemented the mul- values of this mask depend on the image res- tiplications used for the characteristic olution. A typical size of the mask is 7 x 7. Figure 1 – Feature points extracted polynomial calculus employing the from an image captured by a camera • For each pixel (i, j) loop: embedded multipliers on Spartan-3 devices. N Using the Benedetti and Perona approx- aI= ()k 2 3. Feature sorting. We store computed imation, we can choose the corners without ij, ∑ x feature values in block RAM and sort computing the eigenvalues. k them step by step by using successive We have realized an algorithm that, comparisons. compared to the original method, doesn’t N kk 4. Enforce minimum distance. This is require any floating-point operations. bIij, = xI y done to keep a minimum distance Although this algorithm can be imple- ∑ k between features; otherwise we get mented either in hardware or software, by clusters of features heaped around the implementing it in FPGA technology we N most important ones. This is imple- can achieve real-time performance. cI= ()k 2 ij, ∑ y mented using block RAMs, building a Input: k non-detect area around each most • 8-bit gray-level image of known size important feature where other features where N is the number of pixels in the will not be selected. (up to 512 x 512 pixels) k k patch and Ix and Iy are the components of • The expected number of feature the gradient at pixel k inside the patch. Spartan-3 Theoretical Performance points (wf) 2 • Pi,j = (a – t)(c – t) – b The algorithm is developed for gray-level Output: images at different resolutions, up to 512 x where t is a fixed integer parameter. 512 at 100 frames per second. • List of selected features (FL). The type • If (P > 0) and (a > t), then we retain The resources estimated by Xilinx of the output is a 3 x N matrix whose: i,j i,j pixels (i,j) System Generator are: – First row contains the degrees of con- • Discard any pixel that is not a local fidence for each feature in the list • 1,576 slices maximum of Pi,j – Second row contains the x-coordinates • 15 block RAMs • End loop of the feature points • 224 LUTs • Sort, in decreasing order, the feature list – Third row contains the y-coordinates FL based on the degree of confidence • 11 embedded multipliers of the feature points values and take only the first wf items. The embedded multipliers and extensive Semantic of the Algorithm memory resources of the Spartan-3 fabric Implementation In order to determine if a pixel (i, j) is a fea- allow for an efficient logic implementation. With its high-speed embedded multipliers, ture point (corner), we followed Tomasi the Xilinx® Spartan™-3 architecture and Kanade’s method. Applications of Feature Extraction meets the cost/performance characteristics First, we calculate the gradient of the Feature extraction is used in the front end required by many computer vision systems image. Hence the 2 x 2 symmetric matrix for any system employed to solve practical that could take advantage of this algorithm. G = [a b; b c] is computed, whose entries control problems, such as autonomous The implementation is divided into derive from the gradient values in a patch navigation and systems that could rely on four fundamental tasks: around the pixel (i, j). vision to make decisions and provide con- If the minimum eigenvalue of G is 1. Data acquisition. Take in two gradient trol. Typical applications include active greater than a threshold, then the pixel (i, j) values along the x and y axis and video surveillance, robotic arms motion,

Second Quarter 2005 Xcell Journal 23 HIGH-VOLUME SOLUTIONS

measurement of points and distances, and served. Six parameters are required to lation. We implemented the first as a autonomous guided vehicles. define an affine transformation. matrix-vector multiplication (2), with four multipliers and four adders. The second is Image Warping Theory Image Warping Algorithms an approximation of the real result of the Digital image warping deals with tech- There are two common ways to warp an interpolation: we weighted the four pixel niques of geometric spatial transformations. image: values approximating the results of the The pixels in an image are spatially rep- • Forward mapping transformation with two bits after the bina- resented by a couple of Cartesian coordi- • Backward mapping ry point. Instead of performing the calcula- nates (x, y). To apply a geometric spatial Using forward mapping, the source image tions given by the formula, we used a LUT transformation to the image, it is conven- is scanned line by line and the pixels are to obtain the pixel final value, since we ient to switch to homogeneous coordi- copied to the resulting image, in the position divided possible results of the interpolation nates, which allow us to express the given by the result of the linear system shown into a set of discrete values. transformation by a single matrix opera- in equation (2). This technique is subject to tion. Usually this is done by adding a third several problems, the most important being Spartan-3 Theoretical Performance coordinate with value 1 (x, y, 1). the presence of holes in the final image in the We designed the algorithm using System In general, such transformation is repre- case of significant modification of the image Generator for DSP, targeting a Spartan-3 sented by a non-singular 3 x 3 matrix H and (such as rotation or a scaling by a factor device. We generated the HDL code and applied through a matrix-vector multiplica- greater than 1) (Figure 2). synthesized it with ISE™ design software, tion to the pixel homogeneous coordinates: The backward mapping approach gives obtaining a resource utilization of:

x' • 744 slices (1,107 LUTs ) H1,1 H1,2 H1,3 x H1,1 x + H1,2 y + H1,3 x' w' • 164 SRL16 H H H y = H x + H y + H = y' = y' x' w', y' w' (1) 2,1 2,2 2,3 • 2,1 2,2 2,3 w' () H3,1 H3,2 H3,3 1 H3,1 x + H3,2 y + H3,3 w' 1 • 4 embedded multipliers The design can process up to 46 fps The matrix H, called homography or better results. Using the inverse transforma- (frames per second) with 512 x 512 images. collineation, is defined up to a scale factor tion A-1, we scan the final image pixel by Theoretical results show a boundary of 360+ (it has 8 degrees of freedom). The transfor- pixel and transform the coordinates. The fps in a Spartan-3-based system. mation is linear in projective (or homoge- result is a pair of non-integer coordinates in neous) coordinates, but non-linear in the source image. Using a bilinear interpola- Applications of Image Warping Cartesian coordinates. tion of the four pixel values identified in the Image warping is typically used in many The formula implies that to obtain source image, we can find a value for the common computer vision applications, such Cartesian coordinates of the resulting pixel final image pixel (see Figure 3). as view synthesis, video stabilization, and we have to perform a division, an operation This technique avoids the problem of image mosaicing. quite onerous in terms of time and area holes in the final image, so we adopted it Image mosaicing deals with the composi- consumption on an FPGA. For this reason, as our solution for the hardware imple- tion of sequence (or collection) of images we considered a class of spatial transforma- mentation. after aligning all of them respective to a com- tions called “affine transformations” that is mon reference frame. These geometrical a particular specialization of homography. Implementation transformations can be seen as simple rela- This allows us to avoid the division and Software implementations of this algorithm tions between coordinate systems. obtain good observational results: are well-known and widely used in applica- By applying the appropriate transforma- tions through a warping operation and merging the overlapping regions of a A A A x A x + A y + A x' 1,1 1,2 1,3 1,1 1,2 1,3 warped image, we can construct a single A A A y = A x + A y + A = y' x', y' (2) 2,1 2,2 2,3 • 2,1 2,2 2,3 ( ) panoramic image covering the entire visible 0 0 1 1 1 1 area of the scene. Image mosaicing provides a powerful way to create detailed three- Affine transformations include several tions where a personal computer or work- dimensional models and scenes for virtual planar transformation classes as rotation, station is required. A hardware reality scenarios based on real imagery. It is translation, scaling, and all possible combi- implementation requires further work to employed in flight simulators, interactive nations of these. We can summarize the achieve efficiency constraints on an FPGA. multi-player games, and medical image sys- affine transformation as every planar trans- Essentially, the process can be divided tems to construct true scenic panoramas or formation where the parallelism is pre- in two parts: transformation and interpo- limited virtual environments.

24 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

Forward Mapping Image l Image lw Conclusion The challenge is to design efficient, effec- 1234567891011 x 1234567891011 x' tive, and reliable vision modules with the 1 1 highest possible reliability. 2 2 3 3 Ultimodule, a Xilinx XPERTS partner, 4 H 4 and the VIPS Laboratory at the 5 5 Università di Verona have defined a foun- 6 lw(1,2) = I(1,1) 6 lw(3,2) = I(2,1) 7 7 dation platform for computer vision 8 8 using Ultimodule’s system-on-module 9 9 family. The platform provides a stereovi-

y y' sion system for real-time extraction of INPUT: source image I three-dimensional data and a real-time For every y from 1 to height (I) For every x from 1 to width(I) image-processing engine implementing Calculate x', u = round(x') most of the algorithms required when an Calculate y', v = round(y') If 1<= u <= wodth(lw) and 1<= v <= height(lw) application relies on vision to make deci- Copy I(x,y) to lw(u,v) sions and provide control.

OUTPUT: warped image lw The platform supports applications that require high performance and robust Figure 2 – Forward mapping with a scaling factor greater than one vision analysis, both in qualitative and computational terms (real-time), includ- ing active video surveillance, robotic arm motion and control, autonomous vehicle navigation, test and measurement, and hazard detection. The platform provides Backward Mapping Image l Image lw modules with all required system control logic, memory, and processing hardware, 1234567891011 x 1234567891011 x' together with the application software. 1 1 Interconnecting modules allow fast devel- 2 2 3 3 opment of a complex architecture. -1 4 H 4 The platform leverages Xilinx Spartan- 5 5 3 devices, which are an optimal choice for 6 6 7 7 image processing IP cores because of their 8 8 flexibility, high performance, and DSP- 9 9 oriented targeting. The Spartan-3 family

y y' provides a valid, programmable alternative

lw(4,5) = interpolation of { l(3,2), l(4,2), l(3,3), l(4,3)} to ASICs. This characteristic, coupled with lw(4,7) = interpolation of { l(3,4), l(4,4), l(3,5), l(4,5)} its low cost structure, adds considerable value when time to market is crucial. INPUT: source image I For every v from 1 to height (Iw) For more information about feature For every u from 1 to width(Iw) extraction, you can e-mail the authors at Calculate x Calculate y [email protected] or saul. Calculate lw(u,v)as bilinear interpolation the four pixel values: [email protected]. For more informa-

I( x , y ), I(x , y ), I( x , y ), I( x , y ) such as tion about image warping, you can e-mail I w ()(u,v = 1 Ð h )() 1 Ð k ¥¥¥p00 + h (1 Ð k ) p10 + (1 Ð h ) ¥¥¥¥k p01+ h k p11 [email protected] or giovanni. [email protected]. h = x Ð x ,k = y Ð y where: { p00 = I() x , y , p10 = I() x , y , p01 = I() x , y , p11 = I() x , y OUTPUT: warped image lw We are grateful for the support from our advisor, Professor Murino, in the Vision, Image Processing, and Sound (VIPS) Laboratory in the Dipartimento di Informatica at the Università di Verona, and contributions from Figure 3 – Backward mapping with a scaling factor greater than one Marco Monguzzi, Roberto Marzotto, and Alessandro Negrente.

Second Quarter 2005 Xcell Journal 25 HIGH-VOLUME SOLUTIONS DesigningDesigning aa Spartan-3Spartan-3 FPGAFPGA DDRDDR MemoryMemory InterfaceInterface Xilinx provides many tools to implement customized DDR memory interfaces.

by Rufino Olay DDR SDRAM is an evolutionary of storage elements. The storage-element Marketing Manager, Spartan Solutions extension of “single-data-rate” SDRAM pair on either the output path or the three- Xilinx, Inc. and provides the benefits of higher speed, state path can be used together with a spe- [email protected] reduced power, and higher density com- cial multiplexer to produce DDR ponents. Data is clocked into or out of the transmission. This is accomplished by tak- Karthikeyan Palanisamy device on both the rising and falling edges ing data synchronized to the clock signal’s Staff Engineer, Memory Applications Group of the clock. Control signals, however, still rising edge and converting it to bits syn- Xilinx, Inc. change only on the rising clock edge. chronized on both the rising and falling [email protected] DDR memory is used in a wide range edge. The combination of two registers and of systems and platforms and is the com- a multiplexer is referred to as double-data- Memory speed is a crucial component of puting memory of choice. You can use rate D-type flip-flop (FDDR). system performance. Currently, the most Xilinx® Spartan™-3 devices to implement common form of memory used is synchro- a custom DDR memory controller on Memory Controllers Made Fast and Easy nous dynamic random access memory your board. Xilinx has created many tools to get design- (SDRAM). ers quickly through the process of building The late 1990s saw major jumps in Interfacing Spartan-3 and testing memory controllers for Spartan SDRAM memory speeds and technology Devices with DDR SDRAMs devices. These tools include reference because systems required faster perform- Spartan-3 platform FPGAs offer an ideal designs and application notes, the Memory ance and larger data storage capabilities. connectivity solution for low-cost systems, Interface Generator (MIG), and more By 2002, double-data-rate (DDR) providing the system-level building blocks recently, a hardware test platform. SDRAM became the standard to meet necessary to successfully interface to the Xilinx application note XAPP454, this ever-growing demand, with latest generation of DDR memories. “DDR2 SDRAM Memory Interface for DDR266 (initially), DDR333, and Included in all Spartan-3 FPGA Spartan-3 FPGAs,” describes the use of a recently DDR400 speeds. input/output blocks (IOB) are three pairs Spartan-3 FPGA as a memory controller,

26 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

using MIG. The results in Table 1 show DQS that the implementation would use 17% of

DQ the slices, leaving more than 80% of the device free for data-processing functions. Internally or Externally Delayed DQS Testing Out Your Designs to Capture DQ The last sequence in a design is the verifi-

Phase-Shifted cation and debug in actual hardware. DCM Output After using MIG 007 to create your cus- to Capture DQ tomized memory controller, you can implement your design on the Spartan-3 Memory Development Kit, HW-S3- Figure 1 – Read operation timing diagram SL361, as shown in Figure 3. The $995 kit is based on a Spartan-3 1.5M-gate FPGA (the XC3S1500) and includes with particular focus on interfacing to a delayed strobe can be centered in the data additional features such as: Micron MT46v32M16TG-6T DDR window for data capture. • 64 MB of DDR SDRAM Micron SDRAM. This and other application notes To maximize resources within the FPGA, MT5VDDT1672HG-335, with an illustrate the theory of operations, key chal- you can explore design techniques such as additional 128 MB DDR SDRAM lenges, and implementations of a Spartan- using the LUTs as RAMs for data capture – DIMM for future expansion 3 FPGA-based memory controller. while at the same time minimizing the use DDR memories use non-free-running of global clock buffers (BUFGs) and digital • Two-line LCD strobes and edge-aligned read data clock managers (DCMs) – as explained in • 166 MHz oscillator (Figure 1). For 333 Mbps data speeds, the the Xilinx application notes. Results are memory strobe must be used for higher given with respect to the maximum data • Rotary switches margins. Using local clocking resources, a width per FPGA side for either right and left • Universal power supply 85V-240V, or top and bottom 50-60 MHz implementations. Implementation chal- lenges such as these are mitigated with the new Memory Interface Generator. Xilinx created the Memory Interface Generator (MIG 007) to take the guesswork out of designing your own controller. To cre- ate the interface, the tool requires you to Figure 3 – Spartan-3 memory input data including development board (HW-S3-SL361) Figure 2 – Using the MIG 007 to automatically create a DDR memory controller FPGA device, frequen- cy, data width, and Conclusion banks to use. The inter- With the popularity of DDR memory active GUI (Figure 2) increasing in system designs, it is only nat- Feature Utilization Percent Used generates the RTL, ural that designers use Spartan-3 FPGAs as Number of Slices 2,277 out of 13,312 17% EDIF, SDC, UCF, and memory controllers. Implementing the Number of DCMs 1 out of 4 25% related document files. controller need not be difficult. Number of External IOBs 147 out of 487 30% As an example, we For more information about the applica- created a DDR 64-bit tion notes, GUI, and development board, Table 1 – Device utilization for a DDR interface for a Spartan please visit www.xilinx.com/products/ 64-bit interface in an XC3S1500 FPGA XC3S1500-5FG676 design_resources/mem_corner/index.htm.

Second Quarter 2005 Xcell Journal 27 HIGH-VOLUME SOLUTIONS

by Zulfiqar Ali Zamindar Field Application Engineer Nu Horizons Electronics Corp. SignalSignal ProcessingProcessing [email protected] To meet their system design goals, design- ers today must prototype a new idea and CapabilityCapability withwith thethe integrate its product features into the low- est cost silicon, complete with versatile functionality. Specifically, two of the biggest challenges are to get the design cor- NuHorizonsNuHorizons Spartan-3Spartan-3 rect in the first place and to fix a problem rapidly. Immediate design solutions are essential for meeting time-to-market pres- sures, keeping up with changing industry DevelopmentDevelopment BoardBoard standards, and prototyping quickly. Even after the design is completed, a need may exist for field upgradeability if a bug is found or a new functionality is available. The The Spartan-3 platform provides a low-cost FPGA-based Xilinx® Spartan™ FPGA has been a great solution to perform digital signal processing requirements. low-cost programmable platform for low- density control logic and system interfaces. However, when it comes to signal processing, designers traditionally purchase a fixed algo- rithm standard product for high-speed mul- tiply and accumulate (MAC) functions. This requirement demands additional design resources, verification time, system components, and more board space. With the explosion of the wireless communication market and prolifera- tion of low-cost 90 nm processes, Spartan-3 devices – with plenty of logic, memory, and as many as 104 18 x 18 embedded hard multipliers – are the ideal solution for many signal processing needs. The challenge for today’s digital pro- cessing systems is their large memory requirements and the very fast MACs need- ed for rapid mathematical operations. Every multimedia system contains an external DSP processor and memory component that reduces system performance and increases component costs. Once you have uncovered a solution that integrates a paral- lel or semi-parallel system, you can then focus on improving the overall DSP design to eliminate performance bottlenecks. As a result of digital processing chal- lenges, many companies have focused their efforts on developing the system-on-chip (SOC) concept by adding feature sets to

28 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

bring additional functionality to a single Xilinx System Generator block is used to ple, you can create a fast Fourier transform piece of silicon. Customized ASICs have select various implementation options such (FFT) or FIR core with the easy-to-use GUI become very costly solutions in today’s com- as FPGA device, package, speed, system in System Generator for DSP, customize the petitive landscape. Traditional DSP proces- clock, synthesis options, and HDL. The core as per your application, and run the sors are capable of carrying out high-speed need to write HDL is eliminated, as the Xilinx ISE™ tool in the background to MAC operations, but have bandwidth limi- tool creates the proper language for you; build your signal processor system. This tations. FPGA technology has made tremen- however, it can write HDL if you prefer. flow can synthesize, place, route, and gener- dous progress in recent years by increasing a ate hardware configuration files. large number of intellectual properties to The Simulink environment allows you to reduce the cost of silicon development in var- verify the functionality of each block or sub- ious markets. This is accomplished by opti- system created with scopes and graphs to mizing architectures, using leading process view images or observe data. technologies, and adding IP cores. Digital filters are among the most signif- Some of the typical applications for dig- icant components in digital signal process- ital signal processing are digital cameras, ing applications. The function of a filter is to phones, 3G wireless, video conferencing eliminate undesirable parts of the signal systems, and high-definition digital televi- (random noise) or to extract signals in a par- sions. Having a signal processing capability ticular frequency range. Basic FIR filters are inside an FPGA is the perfect design inno- used extensively in video broadcasting and Figure 1 – MAC FIR filter block diagram vation – the stepping stone to system-on- in System Generator wireless communications. A mathematical chip in an FPGA without the high cost of expression of a basic FIR filter is: complex customized chip development. Y (n) = SUM h (k) * x (n-k); k=0 to k =N-1

System Generator for DSP It consists of an input sample, output Xilinx expanded its features in the Spartan-3 sample, and coefficients. Imagine “x” is a FPGA by adding embedded multipliers in continuous stream of input signal and “y” is the architecture. This technological inno- a resulting filtered stream of output signal. vation is similar to embedded block mem- The “n” and “k” in the equation correspond ory, clock management, and multiple to a particular instant in time, so to compute standards for high-speed I/O circuits, all “y (n)” at time “n,” a group of input samples standard characteristics of the Xilinx at “n” different points in time are required, or Spartan-3 and Virtex™-II Pro families. numerically x (n), x (n-1), x (n-2) ... x (n-k). Figure 2 – Simulation view of FIR filter Time to market remains critical for A group of “n” input samples are multiplied companies developing both system hard- by “n” different coefficients and summed ware and software. With Xilinx System together to form a result “y (n).” Generator (SysGen), you can simultane- This design example implements a 43- ously create behavioral-level hardware tap FIR filter with a MAC engine and a blocks and simulate the entire system with dual-port RAM used for data and coeffi- just a few tool clicks. The design environ- cient storage. The filter is a low-pass filter ment allows you to create block-based sys- with a cut-off frequency of 6 KHz. The tems like digital QAM modulators for sampling frequency is 44.1 KHz. software-defined radio, finite impulse Figure 1 represents the model of the fil- response (FIR) filters, image processing ter. The model has a coefficient width of functions, mathematical operators, A/D 12, a coefficient binary point value of 12, a and delta-sigma D/A conversion, and all- Figure 3 – Implementation summary data width value of 10, a data binary point in-one silicon for widely used applications. value of 8, and a sampling frequency of Using System Generator within the 44.1 KHz. Scope shots of the filtered out- Simulink design environment from The Simulink also enables you to integrate put are shown in Figure 2. An implementa- MathWorks, you have unrestricted access blocks from many different libraries. tion summary displaying the use of RAM to many blocks. You can select both Xilinx Commonly used DSP block sets include and multiplier resources is shown in Figure and third-party blocks, drag and drop to math functions, signal management, a vari- 3. This design achieved ~ 125 MHz per- the Simulink work space, connect, and ety of filters, transforms, encoders, decoders, formance in a -4 speed grade of the simulate a system within minutes. The and linear feedback shift registers. For exam- Spartan-3 device.

Second Quarter 2005 Xcell Journal 29 HIGH-VOLUME SOLUTIONS

The NuHorizons Spartan-3 Board semi-parallel customized DSP system-on- Xilinx, its distributors, and third-party chip design within days. companies offer several boards for proto- The Spartan-3 board from Nu typing or emulating a DSP-based system. A Horizons is a perfect solution for proto- low-cost prototyping platform from Nu typing signal processing using Spartan-3 Horizons Electronics Corp. is the Spartan- FPGAs. The board has all of the inter- 3 development board (HW-AFX-SP3- faces necessary to create designs for wire- 2000-DB) (Figure 4). The board comprises less and digital imaging applications if these elements (Figure 5): you want to:

• Xilinx XC3S2000-4FG676 Figure 4 – Spartan-3 board • Integrate your logic and signal pro- Spartan-3 device cessing capability on a single chip application engineers cover topics such as • XCF08 Flash PROM for configuration • Prototype a configurable DSP system- memory controllers, embedded processors, on-chip • 4 x 24-character LCD display hardware-in-the-loop with digital signal • Graphical LCD interface processing, and system monitor design • Reduce the cost of conventional serially • 64 Mb SDRAM (2-1 Mb x 16 x 4) using ADC and DAC on the board. ADC implemented external signal processors and DAC are very powerful attributes of • 32 Mb Flash 2 Mb x 16 • Improve system performance our low-cost board, and two of the many • 50 MHz clock oscillator competitive board features. The Spartan-3 Nu Horizons is also in the process of • PLL clock multiplier platform can be expanded with the add-on releasing a Virtex-4 platform board for cus- ADC board from Linear Technology. tomers requiring higher density logic, more • CAN 2.0B transceiver memory, and hard MACs running up to • RS232 interface Conclusion 500 MSPS for high-performance interfaces • PS2 interface With fast multipliers and lower cost FPGAs, to video and imaging applications. engineers now have the ideal solution to For all of the designs and related • Audio CODEC their signal and image processing require- documentation for the Spartan-3 board, • Two-channel A/D and D/A converter ments. With tools like System Generator, visit the Nu Horizons website at • 10/100 Ethernet MAC anyone can implement a powerful parallel or www.nuhorizons.com/sp3/. • 10/100 Ethernet PHY

5V Reg 3.3V Reg 3.3V Analog High-Speed A/D - D/A 1.2V - 3.3V 1.2V Reg 12V / -12V • Flash memory interface LT1963A LT1941 LT1963A Board Connections LT1963A LT1941 LT1941 • SDRAM memory interface Audio CODEC 10/100 • LED/push buttons ST W5093 Ethernet Phy ICS1893BF • JTAG configuration header for User I/O programming Agilent E5404A 10/100 Probe Ethernet I/F • 16-bit LVDS I/F with clock and control SMSC91C111 JTAG I/F • Test point headers for debugging Cable Test LEDs II & IV Push This fully loaded Spartan-3 develop- Buttons 2 Channel Xilinx Spartan-3 ment board is priced at $449, which 16-Bit D/A XC3S1500 2-RS232 LTC1654 I/F includes all of the features necessary to pro- or ST323 XC3S2000 totype a DSP-based system design. The PS2 I/F CAN I/F board comes with a user’s manual, power STL9616 supply, documents, and design files. The 2 Channel 16-Bit option of getting the board bundled with 16-Bit D/A LVDS I/F ISE Foundation™ software, System LTC1865L with Control Generator, and ChipScope™ Pro software PLL System 32 Mb Clock Flash Generation x2 is also available at an additional cost. ICS511 & ST M29W320DB 2 - 1M x 16 x 4 ICS84021 LCD Display Besides DSP designs, the Spartan-3 SDRAM - 16 MB 2x24 Character platform is a great tool to implement many ISSI42S16400 other Xilinx reference designs. Several designs written by Nu Horizons’ field Figure 5 – Spartan-3 XC3S1500 /2000 board block diagram

30 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS CoolRunner-IICoolRunner-II CPLDsCPLDs OfferOffer NewNew FeaturesFeatures

by Steve Prokosch New updates to CoolRunner-II 32- and High-Volume Product Solutions Marketing Manager Xilinx, Inc. 64-macrocell CPLDs are now available. [email protected] Since joining the CPLD market in 1992, Xilinx® has gone from being the new kid on the block to the second largest CPLD supplier, and is closing the gap on the number-one position in both volume and sales dollars. Growth has come from listen- ing to customer input and creating innova- tive ideas to satisfy customer needs. Continued improvement is an ongoing goal. Toward this end, Xilinx has extended the features of its 32- and 64-macrocell CPLD devices by augmenting a feature, cur- rently only available on higher density parts, that will help ease design difficulties and reduce total system costs. In conjunction with added features, Xilinx is also offering new packages to decrease the cost-per-I/O for small footprint packages. These new packages will help Xilinx continue to pene- trate low-cost, battery-operated devices. Xilinx is achieving CPLD market share growth through both traditional applica- tions such as computing, data processing, networking, and telecommunications, and non-traditional applications such as con- sumer products (set top boxes and large screen TVs – both plasma and LCD) and key handheld markets (PDAs, handset, and other battery operated products). With new market success and continued tradi- tional market use, CPLDs will continue to show growth in system logic solutions.

Second Quarter 2005 Xcell Journal 31 HIGH-VOLUME SOLUTIONS

New Banking for CoolRunner-II CPLDs been continually cost reduced and are With the addition of I/O banks on Xilinx has re-introduced its CoolRunner-II™ now on par with some discrete logic CoolRunner-II 32- and 64-macrocell 32- and 64-macrocell devices with an devices. If you are using multiple discrete CPLDs, voltage and standard I/O transla- added feature known as I/O banking. I/O logic devices, the benefits of a CPLD tion can occur in very small, cost-com- banking is useful in systems that use dif- become even more appealing. petitive CPLDs. When using a ferent supply voltage levels on the same This trend opens up a whole new level CoolRunner-II CPLD, other benefits also printed circuit board. Usually, a mismatch of integration at a low price point. For include the following: of voltage levels occurs between a system the first time in history, CPLDs can com- • Security processor and the devices with which it pete with discrete solutions, add func- communicates. This communication may tionality, and cost less. For instance, a • Input hysteresis be as simple as a serial to parallel conver- typical voltage translation device usually • Optional output control sion, or as integral as a processor interface costs from $0.50 to $3.50, depending on to a display. volume, package type, the number of • Low power operation Another example is the connection of a I/Os to be used, and the process technol- processor to an external memory card such ogy of the device. CPLD devices are • Multiple package options as CF (compact flash) or SD (secure digi- below $1.00 for the smallest density and with varying I/O tal). Because of competition and constant lowest cost package. But you get a whole • IEEE1149.1 Boundary Scan test price pressures, high-profile devices such lot more within that single CPLD than as processors must continue to use any discrete function logic device. • Adjustable from 1.5V to 3.3V advanced process technologies. These process technologies continue to lower CoolRunner-II Family of CPLDs voltage swings because of the physical Features XC2C32A XC2C64A XC2C128 XC2C256 XC2C384 XC2C512 properties of wafer geometries. Yet standards organizations in markets Macrocells 32 64 128 256 384 512 like memory cards, wireless communica- FToggle (MHz) 500 500 450 450 450 450 tions, and bus architectures go through FSYSTEM (MHz) 323 263 263 256 217 217 specification changes more slowly. To bridge Max I/O 33 64 100 184 240 270 the gap, voltage and I/O translation is a nec- I/O Banks 2 2 2 2 4 4 essary component in today’s architectures. Care must be taken to match input and out- LVCMOS, LVTTL (1.5, 1.8, 2.5, 3.3) Yes Yes Yes Yes Yes Yes put voltage-switching thresholds for that HSTL, SSTL No No Yes Yes Yes Yes particular voltage standard. DualEDGE Yes Yes Yes Yes Yes Yes Designers face the challenge of pasting DataGATE, CoolCLOCK No No Yes Yes Yes Yes together many logic level standards on a variety of part types. What device can do Standby Power (µW) 28.8 30.6 34.2 37.8 41.4 45.0 the job and yet leave room for design Advanced Security Yes Yes Yes Yes Yes Yes upgrades or changes? Discrete devices can Packages (Size, Type) Maximum User I/O usually address this mismatch, but may add VQ44 (10 x 10 mm, leaded) 33 33 cost, power consumption, printed circuit board layers, and area to a design. They PC44 (16.5 x 16.5 mm, leaded) 33 33 may have limited functionality – and you CP56 (6 x 6 mm, chip-scale) 33 45 may have to buy additional parts to com- QFG32 (5 x 5 mm, Pb-free) 21 pletely solve the problem. The more parts QFG48 (7 x 7 mm, Pb-free) 37 you use, the higher the chance of picking a VQ100 (14 x 14 mm, leaded) 64 80 80 part that may be discontinued, and the more purchase orders and stocking require- CP132 (8 x 8 mm, chip-scale) 100 106 ments you generate. Plus, if you need to TQ144 (20 x 20 mm, leaded) 100 118 118 assemble more parts, the board assembly PQ208 (28 x 28 mm, leaded) 173 173 173 cost will probably increase. But if a simple FT256 (17 x 17 mm, BGA) 184 212 212 AND gate is all you need, a discrete logic device is probably the right choice. FG324 (23 x 23 mm, BGA) 240 270 Let’s look at the case where you need Table 1 – CoolRunner-II family overview more than just a simple gate. CPLDs have

32 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

electrical characteristics. With their small size, bulk capacitance and inductance are lower than TQFP packages.

Conclusion Many product choices are available when considering voltage translation, but CPLDs are the best choice for many rea- sons. You get additional logic capability, more package choices, lower power, free JTAG testability, I/O options, advanced security, and cost savings. Current users will benefit from the same pin-out, as well Figure 1 – MLF package comparison as the new I/O banking feature. Novice users can find out how to use With these features, CPLDs offer more regular 0.5 mm pin spacing thin quad flat these new CPLDs at www.xilinx.com/cpld/. than voltage translation – and can do a lot pack (TQFP) packages. This makes pin Don’t forget that the design software is more. Because discrete level shifters per- alignment and solder reflow very easy com- always free and easy to use. Download your form a specific task, you may be stuck with pared to BGA packages. copy of ISE™ WebPACK™ software at a specific I/O choice. With Xilinx CPLDs, For debug, MLF packages have external www.xilinx.com/products/design_resources/ you get choices on I/O, and you can use pins that can be probed just as easily as a thin design_tool/index.htm to start your design them for other functions. Thus, they are a quad flat pack package. These features make today. With these design tools and world- better choice when you need level transla- it a great addition to the CoolRunner-II class support, Xilinx CPLDs make a great tion versatility and some logic, extra I/Os, family of CPLDs for consumer products. choice for logic solutions and replacement or other system integration functions. See These MLF packages also offer better of discrete logic devices. Table 1 for features and package types.

Additional Low-Cost Packages With the migration of CPLDs into non- traditional markets, package form factor and cost have become growing concerns. The Xilinx CPLD group has listened to customer needs and found a better small form-factor package. This package is a low- cost alternative to more expensive chip- scale BGA packages with similar I/O counts. The MLF (micro lead frame) pack- age, shown in Figure 1, is small and packs a lot of I/O. Xilinx has introduced two new MLF packages for the CoolRunner-II family of CPLDs: a 32-pin QFG (quad flat no lead, Pb-free) package with 21 I/O for the 32- macrocell device, and a 48-pin QFG pack- age with 37 I/O for the 64-macrocell device. These additions bridge the gap between more expensive chip-scale BGA and low-cost thin quad flat pack packages. MLF packages, also known as QFN (quad flat no lead), offer small sizes and high I/O counts, but unlike their BGA chip-scale cousins, are easy to assemble and easy to probe. For assembly, they are like

Second Quarter 2005 Xcell Journal 33 HIGH-VOLUME SOLUTIONS ReducingReducing Bill-of-MaterialsBill-of-Materials CostCost UsingUsing LogicLogic ConsolidatorConsolidator Logic Consolidator helps analyze the total cost of discrete logic on the board and the potential savings to be gained by using a Xilinx CPLD.

by Monita Chan However, it also came at a cost premium specify your own unit pricing for accurate Product Marketing Manager, High Volume over discrete devices. comparison. Xilinx Hong Kong Improvements in manufacturing Logic Consolidator makes life easy for [email protected] processes and packaging technologies in designers as well as managers. For purchas- Ajay Panicker the last 10 years have significantly lowered ing managers, the easy-to-use Excel tool Field Applications Engineer the cost of programmable logic devices. enables them to accurately list the number CG-CoreEl, India Today, CPLD unit costs have been driven of near-obsolete discrete logic devices that [email protected] down to a level such that they can replace can be replaced by the most optimal Xilinx discrete logic devices, yet maintain advan- CPLD – and the cost savings justifying System enhancements and cost reduction tages such as higher performance, design such an upgrade. For designers, it saves pre- are becoming major trends for upgrading flexibility, re-programmability, and relia- cious design time that is usually spent esti- existing system designs. To be competitive, bility (Figure 1). mating the right PLD to replace a range of companies need to enhance their existing discrete logic devices. feature sets and lower their total cost, pro- Logic Consolidator In this article, we’ll present a real-life viding more for less. To help customers quickly and accurately case study where Logic Consolidator Traditional systems have always used estimate their costs and area savings by helped a customer analyze the real cost discrete logic devices for implementing the using a single CPLD instead of multiple benefits provided by Xilinx CPLDs over simple glue and control logic that every discrete devices, Xilinx® offers a tool called discrete logic devices, saving thousands of design requires. The main argument in Logic Consolidator. This is a Microsoft dollars in the process. favor of discrete logic was that it was cheap Excel-based tool that estimates the number and easily available. Yet discrete logic was of discrete logic devices that can be inte- Customer Case Study not the most optimal choice from a board grated into a single CPLD. It compares The Logic Consolidator tool was instru- area, power consumption, or programma- Xilinx XC9500XL and CoolRunner™-II mental in convincing an initially skeptical bility standpoint. 10,000-unit pricing to the 10,000-unit customer to integrate all of the discrete logic The introduction of programmable average resale price of 7,400 discrete logic on their board to a single Xilinx CPLD. logic devices provided programmability, products in the lowest cost plastic package. Wipro Technologies, one of the biggest flexibility, and higher levels of integration. Also, based on particular quotes, you can hardware design services companies in India

34 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

increased reliability through reduced power consumption and lower EMI, and it was apparent to Wipro that the Xilinx Die Cost Test Cost Package Cost CPLD was the obvious choice. After using Logic Consolidator for this card, Wipro was so impressed with its effi- ciency and usefulness that they decided to Die Cost 10 times smaller die size through use it for all of the other cards in their tele- semiconductor process technology com product upgrade project, and are planning to use it in future projects as well.

Conclusion Logic Consolidator is a powerful tool CPLD Discrete CPLD Discrete that allows customers to analyze the costs 1985 2003 of different alternatives for themselves. Because it is built using Microsoft Excel, Figure 1 – The costs of a CPLD have come down dramatically because of lower die sizes. it is easy to learn and use. To get a copy of Logic Consolidator, contact your (and a Xilinx XPERTS partner), was working potential savings are approximately more Xilinx sales representative. And for more on a project aimed at upgrading an existing than 70% in board real estate, 60% or information about or to download Logic telecom product, mainly to cut manufactur- more in board routing, and a 50% reduc- Consolidator, visit www.xilinx.com/ ing costs and include some enhanced fea- tion in board design time. products/cpldsolutions/logic_tool.htm. A tures. During the initial stage, CG-CoreEl’s Add to this the benefits of programma- new version now includes more logic Xilinx sales team introduced Logic bility, reduced board complexity, and device choices. Consolidator to the customer and helped them identify the appropriate CPLD for replacing multiple discrete devices on board. The tool is very intuitive. It saves a lot of time in finding the right CPLD to replace all of the discrete logic. Figure 2 shows a screenshot of the Logic Consolidator tool as used by the customer. In this case, Wipro was able to identify 73 discrete logic devices that could be replaced by a single Xilinx CPLD. They could choose either a XC9500XL device or a CoolRunner-II device, depend- ing on the supply voltage available. Both options offered an immediate bill-of-mate- rials cost savings – $0.74 with the XC95288XL device and $1.81 with the XC2C256 device, with about 40% more room available in the CPLD to add more logic for any feature enhancement. When Wipro engineers actually imple- mented the design, because of optimiza- tions performed by the Xilinx ISE™ software tool, the design fitted in the next smallest Xilinx device, the XC95144XL- 10TQ144C, with 144 macrocells With a listed price of $6.45 for the XC95144XL device, it provided further cost savings for Figure 2 – Logic Consolidator helped Wipro Technologies understand the cost benefits the customer. On the PCB front, the of using a CPLD in place of multiple discrete logic devices.

Second Quarter 2005 Xcell Journal 35 HIGH-VOLUME SOLUTIONS UsingUsing CoolRunner-IICoolRunner-II SiliconSilicon FeaturesFeatures toto ReduceReduce CostCost CoolRunner-II CPLDs offer features that lower total PCB costs through component count reduction, smaller board space, and lower manufacturing costs.

by Steve Prokosch Product Solutions Marketing Manger Xilinx, Inc. [email protected]

Xilinx® introduced the CoolRunner™-II family of CPLDs in 2001, continuing to lower power consumption through process technology improvements. This is what you would expect from a technology-driven company. But differences in this family exist that have never been employed in a re- programmable logic device. These features deliver even lower power, higher speed, and most importantly, integrated functions that reduce system cost. The CoolRunner-II design team looked at methods to add functionality while pre- serving the low-cost aspect of the current CPLD market. To this end, clock doublers, clock dividers, input hysteresis, and I/O banking were designed in from a cost-per- feature standpoint. This led to extremely creative ways to design features in the smallest amount of silicon. If the feature did not reduce system cost enough to merit the die size increase, that feature was not designed into the device. With this balance of features and cost, CoolRunner-II CPLDs offer a unique advantage when compared to other devices.

36 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

Cost-Reducing Features sources while again preserving valuable I/O benefit reduced power. This power reduction Above and beyond the normal process tech- pins. The neat thing about the is subtle, but nonetheless an advantage that nology shrinks that every silicon component CoolRunner-II clock divider circuit is that other CPLDs do not include. goes through, CoolRunner-II CPLDs have it actually improves duty cycle. For So how does it reduce cost? Because the very special features that help most designers instance, if you have a 60/40 duty cycle on hysteresis is programmable on a pin-by-pin lower the total components in their design. an incoming clock source, the internal basis, it eliminates the need for external These component reductions can be very divider circuitry actually performs duty- Schmitt trigger devices. Although these com- simple or very complex, depending on how cycle correction. So internal to the CPLD, ponents may be cheap, the extra insertion cost creative you are as a designer. They range this will have a 50/50 duty cycle. This may could prove expensive if added after the initial from clocking features to integration of sim- help circumvent clock skew problems and board fabrication. Better to be safe than sorry ple logic functions to voltage and I/O-level again eliminate multiple oscillator inputs. when it comes to reliability. differences. The principal idea behind these The clock divider is not simply a divide One other cost that is commonly over- features is to integrate functions and simpli- by two; it has eight different divide-by set- looked is PCB routing and the associated fy the designer’s task. The first and most tings (see Figure 2). So if you use the clock costs with extra layers of circuit board mate- basic is clock generation. divider in conjunction with DualEDGE rial. Fewer layers is always less expensive. For most designs, a single clock source flip-flops, you can generate multiple com- may not generate all of the necessary I/O Banking – Voltage and I/O frequencies needed in the design. Also, Standard Translation if the PCB is larger than a hand-held With today’s vast selection of device, clock skew may come into play. DualEDGE components, the chances of pick- The design team considered some of Register ing parts that all have the same these clocking design problems and cre- voltage input levels are slim. ated solutions internal to the CPLD. Consider the time-to-market Figure 1 – Xilinx CoolRunner-II clock doubler demand on today’s design engi- Clock Doubler and Divider neers. If you cannot reuse some Two features included with portion of the previous design, CoolRunner-II devices are DualEDGE chances of making the current and (in 128-macrocell and higher product release cycle are reduced. devices) a clock divider. DualEDGE For example, a design that uses flip-flops offer a performance boost for legacy ASSP devices may not line Divider sequential operations, while the clock up with the same voltage or I/O divider can reduce power consumption. structure of the new processor you But that’s just the tip of the iceberg. just specified. Let’s start with a simple clock-dou- Figure 2 – Xilinx CoolRunner-II clock divider With the simple addition of a bling scheme where we can generate a CPLD, the voltage problem goes frequency double that of the incoming away – and you get extra logic for all clock. This yields two clock sources from a binations of clock frequencies. And for of those new features that marketing dreamed single clock input. So if you happen to those troublesome odd clock frequency up for the next generation of products. But need a faster clock to improve a serial data problems, just use an integer divide-by that that’s not all, because you still need to consid- flow, DualEDGE flip-flops are a perfect fit. yields an odd number (for example, 6, 10, er if the I/O standards match. CoolRunner-II It also works well for improved pulse width 14) to get even more clock choices. CPLDs do that for you as well. modulation or higher resolution of What does this have to do with cost sav- timers/counters. Thus, you need not use Input Pin Features ings? If we look at the cost of discrete devices another clock source for portions of your Because of the low power nature of versus low-cost Xilinx CPLDs, you can design that need to run faster. CoolRunner-II CPLDs, the design team achieve the same function for less money and Another key point is the fact that you believed that there may be portable applica- get a whole lot more. Xilinx has a free down- can preserve a clock input pin. So with tions that go into noisy environments. This loadable tool called Logic Consolidator to DualEDGE flip-flops, you get a free clock led to the idea of integrating hysteresis on show you just how much you can save (recent- without using an input pin on the CPLD. input pins. This single feature, available on ly updated with even more parts to compare). Figure 1 shows a simple diagram of what all CoolRunner-II CPLDs, adds little to sil- You can download it at www.xilinx.com/ DualEDGE flip-flops look like. icon area but brings many advantages. With products/cpldsolutions/logic_tool.htm. There are The clock divider feature (Figure 1) is input hysteresis, the problem of noisy envi- even associated documents to explain how we another way to eliminate extra clock ronments was eliminated – and as a side did the comparison.

Second Quarter 2005 Xcell Journal 37 HIGH-VOLUME SOLUTIONS

That’s still not all, because there is figure the Xilinx FPGA devices on board. and still get the great low-power opera- more to cost than the silicon. In fact, if Once that is complete, reprogram the tion of CoolRunner-II CPLDs. you look at the cost of discrete devices CPLD to perform some system function and the associated stocking, shipping, that requires power management, inte- Conclusion assembly, and routing costs, in grating discrete logic functions, or imple- Perhaps you will be able to apply some of most cases a single integrated chip costs menting fixes to standard products. For these cost-saving ideas to your next less. Also note the CoolRunner-II more information regarding OTF, see design. If you need more information on CPLD’s added reliability, versatility, and http://www.xilinx.com/bvdocs/appnotes/ any of these topics, there is a wealth of low power. xapp388.pdf. information from the Xilinx CPLD web- If you’re designing a small-form-factor site at www.xilinx.com/cpld/index.htm. Other Cost-Reducing Ideas device, consider using a new package If you need something more substan- Still other features on CoolRunner-II option for Xilinx CoolRunner-IIA tial on these features, an excellent applica- CPLDs may be overlooked. OTF (on- CPLDs. The package is commonly tion note exists at http://www.xilinx.com/ the-fly) reprogramming is one of those referred to as micro lead frame or quad bvdocs/appnotes/xapp378.pdf. features that can make your CPLD flat no-lead and is extremely small, yet With these innovative features and design do double duty for no extra cost. priced similarly to standard thin-quad flat your imagination, design problems will Here’s one good example of using OTF: package devices. This package is an seem more manageable and hopefully on board power-up, use a CPLD to con- attractive way to use minimal board space make your design more successful.

Increase your embedded software performance by using Nucleus PLUS on your processor-based Xilinx designs. By integrating with Xilinx’s MLD technology, Nucleus PLUS is configured to your system design automatically, making software integration with your FPGA design easy. The compact design of Nucleus PLUS can run from on-chip memory to help minimize power dissipation and deliver increased performance. This, combined with a wealth of middleware, including the Nucleus NET Nucleus PLUS is readily TCP/IP solution supporting Xilinx Ethernet IP, makes it ideal for products targeted at available for: the networking, telecommunications, datacoms and consumer markets. • MicroBlaze Discover how Nucleus software can make your embedded development for Xilinx FPGAs • PowerPC 405 easy. Go to our dedicated Web page for more articles and free downloads at • Spartan-3 and Virtex-4 www.acceleratedtechnology.com/xilinx.

Nucleus EDGE, our new embedded development environment based on Eclipse, offers a complete tools solution for developing designs with Xilinx PowerPC cores. Nucleus EDGE provides you with a Builder, Editor and Project Manager, utilizing optimizing compilers, comprehensive embedded debuggers and RTOS profiling tools.

For more information on the Nucleus complete development solution from UML through to embedded RTOS, go to www.acceleratedtechnology.com.

Nucleus. Embedded made easy.

Accelerated Technology, A Mentor Graphics Division [email protected] • www.AcceleratedTechnology.com

©2005 Mentor Graphics Corporation. All rights reserved. Mentor Graphics is a trademark of Mentor Graphics Corporation.

38 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS Using Spartan-3/3E Features to Area-Optimize Your Design Spartan-3/3E devices include features that were typically the domain of only high-end FPGAs – distributed memory, embedded multipliers, and others that can yield significant optimization.

by Suhel Dhanani rolled out even more logic and I/O at a Features That Optimize Area Senior Marketing Manager, Spartan Products lower price. With the release of the The Spartan-3/3E architecture is very Xilinx, Inc. Spartan-3 series we did the same, but also memory-intensive. Not only does this [email protected] added embedded features that let you architecture have embedded 18 Kb Ken Chapman integrate even more functionality in your block RAMs, but half the look-up tables Staff Engineer, General Products Division favorite Spartan device. Table 1 shows the (LUT) can be configured as memory Xilinx, Inc. advantages of these features. (Figure 1). [email protected] Spartan-3/3E Features To Reduce Area Utilization The Xilinx® Spartan™-3 series is the first low-cost FPGA that does not compromise Embedded Multipliers Area-efficient implementation of various DSP functions on features – it offers fast embedded mul- Up to Eight Digital Clock Clock management functions such as clock multiplication/ tipliers, look-up tables that can be opti- Managers (DCM) division, phase alignment can be integrated within the FPGA mally configured to form shift registers, distributed memory, and large amounts of Distributed RAM Small FIFOs, LIFO buffers, scratch-pad memory, register banks embedded block RAMs. All of these features let you design for the 18 Kb Block RAM Large FIFOs, buffers, cache memory, program storage lowest possible area, lowering the size and for processors thereby the cost of the FPGA in your design. LUT Configured as 16-bit FIFOs, delay lines, state machines, logic-register, With more than 100 million units Shift Register (SRL16E) tradeoff in regular intensive designs shipped, Spartan devices have quickly become the world’s most accepted low- Differential Signaling Support – Lower number of pins, reduced power consumption, cost FPGA architecture, familiar to thou- LVDS, RSDS reduced EMI, high noise immunity sands of engineers. With every generation of the Spartan architecture, Xilinx has Table 1 – Features included in the Spartan-3/3E family of devices

Second Quarter 2005 Xcell Journal 39 HIGH-VOLUME SOLUTIONS

CLB

16 Registers Port PORT_ID[7:0] 8-bit Address READ_STROBE pp Control 16 x 4-bit sF s7 WRITE_STROBE sE s6 sD s5 OUT_PORT[7:0] IN_PORT[7:0] sC s4 sB s3 2 ‘SliceM’ 2 ‘SliceL’ sA s2 ALU s9 s1 s8 s0 ADD/SUB Shift/Rotate kk AND/OR/XOR 32 x 2-bit Scratch Pad PARITY Memory 18-bit instruction word 64-Bytes 8-bit data path ss ZERO & Interrupt CARRY Shadow 8-bit port address Flags Flags 10-bit program address INTERRUPT Interrupt Constants INTERRUPT_ACK Control Program 64 x 1-bit Program Counter ADDRESS[9:0] Flow INSTRUCTION[17:0] aaa Control Program Operational ROM/RAM Control & Instruction Program 1024 Words RESET Decoding aaa / pp / ss / kk Counter 32 x 10 Bits CLK Figure 1 -The LUTs in two slices of Stack the CLB are configurable as 16 x 4, 32 x 2, or a 64 x 1 memory. Implemented Using Distributed Memory

The advantage of having distributed mem- Implemented Using 18 KB Block RAM ory is that you can very efficiently implement any design that requires a large number of Figure 2 – The 8-bit PicoBlaze microcontroller takes only 192 logic cells smaller size memory structures without run- by making effective use of Spartan-3/3E memory resources. ning out of registers or block RAMs, offering the potential for unrivaled data bandwidth. A The FIFO is a common component of also only need one up/down counter good example of this is the 8-bit PicoBlaze™ many system designs. The normal way to instead of two up counters, which again is microcontroller, which takes only 192 logic construct a FIFO is by using a memory half the space. As a final bonus, the cells (LC) within the Spartan-3/3E architec- (Figure 5). Because of the need to write at counter status tells you precisely how ture. As shown in Figure 2, this is possible only the same time as reading, dual-port mem- many words are stored in the FIFO, with because the register stack, scratchpad memory, ory is required. Two address counters are the most significant bit naturally provid- program counter stack, and program ROM are also needed to point at the write address ing a most useful “half-full flag.” constructed using the FPGA memory and the read address. Additional compara- The embedded multipliers available in resources. tor logic is then required to determine the the Spartan-3 fabric are useful in a variety A 4-input LUT in a slice-M can also be con- state of the FIFO: full, empty, or half-full. of DSP applications. Each multiplier can figured in the SRL16E mode. This provides a In Figure 6, a FIFO can be imple- replace as many as 450 logic cells that 16-bit shift register with clock enable in addi- mented using one SRL16E for each bit of would normally be required to imple- tion to the dedicated flip-flop, as shown in the data path width (Din). The SRL16E ment a multiply function. Figure 3. gives a dual-port operation but in half the Many implementations of finite impulse When writing to this shift register, the new space of the real dual-port memory. We response (FIR) filters – base stations, digital data is always placed in location “0” and all other data moves along by one location. SRL16E However, data can be read from any location LUT4 using the address inputs A[3:0]. D I3 CE The SRL16E provides a very cost-efficient I2 A3 way to implement a delay line. For example, O I1 A2 Q D Q Figure 4 shows how one LUT can be used to I0 A1 generate a 5-cycle delay for the data. Simply A0 apply the data into the SRL16E data input, INIT-1234 hard-code the address pins at “0100,” and read the data out from that address. Synthesis tools INIT-1234 can implement this technique automatically for Figure 3 – Half of the LUTs in the Spartan-3 FPGA are configurable as a 16-bit shift register. anything up to a 16-cycle delay in each LUT.

40 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

video systems, wireless LANs, xDSL, and Conclusion cable modems – use multiply-accumulate Spartan-3/3E features were designed to save Synthesis Tools Support functions. You can implement high-per- costs by allowing the design to be imple- SRL16E Mode formance FIR filters that use multiple mented using the smallest (and thereby the The following code describes the full MACs with minimal area penalty within the lowest cost) silicon. Spartan devices utilize functionality of an SRL16E – and yet Spartan-3 fabric. Such MAC-intensive func- the world’s most advanced manufacturing it will be automatically reduced to an tions would ordinarily take significant logic process and incorporate features that allow SRL16E by the synthesis tools. The resources in competing FPGAs that lack you to integrate even more functionality in entire code is thus implemented using embedded multipliers. the smallest FPGA. one look-up table:

Vcc storage: process (clk) SRL16E begin D IN D if clk’event and clk=’1’ then CE —enable only applies to the input path A3 if input_enable = ‘1’ then A2 store0_data <= data_in; Q D Q A1 store1_data <= store0_data; store2_data <= store1_data; A0 CLK store3_data <= store2_data; store4_data <= store3_data; GND store5_data <= store4_data; CE store6_data <= store5_data; CE CE CE CE CE CE CE CE store7_data <= store6_data; D D Q D Q D Q D Q D Q D Q D Q D Q store8_data <= store7_data; store9_data <= store8_data; A[3:0] store10_data <= store9_data; 0000 01000100 store11_data <= store10_data; '0100' store12_data <= store11_data; Q store13_data <= store12_data; Figure 4 – LUT configured to generate a 5-cycle delay store14_data <= store13_data; store15_data <= store14_data; end if; Din —Select only applies to the output path D O Writeite case output_select is WE when “0000” => data_out <= store0_data; when “0001” => data_out <= store1_data;

WRITE A when “0010” => data_out <= store2_data; Pointer when “0011” => data_out <= store3_data; when “0100” => data_out <= store4_data; Read Dout when “0101” => data_out <= store5_data; READ AO Pointer when “0110” => data_out <= store6_data; when “0111” => data_out <= store7_data; when “1000” => data_out <= store8_data; Figure 5 – Traditional FIFO requiring dual-port RAM and two counters when “1001” => data_out <= store9_data; when “1010” => data_out <= store10_data; when “1011” => data_out <= store11_data; Writer

CE CE CE CE CE CE CE CE CE CE CE CE CE CE CE CE when “1100” => data_out <= store12_data; Din D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q D Q when “1101” => data_out <= store13_data; when “1110” => data_out <= store14_data; +1 0000 1111 Read when “1111” => data_out <= store15_data; -1 Dout when others => data_out <= ’X’; end case; end if; Figure 6 – SRL16E mode with a single counter can implement a FIFO. end process; The data path would normally be several LUTs wide.

Second Quarter 2005 Xcell Journal 41 HIGH-VOLUME SOLUTIONS Low-CostLow-Cost EasyPathEasyPath FPGAsFPGAs OfferOffer PromisePromise toto ASSPASSP CompaniesCompanies EasyPath FPGAs offer quick time to market and flexibility at prices below those offered by structured ASICs.

by Gokul Krishnan Senior Marketing Manager Xilinx, Inc. [email protected]

As the semiconductor industry continues its march towards smaller process geome- tries, it has to deal with exponentially increasing mask set costs. Customers who want to a 90 nm ASIC must pay upwards of a million dollars in NRE. In addition, they must incur significant engi- neering and design expenses, increasing the overall cost of product development to prohibitive levels. ASSP companies have come under increasing pressure to weigh more carefully the risk/reward tradeoffs for introducing new products. ASSP vendors are being forced to develop only those “blockbuster” products that will allow a reasonable return on their large up-front investments. Recent innovations such as Xilinx® Spartan™-3 FPGAs and EasyPath™ tech- nology, when used in combination with a wide IP portfolio, allow ASSP vendors to address many more opportunities in a cost- effective way while assuming minimal risks. In this article, we will examine how Spartan-3 FPGAs – together with their EasyPath counterparts – can alleviate some of the challenges faced by consumer elec- tronics ASSP vendors.

42 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

Customer-Specific FPGAs ket/financial risk element. Structured EasyPath Customer-specific FPGAs like Selection Criteria ASICs FPGAs You can now have multiple those in the EasyPath program variations of a particular are identical to standard FPGA Time to Prototype Samples 4-8 Weeks 0 Weeks design (or indeed, multiple offerings, but use patented test- Total Time to Volume Production 12-15 Weeks 8 Weeks designs) catering to different ing techniques to significantly Vendor NRE/Mast Costs $100K-200K $75K market segments – and go to improve yield. Once your Design Cost for Conversion $250K-$300K $0 market with all of them at design is frozen, Xilinx develops the same time. This is Additional Cost of Tools $100K-$200K $0 custom test patterns that specif- for Conversion because the NRE charges ically test only the resources Unit Costs Low Low are minuscule compared to used by that particular design. Risk High Low ASICs. You no longer need You reap the benefits of these Flexibility to Make Changes to spend valuable engineer- Inflexible Flexible consequently higher yields in In-System ing resources to convert an the form of lower costs. Design Conversion from Additional Conversion FPGA design into an ASIC With EasyPath FPGAs, you Prototype to Production Engineering Free RTL, nor spend multiple can realize a 30-80% reduction *Xilinx Market Analysis weeks in verification in unit prices (compared to and simulation to benefit equivalent standard FPGAs) Table 1 – EasyPath FPGAs offer significant advantages in time to market from low-cost, high-volume when you move to high vol- and total cost of ownership when compared to structured ASICs. solutions. ume. Because customer-specif- ic FPGAs are the same piece of silicon as International Business Strategies’ Fourth Time to Market standard FPGAs – except cheaper – they Quarter 2003 report estimates the product In today’s world of complex design cycles, offer a one-for-one match of all the fea- development cost of a 0.13µ ASIC to be short product life cycles, and quickly tures offered in a standard FPGA. As a greater than $10 million, including the cost changing market needs, time to market is result, almost no involvement is required of design, verification, and prototyping. of paramount importance. Unfortunately, from customer engineering teams; the This implies that an ASSP vendor has to the complexity of deep sub-micron conversion process from design comple- sell at least 250,000 units of a $40 ASP designs increases the time required to tion to production is completely seamless. device (see Figure 1) just to recoup the develop and debug a design, while Furthermore, the lead times from the development cost. Clearly, if other sales increasing the risk of re-spins. time a design is frozen to volume production and marketing costs are included and a rea- Because ASIC re-spins can set back a in customer-specific FPGAs can be as short sonable ROI is expected, the lifetime vol- schedule at least three months, ASSP vendors as 8-10 weeks – an advantage of almost 12 ume potential of a device must be take a tremendous risk of being late to mar- weeks when compared to ASIC methodolo- significantly higher. ket. Some studies have shown (as in the IBS gies. Table 1 shows a comparison of Customer-specific FPGAs, on the 2003 report) that just a three-month slip in EasyPath FPGAs versus structured ASIC other hand, significantly reduce the mar- the schedule can cause a product revenue methodologies, illustrating why EasyPath FPGAs are a better overall solution.

1,400,000 30,000,000

Addressing ASSP Market Needs $40 ASP 1,200,000 25,000,000 Historically, ASSP vendors have used $50 ASP FPGAs as prototyping vehicles, moving to 1,000,000 $80 ASP 20,000,000 Design Cost custom ASICs in high volume to take 800,000 advantage of their low unit costs. This was 15,000,000 a reasonable strategy at 0.18µ and larger (units) 600,000 10,000,000 process nodes because lower NRE costs 400,000 allowed companies to take the 200,000 5,000,000 Required Life Cycle Volume

financial/market risk and go to market Deverlopment Cost per Design ($) with multiple products. 0 0 As the demands of system integration, 0.25 0.18 0.13 0.09 performance, and power have pushed Process Geometry (µ) ASSP vendors down to 0.13µ and 90 nm, NRE costs and design complexity have Figure 1 – The increasing cost of developing ASSPs as process nodes shrink means that companies are forced to go after high-volume “blockbuster” products to get a reasonable return on investment. increased significantly. A study in

Second Quarter 2005 Xcell Journal 43 HIGH-VOLUME SOLUTIONS

40.0% like standard cell or structured ASIC. However, recent developments by Xilinx in 35.0% customer-specific FPGAs now preserve 30.0% some of the flexibility of the FPGA while also maintaining a low-cost approach. 25.0% Fast Xilinx Spartan-3 and Virtex™-4 20.0% Med EasyPath FPGAs enable you to prototype Slow two different designs in a standard FPGA 15.0% and then move to an EasyPath device that

Revenue Reduction 10.0% supports both those designs simultaneous- ly in production. For example, you can use 5.0% one bitstream to perform system diagnos- 0.0% tics on the entire system and the other to 3 Months 6 Months 9 Months 12 Months load the second application-specific bit- Project Delay stream. Alternatively, you can choose to implement designs for two different prod- Figure 2 – The impact of time to market on product revenue in fast-, medium-, and slow-moving market segments (IBS 2003). ucts in a single device and take advantage of a common part number for inventory reduction of about 15% (see Figure 2). In the mover advantages. ASSP vendors must management purposes. extreme case of seasonal products (widely either gamble that their final specifications In addition, with Spartan-3 and prevalent in the consumer electronics indus- will get adopted in a standard; implement a Virtex-4 EasyPath FPGAs, you can try), schedule slippage can mean that an superset of all possible future combina- change LUTs and I/Os even after these ASIC product will miss customer demand tions; or delay the design phase, running devices have been deployed in the field altogether. the risk of being late to market. These (see Figure 3). As a result, you can now fix Customer-specific FPGAs allow ASSP options are far from optimal. minor bugs or tweak certain parameters to vendors to postpone spending decisions One of the major advantages of FPGAs further optimize your designs. until market uncertainties are removed. over ASICs is the flexibility to make design For instance, a line card in a router might Once a design is completed, you no longer changes in case of a specification change or need to have the drive strength (and slew need to plan three to four months in design error. Traditionally, designers have rate) adjusted a notch or two depending on advance and hope that the silicon you get at had to forgo this advantage as they move what load it encounters. You can implement the end of the day works to your specifica- from FPGAs to an inflexible custom solution a range of drive strengths that are available tions. With one of the highest levels of reli- ability and shortest lead times in the industry, customer-specific FPGAs require ASSP vendors to incur little to no risk in the conversion process. Furthermore, with just Look Up eight to ten weeks to production, ASSP ven- Digital Clock Manager Tables dors can go to market with confidence and hit that prime market window.

Technical Challenges A common challenge that many ASSP ven- dors have to deal with is the changing land- scape of industry standards. This is Flexible particularly true in wired telecom and wire- I/Os less applications, where the lack of a coordi- nated effort to specify various telecommunications protocols and interface standards can result in incompatibilities between different vendors’ products. Ratifying standards is a time-consuming Figure 3 – EasyPath FPGAs allow ASSP customers to fix minor design bugs by changing process and vendors often have to freeze LUTS and I/Os even after they are in high-volume production. their designs much earlier to get first-

44 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

for certain I/Os. This new level of customer- while at the same time keeping the core area formance. As a result – and contrary to specific flexibility offers a unique blend of to about 1% in an XC3S5000. This provides industry trends – static power consumption FPGA-like features at ASIC-like prices. you with low-cost processing power while in the Xilinx 90 nm Virtex-4 family is about Another important issue is the availabil- leaving enough room for other peripherals. 50% lower than in 0.13µ FPGA devices. ity of qualified, reliable IP. ASSP vendors For example, at a unit cost of $12.95 (the On the performance front, the use of can take advantage of the vast portfolio of 50K unit price) for an EasyPath XC3S1500 specialized embedded hard macros for IP offered by FPGA vendors during the device, you can prototype with confidence processor cores and DSP slices allows Xilinx prototyping phase. When the design is using Spartan-3 standard FPGAs and con- to provide operating frequencies on par with ready for production, you can use the same vert to EasyPath FPGAs for high volume. ASICs. Even on the cost front, the move to IP without any additional charges or quali- smaller process nodes and the development fication effort. ASIC solutions, on the Conclusion of clever techniques to improve yields has other hand, require ASSP vendors to either Power, performance, and cost have ranked allowed FPGA vendors to shift the crossover maintain their own portfolio of state-of- high among the reasons why ASSP ven- point between FPGAs and ASICs to higher the-art IP or incur large expenditures in dors have historically migrated to ASIC and higher volumes. acquiring/qualifying standard IP blocks. solutions over FPGAs. But recent ASSP vendors are being squeezed from advances in process technologies have sig- both ends – by market uncertainties and risk Case Study: Consumer Electronics nificantly mitigated these concerns to a on one side and rising product development Companies developing consumer electron- point where solutions like EasyPath costs on the other. EasyPath FPGAs offer a ic products increasingly find themselves FPGAs have become viable high-volume way out. They eliminate the risk of a new having to incorporate more and more ASSP alternatives. product introduction by significantly reduc- functionality while maintaining lower The use of triple-oxide technology, for ing the total cost of development and the time costs. In no other area are these trends example, has allowed Xilinx to selectively to production. ASSP vendors can now address more evident than in home networking, vary gate-oxide thicknesses to optimize spe- many more markets and segments that might which involves the aggregation and distri- cific regions for low power or high per- otherwise have been uneconomical. bution of content within a home through wireless modems, cable, xDSL, and satel- lite set-top boxes, among others. In these applications, the ability of chipset vendors to integrate various stan- dards and encrypt/decrypt data streams is critical. Xilinx Spartan-3 FPGAs (and their EasyPath analogs) support multiple I/O standards such as Ethernet, IEEE-1394, USB2.0, and PCI, allowing you to develop designs that are either feature-rich or seg- mented by end usage without any addi- tional effort. The Spartan-3 family also offers an extensive library of encryption cores such as AES, DES, and TDES for applications where security is important. In addition, Spartan-3 EasyPath FPGAs support various SSTL and HSTL I/Os for interfacing to different high-speed memo- ries such as SRAM and DRAM, as well as reduced swing differential signaling (RSDS), with applications in LCD TVs and other display products where power consumption is critical. Finally, for applications that require com- putational capability like image/video pro- cessing, the Spartan-3 EasyPath FPGA’s MicroBlaze™ soft-processor cores bring the MIPS cost down below $0.02 per DMIP,

Second Quarter 2005 Xcell Journal 45 HIGH-VOLUME SOLUTIONS ApplicationsApplications ofof thethe SpartanSpartan FamilyFamily inin Flat-PanelFlat-Panel DisplaysDisplays

A full feature set and low prices make Spartan-3 devices attractive for use within the flat-panel display market.

by Danny Mok Product Marketing Manager (High Volume) Xilinx Hong Kong [email protected]

With its combination of low cost and full feature set, the Xilinx® Spartan™-3 family is uniquely positioned to enable various digital consumer systems. One market segment where Spartan-3 devices are being widely accepted is the flat-panel display (FPD) market. Flat-panel displays are the fastest grow- ing segment of a new breed of consumer tel- evisions. There are three types of flat-panel displays on the market: LCD, plasma, and projection (DLP). All three types are poised to dramatically increase unit shipments – see Figure 1 from iSuppli Research.

46 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

Examples of such functions include 40,000 gamma correction, image scaling, edge Projection PDP LCD 35,000 detection, sharpness, contrast, and frame buffer. Given the range of features that 30,000 Spartan-3 devices provide, you can imple-

25,000 ment many of these functions with minimal silicon area. Some of the intellectual prop- 20,000 erty cores utilized by our customers include: 15,000 • Color Space Converter

10,000 • JPEG Codec

5,000 • Discrete Cosine Transfer

Ð 2002 units 2003 units 2004 units 2005 units 2006 units 2007 units • MPEG Video Encoder/Decoder

Television Systems Market Tracker Q4 2003 iSuppli Corporation Figure 2 shows the various functions in a typical FPD that you can implement in a Figure 1 – Most industry experts agree on the explosive growth projections for flat-panel displays. Spartan FPGA.

The Spartan-3 Feature Set before it is fed into the main image pro- Interfacing Spartan-3 FPGAs offer various features that cessing engine. This pre-processing can Many times the FPGA is used for interfacing are very useful for FPD system designers. be anything from discrete cosine transfer, to the screen or to the external memory. These include embedded multipliers that decryption, or interlacing/de-interlacing. Spartan-3 support for low-swing differential enable very efficient DSP implementation; Normally this is how an FPGA is gener- I/O standards (such as reduced swing differ- shift-register functionality features that ally used – in conjunction with the cus- ential signaling [RSDS]), as well as other enable high performance and reduce tomer’s image-processing ASIC or ASSP. single-ended standards such as SSTL/HSTL, resource utilization for implementation of One common use is to assist the ASSP in allows it to be used in such applications. pipelined or multi-channel functions; large image scaling and de-interlacing. The FPGA can also be used to implement memory resources; and built-in support for the timing control unit (TCON) to control popular differential I/O standards used in Core Processing the horizontal and vertical pixel displays for- the display market. Table 1 lists the various Although core image processing is gener- mat. TCON is more or less a vertical and Spartan-3 features and their use in DSP ally done using standard ASSPs or cus- horizontal display counter. The large amount implementation. tom ASICs, FPGAs are sometimes used of flip-flops in the Spartan-3 architecture A Spartan-3 device can be used in FPD for custom processing. Xilinx, along and a built-in carry chain allows for an effi- designs in various ways. Let’s look at some with its AllianceCORE™ and XPERTS cient implementation of this function. typical FPGA usage within the FPD system. partners, provides a range of intellectual Also, with abundant differential I/O property cores that can ease the design of channels, Spartan-3 FPGAs have an efficient Front-End Pre-Processing these functions. Visit the Xilinx IP architecture for designs that are more I/O- The digital RGB signal will typically Center (www.xilinx.com/ipcenter/) for intensive. Figure 3 shows how even a small require some kind of pre-processing more details. Spartan-3E device has enough differential I/O resources to drive a large LCD/plasma screen. Spartan-3 Silicon Features Value in FPD Applications In many designs customers require an Embedded Multipliers Efficient implementation of MAC FIR and other DSP Functions interface to high-speed DDR external mem- Shift Register Logic Efficient implementation of multi-channel functions ory. Xilinx provides reference designs and application notes that allow you to build an Video Line Buffers, Cache Tag Memory, Scratch-Pad Memory for Block RAM and Distributed RAM efficient memory controller within the DSP co-efficients, Packet Buffers, FIFOs Spartan-3 fabric. RSDS support without termination resistors, other special Built-In Support for RSDS design considerations Conclusion With some of the industry’s lowest price points and full-featured capabilities, a Table 1 – Spartan-3 features allow for an area-efficient design in FPD systems. Spartan-3/3E FPGAs is ideal for various low-

Second Quarter 2005 Xcell Journal 47 HIGH-VOLUME SOLUTIONS

Digital Frame Image Buffer Let Xilinx help CODEC Processing Memory you get your

Input System Display message out to PHY MAC Control Driver thousands of

uC programmable logic users Output logic users worldwide.

That’s right ... by advertising Figure 2 – A typical block diagram of functions found within a display application youryour productproduct oror serviceservice inin thethe Xilinx Xcell Journal, you’llyou’ll reachreach more than 70,000 engineers,engineers, 12 RSDS Channels – 4 Channels Each for RGB 4 LVDS Channels designers, and engineering XC3S100E - T144 RSDS Clock 5252 DIFF DIFF I/O I/O Channels Channels LVDS Clock managers worldwide. 12 RSDS Channels – 4 Channels Each for RGB 4 LVDS Channels –TCON RSDS Clock mplemented Within the Device) The Xilinx Xcell Journal isis anan LVDS Clock LCD Display award-winningaward-winning publication,publication, dedicated specifically toto helpinghelping programmableprogrammable

LVTTL Control Signals logiclogic usersusers –– andand itit works.works.

LVTTL Control Signals We offer affordable advertising ratesrates andand aa varietyvariety ofof advertisementadvertisement sizessizes toto meetmeet anyany budget!budget!

Figure 3 – The large number of differential I/Os allows even small Call today : and inexpensive Spartan-3 devices to drive large displays. (800)(800) 493-5551493-5551 or e-mail us at [email protected] cost digital consumer systems. In flat-panel • Timing control and panel RSDS [email protected] display systems, the Spartan-3/3E architec- drivers JoinJoin thethe otherother leadersleaders ture is even more useful because it allows inin ourour industryindustry andand advertiseadvertise high-performance DSP to be implemented in • Image compression/decompression inin thethe XcellXcell Journal!Journal! a fraction of the area consumed by competing The programmable nature of an FPGAs. It also has built-in support for low- FPGA solution reduces the inherent swing display I/O standards such as RSDS. development risk in a new system design. Spartan-3 FPGAs have been used to imple- With its host of other features such as ment these functions successfully within flat- multiple I/O banks, on-chip digital panel display systems: clock managers, and extensive block and

• SD/HD color space conversion distributed memory, Spartan-3 devices R can also efficiently implement many • 4:4:4 to 4:2:2 downsampling control/glue logic functions, effectively • Digital RGB-to-USB card reader reducing the size, complexity, and cost of function your system.

48 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS ManagingManaging SignalSignal QualityQuality Making trade-offs between drive strength, termination strategy, and signal quality.

by Bill Hargin Impedance Mismatches • Timing-critical (address, data, and sig- Product Manager, HyperLynx Reflections can occur at any impedance nals that can have non-ideal edges but Mentor Graphics discontinuity, including variations in must align with timing requirements) [email protected] board stackup, trace-width variations, ball- • Signals with driver edge rates faster grid-array breakouts, stubs, vias, loads, than 5 ns Driven by advances in lithography, IC connector transitions, or large power-plane switching speeds continue their progres- discontinuities. A quick look at the effect of fast driver sively faster march. At the same time, esca- The significance of a reflection is impact- edge rates is instructive. Figure 2 shows the lating clock speeds result in much less ed by several factors, including the imped- effect of increasing driver edges on the forgiving timing margins. ance difference, the length over which the same 5 inch trace. The 10 ns and 5.0 ns The techniques for managing high- impedance difference occurs relative to the drivers produce clean receiver waveforms. speed effects can be broken into three overall length of the transmission path, and The faster 2.5 ns and 1.0 ns drivers, how- broad categories, sometimes referred to as your technology’s tolerance for noise. ever, produce reflections and ringing on the “the three Ts”: Some reflections – if significant yellow and red receiver waveforms. enough – cannot be resolved using the • Technology – selecting driver technology The Second “T”: Topology, three Ts. For this reason, careful pre-lay- fast enough to meet your functional Signal Integrity, and Timing out impedance planning with a tool like needs (but as slow as possible) Signal-integrity problems tend to disappear HyperLynx’s Stackup Planning (shown in when nets are short relative to how fast • Topology – selecting topologies that Figure 1) is a critical part of a proactive they are driven, as reflections settle much meet timing requirements while mini- impedance-control process. mizing the impact of signal reflections more quickly. From the fastest 1.0 ns wave- form in Figure 2, the reflections eventually • Termination – managing signal reflec- The First “T”: Triage by Technology smooth out at a half-inch trace length. tions using passive components Several strategies exist for dealing with non- Although academically instructive, a ideal routing. The first is to know which nets Sounds easy, right? The problem is health-conscious engineer certainly would can accommodate poor routing and which that there are thousands of such choices not want to specify more than a few care- cannot. A “technology triage” strategy works to be made when designing a PCB, and fully planned high-speed nets with a half- well, dividing nets into those that are: you must balance these against timing inch maximum length requirement. requirements and electromagnetic com- • Signal integrity-critical (clocks, strobes, Sometimes, departures from “good prac- pliance (EMC). and signals that require clean edges) tice” routing can actually be a key to resolv-

Second Quarter 2005 Xcell Journal 49 HIGH-VOLUME SOLUTIONS

ing signal-integrity problems. parallel DC terminating resistors Consider the case of a clock with pulled to opposing power supplies to multiple receivers, each of which is achieve a specific DC bias for skew-sensitive (the clock must arrive Thevenin termination. at each receiver at close to the same Here are some general termina- time). In this case, a daisy-chain tion guidelines: route may not be ideal because it • Source termination is useful in delivers the signal to each receiver point-to-point/one-directional serially, inherently creating skew. connections Here, a superior scheme may be a “star” pattern, in which each receiver • Far-end termination is useful in (or small subsets of receivers) has its multipoint connections own routing branch. Each receiver • Distributed termination can be can be placed at approximately the Figure 1 – Careful pre-layout impedance planning using the HyperLynx Stackup Planning tool helps helpful if you have a plug-in sys- same delay length from the driver, eliminate signal reflections. tem with variable configuration and each receiver is considerably more isolated from other receivers Each of these techniques has than on a daisy chain. advantages and disadvantages. Underscoring the interrelationship Parallel DC termination is the sim- and trade-offs between the three Ts, plest, both from the standpoint of however, star routing introduces sev- component count (only one, built eral new problems. Multiple branches into Spartan-3 FPGAs) and the present the driver IC with a low choice of value (equal to the line impedance, requiring it to dynamical- impedance). However, it burns the ly sink and source significant current. most power and may unacceptably In practice, you may need to use a load the driver IC. AC termination stronger driver technology for this requires an additional component topology example, such as a Xilinx® (more expensive, extra board space) Spartan™-3 LVCMOS33_F_24mA and more engineering work (finding driver instead of a LVCMOS33_F_ the optimal capacitor value), but 8mA, as shown in Figure 3. Figure 2 – The effect of driver edge rates on a 5 inch trace. The reduces power consumption. 10 ns and 5 ns edges look fine, with reflections and ringing Series termination creates a volt- effects appearing on the receiver waveforms for the faster 2.5 ns age plateau that persists until a The Third “T”: Termination and 1 ns drivers. For sub-nanosecond driver switching speeds, As a general rule, any line with an receiver waveforms get progressively worse. reflection is received back from the edge rate faster than 5 ns on nets run- end of the line, so series terminators ning longer than an inch should be do not work properly from a timing considered a termination candidate. standpoint unless the receiver ICs Although reducing costs (see sidebar, are clustered near the end of the net, “Conserve Board Space with Spartan- as shown in Figure 4. 3 FPGAs”) is important, the associat- There are several ways to termi- ed signal quality benefits are critical – nate at a junction or star connec- impacting whether the product works tion. One way is to have a series at all. Let’s review some termination termination at every driver. This has strategies for various trace topologies the advantage of reducing settling and design requirements. time at the receiver while consum- ing a minimum amount of power. Termination Types Several conditions must be met for a The classic methods of terminating single-series termination strategy to digital transmission lines are well be effective. Each branch must be close to the same length; otherwise, known. You can terminate the Figure 3 – The HyperLynx oscilloscope and transmission-line source, the far end, or both; you can schematic show a driver that is too weak for the parallel, low reflections coming back from each employ “distributed” terminations at impedance star topology. It would be better in this case to use a branch are not in sync and end up several locations; or you can use two stronger 24 mA buffer. “bleeding” from branch to branch.

50 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

For a discussion regarding termination component place- Using Spartan-3 FPGAs ment, see the “Using Spartan-3 to Optimize Termination FPGAs to Optimize Termination Component Placement” sidebar. Component Placement Ideally, terminators on a PCB are locat- Conclusion ed at precisely the position they’re Spartan-3 devices – with built-in designed to terminate: exactly at the terminators for both single- last receiver IC for parallel termination ended and differential signaling or exactly at the driver for source ter- and support for LVTTL, LVC- mination. Unfortunately, in the real MOS, SSTL, HSTL, GTL, world of dense PCBs, ideal placement LVDS, and RSDS – are a key is often impossible. Therefore, it’s com- Figure 4 – Series terminators do not work properly from a enabler for hardware engineers mon to find terminators that are locat- timing standpoint unless the receiver ICs are clustered near seeking high-speed technology at the end of the net, because series termination works by creating ed a “stub” length away from their ideal a voltage “plateau” that persists until a reflection is received a reasonable price. But the positions. But how long can the stub back from the end of the line. increased capabilities of modern be before the terminator fails? devices force today’s engineers to Series termination will start to fail shoulder the burden of proactive- when the stub is longer than about Each branch must also be the same imped- ly resolving signal integrity, timing, and 10% of the driver switching time. We ance (or close), or it will be impossible to EMC problems. can simulate this with a series resistor choose an effective single resistor value. If With the money you’ll save by manufac- placed within this guideline and com- branches are longer than three-quarters of turing with Spartan-3 FPGAs, consider pare the result to a series terminator an inch, it makes sense to make their paral- adding signal integrity analysis software to placed an inch away. lel impedances equal to the inbound line your toolbox. Several features are important in For parallel termination, if the ter- impedance from the driver. You can also good analysis software, including the ability to minator is located upstream from the terminate at the junction itself by changing recommend termination strategies and run last receiver on the net, the stub length the trace impedance or by using parallel “what-if” simulations early in the design cycle. can be as long as about 15% of the DC termination – both dampening reflec- Critical for post-layout analysis are both driver switching time. tions quickly, and attenuating the signal. “interactive” and whole-board “batch” simu- Fortunately, with Spartan-3 devices, The most appropriate choice depends lation, where violations are flagged and rec- these terminators are located on the on network topology and signal direction. ommendations are made for an entire PCB. FPGA itself, removing the need for For nets that have complex routing pat- Armed with the latest Xilinx technology, placement gyrations and allowing you terns, it may be difficult to find a termina- the “three Ts,” and the appropriate simulation to focus on using the “three Ts” dis- tion scheme that works even in theory. This software, you will not only be ready for today’s cussed in this article to address signal is where a “what-if” simulation tool like technology, but ready for what’s coming down quality, timing, and EMC. HyperLynx can be an indispensable ally in the road as well. For more information, visit comparing alternatives. www.mentor.com/hyperlynx/.

Conserve Board Space with Spartan-3 FPGAs

Spartan-3 FPGAs have built-in termination resistors, allowing To get an idea of the benefit, consider a 4 x 6 in, 6-layer PCB hardware designers to save significant board space that would have with three Spartan-3 FPGAs on it – including a combined routable previously been allocated to surface-mount resistors. area of 77 square inches. With previous FPGA technologies, exter- Traditional external terminations require two traces for every nal terminating resistors would consume about 10,000 square mils chip-to-chip connection: a trace for each resistor lead. On-chip ter- of board space (40 x 250 mils) each. A board of 4,000 nets (1,333 mination using Spartan-3 devices cuts this in half, requiring less with termination) would result in a 13.33 square-inch board space routing area and possibly reduced layer count. savings and a 17% reduction in manufacturing costs.

Second Quarter 2005 Xcell Journal 51 HIGH-VOLUME SOLUTIONS Meeting Timing and Reducing Area How to meet timing with the Synplify Pro Tool while still keeping the area to a minimum.

by Steve Elzinga we’ll outline how to design with perform- mizations will be affected. The Synplify Senior Product Applications Engineer ance and area in mind, along with a few solution has to work on the same critical Xilinx, Inc. tricks-of-the-trade for reducing area. areas as reported by place and route. If [email protected] You should know roughly how many there are excessive routing delays, apply Steve Pereira levels of logic the requirements will toler- the -route constraint to either the clock or the path (see the online documenta- Senior Technical Marketing Manager ate before failing timing. Keep this in Synplicity, Inc. mind while coding counters, state tion for more information). [email protected] machines, decoding, and data path logic. • Do not use the global frequency box When the logic is nearing the first draft, on the front panel of the Synplify tool Being the first to deliver a product in a synthesize and place and route; this will unless you need to constrain combina- given market greatly increases its chance of give you a glimpse at the problems ahead. torial paths. Specify all the clocks in success. Effective use of chip resources If you meet timing, then read further to the constraints editor (SCOPE). allows for less expensive parts and a less the area-saving section. Ensure that unrelated clocks are put expensive overall solution. If you don’t meet timing, here is a into different clock groups. Synplify Pro software has many timing checklist you can go through: • If you have clocks that are related and closure features that increase your chances of • First and foremost, use the Synplify in the same clock group, make sure the using slower speed-grade parts without com- or Synplify Pro products. Some design- periods have a common multiple. If promising on quality. The Synplify Pro tool ers are often unaware of the impact of the clocks are slightly different, this offers a perfect fit for the high-volume, low- a quality synthesis tool. If you don’t can cause very small clock-to-clock cost Spartan™ series of Xilinx® FPGAs. have Synplify or Synplify Pro and are setup delays, making timing impossi- not meeting timing, or you want to ble. This can again have an effect on Identifying Critical Areas reduce area, you can download a full- logic optimizations. For example, if Usually, only a few sections or modules in featured evaluation copy from the you have two clocks, Clk1 20 Mhz and a design fail timing. It is often a good idea Synplicity website for free. Clk2 23 Mhz, change to Clk1 20 Mhz to recognize these structures while coding and Clk2 25 Mhz. your design. During coding, the designer • Is Synplify software reporting positive usually knows what device the design is tar- slack in the log file (.srr)? If the Synplify • Are the I/O constraints correct? geted for but not the speed grade. Bumping product’s estimates are incorrect (usually Excessive input or output delay can up to the next speed grade can add unbud- because of excessive routing delays wreak havoc with synthesis. If the criti- geted costs to the project. In this article, caused by congestion), the logic opti- cal path includes an I/O and one level of

52 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

We hope these guidelines and hints will help you quickly achieve the performance required for your FPGAs while minimizing the logic resources...

logic, there is not much that synthesis How to Reduce Area • Clock tree management. Clock man- tools can do about it. Turn on I/O regis- Here are few ways to reduce area for agement is extremely design-depend- ter packing with the syn_useioff switch. Spartan-3 FPGAs. These techniques will ent, so it is difficult to offer generic have an impact on the timing. advice. The idea is to pack each clock • Ensure that pipelining is enabled. Try a quadrant of the chip with as much run with re-timing on. We have found • Use as many of the dedicated resources logic as possible. that this gives a ~ 5% performance as possible. Spartan-3 devices have improvement for Spartan-3 devices. It is plenty of resources that you can tap • The Synplify tool will build feedback also a very good idea to surgically apply into to reduce LUT usage. Here are logic for registers without resets. This the retiming attribute to registers on the some suggestions: extra MUX can add to the LUT critical path with the global switch off. • Try to keep black boxes/Coregen to a count. Reset every register in the This can increase performance while minimum. Wherever possible replace design if possible. keeping the area increase to a minimum. these components with generic code. • If certain paths/modules are clocked by • Add all timing exceptions: false paths Synplify software is capable of optimizing the critical clock but are not critical and multi-cycle paths. Just adding these around black boxes, but not the boxes themselves, either supply a multicycle constraints can make a huge difference themselves. Significant effort has gone path or give them another clock line off in performance. into dedicated resource management for the DCM. The perfect solution is to each Xilinx architecture. With generic • Does the critical path start or end inside have as little as possible high-frequency/ code, the Synplify product can remove a black box or Coregen .edn/.ngc/.ngo? critical logic on the chip. redundant logic, merge identical logic, Is it possible to re-code in generic code? and pack into the dedicated resources. • Experiment with different state If not, add the .edn file to the Synplify machine implementations. Try a run • Synplify software may pack logic into project. The Synplify tool will optimize with the Synplify Pro product’s FSM registers for performance reasons. If this the logic around these cores. If you Explorer. This is a timing-driven state is undesirable, either loosen the timing have .ngc files, use ngc2edf.exe to con- machine engine that will increase run- constraint or force the logic into vert the core to .edn and add to the time but can often find a better FSM RAMs/ROMs/multipliers. Please see Synplify project. solution for your particular design. the Synplify online documentation for • Provide I/O types. Remember to more information. • Turn on resource sharing. Resource specify the speed and type for I/Os. • There are times when the Synplify tool sharing tells the tool to share logic such Synplify software needs this timing as adders and multipliers whenever pos- information to optimize the will not pack logic into the dedicated resources. Common occurrences are: sible. This may have an impact on tim- driving/driven logic. ing, but usually results in fewer – RAMs; either the address bus or the • Are there gated clocks in the design? resources used. data output must be registers Turn on the gated-clock converter in the Synplify Pro tool. This engine will push – Reset or preset is synchronous/ Conclusion the clock from the register input pin asynchronous, causing failure Spartan-3 devices are increasingly being onto the clock line while maintaining the to map used in higher volume applications. But along with higher volumes comes the same functionality. This will dramatically – Enable is synchronous/asynchro- requirement that part cost is as low as pos- increase the performance of this path. nous, causing failure to map (our sible. We hope these guidelines and hints • Turn off resource sharing. online help has examples) will help you quickly achieve the perform- Are you now meeting timing? If not, – Timing constraints are too tight; ance required for your FPGAs while mini- you may want to try Amplify FPGA phys- logic is mapped to registers for tim- mizing the logic resources – and therefore ical synthesis, which generates detailed ing reasons cost – of your next design project. placement and performs physical opti- – ROMs are less than half-populated; For more information, please contact your mizations for additional performance and use the syn_romstyle attribute local Synplicity sales office, which you can timing predictability. to force find on our website, www.synplicity.com.

Second Quarter 2005 Xcell Journal 53 HIGH-VOLUME SOLUTIONS The Changing Face of Automotive ECU Design Why are automotive electronic control unit designers moving away from microcontrollers and microprocessors and looking at programmable logic devices (PLDs) instead?

by Karen Parnell Automotive Product Marketing Manager Xilinx, Inc. [email protected]

Microcontrollers and microprocessors have long been the natural choice to implement control functions in instru- ment clusters, door modules, and other control-based electronic control units (ECUs). Design experience and legacy code are two of the many reasons to use such devices time and time again. The tendency is to build the software topology in logically partitioned function blocks that meet the ECU specification. But only when these function blocks are compiled and targeted at a single device will we see the real-time issues associated with sequential code. Even though these functions were built discretely, they will run sequentially based on a pre-defined pri- oritization structure. You may not even see these timing issues immediately. Any interrupt collision issues will not be caught in simulation because the tendency is to only test what you know should work – and it is almost impossible to test for indeterminate states or real-time fault conditions. Any function that runs off of an interrupt can only really be tested after the system is built up and

54 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

The CPLD can interface to the microcontroller through a simple serial bus and provide low-cost inputs and outputs to the devices that need to be controlled. run in real time. At this point we can catch Board-Level Functions design with, but offer extremely low power more of the errors, but only when the As designs become more complex, more consumption, thus eliminating the need to product hits the road will every conceivable peripheral functions are controlled by 8-bit compromise power or performance. The interrupt state be tried and tested. We have microcontrollers – but they may not have low-cost XC9500XL devices are ideal if you to hope that these are not major failures sufficient I/O. There are a number of solu- require 5V-compatible I/O. and that they can be fixed with a software tions to this issue, including stepping up to a Figure 1 shows a typical example of patch at the next scheduled vehicle service. more costly 16-bit microcontroller. Another where a CPLD can be used alongside a is to partition the design over a low-cost 8-bit microcontroller, providing extra I/O to Function Interdependency microcontroller and low-cost CPLD. connect to multiple indicator LEDs and With programmable logic, you can build Here’s a good example: a system has mul- also provide I/O to control multiple up a system in the same way as with a tiple ADC inputs used as data inputs to help motors. The CPLD in this example also microprocessor or microcontroller – by control and position multiple stepper provides simple PWM functions and ADC implementing discrete functional blocks. But unlike micro code, these functions are totally independent: instantiated in hard- M ADC ware (as opposed to software), they will run TTL to LEDs in real time and are not susceptible to erro- PWM 4 neous errors or interrupt states. Microprocessor Each function will operate on its own M Serial PWM and is guaranteed to run as described every Bus time, under every condition. Another ben- efit is that you can simulate every state Xilinx CPLD ADC before building any hardware, so the risks of operations or tasks not functioning as required – or happening out of sequence – Figure 1 – I/O expansion example are dramatically reduced. For example, when building an instru- ment cluster, you may need to display a motors (such as in an instrument cluster) interfacing to aid with motor control, “tell tale” based on the driver pushing a plus other I/O for LED control and switch which can enable you to use a lower cost, switch; respond to a CAN packet dictating inputs. The low-cost microcontroller is ideal basic microcontroller without PWMs. the position of the tachometer; and handle for the sequential signal processing but has an error message from the braking system limited I/O to interface to off-chip devices. Interfacing and Bridging that must be displayed urgently to alert the The CPLD can interface to the micro- With the increasing number of bus proto- driver – all at the same time. controller through a simple serial bus and cols supported by ASSP and microcon- In software-based systems, these tasks provide low-cost inputs and outputs to the troller vendors, interface translation is a run sequentially. You must give priority to devices that need to be controlled. growing problem that requires an inexpen- the error message, but which task runs Although this approach may increase the sive and easy solution. Offering the lowest when and in what order depends on when board component count, the combined cost cost method for translating one bus proto- the interrupt hits the micro. of an 8-bit microcontroller and CPLD is col into another, CPLD devices can sup- In a PLD-based system, all of these lower than that of a 16-bit microprocessor. port interface bridging applications, functions are handled in parallel, displayed This approach also allows the board to be including voltage-level shifting (3.3V in to as soon as the message hits the device. The adapted to suit many customer configura- 1.8V out), bus translation applications parallel processing available in FPGAs and tions through the CPLD, providing the abil- (translating a proprietary language to an CPLDs offers a great advantage in applica- ity to interface to many different devices industry-standard language), serial-to- tions where instant, uninterrupted signals under control. The Xilinx® CoolRunner™- parallel and parallel-to-serial bus conver- must be processed reliably. II family of CPLDs are not only easy to sions, and DDR memory interfacing.

Second Quarter 2005 Xcell Journal 55 HIGH-VOLUME SOLUTIONS

3.3V SRAM

Address

Microprocessor Data DDR 1.5V 2.5V µ Control RAM P Flash Xilinx CPLD Xilinx CPLD Memory

1.8V Figure 2 – DDR memory interfacing and level shifting using CPLDs I/O

Figure 2 shows how CPLDs can be used to interface to high-speed memory and also Xilinx Automotive Devices shows an example of a CPLD level-shifting between different voltage levels. Designing electronics systems for automobiles has always been challenging. With CoolRunner-II CPLDs are ideal for automotive electronics comprising as much as 90% of the functional innovation in DDR memory interfacing, as they include new vehicles, it has become even more important to choose the right devices to meet DualEDGE-triggered registers, a global temperature, quality, and technological requirements. clock divider, and voltage-referenced I/O The Xilinx Automotive (XA) range of FPGAs, CPLDs, and configuration devices standards including SSTL_2. These fea- not only offer the design flexibility and time-to-market advantage associated with tures provide the capability to interface programmable logic devices, but also meet the extended temperature requirements between a microprocessor and high-speed and quality standards required in today’s automotive advanced electronics systems. memory devices such as DDR SDRAM. To make implementing this function easi- The XA device range is offered in both the extended temperature Q-grade (-40º C to +125º C) and industrial I-grade (-40º C to +85º C) and is qualified to the industry- er, Xilinx offers a reference design and recognized AEC-Q100 standard (for more information, visit www.aecouncil.com). application note (XAPP384, “Interfacing The XA range includes the lowest power CPLD devices available (CoolRunner-II to DDR SDRAM with CoolRunner-II CPLDs), the lowest cost FPGA (Spartan™-3 devices), and Platform Flash configura- CPLDs”) for downloading at the Xilinx tion memory devices (Table 1). website (www.xilinx.com/literature/). This reference design is particularly useful if For more information, visit www.xilinx.com/automotive/. your chosen microprocessor does not sup- port HSTL or SSTL memory interfaces. Standards Block RAM FPGA/CPLD Packages Density Range I/O Supported Multipliers (Kbits)

Conclusion VQG100, TQG144, 50K-1500K Automotive design has always been a chal- Spartan-3 PQG208,FTG256, 487 23 32 576 system gates lenging environment when faced with FGG456, FGG676 small-form-factor ECU specifications, TQ144, PQ208, 50K-300K Spartan-IIE 329 19 — 64 wide temperature ranges, high-quality and FT256, FG456 system gates reliability requirements, and low cost CPG56, CPG132, points. Today’s designs also need to be flex- 32-256 CoolRunner-II VQG44, VQG100, 118 5 — — ible, upgradeable, and easy to test fully. macrocells TQG144 Programmable logic is a viable alterna- tive to microprocessors and microcon- CS144, VQ44, 36-144 XC9500XL VQ64, TQ100, 117 — —— trollers because they offer true design macrocells interdependency and parallel processing. TQ144 CPLDs can be used to great effect where Flash PROM V0G20 1-4 Mb — ——— microcontrollers have run out of I/O or processing power, or simply for memory Table 1 – The Xilinx Automotive family of CPLDs, FPGAs, and configuration memory devices interfacing or voltage-level shifting.

56 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS AA Multimedia PlatformPlatform

Xilinx FPGAs and Xylon IP forfor AutomotiveAutomotive andand cores shorten design cycles and lower production costs ConsumerConsumer MarketsMarkets forfor multimediamultimedia applications.applications.

by Davor Kovacec CEO Xylon d.o.o. [email protected]

Market leadership in the fast-moving elec- tronics market continuously demands more innovative and cost-effective prod- ucts. With an ever-shrinking time-to-mar- ket window, design tools and prefabricates are important elements of success. But the right development platform can make all the difference between success and failure, harmonizing the contradictory requirements of being standard and highly configurable at the same time. Combining Xilinx® FPGAs with the Xylon logicBRICKS IP cores library, Xylon’s feature-rich Multimedia FPGA Platform is ideal for addressing the time-to- market and flexibility needs of the high- volume customer base. You can quickly turn system designs running on this gener- ic FPGA development platform into spe- cialized products. Such a design approach retains a large portion of design reuse through different hardware (IP cores) and software modules. You can reuse these same modules in many system designs for differ- ent applications.

Second Quarter 2005 Xcell Journal 57 HIGH-VOLUME SOLUTIONS

Multimedia FPGA Platform Touch The basic functional blocks of the Panel Control Keyboard Platform IP Cores Multimedia FPGA Platform are output to User IP Cores displays, with inputs provided from video, I2C GPIO FPGA human machine interface, and communi- Control ASSPs cation interfaces. These functional blocks DISPLAY logiCVC support a variety of different displays, Video Control video input types, input devices, or com- DISPLAY logiCVC munication interfaces. Video Control The Multimedia FPGA Platform is an FPGA design based on hardware modules Drivers logiCAN MicroBlaze logiUART Video Connectors CAN/Lin PPC, ext CPU UltiMEM in the form of IP cores. You can add addi- Camera UMA tional features using third-party IP cores or UltiWIN Memory SDRAM CVBS VCR Versatile Video Controller (DDR) by designing a customer-specific circuit. Decoder Input The main components of the Multimedia DVD User IP UltiWIN FPGA Platform are: OPB CoreConnect (MPEG Versatile Video FPGA IP Decoder) Input • logiCVC – a compact video Interconnection BUS FLASH logiBITBLK SDRAM User IP Graphic DDR controller for display driving (MOST Control) Accelerator FLASH • logiBITBLK – for 2D graphics acceleration Figure 1 – Multimedia FPGA Platform: automotive back seat entertainment • logiCAN/logiUART – for communication logiCVC for multiple display driving and require 54 MBps each. The Xilinx • UltiWIN – for frame grabbing instantiating more than one UltiWIN for MicroBlaze™ soft-processor core running (video input) simultaneous video input streaming. at 100 MHz requires 400 MBps, while the A DVD, VCR, or camera CVBS output logiBITBLK, processing 16 bpp images • UltiMEM – a multi-port SDRAM/ is used as the first video source, while a and running at 50 MHz, requires 100 DDRAM memory controller MOST network is used as the second. The MBps. This gives a total bandwidth The Spartan™-3 hardware resources, CVBS analog signal is converted to an requirement of: ITU656 digital signal by a composite video dedicated multipliers, digital clock man- 2 x 16 MBps + 2 x 54 MBps + 400 MBps signal decoder, an ASSP component. The agers (DCMs), and block RAMs ensure + 100 MBps = 640 MBps that IP cores operate at higher speeds and MOST MAC and MPEG decoder are cus- consume less area of the FPGA. The video tomer-added IP cores. The I2C and GPIO The overall memory bandwidth for a input scaling circuits take advantage of the IP cores are used for the keyboard and touch 32-bit DDRAM running at 100 MHz is dedicated 18 x 18 multipliers, while the controller interface. The UltiMEM IP core is 800 MBps. This leaves 160 MBps of spare high-capacity block RAMs used in all configured for unified memory architecture memory bandwidth that can be used for Xylon IP cores contribute to more efficient and six access ports. The 32-bit DDRAM efficient memory arbitration and access to sharing of memory bandwidth and better assures 800 MBps bandwidth for display memory without stalls. overall system performance. refresh, video streaming, graphics accelera- In addition, the DCMs provide finer tion, and CPU program execution. The System Gate Count clock generation and eliminate the need for logiCAN and logiUART-local interconnect Table 1 shows the size of each IP core and costly clock-generation ASSPs, while digi- network (LIN) IP cores are used for network the total number of slices required for the tal termination and DDR support enable connections with in-car body electronics. backseat entertainment FPGA design. better and lower cost support for SDRAM Every single IP is compact with a low gate or DDRAM memories. Memory Bandwidth Requirements count, resulting in a lower cost solution. Figure 1 shows the Multimedia FPGA The selection of memory bandwidth is The complete backseat entertainment Platform block schematic configured for an important for system performance. The six system is a complex multimedia applica- automotive backseat entertainment appli- IP cores share common memory, as shown tion that fits into a Spartan-3 XC3S1000 cation. This example is significant because in Figure 1. Two logiCVC driving displays device, leaving approximately 2,000 slices it has the capability to show a live video of 400 x 234 resolution and 16 bpp color for extra FPGA circuits. A separate display stream on any display screen. We accom- depth require 16 MBps each. Two controller application driving a single dis- plished this by instantiating more than one UltiWIN processing CVBS video sources play, including 2D graphics acceleration

58 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

Conclusion IP Core Description FPGA Slices The value of the Multimedia FPGA MicroBlaze Xilinx 32-bit RISC CPU 530 Platform from Xylon is in the high number logiCVC Xylon Compact Video Controller 2 x 615(1) of different displays tested; the broad feature set; and the ability to configure for perform- (2) UltiMEM Ultimodule Multiport SDRAM/DDRAM Control 410 ance, price, low development, and produc- logiBITBLK Xylon 2D Graphic Accelerator 765 tion cost, all with virtually no obsolescence. Xilinx FPGAs, and in particular the GPIO Xilinx General I/O 40 Spartan-3 family with its low cost-per-gate INT Control Xilinx OPB-INT Control 40 and rich architecture, provided an excellent UltiWIN Ultimodule Versatile Video Input 2 x 960(1,2) base on which to build. Combining Spartan-3 devices with Xylon logicBRICKS logiCAN Xylon CAN 2.0B Control 430 IP cores provided the ideal combination of I2C Xilinx OPB-I2C Control 200 feature set, form factor, performance, and time-to-market capabilities for the develop- logiUART Xylon FIFO UART Control (LIN Bus) 225 ment of the Multimedia FPGA Platform. Total 5,876 For more information about the Multimedia FPGA Platform and Note: 1. Back seat entertainment applications require two video inputs and logicBRICKS Xylon IP library, please con- Table 1 – IP core size video outputs, such as two logiCVC and two UltiWIN. 2. Xylon is an Ultimodule Inc. technology partner. Xylon has an open license tact us at [email protected], or visit to tailor UltiMEM and UltiWIN IP source code to customer needs. www.logicbricks.com. and a UMA memory architecture, fits into CAN IP for communication with dash- a smaller Spartan-3 XC3S200 device. board or other human interfaces. Multimedia FPGA Platform Features We reduced costs further by using the Xylon is in the process of designing an MicroBlaze core, which eliminates the need automotive reference board, which provides • Displays supported: wide-resolution for a separate external CPU. The IP core the ability to prototype or evaluate the range 128 x 64 to 1280 x 1024 integation and system configuration for the Multimedia FPGA Platform by connecting pixels; wide range of color depths MicroBlaze core is provided by the Xilinx displays, input devices, vehicle communica- 1- to 24-bit bpp, TFT/STN and Embedded Development Kit tion buses, and video signal sources. other flat panel displays; 8- to 32- (EDK). bit memory interface, overlays, multiple display support End Applications • Acceleration: ROP2 operation, The Multimedia FPGA Platform block fill, transparency, color is primarily aimed at the auto- expansion, pattern operations motive market, including appli- • Memory interface: flexible multi- cations such as navigation, port memory controller, unified infotainment, backseat enter- memory architectures for cost- tainment, and driving assistance. sensitive applications, priority, Example infotainment appli- round-robin and mixed-port arbi- cations include such automotive tration policy, configurable data displays as VCR videos, game width for access port and config- consoles, rear-parking cameras, urable memory bus and navigation systems. Figure 2 shows a backseat entertainment Figure 2 – Backseat entertainment prototype • Video input: YUV-RGB and de- prototype system displaying a interlace, picture position, picture contemporary game console on one display Other multimedia applications are cropping, picture scaling using and a DVD movie on another display. equally suitable, such as consumer, med- multi-pass bi-linear interpolation, Driving assistance is an another example ical, and measurement instrumentation or multiple video input source support of an automotive application using multi- factory automation. • Communication: CAN controller media platform IPs. The system comprises Today, automotive OEM and first-tier and UART improved for LIN bus synchronized stereo video camera inputs, a manufacturers have used Xilinx FPGAs and DSP algorithm for image processing, and Xylon IPs to develop infotainment systems.

Second Quarter 2005 Xcell Journal 59 HIGH-VOLUME SOLUTIONS AA Low-CostLow-Cost PCIPCI ExpressExpress SolutionSolution Spartan FPGAs are ideal for next-generation PCI applications and systems.

by Abhijit Athavale PCI 64/66. At the high end, a 32-lane PCI (called a PIPE interface – a white paper Product Marketing Manager Express implementation supports a total of about this is available from Intel). Xilinx, Inc. 80 Gbps, providing more than enough In the two-chip solution, the transport [email protected] bandwidth to support the vast majority of layer resides in a dedicated PHY chip, and the next-generation applications. logic and transport layers reside in a Spartan PCI has been the most widely used bus FPGA. A broad range of PHY devices are standard in the PC, server, and embedded Implementation Details available from manufacturers such as Genesys markets for the past decade. Because PCI PCI Express is a three-layer specification: Logic, Philips Semiconductor, and Texas is limited by its shared, central arbitration- physical (PHY), logical, and transport, all Instruments. PHY pricing will be less than based architecture and system-synchro- defining separate functionalities. Also includ- $10 for high volumes (250,000 units per nous clocking scheme, current and ed in the specification are advanced features year). (See the sidebar, “PHY Vendors,” for next-generation processors are outstrip- for hardware error recovery and system contact information.) Xilinx has collaborated ping its ability to keep up. power management. (For more information with Phillips Semiconductor and delivered PCI’s emerging replacement is PCI about PCI Express, visit www.pcisig.com.) this solution to our customers. Express, a new connectivity standard that Since 2000, Xilinx® has offered a line of To implement the interface, Xilinx and preserves the flexibility and familiarity of PCI 32- and 64-bit solutions for several of our IP partners (including Eureka, PCI while dramatically increasing band- Spartan™ series FPGAs. The most logical GDA, and Northwest Logic) provide PIPE width and performance. The controlling successor is a PCI Express solution using IP cores for Spartan-3 and Spartan-3E body for the PCI specification, the PCI an external PHY chip paired with a devices. A single-lane PCI Express controller SIG, has ratified PCI Express as the next- Spartan-3 or Spartan-3E device. The PCI requires approximately 500,000 gates (50% generation PCI. PCI Express-based prod- Express specification defines an interface of a Spartan XC3S1000) for the logical and ucts are now becoming available; to hook a PHY chip up to a separate device transport layer core, leaving the rest of the shipments are expected to achieve high vol- that houses the logical and transport layers FPGA available for the user application (see ume as early as 2006. Figure 1 shows the adoption forecast for PCI Express. PCI Express uses serial I/O technology Embedded Mainstream Apps to create point-to-point connections and is Adopter Protocol reverse-compatible to PCI, preserving ATCA Bridges Bridges Backplanes many original PCI advantages. It scales Lindenhurst Backplanes Peripheral ships Peripheral from a single lane (1x) to a 32 lane (32x) Bus Grantsdale Server architecture, offering a bandwidth of 2.5 Early Server ships Chipsets Gen2 Gbps per lane. PCI 32/33 has a bandwidth Adopter Early of 1 Gbps, while PCI 64/66 has a band- PC Graphics, Compliance Adopter Chipsets width of 4 Gbps. Workshops PC Graphics, The 1x PCI Express implementation PC Chipsets matches up very well with PCI 32/33, the most commonly used PCI interface across 2004 2005 2006 2007 all markets. A two-lane implementation (5 Gbps) is an incremental improvement over Figure 1 – PCI Express adoption forecast

60 Xcell Journal Second Quarter 2005 HIGH-VOLUME SOLUTIONS

PCLK ing your next-generation designs, you

TxDetectRx/ should consider the PCI Express option Loopback PCI Express I/F IP Core TX+, TX- PowerDown from Xilinx. We encourage you to find PhyStatus out how Spartan-3 and Spartan-3E Reset# FIFO FPGAs will help you meet your current TxData 8 or 16 External and future design requirements. More TxDataK 1 or 2 PHY information about Spartan-3 and RxPolarity Application TxCompliance Spartan-3E FPGAs, PCI Express IP, and Logical Layer TxElecldle Transport Layer compatible PHY devices is available at RxElecldle RX+, RX- FIFO www.xilinx.com/pciexpress/. RxData 8 or 16 RxDataK 1 or 2 User Logic RxValid CLK RxStatus 2 PCI Express IP Genesys Logic PCI Express IP cores are available from multiple ven- Philips Semiconductor PIPE Interface Pins (SSTL2) Texas Instruments dors including Xilinx and our partners. One such core Others from Northwest Logic is featured below. Northwest Logic’s PCI Express Core is specifically Figure 2 – PIPE interface between a Spartan FPGA and an external PHY designed for low-cost Spartan-3 FPGAs. A Spartan-3- based PCI Express design uses the Spartan-3 device with a low-cost physical interface for a PCI Express 40 (PIPE)-compatible PHY chip. The PHY chip implements External PLD the low-level PCI Express physical layer, while the 30 External DLLs, device takes care of the upper-level data link and Memories, transaction layers. Controllers, and XC3S1000 Another version of the PCI Express Core uses the 20 Translators XC3ES1200 > 50% Logic > 50% Logic internal MGTs in Virtex-II Pro and Virtex-4 FX FPGAs to PCIe IP Core PCIe IP Core provide a fully integrated PCI Express solution. Component Cost ($) 10 1x PCI Express to PCI Bridge Northwest Logic’s PCI Express Core is one of the 1x PCIe PHY 1x PCIe PHY smallest PCI Express cores available, enabling you to target the smallest and consequently lowest cost Solution ~$40 Solution ~$20* Solution ~$17* FPGA. The core is provided with a comprehensive ver- *High-volume pricing ification suite and expert support to ensure rapidly Figure 3 – Single-lane PCI Express implementation options developed and validated designs. Also available is a PCI Express Development the “PCI Express Core IP” sidebar for details Conclusion Board for quickly prototyping a complete PCI Express on Northwest Logic’s product and www. In addition to reducing total costs, the System. A demo GUI, drivers, and PCI Express FPGA xilinx.com/pciexpress/ for details on PCI Spartan FPGA + PHY option gives you reference design are also included. Express IP from our other IP partners.) substantial flexibility to build “PCI For more information (including pricing and core Figure 2 shows the implementation of a Express-to-anything” bridges and inte- size for a particular FPGA family), visit the Northwest PIPE interface using a Spartan FPGA and grate other circuit elements. As most sys- Logic website at www.nwlogic.com. external PHY. tems have a range of bandwidth Figure 3 illustrates a range of options to requirements, preserving flexibility is implement a single-lane PCI Express inter- important so that you can add lanes with- face. The cost of a standard-product out dramatically changing the layout. PHY Vendors option is fairly high (>$40), making it ten- Spartan-3 and Spartan-3E FPGAs are Genesys Logic uous for high-volume/low-cost applica- available in a wide range of densities, and www.genesysamerica.com tions. The Spartan options drop that cost preserve migration up and down in over- substantially, and add the flexibility of pro- all bandwidth. And because FPGAs are Philips Semiconductor grammable logic to integrate and imple- fully reprogrammable post-deployment, www.semiconductors.philips.com ment other system capabilities. In 250K they eliminate the risks associated with quantities (reasonable for typical consumer first-generation ASSPs and ASICs. Texas Instruments applications), the Spartan-3E version will If you are currently using PCI for your www.ti.com/pciexpress/ cost approximately $17. interconnect standard and are architect-

Second Quarter 2005 Xcell Journal 61 TUPQPWFSUIFXBMM FOHJOFFSJOH mOBMMZ BMHPSJUINEFWFMPQFST ETQTZTUFNBSDIJUFDUT BOEIBSEXBSFEFTJHOFSTBSFBMMPOUIFTBNFQBHF

#OMPANY PARTNERSHIPS DONT USUALLY GENERATE A LOT OF INTEREST BUT A PARTNERSHIP THAT YIELDS TEAM BUILDING TECHNOLOGY DOES &OREXAMPLE THEINDUSTRYSlRST$30DESIGNmOWFROMAMIXED-!4,!"3IMULINKENVIRONMENTTOAVERIlED&0'! )TSBASED ONATIGHTINTEGRATIONBETWEEN!CCEL#HIP$303YNTHESISAND8ILINX3YSTEM'ENERATORFOR$30THATACCELERATESDEVELOPMENTOF SIGNALPROCESSINGSYSTEMS 7ORKMORECOLLABORATIVELYANDGETDESIGNSTOMARKETFASTERBYELIMINATINGTHENEEDTORECAPTURE IN($, 2EADILYINCORPORATE!CCEL7AREAND8ILINX)0CORESANDVERIFYTHEENTIRESYSTEMUSINGHARDWARE IN THE LOOPSIMULATION FURTHERSPEEDINGDEVELOPMENT(OWSTHATFORTEAMWORK 3TOP YOUR /VER THE 7ALL ENGINEERING$OWNLOAD FREE WHITEPAPERS AT ACCELCHIPCOMPAPERS OR CALL    DESIGN TOOLS Design Leadership from Xilinx The shortest path from design start to project finish is through Xilinx 7.1i design tools.

by Bruce Talley Vice President, Design Software Division Xilinx, Inc. [email protected]

It’s a good time to be an FPGA or CPLD designer. The high cost of ASIC-based projects and the high risk Table of Contents of structured ASIC designs are enabling more advanced features with every new generation of silicon. But without Design Performance Leaps Forward with ISE 7.1i Software...... 64 easy-to-use, productive design tools, all of that power is still just silicon waiting to be unleashed. Debugging with Combined Oscilloscope and Logic Analyzer Measurements ...... 67 ® Xilinx ISE™ software has long been known to Real-Time Debug That Dominates...... 70 deliver some of the fastest logic in programmable design, but the ISE design suite also delivers ease-of-use at the Integrating MATLAB Algorithms into FPGA Designs...... 73 lowest cost possible. With the addition of advanced options like PlanAhead™ design and analysis software Complex FPGAs Require Equivalence Checking ...... 76 and ChipScope™ Pro real-time FPGA debug, the com- Complete FPGA and CPLD Power Analysis...... 80 plete ISE system delivers solutions to design bottlenecks and gets you to project completion faster. Increase Performance and Lower Cost Through System Design...... 83 In this issue of the Xcell Journal, we present a series of Control Your Designs with the PlanAhead Hierarchical Design and Analysis Tool...... 86 articles about design tools and design applications to help you explore the new offerings behind the our 7.1i design Building High-Performance Measurement and Control Systems with FPGAs ...... 90 tools release, from a general overview of ISE 7.1i software to detailed third-party explorations of logic analyzer setup and debug for FPGAs.

Second Quarter 2005 Xcell Journal 63 DESIGN TOOLS DesignDesign PerformancePerformance LeapsLeaps ForwardForward withwith ISEISE 7.1i7.1i SoftwareSoftware Programmable design has never been easier or more cost-effective.

by Lee Hansen Sr. Product Marketing Manager Xilinx, Inc. [email protected] Mike Weir North American Direct Marketing Manager Xilinx, Inc. [email protected]

Some of the fastest tools in programmable logic have gotten even faster. Xilinx® ISE™ 7.1i software includes new tech- nology, new device support, and new speed files that get you through the design flow faster, with the highest device per- formance possible. ISE 7.1i software includes the new Virtex™-4 -12 speed grade, making the industry’s leading-edge Virtex-4 FPGA- based designs even faster. ISE 7.1i software, in combination with Virtex-4 devices, can deliver up to 70% faster design perform- ance than the nearest competing solution. More high-performance technology is packed directly into ISE software than any other design offering, including core capa- bilities like “timing-driven mapping,” glob- al optimization, design re-timing, and FPGA physical synthesis. Together, this ProActive Timing Closure technology leads to higher overall design performance, with timing closure reached faster and less time spent in the overall design flow.

64 Xcell Journal Second Quarter 2005 DESIGN TOOLS

In addition to the advanced spectrum of ISE implementation tools, this collection of new technology slashes design times, getting you through even advanced, high-density design flows faster and cutting project time and costs.

PlanAhead Extends Performance Advantages ware: ISE Simulator and ModelSim Xilinx designers a 50% faster HDL simulation If you need even more design performance, Edition-III. These state-of-the-art HDL environment and 20 times more design the PlanAhead™ hierarchical floorplanner simulation tools give you more perform- capacity than the previous version. from Xilinx can help. PlanAhead software is ance and design capacity than previous ver- For larger designs, you can order the a separately purchased tool option to ISE sions of ISE software, and include new MXE-III full version, which provides five software that is ideal for high-density Virtex- integration into the design flow. times the design capacity and 30% faster 4 or Spartan™-3E designs. ISE Simulator is a new, full-featured performance than the MXE-III starter. PlanAhead implements a block-based HDL simulator integrated within ISE soft- MXE-III enables you to simulate devices hierarchical design approach that can ana- ware. It offers you the ability to simulate with up to 2 million system gates (depend- lyze, detect, and correct potential implemen- directly from the ISE Project Navigator ing on coding). Built on the industry-lead- tation problems earlier in the design cycle, process window; test benches and stimulus ing ModelSim 6.0a HDL simulation leading to greater consistency, faster place can be generated automatically, and graph- environment, MXE-III has enhanced GUIs and route cycles, fewer design iterations, ing is direct in the ISE Project Navigator that give designers easy access to capabili- tighter utilization control, and quicker incre- display window, as shown in Figure 1. ISE ties and faster debugging. With the MXE- mental design changes. Customer-based III full version, you now have benchmarks confirm that PlanAhead soft- the option to license your soft- ware can as much as double your out-of-the- ware with a USB dongle as box ISE design performance. well as the traditional methods You can also lock placement results for of parallel and Ethernet ID. individual blocks that already meet timing so that subsequent place and route iterations Technology Viewer only affect the blocks that need to change, ISE 7.1i software now thus delivering faster reimplementation includes the new Technology compile times. PlanAhead wraps area Viewer, which lets you view groups, incremental design, and modular your post-synthesis HDL- design into a single ASIC-strength floor- based design at the block level planner. See the article “Control Your (in a schematic-like display) to Designs with the PlanAhead Hierarchical get an early picture of how and Design Analysis Tool” in this issue of Figure 1 – ISE Simulator graphing window your design will be represent- Xcell for more detailed information. ed during the implementation Simulator supports VHDL and Verilog and flow. Technology Viewer is built on the Ease-of-Use Slashes Design Time functional and timing simulation. ISE same graphic interface as RTL Viewer, so ISE 7.1i software includes a host of new tools Simulator is also licensed through the exist- there are no new tools or menus to learn. and capabilities that you can’t find anywhere ing Xilinx software registration ID process. Full hierarchy is represented, so you can else, including Technology Viewer, Message Two different versions of ISE Simulator easily push down into the design as far as Filtering, Design Summary, ISE Simulator, are available. ISE Simulator Lite comes necessary, highlighting and identifying crit- and ModelSim Xilinx Edition-III. In addition included with ISE BaseX and ISE ical elements in your design by pin, net, or to the advanced spectrum of ISE implementa- Foundation™ software at no charge. The instance name. tion tools, this collection of new technology full ISE Simulator is a design option that ISE software gives you more ways to slashes design times, getting you through even you can purchase as an upgrade to your view the progress of HDL-based design advanced, high-density design flows faster and existing ISE Foundation seat or purchase as implementation as you move through the cutting project time and costs. a part of a new ISE Foundation seat. programmable design flow. RTL Viewer Xilinx and Mentor Graphics recently displays pre-synthesis design elaboration, New, More Powerful announced an extended OEM agreement while Technology Viewer shows the post- HDL Simulation Choices that has led to the release of the new synthesis block-level design. FPGA Editor Two new advanced HDL simulation offer- ModelSim Xilinx Edition-III HDL simula- lets you see and edit post-place and route ings are now available with ISE 7.1i soft- tor. The free MXE-III starter now offers FPGA results.

Second Quarter 2005 Xcell Journal 65 DESIGN TOOLS

Display Summary and Message Filtering reducing price points for FPGAs; they are nance customers. ESD gives you the ISE Project Navigator includes a new now supported in ISE 7.1i software. option of getting your update ISE copy Design Summary view that takes the Spartan-3E devices extend the reach of by the traditional box shipment or most commonly sought-after design FPGAs into production volumes by downloading from our secure delivery information and places it in one easy-to- reducing costs, while preserving the low system. ISE 7.1i software became avail- use automated display, eliminating the NRE and high flexibility of programma- able via download just six days after its need to search through multiple tool ble logic. Mainstream applications using product announcement, so if you need reports and outputs to find exactly what Spartan-3E FPGAs and ISE 7.1i design your product shipment, you no longer you need. Upon starting a new project, tools will see lower ASIC crossover points have to wait for the box to arrive. Design Summary shows you the key core and lower total project costs. ESD also offers a cleaner delivery and information about your project, such as Virtex-4 FPGAs are now supported in registration system. Because your registra- project location and target device. As you more configurations of ISE 7.1i software tion data is obtained and confirmed upon proceed through the design flow, Design as well. ISE WebPACK software adds login, no product ID step is required. Summary adds and updates more infor- Virtex-4 LX15 and LX25 support, while Product downloads and registration keys mation, like logic utilization, perform- ISE BaseX now includes Virtex-4 LX15, are available together online in a speedy ance summary, and design constraint LX25, SX25, and FX12 devices. You can and secure environment. Your ISE update results (Figure 2). Design Summary also choose the Xilinx device and ISE config- can now be on your system and running contains a list of links that take you to uration that best fits your budget and faster than ever before. Xilinx is one of the more detailed implementation reports so design demands. only PLD vendors offering all of their that you can easily jump to logic design tools through more information. And Design electronic download. Summary is right up front in ISE software, easily accessible Conclusion as a selectable tab in the Project ISE software sets the pro- Navigator display window. grammable design bar even Message Filtering is a new higher with the release of ISE option that works with 7.1i design tools. With faster design reporting tools to cap- performance, new design ture the information that is tools, advanced HDL simu- exactly right for your design lation, expanded device and project. Message Filtering lets support, you select the report informa- and electronic download, we tion that you deem non-critical believe no competitor offers to your design and suppress it you more, whether you’re from future reports. Message making the next handheld Filtering lets you get more CPLD-based PDA or high- Figure 2 – Design summary in ISE Project Navigator streamlined and pertinent speed serial, high-density reports and quickly see the FPGA-based blade server. data necessary to your project, making ISE 7.1i software also expands Linux From logic development with ISE debug and verification quicker and easier. operating system support. ISE Foundation software, ISE BaseX, or ISE Foundation software is available in both WebPACK software, embedded design Lower Project Costs 32-bit and 64-bit Red Hat Enterprise 3 with Platform Studio, system design with ISE 7.1i software also offers new cost- versions; both CDs are included in prod- System Generator for DSP, real-time ver- savings options including new Spartan- uct update shipments. ISE WebPACK ification with ChipScope™ Pro analyzer, 3E device support; Virtex-4 device software also now runs on Linux Red Hat or floorplanning with PlanAhead design support in ISE WebPACK™ software Enterprise 3, letting you use free, down- tools, Xilinx and ISE software have your and ISE BaseX; expanded Linux operat- loadable ISE software on the platform project covered. ing system support; and new electronic that best fits your needs. All ISE configurations, ChipScope software delivery. Pro analyzer, PlanAhead software, Electronic Software Delivery MXE-III, and ISE Simulator are avail- Expanded Device Support Xilinx is also now offering ISE able for purchase from the Xilinx online Spartan-3E devices are the newest low- Foundation software and ISE BaseX via store, Xilinx distributors, or by calling cost FPGA offerings from Xilinx, further electronic download to all in-mainte- 1-800-888-FPGA (3742).

66 Xcell Journal Second Quarter 2005 DESIGN TOOLS DebuggingDebugging withwith CombinedCombined OscilloscopeOscilloscope andand LogicLogic AnalyzerAnalyzer MeasurementsMeasurements With an automated logic analyzer to scope cross-trigger, de-skew, and scope waveform input, one plus one can be greater than two.

by Brad Frieden But it turns out that these tools are analyzer, with the possibility of seeing some Logic Applications Specialist most powerful when used together. You improper states. Another approach is to Agilent Technologies can significantly reduce debug challenges probe onto the serial channel with an oscil- [email protected] by taking advantage of tools that automate loscope in an attempt to find some improp- the integration of scopes and logic analyz- er signals, particularly with respect to signal With higher speeds in today’s digital ers. In this article, we’ll describe these tools integrity. The best approach is to use both designs, it’s not uncommon to encounter and how you can more quickly understand the oscilloscope and logic analyzer to better challenges during board turn on that relate digital design problems by capturing the understand the exact target condition under to timing or signal integrity. Most of the analog characteristics of key digital signals which the error is occurring. work is distinguishing whether it’s a logic simultaneously with a functional view. problem or a signal integrity problem. When Working Independently Spending too much time sorting out such An Example Situation When approaching the problem with a things can reduce time to market and cre- Consider a packet communication system logic analyzer, a number of signals are of ate even more time pressures when what where something is going wrong at boot particular interest. We will certainly want you’d really like to do is design. up. This system (shown in Figure 1) to see what is happening with the trans- Many designers find a real-time oscillo- should be passing packet data across a seri- mit state machine driving packet data scope and logic analyzer invaluable when it al channel to a monitor, but the data is not generation, along with the data and relat- comes time to turn on, validate, and debug being received. We can take a variety of ed transaction IDs . We would probably the hardware in their system. But they are approaches to figure out what exactly is also like to see what is happening over on often used independently – the oscilloscope causing the problem. the receive side of the design, where pack- for signal integrity or timing issues and the One approach is to view the condition ets should be collected. Signals of interest logic analyzer for problems with the logic. of state machines in the design using a logic there include the state machine, data

Second Quarter 2005 Xcell Journal 67 DESIGN TOOLS

transmitter and receiver earlier in Figure 1. External Data IN Because this link is differential and running Monitor at 400 Mbps, we used a 1.5 GHz differential SERIAL PACKETS "Serial to Monitor" 8b/10b "Serial from Monitor" active probe and accessories to let us plug in State Machine Encoder State Machine to the 0.1 inch connectors on the board. SERIAL ACKs TID Ack State TID State Out ID As we attach to this link and get a view of the eye diagram on the data, it looks fine. No Master State glitches, closing eye, or noise. Trying again 8 8 11 5 Machine 5 3 8 from boot up, we see a flurry of activity for a second or so where there appear to be unex- 16 RAM pected glitches or narrow pulses in infinite persistence mode. Fortunately we can trigger Bank Bank Bank Bank Micro 0 3 2 1 on a narrow pulse, and set up the scope to do AckID Blaze TID DATA Master µP that for a more detailed view (Figure 2). The State In & State State TID resultant capture upon boot up is a single- Out . Out Agilent shot view of a narrow pulse. This is highly Xilinx MUX FPGA Trace interesting and reveals a definite problem in Core 2 the system, but we’re still left scratching our head as to what’s going on. JTAG to Logic Analyzer (16 Pins for Debug) External RAM How About a Little Cooperation? What if we cross-trigger the logic analyzer Figure 1 – Probe points of interest brought out through the Agilent trace core when we catch this narrow pulse on the oscilloscope? We could see what condition received, and acknowledge IDs, which get measurement bank 1, things are hung up the logic is at when this situation presents sent back to the transmitter to let it know there too, at 02H, the receive “idle” state. itself on the serial channel. that data was received and to keep the It turns out this is a relatively easy thing data coming. Smarter Logic Analyzer Triggering to do because of application software in the This design is implemented with a Now that we know the condition of the logic analyzer, enabled by the E5850A Xilinx® Virtex™-II FPGA, so we have the transmit state machine once it hung up, we time-correlation fixture created to auto- option to use a measurement core inside can redefine the trigger point to be not that mate the process. By time-correlating the the FPGA to move our probe points value, which should allow us to view state logic analyzer and oscilloscope and pulling around to various locations, collecting the machine activity leading up to this hung the oscilloscope capture into the logic ana- data on the Agilent logic analyzer by using condition. The acquisition reveals what lyzer, we should be able to get a very help- the logic analyzer-based FPGA Dynamic looks like a normal state machine cycle just ful view to track down the problem. Probe application. We use a measurement before the hang up. Similarly, core with four signal banks and 16 FPGA by switching probe points pins for visibility (Figure 1). with the FPGA Dynamic Because the data is not being received, Probe over to the receive side a reasonable starting point is to probe and defining our trigger as these points of interest, set up the logic not “idle,” there is also a nor- analyzer to trigger on anything, and just mal state machine progres- see what’s happening. The answer is not sion. So we might try next to much. In fact, simply looking at the activ- come at this problem with ity indicators before even taking a trace, it the oscilloscope. becomes obvious that the transmit state machine is locked up. Acquiring data The Oscilloscope Insight through bank 0 shows that the state A point of particular interest machine is stuck at 01H, the “idle” state, for an oscilloscope measure- and transaction IDs (TIDs) that should be ment is to view activity on the counting are stuck at 0DH. serial channel and look for If we change our probe points to signals anything unusual. We saw the Figure 2 – Scope trigger on a serial channel glitch and single-shot capture right out of reset of interest over to the receive side through serial channel between the

68 Xcell Journal Second Quarter 2005 DESIGN TOOLS

Time-Correlating the Instruments and it points to the possibility Time-correlation is accomplished by dou- of invalid data being input to ble-probing a common reference signal the state machine, somehow with both the logic analyzer and oscillo- related to this narrow pulse scope. Common BNC cables are placed on the serial channel. between one instrument’s trigger-out con- Time-correlated measure- nector and the other instrument’s trigger ments have helped show that input. The instruments are also connected these are interrelated events. together via the LAN, which allows the The next step would be to logic analyzer to control the oscilloscope isolate which one causes the and pull out its data. other – correcting the logic A calibration routine that runs on the causing a serial channel nar- logic analyzer makes repetitive measure- row pulse, or fixing the nar- Figure 3 – Scope trigger on glitch, cross-trigger to ments in both the “logic analyzer triggers row pulse causing incorrect logic analyzer, and capture of internal receive state scope” direction and the “scope triggers state machine logic. machine, initially looking okay logic analyzer” direction, to determine the delay in the cross-trigger for each direction. Cross-Triggering in the Once these delays are known, they can be Other Direction applied to the oscilloscope data after We’ve determined that the importing to the logic analyzer, time-align- first time we have a narrow ing the logic analyzer and oscilloscope data. pulse, it appears that it coin- cides with problems in our A Better Look state machines. But does a Now we’re going to trigger the oscillo- narrow pulse always exist in scope on the narrow pulse occurring on the serial channel when the the serial channel, cross-trigger the logic state machines lock up? One analyzer, and bring both captured scope approach to determine the and logic analyzer data together on one answer is to have the logic screen. The goal is to see exactly what analyzer cross-trigger the Figure 4 – Zoom-out reveals relationship of glitch state machine activity is occurring inside oscilloscope and have the to receive state machine stalling after 12 packets the FPGA when the narrow pulse hap- logic analyzer trigger on the pens on the serial channel. lock up. This cross-trigger is set up through the Fortunately, to switch the direction of row pulses in the vicinity of the lock up. time-correlation application. All we have to the cross-trigger simply requires checking a We’ve made an important step to narrow set up is the scope trigger on the narrow box in the user interface; the application down the possible causes of the problem. pulse; the rest is taken care of. already has determined the cross-trigger Rebooting the system, we observe the delay from the calibration performed earlier. Conclusion capture on the logic analyzer, which for- Because we know that the receive state With complex interactions between asyn- tunately is very revealing. The trigger machine locks up on a value of 02H (idle) chronous digital circuitry and the fast point for the oscilloscope where the nar- immediately following 10H (TID), we can data rates that are susceptible to signal- row glitch occurs is visible, and we can set up the logic analyzer to trigger when it integrity issues, it makes sense to pair up also see time-correlated to this event the sees TID, followed by idle for greater than your logic analyzer and real-time oscillo- transmit states of the state machine. That one clock cycle. scope. With the ability to automate the actually looks okay. Doing this, importing the related scope time-correlation of each instrument’s As we switch over to the “receive” state capture from the serial channel, and mak- capture to the other and import scope machine with the FPGA Dynamic Probe, ing multiple runs, we see that indeed there waveforms into the logic analyzer, it the state machine also looks okay, at least is a direct relation between the first narrow becomes much easier to track down elu- initially (Figure 3). But with a change of pulse and the state machines locking. We sive cause-and-effect problems, whether time/division for a “macro” view, we see can search through the serial stream cap- the source is logic or signal integrity. that the state machine stalls 12 packets after ture on the oscilloscope and find the loca- For information about logic analyzers the serial channel glitch. The transaction tion of narrow pulses with the E2681A and oscilloscopes from Agilent, visit IDs and acknowledge IDs increment for a EZJIT jitter analysis software package, val- www.agilent.com/find/logic/ and www. time, but then stop. That’s not acceptable, idating that there are always multiple nar- agilent.com/find/scopes/.

Second Quarter 2005 Xcell Journal 69 DESIGN TOOLS Real-TimeReal-Time DebugDebug ThatThat DominatesDominates Debug your design with the ChipScope Pro system.

by Lee Hansen Sr. Product Marketing Manager Xilinx, Inc. [email protected]

Brent Przybus Sr. Product Marketing Manager Xilinx, Inc. [email protected]

Verification continues to be one of the most time-consuming and time-critical phases of the design flow. The fixed nature of an ASIC or structured ASIC makes identifying and finding problems on the board prohibitively expensive and mas- sively time-consuming. HDL verification and timing analysis tools offer a good first line of defense in predicting design behavior, but we believe no design tool offering beats the advantage that the Xilinx® ChipScope™ Pro system brings to FPGA real-time verification.

Faster FPGA Verification The ChipScope Pro system, available as a separately purchased option to ISE™ soft- ware, allows logic and embedded software designers to debug their FPGAs in real time. You can find design problems quickly while the chip is running on the board, interact- ing with the rest of the system. Then, leveraging the FPGA’s re-programmability, you can implement design changes quick- ly and send them back to the device on board in a matter of minutes through the FPGA programming cable.

70 Xcell Journal Second Quarter 2005 DESIGN TOOLS

probe points and then quickly reprogram your FPGA and debug a whole new set of signals in a matter of minutes.

Integrated Logic Analysis You can analyze captured signals through the ChipScope Pro software logic analyzer (Figure 1). The ChipScope Pro logic ana- lyzer is an advanced display and debug tool that makes logic and bus analysis easy. The ChipScope Pro logic analyzer supports multiple window views and bus plotting in either data-versus-time or data-versus-data formats. Capture mode lets you compare data captured after multiple trigger events; sig- nal filtering lets you ignore data that is not critical to your analysis, saving you mem- ory and time. Using the listing viewer, you can import bus token files and view instructions in the order they occur. Figure 1 – ChipScope Pro logic analzyer To facilitate processor system debug environments that use software debuggers Debug Through Software Cores You can quickly and easily configure in addition to ChipScope Pro tools, you The ChipScope Pro package of tools and insert these low-profile cores into your can share the JTAG connection to the includes software debug cores, Core FPGA logic during the design capture FPGA with the ChipScope Pro analyzer. Inserter, CORE Generator™ software, and phase using CORE Generator software, or In addition to providing data capture the integrated logic analyzer. The insert cores directly into the design netlist capabilities, the ChipScope Pro system ChipScope Pro system is a separate option using Core Inserter. These cores are then also includes the Virtual I/O console, the that plugs in directly to ISE design tools synthesized and instantiated into your interface to one of the industry’s first real- and comes in a 60-day evaluation version. FPGA design, allowing you to view any time virtual input/output cores. Through Four software debug cores are included internal signal within the FPGA. the Virtual I/O console, you can control with ChipScope Pro tools: For engineers designing embedded virtual inputs and pulse trains and view processor systems using the Virtex™-II output activity. The ChipScope Pro system • ILA – for accessing and capturing logic Pro or Virtex-4 FX FPGA families, the now runs on Windows, Solaris, and Linux signals up to 315 MHz – 50% faster ChipScope Pro system enables debug of Red Hat Enterprise 3. than the previous ChipScope Pro embedded processor buses, including the release and among the fastest debug IBM CoreConnect processor local bus or New Capabilities cores available on-chip peripheral bus supporting the ChipScope Pro 7.1i software offers new • IBA – for embedded processor bus sig- IBM PowerPC™ 405 processor. You can capabilities that take real-time debug to the nal capture, protocol detection, debug- also view and debug embedded processor next level. Leading the way is the new ging and verifying control, address, signals for the MicroBlaze™ soft-proces- remote debug capability. ChipScope Pro and data buses sor core, supporting all leading Xilinx tools can now run in server/client mode over a TCP/IP connection. You can sit in your • VIO – for setting virtual inputs like FPGA device families. office while debugging a board next door in external switches, mimicking output Signals are captured at or near operating the lab or on the other side of the world. You devices like LEDs, or for simulating system speed and then brought out through can share a single board/system in the lab external logic the programming interface, freeing up pins for your design, not gobbling them up for with other debug engineers on your team, or • ATC2 – the advanced Agilent debug. The ChipScope Pro real-time debug allow help desk personnel to debug a prob- Technologies Trace Core 2, for linking tool is one of the only tools that allow you to lem remotely at a customer site. the ChipScope Pro system to your change probe points without having to re- The ChipScope Pro system supports the Agilent logic analyzer and FPGA synthesize and re-route your design. Using newest low-cost FPGA offering from Dynamic Probe the ISE FPGA Editor, you can change signal Xilinx, the Spartan™-3E family.

Second Quarter 2005 Xcell Journal 71 DESIGN TOOLS

ATC2 software debug core has also been enhanced for automatic setup, allowing the logic analyzer to automatically find which ATC2/FPGA pins are connected to which logic analyzer pod signals, making setup faster and easier.

ChipScope Pro cores have also been per- and Agilent delivers deeper trace memory, Dynamic Probe, and ChipScope Pro soft- formance-enhanced. System debug can now faster clock speeds, and more trigger ware in a target system. occur at clock speeds greater than 315 MHz, options, all using even fewer pins on the ATC2 can support 2, 4, 8, 16, 32, or one of the fastest verification cores available. FPGA. The advanced technology contained 64 banks, with each bank supporting from Core Inserter can now be used across within the ATC2 core and FPGA Dynamic 4 to 128 signal widths. Using optional 2X multiple netlists from one or more Probe isn’t available in other FPGA or ASIC TDM in state debug mode, signal counts sources, allowing system integrators to real-time verification solutions. can be doubled, offering as many as 8,192 debug entire designs rather than one sec- The FPGA Dynamic Probe measures signals possible for debug in one system. tion at a time. ChipScope Pro software new groups of internal FPGA signals in And you can change input banks on the also supports the newest Xilinx USB plat- seconds without forcing you to recompile fly using the JTAG programming inter- form programming cable. your design and with no impact on design face. ATC2 also operates in either state timing. It achieves wider internal visibility mode, for functional debug in one time Linking to Agilent Logic Analyzers over a fixed number of pins with 64 inter- domain with the widest signal capture; or The ChipScope Pro system also links inter- nal probe points for every pin, helping to timing mode, for making measurements nal FPGA debug to Agilent Technologies’ conserve FPGA resources for your design. across multiple time domains. bench-top logic analyzers using the included It eliminates error-prone and time-con- The ATC2 software debug core has also ChipScope Pro ATC2 core. This core syn- suming tasks with features like automated been enhanced for automatic setup, allow- chronizes the ChipScope Pro system to signal/bus labeling from FPGA design to ing the logic analyzer to automatically find Agilent’s FPGA Dynamic Probe software, an logic analyzer, and mapping FPGA pins which ATC2/FPGA pins are connected to optionally purchased plug-in to your Agilent from board layout to logic analysis chan- which logic analyzer pod signals, making 1680, 1690, or 16900 logic analyzer. nels. Figure 2 shows an example setup setup faster and easier. ATC2 can auto- This unique partnership between Xilinx using an Agilent logic analyzer, FPGA matically find the ideal sampling point of each ATC2/FPGA pin in both phase and voltage offset. ATC2 also offers an FPGA Dynamic Probe SW increased number of multiplexer bank Application Supported by 1680/1690/16900 inputs from 32 to 64; and you can now start up your ATC2 debug with a known bank selected. Probe MUX Outputs of ATC2 Core and PCB Signals Xilinx Conclusion Parallel PC Board Cable Nothing in FPGA or ASIC verification approaches what ChipScope Pro tools offer: real-time verification at or near sys- tem operating speeds (with a minimum FPGA impact on design space and I/O pins) and Insert ATC2 Core With the ability to capture any signal inside the Dynamic Control of ATC2 ChipScope Pro Through Your Regular FPGA while it is running on the board, Xilinx Cable interacting with the system. Add to that the advantage of FPGA reprogrammabili- Works with Xilinx Virtex-4, Spartan-3, JTAG ty, and you can identify problems and Virtex-II, and Virtex-II Pro FPGAs View and change your design in a matter of minutes Connect Aquire Analyze or hours, not days or weeks or months using other offerings. To order your ChipScope Pro 7.1i copy today, contact Figure 2 – ATC2 core, FPGA Dynamic Probe, and ChipScope Pro system your Xilinx sales representative.

72 Xcell Journal Second Quarter 2005 DESIGN TOOLS IntegratingIntegrating MATLABMATLAB AlgorithmsAlgorithms TheThe integrationintegration ofof AccelChipAccelChip DSPDSP SynthesisSynthesis and Xilinx System Generator for DSP provides intointo FPGAFPGA DesignsDesigns a seamless path from MATLAB/Simulink to verified FPGA-based DSP systems.

by Eric Cigan DSP Algorithms in FPGAs” in the guage is uniquely adept with vector- and Product Marketing Manager Winter 2004 issue of the Xcell Journal.) array-based waveform data at the core of AccelChip Inc. Many DSP algorithm developers have algorithms in applications such as wire- [email protected] found that the MATLAB language best less communications, radar/infrared Narinder Lall meets their preferred development style. tracking, and image processing. Senior DSP Marketing Manager With more than 1,000 built-in functions, as The AccelChip DSP Synthesis tool was Xilinx, Inc. well as toolbox extensions for signal process- developed specifically for algorithm devel- [email protected] ing, communications, and wavelet process- opers and DSP architects who have ing, MATLAB offers a rich environment for embraced a language-based flow. With There are two kinds of people in electronic algorithm development and debugging. AccelChip, you begin with MATLAB M- design: those who think in terms of words In addition to the IP functions pro- files to perform stimulus creation, algo- and those who think in terms of pictures. vided in MATLAB, the MATLAB lan- rithm evaluation, and post-processing. This dichotomy is most appar- You can load the M-files into ent in the world of DSP. Some the AccelChip tool; they designers prefer to develop become the “golden source” for algorithms in language form, a design flow that ultimately while others choose to draw out produces optimized implemen- block diagrams showing data tations in Xilinx FPGAs. flows. Until now, different sets of tools were required for each Unifying Words and Pictures method – but why should you In the past, design teams have to choose? looked on System Generator Xilinx® System Generator and AccelChip DSP Synthesis for DSP is well established as a as mutually exclusive design productive tool for creating tool options, but wished for DSP designs using graphical access to the best aspects of methods. With a visual pro- each tool. In response, gramming environment that AccelChip and the DSP divi- leverages The MathWorks sion at Xilinx collaborated in an Simulink tool and its prede- effort to combine AccelChip’s fined Xilinx block set of DSP tools with System Generator. functions, System Generator The result allows you to meets the needs of both sys- mix language-based algorithm tem architects (to integrate design in MATLAB and graph- design components) and hard- ical block-oriented design in ware designers (to optimize Simulink in a novel unified implementations). (For more electronic-system-level (ESL) Figure 1 – System Generator/AccelChip interface details, see “Implementing design environment (Figure 1).

Second Quarter 2005 Xcell Journal 73 DESIGN TOOLS

AccelChip’s DSP Synthesis tool aug- statistics of the measured signal and the fixed-point design effects like saturation ments System Generator by providing a predicted behavior. The last nine lines of and rounding. AccelChip aids in this seamless integration path for algorithm the algorithm are the code that predicts process by propagating bit growth through- developers, enabling the rapid creation of and corrects the estimate. out the design and letting you use direc- IP blocks, directly from M-files, that This algorithm illustrates the flexibility tives to set constraints on bit width. This enhance the Xilinx block set in System and conciseness of the MATLAB language. algorithmic design exploration allows you Generator. In addition, AccelChip has Common operators such as addition and to attain the ideal quantization that mini- optional AccelWare toolkits that comple- subtraction operate on variables like the mizes bit widths while managing overflows ment System Generator with additional IP two-dimensional arrays A or P_cap without or underflows, allowing early trade-offs of cores optimized for Xilinx cores. AccelWare having to write loops, as you would in lan- silicon area versus performance metrics. toolkits include mathematical building guages like C. Multiplication of True signals blocks, signal processing, communications, two-dimensional arrays is auto- 0.5 and advanced math to implement linear matically performed as matrix algebra functions. multiplication without any special 0

annotation. MATLAB operators -0.5 Kalman Filter Design Example such as matrix transposition allow 0 50 100 150 200 250 300 350 Signals with additive noise To illustrate this approach, let’s take an the MATLAB code to be compact 1 advanced algorithm written in MATLAB, and easily readable. And complex 0 use AccelChip to synthesize the design, and operations like matrix inversion -1 then integrate it into a System Generator are completed using MATLAB’s 0 50 100 150 200 250 300 350 model. Our example is a Kalman filter – a extensive linear algebra capabili- Filtered signals 1 recursive, adaptive filter well-suited to ties. Although such an algorithm combining multiple noisy signals into a could be constructed as a block 0 clearer signal (for details on the topic, see diagram, doing so would obscure -1 Arthur Gelb’s book, “Applied Optimal the algorithm structure so readily 0 50 100 150 200 250 300 350 Estimation”). apparent in MATLAB. Figure 3 – The Kalman filter constructs estimates of the Kalman filters embed a mathematical With AccelChip, a first step in three signals based on input signals corrupted by noise. model of an object – such as a commercial synthesizing a complete algorithm aircraft being tracked by ground-based is to generate any major cores that radar – and use the model to predict future are referenced – in this case, the matrix Once you have attained suitable quanti- behavior. Kalman filters then use measured inverse indicated by the function call zation, the next step is to generate RTL for signals (such as the signature of the aircraft inv(P_cap_est+R). But you can implement a your target Xilinx device. At this point, you returned to the radar receiver) to periodi- matrix inverse in many ways; the choice of can use the AccelChip GUI to set constraints cally correct the prediction. which method to use depends on the size, on the design using the following design structure, and values of the matrix. directives to achieve further optimizations: Using the matrix inverse IP core from function [S] = simple_kalman(A) • Rolling/unrolling of FOR loops DIM = size(A,2); the AccelWare toolkit, you can choose persistent p P_cap if isempty(P_cap) from micro-architectures designed for dif- • Expansion of vector and matrix addi- P_cap = [8 0 0; 0 8 0; 0 0 8]; p = ones(DIM,1)/2; ferent applications. These micro-architec- tions and multiplications end; I = eye(DIM); tures can be optimized for speed, area, • RAM/ROM memory mapping of R = [128 0 0;0 128 0; 0 0 128]; power, or noise. In this case, the most suit- % estimate step: vectors and two-dimensional arrays %p_est = p; able approach is to use the AccelWare QR P_cap_est = P_cap+I; matrix inverse core. • Pipeline insertion % correction step: K = P_cap_est * inv(P_cap_est+R); • Shift-register mapping p = p + K * ( A' - p ); Synthesizing RTL with AccelChip P_cap = (I - K)*P_cap_est; S = p'; With the MATLAB M-file loaded into Using these directives constitutes AccelChip, the next step is to simulate the hardware-based design exploration, Figure 2 – Kalman filter example M-file floating-point design to establish a baseline. allowing the design team to further You would then use AccelChip to convert improve quality of results. In synthesizing the design to fixed-point math, verifying it the RTL, AccelChip evaluates the entire Figure 2 shows the MATLAB M-file in MATLAB as shown in Figure 3. design and schedules the entire algo- describing the Kalman filter. The algorithm AccelChip offers an array of tools to help rithm, performing necessary boundary defines matrices R and I that describe the you trim bits from the design and verify optimization in the process.

74 Xcell Journal Second Quarter 2005 DESIGN TOOLS

dbl fpt

Sine Wave1 Gateway In a_in_3 s_3 fpt dbl

Gateway Out Band-Limited Scope3 White Noise fpt dbl a_in_2 s_2 Gateway Out1 dbl fpt

Sine Wave Gateway In1 fpt dbl Gateway Out2 Scope2 a_in_1 s_1 Scope Band-Limited White Noise1 kalman_filter_wrapper s-model_kalman_filter_wrapper dbl fpt

Sine Wave2 Gateway In3

Scope4 dbl fpt [ac_reset] Band-Limited Gateway In2 White Noise2 Step Goto Sy stem Generator Figure 4 – AccelChip exports a cycle-accurate System Generator block that supports both simulation and RTL code generation.

Throughout this flow, AccelChip main- block for the Kalman filter is available in applied in System Generator to verify the tains a uniform verification environment the Simulink library browser. You need correctness of the design. through a self-checking test bench; the only select the Kalman filter block and drag Once verified in the System Generator input/output vectors that were generated it into the destination model to incorporate environment, you can verify the when verifying the fixed-point MATLAB the AccelChip-generated Kalman filter into AccelChip-generated block using System design are used to verify the generated a System Generator design. Generator’s supported methods – includ- RTL. The RTL verification step also gives Figure 4 shows the resulting diagram for ing HDL co-simulation and hardware-in- AccelChip the information necessary to the Kalman filter. In the center of the the-loop – to accelerate hardware-level compute the throughput and latency of the System Generator diagram is the Kalman simulation 10 to 100 times. Kalman filter. This is essential information filter, and around it are the gateway blocks to assess whether the design meets specifi- needed for a System Generator design. Conclusion cations and is critical for achieving cycle- Once the AccelChip-generated block is The integration of AccelChip with accurate simulation. included in the System Generator design, Xilinx System Generator is the first solu- you can perform a complete, system-level tion to unite MATLAB-based algorith- Exporting from AccelChip to System Generator simulation of cycle-accurate, bit-true mod- mic synthesis favored by algorithm Although RTL verification is a key step in els to verify that the system meets specifica- developers with the graphical design flow the design flow, designers want to see algo- tions. You can use AccelChip-generated used by system architects and hardware rithms running in hardware. System blocks for System Generator in conjunc- designers. It uses the rich MATLAB lan- Generator’s hardware-in-the-loop co-simu- tion with the Xilinx block set. Once this guage and its companion toolboxes to lation interfaces make this a push-button system-level verification step is completed, create System Generator IP blocks of flow, allowing you to bring the full power the next step in the System Generator flow complex DSP algorithms. of MATLAB and Simulink analysis func- is to move on to design implementation. By using these tools together, design tions to hardware verification. The “Generate” step in System Generator teams can employ the most productive Now that you have run RTL verifica- compiles the design into hardware. means of modeling hardware for imple- tion in AccelChip, you are ready to export mentation, fully involving algorithm the AccelChip design to System Generator Verification Options developers in the FPGA design process by going to the “Export” pull-down menu All design files generated by AccelChip, and completing higher quality designs in the AccelChip GUI and selecting including exported System Generator files, more quickly. “System Generator.” AccelChip then gen- are verified back to the original “golden” For more information on AccelChip erates a cycle-accurate System Generator source MATLAB M-file. AccelChip’s veri- and its interface to Xilinx System block that supports both simulation and fication approach is based on the genera- Generator, visit www.accelchip.com. For RTL code generation. tion of a test bench from the MATLAB more information on System Generator, At this point, you transition the design source – this test bench is applied at the visit www.xilinx.com/products/design_ flow to System Generator, where a new RTL level within AccelChip and can be resources/dsp_central/grouping/index.htm.

Second Quarter 2005 Xcell Journal 75 DESIGN TOOLS Complex FPGAs Require Equivalence Checking Synopsys’ Formality equivalence checker proves identical functionality.

by Lonn Fiance Director, Strategic Alliances Synopsys [email protected] Robert Vallelunga Product Manager Synopsys [email protected]

With the latest process technologies, com- plex high-performance systems – once the exclusive domain of ASIC and custom chips – can now be created with FPGAs. Although this benefits both consumers and companies, it also creates significant new design and ver- ification challenges that must be addressed to meet time-to-market requirements. As a designer, you use system-level analysis and simulation to evaluate differ- ent architectural alternatives. The result of this architectural exploration is the RTL (register transfer language) specifica- tion, which forms the definition of your device’s functionality. Depending on the product require- ments, you can realize this functionality in a number of forms: using Xilinx® devices as FPGAs designed into the final product or prototypes for verification of complex ASICs.

76 Xcell Journal Second Quarter 2005 DESIGN TOOLS

The RTL description defines the func- Register tionality, so maintaining equivalent behavior Compare Point for all implementations under all circum- stances is critical. As design sizes have increased, the ability to prove equivalence by Logic exhaustive simulation has disappeared. Cone ASIC designers recognized this issue Port several years ago and turned to equivalence checking (EC) to maintain functionality. Black Box When you design one of today’s high-com- plexity FPGA devices, whether used in the final product or as an ASIC prototype, you Figure 1 – Logic cone with compare point are clearly facing these same challenges. As a result, EC is quickly becoming a manda- 2. Match: using a variety of tech- 3. Verification: using mathematical tech- tory verification methodology in FPGA niques, the equivalence checker niques, each set of matched compare design flows. This methodology becomes maps corresponding compare points points is proven to be functionally even more important if you have selected between the implementation and equivalent or non-equivalent. the EasyPath™ low-cost FPGA solution reference designs (Figure 2). The 4. Debug: when non-equivalent logic for your production needs. fastest of these techniques is name- cones have been identified, graphical In this article, we’ll provide an based matching, but differences debug techniques are available to isolate overview of EC technology and the bene- between the RTL and gate-level rep- the logic causing functional deviations. fits it brings to FPGA design. Using the resentations can lead to highly dis- Synopsys Formality equivalence checker similar names, making advanced Static verification has two primary bene- in your FPGA design flow enables you to matching algorithms necessary. fits: it is orders of magnitude faster than quickly verify the implementation of your Registers may have been duplicated, dynamic verification and provides 100% device, giving you the freedom to focus merged, or otherwise optimized verification coverage. EC can prove that dif- your efforts on other design tasks. away such that the logic cones ferent versions of a design are equivalent (or between design versions may be not) in a matter of minutes, rather than the Introduction to Equivalence Checking quite different, making the match- hours or days required for dynamic simula- EC is a formal, static verification technology ing process far from simple. tion. Equivalence checking also eliminates that uses mathematical techniques to deter- mine if two versions of the same design, at different stages of development, are func- tionally equivalent. The power of this tech- Reference Design Implementation Design nique is its ability to compare between: Logic Cones Logic Cones • Two RTL versions • An RTL description and a gate-level netlist • Two gate-level netlists EC flows consist primarily of four stages: 1. Read: the EC tool reads RTL descriptions or netlists for the refer- ence and implementation designs and segments the logic in each design into smaller components called logic cones (Figure 1). Logic cones are simply small groups of logic bordered by registers, ports, and black boxes. The end points for each logic cone are known as Figure 2 – Matched (or mapped) compare points between design versions compare points.

Second Quarter 2005 Xcell Journal 77 DESIGN TOOLS

Today’s very large programmable devices may contain hundreds of thousands of logic cells, making equivalency checking a mandatory component of the modern competitive design flow. the need to create extensive sets of test pat- post-synthesis netlist, and second by prov- Block-level design elements like digital terns in an attempt to demonstrate equiva- ing the equivalence between the post-syn- clock managers (DCMs) and block RAMs, lence by automatically identifying compare thesis netlist to the post place and route inherent features of the Xilinx architecture, points and then applying sophisticated ver- (PAR) netlist. You can perform a similar pose interesting challenges in FPGA ification techniques. Dynamic simulation check between the reference RTL and the designs. When they are instantiated – as is remains an important part of your verifica- ASIC implementation netlist(s) to ensure common in FPGA prototyping flows – tion strategy, but its primary use is to ensure that your final ASIC matches the func- they are treated as “black boxes” for syn- proper RTL behavior. tionality proven in the FPGA prototype, thesis and EC purposes. The contents of a as shown in Figure 3. black box are not verified, but the signals at Equivalence Checking in FPGA Flows The benefits of EC also apply when you the periphery of the “black-boxed” element Because the value of EC is well understood intend to use the FPGA as the final imple- are proven to be equivalent. by ASIC designers, its initial application to mentation of your design. The ability to RAMs can also be inferred from the FPGA prototyping is an obvious choice. In quickly and exhaustively prove the corre- RTL. These inferred RAMs can be the prototyping context, you want to prove spondence between your reference RTL, directly verified by Formality as long as that your FPGA prototype implements the the post-synthesis netlist, and the post- the inferred RAM does not become RTL functionality that will be used to cre- PAR netlist significantly reduces simula- excessively large. Large memories can be ate an expensive ASIC. tion time or multiple programming fully verified using memory-specific ver- The FPGA verification usually hap- iterations, neither of which can conclusive- ification tools. pens in two phases: first by proving the ly prove the equivalence between various equivalence of the reference RTL to the versions of your design. Using Formality in a Xilinx Design Flow Formality, the proven equivalency check- ing tool from Synopsys, allows you to veri- fy designs 10-100X faster than Reference simulation-based techniques – and identify RTL and correct logic errors 5-10X faster – FPGA ASIC because of its graphical schematic debug Implementation Implementation capabilities. This performance advantage enables you to run Formality at every stage Synthesis Synthesis of the implementation flow and catch DC FPGA or Other * * DC or Other errors when they are first introduced, great- Auto Auto Setup Setup ly reducing the cost of correcting them. ✔ Functional verification using Formality is the clear winner in terms of performance Formality Physical P & R and error isolation. Today’s very large pro- Design grammable devices may contain hundreds * Auto of thousands of logic cells, making equiva- Setup *Automated setup lency checking a mandatory component of supported in DC, DC FPGA, and the modern competitive design flow. Xilinx tools Synopsys and Xilinx have created the Design Formality EC flow for Xilinx FPGA Integrity designs (shown in Figure 4). Design dis- Assured crepancies can result from the numerous transformations that take place during synthesis and in ISE™ software (NGDbuild, MAP, and PAR). These Figure 3 – Verifiying an ASIC implementation against an FPGA prototyped implementation transformations may change the design to

78 Xcell Journal Second Quarter 2005 DESIGN TOOLS

versions and, after validating the informa- Start tion, uses it to more quickly complete the equivalence check. Linking multiple tools RTL through the setup file helps you achieve the ✔ Formality fastest possible time to results and eliminate Compatible Libraries errors that might be introduced in a manu- DC-FPGA Auto Setup Formality al setup process. RTL vs. Post-Synth Complex FPGA devices are increasingly (Step 1 of 2) simprims being designed using modular, or hierarchi- EDIF Verilog cal, flows. Formality provides an ideal way to verify each individual module and ensure Xilinx ISE unisims that they have been correctly stitched NGDbuild together at the upper levels of your design. Coregen Formality is a mandatory step in the (verilog) EasyPath methodology. EasyPath devices map Auto Setup are not reprogrammable, so ensuring that the device is 100% equivalent to the refer- par ✔ ence RTL is essential. Formality’s static analysis techniques prove functional equiv- NGDAnno netgen Formality alence, minimizing the risk of implement- Post-Synth vs. Post-P&R (Step 2 of 2) ing incorrect functionality to help you reach volume production sooner. FPGA Conclusion Today’s FPGAs have achieved the same End level of functional complexity as some ASICs. The ability to realize very complex Figure 4 – Formality/DC FPGA/Xilinx verification flow functionality in FPGAs has created tremendous challenges in the FPGA verifi- improve timing, reduce area or power, bet- the logic cone itself. For example, if you opti- cation process. Formality helps you: ter match the architectural features of the mize away a register during synthesis, you • Achieve 100% coverage Xilinx device, or meet design rules. remove the end point of a logic cone, forcing Transformations such as combinatorial that logic cone to be expanded until another • Reduce runtimes 10-100X versus reductions, sequential optimizations end point is found. The effect is that the traditional dynamic verification (retiming), FSM re-encoding, register matching of compare points may no longer • Quickly isolate discrepancies between merging or duplication, and place and be possible; even if a match is made, the logic the RTL and implementation route optimizations increase the risk of between corresponding cones is no longer unintended functional changes. equivalent and verification will fail. • Reduce or eliminate the time spent EC tools consider two versions of a re-simulating after a design change design independently, without knowing Tight Linkage Between • Verify that changes did not uninten- how they were created. Therefore, Formality Tools Improves Usability tionally impact other functionality has no awareness of the transformations that In the past, EC tools required extensive tool occurred during design implementation. setup to understand these logic optimiza- Synopsys and Xilinx have worked togeth- These transformations include changes to tion transformations. Fortunately, Synopsys er to create a proven FPGA static verifica- the logic, in net and instance names, in hier- Design Compiler FPGA and Xilinx PAR tion solution centered on Formality, Design archy, and to the number or meaning of tools write out an automated setup file for Compiler FPGA, and Xilinx ISE implemen- state elements. These changes may impact Formality. This file contains information tation tools. Formality provides a fast, thor- the names of compare points, preventing the telling Formality which optimizations were ough functional verification methodology use of efficient name-based matching tech- performed in particular areas of the design, for proving equivalence between multiple niques and increasing the use of advanced minimizing the manual setup information representations of your design. For more but slower matching techniques. that you might otherwise need to provide. information on Formality or DC FPGA, Additionally, any changes to a bounding Formality uses this setup information to visit www.synopsys.com or contact your element of a logic cone results in a change to understand the differences between the two Synopsys sales representative.

Second Quarter 2005 Xcell Journal 79 DESIGN TOOLS

by Lee Hansen Sr. Product Marketing Manager Xilinx, Inc. Complete FPGA and [email protected] Tony Thomas Technical Marketing Engineer Xilinx, Inc. CPLD Power Analysis [email protected] Device power consumption has become one The Power Central website contains everything you need of the leading design issues facing FPGA and CPLD engineers today. Device consumption to accurately predict FPGA or CPLD power consumption. can affect everything from cost and longevi- ty of the device in your project to system per- formance and battery life in hand-held applications. Xilinx® design tools have expanded over the last few releases to offer you more ways to generate accurate estimates of your device power consumption.

The Power Essentials There are two main components to FPGA chip power consumption: • Dynamic power is largely determined by the switching power of the core and the switching speed of the I/O. Dynamic power is largely affected by capacitive load, supply voltage, and switching frequency. • Quiescent power is dominated largely by transistor leakage current and by DC current from a few specialized FPGA circuits. The drive to lower FPGA costs drives transistor size down, which tends to raise quiescent power consumption. Design demands also continue to force design per- formance faster, leading to higher switch- ing rates and higher dynamic power consumption. Xilinx Virtex™-4 FPGAs can deliver as much as 70% more performance than the nearest competing FPGA offering. Normally, this trend might raise dynamic power consumption. Designs are also get- ting denser year to year; again, in normal circumstances this tends to raise static and dynamic power. However, the Virtex-4 family offers dramatic static and dynamic power reduction through architecture, design, and process, offering as much as

80 Xcell Journal Second Quarter 2005 DESIGN TOOLS

1/10th the static power of other 90 nm • System integration phase – measuring Version 4.1 of Xilinx web-based Power FPGAs. You can also obtain a 20X power power using benchtop instrumentation Tools was released in March 2005. This reduction using the available Virtex-4 (board-level measurements). new release is designed to help customers embedded functions. quickly and easily estimate the power How has Xilinx been able to reduce Xilinx Power Tools utilize “activity consumption of their target Xilinx device, power in its Virtex-4 FPGA? Triple-oxide is rate” to reach accurate power estimates. and in particular includes important new the Virtex-4 process innovation that has Clocks and other input signals have an Virtex-4 LX, SX, and FX family enhance- reduced leakage current, whereas two-oxide absolute frequency. Synchronous logic ments (Figure 1). thickness was used for Virtex-II Pro devices signals use a percentage and all past families. However, Virtex-4 “activity rate” relative to FPGAs add a third middle-thickness-oxide the associated clock. An transistor, thereby reducing static power. activity rate of 100% The Virtex-4 architecture has also indicates that a net is enabled dynamic power-per-CLB to be expected to change state reduced by 50% at comparable frequen- on every clock cycle. cies. Even at a 50% higher operating fre- Activity rates can be set quency, Virtex-4 devices reduce dynamic globally (12% of the nets power by 20%. switch each clock cycle), An overall increase in power demand also on groups of signals, or emphasizes average junction temperature individual signals. and thermal power consumption. Exceeding “Activity rate” allows you Figure 1 – Web Power Tools the allowable junction temperature leads to to adjust clock frequency reduced system performance, forces reduced in XPower and see the • Power results are now affected by device utilization, and reduces device relia- effect on power consumption to your changes in ambient temperature – one bility. The more information you have about design results. The frequency of logic ele- of the only power tools able to consid- junction temperature and how your design ments displayed by XPower is a function er ambient temperature will affect it, the more informed your deci- of their output signal(s). sion will be on package types and thermal • Iccintq and Iccauxq have been updated design considerations, as well as where you Power Tools with new, more accurate figures can make design changes to help reduce The Xilinx Power Central website for potential thermal problems. FPGA and CPLD power analysis • The Virtex-4 XtremeDSP™ block Power consumption is design-depend- (www.xilinx.com/power/) contains links to power has been updated with new ent and affected by output loading, sys- Xilinx web-based power tools, information measurements at low, medium, and tem performance (switching frequency), on XPower, ISE-integrated power analysis high toggle rates design density (number of interconnects), software, white papers, design examples, • Latest thermal data updates for all sup- design activity (the percentage of inter- and links to power-centric partner prod- ported devices connects switching), logic block and ucts and software. interconnect structure, and power supply In addition, new updates that affect both voltage levels. You can perform power cal- Web-Based Power Tools Virtex-4 FPGAs and other Xilinx device culations at three distinct phases of the Web-based power estimation is the quick- families include: design cycle: est and easiest way to get an idea of device power consumption early in the design • VccInt can now be varied for Virtex-4, • Concept phase – calculating a rough flow. Xilinx offers a complete set of web- Virtex-II, Virtex-II Pro, and Spartan-3 estimate of power based on estimates based power tools on Power Central. A new devices of logic capacity and frequency of version is released every quarter, so infor- operation. Supported using Xilinx Web • More accurate power estimation for all mation is current, and no installation or Power Tools. open-drain outputs downloading is required – just an Internet • Detailed design phase – calculating connection and a web browser. You can • New and improved online help updat- power more accurately based on specify design parameters and save and load ed with all features available for detailed information about how the design settings, eliminating the need to re- Virtex-4 FPGAs and other families; all design is implemented in the FPGA. enter design parameters with iterative use. web spreadsheets now have clickable Supported through XPower and includ- Just an estimate of design behavior and a header links that take you directly to ed with all copies of ISE™ software. target device will get you started. online help

Second Quarter 2005 Xcell Journal 81 DESIGN TOOLS

provide exact resource use; makes XPower results more accurate the XPower cross-references first time, and eliminates costly and time- routing information with consuming learning curves (Figure 3). characterized capacitance Power Central also contains reference data. Physical resources are information to help you understand and then characterized for reduce power across your entire PLD proj- capacitance. Design charac- ect. Application notes and user guides terization is continuous and with specific device and design examples ongoing for newer devices go through everything from power distri- to provide the most accurate bution considerations to minimizing results. XPower uses net power in hand-held CPLD applications. toggle rates as well as output There is also key information from power loading. XPower then com- supply partners like Intersil and Texas Figure 2 – XPower putes power and junction Instruments on third-party power-based temperature, and can display individual net products and offerings. power data as well. Available to CPLD and FPGA users, the Conclusion XPower design wizard allows for easy data The Xilinx Power Central website delivers entry to XPower and is an ideal tool for everything you need to analyze your beginners. The wizard helps you enter all FPGA or CPLD-based design, with tools the data required by XPower to get a good like XPower in ISE software, Web Power power estimate, including design files, sim- Tools supporting the leading Xilinx FPGA ulation data, operating environment, out- and CPLD product families, and complete put loading, and activity rates. The wizard partner and reference data.

Figure 3 – XPower new design wizard

XPower – Integrated, Design-Specific Power Analysis XPower, a free part of all Xilinx ISE design tool configurations, allows you to get a much more detailed estimate of your design-based power requirements. XPower estimates device power based on a mapped or placed and routed design. XPower calcu- lates an estimate of power with an average design suite error of less than 10% for mature in-production FPGA and CPLDs. It considers device data along with your design files and reports estimated device power consumption at a high level of accuracy, cus- tomized to your specific design information. XPower is integrated directly into ISE software and gives hierarchical and detailed net power displays, detailed summary reports, and a power wizard that makes it easy for new users to run XPower. XPower can accept simulated design activity data and runs in both GUI and batch mode (Figure 2). XPower considers each net and logic ele- ment in the design. The ISE design files

82 Xcell Journal Second Quarter 2005 DESIGN TOOLS IncreaseIncrease PerformancePerformance andand LowerLower CostCost ThroughThrough SystemSystem DesignDesign Poseidon’s breakthrough ESL tools shorten the system-level design cycle by using Xilinx embedded processors.

By Farzad Zarrinfar the development of the architecture. Also, perform architectural “what-if” analysis VP of Worldwide Sales & Marketing architectures must be verified early in the and accelerate time to market. The tool Poseidon Systems design cycle to maximize the benefits of suite was created specifically for processor- [email protected] processor-based designs. You cannot wait based systems that require efficient robust Bill Salefski until RTL development to discover that architectures with the need to optimize VP of Engineering your architecture does not support the sys- performance, power, and cost. Poseidon Systems tem requirements. This “redesign loop” Triton comprises two main tools: [email protected] can easily delay the overall design cycle. • Triton Tuner – a system and software Newer and better alternatives exist to Stephen Simon analysis tool Director of Sales & Business Development partition, analyze design bottlenecks in, Poseidon Systems implement, and verify system architectures • Triton Builder – a hardware/software [email protected] for Xilinx FPGAs with embedded hard or partitioning tool soft processors. Electronic system-level Triton Tuner is a simulation and analy- The introduction of embedded processors (ESL) tools and predefined cores can help sis environment based on SystemC. for FPGAs (like the immersed PowerPC™ you make critical architectural decisions, Simulation is performed at the transaction and Xilinx® MicroBlaze™ soft-processor simplify your design tasks, and reduce level from models of both the processor core) has greatly expanded the benefits of your overall design cycle. In this article, and surrounding buses and peripherals. platform FPGAs. These advantages include we’ll show how Poseidon’s new ESL tools Using Tuner, you co-simulate the hardware lower FPGA power consumption, lower can help you realize these benefits. with the application software. device cost, and more scalable perform- During simulation, the tool collects ance, particularly when balancing your Triton Tool Suite both hardware and software performance application’s execution between what will Poseidon’s Triton Tool suite is a system data. The environment then provides run in the fabric of the FPGA versus the design and acceleration environment that tools to visualize and analyze the data, embedded processor(s). enables you to quickly develop, analyze, reducing the effort required to identify Designing efficient processor-based sys- and optimize system architectures. With inefficiencies in the design. Figure 1 tem architectures and optimizing overall these ESL tools, the abstraction level of the shows a bus activity graph – a visualiza- system performance for a specific applica- design is above RTL – you can quickly tion tool to help identify bus congestion tion is challenging. For example, effectively address system issues without having to and bottlenecks in the system design. partitioning the hardware and software solve the detailed issues surrounding RTL Triton Tuner also links hardware and early in the design is difficult, yet critical to implementation. Thus, you can quickly software events, providing a key tool in

Second Quarter 2005 Xcell Journal 83 DESIGN TOOLS

determining causal relationships To create a complete accelera- between hardware and software sys- tor peripheral, you need more tems. This feature greatly simplifies than just a C synthesis tool. the task of evaluating and optimiz- Poseidon Builder includes not ing system performance. Tuner also just a C-to-RTL synthesizer that provides tools to optimize the mem- creates the compute core, but it ory structure. This can greatly also generates the communica- reduce the need for costly caches tion and control hardware. The and other high-speed memories. Poseidon tool creates a complete Triton Builder is an automated accelerator peripheral, the block partitioning tool, which simplifies diagram of which is shown in the task of accelerating the per- Figure 3. In addition to the com- formance of processor-based algo- pute core, the accelerator rithms. With Triton Builder, you includes a multi-channel DMA can easily offload compute-inten- controller, bus interface, local sive loops and functions from soft- Figure 1 – Bus activity graph memories, and other logic to cre- ware to hardware. The tool ate a plug-and-play peripheral. automatically creates a DMA (direct memory access)-based hardware hardware accelerator. This acceleration hard- Interfacing to Xilinx Tools accelerator to perform the offloaded task. ware is connected directly onto the proces- Triton Tools are extremely flexible, allow- All hardware and source code modifica- sor bus and implemented on the FPGA ing you to use either Triton Tuner or Triton tions are also created to greatly simplify fabric. The block diagram of a typical accel- Builder independently or together as an the repartitioning process. erated architecture is shown in Figure 2. integrated suite. These tools were devel- With the builder tool, you can control If while executing the application code oped to enhance the productivity in devel- a number of key parameters in which to the processor hits a section of code that has oping processor-based designs and vastly optimize the solution to your system been moved to hardware, control is passed increase their capability. requirements. Together, these tools provide to the accelerator through the inserted The Triton tools link seamlessly to a system design and optimization environ- driver, which then performs the accelerated EDK tools. A typical system design flow ment that unobtrusively plugs into a function. The accelerator runs independ- is usually an iterative process where you Xilinx EDK design flow, enabling you to ently from the processor, freeing it to per- would analyze the system performance, quickly make dramatic improvements in form other tasks. When the accelerated task determine inefficiencies, modify the sys- the performance and power consumption is completed, the results are passed back to tem, and check the resultant perform- of your application. the application program. You can imple- ance. When the resultant architecture Triton Builder generates the RTL for ment multiple independent accelerators performs to the desired level, the system the complete accelerator, the driver within the same system. is then transferred back into the Xilinx required to invoke the accelerator, and inserts the driver into the proper place in the original application. The RTL can be generated in either Verilog or VHDL. Applications in ANSI C Thread 3 = Poseidon-Generated Accelerator and Driver This integral generation process ensures Thread 1 Thread 2 that necessary hardware and software Driver Driver components are exactly matched for trou- Off-Chip ble-free design. The tool also generates an Memory RTL test bench and SystemC model for verifying the new hardware. MicroBlaze Cache Accelerator Accelerator I/O or 1 2 Memory PowerPC Design Architecture of FPGA-Based Accelerator Controller The builder tools create a design architec- ture ideally matched to FPGA design. Using Triton Builder, you can quickly create OPB or PLB Bus a peripheral using the C code that runs on the processor. The tool moves the computa- Figure 2 - Accelerated system architecture tionally intensive portions of the code into a

84 Xcell Journal Second Quarter 2005 DESIGN TOOLS

simulation environment, you can develop tion, and cost, as well as shorter design Local Compute robust efficient processor architectures. cycles. For more information, visit our Memory Core With Triton Builder, you can add sophisti- website at www.poseidon-systems.com or cated hardware acceleration to your proces- contact the sales department at (925) sor-based systems. 292-1670. To be qualified for a free tool DMA Controller The Triton tool suite enables design evaluation of the Poseidon Builder and architects to achieve higher system Tuner and a free white paper, e-mail throughput, reduced power consump- [email protected]. Bus Interface

ANSI C Application Code and OPB or PLB Bus Architecture Description

Poseidon Triton Tool Suite Figure 3 - Accelerator block diagram tool chain. The Triton tools accelerate the System Simulation HW/SW Performance Partitioning process of identifying problem areas and Analysis Acceleraton establish an integrated flow that allows you to move between the tools to develop the optimal architecture. A typical flow (shown in Figure 4) Accelerator RTL Modified C-Code comprises these steps: Accelerator Model • The designer develops the target archi- RTL Test Bench

tecture using selected Xilinx tools Co-Design: Hardware <> Software Partitioning

• The architecture description is read Embedded Hardware Embedded Processor from the microprocessor hardware Software Design (XPS) Subsystem Design specification file (.mhs) from EDK, and the ANSI C application source App SW Code IP Design HW code is read into the Triton tools Co-Verification BFM Entry IP RTOS Instruction

BSP Set Simulator • Triton Tuner profiles the ANSI C code, OnChip Debug ISE Implementation reveals bottlenecks in the code or archi- Compiler Debugger tecture, and eliminates inefficiencies • Triton Builder partitions compute- intensive algorithms in ANSI C into hardware and generates a hardware Custom Memory APU PowerPC Auxillary Controller Controller accelerator Processor

CoreConnect OPD • Triton Tuner verifies that the new Custom Bridge

Logic CoreConnect PLB system performs to the desired level GPIO • RTL, test bench, modified C code, driver, and architecture are exported

back into the Xilinx environment UART

Custom FSL (optional) GPIO Logic MicroBlaze

Conclusion Other CoreConnect OPD Poseidon’s Triton Tool suite enables you to Memory Controller rapidly and predictability analyze, opti- mize, and accelerate processor-based archi- tectures. With Triton Tuner’s SystemC Figure 4 - Integrated Poseidon and Xilinx tool flow

Second Quarter 2005 Xcell Journal 85 DESIGN TOOLS

by Salil Raje Software Development Director ControlControl YourYour DesignsDesigns withwith Xilinx, Inc. [email protected] David Knol Senior Staff Engineer thethe PlanAheadPlanAhead HierarchicalHierarchical Xilinx, Inc. [email protected] Brian Jackson DesignDesign andand AnalysisAnalysis ToolTool Product Marketing Manager Xilinx, Inc. [email protected] You can improve performance and reduce Mark Goosman Product Marketing Manager Xilinx, Inc. design closure times through quick analysis, [email protected]

feedback,feedback, andand aa block-basedblock-based flow.flow. If you are pushing the limits of perform- ance and density in an FPGA, then you are also likely facing problems that threaten your design deadlines. Long place and route runtimes and repeatability of results are two common sources of anxiety. Also, traditional FPGA tools provide a flat, push-button implementation approach – a flow that provides few clues about how to improve your design goals and even fewer opportunities to make the necessary changes. If you can relate to these challenges, Xilinx® offers a new tool that will put you in control of your design. The PlanAhead™ hierarchical design and analysis tool allows you to employ ASIC design methodologies for complex FPGAs. PlanAhead software easily integrates into your existing flow between synthesis and place and route, as shown in Figure 1. You can analyze, detect, and correct many potential problems before place and route. And with its block-based incremental capa- bilities, you can divide and conquer your design one block at a time.

Analysis PlanAhead design tools take a synthesized EDIF netlist as input and one or more UCF files for constraints. The powerful and easy-to-use graphical environment displays the Xilinx device that you have targeted (see Figure 2). At this point you have several tools available to explore your design space.

86 Xcell Journal Second Quarter 2005 DESIGN TOOLS

Timing Analysis TimeAhead also allows you to analyze Statistics TimeAhead is a flexible timing analyzer and identify performance bottlenecks in You can generate a detailed statistical report built into PlanAhead software. It allows small sections of the design. You can use for any block in your design that includes you to estimate route delays before run- this feedback to improve results through re- information on clocks, resource utilization, ning place and route. Using the synthesis or better floorplanning. RPMs, carry chains, clock regions, and the PlanAhead block-based approach, the number of internal versus external nets. This accuracy of timing estimates will improve Device Selection could help you narrow down problems in as blocks in the design are implemented It is easy to explore different devices your design to specific blocks or modules through place and route. (within compatible families) for a netlist that you could then re-code or re-synthesize because PlanAhead tools can open mul- with different options. tiple device views (floorplans) on the same netlist. In combination with Schematic Viewer TimeAhead, this enables you to decide A comprehensive schematic viewer can help RTL on the speed grade that matches your you navigate timing paths, trace through target performance and the optimal size cones of logic, or determine the floor-

Synthesis of your Xilinx part. It can also help you planned block connectivity. Some simple choose your pin package. net and pin tracing capabilities within the schematic viewer can save you hours wad- Multiple Floorplans ing through VHDL/Verilog files. PlanAhead: Analysis & Hierarchical PlanAhead software has the ability to Floorplanning open multiple device views on the same Connectivity Display netlist and create a different floorplan in PlanAhead software bundles all nets con- each of the device views. You can analyze necting any two floorplanned blocks into a ISE P&R each floorplan for utilization and per- single line, with thickness and color reflect- formance characteristics and ultimately ing the number of such nets. This bundled run through place and route. This gives connectivity shows the flow of data through Figure 1 – PlanAhead software fits you a powerful what-if mechanism for the design and is extremely useful in arriv- into your existing design flow design space exploration. ing at the “right” floorplan.

Constraints Editing You can view, edit, add, and delete any of your constraints and see the effect of timing constraint changes by re-running TimeAhead. This eliminates the need to re- run place and route – a lengthy process – just to verify constraint changes. If you decide that certain paths are not important, you can false path them, decide that certain clocks need to be tightened, or add a MAX_DELAY constraint on a long meandering net.

Design Rule Checks (DRCs) PlanAhead software sports a comprehensive set of design-rule checks to flag potential problems before launching a time-consum- ing place and route run: • SSO limit violations • I/O bank rule violations • Clock region rules • Carry chain height Figure 2 – The PlanAhead analysis environment • DSP48 internal register optimization

Second Quarter 2005 Xcell Journal 87 DESIGN TOOLS

Performance-Based Floorplanning PBlocks can group logic together to improve performance on your design. Corralling critical paths into smaller regions on the device can potentially improve performance through reduced route delays. You can use any of PlanAhead’s analysis features to drive PBlock creation and floorplan genera- tion. Once defined, a floorplan can absorb minor to moderate changes to the netlist to help keep your place and route results predictable.

Block Analysis If design analysis reveals that your criti- cal paths are contained by a few hierar- chical instances, you might consider the following technique. Create a PBlock with these instances and export the PBlock to place and route. PlanAhead Figure 3 – Analyzing placement with TimeAhead and the schematic viewer software will also export timing con- straints relevant to those instances. Your place and route results serve as a Placement dered how to break up your design into timing sanity check – if the trial PBlock You can import the placement of your smaller, more manageable blocks. Imagine does not meet timing, you are guaran- block or entire design and make changes to a flow that lets you focus your optimization teed that the logic will not meet timing the placement locations. PlanAhead design efforts on a critical block, implement it, in the global context. It is easy to pin- tools give you very easy-to-use cross-prob- and verify its performance before tackling point and address the potential show- ing functionality between the netlist tree the rest of the design. You may have also stoppers in your design. display, schematic, placement, timing wished for a flow that enables you to divide paths, and user floorplanned blocks. up a design between team members. Bottom-Up Flow Figure 3 shows the post-place and route PlanAhead software was built from Once a PBlock has run through place environment. The PlanAhead analysis fea- scratch to be your hierarchical design tool. and route and its results look promising, tures work together to make it easy for you At its core lies the PBlock (or Physical you can use the placement results for the to comprehend the effect of physical Block): a physical design entity that con- implementation on performance. tains one or more logical (RTL or structur- al EDIF) hierarchical instances. Metric Maps Simply create a PBlock, assign a region A B With metric maps (introduced in the latest for it, export its netlist and relevant con- PlanAhead release), you can detect prob- straints for place and route, import its lem areas such as a cluster of negative slack placement results, and verify its timing. C instances within a certain hierarchical PBlock technology enables several key module on the placement surface. methodologies. Congested areas or utilization issues are also detectable. Better, faster, and smarter Flexible Block Partitioning visualization puts you firmly in control of PBlocks can comprise any number of hier- the FPGA and leads to faster turnaround archical instances or logic elements (LUTs, times in fixing problems. FFs). PBlocks need not adhere to your original logical hierarchy and can be Block-Based Design dynamically redefined whenever necessary If you’ve ever struggled to get a giant FPGA throughout your flow. Let’s explore the Figure 4 – Block-based team design out the door on time, you might have won- advantages of this technology.

88 Xcell Journal Second Quarter 2005 DESIGN TOOLS

PBlock in a powerful team-based design Incremental Design flow. For example, the design manager Xilinx ISE™ software’s incremental guided could divide up the logic into PBlocks implementation flow in place and route Xilinx Events and assign one or more to each team requires you to floorplan your design. This member (Figure 4). Team members focus guiding flow will re-use your previous imple- and Tradeshows on achieving satisfactory results for each mentation results in conjunction with the Xilinx participates in numerous trade of their assigned PBlocks and return floorplan to help preserve results in the new shows and events throughout the year. This is a perfect opportunity to meet our those results to the manager. implementation. You can use PlanAhead soft- silicon and software experts, ask questions, The manager then stitches the design ware to easily generate the necessary floorplan see demonstrations of new products, back together by importing each PBlock for guided incremental flow. and hear other customer success stories. result into PlanAhead software as it With larger FPGA design sizes, more users For more information and the current arrives. Because PlanAhead design tools are complaining about unpredictability of schedule, visit www.xilinx.com/events/. do not lock down routing, the design results and long place and route runtimes. manager will need to launch a final route PlanAhead block-based flows put you in con- Worldwide Events Schedule run to complete the design. trol of your larger designs and their increasing North America complexities. IP Re-Use Table 1 offers several examples of cus- April 21 AdvancedTCA Seminar San Jose, CA PlanAhead software has the ability to tomers who have had marked results after export any IP module, which can then adopting the PlanAhead methodology. April 21- May 19 Low-Cost Joint Seminar with Synplicity be run through place and route. Once various locations you are satisfied with performance, you Conclusion May 22-25 Wind River 2005 Worldwide can save the module’s placement in your Even though FPGAs continue to grow in size User Conference team’s IP library. Another team member and sophistication, performance requirements Orlando, FL can import the IP and its placement into and time-to-market pressures remain status June 7-9 SUPERCOMM 2005 an entirely new design. quo for FPGA designers. Existing design Chicago, IL PlanAhead design tools also allow IPs flows and methodologies are struggling to July 11-15 NSREC 2005 to be moved around on the device – the keep pace, but PlanAhead software provides a Seattle, WA relative placement of all logic elements revolutionary boost to your current flow. within the IP is maintained and the net Intuitive and powerful, its design analy- Europe, Middle East, Africa (EMEA) route delays should see minimal varia- sis and block-based hierarchical design capa- tion. The time spent to meet perform- bilities put you in control. To learn more May 11 Mentor Graphics EDA Tech Forum Dresden, Germany ance on your IP modules need not be about PlanAhead design tools, visit repeated for each design using this IP. www.xilinx.com/planahead/. May 11-12 The MathWorks Model-Based Design Conference Warwickshire, UK May 25 TI Developer Conference Device Before After Milan, Italy Customer A XC2V1000 Inconsistent Results: Consistently Exceeding May 30 TI Developer Conference 25 MHz to 60 MHz 63 MHz Tel Aviv, Israel June 2 JTAG Technologies DFT Customer B XC2V8000 9 Hours Place and 3 Hours Place and Boundary-scan Seminar Route Runtime: Route Runtime: Barcelona, Spain 147 MHz 172 MHz June 8-9 The MathWorks Model-Based Design Conference Customer C XC2VP70 54 MHz 102 MHz Munich, Germany June 13-14 The MathWorks Model-Based Customer D XC2V6000 Design Time 3 Weeks: Design Time 3 Days: Design Conference 160 MHz 178 MHz Paris, France June 21 Mentor Graphics EDA TechForum Customer E XC2V4000 120 MHz 170 MHz Reading, UK

Customer F XC2V4-LX60 93 MHz 109 MHz June 23 Mentor Graphics EDA TechForum Paris, France

Table 1 – Customer designs demonstrate a boost in performance and productivity with PlanAhead software.

Second Quarter 2005 Xcell Journal 89 DESIGN TOOLS

by Mike Trimborn LabVIEW Embedded Product Manager National Instruments Building High-Performance [email protected]

For more than 27 years, National Instruments Measurement and Control (NI) has leveraged industry-standard comput- ers, the Internet, and cutting-edge technology to help engineers and scientists build test, measurement, and control solutions. We are a Systems with FPGAs pioneer and leader in virtual instrumentation – a concept that has changed the way engi- neers and scientists approach measurement National Instruments has customized off-the-shelf and control. Virtual instrumentation increases pro- FPGA hardware using graphical development tools. ductivity and lowers costs through easy-to- integrate software, such as the NI LabVIEW graphical development environ- ment; and modular hardware, such as PCI and PXI modules for data acquisition, instrument control, and machine vision. NI and Xilinx® have collaborated to take virtual instrumentation one step fur- ther by bringing the flexibility and per- formance of FPGAs to measurement and control I/O.

NI LabVIEW for Virtual Instrumentation Virtual instrumentation merges software tools with modular hardware to bring you flexible, programmable, and configurable instrumentation. These systems provide you with user-defined I/O functionality, speed, and resolution not available from traditional “box” instrumentation. When implemented with high-end mod- ular instrumentation and LabVIEW graphi- cal programming, virtual instrumentation costs less and performs better than other commercial off-the-shelf (COTS) solutions. The NI LabVIEW FPGA module extends the capabilities of LabVIEW soft- ware to target FPGAs on NI-Reconfigurable I/O (RIO) hardware. LabVIEW FPGA and RIO hardware devices bring custom timing, triggering, synchronization, and advanced counter I/O to instrumentation end users (Figure 1). Within the constraints of a COTS hard- ware platform, LabVIEW provides a high- level application programming interface (API) for software-to-hardware and hard- ware-to-hardware interconnections so that

90 Xcell Journal Second Quarter 2005 DESIGN TOOLS

among measurement and control users who need flexible architectures for custom I/O, timing, and control logic. Chip-to-chip interconnects require knowledge of low- level timing and communication protocols between components. Furthermore, soft- ware and hardware systems often require the design of low-level drivers and APIs to inter- Figure 1 – LabVIEW block diagrams download directly to Xilinx FPGAs on NI CompactRIO. face a custom FPGA board to the rest of a measurement and control system. you don’t need to generate your own cus- Xilinx FPGAs for Virtual Instrumentation This requirement forces you away from tomized API. With NI LabVIEW and RIO Several measurement and control instrumen- the GUI approach used for most other hardware, you can rapidly define custom tation vendors already use FPGAs because of tasks. Such a cumbersome custom devel- control logic and I/O lines in a LabVIEW their performance, fast development time opment can require weeks or months and program – called a virtual instrument (VI) (when compared to other processing tech- cost in excess of $100,000. With the com- – without prior knowledge of low-level nologies), and vendor reconfigurability. bined usage of LabVIEW and Virtex-II development tools. Simply draw your hard- However, until recently, this functionality FPGAs, you can successfully avoid these ware design in LabVIEW and the tool has been out of reach for many instrumenta- delays and extra costs. chain synthesizes the gates directly into the tion users. Most of the growth experienced Another consideration is that typical NI Xilinx FPGA (Figure 2). by Xilinx FPGAs in the measurement and data-acquisition products address a broad range of built-in functionality for timing, triggering, and advanced I/O handling Opportunity Challenges Goal (synchronization and event counting). Custom Measurement HDL Time Ease Development However, in some applications the timing and + + Using COTS Hardware model or built-in functionality breaks Control Hardware Integration Money and LabVIEW down. In these cases, providing hardware with a user-programmable Virtex-II FPGA Figure 2 – Using NI LabVIEW saves time and money by eliminating becomes advantageous because you can HDL coding complexity and providing an easy-to-use GUI interface. define the device to serve the specific needs of your system and application. The intuitive graphical nature of control industry has been among instrumen- With NI RIO hardware, you also save LabVIEW also makes it a unique tool for tation vendors rather than their customers. time and money. Rather than building prototyping Xilinx FPGA designs. You can NI has taken the next logical step – multiple boards for different applications, iteratively change your design blocks in bringing FPGA technology to instrumenta- you can reduce costs by using only one LabVIEW and see these changes quickly tion users – by using Xilinx Virtex™-II device and reprogram it for different appli- converted into actual FPGA behavior. FPGAs on NI RIO hardware to merge cation needs. Additionally, you can integrate previously LabVIEW’s rapid development time with Additionally, because code execution is developed VHDL or Verilog code within the performance and flexibility of Virtex-II implemented within the Virtex-II device, the LabVIEW block diagram using an devices (Figure 3). the hardware responds faster than software, HDL input node. Using the HDL input An obstacle LabVIEW overcomes is that which is very beneficial in applications node, you can develop low-level algorithms HDL software knowledge, while common requiring high-speed control and tight trig- in HDL and represent that algorithm as a among hardware designers, is not as popular gering and synchronization. When imple- single function block within the LabVIEW environment. This function block can be reused throughout the application or saved for other designs. LabVIEW also provides a thin driver- level API to interface RIO hardware to the rest of a measurement and control system, simplifying driver development. Therefore, you can easily integrate existing instrumen- tation, motion, vision, or other COTS hardware into one complete system. Figure 3 – FPGA hardware architecture on NI-reconfigurable I/O hardware

Second Quarter 2005 Xcell Journal 91 DESIGN TOOLS

menting control loops within Virtex-II on any other processing engine. We then Automotive Application devices, digital loops can run up to 20 MHz provide a small set of functions to access the NI alliance partner GÖPEL Electronic, a and analog loops up to 125 kHz. These LabVIEW FPGA data from a LabVIEW leading German PCB and electronic rates are not achievable in implementations host VI (Figure 4). device testing company, created a portable using only system-level software. To speed up development and take measuring unit for onboard vehicle analy- advantage of Xilinx compiler tools, we decid- sis and diagnostics. They used LabVIEW Implementation of LabVIEW FPGA ed to generate an intermediate netlist repre- FPGA and CompactRIO, the latest mem- One of our key objectives during the imple- sentation from the LabVIEW block diagram ber of the RIO hardware family, to create mentation of LabVIEW FPGA was to pre- in a hardware description language, in our a portable, rugged, and extremely versatile serve the same semantics, level of case VHDL. This allows us to represent handheld data recorder for use in test abstraction, and ease of use between the some control logic at higher levels of abstrac- drives in laboratories, environmental hardware implementation using LabVIEW tion, which speeds up the development of chambers, wind tunnels, and on proving FPGA and the desktop version of our code generators and makes debugging grounds. The system, called CARLOS (in- LabVIEW running on Windows. This much easier. And because LabVIEW is a very CAR LOgging System), is also ideal for allows us to leverage the development skills generic programming environment for the endurance and long-term tests, as well as of our existing LabVIEW developers and host, we developed the module generators vehicle calibration and diagnostics. migrate code between targets. Furthermore, themselves in LabVIEW, enabing us to With CARLOS, you can easily and sin- because LabVIEW has always been a paral- increase our productivity even further. gle-handedly navigate all device functions lel language and was originally designed Once we had met the objectives of cre- with a jog wheel, buttons, and soft keys. with programming hardware in mind, its ating identical semantics between Several project functions embedded in parallelism is an ideal match for program- LabVIEW implemented on a FPGA and CARLOS help you manage various meas- ming parallel FPGA hardware. LabVIEW implemented on other process- uring units simultaneously. Because the sys- To implement a LabVIEW block dia- ing engines, we turned our attention to tem is based on LabVIEW and Virtex-II gram onto the FPGA, we generate a set of enhancing the language constructs to take FPGA technology, you can reconfigure its state machines (in addition to the function- further advantage of the fact that we are functionalities far beyond its factory set- ality described in the blocks of the diagram) running native on hardware. We optimized tings to meet a wide variety of testing needs. that behave as an “enable chain” to control our timing with single-cycle timed loops, as the flow of data. We also make data from the strict use of the enable chain had forced Conclusion the LabVIEW FPGA VI available to exter- us to have a cycle delay for each function. Xilinx and NI have collaborated to help nal programmers very intuitively by repre- Using the single-cycle timed loop, we can measurement and control engineers build senting this data as controls and indicators execute complete loops within one FPGA high-performance, custom I/O devices on the front panel. You can treat the VI clock cycle. Customers can also integrate leveraging flexible off-the-shelf hardware running on the FPGA just as if it were a VI legacy VHDL or Verilog code. powered by Xilinx FPGAs at a considerable cost and time savings. While LabVIEW can be used to define FPGAs quickly, the use of Xilinx Virtex-II FPGAs provides reconfig- urable customization for many measure- ment and control platforms from NI. Within hours, you can turn an off-the- shelf I/O board into a control device that exactly meets your needs without ever knowing VHDL or other low-level hard- ware design tools. By simply changing the block diagram of the LabVIEW FPGA code, you can recreate hardware with a completely different personality and func- tionality. This rapidly implemented, reconfigurable I/O approach is unprece- dented in the measurement and control industry and continues to expand the capabilities of virtual instrumentation. For more information about LabVIEW Figure 4 – LabVIEW FPGA block diagram and front panel with corresponding LabVIEW host block diagram FPGA, please visit www.ni.com/fpga/.

92 Xcell Journal Second Quarter 2005 Prototype Your PCIe ASIC HERE

Prove your design with high speed FPGA hardware emulation plugged directly into your PCIe system. Here are 4.5 million gates to emulate your ASIC and kill the RTL bugs before you cut masks. This board will let you test your software and increase your chances that the first spin will be the last. The DN6000K10PCIe is packed with the features you need: •1,4 and 8-lane versions •Six VirtexII-Pro FPGAs (-2vp100s, the big ones) •10 DDR (64Mx16) and 4 SSRAMs (2Mx36) external to the FPGAs •Expansion capability to customize your application •Synplicity Certify® models for quick and easy partitioning Like all our products, this new PCI Express bus board will help you get your ASIC to market on time and in budget. Call The Dini Group today-- PCIe is already here.

1010 Pearl Street, Suite 6 • La Jolla, CA 92037 • (858) 454-3419 • Email: [email protected] TheThe HydraHydra ProjectProject Hydra, currently the strongest chess program in the world, is a cutting-edge application that combines cluster computing with the fine-grain parallel FPGA world.

by Chrilly Donninger Programmer Nimzo Werkstatt OEG [email protected] Ulf Lorenz Researcher Paderborn University [email protected]

The International Computer Games Association (ICGA) regularly organizes computer chess world championships. For quite a long time, large mainframe computers won these championships. Since 1992, however, only PC pro- grams have been world chess champi- ons. They have dominated the world, increasing their playing strength by about 30 ELO per year. (ELO is a sta- tistical measure of 100 points differ- ence corresponding to a 64% winning chance. A certain number of ELO points determines levels of expertise: Beginner = ~1,000 ELO, International Master = ~2,400, Grandmaster = ~2,500, World Champion = ~2,830.) Today, the computer chess commu- nity is highly developed, with special machine rooms using virtual reality and closed and open tournament rooms. Anybody can play against grandmasters or strong machines through the Internet.

94 Xcell Journal Second Quarter 2005 Hydra important feature of FPGAs, because near the top of the complete game tree for The Hydra Project is internationally driv- the evolution of a chess program never examination; usually we select it with the en, financed by PAL Computer Systems in ends, and the dynamic progress in help of a maximum depth parameter. We Abu Dhabi, United Arab Emirates. The computer chess enforces short develop- then assign heuristic values (such as one core team comprises programmer Chrilly ment cycles. Therefore, flexibility is at side has a queen more, so that side will Donninger in Austria; researcher Ulf least as important as speed. probably win) to the artificial leaves of the Lorenz from the University of Paderborn in pre-selected partial tree. We propagate • We can use a lot of fine-grain Germany; chess grandmaster Christopher these values up to the root of the tree as if parallelism. Lutz in Germany; and project manager they were true ones (Figure 1). The key observation over the last 40 Muhammad Nasir Ali in Abu Dhabi. Technical Description FPGAs from Xilinx® are provided on PCI years of computer chess data is that the game The key feature that enables computer cards from AlphaData in the United tree acts as an error filter. The larger the tree chess programs to play as strong as – or Kingdom. The compute cluster is built by that we can examine and the more sophisti- stronger – than the best human players is Megware in Germany, supported by the cated its shape, the better its error filter their search algorithm. The programs per- Paderborn Center for Parallel Computing. property. Therefore, what we need is speed. form a forecast: given a certain position, Taiwan is involved as well. what can I do, what can my opponent do The goal of the Hydra Project is literal- next, and what can I do thereafter? Top ly one of world dominance in the comput- Modern programs use some variant of er chess world: a final, widely accepted the Alphabeta algorithm to examine the victory over human players. Indeed, we are resulting game tree. This algorithm is opti- convinced that in 2005, a computer will be mal in the sense that in most cases it will the strongest chess entity, a world first. examine only O(bt/2) many leaves, instead of Four programs stand out as serious con- bt many leaves, assuming a game tree depth tenders for the crown: of t and a uniform branching of b. With the • Shredder, by Stefan Meyer-Kahlen, the help of upper and lower bounds, the algo- Figure 1 – Game tree search (in the blue part) dominating program over the last rithm uses information that it collects dur- leads to an approximation procedure. decade ing the search process to keep the remaining The Hardware Architecture • Fritz, by Frans Morsch, the most well- search tree small. This makes it a sequential Hydra uses the ChessBase/Fritz GUI run- known program procedure that is difficult to parallelize, and naïve approaches waste resources. ning on a Windows XP PC. It connects to • Junior, by Amir Ban and Shay Although the Alphabeta algorithm is the Internet using ssh to our Linux cluster, Bushinsky, the current Computer efficient, we cannot compute true values which itself comprises eight dual PC server Chess World Champion for all positions in games like chess. The nodes able to handle two PCI buses simul- • Our program, Hydra, in our opinion game tree is simply far too large. Therefore, taneously. Each PCI bus contains one FPGA the strongest program at the moment we use tree search as an approximation pro- accelerator card. One message passing inter- cedure. First, we select a partial tree rooted face (MPI) process is mapped onto each of These four programs scored more than 95% of the points against their opponents Four Dual PCs with Myrinet in the 2003 World Championship. Interconection Network Computational speed, as well as sophis- ticated chess knowledge, are the two most CPU 1 CPU 2 important features of chess programs. Memory Hub FPGAs play an important role in Hydra by Controller harnessing the demands on speed and pro- gram sophistication. Additionally, FPGAs PCI RAM Controller provide these benefits: • Implementing more knowledge FPGA 1 FPGA 2 requires additional space, but nearly PCI Bus Internet no additional time. • FPGA code can be debugged and changed like software without long ASIC development cycles. This is an Figure 2 – A cluster of dual PCs supplied with two FPGA cards; each is connected to a GUI via the Internet.

Second Quarter 2005 Xcell Journal 95 the processors; one of the FPGAs is associat- we can solve them with the help of a move generator consists of two 8 x 8 chess ed with it as well. A Myrinet network inter- Configware coprocessor, benefiting from the boards, as shown in Figure 3. The connects the server nodes (Figure 2). fine-grain parallelism inside the application. GenAggressor and GenVictim modules We have a complete chess program on-chip, instantiate 64 square instances each. Both The Software Architecture consisting of modules for the search, the determine to which neighbor square The software is partitioned into two: the evaluation, generating moves, and executing incoming signals must be forwarded. distributed search algorithm running on or taking back moves. At present, we use 67 The square instances will send piece the Pentium nodes of the cluster and the block RAMs, 9,879 slices, 5,308 TBUFs, signals (if there is a piece on that square), soft co-processor on the Xilinx FPGAs. The basic idea behind our paralleliza- GenVictim (To-Square) GenAggressor (From-Square) tion is to decompose the search tree in order to search parts of it in parallel and to balance the load dynamically with the help of the work-stealing concept.

First, a special processor, P0, gets the search problem and starts performing the forecast algorithm as if it would act sequen- tially. At the same time, the other proces- sors send requests for work to other 64 Square 64 Square randomly chosen processors. When Pi (a processor that is already supplied with 64 x16 Input Bits Comparator 64 x16 Input Bits Comparator Tree Tree work) catches such a request, it checks whether or not there are unexplored parts 1. Occupied squares send a signal in GenVictim 1. Winner square generates signal of a super piece 2. Free squares forward these signals 2. Free squares forward the signals of its search tree ready for evaluation. These 3. All squares receiving a signal are potential to-squares 3. Squares occupied by own pieces are potential from-squares unexplored parts are all rooted at the right 4. Comparator tree selects most attractive to-square 4. Comparator tree selects most attractive from-square (taking moves). siblings of the nodes of Pi’s search stack. Pi sends back either a message that it cannot Figure 3 – The gen modules form the move generator. perform work, or it sends a work packet (a chess position with bounds) to the request- ing processor Pj. Thus, Pi becomes a master 534 flip-flops, and 18,403 LUTs. An upper respectively, forwarding the signals of far- itself, and Pj starts a sequential search on its bound for the number of cycles per search reaching pieces to neighbor square own. The processors can be master and node is nine cycles. We estimate that a pure instances. Additionally, each square can worker at the same time. software solution would require at least output the signal “victim found.” Then we The relationship dynamically changes 6,000 Pentium cycles. The longest path con- know that this square is a “victim” (a to- during computation. When Pj has finished sists of 51 logic levels, and the design runs at square of a legal move). The collection of its work (possibly with the help of other 30 MHz on a Virtex™-I 1000. We have all “victim found” signals is input to an processors), it sends an answer message to just ported the design to a Virtex arbiter (a comparator tree) that selects the

Pi. The master/worker relationship between XC2VP70-5 so that we can now run the most attractive not-yet-examined victim. Pi and Pj is released, and Pj becomes idle. It program with 50 MHz. The GenAggressor module takes the again starts sending requests for work into In software, a move generator is usually arbiter’s output as input and sends the sig- the network. When processor Pi finds out implemented as a quad-loop: one loop over nal of a super-piece (a combination of all that it has sent a wrong αβ-window to one all piece types; an inner loop over pieces of possible pieces). For example, if a rook- of its workers (Pj), it makes a window mes- that type; a second inner loop for all direc- move signal hits a rook of our own, we sage follow to Pj. Pj stops its search, corrects tions where that piece can move; and the will find an “aggressor” (a from-square of the window, and starts its old search from most internal loop for the squares to which a legal move). Thus, many legal moves are the beginning. If the message contained a the piece can move under consideration of generated in parallel. “cutoff,” which indicates superfluous work, the current direction. This is quite a These moves must be sorted and we have

Pj just stops its work. We achieve speed-ups sequential procedure, especially when we to mask those already examined. The moves of 12 on the 16 cluster entities, which consider that piece-taking moves should be are sorted with the help of another com- means that Hydra now examines 36 million promoted to the front of the move list. parator tree and the winner is determined nodes per second. In a fine-grain parallel design, however, within six levels of the tree. Sorting criteria At a certain level of branching, the we have a fast, small move generator, which is based on the value of attacked pieces and remaining problems are small enough that works very differently. In principle, the whether or not a move is a killer move.

96 Xcell Journal Second Quarter 2005 Figure 4 shows the Finite State Machine look for a move in the state FS_START. Therefore, in future versions of Hydra, we for the recursive Alphabeta algorithm. On Otherwise we enter the quiescence search, plan to switch to newer generations of the left side of the figure you can see the which starts with the evaluation inspection. Xilinx FPGAs, increase the number of states of the normal search, including the In the quiescence search we only consider processors further, and fine-tune the evalu- nullmove heuristic. The right side shows capture- and check-evasion moves. ation function. the states of the quiescence search. The The search stack (not shown) is realized For more information, visit www. main difficulty is controlling the timing, as by six blocks of dual-port RAM, organized hydrachess.com and www.chessbase.com. some of the sub-logic (the evaluation and as 16-bit wide RAMs. Thus, we can write the move generator) may need two or three two 16-bit words into the RAM, or one 32- cycles instead of one. We enter the search at bit word at one point of time. A depth vari- FS_INIT. If there is anything to do, and if able controlled by the search FSM controls History of Modern nullmove is not applicable, we come to the the data flow. Various tables capture differ- Computer Chess start of the full search. ent local variables of the recursive search. After possibly increasing the search • 1940-1970: Programmers attempt to depth (not shown in Figure 4), we enter the Conclusion mimic human chess style, but resulting states FS_VICTIM and thereafter We are quite optimistic that Hydra already programsprograms areare weakweak FS_AGGR, which give us the next legal plays better chess than anybody else. move as described previously. Reaching the Nevertheless, we must now show this in a • 1970s: Chess 4.5 is the first “strong” state FS_DOWN corresponds to a recur- series of matches. program,program, emphasizingemphasizing treetree search.search. ItIt isis thethe firstfirst sive call of the Alphabeta algorithm with a At the same time, we want to maintain programprogram toto winwin aa tournamenttournament againstagainst humanshumans 1-point window (α,α + 1). If the search the distance between Hydra and other (the(the MinnesotaMinnesota OpenOpen 1977)1977) remaining depth is greater than zero, we computer players and even increase it. • 1983: Belle becomes National Master with 2,100 ELO

Eval • 1988: Hitec wins for the first time against <= α, or depth Eval FS_INIT finished No > 0? aa GrandmasterGrandmaster Yes nullmove • 1988: IBM’sIBM’s DeepDeep ThoughtThought playsplays onon applicable? Yes Grandmaster level No

FS_START FS_EVAL • 1992: TheThe ChessMachine,ChessMachine, aa conventionalconventional PC program by Ed Schröder, becomes

eval <= α? World Champion. FS_VICTIM QS_VICTIM No Yes • 1997: IBM’sIBM’s DeepDeep BlueBlue beatsbeats KasparovKasparov inin side new victim? new victim? capture found? to move in a six-game match check? a six-game match No No Yes Yes No Yes Yes No • 2003: Hydra’s predecessor wins a human QS_AGGR • 2003: futility cuttoff? futility cutoff Grandmaster Tournament with 9 out of 11 Yes at horizon? No agressor points,points, reachingreaching 2,7682,768 ELOELO FS_AGGR found? Yes

No No • FebruaryFebuary 2004:2004:HydraHydra achieves achieves #1 #1 rank rank agressor found? victim inin InternationalInternational PaderbornPaderborn ComputerComputer ChessChess available? Championship Yes Yes do move do move No Hydra reaches 2,920 ELO FS_DOWN QS_RETURN • April 2004: Hydra reaches 2,920 ELO on ChessBase chess server undo move on ChessBase chess server Yes depth >= 2? cutoff? QS_CUTOFF • August 2004: Hydra scores 5.5:2.5 against Shredder and 3.5:0.5 against No No against Shredder and 3.5:0.5 against No aa 2,6502,650 ELOELO GrandmasterGrandmaster depth <= 0? depth <=0? depth <=0? • October 2004: Hydra is crowned Machine No No World Team Champion against the human team,team, performingperforming 2,9502,950 ELOELO againagain

Figure 4 – Simplified flow chart for the 56-state FSM that operates the Alphabeta algorithm

Second Quarter 2005 Xcell Journal 97 AccelerateAccelerate TimeTime toto MarketMarket withwith EmbeddedEmbedded ProcessingProcessing QuickStart!QuickStart! Embedded Processing QuickStart! will give you the essential training to start and complete an embedded FPGA design.

by Jonathan Trotter plete an embedded system. In the amount Embedded Processing QuickStart! includes: Titanium Business Development Manager of time it takes for learning and training, • Configuration of the Xilinx design Xilinx, Inc. the competition could release their prod- [email protected] uct first. A productive and effective design environment team will enable you to ship to your cus- • An instructor-led embedded systems Co-designing across processor and logic tomers before your competitors do. development course domains can present a significant challenge The Embedded Processing QuickStart! for both embedded software engineers and solution delivers individualized service that • Design architecture/implementation logic designers. To take advantage of the includes a QuickStart! application engi- consultation and guidance embedded capabilities within the hardware, neer at your site for a week. This Xilinx • Guidance with system partitioning designers should learn how to properly uti- expert will train your team on creating lize the Xilinx® Embedded Development embedded systems and teach them how to • Initial design techniques to enable Kit (EDK) and how to apply the best optimize supporting FPGA features. The faster and more effective debug and methodology to implement its functions. assistance provided will help your team verification With time, any design team can over- exceed expectations and achieve desired come these hurdles and successfully com- design results. • A comprehensive training plan

98 Xcell Journal Second Quarter 2005 Benefits ductive, proficient designing machine is transferring what they learned in the If you had to choose between five individ- simply time. Time yields experience and course to a practical design environment. ual devices or five individual functions on familiarity. If your team has experience and Most engineering teams will have quite a one device, you would probably choose the is comfortable designing with a new tech- few lingering questions after such an in- latter. The more dynamic design elements nology, they will finish faster. When they depth course. With the trainer on-site for on one device, the more flexibility and encounter obstacles, they will know what several days afterwards, these questions can options you have. One or all embedded steps are needed to proceed. be answered. processing solutions – such as the Ultra Another major feature of Embedded Controller, PicoBlaze™ soft-core proces- Embedded Processing QuickStart! Features Processing QuickStart! is assistance with sor, MicroBlaze™ soft-core processor, and With Embedded Processing QuickStart!, system partitioning. Some applications are the PowerPC™ 405 – are located or can be Xilinx provides one week of on-site assis- better suited for a processor and others are implemented on one powerful device. tance by a trained Xilinx applications engi- better suited for FPGA fabric. To obtain These system components were once neer. Regardless of your team’s competence optimum performance, your design team separate, discrete, and alone on the board. or skill level, the engineer will set up and will be able to determine which functions In the future, there will be a wave of pro- customize EDK software for you. to implement to logic and which func- grammable embedded platforms. Why not Included during that week of on-site tions should be executed by the CPU. reunite the lonely CPU with some logic, support is a two-day training course in This is critical for any embedded FPGA block RAM, DSP, and connectivity? embedded system development. The course design. With proper hardware/software Instead of all of that functionality linked partitioning, your design can achieve to a whole board, it can now be self-con- superior performance. tained in a single device. The Xilinx The on-site engineer management embedded solution does not just help you leaves behind a training plan tailored for a single application; our solution will specifically to the needs of your design increase the system performance of all of team, which can include system parti- your applications. tioning, CPU programming, C pro- A CPU is better equipped to handle gramming, debugging, and verification. some functions than traditional FPGAs. A This plan helps prevent schedule slips good example of this is arithmetic func- later in the project by ensuring that your tions – CPUs are designed for this purpose. team is skilled in the required disci- So why not let the processor do what it plines. It also helps maintain a more does best and leave the FPGA to concen- effective and highly motivated team that trate on the tasks at which it excels? With is properly equipped and confident to the Xilinx embedded solution, you can do handle any obstacle. just that. By adding a processing solution to your design, you can create a more Conclusion advanced system and provide more features Software engineers must design across to your customers. logic domains and hardware engineers must learn how to design in the software Challenges domain. The effect of these challenges Customers are often reluctant to design lengthens the development schedule and with new technology because the time used covers the benefits of implementing an 8- can affect time to market. for training is time not used for designing. bit PicoBlaze or a 32-bit MicroBlaze soft- The Embedded Processing QuickStart! The cost of using our flexible, scalable, processor core within the Spartan™-3 solution offers an unprecedented level of powerful processing solution is time. And architecture, and highlights the advantages on-site design support and training for while you’re learning the features of our of the embedded IBM PowerPC core with- the critical initial design phase of your platform FPGA, your competitors are com- in Virtex™-II Pro and Virtex-4 FPGAs. project. This service will not only show ing up to speed as well. If time to market is With the Embedded Development Kit your team where to begin; it will also critical for you, the risk of your competi- (EDK), you can create any of the combina- empower them to complete the project tors successfully completing their projects tions that fit your design specifications by on time and on budget. before yours can be catastrophic. applying the techniques taught in the train- For more information about The difference between a good design ing course. Embedded Processing QuickStart!, con- and a great design is time. The difference After the two-day course, the engineer tact your Xilinx representative, or go to between a novice design team and a pro- remains on site to assist the design team in www.xilinx.com/epq/.

Second Quarter 2005 Xcell Journal 99 ImplementingImplementing BioinformaticsBioinformatics AlgorithmsAlgorithms onon Nallatech-ConfigurableNallatech-Configurable Multi-FPGAMulti-FPGA SystemsSystems A computational acceleration of 200X is impressive, but the future of bioinformatics requires more.

by Keith Regester has led to the emergence of a new area of grammability of FPGAs enables algorithm- Systems Applications Engineer research: bioinformatics, where powerful specific computing architectures to be Nallatech [email protected] computational techniques are used to implemented using the same hardware store, analyze, simulate, and predict bio- resource across a range of algorithms. In Jong-Ho Byun logical information. this article, we’ll describe the design flow, Research Assistant Although raw computational power as test results, and system optimization for University of North Carolina at Charlotte predicted by “Moore’s Law” has led to the implementing a Smith-Waterman algo- [email protected] number of transistors that can be integrat- rithm using a scaleable Nallatech FPGA Arindam Mukherjee ed doubling every 18 months, the genom- computing architecture. Assistant Professor ic data at GenBank (the NIH genetic University of North Carolina at Charlotte sequence database, an annotated collection Sequence Matching Overview [email protected] of all publicly available DNA sequences) is The objective of bioinformatic sequence Arun Ravindran doubling every six months. Proteomic and matching is to identify similarities between Assistant Professor cellular imaging data is expected to grow subsequences of strings as far as possible. University of North Carolina at Charlotte even faster. Post-genomic-era bioinformat- The dissimilarity of two sequences of [email protected] ics will require high-performance comput- nucleotides or amino acids can be defined ing power of the order of several hundreds as the minimum change required in one to In the past decade there has been an explo- of teraflops or more. get the other one, among all possible align- sive growth of biological data, including In recent years, FPGAs have emerged as ments between them – this is the edit dis- genome projects, proteomics, protein high-performance computing accelerators tance of the pair of sequences. structure determination, cellular regulato- capable of implementing fine-grained, Note that it depends on the choice of ry mechanisms, and the rapid expansion in massively parallelized versions of computa- scoring scheme. For instance, a possible digitization of patient biological data. This tionally intensive algorithms. The repro- scoring scheme can assign 0 values to

100 Xcell Journal Second Quarter 2005 matches of nucleotides at certain positions networking software, greatly increasing Merging Algorithms and Networks in the sequences, 1 to gaps (insertions or the available computing capability. Let’s The design flow consists of two interde- deletions) at those positions, and 2 to mis- briefly describe the different FPGA net- pendent tasks: matches (substitutions) at those positions. work components. The Smith-Waterman algorithm com- • Capture of the Smith-Waterman putes the local alignments or similarities High-Capacity Motherboard – BenNUEY algorithm in VHDL between substrings of two sequences A and The BenNUEY motherboard features a • Creation of the FPGA network B of lengths m and n, respectively, using a Xilinx® Virtex™-II FPGA and module sites dynamic programming approach. Dynamic for as many as six additional FPGAs. The The user VHDL design maps the Smith- programming is a strategy of building a PCI/control and low-level drivers abstract Waterman algorithm to a systolic array solution gradually using simple recurrences. the PCI interfacing, resulting in a simplified structure. As shown in Figure 1, the array The key observation for the alignment design process for designs/applications. structure comprises computational units problem is that the similarity between known as processing elements (PEs) that sequences A[1..n] and B[1..m] can be com- Virtex-II Expansion Module – BenBlue-II have local interconnections between them. puted by taking the maximum of the three The BenBlue-II DIME-II module provides a Each element of the query sequence is hard- following values: substantial logic resource ideal for imple- coded into a PE to achieve an area-efficient menting applications that have a large num- realization of approximately four slices. • The similarity of A[1..n -1] and ber of processing elements. Through support Note that the length of the maximum B[1..m -1] plus the score of substitut- for as many as two on-board Xilinx length of the query sequence corresponds ing (sub) A[n] for B[m] XC2V8000 FPGAs, the BenBlue-II can pro- to the maximum number of PEs that can • The similarity of A[1..n] and B[1..m- vide more than 200,000 logic cells on a sin- fit inside a given Virtex-II/Pro device. A C- 1] plus the score of deleting (del) A[n] gle module. A Virtex-II Pro version of this based routine instantiates the VHDL code of a PE multiple times to create the array of • The similarity of A[1..n] and B[1..m - module is also available, providing two interconnected PEs. Additional VHDL 1] plus the score of inserting (ins) B[m] XC2VP100 devices. code is then inserted to interface with the To solve the problem with this recur- Multi-FPGA Management – DIMEtalk FPGA network. rence, the algorithm builds an (n +1) x (m To manage the large silicon resource pool DIMEtalk networks are defined early in +1) matrix M, where each M[i, j] repre- provided by the hardware, the DIMEtalk the application development phase using sents the similarity between sequences tool accelerates the design flow for creating the supplied network design software tool, A[1..i] and B[1..j]. The first row and the a reconfigurable data network by providing DTdesign. This allows the structure of the first column represent alignments of one a communications channel between network to be outlined across the FPGAs sequence with spaces. M[0, 0] represents FPGAs and the host user environment. in a system. the alignment of two empty sequences, and Having defined the network, Smith- is set to zero. All other entries are comput- FUSE Tcl/Tk Control and C++ APIs Waterman blocks of design (VHDL or ed with the following formula: FUSE is a reconfigurable operating system VHDL-wrapped source) are imported using the custom component insertion M[i, j] = min {M[i -1, j -1] + sub (A[i], B[j]); that allows flexible and scalable control of wizard and interconnected to the network M[i -1, j] + del (A[i]); M[i, j -1] + ins (B[j])} the FPGA network directly from applica- tions using the C++ development API, in the DTdesign workspace. Because there are (m+1) x (n+1) posi- which is complemented by a Tcl/Tk toolset FPGA I/O ports for the whole design can tions to compute and each takes a con- for scripting base control. be mapped to the device pins using the high- stant amount of work, this algorithm has a time complexity of O (n2). Clearly, it also has quadratic space complexity because it needs to keep the entire matrix C in memory. GCAGTTGCA... o Data In u In T Out In T Out ... In T Out FPGA Architectures and Networking Tools n The FPGA network hardware comprises In D Out In D Out ... In D Out t e Nallatech FPGA computing cards con- InINIT Out InINIT Out ... InINIT Out r taining between one and seven FPGAs per card. Several such cards can be plugged into the PCI slots of a desktop server and Figure 1 – Systolic array of Smith-Waterman processing elements networked using Nallatech’s DIMEtalk

Second Quarter 2005 Xcell Journal 101 level drag-and-drop device editor. BenNUEY motherboard with a DTdesign then automatically gener- Virtex-II XC2V3000-4 FPGA ates VHDL code, user constraint and an attached BenBLUE-II files, and a VHDL test bench and daughterboard with two Virtex- configuration script files. II XC2V6000-4 FPGAs. DIMEtalk components include: The query sequences are • Routers that direct data around hard-coded into the FPGAs, the network and interconnect while the individual database all other component types sequence files are loaded in the main memory. The database • Bridges that move data sequences are first written between different physical across the DIMEtalk network FPGA devices into the 16K read-FIFOs of • Nodes serving as the user the individual FPGAs. Note interface to the network that that a single XC2V6000 device is capable of as many as can be connected to user Figure 2 – Screen capture of network components and FPGA designs, through node Smith-Waterman algorithm using DIMEtalk 1.26 trillion cell updates per interfaces such as block memo- second. At the end of the pro- ries (BRAMs), queues cessing, the final edit distance (FIFOs), and memory maps is written into a write-FIFO. Table 1 gives the runtimes • Edges serving as the user inter- for the different GenBank data- face to the host computer bases in the case of a single Note that the sizes of memory XC2V6000 FPGA, and two elements, such as BRAMs and XC2V6000 FPGAs in the net- FIFOs, depend on the application work. The databases used were requirements and are easily inter- the GBUNA (3 MB, 1,276 changeable in the DIMEtalk GUI sequences in one database file), environment. Figure 2 shows a GBPRI20 (187 MB, 2,036 block diagram of components sequences with each sequence within the FPGA. in a separate database file) and The entire network, including GBROD10 (245 MB, 1,200 the Smith-Waterman application, sequences with each sequence is then synthesized using the Xilinx Figure 3 – Screen capture of a multi-FPGA in a separate database file), all network designed using DIMEtalk ISE™ 6.2i flow, and the place and from the GenBank database. route timing netlist is generated. from the GenBank database are formatted The results are compared Functional and timing simulations verify using a C-routine for input to the FPGA with those obtained using a SunFire 280R the operation of the design. Thereafter, the network. The formatting consists of coding (two UltraSPARC III processors clocking at bitstreams for the entire multiple FPGA the individual nucleotide bases using 4 bits 1.05 GHz, with 8 MB L2 cache and 8 GB network are generated within the ((A – ux00, C – ux01, G – ux10, T – ux11), memory, running Solaris 9). Each DIMEtalk environment and placed in cor- where the bit denoted by “u” is undefined XC2V6000 FPGA holds PEs correspon- responding work directories. Figure 3 and the one denoted by “x” is set to 1 for ding to a 5,000-nucleotide-bases-long shows a block diagram of a multi-FPGA the starting nucleotide base of a sequence query sequence, besides a 10% overhead for network using five XC2V6000 devices as (set to 0 otherwise). The sequences are then the interface logic. We achieved an overall designed in DIMEtalk. packed into groups of eight to fully use the FPGA utilization of more than 80%. The bit files are then downloaded to the 32-bit DIMEtalk networks. The length of designated FPGAs using FUSE C++ API an individual sequence could be well over Systems Optimization calls. In runtime, dedicated Tcl/Tk soft- 200,000 nucleotide bases long. The time required to compare a query ware API functions provide runtime host sequence against a database sequence for connectivity. This enables direct communi- Test Results the individual FPGA configuration is cal- cation from the host system to all nodes The initial test utilizes three of the five culated as: within the network. FPGAs from the network. The resulting T t = t X X (N X D + Q ) The database sequences downloaded multi-FPGA network comprises a db clk N | | | |

102 Xcell Journal Second Quarter 2005 resource partitioners are currently being Computation Platform Time (GBUNA) Time (GBPRI20) Time (GBROD10) developed at UNC Charlotte for optimal Single XC2V6000 120 seconds 294 seconds 216 seconds resource allocation for a given computa- tional problem and an FPGA network con- Network of two XC2V6000 90 seconds 220 seconds 175 seconds figuration. Figure 5 shows a system-level view of mapping an enormous sequence- SunFire 280R 265 seconds 11 hours, 24 minutes 7 hours, 45 minutes matching application to a multi-device FPGA computing platform. Table 1 – Performance of the FPGA network for GenBank databases Conclusion Looking forward, using Nallatech’s scalable FPGA networks will help realize high-per- formance computing capability for bioin- formatics applications at a low cost. A practical system will be evaluated on: • Scaleability to execute diverse complex algorithms • Ease of implementation and integration • Price/performance demands Our prototype implementation of the Figure 4 – The BenNUEY-PCI-4E motherboard configured with three BenDATA-II modules Smith-Waterman sequence alignment algo- rithm provides a computational speed Where |D| is the number of elements in BenNUEY-PCI-4E motherboard has 4 GB improvement of more than two orders of a given database sequence D; Ethernet ports and eight Xilinx magnitude (200X) while running on a |Q| is the length of the query sequence RocketIO™ ports, allowing direct connec- three-FPGA network. Q (equal to the number of processing ele- tion to storage area networks (SANs), net- To find out more about Nallatech’s scal- ments); T is the number of nucleotide worked memory devices, and interlacing of able FPGA computing architectures and sequences in the database; N is the number MGTs between boards. networking tools, visit www.nallatech.com. of FPGAs in the network; and tclk is the These Nallatech boards, which became To discuss how Nallatech can support your clock period in the FPGA network. available in early 2005, would be ideally FPGA acceleration and systems develop- A truly parallel FPGA network would suited for real-world bioinformatics appli- ment needs, please contact Nallatech at have a processing time given by: cations (Figure 4). [email protected] or call 1 (877) 44- Automatically evolving computational T NALLA. t = t X X ( D + Q ) db clk N | | | | while a serial FPGA network would have a processing time given by: Partial Gene Core 0 Partial Gene Core 1 Partial Gene Core N AAGTTC...... TGC AAGG...... TGCAA GGGG...... TGCTT

tdb = tclk X T X (|D |+ |Q | ) ATGGTC.....TGC

In tasks where the processing is ATC...... TGC ATC...... TGC I/O-limited (|D| > |Q|), the FPGA network acts as a serial system, where- FPGA1 FPGA2 FPGA3 Memory1 FPGA4 FPGA5 FPGA6 FPGA7 as in tasks where the processing is computation-limited (|Q| > |D|), the FPGA Bus network behaves as a parallel system.

Host Can Be One or More Stacks of FPGA Boards Note that the use of large on-board Computer Memory4 FPGA12 Memory3 FPGA11 FPGA10 Memory2 FPGA9 FPGA8 PCI memories relieves the I/O bottleneck. Bus For example, the Nallatech BenDATA-II Module and BenNUEY- PCI-4E motherboard adds 1.5 GB of Figure 5 – Resource allocation in a multi-FPGA network on-board memory. In addition, the

Second Quarter 2005 Xcell Journal 103 POWER PLAY ConqueringConquering thethe ThreeThree ChallengesChallenges ofof PowerPower ConsumptionConsumption Why is power such an issue?

by Steve Sharp which they must design. This is not lim- would rather design in more features to Sr. Manager, Corporate Solutions Marketing ited to any single type of system, but differentiate the product than to use a Xilinx, Inc. does affect most designers. Large systems larger power supply. [email protected] with many boards or modules, as well as In portable consumer products, the portable and consumer products, all face overwhelming goal is to extend battery life As chip technology progresses to 90 nm power budgeting issues. for as long as possible. For these products, and below, power becomes a burning In large systems, power budgeting is longer battery life – both in active and issue in system design. At this node, leak- typically done for the total system, as well standby modes – is a significant competi- age plays a more major role in total as distributed power regulation on a per- tive advantage. power; smaller interconnect geometries board or per-module basis. With multiple With all of these challenges, it’s no won- with new dielectric materials affect power supplies now on every board, it is der that power issues are sounding the dynamic power as well. not a simple task to increase the power alarm bells for system designers today. According to Jordan Selburn of market budget for one board without affecting the iSuppli’s Selburn continues, “On the cus- research firm iSupply, “Leakage current – entire system’s power distribution plan. tomer side, chip designers can consider essentially insignificant at the 0.35 micron In line-powered consumer products, architectural approaches such as parallel node and earlier – has become a major issue the goal is usually to use the smallest and processing at reduced clock speeds to as transistors become increasingly leakier. least-expensive power supply possible to reduce dynamic power, or gated clocks that Studies have shown that at the 90 nm keep costs under control. Exceeding the essentially turn off entire sections of the node, leakage power can equal dynamic capabilities of a particular model power chip when they are not needed. Despite power consumption and even exceed it at supply by only a few percent can necessi- these techniques, power consumption the 65 nm node.” tate the use of a larger, more expensive remains a serious issue for a large portion of Another factor facing system designers supply, and this might be unacceptable in the core silicon market, an issue that is is the tighter power budgets around light of total system cost. Designers becoming worse, not better, with time.”

104 Xcell Journal Second Quarter 2005 POWER PLAY

System Design Challenges • Reliability. As power goes up, so does This is enabled with industry-leading There are three key areas of power usage and the threat of brown-out and latch-up technologies such as 90 nm triple-oxide control challenging system designers today: from high power-on surge. In addition, technology, high-performance embedded IP, static power, dynamic power, and in-rush higher failures-in-time (FIT) rates will and power-saving configuration circuitry. power. Each presents different issues and be expected due to higher device oper- Xilinx also provides comprehensive requires different methods to calculate and ating temperatures. tools for power system design: Virtex-4 manage power. datasheet and user guide; a web-based • Cost. As mentioned previously, higher Static power is the power consumed by a power estimator; and XPower, included in power equals higher cost in the system device when it is in its quiescent condition ISE™ software. because of larger, more expensive with no input signals being exercised. It is Virtex-4 devices handle the three types power supplies and thermal manage- also referred to as steady-state or standby of power usage and control in the follow- ment components such as fans and power. In today’s 90 nm technology devices, ing ways: head sinks. leakage currents in the transistors are the • Static power. As process geometries biggest contributors to static power. This is • End-customer operating expenses. shrink to 90 nm and lower, the usually the key parameter of concern to Higher power also impacts end users industry expects higher leakage and designers of portable equipment because of in the form of higher power bills higher static power when channel its effect on battery life, especially for (which can be significant for large length decreases. Working with fab devices that spend large amounts of time in systems) and shorter battery life for partner United Microelectronics a standby condition waiting for input from portable products. the outside world. Dynamic power is the power consumed during normal operation. It is also referred to Higher power levels in a chip can limit as operating power. Dynamic power is depen- dant on operating signal frequency, intercon- nect capacitance, and operating voltage. device and end-system performance by Because the voltage dependency is a square function, the reduction in voltage when mov- forcing a lower system clock rate to stay ing to 90 nm devices has substantially reduced operating power in many devices. However, within the system power budget. for large, high-performance systems with high operating frequencies, dynamic power is still a significant component of total system power. In-rush power is the power required at How Xilinx Helps Manage System Power Corp., Xilinx solved this problem by device power-up. It is also referred to as Virtex-4 FPGAs using triple-oxide technology in the power-up or start-up power, or power-on With a significant reduction in power con- Virtex-4 90 nm process, which surge power (or current). Some devices sumption over that of the competition, the reduces leakage current significantly. require many times more power to begin new Virtex™-4 platform FPGAs offer sig- Two-oxide thicknesses are widely used operation than they do during normal nificant benefits for system design, includ- in the industry today, with a thin operation, thereby placing demands on ing reduced thermal concerns, easier oxide in the core and thicker oxide in system power supplies. In a consumer sys- power-supply design, lower cost power the I/O area. Virtex-4 devices add a tem with very tightly controlled power supply, and higher system reliability. third medium-thick oxide transistor supply size and cost, ensuring that in-rush Virtex-4 FPGAs dramatically reduce power used for certain functions in the power is not more than normal operating consumption when compared to other FPGA. The result is 50% lower static power is a key design goal. FPGAs in all three key power areas: power than that of Virtex-II Pro Higher power levels can affect both man- FPGAs. Other FPGA vendors have ufacturers and end-customers alike, in four • As much as 73 percent lower static gone the other way when migrating key areas: power with the industry’s first triple- to a 90 nm process, with static power oxide technology • Performance. Higher power levels in a increasing more than 2X compared to • As much as 86 percent lower dynamic chip can limit device and end-system 130 nm devices. power enabled by embedded IP blocks performance by forcing a lower system • Dynamic power. New and existing clock rate to stay within the system • Negligible in-rush current with unique Virtex-4 embedded functions lower power budget. power-saving configuration circuitry dynamic power by 5 to 20x compared

Second Quarter 2005 Xcell Journal 105 POWER PLAY

to Virtex-II Pro FPGAs. This results in and delivers low dynamic power as good as XPower as much as 86% lower dynamic power any other device available today. XPower is the first power-analysis software than that of other 90 nm FPGAs. In addition to these advantages in the available for programmable logic design, basic circuit design and process technology, allowing the analysis of total device power, Note these specific examples: CoolRunner-II devices also offer power- power-per-net, routed, partially routed, or – PowerPC™ – as much as 86% management features unique to the CPLD un-routed designs. power reduction industry, including a DataGate feature to – Block RAM – as much as 82% reduce effective logic usage in the device Power Management Hardware power reduction and clock management and input hysteresis Tools for managing power are not just of features to reduce internal operating fre- the software variety. Power-management – DSP – as much as 23% reduction quencies and dynamic power. chips from National Semiconductor, with XtremeDSP™ slice Intersil, Texas Instruments, and Linear – Ethernet MAC – as much as 83% Spartan-3 FPGAs Technology are available to make the job power reduction Our customers have told us that in today’s of supplying the multiple supply voltages cost-conscious consumer products, being needed by today’s FPGAs easier, and – Logic – although Virtex-4 devices forced to put in a bigger supply just to sup- they can be valuable companions to consume similar dynamic power-per- ply a high power-on or in-rush current is Virtex-4 or Spartan-3 devices. Their logic cell when compared to other not a viable option for their system designs. individual capabilities are highlighted FPGAs, the embedded IP blocks Attention to detail when designing the in this issue of the Xcell Journal in the often allow fewer general-purpose Spartan-3 configuration logic has yielded following pages. logic cells to be used. For example, devices where the maximum quiescent when building a source-synchronous power alone is guaranteed to be sufficient Conclusion I/O (SSIO) interface, the new to power up the device. Spartan-3 devices To conquer the key challenges of power ChipSync™ block reduces the num- have no in-rush current or power specifica- consumption, it takes a combination of ber of logic cells used. tion. When using these low-cost devices, good product design, proper device tech- • In-rush power. Other high-performance you can focus on the product features and nology, and tools that let you take control 90 nm FPGAs have exhibited levels of design without worrying about increased of system power management. in-rush power more than four times system cost because of high in-rush power Xilinx is an industry leader in power that of Virtex-4 FPGAs. In Virtex-4 requirements. management and now offers many advan- devices, by spending considerable time tages within its programmable solutions: designing very power-efficient configu- Power Management Tools • Virtex-4 FPGAs consume 1 to 5W less ration logic, Xilinx has been able to Web Power Tools are pre-implementation power than competing 90 nm FPGAs. keep in-rush power within 15-20% of tools that estimate a design’s power con- the static power requirements and sumption based on the expected utilization • Spartan-3 FPGAs are the one of the below typical operating power. This of device resources, operating frequencies, only low-cost FPGAs in the industry removes the need to use a larger power and toggle rates. to eliminate power-on surge. In these supply just to address in-rush current. Once you have implemented your devices, the maximum quiescient design in the Xilinx software tools, you can power alone is sufficient to guarantee CoolRunner-II CPLDs use XPower to accurately estimate the power device power-up. When Xilinx designed the CoolRunner™-II consumption. Actual power consumption • CoolRunner-II CPLDs are the family of low-power CPLDs, our goal was to must be determined in-circuit under the world’s lowest power CPLDs, ideal deliver one of the industry’s lowest power appropriate operating conditions. for even the most power-critical levels for a programmable logic device. portable applications. These devices have standby current require- Web Power Tools ments of less than 20 µA, making them ideal The intuitive interface guides you through • Xilinx offers a comprehensive suite for battery-powered portable devices. Other the steps of the data-entry process and of power management tools, from CPLDs claiming to be low power have ensures the most accurate estimates possible. Web Power Tools to the XPower standby power 100 to 1000x higher, affect- The equations and values used by Web Power analysis tool integrated into the ing battery life so significantly that they are Tools are based on device characterizations ISE environment. unsuitable for portable applications. for the family. Web Power Tools are available You can find comprehensive information The static RealDigital technology used for the Virtex-4, Virtex-II Pro, Virtex-II, on power consumption and solving key in the logic of CoolRunner-II devices does Virtex/Virtex-E, and Spartan-3 FPGA fami- power challenges with Xilinx devices, tools, away with power-hungry sense amplifiers lies, as well as CoolRunner-II CPLDs. and solutions at www.xilinx.com/power/.

106 Xcell Journal Second Quarter 2005 POWER PLAY The Intersil ISL6521 Simplifies Power Solutions for Xilinx FPGAs

Features • Provides one regulated voltage-switching regulator capable of 20A and three linear regulators capable of 120 mA or up to 3A with an external transistor • Externally resistor-adjustable outputs • Simple single-loop control design / voltage-mode PWM control • Fast PWM converter transient response / high-bandwidth error amplifier / full 0% to 100% duty ratio • Excellent output voltage regulation / all outputs: ±2% over temperature • Overcurrent fault monitors / switching regulator does not require extra current sensing element, using instead the MOSFET’s Rds(on) • Small converter size / 300 kHz constant frequency operation / small external component count • Commercial and industrial temperature range support • Pb-free available (RoHS compliant) Applications • FPGA and PowerPC™-based boards • General purpose, low-voltage power supplies

High integration with full features enables space and cost savings.

The Intersil ISL6521 quad regulator IC solution includes one switch- switching regulator and linear voltage reference provide ±2% of stat- ing regulator for loads as high as 20A and three linear regulators that ic regulation over line, load, and temperature ranges. All outputs are each drive 120 mA or 3A with external transistors. High integration user-adjustable by means of an external resistor divider. All linear brings these four regulators together with full protection, power-on controllers can supply up to 120 mA with no external pass devices. reset, and softstart features in one space- and cost-saving IC. Employing bipolar NPNs for the pass transistors, the linear reg- Applications for this new IC range broadly, providing power solu- ulators can achieve output currents of 3A or higher with proper tions for FPGAs, ASICs, POL, embedded systems, and I/O across device selection. The ISL6521 monitors all output voltages. The medical, industrial, computing, and telecom. PWM controller’s adjustable overcurrent function monitors the out- The ISL6521 combines multiple switchers and/or linears in a sin- put current by using the voltage drop across the upper MOSFET’s gle 16-lead SOIC package, delivering the integration and flexibility rDS(ON). The linear regulator outputs are monitored through the FPGA-based designers need. This single IC solution increases avail- FB pins for undervoltage events. able board space while reducing costs and the number of required external components. For a copy of “Power Management Application Guide for The ISL6521 PWM controller is intended to regulate the low- Xilinx® FPGAs,” visit www.intersil.com/Xilinx/index.asp. voltage supply that requires the greatest amount of current (usually the core voltage for the FPGA, ASIC, or processor) with a synchro- nous rectified buck converter. The linears are intended to regulate other system voltages, such as I/O and memory circuits. Both the

Second Quarter 2005 Xcell Journal 107 POWER PLAY

Spartan-3 FPGA Power Management Solutions

Linear Technology DC/DC converters power Spartan-3 devices.

Linear Technology offers a broad range of regulators to simplify the trollers that offer high efficiency and on-chip power supply tracking. design and selection of DC/DC converters to power the Xilinx® Linear’s free SwitcherCAD™ software allows quick and simple sim- Spartan™-3 family of FPGAs. Our converters range from low ulation of power supply designs. To download SwitcherCAD soft- dropout linear regulators to sophisticated dual-output switching con- ware, please visit www.nuhorizons.com/linear.

Single Output Regulators for Spartan-3 FPGAs Core Voltage: 1.2V 200 mA 500 mA 2A-2.5A 3A-5A 25A 1.8V LTC3025 Linear LT3021 Linear LTC3026 Linear LT3150 Linear LTC3713 Controller LTC3713 Controller LTC3713 Controller LTC3414 Switcher LTC3801 Controller 2.5V to 5V LTC3405A Switcher LTC3406 Switcher LTC3412 Switcher LTC3832 Controller LTC3801 Controller LTC1773 Controller LTC1778 Controller LT1976 Switcher LT1976 Switcher LTC1771 Controller 12V LT1616 Switcher LT3431 Switcher LTC1778 Controller LTC1778 Controller LT1767 Switcher LTC1778 Controller LTC1778 Controller LT1976 Switcher LT1976 Switcher LT3431 Switcher 24V LT1934 Switcher LT3431 Switcher LTC1778 Controller LTC1778 Controller LT1767 Switcher LTC1778 Controller LTC1778 Controller

www.nuhorizons.com/linear www.linear.com

108 Xcell Journal Second Quarter 2005 POWER PLAY

Versatile FPGA Power Solutions from National Semiconductor

LM2743/44 Low-Voltage Synchronous Buck Controllers LM2734/36 1A SOT-23 Buck Regulators

LM2734 Typical application diagram LM2743 Typical application diagram (3V to 18V) Input power 1V to 16V 5V D1 VIN VIN BOOST 3V to 6V C2 L1 (Down to 0.8V) (Bias) VCC C1 µ HG * µ 10 H VOUT = O.6V to 0.85 VIN 10 µF 0.01 F V ON SW OUT ISEN 3.3V/1A SD Up to 25A loads LM2743 LG ON LM2734 D2 C3 OFF R1 10 µF EN PGOOD OFF 31.6 kΩ Output power good FB SS/TRK FB GND R2 EAO GND 10 kΩ

• Highly efficient 2A to 25A solution • Complete, easy-to-use switcher solution has the smallest foot- • Input voltage from 1V to 16V print and highest power density in the industry • Adjustable output voltage as low as 0.6V • Choice of switching frequencies allows designers to trade-off efficiency against solution size and EMI • Power-good flag and output enable • Current mode control improves phase margin, line reg- • 1.5% reference accuracy over temperature ulation, and rejection of transients • Current limit without sense resistor • Internal softstart circuitry, cycle-by-cycle, thermal shutdown, and over-voltage protection • Programmable softstart The LM2734 and LM2736 are monolithic, high-frequency (550 • Switching frequency from 50 kHz to 1 MHz kHz and 1.6 MHz) PWM step-down DC/DC converters in tiny six- • Available in small TSSOP-14 packaging pin thin SOT-23 packaging. They provide local DC/DC conversion The LM2743 and LM2744 are high-speed synchronous buck reg- for Xilinx FPGAs with currents up to 1A. ulator controllers that drive external MOSFETs to supply as much Both regulators need no external compensation and are supported ® as 25A of current. They can provide simple down conversion to by WEBENCH , National’s online design tool. The ability to drive output voltages as low as 0.6V. Although the control sections of the up to 1A loads with an internal 300 mΩ NMOS switch using state- ICs are rated for 3 to 6V, the driver sections are designed to accept of-the-art 0.5 µm BiCMOS technology results in the best power den- input supply rails as high as 16V. The use of adaptive non-overlap- sity available. The world-class control circuitry supports exceptionally ping MOSFET gate drivers helps avoid potential shoot-through high frequency conversion over the entire 3V to 20V input operating problems while maintaining high efficiency. range, down to the minimum output voltage of 0.8V. Even though A wide range of switching frequencies from 50 kHz to 1 MHz the operating frequencies are very high, efficiencies as high as 90% are gives Xilinx® FPGA and system power supply designers the flexibili- easy to achieve. Additionally, a current-mode control loop provides ty to make better trade-offs between component size, cost, and effi- fast transient response and accurate regulation in the smallest possible ciency. A versatile softstart and tracking pin allows ratiometric, PCB area. An evaluation board is available. coincidental, or offset tracking, which are critical for FPGA systems. An evaluation board is available. Feature LM2734 LM2736 Input Range 3.0V to 20V 3.0V to 18V For free samples, evaluation boards, or more information, visit power.national.com or call 1-800-272-9959. Output Load 1A 750 mA Output Range 0.8V to 18V 1.25V to 16V Internal References 0.8V, 2% 1.25V, 2% Operating Frequency 550 kHz/1.6 MHz/3 MHz

Second Quarter 2005 Xcell Journal 109 POWER PLAY TI Power Solutions

0.033 0.033 TPS75003

5 V_INPUT IN1 3 A IS1 IN2 BUCK1 SW1 Q2 100 µF IN3 FB1 15 µH EN1 3 A IS2 SS1 BUCK2 SW2 VCCINT EN2 FB2 D2 1.2 V @ 2 A SS2 OUT3 100 µF EN3 300 mA FB3 LDO VCCAUX SS3 AGND DGND DGND DGND Q1 5 µH 1.5 nF 1.5 nF 10 nF VCCO 3.3 V @ 2 A D1 61.9 k 100 µF

36.5 k

VCCAUX 2.5 V @ 300 mA 61.9 k 10 µF

15.4 k

Highly integrated triple supply powers Spartan-3 core, I/O, and VCCAUX rails.

Features The TPS75003 power management IC for Xilinx® Spartan™-II, • Two 95% efficient, 3A buck controllers and one 300 mA LDO Spartan-IIE, and Spartan-3 FPGAs integrates multiple func- tions to significantly reduce the number of external components • Adjustable output voltages: required – and simply your designs. Combining increased – From 1.2V for bucks design flexibility with cost-effective voltage conversion, the – From 1.0V for LDO device includes programmable softstart for in-rush current con- trol and independent enables for sequencing the three channels. • Input voltage range of 2.2V to 6.5V The TPS75003 meets all Xilinx startup profile requirements, • Independent softstart for all three power supplies including monotonic ramp and minimum ramp times. • LDO stable with small ceramic output capacitor • Independent enable for each supply for flexible sequencing • 4.5 mm x 3.5 mm x 0.9 mm 20-pin QFN package For more information about the complete line of TI • 1 ku price: $1.90 power management solutions for Xilinx FPGAs, including a library of reference designs, schematics, and BOMs, Applications visit www.ti.com/xilinxfpga. For questions, samples, or an • DSL modems evaluation module, e-mail [email protected]. • Set-top boxes • Plasma TV display panels • DVD players

110 Xcell Journal Second Quarter 2005

THE BOARD ROOM Spartan-3E Evaluation Kit from Avnet

Features

• Xilinx® XC3S100E TQ144 FPGA • Cypress CY7C68013 USB 2.0 controller • ST 4 MB SPI serial flash • TI TPS75003 triple voltage regulator • Windows USB download utility

Now shipping, Avnet design kits for Spartan FPGAs get you started fast.

Avnet is now shipping the newest member of its Spartan™ series Spartan-3E Design Kit design kits, the Spartan-3E Evaluation Kit, featuring the XC3S100E FPGA. This design kit, offered at $69, is an affordable and robust Featured Device Avnet Part Number Price tool for becoming familiar with the new features of the Spartan-3E XC3S100E ADS-XLX-SP3E-EVL100 $69 family. It features a Cypress USB controller that allows you to pro- Spartan-3 Design Kits gram the FPGA and/or the on-board flash device through the USB Featured Devices Avnet Part Number Price port on a PC. In addition to the new Spartan-3E Evaluation Kit, Avnet offers XC3S400 ADS-XLX-SP3-EVL400 $399 a complete portfolio of other Spartan series design kits. These kits XC3S1500 ADS-XLX-SP3-EVL1500 $499 deliver a variety of low-cost, feature-rich platforms to develop and XC3S1500 ADS-XLX-SP3-DEV1500 $749 test designs targeted at Spartan-3 and Spartan-3L devices. The kits XC3S2000 ADS-XLX-SP3-DEV2000 $849 are available as either “evaluation” or “development” kits, depend- ing on the resale price and mix of on-board peripherals. Please refer Spartan-3L Design Kit to the table for a complete list of part numbers and prices – all kits Featured Device Avnet Part Number Price listed are available and shipping today. XC3S1000L ADS-XLX-SP3-EVL1000L $599 In addition to the Spartan series design kits described here, Avnet offers a comprehensive portfolio of other design kits that sup- port additional Xilinx FPGA families, such as Virtex™-4 and Virtex-II Pro/Virtex-II devices. It also offers a variety of kits target- For a complete listing of the Xilinx design kits offered ed at specific technology domains, such as connectivity, embedded by Avnet, please visit www.em.avnet.com/xlxboardlisting. processing, and DSP.

112 Xcell Journal Second Quarter 2005 THE BOARD ROOM

TS101 TED Evaluation Board Using Spartan-3 FPGAs

Features

• Xilinx® XC3S400-4PQ208 • Xilinx XCF02SVO20C • Seven-segment LEDs (4) • LED • Push switch • DIP switch • Four channels, 10-bit analog-to-digital converter • Four channels, 10-bit digital-to-analog converter • 32 Mb flash ROM (2M x 16 bits) • 64 Mb SDRAM (4M x 16 bits) • Interfaces: – Parallel port – Serial port – JTAG – Expansion connector (20 pin [12 signals])

The TS101 is an ideal platform for evaluating Xilinx Spartan-3 FPGAs.

The Circuit Design Learning Tool “TD-BD-TS101” is a toolkit for efficiently learning how to design circuits using HDL. The toolkit is Part Number TD-BD-TS101 based on a Xilinx Spartan™-3 (XC3S400-4PQ208) mounted on Deliverables TD-BD-TS101 Board board as the main unit. Sample RTL Source Code The main unit carries an RS-232C connector for communica- tion with an external device. By using the toolkit in combination User’s Manual with your PC, you can easily check the board’s I/O operations to Information for FPGA Mapping (UCF File) learn circuit designing. This product comprises the main unit, AC adapter, documents, AC Adapter and CD-ROMs. On board, four channels of AD/DA converter, 32 Configuration Cable bits of flash ROM, and 64 Mb of SDRAM extend the use of the evaluation board for several kids of circuit design. You can program your designed circuit (test circuit) into on- board PROM through the parallel cable from your PC. After the data is written into PROM, re-supply the power or push the con- For more information, visit the figuration button. Transfer test circuit data from the PROM to the HiTech Global Distribution, LLC website at FPGA (for configuring the FPGA) and execute the test circuit. Also, www.hitechglobal.com/ted/ts101.htm. you can directly configure the bit file to the FPGA.

Second Quarter 2005 Xcell Journal 113 THE BOARD ROOM

Nu Horizons Spartan-3 2000 Development Platform

Features • Based on the Xilinx® Spartan™-3 FPGA • Development boards available: – XC3S1500-4 FG676 – XC3S2000-4 FG676 • ISSI 64 Mb SDRAM (IS42S16400) • Two 1M x 16 x 4 SDRAM – 16 MB • Footprint for 1M x 32 SSRAM provided • ST 32 Mb flash (M29W320DB) • ST high-speed CAN 2.0B PHY interface (L9616) • 2 x 24-character LCD interface • Sharp LCD panel connector I/F • SMSC 10/100 Ethernet MAC and PHY (LAN91C111) • ICS 10/100 Ethernet PHY (ICS1893BF) • Linear Technology two-channel A/D and D/A converters • LTC1654 D/A 14-bit • LTC1865L A/D 16-bit • ST audio CODEC (STW5093) • PS2 port interface The ideal platform to evaluate Spartan-3 1500 and 2000 FPGAs. • Intersil 2 – RS232 serial ports (ICL3237) • 16-bit LVDS I/F with control signals The Nu Horizons Electronics Corp. Xilinx Spartan-3 1500/2000 • Parallel Cable III and IV JTAG configuration support development platforms are complete solutions for evaluating Xilinx • ICS PLL system clock generator (ICS511, ICS8422) Spartan-3 XC3S1500 and XC3S2000 FPGAs. Each board is a com- plete development environment, allowing you to evaluate multiple system designs on a single platform. Additional plug-in evaluation modules available: For industrial and automotive applications, the system board • Linear Technology high-speed A/D converters includes the ST Microelectronics L9616 CAN 2.0B physical layer • 10/12/14-bit 10 to 135 Msps ADCs device, which when coupled with the Bosch-certified CAN core • Intersil high-speed D/A interface (available from Nu Horizons) will get you well on your way to devel- • 8/10/12/14-bit 130 to 260 Msps DACs oping applications with the MicroBlaze soft-processor core. The external Ethernet controller allows you to create and evaluate CAN- Applications to-Ethernet gateways. • MicroBlaze™ soft-processor development With the addition of plug-in evaluation modules from Linear • DSP system development Technology, you can utilize analog-to-digital converters with • Industrial systems development 10/12/14-bit resolution and sampling rates from 10 to 135 Msps. • Data communications/telecommunications Plug-in evaluation modules from Intersil allow you to utilize digital- • Universal prototyping platform to-analog converters with 8/10/12/14-bit resolution and through- put rates of 130 to 260 MHz – for use in applications such as medical imaging, 3G-4G base stations, and spectral analysis. The Spartan-3 family from Xilinx has all of the features necessary For all related documentation, visit www.nuhorizons.com/sp3/. for today’s most demanding high-performance applications. Nu And to evaluate the Spartan-3 2000 FPGA online today Horizons’ Spartan-3 development platforms provide the exact hard- through our zero-cost TechOnLine VirtuaLab, visit ware to develop and test to these requirements and meet your time- www.nuhorizons.com/VirtuaLab/. to-market expectations.

114 Xcell Journal Second Quarter 2005 THE BOARD ROOM Memec Spartan-3E Board Solutions

Features

• XC3S100E-4TQ144 FPGA • IDT 512K x 8 SRAM • Atmel 2M x 8 flash memory • Atmel 16 Mb serial data flash memory • System Ace™ interface/user I/O header • XCF01S Platform Flash ISP PROM • Silicon Labs USB-UART interface • Texas Instruments TPS75003 regulator • Low cost

Atmel Seven-Segment Serial Data Flash LED Display

The Memec Spartan-3E LC PlatformFlash User LEDs

Development Kit offers flexible Parallel IV IDT SRAM JTAG Port (512K x 8) Spartan-3E configuration prototyping XC3S100E-4TQ144 System ACE Atmel Flash (2M x 8) in a low-cost evaluation platform. Header/GPIO 50 MHz Clock USB-UART Port The Memec Spartan™-3E LC Development Kit for the User Switches ® Xilinx Spartan-3E FPGA family provides a versatile proto- Power Supply type platform for evaluating the numerous configuration 1.2V / 2.5V / 3.3V options available in Spartan-3E devices. In addition, the demo board provides the basic interface and support func- tions required by a low-cost platform for general-purpose prototype use. The kit bundles a Spartan-3E-based demo board with a For more information or to order power supply, user guide, and reference designs. Xilinx your Memec Spartan-3E LC Development Kit, ISE™ WebPACK™ software or BaseX software, along with visit www.memec.com/xilinxkits, a low-cost JTAG download cable, are available as kit options. or call (888) 488-4133 (in the U.S.) or (858) 314-8910 (outside the U.S.).

Second Quarter 2005 Xcell Journal 115 THE BOARD ROOM Xilinx FPGA Starter Kits

R

The Spartan-3E FPGA Starter Kit – $149 (available in Q3CY05) Complete solution includes: • Xilinx Spartan-3E 500,000-gate platform FPGA XC3S500E-4FG320 • 32 Mb parallel flash • 8 Mb SPI flash • 32 MB of DDR SDRAM • Board interfaces • Ethernet 10/100 PHY • USB 2.0 PHY + controller • 3-bit, eight-color VGA display port • Nine-pin RS-232 serial port The Spartan-3 FPGA Starter Kit – $99 • PS/2-style mouse/keyboard port Complete solution includes: • Three 40-pin expansion connection ports • Xilinx® Spartan™-3 200,000-gate platform FPGA XC3S200-4FT256C Additional features • Xilinx 2 Mb Platform Flash configuration PROM – XCF02S • Two-line LCD • 1 MB of fast asynchronous SRAM (512K x 16 or 256K x 32) • Four slide switches • 3-bit, eight-color VGA display port • Eight individual LED outputs • Nine-pin RS-232 serial port • Two momentary contact push-button switches • PS/2-style mouse/keyboard port • 90 MHz crystal clock oscillator • Four-character, seven-segment LED display • Universal power supply 100-240V AC, 50/60 Hz • Eight slide switches • JTAG cable • Eight individual LED outputs • Spartan-3/3E resource CD • Four momentary contact push-button switches • ISE Foundation evaluation CD • 50 MHz crystal clock oscillator • EDK evaluation CD • Three 40-pin expansion connection ports Part numbers • Universal power supply 100-240V AC, 50/60 Hz • DO-SPAR3E-DK • JTAG cable • DO-SPAR3E-DK-J (Japanese version) • Spartan-3 resource CD • ISE™ Foundation™ evaluation CD The Spartan-3E FPGA Starter Kit includes a full-featured devel- opment board based on the Spartan-3E platform FPGA family. • EDK evaluation CD Part numbers • DO-SPAR3-DK • DO-SPAR3-DK-J (Japanese version) For more information about the Spartan-3 FPGA Starter Kit, The Spartan-3 FPGA Starter Kit gives you instant access to the visit www.xilinx.com/s3boards/. For more information complete platform capabilities of the Xilinx Spartan-3 family. The about the Spartan-3E FPGA Starter Kit, kit brings high-volume designs to reality quicker, at a lower cost, and visit www.xilinx.com/s3eboards/. on schedule.

116 Xcell Journal Second Quarter 2005 ©Memec (MG047-05) 04.01.05 All companyandproductnames maybetrademarks oftheirrespective companies. LogosareownedbytheirproprietorsandusedMemec withpermission. All rightsreserved. Copyright 2005Memec,LLC. The UltimateProgrammable System Memec Mini-Modules: W W hat if... hat if... Spartan tiny footprintthesizeofyourthumb. the necessaryfunctionsneededforanembeddedprocessor systemontoa Memec Mini-Modulesare completesystemsonamodule, whichpackall Visit www.memec.com/xilinx-minimodules ™ -3 Mini-ModuleFeatures: 76 UserI/O SRAM andFlashMemory 10/100 EthernetMAC andPHY 400K GateSpartan-3FPGA Supports MicroBlazeProcessor Complete System-On-A-Module Small Footprint (30mmx65.5mm) The jobhasalready beendoneforyou? Xilinx FPGAontoyourcircuit board? It becameeasiertointegrateyourfavorite Virtex ™ -4 MiniModuleFeatures: 76 UserI/O DDR andFlashMemory 10/100/1000 EthernetPort Based onthe Virtex-4 FX12FPGA Supports PowerPC Processor Complete System-On-A-Module Small Footprint (30mmx65.5mm) 118 Xilinx Virtex-4™ FPGAs www.xilinx.com/devices/ Xcell Journal Xcell Journal Product Selection Matrix

Virtex-4 LX (Logic) Virtex-4 SX (Signal Processing) Virtex-4 FX (Embedded Processing & Serial Connectivity)

XC4VLX15 XC4VLX25 XC4VLX40 XC4VLX60 XC4VLX80 XC4VLX100 XC4VLX160 XC4VLX200 XC4VSX25 XC4VSX35 XC4VSX55 XC4VFX12 XC4VFX20 XC4VFX40 XC4VFX60 XC4VFX100 XC4VFX140

CLB Array (Row x Column) 64 x 24 96 x 28 128 x 36 128 x 52 160 x 56 192 x 64 192 x 88 192 x 116 64 x 40 96 x 40 128 x 48 64 x 24 64 x 36 96 x 52 128 x 52 160 x 68 192 x 84

Slices 6,144 10,752 18,432 26,624 35,840 49,152 67,584 89,088 10,240 15,360 24,576 5,472 8,544 18,624 25,280 42,176 63,168 CLB Resources Logic Cells 13,824 24,192 41,472 59,904 80,640 110,592 152,064 200,448 23,040 34,560 55,296 12,312 19,224 41,904 56,880 94,896 142,128

CLB Flip Flops 12,288 21,504 36,864 53,248 71,680 98,304 135,168 178,176 20,480 30,720 49,152 10,944 17,088 37,248 50,560 84,352 126,336

Max. Distributed RAM Bits 98,304 172,032 294,912 425,984 573,440 786,432 1,081,344 1,425,408 163,840 245,760 393,216 87,552 136,704 297,984 404,480 674,816 1,010,688 Memory Resources Block RAM/FIFO w/ECC (18 kbits each) 48 72 96 160 200 240 288 336 128 192 320 36 68 144 232 376 552 Total Block RAM (kbits) 864 1,296 1,728 2,880 3,600 4,320 5,184 6,048 2,304 3,456 5,760 648 1,224 2,592 4,176 6,768 9,936

Clock Digital Clock Managers (DCM) 4 8 8 8 12 12 12 12 48 8 4 4 8121220 Resources Phase-matched Clock Dividers (PMCD) 0444888804 4 004888

Max Select I/O™ 320 448 640 640 768 960 960 960 320 448 640 320 320 448 576 768 896

Total I/O Banks 91113131517171791113 9 9 11 13 15 17

I/O Resources Digitally Controlled Impedence Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Max Differential I/O Pairs 160 224 320 320 384 480 480 480 160 224 320 160 160 224 288 384 448

I/O Standards LDT-25, LVDS-25, LVDSEXT-25, BLVDS-25, ULVDS-25, LVPECL-25, LVCMOS25, LVCMOS18, LVCMOS15, PCI33, LVTTL, LVCMOS33, PCI-X, PCI66, GTL, GTL+, HSTL I (1.5V,1.8V), HSTL II (1.5V,1.8V), HSTL III (1.5V,1.8V), HSTL IV (1.5V,1.8V), SSTL2I, SSTL2II, SSTL18 I, SSTL18 II

DSP Resources XtremeDSP ™ Slices 32 48 64 64 80 96 96 96 128 192 512 32 32 48 128 160 192

™ Embedded PowerPC Processor Blocks ——————— ——— — 112222 Hard IP 10/100/1000 Ethernet MAC Blocks ——————— ——— — 224444 Resources RocketIO ™ Serial Transceivers ——————— ——— — 0 8 12 16 20 24

Speed Commercial (slowest to fastest) -10, -11, -12 -10, -11, -12 -10, -11, -12 -10, -11, -12 -10, -11, -12 -10, -11, -12 -10, -11, -12 -10, -11 -10, -11, -12 -10, -11, -12 -10, -11, -12 -10, -11, -12 -10, -11, -12 -10, -11, -12 -10, -11, -12 -10, -11, -12 -10, -11 Grades Industrial (slowest to fastest) -10, -11 -10, -11 -10, -11 -10, -11 -10, -11 -10, -11 -10, -11 -10 -10, -11 -10, -11 -10, -11 -10, -11 -10, -11 -10, -11 -10, -11 -10, -11 -10

Configuration Memory Bits 4,765,568 7,819,904 12,259,712 17,717,632 23,291,008 30,711,680 40,347,008 51,367,808 9,147,648 13,700,288 22,745,216 4,765,568 7,242,624 13,550,720 21,002,880 33,065,408 47,856,896

EasyPath™ Cost Reduction Solutions 4 — XCE4VLX25 XCE4VLX40 XCE4VLX60 XCE4VLX80 XCE4VLX100 XCE4VLX160 XCE4VLX200 XCE4VSX25 XCE4VSX35 XCE4VSX55 — XCE4VFX20 XCE4VFX40 XCE4VFX60 XCE4VFX100 XCE4VFX140

Package 1 Area MGT 2 Pins 4VLX15 4VLX25 4VLX40 4VLX60 4VLX80 4VLX100 4VLX160 4VLX200 4VSX25 4VSX35 4VSX55 4VFX12 4VFX20 4VFX40 4VFX60 4VFX100 4VFX140

SF363 17 x 17 mm — 240 240 240 240

FF668 27 x 27 mm — 448 320 448 448 448 320 448 320

FF1148 35 x 35 mm — 768 640 640 768 768 768 640

FF1513 40 x 40 mm — 960 960 960 960

FF672 27 x 27 mm 12 352 320 (8) 3 352 (12) 3 352 (12) 3

FF1152 35 x 35 mm 20 576 448 (12) 3 576 (16) 3 576 (20) 3

FF1517 40 x 40 mm 24 768 768 (20) 3 768 (24) 3 Second Quarter 2005 Second Quarter FF1760 42.5 x 42.5 mm 24 896 896 (24) 3

Notes: 1. SFA Packages (SF): flip-chip fine-pitch BGA (0.80 mm ball spacing). FFA Packages (FF): flip-chip fine-pitch BGA (1.00 mm ball spacing). All Virtex-4 LX and Virtex-4 SX devices available in the same package are footprint-compatible. Pb-free solutions are available. For more information about Pb-free solutions, visit www.xilinx.com/pbfree. 2. MGT: RocketIO Multi-Gigabit Transceivers. 3. Number of available RocketIO Muti-Gigabit Transceivers. 4. EasyPath solutions provide conversion-free path for volume production. Important: Verify all data in this document with the device data sheets found at http://www.xilinx.com/partinfo/databook.htm Second Quarter 2005 Second Quarter Product Selection Matrix Package Options and UserI/O1

CLB Resources Memory ResourcesDSP CLK Resources I/O Features Speed PROM Spartan-3E (1.2V) Spartan-3 (1.2V) XC3S100E XC3S250E XC3S500E XC3S1200E XC3S1600E XC3S50 XC3S200 XC3S400 XC3S1000 XC3S1500 XC3S2000 XC3S4000 XC3S5000

PinsArea2 I/O’s 108 172 232 304 376 I/O’s 124 173 264 391 487 565 712 784

PQFP Packages (PQ) – wire-bond plastic QFP (0.5 mm lead spacing)

208 30.6 x 30.6 mm 158 158 124 141 141

VQFP Packages (VQ) – very thin QFP (0.5 mm lead spacing)

100 16.0 x 16.0 mm 66 66 63 63 Block RAM (bits) Dedicated Multipliers (min/max) DCM Frequency # DCMs Impedance Digitally Controlled I/O Pairs Number of Differential Maximum I/O I/O Standards Speed Grades Commercial (slowest to fastest) Industrial Speed Grades (slowest to fastest) Memory (Bits) Configuration EasyPath System Gates (see note 1) (Row x Col) CLB Array Number of Slices Equivalent Logic Cells CLB Flip-Flops Distributed RAM Bits Max. # Block RAM TQFP Packages (TQ) – thin QFP (0.5 mm lead spacing) 144 22.0 x 22.0 mm Spartan-3E Family – 1.2 Volt 108 108 97 97 97 Chip Scale Packages (CP) – wire-bond chip-scale BGA (0.5 mm ball spacing) XC3S100E 100K 16 x 22 2,160 1920 4 72K 4 5/326 2 NO 108 0.6M 960 15K 40 Single-ended -4 -5 -4 132 8 x 8 mm 92 92 89 XC3S250E 250K 26 x 342448 5,508 489638K 12 216K 12 5/326 4 NO68 172LVTTL, LVCMOS3.3/2.5/1.8/ -4 -5 -4 1.4M 1.5/1.2, PCI 3.3V – 32/64-bit FGA Packages (FT) – wire-bond fine-pitch thin BGA (1.0 mm ball spacing) XC3S500E 500K 34 x 464656 10,476 931273K 20 360K 20 5/326 4 NO92 232-4 -5 -4 2.3M 33/66MHz, PCI-X 100MHz, 172 190 190 SSTL I 1.8/2.5, HSTL I 1.8, XC3S1200E 1200K 46 x 608672 19,512 17344136K 28 504K 28 5/326 8 NO124 304-4 -5 -4 3.8M 256 17 x 17 mm 173 173 173 HSTL III 1.8 FGA Packages (FG) – wire-bond fine-pitch BGA (1.0 mm ball spacing) XC3S1600E 1600K 58 x 7614752 33,192 29504 231K 36 648K 36 5/326 8 NO156 376 Differential -4 -5 -4 5.9M 232 250 250 LVDS2.5, Bus LVDS2.5, 320 19 x 19 mm 221 221 221 304 304 mini-LVDS, RSDS, LVPECL 400 21 x 21 mm Spartan-3 and Spartan-3L Families – 1.2 Volt (see note 2) 456 23 x 23 mm 264 333 333 333 XC3S50 50K 16 x 12 768 1,728 1,536 12K 4 72K 4 24/280 2 YES 56 124 Single-ended -4 -5 -4 .4M 484 23 x 23 mm 376 LVTTL, LVCMOS3.3/2.5/1.8/ XC3S200 200K 24 x 20 1,920 4,320 3,840 30K 12 216K 1224/280 4 YES 76 173 -4 -5 -4 1.0M 1.5/1.2, PCI 3.3V – 32/64-bit 676 27 x 27 mm 391 487 489 489 XC3S400 400K 32 x 28 3,584 8,064 7,168 56K 16 288K 1624/280 4 YES 116 26433MHz, SSTL2 Class I & II, -4 -5 -4 1.7M SSTL18 Class I, HSTL Class I, 900 31 x 31 mm 565 633 633 XC3S1000 XC3S1000L 1000K 48 x 40 7,680 17,280 15,360 120K 24 432K 2424/280 4 YES 175 391III, HSTL1.8 Class I, II & III, -4 -5 -4 3.2M 1156 35 x 35 mm 712 784 GTL, GTL+ XC3S1500 XC3S1500L 1500K 64 x 52 13,312 29,952 26,624 208K 32 576K 3224/280 4 YES 221 487 -4 -5 -4 5.2M ✔ Differential XC3S2000 2000K 80 x 64 20,480 46,080 40,960 320K 40 720K 4024/280 4 YES 270 565 -4 -5 -4 7.7M ✔ Notes: 1. Numbers in table indicate maximum number of user I/Os. LVDS2.5, Bus LVDS2.5, 2. Area dimensions for lead-frame products are inclusive of the leads. XC3S4000 XC3S4000L 4000K 96 x 72 27,648 62,208 55,296 432K 96 1,728K 9624/280 4 YES 312 712Ultra LVDS2.5, LVDS_ext2.5, -4 -5 -4 11.3M ✔ RSDS, LDT2.5, LVPECL 24/280 ✔ XC3S5000 5000K 104 x 80 33,280 74,880 66,560 520K 104 1,872K 104 4 YES 344 784 -4 -5 -4 13.3M Pb-free solutions are available. For more information about Pb-free solutions visit www.xilinx.com/pbfree.

Note: 1. System Gates include 20-30% of CLBs used as RAMs. 2. Spartan-3L devices offer reduced quiescent power consumption. Package offerings may vary slightly from those offered in the Spartan-3 family. See Package Selection Matrix for details.

Important: Verify all data in this document with the device data sheets found at http://www.xilinx.com/partinfo/databook.htm

For the latest information and product specifications on all Xilinx products, please visit the following links:

FPGA and CPLD Devices Packaging Development Reference Boards Global Services www.xilinx.com/devices/ www.xilinx.com/packaging/ www.xilinx.com/board_search/ www.xilinx.com/support/gsd/

Configuration and Storage Systems Software IP Reference Xcell Journal www.xilinx.com/configsolns/ www.xilinx.com/ise/ www.xilinx.com/ipcenter/ 119 Xilinx Spartan™-3 FPGAs www.xilinx.com/devices/ 120 Xilinx CPLD www.xilinx.com/devices/ Xcell Journal Xcell Journal

Product Selection Matrix – CoolRunner ™ Series Package Options and User I/O

I/O Features Speed Clocking

CoolRunner-II CoolRunner XPLA3

1 XCR3032XL XCR3064XL XCR3128XL XCR3256XL XCR3384XL XC2C512 XCR3512XL Pins Area XC2C32A XC2C64A XC2C128 XC2C256 XC2C384 XC2C512

QFN Packages (QFG) – quad flat no-lead (0.5 mm lead spacing)

32 5 x 5 mm 21

48 7 x 7 mm 37

System Gates Macrocells per Macrocell Terms Product Compatible Input Voltage Compatible Output Voltage Maximum I/O I/O Banking Pin-to-pin Logic Delay (ns) Min. Speed Grades Commercial (fastest to slowest) Industrial Speed Grades (fastest to slowest) IQ Speed Grade Global Clocks Clocks per Term Product Function Block PLCC Packages (PC) – wire-bond plastic chip carrier (1.27 mm lead spacing) CoolRunner-II Family – 1.8 Volt 44 17.5 x 17.5 mm 33 33 36 36 XC2C32A 750 32 40 1.5/1.8/2.5/3.3 1.5/1.8/2.5/3.3 33 2 3.8 -4 -6 -6 -6 3 17 PQFP Packages (PQ) – wire-bond plastic QFP (0.5 mm lead spacing) 208 30.6 x 30.6 mm 173 173 173 164 172 180 XC2C64A 1,500 64 40 1.5/1.8/2.5/3.3 1.5/1.8/2.5/3.3 64 2 4.6 -5 -7 -7-7 3 17 VQFP Packages (VQ) – very thin QFP (0.5 mm lead spacing) XC2C128 3,000 128 40 1.5/1.8/2.5/3.3 1.5/1.8/2.5/3.3 100 2 5.7 -6 -7 -7-7 3 17 44 12.0 x 12.0 mm 33 3336 36

XC2C256 6,000 256 40 1.5/1.8/2.5/3.3 1.5/1.8/2.5/3.3 184 2 5.7 -6 -7 -7-7 3 17 100 16.0 x 16.0 mm 64 80 80 68 84

XC2C384 9,000 384 40 1.5/1.8/2.5/3.3 1.5/1.8/2.5/3.3 240 4 7.1 -7 -10 -10-10 3 17 TQFP Packages (TQ) – thin QFP (0.5 mm lead spacing)

XC2C512 12,000 512 40 1.5/1.8/2.5/3.3 1.5/1.8/2.5/3.3 270 4 7.1 -7 -10 -10-10 3 17 144 22.0 x 22.0 mm 100 118 118 108 120 118* CoolRunner XPLA3 Family – 3.3 Volt Chip Scale Packages (CP) – wire-bond chip-scale BGA (0.5 mm ball spacing) 56 6 x 6 mm 33 45 48 XCR3032XL 750 32 48 3.3/5 3.3 36 5 -5 -7 -10 -7 -10 -10 4 16 132 8 x 8 mm 100 106 XCR3064XL 1,500 64 48 3.3/5 3.3 68 6 -6 -7 -10 -7 -10-10 4 16 Chip Scale Packages (CS) – wire-bond chip-scale BGA (0.8 mm ball spacing)

XCR3128XL 3,000 128 48 3.3/5 3.3 108 6 -6 -7 -10 -7 -10-10 4 16 48 7 x 7 mm 36 40 XCR3256XL 6,000 256 48 3.3/5 3.3 164 7.5 -7 -10 -12 -10 -12-12 4 16 144 12 x 12 mm 108 280 16 x 16 mm 164 XCR3384XL 9,000 384 48 3.3/5 3.3 220 7.5 -7 -10 -12 -10 -12-12 4 16 FGA Packages (FT) – wire-bond fine-pitch thin BGA (1.0 mm ball spacing) XCR3512XL 12,000 512 48 3.3/5 3.3 260 7.5 -7 -10 -12 -10 -12-12 4 16 256 17 x 17 mm 184 212 212 164 212 212

FBGA Packages (FG) – wire-bond fine-line BGA (1.0 mm ball spacing)

324 23 x 23 mm 240 270 220 260

* JTAG pins and port enable are not pin compatible in this package for this member of the family. Note 1: Area dimensions for lead-frame products are inclusive of the leads. Second Quarter 2005 Second Quarter Second Quarter 2005 Second Quarter Product Selection Matrix – 9500 Series Package Options and User I/O

I/O Features Speed Clocking

XC9500XV XC9500XL XC9500

1 XC9536 XC9572 XC95108 XC95144 XC95216 XC95288 Pins Area XC9536XV XC9572XV XC95144XV XC95288XV XC9536XL XC9572XL XC95144XL XC95288XL

PLCC Packages (PC) – wire-bond plastic chip carrier (1.27 mm lead spacing)

44 17.5 x 17.5 mm 34 34 34 34 34 34

84 30.2 x 30.2 mm 69 69 Commercial Speed Grades Commercial (fastest to slowest) Industrial Speed Grades (fastest to slowest) IQ Speed Grade Global Clocks Clocks per Term Product Function Block System Gates Macrocells per Macrocell Terms Product Compatible Input Voltage Compatible Output Voltage Maximum I/O I/O Banking Logic Delay (ns) Pin-to-pin Min. PQFP Packages (PQ) – wire-bond plastic QFP (0.5 mm lead spacing) XC9500XV Family – 2.5 Volt 100 23.3 x 17.2 mm 72 81 81 160 31.2 x 31.2 mm 108 133 133 XC9536XV 800 36 90 2.5/3.3 1.8/2.5/3.3 36 1 5 -5 -7 -7 NA 3 18 208 30.6 x 30.6 mm 168 168 166 168 XC9572XV 1,600 72 90 2.5/3.3 1.8/2.5/3.3 721 5-5 -7 -7 NA 3 18 VQFP Packages (VQ) – very thin TQFP (0.5 mm lead spacing) XC95144XV 3,200 144 90 2.5/3.3 1.8/2.5/3.3 1172 5-5 -7 -7 NA 3 18 44 12.0 x 12.0 mm 34 3434 34 34 XC95288XV 6,400 288 90 2.5/3.3 1.8/2.5/3.3 1924 6 -6 -7 -10 -7 -10 NA 3 18 64 12.0 x 12.0 mm 36 52

XC9500XL Family – 3.3 Volt TQFP Packages (TQ) – thin QFP (0.5 mm lead spacing) XC9536XL 800 36 90 2.5/3.3/5 2.5/3.3 36 5 -5 -7 -10 -7 -10 -10 3 18 100 16.0 x 16.0 mm 72 8172 81 72 81 81 144 22.0 x 22.0 mm 117 117 117 117 XC9572XL 1,600 72 90 2.5/3.3/5 2.5/3.3 72 5-5 -7 -10 -7 -10-10 3 18 Chip Scale Packages (CS) – wire-bond chip-scale BGA (0.8 mm ball spacing) XC95144XL 3,200 144 90 2.5/3.3/5 2.5/3.3 117 5-5 -7 -10 -7 -10NA 3 18 48 7 x 7 mm 3638 36 38 34 XC95288XL 6,400 288 90 2.5/3.3/5 2.5/3.3 192 6 -6 -7 -10 -7 -10 NA 3 18 144 12 x 12 mm 117 117 XC9500 Family – 5 Volt 280 16 x 16 mm 192 192 BGA Packages (BG) – wire-bond standard BGA (1.27 mm ball spacing) XC9536 800 36 90 5 5 36 10 -5 -6 -10 -15 -7 -10 -15 -15 3 18 256 27 x 27 mm 192 XC9572 1,600 72 90 5 5 72 10-7 -10 -15 -10 -15-15 3 18 352 35.0 x 35.0 mm 166 192 XC95108 2,400 108 90 5 5 108 10-7 -10 -15 -20 -7 -10 -15 -20NA 3 18 FBGA Packages (FG) – wire-bond Fine-line BGA (1.0 mm ball spacing) XC95144 3,200 144 90 5 5 133 10 -7 -10 -15 -10 -15 NA 3 18 256 17 x 17 mm 192 192 XC95216 4,800 216 90 55166 10 -10 -15 -20 -10 -15 -20 NA 3 18 Note 1: Area dimensions for lead-frame products are inclusive of the leads. XC95288 6,400 288 90 55192 10 -10 -15 -20 -15 -20 NA 3 18

Pb-free solutions are available. For more information about Pb-free solutions visit www.xilinx.com/pbfree Xcell Journal 121 Xilinx CPLD www.xilinx.com/devices/

PN 0010864