Connectivity Illuminatedilluminated

PROGRAMMABLE SYSTEMS SOLUTIONS

Issue 61 Third Quarter 2007 XcellXcelljournaljournal SOLUTIONS FOR A PROGRAMMABLE WORLD

InsideInside Look:Look: ConnectivityConnectivity IlluminatedIlluminated

INSIDE

PCI Express and FPGAs

Reducing CPU Load for Ethernet Applications

A High-Speed Serial Connectivity Solution with Aurora IP

Xilinx FPGAs Adapt to Ever-Changing Broadcast Video Landscape

Making the Most of MOST Control Messaging

www.xilinx.com/xcell/ Support Across The Board.™

Design Kits Fuel Feature-Rich Applications

Build your own system by Avnet Electronics Marketing designs, manufactures, sells and mixing and matching: supports a wide variety of hardware evaluation, development • Processors and reference design kits for developers looking to get a • FPGAs quick start on a new project. • Memory • Networking With a focus on embedded processing, communications • Audio and networking applications, this growing set of modular • Video hardware kits allows users to evaluate, experiment, • Mass storage benchmark, prototype, test and even deploy complete • Bus interface designs for ﬁeld trial. • High-speed serial interface By providing a stable hardware platform that enhances Available add-ons: system development, design kits from Avnet Electronics • Software Marketing help original equipment manufacturers (OEMs) • Firmware bring differentiated products to market quickly and in the • Drivers most cost-efﬁcient way possible. • Third-party development tools For a complete listing of available boards, visit www.em.avnet.com/drc

Enabling success from the center of technology™

1 800 332 8638 em.avnet.com

Innovation Leadership above all…

Xilinx brings your biggest ideas to reality:

Virtex™-5 FPGAs — Highest Performance. With multiple platforms optimized for logic, serial connectivity, DSP, and embedded processing, Virtex-5 FPGAs lead the industry in performance and density.

Spartan™-3 Generation FPGAs — Lowest Cost. A unique balance of features and price for high-volume applications. Multiple platforms allow you to choose the lowest cost device to fit your Get started quickly with easy-to-use kits specific needs.

CoolRunner™-II CPLDs — Lowest Power. Unbeatable for low-power and handheld applications, CoolRunner-II CPLDs deliver more for the money than any other competitive device.

ISE™ Software — Ease-of-Design. With SmartCompile™ technology, users can achieve up to 6X faster runtimes, while preserving timing and implementation. Highest performance in the fastest time — the #1 choice of designers worldwide.

Order online at www.xilinx.com/getstarted Visit our website today, and find out why Xilinx products are world renowned for leadership . . . above all.

www.xilinx.com

At the Heart of Innovation

We’ve all experienced it with our mobile phones: poor voice quality, dropped calls, not enough “bars” to connect. Although inconvenient, the problem usually goes away by moving around or waiting a Xcell journal few minutes before trying your call again. But if you’re a system designer, connectivity problems launch a multi-dimensional conversation that’s PUBLISHER Forrest Couch quite a bit more complicated to solve. Let’s start with your target market. Is your product a solution [email protected] 408-879-5270 for the aerospace/defense, automotive, broadcast, consumer, server/storage, industrial/scientific/ medical, wired, or wireless market? What are the design challenges in that market, such as signal EDITOR Charmaine Cooper Hussain W integrity, power management, area, and performance? In addition to these decisions, you must select the proper protocol for the unique components of your system hierarchy, such as the ubiquitous PCI ART DIRECTOR Scott Blair Express at one end and Gigabit Ethernet at the other. DESIGN/PRODUCTION Teie, Gelwicks & Associates Connectivity, in its myriad of forms, is an increasingly important factor in successful system design. 1-800-493-5551 Whether designing chip-to-chip, board-to-board, or box-to-box, Xilinx has a connectivity solution

ADVERTISING SALES Dan Teie that helps you differentiate your product within your target market, bring it to market as early as 1-800-493-5551 possible and at the lowest cost. With our connectivity hard blocks, you get the benefits of an ASSP as well as the flexibility of an FPGA. TECHNICAL COORDINATOR Alex Goldhammer This issue of Xcell Journal focuses on connectivity and brings light to some of the key issues, as well INTERNATIONAL Dickson Seow, Asia Pacific as presenting implementation examples. Additionally, Xilinx offers a variety of resources to help with [email protected] your connectivity design challenges. Here are a few connectivity resources that may help. Andrea Barnard, Europe/ Middle East/Africa Connectivity Central [email protected] (www.xilinx.com/products/design_resources/conn_central/) Yumi Homura, Japan Xilinx provides end-to-end connectivity solutions supporting serial and parallel protocols. Design for [email protected] success with Virtex™ and Spartan™ series FPGAs, IP cores, tools, and development kits.

SUBSCRIPTIONS All Inquiries Virtex-5 LXT FPGA Development Kit for PCI Express www.xcellpublications.com (www.xilinx.com/xlnx/xebiz/designResources/ip_product_details.jsp?key=HW-V5-ML555-G) The Virtex-5 LXT FPGA Development kit for PCI Express supports PCIe/PCI-X/PCI. REPRINT ORDERS 1-800-493-5551 This complete development kit passed PCI-SIG compliance for PCI Express v1.1 and enables you to rapidly create and evaluate designs using PCI Express, PCI-X, and PCI interfaces. Spartan-3 PCI Express Starter Kit (www.xilinx.com/onlinestore/spartan_boards.htm) This complete development board solution gives you instant access to the capabilities of the Spartan-3 family and the Xilinx® PCI Express core. www.xilinx.com/xcell/ IP Evaluation (www.xilinx.com/ipcenter/ipevaluation/index.htm) Take advantage of Xilinx evaluation IP to “try before you buy.” Visit the Xilinx IP Evaluation Xilinx, Inc. 2100 Logic Drive Lounge and download the evaluation libraries using the links under the “Related Products” tab on San Jose, CA 95124-3400 the right side of the page. The IP on this site is covered by the Xilinx LogiCORE™ Evaluation Phone: 408-559-7778 FAX: 408-879-4780 License Agreement. www.xilinx.com/xcell/ On-Site Help

© 2007 Xilinx, Inc. All rights reserved. XILINX, (www.xilinx.com/xlnx/xebiz/designResources/ip_product_details.jsp?key=xgs_ss_titanium) the Xilinx Logo, and other designated brands included For help with your connectivity designs, Titanium Technical Service provides a dedicated application herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners. engineer, either on-site or at Xilinx, who can help with system architecture optimization, tool coaching, and back-end optimization. The articles, information, and other materials included in this issue are provided solely for the convenience of our readers. Xilinx makes no warranties, express, implied, statutory, or otherwise, and accepts no liability with respect to any such articles, information, or other materials or their use, and any use thereof is solely at the risk of the user. Any person or entity using such information in any way releases and waives any claim it might have against Xilinx for any loss, damage, or expense caused thereby. Forrest Couch Publisher ON THE COVER

CONNECTIVITY CONNECTIVITY 2020 31

Reducing CPU Load for Ethernet Applications A TOE makes 10 Gigabit Ethernet possible. Xilinx FPGAs Adapt to Ever-Changing Broadcast Video Landscape New digital broadcast standards and applications are addressed by advanced silicon, software, and free reference designs CONNECTIVITY available with Xilinx FPGAs.

2626 INTELLECTUAL PROPERTY 5353 A High-Speed Serial Connectivity Solution with Aurora IP Aurora is a highly scalable protocol for applications requiring point-to-point connectivity. Making the Most of MOST Control Messaging This case study presents the design of Xilinx LogiCORE MOST NIC control message processing.

PCI Express and FPGAs Why FPGAs are the best platform for building 1414 PCI Express endpoint devices. THIRD QUARTER 2007, ISSUE 61 Xcell journal

VIEWPOINTS Letter from the Publisher...... 5 Selecting the Right Interconnect ...... 8

FEATURES Connectivity Scaling Chip-to-Chip Interconnect Made Simple...... 11 PCI Express and FPGAs ...... 14 Virtex-5 FPGA Techniques for High-Performance Data Converters...... 18 Reducing CPU Load for Ethernet Applications ...... 20 Automated MGT Serial Link Tuning Ensures Design Margins ...... 23 A High-Speed Serial Connectivity Solution with Aurora IP...... 26 Xilinx FPGAs Adapt to Ever-Changing Broadcast Video Landscape ...... 31 Serial RapidIO Connectivity Enhances DSP Co-Processing ...... 36 The NXP/PLDA Programmable PCI Express Solution ...... 42

Memory Interface Create Memory Interface Designs Faster with Xilinx Solutions ...... 47

Intellectual Property Driving Home Multimedia ...... 50 Making the Most of MOST Control Messaging ...... 53 Leveraging HyperTransport on Xilinx FPGAs ...... 56

GENERAL FPGA-Based Simulation for Rapid Prototyping...... 60

RESOURCES Featured Connectivity Application Notes ...... 62 Connectivity Boards and Kits ...... 63 The Connectivity Curriculum Path ...... 64

XPECTATIONS Developing Technical Leaders through Knowledge Communities ...... 66 VIEWPOINT Selecting the Right Interconnect Because interconnect characteristics vary, define your requirements clearly before selecting the appropriate interconnect.

by Jag Bolaria Sr. Analyst The Linley Group [email protected]

Interconnects have evolved from parallel to serial and increased in complexity to enable communications with PCIe and Ethernet With IEEE802.3ar and BCN (back- greater efficiency and less congestion or hot Our research shows that the leading inter- ward congestion notification), Ethernet points. In addition to connecting end- connects are driven by large-volume plat- enhancements include better flow control, points, modern interconnects define com- forms. The economies of scale from larger congestion management, and attempts to prehensive protocols for moving data volume platforms ensure low-cost building address its inherently lossy nature. These efficiently across a network of endpoints. blocks and broad availability. Additionally, enhancements will strengthen Ethernet Thus, the networks and endpoints that large deployments lead to field-proven applicability for storage systems, data cen- must be interconnected often drive the technologies that can be applied in other ters, and backplanes. requirements for an interconnect. These platforms with minimal risk. Although Ethernet and PCIe are now requirements include data rate, latency, Two of the largest platforms are PCs suitable for more applications, they still fall lossy or lossless links, scalability, and and networking equipment. The PC plat- short in meeting the technical and business redundancy. These requirements then drive form drives PCI Express (PCIe) and requirements for all systems. Blade servers, the selection of the appropriate intercon- Ethernet, while networking equipment for example, use a combination of Ethernet nects for a specific network. drives only Ethernet. But because these and Fibre Channel (FC). Although OEMs The existing ecosystem for an intercon- interconnects were developed for specific may want to consolidate these fabrics, end nect technology is another important fac- applications, they are not a natural fit in users have a large investment in FC and tor in selection. A good ecosystem can help many other markets. The semiconductor want support for that now and in the future. reduce development cost and time to mar- industry and system vendors are evolving PCIe and Ethernet also fall short in meet- ket. In this article, I’ll look at some of the these interconnects to meet requirements ing the scalability, latency, and lossless leading interconnects and position those for new applications. requirements of high-performance comput- for specific applications or market seg- For example, PCIe scalability has ing (HPC) applications. HPC uses a special- ments. The Linley Group’s report on High- evolved to support greater data rates and ized interconnect such as InfiniBand, which Speed Interconnects, available at more lanes. With IOV (I/O virtualiza- provides better latency and scalability. In this www.linleygroup.com, provides more details tion), PCIe is evolving to support virtual- case, OEMs will need flexible interconnect on various interconnects and the leading ization, which enables its deployment in solutions to enable common platforms and products for each. storage systems and blade servers. thus service different user requirements.

8 Xcell Journal Third Quarter 2007 VIEWPOINT

Both dominant and specialized interconnects Xilinx Events will continue to evolve to support greater data rates, reduced latency, and better scalability. Xilinx participates in numerous trade shows and events throughout the year. This is a perfect opportunity to meet our silicon Everyone Else need to design in solutions that provide and software experts, ask ques- Endpoints and specific system requirements flexibility in meeting current and future tions, see demonstrations of new requirements. often drive the development of specialized products and technologies, and interconnects. RapidIO is one such exam- The combination of 10 Gbps rates on hear other customers’ success ple. Steered by system and chip vendors, the network and multiport line cards drives stories with Xilinx products. RapidIO has evolved to address the unique the need for greater bandwidth and there- For more information and the requirements of the wireless infrastructure. fore greater data rates over the backplane. most up-to-date schedule, visit It enables distributed computing on line The IEEE 802.3ap addresses this with its www.xilinx.com/events. cards and networking/wireless infrastruc- 10GBase-KR specification, which defines ture systems better than most competing 10-Gbps serial links. In addition to equal- interconnects. RapidIO is also integrated ization and pre-emphasis, it may be neces- North America on DSPs from Texas Instruments and sary to include forward error correction for July 23-27 PowerPC CPUs from Freescale. acceptable performance over a couple of Nuclear and Space Radiation Because base stations use farms of DSPs, connectors and up to 40-inch traces com- Effects Conference 2007 it is an easy decision to use RapidIO as an mon in backplanes. Additionally, these sys- Honolulu, HI interconnect in these applications. Over tems may need compatibility to older line August 7-9 time, we expect RapidIO to expand to cards, driving the need for backplane oper- NI Week 2007 other platforms that perform digital signal ation at 1 Gbps or 3.125 Gbps. Again, a Austin, TX processing on multiple data streams. flexible solution is critical to meet the sys- September 18-20 Examples of other specialized intercon- tem requirements. Intel Developer Forum nects include XFI, SFI, XAUI, SPAUI, San Francisco, CA Interlaken, SPI-S, and KR. These intercon- Conclusion October 29-31 nects were developed to address the very There are many different applications for Military Communications Conference 2007 specific low-level requirements of each interconnects and many interconnect Orlando, FL application. Although addressing all inter- choices for system designers. We expect November 5-9 connects is beyond the scope of this article, PCIe and Ethernet to be the dominant Software Defined Radio Forum 2007 let’s look at a few to highlight the problems interconnects. These will be used in servers, Denver, CO each solves and its impact in systems. networking, storage systems, wireless net- XFI and SFI are used to connect opti- works, and many other systems. There is, Europe, Middle East, and Africa cal modules at 10 Gbps. At these data however, no one interconnect (or two) that rates, the major challenge is signal condi- can meet the requirements of all systems. August 27-29 International Conference on tioning, including electronic dispersion Therefore, the industry has developed and Field Programmable Logic will continue to support specialized inter- compensation for the fiber and equaliza- and Applications tion for the board traces and connectors. connects for different applications. Amsterdam, Netherlands These requirements drive specialized com- Both dominant and specialized inter- September 6-11 ponents designed specifically for the char- connects will continue to evolve to support IBC acteristics of the channel through which greater data rates, reduced latency, and bet- Amsterdam, Netherlands the signal travels. ter scalability. Additionally, systems will Because data at these rates may be chan- need to support legacy cards. Asia Pacific nelized – that is, include multiple streams We recommend that system designers on a single physical link – it becomes select the best interconnect and design in August 7-10 important to add traffic management. flexibility to cover different interconnects Agilent Digital Measurement Forum Specifications such as Interlaken, SPI-S, and the evolving changes in each. FPGAs Taiwan and SPAUI address high data rates as well play a critical role in offering system as traffic management. Because no single designers this flexibility and supporting the standard exists, we believe system designers broad interconnect landscape.

Third Quarter 2007 Xcell Journal 9 Logic analyzers up to 50% off

A special limited-time offer from Agilent.

Now you can see inside your FPGA designs in a way that will save weeks of development time.

The FPGA dynamic probe, when combined with an Agilent Windows®-based logic analyzer, allows you to access different groups of signals inside your FPGA for debug—without requiring design changes. You’ll increase visibility into internal FPGA activity Agilent portable and modular by gaining access up to 128 internal signals with each debug pin. logic analyzers • Increased visibility with FPGA dynamic probe Our 16800 Series logic analyzers offer you unprecedented • Customized protocol analysis with price-performance in a portable family, with up to 204 channels, Agilent's exclusive packet viewer software 32 M memory depth and a pattern generator available. • Low-cost embedded PCI Express packet analysis And now for a limited time, you can receive up to 50% off our • Pricing starts at $9,450 newest 16901A modular logic analyzer mainframe when you purchase eligible measurement modules. Offer valid February 1, 2007 through August 15, 2007. Reference promotion 5.564.

www.agilent.com/find/logic-offer

© Agilent Technologies, Inc. 2006 Windows is a U.S. registered trademark of Microsoft Corporation CONNECTIVITY Scaling Chip-to-Chip Interconnect Made Simple Sarance’s high-performance Interlaken IP cores connect devices at up to 50 Gbps. by Farhad Shafai Vice President, R&D Sarance Technologies Inc. [email protected] a framework for an efficient and robust Sarance Technologies has developed a suite Kelvin Spencer chip-to-chip packet interface that easily of Interlaken IP cores (IIPC) targeted at Senior Design Engineer scales from 10 Gbps to 50 Gbps and Xilinx® Virtex™-5 FPGAs. The IIPC family Sarance Technologies Inc. beyond. Several applications for Interlaken [email protected] of cores is a highly optimized implementation are shown in Figure 1. Simply put, you can of Interlaken that takes advantage of the use Interlaken as the connectivity solution advanced features of the Virtex-5 device. The As the world gets connected, the demand for all devices in a typical network commu- IIPC abstracts all of the details of Interlaken for bandwidth continues to increase. The nication line card. The devices include and provides a very simple and straightforward interconnect technology for communica- front-end aggregation ICs or FPGAs, net- interface. All members of the IIPC family use tion systems must not only connect devices work processors, and traffic managers. the same user-side protocol and programming today, but also provide a roadmap for the Additionally, you can use translation interface, greatly simplifying performance scal- future. Traditional solutions, such as XAUI FPGAs to bridge between legacy and mod- ing: using a 10-Gbps IIPC core is no different or SPI-4.2, cannot scale beyond 10 Gbps. ern devices that have Interlaken interfaces. than using a 50-Gbps IIPC. SPI-4.2 uses a low-speed parallel bus, which requires a large number of pins to transfer 10 Gbps of data. XAUI does not have any provisions for channelizing the Port Interlaken NPU Interlaken TM Aggregation I III packet stream, making it unsuitable for FPGA FPGA IC applications that differentiate between packets. Several attempts have been made to build on XAUI and SPI-4.2; all deriva- tives, however, suffer from the inherent limitations of the solutions on which they Port InterlakenTranslation SPI4.2NPU SPI4.2 TM are based, and are therefore not optimal. Aggregation II IC FPGA IC IC/FPGA Interlaken is a new chip-to-chip, chan- nelized packet interface protocol developed by Cisco Systems and Cortina Systems. It is Figure 1 – Typical applications based on SERDES technology and provides

Third Quarter 2007 Xcell Journal 11 CONNECTIVITY

and can transmit 10 Gbps of real payload.

Device A Device B Scaling the interface up to 20G simply requires doubling the number of MGTs to eight; the bandwidth scales accordingly. Lane 0 Lane 0 Lane 1 Lane 1 Incoming Outgoing Sarance’s IIPC Family Packet Interlaken Interlaken Packet Stream Transmitter Receiver Stream The IIPC is a highly optimized imple- Lane n-1 Lane n-1 mentation of Interlaken and, remaining Lane n Lane n true to the intent of Interlaken, is archi-

Stripe Data Reassemble Data tected to offer the same flexibility and Across n Lines From n Lanes scalability offered by the protocol. IIPC can stripe and reassemble data across any Figure 2 – A typical chip-to-chip implementation (only the unidirectional link is shown) number of SERDES lanes running at any SERDES rate and has support for as many as 256 channels. It is fully compliant with revision 1.1 of the Interlaken specification Bandwidth MGT Lanes MGT Rate Logic LUTs Block RAMs document and is hardware-proven to be 12.5 Gbps 4 3.125 Gbps 9,000 5 interoperable with several ASIC implementations of Interlaken. 25 Gbps 8 3.125 Gbps 18,000 5 Table 1 lists the device utilization and 50 Gbps 16 3.125 Gbps 32,500 5 implementation details for three different IIPC cores targeted to Virtex-5 LXT FPGAs. Table 1 – Configuration and Virtex-5 LXT resource utilization Figure 3 shows the block diagram of the IIPC. At the time of this writing, IIPC is capable of supporting as much as 50 Gbps The protocol is independent from the Interlaken Basics of raw bandwidth. The IIPC is divided into number of SERDES lanes and the Interlaken is a narrow, high-speed channel- two major functional partitions: ized chip-to-chip interface. For simplicity’s SERDES rate, which makes the perform- sake, we will not discuss the finer details of ance proportional to the number of • Lane logic that is replicated for each the protocol. At a high level, the basic con- SERDES lanes. As an example, consider a SERDES lane cepts of the protocol include: 10-Gbps system. Using four multi-gigabit • Striping and reassembly logic and the transceivers (MGTs) running at 3.125 • Support for 256 logic channels user-side interface Gbps, we can build an interface with a total • Data scrambling and 64B/67B data raw bandwidth of 12.5 Gbps that has Lane-logic circuitry is the dominant encoding to ensure proper DC balance enough headroom for protocol overhead portion of the overall utilization of the • Segmentation of data into bursts delin- eated by control words Interlaken Core • CRC24 protection for each data burst clkout

• CRC32 protection for each lane Lane 0 dataout[255|127|63:0] mtyout[4|3|2:0] • Protocol independence from the num- CRC32, Scrambler/Desrambler, Gearbox chanout[7:0] ber of SERDES lanes and SERDES rate {enaout, sopout, eopout, errout} • Support for in-band and out-of-band clkin flow-control mechanisms rdyout datain[255|127|63:0] • Lane diagnostics and lane mtyin[4|3|2:0]

Striping, Reassembly, CRC24 Striping, Reassembly, chainin[7:0] decommissioning Lane n {enain, sopin,

Protocol Management, User-Side Interface Management, User-Side Protocol eopin, errin} A typical chip-to-chip implementation configuration and is shown in Figure 2. The packet data is status bus striped across any number of high-speed serial lanes by the transmitting device and Figure 3 – Block diagram (SERDES is on the left; the user-side interface is on the right) then reassembled by the receiving device.

12 Xcell Journal Third Quarter 2007 CONNECTIVITY

IIPC, since it is replicated for each lane. tion software and FPGA architecture do Each lane has a gear box, CRC32, and a not change when the design is scaled to descrambler/scrambler module. All of these 20G, 40G, and beyond. Even if you decide functions have traditionally been very to change the SERDES rate or number of expensive in FPGA technology. Our imple- lanes, the user-side interface is still not mentation of these functions, however, affected. IIPC provides you with a solution takes full advantage of the Virtex-5 device’s for today – and for the future – in a single, six-input LUTs in a way that makes the cir- highly optimized package. cuits very compact and efficient. The striping and reassembly logic per- Ease of Use forms the required MUXing to transfer data Using the IIPC is as simple as powering up between the user-side interface and the lane the FPGA, setting the configuration regis- circuitry and handles link-level functions. ters, resetting the core, and waiting for the Although the striping and reassembly logic is core to signal that it is ready. The IIPC will relatively smaller in terms of area, it has to automatically communicate with the other process the total bandwidth of the interface device and, when link integrity is estab- and therefore is the most timing-critical part. lished, will set a status signal. All you have

Our roadmap has us increasing the bandwidth to 120 Gbps in the very near future.

Specifically, the CRC24 function has to to do is monitor the status signal and start potentially process 50 Gbps of data. Again, sending packets when it is asserted. we have made the most of the FPGA’s hard- The IIPC handles all of the details of ware features to come up with a very efficient Interlaken, including automatic word and and high-performance implementation of lane alignment and automatic scram- the CRC24 function. bler/descrambler synchronization. In addition, the IIPC performs full protocol User-Side Interface checking and error handling. It recovers To help ease the integration of the IIPC with from all error conditions (any number of other logic in the FPGA, we have imple- bit errors are properly detected and appro- mented a very simple and straightforward priately handled) and will never violate the user-side interface. The bus protocol for user-side protocol. transferring packet data to and from the IIPC is similar to the familiar SPI-like bus Conclusion protocols that are commonly used in the Interlaken is the future of chip-to-chip pack- industry. The configuration interface com- et interfaces. It combines the benefits of the prises a set of configuration input signals and latest SERDES technology and a simple yet another set of status output signals that can robust protocol layer to define a flexible and be easily connected to any processor inter- scalable interconnect technology. Sarance face. The status signals monitor the status of Technologies’s IIPC is an optimized imple- the link and identify possible configuration mentation of the Interlaken revision 1.1 or transmission errors. specification targeted for the Virtex-5 FPGA. One key feature of our user-side inter- Our core is hardware-proven to interop- face is that it can be set to be identical for erate with several ASIC implementations of the entire IIPC family. This feature allows Interlaken and can support up to 50 Gbps of you to implement the same user-side logic raw bandwidth. Our roadmap has us increas- in all of your designs, independent of the ing the bandwidth to 120 Gbps in the very configuration or bandwidth of the near future. For the latest updates and infor- Interlaken interface. You can build a 10G mation about our interactive demonstration design today knowing that your configura- platform, e-mail [email protected].

Third Quarter 2007 Xcell Journal 13 CONNECTIVITY

by Alex Goldhammer Technical Marketing Manager, Platform Solutions Xilinx, Inc. [email protected]

PCI Express PCI Express is a high-speed serial I/O interconnect scheme that employs a clock data recovery (CDR) technique. The PCI Express Gen1 specification defines a line rate of 2.5 Gbps per lane, allowing you to build applications that have a throughput and FPGAs of 2 Gbps (after 8B/10B encoding) for a single-lane (x1) link to 64 Gbps for 32 lanes. This allows a significant reduction in Why FPGAs are the best platform for pin count while maintaining or improving throughput. It also reduces the size of the building PCI ExpressExpress endpoint devices. PCB, the number of traces and layers, and simplifies layout and design. Fewer pins also translate to reduced noise and electro- magnetic interference (EMI). CDR eliminates the clock-to-data skew problem prevalent in wide parallel buses, making interconnect implementations easier. The PCI Express interconnect architecture is primarily specified for PC-based (desktop/laptop) systems. But just like PCI, PCI Express is also quickly moving into other system types, such as embedded systems. It defines three types of devices: root complex, switch, and endpoint (Figure 1). The CPU, system memory, and graphics controller connect to a root complex, which is roughly equivalent to a PCI host. Because of PCI Express’ point-to-point nature, switch devices are necessary to expand the number of system functions. PCI Express switch devices connect a root complex device on the upstream side to endpoints on the downstream side. Endpoint functionality is similar to a PCI/PCI-X device. Some of the most common endpoint devices are Ethernet controllers or storage HBAs (host-bus adapters). Because FPGAs are most fre- quently used for data processing and bridging functions, the largest target function for FPGAs is endpoints. FPGA implementations are ideally suited for video, medical imaging, industrial, test and measurement, data acquisition, and storage applications. The PCI Express specification main- tained by the PCI-SIG mandates that every PCI Express device use three distinct proto-

14 Xcell Journal Third Quarter 2007 CONNECTIVITY

col layers: physical, data-link, and transaction. You can build a PCI Express endpoint CPU using a single- or two-chip solution. For example, you can use a low-cost FPGA such PCI Express Memory as a Xilinx® Spartan™-3 device for build- Graphics: 16X Root Complex ing data-link and transaction layers with a commercially available discrete PCI Express PHY (Figure 2). This option is best suited Switch Switch for x1 lane applications such as bus controllers, data-acquisition cards, and performance-boosting PCI 32/33 devices. Or Switch you can use a single-chip solution such as PCI x4 End x8 End Bridge Virtex™-5 LXT or SXT FPGAs, which Point Point have an integrated PCI Express PHY. This option is best for communications or high- x1 End x1 End definition audio/video endpoint devices Point Point PCI (Figure 3) that require higher performance of x4 (8-Gbps throughput) or x8 (16-Gbps throughput) links. Figure 1 – PCI Express topology Before selecting a technology for implementing a PCI Express design, you must carefully consider the choice of IP, link effi- Proprietary LVDS Interface ciency, compliance testing, and availability of resources for the application. In this article, I’ll review the factors for building sin- Control gle-chip x4- and x8-lane PCI Express FPGA Processor designs with the latest FPGA technology. PIPE Interface

Choice of IP PHY As a designer, you can choose to build To Sensors your own soft IP or buy IP from either a 1x PCI Express Connector third party or an FPGA vendor. The challenge of building your own IP is that not only do you have to create the design from scratch, you also have to worry PC Motherboard about verification, validation, compli- Figure 2 – Spartan-3 FPGA-based data-acquisition card ance, and hardware evaluation. IP pur- chased from a third party or FPGA vendor will have gone through all of the rigors of compliance testing and hardware evaluation, making it plug and play. When working with a commercially avail- CPU DDR 2 able, proven, compliant PCI Express HD-SDI interface, you can focus on the most value-added part of the design: the user application. The challenge of using soft

IP is the availability of resources for the FPGA DDR 2 application. As the PCI Express MAC, data-link, and transaction layers in soft IP cores are implemented using programmable fabric, you must pay special attention x8 PCI Express to the number of remaining block RAMs, look-up tables, and fabric resources. Figure 3 – Virtex-5 LXT FPGA-based video application

Third Quarter 2007 Xcell Journal 15 CONNECTIVITY

Another option is to use an FPGA with the RocketIO Transceivers latest technology. The Virtex-5 LXT and SXT have an integrated x8-lane PCI Express controller implemented in dedicated gates (Figure 4). This type of implementation is very advan- GTP GTP tageous, as it requires a minimal number of GTP FPGA logic resources because the design is Transaction Data-Link Physical GTP implemented in hard silicon. For example, in Layer Layer Layer GTP GTP the Virtex-5 LXT FPGA, an x8-lane soft IP GTP core can consume up to 10,000 logic cells, GTP while the hard implementation will need about 500 logic cells, mostly for interfacing. This resource savings sometimes allows you to choose a smaller device, which is generally Configuration and Capabilities Module cheaper. Integrated implementations also have typically higher performance, wider data PCI Express Block paths, and are software-configurable. Another challenge with soft IP imple- To Fabric mentations is the number of features. Virtex-5 LXT FPGA Typically, such cores only implement the minimum features required by the specifica- Figure 4 – Virtex-5 LXT FPGA PCI Express endpoint block diagram tion to meet performance or compliance goals. Hard IP, on the other hand, can support a comprehensive feature list based on Performance customer demand and full compliance Lane Width Interface Data Width Interface Speed Bandwidth (each direction) (Table 1). There are no major performance or resource-related issues. x1 64 62.5/125/250 MHz 2 Gbps x2 64 62.5/125/250 MHz 4 Gbps Latency x4 64 125/250 MHz 8 Gbps Although the latency of a PCI Express controller will not have a huge impact on over- x8 64 250 MHz 16 Gbps all system latency, it does affect the performance of the interface. Using a nar- PCI Express Specification v1.1 Compliance rower data path helps latency. Requirement Support (Y/N) For PCI Express, latency is the number of Clock Tolerance (300 ppm) Y cycles it takes to transmit a packet and receive that packet across the physical, logical, and Spread-Spectrum Clocking Y transaction layers. A typical x8-lane PCI Electrical Idle Generate and Detect Y Express endpoint will have a latency of 20-25 Hot Plug Y cycles. At 250 MHz, that translates into 80- 100 ns. If the interface is implemented with a De-Emphasis Y 128-bit data path to make timing easier (such Jitter Specifications Y as 125 MHz), the latency doubles to 160-200 ns. Both the soft and hard IP implementa- CRC Y tions in the latest Virtex-5 LXT and SXT Automatic Retry Y devices implement a 64-bit data path at 250 QOS 2 VC/round robin, weighted round robin, or strict priority MHz for x8 implementations. MPS 128-4096 bytes Link Efficiency BARs Configurable 6 x 32 bit or 3 x 64 bit for memory or I/O Link efficiency is a function of latency, user Required Power Management States Y application design, payload size, and overhead. As payload size (commonly referred to as maximum payload size) increases, the Table 1 – Virtex-5 LXT FPGA PCI Express capabilities

16 Xcell Journal Third Quarter 2007 CONNECTIVITY

longer than 1 DWORD, so the link efficiency for such designs can be effectively managed with a good FIFO design.

PCI Express Compliance Compliance is important detail that is fre- quently missed and often undervalued. If you are building PCI Express applications that must work with other devices and applications, ensuring that your design is compliant is a must. The compliance is not just for the IP but for the entire solution, including the IP, user application, silicon device, and hardware board (Figure 5). If the entire solution has been validated at a PCI-SIG PCI Express compliance workshop (also known as a “plug fest”), it is pretty much guaranteed that the PCI Express portion of your design Figure 5 – Virtex-5 LXT FPGA PCI Express compliance workshop results will always work. effective link efficiency also increases. This is go down. With an ASIC implementation, it Conclusion caused by the fact that the packet overhead is would be impossible to change the size of the Replacing PCI, PCI Express has become fixed; if the payload is large, the efficiency receive buffer, and the loss of efficiency the de facto system interconnect standard goes up. Normally, a payload of 256 bytes would be real and permanent. and has jumped from the PC into other sys- can get you a theoretical efficiency of 93% User application design will also have tem markets including embedded system (256 payload bytes + 12 header bytes + 8 an impact on link efficiency. The user design. FPGAs are ideally suited to build framing bytes). Although PCI Express allows application must be designed so that it PCI Express endpoint devices, as they allow packet sizes up to 4 KB, most systems will drains the receive buffer of the PCI you to create compliant PCI Express not see improved performance with payload Express interface regularly and keeps the devices with the added customization that sizes larger than 256 or 512 bytes. A x4 or x8 transmit buffer full all the time. If the user embedded users desire. PCI Express implementation in the Virtex-5 application does not use packets received New 65-nm FPGAs like the Virtex-5 LXT FPGA will have a link efficiency of 88- right away (or does not respond to trans- LXT and SXT families are fully compliant to 89% because of link protocol overhead mit requests immediately), the overall link the PCI Express specification v1.1 and offer (ACK/NAK, re-transmitted packets) and efficiency will be affected regardless of the an abundance of logic and device resources flow control protocol (credit reporting). performance of the interface. to the user application. The Spartan-3 fami- Using FPGAs for implementation gives When designing with some processors, ly of FPGAs with external PHY offer a low- you better control over link efficiency you will need to implement a DMA con- cost solution. These factors, combined with because it allows you to choose the receive troller if the processors cannot perform the inherent programmable logic advantages buffer size that corresponds to the endpoint bursts longer than 1 DWORD. This trans- of flexibility, reprogrammability, and risk implementation. If both link partners do not lates into poor link utilization and efficiency. reduction, make FPGAs the best platforms implement the data path in a similar way, the Most embedded CPUs can transmit bursts for PCI Express. internal latencies on both will be different. For example, if link partner #1 uses the 64- TAKE THE NEXT STEP (Digital Edition: www.xcellpublications.com/subscribe/) bit, 250-MHz implementation with a latency of 80 ns and link partner #2 uses the • Buy the Development Kit for PCI Express: 128-bit, 125-MHz implementation with a • Virtex-5 FPGA edition latency of 160 ns, the combined latency for • Spartan-3 FPGA edition the link will be 240 ns. Now, if link partner #1’s receive buffer was designed for a latency • Download the Protocol Pack for PCI Express of 160 ns – expecting that the link partner • Watch our live demo from the Embedded Systems Conference would also be a 64-bit, 250-MHz imple- • Register for a PCIe class mentation – then the link efficiency would

Third Quarter 2007 Xcell Journal 17 CONNECTIVITY Virtex-5 FPGA Techniques for High-Performance Data Converters You can harness the DSP resources of Virtex-5 devices to interface to the analog world.

by Luc Langlois ADC provides 12 bits at 500 MSPS, with High-Performance ADC Interface Global Technical Marketing Manager, DSP 64.5 dB full-scale (dBFS) of signal-to- High-performance ADC sampling rates Avnet EM noise ratio (SNR) to 500 MHz. often exceed the minimum rate necessary [email protected] Fast sampling rates offer several bene- to avoid aliasing, or Nyquist rate, defined fits, including the ability to digitize wide- as twice the highest frequency component The incessant demand for higher band- band signals, reduced complexity of in the analog input signal. The highly over- widths and resolutions in communication, anti-alias filters, and lower noise power sampled digital signal entering the FPGA video, and instrumentation systems has spectral density. The result is improved need not maintain a fast sampling rate propelled the development of high-per- SNR in the system. Your challenge is to throughout the signal processing chain; it formance mixed-signal data converters in implement the high-speed interface can be decimated with negligible distortion recent years. This poses a challenge to sys- between the data converter and FPGA in the digital domain by a high-quality dec- tem designers seeking to preserve the while preserving the SNR throughout the imation filter. This offers the benefits of a exceptional signal-to-noise specifications of signal processing chain in the FPGA. slower system clock in subsequent process- these devices in the signal processing chain. Before the digital ADC data is cap- ing stages for easier timing closure and Xilinx® Virtex™-5 FPGAs provide exten- tured in the FPGA, you must take careful lower power consumption. sive resources for high-performance mixed- precautions to minimize jitter on the Xilinx Virtex-5 and Spartan™-3A DSP signal systems, supported by efficient data converter sampling clock. Jitter FPGAs provide the ideal resources to development tools spanning all phases of degrades SNR depending on the signal implement high-performance decimation design, from system-level exploration to bandwidth of interest. For example, pre- filters for fast ADCs using a technique final implementation. serving 74 dB of SNR - or approximate- known as polyphase decomposition. A ly 12 effective number of bits (ENOB) - polyphase decimation filter performs a Key Specifications of Data Converters for signal bandwidths extending to 100 sampling rate change by allocating the DSP A typical mixed-signal processing chain MHz requires a maximum 300 fs (femto- workload among a set of D sub-filters, starts at the analog-to-digital converter seconds) of clock jitter. Modern ADCs where D = decimation rate. Each subfilter (ADC). Modern high-performance ADCs provide clever interfaces that simplify is only required to sustain a throughput of provide sampling rates extending into the distribution of clean low-jitter clocks on fs/D, a fraction of the fast incoming sam- hundreds of megasamples per second the board. Let’s examine how key features pling rate fs from the ADC. (MSPS) for 12- and 14-bit devices. For of Virtex-5 FPGAs are used to imple- As the decimation filter is often the example, the Texas Instruments ADS5463 ment these interfaces. first stage of digital processing, it calls for

18 Xcell Journal Third Quarter 2007 CONNECTIVITY

the highest performance resources closest to the FPGA pins. A Virtex-5 FPGA’s I/O Block 2X Polyphase Decimator input/output block contains an IDDR Sub-Phase 0 XtremeDSP (input double-data-rate register) driven DATA IDDR directly from the FPGA input buffer. IODELAY ADC Several differential signal standards are Sub-Phase 1 XtremeDSP supported, including LVDS, which can DRY sustain in excess of 1-Gbps data rates while providing excellent board-level Low-Jitter Sampling Clock noise immunity. The IDDR is used to de-multiplex the Figure 1 - High-performance ADC interface fast incoming digital signal from the ADC into two single-data-rate data streams, each at one-half the ADC sampling rate. This is the ideal format to feed a 2x polyphase dec- 2X Polyphase Interpolator I/O Block imation filter. Using the Virtex-5 DSP48E, Sub-Phase 0 XtremeDSP ODDR each subfilter can sustain 550 MSPS for a IODELAY DAC maximum 1.1-GSPS ADC sampling rate. Sub-Phase 1 XtremeDSP Similarly, the Spartan-3A DSP can sustain 500 MSPS ADC sampling rates. With the benefits of faster ADC sam- Virtex-5 FPGA Low-Jitter pling rates come the challenges of smaller Sampling Clock data-valid windows to latch the data into the FPGA. Furthermore, the wider the Figure 2 - DAC: polyphase interpolation + ODDR + LVDS ADC data-word precision, the more daunting the layout task and the higher potential for skew across individual signals the Virtex-5 interface to a Texas to situate the polyphase interpolator multi- of the data bus, resulting in corrupted Instruments (TI) DAC5682Z 16-bit dual- plexer as close as possible to the LVDS out- data. Virtex-5 FPGAs provide a robust DAC with 1-GSPS sampling rate and put buffers driving the DAC5682Z. solution called IODELAY, a programma- LVDS inputs. Virtex-5 FPGAs provide a dedicated ble delay element contained in every I/O In a practical system, the 1-GSPS sam- resource within the I/O block that is ideal- block. IODELAY can individually time- pling rate need only be deployed at the ly suited to this purpose: the ODDR (out- shift each signal of the data bus to accu- final output stage to the DAC, while inter- put double-data-rate) registers. The ODDR rately position the data-valid window at mediate stages in the FPGA signal process- routes directly to fast LVDS differential the optimal transition of the half-rate data- ing chain can work at a slower sampling output buffers able to sustain output rates ready signal (DRY). rate commensurate with the signal band- of 1 GSPS (and beyond) while maintaining Figure 1 illustrates the unique features in width. This allows a slower system clock in signal integrity on the PCB. Virtex-5 devices that serve to implement a the intermediate processing stages, with the high-performance ADC interface. To mini- benefits of easier timing closure and lower Conclusion mize sampling jitter, the ADC forwards the power consumption. In this article, I’ve presented DSP and source-synchronous DRY signal along with As is the case for the ADC, polyphase fil- interface techniques for mixed-signal sys- the data, while the clean, low-jitter sam- ters are efficient DSP structures to realize tems using Xilinx Virtex-5 FPGAs. You can pling clock routes directly to the ADC sampling rate changes at the DAC end of the optimize system performance by preserving without passing through the FPGA. signal chain. To attain a 1-GSPS output sam- the outstanding SNR specifications of pling rate to the TI DAC5682Z, a 2x modern high-performance data converters High-Performance DAC Interface polyphase interpolation filter uses two sub- using key features of Virtex-5 devices. For equal data-word precision, digital-to- filters, each with a throughput of 500 MSPS. The techniques described in this article analog converters (DACs) typically offer These rates are within the performance spec- will be featured in Speedway 2007 DSP higher sampling rates than ADCs, resulting ifications of the Virtex-5 DSP48E slice. sessions, in collaboration with major in significant design challenges at the DAC A multiplexer is required to combine the Avnet data converter suppliers Texas extremity of the signal chain. Several fea- outputs of the sub-filters to attain the fast Instruments, Analog Devices, and tures of the Virtex-5 architecture can help output rate from the polyphase. For a 1- National Semiconductor. For details, visit surmount the task. For example, consider GSPS output sampling rate, it is advisable http://em.avnet.com/xilinxspeedway.

Third Quarter 2007 Xcell Journal 19 CONNECTIVITY Reducing CPU Load for Ethernet Applications A TOE makes 10 Gigabit Ethernet possible.

by Andreas Magnussen hardware instead of using a host CPU and embedded CPUs – the performance issues CTO thus reduces its processing burden. arise at 10 to 100 times lower data rates. IPBlaze IPBlaze has implemented the TOE for A high-performance CPU and simple [email protected] various end systems. Our measurement NIC is a flexible solution with a well- results show that using an IPBlaze TOE known API (socket), and it is easy to Ethernet is playing an increasingly impor- reduces latencies and CPU utilization con- implement applications in common tant role today, as it is used anywhere to siderably. In this article, I’ll give an PC/server environments. One of the connect everything. Not surprisingly, overview of the implementation options downsides is the difficulty in scaling effi- bandwidth requirements are increasing in available today for Ethernet applications ciently to 10 GbE because the load distri- the backbone as well as in end systems. and show where the IPBlaze TOE can give bution between CPU cores and process The implication of this increasing band- you the upper edge in terms of perform- intercommunication creates a bottleneck. width is an increased traffic processing load ance. Figure 1 shows performance exam- Power dissipation (heat) is also a limiting in end systems. Today, most end systems ples with and without the IPBlaze TOE. factor in many systems. use one or more CPUs with an OS and a Some NIC acceleration ASICs on the network stack to implement network inter- Ethernet Configurations market perform protocol offload for specif- face functions. For many applications, the Table 1 shows the five different implemen- ic applications such as storage. A high-per- increasing traffic load leads to performance tation options available today. formance CPU and NIC acceleration issues in the network stack implementa- A CPU plus a simple network interface ASIC is a good solution if the ASIC sup- tion. As these performance issues are seen card (NIC) is a very general solution used ports all of the features needed (iSCSI, already at 1 Gigabit Ethernet (GbE), in most PCs and servers today. A new TCP offload). The functionality and band- implementing 10 GbE using a software CPU generation always provides higher width is fixed, however, which makes it stack is not a viable solution. performance and thus also increases net- very hard to add functionality or adapt to To solve these problems, IPBlaze has work performance. However, the increase protocol changes without a huge perform- developed a unique and highly configurable in network processing performance is sig- ance impact. TCP/IP offload engine (TOE). The TOE nificantly lower than the increase in CPU FPGAs are so powerful today that it is processes the TCP/IP stack at wire speed in power. The same problem exists with possible to implement an accelerated 10

20 Xcell Journal Third Quarter 2007 CONNECTIVITY

Application Examples CPU Load The IPBlaze TOE is a general protocol Ethernet Performance Measured by netperf Throughput offload engine including TCP offload. 100% 1000 The IPBlaze TOE product lines include a

90% 900 number of TOEs with different functionality targeting different applications 80% 800 (Figure 2). 70% 700 One example is an FPGA working as an 60% 600 intelligent high-performance TOE NIC at 50% 500 n-times-1 GbE or n-times-10 GbE con-

CPU Load 40% 400 nected to a PC through the PCIe interface.

30% 300 The NIC can be customized with add-on functions such as advanced switch flow

20% 200 Mb/s Wire Rate TCP Packets control. You can also easily add protocol 10% 100 processing functions such as RDMA and 0% 0 iSCSI for storage applications. IPBlaze TOE, IPBlaze TOE, Standard NIC, AMD Athlon Intel Dual- Intel Dual- Intel M1.6, Intel M1.6, Intel M1.6, 64 3700+, Core E6600, Core E6600, Another example would be embedded 1500B pck 250B pck 1500B pck 1500B pck 1500B pck 250B pck video applications where Ethernet is used as a backplane. The TOE can transfer the Figure 1 – The CPU power needed to process network data and throughput, images or video streams without the need tested at 1 GbE links, using the netperf performance measurements program. for a CPU.

High Flexibility with FPGAs High Performance Properties FPGAs hold a number of advantages com- High-performance CPU + simple NIC High CPU load, CPU limits performance pared to ASICs and offer scalable solutions High-performance CPU + NIC acceleration ASIC Good with in supported feature set, but inflexible solution with high data rates: 1 GbE, n-times-1 GbE, 10 GbE, and n-times-10 GbE. The High-performance CPU + NIC acceleration FPGA Flexible, scalable solution flexibility of the FPGA makes it possible to Medium Performance Properties add new functionality and adapt to proto- ASIC SoC (communication controller CPU, network if) General solution, performance limited by software col changes. Advanced network functions like switch-specific flow control, or per- FPGA SoC (integrated CPU, network acceleration) Good for high data rates, flexible and scalable formance-enhancing features can be added. In load distribution, the IPBlaze Table 1 – Overview of implementation options TOE can distribute traffic to multiple CPU cores for higher system performance. GbE NIC in an FPGA at a competitive software by the PowerPC while the FPGA Maintenance and upgrade of the TOE cost point. One example is the IPBlaze hardware, together with the TOE, handles hardware functions located in the FPGA is TOE implemented in Xilinx® Virtex™-4 bulk transfers like video or images at wire done by firmware upgrade – much like and Virtex-5 devices. The Virtex-5 LXT speed (1 or 10 GbE). This solution makes driver software updates. device has hardware support for PCIe (PCI it possible to use FPGA logic for fast data Express), which makes it an attractive processing and fast data communication IPBlaze 10 GbE TOE core choice to use in PCs and servers. and software for complex low-bandwidth The IPBlaze 10 GbE TOE core is an Examples of ASIC SoCs are controller- operations. Software in the embedded implementation of a full-featured type ASICs and communication controller CPU can be Xilinx microkernel, TCP/IP networking subsystem in the ASICs. This is a good solution for multi- Embedded Linux, or VxWorks. form of a “drop-in” silicon IP core. It function systems with limited bandwidth The IPBlaze 10 GbE TOE NIC FPGA includes a standards-compliant TCP/IP (up to 10-20 Mbps). However, it is limited solution has the same performance as a stack implemented as a compact high- by low bandwidth. TOE NIC ASIC while providing much performance synchronous state machine. A powerful example of an FPGA SoC higher flexibility. High-performance The TOE core is built from a collec- solution is an FPGA with an IPBlaze TOE TOE NICs are now available using cost- tion of synthesizable Verilog modules, and an embedded CPU (a PowerPC hard- effective FPGAs. FPGAs help you with which are customized before delivery to core or MicroBlaze™ soft-core processor). fast time to market and ensure compati- provide the best possible performance Applications processing can be done in bility and interoperability. and feature set in a given system.

Third Quarter 2007 Xcell Journal 21 CONNECTIVITY

The TOE core can be instantiated The host interface is essential for system The flexibility in an FPGA solution along with your own IP and configured behavior and performance and represents a allows easy upgrade of the FPGA code to to operate without the need for host significant part of the system complexity. support new protocol features. CPU support. The host CPU has register access to The IPBlaze TOE core is designed to The IPBlaze TOE core can support as TOE registers while the TOE has access to target a variety of FPGA families depend- many as 1,000 concurrent TCP connec- main memory. The main memory contains ing on the performance and functionality tions in on-chip memory. When a higher the send and receive data buffers, queue requirements. The IPBlaze TOE core has number of connections are required, a structures, and TCP connection state been implemented in Virtex-II, Virtex-4, connection cache will allow you to access information if connection caching is and Virtex-5 devices. Here are some example numbers from a 10 GbE TOE implemented in Virtex-5 CPU + Accelerated NIC devices: High-Performance CPU • Core clock: 125 MHz Applications in SW FPGA SOC • TOE latency: 560 ns Virtex-4 or Virtex-5 FPGAs • Throughput: 10G line rate Virtex-5 FPGA (10G bidirectional) Applications in HW PCIe e.g., Video CPU • Packet processing rate: 12 million packets per second IPBlaze TOE IPBlaze TOE Xilinx and IPBlaze provide TCP offload n*1 GbE/10 GbE MAC n*1 GbE/10 GbE MAC solutions that can be implemented as is or customized for functionality, size, speed, or target application. IPBlaze also offers a n*1 GbE/10 GbE PHY n*1 GbE/10 GbE PHY comprehensive set of high-performance TOE solutions for different market seg- Network Interface Network Interface ments. Current TOE solutions include:

Figure 2 – Block diagram for a high-speed TOE solution • IPBlaze General TOE for NIC for a PC/server and embedded system applications. • IPBlaze Security TOE for high-per- the most recently used connection states enabled. An event system is used for fast formance (2-times-10 GbE) bump-in- on-chip. The state information for non- message signaling between the host and the-wire network monitoring. The cached TCP connections is stored in host the TOE. The event system includes a security TOE supports more than memory or in dedicated external memory. number of optimizations for high 1 million concurrent TCP connections. For applications where the TOE does throughput and low latency. not terminate connections but monitors • IPBlaze Embedded TOE for SoC all connections on a backbone, an external OS-Compliant Socket API applications that require high-speed dedicated high-speed memory is used to Applications software interfaces to the network communication (1 GbE and support a very high number (1 million and TOE through a fully standards-compliant 10 GbE). above) of concurrent TCP connections. socket API, with significantly higher per- The IPBlaze TOE core uses three well- formance than a standard software socket Conclusion defined interfaces: implementation. The socket API is imple- With standards-compliant TOE cores, mented at the CPU kernel level. The CPU IPBlaze has created a technology platform • Xilinx 1 GbE and 10 GbE MAC net- load is reduced by 5-10 times and latency is for the 10 GbE TOE market, working with work interfaces for either 1 GbE or 10 improved by 4-5 times. industry leaders in high-performance com- GbE support Hardware-based acceleration should be puting solutions. Programmable solutions enable system architects to add functionali- • A high-speed RAM option for dedicat- able to scale well. The evolution of per- ty as needed. Integrating multiple IP cores ed external memory for TCP connec- formance in future Xilinx FPGA genera- into a single FPGA can reduce board costs tion state information tions provides a good upgrade path. Furthermore, continuous improvements in and time to market. • Xilinx PCIe type or memory host the IPBlaze TOE technology also provide To learn more about protocol offload processor interface increased performance. solutions, visit www.ipblaze.com.

22 Xcell Journal Third Quarter 2007 CONNECTIVITY Automated MGT Serial Link Tuning Ensures Design Margins You can now streamline the serial link tuning process using Agilent’s Serial Link Optimizer tool with Xilinx IBERT measurement cores.

by Brad Frieden In this article, I’ll show you how to Additionally, because the channel is band- Applications Development Engineer optimize a Virtex-4 MGT high-speed width-limited, you can apply a frequency- Agilent Technologies serial link through this process and dis- domain technique called equalization at the [email protected] cuss the results. receiver to compensate for channel frequency roll off. A peaked frequency response yields a When implementing high-speed serial links Challenges of Signal Degradation more flat response when combined with the in FPGAs, you must consider and address At 3.125- and 6-Gbps rates and rise channel roll off. The effects of equalization at the effects of transmission line signal times of 130 ps or shorter, it is no won- the receiver input are quite drastic, but this is integrity. Used together, transmitter pre- der that most applications end up with only visible inside the chip at the actual emphasis and receiver equalization extend significant signal integrity effects from receiver input (post-equalization). the rate at which high-speed serial data can the physical channel that distort the sig- be transferred by opening up the eye dia- nal at the receiver input. Distortion can Measuring Link Performance gram, despite physical channel limitations. come from multiple reflections caused It is important to be able to verify the per- An internal bit error ratio tester by impedance discontinuities, but a formance of a link and to optimize it (IBERT) measurement core from Xilinx more fundamental effect – especially in through the combined adjustment of can view serial signals at an internal receiv- FR4 dielectric PC boards – slows down transmitter pre-emphasis and receiver er point. Used in conjunction with the the edge speeds. Frequency-dependent equalization. Unfortunately, taking a meas- Agilent Serial Link Optimizer, you can skin effect causes a “slow tail” to the urement on the pins of the FPGA where have both a graphical view of the BER pulse. To compensate for this, you can the receiver is located yields a very distort- across the unit interval and automatically apply a time-domain technique called ed signal, since you are observing the signal adjust pre-emphasis and equalization set- pre-emphasis to the transmit pulse, with with pre-emphasis applied but without tings to optimize the channel. significant improvement at the receiver. corresponding equalization.

Third Quarter 2007 Xcell Journal 23 CONNECTIVITY

Figure 1 is an example of such a meas- Transmitter Receiver with Equalizer urement made with an Agilent digital communications analyzer on a 6-Gbps channel implemented with ASIC technology. Backplane Notice that the eye diagram actually appears completely closed, even though Embedded BERT good signals are present at the receiver Agilent DCA at Pins input inside the chip.

IBERT Measurement Core Fortunately, you can observe what is going on at the FPGA’s receiver input by using an IBERT measurement core that is part of the Xilinx® ChipScope™ Pro Serial I/O toolkit. The normal FPGA design is temporarily replaced with one that creates stimulus at the transmitter and measures BER at the receiver. These Figure 1 – DCA measurement on FPGA serial I/O pins stimulus/response core pairs are placed in versus on-chip measurement at the receiver input the design using a core generation process. Now it is possible to observe BER at the receiver inside the chip. Using Software Application that basic measurement capability, you on Windows PC can apply a variety of pre-emphasis and equalization combinations to optimize PC Board the channel response.

Basic BERT Measurements PC Board FPGA Parallel or To understand link performance, you must USB have the ability to measure bit error rate. IBERT To do this, a new tool called the Agilent FPGA Serial Link Optimizer takes control of Insert Core with Xilinx Cable IBERT core stimulus and response in the FPGA Design Software IBERT serial link to create such a measurement. A measurement system as shown in Figure 2 allows for this kind of test. Here, two JTAG Virtex™-4 FPGAs comprise the link: one implements the transmitter, the other the receiver. Both FPGAs are under JTAG con- Figure 2 – Serial Link Optimizer block diagram trol from the Serial Link Optimizer software, and measurements are taken at the I obtained a measurement on a serial with the topology of the ML405 boards, receiver input point inside the FPGA to link that comprises a Xilinx Virtex-4 dictate the physical channel. measure BER. XC4VFX20 multi-gigabit transceiver Steps to set up this measurement include Some of the selectable measurement (MGT) 113A transmitter on one Xilinx (assuming that the IBERT cores are already attributes include: ML405 board, with a connection through created and loaded into the FPGAs): a SATA cable over to an MGT 113A • Loopback mode (internal, 1. Start tool and configure USB JTAG receiver in a second Virtex-4 XC4VFX20 external, or none) for TX and verify connection FPGA on a second ML405 board. I made • Test pattern type a USB JTAG connection to the first 2. Configure parallel JTAG for RX and • Dwell time at each point across FPGA and IBERT core associated with verify connection the unit interval the transmitter, and a parallel JTAG connection to the second IBERT core associ- 3. Select MGT 113A transmitter on • Manual injection of errors ated with the receiver. These cores, along USB cable

24 Xcell Journal Third Quarter 2007 CONNECTIVITY

4. Select MGT 113A receiver on Automated Link Tuning design into each FPGA to determine the opti- parallel cable Let’s extend this measurement process yet mal combination of pre-emphasis and equal- again by automatically trying combinations ization. Now that I have determined the 5. Select loopback (external) of pre-emphasis and equalization settings optimal combination, the Serial Link 6. Select test pattern type (PRBS7) until the error-free zone in the unit interval Optimizer outputs the corresponding pre- is maximized, resulting in the best link per- emphasis and equalization settings for the 7. Setup TX and RX line rates (reference formance for speed and margin. The Serial MGTs. The MGT parameters are represented clock 150 MHz, line rate 6 Gbps) Link Optimizer does just that. Link config- in VHDL, Verilog, and UCF formats and are 8. Select BERT tab and press “Run” uration options include an internal loop- output when clicking the “Export Setting” back test inside the I/O ring of a single button. You can cut and paste these parame- I made a measurement on the link with FPGA; transmit and receive from one ters back into the source design, thus incor- zero errors after 1E+12 bits (~ 3-min meas- FPGA; and a transmit/receive pair test porating the optimal link settings when the urement). This measurement of BER on between two different FPGAs. It is also pos- final design is programmed into the FPGAs. the channel occurred at the receiver input sible to inject signals from an external For example, the Verilog settings for the internal to the chip and required no exter- BERT instrument and monitor signals at first transceiver are: nal measurement hardware. It forms the the receiver inside the FPGA. // Verilog basis for additional capability in the Serial In our example, let’s use the same serial //— Rocket IO MGT Preemphasis and Equalization — Link Optimizer tool. channel between two Virtex-4 FPGAs with defparam MGTx.RXAFEEQ = 9’b111; the MGT 113A/B TX/RX pair. On the defparam MGTx.RXSELDACFIX = 5’b1111; A Graphical View of BER Serial Link Optimizer interface, select the defparam MGTx.RXSELDACTRAN = 5’b1111; The next step toward understanding link “Tuning” tab and Run button as shown in performance is to have the ability to graph Figure 3. The error-free zone with default Required Components BER as a function of To take advantage of these measurements, the unit interval. To do you will need: this, the Agilent Serial Link Optimizer takes 1. A PC with Windows XP installed (SP2 control of IBERT core or higher) stimulus and response 2. Xilinx programming cable(s), parallel in the serial link to cre- and/or USB ate such a graph. The same measure- 3. Xilinx ChipScope Pro Serial I/O ment system is used as Toolkit (for IBERT core) with the basic BER 4. Agilent E5910A Serial Link Optimizer test, but now measure- software ments are taken at 32 discrete steps across the Conclusion unit interval to show Through automated adjustments of pre- where error-free per- Figure 3 – BER plot before and after automatic adjustment of pre- emphasis and equalization, while simultane- formance is achieved. emphasis and equalization to tune the serial link ously monitoring BER at the receiver input The resulting graph is inside the FPGA, it is now possible to auto- shown in the upper BER plot in Figure 3. FPGA settings was 0.06 unit interval. After matically optimize an MGT-based serial link. The Virtex-4 MGT macro has the ability attempting 141 combinations of pre- At 3.125-Gbps rates, such an optimiza- to adjust the sampling position across emphasis and equalization, the error-free tion is helpful to get good margins. But at these 32 discrete points; the Serial Link zone increased to 0.281 (28%) unit inter- 6-Gbps rates, optimization becomes crucial Optimizer uses the IBERT core in con- val, also shown in Figure 3. This means sig- to achieve link performance that ensures junction with sampling position control nificantly better margins in the link design reliable data transfer. This process can save to make the measurements. The link per- at the 6-Gbps speed. significant time in reaching your desired formance with default MGT pre-empha- design margins and can likely achieve better sis and equalization settings has zero Exporting Link Design Constraints results than what is possible through tradi- errors for 0.06 (6%) of the unit interval. These new pre-emphasis and equalization tional manual tuning. This is quite narrow, indicating very little settings must now be incorporated into a real For more information, visit www.agilent. margin. The system could benefit from design that ultimately uses the serial link. Up com/find/serial_io or www.agilent.com/ proper link tuning. until this point, I placed a temporary test find/xilinxfpga.

Third Quarter 2007 Xcell Journal 25 CONNECTIVITY A High-Speed Serial Connectivity Solution with Aurora IP Aurora is a highly scalable protocol for applications requiring point-to-point connectivity.

by Mrinal J. Sarmah Although it offers real benefits, serial • MGTs with point-to-point connec- Hardware Design Engineer I/O has some negative attributes. Serial tions make switched fabric architec- Xilinx, Inc. interfaces require high-bandwidth manage- tures possible ment inside the chip, special initialization [email protected] • The elastic buffers available in MGTs and monitoring, bonding of lanes in an provide high skew tolerance for aggregated channel of multiple lanes, elas- Hemanth Puttashamaiah channel-bonded lanes Hardware Design Engineer tic buffers for data alignment, and de-skew- Xilinx, Inc. ing. Also, flow control is complex and you MGTs are designed for configurable [email protected] must maintain the correct balance between support of multiple protocols; hence high-level features and total chip area. their control is fairly complex. When With advances in communication technol- designing or integrating designs with a ogy, you can achieve gigahertz data-transfer Multi-Gigabit Transceivers high-speed connectivity solution using rates in serial links without having to make Following the industry-wide migration MGTs, you will have to consider MGT trade-offs in data integrity. The prolifera- from parallel to serial interfaces, Xilinx initialization, alignment, channel bond- tion of serial connectivity can be attributed introduced multi-gigabit transceivers ing, idle sequence generation, link man- to its advantages over parallel communica- (MGTs) to meet bandwidth requirements agement, data delineation, clock skew, tion, including: as high as 6.5 Gbps. clock compensation, error detection, and The common functional blocks in these data striping and destriping. Configuring • Improved system scalability transceivers are an 8B/10B encoder/decoder, transceivers for a particular application is • More flexible, thinner cabling transmit buffer, SERDES, receive buffer, challenging, as you are expected to tune loss-of-sync finite state machine (FSM), more than 200 attributes. • Increased throughput on the line with comma detection, and channel bonding minimal additional resources logic. These transceivers have built-in clock The Aurora Solution • More deterministic fault isolation data recovery (CDR) circuits that can per- The Xilinx® Aurora protocol and its asso- form at gigahertz rates. Built-in phase- ciated designs address these challenges by • Predictable and reliable signaling locked loops (PLL) generate the fabric and managing the MGT’s control interface. schemes transceiver clocks. Aurora is free, small, scalable, and cus- • Topologies that promise to scale to Transceivers have several advantages: tomizable. With low overhead, Aurora is a the needs of the end user protocol-agnostic, lightweight, link-layer • With their self-synchronous timing protocol that can be implemented in any • Exceptional bandwidth per pin with models, they reduce the number of silicon device/technology. very high skew immunity traces on the boards and eliminate With Aurora, you can connect one or clock-to-data skew • Reduced system costs because of smaller more MGTs to form a communication form factors, fewer PCB traces and • Multiple MGTs can achieve higher channel. The Aurora protocol defines the layers, and lower pin/wire count bandwidths structure of data packets and procedures for

26 Xcell Journal Third Quarter 2007 CONNECTIVITY

a 2- or 4-byte interface should be High-Speed Serial Transceivers Each Aurora Lane Uses One MGT MGT = Multi-Gigabit Transceiver based on the throughput requirements and latency involved. A 4-

Aurora Channel Partners Aurora Full-Duplex Channel byte design has more latency than a 2-byte design, but offers better throughput and consumes less resources than an equivalent

Lane 1 2-byte design. M M A G G A Aurora channels can function User User Local U T T U Local as uni-directional (simplex) or User Interface R R Interface User Application Link Link Application Interface O Channel O Interface bi-directional (full-duplex). Full- R M Lane N M R duplex module initialization A G G A T T data comes from the channel Interconnect partner through a reverse path, (Ex. Backplane) User Frames and User Frames and whereas in simplex the initializa- Flow Control Messages Aurora Frames Flow Control Messages tion happens through four side- band signals. Aurora is available Figure 1 – Connectivity scenario using Aurora in simplex TX only, RX only, or both. “Both” operation is like full duplex, except that the transmit and receive sections communicate independently. You can use Aurora simplex status status status in token-ring fashion. Figure 2 shows the RX TX RX TX ring structure of Aurora-S (simplex).

Data Flow in Aurora FPGA FPGA Data is transferred between Aurora channel partners as frames. Data flow primarily com- Figure 2 – Simplex token ring structure prises the transfer of protocol data units (PDUs) between the user application and flow control, data striping, error handling, user interface is not part of the Aurora the Aurora interface, as well as the transfer of and initialization to validate MGT links. protocol specification. channel PDUs between channel partners. Aurora shrink-wraps MGTs by providing The Aurora protocol engine converts a transparent interface to them, allowing the generic arbitrary-length data from the Soon after power-up, when the core upper layers of proprietary or industry-stan- Xilinx LocalLink user interface to Aurora comes out of all reset states, the trans- dard protocols such as Ethernet and TCP/IP protocol-defined frames and transfers it mitter sends initialization sequences. If to ride on top of it and provide easily access. across channel partners comprising one or the link is fine and the link partner rec- This easy-to-use pre-defined protocol more high-speed serial links. The number ognizes these patterns, it sends acknowl- needs very little time to integrate with of links between channel partners is config- edgement patterns. After receiving existing user designs. Being lightweight, urable and device-dependent. sufficient acknowledgement patterns, the Aurora does not have an addressing scheme Most Xilinx IPs are developed based on transmitter transits to a link-up state, and cannot support switching. It does not the legacy LocalLink interface. Any user indicating that the individual transceiver define correction within data payloads. interface designed for other LocalLink- connection between channel partners is Aurora is defined for physical and data- based IPs can be directly plugged into established. Once the link is up, the link layers in the Open Systems Aurora. For more information on the Aurora protocol engine moves to a chan- Interconnection (OSI) model and can easi- LocalLink interface, visit www.xilinx.com/ nel verification stage for a single-lane ly integrate into existing networks. products/design_resources/conn_central/ channel, or a channel-bonding stage locallink_member/sp006.pdf. (preceeding a verification stage) for a A Typical Connectivity Scenario You can think of Aurora as a bridge multi-lane channel. When channel verifi- Figure 1 is an overview of a typical Aurora between the LocalLink interface and the cation is complete, Aurora sends a chan- application. The Aurora interface trans- MGT. nel-up signal, which is followed by actual fers data to and from user application Aurora’s LocalLink interface is customiz- data transmission. The link initialization functions through the user interface. The able for 2- or 4-byte data. Your selection of process is shown in Figure 3.

Third Quarter 2007 Xcell Journal 27 CONNECTIVITY

Frame Types

As stated previously, data is sent over Aurora Start channels as frames. There are five frame types, listed in the order of their priority: • Clock compensation (CC) Align SERDES Operation • Initialization sequences • Native flow control (NFC) Polarity Corrected Lane Channel • User flow control (UFC) Align Polarity Verification Bonding • Channel PDUs • Idles Partner Align SERDES Handshake Aurora allows channel partners to use Multi-Lane separate reference clocks by inserting CC sequences at regular intervals. It can Figure 3 – Data flow in Aurora accommodate differential clock rates up to 200 parts per million (ppm) between the transmitter and the receiver. N Octets An Aurora frame sends data in a quanti- User PDU ty of two bytes called a symbol. In case the number of bytes in the PDU is odd, it Padding appends one extra byte called a pad. On the N Octets One Octet receive side, the pad is removed by the receive logic and data is presented to the User PDU PAD LocalLink interface. Link-Layer Encapsulation The transmit LocalLink frame is encap- sulated in the Aurora frame, as shown in Two Octets N Octets One Octet Two Octets

Figure 4. Aurora encapsulates frames with SCPUser PDU PAD ECP start-channel PDU (SCP) and end-channel PDU (ECP) symbols. On the receive side, 8B/10B Encoding the reverse process occurs. The transceivers Two Symbols N Symbols One Symbol Two Symbols are responsible for encoding/decoding the SCP User PDU PAD ECP frame into 8B/10B characters, serialization, and de-serialization. Figure 4 – Frame encapsulation in transmit Aurora Flow Control Aurora supports an optional flow control mechanism to prevent data loss caused by FIFO to come out of its overflow state. transmitted; only after competition of the different source and sink rates between While sending NFC requests, the TX transfer is the NFC request processed. channel partners (Figure 5). NFC FSM should eliminate any round- trip delay. Ideally, NFC requests are sent User Flow Control Native Flow Control before receive FIFO overflows to account UFC is used to implement user-defined NFC is meant for data-link layer-rate con- for this delay. You can program the NFC flow control at any layer. The number of trol. NFC operation is governed by two pause value from 0 to 256; the maximum UFC messages can be 2 to 16 bytes in a NFC FSMs: RX and TX. pause value is infinite. An NFC pause value frame. UFC messages are generated and The RX NFC FSM monitors the state of is non-cumulative and new NFC request interpreted by the user application. the RX FIFO. When there is a risk of over- values override the old. flow, it generates NFC PDUs, requesting that There are two NFC request types: imme- Applications the channel partner pause transmission of diate mode and completion mode. In imme- Aurora is a simple, scalable open protocol user PDUs for a specific time duration. The diate mode, an NFC request is processed for serial I/O. It can be implemented in TX NFC state machine responds by waiting while low-priority requests are halted. In FPGAs or ASICs. Aurora can be applied during the requested time, allowing the RX completion mode, the current frame is where an inexpensive, high-performance

28 Xcell Journal Third Quarter 2007 CONNECTIVITY

The latency for a single-lane Channel Partner AAurora Channel Channel Partner B User Flow User Flow 2-byte design is 10 cycles. Control Control Messages Messages Transceivers insert some inher- D User User PDUs M E User PDUs Rx ent latency. Thus, the overall Native Flow Control U M Native Flow Control FIFO X U Idle Sequences Idle Sequences latency is approximately 29 X Rx Native Tx Native FIFO cycles for an Aurora design on a Flow Control Flow Control Status State Machine State Machine RocketIO™ GTP transceiver. You have the freedom to FIFO Tx Native Rx Native Status Flow Control Flow Control choose reference clocks from a State Machine Idle Sequences Idle Sequences State Machine D range of values. The overall Native Flow Control Native Flow Control User E M M U throughput depends on the User PDUs Rx User PDUs FIFO U X X reference clock value and the User Flow User Flow Control Control line rate chosen. Messages Messages Aurora: CORE Generator IP Figure 5 – NFC operation in Aurora Aurora is released as a part of CORE Generator™ software, with about 10 configurable parameters. You Resource Utilization can configure an Aurora core by selecting 4000 streaming/framing interface, simplex/full- 3500 duplex data flow, single/multiple MGTs, 3000 reference clock value, line rate, MGT loca- 2500 No. of FFs tion(s) based on number of MGTs selected, 2000 No. of LUTs and reference clock source. The supported Resource 1500 No. of Slices line rate can range from 0.1 Gbps to 6.5 1000 Gbps depending on the Virtex device 500 0 selected. The core comes with simulation 12345678 scripts supporting MTI, NC-SIM, and No. of Lanes VCS simulators and build scripts to ease design synthesis and bit generation. Figure 6 – Performance curves for various Aurora flavors Conclusion link layer with little resource consumption tions on top. Aurora simplex operation Aurora is easy to use. The IP is efficient, is required, saving many hours of work for offers the most flexibility and the least scalable, and versatile enough to transport users who were considering developing resource use where users require high legacy protocols and implement proprietary their own MGT protocols. throughput in one direction. Simplex is protocols. Aurora is also backward-compat- Aurora has a variety of applications, ideal for non-data communication like ible: an Aurora design in a Virtex-5 device including: streaming video. can talk to an Aurora design in a Virtex-4 • Video device, making the IP independent of Performance Statistics underlying MGT technologies. Xilinx IP • Medical Aurora is designed to use minimal FPGA and software tools allow you to take advan- • Backplane resources. A single-lane 2-byte framing tage of Aurora’s improved feature set. One design consumes approximately 383 look- future enhancement currently in the • Bridging up tables and 374 flip-flops. The curves in research phase is the inclusion of a packet- • Chip-to-chip and board-to-board com- Figure 6 show the resource utilization sta- retry feature to provide link reliability. munication tistics for different lane configurations on a For more information, visit www. • Splitting functionality among several Virtex™-5 LXT device. xilinx.com/aurora. FPGAs TAKE THE NEXT STEP (Digital Edition: www.xcellpublications.com/subscribe/) • Simplex communication • Download the Aurora protocol specification and bus functional model Aurora provides more throughput for • Order your free Aurora LogiCORE design fewer resources. Plus, it can adapt to cus- • Read about Aurora application examples tomer needs, allowing additional func-

Third Quarter 2007 Xcell Journal 29 30% faster than last year’s model…

Here are 6 of the new, faster, bigger, Virtex-5 FPGAs on a 12 Million ASIC Gate Board that offers unmatched performance to ASIC Prototypers, IP Designers, and FPGA Developers. The V5 65nm process, with 6 input LUT and advanced interconnect, enables 30% faster clock speeds in your application. The Dini DN9000k10PCI captures this performance on an easy to use board with these handy features:

• 33/66 MHz PCI bus or stand-alone operation • 6 - DDR2 SODIMM Sockets • 7 - Global Clock Networks • 3 - 400pin FCI-MEG Connectors for daughter cards • Easy configuration via CompactFlash, USB or PCI

All necessary operating software, including reference designs and Synplicity Certify™ models to simplify partitioning, is supplied with the board. If your need is speed — visit The Dini Group web site 1010 Pearl Street, Suite 6 www.dinigroup.com for complete details on La Jolla, CA 92037 the fastest FPGA board ever. (858) 454-3419 [email protected] CONNECTIVITY Xilinx FPGAs Adapt to Ever-Changing Broadcast Video Landscape New digital broadcast standards and applications are addressed by advanced silicon, software, and free reference designs available with Xilinx FPGAs.

data rates could be supported by a single Virtex™-II Pro MGT and saw an opportunity to integrate multiple ASSP chips into a single Xilinx FPGA. The integration would vastly reduce the cost of these interfaces, especially in video switcher and master controller designs where there are multiple video streams. Snow authored the first application notes for these two interface standards using FPGAs. These application notes included free reference designs written in Verilog and VHDL source code. The code and documentation allowed broadcast sys- by Tim Hemken tem engineers to easily implement fully fea- Marketing Director specific standard product (ASSP) chip mak- tured SDI and HD-SDI receivers and Xilinx, Inc. ers. The data rate for SDI is nominally 270 transmitters and other associated func- [email protected] Mbps, adequate for standard definition tele- tions, such as video test pattern generators, vision (SDTV) resolutions. in Virtex-II Pro FPGAs. Over the next year, The serial digital interface (SDI) protocol In 2002, Xilinx announced the immedi- Virtex-II Pro FPGAs began to appear in for transporting uncompressed standard ate availability of Virtex™-II Pro FPGAs. broadcast equipment throughout the definition video evolved when broadcast The feature set of this device included world, consolidating multiple ASSP chips. studios wanted to convert from analog multi-gigabit transceivers (MGTs) capable HDTV’s acceptance among the masses audio and video to digital audio and video of operating at bit rates as fast as 3.125 continues to grow. This amazing technolo- without replacing the enormous coaxial Gbps. About the same time, the broadcast gy brings viewers closer to reality and, as a transmission cable infrastructure. Today, studios were starting their adoption of the result, the amount of HD content is the ever-increasing screen resolutions and new high-definition television (HDTV) increasing. Today, HD-capable receivers are associated data rates have spawned new standards with greater screen resolutions achieving mass-market price points and serial data communication formats with and higher data-rate requirements. consumers are buying HDTVs at accelerat- the same goal of coax reuse. SMPTE authored a standard known as ing rates. The multi-rate SDI and HD-SDI The first such standard for SDI, authored SMPTE 292M, supporting the serial trans- reference designs are increasingly impor- by the Society of Motion Picture and mission of uncompressed HDTV video tant for distributing digital audio and video Television Engineers (SMPTE), is known as content at a nominal rate of 1.5 Gbps, content inside the broadcast studio. SMPTE 259M, commercialized in 1989. At known as HD-SDI. Today, the SMPTE organization is not its introduction, the primary chips used for Xilinx® Senior Staff Applications standing still. They consistently publish the interface were provided by application- Engineer John Snow realized that the two new standards to handle video formats that

Third Quarter 2007 Xcell Journal 31 CONNECTIVITY

CCD Cameras Memory Ultra high-speed, high-sensitivity broadcast cameras capable of capturing clear, Light smooth, slow-motion video – even in limited lighting – are extremely useful for Memory I/F CCD Bridge recording events such as professional Driver Logic Cable baseball games played at night. These CDS Equalizer CDS HD-SDI cameras can capture fast-moving phe- A/D DSP

HD-SDI nomena that cannot be perceived clearly Shutter with the naked eye, such as the footage of Virtex-5 FPGA a ball’s impact with a bat. Showing these events in slow-motion video improves the Figure 1 – CCD camera viewer experience significantly. The CCD (charge-coupled device) camera shown in Figure 1 uses an FPGA to perform signal synthesis processing and Memory External Processor color processing, as well as interfacing to the CCD driver. The CCD driver in turn drives the CCD, mechanical shutter control, and trigger control. The incoming Embedded Bridge Processor Logic video signal is converted to digital format by the analog-to-digital converter and then Video stored in off-chip memory. When the data Cable Pre- H.264 EMAC Optics Equalizer HD-SDI Processing transfer for an entire frame is complete, the data from the memory is synthesized by the FPGA and sent over the network Virtex-5 FPGA using HD-SDI. The processing time Figure 2 – Video over IP required from trigger to HD-SDI output is one second or less. The FPGA also controls the memory and the ADC. require higher bandwidth. Two of the latest digital interfaces running at 3 Gbps. standards are known as dual-link HD-SDI Even within a single SMPTE standard Video over IP (SMPTE 372M) and 3G-SDI (SMPTE there are opportunities for FPGA “future Some video production centers are starting 424M and SMPTE 425M), both provid- proofing.” 3G-SDI is actually defined by to use Ethernet to transmit crystal-clear ing 3 Gbps of total bandwidth. two standards: SMPTE 424M and HD streams across the network. Images are The dual-link HD-SDI standard uses SMPTE 425M. SMPTE 424M defines the pre- and post-processed to enhance picture two HD-SDI rate links combined together physical and electrical characteristics of the quality in real time with low latency and to facilitate the transfer of richer color serial interface itself. SMPTE 425M then transported over the network using (more pixel color data) or faster update defines how to map various video formats various encoding and decoding standards rates (1080 lines at a 60-Hz progressive to the interface. Even though it was only (codecs). The data must be compressed, as frame rate as opposed to a 30-Hz frame published in 2006, SMPTE is already at the stream size and rate are very high. For rate). The two coax cables forming a dual- work modifying SMPTE 425M to accom- example, a transmission of 1920 x 1080 link HD-SDI interface can be replaced modate additional video formats. pixels at 30 fps requires a data rate of 1.5 with a single coax cable using 3G-SDI. The SMPTE organization has already Gbps uncompressed. Add to that multiple An example of an application requiring defined interfaces that will run at 10 Gbps channels and the rate goes even higher. the increased data rates is driven by the cin- and is contemplating interfaces that will Application-optimized FPGAs with ema business as it migrates to digital data. run even faster. Designing these interfaces embedded DSP blocks, on- and off-chip Digital cinema standards use 36 bits of data with Xilinx FPGAs eliminates the risk that memories, abundant logic to build bridg- per video sample, compared to the 20 bits a newly designed piece of video equipment ing functions, and Ethernet and HD-SDI per sample typically used by HDTV for- will be outdated by quickly evolving stan- connectivity are the ideal solutions to build mats. The increased number of bits per dards before it ships. such systems. Figure 2 shows a block dia- video sample coupled with higher screen Let’s look at a few emerging applications gram of a video over IP system. The FPGA resolutions results in market demand for that use HD-SDI for video connectivity. reads the data presented over an HD-SDI

32 Xcell Journal Third Quarter 2007 CONNECTIVITY

motion-compensated prediction process. Memory External Processor AVC doubles the accuracy of motion prediction, uses smaller block sizes to allow objects to be tracked more accurately, and

Picture Test has many more reference frames to search R MPEG Quality for a good motion-predictive match. Thus, Cable Driver CODEC Rating Display Equalizer HD-SDI a real-time high-definition AVC video

Broadcast R encoder can deliver broadcast-image quality Driver SMPTE Bridge Display at half the bandwidth of MPEG-2. Test-Pattern Logic Generator FPGAs perform the computationally

Local "Test" HDD intensive motion estimation task, as shown Video Sources Disk for Comparison Controller in Figure 4. Motion estimation is per- Virtex-5 FPGA formed using the repeated sum of absolute difference calculations. The data compar- Figure 3 – HDTV picture quality monitor isons are very repetitive and many of the calculations are reused. CPU-based imple-

Memory External mentations tend to struggle to feed the Processor arithmetic logic units from cache; FPGA designs can be customized to retain all of the values in a custom register pipeline.

Motion Cable Video Intra- Estimation Equalizer Scaling Estimation Conclusion HD-SDI The newest devices, such as Xilinx Virtex-5

Bridge FPGAs, have enormous amounts of logic EMAC Optics Logic and offer ASIC-like levels of performance. FPGAs pack in all of the features that Virtex-5 FPGA broadcast equipment designers want:

Figure 4 – Real-time HD AVC • Embedded low-power 3.2-Gbps transceivers capable of supporting several link and then processes it. A codec, such as defined by SMPTE RP 198. Subjective standards such as SDI, HD-SDI, dual- H.264, is used to compress the data. The testing is performed by comparing the link HD-SDI, 3G-SDI, DVB-ASI, AES data is then converted into Ethernet packets broadcast feed with local test video sources. digital audio, Ethernet, and PCI Express with the appropriate header information for The FPGA collects the data from different • High-speed DSP blocks decoding at the receiving end and finally sources, pre-processes it, and then sends it sent over Ethernet links using a MAC. to an external processor for analysis. • Embedded processors • Ethernet MACs HDTV Picture Quality Monitor Real-Time HD AVC • PCI Express cores Previously, consumers could access high- Advanced video coding (AVC) is a video quality video and multi-channel audio compression technique used to deliver • Many video IP cores through DVDs only. As HD broadcasts video at half the required bit rates. AVC By designing video connectivity applica- become commonplace, comparisons with was first used with standard-definition tions in Xilinx silicon products, broadcast DVDs are natural. As a result, viewers are videos, but is even more compelling for equipment manufacturers can lower costs, more aware of picture quality, particularly HD service providers. AVC achieves its differentiate their products from the com- for HD. Picture quality is likely to become most significant gains over MPEG-2 petition, and lower the inherent risk caused a major differentiator between service through substantial improvements to the by changing standards. providers. Picture quality monitors for objective and subjective testing of picture TAKE THE NEXT STEP (Digital Edition: www.xcellpublications.com/subscribe/) quality in conjunction with perceptible • Learn more about Xilinx solutions for video and audio interfacing, real-time error measurements are now required. HD video processing, video and audio codecs, forward error correction Figure 3 shows a picture quality moni- (FEC) and modulation, and other broadcast system functions. tor implemented in a Xilinx Virtex-5 • Read audio, video, and image processing application notes. FPGA. Objective testing is performed • Purchase the Xilinx serial digital video board for HD-SDI, SDI, and DVB-ASI. using commonly used test pattern formats

Third Quarter 2007 Xcell Journal 33 HAPS™ – High-performance Tr ps a ASIC Prototyping System™ a k H c 2007 a high performance, high capacity FPGA platform o e m l for ASIC prototyping and emulation composed of p a t i b multi-FPGA boards and standard or custom-made daughter boards

HapsTrak™ HAPS-50 a set of rules for pinout and mechanical characteristics, which guarantees compatibility with previous and future Tr ps a a k generation HAPS motherboards H c 2005 o e and daughter boards m l p a t i b

HAPS-30

Tr ps a a k

H c 2004 o e m l p a t i b

HAPS-20

Tr ps a a k

H c 2003 o e m l p a t i b

www.synplicity.com, [email protected] Synplicity, Inc. 600 West California Avenue, Sunnyvale, CA 94086 Phone: (U.S.) +1 408 215-6000

Copyright © 2007 Synplicity, Inc. All rights reserved. Specifications subject to change without notice. HAPS-10 Synplicity, the Synplicity logo, and "Simply Better Results" are registered trademarks of Synplicity, Inc. HAPS, "High-performance ASIC Prototyping System", and HapsTrak are trademarks of Synplicity, Inc. Virtex-II, Virtex-II Pro, Virtex-4, and Virtex-5 are registered trademarks of Xilinx Inc. FREE on-line tutorials with Demos On Demand

A series of compelling, highly technical product demonstrations, presented by Xilinx experts, is now available on-line. These comprehensive videos provide excellent, step-by-step tutorials and quick refreshers on a wide array of key topics. The videos are segmented into short chapters to respect your time and make for easy viewing.

Ready for viewing, anytime you are Offering live demonstrations of powerful tools, the videos enable you to achieve complex design requirements and save time. A complete on-line archive is easily accessible at your fingertips. Also, a free DVD containing all the video demos is available at www.xilinx.com/dod. Order yours today!

©2007 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners. CONNECTIVITY Serial RapidIO Connectivity Enhances DSP Co-Processing The Virtex-5 LXT FPGA-based SRIO IP solution significantly enhances connectivity by providing a true bi-directional data flow.

by Navneet Rao System designers must create architec- between compute platforms/boxes in Technical Marketing Manager, Platform Solutions tures that will alleviate the exorbitant wired or wireless infrastructures and the Xilinx, Inc. demands of triple-play scenarios while individual computing resources in these [email protected] meeting such requirements as: platforms/boxes.

Today, the demand for high-speed commu- • High performance Connectivity Between Compute Platforms nication and super-fast computing is spiral- • Low latency Standards-based connectivity is common ing. Wired and wireless communication today. Parallel connectivity standards • Lower system costs (NRE included) standards are being deployed everywhere (PCI, PCI-X, EMIF) may be able to and the data-processing infrastructure is • Scalable, extensible architectures meet current demands, but will come up scaling everyday. The pervasive means of short when scalability and extensibility • Integrating off-the-shelf (OTS) wired communication is Ethernet through are involved. With the advent of packet- components LAN, WAN, and MAN networks. The based processing, clearly the trend is pervasive means of wireless communication • Distributed processing toward high-speed serial connectivity is through cell phones, enabled by infra- • Support for multiple standards (Figure 1). structures using DSP. What was primarily a and protocols The desktop and networking industries means of voice connectivity – the phone – have adopted standards like PCI Express now caters to the ever-increasing demands What emerges from these challenges (PCIe) and Gigabit Ethernet/XAUI. for voice, video, and data. are two primary themes: connectivity Data-processing systems in wireless infra-

36 Xcell Journal Third Quarter 2007 CONNECTIVITY

structures, however, have slightly different > 1 GHz interconnect requirements: Serial I/O CDR Device Device Point-to-Point • Low pin counts Switched Switch Fabric Parellel I/O • Backplane chip-chip connectivity < 1 GHz Source- Device Device Synchronous Device Device • Bandwidth and speed scalability Point-to-Point SRIO, PCIe Switched • DMA and message passing Switch Fabric Parellel I/O < 133 MHz Shared Clock Device Device • Support for complex scalable topologies Device Device Shared Bus RapidIO, POS PHY L4...

• Multicast System Performance Device BridgeDevice Device < 33 MHz PCI-X 133, EMIF • High reliability Device Device Device Device • Time-of-day synchronization Device Device Device Device PCI 32/33 Time • Quality of service (QoS) The Serial RapidIO (SRIO) protocol Figure 1 – Trend toward serial connectivity standard can easily meet and exceed most of these requirements. As such, SRIO has become the dominant interconnect for data-plane connectivity in wireless infra- CPU SRIO networks are built around two structure equipment. Endpoint DRAM Endpoint SRIO networks are built around two basic blocks: - Endpoints basic blocks – endpoints and switches Port Port - Switches (Figure 2). Endpoints source and sink 2 2 Endpoint Port Switch Port Switch Port Endpoint Endpoints source and sink packets packets, while switches pass packets 3 3 1 between ports without interpreting them. Switches pass packets between Port Port ports without interpreting them SRIO is specified in a three-layer archi- 0 0 ROM tectural hierarchy (Figure 3): All devices support maintenance transactions for access to • The physical layer specification Endpoint Endpoint configuration registers describes device-level interface specifics such as packet transport mechanisms, flow control, electrical characteristics, Figure 2 – SRIO network building blocks and low-level error management. • The transport layer specification provides the necessary route information for Upper-Layer Upper-Layer a packet to move from endpoint to end- Protocols Protocols point. Switches operate at the transport Session layer by using device-based routing. Layer Logical Layer • The logical layer specification defines Transport Layered Layer Layered the overall protocol and packet for- OSI SRIO mats. All packets are 256 payload Network Transport Architecture Layer Layer Architecture bytes or less. The transactions use load/store/DMA operations targeted Data- Link Physical to a 34-/50-/66-bit address space. Layer Layer The transactions include: Physical Layer • NREAD – read operation (data returned is the response) • NWRITE – write operation, Figure 3 – Layered SRIO architecture no response

Third Quarter 2007 Xcell Journal 37 CONNECTIVITY

• NWRITE_R – robust write, with response from the target endpoint Logical Specifications I/O System Message Global Shared Flow Data • SWRITE – streaming write Passing Memory Control Streaming • ATOMIC – atomic read-modify-write • MAINTENANCE – system discov- Transport Specifications Common ery, exploration, initialization, config- Transport uration, and maintenance operations

SRIO – An Advantage Scenario Physical Specifications 1x/4x Serial Higher Speed A four-lane SRIO link running at 3.125 PHYs Gbps can deliver 10 Gbps throughput with full data integrity. Because SRIO is similar to microprocessor buses – memory and Ancillary Specifications Interoperability Error System Multicast device addressing instead of the software Management Bring Up management of LAN protocols – packet processing is implemented in hardware. Figure 4 – SRIO specification This means significantly lower I/O processing overhead, lower latency, and increased system bandwidth. But unlike most bus • Ability to support high reliability: throughput can easily be accomplished by interfaces, SRIO has low pin count inter- adopting SRIO as the computing intercon- • Lossless protocol faces and scalable bandwidth based on nect. Furthermore, a broad ecosystem of 3.125-Gbps links. • Automatic retraining and device vendors makes the selection of OTS parts synchronization and components easy. Computing Resources in Platforms • System-level error management SRIO is a packet-based protocol that Today’s applications demand greater pro- supports: cessing resources. Hardware-based imple- • Ablity to support communications • Data movement using packet-based mentations are gaining traction. data plane: operations (read, write, message) Compression/decompression algorithms, • Multicast firewall applications like anti-virus and • I/O non-coherent functions and cache intrusion detection, and security applica- • Traffic-managed (lossy) operation coherence functions tions requiring encryption engines like AES, • Link, class, and stream-based • Efficient interworking and protocol Triple DES, and Skipjack have been target- flow control encapsulation through support for data ed for hardware implementations after being • Protocol interworking streaming and segmentation and initially implemented in software. This reassembly functions demands a massive parallel ecosystem of • Higher transaction concurrency shared bandwidth and processing power. • A traffic-management framework by • Modular and extendable Shared or distributed processing harnessing enabling millions of streams, support for through systems using CPUs, NPUs, • Broad ecosystem support 256 traffic classes, and lossy operations FPGAs, or ASICs is required. The SRIO protocol was designed to • Flow control to support multiple trans- Considering all of these application-spe- support all of the disparate requirements action request flows, provision for QoS cific requirements for building a future- driven by compute devices in wireless proof system, the requirements for • Priorities support to alleviate problems infrastructures. computing resources include: like bandwidth allocation, transaction The SRIO specification (Figure 4) ordering, and deadlock avoidance • Multiple hosts – distributed processing defines a packet-based layered architecture to support multiple domains or market seg- • Topology support for standard (trees • Direct peer-to-peer communications ments for system architects to design next- and meshes) and arbitrary hardware • Multiple heterogeneous OSs generation computing platforms. Features (daisy-chain) topologies through system discovery, configuration, and bring-up, • Complex topologies: like architectural independence, ability to deploy scalable systems with carrier-grade including support for multiple hosts • Discovery mechanisms reliability, advanced traffic management, • Error management and classification • Redundant paths (failover) and provisioning for high performance and (recoverable, notification, and fatal)

38 Xcell Journal Third Quarter 2007 CONNECTIVITY

Xilinx IP Solutions for SRIO The complete Xilinx endpoint IP because fewer LUTs are required, and the Xilinx endpoint IP solutions for SRIO are LogiCORE solution for SRIO has been device has a highly optimized symmetric designed to RapidIO specification v1.3. exhaustively tested, hardware validated, routing architecture. The complete Xilinx endpoint IP solution and is undergoing interoperability test- for SRIO comprises the following compo- ing with leading SRIO device vendors. Memory nents (Figure 5): The LogiCORE IP is delivered through Virtex-5 memory solutions include LUT the Xilinx CORE Generator™ software RAMs, block RAMs, and memory con- • The Xilinx endpoint IP for SRIO is a GUI tool that enables user customiza- trollers for interfacing to large memories. soft LogiCORE™ solution. It sup- tion for baud rates, endpoint configura- The block RAM structure includes pre- ports fully compliant maximum-pay- tion, and support for extended features engineered FIFO logic – embedded error load operations for both sourcing and like flow control, re-transmit suppres- checking and correction (ECC) logic that receiving user data through target and sion, doorbell, and messaging. This can be used for external memories. In addi- initiator interfaces on the logical (I/O) enables you to create a flexible, scalable, tion, Xilinx provides comprehensive design and transport layers. and customized SRIO endpoint IP opti- resources to instantiate memory controller • The buffer layer reference design is mized for your application. blocks in a system design through the provided as source code to perform Memory Interface Generator (MIG) tool. automatic packet re-prioritization Virtex-5 FPGA Computing Resources This enables you to leverage hardware-ver- and queuing. The Xilinx endpoint IP for SRIO ensures ified solutions and focus your efforts on that high-speed connectivity is established other crucial sections of your designs. • The SRIO physical layer IP imple- between the link partner using the SRIO ments link training and initialization, protocol. The IP consumes <20% of the Parallel and Serial I/Os discovery and management, and error available logic resources in the smallest SelectIO™ technology is capable of virtual- and retry recovery mechanisms. Virtex™-5 device, thereby ensuring that ly any parallel source-synchronous interface Additionally, the high-speed trans- the user design has access to the most the customer needs in the design. Using the ceivers are instantiated in the physical logic/memory/I/Os for targeting a system SelectIO interface, you can easily create layer IP to support one- and four-lane application. Let’s review the resources in industry-standard interfaces for more than SRIO bus links at line rates of 1.25 Virtex-5 devices. 40 different electrical standards or propri- Gbps, 2.5 Gbps, and 3.125 Gbps. etary interfaces. The SelectIO interface • The register manager reference design Logic Blocks offers maximum rates at 700 Mbps single- enables the SRIO host device to con- The Virtex-5 logic architecture, with six- ended and 1.25 Gbps differential. figure and maintain endpoint device input look-up tables (LUTs) based on All Virtex-5 LXT FPGAs have a GTP configuration, link status, control, and the 65-nm process, offers the highest transceiver capable of running at speeds time-out mechanisms. In addition, FPGA capacity. Along with improved from 100 Mbps to 3.2 Gbps. The GTP ports are provided on the register man- carry logic, this provides a 30% perform- transceiver is also one of the industry’s low- ager for the user design to probe the ance benefit over previous devices. Power est power MGTs, with less than 100 mW status of the endpoint device. consumption is significantly lowered per transceiver. The design flow process for high-speed serial design is alleviated by the introduction of proven design techniques and methodologies to simplify the design. Logical (I/O) In addition, new design tools (RocketIO™ transceiver wizard and User and Physical RapidIO Transport Layer IBERT) and new silicon capabilities (TX

Logic Buffer Fabric Layer and RX equalization and built-in pseu- do-random bit sequence (PRBS) generator and checker) allow you to exploit the features and benefits of migrating architectures from parallel I/O standards to more than 30 serial standards and Register Manager emerging serial technologies.

DSP Blocks Figure 5 – Xilinx endpoint IP architecture for SRIO Each DSP48E slice can provide 550-MHz performance, enabling you to create appli-

Third Quarter 2007 Xcell Journal 39 CONNECTIVITY

cations that require single-precision floating- point capability like multimedia, video and CPU CPU imaging applications, and digital communi- GbE 10GE Chipset DDR2 cations. This provides expanded functional- Memory GbE Controller ity compared to previous devices as well as Links (PCIe Root GbE 10GE DDR2 providing a power advantage by reducing Controller Complex) Memory dynamic power consumption by more than 40%. The number of DSP48E slices has also x8 PCIe SRIO been increased in Virtex-5 FPGAs, which PCIe optimizes the ratio of these blocks relative to PCIe IP GbE the available logic resources and memory. Packet Queue SRIO IP

Integrated I/O Blocks 3.125G SRIO x4 DSP All Virtex-5 LXT FPGA devices have one (SRIO Host) DSP Application endpoint block for implementing a PCIe 3.125G SRIO x4 3.125G SRIO x4 function. This hard IP endpoint block SRIO Switch enables easy scalability from x1 to x2 and x4 or x8 painlessly with easy reconfigurability. The block has also passed stringent PCI-SIG Figure 6 – CPU-based scalable, high-performance embedded system compliance and interoperability tests for x1, x4, and x8 links, considerably easing the user adoption for PCIe. Performance is measured in millions or chip set. The SRIO system is hosted by a All Virtex-5 LXT FPGA devices also have billions of instructions/operations per DSP. The 32-/64-bit PCIe address space four tri-mode Ethernet media access con- second and efficiency is measured in (base address) can be intelligently mapped trollers (TEMACs) capable of 10-/100-/ terms of time/cycles required to complete to the 34-/66-bit SRIO address space (base 1,000-Mbps speeds. The block provides a specific operation. address). The PCIe application communi- dedicated Ethernet functionality, which High-performance applications requir- cates with the root complex through mem- together with Virtex-5 LXT RocketIO ing a wide range of fixed- and floating- ory or I/O reads and writes. These transceivers and SelectIO technology point operations take longer to process the transactions can be easily mapped to SRIO enables you to connect to a wide variety of data. Examples include signal filtering, fast space through NReads/NWrites/SWrites. network devices. Fourier transforms, vector multiplication Designing such bridge functions is easy Using the two integrated I/O blocks for and searching, image/video analysis and in Xilinx FPGAs because the back-end PCIe and Ethernet, you can create a range format conversion, and simple number- interfaces for these Xilinx endpoint func- of customized packet processing and net- crunching algorithms. High-end signal tional blocks, PCIe, and SRIO are similar. work products that provide a significant processing architectures implemented in The “packet-queue” block can then per- reduction in utilization and power con- DSPs can easily perform these tasks and form the crossover from PCIe to SRIO or sumption. Using these varied resources optimize such operations. The perform- vice-versa to establish packet flow across available in Xilinx FPGAs, you can easily ance of these DSPs is measured in multi- either protocol domain. create and deploy intelligent solutions. ply-accumulates per second. Let’s look at some system design exam- You can easily design embedded systems SRIO DSP System Application ples using SRIO and DSP technologies. that use CPUs and DSPs to take advantage In applications where DSP processing is the of both processing techniques. Figure 6 primary architectural requirement, the sys- SRIO Embedded System Application shows an example system using FPGAs, tem architecture can be designed as depict- Consider an embedded system built CPUs, and DSP architectures. ed in Figure 7. around x86 architecture-based CPUs. In high-end DSPs, the primary data Virtex-5 FPGA-based DSP processing The CPU architecture has been highly interconnect is SRIO. In x86 CPUs, the can act as an intelligent co-processing solu- optimized and can easily cater to applica- primary data interconnect is PCIe. As tion with other DSP devices in the system. tions requiring “number crunching.” You shown in Figure 6, FPGAs can easily be The complete DSP system solution can be can easily implement algorithms in hard- deployed for scaling the DSP application scaled easily if SRIO is used as the data ware and software that use CPU resources or for bridging across disparate data inter- interconnect. Such solutions can be future- to perform functions like e-mail, database connect standards, like PCIe and SRIO. proofed, provide extensibility, and can be management, and word processing that In the system depicted in Figure 6, the supported across multiple form factors. In do not require extensive multiplication. PCIe system is hosted by the root complex DSP-intensive applications, fast number

40 Xcell Journal Third Quarter 2007 CONNECTIVITY

slow parallel connectivity, is easy because of the SRIO endpoint functions that can be Octal DSP DSP DSP Framers SRIO Host DSP DSP targeted to Virtex-5 FPGAs.

T1/E1 Ports TSI Conclusion SRIO Backplane Octal Serial RapidIO Switch SRIO is appearing in a wide array of new Framers applications, largely centered around DSPs in wired and wireless applications. The key SRIO IP advantages of implementing SRIO archi-

Packet Queue tectures in Xilinx devices are: PCIe IP • Availability of complete SRIO end- SRIO point solution PCIe Memory PCIe Root Complex GbE • Flexibility and scalability to produce different classes of products with the CPU CPU same hardware and software architecture • Low power with new GTP transceivers Figure 7 – DSP-intensive farms and 65-nm technologies • Easy configurability through the CORE Generator software GUI tool FPGA FPGA Optics SRIO - Antenna • Proven hardware interoperability with DSP Optics SRIO X4 SRIO Bridge leading industry vendors supporting SRIO SRIO connectivity on their devices Switch • Lower overall system cost by achieving FPGA DSP2 Optics SRIO - Antenna system integration through use of X4 SRIO X4 SRIO Optics Bridge X4 SRIO Optics integrated I/O blocks like PCIe and TEMAC

SRIO IP 3.125G SRIO Link FPGA Legacy DSP System In addition, Virtex-5 FPGAs have 2.5G SRIO Link 1.25G SRIO Link DSP resources that can meet the requirements of existing legacy DSP systems in terms of power, performance, and band- Figure 8 – Scalable baseband uplink/downlink card width. Additional benefits accrue in terms of system integration – through crunching or data processing can be architectures to gain scalability advantages. availability of functional blocks like accomplished by offloading that processing As shown in the system depicted in Ethernet MACs, endpoint blocks for to the x86 architectures. The Virtex-5 Figure 8, you can design Virtex-5 FPGAs PCIe, processor IP blocks, memory ele- FPGA can be easily used to connect the to meet existing demands of line-rate pro- ments, and controllers. Also, you can PCIe subsystem and SRIO architecture to cessing of antenna traffic and also provide achieve significant overall system cost sav- enable efficient function offload. connectivity to other system resources ings through the exhaustive list of IP through SRIO. Migrating existing legacy cores to support multiple source aggrega- SRIO Baseband System Application DSP applications, which have inherently tion in the FPGA. Existing 3G networks are beginning to mature at a rapid pace and OEMs are TAKE THE NEXT STEP (Digital Edition: www.xcellpublications.com/subscribe/) deploying new form-factors to alleviate specific capacity and coverage problems. To • Learn more about RapidIO solve such unique challenges to assess mar- • Evaluate the RapidIO physical layer core ket trends, FPGA-based DSP architectures • Evaluate the RapidIO logical I/O and transport layer core using SRIO as the data-plane standard makes perfect sense. In addition, legacy DSP • Sign up for a connectivity training course systems can be quickly provisioned and tar- • Download the Virtex-5 RocketIO GTP Transceiver User Guide geted to new, fast, low-power FPGA DSP

Third Quarter 2007 Xcell Journal 41 CONNECTIVITY The NXP/PLDA Programmable PCI Express Solution Based on the Spartan-3 FPGA, NXP and PLDA’s PHY/controller solution delivers affordability and ease of integration.

by Ho Wai Wong-Lam Marketing Manager NXP Semiconductors [email protected]

Martin Gallezot Marketing and Sales Director PLDA [email protected]

Developers today are increasingly squeezed between providing ever-greater throughput for data-hungry applications and shrinking product lifecycles. Take frame grabber products, for example, which are image- processing computer boards that capture, process, and store image data for a wide range of scientific, medical, and industrial applications. Advances in image sensing continue to put a greater demand on the interconnect bandwidth as the number of pixels, frame rates, and number of cameras served by a single frame grabber increases. Some applications already require through- PCI Express is designed to replace PCI PCB routing. And to ease the transition put exceeding 1 Gbps. and PCI-X interconnect technologies, pro- from PCI/PCI-X, PCI Express is back- To meet the demands of today’s devel- viding a scalable bandwidth from x1 at ward-compatible with PCI with regards to opment environment, companies are 250 Mbps to x32 at 8 Gbps. The v2.0 software performance. adopting PCI Express, the next-generation specification further increases bandwidth Interoperable solutions such as the NXP I/O interconnect technology of choice, and by doubling the one-lane data rate from PHY/PLDA IP controller offer the advan- increasingly relying on FPGA-based inter- 2.5 Gbps in the v1.1 specification to 5 tages of PCI Express coupled with the secu- operable, proven solutions that reduce the Gbps. Besides increased throughput, PCI rity of a PCI-SIG-proven solution, integration risk and decrease a product’s Express offers a smaller connector than allowing you to concentrate on applica- time to market. PCI/PCI-X, which lowers costs and eases tion-specific functionality.

42 Xcell Journal Third Quarter 2007 CONNECTIVITY

NXP’s PX1012A • Both commercial temperature grade The XpressLite controller is an RTL-level NXP Semiconductors (previously Philips (0 to 70°C) and industrial grade (-40 IP core fully compliant with the PCI Express Semiconductors) offers the PX1012A sin- to 85°C) are available; this is a unique protocol. As required, the configuration space gle-lane 2.5-Gbps PCI Express PHY device, industrial temperature grade device (located within the transaction layer) imple- which serves as a companion chip to for PCI Express PHY, making NXP ments all configuration registers and associat- FPGAs or digital ASICs. The PX1012A is PHY suitable for various industrial ed functions. The configuration space also ideal for use with PLDA’s XpressLite PCI applications. generates all messages (PME#, INT, error, Express IP core. power slot limit), MSI requests, and comple- The PX1012A is optimized for use with PLDA’s XpressLite IP Core tion packets from configuration requests that low-cost FPGAs. It is available in a very PLDA’s XpressLite PCI Express IP core is a flow in the direction of the root complex. small package, delivers superior transmit and complete PCIe digital controller optimized The XpressLite is fully configurable receive performance, and is compliant to for Spartan-3 FPGAs. It includes all three through a graphical wizard, making it sim- PCI Express specifications v1.0a and v1.1. layers of the PCIe specification (physical, ple to customize such parameters as maxi- The PX1012A is designed to serve PCI data link, and transaction layer), plus an mum payload size, configuration space Express applications in all kinds of form fac- additional application layer called the EZ registers, buffer sizes, and number of DMA tors, from space-constrained and low-power DMA interface. channels. The wizard generates a wrapper ExpressCard modules to desktop add-on PLDA’s EZ DMA interface is suitable that instantiates the core top level and con- cards to PXIe test equipment rack add-on for designers who have little or no experi- nects ports and assigns parameters accord- modules. The NXP PCI Express PHY ence with the PCI Express protocol or for ing to specified options. Unused core PX1012A has the following key features: experienced designers looking for a features are not synthesized. robust yet simple PCI Express interface. The XpressLite is PCIe specification • Compliant to PCI Express base specifi- The EZ DMA interface provides design- 1.1-compliant. It has a small footprint and cation v1.0a and v1.1; the NXP PHY ers with a target path, which includes a minimal memory utilization. Typical passed the first official U.S. PCI-SIG simple address/data bus, and a master implementation requires about 8,000 LUTs v1.1 electrical compliance tests in path, which comprises multiple DMA and seven block RAMs in a Spartan-3 December 2006. engines that handle the transfer of data to device. These figures include the EZ multi- • The Xilinx® Spartan™-3 FPGA com- the host system memory. The EZ DMA DMA application layer configured with municates with the PX1012A PHY interface can also be used together with two DMA channels. PCI Express features using the source-synchronous 250-MHz the built-in PCI Express controller in supported by the XpressLite controller are PXPIPE standard, based on SSTL_2 Virtex™-5 LXT devices. summarized in Table 1. I/O. The use of source-synchronous clocking eases PCB layout by making the data transactions between the FPGA and Core Type Legacy or Native Endpoint PHY devices more robust than the origi- Maximum Payload Up to 2 KB nal PIPE standard, which uses only one clock for both transmit and receive data. Backend Data Path 64 bit • A small 9 x 9-mm BGA package with Virtual Channels One two signal rings (so that only one inner- BARs, Expansion ROM User-defined, set by the XpressLite wizard ring signal needs to escape between the PCI ID User-defined, set by the XpressLite wizard balls). In fact, the NXP PHY can be laid out with just two PCB signal layers, Legacy Power Management Minimal or full, set by the XpressLite wizard which have been proven in actual hard- Message Signaled Interrupt (MSI) • Message count: 1 to 32, set by the the XpressLite wizard ware designs. Example reference schematics are available on request. • 64-bit address: Yes • Per-vector masking: No • Low power dissipation in normal L0 mode (typically <300 mW including EZ Multi-DMA Interface • DMA channels: up to eight, set by the XpressLite wizard I/O). For further reduction in power • Number of outstanding requests: one to eight simultaneous dissipation, the removal of termina- requests, set by the XpressLite wizard tion resistors on the PHY/FPGA inter- • Maximum DMA transfer size: up to 4 GB face and optimized PCB layout can further reduce PX1012A power dissi- Table 1 – PCI Express features supported by the XpressLite controller pation to <150 mW.

Third Quarter 2007 Xcell Journal 43 CONNECTIVITY

PLDA’s core package also includes a Theoretical Maximum Throughput complete test bench with an RTL-level PCI The line speed of one-lane PCI Express is Express bus functional model (BFM), 2.5 Gbps. However, 8B/10B encoding transactor, monitor, and checker. overhead reduces the maximum PXPIPE data throughput (2.5 Gbps divided by 10 Fully Compliant and Interoperable Solution bits per byte = 250 Mbps). Given the over- The NXP/PLDA joint solution successfully head performance cost, throughput typical- completed PCI Express compliance tests ly increases with payload size – up to a administered at the PCI-SIG Compliance point. For example, for a payload of 128 Workshop #48 in December 2005 and bytes, the theoretical efficiency is 86% (128 later workshops in 2006 (see transmitter B(ytes) payload + 12B header + 8B fram- eye pattern results in Figure 1). ing) and the maximum theoretical NXP, PLDA, and mutual customers have throughput is 216 Mbps. Although the run extensive and successful system tests in a PCI Express specification specifies a poten- multitude of PC systems, including: tial maximum payload size of as much as 4 KB, most existing applications only imple- • ASUS A8NE (x16, two x1 slots and ment a maximum payload size of 128B or one x4 slot) 256B. Throughput measurements for the • ASUS P5GP motherboard (Intel 915G NXP/PLDA solution are listed in Table 2. chipset) Figure 1 – PLDA XpressLite SP3 board transmitter eye pattern with v1.1 PCI Express Receiver Performance • ASUS P5LD2 DELUXE (Intel i945 template (top: non-transition eye; bottom: transition eye) NXP and PLDA have performed exten- chipset) sive proprietary system tests in many PC • Dell Dimension 4700 x1 and x16 slots • Agilent protocol test card (PTC) test systems to ensure that there are no recov- (Intel 915G chipset) erable receiver errors for extended periods • Agilent receiver bit error rate (BER) test of time (many hours). Test results yield a • Dell Dimension 8400 (Intel 925G • NXP receiver performance test BER of 1x10-12. chipset) PCI Express specifications v1.0a and • PLDA throughput measurement test • DELL Precision 370 v1.1 require 0.6 UI (unit interval) of receiver jitter tolerance, but the specifica- • Dell Precision 470 x8 and x16 slots High Throughput tions do not precisely indicate how the (Intel Turnwater E7525 chipset) Theoretical maximum throughput is a jitter components are composed. NXP function of payload size and PCI Express • HP XW4200 performed receiver BER tests using an protocol overhead. Actual throughput Agilent BER tester. • HP XW4300 results depend on factors such as software driver efficiency, PCI Express IP core effi- • Agilent J-BERT N4903A • MSI (Intel i915 chipset) ciency, the user’s application design, trans- • TJ = total jitter; RJ = random jitter; • Serverworks GC-SL mitter jitter, and receiver BER performance. PJ = periodic jitter; DDJ = data- Throughput might be further compro- dependent jitter; UI = unit interval; • Shuttle ATI Express 200 chipset mised by link-layer protocol overhead such ISI = inter-symbol interference; BER • Supermicro X6DA8G (x16 and as ACK/NAK packets (acknowledge/non- • 0.60UI TJ = 0.25UI RJ + 0.25UI PJ x4 slots) acknowledge), re-transmitted packets, and flow control protocol (credit reporting). (at 15 MHz) + 0.1UI DDJ In addition to the standard PCI-SIG compliance tests, NXP and PLDA also Throughput Throughput used a variety of system tests developed PCIe PHY IP Core Computer Platform Payload DMA Read DMA Write internally and from third-party test equipment vendors: (Card–>PC) (PC–>Card) • PCI scan diagnostic utility PX1012A PLDA ASUS A8NE 128 bytes 200 Mbps 175 Mbps Supermicro • Transmitter electrical compliance tests X6DA8G v1.0a and v1.1 • PCI-SIG PCI-ECV v1.2 Table 2 – Throughput measurement results

44 Xcell Journal Third Quarter 2007 CONNECTIVITY

Get Published

Figure 2 – The PLDA XpressLite SP3 is based on a Xilinx Spartan-3 FPGA and includes NXP’s PCIe PX1012A PHY and PLDA’s XpressLite IP controller.

• 0.1 UI DDJ = ISI module, which tions and a driver. With the Protocore stimulates 9 inches of PCB trace, license, you can freely simulate and synthe- equivalent to about 0.1-UI jitter and size your own design connected to the significant amplitude degradation. XpressLite IP core and re-program the FPGA device accordingly. NXP obtained excellent receiver per- The complete design has gone through formance results with the Agilent bit extensive validation testing, is hardware- error tester. proven, and readily available for customer • PX1012A achieved < 1x10-12 BER with prototyping. a 800-mVdiff. p-p input signal, which is The XpressLite SP3 design kit provides Would you like the minimum transmit output level a low-cost solution to prototype your to be published allowed by the PCI Express specification. design with PCI Express. It can also be in Xcell The 800-mVdiff. p-p signal goes from the used as part of your final product if you Publications? pattern generator to the BERT ISI mod- want to save the effort of designing your ule before feeding into the PHY receiver. own PCIe board. A 400-hole matrix (2.54- It's easier than you think! mm step) for prototyping and probing pur- • Without the ISI module, the PX1012A Submit an article draft for our Web-based can achieve < 1x10-12 BER with a 400- poses is also provided. or printed Xcell Publications and we will mV input signal. diff. p-p assign an editor and a graphic artist Conclusion to work with you to make your work PLDA XpressLite SP3 Design Kit A prototyping board like PLDA’s XpressLite The PLDA XpressLite SP3 Design Kit (see SP3 Design Kit, which combines the power look as good as possible. Figure 2) is based on a Spartan-3 FPGA of the Xilinx Spartan-3 FPGA with a For more information on this proven PCI Express NXP PHY/PLDA IP (XC3S2000 device) and integrates the exciting and highly rewarding program, NXP PX1012A. The design kit includes a controller solution, responds to today’s please contact: Protocore (board-only) license of the development challenges to produce applica- PLDA XpressLite IP core. tions that support ever greater data The Protocore license is a full-featured throughput in an ever-shrinking product Forrest Couch RTL-level license that is valid for an unlim- lifecycle. For more information, visit Publisher, Xcell Publications ited duration when used with the design www.standardics.nxp.com/products/ [email protected] kit. It also includes a software development pcie/phys/ and www.plda.com/products/ kit, with a library of C source code func- board_pcie_sp3.php.

NXP also provides a joint solution with the Xilinx PCI Express PIPE core for Spartan-3 FPGAs. You can buy this solution from Xilinx, which includes an x1 PCI Express add-in card and an evaluation version of the PIPE core.

Third Quarter 2007 Xcell Journal 45 SystemBIST™ enables FPGA Configuration that is less filling for your PCB area, less filling for your BOM budget and less filling for your prototype schedule. All the things you want less of and more of the things you do want for your PCB – like embedded JTAG tests and CPLD reconfiguration. Typical FPGA configuration devices blindly “throw bits” at your FPGAs at power-up. SystemBIST is different – so different it has three US patents granted and more pending. SystemBIST’s associated software tools enable you to develop a complex power-up FPGA strategy and validate it. Using an interactive GUI, you determine what SystemBIST does in the event of a failure, what to program into the FPGA when that daughterboard is missing, or which FPGA bitstreams should be locked from further updates. You can easily add PCB 1149.1/JTAG tests to lower your downstream production costs and enable in-the-field self-test. Some capabilities: I User defined FPGA configuration/CPLD re-configuration I Run Anytime-Anywhere embedded JTAG tests I Add new FPGA designs to your products in the field I “Failsafe” configuration – in the field FPGA updates without risk I Small memory footprint offers lowest cost per bit FPGA configuration I Smaller PCB real-estate, lower parts cost compared to other methods I Industry proven software tools enable you to get-it-right before you embed I FLASH memory locking and fast re-programming I New: At-speed DDR and RocketIO™ MGT tests for V4/V2 If your design team is using PROMS, CPLD & FLASH or CPU and in-house software to configure FPGAs please visit our website at http://www.intellitech.com/xcell.asp to learn more.

Copyright © 2006 Intellitech Corp. All rights reserved. SystemBIST™ is a trademark of Intellitech Corporation. RocketIO™ is a registered trademark of Xilinx Corporation. MEMORY INTERFACE Create Memory Interface Designs Faster with Xilinx Solutions Xilinx simplifies memory interface design with hardware-verified reference designs, easy-to-use software tools, and complete development kits.

by Adrian Cosoroaba DDR SDRAMs and low-end DDR2 Wider buses at these speeds make chip- Marketing Manager SDRAMs running below 400 Mbps per to-chip interfaces all the more challenging, Xilinx, Inc. pin are adequate to meet the memory requiring larger packages and better power- [email protected] bandwidth requirements of most low-cost to-signal and ground-to-signal ratios. systems. For these applications, Xilinx Virtex-5 FPGAs have been built with Xilinx® FPGAs provide I/O blocks and offers Spartan-3 Generation FPGAs: advanced SparseChevron packaging that logic resources that make interface design Spartan-3, 3E, 3A, and 3AN devices. provides superior signal-to-power and easier. Nonetheless, designers must still For applications that push the limits of ground-pin ratios. Every I/O pin is sur- configure, verify, implement, and properly memory interface bandwidths, like 667- rounded by sufficient power and ground connect these I/O blocks – along with extra Mbps-per-pin DDR2 SDRAMs, Xilinx pins and planes to ensure proper shielding logic – to the rest of the system in the offers Virtex-5 FPGAs. for minimum crosstalk noise caused by source RTL code, carefully simulated and Bandwidth is a factor related to both simultaneously switching outputs (SSO). verified in hardware. the data rate per pin and the width of the In this article, I’ll review the perform- data bus. Both Spartan-3 Generation and Memory Interfaces with Spartan-3 FPGAs ance requirements, design challenges, and Virtex-5 FPGAs offer distinct options, For low-cost applications where 400-Mbps Xilinx solutions for memory interface spanning a range from smaller low-cost bit rates per pin are sufficient, Spartan-3 design, from low-cost implementations systems with data bus widths of less than Generation FPGAs coupled with Xilinx with Spartan™-3 Generation FPGAs to 72 bits to those as wide as 576 bits for the software tools provide an easy-to-imple- the highest bandwidth interfaces using larger Virtex-5 packages (Figure 1). ment, economical solution. Virtex™-5 FPGAs. Data Rate/Pin (Mbps) Performance Requirements and Xilinx Solutions 667 In the late 1990s, memory interfaces 600 evolved from single-data-rate SDRAMs to double-data-rate (DDR) SDRAMs, with 400 today’s DDR2 SDRAMs running at 667 Virtex-5 FPGAs Mbps per pin or higher. 333 Applications can generally be classified Spartan-3 200 Generation in two categories: FPGAs • Low-cost applications, where the cost 0 of the device is most important 72 144 288 432 576

• High-performance applications, where Data Bus Width (I/Os) getting the highest bandwidth is para- mount Figure 1 – Xilinx FPGAs and memory interface bandwidth

Third Quarter 2007 Xcell Journal 47 MEMORY INTERFACE

Three fundamental building blocks puts to transmit the DQS strobe properly This trend is readily apparent when comprise a memory interface and con- aligned to the command and data bits. comparing the data valid windows of DDR troller for an FPGA-based design: the read The implementation of a DDR2 SDRAMs running at 400 Mbps and and write data interface, the memory con- SDRAM memory interface has been fully DDR2 memory technology, which runs at troller state machine, and the user interface verified in hardware. The design was 667 Mbps. The DDR device with a 2.5-ns that bridges the memory interface design to implemented in the Spartan-3A starter kit data period has a data valid window of 0.7 the rest of the FPGA design. Implemented board using a 16-bit-wide DDR2 SDRAM ns, while the DDR2 device with a 1.5-ns in the fabric, these blocks are clocked by memory device and the XC3S700A- period has a mere 0.14 ns. the output of the digital control manager FG484 device. The reference design uses Virtex-5 FPGAs address this challenge (DCM) that, in the Spartan-3 Generation only a small portion of the Spartan-3A with dedicated delay and clocking resources implementation, also drives the look-up FPGA’s available resources: 13% of IOBs, in the I/O blocks – called ChipSync™ table (LUT) delay calibration monitor (a 9% of logic slices, 16% of global buffer technology. The ChipSync block built into block of logic that ensures proper timing (BUFG) multiplexers (MUXs), and one of every I/O contains a string of delay ele- for read data capture). the eight DCMs. ments (also known as tap delays), called In the Spartan-3 Generation implemen- You can easily customize Spartan-3 IODELAY, with a resolution of 75 ps. tation, read data capture is implemented Generation memory interface designs to fit The architecture of the implementation using the LUTs in the configurable logic your application using the Memory is based on several building blocks. The user blocks (CLBs). During a read transaction, Interface Generator (MIG) software tool. interface that bridges the memory controller and physical layer interface to the rest of the FPGA design uses a FIFO archi- Address & Control IOBs tecture. The FIFOs hold the command, Command / Conflict Logic Address FIFO Main Controller address, write data, and read data. The main controller block controls the read, write, and refresh operations. Two other logic Init Controller blocks execute the clock-to-data centering User Interface for read operations: the initialization con- DDR2 SDRAM troller and the calibration logic (Figure 2). Calibration Logic The physical layer interface for Data IOBs address, control, and data is implemented in the IOBs. The DQ capture uses the Write FIFO memory strobe to capture corresponding WE DQ and register it with a delayed version

2nd Stage Capture of the strobe. This data is then synchro- Read FIFO WE nized to the system clock domain in a second stage of flip-flops. The input serializer/deserializer feature in the I/O Figure 2 – Virtex-5 FPGA memory interface architecture block is used for read capture – the first pair of flip-flops transfer the data from the delayed strobe to the system clock the DDR2 SDRAM device sends the read Memory Interfaces with Virtex-5 FPGAs domain. The technique involves the use data strobe (DQS) and associated data to With higher data rates, interface timing of 75-ps tap delay (IODELAY) elements the FPGA edge-aligned with the read data requirements become more challenging. that are varied during a calibration rou- (DQ). Capturing the DQ is a challenging The trend toward higher data rates presents tine implemented by the calibration task in source-synchronous interfaces a serious problem to designers in that the logic. This routine is performed during because data changes at every edge of the data valid window – that period within the system initialization to set the optimal non free-running DQS strobe. The read data period during which DQ can be reli- phase between strobe, data, and the sys- data capture implementation uses a tap- ably obtained – is shrinking faster than the tem clock to maximize timing margins. delay mechanism based on LUTs. data period itself. This is because the vari- There are other aspects of the design, Write data commands and timings are ous uncertainties associated with system such as the overall controller state machine generated and controlled through the write and device performance parameters, which logic generation and user interface. To data interface. The write data interface uses impinge on the size of the data valid win- make the complete design easier for the input/output block (IOB) flip-flops and dow, do not scale down at the same rate as FPGA designer, Xilinx developed the the DCM’s 90-, 180-, and 270-degree out- the data period. Memory Interface Generator.

48 Xcell Journal Third Quarter 2007 MEMORY INTERFACE

Design and Integration with MIG memory module (DIMM). The same GUI example. After the optional code change, Integrating all of the building blocks, provides a selection of bus widths and clock you can perform additional simulations to including the memory controller state frequencies. Other options provide control verify the functionality of the overall design. machine, is essential for the completeness of the clocking method, CAS latency, burst The MIG also generates a synthesizable of your designs. Controller state machines length, and pin assignments. test bench with memory checker capability. vary with the memory architecture and sys- The MIG tool can generate in a matter of The test bench is a design example used in tem parameters. State machine code can minutes the RTL and UCF files, which are the functional simulation and the hardware the HDL code and verification of the Xilinx base design. constraints files, The final stage of the design is to import respectively. These the MIG files in the ISE project, merge files are generated them with rest of your FPGA design files, Selects the FPGA using a library of conduct synthesis and place and route, run Banks hardware-verified ref- additional timing simulations if needed, Selects the Memory erence designs, with and finally perform verification in hard- Architecture modifications based ware. MIG software also generates a batch on user inputs. file with the appropriate synthesis, map, You have com- and place and route options to help you plete flexibility to optimally generate the final bit file. further modify the RTL code. Unlike Development Boards and Kits Figure 3 – MIG GUI other solutions that Hardware verification of reference designs is an important final step to ensure a robust Xilinx FPGA Spartan-3E Spartan-3A Spartan-3AN Virtex-5 and reliable solution. Xilinx has verified the memory interface designs for both Spartan- Development Board/Kit Starter Kit Development Kit Starter Kit ML-561 3 Generation and Virtex-5 FPGAs. Table 1 shows the memory interfaces supported for Memory Interfaces Supported DDR DDR2 DDR2 DDR each of the development boards. DDR2 The development boards range from low- QDR-II RLDRAM II cost Spartan-3 Generation FPGA implementations to the high-performance solutions offered by the Virtex-5 FPGA family. Table 1 – Development boards and kits for memory interfaces Conclusion also be complicated and a function of offer “black-box” implementations, the code You can complete your memory interface many variables, such as architecture, data is not encrypted, providing complete flexi- designs faster with a wide spectrum of Xilinx bus width, depth, access algorithms, and bility to change and further customize a FPGA devices, dedicated software tools like data-to-strobe ratios. design. The output files are categorized in the MIG, and development boards that fit The complete design can be generated modules that apply to different building your application needs, from low cost to with the MIG, a software tool freely avail- blocks of the design: user interface, physical high bandwidth. able from Xilinx as part of the ISE™ soft- layer, or controller state machine. You may For more information and details on ware CORE Generator™ suite of reference wish to customize the state machine that these memory interface solutions, visit designs and IP. The MIG design flow is controls the bank access algorithm, for www.xilinx.com/memory. very similar to the traditional FPGA design flow. The benefit of the MIG for designers TAKE THE NEXT STEP is that there is no need to generate RTL • Download the Memory Interface Generator to generate your reference code from scratch for the physical layer designs for Virtex-5, Virtex-4, and Spartan-3 devices, including HDL code interface or the memory controller. and pin placements You can use the MIG’s GUI to set sys- • See a demo on Memory Interface Solutions with Xilinx FPGAs tem and memory parameters (Figure 3). For example, after selecting the FPGA • View an on-demand webcast on Low-Cost to High-Performance Memory device, package, and speed grade, you can Interface Design Made Easy with Xilinx FPGAs select the memory architecture and pick • Contact Xilinx the actual memory device or dual in-line

Third Quarter 2007 Xcell Journal 49 INTELLECTUAL PROPERTY

aggregate bandwidth, far surpassing classi- cal automobile networks. In other words, you could play 15 distinct audio streams at the same time. Each multimedia device is represented by a node in the ring. A typical MOST net- Driving Home work will have between three and 10 nodes. There is one timing master that drives the system clock and generates frames, or 64-byte sequences, of data. The remaining nodes all act as slaves. One node acts as the user control interface, or MMI Multimedia (man machine interface). Often, this is also the timing master. Figure 1 illustrates a Infotainment hits the road with affordable in-car networking. basic MOST ring. The main payload comprises 60 bytes within the 64-byte frame. This payload is made up of synchronous and asynchronous fields. The synchronous field is used for streaming contiguous data; audio and video fall under this category. The asynchronous field is used for sporadic data transfers in applications such as Internet access, navigation data transfer, and phone book synchronization. Also, this channel could be used for firmware upgrades of the control units. A node may send or receive data during its assigned time slots. A time slot – one synchronous byte within the payload – is dynamically allocated between a requesting node and the timing master. Typically, one node will transmit data onto a time slot, by Carl Rohrer MOST (Media Oriented Systems while any number of other nodes will col- Senior Design Engineer Transport) is a network standard on the rise lect the data from that time slot. Xilinx, Inc. among automakers and automotive suppli- The boundary between synchronous [email protected] ers. It provides one single interface for and asynchronous is dynamically con- managing all of your multimedia devices. trolled by the timing master. In any given Per square foot, you probably have more Its strength resides in the ability to handle frame, the synchronous field may be from multimedia in your car than you do in your multiple streams of data intended for dif- 4 to 60 bytes, leaving the remainder of the house. There are the two LCD screens for ferent targets without losing congruity. 60 bytes to the asynchronous field. the kids in the backseat, controlled by a On-time data: this is something even your The remaining four bytes of the frame DVD player or a video game console. An home network cannot guarantee. are used for header, trailer, and control audio system is powered by the latest MP3 In this article, I’ll explore the MOST information. The header contains a pream- player. There’s the navigation system and network and demonstrate the flexibility of ble for frame alignment. The trailer, among even broadcast television in some luxury the Xilinx® MOST solution. other things, does a parity check. The con- vehicles. Plus, you’ve probably got more trol field is used for network-related mes- speakers in your car than in a high-end sur- Under the Hood saging. These messages may be low level, round sound system. It’s no wonder there The MOST network runs on optical fiber, like the allocation and de-allocation of time are so many distracted drivers on the road. typically in a ring topology. The clock and slots. Conversely, they can be high-level What you need is a single, simple control serialized data are bi-phase encoded, application messages sent from the opera- interface; what manufacturers need is a requiring just one single piece of fiber to tor, such as play next track, volume control, sophisticated network. be routed. MOST offers up to 25 Mbps of or repeat play.

50 Xcell Journal Third Quarter 2007 INTELLECTUAL PROPERTY

Getting More Out of MOST Sequence of Events User for playing an audio file Instead of having an external MOST con- Interface (AM = Application Message) Master A troller chip connected to a microcontroller User presses 'play' on the MMI MMI Node 1. A > B: AM "Allocate time slots for MP3 audio" or DSP, you can integrate all of your com- 2. B > A: Allocation control message request of four ponents into one single FPGA. Fewer exter- time slots. (Master allocates time slots 0,5,7, and 8.) nal components and reduced PCB space 3. B > A: AM "Time slots 0,5,7, and 8 have been allocated" 4. A > B: AM "Begin translates into cost savings for developers. transmission of MP3 audio on allocated time slots" Xilinx offers a fully parameterizable 5. A > C: AM "Connect to MOST Network time slots 0,5,7, and 8." MOST Network Interface Controller (NIC) Slave C Slave B Stereo audio can be heard from IP core. You can customize the core to be a both speakers. Amplifier MP3 Player timing master or reduce logic with a slave- Node Node Left Right MP3 only configuration. The core is controlled by Channel Channel Player a full suite of registers accessible through an Speaker Speaker on-chip peripheral (OPB) interface. The OPB interface works seamlessly with the Xilinx MicroBlaze™ 32-bit RISC processor One Frame = 64 Bytes core included in Xilinx Platform Studio. Synchronous Field Asynchronous Field Field

A full set of low-level driver files are Trailer Control Header Dynamic Boundary already available in C source code. The The control field of 16 consecutive frames makes on drivers provide a series of functions for control message. accessing the register space, handling interrupts, and streaming data to the core. Figure 1 – An example MOST automotive ring Mocean Laboratories AB provides MOST network services for the IP core for a complete network stack, leaving you only to MP3 Audio MicroBlaze LMB Bus write your desired application. Stream Processor Unique to the Xilinx MOST NIC is a On-Chip Peripheral Bus (OPB) streaming port interface that allows data to C be preprocessed in real time. It is ideal for OPB Interface bolting on data filters or encryption/decryp- Data and Instruction Space tion modules. This LocalLink interface, a TX RX Xilinx standard, will significantly reduce e B Buffer Filter Streaming traffic on the processor and processor bus by Port offloading dedicated procedures. It’s versatile too. You can read or write data in either MOST NIC Fiber Fiber the receive or transmit directions. Best of all, Spartan-3E FPGA Optic TX Optic RX if you don’t use it, the Xilinx implementa- Data Flow Control Flow tion tools will remove the unnecessary logic, allowing you to conserve resources and fit Figure 2 – Theoretical MP3 node your design into a smaller device. Synchronous data can be transmitted and received on either the streaming port The Xilinx MOST NIC core is flexi- network services on top of MicroBlaze for or the OPB interface. Regardless of the ble. Consider the MOST ring in Figure 1 event scheduling. method you choose and how many time again, which illustrates how you might You can turn the MP3 player into a high- slots you have assigned, the core formats design each node with the Xilinx MOST end audio feed by adding a noise filter bolt- data into 32-bit words for these user inter- NIC. You could configure the core as a on to eliminate the artifacts of audio faces. Through register definitions, the core timing master to operate as the MMI. As compression. The payload data can run from accumulates received time slot data that is the timing master, the core will send and the codec through the noise filter directly deposited through one of 16 logical chan- receive control messages that regulate into the streaming port, completely avoiding nels. The reverse is true in the transmit ring operation. This node will also send the OPB bus. Once again, you can use the direction. The use of these logical channels application messages, also through the MicroBlaze embedded processor for inter- allows for 16 distinct streams of data in control field, on behalf of the user. You rupt handling and event scheduling. Figure 2 each direction. can also use the driver files and Mocean’s depicts the block diagram of such a design.

Third Quarter 2007 Xcell Journal 51 INTELLECTUAL PROPERTY

For the amplifier, imagine a minimal becoming more affordable for cost- Mbps and beyond. These updates will even- design that simply receives data and for- conscious automakers. tually increase available application band- wards it to the speakers. Instead of using As demand grows for larger amounts of width by more than an order of magnitude an embedded processor, as in the MP3 data – from audio to video, telematics, and and are expected to support both a copper node, you can implement a smaller user navigation-based applications – the MOST and optical physical medium. design capable of full network negotia- network technology has plans to grow as The Xilinx MOST NIC is available tion and data collection. Such a compact well. The next-generation standard, MOST today in CORE Generator™ software. At design may be placed in a smaller device 50, has already been defined and offers six block RAMs and about 2,600 slices, it for further cost savings. twice as much bandwidth as the original. At will fit in a mid-sized Spartan™-3E device the time of this writing, the MOST with room to spare for an embedded Conclusion Cooperative is planning a third-generation processor, peripherals, buffers, and your If you drive a high-end European network expected to reach data rates of 150 own custom circuitry. automobile, you may already have a MOST network. It has gained acceptance TAKE THE NEXT STEP (Digital Edition: www.xcellpublications.com/subscribe/) as one of the de facto standards for automotive networking among European- based OEMs. Those of us driving more • Read the MOST NIC data sheet for more details affordable cars will not have too long to • Contact your Xilinx FAE to test-drive the MOST NIC core wait, however. With the advent of competition, this once private standard is

52 Xcell Journal Third Quarter 2007 INTELLECTUAL PROPERTY Making the Most of MOST Control Messaging This case study presents the design of Xilinx LogiCORE MOST NIC control message processing. by Stacey Secatch Staff Design Engineer Xilinx, Inc. [email protected]

At only 2,600 slices, the Xilinx® MOST (Media Oriented Systems Transport) network interface controller (NIC) occupies roughly half of a Spartan™-3E 500 device, which leaves plenty of room for the microprocessor and other peripherals. This full- featured embedded core provides easy access to data and network control as a slave or master node. In this article, I’ll present a control message processing design case study and show how the MOST NIC is able to autonomous- ly handle messages without any host software intervention. With careful planning, the minimal-resource PicoBlaze™ configurable state machine can handle control messages correctly and efficiently.

Control Messages on a MOST Network MOST is an automotive infotainment- oriented network standard with a ring topology. One of its strengths is that it Normal Messages Channel Allocation and Deallocation supports multiple simultaneous streams of Normal messages are distinct from other Allocation messages are sent to the master media intended for several nodes in the types in that they are intended for high- node to gain access to the portion of the ring. However, like any network, more level application use in the host processor. frame set aside for media data. The requestor than just data is sent between nodes; con- All other messages are used for ring opera- indicates how many byte-wide channels are trol communication is also required. A tion and debugging. For example, as part of desired and the responder (master) confirms simple MOST ring network is shown in the boot sequence of the ring, the master both the status of the request (granted or Figure 1, with one network master and could send a normal message to determine denied) and which channels have been several slaves. Six different control message the identification of slave B. Normal mes- granted. Deallocation messages are similar types are available. Let’s focus on the three sages may be combined in the application to allocations, where the requestor indicates that are most commonly used during oper- layer when there is more information than which channels to deallocate and the master ation of a MOST network. one message can hold. responds with deallocation granted.

Third Quarter 2007 Xcell Journal 53 INTELLECTUAL PROPERTY

Choosing the Hardware/Software Partition Because this service requirement is so have decided to implement this in tradi- Embedded design involves choosing the tight, it makes sense to place control mes- tional fashion as a custom state machine. best boundary for the hardware/software sage handling within the core as hardware, Preliminary estimates placed the size of the interface in terms of functionality and leaving the external processor available to state machine at more than 1,000 slices, resources. Choosing to implement too run the network stack and media-oriented plus a block RAM to contain all of the much in hardware is costly because of the user application. message buffers. Note that control mes- additional resources required to implement sages only use 2 bytes out of a 64-byte the design. Choosing to implement too Items to Leave to Software frame, yet processing just these two bytes much in software can result in an overbur- Although nearly all control message types would have accounted for more than a dened microprocessor and stalled logic, can be handled completely by the third of the core. possibly even resulting in a non-functional LogiCORE system, normal control mes- Instead, we chose to incorporate the design if the processor drops events. sages must be passed on to the applica- PicoBlaze configurable state machine as the For the MOST LogiCORE™ solution, tion. Therefore, the MOST NIC backbone for control channel processing we found a clear boundary for dividing up LogiCORE solution provides interrupts and minimize core resources. In our imple- control message processing. to signal that a normal message has been mentation, this includes two block RAMs to received and is available for processing. hold the instruction memory, approximate- Delay Time Requirements And because all messages are initiated by ly 100 slices for the PicoBlaze component, Control messages are broken across multi- the application, there are configuration and 150 slices for peripherals, as well as the ple frames. As messages travel around the registers for the application to indicate original block RAM for message buffering. ring, specific handshaking is required that a message is available for sending and Choosing a PicoBlaze controller offers between the communicating nodes, with a interrupts to alert the application that the you similar opportunities for resource min- turnaround time of roughly one of those message sequence is complete. imization. You will probably also see the frames (about 23 µs). Message-response same benefits that we did, namely a short- generation usually requires calculations. Selecting the Implementation ened design cycle and easier code mainte- For example, allocation responses depend The next step after deciding to process con- nance thanks to the simplicity of coding on determining whether enough free trol messages in hardware was to choose the message processing procedurally. channels are available. most efficient design possible. We could Organizing Data Storage Determining what kind of memory to use for variables when designing an RTL state Master machine is rather straightforward. In general, you use RAM for large buffers and stick everything else in a register. When using a PicoBlaze controller, you need to do a bit more up-front planning to minimize resources while still providing access to all of your data. Slave C Slave A State Machine Variables The 64 “free” scratch pad registers are ideal for storing variables that you will never use outside of the PicoBlaze controller. For example, the accumulators to store intermediate checksum results for Slave B MOST control messages are only needed by the state machine.

Larger Buffers Block RAM is an ideal location for buffering. Although the PicoBlaze controller only provides direct access to 256 locations, it is Figure 1 – A sample MOST ring probably more efficient for you to use one block RAM than to instantiate distributed

54 Xcell Journal Third Quarter 2007 INTELLECTUAL PROPERTY

are even received. Therefore, for a given PicoBlaze 8-bit frame, transmit events are always State Machine e Data Block RAM – Host processed before receive events. Controller Message Interface Strob Buffers and Large Amounts • Consider using a busy-wait loop. If you Scratch Pad – of Data State Machine Only Node have a very long event that cannot be Variables d Data Memory-Mapped Configuration/ Registers – Status processed in the time frame required by Strob Continuously ... Available ... other events, you might consider plac- Registers – Data Interrupt Temporary ing the long event in the main process- Computation Values Address ing wait loop. For example, preparing a MOST control message for transmit can take multiple Figure 2 – Memory organization for Xilinx MOST NIC control message processing frames to complete. We could have set up multiple levels of interrupt servicing memory for all of your buffers. And avoidance and race conditions. You will to ensure that we could still process because a PicoBlaze configurable state need to plan carefully, however, to ensure frame data, but this type of task fits very machine executes sequentially, you will that your event scheduling is correct and well into a busy-wait loop, which auto- never need to access more than one buffer that that no queued events get dropped. matically gives it the lowest priority. at a time, which makes grouping together Here are some guidelines that might help in If you do choose to process data in the all of your memory easy. As a bonus, Xilinx designing your PicoBlaze application: main loop, some additional care is block RAM automatically converts data needed to ensure that you do not cor- • Prioritize your event processing. If you widths, seamlessly allowing 32-bit access rupt your main loop when servicing have events that are not important, from the host side and 8-bit access from the interrupts. At the start of interrupt serv- choose to execute those last. For exam- PicoBlaze controller side. icing, you should copy all registers that ple, the value of buffer occupancy might get overwritten, and restore those counts is changed by a MOST host Shared Variables registers as interrupt servicing com- buffer read. However, this update may Some configuration and status variables pletes. There are also PicoBlaze com- be deferred for several frames. Because will need to be continually available. You mands for disabling interrupts when responding to a message must com- can map these variables into registers in the executing critical portions of your code. plete in one frame, the core services PicoBlaze controller external memory message responds first. space for your state machine to access, just Conclusion like the memory in the MOST LogiCORE • Preserve event ordering. If your pro- The Xilinx MOST NIC has been shown to control processing module (Figure 2). In cessing requires that one event com- efficiently handle control message process- the LogiCORE system, the PicoBlaze con- plete before another, you will want to ing. It is ideal for automotive infotainment troller external memory port is mapped to execute the code for the first event, systems, as it provides similar efficiency in a shared block RAM, with the exception even if the second event may have a streaming data. With upcoming device that the end of the memory range is mem- tighter timing requirement. For exam- support for the Spartan-3A DSP family, ory-mapped registers. ple, control bytes must be sent out this will allow even more efficient multime- As an example, the currently pro- before the bytes in the current frame dia processing for your design. grammed addresses of the MOST node are mapped into a read-only “register” (a mux TAKE THE NEXT STEP (Digital Edition: www.xcellpublications.com/subscribe/) input) to determine if a received message is meant for this node. Status information • Get more information about embedded design using the Xilinx MOST about the current occupancy of the buffers NIC LogiCORE solution in the MOST NIC lounge. is placed into a write-only register for • The MOST NIC User Guide, available after generating a LogiCORE access by the external host. implementation, provides more information about all MOST control message formats for the MOST link layer. Ordering Your Processing Complex state machines are generally built • Download the PicoBlaze microcontroller reference design, which includes from smaller interlocking state machines supporting files such as the assembler and extensive documentation such that operate in parallel. PicoBlaze config- as the User Guide, with many coding examples and documentation for urable state machines execute sequentially, the complete instruction set. providing advantages in terms of conflict

Third Quarter 2007 Xcell Journal 55 INTELLECTUAL PROPERTY Leveraging HyperTransport An open-source HyperTransport IP core allows FPGAs to directly connect to on Xilinx FPGAs AMD Opteron processors.

by David Slogsnat Research Associate University of Mannheim [email protected]

Alexander Giese Research Associate University of Mannheim [email protected]

Ulrich Bruening Professor equipped with this connector now make it Advantages of HyperTransport University of Mannheim possible to directly connect devices to an Both HyperTransport and its closest com- [email protected] Opteron processor. petitor, PCI Express, are very capable of The Computer Architecture Group at the delivering bandwidth. For example, the FPGA-based devices are used widely in University of Mannheim researches networks 3.6-Gbps bi-directional bandwidth of our standard computing systems, either as and network interface controllers (NICs) for HT400 core is also possible using a PCI reprogrammable coprocessors or as proto- high-performance computing. For the con- Express x8 device. typing devices. Usually, they are connected struction of high-speed prototypes, we sought Besides bandwidth, there is a second to the system using peripheral buses like to implement an efficient HyperTransport IP important criterion to specify the perform- PCI Express. on an FPGA. Implemented on a Virtex-4 FX ance of an interconnect: latency. Compared AMD’s Opteron processors offer a device, the IP is capable of running in to PCI Express, HyperTransport adds sig- unique opportunity to improve how HT400 mode, with a link clock of 400 nificantly less latency to the transfer of data devices are connected to the system. Rather MHz. Of course, not only NICs benefit between a device and a host system com- than using a proprietary front-side bus, from a direct HyperTransport connec- prising processors and memory. there are three HyperTransport (HT) inter- tion. Any device with high bandwidth or One reason for this is the protocol itself, faces per processor. The adoption of the low latency requirements, like FPGA which does not require steps like 8B/10B HyperTransport Expansion Connector coprocessors, will benefit from the coding and high-speed serialization. (A (HTX) specification and mainboards increased performance. detailed latency analysis of both protocols

56 Xcell Journal Third Quarter 2007 INTELLECTUAL PROPERTY

can be found on www.hypertransport.org.) wide links. The BIOS checks the capabili- that is sent on the link belongs to this pack- The main reduction in latency, however, ties of all devices by accessing the device’s et. A data packet can have a maximum size stems from the fact that the HTX slot is HyperTransport register space and sets of up to 64 bytes. directly connected to Opteron processors new values for frequency and width for Sending other control packets during a instead of having to go through one or every link according to the capabilities of stream of data packets at every 32-bit more I/O bridges, as depicted in Figure 1. the two devices that share the link. After boundary is allowed, but only if this con- This avoids time-consuming protocol that, it forces a re-initialization of all trol packet would not be followed by data. conversions and usually saves one chip- HyperTransport devices to establish the Otherwise it could not be possible to deter- to-chip hop. new parameters. mine which control packet the data All transfers in HyperTransport are belongs to. This mechanism makes it possi- Overview of HyperTransport packet-based. To decouple the transmission ble to send urgent control packets with HyperTransport is a packet-based com- high priority. Flow control on the link is munication protocol for data transfer. imposed by a credit-based protocol. There are three versions: HT 1.05 was AMD Dual Opteron Processor developed in 2001 and updated to HT Mastering Challenges in FPGA Design AMD64 AMD64 2.0 in 2004. In April 2006, HT 3.0 was CPU Core CPU Core The HyperTransport core has two different defined as the next successor. Current interfaces: one is the HyperTransport send

Opteron processors adhere to the HT L1 I L1 D L1 I L1 D and receive links, as specified in the 2.0b specification; no HT 3.0 devices or Cache Cache Cache Cache HyperTransport protocol. On the other systems are currently available. Therefore, side, there is the application interface, this article focuses on the implementation L2 Cache L2 Cache which allows FPGA designs to access the of an HT 2.0b device. HyperTransport core. This interface con- A HyperTransport link comprises two System Request Queue sists of three queues in each direction, one sets of unidirectional signals. Each set can for every virtual channel. Applications can Crossbar be distinguished into three signal types: access these using a valid-stop synchroniza- Memory HyperTransport CAD (command, address, and data), CTL tion mechanism. Control and attached (control), and CLK (clock signals). The data packets can be delivered simultaneous- CAD lines are used to transport com- Memory Controller ly over the 160-bit-wide interface (96 bit to mand and data packets, while the CTL handle extended con- line distinguishes between command and trol packets, 64 bit for data packets on the CAD lines. The data packets). HTX I/O HyperTransport protocol supports CAD Bridge PCI-X Of course, our aim buses with a width of 2, 4, 8, 16, or 32 was to build a core bit, as depicted in Table 1. The width of that is as fast as possi- the CAD bus is usually known as the ble. One limiting fac- width of the HyperTransport link. If more Figure 1 – Block diagram of an Opteron dual-core tor is the speed of the processor in a system with HTX and PCI-X slots than eight CAD lines are used per link serial I/Os. In the and direction, every group of eight signals Xilinx® Virtex™-4 FX has its own CLK signal. These groups of response from the request, the packets are devices we used, the speed is limited to signals are synchronously transmitted transferred in a split phase transaction. A 400-MHz DDR, thus HT400. Xilinx with the source-associated CLK signal. transfer always starts with one of three SERDES blocks can parallelize/serialize The data transferred on the CAD bus is kinds of control packets: information, the link by a factor of four so that the 32-bit aligned independent of the bus request, or response. Information packets clock frequency of the core is 200 MHz. width. All transferred packets are at least are used for flow control and synchroniza- The SERDES blocks are also controlled the size of one doubleword (32 bit). tion. Request packets are sent to write data by a bitslip module to generate proper HyperTransport allows clock frequencies to a receiver and are also used to initiate 32-bit boundary alignment. from 200 MHz to 1.4 GHz in HT 2.0 and reads. Response packets contain the answer Our second challenge was to process up to 2.6 GHz in HT 3.0. On the CAD to a corresponding request. this data stream with the lowest number of bus, double-data-rate signaling is used. Control packets have a size of 4 or 8 pipeline stages and reasonable resource Current Opteron processors use 16-bit link bytes or, if they use addresses of 64 bits requirements. The decode unit, responsi- widths and frequencies up to 1 GHz. In instead of 40-bit addresses, the extended ble for decoding the incoming packet Opteron systems, all devices start at power format has a size of 12 bytes. If a transfer stream and putting packets into the corre- up of the system with 200-MHz and 8-bit- contains payload data, the next data packet sponding application interface queues or

Third Quarter 2007 Xcell Journal 57 INTELLECTUAL PROPERTY

research and implement new methods in Resource 8-Bit Link, HT200 16-Bit Link, HT400 high-performance networks and network Logic Slices 2,699 6.5% 6,700 15.8% interfaces. Figure 2 is a block diagram of such a NIC. The current prototype is FIFO16/RAMB16s 30 7.5% 30 7.5% implemented using the HT200 8-bit core DCM_ADVs 3 25% 4 33.4% with an internal clock frequency of ISERDESs 10 1% 19 2% 100 MHz. Nevertheless, the Netpipe (www.scl.ameslab.gov/netpipe/) benchmark OSERDES 9 1% 19 2% shows a latency of <1.5 µs on the software API level for sending small messages, which Table 1 – Resource requirements in a Virtex-4 XC4VFX100 FPGA is an excellent result. The hardware of our prototyping plat- Direction Clock Cycles Delay@ HT200 Delay@ HT400 form comprises the Iwill DK8-HTX dual Opteron mainboard and the HTX board. In 11 55 ns 27.5 ns The Xilinx Virtex-4 FPGA-based board Out 7 35 ns 17.5 ns that we developed in 2006 can be plugged directly into the HTX slot of the Iwill Table 2 – Hardware latency of the HyperTransport core mainboard. This is also the only environment in which we have verified both the HyperTransport core units, is the most parison with an HT200 version that sup- core and the board in depth thus far. complex unit in the design. There may be ports only 8-bit-wide links. The pure inter- two incoming 32-bit control packets at the nal hardware latency of the core, shown in Conclusion same time, so there are two decode sub- Table 2, is very low: 11 clock cycles for the The HT IP core successfully exploits the units in parallel. On the other hand, they input path from the HyperTransport link potential of the FPGA in terms of band- may be parts of one large 64-bit or 96-bit to the application interface and seven clock width, latency, and resource utilization. control packet. cycles outbound. Offering an HT400 connection and thus Our third challenge was to reach an You can download the HT200 core from 3.6 GB of bi-directional bandwidth, the internal clock frequency of 200 MHz. We our website, as well as up-to-date informa- HyperTransport core can be used for more are currently using the core with only 100 tion about the core and its performance. than just prototyping – its performance is MHz internally, and thus in HT200 for The HT400 core is still in the verification sufficiently good enough to serve as a pro- more than half a year. The HT400 version phase and will be available soon. duction coprocessor as well. still requires some optimizations to run The HyperTransport core can also be reliably at speed. Implementing a Low-Latency NIC mapped to an ASIC implementation to As a university research group, our main reach higher link speeds. For FPGAs, faster Results focus is in high-performance computing. links may be possible with faster FPGA Table 1 shows the resource requirements of We used the HyperTransport core to build families such as the Virtex-5 device family. the HT400 core (which we did not yet up a prototype NIC for the EXTOLL net- We are also working on a coherent optimize for resource utilization) in com- work. We developed this network to HyperTransport core with HT400 speed that allows devices to participate in the cache-coherence protocol of Opteron

HT processors. In addition to the functionali- 64-bit 18-bit 18-bit Requester ty of a peripheral device, this will, for example, allow a device or coprocessor to Network Link LVDS act like an Opteron system memory con- Port Port Link Loop- HT troller or a coherent cache. In contrast to 64-bit 18-bit back Completer the non-coherent HyperTransport, the use of the coherent version requires a license from AMD. XBar For more information about the open- source HyperTransport core or the HTX board, visit our Center of Excellence Figure 2 – Prototype NIC device for HyperTransport Technology: www.ra. informatik.uni-mannheim.de/coeht.

58 Xcell Journal Third Quarter 2007

FPGA-Based Simulation for Rapid Prototyping You can run an HDL simulator along with an FPGA through USB using iNCITE.

by Ando Ki R&D Director Dynalith Systems [email protected] host computer interface using USB 2.0, The synthesis result EDIF is processed and a board-to-board connector (see by iNSPIRE-Lite, which is the GUI- Designing a hardware block does not sim- Figure 1). For application-specific periph- based integrated design environment for ply mean RTL coding and simulation; the erals, the iNCITE-AVREM application iNCITE (Xilinx place and route is used process usually includes FPGA-based pro- board incorporates audio, video, RS- internally). During this step, two files are totyping. To do this requires a huge effort 232C, Ethernet, MMC, and PS/2. With generated: an emulation information file – including the preparation of a PCB both boards, you can easily implement a (EIF) and a proxy module. The EIF con- board that implements your design and complete system-on-chip (SOC) or tains the necessary data and bitstream to its surrounding functions in hardware – embedded system. configure the FPGA. The proxy module even though you probably only want to handles communication between the sim- see the functionality of your design and Design Flow for FPGA-Based Simulation ulator and the FPGA through USB. not its surrounding components. As shown in Figure 2, there are four The test bench runs on top of the HDL One possible solution around this has design steps. In the first step, the design simulator while the DUT runs in the actually existed for some time: convention- under test (DUT) and its test bench are FPGA. During simulation with iNCITE, al emulation. This hardware-assisted accel- developed using a pure software simula- the EIF is automatically downloaded to the eration technique connects an HDL tor, such as the free ModelSim-XE HDL FPGA through the USB channel at the simulator to programmable devices, allow- simulator. When a synthesizable RTL ver- start of simulation. ing a design to run on real hardware while sion of the DUT is ready, the DUT is The channel between iNCITE and its surrounding functions are simulated on synthesized using an industry-standard software is a generic one; thus, you can use top of software. FPGA synthesizer such as XST (Xilinx any programming language including C, In this article, I’ll describe Dynalith Synthesis Technology), a proprietary syn- C++, SystemC, and MATLAB/Simulink. Systems’s iNCITE USB-connected FPGA thesis engine in Xilinx ISE™ software. Essentially, you can build a virtual system board, supporting Xilinx® Spartan™-3

FPGAs. The iNCITE board provides a iNCITE iNCITE-AVREM USB-based communication channel SDRAM SSRAM Line Driver PS/2 Ethernet between software and an FPGA. Using Host I/F LED, Xilinx Spartan-3 industry-standard HDL simulator soft- Based on Board-to-Board Connector Button, FPGA ware, you can run your design in the FPGA USB 2.0 GPIO under software control. MMC Audio Video JTAG Port Flash Connector Codec DAC iNCITE iNCITE is an FPGA-mounted PCB board Figure 1 – Functional block diagram of iNCITE and iNCITE-AVREM featuring various on-board memories, a

60 Xcell Journal Third Quarter 2007 around the actual FPGA without prepar- designs, as the environment runs the For more information about iNCITE, ing a PCB board, since the DUT is gamut from pure simulation to FPGA pro- visit www.dynalith.com/incite.php. For implemented in the FPGA and its sur- totyping through logic synthesis and more information about Dynalith rounding blocks are modeled using your FPGA place and route. Systems, visit www.dynalith.com. preferred language, including HDL and high-level languages.

Rapid Prototyping HDL Simulator Functional verification utilizing hard- Test Bench DUT ware has three major categories: logic Simulation simulation acceleration, emulation, and prototyping. Acceleration is based on the Logic Xilinx EDIF same idea described previously: some Synthesizer XST Synthesis parts of the design run in the hardware while other parts are simulated on top of FPGA the software. iNSPIRE- Lite Place and Route Emulation uses special hardware in the context of a real environment, where our Xilinx Place and Route design runs in programmable devices con- Proxy Module EIF nected to a target hardware board. iNSPIRE-Lite Prototyping is a customized emulation system encompassing all parts of a system, HDL Simulator including user design. It is usually implemented using an FPGA and PCB board. In iNCITE other words, prototyping requires that you USB design the system before final production. Although the first two categories normally use off-the-shelf products, prototyping is Emulation time-consuming and costly, as it requires Host Computer PCB design and debugging. To assist beginners or those designing small- to medium-scale projects, the Figure 2 – iNCITE design flow iNCITE application board can serve as a ready-to-prototype system (Figure 3). With this system, you can easily imple- iNCITE and GNU Cross SSRAM, SDRAM, Flash ment a complete SOC or embedded sys- Compiler iNCITE-AVREM tem. For the example shown in Figure 3, Memory Ethernet Ethernet GNU GDB DMA Ethernet we built and mapped a complete Controller Controller PHY OpenRISC-based SOC on the FPGA Line Wishbone VGA Video UART TFT-LCD Monitor using iNCITE, while other peripherals Driver or AHB Controller DAC and memories were incorporated using iCON Debug OpenRISC AC97 Audio Unit Loudspeaker iNCITE-AVREM. (USB I/F) or 1200 Controller Codec FPGA

Conclusion Host Computer iNCITE, iNCITE-AVREM, and iNSPIRE- Lite provide an ideal design and verification environment, allowing you to run your design through an FPGA board. You can iNCITE work in the same test bench from RTL design to FPGA-based gate-level verification iNCITE-AVREM without having to prototype a PCB. These tools are also ideal for teaching, Figure 3 – Prototyping example of OpenRISC-based SOC seminars, and small- and medium-sized

Third Quarter 2007 Xcell Journal 61 APPLICATION NOTES Featured Connectivity Create a differentiated connectivity solution with Application Notes product descriptions and associated design files.

Embedded Serial ATA Storage System www.xilinx.com/bvdocs/appnotes/xapp716.pdf This application note describes the design and implementation of an embedded Serial Advanced Technology Attachment (SATA) storage system on a Xilinx® Virtex™-4 platform. SATA is the successor of the prevalent Parallel Advanced Technology Attachment (PATA) interface. SATA over- comes many limitations of PATA and offers a maximum bandwidth of 150 Mbps. Combining SATA with a high-speed network interface, Gigabit Ethernet (1000 Base-X) creates an ideal solution for many high-performance storage applications such as network attached storage (NAS), storage area network (SAN), and redundant array of independent disks (RAID).

Audio/Video Connectivity Solutions for the Broadcast Industry www.xilinx.com/bvdocs/appnotes/xapp514.pdf This comprehensive reference design describes how to use Xilinx FPGAs to implement serial digital video interfaces commonly used in the professional video broadcast Virtex-5 Embedded Tri-Mode Ethernet MAC 10 Gigabit Ethernet Hardware industry. The serial video interfaces described Hardware Demonstration Platform Demonstration Platform in this document are: www.xilinx.com/bvdocs/appnotes/xapp957.pdf www.xilinx.com/bvdocs/appnotes/xapp955.pdf • SD-SDI (SMPTE-259M), used to trans- This application note describes a system using This application note describes the func- port uncompressed standard-definition the Virtex-5 embedded tri-mode Ethernet tionality of the LogiCORE™ 10 Gigabit digital video MAC (Ethernet MAC) wrapper core on an Ethernet and XAUI cores in Xilinx FPGA • HD-SDI (SMPTE-292M), used to ML505 development board. The system pro- hardware. Development board requirements, transport uncompressed high-definition vides an example of how to integrate the setup instructions, MAC core-specific digital video embedded tri-mode Ethernet MAC and design components, and a description of embedded tri-mode Ethernet MAC wrapper, the GUI used to control the demonstra- • DVB-ASI, used to transport compressed using a hardware design to target the develop- tion platform are included. digital video ment board and a PC-based GUI to control • AES, used to transport digital audio the demonstration platform.

62 Xcell Journal Third Quarter 2007 THE BOARD ROOM

Virtex-5 LXT ML555 FPGA Development Kit for PCI Express and PCI Connectivity Boards and Kits www.xilinx.com/xlnx/xebiz/designResources/ ip_product_details.jsp?key=HW-V5- ML555-G

The Virtex-5 LXT FPGA Development kit for PCI Express supports PCIe/ PCI-X/PCI. This complete development kit passed PCI-SIG compliance for PCI Express specification v1.1 and enables you to rapidly create and evaluate designs using PCI Express, PCI-X, and Xilinx, together with distributors and partners, offers a complete line of development and PCI interfaces. expansion boards to help you test and develop designs using Xilinx® devices. The following boards are specific to connectivity applications. Virtex-5 SXT ML506 Evaluation Platform www.xilinx.com/xlnx/xebiz/designResources/ Virtex-5 LXT ML505 Evaluation Platform Virtex-5 LXT ML52x RocketIO ip_product_details.jsp?key=HW-V5- www.xilinx.com/xlnx/xebiz/designResources/ Characterization Platforms ML506-UNI-G ip_product_details.jsp?key=HW-V5- www.xilinx.com/xlnx/xebiz/designResources/ The ML506 is a feature-rich DSP gener- ML505-UNI-G ip_product_details.jsp?key=HW-V5- al-purpose evaluation and development ML52X-UNI-G Create high-speed serial designs using platform. Though economically priced, Virtex™-5 RocketIO™ GTP transceivers: the ML506 offers you the ability to create DSP-based and high-speed serial • Provides a feature-rich general-purpose designs using the Virtex-5 DSP48E slices evaluation and development platform and RocketIO GTP transceivers. A vari- • Includes on-board memory and indus- ety of on-board memories and industry- try-standard connectivity interfaces standard connectivity interfaces add to the ML506’s ability to serve as a versatile • Delivers a versatile development plat- development platform for embedded form for embedded applications applications.

Xilinx ML52x platforms are ideal for characterization and evaluation of Virtex-5 LXT RocketIO GTP transceivers. Each RocketIO GTP transceiver is accessible through four SMA connectors. ML52x platforms are available with XC5VLX50T- FF665 (ML521), XC5VLX110T-FF1136 (ML523), and XC5VLX330T-FF1738 FPGA devices.

Third Quarter 2007 Xcell Journal 63 EDUCATION The Connectivity Curriculum Path Xilinx Education Services is ready to help you use PCI Express, Gigabit Ethernet, or Rapid IO protocols in your next design. by Craig Willert • GTP primitive instantiation in a Designing with Global Training Solutions Communications Manager design using the GTP wizard Ethernet MAC Controllers Xilinx, Inc. The “Designing with Ethernet MAC • Reference material for board [email protected] Controllers” course teaches the basics of design issues the Ethernet standard, protocol, and OSI By learning advanced programmable logic • Power supply, oscillators, model and related Xilinx solutions. design techniques and methodologies, you and trace design This course is ideal for: can take full advantage of today’s advanced • Engineers using Xilinx Ethernet con- FPGA capabilities, including high-speed Designing a LogiCORE nectivity solutions serial I/O. Equipped with this knowledge, PCI Express System you can better innovate when developing The “Designing a LogiCORE™ PCI After completing this training, you will products for your market, reduce R&D Express System” course focuses on the key have the necessary skills to: costs through improved process efficiency, PCI Express protocol subjects and targets • Use various Ethernet cores alone or as and lower production costs by using smaller hard and soft PCI Express cores in the a peripheral in processor-based designs devices in a slower speed grade. Virtex-5 FPGA. The Xilinx® Connectivity Curriculum This course is ideal for: • Select the appropriate core for a specific Path offers a variety of classes to help you design • Hardware designers creating applica- quickly understand how to build the most tions using Xilinx IP cores for PCI • Develop software to drive the core and optimal system using one of the many Express achieve desired functionality high-speed serial I/O protocols that Xilinx supports. The curriculum path also • Software engineers creating APIs, Conclusion includes courses on the use of Xilinx GUIs, and driver software development Xilinx programs provide targeted, high- RocketIO™ GTP serial transceivers, signal • System architects leveraging Xilinx quality education products and services integrity, leading-edge programmable logic performance, latency, and bandwidth that are designed by experts in programma- technology, and Xilinx design flows. ble logic design and delivered by Xilinx- After completing this training, you will qualified trainers. We offer classes at all have the necessary skills to: Designing with Multi-Gigabit Serial I/O expertise levels and create an engaging The “Designing with Multi-Gigabit Serial • Use Xilinx PCI Express cores in your learning environment by blending lectures, I/O” course will teach you how to employ own design environments hands-on labs, interactive discussions, tips, RocketIO transceivers in your Virtex™-5 and best practices. • Select the appropriate PCIe solution LXT FPGA designs. Xilinx delivers training when and where for a specific application After completing this comprehensive you want it by leveraging our global net- training, you will have the skills to: • Identify how PCI Express specifica- work of authorized training providers • Describe and utilize the ports and tions apply to Xilinx PCI Express cores (ATPs) and online learning systems. attributes of the RocketIO multi-gigabit transceiver in the Virtex-5 LXT FPGA TAKE THE NEXT STEP (Digital Edition: www.xcellpublications.com/subscribe/) • Effectively use such features as: • Register today for any of these courses. • Comma detection, CRC, clock correction, and channel bonding • View the full Connectivity Curriculum Path. • 8B/10B encoding/decoding, program- • Contact your Xilinx sales representative. mable termination, and pre-emphasis

64 Xcell Journal Third Quarter 2007

1010101 01001001 XPECTATIONS Developing Technical Leaders through Knowledge Communities Knowledge communities will better prepare students to become team players and problem solvers.

by Ivo Bolsens CTO Xilinx, Inc. and complexity of this task by far exceeds the width over wireless and wired infrastruc- [email protected] capacity and domain of a single professor. tures. And third is the need for high-performance computing to extract information Today’s college fresh- Knowledge Communities from large amounts of raw data. men will enter the For this reason, industry and academia For example, the RAMP research com- workforce in the 2010- have come together to create “knowledge munity (http://ramp.eecs.berkeley.edu) is 2011 time frame. By communities” around specific technology exploring a programming environment then, chip manufacturing will be at the 32- domains. Universities seek to create a wide based on the BEE2 FPGA platform. The nm technology mode, allowing manufac- association of people with common tech- NetFPGA platform (http://yuba.stanford. turers to build single chips that contain nology interests who wish to continue edu/nf2/) provides an open platform for tens of billions of transistors. learning, enhancing the skills of their research on high-bandwidth networking FPGAs will be the leading platforms members and preserving lessons learned. applications. And the WARP project leveraging this manufacturing technology, The industry wants to be an active partner, (http://warp.rice.edu) has built a repository capable of implementing programmable educating those in academia in global sys- to support complex wireless DSP applica- systems that will comprise nearly 100 mil- tem issues and creating a realistic agenda tion designs. lion programmable logic gates. As such, the through intensive dialogue between indus- These activities are part of the Xilinx FPGA will become the heart of many elec- try leaders and visionary academics. university outreach program, supporting tronic products. As these knowledge communities take continuing education and research based on The main building blocks that will hold, students will be better prepared to our latest technology. The Xilinx University make up a system in an FPGA will be soft become the team players and problem Program (XUP, www.xilinx.com/univ/) not processors running complex embedded solvers for future industry projects. only allows easy access to Xilinx best-in- software; highly parallel or pipelined arith- Gradually, they will ascend the “knowl- class tools but also provides free training, metic functions executing real-time signal- edge value chain,” from reproducing text- high-quality support, and curriculum processing algorithms; packet-processing book content to applying judgment calls development assistance. XUP donations engines securing embedded connectivity; in situational problems. take many forms, from design software, and complex hierarchies of memories han- Xilinx is constantly deepening its part- FPGA devices, and IP cores to reference dling efficient data transfer and storage nership with academia and is committed designs and student competitions. mechanisms. These programmable systems to jointly developing new ideas and oppor- As a business that is continually looking will be adaptable and partially reconfig- tunities and solving key challenges. globally for new ideas and technologies, urable in the field. Triple play will permit FPGA platforms Xilinx has established relationships with a Triple play (the convergence of data, to become the core infrastructure; however, large number of universities around the voice, and video over a peer-to-peer network) there are challenges to solve. First is the fur- world to allow their engineering students will be the driving force setting the require- ther digitization and processing of signals to receive hands-on experience with our ments for these platforms. The pace of this with growing resolutions, requiring an technology. These students will enter the evolution, however, leaves very little time to exponential growth in DSP compute capa- workforce with the necessary skills to be educate today’s students so that they become bility. Second is the secure and reliable quickly productive in the fast-paced world tomorrow’s technical leaders. The challenge transport of this data with increasing band- of semiconductors.

66 Xcell Journal Third Quarter 2007

Low-Power Transceivers Ultimate Connectivity . . .

Reduce serial I/O power, cost and complexity with the world’s first 65nm FPGAs.

With a unique combination of up to 24 low-power transceivers, and built-in PCIe® and Ethernet MAC blocks, Virtex™-5 FPGAs get your system running fast. Whether you are an expert or just starting out, only Xilinx delivers everything you need to simplify high-speed Complete PCI Express® serial design, including protocol packs and development kits. Development Kit Lowest-power, most area-efficient serial I/O solution

RocketIO™ GTP transceivers deliver up to 3.2 Gbps connectivity at less than 100 mW to help you beat your power budget. The embedded PCI Express endpoint block saves 60% in power and 90% in area compared to other solutions. Virtex-5 chips and boards are on the PCI-SIG® Integrators List. Plus our UNH-tested Ethernet MAC blocks make connectivity easier than ever.

Visit our website to get the chips, boards, kits, and documentation you need to start designing today.

Delivering Benefits of 65nm FPGAs Since www.xilinx.com/virtex5 May 2006!

Optimized for logic, The Ultimate System Integration Platform serial connectivity and DSP.