All Programmable Abstractions: Programming Your Way SECOND QUARTER 2015, ISSUE 91

Home , Lola (computing), Modula

ISSUE 91, SECOND QUARTER 2015

SOLUTIONS FOR A PROGRAMMABLE WORLD

Optimizing an OpenCL All Programmable Abstractions: Application for Video Watermarking in FPGAs

Programming Your Way Coming to Grips with the Frequency Domain

Enabling a JPEG 2000 Network for Professional Video

What’s New in Vivado Design Suite Update 2015.1?

Oberon System Implemented on a Low-Cost FPGA Board 30 www.xilinx.com/xcell PRESENTED BY AVNET

Accelerate Growth Technology

Unrivaled Technical Training for FPGA, SoC, DSP & Embedded System Designers

Don’t miss the online technical training program featuring 12 technical courses, exhibits and demos, and a one-of-a-kind community forum

For the first time ever, Avnet and Xilinx are excited to bring you X-fest On-Demand! Don’t miss this opportunity to learn about recent innovations in technology and product development methodology from Xilinx and to experience a deep dive with Avnet technical experts into the key technology domains where Xilinx All Programmable products deliver unique, high-impact (and cost-savings) capabilities.

TECHNICAL TRACKS ON:

Design Essentials Techniques & Applications SoCs & IoT

Visit the new site at www.xfest2014.com Integrated Hardware and Software Prototyping Solution

HAPS and ProtoCompiler accelerate software development, HW/SW integration and system validation from individual IP blocks to processor subsystems to complete SoCs.

Integrated ProtoCompiler design automation software speeds prototype bring-up by 3X Enhanced HapsTrak I/O connector technology and high-speed time-domain multiplexing deliver the highest system performance Automated debug captures seconds of trace data for superior debug visibility Scalable architecture supports up to 288 million ASIC gates to match your design size

To learn more visit: www.synopsys.com/HAPS LETTER FROM THE PUBLISHER

Setting the Stage for Improved Xcell journal

PUBLISHER Mike Santarini [email protected] Design Team Productivity 408-626-5981

EDITOR Jacqueline Damian elcome to the spring 2015 issue of Xcell Journal. Among the many great articles in these pages, you’ll find three relating to a strategy Xilinx has dubbed ART DIRECTOR Scott Blair W “All Programmable Abstractions.” The term refers to a new breed of high-level DESIGN/PRODUCTION Teie, Gelwicks & Associates design-entry environments from Xilinx and Alliance members that enable the use of familiar 1-800-493-5551 software-programming models in FPGA design. These development environments make it easier for design teams to become productive and even enable those who have never ADVERTISING SALES Judy Gelwicks 1-800-493-5551 programmed a Xilinx® All Programmable FPGA or SoC to use those devices without assis- [email protected] tance from a hardware engineer.

INTERNATIONAL Melissa Zhang, Asia Pacific In the cover story, I describe the evolution of Xilinx’s development environments to better [email protected] enable high-level design entry. This All Programmable Abstractions strategy started with Xilinx’s release of the Vivado high-level synthesis (HLS) tool in 2012, as well as third-party Christelle Moraga, Europe/ Middle East/Africa design environments from Alliance members such as The MathWorks and National [email protected] Instruments. The initiative has grown over the years, most recently with Xilinx’s launch of three new development environments—SDNet™, SDAccel™ and SDSoC™— in our new SDx Tomoko Suto, Japan [email protected] development environment line. With these new development environments, entire design teams can be more productive REPRINT ORDERS 1-800-493-5551 and can create well-rounded systems upfront, at the architectural level. This saves time at the back end and gives teams a major head start on software development. But perhaps more impressively, these environments will also enable new users to add Xilinx FPGAs and SoCs to their systems arsenals and create new applications and innovations without assistance from FPGA experts. According to recent industry estimates, software engineers outnumber hardware engineers 10 to 1 worldwide. Thus, the SDx environments will enable a much wider audience than ever before to produce innovations based on our All Programmable devices. This issue also contains two practical examples of the SDAccel development environment in action. You’ll find those articles on pages 38 and 42. Elsewhere in the issue, we are excited to showcase a contribution from a living legend in the computer-engineering field, Niklaus Wirth, on a high-level language he developed and ported to a Spartan®-3 FPGA platform. For those of you who don’t recognize the name, Professor Wirth invented the Pascal language and several successors in an effort to turn programmers www.xilinx.com/xcell/ into system inventors. Professor Wirth, now retired, revised and updated his book Project Oberon to help aca- Xilinx, Inc. demics teach system programming to the next generation of computer science professionals. 2100 Logic Drive San Jose, CA 95124-3400 The processor he targeted for that lesson in his first version of the book is no longer in pro- Phone: 408-559-7778 FAX: 408-879-4780 duction. Failing to find any suitable commercial alternatives, Wirth designed his own using the www.xilinx.com/xcell/ low-cost, and thus priced-right-for-students, Spartan-3 development board from Digilent. His article describes his design experience using the Spartan-3 board to revamp his Oberon © 2015 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands included programming language in hopes of inspiring a new generation of innovators. herein are trademarks of Xilinx, Inc. All other trade- Finally, in a great how-to article on page 54, Xilinx’s Daniel Michek walks readers through marks are the property of their respective owners. use of the Vivado™ System Generator for DSP tool to create optimized hardware platforms. The articles, information, and other materials included I hope you enjoy the issue. in this issue are provided solely for the convenience of our readers. Xilinx makes no warranties, express, implied, statutory, or otherwise, and accepts no liability with respect to any such articles, information, or other materials or their use, and any use thereof is solely at the risk of the user. Any person or entity using such information in any way releases and waives any claim it might have against Xilinx for any loss, damage, or expense caused thereby. Mike Santarini Publisher Synplify Premier Achieve Best Quality of Results and Maximum Performance for Your FPGA Design

FPGAs keep getting bigger, but your schedule is not. There is no time to waste on numerous design iterations and long tool runtimes. Use the hierarchical and incremental techniques available in Synopsys’ Synplify® software to bring in your schedule and meet aggressive performance and area goals.

u Unmatched quality of results u Fastest turnaround time u Fast and easy debug u Results preservation from one run to the next

To learn more about how Synopsys FPGA design tools accelerate time to market, visit www.synopsys.com/fpga CONTENTS

VIEWPOINTS XCELLENCE BY DESIGN APPLICATION FEATURES Letter from the Publisher 18 Xcellence in Broadcasting Setting the Stage for Improved Enabling a JPEG 2000 Network for Design Team Productivity… 4 Professional Video… 14 Xcellence in Scientific Applications White Rabbit: When Every Nanosecond Counts… 18 Xcellence in Wireless Communications Evaluating the Linearity of RF-DAC Multiband Transmitters… 26 Xperiment Oberon System Implemented on a Low-Cost FPGA Board… 30 30 Xcellence in Data Centers Removing the Barrier for FPGA-Based OpenCL Data Center Servers… 38 Xcellence in Data Centers Optimizing an OpenCL Application for Video Watermarking in FPGAs… 42 14

Cover Story 8 All Programmable Abstractions: Programming Your Way SECOND QUARTER 2015, ISSUE 91

THE XILINX XPERIENCE FEATURES Xplanation: FPGA 101 Coming to Grips with the Frequency Domain… 48 48 Xplanation: FPGA 101 Marrying SoC Platform Designs with System Generator for DSP… 54 Xplanation: FPGA 101 Rethinking Digital Downconversion in Fast, Wideband ADCs… 58

42 58

XTRA READING Xtra, Xtra The latest Xilinx tool updates and patches, as of April 2015… 64 Xpedite Latest and greatest from the Xilinx Alliance Program partners… 66 Xclamations! Share your wit and wisdom by supplying a caption for our wild and wacky artwork… 68

Excellence in Magazine & Journal Writing Excellence in Magazine & Journal Design 2010, 2011 2010, 2011, 2012 COVER STORY

All Programmable Abstractions: Programming Your Way Xilinx’s new SDx software-defined environments complement Vivado IPI, HLS and popular system-level design tools.

8 Xcell Journal Second Quarter 2015 COVER STORY

marketing group at Xilinx. “For the first time, those engineers will be able to de- by Mike Santarini rive the unique benefits of All Program- Publisher, Xcell Journal mable devices for customized accelera- Xilinx, Inc. All Programmable Abstractions: [email protected] tion, 10x to 100x performance per watt and any-to-any connectivity with the required security and safety for the next generation of smart systems. Xilinx is Programming Your Way enabling these next-generation systems to be more connected, software defined and virtualized. They must also support software-based analytics and enable more computing in the cloud, often driven by pervasive video and embedded To enable new levels of design team provision. This requires SDx software-de- Tductivity and expand the reach of its All fined programming environments and Programmable FPGAs, SoCs and 3D ICs heterogeneous multiprocessing with to a much larger user base of software new UltraScale FPGAs and MPSoCs.” engineers, Xilinx® Inc. recently unveiled The launch of the new SDAccel and two new additions to its SDx™ devel- SDSoC environments follows closely opment environment family. The new behind the spring 2014 release of the SDAccel™ development environment SDNet™ development environment, enables data center equipment program- detailed in the cover story of Xcell mers—with no FPGA experience—to Journal issue 87. program Xilinx FPGAs for data center While the new SDx environments and cloud computing infrastructure enable software engineers and sys- equipment using OpenCL™, C or C++. tem architects to program the FPGA The resulting FPGA-based equipment portion of Xilinx devices, the environ- will offer far superior performance per ments will also enable design teams watt (performance/watt) than GPU- and with hardware engineering resources CPU-based equipment. Xilinx has also to be more productive and quickly con- unveiled the SDSoC™ development verge on an optimized system to im- environment, which enables software prove time-to-market. With a working developers—again, with no FPGA expe- system design in hand, hardware engi- rience—to create systems in C or C++ neers can focus on optimizing FPGA targeting Zynq®-7000 All Programmable layout and performance for system SoC and UltraScale+™ MPSoC plat- efficiencies, while software engineers forms from Xilinx and third-party plat- further refine application code. form developers. The SDx environments are the lat- ALL PROGRAMMABLE est deliverables from Xilinx’s All Pro- ABSTRACIONS grammable Abstractions campaign. When Xilinx began planning its 7 series The initiative is designed to enable All Programmable device family of FP- software engineers and system archi- GAs, 3D ICs and Zynq-7000 All Program- tects to easily program Xilinx devices mable SoCs back in 2008 under new with development environments tai- CEO Moshe Gavrielov, it became evi- lored to their design needs. dent that the rich capabilities of each of “The combination of SDNet, SDAccel the members of the 7 series and future and SDSoC will provide familiar CPU-, product lines would enable customers GPU- and ASSP-like programming en- to place Xilinx’s devices at the heart of vironments to system and software en- their newest, most innovative products. gineers,” said Steve Glaser, senior vice These All Programmable devices president of the corporate strategy and were far more sophisticated than the

Second Quarter 2015 Xcell Journal 9 COVER STORY

glue-logic FPGAs of Xilinx’s earlier sophisticated enough, the environments HLS technology integrated into its ISE® years and enabled system functional- would essentially “democratize” All Pro- Design Suite and Vivado® Design Suite ity and end-product differentiation not grammable FPGA and Zynq SoC design, tool flows. Xilinx selected the AutoESL achievable with any other architecture. enabling individual architects and soft- technology after an exhaustive study by To maximize the value of these new ware engineers who have no experience Berkeley Design Automation demon- devices and jump ahead of the com- designing with FPGAs to program these strated that AutoESL’s HLS tool was petition, management knew it was im- devices without help from hardware de- the easiest to use and offered the best perative for Xilinx to develop tools and signers. Worldwide, software engineers quality of results of all the HLS tools methodologies that would enable sys- outnumber hardware engineers 10 to 1. available from the EDA industry. Its tem architects and even embedded-soft- Thus, Xilinx and Alliance Program part- origins in the EDA world also meant ware developers—not just FPGA ex- ners providing such development envi- that the use model for the tool was tar- perts—to program Xilinx’s newest ronments (or the hardware platforms geting ASSP and system-on-chip (SoC) devices. Further, the company would supporting them) would expand both system architects and well-rounded de- have to develop design environments their user base and their revenue. sign teams who had experience in C and for software engineers targeting key That strategy, along with a subse- C++ programming along with a working growth markets, and tailor those envi- quent development effort to enable soft- knowledge of hardware-description lan- ronments to the tools and flows these ware engineers and system architects guages (Verilog and VHDL) and chip de- designers were accustomed to using. It to program Xilinx devices with environ- sign requirements. would also be imperative to strengthen ments tailored to their design needs, is With Vivado HLS, such broadly alliances with companies like The Math- what Xilinx calls All Programmable Ab- skilled architects and design teams Works and National Instruments, which stractions (Figure 1). can create algorithms in C and C++ already have established environments while using Vivado HLS to compile and that enable nontraditional-FPGA users VIVADO HLS AND IPI: FIRST STEPS convert those algorithms into an RTL to leverage the power efficiency, perfor- UP IN DESIGN ABSTRACTION intellectual-property (IP) block. Then, mance and flexibility of Xilinx All Pro- The first serious step up in design ab- an FPGA designer can take that block grammable devices. straction began in 2011 with Xilinx’s and others in RTL that they’ve created For established customers, providing acquisition of privately held high-lev- or licensed, and assemble the IP into design environments for every member el-synthesis (HLS) tool vendor Au- a design using Xilinx’s IP Integrator of the design team would guarantee ef- toESL. The merger was followed in (IPI) tool. Next, the FPGA designer ficiency and shorten time-to-market. If 2012 by Xilinx’s public release of the takes the assembled design through the Vivado flow, performing HDL simulation, timing and power optimization, placement and routing. Ultimately, the designer generates a netlist/bit file and configures the hardware of the targeted All Programmable device. If the design includes a processor (either the Zynq SoC or a soft MPU core), the configured device is then ready for embedded-software engineers to program. To help these embedded-software engineers with the programming chore, Xilinx provides an Eclipse-based integrated design environment called the Xilinx Software Development Kit (SDK), which includes editors, compilers, debuggers, drivers and libraries targeting Zynq SoCs or FPGAs with Xil- inx’s 32-bit MicroBlaze™ soft core embedded in them. Introduced more than a decade ago, the SDK has evolved dra- Figure 1 – With the new SDx design environments and those from Alliance members, Xilinx is enabling more innovators to leverage Xilinx All Programmable devices to add matically as Xilinx’s integration of pro- greater value to their next-generation products. cessors on its devices has transitioned

10 Xcell Journal Second Quarter 2015 COVER STORY

from hardened DSP slices and soft-core edge to further tweak designs for ad- Software demonstrated an optical-in- MCUs and MPUs (8, 16 and 32 bits) to a ditional algorithm performance gains. spection system it developed with the hardened 32-bit PowerPC® in Virtex®-4 This combination of MathWorks and VisualApplets environment. The demo and Virtex®-5 FPGAs; hardened 32-bit Xilinx technologies has helped cus- showed a system performance speed- ARM® processors in the Zynq SoC; and tomer companies produce thousands up of 10x by using the VisualApplets 64-bit ARM processors in the upcoming of innovative products. environment to move image-process- Zynq UltraScale+ MPSoC. More recently, other companies ing tasks from the Zynq SoC’s process- have begun contributing to the Xilinx ing system to the device’s FPGA logic. ALLIANCE MEMBERS BRING environment as well. Xilinx recently XILINX VALUE TO MORE USERS welcomed two European companies, DEMOCRATIZING FPGA AND For well over a decade, Xilinx has TOPIC Embedded Systems and Silicon SOC DESIGN WITH SDX had very close partnerships with Software, to its Alliance Program, spe- With the SDx software-defined devel- National Instruments and The Math- cifically for their high-level develop- opment environments, Xilinx is build- Works. Both of these companies of- ment environments focused on medical ing a series of higher-level design en- fer unique high-level development and industrial markets. try environments targeting software environments tailored to the specific TOPIC Embedded Systems (Eind- developers and system architects in audiences they serve. hoven, the Netherlands) has a unique key markets. Both SDAccel and SD- National Instruments (Austin, Tex- development environment called DYP- SoC run the entire Vivado flow auto- as) offers hardware development plat- LO, which was featured in Xcell Journal matically under the hood, and require forms fanatically embraced by control issue 89. Targeting the Zynq SoC and, no direct use of Vivado tools and no and test system innovators. Xilinx’s soon, the Zynq MPSoC, the environment hand-off to a hardware engineer. For FPGAs and Zynq SoCs power the NI takes a singular approach to system-lev- its part, SDNet does not access Viva- RIO platforms. National Instruments’ el design, enabling system architects to do HLS and has a unique two-step use LabVIEW development environment is create a system representation of their model targeting line card architects. a user-friendly graphics-based program design in C or C++ and run it on the In the first step, line card architects that runs the Vivado Design Suite under Zynq SoC’s dual-core ARM Cortex™-A9 use an intuitive, C-like high-level lan- the hood so that National Instruments’ processing system. When users find guage instead of arcane microcode to customers need not know any of the sections of the design that are running design the requirements and create a details of FPGA design. Some perhaps too slowly in software, they can drag specification for a network line card. don’t even know a Xilinx device is at and drop the sections to a window that The SDNet development environment the heart of the RIO products. They converts the C into FPGA logic (running generates an RTL version of the de- can simply program their systems in Vivado HLS under the hood) and places sign based on that specification. The the LabVIEW environment and let NI’s it in the Zynq SoC’s programmable log- flow then requires a hardware engi- hardware speed the performance of de- ic. They swap code snippets back and neer to implement the RTL into the signs they are developing. forth until they converge on an optimal targeted FPGA. In the case of The MathWorks system performance. TOPIC has seen In the second step, SDNet also al- (Natick, Mass.), more than a decade ago first use in medical-equipment develop- lows network companies to test and the company added FPGA support to its ment and is starting engagements with update protocols in the high-level lan- MATLAB®, Simulink®, HDL Coder and manufacturers of industrial equipment. guage and upgrade the functionality Embedded Coder with Xilinx’s ISE and Silicon Software (Mannheim, Ger- of the cards—even after deployment Vivado tools running under the hood many) is the newest addition to the in the field—without hardware design and completely automated. As a result, quartet of Alliance members enabling intervention. The flow enables com- the users—who are mainly mathemati- software engineers to leverage Xilinx panies to quickly create and update cian algorithm developers—could de- All Programmable devices. The com- cards, flexibility that’s ideal for soft- velop algorithms and speed algorithm pany’s VisualApplets graphical im- ware-defined networks. performance exponentially, running the age-processing design environment The two new environments, SDAccel algorithms succinctly on FPGA fabric. allows system architects and software and SDSoc, take the SDx philosophy Over a decade ago, Xilinx added an engineers targeting Zynq SoC platforms into new application areas. FPGA architecture-level tool called to build new innovations in industri- System Generator to its ISE develop- al machine vision applications—and SDACCEL OPTIMIZES DATA ment environment and, more recent- do so without assistance from a hard- CENTER PERFORMANCE/WATT ly, the Vivado Design Suite, in order to ware engineer. In Xilinx’s booth at the According to a March 2014 article in the enable teams who had FPGA knowl- SPS Drives trade show in 2013, Silicon Data Center Journal, huge data centers

Second Quarter 2015 Xcell Journal 11 COVER STORY

that form the heart of companies like crete FPGA paired with a discrete CPU gram FPGA-based platforms without re- Google, Facebook, Amazon and Linked- raises power per card minimally but quiring design intervention from hardware In “consume upwards of 3 percent of improves performance dramatically, engineers. Tom Feist, senior director of the world’s electric power, while pro- yielding a significant improvement in design methodology marketing at Xilinx, ducing 200 million metric tons of CO2.” performance/watt. Others believe that said the SDAccel development environ- That enormous power consumption performance/watt can be further im- ment for OpenCL, C and C++ enables data costs data centers more than $60 bil- proved with a chip that combines an x86 center programmers to build equipment lion a year in electricity fees. The power processor core interconnected to FPGA with up to 25x better performance/watt consumption cuts deeply into the bot- logic on a single SoC. Some think that a than CPU- and GPU-based systems. tom-line profitability of even the biggest seemingly even lower-power and equal- Feist said that in the SDAccel flow dot-com companies but also has an un- ly high-performance solution would be (see Figure 2, the SDAccel develop- told cost on the environment. to integrate FPGA logic with ARM 64-bit ment environment demo), which tar- As more companies look to leverage processor IP on a single SoC. gets x86 CPUs paired with acceleration cloud computing and analysis via Big The main deterrent to using FPGAs cards running Xilinx’s 20nm Kintex® Data; and as video and streaming media in the data center has been programming. UltraScale FPGAs, software devel- become more pervasive worldwide; and Data center developers are accustomed opers identify applications that need as more people join wireless networks to programming x86-based architectures acceleration, code and optimize the and upgrade toward 5G, there is a re- and typically come from a pure soft- kernels in OpenCL, and then compile lentless, exponentially increasing de- ware-programming background. A first and execute the applications for the mand for more data centers with even step designed to help developers move CPU. They then cycle to estimate ker- better performance. The current trajec- CPU programs to faster GPUs was the in- nels and debug them to come up with tory, according to the Data Center Jour- dustry’s open development of the OpenCL cycle-accurate models. Next, they use nal article, will bring data center traffic language. Over the last two years, OpenCL SDAccel to compile the code and im- to 7.7 zettabytes annually by 2017. That has evolved even further to enable cus- plement it automatically in the FPGA means that data center power consump- tomers to target FPGAs. This development (possible because Vivado runs under tion will rise astronomically if left un- is opening up new possibilities for future the hood). They can then run the appli- checked. Today the main culprit in this data center equipment architectures and cation and validate the performance of power consumption is the Intel x86 pro- even ubiquitous networks. the applications accelerated with the cessors that form the bedrock of most By launching the SDAccel environment, card vs. performance when running data center equipment. MPUs today Xilinx has bridged that programming gap solely on the CPU. “They can run iter- provide good, but not optimal, perfor- and paved the way to enabling data center ations in this cycle until they find the mance and high power consumption. engineers to use OpenCL, C or C++ to pro- optimal balance between performance Software engineers, in abundant supply worldwide, find these MPUs the easiest devices to program. To solve the performance portion of the data center problem, many companies have been creating equipment with graphics processing units (GPUs) or CPU systems accelerated by GPU cards. GPUs have performance that’s far superior to that of CPUs in data center applications but unfortunately, far worse power consumption. Together, the performance is extremely fast but the power consumption is abysmal. To get the best of both worlds, many companies are turning to an FPGA-centric approach in which they pair FPGAs with other processors to maximize data center equipment performance. Figure 2 – In this SDAccel development environment demo, engineer Henry Styles shows how to use the SDAccel development environment for acceleration using a standard x86 64-bit A number of data center equipment workstation containing an Alpha Data ADM-PCIE-7V3 accelerator. vendors have demonstrated that a dis-

12 Xcell Journal Second Quarter 2015 COVER STORY

system engineers with the SDSoC environment for our Zynq SoCs,” said Ni (see Figure 3, the SDSoC development environment demo). The SDSoC environment leverages a macro compiler that takes the C/C++ code users have designated to accelerate in the Zynq SoC’s logic. By running the Vivado Design Suite under the hood, the environment automatically turns the code into an IP block and configures that block into the device’s logic, automatically generating a driver in software. SDSoC provides board support packages (BSPs) for Zynq All Pro- grammable SoC-based development boards including the ZC702 and Figure 3 – In this SDSoC development environment demo, principal engineer Jim Hwang uses SDSoC to create a simple image-processing pipeline to detect motion and insert motion edges ZC706, as well as third-party and mar- into a live HD 1080p video stream running at 60 frames per second. ket-specific platforms, such as the ZedBoard, MicroZed, ZYBO, and video and imaging development kits. and power and achieve up to 25x per- platforms and ASICs without requiring “We’ll be adding more BSPs to SD- formance/watt improvement over CPU a hardware engineer to do anything,” SoC in the coming months, especially and GPU implementations,” said Feist. said Nick Ni, SDSoC product manager. as more third-party platform compa- In this issue of Xcell Journal, Feist “With the SDSoC development environ- nies develop systems with the Zynq and colleagues contribute a detailed ment, they can program Zynq SoC and SoC,” said Ni. “The SDSoC not only ex- overview of the SDAccel environment MPSoC platforms in the same manner as pands the user base for Xilinx but also and a second article showing SDAccel they have with ASSPs. But what’s unique for companies developing platforms in action (see pages 38 and 42). is that with SDSoC, they can now create leveraging the Zynq SoC.” complete system designs in an Eclipse Ni is quick to note that while the main SDSOC MULTIPLIES IDE environment using C or C++ target- goal of the SDSoC environment is to en- EMBEDDED-SYSTEM INNOVATORS ing the Zynq SoC and Zynq UltraScale+ able the vast number of embedded-soft- While SDNet enables line card develop- MPSoC platforms.” ware designers to now create entire ers to quickly create next-generation net- Ni said that in using the SDSoC en- systems with Xilinx Zynq SoCs, users works with a unique, “softly defined” ap- vironment, embedded-software devel- with traditional FPGA backgrounds and proach, and while SDAccel allows data opers create their designs in C or C++ design teams can also greatly benefit center equipment vendors to achieve the and test to see what portions aren’t from using the environment. best performance/watt of their next-gen- running optimally on the Zynq SoC’s “It enables design teams to quickly eration data centers, the new SDSoC de- processing system. They highlight the design a system architecture in C and velopment environment may well have suspect code and command the SDSoC C++ and then try different configura- the broadest impact among Xilinx’s user environment to automatically partition tions to get the optimal performance base. That’s because SDSoC targets the that code into the Zynq SoC’s program- they require.,” Ni said. “If they do have vast numbers of embedded-system de- mable logic to speed up the system per- FPGA designers on their team, they can sign teams and specifically, their soft- formance. Ni said the SDSoC environ- have them use Vivado tools to optimize ware engineers designing for the majori- ment can move a software function to the blocks and logic layout further.” ty of markets Xilinx serves with its Zynq FPGA logic with the click of a button. For additional news about Alliance SoCs. The SDSoC environment enables And it doesn’t require a hardware en- member support for the SDSoC envi- users to now configure the logic—not gineer to do it. The compiler in SDSoC ronment, see the Xpedite section in this just program the processor of embedded will generate the entire Vivado project issue (page 66). For further informa- systems running Zynq SoC-based hard- as well as the bootable software image tion on Xilinx’s All Programmable Ab- ware platforms—in C and C++. for the targeted hardware platform. stractions, visit http://www.xilinx.com/ “Software developers are accustomed “We are essentially enabling em- products/design-tools/all-programma- to programming motherboards, ASSP bedded-software engineers to become ble-abstrac tions.html.

Second Quarter 2015 Xcell Journal 13 XCELLENCE IN BROADCAST Enabling a JPEG 2000 Network for Professional Video

by Jean-Marie Cloquet Manager, Image Processing Division Barco Silex [email protected]

14 Xcell Journal Second Quarter 2015 XCELLENCE IN BROADCAST

A new reference LOOKING FOR SUPERIOR ment JPEG 2000 encoders and decod- VIDEO COMPRESSION ers in their video gear. However, for JPEG 2000 supersedes the older JPEG the transport between locations, they design from Xilinx standard and offers many advantages still had a choice among a wide range and Barco Silex over its predecessor or other popular for- of implementation options, including Enabling a JPEG 2000 mats such as MPEG. By 2004, JPEG 2000 proprietary protocols. The drawback had become the de facto standard format for video service providers was that offers JPEG 2000 for image compression in digital cinema they had to lock into the products of video transport through the Hollywood-backed Digital one or a few vendors, instead of set- Network for Cinema Initiatives (DCI) specification. ting up the best-matching, cost-effi- The possibility of a visually lossless com- cient infrastructure. over Internet pression makes JPEG 2000 ideal for security, archiving and medical applications. A RECOMMENDATION TO Protocol networks. The broadcasting industry also took STANDARDIZE VIDEO TRANSPORT Professional Video notice. Broadcasting and video service So there was a clear demand from the companies have huge amounts of live service providers for a standardized ecause of its superior quality, JPEG video that has to be transported to post- transport to ensure better interoperabil- 2000 has emerged as the standard production and streaming facilities with- ity between existing and future equip- B of choice for the compression of in their so-called contribution networks ment. What they needed was a transport high-quality video, including the transport (Figure 2), without delay or loss of visual that could be best organized over IP net- of video in the contributing networks of quality. Of particular interest for the pro- works, which were becoming the prevail- television broadcasters. As a result, sup- fessional-video industry, therefore, is the ing network architecture, with standard- pliers of video equipment have started possibility of a visually lossless compres- ized equipment ready for high-throughput adding JPEG 2000 encoders and decoders sion—that is, a compression scheme that data transport. Starting in 2007, the Soci- to a variety of transport solutions, support- retains the image quality and still allows ety of Motion Picture and Television En- ing various interfaces and sometimes even efficient storage and transport. gineers (SMPTE) published a standard using proprietary protocols. In addition, the other innovations in for video transport over IP, which has This trend, however, has locked JPEG 2000 also meant a step forward been expanded since. SMPTE 2022 in- video service providers into the prod- for the broadcasting industry. Each cludes, among others, IP protocols for ucts of one or a few vendors. A solu- frame in the video stream is compressed constant-bit-rate video signals in MPEG-2 tion to this dilemma arrived in April individually as a still frame, in contrast transport streams (SMPTE 2022 1&2 for 2013 with the publication of the Video to MPEG formats, which compress compressed video and SMPTE 2022 5&6 Services Forum’s (VSF) TR-01 recom- frames in groups. This single-frame for uncompressed video). mendation for transport of video over compression technique results in a low Taking these specifications as its ba- Internet Protocol (IP) networks, a latency but also makes for easy per- sis, the Video Services Forum in 2013 specification for interoperable equip- frame post-processing and editing. A published its VSF TR-01 document, ment. Xilinx and Barco Silex, a certi- JPEG 2000 stream may also be partial- a technical recommendation titled fied member of the Xilinx® Alliance ly decompressed and viewed, allowing “Transport of JPEG 2000 Broadcast Pro- Program, quickly joined forces to sup- different applications and viewing expe- file Video in MPEG-2 TS over IP.” The port the interoperability effort. riences from the same stream. VSF is an international association com- Barco Silex has now completed a ref- Another big plus is the resilience posed of service providers, users and erence implementation of the VSF TR- against transmission errors in the stream. manufacturers dedicated to interopera- 01 recommendation. The multichannel If transmission errors cannot be correct- bility, quality metrics and education for video-over-IP with JPEG 2000 solution ed using forward error correction (FEC), video networking technologies. was recently made public on the Xilinx the errors will have a smaller visual im- Any device that adheres to VSF TR- website. It is based on intellectual-prop- pact after decoding compared with other 01 will take its input from an SDI (se- erty cores from Xilinx and Barco Silex, codecs. Finally, JPEG 2000 preserves the rial digital interface) signal, the legacy and is ready to be customized and inte- image quality even after multiple encod- standard for uncompressed point-to- grated by broadcast equipment OEMs. ing/decoding processes, which is of cap- point video transport in the broadcast Honoring this effort, the National Acad- ital importance in contribution networks industry. The device will extract the ac- emy of Television Arts and Sciences with various stages of video management. tive video, audio and ancillary data (for awarded Barco Silex a 2014 Technology Picking up on this interest, equip- example, captions) and compress the & Engineering Emmy Award (Figure 1). ment suppliers soon started to imple- video in JPEG 2000 format.

Second Quarter 2015 Xcell Journal 15 XCELLENCE IN BROADCAST

The resulting stream is multiplexed In this framework, the partners have can be multiplexed with uncompressed into an MPEG-2 transport stream to- now completed a reference design, video channels on a 10-Gbit link using gether with the audio and ancillary composed of a four-channel transmit- the 10GEMAC and 10G PCS/PMA cores. data. This stream is again encapsu- ter-and-receiver platform (Figure 3). On the receiver platform, the Eth- lated according to SMPTE2022 in a The transmitter is able to take up to ernet datagrams of the uncompressed Real-time Transport Protocol (RTP) four SDI high-definition (HD) streams streams are collected at the 10GEMAC. stream and transmitted over IP to a re- (1080p30), optionally compress them The SMPTE 2022-5/6 video-over-IP receiving device. The receiver will de-en- with JPEG 2000 and send them over ceiver core filters the datagrams, de-encapsulate the RTP/IP stream, demulti- 1-Gbps (with compression) or 10-Gbps capsulates and demultiplexes them into plex the MPEG-2 transport stream, (uncompressed) Ethernet according to individual streams, and outputs the SDI decode the JPEG 2000 and place the the VSF TR-01 standard. The receiver video through the SMPTE SDI cores. The video, audio and ancillary data onto platform, conversely, can receive the IP Ethernet datagrams of the compressed the output SDI signal. stream, de-encapsulate and decompress streams are collected at the 10GEMAC, it, and put it on up to four SDI HD links. de-encapsulated by the SMPTE 2022-1/2 IMPLEMENTING AN FPGA-BASED In the transmitter platform, Xilinx video-over-IP receiver core and by the REFERENCE SOLUTION SMPTE SDI cores receive the incom- TS Engine, and fed to the JPEG 2000 de- In September 2012, even before the VSF ing SDI video streams. On the un- coder. Its output video is converted to recommendation was published, Xilinx compressed path, these SDI streams SDI and sent to the SMPTE SDI cores. and Barco Silex announced a partner- are multiplexed and encapsulated For each of the four channels, the ship to develop video-over-IP solutions. into fixed-sized datagrams by Xilinx’s uncompressed or compressed path can The goal was to offer a comprehensive SMPTE 2022-5/6 video-over-IP transmit- be chosen independently of what hap- platform of hardware-validated intellec- ter core and sent out through the Xilinx pens on the other channels. tual-property cores, reference designs 10-Gigabit Ethernet MAC (10GEMAC) and system integration services. In this and 10G PCS/PMA cores. LAYING THE BASIS FOR effort, Barco Silex took on the role of On the compressed path, the SDI INTEROPERABLE SOLUTIONS system integrator, matching cores from streams first go to the JPEG 2000 en- The companies implemented the refer- Xilinx (SMPTE 2022, SMPTE SDI, Eth- coder for compression. Next, they are ence design in two platforms, one us- ernet MACs) with its own high-perfor- encapsulated into MPEG-2 transport ing the Zynq®-7000 All Programmable mance JPEG 2000 and DDR3 memory stream packets according to VSF TR-01 SoC and the other using the Kintex®-7 controller cores. The goal was to enable by the dedicated TS Engine core imple- FPGA. But the blocks that are used can OEMs of broadcast equipment to accel- mented by Barco Silex. Last, the SMPTE be integrated in solutions that address erate their product development and to 2022-1/2 video-over-IP transmitter core the complete range of OEM system re- add the latest video-over-IP capabilities packs the streams into fixed-size data- quirements, from low-cost, high-volume to their existing products and those cur- grams and sends them out through the applications to the most demanding rently in development. 1G TEMAC. Alternatively, the streams high-performance applications. The intellectual-property cores that were used, such as Xilinx’s SMPTE 2022 and Ether- net MAC LogiCORE™ blocks, are available for the full range of Xilinx FPGA systems, up to the UltraScale™ level. For encoding and decoding, the reference design includes the Barco Si- lex JPEG 2000 encoder and decoder IP cores. These are silicon-proven, widely adopted single-FPGA solutions for high-performance, simultaneous multichannel 720p30/60, 1080i, 1080p30/60 and 2K/4K/8K JPEG 2000 encoding and decoding. These cores also support the widest available spectrum of JPEG 2000 options Figure 1 – The Barco Silex video team responsible for the reference design with their 2014 in existence on the market. Essential in Technology & Engineering Emmy Award for Standardization and Productization of JPEG 2000 Interoperability. From left, they are Luc Ploumhans, Sake Buwalda, François Marsin, stitching multiple video streams together Jean-François Marbehant, Jean-Marie Cloquet and Vincent Cousin. into a smooth, high-data-rate system is

16 Xcell Journal Quarter 2015 XCELLENCE IN BROADCAST

lowing a sharper view and a larger video display. Using the four input channels of the reference design in quad-SDI mode (4K carried over four SDI cables), it is now also possible to take as input a 4K signal and send it over the IP network. This makes the reference design ready for video resolutions up to 4K.

FPGAS ADVANCE THE VIDEO INDUSTRY The goal of the collaboration between Xilinx and video specialist Barco Silex was to leverage the power and flexibility of FPGA-based platforms in the profes- Figure 2 – The contributing networks of broadcasting companies sional-video market. By combining the JPEG 2000 cores of Barco and the transport cores of Xilinx, OEMs may produce Barco Silex’s DDR3 memory controller. Links, Macnica, Nevion and Xilinx) pro- and update standardized broadcast This highly customizable controller is vided technology and equipment that was equipment quickly, making their prod- optimized to achieve high bandwidth by interconnected to show live transmission ucts future proof in the process. reordering accesses and mixing them to of 720p30 and 1080i60 HD content being This reference design arrives at a time the different banks of the SDRAM. compressed in real time using JPEG 2000 when the use of IP networks in the vid- The companies showed a first gener- encoders and decoders. eo industry is beginning to take off. The ation of this reference design in a public A few months later, Barco Silex ability of OEMs to capture a share of that interoperability demonstration during demonstrated that the reference design new market will depend on how fast they the annual VidTrans conference held in could also handle 4K and ultra-high-defi- can get products out. With reprogram- February 2014 in Arlington, Va. During nition (UHD) signals. As one of the main mable solutions based on Xilinx FPGAs, this test organized by the VSF, 10 compa- standards proposed for next-generation they can launch products even while nies (Artel, Barco Silex, Ericsson, Evertz, video distribution, 4K video carries four standards are still evolving. Imagine Communications, IntoPIX, Media times as many pixels as 1080p video, al-

OZ745

SDI SMPTE2022 SDI source 5/6 Enc SFP+ SDI source VideoEnc 10GEMAC 10G TX PCS/PMA SDI Rx SDI SMPTE2022 JPEG2000 JP2K TS SDI source 2 TS Enc 1/2 Videocin Enc Enc SDI source BA317

OZ745 VideoDec VideoOut JP2K TS SMPTE2022 Monitor 2 JPEG2000 TS Dec 1/2 SDI Dec Dec Monitor 10GEMAC 10G RX PCS/PMA SDI Rx SDI SMPTE2022 5/6 Monitor buffer level Dec

PIXCO Clock Monitor Recovery BA317

Figure 3 – Schema of the reference design with transmitter and receiver platform

Second Quarter 2015 Xcell Journal 17 XCELLENCE IN SCIENTIFIC APPLICATIONS White Rabbit: When Every Nanosecond Counts

by Dr. Javier Díaz Chief Executive Officer Seven Solutions SL [email protected]

Rafael Rodríguez-Gómez Chief Technical Officer Seven Solutions SL [email protected]

Dr. Eduardo Ros Chief Operating Officer Seven Solutions SL [email protected]

18 Xcell Journal Second Quarter 2015 XCELLENCE IN SCIENTIFIC APPLICATIONS

A highly accurate White Rabbit: Ethernet-based timing solution is making its way When Every to market in products The latest advances in telecommunica- built around Xilinx tions and informatics are pushing industrial time-transfer requirements significantly closerT to those of scientific applications. FPGA and SoC devices. Nanosecond Counts For instance, upcoming 100G Ethernet networks and 5G mobile telecom require timing accuracy in the range of a few nanoseconds, while smart grids for electrical-power distribution require submi- crosecond accuracy. Time-stamping for high-frequency trading (or stock trading in general) requires reliable mechanisms to distribute time from certified authorities to business centers. Finally, positioning services based on GNSS technologies, such as GPS or Galileo, benefit from high-accuracy synchronization mechanisms. An Ethernet-based technology called White Rabbit, born at CERN, the European Organization for Nuclear Research, prom- ises to meet the precise timing needs of these and other applications. Named after the time-obsessed hare in Alice in Wonder- land, White Rabbit is based on, and is compatible with, standard mechanisms such as PTPv2 (IEEE-1588v2) and Synchro- nous Ethernet, but is properly modified to achieve subnanosecond accuracy. White Rabbit inherently performs self-calibration over long-distance links and is capable of distributing time to a very large number of devices with very small degradation. Our startup, Seven Solutions SL, has developed the White Rabbit technology since its genesis in 2009 and brings WR product to market using Xilinx® All Programmable solutions. Our latest offering is the ZEN (Zynq® Embedded Node) board, a timing board intended to hold a high-precision reference clock that provides precise tim-

Second Quarter 2015 Xcell Journal 19 XCELLENCE IN SCIENTIFIC APPLICATIONS

The price may make a solution based on highly accurate clocks, such as chip-scale atomic clocks, too costly for massive utilization.

ing information to other nodes while tions make this approach invalid. message and annotating it in each node. synchronizing itself in the framework of Yet, the target price may make a Having these three elements—frequen- a White Rabbit network. solution based on highly accurate cy, phase (PPS) and time—we can say clocks (like chip-scale atomic clocks, that the network is synchronized. A BRIEF HISTORY OF TIME or CSACs) too costly for massive uti- Current industrial solutions provide Physicists have always understood the lization. In those cases, an alternative these properties in different ways. For importance of time and over the years approach is to distribute clock infor- instance, GPS devices provide a refer- have devised a wide variety of methods mation from a reference clock (highly ence frequency (10 to 50 MHz), a PPS of measuring it. From simple techniques stable and typically expensive) to all signal and a serial code to provide the based on scanning the sky (sun clocks, the other elements in the network that time (typically based on the NMEA pro- star scanning) to complex mechanisms need to be accurately synchronized. tocol). This approach is widely used that rely on the properties of the sub- The question is, how can we do that? in many systems that require accurate atomic world (atomic clocks), scientists synchronization, since different instru- have intensively worked toward devel- TIME TRANSFER TECHNOLOGIES mentation can be easily connected to oping accurate clocks. Existing clocks When you need to distribute time, there different GPS receivers. But it uses a would neither gain nor lose 1 second are many possible approaches. Note significant amount of low-level signals. in about 300 million years, and this ac- that distributing frequency (which in- In power grid applications, these values curacy is key in many applications, for volves sending the oscillator signal are provided based on a simple protocol instance for maintaining national me- through a wire) is not the same as dis- called IRIG-B, which provides time and trology labs’ time scales. tributing phase (when events trigger at PPS information. In the past, the IRIG-B However, these extremely accu- exactly the same instant in all the ele- approach has been good enough to syn- rate clocks are expensive, very fragile ments of a network). chronize the power grid. However, now- and take up significant physical space. We can solve the first problem (fre- adays it can’t handle the “smart grid,” Therefore, they are not well suited in quency distribution) by using, for exam- which is becoming ever-more complex many real-world scenarios. In fact, most ple, a coaxial cable or an optical fiber and contains new energy-monitoring ap- applications rely on electronics that in- to transmit the clock oscillation. In the plications that demand higher accuracy. clude cheap clocks (crystal oscillators). second scenario (phase distribution), With the quasi-omnipresence of For a few bucks we can choose among we can encode a pulse on the wire that packet networks, previous mecha- a large set of oscillators with very differ- transmits each second and use this nisms existing on switching networks ent specifications. pulse as a reference to know when a have evolved and adapted to packet Oscillators are accurate enough for new second starts, a technique typically networks. SDH/SONET technology is simple applications. But in many other called a pulse-per-second (PPS) signal. little by little transforming in solutions application fields requiring synchro- In addition, there is also a third based on the Precision Time Protocol nous communication or a global no- problem. We may also need to provide (PTPv2, or IEEE-1588v2) plus Syn- tion of time to operate simultaneously the time—not just to have everything chronous Ethernet (SyncE). PTPv2 (distributed instrumentation), these ticking at the same time or having the is an industrial evolution of Network “free-running clocks,” not tied to one same reference about when to start the Time Protocol (NTP), the protocol the another, cannot not be used. Although counting (phase), but also to ensure we Internet uses to synchronize the com- designers can partially solve the prob- have the same time in all the devices. puters throughout the network. PTPv2 lem by installing better oscillators, Therefore, time value can be distribut- relies on hardware time-stamping this is not always technically feasible. ed by propagation of the time informa- mechanisms that significantly improve Individual clocks are still unsynchro- tion from a central time server and then the accuracy of time synchronization. nized and even small frequency devia- measuring the propagation time of this The second mechanism, SyncE, makes

20 Xcell Journal Quarter 2015 XCELLENCE IN SCIENTIFIC APPLICATIONS

it possible to encode the clock signal in Scientific facilities are probably the when used for time distribution it is rec- the data carrier. In this way, transparently most demanding infrastructures requir- ommended that critical infrastructures to the user, we are able to distribute time ing highly accurate time distribution. use terrestrial alternatives (based on information and frequency to all the de- From CERN’s LHC accelerator to large optical fibers) as complementary redun- vices. Pairing PTPv2 with SyncE enables radio astronomy distributed facilities dant mechanisms. us to use packet networks for telecom- such as CTA, SKA or KM3NeT, all of munications seamlessly, and this com- them require ultra-accurate time and WHITE RABBIT SOLUTION bination is currently the most popular frequency distribution. White Rabbit (http://www.whiterabbitso- solution for industrial time distribution But next-generation IT and commu- lution.com/) is an extension of Ethernet on telecommunications, power grid and nications applications will also require for precise timing. It was conceived at automation applications. Note that some highly accurate time transfer that is CERN in 2009 as an open, collaborative key problems related to phase propaga- impossible to achieve with the cur- software/hardware initiative, and the tion and system scalability are critical rent standard approaches. In the GPS technology industry eagerly began par- and remain unsolved. world, to take one example, the mea- ticipating in its development. Sources surement of satellite signal-propagation are available at the Open Hardware Re- SCIENTIFIC APPLICATIONS time is equivalent to the measurement pository (OHWR, http://www.ohwr.org), AND BEYOND of distances, so positioning and time encouraging development by different Many applications require that refer- accuracy measures are closely related. companies and research institutions. ence clock source information be prop- In general, GNSS is vulnerable to the From the very beginning, Sev- agated to different destination points. problems of jamming or spoofing. Thus, en Solutions (www.sevensols.com),

Figure 1 – The White Rabbit applications profile

Second Quarter 2015 Xcell Journal 21 XCELLENCE IN SCIENTIFIC APPLICATIONS

based in Granada, Spain, has collab- The new ZEN board from Seven slave) with its own phase. The de- orated in the design of White Rabbit Solutions shows how the key elements viation should be equal to the prop- (WR) products including not only the of White Rabbit come together in a agation time of the signal through electronics but also the firmware and product (Figure 2). Based on Xilinx’s the fiber (properly measured using gateware. The company also provides Zynq-7000 All Programmable SoC, the PTP). Having this information, the customization and turnkey solutions ZEN board includes the White Rab- master can determine the phase dif- based on this technology. bit Core (WRC), along with a Gigabit ference between its own clock and As an extension of Ethernet, White Ethernet MAC implementation that’s the one from the slave, and request Rabbit technology is being evaluated for capable of providing a high-accuracy that the slave shift its phase to exact- possible inclusion in the next Precision clock. Synchronization mechanisms ly the same value as the master. This Time Protocol standard (IEEE-1588v3) implemented in the WRC include the process is done digitally by imple- in the framework of a high-accuracy following elements: menting a digital DMTD in the FPGA

profile. Standardization would facilitate gateware. • Frequency synchronization (syn- WR’s integration with a wide range of thonization): This is obtained by us- • T ime synchronization: This is a conse- diverse technologies in the future, as ing SyncE, which encodes the clock quence of using the PTPv2 protocol, shown in Figure 1. signal in the data carrier. To guar- which measures the link propagation antee that all nodes use the same times and provides the global notion INSIGHTS INTO THE frequency, we employ a mechanism of time. White Rabbit also takes into WHITE RABBIT TECHNOLOGY based on a local oscillator disci- account the asymmetries in the prop- White Rabbit incorporates a number of plined to the external clock that is agation time due to the utilization of mechanisms designed to optimize its recovered from the optical link. different wavelengths in a bidirec- timing accuracy within the framework tional optical fiber for each commu- of an extension of Ethernet (thus keep- • Phase synchronization: The physical nication transfer (back and forward ing the Ethernet communications struc- clock of a node is retransmitted to in the loop), thus improving the ac- ture). WR integrates PTP, Synchronous the master element and vice versa curacy of the standard PTP proto- Ethernet and digital dual-mixer time dif- so that the master can compare the col. Since the frequency and phase ference (DMTD) phase tracking. phase of this signal (coming from the have been previously synchronized, we can guarantee the global notion of time in all the devices within the White Rabbit network. All those operations are implemented in the WRC, partially using the proper FPGA gateware and partially using an embedded soft core. The White Rabbit products include the proper oscillators, PLL and timing electronics required to perform these various clock operations. As a case study, let’s take a more detailed look at the ZEN board. This board holds the Dual WRC (D-WRC), a revision of the original WRC developed by Seven Solutions in our new line of Xilinx 7 series products. The D-WRC is able to synchronize two White Rabbit nodes or to serve as an intermediate link in a daisychain network. In addition, the ZEN board includes high-precision, low-jitter and temperature-com- pensated clock resources controlled by the D-WRC. Furthermore, the Zynq SoC’s du- Figure 2 – White Rabbit gateware elements based on a Xilinx Zynq SoC device al-core ARM® Cortex™-A9 processor,

22 Xcell Journal Quarter 2015 XCELLENCE IN SCIENTIFIC APPLICATIONS

Figure 3 – The White Rabbit LEN (above) and ZEN (right) boards. Powered by Artix FPGA and Zynq SoC devices respectively, they come with proper cases for industrial utilization.

running under the Linux OS, facilitates • An FMC connector makes it pos- Rabbit Starting Kit consisting of a cou- the development of user applications. sible to plug in one of the mezzanine ple of Spartan®-6-based boards called Having Linux onboard allows the utili- boards developed in the framework SPEC, one of which can be configured zation of new and interesting features of the WR project or any other in- as a master and the other as a slave. The such as Web services for configuration, dustrial board existing in the mar- idea was to encourage users to perform SNMP support for status monitoring ket. These FMC cards enhance the several early-evaluation experiments. and remote firmware load and update. possibilities of the ZEN and allow The most complex element of this The ZEN board is intended to func- many product configurations. technology is the switch. We devel- tion as a high-precision time provider. oped (in collaboration with CERN, • Memory resources include SD, It offers a large amount of possibili- GSI and other partners) the 18-port DDR3 and flash. ties facilitated by diverse connections White Rabbit Switch, designing the and expansions: • Two UART-USB connectors are main board in the MicroTCA form fac- included for management and de- tor. The core element was a Virtex®-6 • IRIG-B I/O is the time of day used bugging in the D-WRC and Cortex (LX240T) FPGA. We paired this device by the ZEN board. It is able to work processors. with an external processor (ARM926E) as master or slave. running an embedded Linux OS to per- In brief, the ZEN board offers the form the high-level operations such as • Two 10/100/1000 Ethernet ports con- end user a node capable of reaching system updates, file management, etc. nected to the ARM processors can subnanosecond synchronization and The switch uses 18 GTX links for SFPs be used for diverse network appli- of working in daisychain schemes, and 40 GPIOs for general-purpose tasks cations and protocols (NTP, sNTP, while also providing the best of the (LEDS, SFP detection, etc.). This is a PTPv2, management and so on). Zynq SoC and the new level of sys- significantly complex product capable tem-design capabilities it affords. • Two SFP cages are provided for of distributing timing and also of han- plugging in WR-compliant links. dling data packets while using standard WHITE RABBIT EQUIPMENT telecommunication tools. • SMA connectors enable the ZEN The White Rabbit technology began its Recently, Seven Solutions ported the board to synchronize itself with life in the Open Hardware community White Rabbit core into the Xilinx Artix® more-precise clocks (for example, (Open Hardware Repository, OHWR) FPGA family in the LEN board (Figure a GPS source or highly stable os- promoted by CERN. To accelerate the 3), enabling more cost-effective and cilators) to provide a diverse set of learning curve with this technology, energy-efficient solutions than existing clocks synchronized by WR. Seven Solutions developed a White

Second Quarter 2015 Xcell Journal 23 XCELLENCE IN SCIENTIFIC APPLICATIONS

OHWR devices. Furthermore, we have dio-astronomy facilities. White Rabbit (due to environmental conditions and just developed a White Rabbit offering is already being used in several particle also accidental or malevolent jam- based on the Zynq SoC devices. The accelerators (at CERN and GSI, among ming and spoofing). GPS should not WR-ZEN node, previously described, other institutions) and is also under be used for safety-critical infrastruc- represents a complete and versatile sys- consideration at different international tures (“U.S. Air Force Chief Warns tem-on-a-chip approach, in which the scientific initiatives such as KM3NeT, against Over-Reliance on GPS,” Inside node and the computer are all integrat- HISCORE and others. Thus, the WR ap- GNSS News, Jan. 20, 2010). And in this ed in the same board. This solution will proach has been validated in highly de- sense, White Rabbit represents an al- facilitate maintenance while reducing manding applications where accurate ternative that allows accurate timing cost and improving system flexibility. timing and frequency transference over and frequency transfer over terrestri- Current industrial products de- distributed instrumentation on large fa- al optical fibers. Standard telecommu- veloped by Seven Solutions provide cilities are critical. nication networks can be deployed, standard interfaces for management, In 2014, White Rabbit was also making this approach cost-effective. configuration and monitoring, taking tested over long-distance links of Beyond the functional features of advantage of all the benefits of White 125 km by VSL in the Netherlands White Rabbit, Seven Solutions is devel- Rabbit technology but with enhanced and 1,000 km by MIKES in Finland. oping new industrial products that tar- features, support and documentation. Accurate timing is demanded in a get critical applications requiring high wide range of applications beyond the availability. Redundant power sources, WHITE RABBIT APPLICATIONS scientific fields. The smart power grid hot-swappable fans, holdover oscilla- The first destination for White Rabbit requires accurate and reliable timing, tors and other techniques will allow technology was scientific applications. while high-frequency trading also relies their deployment in facilities that can- More recently, this technology has been on accurate and certified timing. not afford system failure or long repair integrated in several facilities and re- Many of these application domains processes. search projects in the framework of currently depend on GPS timing sig- Figure 4 shows how to use WR-LEN high-energy physics and distributed ra- nals, which are inherently vulnerable as a distributed mechanism to provide

Figure 4 – A White Rabbit network providing accurate timing to different nodes. Daisychain configuration is allowed through WR-LEN nodes.

24 Xcell Journal Quarter 2015 XCELLENCE IN SCIENTIFIC APPLICATIONS

Figure 5 – Network configuration toward safety-critical systems based on the ZEN time provider

timing information in a simple and nization requirements for different end to solve the well-known phase-synchro- cost-effective way. The system is able application fields including smart grids, nization problem. to distribute timing in a GPS-like man- telecommunications and high-frequen- These are just some examples of fu- ner. It provides timing with the IRIG-B cy trading. WR solves problems like the ture applications in which White Rabbit output format, which suits power phase synchronization and at the same may have a significant impact. Moreover, grid applications. In addition, PTPv2 time is able to synchronize clocks with thanks to the potential standardization is also an available interface, a valid subnanosecond accuracy for tens of thou- of White Rabbit within the IEEE-1588v3 option since it can be integrated with sands of devices distributed along large profile (which is currently under consid- PTP networks used on modern power distances (hundreds of kilometers). As a eration), it will be easy to adapt to multi- grid facilities. result, White Rabbit allows ultrahigh-ac- ple vendors. And we think that the most Figure 5 shows a network configu- curacy time transfer and, simultaneously, challenging applications are still to arise. ration based on a ZEN-powered time full data transfer without penalty. Seven Solutions is providing the first provider. In this case, much more pow- These features, along with White industrial products based on the White erful features are present. Redundant Rabbit’s scalability, will allow the devel- Rabbit technology. Our turnkey solu- power supplies, redundant network opment of a world-scale ground-based tions allow an easy integration into user topologies and holdover CSAC oscilla- synchronization mechanism that can applications based on standard tele- tors on critical elements are possible. be used as back-end technology for communication interfaces and can be Moreover, FMC expansion ports make GPS solutions (based on ground station configured/read just by using standard it possible to add mezzanine cards to antennas) and could open the door to software such as Web services or SNMP. further develop additional sensing and novel applications such as autonomous Next, we will incorporate features for control applications. automobiles or indoor navigation. The distributing RF generation (without re- upcoming 100G telecom networks will quiring sending the reference frequen- FUTURE DEVELOPMENTS benefit from mechanisms for accurate cy) or triggering event acquisition with White Rabbit is a promising technology quality-of-service evaluation, while 5G high accuracy as required for many tele- that is able to solve incoming synchro- wireless technologies can rely on WR communication applications.

Second Quarter 2015 Xcell Journal 25 XCELLENCE IN WIRELESS COMMUNICATIONS Evaluating the Linearity of RF-DAC Multiband Transmitters

Researchers at Bell Labs show how to create a flexible platform for rapid evaluation of RF DACs using Xilinx FPGAs, cores and MATLAB.

by Lei Guan Member of Technical Staff Bell Laboratories, Alcatel Lucent Ireland [email protected]

26 Xcell Journal Second Quarter 2015 XCELLENCE IN WIRELESS COMMUNICATIONS

he wireless commu- Here at Bell Labs Ireland, we have cre- sequences. For this purpose, we de- nications industry ated a flexible software-and-hardware signed two testing strategies: a continu- has entered into a platform to rapidly evaluate the RF DACs ous-wave (CW) signals test (xDDS) and new, all-in-one era, that are potential candidates for next-gen- a wideband signals test (xRAM). with every network eration wireless systems. The three key el- Multitone CW testing has long been operator seeking ements of this R&D project are a high-per- the preferred choice of RF engineers more-compact and formance Xilinx® FPGA, Xilinx intellectual for characterizing the nonlinearity of multiband infra- property (IP) and MATLAB®. RF components. Keeping the same test- structure solutions. The emerging RF- A few words here before starting this ing philosophy, we created a tunable Tclass data converters—namely, RF DACs engineering story. In the design, we tried four-tone logic core based on a direct and RF ADCs—architecturally make it to minimize the FPGA resource usage digital synthesizer (DDS), which is ac- possible to create compact multiband while keeping the system as flexible as tually using a pair of two-tone signals to transceivers. But the nonlinearities inher- possible, so we were only interested in im- stimulate the RF DAC at two separate ent in these new devices can be a stum- plementing the necessary functions. For frequency bands. By tuning the four bling block. setting up the whole testing system, we tones independently, we can evaluate For instance, nonlinearity of the RF picked the latest Analog Devices RF-DAC the linearity performance of the RF devices has two faces in the frequency evaluation boards (AD9129 and AD9739a) DAC—that is, the location and the pow- domain: in-band and out of band. In-band and the Xilinx ML605 evaluation board. er of the intermodulation spurs in the nonlinearity refers to the unwanted fre- The ML605 board comes with a Virtex®-6 frequency domain. quency terms within the TX band, while XC6VLX240T-1FFG1156 FPGA device, CW signal testing is an inherently nar- out-of-band nonlinearity is the undesired which contains fast-switching I/Os (up to rowband operation. To further evaluate frequency terms out of the TX band. 710 MHz) and serdes units (up to 5 Gbps) the RF DAC regarding wideband perfor- For system engineers who are pro- for interfacing the RF DACs. mance, we need to drive it with concur- totyping multiband transmitters us- Now, let’s take a closer look at how rent multiband, multimode signals, such ing RF DACs, it is critical to ensure to use Xilinx FPGAs, IP and MATLAB as dual-mode UMTS and LTE signals at this key component meets the lin- to create this simple but powerful 2.1 GHz and 2.6 GHz, respectively. For that earity requirement of the standards. testing platform. purpose, we created an on-chip BRAM ar- Therefore, at the early prototyping ray-based data storage core that has two stage, a flexible testing platform is SYSTEM-LEVEL REQUIREMENT subgroups in which to store the respective fundamentally required to properly AND DESIGN dual-band user data for repeated testing. evaluate the nonlinear performance The key purpose of this evaluation plat- Figure 1 illustrates the simplified de- of the RF DACs regarding the multi- form is to stimulate the RF DAC with sign diagram at the system level. As you band applications. various user-customized testing-data can see, we are using a straightforward

Figure 1 – Simplified block diagram of the platform at the system level

Second Quarter 2015 Xcell Journal 27 XCELLENCE IN WIRELESS COMMUNICATIONS

design strategy here, building up the generated and encapsulated into frames We combined multiple tones via adders platform as simply as possible and mod- in MATLAB, as shown in Figure 3. and then pipelined these tones to the next ularizing it with upgrading capability. Basically, there are two types of data stage. Since the output of the DDS core frames in the system. The frame with was in two’s complement binary format, a HW DESIGN: XILINX FPGA CORE header “FF01” (cRAM frame) was used format-conversion unit may be needed if The FPGA portion of Figure 1 outlines for transmitting phase-incremental val- the RF DAC requires another data format, the implemented logic units the system ues for DDSes and system control mes- such as offset binary. fundamentally required. They include sages. The other frame, with the head- In general, high-performance on-chip a clock distribution unit, a state ma- er “FF10” or “FF11” (dRAM frame), is BRAMs are always the first choice for chine-based system control unit and a used for delivering the user-customized creating small to medium-size user stor- DDS core-based multitone generation data. States “S1x” process only the data age. For example, in this platform, we unit, along with two units built around with the header “FF01” for updating the utilized Xilinx’s Block Memory Generator Block RAM: a small BRAM-based control phase-incremental values and executing core to build up two independent data message storage unit (cRAM core) and a the control instructions. States “S2x” RAMs for two frequency bands. Each of BRAM array-based user data storage unit and “S3x” are utilized for receiving and them is 16 bits in width and 192k in depth. (dRAM core). Also included are a UART storing the user-customized data for two For communicating between the PC serial interface toward the PC and a high- bands, respectively. The busy signal is and the FPGA, we created a UART serial interface unit and configured it at a relatively low speed, 921.6 kbps, equivalent to 115.2 kbytes per second. It takes around 0.16 milliseconds and 3.33 seconds to transfer cRAM frames (18 bytes) and dRAM frames (~384k bytes), respectively. Device vendors usually provide an example design of the high-speed data interface toward their chip in VHDL or Ver- ilog format. It is not very difficult for an experienced FPGA engineer to reuse or customize the reference design. For example, considering our system’s AD9739a and AD9129 RF DACs, a reference design of the parallel LVDS interface is available from Analog Devices. By the way, if there is no available example design from chip vendors, Xilinx has several very handy cores for the high-speed interfaces, such Figure 2 – Detailed design chart of the key state machine as CPRI and JESD204B.

speed data interface toward the RF DAC. used for continuously latching the data SW DESIGN: MATLAB DSP The clock is the life pulse of the in until seeing the last stop bit at the end FUNCTIONS AND GUI FPGA. In order to ensure that mul- of the data sequences. The control mes- We chose MATLAB as the software host, tiple clocks are properly distributed sages, for example invoking single/multi- simply because it has many advantag- across FPGA banks, we chose Xilinx’s ple DDS or the user data sequences, are es in terms of digital signal processing clock-management core, which pro- stored in the last two bytes of the cRAM (DSP) capability. What’s more, MATLAB vides an easy, interactive way of defin- frame. They will be executed at the rising also provides a handy tool called GUIDE ing and specifying clocks. edge of the cRAM_rd_done signal. for laying out a graphical user interface A compact instruction core built We then instantiated four indepen- (GUI). So now, what do we need from around a state machine serves as the dent tone-generation units using Xilinx MATLAB for this project? system control unit. As shown in Fig- DDS cores, which we configured as the We actually need a user interface as- ure 2, at the initial state (S0), a header phase-incremental mode. The phase-in- sociated with lower-level DSP functions detector unit is active, monitoring and cremental values for specific frequencies and data flow control functions. The re- filtering the incoming data byte from were generated in MATLAB and down- quired DSP functions are a phase-incre- the UART receiver. The data bytes were loaded to the FPGA via the cRAM frame. mental values calculator, baseband data

28 Xcell Journal Second Quarter 2015 XCELLENCE IN WIRELESS COMMUNICATIONS

In xRAM mode, we generated the baseband data sequences, normalized them to full scale (signed 16-bit) and up- converted them to the required frequency in MATLAB. After downloading the processed data to the dRAMs via UART, we can invoke the wideband signal testing by pressing the start button. Don’t forget to configure the UART serial interface in MATLAB with the same protocol parameters used on the FPGA side. Finally, we used a signal generator, the R&S SMU200A, to provide the sampling clock to logically “turn on” the RF Figure 3 – Illustration of the data frame encapsulation DAC. And we connected the output of the RF DAC to a spectrum analyzer for evaluating the linearity performance of sequence generator and digital upcon- phase-incremental value for the fre- the RF DAC in the frequency domain. verter. The control functions are a data quency tone fc with sampling rate frame encapsulator, UART interface fs by means of a simple equation, QUICK EVALUATION controller and system status indicators. phase_incr = fc*2nbits/fs, At the early-prototyping stage, evaluation Figure 4 illustrates the GUI that we where nbits represents the number of the key RF components regarding linear- created for the platform. The key pa- of binary bits used by DDS for synthe- ity performance can be a critical problem, rameter of the RF DAC—namely, the sizing the frequencies. By pressing the but with our hardware/software platform sampling rate—should be defined first, “start” button, the generated phase-in- you can manage this evaluation quick- and then you can select either xDDS cremental values will be converted to ly without compromising performance. mode or xRAM mode for stimulating the fixed-point format, encapsulated Then you can add in your RF power ampli- the device. Next, at each subpanel, into a 2-byte data frame with differ- fier and use the proposed platform to eval- we can customize the parameters on ent headers and control messages as uate the linearity of the cascaded system. the go to invoke the corresponding shown in Figure 3, and then sent to the After identifying the nonlinearities, you MATLAB signal-processing functions. cRAM unit via UART and executed in can implement some digital-predistortion In xDDS mode, you can calculate the the FPGA. algorithms to eliminate the unwanted nonlinearities of the cascaded system. Properly using Xilinx IP cores in your FPGA design will significantly reduce the development cycle and increase the robustness of your digital system. Look- ing forward, we expect to upgrade the data interface module in the platform to the JESD204B standard to support the higher data rates required by multiple synchronized RF DACs. Meanwhile, we are migrating the FPGA host from Xilinx’s ML605 to the Zynq®-7000 All Programmable SoC ZC706 evaluation kit. The Zynq SoC design will be a good choice for creating a standalone solution that does not need any external DSP and control functions on a separate PC. For more information regarding the platform and digital predistortion, contact the author at lei.guan@alca- Figure 4 – Snapshot of the graphical user interface tel-lucent.com.

Second Quarter 2015 Xcell Journal 29 XPERIMENT Oberon System Implemented on a Low-Cost FPGA Board

by Niklaus Wirth Professor (retired) Swiss Federal Institute of Technology (ETH) Zurich, Switzerland [email protected]

30 Xcell Journal Second Quarter 2015 XPERIMENT

n 1988, Jürg Gutknecht and I completed and published the programming language A Xilinx Spartan-3 board Oberon [1,2] as the successor to two other languages, Pascal and Modula-2, which becomes the basis for I had developed earlier in my career. We Ioriginally designed the Oberon language to be revamping the author’s more streamlined and efficient than Modula-2 so that it could better help academics teach system Oberon programming programming to computer science students. To advance this endeavor, in 1990 we developed the language and compiler for Oberon operating system (OS) as a modern implementation for workstations that use windows use in software education. and have word-processing abilities. We then published a book that details both the Oberon compiler and and the operating system of the same name. The book, entitled Project Oberon, includes thorough instructions and source code. A few years ago, my friend Paul Reed suggested that I revise and reprint the book because of its value for teaching system design, and because it serves as a good starting point to help would-be innovators build dependable systems from scratch. There was a big obstacle, however. The compiler I originally developed targeted a processor that has essentially disappeared. Thus, my solution was to rewrite a compiler for a modern processor. But after doing quite a bit of research, I couldn’t find a processor that satisfied my criteria for clarity, regularity and simplicity. So I designed my own. I was able to bring this idea to fruition because of the modern FPGA, as it

Xcell Journal is honored to pub- lish this article by industry legend Niklaus Wirth, who invented Pascal and several derivative programming languages and has pioneered some classic approaches to computer and software engineering. A recipient of the ACM Turing Award and of the IEEE Computer Pioneer Award, Professor Wirth has retired from teaching but continues to help ed- ucators develop and inspire the innovators of tomorrow.

Second Quarter 2015 Xcell Journal 31 XPERIMENT

Choosing a Xilinx FPGA allowed me to update the system while keeping the design as close as possible to the original version from 1990.

allowed me to design the hardware as The book and the source code for 32 bits; and a control unit with instruc- well as the system software. What’s the entire system are available on pro- tion register, IR and program counter more, choosing a Xilinx® FPGA al- jectoberon.com [3,4,5]. Also available at PC. The processor is represented by the lowed me to update the system while the same site is a single file called S3RI- Verilog module RISC5. keeping the design as close as possible SCinstall.zip. This file contains instruc- The processor features 20 instruc- to the original version from 1990. tions, an SD-card filesystem image and tions: four for moving, shifting and ro- The new processor is called RISC, FPGA configuration bit files (in the form tating; four for logic operations; four and it was implemented on the low- of PROM files for the Spartan-3 board’s for integer arithmetic; four for float- cost Digilent Spartan®-3 development Platform Flash), together with construc- ing-point arithmetic; two for memory board, hosting a 1-Mbyte static RAM tion details for the SD-card/mouse inter- access; and two for branching. (SRAM) memory. The only system face hardware. RISC5 is imported by RISC5Top, the hardware additions I made were to environment, which contains the interfac- add an interface for a mouse and an THE RISC PROCESSOR es to various (memory-mapped) devices SD card to replace the hard-disk drive The processor consists of an arithme- and the SRAM (256M x 32 bit). The entire in the older system. tic-logic unit; an array of 16 registers of system (Figure 1) consists of the follow-

RISC5TOP

RISC5

Multiplier Divider FPAdder FPMultiplier FPDivider

VID SPI PS2 Mouse RS232

Figure 1 – Diagram of the system and the Verilog modules it contains

32 Xcell Journal First Quarter 2015 XPERIMENT

where the left button serves to set a car- RISC5Top environment 194 et, marking a text position, and the right RISC5 processor 201 button serves to select a text stretch. Multiplier integer arithmetic 47 The module “Kernel” contains the Divider 24 disk-store management and the garbage FPAdder floating-point arithmetic 98 collector. I ensured that the viewers FPMultiplier 33 were tiled and do not overlap. The stan- FPDivider 35 dard layout shows two vertical tracks SPI SD card and transmitter/receiver 25 with any number of viewers. You can VID 1024 x 768 video controller 73 enlarge them, make them smaller or PS2 keyboard 25 move them by simply dragging their title Mouse mouse 95 bars. Figure 2 shows the user interface RS232T RS232 transmitter 23 running on the monitor, along with the RS232R RS232 receiver 25 Spartan-3 board, keyboard and mouse. The system when loaded occupies 112,640 bytes in module space (21 per- ing Verilog modules (line counts shown): THE OBERON OPERATING SYSTEM cent) and 16,128 bytes of the heap (3 I memory-mapped the black-and- The operating system software consists percent). It consists of the following white VGA display so that it occupies of a core that includes a memory allo- modules (line counts shown), as depict- 1024 x 768 x 1 bit per pixel = 98,304 cator with a garbage collector and a file ed in Figure 3: bytes, which is essentially 10 percent of system, along with the loader, a text sys- the available main memory of 1 Mbyte. tem, a viewer system and a text editor. Kernel 271 (inner core) The SD card replaces the 80 Mbytes in The module called “Oberon” is the FileDir 352 the original system. It is accessed over central task dispatcher while “System” Files 505 a standard SPI interface, which accepts is the basic command module. An ac- Modules (loader) 226 and serializes bytes or 32-bit words. The tion is evoked by clicking the middle Viewers 216 (outer core) keyboard and the mouse are connected button on the text “M.P” in any viewer Texts 532 by standard, serial PS-2 interfaces. Fur- on the display, where P is the name of a Oberon 411 thermore, there is a serial, asynchro- procedure declared in module M. If M is MenuViewers 208 nous RS-232 line and a general-purpose, not present, it is automatically loaded. TextFrames 874 8-bit parallel I/O interface. Module Most text-editing commands, howev- System 420 RISC5Top also provides a counter, in- er, are evoked by simple mouse clicks, Edit 233 cremented every millisecond. It is remarkable that system initialization at power-on or reset takes only about 2 seconds. This includes a garbage-collecting scan of the file directory.

THE OBERON COMPILER The compiler, hosted on the system itself, uses the simple top-down recursive-descent parsing method. You can activate the compiler on a module’s selected source text by using the command ORP.Compile @. The parser inputs symbols from the scanner delivering identifiers, numbers and special symbols (like BE- GIN, END, + and so on). This scheme has proven to be useful and elegant in many applications. It is described in detail in my book Compiler Con- Figure 2 – Monitor showing user interface, with Spartan-3 board at right struction [6,7].

First Quarter 2015 Xcell Journal 33 XPERIMENT

The parser calls procedures in the ported variables. The base addresses are ORP parser 968 code generator module. They direct- loaded on demand from a system-glob- ORG code generator 1120 ly append instructions in a code array. al module table, whose address is held ORB base def 435 Forward-branch instructions are pro- in register R12. R15 is used for return ORS scanner 311 vided with jump addresses at the end of addresses (link), as determined by the a module’s compilation (fixups), when RISC architecture. Thus, R0 - R11 are The compiler occupies 115,912 all branch destinations are known. available for expression evaluation and bytes (22 percent) of the module All variable addresses are relative to a for passing procedure parameters. space and 17,508 bytes (4 percent) of base register. This is R14 (stack pointer) The entire compiler consists of four the heap (before any compilation). Its for local variables (set at procedure en- relatively small and efficient modules source code is about 65 kbytes long. try at run-time), or R13 for global and im- (line counts shown): Compilation of the compiler itself

The Lola HDL and its Translation to Verilog

The hardware-description language To drive the development of Lola-2, ethz.ch/personal/wirth/Lola/Lola2.pdf). (HDL) called Lola was defined in 1990 we had the ambition to reformulate in For the sake of brevity, we show as a means for teaching the basics of Lola all modules of the RISC5 proces- here only a single example of a Lola hardware design. This was the period sor. This has now been achieved. text (Figure 1). The unit of source when textual definitions started to re- text is called a module. Its heading place circuit diagrams, and when the THE LOLA LANGUAGE specifies the name and its input and first FPGAs became available, although Lola is a small, terse language in the output parameters with their names had not yet reached mainstream design. style of Oberon (see http://www.inf. and types. The heading is followed by Lola was implemented by a compiler that generated bit files to be loaded onto FPGAs. Bit-file formats were disclosed MODULE Counter0 (IN CLK50M, rstIn: BIT; by Algotronix Inc. and Concurrent Logic IN swi: BYTE; OUT leds: BYTE); Inc. Both featured cells of rather simple structures, which seemed to be optimal TYPE IBUFG := MODULE (IN I: BIT; OUT O: BIT) ^; for automatic placement and routing. VAR clk, tick0, tick1: BIT; In the wake of my project to reim- plement Oberon on an FPGA, the idea clkInBuf: IBUFG; now popped up also to revive Lola. Be- REG (clk) rst: BIT; cause the cells of Xilinx’s FPGAs feature cnt0: [16] BIT; (*half milliseconds*) a rather complex structure, we did not cnt1: [10] BIT; (*half seconds*) venture into the effort of implementing cnt2: BYTE; placement and routing, quite apart from the fact that Xilinx refuses to disclose its bit-file format for proprietary reasons. BEGIN leds := swi.7 -> swi : swi.0 -> cnt1[9:2] : cnt2; The obvious path was to build a tick0 := (cnt0 = 49999); Lola compiler that does not generate tick1 := tick0 & (cnt1 = 499); proprietary bit files, but translates into rst := ~rstIn; a language for which Xilinx provides a cnt0 := ~rst -> 0 : tick0 -> 0 : cnt0 + 1; synthesis tool. We chose Verilog. This cnt1 := ~rst -> 0 : tick1 -> 0 : cnt1 + tick0; solution implies a rather extravagant cnt2 := ~rst -> 0 : cnt2 + tick1; detour: First, a Lola module has to be parsed, then translated, then parsed clkInBuf (CLK50M, clk) again. With all these steps, we ensured END Counter0. that the Lola compiler had proper error reporting and type-consistency Figure 1 — Lola text showing a counter counting milliseconds and seconds, checking capabilities. displaying the latter on the board’s LEDs

34 Xcell Journal Second Quarter 2015 XPERIMENT

takes only a few seconds on the 25- operations must be restricted to driver portant advantage is the presence of MHz RISC processor [8]. modules accessing device interfaces, and static RAM on this board, which makes The compiler always generates checks they are easily recognized by SYSTEM interfacing straightforward (even for for array indices and for references in their import lists. The entire system is byte selection). Unfortunately, newer through pointers with value NIL. It caus- programmed in Oberon itself, and no use boards all feature dynamic RAM, which es a trap in case of violation. This tech- of assembler code is required. is, albeit larger, very much more compli- nique ensures a high degree of security I chose the Digilent Spartan-3 devel- cated to interface, requiring circuits for against errors and corruption. In fact, sys- opment board because of its low cost refresh and initialization (calibration). tem integrity can be violated only by the and simplicity, which makes it suitable This circuitry alone can be as complex use of operations in the pseudo-module for educational institutions to acquire as the entire processor with static RAM. SYSTEM, namely PUT and COPY. These sets of the kits for classrooms. An im- Even if a controller is provided on-chip,

a section containing declarations of fined by one and only one expression LSS scanner 159 local objects, such as variables and (combinational circuit). Multiple as- LSB base 52 registers. Then follows the section signments do not make sense. We can LSC compiler/parser 503 defining the values of variables and simply imagine every HDL program LSV Verilog generator 215 registers through assignments. BYTE to be enclosed in a big repeat-forev- denotes an array of 8 bits. er clause, because the assignments to registers and variables are repeated in THE LOLA COMPILER Instructions on translating from Lola every clock cycle. The compiler uses the simple top-down to Verilog can be found at http://www. It was the ingenious idea of John recursive-descent parsing method. It is inf.ethz.ch/personal/wirth/Lola/Lola- von Neumann to introduce a proces- activated on the selected Lola source Compiler.pdf. sor architecture with a sequencer. text by the command The sequencer contains an instruc- LSC.Compile tion register, according to which @. The parser inputs symbols from the THE DIFFERENCE BETWEEN scanner delivering identifiers, num- SOFTWARE AND HARDWARE (in every cycle) certain circuits are bers and special symbols (like BEGIN, ‘PROGRAMS’ selected and others ignored, thus END, +, etc.). This scheme has proved Many efforts have been undertaken cleverly reusing the different parts to be useful and elegant in many ap- in the past to make HDLs look like (of the ALU). Now, cycles or steps plications. It is described in the book “ordinary” programming languages have become inherently sequential, Compiler Construction (Part 1 and (PLs). We also note that HDLs typi- and it is possible to reassign values Part 2). cally have a counterpart in the set of to the same variables, as the program Instead of generating Verilog texts PLs, adopting the respective style. counter relates them to certain plac- directly on the fly, statements are Thus, Verilog has its ancestor in C, es in the program and in the instruc- present in the parser which, as a VHDL in Ada and Lola in Oberon. We tion sequence. It is this idea of the side effect of parsing, generate a tree consider it important to recognize sequencer that makes it possible to representing the input text in a way the fundamental differences between execute enormous programs by rela- appropriate for further processing. the two classes, in particular in the tively simple circuits. This structure has the advantage that presence of syntactic similarities or In summary, Lola-2 is an HDL in various, different outputs may easily even identities. What are these fun- the style of the PL Oberon. The com- be generated by calling on different damental differences? piler presented here translates Lola translators. One of them is a transla- In order to simplify an explanation, modules into Verilog modules. Lola’s tor to Verilog. The command is we restrict our analysis to synchro- advantages lie in the language’s sim- LSV. ple and regular structure, and in the List outputfile.v. Another nous circuits—that is, circuits in command may translate to VHDL, or which all registers tick with the same compiler’s emphasis on type checking simply list the tree. Yet another might clock. It is indeed a healthy design par- and improved error diagnostics. The generate a netlist for further process- adigm to stick to synchronous circuits entire set of modules for the RISC pro- ing by a placer and router. in general, if possible. Then, quite ob- cessor have been expressed in Lola Thus, the entire compiler consists viously, all elements of a circuit oper- (see http://www.inf.ethz.ch/personal/ of at least four relatively small and ef- ate concurrently, literally at the same wirth/Lola/index.html). ficient modules (line counts shown): time. Every variable and register is de- — Niklaus Wirth

Second Quarter 2015 Xcell Journal 35 XPERIMENT

this counteracts our principle of laying ence and technology, students are paradigm. Here the student was asked everything open for inspection. exposed to many master examples of to write programs right from the start, design before they are asked to exper- before having read any examples. CLOSING THOUGHTS iment with their own attempts. Pro- The reason for this dire fact was More than 40 years ago, C.A.R. Hoare gramming and software design stood that there existed almost no litera- RISC5TOP remarked that in all branches of sci- in marked contrast to this sensible ture with sizable master examples. I therefore decided to remedy the situation somewhat, and I wrote the book on Algorithms and Data Structures in 1975. Subsequently (with J. Gut- RS232 knecht), having been charged with the task of teaching a course on oper- System ating systems, I designed the system Oberon Oberon (1986-88). Since then, the teaching of programming has not markedly improved, TextFrame whereas systems have dramatically increased in size and complexity. Although the open-source endeavor is to be welcomed, it has not really Input changed the situation, since most pro- MenuView grams have been constructed “to run,” but not really for human consumption, for being understood. I continue to make the daring pro- Oberon posal that all programs should be designed not just for computers, but for human reading. They should be pub- lishable. This is a task much harder Viewers Texts Input than creating executable programs, even correct and efficient ones. It implies that no part must be specified in assembler code. The result of ignoring this “hu- Display Modules Fonts man factor” is that in many places new applications are not carefully designed, but rather debugged into existence, sometimes with dismal Files consequences. An important ingredi- ent to achieve understandability is to adhere to simplicity and regularity, and to renounce unnecessary embel- lishments, to avoid bells and whistles FileDir and to distinguish between conven- tional and convenient. The small size of this system is wit- ness to how much can be achieved Kernel with how little. The Oberon system’s dimensions are ridiculously small RISC5 compared with most modern operating systems, although it includes a Figure 3 – The system and its modules file system, a text editor and a viewer (windows) management. The side ef- Multiplier Divider FPAdder FPMultiplier FPDivider 36 Xcell Journal Second Quarter 2015

VID SPI PS2 Nouse RS232 XPERIMENT

fect is that it rests on just a few sim- ACKNOWLEDGEMENT 3. www.inf.ethz.ch/personal/wirth/Pro- ple rules, and that therefore it is easy I gratefully acknowledge the invalu- jectOberon/index.html to learn how to use it. able contributions of Paul Reed. He Finally, another advantage of this suggested that I re-edit the book Proj- 4. www.inf.ethz.ch/personal/wirth/ terseness is that one can safely build ect Oberon and also suggested reim- Oberon/PIO.pdf upon this basic system without being plementing the entire system on an 5. www.inf.ethz.ch/personal/wirth/ afraid of the existence of unknown FPGA. Paul was an essential source of Oberon/Oberon07.Report.pdf features, such as back doors. This is encouragement. He thought of replac- 6. http://www.inf.ethz.ch/personal/ an essential property in view of the ing the disk with an SD card, and he wirth/CompilerConstruction/Compil- increasing danger of attacks on the contributed the SPI, the PS-2 and the erConstruction1.pdf integrity of a system, indispensable VID interfaces in Verilog. for safety-critical applications. Nota- 7. http://www.inf.ethz.ch/personal/ bly, also the hardware of our system References wirth/CompilerConstruction/Compil- erConstruction2.pdf is free of such hidden parts. After all, 1. http://www.inf.ethz.ch/personal/ no guarantee can be given for any sys- wirth/Oberon/Oberon07.Report.pdf 8. http://www.inf.ethz.ch/personal/ tem that is built upon a base that is so 2. http://www.inf.ethz.ch/personal/ wirth/ProjectOberon/PO.Applica- large that nobody can understand it in wirth/Oberon/PIO.pdf tions.pdf (Ch. 12) its entirety.

FPGA-Based Prototyping for Any Design Size? Any Design Stage? Among Multiple Locations? That’s Genius! Realize the Genius of Your Design with S2C’s Prodigy Prototyping Platform

Download our white paper at: http://www.s2cinc.com/resource-library/white-papers

Second Quarter 2015 Xcell Journal 37 XCELLENCE IN DATA CENTERS

Removing the Barrier for FPGA-Based OpenCL Data Center Servers

Xilinx’s SDAccel environment delivers a CPU-like development and run-time experience on FPGAs, easing the data center design burden.

38 Xcell Journal Second Quarter 2015 XCELLENCE IN DATA CENTERS

ata centers today are the ternet to learn on its own. The research backbone of the modern is representative of a new generation by Devadas Varma economy, from the server of computer science that is exploiting Senior Engineering Director rooms that power small to the availability of huge clusters of com- SDAccel and Vivado High-Level Synthesis midsize organizations to puters in giant data centers. Potential Xilinx, Inc. [email protected] Dthe enterprise data centers that support applications include improvements to U.S. corporations and provide access to image search, speech recognition and Tom Feist cloud computing services. According to machine language translation. Howev- Senior Director the Natural Resources Defense Council, er, leveraging CPUs alone is not a pow- Design Methodology Marketing data centers are one of the largest and er-efficient approach to data center de- Xilinx, Inc. fastest-growing consumers of electricity sign. For higher speed and lower power, [email protected] in the United States. In 2013, U.S. data alternative solutions are required. centers consumed an estimated 91 billion Baidu, China’s largest search-engine kilowatt-hours of electricity—enough to specialist, turned to deep-neural-net- power all the households in New York work processing to solve problems in City twice over—and are on track to speech recognition, image search and reach 140 billion kWh by 2020 [1]. Clearly, natural-language processing. The com- lowering power is essential for the scal- pany quickly determined that when ing of data centers to improve reliability neural back-propagation algorithms are and lower operating costs. used in online prediction, FPGA solu- Data center servers vary, depending tions scale far more easily than CPUs on the server application. Many serv- and GPUs while also reducing power [2]. ers run for long periods without inter- The new generation of 28nm and ruption, making hardware reliability 20nm high-integration FPGA families, and durability extremely important. Al- such as Xilinx’s 7 series and Ultra- though servers can be built from com- Scale™ devices, are changing the dy- modity computer parts, mission-critical namics for integration of FPGAs into enterprise servers often use specialized host cards and line cards in data center hardware for application acceleration, servers. Performance per watt can easi- including graphics processing units ly exceed 20x that of an equivalent CPU (GPUs) and digital signal processors or GPU, while offering up to 50x to 75x (DSPs). Now, many companies are latency improvements in some applica- looking to add field-programmable gate tions over traditional CPUs. arrays (FPGAs) for their highly parallel Teams with limited or no FPGA hard- architecture and relatively low power ware resources, however, have found consumption. Xilinx®’s new SDAccel™ the transition to FPGAs challenging development environment removes due to the RTL (VHDL or Verilog) de- programming as a gating issue to FPGA velopment expertise needed to take full utilization in this application by pro- advantage of these devices. To resolve viding developers with a familiar CPU/ this issue, Xilinx has looked to the Open GPU-like environment. Computing Language, or OpenCL™, for a way to ease the programming burden. IMPROVING PERFORMANCE/WATT Public clouds such as Amazon Web OPENCL CODE PORTABILITY services, Google Compute, Microsoft OpenCL was developed by Apple Inc. Azure, Facebook and China’s Baidu and promoted by Khronos Group have huge repositories of pictures and [3] precisely to aid the integration of require very fast image recognition. In CPUs, GPUs, FPGAs and DSP blocks one implementation, Google scientists in heterogeneous designs. To enhance created one of the largest neural net- the OpenCL framework for writing works for machine learning by connect- programs that execute across hetero- ing 16,000 computer processors into an geneous platforms, leading CPU, GPU entity that they turned loose on the In- and FPGA vendors, including Xilinx,

Second Quarter 2015 Xcell Journal 39 XCELLENCE IN DATA CENTERS

The SDAccel compiler delivers a 10x performance improvement over CPUs at 1/10 the power consumption of a GPU.

are contributing to development of guage (based on C99) for program- mization in data center applications. both the language and its APIs. ming and an application program- The result is SDAccel™, an OpenCL The growing acceptance of Open- ming interface (API) to control the development environment for appli- CL by CPU, GPU and FPGA vendors, platform and execute programs on cation acceleration. server OEMs and data center managers the target device. OpenCL provides The new Xilinx SDAccel environ- alike is an indication that all parties rec- parallel computing using task-based ment (Figure 1) provides data cen- ognize one overarching fact: C-based and data-based parallelism. ter application developers with the compilers for single-processor archi- complete FPGA-based hardware and tectures can only offer small reductions XILINX SDACCEL DEVELOPMENT OpenCL software. SDAccel includes a in overall power dissipation within the ENVIRONMENT FOR OPENCL fast, architecturally optimizing compil- server rack, even as processors turn to Xilinx has worked for nearly a decade er that makes efficient use of on-chip sub-20nm process technologies and add on the development of domain-specif- FPGA resources along with a familiar special power-saving states. ic specification environments. Con- software-development flow based on OpenCL is a framework for writing cerns about data center performance an Eclipse integrated design environ- programs that execute across het- from both data center managers and ment (IDE) for code development, pro- erogeneous platforms consisting of server/switch OEMs helped drive one filing and debugging. This IDE provides CPUs, GPUs, DSPs, FPGAs and other such vertical development toward a a CPU/GPU-like work environment. processors. OpenCL includes a lan- unified environment for design opti- Moreover, SDAccel leverages Xilinx’s dynamically reconfigurable technology to enable accelerator kernels optimized for different applications to be swapped in and out on the fly. The applications can have multiple kernels swapped in and out of the FPGA during run-time without disrupting the interface between the server CPU and the FPGA for nonstop application acceleration. SDAccel’s architecturally optimizing compiler allows software developers to optimize and compile streaming, low-latency and custom datapath applications. The SDAccel compiler supports source code using any combination of C, C++ and Open- CL and targets high-performance Xil- inx FPGAs. The SDAccel compiler delivers as much as a 10x performance improvement over high-end CPUs and one-tenth the power consumption of a Figure 1 – The SDAccel environment includes an architecturally optimizing compiler, libraries, GPU, while maintaining code compat- a debugger and a profiler to provide a CPU/GPU-like programming experience. ibility and a traditional software-pro-

40 Xcell Journal Second Quarter 2015 XCELLENCE IN DATA CENTERS

Intel i7, 8GB RAM, OpenCV 2.4.8

100 Nvidia GTX 750Ti GPU, Cuda 6.0

Xilinx 7 Series, Current (100MHz), AuvizCV Beta

Tesia K-20, Cuba 6.0, OpenCV 2.4.8

10 Xilinx 7 Series, Projected (200MHz) Latency for 1080x1920 (ms)

1 Bilateral lter Harris corner detector

Figure 2 – Video-processing algorithms written in OpenCL for CPU, GPU and FPGA architectures run faster on FPGAs. (Benchmarks performed by Auviz using the AuvizCV library)

gramming model for easy application usage of these application-specific migration and cost savings. memory banks represent some of the SDAccel is the only FPGA-based de- architecturally aware capabilities of velopment environment that includes a the SDAccel compiler. wide variety of FPGA-optimized libraries for application acceleration. This SOFTWARE WORKFLOW library includes OpenCL built-ins, arbi- FPGAs have long held the promise trary-precision data types (fixed-point), of increased algorithm performance floating-point, math.h, video, signal-pro- at a lower power envelope than CPU cessing and linear algebra functions. and GPU implementations. Until On real-world computation work- now, that promise has been gated by loads such as video processing with the programming paradigm required complex nested datapaths, it is clear to effectively use FPGAs. SDAccel that the inherent flexibility of the FPGA eliminates this barrier by support- fabric has performance and power ing a software workflow with in-sys- advantages when compared with the tem, on-the-fly reconfigurability that fixed architectures of CPUs and GPUs. maximizes hardware acceleration As shown by the benchmarks seen in ROI in the data center. SDAccel Figure 2, the FPGA solution compiled represents a unique and complete by SDAccel outperforms the CPU im- FPGA-based solution that far ex- plementation of the same code and ceeds the capabilities and ease of offers performance competitive with use of competing point tools. For GPU implementations. more information, visit http://www. Both the bilateral filter and the xilinx.com/products/design-tools/ Harris corner detector were coded us- sdx/sdaccel.html. ing the standard OpenCL design paradigm in which data between kernels is REFERENCES transferred using device global memory. The FPGA implementation gen- 1. http://www .nrdc.org/energy/data-center-efficiency-assessment.asp erated by SDAccel optimizes memory access by creating on-chip memory 2. http://www.pcworld.com/article/2464260/ banks that are used for high-band- microsoft-baidu-find-speedier-search-results-through-specialized-chips.html width memory transfers and low-latency computations. The creation and 3. https://www.khronos.org/opencl

Second Quarter 2015 Xcell Journal 41 XCELLENCE IN DATA CENTERS Optimizing an OpenCL Application for Video Watermarking in FPGAs Xilinx’s SDAccel development environment offers optimization techniques for memory-bound problems.

by Jasmina Vasiljevic Researcher University of Toronto [email protected]

Fernando Martinez Vallina, PhD Software Development Manager Xilinx, Inc. [email protected]

42 Xcell Journal Second Quarter 2015 XCELLENCE IN DATA CENTERS

ideo streaming and VIDEO WATERMARKING tains only the contour of the logo. The downloading ac- WITH LOGO INSERTION pixels in the mask are either white or count for the ma- The main function of the video-wa- black. White pixels in the mask indicate jority of consum- termarking algorithm is to overlay a the logo insertion location, while black er Internet traffic logo at a specific location on a video pixels indicate that the original pixel re- and are a driving stream. The logo used for the water- mains untouched. Figure 1 shows an ex- force behind cloud mark can be either active or passive. ample of the operation of the video-wa- computing.V The continually growing An active logo is typically represented termarking algorithm. demand for this type of content is push- by a short, repeating video clip, while a ing video-processing applications out passive logo is a still image. TARGET SYSTEM AND of specialized systems and into the data The most common technique among INITIAL IMPLEMENTATION center. This shift in the deployment broadcasting companies that brand The system on which we executed the paradigm allows for the rapid scaling their video streams is to use a compa- application is shown in Figure 2. It of computation nodes to accommodate ny logo as a passive watermark, so that is composed of an Alpha Data ADM- the needs of the different compute-in- was the aim of our example design. PCIE-7V3 card communicating with tensive stages of video content prepara- The application inserts a passive logo an x86 processor over a PCIe® link. tion and distribution, such as transcod- on a pixel-by-pixel level of granularity In this system, the host processor re- ing and watermarking. based on the operations of the follow- trieves the input video stream from We recently used Xilinx®’s SDAccel™ ing equation: disk and transfers it to the device

development environment to compile and optimize a video-watermarking ap- out_y[x][y] = (255-mask[x][y]) * in_y[x][y] + mask[x][y] * logo_y[x][y] plication written in OpenCL™ for an FPGA accelerator card. Video content out_cr[x][y] = (255-mask[x][y]) * in_cr[x][y] + mask[x][y] * logo_cr[x][y] providers use watermarking to brand out_cb[x][y] = (255-mask[x][y]) * in_cb[x][y] + mask[x][y] * logo_cb[x][y] and protect their content. Our goal was to design a watermarking application that would process high-definition The input and output frames are global memory. The device global (HD) video at a 1080p resolution with two-dimensional arrays in which pix- memory is the memory on the FPGA a target throughput of 30 frames per els are expressed using the YCbCr col- card that is directly accessible from second (fps) running on an Alpha Data or space. In this color space, each pix- the FPGA. In addition to placing the ADM-PCIE-7V3 card. el is represented in three components: video frames in device global memo- The SDAccel development environ- Y is the luma component, Cb is the ry, the logo and mask are transferred ment enables designers to take applica- chroma blue-difference component from the host to the accelerator card tions captured in OpenCL and compile and Cr is the chroma red-difference and placed in on-chip memory to them to an FPGA without requiring component. Each component is rep- take advantage of the low latency of knowledge of the underlying FPGA im- resented by an 8-bit value, resulting in BRAM memories. Since this applica- plementation tools. The video-water- a total of 24 bits per pixel. tion uses a passive logo, only the data marking application serves as a perfect The logo is a two-dimensional image for the still image and placement lo- way to introduce the main optimization containing the content to be inserted. cation needs to be stored in the on- techniques available in SDAccel. The mask is also an image, but it con- chip memories.

Logo Mask Input video Output video

Figure 1 – The video-watermarking algorithm in action

Second Quarter 2015 Xcell Journal 43 XCELLENCE IN DATA CENTERS

Figure 2 – System overview for the video-watermarking application

Once data has been set up, the host of the time is spent accessing mem- fore, for this application, vectorizing processor sends a start signal to the ory to read and write video frames. the code is as simple as changing the watermarking kernel in the FPGA fab- Therefore, we focused on memory data type of all arrays to char20, as ric. This signal triggers the kernel to do bandwidth when optimizing our ex- shown in Figure 5, which results in a three things: start fetching input video ample design. throughput of 12 fps. frames from device global memory, insert the logo at the location defined OPTIMIZING MEMORY ACCESS OPTIMIZING MEMORY by the mask and place the processed USING VECTORIZATION ACCESS USING BURSTS frames back in device global memory One of the advantages of the FPGA Although vectorization significantly to be fetched by the processor. fabric over other software-program- improves the performance of the ap- The coordination of data transfer mable fabrics is the flexibility and plication, it was not enough to reach and computation for every frame in configuration of interconnect bus- the goal of 30 fps. The application the video stream is achieved by means es to memory. SDAccel creates cus- remains memory bound, because the of the code in Figure 3. tom-size datapaths and architectures kernel is issuing memory transfers of This code, which runs on the host to memory based on the application only 20 pixels at a time. To reduce the processor, is responsible for sending a kernels. A higher memory bandwidth impact of memory constraints on the video frame to the FPGA accelerator from the kernel can be inferred by application, we had to modify the ker- card, launching the accelerator and modifying the code to consume mul- nel code to generate burst read/write then retrieving the processed frame tiple pixels at a time, a procedure re- operations to memory for a data set from the FPGA accelerator card. ferred to as vectorization. larger than 20 pixels. The modified The first implementation of the wa- The level of vectorization that is ap- kernel code is shown in Figure 6. termarking algorithm for the FPGA is propriate depends on the application The first modification to the kernel shown in Figure 4. This is a functionally and the FPGA accelerator card being code is to define on-chip storage in correct implementation of the applica- used. In the case of the Alpha Data the kernel to store a block of pixels tion but does not have any performance card, the interface to device global at a time. The on-chip memories are optimization or consideration for the memory has a width of 512 bits, which defined by arrays declared in the capabilities of the FPGA fabric. As is, matches the maximum AXI intercon- kernel code. To start a burst trans- this code compiles in SDAccel and can nect width available to a kernel in SD- action to memory, the code instan- be run on the Alpha Data card for a Accel. Given this boundary of 512 bits, tiates a memcpy command to move maximum throughput of 0.5 fps. the application is modified to process a block of data from DDR to BRAM As you can see in the code of Fig- 20 pixels at a time (24 bits/pixel x 20 storage inside of the kernel. Based ure 4, the watermarking algorithm is pixels = 504 bits). SDAccel offers full on the size of on-chip memory re- not a compute-intensive design. Most support for vector data types. There- sources and the amount of data to

44 Xcell Journal Second Quarter 2015 XCELLENCE IN DATA CENTERS

be processed, a video frame is divid- the watermarking algorithm on the of operations repeats for 20 times ed into 20 blocks of 1920 x 54 pixels, block of data and places the results until all blocks in a given frame have as shown in Figure 7. back into kernel arrays. The results been processed. As a result of this Once the memcpy operations have of the block processing are trans- modification to the kernel code, the placed the data for a block into the ferred back to DDR memory using system performance is 38 fps, which kernel arrays, the algorithm executes memcpy operations. This sequence exceeds the original goal of 30 fps.

Host Code for (i=0; i

FRAME clWaitForEvents(1, &kernelEVent); COMPUTE // Read the output frame back to CPU err = clEnqueueReadBuffer( commands, d_frout, CL_TRUE, 0, sizeof(int) * LENGTH_FROUT, h_frout, 0, NULL, &readEvent); clWritForEvents(1, &readEvent); FRAME

RECEIVE }

Figure 3 – Code to coordinate each frame’s data transfer and computation

for (y=0; y

i = y*FRAME_WIDTH/20 + x; out_y[i] = (255-mask[i]) * in_y[i] + mask[i] * logo_y[i]; out_cr[i] = (255-mask[i]) * in_cr[i] + mask[i] * logo_cr[i]; out_cb[i] = (255-mask[i]) * in_cb[i] + mask[i] * logo_cb[i]; } }

Figure 4 – Initial implementation of the watermarking kernel

Second Quarter 2015 Xcell Journal 45 XCELLENCE IN DATA CENTERS

20 input pixels

for (y=0; y

i = y*FRAME_WIDTH/20 + x;

out_y[i] = (255-mask[i]) * in_y[i] + mask[i] * logo_y[i]; kernel out_cr[i] = (255-mask[i]) * in_cr[i] + mask[i] * logo_cr[i]; out_cb[i] = (255-mask[i]) * in_cb[i] + mask[i] * logo_cb[i]; } } 0 1 ...... 20 20 output pixels

Figure 5 – Kernel code after vectorization

char20 l_in_y[BLOCK_WIDHT*BLOCK_HEIGHT/20]; char20 l_in_cr[BLOCK_WIDHT*BLOCK_HEIGHT/20]; On-chip memory storage for the input block char20 l_in_cb[BLOCK_WIDHT*BLOCK_HEIGHT/20];

char20 l_out_y[BLOCK_WIDHT*BLOCK_HEIGHT/20]; char20 l_out_cr[BLOCK_WIDHT*BLOCK_HEIGHT/20]; On-chip memory storage for the ioutput block char20 l_out_cb[BLOCK_WIDHT*BLOCK_HEIGHT/20]; Loop through 20 blocks for (block_id=0; block_id

for (y=0; y

i = y*BLOCK_WIDTH/20 + x;

l_out_y[i] = (255-mask[i]) * l_in_y[i] + mask[i] * logo_y[i]; l_out_cr[i] = (255-mask[i]) * l_in_cr[i] + mask[i] * logo_cr[i]; l_out_cb[i] = (255-mask[i]) * l_in_cb[i] + mask[i] * logo_cb[i]; }

memcpy(out_y, l_out_y + block_id*BLOCK_WIDHT*BLOCK_HEIGHT/20, BLOCK_WIDHT*BLOCK_HEIGHT*3); memcpy(out_cr, l_out_cr + block_id*BLOCK_WIDHT*BLOCK_HEIGHT/20, BLOCK_WIDHT*BLOCK_HEIGHT*3); memcpy(out_cb, l_out_cb + block_id*BLOCK_WIDHT*BLOCK_HEIGHT/20, BLOCK_WIDHT*BLOCK_HEIGHT*3);

PHASE 3 PHASE 2} PHASE 1

Figure 6 – Kernel code optimized for burst data transfers

46 Xcell Journal Second Quarter 2015 XCELLENCE IN DATA CENTERS

BROADLY APPLICABLE The optimizations necessary when creating applications like this one using SDAccel are software optimizations. Thus, these optimizations are similar to the ones required to extract performance from other processing fabrics, such as GPUs. As a result of using SD- Accel, the details of getting the PCIe link to work, drivers, IP placement and interconnect became a non-issue, allowing us as designers to focus solely on the target application. The optimizations we made in our watermarking application are applicable to all designs compiled using SDAccel. In fact, video watermarking provides a great “how to” introduction Figure 7 – Video frame partitioning into blocks to the optimization methods Xilinx has made available in SDAccel.

TigerFMC

200 Gbps bandwidth The highest VITA 57 compliant optical bandwidth for FPGA carrier

www.techway.eu

Second Quarter 2015 Xcell Journal 47 XPLANATION: FPGA 101 Coming to Grips with the Frequency Domain

by Adam P. Taylor Chief Engineer e2v [email protected]

48 Xcell Journal Second Quarter 2015 XPLANANTION: FPGA 101

The ability to work within the frequency domain is a necessity in a number of applications. Here’s how the frequency domain factors into FPGA designs.

For many engineers, working in the frequency domain does not come as naturally as working within the time domain, probably because the frequency Fdomain is associated with complicated mathemat- ics. However, to unlock the true potential of Xilinx® FPGA-based solutions, you need to feel comfort- able working within both of these domains. The good news is that it’s not as daunting as you might initially think to master the ins and outs of the frequency domain. Custom modules that you design yourself or IP modules available on the marketplace will help you transform to and from the frequency domain with ease. Methods also exist that make it possible to implement high- speed processing within the frequency domain.

TIME OR FREQUENCY DOMAIN? As engineers, we can examine and manipulate signals in either the time domain, where we analyze signals over time, or the frequency domain, where the analysis takes place with respect to frequency. Knowing when to do which is one of the core reasons an engineer is required on a project. Typically in electronic systems, the signal in question is a changing voltage, current or frequency that has been output by a sensor or generated by another part of the system. Within the time domain, you can measure a signal’s amplitude, frequency and period, along with more interesting parameters such as the rise and fall times of the signal. To observe a time domain signal in a laboratory environment, it is common to use an oscilloscope or logic analyzer. However, there are parameters of a signal that are present within the frequency domain. You must analyze them there in order to access the information contained within. Here, you can identify the frequency components of the signal, their

Second Quarter 2015 Xcell Journal 49 XPLANANTION: FPGA 101

Depending upon the type of signal—repetitive or nonrepetitive, discrete or nondiscrete—there are a number of methods you can use to convert between time and frequency domains.

amplitudes and the phase of each fre- demand that the signal be separated from series, Fourier transforms and Z trans- quency. Working within the frequency a noise source are best analyzed within forms. Within electronic signal process- domain also makes it much simpler to the frequency domain. ing and FPGA applications in particular, manipulate signals due to the ease with Working within the time domain you will most often be interested in one which convolution can be performed. requires little post-processing on the transform: the discrete Fourier transform Convolution is a mathematical way of quantized digital signal as the sampling (DFT), which is a subset of the Fourier combining two signals to form a third. takes place within the time domain. By transform. Engineers use the DFT to an- As with the time domain analysis, if you contrast, working within the frequency alyze signals that are periodic and dis- wish within the laboratory environment domain first requires the application crete—that is, they consist of a number to observe a frequency domain signal, of a transform to the quantized data to of n bit samples evenly spaced at a sam- you can use a spectrum analyzer. convert from the time domain. Similar- pling frequency that in many applications For some applications, you will wish ly, to output the post-processed data is supplied by an ADC within the system. to work within the time domain, for ex- from the frequency domain, you will At its simplest, what the DFT does is ample in systems that monitor the voltage need to perform the inverse transform to decompose the input signal into two or temperature of a larger system. While again back to the time domain. output signals that represent the sine noise may be an issue, taking an average and cosine components of that signal. of a number of samples will in many cas- HOW DO WE GET THERE? Thus, for a time domain sequence of es be sufficient. However, for other ap- Depending upon the type of signal—re- N samples, the DFT will return two plications it is preferable to work within petitive or nonrepetitive, discrete or non- groups of N/2+1 cosine and sine wave the frequency domain. For example, sig- discrete—there are a number of methods samples, respectively referred to as the nal-processing applications that require you can use to convert between time and real and imaginary components (Fig- the filtering of one signal from another or frequency domains, including Fourier ure 1). The real and imaginary sample

0 1 2 3 4 5 6 7 Time Domain

Discrete Fourier Transform

0 1 2 3 Real Result 0 1 2 3 Imaginary Result

Figure 1 – N bits in time domain to n/2 real and imaginary bits in the frequency domain

50 Xcell Journal Second Quarter 2015 XPLANANTION: FPGA 101

width will also be n/2 for an input sig- WHERE DO WE USE The FFT differs slightly from the nal width of n bits. THESE TRANSFORMS? DFT algorithms in that it calculates the The algorithm to calculate the DFT From telecommunications to image complex DFT—that is, it expects real is pretty straightforward, as seen in the processing, radar and sonar, it is hard to and imaginary time domain signals and equation below: think of a more powerful and adaptable produces results in the frequency do- analysis technique to implement within main which are n bits wide as opposed an FPGA than the Fourier transform. In- to n/2. This means that when you wish deed, the DFT forms the foundation for to calculate a real DFT, you must first one of the most commonly used FPGA- set the imaginary part to zero and then based applications: It is the basis for move the time domain signal into the generating the coefficients of the finite real part. If you wish to implement an where x[i] is the time domain signal, i input response (FIR) filter (see Xcell FFT within your Xilinx FPGA, you have ranges from 0 to N-1 and k ranges from Journal issue 78, “Ins and Outs of Digi- two options. You can write an FFT from 0 to N/2. The algorithm is called the cor- tal Filter Design and Implementation”). scratch using the HDL of your choice or relation method and what it does is to However, its use is not just limited to you can use the FFT IP provided within multiply the input signal with the sine filtering. The DFT and IDFT are used in the Vivado® Design Suite IP Catalog or or cosine wave for that iteration to de- telecommunications processing to per- another source. Unless there are press- termine its amplitude. form channelization and recombination ing reasons not to use the IP, the reduced Of course, you will wish at some point of the telecommunication channels. In development time resulting from using in your application to transform back spectral-monitoring applications, they the Xilinx core should drive its selection. from the frequency domain into the time are used to determine what frequencies The basic approach of the FFT is to domain. For this purpose, you can use are present within the monitored band- decompose the time domain signal into the synthesis equation, which combines width, while in image processing the a number of single-point time domain the real and imaginary waveforms to DFT and IDFT are used to handle con- signals. This process is often called bit re-create a time domain signal as such: volution of images with a filter kernel reversal as the samples are reordered. to perform, for example, image pattern The number of stages that it takes to recognition. All of these applications create these single-point time domain are typically implemented using a more signals is calculated by Log2 N, where efficient algorithm to calculate the DFT N is the number of bits, if a bit reversal However, ReX and ImX are scaled ver- than the one shown above. algorithm is not used as a short-cut. sions of the cosine and sine waves. There- All told, the ability to understand These single-point time domain sig- fore, you will need to scale them. Hence, and implement a DFT within your nals are then used to calculate the fre- ImX[k] or ReX[k] is divided by N/2 to de- FPGA is a skill that every FPGA devel- quency spectra for each of these points. termine the values for ReX and ImX in all oper should have. This operation is pretty straightforward, cases except when ReX[0] and ReX[N/2]. as the frequency spectrum is equal to In this case, they are divided by N. This FPGA-BASED IMPLEMENTATION the single-point time domain. is, for obvious reasons, called the inverse The implementation of the DFT and It is in the recombination of these discrete Fourier transform, or IDFT. IDFT as described above is often done single frequency points that the FFT al- Having explored the algorithms used as a nested loop, each loop performing gorithm gets complicated. You must re- for determining the DFT and IDFT, it N calculations. As such, the time it takes combine these spectra points one stage may be helpful to know what you can to implement the DFT calculation is at a time, which is the opposite of the use them for. time domain decomposition. Thus, it will You can use tools such as Octave, MAT- DFTtime = N * N * Kd ft again take Log2 N stages to re-create the LAB® and even Excel to perform DFT cal- where Kdft is the processing time for spectra, and this is where the famous culations upon captured data, and many each iteration to be performed. Clearly, FFT butterfly comes into play. lab tools, such as oscilloscopes, are capa- this can become quite time-consuming When compared with the DFT execu- ble of performing DFT upon request. to implement. For that reason, DFTs tion time, the FFT takes However, it is worth pointing out within FPGAs are normally implement- that both the DFT and IDFT above are ed using an algorithm called the fast FFTtime = K f ft * N Log2 N referred to as real DFT and real IDFT in Fourier transform. This FFT has often which results in significant improvements that the input is a real number and not been called the the most important algo- in execution time to calculate a DFT. complex. Why you need to know this rithm of our lifetime, as it has had such When implementing an FFT with- will become apparent. an enabling impact on many industries. in an FPGA, you must also take into

Second Quarter 2015 Xcell Journal 51 XPLANANTION: FPGA 101

1st NYQUIST 2nd NYQUIST 3rd NYQUIST 4th NYQUIST ZONE ZONE ZONE ZONE

f IIa I

0.5fs fs 1.5fs 2fs

Figure 2 – Nyquist zones and aliasing

x[0] / x[4] / ... X'[0.m2] X''[0.m ] X[0] / X[1] / ... pipelined Q point FFT 2

x[1] / x[5] / ... X'[1.m ] X''[1.m ] X[128] / X[129] / ... pipelined Q point FFT 2 2

x[2] / x[6] / ... X'[2.m ] X''[2.m ] X[256] / X[257] / ... pipelined Q point FFT 2 2

x[3] / x[7] / ... X'[3.m ]X''[3.m ] X[384] / X[385] / ... pipelined Q point FFT 2 2 parallel P point FFT

x[0] / x[4] / ... X[0] / x[1] / ... X'[n1.0] X''[n .0] X[0] / X[4] / ... 1 pipelined Q point FFT

x[1] / x[5] / ... x[128] / x[129] / ... X'[n .1] X''[n .1] X[1] / X[5] / ... 1 1 pipelined Q point FFT

x[2] / x[6] / ... x[256] / x[257] / ... X'[n .2] X''[n1.2] X[2] / X[6] / ... 1 pipelined Q point FFT reordering

x[3] / x[7] / ... X'[n .3] X''[n .3] X[3] / X[7] / ... x[384] / x[385] / ... parallel P point FFT 1 1 pipelined Q point FFT

Figure 3 – Split and combined FFT structures

52 Xcell Journal Second Quarter 2015 XPLANANTION: FPGA 101

account the size of the FFT. The size harmonic content, you can use the Enclustra will determine the noise floor below algorithm below: Everything FPGA. which you cannot see signals of po- Fharm = N × Ffund tential interest. The FFT size will IF (Fharm = Odd Nyquist Zone) MARS ZX2 also determine the spacing of the fre- 1. Floc = Fharm Mod Ffund Zynq-7020 SoC Module quency bins. Use the equation below Else Xilinx Zynq-7010/7020 SoC FPGA to determine the FFT’s size: Up to 1 GB DDR3L SDRAM Floc = Ffund-(Fharm Mod Ffund) 64 MB quad SPI flash USB 2.0 End Gigabit Ethernet Up to 85,120 LUT4-eq 108 user I/Os where N is the integer for the harmon- 3.3 V single supply ic of interest. 67.6 × 30 mm SO-DIMM from $127 Continuing our example further, where n is the number of quantized bits with a sample rate of 2500 MHz and a 2. MERCURY ZX5 within the time domain and FFTSize is fundamental of 1807 MHz, there will Zynq™-7015/30 SoC Module be a harmonic component at 693 MHz Xilinx® Zynq-7015/30 SoC the FFT size. For FPGA-based implemen- 1 GB DDR3L SDRAM tation, this is normally a power of two— within the first Nyquist zone that we 64 MB quad SPI flash PCIe® 2.0 ×4 endpoint for example, 256, 512, 1,024, etc. The fre- can further process within our FFT. 4 × 6.25/6.6 Gbps MGT USB 2.0 Device quency bins will be evenly spaced at Having grasped the basics about the Gigabit Ethernet Up to 125,000 LUT4-eq frequency spectrum, the next crucial 178 user I/Os factor to consider is the way you inter- 5-15 V single supply 56 × 54 mm face these ADC and DAC devices to the FPGA. It is not possible for the data 3. MERCURY ZX1 from the ADC to be received at FS/2 Zynq-7030/35/45 SoC Module For a very simple example, a sampling where in the example above, the sam- Xilinx Zynq-7030/35/45 SoC 1 GB DDR3L SDRAM frequency (FS) of 100 MHz with an FFT pling frequency is 2.5 Gbps. For this 64 MB quad SPI flash PCIe 2.0 ×8 endpoint1 size of 128 would have a frequency res- reason, high-performance data con- 8 × 6.6/10.3125 Gbps MGT2 verters use multiplexed digital inputs USB 2.0 Device olution of 0.39 Hz. This of course means Gigabit & Dual Fast Ethernet frequencies within 0.39 Hz of each other and outputs that operate at a lower Up to 350,000 LUT4-eq 174 user I/Os cannot be distinguished. data rate with respect to the convert- 5-15 V single supply 64 × 54 mm er’s sample rate, typically FS/4 or FS/2. 1, 2: Zynq-7030 has 4 MGTs/PCIe lanes. HIGHER-SPEED SAMPLING Having received the data from the Many applications for FFTs within FPGA in a number of data streams, 4. FPGA MANAGER IP Solution FPGAs and higher-performance sys- the next question is how you can pro- tems operate at very high frequencies. cess the data internally within the C/C++ High-frequency operation can present FPGA if you wish to perform a DFT. C#/.NET MATLAB® its own implementation challenges. One common method used for a num- LabVIEW™ At high frequencies, the Nyquist ber of applications, including telecommunication processors and radio sample rate (sampling at least two sam- USB 3.0 ples per cycle) simply cannot be main- astronomy, is to use combined or split PCIe® Gen2 Gigabit Ethernet tained. Therefore, a different approach FFT structures, as shown in Figure 3. is needed. An example would be using While this application is more com- an ADC to sample a 3-GHz full-pow- plicated than a straightforward FFT, Streaming, FPGA er-bandwidth analog input with a 2.5- such an approach makes it possible to made simple. GHz sample rate. Using Nyquist-rate achieve the higher-speed processing. criteria, signals above 1.25 GHz will As you can see, working within the One tool for all FPGA communications. Stream data from FPGA to host over USB 3.0, be aliased back into the first Nyquist frequency domain is not as difficult PCIe, or Gigabit Ethernet – all with one simple API. zone to be of use. These aliased images as you may initially think, especial- are harmonic components of the fun- ly when there are IP modules to help damental signal and thus contain the in transforming to and from the fre- Design Center · FPGA Modules same information as the non-aliased quency domain. Moreover, a number Base Boards · IP Cores signal, as shown in Figure 2. of methods are available that will We speak FPGA. To determine the resultant fre- enable you to implement high-speed quency location of the harmonic or processing.

Second Quarter 2015 Xcell Journal 53 XPLANATION: FPGA 101 Marrying SoC Platform Designs with System Generator

by Daniel E. Michek Senior Manager, System-Level Product Marketing Xilinx, Inc. for DSP [email protected]

The Vivado System Generator tool easily connects into platform designs to leverage board interfaces and processing systems.

54 Xcell Journal Second Quarter 2015 XPLANANTION: FPGA 101

PGA applications are ever evolv- Vivado IP Integrator tool. Let’s take a CREATING THE DATA ing, and the FPGA design flows closer look at how design automation PATH AS IMPORTABLE IP F are evolving along with them. using System Generator is enabling Our ultimate goal is to make the data No longer do we use FPGAs as simple high-performance designs to take ad- path accessible to the All Programma- glue logic or even as the heart of a sig- vantage of platform connectivity. ble platform framework. If we wanted nal-processing chain that marries in- to start from scratch, we could create tellectual property (IP) to proprietary BUILDING AN the data path with standardized inter- back-end interfaces. Instead, FPGAs ALL PROGRAMMABLE faces. By quickly marking a gateway are transitioning into programmable PLATFORM FRAMEWORK port as an AXI4-Lite interface as shown systems-on-a-chip, a combination of We start our new design flow by defin- in Figure 2, or simply naming our ports hardware designed to act as processor ing the platform framework in which to match a standard connection like peripherals alongside high-level soft- we want to house the data path. The AXI4-Stream on our Simulink® dia- ware running on powerful APUs. This Vivado tool suite is board aware and gram, System Generator will take care is an architecture we refer to as Xilinx® we will take advantage of available of adding the extra logic to our design All Programmable SoCs. board automation to build our new and collecting the common signals into In order to take advantage of this platform design. interfaces as it packages the design for new flow, we need to shift the design A platform design or platform frame- the Vivado IP Catalog. methodology from top-down RTL, as work, as shown in Figure 1, is the basic But a new method allows us to use in the earliest days of the FPGA, to a collection of processor- and board-level the platform framework to tailor-fit a bottom-up flow centered around IP interfaces, along with the logic combining plug-in that marries to the All Program- development and standardized connec- them. We use the platform framework as mable design. We use automation to tions such as ARM®’s Advanced eXten- the basis for our system-level design, the determine which interfaces exist with- sible Interface (AXI). As the interfaces shell, and this leaves us with the room in a platform design, which interfaces evolve from custom to common, we for our data path. Block and connectiv- associate to the board and which inter- reduce the effort expended on verifying ity automation will link the processing faces create a plug-in for the DSP data interactions between the data path and systems through IP peripherals to the path. Since the goal is to convert the the platform design. board-level interfaces. DSP data paths or data path into IP that connects to the Xilinx’s System Generator for DSP software accelerators packaged in the IP platform framework, we do not need has evolved too. This tool, which is part Catalog can then take advantage of Xil- to focus on board-level interfaces but of the Vivado® Design Suite, integrates inx’s Designer Assistance automation to instead, on standardized AXI interfac- the new bottom-up design methodolo- easily tie into our platform framework of es. Each interface not associated at gy by incorporating DSP data paths into processors and, by extension, interfaces the board level converts into System platform designs constructed with the to external devices. Generator gateways. While acting as

Figure 1 – Example of a platform framework connecting a processing system to board-level interfaces

Second Quarter 2015 Xcell Journal 55 XPLANANTION: FPGA 101

simple signals in the System Generator As one example, AXI4-Lite interfaces tool suite. A trivial copy-and-paste gives world, these gateways produce AXI inter- create independent read and write signals us more direct registers available to the faces to connect to the platform design that share a commonly addressable regis- processor at an address offset through when we export it to the IP Catalog. ter interface when exported to the Vivado the same interface. At the same time, we

Figure 2 – Automating gateways into AXI4-Lite and AXI4-Stream interfaces

Figure 3 – Platform-based system that connects DSP data path to the platform framework

56 Xcell Journal Second Quarter 2015 XPLANANTION: FPGA 101

automatically generate software driver The tailoring automation acts as a APIs to read or write to those registers. starting point to create IP that interfaces All Programmable If an AXI4-Stream interface is avail- to the platform framework. But System FPGA and SoC modules able from the platform design, Sys- Generator is flexible and allows us to add tem Generator will add appropriately or delete portions of, or even complete, 4 form factor matched gateways to the model. AXI4- AXI interfaces. The end result is a con- x Stream interfaces are extremely flexi- version of the data path into IP for reuse 5 ble and contain many signals. ACLK is among multiple system-level designs. the clock source associated with this After adding our logic, our last step interface, but this signal is directly im- is to export the data path we built plied as the abstracted system clock with System Generator for DSP back for this portion of the data path. TVAL- into the Vivado IP Catalog. This action ID is the signal that denotes when allows easy connection of the inter- the interface is valid. Other signals faces, whether working with RTL or are optional. System Generator will within IP Integrator. Additionally, we add those that exist in the originating generate drivers for use in the SDK rugged for harsh environments stream interface to our model, but we and attach models for simulation with extended device life cycle can delete or add signals to match our golden test vector data to the IP. And internal requirements. because we have prior knowledge of Available SoMs: In the model shown in Figure 2, we the platform framework when we cre- see that our data path only cares about ated our DSP data path, we automate TDATA (the data sent across the inter- the incorporation of the model and face) and TVALID on the S_AXIS inter- the platform design, as shown in the face. To remove unnecessary signals, we completed system in Figure 3. comment out the unused gateways for Platform Features this model, since default values will drive REDUCING SIMULATION RISK • 4×5 cm compatible footprint the signal connections in IP Integrator. Generating a complete system-on- The AXI4-Lite and AXI4-Stream sig- chip that incorporates hardware ac- • up to 8 Gbit DDR3 SDRAM nals easily simulate and verify with the celerators, DSP data paths or custom • 256 Mbit SPI Flash bottom-up methodology. The AXI4- logic is a challenge. Confirming that • Gigabit Ethernet Lite interfaces model as simple gate- the data path will work as expected • USB option ways that we source and sink from the by simulating in a bottom-up fashion myriad of Simulink blocks, a simple ab- introduces a large risk, and ensuring straction of passing data from one side that we maintain platform interface of the gateway to the other. Likewise, bandwidth to support the data path is the AXI4-Stream interfaces are merely equally daunting. a collection of signals that follow sim- By utilizing standardized interfaces, Design Services ple handshake rules for passing data we develop IP that reduces simulation • Module customization from one IP core to another. risk. That’s because the interactions at • Carrier board customization Our simulation modeling is only the interface level abstract away, so we • Custom project development challenged by the optional ports we may focus on verifying the internal data use on our interface. If we accept data path. And finally, by leveraging smart au- on every cycle and process it though tomation for board, block and connec- our data path without interruption, we tivity, we generate the platform-based will not need the TREADY handshake system that meets our needs and inte- signal. The simplified model sends grates our custom data path. each element of a vector through the More information about System Gen- TDATA gateway from Simulink’s Sig- erator for DSP is available at www.xil- nal to Workspace token. When a full inx.com/products/design-toolsvivado/ handshake is desired, we model it integration/sysgen.html. If you have difference by design with an AXI4-Stream FIFO to buffer any questions or comments, call the au- our data as shown on the M_AXIS in- thor, Daniel Michek, at (858) 207-5213, www.trenz-electronic.de terface in Figure 2. or e-mail [email protected].

Second Quarter 2015 Xcell Journal 57 XPLANATION: FPGA 101 Rethinking Digital Downconversion in Fast, Wideband ADCs

by Ian Beavers Contributing Technical Expert Analog Devices [email protected]

58 Xcell Journal Second Quarter 2015 XPLANANTION: FPGA 101

High-performance GSPS ADCs are bringing the DDC function onboard in a design solution based on Xilinx FPGAs.

Wideband gigasample-per-second (GSPS) Wanalog-to-digital converters offer many performance benefits to high-speed acquisition systems. These ADCs provide a wide frequency spectrum of visibility across high sample rates and input bandwidths. Howev- er, while some applications need a wideband front end, others require the ability to filter and tune to a narrower band of spectrum. It can be inherently inefficient for an ADC to sample, process and burn the power to transmit a wideband spectrum, when only a narrow band is required in the application. An unnecessary system burden is created when the data link consumes a large bank of high-speed transceivers within a Xilinx® FPGA, only to then decimate and filter the wideband data in subsequent processing. The Xilinx FPGA transceiver resources can instead be better allocated to receive the lower bandwidth of interest and channelize the data from multiple ADCs. Additional filtering can be done within the FPGA’s polyphase filter bank channelizer for frequency-division multiplexed (FDM) applications. High-performance GSPS ADCs are now bringing the digital downconversion (DDC) function further up in the signal chain to reside within the ADC in a design solution based on Xilinx FPGAs. This approach offers several new design options to a high- speed system architect. However, because this function is relatively new to the ADC, there are design-related questions that engineers may have about the operation of the DDC blocks within GSPS ADCs. Let’s clear up some of the more-common questions so that designers can begin using this new technique with more confidence.

Second Quarter 2015 Xcell Journal 59 XPLANANTION: FPGA 101

To get the full performance benefit of DDCs, the design must also contain a filter-and-mixer component as a companion to the decimation.

WHAT IS DECIMATION? pass filter. Without frequency transla- moves the out-of-band noise from the In the simplest definition, decimation is tion and digital filtering, decimation will narrowly defined bandwidth that is set the method of observing only a periodic merely fold the harmonics of the funda- by the decimation ratio. The typical dig- subportion of the ADC output samples, mental and other spurious signals on top ital filter implementation for a DDC is while ignoring the rest. The result is to of one another in the frequency domain. a finite impulse response (FIR) filter. effectively reduce the sample rate of the This filter is a function only of the past ADC by downsampling. For example, WHAT IS THE ROLE OF THE DDC? inputs, since there is no feedback. The a decimate-by-M mode in an ADC out- Since decimation by itself does not pre- passband of the filter should match the puts only the first of Mth samples, while vent the folding of out-of-band signals, effective frequency spectrum width of discarding all the other samples in be- how does the DDC make this happen? the converter after the decimation. tween. This method continues to repeat To get the full performance benefit for each multiple of M. of DDCs, the design must also contain HOW WIDE SHOULD Sample decimation alone will only a filter-and-mixer component that’s THE DDC FILTERS BE? effectively reduce the sample rate of the used as a companion to the decimation The decimation ratios for DDCs are typ- ADC and correspondingly act as a low- function. Digital filtering effectively re- ically based on integer factors that are

Real Input to ADC

Signal of interest “image” Signal of interest

0ffs/8 fs/4 3fs/8 /2 -fs/2 -3fs/8 -fs/4 -fs/8 s

After Frequency Translation within ADC NCO “tunes” to signal of interest Signal of interest after Decimation frequency translation “image” Filter

0ffs/8 fs/4 3fs/8 /2 -fs/2 -3fs/8 -fs/4 -fs/8 s

Figure 1 – Frequency translation using a low-pass filter and NCO effectively achieves a bandpass filter at the frequency of interest. Frequency planning ensures that unwanted harmonics, spurs and images fall out of band.

60 Xcell Journal Second Quarter 2015 XPLANANTION: FPGA 101

ADC

DDC

ADC DDC

DDC ADC

DDC ADC ADC GTP Artix-7 GTP Artix-7 TRx FPGA TRx FPGA x16 DDC x16 ADC ADC

DDC ADC

DDC

ADC

DDC

ADC

Figure 2 – The use of DDCs with a decimation factor of 8 allows the same 16 GTP 6.6-Gbps transceivers of the Xilinx Artix-7 to support eight ADCs with decimated I/Q data on two lanes of JESD204B each vs. only two ADCs that each output full bandwidth over eight lanes.

powers of 2 (2, 4, 8, 16, etc.). However, IS THE FREQUENCY FIXED? by the number of bits used in the digital the decimation factor could actually be The next question is whether the DDC tuning word that allows the band of in- any ratio based on the DDC architec- filters are fixed in frequency, or if they terest to be mixed. The tuning word has ture, including fractional decimation. can be tuned and centered on a particu- the tuning range and resolution to place In the case of fractional decimation, an lar band of interest. the filter where it is needed. A typical interpolation computation block is typ- We have discussed the decimation NCO tuning word may be up to 48 bits ically needed ahead of decimation to and filtering stages of DDCs. But this is of resolution across two Nyquist bands achieve a rational fraction ratio. only valuable if the desired frequency of the sampled frequency, which is ade- Ideally, the digital filter would precise- is within the filter passband from DC. quate for most applications. ly match the decimation frequency band- If that is not the case, then we need a The NCO is accompanied by a mixer. width and filter everything outside that way to tune the filter to a different part Operating much like an analog quadra- band. However, a practical effective filter of the frequency spectrum to observe ture mixer, this device performs the width will not exactly match to the full the signal of interest. The narrow band- downconversion of real and complex bandwidth of the decimation ratio. The width can be tuned within the first or input signals by using the NCO frequen- filter width will therefore be some per- second Nyquist zone by a numerically cy as a local oscillator. centage of the decimation frequency, such controlled oscillator (NCO). An NCO of- The filter follows the frequency trans- as 85 percent or 90 percent. For example, fers a method to tune and mix the filter lation stage. After the carrier band of the usable bandwidth of a decimation fac- band to a different portion of the wide- interest is tuned down to DC, the filter tor-of-8 filter may be practically the sample band spectrum (Figure 1). effectively lowers the sample rate while rate divided by 10, or fs/10. The DDC fil- A digital tuning word provides a frac- providing sufficient alias rejection from tering stage must provide a low passband tional divider of the sample rate with a unwanted adjacent carriers around the ripple and a high stopband alias rejection. frequency placement resolution defined tuned bandwidth of interest.

Second Quarter 2015 Xcell Journal 61 XPLANANTION: FPGA 101

The use of a single decimate-by-8 DDC allows the same Xilinx Artix-7 FPGA system to support four times more ADCs.

When mixing a real input signal a board to using a single FPGA, the bandwidth. The SNR calculation of the down to baseband, 6 dB of signal loss number of high-speed serial trans- ADC must then include a correction is introduced due to the filtering of the ceivers supported can limit the num- factor for this filtering that accounts negative image. The NCO introduces ber of ADCs without the use of DDCs. for the processing gain of the filtered an additional small insertion loss. The For a case where only a narrow noise. Using a perfect digital filter, for total loss of a real input signal mixed bandwidth is observed in this system, every power-of-two reduction in band- down to baseband is typically slight- decimation within the ADCs helps width, the processing gain due to the ly more than 6 dB. The NCO allows remove this limitation. The use of a filtered noise will increase by +3 dB: the input spectrum to be tuned to dc, single decimate-by-8 DDC allows the Ideal SNR (with processing gain) = where it can be effectively filtered by same Xilinx Artix®-7 FPGA system to 6.02*N + 1.76 dB + 10log10(fs/(2*BW)) the subsequent filter blocks to prevent support four times more ADCs by re- aliasing. The DDC may also contain an ducing the output bandwidth of the A distinct advantage of using DDCs independently controlled digital gain ADCs to just two output data lanes. is the ability to have the harmonics of stage. A gain stage allows the system For this particular case, as many as the fundamental signal fall outside the to enable +6 dB or more of gain to eight ADCs using DDCs could now be band of interest. With proper frequen- center the dynamic range of the signal designed with the same existing 16 cy planning, digital filtering will pre- within the full scale of the output bits. GTP transceivers in the Artix-7 FPGA vent the harmonics from being seen (Figure 2). This allows more efficient within the narrow DDC bandwidth and INTERPROCESSOR INTERRUPTS use of Xilinx FPGA resources as a therefore will increase the SFDR per- The decimation of the ADC samples multichannel digital receiver for a set formance of the system. removes the need to send unwanted of FDM channels. In the systems where only a narrow information downstream in the sig- bandwidth is needed, a DDC provides nal chain to eventually get discarded DO DDC FILTERS ADC processing gain by filtering out the anyway. Therefore, since this data is AFFECT SNR AND SFDR? wideband noise. This increases the sig- filtered out, it reduces the output data The next question to examine is how nal-to-noise ratio seen within the band- bandwidth needed on the back end of the analog performance of signal-to- width of interest. An additional benefit the ADC. This amount of reduction noise ratio (SNR) and spurious-free dy- is that, with proper frequency planning, is offset by the increase in data from namic range (SFDR) change when the the typically dominant second- and both the I/Q data output. For example, DDC filters are on vs. when they are off. third-order harmonics of the fundamen- a decimate-by-16 filter with both I and Since the wideband noise of the tal fall outside the tuned bandwidth of Q data would reduce the wideband converter is filtered out and only a nar- interest and are digitally filtered. This output data by a factor of 8. row spectrum is observed, we should increases the SFDR of the system. This minimized data rate reduc- expect the signal power relative to the Sampling theorems dictate that es the complexity of system layout observed noise to be higher. The dy- harmonics or other higher-order sys- by lowering the number of output namic range of the ADC will be better tem spurs can fold back around the JESD204B lanes from the ADC. The within the passband of the filter. The end of each Nyquist band. This is also reduction in ADC output bandwidth improved SNR by use of the DDC is true for DDCs, which have the poten- can allow the design of a compact inherently an advantage of decimating tial for unwanted second- or third-or- system that otherwise may not be and filtering the wideband spectrum. der harmonics to fold back into the achievable. For example, in a case Digital filtering by the DDC is used passband and decrease the SFDR. where system power and size limit to filter the noise outside of a smaller Therefore, to navigate around such

62 Xcell Journal Second Quarter 2015 XPLANANTION: FPGA 101

sampling problems, you should im- end will now allow multiple uses for and removes the burden on the system plement your system frequency plan the DDC to either observe multiple FPGA transceiver count and decimation for the DDC passband filter width bands simultaneously or even sweep blocks, which can be reassigned to other and NCO tuning position. the band of interest with the NCO to processing activities such as channelizing find a changing input signal. multiple ADCs for FDM systems. ARE EXTERNAL High-speed ADCs now have the FILTERS REQUIRED? CAN AN ADC processing power to bring the DDC System ADCs using internal DDCs PROVIDE MULTIPLE DDCS? function up the signal chain. For can use additional analog filters, as The final question for engineers con- those systems that do not need to would normally be done without DDC templating internal digital downcon- use the full bandwidth of a wideband filtering. For wideband applications, version with an FPGA is whether an Nyquist-rate ADC, the DDC operation the DDCs provide some relaxation in ADC provides just one DDC. The an- filters the unwanted data and noise. the filtering that’s needed at the front swer is no; in fact, multiple bands can This can improve the SNR and SFDR end of the ADC. be observed. of the signal acquisition. The lower The digital filtering within the For multiple DDCs within an ADC, bandwidth reduces the data inter- DDC will do some of the work and each can have its own NCO that tunes to face burden to the transceivers of an relax what otherwise would require separate bands across the Nyquist zone. FPGA, like Artix-7, and allows for the a strict front-end anti-alias analog This scheme makes it possible to observe design of more-complex signal-acqui- filter. However, a wideband front multiple frequency bands simultaneously sition systems.

Debugging Xilinx's ZynqTM -7000 family with ARM® CoreSightTM

► RTOS support, including Linux kernel and process debugging

► SMP/AMP multicore Cortex®- A9 MPCoreTMs debugging

► Up to 4 GByte realtime trace including PTM/ITM

► Profiling, performance and statistical analysis of ZynqTM's multicore Cortex®-A9 MPCoreTM

Second Quarter 2015 Xcell Journal 63 XTRA, XTRA What’s New in the Vivado 2015.1 Release? Xilinx is continually improving its products, IP and design tools as it strives to help designers work more effectively. Here, we report on the most current updates to Xilinx design tools including the Vivado® Design Suite, a revolu- tionary system- and IP-centric design environment built from the ground up to accelerate the design of Xilinx® All Programmable devices. For more information about the Vivado Design Suite, please visit www.xilinx.com/vivado. Product updates offer significant enhancements and new features to the Xilinx design tools. Keeping your installa- tion up to date is an easy way to ensure the best results for your design. The Vivado Design Suite 2015.1 is available from the Xilinx Download Center at www.xilinx.com/download.

VIVADO DESIGN SUITE 2015.1 RELEASE HIGHLIGHTS The latest release of the Vivado Design Suite includes the new Vivado Lab Edition; interactive clock domain crossing (CDC) analysis; accelerated simulation flows; advanced system performance analysis in the Xilinx Software Development Kit (SDK); and new devices including the XCVU440.

Vivado Lab Edition Design Suite provides unparalleled Device Support The new Vivado Lab Edition is a no- timing analysis and debug functional- New Devices cost, lightweight programming and ities, accelerating time-to-market. debug edition of the Vivado Design The following UltraScale™ devices Suite. It includes the Vivado Device Vivado Simulator and are introduced in this release: Programmer, Vivado Logic and Seri- Third-Party Flows • Virtex® UltraScale devices: al I/O Analyzer, as well as memory Advancements in the simulation XCVU125, XCVU190, XCVU440 debug tools. Lab Edition is intended flows reduce the LogiCORE™ IP General Access for use in laboratory environments compile times by more than half. ® where the full-featured Vivado De- Overall simulation performance is • Kintex UltraScale devices: sign Suite is not required. As such, 20 percent faster than in previous XCKU035, XCKU060, XCKU115 it is 75 percent smaller than the releases. Simulation flows are ful- • Virtex UltraScale devices: XCVU065 complete Vivado Design Edition, ly integrated with those of Xilinx which considerably reduces lab Alliance Program members Aldec, Early Access (Contact your local setup time and system memory re- Cadence Design Systems, Mentor Xilinx sales representative) quirements. For design teams that Graphics and Synopsys. • Virtex UltraScale devices: XCVU160 require remote debug or program- Licensing ming over Ethernet, the Vivado Xilinx SDK Advanced System Design Suite 2015.1 also provides Vivado Design Suite 2015.1 features up- a standalone hardware server that Performance Analysis dates to licensing. With Vivado License occupies less than 1 percent of the Xilinx has extended the SDK with the Borrow, clients may borrow one floating complete Vivado Design Edition. ability to analyze the performance and seat and lock to machine for non-network the bandwidth of a Zynq®-7000 All Pro- use of the Vivado tools for a specific pe- grammable SoC design, including key riod (for activation licenses only). Also Interactive Clock Domain performance metrics for the process- included is virtual-machine support for Crossing Analysis ing system (PS) as well as bandwidth activation-based licenses. This release The interactive CDC capability en- analysis between the PS, the program- introduces one-step activation licensing. ables debug of CDC issues earlier mable logic (PL) and external memo- If you are using Vivado License Manager in the design, reducing expensive ries. System-modeling designs using for client (node-locked) licenses, and are in-system debug cycles. Combined AXI traffic generators are provided connected to the Internet, Vivado License with interactive timing analysis and for the ZC702 and ZC706 evaluation Manager downloads and installs activa- cross-probing features, the Vivado boards. tion licenses automatically.

64 Xcell Journal Second Quarter 2015 VIVADO DESIGN SUITE: System Generator for DSP New Suite of ‘Any to Any’ Video DESIGN EDITION UPDATES Release updates include advanced Connectivity Solutions support for hardware co-simulation The release includes HDMI 1.4/2.0; Dis- Partial Reconfiguration and burst mode, which accelerates sim- playPort 1.2 with support for HDCP; Tandem Configuration ulation to increase performance by and the new 6G/12G versions of SDI. The Partial Reconfiguration Control- 100x. Improved timing analysis allows These Xilinx-developed and support- ler IP is now available for any user of cross-probing to quickly identify failing ed interfaces will allow developers to partial reconfiguration (PR) in 7 series, paths. A new capability parses an SoC adhere to the latest industry standards Zynq SoC or UltraScale devices. This platform design from Vivado IP Inte- for 4K/Ultra-HD systems. IP is the heart of a PR system, fetching grator to tailor a complementary set of from memory and delivering to the con- gateways for easy and accurate IP de- The video connectivity solutions are figuration port partial bitstreams when velopment. Enhanced support for mul- available from Xilinx in the Vivado hardware or software trigger events oc- cur. The latest release has expanded sup- tiple AXI4-Lite interfaces enables in- Design Suite 2015.1 release and will port for UltraScale devices, and supports dependent register alignment to clock be supported on Artix-7, Kintex-7, Vir- implementation for KU115, VU125 and domains. Finally, the tool now supports tex-7, Virtex-7X and Kintex UltraScale VU190 devices, as well as the previously MATLAB® 2015A. FPGAs, and on Zynq-7000 All Program- supported KU040, KU060 and VU095 de- mable SoCs. The video-processing vices. Tandem PROM and Tandem PCIe® See the Vivado Design Suite 2015.1 Re- suite is now available from Omnitek are available for the same UltraScale de- lease Notes for more information. and targets Kintex-7 and Kintex Ultra- vices that now have PR support (KU115, Scale FPGAs and Zynq SoCs. To learn VU125 and VU190). more, visit http://omnitek.tv/sites/default/files/OSVP.pdf. For more information, see PG193. XILINX INTELLECTUAL PROPERTY (IP) UPDATES Vivado IP Integrator Release updates include a bottom-up Xilinx has launched its next generation LEARN MORE synthesis flow option for faster design it- of video-over-Internet Protocol connec- erations. Each piece of IP is synthesized tivity and professional video solutions, QuickTake Video Tutorials by itself and only changed if the IP needs enabling “any media over any network,” Vivado Design Suite QuickTake video tu- to be synthesized again. A new layout for the broadcast and pro A/V markets. torials are how-to videos that take a look optimized for IP Integrator is available inside the features of the Vivado Design in the Vivado IDE. The result is up to a Video-over-Internet Suite and UltraFast™ Design Methodolo- 50 percent reduction in project flow time Protocol Connectivity gy. New topics include: including design generation and valida- Xilinx defines and deploys video over tion, along with improvements to revi- • What’s New in Vivado 2015.1 sion control ease of use. The release also Internet protocols for contribution and • Simulating with Cadence IES offers support for saving a design in the distribution networks with the provision validated state so that validation does of the SMPTE ST 2022-1,2,7 and SMPTE in Vivado not need to be rerun during generation. ST 2022-5,6,7 cores and reference de- • Designing with UltraScale Memory IP The search in the “Add IP…” window has signs. ST 2022-1,2,7 and ST 2022-5,6,7 been enhanced and there is now quick cores are now available from Xilinx. • Using Board Automation with access to IP details. To learn more, visit http://www.xilinx. IP Integrator com/products/intellectual-property/ef- See the Vivado Design Suite 2015.1 See all Quick Take Videos at www.xil- di-smpte2022-12.html and http://www. inx.com/training/vivado. Release Notes for more information. xilinx.com/products/intellectual-property/ef-di-smpte2022-56.html. VIVADO DESIGN SUITE: Training For instructor-led training on the Vivado SYSTEM EDITION UPDATES The video-over-IP FEC Engine l is avail- Design Suite, UltraFast Design Method- able in the Vivado Design Suite 2015.1 ology and more, visit www.xilinx.com/ Vivado High-Level Synthesis release. Xilinx provides free system-lev- training. The System Edition boasts new synthesiz- el reference designs for all cores with able C++ library functions with a special application notes at http://www.xilinx. Download Vivado Design Suite 2015.1 focus on software-defined radio applica- com/esp/broadcast/refdes_listing.htm. today at http://www.xilinx.com/down- tions: numerically controlled oscillator load. (NCO), QAM modulator and demodula- tor. See UG902 for more information.

Second Quarter 2015 Xcell Journal 65 XPEDITE Latest and Greatest from the Xilinx Alliance Program Partners Xpedite highlights the latest technology updates from the Xilinx Alliance Program ecosystem.

he Xilinx® Alliance Program is a worldwide ecosystem of qualified companies that collaborate with Xilinx to further the development of All Programmable technologies. Xilinx has built this ecosystem, leveraging open platforms and T standards, to meet customer needs and is committed to its long-term success. Alliance members—including IP providers, EDA vendors, embedded software providers, system integrators and hardware suppliers—help accelerate your design productivity while minimizing risk. Here are some highlights.

DAVE’S BORA the software domain to the FPGA fabric nology innovations in image process- NOW SUPPORTS on top of an existing implementation. ing and computer vision. To provide SDSOC DEVELOPMENT DAVE demonstrated the BORA sys- reusable IP that smoothly integrates ENVIRONMENT tem in February at Embedded World into Xilinx All Programmable SoCs 2015 in Nuremberg, Germany. The and MPSoCs and implements evolv- DAVE Embedded Systems (Porcia, Ita- demonstration consisted of IP (name- ing video-processing, object-detec- ly) has collaborated with Xilinx to make ly, an LCD controller) that was devel- tions and video-analytics algorithms, sure its BORA module supports Xilinx’s oped with classic tools and IP generat- Xylon’s designers have become ex- SDSoC development environment. The ed by the SDSoC design environment. perts with the latest Xilinx design BORA module is built around Xilinx’s tools and technologies. Xylon has Zynq®-7000 All Programmable SoC. The For more information, please visit been working closely with Xilinx as SDSoC support will allow BORA users to http://www.dave.eu. Xilinx developed its innovative SD- develop their software algorithms quick- SoC development environment as an ly and easily. Hardware-accelerated func- alternative to manual RTL coding. tions within BORA are implemented in XYLON USES SDSOC Using the new development envi- programmable logic but can be invoked FOR MICROZED-BASED ronment for only a couple of weeks, transparently from software applications VISION PLATFORM Xylon was able to develop a board running on the Zynq SoC’s dual-core support package (BSP) for a MicroZed ARM® Cortex™-A9 processing system. Xylon (Zagreb, Croatia) develops board-based vision platform. And by Accelerated functions—written in C, logicBRICKS IP cores that help cus- using the legacy logicBRICKS IP as the C++ or SystemC—can be moved from tomers stay at the forefront of tech- C-callable RTL IP, Xylon designed a re-

66 Xcell Journal Second Quarter 2015 al-time facial-features tracking system ing of the design between software DORNERWORKS TIPS that the company demonstrated at Em- and HDL, the implementation of the XEN HYPERVISOR FOR bedded World 2015. data-processing algorithms in HDL ZYNQ ULTRASCALE+ code and the integration of this HDL MPSOC DESIGNS For more information, please visit code in the main system design. The http://www.logicbricks.com/. Xilinx SDSoC development environ- Xilinx has selected DornerWorks to ment addresses these challenges by provide an open-source Xen hyper- providing an integrated development visor solution to its customers along IVEIA PUTS SDSOC environment within which users can with comprehensive customer sup- THROUGH ITS PACES implement their entire design in C/ port and engineering services. The ON EDGE-DETECTION C++ code, select which parts of the Xen hypervisor is a well-established ALGORITHM design they would like to have imple- virtual-machine monitor for running mented in programmable logic and multiple operating systems simul- Leveraging Xilinx’s SDSoC develop- then let the tool generate the soft- taneously, making Xen an obvious ment environment, iVeia (Annapo- ware and HDL portions of the design choice for upcoming products based lis, Md.) demonstrated at Embedded and bind everything together to cre- on the Zynq UltraScale+™ MPSoC World one of its pure C/C++ reference ate the final system design. targeting next-generation cloud, data designs for processing high-definition Analog Devices (Norwood, Mass.) center, and wireless and wired net- video. Using iVeia’s Atlas-I-Z7e system- has started to adopt the SDSoC de- work computing. Xen provides the on-a-module and video development velopment environment to create virtualization of applications and of kit, the company ran a canny edge-de- reference designs for its SDR FM- guest operating systems running on tection algorithm in real time on a GigE COMS2/3/4/5 platforms based on the the Zynq MPSoC’s quad-core ARM Vision high-definition camera. AD9361/AD9364 agile RF transceiv- Cortex-A53 processor. Two implementations of the iden- ers. The first reference design to be “DornerWorks brings proven, solid tical C/C++ code were demonstrated released shows how to implement a technology expertise on Xen hyper- running side-by-side, one compiled direct-digital-synthesis (DDS) IP core visor solutions and has strong ser- using the standard C/C++ compiler, in the SDSoC development environ- vice capabilities for both embedded the other using the SDSoC development and integrate the generated IP processing designs and FPGA log- ment environment’s full-system op- into the Analog Devices FMCOMMS ic designs,” said Hugh Durdan, vice timizing compiler. The performance HDL platform and Linux environment. president of portfolio and solutions of the SDSoC development environ- In this way, the output of the DDS can marketing at Xilinx. “Our partnership ment’s implementation was nearly be transmitted and received back with DornerWorks on the Xen hypervi- two orders of magnitude faster than over the air by the FMCOMMS card sor solution for the Zynq UltraScale+ the standard implementation. Using and displayed in the Analog Devices MPSoC will enable our customers to the SDSoC environment to profile and IIO Scope Linux application. adopt a virtualized, secure OS plat- partition the design, functions identi- The DDS IP is implemented en- form quickly and accelerate product fied as repetitive and compute-inten- tirely in C code following the Xilinx development cycles.” sive were targeted for the program- high-level synthesis (HLS). The DDS Development with a hypervisor not mable logic while the more-complex function is called in a C main loop only requires the Xen kernel, but also statistical algorithms remained on the to generate the bindings for the data a configured dom0 (the privileged processing system. flow between the IP and the rest of system domain) and configured guest For more information, please visit the design. Based on this code, the domains with their own guest operat- SDSoC development environment is http://www.iveia.com/. ing systems. DornerWorks will pro- able to synthesize the DDS IP, inte- vide sample distributions of complete grate it into the Analog Devices HDL packages and documentation on how ADI ADOPTS SDSOC platform and generate all the neces- to spin your own customized systems. TO CREATE RADIO sary files in order to be able to run To complete the hypervisor eco- REFERENCE DESIGNS the Analog Devices Linux distribu- system, DornerWorks also offers a tion on the Zynq-7000 All Program- range of services to its customers, in- Three of the main challenges of any mable SoC. Analog Devices demon- cluding FPGA design services. SoC-based software-defined radio strated the DDS HLS IP with the Zynq For details, visit http://dorner- (SDR) platform are the partition- SoC SDR kit at Embedded World. works.com/services/XilinxXen.

Seond Quarter 2015 Xcell Journal 67 XCLAMATIONS! Xpress Yourself in Our Caption Contest

BARRY MEAKER, design engineer at Boeing Company (Seattle), won a shiny new Digilent Zynq Zybo board with this caption for the smiley-face cartoon in Issue 90 of Xcell Journal: DANIEL GUIDERA

ven robots have their limits. The mechanical man clanking his way “Ever since Bob announced across this factory floor looks like he’s had enough—driven berserk, he’s retiring in a month, Eperhaps, by faulty programming. How do you interpret his plight? he seems different.” Let us know by writing an engineering- or technology-related caption for our cartoon featuring this modern-day tin man. The image might inspire a caption like “George watched the new janitorial robot walk off the job in a huff, but couldn’t fix the Roomba AI program fast enough to stop him.” Congratulations as well to Send your entries to [email protected]. Include your name, job title, our two runners-up: company affiliation and location, and indicate that you have read the con- “It’s a Logic High.” test rules at www.xilinx.com/xcellcontest. After due deliberation, we will — Aaron Algiere, print the submissions we like the best in the next issue of Xcell Journal. electrical engineer, Harris Corp., ® The winner will receive a Digilent Zynq Zybo board, featuring the Xilinx Melbourne, Fla. Zynq®-7000 All Programmable SoC (http://www.xilinx.com/products/ boards-and-kits/1-4AZFTE.htm). Two runners-up will gain notoriety, fame “It’s obviously unstable. and have their captions, names and affiliations featured in the next issue. Too much positive feedback.” The contest begins at 12:01 a.m. Pacific Time on April 17, 2015. All entries must be received by the sponsor by 5 p.m. PT on July 1, 2015. — Glenn Dixon, engineer, L-3 Communications, Salt Lake City

NO PURCHASE NECESSARY. You must be 18 or older and a resident of the fifty United States, the District of Columbia, or Canada (excluding Quebec) to enter. Entries must be entirely original. Contest begins on April 17, 2015. Entries must be received by 5:00 pm Pacific Time (PT) July 1, 2015. Official rules are available online at www.xilinx.com/xcellcontest. Sponsored by Xilinx, Inc. 2100 Logic Drive, San Jose, CA 95124.

68 Xcell Journal First Quarter 2015 Need Wireless Connectivity in your Next SoC Design?

Adding wireless to your next Zynq®-7000 All Programmable SoC design just got easier with Avnet’s new WiLink® 8 Adaptor and example Linux reference design for MicroZed™.

Order your kit today at: www.zedboard.org/product/wilink-8-adaptor

© Avnet, Inc. 2015. All rights reserved. AVNET is a registered trademark of Avnet, Inc. Xilinx and Zynq are trademarks or registered trademarks of Xilinx, Inc. n Adam Taylor’s MicroZed Chronicles Part 77 – Introducing the Zynq SoC’s Ethernet n NGCodec demos HEVC H.265 Real-Time Encoder running on Kintex-7 FPGA at NAB 2015 n New 27-page White Paper covers the basics of design for functional safety using FPGAs and the Zynq SoC n Zynq-based ONetSwitch open-source SDN Kickstarter project completes successful funding campaign

n Sumitomo Electric’s 4x25G QSFP28 LR4 optical module and Virtex UltraScale FPGA drive 10km of fiber at OFC 2015