Xilinx Speeds Customer Medical Innovations to Market 8 FOURTH QUARTER 2015, ISSUE 93

Home , MicroBlaze, Unicore

ISSUE 93, FOURTH QUARTER 2015

SOLUTIONS FOR A PROGRAMMABLE WORLD

FPGA-Based Fuzzy Xilinx Speeds Customer Controller Manages Sugarcane Extraction

Medical Innovations to Market Zynq MPSoC Gets Xen Hypervisor Support

Making XDC Timing Constraints Work for You

The latest Xilinx tool updates and patches, as of October 2015

Xilinx FPGAs Advance Autonomous Monitoring of Crowds 14 www.xilinx.com/xcell Lifecycle Technology

Design it or Buy it?

Shorten your development cycle with Avnet’s SoC Modules

Quick time-to-market demands are forcing you to rethink how you design, build and deploy your products. Sometimes it’s faster, less costly and lower risk to incorporate an off-the-shelf solution instead of designing from the beginning. Avnet’s system-on module and motherboard solutions for the Xilinx Zynq®-7000 All Programmable SoC can reduce development times by more than four months, allowing you to focus your efforts on adding differentiating features and unique capabilities.

Find out which Zynq SOM is right for you http://zedboard.org/content/design-it-or-buy-it

facebook.com/avnet twitter.com/avnet youtube.com/avnet

LETTER FROM THE PUBLISHER

Zynq UltraScale+ Says ‘Hello World’ Congratulations to Xilinx and especially to the Zynq® UltraScale+™ MPSoC design team for PUBLISHER Mike Santarini shipping the first Zynq MPSoC XCZU9EG one quarter ahead of schedule. This accomplish- [email protected] 408-626-5981 ment makes it a “three peat”—the third time in a row Xilinx has beat the competition to market by being the first programmable-logic company to ship devices on a leading process EDITOR Jacqueline Damian node. Xilinx was the first to 28 nanometers with the 7 series, first to 20nm with the UltraScale™ family and now first to 16/14nm FinFET with the UltraScale+. ART DIRECTOR Scott Blair The news broke on Sept. 30, when Xilinx announced it had sent the first sampling ship- ments of the new device to two customers. The design went from tapeout to shipment in two DESIGN/PRODUCTION Teie, Gelwicks & Associates and half months, which is a true testament to the engineering skills of Xilinx®’s Zynq MPSoC 1-800-493-5551 design and quality teams as well as the good folks at foundry TSMC. Xilinx received the first silicon in late September for characterization and test, and within ADVERTISING SALES Judy Gelwicks 1-800-493-5551 hours was able to boot an upstream Linux kernel on the new device (see video demo). [email protected]

INTERNATIONAL Melissa Zhang, Asia Pacific [email protected]

Christelle Moraga, Europe/ Middle East/Africa [email protected]

Tomoko Suto, Japan [email protected]

REPRINT ORDERS 1-800-493-5551

The video demonstrates an “upstream” Linux kernel booting on the new Zynq UltraScale+ MPSoC, the first member of the Xilinx UltraScale+ portfolio to ship.

We highlighted the architectural features as well as the performance-per-watt advantages of this exciting new device in the cover story of Xcell Journal issue 90. Implemented in TSMC’s www.xilinx.com/xcell/ 16nm FinFET+ process technology, the Zynq MPSoC features on a single device a quad-core 64-bit ARM® Cortex™-A53 application processor, a 32-bit ARM Cortex-R5 real-time processer Xilinx, Inc. and an ARM Mali-400MP graphics processor, along with 16nm FPGA logic (with UltraRAM), 2100 Logic Drive a host of peripherals, security and reliability features, and an innovative power control tech- San Jose, CA 95124-3400 Phone: 408-559-7778 nology. The new Zynq UltraScale+ MPSoC gives users what they need to create systems with FAX: 408-879-4780 a 5x performance/watt advantage over systems designed with the 28nm Zynq SoC. www.xilinx.com/xcell/ In my tenure here as the publisher of Xcell Publications, it’s been truly remarkable to read about the amazing innovations our readers have been able to create with our 28nm Zynq-7000 © 2015 Xilinx, Inc. All rights reserved. XILINX, All Programmable SoC. So, I’ll be eagerly waiting to see what you will do with the new Zynq the Xilinx Logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trade- MPSoC, with its many multiprocessing features. As the device becomes more broadly available marks are the property of their respective owners. over the next few quarters, I hope you will share your Zynq MPSoC design experiences with

The articles, information, and other materials included your peers by contributing articles to Xcell Journal and our new magazine for software devel- in this issue are provided solely for the convenience of opers, Xcell Software Journal. I look forward to reading about your remarkable designs. our readers. Xilinx makes no warranties, express, implied, statutory, or otherwise, and accepts no liability with respect to any such articles, information, or other materials or their use, and any use thereof is solely at the risk of the user. Any person or entity using such information in any way releases and waives any claim it might have against Xili nx for any loss, damage, or Mike Santarini expense caused thereby. Publisher

CONTENTS

VIEWPOINTS XCELLENCE BY DESIGN APPLICATION FEATURES Letter from the Publisher Xcellence in Vision Systems Zynq UltraScale+ Says ‘Hello World’… 4 Xilinx FPGAs Advance Autonomous Monitoring of Crowds… 14

Xcellence in Test & Measurement 14 Test New Memory Technology Chips Using the Zynq SoC… 22 22 Xcellence in Test & Measurement Unleashing High-Performance USB Devices with Artix-7 FPGA… 26 Xcellence in Industrial 30 FPGA-Based Fuzzy Controller Manages Sugarcane Extraction… 30

Cover Story Xilinx Speeds Customer Medical Innovations to Market 8 FOURTH QUARTER 2015, ISSUE 93

THE XILINX XPERIENCE FEATURES Xplanation: FPGA 101 44 Zynq MPSoC Gets Xen Hypervisor Support… 36

Xplanation: FPGA 101 Vivado IPI Opens FPGA Shareable Resources for Aurora Designs… 44

Xplanation: FPGA 101 Making XDC Timing Constraints Work for You… 52

Tools of Xcellence Toward Easier Software Development for Asymmetric Multiprocessing Systems… 58

36 52

XTRA READING Xtra, Xtra The latest Xilinx tool updates and patches, as of October 2015… 66 Xclamations! Share your wit and wisdom by supplying a caption for our wild and wacky artwork… 68 COVER STORY Xilinx Speeds Customer Medical Innovations

by Mike Santarini Publisher, Xcell Journal, Xilinx, Inc. to Market [email protected]

8 Xcell Journal Fourth Quarter 2015 COVER STORY

lectronics affects our lives in so many ways, but one area All Programmable platforms, where it is literally having a life-or-death impact is in the medical-certified IP and software healthcare industry. People are living longer thanks in large stacks enable Xilinx customers part to continual advances in healthcare fueled by rapid im- to bring life-saving equipment provements in medical electronics. Customers have Ecreated these innovations with Xilinx® All Program- mable devices over the last three decades. to market sooner. Today, Xilinx FPGAs (and increasingly, systems-on-chips like the Zynq®-7000 All Program- mable SoC) are at the heart of a growing number of medical systems. End products run the gamut from surgical robots, patient monitors, ventila- tors, medical imaging (CT, MRI and ultrasound) and X-ray machines to defibrillators, endoscopy machines, infusion pumps and analyzers. FPGAs are also central to the high-performance computing systems that enable researchers to perform genomic sequencing at lightning speed and those that enable scientists and pharmaceutical companies to more quickly develop and refine medicines to treat the symptoms of diseases. The reason FPGAs have assumed such a key role in the medical market is simple. FPGAs—and more recently, Zynq SoCs—enable medical equipment developers to lower the risk of failures and speed their equipment through the regulatory process. With All Programmable devices, designers can take the functions of their systems that must be highly reliable and implement them in the logic of the Xilinx devices while running other, less-critical functions in software (Figure 1). “Medical technologies are evolving rapidly,” said Kamran Khan, senior product marketing engineer for Xilinx’s medical group. “They are getting much smarter, more connected, more integrated, more compact and less invasive for faster patient recovery times. And, more important, they are better at what they do.” Demand for medical devices with advanced features is growing as the population worldwide increases and as people live longer, thanks in part to the rapidly evolving healthcare advances of the last half century.

LIVING LONGER, HEALTHIER LIVES The biggest growth driver of the healthcare industry and thus the medical equipment market is the expected increase in the world’s population. Today, the global population stands at roughly 7.3 billion; by 2050, it is expected to grow to 9.7 billion. Where-

Fourth Quarter 2015 Xcell Journal 9 COVER STORY

as today, people 65 and older represent “The medical electronics industry used to have one machine that spe- roughly 23 percent of the world’s popu- used to be massively consolidated, and cialized in one particular task. Each lation, by 2050 that number is expected so companies like Siemens and General of those machines was quite large, to be 32 percent, or 3.1 billion people. Electric could develop new generations and if many different machines were Those senior citizens are expected to be of their systems at their own pace,” said needed, they occupied quite a bit of the segment of the population most like- Khan. “Nowadays, X-ray and ultrasound space in a hospital room.” ly in need of regular medical care. have become a commodity. Today, peo- What’s more, Khan said, the systems “People are living longer than be- ple realize that it is not very difficult to wouldn’t necessarily communicate or fore, because our quality of life and get into these mature, low-risk applica- be compatible with one another, which understanding of medicine are bet- tions, even with the regulatory burden. could cause other complications. So to- ter,” said Khan. “Because people are Now we are seeing a boom in medical day there is a demand to have a single living longer, we need better health- equipment startups, and it’s a global piece of equipment perform multiple care to support them in their older boom, with a large number of new com- tasks. Similarly, users prefer a smaller age, and we need to be able to supply panies from China and South America form factor so that the equipment takes this with less-invasive techniques that entering the market. Some countries, up less space and is easier to move from have fewer side effects.” like China and Brazil, favor domestic room to room. It’s even better if the equip- Population growth along with the ag- products, so there are new companies ment is battery powered so that it can be ing of the populace represents a great emerging to serve these new markets.” used in areas with no electricity or in challenge for the world’s medical com- Khan said this new competition ambulances. Simultaneously, there is a munity as well as governments, which is of course increasing both time-to- growing need for each system to be able are increasingly turning to state-regulated market pressure and pricing pressure to interact with other pieces of equip- healthcare. This population growth also for all players. At the same time, it is ment in real time to, for example, change serves as a grand opportunity for inno- also driving greater value and innova- dosages of a medicine in response to a vation in medical electronics and other tion into the medical world faster, for patient’s changing vital signs. medical fields. As such, the medical mar- the benefit of us all. “Patient monitors, for example, ket is expected to grow to $212 billion by “The market is demanding more used to be quite simplistic, but not any- 2019, with the semiconductor spend ex- integration of features and more por- more,” said Khan. “They used to take pected to reach $6 billion, said Khan. tability,” said Khan. “Medical facilities in a few channels of analog biotelem-

Network Single Chip Network TI OMAP or Solution with Freescale i.MX6d 7010 Motor Motor Control Processor Zynq SoC ControlMotor UnitControl RTOS Unit Higher Potential RTOS RTOS MC Unit Analytics for Failure on isolated core MC Peripherals MC Peripherals Dual Core A9 HMI Higher reliablity Small FPGA for MC with real-time Fabric Sensory input analytics Video ViVideodeo Unit Analytics Singlee CoCorere ARARMM Multiple Motor Control Single chip = Devices, Video lower BOM HMI Extra BOM FPGA Video Application Memory Speciﬁc LPDDR2 Older Memory, Lower power, Memory Features Sensors Slower Faster memory Sensors LPDDR3

Traditional Clinical Designs Single-Chip Clinical Designs

Figure 1 – Zynq SoC platforms enable medical equipment companies to quickly create innovative, optimized systems and bring them to market.

10 Xcell Journal Fourth Quarter 2015 COVER STORY

‘Nowadays, medical equipment must be more integrated and smarter. Systems must be able to communicate and work in concert.’ etry, apply some very simple process- A changing regulatory environment Khan said that while the various reg- ing to it and display the information is often cited as the single biggest chal- ulatory bodies worldwide scrutinize both on a monitor. It was quite simple, so lenge medical equipment companies hardware and software, they tend to look a lot of this was done with embedded face today. Companies can’t legally sell hardest at software, since failures in soft- processors from companies like TI their products until their systems clear ware are believed to be more likely to or Freescale. But nowadays, medical relatively strict regulatory guidelines cause the system to fail into an unknown equipment must be more integrated and testing. Medical equipment failure state. “Most people have encountered and smarter. Systems must be able to can lead to medical liability lawsuits. software bugs on their PCs or mobile de- communicate and work in concert.” The amount of regulation depends vices and not so commonly on the actual For example, he said, a patient monitor largely on the type of equipment a com- hardware, so software typically comes must be able to talk to the ventilator and pany wants to bring to market and scales under greater regulatory scrutiny in gen- infusion pump. If a patient’s vital signs be- accordingly. Less-critical equipment that eral in medical devices,” Khan said. gin to spike, both systems must respond doesn’t touch the patient’s body typically “A lot of medical devices coming to appropriately, with the ventilation pump requires a regulatory cycle that’s roughly market typically require embedded pro- adjusting the oxygen mix and the infu- half a year, while equipment performing cessors,” he went on. “They may not sion pump adjusting the medicine dosage. more-critical functions that do touch require a lot of compute horsepower “They also have to communicate to the a patient’s body requires roughly two but they have some amount of embed- hospital’s main network to keep staff ap- years’ worth of approval cycles before ded processing and thus software. The prised of emergencies,” Khan said. More- release to the market. challenge is, how do I show to regulato- over, the information must be kept as part “Even in established markets like ry bodies and to customers that the de- of the patient’s long-term medical record. patient monitoring or ultrasound, they vice is safe and won’t hurt patients? For Khan said that the healthcare indus- still require a regulatory cycle of a year example, if it’s an infusion pump that’s try is also embracing the Internet of to a year and half, typically because delivering my daily medicine, how do Things, cloud computing and the mod- there is so much documentation and I know that it is delivering the right ern network infrastructure. All of these testing required,” said Khan. “They amount of medicine on time and that it developments are making it possible to have to create a technical file that sup- won’t stop in the middle of the night?” monitor the health of outpatients and reports their product, submit the docu- And as devices become more com- cord health patterns remotely, as well as mentation to the regulator which, after plex to perform multiple tasks, the being able to react in real time to emer- a long review cycle, passes or fails the code also becomes more complex and gency situations. There is a burgeoning product. The goal is to pass on the first much larger. “‘Reliable’ software is still industry to monitor patient health in real try, because if they don’t pass, the sec- something that isn’t well understood,” time much in the same way as home se- ond review is even more detailed and said Khan. “Software is very complex; curity companies monitor property. can involve an even longer cycle. It’s it’s hard to debug and understand the like being audited by the IRS.” risk of it. This high potential for failure THE EVER-GROWING Khan said the real goal is to under- results in high risk, even for mundane REGULATORY CHALLENGE stand and quantify the risk a medical medical systems that aren’t life-critical.” In the midst of this industry growth and system poses. “The perception is that This is one of the main reasons why drive toward smarter, more connected medical devices can’t fail, but regulato- FPGAs and more recently, Zynq SoCs and more integrated medical systems, ry bodies know that everything fails,” have become so popular with medical vendors are faced with ensuring that said Khan. “What they want to do is have equipment manufacturers. With these their equipment conforms to increas- OEMs understand and lower the failure All Programmable devices, compa- ingly more stringent safety and reliabil- risk, at the same time knowing all pos- nies can lower the risk of failures and ity regulations from a growing number sible cases in which a device could fail speed up the regulatory process. They of regulatory bodies worldwide. and what will happen if it does.” can essentially take functions of their

Fourth Quarter 2015 Xcell Journal 11 COVER STORY

system that must be reliable and imple- For the last 10 of the 30 years that “With our Zynq SoC portfolio, we are ment them in the logic of the Xilinx de- Xilinx has been serving the medical able to reduce risk and speed medical in- vices while placing other, less-critical market, FPGAs have been rapidly dis- novations to market,” said Khan. “We are functions in software. placing ASICs and ASSPs in medical able to take in the design’s software ele- gear. Medical equipment is sold in rel- ments with our new SDSoC tools, imple- XILINX DEVICES SPEED atively low volumes worldwide, and ment them in programmable-logic fabric MEDICAL INNOVATION so the cost, coupled with the strict and instead of in software and then add in lay- Khan said that with its decades of expe- time-consuming testing-and-regulatory ers of redundancy to provide more layers rience serving the medical electronics process, makes ASICs and ASSPs pro- of reliability in these systems.” sector, Xilinx has developed a compre- hibitive. As a result, a vast majority of For example, said Khan, if a company hensive medical toolbox consisting first medical equipment today employs Xil- is designing an infusion pump, part of the and foremost of Xilinx’s All Program- inx devices in some capacity. system will be controlling the motors so mable FPGAs and SoCs. The toolbox Starting in the late 1980s and early as to deliver the medicine in exact quan- also contains certified design tools and 1990s, said Khan, customers began us- tities at the exact time specified, with the methodologies to ensure quality, reli- ing Xilinx’s smaller FPGAs as sensor metrics staying exactly where a doctor set ability and redundancy, along with tried- interfaces in medical equipment. But them. Meanwhile, another part of the in- and-true silicon IP and software stacks over time, companies started adding fusion pump is biotelemetry—monitoring from Xilinx and members of the Xilinx more-critical functions to the devices the patient and ensuring he or she is OK. Alliance Program. The company’s new as FPGAs began displacing ASICs and “Using our isolation design flow, cus- SDSoC™ development environment ASSPs. In the newest equipment, Xilinx tomers can partition their system into will enable medical customers to even devices are playing a pivotal role at the critical and noncritical functions, imple- more quickly create optimized systems heart of these systems, especially with ment critical functions in logic and create with critical functions implemented in the Zynq SoC and the recently shipped physical barriers between critical func- the Zynq SoC’s logic while less-critical Zynq UltraScale+™ MPSoC, which has tions in the system,” said Khan. “They ® functions run on the Zynq SoC’s ARM additional safety and security features can build in extra safety measures so that processing system (Figure 2). beyond those offered in the Zynq SoC. if a failure condition may occur, it shuts Network TI OMAP or Single Chip Network Freescale i.MX6d Solution with Motor ~$20 7010 for <$20 Motor Control Processor Zynq SoC Complete Medical-Focused Toolbox ControlMotor UnitControl RTOS Unit Higher Potential RTOS RTOS Reduce Risk, and Lower Development and Certification Times MC Unit Analytics for Failure on isolated core MC Peripherals MC Peripherals Dual Core A9 Lower Regulatory Risk Medical Certified Longest Life Every RTOS HMI Small FPGA for Higher reliablity MC with real-time Sensory input Fabric Video ViVideodeo ~$10 analytics Unit Analytics Singlee CCoreore ARARMM Motor Control Multiple Single chip = Video Devices, lower BOM HMI Extra BOM FPGA Video Lower Risk = Precertified Tools Devices Exceeding Application Memory Specific LPDDR2 Older Memory, Lower power, Lower Regulatory and Devices Medical Product Life Memory Features Sensors Slower Faster memory Sensors Effort LPDDR3

Traditional Clinical Designs Single Chip Clinical Designs Isolation Design Flow ISO 61508 12 to 15 Years+ Compreshensive Medical Isolation Verification ISO 62304 Device Availability RTOS and OS Options Flow ISO 9001

Figure 2 – Xilinx has proven design tools that lower risk and help speed designs through the regulatory process.

12 Xcell Journal Fourth Quarter 2015 COVER STORY

product from scratch and that it needs to take a scalable approach to its product lines,” said Khan. “Platform-based design and cost-sizing are huge, for example, in the medical imaging space, where companies offer a portable version, low end, midrange and high end in a product line. By designing one platform based on the Zynq SoC for the high end, they can use the same hardware at each level and scale the functions to suit the needs of each end market by reducing features.” A platform approach built around the Zynq SoC has many advantages over a platform composed of multiple discrete Figure 3 – TOPIC Embedded Products’ Dyplo IP helps speed medical product design and development. parts, said Khan. “Medical devices are typically on the market from 10 to 15 down in a safe and predictable manner. tion and ISO 13485 medical device design years, much longer than consumer prod- Furthermore, they can show regulatory standards, and international standards in- ucts, which typically have a two- to three- bodies that they are building their design cluding ICE 61508 (functional safety) and year lifespan,” he said. “Medical devices in trusted Xilinx fabric. And then, using ICE 62304 (RTOS integration). The result are typically in design for three years, can reports from our IDT tool, they are also is to speed customer end-product designs go through regulatory approval for an- able to show regulators the signal paths, through the regulatory processes. other one to three years and then have to predictable outcomes and fail-safes.” “Customers have to get certification be in the market for 10 more years. But With Xilinx’s new SDx development on their end products, rather than the most embedded processors today have a environments (SDAccel™ for C, C++ individual devices, but what we do is en- lifespan of about five years and then they and OpenCL™ design entry into FPGAs; sure our devices, tools and IP conform are end-of-lifed for a newer version of the and SDSoC for C/C++ design entry in to these standards,” said Khan. “For ex- device. That’s because most are designed Zynq SoCs), medical equipment com- ample, ICE 62304 is a standard for RTOS mainly for the consumer market. But panies can now develop the prototype integration. Alliance member QNX’s in the medical device industry, if a chip of their systems in C; determine what RTOS has already been precertified for needs to be changed for a newer one be- functions, both critical and noncritical, 62304, so using it on the Zynq SoC can cause the older version isn’t available any would best be suited to running in hard- cut six months off the certification pro- longer, the product needs to go through ware and software; and then use the cess. Likewise, alliance member TOPIC the regulatory process again.” isolation design flow to implement the Embedded Products offers remarkable Khan said the Zynq SoC and MPSoC hardware functions in greater detail and IP to further speed prototyping and families give customer designs the per- add layers of redundancy for further re- design. Their IP, design flow and SOM formance advantages of an embedded liability. “Using a system-level methodol- board are precertified to be compliant processor or multiple processors plus ogy can cut months out of a design cycle with the ISO 13485 quality-management the flexibility, product differentiation and at the back end,” said Khan. standard. This enables customers to cut safety of programmability. On top of all Khan noted that with Xilinx’s rich his- even more time off the regulatory pro- that is I/O flexibility to accommodate a tory serving the military and space com- cess” (see video, Figure 3). vast range protocols, sensors and video munity, Xilinx’s commercial devices used configurations. “Integrating multiple sys- in medical applications already exceed A PLATFORM PLAY tem functions in the Zynq SoC and MPSoC any requirements for radiation tolerance. With the increased regulatory burden families saves space, lowers BOM cost Xilinx also diligently ensures that its de- and mounting time-to-market pres- and lowers power drastically compared vices, its Vivado® Design Suite and IP (its sures, many medical companies today with multichip platforms and speeds med- own and cores from alliance members are employing platform design business ical innovations to market,” said Khan. serving the medical device market) ad- strategies built around the Zynq SoC. For more information on Xilinx in med- here to strict quality and safety standards. “The market is quickly coming to ical applications, visit http://www.xilinx. Among them are the ISO 60601 3rd edi- the realization that it can’t build each com/applications/medical.html.

Fourth Quarter 2015 Xcell Journal 13 XCELLENCE IN VISION SYSTEMS Xilinx FPGAs Advance Autonomous Monitoring of Crowds

by Muhammad Bilal MSc Candidate Center for Advanced Studies in Engineering Islamabad, Pakistan [email protected]

Dr. Shoab. A. Khan Professor Center for Advanced Studies in Engineering (www.case.edu.pk) Islamabad, Pakistan [email protected]

14 Xcell Journal Fourth Quarter 2015 X CELLENCE IN VISION SYSTEMS

rowd surveillance and monitoring have become A Spartan-6-based real-time an area of great importance in current times. Govern- motion-classification system ments and security depart- ments have started looking opens new possibilities Cfor more advanced ways to intelligently monitor human crowds across public plac- for autonomous monitoring es to detect any unusual activity before it is too late to react. However, there are and surveillance of crowds. still some barriers to cross before achieving this goal effectively. For example, if it is desired to monitor all possible crowd activities across a whole city at once, 24 hours a day, then it is certainly not possible through all-manual monitoring only, especially when there are thousands of CCTV cameras installed. The solution to this problem lies in developing new, intelligent cameras or vision systems that could autonomously monitor the activities of human crowds through advanced video analytics techniques, and therefore could immediately report any unusual event to a central control station. The design of such an intelligent camera/vision system would require not only the typical imaging sensors and optics, but also a high-performance video processor to perform video analytics. The reason for having such a powerful video processor onboard is the high processing requirements of sophisticated video analytics techniques, most of which use computationally intensive video-processing algorithms. FPGAs are ideally suited for such performance-hungry applications. And thanks to the UltraFast™ design methodology enabled by high-level synthesis (HLS) in Xilinx®’s Vivado® Design Suite, it’s now possible to create an optimal and high-performance design for FPGAs with great ease. Further, the fusion of an embedded processor such as the Xilinx MicroBlaze™ inside the FPGA’s reconfigurable logic means that applications with a complex control flow can now be easily ported to FPGAs. Keeping this theme in view, we designed a prototype for a human crowd motion-classification and monitoring

Fourth Quarter 2015 Xcell Journal 15 XCELLENCE IN VISION SYSTEMS

Using the sum of weighted absolute differences (SWAD), we computed more than 900 motion vectors across the image.

system using Vivado HLS, Xilinx’s (or placing) interest points in the hu- the other patch’s processing, making Embedded Development Kit (EDK) man crowd scenes and tracking them this approach ideal for parallel imple- and software-based EDA tools from over time to collect motion statistics. mentation on FPGAs. the ISE® Design Suite. Our design These motion statistics are then pro- After computing motion vectors methodology was based on what we jected to some precomputed motion across the entire image, the algorithm think of as a software-controlled, models to predict any unusual activity computes their statistical properties. hardware-accelerated architecture. [1]. Further modifications include the These properties include average mo- We targeted our design for the low- clustering of interest points and track- tion vector length, number of motion cost Xilinx Spartan®-6 LX45 FPGA. ing of these clusters instead of individ- vectors, dominant direction of motion The overall system design, which we ual interest points [2]. and similar metrics. completed in a short span of time, has Our algorithm for crowd-motion We also compute a 360-degree his- demonstrated promising results in classification is based on the same con- togram of the direction of motion vec- terms of real-time performance, low cept, except we preferred using a tem- tors and further analyze its properties, cost and a great flexibility of design. plate-matching scheme for performing such as standard deviation, mean and motion estimation instead of the for- coefficient of variation. These statisti- SYSTEM DESIGN mer approaches like the Kanade-Lucas- cal properties are then projected to a We accomplished the overall system de- Tomasi (KLT) feature tracker. This tem- precomputed motion model to classify sign in two phases. In the first phase, we plate-matching scheme proved much the current motion in one of several cat- developed a human crowd motion-clas- better for motion estimation in cases of egories. We account for these statistical sification algorithm. After the verifi- low or varying contrast at the cost of a properties across multiple frames to as- cation of this algorithm, our next step few more computations. certain the classification results. was to implement it in an FPGA. In this In order to perform motion estima- The precomputed motion model second phase of development, our main tion using this scheme, we divided our is built in the form of a weighted de- focus was on the architectural design video frame into a grid of smaller rect- cision tree classifier that takes into aspects of an FPGA-based real-time vid- angular patches, and performed motion account these statistical properties eo-processing application. Tasks includ- computation on the current and previ- to classify the motion under obser- ed developing a real-time video pipeline, ous images in each patch using a meth- vation. For example, if the motion developing hardware accelerators and, od based on the sum of weighted abso- is observed to be high and there is a finally, integrating them and implement- lute differences (SWAD). Each patch sudden change in momentum in the ing algorithmic control and data flow to in turn contributed one motion vector scene while the direction of motion complete the system design. that depicts the extent and direction of is random or out of the image plane, Let’s take a walk through each of motion in between two frames at that then it will be classified as a possible these development stages, starting particular position. The result is compu- panic condition. The algorithm de- with a brief description of the algo- tation of more than 900 motion vectors velopment was done using Microsoft rithm design followed by a detailed across the whole image. The steps in- Visual C++ with the OpenCV library. look at its subsequent implementation volved in computation of these motion A complete demonstration of this al- on the FPGA platform. vectors are shown in Figure 1. gorithm can be found at the Web links We also utilized a weighted Gauss- provided at the end of this article. ALGORITHM DESIGN ian kernel to achieve robustness Various algorithms have been pro- against occlusion and zero contrast ar- FPGA IMPLEMENTATION posed in the literature regarding crowd eas in the image. Furthermore, the pro- The next phase in our system design surveillance and monitoring. Most of cessing of one patch for computation was the FPGA implementation of our these algorithms start with detecting of a motion vector is independent of algorithm. Such an implementation

16 Xcell Journal Fourth Quarter 2015 XCELLENCE IN VISION SYSTEMS

comes with its own design challenges, Keeping these design aspects and oth- care of necessary video input/output such as the fact that video input/output er architectural considerations in view, and frame buffering. Then, in the next and frame buffering are now a part of we divided our overall FPGA-based im- part, we developed algorithm-specific FPGA-based design. Also, limited re- plementation in three parts. In the first hardware accelerators. Finally, in the sources and available performance may part, we developed a generic real-time third phase of design, we integrated require necessary design optimizations. video pipeline into the FPGA to take them together and implemented algorithmic control and data flow. This completed our overall FPGA-based system design. It’s worth taking a closer look at each stage of the process.

REAL-TIME VIDEO PIPELINE A real-time video pipeline is the most important building block in developing any video-processing application for FPGA platforms. Such a pipeline hides the complex memory management related to video input/output and frame buffering from the user, and provides a simple interface for accessing video frame data for processing. Although there are several advanced and commercially licensed video pipe- lines [3] available in this regard, we opt- ed to build a customized video pipeline for this purpose. We built this pipeline around the Xilinx EDK, with custom video capture/display ports for handling video input/output data. The pipeline is easily configurable for other Xilinx FPGA families as well. The video capture port decodes the incoming video stream data from the video ADC and buffers it locally. It is then forwarded to the main memory for constructing a video frame. Similarly, the video display port encodes the video frame data present in a local buffer and forwards it to the video DAC for display purposes. The video input and output ports are connected to the main peripheral bus of a MicroBlaze host processor, which handles this video data traffic to and from the main memory. The video ports are capable of gener- ating interrupts to inform the MicroBlaze processor that new data is available at the video input port or required by the video output port. The video ports carry out a “ping-pong” buffer-management scheme such that if the MicroBlaze is Figure 1 – Steps in computing motion vectors, starting with image capture (top) unable to immediately service a video

Fourth Quarter 2015 Xcell Journal 17 XCELLENCE IN VISION SYSTEMS

port, then buffer overflow or underrun frames to support double or triple complete the construction of the vid- doesn’t happen. Figure 2 shows the in- frame-buffering schemes. A user ap- eo-processing pipeline in our design. terconnection between the video ports plication running in MicroBlaze can Figure 4 shows an actual video frame and the MicroBlaze processor. easily acquire a video capture frame captured, processed and then dis- The video ports are designed to de- or can submit a video display frame played using this video pipeline on our tect and produce a video line number, using Video Frame Queue API func- FPGA. It also shows a picture-in-pic- field ID (in case of interlaced video) tions. Figure 3 shows these functions ture feature with a zoomed-out view and other control information in the in their respective levels of hierarchy. of computed motion vectors. video input/output stream. This infor- There are several benefits of using mation is passed to the MicroBlaze a MicroBlaze as a host processor to in- VIVADO HLS-BASED processor through the video ports’ interconnect various building blocks in HARDWARE ACCELERATOR terrupt service routines (ISRs) when the system. For example, we can eas- In the crowd motion-classification al- enough video data is buffered by the ily interface with a wide range of ex- gorithm we have described, the most video input port or required by the vid- ternal memories (SRAM, SDRAM, etc.) time-consuming and computationally eo display port. These service routines with MicroBlaze for loading or storing intensive task was to compute motion in turn perform the video data transfer the video frame data from video ports. vectors. The other system task—that between video port local memories Similarly, we can use DMA controllers is, doing classification—did not in- and main memory via DMA. in the EDK for transporting video data volve pixel-level processing and was In addition to video port ISRs, a between video ports and main memory. pretty simple and easy to accomplish. set of higher-level video frame queue- We also interfaced custom hardware ac- Keeping this aspect of the design in management functions that we named celerators in the same fashion with the mind, we built a hardware accelera- the Video Frame Queue API operates MicroBlaze processor. tor for computing motion vectors. We in between these ISRs and a user-lev- These Video Frame Queue API designed, tested and later synthesized el application. This API maintains a functions together with video port the accelerator in RTL using Xilinx queue of multiple capture and display ISRs and video input/output ports Vivado HLS in the C/C++ language.

MicroBlaze Processor Video Capture Port Local Memory

Capture Port Video Capture BT-656 Stream Interface CM Camera DMA Controller MicroBlaze Peripheral Bus (PLB or AXI) Decoder Controller Video ADC

Interrupt Controller Video Display Port Local Memory

Display Port Video Display BT-656 Stream Interface CM Display DMA Controller Encoder Controller Video DAC

Processor Core

ilmb dlmb

32K DPR

A B FPGA Note: CM Donate ‘Control’ and ‘Memory’ interfaces respectively.

Figure 2 – Video ports and their interconnection

18 Xcell Journal Fourth Quarter 2015 XCELLENCE IN VISION SYSTEMS

One of the key features of Vivado-gen- array) into memory interfaces and auto- ware accelerator. These handshake erated RTL code is that it is optimal to a generates required addresses by analyzing signals include “start,” “busy,” “idle” great extent. Vivado HLS synthesizes array the code. It also analyzes precomputable and “done” flags. These flags are rout- accesses (such as pixel data stored in an offsets and constants to perform so-called ed to the MicroBlaze processor through “strided” memory accesses very fast. The GPIOs to perform handshaking. Figure strided memory accesses originate from 5 shows the interconnection between accessing data from multiple rows of an this hardware accelerator, the eight User Application image (such as in 2D convolution). Block RAMs and the MicroBlaze pro- (Algorithm Control & Data Flow) The key considerations in designing cessor’s main peripheral bus. a Vivado-based accelerator were par- The Block RAMs—named SA1, TA1 allelizing the computation of motion to SA4, TA4 in Figure 5—are each 16 Video Frame Queue API vectors and maximizing the data read kbytes in size. Each pair of SA1, TA1 Grab_Capture_Frame() from the main memory. For this pur- to SA4, TA4 holds enough data for the Submit_Display_Frame() pose we used eight Block RAMs to load computation of one complete row of Push_Frame_Queue() and store video frame data in parallel. motion vectors. Thus, upon comple- Pop_Frame_Queue() Frame_Queue_Rest() The hardware accelerator core is capa- tion of its run, the hardware acceler- ble of computing four motion vectors in ator outputs four rows of motion vec- parallel and for this purpose it utilizes tors written back in the same Block Video Ports Low-Level ISRs all eight of these Block RAMs. The data RAM memories. These computed mo- Video_Capture_ISR() pump to these Block RAMs from the tion vectors are then read back by the Video_Display_ISR() main memory is controlled by the Mi- MicroBlaze processor, which copies croBlaze through DMA. the result in its main memory as a grid The hardware accelerator generated of motion vectors. (Figure 4 shows an Figure 3 – Video port ISRs and Video Frame by Vivado HLS comes with some auto- actual frame overlaid with a grid of Queue API functions generated handshake signals that are motion vectors computed through this necessary to start and stop the hard- hardware accelerator.) The hardware accelerator operates at 200 MHz and all the processing needed to compute motion vectors across the whole image completes in less than 10 milliseconds, including all data transfers to and from memory.

ALGORITHM CONTROL AND DATA FLOW With the video pipeline and hardware accelerator in place, the last step in completing the system was to integrate these two elements with the Mi- croBlaze host processor and to implement algorithm control and data flow at a user-level application in C/C++ using Xilinx’s Software Development Kit (SDK). Implementing algorithm control and data flow in the Xilinx SDK brings a great deal of flexibility in design. That’s because you can design and integrate new hardware accelerators in the same fashion, and modify necessary control and data flow to incorporate new hardware accelerators. The result is a kind of software-con- Figure 4 – An actual FPGA-processed frame with motion vector grid overlaid at bottom right trolled, hardware-accelerated design

Fourth Quarter 2015 Xcell Journal 19 XCELLENCE IN VISION SYSTEMS

MicroBlaze Processor 4-Core SAD Computation Hardware Acceleration

Hardware Search Area 16K DPR Accelerator DMA BRAM Controller Controller M

16K DPR

SA1 M Template Area M SWAD BRAM

1 Engine-1 Controller

TA A B M

SA-X, TA-X A B Chip-Select GPIO 16K DPR MicroBlaze PLB or AXI Peripheral Bus Controller Interrupt Controller 16K DPR

SA2 M SWAD

2 Engine-2

TA A B

A B

16K DPR

SA3 M SWAD

3 Engine-3

TA A B M Processor Core A B

16K DPR

ilmb dlmb 16K DPR M SA4 32K DPR SWAD

4 Engine-4

TA A B M

A B A B Hardware Accelerator Control Handshake GPIO Handshake Controller

Note: M Denotes the memory port. FPGA

Figure 5 – Vivado HLS-based hardware accelerator and its interconnections that is as flexible as an all-software imple- ware. The reason for doing so is that implementation in comparison with an mentation and as high in performance as these steps do not involve any pix- earlier desktop PC-based implemen- an all-hardware implementation. el-level processing and add very little tation for accuracy of results. The two The control and data flow of our processing overhead. When the classi- results were found to be identical. We crowd motion-classification algorithm fication results are computed, their reused various test videos from the Uni- starts by capturing a video frame sults and motion vectors are displayed versity of Minnesota database (http:// through the Video Frame Queue API on the processed frame using on-screen mha.cs.umn.edu/proj_recognition.sht- functions. When a frame is acquired, display functions. These OSD functions ml) and from www.gettyimages.com the user application transfers the cur- are also implemented in C/C++ in the for testing our system. rent and previous video frame data to Xilinx SDK. the hardware accelerator and gets the With all these building blocks (re- IMPLEMENTATION RESULTS motion vectors computed. al-time video pipeline, hardware accel- The overall design used only 30 percent At this point, the system computes erator and algorithm control/data flow) of slice LUTs, 60 percent of BRAM and the motion vectors’ statistical proper- in place, our overall system design was 12 percent of DSP48E multiplier re- ties and classification results in soft- completed. We tested our FPGA-based sources on our Spartan-6-LX45 FPGA.

20 Xcell Journal Fourth Quarter 2015 XCELLENCE IN VISION SYSTEMS

Figure 6 shows the hardware setup (top) a detailed demonstration of this system http://www.dailymotion.com/vid- and actual system output. The hardware on the following Web links: eo/x23icxj_real-time-motion-vec- setup consists of a Digilent Atlys Spar- tors-computation-on-fpga_news http://www.dailymotion.com/video/ tan-6 FPGA board and a custom video x2av1wo_fpga-based-real-time-hu- http://www.dailymotion.com/video/ interface card to provide video input/ man-crowd-motion-classification-de- x28sq1c_crowd-motion-classifica- output functionality to the FPGA using mo_school tion-using-motion-vectors-statisti- a video ADC and DAC. You can watch cal-features_school

GREAT FUTURE POTENTIAL FPGAs are the ideal platform for performance-hungry applications like real-time video processing. Develop- ing such an application requires a few architectural considerations to fully leverage the performance of the chosen FPGAs. Furthermore, utilizing advanced tools like the EDK and Vivado HLS, it’s possible to achieve an overall system design with much more efficiency and in a much reduced development time than in the past. Thus, there is great potential in implementing performance-critical applications over FPGAs using these tools, as we have demonstrated in this project. With a working platform in place, we hope to extend this work to address more technical problems, such as automatic traffic monitoring, automatic patient observation in hospitals and many more applications.

REFERENCES 1. Ramin Mehran, Mubarak Shah, “Abnor- mal Crowd Behavior Detection Using Social Force Model,” IEEE International Conference on Computer Vision and Pat- tern Recognition (CVPR), Miami, 2009

2. Duan-Yu Chen, Po-Chung Huang, “Mo- tion-based unusual-event detection in human crowds,” Journal of Visual Com- putation and Image Representation, Vol. 22 Issue 2 (2011), pages 178–186

3. OmniTek OSVP: http://omnitek.tv/sites/ default/files/OSVP.pdf

ACKNOWLEDGEMENTS The author wishes to thank his co-author, Professor Shoab A. Khan, for his great ideas, and to acknowledge Dr. Darshika G. Parera and Dr. Umair Ahsun for outstand- Figure 6 – Hardware setup (top) and an actual FPGA-processed frame classifying a scene as one of panic ing inspiration.

Fourth Quarter 2015 Xcell Journal 21 XCELLENCE IN TEST & MEASUREMENT

Test New Memory Technology Chips Using the Zynq SoC

A platform based on Xilinx’s ZC706 Evaluation Kit proved fast and flexible enough for MRAM testing at Qualcomm.

22 Xcell Journal Fourth Quarter 2015 XCELLENCE IN TEST & MEASUREMENT

he electronics industry is heavily invested in the develop- by Botao Lee ment of new memory technol- Senior Staff Engineer ogies such as PRAM, MRAM Qualcomm Technologies, Inc. and RRAM. The performance [email protected] of new memory technology test chips is improving rapidly, Baodong Liu but work still needs to be done Staff Engineer before these devices can go full-scale to compete Qualcomm Technologies, Inc. Twith or replace conventional memories. [email protected] Generally speaking, when a test chip for a new Wah Hsu memory technology becomes available, basic tests Senior Staff Engineer have already been carried out to check for man- Qualcomm Technologies, Inc. ufacture-related problems such as stuck-at faults, [email protected] transition faults and address-decoding faults. But another type of testing is necessary as well in the Bill Lu form of performance-related tests that will dis- Staff Engineer close how fast the chip can be reliably accessed, as Qualcomm Technologies, Inc. well as how much the chip access speed impacts [email protected] the performance of the whole computing system. To successfully carry out the planned performance tests, the test environment must be able to generate configurable digital waveforms to access the chip. It must also be able to construct an entire computing environment to measure the impact of chip access speed. There are many ways to create or purchase a test environment to satisfy these needs. But our team at Qualcomm decided to make our own environment based on Xilinx®’s Zynq®- 7000 All Programmable SoC ZC706 Evaluation Kit.

INS AND OUTS OF MEMORIES Conventional memory technologies like DRAM, SRAM and flash store ones and zeros using an electrical charge in each memory cell. DRAM is widely used in PCs and mobile computing devices to run programs and to store temporary data. SRAM is commonly used as cache memory and register files in microprocessors. It is also frequently found in embedded systems when power consumption is a big concern. Unlike DRAM or SRAM, flash memory offers persistent storage after power is re- moved from the system. Flash memory runs more slowly than the others, and might wear out with excessively high numbers of programming cycles. In comparison to conventional charge-based memory technologies, new memory technologies are based on other physical properties of their storage elements. As an example, a memory element of magnetoresistive RAM (MRAM) is formed from two ferromagnetic plates separated by a thin layer of in- sulator. Each plate can hold a magnetization. One of them is permanent, the other can be changed by

Fourth Quarter 2015 Xcell Journal 23 XCELLENCE IN TEST & MEASUREMENT

Software runs on the Zynq SoC’s ARM A9 processors, while the memory controller core is created using the programmable logic.

an external field to store data. The stored • The programmable logic (PL) portion HARDWARE AND SYSTEM data is read by measuring the electrical of the Zynq SoC provides the ability ARCHITECTURE resistance of the element. MRAM is sim- to construct parameterizable memo- The hardware architecture of our chip ilar in speed to SRAM and similar in den- ry controller cores. This is essential test environment is illustrated in Fig- sity to DRAM. Compared with flash mem- to meet the requirement that the test ure 1. Software runs on the Zynq SoC’s ory, MRAM runs much faster and suffers chip access speed can be varied. ARM A9 processors, while the memo- no degradation from programming. ry controller core is created using the • The Zynq SoC’s processing system programmable logic. We established a (PS), which consists of two ARM® REQUIREMENT ANALYSIS DMA channel between the PS and the A9 cores, provides the ability to When devising a scheme for evaluating controller core to move large blocks of modify test chip access speed our MRAM test chip, we settled on a data between them easily. The memory through software. Zynq SoC approach because of the fol- test chip resides on the FMC daughter- lowing considerations: • The PS also makes it possible to card, and it talks with the memory con- construct a complete computing troller core through the FMC interface. • The FPGA Mezzanine Card (FMC) system. This is essential to meet The system architecture is illustrat- interface on the ZC706 board provides the requirement that the test sys- ed in Figure 2. The three layers on the high-speed signaling capability to and tem measure the impact of chip bottom are hardware layers and the from the memory test chip through an access speed on a full computing three layers on the top are software FMC daughtercard. environment. layers. We selected Linux as the oper-

Xilinx ZC706 Evaluation Board

XC7Z045 SoC

ZYNQ PS

DMA

ZYNQ PL FMC Daughtercard

Memory FMC Controller Test Chip

Figure 1 – Hardware architecture of the test environment

24 Xcell Journal Fourth Quarter 2015 XCELLENCE IN TEST & MEASUREMENT

Application Test Configuration All Programmable Performance Profiling FPGA and SoC modules OS Linaro Linux 4 form x factor Device Driver Memory Controller 5 Device Driver

Core 2x ARM A9 Memory Controller

Device XC7Z045 SoC Memory Test Chip

Board ZC706 Evaluation Board FMC Daughtercard rugged for harsh environments extended device life cycle Figure 2 – Test environment’s system architecture Available SoMs: ating system because it is open source, to understand both hardware and soft- so the source code can be tweaked if ware reasonably well. needed. Although no tweaking was Another thing we liked was the done in the current stage of develop- Vivado® Design Suite tool chain. The ment, it might be necessary to take ad- Vivado environment intuitively shows Platform Features vantage of some unique properties of the design blocks, automatically as- • 4×5 cm compatible footprint new memory chips down the road. signs register addresses and checks The software we wrote at the appli- for errors before exporting hardware • up to 8 Gbit DDR3 SDRAM cation layer fell into two categories. information to the software devel- • 256 Mbit SPI Flash One category was for configuring the opment process. The Vivado Design • Gigabit Ethernet memory controller core, and the oth- Suite also provides in-system sig- • USB option er one involved profiling the perfor- nal-level debugging ability, which is a mance of the memory chip and the must-have to pinpoint the root cause performance of the whole system. of any RTL issue. The final thing we want to mention EASY MIGRATION OF here is the Linux OS. Our software at HARDWARE AND SOFTWARE the application level is heavily GUI Design Services With help from the local Xilinx FAE, based. The popularity of the Linux OS • Module customization we brought up the test environment allowed us to leverage our previous ex- • Carrier board customization within a month. Most of our effort was perience on Linux GUI development so • Custom project development spent on designing and implementing that we could get the test programs up the interface between software and and running quickly. hardware layers. This is actually one of the reasons that we like the Zynq QUICK AND COST-EFFECTIVE SoC: It contains both microprocessors Using the Zynq-7000 All Programma- and programmable logic in one device, ble SoC ZC706 Evaluation Kit, our which makes migrating functions be- team quickly constructed a complete tween hardware and software fairly computing environment for testing easy. In our design, we fine-tuned the new memory technology chips at software/hardware partition a couple minimal cost. We expect to one day difference by design of times and eventually settled on the use the same design methodology to one we liked. To comfortably work on build similar systems for other pur- www.trenz-electronic.de a Zynq SoC-based system, one needs poses as well.

Fourth Quarter 2015 Xcell Journal 25 XCELLENCE IN TEST & MEASUREMENT Unleashing High-Performance USB Devices with Artix-7 FPGA

The low-power Xilinx FPGA family puts bus-powered USB device designs within easy reach.

26 Xcell Journal Fourth Quarter 2015 XCELLENCE IN TEST & MEASUREMENT

by Tom Myers Senior Hardware Engineer Anritsu Company [email protected]

With several billion ports in the marketplace, Universal WSerial Bus (USB) is the go-to interface for subgigabit connections between hosts and peripheral devices. However, due to the USB specification’s stringent inrush and steady-state operating-current limitations, FPGAs were often overlooked for bus-powered device applications in favor of lower-performance, less flexible microcontroller solutions. With the arrival of the Artix®-7, the latest addition to the Xilinx® low-power device portfolio, this is no longer the case. Paying careful attention to system-level power conversion efficiency and sequencing, and using the power estimation and optimization tools available in the Vivado® Design Suite, designers can overcome these challenging limits to provide a tightly integrated, tailored bus-powered device with high performance. Let’s take a look at how to build a USB 2.0 high- speed bus-powered device around an Artix-7 Micro- Blaze™-based platform. We successfully used this approach at Anritsu in the development of a recent microwave power measurement product. The new design’s USB 2.0 High-Speed interface provides a sub- stantial increase in measurement throughput compared with the USB Full-Speed microcontroller-based solution used in the previous-generation product. In- creased measurement throughput reduces test time in a manufacturing production test application. The result is a cost savings for our customers.

SYSTEM DESIGN In our project at Anritsu, we knew the major obsta- cle we had to overcome was going to be the 500-milli- amp (5-volt nominal) steady-state current draw limit. Therefore, our approach to system design centered

Fourth Quarter 2015 Xcell Journal 27 XCELLENCE IN TEST & MEASUREMENT

Lowering the device’s die temperature reduces leakage power consumption. Strategies include minimizing device die size and selecting the largest package possible. around the power budget. We tabulat- estimated data flow profile. We also the various peripherals using Vivado’s ed typical and maximum current draws evaluated various programmable de- IP Integrator tool. We quickly obtained from datasheet numbers on a power vices and other solutions with tools a synthesizable target and refined the budget spreadsheet. such as Xilinx Power Estimator with power consumption numbers with A large part of the power budget assumptions as to the functionality, Vivado power reports. was driven by the minimum off-chip clock speed and toggling rates. Since MIG doesn’t provide a native memory requirement of 200 Mbytes. We identified a few candidate de- AXI connection to LPDDR2 devices, The best fit for this requirement was vices and refined the power, size and we developed this link later in-house. found to be a standard 4-Gbit LPDDR2 I/O estimates by building out a subset Until our shim was available, we used device. We generated current-draw of the complete system with a Micro- the MIG-generated LPDDR2 example estimates for this device with the de- Blaze, the memory controller (using design for these preliminary power es- tailed methodology provided by ven- Memory Interface Generator, or MIG) timate and sizing builds. Figure 1 shows dor application notes, applying the and adding in the interface blocks for the resulting system architecture. 5V to 1,8V Reg

Artix-7 USB Bus-Powered Device Architecture

LPDDR2 4Gb 128Mx32

X32@ 200 MHz

LPDDR2 Memory Onterface

AXI shim

PLL, Core Logic 100 MHz osc at 100 MHz AXI4 ULPI USB ULPI USB USB Interconnect DMA Transceiver connector

System USB Clock Reset Controller 32 bit data Core path @ AXI4 Lite 100 MHz Interconnect MicroBlaze 5V to 1.8V 5V to 1.2V Interrupt AXI4 Lite Reg Reg controller Peripherals

5V to 1.0V Reg

XADC Cache Temp/ Voltage Monitor

FPGA SPI Xilinx JTAG Controller XC7A100TFGG484-2L config/debug

SPI FLASH Config/Boot JTAG 256Mb/1.8V connector

Figure 1 – System architecture for our Artix-7-based design

28 Xcell Journal Fourth Quarter 2015 XCELLENCE IN TEST & MEASUREMENT

Configuration Step 1: FPGA configuration starts here – read Flash Memory critical switch word Addr 0 Critical Component 1: Switch Word QuickBoot Header Configuration Step 3A: Warm Boot If critical switch word is Jump Sequence OFF, then ignore warm boot sequence.

Configuration Step 2A: Component 2: If critical switch word is ON, Configuration Step 3B: then execute warm boot Golden Synchronize and load and jump to update bitstream. Bitstream golden bitstream.

Configuration Step 2B: Component 3: Synchronize and load Update update bitstream. Bitstream

Figure 2 – QuickBoot ﬂash memory components and conﬁguration method

As described in the “Vivado De- EXPECT THE UNEXPECTED the update process, an updated “work- sign Suite User Guide: Xilinx Power Although various mechanisms are ing” image is loaded in memory after Analysis and Optimization” (UG907), provided to gracefully shut down and the golden image. If the update process lowering the device’s die temperature remove USB devices, in reality many fails, or the working image is somehow reduces leakage power consumption. users will abruptly unplug the device corrupted, the FPGA automatically Strategies we used included minimiz- without warning. This can be a prob- detects the error and falls back to the ing the device die size and selecting the lem if the firmware update process is golden image. The XAPP1081 Quick- largest physical device package possi- not sufficiently robust, leading to non- Boot method extends this procedure ble, given the application’s tight board responsive, “bricked” devices, unhap- with improved configuration time and real estate constraints. py customers and expensive device golden-image update features. We minimized conversion losses and returns for firmware recovery. Anrit- Based on the success of this project, regulator circuitry costs by reducing the su competes on the basis of reliabil- we are looking ahead to how the next number of voltage rails. After locking ity and speed for high-volume man- generation of devices from Xilinx might down the device power requirements, ufacturing test. Therefore, our main enable additional functionality for An- we designed the voltage-conversion requirements included fast boot time ritsu products. For example, a large circuits to step down from the nominal and fast firmware update time. amount of the power budget is con- USB 5-V bus voltage to these rails. We solved this problem by imple- sumed by the off-chip SDRAM intercon- Up to this point, we have been focus- menting the QuickBoot golden-image nect. We look forward to investigating ing on steady-state current draw. How- firmware update architecture and pro- how we might use the newer 16-nano- ever, one must also keep in mind inrush cesses described in Xilinx application meter UltraScale+ lineup’s UltraRAM to current draw. One way to minimize in- note XAPP1081 and summarized in reduce or eliminate this load and per- rush current is to select regulators with Figure 2. The traditional 7 series fall- haps put the ARM7-enabled Zynq®-7000 soft-start capability and sequence them. back multiboot solution provides a boot All Programmable SoC product line in You must carefully balance the FPGA’s process that maintains a known-good reach for our application. sequencing and ramp time require- “golden” image including bitstream in For further information, contact ments with the USB requirements. the configuration flash memory. During [email protected].

Fourth Quarter 2015 Xcell Journal 29 XCELLENCE IN INDUSTRIAL

FPGA-Based Fuzzy Controller Manages Sugarcane Extraction

by Deepali Vyas Master’s Candidate Mody University of Science and Technology Lakshmangarh, Rajasthan, India [email protected]

Yogesh Misra Research Scholar Mewar University Chittorgarh, Rajasthan, India [email protected]

H. R. Kamath Director Malwa Institute of Technology Indore, Madhya Pradesh, India [email protected]

30 Xcell Journal Fourth Quarter 2015 X CELLENCE IN INDUSTRIAL

A three-input fuzzy controller implemented in a Xilinx Virtex-6 FPGA maintains the cane level during the sugar-manufacturing process.

ugar is an important food ingre- is sent to the boiler, where it is burnt as dient that’s widely used in day-to- fuel, while the extracted juice is sent for day life. More than half of the total clarification and then to the pan section, Sglobal supply of raw sugar is pro- where sugar is made from the juice. duced from sugarcane. India is the second The supply of cane for processing is largest sugar manufacturer in the world very uneven, and this uneven supply of after Brazil, with 60 million cane farmers cane during juice extraction adversely af- and their dependents involved in cane cul- fects the efficiency of the sugar mill and tivation, making for a $12 billion business. can cause mill breakdown, stoppage and Because the extraction of cane juice jamming of the equipment. For optimum is a nonlinear process, our team looked juice extraction, it’s necessary to main- to fuzzy logic for a way to improve the tain the cane level in the Donnelly chute flow. Analysis by researchers at the at a desired height. Mody Institute of Science and Technolo- We expected that fuzzy logic would gy (MITS) reveals that the performance nullify the uneven supply of cane and of our fuzzy-based controller, designed maintain the cane level at the desired and implemented in a Xilinx® FPGA, has height by varying the speed of the rake proven better than that of a conventional carrier better than conventional control- controller. With the specific parameters lers. This is why we made an attempt to for crushing 2,500 tons of cane per day, introduce the concept of fuzzy logic to a flow rate of 26.6 kg/second is required. the sugar world. Before looking into the details of how Our first step in 2014 was to design a we implemented our three-input fuzzy two-input fuzzy controller [1] that pre- controller, it’s helpful to understand the cisely monitored the variation in two basics of sugar manufacturing. parameters: weight of cane on the rake carrier and height of cane in the Don- HOW CANE IS EXTRACTED nelly chute.The controller aims at main- A schematic of the cane juice extraction taining a constant level in the chute so process is shown in Figure 1. The cane as to maintain the needed flow rate of billets that the sugar mill receives from 26.6 kg/s. When we compared the re- cane growers are weighed and dumped sults with those of the conventional con- in the cane yard. From there, a crane lifts troller, it was clear that the two-input them to the cane carrier. The cane carrier fuzzy controller performed much bet- moves continuously and is responsible ter. Since the cane is crushed between for bringing the cane to the factory floor the rolls, we decided as an experiment for sugar production. to introduce a third parameter—roll The cane first passes through two sets speed—on the same algorithm. The ad- of rotating knives. Cane knives cut the dition of this third parameter revealed cane into pieces, while shredder knives that roll speed is an equally important prepare the fiber. A rake carrier feeds variable as the other two. these small fibers, which measure rough- So, later in 2014 we introduced ly 1 to 2 cm, to a Donnelly chute. Cane roll speed as our third parameter [2]. juice is extracted by crushing the fibers We redesigned the algorithm with in two or three rolls of the mill. This pro- this additional parameter and imple- cess is repeated through sets of five or six mented it using MATLAB®. When the mills. Residual cane known as bagasse software implementation of our new

Fourth Quarter 2015 Xcell Journal 31 XCELLENCE IN INDUSTRIAL

three-input controller was completed chute, we added height sensors to the the Sugeno method of design is much [3], the next step was to implement the design. A tachogenerator sensor mea- simpler. Therefore, we adopted the Suge- algorithm and develop an entire fuzzy sures the rotational speed of rolls. no method of implementation. system using a Xilinx FPGA. FPGAs The output of the load cell, height sen- The first and foremost step, fuzzifica- are reprogrammable silicon chips that sor and tachogenerator is in microvolts. tion, includes conversion of crisp values provide real-time hardware imple- In order to use these metrics in the next to fuzzy values, which are then represent- mentation of electronic circuits. They steps of the process, we had to amplify ed by membership functions. Crisp val- are highly reliable, cost-effective and the values to a measurable level, name- ues are those that belong to a particular provide a medium to check the per- ly, from microvolts to millivolts. We ac- set; fuzzy values lie in a particular range formance of a circuit before fabrica- complished this amplification using a and are not confined to a particular set. tion. Thus, the Xilinx Virtex®-6 FPGA signal-conditioning system on PSpice. The three input variables are weight, turned out to be a perfect solution for Next, we converted the results to a digital height and roll speed. A triangular mem- our hardware implementation. value using an analog-to-digital converter bership function is used to represent (ADC) that we connected in series with these input variables. The universe of dis- HARDWARE DESIGN the conditioning system. In this way the course of input parameter “WEIGHT” is Figure 2 shows the algorithm for a amplified input is fed to the controller. in the range of 500 kg to 1,000 kg, and it is three-input fuzzy controller. The con- fuzzified into 11 triangular linguistic vari- trol philosophy remains the same as in FIVE-STEP PROCESS ables (LV). The universe of discourse of the two-input version, modified accord- The VHDL implementation of a fuzzy input parameter “HEIGHT” is in the range ing to the three inputs and implemented controller using Xilinx hardware is divid- of 0 to 180 cm and it is fuzzified into seven on MATLAB. The control philosophy ed into five steps: fuzzification of inputs, triangular LVs. The universe of discourse is such that it governs weight, height rule evaluation, implication, aggregations of input parameter “ROLLSPEED” is in and the three conditions of roll speed: and defuzzification. the range of 12 cm/s to 16.6 cm/s and it is that is,when the speed is Roll Low (RL; There are two methods of designing a fuzzified into three triangular LVs. 12 cm/s), Roll Medium (RM; 14.3 cm/s fuzzy-logic controller, Mamdani and Su- The fuzzification in terms of VHDL (RM) and Roll High (RR; 16.6 cm/s). geno. The Mamdani method is difficult code is done as follows. We defined each After designing the controller on and very complex. According to the re- membership function by three points and MATLAB, our next step was to design search, the Mamdani method requires the two slopes, as shown in Figure 3. The the hardware required to measure the centroid of a two-dimensional shape by upward slope (Slope 1) and downward input parameters. Load cell measures integrating across a continuously varying slope (Slope 2) can be evaluated using the amount of cane on the rake carri- function. Hence, this method is not com- the following formula: er. To measure the level of cane in the putationally efficient. On the other hand, S1= (y2-y1/Point2-x1) S2= (y2-y1/x2- Point 2) The degree-of-membership (DOM) function (µ) forms the next step of fuzzi- Cane Carrier fication. Our algorithm divides a member- C Rake Carrier ship function into four segments, namely H Cane Knives U Segment-1(µ=0), Segment-2{(Input - point T 1)* slope 1}, Segment-3{(Input-point 2)* E Shredder slope 2} and Segment-4 (µ=0). The value Knives of DOM is calculated as follows: • If input value < Point 1 (Segment 1), then DOM =0. • If input value ≤ Point 2 and ≥ Point 1 (Segment 2), then DOM = (Input Val- ue –Point 1) * Slope 1. • If input value ≤ Point 3 and ≥ Point 1 Juice for (Segment 3), then DOM = FF- (Input Bagasse Clariﬁcation Value –Point 2) * Slope 2. for Boiler • If input value ≥ Point 3 (Segment 4),

Figure 1 – Schematic of the cane juice extraction process then DOM = 0.

32 Xcell Journal Fourth Quarter 2015 XCELLENCE IN INDUSTRIAL

Roll Speed

RL RM RR

NO NO NO

‘HEIGHT’ = EL ‘HEIGHT’ = VL ‘HEIGHT’ = L

YES YES YES

Feed Rate = +42% of Flow Rate Feed Rate = +31% of Flow Rate Feed Rate = +20% of Flow Rate

For ‘WEIGHT’ = SL, UL, EL, VL, l, For ‘WEIGHT’ = SL, UL, EL, VL, l, For ‘WEIGHT’ = SL, UL, EL, VL, l, JR, H, VH, EH, UH and SH JR, H, VH, EH, UH and SH JR, H, VH, EH, UH and SH

NO NO NO NO ‘HEIGHT’ = VH ‘HEIGHT’ = H ‘HEIGHT’ = JR

YES YES YES

Feed Rate = +31% of Flow Rate Feed Rate = +20% of Flow Rate Feed Rate = Flow Rate

For ‘WEIGHT’ = SL, UL, EL, VL, l, For ‘WEIGHT’ = SL, UL, EL, VL, l, For ‘WEIGHT’ = SL, UL, EL, VL, l, JR, H, VH, EH, UH and SH JR, H, VH, EH, UH and SH JR, H, VH, EH, UH and SH

YES Feed Rate = +20% of Flow Rate ‘HEIGHT’ = EH For ‘WEIGHT’ = SL, UL, EL, VL, l, JR, H, VH, EH, UH and SH

Figure 2 – Development algorithm for a three-input fuzzy controller

Point 2 Y2 1

Segment 4 Segment 2 Segment 3 Segment 1

1.5

Slope 1 Slope 2

Y1 0 Point 1 Point 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X1 X2

Figure 3 – Membership function, de ned by three points and two slopes

Fourth Quarter 2015 Xcell Journal 33 XCELLENCE IN INDUSTRIAL

DIFFERENT DEGREES We also found that the consequent of the weighted-average method. In this OF MEMBERSHIP many rules is the same. All such rules method, we multiply the fuzzy output The next step is the designing of rules so having the same consequent are collect- obtained from aggregation with its as to determine the action to be taken in ed and the maximum of these values is corresponding singleton value, then response to the different degree-of-mem- calculated using a maximum function. At divide the sum of these values by the bership functions. A fuzzy rule is formed the next step, we collected all rules hav- sum of all the fuzzy output obtained using a simple If-Then condition where ing the same consequents. Different max- from the rule evaluation—that is, the we have a consequence to every anteced- imum functions are encoded to evaluate values obtained after aggregation. ent. The Fuzzy Logic Toolbox in MATLAB a maximum value for the entire LV. provides different operators to combine After the output for each rule has been VIRTEX-6 IMPLEMENTATION multiple antecedents. We selected the identified, the last step is to combine all the After implementing the above steps, we AND operator to combine three anteced- output into a single value—in other words, successfully designed a three-input fuzzy ents, since it represents a minimum oper- these values are to be converted to crisp controller using a triangular membership ation among multiple antecedents. For a values. This is done using defuzzification. function and centroid defuzzification. three-input controller, a total of 231 rules Defuzzification forms the last and The program code is available with the are generated. We designed a rule matrix an important step in designing a fuzzy authors. We simulated our three-input for these rules. A minimum function finds system. Defuzzified values generate a fuzzy controller using the Fuzzy Toolbox the minimum of three values—that is, single crisp value, which is the speed of MATLAB version 7.11.0.584 (R2020b) the minimum of DOM among three input of the rake motor. The Sugeno meth- and implemented it on a Xilinx Virtex-6 variables is calculated. od of defuzzification that we used is FPGA with Xilinx’s ISE® Design Suite

Parameters Cane Cane Motor Carrier Cane in Feed Data for Cane Cane level weight speed speed carrier rate next level (MATLAB) (cm) (kg) (rpm) (cm/s) (kg/cm) (kg/s) sampling *(VHDL) ** (cm) Time Roll kg cm (sec) speed (cm/s)

0 15.4 90.0 750 47.0 24.6 0.938 23.1 -16.0 -6.4 83.6 85.7 10 15.8 83.6 729 52.0 27.2 0.911 24.8 -5.0 -2.0 81.6 84.3 20 15.0 81.6 792 50.0 26.2 0.990 25.9 +19.0 +7.6 89.2 90.4 30 16.2 89.2 908 42.0 22.0 1.135 25.0 -9.0 -3.6 85.6 86.5 40 16.6 85.6 965 44.0 23.0 1.206 27.7 +11.0 +3.9 89.5 90.5 50 13.4 89.5 720 49.0 25.7 0.900 23.1 +16.0 +6.4 95.9 95.9 60 13.8 95.9 760 39.0 20.4 0.950 19.4 -27.0 -9.6 86.3 86.3 70 13.4 86.3 790 44.0 23.0 0.988 22.7 +12.0 +4.8 91.1 91.3 80 15.4 91.1 820 46.0 24.1 1.025 24.7 0.0 0.0 91.1 93.4 90 16.2 91.1 555 73.0 38.2 0.694 26.5 -4.0 -1.6 89.5 93.4 100 13.0 89.5 609 51.0 26.7 0.761 20.3 -5.0 -2.0 87.5 92.3 110 14.3 87.5 578 62.0 32.5 0.723 23.5 +6.0 +2.4 89.9 90.2 120 14.6 89.9 598 57.0 29.8 0.748 22.3 -11.0 -4.4 85.5 87.0 130 12.3 85.5 700 44.0 23.0 0.875 20.1 +4.0 +1.6 87.1 88.8 140 12.6 87.1 679 48.0 25.1 0.849 21.3 +11.0 +4.4 91.5 91.7 150 15.4 91.5 800 46.0 24.1 1.000 24.1 -6.0 -2.4 89.1 91.3 160 12.0 89.1 845 32.0 16.8 1.056 17.7 -15.0 -6.0 83.1 84.2 170 14.3 83.1 835 45.0 23.6 1.044 24.6 +17.0 +6.1 89.2 90.3 180 14.6 89.2 874 42.0 22.0 1.093 24.0 +6.0 +2.4 91.6 92.1 190 15.0 91.6 900 41.0 21.5 1.125 24.2 +2.0 +0.8 92.4 92.1 200 15.4 92.4 924 40.0 20.9 1.155 24.1 -6.0 -2.4 90.0 91.4

* Cane level of FPGA-implemented system after each sampling ** Cane level of MATLAB-implemented system after each sampling

Table 1 – Cane level at a 90-cm roll speed varies during each sample.

34 Xcell Journal Fourth Quarter 2015 XCELLENCE IN INDUSTRIAL

14.5 using VHDL. The sampling period is case when the weight of cane in the rake found that the number of LUT blocks re- 10 seconds and total simulation duration carrier is 750 kg, the height of the cane quired exceeded the capacity of the tar- is 200 seconds. level in the Donnelly chute is 90 cm and get device. This is why we implemented We have examined in total 756 differ- the roll speed is 16.6 cm/s. Under these the design on a Virtex-6 instead. But due ent conditions of input parameters under conditions, the expected rake motor to a lack of resources, we couldn’t per- six different cases, but here we have fo- speed is 54.2 rpm (MATLAB). The de- form real-time implementation in the lab. cused on the case when the cane level fuzzified simulated results are 36H, or 54 At the next step we are looking for- and cane weight on the carrier at the ini- rpm, which matches with the MATLAB ward to linking up with the government tial stage of simulation are 90 cm and 750 results and verifies the design. of India’s National Sugar Institute to de- kg respectively. The roll speed varies at After simulation, we synthesized the velop the whole system and verify the the time of each sampling. The simulation design to generate the technology sche- results in a real-world environment. We result is given in Table 1. matic and the approximate device utiliza- have already delivered a presentation to The steps of hardware implementation report. We found that our design used the institute and received a positive re- tion included VHDL modeling, simula- more than 78 percent of the Virtex-6’s slice sponse. It is our belief that the concept tion, synthesis and FPGA implementation lookup tables (LUTs), 93 percent of occu- of fuzzy logic stands poised to change in our lab on the MITS campus. We used pied slices, 1 percent of slice registers and the future of the sugar industry. a mixed type of modeling for designing 1 percent of LUT flip-flops. the VHDL model of our three-input fuzzy We then compared the VHDL results REFERENCES controller, which includes behavioral with results of a conventional controller 1. Y. Misra and H.R. Kamath, “Design Algo- and structural modeling. The design is (see Table 2). The comparison proves that rithm of Conventional and Fuzzy Controller simulated on Xilinx’s ISim simulator. The the fuzzy-logic system is more efficient for Maintaining the Cane Level During Sug- waveform that ISim generated verified than a conventional controller. ar Making Process,” International Journal the functionality of the controller. Figure The lab at MITS provides a Spartan®-6 of Intelligent Systems and Applications, 4 shows the simulated waveform for the FPGA for research use. However, we ISSN: 2074-9058,Vol.7 No.1, December 2014

2. Y. Misra and H.R. Kamath, “Implementa- tion and Performance Analysis of a Three Inputs Conventional Controller to Maintain the Cane Level During Cane Crushing in FPGA using VHDL,” International Jour- nal of Engineering Research & Technolo- gy (IJERT), ISSN: 2278-0181, Vol. 3 Issue 9, September 2014

3. Y. Misra and H.R. Kamath, “Analysis and design of a three input fuzzy system for maintaining the cane level during sugar manufacturing,” Journal of Automation and Control (accepted), ISSN: 2372-3041 Figure 4 – Simulated waveform for the case when weight = 750 kg, height = 90 cm and roll rate = 16.6 cm/s

Three-input Three-input Three-input conventional fuzzy fuzzy controller controller controller (Xilinx VHDL) (MATLAB)

Percentage of time cane is in between 45.8 94.7 88.0 85 cm-95 cm (% Time)

Lowest level of cane in chute (cm) 61.7 84.2 81.6 Highest level of cane in chute (cm) 103.5 95.9 95.9

Table 2 – Comparison of results

Fourth Quarter 2015 Xcell Journal 35 XPLANATION: FPGA 101 Zynq MPSoC Gets Xen Hypervisor Support

by Steven H. VanderLeest Chief Operating Officer DornerWorks, Ltd. [email protected]

36 Xcell Journal Fourth Quarter 2015 XPLANANTION: FPGA 101

The Xen open-source hypervisor is a full-featuredT virtualization technology tra- ditionally used in cloud computing and more recently finding its way into embedded systems. DornerWorks is providing Xen support on the new Zynq® UltraScale+ MPSoC device, bringing multiple benefits to Xilinx users. Not only does the Xen Zynq Xilinx’s latest Zynq hypervisor enable fast software integration and heightened system safety and security, device supercharges it also brings the power of enterprise-style cloud computing to the embedded world. the Xen hypervisor, The rigorous design compartmentaliza- tion that the hypervisor provides makes but support is key to for rapid integration of new software (including entire operating systems) on the choosing this open-source same computing device. At the same time, the isolation reduces or even eliminates virtualization approach. unexpected interference between independent software functionality. Moreover, isolation greatly enhances the level of system safety and security by reducing unexpected interaction between functions and presenting a smaller attack surface exposed to threats, thus making it easier to prove safety or security properties. The arrival of enterprise-style cloud computing in the embedded world offers many of the same advantages, such as deployment of legacy software on new hardware with few (if any) changes to the software. Let’s briefly review what hypervisors are before getting into the specifics of Xen Zynq, the open-source Xen hypervisor on the Zynq MPSoC.

Fourth Quarter 2015 Xcell Journal 37 XPLANANTION: FPGA 101

WHAT IS A HYPERVISOR? space are a more recent development, advanced driver assistance systems A hypervisor is the foundational soft- with adoption occurring as processors (ADAS), instrument cluster, navigation ware layer enabling virtualization. Just appeared with sufficient performance systems, Internet connectivity and, as an operating system (OS) manages and acceptable power consumption. eventually, real-time controls. simultaneously running applications, Use cases for embedded hypervisors When considering virtualization solu- each contained within a process with have a common theme: consolidation of tions, it is important to evaluate the VMM access to machine resources managed multiple complex functions into a single characteristic regarding negligible impact by the OS, so the hypervisor manages computing platform while maintaining on performance. The hypervisor controls simultaneously running operating sys- some separation between them. For aero- all hardware resources (CPU, memory, tems, each contained within a virtual space applications, a hypervisor is often I/O) and thus may impact the perfor- machine with access to machine reused to support integrated modular avi- mance of any of them. For the CPU, one sources managed by the hypervisor. onics, where the software formerly exe- important metric is the time it takes to Virtualization is an idea dating back cuting on federated (independent) avion- switch a core from running one virtual to the 1960s. Popek and Goldberg for- ics hardware is consolidated into a single machine to another. This is sometimes malized the idea of a virtual machine computing platform. The functions might called the context switch time, but may monitor (VMM) in 1974 with three de- include flight control, navigation, flight also be called the partition or domain fining characteristics: management systems, collision avoid- switch time to differentiate it from a sim- ance and more. The FAA requires that ilar concept of operating systems switch- • The VMM provides programs with combined software functions that were ing between processes. An associated a runtime (virtual) environment formerly running on separated hardware metric is the jitter, which is a measure of essentially the same as the original cannot affect one another. This isolation how much this switch time varies, thus (physical) machine. is accomplished via rigorous partitioning affecting determinism and predictability. • The VMM has a negligible impact defined by standards such as DO-248C. Real-time designers are also interested on performance. While the FAA is concerned with the in measuring the minimum time slice that safety of commercial flights when con- can be scheduled, which constrains the • The VMM manages the system solidating functionality, military avionics maximum frequency of the CPU sched- resources. has a parallel need to provide separation ule, or put another way, the maximum A hypervisor is a VMM focusing almost in order to support security. Approaches number of virtual machines that can be exclusively on the basic machine manage- that support multiple classification lev- executed in a given period. When measur- ment tasks. That means some commonly els on one system with strict separation ing impact on memory, the footprint of expected services like file systems, graph- use an architecture called Multiple Inde- the hypervisor kernel consists of a con- ical user interfaces and network protocol pendent Levels of Security (MILS). stant base portion along with an incre- stacks are not implemented at this level, For health care applications, the in- mental portion for each added guest (vir- but rather are delegated to a higher layer, dustry is considering similar consoli- tual machine). The cumulative footprint such as a guest operating system running dation using hypervisors for high-end then constrains the maximum number of in one of the virtualized machines the hy- medical devices such as MRI scanners, virtual machines possible. For I/O, band- pervisor is hosting. robotic (or robotically assisted) sur- width and latency are the key measure- A hypervisor running natively on gery devices and CT scan machines, all ments to make for each device of interest, hardware, as described above, is consid- of which currently incorporate multi- although you could also make estimates ered a Type 1 hypervisor. By contrast, a ple independent processing systems. based on some generic metrics such as Type 2 hypervisor is not the lowest layer The combined functions might include the overall interrupt latency or the raw of software, but is hosted on an operat- graphical user interfaces for physicians, communication bandwidth. ing system. This type is typically used image processing, real-time motor con- Many hypervisors support two ap- to allow one OS to run on another, such trol, patient information databases and proaches to I/O: exclusive and shared. as when a Mac user runs Windows on system management functions. Exclusive I/O typically incurs a some- their MacBook using Parallels or when For automotive applications, hyper- what lower overhead, where the hyper- a Windows user boots up Linux within a visors are an attractive way of combin- visor provides a virtual machine with virtual machine using VirtualBox. ing the dozens of separate microproces- direct and sole access to a particular There are also important differences sors and microcontrollers embedded I/O device, often referred to as a “pass- between enterprise and embedded hy- in a car. Virtually all automotive OEMs through” device. Shared I/O entails a pervisors. Cloud computing and big data are considering the move to hypervisors somewhat higher overhead, because the are typical enterprise use cases for hy- to combine functions such as infotain- hypervisor must impose mechanisms to pervisors. Hypervisors in the embedded ment, driver and passenger controls, enforce a sharing scheme.

38 Xcell Journal Fourth Quarter 2015 XPLANANTION: FPGA 101

ASPECTS OF OPEN SOURCE MAPPING XEN TO THE NEW ZYNQ utilize the virtualized CPU, memory The term “open source” is used to The new Zynq UltraScale+ MPSoC and I/O, while Xen manages how the describe software that is free, as in from Xilinx offers a powerful platform virtualized resources are mapped to speech, but not necessarily free as in for running the Xen hypervisor. This physical resources. beer. Open-source software provides device provides a quad-core ARM® In Xen, each virtual machine is the freedom to modify and share Cortex™-A53 with hardware virtual- called a domain. In order to keep the source code under carefully devel- ization extensions and 64-bit capability hypervisor kernel as small as possible, oped licensing to guarantee that free- in the ARMv8 instruction set. Powerful Xen gives one domain special privileg- dom is preserved. Some of the most hardware requires a rich ecosystem of es. This system domain is called dom0. commonly recognized open-source software in order to take full advan- It starts up other guest domains (each license agreements are the GNU tage of its features and performance. called a domU), configures the sched- General Public License (the active While developing the new Zynq MP- ule and memory mappings that the ker- versions are GPLv2 and GPLv3), the SoC, Xilinx surveyed key customers in nel enforces, and manages I/O access GNU Lesser General Public License, a variety of industries, including aero- permissions. To provide a little more the Apache license and the BSD li- space and defense, health care, tele- detail, let’s consider several views of the cense (in several variations). communications and automotive. The hypervisor environment: the boot se- Open source is not necessarily free message: Most customers expected to quence, ARM exception levels, running of charge. Businesses built around have hypervisor options for the new schedule and resource management. open-source products typically use a device, and half of them desired an Starting from power-on, the boot different kind of revenue model than open-source hypervisor. Xilinx chose sequence on the new Zynq MPSoC can traditional software vendors, instead Xen as the open-source hypervisor and be configured in a variety of ways, in- selling product support, accessories chose DornerWorks to offer the sup- cluding variations on which processor (like printed user manuals), training port service for the new Xen Zynq. (Cortex-A53 or Cortex-R5) starts first. or custom design services. Red Hat The Xen hypervisor hosts guest Most use cases will keep the two pro- is one of the best-known examples, operating systems within virtual ma- cessors quite independent, so the stan- building a billion-dollar business chines, providing them with a virtual- dard Xen Zynq hypervisor distribution around the open-source Linux oper- ized view of the underlying machine. will run only on the Cortex-A53. Figure ating system. The guest OS and its applications then 1 illustrates a typical boot sequence. If

First-Stage Boot Loader U-Boot Xen (FSBL) Kernel

dom0

Guest 1

Time Guest 2

Figure 1 – Typical boot sequence shows stages until the guest OS is running.

Fourth Quarter 2015 Xcell Journal 39 XPLANANTION: FPGA 101

Hypercalls are analogous to system calls that allow an application to invoke an operating system service, but in this case, they invoke hypervisor services. By default, dom0 can make any hypervisor call. the Cortex-R5 were used to host an in- calls that allow an application to invoke least privilege at EL0. When changing to dependent, nonvirtualized secure OS, an operating system service, but in this an exception level with lower privilege, this would typically boot first, from a case, they invoke hypervisor services. By the virtualized machine registers must simple first-stage boot loader (FSBL). default, dom0 can make any hypervisor be the same width or narrower. That Once the R5 had booted, it would then call, while a domU is restricted to certain means you can have a 64-bit hypervisor initiate the A53, starting with its own ones. However, developers can use the and a 32-bit guest, but not vice versa. FSBL. A second-stage loader, such as Xen module XSM-FLASK to implement Xen Zynq uses the AArch64 execution U-Boot, would typically be used to pro- finer-grained control of hypercall access. mode of the ARMv8 architecture, and vide broader booting functionality, per- The processor hardware enforces thus supports 64-bit or 32-bit guests. haps including some integrity checking privileges within categories defined by The privileged domain, dom0, es- of the hypervisor kernel image. the ARM exception-levels model. The tablishes the schedule, thus determin- At this stage, the Xen hypervisor Cortex-A53 uses the ARMv8 architec- ing when domains run and on which kernel is invoked. The kernel initiation ture, which defines four exception lev- core or cores. The hypervisor kernel includes checking for a valid dom0. In els, as illustrated in Figure 2, with the then executes the configured sched- turn, dom0 checks for valid images for highest privileges for the bottom level ule. To achieve certain types of deter- the guest domains and then initiates and in the diagram, and decreasing as one minism, you might configure a sched- schedules them on one or more cores. In moves upward. Complete access priv- ule where a guest domain has sole most cases, dom0 continues to run in or- ileges are granted at exception level access to the machine during its time der to monitor the system, provide man- EL3, which is used for the ARM Trust- slot. Figure 3 provides an example, agement of shared resources and handle Zone monitor. Hypervisors run at EL2 to where Guest 1 runs on several cores certain system faults. The hypervisor provide virtualization of guest domains. (along with dom0) in a single time slot, kernel runs during each domain context Within each hosted virtual machine, the while Guests 2 and 3 do not need this switch and is also invoked through hyper- hosted operating system runs at EL1. restriction, so they can be scheduled calls. Hypercalls are analogous to system Finally, user applications run with the in a more mix-and-match load-balanc- ing scheme during other time slots. The hypervisor manages all resources of the machine. The CPU cores are Trusted Trusted managed primarily by time-sharing, as EL0 App 1 App 2 App 3 App 4 App 5 App 6 discussed above. The hypervisor uses hardware timers to enforce the sched- Cortex-R5 EL1 Guest OS 1 Guest OS 2 Secure World OS ule. Memory is shared not by dividing time, but by dividing space, allocat- ing a portion of the memory to each guest domain. The hypervisor uses the EL2 Hypervisor hardware memory management unit Cortex-A53 (MMU) to enforce the memory layout. Management of I/O varies widely, de- EL3 TrustZone Monitor pending on the type of device. Some I/O devices are mapped directly to the Cortex-A53, while others must be configured to connect through the FPGA Figure 2 – This diagram of ARM Exception Levels shows the hypervisor mapped to EL2. programmable fabric.

40 Xcell Journal Fourth Quarter 2015 XPLANANTION: FPGA 101

Guest access to I/O devices is con- pass data back and forth to dom0. The ded systems that balance demanding figured and managed by dom0, with ap- bottom half of the driver inside dom0 needs (such as high bandwidth, low propriate hypercalls to the Xen kernel then performs the actual I/O operations latency, low power, high reliability) to establish memory mappings to the to the device. and that connect to a wide variety devices. Dom0 can grant a guest domain of system devices in an embedded access to specific I/O devices as needed, CREATING SUPPORT FOR XEN ZYNQ environment. Xilinx selected Dorner- or it may manage shared I/O itself, act- As Xilinx sought customer feedback Works because of our expertise with ing as the gateway to enforce a sharing about the anticipated next-generation the Xen hypervisor, our embedded- mechanism. Interdomain communica- Zynq SoC device, it heard that many engineering design experience and tion in Xen (including I/O) typically uses customers expected strong hypervi- our role as a premier member of the Xen event channels for notifications sor support and half of them wanted Xilinx Alliance Program, providing and shared memory for passing data. an open-source option. This support additional options for customers that Shared I/O device drivers in Xen use a had to be more than a simple help- also seek support for the FPGA design split-driver model, where the top half desk-style service. Rather, the support portions of their systems. in the guest domains provides the API options would need to be more exten- DornerWorks collaborated with Xil- to the guest OS and the functionality to sive, to provide help designing embed- inx on finishing the port of Xen to the

Core 0 dom0 dom0 dom0

OS App OS App OS App OS App OS App OS App Hypervisor Hypervisor Hypervisor

Core 1 Guest 1 Guest 2 Guest 3

OS1 App1 OS1 App2 OS2 App3 OS2 App4 OS3 App5 OS2 App6 Hypervisor Hypervisor Hypervisor

Core 2 Guest 1 Guest 3 Guest 3

OS1 App1 OS1 App2 OS3 App5 OS3 App6 OS3 App5 OS3 App6 Hypervisor Hypervisor Hypervisor

Core 3 Guest 1 Guest 3 Guest 2

OS1 App1 OS1 App2 OS3 App5 OS3 App6 OS2 App3 OS2 App4 Hypervisor Hypervisor Hypervisor

Time

Figure 3 – Multicore scheduling places Guest 1 in an exclusive time slot and mixes Guests 2 and 3.

Fourth Quarter 2015 Xcell Journal 41 XPLANANTION: FPGA 101

new Zynq MPSoC and then confirmed and-test server. On a periodic basis, the Xilinx uncovers, for internally detect- correctness by verification and valida- server queries the repository of source ed issues and for customer-identified tion testing. Our testing needed to cover code. If it detects any changes, the serv- issues (from the community or from not only the Xen hypervisor kernel run- er performs an incremental build on the paid subscribers). In order to sustain ning correctly on the hardware, but also dependent portions of the build image. our work on Xen, we also offer paid the dom0 privileged domain (running It then loads the images necessary for subscription and custom design sup- Linux) and guest domains with a variety each test onto each device of our tar- port services, which provide the busi- of supported guest operating systems. get farm and invokes the test script. In ness-critical support by contract that We named this package of software the some test cases, external stimuli are many customers demand in order to Xen Zynq Distribution. applied to the targets. The test server reduce their business risk and ensure We faced the additional challenge gathers results, collates them and pres- timely response to their needs. You can of testing before we had the actual ents a summary dashboard that pro- find more details about the support op- hardware. Our stand-in model of the vides a view of the overall health of the tions at http:// http://xen.world. hardware was the QEMU open-source test suite or pinpoints where there are machine emulator software running on issues to resolve. TEST-DRIVE XEN YOURSELF an x86 developer system for individual DornerWorks has also developed While you are waiting for the new Zynq debugging and testing or on our team’s the infrastructure to provide compre- MPSoC to ship early next year, you build server for continuous integration hensive support for Xilinx customers can already start investigating Xen. testing. Additionally, we developed using the Xen hypervisor on the new Xen runs on an ordinary x86 PC, ei- against an emulation board named Zynq MPSoC. The base level of support ther natively as a Type-1 hypervisor Remus (not to be confused with the is driven by open-source community or hosted inside VirtualBox on top of Xen live migration tool of the same activism, allowing users to compare Windows for development purposes. name), which uses six Xilinx Virtex®-7 notes and share information. Dorner- To try out embedded Xen, you’ll need FPGAs to emulate the Zynq MPSoC. Works will host forums and gather emulated or actual ARM hardware. Figure 4 shows our continuous-inte- issues from the community. We use Choose an ARM processor that has the gration approach, centered on a build- Jira as our tracking tool for issues that virtualization extensions—ideally the

I/O Stimulus

Repository

Guest Linux Tests Code Under Test Build & OS Target Farm Test Scripts Stimulus Deﬁnition Test Server Build FSBL U-Boot Conﬁg Zynq Build Image QEMU Load Image MPSoC Run Tests Parsed Results

Developer Build and Debug Test Measurements Future REMUS Environment Expansion

Dashboard

Figure 4 – Continuous integration automates build and test of Xen Zynq.

42 Xcell Journal Fourth Quarter 2015 XPLANANTION: FPGA 101

Cortex-A53, but others, such as the DornerWorks has published the Xen Enclustra Cortex-A15, can also provide a fairly Zynq Distribution for the new Zynq MP- Everything FPGA. representative environment. Figure 5 SoC, ready for download from our web- depicts the workflow for building a site: http://dornerworks.com/services/ 1. MARS ZX2 complete hypervisor-based system for XilinxXen. Simply add guest OS imag- Zynq-7020 SoC Module an embedded target. You can find Xen es and you have your own embedded, Xilinx Zynq-7010/7020 SoC FPGA Up to 1 GB DDR3L SDRAM at http://www.xenproject.org/, along virtualized system. 64 MB quad SPI ash with information on building a Linux With Xen on the new Zynq MPSoC, USB 2.0 Gigabit Ethernet image to serve as dom0 and building a you’ve got cloud computing in the palm Up to 85,120 LUT4-eq 108 user I/Os variety of guest OS images. of your hand. 3.3 V single supply 67.6 × 30 mm SO-DIMM from $127

2. MERCURY ZX5 Developer PC Zynq™-7015/30 SoC Module

Xilinx® Zynq-7015/30 SoC 1 GB DDR3L SDRAM Xilinx Linux, 64 MB quad SPI ash Buildroot source PCIe® 2.0 ×4 endpoint Xen source ARM FSBL, 4 × 6.25/6.6 Gbps MGT U-Boot source USB 2.0 Device Gigabit Ethernet Up to 125,000 LUT4-eq 178 user I/Os Pull 5-15 V single supply Pull Pull 56 × 54 mm

64-bit ARM GCC Petalinux 3. MERCURY ZX1 cross-compiler Buildroot Zynq-7030/35/45 SoC Module

Build Xilinx Zynq-7030/35/45 SoC Compile Generate 1 GB DDR3L SDRAM 64 MB quad SPI ash PCIe 2.0 ×8 endpoint1 8 × 6.6/10.3125 Gbps MGT2 ARM FSBL, USB 2.0 Device Gigabit & Dual Fast Ethernet Xen image U-Boot, dom0 image Up to 350,000 LUT4-eq 174 user I/Os Xilinx Linux image 5-15 V single supply 64 × 54 mm Program 1, 2: Zynq-7030 has 4 MGTs/PCIe lanes.

4. FPGA MANAGER IP Solution

On Chip Memory QEMU C/C++ Passed as Virtual C#/.NET Load SATA Drive MATLAB® LabVIEW™

U-Boot TFTP USB 3.0 Load PCIe® Gen2 Gigabit Ethernet

Xen

Load Streaming, FPGA made simple.

dom0 One tool for all FPGA communications. Stream data from FPGA to host over USB 3.0, Load PCIe, or Gigabit Ethernet – all with one simple API.

domU Design Center · FPGA Modules Base Boards · IP Cores Target We speak FPGA.

Figure 5 – The Xen development work ﬂow

Fourth Quarter 2015 Xcell Journal 43 XPLANATION: FPGA 101 Vivado IPI Opens FPGA Shareable Resources for Aurora Designs

by K Krishna Deepak Senior Design Engineer Xilinx, Inc. [email protected]

Dinesh Kumar Senior Engineering Manager Xilinx, Inc. [email protected]

Jayaram PVSS Senior Engineering Manager Xilinx, Inc. [email protected]

Ketan Mehta Senior IP Product Manager Xilinx, Inc. [email protected]

44 Xcell Journal Fourth Quarter 2015 XPLANANTION: FPGA 101

Xilinx’s IP Integrator tool will help you improve design-entry productivity and resource optimization in multicore Aurora designs.

One of the major challenges that custom- Oers encounter when using multiple instances of intellectual property (IP) in a big design that must fit into a single FPGA is how to share the resources effectively across the system. The shared-logic feature of Xilinx®’s Aurora serial communications core provides users with shared resources across multiple instances. The IP Integrator tool within the Vivado® Design Suite is the key to making the most out of these shared resources. The electronics industry is rapidly shift- ing toward high-speed serial connectivity solutions while moving away from parallel communication standards. Industry-standard serial protocols have fixed line rates and defined lane widths, sometimes un- derutilizing the capabilities of gigabit serial transceivers. Aurora, a high-speed serial communication protocol from Xilinx, has been very popular in the industry and is typically used in applications where competing industry protocols are either too complex or too resource-intensive to implement. By delivering a low-cost, high-data-rate, scalable IP solution, Aurora provides a flexible means to build a high-speed serial data channel. High-performance systems and applications that need to be scaled, both in line rate and channel width, are looking to Aurora as a solution. Aurora is also establishing a presence in ASIC designs as well as in systems built of multiple FPGAs

Fourth Quarter 2015 Xcell Journal 45 XPLANANTION: FPGA 101

IPI visualizes cores as top-level blocks. Connections across standard interface ports are now more intuitive, intelligent and in some cases automatic. with backplanes transporting gigabits The Vivado IP Integrator (IPI) is a key AURORA’S SHAREABLE of data. Aurora’s simple framing struc- tool for resource optimization in complex RESOURCES AT A GLANCE ture coupled with protocol-extending multicore systems. In this regard, IPI will Figure 1 is the representative block flow control capability can be used to help you make the best use of the shareable diagram of the Aurora 64b66b core. encapsulate data from existing proto- resources in the Aurora 64b66b and Auro- Highlighted are the clocking resourc- cols. Its electrical requirements are ra 8b10b cores, especially the “shared-log- es such as mixed-mode clock manager compatible with commodity equip- ic” feature. For convenience, let’s focus on (MMCM), BUFG and IBUFDS, along ment. Xilinx delivers Aurora 64b66b the Aurora 64b66b IP, with the understand- with gigabit transceiver (GT) resourc- and Aurora 8b10b cores as part of the ing that similar techniques are applicable es such as GT common and GT chan- Vivado Design Suite’s IP Catalog. to the Aurora 8b10b core as well. nel, illustrated as GT1 and GT2 for a

IBUFDS_GT

GT Common gt_ref_clk

drp_clk drp_clk 64 64 64

Scrambler Scrambler Tx Aurora Protocol Engine

user_clk txusr_clk2 BUFG txusr_clk User IF TXMMCM GT1 GT2 BUFG tx_out_clk X

64 64 32 32

Rx Aurora Protocol Descrambler RxEngine Aurora Protocol rxrec_clk Engine FIFO FIFO X CBCC BUFGCE CBCC

IBUFDS init_clk rxusr_clk TxRx Startup rxusr_clk2 FSM

Figure1 – Shareable resources, highlighted in orange, in the Aurora 64b66b core

46 Xcell Journal Fourth Quarter 2015 XPLANANTION: FPGA 101

two-lane design based on a Xilinx 7 The IPI tool visualizes cores as ra doesn’t limit itself to a single-quad series device. top-level blocks, and connections sharing. The shared-logic definition for For a typical16-lane Aurora 64b66b across standard interface ports are the Aurora core can be extended to any core, the clocking and GT resource now more intuitive, intelligent and in number of supported lanes. requirements, as used in a Kintex®-7 some cases automatic. Appropriate Here are some examples that show- FPGA KC705 Evaluation Kit, are tabu- design rule checks are built into the case applications based on Aurora’s lated in Table 1. tool and around the IP to ensure the shared-logic feature. Clocking and GT resources in an wrong connections are highlighted FPGA are specific to the device and so the designer will spot them at the MULTIPLE SINGLE-LANE DESIGNS the package selected. Often, multiple time of design entry itself. Top-lev- Multiple single-lane designs in a single IP cores will demand resources for el wrapper files and inference of ap- FPGA differ from multilane designs in use at the system level. Hence, it be- propriate pin-level I/O requirements that they require channel bonding. In- comes imperative to optimize the uti- are automatic, making the tool pro- tuitively, it seems clear that resources lization of these precious resources ductive for system designers. If you needed for multiple single-lane designs to reduce the system cost as well as have designed custom sub-blocks, linearly add up at the system level. Let the power consumption. you could consider packaging your us consider different scenarios and examine how the shared-logic feature helps in each case. We will start with a design that 16-Lane Aurora Design has four single lanes. You could build Availability Used by this kind of design straightaway by Resource in Device Aurora instantiating four single-lane Aurora cores. If we actually ran through the MMCME2_ADV 10 1 implementation, each Aurora design would have one instance of GT com- IBUFDS_GTE2 8 2 mon; therefore, the placement and re- GTE2_COMMON 4 4 source utilization of this design would be spread across four GT quads. This GTXE2_CHANNEL 16 16 might not always be a feasible solution because it is resource-intensive. Table 1 – Clocking and GT resource utilization on a Kintex-7 FPGA KC705 Evaluation Kit For a better placement and optimized solution in terms of power and resources, the four GTs selected should AURORA RESOURCE SHARING design by following Xilinx application be from the same GT quad. As part of the shared-logic feature sup- note 1168, “Packaging Custom AXI IP Without the shared-logic feature, ported across multiple GT-based Xilinx for Vivado IP Integrator” (XAPP1168), handcrafting the generated design to cores, the Aurora core can be config- and use the sub-blocks in IPI. suit this requirement is a focused ef- ured either as “Shared logic in core Not only does the shared-logic fea- fort. To use the shared-logic feature (Master)” or “Shared logic in example ture of Aurora provide users with shared effectively, you will need to generate design (Slave).” A combination of the resources across multiple instances, it one Aurora core in master mode and two configurations makes it possible also allows an out-of-the-box experi- the other three Aurora cores in slave to share the clocking and GT resources ence for utilizing the GT channels in the mode, as shown in Figure 2. Addition- across master and slave when instanti- same GT quad without the pain of edit- al system-level considerations need ated at the system level. ing the GT common, PLL, clocking and to be taken into account, such as re- For applications in which the related modules. The only constraint is setting the cores because the master shared-logic feature is to be used, that the line rates of the “shared” cores core controls the clocking to the slave handcrafting the connections across should be the same (harmonics are also cores. This configuration and resource multiple pieces of IP could create er- allowed if you can take a penalty on the optimization are possible out of the rors and will increase the overall de- clocking resources). box only if the Aurora cores are con- sign-entry time. Tool-assisted design A typical shared-logic design would figured with the same line rate. Ta- entry is the way to solve this problem, include one master and one or more ble 2 quantifies the benefits of using and Xilinx’s IP Integrator has elegant- slave instances within a quad. Unlike the shared-logic feature for four sin- ly addressed it. most other communication IP, Auro- gle-lane designs in a system.

Fourth Quarter 2015 Xcell Journal 47 XPLANANTION: FPGA 101

cholesky_alt_top_0_if aurora_64b66b_slave1 S_AXI USER_DATA_S_AXIS_TX S_AXI_0 M_AXIS_0 CORE_CONTROL AP_FIFO_IAR_0 AP_CTRL gt_rxcdrovrden_in AP_FIFO_OARG_0 interupt loopback[2:0] ap_oscalar_0_din[31:0] mmcm_not_locked power_down USER_DATA_M_AXIS_RX AXI4-Stream Accelerator Adapter GT_SERIAL_RX CORE_STATUS aurora_64b66b_master refclk1_in GT_SERIAL_TX user_clk tx_out_clk USER_DATA_M_AXIS_RX sync_clk link_reset_out CORE_STATUS aurora_64b66b_slave2 reset_pb sys_reset_out channel_up pma_init USER_DATA_S_AXIS_TX gt_pll_lock drp_clk_in hard_err CORE_CONTROL init_clk gt_rxcdrovrden_in lane_up[0:0] USER_DATA_S_AXIS_TX gt_qpllclk_quad1_in loopback[2:0] mmom_not_locked_out GT_DIFF_REFCLK1 GT_DIFF_REFCLK1 gt_qpllrefclk_quad1_in mmcm_not_locked soft_err INIT_DIFF_CLK INIT_DIFF_CLK power_down USER_DATA_M_AXIS_RX GT_SERIAL_TX CORE_CONTROL Aurora 64B66B GT_SERIAL_RX CORE_STATUS tx_out_clk GT_SERIAL_RX refclk1_in GT_SERIAL_TX link_reset_out reset_pb user_clk tx_out_clk user_clk_out pma_init sync_clk link_reset_out sync_clk_out drp_clk_in reset_pb sys_reset_out gt_qpllclk_quad1_out aurora_64b66b_slave3 pma_init gt_qpllrefclk_quad1_out drp_clk_in init_clk_out USER_DATA_S_AXIS_TX init_clk sys_reset_out CORE_CONTROL gt_qpllclk_quad1_in gt_reset_out gt_rxcdrovrden_in gt_qpllrefclk_quad1_in gt_refclk1_out loopback[2:0] mmcm_not_locked Aurora 64B66B Aurora 64B66B power_down USER_DATA_M_AXIS_RX GT_SERIAL_RX CORE_STATUS refclk1_in GT_SERIAL_TX user_clk tx_out_clk sync_clk link_reset_out reset_pb sys_reset_out pma_init drp_clk_in init_clk gt_qpllclk_quad1_in gt_qpllrefclk_quad1_in

Aurora 64B66B

Figure 2 – Shared-logic design using one master Aurora core (left) and three slaves

DESIGNS OCCUPYING serve up to maximum of 12 GT channels clocking resources as possible. You 12 GT CHANNELS if it is chosen in a middle quad. can save on clocking resources if For a 7 series FPGA, the GT require- Let us consider the use case you try to extend the one-master- ment based on north-south clocking is where the requirement is to have 12 plus-three-slave configurations as that a single reference clock source can single-lane designs utilizing as few shown in Figure 2. But if this 1+3 configuration is extended for the three quads, the design will need a total of six differential-clocking resources. However, more savings Design with Four Single Lanes are possible if you select two of Without With the master designs such that they Resource Shared Logic Shared Logic accept a single-ended INIT_CLK and GT reference clock. This way, MMCME2_ADV 4 1 potentially we could reduce the differential-clock input require- IBUFDS_GTE2 4 1 ment from six to two for this sys- GTE2_COMMON 4 1 tem, saving IBUFDS/IBUFDS_GTE2 resource requirements (see Table GTXE2_CHANNEL 4 4 3). IBUFDS_GTE2 resource reduction in the design actually would Table 2 – Resource usage beneﬁts with shared logic for designs with four single lanes also mean a reduction in external

48 Xcell Journal Fourth Quarter 2015 XPLANANTION: FPGA 101

12-Single-Lane Designs With Shared Logic With Shared Logic (using single-ended Resource Without Shared Logic (default) master input clocks)

MMCME2_ADV 12 3 3

IBUFDS_GTE2 12 3 1

GTE2_COMMON 12 3 3 GTXE2_CHANNEL 12 12 12

Table 3 – Resource beneﬁt with shared-logic feature for designs with 12 single lanes clocking resources as well as design Moving up in size to large (16 and sources, which could be a daunting pinouts. Similar optimization can be over) single-lane Aurora designs, the task from the point of view of board imagined for the MMCM as well. need for shared logic becomes even design. By using the techniques men- more acute. Sometimes the require- tioned in the 12-single-lane design use THREE-BY-FOUR-LANE DESIGNS ment could be as large as having 48 case, you could reduce the differen- If there is a requirement of having single-lane independent duplex links. tial clocks and MMCM requirements three four-lane designs, without the The number of allowable Aurora sin- of the overall system (see Table 5). shared-logic feature you might end up gle-lane links is limited only by the creating three four-lane Aurora cores number of GT resources available in a ASYMMETRIC LANES AND in master mode and then handcrafting chosen device. In such use cases, it is OTHER CUSTOM OPTIMIZATIONS the generated design for optimal uti- difficult to realize this system design In applications like video projectors, lization of clocking resources. What without making effective use of the mainstream data will flow in one di- if you could achieve the same result shared-logic feature. rection with high throughput while a out of the box? You can do just that by This design would spread over 12 back channel with lower throughput customizing one master core and two quads; hence, there could be a require- is used to transmit auxiliary or con- slave cores as shown in Figure 3. ment of 2*12 differential-clocking re- trol information. In such applications,

aurora_64b66b_slave1aurora_64b66b__slave1

aurora_64b66b_master1aurora_64b66b__master1 USER_DATA_S_AXIS_TX CORE_CONTROL USER_DATA_M_AXIS_RX GT Refclk1 GTXQ1 GT Refclk2 None GT_SERIAL_RX CORE_STATUS GT Refclk1 GTXQ1 GT Refclk2 None QPLL_CONTROL_IN GT_SERIAL_TX gt_qplllock_in QPLL_CONTROL_OUT gt_qplllock_quad1_in gt_qplllock_out qt_qpllrefclklost_in auaurora_64b66b_rora_64b66b_slave2_slave2 gt_qplllock_quad1_out USER_DATA_M_AXIS_RX qt_qpllrefclklost_quad1_in gt_qplllock_quad2_out CORE_STATUS USER_DATA_S_AXIS_TX refclk1_in gt_qpllrefclklost_out GT_SERIAL_TX CORE_CONTROL user_clk GT Refclk1 GTXQ1 GT Refclk2 None USER_DATA_S_AXIS_TX gt_qpllrefclklost_quad1_out tx_out_clk GT_SERIAL_RX sync_clk GT_DIFF_REFCLK1 GT_DIFF_REFCLK1 gt_qpllrefclklost_quad2_out link_reset_out QPLL_CONTROL_IN reset_pb INIT_DIFF_CLK INIT_DIFF_CLK tx_out_clk sys_reset_out gt_qplllock_in pma_init CORE_CONTROL link_reset_out gt_qplllock_quad1_in drp_clk_in X X GT_SERIAL_RX user_clk_out GTXQ2 qt_qpllrefclklost_in init_clk X X USER_DATA_M_AXIS_RX reset_pb sync_clk_out qt_qpllrefclklost_quad1_in gt_qpllclk_quad1_in X X CORE_STATUS pma_init gt_qpllclk_quad1_out GTXQ1 refclk1_in gt_qpllrefclk_quad1_in 4 2 GT_SERIAL_TX drp_clk_in gt_qpllrefclk_quad1_out user_clk gt_qpllclk_quad2_in 2 3 tx_out_clk gt_qpllclk_quad2_out GTXQ0 sync_clk gt_qpllrefclk_quad2_in X 1 link_reset_out 3 4 gt_qpllrefclk_quad2_out reset_pb GTXQ2 sys_reset_out X X gt_qpllclk_quad3_out Aurora 64B66B pma_init X X gt_qpllrefclk_quad3_out drp_clk_in X X GTXQ1 GTXQ2 X 2 init_clk_out init_clk 3 4 X X sys_reset_out gt_qpllclk_quad2_in 1 2 GTXQ0 GTXQ1 1 X gt_reset_out gt_qpllrefclk_quad2_in X X gt_refclk1_out gt_qpllclk_quad3_in X X GTXQ0 gt_qpllrefclk_quad3_in X X Aurora 64B66B Aurora 64B66B

Figure 3 – Single-master, two-slave conﬁguration for a four-lane Aurora design over three consecutive quads

Fourth Quarter 2015 Xcell Journal 49 XPLANANTION: FPGA 101

System designers implementing multiple Aurora cores need to be aware of the device-specific clocking requirements and limitations. Conserving clocking resources will make the system more cost-effective. having a full-blown duplex link would ber of lanes in one direction of data TX and RX lanes. To have a different mean lesser usage of bandwidth and flow (higher throughput) could be number of lanes in the two directions, essentially result in lower ROI of higher than that of the other direction you need to generate two Aurora sim- the system design. An ideal solution (lower throughput). plex cores for each direction. One for this kind of problem could be, as With the current available data flow way of building these kinds of asym- shown in Figure 4, to have an asym- modes (simplex/duplex) in the Aurora metric-lane designs on 7 series FPGAs metric link width with optimized GT cores, it is possible only to configure is discussed in Xilinx application note resource utilization, where the num- the cores with an equal number of 1227, “Asymmetric Lane Design with Aurora 64B/66B IP Core” (XAPP1227). Another useful design strategy is Optimized Lane Selection for 3 x 4-Single-Lane Designs BUFG resource optimization. Often, system designers implementing mul- GTQ2 Master1_3 Master1_4 tiple Aurora cores that operate at the same or different line rates need to be Slave2_3 Slave2_4 aware of the device-specific clocking GTQ1 Slave2_1 Slave2_2 requirements and limitations. Imple- menting numerous Aurora links de- Slave1_4 Master1_2 mands generation of clocks for the GTQ0 Slave1_3 Slave1_2 respective links. Conserving clocking Master1_1 Slave1_1 resources will make the system more cost-effective. If the system design Table 4 – Optimized lane selection for three four-lane designs has multiple blocks and if there is a clocking resource (BUFG) crunch, you could consider replacing the BUFG with BUFR/BUFH. It is recom- mended, though, that you drive both TX path user clocks of the GT cores #M lanes with the same buffer type. Tx Rx The 7 series Aurora core requires an additional dynamic reconfiguration port (DRP) clock input that otherwise N != M would need to use one BUFG. If Au- Board1/Chip1 Board2/Chip2 rora’s free-running clock frequency is #N lanes chosen in the allowable range of the Rx Tx DRP clock, then the available output free-running clock from Aurora could be reused and connected back to the DRP clock. As a result, you could save on the number of BUFGs in the generated designs. When selecting the line rates across Figure 4 – Asymmetric data transfer across links made possible with Aurora multiple Aurora designs, keep in mind

50 Xcell Journal Fourth Quarter 2015 XPLANANTION: FPGA 101

that you can share clocking resources if line rates are integral multiples for 48-Single-Lane Designs easier clock derivation and sharing Without With across the links. If the shared-logic Resource Shared Logic Shared Logic feature is to be extended to harmonic line rates, then by designing in few ex- MMCME2_ADV 48 4 tra clock dividers, you could generate IBUFDS_GTE2 48 4 the required input frequencies for the slave Aurora cores. GTE2_COMMON 48 12

FUTURE POSSIBILITIES GTXE2_CHANNEL 48 48 Aurora’s flexibility opens up possibilities of creating a variety of system con- Table 5 – Resource beneﬁt with shared-logic feature for 48-single-lane designs figurations and applications. Aided by a powerful tool like Xilinx’s Vivado IP Xilinx UltraScale™ architecture, there To evaluate the Aurora cores, check Integrator, design-entry productivity are devices with many more GT chan- out the IP Catalog, IPI and Aurora prod- and system-level resource sharing are nels with enhanced GT line rate support uct Web page at http://www.xilinx. speeding up innovation in the All Pro- and hence the possibilities and effective com/products/design_resources/conn_ grammable application space. With the resource utilization are even greater. central/grouping/aurora.htm.

Support for Xilinx Zynq® Ultrascale+™ and Zynq-7000 families

► Concurrent debugging and tracing of ARM Cortex-A53/-R5, -A9 and MicroBlaze™ soft processor core

► RTOS support, including Linux kernel and process debugging

► Run time analysis of functions and tasks, code coverage, system trace

Fourth Quarter 2015 Xcell Journal 51 XPLANATION: FPGA 101 Making XDC Timing Constraints Work for You by Adam Taylor Chief Engineer e2v [email protected]

52 Xcell Journal Fourth Quarter 2015 XPLANANTION: FPGA 101

ompleting the RTL design is one part of getting your FPGA design production-ready. Timing and placement The next challenge is to ensure the design Cmeets its timing and performance require- constraints are a crucial ments in the silicon. To do this, you will often need to define both timing and placement constraints. factor in achieving your Let’s take a look at how to create and use both of these types of constraints when designing systems design requirements. around Xilinx® FPGAs and SoCs. Here’s a primer on how TIMING CONSTRAINTS At their most basic level, timing constraints define the to use them. operating frequency of your system’s clock or clocks. However, more advanced constraints establish the relationships between clock paths. Engineers include these types of constraints to determine if it will be necessary to analyze the path or—if there is no valid timing relationship between those clock paths—discount it. By default, Xilinx’s Vivado® Design Suite will analyze all relationships. However, not all clocks within a design will have a timing relationship that can be accurately analyzed. One example is clocks that are asynchronous, since it’s not possible to accurately determine their phase, as shown in Figure 1. You can manage the relationships between clock paths using a constraints file and declaring clock groups. When a clock group is declared, the Vivado tools perform no timing analysis in either direction between the clocks defined within it. To aid in the generation of timing constraints, the Vivado tools define clocks as being within one of three categories: synchronous, asynchronous or unexpandable. • Synchronous clocks have a predictable timing/phase relationship. This is normally the case for a primary clock and its generated clocks, as they share a common root and will have a common period. • Asynchronous clocks have no predictable timing/ phase relationship between them. This is normally the case for different primary (and their generated) clocks. Asynchronous clocks will have different roots. • Two clocks are unexpandable if, over 1,000 cycles, a common period cannot be determined. If this is the case, then the worst-case setup relationship over these 1,000 cycles will be used. However, there is no guarantee it is the worst case in reality. To determine which type of clock you are dealing with, use the clock report that Vivado produces. This report will aid you in identifying asynchronous and unexpandable clocks.

Fourth Quarter 2015 Xcell Journal 53 XPLANANTION: FPGA 101

Declaring a multicycle path results in a more appropriate and less restrictive timing analysis, allowing the timing engine to focus upon other, more critical paths.

With these clocks identified, you can BUFGCTL. Therefore, the clocks cannot TIMING EXCEPTIONS now use the “set clock group” constraint be present upon the clock tree at the same You must also focus upon what hap- to disable timing analysis between time. As such, we do not want Vivado to pens within a defined clock group them. The Vivado suite uses Xilinx De- analyze the relationship between these when you have an exception. But what sign Constraints (XDC), which are con- clocks as they are mutually exclusive. is an exception? straints based upon Synopsys Design Finally, the –asynchronous constraint is One common example of a timing Constraints (SDC), a widely used Tcl- used to define asynchronous clock paths. exception would be a result being cap- based constraints format. With the XDC The final aspect of establishing the tured only every other clock cycle. An- constraints, you can use the command timing relationship is to take into account other would be transferring data from below to define a clock group: the nonideal relationship of the clocks, in a slow to a faster clock (or vice versa) particular jitter. You need to consider jit- where both clocks are synchronous. set_clock_groups –name – ter in two forms: input and system jitter. In fact, both of these situations are ex- logically_exclusive –physi- Input jitter is present upon the primary amples of a timing exception common- cally_exclusive –asynchro- clock inputs and is the difference be- ly referred to as a multicycle path, as nous –group tween when the transition occurs against shown in Figure 2. The –name is the name given to the when it should have occurred under ide- Declaring a multicycle path for these group. The –group option is the place al conditions. System jitter results from paths results in a more appropriate and where you can define the members of noise existing within the design. less restrictive timing analysis, allowing the group—that is, the clocks that have You can use the set_input_jitter the timing engine to focus upon other, no timing relationship. The logically and constraint to define the jitter for each more critical paths. The upshot is to in- physically exclusive options are used primary input clock. Meanwhile, the crease the quality of results. when you have multiple clock sources system jitter is set for the whole design Within your XDC file, you can de- from which to select in order to drive a (that’s all the clocks) using the set_sys- clare a multicycle path using the fol- clock tree, including BUFGMUX and tem_jitter constraint. lowing XDC command:

OSC 1 CLK1 Domain

CLK1 Domain

OSC 2 CLK2 Domain

Figure 1 – Domains CLK1 and CLK2 are asynchronous to each other.

54 Xcell Journal Fourth Quarter 2015 XPLANANTION: FPGA 101

Propagation Time > 1 CLK period

Timing analysis between registers uses 1 CLK period by default.

Figure 2 – The multicycle path is one example of a timing exception. set_multicycle_path path_ PHYSICAL CONSTRAINTS The most commonly used constraints multiplier [-setup|-hold] The most commonly used physical relate to I/O placement and configuration [-start|-end][-from ] [-to ] pins and the definition of parameters volves the use of both placement con- [-through ] associated with the I/O pins, for in- straints to locate the physical pin and I/O stance standard drive strength. How- constraints to configure the I/O properties When you declare a multicycle path, ever, there are other types of physical such as the I/O standard, slew rate, etc. you are in effect multiplying the re- constraints, including placement, rout- Modern FPGAs support a number of quirements for the setup and hold (or ing, I/O and configuration constraints. single and differential I/O standards. These both) analysis by the path_mutiplier. Placement constraints make it possible are defined via the I/O constraints. Howev- For instance, in the first example above, to define the locations of cells, while er, you must take care to ensure you are where the output occurs every two routing constraints allow you to define following I/O banking rules, which depend clock cycles, the path_multiplier would the routing of signals. I/O constraints upon the final pin placement. be two in the case of the setup timing. let you define the location of I/Os and But what are I/O banking rules? The Since the multicycle path can be ap- their parameters. Finally, configura- user I/Os within an FPGA are grouped plied to either setup or hold, you then tion constraints offer a way to define together into a number of banks con- have the choice of where to apply it. your configuration methods. sisting of a number of I/Os. These banks When you declare a setup multiplier, it is As always, there are a few con- have independent voltage supplies, en- often best practice to also declare a hold straints that sit outside of these groups. abling the support of the wide range of time multiplier using the equation below. The Vivado Design Suite has three such I/O standards. On the Zynq®-7000 All hold cycles = setup multi- constraints, and they are predominantly Programmable SoC (and other 7 series plier – 1 – hold multiplier used on the netlist. devices), I/O banks are further classified as belonging to one of two overall What this means for the simple exam- • DONT_TOUCH – This constraint is groups—high performance and high ple we have been following is that the hold used to prevent optimization, and range. These categories further con- multiplier is defined by this equation: as such it can be of great use when strain their performance and require hold multiplier = setup implementing a safety-critical or that the engineer use the correct class multiplier – 1 When using a high-reliability system. for the correct interface. The high-performance (HP) class is common clock. • MARK_DEBUG – This constraint is To demonstrate the importance of optimized for higher data rates. As such, used to preserve an RTL net such that multicycle paths, I have created a sim- it uses lower operating voltages and it can be used for debugging later. ple example that you can downloaded does not support LVCMOS 3v3 and 2v5. here. Within the XDC file there is one • CLOCK_DEDICATED_ROUTE – The other class, high range (HR), is opti- example which contains two multicycle This constraint identifies a route mized to handle wider I/O standards not paths declared, both setup and hold. for the clock routing. supported by HP. HR therefore supports

Fourth Quarter 2015 Xcell Journal 55 XPLANANTION: FPGA 101

traditional 3v3 and 2v5 interfacing. Fig- system as a whole, you will end up with nector. In that case, you can use the ure 3 demonstrates these banks. a number of constraints like the ones be- I/O constraints to implement a pull- Once you have determined which low for the I/Os in the design. up or pull-down resistor to prevent banks to use for which signal, you the FPGA input signal from floating, set_property PACKAGE_PIN still have the ability to change the which can cause system issues. G17 [get_ports {dout}] signal drive strength and slew rate. Of course, you can also use phys- set_property IOSTAN- These metrics will be of great interest ical constraints to improve the tim- DARD LVCMOS33 [get_ports to your hardware design team as they ing of your design by implementing {dout}] strive to ensure that the signal integ- the final output flip-flop within the set_property SLEW SLOW rity upon the board is optimal. These I/O block itself. Doing so reduces [get_ports {dout}] selections will also affect the timing the clock-to-output timing. You can of the board design. As such, you may do the same thing on input signals as set_property DRIVE 4 [get_ opt to use a signal integrity tool. well, which will allow the design to ports {dout}] SI tools require an IBIS model. You meet the pin-to-pin setup-and-hold can extract an IBIS model of your de- With the HP I/O banks, you can timing requirements. sign from the Vivado tools when you also use the digitally controlled im- have the implemented design open us- pedance to terminate the I/O correct- PHYSICAL CONSTRAINTS ing the File->Export->Export IBIS mod- ly and increase the SI of the system START WITH PLACEMENT el option. You can then use this file to without the need for external termi- You may wish to constrain the place- close the system-level SI issues and tim- nation schemes. You must also con- ment for a number of reasons—perhaps ing analysis of the final PCB layout. sider the effects of the I/O if there to help achieve timing or maybe to pro- Once the design team is happy with is no signal driving it, for instance vide isolation between sections of the the SI performance and timing of the if it is attached to an external con- design. In this regard, three types of constraints will be important: • BEL – The basic element of logic allows a netlist element to be placed within a slice. Bank 18 HP • LOC – Location places an element from 50 I/0 the netlist to a location within a device. Bank 17 • PBlock – You can use the physical (or HP “P”) block to constrain logic blocks to 50 I/0 a region of the FPGA.

Bank 16 Thus, while a LOC allows you to HP define a slice or other location within 50 I/0 the device, a BEL constraint lets you target at a finer granularity the flip- Bank 15 HP flop to use within the slice. PBlocks 50 I/0 can be used to group logic together when segmenting large areas of the Bank 14 Bank 34 design. Another use for PBlocks is to HP HP define logical regions when you wish 50 I/0 50 I/0 to perform partial reconfiguration. In some instances, you will wish to Bank 13 Bank 33 HP HP group together smaller logic functions 50 I/0 50 I/0 to ensure the timing is optimal. While it’s possible to do so using PBlocks, it Bank 12 Bank 32 is more common in this scenario to use HP HP 50 I/0 50 I/0 relatively placed macros. Relatively placed macros (RPMs) allow design elements such as DSPs, flip- Figure 3 – High-performance (left) and high-range I/O banks on a Xilinx 7 series device. flops, LUTs and RAMs to be grouped to-

56 Xcell Journal Fourth Quarter 2015 XPLANANTION: FPGA 101

• RLOC allows the assignment of relative locations to the SET.

SIGNAL ip_sync : STD_LOGIC_VECTOR(1 DOWNTO 0) :=(OTHERS =>’0’); The RLOC constraints use the defi- SIGNEL shr_reg : STD_LOGIC_VECTOR(31 DOWNTO 0) :=(OTHERS =>’0’); nition RLOC = XmYm, where the X and Y relate to the coordinates of the FPGA ATTRIBUTE RLOC : STRING; array. When you define an RLOC, you ATTRIBUTE HU_SET : STRING; ATTRIBUTE HU_SET OF ip_sync : SIGNAL IS “ip_sync”; can make this either a relative or an ATTRIBUTE HU_SET of shr_reg : SIGNAL IS “shr_reg”; absolute coordinate, depending upon whether you add in the RPM_GRID attribute. Including this attribute makes the definition absolute and not relative. Figure 4 – Constraints within the source code As these constraints are defined within the HDL as in Figure 4, it is of- gether in the placement. Unlike PBlocks, interconnection lengths to enable bet- ten necessary to run a place-and-route RPMs do not constrain the location of ter timing performance. iteration initially, before adding the these elements to a specific area of the To co-locate design elements, you can constraints to the HDL file, so as to cor- device (unless that’s what you want). use three types of constraints, which can rectly define the placement. Instead, RPMs group these elements to- be defined with the HDL source files. In short, understanding timing and gether when they are placed. • U_SET makes it possible to define an placement constraints and learning Placing design elements close to- RPM set of cells regardless of hierarchy. how to correctly use them are the gether makes it possible to achieve keys to obtaining the best quality of two goals. It improves resource effi- • HU_SET allows definition of an RPM results in your Xilinx-based program- ciency and it allows you to fine-tune set of cells with hierarchy. mable logic design.

TigerFMC

200 Gbps bandwidth The highest VITA 57 compliant optical bandwidth for FPGA carrier

www.techway.eu

Fourth Quarter 2015 Xcell Journal 57 TOOLS OF XCELLENCE Toward Easier Software Development for Asymmetric Multiprocessing Systems

by Arvind Raghuraman Staff Engineer Mentor Graphics Corp. [email protected]

58 Xcell Journal Fourth Quarter 2015 TOOLS OF XCELLENCE

The Mentor Embedded Multicore Framework eases SoC system design by hiding the complexities of managing heterogeneous hardware and software environments.

Heterogeneous multiprocessing is becoming increasingly important to embedded applications today. System-on-chip (SoC) architectures such Has Xilinx®’s Zynq® UltraScale+™ MPSoC provide a powerful heterogeneous multiprocessing infrastructure consisting of quad ARM® Cortex®-A53 cores and dual ARM Cortex-R5 cores. In addition to the core compute infrastructure, the SoC contains a rich collection of hardened peripheral IP and FPGA fabric, enabling flexible design paradigms for system developers to create high-performance multiprocessing systems. Various software development paradigms are available that enable developers to leverage the multiprocessing capabilities offered by SoCs such as the Zynq MPSoC. Symmetric multiprocessing (SMP) operating systems provide the infrastructure required to balance application workloads symmetrically or asymmetrically across multiple homogeneous cores present in a multiprocessing system. However, to leverage the compute bandwidth provided by the heterogeneous processors present in the system, asymmetric multiprocessing (AMP) software architectures are needed. AMP architectures typically entail a combination of dissimilar software environments such as Linux, a real-time operating system (RTOS) or bare-metal software running on dissimilar processing cores present in the SoC—all working in concert to achieve the design goals of the end application. Typical designs involve a software context on a master core bringing up a remote software context on a remote core in a demand-driven manner to offload computation. The master, remote processors and their associated software contexts (that is, their OS environments) could

Fourth Quarter 2015 Xcell Journal 59 TOOLS OF XCELLENCE

Mentor chose the remoteproc and rpmsg API present in the Linux 3.4.x kernel and newer. be homogeneous or heterogeneous ing the user with a simplified applica- However, the Linux-provided infra- in nature. Effectively dealing with the tion-level interface. structure had some limitations. For complexities of managing the life cycles Let’s take a closer look at how you starters, Linux rpmsg implicitly assumed of several operating systems on pos- can use this new development frame- that Linux would always be the master sibly dissimilar processors, while also work to manage heterogeneous compu- operating system, and it did not support providing an enabling interprocessor tation in AMP systems. Linux as a remote OS in an AMP config- communication (IPC) infrastructure for uration. Furthermore, the remoteproc offloading compute workload, demands COMPATIBILITY AND ORIGINS and rpmsg APIs were available from the new and improved software capabilities Compliance to open standards and Linux kernel space only—there were no and methods. adoption by the Linux community were equivalent APIs or libraries usable with The Mentor Embedded Multicore important criteria when selecting an ap- other OSes and runtimes. Framework from Mentor Graphics is a propriate API for the Mentor Embedded The Mentor Embedded Multicore software framework that provides two Multicore Framework. Mentor chose Framework is a standalone library writ- key capabilities for AMP system de- the remoteproc and rpmsg API present ten in the C language. It provides a clean- velopers: the remoteproc component in the Linux 3.4.x kernel and newer. room implementation of the remoteproc and API for life cycle management of The Linux remoteproc and rpmsg infra- and rpmsg functionality usable with remote processors and their associat- structure was originally conceived and RTOS or bare-metal software environ- ed software contexts; and the rpmsg committed to the Linux kernel by Tex- ments, with API-level compatibility and component and API for IPC between as Instruments. The infrastructure al- functional symmetry to its Linux coun- OS contexts in the AMP environment. lowed Linux OS on a master processor terpart. Figure 1a shows a software The framework hides the complexities to manage the life cycle and communi- stack diagram of the Mentor Embedded of managing heterogeneous hardware cations with a remote software context Multicore Framework and its usage in and software environments, provid- on a remote processor. RTOS or bare-metal environments. As

a. b.

Application Application

Char device Mentor Embedded Multicore Framework Linux Kernel

rpmsg remoteproc remoteproc rpmsg user device RTOS VirtIO platform driver driver Or Bare-Metal rpmsg Porting Layer HIL Environment remoteproc VirtIO

Hardware Hardware

Figure 1 – Mentor Embedded Multicore Framework in RTOS and bare-metal environments (a), and remoteproc and rpmsg in the Linux kernel (b)

60 Xcell Journal Fourth Quarter 2015 TOOLS OF XCELLENCE

Master

Remote

remoteproc remoteproc remoteproc

M RTOS M RTOS M RTOS M RTOS rpmsg E or E or rpmsg Linux E or rpmsg E or Linux M Bare M Bare M Bare M Bare F metal F metal F metal F metal remoteproc & remoteproc & rpmsg rpmsg

Remote Remote Remote Master Core Master Core Master Core Core Core Core

Configuration a Configuration b Configuration c

Figure 2 – AMP conﬁgurations that the Mentor Multicore Framework supports shown, the framework’s well-abstract- The unsupervised AMP (uAMP) ar- guest OS context for unsupervised man- ed porting layer consists of a hardware chitecture is useful in applications that agement of heterogeneous compute re- interface layer and an OS abstraction do not require a strong separation be- sources; and from within the hypervisor (environment) layer, allowing users to tween the participating OS contexts. In for supervised management of hetero- easily port the framework to other pro- this architecture, the participating oper- geneous compute resources, allowing cessors and operating systems. ating systems run natively on the proces- the hypervisor to supervise interactions Figure 1b shows the remoteproc and sors present in the system. As shown in between the guest operating systems rpmsg infrastructure present in the Li- Figure 3a, the Mentor Embedded Multi- and remote contexts involved. nux kernel. The remoteproc and rpmsg core Framework provides a simple and In general, the Mentor Embedded kernel-space drivers provide services effective infrastructure in which a mas- Multicore Framework is well-suited for to the remoteproc platform driver and ter software context on a master (boot) applications requiring demand-driven off- rpmsg user device driver. The remote- processor can manage the life cycle and load of compute functions to specialized proc platform driver allows for remote offload computation to other compute cores present on a multiprocessing chip. life cycle management, and the rpmsg resources present in the SoC. In the case of power-constrained devices, user device driver exposes IPC services A supervised asymmetric multipro- the framework enables on-demand bring- to user-space applications. cessing (sAMP) architecture is best for up and shutdown of compute resources, In addition to enabling RTOS and applications that require isolation of allowing for optimal power usage. bare-metal environments to interoper- software contexts and virtualization of The framework also provides an easy ate with Linux remoteproc/rpmsg in- system resources. In sAMP, the partic- path for consolidation of legacy sin- frastructure in AMP architectures, the ipating guest operating systems run in gle-core embedded systems onto powerful Mentor Embedded Multicore Frame- guest virtual machines that are managed and more capable multiprocessing SoCs. work provides work flows and runtime and scheduled by a hypervisor (aka vir- With very little effort, the framework al- infrastructure to package and boot Li- tual machine monitor). The hypervisor lows for migration of legacy software nux as a remote OS in AMP configura- provides isolation and virtualization originally developed for unicore silicon tions. Figure 2 shows the various AMP services for the virtual machines. The to easily interoperate with enhanced sys- configurations the framework supports. Mentor Embedded Multicore Frame- tem functionality developed on newer and work enables sAMP architectures to more powerful multiprocessing chips. USE CASES AND APPLICATIONS manage computation on heterogeneous Lastly, the framework facilitates im- The Mentor Embedded Multicore Frame- compute resources present in the SoC. plementation of fault-tolerant system ar- work is well suited for both unsuper- As illustrated in Figure 3b, the frame- chitectures. For example, the framework vised and supervised AMP architectures. work can be used in two ways: from the can enable an RTOS context (master)

Fourth Quarter 2015 Xcell Journal 61 TOOLS OF XCELLENCE

a. b.

IPC between OS contexts using rpmsg IPC between OS contexts using rpmsg

remoteproc remoteproc

Linux Linux Guest 0 Guest 1 MEMF MEMF remoteproc remoteproc Mentor rpmsg rpmsg Bare Bare RTOS Embedded RTOS metal metal Linux (SMP) MEMF MEMF MEMF MEMF Hypervisor remoteproc Cortex- Cortex- Cortex- Cortex- Cortex- A53Cortex- R5 R5 Cortex- R5 R5 Cortex- Core A53 Core- Core A53 Cortex- Core Core A53 Core Core A53 Core Core

Figure 3 – Mentor Embedded Multicore Framework use cases, including the uAMP (a) and sAMP (b) architectures that’s handling critical system function- the memory layout. You should assign conflicts. These changes will typically ality to manage a Linux context handling memory regions for each participating require modifications to the participat- noncritical system functions. Upon fail- OS runtime, and assign shared-mem- ing OS kernel, BSP sources or both. ure of the Linux-based subsystem, the ory regions for IPC between the OS The next step is to perform system RTOS can simply reboot the failed sub- instances. Once the memory layout is partitioning. You must partition system system without causing any adverse ef- finalized, you will need to update the resources such as memory and non- fects to critical system functions. platform-specific configuration data shared I/O devices between the partic- provided to the framework to reflect ipating operating systems so that each SYSTEM-LEVEL CONSIDERATIONS the chosen memory architecture. OS has visibility and can access only Mentor Embedded Multicore Frame- Off-the-shelf operating systems gen- the resources that are assigned to it. work APIs provide the required soft- erally assume they own the entire SoC, You can accomplish this task by modi- ware infrastructure to manage compu- and are not readily suited for operation fying the platform-specific data (device tation in AMP systems. However, those in unsupervised AMP environments, and memory definitions) provided to designing AMP systems must take cer- where cooperative usage of shared re- the participating OSes. For example, tain system-level considerations into sources and mutually exclusive usage modify memory and device definitions account before developing application of nonshared resources are key require- in a Linux device tree source (DTS) file software using these APIs. ments. Each participating OS in an AMP for the Linux OS; in a platform defini- During the initial design phase, you system needs to be modified so that tion file for the Nucleus RTOS; and per- will be determining the AMP topology. shared resources are used in a cooper- haps in the platform-specific headers in The framework can be used in a star to- ative fashion. For example, the remote bare-metal environments. pology—a single master managing mul- OS should not reset and reinitialize a tiple remotes—or in a chain topology, shared global interrupt controller that LIFE CYCLE MANAGEMENT with chained master and remote nodes. might already be in use within the mas- USING REMOTEPROC Once you have chosen a suitable ter context; nor can shared clock trees Once you have made the system-level topology, the next step is to determine or peripherals be modified to cause design decisions and modifications to

62 Xcell Journal Fourth Quarter 2015 TOOLS OF XCELLENCE

The framework provides work flows that make it possible to package Linux, RTOS or bare-metal software images along with required bootstrapping firmware to produce remote firmware images in ELF format. the participating operating systems, Once remoteproc is initialized, you • An rpmsg channel is a bidirectional you are now ready to start using the can use the remoteproc_boot API to communication channel between Mentor Embedded Multicore Frame- boot the remote processor with the the master and the remote processor work from application software. The associated software context. On in- (aka rpmsg device). framework provides work flows that vocation, the firmware image is lo- • An rpmsg endpoint is a logical ab- make it possible to package Linux, cated to execute in place in memory, straction that can be present on ei- RTOS or bare-metal-based software and the remote processor is released ther side of an rpmsg channel. images along with required bootstrap- from reset to execute the image. The ping firmware to produce remote firm- remoteproc_shutdown and remote- • The endpoints provide the infrastruc- ware images in ELF format. proc_deinit APIs allow applications to ture for sending targeted messages A remote firmware ELF image con- shut down the remote processor and between master and remote contexts. tains a special section called the re- de-initialize resources, respectively. • When an endpoint is created, the user source table. The resource table is a (The pseudo-code block in Figure 5 provides a unique endpoint index or static data structure with predefined shows an example of remoteproc API allows the rpmsg component to as- bindings where users can specify re- usage from the master context.) sign an index for the endpoint. In ad- sources required by the remote firm- In the remote context, the boot dition, the user provides an applica- ware. Some key definitions supplied and shutdown APIs are irrelevant. tion-defined callback to be associated in resource tables include memory re- To initialize and de-initialize the re- with the endpoint being created. quired by the remote firmware and IPC moteproc component, you must use capabilities supported by remote firm- remoteproc_resource_init, remote- • When a message is received target- ware. The remoteproc component in proc_resource_deinit APIs. For infor- ed to a given endpoint index, rpmsg the master software context will use the mation on how remoteproc is used invokes the associated receive call- resource table definitions to allocate re- from the Linux context, please refer back with a reference to the data sources and establish communications to Linux kernel documentation. payload received. with the remote context. • Users can create any number of end- The framework master initializes the RPMSG AND INTERPROCESSOR points on either side of an rpmsg remote processor context using the re- COMMUNICATION channel. moteproc_init API. On invocation, the Once the remote firmware is up and remoteproc master fetches the remote running on the remote processor, you • Messages that are not explicitly tar- firmware image, decodes it, obtains can use the rpmsg APIs for interproces- geted to a destination endpoint index the resource table and parses it to dis- sor communication between the master reach a default endpoint associated cover the remote firmware’s resource and remote software contexts. The key with the rpmsg channel. requirements. Based on resource ta- abstractions and concepts to be under- • The rpmsg component notifies user ble definitions, remoteproc carves out stood when using rpmsg are as follows: applications of events such as chan- the physical memory required for the nel creation and deletion using user remote firmware and performs cer- • From the perspective of the master, provided callbacks registered during tain initialization functions specific to an rpmsg device represents a remote initialization. rpmsg/VirtIO-based IPC. processor.

Fourth Quarter 2015 Xcell Journal 63 TOOLS OF XCELLENCE

Endpoint rpmsg channel Application logic producing data Data producer

1 Remote 1 2

Master

1 2 Remote 2

endpoint

Figure 4 – rpmsg channel and endpoint abstractions

Figure 4 illustrates rpmsg channel ly. When the remote context invokes communications with its counterpart. and endpoint abstractions and their remoteproc_resource_deinit, the mas- The rpmsg driver instantiates an rpmsg usage. The rpmsg component works in ter application is notified of the event VirtIO device and uses the VirtQueue in- concert with remoteproc to establish by the rpmsg channel deleted callback, terface to push and consume data w.r.t. and manage the rpmsg communication allowing for graceful termination of the its communicating counterpart. channel between master and remote rpmsg-based communication link. The contexts. Once remoteproc on master master can choose to asynchronously TOOLS FOR DEVELOPMENT brings up the remote context, rpmsg on shut down the remote processor using OF AMP SYSTEMS the remote context sends a name ser- the remoteproc_shutdown API in sit- Development of AMP application soft- vice announcement. Upon receiving the uations where the remote context be- ware presents a unique set of challeng- name service announcement, the master comes unresponsive. The pseudo-code es. System developers typically find registers the announced rpmsg device segment in Figure 5 shows usage of themselves having to simultaneously and establishes an rpmsg channel. Once rpmsg APIs in concert with remoteproc debug different OS environments de- the channel is established, the rpmsg APIs in a master context. ployed on dissimilar processors on het- channel-created callback is invoked on The rpmsg component uses VirtIO as erogeneous SoCs. Having a unified de- both sides, notifying the master and re- a shared-memory-based transport ab- bugging environment with awareness of mote applications of channel creation. straction. VirtIO has its roots as an I/O the operating systems involved will not At this point, the master and remote virtualization standard used for guest-to- only enhance the debug experience, but context can transmit data to each other host communications in lguest, KVM and improve productivity. Mentor Embed- using the rpmsg_sendxx and rpmsg_ the Mentor Embedded Hypervisors. The ded Sourcery CodeBench tools provide trysendxx APIs for blocking and non- rpmsg driver uses services provided by a unified IDE with OS awareness for all blocking transmit requests respective- the VirtIO layer for shared-memory-based supported OS environments (including

64 Xcell Journal Fourth Quarter 2015 TOOLS OF XCELLENCE

Mentor Embedded Linux and Nucle- debugging Linux kernel space, Nucle- While developing AMP systems, soft- us RTOS). Sourcery CodeBench also us RTOS and bare-metal contexts; and ware profiling is a valuable tool to gain supports a multitude of debug options, GDB-based debug for Linux user space insight into how various applications which include JTAG-based debug for and Nucleus RTOS-based applications. deployed on heterogeneous OSes interact with each other during runtime. Each OS instance typically uses an independent clock reference, and any profiling data collected within a given /* rpmsg channel created callback - invoked on channel creation */ OS context will be based on a time base void rpmsg_channel_created(struct rpmsg_channel *rp_chnl) { that is local to the OS. Mentor Embed- .. /* Use RTOS provided primitives (ex., semaphores) to ded Sourcery Analyzer host-based tools release the application context blocked on rpmsg and Mentor’s operating systems contain channel creation */ built-in algorithms that enable users to } graphically visualize and analyze trace /* rpmsg channel deletion callback - invoked on channel deletion */ data collected from disparate OS sourc- void rpmsg_channel_deleted(struct rpmsg_channel *rp_chnl) { es in a unified time reference. This ca- .. /* Use RTOS provided primitives (ex., semaphores) to pability allows users to gain interesting release the application context blocked on rpmsg insights into complex interactions and channel deletion */ hard-to-find timing issues typically en- } countered in developing AMP software. /* rpmsg receive callback - invoked when data received on default endpoint */ AN OPEN-SOURCE void rpmsg_rx_cb(struct rpmsg_channel *rp_chnl, void *data, int len, void * priv, unsigned long src) { RUNTIME COMPONENT .. The Mentor Embedded Multicore Frame- /* Copy received data to application buffer and use work is tightly integrated with Mentor’s de- RTOS provided primitives (ex., semaphores) to release the application IPC data processing context */ velopment tools and operating systems. It } supports a diverse set of ARM-based SoCs and platforms. Using the framework with /* Initialize remote context */ int Initialize_Remote_Context(..) { Mentor tools and operating systems frees ... users from having to design their AMP /* Initialize remote context */ system from scratch—that is, perform the remoteproc_init(remote_fw_info, rpmsg_channel_created, rpmsg_channel_deleted, tasks discussed under the system-level rpmsg_rx_cb, &proc); considerations section above. Users can get started with AMP application devel- /* Boot remote context */ remoteproc_boot(proc); opment with one of the reference config- ... urations and later customize the system } configuration to fit their needs. /* Send data to remote context after rpmsg channel creation */ For AMP system design, there is a int Send_Data_To_Remote(..) { clear need for a standards-based soft- .. ware framework that enables develop- rpmsg_send(app_rp_chnl, user_buff, sizeof(user_buff)); .. ment of RTOS or bare-metal software } that can interoperate with interfaces adopted by the open-source Linux /* Finalize remote context after rpmsg channel deletion */ int Finalize_Remote_Context(..) { community. To address this need and .. to promote industry adoption, Mentor /* Shut down and finalize the remote processor */ Graphics and Xilinx have jointly open- remoteproc_shutdown(proc); remoteproc_deinit(proc); sourced the Mentor Embedded Multi- .. core Framework’s runtime component } under the OpenAMP open-source project, with platform support for the the Zynq-7000 All Programmable SoC. This Figure 5 – Pseudo code that illustrates the usage of key remoteproc project is currently jointly maintained and rpmsg APIs from the master context by Mentor Graphics and Xilinx.

Fourth Quarter 2015 Xcell Journal 65 XTRA, XTRA What’s New in the Vivado 2015.3 Release? Xilinx is continually improving its products, IP and design tools as it strives to help designers work more effectively. Here, we report on the most current updates to Xilinx design tools including the Vivado® Design Suite, a revolution- ary system- and IP-centric design environment built from the ground up to accelerate the design of Xilinx® All Programmable devices. For more information about the Vivado Design Suite, please visit www.xilinx.com/vivado.

Product updates offer significant enhancements and new features to the Xilinx design tools. Keeping your installation up to date is an easy way to ensure the best results for your design.

The Vivado Design Suite 2015.3 is available from the Xilinx Download Center at www.xilinx.com/download.

Also new is this release is the Partial VIVADO DESIGN SUITE 2015.3 Reconfiguration Decoupler IP. This new core makes it easy for designers to iso- RELEASE HIGHLIGHTS late reconfigurable partitions from the The Vivado Design Suite 2015.3 release introduces new market-tai- static design during reconfiguration. lored plug-and-play IP subsystems. These new subsystems, in combination with enhancements to Vivado IP Integrator (IPI) and See the Xilinx IP page for more details: high-level synthesis (HLS) for C/C++ and SystemC-based design, http://www.xilinx.com/products/intel- significantly decrease design creation and integration efforts by ab- lectual-property/pr-decoupler.html. stracting time-consuming RTL development. Vivado IP Integrator To speed design integration, the new Also, the Vivado 2015.3 implementation has new features including release offers easy access to IP exam- pipeline reporting, automated pipeline insertion and retiming, and ple designs from IP Integrator (IPI), enhanced cross-probing, as well as core engine enhancements. along with enhanced configurable example designs. Among these example Device Support New General Access designs is an option to configure the MicroBlaze™ processor. New Devices Supported Devices The following devices are production The following UltraScale™ devices Vivado Simulator are introduced in this release: ready (in -1 and -2 speed grades) The latest simulator version offers up • Kintex® UltraScale devices: • Kintex UltraScale devices: to a 3x improvement in elaboration runtime performance. XCKU095, XCKU025, XCKU095, XCKU025, XCKU085 XCKU085 • Virtex® UltraScale devices: Vivado Debug and Reporting XCVU095, XCVU080 Xilinx has added a host of new reports to speed timing closure and debug. The re- port_design_analysis command reports a list of paths that are critical at both the VIVADO DESIGN SUITE: VU065 and VU080, bringing the total current design stage and the prior stage. DESIGN EDITION UPDATES number of UltraScale devices supported This provides a way to check on which to 12. critical paths the tools will focus on at Partial Reconfiguration and each stage. Another new command, re- Tandem Configuration Partial-bit-file generation is now en- port_pipeline_analysis, evaluates poten- The new release of the Vivado Design abled for the KU060, KU095, KU115 and tial design performance improvements Suite features expanded support for VU095 production silicon, bringing the by hypothetically adding latency (pipe- UltraScale devices, including place-and- total number of devices enabled for bit- line stages) to the design and reporting route support for the KU085, KU095, streams up to six. the new, resulting Fmax.

66 Xcell Journal Fourth Quarter 2015 VIVADO DESIGN SUITE: latency of their algorithms to make swift SYSTEM EDITION UPDATES adjustments. Also new to this release SOLUTIONS FOR A are improvements in System Generator’s PROGAMMABLE WORLD Xcell Publications hardware-based co-simulation, improv- Vivado High-Level Synthesis ing verification runtime by 45x. To accelerate development of C-based IP, Vivado HLS can now launch the Vivado waveform viewer after running XILINX INTELLECTUAL This a C/RTL co-simulation to visualize the simulation waveforms. Just click on PROPERTY (IP) UPDATES the Open Wave Viewer toolbar button. Also new is support for half-precision Xilinx’s new LogiCORE™ IP subsystems year’s floating-point through the hls_half.h are highly configurable, market-tailored header file. The half-precision feature building blocks that integrate up to 80 in- allows for smaller and faster designs dividual IP cores, software drivers, design best while in many cases retaining suffi- examples and testbenches. Available with cient numerical precision. the Vivado Design Suite 2015.3 release are new IP subsystems for Ethernet, PCIe®, release. See the Vivado Design Suite User video processing, image sensor process- Guide: High-Level Synthesis (UG902) ing and OTN development. These IP sub- for more information. systems are based on industry standards such as the AMBA® AXI-4 interconnect System Generator for DSP protocol, IEEE P1735 encryption and IP- New for System Generator is support for XACT to enable interoperability with Xil- MATLAB® 2015B, including tighter in- inx and Alliance member IP and to accel- tegration that allows the HDL Coder to erate integration. automate the generation of a combined model containing high-level RTL and tar- The highly configurable Video Processing get-optimized IP. Subsystem supports a 4K2K video pipe. Comprehensive video support includes For better usability, Xilinx has enhanced VDMA, deinterlacer, chroma resampler the System Generator’s blockset for dig- and scaler. The subsystem can also easily ital upconverters and digital downcon- source and sync DisplayPort, HDMI and verters (DUC/DDC), greatly simplifying MIPI interfaces by leveraging the auto- this blockset for wireless algorithm de- matically generated AXI interfaces and velopment. Enhancements to improve Vivado IPI. The new video IP subsystem The definitive resource for software developers speeding verification and compile runtime have replaces Xillinx’s VIPP cores. been added to the new blocks, all of C/C++ & OpenCL code with Xilinx SDx IDEs & devices which are configured with seven or LEARN MORE fewer parameters. The digital FIR filter See the Vivado Design Suite 2015.3 Re- The Award-winning Xilinx Publication Group is publishing a brand new trade journal block tightly integrates with the Filter lease Notes for more information. specifically for the programmable FPGA Design and Analysis Tool from Math- software industry, focusing on users of Works to build area-efficient filters, in- Xilinx SDx™ development environments and QuickTake Video Tutorials high-level entry methods for programming cluding fixed-fractional interpolation Vivado Design Suite QuickTake tutorials Xilinx All Programmable devices. or decimation types. The sine-wave and are how-to videos that take a look inside complex-product blocks greatly simplify This is where you come in. the features of the Vivado Design Suite Xcell Software Journal is now accepting modulator design for frequency conver- and UltraFast™ Design Methodology. reservations for advertising opportunities sion at high sample rates. The requan- in this new, beautifully designed and written tize block enables quick manipulation of resource. Don’t miss this great opportunity See all QuickTake Videos here: www.xil- to get your product or service into the minds data types to maximize dynamic range at inx.com/training/vivado. of those who matter most. Call or write any point in the data path. today for your free advertising packet! Training System Generator for DSP also features For advertising inquiries (including For instructor-led training on the Viva- calendar and advertising rate card), new interactive cross-probing that ac- do Design Suite, UltraFast Design contact [email protected] celerates design exploration and pro- Methodology and more, visit www.xil- or call: 408-842-2627. vides iterative design closure. With the inx.com/training. cross-probing of timing analysis, algorithm developers can quickly identify Download Vivado Design Suite 2015.3 their critical paths and single out bot- today at http://www.xilinx.com/down- tlenecks that may affect throughput and load.

Fourth Quarter 2015 Xcell Journal 67 XCLAMATIONS! Xpress Yourself in Our Caption Contest

TRAVIS ROTHLISBERGER, director of device development at Cerevast Medical (Redmond, Wash.), won a shiny new Digilent Zynq Zybo board with this caption for the dartboard cartoon in Issue 92 of Xcell Journal: DANIEL GUIDERA

very designer knows that debugging is one of the trickiest steps in “To Ed’s delight, nobody questioned the flow, but at this lab, things have gotten out of hand. Xercise your him when he said that his Efunny bone as you help our beleaguered cartoon engineer swat some Monte Carlo analysis would take of the most recalcitrant bugs we’ve ever seen. We invite readers to submit an several months to complete.” engineering- or technology-related caption for this cartoon showing a worst- case debugging situation. The image might inspire a caption like “Henry Congratulations as well to longed for an automated way to debug his latest design, because doing it by our two runners-up: hand just wasn’t working.” Send your entries to [email protected]. Include your name, job title, “Keep debugging, I’m finalizing the company affiliation and location, and indicate that you have read the con- feature set and pricing right now!” test rules at www.xilinx.com/xcellcontest. After due deliberation, we will — Lee Courtney, CTO/VP-Product, print the submissions we like the best in the next issue of Xcell Journal. Qurasense (Menlo Park, Calif.) The winner will receive a Digilent Zynq Zybo board, featuring the Xilinx® ® Zynq -7000 All Programmable SoC (http://www.xilinx.com/products/ “I figure this is a better way to meet boards-and-kits/1-4AZFTE.htm). Two runners-up will gain notoriety, fame targets than what goes on upstairs!” and have their captions, names and affiliations featured in the next issue. – Larry Standage, principal The contest begins at 12:01 a.m. Pacific Time on Oct. 15, 2015. All entries applications engineer, must be received by the sponsor by 5 p.m. PT on Jan. 6, 2016. Microchip Technology Inc. (Chandler, Ariz.)

NO PURCHASE NECESSARY. You must be 18 or older and a resident of the fifty United States, the District of Columbia, or Canada (excluding Quebec) to enter. Entries must be entirely original. Contest begins on Oct. 15, 2015. Entries must be received by 5:00 pm Pacific Time (PT) Jan. 6, 2016. Official rules are available online at www.xilinx.com/xcellcontest. Sponsored by Xilinx, Inc. 2100 Logic Drive, San Jose, CA 95124.

68 Xcell Journal Fourth Quarter 2015 Get on Target

Let Xcell Publications help you get your message out to thousands of programmable logic users.

Hit your marketing target by advertising your product or service in the Xilinx Xcell Journal, you’ll reach thousands of engineers, designers, and engineering managers worldwide!

The Xilinx Xcell Journal is an award-winning publication, dedicated specifically to helping programmable logic users – and it works.

We offer affordable advertising rates and a variety of advertisement sizes to meet any budg et!

Call today : (800) 493-5551 or e-mail us at [email protected]

Join the other leaders in our industry and advertise in the Xcell Journal!

www.xilinx.com/xcell/ n $55 Zynq-based Wireless Snickerdoodle single-board computer with WiFi, Bluetooth launched today on CrowdSupply n Tiny 100x62mm, Zynq-based Avnet PicoZed SDR implements 2x2 MIMO, 70MHz to 6GHz radio using ADI RF Agile Transceiver n ARTY—the $99 Artix-7 FPGA Dev Board/Eval Kit with Arduino I/O and $3K worth of Vivado software. Wait, What???? n Lift-off! 16nm Zynq UltraScale+ MPSoC ships to customers. From tapeout to “Hello World” in 2.5 months. n Zynq-based, $179 Skreens Nexus on Kickstarter allows you to safely cross the (HDMI video) streams