Cray XD1™ FPGA Development Private S–6400–14 © 2006 Cray Inc. All Rights Reserved. Unpublished Private Information. This unpublished work is protected to trade secret, copyright and other laws. Except as permitted by contract or express written permission of Cray Inc., no part of this work or its content may be used, reproduced or disclosed in any form.

U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable.

Autotasking, Cray, Cray Channels, Cray Y-MP, GigaRing, LibSci, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, CCI, CCMT, CF77, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Ada, Cray Animation Theater, Cray APP, Cray Apprentice2, Cray C++ Compiling System, Cray C90, Cray C90D, Cray CF90, Cray EL, Cray Fortran Compiler, Cray J90, Cray J90se, Cray J916, Cray J932, Cray MTA, Cray MTA-2, Cray MTX, Cray NQS, Cray Research, Cray SeaStar, Cray S-MP, Cray SHMEM, Cray SSD-T90, Cray SuperCluster, Cray SV1, Cray SV1ex, Cray SX-5, Cray SX-6, Cray T3D, Cray T3D MC, Cray T3D MCA, Cray T3D SC, Cray T3E, Cray T90, Cray T916, Cray T932, Cray UNICOS, Cray X1, Cray X1E, Cray XD1, Cray X-MP, Cray XMS, Cray XT3, Cray XT4, Cray Y-MP EL, Cray-1, Cray-2, Cray-3, CrayDoc, CrayLink, Cray-MP, CrayPacs, Cray/REELlibrarian, CraySoft, CrayTutor, CRInform, CRI/TurboKiva, CSIM, CVT, Delivering the power..., Dgauss, Docview, EMDS, Gigaring, HEXAR, HSX, IOS, ISP/Superlink, Libsci, MPP Apprentice, ND Series Network Disk Array, Network Queuing Environment, Network Queuing Tools, OLNET, RapidArray, RQS, SEGLDR, SMARTE, SSD, SUPERLINK, System Maintenance and Remote Testing Environment, Trusted UNICOS, TurboKiva, UNICOS MAX, UNICOS/lc, and UNICOS/mp are trademarks of Cray Inc.

AMD and Opteron are trademarks of Advanced Micro Devices, Inc. Celoxica is a trademark of Celoxica Limited. ChipScope, CORE Generator, ISE, SelectRAM+, System Generator for DSP, Virtex, Xilinx, XST, and XtremeDSP are trademarks of Xilinx, Inc. Cygwin is a trademark of Red Hat, Inc. GNU is a trademark of The Free Software Foundation. IBM and PowerPC are trademarks of International Business Machines Corporation. Linux is a trademark of Linus Torvalds. MATLAB and Simulink are trademarks of The MathWorks, Inc. ModelSim is a trademark of Mentor Graphics Corporation. Riviera is a trademark of Aldec, Inc. Specman is a trademark of Versity Design, Inc. Synopsys is a trademark of Synopsys, Inc. Synplicity is a trademark of Synplicity, Inc. Verilog is a trademark of Gateway Design Automation Corporation. Windows is a trademark of Microsoft Corporation. All other trademarks are the property of their respective owners. New Features

Cray XD1™ FPGA Development S–6400–14

This manual contains the following changes from the previous version: • Incorporates chapters on developing software applications, including the API details. These were moved (and restructured) from Cray XD1 Programming (S–2433) to this manual. • Some existing chapters are restructured and renamed. • Describes the interrupts feature, including how the FPGA generates an interrupt and how the application program enables interrupt processing: the new fpga_int_wait(3) function.

• Documents the clock frequency requirement for all designs. • Describes the new command-line utilities for reading and writing values in the FPGA: fpga_read(1) and fpga_write(1).

Record of Revision

Version Description

1.4 May 2006 Supports Cray XD1 release 1.4.

1.3.1 October 2005 Supports Cray XD1 release 1.3.1 (1.3 general availability).

1.3 July 2005 Supports Cray XD1 release 1.3 (limited availability).

1.2 April 2005 Supports Cray XD1 releases 1.2 and 1.2.1.

1.1 October 2004 Supports Cray XD1 releases 1.1.

1.0 August 2004 Supports Cray XD1 releases 1.0.

S–6400–14 Cray Private i

Contents

Page

Preface xiii Accessing Product Documentation ...... xiii Conventions ...... xiv Reader Comments ...... xv Cray User Group ...... xv Cray XD1 Support ...... xvi

Part I: Introduction

Introduction [1] 3 About This Manual ...... 3 Who Should Read this Manual ...... 3 Scope of this Manual ...... 3 How this Manual is Organized ...... 3 Related Publications ...... 4 Cray XD1 Publications ...... 4 Third-party Publications ...... 5 About FPGA Development ...... 6 What is an FPGA AAP? ...... 6 Advantages and Disadvantages of FPGAs ...... 6 Target Applications ...... 8 Characteristics of Target Applications ...... 8 Sample Application ...... 9 The Development Process ...... 10

S–6400–14 Cray Private iii Cray XD1™ FPGA Development

Page

Part II: The Cray XD1 System

Cray XD1 Architecture [2] 15 Cray XD1 High-level Physical Layout ...... 15 Chassis ...... 15 Compute Blade ...... 17 Expansion Module ...... 18 RapidArray Interconnect ...... 19

Expansion Module [3] 21 Expansion Module Variants ...... 21 FPGA AAP ...... 21 Configurable Logic Blocks ...... 22 Block SelectRAM+ Memory Modules ...... 22 Embedded Multiplier Blocks ...... 22 XtremeDSP Slices ...... 22 Digital Clock Manager Blocks ...... 23 Embedded IBM PowerPC 405 RISC Processor Blocks ...... 23 Programmability ...... 23 Comparison of Xilinx FPGAs ...... 23 RapidArray Processor ...... 24 QDR II SRAM ...... 24 Programmable Clock ...... 24 Communication Paths ...... 24

Part III: Developing FPGA Logic

Quick Start [4] 29 Reference Design Overview ...... 29 Command-line Manipulation of FPGAs ...... 30 Overview ...... 30 Copying the Design Directory ...... 30 iv Cray Private S–6400–14 Contents

Page

Converting FPGA Binary Files ...... 31 Downloading Files to the FPGA ...... 32 Accessing the FPGA ...... 32 Erasing the FPGA ...... 34 Running the C Reference Programs ...... 34

Hardware Development Flows [5] 37 Basic HDL Development Flow ...... 37 Overview of the HDL Development Process ...... 37 Design Entry ...... 38 Synthesis ...... 39 Simulation ...... 39 Implementation ...... 40 Development Flow Using Software-Oriented Languages ...... 40 Design Entry ...... 41 Synthesis ...... 42 Simulation ...... 42 Implementation ...... 42 Third-Party Vendors and Tools ...... 42 Celoxica, Inc...... 43 DSPlogic, Inc...... 43 Impulse Accelerated Technologies, Inc...... 43 Mitrionics Inc...... 43

General Design Considerations [6] 45 Clock Frequency ...... 45 FPGA Memory Resources ...... 45 Fabric Bandwidth and Data Flow ...... 47 SMP Processor-initiated Fabric Transactions ...... 48 FPGA-initiated Fabric Transactions ...... 48 Limitations ...... 48

S–6400–14 Cray Private v Cray XD1™ FPGA Development

Page

Using the Design Template [7] 49 Overview ...... 49 Contents of the Design Template ...... 52 Design Files ...... 52 Directory Structure ...... 53 Working with the Design Template ...... 56 Copying the Design Template ...... 56 Procedure 1: To copy the design template ...... 56 Tools ...... 56 Required Customizations ...... 57 Command Line Execution and Makefile Targets ...... 58 Example 1: Using makefile targets ...... 59 Using the Xilinx ISE GUI ...... 59 Simulating the Design ...... 60

Interfaces [8] 63 RapidArray Transport Interface ...... 63 QDR II SRAM Interface ...... 65 Interfacing User Logic to Other Cray XD1 Resources ...... 67 Interfacing Considerations ...... 67 Disabling Unused Core Interfaces ...... 68 Interaction with the SMP Software Application ...... 68 Memory Map ...... 69 SMP-initiated RT Requests ...... 70 I/O Mapped Accesses ...... 71 Example 2: I/O mapped writes to the AAP ...... 71 API Function Accesses ...... 72 Example 3: Accessing the AAP with fpga_wrt_appif_val ...... 72 FPGA-initiated RT Requests ...... 73 Example 4: Initializing the AAP to access the SMP memory ...... 74 FPGA-initiated Interrupt Requests ...... 74 vi Cray Private S–6400–14 Contents

Page

Example 5: Generating an interrupt request (VHDL notation) ...... 74

Simulation and Debugging [9] 77 Simulation Models ...... 77 Cray XD1 Core Simulation Models ...... 77 RapidArray Fabric Behavioral Model ...... 77 Fabric Model Inputs ...... 77 Example 6: Format of the fabric.in file...... 79 FPGA Transfer Region ...... 80 Using the JTAG Interface Card ...... 81 The JTAG Interface Card ...... 81 Mapping JTAG Interface Ports to FPGAs ...... 81 Viewing JTAG Interface Port Connections ...... 82 Connecting a JTAG Interface Port to an FPGA ...... 83 Example 7: Connecting a JTAG interface port to an FPGA ...... 83 Restoring the Default JTAG Interface Port Connections ...... 83 Connecting a Workstation to a JTAG Interface Port ...... 83 Reading and Writing Data Values ...... 85 Reading Data from the FPGA ...... 86 Example 8: Using the fpga_read command ...... 86 Writing Data to the FPGA ...... 87 Example 9: Using the fpga_write command ...... 87

Part IV: Developing Accelerated Software Applications

Managing FPGA Logic from the Command Line [10] 91 Converting a Raw Logic File to Loadable Form ...... 91 Procedure 2: To convert a raw logic file to loadable form ...... 91 Loading FPGA Logic into the Device ...... 92 Resetting an FPGA ...... 93 Releasing an FPGA from Reset State ...... 93

S–6400–14 Cray Private vii Cray XD1™ FPGA Development

Page

Querying the Status of an FPGA ...... 93 Erasing an FPGA ...... 94

Using an FPGA in an Application Program [11] 95 Role of the Software Programmer in FPGA Logic Development ...... 95 Overview of Using an FPGA in an Application Program ...... 95 Typical Application Workflow...... 95 Understanding Address Spaces on a Node ...... 96 Data Transfer Methods ...... 96 Using Interrupts ...... 97 Summary of API Functions ...... 97 Compiling and Linking ...... 98 FPGA API Library Functions ...... 99 Opening an FPGA: fpga_open ...... 99 Loading FPGA Logic into the Device: fpga_load ...... 100 Resetting an FPGA: fpga_reset ...... 101 Releasing an FPGA from Reset State: fpga_start ...... 102 Mapping FPGA Locations to the Application Address Space: fpga_memmap ...... 103 Synchronizing Accesses to FPGA Locations: fpga_mem_sync ...... 105 Writing and Reading Individual FPGA Locations ...... 107 Writing to an FPGA Location: fpga_wrt_appif_val ...... 108 Reading from an FPGA Location: fpga_rd_appif_val ...... 109 Accessing Application Memory from an FPGA: fpga_register_ftrmem and fpga_dereg_ftrmem ...... 110 Enabling and Waiting for Interrupts: fpga_int_wait ...... 112 Checking the Status of an FPGA: fpga_status ...... 114 Checking the Programming State of an FPGA: fpga_is_loaded ...... 115 Erasing an FPGA: fpga_unload ...... 116 Closing an FPGA: fpga_close ...... 117

viii Cray Private S–6400–14 Contents

Page

Sample Application: Using the Mersenne Twister Accelerator [12] 119 Understanding the Application ...... 119 Algorithm ...... 119 High-level Design of Application and FPGA Logic ...... 119 Some Design Details ...... 120 Walkthrough ...... 122 Building and Running the Application ...... 123 Procedure 3: To build and run the application ...... 123

Part V: Appendixes

Appendix A Troubleshooting 129 Node Hangs After Accessing /proc/ufp/regs ...... 129 Cause ...... 129 Discussion ...... 129 File /dev/ufp0 Does Not Exist (Interaction with the FPGA AAP Fails) ...... 129 Cause ...... 129 Discussion ...... 129 Procedure 4: To load the FPGA driver ...... 130

Appendix B Program Listing: mta_test.c 131

Glossary 143

Index 147

Tables Table 1. Related Cray XD1 publications ...... 5 Table 2. VHDL references ...... 6 Table 3. Results of random number generation ...... 10 Table 4. Current FPGA AAP variants ...... 21 Table 5. Resources in Xilinx FPGAs ...... 24 Table 6. Available memory devices ...... 46

S–6400–14 Cray Private ix Cray XD1™ FPGA Development

Page

Table 7. ufpapps subdirectory descriptions ...... 51 Table 8. Subdirectories of vhdl_template/src/design_name ...... 55 Table 9. Recommended software packages ...... 57 Table 10. Design template customizations ...... 57 Table 11. Makefile targets ...... 58 Table 12. Commands supported in the fabric.in file...... 78 Table 13. Arguments of the fabric.in commands ...... 78 Table 14. JTAG interface card default connections ...... 82 Table 15. API summary ...... 97 Table 16. fpga_open(3) arguments and return value ...... 100 Table 17. fpga_load(3) arguments and return value ...... 101 Table 18. fpga_reset(3) arguments and return value ...... 102 Table 19. fpga_start(3) arguments and return value ...... 103 Table 20. fpga_memmap(3) arguments and return value ...... 104 Table 21. fpga_mem_sync(3) arguments and return value ...... 107 Table 22. fpga_wrt_appif_val(3) arguments and return value ...... 108 Table 23. fpga_rd_appif_val(3) arguments and return value ...... 110 Table 24. fpga_register_ftrmem(3) and fpga_dereg_ftrmem(3) arguments and return value ...... 112 Table 25. fpga_int_wait(3) arguments and return value ...... 113 Table 26. fpga_status(3) arguments and return value ...... 115 Table 27. fpga_is_loaded(3) arguments and return value ...... 116 Table 28. fpga_unload(3) arguments and return value ...... 117 Table 29. fpga_close(3) arguments and return value ...... 118 Table 30. MTA registers ...... 121

Figures Figure 1. FPGA executes multiple codes simultaneously ...... 7 Figure 2. Random number example ...... 9 Figure 3. Developing and using an FPGA application ...... 11 Figure 4. Cray XD1 chassis, physical view ...... 16 x Cray Private S–6400–14 Contents

Page

Figure 5. Cray XD1 chassis, logical view ...... 17 Figure 6. Compute blade ...... 18 Figure 7. Expansion module, physical view ...... 19 Figure 8. RapidArray components in a Cray XD1 chassis ...... 20 Figure 9. Expansion module, logical view ...... 25 Figure 10. HDL development flow...... 38 Figure 11. Development flows with high-level tools ...... 41 Figure 12. FPGA memory hierarchy ...... 46 Figure 13. Design directories in /opt/ufpapps ...... 50 Figure 14. FPGA organization ...... 53 Figure 15. Subdirectories of vhdl_template/src ...... 54 Figure 16. HDL test bench organization ...... 60 Figure 17. RT interface ...... 64 Figure 18. QDR II IP Core interface ...... 66 Figure 19. Physical components of a node and related address spaces ...... 70 Figure 20. JTAG interface card ...... 81 Figure 21. Ports on the JTAG interface card ...... 82 Figure 22. JTAG I/O adapter ...... 84 Figure 23. Accessing the JTAG interface card ...... 85

S–6400–14 Cray Private xi

Preface

The information in this preface is common to Cray documentation provided with this software release.

Accessing Product Documentation

With each software release, Cray provides books and man pages, and in some cases, third-party documentation. These documents are provided in the following ways: CrayDoc The Cray documentation delivery system that allows you to quickly access and search Cray books, man pages, and in some cases, third-party documentation. Access this HTML and PDF documentation via CrayDoc at the following locations: • The local network location defined by your system administrator • The CrayDoc public website: docs.cray.com Man pages Access man pages by entering the man command followed by the name of the man page. For more information about man pages, see the man(1) man page by entering:

% man man

Third-party documentation Access third-party documentation not provided through CrayDoc according to the information provided with the product.

S–6400–14 Cray Private xiii Cray XD1™ FPGA Development

Conventions

These conventions are used throughout Cray documentation: Convention Meaning command This fixed-space font denotes literal items, such as file names, pathnames, man page names, command names, and programming language elements. variable Italic typeface indicates an element that you will replace with a specific value. For instance, you may replace filename with the name datafile in your program. It also denotes a word or concept being defined. user input This bold, fixed-space font denotes literal items that the user enters in interactive sessions. Output is shown in nonbold, fixed-space font. [ ] Brackets enclose optional portions of a syntax representation for a command, library routine, system call, and so on. ... Ellipses indicate that a preceding element can be repeated. name(N) Denotes man pages that provide system and programming reference information. Each man page is referred to by its name followed by a section number in parentheses. Enter:

% man man

to see the meaning of each section number for your particular system.

xiv Cray Private S–6400–14 Preface

Reader Comments

Contact us with any comments that will help us to improve the accuracy and usability of this document. Be sure to include the title and number of the document with your comments. We value your comments and will respond to them promptly. Contact us in any of the following ways: E-mail: [email protected]

Telephone (inside U.S., Canada): 1–800–950–2729 (Cray Customer Support Center) Telephone (outside U.S., Canada): +1–715–726–4993 (Cray Customer Support Center) Mail: Customer Documentation Cray Inc. 1340 Mendota Heights Road Mendota Heights, MN 55120–1128 USA

Cray User Group

The Cray User Group (CUG) is an independent, volunteer-organized international corporation of member organizations that own or use Cray Inc. computer systems. CUG facilitates information exchange among users of Cray systems through technical papers, platform-specific e-mail lists, workshops, and conferences. CUG memberships are by site and include a significant percentage of Cray computer installations worldwide. For more information, contact your Cray site analyst or visit the CUG website at www.cug.org.

S–6400–14 Cray Private xv Cray XD1™ FPGA Development

Cray XD1 Support

Obtain support for the Cray XD1 product in either of the following ways: Telephone: 1–888–279–2729 (Cray XD1 Customer Support Center) Through the CRInform website: http://crinform.cray.com/xd/

Note: Use the contact information provided here if you have a support agreement with Cray. If, however, you have a support agreement with a third-party organization that is a Cray channel partner, contact that organization instead: do not contact Cray directly.

xvi Cray Private S–6400–14 Part I: Introduction

Introduction [1]

This chapter describes this manual and provides a high-level introduction to the field-programmable gate array (FPGA) application acceleration processor and the process of developing applications for it.

1.1 About This Manual

This section identifies the intended audience for this manual, and describes its scope and organization. It also lists related publications.

1.1.1 Who Should Read this Manual

This manual addresses two distinct audiences:

• Experienced hardware or FPGA designers who are proficient in logic design with HDL languages • Software application programmers who are proficient in the C language

1.1.2 Scope of this Manual

This manual is a guide to the FPGA application acceleration processor, an optional component in a Cray XD1 system. It describes how this processor fits into the Cray XD1 hardware architecture, and provides guidelines on designing, developing, and integrating FPGA binaries for use in the system. It also describes in general how to use the resulting FPGA logic in an application program and describes in detail the FPGA application programming interface (API). This document has two companion documents that provide in-depth logic design details: • Design of Cray XD1 RapidArray Transport Core (S–6411) • Design of Cray XD1 QDR II SRAM Core (S–6412)

1.1.3 How this Manual is Organized

This manual consists of the following parts (besides this introduction): • Part II, The Cray XD1 System

S–6400–14 Cray Private 3 Cray XD1™ FPGA Development

This part describes the Cray XD1 architecture with an emphasis on how the FPGA application acceleration processor (AAP) fits into it, and it describes the interfaces on the expansion modules. • Part III, Developing FPGA Logic This part starts with a high-level guide to managing an existing FPGA logic file from the command line. It then covers the main aspects of developing an FPGA logic file for the Cray XD1 system: the standard HDL development workflow, general design considerations, specific considerations for the Cray XD1 system (including using the design template and understanding the interaction between an application program and the FPGA logic), and simulation and debugging. • Part IV, Developing Accelerated Software Applications This part includes information primarily for the software application programmer: a detailed description of managing an FPGA logic file from the command line, general concepts for application programs, a detailed description of the API, and discussion of a sample application program. • Part V, Appendixes The appendixes include a brief troubleshooting guide and the complete listing of the sample application program.

1.1.4 Related Publications

This section lists Cray and third-party publications that supplement the contents of this manual.

1.1.4.1 Cray XD1 Publications

Refer to the publications in Table 1, page 5 for related information about the Cray XD1 system.

4 Cray Private S–6400–14 Introduction [1]

Table 1. Related Cray XD1 publications

Publication title Brief description Cray XD1 Release Description (S–2453) Identifies the main new features and enhancements in a particular release of the product. Includes information about the hardware, embedded software, and Linux based software of the system. Cray XD1 System Overview (S–2429) Overview of the Cray XD1 computer and a description of its hardware and software components. Cray XD1 Programming (S–2433) Identifies the development tools and libraries on the Cray XD1 system. Design of Cray XD1 RapidArray Companion to Cray XD1 FPGA Transport Core (S–6411) Development (S–6400). Provides in-depth design details for the Cray RapidArray Transport core. Design of Cray XD1 QDR II SRAM Core Companion to Cray XD1 FPGA (S–6412) Development (S–6400). Provides in-depth design details for the Cray core that provides the interface to the QDR II SRAM. Cray XD1 Release Notes (S–2455) Information about resolved issues and known issues for a particular release of the product.

1.1.4.2 Third-party Publications

The Cray HDL template and reference designs are written in VHDL. Table 2, page 6 lists publications that are useful references for designers who work in VHDL.

S–6400–14 Cray Private 5 Cray XD1™ FPGA Development

Table 2. VHDL references

Title Author, Publisher VHDL for Programmable Logic Kevin Skahill, Cypress Semiconductor VHDL for Logic Synthesis Andrew Rushton, John Wiley & Sons Ltd. The Designer’s Guide to VHDL Peter J. Ashenden, Morgan Kaufmann Publishers

1.2 About FPGA Development

This section briefly describes the FPGA AAP, its advantages and disadvantages, the types of application that can benefit from using it, and the overall steps in developing an FPGA-accelerated application.

1.2.1 What is an FPGA AAP?

The FPGA application acceleration processors in a Cray XD1 system are Xilinx field-programmable gate arrays. This manual refers to them as FPGA AAPs or simply as FPGAs or as AAPs if the context is clear. (The Cray XD1 system also uses FPGAs for other purposes.). An FPGA AAP is an optional component of the optional expansion module that may be added to each compute blade in a Cray XD1 chassis. (For more information on these and other components of the Cray XD1 system, see Cray XD1 System Overview (S–2429).) You can use the FPGA AAPs as coprocessors to implement selected algorithms or parts of algorithms in hardware. FPGA AAPs can significantly improve the performance of some applications. Typically, they achieve this in one or both of the following ways: • Performing the same processing step on multiple data elements in parallel • Pipelining multiple sequential processing steps that are performed on each element of a large data set

1.2.2 Advantages and Disadvantages of FPGAs

The power of a field-programmable gate array (FPGA) as an application acceleration processor is its ability to perform well in areas that a microprocessor does not, rather than its ability to compete directly with a standard

6 Cray Private S–6400–14 Introduction [1]

microprocessor. The two main advantages an FPGA has over a microprocessor are as follows: • FPGAs have a flexible architecture. You can customize and optimize the logic in the FPGA to perform only the required tasks. • FPGAs can exploit parallelism. General-purpose microprocessors have limited support for instruction or data parallelism. At most, they can perform a handful of operations in parallel and this is difficult to control from a high level programming language. FPGAs can perform thousands of operations in parallel.1

Figure 1, page 7 illustrates that the 64-bit compute processor is limited to linear execution and fixed word length; the FPGA can manipulate multiple variable length data items concurrently by using fine-grained parallelism.

... DataSet do for each array element . . end . …

Application Acceleration FPGA Compute Processor …

Fine-grained parallelism applied for 100x potential speedup …

Figure 1. FPGA executes multiple codes simultaneously

An FPGA also has disadvantages. The key disadvantage is speed. A microprocessor has a fixed set of highly optimized functional blocks that can run at a very high clock rate. The generic logic building blocks and interconnect of an FPGA cannot support the same clock speeds. As a general rule, an FPGA

1 This obviously depends upon the size and complexity of the operation, the size of the FPGA, and the opportunities for parallelism in the algorithm.

S–6400–14 Cray Private 7 Cray XD1™ FPGA Development

supports a maximum clock rate that is approximately one–tenth that of a processor like the Opteron.

1.2.3 Target Applications

This section specifies the characteristics of target applications in general and describes one sample application.

1.2.3.1 Characteristics of Target Applications

The following characteristics of applications make them candidates for acceleration by FPGAs: • The processor cannot reach its maximum performance due to loop overhead, cache thrashing, inefficient instructions, and so on. • The processor cannot exploit data or instruction parallelism that is inherent in the function. Some examples of real-world applications that exhibit these characteristics are as follows: • Searching (generic data streams as well as specialized searches for genetic applications) • Sorting • Digital signal processing • Image processing and recognition • Graphics acceleration • Encryption and decryption • Coding and decoding • Error correction • Random number generation • Some mathematical algorithms • Bit manipulation

8 Cray Private S–6400–14 Introduction [1]

1.2.3.2 Sample Application

The Mersenne Twister random number algorithm can efficiently generate a large number of pseudorandom numbers for use in computations such as Monte Carlo analysis. The FPGA uses this algorithm to generate integers with a uniform distribution that do not repeat a value in 219937 –1 operations. The FPGA delivers slightly more than three times the performance of the fastest available Opteron processor (at the time of writing). In addition, while the Opteron processor uses its full capacity to execute the algorithm, the FPGA uses less than 20% of its capacity.

The performance of the FPGA is limited only by the speed at which numbers can be written into SMP memory, not by the FPGA logic. Figure 2, page 9 illustrates the path that the FPGA uses to access the compute processor memory. The RAP is the RapidArray processor.

Mersenne Processor RAP Twister RNG

pseudo-random numbers

Figure 2. Random number example

Because the design consumes only a small amount of the FPGA, a great deal of capacity remains to provide post-processing functions for the random numbers. For example, the logic can convert the integer values to floating point values, or it can convert the uniform distribution to a normal distribution in an additional pipeline stage. An Opteron processor must perform these functions serially, which further degrades its performance. Table 3, page 10 compares the random-number-generation capabilities of an Opteron processor and an FPGA.

S–6400–14 Cray Private 9 Cray XD1™ FPGA Development

Table 3. Results of random number generation

2.2 GHz Opteron FPGA (XC2VP50-7) @ 200 Processor MHz) Source Original C code VHDL code Speed (32-bit integers ~101 million ~319 million per second) Capacity used 100% < 20% (includes RT core)

Chapter 12, page 119 describes this application in more detail.

1.2.4 The Development Process

An FPGA-accelerated application requires two streams of development—the FPGA logic (hardware) and the software application. Figure 3, page 11 summarizes at a high level the overall steps in these two streams. Chapter 5, page 37 describes the hardware development process in more detail. This manual assumes that you are already familiar with the software development process. The two development streams use different methods and tools and are typically performed by different people. However, the hardware and software developers must cooperate closely to design suitable methods by which the FPGA and the software application interact. Much of this manual is devoted to the various aspects of this interaction. The logic that is programmed into an FPGA exists first as a binary file that encodes the required hardware design. The FPGA utility program, fcu, converts a raw FPGA binary file that the FPGA development tools produce to a form that the system can use. The FPGA application programming interface ( API) library provides the functions that let the software application use the FPGA when the converted FPGA logic file exists. An application can perform certain control operations such as loading the logic into the device, putting the logic into reset or taking it out of reset, and unloading the device, either from the command line or with an API function call in the application program. Figure 3, page 11 illustrates both methods of loading the logic. The API provides additional functions that support the interaction between the software application and the FPGA logic, primarily for various forms of data transfer and for handling interrupts. Chapter 11, page 95 describes the API in detail.

10 Cray Private S–6400–14 Introduction [1]

Design Code FPGA application logic program

Hardware description Application source ufplib.h language files files

Create Synthesize, header file: place, and Compile fcu --build route

Header file Raw logic file Application object libufp.a (default: ufphdr) (file.bin) files

Convert: fcu Link --convert

Converted logic file Application (default: file.ufp) executable

Load logic: Execute fcu --load application (optional) () ad lo l) _ a ga p tion f p (o

FPGA

Figure 3. Developing and using an FPGA application

S–6400–14 Cray Private 11 Cray XD1™ FPGA Development

12 Cray Private S–6400–14 Part II: The Cray XD1 System

Cray XD1 Architecture [2]

This chapter describes the Cray XD1 system from a hardware perspective. It covers the high–level physical layout of a Cray XD1 system, as well as the RapidArray interconnect. This chapter focuses on the environment of the application acceleration processor (AAP), an optional component in a Cray XD1 system. The AAP is a field-programmable gate array (FPGA).

2.1 Cray XD1 High-level Physical Layout

The basic architectural unit of the Cray XD1 system is the Cray XD1 chassis.A large number of chassis may be connected together to form a Cray XD1 system.

2.1.1 Chassis

A Cray XD1 chassis may contain a maximum of 31 commodity and specialized processors that reside on a maximum of 6 compute blades. Figure 4, page 16 shows a physical view of the Cray XD1 chassis.

S–6400–14 Cray Private 15 Cray XD1™ FPGA Development

PCI-X I/O Expansion module

Power supply

Compute blade

Compute blade Fabric expansion PCI-X expansion

Disk blades

Fan assembly

Main board

Figure 4. Cray XD1 chassis, physical view

In a fully populated chassis, these processors are distributed as follows: • Twelve 64-bit AMD Opteron processors configured as six two-way symmetric multiprocessors (SMPs) that run Linux. • Twelve RapidArray processors—Processors, designed by Cray, that process most of the communications within a chassis. • Six application acceleration processors—Optional FPGAs that act as coprocessors to the Opteron processors. • A management processor that runs software to monitor and manage the chassis hardware. Figure 5, page 17 illustrates the logical distribution of these processors.

16 Cray Private S–6400–14 Cray XD1 Architecture [2]

Management processor

Opteron processors

RapidArray Application processors acceleration (RAPs) processor (FPGA)

Figure 5. Cray XD1 chassis, logical view

A Cray XD1 computer system can consist of one to hundreds of Cray XD1 chassis that interconnect in various configurations. A maximum of 13 chassis reside in a seven-foot cabinet.

2.1.2 Compute Blade

A compute blade is a subassembly of the Cray XD1 chassis; refer to Figure 4, page 16 and to Figure 6, page 18. Each chassis contains one to six compute blades. Each compute blade includes the following components: • Two 64-bit Opteron processors, configured as a two-way SMP • 1 to 8 GB of double data rate synchronous dynamic random access memory (DDR SDRAM) per compute processor, providing a maximum of 16 GB of DDR SDRAM per SMP • A RapidArray processor, which provides two 2-GBps RapidArray links to the switch fabric • A connector for an expansion module (not shown in Figure 6, page 18)

S–6400–14 Cray Private 17 Cray XD1™ FPGA Development

Opteron processor

RapidArray processor Air baffle DIMMs

DIMMs

Connector

Opteron processor

Figure 6. Compute blade

2.1.3 Expansion Module

The expansion module is an optional subassembly that attaches to a compute blade; Figure 7, page 19 shows a physical view of the module. Each expansion module contains the following components: • An additional RapidArray processor, which provides two additional RapidArray links per compute blade and the interface to the AAP • An optional FPGA AAP • Four QDR II SRAMs for the FPGA • A programmable clock source for the FPGA When all 6 expansion modules are fully populated, a chassis contains 12 RapidArray processors, 24 internal RapidArray links, and 6 application acceleration processors.

18 Cray Private S–6400–14 Cray XD1 Architecture [2]

RapidArray processor

Connector

Application acceleration processor

Figure 7. Expansion module, physical view

For a more detailed description of the expansion module, see Chapter 3, page 21.

2.2 RapidArray Interconnect

Processors and memory within a chassis and between chassis are linked by a high-speed switch fabric called the RapidArray interconnect. Regardless of the size of a Cray XD1 system, the high-bandwidth, low-latency RapidArray interconnect is its central organizing construct. This interconnect enables the system to avoid PCI-X bus bottlenecks and shared-resource contention. Each chassis has either one or two RapidArray switch fabrics, each of which consists of RapidArray links and a 24-port internal switch; see Figure 8, page 20. The RapidArray switch is a 96-GB-per-second (maximum per chassis) nonblocking, embedded crossbar-switch that connects the RapidArray processors (RAPs). The compute blades connect to the internal RapidArray switch at 4 GBps: 2 GBps each for simultaneous transmit and receive operations.

S–6400–14 Cray Private 19 Cray XD1™ FPGA Development

Cray XD1 chassis

Opteron Opteron Opteron Opteron Opteron Opteron RapidArray processor (RAP) connects to RAP RAP RAP RAP RAP RAP Opteron at 3.2 GB/s

Opteron Opteron Opteron Opteron Opteron Opteron

RA link: 2 GB/s RAP RAP RAP RAP RAP RAP

Expansion Main 24-port switch 24-port switch

Figure 8. RapidArray components in a Cray XD1 chassis

To make use of the optional additional RapidArray components on the expansion modules, a chassis must also include a fabric expansion card; refer to Figure 4, page 16. The fabric expansion card provides the chassis with a second RapidArray switch. With this card and six expansion modules, a chassis has 24 external RapidArray interfaces; without it, only 12. The optional expansion modules and fabric expansion card enable the compute blades to connect to the internal RapidArray switch at 8 GBps: 4 GBps each for simultaneous transmit and receive operations.

20 Cray Private S–6400–14 Expansion Module [3]

This chapter provides additional information about the expansion module—its variants, its components, and the communication paths between components. In particular, it describes the capabilities of the FPGA application acceleration processor (AAP).

3.1 Expansion Module Variants

Table 4, page 21 lists the variants of the expansion module that the Cray XD1 system currently supports. You need to know the relevant Cray part number when you prepare a loadable logic file; for details, see Section 10.1, page 91.

Table 4. Current FPGA AAP variants

Cray part Cray 87 no. Xilinx model Module memory 90-0003-05 87-0003-09 XC2VP50-7 4 x 1M x 36 90-0003-081 87-0003-11 XC2VP50-7 4 x 1M x 36 90-0118-01 87-0020-05 XC4VLX160-10 4 x 1M x 36 90-0119-01 87-0020-08 XC4VSX55-10 4 x 1M x 36

Note: Cray 87 numbers are listed to allow translation between the Cray 87 number reported by some parts of the Cray XD1 software and the Cray part number that is used in most places. Use the Cray part number whenever possible because the software will not accept the Cray 87 numbers in future releases.

3.2 FPGA AAP

The latest members of the Xilinx Virtex family of field-programmable gate arrays (FPGAs) are arguably the most technically sophisticated silicon and software product in the programmable logic industry. Users can program the FPGA to execute computationally intensive and repetitive algorithms. The Cray XD1 supports the Virtex-II Pro and two types of Virtex-4 devices. These FPGAs are optimized for high-density and high-performance system designs.

1 Educational version

S–6400–14 Cray Private 21 Cray XD1™ FPGA Development

The following subsections describe the types of functionality that these devices implement and briefly describe how the devices are programmed. The final subsection compares the capabilities of the various Xilinx FPGAs. The manufacturer of the FPGA, Xilinx, Inc., also provides thousands of pages of documentation on its website at http://www.xilinx.com.

3.2.1 Configurable Logic Blocks

Configurable logic blocks (CLBs) provide functional elements for combinatorial and synchronous logic including distributed memory elements and shift register capability.

3.2.2 Block SelectRAM+ Memory Modules

Block RAM memory modules provide large 18-Kb storage elements of dual-port RAM, programmable from 16K x 1 bit to 512 x 36 bits, in various depth and width configurations. Each port is totally synchronous and independent and offers three read–during–write modes. Block RAM memory can cascade to implement large embedded-storage blocks. The supported memory configurations for dual-port and single-port modes are: 16K x 1 bit, 8K x 2 bits, 4K x 4 bits, 2K x 9 bits, 1K x 18 bits, and 512 x 36 bits. Additionally, the Virtex-4 block RAM contains built in FIFO support.

3.2.3 Embedded Multiplier Blocks

In the Virtex II Pro, a multiplier block is associated with each SelectRAM+ memory block. The multiplier block is a dedicated 18 x 18 bit twos-complement signed multiplier. It is optimized for operations based on the block SelectRAM+ content on one port. The 18 x 18 multiplier can be used independently of the block SelectRAM+ resource. Read, multiply, and accumulate operations and DSP filter structures are extremely efficient. Both the SelectRAM+ memory and the multiplier resource connect to four switch matrices to access the general routing resources.

3.2.4 XtremeDSP Slices

The Virtex-4 XtremeDSP slices contain a dedicated 18 x 18-bit twos-complement signed multiplier, adder logic, and a 48-bit accumulator. Each multiplier or accumulator can be used independently. These blocks are designed to implement extremely efficient and high-speed DSP applications.

22 Cray Private SÐ6400Ð14 Expansion Module [3]

3.2.5 Digital Clock Manager Blocks

Digital clock manager (DCM) blocks provide self-calibrating, fully digital solutions for clock distribution delay compensation, clock multiplication and division, and coarse- and fine-grained clock phase shifting. The DCM and global clock multiplexer buffers provide a complete solution for designing high-speed clock schemes.

3.2.6 Embedded IBM PowerPC 405 RISC Processor Blocks

The Virtex II FPGA contains two embedded PowerPC processors. The PPC405 RISC CPU can execute instructions at a sustained rate of one instruction per cycle. On-chip instruction and data cache reduce design complexity and improve system throughput. Embedded PowerPC 405 RISC processor blocks provide maximum performance of 400 MHz.

3.2.7 Programmability

All programmable elements, including the routing resources, are controlled by values stored in static memory cells. These values are loaded into the memory cells during configuration and can be reloaded an unlimited number of times to change the functions of the programmable elements.

3.2.8 Comparison of Xilinx FPGAs

Table 5, page 24 lists the programmable elements that are available in each of the Xilinx FPGA models that the Cray XD1 system supports.

SÐ6400Ð14 Cray Private 23 Cray XD1™ FPGA Development

Table 5. Resources in Xilinx FPGAs

Block Embedded XtremeDSP Family Device Logic Cells RAMs DCMs Multipliers Slices PowerPCs Virtex-4 LX160 152,064 288 12 0 96 0 Virtex-4 SX55 55,296 320 8 0 512 0 Virtex II VP50 53,136 232 8 232 0 2 Pro

3.3 RapidArray Processor

The RapidArray processor (RAP) provides the interface for the FPGA to connect to the local Opteron processors as well as the RapidArray fabric. The speed and performance of this interface match that of the HyperTransport links, but the link protocol is simplified to reduce logic requirements in the FPGA.

3.4 QDR II SRAM

The quad-data-rate (QDR) static random access memory (SRAM) provides local high-speed storage for the FPGA. Each of the four QDR II SRAM circuits has its own fully independent control. In addition, QDR II SRAM is dual-port SRAM, which enables simultaneous read and write access.

3.5 Programmable Clock

The programmable clock enables the user to set the speed of the FPGA for each design. Section 10.1, page 91 describes how to do so.

3.6 Communication Paths

Figure 9, page 25 shows a logical view of the components of the expansion module and their interconnections.

24 Cray Private SÐ6400Ð14 Expansion Module [3]

HyperTransport to SMP

3.2 GB/s QDR II QDR II RAM RAM RAP

RAP 3.2 GB/s Accelerator FPGA

QDR II QDR II 3.2 RAM RAM GB/s 2 2 GB/s GB/s

RapidArray

Figure 9. Expansion module, logical view

SÐ6400Ð14 Cray Private 25 Cray XD1™ FPGA Development

26 Cray Private SÐ6400Ð14 Part III: Developing FPGA Logic

Quick Start [4]

This chapter describes the Cray FPGA reference designs and how to use them along with the fcu(1) utility program to explore the FPGA subsystem. Many of the techniques described are useful during initial integration between a completed FPGA design and the software that uses it.

4.1 Reference Design Overview

The following reference designs are in subdirectories of /opt/ufpapps if the ufpapps software package is installed (this is the case if the node is in a partition with the "full" software installation option): • Hello A simple design that demonstrates the basic operation of interfacing the FPGA to the Opteron processor. This design is an excellent place to start for both software and FPGA designers. • DMA The Direct Memory Access (DMA) design is an extension of the Hello design. It demonstrates a DMA engine block that transfers data to and from Opteron DRAM and provides a FIFO interface to the user logic. • Mince The Minimal Compute Engine (Mince) design demonstrates interactions between the Opteron processor and FPGA that are more complicated. Mince is a diagnostic tool that tests the FPGA memory and bus interfaces. • MTA The Mersenne Twister Accelerator (MTA) design demonstrates a practical application, an FPGA implementation of the Mersenne Twister pseudorandom number generator. Documentation for each reference design and its sample application programs is in the doc subdirectory (for example, /opt/ufpapps/mince/doc) of the relevant reference design directory.

SÐ6400Ð14 Cray Private 29 Cray XD1™ FPGA Development

4.2 Command-line Manipulation of FPGAs

This section introduces the process that you use to manipulate the FPGA from a Linux shell. It uses examples to illustrate the main actions of the fcu command. For a description of all the capabilities of this command, see Chapter 10, page 91 or the fcu(1) man page. This tutorial focuses on the Mince design because it is a good diagnostic tool. You can use it to verify that the FPGA subsystem is working correctly. You can use a similar procedure with any of the reference designs. For simplicity, this section assumes that you log in to a node that has an FPGA. All the procedures and commands act on the local FPGA. For an example of running an application as a job in the workload management (WLM) system, see Section 12.2, page 123.

4.2.1 Overview

After you simulate an FPGA application design and compile it into binary form, you can load it onto the FPGA AAP. The Cray XD1 system provides some Linux command–line tools and a software API to facilitate this process. Detailed descriptions of the main command-line utility and the FPGA software API are available in Part IV, Developing Accelerated Software Applications.

4.2.2 Copying the Design Directory

The reference designs are in /opt/ufpapps. This procedure modifies some of these files. Therefore, you must first copy them to a user-writable location such as $HOME or /tmp. 1. Log in to Linux on the Cray XD1 system. 2. Copy the design directory structure:

> cp -rP /opt/ufpapps $HOME

3. Make the design user-writable:

> chmod -R u+w $HOME/ufpapps

4. Go to the mince design directory:

> cd $HOME/ufpapps/mince

30 Cray Private SÐ6400Ð14 Quick Start [4]

5. Go to the location of the compiled binary files: > cd bin

4.2.3 Converting FPGA Binary Files

The final output from the FPGA design process is an FPGA binary file that contains the configuration bit stream for a specific Xilinx device. Before you can load an FPGA binary on the FPGA AAP, you must convert it to a Cray proprietary format by prepending a header. The header contains information about both the hardware for which the design is targeted and the clock rate at which to run the design. You can perform this task with the FPGA control utility (the fcu(1) command). It is a general purpose command that enables a designer to program and interact with the FPGA. For a list of options, refer to the fcu(1) man page or type fcu --help. To convert the FPGA binary, the fcu(1) command prepends information from a text file to the Xilinx binary file design.bin and creates the Cray proprietary file design.bin.ufp. If no header file exists, you can first use the fcu command to create it. The following example creates a new header file and converts mince50.bin to mince50.bin.ufp. This example assumes that the FPGA AAP is an XC2VP50 Xilinx chip. 1. Log in to Linux on the Cray XD1 system. 2. Determine the type of application accelerator on the target node:

> lsnode -v | grep -E -i '^hardware|^app' Hardware ID: 269.1 App Accelerator: 87-0003-09 Hardware ID: 269.2 App Accelerator: 87-0003-09 Hardware ID: 269.3 App Accelerator: 87-0003-09 Hardware ID: 269.4 App Accelerator: 87-0003-09 Hardware ID: 269.5 App Accelerator: 87-0003-09 Hardware ID: 269.6 App Accelerator: 87-0003-09

3. If necessary, use Table 4, page 21 to determine the appropriate 90-level part number. For example, 87-0003-09 matches part number 90-0003-05.

SÐ6400Ð14 Cray Private 31 Cray XD1™ FPGA Development

4. Create a header file for the target application accelerator:

> fcu -b Enter Cray Part Number [10 characters] 90-0003-05 Enter FPGA clock frequency in Integer MHz [Range : 63 to 199, inclusive] 180

The fcu command creates a text file called ufphdr. The contents of the file are as follows:

Cray Part Number : 90-0003-05; FPGA Frequency MHz : 180;

5. Convert the Xilinx binary file to the Cray format:

> fcu -c mince50.bin ufphdr header size 59 input fpga load size 2377668

In this example, the fcu command creates a file called mince50.bin.ufp which is ready for download.

4.2.4 Downloading Files to the FPGA

You can load an FPGA design.bin.ufp file through either the software API function fpga_load(3) or the fcu(1) utility from the command line. To load a file onto the FPGA from the command line, use the fcu command as in the following example:

> fcu -l mince50.bin.ufp file size 2381764 setting device location /dev/ufp0 programming device 2381764 bytes opening device /dev/ufp0 programmed FPGA 2381764 bytes closing device file descriptor 4

The FPGA is now programmed with the mince50.bin.ufp logic.

4.2.5 Accessing the FPGA

In general, access to the FPGA occurs through its API. The API provides

32 Cray Private SÐ6400Ð14 Quick Start [4]

functions for reading and writing values to the FPGA, and for mapping the QDR SRAM of the FPGA into the virtual memory space of the processor. As a convenience, the /proc/ufp/regs file in the Linux process information pseudo-filesystem exists to show the contents of some configuration registers that are not directly user-accessible and the first 64 bytes of the QDR SRAM register space. This can be useful for displaying design-dependent configuration and status registers from the command line during initial debugging. Bit definitions for the Host Interface registers are in Design of Cray XD1 RapidArray Transport Core (S–6411). An example of the virtual file contents follows:

> cat /proc/ufp/regs LPC Bus Kernel Virtual Address ffffff0040df8000

FPGA AAP Config and Status Space Offset 0x200 CME Ctrl Reg : 0x05 CME Stat Reg : 0x03 CME Intr Status Reg : 0x07 Host Interface Offset 0x300 RAT Base Num Reg 0 : 0x01 RAT Base Num Reg 1 : 0x00 RAT Config Num Reg : 0x01 RAT Version Num Reg : 0x00 Test Reg 0 : 0x00 Test Reg 1 : 0x00 RAT Compile Num Reg : 0x04 Clock Config Reg 0 : 0xB4 User Reset : 0x01 Clock Status Reg 0 : 0x00 Clock Frequency MHz : 0xB3 Host Latch Register : 0x00

FPGA AAP Application Space Starting Kernel Virtual Address ffffff0040dfa000

Application Interface Register space kernel virtual address ffffff0044dfa000 Offset Range : Value 0x0000-0x0008 : 0x0000002B00080100 0x0008-0x0010 : 0x0000000000000000 0x0010-0x0018 : 0x0000000000000000 0x0018-0x0020 : 0x0000000000000000

SÐ6400Ð14 Cray Private 33 Cray XD1™ FPGA Development

0x0020-0x0028 : 0x0000000000000000 0x0028-0x0030 : 0x0000000000000000 0x0030-0x0038 : 0x0000000000000000 0x0038-0x0040 : 0x0000000F0000000E

The first two tables of registers are of little general interest, except the Clock Config Reg 0 and Clock Frequency MHz registers of the Host Interface. The clock configuration register shows the requested programmable clock frequency (a hexadecimal value in MHz) at which the FPGA application will run. The clock frequency MHz register shows the measured clock frequency with an accuracy of +/– 1 MHz. The Application Interface registers are the first 64 bytes of the FPGA register space. Their content depends on the design of the application logic. These values are refreshed every time you read the file.

4.2.6 Erasing the FPGA

After you program an FPGA, it remains operational until you reset or unload it. Resetting the FPGA leaves the device configured but holds the user logic in reset. If you take the FPGA out of reset, it resumes normal operation. Unloading the FPGA clears all configuration in the device and returns it to an unprogrammed state. Note: If you plan to continue to the next section, do not reset or unload the FPGA at this time. To reset the FPGA:

> fcu --reset

To unload the FPGA:

> fcu --unload

4.3 Running the C Reference Programs

After you load an FPGA, it is ready for software to use it. Each reference design includes one or more programs that use the FPGA logic. Compiled versions of these programs are located with the design binary files in the bin directory of the design. The Mince design includes berttest, qdrtest, and the test.sh script which calls berttest and qdrtest. The following examples show calling the berttest program with two slightly

34 Cray Private SÐ6400Ð14 Quick Start [4]

different sets of parameter values. Both calls use the Mince FPGA to run a bit error rate test (BERT) on the QDR SRAM and on a segment of the SMP DRAM (to test the communication channel between the FPGA and the SMP memory).

> ./berttest -v -r -w 0 -t 60 -d rand -a incr Test Parameters - Data Pattern : rand Address Pattern : incr Wait States : 0 QDR RAM Banks : Bank0 Bank1 Bank2 Bank3 RT BERT : Enabled Execution Time : 60 s Test PASSED. > ./berttest -v -r -w 0 -t 60 -d incr -a rand Test Parameters - Data Pattern : incr Address Pattern : rand Wait States : 0 QDR RAM Banks : Bank0 Bank1 Bank2 Bank3 RT BERT : Enabled Execution Time : 60 s Test PASSED.

The shell script test.sh calls both berttest and qdrtest numerous time with different parameters. This script is a useful tool for performing a sanity check on the AAP hardware.

SÐ6400Ð14 Cray Private 35 Cray XD1™ FPGA Development

36 Cray Private SÐ6400Ð14 Hardware Development Flows [5]

This chapter describes the basic development flow that begins with design entry with a hardware description language (HDL) and ends with an FPGA binary. It also describes alternative approaches to design entry with software-oriented language tools. A designer has a wide variety of Xilinx and third-party tools to choose from for FPGA development. Cray does not supply these tools, but this chapter identifies some representative ones.

5.1 Basic HDL Development Flow

This section identifies the steps of the basic development process and gives a brief description of each step.

5.1.1 Overview of the HDL Development Process

Cray Inc. uses standard processes and tools to develop designs for the FPGA application acceleration processor. The overall development process has the following four basic steps: 1. Design entry—Creates the source code that makes up the design. 2. Synthesis—Translates the source code into a low-level netlist of logic elements that are suitable for implementation in an FPGA. 3. Simulation—Simulates the design source code or the synthesized design netlist. 4. Implementation—Determines the physical placement and routing for the netlist logic on the target FPGA. Figure 10, page 38 illustrates the standard development flow.

SÐ6400Ð14 Cray Private 37 Cray XD1™ FPGA Development

HDL Design Models & Test Bench

Synthesis Simulation

Implementation Cray Cores

Binary File

Metadata 0100010101 1010101011 0100101011 0101011010 1001110101 0110101010

Figure 10. HDL development flow

5.1.2 Design Entry

HDLs can specify a hardware design in terms of familiar programming constructs such as conditional statements, loops, and function calls. VHDL and Verilog are perhaps the two most common languages in current use. Because they are oriented specifically to hardware design, HDLs provide a flexible and

38 Cray Private SÐ6400Ð14 Hardware Development Flows [5]

powerful way to generate efficient logic. However, this orientation makes them suitable only for hardware designers. The Xilinx ISE tools provide a for generating design source code. The text editor includes support for highlighting both VHDL and Verilog syntax. It also includes templates for common logic functions. Several other free text editors support design entry; for example, GNU (or Xemacs). Like the Xilinx ISE software, Emacs supports both a VHDL mode and Verilog mode that provide syntax highlighting and design templates. In addition to designing and coding logic by hand, you can use the Xilinx CORE Generator tool in ISE to create custom logic blocks from a family of parameterizable cores. A wide variety of common cores is available, including simple mathematical functions and DSP functions.

5.1.3 Synthesis

The synthesis process translates the source HDL code into gate-level elements, such as AND gates, OR gates, and flip-flops. The Cray reference designs use the Xilinx Synthesis Technology (XST) tool (part of some Xilinx ISE packages) for synthesis. Synthesis tools are also available from several other vendors; for example: Synplicity, Mentor Graphics, and Synopsys. The output of the synthesis process is a gate-level netlist. If you use the Xilinx XST package, the netlist is a Xilinx proprietary NGC file (for example, top.ngc).

5.1.4 Simulation

You perform the simulation step on a design test bench. A test bench is an HDL structure that includes the design itself along with models that manipulate the design inputs and monitor the outputs. Simulators compile HDL code into simulation models, which you can use to verify the design. If an FPGA design is fully implemented, you can create a back-annotated simulation model. This annotated model contains the actual delays from the placed design. The reference designs work with either the Aldec Riviera simulator or the Mentor Graphics Modelsim simulator. However, most standard simulators should work because the designs, test bench, and models are all VHDL source files. You can also use more powerful test bench and simulation tools such as the Versity Specman simulator.

SÐ6400Ð14 Cray Private 39 Cray XD1™ FPGA Development

5.1.5 Implementation

Design implementation is the process of translating, mapping, placing, routing, and generating an FPGA binary file for your design. The implementation process uses the proprietary Xilinx translate, map, and place-and-route tools, which are part of the Xilinx ISE tool set, to assign the logic that you create during design entry and synthesis to specific physical resources of the target device. Implementation also combines existing cores that your design references (such as the Cray RT core and QDR II SRAM core) with the user application block. A set of user-determined timing and placement constraints guides the Xilinx process. These constraints are in the user constraint file (UCF).

5.2 Development Flow Using Software-Oriented Languages

This section describes how the basic logic development flow changes when you use tools that are based on software-oriented languages (typically C-like languages) to specify the hardware design. Such tools let developers focus on algorithms rather than on the low-level details of hardware design. The available high-level tools have varying degrees of integration with the overall development flow. Some simply generate VHDL code and you then use the tools of the basic development flow to complete the remaining steps. Some other tools let you control all the steps from within their development environment. Figure 11, page 41 shows some variations of the development flow when you use a software-oriented language tool. It highlights vendors that have adapted their software-oriented development kits for the Cray XD1 system. Section 5.2.1, page 41 and Section 5.2.5, page 42 briefly describe the approaches of these vendors.

40 Cray Private SÐ6400Ð14 Hardware Development Flows [5]

Impulse C The Mitrion Celoxica DSP Logic … and others. MathWorks

int mask(a, m) int mask(a, m) { C-like { C-like MATLAB/ MATLAB/ return (a & m); Language return (a & m); Language Simulink Simulink } } Xilinx System System C Synthesis C Synthesis Generator for Generator for DSP DSP

process (a, m) is (edif top_vp50 process (a, m) is process (a, m) is VHDL, (edifVersion 2 0 VHDL, VHDL, begin EDIF begin begin z <= a and m; Verilog 0) z <= a and m; Verilog z <= a and m; Verilog end process; (edifLevel 0) end process; end process;

Synthesis

a Mentor Graphics Gate-level z Synopsis EDIF File m Synplicity Xilinx … and others. Xilinx Place and Route

01001011010101 01010110101001 Binary File 01000101011010 for FPGA 10100101010101

Figure 11. Development flows with high-level tools

5.2.1 Design Entry

The use of software-oriented language tools most affects the design entry step. Various vendors provide not only a high-level language, but also a complete development environment that may integrate with the tools of the basic development flow. Usually, the language has a familiar syntax (typically a C-like syntax) and has extensions that allow you to express additional concepts such as parallelism that are useful for FPGA logic design. Some tools let you design the whole application together and then partition it into software and hardware components for separate processing. The Xilinx System Generator for DSP tool uses a somewhat different approach for

SÐ6400Ð14 Cray Private 41 Cray XD1™ FPGA Development

designing digital signal processing (DSP) blocks. It integrates with the MATLAB and Simulink packages from The MathWorks to let you design the algorithmic block of your application in the MATLAB GUI environment. However, it does not provide full integration with the development flow in the Cray XD1 environment; you still perform the other steps manually as in the basic HDL flow. (DSPlogic uses the same underlying tools but provides a fully integrated environment in which you can perform all the development steps; see Section 5.2.5.2, page 43.)

5.2.2 Synthesis

The synthesis step is the same as in the basic development flow after the high-level language is translated into an HDL. Some tools provide a synthesis step directly from the high-level language.

5.2.3 Simulation

Most software-oriented FPGA development environments provide tools for simulating the design from the source code. The tools for simulation and debugging are familiar to many users because of their origins as standard software development tools. Typically, these tools provide multiple views of the design such as registers (whose values you can set as well as view), source code, and logic blocks. As you simulate running the design, you can observe the changing values of registers and relate the activity of source code and logic blocks. These multiple views can aid you, for example, in improving the parallelism of the design.

5.2.4 Implementation

The implementation step is the same as in the basic development flow. Xilinx tools translate the netlist, map, place and route, and generate an FPGA binary file.

5.2.5 Third-Party Vendors and Tools

Several vendors of software-oriented tools for logic design now offer development kits that integrate (in various ways) with the Cray XD1 system. This section briefly introduces these vendors (in alphabetical order) and provides references to the vendors' websites.

42 Cray Private SÐ6400Ð14 Hardware Development Flows [5]

5.2.5.1 Celoxica, Inc.

Celoxica provides a C-based hardware design language, Handel-C, and a version of the DK4.0 Design Suite that is customized for the Cray XD1 system. DK4.0 Design Suite unifies system verification, hardware/software codesign, and Handel-C synthesis in a GUI-based development environment. The customization includes a Platform Support Library (PSL) for the Cray XD1 system, which includes Handel-C interfaces to the Cray RT core and QDR II SRAM core. It also includes integration with the standard Cray development environment so that you can complete all steps of the development flow from the DK4.0 environment. For more information, contact Celoxica (http://www.celoxica.com).

5.2.5.2 DSPlogic, Inc.

DSPlogic provides the Rapid Reconfigurable Computing (Rapid RC) Development Kit, which integrates with MATLAB and Simulink from The Mathworks and with Xilinx tools. It lets you design, verify, and build the FPGA logic for DSP applications entirely within the MATLAB/Simulink environment on a Windows platform for the Cray XD1 system. The product also includes the Reconfigurable Computing I/O (RCIO) library, which provides a portable application programming interface for communications between the software application that runs on Cray XD1 host processors and the attached FPGAs. For more information, contact DSPlogic (http://dsplogic.com).

5.2.5.3 Impulse Accelerated Technologies, Inc.

Impulse provides a C-based development kit, Impulse C, that lets you program your application in standard C with the aid of a library of functions for describing parallel processes, partitioning the application into software and hardware parts, and simulating and instrumenting the application. The Impulse programming model is stream-based communication between processes. The kit provides a compiler that generates HDL code for synthesis from the hardware parts of the application. For more information, contact Impulse (http://impulsec.com).

5.2.5.4 Mitrionics Inc.

Mitrionics provides a C-based hardware design language, Mitrion-c, and an abstract machine, the Mitrion processor, that let you develop portable high-level code for FPGA applications. The package includes a Cray XD1

SÐ6400Ð14 Cray Private 43 Cray XD1™ FPGA Development

instantiation of the Mitrion processor. This approach supports easy conversions to new generations of hardware. The Mitrion architecture uses a data-driven representation of the program, which the tools map onto programmable logic. For more information, contact Mitrionics (http://mitrion.com).

44 Cray Private SÐ6400Ð14 General Design Considerations [6]

This chapter describes the required clock speed for FPGA application acceleration processor (AAP) designs, the various types of memory that the FPGA AAP can access, issues concerning use of the RapidArray interconnect, and limitations in the use of the AAP.

6.1 Clock Frequency

You can set the Cray XD1 AAP programmable clock to any integer frequency between 66 and 199 MHz. This range is determined by the capabilities of the Xilinx digital clock modules (DCMs). This clock is provided to the AAP module SRAMs and to the QDR II SRAM core and drives the default clock in the user logic (user_clk). You can generate other internal clocks at different frequencies with the DCMs. In this case, the user logic is completely responsible for the DCM functions and for managing signal transitions between clock domains. For details about how to specify the frequency, see Section 10.1, page 91.

6.2 FPGA Memory Resources

For many applications, access to memory resources can limit performance. The FPGA has access to four types of memory; this helps to maximize the performance of memory-intensive applications. The first two types of memory are the distributed and block RAM available within the Xilinx FPGA. These small, high-speed memories are excellent for exploiting parallelism in algorithms because many of them are distributed across the chip. The third type of memory is the four external banks of the second generation Quad Data Rate (QDR II) SRAM (refer to Figure 12, page 46). These memories provide high data transfer rates with low latency and no restrictions on burst size. In addition, they have independent read and write interfaces, so data can be simultaneously written and read from different addresses every clock cycle. The last type of memory is the DDR SDRAM banks of the SMPs. The FPGA has high-speed burst access through the RapidArray fabric to the large pools of memory that belong to each compute processor. Figure 12, page 46 shows the memory hierarchy (slowest to fastest speed) that is available to the FPGA and the relative size of each type of memory.

SÐ6400Ð14 Cray Private 45 Cray XD1™ FPGA Development

SMP SDRAM (Gigabytes)

Module QDR II RAM (Megabytes)

FPGA Block RAM (Kilobytes)

FPGA Distributed RAM (Kilobits)

Figure 12. FPGA memory hierarchy

The types of memory have different characteristics that make them suitable for different tasks. Table 6, page 46 lists those characteristics and suggests some typical uses.

Table 6. Available memory devices

RAM Type Characteristics Typical uses Opteron DDR SDRAM Large pool of memory capable of high-speed • Buffer large sets of burst access. data for processing as well as storage of the results

46 Cray Private SÐ6400Ð14 General Design Considerations [6]

RAM Type Characteristics Typical uses QDR II RAM Medium-size pool of memory capable of • Store segments of a high- speed random larger data set that is access. Supports in Opteron SDRAM simultaneous read • Large lookup tables and write accesses. Four banks allow independent parallel access. FPGA Block RAM Very-high-speed full dual- port RAM. • FIFOs Between 136 and 232, • Shift registers 18-Kb RAMs per chip. Enables a great deal of • Lookup tables parallelism. FPGA Distributed RAM Very-high-speed pseudo dual-port RAM. 30,000 • Small FIFOs to 50,000 16x1 RAMs. • Shift registers Enables massive parallelism when • Small lookup tables memory requirements are small.

6.3 Fabric Bandwidth and Data Flow

The RapidArray Transport bandwidth available to the FPGA depends on a number of factors. The theoretical maximum available bandwidth is approximately 8/9*1.6 Gigabytes per second (approximately 1.422 GBps) in each direction. This maximum bandwidth occurs only with efficient burst transactions between the Opteron and FPGA. The maximum write data bandwidth is decreased when the read interface is in operation, since read requests must be transmitted on the write bus. Another important consideration is the expansion fabric. The expansion fabric and the FPGA share a common path to the Opterons and so FPGA bandwidth can be affected by Opteron traffic that runs through the expansion fabric. While the data path between the Opteron and FPGA is important, it is not the only path to consider. Transferring data into the Opteron memory from disk

SÐ6400Ð14 Cray Private 47 Cray XD1™ FPGA Development

or a network location is much more likely to be the limiting factor in overall application performance.

6.4 SMP Processor-initiated Fabric Transactions

Writes made from the SMP processor to the FPGA are posted. This means that once the write is made, the processor moves on to its next operation. It does not wait for a response from the FPGA. This greatly improves the overall performance of transferring data to the AAP. Read requests, however, require the processor to wait for a response from the FPGA (in other words, the requested data). There is no mechanism for the processor to issue burst read requests to the FPGA or to have multiple outstanding read requests. As a result, the SMP processor can write to the FPGA much more efficiently than it can read. This is generally true for most types of processor interfaces, even, for example, DRAM interfaces. It is this performance gap between read and write transactions that gives rise to the popularity of write-only architectures for achieving the highest performance. In a write-only architecture, data is written from the SMP processor to the FPGA, and then the AAP writes it back to the SMP processor (rather than having the SMP processor read it back).

6.5 FPGA-initiated Fabric Transactions

The FPGA has complete control of its fabric interface and so it has capabilities not available to the application that runs on the SMP processor. The FPGA can burst read and burst write to and from the fabric (for example, to and from the SMP memory). In addition to the ability to post writes, the FPGA can also have multiple read requests outstanding. These facts can make DMA access from the FPGA very efficient. In addition, the FPGA can initiate an interrupt request on the SMP.

6.6 Limitations

The current software release has an important restriction. The Opteron can burst write to the FPGA memory space, but cannot burst read from it. This issue can be addressed by programming the FPGA to write to the Opteron memory window, since the FPGA is capable of bursting in both directions.

48 Cray Private SÐ6400Ð14 Using the Design Template [7]

This chapter describes the Cray XD1 FPGA application acceleration processor development environment. It describes in detail the design template that Cray provides, and it identifies many of the third-party tools that you can choose.

7.1 Overview

User logic for the FPGA AAP must be able to communicate with the surrounding system. Cray IP cores provide this ability. Each core provides an interface to a specific part of the system. You need to design the FPGA user logic to interact with these cores. The documents listed in Section 1.1.4.1, page 4 provide full descriptions of the Cray IP cores and their interfaces. Cray includes a package (ufpapps) in the Cray XD1 software distribution to assist you in integrating user logic into the Cray XD1 system. This package includes the Cray HDL cores, reference designs, and a design template. The ufpapps RPM package creates a directory structure under /opt/ufpapps. Figure 13, page 50 shows the design-related directories in /opt/ufpapps. This directory includes some additional subdirectories that contain utility programs; these are not shown in the figure, but Table 7, page 51 lists them and other chapters describe the utilities.

SÐ6400Ð14 Cray Private 49 Cray XD1™ FPGA Development

Precompiled FPGA ufpapps and Linux binaries

Documentation of the design hardware and ref_design bin software

Source files HDL design that Reference designs: including the HDL follows the design dma doc and C application template structure hello code mince mta

src design_name

Libraries for each Core library folders HDL source FPGA family: that are linked into generated from libxc2vp designs: netlist libxc4v qdr2_core rt_core Pre-synthesized logic for each supported device: xc2vp50 libdev core hdl xc4vlx160 xc4vsx55

par dev_model

Design template simlib riviera

vhdl_template Precompiled simulation libraries (Aldec)

Figure 13. Design directories in /opt/ufpapps

50 Cray Private SÐ6400Ð14 Using the Design Template [7]

Each reference design is based on the design template and includes the Cray cores by reference in order to interface with the rest of the system. The hello design is a simple example and a good place to start becoming familiar with the HDL structures and the Cray cores. Table 7, page 51 describes the ufpapps directory structure.

Table 7. ufpapps subdirectory descriptions

Directory Subdirectory Description bin Contains the fcu program binary. ref_design bin Contains precompiled FPGA binaries and C (dma, hello, mince, application code that can be used to quickly and mta) start interacting with the AAP. doc Contains detailed documentation on the reference design HDL and C application code. src Contains the C application source code and Makefile as well as a directory structure containing the HDL sources. libdev core/hdl Contains the synthesized netlists for the (libxc2vp and Cray core (qdr2_core or rt_core) for the libxc4v) particular device family. You can use this netlist to simulate the core, but it is slower than the simulation libraries. core/par/dev_model Contains presynthesized logic that designs that use core require for different target device models (xc2vp50, xc4vlx160,or xc4vsx55). Cores are in NGC format because it can contain logic placement information. This helps to ensure consistent performance when you incorporate the cores in user designs. core/simlib/riviera Contains precompiled simulation libraries for Riviera. utils Contains the source and executable files for the fpga_read and fpga_write utilities. vhdl_template Contains the HDL design template; for details, see Section 7.2, page 52.

SÐ6400Ð14 Cray Private 51 Cray XD1™ FPGA Development

7.2 Contents of the Design Template

The design template, /opt/ufpapps/vhdl_template, is a tree of directories and files that you can use to structure your own application and development process. The template includes design files that provide a framework within which you develop your application logic and a directory structure and other files necessary to build your application logic for the Cray XD1 system. This design template offers the following features: • Complete definition of all the FPGA external interfaces and pins • Complete framework of makefiles for compiling, simulating, and building the design • Separation of Cray HDL code and structures from user files • Separation of Cray cores and simulation models from user files • Straightforward integration into new Cray Inc. software releases

7.2.1 Design Files

The design template includes a top-level VHDL file that instantiates all the components shown in top.vhd of Figure 14, page 53. The top-level design files of the template are in VHDL. However, by using the ability of VHDL to support "black box" logic components, you can also use other HDLs to develop designs for the AAP. The top-level VHDL file forms a wrapper that combines several logic components. Two of the components are the required logic cores for connecting to the RapidArray processor and to the QDR II SRAMs. A third component generates the user-programmable clock. The fourth component is the user application component that contains all design-specific logic. You can code the user application component in VHDL or Verilog or any other language that can generate a Xilinx netlist. The only requirement is that the user logic component provides the required interface signals to the rest of the design.

52 Cray Private SÐ6400Ð14 Using the Design Template [7]

top.vhd

prog_clk_gen

RapidArray Fabric user_app QDR SRAM rt_core qdr2_core

Figure 14. FPGA organization

7.2.2 Directory Structure

The design template also provides a common directory structure and build environment that supports development with the graphical Xilinx ISE tools and a simple Makefile-based automated development flow. The same structure works on both the Windows1 and Linux operating systems. The vhdl_template directory contains bin and src directories at the top level, analogous to the structure of the ref_design directories as shown in Figure 13, page 50. The remainder of this section describes the structure within the src directory. Figure 15, page 54 shows the subdirectories of the vhdl_template/src directory. The structure is simple, flexible, and easy to maintain. However, depending on the tool set that you choose for development, you may wish to modify the structure as required. For example, if a third-party synthesis tool is used, a separate syn directory may be useful to contain vendor-specific setup and control files. In the template, the design_name directory exists as a particular example—80-0037_vhdl. After you copy the template, you will change this name to a design name of your choice.

1 To use the makefile environment on a Windows system, you can install the free Cygwin software package (see www.cygwin.com). It includes the GNU make package.

SÐ6400Ð14 Cray Private 53 Cray XD1™ FPGA Development

design_name Common HDL source code required by all Top-level HDL designs source files

All user source hdl cray code (user_app block and below) HDL test bench and models

hdl_tb user_app Synthesis and implementation files UCF and other PAR for each supported files device: xc2vp50 xc4vlx160 par dev_model xc4vsx55

Simulation setup and stimulus files

sim

Simulation libraries

simlib modelsim

Top-level HDL source files

timing riveriera

Figure 15. Subdirectories of vhdl_template/src

Table 8, page 55 describes these design directories.

54 Cray Private SÐ6400Ð14 Using the Design Template [7]

Table 8. Subdirectories of vhdl_template/src/design_name

Directory Subdirectory Description hdl Contains the top.vhd file which fully defines interfaces to the user application and Cray cores. user_app Contains the source code for the FPGA user logic design. cray Contains Cray VHDL code that is required in all designs. hdl_tb Contains a VHDL test bench for the design, source models for the QDR II SRAMs, and a simple behavioral model for the RapidArray processor. par dev_model Multiple directories that contain the (xc2vp50, synthesis and implementation files for the xc4vlx160, different FPGA models. or xc4vsx55) sim Contains Modelsim scripts that are useful in the simulation of the design. tc_01 Multiple directories that contain test case tc_02 input and output files; each directory ... contains a separate test case that you can run individually or as part of a series for regression testing. simlib Contains subdirectories for various simulation tools. modelsim Contains simulation libraries for the Modelsim simulator. riviera Contains simulation libraries for the Riviera simulator. timing Contains diagrams that illustrate the behavior of some design blocks; You can display these diagrams with a free viewer (see www.timingdesigner.com).

SÐ6400Ð14 Cray Private 55 Cray XD1™ FPGA Development

7.3 Working with the Design Template

This section details how to copy the design template and use it in each step of the design process. You can work from either the command line or the GUI environment that the third-party tools provide. This process requires software packages that are not part of the Cray XD1 software distribution.

7.3.1 Copying the Design Template

To begin developing an FPGA logic design, first prepare a copy of the design template. Procedure 1: To copy the design template

1. Copy the entire contents of the vhdl_template directory structure to your working location:

> cp -r /opt/ufpapps/vhdl_template myapp

where myapp is the path to the development directory for your application. 2. Move to the src directory of the design:

> cd myapp/src

3. Rename the design directory:

> mv 80-0037_vhdl mydesign

where mydesign is an appropriate name for your design.

7.3.2 Tools

The design template works with specific tools. It can be used with other tools, but some modification will likely be necessary. Table 9, page 57 lists tools that you can use with the template with little or no modification.

56 Cray Private SÐ6400Ð14 Using the Design Template [7]

Table 9. Recommended software packages

In Cray XD1 software Type Software package distribution? Shell Bash Yes Editor Emacs, vi Yes Build management Make Yes HDL Compilation Aldec Riviera, Mentor No Modelsim, Xilinx ISE HDL Simulation Aldec Riviera, Mentor No Modelsim Synthesis Xilinx ISE No Place and Route Xilinx ISE No

7.3.3 Required Customizations

Table 10, page 57 identifies the files in the design_name directory that likely require customization.

Table 10. Design template customizations

File Customization makefile_vars Update makefile variables to refer to your tools and paths. hdl/user_app/XXX Add user logic files to this directory. hdl/user_app/Makefile Update the SOURCE variable to contain all user logic HDL source files. par/dev_model/top.npl Add the appropriate HDL files (update by hand or use the GUI). par/dev_model/top.prj Add the appropriate HDL files (update by hand or use the GUI).

SÐ6400Ð14 Cray Private 57 Cray XD1™ FPGA Development

File Customization par/dev_model/top.ucf Adjust the timing specifications to the desired operational frequency. sim/tc_XX/fabric.in Create a directory for each test case and sim/tc_XX/test.do update the fabric.in and test.do files in each directory as required.

7.3.4 Command Line Execution and Makefile Targets

The template includes a structure of makefiles that enables you to initiate all steps of the design process from the command line. These makefiles refer to the makefile_vars file at the top level of the design directory structure. This file centralizes the setting of some variables that all the makefiles use. Table 10, page 57 indicates that you need to customize the values of these variables for your configuration. Comments in the file describe the variables. Typically, you need to customize at least the following variables: SIMULATOR, TOOL_ROOT, and UFPAPPS. After you customize the design template, you can use the make command with the targets described in Table 11, page 58.

Table 11. Makefile targets Target Description sim_setup Simulation setup.

Creates the simulation libraries and directories that are prerequisites for simulation. Make this target before the first time simulation is run and every time that you add new HDL files to or remove them from the design. sim Batch mode simulation.

Compiles required files and starts simulation in batch mode, which runs all test cases and reports error and warning assertions in the output. Test cases must be in the sim directory in subdirectories labeled tc_01, tc_02, and so on.

58 Cray Private SÐ6400Ð14 Using the Design Template [7]

Target Description sim test=DIR Interactive mode simulation.

Compiles the required HDL files and starts the simulator GUI with the test case files as stimulus. Test cases must be in the sim directory in directories labeled tc_01, tc_02, and so on. dev_model Produce an FPGA binary for a particular target device. (xc2vp50, xc4vlx160,or Synthesizes the HDL code, runs the logic mapper, and xc4vsx55) then the logic place and route tools. clean Clean up temporary files.

Removes temporary files from the design directories. clean_all Clean up temporary and intermediate files.

Removes all temporary and intermediate synthesis files from the design directories.

To initiate one of these steps:

1. Move to the directory that holds the HDL design files (for example, design_name in Figure 13, page 50). 2. Run the make command:

> make target

where target is one of the entries in Table 11, page 58.

Example 1: Using makefile targets

This example sets up the simulation files and then starts the simulator GUI for interactive verification.

>cd/path/to/mydesign/src> make sim_setup... > make sim test=tc_01

7.3.5 Using the Xilinx ISE GUI

SÐ6400Ð14 Cray Private 59 Cray XD1™ FPGA Development

You can run the commands that produce an FPGA binary from either the Linux shell or from the Xilinx ISE GUI. The GUI requires a project file to run. The project file is in the par subdirectory of the target model FPGA (for example, par50). Its name is top.npl. Describing how to use the third-party tools is beyond the scope of this document. For information on using the Xilinx ISE GUI, refer to the Xilinx documentation that is provided with the tool or on the Xilinx website (http://www.xilinx.com).

7.3.6 Simulating the Design

You can simulate a design by instantiating it and connecting it to simulation models that provide stimulus. The design template includes a test bench in the hdl_tb directory; its name is top_tb.vhd. Figure 16, page 60 shows the structure of the test bench.

sim/tc_XX/fabric.in

hdl_tb/top_tb.vhd hdl/top.vhd

prog_clk_gen

hdl_tb/fabric_model.vhd user_app hdl_tb/cy7c1314v18.vhd rt_core qdr2_core

sim/tc_XX/fabric.out

Figure 16. HDL test bench organization

Stimulus for the fabric model changes depending on which test case you run. The stimulus files for a test case and the output from the last time the test case ran are in the tc_XX (for example, tc_01) subdirectory of the sim directory. For the fabric model, these input and output files are called fabric.in and fabric.out respectively.

60 Cray Private SÐ6400Ð14 Using the Design Template [7]

The test.do file in the tc_XX subdirectory contains a list of simulator commands to execute when the test case runs. The simulator run time is an important parameter to verify if you run the test case in batch mode. For further details on using the simulation models, see Chapter 9, page 77.

SÐ6400Ð14 Cray Private 61 Cray XD1™ FPGA Development

62 Cray Private SÐ6400Ð14 Interfaces [8]

This chapter describes the interfaces that the Cray IP cores—the RapidArray Transport (RT) core and the QDR II SRAM core—provide and identifies design considerations for interfacing FPGA user logic to other Cray XD1 resources and to the software application.

8.1 RapidArray Transport Interface

The Cray RT core block provides the RapidArray fabric interface to an FPGA design; see Figure 17, page 64. This core must always be present. The interface can both initiate and process responses for read and write transactions across the fabric. To facilitate this, there are two interfaces to the core: fabric request and user request. The Fabric Request Interface issues read and write requests to the user logic when it receives packets from the fabric (which originate from a node). The User Request Interface accepts read and write requests from the user logic that are for a device on the fabric (SMP processor memory). The RT interface has the following characteristics: • 64-bit interface at a maximum speed of 200 MHz • 3.2 GBps interface—1.6 GBps simultaneous transmit and receive • Posted writes • Multiple outstanding read requests • Data bursts up to 64 bytes per request

SÐ6400Ð14 Cray Private 63 Cray XD1™ FPGA Development

user _clk user _enable user _reset _n rt_ ready

freq _ addr (39 :3) freq _ size (3 :0) freq _ mask (7: 0) freq _ rw_n freq _ ts freq _ srctag (4:0 ) Signals

freq _ data (63:0 ) Request freq _ valid Fabric freq _ enable Request Interface

1.6 GBps uresp _ ts uresp _ srctag (4:0 ) Transmit uresp _ size (3 :0) to fabric uresp _ data (63: 0)

uresp _ full Signals Response

RT Core

ureq _ addr (39 :3) 1.6 GBps ureq _ size (2 :0) Receive ureq _ mask (7 :0) from fabric ureq _ rw_ n ureq _ byte _req ureq _ ts ureq _ data (63: 0) Signals Request Request ureq _ full ureq _ notag User ureq _ srctag (4:0 ) Request Interface

fresp _ enable fresp _ valid fresp _ ts fresp _ size (2 :0) fresp _ srctag (4:0 ) Signals

fresp _ data (63 :0 ) Response

Figure 17. RT interface

64 Cray Private SÐ6400Ð14 Interfaces [8]

8.2 QDR II SRAM Interface

The Cray QDR II core block controls data flow between the FPGA and the module's QDR II RAMs (refer to Figure 18, page 66). Each QDR II SRAM has its own QDR interface. On the SRAM (right) side of the circuit, the interface has separate read and a write buses that function independently and can simultaneously transfer 36 bits (32 bits of data and 4 bits of parity) at a rate of up to 1.6 GB per second in each direction. On the user (left) side of the circuit, the interface has separate read and write buses that function independently and can simultaneously transfer 72 bits (64 bits of data and 8 bits of parity) at a rate of one 64-bit word per clock period in each direction. For design details, see Design of Cray XD1 QDR II SRAM Core (S–6412). If your design does not use the QDR II SRAM, you can omit this core.

SÐ6400Ð14 Cray Private 65 Cray XD1™ FPGA Development

reset_n qdr_clk0 qdr_clk90 locked_dcm ram_rdy

dw_1[71:0] aw_1[19:0] bw_n_1[7:0] w_n_1

Qdr 1 Qdr ar_1[19:0] Interface r_n_1 dr_1[71:0]

dw_2[71:0] aw_2[19:0] bw_n_2[7:0] w_n_2

Qdr 2 ar_2[19:0]

In ter fa ce QDR II SRAM r_n_2 IP Core dr_2[71:0]

dw_3[71:0] aw_3[19:0] bw_n_3[7:0] w_n_3

Qdr 3 Qdr ar_3[19:0] dr_1[71:0] Interface r_n_3

dr_3[71:0] chips SRAM II Qdr to interface bus External

dw_4[71:0] dr_2[71:0] aw_4[19:0] bw_n_4[7:0] w_n_4

Qdr 4 Qdr ar_4[19:0] Interface r_n_4 dr_3[71:0] dr_4[71:0]

Figure 18. QDR II IP Core interface

66 Cray Private SÐ6400Ð14 Interfaces [8]

8.3 Interfacing User Logic to Other Cray XD1 Resources

The FPGA user logic interfaces to the Cray XD1 system through the RT core and the QDR II SRAM core. The application design determines which cores or portions of cores that you need.

8.3.1 Interfacing Considerations

Often, the purpose of the FPGA AAP is to accelerate a software algorithm. The challenge is to write part of the algorithm in HDL code and then interface the user logic to the system so that a program on an SMP can use it. The interface between the application and the FPGA user logic consists of data flows and the control structures that manage them. When you identify a suitable application or algorithm, examine the data flows and resources that it requires. There are several ways to move data between the SMP and the FPGA user logic. To decide where to store data and how to move it between the SMP and the FPGA logic, keep the general design considerations from Chapter 6, page 45 in mind. Consider the following questions: • How will the SMP acquire the data (for example, from a disk or network)? • How fast must the application process data? • What is the structure of the data?

• How will the software move the data to the FPGA and in what form? • Where will the FPGA store intermediate and final results? • How will the FPGA return results to the SMP? For a description of the methods of moving data between the SMP software application and the FPGA user logic, see Section 8.4, page 68. After you identify the data flow and data structures, consider the control mechanisms and structures that you need. Control mechanisms are dependent on the data flow but typically include configuration registers and ring buffers. For example, you can use configuration registers to tell the FPGA logic the locations of data structures in the SMP memory and when to start processing. When you know the data and control requirements, you can create a memory map. This map defines where in the FPGA address space to locate the data and control resources; applications on the SMP also use this map. The mapping of resources is flexible—you can tailor it to each design.

SÐ6400Ð14 Cray Private 67 Cray XD1™ FPGA Development

At the minimum, user logic must use the Fabric Request Interface of the RT core to respond to the SMP software application. The user logic that connects to the interface compares the request addresses to the memory map and takes appropriate action. Some examples of requests are accessing an internal register or block ram and translating and forwarding a request to an interface of the QDR core. The hello reference design demonstrates some basic examples of how to handle these accesses. Accesses to certain resources may require arbitration. This occurs if two portions of user logic can access the resource or if both the application on the SMP and part of the user logic can access the resource. The user logic is responsible for arbitration whenever it is required.

8.3.2 Disabling Unused Core Interfaces

If the user logic does not require the use of some of the interfaces on a core or an entire core that is instantiated in top.vhd, force the unused signals to an unchanging inactive state from the user_app/user_app.vhd file. This causes the HDL compiler to optimize out the unnecessary logic. The hello reference design includes some examples.

8.4 Interaction with the SMP Software Application

This section explains the relationship between API calls from a software application on the SMP and the RT core bus transactions that are delivered to the user logic in the FPGA. For further details of the C API function calls and the RT core bus transactions, see Chapter 11, page 95 and Design of Cray XD1 RapidArray Transport Core (S–6411), respectively. In the Cray XD1 architecture, the FPGA AAP is a RapidArray fabric device. All accesses to or from the FPGA happen across the RapidArray fabric. Software applications on the SMP access the FPGA by calling the API, which initiates appropriate transactions on the RapidArray fabric. These transactions travel through the RapidArray fabric and are delivered to the RT core in the AAP. It is the responsibility of the user logic in the AAP to process the transactions and respond appropriately. User logic in the FPGA can also initiate certain transactions. It can access the SMP memory through the RT core. The user logic in the FPGA sends a bus transaction to the RT core, which forwards it through the RapidArray fabric to hardware on the SMP, where it becomes a read or write transaction to the SMP DRAM. The FPGA logic can also use an RT transaction to request an interrupt on the SMP.

68 Cray Private SÐ6400Ð14 Interfaces [8]

8.4.1 Memory Map

Because the RAP effectively connects the FPGA to the HyperTransport link of the local SMP, the FPGA is accessible via a region of the HyperTransport I/O address space. Specifically, the FPGA occupies a 128 MB address region of the link. Any HyperTransport read and write requests issued by the SMP to this region are directed to the RT interface of the FPGA which passes them on to the user logic. From a logical perspective, the FPGA appears similar to a PCI device in a legacy system, but, from a performance perspective, it appears as if the FPGA connects directly to the SMP. Figure 19, page 70 illustrates this concept.

SÐ6400Ð14 Cray Private 69 Cray XD1™ FPGA Development

Expansion module QDR SRAM DIMMs DIMMs 1234 RapidArray Transport FPGA Hype rTra nsp ort (RT) App. RAP-2 SMP Accel. Processor

512 MB 128 MB AAP RAP-2 128 MB 512 MB Application- dependent RAP memory map

/O space RAP-1 I 512 MB

0 0

RAM DRAM space

0

Figure 19. Physical components of a node and related address spaces

8.4.2 SMP-initiated RT Requests

The software application can initiate transactions to the FPGA in two ways. It can either map the FPGA into the address space of the SMP and then make ordinary memory references or it can call specific read and write functions of the API.

70 Cray Private SÐ6400Ð14 Interfaces [8]

Note: The important difference between these two access methods is that the I/O mapped accesses allow write combining while the read and write function calls do not.

8.4.2.1 I/O Mapped Accesses

I/O mapped accesses take advantage of a feature called write combining. Write combining greatly improves the performance of write accesses from the SMP to the FPGA by combining multiple write accesses into a single HyperTransport packet. The processor hardware does this by identifying writes to sequential address regions and packing them into a single burst write transaction. One important side effect of write combining is that writes may not occur in the same order in which they were issued. For example, if writes are made to addresses 1, 2, 4, 3 in that order, the write combining feature re-orders the transactions into a sequential burst of 1, 2, 3, 4. This type of access is often useful when the program accesses general-purpose memory because the order of the writes is often not important. In the cases where order is important, the application program can either call the fpga_mem_sync(3) function which enforces order by inserting a memory fence or call the fpga_rd_appif_val(3) and fpga_wrt_appif_val(3) functions.

8.4.2.1.1 Software API Details of I/O-Mapped Accesses

Memory mapping the FPGA and then reading and writing to a location in the mapped memory requires the software application to make the function calls shown in Example 2, page 71. For simplicity, the code and function calls listed below are not complete. For complete information on the functions and their use, see Cray XD1 Programming (S–2433). Example 2: I/O mapped writes to the AAP

u_64 fpga_mem, i;

my_fpga = fpga_open(args); ... if (!fpga_is_loaded(args)) { rtn = fpga_load(args); } ... fpga_mem = fpga_memmap(args); ... for (i = 7; i >= 0; i--;) {

SÐ6400Ð14 Cray Private 71 Cray XD1™ FPGA Development

fpga_mem[i] = i; } ...

In Example 2, page 71, the program maps the FPGA and then writes a sequence of 64 bit values to locations that are offset from the base of the I/O mapped region. The program writes the value 7 to offset 7, the value 6 to offset 6, and so on down to offset 0.

8.4.2.1.2 Hardware API Details of I/O-Mapped Accesses

The accesses shown in Example 2, page 71 are memory mapped. Therefore, the processor reorders and combines them into a burst access to the FPGA. The user logic on the FPGA receives a write request from the Fabric Request interface of the RT core. The request is a single burst write of eight quadwords to address 0 with data values 0, 1, 2, 3, 4, 5, 6, and 7. This effectively addresses 0x0, 0x8, 0x10, 0x18, 0x20, 0x28, 0x30, 0x38 because quadwords are addressed. For more information on the RT core interface, refer to Design of Cray XD1 RapidArray Transport Core (S–6411).

8.4.2.2 API Function Accesses

The API functions fpga_wrt_appif_val(3) and fpga_rd_appif_val(3) do not allow write combining and enforce the order of reads and writes. Use these functions when you access control and status information because they ensure that all transactions complete in order. These functions are not as efficient as memory-mapped accesses because they enforce transaction order. Therefore, they are not suitable for large data transfers.

8.4.2.2.1 Software API Details of API Function Accesses

Example 3, page 72 shows how an application can use the fpga_wrt_appif_val(3) function. For simplicity, the code and function calls in the example are not complete. For complete information on the functions and their use, see Cray XD1 Programming (S–2433). Example 3: Accessing the AAP with fpga_wrt_appif_val

u_64 fpga_mem, i;

my_fpga = fpga_open(args); ... if (!fpga_is_loaded(args)) { rtn = fpga_load(args);

72 Cray Private SÐ6400Ð14 Interfaces [8]

} ... for (i = 7; i >= 0; i--;) { fpga_wrt_appif_val(my_fpga, i, i, TYPE_VALUE, &err); }

In Example 3, page 72, the program loads the FPGA and then writes a sequence of 64 bit values to the FPGA address space at quadword offsets 7 down to 0 (addresses 0x38 down to 0x0).

8.4.2.2.2 Hardware API Details of API Function Accesses

The program in Example 3, page 72 uses the fpga_wrt_apapif_val function; therefore, the order of the write transactions is preserved. The user logic on the FPGA receives eight single quadword write requests from the RT core fabric request interface in order. The first is a quadword write to offset 0x38 with a value of 7. For more information on the RT core interface, refer to Design of Cray XD1 RapidArray Transport Core (S–6411).

8.4.3 FPGA-initiated RT Requests

The FPGA can access the local memory of the SMP. To do so, the FPGA user logic requires initialization information from the application program on the SMP. At a minimum, this initialization includes a pointer to the shared memory buffer in the SMP DRAM. The software must write this pointer to the FPGA with the fpga_wrt_appif_val(3) function. The type parameter of the function must have the value 1, which translates the address to the type required by the FPGA.

User logic in the FPGA can use the RT core User Request Interface to generate read and write requests to the SMP. Writes are posted and can be burst transactions of up to eight quadwords. Unlike the SMP, the AAP user logic has full control over the ordering and all other aspects of the write transactions through the RT core. Also unlike the SMP, the AAP can issue burst read requests of up to 64 bytes and issue multiple outstanding read requests. By issuing multiple burst read requests, the FPGA can achieve full utilization of the HyperTransport link bandwidth when it reads from the SMP memory. Example 4, page 74 shows how a program creates an FPGA transfer region and then writes a pointer to that region to a register location, SMP_MEM_PTR_REG,in the FPGA user logic. For more information, see Design of Cray XD1 RapidArray Transport Core (S–6411).

SÐ6400Ð14 Cray Private 73 Cray XD1™ FPGA Development

Example 4: Initializing the AAP to access the SMP memory

void *ftr_mem;

my_fpga = fpga_open(args); ... if (!fpga_is_loaded(args)) { rtn = fpga_load(args); } ... stat = posix_memalign(&ftr_mem, getpagesize(), size); ... stat = fpga_register_ftrmem(my_fpga, ftr_mem, size, &err); ... stat = fpga_wrt_appif_val(my_fpga, SMP_MEM_PTR_REG, ftr_mem, TYPE_ADDR, &err);

8.4.4 FPGA-initiated Interrupt Requests

An FPGA-initiated RT request of particular interest is the message that generates an interrupt request to the SMP. Applications can use this capability to coordinate the activities of the FPGA logic and the application process on the SMP. To generate an interrupt request, the FPGA logic performs a single-byte write to the following address on the RT core user request interface:

0xFDF8000000

For more information about the RT core interface, see Design of Cray XD1 RapidArray Transport Core (S–6411). The VHDL code fragments in Example 5, page 74 illustrate the required method. Example 5: Generating an interrupt request (VHDL notation)

Declare a constant for the required address:

constant c_aap_int_addr : std_logic_vector(63 downto 0) := x"000000FDF8000000";

To generate an interrupt request to the SMP, hold the following values for a single clock cycle:

ureq_ts <= '1'; ureq_rw_n <= '0'; ureq_size <= (others => '0'); ureq_byte_req <= '1';

74 Cray Private SÐ6400Ð14 Interfaces [8]

ureq_mask <= x"01"; ureq_addr <= c_aap_int_addr(39 downto 3);

The value of the ureq_data signal does not matter. The application process in the SMP uses the fpga_int_wait(3) function from the software API library to enable the SMP to process interrupt requests from the FPGA. For details, see Section 11.4.9, page 112.

SÐ6400Ð14 Cray Private 75 Cray XD1™ FPGA Development

76 Cray Private SÐ6400Ð14 Simulation and Debugging [9]

This chapter describes some of the methods available to assist you in debugging user logic on the FPGA application acceleration processor. Cray provides both simulation models and a hardware interface to the Xilinx JTAG port for the purpose of debugging and verification. In addition, two utility programs that use the application programming interface (API) let you read and write data values in the FPGA address space.

9.1 Simulation Models

Simulation models for all the Cray XD1 cores as well as a simple behavioral model for the RapidArray fabric are included in the Cray XD1 software distribution. These models are in the design template simulation test bench.

9.1.1 Cray XD1 Core Simulation Models

Simulation models are present for each of the Cray cores. The models come in several formats: • An encrypted Riviera model • A VHDL netlist file Encrypted models should perform faster during simulation, but you can use them only with the specified simulation software package. You can use VHDL netlists with any simulator.

9.1.2 RapidArray Fabric Behavioral Model

The behavioral model for the RapidArray fabric is a simple VHDL model that you can use to simulate read and write requests to and from the SMP SDRAM and processors. Typically, designers modify the fabric model inputs or the fabric model itself to properly verify the user logic. Examples of how to use the RapidArray fabric model are in each reference design.

9.1.2.1 Fabric Model Inputs

The fabric model inputs are read from the fabric.in file. Requests from the fabric model appear at the fabric request interface of the RT core where the user

SÐ6400Ð14 Cray Private 77 Cray XD1™ FPGA Development

logic must handle them. Read requests can be compared against expected values and assertions will be raised when the expected and actual values differ. The fabric.in file may contain several commands. Table 12, page 78 lists the fabric commands; the fabric.in file also documents these.

Table 12. Commands supported in the fabric.in file

Command name Command format Initialize Link I Print P text_to_print Delay D delay_value Read R address expected_data byte_mask byte_request size Write W address write_data byte_mask byte_request size Burst B data byte_mask

Table 13, page 78 describes the fabric.in command arguments.

Table 13. Arguments of the fabric.in commands Argument Description text_to_print A text value. The text to print to the console. delay_value An integer value greater than 1. The number of user clock cycles to delay. address A 40-bit value. The hex address to access. expected_data A 64-bit value. The data read is compared to this value and an assertion is raised if the values differ. byte_mask An 8-bit value. Controls which bytes of the data to access. The least significant (furthest right) bit, bit 0, controls the least significant byte (furthest right) byte 0. A value of logic 1 enables the byte.

78 Cray Private SÐ6400Ð14 Simulation and Debugging [9]

Argument Description byte_request A single-bit value. When the byte request is a logic 1, the byte mask is used to determine which bytes to access. When byte request is logic 0, a quadword request is made. size A single-byte value. The size of the request in doublewords. A value of 0x0 indicates a single doubleword transfer, while a value of 0xF indicates a 16 doubleword transfer.

Some additional conventions that these input files use are as follows: • All numeric values are in hexadecimal notation. • The # character introduces a comment. Example 6, page 79 shows a fabric.in file. The example performs the following actions: • Delay for 100 user clock cycles. • Print a string that describes the read which follows. • Burst read that starts at address 0x0004000000 and is 16 doublewords long. A read command must begin with a read line and then follow with enough burst lines to equal the size of the access. The first read line and each subsequent burst line includes the expected value from each subsequent location. • Print a string that describes the write which follows. • Burst write that starts at address 0x0004000018 and is 10 doublewords long. A write command must begin with a write line and then follow with enough burst lines to equal the size of the access. Each line of write command includes a data value and a byte mask, which form the value that is written to that address. • Print a string that describes the read that verifies the write. • Burst read that starts at address 0x0004000000 and is 16 doublewords long. Example 6: Format of the fabric.in file

########## # Insert a delay to allow the DLLs some time to lock D 100

SÐ6400Ð14 Cray Private 79 Cray XD1™ FPGA Development

########## # Test access to the internal register space. The internal # registers are mapped by the RT Client into the register space # (i.e. $00_0400_0000 to $00_07FF_FFFF, or the top 64 Mbytes); #### P Fabric Model: Testing a burst read of the registers #### R 0004000000 0000000300360100 FF 0 F # Read base, revision, etc B 0000000000000000 FF # Read the App. Config B 0000000000000000 FF # Read the App. Latch B AAAAAAAAAAAAAAAA FF # Read the user register 1 B BBBBBBBBBBBBBBBB FF # Read the user register 2 B CCCCCCCCCCCCCCCC FF # Read the user register 3 B DDDDDDDDDDDDDDDD FF # Read the user register 4 B EEEEEEEEEEEEEEEE FF # Read the user register 5 ########## P Fabric Model: Testing a burst write to the registers #### W 0004000018 0123456789ABCDEF FF 0 9 # Write user register 1 B 55AA55AA55AA55AA FF # Write user register 2 B FEDCBA9876543210 FF # Write the user register 3 B FFFFFFFFFFFFFFFF FF # Write the user register 4 B AA55AA55AA55AA55 FF # Write the user register 5 ########## P Fabric Model: Reading back the registers #### R 0004000000 0000000300360100 FF 0 F # Read base,revision,etc B 0000000000000000 FF # Read the App. Config B 0000000000000000 FF # Read the App. Latch B 0123456789ABCDEF FF # Read the user register 1 B 55AA55AA55AA55AA FF # Read the user register 2 B FEDCBA9876543210 FF # Read the user register 3 B FFFFFFFFFFFFFFFF FF # Read the user register 4 B AA55AA55AA55AA55 FF # Read the user register 5

9.1.2.2 FPGA Transfer Region

The FPGA transfer region (FTR) is a portion of the SMP processor DRAM that is shared with the FPGA—see Example 4, page 74. The RapidArray fabric model simulates this region using the ram signal, which is of type qw_array for quadword array. If you need initial values in the FTR, modify the fabric model ram signal accordingly.

80 Cray Private SÐ6400Ð14 Simulation and Debugging [9]

9.2 Using the JTAG Interface Card

In addition to design verification through simulation, it can be necessary to verify and debug the design on the target hardware. Many types of problem are quite difficult and time consuming to anticipate, recreate, and test in a simulation environment. To facilitate target debugging, the Cray XD1 system provides access to the JTAG port of the FPGA.

9.2.1 The JTAG Interface Card

Access to a JTAG port occurs through an optional JTAG interface card (see Figure 20, page 81), which plugs into one of the high-speed I/O slots in the back of the Cray XD1 chassis.

Figure 20. JTAG interface card

Note: The Cray XD1 product warranty requires that only Cray service personnel, Cray-authorized service providers, or Cray-trained customers perform hardware maintenance. See the http://crinform.cray.com/xd/ website for information on Cray XD1 training programs and service support. If you install a JTAG interface card in the field, read and follow the field replacement procedure that comes with the card.

9.2.2 Mapping JTAG Interface Ports to FPGAs

The three I/O slots on the back of the Cray XD1 chassis are numbered 1 through 3 from left to right as seen from the back of the chassis. Each JTAG interface card has two ports that are numbered 1 and 2 from top to bottom. Figure 21, page 82

SÐ6400Ð14 Cray Private 81 Cray XD1™ FPGA Development

shows the configuration. Note that this figure shows three JTAG interface cards installed; there may be fewer in your system. Typically, a PCI-X expansion card occupies I/O slot 1.

Slot 1 Slot 2 Slot 3

Port 1

Port 2

Figure 21. Ports on the JTAG interface card

Each port can connect to one of the six possible FPGAs in a chassis. Table 14, page 82 lists the default mapping of JTAG ports to FPGAs.

Table 14. JTAG interface card default connections

Slot 1 Slot 2 Slot 3 Port 1 Node 1 Node 3 Node 5 Port 2 Node 2 Node 4 Node 6

If the default mapping is appropriate, you can proceed to connect your workstation to the relevant port with no further configuration task. Otherwise, you can use a Cray XD1 command to manage the connections of JTAG interface ports to FPGAs. You can view the current mapping of ports to FPGAs in a chassis, map any port to any FPGA in the chassis, or restore the default mapping of ports to FPGAs in a chassis.

9.2.2.1 Viewing JTAG Interface Port Connections

Use the following command to view the current connections of JTAG interface ports to FPGAs:

> xd1_jtag_route --chassis=chassis-addr --show

where chassis-addr is the address of the chassis in one of the following forms: • IP address: aaa .bbb. ccc.ddd This address appears on the first line of the LCD on the front of the chassis.

82 Cray Private SÐ6400Ð14 Simulation and Debugging [9]

• Host name: chassischassis-id where chassis-id is the hardware ID (serial number) of the chassis. An example of the chassis host name is chassis371. The command lists the current connections and the state (enabled or disabled) and status of each.

9.2.2.2 Connecting a JTAG Interface Port to an FPGA

Use the following command to map a JTAG interface port to a particular FPGA in a particular chassis:

> xd1_jtag_route --chassis=chassis-addr --slot=slot-num ––port=port-num --aap=node-num

where chassis-addr is as described in Section 9.2.2.1, page 82, slot-num is the number of the slot that the JTAG interface card occupies, port–num is the number of the relevant port on the card, and node-num is the ordinal of the target node (and hence the FPGA) in the chassis. Example 7: Connecting a JTAG interface port to an FPGA

To connect the upper port on a JTAG interface card in I/O slot 2 of chassis 371 to the FPGA on node 5 of that chassis:

> xd1_jtag_route --chassis=chassis371 --slot=2 --port=1 --aap=5

9.2.2.3 Restoring the Default JTAG Interface Port Connections

Use the following command to restore the default mapping of JTAG interface ports to FPGAs:

> xd1_jtag_route --chassis=chassis-addr --default

where chassis-addr is as described in Section 9.2.2.1, page 82.

The command resets the mapping of all JTAG interface ports to the default shown in Table 14, page 82.

9.2.3 Connecting a Workstation to a JTAG Interface Port

The JTAG port enables you to use tools such as the ChipScope Pro debugger from Xilinx with the FPGA. ChipScope provides logic analyzer-like functionality—it enables you to monitor and capture logic events as they occur in the

SÐ6400Ð14 Cray Private 83 Cray XD1™ FPGA Development

FPGA. For more information on ChipScope, refer to the Xilinx website at http://www.xilinx.com. Currently, the ChipScope Analyzer software runs only on the Windows of a PC. It connects to the target device through the parallel port of the PC. Cray provides an RJ-45 to DB25 adapter that enables you to connect a CAT 5 cable between a port on the JTAG interface card and the PC parallel port. Figure 22, page 84 illustrates the adapter.

REAR VIEW OF SIDE VIEW OF ADAPTER ADAPTER

12345678

Figure 22. JTAG I/O adapter

The JTAG adapter plugs into the parallel port of a PC, and a standard CAT5 (straight-through) networking cable connects the adapter to the JTAG interface card in a Cray XD1 system; see Figure 23, page 85.

84 Cray Private SÐ6400Ð14 Simulation and Debugging [9]

Caution: The CAT 5 connector and cable are identical to Ethernet equipment. ! Do not accidentally plug networking equipment into the JTAG interface card. Damage to the networking equipment and/or the JTAG interface card may occur.

CAT 5 cable (straight)

to PC

RJ-45 to DB25 adapter

Figure 23. Accessing the JTAG interface card

9.3 Reading and Writing Data Values

As a convenience, Cray provides two utility programs, fpga_read(1) and fpga_write(1), that let you access any location in the FPGA address space from the command line. These may be helpful particularly after you have an application logic file and before you have a working software application. These utilities are based on the FPGA software API. Cray provides the source code for them in the following location:

/opt/ufpapps/utils/src

SÐ6400Ð14 Cray Private 85 Cray XD1™ FPGA Development

The executable files for these utilities are in the following directory:

/opt/ufpapps/utils/bin

This section assumes that you add this path to your PATH variable. It also assumes that you load your logic file into the FPGA before you execute these commands.

9.3.1 Reading Data from the FPGA

With the fpga_read(1) command, you can read and display a block of data from the FPGA address space. You must specify at least the offset in the address space (in bytes as a hexadecimal value), and you can specify the word size (in bytes) and the number of words that you want. The basic syntax of the command is as follows:

fpga_read [--number=num_words] [--size=word_size] offset

If you omit the --number option, the command reads a single word. If you omit the --size option, the command uses a word size of 8 bytes; other valid word sizes include 1 and 4. Specify the offset argument as a hexadecimal value. For a full description of these and other options, see the fpga_read(1) man page. The command displays the data as hexadecimal values. Example 8: Using the fpga_read command

To read 8 64-bit words from location 0x4000000 in the local FPGA:

> fpga_read --number=8 0x4000000 0x4000000: 0x0000000900330100 0x4000008: 0x0000000000000000 0x4000010: 0x0000000000000004 0x4000018: 0x0000000000000401 0x4000020: 0x0000000081A00000 0x4000028: 0x0000000000000000 0x4000030: 0x0000000000000000 0x4000038: 0x0000000000000000

To read 8 32-bit words from location 0x4000000 in the local FPGA:

> fpga_read --number=8 --size=4 0x4000000 0x4000000: 0x00330100 0x4000004: 0x00000009 0x4000008: 0x00000000 0x400000C: 0x00000000

86 Cray Private SÐ6400Ð14 Simulation and Debugging [9]

0x4000010: 0x00000004 0x4000014: 0x00000000 0x4000018: 0x00000401 0x400001C: 0x00000000

9.3.2 Writing Data to the FPGA

With the fpga_write(1) command, you can write a data value into any location in the FPGA address space. You must specify at least the offset in the address space (in bytes) and the data value that you want to write at that location. You can also specify the size (in bytes) of the data word. The basic syntax of the command is as follows:

fpga_write [--size=word_size] offset value

Specify both the offset and value arguments as hexadecimal values. If you omit the --size option, the command uses a word size of 8 bytes; other valid word sizes include 1 and 4.For a full description of these and other options, see the fpga_write(1) man page. Example 9: Using the fpga_write command

To read, write, and verify a 64-bit word in location 0x4000028 in the local FPGA:

> fpga_read 0x4000028 0x4000028: 0x0000000000000000 > fpga_write 0x4000028 0x0123456789ABCDEF > fpga_read 0x4000028 0x4000028: 0x0123456789ABCDEF

To write only the lower 32 bits of the same location:

> fpga_write --size=4 0x4000028 0x55aa55aa > fpga_read 0x4000028 0x4000028: 0x0123456755AA55AA

SÐ6400Ð14 Cray Private 87 Cray XD1™ FPGA Development

88 Cray Private SÐ6400Ð14 Part IV: Developing Accelerated Software Applications

Managing FPGA Logic from the Command Line [10]

You can perform some FPGA tasks from the Linux command line with the FPGA control utility, fcu(1). Alternatively, you can perform the same tasks and more with the functions of the FPGA application programming interface (API); for details, see Section 11.4, page 99. Therefore, you can choose whether the application performs these tasks in a shell script or in the compiled C program. This chapter describes all the capabilities of the fcu(1) command, including converting a raw logic file to loadable form (which you can do only with this command). For an example that applies commands from this chapter to the sample application program, see Section 12.2, page 123.

10.1 Converting a Raw Logic File to Loadable Form

The output of the FPGA logic development process is a raw logic file (often called simply the binary file). Usually, the name of this file has a .bin suffix. It must have been compiled for a particular variant of the FPGA—each combination of FPGA size (number of gates) and speed grade requires a different binary file. The model of FPGA in the Cray XD1 system requires a slightly different file format than the development tools produce—it requires reversal of the bits in each byte. In addition, the FPGA API and the Cray XD1 Linux driver for the device require extra information in a header section of the file. The header identifies the target expansion module variant and the desired clock speed for your design. Therefore, you must transform the raw logic file to reverse the bits and embed the extra information before you can use it. The multipurpose FPGA control utility, fcu(1), performs these two operations of the conversion.

Procedure 2: To convert a raw logic file to loadable form 1. Identify the Cray part number of the target FPGA. It is a string of the form 90–nnnn–nn that specifies the particular variant of FPGA that is present. You can identify the part numbers of the FPGAs in a system by using the Active Manager lsnode ––verbose command.

SÐ6400Ð14 Cray Private 91 Cray XD1™ FPGA Development

Note: In previous releases, the FPGA part number that the command requires was of the form 87-nnnn-nn. In the current release, the fcu(1) command still accepts a part number of this form, but will not in future releases. 2. Create a header file (which you will later merge with the logic file):

> fcu --build [headerfile][--partnum part-number] [--clock clock-freq]

where headerfile is the output file name, part-number is the Cray part number of the FPGA, and clock–freq is the clock frequency at which to run the FPGA, in megahertz. If you omit headerfile, the output file name is ufphdr. The logic designer can provide the clock frequency; it must be in the range 63 through 199. If you do not specify the part number or the clock frequency in the command line, the command prompts you for the information. The fcu program creates the header file.

3. Merge the header file and the raw logic file:

> fcu --convert rawfile headerfile [loadfile]

where rawfile is the name of the input raw logic file (typically, design.bin), headerfile is the name of a header file that was previously created by the --build option of fcu (as in the preceding step), and loadfile is the name of the output file. If you omit loadfile, the output file name is rawfile.ufp.

The fcu program prepends the header file to the logic file and transforms the logic file as necessary. The output file is ready to load into the FPGA.

10.2 Loading FPGA Logic into the Device

After you convert the raw logic file into loadable form, you can load it into the FPGA device in preparation for running your application. To load FPGA logic into the device from the command line, run the following command:

> fcu --load loadfile

where loadfile is the path name of the converted FPGA logic file. The fcu program loads the specified file into the local FPGA. The application logic is reset, then released from reset.

92 Cray Private SÐ6400Ð14 Managing FPGA Logic from the Command Line [10]

10.3 Resetting an FPGA

You can explicitly reset the application logic within an FPGA from the command line when necessary. To reset an FPGA from the command line, run the following command:

> fcu --reset

The fcu program resets the local FPGA by asserting the user_reset_n signal that the RT Core outputs. Any application logic that is connected to this signal is reset. Note: Exercise care when you reset the FPGA application logic. In a typical design, the reset logic overrides most other logic functions in a device. If you reset the application logic while the FPGA is actively performing a task, the output may be meaningless.

10.4 Releasing an FPGA from Reset State

After you reset an FPGA, you must explicitly release it from reset. To do so from the command line, run the following command:

> fcu --exec

The fcu program takes the local FPGA out of reset by de-asserting the user_reset_n signal that the RT Core outputs. Any application logic that is connected to this signal is released from reset.

10.5 Querying the Status of an FPGA

You can display information about the status of the FPGA at the command line. To do so, run the following command:

> fcu --status

The program displays a numeric status code that depends on the state of the device. This is the value of the host latch register in the RT Core as a decimal integer. For details, see Design of Cray XD1 RapidArray Transport Core (S–6411). A value of 255 indicates that the FPGA is not programmed.

SÐ6400Ð14 Cray Private 93 Cray XD1™ FPGA Development

10.6 Erasing an FPGA

If security is an issue at your site, you can use the fcu command to completely clear the programming of the FPGA after you finish using it. Note: You do not need to erase the FPGA explicitly before you load another logic file because the load operation also initially erases the device. To erase an FPGA from the command line, run the following command:

> fcu --unload

94 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

This chapter describes how to use the optional Cray XD1 field-programmable gate array (FPGA) application acceleration processors (AAPs) in an application program. Note: In this release, the FPGA application programming interface is available only as a C library.

11.1 Role of the Software Programmer in FPGA Logic Development

Development of the raw FPGA logic file is a complex process that requires specialized knowledge and tools. As an application programmer, you typically collaborate with a hardware designer or other specialist in this task. Your role in the process is mainly to identify the parts of the application that are suitable for implementation in the FPGA and to communicate precise specifications to the logic developer. You also need to work closely with the logic developer to design the protocol for communication between the software and the FPGA logic.

11.2 Overview of Using an FPGA in an Application Program

This section describes some concepts that you need to understand to develop software applications that use the FPGA.

11.2.1 Typical Application Workflow

An application program uses an FPGA similarly to other devices—the program opens the device to get a file descriptor, uses the descriptor to interact with the device, and finally closes the device. Your program begins to interact with the FPGA by loading the converted logic file into the device (unless you plan to do this from the command line prior to the start of the application; see Section 10.2, page 92). Then the program can set up the data and execute the logic as often as necessary. Your program can transfer data in either direction by setting up the appropriate memory access. In addition, functions are available to read and write individual

SÐ6400Ð14 Cray Private 95 Cray XD1™ FPGA Development

values in the FPGA address space. For example, you can read and write application-defined registers to implement control and status operations. In addition to writing and reading data values, the FPGA application logic can generate an interrupt request at the SMP to coordinate the activities of the two processors. Finally, a program can reset the FPGA (to halt execution of the logic) and close the file descriptor.

11.2.2 Understanding Address Spaces on a Node

To understand the methods of communication between an application and the FPGA, you need to know about the relationships among the address spaces of the node components—the Opteron symmetric multiprocessor (SMP), the RapidArray processor (RAP) on the optional expansion module, and the FPGA application acceleration processor. Figure 19, page 70 illustrates these relationships. The physical interpretation of locations in the FPGA address space depends entirely on the design of the application logic in the FPGA. The logic may map the Quad Data Rate (QDR) II SRAM (which is attached to the FPGA) and the internal resources of the FPGA (such as registers and RAMs) to arbitrary addresses in its address space. For example, low addresses (starting at an offset of 0) could map to locations in the QDR II SRAM, and addresses starting at offset 0x4000000 (64 MB) could map to internal resources in the FPGA. Note: Cray Inc. supplies an FPGA logic core—the QDR II SRAM core—that enables application logic in the FPGA to use the SRAM.

11.2.3 Data Transfer Methods

The protocol that you design for the interaction between the application process (which runs on an Opteron processor) and the FPGA application logic can use any or all of the following methods to transfer data: • The application process can map a region of the FPGA address space into its own address space and access a location in the region using a normal memory reference (pointer). It can set or use the contents of any location within the mapped region. For details, see Section 11.4.5, page 103. If the sequence of accesses is important, the application must explicitly request synchronization of accesses at appropriate points; for details, see Section 11.4.6, page 105. • The application process can allocate a block of memory within its own

96 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

address space and register it for direct access by the FPGA logic. This block is called an FPGA transfer region (FTR). The FPGA can set or use the contents of any location in the registered region. For details, see Section 11.4.8, page 110. • The application process can write and read individual 64-bit values at locations in the FPGA address space. Each such operation requires a function call. These calls guarantee the order of access.

11.2.4 Using Interrupts

The protocol that you design for the interaction between the application process (which runs on an Opteron processor) and the FPGA application logic can use interrupts to coordinate the activities of the two processors. The FPGA logic design can include generation of an interrupt request to the Opteron processor, for example to signal that a block of output results is available. At the appropriate time (for example, after it fills up all input buffers) the application process can wait for the interrupt to occur. This method of coordination uses fewer CPU cycles than alternative methods such as polling a register. For details, see Section 11.4.9, page 112.

11.2.5 Summary of API Functions

Table 15, page 97 lists all the C functions in the FPGA API library.

Table 15. API summary

API function name Description fpga_open Opens an FPGA device fpga_load Loads a converted binary logic file into an FPGA fpga_is_loaded Queries the programming state of an FPGA fpga_reset Places the FPGA user logic into reset fpga_start Releases the FPGA user logic from reset. fpga_memmap Maps a region of the FPGA address space into the application address space fpga_mem_sync Forces completion of all outstanding transactions to mapped FPGA memory

SÐ6400Ð14 Cray Private 97 Cray XD1™ FPGA Development

API function name Description fpga_register_ftrmem Registers a region of application memory for direct access by FPGA fpga_dereg_ftrmem Deregisters an FPGA transfer region fpga_set_ftrmem Sets up a region of memory in application process for direct access by FPGA Note: This function is deprecated. Cray recommends that you use the fpga_register_ftrmem function instead. fpga_rd_appif_val Reads a value from the FPGA address space and guarantees access order fpga_wrt_appif_val Writes a value to the FPGA address space and guarantees access order fpga_int_wait Waits for the FPGA to issue an interrupt request fpga_status Gets the status of an FPGA device fpga_unload Clears the programming of an FPGA fpga_close Closes an FPGA device

11.3 Compiling and Linking

The C header file, /usr/local/include/ufplib.h, provides the declarations of the functions and data types that are defined in the FPGA API library. Include it in any C source file that uses the library. Include the following option (illustrated with gcc syntax) when you compile your program:

-D_REENTRANT

The FPGA API object library path is /usr/local/lib64/libufp.a. Include the following options when you link your program:

-L/usr/local/lib64 -lufplib -lpthread -lrapl

98 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

11.4 FPGA API Library Functions

This section describes in detail all the functions in the FPGA API library. In the code samples in this section, bold and italic text highlights only the elements that directly declare or use the function under discussion and its arguments. Bold text represents elements that you type exactly as shown, and italic text represents variables that you name. All the other code in plain fixed-width font is arbitrary sample code that you replace according to the needs and coding style of your application.

11.4.1 Opening an FPGA: fpga_open

The first operation in using an FPGA in an application program is to open the device with the fpga_open(3) function. This operation interacts only with the Linux kernel to prepare for the communication to follow—it does not physically affect the FPGA. The result is a file descriptor that your application uses in all other function calls that affect the device. To open an FPGA, use statements like the following examples in your program:

#include #include "ufplib.h"

int fpga_fd; const char *fpga_path; int flags; /* For example: O_RDWR | O_SYNC */ err_e err;

/* ... */ /* Set values of function arguments. */ /* ... */ fpga_fd = fpga_open(fpga_path, flags, &err); if (fpga_fd <0){ /* Handle error.*/ }

Table 16, page 100 describes the variables in the sample code.

SÐ6400Ð14 Cray Private 99 Cray XD1™ FPGA Development

Table 16. fpga_open(3) arguments and return value Input or Variable output Description fpga_path Input The absolute path of the FPGA character device file. Typically: /dev/ufp0. flags Input Controls the type of access to the device. Specify it as the bitwise OR of the appropriate masks that the open system call recognizes. For details, see the open man page. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. fpga_fd Output The file descriptor of the FPGA. Used in all other function calls of the FPGA API.

11.4.2 Loading FPGA Logic into the Device: fpga_load

If you do not load the converted logic file into the FPGA before you execute your program, you must use the fpga_load function to load the logic file after the program opens the device. This programs the device with the specific logic for your application. When the load operation is complete, the system resets the application logic and releases it from reset. To load FPGA logic into the device, use statements like the following examples in your program:

#include "ufplib.h"

int fpga_fd; const char*loadfile = "my_logic.ufp"; /* for example */ err_e err; int num_bytes;

/*... */

num_bytes = fpga_load(fpga_fd, loadfile, &err); if (num_bytes <0){ /* Handle error.*/

100 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

}

Table 17, page 101 describes the variables in the sample code.

Table 17. fpga_load(3) arguments and return value Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. loadfile Input The path of the converted FPGA logic file that the fcu command created. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. num_bytes Output The value returned by fpga_load: either the number of bytes written to the device or -1 on failure.

11.4.3 Resetting an FPGA: fpga_reset

If the application program loads the logic file, it does not need to reset the logic initially because the load operation does so automatically. However, if the user or another application loaded the logic file before this application executes, consider resetting the logic to put it into a known initial state. The fpga_reset(3) function places the application logic of a previously loaded FPGA into a reset state. It does this by asserting the user_reset_n signal that the RT Core outputs. Any application logic that is connected to this signal is reset. To reset an FPGA, use statements like the following examples in your program:

#include "ufplib.h"

int fpga_fd; err_e err; int status;

/* ... */ status = fpga_reset(fpga_fd,&err); if (status <0){

SÐ6400Ð14 Cray Private 101 Cray XD1™ FPGA Development

/* Handle error. */ }

Table 18, page 102 describes the variables in the sample code.

Table 18. fpga_reset(3) arguments and return value Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. status Output The status value that fpga_reset returns: either 0 on success or -1 on failure.

11.4.4 Releasing an FPGA from Reset State: fpga_start

If the application program loads the logic file, it does not need to release the logic from reset initially because the load operation does so automatically. However, if the application explicitly resets the application logic, it must take the logic out of the reset state to start execution of the logic. The fpga_start(3) function does this by de-asserting the user_reset_n signal that the RT Core outputs. Any application logic that is connected to this signal is released from reset. Use the fpga_reset(3) and fpga_start(3) functions together to perform a reset cycle on the FPGA. You can also use the fpga_start(3) function alone to ensure that the application logic is not in the reset state. To release an FPGA from reset state, use statements like the following examples in your program:

#include "ufplib.h"

int fpga_fd; err_e err; int status;

/* ... */

102 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

status = fpga_start(fpga_fd,&err); if (status <0){ /* Handle error. */ }

Table 19, page 103 describes the variables in the sample code.

Table 19. fpga_start(3) arguments and return value

Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. status Output The status value that fpga_start returns: either 0 on success or -1 on failure.

11.4.5 Mapping FPGA Locations to the Application Address Space: fpga_memmap

The fpga_memmap(3) function maps a region of the FPGA address space to the application address space. The application can then use normal memory references (pointers) to read or write values in the mapped region. For a description of the FPGA address space, see Section 11.2.2, page 96. This function uses the write-combining feature of the Opteron processor, which can lead to out-of-sequence transactions between the Opteron and the FPGA. The API lets you synchronize the transactions to ensure the proper sequence when necessary; see Section 11.4.6, page 105. Instead of (or in addition to) mapping a whole region of the FPGA address space, the application can also use function calls to access individual locations in the FPGA address space; see Section 11.4.7, page 107.

SÐ6400Ð14 Cray Private 103 Cray XD1™ FPGA Development

To map FPGA locations to the application address space, use statements like the following examples in your program:

#include #include "ufplib.h" #define X_OFFSET 0x100 #define Y_OFFSET 0x200

int fpga_fd, prot, flags; size_t len; off_t offset; err_e err; void *fpga_base; long x, y;

/* ... */ /* Set values of function arguments. */ /* ... */ fpga_base = fpga_memmap(fpga_fd, len, prot, flags, offset,&err); if (fpga_base == NULL) { /* Handle error. */ }

/* Use the pointer to write or read values in the FPGA. */ /* Initialize x. */ /* ... */ *(fpga_base + X_OFFSET) = x; /* ... */ y=*(fpga_base + Y_OFFSET);

Table 20, page 104 describes the variables in the sample code.

Table 20. fpga_memmap(3) arguments and return value Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. len Input The number of bytes to be mapped.

104 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

Input or Variable output Description prot Input A bit array that specifies the desired memory protection. It is either PROT_NONE or the bitwise OR of one or more of the other PROT_* flags specified by the Linux mmap function. These constants are defined in sys/mman.h which is included by ufplib.h. flags Input A bit array that specifies mapping options like those of the Linux mmap function. It is the bitwise OR of either MAP_SHARED or MAP_PRIVATE and zero or more of the other flags described in the mmap man page. These constants are defined in sys/mman.h which is included by ufplib.h. offset Input The byte offset in the FPGA address space at which the mapped region begins. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. fpga_base Output A pointer to the mapped area in the application address space. NULL indicates failure.

11.4.6 Synchronizing Accesses to FPGA Locations: fpga_mem_sync

The fpga_memmap(3) function (see Section 11.4.5, page 103) uses the write-combining feature of the Opteron processor to communicate with an FPGA. While this can be more efficient, it can also lead to an out-of-sequence execution of accesses to locations in the FPGA address space. The fpga_mem_sync(3) function executes a memory barrier to flush out all transactions to the region that is mapped by the fpga_memmap(3) function. This function ensures that the Opteron processor completes all previous accesses before it performs any subsequent accesses. Use this function if the order of accesses in the application is important. To synchronize accesses to FPGA locations, use statements like the following examples in your program:

SÐ6400Ð14 Cray Private 105 Cray XD1™ FPGA Development

#include #include "ufplib.h" #define X_OFFSET 0x100 #define Y_OFFSET 0x200

int fpga_fd; err_e err; int status; int prot, flags; size_t len; off_t offset; void *fpga_base; long x, y;

/* ... */ /* Set values of fpga_memmap arguments. */ /* ... */ fpga_base = fpga_memmap(fpga_fd, len, prot, flags, offset, &err); if (fpga_base == NULL) { /* Handle error.*/ } /* Some FPGA accesses */ *(fpga_base + X_OFFSET) = x; /* ... */ /* Synchronize */ status = fpga_mem_sync(fpga_fd,&err); if (status <0){ /* Handle error. */ } /* Some more FPGA accesses */ y = *(fpga_base + Y_OFFSET); /* ... */

Table 21, page 107 describes the variables in the sample code.

106 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

Table 21. fpga_mem_sync(3) arguments and return value Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. status Output The status value that fpga_mem_sync returns: either 0 on success or -1 on failure.

11.4.7 Writing and Reading Individual FPGA Locations

In addition to mapping a region of the FPGA address space and accessing it with ordinary memory references, an application can also write and read any individual 64-bit value in the FPGA address space by using the fpga_wrt_appif_val(3) and fpga_rd_appif_val(3) functions. These functions guarantee that the order of access is the order in which the application calls these functions. You do not need to use fpga_memmap(3) or fpga_mem_sync(3) in conjunction with these functions. For a description of the FPGA address space, see Section 11.2.2, page 96. Note: The meaning of the offset parameter of fpga_rd_appif_val and fpga_wrt_appif_val changed in release 1.2 of the Cray XD1 system. Previously, these functions assumed a particular mapping of FPGA resources in the FPGA address space such that a zero value of the offset parameter accessed the first FPGA register at a fixed location of 0x4000000 in the FPGA address space. Now, the parameter is generalized to allow access to both the internal resources of the FPGA and the attached QDR II SRAM, and the offset value can access any location in the whole FPGA address space. If you have an application program that calls these functions with the old meaning of offset, you need to change your source code to work with release 1.2 and later. Set the value of this parameter to be the address in the FPGA address space of the resource you want to access. This value depends entirely on the design of the FPGA application logic.

SÐ6400Ð14 Cray Private 107 Cray XD1™ FPGA Development

11.4.7.1 Writing to an FPGA Location: fpga_wrt_appif_val

Use the fpga_wrt_appif_val(3) function to write a 64-bit value into a specified location in the FPGA’s address space. This function can provide special treatment for a data value that is a virtual address in an FPGA transfer region—a memory region in the application’s address space that was registered by the fpga_register_ftrmem(3) function for direct access by the FPGA (see Section 11.4.8, page 110). For such a data value, the function first transforms it to a physical address that the FPGA logic can use. To write to an FPGA location, use statements like the following examples in your program:

#include "ufplib.h"

int fpga_fd; unsigned long val; unsigned long offset; unsigned long type; err_e err; int status;

/* ... */ /* Set values of function arguments. */ /* ... */ status = fpga_wrt_appif_val(fpga_fd, val, offset, type,&err); if (status <0){ /* Handle error. */ }

Table 22, page 108 describes the variables in the sample code.

Table 22. fpga_wrt_appif_val(3) arguments and return value Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. val Input The value to write.

108 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

Input or Variable output Description offset Input The byte offset of the location in the FPGA address space. type Input One of: • 0—The value is written as is. • 1—The value is a user-space virtual memory address in an FPGA transfer region (see Section 11.4.8, page 110). The fpga_wrt_appif_val function transforms this virtual address to a physical address before writing it.

err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. status Output The status value that fpga_wrt_appif_val returns: either 0 on success or -1 on failure.

11.4.7.2 Reading from an FPGA Location: fpga_rd_appif_val

Use the fpga_rd_appif_val(3) function to read a 64-bit value from a specified location in the FPGA’s address space. To read from an FPGA location, use statements like the following examples in your program:

#include "ufplib.h"

int fpga_fd; unsigned long val; unsigned long offset; err_e err; int status;

/* ... */ /* Set values of function arguments. */ /* ... */ status = fpga_rd_appif_val(fpga_fd,&val,

SÐ6400Ð14 Cray Private 109 Cray XD1™ FPGA Development

offset,&err); if (status <0) { /* Handle error. */ }

Table 23, page 110 describes the variables in the sample code.

Table 23. fpga_rd_appif_val(3) arguments and return value

Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. val Output The value that was read. offset Input The byte offset of the location in the FPGA address space. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. status Output The status value that fpga_rd_appif_val returns: either 0 on success or -1 on failure.

11.4.8 Accessing Application Memory from an FPGA: fpga_register_ftrmem and fpga_dereg_ftrmem

The API also supports access by the FPGA logic to a region of application memory—the FPGA transfer region that is described in Section 11.2.3, page 96. Your program allocates a memory block, then uses the fpga_register_ftrmem(3) function to register the block as an FTR. This sets up the memory block for the FPGA to access it directly. Note: The fpga_set_ftrmem(3) function that you used in previous releases to allocate and register an FTR is now deprecated. Use the fpga_register_ftrmem(3) function instead. The memory block that you register as an FTR must be aligned on a memory

110 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

page boundary, and its size must be a multiple of the memory page size.1 The minimum size that you can register is the equivalent of one memory page, and the maximum size is 1 GB. (The older fpga_set_ftrmem(3) function can allocate a maximum of only 2 MB.) The fpga_register_ftrmem(3) function does not automatically provide the address of this region to the FPGA application logic. The way that the application communicates this information depends on the protocol that you establish between the application and the FPGA logic. One way to communicate the address is to establish an FPGA register for that purpose and use the fpga_wrt_appif_val(3) function to write the value to the register. For more details, see Section 11.4.7, page 107. When the application program finishes using the FTR, it should deregister the FTR by calling the fpga_dereg_ftrmem(3) function. This frees the memory mapping capability of the processor so that other processes can use the capability. Deregister the FTR regardless of how you registered it: the fpga_register_ftrmem(3) function or the older (and now deprecated) fpga_set_ftrmem(3) function. To access application memory from an FPGA, use statements like the following examples in your program:

#define _XOPEN_SOURCE 600 #include #include #include "ufplib.h"

int fpga_fd; unsigned int size = getpagesize()*10000; err_e err; void *ftr_mem; int status;

status = posix_memalign(&ftr_mem, getpagesize(), size); if (status != 0) { /* Handle error. */ } /* ... */ /* Set other values of function arguments. */ /* ... */

1 You can use the Linux getpagesize function to discover the memory page size and the Linux posix_memalign function to allocate a suitably aligned block of memory.

SÐ6400Ð14 Cray Private 111 Cray XD1™ FPGA Development

status = fpga_register_ftrmem(fpga_fd, ftr_mem, size,&err); if (status <0){ /* Handle error. */ } /* Communicate the FTR base address to the FPGA */ /* ... */ /* Perform the main work of the application */ /* ... */ status = fpga_dereg_ftrmem(fpga_fd, ftr_mem, &err);

Table 24, page 112 describes the variables in the sample code.

Table 24. fpga_register_ftrmem(3) and fpga_dereg_ftrmem(3) arguments and return value Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. ftr_mem Input A pointer to a memory block in the application address space to use as the FTR. size Input The size of the FTR in bytes. It must be a multiple of the memory page size. The maximum size is 1 GB. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. status Output The status value that fpga_register_ftrmem or fpga_dereg_ftrmem returns: either 0 on success or -1 on failure.

11.4.9 Enabling and Waiting for Interrupts: fpga_int_wait

The fpga_int_wait(3) function enables the SMP to process interrupt requests from the FPGA. This is relevant only if the design of the FPGA application logic includes generation of interrupt requests. This function can optionally return immediately, return after a specified time interval, or wait indefinitely for the interrupt to occur. If the function returns before the interrupt occurs, the SMP

112 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

only acknowledges the interrupt when it occurs, and the interrupt has no effect on the application process. To enable an interrupt from the FPGA, use statements like the following examples in your program:

#include "ufplib.h"

#define FOREVER -1

int fpga_fd; err_e err; long time_out; long result;

/* ... */ /* Perform actions required before waiting (for example, fill input buffers). */ /* ... */

/* Set values of function arguments. */ time_out = FOREVER; /* for example */ /* ... */ result = fpga_int_wait(fpga_fd, time_out,&err); if (result <0){ /* Handle error. */ } /* Continue with next actions (for example, read results from output buffers). */ /* ... */

Table 25, page 113 describes the variables in the sample code.

Table 25. fpga_int_wait(3) arguments and return value Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned.

SÐ6400Ð14 Cray Private 113 Cray XD1™ FPGA Development

Input or Variable output Description time_out Input Specifies the function behavior as follows: < 0 Returns when an interrupt occurs. 0 Returns immediately. > 0 Returns when an interrupt occurs or the specified number of milliseconds elapses, whichever comes first. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. result Output The value that fpga_int_wait returns: a value greater than or equal to 0 on success or –1 on failure. The success value depends on the time_out argument as follows: • If time_out < 0, the function returns the number of milliseconds that the function waited until the interrupt occurred. • If time_out is 0, the function returns 0. • If time_out > 0, the function returns the number of milliseconds left to wait when the interrupt occurred.

11.4.10 Checking the Status of an FPGA: fpga_status

The fpga_status(3) function returns a status value from a previously opened FPGA. This value is the value of the host latch register in the RapidArray Transport Core. It is an integer in the range 0 to 255. For information about the meanings of this status value, see Design of Cray XD1 RapidArray Transport Core (S–6411).

114 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

To check the status of an FPGA, use statements like the following examples in your program:

#include "ufplib.h"

int fpga_fd; err_e err; int dev_status;

/* ... */ dev_status = fpga_status(fpga_fd,&err); /* Check the returned value against expected values */

Table 26, page 115 describes the variables in the sample code.

Table 26. fpga_status(3) arguments and return value Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. dev_status Output The FPGA status value that fpga_status returns: an integer from 0 to 255 or -1 on failure.

11.4.11 Checking the Programming State of an FPGA: fpga_is_loaded

The fpga_is_loaded(3) function queries the programming state of a previously opened FPGA. The return value indicates whether a logic file is currently loaded into the FPGA. To check the programming state of an FPGA, use statements like the following examples in your program:

#include "ufplib.h"

int fpga_fd;

SÐ6400Ð14 Cray Private 115 Cray XD1™ FPGA Development

err_e err; int loaded;

/* ... */ loaded = fpga_is_loaded(fpga_fd,&err); if (loaded) { /* do something */ } else { /* do something else */ }

Table 27, page 116 describes the variables in the sample code.

Table 27. fpga_is_loaded(3) arguments and return value Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. loaded Output The result that fpga_is_loaded returns. A nonzero value means that the FPGA is loaded and zero means it is not.

11.4.12 Erasing an FPGA: fpga_unload

The fpga_unload(3) function clears the programming of an FPGA. The function erases the logic that was previously loaded with the fpga_load(3) function or the fcu(1) command. This function is provided for extra security only. You do not need to call it before you load your logic file because the fpga_load(3) function has the same effect. You need this function only if you want to completely clear the FPGA after you finish using it. In this case, call it just before you call the fpga_close(3) function.

116 Cray Private SÐ6400Ð14 Using an FPGA in an Application Program [11]

To erase an FPGA, use statements like the following examples in your program:

#include "ufplib.h"

int fpga_fd; err_e err; int status;

/* ... */ status = fpga_unload(fpga_fd,&err); if (status <0){ /* Handle error. */ }

Table 28, page 117 describes the variables in the sample code.

Table 28. fpga_unload(3) arguments and return value Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. status Output The status value that fpga_unload returns: either 0 on success or -1 on failure.

11.4.13 Closing an FPGA: fpga_close

Use the fpga_close(3) function to close an FPGA that was opened previously. This clears any associations of the application process to the FPGA memory that the fpga_memmap(3) function established (see Section 11.4.5, page 103). To close an FPGA, use statements like the following examples in your program:

#include "ufplib.h"

int fpga_fd; err_e err;

SÐ6400Ð14 Cray Private 117 Cray XD1™ FPGA Development

int status;

/* ... */ status = fpga_close(fpga_fd,&err); if (status <0){ /* Handle error. */ }

Table 29, page 118 describes the variables in the sample code.

Table 29. fpga_close(3) arguments and return value Input or Variable output Description fpga_fd Input The file descriptor of the FPGA that fpga_open returned. err Output The error code that is set upon function return: either NOERR or another constant that is specified by the err_e enumerated type in ufplib.h. status Output The status value that fpga_close returns: either 0 on success or -1 on failure.

118 Cray Private SÐ6400Ð14 Sample Application: Using the Mersenne Twister Accelerator [12]

Our sample application is a program that generates pseudorandom numbers by using the Mersenne Twister algorithm. Cray has implemented the twisting function of the algorithm in FPGA logic, which is called the Mersenne Twister Accelerator (MTA). The Cray XD1 software distribution includes the MTA logic file and a sample C program (mta_test.c) that uses it. This chapter describes the application code and provides a procedure for building and executing it.

12.1 Understanding the Application

This section provides some background information and commentary to help you understand the sample program. A full listing of the program is in Appendix B, page 131. In the listing, each line of code is numbered. In this section, references such as “(line n)” refer to the numbered source code line in the listing.

12.1.1 Algorithm

The Mersenne Twister algorithm is an efficient algorithm for generating pseudorandom numbers with excellent statistical properties. It has become popular as a source of pseudorandom numbers in Monte-Carlo simulations. The algorithm is a form of linear feedback shift register with an extremely long period of 219937–1. A paper by Matsumoto and Nishimura1 describes the algorithm in detail.

12.1.2 High-level Design of Application and FPGA Logic

The MTA logic starts with an initial seed value of 19,938 bits that is generated by the application program and implemented as a state array of 624 pseudorandom 32-bit numbers. The MTA progressively transforms the array into new

1 Makoto Matsumoto and Takuji Nishimura, “Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator,” ACM Transactions on Modeling and Computer Simulation, vol 8, no. 1, pp. 3-30.

SÐ6400Ð14 Cray Private 119 Cray XD1™ FPGA Development

pseudorandom numbers according to the algorithm. It exploits opportunities for both parallel and pipelined computations. For a description of the MTA logic design, see the following document, which is also in the Cray XD1 software distribution:

/usr/local/ufpapps/mta/doc/PNR-DD-0022-MtaFPGA.pdf

The application writes the initial state array into the FPGA registers. It also writes the addresses of the output buffers and an output format selector into the registers. The application sets up the output buffers as FPGA transfer regions, and the MTA logic outputs the results to the FTRs using a double-buffering scheme. The application extracts results from a full buffer while the MTA logic fills the empty buffer.

12.1.3 Some Design Details

The user specifies the type (format) and number of numbers to generate as command-line arguments. For details on the use of the mta_test command, refer to an appendix of the MTA design document. The sample program assumes that the user previously loaded the MTA logic into the FPGA from the command line. For details on this procedure, see Section 10.2, page 92. This design choice is efficient for running the application multiple times (for example, for measuring performance). In a real application, you may prefer to have your program load the logic via the fpga_load(3) function. The MTA design illustrates one way of establishing a protocol for communication between the application and the FPGA logic. The MTA initializes itself when it starts, then holds itself in that state. The MTA design includes a register that allows the application to start and stop the computation of pseudorandom numbers.

The MTA also defines registers to specify the output format, to configure the output buffers, to specify the output buffer addresses, and to hold the Mersenne Twister state array. The MTA registers begin at an offset of 64 MB (0x4000000) in the FPGA address space. Table 30, page 121 shows the additional offsets of the registers of interest from that base address. Application programs must use the total offset value (for example 0x4000008) in calls to the fpga_wrt_appif_val(3) function. (In this design, the application does not read any registers.)

120 Cray Private SÐ6400Ð14 Sample Application: Using the Mersenne Twister Accelerator [12]

Table 30. MTA registers

Offset Offset (hexadecimal) (decimal) 0x4000000 + 64MB + Description 0x8 8 Application configuration register. Bit 0 is the flag that holds the MTA in its initial state or releases it to start computation. 0x18 24 Buffer configuration register. Holds the following items: • Bits 0 to 8 specify the buffer size in units of 4 KB pages. • Bits 32 to 47 specify the buffer status polling interval in units of user clock cycles. 0x20 32 Buffer 0 base pointer register. Holds the base address of the first output buffer. 0x28 40 Buffer 1 base pointer register. Holds the base address of the second output buffer. 0x1000 and 4096 and up Mersenne Twister state array. up

The FPGA typically generates results faster than an Opteron processor can fetch them. The double-buffering scheme that is used in this sample application for FPGA output uses a handshake mechanism to make the operation efficient. After filling one buffer, the MTA continues to generate results in the next buffer while the application extracts those in the previous buffer. The block size (eight 64-bit numbers) that is used in the sample program to extract the results provides optimum performance. The first 64 bytes of each output buffer are reserved for status information, of which the first 8 bytes are a buffer status flag (0—empty or 1—full) that triggers buffer switching. After the sample program creates the output buffers, it zeroes them, which also marks them as empty initially. When the MTA logic is computing, it polls the status of the next output buffer until the status flag shows that the buffer is empty. It then computes numbers and transfers them to the output buffer until the buffer is full. It updates the buffer status to full, then polls the status of the next buffer. The sample program only illustrates how to use the MTA logic; it does not use the pseudorandom numbers that are generated. However, it does use some

SÐ6400Ð14 Cray Private 121 Cray XD1™ FPGA Development

microsecond-precision timing functions (also in the software distribution) that enable you to measure the rate of production of the numbers.

12.1.4 Walkthrough

The mta_test program structure includes four major functions and three wrapper functions that cast the output as different data types: • parse_opt—Parses the command-line options and arguments (line 129 to line 157). • mt_1999_set—Initializes the state array, sets up and configures the output buffers, and starts the computation (line 162 to line 260). • mt_get—Fetches blocks of pseudorandom numbers from the output buffers for use in the main program (line 271 to line 311). In addition, see the functions mt_get_int, mt_get_float, and mt_get_double at line 317 to line 330. • main—The main program (line 336 to line 461). The main program first parses the command-line arguments (line 357). It uses the argp_parse function from the GNU C Library, which in turn calls the parse_opt function in this program. Then it opens the FPGA device (line 368) and resets and starts the logic to ensure that it is in a known state (line 377 and line 378). The main program calls mt_1999_set (line 381), which performs the rest of the initialization: • Computes the initial contents of the Mersenne Twister state array—the seed value of the algorithm (line 169 to line 179). • Allocates the output buffers (line 184 and line 190), registers them as FTRs (line 197 and line 205), and zeroes them (line 214 and line 215). • Writes the buffer addresses (line 220 to line 223), buffer configuration (line 229 to line 232), and initial state array (line 235 to line 240) to the registers. • Sets the application configuration register to start the computation (line 244 to line 263). The main program is now ready to receive the pseudorandom numbers that the FPGA generates. It calls the appropriate mt_get wrapper function for every set of eight 64-bit numbers that it wants (line 384 to line 430). If the current buffer is

122 Cray Private SÐ6400Ð14 Sample Application: Using the Mersenne Twister Accelerator [12]

empty (ready to write), the mt_get function polls the buffer flag until it changes to full (ready to read). On each call, mt_get returns the next block of eight 64-bit numbers (line 295, line 296, and line 310). After it takes the last eight numbers in the buffer, mt_get marks the buffer as empty (line 300 or line 306) and switches to the next buffer. Finally, the program sets the MTA application configuration register to stop the computation (line 439), deregisters the FTRs (line 442 and line 447), frees the buffers (line 454 and line 455), and closes the device (line 458).

12.2 Building and Running the Application

Earlier chapters described both the command-line tools and the API that enable you to use an FPGA on the Cray XD1 system. The previous section described the source code of the sample Mersenne Twister application, which uses many of the API functions. This section illustrates the commands to compile the sample application, convert the appropriate logic file, load the FPGA, and run the sample application. This procedure is one way to verify that an FPGA and the communication path to it work correctly. This procedure makes the following assumptions: • You have an account on the Cray XD1 system. • The nodes of the partition that allows you to log in do not have FPGAs. Therefore, you must submit jobs to another partition. • You have access to a job execution partition that has at least one node with an FPGA.

Procedure 3: To build and run the application 1. Log in to Linux on the Cray XD1 system. 2. Copy the sample program and related files to your home directory:

> cp -r /opt/ufpapps

3. Examine the contents of ~/ufpapps. Although both the source and executable mta_test programs are present, you will rebuild the executable in this example; a suitable Makefile is already present. Similarly, although both the raw and loadable MTA logic files are present, you will rebuild the loadable file in this example.

SÐ6400Ð14 Cray Private 123 Cray XD1™ FPGA Development

4. Build the executable application program:

> cd ~/ufpapps/mta/src > make mta_test

The new mta_test executable file that results is in ~/ufpapps/mta/bin. 5. Identify the FPGA variant that is present in the system:

> lsnode -v | less

You may want to redirect the lsnode(1) output to a file rather than read it on the screen. The output includes about 34 lines for each node in the system, so it will be long, especially if you have multiple chassis in the system. The information for each node begins with lines such as the following examples:

Hardware ID: 65439.4 Partition: compute

Look for a line similar to the following example in the middle of the information for each node:

App Accelerator: 90-0003-05

This is the Cray part number for an FPGA. If an FPGA is not present on the node, the part number is listed as --. Normally, all the FPGAs in a system have the same part number. Take note of the part number and which partition or partitions have FPGAs. 6. Convert the appropriate FPGA logic file:

> cd ~/ufpapps/mta/bin > fcu --build --partnum 90-0003-05 --clock 180 creating header file ufphdr.... > fcu --convert mta50.bin ufphdr header size 59 input fpga load size 2377668

The raw MTA logic file is provided for several variants of the FPGA. The file named mta50.bin corresponds to the part number 90-0003-05. The output of the last command is the loadable logic file mta50.bin.ufp.

124 Cray Private SÐ6400Ð14 Sample Application: Using the Mersenne Twister Accelerator [12]

7. Create a job script that will load the logic file and run the application to generate 500 million pseudorandom numbers; do this in a separate directory:

> mkdir ~/test > cd ~/test > cat > mta_job fcu --load ~/ufpapps/mta/bin/mta50.bin.ufp ~/ufpapps/mta/bin/mta_test 500000000 Ctrl+D > chmod +x mta_job

8. Submit the job to a partition that has an FPGA; use the appropriate command for the configured workload management (WLM) system. For example, to submit the job to the compute partition and request the use of an FPGA with PBS Pro:

> qsub -q compute -l nodes=1:fpga mta_job

Take note of the job ID that the command returns (it may be embedded in a string with other information, depending on the WLM system); for example: 1798. 9. Monitor the progress of the job by using the WLM system. Once the WLM system launches this sample job, it takes only a few seconds to execute. 10. After the job is complete, examine the output file; its name follows the conventions of the WLM system. For example, here are the contents of mta_job.o1798:

file size 2381764 setting device location /dev/ufp0 programming device 2381764 bytes opening device /dev/ufp0 programmed FPGA 2381764 bytes closing device file descriptor 4

Generating 500000000 32 bit pseudo-random numbers with FPGA. Calibrating timer ... cpuHz is 2200000000

Last number : 0x25E3795A Elapsed time : 1559268 microseconds Rate : 320663285.593 32 bit integers/second

The first part of the output shows that the fcu --load command was successful. The second part is the output from the application. It shows that the program generated 500,000,000 pseudorandom numbers (32-bit) in

SÐ6400Ð14 Cray Private 125 Cray XD1™ FPGA Development

1,559,268 microseconds, which is an average rate of 320,663,285 numbers per second. Any rate that exceeds 250,000,000 numbers per second indicates that the FPGA and the infrastructure that supports it are working properly.

126 Cray Private SÐ6400Ð14 Part V: Appendixes

Troubleshooting [A]

This appendix describes some issues that may occur with the FPGA application acceleration processor and how to resolve them.

A.1 Node Hangs After Accessing /proc/ufp/regs

A.1.1 Cause

The FPGA AAP did not respond to an SMP processor read request.

A.1.2 Discussion

If an FPGA AAP is programmed but the user logic is in a reset state (via the fcu --reset command or the fpga_reset(3) function call), the FPGA APP cannot respond to SMP processor read requests. If a request is made, the SMP processor experiences a bus timeout and reboot. Use the reset functionality with care. If possible, restrict its use to debugging. If you erase the device (via the fcu --unload command or the fpga_unload(3) function call), the bus transaction is handled correctly. This behavior may change in a future release.

A.2 File /dev/ufp0 Does Not Exist (Interaction with the FPGA AAP Fails)

A.2.1 Cause

The FPGA AAP kernel module is not properly installed.

A.2.2 Discussion

The Opteron processors access the FPGAs through a standard Linux device driver. The Cray XD1 Active Manager software automatically detects the expansion module on a compute blade and loads the driver. If the Active Manager software is not running or fails to install the driver, you can load the module manually in the same way as any other Linux device driver. The

SÐ6400Ð14 Cray Private 129 Cray XD1™ FPGA Development

following procedure describes how to configure the kernel to load the driver automatically when it is required. Note: This procedure is required only if Active Manager is not in use. If Active Manager is running, these entries will already exist. Procedure 4: To load the FPGA driver

1. Log in to the relevant node of the Cray XD1 system as root. 2. Create the ufp device:

# mknod /dev/ufp0 c 62 0

3. Add the following three lines to the /etc/modprobe.conf.local file:

alias char-major-62 ufp alias ufp0 ufp options ufp cm_part_num=module-number

where module-number is the Cray part number of the target module in quotation marks (for example, "90-0003-05"). Refer to Table 4, page 21 for a list of available modules. After the operating system loads driver, the FPGA special file /dev/ufp0 appears. The FPGA device driver will load when it is required. Generally this occurs when a user application or the fcu(1) utility loads the FPGA. You can force the driver to load by running the fcu -r command . After the driver is loaded, the virtual files /proc/ufp/fpga and /proc/ufp/regs appear on the SMP.

130 Cray Private SÐ6400Ð14 Program Listing: mta_test.c [B]

1 /* 2 * This program demonstrates random number generation on the Cray XD1 3 * system with an FPGA co-processor. 4 * Usage : mta_test 5 * Generates a pseudo-random sequence of length, using the 6 * default seed 4357 7 * Mersenne-Twister algorithm is used to generate the numbers. 8 * The FPGA must be programmed with the official Mersenne-Twister 9 * logic distributed by Cray along with this program. 10 * Timing code is provided in file timer.c. 11 */ 12 13 #include 14 #define __USE_XOPEN 15 #define __USE_XOPEN2K 16 #define _XOPEN_SOURCE 600 17 #include 18 #include 19 #include 20 #include 21 #include 22 #include 23 #include 24 #include 25 26 #include 27 #include "timer.h" 28 29 #define N 624 /* Period parameters */ 30 #define INT32_MASK 0xFFFFFFFF 31 #define DEFAULT_SEED 4357UL 32 #define MULTIPLIER 1812433253UL /* Don Knuth, Vol 2 */ 33 34 /* values specific to Cray FPGA logic */ 35 #define TYPE_VAL 0UL 36 #define TYPE_ADDR 1UL 37 #define READ_NUMS 8 /* every call of mt_get fetches us this many 64-bit random numbers */ 38 #define BUFF_SIZE (1 * 1024 * 1024) 39 #define BUFF_RD_RDY 1UL /* buffer is ready to be read by the Opteron */ 40 #define BUFF_WRT_READY 0UL /* buffer is ready to be written into by the FPGA */

SÐ6400Ð14 Cray Private 131 Cray XD1™ FPGA Development

41 #define QUAD_WRDS_PER_BUF (BUFF_SIZE/8) 42 43 /* Define the address offsets for the MTA FPGA Registers */ 44 #define REG_OFFSET (64 * 1024 *1024) /* Registers start at 64M */ 45 #define APP_ID_REG (REG_OFFSET + 0x00UL) 46 #define APP_CFG_REG (REG_OFFSET + 0x08UL) 47 #define APP_LATCH_REG (REG_OFFSET + 0x10UL) 48 #define BUFF_CFG_REG (REG_OFFSET + 0x18UL) 49 #define BUFF0_PTR_REG (REG_OFFSET + 0x20UL) 50 #define BUFF1_PTR_REG (REG_OFFSET + 0x28UL) 51 #define MTA_RAM_ARRAY (REG_OFFSET + 0x1000UL) 52 53 /* Convenient bit masks for the MTA registers. */ 54 #define MTA_INTEGER 0x0UL 55 #define MTA_FLOAT 0x1UL 56 #define MTA_DOUBLE 0x2UL 57 #define MTA_INIT_KEY 0x1UL 58 #define MTA_FORMAT 0x6UL 59 60 typedef unsigned long u_64; 61 typedef struct { 62 unsigned long mt[N]; 63 int mti; 64 } mt_state_t; 65 66 67 int fp_id; 68 volatile u_64 * buf0_ptr; 69 volatile u_64 * buf1_ptr; 70 71 /**************************************************************************/ 72 /* The following code relates to the commmand line parsing and can pretty */ 73 /* much be ignored. */ 74 /**************************************************************************/ 75 static struct argp_option options[] = { 76 {"verbose", 'v', 0, 0, "Produce verbose output"}, 77 {"format", 'f', "STRING", 0, "Output format ('int, 'float' or 'double')."}, 78 {0} 79 }; 80 static error_t parse_opt (int key, char *arg, struct argp_state *state); 81 82 struct arguments 83 {

132 Cray Private SÐ6400Ð14 Program Listing: mta_test.c [B]

84 char *args[1]; 85 int verbose; /* The -v flag */ 86 char *format; /* Argument for -f */ 87 }; 88 89 const char *argp_program_version = "mta_test 1.1"; 90 const char *argp_program_bug_address = ""; 91 static char args_doc[] = "number"; 92 static char doc[] = "mta_test -- A program that generates random numbers using the MTA FPGA."; 93 static struct argp argp = {options, parse_opt, args_doc, doc}; 94 95 int print_err (err_e e) 96 { 97 switch (e) { 98 case NOERR: 99 printf("Success.\n"); 100 break; 101 case FILEOPRERR: 102 printf("File operation system call failed.\n"); 103 break; 104 case INVALIDOP: 105 printf("Invalid API operation requested.\n"); 106 break; 107 case INVALIDVAL: 108 printf("Invalid value passed to the API call.\n"); 109 break; 110 case INVALIDARGS: 111 printf("Invalid argument passed to the API call.\n"); 112 break; 113 case INVALIDINP: 114 printf("Invalid input given to the API call.\n"); 115 break; 116 case DEVOPRERR: 117 printf("FPGA device operation error.\n"); 118 break; 119 case UNKNOWNERR: 120 printf("Unknown error.\n"); 121 break; 122 default: 123 break; 124 } 125 return 0; 126 }

SÐ6400Ð14 Cray Private 133 Cray XD1™ FPGA Development

127 128 /* Provide a function to parse the parameters. */ 129 static error_t parse_opt (int key, char *arg, struct argp_state *state) 130 { 131 struct arguments *arguments = state->input; 132 133 switch (key) { 134 case 'v': 135 arguments->verbose = 1; 136 break; 137 case 'f': 138 arguments->format = arg; 139 break; 140 case ARGP_KEY_ARG: // one argument accepted 141 if (state->arg_num >= 1) { 142 argp_usage(state); 143 } 144 arguments->args[state->arg_num] = arg; 145 break; 146 case ARGP_KEY_END: 147 if (state->arg_num < 1) { 148 /* Not enough arguments. */ 149 argp_usage (state); 150 } 151 break; 152 default: 153 return ARGP_ERR_UNKNOWN; 154 } 155 156 return 0; 157 } 158 159 /**************************************************************************/ 160 /* Initialize the FPGA */ 161 /**************************************************************************/ 162 static void 163 mt_1999_set (void *vstate, unsigned long int s, struct arguments *arguments) 164 { 165 mt_state_t *state = (mt_state_t *) vstate; 166 int i, page_size; 167 err_e e; 168 unsigned long val; 169 if (s == 0) s = DEFAULT_SEED; /* the default seed is 4357 */

134 Cray Private SÐ6400Ð14 Program Listing: mta_test.c [B]

170 171 state->mt[0]= s & INT32_MASK; 172 173 for (i = 1; i < N; i++) 174 { 175 state->mt[i] = (MULTIPLIER*(state->mt[i-1]^(state->mt[i-1] >> 30)) + i); 176 state->mt[i] &= INT32_MASK; 177 } 178 179 state->mti = i; 180 181 /* set up the FTR memory */ 182 /* Allocate two buffers of 1MB each */ 183 page_size = getpagesize(); 184 i = posix_memalign((void *) &buf0_ptr, page_size, BUFF_SIZE); 185 if (i != 0) { 186 printf("Unable to allocate memory for buffer 0.\n"); 187 exit(1); 188 } 189 190 i = posix_memalign((void *) &buf1_ptr, page_size, BUFF_SIZE); 191 if (i != 0) { 192 printf("Unable to allocate memory for buffer 1.\n"); 193 exit(1); 194 } 195 196 /* Register the buffers with the FPGA device driver */ 197 fpga_register_ftrmem(fp_id, (void *) buf0_ptr, BUFF_SIZE, &e); 198 if (e != NOERR) { 199 printf("Unable to register buffer 0 with the FPGA device driver.\n"); 200 print_err(e); 201 free((void *) buf0_ptr); 202 free((void *) buf1_ptr); 203 exit(1); 204 } 205 fpga_register_ftrmem(fp_id, (void *) buf1_ptr, BUFF_SIZE, &e); 206 if (e != NOERR) { 207 printf("Unable to register buffer 1 with the FPGA device driver.\n"); 208 print_err(e); 209 free((void *) buf0_ptr); 210 free((void *) buf1_ptr); 211 exit(1); 212 }

SÐ6400Ð14 Cray Private 135 Cray XD1™ FPGA Development

213 214 bzero ((void *) buf0_ptr, BUFF_SIZE); 215 bzero ((void *) buf1_ptr, BUFF_SIZE); 216 217 /* FTR memory is split into two buffers for */ 218 /* data transfer between FPGA and the SMP */ 219 /* Program the buffer addresses. */ 220 fpga_wrt_appif_val (fp_id, 221 (u_64) buf0_ptr, 222 BUFF0_PTR_REG, TYPE_ADDR, &e); 223 fpga_wrt_appif_val (fp_id, 224 (u_64) buf1_ptr, 225 BUFF1_PTR_REG, TYPE_ADDR, &e); 226 227 /* set up the config register */ 228 /* the value contains polling frequency and number of pages per buffer */ 229 val = (1UL << 8) - 1; /* polling value set at 255 */ 230 val <<= 32; 231 val |= ((1UL << 8) - 1); /* each buffer has 256 pages */ 232 fpga_wrt_appif_val (fp_id, val, BUFF_CFG_REG, TYPE_VAL, &e); 233 234 /* write the state array */ 235 for (i = 0; i < N/2; i++) { 236 fpga_wrt_appif_val (fp_id, 237 (((0UL | state->mt [2*i+1]) << 32) | 238 state->mt [2*i]), 239 MTA_RAM_ARRAY + (unsigned long) i*8, TYPE_VAL, &e); 240 } 241 242 /* Generate the configuration register value. */ 243 /* Start by turning the init key off. */ 244 val = 0; 245 val |= (MTA_INIT_KEY & 0UL); 246 247 /* Set the output format bits according to the command line parameter. */ 248 switch(arguments->format[0]) { 249 case 'i': /* integer */ 250 val |= (MTA_FORMAT & (MTA_INTEGER<<1)); 251 break; 252 case 'f': /* float */ 253 val |= (MTA_FORMAT & (MTA_FLOAT<<1)); 254 break; 255 case 'd': /* double */

136 Cray Private SÐ6400Ð14 Program Listing: mta_test.c [B]

256 val |= (MTA_FORMAT & (MTA_DOUBLE<<1)); 257 break; 258 default: 259 break; 260 } 261 262 /* Write the format and start value to the FPGA */ 263 fpga_wrt_appif_val (fp_id, val, APP_CFG_REG, TYPE_VAL, &e); 264 } 265 266 /**************************************************************************/ 267 /* Fetches random numbers written into the buffer space by the FPGA. */ 268 /* Returns a pointer to a block_ptr of 64-bit numbers. */ 269 /**************************************************************************/ 270 271 static inline volatile u_64 * mt_get (void) 272 { 273 static volatile u_64 * block_ptr = 0 ; 274 static volatile u_64 * curr_ptr = 0; 275 static int n = 1; 276 volatile u_64 * buf_ptr_arr [2]; 277 278 buf_ptr_arr [0] = buf0_ptr; 279 buf_ptr_arr [1] = buf1_ptr; 280 281 if (n) { 282 curr_ptr = buf0_ptr; 283 n=0; 284 } 285 286 if ((curr_ptr == buf_ptr_arr [0]) 287 || (curr_ptr == buf_ptr_arr [1])) { 288 while (*curr_ptr != BUFF_RD_RDY) { 289 /* Wait ... */ 290 } 291 /* first 64 bytes are unused except for the read-write flags */ 292 curr_ptr += 8; 293 } 294 295 block_ptr = curr_ptr; 296 curr_ptr += READ_NUMS; 297 298 if ((curr_ptr == buf_ptr_arr [0] + QUAD_WRDS_PER_BUF)) {

SÐ6400Ð14 Cray Private 137 Cray XD1™ FPGA Development

299 /* done reading this buffer */ 300 *(curr_ptr - QUAD_WRDS_PER_BUF) = BUFF_WRT_READY; 301 curr_ptr = buf_ptr_arr[1]; 302 } 303 304 if (curr_ptr == (buf_ptr_arr [1] + QUAD_WRDS_PER_BUF)) { 305 /* done reading this buffer */ 306 *(curr_ptr - QUAD_WRDS_PER_BUF) = BUFF_WRT_READY; 307 curr_ptr = buf_ptr_arr [0]; 308 } 309 310 return block_ptr; 311 } 312 313 /**************************************************************************/ 314 /* Cast the pointer type returned by mt_get(). */ 315 /**************************************************************************/ 316 317 unsigned int *mt_get_int(void) 318 { 319 return (unsigned int *) mt_get(); 320 } 321 322 float *mt_get_float(void) 323 { 324 return (float *) mt_get(); 325 } 326 327 double *mt_get_double(void) 328 { 329 return (double *) mt_get(); 330 } 331 332 /**************************************************************************/ 333 /* Main body */ 334 /**************************************************************************/ 335 336 int main (int argc, char *argv []) 337 { 338 struct arguments arguments; 339 u_64 i=0,j=0,n=0; 340 mt_state_t state; 341 err_e e;

138 Cray Private SÐ6400Ð14 Program Listing: mta_test.c [B]

342 long begin=0, end=0, diff=0; 343 double rate; 344 volatile unsigned int *int_array = NULL; 345 volatile float *float_array = NULL; 346 volatile double *double_array = NULL; 347 unsigned int k = 0; 348 float f = 0.0; 349 double d = 0.0; 350 351 /* Set argument defaults */ 352 arguments.verbose = 0; 353 arguments.format = "int"; 354 355 /* Parse any command line options and arguments. Store them in */ 356 /* the arguments structure. */ 357 argp_parse (&argp, argc, argv, 0, 0, &arguments); 358 359 /* Check that the length of the test is reasonable */ 360 n = strtol (arguments.args[0], NULL, 10); 361 if (n < 2) { 362 printf ("The minumum number of comparisons is 2 ... exiting.\n"); 363 return(1); 364 } 365 366 /* Open the FPGA device */ 367 /* We install the FPGA as /dev/ufp0 */ 368 fp_id = fpga_open ("/dev/ufp0", O_RDWR|O_SYNC, &e); 369 370 /* If FPGA open failed exit. */ 371 if (e != NOERR) { 372 printf ("Failed to open FPGA device. Exiting.\n"); 373 return(1); 374 } 375 376 /* Reset the then restart. This puts the FPGA in a known state. */ 377 fpga_reset (fp_id, &e); 378 fpga_start (fp_id, &e); 379 380 /* Initialize the FPGA. */ 381 mt_1999_set (&state, 0UL, &arguments); 382 383 /* Fetch the reqired set of numbers based on command line arguments. */ 384 switch(arguments.format[0]) {

SÐ6400Ð14 Cray Private 139 Cray XD1™ FPGA Development

385 case 'i': /* integer */ 386 printf ("\nGenerating %ld 32 bit integers with FPGA.\n", n); 387 printf ("Calibrating timer ... "); 388 fflush(stdout); 389 begin = ustime (); 390 for (i = 0; i < n; i += READ_NUMS*2) { 391 int_array = mt_get_int(); 392 for (j = 0; j < READ_NUMS*2; j++) { 393 k = int_array[j]; 394 } 395 } 396 end = ustime (); 397 printf ("\n Last number : 0x%08X\n", k); 398 break; 399 case 'f': /* float */ 400 printf ("\nGenerating %ld 32 bit floats with FPGA.\n", n); 401 printf ("Calibrating timer ... "); 402 fflush(stdout); 403 begin = ustime (); 404 for (i = 0; i < n; i += READ_NUMS*2) { 405 float_array = mt_get_float(); 406 for (j = 0; j < READ_NUMS*2; j++) { 407 f = float_array[j]; 408 } 409 } 410 end = ustime (); 411 printf ("\n Last number : %1.10f\n", f); 412 break; 413 case 'd': /* double */ 414 printf ("\nGenerating %ld 64 bit doubles with FPGA.\n", n); 415 printf ("Calibrating timer ... "); 416 fflush(stdout); 417 begin = ustime (); 418 for (i = 0; i < n; i += READ_NUMS) { 419 double_array = mt_get_double(); 420 for (j = 0; j < READ_NUMS; j++) { 421 d = double_array[j]; 422 } 423 } 424 end = ustime (); 425 printf ("\n Last number : %1.20f\n", d); 426 break; 427 default:

140 Cray Private SÐ6400Ð14 Program Listing: mta_test.c [B]

428 printf ("\nUnknown format requested.\n"); 429 break; 430 } 431 432 /* Calculate the time taken to generate the numbers. */ 433 diff = end - begin; 434 rate = (double) n /(double) diff; 435 printf (" Elapsed time : %ld microseconds\n", diff); 436 printf (" Rate : %4.3lf million numbers/second\n\n", rate); 437 438 /* stop the random number generation. */ 439 fpga_wrt_appif_val (fp_id, 1UL, APP_CFG_REG, TYPE_VAL, &e); 440 441 /* deregister the buffers */ 442 fpga_dereg_ftrmem(fp_id, (void *)buf0_ptr, &e); 443 if (e != NOERR) { 444 printf("Unable to deregister buffer 0 with the FPGA device driver\n"); 445 print_err(e); 446 } 447 fpga_dereg_ftrmem(fp_id, (void *)buf1_ptr, &e); 448 if (e != NOERR) { 449 printf("Unable to deregister buffer 1 with the FPGA device driver\n"); 450 print_err(e); 451 } 452 453 /* Free the buffers */ 454 free((void *) buf0_ptr); 455 free((void *) buf1_ptr); 456 457 /* close the device */ 458 fpga_close (fp_id, &e); 459 460 return 0; 461 }

SÐ6400Ð14 Cray Private 141 Cray XD1™ FPGA Development

142 Cray Private SÐ6400Ð14 Glossary

Active Manager The software that monitors and manages all aspects of the Cray XD1 system. Its user interfaces provide administrators and end users with a single point of control for the system.

application acceleration processor (AAP) See FPGA application acceleration processor.

chassis ID The permanent numeric identifier of a chassis, unique to each Cray XD1 chassis. A chassis ID has a maximum of six decimal digits.

compute blade One of six circuit boards in a Cray XD1 chassis; contains Opteron processors configured as an SMP, DIMMs, and a RapidArray processor. A compute blade may also have an expansion module.

Cray XD1 system A stand-alone Cray XD1 chassis or multiple chassis that communicate over both the supervisory network and the RapidArray interconnect.

crossbar switch A communications switch that provides direct connection between any pair of ports.

doubleword Four bytes (32 bits).

expansion module Optional Cray XD1 hardware that connects to each compute blade; if they are present, a chassis has six expansion modules. The expansion modules provide a node with a second RapidArray processor, two additional Rapid Array links, and an optional application acceleration processor.

SÐ6400Ð14 Cray Private 143 Cray XD1™ FPGA Development

fabric The collection of fabric components that interconnect in the same switching plane. A Cray XD1 system has one or two independently wired, parallel RapidArray fabrics: the main fabric and the optional expansion fabric. These fabrics are also known as fabric X and fabric Y, respectively.

fabric expansion card Optional hardware in a Cray XD1 chassis that adds a second RapidArray fabric to the system: provides a second internal 24-port RapidArray switch, 12 additional internal links, and 12 additional external ports for chassis interconnection. The fabric expansion card connects to the main board.

field-programmable gate array (FPGA) An integrated circuit that consists of arrays of AND and OR gates (typically thousands) that can be programmed to perform complex functions. The Cray XD1 system has optional FPGAs available for use as application acceleration processors.

FPGA application acceleration processor An FPGA that users can program to accelerate computationally intensive and repetitive algorithms; acts as a co-processor to the Opteron processor. This is an optional component on the expansion module.

interconnect See RapidArray interconnect.

job A computing task that runs on one processor or multiple processors concurrently. The workload management (WLM) system assigns the requested resources and launches the job.

JTAG interface card Optional hardware that provides a communication path to the JTAG port of the FPGA application acceleration processor. This card connects to one of the high-speed I/O slots on the main board of a Cray XD1 chassis. With this card, you can connect a workstation that runs appropriate debugging tools to the FPGA.

144 Cray Private SÐ6400Ð14 Glossary

link See RapidArray link.

management processor The processor on the main board of a Cray XD1 chassis that runs the Active Manager hardware supervisory subsystem.

node An instance of the Linux operating system and the hardware components that it controls. The hardware components in a Cray XD1 node include an SMP and its associated memory, one or two RapidArray processors (depending on configuration) and, optionally, an FPGA application acceleration processor.

partition A logical group of nodes with the same operating system version and configuration; may reside in more than one Cray XD1 chassis. Partitions enable an organization to dedicate a set of nodes to perform a particular function (run a type of job, host a system-wide service, or serve a particular user group). Users treat the set of nodes in a partition as a single, homogeneous computing resource. Administrators specify the attributes of a partition.

PCI-X expansion card Hardware in the Cray XD1 chassis that provides four PCI-X slots for Gigabit Ethernet and Fibre Channel cards. This card also provides connectors for three disk blades. It connects to one of the three high-speed I/O slots on the main board.

quadword Eight bytes (64 bits).

RapidArray interconnect The high-speed network that interconnects the nodes in a Cray XD1 chassis, and connects all nodes in a Cray XD1 system via cables and optional external RapidArray switches. The RapidArray interconnect consists of a main and an optional expansion fabric, each with its own set of fabric components. The configuration of the RapidArray interconnect in a multichassis system is called the physical topology.

SÐ6400Ð14 Cray Private 145 Cray XD1™ FPGA Development

RapidArray link The physical communication path between two RapidArray ports. Each link can carry two gigabytes per second.

RapidArray processor (RAP) The special-purpose processor on a Cray XD1 compute blade that is responsible for most communication functions within the system. The RapidArray processor interfaces an Opteron processor to the RapidArray fabric.

RapidArray switch A full-crossbar nonblocking switch in the RapidArray fabric. The base configuration includes one 24-port RapidArray switch in each Cray XD1 chassis. The optional fabric expansion card adds a second RapidArray switch. Equivalent external RapidArray switches are available for implementing fat tree (switched) topologies.

switch fabric See fabric.

symmetric multiprocessor (SMP) In a Cray XD1 system, an SMP is formed from two single- or dual-core Opteron processors and their associated memory. One compute blade holds one SMP. Each chassis contains six compute blades and therefore contains six SMPs. See also node.

ufp User FPGA processor. A combining form that occurs in file and directory names; for example, in libufp.a. It is a synonym for the FPGA application acceleration processor.

workload management (WLM) system Software that schedules jobs for execution in a system of networked nodes.

146 Cray Private SÐ6400Ð14 Index

A options, 98 address space, FPGA, mapping, 96 compute blades, 17 address spaces, Cray XD1 node, 69 converting raw FPGA logic files API creating header file, 92 See FPGA API merging header file, 92 application acceleration processors procedure, 91 See FPGAs purpose, 91 application programming interface (for FPGAs) coprocessors See FPGA API See FPGAs applications, sample CORE Generator, 39 See Mersenne Twister pseudorandom numbers architecture, Cray XD1 D chassis, 15 D_REENTRANT compiler option, 98 compute blades, 17 DCM expansion modules, 18 See FPGAs, Xilinx: digital clock manager blocks physical layout, 15 debugging logic RapidArray interconnect, 19 ChipScope analyzer, 83 audience, 3 JTAG interface cards, 81 reading and writing FPGA values, 85 B design considerations berttest program disabling unused core interfaces, 68 execution examples, 34 disk and network I/O, 47 Mince design, 34 FPGA-initiated RapidArray transactions, 48 interfacing, 67 C limitations, 48 Celoxica, 43 memory resources, 45 chassis RapidArray fabric bandwidth, 47 overview, 15 SMP interrupts, 48 ChipScope, 83 SMP-initiated RapidArray transactions, 48 CLB write-only designs, 48 See FPGAs, Xilinx: configurable logic blocks design entry clock HDL development flow, 38 See programmable clock software-oriented languages, 41 closing FPGAs, 117 design template command line See template, logic design See managing FPGA logic: command line development flows compiling, applications HDL, basic Mersenne Twister example, 124 design entry, 38

SÐ6400Ð14 Cray Private 147 Cray XD1™ FPGA Development

implementation, 40 fabric.in file, 60, 77 overview, 37 fabric.out file, 60 simulation, 39 fcu command synthesis, 39 examples, 123 Xilinx tools, 39 introduction, 10 software-oriented languages options Celoxica, 43 build,92 design entry, 41 convert,92 DSPlogic, 43 exec,93 implementation, 42 load,92 Impulse, 43 reset,93 Mitrionics, 43 status,93 overview, 40 unload,94 simulation, 42 tutorial synthesis, 42 converting binary files, 31 Xilinx tools, 41 design directory, 30 development process, 37 erasing FPGA, 34 See also development flows loading FPGA, 32 hardware, 37 Mince, 30 hardware and software streams, 10 field-programmable gate arrays disabling unused core interfaces, 68 See FPGAs disk I/O as limiting factor, 47 file descriptors, FPGA, 99 DMA (reference design), 29 FPGA API double-buffering, in MTA, 121 data transfer methods, 96 DSPlogic, 43 device files, 100 file descriptors, 99 E fpga_close,117 erasing, FPGAs, 94, 116 fpga_dereg_ftrmem, 111 expansion modules, 6 fpga_int_wait,112 communication paths, 24 fpga_is_loaded,115 details, 21 fpga_load, 100 overview, 18 fpga_mem_sync, 105 part numbers, 21 fpga_memmap, 103 programmable clock, 24 fpga_open,99 QDR II SRAM, 24 fpga_rd_appif_val, 109 RapidArray processor, 24 fpga_register_ftrmem,110 Xilinx FPGAs, 21 fpga_reset, 101 fpga_set_ftrmem,110 F fpga_start, 102 fpga_status Fabric Request Interface ,114 fpga_unload description, 63 ,116 fpga_wrt_appif_val transactions, 72–73 , 108

148 Cray Private SÐ6400Ð14 Index

introduction, 10 applications, target, 8 overview, 99 clock speed, 92 workflow, typical, 95 closing, 117 FPGA logic disadvantages, 7 loadable files erasing creating, 91 API, 116 defined, 10 command line, 94 loading getting started, 123 API, 100 interrupt requests, 112 command line, 92 memory resources, 45 raw files opening, 99 converting, 91 optional component, 6 defined, 91 parallel computations, 6 role of application programmer, 95 part numbers, 91 FPGA transfer region pipelined computations, 6 communicating address to FPGA, 111 programming state, 115 deregistering, 111 purpose, 6 overview, 96 registers, viewing, 34 registering, 110 releasing from reset sample application, 120 API, 102 size, 112 command line, 93 fpga_close function, 117 resetting fpga_deprogram function, 117 API, 101 fpga_dereg_ftrmem function, 111 command line, 93 fpga_int_wait function, 113 sample application fpga_is_loaded function, 115 described, 119 fpga_load function, 100 introduction, 9 fpga_mem_sync function, 105 running, 123 fpga_memmap function, 71, 104 status, 93, 114 fpga_open function, 99 variants, 91 fpga_rd_appif_val function, 72, 109 verifying operation, 123 fpga_read command, 86 FPGAs, Xilinx fpga_register_ftrmem function, 73, 111 advantages, 21 fpga_reset function, 101 block SelectRAM+ modules, 22 fpga_set_ftrmem function, 110 comparison, 23 fpga_start function, 102 configurable logic blocks, 22 fpga_status function, 115 digital clock manager blocks, 23 fpga_write command, 87 multiplier blocks, 22 fpga_wrt_appif_val function, 72–73, 108 PowerPCs, embedded, 23 FPGAs programmability, 23 address space, mapped, 96 XtremeDSP slices, 22 advantages, 6 frequency, clock

SÐ6400Ð14 Cray Private 149 Cray XD1™ FPGA Development

See programmable clock generating, 74 FTR ISE (Xilinx design tool suite) See FPGA transfer region GUI, 59 HDL editor, 39 G project file, 59 getpagesize function, 110 J H JTAG interface cards halting FPGA logic connecting workstation, 83 See resetting FPGAs I/O adapter, 84 Handel-C, 43 overview, 81 HDL development flow, 37 ports, mapping hdl_tb directory, 60 changing, 83 header files defaults, 81 converting raw FPGA logic file, 92 restoring defaults, 83 FPGA API, 98 viewing, 82 Hello (reference design), 29 L I libufp.a, FPGA API object library, 98 I/O adapter, JTAG interface cards, 84 limitations implementation Opteron burst read unavailable, 48 HDL development flow, 40 linking, applications, 98 software-oriented languages, 42 loading FPGA logic Impulse, 43 API, 100 interaction, FPGA and software choosing a method, 120 FPGA-initiated fabric requests command line, 92 accessing SMP memory, 73 locations, FPGA SMP interrupt requests, 74 accessing from application, 103 memory map, 69 physical interpretation, 96 overview, 68 logic SMP-initiated fabric requests See FPGA logic API function accesses, 72 I/O-mapped accesses, 71 M overview, 70 make command, 58 interfaces managing, FPGA logic application design considerations, 67 API, 95, 99 QDR II SRAM core, 65 command line, 91 RapidArray Transport core, 63 MATLAB, 41 software API, 99 Matsumoto, 119 interrupt requests memory map, 69 enabling, 112 memory resources, FPGA FPGA-initiated RapidArray transactions, 48 size comparison, 45

150 Cray Private SÐ6400Ð14 Index

types N block RAM in FPGA, 45 network I/O as limiting factor, 47 distributed RAM in FPGA, 45 NGC files, 39 QDR II SRAM, 45 Nishimura, 119 SDRAM on SMP, 45 uses, typical, 46 O memory, application, accessing from FPGA opening FPGAs, 99 communicating address to FPGA, 73 Opteron processors registering SMP memory, 110 compute blade components, 17 memory, FPGA mapping options, 105 P offset, 105 ports pointer in application address space, 105 See JTAG interface cards: ports, mapping protection, 105 posix_memalign function, 73, 110 type, 96 PowerPCs, 23 Mentor Graphics, 39 /proc/ufp/regs file, 33 Mersenne Twister Accelerator programmable clock double-buffering, 121 expansion module component, 24 FPGA logic, sample, 119 frequency range supported, 45 high-level design, 119 setting, 91 protocol, 120 programming state, FPGA, 115 reference design, 29 protection, FPGA memory, 105 registers, 120 protocol, FPGA, sample application, 120 Mersenne Twister pseudorandom numbers publications, related, 4 algorithm, 119 high-level design, application, 120 Q sample program, 119 source code listing, 131 QDR II SRAM structure of program, 122 expansion module components, 24 walkthrough, 122 FPGA address space, 96 Mince QDR II SRAM core berttest program, 34 interface, 65 fcu tutorial, 30 optional, 65 qdrtest qdrtest program, 34 program reference design, 29 Mince design, 34 test.sh script, 34 Mitrion-c, 43 R Mitrionics, 43 Rapid RC Development Kit, 43 mmap, 105 RapidArray core bus transactions MTA overview, 68 See Mersenne Twister Accelerator RapidArray interconnect multiplier blocks, 22 FPGA design consideration, 47 overview, 19

SÐ6400Ð14 Cray Private 151 Cray XD1™ FPGA Development

RapidArray processors fabric.out file, 60 compute blade component, 17 FPGA transfer region, 80 expansion module component, 24 HDL development flow, 39 RapidArray Transport core models bus transactions Cray XD1 cores, 77 FPGA-initiated, 73 RapidArray fabric, 60, 77 SMP-initiated, 70 overview, 60 Fabric Request Interface, 63 software-oriented languages, 42 interface characteristics, 63 stimulus files, 60 mandatory, 63 test.do file, 60 User Request Interface, 63 tools reference designs Modelsim, 39 DMA, 29 Riviera, 39 Hello, 29 Simulink, 41 location, 29 SMPs, 96 Mince, 29 status, FPGA, 93, 114 MTA, 29 synchronizing, accesses to FPGA locations, 105 reference programs Synopsys, 39 berttest,34 Synplicity, 39 qdrtest,34 synthesis running, 34 HDL development flow, 39 test.sh,34 software-oriented languages, 42 registers tools MTA, 120 Mentor Graphics, 39 reading, 109 Synopsys, 39 transforming an address before writing, 109 Synplicity, 39 writing, 108 Xilinx XST, 39 releasing from reset, FPGAs API, 102 T command line, 93 template, logic design resetting FPGAs copying, 56 API, 101 customization, 57 command line, 93 directory structure RT core figure, 53 See RapidArray Transport core overview, 53 table, 54 S make command SelectRAM setting variables, 58 See FPGAs, Xilinx: block SelectRAM+ modules targets, 58 simulation organization of components, 52 back-annoted models, 39 overview, 52 fabric.in file, 60, 77 QDR II SRAM interface, 52

152 Cray Private SÐ6400Ð14 Index

RapidArray interface, 52 utility programs test bench, 60 fcu,91 tools, 56 fpga_read,85 VHDL files, 52 fpga_write,85 vhdl_template directory, 52 test bench, 60 V test.do file, 60 variant FPGAs, 91 test.sh script verifying FPGA operation, 30, 123 checking FPGA hardware, 35 VHDL files Mince design, 34 template, 52 time measurement, in sample application, 121 top.vhd file, 52 W write-combining, Opteron, 71 U write-only designs, 48 ufpapps directory purpose, 49 X subdirectories Xilinx FPGAs figure, 49 See FPGAs, Xilinx table, 51 Xilinx tools ufplib.h , FPGA API header file, 98 ChipScope, 83 user constraint files (UCFs), 40 CORE Generator, 39 user FPGAs ISE, 39, 59 See FPGAs System Generator for DSP, 41 User Request Interface XST, 39 description, 63 XtremeDSP slices, 22 transactions, 73

SÐ6400Ð14 Cray Private 153