PCI Express High Performance Reference Design
Total Page:16
File Type:pdf, Size:1020Kb
PCI Express High Performance Reference Design AN-456-1.4 Application Note The PCI Express High-Performance Reference Design highlights the performance of the Altera® Stratix® V Hard IP for PCI Express and IP Compiler for PCI ExpressTM MegaCore® functions. The design includes a high-performance chaining direct memory access (DMA) that transfers data between the Arria® II GX, Cyclone® IV GX, Stratix® IV GX, or Stratix V FPGA, internal memory and the system memory. The reference design includes a Windows XP-based software application that sets up the DMA transfers. The software application also measures and displays the performance achieved for the transfers. This reference design enables you to evaluate the performance the PCI Express protocol in an Arria II GX, Cyclone IV GX, Stratix IV GX, or Stratix V device. Altera offers the PCI Express MegaCore function in both hard IP and soft IP implementations. The hard IP implementation is available as a root port or endpoint. Depending on the device used, the hard IP implementation is compliant with PCI Express Base Specification 1.1, 2.0, or 3.0. The soft IP implementation is available only as an endpoint. It is compliant with PCI Express Base Specification 1.0a or 1.1. The remainder of this application note includes a tutorial on calculating the throughput of the PCI Express MegaCore function and instructions for running the chaining DMA design example. The chaining DMA in this reference design is the chaining DMA example generated by the PCI Express Compiler. This example is explained in detail in the Stratix V Hard IP for PCI Express User Guide for Stratix V devices and the PCI Express Compiler User Guide for earlier devices. This application note includes the following sections: ■ “Understanding Throughput in PCI Express” on page 1 ■ “Deliverables Included with the Reference Design” on page 6 ■ “Reference Design Functional Description” on page 6 ■ “Design Walkthrough” on page 12 ■ “Performance Benchmarking Results” on page 19 Understanding Throughput in PCI Express The throughput in a PCI Express system depends on several factors, including protocol overhead, payload size, completion latency, and flow control update latency. The throughput also depends on the characteristics of the devices that form the link. This section discusses the various factors that you must consider when analyzing throughput. This example assumes an ×1 link operating at 2.5 Gbps. The same theory applies to a Gen2 link running at 5.0 Gbps. © 2012 Altera Corporation. All rights reserved. ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of Altera Corporation and registered in the U.S. Patent and Trademark Office and in other countries. All other words and logos identified as trademarks or service marks are the property of their respective holders as described at www.altera.com/common/legal.html. Altera warrants performance of its semiconductor ISO 101 Innovation Drive products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any 9001:2008 products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use Registered San Jose, CA 95134 of any information, product, or service described herein except as expressly agreed to in writing by Altera. Altera customers are www.altera.com advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. July 2012 Altera Corporation Feedback Subscribe Page 2 Understanding Throughput in PCI Express Protocol Overhead PCI Express uses 8b/10b encoding, in which every byte of data is converted into a 10- bit data code, resulting in a 25% overhead. The effective data rate is therefore reduced to 2 Gbps or 250 MBps per lane. An active link also transmits Data Link Layer Packets (DLLPs) and Physical Layer Packets (PLPs). The PLPs are four bytes or one dword in size and consist of SKP ordered sets. The DLLPs are two dwords in size and consist of the ACK/NAK and flow control DLLPs. The ACKs and flow control update DLLPs are transmitted in the opposite direction from the Transaction Layer Packet (TLP). In cases where the link is transmitting and receiving high bandwidth traffic, the DLLP activity can be significant and on the order of one DLLP for every TLP. The DLLPs and PLPs reduce the effective bandwidth available for TLPs. The format of the TLP is shown in Figure 1. Figure 1. TLP Format Start SequenceID TLP Header Data Payload ECRC LCRC End 1 Byte 2 Bytes 3-4 DW 0-1024 DW 1 DW 1 DW 1 Byte The overhead associated with a single TLP varies between five and seven dwords if the optional ECRC is not included. The overhead includes the Start and End framing symbols, the Sequence ID, a TLP header that is three or four dwords long, and the link cyclic redundancy check (LCRC). The TLP header size depends on the TLP type and can change from one TLP to another. The rest of the TLP contains 0–1024 dwords of data payload. Throughput for Posted Writes The theoretical maximum throughput is calculated using the following formula: Throughput % = payload size / (payload size + overhead) Figure 2 shows the maximum throughput possible with different TLP header sizes and ignores any DLLPs and PLPs. For a 256-byte maximum payload size and a three dword TLP header (or five dword overhead), the maximum possible throughput is (256/(256+20)), or 92%. PCI Express High Performance Reference Design July 2012 Altera Corporation Understanding Throughput in PCI Express Page 3 Figure 2. Maximum Throughput for Memory Writes Theoretical Maximum Throughput for Memory Writes (x1) 120 100 80 3 DW Header 60 4 DW Header 4 DW Header + ECRC Throughput (%) Throughput 40 20 0 16 32 64 128 256 512 1024 2048 4096 Payload Size (Bytes) The maximum TLP payload size is controlled by the device control register (bits 7:5) in the PCI Express configuration space. The MegaCore function parameter maximum payload size sets the read-only value of the maximum payload size supported field of the device capabilities register (bits 2:0) and optimizes the MegaCore function for this payload size. You can configure the MegaCore function for a maximum payload size, which can then be reduced by the system based on the maximum payload size supported by the system. This MegaCore function parameter affects the resource utilization, so the parameter must be set such that it is not greater than the maximum payload size supported by the system in which the MegaCore function is used. PCI Express uses flow control, in which a TLP is not transmitted unless the receiver has enough free buffer space to accept that TLP. A device needs sufficient header and data credits before sending a TLP. When the application logic in the completer accepts the TLP, it frees up the Rx buffer space in the completer’s transaction layer. The completer sends a flow control update (FC Update DLLP) that returns the credits consumed by the originating TLP. When the device uses up all of its initial credits, the link bandwidth is limited by how fast it receives credit updates. Flow control updates depend on the maximum payload size and the latencies in the transmitting and receiving devices. f For more information about the flow control update loop, refer to the Flow Control chapter of Stratix V Hard IP for PCI Express User Guide for Stratix V devices and the PCI Express Compiler User Guide for earlier devices. This chapter provides detailed information about the flow control update loop and the associated latencies. July 2012 Altera Corporation PCI Express High Performance Reference Design Page 4 Understanding Throughput in PCI Express Throughput for Reads PCI Express uses a split-transaction for reads. A requester first sends a memory read request. The completer then sends an ACK DLLP to acknowledge the memory read request. It subsequently returns a completion data that can be split into multiple completion packets. Read throughput is somewhat lower than write throughput because the data for the read completions may be split into multiple packets rather than being returned in a single packet. The following example illustrates this point. Assuming a read request for 512 bytes and a completion packet size of 256 bytes, the maximum possible throughput can be calculated as follows: Number of completion packets = 512/256 = 2 Overhead for a 3 dword TLP Header with no ECRC = 2*20 = 40 bytes Maximum Throughput % = 512/(512 + 40) = 92%. These calculations do not take into account any DLLPs and PLPs. The read completion boundary (RCB) parameter specified by the PCI Express Base Specification determines the naturally aligned address boundaries on which a read request may be serviced with multiple completions. For a root complex, the RCB is either 64 bytes or 128 bytes. For all other PCI Express devices, the RCB is 128 bytes. 1 A non-aligned read request may experience a further throughput reduction. Read throughput depends on the round-trip delay between the time when the application logic issues a read request and the time when all of the completion data has been returned. To maximize throughput, the application must issue enough read requests and process enough read completions, or offer enough non-posted header credits to cover this delay. Figure 3 shows timing diagrams for memory read requests (MRd) and completions (CplD). The timing diagram in the top part of the figure shows the requester waiting for a completion before making a subsequent read request, resulting in lower throughput. The timing diagram in the bottom part shows the requester making enough memory read requests to eliminate the delay for the completions with the exception of the first read, thus maintaining higher throughput.