Analysis and Optimisation of Communication Links for Signal Processing Applications

ANDREAS ÖDLING

Examensarbete inom elektronik- och datorsystem, avancerad nivå, 30 hp Degree Project, in Electronic- and Computer Systems, second level School of Information and Communication Technology, ICT Royal Institute of Technology, KTH Supervisor: Johnny Öberg Examiner: Ingo Sander Stockholm, November 12, 2012

TRITA-ICT-EX-2012:287

Abstract

There are lots of communication links and standards cur- rently being employed to build systems today. These meth- ods are in many way standardised, but far from everyone of them are. The trick is to select the communication method that best suit your needs. Also there is currently a trend that things have to be cheaper and have shorter time to market. That leads to more Component Off The Shelf (COTS) systems being build using commodity components. As one part of this work, Gigabit is evaluated as a COTS-solution to building large, high-end systems. The computers used are running Windows and the pro- tocol used over Ethernet will be both TCP and UDP. In this work an attempt is also made to evaluate one of the non-standard protocols, the Link Port protocol for the TigerSHARC 20X-series, which is a narrow-, double- data-rate protocol, able to provide multi-gigabit-per-second performance. The studies have shown lots of interesting things, e.g. that using a standard desktop computer and network card, the theoretical throughput of TCP over can almost be met, reaching well over 900 Mbps. UDP performance gives on the other hand birth to a series of new questions about how to achieve good performance in a Windows environment, since it is constantly outperformed by the TCP connections. For the Link Port assessment a custom built IP block is made that is able to support the protocol in full speed, using a Xilinx Virtex 6 FPGA. The IP block is verified through simulation against a model of the Link Port pro- tocol. It is also shown that the transmitter of the IP block is able to send successfully to the receiver IP block. The IP block that is created, is evaluated against some competing multi-gigabit protocols to show it in comparison, and it is a rather small IP block, capable of handling all transactions on the bus as long as data is provided by its host. Referat

I nuläget finns många olika sorters kommunikationslänkar, både standardiserade och inte. Dessutom har krav på kor- tare tid till marknad i många fall gett upphov till att fler och fler system byggs med färdiga komponenter som kopp- las ihop till hela system. Som ett led i detta används ofta väl beprövade tekniker som man vet fungerar. Som en del i det här arbetet kommer prestandan hos Gigabit Ethernet att utvärderas för vanliga persondato- rer som kör Windows genom att använda TCP och UDP- protokollen. Dessa är utrustade med standardnätverkskort med låg kostnad och undersökningen går ut på att ta reda på om dessa kort och datorer kan användas till att byg- ga system med hög prestanda. Dessutom kommer ett ic- kestandardiserat protokoll, Länkportsprotokollet för Ti- gerSHARC 20X-serien, som är ett protokoll som stödjer flera Gbps, att utvärderas för prestanda. Studien av TCP och UDP ledde till mycket intressan- ta resultat. Bland annat så har studien visat att man kan få TCP-kommunikation mellan två persondatorer att vara bara enstaka Mbps från det teoretiska maximala värdet, och kommunikationshastigheter långt över 900 Mbps har uppmätts för TCP. UDP i sin tur, väckte mer frågor än det nåddes svar, och den hade genomgående sämre prestanda än TCP-testerna. Det tyder på att man, när man gör pro- gram för vanliga persondatorer, inte tjänar något på att använda UDP utan snarare tvärt om. För studien av Länkportar så skapades ett IP-block som kan sända och ta emot data i samma hastighet som specificeras som den högsta i protokollbeskrivningen, fyra gigabit per sekund. Blocket verifierades genom simulering och genom att låta sändaren sända data som mottagaren lyckades ta emot. Slutligen jämfördes Länkporten mot andra protokoll med liknande karakteristik, och jämförelsen framställer det skapade IP-blocket som ett gott alternativ till andra protokoll, mycket på grund av sin enkelhet. Contents

Abstract iii

Refereat iv

Contents v

List of Figures ix

List of Tables xi

Listings xii

Definitions xiii

I Prelude 1

1 Introduction 3 1.1 Purpose ...... 3 1.2 Goals ...... 4 1.3 Motivations for This Work ...... 4 1.4 Limitations for This Work ...... 5 1.5 Layout for the Report ...... 5

2 Background and Related Work 7 2.1 History of Radar Systems ...... 7 2.1.1 Radar Construction Basics ...... 8 2.1.2 A Probable Future ...... 9 2.2 An Example System ...... 9 2.2.1 Conceptual Radar System ...... 9 2.2.2 Data Transfers in the Conceptual Radar System ...... 10 2.3 A Background to Physical Signalling ...... 12 2.4 Multi-Gigabit ...... 14 2.5 The Link Port Protocol ...... 15 2.5.1 Some Link Port Characteristics ...... 15

v 2.5.2 Previous Work on Link Ports ...... 17 2.6 Communication Protocols ...... 18 2.7 Previous Work on Protocol Comparison ...... 19 2.7.1 TCP and UDP Performance Over Ethernet ...... 20 2.7.2 RapidIO Analysis ...... 21 2.7.3 PCI Express Evaluation ...... 21 2.7.4 USB Experiments ...... 22 2.7.5 Infiniband Studies ...... 22 2.7.6 Intel ...... 23 2.8 Data Acquisition Networks ...... 23 2.8.1 Common Features for DAQ Networks ...... 24

II Contributions 25

3 Methods 27 3.1 Link Port ...... 27 3.2 Gigabit Ethernet ...... 28 3.2.1 Setup for the experiment ...... 28 3.3 Other High-Speed Protocols ...... 28

4 Ethernet On Windows Computers 29 4.1 Hardware and Software Setup ...... 30 4.1.1 Offloading Checksum Calculations ...... 30 4.1.2 Increasing the Transfer Buffers ...... 31 4.1.3 Increasing the Receiver Buffers ...... 31 4.1.4 Increasing the Size ...... 31 4.1.5 Control the Interrupt Rate ...... 31 4.2 Evaluating the Performance ...... 32 4.2.1 The Measurement Environment ...... 32 4.3 TCP Specifics ...... 34 4.3.1 TCP and IP Checksum Offloading ...... 35 4.3.2 Effects from Interrupt Moderation ...... 36 4.3.3 Changing the Ethernet Frame Size ...... 38 4.3.4 Variable Buffer Size ...... 39 4.4 TCP Evaluation and Summary ...... 41 4.5 UDP Specifics ...... 42 4.5.1 Interrupt Moderation Effects ...... 43 4.5.2 Buffer Size Exploration ...... 45 4.5.3 Does Frame Size Affect UDP Performance? ...... 46 4.6 Analysis of UDP Performance ...... 46 4.7 Summary of Ethernet Performance ...... 49 4.8 Which Settings to Choose ...... 50 5 Creating a Link Port IP Block 51 5.1 Link Port Implementation Idea ...... 51 5.1.1 Key Coding Considerations ...... 52 5.2 Link Port Transmitter ...... 52 5.2.1 Transmitter Clocking ...... 54 5.2.2 Transmitter State Machine ...... 55 5.2.3 Transmitter LVDS Outputs ...... 56 5.2.4 The Data Path and Memory Design ...... 58 5.2.5 Controlling the Transmitter ...... 58 5.2.6 Checksum Calculator ...... 60 5.2.7 The Implementation of Block Complete ...... 61 5.3 Link Port Receiver ...... 61 5.3.1 Receiver Finite State Machine ...... 62 5.3.2 Controlling the Receiver ...... 63 5.3.3 The Deserialisation of Incoming Data ...... 63 5.3.4 Receiver LVDS Inputs ...... 64 5.3.5 Getting the Receiver Through Timing ...... 68 5.4 Testing and Verification ...... 70 5.5 IP Block Restrictions ...... 70 5.6 IP Block Metrics ...... 71 5.7 Link Port Implementation Time ...... 71 5.8 This Link Port Implementation Contributions ...... 72 5.9 Comments and Analysis of the Link Port IP Block ...... 72

6 Comparison of Communication Techniques 75 6.1 Hard facts ...... 75 6.2 Making a Choice ...... 76

7 Goal Follow Up and Conclusions 79

8 Future Work 81

Bibliography 83

IIIAppendices 93

A Abbreviations 95

B A Selection of Used Xilinx Primitives 97

C Selection of Needed Constraints 99

D The OSI Model 101 D.1 ...... 101 D.2 Data Link Layer ...... 101 D.3 Network Layer ...... 102 D.4 Transport Layer ...... 102 D.5 Session Layer ...... 103 D.6 Presentation Layer ...... 103 D.7 Application Layer ...... 103

E PCI Express 105 E.1 Associated Overhead ...... 107

F Gigabit Ethernet 109 F.1 Real-Time Ethernet ...... 111 F.2 Efficiency of Gigabit Ethernet ...... 112

G TCP/IP Protocol Suite 115 G.1 The Internet Protocol Version 4 ...... 115 G.1.1 Efficiency of the Internet Protocol Datagrams ...... 117 G.2 The ...... 118 G.3 The Transmission Control Protocol ...... 119 G.3.1 Socket Buffer Size ...... 120 G.3.2 Different TCP Implementations ...... 120 G.3.3 TCP Offload Engine ...... 121 G.3.4 RDMA-Enhanced TCP Decoding ...... 121 G.3.5 TCP Efficiency Over Ethernet ...... 122

H Link Port for TS20X-Series 125 H.1 Performance of Link Ports ...... 125 H.2 Uses of Link Ports ...... 126

I RapidIO 127 I.1 The Logical Layer ...... 127 I.2 Transaction Layer ...... 129 I.3 Physical Layers ...... 130 I.3.1 Serial RapidIO ...... 130 I.3.2 Parallel RapidIO ...... 131

J USB 133

K Infiniband 135

L 8B/10B Encoding 137

M Case Study: The ATLAS TADQ-System 139 M.1 The Communication Protocols in ATLAS ...... 142 M.2 The Physical Interconnects and Software of ATLAS TDAQ . . . . . 143 List of Figures

2.1 Radar PPI ...... 8 2.2 Example of partitioned radar system ...... 10 2.3 Example data processing flow ...... 10 2.4 Example Radar System ...... 12 2.5 Differential Signalling ...... 13 2.6 Multi Gigabit placements ...... 15 2.7 Link Port Back to back transmissions ...... 16 2.8 Link Port Checksum Transmission ...... 17 2.9 Link Port Start and Stop of Transmission ...... 17

4.1 Flowchart description of Ethernet measurement ...... 33 4.2 TCP Checksum Offloading Effects ...... 36 4.3 More Checksum Offloading Examples ...... 37 4.4 Interrupt Moderation Effects On TCP Performance ...... 38 4.5 Throughput of 4088 B Jumbo Frames ...... 39 4.6 Throughput of 9018 B Jumbo Frames ...... 40 4.7 TCP Throughput With Variable Sender Buffer Size ...... 41 4.8 TCP Throughput With Variable Receive Buffer Size ...... 42 4.9 UDP Performance With Varying Interrupt Moderation ...... 43 4.10 Packet Loss With Different Interrupt Moderation Settings ...... 44 4.11 UDP Throughput With Variable Buffer Size ...... 45 4.12 Packet Loss For UDP with Variable Buffer Size ...... 46 4.13 UDP Throughput at Different Frame Sizes ...... 47 4.14 Comparing Received and Sent Bytes per Second for UDP ...... 48 4.15 Linear Approximation of Measured UDP Throughput ...... 49

5.1 Original Link Port Receiver ...... 52 5.2 Original Link Port Transmitter ...... 53 5.3 Link Port Transmitter Block Diagram ...... 54 5.4 Transmitter Clocking Relationships ...... 54 5.5 Transmitter FSM Chart ...... 55 5.6 Link Port Transmitter Enable Schematic ...... 57 5.7 Output Clock of Link Port Transmitter ...... 57

ix 5.8 Writable Control Registers ...... 58 5.9 Readable Status Register ...... 59 5.10 Link Port Receiver Block Schematic ...... 61 5.11 Receiver FSM Flowchart ...... 62 5.12 Link Port receiver timing start ...... 64 5.13 Link Port receiver timing end ...... 64 5.14 Link Port receiver timing with CoreGenerator ...... 65 5.15 Link Port receiver first schematic ...... 66 5.16 Receiver LVDS Inputs ...... 67 5.17 Input Clocking of Link Port Receiver ...... 67 5.18 Input Logic With Clocking Net Shown ...... 69 5.19 Link Port Receiver Clock Crossing ...... 69

D.1 OSI reference model ...... 102

E.1 PCI Express packet ...... 106

F.1 An overview of the layers in Gigabit Ethernet ...... 110 F.2 Ethernet MAC Frame ...... 110 F.3 Theoretical Ethernet Throughput ...... 113

G.1 An IPv4 Packet Header with optional options field following it . . . . . 116 G.2 IP over Ethernet maximum throughput ...... 117 G.3 UDP Packet Outline ...... 118 G.4 UDP over Ethernet theoretical throughput ...... 119 G.5 TCP Packet Outline ...... 122 G.6 TCP over Ethernet theoretical throughput ...... 123

I.1 Rapid IO to OSI Mapping ...... 128 I.2 The layout of a serial RapidIO packet of arbitrary size. The pink is the Logical layer, the light gray is the transport layer and the blue is the physical layer. All sizes are in bits unless otherwise specified...... 131

M.1 The concept of the original ATLAS network. It is split in two sepa- rate sub-networks where one compute application one, and the other computes application two...... 140 M.2 Atlas Split Network ...... 141 List of Tables

1 Definitions ...... xiii 2 Special Text Decorations ...... xiii

2.1 Link Port Input/Outputs ...... 16

4.1 Computer Setup in Ethernet Test ...... 30

5.1 Resource usage for IP blocks ...... 71 5.2 Time Spent On IP Block Creation ...... 72

6.1 IP Block Resources Comparison ...... 75

B.1 Summary of Xilinx Primitives ...... 97

xi Listings

4.1 Ethernet Setup Message ...... 34 4.2 Client Program Pseudocode ...... 35 C.1 Multicycle Checksum Constraints ...... 99 C.2 Link Port Input Constraints ...... 99

xii Definitions

Byte Eight bits, equal to an Octet Half Word Two Octets (16 bits) Octet Eight bits, equal to a Byte Packet A unit of transmission of a protocol Quad Word Four Words (128 bits) Quartet Four bits Word Four Octets (32 bits)

Table 1. A table of some common definitions that will be used throughout this report.

There are some definitions specified here that will be used throughout the report and they are specified in Table 1. Also, there are some special text decorations that will be used and they are specified in Table 2.

PRIMITIVES Primitives are written with capital letters in a typewriter font. Signal Signals used are written in bold letters. 1 and 0 Logical one and zero are written in typewriter font as 0 and 1.

Table 2. A table summarising the font decorations of special words.

xiii

Part I

Prelude

1

Chapter 1

Introduction

The industrial revolution created something humans thus far never would have dreamt about, standards for components. And according to some source this prob- ably has been a great improvement and made the technical revolution possible in the last century. However, this has not applied to everything. In the computer world a lot of things are standardised, especially in the personal computer domain, but on the industrial side there are more non-standardised solutions. However, this lack of standards increases time to market because communication standards have to be created and thoroughly tested before shipping the product. In the light of harder competition, and also the need to sell products already developed in advance instead of concepts that have to be developed after purchase, the use of pretested and verified techniques is inevitable. In many areas commodity standards are already used, e.g. for processors, memory modules etc. However in communications, this is a process that is currently taking place. For those reasons the need for a comprehensive overview of the communications standards available for large scale embedded systems need to be done and that is why this work is initiated. The main target application is embedded radar systems of varying sizes with applications both in the civil security as well as the military field.

1.1 Purpose

The purpose of this work is to evaluate the effect of different communication links in embedded systems for radar applications. The type of communication is divided into three categories:

• Inter-chip communication. The communication between chips on the same printed circuit board (PCB). Examples are RocketIO which are Multi Gigabit Transceiver (MGT) on Xilinx FPGAs and Link port which is a Low Voltage Differential Signal (LVDS) communication protocol by Analog Devices.

3 CHAPTER 1. INTRODUCTION

• Inter-board communication. The communication between different PCBs in the same system. Examples are RocketIO which is a Multi Gigabit Transceiver (MGT) on Xilinx FPGAs [1] and Link Port which is a Low Voltage Differ- ential Signal (LVDS) communication protocol.

• System to host-communication. The communication between the Host and the rest of the system. Examples are Gigabit Ethernet (GbE), USB3.0, Thunder- bolt, Infiniband (IB) and possibly some other technique.

Some of these techniques will be studied separately since they all pose different demands on the communication and all of them have different demands on the communication in terms of reliability, throughput and latency. Some of them will be just theoretically compared while others are tested or simulated in order to obtain performance of them.

1.2 Goals

The goals of this work are to:

• Create a VHDL implementation of a Link Port [2] for a TigerSHARC (TS20X)- processor to provide a communication interface between an FPGA and a DSP. This model should be verified through simulation for functionality and in that simulation tested for maximum throughput, latency, area and power.

• Investigate how transfer speed of the TCP and UDP protocols over GbE between two units is affected by altering the maximum payload of the Ethernet frame (jumbo frames) as well as the buffer size and the interrupt settings of the network cards. From the results draw conclusions of how to best utilise GbE in embedded system design.

• Collect research results regarding some high speed protocols supported by the multi-gigabit-transceivers (MGTs) inside a Xilinx FPGA, as well as USB3.0 and Thunderbolt. Compare the protocols to show which the benefits and drawbacks of each protocol is, and also recommendations of when to use which protocol.

• Examine latest research results to try to predict what will be the future stan- dards and trends of digital communication within embedded systems.

1.3 Motivations for This Work

The Gigabit Ethernet part of this work will focus on TCP/IP-protocols over Ether- net. In contrast to most previous work which has been done in Linux environments, this study will look into the TCP/IP-protocols in a Windows environment and how to optimise the networking performance. Furthermore, this work tries to specify

4 1.4. LIMITATIONS FOR THIS WORK how a certain traffic pattern associated with radar signal processing will affect the performance on such interconnected machines instead of optimising for an arbitrary traffic pattern. For the Link Port implementation, this work will contribute to an FPGA inter- face to communicate at gigabit speeds with DSPs in a cluster. Furthermore, if this implementation is successful, it could also be used for lower-end FPGA-to-FPGA communication by means of implementing this rather low-speed communication to FPGAs without any Gigabit Transceivers. This work will then be compared to other communication techniques which are standardised in order to examine which method is the most beneficiary. The studies concerning other protocols will be beneficiary when selecting which communication standard or standards that will be implemented in which link when constructing a high-end communications network for radar data processing. Since every single technique has a different characteristic and is optimised for different traffic patterns, several different communication techniques may be chosen in order to best serve the traffic pattern of the selected application. In this part, the evalua- tion of future communication techniques will also be included to some extent, since the future standards are the most recent research trends.

1.4 Limitations for This Work

The aim of this work is to examine how different techniques and protocols are best utilised and to give some guideline to which to choose when implementing a system. However, it is beyond the scope of this work to try to set up environments to actually test all of these protocols. The aim is to take a theoretical approach and evolve it into guidelines for how to select the most appropriate communication protocol. Two techniques will be studied deeper, the Link Port and TCP and UDP over GbE. The Link Port protocol will be examined by creating an IP block to simulate and to measure its characteristics. The TCP and UDP over GbE will consist in evaluating the achievable throughput over GbE lines while using COTS components.

1.5 Layout for the Report

In this chapter the topic is introduced and the purpose of this work is explained. It covers some details about radar systems which may be superfluous. However, a concept system is introduced and looked at. In chapter 2, an attempt to lay some ground for readers into this subject of communications with focus on radar and embedded systems. It will also try to summarise some of the contemporary research made in these areas. By reading chapter 3, readers will get an explanation of which methods have been used in this project to reach the goals.

5 CHAPTER 1. INTRODUCTION

In chapter 4 the methods of how Ethernet was tested as well as the result and the conclusions made and an analysis of the reached results. The content of chapter 5 explains how the creation of the Link Port IP blocks and the evaluation of them as well. All metrics are presented and all parts of the IP blocks are specified. The content of chapter 6 does compare different techniques that have been eval- uated, not only the two which are tested but also some which have only been studied in theory. In chapter 7 some final conclusions are drawn regarding the work that has been carried out. Finally, chapter 8 suggest improvements and future work which have to be done in order to straighten out some of the question marks raised in this work. As an aid for the reader, all (or most) of the abbreviations used are listed in Appendix A, and all of the Xilinx primitives used to describe the FPGA part are listed in Appendix B. In the appendices some in-depth material will be presented in certain areas for readers who wish to have deeper understanding of the subject, even though the content is in no way necessary for the results.

6 Chapter 2

Background and Related Work

In this chapter, the focus is to introduce some concepts of radar and computer communication. It will start with some history in the subject of radar and then move on to an example radar system setup and discuss the system and data flow. Then it will describe which techniques that can be used and what is currently used. Finally, the chapter is finished by a walk-through of some comparisons of common protocols in systems with similar specifications as the radar systems.

2.1 History of Radar Systems

The development of radar began when Henry Hertz in the late 19th century verified a prediction from Maxwell’s electromagnetic field theory [3]. When he was verifying this, he used an apparatus with functionality resembling pulsed radar. This work was later continued and built further upon by Hulsmeyer who created radar with which he wanted to mount on ships in order to monitor other ships and thus avoid collision at sea. During the Second World War, a lot of radar development was carried out [3]. All participating forces developed their own radars, including the forces from America, Great Britain, Germany, the Soviet Union, Italy, Japan, France and the Netherlands. The radars they developed were both land-based and ship-borne; some with long range and others with shorter range, but their main task was to search the airspace. After the Second World War, the Moving Target Indicator (MTI) was invented to find moving objects when analysing the radar echo. In order to find the moving objects the Doppler Effect was exploited. Further in the radar development, radars have travelled into space for surveillance of our planet and the exploration of the universe [3]. Another application that was first theorised in 1951 but yet sparsely used to its full extent is the SAR (Synthetic Aperture Radar). This technology has been difficult to realise in real time due to large amounts of data needed to be processed regularly: But thanks to the ever-shrinking size and an increase in performance

7 CHAPTER 2. BACKGROUND AND RELATED WORK of micro controllers and integrated circuits, more and more SAR-systems are seen today. For example, modern aircraft carry SAR-systems in order to map the sur- rounding terrain [4].

2.1.1 Radar Construction Basics Historically when building a radar system all components have been custom-built; but in the recent years when prices have started to fall and cost savings are a reality for developers, many radar systems are made from COTS-components (Commercial off the shelf) [3]. Still, however, the front end with antenna, transmitter and receiver is created specifically for each kind of radar. The changes have mostly been further back in the data processing line, in the signal processing and detection parts (see Figure 2.4). The signal processing in a radar system is mostly on I and Q-components of the received signal, i.e. the real and complex components [3]. The objective of the processing is that it should remove clutter as well as unwanted noise and jamming signals. For removing clutter from the incoming signal the signal processing applies different filters to extract data from it, e.g. MTI (Moving Target Indicator) and MTD (Moving Target Detection) to detect non-stationary objects in range. Here, the trend has been to move from very specialised hardware to COTS-components.

Figure 2.1. An example of a PPI radar image, common in surveillance radars. Used with permission of Christian Wolff © at www.radartutorial.eu

After the filters have been applied to the signal, often several detections are recorded and they need to be filtered additionally to understand how many real objects that have been detected and their position [3]. When the amount and location of targets is decided, the data may be displayed on a monitor in shape of a plan position indicator (PPI, see Figure 2.1) which is a common display used

8 2.2. AN EXAMPLE SYSTEM in surveillance radar applications which indicates the underlying terrain. On this, the objects detected by the different filters are displayed and furthermore, some additional information like its calling code if it is a friendly aircraft in a military system. This is of cause only one example of what a radar can do, since there are several other uses as well. In some applications for example the need for a monitor showing data is unnecessary, e.g. in traffic cameras measuring speed. In such an applica- tion it is sufficient to determine the speed of moving objects in order to determine whether to photograph them or not.

2.1.2 A Probable Future The use of custom built components is an era which is coming closer towards its end for the most of the radar systems [3]. Instead, new systems have to be cheap enough and have to have a short time to market. To enable this, the use of COTS- components is crucial and thus the future designers of such systems need to be able to choose the correct components to build these high-end systems. In those systems, data transfers will be of crucial importance since it is large quantities of data that need to be transferred quickly. This part will be the main focus in this report, the data links connecting the COTS components.

2.2 An Example System

For purposes of discussion further on in this report, an example of a radar system will be shown. The system is a pipelined system, with different components doing their specific tasks all the time. The problem in such a system is that data has to be transferred between the different stages of the pipeline, and in the case of a radar system, the amounts of data are large. To address this problem there is a substantial need for high-throughput, low-latency links.

2.2.1 Conceptual Radar System A very basic diagram of a radar system is shown in Figure 2.2, where the basic macro-components and their interconnects are visible. A brief summary of its op- eration is that an operator watches the radar screen with a PPI image on it and the operator is also able to set some parameters to the system in order to control what output the system gives and the responsiveness of the system. The transmit- ter sends a signal to the antenna which transmits that signal and then the receiver receives the echo of that radar signal. The received echo is then sent into the signal processing of the radar system. When data arrives at the signal processing, it is often raw data in large amounts. Signal processing and detection may be looked at as one step since both involve computations on the radar raw data in an often sequential manner. The data that arrives has to be moved down the signal processing and detection system in a

9 CHAPTER 2. BACKGROUND AND RELATED WORK

Tx Transmitter Duplexer Antenna Rx s l s a l a n g n i g S i

Signal l S

l o r o t

r processing t n n o o C C

Video Detector

User

Figure 2.2. An example radar system partitioning where the major parts are shown. timely manner, and for radar applications there are often hard real-time deadlines that need to be met. This is basically since new data continuously arrives from the next scan and all data needs processing in order to create the radar images, target indication, target tracking, ground maps etc [4]. These are techniques that require the movement of very large quantities of data [5], all with hard real-time requirements even though the transfer-characteristics of different kinds of algorithms may differ a lot [5]. Since this work will be targeted towards the data transfers from entering the front end of the signal processing, until it exiting the back-end of it, and there- fore a deeper understanding of the data transfers will be sought after in the next subchapter.

2.2.2 Data Transfers in the Conceptual Radar System

Filter 3

Input Filter 1 Filter 2 Filter 5 Output

Filter 4

Figure 2.3. Example processing flow where the input data passes through five filters on its way towards the output. It is visible that filters 3 and 4 may run in parallel, but apart from that the processing needs to be done sequentially.

In order to extract all the valuable information from the received radar data the data needs to be processed. The processing consists of a number of filters, such as FIR filters, FFTs (Fast Fourier Transforms) and other digital filters with

10 2.2. AN EXAMPLE SYSTEM specific purposes [5]. Since the processing in many ways are a series of sequential calculations [5], see Figure 2.3, where each calculation need to be completed prior to the start of the next in the most cases, there are some ways to cope with this. The two most obvious solutions to this is to either pipeline the processing so that the later filters do not have to wait for data except in the start-up phase, or it can be solved by using powerful processors which are able to complete all processing between the arrival times of two consecutive datasets. The pipelining solution is a solution which provides higher latency of the operation, probably at the benefit of less strict timing restrictions of each filter. If the latency is low enough when pipelining, it is a feasible solution. If it is not, then a more powerful processing solution must be created. Some exploration into this subject is presented in Bueno et al. [6] and in Bueno, Conger and George [5] where they try to implement a space-based radar system for Synthetic Aperture Radar (SAR) and Ground Moving Target Indicator (GMTI). In order to do so a network of processing cards is implemented, where the network should be fast enough to finish the processing of the data between two data arrivals. They also experiment with a pipelined solution where new data is fed into the sys- tem while computing results from the old data in order to improve performance. Their system has high demands on throughput inside the network for efficient parti- tioning of data. The problem with transferring data set N+1 into the system while calculating set N is that the interconnection network may be congested with data and thus lowering performance of the N:th calculation. However, they found that it was beneficiary for the performance of the GMTI-algorithm to do this pipelining while it was more difficult when implementing SAR. In solving the problem with data transfers when implementing the radar sys- tem ([5] and [6]), a Parallel RapidIO-solution was chosen (see Appendix I for an overview of RapidIO). By interconnecting the units with Parallel RapidIO an FPGA-supported industrial standard was chosen, which may deliver data rates over serial lines from one up to five gigabits per second [7]. The benefits of using an in- dustrial standard interconnect with an open standard as RapidIO is that no single vendor dependency exists since as long as the components support the standard, they may be connected to each others. An alternative approach to having a network calculating every sample in between data releases is to pipeline the events. This is done in [8] where a hardware SAR processor is built, and which pipelines the calculations into several parts which naturally improves throughput. The downside with pipelining is the increased latency associated with it, since everything cannot proceed in full speed. However, if implemented in a smart man- ner, the penalties of pipelining might be very small or in some cases even negligible and may increase throughput without increasing latency too much so that it violates the timing deadlines. In Figure 2.4, an example system of a pipelined signal processing and detection system in a radar which consists of four stages. In the first stage, data is read into the system from a lot of Analog-to-Digital Converters (ADC). The data is then

11 CHAPTER 2. BACKGROUND AND RELATED WORK

Figure 2.4. A layout of an example radar signal processing system. The letters i, M, N are arbitrarily chosen to make a scalable system with correct characteristics. passed on to front ends which initiates the calculations and probably tries to reduce the dataset and remove unnecessary data before moving it down the line. In the back-end, which is the third pipeline stage, the data is further processed in order to extract the wanted data, e.g. moving targets in an MTI or GMTI. This data is finally sent to some video processing for visualisation and possible to some data storage for possible off-line processing in a later stage. If comparing Figure 2.3 with Figure 2.4, we may map the filters in Figure 2.3 to the stages in Figure 2.4, where then filter 1 would go into the ADCs step, filter 2 would go into the front-end signal processor, filter 3 and 4 would be in the back-end signal processor whereas filter 5 is implemented in the Video processing processor. By partitioning the system like this, and often reduce the amount of data between steps, an effective pipeline is created. However, between the components there are lots of data transfers that need to take place and they often need to do that in very short time. This puts requirements on the systems to have high throughput in order to have the sufficient data transfer power. To transfer data there are several techniques but the question is which components are to be used and how?

2.3 A Background to Physical Signalling

Firstly an introduction to different physical signalling techniques will be presented, and in the field of computer communications, as in all other computer fields, there is a desire to increase the speed and throughput of data. In the past, systems were

12 2.3. A BACKGROUND TO PHYSICAL SIGNALLING

V+ Input

V-

Figure 2.5. Idea of differential signalling is to take one input and then create both that signal and its complement and send both. By doing that, noise immunity and emitted noise lowers, as well as the voltage swing of the difference between V + and V − is the double of their individual swings. often made on chips interconnected with multi-drop buses. The solution when more data had to be sent was then to either increase the bus width or the bus frequency, and thus increasing the total throughput of the system. Recently the trend has changed. The problems associated with multi-drop buses are amongst others skew and increased pin count. This becomes a problem when frequency is increased along with the width, causing serious routing problems on the boards. These problems have driven the trend towards high-speed serial com- munications. The benefits with is that no, or less, skew may happen, depending on the layout of the serial bus. If the clock is embedded into the data stream, the data transfer may happen with only a single line. Since it is only one line, skew is zero. The serial link has the advantage of using less pins, giving less footprint in the I/O of a design. However, many of the high-speed links are using differential signalling for their communication, which gives double the amount of pins compared to single-ended signals. This might seem like a problem but in most cases it is not sine differential signalling have other large benefits. The idea is to use two pins instead of one and on one of the pins (p) have the positive voltage, i.e. it is high for 1 and low voltage for 0. In addition to this pin, its negative complement is put on the n-pin. That pin has high voltage for 0 and low voltage for logical 1 [9]. By using these differential signals and subtract them from each others, the V + − V − differential has double the swing of only V +. This means that the voltage swing on each line is only half of what it would have had to be for the single ended signalling, and thus the rise and fall times for the differential pins is lowered, enabling higher frequencies [9]. To increase noise immunity, the differential pairs should be routed tightly together, since an electrical field is emitted between the conductors. This way they emit less noise as well as they are almost identically affected by noise. The noise immunity comes from that they both are affected by the same noise, and in the voltage subtraction, the noise hopefully was equal and then the resulting difference is still the same. The only standardised differential signalling technique is Low-Voltage Differen- tial Signalling [9] and [10]. This is a very energy-efficient way of signalling, with raw data rates up to 3.125 Gbps. Included in that is optional encoding, in order to provide good signal integrity, and if such encoding is present, the baud rate will be lower than the initial signalling speed. This is a technique that is used in several serial communication standards, even though not all of them use the standardised

13 CHAPTER 2. BACKGROUND AND RELATED WORK

LVDS-signalling. One example is Serial RapidIO [7] which uses LVDS-signalling in order to increase data integrity. The second alternative for differential signalling is Emitter Coupled Logic, ECL. ECL is the oldest of the differential techniques and is today widely used in different military applications, mainly due to its ability to work in all temperature ranges [9]. The main drawback of ECL is that it operates at negative voltages. This is a problem since it is not common to supply negative voltages to chips, and people tend to use mostly positive voltages. The last physical technique is Current Mode Logic, CML. CML is a kind of ECL but with some characteristics that differ. The biggest difference is the transistor circuitry which causes CML to have a higher common mode output voltage [9]. This structure makes CML to be the fastest choice when creating a differential link, with transfer speeds exceeding LVDS-standards. However, CML is restricted in length due to its high transfer speed, and may almost exclusively be used for chip-to-chip communication on a single board. Furthermore, it is far more power-consuming than LVDS, given a certain bit-rate.

2.4 Multi-Gigabit Transceivers

To enable high bitrates between devices, the shift has been from parallel wide buses, to multi-lane independent serial lines with point-to-point connections as discussed in the previous section. In using these serial connections, circuits need to be able to transfer single bits at several gigabits per second. This requires specially built chips, which could be external to the processing element (PE) as in Figure 2.6 a), or an integral part of an FPGA or embedded processor of some sort, as in Figure 2.6 b). These components are sometimes referred to as Multi-Gigabit Transceivers, or MGTs, and are used in order to generate very high speed signals. In Figure 2.6 b), a typical layout of an FPGA with embedded processing elements and integrated MGTs. This is the implementation used by both Altera and Xilinx, the two major FPGA producers. By integrating these MGTs into different IP-cores supplied with the FPGAs the system developer has most of the common high-speed serial communication protocols already available. Some of them which are supported by both Altera and Xilinx high-end FPGAs are PCI Express, Serial RapidIO, XAUI (a part of 10GbE standard) and SATA, but a lot more are supported [11] and [12]. The ability to use these standards ensures high transfer speeds between chips, boards and chassis. Their availability inside FPGAs makes it a lot easier for devel- opers to use these high speed interconnect technologies when comparing to having to add an external card which does the serial transfer, since everything is han- dled on-chip and thus may be thoroughly tested and simulated inside the FPGA development environment.

14 2.5. THE LINK PORT PROTOCOL

PE Wide, slower bus MGT Narrow Multi-gigabit serial line

a)

PE Wide, slower bus MGT Narrow Multi-gigabit serial line b)

FPGA

Figure 2.6. Different placements of MGTs, either as a separate chip, as in a), or as an integral part of e.g. an FPGA as in b).

2.5 The Link Port Protocol

First out of the studied protocols is the Link Port protocol, since it is a key part of the project. The Link Port protocol exist in several versions, but the one looked at here is the protocol for the TigerSHARC TS20x-series and is specified in [13]. The Link Port is specified as a differential data bus and associated source-synchronous clock, with two additional control signals, acknowledgement and a block complete. The idea of the Link Port protocol is for the TigerSHARC DSP to be able to interface to other components through a multi-gigabit per second serial interface.

2.5.1 Some Link Port Characteristics

The Link Port protocol exist in many versions, for different types of processor families from Analog Devices. The protocol which is of interest here is the one targeting TigerSHARC 20x Processor Series. The Link Port is a DDR (double data rate) protocol which means that two datas are presented each clock cycle. The protocol targets point to point connections only, meaning that for each link there can be only one sender and one receiver. Several sender and receiver circuits can coexist on the same device however, enabling the creation of processing clusters with many present devices. The Link Ports have a specific set of ports, listed in Table 2.1. Of these four ports, the two most crucial and fast-switching are differential while the two control signals (Ack and n_BCMP) are normal single-ended signals. The start up of the Link Port transmission is done by the transmitter setting n_BCMP to logic 1 (deasserting it). Then the receiver knows that the transmitter

15 CHAPTER 2. BACKGROUND AND RELATED WORK

Port Name Width (Data/- Description ical) Data 4/8 Four differential data pairs. Outputs from sender. Clk 1/2 Differential clock pair clocking the data. Out- puts from the sender. Ack 1/1 Acknowledgement sent by receiver, indicating that it may receive data. Output from the receiver. n_BCMP 1/1 Block Complete, used to signal last quad-word of a transmission and to setup the link after reset. Output from the sender.

Table 2.1. The inputs and outputs of the Link Port . Their respective transfer origin is specified in the table. Naturally they are inputs at the side which is not a sender. is present and may indicate the possibility to receive data by asserting Ack. Data may then be transmitted until Ack is deasserted again. Due to inner workings of the TigerSHARC processor and the data bus being 128 bits wide, that is also the transmission unit size of the Link Port . This means that data is sent in chunks of 128 bits, with a checksum option available which sends an additional 16 bits for increased data integrity. The checksum is one byte long and is sent after the data. After the checksum byte, a dummy byte is also sent before the transfer of the next data begins. An example of the end of a transmission with checksum enabled is presented in Figure 2.8. The Link Port protocol specifies a discontinuous clock to clock the data. This is specified as that the first input data is to be clocked in at the first rising edge of the input clock. It also specifies that the last received data arrives at the last falling edge of the clock. Also, the clock output is low when there is no data transmission currently happening. The start and end of a transmission is shown in Figure 2.9 where we see that the clock is driven Low when no transaction takes place. When transmitting more than a single quad-word there is no need for a gap between the words, and the next transmission starts on the rising edge following the last falling edge of the previous clock, see Figure 2.7.

Clk Data D120-123 D124-127 D0-3 D4-7

Figure 2.7. The transmission of two back-to-back quad-words have no gap between them.

The clock signal of the Link Port protocol may be clocked up to 500 MHz, and the data comes at DDR at a 4-bit wide bus. This indicates that it can receive 1 byte per clock cycle, or 500 MB per second. The unit of transfer is either 128 bits or 144 bits, depending on whether the signal should have a checksum enabled or

16 2.5. THE LINK PORT PROTOCOL

Clk Data D120-123 D124-127 C0-3 C4-7 Dummy Dummy

Figure 2.8. The end of a transmission with the checksum option enabled.

Clk Data D0-3 D4-7 D120-123 D124-127

Figure 2.9. The start and end of a transmission, showing the discontinuous clock at both times.

not. This means that transfers of small quantities of data is rather inefficient, but if lots of data are sent, the throughput is either the full 500 MB/s, or if checksum is enabled, 128 ÷ 144 · 500MB/s ≈ 444MB/s For some more information on the Link Port protocol, see either Appendix H or read on to chapter 5 where details will be explained on the fly.

2.5.2 Previous Work on Link Ports

In the area of Link Port , comparative studies are less common than with standard protocols. However, this does not mean that the performance of Link Ports has not been tested. The Link Port is a point-to-point technology which connects two units (DSPs) for inter-chip communication. The fact that they are associated with DSPs mean that Link Ports are often used in computationally intensive applications, e.g. radar and image processing. The use of Link Ports in literature often concerns the creation of real-time processing systems. In both [14] and [15] Link Ports are used to create pipelined radar processing systems. The first work uses a pipelined version of the radar systems where several DSPs have different tasks and data is passed down through the pipeline. The other approach is more of a brutal one where the nodes are connected in a cluster of DSPs are connected to an FPGA and communicate through the Link Port protocol. A third work which uses Link Ports is presented in [16]. This design uses Link Ports to interconnect DSP clusters with each others and with FPGAs, as well as interconnecting several FPGAs. This article concerns some design problems when creating FPGA link ports, such as receiver clocking and input design. The biggest design challenge in their Link Port design was the receiver input clocking. In their Virtex 5 FPGA, they used a global clock buffer to gain equal clock delay to all the input clocking components. They also used a lot of primitives similar to ISERDES1 to serialise the incoming data.

17 CHAPTER 2. BACKGROUND AND RELATED WORK

2.6 Communication Protocols

After summarising the most common differential signalling techniques and how they are implemented in FPGA solutions, a summary of different communication protocols will now be presented. The common characteristic is in some sense that all of the protocols specify everything from a physical layer and upwards to a layer where application data may be transferred.

RapidIO One common feature of all protocols that will be studied is that they are almost exclusively serial in their nature, with one exception. The RapidIO link [7] has both a parallel and a serial physical interface. The RapidIO is a fairly new communication protocol which targets embedded high-end systems with re- quirements on high transfer speeds [9] and high connectivity. Since it provides both serial and parallel interface, and the feature that a switch should be able to handle both parallel and serial modes [7], high interoperability is possible to achieve. The serial RapidIO links operate at effective speeds ranging from one to five Gbps. On that, there is an overhead of 12 to 20 bytes for payloads of up to 256 bytes, giving maximum effective throughput to 95.5% of link speed. RapidIO is further described in Appendix I.

PCI Express Another technology which is of interest to this study is PCI Express (PCIe) [17], which is explored in depth in Appendix E. PCI Express is the evolution from the legacy PCI bus, which is a , to a , point-to-point topology. PCI Express has the disadvantage that it has to be backwards compatible with PCI, to maintain operability with older operating systems. However, if implemented in a completely new system it has potential as embedded interconnect with low overhead and low latency. Given the correct conditions, it may achieve a 99.5% efficiency over its links(Appendix E).

Ethernet One of the most common interconnect-technologies today is Ether- net (see Appendix F) and almost every new PC is sold with an Ethernet interface. Ethernet is an old interconnect network, which has seen many improvements since it was first introduced. As of now, Ethernet has evolved from a half-duplex, sub 10Mbps system, to a full-duplex 100 Gbps system. This makes Ethernet one of the most popular networks that exist today, and it is the interconnect technology in several high-performance computers today [18]. The main advantages of Ethernet are its low price/performance-ratio and the number of people with knowledge about it.

TCP/IP Ethernet, however, is only standardised until Layer 2 in the OSI model ( Appendix D) and above that, other protocols are commonly implemented in order to ensure reliability. The most well-known protocol framework is the TCP/IP suite. In this suite, several protocols are fitted, including TCP, UDP and IP, all

18 2.7. PREVIOUS WORK ON PROTOCOL COMPARISON explained in depth in Appendix G. These together serve as the backbone for the most well-known network, the Internet. There are a lot of different protocols in the suite, but the most well-known are TCP, which is a reliable end-to-end protocol that guarantees delivery, UDP, which is a connectionless protocol with very little overhead, and IP, which takes care of routing of both TCP and UDP packets. The TCP guarantees delivery of packets in sequential order [19] which might be very beneficial. The TCP protocol has little overhead in bytes, however it might be a larger processing overhead providing its guarantees.

Infiniband The other popular interconnect for high-performance computers, beside Ethernet, is Infiniband [18]. Infiniband is less mainstream than Ethernet, but delivers higher performance in aspects of e.g. latency. Furthermore, it has several different transmission techniques, in some sense like the TCP/IP-suite. The bandwidth of Infiniband is scalable and it is possible to upscale by increasing the number of parallel lanes. A more in-depth explanation exists in Appendix K.

USB A very common and mainstream interconnect is USB [20, 21]. It has been released in three specifications, each has increased the bandwidth with a great margin. The current USB 3.0 standard specifies a full-duplex, 5 Gbps connection. However, USB is not as easy to interpret in terms of communication speed as several other protocols. It has different transactions, ability to reserve bandwidth (up to 80% for a USB 3.0) and so forth. This makes the link available for real-time usage, but not to a full extent. Furthermore, the specified rate is the raw bit- speed, meaning that penalties for encoding need to be calculated with. For a more comprehensive summary of USB, please refer to Appendix J.

Thunderbolt One more recent technique is Thunderbolt [22], released by Intel, which is a technology that should encapsulate both PCI Express and Display port communication in an external cable [22]. The specification for this standard is only available through non-disclosure-agreements. The technology supports a bi- directional, full-duplex, 10 Gbps channel for inter-chassis communication. The link itself works with both isochronous data transfers and burst transfers, although the amount of bandwidth which may be reserved is not clear from the source.

2.7 Previous Work on Protocol Comparison

A lot of work has been done in order to evaluate these different communication methods and below a summary of the work is attempted. The authors in [23, 24] state that there are three main backbone architectures for embedded systems; Ethernet with TCP or some other upper level protocol, PCI Express and RapidIO. These three are considered by them to be the backbone architectures best suited for embedded systems. In addition to those three, Infiniband, USB and the newly developed Thunderbolt will be reviewed.

19 CHAPTER 2. BACKGROUND AND RELATED WORK

2.7.1 TCP and UDP Performance Over Ethernet

In [25], one of the few comparisons between Linux and Windows TCP-performance is carried out. Furthermore, they investigate how the performance varies with different NICs, different internal bus-widths and payloads (MTUs). They show that the performance when a card is installed directly out of the box is often far less than its optimum configuration. However, they find it easier to improve performance in a Linux environment than in Windows. Some factors they find that can improve performance are increased MTU, increased socket buffers and reduced interrupt rate. However, the conclusion is not that increasing everything gives the best benefits; instead it is that tuning a NIC correctly improves performance the most. In [26] they also investigate how buffer sizes and MTUs affect the performance over long transmission lines with TCP/IP over 10GbE. They show that increased buffer size and MTU size increase the performance of the communication when transmitting over long distances. In [27], the objective is to compare Feodora Linux with Windows XP and Server 2003. The experiment examines the operating systems’ ability to forward packets by trying to send as many packets as possible through a PC. This work, however, does not examine how to improve performance but is only a measurement of how fast the operating systems are in forwarding packets in user and kernel space. In [28] an attempt to monitor the different delays in Windows and Linux UDP stack. The study indicates, as the previous studies also have suggested, that the processing time in Linux is shorter than in Windows. However, their tests were conducted when using the minimal Ethernet packet size and thus the overhead would be the maximum one. In contrast, this study will look at larger packets. This article also discusses some performance enhancements and their impact on real-time behaviour. The setting that limits real-time performance most in the Windows case is interrupt moderation, i.e. wait for more packets before issuing an interrupt and thus reducing the number of interrupts. Tests were showing that if this setting was set improperly, the system showed very poor performance in terms of latency. In addition to these works, a lot of work concerning Ethernet performance was made when building the ATLAS detector at CERN [29]. A thorough examination is available in Appendix M, but a brief summary will be presented here. The decision made when building the data acquisition for ATLAS was to use Ethernet as the backbone communication methodology, to do the real-time filtering of data from approximately 60 TB/s when captured, to 300 MB/s when stored to disk [30]. The initial thought was to go with , at 100 Mbps, but as technology evolved the chosen technology became a combination of Gigabit and [31]. As a communications protocol for ATLAS, TCP/IP was thought of, but was later given into due to the non-real-time effects of it [32]. Since the application layer already had timeouts which were much more predictable to ensure real-time behaviour, the use of TCP was looked at as a performance risk rather than bene-

20 2.7. PREVIOUS WORK ON PROTOCOL COMPARISON

fit. This mainly due to over-occupation of the buffers, polluting the network with acknowledgements and potentially sending unwanted data. However, for non-real- time data TCP is looked at as a great option since it has no requirement of an application guarantee of delivery of raw Ethernet or UDP packets.

2.7.2 RapidIO Analysis The RapidIO is a rather new standard with both a serial and parallel interface. A more thorough explanation may be read in Appendix I. The idea is to present a high- speed interconnect for embedded systems with low overhead and high throughput. In contrast to PCI Express, it does not need to be compatible with old PCI buses which alleviates a lot of pre-made design considerations [9]. Unlike PCI Express, the main area of use for RapidIO is as an embedded-system’s interconnect. Some implementations using RapidIO have been analyzed in the research com- munity. One of the most interesting for this thesis work is made at the University of Florida [5, 6], where a distributed signal processing system for radar applications were implemented with parallel RapidIO. In their work they conclude that RapidIO is able to meet the demands in communication for their real-time processing. In [24], claims are made that RapidIO is the best of the most popular embedded interconnect architectures (Ethernet, RapidIO, PCI Express) in the sense that it gives the best of the other two into a single solution. The ability to send both unreliable and reliable transactions is only possible in RapidIO for example. It is exemplified in an experiment where a highly interconnected system is created in the form of a dual star, with very high bandwidth. In [33] an application is made based on Serial RapidIO (SRIO) but without any performance metrics measured, except the statement that it is a scalable system. In [34] however, measurements are made as to the efficiency of SRIO, with results up to 86% of the link utilised for payload data. However, it is not entirely clear which settings they have used when creating their packets and if they are using the maximum payload, but it is clear that this is not very far from the maximum theoretical utilisation of just over 92% (see Appendix I and [7] part 6). Another interesting performance evaluation of RapidIO, the parallel kind, is made in [35]. Here the latency and saturation link-utilisation is measured. They conclude that for a single switch in between two end-nodes a 64-bit read may be read in fewer than 100 ns, proving that latency-sensitive data may be sent over a RapidIO link.

2.7.3 PCI Express Evaluation In [36], a data acquisition system is built into a PC using a PCI Express interface. The study involves measuring maximum throughput of the link, which in most cases was at 75% of the theoretical maximum, even though their calculations estimated that with their transfer characteristics the theoretical maximum would be at 85% (also read Appendix E for more details why 85%). However, their study shows that

21 CHAPTER 2. BACKGROUND AND RELATED WORK for 99.998% of the time, throughput were over 50%, and in all cases it exceeded 45%. In [37] a study where COTS PCIe-enabled motherboard is used to speed up calculations in a parallel benchmark. They also show differences between PCIe and PCI-X, where PCIe has several advantages in low latency and high bandwidth.

2.7.4 USB Experiments

A pseudo-real-time USB application is created in [38] for a Windows 7 environment. This work however only uses a full-speed (12 Mbps) USB connection but does achieve the timely behaviour in terms that they actually reads one time exactly every 10 ms, which is their deadline, thus proving the possibility to use USB for real-time applications. In [39] an FPGA implementation of a USB device is created in a lower and faster mode, both operating in high-speed USB. Their transfer limitations are in the underlying architecture on the FPGA, where the transfer rates are approximately 100 and 400 Mbps. These are achieved and exceeded in testing, indicating that their solution is able to utilize the high-speed USB almost fully to a full extent.

2.7.5 Infiniband Studies

Infiniband (IB) is of high interest to high-energy physics [40] in terms of Data Acquisition, very similar to the work conducted here. The appealing factors of Infiniband are the high throughput and low latency. It is also very important in the field of High Performance Computing (HPC), where the high throughput and low latency also is critical [41]. In [42] an assessment of Internet Protocol(IP) performance over Infiniband is performed and in that evaluation they find that IB is a competitor to 10GbE since it delivers very high throughput, in this particular study up to 11 Gbps. When adding the results in [43] that shows that latency of IB is less than Ethernet latency, the conclusion that has to be made is that IB is a very big competitor to Ethernet. In [42] the result also compares the two modes of IB, connected and unreliable. They find that by using connected mode, and thus not having to have a TCP-layer on top, the system works faster than the usual IP-defragmentation algorithms do. Hence, speedup is gained when IB gets to do the fragmentation/defragmentation instead of the UDP-stack. Further research in [40] indicates that the choice between Infiniband and 10G Ethernet is dependent on the expected packet size, since Ethernet outperforms Infiniband at small packets, and vice versa for big packets. However, they indicate that these tests were only conducted for a point-to-point case and might not be valid for another setup with several machines over another type of network.

22 2.8. DATA ACQUISITION NETWORKS

2.7.6 Intel Thunderbolt In terms of research on thunderbolt, there is little to find. Some work has been made though, e.g. an attempt to interconnect many PCs running Windows Server 2008 R2, and then making them communicate with each others over Thunderbolt [44], or Light Peak as it was called prior to public release. Their results suggest that Thunderbolt may be used to create data clusters in the future since their prototype managed to achieve well over 50% utilisation of the busses.

2.8 Data Acquisition Networks

Since the aim here is to evaluate the performance in radar signal processing hard- ware, and a characteristic of that are large amounts of data transfers, this subchapter will look into different systems with similar requirements called Data Acquisition systems (DAQs). There are lots of similarities since these systems are created to transfer data at very high rates from their input to the storage or post-processing at the back. According to [45] DAQ-systems may be categorised in three categories: PC- based, embedded and FPGA systems. PC-based are the systems which use a PC to visualise the data in some way. Either connected to an internal bus or to an external connector, these are standard PCs with an extension that captures data. For transmission of data, some techniques are USB, FireWire, RS232, Ethernet and so on. An internal alternative is to connect to a PCI or PCI Express bus. Embedded DAQ-systems are found in cars, aeroplanes, medical equipment and several more applications [45]. These are often fast and high-performance systems, but they have a fixed architecture and are not hardware-reconfigurable after they are built. This is in contrast to the FPGA-solution which may be hardware reconfigured after the system has been deployed [45]. In the work conducted in this thesis, the main focus will be towards a hybrid between FPGA and PC-based if following this definition. The theoretical setup, as may be seen earlier in Chapter 1 in Figure 2.2, shows that there is a PC in the end where the user gets to see the output data. However, the data processing steps prior to the visualisation of the data passes through some non-PC components, e.g. COTS-components or custom created components, which may very well be implemented in an FPGA. In general, many systems are in some sense PC-based, they only differ in the amount of processing done in the PC. Two examples which use a PCI-bus based card for capture and then does processing on the CPU are [46, 47]. This is made in order to use commodity computers to do the processing. Another approach is often made in which high-end computers are used in the middle between the data collection and the visualisation PC. This is done for ex- ample in the ATLAS experiment (see Appendix M for a case study) and as well in other experiments such as the Daya Bay Neutrino DAQ [48, 49] and in the KM3NeT Detector [50, 51]. All of these use different sizes of their computational clusters,

23 CHAPTER 2. BACKGROUND AND RELATED WORK but they all use specialised computers to process the data and filtering out the im- portant parts of it. Also, they all employ an Ethernet backbone in the cluster, and VME-buses for the in-chassis communication.

2.8.1 Common Features for DAQ Networks Even though all data acquisition systems are different and created to fulfil their purpose, they have a lot of common features. One of these features is that some of them use the VME bus for communication within chassis, often in the front end [48, 52, 53, 54, 55, 56]. This is for the communication between cards within the same chassis and it is a proven open standard bus. The feature that most systems have in common is that they utilise Ethernet to some extent. Some use it for the communication between the front and the back end [48, 52, 54, 57]. In the Deep Sea Neutrino Telescope [51] Ethernet is used to connect the detectors to the entire processing mechanism, which is located on land. Ethernet may also only be used in order to transmit the control signals to the DAQ- system [53]. Finally, Ethernet may be used to transmit almost every signal in the entire communication of a DAQ system (Appendix M. A technique that is also present in some of the DAQs is the use of FPGAs. For example, the KATRIN DAQ [58] uses FPGAs for doing the level-triggering of the acquired signal. In the T2K [57], FPGAs are used both for controlling the front-end ASIC-collector circuitry and in the back-end for calculatons. Finally, the most common feature of every of these described DAQ systems is the need for a high sustained data rate that behaves in a true real-time fashion. They have all varying amounts of generated data, ranging from sizes at a few megabyte [52] and [54], to the long-distance transmission from KM3NeT [51] of around 0.1 Terabyte over several kilometres, to the enormous amount from ATLAS of around 60 Terabyte [30], and this data arrives every second and has to be processed in time so that the systems may continue processing.

24 Part II

Contributions

25

Chapter 3

Methods

There are three parts in this thesis work. The first is the investigation of a LVDS- communicating Link Port; the second is the Gigabit Ethernet investigation and lastly the investigation of other potentially interesting protocols. The methodology of them will all be described below.

3.1 Link Port

To investigate the Link Port for the TigerSHARC-processor, a VHDL-model will be created from the specifications of the Link Port protocol [2]. To ensure the functionality of the model, it will be simulated in a test bench to ensure that it functions as specified in the specification. Since the Link Port interface is tar- geted for communication between an FPGA and the TigerSHARC, a back end bus connection will be built for the Link Port model as well. The choice of bus con- nection will be based on the probable bus architecture in a possible future use of this module. The model that is created will also be used to check timings and see what the effective bandwidth is of this Link port tested under some real-case traffic patterns from a typical radar system. This will also be compared to the theoretical values from a mathematical model that will be created. To ensure the accuracy of the model when being implemented as an IP-block, the model will be constrained with the necessary specifications of the TigerSHARC- processor [2]. This will ensure that the design will meet its timing every time. The model will be placed and routed on a probable end-application FPGA for the insurance of functionality. For the implementation of the Link Port IP block the implementation time for different tasks in the design will be monitored and reported in order to enable comparisons in implementation times between VHDL and other implementation languages. The comparison will enable efficiency comparisons between VHDL and other implementation languages. Finally, the VHDL design will be evaluated for common FPGA-design parame-

27 CHAPTER 3. METHODS ters, namely Area, Power, Bandwidth and Latency. This done in order to be able to compare this implementation with other implementations.

3.2 Gigabit Ethernet

For the investigations of the Gigabit Ethernet link, there will be two parts. Firstly, research articles will be reviewed in an attempt to try to determine which factors that affects the effective bandwidth the most. From this, experiments will be de- signed to either support or discard the theory that has been set up from the research results. The experiments will involve interconnecting two or more machines with Win- dows operating system, to evaluate how their Ethernet efficiency varies when the parameters found in literature are altered. Based on what the tests are showing, the optimal settings for a typical case for an Ethernet link on a Windows network will be presented.

3.2.1 Setup for the experiment In order to collect a sufficient amount of data for the evaluation of the Ethernet link, a configurable testing program was implemented. This program allows you to set a lot of parameters in conducting a sweep over different transmission sizes. The data achieved will be analysed with respect to bandwidth and latency and the effects on bandwidth and latency should be looked at in comparison to the corresponding driver parameters. This is done to show which parameters should be chosen. The effects will later be put into consideration when comparing Embedded sys- tems communication, where it will be considered as a commodity option for inter- entity-communication.

3.3 Other High-Speed Protocols

For the rest of the protocols, recent research results will be collected and reviewed. These results will hopefully bring some light to which kind of data link is suitable for which application. The links will be evaluated with respect to latency, throughput and reliability. If there is time, the most promising protocols will be modelled mathematically and through calculations they will be evaluated as well. The results from this study will be the ground for the recommendations for embedded radar systems bus technology.

28 Chapter 4

Ethernet On Windows Computers

As mentioned earlier, the use of Ethernet is widespread and rather cheap compared to many other technologies. The knowledge and support is also large since the technique has been around for several decades. In the use of many high performance system, Ethernet is one of the main communication medias. For embedded applications, there are advantages of using Ethernet, especially due to the number of IP-blocks available for direct usage inside your system. The thing that has been observed in previous studies of Ethernet efficiency has been quite unstable measurements which may cause the system performance to suffer. Here the Ethernet communication between two PCs will be used in order to evaluate their performance when communicating with an Ethernet-enhanced embedded system. This could be a method of interconnecting a back-end of a system, e.g. as shown in section 2.2. To test the Ethernet connections, two standard PCs will be used to measure the performance in terms of maximum throughput. The reason is to test how COTS components may be used in a modern high-performance system in order to achieve performance and at the same time cutting costs. Even though there are softwares available for testing Ethernet connectivity, the idea has been to develop a test program to test the connections between computers and be able to test many types of protocols with the same program and without changing the program interface. The measurements will try to look at the performance which may be obtained from a computer running Windows XP, and which may be looked at as a back-end to a system, responsible of post-processing, data recording or visualisation. This link may be crucial in the system design, as the data must arrive at a minimum rate. The design of the test system will be built from two PCs running Windows XP which are connected through a cross-over cable for communication. The tests will be done using the Common Language Runtime environment in Windows, programmed in C#.

29 CHAPTER 4. ETHERNET ON WINDOWS COMPUTERS

4.1 Hardware and Software Setup

The two computers used in the experiment were of different types. Their specifica- tions are to be found in Table 4.1. The computer called Computer 1, is a high-end computer whereas Computer 2 is a typical desktop computer.

Computer 1 Computer 2 Processor Intel Xeon E5620 Intel Pentium 4 Clock Frequency 2.40 GHz 2.60 GHz Number of Cores 4 1 RAM (GB) 12.0 3.0 Network Card Intel CT Desktop Adapter Intel PRO/1000 MT Network Card Driver Version 10.3.42.0 8.10.3.0 Operating System Windows XP sp3 Windows XP sp3

Table 4.1. The specifications of the computers used for Ethernet testing.

Both computers were equipped with the .NET 2.0 environment for running the programs. This was the choice when the programming environment used was Visual Studio 2005. The program created for the measurements were a rather straight forward program with the sole purpose to measure the transmission time of a certain number of packets with varying payload sizes. These experiments were reproduced while altering device driver settings. The settings primarily looked at are:

• Checksum Offloading

• Interrupt Moderation

• Ethernet Frame Size

• Receive and Transmit Buffer Sizes

All of these parameters were altered and looked at while examining the TCP and UDP protocols. The following subsections will try to explain the effect of the alteration, and the expected outcome from it.

4.1.1 Offloading Checksum Calculations The testing of the checksum offloading capabilities of the network cards were pri- marily conducted to see if it has any effect or if the processor is able to cope with the TCP calculations. The calculations of the checksum could take considerable time on a weaker processor, and hence it could be a performance enhancing operation. This effect will probably increase the performance at least for the weaker computer.

30 4.1. HARDWARE AND SOFTWARE SETUP

4.1.2 Increasing the Transfer Buffers By increasing the transfer buffers, there is more space for saving data on the trans- mitter side. This means that the transmitter may have more outstanding trans- missions at any given time for a TCP connection. For the UDP connection an increased buffer may allow the application to do more consecutive writes into the buffer before filling it up, but due to the lack of confirmation, the increased buffer size will probably not affect too much.

4.1.3 Increasing the Receiver Buffers When increasing the receiver buffer, the receiver may accept more data before having to throw it away. If sufficiently dimensioned, the receiver buffers will allow enough data to be stored so that none has to be thrown away. If data is thrown away on a TCP link, it will cause many retransmissions, and if it is an UDP link it will result in many data losses.

4.1.4 Increasing the Ethernet Frame Size When transmitting large quantities of data, the data will not fit into a single Eth- ernet frame anymore. Instead, it will be fragmented into several consecutive frames which contain the data. Since every frame has a Ethernet header, an IP header and also a TCP or UDP header, the effective transfer ratio of a frame is smaller for small frames than for large frames. For a typical Ethernet frame, the maximum payload data when transferring TCP is 1460 octets, and 1472 when transferring UDP. For such a frame, the number of overhead octets are 78 for a TCP link and 64 octets on a UDP link. These results in overhead on the link in the best case, of 78/1538 ≈ 5.1% for TCP links or 64/1538 ≈ 4.2% for UDP links (more in-depth explanation is found in Appendix G). By increasing the Ethernet frame size, the overhead remains the same in terms of octets, but since the amount of payload octets increase, the throughput of the link is also increased. The network cards on the computers in this experiment support Ethernet Frame sizes of 1518, 4088 and 9014 octets. By using these sizes instead, the links have a higher fraction of payload data and therefore are more efficient. This could be expected to increase performance in terms of bandwidth.

4.1.5 Control the Interrupt Rate Interrupt moderation is a feature that tries to flatten out the number of interrupts per second in order to save CPU time and cycles. The network cards in this ex- periment have a variable interrupt moderation setting with the options: Off, Low, Medium, High, Extreme and Adaptive. This setting is expected to have effects on the performance since it controls how often the CPU have to do a context switch to handle the network traffic. The probable outcome is that the most favourable

31 CHAPTER 4. ETHERNET ON WINDOWS COMPUTERS amount of moderation might depend on other factors, such as frame and packet size.

4.2 Evaluating the Performance

To evaluate the performance of the Ethernet connection, a communication model for the communications were created with a server and several clients. The idea is that the server scans the network for listening clients and then sets up several clients to send data to each others. The program is very scalable and the server has complete control of the transmission paths between all connected clients. The clients may transmit using any specified and supported protocols, they may be set up as a data source or a sink.

4.2.1 The Measurement Environment The environment that the measurements are taking place in is a custom built en- vironment with the sole purpose to examine the throughput of bulk transfers over Ethernet, using standard protocols. Those are the conditions since radar data is often transmitted in large batches that need to be processed. The environment will measure such transfers when varying the size of transfer, i.e. the payload of each packet. That way, an optimal setting might be found and worked on for future systems. For this purpose, a server-client application was created, and will be more ex- plained in detail below.

The Server Functionality The server application is the program which sets up all communication. It may set up an arbitrary amount of connections between clients in a network and it is able to do that by itself, using a predefined algorithm for deciding the connections, or the user may specify every parameter of every connection all by himself. The program flow can be seen in Figure 4.1, and will be shortly explained here. The server starts by sending out a message and telling all the clients that they should report to the server. The server then produces a list of all active clients on the network and waits for the user to select how the connections should be made. Once the connections are done the server issues a TCP packet to every client with an outgoing or incoming connection set up, and tells them how to set up the connection. These packets consist of a predefined number of setup messages which are specified in listing 4.1. Each setup message defines either the receiving or the transmitting side of a connection, and the client is then forced to create that connection. Once the setup messages have been sent, the server waits for the clients to finish and then it is able to send a new set of messages to setup a new connection and use it to do another measurement.

32 4.2. EVALUATING THE PERFORMANCE

Client Server

Client start Server start

Wait for Find clients broadcast with broadcast message

Notify server Wait for that client replies is present

Wait for Enter connection connection setup details details

Create connections Send setup from setup messages message

Start First Send a start measurement message

Create performance report

Figure 4.1. A flowchart showing the communication flow when setting up and starting a measurement of Ethernet performance.

The Client Functionality

Even though the server is the program responsible for the setup, it is the clients which are responsible for conducting the actual measurements. Its program flow goes through the steps specified in Figure 4.1 but will shortly be explained here. The client starts by listening to broadcasts from the server and on reception it notifies the server that it is present. The client waits for the setup data from the server and when it receives the setup data, it creates senders and receivers which basically are threads with the purpose of receiving or sending data. The threads

33 CHAPTER 4. ETHERNET ON WINDOWS COMPUTERS

1 public struct SetupMessage 2 { 3 public const int MESSAGE_LENGTH = 44; 4 public UInt32 ProtocolType;//// bits Type 5 // 00 TxUDP 6 // 01 RxUDP 7 // 10 TxTCP 8 // 11 RxTCP 9 10 public UInt32 minPackSize;// Specify the minimum packet payload size of the run 11 public UInt32 maxPackSize;// Specify the maximum packet payload size of the run 12 public UInt32 packIncrement;//The difference in payload size between two packets 13 public UInt32 sourceIP;// Source(local)IP Address 14 public UInt32 SourcePort;// Source(local) Port Number 15 public UInt32 EndpointIP;// Endpoint(remote)IP Address 16 public UInt32 EndpointPort;// Endpoint(remote) Port Number 17 public IPAddress SourceIPAddress;//The localIP address again 18 public Int32 Iterations;//The number of packets to send of each payload size 19 }

Listing 4.1. A setup message from the server to a client, telling the client how to set up its endpoints in order to communicate correctly.

wait for a start signal sent by the server and after that it starts both sender threads and receiver threads. The threads run and measure the data from the different payload sizes and when finished, it generates a report containing the transferred bytes, the time taken, lost bytes and CPU cycles spent in the program. These reports are later used to generate graphs and plots and used to draw conclusions about.

The Measurement Idea The idea of the measurement is to conduct one measurement for every specified payload size and then report the measured throughput for that payload and when that is done, initiate a new transfer with the next specified payload size (see pseu- docode in listing 4.2). For this to be as efficient as possible, the algorithm uses synchronisation through dedicated TCP ports in order to get the sync correct. This way a transmitter can tell when it has finished transmitting and it can wait until the receiver has finished receiving before starting the next transmission. This way, both receiver and transmitter knows that every packet has been sent in that transmission and that they are ready to conduct another one. By making sure that each transfer takes place in an isolated environment where no other packet sizes are being sent, insurances are made that each packet size will be measured individually without being affected by the adjacent measurements. This is important in order to determine what happens at the exact packet sizes.

4.3 TCP Specifics

When measuring TCP performance the test suite uses sockets that connect to each others and then utilises the ability to read and write arbitrary sized byte arrays in or out from a socket. The tests were made repeatedly for different sizes of the Ethernet

34 4.3. TCP SPECIFICS

1 ... 2 //INITIALIZATION 3 ListenForSetupMessage();//Wait fora server connection 4 ConnectToServer();//Connect to the server 5 WaitForConnectionDetails();//Receive connection details from the server 6 SetupConnections();//Setup the necessary connections 7 //Do all measurements 8 for(int i =0, i < Number_Of_Different_Payloads; i++) 9 { 10 MeasureThroughput();//Measures the throughput of the link at specified payload 11 PrintReport();//Savea report for current payload 12 SynchroniseBeforeNextRun();//Synchronise receivers and transmitters 13 } 14 ...

Listing 4.2. A pseudocode example presenting the workings of the measurement client. The methods are named as what they should do, and do not necessarily correspond to any actual method.

payload as well as varying the other parameters. To ensure that the packets not paired up together in transmission, Nagle’s algorithm were turned off. It is an algorithm that otherwise packs small packages together to improve efficiency [59]. When testing the TCP performance, the measurements have been conducted on a series of different configurations. However, at least initially, every enhancement that the network cards could provide were turned off. This in order to be certain that no clever mechanisms could interfere with the raw performance. The standard buffer size was set to 64 kilobytes for both the receiving and the transmitting buffer. Unless stated somewhere in a figure or the text, the standard interrupt moderation setting has been set to off and the checksum calculations have been enabled. However, that is just the standard case and any deviations from that are noted in text and figures.

4.3.1 TCP and IP Checksum Offloading When testing the TCP and IP checksum offloading mechanisms, all other enhance- ments were turned off and the frame size was set to the standard Ethernet size of 1518 bytes. The test suite ran for packet sizes up to a completely full Ethernet frame with a payload of 1460 bytes. The effects from the offloading are presented in Figure 4.2. Here we may see that the effects from the offloading are quite hard to determine from that single measurement. There seem to be a better worst case throughput when not using offloading but the offloaded transfer produces a more even graph and does not flatten out at right above 500 Mbps as the one not be- ing offloaded. Since this measurements raised more questions, more comparative measurements were made. The next step was to try to introduce some interrupt moderation and study if the interrupt moderation level in some way controlled the performance under offloading. Two settings, low and high were selected as examples. The results from the run are shown in Figure 4.3. When looking at the results, it is evident that the effects are small from checksum offloading. This is especially visible when looking at the

35 CHAPTER 4. ETHERNET ON WINDOWS COMPUTERS

Compared TCP Throughput With and Without TCP Checksum Offloading 900

800

700

600

500

400

Offloading OFF

Throughput (Mbps) 300 Offloading ON

200

100

0 0 500 1000 1500 Payload (B)

Figure 4.2. Difference between checksum offloaded and not offloaded TCP transfers. These transfers are as may seem not that very different and hence it does not look as if the checksum offloading has any major effects on the throughput. low interrupt moderation and see that the throughput is almost unaffected by the offloading. In the case of high interrupt moderation, the difference is bigger but not very big. It is however evident that interrupt moderation plays a big role in the capacity for high-speed transfers and the effects from it will be examined next.

4.3.2 Effects from Interrupt Moderation When the effect from offloading was studied, the effects of the Interrupt moderation was hinted to be large. Therefore the effects that interrupt moderation causes are interesting to study thoroughly. This caused the study to test more cases and settings on the network interface card regarding interrupt moderation. The effects of interrupt moderation are studied with a constant buffer size of 64 kilobytes and using the offloading capabilities of the network card. The reason for choosing to use the offloading capabilities of the card is due to its ability to offload the CPU which results in more processor time available for doing the TCP calculations. Furthermore, the previous measurements showed that the offloaded transmissions were at least as stable and with greater or equal throughput as the transmissions without offloading. The interrupt moderation setting was altered while the rest of parameters were held at a constant value. The investigated payload size varied between 10 and 1460 bytes, i.e. up to the maximum standard size of the payload of a single Ethernet frame. The experiments were conducted with a single flow from Computer 1 to Computer 2.

36 4.3. TCP SPECIFICS

Compared TCP Throughput with different Interrupt Moderation 1000

900

800

700

600

500

400 Throughput (Mbps) 300 High, On 200 High Off Low Off 100 Low On

0 0 500 1000 1500 Payload (B)

Figure 4.3. TCP offloading comparison with both low and high interrupt mod- eration. In the legend, Low or High tells the interrupt moderation and On or Off indicates if the offloading is turned on or off. It may be seen that for low interrupt moderation, there is no benefit from the offloading whilst when moderation is high, offloading increases performance for the transmissions.

The results from the interrupt moderation changes may be seen in Figure 4.4 and it is without a doubt that this setting has effects on the performance. The tuning of this setting results in performance varying from very close to the theoretical (adaptive, low and medium) to really poor performance (minimal and off) for large packets. The high and extreme interrupt moderation also show less than optimal per- formance, but these are however expected. That expectation is based on the fact that the number of interrupts are limited, as well as the buffering size for the NIC, and thus the buffers might fill up before the interrupt is issued and due to that, the maximum throughput is done. The lack of performance for the settings off and minimal are somewhat harder to explain. The probable cause is the fact that the CPU gets too many interrupts and thus have to do too much context switching when switching back and forth between the main application and the interrupt service routine. The reason why both the low and medium settings might be good are that they both perform a good amount of interrupts compared to the number of bytes received. The fact that adaptive outperforms them both for large packets might be due to a better setting that enables the NIC to send interrupts more optimally, with the most beneficial amount of data for the processor to process in between interrupts.

37 CHAPTER 4. ETHERNET ON WINDOWS COMPUTERS

TCP Throughput (Offloading ON) 1000

900

800 Off Minimal Low Medium 700 High Extreme Adaptive Theoretical Max 600

500

Throughput (Mbps) 400

300

200

100

0

0 500 1000 1500 Payload (B)

Figure 4.4. In this graph, the effects of interrupt moderation on Ethernet per- formance is shown. The performance benefit from tuning the interrupt moderation setting correctly is evident from these plots.

4.3.3 Changing the Ethernet Frame Size

Next investigation was about changing the Ethernet frame size, also known as using jumbo frames (see Appendix F). The network cards supported, in extent to the standard 1518 byte frame, a 4088 byte and a 9014 byte frame. The benefits from using bigger frames come from enabling a larger payload with each frame and thus having less overhead. In addition to testing different frame sizes, the different frame sizes were tested with several different interrupt moderation settings as well to evaluate the effect of both of them together. The first setting tested was the 4088 byte frame size which gives a total payload of up to 4030 bytes in a single frame. The run was made with several interrupt moderation settings and the results may be seen in Figure 4.5. In the graph there are some interesting things to note. Most interesting is that when payload is in the 4030 B-area the measured throughput is very close to its theoretical value, when the interrupt moderation setting is low. This is very interesting since it actually indicates that these high throughputs may be achieved. Furthermore, the throughput achieved for minimial interrupt moderation is rather unpredictable when payload sizes are small. However, this stabilises when closing in towards 2000 B payload and after that it constantly achieves very high throughput, almost as the low interrupt moderation. A few final remarks about the results in Figure 4.5. Firstly, the higher interrupt

38 4.3. TCP SPECIFICS

Compared TCP Throughput for 4088B Size Jumbo Packets 1000

900

800

700

600

500

400 Throughput (Mbps)

300 OFF Minimal 200 Low Medium High 100 Extreme Adaptive Theoretical 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Payload (B)

Figure 4.5. This graph evaluates the effects of using a 4088 B when transferring data. Look at the low interrupt moderation setting while sending 4088 B frames (equal to a payload of 4030 B) and how close it comes to the theoretical value. moderation settings have a peak throughput at around 2 kB payload and then stabilises at a certain throughput regardless of payload size. Secondly, the adaptive interrupt moderation seem to be working less efficiently with jumbo frames and large payloads. This when you see that both low and minimal interrupt moderation is better for large packets. The other size of jumbo frames are 9018 B, which enables payloads up to 8960 B. This means that almost a full ten kilobytes might be sent with a single frame. The tests run for this size were identical to the tests for the smaller jumbo frame and the measured results are shown in Figure 4.6. Compared to the smaller jumbo frame, the transmissions by large jumbo frames are less predictable and stable. The measured throughputs are also lower for almost all interrupt moderation settings. However, this is actually the single time that the transfers without interrupt moderation turn out any good. Looking close to the 8960 B point, the un-moderated is actually the best setting (provides highest throughput).

4.3.4 Variable Buffer Size Another part was to look how buffer size limits the throughput in the measure- ment. To do this measurement, the interrupt moderation was set to low and checksum offloading was turned on. The test was conducted by increasing pay- load size in steps of 50 bytes and record the throughput at that payload. The buffer sizes were increased by a factor four for each iteration, and moved through the set

39 CHAPTER 4. ETHERNET ON WINDOWS COMPUTERS

Compared TCP Throughput for 9014B Size Jumbo Packets 1000

900

800

700

600

500

400 Throughput (Mbps)

300

OFF Minimal 200 Low Medium High 100 Extreme Adaptive Theoretical 0

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Payload (B)

Figure 4.6. This graph evaluates the effects of using a 9018 B jumbo frame when transferring data.

{214, 216, 218, 220, 222, 224}. The iterations first moved through the sender buffers, and then through the receiver buffers, while keeping the other at a constant value. The first measurements were made while cycling through different transmission buffer sizes while keeping the receiver at a constant 216. The results from that measurement are shown in Figure 4.7. The things that may be seen from the graph is that a larger transmission buffer size enables a transmission with a higher throughput at smaller packet sizes, but that for large packets the benefits from the larger buffer seem to fade away. This is true for buffer sizes above 64 kB (216), which is the smallest buffer size that reaches to the maximum throughput (the highest achieved around 900 Mbps). The buffer sizes that are smaller than 64 kB behave in such a fashion that they reach an upper limit in throughput and cannot send more data. The upper limit could very well come from the need to receive an ACK after every sent message. Because of that, each message has to remain in the buffer until the ACK arrives. This means that the buffer gets filled up and cannot store any more messages until some messages are confirmed. Moving on to the variable receiver buffer size. For this experiment, the receive buffer is cycled through the set of sizes specified earlier and the transmit buffer is kept constant at 222. The size of the transmit buffer was chosen as one of the top throughputs from the previous run. The results from this measurement are presented in Figure 4.8. In the graph it is clear that the size of the receive buffer is important. The small buffers are outperforming the larger ones, as opposed for the transmit buffers in Figure 4.7. The buffer with highest throughput for larger packets is the 64 kB buffer, but the 4 kB buffer has a steeper rising edge, and is

40 4.4. TCP EVALUATION AND SUMMARY

TCP Throughput, Variable Sender Buffersize 1000 4k 900 16k 64k 800 256k 1M 700 4M 16M 600

500

400 Throughput (Mbps) 300

200

100

0 0 500 1000 1500 Payload Size (B)

Figure 4.7. The buffer size is seen to affect the throughput as in this graph. The tendency seem to be that larger buffer size gives greater throughput in the transmis- sion. better for small packets. One peculiar thing is the peaks around 250-300 B payload. After that limit, all bigger buffers suddenly decrease in throughput and are quite low until reaching 1460 B payload. 1460 B is the full MTU, maximum transfer unit, of a TCP packet on a standard Ethernet connection. Therefore it is interesting to see that when transmitting buffers of the full MTU size the throughput increases again.

4.4 TCP Evaluation and Summary

An evaluation of the results obtained while evaluating TCP performance will be presented here. The evaluation of the Ethernet performance has been very interest- ing in mapping which factors that affect the performance most. When looking at the Offloading (subsection 4.3.1), the conclusion has to be that the offloading is of little or no interest to the performance. This since the performance difference is a lot bigger when changing the interrupt setting. The interrupt setting on the other hand is a very interesting setting. By changing the interrupt moderation from off to low, great benefits are made in terms of performance, and especially stability. The curves achieved when measur- ing with different interrupt moderation settings are so diverse so that the interrupt setting must be considered crucial for the performance of TCP. An example could be picked from Figure 4.4 where some settings achieve almost theoretical throughput

41 CHAPTER 4. ETHERNET ON WINDOWS COMPUTERS

TCP Throughput, Variable Receiver Buffersize 1000

900

4k 800 16k 64k 700 256k 1M 600 4M 16M 500

400 Throughput (Mbps) 300

200

100

0 0 500 1000 1500 Payload Size (B)

Figure 4.8. The receive buffer size is varied and the throughput measured for different payloads. The small buffers clearly outperform the larger ones for medium sized payloads. However, when reaching 1460 B payload, the throughput for all of them rises again. for payloads over one kilobyte, while the others achieve throughputs in the range 100-500 Mbps. The difference is in most of those cases over 100% improvement when selecting either adaptive, low or medium interrupt moderation. For the buffer sizes the conclusion made is that small buffer in the receiver is for some reason desirable. By having that small buffer, performance seem to be a lot better than for a large one. In the transmit buffer, the larger the buffer, the better must be the conclusion. This is however logical since the larger the buffer size, the more transactions may happen simultaneously, since all data has to be stored in buffers until it gets its acknowledgement back from the receiver. Having a large buffer minimises the risk that the buffer fills up before transactions are completed.

4.5 UDP Specifics

When evaluating the performance of UDP, the same test suite as for TCP was used except the evaluation of offloading. This test is excluded due to the lack of UDP offloading on one of the NICs. UDP is a protocol in the TCP/IP protocol suite and for more general information please read Appendix G. The UDP is an unreliable protocol and requires a lot less processing. That it is unreliable means that there is no guarantee that packets reach its destination. These packets are never retransmitted, as they would have been if

42 4.5. UDP SPECIFICS

TCP had been used as a protocol. This means that the number of lost packets will be added as one of the parameters that is measured. Another remark is that for the graphs and numbers presented from the UDP measurements will be made as the throughput received at the sink, i.e. the number of bytes that actually arrives per second. In addition to lost packets, the same parameters will be looked at in UDP as was done in TCP.

4.5.1 Interrupt Moderation Effects The initial measurements were done at the different interrupt moderation settings of the network card. The setting was varied for several of the available options on both sender and receiver. The initial sweep span was from 10 to 1500 byte payload.

UDP Throughput for different Interrupt Moderation Settings 700 Adaptive Off Minimal Low 600 Medium Extreme

500

400

300 Throughput (Mbps)

200

100

0 0 500 1000 1500 Payload (B)

Figure 4.9. The throughput measured when varying the data payload and interrupt moderation setting. It shows that the correct interrupt moderation setting may have huge effect on the performance achieved from the connection.

The results from the measurements with different interrupt moderation settings are visualised in Figure 4.9. When examining the graph, some interesting facts appear. There seem to be a linear increase in throughput when payload increases to around 1020 bytes, for the settings off, minimal, low and medium. When this limit is zoomed in at, it shows that the drop is exactly at 1024 byte payload, equal to 210. This drop is however not related to packet losses, which will be seen in the next subchapter, but has some other explanation. This is however not evident from this measurement why it happens, only that it does. The adaptive setting does not outperform every other but is instead very un- stable and unpredictable. Neither the extreme interrupt moderation is linear and

43 CHAPTER 4. ETHERNET ON WINDOWS COMPUTERS

UDP Lost Data (Mbps)

Adaptive Off Minimal 400 Low Medium Extreme

300

200 Losses (Mbps)

100

0 0 500 1000 1500 Payload (B)

Figure 4.10. This graph shows how many bits per second that are lost. The loss of packets rise as packet payload increases for all but the extreme moderation. Above 1024 byte payload the losses suddenly goes to zero, which is the same place as the throughput suddenly drops. nice in its behaviour but instead has a sudden decrease in performance around 650 byte payload and after that it hardly recovers before reaching the 1024 byte marker where it drops like all others. As stated earlier, the throughput of the UDP traffic drops dramatically when passing 1024 payload size. After that, all different interrupt moderation settings have a linear increase, but with different slopes and different starting points. Here, the highest throughputs are achieved by using no interrupt moderation. Then it is the adaptive that outperforms the others in the order minimal and low, medium and extreme. Here it seem like having a low, or none, interrupt moderation setting is beneficial.

How Packet Losses Vary With Interrupt Moderation

One interesting thing is that interrupt moderation have different effects on achieved throughput at the receiver. However, the percentage of lost packets are also inter- esting. In Figure 4.10 the lost packages are plotted as how many bits per second that are lost. When comparing this to Figure 4.9 the two graphs seem to be rather the opposite to each others. When looking at packet losses, it is evident that a higher interrupt moderation is beneficial for low packet losses, where medium is the best. One more interesting note is that after 1024 byte payload the packet losses are non-existent, i.e. the UDP transmissions lose no packets in the transfers. This

44 4.5. UDP SPECIFICS might come from the same reasons as the fact that the throughput of the UDP traffic drops dramatically at the same limit. Another thing worth mentioning is that the main part of the packages that are dropped, are done so in the OS stack, or in the Network Driver. This was measured using the network card performance counters. By doing that, the counters show that no packets are tossed away and dropped. Hence, the losses must come from the OS.

4.5.2 Buffer Size Exploration To explore the effect of variable buffer size for the UDP communication, the Low interrupt moderation setting is used. In this experiment, the buffer size will vary from 212 to 226 in increases by a factor four, i.e. the exponent grows with two for each measurement.

UDP Throughput With Variable Buffer Size 700 4k 16k 64k 256k 600 1M 4M 16M 64M

500

400

300 Throughput (Mbps)

200

100

0 0 500 1000 1500 Payload Size (B)

Figure 4.11. Here the buffer size varies along with the payload size. The throughput looks best for the larger buffer sizes.

In Figure 4.11 the results from the measurements are shown. The increasing buffer size seem to have a positive impact on performance since the largest measured buffer size seem to produce the highest throughput communication. The peculiar thing with this measurement is however the sudden drops occurring at 750 bytes payload. The drop at 1024 is still there, regardless of the buffer size. When looking at the area around 750 byte payload size the measured through- put is suddenly dropping for most buffer sizes. By looking at Figure 4.12, which shows the percent of lost packets at each payload size, a sudden dramatical increase happens when reaching this point. Quite fascinating though is that the smallest buffer size does not experience this. Also the largest buffer size has no huge spike.

45 CHAPTER 4. ETHERNET ON WINDOWS COMPUTERS

For some others though, the increase are several hundred percent, from 5% to 50% with the 16 MB buffer. However, even looking at the packet losses the largest 64 MB buffer seems like the best choice.

UDP Lost Bytes With Variable Buffer Size 70 4k 16k 64k 256k 60 1M 4M 16M 64M

50

40

30 Bytes Lost (%)

20

10

0

0 500 1000 1500 Packet Size (B)

Figure 4.12. The packet loss in percent ow the transmitted bytes. The graph has an interesting sudden increase in packet loss at approximately 750 bytes payload for most of the buffer sizes.

4.5.3 Does Frame Size Affect UDP Performance? The final exploration is whether the frame size of the Ethernet frame has any effect on the performance of a UDP transmission. To test this, a constant 64 MB buffer size and a medium interrupt moderation was used. The frame size was varied between the three settings supported by the Ethernet NIC namely 1518, 4088 and 9014 bytes. The measurements were once again conducted as the payload was varied with the different settings in frame size. The measurement results are presented in Figure 4.13 and from that it is visible that all of the settings are very equal in terms of throughput, especially for small payloads. All of the settings has a performance drop when exceeding 1024 byte payload. As the payload increases, the different frame sizes are diverging slightly, and performance gets less stable for very large payloads. However, the thing which may be seen in the graph is that there is not much difference, regardless of frame size.

4.6 Analysis of UDP Performance

When looking at the measurements made in the previous section on the different measurements of UDP performance, one thing is the most striking. That is the

46 4.6. ANALYSIS OF UDP PERFORMANCE

UDP Throughput for different Ethernet Frame Sizes 1000 9014B Frame 900 4088B Frame Standard Frame

800

700

600

500

400 Throughput (Mbps) 300

200

100

0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Payload (B)

Figure 4.13. The different throughput of UDP traffic when varying frame size and payload. The throughput does not vary a lot with the different frame sizes, especially for small payloads. throughput drop at the change between 1024 and 1025 byte payload on the UDP packet. If we look at a single case as for example in Figure 4.9 the medium interrupt moderation, the drop in throughput goes from over 650 to below 50 Mbps, more than a 90% drop in throughput. It is shown through other measurements that it is not due to excessive packet loss, since the packet loss drops to zero percent for payloads above 1024 bytes. This means that the limitation for this is not by the line itself, but by some implementa- tion specific parameter in the implementation of either the UDP stack or the NIC device driver. When trying to find out what causes this limit, one might consider if there is some limitation in the buffer that puts this limit, but since the buffer size has been varied and this drop in performance happens regardless of the buffer size (Figure 4.11) the buffer size cannot be the limiting factor, so it has to be something else. Since the performance is behaving in a linear fashion, both before and after the drop in performance, there seem to be some limiting factor that indicates how many bytes per second it may send. For simplicity, the first investigation will look into the performance at payloads above 1025 bytes. When looking at the computers performance monitor, there is an indication to what might be the limiting factor in this transfer. When doing a transfer, the number of interrupts sent by the NIC every second is almost constant. A test was run at payload size 1025, which revealed that the NIC sends out approximately 2400 interrupts per second, the NIC sends approximately 2400 packets per second. We know that each packet is 1025 byte payload + 66 B overhead, and the measured

47 CHAPTER 4. ETHERNET ON WINDOWS COMPUTERS

UDP Sent and Received Bytes per Second 700 Sender/Off Sender/Medium 600 Sender/Extreme Receiver/Off Receiver/Medium 500 Receiver/Extreme

400

300 Data Traffic (Mbps) 200

100

0 0 500 1000 1500 Payload (B)

Figure 4.14. The graph shows the difference between sent and received data. The sent data is indicated by the solid lines in the graph while the dotted lines are the received data. The gap between a dotted and solid line indicates how many packets that are lost in transmission.

bytes sent each second is around 2.6 million, or around 20 Mbps. If then dividing 2.6 million by the size of a packet (1091), the number of packets per second are approximately 2380, or 2400, packets which is equal to the number of interrupts. Therefore it might be so that the NIC driver, or OS UDP stack only may transmit one packet per interrupt. By using this method and evolving it into the case for payloads smaller than 1025, the verification could be quite simple if it is true. The interrupt rate in that range is also very constant. When thinking about it, the combined graphs in Figure 4.9 and Figure 4.10 produce similar growth for most interrupt moderation settings, if combining the losses with the received data. The fact that the extreme interrupt moderation in those cases has a lower stop, might have something to do with the limitation of the number of interrupts. To visualise the combined graphs and show that the sent Bytes per second is identical, look at Figure 4.14. Here one may see that the off and medium interrupt settings have the same amount of sent bytes per second. However, when using the setting off, a lot more packages get dropped before reaching destination than when using medium. This could very well be an indication that the setting somehow affect the weaker receiving computer in greater extent than the sending computer. However, with rather high certainty, one may say that the probable workings of the UDP stack allows a maximum number of bytes per interrupt to be sent.

48 4.7. SUMMARY OF ETHERNET PERFORMANCE

4.7 Summary of Ethernet Performance

When summarising what has come forth in this investigation there have been some expected results as well as some unexpected ones. One thing that is rather fascinat- ing is the high level of throughput actually achieved in TCP over Ethernet testing. For example, by comparing with [25], they need really big packet sizes in order to achieve any real performance. Furthermore, their best throughputs were achieved in Linux and not Windows. However, their computers ran older versions of Windows, namely Windows 2000. So some performance gain might have been won there. Nevertheless, the results from here show that it is nowadays not impossible to achieve a high-throughput link while using Windows computers, if tuning the parameters correctly. The achieved best throughput was just a tiny 5 Mbps from the theoretical value when sending TCP with low interrupt moderation and 4088 byte jumbo frames in the area where the payload is around 4030 bytes. As of the UDP traffic, a very interesting thing has been stumbled upon, and that is the decrease in performance when sending packets with greater payload than 1024 bytes. This limit seem to depend on the ability of the underlying system to only send one packet per interrupt after that limit. However, why that happens is beyond the scope of this project and will have to be examined in the future.

UDP Medium Curve Fitting 700

600 Measured Data Fitted Curve

500

400

300 Throughput (Mbps) 200

100

0 0 200 400 600 800 1000 1200 1400 Payload (B)

Figure 4.15. Curve fitting made for the UDP throughput are plotted here. The approximation are done with the help from Matlab from the measured throughput of the UDP transmission with interrupt moderation setting medium.

The UDP traffic also seem to be limited on the receiver side in the low span of package payloads, i.e. up to 1024 bytes, by the receivers ability to not lose packets. This may be controlled by changing the interrupt moderation optimally. Finally, the UDP traffic follows a rather linear pattern. When looking at it, it looks like two linear equations in different intervals. By using Matlab and do curve

49 CHAPTER 4. ETHERNET ON WINDOWS COMPUTERS

fitting, linear approximations were made for the two intervals. The resulting linear fitting is presented in Figure 4.15 and it is clear that the linear approximation is rather good. The approximation made is the one stated in Equation 4.1. In the equation, Θ is the throughput achieved in Mbps and p is the payload size of the UDP packet. As seen in Figure 4.15 this approximation works rather well for the UDP Throughput. This is however a special case for an approximation when the Interrupt Moderation setting is medium. However, the things which are different for other interrupt moderation settings are the coefficients before p and the constant. ( 0.67 · p + 19 p <= 1024 Θ(p) = (4.1) 0.023 · p + 0.074 p > 1024

4.8 Which Settings to Choose

The final thing from the Ethernet evaluation are some guidelines and recommenda- tions about how to set up the communications in an Ethernet-based system which uses commodity computers as components. The first thing worth mentioning is that the Ethernet communication were able to quite easily set up communications using very cheap components and still achiev- ing high bandwidth. The other thing that this study suggests is that the use of UDP- communication between such computers is no good deal. The maximum throughput measured was only around 700 Mbps, while the TCP communication could achieve well over 900 Mbps. Furthermore, the TCP protocol also guarantees delivery and order of your received data, making the coding of such system a lot easier, when the protocols above are relieved from handling that duty. Secondly, the adaptive interrupt moderation setting might be handy to use and often achieves good throughput. However, it seem to be optimised for smaller packets, as seen when testing the TCP performance for jumbo frames. Especially when using 9018 byte frames the adaptive loses lots of its benefits. So when using large packets and jumbo frames, the manual tuning of the interrupt moderation might be necessary to achieve good throughput. A third and final remark is made about the buffer sizes. For TCP, the most ben- eficial seem to be to have a large transmit and a small receive buffer. The reason is probably the large send buffer increasing the number of transactions currently tak- ing place. For UDP, the trick seem to be to increase buffer size as much as possible and that it achieves low packet losses, resulting in high throughputs.

50 Chapter 5

Creating a Link Port IP Block

As a part of the investigation of different transfer protocols in embedded signal pro- cessing systems, an IP-block for Link Port communication has been implemented. The Link Port is a communication protocol supplied by Analog Devices [2] and it is primarily used as a chip-to-chip communication between TigerSHARC DSP Processors. In order to interface it to an FPGA, a special IP block is needed in the FPGA that supports the Link Port protocol. The idea is to implement a reusable IP-block which may transfer data at full speed according to the specifications. This means that it should be able to transfer and receive data at DDR with a frequency of 500 MHz. Due to the fact that the Link Port is not naturally a bi-directional transfer media, since the transmitter and receiver of a single Link Port entity in a TigerSHARC may send to one destination and receive from another, there is no need to combine a receiver with a transmitter and thus, to enable designers to save area by only using the number of transmitters or receivers that are used, one transmitter block and one receiver block will be created.

5.1 Link Port Implementation Idea

The investigation of the Link Port protocol involves creating IP blocks which should be able to transmit data according to the Link Port protocol specifications [2]. The work of doing this was first to draw a sketch outline to work from and then try to implement all the pieces in the outline. To do this, firstly a transmitter and a receiver block idea was presented. The receiver and transmitter original ideas were thought of as in Figure 5.1 and Figure 5.2. In the figures we see that the basic blocks are there and the data path width is specified, as well as the outputs specified by the protocol specification [2]. Apart from them, the interconnect widths are not specified since at that time the specifics about them were unknown and they would be specified when implementing the parts in the design. As it may be seen in Figure 5.1 and Figure 5.2 the IP blocks will contain some control registers to enable the host to control the IP block. However, all of the actual

51 CHAPTER 5. CREATING A LINK PORT IP BLOCK

4 FIFO Serialiser Logic Data_Diff c i 128 128 g o L

n o i t c

e 1 n Clk_Diff n

o 1

C Control Logic ACK

l

a n_BCMP n

r 1 e t x E

Registers

Figure 5.1. The original block diagram for the Link Port receiver at the beginning of the design phase. As in Figure 5.2 not many bit widths of interconnects are specified. It is also so that not much is decided about the inner workings of the parts, only their existence is there. communication will actually be controlled by the Controller logic of the IP block and hence not require any user intervention. By ensuring that these controllers work according to specifications [13], we are also ensured that the communication actually will happen always according to specifications. The blocks also contain some FIFO to enable some buffering and control and status registers that may be read to control what is happening in the block. Finally, input and output logic is present to actually produce the needed signals from the protocol.

5.1.1 Key Coding Considerations When implementing the blocks for the Link Port , a lot of consideration has been put into trying to code the design as synthesiseable as possible from the beginning. This in order to minimise the time taken for synthesis. To ensure this, most of the primitives have been of type std_logic_vector and for arithmetics the unsigned type have been used. Both these types are synthesiseable by Xilinx synthesiser.

5.2 Link Port Transmitter

The Link Port transmitter module is not just a connection to the physical lines, but also contain other parts to make the usage easier for the users. It has buffering and a state machine that handles the entire communication. The data bus for the Link Port transceivers of the TigerSHARC processor is a 128-bit wide bus, and the smallest transfer unit is a 128-bit quad-word. This made the design choice to be using a natural 128-bit wide bus and also to keep the internal

52 5.2. LINK PORT TRANSMITTER

FIFO Serialiser Logic Data_Diff c i 128 128 4 g o L

n o i t c

e 1

n Clk_Diff n 1 o ACK

C Control Logic

l

a n_BCMP

n 1 r e t x E

Registers

Figure 5.2. This was the original block diagram for the Link Port transmitter at the beginning of the project. All known widths are specified, and that is only the ones from the protocol specification, i.e. the ports that interconnect several entities. No internal signals are stored except for the defined width of the data path of 128 bits, native to the Link Port protocol.

data-width to 128-bits. This internal data path is however only half as fast as the output logic and therefore is easier to clock than if it also would have used the full transfer frequency. This design choice was made due to the ease of interfacing the rest of the design to the Link Port outputs. This will however be described in further detail later. The main worker in the Link Port transmitter module is the Finite State Machine (FSM) which makes sure that the correct data is always sent at the correct time. The FSM takes care of reading out data from the internal FIFO buffer, move it through a 128 to 16 serialiser, take care of the optional transmission of a checksum byte and also enables and disables the transmitter clock at the correct times. In Figure 5.3 the final block diagram of the transmitter IP block is shown. It shows the FSM which controls the data path and takes care about how transmissions are made on the line. The block also contain a register file which is readable and writable and which gives statuses and controls the use of checksum and starts up the IP block. The FIFO present is 128 bits wide and feeds data to the serialiser. The serialiser is implemented as a shift register which shifts out 16 new bits on each rising edge. The data then arrives at the Output logic, which further serialises the 16 bits int four four-bit wide outputs. The FSM controls the discontinuous clock with an enable signal when it is time to transmit. Each of the parts will be explained below in greater detail, starting with the clocking of the block.

53 CHAPTER 5. CREATING A LINK PORT IP BLOCK

Output Logic Data In FIFO Serialiser D Q_p Control Clk Control Clk 16 128 128 CLK Q_n

CLK_HALF Input

2 Controller Output Clock

FSM EN Q_p Ack Q_n 3 Register File n_BCMP Clk CLK

7 Clk 250 Clk 500 Clk 500 shift

Figure 5.3. This is the block diagram for the transmitter IP block. One may see the different blocks which are interconnected with the FSM as the controller logic which controls the data flow. The input controller handles Block-complete as well as new quad-words written. The three clocks (Clk 250, Clk 500 and Clk 500 shift) are all generated by an MMCM to get them properly aligned.

5.2.1 Transmitter Clocking

Clk250 Clk500 DataOut ClkShifted

Figure 5.4. Timing diagram showing the relationship between the different clocks involved in the transfer and the output data. Data is clocked with Clk500 on the output and the clock with which the data is received is ClkShifted, which has a 90° phase shift from the data clock (Clk500).

The first part that will be examined is the clocking of the IP block. The block has three input clocks, all generated from an MMCM for the correct alignment. The relationship between the clocks are shown in Figure 5.4. One of the most important things to notice is that Clk250 and Clk500 are aligned on the rising edge. This is needed to ensure that data is clocked correctly through the outputs. Also, the ClkShifted is shifted 90° after the Clk500 in order to cohere to the Link Port specification. As far as the internals of the IP block, almost everything runs in the 250 MHz domain as is seen in Figure 5.3. This is done due to simplicity and that only the outputs need to be clocked at a higher speed.

54 5.2. LINK PORT TRANSMITTER

e ilabl Ava ta Da Low o ck N d A an e bl Last Send la ai v A a t ft a e L

D h

e g n i

O H

k ck Hig c A h A C d n H a

K le b ila E Idle Ready To Send Sending va n a A a

Dat b

l

e

d l D a D a le a n t b t a a a g A v ail i A

S D v

t a

r a i

l a t

a a

t b S A v le a a il n ab d le A a c N nd k Uninitialised o A High Checksum D ck Lo ata w Ava ilable

Figure 5.5. The transmitter FSM state chart showing all states of the FSM and how they are connected to each others. Note that for the Last Send state the Check- sum Enabled option has highest priority before checking the other edges. If a re- set signal is acquired the FSM restarts in Uninitialised state. Ack is the input acknowledgement-signal from the connected receiver and data available is an indica- tion if there is data available in the internal FIFO buffer.

5.2.2 Transmitter State Machine The state machine of the transmitter is the backbone of the transmitter and is described in Figure 5.5. In the figure we see all states and how they are intercon- nected, as well and we see all the states and how the connections between them are made. Every state is described in detail below.

Uninitialised is the state where the receiver starts. It will leave this state as soon as possible after receiving a start command from the host that indicates that the transmitter should start.

Idle is the state that the transmitter is in while waiting for data to send. When data arrives, the transmitter advances to the next state.

Ready To Send indicates that the transmitter has data to send but the receiver is not ready to receive yet, indicated by Ack being low. As soon as Ack goes high,

55 CHAPTER 5. CREATING A LINK PORT IP BLOCK the transmitter advances to the next state and starts to transmit.

Sending is where the transmitter transmits the main part of the data. It remains here for the transmission of the first seven half-words, leaving for Last Send when the last word should be sent.

Last Send is a special send-state where decisions are made. It transmits the last half-word of the output data and then checks if the checksum byte should be sent or not, then checks if acknowledgement signal is high and finally if data is available and decides which state to move to after that.

Send Checksum is the state where the checksum and dummy bytes are transmit- ted. In this implementation, the dummy byte is always equal to the last sent data byte due to implementation specifics.

5.2.3 Transmitter LVDS Outputs Since the Link Port is a LVDS port with a double data rate, special care has been taken when designing the output logic. Help was obtained by using the Xilinx CoreGenerator Wizard which is a tool where the specified setup of an output port may be created from a graphical wizard where serialisation, data rate and so forth is specified. The upside of using this wizard to create the outputs were that the tool itself created all the necessary primitives and connected them to each others. The down- side is that it might be non-optimal for a specific solution since it may introduce unnecessary logic in the design. However, it was used to generate an output and after generating the logic it was thoroughly tested with simulation ensuring that the functionality of it was the correct one. The created output block has a 16-bit input which then gets serialised onto four differential pairs of pins which are the Link Port data lines. These pairs supply data with 500 MHz DDR, i.e. one new bit for every nano second per data line. The block needs two input clocks, both the 500 MHz clock for the data, but also a clock with half that speed, i.e. 250 MHz, which is the rate at which new 16-bit words arrive to the inputs of the block. To realise the serialisation the block uses the primitive OSERDES1, which is a serialiser present in the output block of the FPGA. The idea of the OSERDES1 is to present new output data on both the rising and falling edge of the input clock, which in this case is the 500 MHz clock. Since the data path is a four bit wide, four OSERDES1 are used in parallel to obtain correct data path width. In addition to the Link Port data output lines, the source-synchronous clock also has to be supplied. Compared to the data, the clock edge should be 90° phase shifted against the data switching edge. The relationship between the clocks and the output data is shown in Figure 5.4. The ClkShifted is always present.

56 5.2. LINK PORT TRANSMITTER

However, the output source-synchronous clock should only be enabled when actual data transfer is happening. This requires the use of an enable signal. The enable signal Enable originates from the FSM, since the FSM is the part which controls the transmission. As it happens, the previously used OSERDES1 block has an enable input. Since the characteristics of the output clock and output data is identical to the one between ClkShifted and DataOut in Figure 5.4, and DataOut is generated through the OSERDES1, the output clock could also be generated through the same block. However, the Enable-signal comes from the Clk250-domain, and it should end up enabling something in the ClkShifted domain, it has to cross domains. This is done as is shown in Figure 5.6, with crossing through two registers in the ClkShifted-domain. This is also needed to create the same delay as from the FIFO, through the serialiser and out to the data outputs. Since the clock only shifts between 0 and 1, its inputs are tied to those values, as seen in Figure 5.6.

Enable D Q D Q 1 D0 Q 0 D1

CLK_250 EN CLK CLK

CLK CLK_500_SHIFTED

Figure 5.6. The schematic layout of the transmitter. Notice how the En- able signal crosses from the Clk250 domain (CLK_250) to the ClkShifted do- main(CLK_500_SHIFTED) through a register that is clocked on the falling edge of ClkShifted to give the signal as long time as possible to propagate.

Clk250 Enable Q1 EN ClkShifted ClkOut

Figure 5.7. Timing diagram of the clock domain crossing of the enable signal of the output clock of Figure 5.6 from where all signal names are. Q1 is the data between the registers, EN is the enable of the OSERDES1 (rightmost component) and ClkOut is the Q-port of the OSERDES1.

The design was however not elementary. Since the enabling of the clock is decided on a rising edge of Clk250 and it should reach a register in the ClkShifted it would cause the time the signal had to travel to the register to be only 0.5 ns. To overcome that problem, the signal is clocked on the falling edge of ClkShifted, and

57 CHAPTER 5. CREATING A LINK PORT IP BLOCK hence the enable signal has a full 1.5 ns to get from the Clk250 to the ClkShifted domain, and then the enable signal has 1 ns to enable the output before the shifted clock goes high again. This makes the timing requirements easier to meet. The design may be seen in Figure 5.6 and its timing is shown in Figure 5.7.

5.2.4 The Data Path and Memory Design The data travels down its own way controlled by the FSM on its way to the output logic. The design consist of a memory FIFO buffer which is used to clock data in from the IP block connections. The FIFO is adjustable to different sizes, but for this design it was created as a 8-entry deep FIFO. Each entry in the FIFO is a 128-bit quad-word. The FIFO is made up by three separate counters and four 32-bit wide memory banks. The use of 32-bit wide memory banks was chosen due to its wider availability inside FPGAs, and to easier map into built-in memory blocks. The three counters in the FIFO is a input address counter, an output address counter, and a full/almost full counter. The full/almost full counter is used by the logic to control whether more data may be written into it or not, and the almost full signal not used in the transmitter side, but in the receiver side, where the receiver must take action beforehand to stop transmission in order to not loose any data in transmission.

5.2.5 Controlling the Transmitter When wanting to send data there are some setup needed in order to get the data transmissions up and running. There are status and control registers with which the IP block is controlled. The status registers tells whether the transmitter has buffer space or is full, as well as if any errors have happened and the writable register is used for setting up parameters such as transfer width and whether to enable checksum or not, and finally enabling the transmit block when it should be used.

Writeable Control Register

Bits: 31 - 3 2 1 0 Unused Transfer Size Checksum Enabled Host Ready

Figure 5.8. The layout of the writable control registers look like this. Only three things are needed to be controlled in the design. The most important thing is to start the receiver by writing a 1 to bit position 0. Simultaneously at position 1, a 0 or 1 is written depending on whether checksum is wanted or not.

To control the transmitter there is a writeable register to set the preferences that is desirable in the current application. It is used to start the Link Port transmitter as well. The writeable register looks like in Figure 5.8 and the different bits controls the behaviour of the Link Port transmitter.

58 5.2. LINK PORT TRANSMITTER

Transfer size is a one bit figure to set the width of the data path for the trans- missions. However, in this release only the four bit wide interface is implemented. Possible values: 0 | 1 bit wide Link Port (Not yet implemented). 1 | 4 bit wide Link Port output.

Checksum enabled sets if the optional checksum calculation is enabled or not. Possible values: 0 | Checksum disabled. 1 | Checksum enabled.

Host ready is the indication from the host that it is ready to use the Link Port transmitter block. When writing a 1 to the IP block, the FSM starts the necessary setup sequence for enabling the Link Port link with the remote receiver. Possible values: 0 | Not ready. 1 | Ready to transfer data.

Readable Status Register There is also a readable status register telling the current status of the transmitter IP block. It may tell several informations about the current status of the transmitter. The bit layout of the readable register may be seen in Figure 5.9 and below a description of each bit position and its meaning is also explained.

Bits: 31 - 6 5 4 3 2 1 0 Unused FIFO FIFO Checksum Timeout FIFO FIFO Overflow Underflow Error Error Empty Full

Figure 5.9. The layout of the readable status register. The register is used to determine if more data could be written into the IP block, or if there is available data on output. In addition it may tell if too much data has been written into the FIFO by indicating FIFO overflow.

FIFO overflow is an indication of whether the FIFO has had an overflow or not. Should not be able to happen in a receiver due to internal logic. However in a transmitter there is a possibility that the host writes to the link port module without checking if it is possible to write or not and thus creating an overflow. Possible values: 0 | No overflow. 1 | Overflow detected.

FIFO underflow is an indication of whether the FIFO has had an underflow or not. Should not be able to happen in a transmitter due to internal logic. However

59 CHAPTER 5. CREATING A LINK PORT IP BLOCK in a receiver there is a possibility that the host reads from the link port module without checking if it is possible to read or not and thus creating an underflow. Possible values: 0 | No underflow. 1 | Underflow detected.

Checksum error indicates if a checksum error has happened in a transmission and the value received is not to be considered correct. Possible values: 0 | No checksum error. 1 | Checksum error detected.

Timeout error indicates if a timeout error has happened in a transmission. This could be due to lost bits in a transmission which makes the receiver to come out of sync. Possible values: 0 | No timeout error. 1 | timeout error detected.

FIFO empty is an indication that there is no data in the FIFO and thus no data may be read. This however guarantees that data may be written into the FIFO. Possible values: 0 | FIFO not empty. 1 | FIFO empty.

FIFO full indicates that the FIFO is full and thus cannot receive any more data writes into it. If a write is done anyway it will result in a FIFO overflow error. Possible values: 0 | FIFO not full. 1 | FIFO full.

5.2.6 Checksum Calculator To increase the integrity of the data there is a checksum calculator which may produce a checksum to send. The checksum is specified in the TigerSHARC speci- fication [2] and is calculated as in Equation 5.1 where LSB is the Least Significant Byte of the argument, and Bi is the i : th byte in the transfer quad-word.

15 X CHECKSUM = LSB( Bi) (5.1) i=0 When implementing the checksum calculator, a large adder tree is created in order to sum all the parts of the quad word. To create that in the VHDL code, an easy approach was chosen. The adder tree was created as one big adder tree without any internal registers, and the path from its inputs to its outputs were specified as a multi-cycle path in the constraints file. This was possible when the

60 5.3. LINK PORT RECEIVER checksum may be calculated during the entire time that the quad word is sent, i.e. several clock cycles. By doing this the logic needed is decreased since the timing is far more relaxed.

5.2.7 The Implementation of Block Complete In many applications there is beneficial if the number of bytes that is going to be sent can be unknown to the receiver, or at least variable. To enable such transmissions, the Link Port protocol uses the n_BCMP signal. To the receiver side it might be very important to specify when a block is complete and thus the last word has to be able to be marked. The way of doing that in this Link Port implementation is to write to a specific port on the IP block, notifying that the next quad word is going to be the last in that block. By writing to that port, a value is stored in memory alongside the next quad word written to the block. Later, when the FSM reads out that quad word from the FIFO, it makes sure to also signal block complete by setting n_BCMP low and thus the receiver knows that the block now is completed.

5.3 Link Port Receiver

The other end of the Link Port link is the receiver. The receiver could be thought of as only reversing the transmitter but that is not entirely true. The receiver has a somewhat harder layout due to two main factors: different clocking domains and a discontinuous clock. However, before dealing with the problems, a description of the Link Port receiver functionality is presented.

Reeciver Logic

D_p Data Data In 4 Deserialiser FIFO Data Out D_n 16 Control Control 4 DAV 128 128 CLK_p Controller New Output 1 Clk In CLK Write Regs CLK_n Register 3 File Read Regs Ack 7 FSM n_BCMP Clk 250 Clk 500

Figure 5.10. The block layout of the Link Port receiver. All blocks are presented with their interconnects and the width of the data paths and the input and output ports. The critical clock domain crossing takes place inside the Receiver Logic block.

The Link Port receiver is in many respects similar to the transmitter. It contains an FSM which is responsible for keeping the data transfers running con- tinuously, it has LVDS inputs, it also contains a FIFO buffer where received words

61 CHAPTER 5. CREATING A LINK PORT IP BLOCK are buffered before being read out to the host. The block layout of the design can be seen in Figure 5.10, which also shows where the data crosses clock domains from the input logic to the rest of the data path. Each of the building blocks will be presented below and finally a description of the technical difficulties with the clock domain crossing and discontinuous clock.

5.3.1 Receiver Finite State Machine The heart of the receiver is its Finite State Machine(FSM). The FSM need to be fast in order to meet requirements by the communication protocol. Since it has to conform to a communication protocol the required timings are set externally and are needed to be followed. One of the hardest requirements is the receivers ability to receive back to back transmissions, i.e. no idle time in between which requires that one word is received and then the receiver is ready to receive another quad word directly. Therefore, the receiver FSM has been created with as few states as possible. The four states are called OFF, RECEIVING, LAST_RECEIVE and CHECKSUM and their respective meaning will be explained below.

OFF n B C M P

RECEIVING d le b a is D

K

e H v

i C e c e r

o t ft le LAST RECEIVE n e CHECKSUM O

CHK Enabled

Figure 5.11. The Link Port receiver state diagram. It shows the state and their transitions to each others.

OFF is the reset and error state. In this state the receiver tries to stay as little time as possible. It leaves when reset is de-asserted and it receives the indica-

62 5.3. LINK PORT RECEIVER tion from the connected transmitter that the link is active, i.e. de-assertion of n_BlockComplete. The receiver reenters this state only when reset is asserted.

RECEIVING is the state where the receiver spends most of its time. This state indicates that the receiver is waiting to receive a full quad-word. It counts the bytes that have arrived and when one reception remains it moves to the LAST_RECEIVE- state.

LAST_RECEIVE is the state responsible for resetting the counter before the next receive. It is also responsible for the next transition into either CHECKSUM or straight back to RECEIVING depending on the user settings.

CHECKSUM is the state which reads out the checksum byte from the stream. It should then compare the received checksum byte to the byte calculated in the re- ceiver. This is not yet implemented, however the important functionality to support the checksum transmission is indeed working. This means that the receiver may connect to a checksum enabled transmitter and receive data correctly but only discarding the checksum and dummy byte.

5.3.2 Controlling the Receiver Controlling the receiver is identical to controlling the transmitter circuitry, ex- plained in subsection 5.2.5, with one exception. Bit number 6 in the readable control register is used in the receiver to indicate if the current output word is the last word in a block. So, prior to reading out the next word, it can be controlled at bit position 6 in the readable register whether it is the last quad-word in a block.

5.3.3 The Deserialisation of Incoming Data When data arrives at the receiver it is serialised and needs to be deserialised correctly in order to be read out. The deserialisation takes place in two steps which occur in two separate blocks. The first step occurs in the input logic block and deserialises the 4 bit DDR stream to a 16-bit wide SDR stream. The 16-bit SDR stream from the input logic arrives at a 16 to 128 deserialiser which takes eight half-words and combinds them into a quad-word which is later stored in the FIFO buffer for the host to read. The FIFO which the data is stored into is very similar to the one in the transmitter IP block (see page 58). The FIFO on the receiver side however uses the functionality AlmostFull from the internal counters. This is used in order to know when to de-assert the Ack signal to the transmitter. By deasserting the signal, the receiver informs the transmitter that it is not ready to receive any more data, which in turn forces the transmitter to pause the transmission.

63 CHAPTER 5. CREATING A LINK PORT IP BLOCK

5.3.4 Receiver LVDS Inputs The creation of the LVDS input logic was one of the most complex parts in the design. There were several difficulties, e.g. the discontinuous source synchronous clock and the very high clock speed and data rate. The input data stream is a 500 MHz DDR stream which means that data has to be captured every nano second. By examining the specification of the Link Port protocol the beginning and end of the transmission may be drawn. In Figure 5.12 the beginning of a transmission may be seen. In it you may see that the data starts on the first rising edge of the clock. In Figure 5.13 the end of transmission is shown and it shows that the last reception of data happens on the last falling edge. This shows the use of the discontinuous clock, which causes many problems when implementing the Link Port receiver.

ClkIn DataIn D0-3 D4-7 D8-11 D12-15 D16-19 D20-23

Figure 5.12. Timing diagram for the receiving side of the Link Port block. It shows the start of a reception of a quad-word. Note that the transmission starts on the first rising edge of the clock.

ClkIn DataIn D112-115 D116-119 D120-123 D124-127

Figure 5.13. Timing diagram for the receiving side of the Link Port block. It shows the end of a reception of a quad-word. After the last falling clock edge, there is no guarantee that another clock pulse will come, so the last data must be clocked in on the last falling edge.

In the creation of the receiver, several implementations have been evaluated before reaching one that met the requirements and was good enough to use in the design. Below, some of the implementations tried will be presented and their respective flaws will be pointed out.

The First Receiver Implementation The first attempt was using the Core Generator from Xilinx to generate the wanted input logic. Since this approach worked well for the creation of the transmitter the idea was that it should work equally well for the receiver. However, due to the higher complexity of the receiver, this was not the case. Since the data starts to be collected at the first rising edge of the input clock, the receiver must already be primed and ready when the first rising edge arrives. The correct timing is shown in Figure 5.12, however when using the generated core the input lost the first input bits, as well as cut of the input at the end. This was due to internal delays in the ISERDES1 block inside the FPGA which made the primitive hard to work with.

64 5.3. LINK PORT RECEIVER

The resulting timing from a circuit created in Core Generator is presented in Figure 5.14. When studying the resulting timing and input/output, the flaw in the design is visible since there is a possibly undefined length delay in the design.

ClkIn undefined timespan DataIn D120-123 D124-127 D0-3 D4-7 DataOut D116-119 D120-123 D124-127 D0-4

Figure 5.14. Timing diagram for input of the Link Port receiver when using CoreGenerator. The DataIn is the data arriving at the inputs and the DataOut is the data that goes out to the host.

The design that was generated is presented in figure Figure 5.15. It is clear that the generated schematic uses ISERDES1-primitives from the Virtex6 FPGA family for the four input ports, and then it clocks down the input clock to half speed for clocking out data. However, with this design there were two problems. The first was the fact that the first quartet never actually appeared on the output of the circuit. The second was that the last data does not exit the module on the last falling clock edge. Instead, the timing is as shown in Figure 5.14. The problem is that there is no knowledge about the arrival of the last quartet to the receiver, since it relies on the next clock pulse, i.e. the start of the next transmission.

Another Attempted Design When failing in the attempt to use only CoreGenerator, a pure VHDL solution was chosen instead as the next step. This attempted to use fast FIFOs as receiver logic. The design contained two FIFO buffers, one for the rising edge of the clock and one for the falling edge of the clock. The idea was that it captures sixteen bits at a time and constantly being read out by the output clock. The internal logic notifies the receiver FSM when there is data available and the FSM then reads it out and decide how to process it, if it is data or checksum bytes. The data width of the output was chosen to 16 bits for simple reasons, namely the desire to be able to support checksum calculations. Since the frame length differ by 16 bits for a checksum enabled and a disabled transmission, a 128-bit wide datapath was inconvenient. By using only 16 bits, the design could support both checksum- enabled and checksum-disabled transmissions without the input-logic being aware what it reads. This block is intended to be fed with a 500 MHz clock on both input and output clock. The input clock should be supplied from outside the chip as a differential signal. the output clock is the clock that outputs data from the block to the FSM. These two clocks have in fact no knowledge about how they are correlated to each others and the main important thing is that they share the same frequency. The output data is clocked out through flip-flops driven by output clock.

65 CHAPTER 5. CREATING A LINK PORT IP BLOCK

RxReceiverModule:1

ReceivedData(15:0) LinkPortInput:1

gnd bufr G ISERDESE1 I O CLK_DIV_OUT ClkDivOut BITSLIP BITSLIP O CLR CE CE1 XST_GND Q1 CE2 clkout_buf_inst

CLK Q2 CLKB

CLKDIV Q3 bufio D DDLY Q4 I O DYNCLKDIVSEL Q5 bufio_inst DYNCLKSEL

OCLK Q6 OFB

IO_RESET RST SHIFTOUT1 SHIFTIN1

SHIFTIN2 SHIFTOUT2

rst CLK_RESET DATA_IN_TO_DEVICE(15:0)

inv

I O

clk_in_int_inv1

ibufds Input_p(3:0) DATA_IN_FROM_PINS_P(3:0) I O IB

Input_n(3:0) DATA_IN_FROM_PINS_N(3:0)

ibufgds ClkIn_p CLK_IN_P I O IB ibufds_clk_inst

ClkIn_n CLK_IN_N

vcc

P XST_VCC

LinkportInputs

RxReceiverModule

Figure 5.15. The schematic for the generated receiver logic.

This design was however not able to meet the timings and that caused the design to move to the next implementation instead.

Close to Goal When the deserialiser circuitry(ISERDES1) on the FPGA which initially was at- tempted to be used turned out to cause problems and the dual-FIFO solution was inadequate, a combination of standard VHDL and Xilinx primitives was considered. The primitive Xilinx parts were used to do the differential to single ended conversion and the rest should be handled by standard VHDL constructs. The solution used this time was to write a VHDL implementation without the generator and that VHDL introduce some Xilinx primitives to deal with certain aspects of the signalling. For example, the differential inputs are connected to IBUFDS primitives which are input buffers that takes a differential input and converts it to a single-ended signal output. These were in turn connected to a number of registers which are used to deserialise the inputs by clocking them in and then on every second falling edge, a data available signal is set high to ascertain that the FSM logic registers the data. The schematic of the input design is available in Figure 5.16 where it may be seen how the data is clocked into the registers in a particular order to deserialise the data.

66 5.3. LINK PORT RECEIVER

D Q 4 4 4

R1

D Q CLK D Q 4 4 16 16 EN R2 OutReg

4 CLK CLK 4 4

4 D Q 4 4

R3

CLK D Q

EN_Counter

CLK

Figure 5.16. The final layout of the receiver LVDS inputs. Everything is clocked with the source synchronous clock sent out by the transmitter. Data is made single ended by the IBUFDS primitive, located furthest out to the left.

ClkIn DataIn D3-0 D7-4 D11-8 D15-12 D19-16 D23-20 D27-24 D31-28 R1 D3-0 D11-8 D19-16 D27-24 R2 D3-0 D11-8 D19-16 R3 D7-4 D15-12 D23-20 D31-28 EnOutReg DataOut D15-0 D31-16 DAV

Figure 5.17. Timing diagram showing how the signals in Figure 5.16 are behaving and are clocked.

To follow the data and clocking in Figure 5.16, we can look at the timing diagram in Figure 5.17 which shows the start of a transmission and reception of the first two quad-words. The data is clocked in to R1 at the first rising clock edge. At the first falling edge, data is clocked into register R3. At the second rising edge, the enable signal goes high and the new data is stored in R1 and the data from R1 is clocked into R2. At the second falling edge, enable is high and the data from the inputs, R1, R2 and R3 are all clocked into OutReg along with the current input data at DataIn. This data is later read by the FSM in order to get the received data into the host. The data is indicated by the DAV signal which is not shown in the schematic but in the timing diagram. The DAV and OutReg are then connected to the rest of the design through double registers clocked with the, for the receiver,

67 CHAPTER 5. CREATING A LINK PORT IP BLOCK local 500 MHz clock to be processed in the receiver. This design was indeed the final design in terms of VHDL code, but in terms of getting it through timing it was not sufficient. The next section will explain what else needed to be done in order to get it through timing requirements. However, this was the design closest to managing to meet timings.

5.3.5 Getting the Receiver Through Timing The cause of the timing issues came largely from the discontinuous clock in com- bination with the high frequency. With a DDR signal and a frequency of half a GigaHertz, there is not very much time to play with. The problem with the tim- ing came however largely from the tool’s willingness to implement the input clock through global clocking resources, i.e. IBUFGDS. However, since these are located in the middle of the fabric and are intended for low skew distribution throughout the chip, they are not perfect for a small network that needs very low latency from pin to register. In a design where the IBUFGDS is used to clock the data at both sides of the registers this is not an issue (register to register delay), but since the data in the Link Port receiver is source synchronous with respect to the input clock, a delay is not acceptable. To deal with this, there are in the FPGA other clock networks than the global networks. The networks most interesting here are the I/O-clock networks and the regional clock networks. The I/O-networks may however only clock input registers and input logic, and when referring to Figure 5.16 it is clear that more than just an input register needs to be clocked. This leaves us with the regional clocking networks, which have a limited size in order to provide their high speed. This is however not a problem since the logic which is clocked by the input clock is rather limited in size; only 39 flip-flops and one inverter. This is far below the size that a regional clocking buffer is able to clock. The resulting layout is presented in 5.18. In addition to forcing the tool to use a regional clocking buffer, care had to be taken to get the input signals to be close to its input pins as well. The location difference between the logic and the pins is however not very important, as long as the routes have equal distance to travel so that all signals arrive at their registers at the exact same time, with as little skew as possible to the input clock. To achieve this, manual placement of the input pins were used. The pins were placed as differential pairs as close to each others as possible to minimise the pin placement problems. In addition to the things mentioned above, there is an issue with clocking since there is both a 500 and a 250 MHz clock in the design. However, they are aligned (not the input clock, but internal 500 MHz clock) on the rising edge (compare with Figure 5.4, clocks Clk250 and Clk500). This can cause troubles when signals cross from one domain to the other, but to ensure timing correctness, the usage of dedicated clocking blocks, MMCMs, correct constraints are placed on the design where the clock crosses between different domains. That, however, is not true for the input clock. This clock is totally unrelated to the other clocks in the design and thus it is

68 5.3. LINK PORT RECEIVER

D Q 4 4 4

R1

D Q CLK D Q 4 4 16 16 EN R2 OutReg

4 IBUFDS CLK CLK 4 4

4 D Q 4 4

R3

CLK D Q

EN_Counter

CLK IBUFDS BUFR

Figure 5.18. Here we see how the clock is routed through a standard differential input pair (IBUFDS) and in through a regional clock buffer (BUFR). This design choice is made to minimise the input delay of the clock.

impossible to constrain the signals that cross between those domains. To fix this, the design has double flip-flops at each crossing to avoid metastability as show in Figure 5.19. This design practise has been suggested [60] and thus should be one way to provide reliable crossing between two clock domains. The design minimises the risk of having metastable states.

Figure 5.19. Here we see the clock domain crossing for the Link Port receiver circuitry. This design has been chosen to avoid metastability.

69 CHAPTER 5. CREATING A LINK PORT IP BLOCK

5.4 Testing and Verification

To test and verify the functionality of the IP blocks, they were both simulated as stand-alone IP blocks with an outside test bench which simulated the Link Port behaviour. This was the method preferred when developing the blocks, since it is easy to see what is going wrong and eventually fixing it. Therefore, it was the primary simulation during development. However, to ensure that both the transmitter and receiver, which both seemed to work well with their independent test benches, worked in a sharp environment, they were both put into the same test bench and connected to each others. Only connections made was to connect the respective Link Port outputs of Figure 5.3 to the Link Port inputs of Figure 5.10 and vice versa. By doing so, we ensured that the test bench did not interfere with the signals belonging to the actual Link Port protocol, but the test bench could only affect the IP blocks through their respective backside connections. Tests were then run with both checksum-enabled transmissions and not, both filling up the buffers and transmitting them empty in order to ensure that the IP blocks could work according to the Link Port specification. This testing verified that the two IP blocks indeed could connect to each others. Testing this showed that these two blocks indeed could talk to each others and that they could handle the ACK and n_BCMP signals according to plans. These simulations also gave the possibility to test the latency and throughput of the IP- blocks. The throughput is an elementary calculation since the IP blocks conform to the Link Port standard, which gives a throughput of 500 MB/s [2], see Appendix H. The calculation of the latency was a bit trickier, but from the simulation the values could be read out. Starting with the receiver, the measured latency lies in the area between 51 and 55 ns, depending on the relationship between the input clock and the local clocks. Since data is clocked out with a clock that has a 4 ns period, the difference between the start of reception, relative to the 4 ns period, may be up to 4 ns. Moving on to the transmitter, and the latency of it. That latency is measured from the write of a word to the time the last bit has left the output, given that everything is started up. The latency of the transmitter is however determined since all clocks have their relations determined. The latency of the receiver is similar in size, 56.5 ns when starting cold.

5.5 IP Block Restrictions

For the IP block there are some restrictions which apply in order to get the IP block to work as specified. One of the restrictions is that the device in particular have to support a 500 MHz frequency in its regional clocking nets which restricts the type of FPGA to only FPGAs which support that. It also restricts the energy efficiency of that FPGA since it has to run in a non-energy save mode. The design targeted

70 5.6. IP BLOCK METRICS a Xilinx Virtex 6 FPGA and in that family the design is applicable to every model supporting speed grade -3; however when the new Virtex 7 arrives the design will fit into any Virtex 7 with speed grade -2 and -3, of which all devices are able to support -2 according to pre-process data-sheet [61].

5.6 IP Block Metrics

The final implementation of the Link Port IP blocks were placed and routed using Xilinx ISE 13.4 and the previously specified FPGA. The necessary constraints were specified and from that placement, the statistics in Table 5.1 shows that the receiver implementation is slightly larger than the transmitter implementation. One cause of this could be the increased complexity in the receiver side with the clock domain crossing and the discontinuous clock.

Transmitter Sender Slice Registers 584 605 Slice LUTs 284 289 Slices 154 193 Dynamic Power(mW) 184 181 Latency (ns) 56.5 51.0-55.0 Bandwidth (MB/s) 500 500

Table 5.1. The resource usage for the two IP blocks in tabular form. The table shows that the receiver is a bit more spacious than the transmitter, however it did also contain more technical challenges.

In Table 5.1 we also see the power consumption of the Link Port blocks. Worth mentioning is that approximately 160 mW of the estimated power disap- pears through the clocking resources, the clock buffers and MMCM. This implies that very little is actually used in the logic parts of the IP blocks.

5.7 Link Port Implementation Time

As a part of this work, the time taken has been kept in order to enable for future work in this area and possibly comparing the time taken to implement this in VHDL compared to using an other implementation method, such as another language. The work has been divided into several categories and the time spent in that kind of work has been reported. The categories are Protocol Understanding, VHDL Coding, Coding for Synthesis, Protocol Debugging and Solving Timing Problems. The total time of creation for these two IP blocks were 222 hours. The hours were spent as specified in Table 5.2 and are reported in approximate hours.

71 CHAPTER 5. CREATING A LINK PORT IP BLOCK

Category Hours (approximate) Protocol Understanding 26 VHDL Coding 65 Coding for Synthesis 47 Protocol Debugging 43 Solving Timing Problems 41

Table 5.2. This table specifies the amount of hours spent on creating the IP blocks for the Link Port transmitter and receiver.

5.8 This Link Port Implementation Contributions

One of the main things that this Link Port implementation has contributed to is the use of the native 128-bit wide data path. When comparing with the IP-blocks available from the two main manufacturers of FPGAs, Altera[62] and Xilinx[63, 64], as well as Qiang et al.[16], the data path width has often been a common 32-bit wide bus. By using a 128-bit wide bus, more efficient data transfers may be used since fewer data transfers are necessary and the transfer frequency may be lower in that case. Further, it has the possibility to operate at full speed (500 MB/s) which neither of the Xilinx-supplied implementations [63, 64] is able to do. The possibility to reuse the design in another type of FPGA is also very large for this implementation. The need then is to replace the differential buffers (ISERDES1 and OSERDES1) to their equal components in the new architecture. The rest of the code is portable to any architecture. The easiness with which to scale the frequency is also great, since the block is working no matter the transfer frequency, as long as the input clocks are aligned as specified in Figure 5.4.

5.9 Comments and Analysis of the Link Port IP Block

The work on the Link Port IP blocks have resulted in two separate working IP- blocks. These blocks may be used together or independently. The blocks, however not discussed in this report, support other link frequencies than 500 MHz. The important thing to remember if configuring for another frequency is to keep the timing as specified in this report (see Figure 5.4). By keeping that relationship, the design works at any frequency, up to 500 MHz. To ensure correctness of the block, the correct constraints have to be set. That includes positioning of the pins, inputs or outputs, as close together as possible to minimise skew between them. Examples of the constraints needed can be found in Listings C.1 and C.2 in Appendix C. These specify the setup and hold times needed, as well as the multi-cycle paths of the design. If the design passes Place and Route using these constraints it should also work in real application. The implementation time of the Link Port block was approximately 220 hours,

72 5.9. COMMENTS AND ANALYSIS OF THE LINK PORT IP BLOCK from the study of the protocol specifications until the entire protocol was imple- mented and verified in VHDL. This is not an extreme time, but compared to using preexisting IP blocks for other protocols, it is a lot of effort just for the communi- cation of a system. However, the need to talk to the specific processor cannot be realised in any other way and thus the need for the IP-block is large. When comparing to other implementations of the Link Port IP blocks, e.g. from Altera [62] and Xilinx [64], this implementation succeed in reaching 500 MHz in the frequency. That is something which is unattainable in the Xilinx implemen- tation, which has a maximum speed of 450 MHz. That is however implemented in a previous Virtex version (Virtex 4), which could be the reason. The size of that module is 220 or 150 slices depending on implementation style. The smaller one uses equivalent ISERDES1 which was considered harder/impossible to use by this work. That means that the implementation using 220 slices is the most equivalent, and in that respect, this work is slightly smaller. Comparing sizes with Altera devices is harder since they report sizes in Logical Elements (LE) instead of slices. However, their design may in a Stratix device reach the full 500 MHz operation [62]. The sizes reported are 301 LE for the receiver and 222 LE for the transmitter, a ratio rather similar to the one achieved in this work.

73

Chapter 6

Comparison of Communication Techniques

Now when different techniques have been studied in practise it would be a good time to compare them with some other transmission techniques that were mentioned ear- lier. This comparison will be looking at their usage and since they are regulated by specifications, only a theoretical throughput study will be carried out. Furthermore, available IP-blocks will be evaluated for comparison as to what techniques are most efficient per used chip area/resources.

6.1 Hard facts

The first thing to do is to summarise some of the protocols that were introduced and looked at in chapter 2. The summary will be looking at IP-blocks with respect to sizes on the chip as well as transfer speeds.

Protocol Type LUTs FFs LUTRAM BRAM BUFG GTX Gbps(Effective) Ethernet(Tri-speed) [65] 360 430 0 2 4 0 0.975 PCI Express x1 (v.2) [66] 375 425 0 4-8 0 1 3.85 PCI Express x2 (v.2) [66] 525 525 0 4-8 0 2 7.70 SRIO x1 (3.125 Gbps) [67] 5800 5250 0 2 2 1 2.39 Link Port Rx 289 670 88 0 3 0 4.0 Link Port Tx 284 563 88 0 4 0 4.0 Table 6.1. A comparison of how much space an IP block of specified protocols take up in a FPGA. All these IP blocks are commercially available blocks available through Xilinx webpage.

In Table 6.1 there is a selection of some common protocol transceiver IP blocks that are available on the market along side the IP block created in this work. By comparing these IP blocks, an overview of the sizes of different transfer protocol transceivers that are implemented in an FPGA and how much area they consume

75 CHAPTER 6. COMPARISON OF COMMUNICATION TECHNIQUES will be given. By cross referencing this to the throughput of each of the protocols a comparison of which protocol that could be able where is given. Firstly, with the PCI Express core, the maximum payload that may be carried is 512 bytes [66], as opposed to the protocol standard of 4096 bytes. The link speed of a v.2 PCI Express is 5 Gbps. Consulting Appendix E, we see that the link uses 80% of its bits sent, due to the overhead from the encoding. Furthermore, it uses at least 20 bytes overhead. This gives a maximum percent of usable bytes to 5 Gbps ·0.80 · 512 ÷ (512 + 20) ≈ 3.85 Gbps. For the two-line case the throughput is growing linearly to 7.70 Gbps. The PCI Express core also supports lots of other configurations using more and less resources at different transfer speeds. The Ethernet core [65] supports 10/100/1000 Mbps, but for throughput compar- isons, 1000 Mbps is selected. With the associated Ethernet overhead (Appendix F) the maximum throughput of an Gigabit Ethernet link is 975 Mbps. The SRIO (Appendix I) is a bit different than the others in the respect that it actually supports the full RapidIO layers up to the top[67], which is a reason why it is so large in terms of area. The large area is the price you have to pay in order to get a functional IP block that covers all of the layers of the stack.

6.2 Making a Choice

When it comes to making a choice which technique that should be used, there are lots of factors that affect this. By only looking at Table 6.1, one might think that SRIO should not be used due to its size. That is not entirely correct, since the IP block actually contains logic to handle all transmission layers in SRIO, giving the user a great advantage compared to do all packet handling and creation themselves. Furthermore, as stated in Appendix I, the efficiency of the SRIO is rather high. The usable part of the link may be as high as 95.5% of link speed. The speed of the links are comparable with those of PCI Express, which might be the greatest competitor when selecting a scalable multi-gigabit technique. The PCI Express is also able to support really high utilisation, if configured correctly. The maximum achievable throughput of data is 99.5%, but that requires that all units can handle a four kilobyte payload. If one unit in a network cannot handle the large payload, all units will decrease theirs to the highest payload that is supported by all [68]. This means that the system designer need to know that all components connected may support sufficiently large payloads in order for the network to gain from having a large payload. One example is the IP block presented in the previous section, which only supports payloads up to 512 bytes, or only a 96.2% of the transferred bits. This is in fact very comparable with the SRIO link. When looking at the Link Port , it is evident that it is no fancy protocol in any way. It is strictly a point-to-point protocol compared to PCI Express and SRIO, which both support addressing. However, it has some benefits that the others lack. Its simplicity enables it to have a very small silicon footprint. In Table 6.1 the Link Port and SRIO are the only ones with buffering of words, even though the buffers

76 6.2. MAKING A CHOICE in the SRIO is larger than the eight quad-word buffer in the Link Port . Even so, it supports full 4 Gbps in its operation without using any MGTs. The price that has to be paid is the number of pins used, four pairs for data, one for clock and two pins for control signals. Ethernet on the other hand might not fit very well into this comparison. Gigabit speed is not that impressive. However, it could very well be used as a mean of communication over longer distances. For that purpose, neither of the other ones are especially good. They all have timing constraints that are rather hard and are not perfectly suited for such transportation. For long connections, the Link Port is an especially bad choice, since it has very delicate timing considerations to fulfil in order to work properly. This is largely due to the fact that the clock is separate from the data, unlike SRIO and PCIe. However, using Link Ports as a short communication method is no bad idea and result in low silicon footprint without the need for MGTs and is still able to achieve gigabit per second speeds.

77

Chapter 7

Goal Follow Up and Conclusions

The work set out to fulfil some goals, as seen in section 1.2, and this is a summary how that went. The first goal was to implement a Link Port IP block for use in FPGAs, and that goal is fulfilled. The work resulted in a model which works in simulation and is able to operate at full Link Port link speed of 4 Gbps. The model has been constrained in order to guarantee that it meets its requirements. The second goal was to evaluate UDP and TCP over GbE and see what af- fected performance. That work worked out well and several factors were revealed that affect performance. It was shown that with correct settings, a throughput was achieved that was within 1% of the theoretical throughput. The UDP commu- nication however raised more questions than it gave answers. It is clear that the settings affect performance a lot, but there seem to be some operating system or device driver implementation details that limit the UDP performance and makes it linear and dependent on the number of interrupts. The GbE work resulted in not only the measurements and characterisation, but also a testing software for interconnected computers where they might be tested for the Ethernet connectivity between them. In addition to being a functional testing platform already, it has potential to be extended into more protocols beside TCP and UDP. Although it has only tested single-stream connections this far, it could easily be extended to test for multiple concurrent connections, since the support for multiple connections already exist. The only thing needed is to add an easier user interface to it. The third goal was to collect research results and compare protocols that could be used in system design. This work was carried out on a very theoretical level and was mostly a collection of already-made research results and some theoretical throughput calculations. The recommendations that exist are found in chapter 6 and do focus on embedded systems and backbone connections mostly. The feeling was that there was a bit too little time to give the protocol comparison the time needed to complete it. In terms of the fourth goal about future standards, there are not too much to

79 CHAPTER 7. GOAL FOLLOW UP AND CONCLUSIONS say. The feeling is that the future will contain many protocols, but the ting they will have in common is that they will be serial multi-gigabit per second protocols that use differential signalling. At least in close future, and in the longer run, the differential electrical signals will probably be replaced by optical signals which create less noise and dont cross-couple to each others.

80 Chapter 8

Future Work

There are a lot of unanswered questions which could be interesting to look at in the future. One of these things is to look at other protocols that time did not allow this work to do. The idea was to have time to study inter-system connections with protocols that are intended for longer transmissions. The only protocol that were thoroughly studied was GbE, with TCP and UDP over it. It would be interesting to have the possibility to test other protocols over GbE as well, but even more interesting to get a evaluation of Infiniband as well. From the work on UDP and TCP there also rose questions that need to be investigated further. What is the reason for the unexpected behaviour of UDP when going from 1024 byte payload to 1025 byte? Furthermore, it would be interesting to see if a more ideal transmitter could be found and then the effects of the receiver could be mapped better at sizes above 1025 byte. For the TCP connections, there were no great answer as to why the throughput dropped at large receive buffer sizes, and that is something that will need further investigation. Another interesting thing is to look into the possibility of using Ethernet frames instead of TCP or UDP packets and see how this would affect performance. For the shorter communications, hardware validation of the IP block is required. It is confirmed that it works in theory by simulating the generated netlist, but it is not confirmed in hardware. A comparison with e.g. an implemented SRIO link would also be an interesting thing to do, looking into whether that link actually can perform as good as it is supposed to according to specifications. Finally, some more studies in larger area networks have to be done. Looking at e.g. Infiniband more in depth would be a good thing and compare it with GbE. Also, testing new 10GbE would be interesting to see, since it puts a lot of more pressure on the CPU.

81

Bibliography

[1] Xilinx, “Virtex-6 FPGA family.” Available at http://www.xilinx.com/ products/silicon-devices/fpga/virtex-6/index.htm, March 2012. [2] Analog Devices, ADSP-TS201S: TigerSHARC Embedded Processor Data Sheet. Analog Devices, c ed., Dec. 2006.

[3] A. Farina, “INTRODUCTION TO RADAR SIGNAL & DATA PROCESS- ING: THE OPPORTUNITY,” Nov. 2003. RTO-EN-SET-063.

[4] SAAB, “PS-05/A AIRBORNE MULTI-MODE RADAR.” Available at http: //www.saabgroup.com/Global/Documents%20and%20Images/Air/Sensor\ %20Systems/PS%2005_A/saab_PS-05%20A%204pg%20Screen%20PDF.pdf, 2011.

[5] D. Bueno, C. Conger, and A. D. George, “Optimizing rapidIO architectures for onboard processing,” ACM Trans. Embed. Comput. Syst., vol. 9, pp. 18:1– 18:30, Mar. 2010.

[6] D. Bueno, C. Conger, A. D. George, I. Troxel, and A. Leko, “RapidIO for radar processing in advanced space systems,” ACM Trans. Embed. Comput. Syst., vol. 7, pp. 1:1–1:38, Dec. 2007.

[7] RapidIO Trade Association, “RapidIO specification 2.2,” May 2011.

[8] H. Jian-Xi, W. Jing-Hong, and H. Shun-Ji, “The hardware implementation of real-time SAR signal processor,” in Radar Conference, 2000. The Record of the IEEE 2000 International, pp. 205–209, IEEE, 2000.

[9] T. Granberg, Handbook of digital techniques for high-speed design: design examples, signaling and memory technologies, fiber optics, modeling and sim- ulation to ensure signal integrity. Upper Saddle River, NJ: Prentice Hall PTR, 2004.

[10] National Semiconductor Corp, LVDS Owner’s Manual 4rd Edition - Low Volt- age Differential Signaling. National Semiconductor, 2008.

[11] Altera, “Altera’s 28-nm, Power-Efficient transceivers,” Nov. 2011. Available at http://www.altera.com/literature/po/ss-28nm-transceivers.pdf.

83 BIBLIOGRAPHY

[12] Xilinx, “7 series FPGAs GTX transceivers: User guide,” Nov. 2011. Available at http://www.altera.com/literature/po/ss-28nm-transceivers.pdf. [13] Analog Devices, “Link ports,” in ADSP-TS201 TigerSHARC Processor Hard- ware Reference, Analog Devices, 1.1 ed., Dec. 2004.

[14] J. Wang, W. Wu, W. Zhang, P. Lei, and W. Li, “Parallel realization of high resolution radar on multi-DSP system,” in Radar Conference, 2009 IET In- ternational, pp. 1–4, 2009.

[15] Z. Fang and J. Xia, “A miniature implementation of air-born SAR real-time processing,” in Asia-Pacific Conference on Synthetic Aperture Radar 2009, pp. 939–942, IEEE, Oct. 2009.

[16] W. Qiang, G. Qing, L. Xuwen, and J. Kebin, “Hardware design of image in- formation processor based on ADSP-TS201 DSPs,” in International workshop on Imaging Systems and Techniques 2009, pp. 155–158, IEEE, May 2009.

[17] PCISIG, “PCI express base 3.0 FAQ.” Available at http://www.pcisig. com/specifications/pciexpress/resources/PCIe_3.0_External_FAQ_ Nereus_9.20.pdf, March 2012.

[18] TOP500.org, “Home | TOP500 supercomputing sites.” Available at http: //top500.org/, March 2012.

[19] J. Postel, “Transmission control protocol,” Sept. 1981. RFC793 at http: //tools.ietf.org/html/rfc793. [20] USB-IF, Universal Serial Bus Specification. USB.org, 2 ed., 4 2000.

[21] USB-IF, Universal Serial Bus 3.0 Specification, vol. 1. USB.org, 6 2011.

[22] “Thunderbolt technology: The transformational PC I/O,” 2012. Avail- able at http://www.intel.com/content/www/us/en/architecture-and- technology/thunderbolt/thunderbolt-technology-brief.html. [23] B. Wood, “Backplane tutorial: RapidIO, PCIe and ethernet.” Avail- able at http://www.eetimes.com/design/signal-processing- dsp/4017736/Backplane-tutorial-RapidIO-PCIe-and-Ethernet, Jan. 2009.

[24] Y. Zhang, Y. Wang, and P. Zhang, “A High-Performance scalable comput- ing system on the RapidIO interconnect architecture,” in 2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discov- ery (CyberC), pp. 288–292, IEEE, Oct. 2010.

[25] P. Gray and A. Betz, “Performance evaluation of copper-based gigabit eth- ernet interfaces,” in International Conference on Local Computer Networks 2002, pp. 679–690, IEEE Comput. Soc, November 2002.

84 BIBLIOGRAPHY

[26] Y. Wu, S. Kumar, and S. Park, “Measurement and performance issues of transport protocols over 10Gbps high-speed optical networks,” Computer Net- works, vol. 54, pp. 475–488, Feb. 2010. [27] K. Salah and M. Hamawi, “Comparative packet-forwarding measurement of three popular operating systems,” J. Netw. Comput. Appl., vol. 32, pp. 1039– 1048, Sept. 2009. [28] G. Prytz and S. Johannessen, “Real-time performance measurements us- ing UDP on windows and linux,” in International Conference on Emerging Techonologies and Factory Automation 2005, vol. 2, pp. 8 pp.–932, IEEE, Sept. 2005.

[29] CERN, “ATLAS - the technical challenges,” June 2011. Available at http: //www.atlas.ch/atlas_brochures_pdf/tech_brochure-11.pdf. [30] S. Stancu, M. Ciobotaru, C. Meirosu, L. Leahu, and B. Martin, “NETWORKS FOR ATLAS TRIGGER AND DATA ACQUISITION,” Feb. 2006. Available at http://indico.cern.ch/getFile.py/access? contribId=289&sessionId=6&resId=1&materialId=paper&confId=048. [31] S. Stancu, M. Ciobotaru, and K. Korcyl, “ATLAS TDAQ DataFlow network architecture analysis and upgrade proposal,” IEEE Transactions on Nuclear Science, vol. 53, pp. 826–833, June 2006. [32] R. E. Hughes-Jones, “The use of TCP/IP for real-time messages in AT- LAS Trigger/DAQ.” Available at https://edms.cern.ch/file/393752/1/ DC-062.pdf, 2003. [33] X. Zhang, M. Gao, and G. Liu, “A scalable heterogeneous multi-processor sig- nal processing system based on the RapidIO interconnect,” in Intelligent In- formation Technology Application Workshops, 2008. IITAW’08. International Symposium on, pp. 761–764, 2008. [34] J. Zhang, H.-b. Su, Q.-z. Wu, and J. Zhang, “Research and implement of serial RapidIO based on Mul-DSP,” in International Conference on Computational Intelligence and Software Engineering, 2009., pp. 1–4, IEEE, Dec. 2009. [35] J. Adams, C. Katsinis, W. Rosen, D. Hecht, V. Adams, H. V. Narravula, S. Sukhtankar, and R. Lachenmaier, “Simulation experiments of a high- performance RapidIO-based processing architecture,” in IEEE International Symposium on Network Computing and Applications, 2001. NCA 2001, pp. 336–339, IEEE, 2001. [36] J. Santos, M. Zilker, L. Guimarais, W. Treutterer, C. Amador, and M. Manso, “COTS-Based High-Data-Throughput acquisition system for a Real-Time reflectometry diagnostic,” IEEE Transactions on Nuclear Science, vol. 58, pp. 1751–1758, Aug. 2011.

85 BIBLIOGRAPHY

[37] A. Nishida, “Building cost effective high performance computing environment via PCI express,” in 2006 International Conference on Parallel Processing Workshops, 2006. ICPP 2006 Workshops, pp. 8 pp.–526, IEEE, 2006.

[38] Y. Watanabe, A. Yamada, M. Nitta, and K. Kato, “Pseudo-real-time control of a USB I/O device under windows 7,” in 2010 International Conference on Control Automation and Systems (ICCAS), pp. 975–980, IEEE, Oct. 2010.

[39] F. A. Jolfaei, N. Mohammadizadeh, M. S. Sadri, and F. FaniSani, “High speed USB 2.0 interface for FPGA based embedded systems,” in 4th International Conference on Embedded and Multimedia Computing, 2009. EM-Com 2009, pp. 1–6, IEEE, Dec. 2009.

[40] D. Bortolotti, A. Carbone, D. Galli, I. Lax, U. Marconi, G. Peco, S. Perazzini, V. M. Vagnoni, and M. Zangoli, “Comparison of UDP transmission perfor- mance between IP-Over-InfiniBand and 10-Gigabit ethernet,” IEEE Transac- tions on Nuclear Science, vol. 58, pp. 1606–1612, Aug. 2011.

[41] B. Bogdanski, F. O. Sem-Jacobsen, S. A. Reinemo, T. Skeie, L. Holen, and L. P. Huse, “Achieving predictable high performance in imbalanced fat trees,” in 2010 IEEE 16th International Conference on Parallel and Distributed Sys- tems (ICPADS), pp. 381–388, IEEE, Dec. 2010.

[42] D. Bortolotti, A. Carbone, D. Galli, I. Lax, U. Marconi, G. Peco, S. Perazzini, V. Vagnoni, and M. Zangoli, “High rate packet transmission via IP-over- InfiniBand using commodity hardware,” in Real Time Conference (RT), 2010 17th IEEE-NPSS, pp. 1–6, IEEE, May 2010.

[43] H. Zhang, W. Huang, J. Han, J. He, and L. Zhang, “A performance study of java communication stacks over InfiniBand and giga-bit ethernet,” in Inter- national Conference on Network and 2007, pp. 602–607, IEEE, Sept. 2007.

[44] S. Addagatla, M. Shaw, S. Sinha, P. Chandra, A. S. Varde, and M. Grinkrug, “Direct network prototype leveraging light peak technology,” in IEEE Sym- posium on High Performance Interconnects 2010, pp. 109–112, IEEE, Aug. 2010.

[45] M. Abdallah and O. Elkeelany, “A survey on data acquisition systems DAQ,” in International Conference on Computing, Engineering and Information 2009, pp. 240–243, IEEE, Apr. 2009.

[46] W. Lixin, S. Wei, and L. Chao, “Implementation of high speed real time data acquisition and transfer system,” in Industrial Electronics and Applications, 2009. ICIEA 2009. 4th IEEE Conference on, pp. 382–386, IEEE, May 2009.

86 BIBLIOGRAPHY

[47] H. Ning, W. Hua, X. Jianping, J. Changlong, and J. Huibo, “An implementa- tion of sustained real-time radar data recording system on FPGA,” in Com- munications, Circuits and Systems and West Sino Expositions, IEEE 2002 International Conference on, vol. 2, pp. 1526–1529, IEEE, 2002.

[48] F. Li, X. Ji, X. Li, and K. Zhu, “DAQ architecture design of daya bay re- actor neutrino experiment,” IEEE Transactions on Nuclear Science, vol. 58, pp. 1723–1727, Aug. 2011.

[49] X. Ji, F. Li, M. Ye, Y. An, K. Zhu, and Y. Wang, “Research and design of DAQ system for daya bay reactor neutrino experiment,” in Nuclear Science Symposium Conference Record, 2008. NSS ’08. IEEE, pp. 2119–2121, IEEE, Oct. 2008.

[50] C. Bigongiari, “Km3NeT, a deep sea challenge for neutrino astronomy,” in In- ternational Conference on Sensor Technologies and Applications, 2007. Sen- sorComm 2007, pp. 248–253, IEEE, Oct. 2007.

[51] S. Anvar, “Data acquisition architecture studies for the KM3NeT deep sea neutrino telescope,” in IEEE Nuclear Science Symposium Conference Record, 2008. NSS ’08, pp. 3558–3561, IEEE, Oct. 2008.

[52] K. Krygier and K. Merle, “MECDAS-a distributed data acquisition system for experiments at MAMI,” IEEE Transactions on Nuclear Science, vol. 41, pp. 86–88, Feb. 1994.

[53] A. Belias, G. Crone, E. FaIk Harris, C. Howcroft, S. Madani, T. Nicholls, G. Pearce, D. Reyna, N. Tagg, and M. Thomson, “The MINOS data ac- quisition system,” in Nuclear Science Symposium Conference Record 2003, pp. 1663–1667 Vol.3, IEEE, 2003.

[54] Y. Sugaya, J. Ahn, H. Akimune, Y. Asano, W. Chang, S. Date, M. Fu- jiwara, K. Hicks, T. Hotta, K. Imai, T. Ishikawa, T. Iwata, H. Kawai, Z. Kim, Y. Kishimoto, N. Kumagai, S. Makino, N. Matsukoka, T. Matsumura, T. Mibe, S. Minami, M. Miyabe, Y. Miyachi, T. Nakano, M. Nomachi, Y. Ohashi, T. Ooba, C. Rangacharylu, A. Sakaguchi, T. Sasaki, D. Seki, H. Shimizu, M. Sumihama, H. Toki, T. Toyama, H. Toyokawa, A. Wakai, C. Wang, W. Wang, T. Yonehara, T. Yorita, and M. Yosoi, “DAQ system for LEPS experiment,” IEEE Transactions on Nuclear Science, vol. 48, pp. 1282– 1285, Aug. 2001.

[55] D. Gold and K. Anantha, “Software development and real-time target systems on a common backplane,” in Signals, Systems and Computers, 1991. 1991 Conference Record of the Twenty-Fifth Asilomar Conference on, pp. 69–73, IEEE Comput. Soc. Press, 1991.

87 BIBLIOGRAPHY

[56] G. Avolio, “DAQ system at the 2002 ATLAS muon test beam,” IEEE Trans- actions on Nuclear Science, vol. 51, pp. 2081–2085, Oct. 2004.

[57] M. Thorpe, C. Angelsen, G. Barr, C. Metelko, T. Nicholls, G. Pearce, and N. West, “The T2K near detector data acquisition systems,” IEEE Transac- tions on Nuclear Science, vol. 58, pp. 1800–1806, Aug. 2011.

[58] D. G. Phillips, T. Bergmann, T. J. Corona, F. Frankle, M. A. Howe, M. Kleifges, A. Kopmann, M. Leber, A. Menshikov, D. Tcherniakhovski, B. VanDevender, B. Wall, J. F. Wilkerson, and S. Wustling, “Characterization of an FPGA-based DAQ system in the KATRIN experiment,” in Nuclear Sci- ence Symposium Conference Record (NSS/MIC), 2010 IEEE, pp. 1399–1403, IEEE, Oct. 2010.

[59] L. L. Peterson, Computer Networks ISE: A Systems Approach. Morgan Kauf- mann, 2007.

[60] C. D. Systems, “Clock domain crossing, closing the loop on clock domain func- tional implementation problems,” technical paper, Cadence Design Systems, Dec 2004.

[61] Xilinx, Virtex-7 FPGAs Data Sheet: DC and Switching Characteristics, Feb 2012.

[62] Altera, “Analog devices link-port reference design,” application note, Altera, February 2005. AN332.

[63] N. Sawyer, “Interfacing virtex-ii series fpgas with analog devices tigersharc ts20x dsps via lvds link ports,” Application Note XAPP635, Xilinx, February 2005.

[64] M. Defossez, “Virtex-4 interface to an analog devices adsp-ts20xx link port,” Application Note XAPP727, Xilinx, January 2006.

[65] Xilinx, Virtex-6 FPGA Embedded Tri-Mode Ethernet MAC Wrapper v1.4. Xil- inx, 1.4 ed., April 2010. DS710.

[66] Xilinx, LogiCORE IP Virtex-6 FPGA Integrated Block v2.5 for PCI Express. Xilinx, 2.5 ed., January 2012. DS800.

[67] Xilinx, LogiCORE IP Serial RapidIO v5.6. Xilinx, 5.6 ed., March 2011. DS696.

[68] A. Goldhammer and J. J. Ayer, “Understanding performance of PCI express systems.” Available at http://www.xilinx.com/support/documentation/ white_papers/wp350.pdf, Sept. 2008. [69] Xilinx, Virtex-6 Libraries Guide for HDL Designs. Xilinx, 12.3 ed., September 2010.

88 BIBLIOGRAPHY

[70] H. Zimmermann, “OSI reference Model–The ISO model of architecture for open systems interconnection,” IEEE Transactions on Communications, vol. 28, pp. 425–432, Apr. 1980.

[71] J. Day and H. Zimmermann, “The OSI reference model,” Proceedings of the IEEE, vol. 71, no. 12, pp. 1334–1340, 1983.

[72] P. Norton, Peter Norton’s complete guide to networking. Indianapolis, Ind.; Hemel Hempstead: Sams; Prentice Hall, 1999.

[73] M. G. Naugle and M. G. Naugle, Network protocols. New York: McGraw-Hill, 1999.

[74] A. H. Wilen, J. P. Schade, and R. Thornburg, Introduction to PCI Express : a hardware and software developer’s guide. Hillsboro, Or.: Intel Press, 2003.

[75] National Instruments, “PCI express an overview of the PCI express standard.” Available at http://zone.ni.com/devzone/cda/tut/p/id/3767, Aug. 2009.

[76] GE Intelligent Platforms, “PCI express Peer-to-Peer interconnect.” Available at http://defense.ge-ip.com/library/detail/12854, 2011.

[77] C. E. Spurgeon, Ethernet: the definitive guide. Beijing [etc.]: O’Reilly, 2000.

[78] Institute of Electrical and Electronics Engineers and IEEE-SA Standards Board, IEEE standard for information technology - telecommunications and information exchange between systems - local and metropolitan area networks - specific requirements. Part 3, Amendment 4, Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer speci- fications. Media access control parameters, physical layers, and management parameters for 40 Gb/s and 100 Gb/s operation. New York: Institute of Electrical and Electronics Engineers, 2010.

[79] C. Callegari, S. Giordano, M. Pagano, and T. Pepe, “Behavior analysis of TCP linux variants,” Computer Networks, vol. 56, pp. 462–476, Jan. 2012.

[80] Institute of Electrical and Electronics Engineers, “IEEE standard for infor- mation Technology–Telecommunications and information exchange between Systems–Local and metropolitan area Networks–Specific requirements part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifications - section three,” 2008.

[81] Institute of Electrical and Electronics Engineers, “IEEE standard for infor- mation Technology–Telecommunications and information exchange between Systems–Local and metropolitan area Networks–Specific requirements part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifications - section one,” 2008.

89 BIBLIOGRAPHY

[82] Institute of Electrical and Electronics Engineers, “IEEE standard for infor- mation technology: Telecommunications and information exchange between systems Local and metropolitan area networks Specific requirements Part 2: Logical link control,” 1998. ISO/IEC 8802-2:1998. [83] T. Steinbach, F. Korf, and T. Schmidt, “Comparing time-triggered ether- net with FlexRay: an evaluation of competing approaches to real-time for in-vehicle networks,” in Factory Communication Systems (WFCS), 2010 8th IEEE International Workshop on, pp. 199–202, 2010. [84] K. Muller, T. Steinbach, F. Korf, and T. C. Schmidt, “A real-time ethernet prototype platform for automotive applications,” in 2011 IEEE International Conference on Consumer Electronics - Berlin (ICCE-Berlin), pp. 221–225, IEEE, Sept. 2011. [85] Y. Takayanagi and T. Akima, “Latest trend of industrial Real-Time ethernet for the SICE-ICASE international joint conference 2006 (SICE-ICCAS 2006),” in SICE-ICASE, 2006. International Joint Conference, pp. 165–169, IEEE, Oct. 2006. [86] L. Seno and C. Zunino, “A simulation approach to a Real-Time ethernet protocol: EtherCAT,” in Emerging Technologies and Factory Automation, 2008. ETFA 2008. IEEE International Conference on, pp. 440–443, 2008. [87] W. Wolf, Computers as components : principles of embedded computing sys- tem design. Amsterdam; Boston: Elsevier/Morgan Kaufmann, 2008. [88] T. Skeie, S. Johannessen, and O. Holmeide, “Timeliness of real-time IP com- munication in switched networks,” Industrial Informatics, IEEE Transactions on, vol. 2, no. 1, pp. 25– 39, 2006. [89] C. L. Liu and J. W. Layland, “Scheduling algorithms for multiprogramming in a Hard-Real-Time environment,” J. ACM, vol. 20, pp. 46–61, Jan. 1973. [90] J. Postel, “Internet protocol,” Sept. 1981. RFC791 at http://tools.ietf. org/html/rfc791. [91] J. Postel, “Assigned numbers,” Sept. 1981. RFC790 at http://tools.ietf. org/html/rfc790. [92] J. Postel, “User datagram protocol,” Aug. 1980. RFC768 at http://tools. ietf.org/html/rfc768. [93] C. J. Kale and T. J. Socolofsky, “TCP/IP tutorial,” Jan. 1991. RFC1180 at http://tools.ietf.org/html/rfc1180. [94] R. Guillier, S. Soudan, and P. V. Primet, “TCP variants and transfer time pre- dictability in very high speed networks,” in High-Speed Networks Workshop, 2007, pp. 6–10, IEEE, May 2007.

90 BIBLIOGRAPHY

[95] J. Hurwitz and W. chun Feng, “Initial end-to-end performance evaluation of 10-Gigabit ethernet,” in 11th Symposium on High Performance Interconnects, 2003. Proceedings, pp. 116– 121, IEEE, Aug. 2003.

[96] T. Uchida, “Hardware-Based TCP processor for gigabit ethernet,” IEEE Transactions on Nuclear Science, vol. 55, pp. 1631–1637, June 2008.

[97] N. Alachiotis, S. A. Berger, and A. Stamatakis, “Efficient PC-FPGA commu- nication over gigabit ethernet,” in 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), pp. 1727–1734, IEEE, July 2010.

[98] D. Dalessandro and P. Wyckoff, “A performance analysis of the ammasso RDMA enabled ethernet adapter and its iWARP API,” in Cluster Computing, 2005. IEEE International, pp. 1–7, IEEE, Sept. 2005.

[99] M. J. Rashti and A. Afsahi, “10-Gigabit iWARP ethernet: Comparative performance analysis with InfiniBand and -10G,” in Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pp. 1–8, IEEE, Mar. 2007.

[100] A. D. George and C. T. Cole, “Comparative performance analysis of RDMA-Enhanced ethernet.” Available at http://www.cercs.gatech.edu/ hpidc2005/presentations/CaseyReardon.pdf, 2005.

[101] Q. Wang, D. Lv, and F. Zhou, “Reconfigurable RDMA communication frame- work of MULTI-DSP,” Journal of Electronics (China), vol. 26, pp. 380–386, May 2009.

[102] S. Audityan, “Implementing the RapidIO interconnect specification part i – understanding the RapidIO interconnect specification.” Available at www. analogzone.com/iot_1117.pdf, 2004.

[103] W. J. Dally and B. Towles, Principles and practices of interconnection net- works. Amsterdam; San Francisco: Morgan Kaufmann Publishers, 2004.

[104] J. Axelson, USB complete the developer’s guide. Madison, Wis.: Lakeview Research LLC, 4 ed., 2009.

[105] USB Implementers Forum, “USB.org - getting a vendor ID.” Available at http://www.usb.org/developers/vendor/, March 2012.

[106] “InfiniBand trade association: Home.” Available at http://www. infinibandta.org/index.php, March 2012.

[107] InfiniBand Trade Association, “InfiniBandTM architecture specification vol- ume 1 release 1.2.1,” Nov. 2007.

91 BIBLIOGRAPHY

[108] InfiniBand Trade Association, “InfiniBandTM architecture specification vol- ume 2 release 1.2.1,” Oct. 2006.

[109] P. A. Franaszek and A. X. Widmer, “United states patent: 4486739 - byte oriented DC balanced (0,4) 8B/10B partitioned block transmission code,” Dec. 1984.

[110] “The ATLAS experiment.” Available athttp://www.atlas.ch/index.html, March 2012.

[111] H. Beck, R. Dobinson, K. Korcyl, and M. LeVine, ATLAS TDAQ: A Network- based Architecture. Available at https://edms.cern.ch/file/391592/2.2/ DC-059.pdf, Feb. 2003.

[112] M. Ciobotaru, S. Stancu, M. LeVine, and B. Martin, “GETB, a gigabit eth- ernet application platform: its use in the ATLAS TDAQ network,” in Real Time Conference, 2005. 14th IEEE-NPSS, p. 6 pp., IEEE, 2005.

[113] C. Meirosu, B. Martin, A. Topurov, and A. Al-Shabibi, “Planning for predictable network performance in the ATLAS TDAQ.” Available at http://indico.cern.ch/getFile.py/access?contribId=41&sessionId= 2&resId=0&materialId=paper&confId=048.

[114] C. Haeberli, A. dos Anjos, H. Beck, A. Bogaerts, D. Botterill, S. Gadom- ski, P. Golonka, R. Hauser, M. LeVine, R. Mommsen, V. Reale, S. Stancu, J. Schlereth, P. Werner, F. Wickens, and H. Zobernig, “ATLAS TDAQ Data- Collection software,” IEEE Transactions on Nuclear Science, vol. 51, pp. 585– 590, June 2004.

[115] P. Golonka, “Linux network performance study for the ATLAS data flow system - draft 0.50.” Available at https://edms.cern.ch/file/368844/0. 50/LinuxNetPerf.pdf, March 2003.

[116] R. Jones, S. Kolos, L. Mapelli, and Y. Ryabov, “Applications of CORBA in the atlas prototype DAQ,” in Real Time Conference 1999., pp. 469–474, IEEE, 1999.

[117] Object Management Group, Inc, “CORBA FAQ.” Available at http://www. omg.org/gettingstarted/corbafaq.htm, March 2012.

[118] J. Vermeulen, M. Abolins, and Alexandrav, “ATLAS DataFlow: the read-out subsystem, results from trigger and data-acquisition system testbed studies and from modeling,” in Real Time Conference 2005. 14th IEEE-NPSS, p. 5 pp., IEEE, 2005.

92 Part III

Appendices

93

Appendix A

Abbreviations

B Byte or Octet, 8 bits BRAM Block RAM CML Current Mode Logic CORBA Common Object Request Broker Architecture COTS Component Off The Shelf or Commodity Off The Shelf DAV Data AVailable ECL Emitter Coupled Logic EMI Electro-Magnetic Interference EMS Electro-Magnetic S FF Flip-Flops FPGA Field Programmable Gate Array FSM Finite State Machine GbE Gigabit Ethernet (1000BASE-T, -X, etc.) IB Infiniband IP Internet Protocol or Intellectual Property LUT Look Up Table LVDS Low-Voltage Differential Signalling MAC Media Access Control MGT Multi gigabit transceiver NIC Network Interface Card OoO Out-of-order PCB Printed circuit board PCIe PCI Express RAM Random Access Memory RDMA Remote RNIC RDMA-enhanced NIC RTT Round Trip Time SerDes Serialiser / Deserialiser

95

Appendix B

A Selection of Used Xilinx Primitives

Primitive Usage A clock buffer for clocking the global clock BUFG nets. Used for high-fanout clocks. A clock buffer for clocking input and/or BUFIO output registers. A regional clocking buffer used to clock a BUFR regional clocking net. The regional net is larger than the net the BUFIO may drive. A differential clock input buffer. Turns IBUFDS the input differential signal into a single ended signal inside the FPGA fabric. A differential signal input buffer. Turns the input differential signal into a single IBUFGDS ended signal inside the FPGA fabric and are able to drive a BUFG or MMCM. A built-in deserialiser which takes a serial ISERDES1 input and turns it into a parallel output. Used for synthesising different clock fre- MMCM quencies inside the FPGA fabric. A serialiser circuit that takes a wide data OSERDES1 and produces a serial output on its output at either SDR or DDR.

Table B.1. A list of used Xilinx primitives. The definition of what the primitives do is cited from [69].

97

Appendix C

Selection of Needed Constraints

1 INST"LinkPort_ChecksumWideMulticycle_checksum_int[7]_dff_15_7" TNM = ChksumOutputReg; 2 INST"LinkPort_ChecksumWideMulticycle_checksum_int[7]_dff_15_0" TNM = ChksumOutputReg; 3 INST"LinkPort_ChecksumWideMulticycle_checksum_int[7]_dff_15_1" TNM = ChksumOutputReg; 4 INST"LinkPort_ChecksumWideMulticycle_checksum_int[7]_dff_15_2" TNM = ChksumOutputReg; 5 INST"LinkPort_ChecksumWideMulticycle_checksum_int[7]_dff_15_3" TNM = ChksumOutputReg; 6 INST"LinkPort_ChecksumWideMulticycle_checksum_int[7]_dff_15_4" TNM = ChksumOutputReg; 7 INST"LinkPort_ChecksumWideMulticycle_checksum_int[7]_dff_15_5" TNM = ChksumOutputReg; 8 INST"LinkPort_ChecksumWideMulticycle_checksum_int[7]_dff_15_6" TNM = ChksumOutputReg; 9 TIMESPEC TS_multi_chksum = FROM"FFS"TO"ChksumOutputReg" 20 ns;

Listing C.1. Constraints specifying that the checksum should be treated as a multi- cycle path

1 INST"DataIn_n<0>" TNM = DataIn; 2 INST"DataIn_n<1>" TNM = DataIn; 3 INST"DataIn_n<2>" TNM = DataIn; 4 INST"DataIn_n<3>" TNM = DataIn; 5 INST"DataIn_p<0>" TNM = DataIn; 6 INST"DataIn_p<1>" TNM = DataIn; 7 INST"DataIn_p<2>" TNM = DataIn; 8 INST"DataIn_p<3>" TNM = DataIn; 9 TIMEGRP"DataIn" OFFSET =IN 0.5 ns VALID 1 ns BEFORE"ClkIn_p"RISING; 10 TIMEGRP"DataIn" OFFSET =IN 0.5 ns VALID 1 ns BEFORE"ClkIn_p"FALLING;

Listing C.2. Constraints specifying the setup and hold times of the Link Port receiver

99

Appendix D

The OSI Model

In the years around 1980 a lot of work was done to standardise communication. This was done in order to conform the industry to a standard before the market grew big and out of hand. That was why a general communications model was introduced, called the OSI reference model [70, 71]. This model is a seven-layer model where layer N only may talk to layer N+1 and N-1. The N:th layer put requests to the N-1:th layer and after response, provides its own response to layer N+1. As seen in Figure 6, the layers communicate only with the ones just above or just below, but when communicating with other devices, the N:th layer of device 1 communicates logically directly with the N:th layer of device 2, hence making the lower layers transparent. The ability to make lower layers transparent improves portability, since the underlying network topology and applications never need to be concerned when implementing a higher layer protocol. Below follows a short description of each of the layers with help from [70, 71].

D.1 Physical Layer

The physical layer is the lowest layer of the model. In this layer, all actual data communication takes place and specifies how the physical transaction takes place, as electrical, optical, mechanical or any other physical method of transferring data. However, it only specifies how to put signals on, and read out from, the transmission media, e.g. a cable. The media itself is not included, except from some performance metrics for how it should behave [72].

D.2 Data Link Layer

On the data link layer, there should be specific support for different physical layers. Different physical media requires different control, and this is supported in the Data link layer. It also provides means for error correction on the physical layer through the means of a data integrity check [73]. Data units used on this layer are called

101 APPENDIX D. THE OSI MODEL

Application layer Application layer

Presentation layer Presentation layer

Session layer Session layer

Transport layer Transport layer

Network layer Network layer

Data Link layer Data Link layer

Physical layer Physical layer

Figure D.1. A sketch of the OSI reference model, and it can be seen by the dashed lines how the layers communicate logically directly on the same level. However, the actual transaction takes the way of the dotted line down through all lower levels and then again up in the receiver side to the equally high layer.

frames, and each frame that is sent, acknowledges this to its sender [72]. Finally, in this layer, physical addressing is taken care of.

D.3 Network Layer

On the network layer there is independence from how data is transferred and takes care of the aspect required from the higher layers of the network. The network layer also takes care about routing of data in the most economic way [73]. One of the down sides of the network layer is the lack of a native error correction mechanism, and hence it is important that the data integrity algorithm in the Data Link layer is reliable [72].

D.4 Transport Layer

In the transport layer, only end to end transfer is concerned. This layer takes care about providing this communication between endpoints and tries to optimise net- work usage. In this layer, packets that arrive out-of-order (OoO) are rearranged to the sequence in which they were originally sent[73, 72]. Furthermore, the transport layer provides more error detection capabilities and may re-request packets which are dropped during transmission in lower layers.

102 D.5. SESSION LAYER

D.5 Session Layer

On the session layer, two entities are connected through either simplex, half-duplex or duplex. It also provides tokens to determine what kind of data structures have been sent and also controls the communication between entities. In this layer, there is also support for synchronisation points to provide error recovery strategies. It also ensures that requests are completed in order [72].

D.6 Presentation Layer

The presentation layer is used by the application layer to understand the data that has been transmitted, since all underlying architectures may have different ways of representing the data [72]. The layer therefore performs some encoding and decoding. This is also where any encryption and decryption in a network transaction takes place [73]. This layer is independent on what the underlying protocols there are, for the ease of developing application layer protocols.

D.7 Application Layer

The application layer is the layer closest to the user, and it is in some sense the API to which a user program issues a request in order to traverse down and up again through the layers of the OSI-model [72].

103

Appendix E

PCI Express

PCI Express (PCIe) is a serial protocol where two devices communicate in a point to point-fashion in a full-duplex mode [68]. Unlike a traditional bus, PCI Express emulates a multidrop bus by using switches [9]. Furthermore, it also utilises a serial communication strategy on differential lines instead of the traditional parallel single-ended ones. On the top layers, it will be completely transparent for operating systems, which will be able to communicate with it as a traditional PCI-device. This means that for most applications, PCI-express is only new up to the third layer in the OSI model Appendix D, the above layers will remain unaffected [9]. Currently there are three major standards of PCIe, namely PCI Express version 1.0, 2.0 and 3.0. These are basically a lot like each other, with doubled effective transfer speed for each generation, from 2 Gbps in version 1.0 to 8 Gbps in the 3.0 version [17]. In this section, most of the focus will be on the first version of PCIe. As most other serial high-speed communications, PCI Express embeds its clock into the data signal using 8B/10B-encoding [9] which makes the physical transac- tions adding a 25% overhead, or to reverse it, only 80% of the bits sent are actual payload (see Appendix L). This encoding also ensures that the data will have a maximum running disparity, i.e. a minimal switching frequency given any input pattern. However, the inefficiencies of 8B/10B-encoding has been lightened some- what in version 3.0 since the encoding has changed to 128B/130B [17]. The encoding is only visible to the physical layer of PCI Express, and so is the partitioning of bytes, if it exist a multi-lane PCI Express port. There may be 1, 2, 4, 8, 16 or 32 parallel lanes of serial transmission, and if they exist, data is partitioned between them equally. However, all this negotiation is transparent to all but the physical layer. Each transmission in the physical layer is made in frames. Each frame starts with a start-of-frame character (SOF) and end with an end-of-frame character [74]. These characters add only two bytes of extra overhead, as seen in Figure E.1, but enable the receiver to know exactly when a frame starts, and when it ends. The SOF character is furthermore different depending on the origin of the message, in terms of layers [74].

105 APPENDIX E. PCI EXPRESS

0 7 8 1516 31 Start Of Frame  Sequence Number          TLP Header (3-4 Words)        Layer Transaction      Data (0-1024 Words)     hhh  h hhhh hhh hhh  hhh hhh   hhhh hhh   hhh hhhh   hhhh hhh   hhh hhhh  Data Link Layer h h   hhhh   hh           Optional CRC     LCRC End Of Frame

Figure E.1. A PCI Express physical layer packet which shows the belongings of each part to its layer. The only parts which are physical-layer-specific are the Start Of Frame (SOF) and the End Of Frame (EOF).

At the data link layer, the concept of data integrity is maintained. To each packet, a CRC number is attached to ensure that there has been no data corruption on the way [9, 74],seen in Figure E.1 as the LRCR. It also adds a 16-bit sequence number to ensure delivery in order [74]. If data has been corrupted or received out of order, the link layer will automatically retry to send the packet and hence making sure that no higher level protocol needs to intervene to ensure the data integrity. Furthermore, the data link layer implements a credit-based flow-control algorithm meaning that it always keep track of how many empty spaces there are in the receiver’s buffer. However, in order to not send too many credits back, a PCIe device usually sends one credit update per several packets [68]. This policy to sum up several things before sending back information is also utilised in the acknowledgement returns, which usually only sends after some packets. The transaction layer creates the packets (TLPs) for the PCIe communication [9]. This works as the point where software packets traverse down into the commu- nication. Each packet is equipped with a header which specifies where the packet is bound, length and some status information, and then it may end with another CRC number for improved integrity [68]. The information about the packet is contained in the TPL header (Figure E.1), and the End-to-end CRC is actually a optional field to have [68]. On top of the transaction layer there is the software layer. This layer gives the

106 E.1. ASSOCIATED OVERHEAD user connection with the PCIe bus, and when creating the PCIe standard, care was taken into making sure that PCIe should work on all pre-existing operating systems without modification, implying that PCIe works by only using a PCI-style operating system driver [9, 75]. Furthermore, it has new abilities to use the enhanced features of PCIe if a PCIe driver is created [75]. Although mostly used for host to device communication, PCI Express (PCIe) may be used for peer-to-peer communication as well [76]. Two host devices may be directly connected and use PCIe for high-speed communication, or more units may also be connected together through switches and hence creating a PCIe network [76]. One of the main advantages with PCIe compared to other communications is the small need for silicon area [76].

E.1 Associated Overhead

With PCI Express, as with almost every other protocol, there is an associated overhead. This overhead in mainly due to routing information, packet bit integrity and other control information. Furthermore, since it utilises both credit based flow control and acknowledgements, even more overhead is imposed [68]. The first ting to look at is the packet size on a PCIe network. The packet size and outline may be seen in Figure E.1. As may be seen, the maximum overhead in a packet is 28 bytes, given that we use CRC-code in the transport layer, and use 64-bit addressing [68] and the minimum overhead is when no CRC-correction in the transport layer is used, and 32-bit addressing as well and then the overhead is 20 bytes. This, combined with a payload of 0-4096 bytes, makes the maximum achievable link utilisation of 4096 / (4096+20) = 99.5 % data bits. This is a lot and makes PCIe a very good choice for a high-efficiency network or communication. Furthermore, each device has a maximum allowed payload [68], and when the system is initially set up, the device with the smallest allowed payload dictates the maximum payload size in the entire PCIe network. This may impose impact on performance since limiting the payload reduces the efficiency of the transfers. For example, if the maximum payload is 128, the maximum theoretical throughput would be limited to 128 / (128 +20) = 86.5 %. This is significantly less than when payload is four kilobyte. This makes the design of the network crucial in this sense, where the designer needs to ensure that all devices can handle the payload size calculated for. When summarising everything, we see that it is not an exceptional amount of overhead associated with PCIe, but actually rather modest, given that the entire network is set up correctly. E.g., it may be seen [68] that in real world examples, a write over PCIe may achieve around 85% of the theoretical throughput of a link, while a read, due to latency of the memory controller and not due to the PCIe protocol, have a smaller utilisation.

107

Appendix F

Gigabit Ethernet

The widespread infrastructure already built around Ethernet is a significant factor when discussing it. The enormous amount of equipment already sold to the market and the rather well pre-built infrastructure that support it makes Ethernet a very cost effective competitor when it comes to high speed interconnects. The way that Ethernet has evolved from being a half-duplex 10 Mbps [77] when the first Ethernet standard was released, to now support up to 100 Gbps in full-duplex over fiber optics cable [78]. This is a quite remarkable progress when effective two way bandwidth has increased about 20,000 times. However, the most recent technology is not main stream yet, and today the mainstream consumer standard is the 1 Gbps, or Gigabit Ethernet (GbE). This makes it a gigabit transfer link at a very low cost in COTS components. This is why focus here is on GbE. However, GbE is only a standard up to the second layer in the OSI Reference Model(Appendix D). Above that, there are several different implementations of protocols which deliver packets over GbE, the most well known is the TCP/IP protocol suite, which is described in Appendix G. It is probably the most common protocol suite today, since it is used for most internet applications [79]. Other ways of communicating over Ethernet is to send raw LLC frames over Ethernet, which is a fast way since it introduces little overhead, but it is unreliable due to the lack of delivery guarantee. Gigabit Ethernet is specified in section three in the 802.3 IEEE standard [80] and it specifies several operating modes, both over copper and over both single and multi-mode fiber optics. How the different techniques operate differ slightly in the physical layer but are identical from the data link layer and upwards. As it shows in Figure F.1, the Logical Link Control (LLC) and Media Access Control (MAC) is identical and independent of the underlying physical layer [79], so that it conforms to the idea of layering for interoperability. The bottom layer of the data link layer, the MAC-layer, specifies the funda- mental frame and packet in which all transactions over Ethernet takes place. The packet is shown in Figure F.2 where it also may be seen which parts of it that builds the frame. As specified in the 802.3 IEEE standard section one [81] these frames

109 APPENDIX F. GIGABIT ETHERNET

Logical Link Control

Media Access Control

Reconciliation Higher Layers GMII

Network Layer Data Link Physical Medium Layer Attatchment Physical Medium Physical Layer Dependent MDI

Medium

Figure F.1. An overview of the layers in Gigabit Ethernet

Preamble(7) SFD(1)  Destination Address(6)    Source Address(6)    Length / Type (2)    Payload (46-1500)  h  Frame hhhh  hhh hhhh hhhh hhh hhh hhhh  hhhh hhh  hhh hhhh  hhhh hhh  hhh hh  hhhh      (4)  Inter-Frame Gap (12)

Figure F.2. The Ethernet MAC frame with its surrounding physical layer preamble, Start-Of-Frame Delimiter and the Inter Frame Gap. The frame is marked in the image and is the actual Ethernet frame. The data outside of the frame is there to keep the link working according to its specifications. have a size between 64 bytes as minimum and 1518 bytes maximum given that they are basic frames. There are also specified Envelope frames with a maximum length of 2000 bytes which are implemented for the use of higher-layer protocols that need extra information [81]. However, these are not recommended to use for other purposes according to the standard. In the frame, there are three fields: the length/type field, the payload and the frame check sequence [81]. The length/type field is a two byte long field that has two different applications depending on the value in the field. If the value is

110 F.1. REAL-TIME ETHERNET greater or equal to 1536, the field is interpreted as a type-field and indicates which kind of upper-layer traffic that the packet carries. In case the value is smaller, it is interpreted as a length-field and indicates the number of bytes existing in the payload part of the frame. The payload is as it sounds the place where the data from upper layers is placed in order to be sent. The size of the payload field in standard frames is between 46 and 1500 bytes. If the actual payload is less than 46 bytes, the payload is padded to be 46 bytes [81]. The last part is the frame check sequence, which is a 32-bit CRC-number calculated by a polynomial as stated in 3.2.9 in [81] for the increase of signal integrity. The CRC check covers the bits from the first bit in the Destination address, to the last bit of the Payload. In the packet there are additional fields to the frame. The first is the seven- byte Preamble which is used for synchronisation of the Ethernet interfaces [72]. Following the Preamble, it is the SFD. The one byte SFD signals to the receiver that the packet starts after this byte. It is followed by the destination and source MAC-addresses of 48 bits each, which are the physical layer addresses of the network cards. After the addresses the frame is sent, followed by an optional extension. The extension is used in half-duplex mode of GbE and is used to ensure that all packets can be detected by flow-control algorithm CSMA/CD. The extension field ensures that all transmissions on GbE are at least 4096 bits long, or 512 bytes, to ensure that collisions will be detected. Furthermore, Ethernet requires an inter packet gap [81] (IPG) which is 96 bits at least and specifies that two separate transmissions over Ethernet need to at least transmit 96 bits in between for both sender and receiver to be allowed to recover from the last transmission. On the top of the hierarchy we have the Logical Link Control (LLC) which is specified in [82]. In this specification the layout of the LLC frame is specified.

F.1 Real-Time Ethernet

The area of real-time is an ever-growing field and there have been several technolo- gies to provide the means to extract the data in a real-time fashion. Due to the low cost and high availability of high performance Ethernet systems, even real-time systems are evolving towards being Ethernet based, as in [83], where the possibility to use a real-time Ethernet solution instead of a standard vehicle transfer medium is evaluated for the usability in cars. That study suggests that real-time Ethernet is a viable competitor to earlier real-time communication standards. There are several different kinds of real-time implementations of Ethernet proto- cols. For example, there is the TTEthernet, AFDX, Profinet and EtherCat [84, 85] gives an overview of which protocols might be suited for which application. These protocols realise their real time behaviour in different ways, some are token-based where only one host is the sender while others limits device bandwidth or prede- fines a schedule for when the Ethernet connection is available to the sender. These approaches all have the same goal, to avoid collisions [84]. That seem to lead to a

111 APPENDIX F. GIGABIT ETHERNET predictable network that fulfils the hard demands of a hard real-time system [84]. There are also other implementations that use the regular TCP stack [86] but these have the limitations of the poor real-time performance of TCP with long delays non-guaranteed delivery times. This may be improved by designing the network carefully, but still it cannot guarantee hard real-time demands. By utilising other protocols with more real time behaviour a more reliable communication can be created, better suited for real-time. By using a scheduler, as stated before, a lot of real-time performance may be achieved [84, 86]. This scheduling algorithm could be e.g. Earliest Deadline First or Rate Monotonic Scheduling [87, 88]. It was proven in 1973 by Liu and Layland [89] that these scheduling algorithms are able to construct feasible schedules for both EDF and RMS.

F.2 Efficiency of Gigabit Ethernet

The maximum theoretical throughput of GbE is depending on the total overhead associated with GbE. As may be seen earlier in this chapter, there are several parts which add to the overhead of the transfer. The theoretical throughput is in terms of payload divided by the sum of payload and overhead as seen in Equation F.1 where Θ is the throughput.

payload Θ = (F.1) payload + overhead The overhead associated with a GbE frame is firstly the length/type field of two bytes, and the CRC-code of two bytes, giving the frame overhead to four bytes. Ad- ditionally, in the packet, there is an 8-byte start-up sequence, 12 bytes of addresses and between each packet, there is the inter-packet gap of 12 bytes, making the total overhead 36 bytes. To make things even worse, the minimum packet size is 512 bytes when on a network which allows collisions, which worsen the conditions for small packets, making GbE a less good transfer technique when transferring over collision-allowing GbEs. In Figure F.3, we may see the how it is rather inefficient to send small packets over gigabit Ethernet, but that initially the throughput rises rather quickly. How- ever, somewhere around 1000 Bytes payload the increase starts to cling off, and payload increases need to be quite large to increase the effective throughput. It is also visible that as payload closes in towards 10 kB, the effective throughput is almost the full gigabit per second.

112 F.2. EFFICIENCY OF GIGABIT ETHERNET

Ethernet Theoretical throughput with varying payload and frame size 1000

1518 B 900 4088 B 9014 B Unlimited

800

700

600

500

Throughput (Mbps) 400

300

200

100

0

0 1 2 3 4 5 10 10 10 10 10 10 Payload Size (B)

Figure F.3. Effective throughput of gigabit Ethernet networks with varying payload on logarithmic x-axis. The theoretical limit is plotted for several different frame sizes, i.e. different payloads in a single packet, and it is seen that it may increase performance substantially to use a larger frame size.

113

Appendix G

TCP/IP Protocol Suite

The TCP/IP protocol suite is one of the most widespread protocol suites for com- munication over Ethernet, since it is where the protocols that are used for Internet access resides. Here, a brief description of a selection of them will be presented

G.1 The Internet Protocol Version 4

The internet protocol (IP) is a host-to-host protocol [90] which delivers datagrams from one host to another. The protocol is the underlying architecture which usually serves TCP or UDP applications. It is probably the most common protocol, since it is used for messages over the Internet, the most famous in the world. The IP handles two things, addressing and fragmentation [90]. The addressing scheme is called IP addresses and this is a unique address on the network. This address is used in order to make the datagrams to be routed correctly and reach their destination in the network. An address is always 32 bits long, but addresses are differentiated into several different classes. Class A are big networks, with first bit set to zero, next seven bits are the network number and ending with 24 bits which specifies the local address on the network. Class B has 16 bits for network number (always starting with 10) and 16 bits for the local address, and Class C has 24 bits for the network number (always starting with 110) and the last eight bits specifying the local address in the network [90]. The other thing that IP handles is fragmentation, which is the ability to take larger packets and splitting them into smaller ones for transmission over networks with smaller payload size [90]. Datagrams may be told not to be fragmented, but that comes with the risk of being discarded if a too narrow data link that cannot hold the entire packet has to be traversed. If a packet is fragmented, a status flag is set in the IP header. Then, for each fragmented packet, an offset is specified in the header of that fragment. That way, no two fragments can be misinterpreted since only one fragment may have the same offset. The fields of the IPv4 header [90], as may be seen in Figure G.1, are used in

115 APPENDIX G. TCP/IP PROTOCOL SUITE

0 3 4 7 8 15 16 18 19 23 24 31  Version Header Type Of Service Total Length Length   Flags Fragment Offset  Identification    Time To Live Protocol Header Checksum   Source Address Header  Destination Address      Options, Padded to a 32-bit boundary  

Data

Figure G.1. An IPv4 Packet Header with optional options field following it

order to route the message correctly and to realize it correctly. Its first field, the version, tells which version of the IP-protocol that is used. This is followed by a four bit field with the length (in 32-bit words) of the header. The header is always at least five words, but may span up to 15 (the maximum value of a 4-bit number) as most, and then ten words of options is added after the address fields. The Type of Service field is an 8-bit field which indicates some quality of service parameters that might be used for switching and routing of messages [90]. For example, a message may specify that it needs low delay, and in that case the routing may be done in an attempt to reduce delay. The following field specifies the total length of the IP datagram, which may be at most 64 kB of data. The IP specification [90] says that the total length may be as much as 65,535 bytes, but only specifies that a host needs to be able to accept 576 bytes of data. The next word contains fields for reassembly of fragmented packets [90]. The Identification field holds a 16-bit identifier to the datagram. The flags tell whether the packet is fragmented, and whether this packet is the last fragment in the series. The final field is a 13 bit field specifying the offset of this packet with respect to the first packet in a fragmentation. The offset is specified in number of double-words, i.e. 64 bits times the value in the offset field is the actual offset in bits. The third word of the header starts with the time-to-live, which is a mechanism to remove packets that cannot be delivered in a network after the specified time, or a specified maximum number of hops [90]. The next field is the Protocol, and it specifies what upper-level protocol is used above the IP. Some examples are TCP which has the number six (6) and UDP which has the number seventeen (17) [91]. This word is ended by a checksum of 16 bits to ensure the data integrity of the IP header [90]. It is important to notice that this checksum only ensures the integrity of the header, and not of the rest of the datagram. The final two words are the source and the destination address which are composed as stated above. They are

116 G.1. THE INTERNET PROTOCOL VERSION 4 unique and used for routing considerations. After the address fields, optional Option-words may be placed if header length is greater than five. These options may be used to select security class or to tell routers by which routing algorithm they should be routed [90]. One final point about the internet protocol. That is that it is media independent, i.e. even though it might be considered to be a synonym with Ethernet, there is noting in either the Ethernet nor the IP standard which relate them to each other, and hence it is media independent.

G.1.1 Efficiency of the Internet Protocol Datagrams

IP Theoretical throughput with varying payload and frame size 1000

900

1518 B 4088 B 800 9014 B Unlimited

700

600

500

400 Throughput (Mbps)

300

200

100

0

0 1 2 3 4 5 10 10 10 10 10 10 Payload Size (B)

Figure G.2. Maximum theoretical throughput when sending standard IP datagrams over an Ethernet link with varying IP Payolad. In this graph we see different examples of existing Ethernet Frame Sizes and the IP throughput of them compared to the "unlimited" packet, i.e. a packet of arbitrary size.

The IP datagram has the potential of being a very efficient protocol, as may be seen in the text above. The header needs to be five 32-bit words, or a total of 20 bytes. This, in contrast to the large potential payload of maximum 64 kB, implies that the overhead of IP datagrams is negligible since it is less than 0.05 %. However, when sent over a Ethernet network with a maximum transfer size of 1500 bytes in the standard case, or 9000 bytes in a common Jumbo frame case,

117 APPENDIX G. TCP/IP PROTOCOL SUITE the overhead is still very small, but it is increased to around 1.3% in the standard and 0.2% in the Jumbo frame case. As may be seen, it still introduces very little overhead to the transfer while sending large Ethernet frames. The total overhead from the Ethernet standard is dominant over the extra overhead induced by IPv4. When examining Figure G.2, where the Ethernet payload size is on the x-axis and the ratio of IP datagram payload is on the y-axis, it is visible that it is needed at least 20 B of Ethernet payload in order to encapsulate the IP header and that is why the graph starts from there. As a comparison between IP datagram efficiency and the raw Ethernet frame efficiency, seen in Figure 12, shows the increased overhead in IP datagrams, since they reach 90% utilization at 170 B Ethernet payload, whereas the raw Ethernet efficiency reaches 90% utilization with a payload of just over 120 B. This is of cause natural, since IP datagrams have additional features compared to the Ethernet.

G.2 The User Datagram Protocol

The User Datagram Protocol (UDP) is one of the transport-layer protocols in the TCP/IP protocol family [73]. It is implemented to send streams between differ- ent applications running on different computers but it does not guarantee non- duplicates, data arrival nor data integrity of the data payload [92]. It is even stated in its specification [92] that it is unreliable and that TCP is the reliable choice if data order should be guaranteed. However, this is also an advantage in small embedded systems where there are not sufficient processing power to handle a complete TCP stack, since it is not as computationally intensive as TCP.

0 15 16 31 Source Port Destination Port ) Header Length Checksum

Data

Figure G.3. An UDP packet where the size of the data field is the amount of bytes specified in the Length field of the packet.

When looking at Figure G.3 we see that the overhead of the UDP datagram is only 64 bits, or 8 bytes. Thus, UDP does not increase the amount of overhead significantly compared to an IP datagram. The difference is the two fields named Source Port and Destination Port, which specifies which logical port the incoming data should end up on [73] and thus it may differentiate several applications running on the same host so that each application only receives its own messages. The effective throughput difference between IP only and UDP/IP datagrams is possible to see when comparing Figure G.2 and Figure G.4. The addition of eight bytes overhead gives the possibility to steer traffic into different ports at very

118 G.3. THE TRANSMISSION CONTROL PROTOCOL little extra cost from a throughput perspective in order to provide the functionality to direct traffic to specific applications and thus adding the possibility to have several streams of data incorporated to one single physical link with the ability to differentiate them.

TCP Theoretical throughput with varying payload and frame size 1000

900

1518 B 4088 B 800 9014 B Unlimited

700

600

500

Throughput (Mbps) 400

300

200

100

0

0 1 2 3 4 5 10 10 10 10 10 10 Payload Size (B)

Figure G.4. The theoretical maximum transfer throughput of UDP packets over a Gigabit Ethernet link. As we see there are different achievable throughputs depending on the actual Ethernet frame size.

G.3 The Transmission Control Protocol

Originally intended for the military [19], the Transmission Control Protocol (TCP) is probably now the most widespread protocol for connecting equipment together in networks. It is designed to be a reliable, connection-oriented protocol with guar- anteed delivery once a connection in established. It is mapped to the fourth layer of the OSI model (see Appendix D) and is designed to work with several underlying architectures. The TCP is built upon lower level protocols, and most often the IP protocol, and relies on its ability to address and provide fragmentation support [19]. TCP is meant to be a communication method between two processes and should interface with the higher layers of the OSI model to do this inter-process communication. By selecting TCP when choosing a communication protocol, users get a reliable protocol which ensures delivery, reordering of packets that arrive out of order, no

119 APPENDIX G. TCP/IP PROTOCOL SUITE duplicates and retransmission of damaged packets. This is ensured by a handshake protocol which makes the receiver transmit an acknowledgement each time a mes- sage is accepted [19]. If no acknowledgement is sent before a timeout timer has expired, the packet is retransmitted. However, in order to guarantee all the things TCP guarantees, it is a more costly protocol than UDP in terms of CPU and Bandwidth [93]. For example, each acknowledgement sent back need bandwidth, and this may impair the network performance. One difference with regard to UDP is that TCP is connection oriented, i.e. it needs to set up a connection prior to sending data [73]. This ensures that the data that is later sent will also be received. The round-trip time (RTT) has effects on the maximum throughput in TCP implementations, and a large RTT makes throughput drop [94].

G.3.1 Socket Buffer Size

The default socket buffer size in Linux has not always been adequate for handling TCP traffic over high speed links according to [26, 95]. It is shown that by increasing the buffer size, the average throughput of TCP increases as well, since fewer or no packets are dropped due to insufficient buffer size. This applies to both the sending and receiving buffers since the protocol needs to wait for acknowledgement and hence have data stored in memory for at least one round trip time, the time it takes for data to go from the sender to the receiver and back again. The size of the buffer is as stated somehow related to the round trip time and hence, when sending messages over long distances, the window might need to be quite large to ensure maximum throughput [26].

G.3.2 Different TCP Implementations

There are several different implementations of TCP [26], most of them have evolved as attempts to reduce latency, increase throughput or some other characteristic in a specific application [79]. As is the idea, they all perform differently in different situations. Wu et al. [26] evaluates how different TCP protocols behave when transmitting over a widespread Wide Area Network, while Callegari et al. [79] tries to give a comparison between all TCP implementations in modern Linux operating system. Both studies conclude that for long distance transmissions, an implemen- tation called Scalable TCP offers the best solution. In [79], the results show that an implementation called CUBIC is the best overall performer, but that for both wired in-house and wireless networks, the standard TCP implementation is not very far behind. However, it is visible that standard TCP suffers less from moderate packet losses in terms of maximum throughput, than it suffers from large round trip time. Also in [94], some TCP implementations are compared. The study shows that there are a lot of factors limiting the throughput and one of the limiting factors is the amount of reverse traffic. The study shows that when transmitting unidirectional

120 G.3. THE TRANSMISSION CONTROL PROTOCOL over a switched backbone network, performance is better than when transmitting bidirectional, despite the full-duplex nature of GbE and 10GbE.

G.3.3 TCP Offload Engine A TCP Offload Engine (TOE) is a co-processor on a NIC that relieves the CPU from doing all TCP calculations, and hence creating more efficient processing when the CPU is free to do non-networking calculations for more time. One approach for embedded systems is to use an IP(Intellectual Property)-core which does the TCP encoding/decoding. A solution is presented in [96], where an IP-core achieves almost full TCP bandwidth in full-duplex mode. However, this core delivers a minimal TCP implementation and only supports one channel, no jumbo frames, and nothing else that is not essential for TCP. A more limited approach is done in [97], where an IP-core handling UDP/IP packets is created for communication between a PC and an FPGA, or it could be used for UDP-communication between FPGAs. This does however reduce the resource usage of the core substantially. Furthermore, the transfer rate available between the FPGA and the PC is not limited by the UDP/IP core, but by the PC at approximately 90% of the GbE link.

G.3.4 RDMA-Enhanced TCP Decoding Since TCP decoding takes up a lot of processing time in modern high-speed net- working, some enhancements have been tried out in order to relieve the CPUs from doing the actual TCP decoding. One of the techniques is RDMA(Remote DMA)- enhanced network interface cards which does the entire TCP decoding and even transfers the data into the application memory [98, 99, 100]. This approach has already been made in Infiniband (see Appendix K) and the idea has now moved to TCP/IP over Ethernet as well [98]. RDMA over TCP/IP is known as iWARP [98], and has great commercial prospects, mainly due to the widespread use of Ethernet, and the fact that Ethernet NICs are decreasing dramatically in cost. The greatest benefit of using RDMA is to relieve the main CPUs from handling the TCP/IP stack [98, 99]. There are however several different solutions to relieving the CPU, one is TCP offload engines (TOEs, described in the previous section), and another is RDMA. RDMA is more advanced since it allows for an API where you have commands which put received data straight into the application memory without any OS intervention and hence creating a zero-copy protocol. The problem with zero-copy is that it has the need to support OoO arrival of packets as well as dropped packets. Attempts to do regular socket-programming to benefit from RDMA has been made and possibilities are that there will be no need to change source code but only a recompilation with the new libraries to fully exploit the benefits of a RNIC [98]. However, the efficiency has to be proven since it is hard to foresee how this works out.

121 APPENDIX G. TCP/IP PROTOCOL SUITE

Nevertheless, testing shows that RNICs with their RDMA capability has edges over situations where no DMA is used [98, 99, 100]. The experiments show that by using RDMA the performance may increase by as much as 30% [98] and it is clearly stated in [100] that it relieves the CPU to do other tasks, since the TCP/IP perfor- mance increase even more when running computationally intensive benchmarks in contrast to the not so intense ones.

G.3.5 TCP Efficiency Over Ethernet Very much as UDP and IP protocols that travel across the Ethernet, the TCP protocol also introduces an overhead. The overhead of a TCP packet is that of the underlying protocols plus the header which is drawn in Figure G.5. Its size is slightly larger than that of TCP (20 vs 8 bytes) but the main difference is its built-in reliability. The reliability comes at a price of lots of calculations and transfers with acknowledgements sent back to the receiver. However, since the protocol does not specify any hardware these calculations have no room in the protocol efficiency. The theoretical values achievable by the TCP protocol over an Ethernet line is shown in Figure G.6.

0 15 16 31  Source Port Destination Port    Sequence Number    Acknowledgement Number  Header

Data Offset Reserved Control Bits Window  Usually 20 Bytes   Urgent Pointer  Checksum   Optional Options Padding to 32-bit boundary 

Data

Figure G.5. A TCP packet outline. Every part of the packet is shown here, but some of the major things are the Source and Destination port indicating which flow that is concerned, and the sequence and acknowledgement numbers, which are used to guarantee sequential and reliable delivery [19].

122 G.3. THE TRANSMISSION CONTROL PROTOCOL

TCP Theoretical throughput with varying payload and frame size 1000

900

1518 B 4088 B 800 9014 B Unlimited

700

600

500

Throughput (Mbps) 400

300

200

100

0

0 1 2 3 4 5 10 10 10 10 10 10 Payload Size (B)

Figure G.6. The theoretical maximum transfer throughput of TCP packets over a Gigabit Ethernet link. As we see there are different achievable throughputs depending on the actual Ethernet frame size.

123

Appendix H

Link Port for TS20X-Series

The TS20X-series is a DSP series from Analog Devices which goes under the name TigerSHARC. These processors have a LVDS [101] interface which goes under the name Link Port. The Link Port is a full duplex LVDS channel which may have a bus width of one or four lanes. Since the internal data bus width of the DSP processors is 128 bits, the data is sent in one transaction is also 128 bit [13] (a quad-word). Unlike RapidIO or PCI Express, Link Ports do not have the clock signal em- bedded in the output but it is sent as a separate LVDS-signal alongside the data. Additional signals that also are carried alongside the data and clock are an acknowl- edgement signal and a block completion signal [13]. A feature in the Link Ports on the TigerSHARCs is the DMAs. There is one DMA at the receive and one at the transmit side of each link port. These DMAs may interface with internal or external memory as well as other link port buffers. One other potential benefit of the Link Port is its completion signal [13]. This signal enables the Link Port to accept arbitrary-sized packets and hence not all sizes need to be predefined. This signal is in contrast to many other protocols, e.g. Ethernet, PCIe or RapidIO, where a packet header informs the receiver of the size of the packet.

H.1 Performance of Link Ports

The Link Port is as earlier stated an LVDS signalled bus protocol. The inclusion of a dedicated clock signal enables the receiver to clock the inputs every time the clock changes, i.e. a Double Data Rate (DDR) protocol [13]. This, in addition to the maximum bus width, means that the maximum data transfer rate per direction is approximately 4 lanes * 2 data rate * frequency of the bus. According to the data sheet [2] this frequency may be up to 500 MHz, which gives a per direction transfer rate at 4 Gbps, or 0.5 GB/s. Since there is no need, according to the Link Port specification [13], to send any control characters, nor to have a maximum transfer length, it implies that a useful bandwidth of 0.5 GB/s can be maintained for as long as necessary, which is an uncommon feature of protocols. However, this implies that

125 APPENDIX H. LINK PORT FOR TS20X-SERIES all bit errors that arrives during transfer will never be noticed. There are means to improve the reliability by sending a checksum byte after every completed quad- word, and when the checksum option is turned on, the sender also sends a dummy byte after the checksum byte is sent [13].

H.2 Uses of Link Ports

In [101] a cluster with four TigerSHARC processors is used in order to speed up and parallelise a computation task. All processors within the cluster are interconnected in between themselves enabling all processors to communicate simultaneously with each others. The idea was to have one DSP receive all incoming data, split the task amongst all processors and finally assemble the result in another DSP who then is responsible of sending the calculated result further down the communication line. To do this, a upper level protocol was used which was a zero-copy protocol that utilised pre-determined memory areas for each and every one of the link ports, and flags set in memory to indicate the status of transmission [101]. This way, data could be sent using the RDMA engine in the link port and hence do all transmission without the intervention of the DSP core. There are several other uses as well, as in a radar application described in [15], where link ports are used for DSP communication in order to create real-time SAR radar images. Another radar application that utilises link ports is presented in [14] where a pulsed Doppler radar had its backbone interconnected with link port links.

126 Appendix I

RapidIO

RapidIO is classified as a intra-system interconnect [9] with strong focus on high performance embedded computing. RapidIO may be implemented with a low foot- print on the silicon and rather transparent to software, which in addition to it being an open standard, makes it a choice of consideration when designing embedded sys- tems. There are two types of RapidIO, serial and parallel, but both of them share the same base characteristics. One of the most important things is the need to reduce overhead [9] in transactions in order to greatly utilise the communication links. Furthermore, the main focus has been on developing for an in-chassis interconnect, meaning short distances and very high throughput. This criteria is however not equally true for serial and parallel, since parallel is more suited for short distances while serial is more suited for a bit longer connections [9]. The RapidIO is a layered architecture with three layers, the Logical Layer, the Transaction Layer and the Physical Layer. These are not directly mapped to the OSI layers instead, as seen in Figure I.1, the top RapidIO layer maps to the fourth and fifth layer in the OSI model, the middle RapidIO layer maps to the third OSI layer and the lowest RapidIO layer maps to the first and second OSI layer [6].

I.1 The Logical Layer

RapidIO has three operating modes, message passing, globally shared memory and an I/O oriented mode [102]. The I/O-oriented mode sends data as simple read/write operations between RapidIO endpoints. This may be any data of sizes up to 256 bytes per write, which is the maximum packet size. In this mode, there are only six types of messages, namely Read, Write, Write-with-response, Streaming write, Atomic and Maintenance [7]. Out of these six types of messages, three generates responses from the message-receiver: Read, Write-with-response and Atomic. The Atomic operation is a Read-modify-write-operation, implying that it is an atomic operation that does a read at a memory location (byte, half-word or word), returns the read value and then performs a write to that location with the new data sent

127 APPENDIX I. RAPIDIO

Rapid IO Layers OSI Layers

Session layer Logical layer Transport layer

Transaction Layer Network layer

Data Link layer Physical Layer Physical layer

Figure I.1. How the layers of the RapidIO model maps to the OSI level model as the model of Rapid IO describes [6]. The layers are larger in RapidIO than in OSI, containing more functionality. The idea is to have a small common Transaction Layer and then have different Logical and Physical layers communicating through it. with the Atomic command. The atomic may do several different operations on the data at the specified location, set all bits, clear all bits, increment or decrement its value by one, and three conditional write operations [7]. The other operations are [7]:

• Read operations which request data from a specified address. This packet has a total size of between 52 and 84 bits, depending on the size of the address space of memory (34, 50 or 66 bit). Each Read is answered by a Response packet that contains 20 bits of overhead plus the amount of double-words specified in the Read request.

• Write operations which send data, from a single byte to a total of 64 double- words, to a specified address and expects no answer. This operation takes no care of any system-wide cache coherency. The packet has a 52, 68 or 84 bit overhead, depending on the size of the address-space. This in addition to between one and 64 double-words.

• Streaming Write operations that write a stream of aligned double-words from the sender to the receiver. This is a special write-operation which induces less overhead into the transaction. This is at the cost of the double-word aligned transmission and the fact that it cannot write anything that is not an integer number of double-words, in contrast to the normal write-transaction. This type of transaction only has 32, 48 or 64 bit overhead, in contrast to the regular write operation. However, this transaction type also has the maximum payload of 256 bytes.

128 I.2. TRANSACTION LAYER

• Write-with-response operations are identical to the ordinary write operations with the exception of a response packet that the receiver sends at comple- tion. This makes it less efficient yet more reliable than ordinary writes. The response is however not necessarily big, only 20 bits are needed.

The Message Passing model of operation supports a mailbox-like communication model where several RapidIO endpoints send messages to each others [7]. These messages may span from one byte, up to 16 consecutive packets, for a total of 4 kB per message. When messages consist of multiple packets, the standard allows for messages to arrive out of order [7] while still guaranteeing the delivery to be a packet in correct order. If using these maximum-sized packets, the standard allows for up to four parallel mailboxes which could be used by e.g. different applications at the receiving endpoint. However, there are some considerations to account for when constructing a network, as with every network. The routing algorithm is not specified, and hence there is a possibility to encounter dead-lock [7]. This may be solved by utilising a dead-lock free routing algorithm, e.g. dimension ordered routing [7, 103]. There are two types of request packet formats in the message passing scheme, namely Doorbell and Data Message, and both will receive a brief explanation below [7]:

• Doorbell messages are means of sending short messages between endpoints. Doorbell messages have their own queue, and they do not share this with the regular mailboxes of the Data Messages. The total length of the Doorbell message is 36 bits, with a 16-bit field that specifies the message that should be sent. This, in addition to the response at reception of 20 bits, means that there is significant overhead related to Doorbells. However, they may very well serve as e.g. control signals and passing of small amounts of data.

• " Data Message messages are the messages that goes into mailboxes at the receiver. It is a sender-initiated communication where the sender sends any- thing from a single byte, up to 16 full-length packets. These packets are further guaranteed to arrive in-order which alleviates upper-level sorting of data. Even though bringing some extra features to the transmission, the ac- tual overhead is very small, only 20 bits needed per packet for the entire transmission. This, in a 16-packet transmission is 320 bits, or 40 bytes, for a theoretical maximum payload of 16*256B = 4kB, which is lower than one percent overhead.

I.2 Transaction Layer

The transaction layer is a common layer for all logical specifications of RapidIO as well as all physical specifications. This layer adds a header with source and

129 APPENDIX I. RAPIDIO destination addresses in order for the physical layer to send the packets to their intended destination. The idea with one common transaction layer is to provide one single middle- layer between several physical and logical layers [7] and provide common methods for all of them to communicate with each others. The transaction layer adds three fields at a total of 18 or 34 bits. One of these is a two-bit field which states whether it is an 8- or 16-bit addressing and then either two 8-bit address fields or two 16-bit address fields that specify the source and destination address of the packet. This means that the added overhead is less in a system with less than 256 endpoints than in a system with more than 256 endpoints.

I.3 Physical Layers

Due to its purpose to be versatile in its nature, the RapidIO Specification allows for several present and future physical specifications. Here an overview to two of them will be given.

I.3.1 Serial RapidIO Serial RapidIO (SRIO) is defined in part 6 of [7], as an alternative to using the parallel original RapidIO. By using LVDS-standard signalling and 8B/10B encoding (see Appendix L) it is a high speed serial interface with great transfer speed and greater available link-length than the parallel interface. Each SRIO link is a full- duplex link, providing the same bandwidth both upstream and downstream. The signalling frequency of SRIO links is either 1.25, 2.5, 3.125, 5 or 6.25 Gbps, which is the signalling rate with the overhead induced by the encoding, making the data transmission rate to be either 1, 2, 2.5, 4 or 5 Gbps depending on operating speed [7]. To further increase bandwidth, SRIO enables for up to 16 links to work in parallel for increased throughput but still with the benefits of serial communication. This operation mode, in contrast to the parallel RapidIO, has no additional clock line or such features. This means that the signals on the individual lines are less sensitive to skew and jitter in between each others, since each signal carries its own clock. Instead, the hardware at the receiver side need to sync data from the different serial lines so that the data is aligned correctly. The serial protocol encapsulates the upper transport layer packet of RapidIO and puts it into a physical SRIO packet. This packet adds fields of a total of 26 or 42 bits, depending on the total transmission length. If a packet is less than 80 bytes the protocol adds 16-bits of CRC but if the packet exceeds 80 bytes, 32 bits CRC is used instead to ensure data integrity. An example of what can be done with SRIO is presented in [33], where re- searchers create an embedded system with very high throughput and which is util- ising SRIO as high-speed communications channel, and by using a SRIO switch, they gain very high flexibility and extendibility of the system.

130 I.3. PHYSICAL LAYERS

ackID (6)

Virtual Destination CRC16 CRC16 Channel (1) ID (8 or 16) (16) (16) Critical Flow Logical (1) Header and Data Priority (2) (4-76 bytes) Rest of Transport Source ID Logical Type (2) (8 or 16) Data Logical Type (4)

Figure I.2. The layout of a serial RapidIO packet of arbitrary size. The pink is the Logical layer, the light gray is the transport layer and the blue is the physical layer. All sizes are in bits unless otherwise specified.

In Figure I.2 the layout of a serial RapidIO packet is painted out. The maximum size of a packet is 276 bytes [7], where 256 bytes are packet payload, 10 bytes are Logical Layer overhead, four bytes are transport layer overhead, five bytes are physical overhead and the last byte is shared overhead from all layers [7]. Even in that case, the total overhead is not very big, 20 bytes overhead on a 276 byte packet, making the efficiency of the link 92.7 %. In the best case, the IDs are only 8 bits and the message being sent is a streaming write in a 32-bit address space, making the total overhead in the logical layer to 4 bytes. This gives us with the help from the graphics in Figure I.2 an overhead of 5 bytes physical, 2 bytes transactional, one byte mixed and four bytes in the logical layer for a total of 12 bytes overhead with a 256 byte payload and a total message size of 268 bytes and an efficiency of 95.5 %, which is the maximum utilisation of a SRIO link. One might argue that an acknowledgement need to be sent as well, but since the link is a full duplex, and assuming data to flow in one direction, acknowledgements might be sent arbitrarily on the return channel. When combining the above results with the data transfer rates, we see that the maximum data transfer rate over a single SRIO link is 95.5% of a 5 Gbps link, which is 4.78 Gbps, or almost 600 MB/s. Even when considering the slowest SRIO link, which operates at 1/5:th of this speed, we still exceed 100 MB/s per SRIO link in maximum throughput. This might not be achievable in a real world scenario though, unless careful system design enables it.

I.3.2 Parallel RapidIO Parallel RapidIO was the original specification that was released, and is specified in Part four of [7]. It uses an eight- or sixteen-bit wide bus where data is sent on differential LVDS lines alongside a clock and a frame signal. However, when the

131 APPENDIX I. RAPIDIO bus changes from 8 to 16 bits, the clock signal is also duplicated so that no more than eight lines share the same clock signal. The specification also specifies how flow control should be implemented and that the nodes should be able to handle several errors in communication. The upper layers of the architecture are the same, and many things are similar to the serial implementation in terms of packet layout, size and CRCs. Even more, the flow control algorithms are the same for deadlock avoidance [7]. One main thing to keep in mind when implementing a RapidIO system is the flow control, which are exemplified in [7]. There are different modes which prove themselves better or worse for different applications. The speed and throughput of a parallel RapidIO link is first of all determined by the clock frequency. The specification [7] gives examples from a 500 MHz clock to 1 GHz, but in [9], even lower frequencies of 250 MHz are given. Since parallel RapidIO samples data on both the rising and falling clock edge, the raw data rate is twice the clock frequency times the number of lanes, i.e. for a 8-bit wide link at 500 MHz, the raw throughput is 8 Gbps, or 1 GB/s, in each direction. This is then scalable to a 1 GHz, 16-bit wide bus with a throughput of 32 Gbps under unidirectional traffic.

132 Appendix J

USB

USB was the peripheral connection that won most consumers after the serial (RS- 232) and parallel printer port [104]. It is nowadays a common connector which can be found in almost any PC. The USB standard aims towards being a flexible standard with support for numerous different peripherals of different characteristics. It provides several different connection possibilities at a wide variety of speeds. The original specification supported line-rates at 1.5 and 12 Mbps. This was extended to 480 Mbps (High-Speed) in USB 2.0 specification and in the latest specification, USB 3.0, the supported line rate is 5 Gbps (SuperSpeed). These are however raw bitrates including encoding, and in the SuperSpeed case the encoding is 8B/10B (see Appendix L) which gives an effective at 4 Gbps. In the other speeds, an encoding with less overhead is used [104]. The main focus of this USB description will be on High-speed (480 Mbps) and Super-speed (5 Gbps) since the aim of this work is to transport large quantities of data. When transmissions are made on a USB bus, they are between a host and a device [104]. The host initiates all transactions by informing the device that it is its time to respond. The device must then be ready since time on the bus is limited and a transaction needs to be initiated immediately. After completion of a transfer, the device must prepare again to be ready for the next transfer when it is its turn. Transfers on the USB bus are organised in timeslots, called frames or mi- croframes. In high- and super-speed buses the timeslot is 125µs, and transfers are scheduled in these time slots [104]. At the beginning of every microframe, the USB host sends out a Start-Of-Frame to all attached units to help them synchronise. The USB transfers allow for four different kinds of transfers with different char- acteristics. The first transfer is the Control transfer, which is supported in every speed and is used to initialise connections and in some cases transfer data [104]. For high- and super-speed USB-devices, 20% of the available bandwidth is reserved for this purpose. The second type is Bulk transfers, which has a low overhead but no reserved bandwidth [104]. The low overhead makes them the potentially fastest transfer method over USB, but that requires a link with low utilisation from other applications in order to achieve bandwidth. A bulk transfer will be served eventually

133 APPENDIX J. USB but with no knowledge as to when. The third and fourth transfer types are Interrupt and Isochronous transfers. These may be granted the remaining 80% of transfer time together [104]. Interrupt transfers are the solution for devices which send data at random times whereas isochronous transfers are for constant bandwidth usage. A device may send up to three interrupts per microframe, each with a payload of 1024 B. This equals 1024 B, three times per 125 µs, or a theoretical maximum just above 24 MB/s. This is also the maximum transfer speed for a high-speed USB endpoint in isochronous mode. In an attempt to deliver more transfer speed, the super-speed USB isochronous transfer allows for up to 48 transfers per microframe, each with a payload of 1024 B [104]. This gives a 16 times higher transfer speed than for high-speed USB, or around 393 MB/s as a maximum transfer speed. However, isochronous transfers have a downside, which is the lack of error correction. Traditionally, USB has not enabled peer-to-peer, or host to host, communication. This however changed slightly in the new USB 3.0 standard, where a new host-to-host cable was presented which enabled hosts to communicate directly to each others. This enables two hosts, e.g. two PCs, to communicate with each others in a host-to-host fashion, very similar to the cross-over LAN cable. However, long distances are not applicable when using USB since cable length is initially limited to 9 ft [104]. This might be extended by the use of hubs, but even this length meets its maximum at 98 ft. in USB 2.0 and 49 ft in the recent USB 3.0. One thing about the licensing of USB, is that to create USB hardware the manufacturer needs to get hold of a Vendor-ID, which might be obtained by the USB-IF at the cost of $2000 if not a member of the USB-IF (Implementers Forum), or if you are a member this fee is removed but your annual fee of $4000 must still be paid [105].

134 Appendix K

Infiniband

InfiniBand (IB) is a specification maintained by the Infiniband Trade Association which is a consortium with members like HP, Intel, IBM and more [106]. The aim of IB is an architecture aiming at being a system area network and its intended use is to interconnect processing nodes, peripherals and other things in order to create a whole system [107]. The architecture of IB is a switched fabric network, where everything is inter- connected with switches. It allows redundancy in order to provide a high quality of service with very low down-time. IB has native support for IPv6 headers [107] which gives a great edge when the internet is on its way of moving to IPv6. The IB signals may traverse a x1, x4, x8 or x12 channel, where the number is the number of lines, and when signalled over copper, they are transmitted using differential signalling [108]. Independent on whether the signal is sent over copper or optical fiber, it should be encoded using an 8B/10B signal. The signal might be a Single-, Double- or Quad Data Rate [108], with each of them carries 250, 500 or 1000 MB/s of data. That gives a maximum for a Quad Data Rate, 12 lane IB channel to transfer 12 GB/s in each direction. The packet structure of an IB signal differs depending on whether it is a local or global packet. Since the study in this case deals primarily with local communi- cations, a discussion about global packets is out of the scope. The only thing worth mentioning is that they contain either IPv6 or a global IB-header which is actually the same in structure [107]. However, henceforth only messages sent on the own subnet will be looked at. In the study presented here, the main focus is on achieving high throughput and high reliability. Infiniband provides means for this by having reliable communica- tions with error detection capabilities and it also offers RDMA capabilities [107]. This is an advantage since it removes transportation responsibilities from the CPU and moves data straight into memory.

135

Appendix L

8B/10B Encoding

Herein, a brief summary of the 8B/10B encoding will be presented and summarised what makes several protocols to use In order to obtain a signal which has almost no low frequencies and is DC free, an encoding technique called 8B/10B-encoding was constructed [109]. The purpose of this encoding was to limit the run length of the code, i.e. the number of consecutive ones or zeros. By having a maximum run length, insurances can be made that switching between ones and zeros happens regularly. This code is made by taking one 3B/4B and one 5B/6B encoding and combining them to a total 8B/10B [109]. By doing this, the signal also gives enough transac- tions between low and high to be able to embed a clock into it, thus relieving the transmission of a clock signal alongside the encoded signal [7]. Finally, it can be shown that the encoding introduces a 25% overhead compared to sending only the data, but this enables some nice features like the ability to AC- couple it, transmit the clock signal along with the data and the possibility to use special control characters in the data stream. These control characters have lots of space, since if 8 bits are encoded into 10 bits, only 25% of the space is used. The rest of the characters may be used to construct disparity (make it DC-balanced) and sending control characters. Some of the communication techniques that utilise 8B/10B encoding is Giga- bit Ethernet in their 1000BASE-X implementation, not the one running over four copper pairs, in Serial RapidIO and in PCI Express, which may be read about in Appendix I and Appendix E.

137

Appendix M

Case Study: The ATLAS TADQ-System

Atlas is the name of the detector that resides inside the Large Hadron Collider at CERN in Switzerland [110]. The detector is supposed to study sub-atomic physics and particles. The amount of data generated by the detector is around 60TB/s [30], a very large amount of data. This data is then filtered through several computations and selections, to a total of 300 MB/s that is stored on disc and is considered valuable data to do research on. At the beginning of the project with creating the DAQ-system, the choice of technology was Fast Ethernet [111] in most connections and in the centre of the network, GbE switches to reduce cabling. Redundancy was built into this sys- tem which consisted of four central switches, two for each type of application (See Figure M.1). This way, one central switch could go down without causing total abruption of data gathering. Later, as GbE became cheaper and a commodity, an upgrade of the system [31] was planned. The choice of standard link technology was changed from Fast Ethernet to Gigabit Ethernet due to the increased availability, the performance over cost ratio, and a belief that Ethernet products will not be obsolete in the near future [112]. However, due to strict real time performance issues, the quality of components used need to be very high to meet the demands of the project. This forced the creation of a testbed in which to test components to evaluate their performance and figure out which ones were best suited for the task. In this testbed, several switches were investigated for their performance before the selection was made. The final implementation of the DAQ-system uses different kind of switches depending on how critical they are [30]. In the full scale system, over 3000 high-end processing computers composed in a well thought out topology with approximately 200 Ethernet switches [113]. The main concern when creating the network was its reliability due to the enormous amount of packages sent every second. To achieve faster switching, Layer 2 switches were used in the high performance nets [30], whereas layer 3 switches with static IP-switching were used in the less demanding network parts. Initially, the Atlas TDAQ system was planned to use only GbE as signalling

139 APPENDIX M. CASE STUDY: THE ATLAS TADQ-SYSTEM

Figure M.1. The concept of the original ATLAS network. It is split in two separate sub-networks where one compute application one, and the other computes application two.

Ethernet standard, but due to the massive amount of data (over 90 Gbps)[31] and the fact that six units could share one gigabit Ethernet connection. In each cabinet there were 30 units which meant that the bandwidth out of a cabinet needed to be 5 Gbps. To achieve this bandwidth, you cannot just put in five GbE-cables, because that would cause rings in the network which is not allowed. Three solutions are described[31], which are trunking, using VLANs and upgrading to 10GbE. Trunking uses standard 802.3ad [31] and is slightly unreliable since, in the stan- dard, there is no definition on how load is supposed to be balanced, only that the frames must arrive in the correct order. When using this, the authors of [31] expe- rienced a very high packet loss-rate at approximately 50%, i.e. every second packet is lost. The second solution [31] they tested was using VLANs, which practically means that one physical switch is split into several logical switches. The experiments showed that the loss rate was low and throughput good. However, since there is no load balancing between the uplinks, this creates a less than optimal solution if the load between the logical switches is unbalanced and some uplinks are saturated while others are close to idle. Their third option was to use 10GbE as uplink. This option has no actual impact on the cost of the devices, since the price-per-byte-sent is close to the same [31] in a GbE card and a 10GbE card. Furthermore, standard off-the-shelf switches nowadays often contain a few 10GbE ports. The third reason was that two cabinets could be connected together and this would mean full utilisation of the 10 Gbps

140 Figure M.2. A later ATLAS DAQ-network. It is split in two VLANs in order for data to not go from data generation through the right half and end up in the left half. This is done to reduce the travelled path of packets.

uplink instead of just using the 5 Gbps that each cabinet provides. Their choice fell on the 10GbE links, because it had fewer drawbacks than any of the other techniques. In all considerations made when dimensioning their network, the Ethernet net- work links have been dimensioned to have an average utilisation of 60% [31] due to the lessened risk of packet-losses [111]. When upgrading the network [31] they also selected another approach to the central network than the four central switches specified in [111]. This time, only two central switches were used, see Figure M.2, and each central switch serves both types of applications [31]. This way, still one of the central switches may go down without total disruption of operation. This approach was chosen when tests indicated that there was no difference in placing all applications of the same type behind the same switch. Then they went for the solution with the greatest redundancy, which was splitting the network up into two separate pieces. The network needed to be divided into two parts to prevent loops in the network, and furthermore stop packets from travelling unnecessarily long paths, e.g. DataGeneration->Central Switch 1 -> Control Network ->Central Switch 2 -> Application 1 or 2 in VLAN 2. The Control Network seen in the figure is a network that need to function on both VLANs to send commands to all application1 and application2 type computers. The Ethernet technology used in the Atlas system evolved from only being Gi-

141 APPENDIX M. CASE STUDY: THE ATLAS TADQ-SYSTEM gabit Ethernet [112] to include 10Gigabit Ethernet links between essential parts of the network and especially between different central nodes for redundancy. One problem with redundancy in Ethernet is the lack for support for it in the stan- dard specification. However, this may be solved by using VLAN [30] and making each separate connection path traverse different VLANs. Furthermore, this may be utilised when using the MST-protocol where, without VLANs, only allow one active path per target address and all other paths need to be sleeping. As the selection for physical medium, the 1000BASE-T standard was used, util- ising copper wires. This limits the length to 100 m [80] but to increase length, additional switching was inserted in between nodes for a maximum length of less than 100 m [30].

M.1 The Communication Protocols in ATLAS

The use of communication protocols in the ATLAS project was crucial for perfor- mance, and therefore several different means of communication are used throughout the system. One thing to keep in mind is that ATLAS already runs on an Ethernet infrastructure, and hence evaluated the most commonly used protocols used across Ethernet networks. The TCP/IP protocol suite is the back bone of Internet, containing the well- known protocols TCP/IP and UDP/IP. Both of these have been evaluated for use in the ATLAS TDAQ [32, 114, 115] along with raw Ethernet frames. The TCP/IP protocol has a nice benefit of guaranteeing the delivery of each packet and in the correct order. However, the mechanism with Acknowledgements sent back might not be good in a real-time system such as the Atlas TDAQ. This is due to the long timeout period before a packet is retransmitted, which is above 35 ms [32] and according to [115] as high as 100 ms. This makes the system rather unresponsive, especially since the application layer protocol of the system is a Request-Respond- protocol [32]. This application layer protocol already has a shorter timeout [115] than the ones specified by TCP/IP’s timeout. The second drawback of TCP/IP is that it uses acknowledgements to send back responses that it has received it. These ACKs may piggyback on other packets [32] but if no other outgoing data is leaving then a specific packet has to be created to deliver the ACK. This way, when sending responses, an ACK might piggyback on the response data, but if it cannot do that, a separate ACK-packet has to be generated and sent. Furthermore, another ACK-packet must be sent back to acknowledge that data was received correctly. All of these ACKs result in a lot of packets that do not need to be sent, to take up space on the Ethernet link [32, 114]. Another option is to use UDP/IP packets. These packets are not guaranteed to reach their destination as a TCP-packet is. However, as the application layer protocol guarantees that packets arrive, it is considered to be unnecessary to dupli- cate this feature in the transport layer as well [32]. Furthermore, their study show that one of 109 UDP-packets is lost in transmission [114], making the handshake

142 M.2. THE PHYSICAL INTERCONNECTS AND SOFTWARE OF ATLAS TDAQ

ACK-protocol even more superfluous. Finally, some details about the use of raw Ethernet packets, or frames, will be presented. The transmission of data over Ethernet is always conducted in frames. The problem is that frames are more difficult to send things with since there is no higher level encapsulation [115]. This makes sending them rather hard, but Linux has means to do this [115]. The greatest benefit is that there is very little overhead since no higher level protocol is used. It also removes the unnecessary bits of the IP-header that does no function while we are on the same network and need no IP-addresses since the MAC-addresses are sufficient on a LAN. That gives the possibility to send valuable data in the entire MTU of the frame. This is on an Ethernet limited to 1518 bytes, but using Jumbo frames, the MTU might be extended [115]. A common size of Jumbo Frames is 9000 bytes. This is however vendor specific since no standard exist. It may be seen in [115] that the CPU load of the system rises a lot slower when using Ethernet frames compared with higher level protocols, indicating that higher speeds might be achieved. Furthermore, some experiments [116] were done during the development of Atlas using a framework for inter process communication called CORBA, or Common Object Request Broker Architecture [117]. One reason why CORBA was evaluated was that it is very code-language-independent, and you may run under several different operating systems and write code in many different languages [116]. In the study, they found that the time increased linearly when communicating devices were added to the network, and that using CORBA imposed a higher latency of around 30% than when using only TCP. Finally a word about how data transmission was recommended for the Atlas DAQ. Firstly, the high-bandwidth data links that do real time data processing were recommended to use either raw Ethernet frames, or to go with UDP since the application layer protocol performed the service guarantee [32, 114]. However, for control signals and other less frequently sent messages, TCP is said to be a good choice since it doesn’t pollute those links with ACKs when there is so little traffic on them anyhow.

M.2 The Physical Interconnects and Software of ATLAS TDAQ

When creating the TDAQ a lot of design uses commodity products and very little is custom made [114]. Only the first detector steps are custom made hardware, after that it is commodity computers doing the rest of the processing. From the sensors to the receiving computers, there are optical fibres sending the data at rates around 160 MB/s (1280 Mbps) [118]. Each of the receiving computers is equipped with custom designed cards to receive this data and each computer may simultaneously take in up to twelve optical channels for a total input of 1920 MB/s (15 Gbps). This data is then filtered out so that only the most interesting objects are sent further down the network. Each of these receiving computers is equipped

143 APPENDIX M. CASE STUDY: THE ATLAS TADQ-SYSTEM with GbE ports so that it is able to send the data down the network for further processing and finally storage.

144