DEGREE PROJECT IN ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018

Next Generation SDN Switches Using Programming Protocol- Independent Packet Processors

TIJO VARGHESE THAZHONE

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

KTHROYAL INSTITUTE OFTECHNOLOGY

MASTER THESIS REPORT

Next Generation SDN Switches Using Programming Protocol-Independent Packet Processors

Author: Company supervisor: Tijo Varghese Thazhone Magnus Svevar (Infinera)

Examiner: Academic supervisor: Dr. Zhonghai Lu Yuan Yao

A thesis submitted in fulfillment of the requirements for the degree of Master of Science in the

School of Electrical Engineering and Computer Science Stockholm, Sweden

November 30, 2018

iii Abstract

Over recent years, Software Defined Networking has enabled operators to control the network and realize new networking topologies. With increasing network traf- fic and protocol formats that aim at managing the traffic efficiently, the capabilities offered by Software Defined Networking alone are currently limited by the under- lying fixed hardware infrastructure. The inflexibility involved in redesigning the hardware forces the bottom-up approach defined by switch vendors in describing the network and limits the capabilities offered to operators for further innovation.

To meet the demands of ensuring a higher degree of flexibility to design, test and guarantee a faster time to market, the concept of Softly Defined Networks was in- troduced. The idea in addition to offering the conventional advantages of Software Defined Networking is based upon implementing a re-programmable data-plane. Field-Programmable Gate Arrays offered a higher degree of flexibility and capa- bility to handle such designs. Programming Protocol-independent Packet Proces- sors(P4) is a high-level language continuously evolving to define data-planes for various networking devices. The aim of P4 is for network operators to customize the underlying hardware with minimum constraints and ease, independent of the target. Therefore, the three major goals while defining such a language revolved around reconfigurability of hardware after being deployed, protocol independence to permit customization without constraints and target independence for users to be less concerned of the underlying hardware. Recent advances in P4 with the added support in terms of compatible targets and compilers have made P4 a viable oppor- tunity to realize a re-programmable hardware.

This work contributes towards exploring the ease of incorporating the capabilities of P4 in realizing a flexible data-plane. To achieve the same and study its characteristics a supporting two lane hardware pipeline is proposed that is capable of accommo- dating P4 upon a Kintex 7 FPGA. Primarily, a custom P4 module is defined that is capable of performing L2 operations upon a double tagged Ethernet frame using an appropriate architecture model. Subsequently, to integrate the P4 description on hardware the proposed supporting pipeline is implemented at a line rate of 10Gbps using the essential building blocks that help in observing the desired processing. Using a test setup, the design shall be further tested for the expected data-plane ac- tivity based upon the populated match-action rules. In terms of resource utilization, the overall design consumes less than 15% of the available resources and incurs an average latency of 5.71us. In addition to the ease of customization compared to the conventional fixed data-plane descriptions, it is vital to analyze the cost inherited while adopting P4. The final design is therefore studied in terms of resource utiliza- tion and latency by increasing the complexity of the P4 definition with regard to the number of headers, tables and write operations(H-T-W) for the adopted compiler. In the case of eight headers, tables and write operations(8H-8T-8W), there is an aver- age latency of 8.01us and the P4 description alone demands 51536 LUTs, 77789 FFs and 118.5 BRAMs in terms of resource utilization. Finally, the article discusses the extent to which the proposed top-down approach is implemented and is capable of redefining the network as we know it.

v Abstrakt

Under de senaste åren har Software Defined Networking gjort det möjligt för op- eratörer att styra nätverket och implementera nya nätverkstopologier. Med ökande nätverkstrafik och nya protokoll som syftar till att hantera trafiken effektivt, är de möjligheter som erbjuds av Software Defined Networking för närvarande begränsat av den underliggande fixa hårdvaruarkitekturen. Den inflexibla hårdvaran tvingar fram det ”bottom-up-” tillvägagångssätt som definieras av switchleverantörer när det gäller att beskriva nätverket och begränsar de möjligheter som erbjuds operatör- erna för att styra och innovera i sina nät.

För att möta kraven på att skapa en högre grad av flexibilitet för att designa, testa och garantera en snabbare tid till marknaden, introducerades begreppet Softly Defined Networks. Tanken, utöver att erbjuda de konventionella fördelarna med Software Defined Networking, bygger på att man implementerar ett omprogrammerbart dat- aplan. Field-Programmable Gate Arrays erbjuder en högre grad av flexibilitet och förmåga att hantera sådana konstruktioner. Programming Protocol-independent Packet Processors(P4) är ett språk på hög nivå som kontinuerligt utvecklas för att definiera dataplanet för olika nätverksenheter. Målet med P4 är att nätverksoper- atörerna lätt ska kunna anpassa den underliggande hårdvaran med minimala be- gränsningar oberoende av leverantör av hårdvara. De tre huvudmålen när man definierade ett sådant språk handlade om omkonfigurerbarhet av hårdvaran efter att ha blivit utplacerad, protokolloberoende för att möjliggöra anpassning utan begrän- sningar och leverantörsoberoende för att användarna skulle vara mindre oroade över den underliggande hårdvaran. Nya framsteg i P4 när det gäller stöd för kom- patibla hårdvaror och kompilatorer har gjort P4 till en tänkbar kandidat för att re- alisera en omprogrammerbar hårdvara.

Detta arbete bidrar till att utforska hur enkelt det är att integrera P4:s förmåga att re- alisera ett flexibelt dataplan. För att uppnå detta och studera dess egenskaper föres- lås en hårdvaruimplementation av L2 i två pipelines av P4 på en Kintex 7 FPGA. I första hand definieras en anpassad P4-modul som kan utföra L2-operationer på en dubbeltaggad Ethernet-ram med hjälp av en lämplig arkitekturmodell. Därefter im- plementeras P4-beskrivningen av hårdvaran på den föreslagna arkitekturmodellen med en hastighet av 10 Gbps med hjälp av de byggblock som krävs för att kunna ob- servera beteendet. Med hjälp av en testupptällning testas konstruktionen för att se om den uppfyller den förväntade dataplanaktiviteten baserat på de uppsatta match- ningsreglerna. När det gäller resursutnyttjandet förbrukar designen mindre än 15% av de tillgängliga resurserna och uppnår en genomsnittlig latens på 5,71us. Föru- tom den enkla implementeringen, jämfört med en konventionell fix beskrivning av data-planet, är det viktigt att analysera kostnaden vid införandet av P4. Den slut- liga konstruktionen studeras därför med avseende på resursutnyttjande och latens genom att öka komplexiteten i P4-definitionen med avseende på antalet rubriker, tabeller och skrivoperationer (H-T-W) för den antagna kompilatorn. När det gäller åtta ”headers”, tabeller och skrivoperationer (8H-8T-8W), är det en genomsnittlig latens på 8.01us och P4-beskrivningen ensam kräver 51536 LUTs, 77789 FFs och 118,5 BRAMs vad gäller resursutnyttjande. Slutligen diskuterar artikeln hur den föres- lagna top-down-metoden är implementerad och hur den kan omdefiniera nätverket som vi känner till det.

vii Acknowledgements

““Tell me and I forget, teach me and I may remember, involve me and I learn.””

-Benjamin Franklin

The contents of this report would be incomplete without acknowledging the con- stant guidance and support I received throughout the thesis work. First and fore- most I am grateful to God for helping me with the patience and capability necessary to see the thesis through.

The opportunity to work on this topic in collaboration with Infinera was an abso- lute pleasure. It would have been impossible to state the findings mentioned in this report without the guidance, support and resources granted by Infinera. The open work culture and supportive colleagues helped make new friends and enjoy my work. I would like to express my gratitude to my industrial supervisor Magnus Svevar for helping me with all the necessary resources and support to better under- stand the topic. Hannah Dysenius helped manage the project work and made sure there was progress in a systematic fashion. I am truly thankful to them both for their patience and understanding with regard to all the unmet deadlines. Kenth Erikson, Dr. Jue Shen and Stefanos Kyri with their years of experience and knowledge in the field offered valuable guidance during various stages of the project work. It was absolutely an honor to have been a part of Infinera and learn more about the topic.

This thesis work was undertaken to fulfill the requirements for the degree of Mas- ter of Science at KTH Royal Institute of Technology, Stockholm with Dr. Zhonghai Lu as the examiner and Yuan Yao as the academic supervisor. I specially thank Dr. Zhonghai Lu for constantly reviewing the status of my thesis work and helping me refine this report.

Finally, I would like to thank my family for their incessant love and encouragement that has helped me throughout my life.

ix

Contents

Abstract iii

Abstraktv

Acknowledgements vii

1 Introduction1 1.1 Background and Motivation...... 2 1.2 Problem Statement...... 5 1.3 Purpose...... 5 1.4 Goals...... 5 1.5 Benefits, Ethics and Sustainability...... 6 1.6 Research Methodology...... 6 1.7 Delimitation...... 7 1.8 Outline...... 8

2 Theoretical framework9 2.1 Software Defined Network to Softly Defined Network...... 9 2.2 P4: Programming Protocol-independent Packet Processors...... 11 2.2.1 Architecture model...... 14 2.2.2 P4 description...... 16 2.2.2.1 Meta-data bus...... 16 2.2.2.2 Parsing of the packet...... 19 2.2.2.3 Match-Action tables...... 21 2.2.2.4 Deparser...... 23 2.2.3 Benefits of using P4...... 24 2.2.4 P4 compiler and tools...... 24 2.3 Related work...... 25 2.4 Miscellaneous...... 26 2.4.1 FPGA platform...... 26 2.4.2 Simulation and testing...... 27

3 P4-enabled switch: The proposed design 31 3.1 Building blocks...... 33 3.1.1 P4 description...... 33 3.1.1.1 The Architecture model...... 33 3.1.1.2 P4 data-plane description...... 34 3.1.2 UPI master...... 41 3.1.3 UPI-AXI4 Lite translator...... 43 3.1.4 10/25G Ethernet Subsystem...... 46 3.1.5 Tuple Controller...... 50 3.1.6 AXI4-Stream switch...... 51 x

4 Results 53 4.1 Simulation of building blocks...... 53 4.2 System integration and packet flow...... 57 4.3 The final design...... 58 4.4 Observing the desired packet processing...... 61 4.5 Analysis...... 63

5 Conclusion and Future work 67 5.1 Conclusion...... 67 5.2 Future work...... 68

A Steps to incorporate P4 69

B Intermediate JSON file 71

Bibliography 77 xi

List of Figures

1.1 OpenFlow based Software Defined Networking framework...... 3

2.1 Softly Defined Networks...... 10 2.2 The overall network framework...... 11 2.3 Traditional switch vs a P4-defined switch...... 13 2.4 Programming a target using P4...... 14 2.5 An architecture model...... 15 2.6 Dataflow topology...... 17 2.7 Sections within a typical P4 program...... 18 2.8 An abstract parser state machine...... 19 2.9 Parser state machine...... 20 2.10 Action code and data...... 21 2.11 Match-action unit...... 22 2.12 P4-SDNet compilation flow...... 25 2.13 Board level block diagram...... 28 2.14 T-BERD/MTS-5800 hand-held network tester...... 29

3.1 Overview of the design...... 32 3.2 XilinxSwitch layout...... 33 3.3 Ethernet frame...... 35 3.4 Parser graph to extract stacked VLAN tags...... 38 3.5 P4 defined module...... 40 3.6 On-board bus architecture...... 42 3.7 UPI bus operation...... 42 3.8 Bus architecture incorporating UPI-AXI4 Lite translator...... 43 3.9 UPI-AXI4 Lite translator...... 44 3.10 Flow diagram for UPI-AXI4 Lite translator...... 45 3.11 PCS-Only Core Variant...... 47 3.12 Normal 64 Bit Frame Transfer...... 48 3.13 10G Ethernet Subsystem...... 49 3.14 Tuple Controller...... 50 3.15 AXI4-Stream switch...... 51

4.1 UPI-AXI4 Lite write operation...... 54 4.2 UPI-AXI4 Lite read operation...... 55 4.3 Tuple Controller simulation results...... 56 4.4 Self-looped test setup...... 57 4.5 Packet stream without P4 defined module...... 57 4.6 The final block design...... 59 4.7 Test setup to observe the desired P4 defined processing...... 61 4.8 Lane 1 with P4 defined packet processing...... 62 4.9 Lane 0 displaying rerouted packets from lane 1...... 62 4.10 LUT variations w.r.t the number of headers, tables and write operations. 63 xii

4.11 Flip-Flop variations w.r.t the number of headers, tables and write op- erations...... 64 4.12 BRAM variations w.r.t the number of headers, tables and write oper- ations...... 64 4.13 Average latency w.r.t the number of headers, tables and write opera- tions...... 65 xiii

List of Tables

1.1 The seven layers of the OSI Model...... 2 1.2 OpenFlow standards and the defined header fields...... 4

2.1 Table restrictions based upon the match kind...... 23 2.2 Capabilities of 7 Series FPGA...... 27

3.1 Partial representation of a populated table...... 39

4.1 Entries within the populated table...... 61 4.2 Resource utilization for various P4 description...... 63 4.3 Latency readings for different P4 descriptions...... 65

xv

List of Abbreviations

TCP/IP Transmission Control Protocol/Internet Protocol SDN Software Defined Networking OSI Open System Interconnection ISO International Organization for Standardization FE Forwarding Element IDS Intrusion Detection System VXLAN Virtual eXtensible Local Area Network POF Protocol-Oblivious Forwarding ASIC Application-specific integrated circuit NPU Network Processing Units FGPA Field-Programmable Gate Array PSA Portable Switch Architecture RTL Register-Transfer Level P4 Programming Protocol-independent Packet Processors MAT Match-Action Tables HDL Hardware Description Language SoC System on Chip API Application Programming Interface PSA Portable Switch Architecture GPCM General-Purpose Chip-Select Machine bus UPI UltraPath Interconnect SFP+ Small Form-factor Pluggable transceivers MGT Multi-Gigabit Transceivers IEEE Institute of Electrical and Electronics Engineers SVLAN Service-Virtual Local Area Network PCP Priority code point DEI Drop eligible and indicator VID VLAN identifier CVLAN Customer-Virtual Local Area Network AXI Advanced eXtensible Interface PMD Physical Medium Dependent PMA Physical Medium Attachment PCS Physical Coding Sublayer MDI Media Dependent Interface MAC Media Access Control MII Media Independent Interface PLL Phase-locked loop LUT LookUp Table FF Flip-Flop BRAM Block Random-Access Memory

xvii

Dedicated to my Dad and Mom.

1

Chapter 1

Introduction

After the advent of electronic computers in the mid-twentieth century, earlier con- cepts of wide area networking were implemented in several computer science labo- ratories. One of the earliest recorded social transactions that proved the existence of networking was in August of 1962, while a series of memos written by J..R Lick- lider of Massachusetts Institute of Technology discussing his “Galactic Network” concept was sent over a network [1]. The concept then, of a globally interconnected set of computers through which users could transfer information has become the paramount vision of today’s modern Internet.

On the 24th of July 1961, Leonard Kleinrock at Massachusetts Institute of Technol- ogy published the first paper on packet switching theory titled “Information Flow in Large Communication Nets”, subsequently he addressed the feasibility of commu- nications using packets instead of circuits in his first book titled “Communication Nets: Stochastic Message Flow and Delay”. One of the most important hurdles dur- ing the time was to make multiple users interact at the same instant. This posed the need for a well-connected network that allowed multiple nodes to communi- cate simultaneously using the same resources. In 1965, MIT researcher Lawrence G. Roberts worked with Thomas Merrill to create the first ever wide-area network by using a low-speed dial-up telephone line to connect TX-2 computer in Massachusetts to the Q-32 in California. This experiment helped conclude that time shared use of resources could enable computers to work well together to run programs and re- trieve data on a remote machine. At the time, conventional circuit-switched tele- phone systems were totally unusable for the application and Kleinrock’s concept for the use of packet switching was the best practical approach [1].

Today, after more than 5 decades the Internet has become a global network of mul- tiple computer clusters interconnected using a communication protocol such as the Transmission Control Protocol/Internet Protocol suite (TCP/IP). With the surge in Internet users over the recent years, many applications are hosted over the Inter- net with a focus on various services such as email, e-commerce, social networking, video streaming, etc. To ensure the desired user experience with the increasing need to manage the packet traffic, demanded continuous innovation to improve the con- ventional architecture that supports the Internet. Lately, optic fibers have enabled transmission of packets over extreme high speeds, but to exploit this opportunity to the maximum it is necessary to ensure flexible and faster processing of the trans- mitted packets within the various network packet processing elements that handle the network traffic management. This demand to compensate for the lack of flexibil- ity and complexity to feed refined traffic management rules within the conventional networking paradigm motivated the need for Software Defined Networking (SDN) solutions. This shall enable network engineers to deploy flexible packet processors 2 Chapter 1. Introduction within the network that are capable of conducting the desired packet processing based upon the engineered traffic rules and further provide a global view or ab- straction of the entire network architecture.

According to the widely adopted Open System Interconnection (OSI) reference model developed by the combined efforts of the International Organization for Standard- ization (ISO) and Telecommunications Standardization Sector of the International Telecommunications Union (ITU-TS) in 1983, the network is partitioned into a verti- cal set of seven layers. The primary goal of the OSI model is to permit nodes to push packets into a physical network and ensure they travel to the destination indepen- dently [2]. Each layer is concerned with a specific set of functionalities and enhances the services offered by the immediately lower layer as described in table 1.1[3].

No. Layer Functionality 1 Physical Interface with the physical medium to transmit unstructured bit stream. 2 Data Link Transmission of frames over single network connections. 3 Network Reliable communication over one or more subnetworks. 4 Transport Reliable and transparent transfer of data between end points. 5 Session Management of sessions between end points. 6 Presentation Encoding(data presentation) during transfer. 7 Application Provision of services to end user by processing of informa- tion.

TABLE 1.1: The seven layers of the OSI Model derived from [3].

The physical, data link and network layers are responsible for the communication portion between the two systems situated at the transmitting and receiving ends. The physical layer manages the transmission of bits between nodes over a medium. It deals with interfacing the node with transmission hardware, physical connector characteristics and voltage levels for encoding of binary values. The data link layer ensures the reliable transmission of data between adjacent nodes and enhances the reliability over the bit transmission within a single link. If the link between two end nodes is indirect, then the transmission will have to pass through multiple links and the reliability for such a transmission will be handled by higher layers within the OSI model. Network layer is responsible for routing and forwarding of packets. Routing determines the path a packet must traverse to reach its destination and for- warding deals with passing of packets between subnetworks. This layer also ensures that data units are well segmented to be acceptable to the data link layer [3]. This research focuses on proposing a highly flexible software defined packet switching technique between optical links.

1.1 Background and Motivation

Over the recent years, the number of user applications running over the Internet has increased drastically. For the existing network, this meant a remarkable rise in packet traffic and an increasing necessity for sophisticated protocols supporting finer traffic engineering rules that ensured efficient traffic management and faster 1.1. Background and Motivation 3 networking for the successful interaction of user applications communicating over the Internet. Network switches that managed the network were equipped with func- tionalities such as access control, tunneling, and overlay formats [4]. In addition, the recent inclusion of SDN capabilities on these network switches accelerated the de- sign of newer protocols with run-time configurable traffic engineering rules by sep- arating the control-plane from the data-plane. This segregation enables the control- plane to have an overall view of multiple packet processing data-planes and also makes the implementation of intricate traffic engineering rules using SDN more hi- erarchical and meaningful.

Figure 1.1 depicts the various layers that are involved within an OpenFlow based SDN framework. The bottom layer comprises of the physical infrastructure which basically is the cluster of interconnected Forwarding Elements (FE) that eventually implements the desired data-plane with the adequate routing algorithms. The sec- ond tier can be described as the network control layer that is decoupled from the underlying infrastructure and behaves as a middleman by enabling the top most network application layer to centrally control the network infrastructure and realize much efficient traffic management. The network application layer which directs the traffic through the controller consists of SDN applications that perform functions such as network monitoring, intrusion detection systems (IDS), network virtualiza- tion and flow balancing.

FIGURE 1.1: OpenFlow based Software Defined Networking frame- work.

Today packet switching devices can be programmed using various technologies in which the control-plane defines the forwarding functionality of the underlying data- plane. But due to increase in complexity of the application specific engineering rules, these rules need to match over more packet header fields. Therefore, in addition to the increase in space needed to store the new rules, the space for the key packet header fields also increase. For example, in the case of OpenFlow the abstraction of a single table of rules that could match packets on a dozen header fields like the MAC addresses, IP addresses, protocol and TCP port numbers was relatively simple. Over the past five years, the combination of matching fields has grown significantly 4 Chapter 1. Introduction in terms of many more header fields and multiple stages of rule tables as shown in the table below [5].

Version Date Header Fields OF 1.0 Dec 2009 12 fields (Ethernet, TCP/IPv4) OF 1.1 Feb 2011 15 fields (MPLS, inter-table metadata) OF 1.2 Dec 2011 36 fields (ARP, ICMP, IPv6, etc.) OF 1.3 Jun 2012 40 fields OF 1.4 Oct 2013 41 fields

TABLE 1.2: OpenFlow(OF) standards and defined header fields [5].

Such forwarding devices are generally implemented on a packet switching chip with a dedicated hardware. A variation in the protocol header fields as frequent as the example described above requires an expensive redesign of the hardware that could demand a few years to implement extensively. The time in the design and standard- ization of Virtual eXtensible Local Area Network (VXLAN) in the past, is a proof of this delay [4,6]. Secondly, currently the functionality of a switch is defined by the device vendor and not by the network operator who deploys these devices and has a better understanding of the network. Recent trends prove for the need to transit towards a “top-down” view commanded by the network operators instead of the traditional “bottom-up” view dictated by the switch vendor.

Therefore, to meet the demands of a market that requires an upgrade, it is necessary to research upon the best techniques that ensure the scalability of network switches in terms of varying matching fields, faster lookup time and increasing rule aggre- gation. There are various packet processing languages with their pros and cons under development that deals with instilling the above mention characteristics on various customized hardware. For example, Protocol-Oblivious Forwarding (POF) handles packet headers as tuples in terms of offset and length. This therefore re- sults in a programming model that resembles an assembly language in which the burden of parsing is dealt by the programmer [4,7]. packetC on the other hand is a domain specific language that enables access to the packet payload. It focuses on more flexibility and lower performance while programming Network Processing Units (NPU), Field-Programmable Gate Arrays (FGPA) and software switches [4,8]. PX targets FPGA platforms by converting high level declaration to Register-Transfer Level (RTL) description of the target substrate in Hardware Description Language (HDL) [4,9]. Therefore, PX restricts itself to FPGA platforms.

This research shall consider Programming Protocol-Independent Packet Processors (P4) as the programming language to design the next generation SDN switch by tak- ing into account the cons of the previously mentioned high-level languages. The three main traits of P4 that are of great value to this research are reconfigurability, protocol independence and target independence [5, 10]. Hence, P4 has been uti- lized to define the data-plane of the SDN switch and the user is capable of primarily defining the functionalities of the parser, Match-Action Tables (MAT) and deparser. 1.2. Problem Statement 5

1.2 Problem Statement

By segregating the control-plane and the data-plane, modern switches are capable of easily reengineering the traffic rules to a certain extend. SDN gives operators a pro- grammable control over the network switches as compared to the traditional way of deploying a system vendor manufactured black-box switches with fixed functions and less optimized routing techniques. With the constant evolution in protocol for- mats, it has become necessary to ensure that the data-plane is not only highly flexible but also easily reconfigurable in addition to permitting the control-plane to decide on the rules.

As discussed previously, P4 is a high-level approach that is optimized for efficiently describing packet forwarding by allowing the designer to customize the parser, match- action tables and the deparser. Therefore, the problem statement for this research project primarily focusses upon how to design the next generation of high-speed SDN switches that are easily reconfigurable with minimum constraints and comple- ments the conventional networking infrastructure by exploring the capabilities of P4? Secondly, to what extent is the "top-down" approach truly achievable in defining the network? Lastly, what is the cost of proposing such a design on parameters such as resource utilization and latency in the case of sophisticated data-plane descrip- tions? A hardware framework that interfaces with the external physical optical links and communicates with the analyzed P4 defined module shall be the by-product of this undertaken research.

1.3 Purpose

The aim of this thesis report is to propose a viable solution in response to the above- mentioned problem statement by designing a version of the network switch that addresses the required characteristics. Various stages of the switch’s hardware ar- chitecture that are crucial to accept, process and eject packets shall be discussed in detail with a primary focus on the development and integration of the module de- fined using P4 consisting of a custom-made parser, lookup tables and deparser. The experiment shall showcase the ease of hardware reconfigurability for future innova- tions in protocol formats and the ability to describe traffic rules by the control-plane using Application Programming Interfaces (API). The report shall be a guide to fu- ture researchers that desire to explore further within this domain.

1.4 Goals

The vision is a faster network that is controlled by numerous SDN applications with an underlying infrastructure that is easily reprogrammable by network operators to accommodate the growing number of protocol formats that are capable of conduct- ing finer traffic rules using highly flexible switches supplied by vendors. The goal of this thesis is to explore the practicality of designing such a packet switching hard- ware on a Kintex 7 FPGA that is compatible with the P4 definition of a data-plane and subsequently analyzing the impact on parameters such as resource utilization and latency. Such a design shall demonstrate the impact of the top-down approach in redefining the network. 6 Chapter 1. Introduction

1.5 Benefits, Ethics and Sustainability

The results of this thesis shall cater to the design of a network switch that would be commanded by network operators by adhering to the top-down approach. The data-plane is entirely defined over an FPGA and shall be easily reconfigurable to test new protocol formats or algorithms. This would ensure that the packet switch- ing techniques commanded by the control layer shall optimize the path which a packet travels through, from source to destination. Keeping in mind the continuous upgradation of the network architecture, the proposed hardware pipeline shall be scalable to higher transfer rates with minimum effort in turn shrinking the heap of obsolete technology that threatens the sustainability of our world.

As far as ethics is concerned, the thesis report shall grant credit to previous re- searches that served as a reference to propose such a design. Citations shall be re- sponsibly stated to appreciate all the previous efforts that helped complement the outcome of this research. Data and figures that are conceived from elsewhere shall be reproduced if necessary for the better understanding of the reader with proper references to the source of information. To adhere to the confidentiality terms and conditions stated by Infinera, some of the contents are described in an abstracted manner. Nevertheless, the work presented in this report shall be comprehensible to the reader.

1.6 Research Methodology

To explore the options of proposing a viable switch design the research shall pri- marily consider the application of a combination of Qualitative and Quantitative research methods, collectively known as the “triangulation” approach to study the phenomenon. Quantitative research methods shall incorporate experiments and continuous testing during the various stages of development or integration of new components that constitute the overall design. Qualitative research methods shall help analyse the final design and study the implications in terms of resource utiliza- tion and latency for the adopted standards and could be a guide in developing the next generation of SDN switches. The reason for mixing these 2 methodologies is to get a complete view of the research area and to complement the results attained from each other [11]. Therefore, the final hardware prototype shall go through a phase of designing, implementation and testing for various cases before articulating a conclusion.

In the beginning of the thesis a brief study shall be conducted to determine the fea- tures of the existing proprietary board upon which the hardware is to be designed. Subsequently it is necessary to meticulously research upon the modules that must be developed and integrated to build a switch that communicates with the rest of the network. Using the concepts of empirical research, knowledge must be derived from proofs of experimentation and test predictions. Therefore, to stay in accordance with the research guidelines, the next stage would be to develop, integrate and test each feature one after the other to understand the corresponding functionality and implications through actual experience [11]. This research method shall guarantee a reliable design alongside promising the validity of the results declared.

Once a prototype capable of accepting packets is developed, the entire flow can be 1.7. Delimitation 7 tested by self-looping the transceivers. The next phase would be to use the same em- pirical techniques to further study the hardware design after integrating the packet processor module defined on P4. In addition to the default actions in the case of a total mismatch, it is possible to program the lookup tables from the control-plane. Ease of defining new rules and reconfiguring the FPGA shall be quantitatively dis- cussed. The above-mentioned procedure shall be initially tested on a single lane and subsequently scaled to a two-lane configuration using arbitration techniques to en- sure the sharing of resources. However, the idea is to operate on multiple lanes in the future.

The research and development adopted in the thesis work shall benefit by adher- ing to the basic principles of an agile workflow in a minimalistic fashion. The thesis kicked-off by understanding the impact of the outcome of this research and charting a rough plan that could help track the progress of the tasks. Instead of considering a straight forward waterfall style, the work involves an iterative development with minor documentation and continuous testing using software tools to simulate hard- ware functionality and eventually observing the expected behaviour on an FPGA using test instruments. However, it is worth mentioning that learning more about the subject demanded that the initial plan be modified to ensure a refined outcome.

As previously discussed, the various stages of development can be classified un- der two broad categories focusing on P4 compatible hardware pipeline develop- ment and defining a suitable algorithm on P4. The second category shall shape the primary feature of the network switch and must be studied to understand the lim- itations of this design. Qualitative results shall be finally gathered to evaluate the characteristics of the proposed next generation of SDN switches designed by the aforementioned techniques.

1.7 Delimitation

Designing modern network switches to be compatible with the “top-down” ap- proach exposes the network architecture to attacks. The segregation of the control- plane with an overall command above multiple data-planes make the control-plane a sweet spot for attackers to redefine the traffic rules and create havoc remotely [12]. It is vital that vendors design such technologies with keeping in mind the security risks involved. The proposed design must be improved further to mitigate such risks.

Currently the research studies the phenomenon and the ease of reconfigurability at a scaled down transfer rate. Since the application demands a faster version, the design must be modified to meet the high-speed requirements. However, the pro- posed design has taken this factor into account and ensures minimum effort to make this modification.

In addition to the prior mentioned delimitations, the design needs amelioration to plug in modified P4 modules without the operator bothering about the remaining components of the hardware pipeline that constitute the switch. Currently the num- ber of P4 architecture models that can be adopted are limited based upon the used compiler; however, in the future modifications made to the P4 module’s architec- ture must be readily recognizable and accepted by the remaining components of the 8 Chapter 1. Introduction network switch hardware pipeline.

1.8 Outline

This section shall be a descriptive guide of the structure that has been adopted while writing this report. Readers can get a glimpse of what to expect from the chapters that follow.

Chapter 2 titled as "Theoretical framework" discusses deeper about SDN and the change in trends over the recent years. OpenFlow and its drawbacks that limits the SDN capabilities have been briefly mentioned to state the significance of P4 in de- scribing Softly Defined Networks. Subsequently, the various characteristics of P4 and the basic structure of a P4 code have been discussed to better comprehend the chapter that follow. Finally, this chapter looks back into previous quality work and technologies that influenced the direction of this research work. Some miscellaneous information with regard to the hardware platform and test instruments used for this project have been included within this chapter.

Chapter 3, "P4-enabled switch: The proposed design" discusses the adopted P4 ar- chitectural model and the custom P4 described data-plane. The supporting hard- ware framework upon which the cost of defining a P4 defined switch is studied has been described in terms of the building blocks used for the design. The algorithm, configuration and signals associated with each building block has been discussed in detail.

Chapter 4, "Results" discusses the achieved goals within this thesis work. Firstly, the simulation and integration of necessary building blocks have been discussed. After the successful integration, the final block design is depicted in this chapter and tested for the desired behavior using a test setup. Subsequently, to study the cost incurred various P4 descriptions have been analyzed in terms of resource uti- lization and latency with increase in complexity.

Chapter 5, "Conclusion and future work" describes the conclusions that could be made from the observations discussed in Chapter 4. Future work to improve the findings articulated in this report has been discussed in this chapter. 9

Chapter 2

Theoretical framework

2.1 Software Defined Network to Softly Defined Network

In order to offer remedies to the ever-evolving operator demands, system vendors realized that the innovation made must ensure reduction of development cycles from years to months with the added advantage of taking into account the flexi- bility of realizing new networking topologies. Software Defined Networking was a promising approach to tackle these demands and the conceptualization of such a technique was driven by the constant reworking of standards and over-the-top services that could dynamically define the network. Initially the traditional belief that software is easier to redesign and hardware is expensive and harder to redesign induced the assumption that “there is relatively dumb switching hardware for high- speed packet forwarding, and relatively intelligent software running on processors for lesser-speed packet forwarding and networking control” [13].

In addition to the limitations briefly mentioned in Chapter 1, OpenFlow is a clear ex- ample that exemplifies the above stated assumption. OpenFlow is one of the pioneer attempts and sometimes wrongly considered as a best approach to achieve SDN. In OpenFlow 1.0, the accessed model contained a single lookup table for matching cer- tain predefined packet fields and only allowed simple actions. Subsequent versions of OpenFlow, operated on models with multiple sequential lookup tables with ex- tended predefined packet fields and actions. Future upgradations were even made in improving the language interface and aimed towards a higher level of abstraction. Nevertheless, OpenFlow offered very restricted view of the underlying forwarding architecture and did not fully tap into the degree of programmability by limiting the end-user to work with predefined protocol formats [13].

Over the course of time, it was made obvious that packet forwarding solutions that relied upon simple switching hardware alone, with complex functionalities handled by software had limitations in terms of flexibility to deliver the requisite perfor- mance. IT architects saw that the underlying hardware had to evolve by crossing limitation for future developments and working towards a goal to implement a set of dynamic virtual services under software control. This questioned the role of fixed function ASICs and the importance of moving towards more flexible Network Pro- cessors (NPUs).

FPGAs and SoC have an even higher degree of flexibility and their capability to handle complex functionalities in addition to being commanded by software, is an opportunity to replace the conventional dumb hardware model. Every functional- ity necessary can be handled by these all programmable devices while supporting the requisite line rates and packet processing rates associated with next generation 10 Chapter 2. Theoretical framework of networking platforms. FPGA technologies have the capabilities to blur the role of hardware and software by offering the scope for defining ‘soft hardware’. This terminology highlights the capabilities of a highly flexible and easily programmable hardware which are manufactured by companies like that are calling their next-generation of programmable networking platforms that goes beyond SDN as ‘Softly Defined Networking’ devices [14].

Therefore, in comparison to conventional SDN devices that function upon a fixed data-plane implementation, Softly Defined Networks have a software defined data- plane implemented on a re-programmable hardware in addition to the common soft- ware defined control-plane. Various development environments are recently focus- ing on providing high level definition capabilities for end users to fully customize the underlying data-plane and easily program them through the control-plane with the necessary APIs. The figure given below has been derived from the guide of a promising development environment offered by Xilinx and is a comprehensive pic- torial description of the above-mentioned comparison.

FIGURE 2.1: Software Defined Networks to Softly Defined Networks.

With all these improvements that commenced at implementing SDN in the most flexible manner, it is important to practically prove the deliverance of high perfor- mance with the added advantage of programmability in a high-level manner. To mention some standards, improvements must go beyond the well-known NetFPGA research vehicle, which supported line rates of 4x1G initially and 4x10G in its later versions, together with a hardware design programming experience [13].

To summarize, compared to the traditional networking approach, SDN differs by primarily separating the data-plane that forwards the traffic and the control-plane 2.2. P4: Programming Protocol-independent Packet Processors 11 that commands the rules on which decisions are made. Secondly, SDN offers an in- terface between the separated control and data planes. Thirdly, control-plane logic is migrated to a logically centralized controller that offers a global view of the underly- ing network resources that enables applications to command and optimize policies. Therefore, these changes with a much flexible underlying soft hardware, radically increases the pace of network innovation to improve the performance, scalability, cost, flexibility, and ease of management [15].

Figure 2.2 derived from [15] is a more descriptive version of the previously illus- trated figure 1.1 and depicts the desired overall network framework with a segre- gated control-plane that is centralized and commands the underlying softly defined data-planes. The SDN applications that form the application layer communicate and orchestrate the underlying infrastructure layer through the centralized control layer services to realize finer traffic engineering rules. The physical switches that consti- tute the lowest layer of the network is studied in this article.

FIGURE 2.2: The overall network framework.

2.2 P4: Programming Protocol-independent Packet Proces- sors

P4 is a high-level language for the end users to define the various sections of a data- plane that determines the fashion in which packets are processed in a programmable forwarding element spanning from software switches, through FPGAs, NPUs and reconfigurable hardware switches [4, 10]. While P4 was initially proposed by Pat Bosshart et al. in a paper also titled as “Programming Protocol-independent Packet Pro- cessors”, it aimed at programming switches. However, the potential of this high-level language has broadened the spectrum of networking devices that can be efficiently designed using P4. 12 Chapter 2. Theoretical framework

As compared to OpenFlow which gives restricted access to only customize a limited number of flow tables, P4 on the other hand intends to design the overall data-plane functionality of the networking device. Therefore, P4 exposes a wide range of pa- rameters of the data-plane for customization to the network programmer without imposing restrictions as compared to OpenFlow and makes innovation more agile which in turn helps reduce the development cycle. Many devices implement both the control plane and the data plane. However, apart from the data plane, P4 is capable of only partially defining the interface by which the control-plane and the data-plane communicate.

In the case of a network switch, there are primarily two differences between a tradi- tional switch and a P4 defined switch [10].

• Data-plane:

– In a traditional switch, the vendor defines the data-plane functionality and is not reconfigurable. – In a P4 defined switch, the data plane functionality is not fixed in advance. Initially, the hardware has no knowledge of the protocols desired by the operator, the header fields extractable and are configured during the ini- tialization based upon the P4 program description. This gives a wider degree of customization to the end user as compared to only modifying the routing tables.

• Routing tables and entries:

– In a traditional switch, control-plane controls the data plane by managing entries in a fixed number of routing tables, configuring specialized objects and by processing control-packets. – In a P4 defined switch, the control-plane communicates with the data- plane in a similar fashion as the traditional switch, but the set of routing tables and configurable objects in the data-plane are not fixed, and are defined by the P4 program description. P4 compiler generates APIs that facilitates the interface between the control-plane and the data-plane, this provides access to the data-plane objects.

Figure 2.3 is a more descriptive version of the softly defined network switch shown in figure 2.1 and is adopted from the “P4-16 Language Specification” guide. It clearly illustrates the differences between a traditional and a P4-defined switch.

While proposing P4 as a high-level language aimed at defining the data-plane for network devices, the authors had 3 main goals that would eventually promise a higher degree of flexibility [5]. These goals were:

• Reconfigurability - User must be capable of easily modifying the switch be- havior even after it is deployed in the field. In addition to catering towards an efficient means of testing new ideas, this ensured lesser development cycles and faster time to market.

• Protocol independence - The P4 definition should not confine the design to any particular protocol format. The user must be able to define the functioning of the packet parser that extract the required header fields and the set of match- action tables that perform actions based on the extracted header fields. 2.2. P4: Programming Protocol-independent Packet Processors 13

FIGURE 2.3: Traditional switch vs P4-defined switch [10].

• Target independence – Similar to other high-level programming languages, a network developer’s efforts should not be dependent on the underlying hard- ware that shall recognize the P4 description. While defining the data-plane for the target device in P4, users must be able to work with a target-independent description and force the compiler to do the necessary target-dependent trans- lation.

Before discussing about the various programmable components that constitute the data-plane of a network switch, it is crucial to understand the typical tool workflow that enables target programming using P4. To begin, it is crucial to have a hardware or software implementation framework, a P4 architecture model definition, and a target specific P4 compiler. Compiling a user-defined P4 description adhering to the architecture model definition produces two outputs:

• Data-plane configuration that is to be implemented as described by the input P4 program. This is loaded onto the overall switch hardware pipeline/frame- work that is capable of interfacing with the rest of the network. This is when the switch gets aware of the various networking protocols, header fields to parse and match-action tables to allocate.

• Run-time API that helps interface the control-plane to manage the data-plane objects. This caters to the successful interaction between the two layers and ensures the vital stage of configuring the match-action tables with the desired traffic engineering rules by the control plane.

The overall work-flow in programming a target using P4 currently requires the tar- get, compatible architecture model and a P4 compiler to be provided by the manu- facturer. The user defined P4 description is subsequently compiled by the compiler to generate the above-mentioned outputs that are further loaded on the target. This work-flow is clearly illustrated defined in figure 2.4 that has been adopted from [10]. 14 Chapter 2. Theoretical framework

FIGURE 2.4: Work-flow in programming a target using P4 [10].

Apart from the prerequisite compiler and target, P416 incorporates a new capabil- ity to enable P4 on a diversity of devices. The P4 architecture model defines the necessary P4-programmable blocks and the data-plane interfaces that shall carry the required signal necessary for the user-defined P4 program. Architectures in- sulate programmers from the underlying target framework details and provides an overview of the requisite framework that needs to accommodate the P4 definition. Hardware providers are responsible to define the compatible architecture models and implement the necessary compiler back-end to map the architecture model and the user-defined P4 descriptions to the respective target-specific configuration [10].

The vital abstractions that are allowed within the P4 language are header types, parsers, tables, actions, match-action units, control flow, extern objects, user-defined metadata and intrinsic metadata. These are explained in detail within the P416 Lan- guage Specification document provided by the P4 Language Consortium [10].

2.2.1 Architecture model A single pipeline forwarding architecture that efficiently generalizes and closely re- lates to the adopted architecture model for this thesis work primarily involves three programmable blocks which are the parser, match-action unit and the deparser. As discussed previously, this architecture model is defined by the target manufacturer to be compatible with the target-specific compiler. In the future, P4 compilers shall share a common front-end that understands multiple architecture models.

Within this model, packets are initially allowed to pass through a programmable parser that permits new headers to be defined instead of a fixed parser assumed by OpenFlow. This block is characterized using a parse graph that defines the var- ious states involved during parsing. Next, the extracted fields and the metadata are processed through a set of match-action tables arranged by the compiler in a mixture of optimized series or parallel configuration as compared to the series con- figuration permitted in OpenFlow. The architecture model assumes that the actions defined are a set of protocol-independent primitives supported by the switch. Archi- tecture models allow the use of a common language like P4 to enunciate how pack- ets are processed in various forwarding devices like ethernet switches and routers, 2.2. P4: Programming Protocol-independent Packet Processors 15 defined upon different technologies such as fixed-function ASICs, NPUs, reconfig- urable switches, software switches and FPGAs.

FIGURE 2.5: Architecture model and programmable blocks derived from [16].

Figure 2.5, depicts the described switch architecture in a more comprehensive man- ner and is derived from [10, 15, 16]. The colored blocks are the programmable sec- tions apart from the meta-data bus that are defined within a P4 description. As shown, the custom P4 definition that describes the parse graph formulates the parser and deparser functionality as they are complimentary to each other. The match- action unit comprising of ingress and egress pipeline are defined using the control program, the desired table configuration and the set of permitted actions. Initially the incoming packet is parsed at the parser to extract the desired header fields that are defined by the developer. Subsequently, these header fields are looked up as key fields within the defined match-action tables which shall be populated by the control-plane with the desired actions to be performed against a specific matched key field. At the output, the packet is finally restructured with the desired modifi- cations by the deparser. Refer [10] to study an example architecture model defined in P4. A standard Portable Switch Architecture (PSA) is currently being defined by the P416 architecture working group with 6 programmable blocks, 2 fixed blocks and functions to support its capabilities [17]. Figure 2.5 can be considered as a simplified version of the PSA model and is useful for multiple applications.

To incorporate P4 into the proposed design and test for the desired results, Xilinx’s P4-SDNet compiler shall be utilized. The currently supported architecture models by this compiler are- 16 Chapter 2. Theoretical framework

• XilinxSwitch • XilinxStreamSwitch • XilinxEngineOnly XilinxSwitch closely resembles a simplified PSA and has been adopted for the purposes of this thesis. XilinxStreamSwitch is an experimental feature and is simi- lar to the XilinxSwitch architecture model and differs in terms of only the deparser definition. XilinxEngineOnly is an architecture model used by developers to define stand-alone SDNet engines without any other interfaced engines as in the case of XilinxSwitch and XilinxStreamSwitch. Detailed description for each of these mod- els with their source code is available for further study within the "P4-SDNet User Guide" [18].

However, the adopted XilinxSwitch architecture model is described in detail within Chapter 3 for the reader’s apprehension.

2.2.2 P4 description As briefly discussed, the blocks of the architecture that constitute the anatomy of a basic pipeline and that is available to the programmer for customization are • Parser • Match-Action tables • Deparser • Meta-data bus The simplified data-flow topology for the above blocks is depicted in figure 2.6. The blue arrow depicts packet transmission and the green arrow depicts tuple transmis- sion.

However, apart from the meta-data bus, each component is optional within a P4 definition. Figure 2.7 depicts the various sections within a typical P4 program. The data declaration is the first section of the P4 program that defines the data types of header fields or the data that is passed within the data-plane pipeline using the meta-data bus. The second section includes the parser definition that describes the various states that constitute the parser graph and is involved in the header extrac- tion. In the third section, the control block declaration of the code shall include the list of match tables and actions that shall be utilized for the packet processing. Fi- nally the last section shall define the deparser that stitches back the packet with the desired modifications. Each of these components along with their sample P4 defini- tions have been discussed in following subsections.

2.2.2.1 Meta-data bus The description of this component is done within the data declaration section of the P4 program and defines the structure of the intermediate results such as the header formats that traverse between the parser, match-action tables and deparser. These definitions are similar to a structure in C programming or a very wide array of regis- ters that hold these intermediate values coming out of the partly processed packets. 2.2. P4: Programming Protocol-independent Packet Processors 17

FIGURE 2.6: Programmable blocks and the data-flow topology.

Therefore, these values also known as meta-data, carries information within the en- tire pipeline. The code segment taken from [18] is an example code provided by Xilinx and depicts how various header fields are defined within the data declaration section of a P4 program. These defined fields shall be extracted by the parser for further packet processing. For example, header ethernet_h is a group of 48 bit desti- nation address, 48 bit source address and a 16 bit type field.

typedef bit<48> MacAddress; typedef bit<32> IPv4Address; typedef bit<128> IPv6Address; header ethernet_h { MacAddress dst; MacAddress src; bit<16> type; } header ipv4_h { bit<4> version; bit<4> ihl; bit<8> tos; bit<16> len; bit<16> id; bit<3> flags; bit<13> frag; bit<8> ttl; bit<8> proto; bit<16> chksum; IPv4Address src; IPv4Address dst; } header ipv6_h { 18 Chapter 2. Theoretical framework

FIGURE 2.7: Sections within a typical P4 program.

bit<4> version; bit<8> tc; bit<20> fl; bit<16> plen; bit<8> nh; bit<8> hl; IPv6Address src; IPv6Address dst; } header tcp_h { bit<16> sport; bit<16> dport; bit<32> seq; bit<32> ack; bit<4> dataofs; bit<4> reserved; bit<8> flags; bit<16> window; bit<16> chksum; bit<16> urgptr; } header udp_h { bit<16> sport; bit<16> dport; bit<16> len; bit<16> chksum; } struct headers_t { ethernet_h ethernet; ipv4_h ipv4; ipv6_h ipv6; tcp_h tcp; udp_h udp; } 2.2. P4: Programming Protocol-independent Packet Processors 19

2.2.2.2 Parsing of the packet Parsing is one of the initial operations performed on the packet and its output of extracted headers is vital to the overall functioning of the SDN based device. As already mentioned, to ensure support for ever evolving network protocols and the increase in multi-gigabit transfer rates, this section of the packet processor must be fast and reconfigurable. Therefore, the parser generated through the high-level and configurable P4 approach must ensure low latency and high-speed packet stream- ing.

The P4 definition of a parser is basically responsible for identifying the purpose

FIGURE 2.8: An abstract parser state machine [10]. of the first N bits of an incoming packet and structuring them as a series of extracted fields with an associated label [19]. These set of extracted fields are known as the “parsed representation” of a packet. To achieve the same the P4 description of a parser can be viewed as a state machine with one start state named as ‘start’ and two final states named as ‘accept’ and ‘reject’. Each state is responsible for the ex- traction of the defined fields and deciding upon the path to traverse over the state machine. The final state ‘accept’ indicates the successful parsing of a packet and ‘reject’ indicates the unsuccessful parsing of a packet. The figure 2.8 is an abstract illustration of a parser state machine that separates the final states from the P4 pro- grammable states [10]. To further explain an actual P4 parser definition, figure 2.9 depicts an actual parser state machine for the P4 code that is adopted from [18] and given below. parser Parser(packet_in pkt, out headers_t hdr) { state start { pkt.extract(hdr.ethernet); transition select(hdr.ethernet.type) { 0x8200 : parse_ipv4; 0x83FD : parse_ipv6; default : accept; } 20 Chapter 2. Theoretical framework

FIGURE 2.9: Parser state machine for the sample parser code.

} state parse_ipv4 { pkt.extract(hdr.ipv4); transition select(hdr.ipv4.proto) { 8 : parse_tcp; 10 : parse_udp; default : accept; } } state parse_ipv6 { pkt.extract(hdr.ipv6); transition select(hdr.ipv6.nh) { 8 : parse_tcp; 10 : parse_udp; default : accept; } } state parse_tcp { pkt.extract(hdr.tcp); transition accept; } state parse_udp { pkt.extract(hdr.udp); transition accept; } } 2.2. P4: Programming Protocol-independent Packet Processors 21

2.2.2.3 Match-Action tables The metadata extracted while parsing the packets is the key to classifying and ma- nipulating the packets within the next control block stage. The body of the control block primarily constitute of a variety of action definition that contain instructions to manipulate the metadata and tables that maps the to be matched extracted fields with their respective actions. These match-action units must be invoked to perform any form of data transformation, thereby being an inevitable part of a packet pro- cessing device. In addition to the metadata incoming from the parser, there could be other metadata coming externally along with the packet if defined within the P4 architecture model.

Actions are responsible for the modification operation on the metadata that is be- ing processed. Actions can be compared to a function call in other high-level lan- guages with the to be data values being written by the control-plane or read by the data-plane. If sufficient actions are defined within the P4 program, this allows the control-plane to command the manipulations to be made on the metadata or in other words define the traffic rules dynamically. Figure 2.10 shown below is derived from [10] and it clearly illustrates an action definition within a P4 program. As shown in the figure, parameters traverse to the action code both from the data-plane and the control-plane as defined on P4. To make action definitions easier there are a set of predefined primitive action that could be used to generate a complex compound action in addition to describing a custom action.

FIGURE 2.10: Action code, data and parameters [10].

Now that the necessary actions are explicitly defined, to implement different switch- ing protocols it is crucial to ensure that these actions are performed in an orderly fashion based upon certain conditions applied on the meta-data fields. Therefore, to perform certain actions based upon matches, a lookup table must be implemented listing the various key fields to be matched and the corresponding predefined ac- tions. To successfully process a packet using this match-action table the following steps must be executed:

• Construction of a key field to be matched upon.

• The match step: Key lookup within the lookup table to decide upon the actions to be executed.

• The action step: based upon the matched key field an action is executed.

Figure 2.11 adopted from [10] clearly describes the cumulative functioning of a match- action table. As shown in this figure, the tool partially generates APIs that ensure the 22 Chapter 2. Theoretical framework table contents can be manipulated asynchronously by the target control-plane. This feature although inherited from the recent iterations of OpenFlow is crucial to guar- antee the promised protocol-independence and reconfigurability to the end-user.

FIGURE 2.11: Match-action unit [10].

Currently there are three kinds of match declaration types defined within the P4 li- brary. It is possible to define match kinds of only these types and P4 programmers cannot define new match kinds [10]. The permitted match kinds are:

• Exact match

• Ternary match

• Longest prefix match

• Direct match

The keywords used to indicate the corresponding match types are ‘exact’, ‘ternary’, ‘lpm’ and ’direct’. The key fields to be matches can be defined using multiple match kind types in P4. However, the Xilinx tool used at the time of this thesis work al- lows only one match type for a table. A sample code below describes a control block consisting of a match-action unit. As mentioned previously, P4 allows mul- tiple match-action tables that are optimized to function in serial or parallel by the compiler. Refer the P4 specification guide for more details regarding the various match types. Table 2.1 is adopted from [18] and states the restrictions imposed on tables of various match kind types. control Forward(inout headers_t hdr, inout switch_metadata_t ctrl) { action forwardPacket(switch_port_t value) { ctrl.egress_port = value; } 2.2. P4: Programming Protocol-independent Packet Processors 23

action dropPacket() { ctrl.egress_port = OxF; }

table forwardIPv4 { key = { hdr.ipv4.dst : ternary; } actions = { forwardPacket; dropPacket; } size = 63; default_action = dropPacket; }

table forwardIPv6 { key = { hdr.ipv6.dst : exact; } actions = { forwardPacket; dropPacket; } size = 64; default_action = dropPacket; }

apply { if (hdr.ipv4.isValid()) forwardIPv4.apply(); else if (hdr.ipv6.isValid()) forwardIPv6.apply(); else dropPacket(); } }

Match Kind Key Size(bits) Element size(bits) Depth exact 12,384 1,256 1,512K ternary 1,800 1,400 1,4K lpm 8,512 1,512 7,64K direct 1,16 1,512 2,64K

TABLE 2.1: Table restrictions based upon the match kind [18].

2.2.2.4 Deparser The deparser is the final section to be defined within a P4 definition. The purpose of this component is to rearrange all the appropriate modified fields with the corre- sponding payload to finally stitch together a packet with the desired manipulations. The initial parsed representation of a packet undergoes major changes while being processed such as modification in header value, omission and addition of header fields. Therefore, deparsing is essential to ensure an accurate restructuring of the desired output packet [19]. In simpler words, a deparser functions as an opposite to a parser.

To perform the rearranging, a deparser shall require the modified meta-data and the corresponding input packet as passed parameters. The sample code given be- low explains how to define a deparser in P4 has been taken from [18]. Emitting a header/meta-data, pastes it at the corresponding location within the output packet. While emitting a stack of values, all the elements of the stack are emitted in the order 24 Chapter 2. Theoretical framework of increasing indexes [10]. control Deparser(in headers_t hdr, packet_out pkt) { apply { pkt.emit(hdr.ethernet); pkt.emit(hdr.ipv4); pkt.emit(hdr.ipv6); pkt.emit(hdr.tcp); pkt.emit(hdr.udp); } }

Finally, following the deparser declaration the packaged architecture model desired is declared with the above mentioned programmable blocks as shown within the Xilinx sample code [18] given below.

XilinxSwitch(Parser(), Forward(), Deparser()) main;

2.2.3 Benefits of using P4 To recapitulate, listed below are some of the major advantages of P4 described within the P4-16 Language Specification guide [10].

• Flexibility: As a high-level language, P4 gives a higher degree of capability to customize the data-plane to the programmers.

• Expressiveness: By defining general-purpose operations and table look-ups, P4 enables programmers to express complex packet processing algorithms.

• Resource mapping and management: The abstract data-plane description is compiled to map the defined fields to hardware resources and efficiently man- age allocation and scheduling.

• Component libraries: Hardware-specific functions can be wrapped into portable high-level P4 constructs.

• Decoupling hardware and software evolution: Low-level architectural details of the hardware can be further abstracted from high-level software processing.

2.2.4 P4 compiler and tools Depending on the type of target used for the design, there are variety of P4 sup- ported tools comprising of compilers or interpreters still under development that focus upon converting high-level P4 code to an optimized target executable file. To achieve the goal of target-independency, it is necessary to develop compilers irre- spective of the target. The adopted approach to ensure this is to identify and seg- regate the compilation of functions required for P4 that are target-dependent and target-independent respectively. Therefore, a compiler can be broadly split into two sections comprising of a frontend which handles the target-independent compilation and a backend which handles the target-dependent compilations. While implement- ing a compiler for a new target, the efforts are drastically reduced by focusing only on modifying the backend to generate a target compatible executable file [16, 20]. 2.3. Related work 25

For this thesis project, the switch is designed upon a Kintex FPGA manufactured by Xilinx, Inc. Therefore, to generate an appropriate executable, the Xilinx’s P4-SDNet tool framework which is still under development is utilized to ensure a compatible P4 compilation. The tool initially consists of a compiler with a target-independent frontend p4c which compiles target-independent functionalities into intermediate C code. Subsequently, the backend sdnet converts the intermediate representation into target dependent sdnet code. p4c supports both P414 and P416, and p4c-sdnet con- verts the P4 description to an appropriate SDNet description of a data-plane. This description consists of engines that primarily communicate with the data-flow of packets and tuples to implement a larger system behavior. SDNet [21] is a develop- ment environment provided by Xilinx which is capable of accepting this description and handling the hardware sold by them. Together, P4 files are compiled to classical verilog files that can be subsequently used to generate the desired IP cores. Fig- ure 2.12 derived from [22], illustrates the functioning of the P4-SDNet tool and lists all the generated output files of SDNet. As depicted some of the essential output files generated are the verilog output files with a top level wrapper, testbenches, APIs for the control plane to populate traffic rules, etc.

The "SDNet Installation and Getting Started"[23] contains information regarding the

FIGURE 2.12: P4-SDNet compilation flow [22]. general installation procedures of this tool. The essential instructions necessary to compile a P4 program and the output files generated are discussed in [18, 21]; how- ever, we shall discuss regarding some of the inevitable commands that helped obtain the desired results in appendix A.

2.3 Related work

After the introduction of P4 by Pat Bosshart et al. [5], there has been a substantial progress in terms of enhancing the language capabilities to define packet processors and also develop efficient tools that compile custom P4 descriptions. Leading indus- tries and universities like Barefoot Networks, , Stanford University, Princeton 26 Chapter 2. Theoretical framework

University, Google, Microsoft Research, Xilinx Inc., etc are collaborating to define the best methodology to implement the top-down approach irrespective of the tar- get.

Anirudh Sivaraman et al. [4] have explored the use of P4 in defining the forward- ing plane of a data-center switch. Various essential capabilities of P4 at the time have been discussed in terms of implementing a data-center switch. To ease the task of prototyping, a software switch has been considered as the target for the P4 programs. In addition to mentioning the pros of utilizing a high-level language, the paper proposes certain improvements to the P4 version explored at the time in terms of modularity of a P4 code, explicit visibility in terms of flow of information from one table to another, parallel execution semantics, architecture-language separation and new primitives such as cloning or digest generation.

Fabien Geyer et al. [24], have discussed the use of P4 to meet the need for higher flexibility in defining custom network protocols generally used within the aeronau- tical industry. This work explores the capabilities of P4 in addressing various ap- plication requirements. Finally a performance based analysis is conducted upon different platforms such as a software-based back-end using Intel DPDK, a hard- ware network accelerator based on a NPU, and an FPGA-based platform. The article states that since P4 is still underdevelopment with incompatibility issues between its different versions, it is not yet suitable for aeronautical applications with long life- times. Possibility for formal analysis and the simple cost model makes P4 promising for future purposes.

Apart from hardware-based analysis, "PISCES: A Programmable, Protocol-Independent Software Switch" [25] discusses regarding the use of a domain-specific language such as P4 in describing the behavior of a protocol independent software switch such as PISCES. The proposed implementation is benchmarked in terms of overall perfor- mance upon increasing complexity of both the parser and actions. Finally, the results are evaluated with the conventional Open vSwitch.

2.4 Miscellaneous

This section shall introduce the various miscellaneous theory that are prerequisites such as details about the FPGA based framework, high-speed transceivers, testing instruments, etc that were necessary to successfully design and test the proposed next generation networking switch.

2.4.1 FPGA platform While working with P4 on Xilinx devices, it is necessary to make sure that the de- vice utilized is supported by the P4-SDNet tool. To meet the requirements of this thesis work, the proposed design shall be developed on a Kintex(xcku060-ffva1156- 2-e) equipped custom board. The latest Kintex-7 FPGA family belongs to the Xilinx 7 Series and promises the best performance for the price with twice the improve- ment compared to the previous generations [26]. The general features of the latest Kintex-7 FPGA are shown in the table 2.2 that is derived from the "7 Series FPGAs Data Sheet"[26]. Since the proprietary platform is custom-made, some of the listed features might be different. 2.4. Miscellaneous 27

Feature Kintex-7 Logic cells 478K Block RAM 34 Mb DSP Slices 1,920 DSP Performance 2,845 GMAC/s MicroBlaze CPU 438 DMIPs Transceivers 32 Transceiver Speed 12.5 Gb/s Serial Bandwidth 800 Gb/s PCIe Interface x8 Gen2 Memory Interface 1,866 Mb/s I/O Pins 500 I/O Voltage 1.2V–3.3V Package Options Bare-Die Flip-Chip and High-Performance Flip-Chip

TABLE 2.2: Capabilities of 7 Series FPGA [26].

The board utilized for this research consists of a MPC8321 microprocessor using an asynchronous interface to the FPGA such as the General-Purpose Chip-Select Ma- chine bus (GPCM). The Kintex FPGA constitutes the traffic FPGA and shall contain the to be designed hardware pipeline that understands the P4 defined data-plane. Currently, within the FPGA point to point interconnect uses Intel’s UltraPath Inter- connect (UPI). An abstract block diagram of the board from an FPGA designer’s point of view can be summarized as shown in Figure 2.13.

For the purposes of this thesis work, the traffic FPGA shall be customized to ac- commodate and function in accordance with the P4 logic defined. Small form-factor pluggable transceiver (SFP+) are used to interface the board with the external op- tical interfaces as shown in figure 2.13. It is important to mention that Xilinx pro- vides power-efficient transceivers that enable high-speed optical interfacing with the board. The maximum line rate offered by GTH transceivers in the current 7-series is up to 12.5Gb/s. The reference clocking comes from an external clocking device – AD9554-i. These transceivers are highly configurable and Vivado offers a GT wiz- ard which offers an easy means to instantiate these transceivers with the desirable settings and connections. A brief overview regarding Multi-Gigabit Transceivers (MGT) can found in [27]. [28] clearly illustrate the intricate details about the GTH transceiver within a Kintex FPGA.

2.4.2 Simulation and testing While designing the various vital components of the hardware pipeline that is ca- pable of accommodating P4, it is best practice to test after each stage both at the component level and at an integrated system level for the desired behavior. Xilinx’s Vivado has been used for component level development using system Verilog and for conducting further simulations to investigate for expected behavior.

Once the desired behavior has been achieved it is crucial to integrate the compo- nent and build the design within the FPGA for system level simulations and testing. To perform testing at the aforementioned system level, the easiest approach was to 28 Chapter 2. Theoretical framework

FIGURE 2.13: Board level block diagram. utilize the T-BERD/MTS-5800 hand-held network tester. This enables the designer to easily generate the desired packets and efficiently analyze the packet processing performed. The figure 2.14 is an image of the device used for testing. For more details regarding the capabilities of the test instrument refer [29]. 2.4. Miscellaneous 29

FIGURE 2.14: T-BERD/MTS-5800 hand-held network tester [29].

31

Chapter 3

P4-enabled switch: The proposed design

To further explore the advantages of defining the data-plane using P4 and study the resource and latency related constraints, it is necessary to configure a pipeline on an FPGA in such a way that it communicates with the external network links and also incorporates all the properties and adequate signals to interface a P4-defined data-plane. This shall be achieved by designing a two lane 10G network switch complemented with the current capabilities of P4 on a proprietary FPGA platform that has been discussed briefly in section 2.4.1. This chapter shall give an insight into the most crucial stage of the thesis, which is the overall design of the FPGA based solution that accommodates the P4 definable switch characteristics. The requisite theoretical background has been discussed in Chapter 2; however, vital component specific theory and the related configuration has been discussed accordingly.

Figure 3.1 gives a simplified overview of the design to be implemented on the Kintex FPGA enabled board. As emphasized in the figure, the P4 switch definition is the central most important component whose behavior determines the network packet processing. The rest of the design revolves around this module and focuses more upon the necessary external interfacing of the board and efficient run-time config- uration of various components which also caters to the effective customization of traffic rules using control signals.

Currently P4 descriptions compiled by P4-SDNet is capable of extracting and mod- ifying header and tuple fields based upon the parser, match-action tables and de- parser definitions. In order for the packets to be processed accurately and be rerouted based upon certain tuple field modifications, it is necessary to ensure accurate P4 de- scription and that the chosen architecture model sits well within the proposed hard- ware design.

Also, to justify the necessity of each component that constitute the block design it is vital to list some of the major hurdles that needed addressing during the imple- mentation of this design in the viewpoint of an FPGA designer such as-

• the translation of existing bus operation to integrate the proposed design;

• GTH transceiver configuration and clocking to interface with the physical links;

• synchronization of the input tuple fields with respect to the packet streams to avoid stalling at the P4 module;

• appropriate clocking configuration for line, lookup and control signals; 32 Chapter 3. P4-enabled switch: The proposed design

FIGURE 3.1: Overview of the design.

• clock domain synchronizations;

• address mapping of components to enable the processing system to use control signals as means to configure at run-time;

• enabling switching of packets between the two lanes as part of the engineered actions accompanied with the necessary arbitration and congestion control to ensure minimal packet loss. The figure above vaguely displays how the various components harmonize to en- sure that the aforementioned hurdles have been dealt with successfully to attain the desired characteristics for the switch with simplified signal interconnection informa- tion for better understanding. For example, with regard to this design and the idea of integrating P4 to an existing infrastructure, the custom UPI-AXI4 Lite translator block to be discussed subsequently acts as bus translator and handles the essential bus operation translation in addition to the 16 to 32-bit conversation between 2 dif- ferent clock domains. The on-board high-speed transceivers that are used to com- municate through external optical link must be clocked accurately for the desired line rate and are selected while configuring the MAC IP. In addition to developing the P4 module with the adequate functionalities, the desired packet processing can only be attained by synchronizing the incoming packet and tuple signals as desired 3.1. Building blocks 33 using a tuple controller. It is also necessary to supply the acceptable clocking to capture the incoming packet streams, execute table lookups using match fields and register changes made by the control signals. Finally, to enable the desired switching of packets between Tx lanes as a result of the packet processing is physically possible by using a stream switch that helps determine the route by behaving as a crossbar and conducting the adequate arbitration at times of collision.

Each of the essential components have been described in more detail with regard to their purpose, configuration and essential signals in the subsequent sections. This information is crucial to reproduce the findings proposed within this article. Finally, Chapter 4 shall illustrate the overall implemented block design which shall include intricate details that ensure successful packet processing.

3.1 Building blocks

3.1.1 P4 description This section shall discuss the architecture model adopted and the desired custom data-plane definition on P4 that is used within the design.

3.1.1.1 The Architecture model Before elaborating on configuring the various components constituting the hard- ware that communicates with the P4-SDNet generated hardware description of the data-plane, it is necessary to know the P4 architecture model adopted for this de- sign and understand the expectations from the rest of the supporting components. As discussed in Chapter 2, there are currently three architectural models that are supported by the Xilinx P4-SDNet tool to successfully configure various Xilinx tech- nologies. For this design the XilinxSwitch architecture model has been adopted as this closely resembles the desired PSA model described within the P4 specification guide [10]. This model basically allows the programmer to customize 3 major con- tainers which are the parser, pipeline and deparser as shown in the figure below.

Figure 3.2 is adopted from the “P4-SDNet User Guide” [18] and depicts the internal

FIGURE 3.2: XilinxSwitch layout [18]. signals that are passed within a typical XilinxSwitch architecture layout. As shown, the XilinxParser is the first container and is responsible for extracting and passing on the user-defined header fields from the incoming packets that are essential for the subsequent Xilinx containers. Once the desired header fields have been extracted, they are supplied to the XilinxPipeline control block along with additional tuple sig- nals that are necessary for the packet processing. Control signals generated from the 34 Chapter 3. P4-enabled switch: The proposed design control-plane are used to configure the lookup tables within this container that are responsible to match the specified fields and execute predefined actions. This facili- tates packet processing based upon the various software defined traffic engineering rules. More detailed information regarding the various match types and actions are given in Chapter 2. After the desired packet fields have been modified, the third container is a XilinxDeparser control block that handles the stitching of the header fields with the rest of the original packet in the order desired.

The XilinxSwitch’s architecture description source code for each of these containers have been derived from the “P4-SDNet User Guide” and given below for under- standing the adopted model. parser XilinxParser (packet_in pkt, out H headers); control XilinxPipeline (inout H headers, inout C control); control XilinxDeparser (in H headers, packet_out pkt); package XilinxSwitch (XilinxParser prsr, XilinxPipeline pipe, XilinxDeparser dprsr);

The first container XilinxParser accepts an input packet stream of type packet_in and delivers the header fields of a user-defined type. This allows the programmer to define the requisite header field composition that needs to be extracted from the in- coming packet with less restrictions. Secondly, the source code describes the param- eters passed to the XilinxPipeline within which the match-action units are defined. For this purpose, the previously extracted header fields are accepted and modified within this container. A user-defined type tuple signal named ‘control’ (this is not related to the control signal that configures the lookup tables) is accepted by this block along with every incoming packet as shown in figure 3.2. The final container as discussed accepts the modified header fields of user-defined type and emits the packet of type packet_out. XilinxSwitch is the final package that assembles each of these mentioned containers to define an architecture model.

3.1.1.2 P4 data-plane description The custom P4 description of the desired data-plane and the final packaged block has been discussed in detail within this section. Certain commands necessary for the successful compilation to attain the desired outcome is introduced here; how- ever, a detailed discussion regarding compilation, configuration and packaging has been provided in appendix A. This shall be a necessary information to support the findings mentioned within this thesis report. At the time of this thesis work, Xilinx SDNet 2018.1 has been used for compilation purposes and the resulting output was packaged using Xilinx Vivado 2017.3. The various steps involved prior to utilizing the P4 description with respect to the P4-SDNet tool that is still under development has been listed in the appendix.

This module is designed to experiment upon L2 level of packet processing and study the capabilities offered by P4. We shall be primarily working with packet streams adhering to the IEEE 803.2 Ethernet frame structure incorporating multiple 802.1Q tags. This allows multiple VLAN tags within a frame as desired in implementing 3.1. Building blocks 35 various metro Ethernet network topologies. The structure of the considered Ether- net frames are shown in figure 3.3. The custom P4 description working upon the mentioned frame type is given below and shall perform certain modify operations using multiple exact match key fields. The tables used within this code are indepen- dent and are executed in parallel; however, this is not the best practice and develop- ers must use parameters between tables to avoid the undesirable effects of parallel semantics.

FIGURE 3.3: Ethernet frame: "Insertion of 802.1ad DoubleTag in Ethernet-II frame" by Luca Ghio is licensed under CC BY-SA 4.0/ cropped and modified depiction of TPID and Inter Frame Gap.

#include

#define VLAN_NO 2 header ethernet_h { bit<48> dst; bit<48> src; bit<16> type; //type could be svlan type,cvlan type or ethertype } header vlan_h { bit<3> pcp;//Priority Code Point bit<1> dei;//Drop eligible indicator bit<12> vlanid;//VID bit<16> type;//next tag type } struct headers_t { ethernet_h dst_src_type; vlan_h[VLAN_NO] vlan; }

@Xilinx_MaxPacketRegion(1526*8)// in bits parser Parser(packet_in pkt, out headers_t hdr) { state start { pkt.extract(hdr.dst_src_type); transition select(hdr.dst_src_type.type) { 36 Chapter 3. P4-enabled switch: The proposed design

0x88A8 : parse_vlan; 0x8100 : parse_vlan; default : accept; } }

state parse_vlan { pkt.extract(hdr.vlan.next); transition select(hdr.vlan.last.type) { 0x8100 : parse_vlan; default : accept; } } } control Pipeline(inout headers_t hdr, inout switch_metadata_t ctrl) {

action modify_vlan0(bit<12> svlan_value) { hdr.vlan[0].vlanid = svlan_value; }

action modify_vlan1(bit<12> cvlan_value) { hdr.vlan[1].vlanid = cvlan_value; }

action modify_pcp0(bit<4> svlan_pcp) { hdr.vlan[0].pcp = svlan_pcp[2:0]; //Since hex values are passed from the.tbl files, reduced information to3 bits while assigning the value }

action modify_pcp1(bit<4> cvlan_pcp) { hdr.vlan[1].pcp = cvlan_pcp[2:0]; }

action modify_dei0(bit<4> svlan_dei) { hdr.vlan[0].dei = svlan_dei[0:0]; }

action modify_dei1(bit<4> cvlan_dei) { hdr.vlan[1].dei = cvlan_dei[0:0]; }

action modify_egress(bit<4> egress_value) { ctrl.egress_port = egress_value; }

action dropPacket() { ctrl.egress_port = 0x2; }

table handle_vlan0 { key = {hdr.vlan[0].pcp : exact; 3.1. Building blocks 37

hdr.vlan[0].dei : exact; hdr.vlan[0].vlanid : exact; } // first hex bit of the key would contain the pcp(3bit) and dei(1bit) value. actions = { modify_vlan0; modify_egress; modify_pcp0; modify_dei0; dropPacket; NoAction; } size = 64; default_action = NoAction; }

table handle_vlan1 { key = { hdr.vlan[1].pcp : exact; hdr.vlan[1].dei : exact; hdr.vlan[1].vlanid : exact; } actions = { modify_vlan1; modify_pcp1; modify_dei1; NoAction; } size = 64; default_action = NoAction; }

apply { if(hdr.vlan[0].isValid()){

handle_vlan0.apply(); }

if(hdr.vlan[1].isValid()){

handle_vlan1.apply(); }

} }

@Xilinx_MaxPacketRegion(1526*8)// in bits control Deparser(in headers_t hdr, packet_out pkt) { apply { pkt.emit(hdr.dst_src_type); pkt.emit(hdr.vlan); } }

XilinxSwitch(Parser(), Pipeline(), Deparser()) main;

As discussed previously, the initially included xilinx.p4 defines the architecture model chosen for this design. To achieve the desired implications on an incoming packet it is vital to define the meta-data bus within the data declaration section of the P4 code as described in Chapter 2. For this purpose adhering to the 802.3 frame format, two header types known as ethernet_h and vlan_h have been defined. Finally, as in most high-level languages, a structure headers_t is defined to represent the sequence of headers. In the case of double tagging, vlan_h shall be extracted multiple times to attain the desired header fields for modifications. 38 Chapter 3. P4-enabled switch: The proposed design

Following the data declaration section comes the parser definition. Figure 3.4 illus-

FIGURE 3.4: Parser graph to extract stacked VLAN tags. trates the optimized parser graph from which the P4 description has been defined. Initially at the start state, the parser extracts the 48 bit destination, 48 bit source and 16 bit type fields. As shown in figure 3.3, decision is made based upon the type value extracted. In the case of a double tagged frame, if the extracted type value is 0x88A8, this indicates the presence of a Service-Virtual Local Area Net- work(SVLAN) tag. Therefore, the next state parse_vlan is invoked and this state would extract the corresponding 3 bit Priority Code Point(PCP), 1 bit Drop eligible indicator(DEI), 12 bit VLAN identifier(VID) and the next 12 bit type field. If the next extracted type field is 0x8100 this indicates the presence of Customer-Virtual Local Area Network(CVLAN) tag and invokes the same parse_vlan state again to extract the respective PCP, DEI, VID and EtherType fields. To consider all three permuta- tions of frame structure, the parsing is terminated by invoking the accept state in the absence of the requisite tag type that helps identify the presence of multiple VLAN tags.

The control block shall include two table definitions handle_vlan0 and handle_vlan1 that are responsible to describe the lookup tables with the preferred key match fields and the ordered list of permitted actions. Each of these tables conduct VLAN tag modifications and are intended to execute only on packets with the required tag. The key fields on which a match is verified are PCP, DEI and VID for each tag within the frame. Some of the defined actions for this experiment are to modify the PCP, DEI and VID fields, to drop the packet, to decide upon the exit lane by modifying tuple data and to pass on the packet without any modifications. However, it is im- portant to note that for the module to function as desired it is vital to populate the lookup table with the desired match-action combinations at run-time. Table 3.1 is a mock representation of the tables that are to be populated for this design. The illus- trated format is derived from the P4 table definition and information attained from the json output. It is mandatory to follow the specified format while creating .tbl files for simulation as shown in appendix A. Refer [21] for more information with regard to various table formats with respect to the selected match type. Appendix F of the P416 Language Specification[10] lists restrictions on compile time and run time calls. 3.1. Building blocks 39

Key(hex) Value(hex) PCP Action(hex) vlan egress vlan vlan + VID value value pcp dei DEI 0 064 1(modify_vlan) 032 0 0 0 4 064 2(modify_egress) 000 0 0 0 1 065 3(modify_pcp) 000 0 5 0 7 065 2(modify_egress) 000 0 0 0 F 066 1(modify_vlan) 166 0 0 0

TABLE 3.1: Partial representation of a populated table.

The deparser definition describes the final restructuring of the packet using the ex- tracted and subsequently modified header fields.

As a result of the adopted architectural model and interface signals defined within it, the final custom P4 generated IP block packaged from the output of the Xilinx P4-SDNet tool is shown in figure 3.5. The various procedures involved from com- pilation of the P4 code to packaging the generated HDL files to attain the desired IP block have been well documented in the appendix section to compensate for the missing information within the reference manuals [18] and [21]. Some of the in- evitable commands that were very useful for the successful compilation were:

$ p4c-sdnet -o

$ sdnet -busType axi -busWidth 64 -lineClock 156.25 -controlClock 100 -lookupClock 100 -ingressSync -workDir

$ p4c-sdnet --sdnet_info

The first command compiles the custom P4 code into an intermediate SDNet file. The second command then ensures the Xilinx SDNet compiler generates the desired folder consisting of the HDL and C++ files. Essential design specific information such as the bus type, datapath width and clocking constraints are supplied with this command. The final command is to generate the supplementary json information regarding the P4 code that is useful in attaining signal and table details. The output SDNet and JSON files for this code can be found in appendix B. This logic shall be implemented individually on both the lanes to minimize risks of traffic congestion.

The various signals that constitute the interface of the final packaged block and their expectations have been discussed below. packet_in_packet_in consists of the 64 bit wide input AXI4 stream interface that accepts packets indirectly from the axis_rx port of the ethernet subsystem through a set of FIFOs responsible for Rx-Tx clock synchronization and deterministic packet flow by enabling packet mode. This interface is assigned to the clk_line signal cur- rently operating at 156.25MHz. 40 Chapter 3. P4-enabled switch: The proposed design

FIGURE 3.5: P4 defined module.

control_handle_vlan0_S_AXI and control_handle_vlan1_S_AXI are the 32-bit wide AXI-Lite interface signal allowing the control signals from the processing system to populate the memory-mapped handle_vlan0 and handle_vlan1 tables defined within the control Pipeline block of the P4 program. clk_control is the clock asso- ciated with this interface and is operated at 100Mhz for this design. This interface is crucial for a central unit to define the traffic engineering rules at run-time. tuple_in_ctrl_DATA[7:0] is an 8-bit wide tuple signal that indicates the 4-bit ingress and the 4-bit egress port for the packets that enter this logic IP block operating based on clk_line for this design. This interface exists as the result of the switch_metdata_t type parameter defined within control Pipeline block of the P4 program. The defini- tion for switch_metadata_t datatype extracted from the xilinx.p4 architecture model described by Xilinx is given below.

typedef bit<4> switch_port_t; struct switch_metadata_t { switch_port_t ingress_port; switch_port_t egress_port; }

To ensure the successful processing of the packet headers, it is vital the supporting hardware pipeline supplies the accurate tuples for each packet at the first cycle (or depending upon tuple_in_ctrl_VALID) of every new packet that enters the block. The tuple_controller block explained later is designed to ensure this essential detail. For this design, the data comes from a slice from the Ethernet subsystem’s user_reg as shown in the final block design( figure 4.6). The ingress port (in range 0 to 15) indicates the lane the packet arrives through and the egress port (in range 0 to 15) is modified based upon the lookup tables. The next component AXI Stream Switch receives this tuple information and uses this to reroute the packet based upon the modified egress port value into the desired lane. tuple_in_ctrl_VALID[0:0] this signal must be asserted valid for one cycle of each new input packet. In other words, this signal must be asserted for a single word 3.1. Building blocks 41 of the packet, preferably the first word. This signal indicates the presence of a valid data at the tuple_in_ctrl_DATA port. Since this block expects a valid tuple in the presence of each new packet, if this signal is not asserted and deasserted as expected the block shall stall. To emulate the start of a new packet, the tu- ple_controller block recognizes the assertion tvalid after the latest tlast assertion over the packet carrying AXI Stream interface. The System Verilog logic that handles the tuple_in_ctrl_VALID behavior is given in section 3.1.5. Chapter 4 illustrates the logic blocks and their interfacing that ensures the expected behavior of this signal. enable_processing[0:0], as its name suggests enables the processing for this block. For initial testing purposes, this signal is set to constant ‘1’. However, the purpose of this signal is to disable undesirable packet processing at instants where some sort of fatal error has already occurred prior to the current block. For ease of managing this signal, it is currently controlled by the processing system by configuring the user register offered by the Ethernet subsystem. clk_line, clk_control and clk_lookup are the clock signals at which the various in- ternal engines operate. To ensure a 10Gbps data rate for this design, the clk_line is set at a 156.25MHz. clk_control is set at 100 MHz similar to other AXI4 Lite control interfaces. clk_lookup is set at 100MHz. The SDNet Packet Processor User Guide [21] provides more information required to configure these clock signals. clk_line_rst, clk_control_rst and clk_lookup_rst are the corresponding reset sig- nals responsible to reset various internal engines. packet_out_packet_out is the output stream of processed packets based upon the traffic engineering rules applied. This AXI4 stream output interface feeds these packets to the AXI stream switch which subsequently decides upon the exit route to take based on the modified egress port within the tuple_out_ctrl_DATA. tuple_out_ctrl_DATA[7:0] is the output tuple data field that contains the aforemen- tioned ingress and egress port information. The egress port value determines the route taken by a packet. tuple_out_ctrl_VALID[0:0] behaves similar to the tuple_in_ctrl_VALID and indi- cates the presence of a valid tuple_out_ctrl_DATA and is active for one clock cycle of every processed packet. internal_rst_done[0:0] signal indicates as and when the internal reset is asserted. It indicates that the internal engines are ready [21].

3.1.2 UPI master As mentioned in section 2.4.1, the on-board CPU communicates with the FPGA using a device specific GPCM bus operation. However, the slave modules imple- mented within the Kintex FPGA interact with a master using the UPI bus operation. In this way, the UPI bus has been adopted as a standard across different devices.

To adhere to this established standard, it is vital to perform appropriate synchro- nizations between the GPCM and UPI bus operations. The UPI master synchronizes 42 Chapter 3. P4-enabled switch: The proposed design all incoming signals with double flip-flops and the the MPC8321 synchronizes all the incoming signals from the FPGA.

Therefore, the overall on-board bus architecture can be pictorially represented as

FIGURE 3.6: On-board bus architecture. shown in figure 3.6. The UPI bus consists of a 16-bit address and data. It also imple- ments 32 individual enables that operate as chip selects internally within the FPGA. Each of these enables shall be configured to map onto any size and place in the mem- ory address. The logic that behaves as a UPI master for as shown above is confiden- tial. However, figure 3.7 illustrates the desired UPI bus operation. This operation complies with Wishbone rev B.3 and compatible slaves may be connected without additional logic. But to integrate the proposed design there is a need to incorporate the AXI4 Lite bus operation as discussed in the next section.

FIGURE 3.7: UPI bus operation. 3.1. Building blocks 43

3.1.3 UPI-AXI4 Lite translator As desired, the control signals similar to the control_handle_vlan0_S_AXI and con- trol_handle_vlan1_S_AXI discussed previously is originating from the processing system and is vital to configure the various components within this design such as the ethernet subsystem and the P4 defined module. The Kintex FPGA that is to be modified to accommodate the P4 defined module currently operates on an internal synchronous on chip UPI bus that helps connect the various existing UPI slaves to the UPI master as already discussed.

Currently, the existing UPI slaves are incapable of understanding the device specific bus operation. Therefore, there must be a UPI master that deals with the translation that adheres to the UPI bus operation within the FPGA as discussed in the previous section. This way, different devices could communicate with the FPGA with mini- mum modification by keeping the UPI bus as a standard. For this design most of the slave components utilize another standard known as the AXI4 Lite bus operation to accept control signals. To integrate these new desired AXI4 Lite slaves along with the already developed proprietary UPI slave components it is necessary to allow to the UPI master to communicate with the AXI4 Lite slaves seamlessly.

The UPI-AXI4 Lite translator ensures the successful integration of AXI4 Lite slaves with the existing bus architecture previously described in figure 3.6. The newly mod- ified bus operation and segregations as compared to figure 3.6 is illustrated further in figure 3.8. As shown in the figure the GPCM bus operation is converted to the UPI bus operation which is eventually translated into an AXI4 Lite bus operation. Finally, it is possible to finally communicate with UPI compatible slaves and AXI compatible slaves.

FIGURE 3.8: Bus architecture incorporating UPI-AXI4 Lite translator.

The algorithm defined for designing this component may vary based upon the board 44 Chapter 3. P4-enabled switch: The proposed design used and the bus operations it is required to synchronize. Based upon the devel- oper’s desire it is optional to maintain or replace the intermediate UPI bus operation. Since the purpose of this design is to integrate P4 capabilities into existing designs, the UPI bus operation is maintained.

To successfully interface with the P4 defined data-plane and other components it is necessary to understand the UPI bus operation and the AXI4 Lite bus operation. Figure 3.7 depicts the UPI bus operation. AXI4-Lite IPIF v2.0 [30] describes the de- sired AXI4-Lite interface.

Some of the essential characteristics considered during the design of this component in a hardware developer’s perspective are:

• Translation of the 16-bit UPI bus operation at the slave interface of the pack- aged module to be compatible with the 32-bit AXI4-Lite compliant master in- terfaces. To accomplish this, alongside other signal synchronizations two in- coming 16-bit UPI data and address values are collected and placed on the 32-bit AXI4 Lite data and address signals for write operations and vice-versa for read operations.

• The proposed design shall expect control signals operated at a 100MHz clock domain. Therefore, this translation logic demands a clock domain synchro- nization from the current 66MHz UPI bus operation to a 100MHz AXI4-Lite bus operation guaranteeing metastability.

Figure 3.9 depicts the final packaged block that is utilized within the proposed de- sign.

FIGURE 3.9: UPI-AXI4 Lite translator.

Some of the signals present at the interface of this block and their expectations are as follows. All the UPI signals are interfaced with the UPI Master block and all the M_AXI signals are interfaced with an AXI slave block.

Upi_aclk is the clock signal on which the upi slave interface shall operate. Cur- rently, it is set at 66MHz. rst is an active high reset signal that shall trigger the resets for the underlying AXI 3.1. Building blocks 45 slaves.

UpiEn is the signal necessary to enable the UPI-AXI4 Lite translator block.

UpiAddr[15:0] is responsible for conducting the functionality of this block. Upi- Addr[2:0] determine the state at which the block shall operate and execute the de- sired write or read operation.

UpiWr indicates the presence of a valid signal on UpiWData.

UpiWData[15:0] is the incoming signal through which both the address and data shall be received. According to the implemented logic UpiAddr[2:0] shall indicate the presence of a least or most significant half of the 32 bit AXI4-Lite address or data.

UpiAck is an acknowledgement signal for the UPI master block.

UpiRData[15:0] is the response signal through which the UPI master reads infor- mation.

FIGURE 3.10: Flow diagram for UPI-AXI4 Lite translator.

M_AXI_ACLK and M_AXI_RSTN are the AXI4-Lite clock and reset signals. For this design M_AXI_ACLK is set at 100MHz. 46 Chapter 3. P4-enabled switch: The proposed design

M_AXI is the set of all signals that constitute the desired 32 bit address and data AXI4-Lite interface that is used to interact with the AXI slaves.

The algorithm that addresses these characteristics and forms the backbone to de- signing the required logic is in figure 3.10. Each of the AXI state comprises the set of AXI interface operations that are called after the prerequisite UPI processes are executed. To avoid repetition, later in section 4.1 we shall discuss this logic in detail and illustrate the simulation results for this while performing a write and read operation.

3.1.4 10/25G Ethernet Subsystem In order for the P4 defined data-plane to successfully process packets it is vital for the overall design to receive and transmit packets using the on-board high-speed transceivers. Therefore, this section sheds light on the means of configuring the physical and data link layers of the to be designed network switch that ensures ex- ternal interfacing. To achieve the same this design proposes the customization of the 10/25G High Speed Ethernet Subsystem v2.3 IP core. This component incorporates the desired Ethernet Media Access Controller (MAC) with a Physical Coding Sublayer (PCS) as defined by the 25G Ethernet Consortium.

Some of the supported features mentioned within the user guide [31] for the 10G configuration of this logic block are-

• Complete MAC and PCS functions.

• Base-KR mode based on IEEE 802.3 Clause 49.

• Pause Processing.

• Optional 64-bit or 32-bit AXI4-Stream user interfaces.

• Optional Standalone MAC with 64-bit AXI4-Stream interface and XGMII pin out.

• Optional Clause 73 Auto-negotiation.

• Optional Clause 72.6.10 Link Training.

• Optional Clause 74 FEC - shortened cyclic code (2112, 2080).

• PCS only version with XGMII Interface.

• Optional AXI4-Lite control and status interface.

• Statistics and diagnostics.

• 66-bit SerDes interface.

• Custom preamble and adjustable Inter Frame Gap.

Apart from the accurate configuration that meets the design requirements, it is vi- tal to understand the various signals required by the IP core and how to inter- face them. According to the IEEE 802.3 standard definition a 10 Gigabit physical 3.1. Building blocks 47 layer constitutes a combination of the Physical Medium Dependent (PMD), Physi- cal Medium Attachment (PMA) and Physical Coding Sublayer (PCS). The network interconnect medium connects to the physical layer through the Media Dependent Interface (MDI) and further connects to the Media Access Control (MAC) within the data link layer through the Media Independent Interface (MII) [32]. Figure 3.11 adopted from the 10G/25G High Speed Ethernet Subsystem product guide [31] and de- picts the PCS-only variant of the core that is configured for this design.

FIGURE 3.11: PCS-Only Core Variant [31].

Figure 3.12 clearly illustrates the desired normal 64 bit frame transfer over the XG- MII. The dark region depicts inter-frame gap that exists in the case of an intermittent frame transfer. In the case of a continuous transfer, each frame is sent back to back.

For this design, we shall utilize two Ethernet MAC+PCS/PMA 64-bit cores. The number of bits transmitted per second defines the data rate over the interface. Thus if 1 bit takes 0.1 nanoseconds to commute through the interconnect medium the number of bits sent in 1 second would be 10 x 109 bits (10Gbps). Data throughput rates are generally calculated in terms of data bits excluding the non-data bits such as the control bits, source address, destination address, and other non-data bits [32]. Hence for the desired data speed, the physical layer gross data rates are increased. 10 Gigabit Ethernet throughput demands that the line rate is set at 10.3125 Gbps. The PCS/PMA option is configured to be ‘BASE-R’, in which ‘BASE’ signifies the mod- ulation type to be baseband and ‘R’ refers to the PCS scRambled coding (64B/66B). The data path interface is set to be an AXI stream interface and subsequently the MAC is configured to include the optional data path interface FIFO. The IEEE PTP 1588v2 operating mode helps configure the timestamping option of the 10/25G Eth- ernet Subsystem when the MAC layer is included. The two selectable modes are 48 Chapter 3. P4-enabled switch: The proposed design

FIGURE 3.12: Normal 64 Bit Frame Transfer [31].

‘one step’ and ‘two step’. The two-step mode is selected as this would cause the two-step ports to be populated and the core shall ensure the two-step time stamping functionality instead of the one-step mode in which case both the one-step and two- step ports are populated with both the core functionality available [31]. The 1588 SYS Clock period is set at 4000ps.

Furthermore, within the GT selection and configuration, the GT core is included within the core with the GT RefClk set at 322.265625MHz and with the Dynamic Reconfiguration Port (DRP) clock manually set at 100MHz. The transceiver type is selected to be GTH and an available quad with desired two lanes that suits the re- quirements of this design are chosen. Lastly, under the shared logic tab it is declared that shared logic such as the transceiver quad PLL, transceiver differential reference clock buffer, reset logic and other clock buffers is included in the core.

Figure 3.13 depicts the final block implemented within the design and illustrates the various signals involved. [31] lists each of these signals and describes their pur- pose. However, some of the signals that needs mentioning are listed below. gt_rx and gt_tx are the MDI interface that interfaces with the medium as described above. axis_tx and axis_rx are the MII stream interface through which the packet stream flows and is clocked at tx_clk and rx_clk for channel 0 and 1 respectively. txoutclksel[2:0] and rxoutclksel[2:0] are set to 3’b101 to select the appropriate GTH transceiver reference clock. pm_tick are needed to read statistics counters and these are tied to ’0’. tx_preamble_in gives the provision to add custom preamble in the place of a stan- dard preamble. These signals are tied to a constant ’0’. 3.1. Building blocks 49

FIGURE 3.13: 10G Ethernet Subsystem.

To ensure unhindered packet transmission through the FPGA it is necessary to con- figure the Ethernet subsystem accurately. However, considering the ease of gener- ating multiple custom packet formats and analyzing the received and transmitted packets for discrepancies, the test instrument mentioned in Chapter 2 was utilized to study the desired behavior whilst making the necessary configuration changes. At first, the cores were temporarily self-looped without any packet processing to test the functionality of the Ethernet subsystem. In other words, for each lane the axis_rx port was tied to the corresponding axis_tx port. Therefore, this setup en- abled the testing of the device using various test traffic conditions and checking for error bits or frames, this in turn promised a functional interface for the to be de- signed data-plane of the P4 compatible switch.

Once the successful configuration of the Ethernet subsystem is attained, the subse- quent stages shall primarily involve integrating the P4 defined module and an AXI stream switch into the design with the help of additional supporting logics. 50 Chapter 3. P4-enabled switch: The proposed design

3.1.5 Tuple Controller As discussed in section 3.1.1.2, the tuple data must be fed into the P4 defined mod- ule for one clock cycle of every new incoming packet. This supporting block ensures that the tuple signals are fed into the P4 defined module for exactly one clock cycle. In the absence of this logic, the packet processing shall stall as the engines are ex- pecting the signal to process the recently arrived packet.

The resulting logic block with the necessary signals are shown in figure 3.14. Each of these signals are explained subsequently.

FIGURE 3.14: Tuple Controller.

S_axis_Tlast and S_axis_Tvalid are the signals extracted from the packet enabled FIFO that feeds the incoming packets to the P4 module. The above described logic functions based upon the signal received through these two inputs to decide whether a new packet has arrived.

Tuple_clk and Tuple_rstn are the clock and reset signals at which the S_axis sig- nals are operating at. Tuple_clk is same as the packet stream clock which is set at 156.25MHz.

Tuple_in_Data[7:0] is the tuple data that needs to transferred to the P4 module. For this design it is an 8 bit data taken from the user_reg of the Ethernet subsystem that indicates the input and output channel.

Tuple_Data[7:0] is the desired data that is passed as tuple to the P4 defined mod- ule.

Tuple_valid indicates the presence of a valid data over Tuple_Data. This shall be valid only for one clock cycle as mentioned earlier.

The pseudo-code that ensures the desired behavior of the tuple_in_ctrl_VALID[0:0] and tuple_in_ctrl_DATA[7:0] signals supplied to the P4 defined module is explained using conditional statements that are checked by the this block for each clock cycle. The logic basically identifies the first rising transition of the packet stream’s tvalid after the previous packet’s tlast signal to determine the start of a new packet. Subse- quently, the required tuple valid and data signals are supplied for exactly one clock cycle as desired. This block is crucial because the stream’s tvalid signal alone can be 3.1. Building blocks 51 discontinuous and cannot indicate the start of a packet accurately.

if (S_axis_Tlast == 1){ old_tlast_flag<=1;// This flag records the end of previous packet } else if (old_tlast_flag==1 && S_axis_Tlast==0 && S_axis_Tvalid==1){ old_tlast_flag<=0;// flag is reset to await next S_axis_Tlast Tuple_Valid<=1;// Shall be high for1 clock cycle Tuple_Data<=Tuple_in_Data;// Shall be high for1 clock cycle } else{ Tuple_Valid<=0;// Pulling output low after1 clock cycle Tuple_Data<=0; // Pulling output low after1 clock cycle }

3.1.6 AXI4-Stream switch After the packets with their corresponding tuples have been processed in accordance with the populated traffic engineering rules it is necessary to switch the packet onto the desired output channel. It is also necessary to ensure arbitration to avoid colli- sion of packet transmission over the same resource. For this purpose, the design shall utilize the AXI4-stream switch IP logic. This block shall contain two slave and two master AXI4 stream interfaces for this design. TDATA width is set at 8 bytes as it was for the previous packet stream interfaces. TKEEP and TLAST are enabled. TDEST width shall be set at 8-bits and shall indirectly receive the tu- ple_out_ctrl_DATA signal from the P4 defined logic. This signal shall determine the output master stream on which the packet shall traverse. Data flow properties must be defined to guarantee efficient arbitration. For this design, the arbitration follows the True Round-Robin algorithm and arbitrates based upon the TLAST transfer or on one low TVALID cycle. The IP block with the required interfaces are shown in figure 3.15.

FIGURE 3.15: AXI4-Stream switch. aclk and aresetn are the clock and active low reset signals responsible to synchronously operate each of the slave and master interfaces. As mentioned previously, for this 52 Chapter 3. P4-enabled switch: The proposed design design the stream interfaces are clocked at 156.25MHz.

S00_AXIS and S01_AXIS are the AXI4 Stream slave interfaces that accept processed packets incoming through both the lanes. Even though both the streams operate at the same frequency they theoretically belong to different clock domains. But to be processed by the stream switch this has been rectified to one clock domain temporar- ily by deploying addition AXI Stream Clock Converter that help switch clock domains both at the input and output of the stream switch for any one lane. One of the most important ports within these interfaces are the s00_axis_tdest and s01_axis_tdest which holds the output tuple data that mentions the input lane and the desired out- put lane for each packet. In this design, if tdest value associated with a packet is b’00000000, the packet is assigned to lane 0 (M00_AXIS) and if it is b’00000001, the packet is assigned to lane 1 (M01_AXIS).

M00_AXIS and M01_AXIS are the corresponding AXI4 Stream master interfaces that are indirectly connected to the axis_tx port of the Ethernet subsystem that even- tually transmits the packets through gt_tx. Each master has an inbuilt FIFO to cater the arbitration between the two slaves and to control the packet queue. To rectify clock domain corrections as mentioned above an AXI Stream Clock Converter is in- cluded on the previously rectified lane between the stream switch and the Ethernet subsystem.

Apart from the above mentioned building blocks, there are additional minor com- ponents such as the AXI4-Stream Data FIFOs(ordinary and packet-mode enabled), AXI4-Stream Clock Converters and reset synchronizers that were necessary to ad- dress some of the hurdles mentioned in the beginning of this chapter. 53

Chapter 4

Results

4.1 Simulation of building blocks

As discussed in section 2.4.2, it is best practice to test the desired behavior both at component level and system level. Here, we shall discuss about some of the test- ing and simulations conducted during the design of the proposed solution. One of the initial components that were integrated into this design and ensured accurate interaction between the CPU and the logics built on the FPGA was the UPI-AXI4 Lite translator discussed in section 3.1.3. This logic was essential to ensure control signals from the control plane could efficiently program the data-plane described on P4. The simulation results attained using Vivado for the expected write and read operations have been illustrated in figure 4.1 and figure 4.2.

Figure 4.1 illustrates the process of executing a write operation by the processor. As defined by the algorithm, when UpiAddr[2:0] is equal to 2 the least significant 16 bits of the M_AXI_AWADDR is populated with 16 bit value passed through Upi- WData. And when UpiAddr[2:0] is equal to 3 the most significant 16 bits of the M_AXI_AWADDR is populated with 16 bit value passed through UpiWData. Simi- larly when UpiAddr[2:0] is equal to 6 the least significant 16 bit of M_AXI_WDATA is populated with 16 bit value passed through UpiWData. When UpiAddr[2:0] is equal to 7 the most significant 16 bit of M_AXI_WDATA is populated with 16 bit value passed through UpiWData. Subsequently, to trigger the AXI4 Lite handshake signals UpiAddr[2:0] is equated to 1 with the passed UpiWData value ensuring the transition from AXI idle state to the AXI write state mentioned in figure 3.10. Fig- ure 4.1 depicts some of these hand shake signals such as the M_AXI_AWVALID, M_AXI_AWREADY, M_AXI_WVALID and M_AXI_WREADY. Therefore, this write operation writes a 32 bit hexadecimal value ba987654 to a 32 bit hexadecimal ad- dress 00000004. Figure 4.2 verifies the desired read activity through the UPI-AXI4 Lite translator. Similar to the write operation, we shall read the previously writ- ten value that had undergone the desired UPI to AXI4 Lite transformation. When UpiAddr[2:0] is equal to 2 the least significant 16 bits of the M_AXI_ARADDR is populated with 16 bit value passed through UpiWData. And when UpiAddr[2:0] is equal to 3 the most significant 16 bits of the M_AXI_ARADDR is populated with 16 bit value passed through UpiWData. Next, to trigger the necessary AXI4 Lite hand- shake signals UpiAddr[2:0] is equated to 1 with the passed UpiWData value ensur- ing the transition from AXI idle state to the AXI read state mentioned in figure 3.10. Figure 4.2 depicts some of these hand shake signals such as the M_AXI_ARVALID, M_AXI_ARREADY, M_AXI_RVALID and M_AXI_RREADY. The expected value is seen on M_AXI_RDATA and later passed on to UpiRData by manipulating the Up- iAddr[2:0]. Thus, the desired read operation requests the value stored at the 32 bit hexadecimal address 00000004 and receives a 32 bit hexadecimal value ba987654. 54 Chapter 4. Results

4.1: UPI-AXI4 Lite write operation. IGURE F 4.1. Simulation of building blocks 55

4.2: UPI-AXI4 Lite read operation. IGURE F 56 Chapter 4. Results

For the P4 defined module to function as desired, it is also necessary to supply tuple data for exactly one clock cycle that the P4 module under test expects. As already discussed in 3.1.5, Tuple Controller accomplishes this using a custom logic to detect the start of every new incoming packet. Figure 4.3 depicts the simulation results for the pseudo-code described in Chapter 3. Tuple_Valid is high for one clock cycle of every new packet. During this period, tuple signals are passed to the P4 defined module.

4.3: Tuple Controller simulation results. IGURE F 4.2. System integration and packet flow 57

4.2 System integration and packet flow

Before integrating the P4 defined module it is necessary to ensure packet flow through the self-looped Ethernet subsystem. This is achieved using the hand-held network tester illustrated in figure 2.14 and interfacing it with the device under test using the configuration depicted in figure 4.4. Successfully configuring the 10G Ethernet subsystem and self-looping the streaming packets through a set of FIFOs shall help attain the result shown in figure 4.5. As shown in the figure, ten packet streams are transmitted by the instrument and successfully received without any modification.

While integrating other components into the design to ensure traffic management, clock synchronization, reset synchronization, etc it is vital to observe the depicted undisturbed stream flow as a preliminary test to avoid stream hindering compo- nents.

FIGURE 4.4: Self-looped test setup ensuring stream flow.

FIGURE 4.5: Packet stream without P4 defined module. 58 Chapter 4. Results

4.3 The final design

This section shall finally mention the outcomes observed as a result of this thesis work. Once the desired architecture model is selected upon which the P4 descrip- tion of the data-plane is defined, as stated it is necessary to design a supporting hardware pipeline that incorporates the building blocks discussed in chapter 3 to study the advantages of defining the next generation of SDN switches using P4. The overall proposed hardware design is depicted in figure 4.6.

Apart from the UPI master, each component and their interconnections have been depicted in figure 4.6. This is essential to reproduce the findings described in this article. In addition to the building blocks mentioned in chapter 3, there are other minor components such as the AXI4 Stream FIFO, AXI4 Stream Clock Converter, reset synchronizer, register slicer, etc that are crucial and handle some major design related issues like clock synchronization and traffic management.

The P4 definition under test is packaged within a block known as "test_block" in this design. This block accepts streams of packets from the 10/25G Ethernet sub- system through a set of FIFOs. A packet mode enabled FIFO is inserted to ensure a packet is forward to the test block after the accumulation of an entire packet. The configuration of the FIFOs used will need to be optimized based upon the maxi- mum frame size that is expected to be processed within this design. The 2 control signals originate from the UPI_AXI4_Lite_translator(UPI_AXI_Bridge_16_32_v1_0) and are distributed by the AXI interconnect as discussed. Tuple_Controller_lane0 and Tuple_Controller_lane1 are responsible for the tuple signals that are supplied to the test block as shown. However, for ease of customization currently the 8 bit tuple data value is sliced from a user register of the Ethernet subsystem. For example, the highlighted interconnection shows the connection between user_reg0 _0 and Slice blocks that extract fields such as the Tuple_in_data for the tuple controller and en- able_processing for the test block. The processed packet and tuple output signals are further combined into a single AXI4 stream signal using a FIFO at the output of the test_block. A set of clock converters have been deployed on lane 0 for the AXI4 Stream Switch to function synchronously. The streams are switched based upon an arbitration technique when needed and fed into separate FIFOs that subsequently transmit the customized packets.

To incorporate new definitions of the test_block into the design it is crucial to fol- low the steps mentioned in appendix A. Further to attain the desired results at the output it is necessary to populate the tables accurately at the assigned addresses with the expected field format. 4.3. The final design 59

xlslice_p4_enable_processing_lane0

Din[31:0] Dout[0:0] gt_ref_clk_322Mhz M_AXIS_0 gt_rx Slice M_AXIS_1 reset_async_design_1

rstInN rstOutP clk xlslice_ingress_egress_lane0 reset_async_design_v1_0 Din[31:0] Dout[7:0]

reset_async_design_0 Slice

rstInN rstOutP P4_test_block_lane0 clk Tuple_Controller_lane0 packet_in_packet_in reset_async_design_v1_0 control_handle_vlan0_S_AXI S_axis set_to_5 util_vector_logic_1 control_handle_vlan1_S_AXI xlslice_ctl_idle_0_bit9 S_axis_Tlast tuple_in_ctrl_VALID[0:0] dout[2:0] S_axis_Tvalid Tuple_Valid Op1[0:0] Res[0:0] Din[31:0] Dout[0:0] tuple_in_ctrl_DATA[7:0] packet_out_packet_out Tuple_clk Tuple_Data[7:0] Constant enable_processing[0:0] tuple_out_ctrl_VALID[0:0] Slice Tuple_rstn Utility Vector Logic clk_line_rst tuple_out_ctrl_DATA[7:0] xlslice_ctl_rfi_0_bit11 Tuple_in_Data[7:0] axi_interconnect_0 clk_line internal_rst_done[0:0] Din[31:0] Dout[0:0] TUPLE_CONTROLLER_v1_0 clk_lookup_rst S00_AXI clk_lookup Slice customlogic_aclk ACLK clk_control_rst xlslice_ctl_lfi_0_bit10 axis_clock_converter_0 ARESETN clk_control S00_ACLK Din[31:0] Dout[0:0] S_AXIS tenG_ethernet_subsytem test_block axis_data_fifo_8 S00_ARESETN axis_data_fifo_10 s_axis_tdest[7:0] gt_ref_clk Slice axis_clock_converter_1 M_AXIS M00_ACLK s_axis_aresetn S_AXIS M00_AXI gt_rx M_AXIS M_AXIS axis_data_count[31:0] M_AXIS_ARESETN[0:0] M00_ARESETN m_axis_aresetn S_AXIS s_axis_aresetn M01_AXI s_axi_0 axis_data_fifo_5 m_axis_tvalid axis_wr_data_count[31:0] M01_ACLK xlslice_ctl_rfi_1_bit10 s_axis_aclk s_axis_aresetn s_axis_aclk M02_AXI s_axi_1 m_axis_tready axis_rd_data_count[31:0] M01_ARESETN S_AXIS m_axis_aclk m_axis_aresetn M_AXIS M03_AXI axis_tx_0 Din[31:0] Dout[0:0] M_AXIS m_axis_tdata[63:0] M02_ACLK s_axis_aresetn s_axis_aclk AXI4-Stream Data FIFO M04_AXI axis_tx_1 gt_tx axis_data_count[31:0] S_AXIS m_axis_tkeep[7:0] AXI4-Stream Clock Converter M02_ARESETN Slice m_axis_aresetn m_axis_aclk M05_AXI ctl_tx_0 axis_rx_0 axis_wr_data_count[31:0] s_axis_aresetn m_axis_tlast M_AXIS_ACLK M03_ACLK xlslice_ctl_lfi_1_bit9 s_axis_aclk ctl_tx_send_idle_0 axis_rx_1 axis_rd_data_count[31:0] s_axis_aclk m_axis_tdest[7:0] AXI4-Stream Clock Converter M03_ARESETN m_axis_aclk ctl_tx_send_lfi_0 stat_tx_0 Din[31:0] Dout[0:0] m_axis_tuser[0:0] M04_ACLK ctl_tx_send_rfi_0 stat_tx_1 AXI4-Stream Data FIFO axis_data_count[31:0] M04_ARESETN Slice ctl_tx_1 stat_rx_0 axis_wr_data_count[31:0] M05_ACLK xlslice_ctl_idle_1_bit8 ctl_tx_send_idle_1 stat_rx_1 axis_rd_data_count[31:0] M05_ARESETN ctl_tx_send_lfi_1 rxrecclkout_0 Din[31:0] Dout[0:0] AXI4-Stream Data FIFO AXIS_Switch ctl_tx_send_rfi_1 rxrecclkout_1 AXI Interconnect Slice rx_core_clk_0 tx_clk_out_0 S00_AXIS M00_AXIS rx_core_clk_1 tx_clk_out_1 S01_AXIS M01_AXIS txoutclksel_in_0[2:0] rx_clk_out_0 aclk xlslice_p4_enable_processing_lane1 s_decode_err[1:0] rxoutclksel_in_0[2:0] rx_clk_out_1 aresetn txoutclksel_in_1[2:0] gt_refclk_out Din[31:0] Dout[0:0] AXI4-Stream Switch rxoutclksel_in_1[2:0] gtpowergood_out_0 Slice gtwiz_reset_tx_datapath_0 gtpowergood_out_1 gtwiz_reset_tx_datapath_1 user_rx_reset_0 gtwiz_reset_rx_datapath_0 user_rx_reset_1 xlslice_ingress_egress_lane1 axis_data_fifo_9 gtwiz_reset_rx_datapath_1 stat_rx_status_0 axis_data_fifo_7 sys_reset stat_rx_status_1 Din[31:0] Dout[7:0] M_AXIS S_AXIS dclk user_tx_reset_0 axis_data_fifo_2 M_AXIS axis_data_count[31:0] Slice s_axis_aresetn s_axi_aclk_0 user_tx_reset_1 m_axis_tvalid axis_wr_data_count[31:0] S_AXIS S_AXIS s_axis_aclk s_axi_aclk_1 tx_unfout_0 M_AXIS m_axis_tlast axis_rd_data_count[31:0] s_axis_aresetn s_axis_aresetn s_axi_aresetn_0 tx_unfout_1 axis_data_count[31:0] axis_data_count[31:0] m_axis_aresetn s_axis_aclk axis_data_fifo_11 AXI4-Stream Data FIFO s_axi_aresetn_1 rx_preambleout_0[55:0] axis_wr_data_count[31:0] axis_wr_data_count[31:0] s_axis_aclk pm_tick_0 rx_preambleout_1[55:0] axis_rd_data_count[31:0] axis_rd_data_count[31:0] S_AXIS M_AXIS m_axis_aclk pm_tick_1 user_reg0_0[31:0] s_axis_tdest[7:0] axis_data_count[31:0] AXI4-Stream Data FIFO rx_reset_0 user_reg0_1[31:0] AXI4-Stream Data FIFO s_axis_aresetn axis_wr_data_count[31:0] rx_reset_1 s_axis_aclk axis_rd_data_count[31:0] tx_reset_0 AXI4-Stream Data FIFO tx_reset_1 tx_preamblein_0[55:0] tx_preamblein_1[55:0]

10G/25G Ethernet Subsystem Tuple_Controller_lane1 util_vector_logic_2 S_axis test_block_lane1 S_axis_Tlast Op1[0:0] Res[0:0] packet_in_packet_in S_axis_Tvalid Tuple_Valid packet_in_packet_in_TLAST[0:0] Tuple_clk Tuple_Data[7:0] Utility Vector Logic packet_in_packet_in_TVALID[0:0] Tuple_rstn gt_tx control_handle_vlan0_S_AXI Tuple_in_Data[7:0] control_handle_vlan1_S_AXI reset_async_design_2 util_vector_logic_3 TUPLE_CONTROLLER_v1_0 tuple_in_ctrl_VALID[0:0] packet_out_packet_out tuple_in_ctrl_DATA[7:0] tuple_out_ctrl_VALID[0:0] rstInN rstOutP Op1[0:0] Res[0:0] enable_processing[0:0] tuple_out_ctrl_DATA[7:0] clk clk_line_rst internal_rst_done[0:0] Utility Vector Logic clk_line reset_async_design_v1_0 reset_async_design_4 util_vector_logic_5 clk_lookup_rst set_to_0 clk_lookup rstInN dout[55:0] Op1[0:0] Res[0:0] rstOutP clk_control_rst clk clk_control Constant Utility Vector Logic set_to_1 reset_async_design_v1_0 test_block

dout[0:0]

Constant

set_to_5_0

dout[2:0] UPI_AXI4_Lite_translator Constant Upi_aclk util_vector_logic_0 rst M_AXI UpiEn Op1[0:0] Res[0:0] UpiAck UpiAddr[15:0] UpiRData[15:0] UpiRData[15:0] UpiWr Utility Vector Logic M_AXI_RSTN UpiWData[15:0] util_vector_logic_4 M_AXI_ACLK

Op1[0:0] Res[0:0] UPI_AXI_Bridge_16_32_v1_0 UpiAck Utility Vector Logic UPI_ACLK UpiEn UpiAddr[15:0] reset_async_design_3

rstInN rstOutP clk

reset_async_design_v1_0 UPI_rst UpiWr UpiWData[15:0]

FIGURE 4.6: The final block design.

4.4. Observing the desired packet processing 61

4.4 Observing the desired packet processing

Once the aforementioned state of design has been attained, to ensure the expected packet processing it is necessary to populate the lookup tables with the desired rules at runtime from the control-plane. The test setup that helped observe the desired packet processing is illustrated in figure 4.7. As described in section 3.1.6, the AXI stream switch shall determine the final lane the processed packet shall take based upon the tuple values.

FIGURE 4.7: Test setup to observe the desired P4 defined processing.

Figure 4.8 and figure 4.9 depicts the outcome of populating the tables using the rules listed in the update.tbl file mentioned in the appendix. A graphical representation of the entire table used for testing is given in table 4.1 for better comprehension. As desired packet streams from 1 to 10 are processed based upon the listed actions. The necessary procedure to compile, package and integrate the logic block along with the ways to populate the tables have been discussed in appendix A.

Key(hex) Value(hex) Action(hex) PCP vlan egress vlan vlan + VID value value pcp dei DEI 0 064 1(modify_vlan) 032 0 0 0 4 064 2(modify_egress) 000 0 0 0 1 065 3(modify_pcp) 000 0 5 0 7 065 2(modify_egress) 000 0 0 0 0 066 1(modify_vlan) 166 0 0 0 D 066 2(modify_egress) 000 0 0 0 8 067 3(modify_pcp) 000 0 2 0 A 068 5(dropPacket) 000 0 0 0 3 069 1(modify_vlan) 101 0 0 0 2 069 3(modify_pcp) 000 0 6 0

TABLE 4.1: Entries within the populated table. 62 Chapter 4. Results

FIGURE 4.8: Lane 1 with P4 defined packet processing.

For this experiment, the P4 module module existing on lane 1 is tested by keep- ing the module on lane 0 idle by performing no actions. Figure 4.8 depicts lane 1 in which streams undergo modification based upon a key combination of PCP, DEI and VID fields. For example, stream 1 undergoes a VID modification from 100 to 50. Similarly, stream 3 is rerouted to lane 0 and stream 8 is dropped. Figure 4.9 depicts the rerouted packet streams from lane 1. As shown streams 2, 4 and 6 are rerouted to lane 0. This therefore proves the rerouting based upon tuple field modification. As discussed, stream 8 is missing and depicts a drop action. All the discussed mod- ifications have been conducted only on SVLAN tags by using table handle_vlan0.

FIGURE 4.9: Lane 0 displaying rerouted packets from lane 1. 4.5. Analysis 63

4.5 Analysis

Since P4 is a high level language that aims at replacing the less flexible conventional methods of building a network switch custom tailored for a specific protocol, it is necessary to quantify the additional cost incurred while adopting this approach. The parameters analyzed depend on the approach and optimization techniques adopted by the compiler in use. As most tools are still under development and competing to provide the best solutions in the market, the articulated results are continuously evolving.

The table below lists the resource utilization analysis performed for various P4 de- scriptions incorporated into the design. The custom P4 description mentioned in section 3.1.1.2 is intially analyzed in terms of the number of LUTs, FFs and BRAMs. Subsequently, the P4 description was sophisticated by increasing the number of headers(H), tables(T) and write(W) operations from 1 to 8.

P4 block LUT FF BRAM Custom P4 description 21875 32401 56 1H-1T-1W 9412 10059 16.5 2H-2T-2W 14910 16512 27.5 3H-3T-3W 19563 24122 43 4H-4T-4W 25297 32868 56.5 5H-5T-5W 31393 42600 72 6H-6T-6W 37783 53355 84.5 7H-7T-7W 44470 64987 105 8H-8T-8W 51536 77789 118.5

TABLE 4.2: Resource utilization for various P4 descriptions.

FIGURE 4.10: LUT variations w.r.t the number of headers, tables and write operations.

Figure 4.10 depicts the LUT trend for the P4 descriptions listed in table 5.1. As the number of header, table and write operations increase the total number of LUTs uti- lized also increase fairly linearly. Figure 4.11 similarly illustrates the trend of FF consumed by various P4 blocks and figure 4.11 is the graph for the BRAMs utilized. 64 Chapter 4. Results

Both these plots have a rising trend with the increase in table H-T-W.

FIGURE 4.11: Flip-Flop variations w.r.t the number of headers, tables and write operations.

FIGURE 4.12: BRAM variations w.r.t the number of headers, tables and write operations.

The overall resource utilization of the entire 2x2 design is less than 15 percent of the available resources on the board. Figures 4.10, 4.11 and 4.12 shows that with increase in complexity of a P4 description the cost that comes along in terms of re- source utilization is deterministic in nature. This shows that P4 and the supporting compilers are promising in nature to replace the conventional HDL implementa- tions. With newer FPGAs with lesser resource constraints tailored for networking purposes coming into the market and optimized tool chains to support P4, design of complex switches is possible using P4.

In addition to the resource consumption, it is crucial to discuss the latency incurred while adopting P4. The outcome of this parameter is largely based upon the opti- mization techniques adopted by the tool that supports the developer’s efforts. Table 5.2 lists the latency for various P4 designs as discussed previously. 4.5. Analysis 65

Condition Average(us) Current(us) Min.(us) Max.(us) Stream w/o P4 processing 2.93 2.93 2.91 2.95 Custom P4 description 5.71 5.71 5.7 5.74 1H-1T-1W 4.91 4.91 4.88 4.92 2H-2T-2W 5.33 5.33 5.31 5.35 3H-3T-3W 5.76 5.76 5.74 5.78 4H-4T-4W 6.22 6.22 6.19 6.24 5H-5T-5W 6.66 6.66 6.24 6.68 6H-6T-6W 7.1 7.1 7.07 7.11 7H-7T-7W 7.54 7.54 7.51 7.56 8H-8T-8W 8.01 8.01 7.99 8.03

TABLE 4.3: Latency readings for different P4 descriptions.

Firstly, packet flow latencies imposed by the supporting hardware pipeline alone without P4 has been measured. This value can be considered as the minimum latency within the design over which the P4 block latency shall be added upon. Therefore, the latencies mentioned for the subsequent conditions by incorporating P4 defined blocks with different header, table and write operation configurations shall include the latency mentioned for the first condition. Figure 4.13 is the trend chart for the findings mentioned in table 5.2. The average latency varies approxi- mately linearly with the change in H-T-W configuration. With larger configurations the latencies are expected to increase, this shall depend upon the optimization tech- niques adopted by the tool in designing the parser, match-action unit and deparser. However, in case of larger latencies to avoid the undesirable dropping of packets the supporting hardware framework shall be customized to handle the storage of packets awaiting processing.

FIGURE 4.13: Average latency w.r.t the number of headers, tables and write operations.

67

Chapter 5

Conclusion and Future work

5.1 Conclusion

This article describes the means to successfully incorporate P4 capabilities while modeling the next generation SDN switches with a programmable data-plane. The overall design caters to the XilinxSwitch architecture model which consists of three programmable blocks which are the parser, pipeline and deparser. Currently, the design is capable of operating at a 10Gbps line rate and is scalable to higher rates with minimum effort. Both the lanes in this design are capable of incorporating different P4 descriptions and process packets separately. Apart from the desired packet processing, P4 is currently incapable of describing packet scheduling and relies on external logic such as the AXI Stream switch implemented within this de- sign. Based upon the modified tuple fields the packets can exit through any of the two lanes. However, within this design replication of packets is not incorporated. Once the desired supporting hardware pipeline is in place such as the proposed de- sign, P4 definitely offers a higher degree of flexibility in describing the data-plane at a higher level compared to the fixed VHDL/SystemVerilog implementation. The conventional HDL implementation are specific to a protocol format and might be more optimized in terms of cost compared to P4 depending on the compiler uti- lized. However, this article does not qualitatively compare the two mainly because of the ambiguity involved in terms of the adopted algorithm while implementing each programmable block of the data-plane by the P4-SDNet tool. The user capabil- ity to optimize the logic is currently limited in the case of P4-SDNet; however, the XilinxEngineOnly architecture model allows the user to describe individual compo- nents separately. Nevertheless, P4 promises more flexibility by adhering to its three main goals of protocol independence, target independence and reconfigurability.

After the successful design of a hardware pipeline that incorporates P4 and its ca- pabilities as a high-level language, it is safe to conclude P4 is definitely a promising approach to implementing SDN while incorporating the top-down approach. It al- lows the customization of underlying hardware that previously was a bottleneck to further innovation, as easily as modifying a P4 code within a few hours. FPGAs offer a higher degree of flexibility compared to other targets such as ASICs and with the added abilities of P4, shall create more space to model, design and test new network- ing protocol formats with fewer constraints. New descriptions can be implemented, deployed and tested in the field within a few hours provided the underlying hard- ware framework is equipped to successfully interface with the adopted architecture model. Suitable P4 compatible platforms developed by switch vendors shall def- initely provide network operators a higher degree of freedom in defining the net- work. The linear trend obtained for resource utilization and latencies prove a much 68 Chapter 5. Conclusion and Future work deterministic outcome. However, with increasing latencies, cut through processing would not be possible at higher rates.

5.2 Future work

P4 as a language is still evolving and has changes that make older versions incom- patible. Currently, it supports parallel semantics and it is the programmers respon- sibility to avoid unexpected behavior due to clashing actions. Although P4 permits multiple match types within the same table, P4-SDNet is currently incapable of han- dling this feature. Therefore, each table is allowed to only permit a single match type. This is a con that needs to be addressed as many applications rely on it. Cur- rently, different tools support different architecture models defined by the manu- facturer. However, the recent adoption of a standardized PSA model into P416 that caters to various applications is beneficial and helps minimize differences between various implementations. In the future, this shall truly allow a standard P4 descrip- tions to be compatible to all targets. On the contrary, it is also necessary to provide developers the freedom to program the architecture model to address specific re- quirements. To implement advanced protocols, there are additional features that are required for faster development such as packet scheduling, replication and methods for time-triggered protocols.

For future improvements, it is necessary to further optimize the hardware pipeline to accommodate larger frame sizes. Currently, the pipeline is designed to function with the XilinxSwitch architecture model and shall require modifications to accommodate other models. Subsequently, the design must be tested by deploying it as a physical switch within an existing infrastructure model. The need for optimizations using the XilinxEngineOnly architecture model needs to be studied as per requirements. Cur- rently for P4-SDNet the lookupClock is not flexible, its impact on performance must be studied for future iterations of the tool. P4FPGA is another tool that compiles P4 and is compatible to FPGA targets provided by different manufacturers. It would be interesting to compare the results between P4-SDNet and P4FPGA to further decide upon the best approach of incorporating P4 in designing the next generation SDN switches. 69

Appendix A

Steps to incorporate P4

This section shall discuss the various stages involved in incorporating P4 into the proposed design. Consider the file consisting of the P4 code discussed in section 3.1.1.2 to be named as L2_show.p4. The discussed steps have been run of a based system running Xilinx SDNet 2018.1 and Xilinx Vivado 2017.3.

To install the P4-SDNet tool refer SDNet Installation and Getting Started(UG1018) guide provided by Xilinx. After executing the necessary installation commands, from the directory that contains L2_show.p4 run the following command to compile the P4_16 source code using the Linux console.

$ p4c-sdnet L2_show.p4 -o L2_show.sdnet The intermediate .sdnet file obtained shall be used to generate the desired output files. To attain a more descriptive json description that is useful while populating the lookup tables, the command below must be run. The generated .json file is given in appendix B.

$ p4c-sdnet L2_show.p4 --sdnet_info L2_show.json Once the intermediate .sdnet file is generated, the command line options available to compile the file is given in the SDNet Packet Processor User Guide(UG1012). For the requirements of this thesis, the following command has been used.

$ sdnet L2_show.sdnet -busType axi -busWidth 64 -lineClock 156.25 -controlClock 100 -lookupClock 100 -ingressSync -workDir test1 SDNet offers the choice of choosing the appropriate bus type from the options lbus and axi. For the purposes of this thesis, bus type axi has been adopted with a datap- ath width of 64 bits. The width must be of the power of 2 and within the range from 16 to 1024. The option parameters -lineClock, -controlClock and -lookupClock help declare the line rate clock, control clock and lookup clock frequencies. These are set at 156.25MHz, 100MHz and 100MHz respectively for this design and is available for customization. It is important that the lookup clock rate is set at least once- per-packet rate clock frequency. -workDir declares the directory within which the requisite output files shall be stored. On successful compilation, the following shall be displayed within the console.

Xilinx SDNet Compiler version 2018.1.1, build 2258648

Compilation successful 70 Appendix A. Steps to incorporate P4

The next steps shall involve the procedures to package the various SDNet engines generated into a single HDL block. Access the folder XilinxSwitch within the gen- erated folder test1 and open file XilinxSwitch_vivado_packager.tcl. Due to certain bugs in the script make the following changes within the tcl script.

• Insert -import_files in line 8 as shown below.

ipx::package_project -root_dir XilinxSwitch_vivado/XilinxSwitch/ XilinxSwitch.srcs/sources_1/imports/XilinxSwitch -import_files -vendor xilinx.com -library user -taxonomy /UserIP

• Delete the final close_project command. This is necessary to make manual modification before packaging the IP.

Next, the XilinxSwitch_vivado_packager.tcl is sourced within the Vivado Tcl con- sole. Within the GUI that opens, under Ports and Interfaces associate each control interface with clk_control and package the IP. Subsequently, within the main vivado project that consists of the supporting hard- ware pipeline, manually add the repository that houses the newly packaged IP within the IP catalog. Import the IP into the design and make the necessary con- nections.

After building the design upon an FPGA, it is essential to populate the lookup table using the appropriate format. To populate tables handle_vlan0 and handle_vlan1 described in section 3.1.1.2 the file update.tbl has been used. Its contents are used also for testing in section 4.4 and has been provided below for reference. The format used within this file has been explained in the same section. To mimic the control plane activity, tcl scripts were used to populate the table through a test program.

0064 1032000 4064 2000000 1065 3000050 7065 2000000 0066 1166000 D066 2000000 8067 3000020 A068 5000000 3069 1101000 2069 3000060 71

Appendix B

Intermediate JSON file

This section shall provide the attained intermediate JSON file during compilation for the reader’s reference. The L2_show.json file is given below.

{ "Parser":{ "px_io_tuples":{ "1":{ "px_name":"hdr", "p4_name":"hdr", "px_type_name":"hdr_t", "direction":"out" } }, "px_engines":[ { "px_name":"Parser", "px_type_name":"Parser_t" } ], "px_system_connections":[] }, "Pipeline":{ "px_io_tuples":{ "0":{ "px_name":"hdr", "p4_name":"hdr", "px_type_name":"hdr_t_0", "direction":"inout" }, "1":{ "px_name":"ctrl", "p4_name":"ctrl", "px_type_name":"ctrl_t", "direction":"inout" } }, "px_engines":[ { "px_name":"Pipeline_lvl", "px_type_name":"Pipeline_lvl_t" }, { "px_name":"Pipeline_lvl_0", "px_type_name":"Pipeline_lvl_0_t" }, 72 Appendix B. Intermediate JSON file

{ "px_name":"handle_vlan0", "px_type_name":"handle_vlan0_t" }, { "px_name":"Pipeline_lvl_1", "px_type_name":"Pipeline_lvl_1_t" }, { "px_name":"handle_vlan1", "px_type_name":"handle_vlan1_t" }, { "px_name":"Pipeline_lvl_2", "px_type_name":"Pipeline_lvl_2_t" } ], "px_system_connections":["Pipeline_lvl_0.Pipeline_fl= Pipeline_lvl.Pipeline_fl","Pipeline_lvl_0.ctrl= Pipeline_lvl.ctrl","Pipeline_lvl_0.hdr= Pipeline_lvl.hdr", "Pipeline_lvl_0.local_state= Pipeline_lvl.local_state", "Pipeline_lvl_1.Pipeline_fl= Pipeline_lvl_0.Pipeline_fl", "Pipeline_lvl_1.ctrl= Pipeline_lvl_0.ctrl","Pipeline_lvl_1.hdr= Pipeline_lvl_0.hdr","Pipeline_lvl_1.local_state= Pipeline_lvl_0.local_state","handle_vlan0.request= Pipeline_lvl_0.handle_vlan0_req","Pipeline_lvl_1.handle_vlan0_resp = handle_vlan0.response","Pipeline_lvl_2.Pipeline_fl= Pipeline_lvl_1.Pipeline_fl","Pipeline_lvl_2.ctrl= Pipeline_lvl_1.ctrl","Pipeline_lvl_2.handle_vlan0_resp= Pipeline_lvl_1.handle_vlan0_resp","Pipeline_lvl_2.hdr= Pipeline_lvl_1.hdr","Pipeline_lvl_2.local_state= Pipeline_lvl_1.local_state","handle_vlan1.request= Pipeline_lvl_1.handle_vlan1_req","Pipeline_lvl_2.handle_vlan1_resp = handle_vlan1.response"], "px_lookups":[ { "px_name":"handle_vlan1", "p4_name":"handle_vlan1", "px_class":"LookupEngine", "px_type_name":"handle_vlan1_t", "match_type":"EM", "action_ids":{ "Pipeline.modify_vlan1" : 1, "Pipeline.modify_pcp1" : 2, "Pipeline.modify_dei1" : 3, ".NoAction":4 }, "response_fields":[ { "px_name":"hit", "type":"bits", "size":1 }, { "px_name":"action_run", "type":"bits", "size":3 Appendix B. Intermediate JSON file 73

}, { "px_name":"modify_vlan1_0", "type":"struct", "fields":[ { "px_name":"cvlan_value", "type":"bits", "size" : 12 } ], "p4_action":"Pipeline.modify_vlan1" }, { "px_name":"modify_pcp1_0", "type":"struct", "fields":[ { "px_name":"cvlan_pcp", "type":"bits", "size":4 } ], "p4_action":"Pipeline.modify_pcp1" }, { "px_name":"modify_dei1_0", "type":"struct", "fields":[ { "px_name":"cvlan_dei", "type":"bits", "size":4 } ], "p4_action":"Pipeline.modify_dei1" } ], "request_fields":[ { "px_name":"lookup_request_key", "p4_name":"hdr.vlan[1].pcp", "type":"bits", "size":3 }, { "px_name":"lookup_request_key_0", "p4_name":"hdr.vlan[1].dei", "type":"bits", "size":1 }, { "px_name":"lookup_request_key_1", "p4_name":"hdr.vlan[1].vlanid", "type":"bits", "size" : 12 } 74 Appendix B. Intermediate JSON file

], "annotations":{ "name":["Pipeline.handle_vlan1"], "Xilinx_ExternallyConnected":["0"], "Xilinx_LookupEngineType":["EM"] } }, { "px_name":"handle_vlan0", "p4_name":"handle_vlan0", "px_class":"LookupEngine", "px_type_name":"handle_vlan0_t", "match_type":"EM", "action_ids":{ "Pipeline.modify_vlan0" : 1, "Pipeline.modify_egress" : 2, "Pipeline.modify_pcp0" : 3, "Pipeline.modify_dei0" : 4, "Pipeline.dropPacket" : 5, ".NoAction":6 }, "response_fields":[ { "px_name":"hit", "type":"bits", "size":1 }, { "px_name":"action_run", "type":"bits", "size":3 }, { "px_name":"modify_vlan0_0", "type":"struct", "fields":[ { "px_name":"svlan_value", "type":"bits", "size" : 12 } ], "p4_action":"Pipeline.modify_vlan0" }, { "px_name":"modify_egress_0", "type":"struct", "fields":[ { "px_name":"egress_value", "type":"bits", "size":4 } ], "p4_action":"Pipeline.modify_egress" }, { Appendix B. Intermediate JSON file 75

"px_name":"modify_pcp0_0", "type":"struct", "fields":[ { "px_name":"svlan_pcp", "type":"bits", "size":4 } ], "p4_action":"Pipeline.modify_pcp0" }, { "px_name":"modify_dei0_0", "type":"struct", "fields":[ { "px_name":"svlan_dei", "type":"bits", "size":4 } ], "p4_action":"Pipeline.modify_dei0" } ], "request_fields":[ { "px_name":"lookup_request_key_2", "p4_name":"hdr.vlan[0].pcp", "type":"bits", "size":3 }, { "px_name":"lookup_request_key_3", "p4_name":"hdr.vlan[0].dei", "type":"bits", "size":1 }, { "px_name":"lookup_request_key_4", "p4_name":"hdr.vlan[0].vlanid", "type":"bits", "size" : 12 } ], "annotations":{ "name":["Pipeline.handle_vlan0"], "Xilinx_ExternallyConnected":["0"], "Xilinx_LookupEngineType":["EM"] } } ], "px_user_engines":[] }, "Deparser":{ "px_io_tuples":{ "0":{ "px_name":"hdr", 76 Appendix B. Intermediate JSON file

"p4_name":"hdr", "px_type_name":"hdr_t_1", "direction":"in" } }, "px_engines":[ { "px_name":"Deparser", "px_type_name":"Deparser_t" } ], "px_system_connections":[] } } 77

Bibliography

[1] Barry M. Leiner et al. “A Brief History of the Internet”. In: SIGCOMM Comput. Commun. Rev. 39.5 (Oct. 2009), p. 3. ISSN: 0146-4833. DOI: 10.1145/1629607. 1629613. URL: http://doi.acm.org/10.1145/1629607.1629613. [2] Andrew S. Tanenbaum and David J. Wetherall. Computer Networks. 5th. Upper Saddle River, NJ, USA: Prentice Hall Press, 2010. ISBN: 0132126958, 9780132126953. [3] Dimitrios N. Serpanos. Vasilis Theoharakis. Enterprise Networking: Multilayer Switching and Applications. 1331E. Chocolate Avenue, Hershey, PA: Idea Group Publishing, 2002. ISBN: 1930708173, 9781930708174. [4] Anirudh Sivaraman et al. “DC.P4: Programming the Forwarding Plane of a Data-center Switch”. In: SOSR (2015), pp. 1–8. DOI: 10.1145/2774993.2775007. URL: http://doi.acm.org/10.1145/2774993.2775007. [5] Pat Bosshart et al. “P4: Programming Protocol-independent Packet Proces- sors”. In: SIGCOMM Comput. Commun. Rev. 44.3 (July 2014), pp. 1–8. ISSN: 0146-4833. DOI: 10.1145/2656877.2656890. URL: http://doi.acm.org/10. 1145/2656877.2656890. [6] Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Vir- tualized Layer 2 Networks over Layer 3 Networks. URL: https://tools.ietf. org/html/rfc7348. [7] Haoyu Song. “Protocol-oblivious forwarding: unleash the power of SDN through a future-proof forwarding plane”. In: HotSDN (2013). [8] R. Duncan and P. Jungck. “packetC Language for High Performance Packet Processing”. In: 2009 11th IEEE International Conference on High Performance Computing and Communications (2009), pp. 450–457. DOI: 10.1109/HPCC.2009. 89. [9] G. Brebner and W. Jiang. “High-Speed Packet Processing using Reconfigurable Computing”. In: IEEE Micro 34.1 (2014), pp. 8–18. ISSN: 0272-1732. DOI: 10. 1109/MM.2014.19. [10] The P4 Language Consortium. P4 Language and Related Specifications. URL: https: //p4.org/specs/. [11] A. Håkansson. “Portal of Research Methods and Methodologies for Research Projects and Degree Projects.” In: FECS (2013), 67–73. URL: http://kth.diva- portal.org/smash/record.jsf?pid=diva2%3A677684&dswid=8266. [12] Kacy Zurkus. SDN solves a lot of network problems, but security isn’t one of them. 2017. URL: https://www.csoonline.com/article/3179637/security/sdn- solves-a-lot-of-network-problems-but-security-isnt-one-of-them. html. [13] G. Brebner. “Softly Defined Networking”. In: ACM/IEEE Symposium on Archi- tectures for Networking and Communications Systems (ANCS) (2012), pp. 1–1. 78 BIBLIOGRAPHY

[14] Software Defined Specification Environment for Networking (SDNet). Xilinx Inc. Mar. 2014. URL: https://www.xilinx.com/publications/prod_mktg/sdnet/ backgrounder.pdf. [15] C. Dixon et al. “Software defined networking to support the software defined environment”. In: IBM Journal of Research and Development 58.2/3 (2014), 3:1– 3:14. ISSN: 0018-8646. DOI: 10.1147/JRD.2014.2300365. [16] Henning Stubbe. “P4 Compiler & Interpreter: A Survey”. In: Chair of Network Architectures and Services, Department of Computer Science, Technische Universität München (2017). DOI: 10.2313/net-2017-05-1_07. URL: https://www.net. in.tum.de/fileadmin/TUM/NET/NET-2017-05-1/NET-2017-05-1_07.pdf. [17] P416 Portable Switch Architecture (PSA)(working draft). The P4.org Architecture Working Group. Oct. 19. URL: https://p4.org/p4-spec/docs/PSA.pdf. [18] P4-SDNet User Guide. UG1252. v2018.1. Xilinx Inc. Apr. 2018. URL: https:// www . xilinx . com / support / documentation / sw _ manuals / xilinx2018 _ 1 / ug1252-p4-sdnet.pdf. [19] David Hancock and Jacobus van der Merwe. “HyPer4: Using P4 to Virtual- ize the Programmable Data Plane”. In: Proceedings of the 12th International on Conference on Emerging Networking EXperiments and Technologies. CoNEXT ’16 (2016), pp. 35–49. DOI: 10.1145/2999572.2999607. URL: http://doi.acm. org/10.1145/2999572.2999607. [20] Sándor Laki et al. “High Speed Packet Forwarding Compiled from Protocol Independent Data Plane Specifications”. In: Proceedings of the 2016 ACM SIG- COMM Conference. SIGCOMM ’16 (2016), pp. 629–630. DOI: 10.1145/2934872. 2959080. URL: http://doi.acm.org/10.1145/2934872.2959080. [21] SDNet Packet Processor User Guide. UG1012. v2017.1. Xilinx Inc. June 2017. URL: https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_ 1/UG1012-sdnet-packet-processor.pdf. [22] P4-NetFPGA Lecture 3. P4.org. URL: https://cs344- stanford.github.io/ lectures/Lecture-3-P4-NetFPGA.pdf. [23] SDNet Compiler Installation, Release Notes, and Getting Started Guide. UG1018. v2017.1.1. Xilinx Inc. July 2017. URL: https://www.xilinx.com/support/ documentation/sw_manuals/xilinx2017_1/ug1018-sdnet-installation. pdf. [24] Fabien Geyer and Max Winkel. “Towards Embedded Packet Processing De- vices for Rapid Prototyping of Avionic Applications”. In: 9th European Congress on Embedded Real Time Software and Systems (Jan. 2018). URL: https://hal. archives-ouvertes.fr/hal-01711011. [25] Muhammad Shahbaz et al. “PISCES: A Programmable, Protocol-Independent Software Switch”. In: Proceedings of the 2016 ACM SIGCOMM Conference. SIG- COMM ’16 (2016), pp. 525–538. DOI: 10.1145/2934872.2934886. URL: http: //doi.acm.org/10.1145/2934872.2934886. [26] 7 Series FPGAs Data Sheet: Overview. DS180. v2.6. Xilinx Inc. Feb. 2018. URL: https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_ 1/ug1018-sdnet-installation.pdf. [27] Vamsi Krishna. Designing with Xilinx R FPGAs. Ed. by Sanjay Churiwala. 1st. Springer International Publishing, 2017. Chap. 4. ISBN: 978-3-319-82581-6. BIBLIOGRAPHY 79

[28] 7 Series FPGAs GTX/GTH Transceivers User Guide. UG476. v1.12.1. Xilinx Inc. Aug. 2018. URL: https://www.xilinx.com/support/documentation/user_ guides/ug476_7Series_Transceivers.pdf. [29] T-BERD/MTS-5800 Handheld Network Tester. 2018. URL: https://www.viavisolutions. com/en-us/products/t-berd-mts-5800-handheld-network-tester. [30] LogiCORE IP AXI4-Lite IPIF v2.0 Product Guide for Vivado Design Suite. PG155. Xilinx Inc. Dec. 2018. URL: https://www.xilinx.com/support/documentation/ ip_documentation/axi_lite_ipif/v2_0/pg155-axi-lite-ipif.pdf. [31] 10G/25G High Speed Ethernet Subsystem v2.3 Product Guide. PG210. Xilinx Inc. Dec. 2017. URL: https : / / www . xilinx . com / support / documentation / ip _ documentation/xxv_ethernet/v2_3/pg210-25g-ethernet.pdf. [32] Understanding the Ethernet Nomenclature – Data Rates, Interconnect Mediums and Physical Layer. URL: https://www.synopsys.com/designware-ip/technical- bulletin/ethernet-dwtb-q117.html. TRITA -EECS-EX-2018:767

www.kth.se