Automatic Identification and Utilization of Custom Instructions for Network

Automatic Identification and Utilization of Custom Instructions for Network Processors Von der Fakultät für Elektrotechnik und Informationstechnik der Rheinisch–Westfälischen Technischen Hochschule Aachen zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften genehmigte Dissertation vorgelegt von Diplom–Informatiker Hanno Scharwächter aus Leverkusen/Nordrhein-Westfalen Berichter: Universitätsprofessor Dr. rer. nat. Rainer Leupers Universitätsprofessor Dr.-Ing. Gerd Ascheid Tag der mündlichen Prüfung: 21.05.2015 Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfügbar. Zusammenfassung Gordon E. Moore beschrieb 1965 einen wegweisenden Trend für die Chipentwicklung: Er sagte voraus, dass sich die Anzahl der verwendeten Schaltkreise pro Flächeneinheit für einen Prozessor alle zwei Jahre verdoppeln würde. Dieser Trend hat sich bis heute bewahrheitet und ermöglicht seitdem dramatische Veränderungen des Prozessorentwurfs. Insbesondere im Bereich der Pro- tokollverarbeitung (TCP/IP/Ethernet) wurde die Entwicklung neuer Prozessoren durch enorme Zuwachsraten an Internet-Nutzern und innovativer Anwendungen wie Voice-over-IP, Internet Tele- fonie u.a. (die heute längst Standard sind) befeuert. Aus dieser Situation heraus entstand die Vision eines Netzwerkprozessors (NPU), welcher die effiziente Verarbeitung hoher Datenvolumen mit der Flexibilität zur Entwicklung neuer Algorithmen verbindet. Der Spagat zwischen diesen unterschiedlichen Anforderungen bewirkt zeitaufwändige Entwurfszyklen, so dass automatisierte Entwurfsmethoden und entsprechende Werkzeuge zwingend notwendig sind, um mit vertret- barem Aufwand ein NPU-System zu entwerfen. Compiler-in-the-Loop Architektur-Exploration wird heutzutage als der richtige Weg angesehen dieses Problem zu lösen. In diesem Zusammen- hang spielen automatisierte Compiler-Erzeugung und Instruktionssatz-Erweiterung (ISE) wichtige Rollen. Beide Techniken ermöglichen Prozessordesignern den Instruktionssatz (ISA) einer NPU an die Anforderungen von Netzwerkanwendungen komfortabel anzupassen und darüber hinaus ein entsprechendes Programmiermodell durch einen Compiler für Anwender bereitzustellen. Diese Ar- beit präsentiert Fallstudien zu Architektur-Exploration und Compiler-Optimierung in Verbindung mit ISE für NPUs. Motiviert durch die gewonnenen Erkenntnisse wird ein Rahmenwerk zur au- tomatisierten Compiler/Architektur Co-Exploration entwickelt. Der Vorteil dieses Rahmenwerks begründet sich auf der Möglichkeit durch Analyse von C-Anwendungen eine optimierte ISA und einen passenden Compiler zu entwickeln und dadurch den Entwurf programmierbarer Prozessorar- chitekturen für diese Anwendungen zu unterstützen. Gleichzeitig wird der Ablauf der Architektur- Exploration beschleunigt, da zeiaufwändige Analysen der Anwendungen und Retargierung des Compilers effektiv unterstützt werden. iii iv Das Rahmenwerk basiert auf zwei Werkzeugen: Einer Analyse von C-Anwendungen zur Identi- fizierung von Instruktionen für eine ISE sowie eines Code-Generators zur Erzeugung eines Al- gorithmus, der, eingebettet in einen Compiler, die automatische Verwendung der Instruktionen durch einen Compiler ermöglicht. Die Analyse bezieht vollständige Anwendungen ein und ist nicht auf einzelne Abschnitte der An- wendungen beschränkt. Durch ihre polynomische Laufzeitkomplexität können mit diesem Ansatz auch komplexe oder mehrere Anwendungen verarbeitet werden. Das Ergebnis dieses Verfahrens ist eine Beschreibung der Instruktionen in einer Form, wie sie der Code-Generator verarbeiten kann. Basierend auf dieser Beschreibung erzeugt der Code-Generator einen Algorithmus zur Code- Selektion. Diese besitzt lineare Laufzeitkomplexität und kann beliebig komplexe Instruktionen erfassen; insbesondere Instruktionen, welche mehrere Ergebnisse gleichzeitig berechnen. Das gesamte Rahmenwerk ist eingebettet in eine industrieerprobte Methodik zur Architektur- Exploration. Diese erlaubt die iterative Entwicklung eines Prozessors basierend auf einem Mod- ell in Architekturbeschreibungssprache. Während der Exploration können relevante Werkzeuge wie Assembler, Linker oder Simulator aus dem Modell heraus erzeugt werden. So ergibt sich ein verbesserter Design-Ablauf, wobei der Designer in einem einzigen Entwurfszyklus einen opti- mierten Satz Instruktionen und einen dazu passenden Compiler erhält. In nachfolgenden Zyklen übernimmt er die Instruktionen in sein Prozessormodell und evaluiert die Veränderungen anhand der verarbeiteten Anwendungen mit Hilfe des neuen Compilers und des generierten Simulators. Abstract 1965, Gordon E. Moore has described a groundbreaking trend for the design of processors: he pre- dicted a doubling of the number of circuit components fabricated on a single chip every two years. This trend has proven itself to be true and has enabled spectacular rates of progress in semicon- ductor technology since then. Particularly in the field of protocol processing, system design has been driven by the continuously growing number of Internet users and traffic accompanying the rise of new network protocols like Voice-over-IP, VPN or Internet TV (which have long become standards). This development has led to the rise of a new type of Application-Specific Instruction Set Processor (ASIP) especially designed to support high-level programmability of network protocols and efficient packet processing at the same time, called Network Processing Unit (NPU). Designing such systems under increasing time-to-market pressure imposes clear requirements for systematic and, moreover, automated design methodologies for building NPUs, so that time and effort to design a system containing both hardware and software remains acceptable. Compiler– in–the–Loop (CiL) architecture exploration is widely accepted as being the right track for fast development of ASIPs like NPUs. In this context, automatic application-specific Instruction Set Extension (ISE) and code generation by a compiler have received huge attention in the past. To- gether, both techniques enable processor designers to quickly adapt a processor’s Instruction Set Architecture (ISA) to the needs of a certain set of applications and to provide an appropriate high- level programming model. This manuscript presents a detailed analysis of architecture exploration for NPUs. It develops a scalable framework for automatic compiler/architecture co-exploration, targeting the domain of network applications. First, the framework is based on a novel code-selection technique for the compiler and an appropriate code-generator for this technique. The code-generator takes a description of the ISA as input and produces a set of C-functions allowing for comfortable implementation of the aforementioned code-selection. The code-selection algorithm features linear runtime complexity and — in contrast to traditional approaches of this type — is capable of handling arbitrary complex instructions; especially inherent parallel instructions computing multiple results at the same time. v vi Second, the framework is based on a novel methodology for automatic identification of possible ISEs. It identifies new instructions through the analysis of complete applications and is not restricted to selected hotspots of these applications. Its polynomial runtime complexity enables the analysis of (a set of) complex real-world applications. It results in a grammar description of the most efficient hardware instructions, which can in turn be processed by the aforementioned code-generator. By embedding this tool flow in an industry-proven architecture exploration framework, a methodology for simultaneous compiler/architecture co-exploration is derived, which allows for iterative development of a processor model, written in Architecture Description Language (ADL). During architecture exploration, relevant tools like assembler, linker and simulator can be generated based on the processor model. Thereby an improved design flow is created that enables designers to retrieve an optimized ISA and appropriate compiler in a single iteration. In subsequent itera- tions, he integrates the instructions in the processor model and evaluates the effects on the target applications with the help of the retargeted compiler and generated simulator. Acknowledgments First and foremost I would like to thank my thesis supervisor Professor Rainer Leupers for pro- viding me with the opportunity to work in his group as well as for his important advice and constant encouragement throughout the course of my research. He always left me a lot freedom and contributed much to an enjoyable and productive working atmosphere. I am also thankful to the Professors Gerd Ascheid and Heinrich Meyr. I want to thank all of them for the lessons they gave me on the importance of details for the success of a scientific or engineering project. It has been a distinct privilege for me to work with them. There were a number of people in my everyday circle of colleagues who have enriched my personal life in various ways and who were also true friends. I am particular indebted to my colleagues Manuel Hohenauer, Kingshuk Karuri, Jiangjiang Ceng and Stefan Kraemer who worked together with me from the first minute. Without their contributions, their support and the inspiring atmosphere within our team, this work would have not been possible. I thank all of you. Special thanks goes also to David Kammler who were my partner in many publications. His expertise in hardware generation was an invaluable asset for my work.

Automatic Identification and Utilization of Custom Instructions for Network

C5ENPA1-DS, C-5E NETWORK PROCESSOR SILICON REVISION A1

Arithmetic and Logical Unit Design for Area Optimization for Microcontroller Amrut Anilrao Purohit 1,2 , Mohammed Riyaz Ahmed 2 and R

Unit 8 : Microprocessor Architecture

Design and Implementation of a Stateful Network Packet Processing

Embedded Multi-Core Processing for Networking

Reverse Engineering X86 Processor Microcode

Reconfigurable Accelerators in the World of General-Purpose Computing

Network Processors: Building Block for Programmable Networks

Predictable, Low-Power Arithmetic Logic Unit for the 8051 Microcontroller Using Asynchronous Logic D

The Central Processing Unit (CPU)

Intel® IXP42X Product Line of Network Processors with ETHERNET Powerlink Controlled Node

ADMIN an Arithmetic Logic Unit (ALU) a Simple 32-Bit