Efficient Hardware for Low Latency Applications

EFFICIENT HARDWARE FOR LOW LATENCY APPLICATIONS Inauguraldissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften der Universität Mannheim vorgelegt von Christian Harald Leber (Diplom-Informatiker der Technischen Informatik) aus Ludwigshafen Mannheim, 2012 Dekan: Professor Dr. Heinz Jürgen Müller, Universität Mannheim Referent: Professor Dr. Ulrich Brüning, Universität Heidelberg Korreferent: Professor Dr. Holger Fröning, Universität Heidelberg Tag der mündlichen Prüfung: 20.8.2012 Für meine Eltern Abstract Abstract The design and development of application specific hardware structures has a high degree of complexity. Logic resources are nowadays often not the limit anymore, but the development time and especially the time it takes for testing are today significant factors that limit the possibility to implement functionality in hardware. In the following techniques are presented that help to improve the capabilities to design complex hardware. The first part presents a generator (RFS) which allows defining control and status structures for hardware designs using an abstract high level language. These structures are usually called register files. Special features are presented for the efficient implementation of counters and error reporting facilities. Native support for the partitioning of such register files is an important trait of RFS. A novel method to inform host systems very efficiently about changes in the register files is presented in the second part. It makes use of a microcode programmable hardware unit. The microcode is auto generated based on output from RFS. The general principle is to push changed information from the register file to the host systems instead of relying on the standard pull approach. In the third part a highly efficient, fully pipelined address translation mechanism for remote memory access in HPC interconnection networks is presented. It is able of performing up to one translation per cycle and provides efficient interfaces for software to control the content of its n-way set associative translation cache. The mechanism is centered on a pipeline with a new concept to resolve dependency problems. The last part of this thesis addresses the problem of sending TCP messages for a low latency trading application using a hybrid TCP stack implementation that consists of hardware and software components. This separation allows low latency while keeping the complexity under control. Furthermore, a simulation environment for the TCP stack is presented, which is used to test the stack within a hardware simulator. This hardware-software co-simulation allows the acceleration of testing and debugging of complex hardware designs. It allows employing standard components like an operating system kernel instead of implementing the functionality again within the simulation environment. Zusammenfassung Zusam- men- fassung Das Design und die Entwicklung applikationsspezifischer Hardwarestrukturen bedingen einen hohen Grad an Komplexität. Die Menge an Logik Ressourcen ist oft nicht mehr das Limit, sondern die Entwicklungszeit und im speziellen die Zeit, welche zum Testen benötigt wird, sind heute ein signifikanter Faktor welcher die Möglichkeiten Logik in Hardware zu implementieren limitiert. Im Folgenden werden Techniken präsentiert, welche die Fähigkei- ten komplexe Hardware zu entwickeln verbessern. Der erste Teil präsentiert einen Generator, welcher es ermöglicht Kontroll- und Status- Strukturen für Hardware Designs mittels einer abstrakten Hochsprache zu beschreiben. Diese werden für gewöhnlich Registerfiles genannt. Spezielle Funktionen für die effiziente Implementierung von Zählern und Einrichtungen zur Indikation von Fehlern werden vorge- stellt. Zentrales Element dieser Entwicklung ist die Fähigkeit die Partitionierung von Regi- sterfiles direkt zu unterstützen. Eine neuartige Methode um das Hostsystem sehr effizient über Änderungen in besagtem Registerfile zu informieren wird im zweiten Teil präsentiert. Dafür wird eine durch Mikro- code programmierbare Hardwareeinheit genutzt. Der Mikrocode wird basierend auf der Ausgabe des Registerfile Generators automatisch erstellt. Das generelle Prinzip beruht dar- auf geänderte Informationen vom Registerfile zum Hostsystem zu schicken, anstatt sich auf den Standardansatz zu stützen, dass das Hostsystem die Daten abholt. In einem dritten Teil wird ein hocheffizienter Mechanismus zur Übersetzung von Adressen für den Zugriff auf entfernte Speicher in HPC Verbindungsnetzwerken gezeigt. Der Mecha- nismus ist in der Lage bis zu eine Übersetzung pro Takt abzuarbeiten und stellt effiziente Schnittstellen zur Verfügung, so dass Software den Inhalt des n-fach assoziativen Zwi- schenspeichers für Übersetzungen kontrollieren kann. Dieser Mechanismus benutzt eine Pipeline mit einer neuen Lösung um Abhängigkeiten aufzulösen. Der letzte Teil dieser Dissertation befasst sich mit dem Problem TCP Nachrichten mit einer möglichst niedrigen Latenz für den Hochgeschwindigkeitshandel an der Börse zu versen- den. Dazu wird eine hybride TCP Implementation, welche aus Hardware und Software Komponenten besteht, präsentiert. Diese Trennung ermöglicht niedrige Latenz und erlaubt die Komplexität unter Kontrolle zu behalten. Weiterhin wird eine Hardware-Software Kosi- mulation für die besagte TCP Implementation präsentiert, welche dazu dient um diese innerhalb eines Hardware Simulators zusammen mit Standard Komponenten wie einem Betriebssystemkern zu testen, anstatt die Funktionalität innerhalb des Simulator neu zu implementieren. Contents TOC Contents I 1 Introduction 1 1.1 Problems, Goals and Solutions . 3 1.2 Environment of the Hardware Structures . 4 1.2.1 HyperTransport on Chip Protocol . 5 1.2.2 Interface to the Host System . 7 1.2.3 Hardware . 8 1.3 Application Specific Hardware Structures . 9 1.3.1 EXTOLL . 9 1.3.2 Trading Accelerator . 14 2 Register File Generation 17 2.1 Introduction . 17 2.1.1 Register File . 17 2.1.2 Auto Generation . 20 2.1.3 Hierarchical Decomposition . 23 2.2 State of the Art and Motivation . 27 2.2.1 Denali Blueprint . 27 2.2.2 Others . 30 2.2.3 Motivation . 30 2.3 Design Decisions . 31 2.3.1 Features . 31 2.3.2 Input Method and Data Format . 32 2.3.3 Output Data Formats . 34 2.3.4 HW/SW Interface Consistency . 38 2.3.5 Interface . 40 I 2.4 XML Specification and Features . 42 2.4.1 Basic XML Example of a RF . 44 2.4.2 Hardware Register - hwreg . 47 2.4.3 64 bit Register - reg64 . 61 2.4.4 Aligner . 63 2.4.5 Ramblock . 65 2.4.6 Regroot / Regroot Instance . 71 2.4.7 Multi Element Instantiation . 75 2.4.8 Placeholder . 77 2.4.9 Trigger Events . 78 2.4.10 Annotated XML . 80 2.5 Implementation of RFS . 81 2.6 Supporting Infrastructure . 82 2.6.1 HyperTransport on Chip to RF Converter . 83 2.6.2 Debug Port . 83 2.6.3 OVM Components . 84 2.6.4 RF Driver Generator . 85 2.7 Application of RFS . 85 2.7.1 Shift Register Read and Write Logic . 85 2.7.2 EXTOLL RF . 87 2.8 Optimization Efforts . 88 2.8.1 Step 1 . 92 2.8.2 Step 2 . 92 2.8.3 Step 3 . 93 2.8.4 Step 4 . 93 2.8.5 Results . 94 3 System Notification Queue 95 3.1 Introduction . 95 3.2 Motivation . 96 3.2.1 Performance Counters . 97 3.3 Register File . 97 3.3.1 Trigger Dump Groups . 99 3.4 Functionality . 100 3.5 Architecture . 101 3.5.1 Control Logic . 104 II 3.6 Hardware Environment and Components . 108 3.6.1 HyperTransport on Chip to RF Converter . 109 3.6.2 Trigger Arbiter and Control . 110 3.6.3 Microcode Engine . 113 3.6.4 Register File Handler . 114 3.6.5 HT Send Unit . 115 3.6.6 SNQ Register File . 122 3.7 Microcode Engine Instruction Set . 122 3.7.1 JUMP Arbiter . 123 3.7.2 JUMP . ..

Efficient Hardware for Low Latency Applications

EP Activity Report 2014

Hardware Realization of an FPGA Processor – Operating System Call Offload and Experiences

A Full-System VM-HDL Co-Simulation Framework for Servers with Pcie

Co-Simulation Between Cλash and Traditional Hdls

Co-Emulation of Scan-Chain Based Designs Utilizing SCE-MI Infrastructure

An Open-Source Python-Based Hardware Generation, Simulation

High Performance Computing Systems: Status and Outlook

Technical Portion

Intro to Programmable Logic and Fpgas

Multiprocessing Contents

A Pythonic Approach for Rapid Hardware Prototyping and Instrumentation

Totalview Reference Guide