High-Performance Implementation of Algorithms on Reconfigurable Hardware
Total Page:16
File Type:pdf, Size:1020Kb
High-Performance Implementation of Algorithms on Reconfigurable Hardware Doctoral Dissertation Christos Gentsos, M.Sc. Aristotle University of Thessaloniki Faculty of Science School of Physics July, 2018 Υψηλών Επιδόσεων Υλοποίηση Αλγορίθμων σε Επαναδιαρθρώσιμο Υλικό Διδακτορική Διατριβή Χρίστος Γέντσος, M.Sc. Αριστοτέλειο Πανεπιστήμιο Θεσσαλονίκης Σχολή Θετικών Επιστημών Τμήμα Φυσικής Ιούλιος, 2018 Copyright 2018 © Christos Gentsos Aristotle University of Thessaloniki This thesis must be used only under the normal conditions of scholarly fair dealingfor purposes of research, criticism or review. In particular no results or conclusions should be extracted from it, nor should it be copied or closely paraphrased in whole or in part without the written consent of the author. Proper written acknowledgement should be made for any assistance obtained from this thesis. Επταμελής Εξεταστική Επιτροπή: ,+ Νικολαΐδης Σπυρίδων∗ καθηγητής ΑΠΘ Αναγνωστόπουλος Αντώνιος+ Θεοδωρίδης Γεώργιος+ καθηγητής επίκουρος καθηγητής ΑΠΘ Πανεπιστήμιο Πατρών Σίσκος Στυλιανός Χατζόπουλος Αλκιβιάδης καθηγητής καθηγητής ΑΠΘ ΑΠΘ Κορδάς Κωνσταντίνος Σιώζιος Κωνσταντίνος επίκουρος καθηγητής επίκουρος καθηγητής ΑΠΘ ΑΠΘ : Επιβλέπων ∗ +: Μέλος τριμελούς συμβουλευτικής επιτροπής Dedicated to my wife Daphne, my parents Dimitrios and Eleni Abstract This work is concerned with the design of high-performance digital circuits onField- Programmable Gate Array (FPGA) devices. These are generic devices, offering reconfig- urable hardware units for digital circuits to be loaded on, and their applications range from the Automotive to the Aerospace sector. The work for this dissertation is two-fold and was motivated by practical problems, in the domains of Molecular Diagnostics and High Energy Physics, calling for high-performance implementations of a number of algorithms that map very well to FPGAs. As such, it is arranged in two main parts, one for each application. The first application calls for a novel implementation of the Canny edge detection algorithm for a real-time machine vision system that powers a microfluidic Lab-on-a- Chip demonstrator. The Canny edge detector algorithm is widely popular, having been designed with the objectives of minimizing the error rate and improving the localization of the identified edges. It is comprised of individual processing steps; two of them, the Gaussian smoothing and the Sobel edge detector, are also widely used independently, as image processing filters. This novel architecture incorporates various methods and well-researched approximations to optimize for performance but the main feature that stands out is the strong exploitation of the parallelism capabilities provided by modern FPGAs. The architecture is pipelined in both cycle level and block level. The simultaneous exploitation of parallelism and pipelining results in a very efficient design that computes four pixels per clock cycle while maintaining a very high operating frequency. At the same time, the memory requirements remain constant with respect to a design that does not apply any pixel computation parallelism, while reducing memory read accesses. The performance achieved by this implementation ranges from 800 Mpixel/s to 1900 Mpixel/s, depending on the FPGA device used. As a specific example, this translates to a computation time of 1.5 ms for a 1.2 Mpixel grayscale image on a Spartan-6 FPGA. To the best of the author’s knowledge, this implementation outperforms existing solutions; furthermore, the performance exceeds the system requirements, allowing even high- resolution images to be used in the real-time system. Performance and resource utilization i ii ABSTRACT figures are presented for each component of the implementation, with differences between successive FPGA families briefly discussed; Finally, the integration of the machine vision implementation into an IP core to be used as a drop-in subsystem in the full design is also presented for completeness. The second application involves the redesign of various algorithms used in theFast TracKer (FTK) project. To perform real-time reconstruction of the trajectories of the particles resulting from collisions inside the ATLAS detector out of the traces they leave on the silicon detector layers, a system comprising a few thousands of FPGAs and custom Application-Specific Integrated Circuits (ASICs) has been realized. The ASICs implement massively parallel comparison operations to perform low-latency pattern matching, each one able to perform 64 G comparisons per second. The FPGAs handle a wide range of tasks, from complex data-moving operations that facilitate pattern matching to high-performance mathematical operations to manipulate the hit coordinates and eventually compute the track parameters. Novel implementations of key components of this system, namely the Data Organizer (DO), the Combiner, and the Track Fitter (TF) have been designed, in order to cope with the higher data rates of future scheduled detector upgrades and lift certain limitations of the existing implementations. The objective is to facilitate the construction of a system based on the principles of FTK for other, even more demanding applications. The DO functions as the bridge between the pattern matching step, performed in low-resolution, and the generation of full-resolution hit combinations that form potential tracks to be used in the track fitting step. Each full- resolution hit is stored based on a low-resolution identifier, and can be retrieved based on it. The operating principle is based on a novel, fast FPGA implementation ofan instantly-erasable array of linked lists with support for features of the AM ASICs, such as variable-size patterns and missing layers introducing extra layers of complexity tothe architecture. The final implementation supports an operating frequency of upwards of 400 MHz, greatly surpassing the specification targets. Advanced design methods, such as the automated generation of Look-up Table (LUT) instantiation code, and interleaving reading loops with initiation interval greater than one, such that one memory read port can form more than one individual read channels, were introduced and applied to achieve that performance. The next component, the Combiner, is given a set of hits foreach detector layer; its function is to form all the track-forming combinations out of these hit sets. This design is simpler than the DO; nevertheless, it still outperforms previous designs and it connects the DO to the final component of this track reconstruction chain, the TF. The latter component performs the track fitting operation by implementing fast scalar products with columns of pre-computed matrices. The goal was to design a flexible iii novel architecture, optimized to strike a balance between low latency and resource usage while maintaining an operating frequency that approaches the device limits. An architecture involving systolic arrays of registers, hardened Digital Signal Processing (DSP) blocks provided by modern FPGAs, and their dedicated interconnects, was devised. Combining the principles of parallelism and pipelining, one full track can be processed per clock cycle; by also taking physical layout considerations into account early in the design phase, these clock cycles are short, as an implementation that reaches a frequency of 600 MHz was obtained. Furthermore, advanced methodologies were employed for the verification of these components, and a novel method was devised to utilize the same high-level testbench to verify correct operation in both the Register Transfer Level simulation, and the actual implemented design while it is running on the FPGA. Finally, two demonstrators that make use of these implementations are presented; one ad-hoc demonstrator based on an evaluation board, and a research proposal that offers track reconstruction at the Level-1 track trigger for the 2025 HL-LHC CMS detector upgrade. The compromises and approximations made in the algorithms and their justifica- tion; the strategies and methodologies employed, or devised, in order to derive these implementations; and area, power, and performance metrics of the resulting designs, are described in detail over the subsequent chapters. Περίληψη Το πόνημα που ακολουθεί αφορά τον σχεδιασμό ψηφιακών κυκλωμάτων υψηλών επιδό- σεων σε συσκευές «προγραμματιζόμενες-στο-πεδίο διατάξεις πυλών» (FPGA). Αυτές είναι συσκευές γενικής χρήσης που προσφέρουν μονάδες επαναδιαρθρώσιμου υλικού για την υλοποίηση ψηφιακών κυκλωμάτων και βρίσκουν εφαρμογή σε μία σειρά τομέων, από την αυτοκινητοβιομηχανία μέχρι την αεροναυπηγική. Η δουλειά που παρουσιά- ζεται σε αυτήν τη διατριβή αφορά δύο κύριες εφαρμογές που προκύπτουν από τους τομείς της Μοριακής Διαγνωστικής και της Φυσικής Υψηλών Ενεργειών. Κίνητρο για την πραγματοποίησή της αποτέλεσαν υπάρχοντα προβλήματα στους παραπάνω το- μείς, όπου παρουσιάζονται απαιτήσεις για υψηλών επιδόσεων υλοποιήσεις διαφόρων αλγορίθμων. Ο χαρακτήρας των εφαρμογών κάνει τις συσκευές FPGA να αποτελούν ιδανική πλατφόρμα για τις εν λόγω υλοποιήσεις. Ως αποτέλεσμα των παραπάνω, η διατριβή είναι οργανωμένη σε δύο κύρια μέρη, ένα για την κάθε εφαρμογή. Η πρώτη εφαρμογή απαιτεί την ανάπτυξη μιας πρωτότυπης υλοποίησης ενός αλ- γόριθμου ανίχνευσης ακμών, ονόματι Canny, για ένα σύστημα πραγματικού χρόνου που αποτελεί τη βάση ενός συστήματος επίδειξης για μικρορροϊκό εργαστήριο-σε-τσιπ. Ο αλγόριθμος ανίχνευσης ακμών Canny είναι αρκετά δημοφιλής και έχει σχεδιαστεί με στόχο την ελαχιστοποίηση των σφαλμάτων και τη βελτιστοποίηση του εντοπισμού